I was attending a talk about Apache Spark: a memory based real-time computing structure. The Spark is an effort from UC Berkeley for data analytic.
Search “The Berkeley Data Analytics Stack Pressent and Future”, you will find a more clear description about the framework. In the framework, Spark serves as the computation base and on top of it,
- There is Shark which provides SQL-like environment, comparable project is MapReduce.
- There is Graph X which is for graph structure computaiton, comparable project is like Gigraph
- There is streaming process component, which is Spark streaming.
- There is machine learning toolkit, ML-lib.
In the demonstration, the spark is quite impressive compared to Hive. Especially, after the first run, the data is loaded in the memory and query is quite fast.
One point need to make is that all this doesn’t come as free. Somewhere in the code, you must all cache() method on the object to indicate that you want to keep the object in the memory.
No comments:
Post a Comment