storm vs sparkUpdate: additional question about Storm
The question is to compare Spark to Storm (see comments below).
Spark is still based on the idea that, when the existing data volume is huge, it is cheaper to move the process to the data, rather than moving the data to the process. Each node stores (or caches) its dataset, and jobs are submitted to the nodes. So the process moves to the data. It is very similar to Hadoop map/reduce, except memory storage is aggressively used to avoid I/Os which makes it efficient for iterative algorithms (when the output of the previous step is the input of the next step). Shark is only a query engine built on top of Spark (supporting ad-hoc analytical queries).
You can see Storm as the complete architectural opposite of Spark. Storm is a distributed streaming engine. Each node implements a basic process, and data items flow in/out a network of interconnected nodes (contrary to Spark). With Storm, the data move to the process.
Both frameworks are used to parallelize computations of massive amount of data.
However, Storm is good at dynamically processing numerous generated/collected small data items (such as calculating some aggregation function or analytics in real time on a Twitter stream).
Spark applies on a corpus of existing data (like Hadoop) which has been imported into the Spark cluster, provides fast scanning capabilities due to in-memory management, and minimizes the global number of I/Os for iterative algorithms.
Spark Streaming module is comparable to Storm (both are streaming engines). They work differently though. Spark Streaming accumulates batches of data, and then submit these batches to the Spark engine as if they were immutable Spark datasets. Storm processes and dispatch items as soon as they are received. I don't know which one is the most efficient in term of throughput. In term of latency, it is probably Storm. –
Here is my 2 cents: Spark streaming has concept of sliding window while in Storm you have to maintain the window by yourself
**************************************
>>理解: move the process to the data和move the data to the process的不同
>>>Spark与Hadoop MapReduce非常相似,不同之处是Spark采用内存缓存数据集提高性能,并且使迭代算法更高效(上一步的输出作为下一步的输入)。
==>Spark的RDD计算模型
>>>Shark仅仅是一个构建在Spark之上的查询引擎,支持特定的分析查询。
>>>Spark仍然是基于这样一种思想,假定已有的数据量巨大,并且处理找数据更便宜些,而不是让数据找处理。
>>>Spark是处理找数据,而Storm是数据流向(找)处理
>>>Spark可以理解为,处理找数据,即数据等着被处理,而storm是数据主动找处理
>>>Storm是和Spark完全对立的架构,Storm是一个分布式流处理引擎,每一个节点实现基本处理,数据项在互联节点间流入、流出。所以,对Storm来说,是数据找处理。
>>>Spark和Storm都可以用于大规模数据量的并行计算。
>>>然而,Storm擅长于动态处理许多(大量)产生或者收集的小数据项(比如:Twitter中计算某些聚合功能或者实时分析)
==>Storm用于对大量变化数据的实时分析
>>>Spark像Hadoop一样,应用于已有数据这个范畴,这些数据导入到Spark集群中。
>>>Spark提供快速扫描能力,是因为基于内存管理,对迭代算法来说,使得整体IO数量最小化。
>>>Spark流处理模型可以和Storm进行比较(两者都是流处理引擎)。但是它们工作起来是不同的,Spark流模型积攒成批的数据,然后把这些成批数据提交给Spark引擎,
对Spark引擎来说,可以看做是不变的Spark数据集。
>>>而Storm的工作方式是:只要数据被storm节点接收到,就会尽可能快的被处理和分发出去。
>>>在吞吐量方面,我不知道哪一个更高效,但是就延迟来说,可能是Storm延迟低。
==>性能指标,高吞吐、低延迟
>>>Spark coded in Scala,Storm coded in Clojure
>>>Spark流模型有滑窗的概念,而在Storm中,你必须自己维护滑窗。
分享到:
相关推荐
写得比较简单,供初学者参考参考
实时大数据必备书籍,版本比较新,2018年出版的,技术比较新
storm spark 入门项目storm spark 入门项目storm spark 入门项目storm spark 入门项目storm spark 入门项目
Flink,Storm,Spark Streaming三种流框架的对比分析。比较清晰明确
概括性、总结性的对比Mapreduce、spark、storm,三者的特点,区别对比。
大数据完整版视频。视频未加密,绝对可以看。
storm和spark入门项目final,以此为准,前面几个都不正确
spark Streaming的原理介绍和与storm的对比
扫描完整版 颠覆大数据分析 基于StormSpark等Hadoop替代技术的实时应用
about 云资源汇总指引 V1.6:包括 hadoop,openstack,storm,spark 等视频文档书籍汇总
flink,spark streaming,storm框架对比,
大数据Hadoop权威指南,pdf,中英文版。第4版 The Apache™ Hadoop® project develops open-source software for reliable, scalable, distributed computing. The Apache Hadoop software library is a framework...
藏经阁-Lego-Like Building Blocks of Storm and Spark Streaming Pipel
Big Data Analytics Beyond Hadoop.Real-Time Applications with Storm.Spark and More Hadoop Alternatives
Hadoop Hive HBase Spark Storm概念解释
云资源汇总指引V1.5包括hadoop,openstack,storm,spark等视频文档书籍汇总
讲述Storm与sparkStreaming分别用法与区别,在操作流程等。
-_11.11_storm-spark-hadoophadoop_storm_spark结合实验的例子,模拟淘宝双11节,根据订单详细信息,汇总出总销售量,各个省份销售排行,以及后期SQL分析,数据分析,数据挖掘等。--------大概流程-------第一阶段...
本文主要调研了Apache Kafka、Apache Flink、Apache Storm、Apache Apex和Apache Spark Streaming五种流式大数据系统。主要的工作有:1)通过文献阅读和试用比较了它们的实现原理;2)利用了kafka自带的测试脚本进行...
流式处理框架stormspark和samza的对比共5页.pdf.zip