Amazon Kinesis vs Apache Spark

Overview

Amazon Kinesis

Stacks798

Followers604

Votes9

Apache Spark

Stacks3.1K

Followers3.5K

Votes141

GitHub Stars42.2K

Forks28.9K

Amazon Kinesis vs Apache Spark: What are the differences?

Key Differences between Amazon Kinesis and Apache Spark

1. Scalability: Amazon Kinesis is designed to handle real-time streaming data with high scalability, allowing for processing very large amounts of data efficiently. On the other hand, Apache Spark is a general-purpose distributed computing system that provides scalable processing and analytics capabilities for both batch and streaming data.

2. Architecture: Amazon Kinesis is a managed service in the cloud that makes it easy to collect, process, and analyze real-time streaming data. It provides ready-to-use components, such as Kinesis Data Streams and Kinesis Data Firehose, to ingest and process data. Apache Spark, on the other hand, is a distributed computing framework that provides a unified analytics engine for big data processing. It offers a high-level API and supports various data sources, including streaming.

3. Real-Time Processing: Amazon Kinesis is optimized for real-time data processing scenarios, allowing for near real-time ingestion and analytics of streaming data. It provides features like real-time event data streaming, data transformation, and data aggregation. Apache Spark supports real-time processing as well, but it is not specifically designed for real-time streaming data. It can process both batch and streaming data, making it a more versatile option.

4. Data Processing Capabilities: Amazon Kinesis focuses on handling data ingestion and processing at scale, with capabilities like data partitioning, record buffering, and automated scaling. It provides built-in integration with other AWS services for data storage and analytics. Apache Spark, on the other hand, offers a wide range of data processing capabilities, including batch processing, stream processing, machine learning, graph processing, and SQL queries. It provides a rich set of libraries and APIs for various data processing tasks.

5. Cost and Pricing Model: Amazon Kinesis has a pay-as-you-go pricing model, where you pay for the resources you use. The pricing is based on the amount of data ingested, stored, and processed. Apache Spark is an open-source project and can be deployed on various infrastructure options, including cloud platforms and on-premises clusters. The cost of using Apache Spark depends on the infrastructure you choose and any additional services you integrate with.

6. Development and Deployment Ease: Amazon Kinesis provides a managed service that abstracts away much of the infrastructure management and setup, making it easy to get started with real-time data processing. It integrates well with other AWS services and provides a simple API for data ingestion and processing. Apache Spark requires more setup and configuration, as it is a distributed computing framework. It offers flexibility in terms of deployment options but may require more expertise to set up and manage a Spark cluster.

In Summary, Amazon Kinesis is a managed service optimized for real-time streaming data processing, providing scalability, ease of use, and integration with AWS ecosystem. Apache Spark is a versatile distributed computing framework that supports both batch and streaming data processing with a wide range of capabilities and deployment options.

Share your Stack

Help developers discover the tools you use. Get visibility for your team's tech choices and contribute to the community's knowledge.

View Docs

CLI (Node.js)

Manual

Advice on Amazon Kinesis, Apache Spark

Nilesh

Technical Architect at Self Employed

Jul 8, 2020

Needs adviceon

Elasticsearch

Kafka

We have a Kafka topic having events of type A and type B. We need to perform an inner join on both type of events using some common field (primary-key). The joined events to be inserted in Elasticsearch.

In usual cases, type A and type B events (with same key) observed to be close upto 15 minutes. But in some cases they may be far from each other, lets say 6 hours. Sometimes event of either of the types never come.

In all cases, we should be able to find joined events instantly after they are joined and not-joined events within 15 minutes.

577k views577k

Comments

Detailed Comparison

Amazon Kinesis	Apache Spark
Amazon Kinesis can collect and process hundreds of gigabytes of data per second from hundreds of thousands of sources, allowing you to easily write applications that process information in real-time, from sources such as web site click-streams, marketing and financial information, manufacturing instrumentation and social media, and operational logs and metering data.	Spark is a fast and general processing engine compatible with Hadoop data. It can run in Hadoop clusters through YARN or Spark's standalone mode, and it can process data in HDFS, HBase, Cassandra, Hive, and any Hadoop InputFormat. It is designed to perform both batch processing (similar to MapReduce) and new workloads like streaming, interactive queries, and machine learning.
Real-time Processing- Amazon Kinesis enables you to collect and analyze information in real-time, allowing you to answer questions about the current state of your data, from inventory levels to stock trade frequencies, rather than having to wait for an out-of-date report;Easy to use- You can create a new stream, set the throughput requirements, and start streaming data quickly and easily. Amazon Kinesis automatically provisions and manages the storage required to reliably and durably collect your data stream;High throughput. Elastic.- Amazon Kinesis seamlessly scales to match the data throughput rate and volume of your data, from megabytes to terabytes per hour. Amazon Kinesis will scale up or down based on your needs;Integrate with Amazon S3, Amazon Redshift, and Amazon DynamoDB- With Amazon Kinesis, you can reliably collect, process, and transform all of your data in real-time before delivering it to data stores of your choice, where it can be used by existing or new applications. Connectors enable integration with Amazon S3, Amazon Redshift, and Amazon DynamoDB;Build Kinesis Applications- Amazon Kinesis provides developers with client libraries that enable the design and operation of real-time data processing applications. Just add the Amazon Kinesis Client Library to your Java application and it will be notified when new data is available for processing;Low Cost- Amazon Kinesis is cost-efficient for workloads of any scale. You can pay as you go, and you’ll only pay for the resources you use. You can get started by provisioning low throughput streams, and only pay a low hourly rate for the throughput you need	Run programs up to 100x faster than Hadoop MapReduce in memory, or 10x faster on disk;Write applications quickly in Java, Scala or Python;Combine SQL, streaming, and complex analytics;Spark runs on Hadoop, Mesos, standalone, or in the cloud. It can access diverse data sources including HDFS, Cassandra, HBase, S3
Statistics
GitHub Stars -	GitHub Stars 42.2K
GitHub Forks -	GitHub Forks 28.9K
Stacks 798	Stacks 3.1K
Followers 604	Followers 3.5K
Votes 9	Votes 141
Pros & Cons
Pros 9 Scalable Cons 3 Cost	Pros 61 Open-source 48 Fast and Flexible 8 One platform for every big data problem 8 Great for distributed SQL like applications 6 Easy to install and to use Cons 4 Speed

What are some alternatives to Amazon Kinesis, Apache Spark?

Presto

Distributed SQL Query Engine for Big Data

Amazon Athena

Amazon Athena is an interactive query service that makes it easy to analyze data in Amazon S3 using standard SQL. Athena is serverless, so there is no infrastructure to manage, and you pay only for the queries that you run.

Apache Flink

Apache Flink is an open source system for fast and versatile data analytics in clusters. Flink supports batch and streaming analytics, in one system. Analytical programs can be written in concise and elegant APIs in Java and Scala.

lakeFS

It is an open-source data version control system for data lakes. It provides a “Git for data” platform enabling you to implement best practices from software engineering on your data lake, including branching and merging, CI/CD, and production-like dev/test environments.

Druid

Druid is a distributed, column-oriented, real-time analytics data store that is commonly used to power exploratory dashboards in multi-tenant environments. Druid excels as a data warehousing solution for fast aggregate queries on petabyte sized data sets. Druid supports a variety of flexible filters, exact calculations, approximate algorithms, and other useful calculations.

Apache Kylin

Apache Kylin™ is an open source Distributed Analytics Engine designed to provide SQL interface and multi-dimensional analysis (OLAP) on Hadoop/Spark supporting extremely large datasets, originally contributed from eBay Inc.

Splunk

It provides the leading platform for Operational Intelligence. Customers use it to search, monitor, analyze and visualize machine data.

Google Cloud Dataflow

Google Cloud Dataflow is a unified programming model and a managed service for developing and executing a wide range of data processing patterns including ETL, batch computation, and continuous computation. Cloud Dataflow frees you from operational tasks like resource management and performance optimization.

Apache Impala

Impala is a modern, open source, MPP SQL query engine for Apache Hadoop. Impala is shipped by Cloudera, MapR, and Amazon. With Impala, you can query data, whether stored in HDFS or Apache HBase – including SELECT, JOIN, and aggregate functions – in real time.