Need advice about which tool to choose?Ask the StackShare community!

Amazon SQS

Stacks2.3K

Followers2K

+ 1

Votes171

Apache Spark

Stacks3K

Followers3.5K

+ 1

Votes140

Add tool

Amazon SQS vs Apache Spark: What are the differences?

Introduction

Amazon SQS and Apache Spark are two popular and widely used technologies in the field of distributed computing. While they both serve the purpose of handling large volumes of data, there are several key differences between the two. This article aims to highlight and explain those differences in a concise manner.

Message Broker vs. Distributed Computing Framework: Amazon SQS is a fully managed message queuing service that allows you to decouple and scale microservices, distributed systems, and serverless applications. On the other hand, Apache Spark is an open-source distributed computing framework that provides fast in-memory data processing and analytics capabilities.
Data Processing Paradigm: Amazon SQS primarily focuses on asynchronous message passing and follows the publish-subscribe model. It allows the decoupling of sender and receiver systems through the use of queues. On the contrary, Apache Spark operates on data batches or streams and follows a batch processing or stream processing paradigm. It provides a resilient distributed dataset (RDD) as its core abstraction for processing large datasets.
Storage Requirement: In Amazon SQS, the messages are stored in a distributed manner within the Amazon managed infrastructure, reducing the overhead of managing the storage yourself. In contrast, Apache Spark requires you to set up a distributed storage system, such as Hadoop Distributed File System (HDFS) or Amazon S3, to store and manage the input data.
Compatibility and Integration: Amazon SQS is a cloud-native service provided by Amazon Web Services (AWS), and it seamlessly integrates with other AWS services like Lambda, EC2, and S3, making it easy to build serverless architectures. Apache Spark, being an open-source technology, can run on various platforms, including AWS, and allows integration with multiple data sources and databases.
Fault Tolerance and Scalability: Amazon SQS provides high fault tolerance by replicating messages across multiple availability zones within a region, ensuring high availability and durability. It also scales automatically to accommodate variable message traffic. Apache Spark, on the other hand, offers fault tolerance through the concept of RDD lineage, allowing the reconstruction of lost data partitions. It provides horizontal scalability by distributing the dataset and computation across a cluster of machines.
Real-time vs. Batch Processing: Amazon SQS focuses on handling messages in an asynchronous manner, which makes it more suitable for real-time messaging scenarios where immediate processing of data is not necessary. In contrast, Apache Spark is designed to handle both real-time and batch processing tasks efficiently. It can process streaming data in real-time as well as apply complex batch analytics on large volumes of data.

In summary, Amazon SQS is a managed message queuing service that offers asynchronous messaging and decoupling of distributed systems, whereas Apache Spark is a distributed computing framework that provides fast data processing and analytics capabilities through RDDs. SQS is more suited for real-time messaging scenarios, while Spark excels in both real-time and batch processing tasks.

Advice on Amazon SQS and Apache Spark

Pulkit Sapra

Software Engineer · Oct 30, 2020 | 7 upvotes · 462.9K views

Needs advice

Amazon SQS

Kubernetes

and

RabbitMQ

Hi! I am creating a scraping system in Django, which involves long running tasks between 1 minute & 1 Day. As I am new to Message Brokers and Task Queues, I need advice on which architecture to use for my system. ( Amazon SQS, RabbitMQ, or Celery). The system should be autoscalable using Kubernetes(K8) based on the number of pending tasks in the queue.

Replies (1)

Anis Zehani

Founder at Odix · Nov 8, 2020 | 1 upvotes · 296.9K views

Recommends

Kafka

Hello, i highly recommend Apache Kafka, to me it's the best. You can deploy it in cluster mode inside K8S, thus you can have a Highly available system (also auto scalable).

Good luck

Meili Triantafyllidi

Software engineer at Digital Science · Sep 24, 2020 | 6 upvotes · 489.3K views

Needs advice

Amazon SQS

RabbitMQ

and

ZeroMQ

Hi, we are in a ZMQ set up in a push/pull pattern, and we currently start to have more traffic and cases that the service is unavailable or stuck. We want to: * Not loose messages in services outages * Safely restart service without losing messages (ZeroMQ seems to need to close the socket in the receiver before restart manually)

Do you have experience with this setup with ZeroMQ? Would you suggest RabbitMQ or Amazon SQS (we are in AWS setup) instead? Something else?

Thank you for your time

Replies (2)

Shishir Pandey

at Staples · Sep 30, 2020 | 6 upvotes · 351.8K views

Recommends

RabbitMQ

ZeroMQ is fast but you need to build build reliability yourself. There are a number of patterns described in the zeromq guide. I have used RabbitMQ before which gives lot of functionality out of the box, you can probably use the worker queues example from the tutorial, it can also persists messages in the queue.

I haven't used Amazon SQS before. Another tool you could use is Kafka.

Kevin Deyne

Principal Software Engineer at Accurate Background · Dec 1, 2021 | 5 upvotes · 224.8K views

Recommends

RabbitMQ

Both would do the trick, but there are some nuances. We work with both.

From the sound of it, your main focus is "not losing messages". In that case, I would go with RabbitMQ with a high availability policy (ha-mode=all) and a main/retry/error queue pattern.

Push messages to an exchange, which sends them to the main queue. If an error occurs, push the errored out message to the retry exchange, which forwards it to the retry queue. Give the retry queue a x-message-ttl and set the main exchange as a dead-letter-exchange. If your message has been retried several times, push it to the error exchange, where the message can remain until someone has time to look at it.

This is a very useful and resilient pattern that allows you to never lose messages. With the high availability policy, you make sure that if one of your rabbitmq nodes dies, another can take over and messages are already mirrored to it.

This is not really possible with SQS, because SQS is a lot more focused on throughput and scaling. Combined with SNS it can do interesting things like deduplication of messages and such. That said, one thing core to its design is that messages have a maximum retention time. The idea is that a message that has stayed in an SQS queue for a while serves no more purpose after a while, so it gets removed - so as to not block up any listener resources for a long time. You can also set up a DLQ here, but these similarly do not hold onto messages forever. Since you seem to depend on messages surviving at all cost, I would suggest that the scaling/throughput benefit of SQS does not outweigh the difference in approach to messages there.

Nilesh Akhade

Technical Architect at Self Employed · Jul 8, 2020 | 5 upvotes · 565.1K views

Needs advice

and

We have a Kafka topic having events of type A and type B. We need to perform an inner join on both type of events using some common field (primary-key). The joined events to be inserted in Elasticsearch.

In usual cases, type A and type B events (with same key) observed to be close upto 15 minutes. But in some cases they may be far from each other, lets say 6 hours. Sometimes event of either of the types never come.

In all cases, we should be able to find joined events instantly after they are joined and not-joined events within 15 minutes.

Replies (2)

lvhuyen

Jul 9, 2020 | 5 upvotes · 466.4K views

Recommends

Elasticsearch

The first solution that came to me is to use upsert to update ElasticSearch:

Use the primary-key as ES document id
Upsert the records to ES as soon as you receive them. As you are using upsert, the 2nd record of the same primary-key will not overwrite the 1st one, but will be merged with it.

Cons: The load on ES will be higher, due to upsert.

To use Flink:

Create a KeyedDataStream by the primary-key
In the ProcessFunction, save the first record in a State. At the same time, create a Timer for 15 minutes in the future
When the 2nd record comes, read the 1st record from the State, merge those two, and send out the result, and clear the State and the Timer if it has not fired
When the Timer fires, read the 1st record from the State and send out as the output record.
Have a 2nd Timer of 6 hours (or more) if you are not using Windowing to clean up the State

Pro: if you have already having Flink ingesting this stream. Otherwise, I would just go with the 1st solution.

Averell Huyen Levan – Medium

Akshaya Rawat

Senior Specialist Platform at Publicis Sapient · Sep 4, 2020 | 3 upvotes · 400.2K views

Recommends

Apache Spark

Please refer "Structured Streaming" feature of Spark. Refer "Stream - Stream Join" at https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#stream-stream-joins . In short you need to specify "Define watermark delays on both inputs" and "Define a constraint on time across the two inputs"

MITHIRIDI PRASANTH

Software Engineer at LightMetrics · May 8, 2020 | 4 upvotes · 296.3K views

Needs advice

Amazon MQ

and

Amazon SQS

My Stack

I want to schedule a message. Amazon SQS provides a delay of 15 minutes, but I want it in some hours.

Example: Let's say a Message1 is consumed by a consumer A but somehow it failed inside the consumer. I would want to put it in a queue and retry after 4hrs. Can I do this in Amazon MQ? I have seen in some Amazon MQ videos saying scheduling messages can be done. But, I'm not sure how.

Replies (1)

Andres Paredes

Lead Senior Software Engineer at InTouch Technology · Jun 3, 2020 | 1 upvotes · 225.1K views

Recommends

Amazon SQS

Mithiridi, I believe you are talking about two different things. 1. If you need to process messages with delays of more 15m or at specific times, it's not a good idea to use queues, independently of tool SQM, Rabbit or Amazon MQ. you should considerer another approach using a scheduled job. 2. For dead queues and policy retries RabbitMQ, for example, doesn't support your use case. https://medium.com/@kiennguyen88/rabbitmq-delay-retry-schedule-with-dead-letter-exchange-31fb25a440fc I'm not sure if that is possible SNS/SQS support, they have a maximum delay for delivery (maxDelayTarget) in seconds but it's not clear the number. You can check this out: https://docs.aws.amazon.com/sns/latest/dg/sns-message-delivery-retries.html

Manage your open source components, licenses, and vulnerabilities

Learn More

Pros of Amazon SQS

Pros of Apache Spark

62
Easy to use, reliable
40
Low cost
28
Simple
14
Doesn't need to maintain it
8
It is Serverless
4
Has a max message size (currently 256K)
3
Triggers Lambda
3
Easy to configure with Terraform
3
Delayed delivery upto 15 mins only
3
Delayed delivery upto 12 hours
1
JMS compliant
1
Support for retry and dead letter queue
1
D

61
Open-source
48
Fast and Flexible
8
One platform for every big data problem
8
Great for distributed SQL like applications
6
Easy to install and to use
3
Works well for most Datascience usecases
2
Interactive Query
2
Machine learning libratimery, Streaming in real
2
In memory Computation

Sign up to add or upvote prosMake informed product decisions

Cons of Amazon SQS

Cons of Apache Spark

2
Has a max message size (currently 256K)
2
Proprietary
2
Difficult to configure
1
Has a maximum 15 minutes of delayed messages only

4
Speed

Sign up to add or upvote consMake informed product decisions

4.2K

982

132

- No public GitHub repository available -

41K

28.5K

What is Amazon SQS?

Transmit any volume of data, at any level of throughput, without losing messages or requiring other services to be always available. With SQS, you can offload the administrative burden of operating and scaling a highly available messaging cluster, while paying a low price for only what you use.

What is Apache Spark?

Spark is a fast and general processing engine compatible with Hadoop data. It can run in Hadoop clusters through YARN or Spark's standalone mode, and it can process data in HDFS, HBase, Cassandra, Hive, and any Hadoop InputFormat. It is designed to perform both batch processing (similar to MapReduce) and new workloads like streaming, interactive queries, and machine learning.

Need advice about which tool to choose?Ask the StackShare community!

What companies use Amazon SQS?

What companies use Apache Spark?

Manage your open source components, licenses, and vulnerabilities

Learn More

Sign up to get full access to all the companiesMake informed product decisions

What tools integrate with Amazon SQS?

What tools integrate with Apache Spark?

Sign up to get full access to all the tool integrationsMake informed product decisions

Blog Posts

Improving Efficiency and Reducing Runtime Using S3 Read Optimi...

Sep 1 2021 at 5:34PM

1272

Pinterest Flink Deployment Framework

Mar 24 2021 at 12:57PM

2251

Pinterest Visual Signals Infrastructure: Evolution from Lambda...

Nov 24 2020 at 7:01PM

2582

Powering Inclusive Search & Recommendations with Our New V...

Aug 26 2020 at 4:42PM

820

Empowering Pinterest Data Scientists and Machine Learning Engi...

Jul 9 2020 at 2:41PM

+11

7171

Powering Pinterest Ads Analytics with Apache Druid

Apr 8 2020 at 5:37PM

2113

How Sqreen handles 50,000 requests every minute in a write-hea...

Sep 17 2019 at 9:38PM

Sqreen

+17

6955

Cultivating your Data Lake

Aug 28 2019 at 3:10AM

Segment

+16

2684

What are some alternatives to Amazon SQS and Apache Spark?

Amazon MQ

Amazon MQ is a managed message broker service for Apache ActiveMQ that makes it easy to set up and operate message brokers in the cloud.

Kafka

Kafka is a distributed, partitioned, replicated commit log service. It provides the functionality of a messaging system, but with a unique design.

Redis

Redis is an open source (BSD licensed), in-memory data structure store, used as a database, cache, and message broker. Redis provides data structures such as strings, hashes, lists, sets, sorted sets with range queries, bitmaps, hyperloglogs, geospatial indexes, and streams.

ActiveMQ

Apache ActiveMQ is fast, supports many Cross Language Clients and Protocols, comes with easy to use Enterprise Integration Patterns and many advanced features while fully supporting JMS 1.1 and J2EE 1.4. Apache ActiveMQ is released under the Apache 2.0 License.

Amazon SNS

Amazon Simple Notification Service makes it simple and cost-effective to push to mobile devices such as iPhone, iPad, Android, Kindle Fire, and internet connected smart devices, as well as pushing to other distributed services. Besides pushing cloud notifications directly to mobile devices, SNS can also deliver notifications by SMS text message or email, to Simple Queue Service (SQS) queues, or to any HTTP endpoint.

See all alternatives

Amazon SQS vs Apache Spark

Need advice about which tool to choose?Ask the StackShare community!

Amazon SQS vs Apache Spark: What are the differences?

Introduction

Pros of Amazon SQS

Pros of Apache Spark

Sign up to add or upvote prosMake informed product decisions

Cons of Amazon SQS

Cons of Apache Spark

Sign up to add or upvote consMake informed product decisions

What is Amazon SQS?

What is Apache Spark?

Need advice about which tool to choose?Ask the StackShare community!

What companies use Amazon SQS?

What companies use Apache Spark?

Sign up to get full access to all the companiesMake informed product decisions

What tools integrate with Amazon SQS?

What tools integrate with Apache Spark?

Sign up to get full access to all the tool integrationsMake informed product decisions

Blog Posts

Related Comparisons

Trending Comparisons

Top Comparisons