Get Advice Icon

Need advice about which tool to choose?Ask the StackShare community!

Amazon SQS

2.3K
2K
+ 1
171
Apache Spark

3K
3.5K
+ 1
140
Add tool

Amazon SQS vs Apache Spark: What are the differences?

Introduction

Amazon SQS and Apache Spark are two popular and widely used technologies in the field of distributed computing. While they both serve the purpose of handling large volumes of data, there are several key differences between the two. This article aims to highlight and explain those differences in a concise manner.

  1. Message Broker vs. Distributed Computing Framework: Amazon SQS is a fully managed message queuing service that allows you to decouple and scale microservices, distributed systems, and serverless applications. On the other hand, Apache Spark is an open-source distributed computing framework that provides fast in-memory data processing and analytics capabilities.

  2. Data Processing Paradigm: Amazon SQS primarily focuses on asynchronous message passing and follows the publish-subscribe model. It allows the decoupling of sender and receiver systems through the use of queues. On the contrary, Apache Spark operates on data batches or streams and follows a batch processing or stream processing paradigm. It provides a resilient distributed dataset (RDD) as its core abstraction for processing large datasets.

  3. Storage Requirement: In Amazon SQS, the messages are stored in a distributed manner within the Amazon managed infrastructure, reducing the overhead of managing the storage yourself. In contrast, Apache Spark requires you to set up a distributed storage system, such as Hadoop Distributed File System (HDFS) or Amazon S3, to store and manage the input data.

  4. Compatibility and Integration: Amazon SQS is a cloud-native service provided by Amazon Web Services (AWS), and it seamlessly integrates with other AWS services like Lambda, EC2, and S3, making it easy to build serverless architectures. Apache Spark, being an open-source technology, can run on various platforms, including AWS, and allows integration with multiple data sources and databases.

  5. Fault Tolerance and Scalability: Amazon SQS provides high fault tolerance by replicating messages across multiple availability zones within a region, ensuring high availability and durability. It also scales automatically to accommodate variable message traffic. Apache Spark, on the other hand, offers fault tolerance through the concept of RDD lineage, allowing the reconstruction of lost data partitions. It provides horizontal scalability by distributing the dataset and computation across a cluster of machines.

  6. Real-time vs. Batch Processing: Amazon SQS focuses on handling messages in an asynchronous manner, which makes it more suitable for real-time messaging scenarios where immediate processing of data is not necessary. In contrast, Apache Spark is designed to handle both real-time and batch processing tasks efficiently. It can process streaming data in real-time as well as apply complex batch analytics on large volumes of data.

In summary, Amazon SQS is a managed message queuing service that offers asynchronous messaging and decoupling of distributed systems, whereas Apache Spark is a distributed computing framework that provides fast data processing and analytics capabilities through RDDs. SQS is more suited for real-time messaging scenarios, while Spark excels in both real-time and batch processing tasks.

Advice on Amazon SQS and Apache Spark
Pulkit Sapra
Needs advice
on
Amazon SQSAmazon SQSKubernetesKubernetes
and
RabbitMQRabbitMQ

Hi! I am creating a scraping system in Django, which involves long running tasks between 1 minute & 1 Day. As I am new to Message Brokers and Task Queues, I need advice on which architecture to use for my system. ( Amazon SQS, RabbitMQ, or Celery). The system should be autoscalable using Kubernetes(K8) based on the number of pending tasks in the queue.

See more
Replies (1)
Anis Zehani
Recommends
on
KafkaKafka

Hello, i highly recommend Apache Kafka, to me it's the best. You can deploy it in cluster mode inside K8S, thus you can have a Highly available system (also auto scalable).

Good luck

See more
Meili Triantafyllidi
Software engineer at Digital Science · | 6 upvotes · 489.3K views
Needs advice
on
Amazon SQSAmazon SQSRabbitMQRabbitMQ
and
ZeroMQZeroMQ

Hi, we are in a ZMQ set up in a push/pull pattern, and we currently start to have more traffic and cases that the service is unavailable or stuck. We want to: * Not loose messages in services outages * Safely restart service without losing messages (ZeroMQ seems to need to close the socket in the receiver before restart manually)

Do you have experience with this setup with ZeroMQ? Would you suggest RabbitMQ or Amazon SQS (we are in AWS setup) instead? Something else?

Thank you for your time

See more
Replies (2)
Shishir Pandey
Recommends
on
RabbitMQRabbitMQ

ZeroMQ is fast but you need to build build reliability yourself. There are a number of patterns described in the zeromq guide. I have used RabbitMQ before which gives lot of functionality out of the box, you can probably use the worker queues example from the tutorial, it can also persists messages in the queue.

I haven't used Amazon SQS before. Another tool you could use is Kafka.

See more
Kevin Deyne
Principal Software Engineer at Accurate Background · | 5 upvotes · 224.8K views
Recommends
on
RabbitMQRabbitMQ

Both would do the trick, but there are some nuances. We work with both.

From the sound of it, your main focus is "not losing messages". In that case, I would go with RabbitMQ with a high availability policy (ha-mode=all) and a main/retry/error queue pattern.

Push messages to an exchange, which sends them to the main queue. If an error occurs, push the errored out message to the retry exchange, which forwards it to the retry queue. Give the retry queue a x-message-ttl and set the main exchange as a dead-letter-exchange. If your message has been retried several times, push it to the error exchange, where the message can remain until someone has time to look at it.

This is a very useful and resilient pattern that allows you to never lose messages. With the high availability policy, you make sure that if one of your rabbitmq nodes dies, another can take over and messages are already mirrored to it.

This is not really possible with SQS, because SQS is a lot more focused on throughput and scaling. Combined with SNS it can do interesting things like deduplication of messages and such. That said, one thing core to its design is that messages have a maximum retention time. The idea is that a message that has stayed in an SQS queue for a while serves no more purpose after a while, so it gets removed - so as to not block up any listener resources for a long time. You can also set up a DLQ here, but these similarly do not hold onto messages forever. Since you seem to depend on messages surviving at all cost, I would suggest that the scaling/throughput benefit of SQS does not outweigh the difference in approach to messages there.

See more
Nilesh Akhade
Technical Architect at Self Employed · | 5 upvotes · 565.1K views

We have a Kafka topic having events of type A and type B. We need to perform an inner join on both type of events using some common field (primary-key). The joined events to be inserted in Elasticsearch.

In usual cases, type A and type B events (with same key) observed to be close upto 15 minutes. But in some cases they may be far from each other, lets say 6 hours. Sometimes event of either of the types never come.

In all cases, we should be able to find joined events instantly after they are joined and not-joined events within 15 minutes.

See more
Replies (2)
Recommends
on
ElasticsearchElasticsearch

The first solution that came to me is to use upsert to update ElasticSearch:

  1. Use the primary-key as ES document id
  2. Upsert the records to ES as soon as you receive them. As you are using upsert, the 2nd record of the same primary-key will not overwrite the 1st one, but will be merged with it.

Cons: The load on ES will be higher, due to upsert.

To use Flink:

  1. Create a KeyedDataStream by the primary-key
  2. In the ProcessFunction, save the first record in a State. At the same time, create a Timer for 15 minutes in the future
  3. When the 2nd record comes, read the 1st record from the State, merge those two, and send out the result, and clear the State and the Timer if it has not fired
  4. When the Timer fires, read the 1st record from the State and send out as the output record.
  5. Have a 2nd Timer of 6 hours (or more) if you are not using Windowing to clean up the State

Pro: if you have already having Flink ingesting this stream. Otherwise, I would just go with the 1st solution.

See more
Akshaya Rawat
Senior Specialist Platform at Publicis Sapient · | 3 upvotes · 400.2K views
Recommends
on
Apache SparkApache Spark

Please refer "Structured Streaming" feature of Spark. Refer "Stream - Stream Join" at https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#stream-stream-joins . In short you need to specify "Define watermark delays on both inputs" and "Define a constraint on time across the two inputs"

See more
MITHIRIDI PRASANTH
Software Engineer at LightMetrics · | 4 upvotes · 296.3K views
Needs advice
on
Amazon MQAmazon MQ
and
Amazon SQSAmazon SQS
in

I want to schedule a message. Amazon SQS provides a delay of 15 minutes, but I want it in some hours.

Example: Let's say a Message1 is consumed by a consumer A but somehow it failed inside the consumer. I would want to put it in a queue and retry after 4hrs. Can I do this in Amazon MQ? I have seen in some Amazon MQ videos saying scheduling messages can be done. But, I'm not sure how.

See more
Replies (1)
Andres Paredes
Lead Senior Software Engineer at InTouch Technology · | 1 upvotes · 225.1K views
Recommends
on
Amazon SQSAmazon SQS

Mithiridi, I believe you are talking about two different things. 1. If you need to process messages with delays of more 15m or at specific times, it's not a good idea to use queues, independently of tool SQM, Rabbit or Amazon MQ. you should considerer another approach using a scheduled job. 2. For dead queues and policy retries RabbitMQ, for example, doesn't support your use case. https://medium.com/@kiennguyen88/rabbitmq-delay-retry-schedule-with-dead-letter-exchange-31fb25a440fc I'm not sure if that is possible SNS/SQS support, they have a maximum delay for delivery (maxDelayTarget) in seconds but it's not clear the number. You can check this out: https://docs.aws.amazon.com/sns/latest/dg/sns-message-delivery-retries.html

See more
Manage your open source components, licenses, and vulnerabilities
Learn More
Pros of Amazon SQS
Pros of Apache Spark
  • 62
    Easy to use, reliable
  • 40
    Low cost
  • 28
    Simple
  • 14
    Doesn't need to maintain it
  • 8
    It is Serverless
  • 4
    Has a max message size (currently 256K)
  • 3
    Triggers Lambda
  • 3
    Easy to configure with Terraform
  • 3
    Delayed delivery upto 15 mins only
  • 3
    Delayed delivery upto 12 hours
  • 1
    JMS compliant
  • 1
    Support for retry and dead letter queue
  • 1
    D
  • 61
    Open-source
  • 48
    Fast and Flexible
  • 8
    One platform for every big data problem
  • 8
    Great for distributed SQL like applications
  • 6
    Easy to install and to use
  • 3
    Works well for most Datascience usecases
  • 2
    Interactive Query
  • 2
    Machine learning libratimery, Streaming in real
  • 2
    In memory Computation

Sign up to add or upvote prosMake informed product decisions

Cons of Amazon SQS
Cons of Apache Spark
  • 2
    Has a max message size (currently 256K)
  • 2
    Proprietary
  • 2
    Difficult to configure
  • 1
    Has a maximum 15 minutes of delayed messages only
  • 4
    Speed

Sign up to add or upvote consMake informed product decisions

45
4.2K
982
132
- No public GitHub repository available -

What is Amazon SQS?

Transmit any volume of data, at any level of throughput, without losing messages or requiring other services to be always available. With SQS, you can offload the administrative burden of operating and scaling a highly available messaging cluster, while paying a low price for only what you use.

What is Apache Spark?

Spark is a fast and general processing engine compatible with Hadoop data. It can run in Hadoop clusters through YARN or Spark's standalone mode, and it can process data in HDFS, HBase, Cassandra, Hive, and any Hadoop InputFormat. It is designed to perform both batch processing (similar to MapReduce) and new workloads like streaming, interactive queries, and machine learning.

Need advice about which tool to choose?Ask the StackShare community!

What companies use Amazon SQS?
What companies use Apache Spark?
Manage your open source components, licenses, and vulnerabilities
Learn More

Sign up to get full access to all the companiesMake informed product decisions

What tools integrate with Amazon SQS?
What tools integrate with Apache Spark?

Sign up to get full access to all the tool integrationsMake informed product decisions

Blog Posts

Mar 24 2021 at 12:57PM

Pinterest

GitJenkinsKafka+7
3
2251
MySQLKafkaApache Spark+6
2
2113
Aug 28 2019 at 3:10AM

Segment

PythonJavaAmazon S3+16
7
2684
What are some alternatives to Amazon SQS and Apache Spark?
Amazon MQ
Amazon MQ is a managed message broker service for Apache ActiveMQ that makes it easy to set up and operate message brokers in the cloud.
Kafka
Kafka is a distributed, partitioned, replicated commit log service. It provides the functionality of a messaging system, but with a unique design.
Redis
Redis is an open source (BSD licensed), in-memory data structure store, used as a database, cache, and message broker. Redis provides data structures such as strings, hashes, lists, sets, sorted sets with range queries, bitmaps, hyperloglogs, geospatial indexes, and streams.
ActiveMQ
Apache ActiveMQ is fast, supports many Cross Language Clients and Protocols, comes with easy to use Enterprise Integration Patterns and many advanced features while fully supporting JMS 1.1 and J2EE 1.4. Apache ActiveMQ is released under the Apache 2.0 License.
Amazon SNS
Amazon Simple Notification Service makes it simple and cost-effective to push to mobile devices such as iPhone, iPad, Android, Kindle Fire, and internet connected smart devices, as well as pushing to other distributed services. Besides pushing cloud notifications directly to mobile devices, SNS can also deliver notifications by SMS text message or email, to Simple Queue Service (SQS) queues, or to any HTTP endpoint.
See all alternatives