Need advice about which tool to choose?Ask the StackShare community!
Amazon SQS vs Apache Spark: What are the differences?
Introduction
Amazon SQS and Apache Spark are two popular and widely used technologies in the field of distributed computing. While they both serve the purpose of handling large volumes of data, there are several key differences between the two. This article aims to highlight and explain those differences in a concise manner.
Message Broker vs. Distributed Computing Framework: Amazon SQS is a fully managed message queuing service that allows you to decouple and scale microservices, distributed systems, and serverless applications. On the other hand, Apache Spark is an open-source distributed computing framework that provides fast in-memory data processing and analytics capabilities.
Data Processing Paradigm: Amazon SQS primarily focuses on asynchronous message passing and follows the publish-subscribe model. It allows the decoupling of sender and receiver systems through the use of queues. On the contrary, Apache Spark operates on data batches or streams and follows a batch processing or stream processing paradigm. It provides a resilient distributed dataset (RDD) as its core abstraction for processing large datasets.
Storage Requirement: In Amazon SQS, the messages are stored in a distributed manner within the Amazon managed infrastructure, reducing the overhead of managing the storage yourself. In contrast, Apache Spark requires you to set up a distributed storage system, such as Hadoop Distributed File System (HDFS) or Amazon S3, to store and manage the input data.
Compatibility and Integration: Amazon SQS is a cloud-native service provided by Amazon Web Services (AWS), and it seamlessly integrates with other AWS services like Lambda, EC2, and S3, making it easy to build serverless architectures. Apache Spark, being an open-source technology, can run on various platforms, including AWS, and allows integration with multiple data sources and databases.
Fault Tolerance and Scalability: Amazon SQS provides high fault tolerance by replicating messages across multiple availability zones within a region, ensuring high availability and durability. It also scales automatically to accommodate variable message traffic. Apache Spark, on the other hand, offers fault tolerance through the concept of RDD lineage, allowing the reconstruction of lost data partitions. It provides horizontal scalability by distributing the dataset and computation across a cluster of machines.
Real-time vs. Batch Processing: Amazon SQS focuses on handling messages in an asynchronous manner, which makes it more suitable for real-time messaging scenarios where immediate processing of data is not necessary. In contrast, Apache Spark is designed to handle both real-time and batch processing tasks efficiently. It can process streaming data in real-time as well as apply complex batch analytics on large volumes of data.
In summary, Amazon SQS is a managed message queuing service that offers asynchronous messaging and decoupling of distributed systems, whereas Apache Spark is a distributed computing framework that provides fast data processing and analytics capabilities through RDDs. SQS is more suited for real-time messaging scenarios, while Spark excels in both real-time and batch processing tasks.
Hi! I am creating a scraping system in Django, which involves long running tasks between 1 minute & 1 Day. As I am new to Message Brokers and Task Queues, I need advice on which architecture to use for my system. ( Amazon SQS, RabbitMQ, or Celery). The system should be autoscalable using Kubernetes(K8) based on the number of pending tasks in the queue.
Hello, i highly recommend Apache Kafka, to me it's the best. You can deploy it in cluster mode inside K8S, thus you can have a Highly available system (also auto scalable).
Good luck
Hi, we are in a ZMQ set up in a push/pull pattern, and we currently start to have more traffic and cases that the service is unavailable or stuck. We want to: * Not loose messages in services outages * Safely restart service without losing messages (ZeroMQ seems to need to close the socket in the receiver before restart manually)
Do you have experience with this setup with ZeroMQ? Would you suggest RabbitMQ or Amazon SQS (we are in AWS setup) instead? Something else?
Thank you for your time
ZeroMQ is fast but you need to build build reliability yourself. There are a number of patterns described in the zeromq guide. I have used RabbitMQ before which gives lot of functionality out of the box, you can probably use the worker queues
example from the tutorial, it can also persists messages in the queue.
I haven't used Amazon SQS before. Another tool you could use is Kafka.
Both would do the trick, but there are some nuances. We work with both.
From the sound of it, your main focus is "not losing messages". In that case, I would go with RabbitMQ with a high availability policy (ha-mode=all) and a main/retry/error queue pattern.
Push messages to an exchange, which sends them to the main queue. If an error occurs, push the errored out message to the retry exchange, which forwards it to the retry queue. Give the retry queue a x-message-ttl and set the main exchange as a dead-letter-exchange. If your message has been retried several times, push it to the error exchange, where the message can remain until someone has time to look at it.
This is a very useful and resilient pattern that allows you to never lose messages. With the high availability policy, you make sure that if one of your rabbitmq nodes dies, another can take over and messages are already mirrored to it.
This is not really possible with SQS, because SQS is a lot more focused on throughput and scaling. Combined with SNS it can do interesting things like deduplication of messages and such. That said, one thing core to its design is that messages have a maximum retention time. The idea is that a message that has stayed in an SQS queue for a while serves no more purpose after a while, so it gets removed - so as to not block up any listener resources for a long time. You can also set up a DLQ here, but these similarly do not hold onto messages forever. Since you seem to depend on messages surviving at all cost, I would suggest that the scaling/throughput benefit of SQS does not outweigh the difference in approach to messages there.
We have a Kafka topic having events of type A and type B. We need to perform an inner join on both type of events using some common field (primary-key). The joined events to be inserted in Elasticsearch.
In usual cases, type A and type B events (with same key) observed to be close upto 15 minutes. But in some cases they may be far from each other, lets say 6 hours. Sometimes event of either of the types never come.
In all cases, we should be able to find joined events instantly after they are joined and not-joined events within 15 minutes.
The first solution that came to me is to use upsert to update ElasticSearch:
- Use the primary-key as ES document id
- Upsert the records to ES as soon as you receive them. As you are using upsert, the 2nd record of the same primary-key will not overwrite the 1st one, but will be merged with it.
Cons: The load on ES will be higher, due to upsert.
To use Flink:
- Create a KeyedDataStream by the primary-key
- In the ProcessFunction, save the first record in a State. At the same time, create a Timer for 15 minutes in the future
- When the 2nd record comes, read the 1st record from the State, merge those two, and send out the result, and clear the State and the Timer if it has not fired
- When the Timer fires, read the 1st record from the State and send out as the output record.
- Have a 2nd Timer of 6 hours (or more) if you are not using Windowing to clean up the State
Pro: if you have already having Flink ingesting this stream. Otherwise, I would just go with the 1st solution.
Please refer "Structured Streaming" feature of Spark. Refer "Stream - Stream Join" at https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#stream-stream-joins . In short you need to specify "Define watermark delays on both inputs" and "Define a constraint on time across the two inputs"
I want to schedule a message. Amazon SQS provides a delay of 15 minutes, but I want it in some hours.
Example: Let's say a Message1 is consumed by a consumer A but somehow it failed inside the consumer. I would want to put it in a queue and retry after 4hrs. Can I do this in Amazon MQ? I have seen in some Amazon MQ videos saying scheduling messages can be done. But, I'm not sure how.
Mithiridi, I believe you are talking about two different things. 1. If you need to process messages with delays of more 15m or at specific times, it's not a good idea to use queues, independently of tool SQM, Rabbit or Amazon MQ. you should considerer another approach using a scheduled job. 2. For dead queues and policy retries RabbitMQ, for example, doesn't support your use case. https://medium.com/@kiennguyen88/rabbitmq-delay-retry-schedule-with-dead-letter-exchange-31fb25a440fc I'm not sure if that is possible SNS/SQS support, they have a maximum delay for delivery (maxDelayTarget) in seconds but it's not clear the number. You can check this out: https://docs.aws.amazon.com/sns/latest/dg/sns-message-delivery-retries.html
Pros of Amazon SQS
- Easy to use, reliable62
- Low cost40
- Simple28
- Doesn't need to maintain it14
- It is Serverless8
- Has a max message size (currently 256K)4
- Triggers Lambda3
- Easy to configure with Terraform3
- Delayed delivery upto 15 mins only3
- Delayed delivery upto 12 hours3
- JMS compliant1
- Support for retry and dead letter queue1
- D1
Pros of Apache Spark
- Open-source61
- Fast and Flexible48
- One platform for every big data problem8
- Great for distributed SQL like applications8
- Easy to install and to use6
- Works well for most Datascience usecases3
- Interactive Query2
- Machine learning libratimery, Streaming in real2
- In memory Computation2
Sign up to add or upvote prosMake informed product decisions
Cons of Amazon SQS
- Has a max message size (currently 256K)2
- Proprietary2
- Difficult to configure2
- Has a maximum 15 minutes of delayed messages only1
Cons of Apache Spark
- Speed4