Amazon RDS for Aurora vs Apache Spark

Get Advice Icon

Need advice about which tool to choose?Ask the StackShare community!

Amazon RDS for Aurora
Amazon RDS for Aurora

313
145
+ 1
47
Apache Spark
Apache Spark

1K
804
+ 1
98
Add tool

Amazon RDS for Aurora vs Apache Spark: What are the differences?

Amazon RDS for Aurora: MySQL and PostgreSQL compatible relational database with several times better performance. Amazon Aurora is a MySQL-compatible, relational database engine that combines the speed and availability of high-end commercial databases with the simplicity and cost-effectiveness of open source databases. Amazon Aurora provides up to five times better performance than MySQL at a price point one tenth that of a commercial database while delivering similar performance and availability; Apache Spark: Fast and general engine for large-scale data processing. Spark is a fast and general processing engine compatible with Hadoop data. It can run in Hadoop clusters through YARN or Spark's standalone mode, and it can process data in HDFS, HBase, Cassandra, Hive, and any Hadoop InputFormat. It is designed to perform both batch processing (similar to MapReduce) and new workloads like streaming, interactive queries, and machine learning.

Amazon RDS for Aurora belongs to "SQL Database as a Service" category of the tech stack, while Apache Spark can be primarily classified under "Big Data Tools".

Some of the features offered by Amazon RDS for Aurora are:

  • High Throughput with Low Jitter
  • Push-button Compute Scaling
  • Storage Auto-scaling

On the other hand, Apache Spark provides the following key features:

  • Run programs up to 100x faster than Hadoop MapReduce in memory, or 10x faster on disk
  • Write applications quickly in Java, Scala or Python
  • Combine SQL, streaming, and complex analytics

"MySQL compatibility " is the primary reason why developers consider Amazon RDS for Aurora over the competitors, whereas "Open-source" was stated as the key factor in picking Apache Spark.

Apache Spark is an open source tool with 22.3K GitHub stars and 19.3K GitHub forks. Here's a link to Apache Spark's open source repository on GitHub.

Slack, Shopify, and SendGrid are some of the popular companies that use Apache Spark, whereas Amazon RDS for Aurora is used by StackShare, GoGuardian, and Akoova. Apache Spark has a broader approval, being mentioned in 263 company stacks & 111 developers stacks; compared to Amazon RDS for Aurora, which is listed in 116 company stacks and 30 developer stacks.

No Stats
- No public GitHub repository available -

What is Amazon RDS for Aurora?

Amazon Aurora is a MySQL-compatible, relational database engine that combines the speed and availability of high-end commercial databases with the simplicity and cost-effectiveness of open source databases. Amazon Aurora provides up to five times better performance than MySQL at a price point one tenth that of a commercial database while delivering similar performance and availability.

What is Apache Spark?

Spark is a fast and general processing engine compatible with Hadoop data. It can run in Hadoop clusters through YARN or Spark's standalone mode, and it can process data in HDFS, HBase, Cassandra, Hive, and any Hadoop InputFormat. It is designed to perform both batch processing (similar to MapReduce) and new workloads like streaming, interactive queries, and machine learning.
Get Advice Icon

Need advice about which tool to choose?Ask the StackShare community!

Why do developers choose Amazon RDS for Aurora?
Why do developers choose Apache Spark?

Sign up to add, upvote and see more prosMake informed product decisions

What companies use Amazon RDS for Aurora?
What companies use Apache Spark?

Sign up to get full access to all the companiesMake informed product decisions

What tools integrate with Amazon RDS for Aurora?
What tools integrate with Apache Spark?

Sign up to get full access to all the tool integrationsMake informed product decisions

What are some alternatives to Amazon RDS for Aurora and Apache Spark?
Amazon RDS
Amazon RDS gives you access to the capabilities of a familiar MySQL, Oracle or Microsoft SQL Server database engine. This means that the code, applications, and tools you already use today with your existing databases can be used with Amazon RDS. Amazon RDS automatically patches the database software and backs up your database, storing the backups for a user-defined retention period and enabling point-in-time recovery. You benefit from the flexibility of being able to scale the compute resources or storage capacity associated with your Database Instance (DB Instance) via a single API call.
Google Cloud SQL
MySQL databases deployed in the cloud without a fuss. Google Cloud Platform provides you with powerful databases that run fast, don’t run out of space and give your application the redundant, reliable storage it needs.
ClearDB
ClearDB uses a combination of advanced replication techniques, advanced cluster technology, and layered web services to provide you with a MySQL database that is "smarter" than usual.
Azure Database for MySQL
Azure Database for MySQL provides a managed database service for app development and deployment that allows you to stand up a MySQL database in minutes and scale on the fly – on the cloud you trust most.
DigitalOcean Managed Databases
Build apps and store data in minutes with easy access to one or more databases and sleep better knowing your data is backed up and optimized.
See all alternatives
Decisions about Amazon RDS for Aurora and Apache Spark
StackShare Editors
StackShare Editors
Presto
Presto
Apache Spark
Apache Spark
Hadoop
Hadoop

Around 2015, the growing use of Uber’s data exposed limitations in the ETL and Vertica-centric setup, not to mention the increasing costs. “As our company grew, scaling our data warehouse became increasingly expensive. To cut down on costs, we started deleting older, obsolete data to free up space for new data.”

To overcome these challenges, Uber rebuilt their big data platform around Hadoop. “More specifically, we introduced a Hadoop data lake where all raw data was ingested from different online data stores only once and with no transformation during ingestion.”

“In order for users to access data in Hadoop, we introduced Presto to enable interactive ad hoc user queries, Apache Spark to facilitate programmatic access to raw data (in both SQL and non-SQL formats), and Apache Hive to serve as the workhorse for extremely large queries.

See more
StackShare Editors
StackShare Editors
Presto
Presto
Apache Spark
Apache Spark
Hadoop
Hadoop

To improve platform scalability and efficiency, Uber transitioned from JSON to Parquet, and built a central schema service to manage schemas and integrate different client libraries.

While the first generation big data platform was vulnerable to upstream data format changes, “ad hoc data ingestions jobs were replaced with a standard platform to transfer all source data in its original, nested format into the Hadoop data lake.”

These platform changes enabled the scaling challenges Uber was facing around that time: “On a daily basis, there were tens of terabytes of new data added to our data lake, and our Big Data platform grew to over 10,000 vcores with over 100,000 running batch jobs on any given day.”

See more
StackShare Editors
StackShare Editors
Presto
Presto
Apache Spark
Apache Spark
Scala
Scala
MySQL
MySQL
Kafka
Kafka

Slack’s data team works to “provide an ecosystem to help people in the company quickly and easily answer questions about usage, so they can make better and data informed decisions.” To achieve that goal, that rely on a complex data pipeline.

An in-house tool call Sqooper scrapes MySQL backups and pipe them to S3. Job queue and log data is sent to Kafka then persisted to S3 using an open source tool called Secor, which was created by Pinterest.

For compute, Amazon’s Elastic MapReduce (EMR) creates clusters preconfigured for Presto, Hive, and Spark.

Presto is then used for ad-hoc questions, validating data assumptions, exploring smaller datasets, and creating visualizations for some internal tools. Hive is used for larger data sets or longer time series data, and Spark allows teams to write efficient and robust batch and aggregation jobs. Most of the Spark pipeline is written in Scala.

Thrift binds all of these engines together with a typed schema and structured data.

Finally, the Hive Metastore serves as the ground truth for all data and its schema.

See more
Tim Specht
Tim Specht
‎Co-Founder and CTO at Dubsmash · | 13 upvotes · 49.9K views
atDubsmashDubsmash
Amazon RDS for Aurora
Amazon RDS for Aurora
Redis
Redis
Amazon DynamoDB
Amazon DynamoDB
Amazon RDS
Amazon RDS
Heroku
Heroku
PostgreSQL
PostgreSQL
#PlatformAsAService
#Databases
#NosqlDatabaseAsAService
#SqlDatabaseAsAService

Over the years we have added a wide variety of different storages to our stack including PostgreSQL (some hosted by Heroku, some by Amazon RDS) for storing relational data, Amazon DynamoDB to store non-relational data like recommendations & user connections, or Redis to hold pre-aggregated data to speed up API endpoints.

Since we started running Postgres ourselves on RDS instead of only using the managed offerings of Heroku, we've gained additional flexibility in scaling our application while reducing costs at the same time.

We are also heavily testing Amazon RDS for Aurora in its Postgres-compatible version and will also give the new release of Aurora Serverless a try!

#SqlDatabaseAsAService #NosqlDatabaseAsAService #Databases #PlatformAsAService

See more
StackShare Editors
StackShare Editors
Apache Thrift
Apache Thrift
Kotlin
Kotlin
Presto
Presto
HHVM (HipHop Virtual Machine)
HHVM (HipHop Virtual Machine)
gRPC
gRPC
Kubernetes
Kubernetes
Apache Spark
Apache Spark
Airflow
Airflow
Terraform
Terraform
Hadoop
Hadoop
Swift
Swift
Hack
Hack
Memcached
Memcached
Consul
Consul
Chef
Chef
Prometheus
Prometheus

Since the beginning, Cal Henderson has been the CTO of Slack. Earlier this year, he commented on a Quora question summarizing their current stack.

Apps
  • Web: a mix of JavaScript/ES6 and React.
  • Desktop: And Electron to ship it as a desktop application.
  • Android: a mix of Java and Kotlin.
  • iOS: written in a mix of Objective C and Swift.
Backend
  • The core application and the API written in PHP/Hack that runs on HHVM.
  • The data is stored in MySQL using Vitess.
  • Caching is done using Memcached and MCRouter.
  • The search service takes help from SolrCloud, with various Java services.
  • The messaging system uses WebSockets with many services in Java and Go.
  • Load balancing is done using HAproxy with Consul for configuration.
  • Most services talk to each other over gRPC,
  • Some Thrift and JSON-over-HTTP
  • Voice and video calling service was built in Elixir.
Data warehouse
  • Built using open source tools including Presto, Spark, Airflow, Hadoop and Kafka.
Etc
See more
Julien DeFrance
Julien DeFrance
Full Stack Engineering Manager at ValiMail · | 16 upvotes · 265.1K views
atSmartZipSmartZip
Amazon DynamoDB
Amazon DynamoDB
Ruby
Ruby
Node.js
Node.js
AWS Lambda
AWS Lambda
New Relic
New Relic
Amazon Elasticsearch Service
Amazon Elasticsearch Service
Elasticsearch
Elasticsearch
Superset
Superset
Amazon Quicksight
Amazon Quicksight
Amazon Redshift
Amazon Redshift
Zapier
Zapier
Segment
Segment
Amazon CloudFront
Amazon CloudFront
Memcached
Memcached
Amazon ElastiCache
Amazon ElastiCache
Amazon RDS for Aurora
Amazon RDS for Aurora
MySQL
MySQL
Amazon RDS
Amazon RDS
Amazon S3
Amazon S3
Docker
Docker
Capistrano
Capistrano
AWS Elastic Beanstalk
AWS Elastic Beanstalk
Rails API
Rails API
Rails
Rails
Algolia
Algolia

Back in 2014, I was given an opportunity to re-architect SmartZip Analytics platform, and flagship product: SmartTargeting. This is a SaaS software helping real estate professionals keeping up with their prospects and leads in a given neighborhood/territory, finding out (thanks to predictive analytics) who's the most likely to list/sell their home, and running cross-channel marketing automation against them: direct mail, online ads, email... The company also does provide Data APIs to Enterprise customers.

I had inherited years and years of technical debt and I knew things had to change radically. The first enabler to this was to make use of the cloud and go with AWS, so we would stop re-inventing the wheel, and build around managed/scalable services.

For the SaaS product, we kept on working with Rails as this was what my team had the most knowledge in. We've however broken up the monolith and decoupled the front-end application from the backend thanks to the use of Rails API so we'd get independently scalable micro-services from now on.

Our various applications could now be deployed using AWS Elastic Beanstalk so we wouldn't waste any more efforts writing time-consuming Capistrano deployment scripts for instance. Combined with Docker so our application would run within its own container, independently from the underlying host configuration.

Storage-wise, we went with Amazon S3 and ditched any pre-existing local or network storage people used to deal with in our legacy systems. On the database side: Amazon RDS / MySQL initially. Ultimately migrated to Amazon RDS for Aurora / MySQL when it got released. Once again, here you need a managed service your cloud provider handles for you.

Future improvements / technology decisions included:

Caching: Amazon ElastiCache / Memcached CDN: Amazon CloudFront Systems Integration: Segment / Zapier Data-warehousing: Amazon Redshift BI: Amazon Quicksight / Superset Search: Elasticsearch / Amazon Elasticsearch Service / Algolia Monitoring: New Relic

As our usage grows, patterns changed, and/or our business needs evolved, my role as Engineering Manager then Director of Engineering was also to ensure my team kept on learning and innovating, while delivering on business value.

One of these innovations was to get ourselves into Serverless : Adopting AWS Lambda was a big step forward. At the time, only available for Node.js (Not Ruby ) but a great way to handle cost efficiency, unpredictable traffic, sudden bursts of traffic... Ultimately you want the whole chain of services involved in a call to be serverless, and that's when we've started leveraging Amazon DynamoDB on these projects so they'd be fully scalable.

See more
Eric Colson
Eric Colson
Chief Algorithms Officer at Stitch Fix · | 19 upvotes · 207.8K views
atStitch FixStitch Fix
Amazon EC2 Container Service
Amazon EC2 Container Service
Docker
Docker
PyTorch
PyTorch
R
R
Python
Python
Presto
Presto
Apache Spark
Apache Spark
Amazon S3
Amazon S3
PostgreSQL
PostgreSQL
Kafka
Kafka
#Data
#DataStack
#DataScience
#ML
#Etl
#AWS

The algorithms and data infrastructure at Stitch Fix is housed in #AWS. Data acquisition is split between events flowing through Kafka, and periodic snapshots of PostgreSQL DBs. We store data in an Amazon S3 based data warehouse. Apache Spark on Yarn is our tool of choice for data movement and #ETL. Because our storage layer (s3) is decoupled from our processing layer, we are able to scale our compute environment very elastically. We have several semi-permanent, autoscaling Yarn clusters running to serve our data processing needs. While the bulk of our compute infrastructure is dedicated to algorithmic processing, we also implemented Presto for adhoc queries and dashboards.

Beyond data movement and ETL, most #ML centric jobs (e.g. model training and execution) run in a similarly elastic environment as containers running Python and R code on Amazon EC2 Container Service clusters. The execution of batch jobs on top of ECS is managed by Flotilla, a service we built in house and open sourced (see https://github.com/stitchfix/flotilla-os).

At Stitch Fix, algorithmic integrations are pervasive across the business. We have dozens of data products actively integrated systems. That requires serving layer that is robust, agile, flexible, and allows for self-service. Models produced on Flotilla are packaged for deployment in production using Khan, another framework we've developed internally. Khan provides our data scientists the ability to quickly productionize those models they've developed with open source frameworks in Python 3 (e.g. PyTorch, sklearn), by automatically packaging them as Docker containers and deploying to Amazon ECS. This provides our data scientist a one-click method of getting from their algorithms to production. We then integrate those deployments into a service mesh, which allows us to A/B test various implementations in our product.

For more info:

#DataScience #DataStack #Data

See more
Interest over time
Reviews of Amazon RDS for Aurora and Apache Spark
No reviews found
How developers use Amazon RDS for Aurora and Apache Spark
Avatar of Wei Chen
Wei Chen uses Apache SparkApache Spark

Spark is good at parallel data processing management. We wrote a neat program to handle the TBs data we get everyday.

Avatar of Secumail
Secumail uses Amazon RDS for AuroraAmazon RDS for Aurora

Managed MySQL clustered database so I dont have to deal with the required infrastructure

Avatar of RedLine13
RedLine13 uses Amazon RDS for AuroraAmazon RDS for Aurora

Core database for managing users, teams, tests, and result summaries

Avatar of Yaakov Gesher
Yaakov Gesher uses Amazon RDS for AuroraAmazon RDS for Aurora

We moved our database from compose.io to AWS for speed and price.

Avatar of Ralic Lo
Ralic Lo uses Apache SparkApache Spark

Used Spark Dataframe API on Spark-R for big data analysis.

Avatar of Bùi Thanh
Bùi Thanh uses Amazon RDS for AuroraAmazon RDS for Aurora
  • Performance, HA and Scalable.
  • AutoScale replicas.
Avatar of Kalibrr
Kalibrr uses Apache SparkApache Spark

We use Apache Spark in computing our recommendations.

Avatar of BrainFinance
BrainFinance uses Apache SparkApache Spark

As a part of big data machine learning stack (SMACK).

Avatar of Dotmetrics
Dotmetrics uses Apache SparkApache Spark

Big data analytics and nightly transformation jobs.

How much does Amazon RDS for Aurora cost?
How much does Apache Spark cost?
Pricing unavailable
News about Apache Spark
More news