Get Advice Icon

Need advice about which tool to choose?Ask the StackShare community!

Memcached
Memcached

2.7K
1.6K
+ 1
452
Apache Spark
Apache Spark

1K
818
+ 1
98
Add tool

Memcached vs Apache Spark: What are the differences?

What is Memcached? High-performance, distributed memory object caching system. Memcached is an in-memory key-value store for small chunks of arbitrary data (strings, objects) from results of database calls, API calls, or page rendering.

What is Apache Spark? Fast and general engine for large-scale data processing. Spark is a fast and general processing engine compatible with Hadoop data. It can run in Hadoop clusters through YARN or Spark's standalone mode, and it can process data in HDFS, HBase, Cassandra, Hive, and any Hadoop InputFormat. It is designed to perform both batch processing (similar to MapReduce) and new workloads like streaming, interactive queries, and machine learning.

Memcached belongs to "Databases" category of the tech stack, while Apache Spark can be primarily classified under "Big Data Tools".

"Fast object cache" is the primary reason why developers consider Memcached over the competitors, whereas "Open-source" was stated as the key factor in picking Apache Spark.

Memcached and Apache Spark are both open source tools. It seems that Apache Spark with 22.3K GitHub stars and 19.3K forks on GitHub has more adoption than Memcached with 8.93K GitHub stars and 2.6K GitHub forks.

According to the StackShare community, Memcached has a broader approval, being mentioned in 750 company stacks & 264 developers stacks; compared to Apache Spark, which is listed in 263 company stacks and 111 developer stacks.

What is Memcached?

Memcached is an in-memory key-value store for small chunks of arbitrary data (strings, objects) from results of database calls, API calls, or page rendering.

What is Apache Spark?

Spark is a fast and general processing engine compatible with Hadoop data. It can run in Hadoop clusters through YARN or Spark's standalone mode, and it can process data in HDFS, HBase, Cassandra, Hive, and any Hadoop InputFormat. It is designed to perform both batch processing (similar to MapReduce) and new workloads like streaming, interactive queries, and machine learning.
Get Advice Icon

Need advice about which tool to choose?Ask the StackShare community!

Why do developers choose Memcached?
Why do developers choose Apache Spark?

Sign up to add, upvote and see more prosMake informed product decisions

    Be the first to leave a con
    What companies use Memcached?
    What companies use Apache Spark?

    Sign up to get full access to all the companiesMake informed product decisions

    What tools integrate with Memcached?
    What tools integrate with Apache Spark?

    Sign up to get full access to all the tool integrationsMake informed product decisions

    What are some alternatives to Memcached and Apache Spark?
    Redis
    Redis is an open source, BSD licensed, advanced key-value store. It is often referred to as a data structure server since keys can contain strings, hashes, lists, sets and sorted sets.
    Ehcache
    Ehcache is an open source, standards-based cache for boosting performance, offloading your database, and simplifying scalability. It's the most widely-used Java-based cache because it's robust, proven, and full-featured. Ehcache scales from in-process, with one or more nodes, all the way to mixed in-process/out-of-process configurations with terabyte-sized caches.
    Varnish
    Varnish Cache is a web application accelerator also known as a caching HTTP reverse proxy. You install it in front of any server that speaks HTTP and configure it to cache the contents. Varnish Cache is really, really fast. It typically speeds up delivery with a factor of 300 - 1000x, depending on your architecture.
    Hazelcast
    With its various distributed data structures, distributed caching capabilities, elastic nature, memcache support, integration with Spring and Hibernate and more importantly with so many happy users, Hazelcast is feature-rich, enterprise-ready and developer-friendly in-memory data grid solution.
    MongoDB
    MongoDB stores data in JSON-like documents that can vary in structure, offering a dynamic, flexible schema. MongoDB was also designed for high availability and scalability, with built-in replication and auto-sharding.
    See all alternatives
    Decisions about Memcached and Apache Spark
    HAProxy
    HAProxy
    Varnish
    Varnish
    Tornado
    Tornado
    Django
    Django
    Redis
    Redis
    RabbitMQ
    RabbitMQ
    nginx
    nginx
    Memcached
    Memcached
    MySQL
    MySQL
    Python
    Python
    Node.js
    Node.js

    Around the time of their Series A, Pinterest’s stack included Python and Django, with Tornado and Node.js as web servers. Memcached / Membase and Redis handled caching, with RabbitMQ handling queueing. Nginx, HAproxy and Varnish managed static-delivery and load-balancing, with persistent data storage handled by MySQL.

    See more
    StackShare Editors
    StackShare Editors
    Presto
    Presto
    Apache Spark
    Apache Spark
    Hadoop
    Hadoop

    Around 2015, the growing use of Uber’s data exposed limitations in the ETL and Vertica-centric setup, not to mention the increasing costs. “As our company grew, scaling our data warehouse became increasingly expensive. To cut down on costs, we started deleting older, obsolete data to free up space for new data.”

    To overcome these challenges, Uber rebuilt their big data platform around Hadoop. “More specifically, we introduced a Hadoop data lake where all raw data was ingested from different online data stores only once and with no transformation during ingestion.”

    “In order for users to access data in Hadoop, we introduced Presto to enable interactive ad hoc user queries, Apache Spark to facilitate programmatic access to raw data (in both SQL and non-SQL formats), and Apache Hive to serve as the workhorse for extremely large queries.

    See more
    StackShare Editors
    StackShare Editors
    Presto
    Presto
    Apache Spark
    Apache Spark
    Hadoop
    Hadoop

    To improve platform scalability and efficiency, Uber transitioned from JSON to Parquet, and built a central schema service to manage schemas and integrate different client libraries.

    While the first generation big data platform was vulnerable to upstream data format changes, “ad hoc data ingestions jobs were replaced with a standard platform to transfer all source data in its original, nested format into the Hadoop data lake.”

    These platform changes enabled the scaling challenges Uber was facing around that time: “On a daily basis, there were tens of terabytes of new data added to our data lake, and our Big Data platform grew to over 10,000 vcores with over 100,000 running batch jobs on any given day.”

    See more
    StackShare Editors
    StackShare Editors
    Presto
    Presto
    Apache Spark
    Apache Spark
    Scala
    Scala
    MySQL
    MySQL
    Kafka
    Kafka

    Slack’s data team works to “provide an ecosystem to help people in the company quickly and easily answer questions about usage, so they can make better and data informed decisions.” To achieve that goal, that rely on a complex data pipeline.

    An in-house tool call Sqooper scrapes MySQL backups and pipe them to S3. Job queue and log data is sent to Kafka then persisted to S3 using an open source tool called Secor, which was created by Pinterest.

    For compute, Amazon’s Elastic MapReduce (EMR) creates clusters preconfigured for Presto, Hive, and Spark.

    Presto is then used for ad-hoc questions, validating data assumptions, exploring smaller datasets, and creating visualizations for some internal tools. Hive is used for larger data sets or longer time series data, and Spark allows teams to write efficient and robust batch and aggregation jobs. Most of the Spark pipeline is written in Scala.

    Thrift binds all of these engines together with a typed schema and structured data.

    Finally, the Hive Metastore serves as the ground truth for all data and its schema.

    See more
    Kir Shatrov
    Kir Shatrov
    Production Engineer at Shopify · | 12 upvotes · 55.9K views
    atShopifyShopify
    Redis
    Redis
    Memcached
    Memcached
    MySQL
    MySQL
    Rails
    Rails

    As is common in the Rails stack, since the very beginning, we've stayed with MySQL as a relational database, Memcached for key/value storage and Redis for queues and background jobs.

    In 2014, we could no longer store all our data in a single MySQL instance - even by buying better hardware. We decided to use sharding and split all of Shopify into dozens of database partitions.

    Sharding played nicely for us because Shopify merchants are isolated from each other and we were able to put a subset of merchants on a single shard. It would have been harder if our business assumed shared data between customers.

    The sharding project bought us some time regarding database capacity, but as we soon found out, there was a huge single point of failure in our infrastructure. All those shards were still using a single Redis. At one point, the outage of that Redis took down all of Shopify, causing a major disruption we later called “Redismageddon”. This taught us an important lesson to avoid any resources that are shared across all of Shopify.

    Over the years, we moved from shards to the concept of "pods". A pod is a fully isolated instance of Shopify with its own datastores like MySQL, Redis, memcached. A pod can be spawned in any region. This approach has helped us eliminate global outages. As of today, we have more than a hundred pods, and since moving to this architecture we haven't had any major outages that affected all of Shopify. An outage today only affects a single pod or region.

    See more
    Kir Shatrov
    Kir Shatrov
    Production Engineer at Shopify · | 13 upvotes · 106.3K views
    atShopifyShopify
    Memcached
    Memcached
    Redis
    Redis
    MySQL
    MySQL
    Google Kubernetes Engine
    Google Kubernetes Engine
    Kubernetes
    Kubernetes
    Docker
    Docker

    At Shopify, over the years, we moved from shards to the concept of "pods". A pod is a fully isolated instance of Shopify with its own datastores like MySQL, Redis, Memcached. A pod can be spawned in any region. This approach has helped us eliminate global outages. As of today, we have more than a hundred pods, and since moving to this architecture we haven't had any major outages that affected all of Shopify. An outage today only affects a single pod or region.

    As we grew into hundreds of shards and pods, it became clear that we needed a solution to orchestrate those deployments. Today, we use Docker, Kubernetes, and Google Kubernetes Engine to make it easy to bootstrap resources for new Shopify Pods.

    See more
    Amazon ElastiCache
    Amazon ElastiCache
    Amazon Elasticsearch Service
    Amazon Elasticsearch Service
    AWS Elastic Load Balancing (ELB)
    AWS Elastic Load Balancing (ELB)
    Memcached
    Memcached
    Redis
    Redis
    Python
    Python
    AWS Lambda
    AWS Lambda
    Amazon RDS
    Amazon RDS
    Microsoft SQL Server
    Microsoft SQL Server
    MariaDB
    MariaDB
    Amazon RDS for PostgreSQL
    Amazon RDS for PostgreSQL
    Rails
    Rails
    Ruby
    Ruby
    Heroku
    Heroku
    AWS Elastic Beanstalk
    AWS Elastic Beanstalk

    We initially started out with Heroku as our PaaS provider due to a desire to use it by our original developer for our Ruby on Rails application/website at the time. We were finding response times slow, it was painfully slow, sometimes taking 10 seconds to start loading the main page. Moving up to the next "compute" level was going to be very expensive.

    We moved our site over to AWS Elastic Beanstalk , not only did response times on the site practically become instant, our cloud bill for the application was cut in half.

    In database world we are currently using Amazon RDS for PostgreSQL also, we have both MariaDB and Microsoft SQL Server both hosted on Amazon RDS. The plan is to migrate to AWS Aurora Serverless for all 3 of those database systems.

    Additional services we use for our public applications: AWS Lambda, Python, Redis, Memcached, AWS Elastic Load Balancing (ELB), Amazon Elasticsearch Service, Amazon ElastiCache

    See more
    StackShare Editors
    StackShare Editors
    Apache Thrift
    Apache Thrift
    Kotlin
    Kotlin
    Presto
    Presto
    HHVM (HipHop Virtual Machine)
    HHVM (HipHop Virtual Machine)
    gRPC
    gRPC
    Kubernetes
    Kubernetes
    Apache Spark
    Apache Spark
    Airflow
    Airflow
    Terraform
    Terraform
    Hadoop
    Hadoop
    Swift
    Swift
    Hack
    Hack
    Memcached
    Memcached
    Consul
    Consul
    Chef
    Chef
    Prometheus
    Prometheus

    Since the beginning, Cal Henderson has been the CTO of Slack. Earlier this year, he commented on a Quora question summarizing their current stack.

    Apps
    • Web: a mix of JavaScript/ES6 and React.
    • Desktop: And Electron to ship it as a desktop application.
    • Android: a mix of Java and Kotlin.
    • iOS: written in a mix of Objective C and Swift.
    Backend
    • The core application and the API written in PHP/Hack that runs on HHVM.
    • The data is stored in MySQL using Vitess.
    • Caching is done using Memcached and MCRouter.
    • The search service takes help from SolrCloud, with various Java services.
    • The messaging system uses WebSockets with many services in Java and Go.
    • Load balancing is done using HAproxy with Consul for configuration.
    • Most services talk to each other over gRPC,
    • Some Thrift and JSON-over-HTTP
    • Voice and video calling service was built in Elixir.
    Data warehouse
    • Built using open source tools including Presto, Spark, Airflow, Hadoop and Kafka.
    Etc
    See more
    Julien DeFrance
    Julien DeFrance
    Principal Software Engineer at Tophatter · | 16 upvotes · 372.8K views
    atSmartZipSmartZip
    Amazon DynamoDB
    Amazon DynamoDB
    Ruby
    Ruby
    Node.js
    Node.js
    AWS Lambda
    AWS Lambda
    New Relic
    New Relic
    Amazon Elasticsearch Service
    Amazon Elasticsearch Service
    Elasticsearch
    Elasticsearch
    Superset
    Superset
    Amazon Quicksight
    Amazon Quicksight
    Amazon Redshift
    Amazon Redshift
    Zapier
    Zapier
    Segment
    Segment
    Amazon CloudFront
    Amazon CloudFront
    Memcached
    Memcached
    Amazon ElastiCache
    Amazon ElastiCache
    Amazon RDS for Aurora
    Amazon RDS for Aurora
    MySQL
    MySQL
    Amazon RDS
    Amazon RDS
    Amazon S3
    Amazon S3
    Docker
    Docker
    Capistrano
    Capistrano
    AWS Elastic Beanstalk
    AWS Elastic Beanstalk
    Rails API
    Rails API
    Rails
    Rails
    Algolia
    Algolia

    Back in 2014, I was given an opportunity to re-architect SmartZip Analytics platform, and flagship product: SmartTargeting. This is a SaaS software helping real estate professionals keeping up with their prospects and leads in a given neighborhood/territory, finding out (thanks to predictive analytics) who's the most likely to list/sell their home, and running cross-channel marketing automation against them: direct mail, online ads, email... The company also does provide Data APIs to Enterprise customers.

    I had inherited years and years of technical debt and I knew things had to change radically. The first enabler to this was to make use of the cloud and go with AWS, so we would stop re-inventing the wheel, and build around managed/scalable services.

    For the SaaS product, we kept on working with Rails as this was what my team had the most knowledge in. We've however broken up the monolith and decoupled the front-end application from the backend thanks to the use of Rails API so we'd get independently scalable micro-services from now on.

    Our various applications could now be deployed using AWS Elastic Beanstalk so we wouldn't waste any more efforts writing time-consuming Capistrano deployment scripts for instance. Combined with Docker so our application would run within its own container, independently from the underlying host configuration.

    Storage-wise, we went with Amazon S3 and ditched any pre-existing local or network storage people used to deal with in our legacy systems. On the database side: Amazon RDS / MySQL initially. Ultimately migrated to Amazon RDS for Aurora / MySQL when it got released. Once again, here you need a managed service your cloud provider handles for you.

    Future improvements / technology decisions included:

    Caching: Amazon ElastiCache / Memcached CDN: Amazon CloudFront Systems Integration: Segment / Zapier Data-warehousing: Amazon Redshift BI: Amazon Quicksight / Superset Search: Elasticsearch / Amazon Elasticsearch Service / Algolia Monitoring: New Relic

    As our usage grows, patterns changed, and/or our business needs evolved, my role as Engineering Manager then Director of Engineering was also to ensure my team kept on learning and innovating, while delivering on business value.

    One of these innovations was to get ourselves into Serverless : Adopting AWS Lambda was a big step forward. At the time, only available for Node.js (Not Ruby ) but a great way to handle cost efficiency, unpredictable traffic, sudden bursts of traffic... Ultimately you want the whole chain of services involved in a call to be serverless, and that's when we've started leveraging Amazon DynamoDB on these projects so they'd be fully scalable.

    See more
    Yonas Beshawred
    Yonas Beshawred
    CEO at StackShare · | 9 upvotes · 26K views
    atStackShareStackShare
    Memcached
    Memcached
    Heroku
    Heroku
    Amazon ElastiCache
    Amazon ElastiCache
    Rails
    Rails
    PostgreSQL
    PostgreSQL
    MemCachier
    MemCachier
    #RailsCaching
    #Caching

    We decided to use MemCachier as our Memcached provider because we were seeing some serious PostgreSQL performance issues with query-heavy pages on the site. We use MemCachier for all Rails caching and pretty aggressively too for the logged out experience (fully cached pages for the most part). We really need to move to Amazon ElastiCache as soon as possible so we can stop paying so much. The only reason we're not moving is because there are some restrictions on the network side due to our main app being hosted on Heroku.

    #Caching #RailsCaching

    See more
    Eric Colson
    Eric Colson
    Chief Algorithms Officer at Stitch Fix · | 19 upvotes · 267.4K views
    atStitch FixStitch Fix
    Amazon EC2 Container Service
    Amazon EC2 Container Service
    Docker
    Docker
    PyTorch
    PyTorch
    R
    R
    Python
    Python
    Presto
    Presto
    Apache Spark
    Apache Spark
    Amazon S3
    Amazon S3
    PostgreSQL
    PostgreSQL
    Kafka
    Kafka
    #Data
    #DataStack
    #DataScience
    #ML
    #Etl
    #AWS

    The algorithms and data infrastructure at Stitch Fix is housed in #AWS. Data acquisition is split between events flowing through Kafka, and periodic snapshots of PostgreSQL DBs. We store data in an Amazon S3 based data warehouse. Apache Spark on Yarn is our tool of choice for data movement and #ETL. Because our storage layer (s3) is decoupled from our processing layer, we are able to scale our compute environment very elastically. We have several semi-permanent, autoscaling Yarn clusters running to serve our data processing needs. While the bulk of our compute infrastructure is dedicated to algorithmic processing, we also implemented Presto for adhoc queries and dashboards.

    Beyond data movement and ETL, most #ML centric jobs (e.g. model training and execution) run in a similarly elastic environment as containers running Python and R code on Amazon EC2 Container Service clusters. The execution of batch jobs on top of ECS is managed by Flotilla, a service we built in house and open sourced (see https://github.com/stitchfix/flotilla-os).

    At Stitch Fix, algorithmic integrations are pervasive across the business. We have dozens of data products actively integrated systems. That requires serving layer that is robust, agile, flexible, and allows for self-service. Models produced on Flotilla are packaged for deployment in production using Khan, another framework we've developed internally. Khan provides our data scientists the ability to quickly productionize those models they've developed with open source frameworks in Python 3 (e.g. PyTorch, sklearn), by automatically packaging them as Docker containers and deploying to Amazon ECS. This provides our data scientist a one-click method of getting from their algorithms to production. We then integrate those deployments into a service mesh, which allows us to A/B test various implementations in our product.

    For more info:

    #DataScience #DataStack #Data

    See more
    Interest over time
    Reviews of Memcached and Apache Spark
    No reviews found
    How developers use Memcached and Apache Spark
    Avatar of Reactor Digital
    Reactor Digital uses MemcachedMemcached

    As part of the cacheing system within Drupal.

    Memcached mainly took care of creating and rebuilding the REST API cache once changes had been made within Drupal.

    Avatar of Casey Smith
    Casey Smith uses MemcachedMemcached

    Distributed cache exposed through Google App Engine APIs; use to stage fresh data (incoming and recently processed) for faster access in data processing pipeline.

    Avatar of The Independent
    The Independent uses MemcachedMemcached

    Memcache caches database results and articles, reducing overall DB load and allowing seamless DB maintenance during quiet periods.

    Avatar of Wei Chen
    Wei Chen uses Apache SparkApache Spark

    Spark is good at parallel data processing management. We wrote a neat program to handle the TBs data we get everyday.

    Avatar of eXon Technologies
    eXon Technologies uses MemcachedMemcached

    Used to cache most used files for our clients. Connected with CloudFlare Railgun Optimizer.

    Avatar of ScholaNoctis
    ScholaNoctis uses MemcachedMemcached

    Memcached is used as a simple page cache across the whole application.

    Avatar of Ralic Lo
    Ralic Lo uses Apache SparkApache Spark

    Used Spark Dataframe API on Spark-R for big data analysis.

    Avatar of Kalibrr
    Kalibrr uses Apache SparkApache Spark

    We use Apache Spark in computing our recommendations.

    Avatar of BrainFinance
    BrainFinance uses Apache SparkApache Spark

    As a part of big data machine learning stack (SMACK).

    Avatar of Dotmetrics
    Dotmetrics uses Apache SparkApache Spark

    Big data analytics and nightly transformation jobs.

    How much does Memcached cost?
    How much does Apache Spark cost?
    Pricing unavailable
    Pricing unavailable
    News about Apache Spark
    More news