Get Advice Icon

Need advice about which tool to choose?Ask the StackShare community!

Amazon Machine Learning
Amazon Machine Learning

63
83
+ 1
0
Apache Spark
Apache Spark

1K
803
+ 1
98
Add tool

Amazon Machine Learning vs Apache Spark: What are the differences?

What is Amazon Machine Learning? Visualization tools and wizards that guide you through the process of creating ML models w/o having to learn complex ML algorithms & technology. This new AWS service helps you to use all of that data you’ve been collecting to improve the quality of your decisions. You can build and fine-tune predictive models using large amounts of data, and then use Amazon Machine Learning to make predictions (in batch mode or in real-time) at scale. You can benefit from machine learning even if you don’t have an advanced degree in statistics or the desire to setup, run, and maintain your own processing and storage infrastructure.

What is Apache Spark? Fast and general engine for large-scale data processing. Spark is a fast and general processing engine compatible with Hadoop data. It can run in Hadoop clusters through YARN or Spark's standalone mode, and it can process data in HDFS, HBase, Cassandra, Hive, and any Hadoop InputFormat. It is designed to perform both batch processing (similar to MapReduce) and new workloads like streaming, interactive queries, and machine learning.

Amazon Machine Learning belongs to "Machine Learning as a Service" category of the tech stack, while Apache Spark can be primarily classified under "Big Data Tools".

Some of the features offered by Amazon Machine Learning are:

  • Easily Create Machine Learning Models
  • From Models to Predictions in Seconds
  • Scalable, High Performance Prediction Generation Service

On the other hand, Apache Spark provides the following key features:

  • Run programs up to 100x faster than Hadoop MapReduce in memory, or 10x faster on disk
  • Write applications quickly in Java, Scala or Python
  • Combine SQL, streaming, and complex analytics

Apache Spark is an open source tool with 22.5K GitHub stars and 19.4K GitHub forks. Here's a link to Apache Spark's open source repository on GitHub.

Uber Technologies, Slack, and Shopify are some of the popular companies that use Apache Spark, whereas Amazon Machine Learning is used by Apli, Cymatic Security, and FetchyFox. Apache Spark has a broader approval, being mentioned in 266 company stacks & 112 developers stacks; compared to Amazon Machine Learning, which is listed in 9 company stacks and 10 developer stacks.

- No public GitHub repository available -

What is Amazon Machine Learning?

This new AWS service helps you to use all of that data you’ve been collecting to improve the quality of your decisions. You can build and fine-tune predictive models using large amounts of data, and then use Amazon Machine Learning to make predictions (in batch mode or in real-time) at scale. You can benefit from machine learning even if you don’t have an advanced degree in statistics or the desire to setup, run, and maintain your own processing and storage infrastructure.

What is Apache Spark?

Spark is a fast and general processing engine compatible with Hadoop data. It can run in Hadoop clusters through YARN or Spark's standalone mode, and it can process data in HDFS, HBase, Cassandra, Hive, and any Hadoop InputFormat. It is designed to perform both batch processing (similar to MapReduce) and new workloads like streaming, interactive queries, and machine learning.
Get Advice Icon

Need advice about which tool to choose?Ask the StackShare community!

Why do developers choose Amazon Machine Learning?
Why do developers choose Apache Spark?
    Be the first to leave a pro

    Sign up to add, upvote and see more prosMake informed product decisions

      Be the first to leave a con
      What companies use Amazon Machine Learning?
      What companies use Apache Spark?

      Sign up to get full access to all the companiesMake informed product decisions

      What tools integrate with Amazon Machine Learning?
      What tools integrate with Apache Spark?
        No integrations found

        Sign up to get full access to all the tool integrationsMake informed product decisions

        What are some alternatives to Amazon Machine Learning and Apache Spark?
        TensorFlow
        TensorFlow is an open source software library for numerical computation using data flow graphs. Nodes in the graph represent mathematical operations, while the graph edges represent the multidimensional data arrays (tensors) communicated between them. The flexible architecture allows you to deploy computation to one or more CPUs or GPUs in a desktop, server, or mobile device with a single API.
        Amazon SageMaker
        A fully-managed service that enables developers and data scientists to quickly and easily build, train, and deploy machine learning models at any scale.
        RapidMiner
        It is a software platform for data science teams that unites data prep, machine learning, and predictive model deployment.
        Azure Machine Learning
        Azure Machine Learning is a fully-managed cloud service that enables data scientists and developers to efficiently embed predictive analytics into their applications, helping organizations use massive data sets and bring all the benefits of the cloud to machine learning.
        NanoNets
        Build a custom machine learning model without expertise or large amount of data. Just go to nanonets, upload images, wait for few minutes and integrate nanonets API to your application.
        See all alternatives
        Decisions about Amazon Machine Learning and Apache Spark
        StackShare Editors
        StackShare Editors
        Presto
        Presto
        Apache Spark
        Apache Spark
        Hadoop
        Hadoop

        Around 2015, the growing use of Uber’s data exposed limitations in the ETL and Vertica-centric setup, not to mention the increasing costs. “As our company grew, scaling our data warehouse became increasingly expensive. To cut down on costs, we started deleting older, obsolete data to free up space for new data.”

        To overcome these challenges, Uber rebuilt their big data platform around Hadoop. “More specifically, we introduced a Hadoop data lake where all raw data was ingested from different online data stores only once and with no transformation during ingestion.”

        “In order for users to access data in Hadoop, we introduced Presto to enable interactive ad hoc user queries, Apache Spark to facilitate programmatic access to raw data (in both SQL and non-SQL formats), and Apache Hive to serve as the workhorse for extremely large queries.

        See more
        StackShare Editors
        StackShare Editors
        Presto
        Presto
        Apache Spark
        Apache Spark
        Hadoop
        Hadoop

        To improve platform scalability and efficiency, Uber transitioned from JSON to Parquet, and built a central schema service to manage schemas and integrate different client libraries.

        While the first generation big data platform was vulnerable to upstream data format changes, “ad hoc data ingestions jobs were replaced with a standard platform to transfer all source data in its original, nested format into the Hadoop data lake.”

        These platform changes enabled the scaling challenges Uber was facing around that time: “On a daily basis, there were tens of terabytes of new data added to our data lake, and our Big Data platform grew to over 10,000 vcores with over 100,000 running batch jobs on any given day.”

        See more
        StackShare Editors
        StackShare Editors
        Presto
        Presto
        Apache Spark
        Apache Spark
        Scala
        Scala
        MySQL
        MySQL
        Kafka
        Kafka

        Slack’s data team works to “provide an ecosystem to help people in the company quickly and easily answer questions about usage, so they can make better and data informed decisions.” To achieve that goal, that rely on a complex data pipeline.

        An in-house tool call Sqooper scrapes MySQL backups and pipe them to S3. Job queue and log data is sent to Kafka then persisted to S3 using an open source tool called Secor, which was created by Pinterest.

        For compute, Amazon’s Elastic MapReduce (EMR) creates clusters preconfigured for Presto, Hive, and Spark.

        Presto is then used for ad-hoc questions, validating data assumptions, exploring smaller datasets, and creating visualizations for some internal tools. Hive is used for larger data sets or longer time series data, and Spark allows teams to write efficient and robust batch and aggregation jobs. Most of the Spark pipeline is written in Scala.

        Thrift binds all of these engines together with a typed schema and structured data.

        Finally, the Hive Metastore serves as the ground truth for all data and its schema.

        See more
        StackShare Editors
        StackShare Editors
        Apache Thrift
        Apache Thrift
        Kotlin
        Kotlin
        Presto
        Presto
        HHVM (HipHop Virtual Machine)
        HHVM (HipHop Virtual Machine)
        gRPC
        gRPC
        Kubernetes
        Kubernetes
        Apache Spark
        Apache Spark
        Airflow
        Airflow
        Terraform
        Terraform
        Hadoop
        Hadoop
        Swift
        Swift
        Hack
        Hack
        Memcached
        Memcached
        Consul
        Consul
        Chef
        Chef
        Prometheus
        Prometheus

        Since the beginning, Cal Henderson has been the CTO of Slack. Earlier this year, he commented on a Quora question summarizing their current stack.

        Apps
        • Web: a mix of JavaScript/ES6 and React.
        • Desktop: And Electron to ship it as a desktop application.
        • Android: a mix of Java and Kotlin.
        • iOS: written in a mix of Objective C and Swift.
        Backend
        • The core application and the API written in PHP/Hack that runs on HHVM.
        • The data is stored in MySQL using Vitess.
        • Caching is done using Memcached and MCRouter.
        • The search service takes help from SolrCloud, with various Java services.
        • The messaging system uses WebSockets with many services in Java and Go.
        • Load balancing is done using HAproxy with Consul for configuration.
        • Most services talk to each other over gRPC,
        • Some Thrift and JSON-over-HTTP
        • Voice and video calling service was built in Elixir.
        Data warehouse
        • Built using open source tools including Presto, Spark, Airflow, Hadoop and Kafka.
        Etc
        See more
        Julien DeFrance
        Julien DeFrance
        Full Stack Engineering Manager at ValiMail · | 2 upvotes · 11.8K views
        atSmartZipSmartZip
        Amazon SageMaker
        Amazon SageMaker
        Amazon Machine Learning
        Amazon Machine Learning
        AWS Lambda
        AWS Lambda
        Serverless
        Serverless
        #FaaS
        #GCP
        #PaaS

        Which #IaaS / #PaaS to chose? Not all #Cloud providers are created equal. As you start to use one or the other, you'll build around very specific services that don't have their equivalent elsewhere.

        Back in 2014/2015, this decision I made for SmartZip was a no-brainer and #AWS won. AWS has been a leader, and over the years demonstrated their capacity to innovate, and reducing toil. Like no other.

        Year after year, this kept on being confirmed, as they rolled out new (managed) services, got into Serverless with AWS Lambda / FaaS And allowed domains such as #AI / #MachineLearning to be put into the hands of every developers thanks to Amazon Machine Learning or Amazon SageMaker for instance.

        Should you compare with #GCP for instance, it's not quite there yet. Building around these managed services, #AWS allowed me to get my developers on a whole new level. Where they know what's under the hood. Where they know they have these services available and can build around them. Where they care and are responsible for operations and security and deployment of what they've worked on.

        See more
        Eric Colson
        Eric Colson
        Chief Algorithms Officer at Stitch Fix · | 19 upvotes · 204.6K views
        atStitch FixStitch Fix
        Amazon EC2 Container Service
        Amazon EC2 Container Service
        Docker
        Docker
        PyTorch
        PyTorch
        R
        R
        Python
        Python
        Presto
        Presto
        Apache Spark
        Apache Spark
        Amazon S3
        Amazon S3
        PostgreSQL
        PostgreSQL
        Kafka
        Kafka
        #Data
        #DataStack
        #DataScience
        #ML
        #Etl
        #AWS

        The algorithms and data infrastructure at Stitch Fix is housed in #AWS. Data acquisition is split between events flowing through Kafka, and periodic snapshots of PostgreSQL DBs. We store data in an Amazon S3 based data warehouse. Apache Spark on Yarn is our tool of choice for data movement and #ETL. Because our storage layer (s3) is decoupled from our processing layer, we are able to scale our compute environment very elastically. We have several semi-permanent, autoscaling Yarn clusters running to serve our data processing needs. While the bulk of our compute infrastructure is dedicated to algorithmic processing, we also implemented Presto for adhoc queries and dashboards.

        Beyond data movement and ETL, most #ML centric jobs (e.g. model training and execution) run in a similarly elastic environment as containers running Python and R code on Amazon EC2 Container Service clusters. The execution of batch jobs on top of ECS is managed by Flotilla, a service we built in house and open sourced (see https://github.com/stitchfix/flotilla-os).

        At Stitch Fix, algorithmic integrations are pervasive across the business. We have dozens of data products actively integrated systems. That requires serving layer that is robust, agile, flexible, and allows for self-service. Models produced on Flotilla are packaged for deployment in production using Khan, another framework we've developed internally. Khan provides our data scientists the ability to quickly productionize those models they've developed with open source frameworks in Python 3 (e.g. PyTorch, sklearn), by automatically packaging them as Docker containers and deploying to Amazon ECS. This provides our data scientist a one-click method of getting from their algorithms to production. We then integrate those deployments into a service mesh, which allows us to A/B test various implementations in our product.

        For more info:

        #DataScience #DataStack #Data

        See more
        Interest over time
        Reviews of Amazon Machine Learning and Apache Spark
        No reviews found
        How developers use Amazon Machine Learning and Apache Spark
        Avatar of Wei Chen
        Wei Chen uses Apache SparkApache Spark

        Spark is good at parallel data processing management. We wrote a neat program to handle the TBs data we get everyday.

        Avatar of Ralic Lo
        Ralic Lo uses Apache SparkApache Spark

        Used Spark Dataframe API on Spark-R for big data analysis.

        Avatar of Kalibrr
        Kalibrr uses Apache SparkApache Spark

        We use Apache Spark in computing our recommendations.

        Avatar of BrainFinance
        BrainFinance uses Apache SparkApache Spark

        As a part of big data machine learning stack (SMACK).

        Avatar of Dotmetrics
        Dotmetrics uses Apache SparkApache Spark

        Big data analytics and nightly transformation jobs.

        Avatar of Taylor Host
        Taylor Host uses Amazon Machine LearningAmazon Machine Learning

        Mild re-training data usage.

        How much does Amazon Machine Learning cost?
        How much does Apache Spark cost?
        Pricing unavailable
        Pricing unavailable
        News about Amazon Machine Learning
        More news
        News about Apache Spark
        More news