1K
774
+ 1
98

What is Apache Spark?

Spark is a fast and general processing engine compatible with Hadoop data. It can run in Hadoop clusters through YARN or Spark's standalone mode, and it can process data in HDFS, HBase, Cassandra, Hive, and any Hadoop InputFormat. It is designed to perform both batch processing (similar to MapReduce) and new workloads like streaming, interactive queries, and machine learning.
Apache Spark is a tool in the Big Data Tools category of a tech stack.
Apache Spark is an open source tool with 23.4K GitHub stars and 20.1K GitHub forks. Here’s a link to Apache Spark's open source repository on GitHub

Who uses Apache Spark?

Companies
355 companies reportedly use Apache Spark in their tech stacks, including Uber, Slack, and Shopify.

Developers
591 developers on StackShare have stated that they use Apache Spark.

Apache Spark Integrations

Couchbase, Azure Cosmos DB, Snowflake, Apache Zeppelin, and Apache Hive are some of the popular tools that integrate with Apache Spark. Here's a list of all 21 tools that integrate with Apache Spark.

Why developers like Apache Spark?

Here’s a list of reasons why companies and developers use Apache Spark
Apache Spark Reviews

Here are some stack decisions, common use cases and reviews by companies and developers who chose Apache Spark in their tech stack.

Eric Colson
Eric Colson
Chief Algorithms Officer at Stitch Fix · | 19 upvotes · 152.4K views
atStitch FixStitch Fix
Amazon EC2 Container Service
Amazon EC2 Container Service
Docker
Docker
PyTorch
PyTorch
R
R
Python
Python
Presto
Presto
Apache Spark
Apache Spark
Amazon S3
Amazon S3
PostgreSQL
PostgreSQL
Kafka
Kafka
#Data
#DataStack
#DataScience
#ML
#Etl
#AWS

The algorithms and data infrastructure at Stitch Fix is housed in #AWS. Data acquisition is split between events flowing through Kafka, and periodic snapshots of PostgreSQL DBs. We store data in an Amazon S3 based data warehouse. Apache Spark on Yarn is our tool of choice for data movement and #ETL. Because our storage layer (s3) is decoupled from our processing layer, we are able to scale our compute environment very elastically. We have several semi-permanent, autoscaling Yarn clusters running to serve our data processing needs. While the bulk of our compute infrastructure is dedicated to algorithmic processing, we also implemented Presto for adhoc queries and dashboards.

Beyond data movement and ETL, most #ML centric jobs (e.g. model training and execution) run in a similarly elastic environment as containers running Python and R code on Amazon EC2 Container Service clusters. The execution of batch jobs on top of ECS is managed by Flotilla, a service we built in house and open sourced (see https://github.com/stitchfix/flotilla-os).

At Stitch Fix, algorithmic integrations are pervasive across the business. We have dozens of data products actively integrated systems. That requires serving layer that is robust, agile, flexible, and allows for self-service. Models produced on Flotilla are packaged for deployment in production using Khan, another framework we've developed internally. Khan provides our data scientists the ability to quickly productionize those models they've developed with open source frameworks in Python 3 (e.g. PyTorch, sklearn), by automatically packaging them as Docker containers and deploying to Amazon ECS. This provides our data scientist a one-click method of getting from their algorithms to production. We then integrate those deployments into a service mesh, which allows us to A/B test various implementations in our product.

For more info:

#DataScience #DataStack #Data

See more
Conor Myhrvold
Conor Myhrvold
Tech Brand Mgr, Office of CTO at Uber · | 11 upvotes · 505.2K views
atUber TechnologiesUber Technologies
Apache Spark
Apache Spark
C#
C#
OpenShift
OpenShift
JavaScript
JavaScript
Kubernetes
Kubernetes
C++
C++
Go
Go
Node.js
Node.js
Java
Java
Python
Python
Jaeger
Jaeger

How Uber developed the open source, end-to-end distributed tracing Jaeger , now a CNCF project:

Distributed tracing is quickly becoming a must-have component in the tools that organizations use to monitor their complex, microservice-based architectures. At Uber, our open source distributed tracing system Jaeger saw large-scale internal adoption throughout 2016, integrated into hundreds of microservices and now recording thousands of traces every second.

Here is the story of how we got here, from investigating off-the-shelf solutions like Zipkin, to why we switched from pull to push architecture, and how distributed tracing will continue to evolve:

https://eng.uber.com/distributed-tracing/

(GitHub Pages : https://www.jaegertracing.io/, GitHub: https://github.com/jaegertracing/jaeger)

Bindings/Operator: Python Java Node.js Go C++ Kubernetes JavaScript OpenShift C# Apache Spark

See more
Patrick Sun
Patrick Sun
Software Engineer at Stitch Fix · | 10 upvotes · 22.1K views
atStitch FixStitch Fix
Apache Spark
Apache Spark
Victory
Victory
Amazon S3
Amazon S3
Elasticsearch
Elasticsearch
Redux
Redux
React
React

As a frontend engineer on the Algorithms & Analytics team at Stitch Fix, I work with data scientists to develop applications and visualizations to help our internal business partners make data-driven decisions. I envisioned a platform that would assist data scientists in the data exploration process, allowing them to visually explore and rapidly iterate through their assumptions, then share their insights with others. This would align with our team's philosophy of having engineers "deploy platforms, services, abstractions, and frameworks that allow the data scientists to conceive of, develop, and deploy their ideas with autonomy", and solve the pain of data exploration.

The final product, code-named Dora, is built with React, Redux.js and Victory, backed by Elasticsearch to enable fast and iterative data exploration, and uses Apache Spark to move data from our Amazon S3 data warehouse into the Elasticsearch cluster.

See more
movilebr
movilebr
at Movile · | 6 upvotes · 7.1K views
atGrupo MovileGrupo Movile
Apache Spark
Apache Spark

Artigo que introduz como estamos aplicando Apache Spark em projetos de Machine Learning.

Nesse artigo podemos ter uma visão de alto nível de como o Spark funciona, quais são as APIs disponíveis, o que cada uma delas se propõe a fazer e como configurá-lo e programar usando PySpark no Google Colab.

Sugiro que você dê uma lida na documentação do PySpark.sql e tente criar algumas análises diferentes da base de dados. Tente entender melhor como a API funciona e como o processamento é feito, só não vale usar Pandas, ok? Salvei o notebook com todos os comandos em [8]. Basta fazer o download, subir para o Google Colab e brincar com os comandos.

See more
Apache Spark
Apache Spark

I use Apache Spark because it is THE framework for big data processing from big tech to startup. It can be run on pretty much any platform. It's open source, and lots of community support and code samples to draw from.

The Python API is good for low-med level transformations, but most recommend starting with Scala/Java to use full spark capabilities.

It comes with quite learning curve to make sense of how data is shuffling through different nodes, but it's worth it for running large-scale ETL.

Also, keep in mind the streaming and batch frameworks are not unified, so you'll have learn them both separately.

See more
StackShare Editors
StackShare Editors
| 4 upvotes · 21.9K views
atUber TechnologiesUber Technologies
TensorFlow
TensorFlow
Apache Spark
Apache Spark
Cassandra
Cassandra

In mid-2015, Uber began exploring ways to scale ML across the organization, avoiding ML anti-patterns while standardizing workflows and tools. This effort led to Michelangelo.

Michelangelo consists of a mix of open source systems and components built in-house. The primary open sourced components used are HDFS, Spark, Samza, Cassandra, MLLib, XGBoost, and TensorFlow.

!

See more

Apache Spark's Features

  • Run programs up to 100x faster than Hadoop MapReduce in memory, or 10x faster on disk
  • Write applications quickly in Java, Scala or Python
  • Combine SQL, streaming, and complex analytics
  • Spark runs on Hadoop, Mesos, standalone, or in the cloud. It can access diverse data sources including HDFS, Cassandra, HBase, S3

Apache Spark Alternatives & Comparisons

What are some alternatives to Apache Spark?
Hadoop
The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage.
Splunk
Splunk Inc. provides the leading platform for Operational Intelligence. Customers use Splunk to search, monitor, analyze and visualize machine data.
Cassandra
Partitioning means that Cassandra can distribute your data across multiple machines in an application-transparent matter. Cassandra will automatically repartition as machines are added and removed from the cluster. Row store means that like relational databases, Cassandra organizes data by rows and columns. The Cassandra Query Language (CQL) is a close relative of SQL.
Apache Beam
It implements batch and streaming data processing jobs that run on any execution engine. It executes pipelines on multiple execution environments.
Apache Flume
It is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data. It has a simple and flexible architecture based on streaming data flows. It is robust and fault tolerant with tunable reliability mechanisms and many failover and recovery mechanisms. It uses a simple extensible data model that allows for online analytic application.
See all alternatives

Apache Spark's Stats

Apache Spark's Followers
774 developers follow Apache Spark to keep up with related blogs and decisions.
Waris Radji
Pradeep Gupta
Bharat Gupta
ran_ved
zhenxu66
Clifton  Jadoo
Clyde Eccleston-Barrow
Benjamin Mpoyi
Gopi K
Suresh Grandhisiri