Apache Spark logo

Apache Spark

Fast and general engine for large-scale data processing
1.3K
1.3K
+ 1
99

What is Apache Spark?

Spark is a fast and general processing engine compatible with Hadoop data. It can run in Hadoop clusters through YARN or Spark's standalone mode, and it can process data in HDFS, HBase, Cassandra, Hive, and any Hadoop InputFormat. It is designed to perform both batch processing (similar to MapReduce) and new workloads like streaming, interactive queries, and machine learning.
Apache Spark is a tool in the Big Data Tools category of a tech stack.
Apache Spark is an open source tool with 25.5K GitHub stars and 21.4K GitHub forks. Here’s a link to Apache Spark's open source repository on GitHub

Who uses Apache Spark?

Companies
381 companies reportedly use Apache Spark in their tech stacks, including Uber, Slack, and Shopify.

Developers
893 developers on StackShare have stated that they use Apache Spark.

Apache Spark Integrations

Jupyter, Azure Cosmos DB, Couchbase, Snowflake, and Apache Hive are some of the popular tools that integrate with Apache Spark. Here's a list of all 26 tools that integrate with Apache Spark.

Why developers like Apache Spark?

Here’s a list of reasons why companies and developers use Apache Spark
Private Decisions at about Apache Spark
Private to your company

Here are some stack decisions, common use cases and reviews by members of with Apache Spark in their tech stack.

Apache Spark
Apache Spark

Data retrieval and analysis of Cassandra. Apache Spark

See more
Apache Spark
Apache Spark

Spark is good at parallel data processing management. We wrote a neat program to handle the TBs data we get everyday. Apache Spark

See more
Apache Spark
Apache Spark

Used Spark Dataframe API on Spark-R for big data analysis. Apache Spark

See more
Apache Spark
Apache Spark

1.6.2 Standalone Apache Spark

See more
Apache Spark
Apache Spark

ŚŅęŚŅęŚŅęŚŅęŚŅę Apache Spark

See more
movilebr
movilebr
Apache Spark
Apache Spark

Introdução sobre como aplicamos Apache Spark em projetos de Machine Learning.

Nesse artigo podemos ter uma vis√£o de alto n√≠vel de como o Spark funciona, quais s√£o as APIs dispon√≠veis, o que cada uma delas se prop√Ķe a fazer e como configur√°-lo e programar usando PySpark no Google Colab.

Sugiro que você dê uma lida na documentação do PySpark.sql e tente criar algumas análises diferentes da base de dados. Tente entender melhor como a API funciona e como o processamento é feito, só não vale usar Pandas, ok? Salvei o notebook com todos os comandos em [8]. Basta fazer o download, subir para o Google Colab e brincar com os comandos.

See more
Public Decisions about Apache Spark

Here are some stack decisions, common use cases and reviews by companies and developers who chose Apache Spark in their tech stack.

Conor Myhrvold
Conor Myhrvold
Tech Brand Mgr, Office of CTO at Uber · | 24 upvotes · 2M views
atUber TechnologiesUber Technologies
Jaeger
Jaeger
Python
Python
Java
Java
Node.js
Node.js
Go
Go
C++
C++
Kubernetes
Kubernetes
JavaScript
JavaScript
Red Hat OpenShift
Red Hat OpenShift
C#
C#
Apache Spark
Apache Spark

How Uber developed the open source, end-to-end distributed tracing Jaeger , now a CNCF project:

Distributed tracing is quickly becoming a must-have component in the tools that organizations use to monitor their complex, microservice-based architectures. At Uber, our open source distributed tracing system Jaeger saw large-scale internal adoption throughout 2016, integrated into hundreds of microservices and now recording thousands of traces every second.

Here is the story of how we got here, from investigating off-the-shelf solutions like Zipkin, to why we switched from pull to push architecture, and how distributed tracing will continue to evolve:

https://eng.uber.com/distributed-tracing/

(GitHub Pages : https://www.jaegertracing.io/, GitHub: https://github.com/jaegertracing/jaeger)

Bindings/Operator: Python Java Node.js Go C++ Kubernetes JavaScript OpenShift C# Apache Spark

See more
Eric Colson
Eric Colson
Chief Algorithms Officer at Stitch Fix · | 19 upvotes · 881.9K views
atStitch FixStitch Fix
Kafka
Kafka
PostgreSQL
PostgreSQL
Amazon S3
Amazon S3
Apache Spark
Apache Spark
Presto
Presto
Python
Python
R Language
R Language
PyTorch
PyTorch
Docker
Docker
Amazon EC2 Container Service
Amazon EC2 Container Service
#AWS
#Etl
#ML
#DataScience
#DataStack
#Data

The algorithms and data infrastructure at Stitch Fix is housed in #AWS. Data acquisition is split between events flowing through Kafka, and periodic snapshots of PostgreSQL DBs. We store data in an Amazon S3 based data warehouse. Apache Spark on Yarn is our tool of choice for data movement and #ETL. Because our storage layer (s3) is decoupled from our processing layer, we are able to scale our compute environment very elastically. We have several semi-permanent, autoscaling Yarn clusters running to serve our data processing needs. While the bulk of our compute infrastructure is dedicated to algorithmic processing, we also implemented Presto for adhoc queries and dashboards.

Beyond data movement and ETL, most #ML centric jobs (e.g. model training and execution) run in a similarly elastic environment as containers running Python and R code on Amazon EC2 Container Service clusters. The execution of batch jobs on top of ECS is managed by Flotilla, a service we built in house and open sourced (see https://github.com/stitchfix/flotilla-os).

At Stitch Fix, algorithmic integrations are pervasive across the business. We have dozens of data products actively integrated systems. That requires serving layer that is robust, agile, flexible, and allows for self-service. Models produced on Flotilla are packaged for deployment in production using Khan, another framework we've developed internally. Khan provides our data scientists the ability to quickly productionize those models they've developed with open source frameworks in Python 3 (e.g. PyTorch, sklearn), by automatically packaging them as Docker containers and deploying to Amazon ECS. This provides our data scientist a one-click method of getting from their algorithms to production. We then integrate those deployments into a service mesh, which allows us to A/B test various implementations in our product.

For more info:

#DataScience #DataStack #Data

See more
Patrick Sun
Patrick Sun
Software Engineer at Stitch Fix · | 10 upvotes · 36.8K views
atStitch FixStitch Fix
Victory
Victory
Apache Spark
Apache Spark
React
React
Redux
Redux
Elasticsearch
Elasticsearch
Amazon S3
Amazon S3

As a frontend engineer on the Algorithms & Analytics team at Stitch Fix, I work with data scientists to develop applications and visualizations to help our internal business partners make data-driven decisions. I envisioned a platform that would assist data scientists in the data exploration process, allowing them to visually explore and rapidly iterate through their assumptions, then share their insights with others. This would align with our team's philosophy of having engineers "deploy platforms, services, abstractions, and frameworks that allow the data scientists to conceive of, develop, and deploy their ideas with autonomy", and solve the pain of data exploration.

The final product, code-named Dora, is built with React, Redux.js and Victory, backed by Elasticsearch to enable fast and iterative data exploration, and uses Apache Spark to move data from our Amazon S3 data warehouse into the Elasticsearch cluster.

See more
Conor Myhrvold
Conor Myhrvold
Tech Brand Mgr, Office of CTO at Uber · | 7 upvotes · 450.2K views
atUber TechnologiesUber Technologies
Kafka
Kafka
Kafka Manager
Kafka Manager
Hadoop
Hadoop
Apache Spark
Apache Spark
GitHub
GitHub

Why we built Marmaray, an open source generic data ingestion and dispersal framework and library for Apache Hadoop :

Built and designed by our Hadoop Platform team, Marmaray is a plug-in-based framework built on top of the Hadoop ecosystem. Users can add support to ingest data from any source and disperse to any sink leveraging the use of Apache Spark . The name, Marmaray, comes from a tunnel in Turkey connecting Europe and Asia. Similarly, we envisioned Marmaray within Uber as a pipeline connecting data from any source to any sink depending on customer preference:

https://eng.uber.com/marmaray-hadoop-ingestion-open-source/

(Direct GitHub repo: https://github.com/uber/marmaray Kafka Kafka Manager )

See more
Apache Spark
Apache Spark

I use Apache Spark because it is THE framework for big data processing from big tech to startup. It can be run on pretty much any platform. It's open source, and lots of community support and code samples to draw from.

The Python API is good for low-med level transformations, but most recommend starting with Scala/Java to use full spark capabilities.

It comes with quite learning curve to make sense of how data is shuffling through different nodes, but it's worth it for running large-scale ETL.

Also, keep in mind the streaming and batch frameworks are not unified, so you'll have learn them both separately.

See more
movilebr
movilebr
at Movile · | 6 upvotes · 19.1K views
atGrupo MovileGrupo Movile
Apache Spark
Apache Spark

Artigo que introduz como estamos aplicando Apache Spark em projetos de Machine Learning.

Nesse artigo podemos ter uma vis√£o de alto n√≠vel de como o Spark funciona, quais s√£o as APIs dispon√≠veis, o que cada uma delas se prop√Ķe a fazer e como configur√°-lo e programar usando PySpark no Google Colab.

Sugiro que você dê uma lida na documentação do PySpark.sql e tente criar algumas análises diferentes da base de dados. Tente entender melhor como a API funciona e como o processamento é feito, só não vale usar Pandas, ok? Salvei o notebook com todos os comandos em [8]. Basta fazer o download, subir para o Google Colab e brincar com os comandos.

See more

Apache Spark's Features

  • Run programs up to 100x faster than Hadoop MapReduce in memory, or 10x faster on disk
  • Write applications quickly in Java, Scala or Python
  • Combine SQL, streaming, and complex analytics
  • Spark runs on Hadoop, Mesos, standalone, or in the cloud. It can access diverse data sources including HDFS, Cassandra, HBase, S3

Apache Spark Alternatives & Comparisons

What are some alternatives to Apache Spark?
Hadoop
The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage.
Splunk
Splunk Inc. provides the leading platform for Operational Intelligence. Customers use Splunk to search, monitor, analyze and visualize machine data.
Cassandra
Partitioning means that Cassandra can distribute your data across multiple machines in an application-transparent matter. Cassandra will automatically repartition as machines are added and removed from the cluster. Row store means that like relational databases, Cassandra organizes data by rows and columns. The Cassandra Query Language (CQL) is a close relative of SQL.
Apache Beam
It implements batch and streaming data processing jobs that run on any execution engine. It executes pipelines on multiple execution environments.
Apache Flume
It is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data. It has a simple and flexible architecture based on streaming data flows. It is robust and fault tolerant with tunable reliability mechanisms and many failover and recovery mechanisms. It uses a simple extensible data model that allows for online analytic application.
See all alternatives

Apache Spark's Followers
1292 developers follow Apache Spark to keep up with related blogs and decisions.
doukan yńĪlmaz
Venkata Dindi
Miguel Alves
femtobytes
Moggz1
Shahar Barel
Ahsan Ullah Niazi
Rafael Alencar
javierabosch
Oscar Espitia