StackShareStackShare
Follow on
StackShare

Discover and share technology stacks from companies around the world.

Product

  • Stacks
  • Tools
  • Companies
  • Feed

Company

  • About
  • Blog
  • Contact

Legal

  • Privacy Policy
  • Terms of Service

© 2025 StackShare. All rights reserved.

API StatusChangelog
Apache Spark
ByApache SparkApache Spark

Apache Spark

#10in Databases
Discussions46
Followers3.53k
OverviewDiscussions46

What is Apache Spark?

Spark is a fast and general processing engine compatible with Hadoop data. It can run in Hadoop clusters through YARN or Spark's standalone mode, and it can process data in HDFS, HBase, Cassandra, Hive, and any Hadoop InputFormat. It is designed to perform both batch processing (similar to MapReduce) and new workloads like streaming, interactive queries, and machine learning.

Apache Spark is a tool in the Databases category of a tech stack.

Key Features

Run programs up to 100x faster than Hadoop MapReduce in memory, or 10x faster on diskWrite applications quickly in Java, Scala or PythonCombine SQL, streaming, and complex analyticsSpark runs on Hadoop, Mesos, standalone, or in the cloud. It can access diverse data sources including HDFS, Cassandra, HBase, S3

Apache Spark Pros & Cons

Pros of Apache Spark

  • ✓Open-source
  • ✓Fast and Flexible
  • ✓Great for distributed SQL like applications
  • ✓One platform for every big data problem
  • ✓Easy to install and to use
  • ✓Works well for most Datascience usecases
  • ✓In memory Computation
  • ✓Interactive Query
  • ✓Machine learning libratimery, Streaming in real

Cons of Apache Spark

  • ✗Speed

Apache Spark Alternatives & Comparisons

What are some alternatives to Apache Spark?

Splunk

Splunk

It provides the leading platform for Operational Intelligence. Customers use it to search, monitor, analyze and visualize machine data.

Apache Flink

Apache Flink

Apache Flink is an open source system for fast and versatile data analytics in clusters. Flink supports batch and streaming analytics, in one system. Analytical programs can be written in concise and elegant APIs in Java and Scala.

Amazon Athena

Amazon Athena

Amazon Athena is an interactive query service that makes it easy to analyze data in Amazon S3 using standard SQL. Athena is serverless, so there is no infrastructure to manage, and you pay only for the queries that you run.

Apache Hive

Apache Hive

Hive facilitates reading, writing, and managing large datasets residing in distributed storage using SQL. Structure can be projected onto data already in storage.

AWS Glue

AWS Glue

A fully managed extract, transform, and load (ETL) service that makes it easy for customers to prepare and load their data for analytics.

Presto

Presto

Distributed SQL Query Engine for Big Data

Apache Spark Integrations

Google Cloud Bigtable, Azure Cosmos DB, MapD, Apache Zeppelin, Apache Kylin and 7 more are some of the popular tools that integrate with Apache Spark. Here's a list of all 12 tools that integrate with Apache Spark.

Google Cloud Bigtable
Google Cloud Bigtable
Azure Cosmos DB
Azure Cosmos DB
MapD
MapD
Apache Zeppelin
Apache Zeppelin
Apache Kylin
Apache Kylin
TransmogrifAI
TransmogrifAI
Couchbase
Couchbase
SQLdep
SQLdep
Delta Lake
Delta Lake
.NET for Apache Spark
.NET for Apache Spark
Apache Hive
Apache Hive
Azure Databricks
Azure Databricks

Apache Spark Discussions

Discover why developers choose Apache Spark. Read real-world technical decisions and stack choices from the StackShare community.Showing 4 of 5 discussions.

Conor Myhrvold
Conor Myhrvold

Tech Brand Mgr, Office of CTO at Uber Technologies

Dec 4, 2018

Needs adviceonKafkaKafkaKafka ManagerKafka ManagerHadoopHadoop

Why we built Marmaray, an open source generic data ingestion and dispersal framework and library for Apache Hadoop :

Built and designed by our Hadoop Platform team, Marmaray is a plug-in-based framework built on top of the Hadoop ecosystem. Users can add support to ingest data from any source and disperse to any sink leveraging the use of Apache Spark . The name, Marmaray, comes from a tunnel in Turkey connecting Europe and Asia. Similarly, we envisioned Marmaray within Uber as a pipeline connecting data from any source to any sink depending on customer preference:

https://eng.uber.com/marmaray-hadoop-ingestion-open-source/

(Direct GitHub repo: https://github.com/uber/marmaray Kafka Kafka Manager )

0 views0
Comments
Conor Myhrvold
Conor Myhrvold

Tech Brand Mgr, Office of CTO at Uber Technologies

Dec 4, 2018

Needs adviceonGitHubGitHubGitHub PagesGitHub PagesJaegerJaeger

How Uber developed the open source, end-to-end distributed tracing Jaeger , now a CNCF project:

Distributed tracing is quickly becoming a must-have component in the tools that organizations use to monitor their complex, microservice-based architectures. At Uber, our open source distributed tracing system Jaeger saw large-scale internal adoption throughout 2016, integrated into hundreds of microservices and now recording thousands of traces every second.

Here is the story of how we got here, from investigating off-the-shelf solutions like Zipkin, to why we switched from pull to push architecture, and how distributed tracing will continue to evolve:

https://eng.uber.com/distributed-tracing/

(GitHub Pages : https://www.jaegertracing.io/, GitHub: https://github.com/jaegertracing/jaeger)

Bindings/Operator: Python Java Node.js Golang C++ Kubernetes JavaScript Red Hat OpenShift C# Apache Spark

0 views0
Comments
Tobias Widmer
Tobias Widmer

CTO at Onedot

Dec 3, 2018

Needs adviceonReactReactReduxReduxScalaScala

Onedot is building an automated data preparation service using probabilistic and statistical methods including artificial intelligence (AI). From the beginning, having a stable foundation while at the same time being able to iterate quickly was very important to us. Due to the nature of compute workloads we face, the decision for a functional programming paradigm and a scalable cluster model was a no-brainer. We started playing with Apache Spark very early on, when the platform was still in its infancy. As a storage backend, we first used Cassandra, but found out that it was not the optimal choice for our workloads (lots of rather smallish datasets, data pipelines with considerable complexity, etc.). In the end, we migrated dataset storage to Amazon S3 which proved to be much more adequate to our case. In the frontend, we bet on more traditional frameworks like React/Redux, Blueprint and a number of common npm packages of our universe. Because of the very positive experience with Scala (in particular the ability to write things very expressively, use immutability across the board, etc.) we settled with TypeScript in the frontend. In our opinion, a very good decision. Nowadays, transpiling is a common thing, so we thought why not introduce the same type-safety and mathematical rigour to the user interface?

0 views0
Comments
Patrick Sun
Patrick Sun

Software Engineer at Stitch Fix

Sep 13, 2018

Needs adviceonVictoryVictoryApache SparkApache SparkReactReact

As a frontend engineer on the Algorithms & Analytics team at Stitch Fix, I work with data scientists to develop applications and visualizations to help our internal business partners make data-driven decisions. I envisioned a platform that would assist data scientists in the data exploration process, allowing them to visually explore and rapidly iterate through their assumptions, then share their insights with others. This would align with our team's philosophy of having engineers "deploy platforms, services, abstractions, and frameworks that allow the data scientists to conceive of, develop, and deploy their ideas with autonomy", and solve the pain of data exploration.

The final product, code-named Dora, is built with React, Redux and Victory, backed by Elasticsearch to enable fast and iterative data exploration, and uses Apache Spark to move data from our Amazon S3 data warehouse into the Elasticsearch cluster.

0 views0
Comments
View all 5 discussions

Try It

Visit Website

Adoption

On StackShare

Companies
566
9HMPST+560
Developers
2.37k
BAJLTA+2363