Apache Spark vs Serverless

Overview

Apache Spark

Stacks3.1K

Followers3.5K

Votes140

GitHub Stars42.2K

Forks28.9K

Serverless

Stacks2.2K

Followers1.2K

Votes28

GitHub Stars46.9K

Forks5.7K

Apache Spark vs Serverless: What are the differences?

Introduction:

Apache Spark and Serverless are two popular technologies used for big data processing and analytics. Although they both provide solutions for handling large datasets, there are key differences between them. In this article, we will discuss the six main differences between Apache Spark and Serverless.

Deployment Model: Apache Spark is typically deployed on a cluster of machines, where data is distributed and processing is done in parallel. On the other hand, Serverless technologies like AWS Lambda or Azure Functions are event-driven and allow developers to run code on demand without having to manage the underlying infrastructure.
Resource Allocation: In Apache Spark, resources need to be pre-allocated and managed manually, specifying how much memory or cores should be allocated for each job. Serverless platforms, on the other hand, automatically allocate resources based on the demand, scaling up or down as needed. This allows for better resource utilization and cost optimization.
Scalability: Apache Spark provides horizontal scalability, meaning it can scale by adding more machines to the cluster. Serverless platforms also provide scalability, but at the function level. Each function can scale independently based on the incoming workload, without affecting other functions.
State Management: Apache Spark provides an in-memory computing model, allowing users to persist data in memory for faster processing. Serverless platforms, on the other hand, are stateless by design. They are designed to handle short-lived functions that process small units of data and do not provide built-in support for persistent state.
Cost Model: Apache Spark requires the setup and management of a dedicated cluster, which may require upfront costs for hardware and infrastructure. Serverless platforms follow a pay-as-you-go pricing model, where users only pay for the actual execution time and resources used by their functions, leading to potential cost savings, especially for sporadic workloads.
Flexibility: Apache Spark provides a wide range of data processing and analysis capabilities through its extensive library ecosystem. It supports batch processing, interactive queries, machine learning, and graph processing. Serverless platforms, on the other hand, are more focused on event-driven functions and are optimized for short-lived, stateless operations.

In Summary, Apache Spark and Serverless differ in their deployment model, resource allocation, scalability, state management, cost model, and flexibility. Apache Spark requires a dedicated cluster, manual resource allocation, and is capable of handling large and complex workloads. Serverless platforms are event-driven, automatically allocate resources, and are optimized for short-lived functions with lower upfront costs.

Share your Stack

Help developers discover the tools you use. Get visibility for your team's tech choices and contribute to the community's knowledge.

View Docs

CLI (Node.js)

Manual

Advice on Apache Spark, Serverless

Tim

CTO at Checkly Inc.

Sep 18, 2019

Needs adviceon

Heroku

AWS Lambda

When adding a new feature to Checkly rearchitecting some older piece, I tend to pick Heroku for rolling it out. But not always, because sometimes I pick AWS Lambda . The short story:

Developer Experience trumps everything.
AWS Lambda is cheap. Up to a limit though. This impact not only your wallet.
If you need geographic spread, AWS is lonely at the top.

The setup

Recently, I was doing a brainstorm at a startup here in Berlin on the future of their infrastructure. They were ready to move on from their initial, almost 100% Ec2 + Chef based setup. Everything was on the table. But we crossed out a lot quite quickly:

Pure, uncut, self hosted Kubernetes — way too much complexity
Managed Kubernetes in various flavors — still too much complexity
Zeit — Maybe, but no Docker support
Elastic Beanstalk — Maybe, bit old but does the job
Heroku
Lambda

It became clear a mix of PaaS and FaaS was the way to go. What a surprise! That is exactly what I use for Checkly! But when do you pick which model?

I chopped that question up into the following categories:

Developer Experience / DX 🤓
Ops Experience / OX 🐂 (?)
Cost 💵
Lock in 🔐

Read the full post linked below for all details

357k views357k

Comments

Nilesh

Technical Architect at Self Employed

Jul 8, 2020

Needs adviceon

Elasticsearch

Kafka

We have a Kafka topic having events of type A and type B. We need to perform an inner join on both type of events using some common field (primary-key). The joined events to be inserted in Elasticsearch.

In usual cases, type A and type B events (with same key) observed to be close upto 15 minutes. But in some cases they may be far from each other, lets say 6 hours. Sometimes event of either of the types never come.

In all cases, we should be able to find joined events instantly after they are joined and not-joined events within 15 minutes.

576k views576k

Comments

Detailed Comparison

Apache Spark	Serverless
Spark is a fast and general processing engine compatible with Hadoop data. It can run in Hadoop clusters through YARN or Spark's standalone mode, and it can process data in HDFS, HBase, Cassandra, Hive, and any Hadoop InputFormat. It is designed to perform both batch processing (similar to MapReduce) and new workloads like streaming, interactive queries, and machine learning.	Build applications comprised of microservices that run in response to events, auto-scale for you, and only charge you when they run. This lowers the total cost of maintaining your apps, enabling you to build more logic, faster. The Framework uses new event-driven compute services, like AWS Lambda, Google CloudFunctions, and more.
Run programs up to 100x faster than Hadoop MapReduce in memory, or 10x faster on disk;Write applications quickly in Java, Scala or Python;Combine SQL, streaming, and complex analytics;Spark runs on Hadoop, Mesos, standalone, or in the cloud. It can access diverse data sources including HDFS, Cassandra, HBase, S3	-
Statistics
GitHub Stars 42.2K	GitHub Stars 46.9K
GitHub Forks 28.9K	GitHub Forks 5.7K
Stacks 3.1K	Stacks 2.2K
Followers 3.5K	Followers 1.2K
Votes 140	Votes 28
Pros & Cons
Pros 61 Open-source 48 Fast and Flexible 8 Great for distributed SQL like applications 8 One platform for every big data problem 6 Easy to install and to use Cons 4 Speed	Pros 14 API integration 7 Supports cloud functions for Google, Azure, and IBM 3 Lower cost 1 Auto scale 1 Openwhisk
Integrations
No integrations available	Azure Functions AWS Lambda Amazon API Gateway

What are some alternatives to Apache Spark, Serverless?

AWS Lambda

AWS Lambda is a compute service that runs your code in response to events and automatically manages the underlying compute resources for you. You can use AWS Lambda to extend other AWS services with custom logic, or create your own back-end services that operate at AWS scale, performance, and security.

Presto

Distributed SQL Query Engine for Big Data

Azure Functions

Azure Functions is an event driven, compute-on-demand experience that extends the existing Azure application platform with capabilities to implement code triggered by events occurring in virtually any Azure or 3rd party service as well as on-premises systems.

Google Cloud Run

A managed compute platform that enables you to run stateless containers that are invocable via HTTP requests. It's serverless by abstracting away all infrastructure management.

Amazon Athena

Amazon Athena is an interactive query service that makes it easy to analyze data in Amazon S3 using standard SQL. Athena is serverless, so there is no infrastructure to manage, and you pay only for the queries that you run.

Apache Flink

Apache Flink is an open source system for fast and versatile data analytics in clusters. Flink supports batch and streaming analytics, in one system. Analytical programs can be written in concise and elegant APIs in Java and Scala.

lakeFS

It is an open-source data version control system for data lakes. It provides a “Git for data” platform enabling you to implement best practices from software engineering on your data lake, including branching and merging, CI/CD, and production-like dev/test environments.

Druid

Druid is a distributed, column-oriented, real-time analytics data store that is commonly used to power exploratory dashboards in multi-tenant environments. Druid excels as a data warehousing solution for fast aggregate queries on petabyte sized data sets. Druid supports a variety of flexible filters, exact calculations, approximate algorithms, and other useful calculations.

Google Cloud Functions

Construct applications from bite-sized business logic billed to the nearest 100 milliseconds, only while your code is running

Apache Kylin

Apache Kylin™ is an open source Distributed Analytics Engine designed to provide SQL interface and multi-dimensional analysis (OLAP) on Hadoop/Spark supporting extremely large datasets, originally contributed from eBay Inc.

Related Comparisons

Apache Spark vs Serverless: What are the differences?

Introduction:

Deployment Model: Apache Spark is typically deployed on a cluster of machines, where data is distributed and processing is done in parallel. On the other hand, Serverless technologies like AWS Lambda or Azure Functions are event-driven and allow developers to run code on demand without having to manage the underlying infrastructure.
Resource Allocation: In Apache Spark, resources need to be pre-allocated and managed manually, specifying how much memory or cores should be allocated for each job. Serverless platforms, on the other hand, automatically allocate resources based on the demand, scaling up or down as needed. This allows for better resource utilization and cost optimization.
Scalability: Apache Spark provides horizontal scalability, meaning it can scale by adding more machines to the cluster. Serverless platforms also provide scalability, but at the function level. Each function can scale independently based on the incoming workload, without affecting other functions.
State Management: Apache Spark provides an in-memory computing model, allowing users to persist data in memory for faster processing. Serverless platforms, on the other hand, are stateless by design. They are designed to handle short-lived functions that process small units of data and do not provide built-in support for persistent state.
Cost Model: Apache Spark requires the setup and management of a dedicated cluster, which may require upfront costs for hardware and infrastructure. Serverless platforms follow a pay-as-you-go pricing model, where users only pay for the actual execution time and resources used by their functions, leading to potential cost savings, especially for sporadic workloads.
Flexibility: Apache Spark provides a wide range of data processing and analysis capabilities through its extensive library ecosystem. It supports batch processing, interactive queries, machine learning, and graph processing. Serverless platforms, on the other hand, are more focused on event-driven functions and are optimized for short-lived, stateless operations.

Apache Spark vs Serverless

Overview