Get Advice Icon

Need advice about which tool to choose?Ask the StackShare community!

Apache Spark
Apache Spark

1K
804
+ 1
98
SQLite
SQLite

3.3K
2.5K
+ 1
504
Add tool

Apache Spark vs SQLite: What are the differences?

Developers describe Apache Spark as "Fast and general engine for large-scale data processing". Spark is a fast and general processing engine compatible with Hadoop data. It can run in Hadoop clusters through YARN or Spark's standalone mode, and it can process data in HDFS, HBase, Cassandra, Hive, and any Hadoop InputFormat. It is designed to perform both batch processing (similar to MapReduce) and new workloads like streaming, interactive queries, and machine learning. On the other hand, SQLite is detailed as "A software library that implements a self-contained, serverless, zero-configuration, transactional SQL database engine". SQLite is an embedded SQL database engine. Unlike most other SQL databases, SQLite does not have a separate server process. SQLite reads and writes directly to ordinary disk files. A complete SQL database with multiple tables, indices, triggers, and views, is contained in a single disk file.

Apache Spark belongs to "Big Data Tools" category of the tech stack, while SQLite can be primarily classified under "Databases".

"Open-source" is the primary reason why developers consider Apache Spark over the competitors, whereas "Lightweight" was stated as the key factor in picking SQLite.

Apache Spark is an open source tool with 22.3K GitHub stars and 19.3K GitHub forks. Here's a link to Apache Spark's open source repository on GitHub.

According to the StackShare community, SQLite has a broader approval, being mentioned in 313 company stacks & 470 developers stacks; compared to Apache Spark, which is listed in 263 company stacks and 111 developer stacks.

- No public GitHub repository available -

What is Apache Spark?

Spark is a fast and general processing engine compatible with Hadoop data. It can run in Hadoop clusters through YARN or Spark's standalone mode, and it can process data in HDFS, HBase, Cassandra, Hive, and any Hadoop InputFormat. It is designed to perform both batch processing (similar to MapReduce) and new workloads like streaming, interactive queries, and machine learning.

What is SQLite?

SQLite is an embedded SQL database engine. Unlike most other SQL databases, SQLite does not have a separate server process. SQLite reads and writes directly to ordinary disk files. A complete SQL database with multiple tables, indices, triggers, and views, is contained in a single disk file.
Get Advice Icon

Need advice about which tool to choose?Ask the StackShare community!

Why do developers choose Apache Spark?
Why do developers choose SQLite?

Sign up to add, upvote and see more prosMake informed product decisions

What companies use Apache Spark?
What companies use SQLite?

Sign up to get full access to all the companiesMake informed product decisions

What tools integrate with Apache Spark?
What tools integrate with SQLite?

Sign up to get full access to all the tool integrationsMake informed product decisions

What are some alternatives to Apache Spark and SQLite?
Hadoop
The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage.
Splunk
Splunk Inc. provides the leading platform for Operational Intelligence. Customers use Splunk to search, monitor, analyze and visualize machine data.
Cassandra
Partitioning means that Cassandra can distribute your data across multiple machines in an application-transparent matter. Cassandra will automatically repartition as machines are added and removed from the cluster. Row store means that like relational databases, Cassandra organizes data by rows and columns. The Cassandra Query Language (CQL) is a close relative of SQL.
Apache Beam
It implements batch and streaming data processing jobs that run on any execution engine. It executes pipelines on multiple execution environments.
Apache Flume
It is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data. It has a simple and flexible architecture based on streaming data flows. It is robust and fault tolerant with tunable reliability mechanisms and many failover and recovery mechanisms. It uses a simple extensible data model that allows for online analytic application.
See all alternatives
Decisions about Apache Spark and SQLite
StackShare Editors
StackShare Editors
Presto
Presto
Apache Spark
Apache Spark
Hadoop
Hadoop

Around 2015, the growing use of Uber’s data exposed limitations in the ETL and Vertica-centric setup, not to mention the increasing costs. “As our company grew, scaling our data warehouse became increasingly expensive. To cut down on costs, we started deleting older, obsolete data to free up space for new data.”

To overcome these challenges, Uber rebuilt their big data platform around Hadoop. “More specifically, we introduced a Hadoop data lake where all raw data was ingested from different online data stores only once and with no transformation during ingestion.”

“In order for users to access data in Hadoop, we introduced Presto to enable interactive ad hoc user queries, Apache Spark to facilitate programmatic access to raw data (in both SQL and non-SQL formats), and Apache Hive to serve as the workhorse for extremely large queries.

See more
StackShare Editors
StackShare Editors
Presto
Presto
Apache Spark
Apache Spark
Hadoop
Hadoop

To improve platform scalability and efficiency, Uber transitioned from JSON to Parquet, and built a central schema service to manage schemas and integrate different client libraries.

While the first generation big data platform was vulnerable to upstream data format changes, “ad hoc data ingestions jobs were replaced with a standard platform to transfer all source data in its original, nested format into the Hadoop data lake.”

These platform changes enabled the scaling challenges Uber was facing around that time: “On a daily basis, there were tens of terabytes of new data added to our data lake, and our Big Data platform grew to over 10,000 vcores with over 100,000 running batch jobs on any given day.”

See more
StackShare Editors
StackShare Editors
Presto
Presto
Apache Spark
Apache Spark
Scala
Scala
MySQL
MySQL
Kafka
Kafka

Slack’s data team works to “provide an ecosystem to help people in the company quickly and easily answer questions about usage, so they can make better and data informed decisions.” To achieve that goal, that rely on a complex data pipeline.

An in-house tool call Sqooper scrapes MySQL backups and pipe them to S3. Job queue and log data is sent to Kafka then persisted to S3 using an open source tool called Secor, which was created by Pinterest.

For compute, Amazon’s Elastic MapReduce (EMR) creates clusters preconfigured for Presto, Hive, and Spark.

Presto is then used for ad-hoc questions, validating data assumptions, exploring smaller datasets, and creating visualizations for some internal tools. Hive is used for larger data sets or longer time series data, and Spark allows teams to write efficient and robust batch and aggregation jobs. Most of the Spark pipeline is written in Scala.

Thrift binds all of these engines together with a typed schema and structured data.

Finally, the Hive Metastore serves as the ground truth for all data and its schema.

See more
StackShare Editors
StackShare Editors
Apache Thrift
Apache Thrift
Kotlin
Kotlin
Presto
Presto
HHVM (HipHop Virtual Machine)
HHVM (HipHop Virtual Machine)
gRPC
gRPC
Kubernetes
Kubernetes
Apache Spark
Apache Spark
Airflow
Airflow
Terraform
Terraform
Hadoop
Hadoop
Swift
Swift
Hack
Hack
Memcached
Memcached
Consul
Consul
Chef
Chef
Prometheus
Prometheus

Since the beginning, Cal Henderson has been the CTO of Slack. Earlier this year, he commented on a Quora question summarizing their current stack.

Apps
  • Web: a mix of JavaScript/ES6 and React.
  • Desktop: And Electron to ship it as a desktop application.
  • Android: a mix of Java and Kotlin.
  • iOS: written in a mix of Objective C and Swift.
Backend
  • The core application and the API written in PHP/Hack that runs on HHVM.
  • The data is stored in MySQL using Vitess.
  • Caching is done using Memcached and MCRouter.
  • The search service takes help from SolrCloud, with various Java services.
  • The messaging system uses WebSockets with many services in Java and Go.
  • Load balancing is done using HAproxy with Consul for configuration.
  • Most services talk to each other over gRPC,
  • Some Thrift and JSON-over-HTTP
  • Voice and video calling service was built in Elixir.
Data warehouse
  • Built using open source tools including Presto, Spark, Airflow, Hadoop and Kafka.
Etc
See more
Eric Colson
Eric Colson
Chief Algorithms Officer at Stitch Fix · | 19 upvotes · 209.4K views
atStitch FixStitch Fix
Amazon EC2 Container Service
Amazon EC2 Container Service
Docker
Docker
PyTorch
PyTorch
R
R
Python
Python
Presto
Presto
Apache Spark
Apache Spark
Amazon S3
Amazon S3
PostgreSQL
PostgreSQL
Kafka
Kafka
#Data
#DataStack
#DataScience
#ML
#Etl
#AWS

The algorithms and data infrastructure at Stitch Fix is housed in #AWS. Data acquisition is split between events flowing through Kafka, and periodic snapshots of PostgreSQL DBs. We store data in an Amazon S3 based data warehouse. Apache Spark on Yarn is our tool of choice for data movement and #ETL. Because our storage layer (s3) is decoupled from our processing layer, we are able to scale our compute environment very elastically. We have several semi-permanent, autoscaling Yarn clusters running to serve our data processing needs. While the bulk of our compute infrastructure is dedicated to algorithmic processing, we also implemented Presto for adhoc queries and dashboards.

Beyond data movement and ETL, most #ML centric jobs (e.g. model training and execution) run in a similarly elastic environment as containers running Python and R code on Amazon EC2 Container Service clusters. The execution of batch jobs on top of ECS is managed by Flotilla, a service we built in house and open sourced (see https://github.com/stitchfix/flotilla-os).

At Stitch Fix, algorithmic integrations are pervasive across the business. We have dozens of data products actively integrated systems. That requires serving layer that is robust, agile, flexible, and allows for self-service. Models produced on Flotilla are packaged for deployment in production using Khan, another framework we've developed internally. Khan provides our data scientists the ability to quickly productionize those models they've developed with open source frameworks in Python 3 (e.g. PyTorch, sklearn), by automatically packaging them as Docker containers and deploying to Amazon ECS. This provides our data scientist a one-click method of getting from their algorithms to production. We then integrate those deployments into a service mesh, which allows us to A/B test various implementations in our product.

For more info:

#DataScience #DataStack #Data

See more
Daniel Quinn
Daniel Quinn
Senior Developer at Founders4Schools · | 2 upvotes · 11.5K views
atThe Paperless ProjectThe Paperless Project
PostgreSQL
PostgreSQL
SQLite
SQLite

SQLite is a tricky beast. It's great if you're working single-threaded, but a Terrible Idea if you've got more than one concurrent connection. You use it because it's easy to setup, light, and portable (it's just a file).

In Paperless, we've built a self-hosted web application, so it makes sense to standardise on something small & light, and as we don't have to worry about multiple connections (it's just you using the app), it's a perfect fit.

For users wanting to scale Paperless up to a multi-user environment though, we do provide the hooks to switch to PostgreSQL .

See more
Interest over time
Reviews of Apache Spark and SQLite
No reviews found
How developers use Apache Spark and SQLite
Avatar of Romans Malinovskis
Romans Malinovskis uses SQLiteSQLite

We build queries in PHP with DSQL that work with SQLite. We also have SQLite data controller, so that you can build SQLite-based models.

Avatar of Coolfront Technologies
Coolfront Technologies uses SQLiteSQLite

Used during the "build process" of Coolfront Mobile's Flat rate search engine database. Flat rate data that resides in Salesforce is transformed using SQLite into a format that is usable for our mobile Flat rate search engine (AKA: Charlie).

Avatar of Sripathi Krishnan
Sripathi Krishnan uses SQLiteSQLite

RDBTools is a self-hosted application, and it is important that the installation process is simple. With SQLite, we create a new database file for every analysis. Once the analysis is done, the SQLite file can be thrown away easily.

Avatar of Perljobs.Ru
Perljobs.Ru uses SQLiteSQLite

All the dynamic data (i.e.: jobs) is stored in a simple SQLite database.

Все динамические данные (вакансии) хранятся в простой SQLite БД.

Avatar of A. M. Douglas
A. M. Douglas uses SQLiteSQLite

There's really no call for something heavier for this site. SQLite is simple, easy to use and quite reliable given its age.

Avatar of Wei Chen
Wei Chen uses Apache SparkApache Spark

Spark is good at parallel data processing management. We wrote a neat program to handle the TBs data we get everyday.

Avatar of Ralic Lo
Ralic Lo uses Apache SparkApache Spark

Used Spark Dataframe API on Spark-R for big data analysis.

Avatar of BrainFinance
BrainFinance uses Apache SparkApache Spark

As a part of big data machine learning stack (SMACK).

Avatar of Kalibrr
Kalibrr uses Apache SparkApache Spark

We use Apache Spark in computing our recommendations.

Avatar of Dotmetrics
Dotmetrics uses Apache SparkApache Spark

Big data analytics and nightly transformation jobs.

How much does Apache Spark cost?
How much does SQLite cost?
Pricing unavailable
Pricing unavailable
News about Apache Spark
More news