PostgreSQL vs Apache Spark

Get Advice Icon

Need advice about which tool to choose?Ask the StackShare community!

PostgreSQL
PostgreSQL

17.1K
12.9K
+ 1
3.4K
Apache Spark
Apache Spark

1.1K
824
+ 1
98
Add tool

PostgreSQL vs Apache Spark: What are the differences?

Developers describe PostgreSQL as "A powerful, open source object-relational database system". PostgreSQL is an advanced object-relational database management system that supports an extended subset of the SQL standard, including transactions, foreign keys, subqueries, triggers, user-defined types and functions. On the other hand, Apache Spark is detailed as "Fast and general engine for large-scale data processing". Spark is a fast and general processing engine compatible with Hadoop data. It can run in Hadoop clusters through YARN or Spark's standalone mode, and it can process data in HDFS, HBase, Cassandra, Hive, and any Hadoop InputFormat. It is designed to perform both batch processing (similar to MapReduce) and new workloads like streaming, interactive queries, and machine learning.

PostgreSQL can be classified as a tool in the "Databases" category, while Apache Spark is grouped under "Big Data Tools".

"Relational database" is the primary reason why developers consider PostgreSQL over the competitors, whereas "Open-source" was stated as the key factor in picking Apache Spark.

PostgreSQL and Apache Spark are both open source tools. Apache Spark with 22.3K GitHub stars and 19.3K forks on GitHub appears to be more popular than PostgreSQL with 5.38K GitHub stars and 1.79K GitHub forks.

reddit, Instacart, and StackShare are some of the popular companies that use PostgreSQL, whereas Apache Spark is used by Slack, Shopify, and SendGrid. PostgreSQL has a broader approval, being mentioned in 2701 company stacks & 2097 developers stacks; compared to Apache Spark, which is listed in 263 company stacks and 111 developer stacks.

What is PostgreSQL?

PostgreSQL is an advanced object-relational database management system that supports an extended subset of the SQL standard, including transactions, foreign keys, subqueries, triggers, user-defined types and functions.

What is Apache Spark?

Spark is a fast and general processing engine compatible with Hadoop data. It can run in Hadoop clusters through YARN or Spark's standalone mode, and it can process data in HDFS, HBase, Cassandra, Hive, and any Hadoop InputFormat. It is designed to perform both batch processing (similar to MapReduce) and new workloads like streaming, interactive queries, and machine learning.
Get Advice Icon

Need advice about which tool to choose?Ask the StackShare community!

Why do developers choose PostgreSQL?
Why do developers choose Apache Spark?

Sign up to add, upvote and see more prosMake informed product decisions

What companies use PostgreSQL?
What companies use Apache Spark?

Sign up to get full access to all the companiesMake informed product decisions

What tools integrate with PostgreSQL?
What tools integrate with Apache Spark?

Sign up to get full access to all the tool integrationsMake informed product decisions

What are some alternatives to PostgreSQL and Apache Spark?
MySQL
The MySQL software delivers a very fast, multi-threaded, multi-user, and robust SQL (Structured Query Language) database server. MySQL Server is intended for mission-critical, heavy-load production systems as well as for embedding into mass-deployed software.
MariaDB
Started by core members of the original MySQL team, MariaDB actively works with outside developers to deliver the most featureful, stable, and sanely licensed open SQL server in the industry. MariaDB is designed as a drop-in replacement of MySQL(R) with more features, new storage engines, fewer bugs, and better performance.
Oracle
Oracle Database is an RDBMS. An RDBMS that implements object-oriented features such as user-defined types, inheritance, and polymorphism is called an object-relational database management system (ORDBMS). Oracle Database has extended the relational model to an object-relational model, making it possible to store complex business models in a relational database.
MongoDB
MongoDB stores data in JSON-like documents that can vary in structure, offering a dynamic, flexible schema. MongoDB was also designed for high availability and scalability, with built-in replication and auto-sharding.
SQLite
SQLite is an embedded SQL database engine. Unlike most other SQL databases, SQLite does not have a separate server process. SQLite reads and writes directly to ordinary disk files. A complete SQL database with multiple tables, indices, triggers, and views, is contained in a single disk file.
See all alternatives
Decisions about PostgreSQL and Apache Spark
Anton Sidelnikov
Anton Sidelnikov
Backend Developer at Beamery · | 5 upvotes · 9K views
MongoDB
MongoDB
PostgreSQL
PostgreSQL

In my opinion PostgreSQL is totally over MongoDB - not only works with structured data & SQL & strict types, but also has excellent support for unstructured data as separate data type (you can store arbitrary JSONs - and they may be also queryable, depending on one of format's you may choose). Both writes & reads are much faster, then in Mongo. So you can get best on Document NoSQL & SQL in single database..

Formal downside of PostgreSQL is clustering scalability. There's not simple way to build distributed a cluster. However, two points:

1) You will need much more time before you need to actually scale due to PG's efficiency. And if you follow database-per-service pattern, maybe you won't need ever, cause dealing few billion records on single machine is an option for PG.

2) When you need to - you do it in a way you need, including as a part of app's logic (e.g. sharding by key, or PG-based clustering solution with strict model), scalability will be very transparent, much more obvious than Mongo's "cluster just works (but then fails)" replication.

See more
John Kodumal
John Kodumal
CTO at LaunchDarkly · | 15 upvotes · 150.9K views
atLaunchDarklyLaunchDarkly
Kafka
Kafka
Amazon Kinesis
Amazon Kinesis
Redis
Redis
Amazon EC2
Amazon EC2
Amazon ElastiCache
Amazon ElastiCache
Consul
Consul
Patroni
Patroni
TimescaleDB
TimescaleDB
PostgreSQL
PostgreSQL
Amazon RDS
Amazon RDS

As we've evolved or added additional infrastructure to our stack, we've biased towards managed services. Most new backing stores are Amazon RDS instances now. We do use self-managed PostgreSQL with TimescaleDB for time-series data—this is made HA with the use of Patroni and Consul.

We also use managed Amazon ElastiCache instances instead of spinning up Amazon EC2 instances to run Redis workloads, as well as shifting to Amazon Kinesis instead of Kafka.

See more
Joshua Dean Küpper
Joshua Dean Küpper
CEO at Scrayos UG (haftungsbeschränkt) · | 5 upvotes · 40.1K views
atScrayos UG (haftungsbeschränkt)Scrayos UG (haftungsbeschränkt)
Sentry
Sentry
GitLab
GitLab
PostgreSQL
PostgreSQL
MariaDB
MariaDB

We primarily use MariaDB but use PostgreSQL as a part of GitLab , Sentry and @Nextcloud , which (initially) forced us to use it anyways. While this isn't much of a decision – because we didn't have one (ha ha) – we learned to love the perks and advantages of PostgreSQL anyways. PostgreSQLs extension system makes it even more flexible than a lot of the other SQL-based DBs (that only offer stored procedures) and the additional JOIN options, the enhanced role management and the different authentication options came in really handy, when doing manual maintenance on the databases.

See more
Alex A
Alex A
Founder at PRIZ Guru · | 6 upvotes · 8.4K views
atPRIZ GuruPRIZ Guru
PostgreSQL
PostgreSQL
MySQL
MySQL

One of our battles at the very beginning of the road was choosing the right database. In fact, our first prototype was built on MySQL and back then nothing else was even under a consideration (don't ask me why). At some point, I was working on a project which was running on PostgreSQL and it is only then I understood the full power of it. We have over a billion of records in production instance, and we are able to optimize it to run fast and reliable. Well, now my default DB is PostgreSQL :)

See more
Tim Nolet
Tim Nolet
Founder, Engineer & Dishwasher at Checkly · | 8 upvotes · 61.9K views
atChecklyHQChecklyHQ
Amazon DynamoDB
Amazon DynamoDB
MongoDB
MongoDB
Node.js
Node.js
Heroku
Heroku
PostgreSQL
PostgreSQL

PostgreSQL Heroku Node.js MongoDB Amazon DynamoDB

When I started building Checkly, one of the first things on the agenda was how to actually structure our SaaS database model: think accounts, users, subscriptions etc. Weirdly, there is not a lot of information on this on the "blogopshere" (cringe...). After research and some false starts with MongoDB and Amazon DynamoDB we ended up with PostgreSQL and a schema consisting of just four tables that form the backbone of all generic "Saasy" stuff almost any B2B SaaS bumps into.

In a nutshell:cPostgreSQL Heroku Node.js MongoDB Amazon DynamoDB

When I started building Checkly, one of the first things on the agenda was how to actually structure our SaaS database model: think accounts, users, subscriptions etc. Weirdly, there is not a lot of information on this on the "blogopshere" (cringe...). After research and some false starts with MongoDB and Amazon DynamoDB we ended up with PostgreSQL and a schema consisting of just four tables that form the backbone of all generic "Saasy" stuff almost any B2B SaaS bumps into.

In a nutshell:

  • We use Postgres on Heroku.
  • We use a "one database, on schema" approach for partitioning customer data.
  • We use an accounts, memberships and users table to create a many-to-many relation between users and accounts.
  • We completely decouple prices, payments and the exact ingredients for a customer's plan.

All the details including a database schema diagram are in the linked blog post.

See more
Łukasz Korecki
Łukasz Korecki
CTO & Co-founder at EnjoyHQ · | 12 upvotes · 40.5K views
atEnjoyHQEnjoyHQ
PostgreSQL
PostgreSQL
MongoDB
MongoDB
RethinkDB
RethinkDB

We initially chose RethinkDB because of the schema-less document store features, and better durability resilience/story than MongoDB In the end, it didn't work out quite as we expected: there's plenty of scalability issues, it's near impossible to run analytical workloads and small community makes working with Rethink a challenge. We're in process of migrating all our workloads to PostgreSQL and hopefully, we will be able to decommission our RethinkDB deployment soon.

See more
Eric Colson
Eric Colson
Chief Algorithms Officer at Stitch Fix · | 19 upvotes · 291.9K views
atStitch FixStitch Fix
Amazon EC2 Container Service
Amazon EC2 Container Service
Docker
Docker
PyTorch
PyTorch
R
R
Python
Python
Presto
Presto
Apache Spark
Apache Spark
Amazon S3
Amazon S3
PostgreSQL
PostgreSQL
Kafka
Kafka
#AWS
#Etl
#ML
#DataScience
#DataStack
#Data

The algorithms and data infrastructure at Stitch Fix is housed in #AWS. Data acquisition is split between events flowing through Kafka, and periodic snapshots of PostgreSQL DBs. We store data in an Amazon S3 based data warehouse. Apache Spark on Yarn is our tool of choice for data movement and #ETL. Because our storage layer (s3) is decoupled from our processing layer, we are able to scale our compute environment very elastically. We have several semi-permanent, autoscaling Yarn clusters running to serve our data processing needs. While the bulk of our compute infrastructure is dedicated to algorithmic processing, we also implemented Presto for adhoc queries and dashboards.

Beyond data movement and ETL, most #ML centric jobs (e.g. model training and execution) run in a similarly elastic environment as containers running Python and R code on Amazon EC2 Container Service clusters. The execution of batch jobs on top of ECS is managed by Flotilla, a service we built in house and open sourced (see https://github.com/stitchfix/flotilla-os).

At Stitch Fix, algorithmic integrations are pervasive across the business. We have dozens of data products actively integrated systems. That requires serving layer that is robust, agile, flexible, and allows for self-service. Models produced on Flotilla are packaged for deployment in production using Khan, another framework we've developed internally. Khan provides our data scientists the ability to quickly productionize those models they've developed with open source frameworks in Python 3 (e.g. PyTorch, sklearn), by automatically packaging them as Docker containers and deploying to Amazon ECS. This provides our data scientist a one-click method of getting from their algorithms to production. We then integrate those deployments into a service mesh, which allows us to A/B test various implementations in our product.

For more info:

#DataScience #DataStack #Data

See more
Mauro Bennici
Mauro Bennici
CTO at You Are My GUide · | 7 upvotes · 10.8K views
atYou Are My GUideYou Are My GUide
MongoDB
MongoDB
TimescaleDB
TimescaleDB
PostgreSQL
PostgreSQL

PostgreSQL plus TimescaleDB allow us to concentrate the business effort on how to analyze valuable data instead of manage them on IT side. We are now able to ingest thousand of social shares "managed" data without compromise the scalability of the system or the time query. TimescaleDB is transparent to PostgreSQL , so we continue to use the same SQL syntax without any changes. At the same time, because we need to manage few document objects we dismissed the MongoDB cluster.

See more
Tor Hagemann
Tor Hagemann
at Socotra · | 2 upvotes · 2.2K views
atSocotraSocotra
Amazon DynamoDB
Amazon DynamoDB
PostgreSQL
PostgreSQL
MySQL
MySQL

Much of our data model is relational, which makes MySQL or PostgreSQL (and family) fit the API's we need to build, in order to meet the needs of our customers.

Sometimes the flexibility of a NoSQL store like Amazon DynamoDB is very useful, but the lack of consistency really impacts usability and performance long-term, compared with viable alternatives. At our current scale, we've seen huge benefits from moving some of our tables out of Dynamo and doing more in SQL.

There will always be use cases for NoSQL and key-values stores, but if your model is well understood in your business/industry: relational databases are the way to go after finding product-market fit. Always understand the trade-offs (and a few intimate details) of any data store before you add to your company's stack!

See more
Joseph Irving
Joseph Irving
DevOps Engineer at uSwitch · | 8 upvotes · 7.5K views
atuSwitchuSwitch
Go
Go
PostgreSQL
PostgreSQL
MySQL
MySQL
Kubernetes
Kubernetes
Vault
Vault

At uSwitch we use Vault to generate short lived database credentials for our applications running in Kubernetes. We wanted to move from an environment where we had 100 dbs with a variety of static passwords being shared around to a place where each pod would have credentials that only last for its lifetime.

We chose vault because:

  • It had built in Kubernetes support so we could use service accounts to permission which pods could access which database.

  • A terraform provider so that we could configure both our RDS instances and their vault configuration in one place.

  • A variety of database providers including MySQL/PostgreSQL (our most common dbs).

  • A good api/Go -sdk so that we could build tooling around it to simplify development worfklow.

  • It had other features we would utilise such as PKI

See more
Daniel Quinn
Daniel Quinn
Senior Developer at Workfinder · | 2 upvotes · 24.6K views
atThe Paperless ProjectThe Paperless Project
PostgreSQL
PostgreSQL
SQLite
SQLite

SQLite is a tricky beast. It's great if you're working single-threaded, but a Terrible Idea if you've got more than one concurrent connection. You use it because it's easy to setup, light, and portable (it's just a file).

In Paperless, we've built a self-hosted web application, so it makes sense to standardise on something small & light, and as we don't have to worry about multiple connections (it's just you using the app), it's a perfect fit.

For users wanting to scale Paperless up to a multi-user environment though, we do provide the hooks to switch to PostgreSQL .

See more
Robert Zuber
Robert Zuber
CTO at CircleCI · | 22 upvotes · 175.8K views
atCircleCICircleCI
Amazon S3
Amazon S3
GitHub
GitHub
Redis
Redis
PostgreSQL
PostgreSQL
MongoDB
MongoDB

We use MongoDB as our primary #datastore. Mongo's approach to replica sets enables some fantastic patterns for operations like maintenance, backups, and #ETL.

As we pull #microservices from our #monolith, we are taking the opportunity to build them with their own datastores using PostgreSQL. We also use Redis to cache data we’d never store permanently, and to rate-limit our requests to partners’ APIs (like GitHub).

When we’re dealing with large blobs of immutable data (logs, artifacts, and test results), we store them in Amazon S3. We handle any side-effects of S3’s eventual consistency model within our own code. This ensures that we deal with user requests correctly while writes are in process.

See more
Martin Johannesson
Martin Johannesson
Senior Software Developer at IT Minds · | 10 upvotes · 15.7K views
atIT MindsIT Minds
AMP
AMP
PWA
PWA
React
React
MongoDB
MongoDB
Next.js
Next.js
GraphQL
GraphQL
Apollo
Apollo
PostgreSQL
PostgreSQL
TypeORM
TypeORM
Node.js
Node.js
TypeScript
TypeScript
#B2B
#Backend
#Serverless

At IT Minds we create customized internal or #B2B web and mobile apps. I have a go to stack that I pitch to our customers consisting of 3 core areas. 1) A data core #backend . 2) A micro #serverless #backend. 3) A user client #frontend.

For the Data Core I create a backend using TypeScript Node.js and with TypeORM connecting to a PostgreSQL Exposing an action based api with Apollo GraphQL

For the micro serverless backend, which purpose is verification for authentication, autorization, logins and the likes. It is created with Next.js api pages. Using MongoDB to store essential information, caching etc.

Finally the frontend is built with React using Next.js , TypeScript and @Apollo. We create the frontend as a PWA and have a AMP landing page by default.

See more
Jelena Dedovic
Jelena Dedovic
MSSQL
MSSQL
PostgreSQL
PostgreSQL
AIOHTTP
AIOHTTP
asyncio
asyncio
Tornado
Tornado

Investigating Tortoise ORM and GINO ORM...

I need to introduce some async ORM with the current stack: Tornado with asyncio loop, AIOHTTP, with PostgreSQL and MSSQL. This project revolves heavily around realtime and due to the realtime requirements, blocking during database access is not acceptable.

Considering that these ORMs are both young projects, I wondered if anybody had some experience with similar stack and these async ORMs?

See more
Nicolas Apx
Nicolas Apx
CEO - FullStack Javascript at Apx Development Limited · | 14 upvotes · 18.5K views
atAPX DevelopmentAPX Development
PostgreSQL
PostgreSQL
MongoDB
MongoDB
Node.js
Node.js
Python
Python

I am planning on building a micro-service eCommerce back-end to be easy to reuse in any project as we need. I would like to use both Python and Node.js and MongoDB & PostgreSQL , in your opinion which one would best suited for the following services:

  • Users-service
  • Products-service
  • Auth-service
  • Inventory-service
  • Order-service
  • Payment-service
  • Sku-service
  • And more not yet defined....

Thanks

Nicolas

See more
Interest over time
Reviews of PostgreSQL and Apache Spark
No reviews found
How developers use PostgreSQL and Apache Spark
Avatar of AngeloR
AngeloR uses PostgreSQLPostgreSQL

We use postgresql for the merge between sql/nosql. A lot of our data is unstructured JSON, or JSON that is currently in flux due to some MVP/interation processes that are going on. PostgreSQL gives the capability to do this.

At the moment PostgreSQL on amazon is only at 9.5 which is one minor version down from support for document fragment updates which is something that we are waiting for. However, that may be some ways away.

Other than that, we are using PostgreSQL as our main SQL store as a replacement for all the MSSQL databases that we have. Not only does it have great support through RDS (small ops team), but it also has some great ways for us to migrate off RDS to managed EC2 instances down the line if we need to.

Avatar of Cloudcraft
Cloudcraft uses PostgreSQLPostgreSQL

PostgreSQL combines the best aspects of traditional SQL databases such as reliability, consistent performance, transactions, querying power, etc. with the flexibility of schemaless noSQL systems that are all the rage these days. Through the powerful JSON column types and indexes, you can now have your cake and eat it too! PostgreSQL may seem a bit arcane and old fashioned at first, but the developers have clearly shown that they understand databases and the storage trends better than almost anyone else. It definitely deserves to be part of everyone's toolbox; when you find yourself needing rock solid performance, operational simplicity and reliability, reach for PostgresQL.

Avatar of Brandon Adams
Brandon Adams uses PostgreSQLPostgreSQL

Relational data stores solve a lot of problems reasonably well. Postgres has some data types that are really handy such as spatial, json, and a plethora of useful dates and integers. It has good availability of indexing solutions, and is well-supported for both custom modifications as well as hosting options (I like Amazon's Postgres for RDS). I use HoneySQL for Clojure as a composable AST that translates reliably to SQL. I typically use JDBC on Clojure, usually via org.clojure/java.jdbc.

Avatar of ReviewTrackers
ReviewTrackers uses PostgreSQLPostgreSQL

PostgreSQL is responsible for nearly all data storage, validation and integrity. We leverage constraints, functions and custom extensions to ensure we have only one source of truth for our data access rules and that those rules live as close to the data as possible. Call us crazy, but ORMs only lead to ruin and despair.

Avatar of Jeff Flynn
Jeff Flynn uses PostgreSQLPostgreSQL

Tried MongoDB - early euphoria - later dread. Tried MySQL - not bad at all. Found PostgreSQL - will never go back. So much support for this it should be your first choice. Simple local (free) installation, and one-click setup in Heroku - lots of options in terms of pricing/performance combinations.

Avatar of Wei Chen
Wei Chen uses Apache SparkApache Spark

Spark is good at parallel data processing management. We wrote a neat program to handle the TBs data we get everyday.

Avatar of Ralic Lo
Ralic Lo uses Apache SparkApache Spark

Used Spark Dataframe API on Spark-R for big data analysis.

Avatar of BrainFinance
BrainFinance uses Apache SparkApache Spark

As a part of big data machine learning stack (SMACK).

Avatar of Kalibrr
Kalibrr uses Apache SparkApache Spark

We use Apache Spark in computing our recommendations.

Avatar of Dotmetrics
Dotmetrics uses Apache SparkApache Spark

Big data analytics and nightly transformation jobs.

How much does PostgreSQL cost?
How much does Apache Spark cost?
Pricing unavailable
Pricing unavailable
News about Apache Spark
More news