Application and Data

Big Data as a Service

Alternatives to Snowflake

MySQL, Cassandra, Databricks, Hadoop, and Oracle are the most popular alternatives and competitors to Snowflake.

Stacks1.1K

Followers1.2K

+ 1

Votes27

What is Snowflake and what are its top alternatives?

Snowflake eliminates the administration and management demands of traditional data warehouses and big data platforms. Snowflake is a true data warehouse as a service running on Amazon Web Services (AWS)—no infrastructure to manage and no knobs to turn.

Snowflake is a tool in the Big Data as a Service category of a tech stack.

Explore Snowflake's Story

Top Alternatives to Snowflake

MySQL
The MySQL software delivers a very fast, multi-threaded, multi-user, and robust SQL (Structured Query Language) database server. MySQL Server is intended for mission-critical, heavy-load production systems as well as for embedding into mass-deployed software. ...
Cassandra
Partitioning means that Cassandra can distribute your data across multiple machines in an application-transparent matter. Cassandra will automatically repartition as machines are added and removed from the cluster. Row store means that like relational databases, Cassandra organizes data by rows and columns. The Cassandra Query Language (CQL) is a close relative of SQL. ...
Databricks
Databricks Unified Analytics Platform, from the original creators of Apache Spark™, unifies data science and engineering across the Machine Learning lifecycle from data preparation to experimentation and deployment of ML applications. ...
Hadoop
The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. ...
Oracle
Oracle Database is an RDBMS. An RDBMS that implements object-oriented features such as user-defined types, inheritance, and polymorphism is called an object-relational database management system (ORDBMS). Oracle Database has extended the relational model to an object-relational model, making it possible to store complex business models in a relational database. ...
PostgreSQL
PostgreSQL is an advanced object-relational database management system that supports an extended subset of the SQL standard, including transactions, foreign keys, subqueries, triggers, user-defined types and functions. ...
MongoDB
MongoDB stores data in JSON-like documents that can vary in structure, offering a dynamic, flexible schema. MongoDB was also designed for high availability and scalability, with built-in replication and auto-sharding. ...
Redis
Redis is an open source (BSD licensed), in-memory data structure store, used as a database, cache, and message broker. Redis provides data structures such as strings, hashes, lists, sets, sorted sets with range queries, bitmaps, hyperloglogs, geospatial indexes, and streams. ...

Snowflake alternatives & related posts

MySQL

126.8K

3.8K

The world's most popular open source database

Stacks126.8K

Votes3.8K

PROS OF MYSQL

800
Sql
679
Free
562
Easy
528
Widely used
490
Open source
180
High availability
160
Cross-platform support
104
Great community
79
Secure
75
Full-text indexing and searching
26
Fast, open, available
16
Reliable
16
SSL support
15
Robust
9
Enterprise Version
7
Easy to set up on all platforms
3
NoSQL access to JSON data type
1
Relational database
1
Easy, light, scalable
1
Sequel Pro (best SQL GUI)
1
Replica Support

CONS OF MYSQL

16
Owned by a company with their own agenda
3
Can't roll back schema changes

COMPARE

Compare MySQL vs Snowflake

related MySQL posts

Nick Rockwell

SVP, Engineering at Fastly · Sep 24, 2018 | 46 upvotes · 4.3M views

Shared insights

on

PHP

React

Apollo

GraphQL

GraphQL +3 more

at

The New York Times

When I joined NYT there was already broad dissatisfaction with the LAMP (Linux Apache HTTP Server MySQL PHP) Stack and the front end framework, in particular. So, I wasn't passing judgment on it. I mean, LAMP's fine, you can do good work in LAMP. It's a little dated at this point, but it's not ... I didn't want to rip it out for its own sake, but everyone else was like, "We don't like this, it's really inflexible." And I remember from being outside the company when that was called MIT FIVE when it had launched. And been observing it from the outside, and I was like, you guys took so long to do that and you did it so carefully, and yet you're not happy with your decisions. Why is that? That was more the impetus. If we're going to do this again, how are we going to do it in a way that we're gonna get a better result?

So we're moving quickly away from LAMP, I would say. So, right now, the new front end is React based and using Apollo. And we've been in a long, protracted, gradual rollout of the core experiences.

React is now talking to GraphQL as a primary API. There's a Node.js back end, to the front end, which is mainly for server-side rendering, as well.

Behind there, the main repository for the GraphQL server is a big table repository, that we call Bodega because it's a convenience store. And that reads off of a Kafka pipeline.

The Evolution of The New York Times Tech Stack | StackShare

Abdussamad ARHUN

Mar 2, 2024 | 27 upvotes · 151.5K views

Shared insights

on

PostgreSQL

ExpressJS

Next.js

PHP

Hello, I am building a website for a school that's used by students to find Zoom meeting links, view their marks, and check course materials. It is also used by the teachers to put the meeting links, students' marks, and course materials.

I created a similar website using HTML, CSS, PHP, and MySQL. Now I want to implement this project using some frameworks: Next.js, ExpressJS and use PostgreSQL instead of MYSQL

I want to have some advice on whether these are enough to implement my project.

Cassandra

3.6K

507

A partitioned row store. Rows are organized into tables with a required primary key.

Stacks3.6K

Votes507

PROS OF CASSANDRA

119
Distributed
98
High performance
81
High availability
74
Easy scalability
53
Replication
26
Reliable
26
Multi datacenter deployments
10
Schema optional
9
OLTP
8
Open source
2
Workload separation (via MDC)
1
Fast

CONS OF CASSANDRA

3
Reliability of replication
1
Size
1
Updates

COMPARE

Compare Cassandra vs Snowflake

related Cassandra posts

Thierry Schellenbach

CEO at Stream · Sep 13, 2018 | 17 upvotes · 1.1M views

Shared insights

on

Redis

Cassandra

RocksDB

at

1.0 of Stream leveraged Cassandra for storing the feed. Cassandra is a common choice for building feeds. Instagram, for instance started, out with Redis but eventually switched to Cassandra to handle their rapid usage growth. Cassandra can handle write heavy workloads very efficiently.

Cassandra is a great tool that allows you to scale write capacity simply by adding more nodes, though it is also very complex. This complexity made it hard to diagnose performance fluctuations. Even though we had years of experience with running Cassandra, it still felt like a bit of a black box. When building Stream 2.0 we decided to go for a different approach and build Keevo. Keevo is our in-house key-value store built upon RocksDB, gRPC and Raft.

RocksDB is a highly performant embeddable database library developed and maintained by Facebook’s data engineering team. RocksDB started as a fork of Google’s LevelDB that introduced several performance improvements for SSD. Nowadays RocksDB is a project on its own and is under active development. It is written in C++ and it’s fast. Have a look at how this benchmark handles 7 million QPS. In terms of technology it’s much more simple than Cassandra.

This translates into reduced maintenance overhead, improved performance and, most importantly, more consistent performance. It’s interesting to note that LinkedIn also uses RocksDB for their feed.

#InMemoryDatabases #DataStores #Databases

Stream & Go: News Feeds for Over 300 Million End Users - Stream Tech Stack | StackShare

kew44

Nov 10, 2022 | 6 upvotes · 120.2K views

Shared insights

on

JavaScript

Cassandra

Snowplow

Azure Data Factory

Azure Data Factory

AWS Glue

AWS Glue +1 more

Trying to establish a data lake(or maybe puddle) for my org's Data Sharing project. The idea is that outside partners would send cuts of their PHI data, regardless of format/variables/systems, to our Data Team who would then harmonize the data, create data marts, and eventually use it for something. End-to-end, I'm envisioning:

Ingestion->Secure, role-based, self service portal for users to upload data (1a. bonus points if it can preform basic validations/masking)
Storage->Amazon S3 seems like the cheapest. We probably won't need very big, even at full capacity. Our current storage is a secure Box folder that has ~4GB with several batches of test data, code, presentations, and planning docs.
Data Catalog-> AWS Glue? Azure Data Factory? Snowplow? is the main difference basically based on the vendor? We also will have Data Dictionaries/Codebooks from submitters. Where would they fit in?
Partitions-> I've seen Cassandra and YARN mentioned, but have no experience with either
Processing-> We want to use SAS if at all possible. What will work with SAS code?
Pipeline/Automation->The check-in and verification processes that have been outlined are rather involved. Some sort of automated messaging or approval workflow would be nice
I have very little guidance on what a "Data Mart" should look like, so I'm going with the idea that it would be another "experimental" partition. Unless there's an actual mart-building paradigm I've missed?
An end user might use the catalog to pull certain de-identified data sets from the marts. Again, role-based access and self-service gui would be preferable. I'm the only full-time tech person on this project, but I'm mostly an OOP, HTML, JavaScript, and some SQL programmer. Most of this is out of my repertoire. I've done a lot of research, but I can't be an effective evangelist without hands-on experience. Since we're starting a new year of our grant, they've finally decided to let me try some stuff out. Any pointers would be appreciated!

Databricks

511

8

A unified analytics platform, powered by Apache Spark

Stacks511

Votes8

PROS OF DATABRICKS

1
Best Performances on large datasets
1
True lakehouse architecture
1
Scalability
1
Databricks doesn't get access to your data
1
Usage Based Billing
1
Security
1
Data stays in your cloud account
1
Multicloud

CONS OF DATABRICKS

Be the first to leave a con

COMPARE

Compare Databricks vs Snowflake

related Databricks posts

Jan Vlnas

Senior Software Engineer at Mews · Oct 6, 2022 | 5 upvotes · 464.1K views

Shared insights

on

Apache Hive

Apache Spark

Deepnote

Databricks

Jupyter

Jupyter +1 more

From my point of view, both OpenRefine and Apache Hive serve completely different purposes. OpenRefine is intended for interactive cleaning of messy data locally. You could work with their libraries to use some of OpenRefine features as part of your data pipeline (there are pointers in FAQ), but OpenRefine in general is intended for a single-user local operation.

I can't recommend a particular alternative without better understanding of your use case. But if you are looking for an interactive tool to work with big data at scale, take a look at notebook environments like Jupyter, Databricks, or Deepnote. If you are building a data processing pipeline, consider also Apache Spark.

Edit: Fixed references from Hadoop to Hive, which is actually closer to Spark.

Hadoop

2.5K

56

Open-source software for reliable, scalable, distributed computing

Stacks2.5K

Votes56

PROS OF HADOOP

39
Great ecosystem
11
One stack to rule them all
4
Great load balancer
1
Amazon aws
1
Java syntax

CONS OF HADOOP

Be the first to leave a con

COMPARE

Compare Hadoop vs Snowflake

related Hadoop posts

StackShare Editors

May 10, 2014 | 11 upvotes · 621.3K views

Shared insights

on

Kafka

Hadoop

at

The early data ingestion pipeline at Pinterest used Kafka as the central message transporter, with the app servers writing messages directly to Kafka, which then uploaded log files to S3.

For databases, a custom Hadoop streamer pulled database data and wrote it to S3.

Challenges cited for this infrastructure included high operational overhead, as well as potential data loss occurring when Kafka broker outages led to an overflow of in-memory message buffering.

Scalable and reliable data ingestion at Pinterest - Pinterest Engineering - Medium

Conor Myhrvold

Tech Brand Mgr, Office of CTO at Uber · Dec 4, 2018 | 7 upvotes · 3M views

Shared insights

on

Kafka

Kafka Manager

Hadoop

Apache Spark

GitHub

at

Uber Technologies

Why we built Marmaray, an open source generic data ingestion and dispersal framework and library for Apache Hadoop :

Built and designed by our Hadoop Platform team, Marmaray is a plug-in-based framework built on top of the Hadoop ecosystem. Users can add support to ingest data from any source and disperse to any sink leveraging the use of Apache Spark . The name, Marmaray, comes from a tunnel in Turkey connecting Europe and Asia. Similarly, we envisioned Marmaray within Uber as a pipeline connecting data from any source to any sink depending on customer preference:

https://eng.uber.com/marmaray-hadoop-ingestion-open-source/

(Direct GitHub repo: https://github.com/uber/marmaray Kafka Kafka Manager )

Marmaray: An Open Source Generic Data Ingestion and Dispersal Framework and Library for Apache Hadoop | Uber Engineering Blog

Oracle

2.3K

113

An RDBMS that implements object-oriented features such as user-defined types, inheritance, and polymorphism

Stacks2.3K

Votes113

PROS OF ORACLE

44
Reliable
33
Enterprise
15
High Availability
5
Hard to maintain
5
Expensive
4
Maintainable
4
Hard to use
3
High complexity

CONS OF ORACLE

14
Expensive

COMPARE

Compare Oracle vs Snowflake

related Oracle posts

Babu D

Nov 20, 2020 | 6 upvotes · 736.2K views

Shared insights

on

Laravel

Node.js

ASP.NET

MongoDB

Hi. We are planning to develop web, desktop, and mobile app for procurement, logistics, and contracts. Procure to Pay and Source to pay, spend management, supplier management, catalog management. ( similar to SAP Ariba, gap.com, coupa.com, ivalua.com vroozi.com, procurify.com

We got stuck when deciding which technology stack is good for the future. We look forward to your kind guidance that will help us.

We want to integrate with multiple databases with seamless bidirectional integration. What APIs and middleware available are best to achieve this? SAP HANA, Oracle, MySQL, MongoDB...

ASP.NET / Node.js / Laravel. ......?

Please guide us

Bob hs

Data Scientist · Mar 6, 2022 | 5 upvotes · 345.5K views

Shared insights

on

Google BigQuery

Google BigQuery

Apache Spark

Google Sheets

Google Datastudio

Google Datastudio

Python

I recently started a new position as a data scientist at an E-commerce company. The company is founded about 4-5 years ago and is new to many data-related areas. Specifically, I'm their first data science employee. So I have to take care of both data analysis tasks as well as bringing new technologies to the company.

They have used Elasticsearch (and Kibana) to have reporting dashboards on their daily purchases and users interactions on their e-commerce website.
They also use the Oracle database system to keep records of their daily turnovers and lists of their current products, clients, and sellers lists.
They use Data-Warehouse with cockpit 10 for generating reports on different aspects of their business including number 2 in this list.

At the moment, I grab batches of data from their system to perform predictive analytics from data science perspectives. In some cases, I use a static form of data such as monthly turnover, client values, and high-demand products, and run my predictive analysis using Python (VS code). Also, I use Google Datastudio or Google Sheets to present my findings. In other cases, I try to do time-series analysis using offline batches of data extracted from Elastic Search to do user recommendations and user personalization.

I really want to use modern data science tools such as Apache Spark, Google BigQuery, AWS, Azure, or others where they really fit. I think these tools can improve my performance as a data scientist and can provide more continuous analytics of their business interactions. But honestly, I'm not sure where each tool is needed and what part of their system should be replaced by or combined with the current state of technology to improve productivity from the above perspectives.

PostgreSQL

99.3K

3.5K

A powerful, open source object-relational database system

Stacks99.3K

Votes3.5K

PROS OF POSTGRESQL

CONS OF POSTGRESQL

10
Table/index bloatings

COMPARE

Compare PostgreSQL vs Snowflake

related PostgreSQL posts

Simon Reymann

Senior Fullstack Developer at QUANTUSflow Software GmbH · Apr 27, 2020 | 30 upvotes · 12.2M views

Shared insights

on

OpenSSL

SSLMate

NGINX

Docker Swarm

Redis

at

QUANTUSflow Software GmbH

Our whole DevOps stack consists of the following tools:

GitHub (incl. GitHub Pages/Markdown for Documentation, GettingStarted and HowTo's) for collaborative review and code management tool
Respectively Git as revision control system
SourceTree as Git GUI
Visual Studio Code as IDE
CircleCI for continuous integration (automatize development process)
Prettier / TSLint / ESLint as code linter
SonarQube as quality gate
Docker as container management (incl. Docker Compose for multi-container application management)
VirtualBox for operating system simulation tests
Kubernetes as cluster management for docker containers
Heroku for deploying in test environments
nginx as web server (preferably used as facade server in production environment)
SSLMate (using OpenSSL) for certificate management
Amazon EC2 (incl. Amazon S3) for deploying in stage (production-like) and production environments
PostgreSQL as preferred database system
Redis as preferred in-memory database/store (great for caching)

The main reason we have chosen Kubernetes over Docker Swarm is related to the following artifacts:

Key features: Easy and flexible installation, Clear dashboard, Great scaling operations, Monitoring is an integral part, Great load balancing concepts, Monitors the condition and ensures compensation in the event of failure.
Applications: An application can be deployed using a combination of pods, deployments, and services (or micro-services).
Functionality: Kubernetes as a complex installation and setup process, but it not as limited as Docker Swarm.
Monitoring: It supports multiple versions of logging and monitoring when the services are deployed within the cluster (Elasticsearch/Kibana (ELK), Heapster/Grafana, Sysdig cloud integration).
Scalability: All-in-one framework for distributed systems.
Other Benefits: Kubernetes is backed by the Cloud Native Computing Foundation (CNCF), huge community among container orchestration tools, it is an open source and modular tool that works with any OS.

Abdussamad ARHUN

Mar 2, 2024 | 27 upvotes · 151.5K views

Shared insights

on

PostgreSQL

ExpressJS

Next.js

PHP

Hello, I am building a website for a school that's used by students to find Zoom meeting links, view their marks, and check course materials. It is also used by the teachers to put the meeting links, students' marks, and course materials.

I created a similar website using HTML, CSS, PHP, and MySQL. Now I want to implement this project using some frameworks: Next.js, ExpressJS and use PostgreSQL instead of MYSQL

I want to have some advice on whether these are enough to implement my project.

MongoDB

94.3K

4.1K

The database for giant ideas

Stacks94.3K

Votes4.1K

PROS OF MONGODB

CONS OF MONGODB

6
Very slowly for connected models that require joins
3
Not acid compliant
2
Proprietary query language

COMPARE

Compare MongoDB vs Snowflake

related MongoDB posts

Jeyabalaji Subramanian

CTO at FundsCorner · Jan 30, 2019 | 25 upvotes · 3.5M views

Shared insights

on

MongoDB

PostgreSQL

MongoDB Stitch

Node.js

Amazon SQS

Amazon SQS +4 more

at

Recently we were looking at a few robust and cost-effective ways of replicating the data that resides in our production MongoDB to a PostgreSQL database for data warehousing and business intelligence.

We set ourselves the following criteria for the optimal tool that would do this job: - The data replication must be near real-time, yet it should NOT impact the production database - The data replication must be horizontally scalable (based on the load), asynchronous & crash-resilient

Based on the above criteria, we selected the following tools to perform the end to end data replication:

We chose MongoDB Stitch for picking up the changes in the source database. It is the serverless platform from MongoDB. One of the services offered by MongoDB Stitch is Stitch Triggers. Using stitch triggers, you can execute a serverless function (in Node.js) in real time in response to changes in the database. When there are a lot of database changes, Stitch automatically "feeds forward" these changes through an asynchronous queue.

We chose Amazon SQS as the pipe / message backbone for communicating the changes from MongoDB to our own replication service. Interestingly enough, MongoDB stitch offers integration with AWS services.

In the Node.js function, we wrote minimal functionality to communicate the database changes (insert / update / delete / replace) to Amazon SQS.

Next we wrote a minimal micro-service in Python to listen to the message events on SQS, pickup the data payload & mirror the DB changes on to the target Data warehouse. We implemented source data to target data translation by modelling target table structures through SQLAlchemy . We deployed this micro-service as AWS Lambda with Zappa. With Zappa, deploying your services as event-driven & horizontally scalable Lambda service is dumb-easy.

In the end, we got to implement a highly scalable near realtime Change Data Replication service that "works" and deployed to production in a matter of few days!

Robert Zuber

CTO at CircleCI · Jul 24, 2019 | 24 upvotes · 3.3M views

Shared insights

on

MongoDB

PostgreSQL

Redis

GitHub

Amazon S3

at

We use MongoDB as our primary #datastore. Mongo's approach to replica sets enables some fantastic patterns for operations like maintenance, backups, and #ETL.

As we pull #microservices from our #monolith, we are taking the opportunity to build them with their own datastores using PostgreSQL. We also use Redis to cache data we’d never store permanently, and to rate-limit our requests to partners’ APIs (like GitHub).

When we’re dealing with large blobs of immutable data (logs, artifacts, and test results), we store them in Amazon S3. We handle any side-effects of S3’s eventual consistency model within our own code. This ensures that we deal with user requests correctly while writes are in process.

Update: How CircleCI Processes Over 30 Million Builds Per Month - CircleCI Tech Stack

Redis

60.1K

3.9K

Open source (BSD licensed), in-memory data structure store

Stacks60.1K

Votes3.9K

PROS OF REDIS

CONS OF REDIS

15
Cannot query objects directly
3
No secondary indexes for non-numeric data types
1
No WAL

COMPARE

Compare Redis vs Snowflake

related Redis posts

Russel Werner

Lead Engineer at StackShare · Dec 3, 2018 | 32 upvotes · 2.9M views

Shared insights

on

React

Glamorous

Apollo

Node.js

Rails

at

StackShare Feed is built entirely with React, Glamorous, and Apollo. One of our objectives with the public launch of the Feed was to enable a Server-side rendered (SSR) experience for our organic search traffic. When you visit the StackShare Feed, and you aren't logged in, you are delivered the Trending feed experience. We use an in-house Node.js rendering microservice to generate this HTML. This microservice needs to run and serve requests independent of our Rails web app. Up until recently, we had a mono-repo with our Rails and React code living happily together and all served from the same web process. In order to deploy our SSR app into a Heroku environment, we needed to split out our front-end application into a separate repo in GitHub. The driving factor in this decision was mostly due to limitations imposed by Heroku specifically with how processes can't communicate with each other. A new SSR app was created in Heroku and linked directly to the frontend repo so it stays in-sync with changes.

Related to this, we need a way to "deploy" our frontend changes to various server environments without building & releasing the entire Ruby application. We built a hybrid Amazon S3 Amazon CloudFront solution to host our Webpack bundles. A new CircleCI script builds the bundles and uploads them to S3. The final step in our rollout is to update some keys in Redis so our Rails app knows which bundles to serve. The result of these efforts were significant. Our frontend team now moves independently of our backend team, our build & release process takes only a few minutes, we are now using an edge CDN to serve JS assets, and we have pre-rendered React pages!

#StackDecisionsLaunch #SSR #Microservices #FrontEndRepoSplit

Simon Reymann

Senior Fullstack Developer at QUANTUSflow Software GmbH · Apr 27, 2020 | 30 upvotes · 12.2M views

Shared insights

on

OpenSSL

SSLMate

NGINX

Docker Swarm

Redis

at

QUANTUSflow Software GmbH

Our whole DevOps stack consists of the following tools:

GitHub (incl. GitHub Pages/Markdown for Documentation, GettingStarted and HowTo's) for collaborative review and code management tool
Respectively Git as revision control system
SourceTree as Git GUI
Visual Studio Code as IDE
CircleCI for continuous integration (automatize development process)
Prettier / TSLint / ESLint as code linter
SonarQube as quality gate
Docker as container management (incl. Docker Compose for multi-container application management)
VirtualBox for operating system simulation tests
Kubernetes as cluster management for docker containers
Heroku for deploying in test environments
nginx as web server (preferably used as facade server in production environment)
SSLMate (using OpenSSL) for certificate management
Amazon EC2 (incl. Amazon S3) for deploying in stage (production-like) and production environments
PostgreSQL as preferred database system
Redis as preferred in-memory database/store (great for caching)

The main reason we have chosen Kubernetes over Docker Swarm is related to the following artifacts:

Key features: Easy and flexible installation, Clear dashboard, Great scaling operations, Monitoring is an integral part, Great load balancing concepts, Monitors the condition and ensures compensation in the event of failure.
Applications: An application can be deployed using a combination of pods, deployments, and services (or micro-services).
Functionality: Kubernetes as a complex installation and setup process, but it not as limited as Docker Swarm.
Monitoring: It supports multiple versions of logging and monitoring when the services are deployed within the cluster (Elasticsearch/Kibana (ELK), Heapster/Grafana, Sysdig cloud integration).
Scalability: All-in-one framework for distributed systems.
Other Benefits: Kubernetes is backed by the Cloud Native Computing Foundation (CNCF), huge community among container orchestration tools, it is an open source and modular tool that works with any OS.