Alternatives to Google BigQuery logo

Alternatives to Google BigQuery

Google Cloud Bigtable, Amazon Redshift, Hadoop, Snowflake, and Google Analytics are the most popular alternatives and competitors to Google BigQuery.
678
503
+ 1
99

What is Google BigQuery and what are its top alternatives?

Run super-fast, SQL-like queries against terabytes of data in seconds, using the processing power of Google's infrastructure. Load data with ease. Bulk load your data using Google Cloud Storage or stream it in. Easy access. Access BigQuery by using a browser tool, a command-line tool, or by making calls to the BigQuery REST API with client libraries such as Java, PHP or Python.
Google BigQuery is a tool in the Big Data as a Service category of a tech stack.

Google BigQuery alternatives & related posts

Google Cloud Bigtable logo

Google Cloud Bigtable

71
122
15
71
122
+ 1
15
The same database that powers Google Search, Gmail and Analytics
Google Cloud Bigtable logo
Google Cloud Bigtable
VS
Google BigQuery logo
Google BigQuery

related Google Cloud Bigtable posts

Google Cloud IoT Core
Google Cloud IoT Core
Terraform
Terraform
Python
Python
Google Cloud Deployment Manager
Google Cloud Deployment Manager
Google Cloud Build
Google Cloud Build
Google Cloud Run
Google Cloud Run
Google Cloud Bigtable
Google Cloud Bigtable
Google BigQuery
Google BigQuery
Google Cloud Storage
Google Cloud Storage
Google Compute Engine
Google Compute Engine
GitHub
GitHub

Context: I wanted to create an end to end IoT data pipeline simulation in Google Cloud IoT Core and other GCP services. I never touched Terraform meaningfully until working on this project, and it's one of the best explorations in my development career. The documentation and syntax is incredibly human-readable and friendly. I'm used to building infrastructure through the google apis via Python , but I'm so glad past Sung did not make that decision. I was tempted to use Google Cloud Deployment Manager, but the templates were a bit convoluted by first impression. I'm glad past Sung did not make this decision either.

Solution: Leveraging Google Cloud Build Google Cloud Run Google Cloud Bigtable Google BigQuery Google Cloud Storage Google Compute Engine along with some other fun tools, I can deploy over 40 GCP resources using Terraform!

Check Out My Architecture: CLICK ME

Check out the GitHub repo attached

See more
Amazon Redshift logo

Amazon Redshift

857
605
87
857
605
+ 1
87
Fast, fully managed, petabyte-scale data warehouse service
Amazon Redshift logo
Amazon Redshift
VS
Google BigQuery logo
Google BigQuery

related Amazon Redshift posts

Julien DeFrance
Julien DeFrance
Principal Software Engineer at Tophatter · | 16 upvotes · 1.3M views
atSmartZipSmartZip
Rails
Rails
Rails API
Rails API
AWS Elastic Beanstalk
AWS Elastic Beanstalk
Capistrano
Capistrano
Docker
Docker
Amazon S3
Amazon S3
Amazon RDS
Amazon RDS
MySQL
MySQL
Amazon RDS for Aurora
Amazon RDS for Aurora
Amazon ElastiCache
Amazon ElastiCache
Memcached
Memcached
Amazon CloudFront
Amazon CloudFront
Segment
Segment
Zapier
Zapier
Amazon Redshift
Amazon Redshift
Amazon Quicksight
Amazon Quicksight
Superset
Superset
Elasticsearch
Elasticsearch
Amazon Elasticsearch Service
Amazon Elasticsearch Service
New Relic
New Relic
AWS Lambda
AWS Lambda
Node.js
Node.js
Ruby
Ruby
Amazon DynamoDB
Amazon DynamoDB
Algolia
Algolia

Back in 2014, I was given an opportunity to re-architect SmartZip Analytics platform, and flagship product: SmartTargeting. This is a SaaS software helping real estate professionals keeping up with their prospects and leads in a given neighborhood/territory, finding out (thanks to predictive analytics) who's the most likely to list/sell their home, and running cross-channel marketing automation against them: direct mail, online ads, email... The company also does provide Data APIs to Enterprise customers.

I had inherited years and years of technical debt and I knew things had to change radically. The first enabler to this was to make use of the cloud and go with AWS, so we would stop re-inventing the wheel, and build around managed/scalable services.

For the SaaS product, we kept on working with Rails as this was what my team had the most knowledge in. We've however broken up the monolith and decoupled the front-end application from the backend thanks to the use of Rails API so we'd get independently scalable micro-services from now on.

Our various applications could now be deployed using AWS Elastic Beanstalk so we wouldn't waste any more efforts writing time-consuming Capistrano deployment scripts for instance. Combined with Docker so our application would run within its own container, independently from the underlying host configuration.

Storage-wise, we went with Amazon S3 and ditched any pre-existing local or network storage people used to deal with in our legacy systems. On the database side: Amazon RDS / MySQL initially. Ultimately migrated to Amazon RDS for Aurora / MySQL when it got released. Once again, here you need a managed service your cloud provider handles for you.

Future improvements / technology decisions included:

Caching: Amazon ElastiCache / Memcached CDN: Amazon CloudFront Systems Integration: Segment / Zapier Data-warehousing: Amazon Redshift BI: Amazon Quicksight / Superset Search: Elasticsearch / Amazon Elasticsearch Service / Algolia Monitoring: New Relic

As our usage grows, patterns changed, and/or our business needs evolved, my role as Engineering Manager then Director of Engineering was also to ensure my team kept on learning and innovating, while delivering on business value.

One of these innovations was to get ourselves into Serverless : Adopting AWS Lambda was a big step forward. At the time, only available for Node.js (Not Ruby ) but a great way to handle cost efficiency, unpredictable traffic, sudden bursts of traffic... Ultimately you want the whole chain of services involved in a call to be serverless, and that's when we've started leveraging Amazon DynamoDB on these projects so they'd be fully scalable.

See more
Ankit Sobti
Ankit Sobti
Looker
Looker
Stitch
Stitch
Amazon Redshift
Amazon Redshift
dbt
dbt

Looker , Stitch , Amazon Redshift , dbt

We recently moved our Data Analytics and Business Intelligence tooling to Looker . It's already helping us create a solid process for reusable SQL-based data modeling, with consistent definitions across the entire organizations. Looker allows us to collaboratively build these version-controlled models and push the limits of what we've traditionally been able to accomplish with analytics with a lean team.

For Data Engineering, we're in the process of moving from maintaining our own ETL pipelines on AWS to a managed ELT system on Stitch. We're also evaluating the command line tool, dbt to manage data transformations. Our hope is that Stitch + dbt will streamline the ELT bit, allowing us to focus our energies on analyzing data, rather than managing it.

See more
Hadoop logo

Hadoop

1.5K
1.4K
53
1.5K
1.4K
+ 1
53
Open-source software for reliable, scalable, distributed computing
Hadoop logo
Hadoop
VS
Google BigQuery logo
Google BigQuery

related Hadoop posts

StackShare Editors
StackShare Editors
Kafka
Kafka
Kibana
Kibana
Elasticsearch
Elasticsearch
Logstash
Logstash
Hadoop
Hadoop

With interactions across each other and mobile devices, logging is important as it is information for internal cases like debugging and business cases like dynamic pricing.

With multiple Kafka clusters, data is archived into Hadoop before expiration. Data is ingested in realtime and indexed into an ELK stack. The ELK stack comprises of Elasticsearch, Logstash, and Kibana for searching and visualization.

See more
StackShare Editors
StackShare Editors
Prometheus
Prometheus
Chef
Chef
Consul
Consul
Memcached
Memcached
Hack
Hack
Swift
Swift
Hadoop
Hadoop
Terraform
Terraform
Airflow
Airflow
Apache Spark
Apache Spark
Kubernetes
Kubernetes
gRPC
gRPC
HHVM (HipHop Virtual Machine)
HHVM (HipHop Virtual Machine)
Presto
Presto
Kotlin
Kotlin
Apache Thrift
Apache Thrift

Since the beginning, Cal Henderson has been the CTO of Slack. Earlier this year, he commented on a Quora question summarizing their current stack.

Apps
  • Web: a mix of JavaScript/ES6 and React.
  • Desktop: And Electron to ship it as a desktop application.
  • Android: a mix of Java and Kotlin.
  • iOS: written in a mix of Objective C and Swift.
Backend
  • The core application and the API written in PHP/Hack that runs on HHVM.
  • The data is stored in MySQL using Vitess.
  • Caching is done using Memcached and MCRouter.
  • The search service takes help from SolrCloud, with various Java services.
  • The messaging system uses WebSockets with many services in Java and Go.
  • Load balancing is done using HAproxy with Consul for configuration.
  • Most services talk to each other over gRPC,
  • Some Thrift and JSON-over-HTTP
  • Voice and video calling service was built in Elixir.
Data warehouse
  • Built using open source tools including Presto, Spark, Airflow, Hadoop and Kafka.
Etc
See more
Snowflake logo

Snowflake

194
167
0
194
167
+ 1
0
The data warehouse built for the cloud
    Be the first to leave a pro
    Snowflake logo
    Snowflake
    VS
    Google BigQuery logo
    Google BigQuery

    related Snowflake posts

    Google BigQuery
    Google BigQuery
    Snowflake
    Snowflake

    I use Google BigQuery because it makes is super easy to query and store data for analytics workloads. If you're using GCP, you're likely using BigQuery. However, running data viz tools directly connected to BigQuery will run pretty slow. They recently announced BI Engine which will hopefully compete well against big players like Snowflake when it comes to concurrency.

    What's nice too is that it has SQL-based ML tools, and it has great GIS support!

    See more

    related Google Analytics posts

    Tassanai Singprom
    Tassanai Singprom
    JavaScript
    JavaScript
    PHP
    PHP
    HTML5
    HTML5
    jQuery
    jQuery
    Redis
    Redis
    Amazon EC2
    Amazon EC2
    Ubuntu
    Ubuntu
    Sass
    Sass
    Vue.js
    Vue.js
    Firebase
    Firebase
    Laravel
    Laravel
    Lumen
    Lumen
    Amazon RDS
    Amazon RDS
    GraphQL
    GraphQL
    MariaDB
    MariaDB
    Google Analytics
    Google Analytics
    Postman
    Postman
    Elasticsearch
    Elasticsearch
    Git
    Git
    GitHub
    GitHub
    GitLab
    GitLab
    npm
    npm
    Visual Studio Code
    Visual Studio Code
    Kibana
    Kibana
    Sentry
    Sentry
    BrowserStack
    BrowserStack
    Slack
    Slack

    This is my stack in Application & Data

    JavaScript PHP HTML5 jQuery Redis Amazon EC2 Ubuntu Sass Vue.js Firebase Laravel Lumen Amazon RDS GraphQL MariaDB

    My Utilities Tools

    Google Analytics Postman Elasticsearch

    My Devops Tools

    Git GitHub GitLab npm Visual Studio Code Kibana Sentry BrowserStack

    My Business Tools

    Slack

    See more
    Yonas Beshawred
    Yonas Beshawred
    CEO at StackShare · | 5 upvotes · 123.6K views
    atStackShareStackShare
    Amplitude
    Amplitude
    Segment
    Segment
    Google Analytics
    Google Analytics
    #FunnelAnalysisAnalytics
    #Analytics
    #Analyticsstack

    Adopting Amplitude was one of the best decisions we've made. We didn't try any of the alternatives- the free tier was really generous so it was easy to justify trying it out (via Segment). We've had Google Analytics since inception, but just for logged out traffic. We knew we'd need some sort of #FunnelAnalysisAnalytics solution, so it came down to just a few solutions.

    We had heard good things about Amplitude from friends and even had a consultant/advisor who was an Amplitude pro from using it as his company, so he kinda convinced us to splurge on the Enterprise tier for the behavioral cohorts alone. Writing the queries they provide via a few clicks in their UI would take days/weeks to craft in SQL. The behavioral cohorts allow us to create a lot of useful retention charts.

    Another really useful feature is kinda minor but kinda not. When you change a saved chart, a new URL gets generated and is visible in your browser (chartURL/edit) and that URL is immediately available to share with your team. It may sound inconsequential, but in practice, it makes it really easy to share and iterate on graphs. Only complaint is that you have to explicitly tag other team members as owners of whatever chart you're creating for them to be able to edit it and save it. I can see why this is the case, but more often than not, the people I'm sharing the chart with are the ones I want to edit it 🤷🏾‍♂️

    The Engagement Matrix feature is also really helpful (once you filter out the noisy events). Charts and dashboards are also great and make it easy for us to focus on the important metrics. We've been using Amplitude in production for about 6 months now. There's a bunch of other features we don't use regularly like Pathfinder, etc that I personally don't fully understand yet but I'm sure we'll start using them eventually.

    Again, haven't tried any of the alternatives like Heap, Mixpanel, or Kissmetrics so can't speak to those, but Amplitude works great for us.

    #analytics analyticsstack

    See more

    related Amazon Athena posts

    Amazon Athena
    Amazon Athena
    Google BigQuery
    Google BigQuery

    I use Amazon Athena because similar to Google BigQuery , you can store and query data easily. Especially since you can define data schema in the Glue data catalog, there's a central way to define data models.

    However, I would not recommend for batch jobs. I typically use this to check intermediary datasets in data engineering workloads. It's good for getting a look and feel of the data along its ETL journey.

    See more
    Google BigQuery
    Google BigQuery
    Amazon Redshift
    Amazon Redshift
    Amazon Athena
    Amazon Athena
    Amazon S3
    Amazon S3

    Hi all,

    Currently, we need to ingest the data from Amazon S3 to DB either Amazon Athena or Amazon Redshift. But the problem with the data is, it is in .PSV (pipe separated values) format and the size is also above 200 GB. The query performance of the timeout in Athena/Redshift is not up to the mark, too slow while compared to Google BigQuery. How would I optimize the performance and query result time? Can anyone please help me out?

    See more

    related Elasticsearch posts

    Tim Abbott
    Tim Abbott
    Founder at Zulip · | 23 upvotes · 341.2K views
    atZulipZulip
    PostgreSQL
    PostgreSQL
    MySQL
    MySQL
    Elasticsearch
    Elasticsearch

    We've been using PostgreSQL since the very early days of Zulip, but we actually didn't use it from the beginning. Zulip started out as a MySQL project back in 2012, because we'd heard it was a good choice for a startup with a wide community. However, we found that even though we were using the Django ORM for most of our database access, we spent a lot of time fighting with MySQL. Issues ranged from bad collation defaults, to bad query plans which required a lot of manual query tweaks.

    We ended up getting so frustrated that we tried out PostgresQL, and the results were fantastic. We didn't have to do any real customization (just some tuning settings for how big a server we had), and all of our most important queries were faster out of the box. As a result, we were able to delete a bunch of custom queries escaping the ORM that we'd written to make the MySQL query planner happy (because postgres just did the right thing automatically).

    And then after that, we've just gotten a ton of value out of postgres. We use its excellent built-in full-text search, which has helped us avoid needing to bring in a tool like Elasticsearch, and we've really enjoyed features like its partial indexes, which saved us a lot of work adding unnecessary extra tables to get good performance for things like our "unread messages" and "starred messages" indexes.

    I can't recommend it highly enough.

    See more
    Tymoteusz Paul
    Tymoteusz Paul
    Devops guy at X20X Development LTD · | 21 upvotes · 1.7M views
    Vagrant
    Vagrant
    VirtualBox
    VirtualBox
    Ansible
    Ansible
    Elasticsearch
    Elasticsearch
    Kibana
    Kibana
    Logstash
    Logstash
    TeamCity
    TeamCity
    Jenkins
    Jenkins
    Slack
    Slack
    Apache Maven
    Apache Maven
    Vault
    Vault
    Git
    Git
    Docker
    Docker
    CircleCI
    CircleCI
    LXC
    LXC
    Amazon EC2
    Amazon EC2

    Often enough I have to explain my way of going about setting up a CI/CD pipeline with multiple deployment platforms. Since I am a bit tired of yapping the same every single time, I've decided to write it up and share with the world this way, and send people to read it instead ;). I will explain it on "live-example" of how the Rome got built, basing that current methodology exists only of readme.md and wishes of good luck (as it usually is ;)).

    It always starts with an app, whatever it may be and reading the readmes available while Vagrant and VirtualBox is installing and updating. Following that is the first hurdle to go over - convert all the instruction/scripts into Ansible playbook(s), and only stopping when doing a clear vagrant up or vagrant reload we will have a fully working environment. As our Vagrant environment is now functional, it's time to break it! This is the moment to look for how things can be done better (too rigid/too lose versioning? Sloppy environment setup?) and replace them with the right way to do stuff, one that won't bite us in the backside. This is the point, and the best opportunity, to upcycle the existing way of doing dev environment to produce a proper, production-grade product.

    I should probably digress here for a moment and explain why. I firmly believe that the way you deploy production is the same way you should deploy develop, shy of few debugging-friendly setting. This way you avoid the discrepancy between how production work vs how development works, which almost always causes major pains in the back of the neck, and with use of proper tools should mean no more work for the developers. That's why we start with Vagrant as developer boxes should be as easy as vagrant up, but the meat of our product lies in Ansible which will do meat of the work and can be applied to almost anything: AWS, bare metal, docker, LXC, in open net, behind vpn - you name it.

    We must also give proper consideration to monitoring and logging hoovering at this point. My generic answer here is to grab Elasticsearch, Kibana, and Logstash. While for different use cases there may be better solutions, this one is well battle-tested, performs reasonably and is very easy to scale both vertically (within some limits) and horizontally. Logstash rules are easy to write and are well supported in maintenance through Ansible, which as I've mentioned earlier, are at the very core of things, and creating triggers/reports and alerts based on Elastic and Kibana is generally a breeze, including some quite complex aggregations.

    If we are happy with the state of the Ansible it's time to move on and put all those roles and playbooks to work. Namely, we need something to manage our CI/CD pipelines. For me, the choice is obvious: TeamCity. It's modern, robust and unlike most of the light-weight alternatives, it's transparent. What I mean by that is that it doesn't tell you how to do things, doesn't limit your ways to deploy, or test, or package for that matter. Instead, it provides a developer-friendly and rich playground for your pipelines. You can do most the same with Jenkins, but it has a quite dated look and feel to it, while also missing some key functionality that must be brought in via plugins (like quality REST API which comes built-in with TeamCity). It also comes with all the common-handy plugins like Slack or Apache Maven integration.

    The exact flow between CI and CD varies too greatly from one application to another to describe, so I will outline a few rules that guide me in it: 1. Make build steps as small as possible. This way when something breaks, we know exactly where, without needing to dig and root around. 2. All security credentials besides development environment must be sources from individual Vault instances. Keys to those containers should exist only on the CI/CD box and accessible by a few people (the less the better). This is pretty self-explanatory, as anything besides dev may contain sensitive data and, at times, be public-facing. Because of that appropriate security must be present. TeamCity shines in this department with excellent secrets-management. 3. Every part of the build chain shall consume and produce artifacts. If it creates nothing, it likely shouldn't be its own build. This way if any issue shows up with any environment or version, all developer has to do it is grab appropriate artifacts to reproduce the issue locally. 4. Deployment builds should be directly tied to specific Git branches/tags. This enables much easier tracking of what caused an issue, including automated identifying and tagging the author (nothing like automated regression testing!).

    Speaking of deployments, I generally try to keep it simple but also with a close eye on the wallet. Because of that, I am more than happy with AWS or another cloud provider, but also constantly peeking at the loads and do we get the value of what we are paying for. Often enough the pattern of use is not constantly erratic, but rather has a firm baseline which could be migrated away from the cloud and into bare metal boxes. That is another part where this approach strongly triumphs over the common Docker and CircleCI setup, where you are very much tied in to use cloud providers and getting out is expensive. Here to embrace bare-metal hosting all you need is a help of some container-based self-hosting software, my personal preference is with Proxmox and LXC. Following that all you must write are ansible scripts to manage hardware of Proxmox, similar way as you do for Amazon EC2 (ansible supports both greatly) and you are good to go. One does not exclude another, quite the opposite, as they can live in great synergy and cut your costs dramatically (the heavier your base load, the bigger the savings) while providing production-grade resiliency.

    See more