Why developers like Google BigQuery

What is Google BigQuery?

Run super-fast, SQL-like queries against terabytes of data in seconds, using the processing power of Google's infrastructure. Load data with ease. Bulk load your data using Google Cloud Storage or stream it in. Easy access. Access BigQuery by using a browser tool, a command-line tool, or by making calls to the BigQuery REST API with client libraries such as Java, PHP or Python.

Google BigQuery is a tool in the Big Data as a Service category of a tech stack.

Explore Google BigQuery's Story

Who uses Google BigQuery?

Companies

518 companies reportedly use Google BigQuery in their tech stacks, including Spotify, Delivery Hero, and The New York Times.

Spotify

Delivery Hero

The New York Times

Payhere

Mollie

Sentry

Trustpilot

Groww

yogiyo

Developers

1089 developers on StackShare have stated that they use Google BigQuery.

ytnobody

yoshi-iketani

Nuvem Educativa

Jumia Tools &amp ...

yasuhiroki

Jznight

My Stack

Google BigQuery Integrations

Fastly, Fluentd, Looker, Dataform, and dbt are some of the popular tools that integrate with Google BigQuery. Here's a list of all 121 tools that integrate with Google BigQuery.

Fastly

Fluentd

Looker

Dataform

dbt

Chartio

Redash

Data Studio

FullStory

Pros of Google BigQuery

High Performance

Easy to use

Fully managed service

Cheap Pricing

Process hundreds of GB in seconds

Big Data

Full table scans in seconds, no indexes needed

Always on, no per-hour costs

Good combination with fluentd

Machine learning

Easy to manage

Easy to learn

Decisions about Google BigQuery

Here are some stack decisions, common use cases and reviews by companies and developers who chose Google BigQuery in their tech stack.

Simone Sadak

Jan 21, 2023 | 5 upvotes · 190.8K views

Needs advice

Druid

Kafka

and

Apache Spark

My process is like this: I would get data once a month, either from Google BigQuery or as parquet files from Azure Blob Storage. I have a script that does some cleaning and then stores the result as partitioned parquet files because the following process cannot handle loading all data to memory.

The next process is making a heavy computation in a parallel fashion (per partition), and storing 3 intermediate versions as parquet files: two used for statistics, and the third will be filtered and create the final files.

I make a report based on the two files in Jupyter notebook and convert it to HTML.

Everything is done with vanilla python and Pandas.
sometimes I may get a different format of data
cloud service is Microsoft Azure.

What I'm considering is the following:

Get the data with Kafka or with native python, do the first processing, and store data in Druid, the second processing will be done with Apache Spark getting data from apache druid.

the intermediate states can be stored in druid too. and visualization would be with apache superset.

Andrea Latorre

Jan 2, 2023 | 6 upvotes · 214.1K views

Needs advice

Google Cloud Dataflow

and

Google Cloud Data Fusion

I am currently launching 50 pipelines in a Google Cloud Data Fusion version 6.4 instance. These pipelines are launched daily and transport data from a MySQLServer database to Google BigQuery. The cost is becoming very high and I was wondering if the costs with Google Cloud Dataflow decrease for the same rows transported.

Jeffrey Richman

Sep 21, 2022 | 5 upvotes · 123.5K views

Needs advice

Google BigQuery

Cloud Firestore

and

Snowflake

I'm wondering if any Cloud Firestore users might be open to sharing some input and challenges encountered when trying to create a low-cost, low-latency data pipeline to their Analytics warehouse (e.g. Google BigQuery, Snowflake, etc...)

I'm working with a platform by the name of Estuary.dev, an ETL/ELT and we are conducting some research on the pain points here to see if there are drawbacks of the Firestore->BQ extension and/or if users are seeking easy ways for getting nosql->fine-grained tabular data

Please feel free to drop some knowledge/wish list stuff on me for a better pipeline here!

Cyril Duchon-Doris

CTO at My Job Glasses · Aug 24, 2022 | 6 upvotes · 48.3K views

Needs advice

and

Hello, For security and strategic reasons, we are migrating our apps from AWS/Google to a cloud provider with more security certifications and fewer functionalities, named Outscale. So far we have been using Google BigQuery as our data warehouse with ELT workflows (using Stitch and dbt ) and we need to migrate our data ecosystem to this new cloud provider.

We are setting up a Kubernetes cluster in our new cloud provider for our apps. Regarding the data warehouse, it's not clear if there are advantages/inconvenients about setting it up on kubernetes (apart from having to create node groups and tolerations with more ram/cpu). Also, we are not sure what's the best Open source or on-premise tool to use. The main requirement is that data must remain in the secure cluster, and no external entity (especially US) can have access to it. We have a dev cluster/environment and a production cluster/environment on this cloud.

Regarding the actual DWH usage - Today we have ~1.5TB in BigQuery in production. We're going to run our initial rests with ~50-100GB of data for our test cluster - Most of our data comes from other databases, so in most cases, we already have replicated sources somewhere, and there are only a handful of collections whose source is directly in the DWH (such as snapshots, some external data we've fetched at some point, google analytics, etc) and needs appropriate level of replication - We are a team of 30-ish people, we do not have critical needs regarding analytics speed, and we do not need real time. We rebuild our DBT models 2-3 times a day and this usually proves enough

Apart from postgreSQL, I haven't really found open-source or on-premise alternatives for setting up a data warehouse, and running transformations with DBT. There is also the question of data ingestion, I've selected Airbyte and @meltano and I have troubles understanding if one of the 2 is better but Airbytes seems to have a bigger community.

What do you suggest regarding the data warehouse, and the ELT workflows ? - Kubernetes or not kubernetes ? - Postgresql or something else ? if postgre, what are the important configs you'd have in mind ? - Airbyte/DBT or something else.

Bob hs

Data Scientist · Mar 6, 2022 | 5 upvotes · 343.3K views

Needs advice

Amazon EC2

Google BigQuery

and

Apache Spark

I recently started a new position as a data scientist at an E-commerce company. The company is founded about 4-5 years ago and is new to many data-related areas. Specifically, I'm their first data science employee. So I have to take care of both data analysis tasks as well as bringing new technologies to the company.

They have used Elasticsearch (and Kibana) to have reporting dashboards on their daily purchases and users interactions on their e-commerce website.
They also use the Oracle database system to keep records of their daily turnovers and lists of their current products, clients, and sellers lists.
They use Data-Warehouse with cockpit 10 for generating reports on different aspects of their business including number 2 in this list.

At the moment, I grab batches of data from their system to perform predictive analytics from data science perspectives. In some cases, I use a static form of data such as monthly turnover, client values, and high-demand products, and run my predictive analysis using Python (VS code). Also, I use Google Datastudio or Google Sheets to present my findings. In other cases, I try to do time-series analysis using offline batches of data extracted from Elastic Search to do user recommendations and user personalization.

I really want to use modern data science tools such as Apache Spark, Google BigQuery, AWS, Azure, or others where they really fit. I think these tools can improve my performance as a data scientist and can provide more continuous analytics of their business interactions. But honestly, I'm not sure where each tool is needed and what part of their system should be replaced by or combined with the current state of technology to improve productivity from the above perspectives.

Ryan Freedman

Mar 30, 2021 | 4 upvotes · 29.5K views

Needs advice

Google BigQuery

and

Google Cloud Storage

Reading data from on prem data lake to cloud storage in order to utilize cloud computing for resource heavy operations regarding NLP and ML (<10GB Total). Trying to decide if we need to utilize Google BigQuery here or if we can work directly form Google Cloud Storage with a DataProc cluster. Any thoughts here would be appreciated in regards to which would be a better approach. Thanks!

See all decisions

Blog Posts

Cultivating your Data Lake

Aug 28 2019 at 3:10AM

Segment

+16

2668

The Growth Stacks of 2019

Jul 2 2019 at 9:34PM

Segment

+25

6931

Dubsmash: Scaling To 200 Million Users With 3 Engineers

Dec 14 2017 at 10:02AM

Dubsmash

+47

72957

How Sentry Receives 20 Billion Events Per Month While Preparin...

Nov 8 2017 at 5:09PM

Sentry

+31

37286

The Stack That Helped Opendoor Buy and Sell Over $1B in Homes

Mar 9 2017 at 8:02AM

Opendoor

+39

31864

How imgix Built A Stack To Serve 100,000 Images Per Second

Aug 28 2015 at 9:58AM

imgix

+26

10883

Google BigQuery's Features

All behind the scenes- Your queries can execute asynchronously in the background, and can be polled for status.
Import data with ease- Bulk load your data using Google Cloud Storage or stream it in bursts of up to 1,000 rows per second.
Affordable big data- The first Terabyte of data processed each month is free.
The right interface- Separate interfaces for administration and developers will make sure that you have access to the tools you need.

Google BigQuery Alternatives & Comparisons

What are some alternatives to Google BigQuery?

Google Cloud Bigtable

Google Cloud Bigtable offers you a fast, fully managed, massively scalable NoSQL database service that's ideal for web, mobile, and Internet of Things applications requiring terabytes to petabytes of data. Unlike comparable market offerings, Cloud Bigtable doesn't require you to sacrifice speed, scale, or cost efficiency when your applications grow. Cloud Bigtable has been battle-tested at Google for more than 10 years—it's the database driving major applications such as Google Analytics and Gmail.

Amazon Redshift

It is optimized for data sets ranging from a few hundred gigabytes to a petabyte or more and costs less than $1,000 per terabyte per year, a tenth the cost of most traditional data warehousing solutions.

Hadoop

The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage.

Snowflake

Snowflake eliminates the administration and management demands of traditional data warehouses and big data platforms. Snowflake is a true data warehouse as a service running on Amazon Web Services (AWS)—no infrastructure to manage and no knobs to turn.

Google Analytics

Google Analytics lets you measure your advertising ROI as well as track your Flash, video, and social networking sites and applications.

See all alternatives

Related Comparisons