What is Google BigQuery?
Who uses Google BigQuery?
Google BigQuery Integrations
Why developers like Google BigQuery?
Here are some stack decisions, common use cases and reviews by companies and developers who chose Google BigQuery in their tech stack.
Context: I wanted to create an end to end IoT data pipeline simulation in Google Cloud IoT Core and other GCP services. I never touched Terraform meaningfully until working on this project, and it's one of the best explorations in my development career. The documentation and syntax is incredibly human-readable and friendly. I'm used to building infrastructure through the google apis via Python , but I'm so glad past Sung did not make that decision. I was tempted to use Google Cloud Deployment Manager, but the templates were a bit convoluted by first impression. I'm glad past Sung did not make this decision either.
Solution: Leveraging Google Cloud Build Google Cloud Run Google Cloud Bigtable Google BigQuery Google Cloud Storage Google Compute Engine along with some other fun tools, I can deploy over 40 GCP resources using Terraform!
Check Out My Architecture: CLICK ME
Check out the GitHub repo attached
In order to accurately measure & track user behaviour on our platform we moved over quickly from the initial solution using Google Analytics to a custom-built one due to resource & pricing concerns we had.
While this does sound complicated, it’s as easy as clients sending JSON blobs of events to Amazon Kinesis from where we use AWS Lambda & Amazon SQS to batch and process incoming events and then ingest them into Google BigQuery. Once events are stored in BigQuery (which usually only takes a second from the time the client sends the data until it’s available), we can use almost-standard-SQL to simply query for data while Google makes sure that, even with terabytes of data being scanned, query times stay in the range of seconds rather than hours. Before ingesting their data into the pipeline, our mobile clients are aggregating events internally and, once a certain threshold is reached or the app is going to the background, sending the events as a JSON blob into the stream.
In the past we had workers running that continuously read from the stream and would validate and post-process the data and then enqueue them for other workers to write them to BigQuery. We went ahead and implemented the Lambda-based approach in such a way that Lambda functions would automatically be triggered for incoming records, pre-aggregate events, and write them back to SQS, from which we then read them, and persist the events to BigQuery. While this approach had a couple of bumps on the road, like re-triggering functions asynchronously to keep up with the stream and proper batch sizes, we finally managed to get it running in a reliable way and are very happy with this solution today.
#ServerlessTaskProcessing #GeneralAnalytics #RealTimeDataProcessing #BigDataAsAService
We really drank the Google Kool-Aid on analytics. So, everything's going into Google BigQuery and almost everything is going straight into Google Cloud Pub/Sub and then doing some processing in Google Cloud Dataflow before ending up in BigQuery. We still do too much processing and augmentation on the front end before it goes into Pub/Sub. And that's using some kind of stuff we pulled together using Amazon DynamoDB and so on. And it's very brittle, actually. Actually, Dynamo throttling is one of our biggest headaches. So, I want all of that to go away and do all our augmentation in BigQuery after the data's been collected. And having it just go straight into Pub/Sub. So, we're working on that. And it'll happen, some time. #Analytics #AnalyticsPipeline
I use Amazon Athena because similar to Google BigQuery , you can store and query data easily. Especially since you can define data schema in the Glue data catalog, there's a central way to define data models.
However, I would not recommend for batch jobs. I typically use this to check intermediary datasets in data engineering workloads. It's good for getting a look and feel of the data along its ETL journey.
I use Google BigQuery because it makes is super easy to query and store data for analytics workloads. If you're using GCP, you're likely using BigQuery. However, running data viz tools directly connected to BigQuery will run pretty slow. They recently announced BI Engine which will hopefully compete well against big players like Snowflake when it comes to concurrency.
What's nice too is that it has SQL-based ML tools, and it has great GIS support!
I used dbt over manually setting up python wrappers around SQL scripts because it makes managing transformations within Google BigQuery much easier. This saves future Sung dozens of hours maintaining plumbing code to run a couple SQL queries. Check out my tutorial in the link!
I haven't seen any other tool make it as easy to run dependent SQL DAGs directly in a data warehouse.
Google BigQuery's Features
- All behind the scenes- Your queries can execute asynchronously in the background, and can be polled for status.
- Import data with ease- Bulk load your data using Google Cloud Storage or stream it in bursts of up to 1,000 rows per second.
- Affordable big data- The first Terabyte of data processed each month is free.
- The right interface- Separate interfaces for administration and developers will make sure that you have access to the tools you need.