Avatar of Sung Won Chung

Context: I wanted to create an end to end IoT data pipeline simulation in Google Cloud IoT Core and other GCP services. I never touched Terraform meaningfully until working on this project, and it's one of the best explorations in my development career. The documentation and syntax is incredibly human-readable and friendly. I'm used to building infrastructure through the google apis via Python , but I'm so glad past Sung did not make that decision. I was tempted to use Google Cloud Deployment Manager, but the templates were a bit convoluted by first impression. I'm glad past Sung did not make this decision either.

Solution: Leveraging Google Cloud Build Google Cloud Run Google Cloud Bigtable Google BigQuery Google Cloud Storage Google Compute Engine along with some other fun tools, I can deploy over 40 GCP resources using Terraform!

Check Out My Architecture: CLICK ME

Check out the GitHub repo attached

GitHub - sungchun12/iot-python-webapp: Live, real-time dashboard in a serverless docker web app, and deployed via terraform with a built-in CICD trigger-See Mock Website (github.com)
25 upvotes4 comments188K views

I use GitHub because it's the coolest kid on the block for open source. Searching for repos you need/want is easy.

Especially with the apache foundation moving their workloads to them, unlimited private repos, and a package registry on the way, they are becoming the one stop shop for open source needs.

I'm curious to see how the GitHub Sponsors(patreon for developers) plays out, and what it'll do for open source. Hopefully, they design it in a way where it's not abused by big tech to "plant" developers that look like they're building open source when they're actually building proprietary tools.

Bitbucket GitLab

9 upvotes29.8K views

I use AWS Lambda because it is the most mature of the major cloud platforms for serverless functions. The fact that you can add VPC configs at the start is huge from a security perspective. However, it does take a lot of work to configure the Amazon VPC to work with AWS Secrets Manager and Lambda. It's also nice because it works so well with Amazon API Gateway

I typically use it to connect with databases to insert and extract information for downstream analytics.

I won't be surprised if one day the majority of workloads run on this service. Not having to manage and maintain infrastructure is truly a blessing.

7 upvotes41.3K views
Shared insights
Apache SparkApache Spark

I use Apache Spark because it is THE framework for big data processing from big tech to startup. It can be run on pretty much any platform. It's open source, and lots of community support and code samples to draw from.

The Python API is good for low-med level transformations, but most recommend starting with Scala/Java to use full spark capabilities.

It comes with quite learning curve to make sense of how data is shuffling through different nodes, but it's worth it for running large-scale ETL.

Also, keep in mind the streaming and batch frameworks are not unified, so you'll have learn them both separately.

6 upvotes19.6K views
Shared insights

I used dbt over manually setting up python wrappers around SQL scripts because it makes managing transformations within Google BigQuery much easier. This saves future Sung dozens of hours maintaining plumbing code to run a couple SQL queries. Check out my tutorial in the link!

I haven't seen any other tool make it as easy to run dependent SQL DAGs directly in a data warehouse.

GitHub - sungchun12/dbt_bigquery_example: dbt(data build tool) tutorial on bigquery with extensive NOTES (github.com)
6 upvotes8.9K views
Shared insights

I use Python because it is one of the most versatile and easy to read programming languages. The open source community is vibrant and there are so many tutorials and Medium blogs it can be overwhelming, and that's a good problem to have!

I primarily use it for automating backend infrastructure tasks, data exploration via Jupyter, and data engineering development. It's great to maintain most of my stack in one language for consistency.

HOWEVER, when it comes to scaling data engineering workloads compared to other languages like Java and Scala, performance speed degrades significantly. You'll notice that most of the big tech companies use Scala or Java for Spark because the Python API is still a second-class citizen in new releases.

ANOTHER HOWEVER, I'm excited for the future of parallelism in python and how that may replace complex spark workloads. It's still young, but growing: Ray

6 upvotes1.9K views

I use Terraform because it hits the level of abstraction pocket of being high-level and flexible, and is agnostic to cloud platforms. Creating complex infrastructure components for a solution with a UI console is tedious to repeat. Using low-level APIs are usually specific to cloud platforms, and you still have to build your own tooling for deploying, state management, and destroying infrastructure.

However, Terraform is usually slower to implement new services compared to cloud-specific APIs. It's worth the trade-off though, especially if you're multi-cloud. I heard someone say, "We want to preference a cloud, not lock in to one." Terraform builds on that claim.

Terraform Google Cloud Deployment Manager AWS CloudFormation

4 upvotes83.7K views

I use Amazon Athena because similar to Google BigQuery , you can store and query data easily. Especially since you can define data schema in the Glue data catalog, there's a central way to define data models.

However, I would not recommend for batch jobs. I typically use this to check intermediary datasets in data engineering workloads. It's good for getting a look and feel of the data along its ETL journey.

4 upvotes73.1K views

I use Google BigQuery because it makes is super easy to query and store data for analytics workloads. If you're using GCP, you're likely using BigQuery. However, running data viz tools directly connected to BigQuery will run pretty slow. They recently announced BI Engine which will hopefully compete well against big players like Snowflake when it comes to concurrency.

What's nice too is that it has SQL-based ML tools, and it has great GIS support!

4 upvotes43.4K views