Avatar of Sung Won Chung

Decision about GitLab, Bitbucket, GitHub

Avatar of sungchun12

I use GitHub because it's the coolest kid on the block for open source. Searching for repos you need/want is easy.

Especially with the apache foundation moving their workloads to them, unlimited private repos, and a package registry on the way, they are becoming the one stop shop for open source needs.

I'm curious to see how the GitHub Sponsors(patreon for developers) plays out, and what it'll do for open source. Hopefully, they design it in a way where it's not abused by big tech to "plant" developers that look like they're building open source when they're actually building proprietary tools.

Bitbucket GitLab

9 upvotes·3.4K views

Decision about Amazon API Gateway, AWS Secrets Manager, Amazon VPC, AWS Lambda

Avatar of sungchun12

I use AWS Lambda because it is the most mature of the major cloud platforms for serverless functions. The fact that you can add VPC configs at the start is huge from a security perspective. However, it does take a lot of work to configure the Amazon VPC to work with AWS Secrets Manager and Lambda. It's also nice because it works so well with Amazon API Gateway

I typically use it to connect with databases to insert and extract information for downstream analytics.

I won't be surprised if one day the majority of workloads run on this service. Not having to manage and maintain infrastructure is truly a blessing.

7 upvotes·1.7K views

Decision about Python

Avatar of sungchun12

I use Python because it is one of the most versatile and easy to read programming languages. The open source community is vibrant and there are so many tutorials and Medium blogs it can be overwhelming, and that's a good problem to have!

I primarily use it for automating backend infrastructure tasks, data exploration via Jupyter, and data engineering development. It's great to maintain most of my stack in one language for consistency.

HOWEVER, when it comes to scaling data engineering workloads compared to other languages like Java and Scala, performance speed degrades significantly. You'll notice that most of the big tech companies use Scala or Java for Spark because the Python API is still a second-class citizen in new releases.

ANOTHER HOWEVER, I'm excited for the future of parallelism in python and how that may replace complex spark workloads. It's still young, but growing: Ray

5 upvotes·1.1K views

Decision about Apache Spark

Avatar of sungchun12

I use Apache Spark because it is THE framework for big data processing from big tech to startup. It can be run on pretty much any platform. It's open source, and lots of community support and code samples to draw from.

The Python API is good for low-med level transformations, but most recommend starting with Scala/Java to use full spark capabilities.

It comes with quite learning curve to make sense of how data is shuffling through different nodes, but it's worth it for running large-scale ETL.

Also, keep in mind the streaming and batch frameworks are not unified, so you'll have learn them both separately.

5 upvotes·1K views

Decision about AWS CloudFormation, Google Cloud Deployment Manager, Terraform

Avatar of sungchun12

I use Terraform because it hits the level of abstraction pocket of being high-level and flexible, and is agnostic to cloud platforms. Creating complex infrastructure components for a solution with a UI console is tedious to repeat. Using low-level APIs are usually specific to cloud platforms, and you still have to build your own tooling for deploying, state management, and destroying infrastructure.

However, Terraform is usually slower to implement new services compared to cloud-specific APIs. It's worth the trade-off though, especially if you're multi-cloud. I heard someone say, "We want to preference a cloud, not lock in to one." Terraform builds on that claim.

Terraform Google Cloud Deployment Manager AWS CloudFormation

4 upvotes·2.1K views

Decision about Sublime Text, Atom, Visual Studio Code

Avatar of sungchun12

I use Visual Studio Code because it is a super flexible code editor that can be customized to function like a full IDE. It has great git and terminal integrations out of the box compared to Atom and Sublime Text

It has so many extensions and boots up pretty fast even with all my extensions.

Feel free to checkout my settings: VS Code Settings

4 upvotes·2K views

Decision about Snowflake, Google BigQuery

Avatar of sungchun12

I use Google BigQuery because it makes is super easy to query and store data for analytics workloads. If you're using GCP, you're likely using BigQuery. However, running data viz tools directly connected to BigQuery will run pretty slow. They recently announced BI Engine which will hopefully compete well against big players like Snowflake when it comes to concurrency.

What's nice too is that it has SQL-based ML tools, and it has great GIS support!

4 upvotes·789 views

Decision about Travis CI, CircleCI, Google Cloud Build

Avatar of sungchun12

I use Google Cloud Build because it's my first foray into the CICD world(loving it so far), and I wanted to work with something GCP native to avoid giving permissions to other SaaS tools like CircleCI and Travis CI.

I really like it because it's free for the first 120 minutes, and it's one of the few CICD tools that enterprises are open to using since it's contained within GCP.

One of the unique things is that it has the Kaniko cache, which speeds up builds by creating intermediate layers within the docker image vs. pushing the full thing from the start. Helpful when you're installing just a few additional dependencies.

Feel free to checkout an example: Cloudbuild Example

4 upvotes·718 views

Decision about Google BigQuery, Amazon Athena

Avatar of sungchun12

I use Amazon Athena because similar to Google BigQuery , you can store and query data easily. Especially since you can define data schema in the Glue data catalog, there's a central way to define data models.

However, I would not recommend for batch jobs. I typically use this to check intermediary datasets in data engineering workloads. It's good for getting a look and feel of the data along its ETL journey.

4 upvotes·645 views

Decision about Google App Engine

Avatar of sungchun12

I use Google App Engine because it's great for setting up flask applications for some of my data workloads. The auto-scaling feature is really nice and abstracts a lot of the tedium that comes with up setting up an application.

Back in the day, using a cron job on app engine was the easiest way to automate tasks on GCP. Thankfully, that's not the case anymore. For those that want to see: Cron Job on App Engine

4 upvotes·641 views