AWS Glue vs Databricks

Overview

AWS Glue

Stacks464

Followers819

Votes9

Databricks

Stacks532

Followers768

Votes8

AWS Glue vs Databricks: What are the differences?

AWS Glue and Databricks are both popular data processing and analytics platforms, but they have some key differences that set them apart from each other. In this comparison, we will explore these differences in detail.

Managed Service vs Collaborative Workspace: AWS Glue is a fully managed ETL (Extract, Transform, Load) service provided by Amazon Web Services. It automates the entire process of discovering, cataloging, and transforming data into a usable format. On the other hand, Databricks is a collaborative workspace that provides a unified analytics platform. It combines data engineering capabilities along with advanced analytics, machine learning, and visualization features.
Scalability and Flexibility: AWS Glue is designed to be highly scalable, allowing you to process large volumes of data efficiently. It automatically scales resources based on the size of the data and the complexity of the transformations. Databricks, on the other hand, provides a flexible and scalable environment for data analytics and processing. It offers the ability to scale compute and storage resources independently, providing more granular control over resource allocation.
Data Lake vs Data Warehouse: AWS Glue is often used as a tool to build data lakes by consolidating data from various sources and making it available for analysis. It is well-integrated with other AWS services like Amazon S3, Redshift, and Athena, enabling seamless data ingestion and transformation. Databricks, on the other hand, focuses more on data warehouse capabilities and provides tight integration with popular data warehousing solutions like Delta Lake and Apache Spark.
Integration with Ecosystem: AWS Glue seamlessly integrates with other AWS services, allowing you to build end-to-end data processing pipelines using services like AWS Lambda, AWS Step Functions, and AWS Glue Spark ETL jobs. Databricks also offers integration with various third-party tools and services, making it easier to connect with different data sources and systems.
Machine Learning Capabilities: Databricks provides extensive support for machine learning and advanced analytics with built-in libraries like MLlib and MLflow. It offers a collaborative environment for data scientists and data engineers to build, deploy, and manage machine learning models. AWS Glue, on the other hand, is primarily focused on data processing and ETL, and does not provide as many built-in machine learning capabilities compared to Databricks.
Pricing Model: AWS Glue pricing is based on the number of data catalog objects, crawler runs, and development endpoints used. It also charges for the amount of data processed during ETL jobs. Databricks follows a consumption-based pricing model, where you pay for the resources you use, such as compute instances and storage.

In summary, AWS Glue is a fully managed ETL service focusing on data integration and processing in the AWS ecosystem, while Databricks is a collaborative workspace that provides a unified analytics platform with powerful machine learning capabilities. The choice between the two depends on your specific use case, whether you need a fully managed service for ETL or a collaborative environment for advanced analytics and machine learning.

Share your Stack

Help developers discover the tools you use. Get visibility for your team's tech choices and contribute to the community's knowledge.

View Docs

CLI (Node.js)

Manual

Advice on AWS Glue, Databricks

Vamshi

Data Engineer at Tata Consultancy Services

May 29, 2020

Needs adviceon

PySpark

Azure Data Factory

Databricks

I have to collect different data from multiple sources and store them in a single cloud location. Then perform cleaning and transforming using PySpark, and push the end results to other applications like reporting tools, etc. What would be the best solution? I can only think of Azure Data Factory + Databricks. Are there any alternatives to #AWS services + Databricks?

269k views269k

Comments

datocrats-org

Jul 29, 2020

Needs adviceon

Amazon EC2

Tableau

PowerBI

We need to perform ETL from several databases into a data warehouse or data lake. We want to

keep raw and transformed data available to users to draft their own queries efficiently
give users the ability to give custom permissions and SSO
move between open-source on-premises development and cloud-based production environments

We want to use inexpensive Amazon EC2 instances only on medium-sized data set 16GB to 32GB feeding into Tableau Server or PowerBI for reporting and data analysis purposes.

319k views319k

Comments

Pavithra

Mar 12, 2020

Needs adviceon

Amazon S3

Amazon Athena

Amazon Redshift

Hi all,

Currently, we need to ingest the data from Amazon S3 to DB either Amazon Athena or Amazon Redshift. But the problem with the data is, it is in .PSV (pipe separated values) format and the size is also above 200 GB. The query performance of the timeout in Athena/Redshift is not up to the mark, too slow while compared to Google BigQuery. How would I optimize the performance and query result time? Can anyone please help me out?

522k views522k

Comments

Detailed Comparison

AWS Glue	Databricks
A fully managed extract, transform, and load (ETL) service that makes it easy for customers to prepare and load their data for analytics.	Databricks Unified Analytics Platform, from the original creators of Apache Spark™, unifies data science and engineering across the Machine Learning lifecycle from data preparation to experimentation and deployment of ML applications.
Easy - AWS Glue automates much of the effort in building, maintaining, and running ETL jobs. AWS Glue crawls your data sources, identifies data formats, and suggests schemas and transformations. AWS Glue automatically generates the code to execute your data transformations and loading processes.; Integrated - AWS Glue is integrated across a wide range of AWS services.; Serverless - AWS Glue is serverless. There is no infrastructure to provision or manage. AWS Glue handles provisioning, configuration, and scaling of the resources required to run your ETL jobs on a fully managed, scale-out Apache Spark environment. You pay only for the resources used while your jobs are running.; Developer Friendly - AWS Glue generates ETL code that is customizable, reusable, and portable, using familiar technology - Scala, Python, and Apache Spark. You can also import custom readers, writers and transformations into your Glue ETL code. Since the code AWS Glue generates is based on open frameworks, there is no lock-in. You can use it anywhere.	Built on Apache Spark and optimized for performance; Reliable and Performant Data Lakes; Interactive Data Science and Collaboration; Data Pipelines and Workflow Automation; End-to-End Data Security and Compliance; Compatible with Common Tools in the Ecosystem; Unparalled Support by the Leading Committers of Apache Spark
Statistics
Stacks 464	Stacks 532
Followers 819	Followers 768
Votes 9	Votes 8
Pros & Cons
Pros 10 Managed Hive Metastore	Pros 1 Multicloud 1 Data stays in your cloud account 1 Security 1 Best Performances on large datasets 1 Databricks doesn't get access to your data
Integrations
Amazon Redshift Amazon S3 Amazon RDS Amazon Athena MySQL Microsoft SQL Server Amazon EMR Amazon Aurora Oracle Amazon RDS for PostgreSQL	MLflow Delta Lake Kafka Apache Spark TensorFlow Hadoop PyTorch Keras

What are some alternatives to AWS Glue, Databricks?

Google Analytics

Google Analytics lets you measure your advertising ROI as well as track your Flash, video, and social networking sites and applications.

Mixpanel

Mixpanel helps companies build better products through data. With our powerful, self-serve product analytics solution, teams can easily analyze how and why people engage, convert, and retain to improve their user experience.

Apache Spark

Spark is a fast and general processing engine compatible with Hadoop data. It can run in Hadoop clusters through YARN or Spark's standalone mode, and it can process data in HDFS, HBase, Cassandra, Hive, and any Hadoop InputFormat. It is designed to perform both batch processing (similar to MapReduce) and new workloads like streaming, interactive queries, and machine learning.

Piwik

Matomo (formerly Piwik) is a full-featured PHP MySQL software program that you download and install on your own webserver. At the end of the five-minute installation process, you will be given a JavaScript code.

Presto

Distributed SQL Query Engine for Big Data

Amazon Athena

Amazon Athena is an interactive query service that makes it easy to analyze data in Amazon S3 using standard SQL. Athena is serverless, so there is no infrastructure to manage, and you pay only for the queries that you run.

Apache Flink

Apache Flink is an open source system for fast and versatile data analytics in clusters. Flink supports batch and streaming analytics, in one system. Analytical programs can be written in concise and elegant APIs in Java and Scala.

lakeFS

It is an open-source data version control system for data lakes. It provides a “Git for data” platform enabling you to implement best practices from software engineering on your data lake, including branching and merging, CI/CD, and production-like dev/test environments.

Druid

Druid is a distributed, column-oriented, real-time analytics data store that is commonly used to power exploratory dashboards in multi-tenant environments. Druid excels as a data warehousing solution for fast aggregate queries on petabyte sized data sets. Druid supports a variety of flexible filters, exact calculations, approximate algorithms, and other useful calculations.

Clicky

Clicky Web Analytics gives bloggers and smaller web sites a more personal understanding of their visitors. Clicky has various features that helps stand it apart from the competition specifically Spy and RSS feeds that allow web site owners to get live information about their visitors.

Related Comparisons

AWS Glue vs Databricks: What are the differences?

Managed Service vs Collaborative Workspace: AWS Glue is a fully managed ETL (Extract, Transform, Load) service provided by Amazon Web Services. It automates the entire process of discovering, cataloging, and transforming data into a usable format. On the other hand, Databricks is a collaborative workspace that provides a unified analytics platform. It combines data engineering capabilities along with advanced analytics, machine learning, and visualization features.
Scalability and Flexibility: AWS Glue is designed to be highly scalable, allowing you to process large volumes of data efficiently. It automatically scales resources based on the size of the data and the complexity of the transformations. Databricks, on the other hand, provides a flexible and scalable environment for data analytics and processing. It offers the ability to scale compute and storage resources independently, providing more granular control over resource allocation.
Data Lake vs Data Warehouse: AWS Glue is often used as a tool to build data lakes by consolidating data from various sources and making it available for analysis. It is well-integrated with other AWS services like Amazon S3, Redshift, and Athena, enabling seamless data ingestion and transformation. Databricks, on the other hand, focuses more on data warehouse capabilities and provides tight integration with popular data warehousing solutions like Delta Lake and Apache Spark.
Integration with Ecosystem: AWS Glue seamlessly integrates with other AWS services, allowing you to build end-to-end data processing pipelines using services like AWS Lambda, AWS Step Functions, and AWS Glue Spark ETL jobs. Databricks also offers integration with various third-party tools and services, making it easier to connect with different data sources and systems.
Machine Learning Capabilities: Databricks provides extensive support for machine learning and advanced analytics with built-in libraries like MLlib and MLflow. It offers a collaborative environment for data scientists and data engineers to build, deploy, and manage machine learning models. AWS Glue, on the other hand, is primarily focused on data processing and ETL, and does not provide as many built-in machine learning capabilities compared to Databricks.
Pricing Model: AWS Glue pricing is based on the number of data catalog objects, crawler runs, and development endpoints used. It also charges for the amount of data processed during ETL jobs. Databricks follows a consumption-based pricing model, where you pay for the resources you use, such as compute instances and storage.

AWS Glue vs Databricks

Overview