Need advice about which tool to choose?Ask the StackShare community!
AWS Glue vs Airflow: What are the differences?
AWS Glue and Airflow are two popular tools utilized for data processing and workflow management. Let's explore the key differences between the two:
Data Processing Approach: AWS Glue is a fully managed extract, transform, load (ETL) service provided by Amazon Web Services. It follows a serverless architecture, where the infrastructure management is taken care of by AWS. It supports automated schema discovery and offers an integrated metadata catalog. On the other hand, Airflow is an open-source platform developed by Apache. It adopts a code-driven approach where workflows are defined using Python scripts, providing flexibility in defining complex data pipelines.
Scalability and Elasticity: AWS Glue is built to scale automatically based on the workload, without requiring manual provisioning or configuration. It can handle large data volumes and parallel processing across multiple nodes. Airflow, being an open-source tool, can be deployed on any infrastructure and scaled based on the available resources. However, the scaling process may require additional manual effort compared to AWS Glue.
Integration with AWS Services: As a native AWS service, AWS Glue seamlessly integrates with other AWS services like S3, Redshift, Athena, and more. It provides direct connectivity to these services out of the box, allowing easy data extraction and loading. In contrast, Airflow supports a wide range of integrations with various services and databases through custom operators and hooks. While it offers flexibility to work with different systems, it may require additional setup and configuration to establish connections.
Managed vs Self-Hosted: AWS Glue is fully managed by AWS, handling infrastructure management, data scaling, and monitoring. This relieves users from the operational overhead of maintaining and scaling the infrastructure. On the other hand, Airflow requires self-hosting, where users are responsible for setting up and managing the infrastructure. This provides greater control over the environment but requires additional effort and expertise for maintenance.
Ecosystem and Community Support: AWS Glue is a part of the extensive AWS ecosystem, which includes various services and solutions. It offers built-in integration with AWS data lakes, analytics tools, and machine learning capabilities. Airflow, being an open-source project, has a vibrant community supporting its development. It has a wide range of plugins and extensions contributed by the community, which extends its functionality and addresses specific use cases.
Pricing Model and Cost: AWS Glue follows a pay-as-you-go pricing model, based on the number of data transformation and job execution instances. The cost includes factors like data processing, data catalog usage, and data storage. In contrast, Airflow is an open-source tool, and the cost primarily involves infrastructure and resource costs for hosting and scaling the system. The expense may vary based on the deployment model and infrastructure choices.
In summary, AWS Glue provides a serverless, managed solution, tightly integrated with AWS services, while Airflow offers a code-driven, flexible approach with a vibrant open-source community.
We need to perform ETL from several databases into a data warehouse or data lake. We want to
- keep raw and transformed data available to users to draft their own queries efficiently
- give users the ability to give custom permissions and SSO
- move between open-source on-premises development and cloud-based production environments
We want to use inexpensive Amazon EC2 instances only on medium-sized data set 16GB to 32GB feeding into Tableau Server or PowerBI for reporting and data analysis purposes.
You could also use AWS Lambda and use Cloudwatch event schedule if you know when the function should be triggered. The benefit is that you could use any language and use the respective database client.
But if you orchestrate ETLs then it makes sense to use Apache Airflow. This requires Python knowledge.
Though we have always built something custom, Apache airflow (https://airflow.apache.org/) stood out as a key contender/alternative when it comes to open sources. On the commercial offering, Amazon Redshift combined with Amazon Kinesis (for complex manipulations) is great for BI, though Redshift as such is expensive.
You may want to look into a Data Virtualization product called Conduit. It connects to disparate data sources in AWS, on prem, Azure, GCP, and exposes them as a single unified Spark SQL view to PowerBI (direct query) or Tableau. Allows auto query and caching policies to enhance query speeds and experience. Has a GPU query engine and optimized Spark for fallback. Can be deployed on your AWS VM or on prem, scales up and out. Sounds like the ideal solution to your needs.
I have to collect different data from multiple sources and store them in a single cloud location. Then perform cleaning and transforming using PySpark, and push the end results to other applications like reporting tools, etc. What would be the best solution? I can only think of Azure Data Factory + Databricks. Are there any alternatives to #AWS services + Databricks?
Hi all,
Currently, we need to ingest the data from Amazon S3 to DB either Amazon Athena or Amazon Redshift. But the problem with the data is, it is in .PSV (pipe separated values) format and the size is also above 200 GB. The query performance of the timeout in Athena/Redshift is not up to the mark, too slow while compared to Google BigQuery. How would I optimize the performance and query result time? Can anyone please help me out?
you can use aws glue service to convert you pipe format data to parquet format , and thus you can achieve data compression . Now you should choose Redshift to copy your data as it is very huge. To manage your data, you should partition your data in S3 bucket and also divide your data across the redshift cluster
First of all you should make your choice upon Redshift or Athena based on your use case since they are two very diferent services - Redshift is an enterprise-grade MPP Data Warehouse while Athena is a SQL layer on top of S3 with limited performance. If performance is a key factor, users are going to execute unpredictable queries and direct and managing costs are not a problem I'd definitely go for Redshift. If performance is not so critical and queries will be predictable somewhat I'd go for Athena.
Once you select the technology you'll need to optimize your data in order to get the queries executed as fast as possible. In both cases you may need to adapt the data model to fit your queries better. In the case you go for Athena you'd also proabably need to change your file format to Parquet or Avro and review your partition strategy depending on your most frequent type of query. If you choose Redshift you'll need to ingest the data from your files into it and maybe carry out some tuning tasks for performance gain.
I'll recommend Redshift for now since it can address a wider range of use cases, but we could give you better advice if you described your use case in depth.
It depend of the nature of your data (structured or not?) and of course your queries (ad-hoc or predictible?). For example you can look at partitioning and columnar format to maximize MPP capabilities for both Athena and Redshift
you can change your PSV fomat data to parquet file format with AWS GLUE and then your query performance will be improved
I am so confused. I need a tool that will allow me to go to about 10 different URLs to get a list of objects. Those object lists will be hundreds or thousands in length. I then need to get detailed data lists about each object. Those detailed data lists can have hundreds of elements that could be map/reduced somehow. My batch process dies sometimes halfway through which means hours of processing gone, i.e. time wasted. I need something like a directed graph that will keep results of successful data collection and allow me either pragmatically or manually to retry the failed ones some way (0 - forever) times. I want it to then process all the ones that have succeeded or been effectively ignored and load the data store with the aggregation of some couple thousand data-points. I know hitting this many endpoints is not a good practice but I can't put collectors on all the endpoints or anything like that. It is pretty much the only way to get the data.
For a non-streaming approach:
You could consider using more checkpoints throughout your spark jobs. Furthermore, you could consider separating your workload into multiple jobs with an intermittent data store (suggesting cassandra or you may choose based on your choice and availability) to store results , perform aggregations and store results of those.
Spark Job 1 - Fetch Data From 10 URLs and store data and metadata in a data store (cassandra) Spark Job 2..n - Check data store for unprocessed items and continue the aggregation
Alternatively for a streaming approach: Treating your data as stream might be useful also. Spark Streaming allows you to utilize a checkpoint interval - https://spark.apache.org/docs/latest/streaming-programming-guide.html#checkpointing
Pros of Airflow
- Features53
- Task Dependency Management14
- Beautiful UI12
- Cluster of workers12
- Extensibility10
- Open source6
- Complex workflows5
- Python5
- Good api3
- Apache project3
- Custom operators3
- Dashboard2
Pros of AWS Glue
- Managed Hive Metastore9
Sign up to add or upvote prosMake informed product decisions
Cons of Airflow
- Observability is not great when the DAGs exceed 2502
- Running it on kubernetes cluster relatively complex2
- Open source - provides minimum or no support2
- Logical separation of DAGs is not straight forward1