Alternatives to AWS Glue logo

Alternatives to AWS Glue

AWS Data Pipeline, Airflow, Apache Spark, Talend, and Alooma are the most popular alternatives and competitors to AWS Glue.
62
38
+ 1
0

What is AWS Glue and what are its top alternatives?

A fully managed extract, transform, and load (ETL) service that makes it easy for customers to prepare and load their data for analytics.
AWS Glue is a tool in the Big Data Tools category of a tech stack.

AWS Glue alternatives & related posts

AWS Data Pipeline logo

AWS Data Pipeline

34
28
1
34
28
+ 1
1
Process and move data between different AWS compute and storage services
AWS Data Pipeline logo
AWS Data Pipeline
VS
AWS Glue logo
AWS Glue
Airflow logo

Airflow

340
274
19
340
274
+ 1
19
A platform to programmaticaly author, schedule and monitor data pipelines, by Airbnb
Airflow logo
Airflow
VS
AWS Glue logo
AWS Glue

related Apache Spark posts

Eric Colson
Eric Colson
Chief Algorithms Officer at Stitch Fix | 19 upvotes 346.6K views
atStitch FixStitch Fix
Kafka
Kafka
PostgreSQL
PostgreSQL
Amazon S3
Amazon S3
Apache Spark
Apache Spark
Presto
Presto
Python
Python
R
R
PyTorch
PyTorch
Docker
Docker
Amazon EC2 Container Service
Amazon EC2 Container Service
#AWS
#Etl
#ML
#DataScience
#DataStack
#Data

The algorithms and data infrastructure at Stitch Fix is housed in #AWS. Data acquisition is split between events flowing through Kafka, and periodic snapshots of PostgreSQL DBs. We store data in an Amazon S3 based data warehouse. Apache Spark on Yarn is our tool of choice for data movement and #ETL. Because our storage layer (s3) is decoupled from our processing layer, we are able to scale our compute environment very elastically. We have several semi-permanent, autoscaling Yarn clusters running to serve our data processing needs. While the bulk of our compute infrastructure is dedicated to algorithmic processing, we also implemented Presto for adhoc queries and dashboards.

Beyond data movement and ETL, most #ML centric jobs (e.g. model training and execution) run in a similarly elastic environment as containers running Python and R code on Amazon EC2 Container Service clusters. The execution of batch jobs on top of ECS is managed by Flotilla, a service we built in house and open sourced (see https://github.com/stitchfix/flotilla-os).

At Stitch Fix, algorithmic integrations are pervasive across the business. We have dozens of data products actively integrated systems. That requires serving layer that is robust, agile, flexible, and allows for self-service. Models produced on Flotilla are packaged for deployment in production using Khan, another framework we've developed internally. Khan provides our data scientists the ability to quickly productionize those models they've developed with open source frameworks in Python 3 (e.g. PyTorch, sklearn), by automatically packaging them as Docker containers and deploying to Amazon ECS. This provides our data scientist a one-click method of getting from their algorithms to production. We then integrate those deployments into a service mesh, which allows us to A/B test various implementations in our product.

For more info:

#DataScience #DataStack #Data

See more
Conor Myhrvold
Conor Myhrvold
Tech Brand Mgr, Office of CTO at Uber | 5 upvotes 161K views
atUber TechnologiesUber Technologies
Kafka
Kafka
Kafka Manager
Kafka Manager
Hadoop
Hadoop
Apache Spark
Apache Spark
GitHub
GitHub

Why we built Marmaray, an open source generic data ingestion and dispersal framework and library for Apache Hadoop :

Built and designed by our Hadoop Platform team, Marmaray is a plug-in-based framework built on top of the Hadoop ecosystem. Users can add support to ingest data from any source and disperse to any sink leveraging the use of Apache Spark . The name, Marmaray, comes from a tunnel in Turkey connecting Europe and Asia. Similarly, we envisioned Marmaray within Uber as a pipeline connecting data from any source to any sink depending on customer preference:

https://eng.uber.com/marmaray-hadoop-ingestion-open-source/

(Direct GitHub repo: https://github.com/uber/marmaray Kafka Kafka Manager )

See more
Talend logo

Talend

26
12
0
26
12
+ 1
0
A single, unified suite for all integration needs
    Be the first to leave a pro
    Talend logo
    Talend
    VS
    AWS Glue logo
    AWS Glue
    Alooma logo

    Alooma

    18
    21
    0
    18
    21
    + 1
    0
    Integrate any data source like databases, applications, and any API - with your own Amazon Redshift
      Be the first to leave a pro
      Alooma logo
      Alooma
      VS
      AWS Glue logo
      AWS Glue

      related Amazon Athena posts

      Amazon Athena
      Amazon Athena
      Google BigQuery
      Google BigQuery

      I use Amazon Athena because similar to Google BigQuery , you can store and query data easily. Especially since you can define data schema in the Glue data catalog, there's a central way to define data models.

      However, I would not recommend for batch jobs. I typically use this to check intermediary datasets in data engineering workloads. It's good for getting a look and feel of the data along its ETL journey.

      See more
      Apache Flink logo

      Apache Flink

      145
      155
      11
      145
      155
      + 1
      11
      Fast and reliable large-scale data processing engine
      Apache Flink logo
      Apache Flink
      VS
      AWS Glue logo
      AWS Glue

      related Apache Flink posts

      Surabhi Bhawsar
      Surabhi Bhawsar
      Technical Architect at Pepcus | 6 upvotes 44.1K views
      Kafka
      Kafka
      Apache Flink
      Apache Flink

      I need to build the Alert & Notification framework with the use of a scheduled program. We will analyze the events from the database table and filter events that are falling under a day timespan and send these event messages over email. Currently, we are using Kafka Pub/Sub for messaging. The customer wants us to move on Apache Flink, I am trying to understand how Apache Flink could be fit better for us.

      See more
      Apache Hive logo

      Apache Hive

      116
      50
      0
      116
      50
      + 1
      0
      Data Warehouse Software for Reading, Writing, and Managing Large Datasets
        Be the first to leave a pro
        Apache Hive logo
        Apache Hive
        VS
        AWS Glue logo
        AWS Glue

        related Apache Hive posts

        Ashish Singh
        Ashish Singh
        Tech Lead, Big Data Platform at Pinterest | 19 upvotes 30.7K views
        Apache Hive
        Apache Hive
        Kubernetes
        Kubernetes
        Kafka
        Kafka
        Amazon S3
        Amazon S3
        Amazon EC2
        Amazon EC2
        Presto
        Presto
        #DataScience
        #DataEngineering
        #AWS
        #BigData

        To provide employees with the critical need of interactive querying, we鈥檝e worked with Presto, an open-source distributed SQL query engine, over the years. Operating Presto at Pinterest鈥檚 scale has involved resolving quite a few challenges like, supporting deeply nested and huge thrift schemas, slow/ bad worker detection and remediation, auto-scaling cluster, graceful cluster shutdown and impersonation support for ldap authenticator.

        Our infrastructure is built on top of Amazon EC2 and we leverage Amazon S3 for storing our data. This separates compute and storage layers, and allows multiple compute clusters to share the S3 data.

        We have hundreds of petabytes of data and tens of thousands of Apache Hive tables. Our Presto clusters are comprised of a fleet of 450 r4.8xl EC2 instances. Presto clusters together have over 100 TBs of memory and 14K vcpu cores. Within Pinterest, we have close to more than 1,000 monthly active users (out of total 1,600+ Pinterest employees) using Presto, who run about 400K queries on these clusters per month.

        Each query submitted to Presto cluster is logged to a Kafka topic via Singer. Singer is a logging agent built at Pinterest and we talked about it in a previous post. Each query is logged when it is submitted and when it finishes. When a Presto cluster crashes, we will have query submitted events without corresponding query finished events. These events enable us to capture the effect of cluster crashes over time.

        Each Presto cluster at Pinterest has workers on a mix of dedicated AWS EC2 instances and Kubernetes pods. Kubernetes platform provides us with the capability to add and remove workers from a Presto cluster very quickly. The best-case latency on bringing up a new worker on Kubernetes is less than a minute. However, when the Kubernetes cluster itself is out of resources and needs to scale up, it can take up to ten minutes. Some other advantages of deploying on Kubernetes platform is that our Presto deployment becomes agnostic of cloud vendor, instance types, OS, etc.

        #BigData #AWS #DataScience #DataEngineering

        See more
        Presto logo

        Presto

        113
        191
        46
        113
        191
        + 1
        46
        Distributed SQL Query Engine for Big Data
        Presto logo
        Presto
        VS
        AWS Glue logo
        AWS Glue

        related Presto posts

        Eric Colson
        Eric Colson
        Chief Algorithms Officer at Stitch Fix | 19 upvotes 346.6K views
        atStitch FixStitch Fix
        Kafka
        Kafka
        PostgreSQL
        PostgreSQL
        Amazon S3
        Amazon S3
        Apache Spark
        Apache Spark
        Presto
        Presto
        Python
        Python
        R
        R
        PyTorch
        PyTorch
        Docker
        Docker
        Amazon EC2 Container Service
        Amazon EC2 Container Service
        #AWS
        #Etl
        #ML
        #DataScience
        #DataStack
        #Data

        The algorithms and data infrastructure at Stitch Fix is housed in #AWS. Data acquisition is split between events flowing through Kafka, and periodic snapshots of PostgreSQL DBs. We store data in an Amazon S3 based data warehouse. Apache Spark on Yarn is our tool of choice for data movement and #ETL. Because our storage layer (s3) is decoupled from our processing layer, we are able to scale our compute environment very elastically. We have several semi-permanent, autoscaling Yarn clusters running to serve our data processing needs. While the bulk of our compute infrastructure is dedicated to algorithmic processing, we also implemented Presto for adhoc queries and dashboards.

        Beyond data movement and ETL, most #ML centric jobs (e.g. model training and execution) run in a similarly elastic environment as containers running Python and R code on Amazon EC2 Container Service clusters. The execution of batch jobs on top of ECS is managed by Flotilla, a service we built in house and open sourced (see https://github.com/stitchfix/flotilla-os).

        At Stitch Fix, algorithmic integrations are pervasive across the business. We have dozens of data products actively integrated systems. That requires serving layer that is robust, agile, flexible, and allows for self-service. Models produced on Flotilla are packaged for deployment in production using Khan, another framework we've developed internally. Khan provides our data scientists the ability to quickly productionize those models they've developed with open source frameworks in Python 3 (e.g. PyTorch, sklearn), by automatically packaging them as Docker containers and deploying to Amazon ECS. This provides our data scientist a one-click method of getting from their algorithms to production. We then integrate those deployments into a service mesh, which allows us to A/B test various implementations in our product.

        For more info:

        #DataScience #DataStack #Data

        See more
        Ashish Singh
        Ashish Singh
        Tech Lead, Big Data Platform at Pinterest | 19 upvotes 30.7K views
        Apache Hive
        Apache Hive
        Kubernetes
        Kubernetes
        Kafka
        Kafka
        Amazon S3
        Amazon S3
        Amazon EC2
        Amazon EC2
        Presto
        Presto
        #DataScience
        #DataEngineering
        #AWS
        #BigData

        To provide employees with the critical need of interactive querying, we鈥檝e worked with Presto, an open-source distributed SQL query engine, over the years. Operating Presto at Pinterest鈥檚 scale has involved resolving quite a few challenges like, supporting deeply nested and huge thrift schemas, slow/ bad worker detection and remediation, auto-scaling cluster, graceful cluster shutdown and impersonation support for ldap authenticator.

        Our infrastructure is built on top of Amazon EC2 and we leverage Amazon S3 for storing our data. This separates compute and storage layers, and allows multiple compute clusters to share the S3 data.

        We have hundreds of petabytes of data and tens of thousands of Apache Hive tables. Our Presto clusters are comprised of a fleet of 450 r4.8xl EC2 instances. Presto clusters together have over 100 TBs of memory and 14K vcpu cores. Within Pinterest, we have close to more than 1,000 monthly active users (out of total 1,600+ Pinterest employees) using Presto, who run about 400K queries on these clusters per month.

        Each query submitted to Presto cluster is logged to a Kafka topic via Singer. Singer is a logging agent built at Pinterest and we talked about it in a previous post. Each query is logged when it is submitted and when it finishes. When a Presto cluster crashes, we will have query submitted events without corresponding query finished events. These events enable us to capture the effect of cluster crashes over time.

        Each Presto cluster at Pinterest has workers on a mix of dedicated AWS EC2 instances and Kubernetes pods. Kubernetes platform provides us with the capability to add and remove workers from a Presto cluster very quickly. The best-case latency on bringing up a new worker on Kubernetes is less than a minute. However, when the Kubernetes cluster itself is out of resources and needs to scale up, it can take up to ten minutes. Some other advantages of deploying on Kubernetes platform is that our Presto deployment becomes agnostic of cloud vendor, instance types, OS, etc.

        #BigData #AWS #DataScience #DataEngineering

        See more
        Druid logo

        Druid

        107
        152
        17
        107
        152
        + 1
        17
        Fast column-oriented distributed data store
        Druid logo
        Druid
        VS
        AWS Glue logo
        AWS Glue
        Apache Impala logo

        Apache Impala

        58
        56
        8
        58
        56
        + 1
        8
        Real-time Query for Hadoop
        Apache Impala logo
        Apache Impala
        VS
        AWS Glue logo
        AWS Glue
        Pig logo

        Pig

        38
        52
        4
        38
        52
        + 1
        4
        Platform for analyzing large data sets
        Pig logo
        Pig
        VS
        AWS Glue logo
        AWS Glue
        Amazon Redshift Spectrum logo

        Amazon Redshift Spectrum

        38
        36
        0
        38
        36
        + 1
        0
        Exabyte-Scale In-Place Queries of S3 Data
          Be the first to leave a pro
          Amazon Redshift Spectrum logo
          Amazon Redshift Spectrum
          VS
          AWS Glue logo
          AWS Glue
          Apache Kudu logo

          Apache Kudu

          27
          44
          3
          27
          44
          + 1
          3
          Fast Analytics on Fast Data. A columnar storage manager developed for the Hadoop platform
          Apache Kudu logo
          Apache Kudu
          VS
          AWS Glue logo
          AWS Glue
          Vertica logo

          Vertica

          25
          6
          0
          25
          6
          + 1
          0
          Storage platform designed to handle large volumes of data
            Be the first to leave a pro
            Vertica logo
            Vertica
            VS
            AWS Glue logo
            AWS Glue
            Apache Parquet logo

            Apache Parquet

            23
            10
            0
            23
            10
            + 1
            0
            A free and open-source column-oriented data storage format
              Be the first to leave a pro
              Apache Parquet logo
              Apache Parquet
              VS
              AWS Glue logo
              AWS Glue
              Hue logo

              Hue

              16
              7
              0
              16
              7
              + 1
              0
              An open source SQL Workbench for Data Warehouses
                Be the first to leave a pro
                Hue logo
                Hue
                VS
                AWS Glue logo
                AWS Glue
                Mule logo

                Mule

                16
                20
                0
                16
                20
                + 1
                0
                Revolutionizing the way the world connects data and applications
                  Be the first to leave a pro
                  Mule logo
                  Mule
                  VS
                  AWS Glue logo
                  AWS Glue
                  Azure Data Factory logo

                  Azure Data Factory

                  13
                  4
                  0
                  13
                  4
                  + 1
                  0
                  Hybrid data integration service that simplifies ETL at scale
                    Be the first to leave a pro
                    Azure Data Factory logo
                    Azure Data Factory
                    VS
                    AWS Glue logo
                    AWS Glue