Alternatives to Azure Data Factory logo

Alternatives to Azure Data Factory

Azure Databricks, Talend, AWS Data Pipeline, AWS Glue, and Apache NiFi are the most popular alternatives and competitors to Azure Data Factory.
247
0

What is Azure Data Factory and what are its top alternatives?

It is a service designed to allow developers to integrate disparate data sources. It is a platform somewhat like SSIS in the cloud to manage the data you have both on-prem and in the cloud.
Azure Data Factory is a tool in the Big Data Tools category of a tech stack.
Azure Data Factory is an open source tool with 485 GitHub stars and 590 GitHub forks. Here’s a link to Azure Data Factory's open source repository on GitHub

Top Alternatives to Azure Data Factory

  • Azure Databricks
    Azure Databricks

    Accelerate big data analytics and artificial intelligence (AI) solutions with Azure Databricks, a fast, easy and collaborative Apache Spark–based analytics service. ...

  • Talend
    Talend

    It is an open source software integration platform helps you in effortlessly turning data into business insights. It uses native code generation that lets you run your data pipelines seamlessly across all cloud providers and get optimized performance on all platforms. ...

  • AWS Data Pipeline
    AWS Data Pipeline

    AWS Data Pipeline is a web service that provides a simple management system for data-driven workflows. Using AWS Data Pipeline, you define a pipeline composed of the “data sources” that contain your data, the “activities” or business logic such as EMR jobs or SQL queries, and the “schedule” on which your business logic executes. For example, you could define a job that, every hour, runs an Amazon Elastic MapReduce (Amazon EMR)–based analysis on that hour’s Amazon Simple Storage Service (Amazon S3) log data, loads the results into a relational database for future lookup, and then automatically sends you a daily summary email. ...

  • AWS Glue
    AWS Glue

    A fully managed extract, transform, and load (ETL) service that makes it easy for customers to prepare and load their data for analytics. ...

  • Apache NiFi
    Apache NiFi

    An easy to use, powerful, and reliable system to process and distribute data. It supports powerful and scalable directed graphs of data routing, transformation, and system mediation logic. ...

  • Airflow
    Airflow

    Use Airflow to author workflows as directed acyclic graphs (DAGs) of tasks. The Airflow scheduler executes your tasks on an array of workers while following the specified dependencies. Rich command lines utilities makes performing complex surgeries on DAGs a snap. The rich user interface makes it easy to visualize pipelines running in production, monitor progress and troubleshoot issues when needed. ...

  • Databricks
    Databricks

    Databricks Unified Analytics Platform, from the original creators of Apache Spark™, unifies data science and engineering across the Machine Learning lifecycle from data preparation to experimentation and deployment of ML applications. ...

  • MySQL
    MySQL

    The MySQL software delivers a very fast, multi-threaded, multi-user, and robust SQL (Structured Query Language) database server. MySQL Server is intended for mission-critical, heavy-load production systems as well as for embedding into mass-deployed software. ...

Azure Data Factory alternatives & related posts

Azure Databricks logo

Azure Databricks

247
393
0
Fast, easy, and collaborative Apache Spark–based analytics service
247
393
+ 1
0
PROS OF AZURE DATABRICKS
    Be the first to leave a pro
    CONS OF AZURE DATABRICKS
      Be the first to leave a con

      related Azure Databricks posts

      Talend logo

      Talend

      152
      248
      0
      A single, unified suite for all integration needs
      152
      248
      + 1
      0
      PROS OF TALEND
        Be the first to leave a pro
        CONS OF TALEND
          Be the first to leave a con

          related Talend posts

          AWS Data Pipeline logo

          AWS Data Pipeline

          95
          398
          1
          Process and move data between different AWS compute and storage services
          95
          398
          + 1
          1
          PROS OF AWS DATA PIPELINE
          • 1
            Easy to create DAG and execute it
          CONS OF AWS DATA PIPELINE
            Be the first to leave a con

            related AWS Data Pipeline posts

            AWS Glue logo

            AWS Glue

            459
            816
            9
            Fully managed extract, transform, and load (ETL) service
            459
            816
            + 1
            9
            PROS OF AWS GLUE
            • 9
              Managed Hive Metastore
            CONS OF AWS GLUE
              Be the first to leave a con

              related AWS Glue posts

              Will Dataflow be the right replacement for AWS Glue? Are there any unforeseen exceptions like certain proprietary transformations not supported in Google Cloud Dataflow, connectors ecosystem, Data Quality & Date cleansing not supported in DataFlow. etc?

              Also, how about Google Cloud Data Fusion as a replacement? In terms of No Code/Low code .. (Since basic use cases in Glue support UI, in that case, CDF may be the right choice ).

              What would be the best choice?

              See more
              Pardha Saradhi
              Technical Lead at Incred Financial Solutions · | 6 upvotes · 107.4K views

              Hi,

              We are currently storing the data in Amazon S3 using Apache Parquet format. We are using Presto to query the data from S3 and catalog it using AWS Glue catalog. We have Metabase sitting on top of Presto, where our reports are present. Currently, Presto is becoming too costly for us, and we are looking for alternatives for it but want to use the remaining setup (S3, Metabase) as much as possible. Please suggest alternative approaches.

              See more
              Apache NiFi logo

              Apache NiFi

              351
              686
              65
              A reliable system to process and distribute data
              351
              686
              + 1
              65
              PROS OF APACHE NIFI
              • 17
                Visual Data Flows using Directed Acyclic Graphs (DAGs)
              • 8
                Free (Open Source)
              • 7
                Simple-to-use
              • 5
                Scalable horizontally as well as vertically
              • 5
                Reactive with back-pressure
              • 4
                Fast prototyping
              • 3
                Bi-directional channels
              • 3
                End-to-end security between all nodes
              • 2
                Built-in graphical user interface
              • 2
                Can handle messages up to gigabytes in size
              • 2
                Data provenance
              • 1
                Lots of documentation
              • 1
                Hbase support
              • 1
                Support for custom Processor in Java
              • 1
                Hive support
              • 1
                Kudu support
              • 1
                Slack integration
              • 1
                Lot of articles
              CONS OF APACHE NIFI
              • 2
                HA support is not full fledge
              • 2
                Memory-intensive
              • 1
                Kkk

              related Apache NiFi posts

              John Calandra
              Data Manager at The Garrett Group · | 8 upvotes · 367K views

              There is a question coming... I am using Oracle VirtualBox to spawn 3 Ubuntu Linux virtual machines (VM). VM1 is being used as a data lake - just a place to store flat files. VM2 hosts Apache NiFi. VM3 hosts PostgreSQL. I have built a NiFi pipeline that reads flat files on VM1 and then pipes the data over to and inserts it into the Postgresql database. I left this setup alone for a while, and then something hiccupped on VM3, and I had to rebuild it. Now I cannot make a remote connection to Postgresql on VM3. I was using pgAdmin3 on VM3, but it kept throwing errors - I found out it went end-of-life in 2018 and uninstalled it. pgAdmin4 is out, but for some reason, I cannot get the APT utility to find/install it. I am trying to figure out the pgAdmin4 install problem and looking for a good alternative for pgAdmin4 that I can use to diagnose the remote database connection problem. Does anyone have any suggestions? Thanks in advance.

              See more

              I am looking for the best tool to orchestrate #ETL workflows in non-Hadoop environments, mainly for regression testing use cases. Would Airflow or Apache NiFi be a good fit for this purpose?

              For example, I want to run an Informatica ETL job and then run an SQL task as a dependency, followed by another task from Jira. What tool is best suited to set up such a pipeline?

              See more
              Airflow logo

              Airflow

              1.7K
              2.7K
              128
              A platform to programmaticaly author, schedule and monitor data pipelines, by Airbnb
              1.7K
              2.7K
              + 1
              128
              PROS OF AIRFLOW
              • 53
                Features
              • 14
                Task Dependency Management
              • 12
                Beautiful UI
              • 12
                Cluster of workers
              • 10
                Extensibility
              • 6
                Open source
              • 5
                Complex workflows
              • 5
                Python
              • 3
                Good api
              • 3
                Apache project
              • 3
                Custom operators
              • 2
                Dashboard
              CONS OF AIRFLOW
              • 2
                Observability is not great when the DAGs exceed 250
              • 2
                Running it on kubernetes cluster relatively complex
              • 2
                Open source - provides minimum or no support
              • 1
                Logical separation of DAGs is not straight forward

              related Airflow posts

              Data science and engineering teams at Lyft maintain several big data pipelines that serve as the foundation for various types of analysis throughout the business.

              Apache Airflow sits at the center of this big data infrastructure, allowing users to “programmatically author, schedule, and monitor data pipelines.” Airflow is an open source tool, and “Lyft is the very first Airflow adopter in production since the project was open sourced around three years ago.”

              There are several key components of the architecture. A web UI allows users to view the status of their queries, along with an audit trail of any modifications the query. A metadata database stores things like job status and task instance status. A multi-process scheduler handles job requests, and triggers the executor to execute those tasks.

              Airflow supports several executors, though Lyft uses CeleryExecutor to scale task execution in production. Airflow is deployed to three Amazon Auto Scaling Groups, with each associated with a celery queue.

              Audit logs supplied to the web UI are powered by the existing Airflow audit logs as well as Flask signal.

              Datadog, Statsd, Grafana, and PagerDuty are all used to monitor the Airflow system.

              See more

              We are a young start-up with 2 developers and a team in India looking to choose our next ETL tool. We have a few processes in Azure Data Factory but are looking to switch to a better platform. We were debating Trifacta and Airflow. Or even staying with Azure Data Factory. The use case will be to feed data to front-end APIs.

              See more
              Databricks logo

              Databricks

              497
              752
              8
              A unified analytics platform, powered by Apache Spark
              497
              752
              + 1
              8
              PROS OF DATABRICKS
              • 1
                Best Performances on large datasets
              • 1
                True lakehouse architecture
              • 1
                Scalability
              • 1
                Databricks doesn't get access to your data
              • 1
                Usage Based Billing
              • 1
                Security
              • 1
                Data stays in your cloud account
              • 1
                Multicloud
              CONS OF DATABRICKS
                Be the first to leave a con

                related Databricks posts

                Jan Vlnas
                Senior Software Engineer at Mews · | 5 upvotes · 455.8K views

                From my point of view, both OpenRefine and Apache Hive serve completely different purposes. OpenRefine is intended for interactive cleaning of messy data locally. You could work with their libraries to use some of OpenRefine features as part of your data pipeline (there are pointers in FAQ), but OpenRefine in general is intended for a single-user local operation.

                I can't recommend a particular alternative without better understanding of your use case. But if you are looking for an interactive tool to work with big data at scale, take a look at notebook environments like Jupyter, Databricks, or Deepnote. If you are building a data processing pipeline, consider also Apache Spark.

                Edit: Fixed references from Hadoop to Hive, which is actually closer to Spark.

                See more
                MySQL logo

                MySQL

                125.3K
                106K
                3.8K
                The world's most popular open source database
                125.3K
                106K
                + 1
                3.8K
                PROS OF MYSQL
                • 800
                  Sql
                • 679
                  Free
                • 562
                  Easy
                • 528
                  Widely used
                • 490
                  Open source
                • 180
                  High availability
                • 160
                  Cross-platform support
                • 104
                  Great community
                • 79
                  Secure
                • 75
                  Full-text indexing and searching
                • 26
                  Fast, open, available
                • 16
                  Reliable
                • 16
                  SSL support
                • 15
                  Robust
                • 9
                  Enterprise Version
                • 7
                  Easy to set up on all platforms
                • 3
                  NoSQL access to JSON data type
                • 1
                  Relational database
                • 1
                  Easy, light, scalable
                • 1
                  Sequel Pro (best SQL GUI)
                • 1
                  Replica Support
                CONS OF MYSQL
                • 16
                  Owned by a company with their own agenda
                • 3
                  Can't roll back schema changes

                related MySQL posts

                Nick Rockwell
                SVP, Engineering at Fastly · | 46 upvotes · 4.1M views

                When I joined NYT there was already broad dissatisfaction with the LAMP (Linux Apache HTTP Server MySQL PHP) Stack and the front end framework, in particular. So, I wasn't passing judgment on it. I mean, LAMP's fine, you can do good work in LAMP. It's a little dated at this point, but it's not ... I didn't want to rip it out for its own sake, but everyone else was like, "We don't like this, it's really inflexible." And I remember from being outside the company when that was called MIT FIVE when it had launched. And been observing it from the outside, and I was like, you guys took so long to do that and you did it so carefully, and yet you're not happy with your decisions. Why is that? That was more the impetus. If we're going to do this again, how are we going to do it in a way that we're gonna get a better result?

                So we're moving quickly away from LAMP, I would say. So, right now, the new front end is React based and using Apollo. And we've been in a long, protracted, gradual rollout of the core experiences.

                React is now talking to GraphQL as a primary API. There's a Node.js back end, to the front end, which is mainly for server-side rendering, as well.

                Behind there, the main repository for the GraphQL server is a big table repository, that we call Bodega because it's a convenience store. And that reads off of a Kafka pipeline.

                See more
                Tim Abbott

                We've been using PostgreSQL since the very early days of Zulip, but we actually didn't use it from the beginning. Zulip started out as a MySQL project back in 2012, because we'd heard it was a good choice for a startup with a wide community. However, we found that even though we were using the Django ORM for most of our database access, we spent a lot of time fighting with MySQL. Issues ranged from bad collation defaults, to bad query plans which required a lot of manual query tweaks.

                We ended up getting so frustrated that we tried out PostgresQL, and the results were fantastic. We didn't have to do any real customization (just some tuning settings for how big a server we had), and all of our most important queries were faster out of the box. As a result, we were able to delete a bunch of custom queries escaping the ORM that we'd written to make the MySQL query planner happy (because postgres just did the right thing automatically).

                And then after that, we've just gotten a ton of value out of postgres. We use its excellent built-in full-text search, which has helped us avoid needing to bring in a tool like Elasticsearch, and we've really enjoyed features like its partial indexes, which saved us a lot of work adding unnecessary extra tables to get good performance for things like our "unread messages" and "starred messages" indexes.

                I can't recommend it highly enough.

                See more