Why developers like Azure Data Factory

What is Azure Data Factory?

It is a service designed to allow developers to integrate disparate data sources. It is a platform somewhat like SSIS in the cloud to manage the data you have both on-prem and in the cloud.

Azure Data Factory is a tool in the Big Data Tools category of a tech stack.

Azure Data Factory is an open source tool with 499 GitHub stars and 599 GitHub forks. Here’s a link to Azure Data Factory's open source repository on GitHub

Who uses Azure Data Factory?

Companies

33 companies reportedly use Azure Data Factory in their tech stacks, including ViaVarejo, Mews, and Awin.

ViaVarejo

Mews

Awin

iOLAP

ADEXT

DevRain

Roar Publicis

Retail Product Catalog

Driverama

Developers

218 developers on StackShare have stated that they use Azure Data Factory.

My Stack

templafy

My Stack

palmtree

My Stack

Azure Data Factory Integrations

Java, .NET, Azure HDInsight, Octotree, and Hosted Graphite are some of the popular tools that integrate with Azure Data Factory. Here's a list of all 6 tools that integrate with Azure Data Factory.

Java

.NET

Azure HDInsight

Octotree

Hosted Graphite

Octopai

Decisions about Azure Data Factory

Here are some stack decisions, common use cases and reviews by companies and developers who chose Azure Data Factory in their tech stack.

Andres Crucetta

Dec 19, 2022 | 6 upvotes · 68.4K views

Needs advice

Airflow

Azure Data Factory

and

Trifacta

We are a young start-up with 2 developers and a team in India looking to choose our next ETL tool. We have a few processes in Azure Data Factory but are looking to switch to a better platform. We were debating Trifacta and Airflow. Or even staying with Azure Data Factory. The use case will be to feed data to front-end APIs.

kew44

Nov 10, 2022 | 6 upvotes · 118.7K views

Needs advice

Amazon S3

Dremio

and

Snowflake

Trying to establish a data lake(or maybe puddle) for my org's Data Sharing project. The idea is that outside partners would send cuts of their PHI data, regardless of format/variables/systems, to our Data Team who would then harmonize the data, create data marts, and eventually use it for something. End-to-end, I'm envisioning:

Ingestion->Secure, role-based, self service portal for users to upload data (1a. bonus points if it can preform basic validations/masking)
Storage->Amazon S3 seems like the cheapest. We probably won't need very big, even at full capacity. Our current storage is a secure Box folder that has ~4GB with several batches of test data, code, presentations, and planning docs.
Data Catalog-> AWS Glue? Azure Data Factory? Snowplow? is the main difference basically based on the vendor? We also will have Data Dictionaries/Codebooks from submitters. Where would they fit in?
Partitions-> I've seen Cassandra and YARN mentioned, but have no experience with either
Processing-> We want to use SAS if at all possible. What will work with SAS code?
Pipeline/Automation->The check-in and verification processes that have been outlined are rather involved. Some sort of automated messaging or approval workflow would be nice
I have very little guidance on what a "Data Mart" should look like, so I'm going with the idea that it would be another "experimental" partition. Unless there's an actual mart-building paradigm I've missed?
An end user might use the catalog to pull certain de-identified data sets from the marts. Again, role-based access and self-service gui would be preferable. I'm the only full-time tech person on this project, but I'm mostly an OOP, HTML, JavaScript, and some SQL programmer. Most of this is out of my repertoire. I've done a lot of research, but I can't be an effective evangelist without hands-on experience. Since we're starting a new year of our grant, they've finally decided to let me try some stuff out. Any pointers would be appreciated!

Vamshi Krishna

Data Engineer at Tata Consultancy Services · May 29, 2020 | 4 upvotes · 263.9K views

Needs advice

AWS Data Pipeline

AWS Glue

and

Azure Data Factory

I have to collect different data from multiple sources and store them in a single cloud location. Then perform cleaning and transforming using PySpark, and push the end results to other applications like reporting tools, etc. What would be the best solution? I can only think of Azure Data Factory + Databricks. Are there any alternatives to #AWS services + Databricks?

See all decisions

Azure Data Factory's Features

Real-Time Integration
Parallel Processing
Data Chunker
Data Masking
Proactive Monitoring
Big Data Processing

Azure Data Factory Alternatives & Comparisons

What are some alternatives to Azure Data Factory?

Azure Databricks

Accelerate big data analytics and artificial intelligence (AI) solutions with Azure Databricks, a fast, easy and collaborative Apache Spark–based analytics service.

Talend

It is an open source software integration platform helps you in effortlessly turning data into business insights. It uses native code generation that lets you run your data pipelines seamlessly across all cloud providers and get optimized performance on all platforms.

AWS Data Pipeline

AWS Data Pipeline is a web service that provides a simple management system for data-driven workflows. Using AWS Data Pipeline, you define a pipeline composed of the “data sources” that contain your data, the “activities” or business logic such as EMR jobs or SQL queries, and the “schedule” on which your business logic executes. For example, you could define a job that, every hour, runs an Amazon Elastic MapReduce (Amazon EMR)–based analysis on that hour’s Amazon Simple Storage Service (Amazon S3) log data, loads the results into a relational database for future lookup, and then automatically sends you a daily summary email.

AWS Glue

A fully managed extract, transform, and load (ETL) service that makes it easy for customers to prepare and load their data for analytics.

Apache NiFi

An easy to use, powerful, and reliable system to process and distribute data. It supports powerful and scalable directed graphs of data routing, transformation, and system mediation logic.

See all alternatives

Related Comparisons