Need advice about which tool to choose?Ask the StackShare community!

Azure Data Factory

237
468
+ 1
0
Dremio

116
342
+ 1
8
Add tool

Azure Data Factory vs Dremio: What are the differences?

Introduction

Azure Data Factory and Dremio are both powerful data platforms that offer data integration and transformation capabilities. However, there are key differences between the two in terms of functionality and architecture.

  1. Data Integration Approach:

Azure Data Factory is a cloud-based integration service that allows you to create, schedule, and orchestrate data pipelines. It provides a visual interface for designing and managing data workflows, and supports seamless integration with various data sources and destination systems. On the other hand, Dremio is a data virtualization platform that enables users to access, query, and analyze data from multiple sources in real-time. It leverages a "self-service" approach, allowing users to directly access and query data without the need for data movement or ETL processes.

  1. Data Transformation Capabilities:

Azure Data Factory provides built-in data transformation activities that allow you to transform and manipulate data during the pipeline execution. It supports various data transformation operations like data cleansing, aggregation, filtering, and more. In contrast, Dremio offers advanced data transformation capabilities through its Data Reflections feature. It automatically optimizes and accelerates data transformations by creating aggregated and indexed views of the underlying data, resulting in faster query performance.

  1. Data Storage and Processing:

Azure Data Factory integrates seamlessly with Azure's storage and processing services like Azure Blob Storage, Azure Data Lake Storage, and Azure Databricks. This allows you to leverage Azure's scalable and cost-effective storage and processing capabilities for performing data integration tasks. Dremio, on the other hand, can connect to various data sources, including both cloud-based and on-premises storage systems. It uses its own distributed storage layer called "Dremio Reflections" to optimize query performance and data access.

  1. Data Governance and Security:

Azure Data Factory provides robust data governance and security features, including data encryption, role-based access control (RBAC), and Azure Active Directory integration. It also supports data masking and data classification to protect sensitive data. Dremio offers similar data governance capabilities, providing granular access controls and encryption of data in transit and at rest. It also integrates with existing identity management systems for secure authentication and authorization.

  1. Scalability and Performance:

Azure Data Factory is designed to scale seamlessly to handle large volumes of data and can be integrated with Azure's autoscaling capabilities. It leverages Azure's distributed computing resources for efficient and parallel execution of data integration workflows. Dremio is also highly scalable and can handle large-scale data processing, thanks to its distributed architecture. It optimizes query performance through data caching, indexing, and workload-aware query routing.

  1. Data Exploration and Visualization:

Azure Data Factory primarily focuses on data integration and orchestration and does not provide built-in data exploration or visualization capabilities. However, it can integrate with other Azure services like Power BI, Azure Synapse Analytics, and Azure Analysis Services for advanced data analytics and visualization. Dremio, on the other hand, offers powerful self-service data exploration and visualization capabilities. It includes a built-in SQL editor, a data virtualization layer, and integrates with popular BI tools like Tableau and Power BI.

In summary, Azure Data Factory and Dremio are both powerful data platforms, but they differ in their integration approach, data transformation capabilities, data storage and processing options, data governance and security features, scalability and performance optimizations, and data exploration and visualization capabilities.

Advice on Azure Data Factory and Dremio

We need to perform ETL from several databases into a data warehouse or data lake. We want to

  • keep raw and transformed data available to users to draft their own queries efficiently
  • give users the ability to give custom permissions and SSO
  • move between open-source on-premises development and cloud-based production environments

We want to use inexpensive Amazon EC2 instances only on medium-sized data set 16GB to 32GB feeding into Tableau Server or PowerBI for reporting and data analysis purposes.

See more
Replies (3)
John Nguyen
Recommends
on
AirflowAirflowAWS LambdaAWS Lambda

You could also use AWS Lambda and use Cloudwatch event schedule if you know when the function should be triggered. The benefit is that you could use any language and use the respective database client.

But if you orchestrate ETLs then it makes sense to use Apache Airflow. This requires Python knowledge.

See more
Recommends
on
AirflowAirflow

Though we have always built something custom, Apache airflow (https://airflow.apache.org/) stood out as a key contender/alternative when it comes to open sources. On the commercial offering, Amazon Redshift combined with Amazon Kinesis (for complex manipulations) is great for BI, though Redshift as such is expensive.

See more
Recommends

You may want to look into a Data Virtualization product called Conduit. It connects to disparate data sources in AWS, on prem, Azure, GCP, and exposes them as a single unified Spark SQL view to PowerBI (direct query) or Tableau. Allows auto query and caching policies to enhance query speeds and experience. Has a GPU query engine and optimized Spark for fallback. Can be deployed on your AWS VM or on prem, scales up and out. Sounds like the ideal solution to your needs.

See more
karunakaran karthikeyan
Needs advice
on
DremioDremio
and
TalendTalend

I am trying to build a data lake by pulling data from multiple data sources ( custom-built tools, excel files, CSV files, etc) and use the data lake to generate dashboards.

My question is which is the best tool to do the following:

  1. Create pipelines to ingest the data from multiple sources into the data lake
  2. Help me in aggregating and filtering data available in the data lake.
  3. Create new reports by combining different data elements from the data lake.

I need to use only open-source tools for this activity.

I appreciate your valuable inputs and suggestions. Thanks in Advance.

See more
Replies (1)
Rod Beecham
Partnering Lead at Zetaris · | 3 upvotes · 62.9K views
Recommends
on
DremioDremio

Hi Karunakaran. I obviously have an interest here, as I work for the company, but the problem you are describing is one that Zetaris can solve. Talend is a good ETL product, and Dremio is a good data virtualization product, but the problem you are describing best fits a tool that can combine the five styles of data integration (bulk/batch data movement, data replication/data synchronization, message-oriented movement of data, data virtualization, and stream data integration). I may be wrong, but Zetaris is, to the best of my knowledge, the only product in the world that can do this. Zetaris is not a dashboarding tool - you would need to combine us with Tableau or Qlik or PowerBI (or whatever) - but Zetaris can consolidate data from any source and any location (structured, unstructured, on-prem or in the cloud) in real time to allow clients a consolidated view of whatever they want whenever they want it. Please take a look at www.zetaris.com for more information. I don't want to do a "hard sell", here, so I'll say no more! Warmest regards, Rod Beecham.

See more
Vamshi Krishna
Data Engineer at Tata Consultancy Services · | 4 upvotes · 240.2K views

I have to collect different data from multiple sources and store them in a single cloud location. Then perform cleaning and transforming using PySpark, and push the end results to other applications like reporting tools, etc. What would be the best solution? I can only think of Azure Data Factory + Databricks. Are there any alternatives to #AWS services + Databricks?

See more
Get Advice from developers at your company using StackShare Enterprise. Sign up for StackShare Enterprise.
Learn More
Pros of Azure Data Factory
Pros of Dremio
    Be the first to leave a pro
    • 3
      Nice GUI to enable more people to work with Data
    • 2
      Connect NoSQL databases with RDBMS
    • 2
      Easier to Deploy
    • 1
      Free

    Sign up to add or upvote prosMake informed product decisions

    Cons of Azure Data Factory
    Cons of Dremio
      Be the first to leave a con
      • 1
        Works only on Iceberg structured data

      Sign up to add or upvote consMake informed product decisions

      - No public GitHub repository available -

      What is Azure Data Factory?

      It is a service designed to allow developers to integrate disparate data sources. It is a platform somewhat like SSIS in the cloud to manage the data you have both on-prem and in the cloud.

      What is Dremio?

      Dremio—the data lake engine, operationalizes your data lake storage and speeds your analytics processes with a high-performance and high-efficiency query engine while also democratizing data access for data scientists and analysts.

      Need advice about which tool to choose?Ask the StackShare community!

      Jobs that mention Azure Data Factory and Dremio as a desired skillset
      What companies use Azure Data Factory?
      What companies use Dremio?
      See which teams inside your own company are using Azure Data Factory or Dremio.
      Sign up for StackShare EnterpriseLearn More

      Sign up to get full access to all the companiesMake informed product decisions

      What tools integrate with Azure Data Factory?
      What tools integrate with Dremio?

      Sign up to get full access to all the tool integrationsMake informed product decisions

      What are some alternatives to Azure Data Factory and Dremio?
      Azure Databricks
      Accelerate big data analytics and artificial intelligence (AI) solutions with Azure Databricks, a fast, easy and collaborative Apache Spark–based analytics service.
      Talend
      It is an open source software integration platform helps you in effortlessly turning data into business insights. It uses native code generation that lets you run your data pipelines seamlessly across all cloud providers and get optimized performance on all platforms.
      AWS Data Pipeline
      AWS Data Pipeline is a web service that provides a simple management system for data-driven workflows. Using AWS Data Pipeline, you define a pipeline composed of the “data sources” that contain your data, the “activities” or business logic such as EMR jobs or SQL queries, and the “schedule” on which your business logic executes. For example, you could define a job that, every hour, runs an Amazon Elastic MapReduce (Amazon EMR)–based analysis on that hour’s Amazon Simple Storage Service (Amazon S3) log data, loads the results into a relational database for future lookup, and then automatically sends you a daily summary email.
      AWS Glue
      A fully managed extract, transform, and load (ETL) service that makes it easy for customers to prepare and load their data for analytics.
      Apache NiFi
      An easy to use, powerful, and reliable system to process and distribute data. It supports powerful and scalable directed graphs of data routing, transformation, and system mediation logic.
      See all alternatives