Azure Data Factory vs Apache Impala

Need advice about which tool to choose?Ask the StackShare community!

Azure Data Factory

156
307
+ 1
0
Apache Impala

110
230
+ 1
10
Add tool

Apache Impala vs Azure Data Factory: What are the differences?

Developers describe Apache Impala as "Real-time Query for Hadoop". Impala is a modern, open source, MPP SQL query engine for Apache Hadoop. Impala is shipped by Cloudera, MapR, and Amazon. With Impala, you can query data, whether stored in HDFS or Apache HBase – including SELECT, JOIN, and aggregate functions – in real time. On the other hand, Azure Data Factory is detailed as "Create, Schedule, & Manage Data Pipelines". It is a service designed to allow developers to integrate disparate data sources. It is a platform somewhat like SSIS in the cloud to manage the data you have both on-prem and in the cloud.

Apache Impala and Azure Data Factory belong to "Big Data Tools" category of the tech stack.

Some of the features offered by Apache Impala are:

  • Do BI-style Queries on Hadoop
  • Unify Your Infrastructure
  • Implement Quickly

On the other hand, Azure Data Factory provides the following key features:

  • Real-Time Integration
  • Parallel Processing
  • Data Chunker

Apache Impala and Azure Data Factory are both open source tools. It seems that Apache Impala with 2.22K GitHub stars and 834 forks on GitHub has more adoption than Azure Data Factory with 150 GitHub stars and 255 GitHub forks.

Advice on Azure Data Factory and Apache Impala
Vamshi Krishna
Data Engineer at Tata Consultancy Services · | 4 upvotes · 90.6K views

I have to collect different data from multiple sources and store them in a single cloud location. Then perform cleaning and transforming using PySpark, and push the end results to other applications like reporting tools, etc. What would be the best solution? I can only think of Azure Data Factory + Databricks. Are there any alternatives to #AWS services + Databricks?

See more
Get Advice from developers at your company using Private StackShare. Sign up for Private StackShare.
Learn More
Pros of Azure Data Factory
Pros of Apache Impala
    Be the first to leave a pro
    • 10
      Super fast

    Sign up to add or upvote prosMake informed product decisions

    Sign up to add or upvote consMake informed product decisions

    What is Azure Data Factory?

    It is a service designed to allow developers to integrate disparate data sources. It is a platform somewhat like SSIS in the cloud to manage the data you have both on-prem and in the cloud.

    What is Apache Impala?

    Impala is a modern, open source, MPP SQL query engine for Apache Hadoop. Impala is shipped by Cloudera, MapR, and Amazon. With Impala, you can query data, whether stored in HDFS or Apache HBase – including SELECT, JOIN, and aggregate functions – in real time.

    Need advice about which tool to choose?Ask the StackShare community!

    What companies use Azure Data Factory?
    What companies use Apache Impala?
    See which teams inside your own company are using Azure Data Factory or Apache Impala.
    Sign up for Private StackShareLearn More

    Sign up to get full access to all the companiesMake informed product decisions

    What tools integrate with Azure Data Factory?
    What tools integrate with Apache Impala?

    Sign up to get full access to all the tool integrationsMake informed product decisions

    What are some alternatives to Azure Data Factory and Apache Impala?
    Azure Databricks
    Accelerate big data analytics and artificial intelligence (AI) solutions with Azure Databricks, a fast, easy and collaborative Apache Spark–based analytics service.
    Talend
    It is an open source software integration platform helps you in effortlessly turning data into business insights. It uses native code generation that lets you run your data pipelines seamlessly across all cloud providers and get optimized performance on all platforms.
    AWS Data Pipeline
    AWS Data Pipeline is a web service that provides a simple management system for data-driven workflows. Using AWS Data Pipeline, you define a pipeline composed of the “data sources” that contain your data, the “activities” or business logic such as EMR jobs or SQL queries, and the “schedule” on which your business logic executes. For example, you could define a job that, every hour, runs an Amazon Elastic MapReduce (Amazon EMR)–based analysis on that hour’s Amazon Simple Storage Service (Amazon S3) log data, loads the results into a relational database for future lookup, and then automatically sends you a daily summary email.
    AWS Glue
    A fully managed extract, transform, and load (ETL) service that makes it easy for customers to prepare and load their data for analytics.
    Apache NiFi
    An easy to use, powerful, and reliable system to process and distribute data. It supports powerful and scalable directed graphs of data routing, transformation, and system mediation logic.
    See all alternatives