Get Advice Icon

Need advice about which tool to choose?Ask the StackShare community!

Azure Data Factory

252
480
+ 1
0
Pig

59
111
+ 1
5
Add tool

Pig vs Azure Data Factory: What are the differences?

Pig: Platform for analyzing large data sets. Pig is a dataflow programming environment for processing very large files. Pig's language is called Pig Latin. A Pig Latin program consists of a directed acyclic graph where each node represents an operation that transforms data Operations are of two flavors: (1) relational-algebra style operations such as join, filter, project; (2) functional-programming style operators such as map, reduce. ; Azure Data Factory: Create, Schedule, & Manage Data Pipelines. It is a service designed to allow developers to integrate disparate data sources. It is a platform somewhat like SSIS in the cloud to manage the data you have both on-prem and in the cloud.

Pig and Azure Data Factory can be categorized as "Big Data" tools.

Pig and Azure Data Factory are both open source tools. It seems that Pig with 585 GitHub stars and 448 forks on GitHub has more adoption than Azure Data Factory with 150 GitHub stars and 255 GitHub forks.

Advice on Azure Data Factory and Pig
Vamshi Krishna
Data Engineer at Tata Consultancy Services · | 4 upvotes · 261.1K views

I have to collect different data from multiple sources and store them in a single cloud location. Then perform cleaning and transforming using PySpark, and push the end results to other applications like reporting tools, etc. What would be the best solution? I can only think of Azure Data Factory + Databricks. Are there any alternatives to #AWS services + Databricks?

See more
Manage your open source components, licenses, and vulnerabilities
Learn More
Pros of Azure Data Factory
Pros of Pig
    Be the first to leave a pro
    • 2
      Finer-grained control on parallelization
    • 1
      Proven at Petabyte scale
    • 1
      Open-source
    • 1
      Join optimizations for highly skewed data

    Sign up to add or upvote prosMake informed product decisions

    9.1K
    3

    What is Azure Data Factory?

    It is a service designed to allow developers to integrate disparate data sources. It is a platform somewhat like SSIS in the cloud to manage the data you have both on-prem and in the cloud.

    What is Pig?

    Pig is a dataflow programming environment for processing very large files. Pig's language is called Pig Latin. A Pig Latin program consists of a directed acyclic graph where each node represents an operation that transforms data. Operations are of two flavors: (1) relational-algebra style operations such as join, filter, project; (2) functional-programming style operators such as map, reduce.

    Need advice about which tool to choose?Ask the StackShare community!

    What companies use Azure Data Factory?
    What companies use Pig?
    Manage your open source components, licenses, and vulnerabilities
    Learn More

    Sign up to get full access to all the companiesMake informed product decisions

    What tools integrate with Azure Data Factory?
    What tools integrate with Pig?

    Sign up to get full access to all the tool integrationsMake informed product decisions

    What are some alternatives to Azure Data Factory and Pig?
    Azure Databricks
    Accelerate big data analytics and artificial intelligence (AI) solutions with Azure Databricks, a fast, easy and collaborative Apache Spark–based analytics service.
    Talend
    It is an open source software integration platform helps you in effortlessly turning data into business insights. It uses native code generation that lets you run your data pipelines seamlessly across all cloud providers and get optimized performance on all platforms.
    AWS Data Pipeline
    AWS Data Pipeline is a web service that provides a simple management system for data-driven workflows. Using AWS Data Pipeline, you define a pipeline composed of the “data sources” that contain your data, the “activities” or business logic such as EMR jobs or SQL queries, and the “schedule” on which your business logic executes. For example, you could define a job that, every hour, runs an Amazon Elastic MapReduce (Amazon EMR)–based analysis on that hour’s Amazon Simple Storage Service (Amazon S3) log data, loads the results into a relational database for future lookup, and then automatically sends you a daily summary email.
    AWS Glue
    A fully managed extract, transform, and load (ETL) service that makes it easy for customers to prepare and load their data for analytics.
    Apache NiFi
    An easy to use, powerful, and reliable system to process and distribute data. It supports powerful and scalable directed graphs of data routing, transformation, and system mediation logic.
    See all alternatives