Need advice about which tool to choose?Ask the StackShare community!

CDAP

41
108
+ 1
0
Dremio

117
348
+ 1
8
Add tool

CDAP vs Dremio: What are the differences?

What is CDAP? Open source virtualization platform for Hadoop data and apps. Cask Data Application Platform (CDAP) is an open source application development platform for the Hadoop ecosystem that provides developers with data and application virtualization to accelerate application development, address a broader range of real-time and batch use cases, and deploy applications into production while satisfying enterprise requirements.

What is Dremio? Self-service data for everyone. It is a data-as-a-service platform that empowers users to discover, curate, accelerate, and share any data at any time, regardless of location, volume, or structure. Modern data is managed by a wide range of technologies, including relational databases, NoSQL datastores, file systems, Hadoop, and others.

CDAP and Dremio belong to "Big Data Tools" category of the tech stack.

Some of the features offered by CDAP are:

  • Streams for data ingestion
  • Reusable libraries for common Big Data access patterns
  • Data available to multiple applications and different paradigms

On the other hand, Dremio provides the following key features:

  • Democratize all your data
  • Make your data engineers more productive
  • Accelerate your favorite tools

CDAP is an open source tool with 356 GitHub stars and 184 GitHub forks. Here's a link to CDAP's open source repository on GitHub.

Advice on CDAP and Dremio

We need to perform ETL from several databases into a data warehouse or data lake. We want to

  • keep raw and transformed data available to users to draft their own queries efficiently
  • give users the ability to give custom permissions and SSO
  • move between open-source on-premises development and cloud-based production environments

We want to use inexpensive Amazon EC2 instances only on medium-sized data set 16GB to 32GB feeding into Tableau Server or PowerBI for reporting and data analysis purposes.

See more
Replies (3)
John Nguyen
Recommends
on
AirflowAirflowAWS LambdaAWS Lambda

You could also use AWS Lambda and use Cloudwatch event schedule if you know when the function should be triggered. The benefit is that you could use any language and use the respective database client.

But if you orchestrate ETLs then it makes sense to use Apache Airflow. This requires Python knowledge.

See more
Recommends
on
AirflowAirflow

Though we have always built something custom, Apache airflow (https://airflow.apache.org/) stood out as a key contender/alternative when it comes to open sources. On the commercial offering, Amazon Redshift combined with Amazon Kinesis (for complex manipulations) is great for BI, though Redshift as such is expensive.

See more
Recommends

You may want to look into a Data Virtualization product called Conduit. It connects to disparate data sources in AWS, on prem, Azure, GCP, and exposes them as a single unified Spark SQL view to PowerBI (direct query) or Tableau. Allows auto query and caching policies to enhance query speeds and experience. Has a GPU query engine and optimized Spark for fallback. Can be deployed on your AWS VM or on prem, scales up and out. Sounds like the ideal solution to your needs.

See more
karunakaran karthikeyan
Needs advice
on
DremioDremio
and
TalendTalend

I am trying to build a data lake by pulling data from multiple data sources ( custom-built tools, excel files, CSV files, etc) and use the data lake to generate dashboards.

My question is which is the best tool to do the following:

  1. Create pipelines to ingest the data from multiple sources into the data lake
  2. Help me in aggregating and filtering data available in the data lake.
  3. Create new reports by combining different data elements from the data lake.

I need to use only open-source tools for this activity.

I appreciate your valuable inputs and suggestions. Thanks in Advance.

See more
Replies (1)
Rod Beecham
Partnering Lead at Zetaris · | 3 upvotes · 67.8K views
Recommends
on
DremioDremio

Hi Karunakaran. I obviously have an interest here, as I work for the company, but the problem you are describing is one that Zetaris can solve. Talend is a good ETL product, and Dremio is a good data virtualization product, but the problem you are describing best fits a tool that can combine the five styles of data integration (bulk/batch data movement, data replication/data synchronization, message-oriented movement of data, data virtualization, and stream data integration). I may be wrong, but Zetaris is, to the best of my knowledge, the only product in the world that can do this. Zetaris is not a dashboarding tool - you would need to combine us with Tableau or Qlik or PowerBI (or whatever) - but Zetaris can consolidate data from any source and any location (structured, unstructured, on-prem or in the cloud) in real time to allow clients a consolidated view of whatever they want whenever they want it. Please take a look at www.zetaris.com for more information. I don't want to do a "hard sell", here, so I'll say no more! Warmest regards, Rod Beecham.

See more
Manage your open source components, licenses, and vulnerabilities
Learn More
Pros of CDAP
Pros of Dremio
    Be the first to leave a pro
    • 3
      Nice GUI to enable more people to work with Data
    • 2
      Connect NoSQL databases with RDBMS
    • 2
      Easier to Deploy
    • 1
      Free

    Sign up to add or upvote prosMake informed product decisions

    Cons of CDAP
    Cons of Dremio
      Be the first to leave a con
      • 1
        Works only on Iceberg structured data

      Sign up to add or upvote consMake informed product decisions

      What is CDAP?

      Cask Data Application Platform (CDAP) is an open source application development platform for the Hadoop ecosystem that provides developers with data and application virtualization to accelerate application development, address a broader range of real-time and batch use cases, and deploy applications into production while satisfying enterprise requirements.

      What is Dremio?

      Dremio—the data lake engine, operationalizes your data lake storage and speeds your analytics processes with a high-performance and high-efficiency query engine while also democratizing data access for data scientists and analysts.

      Need advice about which tool to choose?Ask the StackShare community!

      What companies use CDAP?
      What companies use Dremio?
      Manage your open source components, licenses, and vulnerabilities
      Learn More

      Sign up to get full access to all the companiesMake informed product decisions

      What tools integrate with CDAP?
      What tools integrate with Dremio?

      Sign up to get full access to all the tool integrationsMake informed product decisions

      What are some alternatives to CDAP and Dremio?
      Airflow
      Use Airflow to author workflows as directed acyclic graphs (DAGs) of tasks. The Airflow scheduler executes your tasks on an array of workers while following the specified dependencies. Rich command lines utilities makes performing complex surgeries on DAGs a snap. The rich user interface makes it easy to visualize pipelines running in production, monitor progress and troubleshoot issues when needed.
      Apache Spark
      Spark is a fast and general processing engine compatible with Hadoop data. It can run in Hadoop clusters through YARN or Spark's standalone mode, and it can process data in HDFS, HBase, Cassandra, Hive, and any Hadoop InputFormat. It is designed to perform both batch processing (similar to MapReduce) and new workloads like streaming, interactive queries, and machine learning.
      Akutan
      A distributed knowledge graph store. Knowledge graphs are suitable for modeling data that is highly interconnected by many types of relationships, like encyclopedic information about the world.
      Apache NiFi
      An easy to use, powerful, and reliable system to process and distribute data. It supports powerful and scalable directed graphs of data routing, transformation, and system mediation logic.
      StreamSets
      An end-to-end data integration platform to build, run, monitor and manage smart data pipelines that deliver continuous data for DataOps.
      See all alternatives