Alternatives to CDAP logo

Alternatives to CDAP

Apache Spark, Amazon Athena, Apache Flink, Apache Hive, and Druid are the most popular alternatives and competitors to CDAP.
6
15
+ 1
0

What is CDAP and what are its top alternatives?

Cask Data Application Platform (CDAP) is an open source application development platform for the Hadoop ecosystem that provides developers with data and application virtualization to accelerate application development, address a broader range of real-time and batch use cases, and deploy applications into production while satisfying enterprise requirements.
CDAP is a tool in the Big Data Tools category of a tech stack.
CDAP is an open source tool with 405 GitHub stars and 216 GitHub forks. Here鈥檚 a link to CDAP's open source repository on GitHub

CDAP alternatives & related posts

related Apache Spark posts

Eric Colson
Eric Colson
Chief Algorithms Officer at Stitch Fix | 19 upvotes 622.1K views
atStitch FixStitch Fix
Kafka
Kafka
PostgreSQL
PostgreSQL
Amazon S3
Amazon S3
Apache Spark
Apache Spark
Presto
Presto
Python
Python
R Language
R Language
PyTorch
PyTorch
Docker
Docker
Amazon EC2 Container Service
Amazon EC2 Container Service
#AWS
#Etl
#ML
#DataScience
#DataStack
#Data

The algorithms and data infrastructure at Stitch Fix is housed in #AWS. Data acquisition is split between events flowing through Kafka, and periodic snapshots of PostgreSQL DBs. We store data in an Amazon S3 based data warehouse. Apache Spark on Yarn is our tool of choice for data movement and #ETL. Because our storage layer (s3) is decoupled from our processing layer, we are able to scale our compute environment very elastically. We have several semi-permanent, autoscaling Yarn clusters running to serve our data processing needs. While the bulk of our compute infrastructure is dedicated to algorithmic processing, we also implemented Presto for adhoc queries and dashboards.

Beyond data movement and ETL, most #ML centric jobs (e.g. model training and execution) run in a similarly elastic environment as containers running Python and R code on Amazon EC2 Container Service clusters. The execution of batch jobs on top of ECS is managed by Flotilla, a service we built in house and open sourced (see https://github.com/stitchfix/flotilla-os).

At Stitch Fix, algorithmic integrations are pervasive across the business. We have dozens of data products actively integrated systems. That requires serving layer that is robust, agile, flexible, and allows for self-service. Models produced on Flotilla are packaged for deployment in production using Khan, another framework we've developed internally. Khan provides our data scientists the ability to quickly productionize those models they've developed with open source frameworks in Python 3 (e.g. PyTorch, sklearn), by automatically packaging them as Docker containers and deploying to Amazon ECS. This provides our data scientist a one-click method of getting from their algorithms to production. We then integrate those deployments into a service mesh, which allows us to A/B test various implementations in our product.

For more info:

#DataScience #DataStack #Data

See more
Conor Myhrvold
Conor Myhrvold
Tech Brand Mgr, Office of CTO at Uber | 7 upvotes 283.8K views
atUber TechnologiesUber Technologies
Kafka
Kafka
Kafka Manager
Kafka Manager
Hadoop
Hadoop
Apache Spark
Apache Spark
GitHub
GitHub

Why we built Marmaray, an open source generic data ingestion and dispersal framework and library for Apache Hadoop :

Built and designed by our Hadoop Platform team, Marmaray is a plug-in-based framework built on top of the Hadoop ecosystem. Users can add support to ingest data from any source and disperse to any sink leveraging the use of Apache Spark . The name, Marmaray, comes from a tunnel in Turkey connecting Europe and Asia. Similarly, we envisioned Marmaray within Uber as a pipeline connecting data from any source to any sink depending on customer preference:

https://eng.uber.com/marmaray-hadoop-ingestion-open-source/

(Direct GitHub repo: https://github.com/uber/marmaray Kafka Kafka Manager )

See more

related Amazon Athena posts

Amazon Athena
Amazon Athena
Google BigQuery
Google BigQuery

I use Amazon Athena because similar to Google BigQuery , you can store and query data easily. Especially since you can define data schema in the Glue data catalog, there's a central way to define data models.

However, I would not recommend for batch jobs. I typically use this to check intermediary datasets in data engineering workloads. It's good for getting a look and feel of the data along its ETL journey.

See more
Apache Flink logo

Apache Flink

170
210
11
170
210
+ 1
11
Fast and reliable large-scale data processing engine
Apache Flink logo
Apache Flink
VS
CDAP logo
CDAP

related Apache Flink posts

Surabhi Bhawsar
Surabhi Bhawsar
Technical Architect at Pepcus | 6 upvotes 300.3K views
Kafka
Kafka
Apache Flink
Apache Flink

I need to build the Alert & Notification framework with the use of a scheduled program. We will analyze the events from the database table and filter events that are falling under a day timespan and send these event messages over email. Currently, we are using Kafka Pub/Sub for messaging. The customer wants us to move on Apache Flink, I am trying to understand how Apache Flink could be fit better for us.

See more
Apache Hive logo

Apache Hive

152
95
0
152
95
+ 1
0
Data Warehouse Software for Reading, Writing, and Managing Large Datasets
    Be the first to leave a pro
    Apache Hive logo
    Apache Hive
    VS
    CDAP logo
    CDAP

    related Apache Hive posts

    Ashish Singh
    Ashish Singh
    Tech Lead, Big Data Platform at Pinterest | 26 upvotes 185.2K views
    Apache Hive
    Apache Hive
    Kubernetes
    Kubernetes
    Kafka
    Kafka
    Amazon S3
    Amazon S3
    Amazon EC2
    Amazon EC2
    Presto
    Presto
    #DataScience
    #DataEngineering
    #AWS
    #BigData

    To provide employees with the critical need of interactive querying, we鈥檝e worked with Presto, an open-source distributed SQL query engine, over the years. Operating Presto at Pinterest鈥檚 scale has involved resolving quite a few challenges like, supporting deeply nested and huge thrift schemas, slow/ bad worker detection and remediation, auto-scaling cluster, graceful cluster shutdown and impersonation support for ldap authenticator.

    Our infrastructure is built on top of Amazon EC2 and we leverage Amazon S3 for storing our data. This separates compute and storage layers, and allows multiple compute clusters to share the S3 data.

    We have hundreds of petabytes of data and tens of thousands of Apache Hive tables. Our Presto clusters are comprised of a fleet of 450 r4.8xl EC2 instances. Presto clusters together have over 100 TBs of memory and 14K vcpu cores. Within Pinterest, we have close to more than 1,000 monthly active users (out of total 1,600+ Pinterest employees) using Presto, who run about 400K queries on these clusters per month.

    Each query submitted to Presto cluster is logged to a Kafka topic via Singer. Singer is a logging agent built at Pinterest and we talked about it in a previous post. Each query is logged when it is submitted and when it finishes. When a Presto cluster crashes, we will have query submitted events without corresponding query finished events. These events enable us to capture the effect of cluster crashes over time.

    Each Presto cluster at Pinterest has workers on a mix of dedicated AWS EC2 instances and Kubernetes pods. Kubernetes platform provides us with the capability to add and remove workers from a Presto cluster very quickly. The best-case latency on bringing up a new worker on Kubernetes is less than a minute. However, when the Kubernetes cluster itself is out of resources and needs to scale up, it can take up to ten minutes. Some other advantages of deploying on Kubernetes platform is that our Presto deployment becomes agnostic of cloud vendor, instance types, OS, etc.

    #BigData #AWS #DataScience #DataEngineering

    See more
    Druid logo

    Druid

    142
    219
    18
    142
    219
    + 1
    18
    Fast column-oriented distributed data store
    Druid logo
    Druid
    VS
    CDAP logo
    CDAP
    Presto logo

    Presto

    141
    277
    49
    141
    277
    + 1
    49
    Distributed SQL Query Engine for Big Data
    Presto logo
    Presto
    VS
    CDAP logo
    CDAP

    related Presto posts

    Ashish Singh
    Ashish Singh
    Tech Lead, Big Data Platform at Pinterest | 26 upvotes 185.2K views
    Apache Hive
    Apache Hive
    Kubernetes
    Kubernetes
    Kafka
    Kafka
    Amazon S3
    Amazon S3
    Amazon EC2
    Amazon EC2
    Presto
    Presto
    #DataScience
    #DataEngineering
    #AWS
    #BigData

    To provide employees with the critical need of interactive querying, we鈥檝e worked with Presto, an open-source distributed SQL query engine, over the years. Operating Presto at Pinterest鈥檚 scale has involved resolving quite a few challenges like, supporting deeply nested and huge thrift schemas, slow/ bad worker detection and remediation, auto-scaling cluster, graceful cluster shutdown and impersonation support for ldap authenticator.

    Our infrastructure is built on top of Amazon EC2 and we leverage Amazon S3 for storing our data. This separates compute and storage layers, and allows multiple compute clusters to share the S3 data.

    We have hundreds of petabytes of data and tens of thousands of Apache Hive tables. Our Presto clusters are comprised of a fleet of 450 r4.8xl EC2 instances. Presto clusters together have over 100 TBs of memory and 14K vcpu cores. Within Pinterest, we have close to more than 1,000 monthly active users (out of total 1,600+ Pinterest employees) using Presto, who run about 400K queries on these clusters per month.

    Each query submitted to Presto cluster is logged to a Kafka topic via Singer. Singer is a logging agent built at Pinterest and we talked about it in a previous post. Each query is logged when it is submitted and when it finishes. When a Presto cluster crashes, we will have query submitted events without corresponding query finished events. These events enable us to capture the effect of cluster crashes over time.

    Each Presto cluster at Pinterest has workers on a mix of dedicated AWS EC2 instances and Kubernetes pods. Kubernetes platform provides us with the capability to add and remove workers from a Presto cluster very quickly. The best-case latency on bringing up a new worker on Kubernetes is less than a minute. However, when the Kubernetes cluster itself is out of resources and needs to scale up, it can take up to ten minutes. Some other advantages of deploying on Kubernetes platform is that our Presto deployment becomes agnostic of cloud vendor, instance types, OS, etc.

    #BigData #AWS #DataScience #DataEngineering

    See more
    Eric Colson
    Eric Colson
    Chief Algorithms Officer at Stitch Fix | 19 upvotes 622.1K views
    atStitch FixStitch Fix
    Kafka
    Kafka
    PostgreSQL
    PostgreSQL
    Amazon S3
    Amazon S3
    Apache Spark
    Apache Spark
    Presto
    Presto
    Python
    Python
    R Language
    R Language
    PyTorch
    PyTorch
    Docker
    Docker
    Amazon EC2 Container Service
    Amazon EC2 Container Service
    #AWS
    #Etl
    #ML
    #DataScience
    #DataStack
    #Data

    The algorithms and data infrastructure at Stitch Fix is housed in #AWS. Data acquisition is split between events flowing through Kafka, and periodic snapshots of PostgreSQL DBs. We store data in an Amazon S3 based data warehouse. Apache Spark on Yarn is our tool of choice for data movement and #ETL. Because our storage layer (s3) is decoupled from our processing layer, we are able to scale our compute environment very elastically. We have several semi-permanent, autoscaling Yarn clusters running to serve our data processing needs. While the bulk of our compute infrastructure is dedicated to algorithmic processing, we also implemented Presto for adhoc queries and dashboards.

    Beyond data movement and ETL, most #ML centric jobs (e.g. model training and execution) run in a similarly elastic environment as containers running Python and R code on Amazon EC2 Container Service clusters. The execution of batch jobs on top of ECS is managed by Flotilla, a service we built in house and open sourced (see https://github.com/stitchfix/flotilla-os).

    At Stitch Fix, algorithmic integrations are pervasive across the business. We have dozens of data products actively integrated systems. That requires serving layer that is robust, agile, flexible, and allows for self-service. Models produced on Flotilla are packaged for deployment in production using Khan, another framework we've developed internally. Khan provides our data scientists the ability to quickly productionize those models they've developed with open source frameworks in Python 3 (e.g. PyTorch, sklearn), by automatically packaging them as Docker containers and deploying to Amazon ECS. This provides our data scientist a one-click method of getting from their algorithms to production. We then integrate those deployments into a service mesh, which allows us to A/B test various implementations in our product.

    For more info:

    #DataScience #DataStack #Data

    See more
    AWS Glue logo

    AWS Glue

    82
    105
    0
    82
    105
    + 1
    0
    Fully managed extract, transform, and load (ETL) service
      Be the first to leave a pro
      AWS Glue logo
      AWS Glue
      VS
      CDAP logo
      CDAP
      Apache Impala logo

      Apache Impala

      67
      89
      8
      67
      89
      + 1
      8
      Real-time Query for Hadoop
      Apache Impala logo
      Apache Impala
      VS
      CDAP logo
      CDAP