Alternatives to Corral logo

Alternatives to Corral

Apache Spark, Amazon Athena, Apache Flink, Apache Hive, and Presto are the most popular alternatives and competitors to Corral.
0
2
+ 1
0

What is Corral and what are its top alternatives?

Corral is a MapReduce framework designed to be deployed to serverless platforms, like AWS Lambda. It presents a lightweight alternative to Hadoop MapReduce.
Corral is a tool in the Big Data Tools category of a tech stack.
Corral is an open source tool with 624 GitHub stars and 18 GitHub forks. Here鈥檚 a link to Corral's open source repository on GitHub

Corral alternatives & related posts

related Apache Spark posts

Eric Colson
Eric Colson
Chief Algorithms Officer at Stitch Fix | 19 upvotes 350.6K views
atStitch FixStitch Fix
Kafka
Kafka
PostgreSQL
PostgreSQL
Amazon S3
Amazon S3
Apache Spark
Apache Spark
Presto
Presto
Python
Python
R
R
PyTorch
PyTorch
Docker
Docker
Amazon EC2 Container Service
Amazon EC2 Container Service
#AWS
#Etl
#ML
#DataScience
#DataStack
#Data

The algorithms and data infrastructure at Stitch Fix is housed in #AWS. Data acquisition is split between events flowing through Kafka, and periodic snapshots of PostgreSQL DBs. We store data in an Amazon S3 based data warehouse. Apache Spark on Yarn is our tool of choice for data movement and #ETL. Because our storage layer (s3) is decoupled from our processing layer, we are able to scale our compute environment very elastically. We have several semi-permanent, autoscaling Yarn clusters running to serve our data processing needs. While the bulk of our compute infrastructure is dedicated to algorithmic processing, we also implemented Presto for adhoc queries and dashboards.

Beyond data movement and ETL, most #ML centric jobs (e.g. model training and execution) run in a similarly elastic environment as containers running Python and R code on Amazon EC2 Container Service clusters. The execution of batch jobs on top of ECS is managed by Flotilla, a service we built in house and open sourced (see https://github.com/stitchfix/flotilla-os).

At Stitch Fix, algorithmic integrations are pervasive across the business. We have dozens of data products actively integrated systems. That requires serving layer that is robust, agile, flexible, and allows for self-service. Models produced on Flotilla are packaged for deployment in production using Khan, another framework we've developed internally. Khan provides our data scientists the ability to quickly productionize those models they've developed with open source frameworks in Python 3 (e.g. PyTorch, sklearn), by automatically packaging them as Docker containers and deploying to Amazon ECS. This provides our data scientist a one-click method of getting from their algorithms to production. We then integrate those deployments into a service mesh, which allows us to A/B test various implementations in our product.

For more info:

#DataScience #DataStack #Data

See more
Conor Myhrvold
Conor Myhrvold
Tech Brand Mgr, Office of CTO at Uber | 5 upvotes 163K views
atUber TechnologiesUber Technologies
Kafka
Kafka
Kafka Manager
Kafka Manager
Hadoop
Hadoop
Apache Spark
Apache Spark
GitHub
GitHub

Why we built Marmaray, an open source generic data ingestion and dispersal framework and library for Apache Hadoop :

Built and designed by our Hadoop Platform team, Marmaray is a plug-in-based framework built on top of the Hadoop ecosystem. Users can add support to ingest data from any source and disperse to any sink leveraging the use of Apache Spark . The name, Marmaray, comes from a tunnel in Turkey connecting Europe and Asia. Similarly, we envisioned Marmaray within Uber as a pipeline connecting data from any source to any sink depending on customer preference:

https://eng.uber.com/marmaray-hadoop-ingestion-open-source/

(Direct GitHub repo: https://github.com/uber/marmaray Kafka Kafka Manager )

See more

related Amazon Athena posts

Amazon Athena
Amazon Athena
Google BigQuery
Google BigQuery

I use Amazon Athena because similar to Google BigQuery , you can store and query data easily. Especially since you can define data schema in the Glue data catalog, there's a central way to define data models.

However, I would not recommend for batch jobs. I typically use this to check intermediary datasets in data engineering workloads. It's good for getting a look and feel of the data along its ETL journey.

See more
Apache Flink logo

Apache Flink

146
157
11
146
157
+ 1
11
Fast and reliable large-scale data processing engine
Apache Flink logo
Apache Flink
VS
Corral logo
Corral

related Apache Flink posts

Surabhi Bhawsar
Surabhi Bhawsar
Technical Architect at Pepcus | 6 upvotes 44.8K views
Kafka
Kafka
Apache Flink
Apache Flink

I need to build the Alert & Notification framework with the use of a scheduled program. We will analyze the events from the database table and filter events that are falling under a day timespan and send these event messages over email. Currently, we are using Kafka Pub/Sub for messaging. The customer wants us to move on Apache Flink, I am trying to understand how Apache Flink could be fit better for us.

See more
Apache Hive logo

Apache Hive

118
52
0
118
52
+ 1
0
Data Warehouse Software for Reading, Writing, and Managing Large Datasets
    Be the first to leave a pro
    Apache Hive logo
    Apache Hive
    VS
    Corral logo
    Corral

    related Apache Hive posts

    Ashish Singh
    Ashish Singh
    Tech Lead, Big Data Platform at Pinterest | 20 upvotes 33.2K views
    Apache Hive
    Apache Hive
    Kubernetes
    Kubernetes
    Kafka
    Kafka
    Amazon S3
    Amazon S3
    Amazon EC2
    Amazon EC2
    Presto
    Presto
    #DataScience
    #DataEngineering
    #AWS
    #BigData

    To provide employees with the critical need of interactive querying, we鈥檝e worked with Presto, an open-source distributed SQL query engine, over the years. Operating Presto at Pinterest鈥檚 scale has involved resolving quite a few challenges like, supporting deeply nested and huge thrift schemas, slow/ bad worker detection and remediation, auto-scaling cluster, graceful cluster shutdown and impersonation support for ldap authenticator.

    Our infrastructure is built on top of Amazon EC2 and we leverage Amazon S3 for storing our data. This separates compute and storage layers, and allows multiple compute clusters to share the S3 data.

    We have hundreds of petabytes of data and tens of thousands of Apache Hive tables. Our Presto clusters are comprised of a fleet of 450 r4.8xl EC2 instances. Presto clusters together have over 100 TBs of memory and 14K vcpu cores. Within Pinterest, we have close to more than 1,000 monthly active users (out of total 1,600+ Pinterest employees) using Presto, who run about 400K queries on these clusters per month.

    Each query submitted to Presto cluster is logged to a Kafka topic via Singer. Singer is a logging agent built at Pinterest and we talked about it in a previous post. Each query is logged when it is submitted and when it finishes. When a Presto cluster crashes, we will have query submitted events without corresponding query finished events. These events enable us to capture the effect of cluster crashes over time.

    Each Presto cluster at Pinterest has workers on a mix of dedicated AWS EC2 instances and Kubernetes pods. Kubernetes platform provides us with the capability to add and remove workers from a Presto cluster very quickly. The best-case latency on bringing up a new worker on Kubernetes is less than a minute. However, when the Kubernetes cluster itself is out of resources and needs to scale up, it can take up to ten minutes. Some other advantages of deploying on Kubernetes platform is that our Presto deployment becomes agnostic of cloud vendor, instance types, OS, etc.

    #BigData #AWS #DataScience #DataEngineering

    See more
    Presto logo

    Presto

    114
    192
    46
    114
    192
    + 1
    46
    Distributed SQL Query Engine for Big Data
    Presto logo
    Presto
    VS
    Corral logo
    Corral

    related Presto posts

    Ashish Singh
    Ashish Singh
    Tech Lead, Big Data Platform at Pinterest | 20 upvotes 33.2K views
    Apache Hive
    Apache Hive
    Kubernetes
    Kubernetes
    Kafka
    Kafka
    Amazon S3
    Amazon S3
    Amazon EC2
    Amazon EC2
    Presto
    Presto
    #DataScience
    #DataEngineering
    #AWS
    #BigData

    To provide employees with the critical need of interactive querying, we鈥檝e worked with Presto, an open-source distributed SQL query engine, over the years. Operating Presto at Pinterest鈥檚 scale has involved resolving quite a few challenges like, supporting deeply nested and huge thrift schemas, slow/ bad worker detection and remediation, auto-scaling cluster, graceful cluster shutdown and impersonation support for ldap authenticator.

    Our infrastructure is built on top of Amazon EC2 and we leverage Amazon S3 for storing our data. This separates compute and storage layers, and allows multiple compute clusters to share the S3 data.

    We have hundreds of petabytes of data and tens of thousands of Apache Hive tables. Our Presto clusters are comprised of a fleet of 450 r4.8xl EC2 instances. Presto clusters together have over 100 TBs of memory and 14K vcpu cores. Within Pinterest, we have close to more than 1,000 monthly active users (out of total 1,600+ Pinterest employees) using Presto, who run about 400K queries on these clusters per month.

    Each query submitted to Presto cluster is logged to a Kafka topic via Singer. Singer is a logging agent built at Pinterest and we talked about it in a previous post. Each query is logged when it is submitted and when it finishes. When a Presto cluster crashes, we will have query submitted events without corresponding query finished events. These events enable us to capture the effect of cluster crashes over time.

    Each Presto cluster at Pinterest has workers on a mix of dedicated AWS EC2 instances and Kubernetes pods. Kubernetes platform provides us with the capability to add and remove workers from a Presto cluster very quickly. The best-case latency on bringing up a new worker on Kubernetes is less than a minute. However, when the Kubernetes cluster itself is out of resources and needs to scale up, it can take up to ten minutes. Some other advantages of deploying on Kubernetes platform is that our Presto deployment becomes agnostic of cloud vendor, instance types, OS, etc.

    #BigData #AWS #DataScience #DataEngineering

    See more
    Eric Colson
    Eric Colson
    Chief Algorithms Officer at Stitch Fix | 19 upvotes 350.6K views
    atStitch FixStitch Fix
    Kafka
    Kafka
    PostgreSQL
    PostgreSQL
    Amazon S3
    Amazon S3
    Apache Spark
    Apache Spark
    Presto
    Presto
    Python
    Python
    R
    R
    PyTorch
    PyTorch
    Docker
    Docker
    Amazon EC2 Container Service
    Amazon EC2 Container Service
    #AWS
    #Etl
    #ML
    #DataScience
    #DataStack
    #Data

    The algorithms and data infrastructure at Stitch Fix is housed in #AWS. Data acquisition is split between events flowing through Kafka, and periodic snapshots of PostgreSQL DBs. We store data in an Amazon S3 based data warehouse. Apache Spark on Yarn is our tool of choice for data movement and #ETL. Because our storage layer (s3) is decoupled from our processing layer, we are able to scale our compute environment very elastically. We have several semi-permanent, autoscaling Yarn clusters running to serve our data processing needs. While the bulk of our compute infrastructure is dedicated to algorithmic processing, we also implemented Presto for adhoc queries and dashboards.

    Beyond data movement and ETL, most #ML centric jobs (e.g. model training and execution) run in a similarly elastic environment as containers running Python and R code on Amazon EC2 Container Service clusters. The execution of batch jobs on top of ECS is managed by Flotilla, a service we built in house and open sourced (see https://github.com/stitchfix/flotilla-os).

    At Stitch Fix, algorithmic integrations are pervasive across the business. We have dozens of data products actively integrated systems. That requires serving layer that is robust, agile, flexible, and allows for self-service. Models produced on Flotilla are packaged for deployment in production using Khan, another framework we've developed internally. Khan provides our data scientists the ability to quickly productionize those models they've developed with open source frameworks in Python 3 (e.g. PyTorch, sklearn), by automatically packaging them as Docker containers and deploying to Amazon ECS. This provides our data scientist a one-click method of getting from their algorithms to production. We then integrate those deployments into a service mesh, which allows us to A/B test various implementations in our product.

    For more info:

    #DataScience #DataStack #Data

    See more
    Druid logo

    Druid

    107
    152
    17
    107
    152
    + 1
    17
    Fast column-oriented distributed data store
    Druid logo
    Druid
    VS
    Corral logo
    Corral
    AWS Glue logo

    AWS Glue

    63
    38
    0
    63
    38
    + 1
    0
    Fully managed extract, transform, and load (ETL) service
      Be the first to leave a pro
      AWS Glue logo
      AWS Glue
      VS
      Corral logo
      Corral
      Apache Impala logo

      Apache Impala

      58
      56
      8
      58
      56
      + 1
      8
      Real-time Query for Hadoop
      Apache Impala logo
      Apache Impala
      VS
      Corral logo
      Corral
      Amazon Redshift Spectrum logo

      Amazon Redshift Spectrum

      38
      36
      0
      38
      36
      + 1
      0
      Exabyte-Scale In-Place Queries of S3 Data
        Be the first to leave a pro
        Amazon Redshift Spectrum logo
        Amazon Redshift Spectrum
        VS
        Corral logo
        Corral
        Pig logo

        Pig

        38
        52
        4
        38
        52
        + 1
        4
        Platform for analyzing large data sets
        Pig logo
        Pig
        VS
        Corral logo
        Corral
        Apache Kudu logo

        Apache Kudu

        27
        45
        3
        27
        45
        + 1
        3
        Fast Analytics on Fast Data. A columnar storage manager developed for the Hadoop platform
        Apache Kudu logo
        Apache Kudu
        VS
        Corral logo
        Corral
        Talend logo

        Talend

        26
        12
        0
        26
        12
        + 1
        0
        A single, unified suite for all integration needs
          Be the first to leave a pro
          Talend logo
          Talend
          VS
          Corral logo
          Corral
          Vertica logo

          Vertica

          25
          6
          0
          25
          6
          + 1
          0
          Storage platform designed to handle large volumes of data
            Be the first to leave a pro
            Vertica logo
            Vertica
            VS
            Corral logo
            Corral
            Apache Parquet logo

            Apache Parquet

            23
            10
            0
            23
            10
            + 1
            0
            A free and open-source column-oriented data storage format
              Be the first to leave a pro
              Apache Parquet logo
              Apache Parquet
              VS
              Corral logo
              Corral
              Hue logo

              Hue

              16
              7
              0
              16
              7
              + 1
              0
              An open source SQL Workbench for Data Warehouses
                Be the first to leave a pro
                Hue logo
                Hue
                VS
                Corral logo
                Corral
                Mule logo

                Mule

                16
                20
                0
                16
                20
                + 1
                0
                Revolutionizing the way the world connects data and applications
                  Be the first to leave a pro
                  Mule logo
                  Mule
                  VS
                  Corral logo
                  Corral
                  Azure Data Factory logo

                  Azure Data Factory

                  13
                  4
                  0
                  13
                  4
                  + 1
                  0
                  Hybrid data integration service that simplifies ETL at scale
                    Be the first to leave a pro
                    Azure Data Factory logo
                    Azure Data Factory
                    VS
                    Corral logo
                    Corral
                    Singer logo

                    Singer

                    9
                    9
                    0
                    9
                    9
                    + 1
                    0
                    Simple, Composable, Open Source ETL
                      Be the first to leave a pro
                      Singer logo
                      Singer
                      VS
                      Corral logo
                      Corral
                      Pachyderm logo

                      Pachyderm

                      8
                      12
                      2
                      8
                      12
                      + 1
                      2
                      MapReduce without Hadoop. Analyze massive datasets with Docker.
                      Pachyderm logo
                      Pachyderm
                      VS
                      Corral logo
                      Corral