Alternatives to Apache Kudu logo

Alternatives to Apache Kudu

Cassandra, HBase, Apache Spark, Apache Impala, and Hadoop are the most popular alternatives and competitors to Apache Kudu.
26
39
+ 1
3

What is Apache Kudu and what are its top alternatives?

A new addition to the open source Apache Hadoop ecosystem, Kudu completes Hadoop's storage layer to enable fast analytics on fast data.
Apache Kudu is a tool in the Big Data Tools category of a tech stack.
Apache Kudu is an open source tool with 800 GitHub stars and 268 GitHub forks. Here’s a link to Apache Kudu's open source repository on GitHub

Apache Kudu alternatives & related posts

Cassandra logo

Cassandra

2K
1.5K
442
2K
1.5K
+ 1
442
A partitioned row store. Rows are organized into tables with a required primary key.
Cassandra logo
Cassandra
VS
Apache Kudu logo
Apache Kudu

related Cassandra posts

Thierry Schellenbach
Thierry Schellenbach
CEO at Stream · | 17 upvotes · 21.6K views
atStreamStream
RocksDB
RocksDB
Cassandra
Cassandra
Redis
Redis
#Databases
#DataStores
#InMemoryDatabases

1.0 of Stream leveraged Cassandra for storing the feed. Cassandra is a common choice for building feeds. Instagram, for instance started, out with Redis but eventually switched to Cassandra to handle their rapid usage growth. Cassandra can handle write heavy workloads very efficiently.

Cassandra is a great tool that allows you to scale write capacity simply by adding more nodes, though it is also very complex. This complexity made it hard to diagnose performance fluctuations. Even though we had years of experience with running Cassandra, it still felt like a bit of a black box. When building Stream 2.0 we decided to go for a different approach and build Keevo. Keevo is our in-house key-value store built upon RocksDB, gRPC and Raft.

RocksDB is a highly performant embeddable database library developed and maintained by Facebook’s data engineering team. RocksDB started as a fork of Google’s LevelDB that introduced several performance improvements for SSD. Nowadays RocksDB is a project on its own and is under active development. It is written in C++ and it’s fast. Have a look at how this benchmark handles 7 million QPS. In terms of technology it’s much more simple than Cassandra.

This translates into reduced maintenance overhead, improved performance and, most importantly, more consistent performance. It’s interesting to note that LinkedIn also uses RocksDB for their feed.

#InMemoryDatabases #DataStores #Databases

See more
Linux
Linux
Docker
Docker
jQuery
jQuery
AngularJS
AngularJS
React
React
Cassandra
Cassandra
MongoDB
MongoDB
MySQL
MySQL
Zend Framework
Zend Framework
Laravel
Laravel

React AngularJS jQuery

Laravel Zend Framework

MySQL MongoDB Cassandra

Docker

Linux

See more
HBase logo

HBase

192
155
12
192
155
+ 1
12
The Hadoop database, a distributed, scalable, big data store
HBase logo
HBase
VS
Apache Kudu logo
Apache Kudu

related Apache Spark posts

Eric Colson
Eric Colson
Chief Algorithms Officer at Stitch Fix · | 19 upvotes · 209.1K views
atStitch FixStitch Fix
Amazon EC2 Container Service
Amazon EC2 Container Service
Docker
Docker
PyTorch
PyTorch
R
R
Python
Python
Presto
Presto
Apache Spark
Apache Spark
Amazon S3
Amazon S3
PostgreSQL
PostgreSQL
Kafka
Kafka
#Data
#DataStack
#DataScience
#ML
#Etl
#AWS

The algorithms and data infrastructure at Stitch Fix is housed in #AWS. Data acquisition is split between events flowing through Kafka, and periodic snapshots of PostgreSQL DBs. We store data in an Amazon S3 based data warehouse. Apache Spark on Yarn is our tool of choice for data movement and #ETL. Because our storage layer (s3) is decoupled from our processing layer, we are able to scale our compute environment very elastically. We have several semi-permanent, autoscaling Yarn clusters running to serve our data processing needs. While the bulk of our compute infrastructure is dedicated to algorithmic processing, we also implemented Presto for adhoc queries and dashboards.

Beyond data movement and ETL, most #ML centric jobs (e.g. model training and execution) run in a similarly elastic environment as containers running Python and R code on Amazon EC2 Container Service clusters. The execution of batch jobs on top of ECS is managed by Flotilla, a service we built in house and open sourced (see https://github.com/stitchfix/flotilla-os).

At Stitch Fix, algorithmic integrations are pervasive across the business. We have dozens of data products actively integrated systems. That requires serving layer that is robust, agile, flexible, and allows for self-service. Models produced on Flotilla are packaged for deployment in production using Khan, another framework we've developed internally. Khan provides our data scientists the ability to quickly productionize those models they've developed with open source frameworks in Python 3 (e.g. PyTorch, sklearn), by automatically packaging them as Docker containers and deploying to Amazon ECS. This provides our data scientist a one-click method of getting from their algorithms to production. We then integrate those deployments into a service mesh, which allows us to A/B test various implementations in our product.

For more info:

#DataScience #DataStack #Data

See more
Conor Myhrvold
Conor Myhrvold
Tech Brand Mgr, Office of CTO at Uber · | 4 upvotes · 97.1K views
atUber TechnologiesUber Technologies
Kafka Manager
Kafka Manager
Kafka
Kafka
GitHub
GitHub
Apache Spark
Apache Spark
Hadoop
Hadoop

Why we built Marmaray, an open source generic data ingestion and dispersal framework and library for Apache Hadoop :

Built and designed by our Hadoop Platform team, Marmaray is a plug-in-based framework built on top of the Hadoop ecosystem. Users can add support to ingest data from any source and disperse to any sink leveraging the use of Apache Spark . The name, Marmaray, comes from a tunnel in Turkey connecting Europe and Asia. Similarly, we envisioned Marmaray within Uber as a pipeline connecting data from any source to any sink depending on customer preference:

https://eng.uber.com/marmaray-hadoop-ingestion-open-source/

(Direct GitHub repo: https://github.com/uber/marmaray Kafka Kafka Manager )

See more
Apache Impala logo

Apache Impala

58
54
8
58
54
+ 1
8
Real-time Query for Hadoop
Apache Impala logo
Apache Impala
VS
Apache Kudu logo
Apache Kudu
Hadoop logo

Hadoop

1K
861
48
1K
861
+ 1
48
Open-source software for reliable, scalable, distributed computing
Hadoop logo
Hadoop
VS
Apache Kudu logo
Apache Kudu

related Hadoop posts

StackShare Editors
StackShare Editors
| 4 upvotes · 23.5K views
atUber TechnologiesUber Technologies
Hadoop
Hadoop
Logstash
Logstash
Elasticsearch
Elasticsearch
Kibana
Kibana
Kafka
Kafka

With interactions across each other and mobile devices, logging is important as it is information for internal cases like debugging and business cases like dynamic pricing.

With multiple Kafka clusters, data is archived into Hadoop before expiration. Data is ingested in realtime and indexed into an ELK stack. The ELK stack comprises of Elasticsearch, Logstash, and Kibana for searching and visualization.

See more
StackShare Editors
StackShare Editors
Apache Thrift
Apache Thrift
Kotlin
Kotlin
Presto
Presto
HHVM (HipHop Virtual Machine)
HHVM (HipHop Virtual Machine)
gRPC
gRPC
Kubernetes
Kubernetes
Apache Spark
Apache Spark
Airflow
Airflow
Terraform
Terraform
Hadoop
Hadoop
Swift
Swift
Hack
Hack
Memcached
Memcached
Consul
Consul
Chef
Chef
Prometheus
Prometheus

Since the beginning, Cal Henderson has been the CTO of Slack. Earlier this year, he commented on a Quora question summarizing their current stack.

Apps
  • Web: a mix of JavaScript/ES6 and React.
  • Desktop: And Electron to ship it as a desktop application.
  • Android: a mix of Java and Kotlin.
  • iOS: written in a mix of Objective C and Swift.
Backend
  • The core application and the API written in PHP/Hack that runs on HHVM.
  • The data is stored in MySQL using Vitess.
  • Caching is done using Memcached and MCRouter.
  • The search service takes help from SolrCloud, with various Java services.
  • The messaging system uses WebSockets with many services in Java and Go.
  • Load balancing is done using HAproxy with Consul for configuration.
  • Most services talk to each other over gRPC,
  • Some Thrift and JSON-over-HTTP
  • Voice and video calling service was built in Elixir.
Data warehouse
  • Built using open source tools including Presto, Spark, Airflow, Hadoop and Kafka.
Etc
See more
Druid logo

Druid

102
138
17
102
138
+ 1
17
Fast column-oriented distributed data store
Druid logo
Druid
VS
Apache Kudu logo
Apache Kudu

related Amazon Athena posts

Google BigQuery
Google BigQuery
Amazon Athena
Amazon Athena

I use Amazon Athena because similar to Google BigQuery , you can store and query data easily. Especially since you can define data schema in the Glue data catalog, there's a central way to define data models.

However, I would not recommend for batch jobs. I typically use this to check intermediary datasets in data engineering workloads. It's good for getting a look and feel of the data along its ETL journey.

See more

related Apache Flink posts

Surabhi Bhawsar
Surabhi Bhawsar
Technical Architect at Pepcus · | 6 upvotes · 15.6K views
Apache Flink
Apache Flink
Kafka
Kafka

I need to build the Alert & Notification framework with the use of a scheduled program. We will analyze the events from the database table and filter events that are falling under a day timespan and send these event messages over email. Currently, we are using Kafka Pub/Sub for messaging. The customer wants us to move on Apache Flink, I am trying to understand how Apache Flink could be fit better for us.

See more
Presto logo

Presto

106
178
46
106
178
+ 1
46
Distributed SQL Query Engine for Big Data
Presto logo
Presto
VS
Apache Kudu logo
Apache Kudu

related Presto posts

Eric Colson
Eric Colson
Chief Algorithms Officer at Stitch Fix · | 19 upvotes · 209.1K views
atStitch FixStitch Fix
Amazon EC2 Container Service
Amazon EC2 Container Service
Docker
Docker
PyTorch
PyTorch
R
R
Python
Python
Presto
Presto
Apache Spark
Apache Spark
Amazon S3
Amazon S3
PostgreSQL
PostgreSQL
Kafka
Kafka
#Data
#DataStack
#DataScience
#ML
#Etl
#AWS

The algorithms and data infrastructure at Stitch Fix is housed in #AWS. Data acquisition is split between events flowing through Kafka, and periodic snapshots of PostgreSQL DBs. We store data in an Amazon S3 based data warehouse. Apache Spark on Yarn is our tool of choice for data movement and #ETL. Because our storage layer (s3) is decoupled from our processing layer, we are able to scale our compute environment very elastically. We have several semi-permanent, autoscaling Yarn clusters running to serve our data processing needs. While the bulk of our compute infrastructure is dedicated to algorithmic processing, we also implemented Presto for adhoc queries and dashboards.

Beyond data movement and ETL, most #ML centric jobs (e.g. model training and execution) run in a similarly elastic environment as containers running Python and R code on Amazon EC2 Container Service clusters. The execution of batch jobs on top of ECS is managed by Flotilla, a service we built in house and open sourced (see https://github.com/stitchfix/flotilla-os).

At Stitch Fix, algorithmic integrations are pervasive across the business. We have dozens of data products actively integrated systems. That requires serving layer that is robust, agile, flexible, and allows for self-service. Models produced on Flotilla are packaged for deployment in production using Khan, another framework we've developed internally. Khan provides our data scientists the ability to quickly productionize those models they've developed with open source frameworks in Python 3 (e.g. PyTorch, sklearn), by automatically packaging them as Docker containers and deploying to Amazon ECS. This provides our data scientist a one-click method of getting from their algorithms to production. We then integrate those deployments into a service mesh, which allows us to A/B test various implementations in our product.

For more info:

#DataScience #DataStack #Data

See more
StackShare Editors
StackShare Editors
| 4 upvotes · 13.9K views
atUber TechnologiesUber Technologies
Presto
Presto
Apache Spark
Apache Spark
Hadoop
Hadoop

To improve platform scalability and efficiency, Uber transitioned from JSON to Parquet, and built a central schema service to manage schemas and integrate different client libraries.

While the first generation big data platform was vulnerable to upstream data format changes, “ad hoc data ingestions jobs were replaced with a standard platform to transfer all source data in its original, nested format into the Hadoop data lake.”

These platform changes enabled the scaling challenges Uber was facing around that time: “On a daily basis, there were tens of terabytes of new data added to our data lake, and our Big Data platform grew to over 10,000 vcores with over 100,000 running batch jobs on any given day.”

See more
Apache Hive logo

Apache Hive

104
43
0
104
43
+ 1
0
Data Warehouse Software for Reading, Writing, and Managing Large Datasets
    Be the first to leave a pro
    Apache Hive logo
    Apache Hive
    VS
    Apache Kudu logo
    Apache Kudu
    AWS Glue logo

    AWS Glue

    56
    33
    0
    56
    33
    + 1
    0
    Fully managed extract, transform, and load (ETL) service
      Be the first to leave a pro
      AWS Glue logo
      AWS Glue
      VS
      Apache Kudu logo
      Apache Kudu
      Amazon Redshift Spectrum logo

      Amazon Redshift Spectrum

      36
      34
      0
      36
      34
      + 1
      0
      Exabyte-Scale In-Place Queries of S3 Data
        Be the first to leave a pro
        Amazon Redshift Spectrum logo
        Amazon Redshift Spectrum
        VS
        Apache Kudu logo
        Apache Kudu
        Talend logo

        Talend

        25
        9
        0
        25
        9
        + 1
        0
        A single, unified suite for all integration needs
          Be the first to leave a pro
          Talend logo
          Talend
          VS
          Apache Kudu logo
          Apache Kudu
          Vertica logo

          Vertica

          24
          4
          0
          24
          4
          + 1
          0
          Storage platform designed to handle large volumes of data
            Be the first to leave a pro
            Vertica logo
            Vertica
            VS
            Apache Kudu logo
            Apache Kudu
            Apache Parquet logo

            Apache Parquet

            17
            6
            0
            17
            6
            + 1
            0
            A free and open-source column-oriented data storage format
              Be the first to leave a pro
              Apache Parquet logo
              Apache Parquet
              VS
              Apache Kudu logo
              Apache Kudu
              Hue logo

              Hue

              16
              7
              0
              16
              7
              + 1
              0
              An open source SQL Workbench for Data Warehouses
                Be the first to leave a pro
                Hue logo
                Hue
                VS
                Apache Kudu logo
                Apache Kudu
                Mule logo

                Mule

                10
                11
                0
                10
                11
                + 1
                0
                Revolutionizing the way the world connects data and applications
                  Be the first to leave a pro
                  Mule logo
                  Mule
                  VS
                  Apache Kudu logo
                  Apache Kudu
                  Azure Data Factory logo

                  Azure Data Factory

                  8
                  1
                  0
                  8
                  1
                  + 1
                  0
                  Create, Schedule, & Manage Data Pipelines
                    Be the first to leave a pro
                    Azure Data Factory logo
                    Azure Data Factory
                    VS
                    Apache Kudu logo
                    Apache Kudu