What is Apache Kudu and what are its top alternatives?
Apache Kudu is an open-source data storage engine that provides a combination of fast analytics and fast data ingestion. It is designed to handle analytical workloads like SQL and machine learning. Key features of Apache Kudu include columnar storage, real-time updates, support for diverse workloads, and seamless integration with Apache Spark and Apache Impala. However, limitations of Apache Kudu include high memory usage for certain workloads and the lack of support for complex transactions.
Cloudera Impala: Cloudera Impala is an open-source, massively parallel processing SQL query engine for large-scale data stored in Apache Hadoop clusters. Key features include fast query performance, integration with various BI tools, and support for complex queries. Pros include fast querying speed, while cons include limited support for real-time data ingestion compared to Apache Kudu.
Apache HBase: Apache HBase is an open-source, distributed, scalable, Big Data store that runs on top of the Hadoop Distributed File System (HDFS). Key features include random, real-time read/write access to Big Data, linear and modular scalability, and automatic sharding. Pros include fast read and write access, while cons include performance limitations for analytical workloads compared to Apache Kudu.
Druid: Apache Druid is a high-performance, column-oriented, distributed data store for real-time analytics on large datasets. Key features include low-latency queries, scalable infrastructure, and support for time-series data. Pros include real-time data ingestion, while cons include limited support for transactional workloads compared to Apache Kudu.
ClickHouse: ClickHouse is an open-source, column-oriented database management system that allows for real-time analytics, interactive queries, and scalable data storage. Key features include high performance, horizontal scalability, and native support for various data formats. Pros include efficient query execution, while cons include limited support for complex transactions compared to Apache Kudu.
Cassandra: Apache Cassandra is a highly scalable, distributed NoSQL database designed to handle large amounts of data across many commodity servers. Key features include linear scalability, continuous availability, and flexible data storage. Pros include high availability and fault tolerance, while cons include limited support for real-time analytics compared to Apache Kudu.
InfluxDB: InfluxDB is an open-source, time-series database designed for fast, high-availability storage and retrieval of time-series data. Key features include high performance, efficient storage, and support for data visualization tools. Pros include native support for time-series data, while cons include limited support for complex queries compared to Apache Kudu.
Vertica: Vertica is a commercial, column-oriented, relational database management system designed for Big Data analytics. Key features include high performance, scalability, and advanced analytics capabilities. Pros include support for advanced analytics functions, while cons include licensing costs compared to Apache Kudu.
Greenplum: Greenplum is an open-source, massively parallel processing data platform based on PostgreSQL. Key features include scalable architecture, support for complex SQL queries, and advanced analytics capabilities. Pros include support for complex, ad-hoc queries, while cons include steep learning curve compared to Apache Kudu.
TiDB: TiDB is an open-source, distributed SQL database that combines the horizontal scalability of NoSQL with the ACID compliance of traditional RDBMS. Key features include distributed transactions, SQL support, and horizontal scalability. Pros include scalability and horizontal sharding, while cons include performance limitations compared to Apache Kudu.
MemSQL: MemSQL is a distributed, in-memory, SQL-compatible database that provides high performance and scalability for real-time analytics workloads. Key features include in-memory processing, high availability, and support for distributed SQL queries. Pros include fast query performance, while cons include limited support for complex data types compared to Apache Kudu.
Top Alternatives to Apache Kudu
- Cassandra
Partitioning means that Cassandra can distribute your data across multiple machines in an application-transparent matter. Cassandra will automatically repartition as machines are added and removed from the cluster. Row store means that like relational databases, Cassandra organizes data by rows and columns. The Cassandra Query Language (CQL) is a close relative of SQL. ...
- HBase
Apache HBase is an open-source, distributed, versioned, column-oriented store modeled after Google' Bigtable: A Distributed Storage System for Structured Data by Chang et al. Just as Bigtable leverages the distributed data storage provided by the Google File System, HBase provides Bigtable-like capabilities on top of Apache Hadoop. ...
- Apache Spark
Spark is a fast and general processing engine compatible with Hadoop data. It can run in Hadoop clusters through YARN or Spark's standalone mode, and it can process data in HDFS, HBase, Cassandra, Hive, and any Hadoop InputFormat. It is designed to perform both batch processing (similar to MapReduce) and new workloads like streaming, interactive queries, and machine learning. ...
- Apache Impala
Impala is a modern, open source, MPP SQL query engine for Apache Hadoop. Impala is shipped by Cloudera, MapR, and Amazon. With Impala, you can query data, whether stored in HDFS or Apache HBase – including SELECT, JOIN, and aggregate functions – in real time. ...
- Hadoop
The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. ...
- Druid
Druid is a distributed, column-oriented, real-time analytics data store that is commonly used to power exploratory dashboards in multi-tenant environments. Druid excels as a data warehousing solution for fast aggregate queries on petabyte sized data sets. Druid supports a variety of flexible filters, exact calculations, approximate algorithms, and other useful calculations. ...
- Apache Ignite
It is a memory-centric distributed database, caching, and processing platform for transactional, analytical, and streaming workloads delivering in-memory speeds at petabyte scale ...
- Clickhouse
It allows analysis of data that is updated in real time. It offers instant results in most cases: the data is processed faster than it takes to create a query. ...
Apache Kudu alternatives & related posts
Cassandra
- Distributed119
- High performance98
- High availability81
- Easy scalability74
- Replication53
- Reliable26
- Multi datacenter deployments26
- Schema optional10
- OLTP9
- Open source8
- Workload separation (via MDC)2
- Fast1
- Reliability of replication3
- Size1
- Updates1
related Cassandra posts
1.0 of Stream leveraged Cassandra for storing the feed. Cassandra is a common choice for building feeds. Instagram, for instance started, out with Redis but eventually switched to Cassandra to handle their rapid usage growth. Cassandra can handle write heavy workloads very efficiently.
Cassandra is a great tool that allows you to scale write capacity simply by adding more nodes, though it is also very complex. This complexity made it hard to diagnose performance fluctuations. Even though we had years of experience with running Cassandra, it still felt like a bit of a black box. When building Stream 2.0 we decided to go for a different approach and build Keevo. Keevo is our in-house key-value store built upon RocksDB, gRPC and Raft.
RocksDB is a highly performant embeddable database library developed and maintained by Facebook’s data engineering team. RocksDB started as a fork of Google’s LevelDB that introduced several performance improvements for SSD. Nowadays RocksDB is a project on its own and is under active development. It is written in C++ and it’s fast. Have a look at how this benchmark handles 7 million QPS. In terms of technology it’s much more simple than Cassandra.
This translates into reduced maintenance overhead, improved performance and, most importantly, more consistent performance. It’s interesting to note that LinkedIn also uses RocksDB for their feed.
#InMemoryDatabases #DataStores #Databases
Trying to establish a data lake(or maybe puddle) for my org's Data Sharing project. The idea is that outside partners would send cuts of their PHI data, regardless of format/variables/systems, to our Data Team who would then harmonize the data, create data marts, and eventually use it for something. End-to-end, I'm envisioning:
- Ingestion->Secure, role-based, self service portal for users to upload data (1a. bonus points if it can preform basic validations/masking)
- Storage->Amazon S3 seems like the cheapest. We probably won't need very big, even at full capacity. Our current storage is a secure Box folder that has ~4GB with several batches of test data, code, presentations, and planning docs.
- Data Catalog-> AWS Glue? Azure Data Factory? Snowplow? is the main difference basically based on the vendor? We also will have Data Dictionaries/Codebooks from submitters. Where would they fit in?
- Partitions-> I've seen Cassandra and YARN mentioned, but have no experience with either
- Processing-> We want to use SAS if at all possible. What will work with SAS code?
- Pipeline/Automation->The check-in and verification processes that have been outlined are rather involved. Some sort of automated messaging or approval workflow would be nice
- I have very little guidance on what a "Data Mart" should look like, so I'm going with the idea that it would be another "experimental" partition. Unless there's an actual mart-building paradigm I've missed?
- An end user might use the catalog to pull certain de-identified data sets from the marts. Again, role-based access and self-service gui would be preferable. I'm the only full-time tech person on this project, but I'm mostly an OOP, HTML, JavaScript, and some SQL programmer. Most of this is out of my repertoire. I've done a lot of research, but I can't be an effective evangelist without hands-on experience. Since we're starting a new year of our grant, they've finally decided to let me try some stuff out. Any pointers would be appreciated!
- Performance9
- OLTP5
- Fast Point Queries1
related HBase posts
I am researching different querying solutions to handle ~1 trillion records of data (in the realm of a petabyte). The data is mostly textual. I have identified a few options: Milvus, HBase, RocksDB, and Elasticsearch. I was wondering if there is a good way to compare the performance of these options (or if anyone has already done something like this). I want to be able to compare the speed of ingesting and querying textual data from these tools. Does anyone have information on this or know where I can find some? Thanks in advance!
Hi, I'm building a machine learning pipelines to store image bytes and image vectors in the backend.
So, when users query for the random access image data (key), we return the image bytes and perform machine learning model operations on it.
I'm currently considering going with Amazon S3 (in the future, maybe add Redis caching layer) as the backend system to store the information (s3 buckets with sharded prefixes).
As the latency of S3 is 100-200ms (get/put) and it has a high throughput of 3500 puts/sec and 5500 gets/sec for a given bucker/prefix. In the future I need to reduce the latency, I can add Redis cache.
Also, s3 costs are way fewer than HBase (on Amazon EC2 instances with 3x replication factor)
I have not personally used HBase before, so can someone help me if I'm making the right choice here? I'm not aware of Hbase latencies and I have learned that the MOB feature on Hbase has to be turned on if we have store image bytes on of the column families as the avg image bytes are 240Kb.
- Open-source61
- Fast and Flexible48
- One platform for every big data problem8
- Great for distributed SQL like applications8
- Easy to install and to use6
- Works well for most Datascience usecases3
- Interactive Query2
- Machine learning libratimery, Streaming in real2
- In memory Computation2
- Speed4
related Apache Spark posts
The algorithms and data infrastructure at Stitch Fix is housed in #AWS. Data acquisition is split between events flowing through Kafka, and periodic snapshots of PostgreSQL DBs. We store data in an Amazon S3 based data warehouse. Apache Spark on Yarn is our tool of choice for data movement and #ETL. Because our storage layer (s3) is decoupled from our processing layer, we are able to scale our compute environment very elastically. We have several semi-permanent, autoscaling Yarn clusters running to serve our data processing needs. While the bulk of our compute infrastructure is dedicated to algorithmic processing, we also implemented Presto for adhoc queries and dashboards.
Beyond data movement and ETL, most #ML centric jobs (e.g. model training and execution) run in a similarly elastic environment as containers running Python and R code on Amazon EC2 Container Service clusters. The execution of batch jobs on top of ECS is managed by Flotilla, a service we built in house and open sourced (see https://github.com/stitchfix/flotilla-os).
At Stitch Fix, algorithmic integrations are pervasive across the business. We have dozens of data products actively integrated systems. That requires serving layer that is robust, agile, flexible, and allows for self-service. Models produced on Flotilla are packaged for deployment in production using Khan, another framework we've developed internally. Khan provides our data scientists the ability to quickly productionize those models they've developed with open source frameworks in Python 3 (e.g. PyTorch, sklearn), by automatically packaging them as Docker containers and deploying to Amazon ECS. This provides our data scientist a one-click method of getting from their algorithms to production. We then integrate those deployments into a service mesh, which allows us to A/B test various implementations in our product.
For more info:
- Our Algorithms Tour: https://algorithms-tour.stitchfix.com/
- Our blog: https://multithreaded.stitchfix.com/blog/
- Careers: https://multithreaded.stitchfix.com/careers/
#DataScience #DataStack #Data
As a frontend engineer on the Algorithms & Analytics team at Stitch Fix, I work with data scientists to develop applications and visualizations to help our internal business partners make data-driven decisions. I envisioned a platform that would assist data scientists in the data exploration process, allowing them to visually explore and rapidly iterate through their assumptions, then share their insights with others. This would align with our team's philosophy of having engineers "deploy platforms, services, abstractions, and frameworks that allow the data scientists to conceive of, develop, and deploy their ideas with autonomy", and solve the pain of data exploration.
The final product, code-named Dora, is built with React, Redux.js and Victory, backed by Elasticsearch to enable fast and iterative data exploration, and uses Apache Spark to move data from our Amazon S3 data warehouse into the Elasticsearch cluster.
- Super fast11
- Massively Parallel Processing1
- Load Balancing1
- Replication1
- Scalability1
- Distributed1
- High Performance1
- Open Sourse1
related Apache Impala posts
I have been working on a Java application to demonstrate the latency for the select/insert/update operations on KUDU storage using Apache Kudu API - Java based client. I have a few queries about using Apache Kudu API
Do we have JDBC wrapper to use Apache Kudu API for getting connection to Kudu masters with connection pool mechanism and all DB operations?
Does Apache KuduAPI supports order by, group by, and aggregate functions? if yes, how to implement these functions using Kudu APIs.
How can we add kudu predicates to Kudu update operation? if yes, how?
Does Apache Kudu API supports batch insertion (execute the Kudu Insert for multiple rows at one go instead of row by row)? (like Kudusession.apply(List);)
Does Apache Kudu API support join on tables?
which tool is preferred over others (Apache Impala /Kudu API) for read and update/insert DB operations?
- Great ecosystem39
- One stack to rule them all11
- Great load balancer4
- Amazon aws1
- Java syntax1
related Hadoop posts
The early data ingestion pipeline at Pinterest used Kafka as the central message transporter, with the app servers writing messages directly to Kafka, which then uploaded log files to S3.
For databases, a custom Hadoop streamer pulled database data and wrote it to S3.
Challenges cited for this infrastructure included high operational overhead, as well as potential data loss occurring when Kafka broker outages led to an overflow of in-memory message buffering.
Why we built Marmaray, an open source generic data ingestion and dispersal framework and library for Apache Hadoop :
Built and designed by our Hadoop Platform team, Marmaray is a plug-in-based framework built on top of the Hadoop ecosystem. Users can add support to ingest data from any source and disperse to any sink leveraging the use of Apache Spark . The name, Marmaray, comes from a tunnel in Turkey connecting Europe and Asia. Similarly, we envisioned Marmaray within Uber as a pipeline connecting data from any source to any sink depending on customer preference:
https://eng.uber.com/marmaray-hadoop-ingestion-open-source/
(Direct GitHub repo: https://github.com/uber/marmaray Kafka Kafka Manager )
- Real Time Aggregations15
- Batch and Real-Time Ingestion6
- OLAP5
- OLAP + OLTP3
- Combining stream and historical analytics2
- OLTP1
- Limited sql support3
- Joins are not supported well2
- Complexity1
related Druid posts
My background is in Data analytics in the telecom domain. Have to build the database for analyzing large volumes of CDR data so far the data are maintained in a file server and the application queries data from the files. It's consuming a lot of resources queries are taking time so now I am asked to come up with the approach. I planned to rewrite the app, so which database needs to be used. I am confused between MongoDB and Druid.
So please do advise me on picking from these two and why?
My process is like this: I would get data once a month, either from Google BigQuery or as parquet files from Azure Blob Storage. I have a script that does some cleaning and then stores the result as partitioned parquet files because the following process cannot handle loading all data to memory.
The next process is making a heavy computation in a parallel fashion (per partition), and storing 3 intermediate versions as parquet files: two used for statistics, and the third will be filtered and create the final files.
I make a report based on the two files in Jupyter notebook and convert it to HTML.
- Everything is done with vanilla python and Pandas.
- sometimes I may get a different format of data
- cloud service is Microsoft Azure.
What I'm considering is the following:
Get the data with Kafka or with native python, do the first processing, and store data in Druid, the second processing will be done with Apache Spark getting data from apache druid.
the intermediate states can be stored in druid too. and visualization would be with apache superset.
- Written in java. runs on jvm5
- Multiple client language support5
- Free5
- High Avaliability5
- Rest interface4
- Sql query support in cluster wide4
- Load balancing4
- Distributed compute3
- Better Documentation3
- Easy to use2
- Distributed Locking1
related Apache Ignite posts
- Fast, very very fast21
- Good compression ratio11
- Horizontally scalable7
- Utilizes all CPU resources6
- RESTful5
- Open-source5
- Great CLI5
- Great number of SQL functions4
- Buggy4
- Server crashes its normal :(3
- Highly available3
- Flexible connection options3
- Has no transactions3
- ODBC2
- Flexible compression options2
- In IDEA data import via HTTP interface not working1
- Slow insert operations5