Hadoop logo

Hadoop

Open-source software for reliable, scalable, distributed computing
1.1K
909
+ 1
48

What is Hadoop?

The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage.
Hadoop is a tool in the Databases category of a tech stack.
Hadoop is an open source tool with 9.9K GitHub stars and 6.1K GitHub forks. Here’s a link to Hadoop's open source repository on GitHub

Who uses Hadoop?

Companies
321 companies reportedly use Hadoop in their tech stacks, including Airbnb, Uber, and Netflix.

Developers
728 developers on StackShare have stated that they use Hadoop.

Hadoop Integrations

Azure Cosmos DB, Apache Flink, Presto, Apache Hive, and Apache Zeppelin are some of the popular tools that integrate with Hadoop. Here's a list of all 30 tools that integrate with Hadoop.

Why developers like Hadoop?

Here’s a list of reasons why companies and developers use Hadoop
Hadoop Reviews

Here are some stack decisions, common use cases and reviews by companies and developers who chose Hadoop in their tech stack.

Kafka
Kafka
Hadoop
Hadoop

The early data ingestion pipeline at Pinterest used Kafka as the central message transporter, with the app servers writing messages directly to Kafka, which then uploaded log files to S3.

For databases, a custom Hadoop streamer pulled database data and wrote it to S3.

Challenges cited for this infrastructure included high operational overhead, as well as potential data loss occurring when Kafka broker outages led to an overflow of in-memory message buffering.

See more
Conor Myhrvold
Conor Myhrvold
Tech Brand Mgr, Office of CTO at Uber · | 5 upvotes · 155.5K views
atUber TechnologiesUber Technologies
Kafka
Kafka
Kafka Manager
Kafka Manager
Hadoop
Hadoop
Apache Spark
Apache Spark
GitHub
GitHub

Why we built Marmaray, an open source generic data ingestion and dispersal framework and library for Apache Hadoop :

Built and designed by our Hadoop Platform team, Marmaray is a plug-in-based framework built on top of the Hadoop ecosystem. Users can add support to ingest data from any source and disperse to any sink leveraging the use of Apache Spark . The name, Marmaray, comes from a tunnel in Turkey connecting Europe and Asia. Similarly, we envisioned Marmaray within Uber as a pipeline connecting data from any source to any sink depending on customer preference:

https://eng.uber.com/marmaray-hadoop-ingestion-open-source/

(Direct GitHub repo: https://github.com/uber/marmaray Kafka Kafka Manager )

See more
StackShare Editors
StackShare Editors
Puppet Labs
Puppet Labs
Hadoop
Hadoop
Qubole
Qubole

By mid-2014, around the time of the Series F, Pinterest users had already created more than 30 billion Pins, and the company was logging around 20 terabytes of new data daily, with around 10 petabytes of data in S3. To drive personalization for its users, and to empower engineers to build big data applications quickly, the data team built a self-serve Hadoop platform.

To start, they decoupled compute from storage, which meant teams would have to worry less about loading or synchronizing data, allowing existing or future clusters to make use of the data across a single shared file system.

A centralized Hive metastore act as the source of truth. They chose Hive for most of their Hadoop jobs “primarily because the SQL interface is simple and familiar to people across the industry.”

Dependency management takes place across three layers: *** Baked AMIs, which are large slow-loading dependencies pre-loaded on images; **Automated Configurations (Masterless Puppets), which allows Puppet clients to “pull their configuration from S3 and set up a service that’s responsible for keeping S3 configurations in sync with the Puppet master;” and Runtime Staging on S3, which creates a working directory at runtime for each developer that pulls down its dependencies directly from S3.

Finally, they migrated their Hadoop jobs to Qubole, which “supported AWS/S3 and was relatively easy to get started on.”

See more
StackShare Editors
StackShare Editors
| 4 upvotes · 101.9K views
atUber TechnologiesUber Technologies
Kafka
Kafka
Kibana
Kibana
Elasticsearch
Elasticsearch
Logstash
Logstash
Hadoop
Hadoop

With interactions across each other and mobile devices, logging is important as it is information for internal cases like debugging and business cases like dynamic pricing.

With multiple Kafka clusters, data is archived into Hadoop before expiration. Data is ingested in realtime and indexed into an ELK stack. The ELK stack comprises of Elasticsearch, Logstash, and Kibana for searching and visualization.

See more
StackShare Editors
StackShare Editors
| 4 upvotes · 27.9K views
atUber TechnologiesUber Technologies
Hadoop
Hadoop
Apache Spark
Apache Spark
Presto
Presto

To improve platform scalability and efficiency, Uber transitioned from JSON to Parquet, and built a central schema service to manage schemas and integrate different client libraries.

While the first generation big data platform was vulnerable to upstream data format changes, “ad hoc data ingestions jobs were replaced with a standard platform to transfer all source data in its original, nested format into the Hadoop data lake.”

These platform changes enabled the scaling challenges Uber was facing around that time: “On a daily basis, there were tens of terabytes of new data added to our data lake, and our Big Data platform grew to over 10,000 vcores with over 100,000 running batch jobs on any given day.”

See more
StackShare Editors
StackShare Editors
| 3 upvotes · 14K views
atUber TechnologiesUber Technologies
Hadoop
Hadoop
Apache Spark
Apache Spark
Presto
Presto

Around 2015, the growing use of Uber’s data exposed limitations in the ETL and Vertica-centric setup, not to mention the increasing costs. “As our company grew, scaling our data warehouse became increasingly expensive. To cut down on costs, we started deleting older, obsolete data to free up space for new data.”

To overcome these challenges, Uber rebuilt their big data platform around Hadoop. “More specifically, we introduced a Hadoop data lake where all raw data was ingested from different online data stores only once and with no transformation during ingestion.”

“In order for users to access data in Hadoop, we introduced Presto to enable interactive ad hoc user queries, Apache Spark to facilitate programmatic access to raw data (in both SQL and non-SQL formats), and Apache Hive to serve as the workhorse for extremely large queries.

See more

Hadoop Alternatives & Comparisons

What are some alternatives to Hadoop?
Cassandra
Partitioning means that Cassandra can distribute your data across multiple machines in an application-transparent matter. Cassandra will automatically repartition as machines are added and removed from the cluster. Row store means that like relational databases, Cassandra organizes data by rows and columns. The Cassandra Query Language (CQL) is a close relative of SQL.
MongoDB
MongoDB stores data in JSON-like documents that can vary in structure, offering a dynamic, flexible schema. MongoDB was also designed for high availability and scalability, with built-in replication and auto-sharding.
Elasticsearch
Elasticsearch is a distributed, RESTful search and analytics engine capable of storing data and searching it in near real time. Elasticsearch, Kibana, Beats and Logstash are the Elastic Stack (sometimes called the ELK Stack).
Splunk
Splunk Inc. provides the leading platform for Operational Intelligence. Customers use Splunk to search, monitor, analyze and visualize machine data.
HBase
Apache HBase is an open-source, distributed, versioned, column-oriented store modeled after Google' Bigtable: A Distributed Storage System for Structured Data by Chang et al. Just as Bigtable leverages the distributed data storage provided by the Google File System, HBase provides Bigtable-like capabilities on top of Apache Hadoop.
See all alternatives

Hadoop's Followers
909 developers follow Hadoop to keep up with related blogs and decisions.
Alex Gauthier
William Javier Trigos Guevara
gsm1011
Midhun Suraj
Kyle Prifogle
Ryan McCall
Yevhen Lebid
Tristan Gilford
Kalyan Raghu
Renae F