Hadoop logo
Open-source software for reliable, scalable, distributed computing

What is Hadoop?

The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage.
Hadoop is a tool in the Databases category of a tech stack.
Hadoop is an open source tool with 9.2K GitHub stars and 5.7K GitHub forks. Here’s a link to Hadoop's open source repository on GitHub

Who uses Hadoop?

Companies
237 companies use Hadoop in their tech stacks, including Airbnb, Uber, and Spotify.

Developers
116 developers use Hadoop.

Hadoop Integrations

Datadog, Couchbase, Apache Flink, Presto, and Apache Zeppelin are some of the popular tools that integrate with Hadoop. Here's a list of all 18 tools that integrate with Hadoop.

Why developers like Hadoop?

Here’s a list of reasons why companies and developers use Hadoop
Hadoop Reviews

Here are some stack decisions, common use cases and reviews by companies and developers who chose Hadoop in their tech stack.

Conor Myhrvold
Conor Myhrvold
Tech Brand Mgr, Office of CTO at Uber · | 3 upvotes · 67.7K views
atUber Technologies
Kafka Manager
Kafka
GitHub
Apache Spark
Hadoop

Why we built Marmaray, an open source generic data ingestion and dispersal framework and library for Apache Hadoop :

Built and designed by our Hadoop Platform team, Marmaray is a plug-in-based framework built on top of the Hadoop ecosystem. Users can add support to ingest data from any source and disperse to any sink leveraging the use of Apache Spark . The name, Marmaray, comes from a tunnel in Turkey connecting Europe and Asia. Similarly, we envisioned Marmaray within Uber as a pipeline connecting data from any source to any sink depending on customer preference:

https://eng.uber.com/marmaray-hadoop-ingestion-open-source/

(Direct GitHub repo: https://github.com/uber/marmaray Kafka Kafka Manager )

See more
John Egan
John Egan
at Pinterest · | 1 upvotes · 6.7K views
atPinterest
Hadoop

The MapReduce workflow starts to process experiment data nightly when data of the previous day is copied over from Kafka. At this time, all the raw log requests are transformed into meaningful experiment results and in-depth analysis. To populate experiment data for the dashboard, we have around 50 jobs running to do all the calculations and transforms of data. Hadoop

See more
Hadoop

in 2009 we open sourced mrjob, which allows any engineer to write a MapReduce job without contending for resources. We’re only limited by the amount of machines in an Amazon data center (which is an issue we’ve rarely encountered). Hadoop

See more
John Egan
John Egan
at Pinterest · | 1 upvotes · 1.5K views
atPinterest
Hadoop

The massive volume of discovery data that powers Pinterest and enables people to save Pins, create boards and follow other users, is generated through daily Hadoop jobs... Hadoop

See more
John Egan
John Egan
at Pinterest · | 1 upvotes · 1.5K views
atPinterest
Hadoop

The massive volume of discovery data that powers Pinterest and enables people to save Pins, create boards and follow other users, is generated through daily Hadoop jobs... Hadoop

See more
Robert Brown
Robert Brown
Co Founder at University of Cincinnati · | 1 upvotes · 1.2K views
Hadoop

Importing/Exporting data, interpreting results. Possible integration with SAS Hadoop

See more

Hadoop Alternatives & Comparisons

What are some alternatives to Hadoop?
Cassandra
Partitioning means that Cassandra can distribute your data across multiple machines in an application-transparent matter. Cassandra will automatically repartition as machines are added and removed from the cluster. Row store means that like relational databases, Cassandra organizes data by rows and columns. The Cassandra Query Language (CQL) is a close relative of SQL.
MongoDB
MongoDB stores data in JSON-like documents that can vary in structure, offering a dynamic, flexible schema. MongoDB was also designed for high availability and scalability, with built-in replication and auto-sharding.
Elasticsearch
Elasticsearch is a distributed, RESTful search and analytics engine capable of storing data and searching it in near real time. Elasticsearch, Kibana, Beats and Logstash are the Elastic Stack (sometimes called the ELK Stack).
Splunk
Splunk Inc. provides the leading platform for Operational Intelligence. Customers use Splunk to search, monitor, analyze and visualize machine data.
HBase
Apache HBase is an open-source, distributed, versioned, column-oriented store modeled after Google' Bigtable: A Distributed Storage System for Structured Data by Chang et al. Just as Bigtable leverages the distributed data storage provided by the Google File System, HBase provides Bigtable-like capabilities on top of Apache Hadoop.
See all alternatives

Hadoop's Stats