Alternatives to Azure HDInsight logo

Alternatives to Azure HDInsight

Amazon EMR, Azure Databricks, Hadoop, Azure Machine Learning, and Azure Data Factory are the most popular alternatives and competitors to Azure HDInsight.
31
138
+ 1
0

What is Azure HDInsight and what are its top alternatives?

Azure HDInsight is a fully managed cloud-based service from Microsoft that provides Apache Hadoop and Apache Spark clusters. It allows users to process big data workloads in a cost-effective and scalable manner. Key features include support for various big data frameworks, integration with other Azure services, enterprise-grade security, and easy scalability. However, limitations include high costs for large workloads and potential complexity in managing different big data frameworks simultaneously.

  1. Amazon EMR: Amazon EMR is a cloud-based big data platform that utilizes various open-source tools such as Apache Spark, Hadoop, and Hive. Key features include easy setup, integration with other AWS services, and cost-effectiveness. Pros include seamless integration with AWS services, while cons include potential complexity for users not familiar with AWS.

  2. Google Cloud Dataproc: Google Cloud Dataproc is a managed Apache Spark and Hadoop service that runs on Google Cloud Platform. Key features include easy cluster management, autoscaling, and integration with other Google Cloud services. Pros include seamless integration with Google Cloud Platform, while cons include potential higher costs compared to other alternatives.

  3. Cloudera Distribution for Hadoop (CDH): CDH is a distribution of Apache Hadoop and related projects from Cloudera. Key features include comprehensive data management capabilities, enterprise-grade security, and support for various big data frameworks. Pros include extensive support and documentation, while cons include potential higher costs for enterprise deployments.

  4. MapR: MapR is a converged data platform that integrates Hadoop, Spark, and other big data frameworks. Key features include high performance, enterprise-grade reliability, and global data consistency. Pros include faster performance compared to other alternatives, while cons include potential higher costs for large-scale deployments.

  5. IBM BigInsights: IBM BigInsights is an enterprise-grade Hadoop distribution with additional analytics capabilities. Key features include advanced analytics tools, integration with IBM Watson services, and enterprise-grade security. Pros include seamless integration with IBM ecosystem, while cons include potential higher costs for smaller deployments.

  6. Hortonworks Data Platform (HDP): HDP is an open-source distribution of Apache Hadoop from Hortonworks. Key features include comprehensive data management tools, enterprise-grade security, and support for various big data frameworks. Pros include open-source nature, while cons include potential complexity in managing different components.

  7. Databricks: Databricks is a unified data analytics platform that leverages Apache Spark for big data processing. Key features include collaborative notebooks, automated cluster management, and integration with various data sources. Pros include ease of use for data scientists, while cons include potential higher costs for large-scale deployments.

  8. Qubole: Qubole is a cloud-native data platform that simplifies big data processing using Apache Spark, Hadoop, and Presto. Key features include self-service analytics, auto-scaling, and cost optimization. Pros include ease of use for data analysts, while cons include potential limitations in customization compared to other alternatives.

  9. Snowflake: Snowflake is a cloud data platform that offers a data warehouse-as-a-service solution for analytics. Key features include instant elasticity, built-in security, and support for structured and semi-structured data. Pros include easy scalability for varying workloads, while cons include potential limitations for unstructured data processing.

  10. Apache Flink: Apache Flink is an open-source stream processing framework that can also be used for batch processing. Key features include low-latency processing, fault tolerance, and support for event time processing. Pros include high throughput and low latency, while cons include potential complexity in setting up and managing Flink clusters.

Top Alternatives to Azure HDInsight

  • Amazon EMR
    Amazon EMR

    It is used in a variety of applications, including log analysis, data warehousing, machine learning, financial analysis, scientific simulation, and bioinformatics. ...

  • Azure Databricks
    Azure Databricks

    Accelerate big data analytics and artificial intelligence (AI) solutions with Azure Databricks, a fast, easy and collaborative Apache Spark–based analytics service. ...

  • Hadoop
    Hadoop

    The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. ...

  • Azure Machine Learning
    Azure Machine Learning

    Azure Machine Learning is a fully-managed cloud service that enables data scientists and developers to efficiently embed predictive analytics into their applications, helping organizations use massive data sets and bring all the benefits of the cloud to machine learning. ...

  • Azure Data Factory
    Azure Data Factory

    It is a service designed to allow developers to integrate disparate data sources. It is a platform somewhat like SSIS in the cloud to manage the data you have both on-prem and in the cloud. ...

  • Databricks
    Databricks

    Databricks Unified Analytics Platform, from the original creators of Apache Spark™, unifies data science and engineering across the Machine Learning lifecycle from data preparation to experimentation and deployment of ML applications. ...

  • MySQL
    MySQL

    The MySQL software delivers a very fast, multi-threaded, multi-user, and robust SQL (Structured Query Language) database server. MySQL Server is intended for mission-critical, heavy-load production systems as well as for embedding into mass-deployed software. ...

  • PostgreSQL
    PostgreSQL

    PostgreSQL is an advanced object-relational database management system that supports an extended subset of the SQL standard, including transactions, foreign keys, subqueries, triggers, user-defined types and functions. ...

Azure HDInsight alternatives & related posts

Amazon EMR logo

Amazon EMR

544
54
Distribute your data and processing across a Amazon EC2 instances using Hadoop
544
54
PROS OF AMAZON EMR
  • 15
    On demand processing power
  • 12
    Don't need to maintain Hadoop Cluster yourself
  • 7
    Hadoop Tools
  • 6
    Elastic
  • 4
    Backed by Amazon
  • 3
    Flexible
  • 3
    Economic - pay as you go, easy to use CLI and SDKs
  • 2
    Don't need a dedicated Ops group
  • 1
    Massive data handling
  • 1
    Great support
CONS OF AMAZON EMR
    Be the first to leave a con

    related Amazon EMR posts

    I have to build a data processing application with an Apache Beam stack and Apache Flink runner on an Amazon EMR cluster. I saw some instability with the process and EMR clusters that keep going down. Here, the Apache Beam application gets inputs from Kafka and sends the accumulative data streams to another Kafka topic. Any advice on how to make the process more stable?

    See more
    Shared insights
    on
    AWS GlueAWS GlueAmazon EMRAmazon EMR

    I use AWS Glue because I thought it was worth all they hype Fall 2018. However, you had to use Python 2.7 with no pandas support, and cold starts lasted as long as 15 minutes. Also, setting up a dev environment for iterative development was near impossible at the time.

    It was a terrible experience for me. I recommend using Amazon EMR instead. Even talking with a friend that works at Amazon, they use EMR instead of Glue for internal spark workloads. Just because a company makes something doesn't mean they use that something :/

    See more
    Azure Databricks logo

    Azure Databricks

    250
    0
    Fast, easy, and collaborative Apache Spark–based analytics service
    250
    0
    PROS OF AZURE DATABRICKS
      Be the first to leave a pro
      CONS OF AZURE DATABRICKS
        Be the first to leave a con

        related Azure Databricks posts

        Hadoop logo

        Hadoop

        2.5K
        56
        Open-source software for reliable, scalable, distributed computing
        2.5K
        56
        PROS OF HADOOP
        • 39
          Great ecosystem
        • 11
          One stack to rule them all
        • 4
          Great load balancer
        • 1
          Amazon aws
        • 1
          Java syntax
        CONS OF HADOOP
          Be the first to leave a con

          related Hadoop posts

          Shared insights
          on
          KafkaKafkaHadoopHadoop
          at

          The early data ingestion pipeline at Pinterest used Kafka as the central message transporter, with the app servers writing messages directly to Kafka, which then uploaded log files to S3.

          For databases, a custom Hadoop streamer pulled database data and wrote it to S3.

          Challenges cited for this infrastructure included high operational overhead, as well as potential data loss occurring when Kafka broker outages led to an overflow of in-memory message buffering.

          See more
          Conor Myhrvold
          Tech Brand Mgr, Office of CTO at Uber · | 7 upvotes · 3M views

          Why we built Marmaray, an open source generic data ingestion and dispersal framework and library for Apache Hadoop :

          Built and designed by our Hadoop Platform team, Marmaray is a plug-in-based framework built on top of the Hadoop ecosystem. Users can add support to ingest data from any source and disperse to any sink leveraging the use of Apache Spark . The name, Marmaray, comes from a tunnel in Turkey connecting Europe and Asia. Similarly, we envisioned Marmaray within Uber as a pipeline connecting data from any source to any sink depending on customer preference:

          https://eng.uber.com/marmaray-hadoop-ingestion-open-source/

          (Direct GitHub repo: https://github.com/uber/marmaray Kafka Kafka Manager )

          See more
          Azure Machine Learning logo

          Azure Machine Learning

          244
          0
          A fully-managed cloud service for predictive analytics
          244
          0
          PROS OF AZURE MACHINE LEARNING
            Be the first to leave a pro
            CONS OF AZURE MACHINE LEARNING
              Be the first to leave a con

              related Azure Machine Learning posts

              Azure Data Factory logo

              Azure Data Factory

              252
              0
              Hybrid data integration service that simplifies ETL at scale
              252
              0
              PROS OF AZURE DATA FACTORY
                Be the first to leave a pro
                CONS OF AZURE DATA FACTORY
                  Be the first to leave a con

                  related Azure Data Factory posts

                  Trying to establish a data lake(or maybe puddle) for my org's Data Sharing project. The idea is that outside partners would send cuts of their PHI data, regardless of format/variables/systems, to our Data Team who would then harmonize the data, create data marts, and eventually use it for something. End-to-end, I'm envisioning:

                  1. Ingestion->Secure, role-based, self service portal for users to upload data (1a. bonus points if it can preform basic validations/masking)
                  2. Storage->Amazon S3 seems like the cheapest. We probably won't need very big, even at full capacity. Our current storage is a secure Box folder that has ~4GB with several batches of test data, code, presentations, and planning docs.
                  3. Data Catalog-> AWS Glue? Azure Data Factory? Snowplow? is the main difference basically based on the vendor? We also will have Data Dictionaries/Codebooks from submitters. Where would they fit in?
                  4. Partitions-> I've seen Cassandra and YARN mentioned, but have no experience with either
                  5. Processing-> We want to use SAS if at all possible. What will work with SAS code?
                  6. Pipeline/Automation->The check-in and verification processes that have been outlined are rather involved. Some sort of automated messaging or approval workflow would be nice
                  7. I have very little guidance on what a "Data Mart" should look like, so I'm going with the idea that it would be another "experimental" partition. Unless there's an actual mart-building paradigm I've missed?
                  8. An end user might use the catalog to pull certain de-identified data sets from the marts. Again, role-based access and self-service gui would be preferable. I'm the only full-time tech person on this project, but I'm mostly an OOP, HTML, JavaScript, and some SQL programmer. Most of this is out of my repertoire. I've done a lot of research, but I can't be an effective evangelist without hands-on experience. Since we're starting a new year of our grant, they've finally decided to let me try some stuff out. Any pointers would be appreciated!
                  See more

                  We are a young start-up with 2 developers and a team in India looking to choose our next ETL tool. We have a few processes in Azure Data Factory but are looking to switch to a better platform. We were debating Trifacta and Airflow. Or even staying with Azure Data Factory. The use case will be to feed data to front-end APIs.

                  See more
                  Databricks logo

                  Databricks

                  507
                  8
                  A unified analytics platform, powered by Apache Spark
                  507
                  8
                  PROS OF DATABRICKS
                  • 1
                    Best Performances on large datasets
                  • 1
                    True lakehouse architecture
                  • 1
                    Scalability
                  • 1
                    Databricks doesn't get access to your data
                  • 1
                    Usage Based Billing
                  • 1
                    Security
                  • 1
                    Data stays in your cloud account
                  • 1
                    Multicloud
                  CONS OF DATABRICKS
                    Be the first to leave a con

                    related Databricks posts

                    Jan Vlnas
                    Senior Software Engineer at Mews · | 5 upvotes · 460.4K views

                    From my point of view, both OpenRefine and Apache Hive serve completely different purposes. OpenRefine is intended for interactive cleaning of messy data locally. You could work with their libraries to use some of OpenRefine features as part of your data pipeline (there are pointers in FAQ), but OpenRefine in general is intended for a single-user local operation.

                    I can't recommend a particular alternative without better understanding of your use case. But if you are looking for an interactive tool to work with big data at scale, take a look at notebook environments like Jupyter, Databricks, or Deepnote. If you are building a data processing pipeline, consider also Apache Spark.

                    Edit: Fixed references from Hadoop to Hive, which is actually closer to Spark.

                    See more
                    MySQL logo

                    MySQL

                    126.3K
                    3.8K
                    The world's most popular open source database
                    126.3K
                    3.8K
                    PROS OF MYSQL
                    • 800
                      Sql
                    • 679
                      Free
                    • 562
                      Easy
                    • 528
                      Widely used
                    • 490
                      Open source
                    • 180
                      High availability
                    • 160
                      Cross-platform support
                    • 104
                      Great community
                    • 79
                      Secure
                    • 75
                      Full-text indexing and searching
                    • 26
                      Fast, open, available
                    • 16
                      Reliable
                    • 16
                      SSL support
                    • 15
                      Robust
                    • 9
                      Enterprise Version
                    • 7
                      Easy to set up on all platforms
                    • 3
                      NoSQL access to JSON data type
                    • 1
                      Relational database
                    • 1
                      Easy, light, scalable
                    • 1
                      Sequel Pro (best SQL GUI)
                    • 1
                      Replica Support
                    CONS OF MYSQL
                    • 16
                      Owned by a company with their own agenda
                    • 3
                      Can't roll back schema changes

                    related MySQL posts

                    Nick Rockwell
                    SVP, Engineering at Fastly · | 46 upvotes · 4.3M views

                    When I joined NYT there was already broad dissatisfaction with the LAMP (Linux Apache HTTP Server MySQL PHP) Stack and the front end framework, in particular. So, I wasn't passing judgment on it. I mean, LAMP's fine, you can do good work in LAMP. It's a little dated at this point, but it's not ... I didn't want to rip it out for its own sake, but everyone else was like, "We don't like this, it's really inflexible." And I remember from being outside the company when that was called MIT FIVE when it had launched. And been observing it from the outside, and I was like, you guys took so long to do that and you did it so carefully, and yet you're not happy with your decisions. Why is that? That was more the impetus. If we're going to do this again, how are we going to do it in a way that we're gonna get a better result?

                    So we're moving quickly away from LAMP, I would say. So, right now, the new front end is React based and using Apollo. And we've been in a long, protracted, gradual rollout of the core experiences.

                    React is now talking to GraphQL as a primary API. There's a Node.js back end, to the front end, which is mainly for server-side rendering, as well.

                    Behind there, the main repository for the GraphQL server is a big table repository, that we call Bodega because it's a convenience store. And that reads off of a Kafka pipeline.

                    See more
                    Tim Abbott

                    We've been using PostgreSQL since the very early days of Zulip, but we actually didn't use it from the beginning. Zulip started out as a MySQL project back in 2012, because we'd heard it was a good choice for a startup with a wide community. However, we found that even though we were using the Django ORM for most of our database access, we spent a lot of time fighting with MySQL. Issues ranged from bad collation defaults, to bad query plans which required a lot of manual query tweaks.

                    We ended up getting so frustrated that we tried out PostgresQL, and the results were fantastic. We didn't have to do any real customization (just some tuning settings for how big a server we had), and all of our most important queries were faster out of the box. As a result, we were able to delete a bunch of custom queries escaping the ORM that we'd written to make the MySQL query planner happy (because postgres just did the right thing automatically).

                    And then after that, we've just gotten a ton of value out of postgres. We use its excellent built-in full-text search, which has helped us avoid needing to bring in a tool like Elasticsearch, and we've really enjoyed features like its partial indexes, which saved us a lot of work adding unnecessary extra tables to get good performance for things like our "unread messages" and "starred messages" indexes.

                    I can't recommend it highly enough.

                    See more
                    PostgreSQL logo

                    PostgreSQL

                    99K
                    3.5K
                    A powerful, open source object-relational database system
                    99K
                    3.5K
                    PROS OF POSTGRESQL
                    • 764
                      Relational database
                    • 510
                      High availability
                    • 439
                      Enterprise class database
                    • 383
                      Sql
                    • 304
                      Sql + nosql
                    • 173
                      Great community
                    • 147
                      Easy to setup
                    • 131
                      Heroku
                    • 130
                      Secure by default
                    • 113
                      Postgis
                    • 50
                      Supports Key-Value
                    • 48
                      Great JSON support
                    • 34
                      Cross platform
                    • 33
                      Extensible
                    • 28
                      Replication
                    • 26
                      Triggers
                    • 23
                      Multiversion concurrency control
                    • 23
                      Rollback
                    • 21
                      Open source
                    • 18
                      Heroku Add-on
                    • 17
                      Stable, Simple and Good Performance
                    • 15
                      Powerful
                    • 13
                      Lets be serious, what other SQL DB would you go for?
                    • 11
                      Good documentation
                    • 9
                      Scalable
                    • 8
                      Free
                    • 8
                      Reliable
                    • 8
                      Intelligent optimizer
                    • 7
                      Transactional DDL
                    • 7
                      Modern
                    • 6
                      One stop solution for all things sql no matter the os
                    • 5
                      Relational database with MVCC
                    • 5
                      Faster Development
                    • 4
                      Full-Text Search
                    • 4
                      Developer friendly
                    • 3
                      Excellent source code
                    • 3
                      Free version
                    • 3
                      Great DB for Transactional system or Application
                    • 3
                      Relational datanbase
                    • 3
                      search
                    • 3
                      Open-source
                    • 2
                      Text
                    • 2
                      Full-text
                    • 1
                      Can handle up to petabytes worth of size
                    • 1
                      Composability
                    • 1
                      Multiple procedural languages supported
                    • 0
                      Native
                    CONS OF POSTGRESQL
                    • 10
                      Table/index bloatings

                    related PostgreSQL posts

                    Simon Reymann
                    Senior Fullstack Developer at QUANTUSflow Software GmbH · | 30 upvotes · 12M views

                    Our whole DevOps stack consists of the following tools:

                    • GitHub (incl. GitHub Pages/Markdown for Documentation, GettingStarted and HowTo's) for collaborative review and code management tool
                    • Respectively Git as revision control system
                    • SourceTree as Git GUI
                    • Visual Studio Code as IDE
                    • CircleCI for continuous integration (automatize development process)
                    • Prettier / TSLint / ESLint as code linter
                    • SonarQube as quality gate
                    • Docker as container management (incl. Docker Compose for multi-container application management)
                    • VirtualBox for operating system simulation tests
                    • Kubernetes as cluster management for docker containers
                    • Heroku for deploying in test environments
                    • nginx as web server (preferably used as facade server in production environment)
                    • SSLMate (using OpenSSL) for certificate management
                    • Amazon EC2 (incl. Amazon S3) for deploying in stage (production-like) and production environments
                    • PostgreSQL as preferred database system
                    • Redis as preferred in-memory database/store (great for caching)

                    The main reason we have chosen Kubernetes over Docker Swarm is related to the following artifacts:

                    • Key features: Easy and flexible installation, Clear dashboard, Great scaling operations, Monitoring is an integral part, Great load balancing concepts, Monitors the condition and ensures compensation in the event of failure.
                    • Applications: An application can be deployed using a combination of pods, deployments, and services (or micro-services).
                    • Functionality: Kubernetes as a complex installation and setup process, but it not as limited as Docker Swarm.
                    • Monitoring: It supports multiple versions of logging and monitoring when the services are deployed within the cluster (Elasticsearch/Kibana (ELK), Heapster/Grafana, Sysdig cloud integration).
                    • Scalability: All-in-one framework for distributed systems.
                    • Other Benefits: Kubernetes is backed by the Cloud Native Computing Foundation (CNCF), huge community among container orchestration tools, it is an open source and modular tool that works with any OS.
                    See more
                    Jeyabalaji Subramanian

                    Recently we were looking at a few robust and cost-effective ways of replicating the data that resides in our production MongoDB to a PostgreSQL database for data warehousing and business intelligence.

                    We set ourselves the following criteria for the optimal tool that would do this job: - The data replication must be near real-time, yet it should NOT impact the production database - The data replication must be horizontally scalable (based on the load), asynchronous & crash-resilient

                    Based on the above criteria, we selected the following tools to perform the end to end data replication:

                    We chose MongoDB Stitch for picking up the changes in the source database. It is the serverless platform from MongoDB. One of the services offered by MongoDB Stitch is Stitch Triggers. Using stitch triggers, you can execute a serverless function (in Node.js) in real time in response to changes in the database. When there are a lot of database changes, Stitch automatically "feeds forward" these changes through an asynchronous queue.

                    We chose Amazon SQS as the pipe / message backbone for communicating the changes from MongoDB to our own replication service. Interestingly enough, MongoDB stitch offers integration with AWS services.

                    In the Node.js function, we wrote minimal functionality to communicate the database changes (insert / update / delete / replace) to Amazon SQS.

                    Next we wrote a minimal micro-service in Python to listen to the message events on SQS, pickup the data payload & mirror the DB changes on to the target Data warehouse. We implemented source data to target data translation by modelling target table structures through SQLAlchemy . We deployed this micro-service as AWS Lambda with Zappa. With Zappa, deploying your services as event-driven & horizontally scalable Lambda service is dumb-easy.

                    In the end, we got to implement a highly scalable near realtime Change Data Replication service that "works" and deployed to production in a matter of few days!

                    See more