Alternatives to Azure HDInsight logo

Alternatives to Azure HDInsight

Amazon EMR, Azure Databricks, Hadoop, Azure Machine Learning, and Azure Data Factory are the most popular alternatives and competitors to Azure HDInsight.
29
137
+ 1
0

What is Azure HDInsight and what are its top alternatives?

It is a cloud-based service from Microsoft for big data analytics that helps organizations process large amounts of streaming or historical data.
Azure HDInsight is a tool in the Big Data as a Service category of a tech stack.

Top Alternatives to Azure HDInsight

  • Amazon EMR
    Amazon EMR

    It is used in a variety of applications, including log analysis, data warehousing, machine learning, financial analysis, scientific simulation, and bioinformatics. ...

  • Azure Databricks
    Azure Databricks

    Accelerate big data analytics and artificial intelligence (AI) solutions with Azure Databricks, a fast, easy and collaborative Apache Spark–based analytics service. ...

  • Hadoop
    Hadoop

    The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. ...

  • Azure Machine Learning
    Azure Machine Learning

    Azure Machine Learning is a fully-managed cloud service that enables data scientists and developers to efficiently embed predictive analytics into their applications, helping organizations use massive data sets and bring all the benefits of the cloud to machine learning. ...

  • Azure Data Factory
    Azure Data Factory

    It is a service designed to allow developers to integrate disparate data sources. It is a platform somewhat like SSIS in the cloud to manage the data you have both on-prem and in the cloud. ...

  • Databricks
    Databricks

    Databricks Unified Analytics Platform, from the original creators of Apache Spark™, unifies data science and engineering across the Machine Learning lifecycle from data preparation to experimentation and deployment of ML applications. ...

  • Apache Spark
    Apache Spark

    Spark is a fast and general processing engine compatible with Hadoop data. It can run in Hadoop clusters through YARN or Spark's standalone mode, and it can process data in HDFS, HBase, Cassandra, Hive, and any Hadoop InputFormat. It is designed to perform both batch processing (similar to MapReduce) and new workloads like streaming, interactive queries, and machine learning. ...

  • Google BigQuery
    Google BigQuery

    Run super-fast, SQL-like queries against terabytes of data in seconds, using the processing power of Google's infrastructure. Load data with ease. Bulk load your data using Google Cloud Storage or stream it in. Easy access. Access BigQuery by using a browser tool, a command-line tool, or by making calls to the BigQuery REST API with client libraries such as Java, PHP or Python. ...

Azure HDInsight alternatives & related posts

Amazon EMR logo

Amazon EMR

541
680
54
Distribute your data and processing across a Amazon EC2 instances using Hadoop
541
680
+ 1
54
PROS OF AMAZON EMR
  • 15
    On demand processing power
  • 12
    Don't need to maintain Hadoop Cluster yourself
  • 7
    Hadoop Tools
  • 6
    Elastic
  • 4
    Backed by Amazon
  • 3
    Flexible
  • 3
    Economic - pay as you go, easy to use CLI and SDKs
  • 2
    Don't need a dedicated Ops group
  • 1
    Massive data handling
  • 1
    Great support
CONS OF AMAZON EMR
    Be the first to leave a con

    related Amazon EMR posts

    Azure Databricks logo

    Azure Databricks

    235
    377
    0
    Fast, easy, and collaborative Apache Spark–based analytics service
    235
    377
    + 1
    0
    PROS OF AZURE DATABRICKS
      Be the first to leave a pro
      CONS OF AZURE DATABRICKS
        Be the first to leave a con

        related Azure Databricks posts

        Hadoop logo

        Hadoop

        2.5K
        2.3K
        56
        Open-source software for reliable, scalable, distributed computing
        2.5K
        2.3K
        + 1
        56
        PROS OF HADOOP
        • 39
          Great ecosystem
        • 11
          One stack to rule them all
        • 4
          Great load balancer
        • 1
          Amazon aws
        • 1
          Java syntax
        CONS OF HADOOP
          Be the first to leave a con

          related Hadoop posts

          Shared insights
          on
          KafkaKafkaHadoopHadoop
          at

          The early data ingestion pipeline at Pinterest used Kafka as the central message transporter, with the app servers writing messages directly to Kafka, which then uploaded log files to S3.

          For databases, a custom Hadoop streamer pulled database data and wrote it to S3.

          Challenges cited for this infrastructure included high operational overhead, as well as potential data loss occurring when Kafka broker outages led to an overflow of in-memory message buffering.

          See more
          Conor Myhrvold
          Tech Brand Mgr, Office of CTO at Uber · | 7 upvotes · 2.9M views

          Why we built Marmaray, an open source generic data ingestion and dispersal framework and library for Apache Hadoop :

          Built and designed by our Hadoop Platform team, Marmaray is a plug-in-based framework built on top of the Hadoop ecosystem. Users can add support to ingest data from any source and disperse to any sink leveraging the use of Apache Spark . The name, Marmaray, comes from a tunnel in Turkey connecting Europe and Asia. Similarly, we envisioned Marmaray within Uber as a pipeline connecting data from any source to any sink depending on customer preference:

          https://eng.uber.com/marmaray-hadoop-ingestion-open-source/

          (Direct GitHub repo: https://github.com/uber/marmaray Kafka Kafka Manager )

          See more
          Azure Machine Learning logo

          Azure Machine Learning

          238
          366
          0
          A fully-managed cloud service for predictive analytics
          238
          366
          + 1
          0
          PROS OF AZURE MACHINE LEARNING
            Be the first to leave a pro
            CONS OF AZURE MACHINE LEARNING
              Be the first to leave a con

              related Azure Machine Learning posts

              Azure Data Factory logo

              Azure Data Factory

              237
              468
              0
              Hybrid data integration service that simplifies ETL at scale
              237
              468
              + 1
              0
              PROS OF AZURE DATA FACTORY
                Be the first to leave a pro
                CONS OF AZURE DATA FACTORY
                  Be the first to leave a con

                  related Azure Data Factory posts

                  Trying to establish a data lake(or maybe puddle) for my org's Data Sharing project. The idea is that outside partners would send cuts of their PHI data, regardless of format/variables/systems, to our Data Team who would then harmonize the data, create data marts, and eventually use it for something. End-to-end, I'm envisioning:

                  1. Ingestion->Secure, role-based, self service portal for users to upload data (1a. bonus points if it can preform basic validations/masking)
                  2. Storage->Amazon S3 seems like the cheapest. We probably won't need very big, even at full capacity. Our current storage is a secure Box folder that has ~4GB with several batches of test data, code, presentations, and planning docs.
                  3. Data Catalog-> AWS Glue? Azure Data Factory? Snowplow? is the main difference basically based on the vendor? We also will have Data Dictionaries/Codebooks from submitters. Where would they fit in?
                  4. Partitions-> I've seen Cassandra and YARN mentioned, but have no experience with either
                  5. Processing-> We want to use SAS if at all possible. What will work with SAS code?
                  6. Pipeline/Automation->The check-in and verification processes that have been outlined are rather involved. Some sort of automated messaging or approval workflow would be nice
                  7. I have very little guidance on what a "Data Mart" should look like, so I'm going with the idea that it would be another "experimental" partition. Unless there's an actual mart-building paradigm I've missed?
                  8. An end user might use the catalog to pull certain de-identified data sets from the marts. Again, role-based access and self-service gui would be preferable. I'm the only full-time tech person on this project, but I'm mostly an OOP, HTML, JavaScript, and some SQL programmer. Most of this is out of my repertoire. I've done a lot of research, but I can't be an effective evangelist without hands-on experience. Since we're starting a new year of our grant, they've finally decided to let me try some stuff out. Any pointers would be appreciated!
                  See more

                  We are a young start-up with 2 developers and a team in India looking to choose our next ETL tool. We have a few processes in Azure Data Factory but are looking to switch to a better platform. We were debating Trifacta and Airflow. Or even staying with Azure Data Factory. The use case will be to feed data to front-end APIs.

                  See more
                  Databricks logo

                  Databricks

                  471
                  726
                  8
                  A unified analytics platform, powered by Apache Spark
                  471
                  726
                  + 1
                  8
                  PROS OF DATABRICKS
                  • 1
                    Best Performances on large datasets
                  • 1
                    True lakehouse architecture
                  • 1
                    Scalability
                  • 1
                    Databricks doesn't get access to your data
                  • 1
                    Usage Based Billing
                  • 1
                    Security
                  • 1
                    Data stays in your cloud account
                  • 1
                    Multicloud
                  CONS OF DATABRICKS
                    Be the first to leave a con

                    related Databricks posts

                    Jan Vlnas
                    Developer Advocate at Superface · | 5 upvotes · 328.3K views

                    From my point of view, both OpenRefine and Apache Hive serve completely different purposes. OpenRefine is intended for interactive cleaning of messy data locally. You could work with their libraries to use some of OpenRefine features as part of your data pipeline (there are pointers in FAQ), but OpenRefine in general is intended for a single-user local operation.

                    I can't recommend a particular alternative without better understanding of your use case. But if you are looking for an interactive tool to work with big data at scale, take a look at notebook environments like Jupyter, Databricks, or Deepnote. If you are building a data processing pipeline, consider also Apache Spark.

                    Edit: Fixed references from Hadoop to Hive, which is actually closer to Spark.

                    See more
                    Apache Spark logo

                    Apache Spark

                    2.9K
                    3.5K
                    140
                    Fast and general engine for large-scale data processing
                    2.9K
                    3.5K
                    + 1
                    140
                    PROS OF APACHE SPARK
                    • 61
                      Open-source
                    • 48
                      Fast and Flexible
                    • 8
                      One platform for every big data problem
                    • 8
                      Great for distributed SQL like applications
                    • 6
                      Easy to install and to use
                    • 3
                      Works well for most Datascience usecases
                    • 2
                      Interactive Query
                    • 2
                      Machine learning libratimery, Streaming in real
                    • 2
                      In memory Computation
                    CONS OF APACHE SPARK
                    • 4
                      Speed

                    related Apache Spark posts

                    Eric Colson
                    Chief Algorithms Officer at Stitch Fix · | 21 upvotes · 6.1M views

                    The algorithms and data infrastructure at Stitch Fix is housed in #AWS. Data acquisition is split between events flowing through Kafka, and periodic snapshots of PostgreSQL DBs. We store data in an Amazon S3 based data warehouse. Apache Spark on Yarn is our tool of choice for data movement and #ETL. Because our storage layer (s3) is decoupled from our processing layer, we are able to scale our compute environment very elastically. We have several semi-permanent, autoscaling Yarn clusters running to serve our data processing needs. While the bulk of our compute infrastructure is dedicated to algorithmic processing, we also implemented Presto for adhoc queries and dashboards.

                    Beyond data movement and ETL, most #ML centric jobs (e.g. model training and execution) run in a similarly elastic environment as containers running Python and R code on Amazon EC2 Container Service clusters. The execution of batch jobs on top of ECS is managed by Flotilla, a service we built in house and open sourced (see https://github.com/stitchfix/flotilla-os).

                    At Stitch Fix, algorithmic integrations are pervasive across the business. We have dozens of data products actively integrated systems. That requires serving layer that is robust, agile, flexible, and allows for self-service. Models produced on Flotilla are packaged for deployment in production using Khan, another framework we've developed internally. Khan provides our data scientists the ability to quickly productionize those models they've developed with open source frameworks in Python 3 (e.g. PyTorch, sklearn), by automatically packaging them as Docker containers and deploying to Amazon ECS. This provides our data scientist a one-click method of getting from their algorithms to production. We then integrate those deployments into a service mesh, which allows us to A/B test various implementations in our product.

                    For more info:

                    #DataScience #DataStack #Data

                    See more
                    Conor Myhrvold
                    Tech Brand Mgr, Office of CTO at Uber · | 7 upvotes · 2.9M views

                    Why we built Marmaray, an open source generic data ingestion and dispersal framework and library for Apache Hadoop :

                    Built and designed by our Hadoop Platform team, Marmaray is a plug-in-based framework built on top of the Hadoop ecosystem. Users can add support to ingest data from any source and disperse to any sink leveraging the use of Apache Spark . The name, Marmaray, comes from a tunnel in Turkey connecting Europe and Asia. Similarly, we envisioned Marmaray within Uber as a pipeline connecting data from any source to any sink depending on customer preference:

                    https://eng.uber.com/marmaray-hadoop-ingestion-open-source/

                    (Direct GitHub repo: https://github.com/uber/marmaray Kafka Kafka Manager )

                    See more
                    Google BigQuery logo

                    Google BigQuery

                    1.6K
                    1.5K
                    152
                    Analyze terabytes of data in seconds
                    1.6K
                    1.5K
                    + 1
                    152
                    PROS OF GOOGLE BIGQUERY
                    • 28
                      High Performance
                    • 25
                      Easy to use
                    • 22
                      Fully managed service
                    • 19
                      Cheap Pricing
                    • 16
                      Process hundreds of GB in seconds
                    • 12
                      Big Data
                    • 11
                      Full table scans in seconds, no indexes needed
                    • 8
                      Always on, no per-hour costs
                    • 6
                      Good combination with fluentd
                    • 4
                      Machine learning
                    • 1
                      Easy to manage
                    • 0
                      Easy to learn
                    CONS OF GOOGLE BIGQUERY
                    • 1
                      You can't unit test changes in BQ data

                    related Google BigQuery posts

                    Context: I wanted to create an end to end IoT data pipeline simulation in Google Cloud IoT Core and other GCP services. I never touched Terraform meaningfully until working on this project, and it's one of the best explorations in my development career. The documentation and syntax is incredibly human-readable and friendly. I'm used to building infrastructure through the google apis via Python , but I'm so glad past Sung did not make that decision. I was tempted to use Google Cloud Deployment Manager, but the templates were a bit convoluted by first impression. I'm glad past Sung did not make this decision either.

                    Solution: Leveraging Google Cloud Build Google Cloud Run Google Cloud Bigtable Google BigQuery Google Cloud Storage Google Compute Engine along with some other fun tools, I can deploy over 40 GCP resources using Terraform!

                    Check Out My Architecture: CLICK ME

                    Check out the GitHub repo attached

                    See more
                    Tim Specht
                    ‎Co-Founder and CTO at Dubsmash · | 14 upvotes · 937.1K views

                    In order to accurately measure & track user behaviour on our platform we moved over quickly from the initial solution using Google Analytics to a custom-built one due to resource & pricing concerns we had.

                    While this does sound complicated, it’s as easy as clients sending JSON blobs of events to Amazon Kinesis from where we use AWS Lambda & Amazon SQS to batch and process incoming events and then ingest them into Google BigQuery. Once events are stored in BigQuery (which usually only takes a second from the time the client sends the data until it’s available), we can use almost-standard-SQL to simply query for data while Google makes sure that, even with terabytes of data being scanned, query times stay in the range of seconds rather than hours. Before ingesting their data into the pipeline, our mobile clients are aggregating events internally and, once a certain threshold is reached or the app is going to the background, sending the events as a JSON blob into the stream.

                    In the past we had workers running that continuously read from the stream and would validate and post-process the data and then enqueue them for other workers to write them to BigQuery. We went ahead and implemented the Lambda-based approach in such a way that Lambda functions would automatically be triggered for incoming records, pre-aggregate events, and write them back to SQS, from which we then read them, and persist the events to BigQuery. While this approach had a couple of bumps on the road, like re-triggering functions asynchronously to keep up with the stream and proper batch sizes, we finally managed to get it running in a reliable way and are very happy with this solution today.

                    #ServerlessTaskProcessing #GeneralAnalytics #RealTimeDataProcessing #BigDataAsAService

                    See more