What is Azure HDInsight and what are its top alternatives?
Azure HDInsight is a fully managed cloud-based service from Microsoft that provides Apache Hadoop and Apache Spark clusters. It allows users to process big data workloads in a cost-effective and scalable manner. Key features include support for various big data frameworks, integration with other Azure services, enterprise-grade security, and easy scalability. However, limitations include high costs for large workloads and potential complexity in managing different big data frameworks simultaneously.
Amazon EMR: Amazon EMR is a cloud-based big data platform that utilizes various open-source tools such as Apache Spark, Hadoop, and Hive. Key features include easy setup, integration with other AWS services, and cost-effectiveness. Pros include seamless integration with AWS services, while cons include potential complexity for users not familiar with AWS.
Google Cloud Dataproc: Google Cloud Dataproc is a managed Apache Spark and Hadoop service that runs on Google Cloud Platform. Key features include easy cluster management, autoscaling, and integration with other Google Cloud services. Pros include seamless integration with Google Cloud Platform, while cons include potential higher costs compared to other alternatives.
Cloudera Distribution for Hadoop (CDH): CDH is a distribution of Apache Hadoop and related projects from Cloudera. Key features include comprehensive data management capabilities, enterprise-grade security, and support for various big data frameworks. Pros include extensive support and documentation, while cons include potential higher costs for enterprise deployments.
MapR: MapR is a converged data platform that integrates Hadoop, Spark, and other big data frameworks. Key features include high performance, enterprise-grade reliability, and global data consistency. Pros include faster performance compared to other alternatives, while cons include potential higher costs for large-scale deployments.
IBM BigInsights: IBM BigInsights is an enterprise-grade Hadoop distribution with additional analytics capabilities. Key features include advanced analytics tools, integration with IBM Watson services, and enterprise-grade security. Pros include seamless integration with IBM ecosystem, while cons include potential higher costs for smaller deployments.
Hortonworks Data Platform (HDP): HDP is an open-source distribution of Apache Hadoop from Hortonworks. Key features include comprehensive data management tools, enterprise-grade security, and support for various big data frameworks. Pros include open-source nature, while cons include potential complexity in managing different components.
Databricks: Databricks is a unified data analytics platform that leverages Apache Spark for big data processing. Key features include collaborative notebooks, automated cluster management, and integration with various data sources. Pros include ease of use for data scientists, while cons include potential higher costs for large-scale deployments.
Qubole: Qubole is a cloud-native data platform that simplifies big data processing using Apache Spark, Hadoop, and Presto. Key features include self-service analytics, auto-scaling, and cost optimization. Pros include ease of use for data analysts, while cons include potential limitations in customization compared to other alternatives.
Snowflake: Snowflake is a cloud data platform that offers a data warehouse-as-a-service solution for analytics. Key features include instant elasticity, built-in security, and support for structured and semi-structured data. Pros include easy scalability for varying workloads, while cons include potential limitations for unstructured data processing.
Apache Flink: Apache Flink is an open-source stream processing framework that can also be used for batch processing. Key features include low-latency processing, fault tolerance, and support for event time processing. Pros include high throughput and low latency, while cons include potential complexity in setting up and managing Flink clusters.
Top Alternatives to Azure HDInsight
- Amazon EMR
It is used in a variety of applications, including log analysis, data warehousing, machine learning, financial analysis, scientific simulation, and bioinformatics. ...
- Azure Databricks
Accelerate big data analytics and artificial intelligence (AI) solutions with Azure Databricks, a fast, easy and collaborative Apache Spark–based analytics service. ...
- Hadoop
The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. ...
- Azure Machine Learning
Azure Machine Learning is a fully-managed cloud service that enables data scientists and developers to efficiently embed predictive analytics into their applications, helping organizations use massive data sets and bring all the benefits of the cloud to machine learning. ...
- Azure Data Factory
It is a service designed to allow developers to integrate disparate data sources. It is a platform somewhat like SSIS in the cloud to manage the data you have both on-prem and in the cloud. ...
- Databricks
Databricks Unified Analytics Platform, from the original creators of Apache Spark™, unifies data science and engineering across the Machine Learning lifecycle from data preparation to experimentation and deployment of ML applications. ...
- MySQL
The MySQL software delivers a very fast, multi-threaded, multi-user, and robust SQL (Structured Query Language) database server. MySQL Server is intended for mission-critical, heavy-load production systems as well as for embedding into mass-deployed software. ...
- PostgreSQL
PostgreSQL is an advanced object-relational database management system that supports an extended subset of the SQL standard, including transactions, foreign keys, subqueries, triggers, user-defined types and functions. ...
Azure HDInsight alternatives & related posts
- On demand processing power15
- Don't need to maintain Hadoop Cluster yourself12
- Hadoop Tools7
- Elastic6
- Backed by Amazon4
- Flexible3
- Economic - pay as you go, easy to use CLI and SDKs3
- Don't need a dedicated Ops group2
- Massive data handling1
- Great support1
related Amazon EMR posts
I have to build a data processing application with an Apache Beam stack and Apache Flink runner on an Amazon EMR cluster. I saw some instability with the process and EMR clusters that keep going down. Here, the Apache Beam application gets inputs from Kafka and sends the accumulative data streams to another Kafka topic. Any advice on how to make the process more stable?
I use AWS Glue because I thought it was worth all they hype Fall 2018. However, you had to use Python 2.7 with no pandas support, and cold starts lasted as long as 15 minutes. Also, setting up a dev environment for iterative development was near impossible at the time.
It was a terrible experience for me. I recommend using Amazon EMR instead. Even talking with a friend that works at Amazon, they use EMR instead of Glue for internal spark workloads. Just because a company makes something doesn't mean they use that something :/
related Azure Databricks posts
- Great ecosystem39
- One stack to rule them all11
- Great load balancer4
- Amazon aws1
- Java syntax1
related Hadoop posts
The early data ingestion pipeline at Pinterest used Kafka as the central message transporter, with the app servers writing messages directly to Kafka, which then uploaded log files to S3.
For databases, a custom Hadoop streamer pulled database data and wrote it to S3.
Challenges cited for this infrastructure included high operational overhead, as well as potential data loss occurring when Kafka broker outages led to an overflow of in-memory message buffering.
Why we built Marmaray, an open source generic data ingestion and dispersal framework and library for Apache Hadoop :
Built and designed by our Hadoop Platform team, Marmaray is a plug-in-based framework built on top of the Hadoop ecosystem. Users can add support to ingest data from any source and disperse to any sink leveraging the use of Apache Spark . The name, Marmaray, comes from a tunnel in Turkey connecting Europe and Asia. Similarly, we envisioned Marmaray within Uber as a pipeline connecting data from any source to any sink depending on customer preference:
https://eng.uber.com/marmaray-hadoop-ingestion-open-source/
(Direct GitHub repo: https://github.com/uber/marmaray Kafka Kafka Manager )
related Azure Machine Learning posts
related Azure Data Factory posts
Trying to establish a data lake(or maybe puddle) for my org's Data Sharing project. The idea is that outside partners would send cuts of their PHI data, regardless of format/variables/systems, to our Data Team who would then harmonize the data, create data marts, and eventually use it for something. End-to-end, I'm envisioning:
- Ingestion->Secure, role-based, self service portal for users to upload data (1a. bonus points if it can preform basic validations/masking)
- Storage->Amazon S3 seems like the cheapest. We probably won't need very big, even at full capacity. Our current storage is a secure Box folder that has ~4GB with several batches of test data, code, presentations, and planning docs.
- Data Catalog-> AWS Glue? Azure Data Factory? Snowplow? is the main difference basically based on the vendor? We also will have Data Dictionaries/Codebooks from submitters. Where would they fit in?
- Partitions-> I've seen Cassandra and YARN mentioned, but have no experience with either
- Processing-> We want to use SAS if at all possible. What will work with SAS code?
- Pipeline/Automation->The check-in and verification processes that have been outlined are rather involved. Some sort of automated messaging or approval workflow would be nice
- I have very little guidance on what a "Data Mart" should look like, so I'm going with the idea that it would be another "experimental" partition. Unless there's an actual mart-building paradigm I've missed?
- An end user might use the catalog to pull certain de-identified data sets from the marts. Again, role-based access and self-service gui would be preferable. I'm the only full-time tech person on this project, but I'm mostly an OOP, HTML, JavaScript, and some SQL programmer. Most of this is out of my repertoire. I've done a lot of research, but I can't be an effective evangelist without hands-on experience. Since we're starting a new year of our grant, they've finally decided to let me try some stuff out. Any pointers would be appreciated!
We are a young start-up with 2 developers and a team in India looking to choose our next ETL tool. We have a few processes in Azure Data Factory but are looking to switch to a better platform. We were debating Trifacta and Airflow. Or even staying with Azure Data Factory. The use case will be to feed data to front-end APIs.
- Best Performances on large datasets1
- True lakehouse architecture1
- Scalability1
- Databricks doesn't get access to your data1
- Usage Based Billing1
- Security1
- Data stays in your cloud account1
- Multicloud1
related Databricks posts
From my point of view, both OpenRefine and Apache Hive serve completely different purposes. OpenRefine is intended for interactive cleaning of messy data locally. You could work with their libraries to use some of OpenRefine features as part of your data pipeline (there are pointers in FAQ), but OpenRefine in general is intended for a single-user local operation.
I can't recommend a particular alternative without better understanding of your use case. But if you are looking for an interactive tool to work with big data at scale, take a look at notebook environments like Jupyter, Databricks, or Deepnote. If you are building a data processing pipeline, consider also Apache Spark.
Edit: Fixed references from Hadoop to Hive, which is actually closer to Spark.
- Sql800
- Free679
- Easy562
- Widely used528
- Open source490
- High availability180
- Cross-platform support160
- Great community104
- Secure79
- Full-text indexing and searching75
- Fast, open, available26
- Reliable16
- SSL support16
- Robust15
- Enterprise Version9
- Easy to set up on all platforms7
- NoSQL access to JSON data type3
- Relational database1
- Easy, light, scalable1
- Sequel Pro (best SQL GUI)1
- Replica Support1
- Owned by a company with their own agenda16
- Can't roll back schema changes3
related MySQL posts
When I joined NYT there was already broad dissatisfaction with the LAMP (Linux Apache HTTP Server MySQL PHP) Stack and the front end framework, in particular. So, I wasn't passing judgment on it. I mean, LAMP's fine, you can do good work in LAMP. It's a little dated at this point, but it's not ... I didn't want to rip it out for its own sake, but everyone else was like, "We don't like this, it's really inflexible." And I remember from being outside the company when that was called MIT FIVE when it had launched. And been observing it from the outside, and I was like, you guys took so long to do that and you did it so carefully, and yet you're not happy with your decisions. Why is that? That was more the impetus. If we're going to do this again, how are we going to do it in a way that we're gonna get a better result?
So we're moving quickly away from LAMP, I would say. So, right now, the new front end is React based and using Apollo. And we've been in a long, protracted, gradual rollout of the core experiences.
React is now talking to GraphQL as a primary API. There's a Node.js back end, to the front end, which is mainly for server-side rendering, as well.
Behind there, the main repository for the GraphQL server is a big table repository, that we call Bodega because it's a convenience store. And that reads off of a Kafka pipeline.
We've been using PostgreSQL since the very early days of Zulip, but we actually didn't use it from the beginning. Zulip started out as a MySQL project back in 2012, because we'd heard it was a good choice for a startup with a wide community. However, we found that even though we were using the Django ORM for most of our database access, we spent a lot of time fighting with MySQL. Issues ranged from bad collation defaults, to bad query plans which required a lot of manual query tweaks.
We ended up getting so frustrated that we tried out PostgresQL, and the results were fantastic. We didn't have to do any real customization (just some tuning settings for how big a server we had), and all of our most important queries were faster out of the box. As a result, we were able to delete a bunch of custom queries escaping the ORM that we'd written to make the MySQL query planner happy (because postgres just did the right thing automatically).
And then after that, we've just gotten a ton of value out of postgres. We use its excellent built-in full-text search, which has helped us avoid needing to bring in a tool like Elasticsearch, and we've really enjoyed features like its partial indexes, which saved us a lot of work adding unnecessary extra tables to get good performance for things like our "unread messages" and "starred messages" indexes.
I can't recommend it highly enough.
- Relational database764
- High availability510
- Enterprise class database439
- Sql383
- Sql + nosql304
- Great community173
- Easy to setup147
- Heroku131
- Secure by default130
- Postgis113
- Supports Key-Value50
- Great JSON support48
- Cross platform34
- Extensible33
- Replication28
- Triggers26
- Multiversion concurrency control23
- Rollback23
- Open source21
- Heroku Add-on18
- Stable, Simple and Good Performance17
- Powerful15
- Lets be serious, what other SQL DB would you go for?13
- Good documentation11
- Scalable9
- Free8
- Reliable8
- Intelligent optimizer8
- Transactional DDL7
- Modern7
- One stop solution for all things sql no matter the os6
- Relational database with MVCC5
- Faster Development5
- Full-Text Search4
- Developer friendly4
- Excellent source code3
- Free version3
- Great DB for Transactional system or Application3
- Relational datanbase3
- search3
- Open-source3
- Text2
- Full-text2
- Can handle up to petabytes worth of size1
- Composability1
- Multiple procedural languages supported1
- Native0
- Table/index bloatings10
related PostgreSQL posts
Our whole DevOps stack consists of the following tools:
- GitHub (incl. GitHub Pages/Markdown for Documentation, GettingStarted and HowTo's) for collaborative review and code management tool
- Respectively Git as revision control system
- SourceTree as Git GUI
- Visual Studio Code as IDE
- CircleCI for continuous integration (automatize development process)
- Prettier / TSLint / ESLint as code linter
- SonarQube as quality gate
- Docker as container management (incl. Docker Compose for multi-container application management)
- VirtualBox for operating system simulation tests
- Kubernetes as cluster management for docker containers
- Heroku for deploying in test environments
- nginx as web server (preferably used as facade server in production environment)
- SSLMate (using OpenSSL) for certificate management
- Amazon EC2 (incl. Amazon S3) for deploying in stage (production-like) and production environments
- PostgreSQL as preferred database system
- Redis as preferred in-memory database/store (great for caching)
The main reason we have chosen Kubernetes over Docker Swarm is related to the following artifacts:
- Key features: Easy and flexible installation, Clear dashboard, Great scaling operations, Monitoring is an integral part, Great load balancing concepts, Monitors the condition and ensures compensation in the event of failure.
- Applications: An application can be deployed using a combination of pods, deployments, and services (or micro-services).
- Functionality: Kubernetes as a complex installation and setup process, but it not as limited as Docker Swarm.
- Monitoring: It supports multiple versions of logging and monitoring when the services are deployed within the cluster (Elasticsearch/Kibana (ELK), Heapster/Grafana, Sysdig cloud integration).
- Scalability: All-in-one framework for distributed systems.
- Other Benefits: Kubernetes is backed by the Cloud Native Computing Foundation (CNCF), huge community among container orchestration tools, it is an open source and modular tool that works with any OS.
Recently we were looking at a few robust and cost-effective ways of replicating the data that resides in our production MongoDB to a PostgreSQL database for data warehousing and business intelligence.
We set ourselves the following criteria for the optimal tool that would do this job: - The data replication must be near real-time, yet it should NOT impact the production database - The data replication must be horizontally scalable (based on the load), asynchronous & crash-resilient
Based on the above criteria, we selected the following tools to perform the end to end data replication:
We chose MongoDB Stitch for picking up the changes in the source database. It is the serverless platform from MongoDB. One of the services offered by MongoDB Stitch is Stitch Triggers. Using stitch triggers, you can execute a serverless function (in Node.js) in real time in response to changes in the database. When there are a lot of database changes, Stitch automatically "feeds forward" these changes through an asynchronous queue.
We chose Amazon SQS as the pipe / message backbone for communicating the changes from MongoDB to our own replication service. Interestingly enough, MongoDB stitch offers integration with AWS services.
In the Node.js function, we wrote minimal functionality to communicate the database changes (insert / update / delete / replace) to Amazon SQS.
Next we wrote a minimal micro-service in Python to listen to the message events on SQS, pickup the data payload & mirror the DB changes on to the target Data warehouse. We implemented source data to target data translation by modelling target table structures through SQLAlchemy . We deployed this micro-service as AWS Lambda with Zappa. With Zappa, deploying your services as event-driven & horizontally scalable Lambda service is dumb-easy.
In the end, we got to implement a highly scalable near realtime Change Data Replication service that "works" and deployed to production in a matter of few days!