What is Druid?
Who uses Druid?
Here are some stack decisions, common use cases and reviews by companies and developers who chose Druid in their tech stack.
My background is in Data analytics in the telecom domain. Have to build the database for analyzing large volumes of CDR data so far the data are maintained in a file server and the application queries data from the files. It's consuming a lot of resources queries are taking time so now I am asked to come up with the approach. I planned to rewrite the app, so which database needs to be used. I am confused between MongoDB and Druid.
So please do advise me on picking from these two and why?
My process is like this: I would get data once a month, either from Google BigQuery or as parquet files from Azure Blob Storage. I have a script that does some cleaning and then stores the result as partitioned parquet files because the following process cannot handle loading all data to memory.
The next process is making a heavy computation in a parallel fashion (per partition), and storing 3 intermediate versions as parquet files: two used for statistics, and the third will be filtered and create the final files.
I make a report based on the two files in Jupyter notebook and convert it to HTML.
- Everything is done with vanilla python and Pandas.
- sometimes I may get a different format of data
- cloud service is Microsoft Azure.
What I'm considering is the following:
Get the data with Kafka or with native python, do the first processing, and store data in Druid, the second processing will be done with Apache Spark getting data from apache druid.
the intermediate states can be stored in druid too. and visualization would be with apache superset.
Developing a solution that collects Telemetry Data from different devices, nearly 1000 devices minimum and maximum 12000. Each device is sending 2 packets in 1 second. This is time-series data, and this data definition and different reports are saved on PostgreSQL. Like Building information, maintenance records, etc. I want to know about the best solution. This data is required for Math and ML to run different algorithms. Also, data is raw without definitions and information stored in PostgreSQL. Initially, I went with TimescaleDB due to PostgreSQL support, but to increase in sites, I started facing many issues with timescale DB in terms of flexibility of storing data.
My major requirement is also the replication of the database for reporting and different purposes. You may also suggest other options other than Druid and Cassandra. But an open source solution is appreciated.