Druid

Druid

Application and Data / Data Stores / Big Data Tools
Needs advice
on
DruidDruid
and
MongoDBMongoDB

My background is in Data analytics in the telecom domain. Have to build the database for analyzing large volumes of CDR data so far the data are maintained in a file server and the application queries data from the files. It's consuming a lot of resources queries are taking time so now I am asked to come up with the approach. I planned to rewrite the app, so which database needs to be used. I am confused between MongoDB and Druid.

So please do advise me on picking from these two and why?

READ MORE
7 upvotes·124.2K views
Replies (1)
Database Consultant at Aerospike·

Hey Keyaan07,

Am I right to assume that you are looking to store this data in documents and therefore looking at Mongo as an option? If so, I'd take a look at using Aerospike's documentDB capabilities as a far more performant option that can scale with you as you grow and eliminate the need to re-platform in the future.

Aerospike DocumentDB: https://aerospike.com/products/document-data-services/

Optimize your database Infrastructure cost: https://aerospike.com/blog/optimizing-database-infrastructure-cost/

I'd be more than happy to jump on a call at your leisure to walk you through the trade offs and best time to use Mongo/Druid/Aerospike

My LinkedIn Profile: Send me a message and let's chat! https://www.linkedin.com/in/ldwyatt/

READ MORE
9 upvotes·10.3K views
Needs advice
on
DruidDruidKafkaKafka
and
Apache SparkApache Spark

My process is like this: I would get data once a month, either from Google BigQuery or as parquet files from Azure Blob Storage. I have a script that does some cleaning and then stores the result as partitioned parquet files because the following process cannot handle loading all data to memory.

The next process is making a heavy computation in a parallel fashion (per partition), and storing 3 intermediate versions as parquet files: two used for statistics, and the third will be filtered and create the final files.

I make a report based on the two files in Jupyter notebook and convert it to HTML.

  • Everything is done with vanilla python and Pandas.
  • sometimes I may get a different format of data
  • cloud service is Microsoft Azure.

What I'm considering is the following:

Get the data with Kafka or with native python, do the first processing, and store data in Druid, the second processing will be done with Apache Spark getting data from apache druid.

the intermediate states can be stored in druid too. and visualization would be with apache superset.

READ MORE
5 upvotes·184.8K views
Technical Architect at ERP Studio·
Needs advice
on
CassandraCassandraDruidDruid
and
TimescaleDBTimescaleDB

Developing a solution that collects Telemetry Data from different devices, nearly 1000 devices minimum and maximum 12000. Each device is sending 2 packets in 1 second. This is time-series data, and this data definition and different reports are saved on PostgreSQL. Like Building information, maintenance records, etc. I want to know about the best solution. This data is required for Math and ML to run different algorithms. Also, data is raw without definitions and information stored in PostgreSQL. Initially, I went with TimescaleDB due to PostgreSQL support, but to increase in sites, I started facing many issues with timescale DB in terms of flexibility of storing data.

My major requirement is also the replication of the database for reporting and different purposes. You may also suggest other options other than Druid and Cassandra. But an open source solution is appreciated.

READ MORE
3 upvotes·450.6K views
Replies (1)
Recommends
on
MongoDB

Hi Umair, Did you try MongoDB. We are using MongoDB on a production environment and collecting data from devices like your scenario. We have a MongoDB cluster with three replicas. Data from devices are being written to the master node and real-time dashboard UI is using the secondary nodes for read operations. With this setup write operations are not affected by read operations too.

READ MORE
6 upvotes·1 comment·69.6K views
Don Bizzell
Don Bizzell
·
February 9th 2022 at 5:05PM

You might want to look at Yugabyte DB it is open source, scalable, and can do geographicly distributed clusters. Best of all it fully supports Postgress, so you may not have to change anything but a driver.

·
Reply