Hello Stackshare. I'm currently doing some research on real-time reporting and analytics architectures. We have a use case where 1million+ records of users, 4million+ activities, and messages that we want to report against. The start was to present it directly from MySQL, which didn't go well and puts a heavy load on the database. Anybody can suggest something where we feed the data and can report in realtime? Read some articles about ElasticSearch and Kafka https://medium.com/@D11Engg/building-scalable-real-time-analytics-alerting-and-anomaly-detection-architecture-at-dream11-e20edec91d33 EDIT: also considering Neo4j
One of the reasons why your real-time reporting built on top of MySQL might not be performing so well is due to the fact that you are most likely interested in aggregates (e.g. group by & SUM, AVG, TopN). In data warehousing, there is a term known as column-oriented vs row-oriented databases - the key here is that in column-oriented DBMSs, you more precisely access the data you need to answer a question, avoiding having to scan the entire table to calculate an answer. Most of the time pre-aggregates can be calculated on insertion instead of at query time.
An excellent OLAP modern tool that I successfully used for many years to index events from Kafka at a staggering rate and query millions of events in less than a second is Apache Druid and it's an example of a distributed column-oriented data store. There are of course many more technologies out there for answering OLAP business intelligence questions, but personally, I think you won't go very far with a traditional RDBMS or a Lucene based search engine like ElasticSearch for building a Business Intelligence database for vast amounts of data.
"Apache Druid is an open-source data store designed for sub-second queries on real-time and historical data. It is primarily used for business intelligence (OLAP) queries on event data. Druid provides low latency (real-time) data ingestion, flexible data exploration, and fast data aggregation."
If you don't want to invest resources into deploying and hosting it yourself, there are other companies out there that can host it for you, but I will leave that up to you to research.
Here is an excellent article by my former work colleagues explaining how they implemented real-time analytics on top of Druid: https://medium.com/superawesome-engineering/how-we-use-apache-druids-real-time-analytics-to-power-kidtech-at-superawesome-8da6a0fb28b1. Also, I recommend reading through this HackerNews thread that talks in-depth about time-series databases: https://news.ycombinator.com/item?id=18403507.
With the nature of application that you're building, you might even consider setting up some KSQL streams. I have just recently finished a poc on establishing a streaming analytics pipeline with KSQL dB (Both standalone and confluent supported) setting up kafka streams. Also they have a headless deployment mode in production which keeps your KSQL script pretty secured.