Hi, I need advice on which Database tool to use in the following scenario:
I work with Cesium, and I need to save and load CZML snapshot and update objects for a recording program that saves files containing several entities (along with the time of the snapshot or update). I need to be able to easily load the files according to the corresponding timeline point (for example, if the update was recorded at 13:15, I should be able to easily load the update file when I click on the 13:15 point on the timeline). I should also be able to make geo-queries relatively easily.
I am currently thinking about Elasticsearch or PostgreSQL, but I am open to suggestions. I tried looking into Time Series Databases like TimescaleDB but found that it is unnecessarily powerful than my needs since the update time is a simple variable.
Thanks for your advice in advance!
In your situation, PostgreSQL seems to be better option. Why? 1) Saving structured data is possible in both PostgreSQL and Elasticsearch. In PostgreSQL, there is JSONB column available and you can build indexes on top of it. 2) If you are able to specify the time as a primary key, both Elasticsearch and PostgreSQL are great options. 3) PostgreSQL allows you to do a lot more with your data and handle them in a relation way. You are not clear whether it's a benefit or not but let's consider extensibility to be an advantage. 4) PostgreSQL comes with PostGIS extension to work with geo data. May be useful for your situation. 5) PostgreSQL may serve for other needs of your app. Managing one database is always better than having two of them.
Thanks to JSONB column type, PostgreSQL is a sweet combination of relational and noSQL database, but there are also drawbacks coming from ACID compliancy and WAL overhead for rapid changes.
We have a Kafka topic having events of type A and type B. We need to perform an inner join on both type of events using some common field (primary-key). The joined events to be inserted in Elasticsearch.
In usual cases, type A and type B events (with same key) observed to be close upto 15 minutes. But in some cases they may be far from each other, lets say 6 hours. Sometimes event of either of the types never come.
In all cases, we should be able to find joined events instantly after they are joined and not-joined events within 15 minutes.
The first solution that came to me is to use upsert to update ElasticSearch:
- Use the primary-key as ES document id
- Upsert the records to ES as soon as you receive them. As you are using upsert, the 2nd record of the same primary-key will not overwrite the 1st one, but will be merged with it.
Cons: The load on ES will be higher, due to upsert.
To use Flink:
- Create a KeyedDataStream by the primary-key
- In the ProcessFunction, save the first record in a State. At the same time, create a Timer for 15 minutes in the future
- When the 2nd record comes, read the 1st record from the State, merge those two, and send out the result, and clear the State and the Timer if it has not fired
- When the Timer fires, read the 1st record from the State and send out as the output record.
- Have a 2nd Timer of 6 hours (or more) if you are not using Windowing to clean up the State
Pro: if you have already having Flink ingesting this stream. Otherwise, I would just go with the 1st solution.
In flink approach, we cant query the data while its being processed (in flink memory). Consequently we have to wait for 6 hours for event to be available. Although this can be worked around by maintaining copy of data being processed for 15mins.
Thank you so much for detailed solution.
What are your views on preferring Apache Flink over Kafka Streams and Apache Spark for this use case?
What do you think about having MongoDB for 1st case i/o ES? My point of view is that it's easier to get started with MongoDB.
Please refer "Structured Streaming" feature of Spark. Refer "Stream - Stream Join" at https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#stream-stream-joins . In short you need to specify "Define watermark delays on both inputs" and "Define a constraint on time across the two inputs"
Hi, We have a situation, where we are using Prometheus to get system metrics from PCF (Pivotal Cloud Foundry) platform. We send that as time-series data to Cortex via a Prometheus server and built a dashboard using Grafana. There is another pipeline where we need to read metrics from a Linux server using Metricbeat, CPU, memory, and Disk. That will be sent to Elasticsearch and Grafana will pull and show the data in a dashboard.
Is it OK to use Metricbeat for Linux server or can we use Prometheus?
What is the difference in system metrics sent by Metricbeat and Prometheus node exporters?
If you're already using Prometheus for your system metrics, then it seems like standing up Elasticsearch just for Linux host monitoring is excessive. The node_exporter is probably sufficient if you'e looking for standard system metrics.
Another thing to consider is that Metricbeat / ELK use a push model for metrics delivery, whereas Prometheus pulls metrics from each node it is monitoring. Depending on how you manage your network security, opting for one solution over two may make things simpler.
Hi Sunil! Unfortunately, I don´t have much experience with Metricbeat so I can´t advise on the diffs with Prometheus...for Linux server, I encourage you to use Prometheus node exporter and for PCF, I would recommend using the instana tile (https://www.instana.com/supported-technologies/pivotal-cloud-foundry/). Let me know if you have further questions! Regards Jose
Hey everybody! (1) I am developing an android application. I have data of around 3 million record (less than a TB). I want to save that data in the cloud. Which company provides the best cloud database services that would suit my scenario? It should be secured, long term useable, and provide better services. I decided to use Firebase Realtime database. Should I stick with Firebase or are there any other companies that provide a better service?
(2) I have the functionality of searching data in my app. Same data (less than a TB). Which search solution should I use in this case? I found Elasticsearch and Algolia search. It should be secure and fast. If any other company provides better services than these, please feel free to suggest them.
Hi Rana, good question! From my Firebase experience, 3 million records is not too big at all, as long as the cost is within reason for you. With Firebase you will be able to access the data from anywhere, including an android app, and implement fine-grained security with JSON rules. The real-time-ness works perfectly. As a fully managed database, Firebase really takes care of everything. The only thing to watch out for is if you need complex query patterns - Firestore (also in the Firebase family) can be a better fit there.
To answer question 2: the right answer will depend on what's most important to you. Algolia is like Firebase is that it is fully-managed, very easy to set up, and has great SDKs for Android. Algolia is really a full-stack search solution in this case, and it is easy to connect with your Firebase data. Bear in mind that Algolia does cost money, so you'll want to make sure the cost is okay for you, but you will save a lot of engineering time and never have to worry about scale. The search-as-you-type performance with Algolia is flawless, as that is a primary aspect of its design. Elasticsearch can store tons of data and has all the flexibility, is hosted for cheap by many cloud services, and has many users. If you haven't done a lot with search before, the learning curve is higher than Algolia for getting the results ranked properly, and there is another learning curve if you want to do the DevOps part yourself. Both are very good platforms for search, Algolia shines when buliding your app is the most important and you don't want to spend many engineering hours, Elasticsearch shines when you have a lot of data and don't mind learning how to run and optimize it.
Rana - we use Cloud Firestore at our startup. It handles many million records without any issues. It provides you the same set of features that the Firebase Realtime Database provides on top of the indexing and security trims. The only thing to watch out for is to make sure your Cloud Functions have proper exception handling and there are no infinite loop in the code. This will be too costly if not caught quickly.
For search; Algolia is a great option, but cost is a real consideration. Indexing large number of records can be cost prohibitive for most projects. Elasticsearch is a solid alternative, but requires a little additional work to configure and maintain if you want to self-host.
Hope this helps.
We are starting to work on a web-based platform aiming to connect investors/wholesalers (clients) and buyers (service providers). A third service provider, lenders, will be added in the future.
The ability to create profiles of buyers w/ their buying criteria, to create saved records of properties for sale (provided by client) to be cross-referenced against the buyers' criteria is our core functionality.
In-app, timeline-based, real-time communication between users (& storing it), file transfers, and push notifications are post MVP features we would like as well.
We are considering using React, Elasticsearch / App Search w/ their Search UI, and using Real-Time Database and functionalities of Firebase.
Hi, community, I'm planning to build a web service that will perform a text search in a data set off less than 3k well-structured JSON objects containing config data. I'm expecting no more than 20 MB of data. The general traits I need for this search are: - Typo tolerant (fuzzy query), so it has to match the entries even though the query does not match 100% with a word on that JSON - Allow a strict match mode - Perform the search through all the JSON values (it can reach 6 nesting levels) - Ignore all Keys of the JSON; I'm interested only in the values.
The only thing I'm researching at the moment is Elasticsearch, and since the rest of the stack is on AWS the Amazon ElasticSearch is my favorite candidate so far. Although, the only knowledge I have on it was fetched from some articles and Q&A that I read here and there. Is ElasticSearch a good path for this project? I'm also considering Amazon DynamoDB (which I also don't know of), but it does not look to cover the requirements of fuzzy-search and ignore the JSON properties. Thank you in advance for your precious advice!
The Amazon Elastic Search service will certainly help you do most of the heavy lifting and you won't have to maintain any of the underlying infrastructure. However, elastic search isn't trivial in nature. Typically, this will mean several days worth of work.
Over time and projects, I've over the years leveraged another solution called Algolia Search. Algolia is a fully managed, search as a service solution, which also has SDKs available for most common languages, will answer your fuzzy search requirements, and also cut down implementation and maintenance costs significantly. You should be able to get a solution up and running within a couple of minutes to an hour.
I think elasticsearch should be a great fit for that use case. Using the AWS version will make your life easier. With such a small dataset you may also be able to use an in process library for searching and possibly remove the overhead of using a database. I don’t if it fits the bill, but you may also want to look into lucene.
I can tell you that Dynamo DB is definitely not a good fit for your use case. There is no fuzzy matching feature and you would need to have an index for each field you want to search or convert your data into a more searchable format for storing in Dynamo, which is something a full text search tool like elasticsearch is going to do for you.
Hi everyone. I'm trying to create my personal syslog monitoring.
To get the logs, I have uncertainty to choose the way: 1.1 Use Logstash like a TCP server. 1.2 Implement a Go TCP server.
To store and plot data. 2.1 Use Elasticsearch tools. 2.2 Use InfluxDB and Grafana.
I would like to know... Which is a cheaper and scalable solution?
Or even if there is a better way to do it.
A very simple and cheap (resource usage) option here would be to use promtail to send syslog data to Loki and visualise Loki with Grafana using the native Grafana Loki data source. I have recently put together this set up and promtail and Loki are less resource intensive than Logstash/ES and it is a simple set up and configuration and works very nicely.
Does promtel available for PCF?
Hi @sunilmchaudhari I do not know. I assume by PCF you are refering to Pivot Cloud Foundry, which I have no knowledge of sorry. Promtail is a go binary so if you can add log data to a syslog, then you can process it with Promtail.
For Syslog, you can certainly use TCP Input. Really interested to know what is your syslog client( which will ship logs to logstash). Anyways you can check that and see if that client has capability to configure multiple logstash host ports so that it works as a load balancer. This will increase throughput. Also check pipeline-to-pipeline communcation of logstash: https://www.elastic.co/guide/en/logstash/current/pipeline-to-pipeline.html This helps to implement distributor pattern of pipeline where multiple type of data is coming to same input and you may want to route filtering and processing based on types. It increases parallelism. About Elasticsearch: Its a native component and perfectly fits with logstash so you can use elasticsearch for storage and search. Its one of the datasource of grafana.