Cassandra

Cassandra

Application and Data / Data Stores / Databases
Needs advice
on
Amazon S3Amazon S3DremioDremio
and
SnowflakeSnowflake

Trying to establish a data lake(or maybe puddle) for my org's Data Sharing project. The idea is that outside partners would send cuts of their PHI data, regardless of format/variables/systems, to our Data Team who would then harmonize the data, create data marts, and eventually use it for something. End-to-end, I'm envisioning:

  1. Ingestion->Secure, role-based, self service portal for users to upload data (1a. bonus points if it can preform basic validations/masking)
  2. Storage->Amazon S3 seems like the cheapest. We probably won't need very big, even at full capacity. Our current storage is a secure Box folder that has ~4GB with several batches of test data, code, presentations, and planning docs.
  3. Data Catalog-> AWS Glue? Azure Data Factory? Snowplow? is the main difference basically based on the vendor? We also will have Data Dictionaries/Codebooks from submitters. Where would they fit in?
  4. Partitions-> I've seen Cassandra and YARN mentioned, but have no experience with either
  5. Processing-> We want to use SAS if at all possible. What will work with SAS code?
  6. Pipeline/Automation->The check-in and verification processes that have been outlined are rather involved. Some sort of automated messaging or approval workflow would be nice
  7. I have very little guidance on what a "Data Mart" should look like, so I'm going with the idea that it would be another "experimental" partition. Unless there's an actual mart-building paradigm I've missed?
  8. An end user might use the catalog to pull certain de-identified data sets from the marts. Again, role-based access and self-service gui would be preferable. I'm the only full-time tech person on this project, but I'm mostly an OOP, HTML, JavaScript, and some SQL programmer. Most of this is out of my repertoire. I've done a lot of research, but I can't be an effective evangelist without hands-on experience. Since we're starting a new year of our grant, they've finally decided to let me try some stuff out. Any pointers would be appreciated!
READ MORE
6 upvotes·81.3K views
Developer at SAP Labs India Private Ltd·
Needs advice
on
CassandraCassandraMongoDBMongoDB
and
PostgreSQLPostgreSQL

I was reading instagram system design question

He said he will be using Cassandra for tables storing user, photo metadata, user follows tables... I didn't understand why he is using Cassandra and why not regular rdms like PostgreSQL or nosql like MongoDB.

READ MORE
2 upvotes·29.6K views
Technical Architect at ERP Studio·
Needs advice
on
CassandraCassandraDruidDruid
and
TimescaleDBTimescaleDB

Developing a solution that collects Telemetry Data from different devices, nearly 1000 devices minimum and maximum 12000. Each device is sending 2 packets in 1 second. This is time-series data, and this data definition and different reports are saved on PostgreSQL. Like Building information, maintenance records, etc. I want to know about the best solution. This data is required for Math and ML to run different algorithms. Also, data is raw without definitions and information stored in PostgreSQL. Initially, I went with TimescaleDB due to PostgreSQL support, but to increase in sites, I started facing many issues with timescale DB in terms of flexibility of storing data.

My major requirement is also the replication of the database for reporting and different purposes. You may also suggest other options other than Druid and Cassandra. But an open source solution is appreciated.

READ MORE
3 upvotes·440.3K views
Replies (1)
Recommends
on
MongoDB

Hi Umair, Did you try MongoDB. We are using MongoDB on a production environment and collecting data from devices like your scenario. We have a MongoDB cluster with three replicas. Data from devices are being written to the master node and real-time dashboard UI is using the secondary nodes for read operations. With this setup write operations are not affected by read operations too.

READ MORE
6 upvotes·1 comment·64.9K views
Don Bizzell
Don Bizzell
·
February 9th 2022 at 5:05PM

You might want to look at Yugabyte DB it is open source, scalable, and can do geographicly distributed clusters. Best of all it fully supports Postgress, so you may not have to change anything but a driver.

·
Reply