Delta Lake

The answer to your question requires a careful study of your project's scope, demands and resources. A one-fit-all answer will be misleading at best. Here, I try to summarize a few points you should consider before making a decision. Then, I will justify my personal recommendations. There recommendations are tools that could be one of many potential solutions to your data design problem, based on your brief description.

Threshold for Distributed Processing:

The threshold for distributed processing depends on data volume, complexity, and system performance.
1. To determine if this threshold is reached, assess:
  - System performance under current loads.
  - Query execution times and resource utilization.
  - The scalability requirements of your system.
  - The complexity of the data and the computations being performed.

Categories of Operations:

The answer largely depends on the kinds of operations you are performing:
1. Data Transformation: Such as normalizing data formats, cleaning data, and transforming data structures.
2. Aggregation and Summary Statistics: Useful for generating reports or insights from large datasets.
3. Complex Joins: Involving multiple datasets, which can be computationally intensive.
4. Predictive Analytics and Machine Learning: Where large volumes of data are used to train models.

Then, choose a framework that best suits your needs. Specifically for big data applications, I have experience with Apache Spark, and I have seen enormous potential with tools such as Delta Lake, so I believe they provide a versatile combination for different use cases. At the same time, I know PostgreSQL handles intensive demands extremely well, and can be seen in the stack of many top performing tech companies, likely in their business intelligence and reporting demands.

Another important hint is to plan a comprehensive stack to benefit from the advantages of different frameworks for different use cases. I can definitely envision a system where the three technologies interact to leverage the best of their abilities. As much as you can, make them your own :)

Delta Lake Discussions

Discover why developers choose Delta Lake. Read real-world technical decisions and stack choices from the StackShare community.

yurisugano

Feb 6, 2024

Needs adviceon

Apache Spark

Delta Lake

PostgreSQL

Threshold for Distributed Processing:

The threshold for distributed processing depends on data volume, complexity, and system performance.
1. To determine if this threshold is reached, assess:
  - System performance under current loads.
  - Query execution times and resource utilization.
  - The scalability requirements of your system.
  - The complexity of the data and the computations being performed.

Categories of Operations:

The answer largely depends on the kinds of operations you are performing:
1. Data Transformation: Such as normalizing data formats, cleaning data, and transforming data structures.
2. Aggregation and Summary Statistics: Useful for generating reports or insights from large datasets.
3. Complex Joins: Involving multiple datasets, which can be computationally intensive.
4. Predictive Analytics and Machine Learning: Where large volumes of data are used to train models.

0 views0

Comments

Arjun R

Jun 3, 2022

Needs adviceon

Delta Lake

Azure Cosmos DB

JSON

We are building cloud based analytical app and most of the data for UI is supplied from SQL server to Delta lake and then from Delta Lake to Azure Cosmos DB as JSON using Databricks. So that API can send it to front-end. Sometimes we get larger documents while transforming table rows into JSONs and it exceeds 2mb limit of cosmos size. What is the best solution for replacing Cosmos DB?

0 views0

Comments

Delta Lake

What is Delta Lake?

Key Features

Delta Lake Pros & Cons

Pros of Delta Lake

Cons of Delta Lake

Delta Lake Integrations

Delta Lake Discussions

Delta Lake Alternatives & Comparisons

Apache Spark

Splunk

Apache Flink

Amazon Athena

Apache Hive

AWS Glue

Try It

Adoption

Delta Lake Discussions

Delta Lake Integrations