Apache Spark

Application and Data / Data Stores / Big Data Tools

Needs advice

and

My process is like this: I would get data once a month, either from Google BigQuery or as parquet files from Azure Blob Storage. I have a script that does some cleaning and then stores the result as partitioned parquet files because the following process cannot handle loading all data to memory.

The next process is making a heavy computation in a parallel fashion (per partition), and storing 3 intermediate versions as parquet files: two used for statistics, and the third will be filtered and create the final files.

I make a report based on the two files in Jupyter notebook and convert it to HTML.

Everything is done with vanilla python and Pandas.
sometimes I may get a different format of data
cloud service is Microsoft Azure.

What I'm considering is the following:

Get the data with Kafka or with native python, do the first processing, and store data in Druid, the second processing will be done with Apache Spark getting data from apache druid.

the intermediate states can be stored in druid too. and visualization would be with apache superset.

5 upvotes·191.3K views

Needs advice

and

I am working on a project of an e-learning platform and I'm confused about which technology to choose in order to create a big data pipeline aws / azure or Apache Spark.

Can Spark do the job (data ingestion /data storage/data processing) and finally create dashboards

3 upvotes·24.6K views

Bob hs

Data Scientist ·Mar 6, 2022

Needs advice

Amazon EC2

Google BigQuery

and

Apache Spark

I recently started a new position as a data scientist at an E-commerce company. The company is founded about 4-5 years ago and is new to many data-related areas. Specifically, I'm their first data science employee. So I have to take care of both data analysis tasks as well as bringing new technologies to the company.

They have used Elasticsearch (and Kibana) to have reporting dashboards on their daily purchases and users interactions on their e-commerce website.
They also use the Oracle database system to keep records of their daily turnovers and lists of their current products, clients, and sellers lists.
They use Data-Warehouse with cockpit 10 for generating reports on different aspects of their business including number 2 in this list.

At the moment, I grab batches of data from their system to perform predictive analytics from data science perspectives. In some cases, I use a static form of data such as monthly turnover, client values, and high-demand products, and run my predictive analysis using Python (VS code). Also, I use Google Datastudio or Google Sheets to present my findings. In other cases, I try to do time-series analysis using offline batches of data extracted from Elastic Search to do user recommendations and user personalization.

I really want to use modern data science tools such as Apache Spark, Google BigQuery, AWS, Azure, or others where they really fit. I think these tools can improve my performance as a data scientist and can provide more continuous analytics of their business interactions. But honestly, I'm not sure where each tool is needed and what part of their system should be replaced by or combined with the current state of technology to improve productivity from the above perspectives.

5 upvotes·343.9K views

Replies (2)

Stefan Goldener

Data Scientist / Data Engineer at The Prosperity Company AG·Mar 12, 2022

Recommends

Data Studio

Google BigQuery

Google Cloud Functions

Kubernetes

Apache Spark +2 more

It's hard to make a suggestion here as your use case isn't clear enough.

Use BigQuery if you want to replicate your probably on premise Oracle and Elasticsearch databases so you can profit from the speed of BigQuery. You can do the replication via Google Cloud Functions. Your Google Sheets can be connected to BigQuery and BigQuery can easily be connected to DataStudio.

If you do data science on the data there would be BigQuery ML and Google Colab that would fit into your stack.

In case you do BigData analysis you can go with Apache Spark if you have enough resources (on-prem or Cloud). I suggest you to use a Kubernetes backbone for this as you only reserve the resources when in use and the cluster can be used for other stuff as well.

For dashboarding find your preference and the preference of your audience with DataStudio, Tableau or Apache Superset

9 upvotes·2 comments·5.4K views

Bob hs

March 16th 2022 at 12:09PM

Thank you for the answer.

A few days ago the head of IT told me to try AWS if I need cloud resources. I cannot migrate everything from on-premise to cloud. But, I need to choose what data I need for my Data Science tasks on the cloud. For example, I need to extract their daily sale records stored in Oracle as well as their web usage from Elasticsearch.

My main tasks would be "sale/demand forecast", "user retention prediction", "recommendation systems", and "user activity analysis". So, BigData analysis would be part of the job.

I think BigQuery and Datastudio would be out of my options. I need to use resources offered by AWS or compatible with AWS. I'm not sure if I need to grab their web data directly from their web platform's server or from Elasticsearch.

Also, what dashboarding tool is better when I use AWS for my DS pipeline?

peol solutions

January 8th 2024 at 1:36PM

BigQuery and Datastudio would be out of my options. I need to use resources offered by AWS or compatible with AWS. I'm not sure if I need to grab their web data directly from their web platform's server or from Elasticsearch.

peol solutions

Oct 24, 2024

A trending feed showcases the most popular topics, hashtags, or posts on social media platforms in real time. These feeds are driven by algorithms that analyze user engagement, such as likes, shares, and comments, to identify what resonates with audiences. Trending feeds serve as a valuable tool for users and marketers, offering insights into current events, pop culture, and consumer interests. Engaging with trending topics can enhance visibility and drive traffic. For a deeper understanding, check resources from platforms like Hootsuite and Sprout Social, which provide insights into leveraging trending feeds effectively.

1 upvote·142 views

Αλέξανδρος Παπαδόπουλος

Junior Researcher at Παπαδόπουλος Αλέξανδρος·Feb 18, 2021

Needs advice

MySQL

and

Apache Spark

My Stack

I use Kafka with Lenses. I would integrate Apache Spark in order to achieve data processing, but I could not find the appropriate connector. Should I use only MySQL for data processing?

2 upvotes·31.3K views

Needs advice

and

I am new to Apache Spark and Scala both. I am basically a Java developer and have around 10 years of experience in Java.

I wish to work on some Machine learning or AI tech stacks. Please assist me in the tech stack and help make a clear Road Map. Any feedback is welcome.

Technologies apart from Scala and Spark are also welcome. Please note that the tools should be relevant to Machine Learning or Artificial Intelligence.

7 upvotes·2.9M views

Replies (1)

Channing Walton

Partner at Underscore·Aug 11, 2020

Recommends

Scala

I may be a little biased, but if you need some good introductions to Scala have a look at the free books from https://underscore.io/training/ - click through to each course and there is a free book.

3 upvotes·12.9K views