I have to collect different data from multiple sources and store them in a single cloud location. Then perform cleaning and transforming using PySpark, and push the end results to other applications like reporting tools, etc. What would be the best solution? I can only think of Azure Data Factory + Databricks. Are there any alternatives to #AWS services + Databricks?
That’s a great question! But first, is there any particular reason you are avoiding AWS? Amazon has a lot of great services to handle data collection (Amazon Kinesis), processing (Lambda, Amazon EMR, Amazon Glue), storage (S3), plus monitoring and reporting (CloudWatch). As you are collecting and processing data, you can kick start your work on Machine Learning models with tools like Amazon SageMaker which is more efficient (and better on the budget) than running an EC2 instance with a Data Science image. Lastly if you are interested in DataBricks, it is fully supported by Amazon.
Back to the why, not AWS? If it is about the overhead of setting up all the resources, you can use The Ops Platform to help automate the steps and maintenance allowing anyone on your team to be able to handle this with confidence. Reach out to us if you need any help.