We've been experiencing performance problems in our Hadoop cluster (8 nodes c5.2xlarge) trying to process huge volumes of Cassandra data.
The perfomance problems were both in speed but also in reliablity (sometimes m/r split calculation could hang due to driver inefficiencies on huge datasets).
That's why in the past 6 months we've been doing a tremendous migration effort to moving our analytics/big data loads on Databricks with DeltaLake /S3.
So far the results are encouraging: our worst case scenarios improved tremendously in term of speed: from 8 hours for 250GB payload to just below 30 min. The stability is also good, but the real benefit is the fact that Cassandra/Hadoop as a pair straight up are not made for pain-free big data analytics (out of the box). The amount of projections of the data one'd have to make in order to have Cassandra + Hadoop friendly analytics was simply not worth the pain. At Relay42 our Cassandra nodes are backed up by EBS volumes, which are also damn expensive by themselves + expensive to backup at the scale we run with so many nodes and so many terrabytes of data on each node. This just doesn't scale well financially.
Thus, we are very happy with the cost savings we are making by moving the data to S3/Delta from EBS volumes. Currently we are still running those in parallel but soon those costly EBS volumes are about to be shrunk :)