Alternatives to Amazon Kinesis Firehose logo

Alternatives to Amazon Kinesis Firehose

Stream, Kafka, Amazon Kinesis, and Google Cloud Dataflow are the most popular alternatives and competitors to Amazon Kinesis Firehose.
153
114
+ 1
0

What is Amazon Kinesis Firehose and what are its top alternatives?

Amazon Kinesis Firehose is the easiest way to load streaming data into AWS. It can capture and automatically load streaming data into Amazon S3 and Amazon Redshift, enabling near real-time analytics with existing business intelligence tools and dashboards you鈥檙e already using today.
Amazon Kinesis Firehose is a tool in the Real-time Data Processing category of a tech stack.

Amazon Kinesis Firehose alternatives & related posts

Stream logo

Stream

137
145
54
Build scalable feeds, activity streams & chat in a few hours instead of months.
137
145
+ 1
54

related Stream posts

related Kafka posts

Eric Colson
Eric Colson
Chief Algorithms Officer at Stitch Fix | 19 upvotes 路 1.4M views

The algorithms and data infrastructure at Stitch Fix is housed in #AWS. Data acquisition is split between events flowing through Kafka, and periodic snapshots of PostgreSQL DBs. We store data in an Amazon S3 based data warehouse. Apache Spark on Yarn is our tool of choice for data movement and #ETL. Because our storage layer (s3) is decoupled from our processing layer, we are able to scale our compute environment very elastically. We have several semi-permanent, autoscaling Yarn clusters running to serve our data processing needs. While the bulk of our compute infrastructure is dedicated to algorithmic processing, we also implemented Presto for adhoc queries and dashboards.

Beyond data movement and ETL, most #ML centric jobs (e.g. model training and execution) run in a similarly elastic environment as containers running Python and R code on Amazon EC2 Container Service clusters. The execution of batch jobs on top of ECS is managed by Flotilla, a service we built in house and open sourced (see https://github.com/stitchfix/flotilla-os).

At Stitch Fix, algorithmic integrations are pervasive across the business. We have dozens of data products actively integrated systems. That requires serving layer that is robust, agile, flexible, and allows for self-service. Models produced on Flotilla are packaged for deployment in production using Khan, another framework we've developed internally. Khan provides our data scientists the ability to quickly productionize those models they've developed with open source frameworks in Python 3 (e.g. PyTorch, sklearn), by automatically packaging them as Docker containers and deploying to Amazon ECS. This provides our data scientist a one-click method of getting from their algorithms to production. We then integrate those deployments into a service mesh, which allows us to A/B test various implementations in our product.

For more info:

#DataScience #DataStack #Data

See more
John Kodumal
John Kodumal
CTO at LaunchDarkly | 16 upvotes 路 1M views

As we've evolved or added additional infrastructure to our stack, we've biased towards managed services. Most new backing stores are Amazon RDS instances now. We do use self-managed PostgreSQL with TimescaleDB for time-series data鈥攖his is made HA with the use of Patroni and Consul.

We also use managed Amazon ElastiCache instances instead of spinning up Amazon EC2 instances to run Redis workloads, as well as shifting to Amazon Kinesis instead of Kafka.

See more
Amazon Kinesis logo

Amazon Kinesis

507
370
4
Store and process terabytes of data each hour from hundreds of thousands of sources
507
370
+ 1
4
PROS OF AMAZON KINESIS
CONS OF AMAZON KINESIS
    No cons available

    related Amazon Kinesis posts

    John Kodumal
    John Kodumal
    CTO at LaunchDarkly | 16 upvotes 路 1M views

    As we've evolved or added additional infrastructure to our stack, we've biased towards managed services. Most new backing stores are Amazon RDS instances now. We do use self-managed PostgreSQL with TimescaleDB for time-series data鈥攖his is made HA with the use of Patroni and Consul.

    We also use managed Amazon ElastiCache instances instead of spinning up Amazon EC2 instances to run Redis workloads, as well as shifting to Amazon Kinesis instead of Kafka.

    See more
    Tim Specht
    Tim Specht
    鈥嶤o-Founder and CTO at Dubsmash | 14 upvotes 路 505.5K views

    In order to accurately measure & track user behaviour on our platform we moved over quickly from the initial solution using Google Analytics to a custom-built one due to resource & pricing concerns we had.

    While this does sound complicated, it鈥檚 as easy as clients sending JSON blobs of events to Amazon Kinesis from where we use AWS Lambda & Amazon SQS to batch and process incoming events and then ingest them into Google BigQuery. Once events are stored in BigQuery (which usually only takes a second from the time the client sends the data until it鈥檚 available), we can use almost-standard-SQL to simply query for data while Google makes sure that, even with terabytes of data being scanned, query times stay in the range of seconds rather than hours. Before ingesting their data into the pipeline, our mobile clients are aggregating events internally and, once a certain threshold is reached or the app is going to the background, sending the events as a JSON blob into the stream.

    In the past we had workers running that continuously read from the stream and would validate and post-process the data and then enqueue them for other workers to write them to BigQuery. We went ahead and implemented the Lambda-based approach in such a way that Lambda functions would automatically be triggered for incoming records, pre-aggregate events, and write them back to SQS, from which we then read them, and persist the events to BigQuery. While this approach had a couple of bumps on the road, like re-triggering functions asynchronously to keep up with the stream and proper batch sizes, we finally managed to get it running in a reliable way and are very happy with this solution today.

    #ServerlessTaskProcessing #GeneralAnalytics #RealTimeDataProcessing #BigDataAsAService

    See more
    Google Cloud Dataflow logo

    Google Cloud Dataflow

    130
    199
    0
    A fully-managed cloud service and programming model for batch and streaming big data processing.
    130
    199
    + 1
    0
    PROS OF GOOGLE CLOUD DATAFLOW
      No pros available
      CONS OF GOOGLE CLOUD DATAFLOW
        No cons available

        related Google Cloud Dataflow posts