Alternatives to Amazon Athena logo

Alternatives to Amazon Athena

Presto, Amazon Redshift Spectrum, Amazon Redshift, Cassandra, and Spectrum are the most popular alternatives and competitors to Amazon Athena.
151
90
+ 1
28

What is Amazon Athena and what are its top alternatives?

Amazon Athena is an interactive query service that makes it easy to analyze data in Amazon S3 using standard SQL. Athena is serverless, so there is no infrastructure to manage, and you pay only for the queries that you run.
Amazon Athena is a tool in the Big Data Tools category of a tech stack.

Amazon Athena alternatives & related posts

Presto logo

Presto

108
182
46
108
182
+ 1
46
Distributed SQL Query Engine for Big Data
Presto logo
Presto
VS
Amazon Athena logo
Amazon Athena

related Presto posts

Eric Colson
Eric Colson
Chief Algorithms Officer at Stitch Fix · | 19 upvotes · 290.1K views
atStitch FixStitch Fix
Amazon EC2 Container Service
Amazon EC2 Container Service
Docker
Docker
PyTorch
PyTorch
R
R
Python
Python
Presto
Presto
Apache Spark
Apache Spark
Amazon S3
Amazon S3
PostgreSQL
PostgreSQL
Kafka
Kafka
#AWS
#Etl
#ML
#DataScience
#DataStack
#Data

The algorithms and data infrastructure at Stitch Fix is housed in #AWS. Data acquisition is split between events flowing through Kafka, and periodic snapshots of PostgreSQL DBs. We store data in an Amazon S3 based data warehouse. Apache Spark on Yarn is our tool of choice for data movement and #ETL. Because our storage layer (s3) is decoupled from our processing layer, we are able to scale our compute environment very elastically. We have several semi-permanent, autoscaling Yarn clusters running to serve our data processing needs. While the bulk of our compute infrastructure is dedicated to algorithmic processing, we also implemented Presto for adhoc queries and dashboards.

Beyond data movement and ETL, most #ML centric jobs (e.g. model training and execution) run in a similarly elastic environment as containers running Python and R code on Amazon EC2 Container Service clusters. The execution of batch jobs on top of ECS is managed by Flotilla, a service we built in house and open sourced (see https://github.com/stitchfix/flotilla-os).

At Stitch Fix, algorithmic integrations are pervasive across the business. We have dozens of data products actively integrated systems. That requires serving layer that is robust, agile, flexible, and allows for self-service. Models produced on Flotilla are packaged for deployment in production using Khan, another framework we've developed internally. Khan provides our data scientists the ability to quickly productionize those models they've developed with open source frameworks in Python 3 (e.g. PyTorch, sklearn), by automatically packaging them as Docker containers and deploying to Amazon ECS. This provides our data scientist a one-click method of getting from their algorithms to production. We then integrate those deployments into a service mesh, which allows us to A/B test various implementations in our product.

For more info:

#DataScience #DataStack #Data

See more
StackShare Editors
StackShare Editors
| 4 upvotes · 24.7K views
atUber TechnologiesUber Technologies
Presto
Presto
Apache Spark
Apache Spark
Hadoop
Hadoop

To improve platform scalability and efficiency, Uber transitioned from JSON to Parquet, and built a central schema service to manage schemas and integrate different client libraries.

While the first generation big data platform was vulnerable to upstream data format changes, “ad hoc data ingestions jobs were replaced with a standard platform to transfer all source data in its original, nested format into the Hadoop data lake.”

These platform changes enabled the scaling challenges Uber was facing around that time: “On a daily basis, there were tens of terabytes of new data added to our data lake, and our Big Data platform grew to over 10,000 vcores with over 100,000 running batch jobs on any given day.”

See more
Amazon Redshift Spectrum logo

Amazon Redshift Spectrum

36
34
0
36
34
+ 1
0
Exabyte-Scale In-Place Queries of S3 Data
    Be the first to leave a pro
    Amazon Redshift Spectrum logo
    Amazon Redshift Spectrum
    VS
    Amazon Athena logo
    Amazon Athena

    related Amazon Redshift posts

    Julien DeFrance
    Julien DeFrance
    Principal Software Engineer at Tophatter · | 16 upvotes · 406.2K views
    atSmartZipSmartZip
    Amazon DynamoDB
    Amazon DynamoDB
    Ruby
    Ruby
    Node.js
    Node.js
    AWS Lambda
    AWS Lambda
    New Relic
    New Relic
    Amazon Elasticsearch Service
    Amazon Elasticsearch Service
    Elasticsearch
    Elasticsearch
    Superset
    Superset
    Amazon Quicksight
    Amazon Quicksight
    Amazon Redshift
    Amazon Redshift
    Zapier
    Zapier
    Segment
    Segment
    Amazon CloudFront
    Amazon CloudFront
    Memcached
    Memcached
    Amazon ElastiCache
    Amazon ElastiCache
    Amazon RDS for Aurora
    Amazon RDS for Aurora
    MySQL
    MySQL
    Amazon RDS
    Amazon RDS
    Amazon S3
    Amazon S3
    Docker
    Docker
    Capistrano
    Capistrano
    AWS Elastic Beanstalk
    AWS Elastic Beanstalk
    Rails API
    Rails API
    Rails
    Rails
    Algolia
    Algolia

    Back in 2014, I was given an opportunity to re-architect SmartZip Analytics platform, and flagship product: SmartTargeting. This is a SaaS software helping real estate professionals keeping up with their prospects and leads in a given neighborhood/territory, finding out (thanks to predictive analytics) who's the most likely to list/sell their home, and running cross-channel marketing automation against them: direct mail, online ads, email... The company also does provide Data APIs to Enterprise customers.

    I had inherited years and years of technical debt and I knew things had to change radically. The first enabler to this was to make use of the cloud and go with AWS, so we would stop re-inventing the wheel, and build around managed/scalable services.

    For the SaaS product, we kept on working with Rails as this was what my team had the most knowledge in. We've however broken up the monolith and decoupled the front-end application from the backend thanks to the use of Rails API so we'd get independently scalable micro-services from now on.

    Our various applications could now be deployed using AWS Elastic Beanstalk so we wouldn't waste any more efforts writing time-consuming Capistrano deployment scripts for instance. Combined with Docker so our application would run within its own container, independently from the underlying host configuration.

    Storage-wise, we went with Amazon S3 and ditched any pre-existing local or network storage people used to deal with in our legacy systems. On the database side: Amazon RDS / MySQL initially. Ultimately migrated to Amazon RDS for Aurora / MySQL when it got released. Once again, here you need a managed service your cloud provider handles for you.

    Future improvements / technology decisions included:

    Caching: Amazon ElastiCache / Memcached CDN: Amazon CloudFront Systems Integration: Segment / Zapier Data-warehousing: Amazon Redshift BI: Amazon Quicksight / Superset Search: Elasticsearch / Amazon Elasticsearch Service / Algolia Monitoring: New Relic

    As our usage grows, patterns changed, and/or our business needs evolved, my role as Engineering Manager then Director of Engineering was also to ensure my team kept on learning and innovating, while delivering on business value.

    One of these innovations was to get ourselves into Serverless : Adopting AWS Lambda was a big step forward. At the time, only available for Node.js (Not Ruby ) but a great way to handle cost efficiency, unpredictable traffic, sudden bursts of traffic... Ultimately you want the whole chain of services involved in a call to be serverless, and that's when we've started leveraging Amazon DynamoDB on these projects so they'd be fully scalable.

    See more
    Ankit Sobti
    Ankit Sobti
    CTO at Postman Inc · | 10 upvotes · 74.2K views
    atPostmanPostman
    dbt
    dbt
    Amazon Redshift
    Amazon Redshift
    Stitch
    Stitch
    Looker
    Looker

    Looker , Stitch , Amazon Redshift , dbt

    We recently moved our Data Analytics and Business Intelligence tooling to Looker . It's already helping us create a solid process for reusable SQL-based data modeling, with consistent definitions across the entire organizations. Looker allows us to collaboratively build these version-controlled models and push the limits of what we've traditionally been able to accomplish with analytics with a lean team.

    For Data Engineering, we're in the process of moving from maintaining our own ETL pipelines on AWS to a managed ELT system on Stitch. We're also evaluating the command line tool, dbt to manage data transformations. Our hope is that Stitch + dbt will streamline the ELT bit, allowing us to focus our energies on analyzing data, rather than managing it.

    See more
    Cassandra logo

    Cassandra

    2K
    1.6K
    443
    2K
    1.6K
    + 1
    443
    A partitioned row store. Rows are organized into tables with a required primary key.
    Cassandra logo
    Cassandra
    VS
    Amazon Athena logo
    Amazon Athena

    related Cassandra posts

    Thierry Schellenbach
    Thierry Schellenbach
    CEO at Stream · | 17 upvotes · 60.1K views
    atStreamStream
    RocksDB
    RocksDB
    Cassandra
    Cassandra
    Redis
    Redis
    #InMemoryDatabases
    #DataStores
    #Databases

    1.0 of Stream leveraged Cassandra for storing the feed. Cassandra is a common choice for building feeds. Instagram, for instance started, out with Redis but eventually switched to Cassandra to handle their rapid usage growth. Cassandra can handle write heavy workloads very efficiently.

    Cassandra is a great tool that allows you to scale write capacity simply by adding more nodes, though it is also very complex. This complexity made it hard to diagnose performance fluctuations. Even though we had years of experience with running Cassandra, it still felt like a bit of a black box. When building Stream 2.0 we decided to go for a different approach and build Keevo. Keevo is our in-house key-value store built upon RocksDB, gRPC and Raft.

    RocksDB is a highly performant embeddable database library developed and maintained by Facebook’s data engineering team. RocksDB started as a fork of Google’s LevelDB that introduced several performance improvements for SSD. Nowadays RocksDB is a project on its own and is under active development. It is written in C++ and it’s fast. Have a look at how this benchmark handles 7 million QPS. In terms of technology it’s much more simple than Cassandra.

    This translates into reduced maintenance overhead, improved performance and, most importantly, more consistent performance. It’s interesting to note that LinkedIn also uses RocksDB for their feed.

    #InMemoryDatabases #DataStores #Databases

    See more
    Linux
    Linux
    Docker
    Docker
    jQuery
    jQuery
    AngularJS
    AngularJS
    React
    React
    Cassandra
    Cassandra
    MongoDB
    MongoDB
    MySQL
    MySQL
    Zend Framework
    Zend Framework
    Laravel
    Laravel

    React AngularJS jQuery

    Laravel Zend Framework

    MySQL MongoDB Cassandra

    Docker

    Linux

    See more
    Spectrum logo

    Spectrum

    10
    8
    0
    10
    8
    + 1
    0
    A community platform for the future.
      Be the first to leave a pro
      Spectrum logo
      Spectrum
      VS
      Amazon Athena logo
      Amazon Athena

      related Spectrum posts

      Slack
      Slack
      Spectrum
      Spectrum
      Discord
      Discord
      Gitter
      Gitter

      From a StackShare Community member: “We’re about to start a chat group for our open source project (over 5K stars on GitHub) so we can let our community collaborate more closely. The obvious choice would be Slack (k8s and a ton of major projects use it), but we’ve seen Gitter (webpack uses it) for a lot of open source projects, Discord (Vue.js moved to them), and as of late I’m seeing Spectrum more and more often. Does anyone have experience with these or other alternatives? Is it even worth assessing all these options, or should we just go with Slack? Some things that are important to us: free, all the regular integrations (GitHub, Heroku, etc), mobile & desktop apps, and open source is of course a plus."

      See more
      Amazon Quicksight logo

      Amazon Quicksight

      39
      44
      1
      39
      44
      + 1
      1
      Fast, easy to use business analytics at 1/10th the cost of traditional BI solutions
      Amazon Quicksight logo
      Amazon Quicksight
      VS
      Amazon Athena logo
      Amazon Athena

      related Amazon Quicksight posts

      Julien DeFrance
      Julien DeFrance
      Principal Software Engineer at Tophatter · | 16 upvotes · 406.2K views
      atSmartZipSmartZip
      Amazon DynamoDB
      Amazon DynamoDB
      Ruby
      Ruby
      Node.js
      Node.js
      AWS Lambda
      AWS Lambda
      New Relic
      New Relic
      Amazon Elasticsearch Service
      Amazon Elasticsearch Service
      Elasticsearch
      Elasticsearch
      Superset
      Superset
      Amazon Quicksight
      Amazon Quicksight
      Amazon Redshift
      Amazon Redshift
      Zapier
      Zapier
      Segment
      Segment
      Amazon CloudFront
      Amazon CloudFront
      Memcached
      Memcached
      Amazon ElastiCache
      Amazon ElastiCache
      Amazon RDS for Aurora
      Amazon RDS for Aurora
      MySQL
      MySQL
      Amazon RDS
      Amazon RDS
      Amazon S3
      Amazon S3
      Docker
      Docker
      Capistrano
      Capistrano
      AWS Elastic Beanstalk
      AWS Elastic Beanstalk
      Rails API
      Rails API
      Rails
      Rails
      Algolia
      Algolia

      Back in 2014, I was given an opportunity to re-architect SmartZip Analytics platform, and flagship product: SmartTargeting. This is a SaaS software helping real estate professionals keeping up with their prospects and leads in a given neighborhood/territory, finding out (thanks to predictive analytics) who's the most likely to list/sell their home, and running cross-channel marketing automation against them: direct mail, online ads, email... The company also does provide Data APIs to Enterprise customers.

      I had inherited years and years of technical debt and I knew things had to change radically. The first enabler to this was to make use of the cloud and go with AWS, so we would stop re-inventing the wheel, and build around managed/scalable services.

      For the SaaS product, we kept on working with Rails as this was what my team had the most knowledge in. We've however broken up the monolith and decoupled the front-end application from the backend thanks to the use of Rails API so we'd get independently scalable micro-services from now on.

      Our various applications could now be deployed using AWS Elastic Beanstalk so we wouldn't waste any more efforts writing time-consuming Capistrano deployment scripts for instance. Combined with Docker so our application would run within its own container, independently from the underlying host configuration.

      Storage-wise, we went with Amazon S3 and ditched any pre-existing local or network storage people used to deal with in our legacy systems. On the database side: Amazon RDS / MySQL initially. Ultimately migrated to Amazon RDS for Aurora / MySQL when it got released. Once again, here you need a managed service your cloud provider handles for you.

      Future improvements / technology decisions included:

      Caching: Amazon ElastiCache / Memcached CDN: Amazon CloudFront Systems Integration: Segment / Zapier Data-warehousing: Amazon Redshift BI: Amazon Quicksight / Superset Search: Elasticsearch / Amazon Elasticsearch Service / Algolia Monitoring: New Relic

      As our usage grows, patterns changed, and/or our business needs evolved, my role as Engineering Manager then Director of Engineering was also to ensure my team kept on learning and innovating, while delivering on business value.

      One of these innovations was to get ourselves into Serverless : Adopting AWS Lambda was a big step forward. At the time, only available for Node.js (Not Ruby ) but a great way to handle cost efficiency, unpredictable traffic, sudden bursts of traffic... Ultimately you want the whole chain of services involved in a call to be serverless, and that's when we've started leveraging Amazon DynamoDB on these projects so they'd be fully scalable.

      See more

      related Elasticsearch posts

      Julien DeFrance
      Julien DeFrance
      Principal Software Engineer at Tophatter · | 16 upvotes · 406.2K views
      atSmartZipSmartZip
      Amazon DynamoDB
      Amazon DynamoDB
      Ruby
      Ruby
      Node.js
      Node.js
      AWS Lambda
      AWS Lambda
      New Relic
      New Relic
      Amazon Elasticsearch Service
      Amazon Elasticsearch Service
      Elasticsearch
      Elasticsearch
      Superset
      Superset
      Amazon Quicksight
      Amazon Quicksight
      Amazon Redshift
      Amazon Redshift
      Zapier
      Zapier
      Segment
      Segment
      Amazon CloudFront
      Amazon CloudFront
      Memcached
      Memcached
      Amazon ElastiCache
      Amazon ElastiCache
      Amazon RDS for Aurora
      Amazon RDS for Aurora
      MySQL
      MySQL
      Amazon RDS
      Amazon RDS
      Amazon S3
      Amazon S3
      Docker
      Docker
      Capistrano
      Capistrano
      AWS Elastic Beanstalk
      AWS Elastic Beanstalk
      Rails API
      Rails API
      Rails
      Rails
      Algolia
      Algolia

      Back in 2014, I was given an opportunity to re-architect SmartZip Analytics platform, and flagship product: SmartTargeting. This is a SaaS software helping real estate professionals keeping up with their prospects and leads in a given neighborhood/territory, finding out (thanks to predictive analytics) who's the most likely to list/sell their home, and running cross-channel marketing automation against them: direct mail, online ads, email... The company also does provide Data APIs to Enterprise customers.

      I had inherited years and years of technical debt and I knew things had to change radically. The first enabler to this was to make use of the cloud and go with AWS, so we would stop re-inventing the wheel, and build around managed/scalable services.

      For the SaaS product, we kept on working with Rails as this was what my team had the most knowledge in. We've however broken up the monolith and decoupled the front-end application from the backend thanks to the use of Rails API so we'd get independently scalable micro-services from now on.

      Our various applications could now be deployed using AWS Elastic Beanstalk so we wouldn't waste any more efforts writing time-consuming Capistrano deployment scripts for instance. Combined with Docker so our application would run within its own container, independently from the underlying host configuration.

      Storage-wise, we went with Amazon S3 and ditched any pre-existing local or network storage people used to deal with in our legacy systems. On the database side: Amazon RDS / MySQL initially. Ultimately migrated to Amazon RDS for Aurora / MySQL when it got released. Once again, here you need a managed service your cloud provider handles for you.

      Future improvements / technology decisions included:

      Caching: Amazon ElastiCache / Memcached CDN: Amazon CloudFront Systems Integration: Segment / Zapier Data-warehousing: Amazon Redshift BI: Amazon Quicksight / Superset Search: Elasticsearch / Amazon Elasticsearch Service / Algolia Monitoring: New Relic

      As our usage grows, patterns changed, and/or our business needs evolved, my role as Engineering Manager then Director of Engineering was also to ensure my team kept on learning and innovating, while delivering on business value.

      One of these innovations was to get ourselves into Serverless : Adopting AWS Lambda was a big step forward. At the time, only available for Node.js (Not Ruby ) but a great way to handle cost efficiency, unpredictable traffic, sudden bursts of traffic... Ultimately you want the whole chain of services involved in a call to be serverless, and that's when we've started leveraging Amazon DynamoDB on these projects so they'd be fully scalable.

      See more
      Tim Specht
      Tim Specht
      ‎Co-Founder and CTO at Dubsmash · | 16 upvotes · 78.6K views
      atDubsmashDubsmash
      Memcached
      Memcached
      Algolia
      Algolia
      Elasticsearch
      Elasticsearch
      #SearchAsAService

      Although we were using Elasticsearch in the beginning to power our in-app search, we moved this part of our processing over to Algolia a couple of months ago; this has proven to be a fantastic choice, letting us build search-related features with more confidence and speed.

      Elasticsearch is only used for searching in internal tooling nowadays; hosting and running it reliably has been a task that took up too much time for us in the past and fine-tuning the results to reach a great user-experience was also never an easy task for us. With Algolia we can flexibly change ranking methods on the fly and can instead focus our time on fine-tuning the experience within our app.

      Memcached is used in front of most of the API endpoints to cache responses in order to speed up response times and reduce server-costs on our side.

      #SearchAsAService

      See more

      related Google BigQuery posts

      Tim Specht
      Tim Specht
      ‎Co-Founder and CTO at Dubsmash · | 14 upvotes · 68K views
      atDubsmashDubsmash
      Google BigQuery
      Google BigQuery
      Amazon SQS
      Amazon SQS
      AWS Lambda
      AWS Lambda
      Amazon Kinesis
      Amazon Kinesis
      Google Analytics
      Google Analytics
      #ServerlessTaskProcessing
      #GeneralAnalytics
      #RealTimeDataProcessing
      #BigDataAsAService

      In order to accurately measure & track user behaviour on our platform we moved over quickly from the initial solution using Google Analytics to a custom-built one due to resource & pricing concerns we had.

      While this does sound complicated, it’s as easy as clients sending JSON blobs of events to Amazon Kinesis from where we use AWS Lambda & Amazon SQS to batch and process incoming events and then ingest them into Google BigQuery. Once events are stored in BigQuery (which usually only takes a second from the time the client sends the data until it’s available), we can use almost-standard-SQL to simply query for data while Google makes sure that, even with terabytes of data being scanned, query times stay in the range of seconds rather than hours. Before ingesting their data into the pipeline, our mobile clients are aggregating events internally and, once a certain threshold is reached or the app is going to the background, sending the events as a JSON blob into the stream.

      In the past we had workers running that continuously read from the stream and would validate and post-process the data and then enqueue them for other workers to write them to BigQuery. We went ahead and implemented the Lambda-based approach in such a way that Lambda functions would automatically be triggered for incoming records, pre-aggregate events, and write them back to SQS, from which we then read them, and persist the events to BigQuery. While this approach had a couple of bumps on the road, like re-triggering functions asynchronously to keep up with the stream and proper batch sizes, we finally managed to get it running in a reliable way and are very happy with this solution today.

      #ServerlessTaskProcessing #GeneralAnalytics #RealTimeDataProcessing #BigDataAsAService

      See more
      GitHub
      GitHub
      Google Compute Engine
      Google Compute Engine
      Google Cloud Storage
      Google Cloud Storage
      Google BigQuery
      Google BigQuery
      Google Cloud Bigtable
      Google Cloud Bigtable
      Google Cloud Run
      Google Cloud Run
      Google Cloud Build
      Google Cloud Build
      Google Cloud Deployment Manager
      Google Cloud Deployment Manager
      Python
      Python
      Terraform
      Terraform
      Google Cloud IoT Core
      Google Cloud IoT Core

      Context: I wanted to create an end to end IoT data pipeline simulation in Google Cloud IoT Core and other GCP services. I never touched Terraform meaningfully until working on this project, and it's one of the best explorations in my development career. The documentation and syntax is incredibly human-readable and friendly. I'm used to building infrastructure through the google apis via Python , but I'm so glad past Sung did not make that decision. I was tempted to use Google Cloud Deployment Manager, but the templates were a bit convoluted by first impression. I'm glad past Sung did not make this decision either.

      Solution: Leveraging Google Cloud Build Google Cloud Run Google Cloud Bigtable Google BigQuery Google Cloud Storage Google Compute Engine along with some other fun tools, I can deploy over 40 GCP resources using Terraform!

      Check Out My Architecture: CLICK ME

      Check out the GitHub repo attached

      See more

      related Apache Spark posts

      Eric Colson
      Eric Colson
      Chief Algorithms Officer at Stitch Fix · | 19 upvotes · 290.1K views
      atStitch FixStitch Fix
      Amazon EC2 Container Service
      Amazon EC2 Container Service
      Docker
      Docker
      PyTorch
      PyTorch
      R
      R
      Python
      Python
      Presto
      Presto
      Apache Spark
      Apache Spark
      Amazon S3
      Amazon S3
      PostgreSQL
      PostgreSQL
      Kafka
      Kafka
      #AWS
      #Etl
      #ML
      #DataScience
      #DataStack
      #Data

      The algorithms and data infrastructure at Stitch Fix is housed in #AWS. Data acquisition is split between events flowing through Kafka, and periodic snapshots of PostgreSQL DBs. We store data in an Amazon S3 based data warehouse. Apache Spark on Yarn is our tool of choice for data movement and #ETL. Because our storage layer (s3) is decoupled from our processing layer, we are able to scale our compute environment very elastically. We have several semi-permanent, autoscaling Yarn clusters running to serve our data processing needs. While the bulk of our compute infrastructure is dedicated to algorithmic processing, we also implemented Presto for adhoc queries and dashboards.

      Beyond data movement and ETL, most #ML centric jobs (e.g. model training and execution) run in a similarly elastic environment as containers running Python and R code on Amazon EC2 Container Service clusters. The execution of batch jobs on top of ECS is managed by Flotilla, a service we built in house and open sourced (see https://github.com/stitchfix/flotilla-os).

      At Stitch Fix, algorithmic integrations are pervasive across the business. We have dozens of data products actively integrated systems. That requires serving layer that is robust, agile, flexible, and allows for self-service. Models produced on Flotilla are packaged for deployment in production using Khan, another framework we've developed internally. Khan provides our data scientists the ability to quickly productionize those models they've developed with open source frameworks in Python 3 (e.g. PyTorch, sklearn), by automatically packaging them as Docker containers and deploying to Amazon ECS. This provides our data scientist a one-click method of getting from their algorithms to production. We then integrate those deployments into a service mesh, which allows us to A/B test various implementations in our product.

      For more info:

      #DataScience #DataStack #Data

      See more
      Conor Myhrvold
      Conor Myhrvold
      Tech Brand Mgr, Office of CTO at Uber · | 5 upvotes · 133K views
      atUber TechnologiesUber Technologies
      Kafka Manager
      Kafka Manager
      Kafka
      Kafka
      GitHub
      GitHub
      Apache Spark
      Apache Spark
      Hadoop
      Hadoop

      Why we built Marmaray, an open source generic data ingestion and dispersal framework and library for Apache Hadoop :

      Built and designed by our Hadoop Platform team, Marmaray is a plug-in-based framework built on top of the Hadoop ecosystem. Users can add support to ingest data from any source and disperse to any sink leveraging the use of Apache Spark . The name, Marmaray, comes from a tunnel in Turkey connecting Europe and Asia. Similarly, we envisioned Marmaray within Uber as a pipeline connecting data from any source to any sink depending on customer preference:

      https://eng.uber.com/marmaray-hadoop-ingestion-open-source/

      (Direct GitHub repo: https://github.com/uber/marmaray Kafka Kafka Manager )

      See more

      related Apache Flink posts

      Surabhi Bhawsar
      Surabhi Bhawsar
      Technical Architect at Pepcus · | 6 upvotes · 32.3K views
      Apache Flink
      Apache Flink
      Kafka
      Kafka

      I need to build the Alert & Notification framework with the use of a scheduled program. We will analyze the events from the database table and filter events that are falling under a day timespan and send these event messages over email. Currently, we are using Kafka Pub/Sub for messaging. The customer wants us to move on Apache Flink, I am trying to understand how Apache Flink could be fit better for us.

      See more
      Apache Hive logo

      Apache Hive

      107
      45
      0
      107
      45
      + 1
      0
      Data Warehouse Software for Reading, Writing, and Managing Large Datasets
        Be the first to leave a pro
        Apache Hive logo
        Apache Hive
        VS
        Amazon Athena logo
        Amazon Athena
        Druid logo

        Druid

        104
        143
        17
        104
        143
        + 1
        17
        Fast column-oriented distributed data store
        Druid logo
        Druid
        VS
        Amazon Athena logo
        Amazon Athena
        Apache Impala logo

        Apache Impala

        58
        55
        8
        58
        55
        + 1
        8
        Real-time Query for Hadoop
        Apache Impala logo
        Apache Impala
        VS
        Amazon Athena logo
        Amazon Athena
        AWS Glue logo

        AWS Glue

        58
        37
        0
        58
        37
        + 1
        0
        Fully managed extract, transform, and load (ETL) service
          Be the first to leave a pro
          AWS Glue logo
          AWS Glue
          VS
          Amazon Athena logo
          Amazon Athena
          Talend logo

          Talend

          26
          11
          0
          26
          11
          + 1
          0
          A single, unified suite for all integration needs
            Be the first to leave a pro
            Talend logo
            Talend
            VS
            Amazon Athena logo
            Amazon Athena
            Apache Kudu logo

            Apache Kudu

            26
            41
            3
            26
            41
            + 1
            3
            Fast Analytics on Fast Data. A columnar storage manager developed for the Hadoop platform
            Apache Kudu logo
            Apache Kudu
            VS
            Amazon Athena logo
            Amazon Athena
            Vertica logo

            Vertica

            25
            6
            0
            25
            6
            + 1
            0
            Storage platform designed to handle large volumes of data
              Be the first to leave a pro
              Vertica logo
              Vertica
              VS
              Amazon Athena logo
              Amazon Athena
              Apache Parquet logo

              Apache Parquet

              19
              8
              0
              19
              8
              + 1
              0
              A free and open-source column-oriented data storage format
                Be the first to leave a pro
                Apache Parquet logo
                Apache Parquet
                VS
                Amazon Athena logo
                Amazon Athena