AWS Data Pipeline vs Elasticsearch

Get Advice Icon

Need advice about which tool to choose?Ask the StackShare community!

AWS Data Pipeline
AWS Data Pipeline

31
25
+ 1
1
Elasticsearch
Elasticsearch

8.8K
5.9K
+ 1
1.6K
Add tool

AWS Data Pipeline vs Elasticsearch: What are the differences?

What is AWS Data Pipeline? Process and move data between different AWS compute and storage services. AWS Data Pipeline is a web service that provides a simple management system for data-driven workflows. Using AWS Data Pipeline, you define a pipeline composed of the “data sources” that contain your data, the “activities” or business logic such as EMR jobs or SQL queries, and the “schedule” on which your business logic executes. For example, you could define a job that, every hour, runs an Amazon Elastic MapReduce (Amazon EMR)–based analysis on that hour’s Amazon Simple Storage Service (Amazon S3) log data, loads the results into a relational database for future lookup, and then automatically sends you a daily summary email.

What is Elasticsearch? Open Source, Distributed, RESTful Search Engine. Elasticsearch is a distributed, RESTful search and analytics engine capable of storing data and searching it in near real time. Elasticsearch, Kibana, Beats and Logstash are the Elastic Stack (sometimes called the ELK Stack).

AWS Data Pipeline can be classified as a tool in the "Data Transfer" category, while Elasticsearch is grouped under "Search as a Service".

Some of the features offered by AWS Data Pipeline are:

  • You can find (and use) a variety of popular AWS Data Pipeline tasks in the AWS Management Console’s template section.
  • Hourly analysis of Amazon S3‐based log data
  • Daily replication of AmazonDynamoDB data to Amazon S3

On the other hand, Elasticsearch provides the following key features:

  • Distributed and Highly Available Search Engine.
  • Multi Tenant with Multi Types.
  • Various set of APIs including RESTful

Elasticsearch is an open source tool with 42.4K GitHub stars and 14.2K GitHub forks. Here's a link to Elasticsearch's open source repository on GitHub.

- No public GitHub repository available -

What is AWS Data Pipeline?

AWS Data Pipeline is a web service that provides a simple management system for data-driven workflows. Using AWS Data Pipeline, you define a pipeline composed of the “data sources” that contain your data, the “activities” or business logic such as EMR jobs or SQL queries, and the “schedule” on which your business logic executes. For example, you could define a job that, every hour, runs an Amazon Elastic MapReduce (Amazon EMR)–based analysis on that hour’s Amazon Simple Storage Service (Amazon S3) log data, loads the results into a relational database for future lookup, and then automatically sends you a daily summary email.

What is Elasticsearch?

Elasticsearch is a distributed, RESTful search and analytics engine capable of storing data and searching it in near real time. Elasticsearch, Kibana, Beats and Logstash are the Elastic Stack (sometimes called the ELK Stack).
Get Advice Icon

Need advice about which tool to choose?Ask the StackShare community!

Why do developers choose AWS Data Pipeline?
Why do developers choose Elasticsearch?

Sign up to add, upvote and see more prosMake informed product decisions

    Be the first to leave a con
    Jobs that mention AWS Data Pipeline and Elasticsearch as a desired skillset
    PinterestPinterest
    San Francisco, CA; Palo Alto, CA
    PinterestPinterest
    San Francisco, CA; Palo Alto, CA
    PinterestPinterest
    San Francisco, CA; Palo Alto, CA
    PinterestPinterest
    San Francisco, CA; Palo Alto, CA
    What companies use AWS Data Pipeline?
    What companies use Elasticsearch?

    Sign up to get full access to all the companiesMake informed product decisions

    What tools integrate with AWS Data Pipeline?
    What tools integrate with Elasticsearch?

    Sign up to get full access to all the tool integrationsMake informed product decisions

    What are some alternatives to AWS Data Pipeline and Elasticsearch?
    AWS Glue
    A fully managed extract, transform, and load (ETL) service that makes it easy for customers to prepare and load their data for analytics.
    Airflow
    Use Airflow to author workflows as directed acyclic graphs (DAGs) of tasks. The Airflow scheduler executes your tasks on an array of workers while following the specified dependencies. Rich command lines utilities makes performing complex surgeries on DAGs a snap. The rich user interface makes it easy to visualize pipelines running in production, monitor progress and troubleshoot issues when needed.
    Apache NiFi
    An easy to use, powerful, and reliable system to process and distribute data. It supports powerful and scalable directed graphs of data routing, transformation, and system mediation logic.
    AWS Step Functions
    AWS Step Functions makes it easy to coordinate the components of distributed applications and microservices using visual workflows. Building applications from individual components that each perform a discrete function lets you scale and change applications quickly.
    AWS Batch
    It enables developers, scientists, and engineers to easily and efficiently run hundreds of thousands of batch computing jobs on AWS. It dynamically provisions the optimal quantity and type of compute resources (e.g., CPU or memory optimized instances) based on the volume and specific resource requirements of the batch jobs submitted.
    See all alternatives
    Decisions about AWS Data Pipeline and Elasticsearch
    Tim Specht
    Tim Specht
    ‎Co-Founder and CTO at Dubsmash · | 16 upvotes · 53.5K views
    atDubsmashDubsmash
    Memcached
    Memcached
    Algolia
    Algolia
    Elasticsearch
    Elasticsearch
    #SearchAsAService

    Although we were using Elasticsearch in the beginning to power our in-app search, we moved this part of our processing over to Algolia a couple of months ago; this has proven to be a fantastic choice, letting us build search-related features with more confidence and speed.

    Elasticsearch is only used for searching in internal tooling nowadays; hosting and running it reliably has been a task that took up too much time for us in the past and fine-tuning the results to reach a great user-experience was also never an easy task for us. With Algolia we can flexibly change ranking methods on the fly and can instead focus our time on fine-tuning the experience within our app.

    Memcached is used in front of most of the API endpoints to cache responses in order to speed up response times and reduce server-costs on our side.

    #SearchAsAService

    See more
    Julien DeFrance
    Julien DeFrance
    Full Stack Engineering Manager at ValiMail · | 16 upvotes · 276.2K views
    atSmartZipSmartZip
    Amazon DynamoDB
    Amazon DynamoDB
    Ruby
    Ruby
    Node.js
    Node.js
    AWS Lambda
    AWS Lambda
    New Relic
    New Relic
    Amazon Elasticsearch Service
    Amazon Elasticsearch Service
    Elasticsearch
    Elasticsearch
    Superset
    Superset
    Amazon Quicksight
    Amazon Quicksight
    Amazon Redshift
    Amazon Redshift
    Zapier
    Zapier
    Segment
    Segment
    Amazon CloudFront
    Amazon CloudFront
    Memcached
    Memcached
    Amazon ElastiCache
    Amazon ElastiCache
    Amazon RDS for Aurora
    Amazon RDS for Aurora
    MySQL
    MySQL
    Amazon RDS
    Amazon RDS
    Amazon S3
    Amazon S3
    Docker
    Docker
    Capistrano
    Capistrano
    AWS Elastic Beanstalk
    AWS Elastic Beanstalk
    Rails API
    Rails API
    Rails
    Rails
    Algolia
    Algolia

    Back in 2014, I was given an opportunity to re-architect SmartZip Analytics platform, and flagship product: SmartTargeting. This is a SaaS software helping real estate professionals keeping up with their prospects and leads in a given neighborhood/territory, finding out (thanks to predictive analytics) who's the most likely to list/sell their home, and running cross-channel marketing automation against them: direct mail, online ads, email... The company also does provide Data APIs to Enterprise customers.

    I had inherited years and years of technical debt and I knew things had to change radically. The first enabler to this was to make use of the cloud and go with AWS, so we would stop re-inventing the wheel, and build around managed/scalable services.

    For the SaaS product, we kept on working with Rails as this was what my team had the most knowledge in. We've however broken up the monolith and decoupled the front-end application from the backend thanks to the use of Rails API so we'd get independently scalable micro-services from now on.

    Our various applications could now be deployed using AWS Elastic Beanstalk so we wouldn't waste any more efforts writing time-consuming Capistrano deployment scripts for instance. Combined with Docker so our application would run within its own container, independently from the underlying host configuration.

    Storage-wise, we went with Amazon S3 and ditched any pre-existing local or network storage people used to deal with in our legacy systems. On the database side: Amazon RDS / MySQL initially. Ultimately migrated to Amazon RDS for Aurora / MySQL when it got released. Once again, here you need a managed service your cloud provider handles for you.

    Future improvements / technology decisions included:

    Caching: Amazon ElastiCache / Memcached CDN: Amazon CloudFront Systems Integration: Segment / Zapier Data-warehousing: Amazon Redshift BI: Amazon Quicksight / Superset Search: Elasticsearch / Amazon Elasticsearch Service / Algolia Monitoring: New Relic

    As our usage grows, patterns changed, and/or our business needs evolved, my role as Engineering Manager then Director of Engineering was also to ensure my team kept on learning and innovating, while delivering on business value.

    One of these innovations was to get ourselves into Serverless : Adopting AWS Lambda was a big step forward. At the time, only available for Node.js (Not Ruby ) but a great way to handle cost efficiency, unpredictable traffic, sudden bursts of traffic... Ultimately you want the whole chain of services involved in a call to be serverless, and that's when we've started leveraging Amazon DynamoDB on these projects so they'd be fully scalable.

    See more
    Interest over time
    Reviews of AWS Data Pipeline and Elasticsearch
    No reviews found
    How developers use AWS Data Pipeline and Elasticsearch
    Avatar of imgur
    imgur uses ElasticsearchElasticsearch

    Elasticsearch is the engine that powers search on the site. From a high level perspective, it’s a Lucene wrapper that exposes Lucene’s features via a RESTful API. It handles the distribution of data and simplifies scaling, among other things.

    Given that we are on AWS, we use an AWS cloud plugin for Elasticsearch that makes it easy to work in the cloud. It allows us to add nodes without much hassle. It will take care of figuring out if a new node has joined the cluster, and, if so, Elasticsearch will proceed to move data to that new node. It works the same way when a node goes down. It will remove that node based on the AWS cluster configuration.

    Avatar of Instacart
    Instacart uses ElasticsearchElasticsearch

    The very first version of the search was just a Postgres database query. It wasn’t terribly efficient, and then at some point, we moved over to ElasticSearch, and then since then, Andrew just did a lot of work with it, so ElasticSearch is amazing, but out of the box, it doesn’t come configured with all the nice things that are there, but you spend a lot of time figuring out how to put it all together to add stemming, auto suggestions, all kinds of different things, like even spelling adjustments and tomato/tomatoes, that would return different results, so Andrew did a ton of work to make it really, really nice and build a very simple Ruby gem called SearchKick.

    Avatar of AngeloR
    AngeloR uses ElasticsearchElasticsearch

    We use ElasticSearch for

    • Session Logs
    • Analytics
    • Leaderboards

    We originally self managed the ElasticSearch clusters, but due to our small ops team size we opt to move things to managed AWS services where possible.

    The managed servers, however, do not allow us to manage our own backups and a restore actually requires us to open a support ticket with them. We ended up setting up our own nightly backup since we do per day indexes for the logs/analytics.

    Avatar of Brandon Adams
    Brandon Adams uses ElasticsearchElasticsearch

    Elasticsearch has good tooling and supports a large api that makes it ideal for denormalizing data. It has a simple to use aggregations api that tends to encompass most of what I need a BI tool to do, especially in the early going (when paired with Kibana). It's also handy when you just want to search some text.

    Avatar of Ana Phi Sancho
    Ana Phi Sancho uses ElasticsearchElasticsearch

    Self taught : acquired knowledge or skill on one's own initiative. Open Source Search & Analytics. -time search and analytics engine. Search engine based on Lucene. It provides a distributed, multitenant-capable full-text search engine with an HTTP web interface and schema-free JSON documents.

    How much does AWS Data Pipeline cost?
    How much does Elasticsearch cost?
    News about AWS Data Pipeline
    More news