Alternatives to s3-lambda logo

Alternatives to s3-lambda

Amazon EC2, Apache Spark, Amazon Athena, Apache Flink, and Apache Hive are the most popular alternatives and competitors to s3-lambda.
3
28
+ 1
0

What is s3-lambda and what are its top alternatives?

s3-lambda enables you to run lambda functions over a context of S3 objects. It has a stateless architecture with concurrency control, allowing you to process a large number of files very quickly. This is useful for quickly prototyping complex data jobs without an infrastructure like Hadoop or Spark.
s3-lambda is a tool in the Big Data Tools category of a tech stack.
s3-lambda is an open source tool with 1.1K GitHub stars and 44 GitHub forks. Here鈥檚 a link to s3-lambda's open source repository on GitHub

s3-lambda alternatives & related posts

related Amazon EC2 posts

Ashish Singh
Ashish Singh
Tech Lead, Big Data Platform at Pinterest | 27 upvotes 241.8K views
Apache Hive
Apache Hive
Presto
Presto
Amazon EC2
Amazon EC2
Amazon S3
Amazon S3
Kafka
Kafka
Kubernetes
Kubernetes
#DataScience
#DataEngineering
#AWS
#BigData

To provide employees with the critical need of interactive querying, we鈥檝e worked with Presto, an open-source distributed SQL query engine, over the years. Operating Presto at Pinterest鈥檚 scale has involved resolving quite a few challenges like, supporting deeply nested and huge thrift schemas, slow/ bad worker detection and remediation, auto-scaling cluster, graceful cluster shutdown and impersonation support for ldap authenticator.

Our infrastructure is built on top of Amazon EC2 and we leverage Amazon S3 for storing our data. This separates compute and storage layers, and allows multiple compute clusters to share the S3 data.

We have hundreds of petabytes of data and tens of thousands of Apache Hive tables. Our Presto clusters are comprised of a fleet of 450 r4.8xl EC2 instances. Presto clusters together have over 100 TBs of memory and 14K vcpu cores. Within Pinterest, we have close to more than 1,000 monthly active users (out of total 1,600+ Pinterest employees) using Presto, who run about 400K queries on these clusters per month.

Each query submitted to Presto cluster is logged to a Kafka topic via Singer. Singer is a logging agent built at Pinterest and we talked about it in a previous post. Each query is logged when it is submitted and when it finishes. When a Presto cluster crashes, we will have query submitted events without corresponding query finished events. These events enable us to capture the effect of cluster crashes over time.

Each Presto cluster at Pinterest has workers on a mix of dedicated AWS EC2 instances and Kubernetes pods. Kubernetes platform provides us with the capability to add and remove workers from a Presto cluster very quickly. The best-case latency on bringing up a new worker on Kubernetes is less than a minute. However, when the Kubernetes cluster itself is out of resources and needs to scale up, it can take up to ten minutes. Some other advantages of deploying on Kubernetes platform is that our Presto deployment becomes agnostic of cloud vendor, instance types, OS, etc.

#BigData #AWS #DataScience #DataEngineering

See more
John-Daniel Trask
John-Daniel Trask
Co-founder & CEO at Raygun | 19 upvotes 172K views
atRaygunRaygun
Amazon S3
Amazon S3
Amazon RDS
Amazon RDS
nginx
nginx
Amazon EC2
Amazon EC2
AWS Elastic Load Balancing (ELB)
AWS Elastic Load Balancing (ELB)
#CloudHosting
#WebServers
#CloudStorage
#LoadBalancerReverseProxy

We chose AWS because, at the time, it was really the only cloud provider to choose from.

We tend to use their basic building blocks (EC2, ELB, Amazon S3, Amazon RDS) rather than vendor specific components like databases and queuing. We deliberately decided to do this to ensure we could provide multi-cloud support or potentially move to another cloud provider if the offering was better for our customers.

We鈥檝e utilized c3.large nodes for both the Node.js deployment and then for the .NET Core deployment. Both sit as backends behind an nginx instance and are managed using scaling groups in Amazon EC2 sitting behind a standard AWS Elastic Load Balancing (ELB).

While we鈥檙e satisfied with AWS, we do review our decision each year and have looked at Azure and Google Cloud offerings.

#CloudHosting #WebServers #CloudStorage #LoadBalancerReverseProxy

See more

related Apache Spark posts

Eric Colson
Eric Colson
Chief Algorithms Officer at Stitch Fix | 19 upvotes 896.6K views
atStitch FixStitch Fix
Kafka
Kafka
PostgreSQL
PostgreSQL
Amazon S3
Amazon S3
Apache Spark
Apache Spark
Presto
Presto
Python
Python
R Language
R Language
PyTorch
PyTorch
Docker
Docker
Amazon EC2 Container Service
Amazon EC2 Container Service
#AWS
#Etl
#ML
#DataScience
#DataStack
#Data

The algorithms and data infrastructure at Stitch Fix is housed in #AWS. Data acquisition is split between events flowing through Kafka, and periodic snapshots of PostgreSQL DBs. We store data in an Amazon S3 based data warehouse. Apache Spark on Yarn is our tool of choice for data movement and #ETL. Because our storage layer (s3) is decoupled from our processing layer, we are able to scale our compute environment very elastically. We have several semi-permanent, autoscaling Yarn clusters running to serve our data processing needs. While the bulk of our compute infrastructure is dedicated to algorithmic processing, we also implemented Presto for adhoc queries and dashboards.

Beyond data movement and ETL, most #ML centric jobs (e.g. model training and execution) run in a similarly elastic environment as containers running Python and R code on Amazon EC2 Container Service clusters. The execution of batch jobs on top of ECS is managed by Flotilla, a service we built in house and open sourced (see https://github.com/stitchfix/flotilla-os).

At Stitch Fix, algorithmic integrations are pervasive across the business. We have dozens of data products actively integrated systems. That requires serving layer that is robust, agile, flexible, and allows for self-service. Models produced on Flotilla are packaged for deployment in production using Khan, another framework we've developed internally. Khan provides our data scientists the ability to quickly productionize those models they've developed with open source frameworks in Python 3 (e.g. PyTorch, sklearn), by automatically packaging them as Docker containers and deploying to Amazon ECS. This provides our data scientist a one-click method of getting from their algorithms to production. We then integrate those deployments into a service mesh, which allows us to A/B test various implementations in our product.

For more info:

#DataScience #DataStack #Data

See more
Conor Myhrvold
Conor Myhrvold
Tech Brand Mgr, Office of CTO at Uber | 7 upvotes 456.6K views
atUber TechnologiesUber Technologies
Kafka
Kafka
Kafka Manager
Kafka Manager
Hadoop
Hadoop
Apache Spark
Apache Spark
GitHub
GitHub

Why we built Marmaray, an open source generic data ingestion and dispersal framework and library for Apache Hadoop :

Built and designed by our Hadoop Platform team, Marmaray is a plug-in-based framework built on top of the Hadoop ecosystem. Users can add support to ingest data from any source and disperse to any sink leveraging the use of Apache Spark . The name, Marmaray, comes from a tunnel in Turkey connecting Europe and Asia. Similarly, we envisioned Marmaray within Uber as a pipeline connecting data from any source to any sink depending on customer preference:

https://eng.uber.com/marmaray-hadoop-ingestion-open-source/

(Direct GitHub repo: https://github.com/uber/marmaray Kafka Kafka Manager )

See more

related Amazon Athena posts

Amazon Athena
Amazon Athena
Google BigQuery
Google BigQuery

I use Amazon Athena because similar to Google BigQuery , you can store and query data easily. Especially since you can define data schema in the Glue data catalog, there's a central way to define data models.

However, I would not recommend for batch jobs. I typically use this to check intermediary datasets in data engineering workloads. It's good for getting a look and feel of the data along its ETL journey.

See more
Google BigQuery
Google BigQuery
Amazon Redshift
Amazon Redshift
Amazon Athena
Amazon Athena
Amazon S3
Amazon S3

Hi all,

Currently, we need to ingest the data from Amazon S3 to DB either Amazon Athena or Amazon Redshift. But the problem with the data is, it is in .PSV (pipe separated values) format and the size is also above 200 GB. The query performance of the timeout in Athena/Redshift is not up to the mark, too slow while compared to Google BigQuery. How would I optimize the performance and query result time? Can anyone please help me out?

See more
Apache Flink logo

Apache Flink

207
275
11
207
275
+ 1
11
Fast and reliable large-scale data processing engine
Apache Flink logo
Apache Flink
VS
s3-lambda logo
s3-lambda

related Apache Flink posts

Surabhi Bhawsar
Surabhi Bhawsar
Technical Architect at Pepcus | 6 upvotes 371.2K views
Kafka
Kafka
Apache Flink
Apache Flink

I need to build the Alert & Notification framework with the use of a scheduled program. We will analyze the events from the database table and filter events that are falling under a day timespan and send these event messages over email. Currently, we are using Kafka Pub/Sub for messaging. The customer wants us to move on Apache Flink, I am trying to understand how Apache Flink could be fit better for us.

See more
Apache Hive logo

Apache Hive

172
129
0
172
129
+ 1
0
Data Warehouse Software for Reading, Writing, and Managing Large Datasets
    Be the first to leave a pro
    Apache Hive logo
    Apache Hive
    VS
    s3-lambda logo
    s3-lambda

    related Apache Hive posts

    Ashish Singh
    Ashish Singh
    Tech Lead, Big Data Platform at Pinterest | 27 upvotes 241.8K views
    Apache Hive
    Apache Hive
    Presto
    Presto
    Amazon EC2
    Amazon EC2
    Amazon S3
    Amazon S3
    Kafka
    Kafka
    Kubernetes
    Kubernetes
    #DataScience
    #DataEngineering
    #AWS
    #BigData

    To provide employees with the critical need of interactive querying, we鈥檝e worked with Presto, an open-source distributed SQL query engine, over the years. Operating Presto at Pinterest鈥檚 scale has involved resolving quite a few challenges like, supporting deeply nested and huge thrift schemas, slow/ bad worker detection and remediation, auto-scaling cluster, graceful cluster shutdown and impersonation support for ldap authenticator.

    Our infrastructure is built on top of Amazon EC2 and we leverage Amazon S3 for storing our data. This separates compute and storage layers, and allows multiple compute clusters to share the S3 data.

    We have hundreds of petabytes of data and tens of thousands of Apache Hive tables. Our Presto clusters are comprised of a fleet of 450 r4.8xl EC2 instances. Presto clusters together have over 100 TBs of memory and 14K vcpu cores. Within Pinterest, we have close to more than 1,000 monthly active users (out of total 1,600+ Pinterest employees) using Presto, who run about 400K queries on these clusters per month.

    Each query submitted to Presto cluster is logged to a Kafka topic via Singer. Singer is a logging agent built at Pinterest and we talked about it in a previous post. Each query is logged when it is submitted and when it finishes. When a Presto cluster crashes, we will have query submitted events without corresponding query finished events. These events enable us to capture the effect of cluster crashes over time.

    Each Presto cluster at Pinterest has workers on a mix of dedicated AWS EC2 instances and Kubernetes pods. Kubernetes platform provides us with the capability to add and remove workers from a Presto cluster very quickly. The best-case latency on bringing up a new worker on Kubernetes is less than a minute. However, when the Kubernetes cluster itself is out of resources and needs to scale up, it can take up to ten minutes. Some other advantages of deploying on Kubernetes platform is that our Presto deployment becomes agnostic of cloud vendor, instance types, OS, etc.

    #BigData #AWS #DataScience #DataEngineering

    See more
    Druid logo

    Druid

    163
    283
    20
    163
    283
    + 1
    20
    Fast column-oriented distributed data store
    Druid logo
    Druid
    VS
    s3-lambda logo
    s3-lambda
    Presto logo

    Presto

    159
    358
    51
    159
    358
    + 1
    51
    Distributed SQL Query Engine for Big Data
    Presto logo
    Presto
    VS
    s3-lambda logo
    s3-lambda

    related Presto posts

    Ashish Singh
    Ashish Singh
    Tech Lead, Big Data Platform at Pinterest | 27 upvotes 241.8K views
    Apache Hive
    Apache Hive
    Presto
    Presto
    Amazon EC2
    Amazon EC2
    Amazon S3
    Amazon S3
    Kafka
    Kafka
    Kubernetes
    Kubernetes
    #DataScience
    #DataEngineering
    #AWS
    #BigData

    To provide employees with the critical need of interactive querying, we鈥檝e worked with Presto, an open-source distributed SQL query engine, over the years. Operating Presto at Pinterest鈥檚 scale has involved resolving quite a few challenges like, supporting deeply nested and huge thrift schemas, slow/ bad worker detection and remediation, auto-scaling cluster, graceful cluster shutdown and impersonation support for ldap authenticator.

    Our infrastructure is built on top of Amazon EC2 and we leverage Amazon S3 for storing our data. This separates compute and storage layers, and allows multiple compute clusters to share the S3 data.

    We have hundreds of petabytes of data and tens of thousands of Apache Hive tables. Our Presto clusters are comprised of a fleet of 450 r4.8xl EC2 instances. Presto clusters together have over 100 TBs of memory and 14K vcpu cores. Within Pinterest, we have close to more than 1,000 monthly active users (out of total 1,600+ Pinterest employees) using Presto, who run about 400K queries on these clusters per month.

    Each query submitted to Presto cluster is logged to a Kafka topic via Singer. Singer is a logging agent built at Pinterest and we talked about it in a previous post. Each query is logged when it is submitted and when it finishes. When a Presto cluster crashes, we will have query submitted events without corresponding query finished events. These events enable us to capture the effect of cluster crashes over time.

    Each Presto cluster at Pinterest has workers on a mix of dedicated AWS EC2 instances and Kubernetes pods. Kubernetes platform provides us with the capability to add and remove workers from a Presto cluster very quickly. The best-case latency on bringing up a new worker on Kubernetes is less than a minute. However, when the Kubernetes cluster itself is out of resources and needs to scale up, it can take up to ten minutes. Some other advantages of deploying on Kubernetes platform is that our Presto deployment becomes agnostic of cloud vendor, instance types, OS, etc.

    #BigData #AWS #DataScience #DataEngineering

    See more
    Eric Colson
    Eric Colson
    Chief Algorithms Officer at Stitch Fix | 19 upvotes 896.6K views
    atStitch FixStitch Fix
    Kafka
    Kafka
    PostgreSQL
    PostgreSQL
    Amazon S3
    Amazon S3
    Apache Spark
    Apache Spark
    Presto
    Presto
    Python
    Python
    R Language
    R Language
    PyTorch
    PyTorch
    Docker
    Docker
    Amazon EC2 Container Service
    Amazon EC2 Container Service
    #AWS
    #Etl
    #ML
    #DataScience
    #DataStack
    #Data

    The algorithms and data infrastructure at Stitch Fix is housed in #AWS. Data acquisition is split between events flowing through Kafka, and periodic snapshots of PostgreSQL DBs. We store data in an Amazon S3 based data warehouse. Apache Spark on Yarn is our tool of choice for data movement and #ETL. Because our storage layer (s3) is decoupled from our processing layer, we are able to scale our compute environment very elastically. We have several semi-permanent, autoscaling Yarn clusters running to serve our data processing needs. While the bulk of our compute infrastructure is dedicated to algorithmic processing, we also implemented Presto for adhoc queries and dashboards.

    Beyond data movement and ETL, most #ML centric jobs (e.g. model training and execution) run in a similarly elastic environment as containers running Python and R code on Amazon EC2 Container Service clusters. The execution of batch jobs on top of ECS is managed by Flotilla, a service we built in house and open sourced (see https://github.com/stitchfix/flotilla-os).

    At Stitch Fix, algorithmic integrations are pervasive across the business. We have dozens of data products actively integrated systems. That requires serving layer that is robust, agile, flexible, and allows for self-service. Models produced on Flotilla are packaged for deployment in production using Khan, another framework we've developed internally. Khan provides our data scientists the ability to quickly productionize those models they've developed with open source frameworks in Python 3 (e.g. PyTorch, sklearn), by automatically packaging them as Docker containers and deploying to Amazon ECS. This provides our data scientist a one-click method of getting from their algorithms to production. We then integrate those deployments into a service mesh, which allows us to A/B test various implementations in our product.

    For more info:

    #DataScience #DataStack #Data

    See more
    AWS Glue logo

    AWS Glue

    110
    191
    0
    110
    191
    + 1
    0
    Fully managed extract, transform, and load (ETL) service
      Be the first to leave a pro
      AWS Glue logo
      AWS Glue
      VS
      s3-lambda logo
      s3-lambda