What is CloudAMQP and what are its top alternatives?
CloudAMQP is a managed RabbitMQ service that allows users to easily create and manage RabbitMQ message brokers in the cloud. Key features include automatic scaling, monitoring, and backups. However, some limitations of CloudAMQP include limited customization options and potential costs associated with scaling.
RabbitMQ on Amazon Web Services: Amazon Web Services offers a fully managed RabbitMQ service with features such as high availability, monitoring, and scaling. Pros include seamless integration with AWS services, but cons may include potentially higher costs.
Google Cloud Pub/Sub: Google Cloud Pub/Sub is a fully managed messaging service that allows for scalable event-driven applications. Key features include real-time messaging and global availability. Pros include seamless integration with other Google Cloud services, while cons may include potential limitations on message retention policies.
IBM Message Hub: IBM Message Hub is a fully managed messaging service on the IBM Cloud platform. Features include support for Apache Kafka, message routing, and monitoring capabilities. Pros include integration with other IBM Cloud services, but potential cons may include limitations on message throughput.
Kafka on Azure: Microsoft Azure offers a fully managed Apache Kafka service with features such as high throughput, data retention, and scalability. Pros include seamless integration with Azure services, while cons may include potential learning curve for managing Kafka clusters.
ActiveMQ: Apache ActiveMQ is an open-source messaging broker that supports multiple messaging protocols. Key features include message persistence, clustering, and high availability. Pros include flexibility and customization options, but cons may include the need for manual management.
Mosquitto: Eclipse Mosquitto is an open-source MQTT broker that is lightweight and suitable for Internet of Things (IoT) applications. Features include low resource usage, TLS encryption, and message persistence. Pros include ease of use and scalability, while cons may include potential limitations on advanced features.
IronMQ: IronMQ is a cloud-based message queue service that offers features like message retries, delayed messages, and multi-cloud support. Pros include ease of use and scalability across multiple cloud providers, while cons may include potential costs based on usage.
KubeMQ: KubeMQ is a Kubernetes-native message broker that offers features like high availability, multi-tenancy, and support for various messaging patterns. Pros include efficient resource utilization within Kubernetes clusters, while cons may include potential complexity in managing Kubernetes resources.
NATS: NATS is a lightweight and high-performance messaging system that supports pub/sub and request/reply patterns. Key features include message streaming, clustering, and security. Pros include high performance and simplicity, while cons may include potential limitations on advanced features.
Azure Service Bus: Azure Service Bus is a fully managed messaging service on Microsoft Azure that supports both queues and topics. Features include message ordering, sessions, and dead-lettering. Pros include seamless integration with Azure services, while cons may include potential costs based on usage.
Top Alternatives to CloudAMQP
- RabbitMQ
RabbitMQ gives your applications a common platform to send and receive messages, and your messages a safe place to live until received. ...
- IronMQ
An easy-to-use highly available message queuing service. Built for distributed cloud applications with critical messaging needs. Provides on-demand message queuing with advanced features and cloud-optimized performance. ...
- Kafka
Kafka is a distributed, partitioned, replicated commit log service. It provides the functionality of a messaging system, but with a unique design. ...
- MySQL
The MySQL software delivers a very fast, multi-threaded, multi-user, and robust SQL (Structured Query Language) database server. MySQL Server is intended for mission-critical, heavy-load production systems as well as for embedding into mass-deployed software. ...
- PostgreSQL
PostgreSQL is an advanced object-relational database management system that supports an extended subset of the SQL standard, including transactions, foreign keys, subqueries, triggers, user-defined types and functions. ...
- MongoDB
MongoDB stores data in JSON-like documents that can vary in structure, offering a dynamic, flexible schema. MongoDB was also designed for high availability and scalability, with built-in replication and auto-sharding. ...
- Redis
Redis is an open source (BSD licensed), in-memory data structure store, used as a database, cache, and message broker. Redis provides data structures such as strings, hashes, lists, sets, sorted sets with range queries, bitmaps, hyperloglogs, geospatial indexes, and streams. ...
- Amazon S3
Amazon Simple Storage Service provides a fully redundant data storage infrastructure for storing and retrieving any amount of data, at any time, from anywhere on the web ...
CloudAMQP alternatives & related posts
- It's fast and it works with good metrics/monitoring235
- Ease of configuration80
- I like the admin interface60
- Easy to set-up and start with52
- Durable22
- Standard protocols19
- Intuitive work through python19
- Written primarily in Erlang11
- Simply superb9
- Completeness of messaging patterns7
- Reliable4
- Scales to 1 million messages per second4
- Better than most traditional queue based message broker3
- Distributed3
- Supports MQTT3
- Supports AMQP3
- Clear documentation with different scripting language2
- Better routing system2
- Inubit Integration2
- Great ui2
- High performance2
- Reliability2
- Open-source2
- Runs on Open Telecom Platform2
- Clusterable2
- Delayed messages2
- Supports Streams1
- Supports STOMP1
- Supports JMS1
- Too complicated cluster/HA config and management9
- Needs Erlang runtime. Need ops good with Erlang runtime6
- Configuration must be done first, not by your code5
- Slow4
related RabbitMQ posts
As Sentry runs throughout the day, there are about 50 different offline tasks that we execute—anything from “process this event, pretty please” to “send all of these cool people some emails.” There are some that we execute once a day and some that execute thousands per second.
Managing this variety requires a reliably high-throughput message-passing technology. We use Celery's RabbitMQ implementation, and we stumbled upon a great feature called Federation that allows us to partition our task queue across any number of RabbitMQ servers and gives us the confidence that, if any single server gets backlogged, others will pitch in and distribute some of the backlogged tasks to their consumers.
#MessageQueue
Hi, I am building an enhanced web-conferencing app that will have a voice/video call, live chats, live notifications, live discussions, screen sharing, etc features. Ref: Zoom.
I need advise finalizing the tech stack for this app. I am considering below tech stack:
- Frontend: React
- Backend: Node.js
- Database: MongoDB
- IAAS: #AWS
- Containers & Orchestration: Docker / Kubernetes
- DevOps: GitLab, Terraform
- Brokers: Redis / RabbitMQ
I need advice at the platform level as to what could be considered to support concurrent video streaming seamlessly.
Also, please suggest what could be a better tech stack for my app?
#SAAS #VideoConferencing #WebAndVideoConferencing #zoom #stack
- Great Support12
- Heroku Add-on8
- Push support3
- Delayed delivery upto 7 days3
- Super fast2
- Language agnostic2
- Good analytics/monitoring2
- Ease of configuration2
- GDPR Compliant2
- Can't use rabbitmqadmin1
related IronMQ posts
- High-throughput126
- Distributed119
- Scalable92
- High-Performance86
- Durable66
- Publish-Subscribe38
- Simple-to-use19
- Open source18
- Written in Scala and java. Runs on JVM12
- Message broker + Streaming system9
- KSQL4
- Avro schema integration4
- Robust4
- Suport Multiple clients3
- Extremely good parallelism constructs2
- Partioned, replayable log2
- Simple publisher / multi-subscriber model1
- Fun1
- Flexible1
- Non-Java clients are second-class citizens32
- Needs Zookeeper29
- Operational difficulties9
- Terrible Packaging5
related Kafka posts
When I joined NYT there was already broad dissatisfaction with the LAMP (Linux Apache HTTP Server MySQL PHP) Stack and the front end framework, in particular. So, I wasn't passing judgment on it. I mean, LAMP's fine, you can do good work in LAMP. It's a little dated at this point, but it's not ... I didn't want to rip it out for its own sake, but everyone else was like, "We don't like this, it's really inflexible." And I remember from being outside the company when that was called MIT FIVE when it had launched. And been observing it from the outside, and I was like, you guys took so long to do that and you did it so carefully, and yet you're not happy with your decisions. Why is that? That was more the impetus. If we're going to do this again, how are we going to do it in a way that we're gonna get a better result?
So we're moving quickly away from LAMP, I would say. So, right now, the new front end is React based and using Apollo. And we've been in a long, protracted, gradual rollout of the core experiences.
React is now talking to GraphQL as a primary API. There's a Node.js back end, to the front end, which is mainly for server-side rendering, as well.
Behind there, the main repository for the GraphQL server is a big table repository, that we call Bodega because it's a convenience store. And that reads off of a Kafka pipeline.
To provide employees with the critical need of interactive querying, we’ve worked with Presto, an open-source distributed SQL query engine, over the years. Operating Presto at Pinterest’s scale has involved resolving quite a few challenges like, supporting deeply nested and huge thrift schemas, slow/ bad worker detection and remediation, auto-scaling cluster, graceful cluster shutdown and impersonation support for ldap authenticator.
Our infrastructure is built on top of Amazon EC2 and we leverage Amazon S3 for storing our data. This separates compute and storage layers, and allows multiple compute clusters to share the S3 data.
We have hundreds of petabytes of data and tens of thousands of Apache Hive tables. Our Presto clusters are comprised of a fleet of 450 r4.8xl EC2 instances. Presto clusters together have over 100 TBs of memory and 14K vcpu cores. Within Pinterest, we have close to more than 1,000 monthly active users (out of total 1,600+ Pinterest employees) using Presto, who run about 400K queries on these clusters per month.
Each query submitted to Presto cluster is logged to a Kafka topic via Singer. Singer is a logging agent built at Pinterest and we talked about it in a previous post. Each query is logged when it is submitted and when it finishes. When a Presto cluster crashes, we will have query submitted events without corresponding query finished events. These events enable us to capture the effect of cluster crashes over time.
Each Presto cluster at Pinterest has workers on a mix of dedicated AWS EC2 instances and Kubernetes pods. Kubernetes platform provides us with the capability to add and remove workers from a Presto cluster very quickly. The best-case latency on bringing up a new worker on Kubernetes is less than a minute. However, when the Kubernetes cluster itself is out of resources and needs to scale up, it can take up to ten minutes. Some other advantages of deploying on Kubernetes platform is that our Presto deployment becomes agnostic of cloud vendor, instance types, OS, etc.
#BigData #AWS #DataScience #DataEngineering
- Sql800
- Free679
- Easy562
- Widely used528
- Open source490
- High availability180
- Cross-platform support160
- Great community104
- Secure79
- Full-text indexing and searching75
- Fast, open, available26
- Reliable16
- SSL support16
- Robust15
- Enterprise Version9
- Easy to set up on all platforms7
- NoSQL access to JSON data type3
- Relational database1
- Easy, light, scalable1
- Sequel Pro (best SQL GUI)1
- Replica Support1
- Owned by a company with their own agenda16
- Can't roll back schema changes3
related MySQL posts
When I joined NYT there was already broad dissatisfaction with the LAMP (Linux Apache HTTP Server MySQL PHP) Stack and the front end framework, in particular. So, I wasn't passing judgment on it. I mean, LAMP's fine, you can do good work in LAMP. It's a little dated at this point, but it's not ... I didn't want to rip it out for its own sake, but everyone else was like, "We don't like this, it's really inflexible." And I remember from being outside the company when that was called MIT FIVE when it had launched. And been observing it from the outside, and I was like, you guys took so long to do that and you did it so carefully, and yet you're not happy with your decisions. Why is that? That was more the impetus. If we're going to do this again, how are we going to do it in a way that we're gonna get a better result?
So we're moving quickly away from LAMP, I would say. So, right now, the new front end is React based and using Apollo. And we've been in a long, protracted, gradual rollout of the core experiences.
React is now talking to GraphQL as a primary API. There's a Node.js back end, to the front end, which is mainly for server-side rendering, as well.
Behind there, the main repository for the GraphQL server is a big table repository, that we call Bodega because it's a convenience store. And that reads off of a Kafka pipeline.
We've been using PostgreSQL since the very early days of Zulip, but we actually didn't use it from the beginning. Zulip started out as a MySQL project back in 2012, because we'd heard it was a good choice for a startup with a wide community. However, we found that even though we were using the Django ORM for most of our database access, we spent a lot of time fighting with MySQL. Issues ranged from bad collation defaults, to bad query plans which required a lot of manual query tweaks.
We ended up getting so frustrated that we tried out PostgresQL, and the results were fantastic. We didn't have to do any real customization (just some tuning settings for how big a server we had), and all of our most important queries were faster out of the box. As a result, we were able to delete a bunch of custom queries escaping the ORM that we'd written to make the MySQL query planner happy (because postgres just did the right thing automatically).
And then after that, we've just gotten a ton of value out of postgres. We use its excellent built-in full-text search, which has helped us avoid needing to bring in a tool like Elasticsearch, and we've really enjoyed features like its partial indexes, which saved us a lot of work adding unnecessary extra tables to get good performance for things like our "unread messages" and "starred messages" indexes.
I can't recommend it highly enough.
- Relational database764
- High availability510
- Enterprise class database439
- Sql383
- Sql + nosql304
- Great community173
- Easy to setup147
- Heroku131
- Secure by default130
- Postgis113
- Supports Key-Value50
- Great JSON support48
- Cross platform34
- Extensible33
- Replication28
- Triggers26
- Multiversion concurrency control23
- Rollback23
- Open source21
- Heroku Add-on18
- Stable, Simple and Good Performance17
- Powerful15
- Lets be serious, what other SQL DB would you go for?13
- Good documentation11
- Scalable9
- Free8
- Reliable8
- Intelligent optimizer8
- Transactional DDL7
- Modern7
- One stop solution for all things sql no matter the os6
- Relational database with MVCC5
- Faster Development5
- Full-Text Search4
- Developer friendly4
- Excellent source code3
- Free version3
- Great DB for Transactional system or Application3
- Relational datanbase3
- search3
- Open-source3
- Text2
- Full-text2
- Can handle up to petabytes worth of size1
- Composability1
- Multiple procedural languages supported1
- Native0
- Table/index bloatings10
related PostgreSQL posts
Our whole DevOps stack consists of the following tools:
- GitHub (incl. GitHub Pages/Markdown for Documentation, GettingStarted and HowTo's) for collaborative review and code management tool
- Respectively Git as revision control system
- SourceTree as Git GUI
- Visual Studio Code as IDE
- CircleCI for continuous integration (automatize development process)
- Prettier / TSLint / ESLint as code linter
- SonarQube as quality gate
- Docker as container management (incl. Docker Compose for multi-container application management)
- VirtualBox for operating system simulation tests
- Kubernetes as cluster management for docker containers
- Heroku for deploying in test environments
- nginx as web server (preferably used as facade server in production environment)
- SSLMate (using OpenSSL) for certificate management
- Amazon EC2 (incl. Amazon S3) for deploying in stage (production-like) and production environments
- PostgreSQL as preferred database system
- Redis as preferred in-memory database/store (great for caching)
The main reason we have chosen Kubernetes over Docker Swarm is related to the following artifacts:
- Key features: Easy and flexible installation, Clear dashboard, Great scaling operations, Monitoring is an integral part, Great load balancing concepts, Monitors the condition and ensures compensation in the event of failure.
- Applications: An application can be deployed using a combination of pods, deployments, and services (or micro-services).
- Functionality: Kubernetes as a complex installation and setup process, but it not as limited as Docker Swarm.
- Monitoring: It supports multiple versions of logging and monitoring when the services are deployed within the cluster (Elasticsearch/Kibana (ELK), Heapster/Grafana, Sysdig cloud integration).
- Scalability: All-in-one framework for distributed systems.
- Other Benefits: Kubernetes is backed by the Cloud Native Computing Foundation (CNCF), huge community among container orchestration tools, it is an open source and modular tool that works with any OS.
Recently we were looking at a few robust and cost-effective ways of replicating the data that resides in our production MongoDB to a PostgreSQL database for data warehousing and business intelligence.
We set ourselves the following criteria for the optimal tool that would do this job: - The data replication must be near real-time, yet it should NOT impact the production database - The data replication must be horizontally scalable (based on the load), asynchronous & crash-resilient
Based on the above criteria, we selected the following tools to perform the end to end data replication:
We chose MongoDB Stitch for picking up the changes in the source database. It is the serverless platform from MongoDB. One of the services offered by MongoDB Stitch is Stitch Triggers. Using stitch triggers, you can execute a serverless function (in Node.js) in real time in response to changes in the database. When there are a lot of database changes, Stitch automatically "feeds forward" these changes through an asynchronous queue.
We chose Amazon SQS as the pipe / message backbone for communicating the changes from MongoDB to our own replication service. Interestingly enough, MongoDB stitch offers integration with AWS services.
In the Node.js function, we wrote minimal functionality to communicate the database changes (insert / update / delete / replace) to Amazon SQS.
Next we wrote a minimal micro-service in Python to listen to the message events on SQS, pickup the data payload & mirror the DB changes on to the target Data warehouse. We implemented source data to target data translation by modelling target table structures through SQLAlchemy . We deployed this micro-service as AWS Lambda with Zappa. With Zappa, deploying your services as event-driven & horizontally scalable Lambda service is dumb-easy.
In the end, we got to implement a highly scalable near realtime Change Data Replication service that "works" and deployed to production in a matter of few days!
- Document-oriented storage828
- No sql593
- Ease of use553
- Fast464
- High performance410
- Free255
- Open source218
- Flexible180
- Replication & high availability145
- Easy to maintain112
- Querying42
- Easy scalability39
- Auto-sharding38
- High availability37
- Map/reduce31
- Document database27
- Easy setup25
- Full index support25
- Reliable16
- Fast in-place updates15
- Agile programming, flexible, fast14
- No database migrations12
- Easy integration with Node.Js8
- Enterprise8
- Enterprise Support6
- Great NoSQL DB5
- Support for many languages through different drivers4
- Schemaless3
- Aggregation Framework3
- Drivers support is good3
- Fast2
- Managed service2
- Easy to Scale2
- Awesome2
- Consistent2
- Good GUI1
- Acid Compliant1
- Very slowly for connected models that require joins6
- Not acid compliant3
- Proprietary query language2
related MongoDB posts
Recently we were looking at a few robust and cost-effective ways of replicating the data that resides in our production MongoDB to a PostgreSQL database for data warehousing and business intelligence.
We set ourselves the following criteria for the optimal tool that would do this job: - The data replication must be near real-time, yet it should NOT impact the production database - The data replication must be horizontally scalable (based on the load), asynchronous & crash-resilient
Based on the above criteria, we selected the following tools to perform the end to end data replication:
We chose MongoDB Stitch for picking up the changes in the source database. It is the serverless platform from MongoDB. One of the services offered by MongoDB Stitch is Stitch Triggers. Using stitch triggers, you can execute a serverless function (in Node.js) in real time in response to changes in the database. When there are a lot of database changes, Stitch automatically "feeds forward" these changes through an asynchronous queue.
We chose Amazon SQS as the pipe / message backbone for communicating the changes from MongoDB to our own replication service. Interestingly enough, MongoDB stitch offers integration with AWS services.
In the Node.js function, we wrote minimal functionality to communicate the database changes (insert / update / delete / replace) to Amazon SQS.
Next we wrote a minimal micro-service in Python to listen to the message events on SQS, pickup the data payload & mirror the DB changes on to the target Data warehouse. We implemented source data to target data translation by modelling target table structures through SQLAlchemy . We deployed this micro-service as AWS Lambda with Zappa. With Zappa, deploying your services as event-driven & horizontally scalable Lambda service is dumb-easy.
In the end, we got to implement a highly scalable near realtime Change Data Replication service that "works" and deployed to production in a matter of few days!
We use MongoDB as our primary #datastore. Mongo's approach to replica sets enables some fantastic patterns for operations like maintenance, backups, and #ETL.
As we pull #microservices from our #monolith, we are taking the opportunity to build them with their own datastores using PostgreSQL. We also use Redis to cache data we’d never store permanently, and to rate-limit our requests to partners’ APIs (like GitHub).
When we’re dealing with large blobs of immutable data (logs, artifacts, and test results), we store them in Amazon S3. We handle any side-effects of S3’s eventual consistency model within our own code. This ensures that we deal with user requests correctly while writes are in process.
- Performance887
- Super fast542
- Ease of use514
- In-memory cache444
- Advanced key-value cache324
- Open source194
- Easy to deploy182
- Stable165
- Free156
- Fast121
- High-Performance42
- High Availability40
- Data Structures35
- Very Scalable32
- Replication24
- Pub/Sub23
- Great community22
- "NoSQL" key-value data store19
- Hashes16
- Sets13
- Sorted Sets11
- Lists10
- NoSQL10
- Async replication9
- BSD licensed9
- Integrates super easy with Sidekiq for Rails background8
- Bitmaps8
- Open Source7
- Keys with a limited time-to-live7
- Lua scripting6
- Strings6
- Awesomeness for Free5
- Hyperloglogs5
- Runs server side LUA4
- Transactions4
- Networked4
- Outstanding performance4
- Feature Rich4
- Written in ANSI C4
- LRU eviction of keys4
- Data structure server3
- Performance & ease of use3
- Temporarily kept on disk2
- Dont save data if no subscribers are found2
- Automatic failover2
- Easy to use2
- Scalable2
- Channels concept2
- Object [key/value] size each 500 MB2
- Existing Laravel Integration2
- Simple2
- Cannot query objects directly15
- No secondary indexes for non-numeric data types3
- No WAL1
related Redis posts
StackShare Feed is built entirely with React, Glamorous, and Apollo. One of our objectives with the public launch of the Feed was to enable a Server-side rendered (SSR) experience for our organic search traffic. When you visit the StackShare Feed, and you aren't logged in, you are delivered the Trending feed experience. We use an in-house Node.js rendering microservice to generate this HTML. This microservice needs to run and serve requests independent of our Rails web app. Up until recently, we had a mono-repo with our Rails and React code living happily together and all served from the same web process. In order to deploy our SSR app into a Heroku environment, we needed to split out our front-end application into a separate repo in GitHub. The driving factor in this decision was mostly due to limitations imposed by Heroku specifically with how processes can't communicate with each other. A new SSR app was created in Heroku and linked directly to the frontend repo so it stays in-sync with changes.
Related to this, we need a way to "deploy" our frontend changes to various server environments without building & releasing the entire Ruby application. We built a hybrid Amazon S3 Amazon CloudFront solution to host our Webpack bundles. A new CircleCI script builds the bundles and uploads them to S3. The final step in our rollout is to update some keys in Redis so our Rails app knows which bundles to serve. The result of these efforts were significant. Our frontend team now moves independently of our backend team, our build & release process takes only a few minutes, we are now using an edge CDN to serve JS assets, and we have pre-rendered React pages!
#StackDecisionsLaunch #SSR #Microservices #FrontEndRepoSplit
Our whole DevOps stack consists of the following tools:
- GitHub (incl. GitHub Pages/Markdown for Documentation, GettingStarted and HowTo's) for collaborative review and code management tool
- Respectively Git as revision control system
- SourceTree as Git GUI
- Visual Studio Code as IDE
- CircleCI for continuous integration (automatize development process)
- Prettier / TSLint / ESLint as code linter
- SonarQube as quality gate
- Docker as container management (incl. Docker Compose for multi-container application management)
- VirtualBox for operating system simulation tests
- Kubernetes as cluster management for docker containers
- Heroku for deploying in test environments
- nginx as web server (preferably used as facade server in production environment)
- SSLMate (using OpenSSL) for certificate management
- Amazon EC2 (incl. Amazon S3) for deploying in stage (production-like) and production environments
- PostgreSQL as preferred database system
- Redis as preferred in-memory database/store (great for caching)
The main reason we have chosen Kubernetes over Docker Swarm is related to the following artifacts:
- Key features: Easy and flexible installation, Clear dashboard, Great scaling operations, Monitoring is an integral part, Great load balancing concepts, Monitors the condition and ensures compensation in the event of failure.
- Applications: An application can be deployed using a combination of pods, deployments, and services (or micro-services).
- Functionality: Kubernetes as a complex installation and setup process, but it not as limited as Docker Swarm.
- Monitoring: It supports multiple versions of logging and monitoring when the services are deployed within the cluster (Elasticsearch/Kibana (ELK), Heapster/Grafana, Sysdig cloud integration).
- Scalability: All-in-one framework for distributed systems.
- Other Benefits: Kubernetes is backed by the Cloud Native Computing Foundation (CNCF), huge community among container orchestration tools, it is an open source and modular tool that works with any OS.
- Reliable590
- Scalable492
- Cheap456
- Simple & easy329
- Many sdks83
- Logical30
- Easy Setup13
- REST API11
- 1000+ POPs11
- Secure6
- Easy4
- Plug and play4
- Web UI for uploading files3
- Faster on response2
- Flexible2
- GDPR ready2
- Easy to use1
- Plug-gable1
- Easy integration with CloudFront1
- Permissions take some time to get right7
- Requires a credit card6
- Takes time/work to organize buckets & folders properly6
- Complex to set up3
related Amazon S3 posts
To provide employees with the critical need of interactive querying, we’ve worked with Presto, an open-source distributed SQL query engine, over the years. Operating Presto at Pinterest’s scale has involved resolving quite a few challenges like, supporting deeply nested and huge thrift schemas, slow/ bad worker detection and remediation, auto-scaling cluster, graceful cluster shutdown and impersonation support for ldap authenticator.
Our infrastructure is built on top of Amazon EC2 and we leverage Amazon S3 for storing our data. This separates compute and storage layers, and allows multiple compute clusters to share the S3 data.
We have hundreds of petabytes of data and tens of thousands of Apache Hive tables. Our Presto clusters are comprised of a fleet of 450 r4.8xl EC2 instances. Presto clusters together have over 100 TBs of memory and 14K vcpu cores. Within Pinterest, we have close to more than 1,000 monthly active users (out of total 1,600+ Pinterest employees) using Presto, who run about 400K queries on these clusters per month.
Each query submitted to Presto cluster is logged to a Kafka topic via Singer. Singer is a logging agent built at Pinterest and we talked about it in a previous post. Each query is logged when it is submitted and when it finishes. When a Presto cluster crashes, we will have query submitted events without corresponding query finished events. These events enable us to capture the effect of cluster crashes over time.
Each Presto cluster at Pinterest has workers on a mix of dedicated AWS EC2 instances and Kubernetes pods. Kubernetes platform provides us with the capability to add and remove workers from a Presto cluster very quickly. The best-case latency on bringing up a new worker on Kubernetes is less than a minute. However, when the Kubernetes cluster itself is out of resources and needs to scale up, it can take up to ten minutes. Some other advantages of deploying on Kubernetes platform is that our Presto deployment becomes agnostic of cloud vendor, instance types, OS, etc.
#BigData #AWS #DataScience #DataEngineering
StackShare Feed is built entirely with React, Glamorous, and Apollo. One of our objectives with the public launch of the Feed was to enable a Server-side rendered (SSR) experience for our organic search traffic. When you visit the StackShare Feed, and you aren't logged in, you are delivered the Trending feed experience. We use an in-house Node.js rendering microservice to generate this HTML. This microservice needs to run and serve requests independent of our Rails web app. Up until recently, we had a mono-repo with our Rails and React code living happily together and all served from the same web process. In order to deploy our SSR app into a Heroku environment, we needed to split out our front-end application into a separate repo in GitHub. The driving factor in this decision was mostly due to limitations imposed by Heroku specifically with how processes can't communicate with each other. A new SSR app was created in Heroku and linked directly to the frontend repo so it stays in-sync with changes.
Related to this, we need a way to "deploy" our frontend changes to various server environments without building & releasing the entire Ruby application. We built a hybrid Amazon S3 Amazon CloudFront solution to host our Webpack bundles. A new CircleCI script builds the bundles and uploads them to S3. The final step in our rollout is to update some keys in Redis so our Rails app knows which bundles to serve. The result of these efforts were significant. Our frontend team now moves independently of our backend team, our build & release process takes only a few minutes, we are now using an edge CDN to serve JS assets, and we have pre-rendered React pages!
#StackDecisionsLaunch #SSR #Microservices #FrontEndRepoSplit