PostgreSQL

PostgreSQL

Application and Data / Data Stores / Databases
Avatar of jeyabalajis
CTO at FundsCorner ·

Recently we were looking at a few robust and cost-effective ways of replicating the data that resides in our production MongoDB to a PostgreSQL database for data warehousing and business intelligence.

We set ourselves the following criteria for the optimal tool that would do this job: - The data replication must be near real-time, yet it should NOT impact the production database - The data replication must be horizontally scalable (based on the load), asynchronous & crash-resilient

Based on the above criteria, we selected the following tools to perform the end to end data replication:

We chose MongoDB Stitch for picking up the changes in the source database. It is the serverless platform from MongoDB. One of the services offered by MongoDB Stitch is Stitch Triggers. Using stitch triggers, you can execute a serverless function (in Node.js) in real time in response to changes in the database. When there are a lot of database changes, Stitch automatically "feeds forward" these changes through an asynchronous queue.

We chose Amazon SQS as the pipe / message backbone for communicating the changes from MongoDB to our own replication service. Interestingly enough, MongoDB stitch offers integration with AWS services.

In the Node.js function, we wrote minimal functionality to communicate the database changes (insert / update / delete / replace) to Amazon SQS.

Next we wrote a minimal micro-service in Python to listen to the message events on SQS, pickup the data payload & mirror the DB changes on to the target Data warehouse. We implemented source data to target data translation by modelling target table structures through SQLAlchemy . We deployed this micro-service as AWS Lambda with Zappa. With Zappa, deploying your services as event-driven & horizontally scalable Lambda service is dumb-easy.

In the end, we got to implement a highly scalable near realtime Change Data Replication service that "works" and deployed to production in a matter of few days!

READ MORE
24 upvotes·583.6K views
Avatar of z00b
CTO at CircleCI ·

We use MongoDB as our primary #datastore. Mongo's approach to replica sets enables some fantastic patterns for operations like maintenance, backups, and #ETL.

As we pull #microservices from our #monolith, we are taking the opportunity to build them with their own datastores using PostgreSQL. We also use Redis to cache data we’d never store permanently, and to rate-limit our requests to partners’ APIs (like GitHub).

When we’re dealing with large blobs of immutable data (logs, artifacts, and test results), we store them in Amazon S3. We handle any side-effects of S3’s eventual consistency model within our own code. This ensures that we deal with user requests correctly while writes are in process.

READ MORE
Update: How CircleCI Processes Over 30 Million Builds Per Month - CircleCI Tech Stack (stackshare.io)
22 upvotes·1 comment·431.7K views
Avatar of jtcunning
Operations Engineer at Sentry ·

Sentry started as (and remains) an open-source project, growing out of an error logging tool built in 2008. That original build nine years ago was Django and Celery (Python’s asynchronous task codebase), with PostgreSQL as the database and Redis as the power behind Celery.

We displayed a truly shrewd notion of branding even then, giving the project a catchy name that companies the world over remain jealous of to this day: django-db-log. For the longest time, Sentry’s subtitle on GitHub was “A simple Django app, built with love.” A slightly more accurate description probably would have included Starcraft and Soylent alongside love; regardless, this captured what Sentry was all about.

#MessageQueue #InMemoryDatabases

READ MORE
How Sentry Receives 20 Billion Events Per Month While Preparing to Handle Twice That - Sentry Tech Stack | StackShare (stackshare.io)
22 upvotes·186.4K views
Avatar of tabbott
Founder at Zulip ·

We've been using PostgreSQL since the very early days of Zulip, but we actually didn't use it from the beginning. Zulip started out as a MySQL project back in 2012, because we'd heard it was a good choice for a startup with a wide community. However, we found that even though we were using the Django ORM for most of our database access, we spent a lot of time fighting with MySQL. Issues ranged from bad collation defaults, to bad query plans which required a lot of manual query tweaks.

We ended up getting so frustrated that we tried out PostgresQL, and the results were fantastic. We didn't have to do any real customization (just some tuning settings for how big a server we had), and all of our most important queries were faster out of the box. As a result, we were able to delete a bunch of custom queries escaping the ORM that we'd written to make the MySQL query planner happy (because postgres just did the right thing automatically).

And then after that, we've just gotten a ton of value out of postgres. We use its excellent built-in full-text search, which has helped us avoid needing to bring in a tool like Elasticsearch, and we've really enjoyed features like its partial indexes, which saved us a lot of work adding unnecessary extra tables to get good performance for things like our "unread messages" and "starred messages" indexes.

I can't recommend it highly enough.

READ MORE
21 upvotes·183.5K views
Avatar of tim_nolet
Founder, Engineer & Dishwasher at Checkly ·

Heroku Docker GitHub Node.js hapi Vue.js AWS Lambda Amazon S3 PostgreSQL Knex.js Checkly is a fairly young company and we're still working hard to find the correct mix of product features, price and audience.

We are focussed on tech B2B, but I always wanted to serve solo developers too. So I decided to make a $7 plan.

Why $7? Simply put, it seems to be a sweet spot for tech companies: Heroku, Docker, Github, Appoptics (Librato) all offer $7 plans. They must have done a ton of research into this, so why not piggy back that and try it out.

Enough biz talk, onto tech. The challenges were:

  • Slice of a portion of the functionality so a $7 plan is still profitable. We call this the "plan limits"
  • Update API and back end services to handle and enforce plan limits.
  • Update the UI to kindly state plan limits are in effect on some part of the UI.
  • Update the pricing page to reflect all changes.
  • Keep the actual processing backend, storage and API's as untouched as possible.

In essence, we went from strictly volume based pricing to value based pricing. Here come the technical steps & decisions we made to get there.

  1. We updated our PostgreSQL schema so plans now have an array of "features". These are string constants that represent feature toggles.
  2. The Vue.js frontend reads these from the vuex store on login.
  3. Based on these values, the UI has simple v-if statements to either just show the feature or show a friendly "please upgrade" button.
  4. The hapi API has a hook on each relevant API endpoint that checks whether a user's plan has the feature enabled, or not.

Side note: We offer 10 SMS messages per month on the developer plan. However, we were not actually counting how many people were sending. We had to update our alerting daemon (that runs on Heroku and triggers SMS messages via AWS SNS) to actually bump a counter.

What we build is basically feature-toggling based on plan features. It is very extensible for future additions. Our scheduling and storage backend that actually runs users' monitoring requests (AWS Lambda) and stores the results (S3 and Postgres) has no knowledge of all of this and remained unchanged.

Hope this helps anyone building out their SaaS and is in a similar situation.

READ MORE
20 upvotes·416.1K views
Avatar of ecolson
Chief Algorithms Officer at Stitch Fix ·

The algorithms and data infrastructure at Stitch Fix is housed in #AWS. Data acquisition is split between events flowing through Kafka, and periodic snapshots of PostgreSQL DBs. We store data in an Amazon S3 based data warehouse. Apache Spark on Yarn is our tool of choice for data movement and #ETL. Because our storage layer (s3) is decoupled from our processing layer, we are able to scale our compute environment very elastically. We have several semi-permanent, autoscaling Yarn clusters running to serve our data processing needs. While the bulk of our compute infrastructure is dedicated to algorithmic processing, we also implemented Presto for adhoc queries and dashboards.

Beyond data movement and ETL, most #ML centric jobs (e.g. model training and execution) run in a similarly elastic environment as containers running Python and R code on Amazon EC2 Container Service clusters. The execution of batch jobs on top of ECS is managed by Flotilla, a service we built in house and open sourced (see https://github.com/stitchfix/flotilla-os).

At Stitch Fix, algorithmic integrations are pervasive across the business. We have dozens of data products actively integrated systems. That requires serving layer that is robust, agile, flexible, and allows for self-service. Models produced on Flotilla are packaged for deployment in production using Khan, another framework we've developed internally. Khan provides our data scientists the ability to quickly productionize those models they've developed with open source frameworks in Python 3 (e.g. PyTorch, sklearn), by automatically packaging them as Docker containers and deploying to Amazon ECS. This provides our data scientist a one-click method of getting from their algorithms to production. We then integrate those deployments into a service mesh, which allows us to A/B test various implementations in our product.

For more info:

#DataScience #DataStack #Data

READ MORE
Stitch Fix Algorithms Tour (algorithms-tour.stitchfix.com)
19 upvotes·481.5K views
Avatar of drob
Heap, Inc. ·
Shared insights
on
PostgreSQLPostgreSQLCitusCitus
at

PostgreSQL was an easy early decision for the founding team. The relational data model fit the types of analyses they would be doing: filtering, grouping, joining, etc., and it was the database they knew best.

Shortly after adopting PG, they discovered Citus, which is a tool that makes it easy to distribute queries. Although it was a young project and a fork of Postgres at that point, Dan says the team was very available, highly expert, and it wouldn’t be very difficult to move back to PG if they needed to.

The stuff they forked was in query execution. You could treat the worker nodes like regular PG instances. Citus also gave them a ton of flexibility to make queries fast, and again, they felt the data model was the best fit for their application.

#DataStores #Databases

READ MORE
How Heap Built an Analytics Platform that Auto-Tracks Every User Event - Heap Tech Stack | StackShare (stackshare.io)
16 upvotes·55.8K views

Recently I have been working on an open source stack to help people consolidate their personal health data in a single database so that AI and analytics apps can be run against it to find personalized treatments. We chose to go with a #containerized approach leveraging Docker #containers with a local development environment setup with Docker Compose and nginx for container routing. For the production environment we chose to pull code from GitHub and build/push images using Jenkins and using Kubernetes to deploy to Amazon EC2.

We also implemented a dashboard app to handle user authentication/authorization, as well as a custom SSO server that runs on Heroku which allows experts to easily visit more than one instance without having to login repeatedly. The #Backend was implemented using my favorite #Stack which consists of FeathersJS on top of Node.js and ExpressJS with PostgreSQL as the main database. The #Frontend was implemented using React, Redux.js, Semantic UI React and the FeathersJS client. Though testing was light on this project, we chose to use AVA as well as ESLint to keep the codebase clean and consistent.

READ MORE
15 upvotes·604.3K views
Avatar of jkodumal
CTO at LaunchDarkly ·

As we've evolved or added additional infrastructure to our stack, we've biased towards managed services. Most new backing stores are Amazon RDS instances now. We do use self-managed PostgreSQL with TimescaleDB for time-series data—this is made HA with the use of Patroni and Consul.

We also use managed Amazon ElastiCache instances instead of spinning up Amazon EC2 instances to run Redis workloads, as well as shifting to Amazon Kinesis instead of Kafka.

READ MORE
Redux: Scaling LaunchDarkly from 4 to 200 billion feature flags daily - LaunchDarkly Tech Stack | StackShare (stackshare.io)
15 upvotes·1 comment·289.3K views
Avatar of dmitry-mukhin
CTO at Uploadcare ·

Uploadcare has built an infinitely scalable infrastructure by leveraging AWS. Building on top of AWS allows us to process 350M daily requests for file uploads, manipulations, and deliveries. When we started in 2011 the only cloud alternative to AWS was Google App Engine which was a no-go for a rather complex solution we wanted to build. We also didn’t want to buy any hardware or use co-locations.

Our stack handles receiving files, communicating with external file sources, managing file storage, managing user and file data, processing files, file caching and delivery, and managing user interface dashboards.

At its core, Uploadcare runs on Python. The Europython 2011 conference in Florence really inspired us, coupled with the fact that it was general enough to solve all of our challenges informed this decision. Additionally we had prior experience working in Python.

We chose to build the main application with Django because of its feature completeness and large footprint within the Python ecosystem.

All the communications within our ecosystem occur via several HTTP APIs, Redis, Amazon S3, and Amazon DynamoDB. We decided on this architecture so that our our system could be scalable in terms of storage and database throughput. This way we only need Django running on top of our database cluster. We use PostgreSQL as our database because it is considered an industry standard when it comes to clustering and scaling.

READ MORE
How Uploadcare Built a Stack That Handles 350M File API Requests Per Day - Uploadcare Tech Stack | StackShare (stackshare.io)
15 upvotes·95.2K views