What is RethinkDB and what are its top alternatives?
Top Alternatives to RethinkDB
- MongoDB
MongoDB stores data in JSON-like documents that can vary in structure, offering a dynamic, flexible schema. MongoDB was also designed for high availability and scalability, with built-in replication and auto-sharding. ...
- CouchDB
Apache CouchDB is a database that uses JSON for documents, JavaScript for MapReduce indexes, and regular HTTP for its API. CouchDB is a database that completely embraces the web. Store your data with JSON documents. Access your documents and query your indexes with your web browser, via HTTP. Index, combine, and transform your documents with JavaScript. ...
- CockroachDB
CockroachDB is distributed SQL database that can be deployed in serverless, dedicated, or on-prem. Elastic scale, multi-active availability for resilience, and low latency performance. ...
- Couchbase
Developed as an alternative to traditionally inflexible SQL databases, the Couchbase NoSQL database is built on an open source foundation and architected to help developers solve real-world problems and meet high scalability demands. ...
- Firebase
Firebase is a cloud service designed to power real-time, collaborative applications. Simply add the Firebase library to your application to gain access to a shared data structure; any changes you make to that data are automatically synchronized with the Firebase cloud and with other clients within milliseconds. ...
- Redis
Redis is an open source (BSD licensed), in-memory data structure store, used as a database, cache, and message broker. Redis provides data structures such as strings, hashes, lists, sets, sorted sets with range queries, bitmaps, hyperloglogs, geospatial indexes, and streams. ...
- Cassandra
Partitioning means that Cassandra can distribute your data across multiple machines in an application-transparent matter. Cassandra will automatically repartition as machines are added and removed from the cluster. Row store means that like relational databases, Cassandra organizes data by rows and columns. The Cassandra Query Language (CQL) is a close relative of SQL. ...
- PostgreSQL
PostgreSQL is an advanced object-relational database management system that supports an extended subset of the SQL standard, including transactions, foreign keys, subqueries, triggers, user-defined types and functions. ...
RethinkDB alternatives & related posts
- Document-oriented storage828
- No sql593
- Ease of use553
- Fast464
- High performance410
- Free255
- Open source218
- Flexible180
- Replication & high availability145
- Easy to maintain112
- Querying42
- Easy scalability39
- Auto-sharding38
- High availability37
- Map/reduce31
- Document database27
- Easy setup25
- Full index support25
- Reliable16
- Fast in-place updates15
- Agile programming, flexible, fast14
- No database migrations12
- Easy integration with Node.Js8
- Enterprise8
- Enterprise Support6
- Great NoSQL DB5
- Support for many languages through different drivers4
- Schemaless3
- Aggregation Framework3
- Drivers support is good3
- Fast2
- Managed service2
- Easy to Scale2
- Awesome2
- Consistent2
- Good GUI1
- Acid Compliant1
- Very slowly for connected models that require joins6
- Not acid compliant3
- Proprietary query language2
related MongoDB posts
Recently we were looking at a few robust and cost-effective ways of replicating the data that resides in our production MongoDB to a PostgreSQL database for data warehousing and business intelligence.
We set ourselves the following criteria for the optimal tool that would do this job: - The data replication must be near real-time, yet it should NOT impact the production database - The data replication must be horizontally scalable (based on the load), asynchronous & crash-resilient
Based on the above criteria, we selected the following tools to perform the end to end data replication:
We chose MongoDB Stitch for picking up the changes in the source database. It is the serverless platform from MongoDB. One of the services offered by MongoDB Stitch is Stitch Triggers. Using stitch triggers, you can execute a serverless function (in Node.js) in real time in response to changes in the database. When there are a lot of database changes, Stitch automatically "feeds forward" these changes through an asynchronous queue.
We chose Amazon SQS as the pipe / message backbone for communicating the changes from MongoDB to our own replication service. Interestingly enough, MongoDB stitch offers integration with AWS services.
In the Node.js function, we wrote minimal functionality to communicate the database changes (insert / update / delete / replace) to Amazon SQS.
Next we wrote a minimal micro-service in Python to listen to the message events on SQS, pickup the data payload & mirror the DB changes on to the target Data warehouse. We implemented source data to target data translation by modelling target table structures through SQLAlchemy . We deployed this micro-service as AWS Lambda with Zappa. With Zappa, deploying your services as event-driven & horizontally scalable Lambda service is dumb-easy.
In the end, we got to implement a highly scalable near realtime Change Data Replication service that "works" and deployed to production in a matter of few days!
We use MongoDB as our primary #datastore. Mongo's approach to replica sets enables some fantastic patterns for operations like maintenance, backups, and #ETL.
As we pull #microservices from our #monolith, we are taking the opportunity to build them with their own datastores using PostgreSQL. We also use Redis to cache data we’d never store permanently, and to rate-limit our requests to partners’ APIs (like GitHub).
When we’re dealing with large blobs of immutable data (logs, artifacts, and test results), we store them in Amazon S3. We handle any side-effects of S3’s eventual consistency model within our own code. This ensures that we deal with user requests correctly while writes are in process.
CouchDB
- JSON43
- Open source30
- Highly available18
- Partition tolerant12
- Eventual consistency11
- Sync7
- REST API5
- Attachments mechanism to docs4
- Multi master replication4
- Changes feed3
- REST interface1
- js- and erlang-views1
related CouchDB posts
I needed to choose a full stack of tools for cross platform mobile application design & development. After much research and trying different tools, these are what I came up with that work for me today:
For the client coding I chose Framework7 because of its performance, easy learning curve, and very well designed, beautiful UI widgets. I think it's perfect for solo development or small teams. I didn't like React Native. It felt heavy to me and rigid. Framework7 allows the use of #CSS3, which I think is the best technology to come out of the #WWW movement. No other tech has been able to allow designers and developers to develop such flexible, high performance, customisable user interface elements that are highly responsive and hardware accelerated before. Now #CSS3 includes variables and flexboxes it is truly a powerful language and there is no longer a need for preprocessors such as #SCSS / #Sass / #less. React Native contains a very limited interpretation of #CSS3 which I found very frustrating after using #CSS3 for some years already and knowing its powerful features. The other very nice feature of Framework7 is that you can even build for the browser if you want your app to be available for desktop web browsers. The latest release also includes the ability to build for #Electron so you can have MacOS, Windows and Linux desktop apps. This is not possible with React Native yet.
Framework7 runs on top of Apache Cordova. Cordova and webviews have been slated as being slow in the past. Having a game developer background I found the tweeks to make it run as smooth as silk. One of those tweeks is to use WKWebView. Another important one was using srcset on images.
I use #Template7 for the for the templating system which is a no-nonsense mobile-centric #HandleBars style extensible templating system. It's easy to write custom helpers for, is fast and has a small footprint. I'm not forced into a new paradigm or learning some new syntax. It operates with standard JavaScript, HTML5 and CSS 3. It's written by the developer of Framework7 and so dovetails with it as expected.
I configured TypeScript to work with the latest version of Framework7. I consider TypeScript to be one of the best creations to come out of Microsoft in some time. They must have an amazing team working on it. It's very powerful and flexible. It helps you catch a lot of bugs and also provides code completion in supporting IDEs. So for my IDE I use Visual Studio Code which is a blazingly fast and silky smooth editor that integrates seamlessly with TypeScript for the ultimate type checking setup (both products are produced by Microsoft).
I use Webpack and Babel to compile the JavaScript. TypeScript can compile to JavaScript directly but Babel offers a few more options and polyfills so you can use the latest (and even prerelease) JavaScript features today and compile to be backwards compatible with virtually any browser. My favorite recent addition is "optional chaining" which greatly simplifies and increases readability of a number of sections of my code dealing with getting and setting data in nested objects.
I use some Ruby scripts to process images with ImageMagick and pngquant to optimise for size and even auto insert responsive image code into the HTML5. Ruby is the ultimate cross platform scripting language. Even as your scripts become large, Ruby allows you to refactor your code easily and make it Object Oriented if necessary. I find it the quickest and easiest way to maintain certain aspects of my build process.
For the user interface design and prototyping I use Figma. Figma has an almost identical user interface to #Sketch but has the added advantage of being cross platform (MacOS and Windows). Its real-time collaboration features are outstanding and I use them a often as I work mostly on remote projects. Clients can collaborate in real-time and see changes I make as I make them. The clickable prototyping features in Figma are also very well designed and mean I can send clickable prototypes to clients to try user interface updates as they are made and get immediate feedback. I'm currently also evaluating the latest version of #AdobeXD as an alternative to Figma as it has the very cool auto-animate feature. It doesn't have real-time collaboration yet, but I heard it is proposed for 2019.
For the UI icons I use Font Awesome Pro. They have the largest selection and best looking icons you can find on the internet with several variations in styles so you can find most of the icons you want for standard projects.
For the backend I was using the #GraphCool Framework. As I later found out, #GraphQL still has some way to go in order to provide the full power of a mature graph query language so later in my project I ripped out #GraphCool and replaced it with CouchDB and Pouchdb. Primarily so I could provide good offline app support. CouchDB with Pouchdb is very flexible and efficient combination and overcomes some of the restrictions I found in #GraphQL and hence #GraphCool also. The most impressive and important feature of CouchDB is its replication. You can configure it in various ways for backups, fault tolerance, caching or conditional merging of databases. CouchDB and Pouchdb even supports storing, retrieving and serving binary or image data or other mime types. This removes a level of complexity usually present in database implementations where binary or image data is usually referenced through an #HTML5 link. With CouchDB and Pouchdb apps can operate offline and sync later, very efficiently, when the network connection is good.
I use PhoneGap when testing the app. It auto-reloads your app when its code is changed and you can also install it on Android phones to preview your app instantly. iOS is a bit more tricky cause of Apple's policies so it's not available on the App Store, but you can build it and install it yourself to your device.
So that's my latest mobile stack. What tools do you use? Have you tried these ones?
We implemented our first large scale EPR application from naologic.com using CouchDB .
Very fast, replication works great, doesn't consume much RAM, queries are blazing fast but we found a problem: the queries were very hard to write, it took a long time to figure out the API, we had to go and write our own @nodejs library to make it work properly.
It lost most of its support. Since then, we migrated to Couchbase and the learning curve was steep but all worth it. Memcached indexing out of the box, full text search works great.
CockroachDB
related CockroachDB posts
- High performance18
- Flexible data model, easy scalability, extremely fast18
- Mobile app support9
- You can query it with Ansi-92 SQL7
- All nodes can be read/write6
- Equal nodes in cluster, allowing fast, flexible changes5
- Both a key-value store and document (JSON) db5
- Open source, community and enterprise editions5
- Automatic configuration of sharding4
- Local cache capability4
- Easy setup3
- Linearly scalable, useful to large number of tps3
- Easy cluster administration3
- Cross data center replication3
- SDKs in popular programming languages3
- Elasticsearch connector3
- Web based management, query and monitoring panel3
- Map reduce views2
- DBaaS available2
- NoSQL2
- Buckets, Scopes, Collections & Documents1
- FTS + SQL together1
- Terrible query language3
related Couchbase posts
We implemented our first large scale EPR application from naologic.com using CouchDB .
Very fast, replication works great, doesn't consume much RAM, queries are blazing fast but we found a problem: the queries were very hard to write, it took a long time to figure out the API, we had to go and write our own @nodejs library to make it work properly.
It lost most of its support. Since then, we migrated to Couchbase and the learning curve was steep but all worth it. Memcached indexing out of the box, full text search works great.
Hey, we want to build a referral campaign mechanism that will probably contain millions of records within the next few years. We want fast read access based on IDs or some indexes, and isolation is crucial as some listeners will try to update the same document at the same time. What's your suggestion between Couchbase and MongoDB? Thanks!
- Realtime backend made easy371
- Fast and responsive270
- Easy setup242
- Real-time215
- JSON191
- Free134
- Backed by google128
- Angular adaptor83
- Reliable68
- Great customer support36
- Great documentation32
- Real-time synchronization25
- Mobile friendly21
- Rapid prototyping19
- Great security14
- Automatic scaling12
- Freakingly awesome11
- Super fast development8
- Angularfire is an amazing addition!8
- Chat8
- Firebase hosting6
- Built in user auth/oauth6
- Awesome next-gen backend6
- Ios adaptor6
- Speed of light4
- Very easy to use4
- Great3
- It's made development super fast3
- Brilliant for startups3
- Free hosting2
- Cloud functions2
- JS Offline and Sync suport2
- Low battery consumption2
- .net2
- The concurrent updates create a great experience2
- Push notification2
- I can quickly create static web apps with no backend2
- Great all-round functionality2
- Free authentication solution2
- Easy Reactjs integration1
- Google's support1
- Free SSL1
- CDN & cache out of the box1
- Easy to use1
- Large1
- Faster workflow1
- Serverless1
- Good Free Limits1
- Simple and easy1
- Can become expensive31
- No open source, you depend on external company16
- Scalability is not infinite15
- Not Flexible Enough9
- Cant filter queries7
- Very unstable server3
- No Relational Data3
- Too many errors2
- No offline sync2
related Firebase posts
Hi Otensia! I'd definitely recommend using the skills you've already got and building with JavaScript is a smart way to go these days. Most platform services have JavaScript/Node SDKs or NPM packages, many serverless platforms support Node in case you need to write any backend logic, and JavaScript is incredibly popular - meaning it will be easy to hire for, should you ever need to.
My advice would be "don't reinvent the wheel". If you already have a skill set that will work well to solve the problem at hand, and you don't need it for any other projects, don't spend the time jumping into a new language. If you're looking for an excuse to learn something new, it would be better to invest that time in learning a new platform/tool that compliments your knowledge of JavaScript. For this project, I might recommend using Netlify, Vercel, or Google Firebase to quickly and easily deploy your web app. If you need to add user authentication, there are great examples out there for Firebase Authentication, Auth0, or even Magic (a newcomer on the Auth scene, but very user friendly). All of these services work very well with a JavaScript-based application.
For inboxkitten.com, an opensource disposable email service;
We migrated our serverless workload from Cloud Functions for Firebase to CloudFlare workers, taking advantage of the lower cost and faster-performing edge computing of Cloudflare network. Made possible due to our extremely low CPU and RAM overhead of our serverless functions.
If I were to summarize the limitation of Cloudflare (as oppose to firebase/gcp functions), it would be ...
- <5ms CPU time limit
- Incompatible with express.js
- one script limitation per domain
Limitations our workload is able to conform with (YMMV)
For hosting of static files, we migrated from Firebase to CommonsHost
More details on the trade-off in between both serverless providers is in the article
- Performance886
- Super fast542
- Ease of use513
- In-memory cache444
- Advanced key-value cache324
- Open source194
- Easy to deploy182
- Stable164
- Free155
- Fast121
- High-Performance42
- High Availability40
- Data Structures35
- Very Scalable32
- Replication24
- Great community22
- Pub/Sub22
- "NoSQL" key-value data store19
- Hashes16
- Sets13
- Sorted Sets11
- NoSQL10
- Lists10
- Async replication9
- BSD licensed9
- Bitmaps8
- Integrates super easy with Sidekiq for Rails background8
- Keys with a limited time-to-live7
- Open Source7
- Lua scripting6
- Strings6
- Awesomeness for Free5
- Hyperloglogs5
- Transactions4
- Outstanding performance4
- Runs server side LUA4
- LRU eviction of keys4
- Feature Rich4
- Written in ANSI C4
- Networked4
- Data structure server3
- Performance & ease of use3
- Dont save data if no subscribers are found2
- Automatic failover2
- Easy to use2
- Temporarily kept on disk2
- Scalable2
- Existing Laravel Integration2
- Channels concept2
- Object [key/value] size each 500 MB2
- Simple2
- Cannot query objects directly15
- No secondary indexes for non-numeric data types3
- No WAL1
related Redis posts
StackShare Feed is built entirely with React, Glamorous, and Apollo. One of our objectives with the public launch of the Feed was to enable a Server-side rendered (SSR) experience for our organic search traffic. When you visit the StackShare Feed, and you aren't logged in, you are delivered the Trending feed experience. We use an in-house Node.js rendering microservice to generate this HTML. This microservice needs to run and serve requests independent of our Rails web app. Up until recently, we had a mono-repo with our Rails and React code living happily together and all served from the same web process. In order to deploy our SSR app into a Heroku environment, we needed to split out our front-end application into a separate repo in GitHub. The driving factor in this decision was mostly due to limitations imposed by Heroku specifically with how processes can't communicate with each other. A new SSR app was created in Heroku and linked directly to the frontend repo so it stays in-sync with changes.
Related to this, we need a way to "deploy" our frontend changes to various server environments without building & releasing the entire Ruby application. We built a hybrid Amazon S3 Amazon CloudFront solution to host our Webpack bundles. A new CircleCI script builds the bundles and uploads them to S3. The final step in our rollout is to update some keys in Redis so our Rails app knows which bundles to serve. The result of these efforts were significant. Our frontend team now moves independently of our backend team, our build & release process takes only a few minutes, we are now using an edge CDN to serve JS assets, and we have pre-rendered React pages!
#StackDecisionsLaunch #SSR #Microservices #FrontEndRepoSplit
Our whole DevOps stack consists of the following tools:
- GitHub (incl. GitHub Pages/Markdown for Documentation, GettingStarted and HowTo's) for collaborative review and code management tool
- Respectively Git as revision control system
- SourceTree as Git GUI
- Visual Studio Code as IDE
- CircleCI for continuous integration (automatize development process)
- Prettier / TSLint / ESLint as code linter
- SonarQube as quality gate
- Docker as container management (incl. Docker Compose for multi-container application management)
- VirtualBox for operating system simulation tests
- Kubernetes as cluster management for docker containers
- Heroku for deploying in test environments
- nginx as web server (preferably used as facade server in production environment)
- SSLMate (using OpenSSL) for certificate management
- Amazon EC2 (incl. Amazon S3) for deploying in stage (production-like) and production environments
- PostgreSQL as preferred database system
- Redis as preferred in-memory database/store (great for caching)
The main reason we have chosen Kubernetes over Docker Swarm is related to the following artifacts:
- Key features: Easy and flexible installation, Clear dashboard, Great scaling operations, Monitoring is an integral part, Great load balancing concepts, Monitors the condition and ensures compensation in the event of failure.
- Applications: An application can be deployed using a combination of pods, deployments, and services (or micro-services).
- Functionality: Kubernetes as a complex installation and setup process, but it not as limited as Docker Swarm.
- Monitoring: It supports multiple versions of logging and monitoring when the services are deployed within the cluster (Elasticsearch/Kibana (ELK), Heapster/Grafana, Sysdig cloud integration).
- Scalability: All-in-one framework for distributed systems.
- Other Benefits: Kubernetes is backed by the Cloud Native Computing Foundation (CNCF), huge community among container orchestration tools, it is an open source and modular tool that works with any OS.
Cassandra
- Distributed119
- High performance98
- High availability81
- Easy scalability74
- Replication53
- Reliable26
- Multi datacenter deployments26
- Schema optional10
- OLTP9
- Open source8
- Workload separation (via MDC)2
- Fast1
- Reliability of replication3
- Size1
- Updates1
related Cassandra posts
1.0 of Stream leveraged Cassandra for storing the feed. Cassandra is a common choice for building feeds. Instagram, for instance started, out with Redis but eventually switched to Cassandra to handle their rapid usage growth. Cassandra can handle write heavy workloads very efficiently.
Cassandra is a great tool that allows you to scale write capacity simply by adding more nodes, though it is also very complex. This complexity made it hard to diagnose performance fluctuations. Even though we had years of experience with running Cassandra, it still felt like a bit of a black box. When building Stream 2.0 we decided to go for a different approach and build Keevo. Keevo is our in-house key-value store built upon RocksDB, gRPC and Raft.
RocksDB is a highly performant embeddable database library developed and maintained by Facebook’s data engineering team. RocksDB started as a fork of Google’s LevelDB that introduced several performance improvements for SSD. Nowadays RocksDB is a project on its own and is under active development. It is written in C++ and it’s fast. Have a look at how this benchmark handles 7 million QPS. In terms of technology it’s much more simple than Cassandra.
This translates into reduced maintenance overhead, improved performance and, most importantly, more consistent performance. It’s interesting to note that LinkedIn also uses RocksDB for their feed.
#InMemoryDatabases #DataStores #Databases
Trying to establish a data lake(or maybe puddle) for my org's Data Sharing project. The idea is that outside partners would send cuts of their PHI data, regardless of format/variables/systems, to our Data Team who would then harmonize the data, create data marts, and eventually use it for something. End-to-end, I'm envisioning:
- Ingestion->Secure, role-based, self service portal for users to upload data (1a. bonus points if it can preform basic validations/masking)
- Storage->Amazon S3 seems like the cheapest. We probably won't need very big, even at full capacity. Our current storage is a secure Box folder that has ~4GB with several batches of test data, code, presentations, and planning docs.
- Data Catalog-> AWS Glue? Azure Data Factory? Snowplow? is the main difference basically based on the vendor? We also will have Data Dictionaries/Codebooks from submitters. Where would they fit in?
- Partitions-> I've seen Cassandra and YARN mentioned, but have no experience with either
- Processing-> We want to use SAS if at all possible. What will work with SAS code?
- Pipeline/Automation->The check-in and verification processes that have been outlined are rather involved. Some sort of automated messaging or approval workflow would be nice
- I have very little guidance on what a "Data Mart" should look like, so I'm going with the idea that it would be another "experimental" partition. Unless there's an actual mart-building paradigm I've missed?
- An end user might use the catalog to pull certain de-identified data sets from the marts. Again, role-based access and self-service gui would be preferable. I'm the only full-time tech person on this project, but I'm mostly an OOP, HTML, JavaScript, and some SQL programmer. Most of this is out of my repertoire. I've done a lot of research, but I can't be an effective evangelist without hands-on experience. Since we're starting a new year of our grant, they've finally decided to let me try some stuff out. Any pointers would be appreciated!
- Relational database763
- High availability510
- Enterprise class database439
- Sql383
- Sql + nosql304
- Great community173
- Easy to setup147
- Heroku131
- Secure by default130
- Postgis113
- Supports Key-Value50
- Great JSON support48
- Cross platform34
- Extensible33
- Replication28
- Triggers26
- Multiversion concurrency control23
- Rollback23
- Open source21
- Heroku Add-on18
- Stable, Simple and Good Performance17
- Powerful15
- Lets be serious, what other SQL DB would you go for?13
- Good documentation11
- Scalable9
- Free8
- Reliable8
- Intelligent optimizer8
- Transactional DDL7
- Modern7
- One stop solution for all things sql no matter the os6
- Relational database with MVCC5
- Faster Development5
- Full-Text Search4
- Developer friendly4
- Excellent source code3
- Free version3
- Great DB for Transactional system or Application3
- Relational datanbase3
- search3
- Open-source3
- Text2
- Full-text2
- Can handle up to petabytes worth of size1
- Composability1
- Multiple procedural languages supported1
- Native0
- Table/index bloatings10
related PostgreSQL posts
Our whole DevOps stack consists of the following tools:
- GitHub (incl. GitHub Pages/Markdown for Documentation, GettingStarted and HowTo's) for collaborative review and code management tool
- Respectively Git as revision control system
- SourceTree as Git GUI
- Visual Studio Code as IDE
- CircleCI for continuous integration (automatize development process)
- Prettier / TSLint / ESLint as code linter
- SonarQube as quality gate
- Docker as container management (incl. Docker Compose for multi-container application management)
- VirtualBox for operating system simulation tests
- Kubernetes as cluster management for docker containers
- Heroku for deploying in test environments
- nginx as web server (preferably used as facade server in production environment)
- SSLMate (using OpenSSL) for certificate management
- Amazon EC2 (incl. Amazon S3) for deploying in stage (production-like) and production environments
- PostgreSQL as preferred database system
- Redis as preferred in-memory database/store (great for caching)
The main reason we have chosen Kubernetes over Docker Swarm is related to the following artifacts:
- Key features: Easy and flexible installation, Clear dashboard, Great scaling operations, Monitoring is an integral part, Great load balancing concepts, Monitors the condition and ensures compensation in the event of failure.
- Applications: An application can be deployed using a combination of pods, deployments, and services (or micro-services).
- Functionality: Kubernetes as a complex installation and setup process, but it not as limited as Docker Swarm.
- Monitoring: It supports multiple versions of logging and monitoring when the services are deployed within the cluster (Elasticsearch/Kibana (ELK), Heapster/Grafana, Sysdig cloud integration).
- Scalability: All-in-one framework for distributed systems.
- Other Benefits: Kubernetes is backed by the Cloud Native Computing Foundation (CNCF), huge community among container orchestration tools, it is an open source and modular tool that works with any OS.
Recently we were looking at a few robust and cost-effective ways of replicating the data that resides in our production MongoDB to a PostgreSQL database for data warehousing and business intelligence.
We set ourselves the following criteria for the optimal tool that would do this job: - The data replication must be near real-time, yet it should NOT impact the production database - The data replication must be horizontally scalable (based on the load), asynchronous & crash-resilient
Based on the above criteria, we selected the following tools to perform the end to end data replication:
We chose MongoDB Stitch for picking up the changes in the source database. It is the serverless platform from MongoDB. One of the services offered by MongoDB Stitch is Stitch Triggers. Using stitch triggers, you can execute a serverless function (in Node.js) in real time in response to changes in the database. When there are a lot of database changes, Stitch automatically "feeds forward" these changes through an asynchronous queue.
We chose Amazon SQS as the pipe / message backbone for communicating the changes from MongoDB to our own replication service. Interestingly enough, MongoDB stitch offers integration with AWS services.
In the Node.js function, we wrote minimal functionality to communicate the database changes (insert / update / delete / replace) to Amazon SQS.
Next we wrote a minimal micro-service in Python to listen to the message events on SQS, pickup the data payload & mirror the DB changes on to the target Data warehouse. We implemented source data to target data translation by modelling target table structures through SQLAlchemy . We deployed this micro-service as AWS Lambda with Zappa. With Zappa, deploying your services as event-driven & horizontally scalable Lambda service is dumb-easy.
In the end, we got to implement a highly scalable near realtime Change Data Replication service that "works" and deployed to production in a matter of few days!