What is Spring Batch and what are its top alternatives?
Top Alternatives to Spring Batch
- Hadoop
The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. ...
- Talend
It is an open source software integration platform helps you in effortlessly turning data into business insights. It uses native code generation that lets you run your data pipelines seamlessly across all cloud providers and get optimized performance on all platforms. ...
- Spring Boot
Spring Boot makes it easy to create stand-alone, production-grade Spring based Applications that you can "just run". We take an opinionated view of the Spring platform and third-party libraries so you can get started with minimum fuss. Most Spring Boot applications need very little Spring configuration. ...
- Apache Spark
Spark is a fast and general processing engine compatible with Hadoop data. It can run in Hadoop clusters through YARN or Spark's standalone mode, and it can process data in HDFS, HBase, Cassandra, Hive, and any Hadoop InputFormat. It is designed to perform both batch processing (similar to MapReduce) and new workloads like streaming, interactive queries, and machine learning. ...
- Kafka
Kafka is a distributed, partitioned, replicated commit log service. It provides the functionality of a messaging system, but with a unique design. ...
- AWS Batch
It enables developers, scientists, and engineers to easily and efficiently run hundreds of thousands of batch computing jobs on AWS. It dynamically provisions the optimal quantity and type of compute resources (e.g., CPU or memory optimized instances) based on the volume and specific resource requirements of the batch jobs submitted. ...
- Node.js
Node.js uses an event-driven, non-blocking I/O model that makes it lightweight and efficient, perfect for data-intensive real-time applications that run across distributed devices. ...
- Django
Django is a high-level Python Web framework that encourages rapid development and clean, pragmatic design. ...
Spring Batch alternatives & related posts
- Great ecosystem39
- One stack to rule them all11
- Great load balancer4
- Amazon aws1
- Java syntax1
related Hadoop posts
The early data ingestion pipeline at Pinterest used Kafka as the central message transporter, with the app servers writing messages directly to Kafka, which then uploaded log files to S3.
For databases, a custom Hadoop streamer pulled database data and wrote it to S3.
Challenges cited for this infrastructure included high operational overhead, as well as potential data loss occurring when Kafka broker outages led to an overflow of in-memory message buffering.
Why we built Marmaray, an open source generic data ingestion and dispersal framework and library for Apache Hadoop :
Built and designed by our Hadoop Platform team, Marmaray is a plug-in-based framework built on top of the Hadoop ecosystem. Users can add support to ingest data from any source and disperse to any sink leveraging the use of Apache Spark . The name, Marmaray, comes from a tunnel in Turkey connecting Europe and Asia. Similarly, we envisioned Marmaray within Uber as a pipeline connecting data from any source to any sink depending on customer preference:
https://eng.uber.com/marmaray-hadoop-ingestion-open-source/
(Direct GitHub repo: https://github.com/uber/marmaray Kafka Kafka Manager )
related Talend posts
Spring Boot
- Powerful and handy142
- Easy setup133
- Java125
- Spring90
- Fast85
- Extensible46
- Lots of "off the shelf" functionalities37
- Cloud Solid32
- Caches well26
- Many receipes around for obscure features24
- Productive24
- Modular23
- Integrations with most other Java frameworks23
- Spring ecosystem is great22
- Fast Performance With Microservices21
- Auto-configuration20
- Community18
- Easy setup, Community Support, Solid for ERP apps17
- One-stop shop15
- Easy to parallelize14
- Cross-platform14
- Easy setup, good for build erp systems, well documented13
- Powerful 3rd party libraries and frameworks13
- Easy setup, Git Integration12
- It's so easier to start a project on spring5
- Kotlin4
- The ability to integrate with the open source ecosystem1
- Microservice and Reactive Programming1
- Heavy weight23
- Annotation ceremony18
- Java13
- Many config files needed11
- Reactive5
- Excellent tools for cloud hosting, since 5.x4
related Spring Boot posts
We are in the process of building a modern content platform to deliver our content through various channels. We decided to go with Microservices architecture as we wanted scale. Microservice architecture style is an approach to developing an application as a suite of small independently deployable services built around specific business capabilities. You can gain modularity, extensive parallelism and cost-effective scaling by deploying services across many distributed servers. Microservices modularity facilitates independent updates/deployments, and helps to avoid single point of failure, which can help prevent large-scale outages. We also decided to use Event Driven Architecture pattern which is a popular distributed asynchronous architecture pattern used to produce highly scalable applications. The event-driven architecture is made up of highly decoupled, single-purpose event processing components that asynchronously receive and process events.
To build our #Backend capabilities we decided to use the following: 1. #Microservices - Java with Spring Boot , Node.js with ExpressJS and Python with Flask 2. #Eventsourcingframework - Amazon Kinesis , Amazon Kinesis Firehose , Amazon SNS , Amazon SQS, AWS Lambda 3. #Data - Amazon RDS , Amazon DynamoDB , Amazon S3 , MongoDB Atlas
To build #Webapps we decided to use Angular 2 with RxJS
#Devops - GitHub , Travis CI , Terraform , Docker , Serverless
Is learning Spring and Spring Boot for web apps back-end development is still relevant in 2021? Feel free to share your views with comparison to Django/Node.js/ ExpressJS or other frameworks.
Please share some good beginner resources to start learning about spring/spring boot framework to build the web apps.
- Open-source60
- Fast and Flexible48
- Great for distributed SQL like applications8
- One platform for every big data problem8
- Easy to install and to use6
- Works well for most Datascience usecases3
- In memory Computation2
- Interactive Query2
- Machine learning libratimery, Streaming in real2
- Speed3
related Apache Spark posts
The algorithms and data infrastructure at Stitch Fix is housed in #AWS. Data acquisition is split between events flowing through Kafka, and periodic snapshots of PostgreSQL DBs. We store data in an Amazon S3 based data warehouse. Apache Spark on Yarn is our tool of choice for data movement and #ETL. Because our storage layer (s3) is decoupled from our processing layer, we are able to scale our compute environment very elastically. We have several semi-permanent, autoscaling Yarn clusters running to serve our data processing needs. While the bulk of our compute infrastructure is dedicated to algorithmic processing, we also implemented Presto for adhoc queries and dashboards.
Beyond data movement and ETL, most #ML centric jobs (e.g. model training and execution) run in a similarly elastic environment as containers running Python and R code on Amazon EC2 Container Service clusters. The execution of batch jobs on top of ECS is managed by Flotilla, a service we built in house and open sourced (see https://github.com/stitchfix/flotilla-os).
At Stitch Fix, algorithmic integrations are pervasive across the business. We have dozens of data products actively integrated systems. That requires serving layer that is robust, agile, flexible, and allows for self-service. Models produced on Flotilla are packaged for deployment in production using Khan, another framework we've developed internally. Khan provides our data scientists the ability to quickly productionize those models they've developed with open source frameworks in Python 3 (e.g. PyTorch, sklearn), by automatically packaging them as Docker containers and deploying to Amazon ECS. This provides our data scientist a one-click method of getting from their algorithms to production. We then integrate those deployments into a service mesh, which allows us to A/B test various implementations in our product.
For more info:
- Our Algorithms Tour: https://algorithms-tour.stitchfix.com/
- Our blog: https://multithreaded.stitchfix.com/blog/
- Careers: https://multithreaded.stitchfix.com/careers/
#DataScience #DataStack #Data
Why we built Marmaray, an open source generic data ingestion and dispersal framework and library for Apache Hadoop :
Built and designed by our Hadoop Platform team, Marmaray is a plug-in-based framework built on top of the Hadoop ecosystem. Users can add support to ingest data from any source and disperse to any sink leveraging the use of Apache Spark . The name, Marmaray, comes from a tunnel in Turkey connecting Europe and Asia. Similarly, we envisioned Marmaray within Uber as a pipeline connecting data from any source to any sink depending on customer preference:
https://eng.uber.com/marmaray-hadoop-ingestion-open-source/
(Direct GitHub repo: https://github.com/uber/marmaray Kafka Kafka Manager )
Kafka
- High-throughput126
- Distributed119
- Scalable92
- High-Performance86
- Durable66
- Publish-Subscribe38
- Simple-to-use19
- Open source18
- Written in Scala and java. Runs on JVM11
- Message broker + Streaming system8
- Robust4
- Avro schema integration4
- KSQL4
- Suport Multiple clients3
- Partioned, replayable log2
- Simple publisher / multi-subscriber model1
- Flexible1
- Extremely good parallelism constructs1
- Fun1
- Non-Java clients are second-class citizens32
- Needs Zookeeper29
- Operational difficulties9
- Terrible Packaging4
related Kafka posts
The algorithms and data infrastructure at Stitch Fix is housed in #AWS. Data acquisition is split between events flowing through Kafka, and periodic snapshots of PostgreSQL DBs. We store data in an Amazon S3 based data warehouse. Apache Spark on Yarn is our tool of choice for data movement and #ETL. Because our storage layer (s3) is decoupled from our processing layer, we are able to scale our compute environment very elastically. We have several semi-permanent, autoscaling Yarn clusters running to serve our data processing needs. While the bulk of our compute infrastructure is dedicated to algorithmic processing, we also implemented Presto for adhoc queries and dashboards.
Beyond data movement and ETL, most #ML centric jobs (e.g. model training and execution) run in a similarly elastic environment as containers running Python and R code on Amazon EC2 Container Service clusters. The execution of batch jobs on top of ECS is managed by Flotilla, a service we built in house and open sourced (see https://github.com/stitchfix/flotilla-os).
At Stitch Fix, algorithmic integrations are pervasive across the business. We have dozens of data products actively integrated systems. That requires serving layer that is robust, agile, flexible, and allows for self-service. Models produced on Flotilla are packaged for deployment in production using Khan, another framework we've developed internally. Khan provides our data scientists the ability to quickly productionize those models they've developed with open source frameworks in Python 3 (e.g. PyTorch, sklearn), by automatically packaging them as Docker containers and deploying to Amazon ECS. This provides our data scientist a one-click method of getting from their algorithms to production. We then integrate those deployments into a service mesh, which allows us to A/B test various implementations in our product.
For more info:
- Our Algorithms Tour: https://algorithms-tour.stitchfix.com/
- Our blog: https://multithreaded.stitchfix.com/blog/
- Careers: https://multithreaded.stitchfix.com/careers/
#DataScience #DataStack #Data
As we've evolved or added additional infrastructure to our stack, we've biased towards managed services. Most new backing stores are Amazon RDS instances now. We do use self-managed PostgreSQL with TimescaleDB for time-series data—this is made HA with the use of Patroni and Consul.
We also use managed Amazon ElastiCache instances instead of spinning up Amazon EC2 instances to run Redis workloads, as well as shifting to Amazon Kinesis instead of Kafka.
- Containerized3
- Scalable3
- More overhead than lambda2
- Image management1
related AWS Batch posts
Node.js
- Npm1.4K
- Javascript1.3K
- Great libraries1.1K
- High-performance1K
- Open source802
- Great for apis485
- Asynchronous475
- Great community421
- Great for realtime apps390
- Great for command line utilities296
- Websockets82
- Node Modules82
- Uber Simple69
- Great modularity59
- Allows us to reuse code in the frontend58
- Easy to start42
- Great for Data Streaming35
- Realtime32
- Awesome28
- Non blocking IO25
- Can be used as a proxy18
- High performance, open source, scalable17
- Non-blocking and modular16
- Easy and Fun15
- Easy and powerful14
- Future of BackEnd13
- Same lang as AngularJS13
- Fullstack12
- Fast11
- Scalability10
- Cross platform10
- Simple9
- Mean Stack8
- Great for webapps7
- Easy concurrency7
- React6
- Fast, simple code and async6
- Friendly6
- Typescript6
- Fast development5
- Its amazingly fast and scalable5
- Easy to use and fast and goes well with JSONdb's5
- Scalable5
- Great speed5
- Control everything5
- Easy to use4
- It's fast4
- Isomorphic coolness4
- Easy3
- Easy to learn3
- Great community3
- Not Python3
- Sooper easy for the Backend connectivity3
- TypeScript Support3
- Scales, fast, simple, great community, npm, express3
- One language, end-to-end3
- Less boilerplate code3
- Performant and fast prototyping3
- Blazing fast3
- Npm i ape-updating2
- Event Driven2
- Lovely2
- Creat for apis1
- Node0
- Bound to a single CPU46
- New framework every day44
- Lots of terrible examples on the internet38
- Asynchronous programming is the worst31
- Callback23
- Javascript18
- Dependency based on GitHub11
- Dependency hell11
- Low computational power10
- Very very Slow7
- Can block whole server easily7
- Callback functions may not fire on expected sequence6
- Unneeded over complication3
- Unstable3
- Breaking updates3
- No standard approach2
- Bad transitive dependency management1
- Can't read server session1
related Node.js posts
When I joined NYT there was already broad dissatisfaction with the LAMP (Linux Apache HTTP Server MySQL PHP) Stack and the front end framework, in particular. So, I wasn't passing judgment on it. I mean, LAMP's fine, you can do good work in LAMP. It's a little dated at this point, but it's not ... I didn't want to rip it out for its own sake, but everyone else was like, "We don't like this, it's really inflexible." And I remember from being outside the company when that was called MIT FIVE when it had launched. And been observing it from the outside, and I was like, you guys took so long to do that and you did it so carefully, and yet you're not happy with your decisions. Why is that? That was more the impetus. If we're going to do this again, how are we going to do it in a way that we're gonna get a better result?
So we're moving quickly away from LAMP, I would say. So, right now, the new front end is React based and using Apollo. And we've been in a long, protracted, gradual rollout of the core experiences.
React is now talking to GraphQL as a primary API. There's a Node.js back end, to the front end, which is mainly for server-side rendering, as well.
Behind there, the main repository for the GraphQL server is a big table repository, that we call Bodega because it's a convenience store. And that reads off of a Kafka pipeline.
How Uber developed the open source, end-to-end distributed tracing Jaeger , now a CNCF project:
Distributed tracing is quickly becoming a must-have component in the tools that organizations use to monitor their complex, microservice-based architectures. At Uber, our open source distributed tracing system Jaeger saw large-scale internal adoption throughout 2016, integrated into hundreds of microservices and now recording thousands of traces every second.
Here is the story of how we got here, from investigating off-the-shelf solutions like Zipkin, to why we switched from pull to push architecture, and how distributed tracing will continue to evolve:
https://eng.uber.com/distributed-tracing/
(GitHub Pages : https://www.jaegertracing.io/, GitHub: https://github.com/jaegertracing/jaeger)
Bindings/Operator: Python Java Node.js Go C++ Kubernetes JavaScript OpenShift C# Apache Spark
- Rapid development660
- Open source480
- Great community416
- Easy to learn371
- Mvc271
- Beautiful code225
- Elegant217
- Free201
- Great packages198
- Great libraries186
- Restful74
- Comes with auth and crud admin panel73
- Powerful72
- Great documentation69
- Great for web65
- Python52
- Great orm39
- Great for api37
- All included28
- Fast25
- Web Apps23
- Clean21
- Used by top startups20
- Easy setup19
- Sexy18
- ORM14
- Convention over configuration14
- Allows for very rapid development with great libraries13
- The Django community12
- Great MVC and templating engine10
- King of backend world10
- Full stack8
- Its elegant and practical7
- Batteries included7
- Cross-Platform6
- Very quick to get something up and running6
- Have not found anything that it can't do6
- Fast prototyping6
- Mvt6
- Easy Structure , useful inbuilt library5
- Zero code burden to change databases5
- Easy to develop end to end AI Models5
- Map4
- Python community4
- Easy to use4
- Easy to change database manager4
- Modular4
- Great peformance4
- Easy4
- Many libraries4
- Full-Text Search3
- Just the right level of abstraction3
- Scaffold3
- Scalable1
- Node js1
- Rails0
- Fastapi0
- Underpowered templating26
- Autoreload restarts whole server22
- Underpowered ORM22
- URL dispatcher ignores HTTP method15
- Internal subcomponents coupling10
- Not nodejs8
- Configuration hell8
- Admin7
- Not as clean and nice documentation like Laravel5
- Python3
- Not typed3
- Bloated admin panel included3
- Overwhelming folder structure2
- InEffective Multithreading2
- Not type safe1
related Django posts
Simple controls over complex technologies, as we put it, wouldn't be possible without neat UIs for our user areas including start page, dashboard, settings, and docs.
Initially, there was Django. Back in 2011, considering our Python-centric approach, that was the best choice. Later, we realized we needed to iterate on our website more quickly. And this led us to detaching Django from our front end. That was when we decided to build an SPA.
For building user interfaces, we're currently using React as it provided the fastest rendering back when we were building our toolkit. It’s worth mentioning Uploadcare is not a front-end-focused SPA: we aren’t running at high levels of complexity. If it were, we’d go with Ember.js.
However, there's a chance we will shift to the faster Preact, with its motto of using as little code as possible, and because it makes more use of browser APIs. One of our future tasks for our front end is to configure our Webpack bundler to split up the code for different site sections. For styles, we use PostCSS along with its plugins such as cssnano which minifies all the code.
All that allows us to provide a great user experience and quickly implement changes where they are needed with as little code as possible.
Hey, so I developed a basic application with Python. But to use it, you need a python interpreter. I want to add a GUI to make it more appealing. What should I choose to develop a GUI? I have very basic skills in front end development (CSS, JavaScript). I am fluent in python. I'm looking for a tool that is easy to use and doesn't require too much code knowledge. I have recently tried out Flask, but it is kinda complicated. Should I stick with it, move to Django, or is there another nice framework to use?