I have started using AWS Batch for some long ML inference jobs. So far it's working well and giving a decent performance. Since it is fully managed, it saves a lot of extra work as well. But Batch takes a good amount of time to create a new cluster and then load the job based on the priority of the queue. Going forward would love to put effort into something which is fast to start and give more flexibility as well. What other tools you would suggest for long-running backend jobs which can scale well. I am not looking for something fully managed so ignore the options similar to batch in Google Cloud Platform or Microsoft Azure, Looking for open-source alternatives here. Do you think Kubernetes, RabbitMQ/Kafka will be a good fit or just overkill for my problem. Usually w we get 1000s of requests in parallel and each job might take 20-30 mins in a 2 vCPU system.
6 upvotes·9.5K views