Alternatives to Apache Oozie logo

Alternatives to Apache Oozie

Apache Spark, Airflow, Apache NiFi, Yarn, and Zookeeper are the most popular alternatives and competitors to Apache Oozie.
40
75
+ 1
0

What is Apache Oozie and what are its top alternatives?

It is a server-based workflow scheduling system to manage Hadoop jobs. Workflows in it are defined as a collection of control flow and action nodes in a directed acyclic graph. Control flow nodes define the beginning and the end of a workflow as well as a mechanism to control the workflow execution path.
Apache Oozie is a tool in the Workflow Manager category of a tech stack.

Top Alternatives to Apache Oozie

  • Apache Spark
    Apache Spark

    Spark is a fast and general processing engine compatible with Hadoop data. It can run in Hadoop clusters through YARN or Spark's standalone mode, and it can process data in HDFS, HBase, Cassandra, Hive, and any Hadoop InputFormat. It is designed to perform both batch processing (similar to MapReduce) and new workloads like streaming, interactive queries, and machine learning. ...

  • Airflow
    Airflow

    Use Airflow to author workflows as directed acyclic graphs (DAGs) of tasks. The Airflow scheduler executes your tasks on an array of workers while following the specified dependencies. Rich command lines utilities makes performing complex surgeries on DAGs a snap. The rich user interface makes it easy to visualize pipelines running in production, monitor progress and troubleshoot issues when needed. ...

  • Apache NiFi
    Apache NiFi

    An easy to use, powerful, and reliable system to process and distribute data. It supports powerful and scalable directed graphs of data routing, transformation, and system mediation logic. ...

  • Yarn
    Yarn

    Yarn caches every package it downloads so it never needs to again. It also parallelizes operations to maximize resource utilization so install times are faster than ever. ...

  • Zookeeper
    Zookeeper

    A centralized service for maintaining configuration information, naming, providing distributed synchronization, and providing group services. All of these kinds of services are used in some form or another by distributed applications. ...

  • Apache Beam
    Apache Beam

    It implements batch and streaming data processing jobs that run on any execution engine. It executes pipelines on multiple execution environments. ...

  • GitHub Actions
    GitHub Actions

    It makes it easy to automate all your software workflows, now with world-class CI/CD. Build, test, and deploy your code right from GitHub. Make code reviews, branch management, and issue triaging work the way you want. ...

  • Camunda
    Camunda

    With Camunda, business users collaborate with developers to model and automate end-to-end processes using BPMN-powered flowcharts that run with the speed, scale, and resiliency required to compete in today’s digital-first world ...

Apache Oozie alternatives & related posts

Apache Spark logo

Apache Spark

2.8K
3.3K
139
Fast and general engine for large-scale data processing
2.8K
3.3K
+ 1
139
PROS OF APACHE SPARK
  • 60
    Open-source
  • 48
    Fast and Flexible
  • 8
    Great for distributed SQL like applications
  • 8
    One platform for every big data problem
  • 6
    Easy to install and to use
  • 3
    Works well for most Datascience usecases
  • 2
    In memory Computation
  • 2
    Interactive Query
  • 2
    Machine learning libratimery, Streaming in real
CONS OF APACHE SPARK
  • 3
    Speed

related Apache Spark posts

Eric Colson
Chief Algorithms Officer at Stitch Fix · | 21 upvotes · 2.7M views

The algorithms and data infrastructure at Stitch Fix is housed in #AWS. Data acquisition is split between events flowing through Kafka, and periodic snapshots of PostgreSQL DBs. We store data in an Amazon S3 based data warehouse. Apache Spark on Yarn is our tool of choice for data movement and #ETL. Because our storage layer (s3) is decoupled from our processing layer, we are able to scale our compute environment very elastically. We have several semi-permanent, autoscaling Yarn clusters running to serve our data processing needs. While the bulk of our compute infrastructure is dedicated to algorithmic processing, we also implemented Presto for adhoc queries and dashboards.

Beyond data movement and ETL, most #ML centric jobs (e.g. model training and execution) run in a similarly elastic environment as containers running Python and R code on Amazon EC2 Container Service clusters. The execution of batch jobs on top of ECS is managed by Flotilla, a service we built in house and open sourced (see https://github.com/stitchfix/flotilla-os).

At Stitch Fix, algorithmic integrations are pervasive across the business. We have dozens of data products actively integrated systems. That requires serving layer that is robust, agile, flexible, and allows for self-service. Models produced on Flotilla are packaged for deployment in production using Khan, another framework we've developed internally. Khan provides our data scientists the ability to quickly productionize those models they've developed with open source frameworks in Python 3 (e.g. PyTorch, sklearn), by automatically packaging them as Docker containers and deploying to Amazon ECS. This provides our data scientist a one-click method of getting from their algorithms to production. We then integrate those deployments into a service mesh, which allows us to A/B test various implementations in our product.

For more info:

#DataScience #DataStack #Data

See more
Conor Myhrvold
Tech Brand Mgr, Office of CTO at Uber · | 7 upvotes · 1.3M views

Why we built Marmaray, an open source generic data ingestion and dispersal framework and library for Apache Hadoop :

Built and designed by our Hadoop Platform team, Marmaray is a plug-in-based framework built on top of the Hadoop ecosystem. Users can add support to ingest data from any source and disperse to any sink leveraging the use of Apache Spark . The name, Marmaray, comes from a tunnel in Turkey connecting Europe and Asia. Similarly, we envisioned Marmaray within Uber as a pipeline connecting data from any source to any sink depending on customer preference:

https://eng.uber.com/marmaray-hadoop-ingestion-open-source/

(Direct GitHub repo: https://github.com/uber/marmaray Kafka Kafka Manager )

See more
Airflow logo

Airflow

1.5K
2.6K
125
A platform to programmaticaly author, schedule and monitor data pipelines, by Airbnb
1.5K
2.6K
+ 1
125
PROS OF AIRFLOW
  • 50
    Features
  • 14
    Task Dependency Management
  • 12
    Beautiful UI
  • 12
    Cluster of workers
  • 10
    Extensibility
  • 6
    Open source
  • 5
    Complex workflows
  • 5
    Python
  • 3
    Good api
  • 3
    Apache project
  • 3
    Custom operators
  • 2
    Dashboard
CONS OF AIRFLOW
  • 2
    Running it on kubernetes cluster relatively complex
  • 2
    Open source - provides minimum or no support
  • 1
    Logical separation of DAGs is not straight forward
  • 1
    Observability is not great when the DAGs exceed 250

related Airflow posts

Shared insights
on
JenkinsJenkinsAirflowAirflow

I am looking for an open-source scheduler tool with cross-functional application dependencies. Some of the tasks I am looking to schedule are as follows:

  1. Trigger Matillion ETL loads
  2. Trigger Attunity Replication tasks that have downstream ETL loads
  3. Trigger Golden gate Replication Tasks
  4. Shell scripts, wrappers, file watchers
  5. Event-driven schedules

I have used Airflow in the past, and I know we need to create DAGs for each pipeline. I am not familiar with Jenkins, but I know it works with configuration without much underlying code. I want to evaluate both and appreciate any advise

See more
Shared insights
on
AWS Step FunctionsAWS Step FunctionsAirflowAirflow

I am working on a project that grabs a set of input data from AWS S3, pre-processes and divvies it up, spins up 10K batch containers to process the divvied data in parallel on AWS Batch, post-aggregates the data, and pushes it to S3.

I already have software patterns from other projects for Airflow + Batch but have not dealt with the scaling factors of 10k parallel tasks. Airflow is nice since I can look at which tasks failed and retry a task after debugging. But dealing with that many tasks on one Airflow EC2 instance seems like a barrier. Another option would be to have one task that kicks off the 10k containers and monitors it from there.

I have no experience with AWS Step Functions but have heard it's AWS's Airflow. There looks to be plenty of patterns online for Step Functions + Batch. Do Step Functions seem like a good path to check out for my use case? Do you get the same insights on failing jobs / ability to retry tasks as you do with Airflow?

See more
Apache NiFi logo

Apache NiFi

310
641
63
A reliable system to process and distribute data
310
641
+ 1
63
PROS OF APACHE NIFI
  • 16
    Visual Data Flows using Directed Acyclic Graphs (DAGs)
  • 8
    Free (Open Source)
  • 7
    Simple-to-use
  • 5
    Reactive with back-pressure
  • 5
    Scalable horizontally as well as vertically
  • 4
    Fast prototyping
  • 3
    Bi-directional channels
  • 2
    Data provenance
  • 2
    Built-in graphical user interface
  • 2
    End-to-end security between all nodes
  • 2
    Can handle messages up to gigabytes in size
  • 1
    Hbase support
  • 1
    Kudu support
  • 1
    Hive support
  • 1
    Slack integration
  • 1
    Support for custom Processor in Java
  • 1
    Lot of articles
  • 1
    Lots of documentation
CONS OF APACHE NIFI
  • 2
    HA support is not full fledge
  • 2
    Memory-intensive

related Apache NiFi posts

I am looking for the best tool to orchestrate #ETL workflows in non-Hadoop environments, mainly for regression testing use cases. Would Airflow or Apache NiFi be a good fit for this purpose?

For example, I want to run an Informatica ETL job and then run an SQL task as a dependency, followed by another task from Jira. What tool is best suited to set up such a pipeline?

See more
Yarn logo

Yarn

16.6K
12K
143
A new package manager for JavaScript
16.6K
12K
+ 1
143
PROS OF YARN
  • 84
    Incredibly fast
  • 21
    Easy to use
  • 12
    Open Source
  • 10
    Can install any npm package
  • 7
    Works where npm fails
  • 6
    Workspaces
  • 2
    Incomplete to run tasks
  • 1
    Fast
CONS OF YARN
  • 15
    Facebook
  • 6
    Sends data to facebook
  • 3
    Should be installed separately
  • 2
    Cannot publish to registry other than npm

related Yarn posts

Simon Reymann
Senior Fullstack Developer at QUANTUSflow Software GmbH · | 26 upvotes · 3.4M views

Our whole Node.js backend stack consists of the following tools:

  • Lerna as a tool for multi package and multi repository management
  • npm as package manager
  • NestJS as Node.js framework
  • TypeScript as programming language
  • ExpressJS as web server
  • Swagger UI for visualizing and interacting with the API’s resources
  • Postman as a tool for API development
  • TypeORM as object relational mapping layer
  • JSON Web Token for access token management

The main reason we have chosen Node.js over PHP is related to the following artifacts:

  • Made for the web and widely in use: Node.js is a software platform for developing server-side network services. Well-known projects that rely on Node.js include the blogging software Ghost, the project management tool Trello and the operating system WebOS. Node.js requires the JavaScript runtime environment V8, which was specially developed by Google for the popular Chrome browser. This guarantees a very resource-saving architecture, which qualifies Node.js especially for the operation of a web server. Ryan Dahl, the developer of Node.js, released the first stable version on May 27, 2009. He developed Node.js out of dissatisfaction with the possibilities that JavaScript offered at the time. The basic functionality of Node.js has been mapped with JavaScript since the first version, which can be expanded with a large number of different modules. The current package managers (npm or Yarn) for Node.js know more than 1,000,000 of these modules.
  • Fast server-side solutions: Node.js adopts the JavaScript "event-loop" to create non-blocking I/O applications that conveniently serve simultaneous events. With the standard available asynchronous processing within JavaScript/TypeScript, highly scalable, server-side solutions can be realized. The efficient use of the CPU and the RAM is maximized and more simultaneous requests can be processed than with conventional multi-thread servers.
  • A language along the entire stack: Widely used frameworks such as React or AngularJS or Vue.js, which we prefer, are written in JavaScript/TypeScript. If Node.js is now used on the server side, you can use all the advantages of a uniform script language throughout the entire application development. The same language in the back- and frontend simplifies the maintenance of the application and also the coordination within the development team.
  • Flexibility: Node.js sets very few strict dependencies, rules and guidelines and thus grants a high degree of flexibility in application development. There are no strict conventions so that the appropriate architecture, design structures, modules and features can be freely selected for the development.
See more
Johnny Bell

So when starting a new project you generally have your go to tools to get your site up and running locally, and some scripts to build out a production version of your site. Create React App is great for that, however for my projects I feel as though there is to much bloat in Create React App and if I use it, then I'm tied to React, which I love but if I want to switch it up to Vue or something I want that flexibility.

So to start everything up and running I clone my personal Webpack boilerplate - This is still in Webpack 3, and does need some updating but gets the job done for now. So given the name of the repo you may have guessed that yes I am using Webpack as my bundler I use Webpack because it is so powerful, and even though it has a steep learning curve once you get it, its amazing.

The next thing I do is make sure my machine has Node.js configured and the right version installed then run Yarn. I decided to use Yarn because when I was building out this project npm had some shortcomings such as no .lock file. I could probably move from Yarn to npm but I don't really see any point really.

I use Babel to transpile all of my #ES6 to #ES5 so the browser can read it, I love Babel and to be honest haven't looked up any other transpilers because Babel is amazing.

Finally when developing I have Prettier setup to make sure all my code is clean and uniform across all my JS files, and ESLint to make sure I catch any errors or code that could be optimized.

I'm really happy with this stack for my local env setup, and I'll probably stick with it for a while.

See more
Zookeeper logo

Zookeeper

628
951
42
Because coordinating distributed systems is a Zoo
628
951
+ 1
42
PROS OF ZOOKEEPER
  • 11
    High performance ,easy to generate node specific config
  • 8
    Kafka support
  • 8
    Java
  • 5
    Spring Boot Support
  • 3
    Supports extensive distributed IPC
  • 2
    Used in ClickHouse
  • 2
    Supports DC/OS
  • 1
    Embeddable In Java Service
  • 1
    Curator
  • 1
    Used in Hadoop
CONS OF ZOOKEEPER
    Be the first to leave a con

    related Zookeeper posts

    Apache Beam logo

    Apache Beam

    172
    345
    14
    A unified programming model
    172
    345
    + 1
    14
    PROS OF APACHE BEAM
    • 5
      Open-source
    • 5
      Cross-platform
    • 2
      Portable
    • 2
      Unified batch and stream processing
    • 0
      Nhat
    CONS OF APACHE BEAM
      Be the first to leave a con

      related Apache Beam posts

      I have to build a data processing application with an Apache Beam stack and Apache Flink runner on an Amazon EMR cluster. I saw some instability with the process and EMR clusters that keep going down. Here, the Apache Beam application gets inputs from Kafka and sends the accumulative data streams to another Kafka topic. Any advice on how to make the process more stable?

      See more
      GitHub Actions logo

      GitHub Actions

      3.5K
      1.2K
      26
      Automate your workflow from idea to production
      3.5K
      1.2K
      + 1
      26
      PROS OF GITHUB ACTIONS
      • 7
        Integration with GitHub
      • 5
        Free
      • 3
        Easy to duplicate a workflow
      • 3
        Ready actions in Marketplace
      • 2
        Configs stored in .github
      • 2
        Docker Support
      • 2
        Read actions in Marketplace
      • 1
        Active Development Roadmap
      • 1
        Fast
      CONS OF GITHUB ACTIONS
      • 5
        Lacking [skip ci]
      • 4
        Lacking allow failure
      • 3
        Lacking job specific badges
      • 2
        No ssh login to servers
      • 1
        No Deployment Projects
      • 1
        No manual launch

      related GitHub Actions posts

      Somnath Mahale
      Engineering Leader at Altimetrik Corp. · | 8 upvotes · 164.8K views

      I am in the process of evaluating CircleCI, Drone.io, and Github Actions to cover my #CI/ CD needs. I would appreciate your advice on comparative study w.r.t. attributes like language-Inclusive support, code-base integration, performance, cost, maintenance, support, ease of use, ability to deal with big projects, etc. based on actual industry experience.

      Thanks in advance!

      See more
      Omkar Kulkarni
      DevOps Engineer at LTI · | 3 upvotes · 74.7K views
      Shared insights
      on
      GitLabGitLabGitHub ActionsGitHub Actions

      Hello Everyone, Can some please help me to understand the difference between GitHub Actions And GitLab I have been trying to understand them, but still did not get how exactly they are different.

      See more
      Camunda logo

      Camunda

      156
      186
      0
      The Universal Process Orchestrator
      156
      186
      + 1
      0
      PROS OF CAMUNDA
        Be the first to leave a pro
        CONS OF CAMUNDA
          Be the first to leave a con

          related Camunda posts