Datameer vs Trifacta

Overview

Trifacta

Stacks19

Followers41

Votes0

Datameer

Stacks5

Followers12

Votes0

Datameer vs Trifacta: What are the differences?

## Datameer vs. Trifacta

<Write Introduction here>

1. **Ease of Use**: Datameer provides a user-friendly interface with drag-and-drop functionality, allowing users to easily manipulate large datasets without the need for extensive coding skills. In contrast, Trifacta emphasizes a more visual approach to data preparation, with interactive visualizations that can simplify complex data transformations for non-technical users.

2. **Integration Capabilities**: Datameer integrates seamlessly with Hadoop ecosystems, making it well-suited for processing large-scale data on distributed systems. On the other hand, Trifacta offers broader integration capabilities with a wide range of data sources beyond Hadoop, including cloud-based platforms and traditional databases, providing more flexibility in data connectivity.

3. **Collaboration Features**: Datameer offers robust collaboration features, allowing multiple users to work on the same dataset simultaneously and track changes made by team members. Trifacta, on the other hand, focuses more on individual data wrangling tasks, with limited collaborative functionalities, which may be more suitable for independent data preparation projects.

4. **Advanced Data Profiling**: Datameer includes advanced data profiling tools that allow users to gain deeper insights into their data quality, distributions, and patterns, enabling more informed data transformation decisions. While Trifacta offers basic data profiling capabilities, it may not provide as comprehensive data analysis features as Datameer in this regard.

5. **Automation and Workflow Orchestration**: Datameer provides robust automation and scheduling features for data processing workflows, enabling users to streamline repetitive tasks and ensure consistent data processing outcomes. Trifacta, on the other hand, may require more manual intervention in workflow orchestration, which could be a consideration for organizations looking to optimize their data preparation processes.

6. **Scalability and Performance**: Datameer is known for its scalability and performance capabilities, handling large volumes of data efficiently and supporting parallel processing for faster data transformations. While Trifacta also offers scalability options, Datameer's architecture is specifically optimized for big data environments, making it a preferred choice for organizations dealing with massive datasets and complex processing requirements.

In Summary, Datameer and Trifacta offer distinct advantages in terms of ease of use, integration capabilities, collaboration features, advanced data profiling, automation, and scalability, catering to different data preparation needs and preferences in the market.

Share your Stack

Help developers discover the tools you use. Get visibility for your team's tech choices and contribute to the community's knowledge.

View Docs

CLI (Node.js)

Manual

Advice on Trifacta, Datameer

Sarah

Jun 25, 2020

Needs adviceon

OpenRefine

I'm looking for an open-source/free/cheap tool to clean messy data coming from various travel APIs. We use many different APIs and save the info in our DB. However, many duplicates cannot be easily recognized as such.

We would either write an algorithm or use smart technology/tools with ML to help with product management.

While there are many things to be considered, this is one feature that it should have:

"To avoid confusion, we need to merge the suppliers & products accordingly. Products and suppliers must be able to be merged and assigned separately.

Reason: It may happen that one supplier offers different products. E.g., 1 tour operator offers 3 products via 1 API, but only 1 product with 3 (or a different amount of) variations via a different API. Also, the commission may differ for products, which we need to consider. Very often, products that are live (are bookable in real-time) on via 1 API, but are not live on the other. E.g., Supplier product 1 & 2 of API1 are live, product 3 not. For the same supplier, API2 provides live availability for products 1, 2, and 3.

Summing up, when merging the suppliers (tour operators) we need to consider:

Are the products the same for all APIs?
Which booking system API gives a better commission? Note: Some APIs charge us 1-5% depending on the monthly sale, which needs to be considered
Which booking system provides live availability
Is it the same supplier, or is the name only similar?

Most of the time, the supplier names differ even if they are the same (e.g., API1 often names them XX Pty Ltd, while API2 leaves "Pty Ltd" out). Additionally, the product title, description, etc. differ.

We need to write logic and create an algorithm to find the duplicates & to merge, assign, or (de)activate the respective supplier or product. My previous developer started a module to merge the suppliers, which does not seem to work correctly. Also, it is way too time taking considering the high amount of products that we have.

I would recommend merging, assigning etc. products and suppliers only if our algorithm says it's 90- 100% the matching supplier/product. Otherwise, admins need to be able to check & modify this. E.g. everything with a lower possibility of matching will be matched automatically, but can be undone or modified.

The next time the cron job runs, this needs to be considered to avoid recreating duplicates & creating a mess."

I am not sure in what way OpenRefine can help to achieve this and what ML tool can be connected to learn from the decisions the product management team makes. Maybe you have an idea of how other travel portals deal with messy data, duplicates, etc.?

I'm looking for the cheapest solution for a start-up, but it should do the work properly.

19.2k views19.2k

Comments

Detailed Comparison

Trifacta	Datameer
It is an Intelligent Platform that Interoperates with Your Data Investments. It sits between the data storage and processing environments and the visualization, statistical or machine learning tools used downstream	It is a single application that helps you get any data into Hadoop, bring it together, analyze it, and visualize it as quickly and easily as possible. No coding required. Everything in it is self-service and intuitive, from our wizard-based data integration, to a spreadsheet with point-and-click analytics, to our blank canvas to for building custom visualizations.
Interactive Exploration; Automated visual representations of data based upon its content in the most compelling visual profile; Predictive Transformation; Intelligent Execution; Collaborative Data Governance.	Data integration; Data visualization; Dynamic data management; Open infrastructure; Pre-built application; Self-service analytics.
Statistics
Stacks 19	Stacks 5
Followers 41	Followers 12
Votes 0	Votes 0
Integrations
Microsoft Azure Google Cloud Storage Snowflake AWS Data Pipeline Tableau	Amazon S3 Microsoft Azure MySQL Oracle PostgreSQL Beehive Snowflake

What are some alternatives to Trifacta, Datameer?

Metabase

It is an easy way to generate charts and dashboards, ask simple ad hoc queries without using SQL, and see detailed information about rows in your Database. You can set it up in under 5 minutes, and then give yourself and others a place to ask simple questions and understand the data your application is generating.

Apache Spark

Spark is a fast and general processing engine compatible with Hadoop data. It can run in Hadoop clusters through YARN or Spark's standalone mode, and it can process data in HDFS, HBase, Cassandra, Hive, and any Hadoop InputFormat. It is designed to perform both batch processing (similar to MapReduce) and new workloads like streaming, interactive queries, and machine learning.

Presto

Distributed SQL Query Engine for Big Data

Amazon Athena

Amazon Athena is an interactive query service that makes it easy to analyze data in Amazon S3 using standard SQL. Athena is serverless, so there is no infrastructure to manage, and you pay only for the queries that you run.

Superset

Superset's main goal is to make it easy to slice, dice and visualize data. It empowers users to perform analytics at the speed of thought.

Apache Flink

Apache Flink is an open source system for fast and versatile data analytics in clusters. Flink supports batch and streaming analytics, in one system. Analytical programs can be written in concise and elegant APIs in Java and Scala.

lakeFS

It is an open-source data version control system for data lakes. It provides a “Git for data” platform enabling you to implement best practices from software engineering on your data lake, including branching and merging, CI/CD, and production-like dev/test environments.

Druid

Druid is a distributed, column-oriented, real-time analytics data store that is commonly used to power exploratory dashboards in multi-tenant environments. Druid excels as a data warehousing solution for fast aggregate queries on petabyte sized data sets. Druid supports a variety of flexible filters, exact calculations, approximate algorithms, and other useful calculations.

Cube

Cube: the universal semantic layer that makes it easy to connect BI silos, embed analytics, and power your data apps and AI with context.

Power BI

It aims to provide interactive visualizations and business intelligence capabilities with an interface simple enough for end users to create their own reports and dashboards.

Related Comparisons

Bootstrap vs Materialize

Django vs Laravel vs Node.js

Bootstrap vs Foundation vs Material UI

Node.js vs Spring-Boot

Flyway vs Liquibase