Reconciling Disparate Data Sources with aiMatch and eXpert-in-the-Loop
Last update: July 17, 2024
Introduction
The goal of applying AI/ML to enterprise data is to extract meaningful patterns and insights, enabling impactful decision making. To achieve this end, it is essential to understand how all of the data fits together, so meaningful relationships can be surfaced. However, enterprise data is often siloed across multiple data sources, primarily due to different aspects of the business being managed within different processes and tools, resulting in many partial sources of truth. To gain a holistic understanding of the data, it is necessary to stitch the information together across these disparate sources.
In an ideal scenario, where data sources are designed carefully by a centralized system designer, a natural way to stitch them together would be through traditional database or data warehouse operations such as joins relying on shared identifiers or keys.
However, the reality is that such data sources are rarely designed with easy integration in mind. To make matters worse, the data often contains missing values as well as errors and anomalies, making automation of data unification a challenging task.
The goal of aiMatch, a generative AI solution from Ikigai, is to precisely address this challenge. With a computationally efficient approach via large graphical models (LGMs), Ikigai solutions – such as aiMatch -- are able to solve large and critical problems for the enterprise.
For the problem of data reconciliation, aiMatch is able to bring together previously disparate datasets by matching data across tables with AI and human oversight, known as expert in the loop (XitL). Below we walk through an example of a typical business scenario that is supported by aiMatch.
Data input: Two mismatched tables
Consider the following business scenario: a large business with a complex network of suppliers and vendors relies on its GL system to maintain its overall financial records, while using a separate cash management system to manage banking transactions and cash flows across multiple banking partners and payment processors.
When the company makes a sale, pays a vendor, or incurs an expense, the transaction is recorded in the GL system using the appropriate GL account names, debit amounts, and credit amounts. These transactions are summarized and batched before being entered into the GL system, with each batch assigned a unique Batch ID. The company’s bank transactions, such as customer payments, vendor payments, and bank charges, are recorded in the cash management system, where each transaction is assigned a unique statement item (SI) number, Reference ID, and other relevant details like value date, posting date, and transaction reference.
While each system serves its purpose, reconciling the data from any two systems is no easy task, with the resulting output of these systems being two mismatched tables that need to be reconciled by the business. In the next several paragraphs, we will show how aiMatch can be used in this example to reconcile data between two disparate sources.
aiMatch is able to bring together previously disparate datasets by matching data across tables with AI and human oversight, known as expert in the loop (XitL).
Step 1: Schema mapping
In order to reconcile these datasets together, we must identify which rows in Table 1 match to which rows in Table 2.
In order to see matches or more general similarities between any pair of rows, we must first understand what columns in Table 1 have affinity with columns in Table 2. To accomplish this, aiMatch generates Schema Mapping out-of-the-box. For the example tables shown above, the following is output generated by aiMatch.
When a user runs aiMatch, the system suggests the top matches out-of-the-box, as shown in the example on the previous page. It shows that certain columns in Table 1 (under Left Column) are similar to the columns in Table 2 (under Right Column) with a similarity score(under Weights). For example, if two columns match perfectly then their similarity score will be 1, and if they do not match at all then their similarity score will be 0.
Step 2: Expert feedback to improve schema mapping
With the first attempt, the out-of-the-box column mapping is likely to have some errors. This can be corrected quickly with the help of an expert in the loop (XitL), who can remove suggested matches and add new matches to improve aiMatch’s output.
Step 3: Row matching
aiMatch uses the schema mapping to determine the matches of rows across Tables 1 and 2. This results in a certain number of rows being mapped while others remain unmatched. In the image below, we see that after the first round of matching, more than 10% of the rows remain unmatched across tables with the distribution of pair-wise row similarities depicted below.
Once a user clicks into the data to see the examples of exceptions across Tables, it quickly becomes clear that there are some obvious examples in both tables that should be matched.
Step 4: Expert feedback to improve row matching
While aiMatch uses ML to identify potential matches between the fields across different systems, the real power of the tool lies in the way it solicits expert feedback. Rather than requiring users to go through an entire collection of matched/not-matched sets, aiMatch solicits input from reviewers in terms of thumb up or thumbs down on very few carefully chosen matches and non-matches.
While aiMatch uses ML to identify potential matches between the fields across different systems, the real power of the tool lies in the way it solicits expert feedback.
As feedback is provided, aiMatch instantly incorporates and learns from the feedback, improving its ability to find additional matches.
Using the new information provided from the expert review, the system will now go back to find all possible matches. In the image below, we can see that all data has been matched outside of 6 exceptions, which were only left unmatched because there was nothing left to match to in Table 1.
With data now cleanly reconciled, the retailer is now prepared to close out their books or perform analysis on the integrated dataset.
aiMatch: Use case example
Retail Promotions Effectiveness: Matching and Analysis Across Data Sources
A retailer is looking to better understand how different promotions have driven demand across a wide set of products sold across online and in-person stores, as well as if there are trends across geographies and consumer categories. To accomplish this, the retailer will need to look across a wide variety of data sets including:
- Promotion details such as start/end dates, discount amounts, and promotion types
- Promotion targeting information such as customer segments and loyalty tiers, geographic regions, or channels
- Promotion performance metrics such as sales revenue generated, units sold, redemption rates, and profit margins
Bringing the data together – and ensuring it’s fit for analysis – is no easy task, as much of the data lives in different systems, and is governed by different schemas and naming conventions. Basic promotion details (name, dates, and eligible products) are stored in the ERP, while a recently implemented CRM, with its own schema for storing promotional data, houses targeted customer segments and creative assets. To further complicate matters, the retailer has recently acquired a smaller company that uses a legacy POS system. This system has a different schema for product information, which differs from the retailer’s custom PIM system. Integrating the acquired company’s data into the existing infrastructure adds another layer of complexity to the already diverse data landscape.
Across all these systems, the data schemas lack uniformity, and the degree of data completeness varies depending on the type of information each system stores. Despite these challenges, harmonizing these disparate data sources is crucial for gaining a comprehensive view of promotions and making data driven decisions.
This is the perfect application of aiMatch. Rather than spending valuable resources to manually reconcile data across sources, the retailer can use aiMatch to largely automate the task. With the review of an expert in the loop, the retailer will benefit from automation as well as accuracy, making quick work of identifying all possible matches, and preparing the data for further analysis.
Conclusion
Connecting and reconciling disparate data sources is a common problem for all businesses across all industries, often accounting for more than 80% of a data analyst's time. Ikigai automates this process by leveraging its patented Large Language Model, aiMatch, to harmonize data across tables for greater efficiency and accuracy. aiMatch integrates human intuition and expertise with its eXpert-in-the-loop feature to quickly address anomalies and exceptions, continuously improving model confidence for increased data quality and improved decision-making.
Additional resources
Glossary
aiCast
aiCast is a forecasting AI model based onpatented Large Graphical Models (LGM). It is designed to predict future trendsand outcomes based on both historical tabular and time series data andreal-time data. aiCast generates 20% more accurate forecasts than traditionalmodels and methods, even with sparse data.
aiMatch
aiMatch is a data reconciliation AI model based on patented Large Graphical Models (LGM). It automates the process of connecting and harmonizing disparate datasets, ensuring consistency and accuracy across multiple sources. By utilizing advanced pattern recognition and probabilistic techniques, aiMatch enables identification and resolution of inconsistent data and can synthesize new data to address missing or incorrect data.
aiPlan
aiPlan is a scenario planning AI model based on Large Graphical Models (LGM) which can generate and evaluate up to 1019 scenarios based on complex datasets. By simulating various potential outcomes and their likelihoods, aiPlan enhances scenario planning by providing insights into risks, opportunities, and strategic responses for organizations to navigate uncertainties.
eXpert-in-the-Loop
"eXpert-in-the-loop" (Xitl) refers to a hybrid approach in artificial intelligence where human expertise is integrated into the machine learning process. This methodology involves combining the capabilities of machine learning algorithms with human domain knowledge or judgment to improve the accuracy, efficiency, and interpretability of AI systems.
Large Graphical Model (LGM)
A Large Graphical Model is a generative AI model that produces a graph to represent the conditional dependencies between a set of random variables. It is designed to work with enterprise-specific or proprietary data sources, such as tabular and time series data used in data reconciliation, forecasting, and scenario planning.
To learn more about the Ikigai platform, visit here.
Download PDF
To download the eBook as a PDF, click here.