RUBIX

A Framework for Improving Data Integration with Linked Data

April 13 2015

RUBIX is a framework that enables business users to semi-automatically combine potentially noisy data residing in heterogeneous silos. Semantically related data is identified and appropriate mappings are suggested to users

Introduction

Companies have traditionally performed business analysis based on transactional data stored in legacy relational databases. The enterprise data available for decision makers was typically relationship management or enterprise resource planning data. However social media feeds, weblogs, sensor data, or data published by governments or international organizations are nowadays becoming increasingly available.

With today’s public data sets containing billions of data items, more and more companies are looking to integrate external data with their traditional enterprise data to improve business intelligence analysis. These distributed data sources however exhibit heterogeneous data formats and terminologies and may contain noisy data. RUBIX is a novel framework that enables business users to semi-automatically perform data integration on potentially noisy tabular data. This framework offers an extension to Open Refine (Formerly Google Refine) with novel schema matching algorithms leveraging Freebase rich types. First experiments show that using Linked Data to map cell values with instances and column headers with types improves significantly the quality of the matching results and therefore should lead to more informed decisions.

The quality and amount of structured knowledge available make it now feasible for companies to mine this huge amount of public data and integrate it in their next-generation enterprise information management systems. Analyzing this new type of data within the context of existing enterprise data should bring them new or more accurate business insights and allow better recognition of sales and market opportunities. These new distributed sources however raise tremendous challenges. They have inherently different file formats, access protocols or query languages. They possess their own data model with different ways of representing and storing the data.

Data across these sources may be noisy (e.g. duplicate or inconsistent), uncertain or be semantically similar yet different. Integration and provision of a unified view for these heterogeneous and complex data structures therefore require powerful tools to map and organize the data. RUBIX is a framework that enables business users to semi-automatically combine potentially noisy data residing in heterogeneous silos. Semantically related data is identified and appropriate mappings are suggested to users. On user acceptance, data is aggregated and can be visualized directly or exported to Business Intelligence reporting tools.

The framework is composed of a set of extensions to Open Refine (now called Open Refine) server and a plug-in to its user interface. Open Refine was selected for its extensibility as well as good cleansing and transformation capabilities. We first map cell values with instances and column headers with types from popular data sets from the Linked Open Data Cloud. To perform the matching, we use the Auto Mapping Core (also called AMC ) that combines the results of various similarity algorithms. The novelty of our approach resides in our exploitation of Linked Data to improve the schema matching process. We developed specific algorithms on rich types from vector algebra and statistics. The AMC generates a list of high-quality mappings from these algorithms allowing better data integration. First experiments show that Linked Data increases significantly the number of mappings suggested to the user. Schemas can also be discovered if column headers are not defined and can be improved when they are not named or typed correctly. Finally, data reconciliation can be performed regardless of data source languages or ambiguity. All these enhancements allow business users to get more valuable and higher-quality data and consequently to take more informed decisions.

Related Work

While schema matching has always been an active research area in data integration, new challenges are faced today by the increasing size, number and complexity of data sources and their distribution over the network. Data sets are not always correctly typed or labeled and that hinders the matching process. In the past, some work has tried to improve existing data schemas  but literature mainly covers automatic or semi-automatic labeling of anonymous data sets through Web extraction. Examples include that automatically labels news articles with a tree structure analysis or that defines heuristics based on distance and alignment of a data value and its label.

These approaches are however restricting label candidates to Web content from which the data was extracted goes a step further by launching speculative queries to standard Web search engines to enlarge the set of potential candidate labels. More recently, applies machine learning techniques to respectively annotate table rows as entities, columns as their types and pairs of columns as relationships, referring to the YAGO ontology. The work presented aims however at leveraging such annotations to assist semantic search queries construction and not at improving schema matching. With the emergence of the Semantic Web, new work in the area has tried to exploit Linked Data repositories.

The authors of present techniques to automatically infer a semantic model on tabular data by getting top candidates from Wikitology and classifying them with the Google page ranking algorithm. Since the authors’ goal is to export the resulting table data as Linked Data and not to improve schema matching, some columns can be labeled incorrectly, and acronyms and languages are not well handled. In the Helix project, a tagging mechanism is used to add semantic information on tabular data. A sample of instances values for each column is taken and a set of tags with scores are gathered from online sources such as Freebase. Tags are then correlated to infer annotations for the column. The mechanism is quite similar to ours but the resulting tags for the column are independent of the existing column name and sampling might not always provide a representative population of the instance values.

Proposition

Open Refine (formerly Google Refine and Freebase Gridworks) is a tool designed to quickly and efficiently process, clean and eventually enrich large amounts of data with existing knowledge bases such as Freebase. The tool has however some limitations: it was initially designed for data cleansing on only one data set at a time, with no possibility to compose columns from different data sets. Moreover, Open Refine has some strict assumptions over the input of spreadsheets which makes it difficult to identify primitive and complex data types. The AMC is a novel framework that supports the construction and execution of new matching components or algorithms. AMC contains several matching components that can be plugged and used, like string matchers (Levenshtein, JaroWinkler … etc.), data types matchers and path matchers. It also provides a set of combination and selection algorithms to produce optimized results (weighted average, average, sigmoid … etc.).

Activity Flow

The data sets to match can be contained in files (e.g. csv, Excel spreadsheets, etc.) or defined in Open Refine projects. The inputs for the match module are the source and target files and/or projects that contain the data sets.

• These projects are imported into the internal data structure (called schema) of the AMC .
• The AMC then uses a set of built-in algorithms to calculate similarities between the source and target schemas on an element basis, i.e. column names in the case of spreadsheets or relational databases.
• The output is a set of similarities, each containing a triple consisting of source schema element, target element, and similarity between the two.
• These results are presented to the user in tabular form such that s/he can check, correct, and potentially complete the mappings.
• Once the user has completed the matching of columns, the merge information is sent back to Open Refine, which calls the merge module.
• This module creates a new project, which contains a union of the two projects where the matched columns of the target project are appended to the corresponding source columns.
• The user can then select the columns that s/he wants to merge and visualize by dragging and dropping the required columns onto the fields that represent the x and y axes.
• Once the selection has been performed, the aggregation module merges the filtered columns and the result can then be visualized.

As aggregation operations can quickly become complex, our default aggregation module can be replaced by more advanced analytics on tabular data.

Schema Matching

Schema matching is typically used in business to business integration, metamodel matching, as well as Extract, Transform, Load (ETL) processes. For non-IT specialists the typical way of comparing financial data from two different years or quarters, for example, would be to copy and paste the data from one Excel spreadsheet into another one, thus creating reduncancies and potentially introducing copy-and-paste errors. By using schema matching techniques it is possible to support this process semi-automatically, i.e. to determine which columns are similar and propose them to the user for integration. This integration can then be done with appropriate business intelligence tools to provide visualizations. One of the problems in performing the integration is the quality of data. The columns may contain data that is noisy or incorrect. There may also be no column headers to provide suitable information for matching. A number of approaches exploit the similarities of headers or similarities of types of column data. We propose a new approach that exploits semantic rich typing provided by popular datasets from the Linked Data cloud.

Data Reconciliation

Reconciliation enables entity resolution, i.e. matching cells with corresponding typed entities in case of tabular data. Open Refine already supports reconciliation with Freebase but requires confirmation from the user. For medium to large data sets, this can be very time-consuming. To reconcile data, we therefore first identify the columns that are candidates for reconciliation by skipping the columns containing numerical values or dates. We then use the Freebase search API to query for each cell of the source and target columns the list of typed entities candidates. Results are cached in order to be retrieved by our similarity algorithms.

Matching Unnamed and Untyped Columns

The AMC has the ability to combine the results of different matching algorithms. Its default built-in matching algorithms work on column headers and produce an overall similarity score between the compared schema elements. It has been proven that combining different algorithms greatly increases the quality of matching results. However, when headers are missing or ambiguous, the AMC can only exploit domain intersection and inclusion algorithms based on column data. We have therefore implemented three new similarity algorithms that leverage the rich types retrieved from Linked Data in order to enhance the matching results of unnamed or untyped columns.

Cosine Similarity

The first algorithm that we implemented is based on vector algebra. Let $$v$$ be the vector of ranked candidate types returned by Freebase for each cell value of a column. Then: $v:=\sum^K_{i=1}{a_i}*\overrightarrow{t_i}$ where $$a_i$$ is the score of the entry and $$\overrightarrow{t_i}$$ is the type returned by Freebase. The vector notation is chosen to indicate that each distinct answer determines one dimension in the space of results. Each cell value now has a weighted result set that can be used for aggregation to produce a result vector for the whole column. The column result $$V$$ is then given by: $V:\sum^n_{i=1}{v_i}$ We now compare the result vector of candidate types from the source column with the result vector of candidate types from the target column. Let $$W$$ be the result vector for the target column, then the similarity $$s$$ between the columns pair can be calculated using the absolute value of the cosine similarity function.

Pearson Product-Moment Correlation Coefficient (PPMCC)

The second algorithm that we implemented is PPMCC, a statistical measure of the linear independence between two variables $$\left(x,y\right)$$ . In our method, x is an array that represents the total scores for the source column rich types, y is an array that represents the mapped values between the source and the target columns. The values present in x but not in y are represented by zeros.

Spearman’s Rank Correlation Coefficient

The last algorithm that we implemented to match unnamed and untyped columns is Spearman’s rank correlation coefficient. It applies a rank transformation on the input data and computes PPMCC afterwards on the ranked data. In our experiments we used Natural Ranking with default strategies for handling ties and NaN values. The ranking algorithm is however configurable and can be enhanced by using more sophisticated measures.

Handling Non-String Values

So far we have covered several methods to identify the similarity between “String” values, but how about other numeral values like dates, money, distance …etc. For this purpose we have implemented some basic type identifier that can recognize dates, money, numeral values, numerals used as identifiers. This will help us in better match corresponding entries. Adjusting AMC’s combination algorithms can be of great importance at this stage; for example assigning weights to different matchers and tweaking the configuration can result in more accurate results.

Experiments

We present in this section results from experiments we conducted using the different methods described above. To appreciate the value of our approach, we have used a real life scenario that exposes common problems faced by the management in SAP. The data we have used come from two different SAP systems; the Event tracker and the Travel Expense Manager. The Event Tracker provides an overview of events (Conferences, Internal events … etc.) that SAP Research employees contribute to or host. The entries in this system contain as much information as necessary to give an overview of the activity like the Activity type and title, travel destination, travel costs divided into several sub categories (conference fees, accommodation, transportation and others), and duration related information (departure, return dates) … etc. Entries in the Event Tracker are generally entered in batches as employees fill in their planned events that they wish to attend or contribute to at the beginning of each year. Afterwards, managers according to their allocated budget can either accept or reject these planned events. On the other hand, the Travel Expense Manager contains the actual data for the successfully accepted events. This system is used by employees to enter their actual trip details in order to claim their expenses. It contains more detailed information and aggregated views of the events, such as the total cost, duration calculated in days, currency exchange rates and lots of internal system tags and Identifiers. Matching reports from these two systems is of great benefit to managers to organize and monitor their allocated budget, they mainly want to:

1. Find the number of the actual (accepted plans) compared with the total number of entered events.
2. Calculate the deviation between the estimated and actual cost of each event.

However, matching from these two sources can face several difficulties that can be classified in two domains; the first is in global labels (or column headers as we are dealing with Excel like files). These problems can be summarized in:

• Missing labels: Importing files into Open Refine with empty headers will result in assigning that column a dummy name by concatenating the word “column” with a number starting from 0.
• Dummy labels or semantically unrelated names: This is a common problem especially from the data coming from the Travel Expense Manager. This can be applied to columns that are labeled according to the corresponding database table (i.e. lbl_dst to denote destination label). Moreover, column labels do not often convey the semantic type of the underlying data.

The second domain is at cell (single entry) level; these problems can be summarized in:

• Detecting different date formats: We have found out that dates field coming from the two systems have different formats, moreover; the built in type detection in Open Refine converts detected date (if found) into another third format.
• Entries from different people can be made in different languages.
• Entries in the two systems can be incomplete, an entry can be shortened automatically by the system; for example, selecting a country in the Travel Expense Manager will result in filling out that country code in the exported report (i.e. France = FR).
• Inaccurate entries: This is one of the most common problems faced; users in the some field can enter several values that correspond to the same entity. For example, in the destination column, users can enter the country, the airport at the destination, the city or even the exact location of the event (i.e. office location).