The abundance of tabular data being accumulated by enterprises internally and online represents a big challenge for data integration and analysis for companies that strive to be data-driven.
While the main focus of AI/ML research and community during the last several years has been on processing unstructured data, the tabular data still is where the most time and money are spent in the Data Integration world.
These days every company, whether big or small, has dozens of different systems (data sources). Many of those have great integration capabilities but still every day people export data from various systems and ingest it into proprietary enterprise Data Lakes, Data Warehouses, Reporting databases, etc.
That requires a lot of time and qualified Data Engineers.
To keep up with quickly changing business requirements and growing data size there have been many great technologies and tools developed by companies and communities: big data stack, data wrangling tools, quality and observability, visualization and NLU interfaces to the data.
All those technologies allowed collecting and processing unprecedented amounts of data. Though it also makes the modern data engineering stack very complex, which, in turn, results in a long and expensive development cycles that don’t meet business requirements
Tools assume that the user understands the data, knows how it’s structured and how to cross reference data that comes from different sources.
And that represents probably the biggest problem in today’s corporate world. In order to do something with the data you have to understand it, know where it is and how it’s structured. There is a misconception that there are people who fully understand all the nuances of companies' data and can efficiently manage it.
And it becomes exponentially more difficult when a company grows and people change.
In its work, Datuum specifically targets this challenge by applying the state-of-the-art AI models to help Data Workers to handle the variety and complexity of tabular data. Datuum speeds-up and optimizes the integration of multiple data sources by helping data workers to find data they need and connect the dots between source and destination. As the result of interacting with your data our AI learns more with every iteration and that data so the institutional knowledge is not lost.
Why is it not easy?
Imagine that you have two different data sources with seemingly similar data stored in them. For instance, data from medical or insurance domains, in which some tables could contain hundreds or even thousands of columns. Our major task is to unify this data into a single source of truth.
The cornerstone of this unification is the process of column matching – discovering combinations of columns from different data sources which store the same information (Fig. 1).
[Figure 1]: Table cross-reference comprises of column and record matching
While it could be trivial to match columns with generalized data like name, email or address, it's not the case for columns containing domain-specific information -- to map this kind of data, we usually need domain expertise and manual intervention.
The other important phase of the unification pipeline would be record matching - the process in which records from different data sources are merged into a single entity. For highly regulated domains like healthcare, this step could be extremely important. And it's even harder to get right when large datasets are being processed, even for human annotators.
Transformers to the rescue!
Thankfully, the latest research in large language models gives us the key to the simplification of the matching process. It doesn't yet fully remove the need for human experts (especially in complex domains), but could give them a substantial boost compared to standard setting.
By definition, language modeling is the usage of various statistical and probabilistic techniques to determine the probability of a given sequence of words occurring in a sentence. Language models train by analyzing text to correctly predict the next or masked word in a sequence (Fig. 2).
[Figure 2]: One of the possible language model objectives is to predict a masked word in an input sequence. Models trained with this objective are called Masked Language Models.
The task of language modeling is not new, but in recent years it has become one of the main areas for research and development in the machine learning field. Basically, there are 3 reasons why this sub-field experienced such an enormous progress:
The nature of language modeling task allowed us to use tremendous amounts of data for unsupervised learning
Transformer attention-based architecture allowed the massive parallelism of training on sequential data.
Transformers pretrained on a language modeling task turned out to be great at solving other, unrelated to language modeling, tasks.
As a result, transformer models have found use in a wide range of tasks, ranging from question answering to video understanding. Naturally, people also looked at applying this architecture to tabular data as well.
Tabular data is textual in its nature. One could argue that numeric columns do not constitute text, but in fact, numerical values are nothing other than sequences of specialized text characters. This means that tabular data, similar to textual one, could be effectively modeled as a language using Transformers.
Modeling tabular data with Transformers facilitates the solution of other downstream tasks, such as column type prediction or column name recovery. Some of the proposed models could also learn efficient dense representations of the table parts: columns and rows.
Deep learning journey
Machine learning approaches using tabular data can hardly be called new. For many years, tree-based solutions, such as random forests or gradient boosting algorithms, were used for various classification and regression tasks, such as credit score prediction or fraud detection.
However, complexity of aforementioned models lets us learn and understand patterns of data on observation (row) level.
For a much deeper understanding of data, as relations between columns in a table or between tables in a database, neural networks became very handy.
In the pre-transformer era, deep learning models had a limited capability in processing sequential data like column values or names and, hence, learning contextual information from it. As a result, they often relied on manually engineered features for model input and processed each column separately.
One prominent example of such a model is Sherlock  – the first successful application of deep learning to the task of semantic type classification. Sherlock is a regular feed forward neural network that uses a set of statistical features calculated from column values to predict one of 78 semantic column types. It was the first deep learning model used to improve column matching in Datuum.
Despite being quite effective for common column types, like name, address or zip-code, Sherlock had a set of limitations, such as high reliance on annotated data, especially in previously unseen domains or inability to take table context and column name into account.
Obviously that represented an obstacle to successfully applying Datuum in new verticals fast. As that architecture required additional training on the new data.
Fortunately enough, the Transformer-based model proved to be a perfect alternative to Sherlock.
The case for transformers
Our history with application of transformers to the task of column matching started with an extensive research of related work. As a result, we chose TABBIE  and TaBERT  models for further experiments. It should be noted that at the beginning of 2022 almost all SOTA approaches to the task of column type classification were based on Transformers (Fig. 3).
[Figure 3]: The idea behind the TABBIE model is to use semi-supervised training with large amounts of tabular data to get powerful representations of cells, columns and rows.
By fine-tuning these models on multiple relevant datasets, we proved the superiority of transformer-based methods over Sherlock. For example, TABBIE F1-macro score on TURL dataset  was almost 20% higher than Sherlock’s.
In fact, the performance of transformers on classification tasks wasn’t what we were looking for. The property that caught our attention most was the representational ability of these models.
For instance, by training transformer-based models like TABBIE on domain specific dataset for the task of semantic column type classification, we were able to learn dense representations of the elements of these tables: rows and columns. This, in turn, allowed us to extend the application of the model to previously unseen domains without the need to annotate new data.
Our column matching pipeline is specifically designed for a system that has a regular inflow of new data from previously unseen domains (Fig. 4).
[Figure 4]: The representational ability of the TABBIE model allows us to improve column-matching performance on domain-specific data without the need to annotate it.
Generally, our approach works the following way:
We fine-tuneTABBIE model with domain-specific data for a task that does not require additional annotation. For example, corrupted cell detection or cell in-filling. Such training is called semi-supervised.
The model learns dense representations of columns that leverage both column information and table context.
At runtime, we encode columns from tables that need to be matched and combine them into matching pairs based on distance measure between dense representations in vector space.
Human experts review low-confidence matches for correctness.
This pipeline has few important benefits compared to the pipeline based on Sherlock model, namely:
Matching is not limited to a defined set of column types.
The need for annotation is eliminated.
Ability to include encoded column data, metadata and context into single representation.
Combined, these advantages set a new standard in accuracy and adaptivity of column matching employed in Datuum.
Representational ability of Transformer-based language models is bringing new capabilities to almost all tasks that rely on textual data, and table cross-reference is no exception.
By employing a Transformer model into our column-matching pipeline, we were able to increase matching accuracy and generalization ability of the system that doesn’t depend on annotated data, which is especially useful in domains where data access is restricted.