Article Preview
TopIntroduction
Manual solutions to schema matching involve examining all tables in the integrated databases and identifying semantically corresponding attributes in each pair of a source table and a target table. These manual processes are prone to errors, limited in their scalability and often too slow to meet the requirement of fast information integration. The goal of automatic and semi-automatic schema matching techniques is to cope with these difficulties in such database applications as schema integration, data warehousing, etc.
There are two main approaches to automatic schema matching, namely the schema-based and the instance-based approach. The schema-based matching considers the schema information like attribute names, data types, relationship types, integrity constraints, and data descriptions. When the column names and values are uninterpreted, the instance-based schema matching approach (Liang, 2008) can identify the semantic correspondence of schema attributes by analyzing the instance data without any schema-level information. A deep insight into the content of schema elements such as data range, data domain and statistics is very helpful for uncovering semantic correspondences. To identify matching columns, which use different encodings of logically similar domains, one can also utilize the inter-attribute dependencies in each table.
The following three types of cardinality constraints should be considered in schema matching:
- •
One-to-One Mapping ([1,1] – [1,1], in UML Notation): Each attribute in the source table A has a unique match in the target table B and vice versa. This applies to mapping between tables having the same number of attributes. Consequently, the problem simply involves finding a correspondence between the attributes;
- •
Onto Mapping ([0,1] – [1,1]): Each attribute in A has a unique match in B while each attribute in B either has a unique match in A or remains unmatched. This corresponds to cases where we know that table A’s attributes are a subset of table B’s. Hence, we have to discover this subset and then to map its attributes;
- •
Partial Mapping ([0,1] – [0,1]): Each attribute in A either has a unique match in B or remains unmatched and vice versa. This corresponds to the most general and difficult cases, where we do not know which attributes of A map to B or even how many attributes of A map to B. In this case, we need to find the best subset of attributes of A to be mapped to B as well as the best mapping for this subset.
Zhao (2010) uses information-theoretic metrics to identify matching attributes of the most common types across two databases. The method application scope is limited to one-to-one mapping between data sources that contain some amount of overlapping records.
Kang and Naughton (2003, 2008) proposed a two-step schema-matching algorithm for the uninterpreted matching domain using inter-attribute dependencies. In the first step, they measure dependencies by calculating mutual information between all attributes. In the second step, they find matching node pairs in the dependency graphs by running a graph-matching algorithm. In their first article, Kang & Naughton (2003) execute graph matching using a naïve exhaustive search, which is impractical, in terms of time complexity, for schemas with a large number of attributes. In their second article, Kang & Naughton (2008) proposed several heuristic algorithms, but the sub-problem they discuss is schema matching between tables with the same number of attributes, where each attribute in Table A has a unique match in Table B and vice versa (one-to-one mapping). A faster schema-matching algorithm using dependencies between discrete and continuous attributes is applied in our previous work (Rabinovich and Last, 2011) to the one-to-one and onto mapping sub-problems.