open access publication

Article, 2024

Incremental schema integration for data wrangling via knowledge graphs

SEMANTIC WEB, ISSN 1570-0844, 1570-0844, Volume 15, 3, Pages 793-830, 10.3233/SW-233347

Contributors

Flores, Javier (Corresponding author) [1] Rabbani, Kashif [2] Nadal, Sergi 0000-0002-8565-952X [1] Gomez, Cristina [1] Romero, Oscar 0000-0001-6350-8328 [1] Jamin, Emmanuel [3] Dasiopoulou, Stamatia [3]

Affiliations

  1. [1] Univ Politecn Cataluna, Dept Serv & Informat Syst Engn, Barcelona, Spain
  2. [NORA names: Spain; Europe, EU; OECD];
  3. [2] Aalborg Univ, Dept Comp Sci, Aalborg, Denmark
  4. [NORA names: AAU Aalborg University; University; Denmark; Europe, EU; Nordic; OECD];
  5. [3] NTT Data, SEMBU, Barcelona, Spain
  6. [NORA names: Spain; Europe, EU; OECD]

Abstract

Virtual data integration is the current approach to go for data wrangling in data-driven decision-making. In this paper, we focus on automating schema integration, which extracts a homogenised representation of the data source schemata and integrates them into a global schema to enable virtual data integration. Schema integration requires a set of well-known constructs: the data source schemata and wrappers, a global integrated schema and the mappings between them. Based on them, virtual data integration systems enable fast and on-demand data exploration via query rewriting. Unfortunately, the generation of such constructs is currently performed in a largely manual manner, hindering its feasibility in real scenarios. This becomes aggravated when dealing with heterogeneous and evolving data sources. To overcome these issues, we propose a fully-fledged semi-automatic and incremental approach grounded on knowledge graphs to generate the required schema integration constructs in four main steps: bootstrapping, schema matching, schema integration, and generation of system-specific constructs. We also present NextiaDI, a tool implementing our approach. Finally, a comprehensive evaluation is presented to scrutinize our approach.

Keywords

Schema integration, bootstrapping, virtual data integration

Data Provider: Clarivate