Tutorial on Semantic Schema Discovery: principles, methods and future research directions

The explosion of the data on the semantic web has led to many weakly structured, and irregular data sources, becoming available every day. The schema of these sources is useful for a number of tasks, such as source selection, query answering, exploration and summariza- tion. However, although semantic web data might contain schema in- formation, in many cases this is completely missing or partially defined. Schema discovery consists in extracting schema-related information from the original semantic graph, which some applications can exploit instead of or along with the original graph, to perform some tasks more efficiently. This tutorial presents a structured analysis and comparison of existing works in the area of semantic schema discovery helping researchers and practitioners to understand the challenges in the area; it is based upon a recent survey we authored.

How?

Semantic Schema Discovery

Modern day data sources are weakly structured or incomplete. Type information is essential for a number of tasks such as query answering, integration, summarization, and partitioning.

Schema discovery consists in extracting schema-related information from the original semantic graph, which some applications can exploit instead of or along with the original graph, to perform some tasks more efficiently.

This tutorial presents a structured analysis and comparison of existing works in the area of semantic schema discovery; the concepts at the core of each approach, their main technical aspects and implementation.

Objectives

Scope of this tutorial

  • Implicit Type Discovery

    The works in this category try to discover the schema of a data source, usually without needing any additional information about the schema declared in the dataset. The goal is to extract new types by analyzing the entities in the dataset. Given a set of entities E in an RDF dataset G, schema discovery may be formulated as the problem of generating a set of possibly overlapping groups representing classes (types) C = {C_1, C_2, ...., C_n}, where each class C_i corresponds to a subset of E. Each class C_i represents a set of similar entities, i.e., entities which are of the same kind. Links among the classes may also be generated, considering the properties describing the corresponding entities in the dataset. Classes may be overlapping because a given entity may be an instance of several classes as multiple typing is supported in RDF. Schema discovery approaches in this category proceed either by grouping entities, or grouping paths.

  • Explicit Schema Enrichment

    This category includes works that use the statements on the schema to complement or to enrich the schema already available. Their goal is to generate new schema statements, such as rdf:type, using existing types or other schema statements such as: rdfs:domain, rdfs:range, rdfs:subClassOf, etc. Given a set of schema statements S= {s_1, s_2, ...., s_n} explicitly defined in the dataset G, schema discovery may be stated as the problem of enriching the set S by new schema statements inferred from S.

  • Structural pattern discovery

    Schema discovery can also be viewed as the problem of identifying all the possible structural patterns (versions) of the entities in an RDF dataset G. A structural pattern can be viewed as a set of properties V = {p_1, p_2, ...., p_n} such that there is at least one entity e_i in G such that e_i is described by all the properties in V. If V contains all the properties describing e_i, then we consider V as an exact pattern, otherwise, V is an approximate pattern. The problem here is not to discover new schema statements such as rdf:type, rdfs:subClassOf, etc., but rather to analyse the co-occurrence relationships among the properties in the dataset.

  • ....

    .....

  • ...

    ....

Tutorial

Program

Dimitris Plexousakis 10.00CET (15 min)

Introduction and preliminaries

The basics of semi-structured data models such as RDF graphs, RDFS and OWL ontologies, Shape Expressions for an RDF graph, Object Exchange Model and JSON.

Presentation Part 1
Dimitris Plexousakis 10.15CET (15 min)

Applications

This part covers a list of uses and applications for schema-related information, such as Data Indexing, Query Answering, Semantic Summarization etc.

Presentation Part 1
Zoubida Kedad 10.30 (35 min)

Dimensions of Analysis & Implicit Schema Discovery & Implicit Schema Discovery

An overview of the analysis dimensions and the implicit schema category.

Presentation Part 2
Kenza Kellou-Menouer 11.05 (25 min)

Pattern Discovery

An overview of the works in the pattern discovery is presented.



Presentation Part 3
11.30 (15 min)

Break

Haridimos Kondylakis 11.45 (20 min)

Explicit Schema Enrichment

An overview of the works in the explicit schema enrichment is presented.



Presentation Part 4
Haridimos Kondylakis 12.05 (20 min)

Open issues and future research direction

Open issues on schema discovery and the possible future research directions are discussed in this part of the tutorial.

Presentation Part 4
Nikos Kardoulakis & Georgia Troullinou 12.25 (20 min)

Hands-on

Show, through experimentation, the challenges involved in semantic schema discovery.

Presentation Part 5
All 12.45 (15 min)

Q&A

Questions and answers.

Team

Collaborators

  • Kenza Kellou-Menouer

    Kenza Kellou-Menouer is a Collaborating Researcher at University of Versailles Paris-Saclay, France. She holds an engineering degree in machine learning, a research M.Sc in databases then a Ph.D on the topic of “Semantic Schema Discovery”. She has publications on this topic in international conferences and journals including: Information Systems Journal, VLDB Journal, ER, SSDBM, ESWC, etc. She is also a freelance trainer in IT, for various companies and engineering schools, teaching lessons on: Semantic Web Mining, Database Management, Software Engineering, Multi-language Programming, etc.

  • Nikos Kardoulakis

    Nikolaos Kardoulakis is a Software Engineer and Research Assistant at Institute of Computer Science, Foundation of Research Technology Hellas (FORTH). He holds a M.Sc and B.A in Computer Science from University of Crete. His research interests include big data, semantics, data partitioning and schema discovery. He has a recent publication to relevant topics in the SSDBM 2021.

  • Georgia Troullinou

    Georgia Troullinou is a Ph.D candidate at the University of Crete in Computer Science. Before that, she was a research and development engineer at FORTH-ICS. She got her B.Sc and M.Sc degrees from the Computer Science Department of the University of Crete. Georgia’s research interests include knowledge representation and management using Semantic Web technologies, with a particular interest on summarizing semantic knowledge bases, big data partitioning and semantic schema discovery. She has publications in international conferences and journals in those areas including ACM SIGMOD, ISWC, ESWC, ICDE, SSDBM, VLDBJ.

  • Zoubida Kedad

    Zoubida Kedad is an Associate Professor in Computer Science at the University of Versailles-Paris Saclay (France). She is currently Deputy Head of the DAVID laboratory (Data and Algorithms for Smart and Sustainable Cities) at the University of Versailles, and a member of the ADAM team (Ambiant Data Access and Mining). She holds an engineer degree in Computer Science. She also holds a Ph.D. in Computer Science and an HDR, both from the University of Versailles. Her research interests lie in data integration, data quality, semantic-based data analysis and schema discovery for irregular and weakly structured data; she has participated in several funded research projects on these topics. She regularly serves as a PC member of international and national conferences, and as a reviewer for national and international journals.

  • Dimitris Plexousakis

    Director of the Institute of Computer Science (ICS), FORTH is a Professor at the Department of Computer Science , University of Crete and Head of the Information Systems Laboratory of FORTH-ICS. He received his B.Sc. degree in Computer Science from the University of Crete in 1988 and M.Sc. and Ph.D degrees in Computer Science from the University of Toronto in 1990 and 1996 respectively. He was a visiting Professor at the Universities of Vienna, CNAM, Orsay and Paris-Est. His research interests span the following areas: Knowledge Representation and Knowledge Base Design; formal knowledge representation models and query languages for the Semantic Web; Formal reasoning systems and applications of artificial intelligence in database systems and Ambient Intelligence. He has published over 200 articles in international conferences and journals. He has extensive experience in the scientific coordination of research projects at the national and European levels. He is a member of the ACM and AAAI.

  • Haridimos kondylakis

    Haridimos kondylakis is a Collaborating Researcher at FORTH-ICS. He holds a Ph.D and an M.Sc. from the Department of Computer Science, University of Crete. He is also a visiting lecturer at the Department of Computer Science, University of Crete and at the Department of Electric and Computer Engineering at the Hellenic Mediterranean University. He has more than 180 publications in international conferences, books and journals including ACM SIGMOD, VLDB, JWS, KER, EDBT, ISWC, ESWC etc. He has more than 20 publications on semantic schema discovery and summarization, experience on delivering high profile tutorials (e.g.EDBT 2019) on related topics, and he has recently taught a course on semantic schema discovery and summarization at Aalborg University (Spring 2021).