EPIC: an iterative model for metadata improvement
Hannah Tarver; Mark Edward Phillips; Ana Krahmer
International Journal of Metadata, Semantics and Ontologies, Vol. 15, No. 4 (2021) pp. 244 - 253
This paper provides a case study of iterative metadata correction and enhancement at the University of North Texas (UNT), within a model that we have developed to describe this process: Evaluate, Prioritise, Identify, Correct (EPIC). These steps are illustrated within the paper to show how they function at UNT and why it may serve as a useful tool for other organisations. We suggest that the EPIC model works for ongoing assessment, but is particularly useful for large remediation and enhancement projects to plan timelines and to allocate the people and resources needed to determine what issues should be addressed (evaluate), to rate their level of severity, importance, or difficulty (prioritise), to define subsets or records that are affected (identify) and to make changes based on prioritisation (correct).
An ontology proposal for a corpus of letters of Vincenzo Bellini: formal properties of physical structure and the case of rotated texts
Salvatore Cristofaro; Pietro Sichera; Daria Spampinato
International Journal of Metadata, Semantics and Ontologies, Vol. 15, No. 4 (2021) pp. 269 - 279
In this paper the formal OntoBelliniLetters ontology is described concerning the corpus of Vincenzo Bellini's letters kept at the Belliniano Civic Museum of Catania. This ontology is part of a wider project - the BellinInRete project - one of whose aims is the development of a more general and complete ontology for the whole of Vincenzo Bellini's legacy preserved in the museum. The main concepts and relations building up the ontology knowledge base are described and discussed and some formal properties of them are presented. The ontology schema is inspired by the CIDOC Conceptual Reference Model (CIDOC CRM).
Making heterogeneous smart home data interoperable with the SAREF ontology
Roderick Van Der Weerdt; Victor De Boer; Laura Daniele; Barry Nouwt; Ronald Siebes
International Journal of Metadata, Semantics and Ontologies, Vol. 15, No. 4 (2021) pp. 280 - 293
SAREF is an ontology created to enable interoperability between smart devices, but there is a lack in the literature of practical examples to implement SAREF in real applications. We validate the practical implementation of SAREF through two approaches. We first examine two methods to map the IoT data available in a smart home into linked data using SAREF: (1) by creating a template-based mapping to describe how SAREF can be used and (2) by using a mapping language to demonstrate it can be simple to map, while still using SAREF. The second approach demonstrates the communication capabilities of IoT devices when they share knowledge represented using SAREF and describes how SAREF enables interoperability between different devices. The two approaches demonstrate that all the information from various data sets of smart devices can successfully be transformed into the SAREF ontology and how SAREF can be applied in a concrete interoperability framework.
Automated subject indexing using word embeddings and controlled vocabularies: a comparative study
Michalis Sfakakis; Leonidas Papachristopoulos; Kyriaki Zoutsou; Christos Papatheodorou; Giannis Tsakonas
International Journal of Metadata, Semantics and Ontologies, Vol. 15, No. 4 (2021) pp. 233 - 243
Text mining methods contribute significantly to the understanding and the management of digital content, increasing the potential of entry links. This paper introduces a method for subject analysis combining topic modelling and automated labelling of the generated topics exploiting terms from existing knowledge organisation systems. A testbed was developed in which the Latent Dirichlet Allocation (LDA) algorithm was deployed for modelling the topics of a corpus of papers related to the Digital Library Evaluation domain. The generated topics were represented in the form of bags-of-words word embeddings and were utilised for retrieving terms from the EuroVoc Thesaurus and the Computer Science Ontology (CSO). The results of this study show that the domain of DL can be described with different vocabularies, but during the process of automatic labelling the context needs to be taken into account.
What process can a university follow for open data? The University of Crete case
Yannis Tzitzikas; Marios Pitikakis; Giorgos Giakoumis; Kalliopi Varouha; Eleni Karkanaki
International Journal of Metadata, Semantics and Ontologies, Vol. 15, No. 4 (2021) pp. 254 - 268
All public bodies in Greece, including universities, are obliged to comply with the national legal framework and policy on open data. An emerging concern is how such a big and diverse organisation could develop supporting procedures from an administrative, legal and technical standpoint, that will enhance and expand the level of the provided open data related services. In this paper, we describe our experience, at the University of Crete, for tackling these requirements. In particular, (a) we detail the steps of the process that we followed, (b) we show how an Open Data Catalogue can be exploited also in the first steps of this process, (c) we describe the platform that we selected, how we organised the catalogue and the metadata selection, (d) we describe extensions that were required, (e) we motivate and describe various additional services that we developed and (f) we discuss the current status and possible next steps.
A method for archaeological and dendrochronological concept annotation using domain knowledge in information extraction
Andreas Vlachidis; Douglas Tudhope
International Journal of Metadata, Semantics and Ontologies, Vol. 15, No. 3 (2021) pp. 192 - 203
Advances in Natural Language Processing allow the process of deriving information from large volumes of text to be automated. Attention is turned to one of the most important, but traditionally difficult to access resources in archaeology, commonly known as 'grey literature'. This paper presents the development of two separate Named-Entity Recognition (NER) pipelines aimed at the extraction of Archaeological and of Dendrochronological concepts in Dutch, respectively. The role of domain vocabulary is discussed for the development of a Knowledge Organisation System (KOS)-driven, Rule-Based method of NER which makes complementary use of ontology, thesauri and domain vocabulary for information extraction and attribute assignment of semantic annotations. The NER task is challenged by a series of domain and language-oriented aspects and evaluated against a human-annotated Gold Standard. The results suggest the suitability of Rule-based KOS driven approaches for attaining the low-hanging fruits of NER, using a combination of quality vocabulary and rules.
Interlinking and enrichment of disparate organisational data with LOD at application run-time
Sotiris Angelis; Konstantinos Kotis; Panagiotis Mouzakis
International Journal of Metadata, Semantics and Ontologies, Vol. 15, No. 3 (2021) pp. 204 - 219
The present work focuses on the semantic integration, enrichment and interlinking of data that is semi-automatically generated by documenting artworks and their creators. In this work, we have been experimenting with RDFization and links discovery tools, W3C standards and widely accepted vocabularies. This work has been already evaluated with museum data, emphasising the discovery of links between disparate data sets and external data sources at the back-end of the proposed approach. In this paper, while contributing a number of other new and extended features at the back-end of this approach, we emphasise links discovery at the front-end in order to interlink and enrich cultural data with LOD at application run-time and facilitate the real-time and up-to-date exploitation of semantically integrated and LOD-enriched data. This is achieved by implementing a custom links discovery method and evaluating it within a web application using LOD cloud data sources such as DBpedia and Europeana.
A workflow for supporting the evolution requirements of RDF-based semantic warehouses
Yannis Marketakis; Yannis Tzitzikas; Aureliano Gentile; Bracken Van Niekerk; Marc Taconet
International Journal of Metadata, Semantics and Ontologies, Vol. 15, No. 3 (2021) pp. 220 - 232
Semantic data integration aims to exploit heterogeneous pieces of similar or complementary information for enabling integrated browsing and querying services. A quite common approach is the transformation from the original sources with respect to a common graph-based data model and the construction of a global semantic warehouse. The main problem is the periodic refreshment of the warehouse, as the contents from the data sources change. This is a challenging requirement, not only because the transformations that were used for constructing the warehouse can be invalidated, but also because additional information may have been added in the semantic warehouse, which needs to be preserved after every reconstruction. In this paper, we focus on this particular problem using a semantic warehouse that integrates data about stocks and fisheries from various information systems, we detail the requirements related to the evolution of semantic warehouses and propose a workflow for tackling them.
Introducing a novel bi-functional method for exploiting sentiment in complex information networks
Paraskevas Koukaras; Dimitrios Rousidis; Christos Tjortjis
International Journal of Metadata, Semantics and Ontologies, Vol. 15, No. 3 (2021) pp. 157 - 169
This paper elaborates on multilayer Information Network (IN) modelling, utilising graph mining and machine learning. Although, Social Media (SM) INs may be modelled as homogeneous networks, real-world networks contain multi-typed entities, characterised by complex relations and interactions posing as heterogeneous INs. For mining data whilst retaining semantic context in such complex structures, we need better ways for handling multi-typed and interconnected data. This work conceives and performs several simulations on SM data. The first simulation models information, based on a bi-partite network schema. The second simulation utilises a star network schema, along with a graph database offering querying for graph metrics. The third simulation handles data from the previous simulations to generate a multilayer IN. The paper proposes a novel bi-functional method for sentiment extraction of user reviews/opinions across multiple SM platforms, considering the concepts of supervised/unsupervised learning and sentiment analysis.
Integrated classification schemas to interlink cultural heritage collections over the web using LOD technologies
Carlos Henrique Marcondes
International Journal of Metadata, Semantics and Ontologies, Vol. 15, No. 3 (2021) pp. 170 - 177
Libraries, archives and museum collections are now being published over the web using LOD technologies. Many of them have thematic intersections or are related to other web subjects and resources such as authorities, sites for historic events, online exhibitions, or to articles in Wikipedia and its sibling resources DBpedia and Wikidata. The full potential of such published initiatives using LOD rests heavily on the meaningful interlinking of such collections. Within these contextual vocabularies and classifications, schemas are important, as they provide meaning and context to heritage data. This paper proposes comprehensive classification schemas - a Culturally Relevant Relationships (CRR) vocabulary and a classification schema of types of heritage objects - to order, integrate and provide structure to cultural heritage data brought about with the publication of heritage collections as LOD.
Institutional support for data management plans: case studies for a systematic approach
Yulia Karimova; Cristina Ribeiro; Gabriel David
International Journal of Metadata, Semantics and Ontologies, Vol. 15, No. 3 (2021) pp. 178 - 191
Researchers have to ensure that their projects comply with Research Data Management (RDM) requirements. Consequently, the main funding agencies require Data Management Plans (DMPs) for grant applications. So, institutions are investing in RDM tools and implementing RDM workflows in order to support their researchers. In this context, we propose a collaborative DMP-building method that involves researchers, data stewards and other parties if required. This method was applied as part of an RDM workflow in research groups across several scientific domains. We describe it as a systematic approach and illustrate it through a set of case studies. We also address the DMP monitoring process during the life cycle of projects. The feedback from the researchers highlighted the advantages of creating DMPs and their growing need. So, there is motivation to improve the DMP support process according to the machine-actionable DMPs concept and to the best practices in each scientific community.
A fuzzy logic and ontology-based approach for improving the CV and job offer matching in recruitment process
Amine Habous; El Habib Nfaoui
International Journal of Metadata, Semantics and Ontologies, Vol. 15, No. 2 (2021) pp. 104 - 120
The recruitment process is a critical activity for every organisation, and it allows to find the appropriate candidate for a job offer and its employer work criteria. The competitive nature of the recruitment environment makes the task of hiring new employees very hard for companies due to the high number of CV (resume) and profiles to process, the personal job interests, the customised requirements and precise skills requested by employees, etc. The time becomes crucial for recruiters' choices; consequently, it might impact the selection process quality. In this paper, we propose a retrieval system for automating the matching process between the candidate CV and the job offer. It is designed based on Natural Language Processing, machine learning and fuzzy logic to handle the matching between the job description and the CV. It also considers the proficiency level for the technology skills. Moreover, it offers an estimation of the overall CV/job offer expertise level. In that way, it overcomes the under-qualification and over-qualification issues in the ICT (Information and Communication Technologies) recruitment process. Experimental results on a ground-truth data of a recruiter company demonstrate that our proposal provides effective results.
Systematic design and implementation of a semantic assistance system for aero-engine design and manufacturing
Sonika Gogineni; JÃ¶rg BrÃ¼nnhÃ¤uÃer; Kai Lindow; Erik Paul Konietzko; Rainer Stark; Jonas Nickel; Heiko Witte
International Journal of Metadata, Semantics and Ontologies, Vol. 15, No. 2 (2021) pp. 87 - 103
Data in organisations is often spread across various Information and Communication Technology (ICT) systems, leading to redundancies, lack of overview and time wasted searching for information while carrying out daily activities. This paper focuses on addressing these problems for an aerospace company by using semantic technologies to design and develop an assistance system using existing infrastructure. In the aero-engine industry, complex data systems for design, configuration, manufacturing and service data are common. Additionally, unstructured data and information from numerous sources become available during the product's life cycle. In this paper, a systematic approach is followed to design a system, which integrates data silos by using a common ontology. This paper highlights the problems being addressed, the approach selected to develop the system, along with the implementation of two use cases to support user activities in an aerospace company.
Links between research artefacts: use cases for digital libraries
Fidan Limani; Atif Latif; Klaus Tochtermann
International Journal of Metadata, Semantics and Ontologies, Vol. 15, No. 2 (2021) pp. 133 - 143
The generation and availability of links between scholarly resources continues to increase. Initiatives to support it - both in terms of a (standard) representation model and accompanying infrastructure for collection and exchange - make this emerging artefact interesting to explore. Its role towards a more transparent, reproducible and, ultimately, richer research context, makes it a valuable proposition for information infrastructures such as Digital Libraries. In this paper, we assess the potential of link artefacts for such an environment. We rely on a public link collection subset of (>4.8 M links), which we represent based on the Linked Data approach that results with a collection of >163.8 M RDF triples. The incorporated use cases demonstrate the usefulness of this artefact in this study. We claim that the adoption of links extends the scholarly data collection and advances the services a Digital Library offers to its users.
Ontology-based knowledge management in income tax of Nepal
Shib Raj Bhatta; Bhoj Raj Ghimire; Marut Buranarach
International Journal of Metadata, Semantics and Ontologies, Vol. 15, No. 2 (2021) pp. 144 - 156
Organisational knowledge management in the government agencies is crucial to create a common understanding about data, system and procedures for stakeholders. Main purpose of this work is to create a knowledge repository for income tax of Nepal using domain ontology. To create this ontology, we have used 'Methontology' a broadly used ontology development methodology. After the knowledge acquisition; competency questions were created and conceptualisation was specified. A class hierarchy, object properties and data propertied are identified and specified. Further, ontology was created by using the Protégé tool. Consistency of the developed ontology was evaluated by correctly answering all the competency questions using SPARQL. The ontology can be used by the Inland Revenue Department as a stepping stone towards knowledge management through ontology. The work can be a foundation for the future works in this domain in good governance, data sharing, interoperability and transparency for the government of Nepal.
Keyphrase extraction from single textual documents based on semantically defined background knowledge and co-occurrence graphs
Mauro Dalle Lucca Tosi; Julio Cesar Dos Reis
International Journal of Metadata, Semantics and Ontologies, Vol. 15, No. 2 (2021) pp. 121 - 132
The keyphrase extraction task is a fundamental and challenging task designed to extract a set of keyphrases from textual documents. Keyphrases are essential to assist publishers in indexing documents and readers in identifying the most relevant ones. They are short phrases composed of one or more terms used to represent a textual document and its main topics. In this article, we extend our research on C-Rank, which is an unsupervised approach that automatically extracts keyphrases from single documents. C-Rank uses concept-linking to link concepts in common between single documents and an external background knowledge base. We advance our study over C-Rank by evaluating it using different concept-linking approaches - Babelfy and DBPedia Spotlight. We evaluated C-Rank on data sets composed of academic articles, academic abstracts, and news articles. Our findings indicate that C-Rank achieves state-of-the-art results extracting keyphrases from scientific documents by experimentally comparing it to existing unsupervised approaches.
Applying cross-data set identity reasoning for producing URI embeddings over hundreds of RDF data sets
Michalis Mountantonakis; Yannis Tzitzikas
International Journal of Metadata, Semantics and Ontologies, Vol. 15, No. 1 (2021) pp. 1 - 22
There is a proliferation of approaches that exploit RDF data sets for creating URI embeddings, i.e., embeddings that are produced by taking as input URI sequences (instead of simple words or phrases), since they can be of primary importance for several tasks (e.g., machine learning tasks). However, existing techniques exploit either a single or a few data sets for creating URI embeddings. For this reason, we introduce a prototype, called LODVec, which exploits LODsyndesis for enabling the creation of URI embeddings by using hundreds of data sets simultaneously, after enriching them with the results of cross-data set identity reasoning. By using LODVec, it is feasible to produce URI sequences by following paths of any length (according to a given configuration), and the produced URI sequences are used as input for creating embeddings through word2vec model. We provide comparative results for evaluating the gain of using several data sets for creating URI embeddings, for the tasks of classification and regression, and for finding the most similar entities to a given one.
Children's art museum collections as Linked Open Data
Konstantinos Kotis; Sotiris Angelis; Maria Chondrogianni; Efstathia Marini
International Journal of Metadata, Semantics and Ontologies, Vol. 15, No. 1 (2021) pp. 60 - 70
It has been recently argued that it is rather beneficial to cultural institutions to provide their datasets as Linked Open Data, to achieve cross-referencing, interlinking, and integration with other datasets in the LOD cloud. In this paper, we present the Greek Children's Art Museum (GCAM) linked dataset, along with dataset and vocabulary statistics, as well as lessons learned from the process of transforming the collections to HTML-embedded structured data using the Europeana Data Model and the Schema.org model. The dataset consists of three cultural collections of 121 child artworks (paintings), including detailed descriptions and interlinks to external datasets. In addition to the presentation of GCAM data and the lessons learned from the experimentation of non-ICT experts with LOD paradigm, the paper introduces a new metric for measuring datasets quality in terms of links to and from other datasets.
Analysis of structured data on Wikipedia
Johny Moreira; Everaldo Costa Neto; Luciano Barbosa
International Journal of Metadata, Semantics and Ontologies, Vol. 15, No. 1 (2021) pp. 71 - 86
Wikipedia has been widely used for information consumption or for implementing solutions using its content. It contains primarily unstructured text about entities, but it can also contain infoboxes, which are structured attributes describing these entities. Owing to its structural nature, infoboxes have been shown useful to many applications. In this work, we perform an extensive data analysis on different aspects of Wikipedia structured data: infoboxes, templates and categories, aiming to uncover data issues and limitations, and to guide researchers in the use of these structured data. We devise a framework to process, index and query the Wikipedia data, using it to analyse different scenarios such as the popularity of infoboxes, their size distribution and usage across categories. Some of our findings are: only 54% of Wikipedia articles have infoboxes; there is a considerable amount of geographical and temporal information in infoboxes; and there is great heterogeneity of infoboxes across a same category.
An ontology-driven perspective on the emotional human reactions to social events
Danilo Cavaliere; Sabrina Senatore
International Journal of Metadata, Semantics and Ontologies, Vol. 15, No. 1 (2021) pp. 23 - 38
Social media has become a fulcrum for sharing information on everyday-life events: people, companies, and organisations express opinions about new products, political and social situations, football matches, and concerts. The recognition of feelings and reactions to events from social networks requires dealing with great amounts of data streams, especially for tweets, to investigate the main sentiments and opinions that justify some reactions. This paper presents an emotion-based classification model to extract feelings from tweets related to an event or a trend, described by a hashtag, and build an emotional concept ontology to study human reactions to events in a context. From the tweet analysis, terms expressing a feeling are selected to build a topological space of emotion-based concepts. The extracted concepts serve to train a multi-class SVM classifier that is used to perform soft classification aimed at identifying the emotional reactions towards events. Then, an ontology allows arranging classification results, enriched with additional DBpedia concepts. SPARQL queries on the final knowledge base provide specific insights to explain people's reactions towards events. Practical case studies and test results demonstrate the applicability and potential of the approach.
Persons, GLAM institutes and collections: an analysis of entity linking based on the COURAGE registry
Ghazal Faraj; AndrÃ¡s Micsik
International Journal of Metadata, Semantics and Ontologies, Vol. 15, No. 1 (2021) pp. 39 - 49
It is an important task to connect encyclopaedic knowledge graphs by finding and linking the same entity nodes. Various available automated linking solutions cannot be applied in situations where data is sparse, private or a high degree of correctness is expected. Wikidata has grown into a leading linking hub collecting entity identifiers from various registries and repositories. To get a picture of connectability, we analysed the linking methods and results between the COURAGE registry and Wikidata, VIAF, ISNI and ULAN. This paper describes our investigations and solutions while mapping and enriching entities in Wikidata. Each possible mapped pair of entities received a numeric score of reliability. Using this score-based matching method, we tried to minimise the need for human decisions, hence we introduced the term human decision window for the mappings where neither acceptance nor refusal can be made automatically and safely. Furthermore, Wikidata has been enriched with related COURAGE entities and bi-directional links between mapped persons, organisations, collections, and collection items. We also describe the findings on coverage and quality of mapping among the above mentioned authority databases.
Documenting flooding areas calculation: a PROV approach
Monica De Martino; Alfonso Quarati; Sergio Rosim; LaÃ©rcio Massaru Namikawa
International Journal of Metadata, Semantics and Ontologies, Vol. 15, No. 1 (2021) pp. 50 - 59
Flooding events related to waste-lake dam ruptures are one of the most threatening natural disasters in Brazil. They must be managed in advance by public institutions through the use of adequate hydrographic and environmental information. Although the Open Data paradigm offers an opportunity to share hydrographic data sets, their actual reuse is still low because of metadata quality. Our previous work highlighted a lack of detailed provenance information. The paper presents an Open Data approach to improve the release of hydrographic data sets. We discuss a methodology, based on W3C recommendations, for documenting the provenance of hydrographic data sets, considering the workflow activities related to the study of flood areas caused by the waste-lakes breakdowns. We provide an illustrative example that documents, through W3C PROV metadata model, the generation of flooding area maps by integrating land use classification, from Sentinel images, with hydrographic data sets produced by the Brazilian National Institute for Space Research.
A survey study on Arabic WordNet: baring opportunities and future research directions
Abdulmohsen S. Albesher; Osama B. Rabie
International Journal of Metadata, Semantics and Ontologies, Vol. 14, No. 4 (2020) pp. 290 - 305
WordNet (WN) plays an essential role in knowledge management and information retrieval - as it allows for a better understanding of word relationships, which leads to more accurate text processing. The success of WN for the English language encouraged researchers to develop WNs for other languages. One of the most common of such languages is Arabic. However, the current state of affairs of Arabic WN (AWN) has not been properly studied. Thus, this paper presents a survey study on AWN conducted to explore opportunities and possible future research directions. The results involve the synthesis of over 100 research papers on AWN. These research papers were divided into categories and subcategories.
An ontology-based method for improving the quality of process event logs using database bin logs
Shokoufeh Ghalibafan; Behshid Behkamal; Mohsen Kahani; Mohammad Allahbakhsh
International Journal of Metadata, Semantics and Ontologies, Vol. 14, No. 4 (2020) pp. 279 - 289
The main goal of process mining is discovering models from event logs. The usefulness of these discovered models is directly related to the quality of event logs. Researchers proposed various solutions to detect deficiencies and improve the quality of event logs; however, only a few have considered the application of a reliable external source for the improvement of the quality of event data. In this paper, we propose a method to repair the event log using the database bin log. We show that database operations can be employed to overcome the inadequacies of the event logs, including incorrect and missing data. To this end, we, first, extract an ontology from each of the event logs and the bin log. Then, we match the extracted ontologies and remove inadequacies from the event log. The results show the stability of our proposed model and its superiority over related works.
Stress-testing big data platform to extract smart and interoperable food safety analytics
Ioanna Polychronou; Giannis Stoitsis; Mihalis Papakonstantinou; Nikos Manouselis
International Journal of Metadata, Semantics and Ontologies, Vol. 14, No. 4 (2020) pp. 306 - 314
One of the significant challenges for the future is to guarantee safe food for all inhabitants of the planet. During the last 15 years, very important fraud issues like the '2013 horse meat scandal' and the '2008 Chinese milk scandal' have greatly affected the food industry and public health. One of the alternatives for this issue consists of increasing production, but to accomplish this, it is necessary that innovative options be applied to enhance the safety of the food supply chain. For this reason, it is quite important to have the right infrastructure in order to manage data of the food safety sector and provide useful analytics to Food Safety Experts. In this paper, we describe Agroknow's Big Data Platform architecture and examine its scalability for data management and experimentation.