Applying cross-data set identity reasoning for producing URI embeddings over hundreds of RDF data sets
Michalis Mountantonakis; Yannis Tzitzikas
International Journal of Metadata, Semantics and Ontologies, Vol. 15, No. 1 (2021) pp. 1 - 22
There is a proliferation of approaches that exploit RDF data sets for creating URI embeddings, i.e., embeddings that are produced by taking as input URI sequences (instead of simple words or phrases), since they can be of primary importance for several tasks (e.g., machine learning tasks). However, existing techniques exploit either a single or a few data sets for creating URI embeddings. For this reason, we introduce a prototype, called LODVec, which exploits LODsyndesis for enabling the creation of URI embeddings by using hundreds of data sets simultaneously, after enriching them with the results of cross-data set identity reasoning. By using LODVec, it is feasible to produce URI sequences by following paths of any length (according to a given configuration), and the produced URI sequences are used as input for creating embeddings through word2vec model. We provide comparative results for evaluating the gain of using several data sets for creating URI embeddings, for the tasks of classification and regression, and for finding the most similar entities to a given one.
Children's art museum collections as Linked Open Data
Konstantinos Kotis; Sotiris Angelis; Maria Chondrogianni; Efstathia Marini
International Journal of Metadata, Semantics and Ontologies, Vol. 15, No. 1 (2021) pp. 60 - 70
It has been recently argued that it is rather beneficial to cultural institutions to provide their datasets as Linked Open Data, to achieve cross-referencing, interlinking, and integration with other datasets in the LOD cloud. In this paper, we present the Greek Children's Art Museum (GCAM) linked dataset, along with dataset and vocabulary statistics, as well as lessons learned from the process of transforming the collections to HTML-embedded structured data using the Europeana Data Model and the Schema.org model. The dataset consists of three cultural collections of 121 child artworks (paintings), including detailed descriptions and interlinks to external datasets. In addition to the presentation of GCAM data and the lessons learned from the experimentation of non-ICT experts with LOD paradigm, the paper introduces a new metric for measuring datasets quality in terms of links to and from other datasets.
Analysis of structured data on Wikipedia
Johny Moreira; Everaldo Costa Neto; Luciano Barbosa
International Journal of Metadata, Semantics and Ontologies, Vol. 15, No. 1 (2021) pp. 71 - 86
Wikipedia has been widely used for information consumption or for implementing solutions using its content. It contains primarily unstructured text about entities, but it can also contain infoboxes, which are structured attributes describing these entities. Owing to its structural nature, infoboxes have been shown useful to many applications. In this work, we perform an extensive data analysis on different aspects of Wikipedia structured data: infoboxes, templates and categories, aiming to uncover data issues and limitations, and to guide researchers in the use of these structured data. We devise a framework to process, index and query the Wikipedia data, using it to analyse different scenarios such as the popularity of infoboxes, their size distribution and usage across categories. Some of our findings are: only 54% of Wikipedia articles have infoboxes; there is a considerable amount of geographical and temporal information in infoboxes; and there is great heterogeneity of infoboxes across a same category.
An ontology-driven perspective on the emotional human reactions to social events
Danilo Cavaliere; Sabrina Senatore
International Journal of Metadata, Semantics and Ontologies, Vol. 15, No. 1 (2021) pp. 23 - 38
Social media has become a fulcrum for sharing information on everyday-life events: people, companies, and organisations express opinions about new products, political and social situations, football matches, and concerts. The recognition of feelings and reactions to events from social networks requires dealing with great amounts of data streams, especially for tweets, to investigate the main sentiments and opinions that justify some reactions. This paper presents an emotion-based classification model to extract feelings from tweets related to an event or a trend, described by a hashtag, and build an emotional concept ontology to study human reactions to events in a context. From the tweet analysis, terms expressing a feeling are selected to build a topological space of emotion-based concepts. The extracted concepts serve to train a multi-class SVM classifier that is used to perform soft classification aimed at identifying the emotional reactions towards events. Then, an ontology allows arranging classification results, enriched with additional DBpedia concepts. SPARQL queries on the final knowledge base provide specific insights to explain people's reactions towards events. Practical case studies and test results demonstrate the applicability and potential of the approach.
Persons, GLAM institutes and collections: an analysis of entity linking based on the COURAGE registry
Ghazal Faraj; AndrÃ¡s Micsik
International Journal of Metadata, Semantics and Ontologies, Vol. 15, No. 1 (2021) pp. 39 - 49
It is an important task to connect encyclopaedic knowledge graphs by finding and linking the same entity nodes. Various available automated linking solutions cannot be applied in situations where data is sparse, private or a high degree of correctness is expected. Wikidata has grown into a leading linking hub collecting entity identifiers from various registries and repositories. To get a picture of connectability, we analysed the linking methods and results between the COURAGE registry and Wikidata, VIAF, ISNI and ULAN. This paper describes our investigations and solutions while mapping and enriching entities in Wikidata. Each possible mapped pair of entities received a numeric score of reliability. Using this score-based matching method, we tried to minimise the need for human decisions, hence we introduced the term human decision window for the mappings where neither acceptance nor refusal can be made automatically and safely. Furthermore, Wikidata has been enriched with related COURAGE entities and bi-directional links between mapped persons, organisations, collections, and collection items. We also describe the findings on coverage and quality of mapping among the above mentioned authority databases.
Documenting flooding areas calculation: a PROV approach
Monica De Martino; Alfonso Quarati; Sergio Rosim; LaÃ©rcio Massaru Namikawa
International Journal of Metadata, Semantics and Ontologies, Vol. 15, No. 1 (2021) pp. 50 - 59
Flooding events related to waste-lake dam ruptures are one of the most threatening natural disasters in Brazil. They must be managed in advance by public institutions through the use of adequate hydrographic and environmental information. Although the Open Data paradigm offers an opportunity to share hydrographic data sets, their actual reuse is still low because of metadata quality. Our previous work highlighted a lack of detailed provenance information. The paper presents an Open Data approach to improve the release of hydrographic data sets. We discuss a methodology, based on W3C recommendations, for documenting the provenance of hydrographic data sets, considering the workflow activities related to the study of flood areas caused by the waste-lakes breakdowns. We provide an illustrative example that documents, through W3C PROV metadata model, the generation of flooding area maps by integrating land use classification, from Sentinel images, with hydrographic data sets produced by the Brazilian National Institute for Space Research.
A survey study on Arabic WordNet: baring opportunities and future research directions
Abdulmohsen S. Albesher; Osama B. Rabie
International Journal of Metadata, Semantics and Ontologies, Vol. 14, No. 4 (2020) pp. 290 - 305
WordNet (WN) plays an essential role in knowledge management and information retrieval - as it allows for a better understanding of word relationships, which leads to more accurate text processing. The success of WN for the English language encouraged researchers to develop WNs for other languages. One of the most common of such languages is Arabic. However, the current state of affairs of Arabic WN (AWN) has not been properly studied. Thus, this paper presents a survey study on AWN conducted to explore opportunities and possible future research directions. The results involve the synthesis of over 100 research papers on AWN. These research papers were divided into categories and subcategories.
An ontology-based method for improving the quality of process event logs using database bin logs
Shokoufeh Ghalibafan; Behshid Behkamal; Mohsen Kahani; Mohammad Allahbakhsh
International Journal of Metadata, Semantics and Ontologies, Vol. 14, No. 4 (2020) pp. 279 - 289
The main goal of process mining is discovering models from event logs. The usefulness of these discovered models is directly related to the quality of event logs. Researchers proposed various solutions to detect deficiencies and improve the quality of event logs; however, only a few have considered the application of a reliable external source for the improvement of the quality of event data. In this paper, we propose a method to repair the event log using the database bin log. We show that database operations can be employed to overcome the inadequacies of the event logs, including incorrect and missing data. To this end, we, first, extract an ontology from each of the event logs and the bin log. Then, we match the extracted ontologies and remove inadequacies from the event log. The results show the stability of our proposed model and its superiority over related works.
Stress-testing big data platform to extract smart and interoperable food safety analytics
Ioanna Polychronou; Giannis Stoitsis; Mihalis Papakonstantinou; Nikos Manouselis
International Journal of Metadata, Semantics and Ontologies, Vol. 14, No. 4 (2020) pp. 306 - 314
One of the significant challenges for the future is to guarantee safe food for all inhabitants of the planet. During the last 15 years, very important fraud issues like the '2013 horse meat scandal' and the '2008 Chinese milk scandal' have greatly affected the food industry and public health. One of the alternatives for this issue consists of increasing production, but to accomplish this, it is necessary that innovative options be applied to enhance the safety of the food supply chain. For this reason, it is quite important to have the right infrastructure in order to manage data of the food safety sector and provide useful analytics to Food Safety Experts. In this paper, we describe Agroknow's Big Data Platform architecture and examine its scalability for data management and experimentation.
Data aggregation lab: an experimental framework for data aggregation in cultural heritage
International Journal of Metadata, Semantics and Ontologies, Vol. 14, No. 4 (2020) pp. 315 - 324
This paper describes the Data Aggregation Lab software, a system that implements the metadata aggregation workflow of cultural heritage, based on the underlying concepts and technologies of the Web of Data. These aggregation technologies can fulfil the functional requirements for metadata harvesting in cultural heritage and at the same time allow cultural heritage data to be globally interoperable with internet search engines and the Web of Data. The Data Aggregation Lab provides a framework to support several research activities within the Europeana network, such as conducting case studies, providing reference implementations, and supporting technology adoption. It provides working implementations for metadata aggregation methods with which our research has obtained positive results. These methods apply linked data, Schema.org, IIIF, Sitemaps and RDF-related technologies for innovation in data aggregation, data analysis and data conversion for cultural heritage data. The software is available for reuse and is open-sourced.
Automatic metadata extraction via image processing using Migne's Patrologia Graeca
Evagelos Varthis; Marios Poulos; Ilias Giarenis; Sozon Papavlasopoulos
International Journal of Metadata, Semantics and Ontologies, Vol. 14, No. 4 (2020) pp. 265 - 278
A wealth of knowledge is kept in libraries and cultural institutions in various digital forms without, however, the possibility of a simple term search, let alone of a substantial semantic search. In this study, a novel approach is proposed which strives to recognise words and automatically generate metadata from large machine-printed corpora such as Migne's Patrologia Graeca (PG). The proposed framework firstly applies an efficient word segmentation and then transforms the word-images into special compact shapes. For the comparison, we use Hu's invariant moments for discarding unlikely similar matches, Shape Context (SC) for the contour similarity and the Pearson's Correlation Coefficient (PCC) for final verification. Comparative results are presented by using the Long-Short Term Memory (LSTM) Neural Network (NN) engine of Tesseract Optical Character Recognition (OCR) system instead of PCC. In addition, an intelligent scenario is proposed for automatic generation of PG metadata by librarians.
Semantic similarity measurement: an intrinsic information content model
Abhijit Adhikari; Biswanath Dutta; Animesh Dutta; Deepjyoti Mondal
International Journal of Metadata, Semantics and Ontologies, Vol. 14, No. 3 (2020) pp. 218 - 233
Ontology dependent Semantic Similarity (SS) measurement has emerged as a new research paradigm in finding the semantic strength between any two entities. In this regard, as observed, the information theoretic intrinsic approach yields better accuracy in correlation with human cognition. The precision of such a technique highly depends on how accurately we calculate Information Content (IC) of concepts and its compatibility with a SS model. In this work, we develop an intrinsic IC model to facilitate better SS measurement. The proposed model has been evaluated using three vocabularies, namely SNOMED CT, MeSH and WordNet against a set of benchmark data sets. We compare the results with the state-of-the-art IC models. The results show that the proposed intrinsic IC model yields a high correlation with human assessment. The article also evaluates the compatibility of the proposed IC model and the other existing IC models in combination with a set of state-of-the-art SS models.
Formalisation and classification of grammar and template-mediated techniques to model and ontology verbalisation
Zola Mahlaza; C. Maria Keet
International Journal of Metadata, Semantics and Ontologies, Vol. 14, No. 3 (2020) pp. 249 - 262
Computational tools that translate modelling languages to a restricted natural language can improve end-user involvement in modelling. Templates are a popular approach for such a translation and are often paired with computational grammar rules to support grammatical complexity to obtain better quality sentences. There is no explicit specification of the relations used for the pairing of templates with grammar rules, so it is challenging to compare the latter templates' suitability for less-resourced languages, where grammar reuse is vital in reducing development effort. In order to enable such comparisons, we devise a model of pairing templates and rules, and assess its applicability by considering 54 existing systems for classification, and 16 of them in detail. Our classification shows that most grammar-infused template systems support detachable grammar rules and half of them introduce syntax trees for multilingualism or error checking. Furthermore, out of the 16 considered grammar-infused template systems, most do not currently support any of form of aggregation (63%) or the embedding of verb conjugation rules (81%); hence, if such features would be required, then they would need to be implemented from the ground up.
Automatic classification of digital objects for improved metadata quality of electronic theses and dissertations in institutional repositories
International Journal of Metadata, Semantics and Ontologies, Vol. 14, No. 3 (2020) pp. 234 - 248
Higher education institutions typically employ Institutional Repositories (IRs) in order to curate and make available Electronic Theses and Dissertations (ETDs). While most of these IRs are implemented with self-archiving functionalities, self-archiving practices are still a challenge. This arguably leads to inconsistencies in the tagging of digital objects with descriptive metadata, potentially compromising searching and browsing of scholarly research output in IRs. This paper proposes an approach to automatically classify ETDs in IRs, using supervised machine learning techniques, by extracting features from the minimum possible input expected from document authors: the ETD manuscript. The experiment results demonstrate the feasibility of automatically classifying IR ETDs and, additionally, ensuring that repository digital objects are appropriately structured. Automatic classification of repository objects has the obvious benefit of improving the searching and browsing of content in IRs and further presents opportunities for the implementation of third-party tools and extensions that could potentially result in effective self-archiving strategies.
Modelling weightlifting 'Training-Diet-Competition' cycle following a modular and scalable approach
Piyaporn Tumnark; Paulo Cardoso; Jorge Cabral; Filipe ConceiÃ§Ã£o
International Journal of Metadata, Semantics and Ontologies, Vol. 14, No. 3 (2020) pp. 185 - 196
Studies in weightlifting have been characterised by unclear results and information paucity, mainly due to the lack of information sharing between athletes, coaches, biomechanists, physiologists and nutritionists. These experts' knowledge is not captured, classified or integrated into an information system for decision-making. An ontology-driven knowledge model for Olympic weightlifting was developed to leverage a better understanding of the weightlifting domain as a whole, bringing together related knowledge domains of training methodology, weightlifting biomechanics, and dietary regimes, while modelling the synergy among them. It unifies terminology, semantics, and concepts among sport scientists, coaches, nutritionists, and athletes to partially obviate the recognised limitations and inconsistencies, leading to the provision of superior coaching and a research environment which promotes better understanding and more conclusive results. The ontology-assisted weightlifting knowledge base consists of 110 classes, 50 object properties, 92 data properties, 167 inheritance relationships concepts, in a total of 1761 axioms, alongside 23 SWRL rules.
An algorithm to generate short sentences in natural language from linked open data based on linguistic templates
Augusto Lopes Da Silva; Sandro JosÃ© Rigo; JÃ©ssica Braun De Moraes
International Journal of Metadata, Semantics and Ontologies, Vol. 14, No. 3 (2020) pp. 197 - 208
The generation of natural language phrases from Linked Open Data can benefit from a significant amount of information available on the internet, as well as from the existence of properties within them, which appears, mostly, in the RDF format. These properties can represent semantic relationships between concepts that might help in creating sentences in natural language. Nevertheless, research in this field tends not to use the information in RDF. We support that this is a factor that might foster the generation of more natural phrases. In this scenario, this research explores these RDF properties for the generation of natural language phrases. The short sentences generated by the algorithm implementation were evaluated regarding their fluency by linguists and native English speakers. The results show that the sentences generated are promising regarding sentence fluency.
Towards linked open government data in Canada
International Journal of Metadata, Semantics and Ontologies, Vol. 14, No. 3 (2020) pp. 209 - 217
Governments are publishing enormous amounts of open data on the web every day in an effort to increase transparency and reusability. Linking data from multiple sources on the web enables the performance of advanced data analytics, which can lead to the development of valuable services and data products. However, Canada's open government data portals are isolated from one another and remain unlinked to other resources on the web. In this paper, we first expose the statistical data sets in Canadian provincial open data portals as Linked Data, and then integrate them using RDF Cube vocabulary, thereby making different open data portals available through a single search endpoint. We leverage Semantic Web Technologies to publish open data sets taken from two provincial portals (Nova Scotia and Alberta) as RDF (the Linked Data format), and to connect them to one another. The success of our approach illustrates its high potential for linking open government data sets across Canada, which will in turn enable greater data accessibility and improved search results.
Citation content/context data as a source for research cooperation analysis
Sergey Parinov; Victoria Antonova
International Journal of Metadata, Semantics and Ontologies, Vol. 14, No. 2 (2020) pp. 149 - 157
Using citation relationships, one can build three groups of papers: (1) the papers of a selected author; (2) those papers cited by the author; (3) papers citing the author. Authors of papers from these three groups can be presented as a fragment of a research cooperation network, because they use/cite research outputs of each other. Their papers' full texts and especially the contexts of their in-text citations contain some information about the character of this research cooperation. We present a concept of research cooperation, based on publications and the current results of the Cirtec project for building the research cooperation characteristics. This work is based on the processing of citation content/context data. The results include an on-line service for authors to monitor the citation content data extractions and three types of built indicators/parameters: co-citation statistics, spatial distribution of citations over papers' body and topic models for citation contexts.
From the web of bibliographic data to the web of bibliographic meaning: structuring, interlinking and validating ontologies on the semantic web
Helena SimÃµes PatrÃcio; Maria InÃªs Cordeiro; Pedro Nogueira Ramos
International Journal of Metadata, Semantics and Ontologies, Vol. 14, No. 2 (2020) pp. 124 - 134
Bibliographic data sets have revealed good levels of technical interoperability observing the principles and good practices of linked data. However, they have a low level of quality from the semantic point of view, due to many factors: lack of a common conceptual framework for a diversity of standards often used together, reduced number of links between the ontologies underlying data sets, proliferation of heterogeneous vocabularies, underuse of semantic mechanisms in data structures, "ontology hijacking" (Feeney et al., 2018), point-to-point mappings, as well as limitations of semantic web languages for the requirements of bibliographic data interoperability. After reviewing such issues, a research direction is proposed to overcome the misalignments found by means of a reference model and a superontology, using Shapes Constraint Language (SHACL) to solve current limitations of RDF languages.
Service traceability in SOA-based software systems: a traceability network add-in for BPAOntoSOA framework
Rana Yousef; Sarah Imtera
International Journal of Metadata, Semantics and Ontologies, Vol. 14, No. 2 (2020) pp. 169 - 183
BPAOntoSOA is a generic framework that generates a service model from a given organisational business process architecture. Service Oriented Architecture (SOA) traceability is essentially important to facilitate change management and support reusability of an SOA; it has a wide application in the development and maintenance process. Such a traceability network is not available for BPAOntoSOA framework. This paper introduces an ontology-based traceability network for BPAOntoSOA framework that semantically generates trace links between services and business process architectural elements in both forward and backward directions. The proposed traceability approach was evaluated using the postgraduate faculty information system case study in order to assess the framework behaviour in general. As a continued evaluation effort, a group of parameters have been selected to create an evaluation criterion, which was used to compare the BPAOntoSOA trace solution to one of the most related traceability frameworks, STraS traceability framework.
CMDIfication process for textbook resources
Francesca Fallucchi; Ernesto William De Luca
International Journal of Metadata, Semantics and Ontologies, Vol. 14, No. 2 (2020) pp. 135 - 148
Interoperability between heterogeneous resources and services is the key to a correctly functioning digital infrastructure, which can provide shared resources at once. We analyse the establishment of a standardised common infrastructure covering metadata, content, and inferred knowledge to allow collaborative work between researchers in the humanities. In this paper, we discuss how to provide a CMDI (Component MetaData Infrastructure) profile for textbooks, in order to integrate it into the Common Language Resources and Technology Infrastructure (CLARIN) and thus to make the data available in an open way and according to the FAIR principles. We focus on the 'CMDIfication' process, which fulfils the needs of our related projects. We describe a process of building resources using CMDI description from Text Encoding Initiative (TEI), Metadata Encoding and Transmission Standard (METS) and Dublin Core (DC) metadata, testing it on the textbook resources of the Georg Eckert Institute (GEI).
Intermediary XML schemas: constraint, templating and interoperability in complex environments
International Journal of Metadata, Semantics and Ontologies, Vol. 14, No. 2 (2020) pp. 88 - 97
This article introduces the methodology of intermediary schemas for complex metadata environments. Metadata in instances conforming to these is not generally intended for dissemination but must usually be transformed by XSLT transformations to generate instances conforming to the referent schemas to which they mediate. The methodology is designed to enhance the interoperability of complex metadata within XML architectures. This methodology incorporates three subsidiary methods: these are project-specific schemas which represent constrained mediators to over-complex or over-flexible referents (Method 1), templates or conceptual maps from which instances may be generated (Method 2) and serialised maps of instances conforming to their referent schemas (Method 3). The three methods are detailed and their applications to current research in digital ecosystems, archival description and digital asset management and preservation are examined. A possible synthesis of the three is also proposed in order to enable the methodology to operate within a single schema, the Metadata Encoding and Transmission Standard (METS).
Unique challenges facing Linked Data implementation for National Educational Television
International Journal of Metadata, Semantics and Ontologies, Vol. 14, No. 2 (2020) pp. 98 - 111
Implementing Linked Data involves a costly process of converting metadata into an exchange format substantially different from traditional library "records-based" exchange. To achieve full implementation, it is necessary to navigate a complex process of data modelling, crosswalking, and publishing. This paper documents the transition of a data set of National Educational Television (NET) collection records to a "data-based" exchange environment of Linked Data by discussing challenges faced during the conversion. These challenges include silos like the Library's media asset management system Merged Audio-Visual Information System (MAVIS), aligning PBCore with the bibliographic Linked Data model BIBFRAME, modelling differences in works between archival moving image cataloguing and other domains using Entertainment Identifier Registry IDs (EIDR IDs), and possible alignments with EBUCore (the European Broadcasting Union Linked Data model) to address gaps between PBCore and BIBFRAME.
Exploring the utility of metadata record graphs and network analysis for metadata quality evaluation and augmentation
Mark Edward Phillips; Oksana L. Zavalina; Hannah Tarver
International Journal of Metadata, Semantics and Ontologies, Vol. 14, No. 2 (2020) pp. 112 - 123
Our study explores the possible uses and effectiveness of network analysis, including Metadata Record Graphs, for evaluating collections of metadata records at scale. We present the results of an experiment applying these methods to records in the University of North Texas (UNT) Digital Library and two sub-collections of different compositions: the UNT Scholarly Works collection, which functions as an institutional repository, and a collection of architectural slide images. The data includes count- and value-based statistics with network metrics for every Dublin Core element in each set. The study finds that network analysis provides useful information that supplements other metrics, for example by identifying records that are completely unconnected to other items through the subject, creator, or other field values. Additionally, network density may help managers identify collections or records that could benefit from enhancement. We also discuss the constraints of these metrics and suggest possible future applications.
The future of interlinked, interoperable and scalable metadata
Getaneh Alemu; Emmanouel Garoufallou
International Journal of Metadata, Semantics and Ontologies, Vol. 14, No. 2 (2020) pp. 81 - 87
With the growing diversity of information resources the emphasis on data-centric applications such as big data, metadata, semantics and ontologies has become central. This editorial paper presents a summary of recent developments in metadata, semantics and ontologies - focusing in particular on metadata enriching, linking and interoperability. National libraries and archives are devising new bibliographic models and metadata presentation formats. Bibliographic metadata sets are being made available using these new data formats such as RDF. The new formats are aiming to represent data in granular structures and define unique identification protocols such as URIs. The paper concludes by introducing the five papers included in the special issue. The papers in this special issue present novel approaches to metadata integration, interoperability frameworks, re-use of metadata ontologies and methods of metadata quality analysis.