A method for archaeological and dendrochronological concept annotation using domain knowledge in information extraction
Andreas Vlachidis; Douglas Tudhope
International Journal of Metadata, Semantics and Ontologies, Vol. 15, No. 3 (2021) pp. 192 - 203
Advances in Natural Language Processing allow the process of deriving information from large volumes of text to be automated. Attention is turned to one of the most important, but traditionally difficult to access resources in archaeology, commonly known as 'grey literature'. This paper presents the development of two separate Named-Entity Recognition (NER) pipelines aimed at the extraction of Archaeological and of Dendrochronological concepts in Dutch, respectively. The role of domain vocabulary is discussed for the development of a Knowledge Organisation System (KOS)-driven, Rule-Based method of NER which makes complementary use of ontology, thesauri and domain vocabulary for information extraction and attribute assignment of semantic annotations. The NER task is challenged by a series of domain and language-oriented aspects and evaluated against a human-annotated Gold Standard. The results suggest the suitability of Rule-based KOS driven approaches for attaining the low-hanging fruits of NER, using a combination of quality vocabulary and rules.
Interlinking and enrichment of disparate organisational data with LOD at application run-time
Sotiris Angelis; Konstantinos Kotis; Panagiotis Mouzakis
International Journal of Metadata, Semantics and Ontologies, Vol. 15, No. 3 (2021) pp. 204 - 219
The present work focuses on the semantic integration, enrichment and interlinking of data that is semi-automatically generated by documenting artworks and their creators. In this work, we have been experimenting with RDFization and links discovery tools, W3C standards and widely accepted vocabularies. This work has been already evaluated with museum data, emphasising the discovery of links between disparate data sets and external data sources at the back-end of the proposed approach. In this paper, while contributing a number of other new and extended features at the back-end of this approach, we emphasise links discovery at the front-end in order to interlink and enrich cultural data with LOD at application run-time and facilitate the real-time and up-to-date exploitation of semantically integrated and LOD-enriched data. This is achieved by implementing a custom links discovery method and evaluating it within a web application using LOD cloud data sources such as DBpedia and Europeana.
A workflow for supporting the evolution requirements of RDF-based semantic warehouses
Yannis Marketakis; Yannis Tzitzikas; Aureliano Gentile; Bracken Van Niekerk; Marc Taconet
International Journal of Metadata, Semantics and Ontologies, Vol. 15, No. 3 (2021) pp. 220 - 232
Semantic data integration aims to exploit heterogeneous pieces of similar or complementary information for enabling integrated browsing and querying services. A quite common approach is the transformation from the original sources with respect to a common graph-based data model and the construction of a global semantic warehouse. The main problem is the periodic refreshment of the warehouse, as the contents from the data sources change. This is a challenging requirement, not only because the transformations that were used for constructing the warehouse can be invalidated, but also because additional information may have been added in the semantic warehouse, which needs to be preserved after every reconstruction. In this paper, we focus on this particular problem using a semantic warehouse that integrates data about stocks and fisheries from various information systems, we detail the requirements related to the evolution of semantic warehouses and propose a workflow for tackling them.
Introducing a novel bi-functional method for exploiting sentiment in complex information networks
Paraskevas Koukaras; Dimitrios Rousidis; Christos Tjortjis
International Journal of Metadata, Semantics and Ontologies, Vol. 15, No. 3 (2021) pp. 157 - 169
This paper elaborates on multilayer Information Network (IN) modelling, utilising graph mining and machine learning. Although, Social Media (SM) INs may be modelled as homogeneous networks, real-world networks contain multi-typed entities, characterised by complex relations and interactions posing as heterogeneous INs. For mining data whilst retaining semantic context in such complex structures, we need better ways for handling multi-typed and interconnected data. This work conceives and performs several simulations on SM data. The first simulation models information, based on a bi-partite network schema. The second simulation utilises a star network schema, along with a graph database offering querying for graph metrics. The third simulation handles data from the previous simulations to generate a multilayer IN. The paper proposes a novel bi-functional method for sentiment extraction of user reviews/opinions across multiple SM platforms, considering the concepts of supervised/unsupervised learning and sentiment analysis.
Integrated classification schemas to interlink cultural heritage collections over the web using LOD technologies
Carlos Henrique Marcondes
International Journal of Metadata, Semantics and Ontologies, Vol. 15, No. 3 (2021) pp. 170 - 177
Libraries, archives and museum collections are now being published over the web using LOD technologies. Many of them have thematic intersections or are related to other web subjects and resources such as authorities, sites for historic events, online exhibitions, or to articles in Wikipedia and its sibling resources DBpedia and Wikidata. The full potential of such published initiatives using LOD rests heavily on the meaningful interlinking of such collections. Within these contextual vocabularies and classifications, schemas are important, as they provide meaning and context to heritage data. This paper proposes comprehensive classification schemas - a Culturally Relevant Relationships (CRR) vocabulary and a classification schema of types of heritage objects - to order, integrate and provide structure to cultural heritage data brought about with the publication of heritage collections as LOD.
Institutional support for data management plans: case studies for a systematic approach
Yulia Karimova; Cristina Ribeiro; Gabriel David
International Journal of Metadata, Semantics and Ontologies, Vol. 15, No. 3 (2021) pp. 178 - 191
Researchers have to ensure that their projects comply with Research Data Management (RDM) requirements. Consequently, the main funding agencies require Data Management Plans (DMPs) for grant applications. So, institutions are investing in RDM tools and implementing RDM workflows in order to support their researchers. In this context, we propose a collaborative DMP-building method that involves researchers, data stewards and other parties if required. This method was applied as part of an RDM workflow in research groups across several scientific domains. We describe it as a systematic approach and illustrate it through a set of case studies. We also address the DMP monitoring process during the life cycle of projects. The feedback from the researchers highlighted the advantages of creating DMPs and their growing need. So, there is motivation to improve the DMP support process according to the machine-actionable DMPs concept and to the best practices in each scientific community.
A fuzzy logic and ontology-based approach for improving the CV and job offer matching in recruitment process
Amine Habous; El Habib Nfaoui
International Journal of Metadata, Semantics and Ontologies, Vol. 15, No. 2 (2021) pp. 104 - 120
The recruitment process is a critical activity for every organisation, and it allows to find the appropriate candidate for a job offer and its employer work criteria. The competitive nature of the recruitment environment makes the task of hiring new employees very hard for companies due to the high number of CV (resume) and profiles to process, the personal job interests, the customised requirements and precise skills requested by employees, etc. The time becomes crucial for recruiters' choices; consequently, it might impact the selection process quality. In this paper, we propose a retrieval system for automating the matching process between the candidate CV and the job offer. It is designed based on Natural Language Processing, machine learning and fuzzy logic to handle the matching between the job description and the CV. It also considers the proficiency level for the technology skills. Moreover, it offers an estimation of the overall CV/job offer expertise level. In that way, it overcomes the under-qualification and over-qualification issues in the ICT (Information and Communication Technologies) recruitment process. Experimental results on a ground-truth data of a recruiter company demonstrate that our proposal provides effective results.
Systematic design and implementation of a semantic assistance system for aero-engine design and manufacturing
Sonika Gogineni; JÃ¶rg BrÃ¼nnhÃ¤uÃer; Kai Lindow; Erik Paul Konietzko; Rainer Stark; Jonas Nickel; Heiko Witte
International Journal of Metadata, Semantics and Ontologies, Vol. 15, No. 2 (2021) pp. 87 - 103
Data in organisations is often spread across various Information and Communication Technology (ICT) systems, leading to redundancies, lack of overview and time wasted searching for information while carrying out daily activities. This paper focuses on addressing these problems for an aerospace company by using semantic technologies to design and develop an assistance system using existing infrastructure. In the aero-engine industry, complex data systems for design, configuration, manufacturing and service data are common. Additionally, unstructured data and information from numerous sources become available during the product's life cycle. In this paper, a systematic approach is followed to design a system, which integrates data silos by using a common ontology. This paper highlights the problems being addressed, the approach selected to develop the system, along with the implementation of two use cases to support user activities in an aerospace company.
Links between research artefacts: use cases for digital libraries
Fidan Limani; Atif Latif; Klaus Tochtermann
International Journal of Metadata, Semantics and Ontologies, Vol. 15, No. 2 (2021) pp. 133 - 143
The generation and availability of links between scholarly resources continues to increase. Initiatives to support it - both in terms of a (standard) representation model and accompanying infrastructure for collection and exchange - make this emerging artefact interesting to explore. Its role towards a more transparent, reproducible and, ultimately, richer research context, makes it a valuable proposition for information infrastructures such as Digital Libraries. In this paper, we assess the potential of link artefacts for such an environment. We rely on a public link collection subset of (>4.8 M links), which we represent based on the Linked Data approach that results with a collection of >163.8 M RDF triples. The incorporated use cases demonstrate the usefulness of this artefact in this study. We claim that the adoption of links extends the scholarly data collection and advances the services a Digital Library offers to its users.
Ontology-based knowledge management in income tax of Nepal
Shib Raj Bhatta; Bhoj Raj Ghimire; Marut Buranarach
International Journal of Metadata, Semantics and Ontologies, Vol. 15, No. 2 (2021) pp. 144 - 156
Organisational knowledge management in the government agencies is crucial to create a common understanding about data, system and procedures for stakeholders. Main purpose of this work is to create a knowledge repository for income tax of Nepal using domain ontology. To create this ontology, we have used 'Methontology' a broadly used ontology development methodology. After the knowledge acquisition; competency questions were created and conceptualisation was specified. A class hierarchy, object properties and data propertied are identified and specified. Further, ontology was created by using the Protégé tool. Consistency of the developed ontology was evaluated by correctly answering all the competency questions using SPARQL. The ontology can be used by the Inland Revenue Department as a stepping stone towards knowledge management through ontology. The work can be a foundation for the future works in this domain in good governance, data sharing, interoperability and transparency for the government of Nepal.
Keyphrase extraction from single textual documents based on semantically defined background knowledge and co-occurrence graphs
Mauro Dalle Lucca Tosi; Julio Cesar Dos Reis
International Journal of Metadata, Semantics and Ontologies, Vol. 15, No. 2 (2021) pp. 121 - 132
The keyphrase extraction task is a fundamental and challenging task designed to extract a set of keyphrases from textual documents. Keyphrases are essential to assist publishers in indexing documents and readers in identifying the most relevant ones. They are short phrases composed of one or more terms used to represent a textual document and its main topics. In this article, we extend our research on C-Rank, which is an unsupervised approach that automatically extracts keyphrases from single documents. C-Rank uses concept-linking to link concepts in common between single documents and an external background knowledge base. We advance our study over C-Rank by evaluating it using different concept-linking approaches - Babelfy and DBPedia Spotlight. We evaluated C-Rank on data sets composed of academic articles, academic abstracts, and news articles. Our findings indicate that C-Rank achieves state-of-the-art results extracting keyphrases from scientific documents by experimentally comparing it to existing unsupervised approaches.
Applying cross-data set identity reasoning for producing URI embeddings over hundreds of RDF data sets
Michalis Mountantonakis; Yannis Tzitzikas
International Journal of Metadata, Semantics and Ontologies, Vol. 15, No. 1 (2021) pp. 1 - 22
There is a proliferation of approaches that exploit RDF data sets for creating URI embeddings, i.e., embeddings that are produced by taking as input URI sequences (instead of simple words or phrases), since they can be of primary importance for several tasks (e.g., machine learning tasks). However, existing techniques exploit either a single or a few data sets for creating URI embeddings. For this reason, we introduce a prototype, called LODVec, which exploits LODsyndesis for enabling the creation of URI embeddings by using hundreds of data sets simultaneously, after enriching them with the results of cross-data set identity reasoning. By using LODVec, it is feasible to produce URI sequences by following paths of any length (according to a given configuration), and the produced URI sequences are used as input for creating embeddings through word2vec model. We provide comparative results for evaluating the gain of using several data sets for creating URI embeddings, for the tasks of classification and regression, and for finding the most similar entities to a given one.
Children's art museum collections as Linked Open Data
Konstantinos Kotis; Sotiris Angelis; Maria Chondrogianni; Efstathia Marini
International Journal of Metadata, Semantics and Ontologies, Vol. 15, No. 1 (2021) pp. 60 - 70
It has been recently argued that it is rather beneficial to cultural institutions to provide their datasets as Linked Open Data, to achieve cross-referencing, interlinking, and integration with other datasets in the LOD cloud. In this paper, we present the Greek Children's Art Museum (GCAM) linked dataset, along with dataset and vocabulary statistics, as well as lessons learned from the process of transforming the collections to HTML-embedded structured data using the Europeana Data Model and the Schema.org model. The dataset consists of three cultural collections of 121 child artworks (paintings), including detailed descriptions and interlinks to external datasets. In addition to the presentation of GCAM data and the lessons learned from the experimentation of non-ICT experts with LOD paradigm, the paper introduces a new metric for measuring datasets quality in terms of links to and from other datasets.
Analysis of structured data on Wikipedia
Johny Moreira; Everaldo Costa Neto; Luciano Barbosa
International Journal of Metadata, Semantics and Ontologies, Vol. 15, No. 1 (2021) pp. 71 - 86
Wikipedia has been widely used for information consumption or for implementing solutions using its content. It contains primarily unstructured text about entities, but it can also contain infoboxes, which are structured attributes describing these entities. Owing to its structural nature, infoboxes have been shown useful to many applications. In this work, we perform an extensive data analysis on different aspects of Wikipedia structured data: infoboxes, templates and categories, aiming to uncover data issues and limitations, and to guide researchers in the use of these structured data. We devise a framework to process, index and query the Wikipedia data, using it to analyse different scenarios such as the popularity of infoboxes, their size distribution and usage across categories. Some of our findings are: only 54% of Wikipedia articles have infoboxes; there is a considerable amount of geographical and temporal information in infoboxes; and there is great heterogeneity of infoboxes across a same category.
An ontology-driven perspective on the emotional human reactions to social events
Danilo Cavaliere; Sabrina Senatore
International Journal of Metadata, Semantics and Ontologies, Vol. 15, No. 1 (2021) pp. 23 - 38
Social media has become a fulcrum for sharing information on everyday-life events: people, companies, and organisations express opinions about new products, political and social situations, football matches, and concerts. The recognition of feelings and reactions to events from social networks requires dealing with great amounts of data streams, especially for tweets, to investigate the main sentiments and opinions that justify some reactions. This paper presents an emotion-based classification model to extract feelings from tweets related to an event or a trend, described by a hashtag, and build an emotional concept ontology to study human reactions to events in a context. From the tweet analysis, terms expressing a feeling are selected to build a topological space of emotion-based concepts. The extracted concepts serve to train a multi-class SVM classifier that is used to perform soft classification aimed at identifying the emotional reactions towards events. Then, an ontology allows arranging classification results, enriched with additional DBpedia concepts. SPARQL queries on the final knowledge base provide specific insights to explain people's reactions towards events. Practical case studies and test results demonstrate the applicability and potential of the approach.
Persons, GLAM institutes and collections: an analysis of entity linking based on the COURAGE registry
Ghazal Faraj; AndrÃ¡s Micsik
International Journal of Metadata, Semantics and Ontologies, Vol. 15, No. 1 (2021) pp. 39 - 49
It is an important task to connect encyclopaedic knowledge graphs by finding and linking the same entity nodes. Various available automated linking solutions cannot be applied in situations where data is sparse, private or a high degree of correctness is expected. Wikidata has grown into a leading linking hub collecting entity identifiers from various registries and repositories. To get a picture of connectability, we analysed the linking methods and results between the COURAGE registry and Wikidata, VIAF, ISNI and ULAN. This paper describes our investigations and solutions while mapping and enriching entities in Wikidata. Each possible mapped pair of entities received a numeric score of reliability. Using this score-based matching method, we tried to minimise the need for human decisions, hence we introduced the term human decision window for the mappings where neither acceptance nor refusal can be made automatically and safely. Furthermore, Wikidata has been enriched with related COURAGE entities and bi-directional links between mapped persons, organisations, collections, and collection items. We also describe the findings on coverage and quality of mapping among the above mentioned authority databases.
Documenting flooding areas calculation: a PROV approach
Monica De Martino; Alfonso Quarati; Sergio Rosim; LaÃ©rcio Massaru Namikawa
International Journal of Metadata, Semantics and Ontologies, Vol. 15, No. 1 (2021) pp. 50 - 59
Flooding events related to waste-lake dam ruptures are one of the most threatening natural disasters in Brazil. They must be managed in advance by public institutions through the use of adequate hydrographic and environmental information. Although the Open Data paradigm offers an opportunity to share hydrographic data sets, their actual reuse is still low because of metadata quality. Our previous work highlighted a lack of detailed provenance information. The paper presents an Open Data approach to improve the release of hydrographic data sets. We discuss a methodology, based on W3C recommendations, for documenting the provenance of hydrographic data sets, considering the workflow activities related to the study of flood areas caused by the waste-lakes breakdowns. We provide an illustrative example that documents, through W3C PROV metadata model, the generation of flooding area maps by integrating land use classification, from Sentinel images, with hydrographic data sets produced by the Brazilian National Institute for Space Research.
A survey study on Arabic WordNet: baring opportunities and future research directions
Abdulmohsen S. Albesher; Osama B. Rabie
International Journal of Metadata, Semantics and Ontologies, Vol. 14, No. 4 (2020) pp. 290 - 305
WordNet (WN) plays an essential role in knowledge management and information retrieval - as it allows for a better understanding of word relationships, which leads to more accurate text processing. The success of WN for the English language encouraged researchers to develop WNs for other languages. One of the most common of such languages is Arabic. However, the current state of affairs of Arabic WN (AWN) has not been properly studied. Thus, this paper presents a survey study on AWN conducted to explore opportunities and possible future research directions. The results involve the synthesis of over 100 research papers on AWN. These research papers were divided into categories and subcategories.
An ontology-based method for improving the quality of process event logs using database bin logs
Shokoufeh Ghalibafan; Behshid Behkamal; Mohsen Kahani; Mohammad Allahbakhsh
International Journal of Metadata, Semantics and Ontologies, Vol. 14, No. 4 (2020) pp. 279 - 289
The main goal of process mining is discovering models from event logs. The usefulness of these discovered models is directly related to the quality of event logs. Researchers proposed various solutions to detect deficiencies and improve the quality of event logs; however, only a few have considered the application of a reliable external source for the improvement of the quality of event data. In this paper, we propose a method to repair the event log using the database bin log. We show that database operations can be employed to overcome the inadequacies of the event logs, including incorrect and missing data. To this end, we, first, extract an ontology from each of the event logs and the bin log. Then, we match the extracted ontologies and remove inadequacies from the event log. The results show the stability of our proposed model and its superiority over related works.
Stress-testing big data platform to extract smart and interoperable food safety analytics
Ioanna Polychronou; Giannis Stoitsis; Mihalis Papakonstantinou; Nikos Manouselis
International Journal of Metadata, Semantics and Ontologies, Vol. 14, No. 4 (2020) pp. 306 - 314
One of the significant challenges for the future is to guarantee safe food for all inhabitants of the planet. During the last 15 years, very important fraud issues like the '2013 horse meat scandal' and the '2008 Chinese milk scandal' have greatly affected the food industry and public health. One of the alternatives for this issue consists of increasing production, but to accomplish this, it is necessary that innovative options be applied to enhance the safety of the food supply chain. For this reason, it is quite important to have the right infrastructure in order to manage data of the food safety sector and provide useful analytics to Food Safety Experts. In this paper, we describe Agroknow's Big Data Platform architecture and examine its scalability for data management and experimentation.
Data aggregation lab: an experimental framework for data aggregation in cultural heritage
International Journal of Metadata, Semantics and Ontologies, Vol. 14, No. 4 (2020) pp. 315 - 324
This paper describes the Data Aggregation Lab software, a system that implements the metadata aggregation workflow of cultural heritage, based on the underlying concepts and technologies of the Web of Data. These aggregation technologies can fulfil the functional requirements for metadata harvesting in cultural heritage and at the same time allow cultural heritage data to be globally interoperable with internet search engines and the Web of Data. The Data Aggregation Lab provides a framework to support several research activities within the Europeana network, such as conducting case studies, providing reference implementations, and supporting technology adoption. It provides working implementations for metadata aggregation methods with which our research has obtained positive results. These methods apply linked data, Schema.org, IIIF, Sitemaps and RDF-related technologies for innovation in data aggregation, data analysis and data conversion for cultural heritage data. The software is available for reuse and is open-sourced.
Automatic metadata extraction via image processing using Migne's Patrologia Graeca
Evagelos Varthis; Marios Poulos; Ilias Giarenis; Sozon Papavlasopoulos
International Journal of Metadata, Semantics and Ontologies, Vol. 14, No. 4 (2020) pp. 265 - 278
A wealth of knowledge is kept in libraries and cultural institutions in various digital forms without, however, the possibility of a simple term search, let alone of a substantial semantic search. In this study, a novel approach is proposed which strives to recognise words and automatically generate metadata from large machine-printed corpora such as Migne's Patrologia Graeca (PG). The proposed framework firstly applies an efficient word segmentation and then transforms the word-images into special compact shapes. For the comparison, we use Hu's invariant moments for discarding unlikely similar matches, Shape Context (SC) for the contour similarity and the Pearson's Correlation Coefficient (PCC) for final verification. Comparative results are presented by using the Long-Short Term Memory (LSTM) Neural Network (NN) engine of Tesseract Optical Character Recognition (OCR) system instead of PCC. In addition, an intelligent scenario is proposed for automatic generation of PG metadata by librarians.
Semantic similarity measurement: an intrinsic information content model
Abhijit Adhikari; Biswanath Dutta; Animesh Dutta; Deepjyoti Mondal
International Journal of Metadata, Semantics and Ontologies, Vol. 14, No. 3 (2020) pp. 218 - 233
Ontology dependent Semantic Similarity (SS) measurement has emerged as a new research paradigm in finding the semantic strength between any two entities. In this regard, as observed, the information theoretic intrinsic approach yields better accuracy in correlation with human cognition. The precision of such a technique highly depends on how accurately we calculate Information Content (IC) of concepts and its compatibility with a SS model. In this work, we develop an intrinsic IC model to facilitate better SS measurement. The proposed model has been evaluated using three vocabularies, namely SNOMED CT, MeSH and WordNet against a set of benchmark data sets. We compare the results with the state-of-the-art IC models. The results show that the proposed intrinsic IC model yields a high correlation with human assessment. The article also evaluates the compatibility of the proposed IC model and the other existing IC models in combination with a set of state-of-the-art SS models.
Formalisation and classification of grammar and template-mediated techniques to model and ontology verbalisation
Zola Mahlaza; C. Maria Keet
International Journal of Metadata, Semantics and Ontologies, Vol. 14, No. 3 (2020) pp. 249 - 262
Computational tools that translate modelling languages to a restricted natural language can improve end-user involvement in modelling. Templates are a popular approach for such a translation and are often paired with computational grammar rules to support grammatical complexity to obtain better quality sentences. There is no explicit specification of the relations used for the pairing of templates with grammar rules, so it is challenging to compare the latter templates' suitability for less-resourced languages, where grammar reuse is vital in reducing development effort. In order to enable such comparisons, we devise a model of pairing templates and rules, and assess its applicability by considering 54 existing systems for classification, and 16 of them in detail. Our classification shows that most grammar-infused template systems support detachable grammar rules and half of them introduce syntax trees for multilingualism or error checking. Furthermore, out of the 16 considered grammar-infused template systems, most do not currently support any of form of aggregation (63%) or the embedding of verb conjugation rules (81%); hence, if such features would be required, then they would need to be implemented from the ground up.
Automatic classification of digital objects for improved metadata quality of electronic theses and dissertations in institutional repositories
International Journal of Metadata, Semantics and Ontologies, Vol. 14, No. 3 (2020) pp. 234 - 248
Higher education institutions typically employ Institutional Repositories (IRs) in order to curate and make available Electronic Theses and Dissertations (ETDs). While most of these IRs are implemented with self-archiving functionalities, self-archiving practices are still a challenge. This arguably leads to inconsistencies in the tagging of digital objects with descriptive metadata, potentially compromising searching and browsing of scholarly research output in IRs. This paper proposes an approach to automatically classify ETDs in IRs, using supervised machine learning techniques, by extracting features from the minimum possible input expected from document authors: the ETD manuscript. The experiment results demonstrate the feasibility of automatically classifying IR ETDs and, additionally, ensuring that repository digital objects are appropriately structured. Automatic classification of repository objects has the obvious benefit of improving the searching and browsing of content in IRs and further presents opportunities for the implementation of third-party tools and extensions that could potentially result in effective self-archiving strategies.