Metadata is a key data source for researchers seeking to apply machine learning (ML) to the vast collections of digitized biological specimens that can be found online. Unfortunately, the associated metadata is often sparse and, at times, erroneous. This paper extends previous research conducted with the Illinois Natural History Survey (INHS) collection (7244 specimen images) that uses computational approaches to analyze image quality, and then automatically generates 22 metadata properties representing the image quality and morphological features of the specimens. In the research reported here, we demonstrate the extension of our initial work to University of the Wisconsin Zoological Museum (UWZM) collection (4155 specimen images). Further, we enhance our computational methods in four ways: (1) augmenting the training set, (2) applying contrast enhancement, (3) upscaling small objects, and (4) refining our processing logic. Together these new methods improved our overall error rates from 4.6 to 1.1%. These enhancements also allowed us to compute an additional set of 17 image-based metadata properties. The new metadata properties provide supplemental features and information that may also be used to analyze and classify the fish specimens. Examples of these new features include convex area, eccentricity, perimeter, skew, etc. The newly refined process further outperforms humans in terms of time and labor cost, as well as accuracy, providing a novel solution for leveraging digitized specimens with ML. This research demonstrates the ability of computational methods to enhance the digital library services associated with the tens of thousands of digitized specimens stored in open-access repositories world-wide by generating accurate and valuable metadata for those repositories.
In 2001, Rutgers University Libraries (RUL) accepted a substantial donation of Roman Republican coins. The work to catalog, house, digitize, describe, and present this collection online provided unique challenges for the institution. Coins are often seen as museum objects; however, they can serve pedagogical purposes within libraries. In the quest to innovate, RUL digitized coins from seven angles to provide a 180-degree view of coins. However, this strategy had its drawbacks; it had to be reassessed as the project continued. RUCore, RUL’s digital repository, uses Metadata Object Description Schema (MODS). Accordingly, it was necessary to adapt numismatic description to bibliographic metadata standards.With generous funding from the Loeb Foundation, the resulting digital collection of 1200 coins was added to RUCore from 2012 to 2018. Rutgers’s Badian Roman Coins Collection serves as an exemplar of numismatics in a library environment that is freely available to all on the Web.
This article is a report about the progress and current status of the World Historical Gazetteer (whgazetteer.org) (WHG) in the context of its value for helping to organize and record digital and paleographic information. It summarizes the development and functionality of the WHG as a software platform for connecting specialist collections of historical place names. It also reviews the idea of places as entities (rather than simple objects with single labels). It also explains the utility of gazetteers in digital library infrastructure and describes potential future developments.
This research analyzed and compared master’s theses deposits in institutional repositories of the SEA-EU Alliance universities in order to determine whether universities’ mandates to deposit master’s theses influence the number of theses deposited to the institutional repositories. We compared the universities’ institutional repository content focusing on the ratio of the number of deposited master’s theses to the number of enrolled students taking into account the open access policies, or more specifically, the national and the universities’ mandates to deposit master’s theses. The research methods involved a quantitative approach where the data were collected through direct communication with universities’ employees and from the institutional repositories in our sample. Our analysis showed that the number of papers stored in repositories reflects the open access policy, or more specifically, the master’s theses deposit policy, at certain universities. Furthermore, when analyzing the distribution of the number of theses across the scientific disciplines as well as the degree of openness it was evident that it largely concurred with the trends recorded in the current scholarly literature.
This article advances the thesis that three decades of investments by national and international funders, combined with those of scholars, technologists, librarians, archivists, and their institutions, have resulted in a digital infrastructure in the humanities that is now capable of supporting end-to-end research workflows. The article refers to key developments in the epigraphy and paleography of the premodern period. It draws primarily on work in classical studies but also highlights related work in the adjacent disciplines of Egyptology, ancient Near East studies, and medieval studies. The argument makes a case that much has been achieved but it does not declare “mission accomplished.” The capabilities of the infrastructure remain unevenly distributed within and across disciplines, institutions, and regions. Moreover, the components, including the links between steps in the workflow, are generally far from user-friendly and seamless in operation. Because further refinements and additional capacities are still much needed, the article concludes with a discussion of key priorities for future work.
Scientific writing builds upon already published papers. Manual identification of publications to read, cite or consider as related papers relies on a researcher’s ability to identify fitting keywords or initial papers from which a literature search can be started. The rapidly increasing amount of papers has called for automatic measures to find the desired relevant publications, so-called paper recommendation systems. As the number of publications increases so does the amount of paper recommendation systems. Former literature reviews focused on discussing the general landscape of approaches throughout the years and highlight the main directions. We refrain from this perspective, instead we only consider a comparatively small time frame but analyse it fully. In this literature review we discuss used methods, datasets, evaluations and open challenges encountered in all works first released between January 2019 and October 2021. The goal of this survey is to provide a comprehensive and complete overview of current paper recommendation systems.
Video retrieval methods, e.g., for visual concept classification, person recognition, and similarity search, are essential to perform fine-grained semantic search in large video archives. However, such retrieval methods often have to be adapted to the users’ changing search requirements: which concepts or persons are frequently searched for, what research topics are currently important or will be relevant in the future? In this paper, we present VIVA, a software tool for building content-based video retrieval methods based on deep learning models. VIVA allows non-expert users to conduct visual information retrieval for concepts and persons in video archives and to add new people or concepts to the underlying deep learning models as new requirements arise. For this purpose, VIVA provides a novel semi-automatic data acquisition workflow including a web crawler, image similarity search, as well as review and user feedback components to reduce the time-consuming manual effort for collecting training samples. We present experimental retrieval results using VIVA for four use cases in the context of a historical video collection of the German Broadcasting Archive based on about 34,000 h of television recordings from the former German Democratic Republic (GDR). We evaluate the performance of deep learning models built using VIVA for 91 GDR specific concepts and 98 personalities from the former GDR as well as the performance of the image and person similarity search approaches.
Cuneiform tablets remain founding cornerstones of two hundred plus collections in American academic institutions, having been acquired a century or more ago under dynamic ethical norms and global networks. To foster data sharing, this contribution incorporates empirical data from interactive ArcGIS and reusable OpenContext maps to encourage tandem dialogues about using the inscribed works and learning their collecting histories. Such provenance research aids, on their own, initiate the narration of objects’ journeys over time while cultivating the digital inclusion of expert local knowledge relevant to an object biography. The paper annotates several approaches institutions are or might consider using to expand upon current provenance information in ways that encourage visitors’ critical thinking and learning about global journeys, travel archives, and such dispositions as virtual reunification, reconstructions, or restitution made possible by the provenance research.
In the last years, several scientific digital libraries (DLs) in digital humanities (DH) field have been developed following the Open Science principles. These DLs aim at sharing the research outcomes, in several cases as FAIR data, and at creating linked information spaces. In several cases, to reach these aims the Semantic Web technologies and Linked Data have been used. This paper presents how the current scientific DLs in the DH field can provide the creation of linked information spaces and navigational services that allow users to navigate them, using Semantic Web technologies to formally represent, search and browsing knowledge. To support the argument, we present our experience in developing a scientific DL supporting scholars in creating, evolving and consulting a knowledge base related to Medieval and Renaissance geographical works within the three years (2020–2023) Italian National research project IMAGO—Index Medii Aevi Geographiae Operum. In the presented case study, a linked information space was created to allow users to discover and navigate knowledge across multiple repositories, thanks to the extensive use of ontologies. In particular, the linked information spaces created within the IMAGO project make use of five different datasets, i.e. Wikidata, the MIRABILE digital archive, the Nuovo Soggettario thesaurus, Mapping Manuscript Migration knowledge base and the Pleiades gazetteer. The linking among different datasets allows to considerably enrich the knowledge collected in the IMAGO KB.
This paper describes the Perseus Digital Library as, in part, a response to limitations of what is now a print culture that is rapidly receding from contemporary consciousness and, at the same time, as an attempt to fashion an infrastructure for the study of the past that can support a shared cultural heritage that extends beyond Europe and is global in scope. But if Greco-Roman culture cannot by itself represent the background of an international twenty-first century culture, this field, at the same time, offers challenges in its scale and complexity that allow us to explore the possibility of digital libraries. Greco-Roman studies is in a position to begin creating a completely transparent intellectual ecosystem, with a critical mass of its primary data available under an open license and with new forms of reading support that make sources in ancient and modern languages accessible to a global audience. In this model, traditional libraries play the role of archives: physically constrained spaces to which a handful of specialists can have access. If non-specialists draw problematic conclusions because the underlying sources are not publicly available and as well-documented as possible, the responsibility lies with the specialists who have not yet created the open, digital libraries upon which the intellectual life of humanity must depend. Greco-Roman Studies can play a major role in modeling such libraries. Perseus seeks to contribute to that transformation.
Machine Reading Comprehension (MRC) of a document is a challenging problem that requires discourse-level understanding. Information extraction from scholarly articles nowadays is a critical use case for researchers to understand the underlying research quickly and move forward, especially in this age of infodemic. MRC on research articles can also provide helpful information to the reviewers and editors. However, the main bottleneck in building such models is the availability of human-annotated data. In this paper, firstly, we introduce a dataset to facilitate question answering (QA) on scientific articles. We prepare the dataset in a semi-automated fashion having more than 100k human-annotated context–question–answer triples. Secondly, we implement one baseline QA model based on Bidirectional Encoder Representations from Transformers (BERT). Additionally, we implement two models: the first one is based on Science BERT (SciBERT), and the second is the combination of SciBERT and Bi-Directional Attention Flow (Bi-DAF). The best model (i.e., SciBERT) obtains an F1 score of 75.46%. Our dataset is novel, and our work opens up a new avenue for scholarly document processing research by providing a benchmark QA dataset and standard baseline. We make our dataset and codes available here at https://github.com/TanikSaikh/Scientific-Question-Answering.
Citation indexes are by now part of the research infrastructure in use by most scientists: a necessary tool in order to cope with the increasing amounts of scientific literature being published. Commercial citation indexes are designed for the sciences and have uneven coverage and unsatisfactory characteristics for humanities scholars, while no comprehensive citation index is published by a public organisation. We argue that an open citation index for the humanities is desirable, for four reasons: it would greatly improve and accelerate the retrieval of sources, it would offer a way to interlink collections across repositories (such as archives and libraries), it would foster the adoption of metadata standards and best practices by all stakeholders (including publishers) and it would contribute research data to fields such as bibliometrics and science studies. We also suggest that the citation index should be informed by a set of requirements relevant to the humanities. We discuss four such requirements: source coverage must be comprehensive, including books and citations to primary sources; there needs to be chronological depth, as scholarship in the humanities remains relevant over time; the index should be collection driven, leveraging the accumulated thematic collections of specialised research libraries; and it should be rich in context in order to allow for the qualification of each citation, for example, by providing citation excerpts. We detail the fit-for-purpose research infrastructure which can make the Humanities Citation Index a reality. Ultimately, we argue that a citation index for the humanities can be created by humanists, via a collaborative, distributed and open effort.
While most previous research focused only on the textual content of documents, advanced support for document management in digital libraries, for open science, requires handling all aspects of a document: from structure, to content, to context. These different but inter-related aspects cannot be handled separately and were traditionally ignored in digital libraries. We propose a graph-based unifying representation and handling model based on the definition of an ontology that integrates all the different perspectives and drives the document description in order to boost the effectiveness of document management. We also show how even simple algorithms can profitably use our proposed approach to return relevant and personalized outcomes in different document management tasks.
Author affiliations provide key information when attributing academic performance like publication counts. So far, such measures have been aggregated either manually or only to top-level institutions, such as universities. Supervised affiliation resolution requires a large number of annotated alignments between affiliation strings and known institutions, which are not readily available. We introduce the task of unsupervised hierarchical affiliation resolution, which assigns affiliations to institutions on all hierarchy levels (e.g. departments), discovering the institutions as well as their hierarchical ordering on the fly. From the corresponding requirements, we derive a simple conceptual framework based on the subset partial order that can be extended to account for the discrepancies evident in realistic affiliations from the Web of Science. We implement initial baselines and provide datasets and evaluation metrics for experimentation. Results show that mapping affiliations to known institutions and discovering lower-level institutions works well with simple baselines, whereas unsupervised top-level- and hierarchical resolution is more challenging. Our work provides structured guidance for further in-depth studies and improved methodology by identifying and discussing a number of observed difficulties and important challenges that future work needs to address.
Event detection is a crucial task for natural language processing and it involves the identification of instances of specified types of events in text and their classification into event types. The detection of events from digitised documents could enable historians to gather and combine a large amount of information into an integrated whole, a panoramic interpretation of the past. However, the level of degradation of digitised documents and the quality of the optical character recognition (OCR) tools might hinder the performance of an event detection system. While several studies have been performed in detecting events from historical documents, the transcribed documents needed to be hand-validated which implied a great effort of human expertise and manual labour-intensive work. Thus, in this study, we explore the robustness of two different event detection language-independent models to OCR noise, over two datasets that cover different event types and multiple languages. We aim at analysing their ability to mitigate problems caused by the low quality of the digitised documents and we simulate the existence of transcribed data, synthesised from clean annotated text, by injecting synthetic noise. For creating the noisy synthetic data, we chose to utilise four main types of noise that commonly occur after the digitisation process: Character Degradation, Bleed Through, Blur, and Phantom Character. Finally, we conclude that the imbalance of the datasets, the richness of the different annotation styles, and the language characteristics are the most important factors that can influence event detection in digitised documents.
The current scientific context is characterized by intensive digitization of the research outcomes and by the creation of data infrastructures for the systematic publication of datasets and data services. Several relationships can exist among these outcomes. Some of them are explicit, e.g. the relationships of spatial or temporal similarity, whereas others are hidden, e.g. the relationship of causality. By materializing these hidden relationships through a linking mechanism, several patterns can be established. These knowledge patterns may lead to the discovery of information previously unknown. A new approach to knowledge production can emerge by following these patterns. This new approach is exploratory because by following these patterns, a researcher can get new insights into a research problem. In the paper, we report our effort to depict this new exploratory approach using Linked Data and Semantic Web technologies (RDF, OWL). As a use case, we apply our approach to the archaeological domain.
This volume presents a special issue on selected papers from the 2019 & 2020 editions of the International Conference on Theory and Practice of Digital Libraries (TPDL). They cover different research areas within Digital Libraries, from Ontology and Linked Data to quality in Web Archives and Topic Detection. We first provide a brief overview of both TPDL editions, and we introduce the selected papers.
In Greece, there are many audiovisual resources available on the Internet that interest scientists and the general public. Although freely available, finding such resources often becomes a challenging task, because they are hosted on scattered websites and in different types/formats. These websites usually offer limited search options; at the same time, there is no aggregation service for audiovisual resources, nor a national registry for such content. To meet this need, the Open AudioVisual Archives project was launched and the first step in its development is to create a dataset with open access audiovisual material. The current research creates such a dataset by applying specific selection criteria in terms of copyright and content, form/use and process/technical characteristics. The results reported in this paper show that libraries, archives, museums, universities, mass media organizations, governmental and non-governmental organizations are the main types of providers, but the vast majority of resources are open courses offered by universities under the “Creative Commons” license. Providers have significant differences in terms of their collection management capabilities. Most of them do not own any kind of publishing infrastructure and use commercial streaming services, such as YouTube. In terms of metadata policy, most of the providers use application profiles instead of international metadata schemas.
Citation information in scholarly data is an important source of insight into the reception of publications and the scholarly discourse. Outcomes of citation analyses and the applicability of citation-based machine learning approaches heavily depend on the completeness of such data. One particular shortcoming of scholarly data nowadays is that non-English publications are often not included in data sets, or that language metadata is not available. Because of this, citations between publications of differing languages (cross-lingual citations) have only been studied to a very limited degree. In this paper, we present an analysis of cross-lingual citations based on over one million English papers, spanning three scientific disciplines and a time span of three decades. Our investigation covers differences between cited languages and disciplines, trends over time, and the usage characteristics as well as impact of cross-lingual citations. Among our findings are an increasing rate of citations to publications written in Chinese, citations being primarily to local non-English languages, and consistency in citation intent between cross- and monolingual citations. To facilitate further research, we make our collected data and source code publicly available.
Digital repositories rely on technical metadata to manage their objects. The output of characterization tools is aggregated and analyzed through content profiling. The accuracy and correctness of characterization tools vary; they frequently produce contradicting outputs, resulting in metadata conflicts. The resulting metadata conflicts limit scalable preservation risk assessment and repository management. This article presents and evaluates a rule-based approach to improving data quality in this scenario through expert-conducted conflict resolution. We characterize the data quality challenges and present a method for developing conflict resolution rules to improve data quality. We evaluate the method and the resulting data quality improvements in an experiment on a publicly available document collection. The results demonstrate that our approach enables the effective resolution of conflicts by producing rules that reduce the number of conflicts in the data set from 17 to 3%. This replicable method for presents a significant improvement in content profiling technology for digital repositories, since the enhanced data quality can improve risk assessment and preservation management in digital repository systems.
Our research aims at understanding children’s information search and their use of information search tools during educational pursuits. We conducted an observation study with 50 New Zealand school children between the ages of 9 and 13 years old. In particular, we studied the way that children constructed search queries and interacted with the Google search engine when undertaking a range of educationally appropriate inquiry tasks. As a result of this in situ study, we identified typical query-creation and query-reformulation strategies that children use. The children worked through 250 tasks, and created a total of 550 search queries. 64.4% of the successful queries made were natural language queries compared to only 35.6% keyword queries. Only three children used the related searches feature of the search engine, while 46 children used query suggestions. We gained insights into the information search strategies children use during their educational pursuits. We observed a range of issues that children encountered when interacting with a search engine to create searches as well as to triage and explore information in the search engine results page lists. We found that search tasks posed as questions were more likely to result in query constructions based on natural language questions, while tasks posed as instructions were more likely to result in query constructions using natural language sentences or keywords. Our findings have implications for both educators and search engine designers.
Inferring the magnitude and occurrence of real-world events from natural language text is a crucial task in various domains. Particularly in the domain of public health, the state-of-the-art document and token centric event detection approaches have not kept the pace with the growing need for more robust event detection in public health. In this paper, we propose UPHED, a unified approach, which combines both the document and token centric event detection techniques in an unsupervised manner such that events which are: rare (aperiodic); reoccurring (periodic) can be detected using a generative model for the domain of public health. We evaluate the efficiency of our approach as well as its effectiveness for two real-world case studies with respect to the quality of document clusters. Our results show that we are able to achieve a precision of 60% and a recall of 71% analyzed using manually annotated real-world data. Finally, we also make a comparative analysis of our work with the well-established rule-based system of MedISys and find that UPHED can be used in a cooperative way with MedISys to not only detect similar anomalies, but can also deliver more information about the specific outbreak of reported diseases.