Noticias em eLiteracias

🔒
✇ International Journal on Digital Libraries

Exploiting the untapped functional potential of Memento aggregators beyond aggregation

27 de Janeiro de 2024, 00:00

Abstract

Web archives capture, retain, and present historical versions of web pages. Viewing web archives often amounts to a user visiting the Wayback Machine homepage, typing in a URL, then choosing a date and time significant of the capture. Other web archives also capture the web and use Memento as an interoperable point of querying their captures. Memento aggregators are web accessible software packages that allow clients to send requests for past web pages to a single endpoint source that then relays that request to a set of web archives. Though few deployed aggregator instances exist that exhibit this aggregation trait, they all, for the most part, align to a model of serving a request for a URI of an original resource (URI-R) to a client by first querying then aggregating the results of the responses from a collection of web archives. This single tier querying need not be the logical flow of an aggregator, so long as a user can still utilize the aggregator from a single URL. In this paper, we discuss theoretical aggregation models of web archives. We first describe the status quo as the conventional behavior exhibited by an aggregator. We then build on prior work to describe a multi-tiered, structured querying model that may be exhibited by an aggregator. We highlight some potential issues and high-level optimization to ensure efficient aggregation while also extending on the state-of-the-art of memento aggregation. Part of our contribution is the extension of an open-source, user-deployable Memento aggregator to exhibit the capability described in this paper. We also extend a browser extension that typically consults an aggregator to have the ability to aggregate itself rather than needing to consult an external service. A purely client-side, browser-based Memento aggregator is novel to this work.

✇ International Journal on Digital Libraries

Image searching in an open photograph archive: search tactics and faced barriers in historical research

24 de Janeiro de 2024, 00:00

Abstract

During the last decades, cultural heritage collections have been digitized, for example, for the use of academic scholars. However, earlier studies have mainly focused on the use of textual materials. Thus, little is known about how digitized photographs are used and searched in digital humanities. The aim of this paper is to investigate the applied search tactics and perceived barriers when looking for historical photographs from a digital image archive for research and writing tasks. The case archive of this study contains approximately 160,000 historical wartime photographs that are openly available. The study is based on a qualitative interview and demonstration data of 15 expert users of the image collection searching photographs for research and writing tasks. Critical incident questions yielded a total of 37 detailed real-life search examples and 158 expressed barriers to searching. Results show that expert users apply and combine different tactics (keywords, filtering and browsing) for image searching, and rarely using one tactic only is enough. During searching users face various barriers, most of them focusing on keyword searching due to the shortcomings of image metadata. Barriers were mostly in the context of the collection and tools. Although scholars have benefited from the efforts put into digitizing cultural heritage collections, providing digitized content openly online is not enough if there are no sufficient means for accessing the content. Automatic annotation methods are one option for creating metadata to improve the findability of the images. However, a better understanding of human information interaction with image data is needed to better support digitalization in the humanities in this respect.

✇ International Journal on Digital Libraries

A BERT-based sequential deep neural architecture to identify contribution statements and extract phrases for triplets from scientific publications

23 de Janeiro de 2024, 00:00

Abstract

Research in Natural Language Processing (NLP) is increasing rapidly; as a result, a large number of research papers are being published. It is challenging to find the contributions of the research paper in any specific domain from the huge amount of unstructured data. There is a need for structuring the relevant contributions in Knowledge Graph (KG). In this paper, we describe our work to accomplish four tasks toward building the Scientific Knowledge Graph (SKG). We propose a pipelined system that performs contribution sentence identification, phrase extraction from contribution sentences, Information Units (IUs) classification, and organize phrases into triplets (subject, predicate, object) from the NLP scholarly publications. We develop a multitasking system (ContriSci) for contribution sentence identification with two supporting tasks, viz. Section Identification and Citance Classification. We use the Bidirectional Encoder Representations from Transformers (BERT)—Conditional Random Field (CRF) model for the phrase extraction and train with two additional datasets: SciERC and SciClaim. To classify the contribution sentences into IUs, we use a BERT-based model. For the triplet extraction, we categorize the triplets into five categories and classify the triplets with the BERT-based classifier. Our proposed approach yields the F1 score values of 64.21%, 77.47%, 84.52%, and 62.71% for the contribution sentence identification, phrase extraction, IUs classification, and triplet extraction, respectively, for non-end-to-end setting. The relative improvement for contribution sentence identification, IUs classification, and triplet extraction is 8.08, 2.46, and 2.31 in terms of F1 score for the NLPContributionGraph (NCG) dataset. Our system achieves the best performance (57.54% F1 score) in the end-to-end pipeline with all four sub-tasks combined. We make our codes available at: https://github.com/92Komal/pipeline_triplet_extraction.

✇ International Journal on Digital Libraries

Sequential sentence classification in research papers using cross-domain multi-task learning

22 de Janeiro de 2024, 00:00

Abstract

The automatic semantic structuring of scientific text allows for more efficient reading of research articles and is an important indexing step for academic search engines. Sequential sentence classification is an essential structuring task and targets the categorisation of sentences based on their content and context. However, the potential of transfer learning for sentence classification across different scientific domains and text types, such as full papers and abstracts, has not yet been explored in prior work. In this paper, we present a systematic analysis of transfer learning for scientific sequential sentence classification. For this purpose, we derive seven research questions and present several contributions to address them: (1) We suggest a novel uniform deep learning architecture and multi-task learning for cross-domain sequential sentence classification in scientific text. (2) We tailor two transfer learning methods to deal with the given task, namely sequential transfer learning and multi-task learning. (3) We compare the results of the two best models using qualitative examples in a case study. (4) We provide an approach for the semi-automatic identification of semantically related classes across annotation schemes and analyse the results for four annotation schemes. The clusters and underlying semantic vectors are validated using k-means clustering. (5) Our comprehensive experimental results indicate that when using the proposed multi-task learning architecture, models trained on datasets from different scientific domains benefit from one another. Our approach significantly outperforms state of the art on full paper datasets while being on par for datasets consisting of abstracts.

✇ International Journal on Digital Libraries

Academics’ experience of online reading lists and the use of reading list notes

12 de Janeiro de 2024, 00:00

Abstract

Reading Lists Systems are widely used in tertiary education as a pedagogical tool and for tracking copyrighted material. This paper explores academics' experiences with reading lists and in particular the use of reading lists notes feature. A mixed-methods approach was employed in which we first conducted interviews with academics about their experience with reading lists. We identified the need for streamlining the workflow of the reading lists set-up, improved usability of the interfaces, and better synchronization with other teaching support systems. Next, we performed a log analysis of the use of the notes feature throughout one academic year. The results of our log analysis were that the note feature is under-utilized by academics. We recommend improving the systems’ usability by re-engineering the user workflows and to better integrate notes feature into academic teaching.

✇ International Journal on Digital Libraries

SciND: a new triplet-based dataset for scientific novelty detection via knowledge graphs

8 de Janeiro de 2024, 00:00

Abstract

Detecting texts that contain semantic-level new information is not straightforward. The problem becomes more challenging for research articles. Over the years, many datasets and techniques have been developed to attempt automatic novelty detection. However, the majority of the existing textual novelty detection investigations are targeted toward general domains like newswire. A comprehensive dataset for scientific novelty detection is not available in the literature. In this paper, we present a new triplet-based corpus (SciND) for scientific novelty detection from research articles via knowledge graphs. The proposed dataset consists of three types of triples (i) triplet for the knowledge graph, (ii) novel triplets, and (iii) non-novel triplets. We build a scientific knowledge graph for research articles using triplets across several natural language processing (NLP) domains and extract novel triplets from the paper published in the year 2021. For the non-novel articles, we use blog post summaries of the research articles. Our knowledge graph is domain-specific. We build the knowledge graph for seven NLP domains. We further use a feature-based novelty detection scheme from the research articles as a baseline. Moreover, we show the applicability of our proposed dataset using our baseline novelty detection algorithm. Our algorithm yields a baseline F1 score of 72%. We show analysis and discuss the future scope using our proposed dataset. To the best of our knowledge, this is the very first dataset for scientific novelty detection via a knowledge graph. We make our codes and dataset publicly available at https://github.com/92Komal/Scientific_Novelty_Detection.

✇ International Journal on Digital Libraries

Human-in-the-loop latent space learning for biblio-record-based literature management

4 de Janeiro de 2024, 00:00

Abstract

Every researcher must conduct a literature review, and the document management needs of researchers working on various research topics vary. However, there are two major challenges. First, traditional methods such as the tree hierarchy of document folders and tag-based management are no longer effective with the enormous volume of publications. Second, although their bibliographic information is available to everyone, many papers can only be accessed through paid services. This study attempts to develop an interactive tool for personal literature management based solely on their bibliographic records. To make such a tool possible, we developed a principled “human-in-the-loop latent space learning” method that estimates the management criteria of each researcher based on his or her feedback to calculate the positions of documents in a two-dimensional space on the screen. As a set of bibliographic records forms a graph, our model is naturally designed as a graph-based encoder–decoder model that connects the graph and the space. In addition, we also devised an active learning framework using uncertainty sampling for it. The challenge here is to define the uncertainty in a problem setting. Experiments with ten researchers from the humanities, science, and engineering domains show that the proposed framework provides superior results to a typical graph convolutional encoder–decoder model. In addition, we found that our active learning framework was effective in selecting good samples.

✇ International Journal on Digital Libraries

OAVA: the open audio-visual archives aggregator

16 de Dezembro de 2023, 00:00

Abstract

The purpose of the current article is to provide an overview of an open-access audiovisual aggregation and search service platform developed for Greek audiovisual content during the OAVA (Open Access AudioVisual Archive) project. The platform allows the search of audiovisual resources utilizing metadata descriptions, as well as full-text search utilizing content generated from automatic speech recognition (ASR) processes through deep learning models. A dataset containing reliable Greek audiovisual content providers and their resources (1710 in total) is created. Both providers and resources are reviewed according to specific criteria already established and used for content aggregation purposes, to ensure the quality of the content and to avoid copyright infringements. Well-known aggregation services and well-established schemas for audiovisual resources have been studied and considered regarding both aggregated content and metadata. Most Greek audiovisual content providers do not use established metadata schemas when publishing their content, nor technical cooperation with them is guaranteed. Thus, a model is developed for reconciliation and aggregation. To utilize audiovisual resources the OAVA platform makes use of the latest state-of-the-art ASR approaches. OAVA platform supports Greek and English speech-to-text models. Specifically for Greek, to mitigate the scarcity of available datasets, a large-scale ASR dataset is annotated to train and evaluate deep learning architectures. The result of the above-mentioned efforts, namely selection of content, metadata, development of appropriate ASR techniques, and aggregation and enrichment of content and metadata, is the OAVA platform. This unified search mechanism for Greek audiovisual content will serve teaching, research, and cultural activities. OAVA platform is available at: https://openvideoarchives.gr/.

✇ International Journal on Digital Libraries

User versus institutional perspectives of metadata and searching: an investigation of online access to cultural heritage content during the COVID-19 pandemic

15 de Dezembro de 2023, 00:00

Abstract

Findings from log analyses of user interactions with the digital content of two large national cultural heritage institutions (National Museums of Scotland and National Galleries of Scotland) during the COVID-19 lockdown highlighted limited engagement compared to pre-pandemic levels. Just 8% of users returned to these sites, whilst the average time spent, and number of pages accessed, were generally low. This prompted a user study to investigate the potential mismatch between the way content was indexed by the curators and searched for by users. A controlled experiment with ten participants, involving two tasks and a selected set of digital cultural heritage content, explored: (a) how does the metadata assigned by cultural heritage organisations meet or differ from the search needs of users? and (b) how can the search strategies of users inform the search pathways employed by cultural heritage organisations? Findings reveal that collection management standards like Spectrum encourage a variety of different characteristics to be considered when developing metadata, yet much of the content is left to the interpretations of curators. Rather, user- and context-specific guidelines could be beneficial in ensuring the aspects considered most important by consumers are indexed, thereby producing more relevant search results. A user-centred approach to designing cultural heritage websites would help to improve an individual’s experience when searching for information. However, a process is needed for institutions to form a concrete understanding of who their target users are before developing features and designs to suit their specific needs and interests.

✇ International Journal on Digital Libraries

Enhancing the examination of obstacles in an automated peer review system

2 de Dezembro de 2023, 00:00

Abstract

The peer review process is the main academic resource to ensure that science advances and is disseminated. To contribute to this important process, classification models were created to perform two tasks: the review score prediction (RSP) and the paper decision prediction (PDP). But what challenges prevent us from having a fully efficient system responsible for these tasks? And how far are we from having an automated system to take care of these two tasks? To answer these questions, in this work, we evaluated the general performance of existing state-of-the-art models for RSP and PDP tasks and investigated what types of instances these models tend to have difficulty classifying and how impactful they are. We found, for example, that the performance of a model to predict the final decision of a paper is 23.31% lower when it is exposed to difficult instances and that the classifiers make mistake with a very high confidence. These and other results lead us to conclude that there are groups of instances that can negatively impact the model’s performance. That way, the current state-of-the-art models have potential to helping editors to decide whether to approve or reject a paper; however, we are still far from having a system that is fully responsible for scoring a paper and decide if it will be accepted or rejected.

✇ International Journal on Digital Libraries

The case for the Humanities Citation Index (HuCI): a citation index by the humanities, for the humanities

1 de Dezembro de 2023, 00:00

Abstract

Citation indexes are by now part of the research infrastructure in use by most scientists: a necessary tool in order to cope with the increasing amounts of scientific literature being published. Commercial citation indexes are designed for the sciences and have uneven coverage and unsatisfactory characteristics for humanities scholars, while no comprehensive citation index is published by a public organisation. We argue that an open citation index for the humanities is desirable, for four reasons: it would greatly improve and accelerate the retrieval of sources, it would offer a way to interlink collections across repositories (such as archives and libraries), it would foster the adoption of metadata standards and best practices by all stakeholders (including publishers) and it would contribute research data to fields such as bibliometrics and science studies. We also suggest that the citation index should be informed by a set of requirements relevant to the humanities. We discuss four such requirements: source coverage must be comprehensive, including books and citations to primary sources; there needs to be chronological depth, as scholarship in the humanities remains relevant over time; the index should be collection driven, leveraging the accumulated thematic collections of specialised research libraries; and it should be rich in context in order to allow for the qualification of each citation, for example, by providing citation excerpts. We detail the fit-for-purpose research infrastructure which can make the Humanities Citation Index a reality. Ultimately, we argue that a citation index for the humanities can be created by humanists, via a collaborative, distributed and open effort.

✇ International Journal on Digital Libraries

Holistic graph-based document representation and management for open science

1 de Dezembro de 2023, 00:00

Abstract

While most previous research focused only on the textual content of documents, advanced support for document management in digital libraries, for open science, requires handling all aspects of a document: from structure, to content, to context. These different but inter-related aspects cannot be handled separately and were traditionally ignored in digital libraries. We propose a graph-based unifying representation and handling model based on the definition of an ontology that integrates all the different perspectives and drives the document description in order to boost the effectiveness of document management. We also show how even simple algorithms can profitably use our proposed approach to return relevant and personalized outcomes in different document management tasks.

✇ International Journal on Digital Libraries

Focused Issue on Digital Library Challenges to Support the Open Science Process

1 de Dezembro de 2023, 00:00

Abstract

Open Science is the broad term that involves several aspects aiming to remove the barriers for sharing any kind of output, resources, methods or tools, at any stage of the research process (https://book.fosteropenscience.eu/en/). The Open Science process is a set of transparent research practices that help to improve the quality of scientific knowledge and are crucial to the most basic aspects of the scientific process by means of the FAIR (Findable, Accessible, Interoperable, and Reusable) principles. Thanks to research transparency and accessibility, we can evaluate the credibility of scientific claims and make the research process reproducible and the obtained results replicable. In this context, digital libraries play a pivotal role in supporting the Open Science process by facilitating the storage, organization, and dissemination of research outputs, including open access publications and open data. In this focused issue, we invited researchers to discuss innovative solutions, also related to technical challenges, about the identifiability of digital objects as well as the use of metadata and ontologies in order to support replicable and reusable research, the adoption of standards and semantic technologies to link information, and the evaluation of the application of the FAIR principles.

✇ International Journal on Digital Libraries

Universities, heritage, and non-museum institutions: a methodological proposal for sustainable documentation

27 de Outubro de 2023, 00:00

Abstract

To provide a sustainable methodology for documenting the small (and underfunded) but often important university heritage collections. The sequence proposed by the DBLC (Database Life Cycle) (Coronel and Morris, Database Systems: Design, Implementation, & Management. Cengage Learning, Boston, 2018; Oppel Databases a beginner’s guide. McGraw-Hill, New York, 2009) is followed, focusing on the database design phase. The resulting proposals aim at harmonising the different documentation tools developed by GLAM institutions (acronym that aims to highlight the common aspects of Galleries, Libraries, Archives and Museums), all of which are present in the university environment. The work phases are based mainly on the work of Valle, Fernández Cacho, and Arenillas (Muñoz Cruz et al. Introducción a la documentación del patrimonio cultural. Consejería de Cultura de la Junta de Andalucía, Seville, 2017), combined with the experience acquired from the creation of the virtual museum at our institution. The creation of a working team that includes university staff members is recommended because we believe that universities have sufficient power to manage their own heritage. For documentation, we recommend the use of application profiles that consider the new trends in semantic web and LOD (Linked Open Data) and that are created using structural interchange standards such as Dublin Core, LIDO, or Darwin Core, which should be combined with content and value standards adapted from the GLAM area. The application of the methodology described above will make it possible to obtain quality metadata in a sustainable way given the limited resources of university collections. A proposed metadata schema is provided as an annex.

✇ International Journal on Digital Libraries

Digital Libraries, Epigraphy and Paleography: Bring Records from the Distant Past to the Present: Part II

1 de Setembro de 2023, 00:00

Abstract

The two volumes of this Special Issue explore the intersections of digital libraries, epigraphy and paleography. Digital libraries research, practices and infrastructures have transformed the study of ancient inscriptions by providing organizing principles for collections building, defining interoperability requirements and developing innovative user tools and services. Yet linking collections and their contents to support advanced scholarly work in epigraphy and paleography tests the limits of current digital libraries applications. This is due, in part, to the magnitude and heterogeneity of works created over a time period of more than five millennia. The remarkable diversity ranges from the types of artifacts to the methods used in their production to the singularity of individual marks contained within them. Conversion of analogue collections to digital repositories is well underway—but most often not in a way that meets the basic requirements needed to support scholarly workflows. This is beginning to change as collections and content are being described more fully with rich annotations and metadata conforming to established standards. New use of imaging technologies and computational approaches are remediating damaged works and revealing text that has, over time, become illegible or hidden. Transcription of handwritten text to machine-readable form is still primarily a manual process, but research into automated transcription is moving forward. Progress in digital libraries research and practices coupled with collections development of ancient writtten works suggests that epigraphy and paleography will gain new prominence in the Academy.

✇ International Journal on Digital Libraries

Analytical developments for the Homer Multitext: palaeography, orthography, morphology, prosody, semantics

1 de Setembro de 2023, 00:00

Abstract

We describe ongoing development for The Homer Multitext focusing on the interlocking challenges of automated analysis of diplomatic manuscript transcriptions. With the goal of lexical and morphological analysis of prose and poetry texts, and metrical analysis of poetic texts (and quotations thereof), we face the challenge of working generically across languages and across multiple possible orthographies in each language. In the case of Greek, our working dataset includes Greek following the conventions of Attica before 404 BCE, the conventions of “standard” literary polytonic Greek, and the particular conventions found in Byzantine codex manuscripts of Greek epic poetry with accompanying commentary. The latest work involves re-implementing existing CITE Architecture libraries in the Julia language, with documentation in the form of runnable code notebooks using the Pluto.jl framework. The Homer Multitext has been a work in progress for two decades. Because of the project’s emphasis on simple data formats (plain text, very simple XML, tabular lists), our data remain valid even as we gain understanding of the challenges posed by our source-material, particularly the 10th and 11th Century manuscripts of Greek epic poetry with accompanying ancient commentary that, within themselves, represent over a thousand years of linguistic evolution. The work outlined here represents the latest shift in our development tools, a flexibility likewise made possible by the separation of concerns that has been a central value in the project.

✇ International Journal on Digital Libraries

Challenges in replaying archived Twitter pages

26 de Agosto de 2023, 00:00

Abstract

Historians and researchers rely on web archives to preserve social media content that no longer exists on the live web. However, what we see on the live web and how it is replayed in the archive are not always the same. In this study, we document and analyze the problems in archiving Twitter after Twitter switched to a new user interface (UI) in June 2020. Most web archives could not archive the new UI, resulting in archived Twitter pages displaying Twitter’s “Something went wrong” error. The challenges in archiving the new UI forced web archives to continue using the old UI. But, features such as Twitter labels were a part of the new UI; hence, web archives archiving Twitter’s old UI would be missing these labels. To analyze the potential loss of information in web archival data due to this change, we used the personal Twitter account of the 45th President of the USA, @realDonaldTrump, which was suspended by Twitter on January 8, 2021. Trump’s account was heavily labeled by Twitter for spreading misinformation; however, we discovered that there is no evidence in web archives to prove that some of his tweets ever had a label assigned to them. We also studied the possibility of temporal violations in archived versions of the new UI, which may result in the replay of pages that never existed on the live web. We also discovered that when some tweets with embedded media are replayed, portions of the rewritten t.co URL, meant to be hidden from the end-user, are partially exposed in the replayed page. Our goal is to educate researchers who may use web archives and caution them when drawing conclusions based on archived Twitter pages.

✇ International Journal on Digital Libraries

Graduate student search strategies within academic digital libraries

15 de Agosto de 2023, 00:00

Abstract

When searching within an academic digital library, a variety of information seeking strategies may be employed. The purpose of this study is to determine whether graduate students choose appropriate information seeking strategies for the complexity of a given search scenario and to explore among other factors that could influence their decisions. We used a survey method in which participants ( \(n=176\) ) were asked to recall their most recent instance of an academic digital library search session that matched two given scenarios (randomly chosen from four alternatives) and, for each scenario, identify whether they employed search strategies associated with four different information seeking models. Among the search strategies, only lookup search was used in a manner that was consistent with the complexity of the search scenario. Other factors that influenced the choice of strategy were the discipline of study and the type of academic search training received. Patterns of search tool use with respect to the complexity of the search scenarios were also identified. These findings highlight that not only is it important to train graduate students on how to conduct academic digital library searches, more work is needed to train them on matching the information seeking strategies to the complexity of their search tasks and developing interfaces that guide their search process.

✇ International Journal on Digital Libraries

Cross-lingual extreme summarization of scholarly documents

10 de Agosto de 2023, 00:00

Abstract

The number of scientific publications nowadays is rapidly increasing, causing information overload for researchers and making it hard for scholars to keep up to date with current trends and lines of work. Recent work has tried to address this problem by developing methods for automated summarization in the scholarly domain, but concentrated so far only on monolingual settings, primarily English. In this paper, we consequently explore how state-of-the-art neural abstract summarization models based on a multilingual encoder–decoder architecture can be used to enable cross-lingual extreme summaries of scholarly texts. To this end, we compile a new abstractive cross-lingual summarization dataset for the scholarly domain in four different languages, which enables us to train and evaluate models that process English papers and generate summaries in German, Italian, Chinese and Japanese. We present our new X-SCITLDR dataset for multilingual summarization and thoroughly benchmark different models based on a state-of-the-art multilingual pre-trained model, including a two-stage pipeline approach that independently summarizes and translates, as well as a direct cross-lingual model. We additionally explore the benefits of intermediate-stage training using English monolingual summarization and machine translation as intermediate tasks and analyze performance in zero- and few-shot scenarios. Finally, we investigate how to make our approach more efficient on the basis of knowledge distillation methods, which make it possible to shrink the size of our models, so as to reduce the computational complexity of the summarization inference.

✇ International Journal on Digital Libraries

Controlled vocabularies in digital libraries: challenges and solutions for increased discoverability of digital objects

5 de Agosto de 2023, 00:00

Abstract

Digital Library Systems are widely used in the Higher Education sector, through the use of Institutional Repositories (IRs), to collect, store, manage and make available scholarly research output produced by Higher Education Institutions (HEIs). This wide application of IRs is a direct response to the increase in scholarly research output produced. In order to facilitate discoverability of digital content in IRs, accurate, consistent and comprehensive association of descriptive metadata to digital objects during ingestion into IRs is crucial. However, due to human errors resulting from complex IR ingestion workflows, most digital content in IRs have incorrect and inconsistent descriptive metadata. While there exists a broad spectrum of descriptive metadata elements, subject headings present a classic example of a crucial metadata element that adversely affects discoverability of digital content when incorrectly and inconsistently specified. This paper outlines a case study conducted at an HEI—The University of Zambia—in order to demonstrate the effectiveness of integrating controlled subject vocabularies during the ingestion of digital objects in to IRs. A situational analysis was conducted to understand how subject headings are associated with digital objects and to analyse subject headings associated with already ingested digital objects. In addition, an exploratory study was conducted to determine domain-specific subject headings to be integrated with the IR. Furthermore, a usability study was conducted in order to comparatively determine the usefulness of using controlled vocabularies during the ingestion of digital objects into IRs. Finally, multi-label classification experiments were carried out where digital objects were assigned with more than one class. The results of the study revealed that a noticeable number of digital content is associated with incorrect subject categories and, additionally, associated with few subjects headings: two or less subject headings (71.2 \(\%\) ), with a significant number of subject headings (92.1 \(\%\) ) being associated with a single publication. A comparative study conducted suggests that IRs integrated with controlled vocabularies are perceived to be more usable (SUS Score = 68.9) when compared with IRs without controlled vocabularies (SUS Score = 66.2). Furthermore, the effectiveness of the multi-label arXiv subjects classifier demonstrates the viability of integrating automated techniques for subject classification.

✇ International Journal on Digital Libraries

Complexities of leveraging user-generated book reviews for scholarly research: transiency, power dynamics, and cultural dependency

31 de Julho de 2023, 00:00

Abstract

In the past two decades, digital libraries (DL) have increasingly supported computational studies of digitized books (Jett et al. The hathitrust research center extracted features dataset (2.0), 2020; Underwood, Distant horizons: digital evidence and literary change, University of Chicago Press, Chicago, 2019; Organisciak et al. J Assoc Inf Sci Technol 73:317–332, 2022; Michel et al. Science 331:176–182, 2011). Nonetheless, there remains a dearth of DL data provisions or infrastructures for research on book reception, and user-generated book reviews have opened up unprecedented research opportunities in this area. However, insufficient attention has been paid to real-world complexities and limitations of using these datasets in scholarly research, which may cause analytical oversights (Crawford and Finn, Geo J 80:491–502, 2015), methodological pitfalls (Olteanu et al. Front Big Data 2:13, 2019), and ethical concerns (Hu et al. Research with user-generated book review data: legal and ethical pitfalls and contextualized mitigations, Springer, Berlin, 2023; Diesner and Chin, Gratis, libre, or something else? regulations and misassumptions related to working with publicly available text data, 2016). In this paper, we present three case studies that contextually and empirically investigate book reviews for their temporal, cultural, and socio-participatory complexities: (1) a longitudinal analysis of a ranked book list across ten years and over one month; (2) a text classification of 20,000 sponsored and 20,000 non-sponsored books reviews; and (3) a comparative analysis of 537 book ratings from Anglophone and non-Anglophone readerships. Our work reflects on both (1) data curation challenges that researchers may encounter (e.g., platform providers’ lack of bibliographic control) when studying book reviews and (2) mitigations that researchers might adopt to address these challenges (e.g., how to align data from various platforms). Taken together, our findings illustrate some of the sociotechnical complexities of working with user-generated book reviews by revealing the transiency, power dynamics, and cultural dependency in these datasets. This paper explores some of the limitations and challenges of using user-generated book reviews for scholarship and calls for critical and contextualized usage of user-generated book reviews in future scholarly research.

✇ International Journal on Digital Libraries

RDFtex in-depth: knowledge exchange between LATEX-based research publications and Scientific Knowledge Graphs

31 de Julho de 2023, 00:00

Abstract

For populating Scientific Knowledge Graphs (SciKGs), research publications pose a central information source. However, typical forms of research publications like traditional papers do not provide means of integrating contributions into SciKGs. Furthermore, they do not support making direct use of the rich information SciKGs provide. To tackle this, the present paper proposes RDFtex, a framework enabling (1) the import of contributions represented in SciKGs to facilitate the preparation of -based research publications and (2) the export of original contributions from papers to facilitate their integration into SciKGs. The framework’s functionality is demonstrated using the present paper itself since it was prepared with our proof-of-concept implementation of RDFtex. The runtime of the implementation’s preprocessor was evaluated based on three projects with different numbers of imports and exports. A small user study ( \(N=10\) ) was conducted to obtain initial user feedback. The concept and the process of preparing a -based research publication using RDFtex are discussed thoroughly. RDFtex’s import functionality takes considerably more time than its export functionality. Nevertheless, the entire preprocessing takes only a fraction of the time required to compile the PDF. The users were able to solve all predefined tasks but preferred the import functionality over the export functionality because of its general simplicity. RDFtex is a promising approach to facilitate the move toward knowledge graph augmented research since it only introduces minor differences compared to the preparation of traditional -based publications while narrowing the gap between papers and SciKGs.

✇ International Journal on Digital Libraries

Is this news article still relevant? Ranking by contemporary relevance in archival search

28 de Julho de 2023, 00:00

Abstract

Our civilization creates enormous volumes of digital data, a substantial fraction of which is preserved and made publicly available for present and future usage. Additionally, historical born-analog records are progressively being digitized and incorporated into digital document repositories. While professionals often have a clear idea of what they are looking for in document archives, average users are likely to have no precise search needs when accessing available archives (e.g., through their online interfaces). Thus, if the results are to be relevant and appealing to average people, they should include engaging and recognizable material. However, state-of-the-art document archival retrieval systems essentially use the same approaches as search engines for synchronic document collections. In this article, we develop unique ranking criteria for assessing the usefulness of archived contents based on their estimated relationship with current times, which we call contemporary relevance. Contemporary relevance may be utilized to enhance access to archival document collections, increasing the likelihood that users will discover interesting or valuable material. We next present an effective strategy for estimating contemporary relevance degrees of news articles by utilizing learning to rank approach based on a variety of diverse features, and we then successfully test it on the New York Times news collection. The incorporation of the contemporary relevance computation into archival retrieval systems should enable a new search style in which search results are meant to relate to the context of searchers’ times, and by this have the potential to engage the archive users. As a proof of concept, we develop and demonstrate a working prototype of a simplified ranking model that operates on the top of the Portuguese Web Archive portal (arquivo.pt).

✇ International Journal on Digital Libraries

Gesture retrieval and its application to the study of multimodal communication

24 de Julho de 2023, 00:00

Abstract

Comprehending communication is dependent on analyzing the different modalities of conversation, including audio, visual, and others. This is a natural process for humans, but in digital libraries, where preservation and dissemination of digital information are crucial, it is a complex task. A rich conversational model, encompassing all modalities and their co-occurrences, is required to effectively analyze and interact with digital information. Currently, the analysis of co-speech gestures in videos is done through manual annotation by linguistic experts based on textual searches. However, this approach is limited and does not fully utilize the visual modality of gestures. This paper proposes a visual gesture retrieval method using a deep learning architecture to extend current research in this area. The method is based on body keypoints and uses an attention mechanism to focus on specific groups. Experiments were conducted on a subset of the NewsScape dataset, which presents challenges such as multiple people, camera perspective changes, and occlusions. A user study was conducted to assess the usability of the results, establishing a baseline for future gesture retrieval methods in real-world video collections. The results of the experiment demonstrate the high potential of the proposed method in multimodal communication research and highlight the significance of visual gesture retrieval in enhancing interaction with video content. The integration of visual similarity search for gestures in the open-source multimedia retrieval stack, vitrivr, can greatly contribute to the field of computational linguistics. This research advances the understanding of the role of the visual modality in co-speech gestures and highlights the need for further development in this area.

✇ International Journal on Digital Libraries

Comparing different search methods for the open access journal recommendation tool B!SON

20 de Julho de 2023, 00:00

Abstract

Finding a suitable open access journal to publish academic work is a complex task: Researchers have to navigate a constantly growing number of journals, institutional agreements with publishers, funders’ conditions and the risk of predatory publishers. To help with these challenges, we introduce a web-based journal recommendation system called B!SON. A systematic requirements analysis was conducted in the form of a survey. The developed tool suggests open access journals based on title, abstract and references provided by the user. The recommendations are built on open data, publisher-independent and work across domains and languages. Transparency is provided by its open source nature, an open application programming interface (API) and by specifying which matches the shown recommendations are based on. The recommendation quality has been evaluated using two different evaluation techniques, including several new recommendation methods. We were able to improve the results from our previous paper with a pre-trained transformer model. The beta version of the tool received positive feedback from the community and in several test sessions. We developed a recommendation system for open access journals to help researchers find a suitable journal. The open tool has been extensively tested, and we found possible improvements for our current recommendation technique. Development by two German academic libraries ensures the longevity and sustainability of the system.

❌