Há novos artigos disponíveis, clique para atualizar a página.

Antes de ontemInternational Journal on Digital Libraries

International Journal on Digital Libraries
The Perseus Digital Library and the future of libraries
1 de Junho de 2023, 00:00

The Perseus Digital Library and the future of libraries

Abstract

This paper describes the Perseus Digital Library as, in part, a response to limitations of what is now a print culture that is rapidly receding from contemporary consciousness and, at the same time, as an attempt to fashion an infrastructure for the study of the past that can support a shared cultural heritage that extends beyond Europe and is global in scope. But if Greco-Roman culture cannot by itself represent the background of an international twenty-first century culture, this field, at the same time, offers challenges in its scale and complexity that allow us to explore the possibility of digital libraries. Greco-Roman studies is in a position to begin creating a completely transparent intellectual ecosystem, with a critical mass of its primary data available under an open license and with new forms of reading support that make sources in ancient and modern languages accessible to a global audience. In this model, traditional libraries play the role of archives: physically constrained spaces to which a handful of specialists can have access. If non-specialists draw problematic conclusions because the underlying sources are not publicly available and as well-documented as possible, the responsibility lies with the specialists who have not yet created the open, digital libraries upon which the intellectual life of humanity must depend. Greco-Roman Studies can play a major role in modeling such libraries. Perseus seeks to contribute to that transformation.

1 de Junho de 2023, 00:00

International Journal on Digital Libraries
Linking different scientific digital libraries in Digital Humanities: the IMAGO case study
1 de Dezembro de 2022, 00:00

Linking different scientific digital libraries in Digital Humanities: the IMAGO case study

Abstract

In the last years, several scientific digital libraries (DLs) in digital humanities (DH) field have been developed following the Open Science principles. These DLs aim at sharing the research outcomes, in several cases as FAIR data, and at creating linked information spaces. In several cases, to reach these aims the Semantic Web technologies and Linked Data have been used. This paper presents how the current scientific DLs in the DH field can provide the creation of linked information spaces and navigational services that allow users to navigate them, using Semantic Web technologies to formally represent, search and browsing knowledge. To support the argument, we present our experience in developing a scientific DL supporting scholars in creating, evolving and consulting a knowledge base related to Medieval and Renaissance geographical works within the three years (2020–2023) Italian National research project IMAGO—Index Medii Aevi Geographiae Operum. In the presented case study, a linked information space was created to allow users to discover and navigate knowledge across multiple repositories, thanks to the extensive use of ontologies. In particular, the linked information spaces created within the IMAGO project make use of five different datasets, i.e. Wikidata, the MIRABILE digital archive, the Nuovo Soggettario thesaurus, Mapping Manuscript Migration knowledge base and the Pleiades gazetteer. The linking among different datasets allows to considerably enrich the knowledge collected in the IMAGO KB.

1 de Dezembro de 2022, 00:00

International Journal on Digital Libraries
Correction to: Mapping audiovisual content providers and resources in Greece
1 de Setembro de 2022, 00:00

Correction to: Mapping audiovisual content providers and resources in Greece

1 de Setembro de 2022, 00:00

International Journal on Digital Libraries
Envisioning networked provenance data storytelling with American cuneiform collections
1 de Setembro de 2023, 00:00

Envisioning networked provenance data storytelling with American cuneiform collections

Abstract

Cuneiform tablets remain founding cornerstones of two hundred plus collections in American academic institutions, having been acquired a century or more ago under dynamic ethical norms and global networks. To foster data sharing, this contribution incorporates empirical data from interactive ArcGIS and reusable OpenContext maps to encourage tandem dialogues about using the inscribed works and learning their collecting histories. Such provenance research aids, on their own, initiate the narration of objects’ journeys over time while cultivating the digital inclusion of expert local knowledge relevant to an object biography. The paper annotates several approaches institutions are or might consider using to expand upon current provenance information in ways that encourage visitors’ critical thinking and learning about global journeys, travel archives, and such dispositions as virtual reunification, reconstructions, or restitution made possible by the provenance research.

1 de Setembro de 2023, 00:00

International Journal on Digital Libraries
VIVA: visual information retrieval in video archives
1 de Dezembro de 2022, 00:00

VIVA: visual information retrieval in video archives

Abstract

Video retrieval methods, e.g., for visual concept classification, person recognition, and similarity search, are essential to perform fine-grained semantic search in large video archives. However, such retrieval methods often have to be adapted to the users’ changing search requirements: which concepts or persons are frequently searched for, what research topics are currently important or will be relevant in the future? In this paper, we present VIVA, a software tool for building content-based video retrieval methods based on deep learning models. VIVA allows non-expert users to conduct visual information retrieval for concepts and persons in video archives and to add new people or concepts to the underlying deep learning models as new requirements arise. For this purpose, VIVA provides a novel semi-automatic data acquisition workflow including a web crawler, image similarity search, as well as review and user feedback components to reduce the time-consuming manual effort for collecting training samples. We present experimental retrieval results using VIVA for four use cases in the context of a historical video collection of the German Broadcasting Archive based on about 34,000 h of television recordings from the former German Democratic Republic (GDR). We evaluate the performance of deep learning models built using VIVA for 91 GDR specific concepts and 98 personalities from the former GDR as well as the performance of the image and person similarity search approaches.

1 de Dezembro de 2022, 00:00

International Journal on Digital Libraries
Scientific paper recommendation systems: a literature review of recent publications
1 de Dezembro de 2022, 00:00

Scientific paper recommendation systems: a literature review of recent publications

Abstract

Scientific writing builds upon already published papers. Manual identification of publications to read, cite or consider as related papers relies on a researcher’s ability to identify fitting keywords or initial papers from which a literature search can be started. The rapidly increasing amount of papers has called for automatic measures to find the desired relevant publications, so-called paper recommendation systems. As the number of publications increases so does the amount of paper recommendation systems. Former literature reviews focused on discussing the general landscape of approaches throughout the years and highlight the main directions. We refrain from this perspective, instead we only consider a comparatively small time frame but analyse it fully. In this literature review we discuss used methods, datasets, evaluations and open challenges encountered in all works first released between January 2019 and October 2021. The goal of this survey is to provide a comprehensive and complete overview of current paper recommendation systems.

1 de Dezembro de 2022, 00:00

International Journal on Digital Libraries
The emerging digital infrastructure for research in the humanities
1 de Junho de 2023, 00:00

The emerging digital infrastructure for research in the humanities

Abstract

This article advances the thesis that three decades of investments by national and international funders, combined with those of scholars, technologists, librarians, archivists, and their institutions, have resulted in a digital infrastructure in the humanities that is now capable of supporting end-to-end research workflows. The article refers to key developments in the epigraphy and paleography of the premodern period. It draws primarily on work in classical studies but also highlights related work in the adjacent disciplines of Egyptology, ancient Near East studies, and medieval studies. The argument makes a case that much has been achieved but it does not declare “mission accomplished.” The capabilities of the infrastructure remain unevenly distributed within and across disciplines, institutions, and regions. Moreover, the components, including the links between steps in the workflow, are generally far from user-friendly and seamless in operation. Because further refinements and additional capacities are still much needed, the article concludes with a discussion of key priorities for future work.

1 de Junho de 2023, 00:00

International Journal on Digital Libraries
Accessibility of master’s theses at SEA-EU Alliance universities in open access repositories
1 de Dezembro de 2023, 00:00

Accessibility of master’s theses at SEA-EU Alliance universities in open access repositories

Abstract

This research analyzed and compared master’s theses deposits in institutional repositories of the SEA-EU Alliance universities in order to determine whether universities’ mandates to deposit master’s theses influence the number of theses deposited to the institutional repositories. We compared the universities’ institutional repository content focusing on the ratio of the number of deposited master’s theses to the number of enrolled students taking into account the open access policies, or more specifically, the national and the universities’ mandates to deposit master’s theses. The research methods involved a quantitative approach where the data were collected through direct communication with universities’ employees and from the institutional repositories in our sample. Our analysis showed that the number of papers stored in repositories reflects the open access policy, or more specifically, the master’s theses deposit policy, at certain universities. Furthermore, when analyzing the distribution of the number of theses across the scientific disciplines as well as the degree of openness it was evident that it largely concurred with the trends recorded in the current scholarly literature.

1 de Dezembro de 2023, 00:00

International Journal on Digital Libraries
Bringing places from the distant past to the present: a report on the World Historical Gazetteer
1 de Setembro de 2023, 00:00

Bringing places from the distant past to the present: a report on the World Historical Gazetteer

Abstract

This article is a report about the progress and current status of the World Historical Gazetteer (whgazetteer.org) (WHG) in the context of its value for helping to organize and record digital and paleographic information. It summarizes the development and functionality of the WHG as a software platform for connecting specialist collections of historical place names. It also reviews the idea of places as entities (rather than simple objects with single labels). It also explains the utility of gazetteers in digital library infrastructure and describes potential future developments.

1 de Setembro de 2023, 00:00

International Journal on Digital Libraries
Coins in the library: the creation of a digital collection of Roman Republican coins
1 de Junho de 2023, 00:00

Coins in the library: the creation of a digital collection of Roman Republican coins

Abstract

In 2001, Rutgers University Libraries (RUL) accepted a substantial donation of Roman Republican coins. The work to catalog, house, digitize, describe, and present this collection online provided unique challenges for the institution. Coins are often seen as museum objects; however, they can serve pedagogical purposes within libraries. In the quest to innovate, RUL digitized coins from seven angles to provide a 180-degree view of coins. However, this strategy had its drawbacks; it had to be reassessed as the project continued. RUCore, RUL’s digital repository, uses Metadata Object Description Schema (MODS). Accordingly, it was necessary to adapt numismatic description to bibliographic metadata standards.With generous funding from the Loeb Foundation, the resulting digital collection of 1200 coins was added to RUCore from 2012 to 2018. Rutgers’s Badian Roman Coins Collection serves as an exemplar of numismatics in a library environment that is freely available to all on the Web.

1 de Junho de 2023, 00:00

International Journal on Digital Libraries
Computational metadata generation methods for biological specimen image collections
23 de Novembro de 2022, 00:00

Computational metadata generation methods for biological specimen image collections

Abstract

Metadata is a key data source for researchers seeking to apply machine learning (ML) to the vast collections of digitized biological specimens that can be found online. Unfortunately, the associated metadata is often sparse and, at times, erroneous. This paper extends previous research conducted with the Illinois Natural History Survey (INHS) collection (7244 specimen images) that uses computational approaches to analyze image quality, and then automatically generates 22 metadata properties representing the image quality and morphological features of the specimens. In the research reported here, we demonstrate the extension of our initial work to University of the Wisconsin Zoological Museum (UWZM) collection (4155 specimen images). Further, we enhance our computational methods in four ways: (1) augmenting the training set, (2) applying contrast enhancement, (3) upscaling small objects, and (4) refining our processing logic. Together these new methods improved our overall error rates from 4.6 to 1.1%. These enhancements also allowed us to compute an additional set of 17 image-based metadata properties. The new metadata properties provide supplemental features and information that may also be used to analyze and classify the fish specimens. Examples of these new features include convex area, eccentricity, perimeter, skew, etc. The newly refined process further outperforms humans in terms of time and labor cost, as well as accuracy, providing a novel solution for leveraging digitized specimens with ML. This research demonstrates the ability of computational methods to enhance the digital library services associated with the tens of thousands of digitized specimens stored in open-access repositories world-wide by generating accurate and valuable metadata for those repositories.

23 de Novembro de 2022, 00:00

International Journal on Digital Libraries
Design, realization, and user evaluation of the ARCA system for exploring a digital library
1 de Março de 2023, 00:00

Design, realization, and user evaluation of the ARCA system for exploring a digital library

Abstract

This paper presents ARCA, a software system that enables semantic search and exploration over a book catalog. The main purpose of this work is twofold: to propose a general paradigm for a semantic enrichment workflow and to evaluate a visual approach to information retrieval based on extracted information and existing knowledge graphs. ARCA has been designed and implemented following a user-centered design approach. Two different releases of the system have incrementally and iteratively developed and evaluated. The first release has evaluated the quality and usefulness of the extracted data. The second release, whose design was a refinement based on the previous evaluation results, was assessed by several users. Moreover, a comparative test with other information retrieval systems was conducted in order to study the potential added-value of the system. ARCA is employed in a real editorial scenario to visually search and explore the books of a publishing house.

1 de Março de 2023, 00:00

International Journal on Digital Libraries
Online reading lists: a mixed-method analysis of the academic perspective
1 de Março de 2023, 00:00

Online reading lists: a mixed-method analysis of the academic perspective

Abstract

Reading list systems are widely used in tertiary education as a pedagogical tool and for tracking copyrighted material. This article explores the make-up of reading lists across a whole university. We investigated the experience of academics and librarians when creating reading lists. A mixed-method approach was employed in which we performed a transaction log analysis on reading lists at a single university, from 2016 to 2020. A questionnaire was then answered by both academics and academic liaison librarians about their experience with reading lists. The results of our analysis found that uptake of reading lists varies widely between different academic disciplines. Academic engagement with reading lists was found to show only incremental growth over time, and overall satisfaction by academics with the reading list system was low. We explore implications for reading lists implemented through digital Libraries and recommend developing discipline-specific support to increase reading list numbers and to integrate pedagogical features to increase academic buy-in.

1 de Março de 2023, 00:00

International Journal on Digital Libraries
Evaluating and mitigating the impact of OCR errors on information retrieval
1 de Março de 2023, 00:00

Evaluating and mitigating the impact of OCR errors on information retrieval

Abstract

Optical character recognition (OCR) is typically used to extract the textual contents of scanned texts. The output of OCR can be noisy, especially when the quality of the scanned image is poor, which in turn can impact downstream tasks such as information retrieval (IR). Post-processing OCR-ed documents is an alternative to fix digitization errors and, intuitively, improve the results of downstream tasks. This work evaluates the impact of OCR digitization and correction on IR. We compared different digitization and correction methods on real OCR-ed data from an IR test collection with 22k documents and 34 query topics on the geoscientific domain in Portuguese. Our results have shown significant differences in IR metrics for the different digitization methods (up to 5 percentage points in terms of mean average precision). Regarding the impact of error correction, our results showed that on the average for the complete set of query topics, retrieval quality metrics change very little. However, a more detailed analysis revealed it improved 19 out of 34 query topics. Our findings indicate that, contrary to previous work, long documents are impacted by OCR errors.

1 de Março de 2023, 00:00

International Journal on Digital Libraries
Transliterating Latin to Amharic scripts using user-defined rules and character mappings
1 de Março de 2023, 00:00

Transliterating Latin to Amharic scripts using user-defined rules and character mappings

Abstract

As social media platforms become increasingly accessible, individuals’ usage of new forms of textual communication (posts, comments, chats, etc.) on social media using local language scripts such as Amharic has increased tremendously. However, many users prefer to post comments in Latin scripts instead of local ones due to the availability of more convenient forms of character input using Latin keyboards. In existing Latin to Amharic transliteration systems, missing consideration of double consonants and double vowels has caused transliteration errors. Further, as there are multiple ways of character mapping conventions in existing systems, social media texts are susceptible to a wide variety of user adoptions during script production. The current systems have failed to address these gaps and adoptions. In this work, we present the RBLatAm (Rule-Based Latin to Amharic) transliteration system, a generic rule-based system that converts Amharic words which have been written using Latin script back into their native Amharic script. The system is based on mapping rules engineered from three existing transliteration systems (Microsoft, Google, SERA) and additional rules for double consonants, and conventions adopted on social media by speakers of Amharic. When tested on transliterated Amharic words of non-named entities, and named entities of persons, the system achieves an accuracy of 75.8% and 84.6%, respectively. The system also correctly transliterates words reported as errors in previous studies. This system drastically improves the basis for performing research on text mining for Amharic language texts by being able to process such texts even if they have originally been produced in Latin scripts.

1 de Março de 2023, 00:00

International Journal on Digital Libraries
Implications of an ecospatial indigenous perspective on digital information organization and access
7 de Março de 2023, 00:00

Implications of an ecospatial indigenous perspective on digital information organization and access

Abstract

The digitalisation of indigenous knowledge has been challenging considering epistemological differences and the lack of involvement of indigenous people. Drawing from our most recent community projects in Namibia, we share insights on indigenous ecospatial worldviews guiding the design of digital information organization and access of indigenous knowledge. With emerging technologies, such as augmented and virtual reality, offering new opportunities for richer and more meaningful spatial and embodied accounts of indigenous knowledge, we re-imagine digital libraries inclusive of indigenous people and their worldviews.

7 de Março de 2023, 00:00

DeepMetaGen: an unsupervised deep neural approach to generate template-based meta-reviews leveraging on aspect category and sentiment analysis from peer reviews

Abstract

Peer reviews form an essential part of scientific communication. Scholarly peer review is probably the most accepted way to evaluate research papers by involving multiple experts to review the concerned research independently. Usually, the area chair, the program chair, or the editor takes a call weighing the reviewer’s judgments. It communicates the decision to the author via writing a meta-review by summarizing the review comments. With the exponential rise in research paper submissions and the corresponding rise in the reviewer pool, it becomes stressful for the chairs/editors to manage conflicts, arrive at a consensus, and also write an informative meta-review. Here in this work, we propose a novel deep neural network-based approach for generating meta-reviews in an unsupervised fashion. To generate consistent meta-reviews, we use a generic template where the task is like to slot-fill the template with the generated meta-review text. We consider the setting where only peer reviews with no summaries or meta-reviews are provided and propose an end-to-end neural network model to perform unsupervised opinion-based abstractive summarization. We first use an aspect-based sentiment analysis model, which classifies the review sentences with the corresponding aspects (e.g., novelty, substance, soundness, etc.) and sentiment. We then extract opinion phrases from reviews for the corresponding aspect and sentiment labels. Next, we train a transformer model to reconstruct the original reviews from these extraction. Finally, we filter the selected opinions according to their aspect and/or sentiment at the time of summarization. The selected opinions of each aspect are used as input to the trained Transformer model, which uses them to construct an opinion summary. The idea is to give a concise meta-review that maximizes information coverage by focusing on aspects and sentiment present in the review, coherence, readability, and redundancy. We evaluate our model on the human written template-based meta-reviews to show that our framework outperforms competitive baselines. We believe that the template-based meta-review generation focusing on aspect and sentiment will help the editor/chair in decision-making and assist the meta-reviewer in writing better and more informative meta-reviews. We make our codes available at https://github.com/sandeep82945/Unsupervised-meta-review-generation.

1 de Dezembro de 2023, 00:00

International Journal on Digital Libraries
CH-Bench: a user-oriented benchmark for systems for efficient distant reading (design, performance, and insights)
1 de Dezembro de 2023, 00:00

CH-Bench: a user-oriented benchmark for systems for efficient distant reading (design, performance, and insights)

Abstract

Data science deals with the discovery of information from large volumes of data. The data studied by scientists in the humanities include large textual corpora. An important objective is to study the ideas and expectations of a society regarding specific concepts, like “freedom” or “democracy,” both for today’s society and even more for societies of the past. Studying the meaning of words using large corpora requires efficient systems for text analysis, so-called distant reading systems. Making such systems efficient calls for a specification of the necessary functionality and clear expectations regarding typical work loads. But this currently is unclear, and there is no benchmark to evaluate distant reading systems. In this article, we propose such a benchmark, with the following innovations: As a first step, we collect and structure various information needs of the target users. We then formalize the notion of word context to facilitate the analysis of specific concepts. Using this notion, we formulate queries in line with the information needs of users. Finally, based on this, we propose concrete benchmark queries. To demonstrate the benefit of our benchmark, we conduct an evaluation, with two objectives. First, we aim at insights regarding the content of different corpora, i.e., whether and how their size and nature (e.g., popular and broad literature or specific expert literature) affect results. Second, we benchmark different data management technologies. This has allowed us to identify performance bottlenecks.

1 de Dezembro de 2023, 00:00

International Journal on Digital Libraries
Beyond translation: engaging with foreign languages in a digital library
1 de Setembro de 2023, 00:00

Beyond translation: engaging with foreign languages in a digital library

Abstract

Digital libraries can enable their patrons to go beyond modern language translations and to engage directly with sources in more languages than any individual could study, much less master. Translations should be viewed not so much as an end but as an entry point into the sources that they represent. In the case of highly studied sources, one or more experts can curate the network of annotations that support such reading. A digital library should, however, automatically create a serviceable first version of such a multi-lingual edition. Such a service is possible but benefits (if it does not require) a new generation of increasingly well-designed machine-readable translations, lexica, grammars, and encyclopedias. This paper reports on exploratory work that uses the Homeric epics to explore this wider topic and on the more general application of the results.

1 de Setembro de 2023, 00:00

International Journal on Digital Libraries
The digitization of historical astrophysical literature with highly localized figures and figure captions
22 de Março de 2023, 00:00

The digitization of historical astrophysical literature with highly localized figures and figure captions

Abstract

Scientific articles published prior to the “age of digitization” in the late 1990s contain figures which are “trapped” within their scanned pages. While progress to extract figures and their captions has been made, there is currently no robust method for this process. We present a YOLO-based method for use on scanned pages, after they have been processed with Optical character recognition (OCR), which uses both grayscale and OCR features. We focus our efforts on translating the intersection-over-union (IOU) metric from the field of object detection to document layout analysis and quantify “high localization” levels as an IOU of 0.9. When applied to the astrophysics literature holdings of the NASA astrophysics data system, we find F1 scores of 90.9% (92.2%) for figures (figure captions) with the IOU cut-off of 0.9 which is a significant improvement over other state-of-the-art methods.

22 de Março de 2023, 00:00

International Journal on Digital Libraries
Scientific document processing: challenges for modern learning methods
1 de Dezembro de 2023, 00:00

Scientific document processing: challenges for modern learning methods

Abstract

Neural network models enjoy success on language tasks related to Web documents, including news and Wikipedia articles. However, the characteristics of scientific publications pose specific challenges that have yet to be satisfactorily addressed: the discourse structure of scientific documents crucial in scholarly document processing (SDP) tasks, the interconnected nature of scientific documents, and their multimodal nature. We survey modern neural network learning methods that tackle these challenges: those that can model discourse structure and their interconnectivity and use their multimodal nature. We also highlight efforts to collect large-scale datasets and tools developed to enable effective deep learning deployment for SDP. We conclude with a discussion on upcoming trends and recommend future directions for pursuing neural natural language processing approaches for SDP.

1 de Dezembro de 2023, 00:00

International Journal on Digital Libraries
Referencing behaviours across disciplines: publication types and common metadata for defining bibliographic references
27 de Março de 2023, 00:00

Referencing behaviours across disciplines: publication types and common metadata for defining bibliographic references

Abstract

In this work, we investigate existing citation practices by analysing a huge set of articles published in journals to measure which metadata are used across the various scholarly disciplines, independently from the particular citation style adopted, for defining bibliographic reference. We selected the most cited journals in each of the 27 subject areas listed in the SCImago Journal Rank in the 2015–2017 triennium according to the SCImago total cites ranking. Each journal in the sample was represented by five articles (in PDF format) published in the most recent issue published in October 2019, for a total of 729 articles. We extracted all 34,140 bibliographic references in the bibliographic references lists of these articles. Finally, we detected the types of cited works in each discipline and the structure of bibliographic references and in-text reference pointers for each type of cited work. By analysing the data gathered, we observed that the bibliographic references in our sample referenced 36 different types of cited works. Such a considerable variety of publications revealed the existence of particular citing behaviours in scientific articles that varied from subject area to subject area.

27 de Março de 2023, 00:00

International Journal on Digital Libraries
Creating and validating a scholarly knowledge graph using natural language processing and microtask crowdsourcing
5 de Abril de 2023, 00:00

Creating and validating a scholarly knowledge graph using natural language processing and microtask crowdsourcing

Abstract

Due to the growing number of scholarly publications, finding relevant articles becomes increasingly difficult. Scholarly knowledge graphs can be used to organize the scholarly knowledge presented within those publications and represent them in machine-readable formats. Natural language processing (NLP) provides scalable methods to automatically extract knowledge from articles and populate scholarly knowledge graphs. However, NLP extraction is generally not sufficiently accurate and, thus, fails to generate high granularity quality data. In this work, we present TinyGenius, a methodology to validate NLP-extracted scholarly knowledge statements using microtasks performed with crowdsourcing. TinyGenius is employed to populate a paper-centric knowledge graph, using five distinct NLP methods. We extend our previous work of the TinyGenius methodology in various ways. Specifically, we discuss the NLP tasks in more detail and include an explanation of the data model. Moreover, we present a user evaluation where participants validate the generated NLP statements. The results indicate that employing microtasks for statement validation is a promising approach despite the varying participant agreement for different microtasks.

5 de Abril de 2023, 00:00

International Journal on Digital Libraries
Approximate nearest neighbor for long document relationship labeling in digital libraries
1 de Dezembro de 2023, 00:00

Approximate nearest neighbor for long document relationship labeling in digital libraries

Abstract

Relationship tagging of long text documents is a growing need in information science, spurred by the emergence of multi-million book bibliographic digital libraries. Large digital libraries offer an unprecedented glimpse into cultural history through their collections, but the combination of collection scale and document length complicates their study, given that prior work on large corpora has dealt primarily with much shorter texts. This study presents and evaluates an approach for fast retrieval on long texts, which leverages a chunk-and-aggregate approach with document sub-units to capture nuanced similarity relationships at scales which are not otherwise tractable. This approach is evaluated on book relationships from the HathiTrust Digital Library and shows strong results for relationships beyond exact duplicates. Finally, we argue for the value of approximate nearest neighbor search for narrowing the search space for downstream classification and retrieval contexts.

1 de Dezembro de 2023, 00:00

International Journal on Digital Libraries
A discovery system for narrative query graphs: entity-interaction-aware document retrieval
24 de Abril de 2023, 00:00

A discovery system for narrative query graphs: entity-interaction-aware document retrieval

Abstract

Finding relevant publications in the scientific domain can be quite tedious: Accessing large-scale document collections often means to formulate an initial keyword-based query followed by many refinements to retrieve a sufficiently complete, yet manageable set of documents to satisfy one’s information need. Since keyword-based search limits researchers to formulating their information needs as a set of unconnected keywords, retrieval systems try to guess each user’s intent. In contrast, distilling short narratives of the searchers’ information needs into simple, yet precise entity-interaction graph patterns provides all information needed for a precise search. As an additional benefit, such graph patterns may also feature variable nodes to flexibly allow for different substitutions of entities taking a specified role. An evaluation over the PubMed document collection quantifies the gains in precision for our novel entity-interaction-aware search. Moreover, we perform expert interviews and a questionnaire to verify the usefulness of our system in practice. This paper extends our previous work by giving a comprehensive overview about the discovery system to realize narrative query graph retrieval.

24 de Abril de 2023, 00:00