Noticias em eLiteracias

🔒
❌ Sobre o FreshRSS
Há novos artigos disponíveis, clique para atualizar a página.
Antes de ontemThe Code4Lib Journal

Pythagoras: Discovering and Visualizing Musical Relationships Using Computer Analysis

Por Brandon Bellanti
This paper presents an introduction to Pythagoras, an in-progress digital humanities project using Python to parse and analyze XML-encoded music scores. The goal of the project is to use recurring patterns of notes to explore existing relationships among musical works and composers. An intended outcome of this project is to give music performers, scholars, librarians, and anyone else interested in digital humanities new insights into musical relationships as well as new methods of data analysis in the arts.

Conspectus: A Syllabi Analysis Platform for Leganto Data Sources

Por David Massey, Thomas Sødring
In recent years, higher education institutions have implemented electronic solutions for the management of syllabi, resulting in new and exciting opportunities within the area of large-scale syllabi analysis. This article details an information pipeline that can be used to harvest, enrich and use such information.

Digitization Decisions: Comparing OCR Software for Librarian and Archivist Use

Por Leanne Olson and Veronica Berry
This paper is intended to help librarians and archivists who are involved in digitization work choose optical character recognition (OCR) software. The paper provides an introduction to OCR software for digitization projects, and shares the method we developed for easily evaluating the effectiveness of OCR software on resources we are digitizing. We tested three major OCR programs (Adobe Acrobat, ABBYY FineReader, Tesseract) for accuracy on three different digitized texts from our archives and special collections at the University of Western Ontario. Our test was divided into two parts: a word accuracy test (to determine how searchable the final documents were), and a test with a screen reader (to determine how accessible the final documents were). We share our findings from the tests and make recommendations for OCR work on digitized documents from archives and special collections.

An XML-Based Migration from Digital Commons to Open Journal Systems

Por Cara M. Key
The Oregon Library Association has produced its peer-reviewed journal, the OLA Quarterly (OLAQ), since 1995, and OLAQ was published in Digital Commons beginning in 2014. When the host institution undertook to move away from Bepress, their new repository solution was no longer a good match for OLAQ. Oregon State University and University of Oregon agreed to move the journal into their joint instance of Open Journal Systems (OJS), and a small team from OSU Libraries carried out the migration project. The OSU project team declined to use PKP’s existing migration plugin for a number of reasons, instead pursuing a metadata-centered migration pipeline from Digital Commons to OJS. We used custom XSLT to convert tabular data exported from Bepress into PKP’s Native XML schema, which we imported using the OJS Native XML Plugin. This approach provided a high degree of control over the journal’s metadata and a robust ability to test and make adjustments along the way. The article discusses the development of the transformation stylesheet, the metadata mapping and cleanup work involved, as well as advantages and limitations of using this migration strategy.

Building and Maintaining Metadata Aggregation Workflows Using Apache Airflow

Por Leanne Finnigan and Emily Toner
PA Digital is a Pennsylvania network that serves as the state’s service hub for the Digital Public Library of America (DPLA). The group developed a homegrown aggregation system in 2014, used to harvest digital collection records from contributing institutions, validate and transform their metadata, and deliver aggregated records to the DPLA. Since our initial launch, PA Digital has expanded significantly, harvesting from an increasing number of contributors with a variety of repository systems. With each new system, our highly customized aggregator software became more complex and difficult to maintain. By 2018, PA Digital staff had determined that a new solution was needed. From 2019 to 2021, a cross-functional team implemented a more flexible and scalable approach to metadata aggregation for PA Digital, using Apache Airflow for workflow management and Solr/Blacklight for internal metadata review. In this article, we will outline how we use this group of applications and the new workflows adopted, which afford our metadata specialists more autonomy to contribute directly to the ongoing development of the aggregator. We will discuss how this work fits into our broader sustainability planning as a network and how the team leveraged shared expertise to build a more stable approach to maintenance.

Introducing SAGE: An Open-Source Solution for Customizable Discovery Across Collections

Por David B. Lowe, James Creel, Elizabeth German, Douglas Hahn, and Jeremy Huff
Digital libraries at research universities make use of a wide range of unique tools to enable the sharing of eclectic sets of texts, images, audio, video, and other digital objects. Presenting these assorted local treasures to the world can be a challenge, since text is often siloed with text, images with images, and so on, such that per type, there may be separate user experiences in a variety of unique discovery interfaces. One common tool that has been developed in recent years to potentially unite them all is the Apache Solr index. Texas A&M University (TAMU) Libraries has harnessed Solr for internal indexing for repositories like DSpace, Fedora, and Avalon. Impressed by frameworks like Blacklight at peer institutions, TAMU Libraries wrote an analogous set of tools in Java, and thus was born SAGE, the Solr AGgregation Engine, with two primary functions: 1) aggregating Solr indices or “cores,” from various local sources, and 2) presenting search facility to the user in a discovery interface.

Using Low Code to Automate Public Service Workflows: Three Cases

Por Dianna Morganti and Jess Williams
Public service librarians without coding experience or technical education may not always be aware of or consider automation to be an option to streamline their regular work tasks, but the new prevalence of enterprise-level low code solutions allows novices to take advantage of technology to make their work more efficient and effective. Low code applications apply a graphic user interface on top of a coding platform to make it easy for novices to leverage automation at work. This paper presents three cases of using low code solutions for automating public service problems using the prevalent Microsoft Power Automate application, available in many library workplaces that use the Microsoft Office ecosystem. From simplifying the communication and scheduling process for instruction classes to connecting our student workers’ hourly floor counts to our administrators’ dashboard of building occupancy, we’ve leveraged simple low code automation in a scalable and replicable manner. Pseudo-code examples provided.

Closing the Gap between FAIR Data Repositories and Hierarchical Data Formats

Por Connor B. Bailey, Fedor F. Balakirev, and Lyudmila L. Balakireva
Many in the scientific community, particularly in publicly funded research, are pushing to adhere to more accessible data standards to maximize the findability, accessibility, interoperability, and reusability (FAIR) of scientific data, especially with the growing prevalence of machine learning augmented research. Online FAIR data repositories, such as the Open Science Framework (OSF), help facilitate the adoption of these standards by providing frameworks for storage, access, search, APIs, and other features that create organized hubs of scientific data. However, the wider acceptance of such repositories is hindered by the lack of support of hierarchical data formats, such as Technical Data Management Streaming (TDMS) and Hierarchical Data Format 5 (HDF5), that many researchers rely on to organize their datasets. Various tools and strategies should be used to allow hierarchical data formats, FAIR data repositories, and scientific organizations to work more seamlessly together. A pilot project at Los Alamos National Laboratory (LANL) addresses the disconnect between them by integrating the OSF FAIR data repository with hierarchical data renderers, extending support for additional file types in their framework. The multifaceted interactive renderer displays a tree of metadata alongside a table and plot of the data channels in the file. This allows users to quickly and efficiently load large and complex data files directly in the OSF webapp. Users who are browsing files can quickly and intuitively see the files in the way they or their colleagues structured the hierarchical form and immediately grasp their contents. This solution helps bridge the gap between hierarchical data storage techniques and FAIR data repositories, making both of them more viable options for scientific institutions like LANL which have been put off by the lack of integration between them.

Editorial : The Cost of Knowing Our Users

Por Mark Swenson
Some musings on the difficulty of wanting to know our users' secrets and simultaneously wanting to not know them.

Leveraging a Custom Python Script to Scrape Subject Headings for Journals

Por Shelly R. McDavid, Eric McDavid, and Neil E. Das
In our current library fiscal climate with yearly inflationary cost increases of 2-6+% for many journals and journal package subscriptions, it is imperative that libraries strive to make our budgets go further to expand our suite of resources. As a result, most academic libraries annually undertake some form of electronic journal review, employing factors such as cost per use to inform budgetary decisions. In this paper we detail some tech savvy processes we created to leverage a Python script to automate journal subject heading generation within the OCLC’s WorldCat catalog, the MOBIUS (A Missouri Library Consortium) Catalog, and the VuFind Library Catalog, a now retired catalog for the CARLI (Consortium for Academic and Research Libraries in Illinois). We also describe the rationale for the inception of this project, the methodology we utilized, the current limitations, and details of our future work in automating our annual analysis of journal subject headings by use of an OCLC API.

Core Concepts and Techniques for Library Metadata Analysis

Por Stacie Traill and Martin Patrick
Metadata analysis is a growing need in libraries of all types and sizes, as demonstrated in many recent job postings. Data migration, transformation, enhancement, and remediation all require strong metadata analysis skills. But there is no well-defined body of knowledge or competencies list for library metadata analysis, leaving library staff with analysis-related responsibilities largely on their own to learn how to do the work effectively. In this paper, two experienced metadata analysts will share what they see as core knowledge areas and problem solving techniques for successful library metadata analysis. The paper will also discuss suggested tools, though the emphasis is intentionally not to prescribe specific tools, software, or programming languages, but rather to help readers recognize tools that will meet their analysis needs. The goal of the paper is to help library staff and their managers develop a shared understanding of the skill sets required to meet their library’s metadata analysis needs. It will also be useful to individuals interested in pursuing a career in library metadata analysis and wondering how to enhance their existing knowledge and skills for success in analysis work.

Automated 3D Printing in Libraries

Por Brandon Patterson, Ben Engel, and Willis Holle
This article highlights the creation of an automated 3D printed system created at a health sciences library at a large research university. As COVID-19 limited in-person interaction with 3D printers, a group of library staff came together to code a form that took users’ 3D printed files and connected them to machines automatically. A ticketing system and payment form was also automated via this system. The only in-person interactions are dedicated staff members that unload the prints. This article will describe the journey in getting to an automated system and share code and strategies so others can try it for themselves.

Strategies for Preserving Digital Scholarship / Humanities Projects

Por Kirsta Stapelfeldt, Sukhvir Khera, Natkeeran Ledchumykanthan, Lara Gomez, Erin Liu, and Sonia Dhaliwal
The Digital Scholarship Unit (DSU) at the University of Toronto Scarborough library frequently partners with faculty for the creation of digital scholarship (DS) projects. However, managing completed projects can be challenging when it is no longer under active development by the original project team, and resources allocated to its ongoing maintenance are scarce. Maintaining inactive projects on the live web bloats staff workloads or is not possible due to limited staff capacity. As technical obsolescence meets a lack of staff capacity, the gradual disappearance of digital scholarship projects forms a gap in the scholarly record. This article discusses the Library DSU’s experimentations with using web archiving technologies to capture and describe digital scholarship projects, with the goal of accessioning the resulting web archives into the Library’s digital collections. In addition to comparing some common technologies used for crawling and replay of archives, this article describes aspects of the technical infrastructure the DSU is building with the goal of making web archives discoverable and playable through the library’s digital collections interface.

Supporting open access, integrating distributed research platforms, and building a research information management platform

Por Daniel M. Coughlin, Cynthia Hudson Vitale

Academic libraries are often called upon by their university communities to collect, manage, and curate information about the research activity produced at their campuses. Proper research information management (RIM) can be leveraged for multiple institutional contexts, including networking, reporting activities, building faculty profiles, and supporting the reputation management of the institution.

In the last ten to fifteen years the adoption and implementation of RIM infrastructure has become widespread throughout the academic world. Approaches to developing and implementing this infrastructure have varied, from commercial and open-source options to locally developed instances. Each piece of infrastructure has its own functionality, features, and metadata sources. There is no single application or data source to meet all the needs of these varying pieces of research information, many of these systems together create an ecosystem to provide for the diverse set of needs and contexts.

This paper examines the systems at Pennsylvania State University that contribute to our RIM ecosystem; how and why we developed another piece of supporting infrastructure for our Open Access policy and the successes and challenges of this work.

Citation Needed: Adding Citations to CONTENTdm Records

Por Jenn Randles & Andrew Bullen
The Tennessee State Library and Archives and the Illinois State Library identified a need to add citation information to individual image records in OCLC’s CONTENTdm (https://www.oclc.org/en/contentdm.html). Experience with digital archives at both institutions showed that citation information was one of the most requested features. Unfortunately, CONTENTdm does not natively display citation information about image records; to add this functionality, custom JavaScript had to be written that would interact with the underlying React environment and parse out or retrieve the appropriate metadata to dynamically build record citations. Detailed code and a description of methods for building two different models of citation generators are presented.

The DSA Toolkit Shines Light Into Dark and Stormy Archives

Por Shawn M. Jones, Himarsha R. Jayanetti, Alex Osborne, Paul Koerbin, Martin Klein, Michele C. Weigle, Michael L. Nelson
Themed web archive collections exist to make sense of archived web pages (mementos). Some collections contain hundreds of thousands of mementos. There are many collections about the same topic. Few collections on platforms like Archive-It include standardized metadata. Reviewing the documents in a single collection thus becomes an expensive proposition. Search engines help find individual documents but do not provide an overall understanding of each collection as a whole. Visitors need to be able to understand what individual collections contain so they can make decisions about individual collections and compare them to each other. The Dark and Stormy Archives (DSA) Project applies social media storytelling to a subset of a collection to facilitate collection understanding at a glance. As part of this work, we developed the DSA Toolkit, which helps archivists and visitors leverage this capability. As part of our recent International Internet Preservation Consortium (IIPC) grant, Los Alamos National Laboratory (LANL) and Old Dominion University (ODU) piloted the DSA toolkit with the National Library of Australia (NLA). Collectively we have made numerous improvements, from better handling of NLA mementos to native Linux installers to more approachable Web User Interfaces. Our goal is to make the DSA approachable for everyone so that end-users and archivists alike can apply social media storytelling to web archives.

Automating reference consultation requests with JavaScript and a Google Form

Por Stephen Zweibel
At the CUNY Graduate Center Library, reference consultation requests were previously sent to a central email address, then manually directed by our head of reference to the appropriate subject expert. This process was cumbersome and because the inbox was not checked every day, responses were delayed and messages were occasionally missed. In order to streamline this process, I created a form and wrote a script that uses the answers in the form to automatically forward any consultation requests to the correct subject specialist. This was done using JavaScript, Google Sheets, and the Google Apps Script backend. When a patron requesting a consultation fills out the form, they include their field of research. This field is associated in my script with a particular subject specialist librarian, who then receives an email with the pertinent information. Rather than requiring either that patrons themselves search for the right subject specialist, or that library faculty spend time distributing messages to the right liaison, this enables a smoother, more direct interaction. In this article, I will describe the steps I took to write this script, using only freely available online software.

Editorial — New name change policy

Por Ron Peterson
The Code4Lib Journal Editorial Committee is implementing a new name change policy aimed to facilitate the process and ensure timely and comprehensive name changes for anyone who needs to change their name within the Journal.

Fractal in detail: What information is in a file format identification report?

Por Ross Spencer
A file format identification report, such as those generated by digital preservation tools, DROID, Siegfried, or FIDO, contain an incredible wealth of information. Used to scan discrete sets of files comprising a part of, or the entirety of a digital collection, these datasets can serve as entry points for further activities including appraisal, identification of future work efforts, and the facilitation of transfer of digital objects into preservation storage. The information contained in them is fractal in detail and there are numerous outputs that can be generated from that detail. This paper describes the purpose of a file format identification report and the extensive information that can be extracted from one. It summarizes a number of ways of transforming them into the inputs for other systems and describes a handful of the tools already doing so. The paper concludes that describing a format identification report is a pivotal artefact in the digital transfer process, and asks the reader to consider how they might leverage them and the benefits doing so might provide.

Lantern: A Pandoc Template for OER Publishing

Por Chris Diaz
Lantern is a template and workflow for using Pandoc and GitHub to create and host multi-format open educational resources (OER) online. It applies minimal computing methods to OER publishing practices. The purpose is to minimize the technical footprint for digital publishing while maximizing control over the form, content, and distribution of OER texts. Lantern uses Markdown and YAML to capture an OER’s source content and metadata and Pandoc to transform it into HTML, PDF, EPUB, and DOCX formats. Pandoc’s options and arguments are pre-configured in a Bash script to simplify the process for users. Lantern is available as a template repository on GitHub. The template repository is set up to run Pandoc with GitHub Actions and serve output files on GitHub Pages for convenience; however, GitHub is not a required dependency. Lantern can be used on any modern computer to produce OER files that can be uploaded to any modern web server.

Works, Expressions, Manifestations, Items: An Ontology

Por Karen Coyle
The concepts first introduced in the FRBR document and known as "WEMI" have been employed in situations quite different from the library bibliographic catalog. This is evidence that a definition of similar classes that are more general than those developed for library usage would benefit metadata developers broadly. This article proposes a minimally constrained set of classes and relationships that could form the basis for a useful model of created works.

Editorial: On FOSS in Libraries

Por Andrew Darby
Some thoughts on the state of free and open source software in libraries.

Annif Analyzer Shootout: Comparing text lemmatization methods for automated subject indexing

Por Osma Suominen, Ilkka Koskenniemi
Automated text classification is an important function for many AI systems relevant to libraries, including automated subject indexing and classification. When implemented using the traditional natural language processing (NLP) paradigm, one key part of the process is the normalization of words using stemming or lemmatization, which reduces the amount of linguistic variation and often improves the quality of classification. In this paper, we compare the output of seven different text lemmatization algorithms as well as two baseline methods. We measure how the choice of method affects the quality of text classification using example corpora in three languages. The experiments have been performed using the open source Annif toolkit for automated subject indexing and classification, but should generalize also to other NLP toolkits and similar text classification tasks. The results show that lemmatization methods in most cases outperform baseline methods in text classification particularly for Finnish and Swedish text, but not English, where baseline methods are most effective. The differences between lemmatization methods are quite small. The systematic comparison will help optimize text classification pipelines and inform the further development of the Annif toolkit to incorporate a wider choice of normalization methods.

Teaching AI when to care about gender

Por James Powell, Kari Sentz, Elizabeth Moyer, Martin Klein
Natural Language Processing (NLP) is a branch of Artificial Intelligence (AI) concerned with solving language tasks by modeling large amounts of textual data. Some NLP techniques use word embeddings which are semantic models where machine learning (ML) is used to learn to cluster semantically related words by learning about word co-occurrences in the original training text. Unfortunately, these models tend to reflect or even exaggerate biases that are present in the training corpus. Here we describe the Word Embedding Navigator (WEN), which is a tool for exploring word embedding models. We examine a specific potential use case for this tool: interactive discovery and neutralization of gender bias in word embedding models, and compare this human-in-the-loop approach to reducing bias in word embeddings with a debiasing post-processing technique.
❌