Noticias em eLiteracias

✇ The Code4Lib Journal

Core Concepts and Techniques for Library Metadata Analysis

Por Stacie Traill and Martin Patrick — 22 de Setembro de 2021, 15:52
Metadata analysis is a growing need in libraries of all types and sizes, as demonstrated in many recent job postings. Data migration, transformation, enhancement, and remediation all require strong metadata analysis skills. But there is no well-defined body of knowledge or competencies list for library metadata analysis, leaving library staff with analysis-related responsibilities largely on their own to learn how to do the work effectively. In this paper, two experienced metadata analysts will share what they see as core knowledge areas and problem solving techniques for successful library metadata analysis. The paper will also discuss suggested tools, though the emphasis is intentionally not to prescribe specific tools, software, or programming languages, but rather to help readers recognize tools that will meet their analysis needs. The goal of the paper is to help library staff and their managers develop a shared understanding of the skill sets required to meet their library’s metadata analysis needs. It will also be useful to individuals interested in pursuing a career in library metadata analysis and wondering how to enhance their existing knowledge and skills for success in analysis work.
✇ The Code4Lib Journal

Leveraging a Custom Python Script to Scrape Subject Headings for Journals

Por Shelly R. McDavid, Eric McDavid, and Neil E. Das — 22 de Setembro de 2021, 15:52
In our current library fiscal climate with yearly inflationary cost increases of 2-6+% for many journals and journal package subscriptions, it is imperative that libraries strive to make our budgets go further to expand our suite of resources. As a result, most academic libraries annually undertake some form of electronic journal review, employing factors such as cost per use to inform budgetary decisions. In this paper we detail some tech savvy processes we created to leverage a Python script to automate journal subject heading generation within the OCLC’s WorldCat catalog, the MOBIUS (A Missouri Library Consortium) Catalog, and the VuFind Library Catalog, a now retired catalog for the CARLI (Consortium for Academic and Research Libraries in Illinois). We also describe the rationale for the inception of this project, the methodology we utilized, the current limitations, and details of our future work in automating our annual analysis of journal subject headings by use of an OCLC API.
✇ The Code4Lib Journal

Editorial : The Cost of Knowing Our Users

Por Mark Swenson — 22 de Setembro de 2021, 15:52
Some musings on the difficulty of wanting to know our users' secrets and simultaneously wanting to not know them.
✇ The Code4Lib Journal

Closing the Gap between FAIR Data Repositories and Hierarchical Data Formats

Por Connor B. Bailey, Fedor F. Balakirev, and Lyudmila L. Balakireva — 22 de Setembro de 2021, 15:52
Many in the scientific community, particularly in publicly funded research, are pushing to adhere to more accessible data standards to maximize the findability, accessibility, interoperability, and reusability (FAIR) of scientific data, especially with the growing prevalence of machine learning augmented research. Online FAIR data repositories, such as the Open Science Framework (OSF), help facilitate the adoption of these standards by providing frameworks for storage, access, search, APIs, and other features that create organized hubs of scientific data. However, the wider acceptance of such repositories is hindered by the lack of support of hierarchical data formats, such as Technical Data Management Streaming (TDMS) and Hierarchical Data Format 5 (HDF5), that many researchers rely on to organize their datasets. Various tools and strategies should be used to allow hierarchical data formats, FAIR data repositories, and scientific organizations to work more seamlessly together. A pilot project at Los Alamos National Laboratory (LANL) addresses the disconnect between them by integrating the OSF FAIR data repository with hierarchical data renderers, extending support for additional file types in their framework. The multifaceted interactive renderer displays a tree of metadata alongside a table and plot of the data channels in the file. This allows users to quickly and efficiently load large and complex data files directly in the OSF webapp. Users who are browsing files can quickly and intuitively see the files in the way they or their colleagues structured the hierarchical form and immediately grasp their contents. This solution helps bridge the gap between hierarchical data storage techniques and FAIR data repositories, making both of them more viable options for scientific institutions like LANL which have been put off by the lack of integration between them.
✇ The Code4Lib Journal

Using Low Code to Automate Public Service Workflows: Three Cases

Por Dianna Morganti and Jess Williams — 22 de Setembro de 2021, 15:52
Public service librarians without coding experience or technical education may not always be aware of or consider automation to be an option to streamline their regular work tasks, but the new prevalence of enterprise-level low code solutions allows novices to take advantage of technology to make their work more efficient and effective. Low code applications apply a graphic user interface on top of a coding platform to make it easy for novices to leverage automation at work. This paper presents three cases of using low code solutions for automating public service problems using the prevalent Microsoft Power Automate application, available in many library workplaces that use the Microsoft Office ecosystem. From simplifying the communication and scheduling process for instruction classes to connecting our student workers’ hourly floor counts to our administrators’ dashboard of building occupancy, we’ve leveraged simple low code automation in a scalable and replicable manner. Pseudo-code examples provided.
✇ The Code4Lib Journal

Introducing SAGE: An Open-Source Solution for Customizable Discovery Across Collections

Por David B. Lowe, James Creel, Elizabeth German, Douglas Hahn, and Jeremy Huff — 22 de Setembro de 2021, 15:52
Digital libraries at research universities make use of a wide range of unique tools to enable the sharing of eclectic sets of texts, images, audio, video, and other digital objects. Presenting these assorted local treasures to the world can be a challenge, since text is often siloed with text, images with images, and so on, such that per type, there may be separate user experiences in a variety of unique discovery interfaces. One common tool that has been developed in recent years to potentially unite them all is the Apache Solr index. Texas A&M University (TAMU) Libraries has harnessed Solr for internal indexing for repositories like DSpace, Fedora, and Avalon. Impressed by frameworks like Blacklight at peer institutions, TAMU Libraries wrote an analogous set of tools in Java, and thus was born SAGE, the Solr AGgregation Engine, with two primary functions: 1) aggregating Solr indices or “cores,” from various local sources, and 2) presenting search facility to the user in a discovery interface.
✇ The Code4Lib Journal

Building and Maintaining Metadata Aggregation Workflows Using Apache Airflow

Por Leanne Finnigan and Emily Toner — 22 de Setembro de 2021, 15:52
PA Digital is a Pennsylvania network that serves as the state’s service hub for the Digital Public Library of America (DPLA). The group developed a homegrown aggregation system in 2014, used to harvest digital collection records from contributing institutions, validate and transform their metadata, and deliver aggregated records to the DPLA. Since our initial launch, PA Digital has expanded significantly, harvesting from an increasing number of contributors with a variety of repository systems. With each new system, our highly customized aggregator software became more complex and difficult to maintain. By 2018, PA Digital staff had determined that a new solution was needed. From 2019 to 2021, a cross-functional team implemented a more flexible and scalable approach to metadata aggregation for PA Digital, using Apache Airflow for workflow management and Solr/Blacklight for internal metadata review. In this article, we will outline how we use this group of applications and the new workflows adopted, which afford our metadata specialists more autonomy to contribute directly to the ongoing development of the aggregator. We will discuss how this work fits into our broader sustainability planning as a network and how the team leveraged shared expertise to build a more stable approach to maintenance.
✇ The Code4Lib Journal

An XML-Based Migration from Digital Commons to Open Journal Systems

Por Cara M. Key — 22 de Setembro de 2021, 15:52
The Oregon Library Association has produced its peer-reviewed journal, the OLA Quarterly (OLAQ), since 1995, and OLAQ was published in Digital Commons beginning in 2014. When the host institution undertook to move away from Bepress, their new repository solution was no longer a good match for OLAQ. Oregon State University and University of Oregon agreed to move the journal into their joint instance of Open Journal Systems (OJS), and a small team from OSU Libraries carried out the migration project. The OSU project team declined to use PKP’s existing migration plugin for a number of reasons, instead pursuing a metadata-centered migration pipeline from Digital Commons to OJS. We used custom XSLT to convert tabular data exported from Bepress into PKP’s Native XML schema, which we imported using the OJS Native XML Plugin. This approach provided a high degree of control over the journal’s metadata and a robust ability to test and make adjustments along the way. The article discusses the development of the transformation stylesheet, the metadata mapping and cleanup work involved, as well as advantages and limitations of using this migration strategy.
✇ The Code4Lib Journal

Digitization Decisions: Comparing OCR Software for Librarian and Archivist Use

Por Leanne Olson and Veronica Berry — 22 de Setembro de 2021, 15:52
This paper is intended to help librarians and archivists who are involved in digitization work choose optical character recognition (OCR) software. The paper provides an introduction to OCR software for digitization projects, and shares the method we developed for easily evaluating the effectiveness of OCR software on resources we are digitizing. We tested three major OCR programs (Adobe Acrobat, ABBYY FineReader, Tesseract) for accuracy on three different digitized texts from our archives and special collections at the University of Western Ontario. Our test was divided into two parts: a word accuracy test (to determine how searchable the final documents were), and a test with a screen reader (to determine how accessible the final documents were). We share our findings from the tests and make recommendations for OCR work on digitized documents from archives and special collections.
✇ The Code4Lib Journal

On Two Proposed Metrics of Electronic Resource Use

Por William Denton — 22 de Setembro de 2021, 15:52
There are many ways to look at electronic resource use, individually or aggregated. I propose two new metrics to help give a better understanding of comparative use across an online collection. Users per mille is a relative annual measure of how many users a platform had for every thousand potential users: this tells us how many people used a given platform. Interest factor is the average number of uses of a platform by people who used it more than once: this tells us how much people used a given platform. These two metrics are enough to give us good insight into collection use. Dividing each into quartiles allows a quadrant comparison of lows and highs on each metric, giving a quick view of platforms many people use a lot (the big expensive ones), many people use very little (a curious subset), a few people use a lot (very specific to a narrow subject) and a few people use very little (deserves attention). This helps understand collection use and informs collection management.
✇ The Code4Lib Journal

Conspectus: A Syllabi Analysis Platform for Leganto Data Sources

Por David Massey, Thomas Sødring — 22 de Setembro de 2021, 15:52
In recent years, higher education institutions have implemented electronic solutions for the management of syllabi, resulting in new and exciting opportunities within the area of large-scale syllabi analysis. This article details an information pipeline that can be used to harvest, enrich and use such information.
✇ The Code4Lib Journal

Pythagoras: Discovering and Visualizing Musical Relationships Using Computer Analysis

Por Brandon Bellanti — 14 de Junho de 2021, 21:40
This paper presents an introduction to Pythagoras, an in-progress digital humanities project using Python to parse and analyze XML-encoded music scores. The goal of the project is to use recurring patterns of notes to explore existing relationships among musical works and composers. An intended outcome of this project is to give music performers, scholars, librarians, and anyone else interested in digital humanities new insights into musical relationships as well as new methods of data analysis in the arts.
✇ The Code4Lib Journal

On the Nature of Extreme Close-Range Photogrammetry: Visualization and Measurement of North African Stone Points

Por Michael J. Bennett — 14 de Junho de 2021, 21:40
Image acquisition, visualization, and measurement are examined in the context of extreme close-range photogrammetric data analysis. Manual measurements commonly used in traditional stone artifact investigation are used as a starting point to better gauge the usefulness of high-resolution 3D surrogates and the flexible digital tool sets that can work with them. The potential of various visualization techniques are also explored in the context of future teaching, learning, and research in virtual environments.
✇ The Code4Lib Journal

Choose Your Own Educational Resource: Developing an Interactive OER Using the Ink Scripting Language

Por Stewart Baker — 14 de Junho de 2021, 21:40
Learning games are games created with the purpose of educating, as well as entertaining, players. This article describes the potential of interactive fiction (IF), a type of text-based game, to serve as learning games. After summarizing the basic concepts of interactive fiction and learning games, the article describes common interactive fiction programming languages and tools, including Ink, a simple markup language that can be used to create choice based text games that play in a web browser. The final section of the article includes code putting the concepts of Ink, interactive fiction, and learning games into action using part of an interactive OER created by the author in December of 2020.
✇ The Code4Lib Journal

Institutional Data Repository Development, a Moving Target

Por Colleen Fallaw, Genevieve Schmitt, Hoa Luong, Jason Colwell, and Jason Strutz — 14 de Junho de 2021, 21:40
At the end of 2019, the Research Data Service (RDS) at the University of Illinois at Urbana-Champaign (UIUC) completed its fifth year as a campus-wide service. In order to gauge the effectiveness of the RDS in meeting the needs of Illinois researchers, RDS staff developed a five-year review consisting of a survey and a series of in-depth focus group interviews. As a result, our institutional data repository developed in-house by University Library IT staff, Illinois Data Bank, was recognized as the most useful service offering by our unit. When launched in 2016, storage resources and web servers for Illinois Data Bank and supporting systems were hosted on-premises at UIUC. As anticipated, researchers increasingly need to share large, and complex datasets. In a responsive effort to leverage the potentially more reliable, highly available, cost-effective, and scalable storage accessible to computation resources, we migrated our item bitstreams and web services to the cloud. Our efforts have met with success, but also with painful bumps along the way. This article describes how we supported data curation workflows through transitioning from on-premises to cloud resource hosting. It details our approaches to ingesting, curating, and offering access to dataset files up to 2TB in size--which may be archive type files (e.g., .zip or .tar) containing complex directory structures.
✇ The Code4Lib Journal

How We Built a Spatial Subject Classification Based on Wikidata

Por Adrian Pohl — 14 de Junho de 2021, 21:40
From the fall of 2017 to the beginning of 2020 a project had been carried out to upgrade spatial subject indexing in North Rhine-Westphalian Bibliography (NWBib) from uncontrolled strings to controlled values. For this purpose, a spatial classification with around 4,500 entries was created from Wikidata and published as SKOS (Simple Knowledge Organization System) vocabulary. The article gives an overview over the initial problem and outlines the different implementation steps.
✇ The Code4Lib Journal

Better Together: Improving the Lives of Metadata Creators with Natural Language Processing

Por Paul Kelly — 14 de Junho de 2021, 21:40
DC Public Library has long held digital copies of the full run of local alternative weekly, Washington City Paper, but had no official status as a rights grantor to enable use. That recently changed due to a full agreement being reached with the publisher. One condition of that agreement, however, was that issues become available with usable descriptive metadata and subject access in time to celebrate the upcoming 40th anniversary of the publication, which at that time was in six months. One of the most time intensive tasks our metadata specialists work on is assigning description to digital objects. This paper details how we applied Python’s Natural Language Toolkit and OpenRefine’s reconciliation functions to the collection’s OCR text to simplify subject selection for staff with no background in programming.
✇ The Code4Lib Journal

Optimizing Elasticsearch Search Experience Using a Thesaurus

Por Emmanuel Di Pretoro, Edwin De Roock, Wim Fremout, Erik Buelinckx, Stephanie Buyle, Véronique Van der Stede — 14 de Junho de 2021, 21:40
The Belgian Art Links and Tools (BALaT) ( is the continuously expanding online documentary platform of the Royal Institute for Cultural Heritage (KIK-IRPA), Brussels (Belgium). BALaT contains over 750,000 images of KIK-IRPA’s unique collection of photo negatives on the cultural heritage of Belgium, but also the library catalogue, PDFs of articles from KIK-IRPA’s Bulletin and other publications, an extensive persons and institutions authority list, and several specialized thematic websites, each of those collections being multilingual as Belgium has three official languages. All these are interlinked to give the user easy access to freely available information on the Belgian cultural heritage. During the last years, KIK-IRPA has been working on a detailed and inclusive data management plan. Through this data management plan, a new project HESCIDA (Heritage Science Data Archive) will upgrade BALaT to BALaT+, enabling access to searchable registries of KIK-IRPA datasets and data interoperability. BALaT+ will be a building block of DIGILAB, one of the future pillars of the European Research Infrastructure for Heritage Science (E-RIHS), which will provide online access to scientific data concerning tangible heritage, following the FAIR-principles (Findable-Accessible-Interoperable-Reusable). It will include and enable access to searchable registries of specialized digital resources (datasets, reference collections, thesauri, ontologies, etc.). In the context of this project, Elasticsearch has been chosen as the technology empowering the search component of BALaT+. An essential feature of this search functionality of BALaT+ is the need for linguistic equivalencies, meaning a term query in French should also return the matching results containing the equivalent term in Dutch. Another important feature is to offer a mechanism to broaden the search with elements of more precise terminology: a term like "furniture" could also match records containing chairs, tables, etc. This article will explain how a thesaurus developed in-house at KIK-IRPA was used to obtain these functionalities, from the processing of that thesaurus to the production of the configuration needed by Elasticsearch.
✇ The Code4Lib Journal

Adaptive Digital Library Services: Emergency Access Digitization at the University of Illinois at Urbana-Champaign During the COVID-19 Pandemic

Por Kyle R. Rimkus, Alex Dolski, Brynlee Emery, Rachael Johns, Patricia Lampron, William Schlaack, Angela Waarala — 14 de Junho de 2021, 21:40
This paper describes how the University of Illinois at Urbana-Champaign Library provided access to circulating library materials during the 2020 COVID-19 pandemic. Specifically, it details how the library adapted existing staff roles and digital library infrastructure to offer on-demand digitization of and limited online access to library collection items requested by patrons working in a remote teaching and learning environment. The paper also provides an overview of the technology used, details how dedicated staff with strong local control of technology were able to scale up a university-wide solution, reflects on lessons learned, and analyzes nine months of usage data to shed light on library patrons’ changing needs during the pandemic.
✇ The Code4Lib Journal

Enhancing Print Journal Analysis for Shared Print Collections

Por Dana Jemison, Lucy Liu, Anna Striker, Alison Wohlers, Jing Jiang, and Judy Dobry — 14 de Junho de 2021, 21:40
The Western Regional Storage Trust (WEST), is a distributed shared print journal repository program serving research libraries, college and university libraries, and library consortia in the Western Region of the United States. WEST solicits serial bibliographic records and related holdings biennially, which are evaluated and identified as candidates for shared print archiving using a complex collection analysis process. California Digital Library’s Discovery & Delivery WEST operations team (WEST-Ops) supports the functionality behind this collection analysis process used by WEST program staff (WEST-Staff) and members. For WEST, proposals for shared print archiving have been historically predicated on what is known as an Ulrich’s journal family, which pulls together related serial titles, for example, succeeding and preceding serial titles, their supplements, and foreign language parallel titles. Ulrich’s, while it has been invaluable, proves problematic in several ways, resulting in the approximate omission of half of the journal titles submitted for collection analysis. Part of WEST’s effectiveness in archiving hinges upon its ability to analyze local serials data across its membership as holistically as possible. The process that enables this analysis, and subsequent archiving proposals, is dependent on Ulrich’s journal family, for which ISSN has been traditionally used to match and cluster all related titles within a particular family. As such, the process is limited in that many journals have never been assigned ISSNs, especially older publications, or member bibliographic records may lack an ISSN(s), though the ISSN may exist in an OCLC primary record. Building a mechanism for matching on ISSNs that goes beyond the base set of primary, former, and succeeding titles, expands the number of eligible ISSNs that facilitate Ulrich’s journal family matching. Furthermore, when no matches in Ulrich’s can be made based on ISSN, other types of control numbers within a bibliographic record may be used to match with records that have been previously matched with an Ulrich’s journal family via ISSN, resulting in a significant increase in the number of titles eligible for collection analysis. This paper will discuss problems in Ulrich’s journal family matching, improved functional methodologies developed to address those problems, and potential strategies to improve in serial title clustering in the future.
✇ The Code4Lib Journal

Assessing High-volume Transfers from Optical Media at NYPL

Por Michelle Rothrock, Alison Rhonemus, and Nick Krabbenhoeft — 14 de Junho de 2021, 21:40
NYPL’s workflow for transferring optical media to long-term storage was met with a challenge: an acquisition of a collection containing thousands of recordable CDs and DVDs. Many programs take a disk-by-disk approach to imaging or transferring optical media, but to deal with a collection of this size, NYPL developed a workflow using a Nimbie AutoLoader and a customized version of KBNL’s open-source IROMLAB software to batch disks for transfer. This workflow prioritized quantity, but, at the outset, it was difficult to tell if every transfer was as accurate as it could be. We discuss the process of evaluating the success of the mass transfer workflow, and the improvements we made to identify and troubleshoot errors that could occur during the transfer. A background of the institution and other institutions’ approaches to similar projects is given, then an in-depth discussion of the process of gathering and analyzing data. We finish with a discussion of our takeaways from the project.
✇ The Code4Lib Journal

Editorial: Closer to 100 than to 1

Por Edward M. Corrado — 14 de Junho de 2021, 21:40
With the publication of Issue 51, the Code4Lib Journal is now closer to Issue 100 than we are to Issue 1. Also, we are developing a name change policy.
✇ The Code4Lib Journal

Machine Learning Based Chat Analysis

Por Christopher Brousseau, Justin Johnson, Curtis Thacker — 10 de Fevereiro de 2021, 17:30
The BYU library implemented a Machine Learning-based tool to perform various text analysis tasks on transcripts of chat-based interactions between patrons and librarians. These text analysis tasks included estimating patron satisfaction and classifying queries into various categories such as Research/Reference, Directional, Tech/Troubleshooting, Policy/Procedure, and others. An accuracy of 78% or better was achieved for each category. This paper details the implementation details and explores potential applications for the text analysis tool.
✇ The Code4Lib Journal

Archive This Moment D.C.: A Case Study of Participatory Collecting During COVID-19

Por Julie Burns, Laura Farley, Siobhan C. Hagan, Paul Kelly, and Lisa Warwick — 10 de Fevereiro de 2021, 17:30
When the COVID-19 pandemic brought life in Washington, D.C. to a standstill in March 2020, staff at DC Public Library began looking for ways to document how this historic event was affecting everyday life. Recognizing the value of first-person accounts for historical research, staff launched Archive This Moment D.C. to preserve the story of daily life in the District during the stay-at-home order. Materials were collected from public Instagram and Twitter posts submitted through the hashtag #archivethismomentdc. In addition to social media, creators also submitted materials using an Airtable webform set up for the project and through email. Over 2,000 digital files were collected. This article will discuss the planning, professional collaboration, promotion, selection, access, and lessons learned from the project; as well as the technical setup, collection strategies, and metadata requirements. In particular, this article will include a discussion of the evolving collection scope of the project and the need for clear ethical guidelines surrounding privacy when collecting materials in real-time.
✇ The Code4Lib Journal

Managing an institutional repository workflow with GitLab and a folder-based deposit system

Por Whitney R. Johnson-Freeman, Mark E. Phillips, and Kristy K. Phillips — 10 de Fevereiro de 2021, 17:30
Institutional Repositories (IR) exist in a variety of configurations and in various states of development across the country. Each organization with an IR has a workflow that can range from explicitly documented and codified sets of software and human workflows, to ad hoc assortments of methods for working with faculty to acquire, process and load items into a repository. The University of North Texas (UNT) Libraries has managed an IR called UNT Scholarly Works for the past decade but has until recently relied on ad hoc workflows. Over the past six months, we have worked to improve our processes in a way that is extensible and flexible while also providing a clear workflow for our staff to process submitted and harvested content. Our approach makes use of GitLab and its associated tools to track and communicate priorities for a multi-user team processing resources. We paired this Web-based management with a folder-based system for moving the deposited resources through a sequential set of processes that are necessary to describe, upload, and preserve the resource. This strategy can be used in a number of different applications and can serve as a set of building blocks that can be configured in different ways. This article will discuss which components of GitLab are used together as tools for tracking deposits from faculty as they move through different steps in the workflow. Likewise, the folder-based workflow queue will be presented and described as implemented at UNT, and examples for how we have used it in different situations will be presented.