In this short practice paper, we introduce the public version of the Qualitative Data Repository’s (QDR) Curation Handbook. The Handbook documents and structures curation practices at QDR. We describe the background and genesis of the Handbook and highlight some of its key content.
Objective: To increase data quality and ensure compliance with appropriate policies, many institutional data repositories curate data that is deposited into their systems. Here, we present our experience as an academic library implementing and managing a semi-automated, cloud-based data curation workflow for a recently launched institutional data repository. Based on our experiences we then present management observations intended for data repository managers and technical staff looking to move some or all of their curation services to the cloud.
Methods: We implemented tooling for our curation workflow in a service-oriented manner, making significant use of our data repository platform’s application programming interface (API). With an eye towards sustainability, a guiding development philosophy has been to automate processes following industry best practices while avoiding solutions with high resource needs (e.g., maintenance), and minimizing the risk of becoming locked-in to specific tooling.
Results: The initial barrier for implementing a data curation workflow in the cloud was high in comparison to on-premises curation, mainly due to the need to develop in-house cloud expertise. However, compared to the cost for on-premises servers and storage, infrastructure costs have been substantially lower. Furthermore, in our particular case, once the foundation had been established, a cloud approach resulted in increased agility allowing us to quickly automate our workflow as needed.
Conclusions: Workflow automation has put us on a path toward scaling the service and a cloud based-approach has helped with reduced initial costs. However, because cloud-based workflows and automation come with a maintenance overhead, it is important to build tooling that follows software development best practices and can be decoupled from curation workflows to avoid lock-in.
Objective: The Illinois Data Bank provides Illinois researchers with the infrastructure to publish research data publicly. During a five-year review of the Research Data Service at the University of Illinois at Urbana-Champaign, it was recognized as the most useful service offering in the unit. Internal metrics are captured and used to monitor the growth, document curation workflows, and surface technical challenges faced as we assist our researchers. Here we present examples of these curation challenges and the solutions chosen to address them.
Methods: Some Illinois Data Bank metrics are collected internally by within the system, but most of the curation metrics reported here are tracked separately in a Google spreadsheet. The curator logs required information after curation is complete for each dataset. While the data is sometimes ambiguous (e.g., depending on researcher uptake of suggested actions), our curation data provide a general understanding about our data repository and have been useful in assessing our workflows and services. These metrics also help prioritize development needs for the Illinois Data Bank.
Results and Conclusions: The curatorial services polish and improve the datasets, which contributes to the spirit of data reuse. Although we continue to see challenges in our processes, curation makes a positive impact on datasets. Continued development and adaptation of the technical infrastructure allows for an ever-better experience for the curators and users. These improvements have helped our repository more effectively support the data sharing process by successfully fostering depositor engagement with curators to improve datasets and facilitating easy transfer of very large files.
This commentary describes how context, quality, and efficiency guide data curation at the University of Michigan's Inter-university Consortium for Political and Social Research (ICPSR). These three principals manifest from necessity. A primary purpose of this work is to facilitate secondary data analysis but in order to so, the context of data must be documented. Since a mistake in this work would render any results published from the data inaccurate, quality is paramount. However, optimizing data quality can be time consuming, so automative curation practices are necessary for efficiency. The implementation of these principles (context, quality, and efficiency) is demonstrated by a recent case study with a high-profile dataset. As the nature of data work changes, these principles will continue to guide the practice of curation and establish valuable skills for future curators to cultivate.
Purpose: This paper introduces the Portage Network’s Dataverse Curation Guide and the new bilingual curation framework developed to support it.
Brief Description: Canadian academic institutions and national organizations have been building infrastructure, staffing, and programming to support research data management. Amidst this work, a notable gap emerged between requirements for data curation in general repositories like Dataverse and the requisite workflows and guidance materials needed by curators to meet them. In response, Portage, a national network of data experts, organized a working group to develop a Dataverse curation guide built upon the Data Curation Network’s CURATED workflow. To create a bilingual resource, the original CURATE(D) acronym was modified to CURATION—which has the same meaning in both French and English—and steps were augmented with Dataverse-specific guidance and mapped to three conceptualized levels of curation to assist curators in prioritizing curation actions.
Methods: An environmental scan of relevant deposit and curation guidance materials from Canadian and international institutions identified the need for a comprehensive Dataverse Curation Guide, as most existing resources were either depositor-focused or contained only partial workflows. The resulting Guide synthesized these guidance materials into the CURATION steps and mapped actions to various theoretical levels of data repository services and levels of curation.
Resources: The following documents are supplemental to the Dataverse Curation Guide: the Portage Dataverse North Metadata Best Practices Guide, the Scholars Portal Dataverse Guide, and the Data Curation Network CURATED Workflow and Data Curation Primers.
Keeping in mind the work done by data librarians is key to understanding the importance of providing open and free access to data. Standards such as persistent identifiers (PIDs) were created to provide long-lasting access to all types of digital materials and resources. Providing new ways to inform and instruct researchers and other users on the importance of making data available for sharing, reproducibility, and re-use helps in driving good and effective social policy for researchers.
Methods: Replicated methods of a prior citation study provide an updated transparent, reproducible citation analysis protocol that can be replicated with Jupyter Notebooks.
Results: This study replicated the prior citation study’s conclusions, and also adapted the author’s methods to analyze the citation practices of Earth Scientists at four institutions. We found that 80% of the citations could be accounted for by only 7.88% of journals, a key metric to help identify a core collection of titles in this discipline. We then demonstrated programmatically that 36% of these cited references were available as open access.
Conclusions: Jupyter Notebooks are a viable platform for disseminating replicable processes for citation analysis. A completely open methodology is emerging and we consider this a step forward. Adherence to the 80/20 rule aligned with institutional research output, but citation preferences are evident. Reproducible citation analysis methods may be used to analyze open access uptake, however, results are inconclusive. It is difficult to determine whether an article was open access at the time of citation, or became open access after an embargo.
Book review of: Data Feminism by Catherine D'Ignazio and Lauren F. Klein, The MIT Press (2020). Data Feminism combines intersectional feminism and critical data studies to invite the reader to consider: “How can we use data to remake the world?” As non-profit organizations with a mandate to provide equitable access to non-neutral information and services, libraries and library workers are uniquely positioned to advance the principles laid out in Data Feminism.
A digital object identifier (DOI) is an increasingly prominent persistent identifier in finding and accessing scholarly information. This paper intends to present an overview of global development and approaches in the field of DOI and DOI services with a slight geographical focus on Germany. At first, the initiation and components of the DOI system and the structure of a DOI name are explored. Next, the fundamental and specific characteristics of DOIs are described and DOIs for three (3) kinds of typical intellectual entities in the scholar communication are dealt with; then, a general DOI service pyramid is sketched with brief descriptions of functions of institutions at different levels. After that, approaches of the research data librarianship community in the field of RDM, especially DOI services, are elaborated. As examples, the DOI services provided in German research libraries as well as best practices of DOI services in a German library are introduced; and finally, the current practices and some issues dealing with DOIs are summarized. It is foreseeable that DOI, which is crucial to FAIR research data, will gain extensive recognition in the scientific world.
Objectives: Compare journal coverage of abstract and indexing tools commonly used within academic science and engineering research.
Methods: Title lists of Compendex, Inspec, Reaxys, SciFinder, and Web of Science were provided by their respective publishers. These lists were imported into Excel and the overlap of the ISSN/EISSNs and journal titles was determined using the VLOOKUP command, which determines if the value in one cell can be found in a column of other cells.
Results: There is substantial overlap between the Web of Science’s Science Citation Index Expanded and the Emerging Sources Citation Index, the largest database with 17,014 titles, and Compendex (63.6%), Inspec (71.0%), Reaxys (67.0%), and SciFinder (75.8%). SciFinder also overlaps heavily with Reaxys (75.9%). Web of Science and Compendex combined contain 77.6% of the titles within Inspec.
Conclusion: Flat or decreasing library budgets combined with increasing journal prices result in an unsustainable system that will require a calculated allocation of resources at many institutions. The overlap of commonly indexed journals among abstracting and indexing tools could serve as one way to determine how these resources should be allocated.
A range of regulatory pressures emanating from funding agencies and scholarly journals increasingly encourage researchers to engage in formal data sharing practices. As academic libraries continue to refine their role in supporting researchers in this data sharing space, one particular challenge has been finding new ways to meaningfully engage with campus researchers. Libraries help shape norms and encourage data sharing through education and training, and there has been significant growth in the services these institutions are able to provide and the ways in which library staff are able to collaborate and communicate with researchers. Evidence also suggests that within disciplines, normative pressures and expectations around professional conduct have a significant impact on data sharing behaviors (Kim and Adler 2015; Sigit Sayogo and Pardo 2013; Zenk-Moltgen et al. 2018). Duke University Libraries' Research Data Management program has recently centered part of its outreach strategy on leveraging peer networks and social modeling to encourage and normalize robust data sharing practices among campus researchers. The program has hosted two panel discussions on issues related to data management—specifically, data sharing and research reproducibility. This paper reflects on some lessons learned from these outreach efforts and outlines next steps.
Objective: Investigate how different groups of depositors vary in their use of optional data curation features that provide support for FAIR research data in the Harvard Dataverse repository.
Methods: A numerical score based upon the presence or absence of characteristics associated with the use of optional features was assigned to each of the 29,295 datasets deposited in Harvard Dataverse between 2007 and 2019. Statistical analyses were performed to investigate patterns of optional feature use amongst different groups of depositors and their relationship to other dataset characteristics.
Results: Members of groups make greater use of Harvard Dataverse's optional features than individual researchers. Datasets that undergo a data curation review before submission to Harvard Dataverse, are associated with a publication, or contain restricted files also make greater use of optional features.
Conclusions: Individual researchers might benefit from increased outreach and improved documentation about the benefits and use of optional features to improve their datasets' level of curation beyond the FAIR-informed support that the Harvard Dataverse repository provides by default. Platform designers, developers, and managers may also use the numerical scoring approach to explore how different user groups use optional application features.
Inspired by Reid Boehm’s presentation “Beyond Pronouns: Caring for Transgender Medical Research Data to Benefit All People,” at the Research Data Access and Preservation Summit (RDAP) in March 2018, four librarians from the University of Minnesota (UMN) set out to create a LibGuide to support research on transgender topics as a response to Boehm’s identification of insufficient traditional mechanisms for describing, securing, and accessing data on transgender people and topics. This commentary describes the process used to craft the LibGuide, "Library Resources for Transgender Topics," including assembling a team of interested library staff, defining the scope of the project, interacting with stakeholders and community partners, establishing a workflow, and designing an ongoing process to incorporate user feedback.
The Journal of eScience Librarianship has partnered with the Research Data Access & Preservation (RDAP) Association for a third year to publish selected conference proceedings. This issue highlights the research presented at the RDAP 2020 Summit and the community it has fostered.
Objective: Promoting discovery of research data helps archived data realize its potential to advance knowledge. Montana State University (MSU) Dataset Search aims to support discovery and reporting for research datasets created by researchers at institutions.
Methods and Results: The Dataset Search application consists of five core features: a streamlined browse and search interface, a data model based on dataset discovery, a harvesting process for finding and vetting datasets stored in external repositories, an administrative interface for managing the creation, ingest, and maintenance of dataset records, and a dataset visualization interface to demonstrate how data is produced and used by MSU researchers.
Conclusion: The Dataset Search application is designed to be easily customized and implemented by other institutions. Indexes like Dataset Search can improve search and discovery for content archived in data repositories, therefore amplifying the impact and benefits of archived data.
This commentary describes the experience of attending RDAP 2020 remotely after the author’s trip cancellation due to COVID-19 travel restrictions. The author describes the highs and lows of the remote viewing experience, and the potential future landscape of virtual conferences and remote attendance. Maintaining networking and casual conversation during a virtual conference is an area that needs improvement but has potential. Takeaways from several conference sessions, including the keynote speaker, are also included along with discussion of how the author learned valuable information or could apply the topics to her own work.
Objective: As electronic laboratory notebook (ELN) capability continues to expand, more researchers are turning to this digital format. The University of Massachusetts Medical School developed new guidelines to outline the retention and transferal of ELNs. How do other universities approach the retention and transferal of laboratory notebooks, including ELNs?
Methods: The websites of 25 universities were searched for policies or guidelines on laboratory notebook retention and transferal. A textual analysis of the policies was performed to find common themes.
Results: Information on the retention and transferal of laboratory notebooks was found in record retention and research data policies/guidelines. Out of the 25 institutional websites searched, 16 policies/guidelines on research notebook retention were found and 10 institutions had policies/guidelines on transferring research notebooks when a researcher leaves the university. Only one policy had a retention recommendation for storage location specific to electronic media, including laboratory notebooks, that did not apply to its paper counterparts, the remaining policies either explicitly include multiple forms and media or do not mention multiple formats for research records at all. The minimum number of years of retention for research notebooks ranged from immediately after report completion to 7 years after completing the research with the possibility of extension depending on a wide range of external requirements. Most research notebook transferal policies and guidelines required associated researchers and students to request permission from their principal investigator (PI) before taking a copy of the notebook. Most institutions with policies also seek to retain access to research notebooks when a PI leaves an institution to protect intellectual property and respond to any cases of scientific misconduct or conflict of interest.
Conclusions: Other universities have a range of approaches for the retention and transferal of laboratory notebooks, but most provide the same recommendations for both electronic and physical laboratory notebooks in their research data or record retention policies/guidelines.
Key themes in Dickens’ novel, transformation and resurrection, darkness and light, and social justice are firmly connected to the work being done in data. Data librarians can make a difference in times like these: resurrecting data, transforming how students, researchers, or the public think about and use data; unearthing and bringing to light historical data that will give context and meaning to an issue; and that accessible data can help address, and perhaps solve, social justice issues.
Objective: This eScience in Action article describes the collaborative development process and outputs for a qualitative data curation curriculum initiative led by a library faculty (research data specialist) at an R1 research university.
Methods: The collaborative curriculum development activities described in this article took place between 2015-2020 and included 1) a college-wide “call out” meeting with graduate methods instructors and additional one-on-one conversations, 2) a year-long training series for disciplinary faculty teaching graduate-level qualitative research methods courses, 3) guest lectures and co-curricular workshops, and 4) the development of a credit-bearing graduate-level course.
Results: This practice-based article includes a reflection on the collaborative curriculum development process and impacts, including the development of networks between the Library and qualitative researchers across campus. The article provides a proof-of-concept example for developing relevant and trustworthy library data services for humanities and qualitative social-science researchers.
Conclusions: Curriculum development activities focused predominately upon researcher-centered perspectives and identified needs. However, changes in institutional expectations for library faculty (i.e. requirement to teach credit-bearing courses) played a major role in how the curriculum was implemented, its impact and continued sustainability of outputs going forward.
Researchers are faced with unprecedented challenges due to the size and complexity of data, and libraries are stepping in to help by providing guidance on research data management primarily to graduate students and faculty. Currently, many universities are encouraging an undergraduate research experience where students engage in research projects in the classroom and in research labs, yet research data management is often not included as part of these opportunities. At UW-Madison, we piloted researchERS (Emerging Research Scholars), a program for undergraduates from all disciplines to learn data management skills. Focusing on core concepts as well as data ethics, reproducibility, and research workflows, the format of the program included seven evening workshops, two networking events, and one field trip. Each workshop invited campus and community speakers relevant to the workshop’s theme as a way to introduce the students to the network of available resources and data expertise and provided food for attendees. The workshops also built in customized activities to show students how to incorporate best practices into their work. Local businesses provided a tour of their facilities as well as a talk on how they leverage data. This paper will describe this program as well as the benefits and drawbacks of tailoring a research data management program toward undergraduates.
Objective: Data curation is becoming widely accepted as a necessary component of data sharing. Yet, as there are so many different types of data with various curation needs, the Data Curation Network (DCN) project anticipated that a collaborative approach to data curation across a network of repositories would expand what any single institution might offer alone. Now, halfway through a three-year implementation phase, we’re testing our assumptions using one year of data from the DCN.
Methods: Ten institutions participated in the implementation phase of a shared staffing model for curating research data. Starting on January 1, 2019, for 12 months we tracked the number, file types, and disciplines represented in data sets submitted to the DCN. Participating curators were matched to data sets based on their self-reported curation expertise. Aspects such as curation time, level of satisfaction with the assignment, and lack of appropriate expertise in the network were tracked and analyzed.
Results: Seventy-four data sets were submitted to the DCN in year one. Seventy-one of them were successfully curated by DCN curators. Each curation assignment takes 2.4 hours on average, and data sets take a median of three days to pass through the network. By analyzing the domain and file types of first- year submissions, we find that our coverage is well represented across domains and that our capacity is higher than the demand, but we also observed that the higher volume of data containing software code relied on certain curator expertise more often than others, creating potential unbalance.
Conclusions: The data from year one of the DCN pilot have verified key assumptions about our collaborative approach to data curation, and these results have raised additional questions about capacity, equitable use of network resources, and sustained growth that we hope to answer by the end of this implementation phase.
Objectives: This small-scale study explores the current state of connections between open data and open access (OA) articles in the life sciences.
Methods: This study involved 44 openly available life sciences datasets from the Illinois Data Bank that had 45 related research articles. For each article, I gathered the OA status of the journal and the article on the publisher website and checked whether the article was openly available via Unpaywall and Research Gate. I also examined how and where the open data was included in the HTML and PDF versions of the related articles.
Results: Of the 45 articles studied, less than half were published in Gold/Full OA journals, and while the remaining articles were published in Gold/Hybrid journals, none of them were OA. This study found that OA articles pointed to the Illinois Data Bank datasets similarly to all of the related articles, most commonly with a data availability statement containing a DOI.
Conclusions: The findings indicate that Gold OA in hybrid journals does not appear to be a popular option, even for articles connected to open data, and this study emphasizes the importance of data repositories providing DOIs, since the related articles frequently used DOIs to point to the Illinois Data Bank datasets. This study also revealed concerns about free (not licensed OA) access to articles on publisher websites, which will be a significant topic for future research.
There are many courses available to teach research data management to librarians and researchers. While these courses can help with technical skills, like programming or statistics, and practical knowledge of data life cycles or data sharing policies, there are “soft skills” and non-technical skills that are needed to successfully start and run data services. While there are many important characteristics of a good data librarian, reference skills, relationship building, collaboration, listening, and facilitation are some of the most important. Giving consideration to these skills will help any data librarian with their multifaceted job.
Objective: Evaluate and examine Data Literacy (DL) in the supported disciplines of four liaison librarians at a large research university.
Methods: Using a framework developed by Prado and Marzal (2013), the study analyzed 378 syllabi from a two-year period across six departments—Criminal Justice, Geography, Geology, Journalism, Political Science, and Sociology—to see which classes included DLs.
Results: The study was able to determine which classes hit on specific DLs and where those classes might need more support in other DLs. The most common DLs being taught in courses are Reading, Interpreting, and Evaluating Data, and Using Data. The least commonly taught are Understanding Data and Managing Data skills.
Conclusions: While all disciplines touched on data in some way, there is clear room for librarians to support DLs in the areas of Understanding Data and Managing Data.
The Journal of eScience Librarianship has partnered with the Research Data Access & Preservation (RDAP) Association for a second year to publish selected conference proceedings. This issue highlights the research presented at the RDAP 2019 Summit and the community it has fostered.