Novel ideas often experience resistance from incumbent forces. While evidence of the bias against novelty has been widely identified in science, there is still a lack of large-scale quantitative work to study this problem occurring in the prepublication process of manuscripts. This paper examines the association between manuscript novelty and handling time of publication based on 778,345 articles in 1,159 journals indexed by PubMed. Measuring the novelty as the extent to which manuscripts disrupt existing knowledge, we found systematic evidence that higher novelty is associated with longer handling time. Matching and fixed-effect models were adopted to confirm the statistical significance of this pattern. Moreover, submissions from prestigious authors and institutions have the advantage of shorter handling time, but this advantage is diminishing as manuscript novelty increases. In addition, we found longer handling time is negatively related to the impact of manuscripts, while the relationships between novelty and 3- and 5-year citations are U-shape. This study expands the existing knowledge of the novelty bias by examining its existence in the prepublication process of manuscripts.
Academic research often draws on multiple funding sources. This paper investigates whether complementarity or substitutability emerges when different types of funding are used. Scholars have examined this phenomenon at the university and scientist levels, but not at the publication level. This gap is significant since acknowledgement sections in scientific papers indicate publications are often supported by multiple funding sources. To address this gap, we examine the extent to which different funding types are jointly used in publications, and to what extent certain combinations of funding are associated with higher academic impact (citation count). We focus on three types of funding accessed by UK-based researchers: national, international, and industry. The analysis builds on data extracted from all UK cancer-related publications in 2011, thus providing a 10-year citation window. Findings indicate that, although there is complementarity between national and international funding in terms of their co-occurrence (where these are acknowledged in the same publication), when we evaluate funding complementarity in relation to academic impact (we employ the supermodularity framework), we found no evidence of such a relationship. Rather, our results suggest substitutability between national and international funding. We also observe substitutability between international and industry funding.
The automatic summarization of scientific articles differs from other text genres because of the structured format and longer text length. Previous approaches have focused on tackling the lengthy nature of scientific articles, aiming to improve the computational efficiency of summarizing long text using a flat, unstructured abstract. However, the structured format of scientific articles and characteristics of each section have not been fully explored, despite their importance. The lack of a sufficient investigation and discussion of various characteristics for each section and their influence on summarization results has hindered the practical use of automatic summarization for scientific articles. To provide a balanced abstract proportionally emphasizing each section of a scientific article, the community introduced the structured abstract, an abstract with distinct, labeled sections. Using this information, in this study, we aim to understand tasks ranging from data preparation to model evaluation from diverse viewpoints. Specifically, we provide a preprocessed large-scale dataset and propose a summarization method applying the introduction, methods, results, and discussion (IMRaD) format reflecting the characteristics of each section. We also discuss the objective benchmarks and perspectives of state-of-the-art algorithms and present the challenges and research directions in this area.
How scholars and IRBs perceive and apply the Belmont principles in crowd work-based research was an open and largely neglected question. As crowd work becomes increasingly popular for scholars to implement research and collect data, such negligence, signaling a lack of attention to the ethical issues in crowd work-based research more broadly, seemed alarming. To fill this gap, we conducted a qualitative study with 32 scholars and IRB directors/analysts in the United States to inquire into their perceptions and applications of the Belmont principles in crowd work-based research. We found two dilemmas in applying the Belmont principles in crowd work-based research, namely the dilemma between the dehumanization and expected autonomy of crowd workers, and the dilemma between the monetary incentive/reputationall risks and the conventional notion of research benefits/risks. We also compared the scholars' and IRBs' ethical perspectives and proposed our research implications for future work.
MEDLINE is the National Library of Medicine's (NLM) journal citation database. It contains over 28 million references to biomedical and life science journal articles, and a key feature of the database is that all articles are indexed with NLM Medical Subject Headings (MeSH). The library employs a team of MeSH indexers, and in recent years they have been asked to index close to 1 million articles per year in order to keep MEDLINE up to date. An important part of the MEDLINE indexing process is the assignment of articles to indexers. High quality and timely indexing is only possible when articles are assigned to indexers with suitable expertise. This article introduces the NLM indexer assignment dataset: a large dataset of 4.2 million indexer article assignments for articles indexed between 2011 and 2019. The dataset is shown to be a valuable testbed for expert matching and assignment algorithms, and indexer article assignment is also found to be useful domain-adaptive pre-training for the closely related task of reviewer assignment.
Research information management systems (RIMS) have become critical components of information technology infrastructure on university campuses. They are used not just for sharing and promoting faculty research, but also for conducting faculty evaluation and development, facilitating research collaborations, identifying mentors for student projects, and expert consultants for local businesses. This study is one of the first empirical investigations of the structure of researchers' scholarly profile maintenance activities in a nonmandatory institutional RIMS. By analyzing the RIMS's log data, we identified 11 tasks researchers performed when updating their profiles. These tasks were further grouped into three activities: (a) adding publication, (b) enhancing researcher identity, and (c) improving research discoverability. In addition, we found that junior researchers and female researchers were more engaged in maintaining their RIMS profiles than senior researchers and male researchers. The results provide insights for designing profile maintenance action templates for institutional RIMS that are tailored to researchers' characteristics and help enhance researchers' engagement in the curation of their research information. This also suggests that female and junior researchers can serve as early adopters of institutional RIMS.
Ensuring Wikipedia cites scholarly publications based on quality and relevancy without biases is critical to credible and fair knowledge dissemination. We investigate gender- and country-based biases in Wikipedia citation practices using linked data from the Web of Science and a Wikipedia citation dataset. Using coarsened exact matching, we show that publications by women are cited less by Wikipedia than expected, and publications by women are less likely to be cited than those by men. Scholarly publications by authors affiliated with non-Anglosphere countries are also disadvantaged in getting cited by Wikipedia, compared with those by authors affiliated with Anglosphere countries. The level of gender- or country-based inequalities varies by research field, and the gender-country intersectional bias is prominent in math-intensive STEM fields. To ensure the credibility and equality of knowledge presentation, Wikipedia should consider strategies and guidelines to cite scholarly publications independent of the gender and country of authors.
This study examines trends in open access article processing charges (APCs) from 2011 to 2021, building on a 2011 study by Solomon and Björk. Two methods are employed, a modified replica and a status update of the 2011 journals. Data are drawn from multiple sources and datasets are available as open data. Most journals do not charge APCs; this has not changed. The global average per-journal APC increased slightly, from 906 to 958 USD, while the per-article average increased from 904 to 1,626 USD, indicating that authors choose to publish in more expensive journals. Publisher size, type, impact metrics and subject affect charging tendencies, average APC, and pricing trends. Half the journals from the 2011 sample are no longer listed in DOAJ in 2021, due to ceased publication or publisher de-listing. Conclusions include a caution about the potential of the APC model to increase costs beyond inflation. The university sector may be the most promising approach to economically sustainable no-fee OA journals. Universities publish many OA journals, nearly half of OA articles, tend not to charge APCs and when APCs are charged, the prices are very low on average.
During the coronavirus pandemic, changes in the way science is done and shared occurred, which motivates meta-research to help understand science communication in crises and improve its effectiveness. The objective is to study how many Spanish scientific papers on COVID-19 published during 2020 share their research data. Qualitative and descriptive study applying nine attributes: (a) availability, (b) accessibility, (c) format, (d) licensing, (e) linkage, (f) funding, (g) editorial policy, (h) content, and (i) statistics. We analyzed 1,340 papers, 1,173 (87.5%) did not have research data. A total of 12.5% share their research data of which 2.1% share their data in repositories, 5% share their data through a simple request, 0.2% do not have permission to share their data, and 5.2% share their data as supplementary material. There is a small percentage that shares their research data; however, it demonstrates the researchers' poor knowledge on how to properly share their research data and their lack of knowledge on what is research data.
Automated text categorization methods are of broad relevance for domain experts since they free researchers and practitioners from manual labeling, save their resources (e.g., time, labor), and enrich the data with information helpful to study substantive questions. Despite a variety of newly developed categorization methods that require substantial amounts of annotated data, little is known about how to build models when (a) labeling texts with categories requires substantial domain expertise and/or in-depth reading, (b) only a few annotated documents are available for model training, and (c) no relevant computational resources, such as pretrained models, are available. In a collaboration with environmental scientists who study the socio-ecological impact of funded biodiversity conservation projects, we develop a method that integrates deep domain expertise with computational models to automatically categorize project reports based on a small sample of 93 annotated documents. Our results suggest that domain expertise can improve automated categorization and that the magnitude of these improvements is influenced by the experts' understanding of categories and their confidence in their annotation, as well as data sparsity and additional category characteristics such as the portion of exclusive keywords that can identify a category.
Compared to previous studies that generally detect scientific breakthroughs based on citation patterns, this article proposes a knowledge entity-based disruption indicator by quantifying the change of knowledge directly created and inspired by scientific breakthroughs to their evolutionary trajectories. Two groups of analytic units, including MeSH terms and their co-occurrences, are employed independently by the indicator to measure the change of knowledge. The effectiveness of the proposed indicators was evaluated against the four datasets of scientific breakthroughs derived from four recognition trials. In terms of identifying scientific breakthroughs, the proposed disruption indicator based on MeSH co-occurrences outperforms that based on MeSH terms and three earlier citation-based disruption indicators. It is also shown that in our indicator, measuring the change of knowledge inspired by the focal paper in its evolutionary trajectory is a larger contributor than measuring the change created by the focal paper. Our study not only offers empirical insights into conceptual understanding of scientific breakthroughs but also provides practical disruption indicator for scientists and science management agencies searching for valuable research.
Governmental and organizational policy increasingly claims to be data-driven, data-informed, or knowledge-driven. We explore the data practices of local governments and nonprofits a seeking to end homelessness in the City of Austin. Drawing on 31 interviews with stakeholders, alongside the reflections and experiences of our interdisciplinary, cross-sector collaborative team, we consider the role of data in guiding and informing interventions and policy regarding homelessness. Ending homelessness is a particularly challenging scenario for intervention, with increasing politicization, changing circumstances, and needing rapid intervention to reduce harm. In exploring some implications of data science “in the wild” as it is deployed, understood, and supported within the Travis County Continuum of Care (CoC), we analyze how data-intensive work connects and engages across disciplinary boundaries. Furthermore, we consider how data science and the iField can collaborate in addressing complex, social problems as advisors and partners with invested organizations.
The transition from secondary education to higher education could be challenging for most freshmen. For students who fail to adjust to university life smoothly, their status may worsen if the university cannot offer timely and proper guidance. Helping students adapt to university life is a long-term goal for any academic institution. Therefore, understanding the nature of the maladaptation phenomenon and the early prediction of “at-risk” students are crucial tasks that urgently need to be tackled effectively. This article aims to analyze the relevant factors that affect the maladaptation phenomenon and predict this phenomenon in advance. We develop a prediction framework (MAladaptive STudEnt pRediction, MASTER) for the early prediction of students with maladaptation. First, our framework uses the SMOTE (Synthetic Minority Oversampling Technique) algorithm to solve the data label imbalance issue. Moreover, a novel ensemble algorithm, priority forest, is proposed for outputting ranks instead of binary results, which enables us to perform proactive interventions in a prioritized manner where limited education resources are available. Experimental results on real-world education datasets demonstrate that the MASTER framework outperforms other state-of-art methods.
Understanding the factors that influence trust in public health information is critical for designing successful public health campaigns during pandemics such as COVID-19. We present findings from a cross-sectional survey of 454 US adults—243 older (65+) and 211 younger (18–64) adults—who responded to questionnaires on human values, trust in COVID-19 information sources, attention to information quality, self-efficacy, and factual knowledge about COVID-19. Path analysis showed that trust in direct personal contacts (B = 0.071, p = .04) and attention to information quality (B = 0.251, p < .001) were positively related to self-efficacy for coping with COVID-19. The human value of self-transcendence, which emphasizes valuing others as equals and being concerned with their welfare, had significant positive indirect effects on self-efficacy in coping with COVID-19 (mediated by attention to information quality; effect = 0.049, 95% CI 0.001–0.104) and factual knowledge about COVID-19 (also mediated by attention to information quality; effect = 0.037, 95% CI 0.003–0.089). Our path model offers guidance for fine-tuning strategies for effective public health messaging and serves as a basis for further research to better understand the societal impact of COVID-19 and other public health crises.
Coauthorship prediction applies predictive analytics to bibliographic data to predict authors who are highly likely to be coauthors. In this study, we propose an approach for coauthorship prediction based on bibliographic network embedding through a graph-based bibliographic data model that can be used to model common bibliographic data, including papers, terms, sources, authors, departments, research interests, universities, and countries. A real-world dataset released by AMiner that includes more than 2 million papers, 8 million citations, and 1.7 million authors were integrated into a large bibliographic network using the proposed bibliographic data model. Translation-based methods were applied to the entities and relationships to generate their low-dimensional embeddings while preserving their connectivity information in the original bibliographic network. We applied machine learning algorithms to embeddings that represent the coauthorship relationships of the two authors and achieved high prediction results. The reference model, which is the combination of a network embedding size of 100, the most basic translation-based method, and a gradient boosting method achieved an F1 score of 0.9 and even higher scores are obtainable with different embedding sizes and more advanced embedding methods. Thus, the strengths of the proposed approach lie in its customizable components under a unified framework.
Scholarly publications are often regarded as “information” by default. They are collected, organized, preserved, and made accessible as knowledge records. However, the instances of article retraction, misconduct and malpractices of researchers and the replication crisis have raised concerns about the informativeness and evidential qualities of information. Among many factors, knowledge production has moved away from “normal science” under the systemic influences of platformization involving the datafication and commodification of scholarly articles, research profiles and research activities. This article aims to understand the platformization of information by examining how research practices and knowledge production are steered by market and platform mechanisms in four ways: (a) ownership of information; (b) metrics for sale; (c) relevance by metrics, and (d) market-based competition. In conclusion, the article argues that information is platformized when platforms hold the dominating power in determining what kinds of information can be disseminated and rewarded and when informativeness is decoupled from the normative agreement or consensus co-constructed and co-determined in an open and public discourse.
While there is a considerable amount of interest in information and communication technologies for development (ICT4D) in the Indigenous communities, it remains limited to those who can afford it and have the skills and knowledge to implement the technology and access appropriate digital tools. Hence, Indigenous communities are continually stigmatized as marginalized, leading to a cultural misrepresentation of histories that affects the continuing information disparity between Indigenous and Western knowledge systems, particularly the insufficient technology infrastructure designed for traditional users. In this article, ICT4D was conceptualized as a digital platform to support Senior Ngarrindjeri Elder Aunty Ellen Trevorrow in continuing her practice of weaving and storytelling throughout the pandemic. In this context, the community-based participatory research (CBPR) principles within the structure of video ethnography were qualitatively designed to implement the ICT4D project culturally and ethically. Video recordings, image data, transcriptions, and the Ngarrindjeri ICT4D Pondi (Murray Cod) framework were embedded to justify the findings and the aim of illustrating Aunty Ellen's knowledge-sharing process to online learners. Likewise, the results demonstrate the positive and negative impact of COVID-19 on the continuity and orality of Aunty Ellen's cultural stories and practices. The future continuity of Aunty Ellen's knowledge ought to consider the inconsistency of technological infrastructure in regional areas, her waning health, and the interconnectedness of oral expertise, which often pose challenges. This study is a small step toward a better understanding of the value of oral knowledge; emphasizing the creation of e-learning weaving instructional videos is valuable for future digital management of Indigenous knowledge relevant to LIS.
This study investigates scholars' citation behaviors from a fine-grained perspective. Specifically, each scholarly citation is considered multidimensional rather than logically unidimensional (i.e., present or absent). Thirty million articles from PubMed were accessed for use in empirical research, in which a total of 15 interpretable features of scholarly citations were constructed and grouped into three main categories. Each category corresponds to one aspect of the reasons and motivations behind scholars' citation decision-making during academic writing. Using about 500,000 pairs of actual and randomly generated scholarly citations, a series of Random Forest-based classification experiments were conducted to quantitatively evaluate the correlation between each constructed citation feature and citation decisions made by scholars. Our experimental results indicate that citation proximity is the category most relevant to scholars' citation decision-making, followed by citation authority and citation inertia. However, big-name scholars whose h-indexes rank among the top 1% exhibit a unique pattern of citation behaviors—their citation decision-making correlates most closely with citation inertia, with the correlation nearly three times as strong as that of their ordinary counterparts. Hopefully, the empirical findings presented in this paper can bring us closer to characterizing and understanding the complex process of generating scholarly citations in academia.
Information and communication technology for development (ICT4D) research sporadically leverages information science scholarship. Our qualitative study employs the “information grounds” (IG) lens to investigate the consequences of information exchanges by pregnant women on Facebook, who are vulnerable in the doctor-centric birth culture in rural America. The thematic analysis of in-depth interviews with members and administrators of the Vaginal Birth After Cesarean (VBAC) group shows that positive consequences outweigh negative consequences of information exchanges and lead to the following progression of outcomes: (a) VBAC group as an information ground, (b) social capital (e.g., cognitive, structural, and relational capital) built on the information ground, (c) seven emergent properties of the information ground, and (d) value co-created (e.g., local, affordable, timely, enduring, and reliable support) by VBAC group members. The IG lens reveals the following roles of Facebook, an ICT, in development: (a) a linker that lets people with similar needs and interests convene and shapes their interactions, (b) a prerequisite to building an online, “third place” for social interactions, and (c) an apparatus for ubiquitously seeking, searching, sharing, and storing information in multiple formats and controlling its flow on the VBAC group. This paper fills in six gaps in the ICT4D research.