The digital repository of research articles is increasing at a rapid rate and hence searching the right paper becoming a tedious task for researchers. A research paper recommendation system is advocated to help researchers in this context. In the process of designing such a system, proper representation of articles, more specifically, feature identification and extraction are two essential tasks. The existing approaches mainly consider direct features which are readily available from research articles. However, there are certain features which are not readily available from a paper, but may greatly influence the performance of recommendation systems. This paper proposes four indirect features: keyword diversification, text complexity, citation analysis over time, and scientific quality measurement to represent a research article. The keyword diversification measures the uniqueness of the keywords of a paper which helps variation in recommendation. The text complexity measurement helps to provide a paper by matching the user’s understandability level. The citation analysis over time decides the relevancy of a paper. The scientific quality measurement helps to measure the scientific values of papers. Formal definitions of the proposed indirect features, schemes to extract the feature values given a research article, and metrics to measure them quantitatively are discussed in this paper. To substantiate the efficacy of the proposed features, a number of experiments have been carried out. The experimental results reveal that the proposed indirect features uniquely define a research article than the direct features. Given a research paper, extraction of feature vector is computationally fast and thus feasible to filter a large corpus of papers in real time. More significantly, indirect features are matchable with user’s profile features, thus satisfying an important criterion in collaborative filtering.
Metadata are fundamental for the indexing, browsing and retrieval of cultural heritage resources in repositories, digital libraries and catalogues. In order to be effectively exploited, metadata information has to meet some quality standards, typically defined in the collection usage guidelines. As manually checking the quality of metadata in a repository may not be affordable, especially in large collections, in this paper we specifically address the problem of automatically assessing the quality of metadata, focusing in particular on textual descriptions of cultural heritage items. We describe a novel approach based on machine learning that tackles this problem by framing it as a binary text classification task aimed at evaluating the accuracy of textual descriptions. We report our assessment of different classifiers using a new dataset that we developed, containing more than 100K descriptions. The dataset was extracted from different collections and domains from the Italian digital library “Cultura Italia” and was annotated with accuracy information in terms of compliance with the cataloguing guidelines. The results empirically confirm that our proposed approach can effectively support curators (F1 \(\sim \) 0.85) in assessing the quality of the textual descriptions of the records in their collections and provide some insights into how training data, specifically their size and domain, can affect classification performance.
The frequency at which new research documents are being published causes challenges for researchers who increasingly need access to relevant documents in order to conduct their research. Searching across a variety of databases and browsing millions of documents to find semantically relevant material is a time-consuming task. Recently, there has been a focus on recommendation algorithms that suggest relevant documents based on the current interests of the researchers. In this paper, we describe the implementation of seven commonly used algorithms and three aggregation algorithms. We evaluate the recommendation algorithms in a large-scale biomedical knowledge base with the goal of identifying relative weaknesses and strengths of each algorithm. We analyze the recommendations from each algorithm based on assessments of output as evaluated by 14 biomedical researchers. The results of our research provide unique insights into the performance of recommendation algorithms against the needs of modern-day biomedical researchers.
Purpose Publishing research data for reuse has become good practice in recent years. However, not much is known on how researchers actually find said data. In this exploratory study, we observe the information-seeking behaviour of social scientists searching for research data to reveal impediments and identify opportunities for data search infrastructure.Methods We asked 12 participants to search for research data and observed them in their natural environment. The sessions were recorded. Afterwards, we conducted semi-structured interviews to get a thorough understanding of their way of searching. From the recordings, we extracted the interaction behaviour of the participants and analysed the spoken words both during the search task and the interview by creating affinity diagrams.Results We found that literature search is more closely intertwined with dataset search than previous literature suggests. Both the search itself and the relevance assessment are very complex, and many different strategies are employed, including the creatively “misuse” of existing tools, since no appropriate tools exist or are unknown to the participants.Conclusion Many of the issues we found relate directly or indirectly to the application of the FAIR principles, but some, like a greater need for dataset search literacy, go beyond that. Both infrastructure and tools offered for dataset search could be tailored more tightly to the observed work processes, particularly by offering more interconnectivity between datasets, literature, and other relevant materials.
Sheet music scores have been the traditional way to preserve and disseminate western classical music works for centuries. Nowadays, their content can be encoded in digital formats that yield a very detailed representation of music content expressed in the language of music notation. These digital scores constitute, therefore, an invaluable asset for digital library services such as search, analysis, clustering, recommendations, and synchronization with audio files. Digital scores, like any other published data, may suffer from quality problems. For instance, they can contain incomplete or inaccurate elements. As a “dirty” dataset may be an irrelevant input for some use cases, users need to be able to estimate the quality level of the data they are about to use. This article presents the data quality management framework for digital score libraries (DSL) designed by the GioQoso multi-disciplinary project. It relies on a content model that identifies several information levels that are unfortunately blurred out in digital score encodings. This content model then serves as a foundation to organize the categories of quality issues that can occur in a music score, leading to a quality model. The quality model also positions each issue with respect to potential usage contexts, allowing attachment of a consistent set of indicators that together measure how a given score is fit to a specific usage. We finally report an implementation of these conceptual foundations in an online DSL.
The automated recommendation of content resources to learners is one of the most promising functions of educational digital libraries. Underlying strategies should take the individual progress of the learner into account to provide appropriate recommendations that are meaningful to the learner. If presented with appropriate assistance, learners will more likely engage in productive learning strategies, such as reading up on concepts and accessing preparatory materials, and refrain from unproductive behavior, such as guessing on or copying of homework. In this exploratory case study, we are analyzing transactional data within an educational digital library of online physics homework problems and learning content. The sequence of events starting with a learner failing to solve a particular problem, interacting with other online resources, and then succeeding on that same problem is used to identify potentially helpful resources for future learners. It was found that these “success stories” indeed allow for providing recommendations with acceptable accuracy, which, when implemented, may lead to more productive learning paths.
Digital Humanities projects have shown to be projects of collaboration and interdisciplinary cooperation. As digital humanities researchers are creating and discovering new methods of collaboration, libraries have been reflecting on how they can best support and nurture such collaborations. This paper aims to demonstrate a practical case of what the University of South Florida Digital Collections has done to support the accessibility and discoverability of a new archaeological dataset in collaboration with a Digital Humanities research facility on campus. This paper is a case study to showcase the process employed by the University of South Florida Libraries in its partnership with an on campus digital humanities institute in the creation of a new digital collection. The partnership resulted in a prototype archaeological data repository named the Andean Archaeological Data Project, and a digital collection that was successfully housed in the existing digital library platform, through which public access to the material is natively enabled through the library web pages. The authors have come to the conclusion that with the proper infrastructure and the appropriate skill sets, a digital library can be a long-term platform for dynamic digital humanities research data.
Microblogging platforms such as Twitter have been increasingly used nowadays to share information between users. They are also convenient means for propagating content related to history. Hence, from the research viewpoint they can offer opportunities to analyze the way in which users refer to the past, and how as well when such references appear and what purposes they serve. Such study could allow to quantify the interest degree and the mechanisms behind content dissemination. We report the results of a large scale exploratory analysis of history-oriented posts in microblogs based on a 28-month-long snapshot of Twitter data. The results can increase our understanding of the characteristics of history-focused content sharing in Twitter. They can also be used for guiding the design of content recommendation systems as well as time-aware search applications.
Assigning several labels to digital data is becoming easier as this can be achieved in a collaborative manner with Internet users. However, this process is still a challenge, especially in cases where several labels are assigned to each datum, as some suitable labels may be missed. The missing labels lead to inaccuracies in classification. In this study, we propose a novel graph-based multi-label classifier that exhibits stability for obtaining high-accuracy results; this is achieved even where there are missing labels in training data. The core process of our algorithm is to smoothen the label values of the training data from their top-k similar data by propagating their values and averaging them to generate values for the missing labels in the training data. In experimental evaluations, we used multi-labeled document and image datasets to evaluate classifiers, and then measured micro-averaged F-scores for eight classifiers. Even though we incrementally removed correct labels from the two datasets, the proposed algorithm tended to maintain the F-scores, whereas other classifiers decreased the scores. In addition, we evaluated the algorithm using Wikipedia, which comprises a real dataset that includes missing labels, in order to determine how well the algorithm predicted the correct labels and how useful it was for manual annotations, as initial decisions. We have confirmed that LPAC is useful for not only automatic annotation, but also the facilitation of decision making in the initial manual category assignment.
The study and analysis of past events can provide numerous benefits. While event categorization has been previously studied, it usually assigned only one event category to an event. In this study, we focus on multi-label classification for past events, which is a more general and challenging problem than those approached in previous studies. We categorize events into thirteen different types using a range of diverse features and classifiers trained on a dataset that has at least 50 labeled news articles for each category. We have confirmed that using all the features to train classifiers has statistical significance and improves all micro- and macro-average \(F_1\) , multi-label accuracy, average precision@5, area under the receiver operating characteristic curve and example-based loss functions.
Plagiarism detection deals with detecting plagiarized fragments among textual documents. The availability of digital documents in online libraries makes plagiarism easier and on the other hand, to be easily detected by automatic plagiarism detection systems. Large scale plagiarism corpora with a wide variety of plagiarism cases are needed to evaluate different detection methods in different languages. Plagiarism detection corpora play an important role in evaluating and tuning plagiarism detection systems. Despite of their importance, few corpora have been developed for low resource languages. In this paper, we propose HAMTA, a Persian plagiarism detection corpus. To simulate real cases of plagiarism, manually paraphrased text are used to compile the corpus. For obtaining the manual plagiarism cases, a crowdsourcing platform is developed and crowd workers are asked to paraphrase fragments of text in order to simulate real cases of plagiarism. Moreover, artificial methods are used to scale-up the proposed corpus by automatically generating cases of text re-use. The evaluation results indicate a high correlation between the proposed corpus and the PAN state-of-the-art English plagiarism detection corpus.
The Art. 2 of the UE Council conclusions of 21 May 2014 on cultural heritage as a strategic resource for a sustainable Europe (2014/C 183/08) states: “Cultural heritage consists of the resources inherited from the past in all forms and aspects—tangible, intangible and digital (born digital and digitized), including monuments, sites, landscapes, skills, practices, knowledge and expressions of human creativity, as well as collections conserved and managed by public and private bodies such as museums, libraries and archives”. Starting from this assumption, we have to rethink digital and digitization as social and cultural expressions of the contemporary age. We need to rethink digital libraries produced by digitization as cultural entities and no longer as mere dataset for enhancing fruition of cultural heritage, by defining clear and homogeneous criteria to validate and certify them as memory and sources of knowledge for future generations. By expanding R: Re-usable of the FAIR Guiding Principles for scientific data management and stewardship into R4: Re-usable, Relevant, Reliable and Resilient, this paper aims to propose a more reflective approach to creation of descriptive metadata for managing digital resource of cultural heritage, which can guarantee their long term preservation.
Citation recommendation describes the task of recommending citations for a given text. Due to the overload of published scientific works in recent years on the one hand, and the need to cite the most appropriate publications when writing scientific texts on the other hand, citation recommendation has emerged as an important research topic. In recent years, several approaches and evaluation data sets have been presented. However, to the best of our knowledge, no literature survey has been conducted explicitly on citation recommendation. In this article, we give a thorough introduction to automatic citation recommendation research. We then present an overview of the approaches and data sets for citation recommendation and identify differences and commonalities using various dimensions. Last but not least, we shed light on the evaluation methods and outline general challenges in the evaluation and how to meet them. We restrict ourselves to citation recommendation for scientific publications, as this document type has been studied the most in this area. However, many of the observations and discussions included in this survey are also applicable to other types of text, such as news articles and encyclopedic articles.
In this paper, we will explore the theme of the documentation of 3D cultural heritage assets, not only as entire artefacts but also including the interesting features of the object from an archaeological perspective. Indeed, the goal is supporting archaeological research and curation, providing a different approach to enrich the documentation of digital resources and their components with corresponding measurements, combining semantic and geometric techniques. A documentation scheme based on CIDOC, where measurements on digital data have been included extending CIDOC CRMdig, is discussed. To annotate accurately the components and features of the artefacts, a controlled vocabulary named Cultural Heritage Artefact Partonomy (CHAP) has been defined and integrated into the scheme as a SKOS taxonomy to showcase the proposed methodology. CHAP concerns Coroplastic, which is the study of ancient terracotta figurines and in particular the Cypriot production. Two case studies have been considered: the terracotta statues from the port of Salamis and the small clay statuettes from the Ayia Irini sanctuary. Focussing both on the artefacts and their digital counterparts, the proposed methodology supports effectively typical operations within digital libraries and repositories (e.g. search, part-based annotation), and more specific objectives such as the archaeological interpretation and digitally assisted classification, as proved in a real archaeological scenario. The proposed approach is general and applies to different contexts, since it is able to support any archaeological research where the goal is an extensive digital documentation of tangible findings including quantitative attributes.
For conducting a literature review is necessary a preliminary organization of the available bibliographic material. In this article, we present a novel method called OrgBR-M (method to organize bibliographic references), based on the formal concept analysis theory, to assist in organizing bibliographic material. Our method systematizes the organization of bibliography and proposes metrics to assist in guiding the literature review. As a case study, we apply the OrgBR-M method to perform a literature review of the educational data mining field of study.
An unrecognized significance of the web acts as a driving force for the massive and rapid growth of websites in each domain of social life. For making a successful website, it is necessary for developers to embrace appropriate web testing and evaluation methodology. Some valuable works in the past have striven to appraise the web applications quantitatively. Various parameters have been considered which are again sub-parameterized to measurable indicators. But their weighing criterion has not been appropriately taken into account according to the domain of the website. Also, the relative degrees of interactions among parameters have not been taken into consideration. The work presented in this paper aims at describing a framework, Quality Index Evaluation Method to gauge the design quality of a website in the form of index value. An automated tool has been designed and coded to measure the metrics quantitatively. A weighing technique based on Fuzzy-DEMATEL (Decision Making Trial and Evaluation Laboratory Method) has been applied on these metrics. Fuzzy trapezoidal numbers have been used for assessment of parameters and the final design quality index value. To verify the use of framework in different website domains, it has been exercised on eight academic (four institutional and four digital libraries), five informative and four commercial websites. The results have been validated through the most widely used method in literature, i.e., user judgment. Opinions of users for each website have been quantified and aggregated with fuzzy aggregation technique. Experimental results show that the proposed framework provides accurate and consistent results in very less time.
A publication venue authority file stores variants of the names of journals and conferences that publish scientific articles. It is useful in the construction of search tools and data disambiguation, and it is of special interest to agencies funding research and evaluating graduate programs, which use the quality of publication venues as a basis for evaluating researchers’ and research groups’ publications. However, keeping an updated authority file is not a trivial task. Different names are used to refer to the same publication venue, these venues sometimes change their name, new venues emerge regularly, and journal bibliometrics are updated frequently. This paper presents the publication venue authority file (PVAF), an environment for the disambiguation of scientific publication venues. It consists of an authority file and a set of tools for updating and querying its data. We describe and experimentally evaluate each of these tools. We also propose a search algorithm based on an associative classifier, which allows for incremental updates of its learning model. The results show that the PVAF has coverage greater than 86% for publication venues in several fields of knowledge, and its tools attain a good accuracy in the classification of publication venues from curricula vitae formatted in various citation styles.
Web archives, such as the Internet Archive, preserve an unprecedented abundance of materials regarding major events and transformations in our society. In this paper, we present an approach for building event-centric sub-collections from such large archives, which includes not only the core documents related to the event itself but, even more importantly, documents describing related aspects (e.g., premises and consequences). This is achieved by identifying relevant concepts and entities from a knowledge base, and then detecting their mentions in documents, which are interpreted as indicators for relevance. We extensively evaluate our system on two diachronic corpora, the New York Times Corpus and the US Congressional Record; additionally, we test its performance on the TREC KBA Stream Corpus and on the TREC-CAR dataset, two publicly available large-scale web collections.
Web archiving is the process of collecting portions of the Web to ensure that the information is preserved for future exploitation. However, despite the increasing number of web archives worldwide, the absence of efficient and meaningful exploration methods still remains a major hurdle in the way of turning them into a usable and useful information source. In this paper, we focus on this problem and propose an RDF/S model and a distributed framework for building semantic profiles (“layers”) that describe semantic information about the contents of web archives. A semantic layer allows describing metadata information about the archived documents, annotating them with useful semantic information (like entities, concepts, and events), and publishing all these data on the Web as Linked Data. Such structured repositories offer advanced query and integration capabilities, and make web archives directly exploitable by other systems and tools. To demonstrate their query capabilities, we build and query semantic layers for three different types of web archives. An experimental evaluation showed that a semantic layer can answer information needs that existing keyword-based systems are not able to sufficiently satisfy.
Indexing documents with controlled vocabularies enables a wealth of semantic applications for digital libraries. Due to the rapid growth of scientific publications, machine learning-based methods are required that assign subject descriptors automatically. While stability of generative processes behind the underlying data is often assumed tacitly, it is being violated in practice. Addressing this problem, this article studies explicit and implicit concept drift, that is, settings with new descriptor terms and new types of documents, respectively. First, the existence of concept drift in automatic subject indexing is discussed in detail and demonstrated by example. Subsequently, architectures for automatic indexing are analyzed in this regard, highlighting individual strengths and weaknesses. The results of the theoretical analysis justify research on fusion of different indexing approaches with special consideration on information sharing among descriptors. Experimental results on titles and author keywords in the domain of economics underline the relevance of the fusion methodology, especially under concept drift. Fusion approaches outperformed non-fusion strategies on the tested data sets, which comprised shifts in priors of descriptors as well as covariates. These findings can help researchers and practitioners in digital libraries to choose appropriate methods for automatic subject indexing, as is finally shown by a recent case study.
The importance and the need for the peer-review system is highly debated in the academic community, and recently there has been a growing consensus to completely get rid of it. This is one of the steps in the publication pipeline that usually requires the publishing house to invest a significant portion of their budget in order to ensure quality editing and reviewing of the submissions received. Therefore, a very pertinent question is if at all such investments are worth making. To answer this question, in this paper, we perform a rigorous measurement study on a massive dataset (29k papers with 70k distinct review reports) to unfold the detailed characteristics of the peer-review process considering the three most important entities of this process—(i) the paper (ii) the authors and (iii) the referees and thereby identify different factors related to these three entities which can be leveraged to predict the long-term impact of a submitted paper. These features when plugged into a regression model achieve a high \(R^2\) of 0.85 and RMSE of 0.39. Analysis of feature importance indicates that reviewer- and author-related features are most indicative of long-term impact of a paper. We believe that our framework could definitely be utilized in assisting editors to decide the fate of a paper.
Media bias describes differences in the content or presentation of news. It is an ubiquitous phenomenon in news coverage that can have severely negative effects on individuals and society. Identifying media bias is a challenging problem, for which current information systems offer little support. News aggregators are the most important class of systems to support users in coping with the large amount of news that is published nowadays. These systems focus on identifying and presenting important, common information in news articles, but do not reveal different perspectives on the same topic. Due to this analysis approach, current news aggregators cannot effectively reveal media bias. To address this problem, we present matrix-based news aggregation, a novel approach for news exploration that helps users gain a broad and diverse news understanding by presenting various perspectives on the same news topic. Additionally, we present NewsBird, an open-source news aggregator that implements matrix-based news aggregation for international news topics. The results of a user study showed that NewsBird more effectively broadens the user’s news understanding than the list-based visualization approach employed by established news aggregators, while achieving comparable effectiveness and efficiency for the two main use cases of news consumption: getting an overview of and finding details on current news topics.
Finding similar words with the help of word embedding models, such as Word2Vec or GloVe, computed on large-scale digital libraries has yielded meaningful results in many cases. However, the underlying notion of similarity has remained ambiguous. In this paper, we examine when exactly similarity values in word embedding models are meaningful. To do so, we analyze the statistical distribution of similarity values systematically, conducting two series of experiments. The first one examines how the distribution of similarity values depends on the different embedding model algorithms and parameters. The second one starts by showing that intuitive similarity thresholds do not exist. We then propose a method stating which similarity values and thresholds actually are meaningful for a given embedding model. Based on these results, we calculate how these thresholds, when taken into account during evaluation, change the evaluation scores of the models in similarity test sets. In more abstract terms, our insights give way to a better understanding of the notion of similarity in embedding models and to more reliable evaluations of such models.
Museums are increasing access to their collections and providing richer user experiences via web-based interfaces. However, they are seeing high numbers of users looking at only one or two pages within 10 s and then leaving. To reduce this rate, a better understanding of the type of user who visits a museum website is required. Existing models for museum website users tend to focus on groups that are readily accessible for study or provide little detail in their definitions of the groups. This paper presents the results of a large-scale user survey for the National Museums Liverpool museum website in which data on a wide range of user characteristics were collected regarding their current visit to provide a better understanding of their motivations, tasks, engagement and domain knowledge. Results show that the frequently understudied general public and non-professional users make up the majority (approximately 77%) of the respondents.