Noticias em eLiteracias

🔒
✇ International Journal on Digital Libraries

PEERRec: An AI-based approach to automatically generate recommendations and predict decisions in peer review

4 de Julho de 2023, 00:00

Abstract

One key frontier of artificial intelligence (AI) is the ability to comprehend research articles and validate their findings, posing a magnanimous problem for AI systems to compete with human intelligence and intuition. As a benchmark of research validation, the existing peer-review system still stands strong despite being criticized at times by many. However, the paper vetting system has been severely strained due to an influx of research paper submissions and increased conferences/journals. As a result, problems, including having insufficient reviewers, finding the right experts, and maintaining review quality, are steadily and strongly surfacing. To ease the workload of the stakeholders associated with the peer-review process, we probed into what an AI-powered review system would look like. In this work, we leverage the interaction between the paper’s full text and the corresponding peer-review text to predict the overall recommendation score and final decision. We do not envisage AI reviewing papers in the near future. Still, we intend to explore the possibility of a human–AI collaboration in the decision-making process to make the current system FAIR. The idea is to have an assistive decision-making tool for the chairs/editors to help them with an additional layer of confidence, especially with borderline and contrastive reviews. We use a deep attention network between the review text and paper to learn the interactions and predict the overall recommendation score and final decision. We also use sentiment information encoded within peer-review texts to guide the outcome further. Our proposed model outperforms the recent state-of-the-art competitive baselines. We release the code of our implementation here: https://github.com/PrabhatkrBharti/PEERRec.git.

✇ International Journal on Digital Libraries

Special Issue: Epigraphy and Paleography: Bringing Records from the Distant Past to the Present

1 de Junho de 2023, 00:00

Abstract

This special issue brings together three areas of research and scholarly work areas that would have demonstrated few obvious relationships three decades ago. Digital libraries research, practices and infrastructures have transformed the study of ancient inscriptions by providing organizing principles for collections building, defining interoperability requirements and developing innovative user tools and services. Yet linking collections and their contents to support advanced scholarly work in epigraphy and paleography tests the limits of current digital libraries applications. This is due, in part, to the magnitude and heterogeneity of works created over a time period of more than five millennia. The remarkable diversity ranges from the number of types of artifacts to the methods used in their production to the singularity of individual marks contained within them. Conversion of analog collections to digital repositories is well underway—but most often not in a way that meets the basic requirements needed to support scholarly workflows. This is beginning to change. In addition to efforts to develop complex data objects, linking strategies and repositories aggregation, there is a new use of imaging technologies and computational approaches to recognize, enhance, recover and restore writings. Most recently, leading-edge artificial intelligence methods are being applied for the automated transcription of handwritten text into machine readable forms. The articles in this special issue will give examples of each.

✇ International Journal on Digital Libraries

AgAsk: an agent to help answer farmer’s questions from scientific documents

19 de Junho de 2023, 00:00

Abstract

Decisions in agriculture are increasingly data-driven. However, valuable agricultural knowledge is often locked away in free-text reports, manuals and journal articles. Specialised search systems are needed that can mine agricultural information to provide relevant answers to users’ questions. This paper presents AgAsk—an agent able to answer natural language agriculture questions by mining scientific documents. We carefully survey and analyse farmers’ information needs. On the basis of these needs, we release an information retrieval test collection comprising real questions, a large collection of scientific documents split in passages, and ground truth relevance assessments indicating which passages are relevant to each question. We implement and evaluate a number of information retrieval models to answer farmers questions, including two state-of-the-art neural ranking models. We show that neural rankers are highly effective at matching passages to questions in this context. Finally, we propose a deployment architecture for AgAsk that includes a client based on the Telegram messaging platform and retrieval model deployed on commodity hardware. The test collection we provide is intended to stimulate more research in methods to match natural language to answers in scientific documents. While the retrieval models were evaluated in the agriculture domain, they are generalisable and of interest to others working on similar problems. The test collection is available at: https://github.com/ielab/agvaluate.

✇ International Journal on Digital Libraries

ORKG-Leaderboards: a systematic workflow for mining leaderboards as a knowledge graph

15 de Junho de 2023, 00:00

Abstract

The purpose of this work is to describe the orkg-Leaderboard software designed to extract leaderboards defined as task–dataset–metric tuples automatically from large collections of empirical research papers in artificial intelligence (AI). The software can support both the main workflows of scholarly publishing, viz. as LaTeX files or as PDF files. Furthermore, the system is integrated with the open research knowledge graph (ORKG) platform, which fosters the machine-actionable publishing of scholarly findings. Thus, the systemsss output, when integrated within the ORKG’s supported Semantic Web infrastructure of representing machine-actionable ‘resources’ on the Web, enables: (1) broadly, the integration of empirical results of researchers across the world, thus enabling transparency in empirical research with the potential to also being complete contingent on the underlying data source(s) of publications; and (2) specifically, enables researchers to track the progress in AI with an overview of the state-of-the-art across the most common AI tasks and their corresponding datasets via dynamic ORKG frontend views leveraging tables and visualization charts over the machine-actionable data. Our best model achieves performances above 90% F1 on the leaderboard extraction task, thus proving orkg-Leaderboards a practically viable tool for real-world usage. Going forward, in a sense, orkg-Leaderboards transforms the leaderboard extraction task to an automated digitalization task, which has been, for a long time in the community, a crowdsourced endeavor.

✇ International Journal on Digital Libraries

A detailed library perspective on nearly unsupervised information extraction workflows in digital libraries

13 de Junho de 2023, 00:00

Abstract

Information extraction can support novel and effective access paths for digital libraries. Nevertheless, designing reliable extraction workflows can be cost-intensive in practice. On the one hand, suitable extraction methods rely on domain-specific training data. On the other hand, unsupervised and open extraction methods usually produce not-canonicalized extraction results. This paper is an extension of our original work and tackles the question of how digital libraries can handle such extractions and whether their quality is sufficient in practice. We focus on unsupervised extraction workflows by analyzing them in case studies in the domains of encyclopedias (Wikipedia), Pharmacy, and Political Sciences. As an extension, we analyze the extractions in more detail, verify our findings on a second extraction method, discuss another canonicalizing method, and give an outlook on how non-English texts can be handled. Therefore, we report on opportunities and limitations. Finally, we discuss best practices for unsupervised extraction workflows.

✇ International Journal on Digital Libraries

From stone to silicon: technical advances in epigraphy

1 de Junho de 2023, 00:00

Abstract

Through the annals of time, writing has slowly scrawled its way from the painted surfaces of stone walls to the grooves of inscriptions to the strokes of quill, pen, and ink. While we still inscribe stone (tombstones, monuments) and we continue to write on skin (tattoos abound), our quotidian method of writing on paper is increasingly abandoned in favor of the quick-to-generate digital text. And even though the stone-inscribed text of epigraphy offers demonstrably better permanence than that of writing on skin and paper—even better than that of the memory system of the modern computer (Bollacker in Am Sci 98:106, 2010)—this field of study has also made the digital leap. Today’s scholarly analyses of epigraphic content increasingly rely on high-tech approaches involving data science and computer models. This essay discusses how advances in a number of exciting technologies are enabling the digital analysis of epigraphic texts and accelerating the ability of scholars to preserve, renew, and reinvigorate the study of the inscriptions that remain from throughout history.

✇ International Journal on Digital Libraries

Coverage and similarity of bibliographic databases to find most relevant literature for systematic reviews in education

24 de Maio de 2023, 00:00

Abstract

Systematic literature reviews in educational research have become a popular research method. A key point hereby is the choice of bibliographic databases to reach a maximum probability of finding all potentially relevant literature that deals with the research question analyzed in a systematic literature review. Guidelines and handbooks on review recommend proper databases and information sources for education, along with specific search strategies. However, in many disciplines, among them educational research, there is a lack of evidence on the relevance of databases that need to be considered to find relevant literature and lessen the risk of missing relevant publications. Educational research is an interdisciplinary field and has no core database. Instead, the field is covered by multiple disciplinary and multidisciplinary information sources that have either a national or international focus. In this article, we discuss the relevance of seven databases in systematic literature reviews in education, based on results of an empirical data analysis of three recently published reviews. To evaluate the relevance of a database, the relevant literature of those reviews served as the gold standard. Results indicate that discipline-specific databases outperform international multidisciplinary sources, and a combination of discipline-specific international and national sources is most efficient in finding a high proportion of relevant literature. The article discusses the relevance of the databases in relation to their coverage of relevant literature, while considering practical implications for researchers performing a systematic literature search. We, thus, present evidence for proper database choices for educational and discipline-related systematic literature reviews.

✇ International Journal on Digital Libraries

Correction: Beyond translation: engaging with foreign languages in a digital library

19 de Maio de 2023, 00:00
✇ International Journal on Digital Libraries

Self-training involving semantic-space finetuning for semi-supervised multi-label document classification

11 de Maio de 2023, 00:00

Abstract

Self-training is an effective solution for semi-supervised learning, in which both labeled and unlabeled data are leveraged for training. However, the application scenarios of existing self-training frameworks are mostly confined to single-label classification. There exist difficulties in applying self-training under multi-label scenario, since unlike single-label classification, there is no constraint of mutual exclusion over categories, and the vast number of possible label vectors makes discovery of credible predictions harder. For realizing effective self-training under multi-label scenario, we propose ML-DST and ML-DST+ that utilize contextualized document representations of pretrained language models. A BERT-based multi-label classifier and newly designed weighted loss functions for finetuning are proposed. Two label propagation-based algorithms SemLPA and SemLPA+ are also proposed to enhance multi-label prediction, whose similarity measure is iteratively improved through semantic-space finetuning, by which semantic space consisting of document representations is finetuned to better reflect learnt label correlations. High-confidence label predictions are recognized through examining the prediction score on each category separately, which are in turn used for both classifier finetuning and semantic-space finetuning. According to our experiment results, the performance of our approach steadily exceeds the representative baselines under different label rates, proving the superiority of our proposed approach.

✇ International Journal on Digital Libraries

DETEXA: declarative extensible text exploration and analysis through SQL

10 de Maio de 2023, 00:00

Abstract

Metadata enrichment through text mining techniques is becoming one of the most significant tasks in digital libraries. Due to the exponential increase of open access publications, several new challenges have emerged. Raw data are usually big, unstructured, and come from heterogeneous data sources. In this paper, we introduce a text analysis framework implemented in extended SQL that exploits the scalability characteristics of modern database management systems. The purpose of this framework is to provide the opportunity to build performant end-to-end text mining pipelines which include data harvesting, cleaning, processing, and text analysis at once. SQL is selected due to its declarative nature which offers fast experimentation and the ability to build APIs so that domain experts can edit text mining workflows via easy-to-use graphical interfaces. Our experimental analysis demonstrates that the proposed framework is very effective and achieves significant speedup, up to three times faster, in common use cases compared to other popular approaches.

✇ International Journal on Digital Libraries

Predicting answer acceptability for question-answering system

5 de Maio de 2023, 00:00

Abstract

Question-answering (QA) platforms such as Stack Overflow, Quora, and Stack Exchange have become favourite places to exchange knowledge with community users. Finding answers to simple or complex questions is easier on QA platforms nowadays. Due to a large number of responses from users all around the world, these CQA systems are currently facing massive problems. Stack Overflow allows users to ask questions and give answers or comments on others’ posts. Consequently, Stack Overflow also rewards those users whose posts are appreciated by the community in the form of reputation points. The accepted answer provides maximum reputation points to the answerer. More reputation points allow getting more website privileges. Hence, each answerer needs to get their answer accepted. Very little research has been done to check whether the user’s answers will be accepted or not. This paper proposes a model that predicts answer acceptability and its reason. The model’s findings help the answerer know about the answer acceptance; if the model predicted the probability of acceptance is less, the answerer might revise their answer immediately. The comparison with the state-of-the-art literature confirmed that the proposed model achieves better performance.

✇ International Journal on Digital Libraries

Deep author name disambiguation using DBLP data

4 de Maio de 2023, 00:00

Abstract

In the academic world, the number of scientists grows every year and so does the number of authors sharing the same names. Consequently, it is challenging to assign newly published papers to their respective authors. Therefore, author name ambiguity is considered a critical open problem in digital libraries. This paper proposes an author name disambiguation approach that links author names to their real-world entities by leveraging their co-authors and domain of research. To this end, we use data collected from the DBLP repository that contains more than 5 million bibliographic records authored by around 2.6 million co-authors. Our approach first groups authors who share the same last names and same first name initials. The author within each group is identified by capturing the relation with his/her co-authors and area of research, represented by the titles of the validated publications of the corresponding author. To this end, we train a neural network model that learns from the representations of the co-authors and titles. We validated the effectiveness of our approach by conducting extensive experiments on a large dataset.

✇ International Journal on Digital Libraries

Retrievability in an integrated retrieval system: an extended study

28 de Abril de 2023, 00:00

Abstract

Retrievability measures the influence a retrieval system has on the access to information in a given collection of items. This measure can help in making an evaluation of the search system based on which insights can be drawn. In this paper, we investigate the retrievability in an integrated search system consisting of items from various categories, particularly focussing on datasets, publications and variables in a real-life digital library. The traditional metrics, that is, the Lorenz curve and Gini coefficient, are employed to visualise the diversity in retrievability scores of the three retrievable document types (specifically datasets, publications, and variables). Our results show a significant popularity bias with certain items being retrieved more often than others. Particularly, it has been shown that certain datasets are more likely to be retrieved than other datasets in the same category. In contrast, the retrievability scores of items from the variable or publication category are more evenly distributed. We have observed that the distribution of document retrievability is more diverse for datasets as compared to publications and variables.

✇ International Journal on Digital Libraries

Towards automated meta-review generation via an NLP/ML pipeline in different stages of the scholarly peer review process

24 de Abril de 2023, 00:00

Abstract

With the ever-increasing number of submissions in top-tier conferences and journals, finding good reviewers and meta-reviewers is becoming increasingly difficult. Writing a meta-review is not straightforward as it involves a series of sub-tasks, including making a decision on the paper based on the reviewer’s recommendation and their confidence in the recommendation, mitigating disagreements among the reviewers, and other such similar tasks. In this work, we develop a novel approach to automatically generate meta-reviews that are decision-aware and which also take into account a set of relevant sub-tasks in the peer-review process. More specifically, we first predict the recommendation scores and confidence scores for the reviews, using which we then predict the decision on a particular manuscript. Finally, we utilize the decision signals for generating the meta-reviews using a transformer-based seq2seq architecture. Our proposed pipelined approach for automatic decision-aware meta-review generation achieves significant performance improvement over the standard summarization baselines as well as relevant prior works on this problem. We make our codes available at https://github.com/saprativa/seq-to-seq-decision-aware-mrg.

✇ International Journal on Digital Libraries

A discovery system for narrative query graphs: entity-interaction-aware document retrieval

24 de Abril de 2023, 00:00

Abstract

Finding relevant publications in the scientific domain can be quite tedious: Accessing large-scale document collections often means to formulate an initial keyword-based query followed by many refinements to retrieve a sufficiently complete, yet manageable set of documents to satisfy one’s information need. Since keyword-based search limits researchers to formulating their information needs as a set of unconnected keywords, retrieval systems try to guess each user’s intent. In contrast, distilling short narratives of the searchers’ information needs into simple, yet precise entity-interaction graph patterns provides all information needed for a precise search. As an additional benefit, such graph patterns may also feature variable nodes to flexibly allow for different substitutions of entities taking a specified role. An evaluation over the PubMed document collection quantifies the gains in precision for our novel entity-interaction-aware search. Moreover, we perform expert interviews and a questionnaire to verify the usefulness of our system in practice. This paper extends our previous work by giving a comprehensive overview about the discovery system to realize narrative query graph retrieval.

✇ International Journal on Digital Libraries

Approximate nearest neighbor for long document relationship labeling in digital libraries

1 de Dezembro de 2023, 00:00

Abstract

Relationship tagging of long text documents is a growing need in information science, spurred by the emergence of multi-million book bibliographic digital libraries. Large digital libraries offer an unprecedented glimpse into cultural history through their collections, but the combination of collection scale and document length complicates their study, given that prior work on large corpora has dealt primarily with much shorter texts. This study presents and evaluates an approach for fast retrieval on long texts, which leverages a chunk-and-aggregate approach with document sub-units to capture nuanced similarity relationships at scales which are not otherwise tractable. This approach is evaluated on book relationships from the HathiTrust Digital Library and shows strong results for relationships beyond exact duplicates. Finally, we argue for the value of approximate nearest neighbor search for narrowing the search space for downstream classification and retrieval contexts.

✇ International Journal on Digital Libraries

Creating and validating a scholarly knowledge graph using natural language processing and microtask crowdsourcing

5 de Abril de 2023, 00:00

Abstract

Due to the growing number of scholarly publications, finding relevant articles becomes increasingly difficult. Scholarly knowledge graphs can be used to organize the scholarly knowledge presented within those publications and represent them in machine-readable formats. Natural language processing (NLP) provides scalable methods to automatically extract knowledge from articles and populate scholarly knowledge graphs. However, NLP extraction is generally not sufficiently accurate and, thus, fails to generate high granularity quality data. In this work, we present TinyGenius, a methodology to validate NLP-extracted scholarly knowledge statements using microtasks performed with crowdsourcing. TinyGenius is employed to populate a paper-centric knowledge graph, using five distinct NLP methods. We extend our previous work of the TinyGenius methodology in various ways. Specifically, we discuss the NLP tasks in more detail and include an explanation of the data model. Moreover, we present a user evaluation where participants validate the generated NLP statements. The results indicate that employing microtasks for statement validation is a promising approach despite the varying participant agreement for different microtasks.

✇ International Journal on Digital Libraries

Referencing behaviours across disciplines: publication types and common metadata for defining bibliographic references

27 de Março de 2023, 00:00

Abstract

In this work, we investigate existing citation practices by analysing a huge set of articles published in journals to measure which metadata are used across the various scholarly disciplines, independently from the particular citation style adopted, for defining bibliographic reference. We selected the most cited journals in each of the 27 subject areas listed in the SCImago Journal Rank in the 2015–2017 triennium according to the SCImago total cites ranking. Each journal in the sample was represented by five articles (in PDF format) published in the most recent issue published in October 2019, for a total of 729 articles. We extracted all 34,140 bibliographic references in the bibliographic references lists of these articles. Finally, we detected the types of cited works in each discipline and the structure of bibliographic references and in-text reference pointers for each type of cited work. By analysing the data gathered, we observed that the bibliographic references in our sample referenced 36 different types of cited works. Such a considerable variety of publications revealed the existence of particular citing behaviours in scientific articles that varied from subject area to subject area.

✇ International Journal on Digital Libraries

Scientific document processing: challenges for modern learning methods

1 de Dezembro de 2023, 00:00

Abstract

Neural network models enjoy success on language tasks related to Web documents, including news and Wikipedia articles. However, the characteristics of scientific publications pose specific challenges that have yet to be satisfactorily addressed: the discourse structure of scientific documents crucial in scholarly document processing (SDP) tasks, the interconnected nature of scientific documents, and their multimodal nature. We survey modern neural network learning methods that tackle these challenges: those that can model discourse structure and their interconnectivity and use their multimodal nature. We also highlight efforts to collect large-scale datasets and tools developed to enable effective deep learning deployment for SDP. We conclude with a discussion on upcoming trends and recommend future directions for pursuing neural natural language processing approaches for SDP.

✇ International Journal on Digital Libraries

The digitization of historical astrophysical literature with highly localized figures and figure captions

22 de Março de 2023, 00:00

Abstract

Scientific articles published prior to the “age of digitization” in the late 1990s contain figures which are “trapped” within their scanned pages. While progress to extract figures and their captions has been made, there is currently no robust method for this process. We present a YOLO-based method for use on scanned pages, after they have been processed with Optical character recognition (OCR), which uses both grayscale and OCR features. We focus our efforts on translating the intersection-over-union (IOU) metric from the field of object detection to document layout analysis and quantify “high localization” levels as an IOU of 0.9. When applied to the astrophysics literature holdings of the NASA astrophysics data system, we find F1 scores of 90.9% (92.2%) for figures (figure captions) with the IOU cut-off of 0.9 which is a significant improvement over other state-of-the-art methods.

✇ International Journal on Digital Libraries

Beyond translation: engaging with foreign languages in a digital library

1 de Setembro de 2023, 00:00

Abstract

Digital libraries can enable their patrons to go beyond modern language translations and to engage directly with sources in more languages than any individual could study, much less master. Translations should be viewed not so much as an end but as an entry point into the sources that they represent. In the case of highly studied sources, one or more experts can curate the network of annotations that support such reading. A digital library should, however, automatically create a serviceable first version of such a multi-lingual edition. Such a service is possible but benefits (if it does not require) a new generation of increasingly well-designed machine-readable translations, lexica, grammars, and encyclopedias. This paper reports on exploratory work that uses the Homeric epics to explore this wider topic and on the more general application of the results.

✇ International Journal on Digital Libraries

CH-Bench: a user-oriented benchmark for systems for efficient distant reading (design, performance, and insights)

1 de Dezembro de 2023, 00:00

Abstract

Data science deals with the discovery of information from large volumes of data. The data studied by scientists in the humanities include large textual corpora. An important objective is to study the ideas and expectations of a society regarding specific concepts, like “freedom” or “democracy,” both for today’s society and even more for societies of the past. Studying the meaning of words using large corpora requires efficient systems for text analysis, so-called distant reading systems. Making such systems efficient calls for a specification of the necessary functionality and clear expectations regarding typical work loads. But this currently is unclear, and there is no benchmark to evaluate distant reading systems. In this article, we propose such a benchmark, with the following innovations: As a first step, we collect and structure various information needs of the target users. We then formalize the notion of word context to facilitate the analysis of specific concepts. Using this notion, we formulate queries in line with the information needs of users. Finally, based on this, we propose concrete benchmark queries. To demonstrate the benefit of our benchmark, we conduct an evaluation, with two objectives. First, we aim at insights regarding the content of different corpora, i.e., whether and how their size and nature (e.g., popular and broad literature or specific expert literature) affect results. Second, we benchmark different data management technologies. This has allowed us to identify performance bottlenecks.

✇ International Journal on Digital Libraries

DeepMetaGen: an unsupervised deep neural approach to generate template-based meta-reviews leveraging on aspect category and sentiment analysis from peer reviews

1 de Dezembro de 2023, 00:00

Abstract

Peer reviews form an essential part of scientific communication. Scholarly peer review is probably the most accepted way to evaluate research papers by involving multiple experts to review the concerned research independently. Usually, the area chair, the program chair, or the editor takes a call weighing the reviewer’s judgments. It communicates the decision to the author via writing a meta-review by summarizing the review comments. With the exponential rise in research paper submissions and the corresponding rise in the reviewer pool, it becomes stressful for the chairs/editors to manage conflicts, arrive at a consensus, and also write an informative meta-review. Here in this work, we propose a novel deep neural network-based approach for generating meta-reviews in an unsupervised fashion. To generate consistent meta-reviews, we use a generic template where the task is like to slot-fill the template with the generated meta-review text. We consider the setting where only peer reviews with no summaries or meta-reviews are provided and propose an end-to-end neural network model to perform unsupervised opinion-based abstractive summarization. We first use an aspect-based sentiment analysis model, which classifies the review sentences with the corresponding aspects (e.g., novelty, substance, soundness, etc.) and sentiment. We then extract opinion phrases from reviews for the corresponding aspect and sentiment labels. Next, we train a transformer model to reconstruct the original reviews from these extraction. Finally, we filter the selected opinions according to their aspect and/or sentiment at the time of summarization. The selected opinions of each aspect are used as input to the trained Transformer model, which uses them to construct an opinion summary. The idea is to give a concise meta-review that maximizes information coverage by focusing on aspects and sentiment present in the review, coherence, readability, and redundancy. We evaluate our model on the human written template-based meta-reviews to show that our framework outperforms competitive baselines. We believe that the template-based meta-review generation focusing on aspect and sentiment will help the editor/chair in decision-making and assist the meta-reviewer in writing better and more informative meta-reviews. We make our codes available at https://github.com/sandeep82945/Unsupervised-meta-review-generation.

✇ International Journal on Digital Libraries

Implications of an ecospatial indigenous perspective on digital information organization and access

7 de Março de 2023, 00:00

Abstract

The digitalisation of indigenous knowledge has been challenging considering epistemological differences and the lack of involvement of indigenous people. Drawing from our most recent community projects in Namibia, we share insights on indigenous ecospatial worldviews guiding the design of digital information organization and access of indigenous knowledge. With emerging technologies, such as augmented and virtual reality, offering new opportunities for richer and more meaningful spatial and embodied accounts of indigenous knowledge, we re-imagine digital libraries inclusive of indigenous people and their worldviews.

✇ International Journal on Digital Libraries

Transliterating Latin to Amharic scripts using user-defined rules and character mappings

1 de Março de 2023, 00:00

Abstract

As social media platforms become increasingly accessible, individuals’ usage of new forms of textual communication (posts, comments, chats, etc.) on social media using local language scripts such as Amharic has increased tremendously. However, many users prefer to post comments in Latin scripts instead of local ones due to the availability of more convenient forms of character input using Latin keyboards. In existing Latin to Amharic transliteration systems, missing consideration of double consonants and double vowels has caused transliteration errors. Further, as there are multiple ways of character mapping conventions in existing systems, social media texts are susceptible to a wide variety of user adoptions during script production. The current systems have failed to address these gaps and adoptions. In this work, we present the RBLatAm (Rule-Based Latin to Amharic) transliteration system, a generic rule-based system that converts Amharic words which have been written using Latin script back into their native Amharic script. The system is based on mapping rules engineered from three existing transliteration systems (Microsoft, Google, SERA) and additional rules for double consonants, and conventions adopted on social media by speakers of Amharic. When tested on transliterated Amharic words of non-named entities, and named entities of persons, the system achieves an accuracy of 75.8% and 84.6%, respectively. The system also correctly transliterates words reported as errors in previous studies. This system drastically improves the basis for performing research on text mining for Amharic language texts by being able to process such texts even if they have originally been produced in Latin scripts.

❌