# Noticias em eLiteracias

Antes de ontemKnowledge and Information Systems

# Anonymous location sharing in urban area mobility

### Abstract

This work studies the location-privacy preserving location update in the context of data-centric people mobility applications. The mobility model involves an urban area annotated city network (ACN) over which the users move and record/report their locations at non-regular intervals. The ACN is modeled as a directed weighted graph. Since the data receiver (e.g., an LBS provider) is curious in our privacy model, the users share their locations after anonymization which requires k-member partitioning of the ACN. Our framework, in the offline stage, requires a prototype vertex selection for each of the partitions. To this end, we develop a heuristic to obtain more representative prototype vertices. The temporal dimension of the location anonymity is achieved by two notions of the anonymity models, called weak location k-anonymity (to provide snapshot location anonymity) and strong location k-anonymity (to provide historical location anonymity). The attack scenario models the belief of the attacker (the LBS provider) on the whereabouts of the users at each location update. In the online stage, our algorithms make anonymity violation tests at every location update request and selectively block the anonymity violating ones. The online stage algorithms providing weak/strong location k-anonymity are shown to run in constant time per location update. An extensive experimental evaluation, mainly addressing the issue of privacy/utility trade-off, on three real ACNs with a simulated mobility is presented.

• 1 de Julho de 2021, 00:00

# Application of genetic algorithm-based intuitionistic fuzzy weighted c-ordered-means algorithm to cluster analysis

### Abstract

With the advance of information technology, many fields have begun using data clustering to reveal data structures and obtain useful information. Most of the existing clustering algorithms are susceptible to outliers and noises as well as the initial solution. The fuzzy c-ordered-means (FCOM) method can handle outlier and noise problems by using Huber’s M-estimators and Yager’s OWA operator to enhance its robustness. However, the result of the FCOM algorithm is still unstable because its initial centroids are randomly generated. Besides, the attributes’ weight also affect the clustering performance. Thus, this study first proposed an intuitionistic fuzzy weighted c-ordered-means (IFWCOM) algorithm that combines intuitionistic fuzzy sets (IFSs), the feature-weighted and FCOM together to improve the clustering result. Moreover, this study proposed a real-coded genetic algorithm-based IFWCOM (GA-IFWCOM) that employs the genetic algorithm to exploit the global optimal solution of the IFWCOM algorithm. Twelve benchmark datasets were used for verification in the experiment. According to the experimental results, the GA-IFWCOM algorithm achieved better clustering accuracy than the other clustering algorithms for most of the datasets.

• 1 de Julho de 2021, 00:00

# Feature extraction for chart pattern classification in financial time series

### Abstract

Extracting shape-related features from a given query subsequence is a crucial preprocessing step for chart pattern matching in rule-based, template-based and hybrid pattern classification methods. The extracted features can significantly influence the accuracy of pattern recognition tasks during the data mining process. Although shape-related features are widely used for chart pattern matching in financial time series, the intrinsic properties of these features and their relationships to the patterns are rarely investigated in research community. This paper aims to formally identify shape-related features used in chart patterns and investigates their impact on chart pattern classifications in financial time series. In this paper, we describe a comprehensive analysis of 14 shape-related features which can be used to classify 41 known chart patterns in technical analysis domain. In order to evaluate their effectiveness, shape-related features are then translated into rules for chart pattern classification. We perform extensive experiments on real datasets containing historical price data of 24 stocks/indices to analyze the effectiveness of the rules. Experimental results reveal that the features put forward in this paper can be effectively used for recognizing chart patterns in financial time series. Our analysis also reveals that high-level features can be hierarchically composed from low-level features. Hierarchical composition allows construction of complex chart patterns from features identified in this paper. We hope that the features identified in this paper can be used as a reference model for the future research in chart pattern analysis.

• 1 de Julho de 2021, 00:00

# A study on using data clustering for feature extraction to improve the quality of classification

### Abstract

There is a certain belief among data science researchers and enthusiasts alike that clustering can be used to improve classification quality. Insofar as this belief is fairly uncontroversial, it is also very general and therefore produces a lot of confusion around the subject. There are many ways of using clustering in classification and it obviously cannot always improve the quality of predictions, so a question arises, in which scenarios exactly does it help? Since we were unable to find a rigorous study addressing this question, in this paper, we try to shed some light on the concept of using clustering for classification. To do so, we first put forward a framework for incorporating clustering as a method of feature extraction for classification. The framework is generic w.r.t. similarity measures, clustering algorithms, classifiers, and datasets and serves as a platform to answer ten essential questions regarding the studied subject. Each answer is formulated based on a separate experiment on 16 publicly available datasets, followed by an appropriate statistical analysis. After performing the experiments and analyzing the results separately, we discuss them from a global perspective and form general conclusions regarding using clustering as feature extraction for classification.

• 1 de Julho de 2021, 00:00

# A meta-algorithm for finding large k-plexes

### Abstract

We focus on the automatic detection of communities in large networks, a challenging problem in many disciplines (such as sociology, biology, and computer science). Humans tend to associate to form families, villages, and nations. Similarly, the elements of real-world networks naturally tend to form highly connected groups. A popular model to represent such structures is the clique, that is, a set of fully interconnected nodes. However, it has been observed that cliques are too strict to represent communities in practice. The k-plex relaxes the notion of clique, by allowing each node to miss up to k connections. Although k-plexes are more flexible than cliques, finding them is more challenging as their number is greater. In addition, most of them are small and not significant. In this paper we tackle the problem of finding only large k-plexes (i.e., comparable in size to the largest clique) and design a meta-algorithm that can be used on top of known enumeration algorithms to return only significant k-plexes in a fraction of the time. Our approach relies on: (1) methods for strongly reducing the search space and (2) decomposition techniques based on the efficient computation of maximal cliques. We demonstrate experimentally that known enumeration algorithms equipped with our approach can run orders of magnitude faster than full enumeration.

• 1 de Julho de 2021, 00:00

# A streaming edge sampling method for network visualization

### Abstract

Visualization strategies facilitate streaming network analysis by allowing its exploration through graphical and interactive layouts. Depending on the strategy and the network density, such layouts may suffer from a high level of visual clutter that hides meaningful temporal patterns, highly active groups of nodes, bursts of activity, and other important network properties. Edge sampling improves layout readability, highlighting important properties and leading to easier and faster pattern identification and decision making. This paper presents Streaming Edge Sampling for Network Visualization–SEVis, a streaming edge sampling method that discards edges of low-active nodes while preserving a distribution of edge counts that is similar to the original network. It can be applied to a variety of layouts to enhance streaming network analyses. We evaluated SEVis performance using synthetic and real-world networks through quantitative and visual analyses. The results indicate a higher performance of SEVis for clutter reduction and pattern identification when compared with other sampling methods.

• 1 de Julho de 2021, 00:00

# Attentive multi-task learning for group itinerary recommendation

### Abstract

Tourism is one of the largest service industries and a popular leisure activity participated by people with friends or family. A significant problem faced by the tourists is how to plan sequences of points of interest (POIs) that maintain a balance between the group preferences and the given temporal and spatial constraints. Most traditional group itinerary recommendation methods adopt predefined preference aggregate strategies without considering the group members’ distinctive characteristics and inner relations. Besides, POI textual information is beneficial to capture overall group preferences but is rarely considered. With these concerns in mind, this paper proposes an AMT-IRE (short for Attentive Multi-Task learning-based group Itinerary REcommendation) framework, which can dynamically learn the inner relations between group members and obtain consensus group preferences via the attention mechanism. Meanwhile, AMT-IRE integrates POI categories and POI textual information via another attention network. Finally, the group preferences are used in a variant of the orienteering problem to recommend group itineraries. Extensive experiments on six datasets validate the effectiveness of AMT-IRE.

• 1 de Julho de 2021, 00:00

# A word embedding-based approach to cross-lingual topic modeling

### Abstract

The cross-lingual topic analysis aims at extracting latent topics from corpora of different languages. Early approaches rely on high-cost multilingual resources (e.g., a parallel corpus), which is hard to come by in many real cases. Some works only require a translation dictionary as a linkage between languages; however, when given an inappropriate dictionary (e.g., small coverage of dictionary), the cross-lingual topic model would shrink to a monolingual topic model and generate less diversified topics. Therefore, it is imperative to investigate a cross-lingual topic model requiring fewer bilingual resources. Recently, some space-mapping techniques have been proposed to help align multiple word embedding of different languages into a quality cross-lingual word embedding by referring to a small number of translation pairs. This work proposes a cross-lingual topic model, called Cb-CLTM, which incorporates with cross-lingual word embedding. To leverage the power of word semantics and the linkage between languages from the cross-lingual word embedding, the Cb-CLTM considers each word as a continuous embedding vector rather than a discrete word type. The experiments demonstrate that, when cross-lingual word space exhibits strong isomorphism, Cb-CLTM can generate more coherent topics with higher diversity and induce better representations of documents across languages for further tasks such as cross-lingual document clustering and classification. When the cross-lingual word space is less isomorphic, Cb-CLTM generates less coherent topics yet still prevails in topic diversity and document classification.

• 1 de Junho de 2021, 00:00

# A novel cluster-based approach for keyphrase extraction from MOOC video lectures

### Abstract

Massive open online courses (MOOCs) have emerged as a great resource for learners. Numerous challenges remain to be addressed in order to make MOOCs more useful and convenient for learners. One such challenge is how to automatically extract a set of keyphrases from MOOC video lectures that can help students quickly identify the right knowledge they want to learn and thus expedite their learning process. In this paper, we propose SemKeyphrase, an unsupervised cluster-based approach for keyphrase extraction from MOOC video lectures. SemKeyphrase incorporates a new semantic relatedness metric and a ranking algorithm, called PhraseRank, that involves two phases on ranking candidates. We conducted experiments on a real-world dataset of MOOC video lectures, and the results show that our proposed approach outperforms the state-of-the-art keyphrase extraction methods.

• 1 de Julho de 2021, 00:00

# Metro passengers counting and density estimation via dilated-transposed fully convolutional neural network

### Abstract

Metro passenger counting and density estimation are crucial for traffic scheduling and risk prevention. Although deep learning has achieved great success in passenger counting, most existing methods ignore fundamental appearance information, leading to density maps of low quality. To address this problem, we propose a novel counting method called “dilated-transposed fully convolution neural network” (DT-CNN), which combines a feature extraction module (FEM) and a feature recovery module (FRM) to generate high-quality density maps and accurately estimate passenger counts in highly congested metro scenes. Specifically, the FEM is composed of a CNN, and a set of dilated convolutional layers extract 2D features relevant to scenes containing crowded human objects. Then, the resulting density map produced by the FEM is processed by the FRM to learn potential features, which is used to restore feature map pixels. The DT-CNN is end-to-end trainable and independent of the backbone fully convolutional network architecture. In addition, we introduce a new metro passenger counting dataset (Zhengzhou_MT++) that contains 396 images with 3,978 annotations. Extensive experiments conducted on self-built datasets and three representative crowd-counting datasets show the proposed method achieves superior performance relative to other state-of-the-art methods in terms of counting accuracy and density map quality. The Zhengzhou MT++ dataset is available at https://github.com/YellowChampagne/Zhengzhou_MT.

• 1 de Junho de 2021, 00:00

# Advancing synthesis of decision tree-based multiple classifier systems: an approximate computing case study

### Abstract

So far, multiple classifier systems have been increasingly designed to take advantage of hardware features, such as high parallelism and computational power. Indeed, compared to software implementations, hardware accelerators guarantee higher throughput and lower latency. Although the combination of multiple classifiers leads to high classification accuracy, the required area overhead makes the design of a hardware accelerator unfeasible, hindering the adoption of commercial configurable devices. For this reason, in this paper, we exploit approximate computing design paradigm to trade hardware area overhead off for classification accuracy. In particular, starting from trained DT models and employing precision-scaling technique, we explore approximate decision tree variants by means of multiple objective optimization problem, demonstrating a significant performance improvement targeting field-programmable gate array devices.

• 1 de Junho de 2021, 00:00

# Progressive approaches to flexible group skyline queries

### Abstract

The G-Skyline (GSky) query is formulated to report optimal groups that are not dominated by any other group of the same size. Particularly, a given group $$G_1$$ dominates another group $$G_2$$ if for any point $$p\in G_1$$ , p dominates or equals to points $$p{'}\in G_2$$ ; at the same time, there is at least one point p dominating $$p{'}$$ . Most existing group skyline queries need to calculate an aggregate point for each group. Compared to these queries, the GSky query is more practical because it avoids specifying an aggregate function which leads to miss important results containing non-skyline points. This means the GSky query can get much more comprehensive query results which not only contain the G-Skylines consisting of skyline points but also the G-Skylines including non-skyline points. Here, a non-skyline point is dominated by another point in a given data set. However, the GSky query usually returns too many results, making it a big burden for users to pick out their expected results. To address these issues, we investigate a flexible group skyline query, namely Flexible G-Skyline (FGSky) query, which is flexible and practical for directly computing the optimal groups on the basis of user preferences. In this paper, we formulate the FGSky query, identify its properties, and present effective pruning strategies. Besides, we propose progressive algorithms for the FGSky query where a grouping strategy and a layered strategy are utilized to get better query performance. Through extensive experiments on both synthetic and real data sets, we demonstrate the efficiency, effectiveness, and progressiveness of the proposed algorithms.

• 1 de Junho de 2021, 00:00

# Efficient unsupervised drift detector for fast and high-dimensional data streams

### Abstract

Stream mining considers the online arrival of examples at high speed and the possibility of changes in its descriptive features or class definitions compared with past knowledge (i.e., concept drifts). The fast detection of drifts is essential to keep the predictive model updated and stable in changing environments. For many applications, such as those related to smart sensors, the high number of features is an additional challenge in terms of memory and time for stream processing. This paper presents an unsupervised and model-independent concept drift detector suitable for high-speed and high-dimensional data streams. We propose a straightforward two-dimensional data representation that allows the faster processing of datasets with a large number of examples and dimensions. We developed an adaptive drift detector on this visual representation that is efficient for fast streams with thousands of features and is accurate as existing costly methods that perform various statistical tests considering each feature individually. Our method achieves better performance measured by execution time and accuracy in classification problems for different types of drifts. The experimental evaluation considering synthetic and real data demonstrates the method’s versatility in several domains, including entomology, medicine, and transportation systems.

• 1 de Junho de 2021, 00:00

# Discovering cluster evolution patterns with the Cluster Association-aware matrix factorization

### Abstract

Tracking of document collections over time (or across domains) is helpful in several applications such as finding dynamics of terminologies, identifying emerging and evolving trends, and concept drift detection. We propose a novel ‘Cluster Association-aware’ Non-negative Matrix Factorization (NMF)-based method with graph-based visualization to identify the changing dynamics of text clusters over time/domains. NMF is utilized to find similar clusters in the set of clustering solutions. Based on the similarities, four major lifecycle states of clusters, namely birth, split, merge and death, are tracked to discover their emergence, growth, persistence and decay. The novel concepts of ‘cluster associations’ and term frequency-based ‘cluster density’ have been used to improve the quality of evolution patterns. The cluster evolution is visualized using a k-partite graph. Empirical analysis with the text data shows that the proposed method is able to produce accurate and efficient solution as compared to the state-of-the-art methods.

• 1 de Junho de 2021, 00:00

# Comparing ontologies and databases: a critical review of lifecycle engineering models in manufacturing

### Abstract

The literature on the modeling and management of data generated through the lifecycle of a manufacturing system is split into two main paradigms: product lifecycle management (PLM) and product, process, resource (PPR) modeling. These paradigms are complementary, and the latter could be considered a more neutral version of the former. There are two main technologies associated with these paradigms: ontologies and databases. Database technology is widespread in industry and is well established. Ontologies remain largely a plaything of the academic community which, despite numerous projects and publications, have seen limited implementations in industrial manufacturing applications. The main objective of this paper is to provide a comparison between ontologies and databases, offering both qualitative and quantitative analyses in the context of PLM and PPR. To achieve this, the article presents (1) a literature review within the context of manufacturing systems that use databases and ontologies, identifying their respective strengths and weaknesses, and (2) an implementation in a real industrial scenario that demonstrates how different modeling approaches can be used for the same purpose. This experiment is used to enable discussion and comparative analysis of both modeling strategies.

• 1 de Junho de 2021, 00:00

# Deep graph transformation for attributed, directed, and signed networks

### Abstract

Generalized from image and language translation, the goal of graph translation or transformation is to generate a graph of the target domain on the condition of an input graph of the source domain. Existing works are limited to either merely generating the node attributes of graphs with fixed topology or only generating the graph topology without allowing the node attributes to change. They are prevented from simultaneously generating both node and edge attributes due to: (1) difficulty in modeling the iterative, interactive, and asynchronous process of both node and edge translation and (2) difficulty in learning and preserving the inherent consistency between the nodes and edges in generated graphs. A general, end-to-end framework for jointly generating node and edge attributes is needed for real-world problems. In this paper, this generic problem of multi-attributed graph translation is named and a novel framework coherently accommodating both node and edge translations is proposed. The proposed generic edge translation path is also proven to be a generalization of existing topology translation models. Then, in order to discover and preserve the consistency of the generated nodes and edges, a spectral graph regularization based on our nonparametric graph Laplacian is designed. In addition, two extensions of the proposed model are developed for signed and directed graph translation. Lastly, comprehensive experiments on both synthetic and real-world practical datasets demonstrate the power and efficiency of the proposed method.

• 1 de Junho de 2021, 00:00

# The impact of data difficulty factors on classification of imbalanced and concept drifting data streams

### Abstract

Class imbalance introduces additional challenges when learning classifiers from concept drifting data streams. Most existing work focuses on designing new algorithms for dealing with the global imbalance ratio and does not consider other data complexities. Independent research on static imbalanced data has highlighted the influential role of local data difficulty factors such as minority class decomposition and presence of unsafe types of examples. Despite often being present in real-world data, the interactions between concept drifts and local data difficulty factors have not been investigated in concept drifting data streams yet. We thoroughly study the impact of such interactions on drifting imbalanced streams. For this purpose, we put forward a new categorization of concept drifts for class imbalanced problems. Through comprehensive experiments with synthetic and real data streams, we study the influence of concept drifts, global class imbalance, local data difficulty factors, and their combinations, on predictions of representative online classifiers. Experimental results reveal the high influence of new considered factors and their local drifts, as well as differences in existing classifiers’ reactions to such factors. Combinations of multiple factors are the most challenging for classifiers. Although existing classifiers are partially capable of coping with global class imbalance, new approaches are needed to address challenges posed by imbalanced data streams.

• 1 de Junho de 2021, 00:00

# Efficient discovery of co-location patterns from massive spatial datasets with or without rare features

### Abstract

A co-location pattern indicates a group of spatial features whose instances are frequently located together in proximate geographic area. Spatial co-location pattern mining (SCPM) is valuable for many practical applications. Numerous previous SCPM studies emphasize the equal participation per feature. As a result, the interesting co-locations with rare features cannot be captured. In this paper, we propose a novel interest measure, i.e., the weighted participation index (WPI), to identify co-locations with or without rare features. The WPI measure possesses a conditional anti-monotone property which can be utilized to prune the search space. In addition, a fast row instance identification mechanism based on the ordered NR-tree is proposed to enhance efficiency. Subsequently, the ordered NR-tree-based algorithm is developed. To further improve efficiency and process massive spatial data, we break the ordered NR-tree into multiple independent subtrees, and parallelize the ordered NR-tree-based algorithm on MapReduce framework. Extensive experiments are conducted on both real and synthetic datasets to verify the effectiveness, efficiency and scalability of our techniques.

• 1 de Junho de 2021, 00:00

# Incremental communication patterns in online social groups

### Abstract

In the last decades, temporal networks played a key role in modelling, understanding, and analysing the properties of dynamic systems where individuals and events vary in time. Of paramount importance is the representation and the analysis of Social Media, in particular Social Networks and Online Communities, through temporal networks, due to their intrinsic dynamism (social ties, online/offline status, users’ interactions, etc..). The identification of recurrent patterns in Online Communities, and in detail in Online Social Groups, is an important challenge which can reveal information concerning the structure of the social network, but also patterns of interactions, trending topics, and so on. Different works have already investigated the pattern detection in several scenarios by focusing mainly on identifying the occurrences of fixed and well known motifs (mostly, triads) or more flexible subgraphs. In this paper, we present the concept on the Incremental Communication Patterns, which is something in-between motifs, from which they inherit the meaningfulness of the identified structure, and subgraph, from which they inherit the possibility to be extended as needed. We formally define the Incremental Communication Patterns and exploit them to investigate the interaction patterns occurring in a real dataset consisting of 17 Online Social Groups taken from the list of Facebook groups. The results regarding our experimental analysis uncover interesting aspects of interactions patterns occurring in social groups and reveal that Incremental Communication Patterns are able to capture roles of the users within the groups.

• 1 de Junho de 2021, 00:00

# Kernel-based regression via a novel robust loss function and iteratively reweighted least squares

### Abstract

Least squares kernel-based methods have been widely used in regression problems due to the simple implementation and good generalization performance. Among them, least squares support vector regression (LS-SVR) and extreme learning machine (ELM) are popular techniques. However, the noise sensitivity is a major bottleneck. To address this issue, a generalized loss function, called $$\ell _s$$ -loss, is proposed in this paper. With the support of novel loss function, two kernel-based regressors are constructed by replacing the $$\ell _2$$ -loss in LS-SVR and ELM with the proposed $$\ell _s$$ -loss for better noise robustness. Important properties of $$\ell _s$$ -loss, including robustness, asymmetry and asymptotic approximation behaviors, are verified theoretically. Moreover, iteratively reweighted least squares are utilized to optimize and interpret the proposed methods from a weighted viewpoint. The convergence of the proposal is proved, and detailed analyses of robustness are given. Experiments on both artificial and benchmark datasets confirm the validity of the proposed methods.

• 1 de Maio de 2021, 00:00

# Learning diffusion model-free and efficient influence function for influence maximization from information cascades

### Abstract

When considering the problem of influence maximization from information cascades, one essential component is influence estimation. Traditional approaches for influence estimation generally follow a two-stage framework, i.e., learn a hypothetical diffusion model from information cascades and then calculate the influence spread according to the learned diffusion model via Monte Carlo simulation or heuristic approximation. The effectiveness of these approaches heavily relies on the correctness of the diffusion model, suffering from the problem of model misspecification. Meanwhile, these approaches are inefficient when influence estimation is conducted via lots of Monte Carlo simulations. In this paper, without assuming a diffusion model a priori, we directly learn a monotone and submodular influence function from information cascades. Once the influence function is obtained, greedy algorithm is applied to efficiently solve influence maximization. Experimental results on both synthetic and real-world datasets show the effectiveness and efficiency of the learned influence function for both influence estimation and influence maximization tasks.

• 1 de Maio de 2021, 00:00

# Unifying community detection and network embedding in attributed networks

### Abstract

Traditionally, community detection and network embedding are two separate tasks. Network embedding aims to output a vector representation for each node in the network, and community detection aims to find all densely connected groups of nodes and well separate them from others. Most of the existing approaches do community detection and network embedding in a separate manner, and ignore node attributes information, which leads to poor results. In this paper, we propose a novel model that jointly solves the network embedding and community detection problems together. The model can make use of the network local information, the global information and node attributes information collaboratively. We empirically show that by jointly solving these two problems together, the model can greatly improve the ability of community detection, but also learn better network embedding than the advanced baseline methods. We evaluate the proposed model on several datasets, and the experimental results have shown the effectiveness and advancement of our model.

• 1 de Maio de 2021, 00:00

# Cost-sensitive selection of variables by ensemble of model sequences

### Abstract

Many applications require the collection of data on different variables or measurements over many system performance metrics. We term those broadly as measures or variables. Often data collection along each measure incurs a cost, thus it is desirable to consider the cost of measures in modeling. This is a fairly new class of problems in the area of cost-sensitive learning. A few attempts have been made to incorporate costs in combining and selecting measures. However, existing studies either do not strictly enforce a budget constraint, or are not the ‘most’ cost effective. With a focus on classification problems, we propose a computationally efficient approach that could find a near optimal model under a given budget by exploring the most ‘promising’ part of the solution space. Instead of outputting a single model, we produce a model schedule—a list of models, sorted by model costs and expected predictive accuracy. This could be used to choose the model with the best predictive accuracy under a given budget, or to trade off between the budget and the predictive accuracy. Experiments on some benchmark datasets show that our approach compares favorably to competing methods.

• 1 de Maio de 2021, 00:00

# Expert-driven trace clustering with instance-level constraints

### Abstract

Within the field of process mining, several different trace clustering approaches exist for partitioning traces or process instances into similar groups. Typically, this partitioning is based on certain patterns or similarity between the traces, or driven by the discovery of a process model for each cluster. The main drawback of these techniques, however, is that their solutions are usually hard to evaluate or justify by domain experts. In this paper, we present two constrained trace clustering techniques that are capable to leverage expert knowledge in the form of instance-level constraints. In an extensive experimental evaluation using two real-life datasets, we show that our novel techniques are indeed capable of producing clustering solutions that are more justifiable without a substantial negative impact on their quality.

• 1 de Maio de 2021, 00:00

# Towards metrics-driven ontology engineering

### Abstract

The software engineering field is continuously making an effort to improve the effectiveness of the software development process. This improvement is performed by developing quantitative measures that can be used to enhance the quality of software products and to more accurately describe, better understand and manage the software development life cycle. Even if the ontology engineering field is constantly adopting practices from software engineering, it has not yet reached a state in which metrics are an integral part of ontology engineering processes and support making evidence-based decisions over the process and its outputs. Up to now, ontology metrics are mainly focused on the ontology implementation and do not take into account the development process or other artefacts that can help assessing the quality of the ontology, e.g. its requirements. This work envisions the need for a metrics-driven ontology engineering process and, as a first step, presents a set of metrics for ontology engineering which are obtained from artefacts generated during the ontology development process and from the process itself. The approach is validated by measuring the ontology engineering process carried out in a research project and by showing how the proposed metrics can be used to improve the efficiency of the process by making predictions, such as the effort needed to implement an ontology, or assessments, such as the coverage of the ontology according to its requirements.

• 1 de Abril de 2021, 00:00