Noticias em eLiteracias

🔒
❌ Sobre o FreshRSS
Há novos artigos disponíveis, clique para atualizar a página.
Ontem — 29 de Novembro de 2022Knowledge and Information Systems

DNETC: dynamic network embedding preserving both triadic closure evolution and community structures

Abstract

Network embedding, a central issue of deep learning preprocessing on social networks, aims to transform network elements (vertices) into low-dimensional latent vector space while preserving the topology and properties of the network. However, most of the existing methods mainly focus on static networks, neglecting the dynamic characteristics of real social networks. The explanation for the fundamental dynamic mechanism of social network evolution is still lacking. We design a novel dynamic network embedding approach preserving both triadic closure evolution and community structures (DNETC). First, three factors, the popularity of vertices, the proximity of vertices, and the community structures, are incorporated relying on the triadic closure principle in social networks. Second, the triadic closure loss function, the community loss function, and the temporal smoothness loss function are constructed and incorporated to optimize DNETC. Finally, the low-dimensional cognition presentation of a dynamic social network can be achieved, which can save both the evolution patterns of microscopic vertices and the structure information of macroscopic communities. Experiments on the classical tasks of link prediction, link reconstruction, and changed link reconstruction and prediction demonstrate the superiority of DNETC over state-of-the-art methods. The first experimental results validate the effectiveness of adopting triadic closure progress and community structures to improve the quality of the learned low-dimensional vectors. The last experimental results further verify the parameter sensitivity of DNETC to the analysis task. It provides a new idea for dynamic network embedding to reflect the real evolution characteristics of networks and enhance the effect of network analysis tasks. The code is available at https://github.com/YangMin-10/DNETC.

  • 27 de Novembro de 2022, 00:00

Dynamic ensemble selection classification algorithm based on window over imbalanced drift data stream

Abstract

Data stream classification is an important research direction in the field of data mining, but in many practical applications, it is impossible to collect the complete training set at one time, and the data may be in an imbalanced state and interspersed with concept drift, which will greatly affect the classification performance. To this end, an online dynamic ensemble selection classification algorithm based on window over imbalanced drift data stream (DESW-ID) is proposed. The algorithm employs various balancing measures, first resampling the data stream using Poisson distribution, and if it is in a highly imbalanced state then secondary sampling is performed using a window storing a minority class instances to achieve the current balanced state of the data. To improve the processing efficiency of the algorithm, a classifier selection ensemble is proposed to dynamically adjust the number of classifiers, and the algorithm runs with an ADWIN detector to detect the presence of concept drift. The experimental results show that the proposed algorithm ranks first on average in all five classification performance metrics compared to the state-of-the-art methods. Therefore, the proposed algorithm has better classification performance for imbalanced data streams with concept drift and also improves the operation efficiency of the algorithm.

  • 27 de Novembro de 2022, 00:00
Antes de ontemKnowledge and Information Systems

KLECA: knowledge-level-evolution and category-aware personalized knowledge recommendation

Abstract

Knowledge recommendation plays a crucial role in online learning platforms. It aims to optimize the service quality so as to improve users’ learning efficiency and outcomes. Existing approaches generally leverage RNN-based methods in combination with attention mechanisms to learn user preference. There is a lack of in-depth understanding of users’ knowledge-level changes over time and the impact of knowledge item categories on recommendation performance. To this end, we propose the knowledge-level-evolution and category-aware personalized knowledge recommendation (KLECA) model. The model firstly leverages bidirectional GRU and the time adjustment function to understand users’ learning evolution by analyzing their learning trajectory data. Secondly, it considers the effect of item categories and descriptive information and enhances the accuracy of knowledge recommendation by introducing a cross-head decorrelation module to capture the information of knowledge items based on a multi-head attention mechanism. In addition, a personalized attention mechanism and gated function are introduced to grab the relationship between items, item categories and user learning trajectory to strengthen the representation of information. Through extensive experiments on real-world data collected from an online learning platform, the proposed approach has been shown to significantly outperform other approaches.

  • 24 de Novembro de 2022, 00:00

Concept drift detection and accelerated convergence of online learning

Abstract

Streaming data has become an important form in the era of big data, and the concept drift, as one of the most important problem of it, is often studied deeply. However, similar to true concept drift, noise and too small training samples will also lead to the classification performance fluctuation, which is easy to confuse with true concept drift. To solve this problem, an improved concept drift detection method is proposed, and the accelerated convergence of the model after concept drift is also studied. Firstly, the effective fluctuation sites can be obtained by group detection method. Secondly, the authenticity of concept drift can be determined by tracking the testing accuracy of reference sites near the effective fluctuation site. Lastly, in the convergence acceleration stage, the time sequential distance is designed to measure the similarity of these sequential data blocks during different time periods, and the noncritical disturbance data with the largest time sequential distance are removed sequentially to improve the convergence speed of the model after concept drift occurs. The experimental results demonstrate that the proposed method not only produces better identification results in distinguishing true and false concept drift but also improves the convergence speed of the model.

  • 23 de Novembro de 2022, 00:00

Logical design of multi-model data warehouses

Abstract

Multi-model DBMSs, which support different data models with a fully integrated backend, have been shown to be beneficial to data warehouses and OLAP systems. Indeed, they can store data according to the multidimensional model and, at the same time, let each of its elements be represented through the most appropriate model. An open challenge in this context is the lack of methods for logical design. Indeed, in a multi-model context, several alternatives emerge for the logical representation of dimensions and facts. The goal of this paper is to devise a set of guidelines for the logical design of multi-model data warehouses so that the designer can achieve the best trade-off between features such as querying, storage, and ETL. To this end, for each model considered (relational, document-based, and graph-based) and for each type of multidimensional element (e.g., non-strict hierarchy) we propose some solutions and carry out a set of intra-model and inter-model comparisons. The resulting guidelines are then tested on a case study that shows all types of multidimensional elements.

  • 15 de Novembro de 2022, 00:00

A systematic construction of non-i.i.d. data sets from a single data set: non-identically distributed data

Abstract

Data-driven models strongly depend on data. Nevertheless, for research and academic purposes, public data sets are usually considered and analyzed. For example, most machine learning algorithms are applied and tested using the UCI Machine Learning repository. There is a current need for not i.i.d. data sets for distributed machine learning. Recall that i.i.d. random variables stand for independent and identically distributed (i.i.d.) random variables. An example of this need is federated learning. In federated learning, the typical scenario is to consider a set of agents each one with its own data set. Agents are typically heterogeneous and because of that, it is not appropriate to consider that the data of these agents follow the same distributions. In this paper we propose an approach to build non-identically distributed data sets from a single data set for machine learning classification, where we may suppose or not that all instances follow the same distribution. Each device will have only instances of a subset of the classes. The approach uses optimization to distribute the data set into a set of subsets, each one following a different distribution. Our goal is to define an approach for building subsets for training that is as systematic as the approaches used for cross-validation/k-fold validation.

  • 11 de Novembro de 2022, 00:00

Imbalanced data preprocessing techniques for machine learning: a systematic mapping study

Abstract

Machine Learning (ML) algorithms have been increasingly replacing people in several application domains—in which the majority suffer from data imbalance. In order to solve this problem, published studies implement data preprocessing techniques, cost-sensitive and ensemble learning. These solutions reduce the naturally occurring bias towards the majority sample through ML. This study uses a systematic mapping methodology to assess 9927 papers related to sampling techniques for ML in imbalanced data applications from 7 digital libraries. A filtering process selected 35 representative papers from various domains, such as health, finance, and engineering. As a result of a thorough quantitative analysis of these papers, this study proposes two taxonomies—illustrating sampling techniques and ML models. The results indicate that oversampling and classical ML are the most common preprocessing techniques and models, respectively. However, solutions with neural networks and ensemble ML models have the best performance—with potentially better results through hybrid sampling techniques. Finally, none of the 35 works apply simulation-based synthetic oversampling, indicating a path for future preprocessing solutions.

  • 9 de Novembro de 2022, 00:00

Bold driver and static restart fused adaptive momentum for visual question answering

Abstract

Stacked attention networks (SANs) are one of the most classic models for visual question answering (VQA) and have effectively promoted the research progress of VQA. Existing literature utilized momentum to optimize SANs and obtained impressive results. However, error analysis shows that the fixed global learning rate in momentum makes it easy to fall into local optimal solution. Many Learning Rate Adaptation algorithms (LRA) (e.g., static restart, bold driver) are proposed to solve the issue by adjusting global learning rate. However, these algorithms still have many defects. For example, static restart has too high restart learning rate and the blindness of adaptive global learning rate; although bold driver can solve the blindness, it has the improper setting of adaptive parameters. To solve these issues, we fuse bold driver and static restart (BDSR) into momentum to devise our method called bold driver and static restart fused adaptive momentum (BDSRM). Then, we analyze its optimization process and time complexity and conduct quantitative experiments on VQAv1, Cifar-10 and similar models to verify that our BDSRM outperforms the state-of-the-art optimization algorithms on SANs. Afterward, we perform ablation experiments and visualization experiments to verify that our BDSR has preferable effectiveness.

  • 9 de Novembro de 2022, 00:00

Information extraction from electronic medical documents: state of the art and future research directions

Abstract

In the medical field, a doctor must have a comprehensive knowledge by reading and writing narrative documents, and he is responsible for every decision he takes for patients. Unfortunately, it is very tiring to read all necessary information about drugs, diseases and patients due to the large amount of documents that are increasing every day. Consequently, so many medical errors can happen and even kill people. Likewise, there is such an important field that can handle this problem, which is the information extraction. There are several important tasks in this field to extract the important and desired information from unstructured text written in natural language. The main principal tasks are named entity recognition and relation extraction since they can structure the text by extracting the relevant information. However, in order to treat the narrative text we should use natural language processing techniques to extract useful information and features. In our paper, we introduce and discuss the several techniques and solutions used in these tasks. Furthermore, we outline the challenges in information extraction from medical documents. In our knowledge, this is the most comprehensive survey in the literature with an experimental analysis and a suggestion for some uncovered directions.

  • 8 de Novembro de 2022, 00:00

SVM-based subspace optimization domain transfer method for unsupervised cross-domain time series classification

Abstract

Time series classification on edge devices has received considerable attention in recent years, and it is often conducted on the assumption that the training and testing data are drawn from the same distribution. However, in practical IoT applications, this assumption does not hold due to variations in installation positions, precision error, and sampling frequency of edge devices. To tackle this problem, in this paper, we propose a new SVM-based domain transfer method called subspace optimization transfer support vector machine (SOTSVM) for cross-domain time series classification. SOTSVM aims to learn a domain-invariant SVM classifier by which (1) global projected distribution alignment jointly exploits the marginal distribution discrepancy, geometric structure, and distribution scatter to reduce the global distribution discrepancy between the source and target domains; (2) feature grouping is used to divide the features into highly transferable features (HTF) and lowly transferable features (LTF), where the importance of HTF is preserved and importance of LTF is suppressed in the domain-invariant classifier training; and (3) empirical risk minimization is constructed for improving the discrimination of the SOTSVM. In this paper, we formulate a minimization problem that integrates global projected distribution alignment, feature grouping and empirical risk minimization into the joint SVM framework, giving an effective optimization algorithm. Furthermore, we present the extension of multiple kernel SOTSVM. Experimental results on three sets of cross-domain time series datasets show that our method outperforms some state-of-the-art conventional transfer learning methods and no transfer learning methods.

  • 7 de Novembro de 2022, 00:00

New cosine similarity and distance measures for Fermatean fuzzy sets and TOPSIS approach

Abstract

The most straightforward approaches to checking the degrees of similarity and differentiation between two sets are to use distance and cosine similarity metrics. The cosine of the angle between two n-dimensional vectors in n-dimensional space is called cosine similarity. Even though the two sides are dissimilar in size, cosine similarity may readily find commonalities since it deals with the angle in between. Cosine similarity is widely used because it is simple, ideal for usage with sparse data, and deals with the angle between two vectors rather than their magnitude. The distance function is an elegant and canonical quantitative tool to measure the similarity or difference between two sets. This work presents new metrics of distance and cosine similarity amongst Fermatean fuzzy sets. Initially, the definitions of the new measures based on Fermatean fuzzy sets were presented, and their properties were explored. Considering that the cosine measure does not satisfy the axiom of similarity measure, then we propose a method to construct other similarity measures between Fermatean fuzzy sets based on the proposed cosine similarity and Euclidean distance measures and it satisfies the axiom of the similarity measure. Furthermore, we obtain a cosine distance measure between Fermatean fuzzy sets by using the relationship between the similarity and distance measures, then we extend the technique for order of preference by similarity to the ideal solution method to the proposed cosine distance measure, which can deal with the related decision-making problems not only from the point of view of geometry but also from the point of view of algebra. Finally, we give a practical example to illustrate the reasonableness and effectiveness of the proposed method, which is also compared with other existing methods.

  • 4 de Novembro de 2022, 00:00

A storytree-based model for inter-document causal relation extraction from news articles

Abstract

With more and more news articles appearing on the Internet, discovering causal relations between news articles is very important for people to understand the development of news. Extracting the causal relations between news articles is an inter-document relation extraction task. Existing works on relation extraction cannot solve it well because of the following two reasons: (1) most relation extraction models are intra-document models, which focus on relation extraction between entities. However, news articles are many times longer and more complex than entities, which makes the inter-document relation extraction task harder than intra-document. (2) Existing inter-document relation extraction models rely on similarity information between news articles, which could limit the performance of extraction methods. In this paper, we propose an inter-document model based on storytree information to extract causal relations between news articles. We adopt storytree information to integer linear programming (ILP) and design the storytree constraints for the ILP objective function. Experimental results show that all the constraints are effective and the proposed method outperforms widely used machine learning models and a state-of-the-art deep learning model, with F1 improved by more than 5% on three different datasets. Further analysis shows that five constraints in our model improve the results to varying degrees and the effects on the three datasets are different. The experiment about link features also suggests the positive influence of link information.

  • 3 de Novembro de 2022, 00:00

A prediction model of student performance based on self-attention mechanism

Abstract

Performance prediction is an important research facet of educational data mining. Most models extract student behavior features from campus card data for prediction. However, most of these methods have coarse time granularity, difficulty in extracting useful high-order behavior combination features, dependence on 6 historical achievements, etc. To solve these problems, this paper utilizes prediction of grade point average (GPA prediction) and whether a specific student has failing subjects (failing prediction) in a term as the goal of performance prediction and proposes a comprehensive performance prediction model of college students based on behavior features. First, a method for representing campus card data based on behavior flow is introduced to retain higher time accuracy. Second, a method for extracting student behavior features based on multi-head self-attention mechanism is proposed to automatically select more important high-order behavior combination features. Finally, a performance prediction model based on student behavior feature mode difference is proposed to improve the model’s prediction accuracy and increases the model’s robustness for students with significant changes in performance. The performance of the model is verified on actual data collected by the teaching monitoring big data platform of Xi’an Jiaotong University. The results show that the model’s prediction performance is better than the comparison algorithms on both the failing prediction and GPA prediction.

  • 2 de Novembro de 2022, 00:00

Partial-order-based process mining: a survey and outlook

Abstract

The field of process mining focuses on distilling knowledge of the (historical) execution of a process based on the operational event data generated and stored during its execution. Most existing process mining techniques assume that the event data describe activity executions as degenerate time intervals, i.e., intervals of the form [tt], yielding a strict total order on the observed activity instances. However, for various practical use cases, e.g., the logging of activity executions with a nonzero duration and uncertainty on the correctness of the recorded timestamps of the activity executions, assuming a partial order on the observed activity instances is more appropriate. Using partial orders to represent process executions, i.e., based on recorded event data, allows for new classes of process mining algorithms, i.e., aware of parallelism and robust to uncertainty. Yet, interestingly, only a limited number of studies consider using intermediate data abstractions that explicitly assume a partial order over a collection of observed activity instances. Considering recent developments in process mining, e.g., the prevalence of high-quality event data and techniques for event data abstraction, the need for algorithms designed to handle partially ordered event data is expected to grow in the upcoming years. Therefore, this paper presents a survey of process mining techniques that explicitly use partial orders to represent recorded process behavior. We performed a keyword search, followed by a snowball sampling strategy, yielding 68 relevant articles in the field. We observe a recent uptake in works covering partial-order-based process mining, e.g., due to the current trend of process mining based on uncertain event data. Furthermore, we outline promising novel research directions for the use of partial orders in the context of process mining algorithms.

  • 2 de Novembro de 2022, 00:00

AdaCC: cumulative cost-sensitive boosting for imbalanced classification

Abstract

Class imbalance poses a major challenge for machine learning as most supervised learning models might exhibit bias towards the majority class and under-perform in the minority class. Cost-sensitive learning tackles this problem by treating the classes differently, formulated typically via a user-defined fixed misclassification cost matrix provided as input to the learner. Such parameter tuning is a challenging task that requires domain knowledge and moreover, wrong adjustments might lead to overall predictive performance deterioration. In this work, we propose a novel cost-sensitive boosting approach for imbalanced data that dynamically adjusts the misclassification costs over the boosting rounds in response to model’s performance instead of using a fixed misclassification cost matrix. Our method, called AdaCC, is parameter-free as it relies on the cumulative behavior of the boosting model in order to adjust the misclassification costs for the next boosting round and comes with theoretical guarantees regarding the training error. Experiments on 27 real-world datasets from different domains with high class imbalance demonstrate the superiority of our method over 12 state-of-the-art cost-sensitive boosting approaches exhibiting consistent improvements in different measures, for instance, in the range of [0.3–28.56%] for AUC, [3.4–21.4%] for balanced accuracy, [4.8–45%] for gmean and [7.4–85.5%] for recall.

  • 2 de Novembro de 2022, 00:00

Tailoring and evaluating the Wikipedia for in-domain comparable corpora extraction

Abstract

We propose a language-independent graph-based method to build à-la-carte article collections on user-defined domains from the Wikipedia. The core model is based on the exploration of the encyclopedia’s category graph and can produce both mono- and multilingual comparable collections. We run thorough experiments to assess the quality of the obtained corpora in 10 languages and 743 domains. According to an extensive manual evaluation, our graph model reaches an average precision of \(84\%\) on in-domain articles, outperforming an alternative model based on information retrieval techniques. As manual evaluations are costly, we introduce the concept of domainness and design several automatic metrics to account for the quality of the collections. Our best metric for domainness shows a strong correlation with human judgments, representing a reasonable automatic alternative to assess the quality of domain-specific corpora. We release the WikiTailor toolkit with the implementation of the extraction methods, the evaluation measures and several utilities.

  • 1 de Novembro de 2022, 00:00

Distributed real-time ETL architecture for unstructured big data

Abstract

Real-time extract transform load (ETL) is the integral part of increasing demand of faster business decisions targeting large number of modern applications. Multi-source unstructured data stream extraction and transformation using disk data in distributed environment are the building blocks of real-time ETL due to volume and velocity of data. Therefore designing an architecture for basic building blocks for real-time ETL remains a major challenge. In this paper, we focus primarily to expedite stream-disk joins during transformation phase of ETL that is considered most expensive operator in stream processing due to frequent disk access. We propose an architecture for real-time ETL to ingest unstructured stream of data from multi-sources, without having to worry about the structure of data sources, and transform them after joining with distributed disk data. We also present a novel data pipeline stream-disk join that uses partition-based input and best-effort in-memory database technique reducing frequent disk access. The proposed architecture addresses the challenges of stream data loss, ignored un-matching streams, disk overhead and real-time processing for distributed environment. The experimental results obtained using stream generator and real-world datasets on local and distributed machines show that proposed architecture yields significantly improved throughput especially for large number of stream tuples with large datasets.

  • 1 de Dezembro de 2022, 00:00

Risk-aware temporal cascade reconstruction to detect asymptomatic cases

Abstract

This paper studies the problem of detecting asymptomatic cases in a temporal contact network in which multiple outbreaks have occurred. We show that the key to detecting asymptomatic cases well is taking into account both individual risk and the likelihood of disease-flow along edges. We consider both aspects by formulating the asymptomatic case detection problem as a directed prize-collecting Steiner tree (Directed PCST) problem. We present an approximation-preserving reduction from this problem to the directed Steiner tree problem and obtain scalable algorithms for the Directed PCST problem on instances with more than 1.5M edges obtained from both synthetic and fine-grained hospital data. On synthetic data, we demonstrate that our detection methods significantly outperform various baselines (with a gain of \(3.6 \times \) ). We apply our method to the infectious disease prediction task by using an additional feature set that captures exposure to detected asymptomatic cases and show that our method outperforms all baselines. We further use our method to detect infection sources (“patient zero”) of outbreaks that outperform baselines. We also demonstrate that the solutions returned by our approach are clinically meaningful by presenting case studies.

  • 1 de Dezembro de 2022, 00:00

Self-paced annotations of crowd workers

Abstract

Crowdsourcing can harness human intelligence to handle computer-hard tasks in a relatively economic way. The collected answers from various crowd workers are of different qualities, due to the task difficulty, worker capability, incentives and other factors. To maintain high-quality answers while reducing the cost, various strategies have been developed by modeling tasks, workers, or both. Nevertheless, they typically deem that the capability of workers is static when assigning/completing all the tasks. However, in actual fact, crowd workers can improve their capability by gradually completing easy to hard tasks, alike human beings’ intrinsic self-paced learning ability. In this paper, we study crowdsourcing with self-paced workers, whose capability can be progressively improved as they scrutinize and complete tasks from to easy to hard. We introduce a Self-paced Crowd-worker model (SPCrowder). In SPCrowder, workers firstly do a set of golden tasks with known truths, which serve as feedbacks to assist workers capturing the raw modes of tasks and to stimulate the self-paced learning. This also helps to estimate workers’ quality and tasks’ difficulty. SPCrowder then uses a task difficulty model to dynamically measure the difficulty of tasks and rank them from easy to hard and assign tasks to self-paced workers by maximizing a benefit criterion. By doing so, a normal worker can be capable to handle hard tasks after completing some easier and related tasks. We conducted extensive experiments on semi-simulated and real crowdsourcing datasets, SPCrowder outperforms competitive methods in quality control and budget saving. Crowd workers indeed hold the self-paced learning ability, which boosts the quality and save the budget.

  • 1 de Dezembro de 2022, 00:00

Multiresolution hierarchical support vector machine for classification of large datasets

Abstract

Support vector machine (SVM) is a popular supervised learning algorithm based on margin maximization. It has a high training cost and does not scale well to a large number of data points. We propose a multiresolution algorithm MRH-SVM that trains SVM on a hierarchical data aggregation structure, which also serves as a common data input to other learning algorithms. The proposed algorithm learns SVM models using high-level data aggregates and only visits data aggregates at more detailed levels where support vectors reside. In addition to performance improvements, the algorithm has advantages such as the ability to handle data streams and datasets with imbalanced classes. Experimental results show significant performance improvements in comparison with existing SVM algorithms.

  • 1 de Dezembro de 2022, 00:00

Iterative sliding window aggregation for generating length-scale-specific fractal features

Abstract

The prevalence of high-resolution geospatial raster data is rapidly increasing, with potentially far-reaching applications in the area of food, energy, and water. The added resolution allows shifting the focus from data science at the level of individual pixels to working with windows of pixels that characterize a region. We propose a sliding-window-based approach that allows extracting derived features on a spectrum of well-defined length scales. The resulting image has the same resolution as the input image, albeit with slightly smaller size, and the fractal dimension measures are consistent with the definition of the conventional global feature. The sliding windows can be large since the dependence of the computational cost on the window size is logarithmic. We demonstrate the success of the approach for geometric examples and for land use data and show that the resulting features can aid in a downstream classification task. Overall, this work fits the broadly recognized need in agricultural data science of transforming raw data into multi-modal representations that capture application-relevant features.

  • 1 de Dezembro de 2022, 00:00

Manifold clustering optimized by adaptive aggregation strategy

Abstract

Different from general spherical datasets, manifold datasets have a more complex spatial manifold structure, which makes it difficult to distinguish sample points on different manifold structures by Euclidean distance. Although the density peak clustering (DPC, two parameters: the cut-off ratio \(\mathrm{dc}\) and the number of class centers \(C\) ) algorithm can search for density peaks quickly and assign sample points, it cannot identify clusters effectively with complex manifold structures due to the sample similarity measurement only based on Euclidean distance. To solve these problems, this paper proposes a Manifold Clustering optimized by Adaptive Aggregation Strategy (MC-AAS, two parameters: the number of nearest neighbors \(k\) and the threshold ratio of core points \(p\) ). Firstly, it introduces a novel manifold similarity measurement based on the shared nearest neighbors and redefines the local density of sample points by summing the manifold similarity. Secondly, the core points are determined by the statistical characteristics of local density, and the local sub-clusters of manifold structural datasets are obtained by means of the nearest neighbor connection of the core points. And then, the initial clusters are merged on the basis of the statistical test of boundary density and the silhouette coefficient of adjacent subclass to realize the identification of manifold structural datasets. Finally, based on three evaluation metrics: Adjusted Mutual Information, Adjusted Rand Index and Fowlkes-Mallows Index, we conduct extensive experiments on synthetic datasets and real-world datasets. The experimental results indicate that, compared with current methods, the MC-AAS algorithm achieves a better clustering effect in identifying complex manifold datasets and has better robustness.

  • 12 de Outubro de 2022, 00:00

Dual neighborhood thresholding patterns based on directional sampling

Abstract

The evaluation of face recognition algorithms relies on the diversity of the challenges simulated on the adopted benchmarks. The main face recognition challenges cover the illumination changes, inhomogeneous background, different facial expressions, pose variations, occlusion, aging and resolution. Many 2D databases that include one challenge or more have been proposed in the state of the art. These databases represent different amount of individuals and samples, and generally the number of persons does not exceed 100 classes. The reason behind this limitation relies on the resources and materials required to construct a database composed of thousands of images and hundreds of persons along with the mentioned face recognition basic challenges. As a solution, researchers proposed to build benchmarks based on collecting the web images of celebrities from search engines such as Google Images and Flicker. The well-known database of this kind is Labeled Faces in the Wild (LFW) as a public benchmark for face verification. This solution managed to constitute a dependent way to construct benchmarks, but it could not be applicable for face recognition since the collected images have a low resolution and the majority of the persons are represented over few samples (one or two in most cases), which made these databases extremely hard for handcrafted-based face recognition systems. In this paper, we propose to construct a challenging database referred to as mixed face recognition database (MFRD) based on gathering the images of eight well-known benchmarks of the literature (FERET, Extended Yale B, ORL, AR, FEI, KDEF, IMM and JAFFE). The constructed database is expected to be more complex in terms of the amount of classes/images and the diversity of challenges. We expect then that the recognition performance on this database will drop compared to the one recorded on each considered benchmark individually. This paper presents also a new LBP variant, namely dual neighborhood thresholding patterns based on directional sampling (DNTPDS) as a robust and computationally efficient handcrafted descriptor for face recognition. The concept behind this new descriptor is based on defining a \(5\times 5\) neighborhood topology, that relies on a directional sampling to select the only 16 prominent neighbors instead of 25. The proposed DNTPDS operator demonstrates a superior performance and outperforms 18 state-of-the-art LBP variants that is proved through a set of comprehensive experiments.

  • 9 de Outubro de 2022, 00:00

Isolation Kernel Estimators

Abstract

Existing adaptive kernel density estimators (KDEs) and kernel regressions (KRs) often employ a data-independent kernel, such as Gaussian kernel. They require an additional means to adapt the kernel bandwidth locally in a given dataset in order to produce better estimations. But this comes with high computational cost. In this paper, we show that adaptive KDEs and KRs can be directly derived from Isolation Kernel with constant-time complexity for each estimation. The resultant estimators called IKDE and IKR are the first KDE and KR that are fast and adaptive. We demonstrate both the superior efficiency and efficacy of IKDE and IKR in anomaly detection and regression tasks, respectively.

  • 1 de Outubro de 2022, 00:00

A study of approaches to answering complex questions over knowledge bases

Abstract

Question answering (QA) systems retrieve the most relevant answer to a natural language question. Knowledge base question answering (KBQA) systems explore entities and relations from knowledge bases to generate answers. Currently, QA systems achieve better results when answering simple questions, but complex QA systems are receiving great attention nowadays. However, there is a lack of studies that analyzes complex questions inside the KBQA field and how it has been addressed. This work aims to fill this gap, presenting a systematic mapping on the complex knowledge base question answering (C-KBQA). The main contributions of this work are: (i) the use of a systematic method to provide an overview of C-KBQA; (ii) a collection of 54 papers systematically selected from 894 papers; (iii) the identification of the most frequent venues, domains, and knowledge bases used in the literature; (iv) a mapping of methods, datasets, and metrics used in the C-KBQA scenario; (v) future directions and the main gaps in the C-KBQA field. The authors show that the C-KBQA system aims to solve two question types: multi-hop and constraint questions. Also, it was possible to identify three main steps to construct a C-KBQA system and the use of two main approaches in this process. It was also noticed that datasets for C-KBQA are still an open challenge.

  • 1 de Novembro de 2022, 00:00
❌