IEEE Account

  • Change Username/Password
  • Update Address

Purchase Details

  • Payment Options
  • Order History
  • View Purchased Documents

Profile Information

  • Communications Preferences
  • Profession and Education
  • Technical Interests
  • US & Canada: +1 800 678 4333
  • Worldwide: +1 732 981 0060
  • Contact & Support
  • About IEEE Xplore
  • Accessibility
  • Terms of Use
  • Nondiscrimination Policy
  • Privacy & Opting Out of Cookies

A not-for-profit organization, IEEE is the world's largest technical professional organization dedicated to advancing technology for the benefit of humanity. © Copyright 2024 IEEE - All rights reserved. Use of this web site signifies your agreement to the terms and conditions.

Advertisement

Advertisement

Machine Learning: Algorithms, Real-World Applications and Research Directions

  • Review Article
  • Published: 22 March 2021
  • Volume 2 , article number  160 , ( 2021 )

Cite this article

clustering in machine learning research papers

  • Iqbal H. Sarker   ORCID: orcid.org/0000-0003-1740-5517 1 , 2  

568k Accesses

1727 Citations

31 Altmetric

Explore all metrics

In the current age of the Fourth Industrial Revolution (4 IR or Industry 4.0), the digital world has a wealth of data, such as Internet of Things (IoT) data, cybersecurity data, mobile data, business data, social media data, health data, etc. To intelligently analyze these data and develop the corresponding smart and automated  applications, the knowledge of artificial intelligence (AI), particularly, machine learning (ML) is the key. Various types of machine learning algorithms such as supervised, unsupervised, semi-supervised, and reinforcement learning exist in the area. Besides, the deep learning , which is part of a broader family of machine learning methods, can intelligently analyze the data on a large scale. In this paper, we present a comprehensive view on these machine learning algorithms that can be applied to enhance the intelligence and the capabilities of an application. Thus, this study’s key contribution is explaining the principles of different machine learning techniques and their applicability in various real-world application domains, such as cybersecurity systems, smart cities, healthcare, e-commerce, agriculture, and many more. We also highlight the challenges and potential research directions based on our study. Overall, this paper aims to serve as a reference point for both academia and industry professionals as well as for decision-makers in various real-world situations and application areas, particularly from the technical point of view.

Similar content being viewed by others

clustering in machine learning research papers

Machine Learning Approaches for Smart City Applications: Emergence, Challenges and Opportunities

clustering in machine learning research papers

Insights into the Advancements of Artificial Intelligence and Machine Learning, the Present State of Art, and Future Prospects: Seven Decades of Digital Revolution

clustering in machine learning research papers

Editorial: Machine Learning, Advances in Computing, Renewable Energy and Communication (MARC)

Explore related subjects.

  • Artificial Intelligence

Avoid common mistakes on your manuscript.

Introduction

We live in the age of data, where everything around us is connected to a data source, and everything in our lives is digitally recorded [ 21 , 103 ]. For instance, the current electronic world has a wealth of various kinds of data, such as the Internet of Things (IoT) data, cybersecurity data, smart city data, business data, smartphone data, social media data, health data, COVID-19 data, and many more. The data can be structured, semi-structured, or unstructured, discussed briefly in Sect. “ Types of Real-World Data and Machine Learning Techniques ”, which is increasing day-by-day. Extracting insights from these data can be used to build various intelligent applications in the relevant domains. For instance, to build a data-driven automated and intelligent cybersecurity system, the relevant cybersecurity data can be used [ 105 ]; to build personalized context-aware smart mobile applications, the relevant mobile data can be used [ 103 ], and so on. Thus, the data management tools and techniques having the capability of extracting insights or useful knowledge from the data in a timely and intelligent way is urgently needed, on which the real-world applications are based.

figure 1

The worldwide popularity score of various types of ML algorithms (supervised, unsupervised, semi-supervised, and reinforcement) in a range of 0 (min) to 100 (max) over time where x-axis represents the timestamp information and y-axis represents the corresponding score

Artificial intelligence (AI), particularly, machine learning (ML) have grown rapidly in recent years in the context of data analysis and computing that typically allows the applications to function in an intelligent manner [ 95 ]. ML usually provides systems with the ability to learn and enhance from experience automatically without being specifically programmed and is generally referred to as the most popular latest technologies in the fourth industrial revolution (4 IR or Industry 4.0) [ 103 , 105 ]. “Industry 4.0” [ 114 ] is typically the ongoing automation of conventional manufacturing and industrial practices, including exploratory data processing, using new smart technologies such as machine learning automation. Thus, to intelligently analyze these data and to develop the corresponding real-world applications, machine learning algorithms is the key. The learning algorithms can be categorized into four major types, such as supervised, unsupervised, semi-supervised, and reinforcement learning in the area [ 75 ], discussed briefly in Sect. “ Types of Real-World Data and Machine Learning Techniques ”. The popularity of these approaches to learning is increasing day-by-day, which is shown in Fig. 1 , based on data collected from Google Trends [ 4 ] over the last five years. The x - axis of the figure indicates the specific dates and the corresponding popularity score within the range of \(0 \; (minimum)\) to \(100 \; (maximum)\) has been shown in y - axis . According to Fig. 1 , the popularity indication values for these learning types are low in 2015 and are increasing day by day. These statistics motivate us to study on machine learning in this paper, which can play an important role in the real-world through Industry 4.0 automation.

In general, the effectiveness and the efficiency of a machine learning solution depend on the nature and characteristics of data and the performance of the learning algorithms . In the area of machine learning algorithms, classification analysis, regression, data clustering, feature engineering and dimensionality reduction, association rule learning, or reinforcement learning techniques exist to effectively build data-driven systems [ 41 , 125 ]. Besides, deep learning originated from the artificial neural network that can be used to intelligently analyze data, which is known as part of a wider family of machine learning approaches [ 96 ]. Thus, selecting a proper learning algorithm that is suitable for the target application in a particular domain is challenging. The reason is that the purpose of different learning algorithms is different, even the outcome of different learning algorithms in a similar category may vary depending on the data characteristics [ 106 ]. Thus, it is important to understand the principles of various machine learning algorithms and their applicability to apply in various real-world application areas, such as IoT systems, cybersecurity services, business and recommendation systems, smart cities, healthcare and COVID-19, context-aware systems, sustainable agriculture, and many more that are explained briefly in Sect. “ Applications of Machine Learning ”.

Based on the importance and potentiality of “Machine Learning” to analyze the data mentioned above, in this paper, we provide a comprehensive view on various types of machine learning algorithms that can be applied to enhance the intelligence and the capabilities of an application. Thus, the key contribution of this study is explaining the principles and potentiality of different machine learning techniques, and their applicability in various real-world application areas mentioned earlier. The purpose of this paper is, therefore, to provide a basic guide for those academia and industry people who want to study, research, and develop data-driven automated and intelligent systems in the relevant areas based on machine learning techniques.

The key contributions of this paper are listed as follows:

To define the scope of our study by taking into account the nature and characteristics of various types of real-world data and the capabilities of various learning techniques.

To provide a comprehensive view on machine learning algorithms that can be applied to enhance the intelligence and capabilities of a data-driven application.

To discuss the applicability of machine learning-based solutions in various real-world application domains.

To highlight and summarize the potential research directions within the scope of our study for intelligent data analysis and services.

The rest of the paper is organized as follows. The next section presents the types of data and machine learning algorithms in a broader sense and defines the scope of our study. We briefly discuss and explain different machine learning algorithms in the subsequent section followed by which various real-world application areas based on machine learning algorithms are discussed and summarized. In the penultimate section, we highlight several research issues and potential future directions, and the final section concludes this paper.

Types of Real-World Data and Machine Learning Techniques

Machine learning algorithms typically consume and process data to learn the related patterns about individuals, business processes, transactions, events, and so on. In the following, we discuss various types of real-world data as well as categories of machine learning algorithms.

Types of Real-World Data

Usually, the availability of data is considered as the key to construct a machine learning model or data-driven real-world systems [ 103 , 105 ]. Data can be of various forms, such as structured, semi-structured, or unstructured [ 41 , 72 ]. Besides, the “metadata” is another type that typically represents data about the data. In the following, we briefly discuss these types of data.

Structured: It has a well-defined structure, conforms to a data model following a standard order, which is highly organized and easily accessed, and used by an entity or a computer program. In well-defined schemes, such as relational databases, structured data are typically stored, i.e., in a tabular format. For instance, names, dates, addresses, credit card numbers, stock information, geolocation, etc. are examples of structured data.

Unstructured: On the other hand, there is no pre-defined format or organization for unstructured data, making it much more difficult to capture, process, and analyze, mostly containing text and multimedia material. For example, sensor data, emails, blog entries, wikis, and word processing documents, PDF files, audio files, videos, images, presentations, web pages, and many other types of business documents can be considered as unstructured data.

Semi-structured: Semi-structured data are not stored in a relational database like the structured data mentioned above, but it does have certain organizational properties that make it easier to analyze. HTML, XML, JSON documents, NoSQL databases, etc., are some examples of semi-structured data.

Metadata: It is not the normal form of data, but “data about data”. The primary difference between “data” and “metadata” is that data are simply the material that can classify, measure, or even document something relative to an organization’s data properties. On the other hand, metadata describes the relevant data information, giving it more significance for data users. A basic example of a document’s metadata might be the author, file size, date generated by the document, keywords to define the document, etc.

In the area of machine learning and data science, researchers use various widely used datasets for different purposes. These are, for example, cybersecurity datasets such as NSL-KDD [ 119 ], UNSW-NB15 [ 76 ], ISCX’12 [ 1 ], CIC-DDoS2019 [ 2 ], Bot-IoT [ 59 ], etc., smartphone datasets such as phone call logs [ 84 , 101 ], SMS Log [ 29 ], mobile application usages logs [ 137 ] [ 117 ], mobile phone notification logs [ 73 ] etc., IoT data [ 16 , 57 , 62 ], agriculture and e-commerce data [ 120 , 138 ], health data such as heart disease [ 92 ], diabetes mellitus [ 83 , 134 ], COVID-19 [ 43 , 74 ], etc., and many more in various application domains. The data can be in different types discussed above, which may vary from application to application in the real world. To analyze such data in a particular problem domain, and to extract the insights or useful knowledge from the data for building the real-world intelligent applications, different types of machine learning techniques can be used according to their learning capabilities, which is discussed in the following.

Types of Machine Learning Techniques

Machine Learning algorithms are mainly divided into four categories: Supervised learning, Unsupervised learning, Semi-supervised learning, and Reinforcement learning [ 75 ], as shown in Fig. 2 . In the following, we briefly discuss each type of learning technique with the scope of their applicability to solve real-world problems.

figure 2

Various types of machine learning techniques

Supervised: Supervised learning is typically the task of machine learning to learn a function that maps an input to an output based on sample input-output pairs [ 41 ]. It uses labeled training data and a collection of training examples to infer a function. Supervised learning is carried out when certain goals are identified to be accomplished from a certain set of inputs [ 105 ], i.e., a task-driven approach . The most common supervised tasks are “classification” that separates the data, and “regression” that fits the data. For instance, predicting the class label or sentiment of a piece of text, like a tweet or a product review, i.e., text classification, is an example of supervised learning.

Unsupervised: Unsupervised learning analyzes unlabeled datasets without the need for human interference, i.e., a data-driven process [ 41 ]. This is widely used for extracting generative features, identifying meaningful trends and structures, groupings in results, and exploratory purposes. The most common unsupervised learning tasks are clustering, density estimation, feature learning, dimensionality reduction, finding association rules, anomaly detection, etc.

Semi-supervised: Semi-supervised learning can be defined as a hybridization of the above-mentioned supervised and unsupervised methods, as it operates on both labeled and unlabeled data [ 41 , 105 ]. Thus, it falls between learning “without supervision” and learning “with supervision”. In the real world, labeled data could be rare in several contexts, and unlabeled data are numerous, where semi-supervised learning is useful [ 75 ]. The ultimate goal of a semi-supervised learning model is to provide a better outcome for prediction than that produced using the labeled data alone from the model. Some application areas where semi-supervised learning is used include machine translation, fraud detection, labeling data and text classification.

Reinforcement: Reinforcement learning is a type of machine learning algorithm that enables software agents and machines to automatically evaluate the optimal behavior in a particular context or environment to improve its efficiency [ 52 ], i.e., an environment-driven approach . This type of learning is based on reward or penalty, and its ultimate goal is to use insights obtained from environmental activists to take action to increase the reward or minimize the risk [ 75 ]. It is a powerful tool for training AI models that can help increase automation or optimize the operational efficiency of sophisticated systems such as robotics, autonomous driving tasks, manufacturing and supply chain logistics, however, not preferable to use it for solving the basic or straightforward problems.

Thus, to build effective models in various application areas different types of machine learning techniques can play a significant role according to their learning capabilities, depending on the nature of the data discussed earlier, and the target outcome. In Table 1 , we summarize various types of machine learning techniques with examples. In the following, we provide a comprehensive view of machine learning algorithms that can be applied to enhance the intelligence and capabilities of a data-driven application.

Machine Learning Tasks and Algorithms

In this section, we discuss various machine learning algorithms that include classification analysis, regression analysis, data clustering, association rule learning, feature engineering for dimensionality reduction, as well as deep learning methods. A general structure of a machine learning-based predictive model has been shown in Fig. 3 , where the model is trained from historical data in phase 1 and the outcome is generated in phase 2 for the new test data.

figure 3

A general structure of a machine learning based predictive model considering both the training and testing phase

Classification Analysis

Classification is regarded as a supervised learning method in machine learning, referring to a problem of predictive modeling as well, where a class label is predicted for a given example [ 41 ]. Mathematically, it maps a function ( f ) from input variables ( X ) to output variables ( Y ) as target, label or categories. To predict the class of given data points, it can be carried out on structured or unstructured data. For example, spam detection such as “spam” and “not spam” in email service providers can be a classification problem. In the following, we summarize the common classification problems.

Binary classification: It refers to the classification tasks having two class labels such as “true and false” or “yes and no” [ 41 ]. In such binary classification tasks, one class could be the normal state, while the abnormal state could be another class. For instance, “cancer not detected” is the normal state of a task that involves a medical test, and “cancer detected” could be considered as the abnormal state. Similarly, “spam” and “not spam” in the above example of email service providers are considered as binary classification.

Multiclass classification: Traditionally, this refers to those classification tasks having more than two class labels [ 41 ]. The multiclass classification does not have the principle of normal and abnormal outcomes, unlike binary classification tasks. Instead, within a range of specified classes, examples are classified as belonging to one. For example, it can be a multiclass classification task to classify various types of network attacks in the NSL-KDD [ 119 ] dataset, where the attack categories are classified into four class labels, such as DoS (Denial of Service Attack), U2R (User to Root Attack), R2L (Root to Local Attack), and Probing Attack.

Multi-label classification: In machine learning, multi-label classification is an important consideration where an example is associated with several classes or labels. Thus, it is a generalization of multiclass classification, where the classes involved in the problem are hierarchically structured, and each example may simultaneously belong to more than one class in each hierarchical level, e.g., multi-level text classification. For instance, Google news can be presented under the categories of a “city name”, “technology”, or “latest news”, etc. Multi-label classification includes advanced machine learning algorithms that support predicting various mutually non-exclusive classes or labels, unlike traditional classification tasks where class labels are mutually exclusive [ 82 ].

Many classification algorithms have been proposed in the machine learning and data science literature [ 41 , 125 ]. In the following, we summarize the most common and popular methods that are used widely in various application areas.

Naive Bayes (NB): The naive Bayes algorithm is based on the Bayes’ theorem with the assumption of independence between each pair of features [ 51 ]. It works well and can be used for both binary and multi-class categories in many real-world situations, such as document or text classification, spam filtering, etc. To effectively classify the noisy instances in the data and to construct a robust prediction model, the NB classifier can be used [ 94 ]. The key benefit is that, compared to more sophisticated approaches, it needs a small amount of training data to estimate the necessary parameters and quickly [ 82 ]. However, its performance may affect due to its strong assumptions on features independence. Gaussian, Multinomial, Complement, Bernoulli, and Categorical are the common variants of NB classifier [ 82 ].

Linear Discriminant Analysis (LDA): Linear Discriminant Analysis (LDA) is a linear decision boundary classifier created by fitting class conditional densities to data and applying Bayes’ rule [ 51 , 82 ]. This method is also known as a generalization of Fisher’s linear discriminant, which projects a given dataset into a lower-dimensional space, i.e., a reduction of dimensionality that minimizes the complexity of the model or reduces the resulting model’s computational costs. The standard LDA model usually suits each class with a Gaussian density, assuming that all classes share the same covariance matrix [ 82 ]. LDA is closely related to ANOVA (analysis of variance) and regression analysis, which seek to express one dependent variable as a linear combination of other features or measurements.

Logistic regression (LR): Another common probabilistic based statistical model used to solve classification issues in machine learning is Logistic Regression (LR) [ 64 ]. Logistic regression typically uses a logistic function to estimate the probabilities, which is also referred to as the mathematically defined sigmoid function in Eq. 1 . It can overfit high-dimensional datasets and works well when the dataset can be separated linearly. The regularization (L1 and L2) techniques [ 82 ] can be used to avoid over-fitting in such scenarios. The assumption of linearity between the dependent and independent variables is considered as a major drawback of Logistic Regression. It can be used for both classification and regression problems, but it is more commonly used for classification.

K-nearest neighbors (KNN): K-Nearest Neighbors (KNN) [ 9 ] is an “instance-based learning” or non-generalizing learning, also known as a “lazy learning” algorithm. It does not focus on constructing a general internal model; instead, it stores all instances corresponding to training data in n -dimensional space. KNN uses data and classifies new data points based on similarity measures (e.g., Euclidean distance function) [ 82 ]. Classification is computed from a simple majority vote of the k nearest neighbors of each point. It is quite robust to noisy training data, and accuracy depends on the data quality. The biggest issue with KNN is to choose the optimal number of neighbors to be considered. KNN can be used both for classification as well as regression.

Support vector machine (SVM): In machine learning, another common technique that can be used for classification, regression, or other tasks is a support vector machine (SVM) [ 56 ]. In high- or infinite-dimensional space, a support vector machine constructs a hyper-plane or set of hyper-planes. Intuitively, the hyper-plane, which has the greatest distance from the nearest training data points in any class, achieves a strong separation since, in general, the greater the margin, the lower the classifier’s generalization error. It is effective in high-dimensional spaces and can behave differently based on different mathematical functions known as the kernel. Linear, polynomial, radial basis function (RBF), sigmoid, etc., are the popular kernel functions used in SVM classifier [ 82 ]. However, when the data set contains more noise, such as overlapping target classes, SVM does not perform well.

Decision tree (DT): Decision tree (DT) [ 88 ] is a well-known non-parametric supervised learning method. DT learning methods are used for both the classification and regression tasks [ 82 ]. ID3 [ 87 ], C4.5 [ 88 ], and CART [ 20 ] are well known for DT algorithms. Moreover, recently proposed BehavDT [ 100 ], and IntrudTree [ 97 ] by Sarker et al. are effective in the relevant application domains, such as user behavior analytics and cybersecurity analytics, respectively. By sorting down the tree from the root to some leaf nodes, as shown in Fig. 4 , DT classifies the instances. Instances are classified by checking the attribute defined by that node, starting at the root node of the tree, and then moving down the tree branch corresponding to the attribute value. For splitting, the most popular criteria are “gini” for the Gini impurity and “entropy” for the information gain that can be expressed mathematically as [ 82 ].

figure 4

An example of a decision tree structure

figure 5

An example of a random forest structure considering multiple decision trees

Random forest (RF): A random forest classifier [ 19 ] is well known as an ensemble classification technique that is used in the field of machine learning and data science in various application areas. This method uses “parallel ensembling” which fits several decision tree classifiers in parallel, as shown in Fig. 5 , on different data set sub-samples and uses majority voting or averages for the outcome or final result. It thus minimizes the over-fitting problem and increases the prediction accuracy and control [ 82 ]. Therefore, the RF learning model with multiple decision trees is typically more accurate than a single decision tree based model [ 106 ]. To build a series of decision trees with controlled variation, it combines bootstrap aggregation (bagging) [ 18 ] and random feature selection [ 11 ]. It is adaptable to both classification and regression problems and fits well for both categorical and continuous values.

Adaptive Boosting (AdaBoost): Adaptive Boosting (AdaBoost) is an ensemble learning process that employs an iterative approach to improve poor classifiers by learning from their errors. This is developed by Yoav Freund et al. [ 35 ] and also known as “meta-learning”. Unlike the random forest that uses parallel ensembling, Adaboost uses “sequential ensembling”. It creates a powerful classifier by combining many poorly performing classifiers to obtain a good classifier of high accuracy. In that sense, AdaBoost is called an adaptive classifier by significantly improving the efficiency of the classifier, but in some instances, it can trigger overfits. AdaBoost is best used to boost the performance of decision trees, base estimator [ 82 ], on binary classification problems, however, is sensitive to noisy data and outliers.

Extreme gradient boosting (XGBoost): Gradient Boosting, like Random Forests [ 19 ] above, is an ensemble learning algorithm that generates a final model based on a series of individual models, typically decision trees. The gradient is used to minimize the loss function, similar to how neural networks [ 41 ] use gradient descent to optimize weights. Extreme Gradient Boosting (XGBoost) is a form of gradient boosting that takes more detailed approximations into account when determining the best model [ 82 ]. It computes second-order gradients of the loss function to minimize loss and advanced regularization (L1 and L2) [ 82 ], which reduces over-fitting, and improves model generalization and performance. XGBoost is fast to interpret and can handle large-sized datasets well.

Stochastic gradient descent (SGD): Stochastic gradient descent (SGD) [ 41 ] is an iterative method for optimizing an objective function with appropriate smoothness properties, where the word ‘stochastic’ refers to random probability. This reduces the computational burden, particularly in high-dimensional optimization problems, allowing for faster iterations in exchange for a lower convergence rate. A gradient is the slope of a function that calculates a variable’s degree of change in response to another variable’s changes. Mathematically, the Gradient Descent is a convex function whose output is a partial derivative of a set of its input parameters. Let, \(\alpha\) is the learning rate, and \(J_i\) is the training example cost of \(i \mathrm{th}\) , then Eq. ( 4 ) represents the stochastic gradient descent weight update method at the \(j^\mathrm{th}\) iteration. In large-scale and sparse machine learning, SGD has been successfully applied to problems often encountered in text classification and natural language processing [ 82 ]. However, SGD is sensitive to feature scaling and needs a range of hyperparameters, such as the regularization parameter and the number of iterations.

Rule-based classification : The term rule-based classification can be used to refer to any classification scheme that makes use of IF-THEN rules for class prediction. Several classification algorithms such as Zero-R [ 125 ], One-R [ 47 ], decision trees [ 87 , 88 ], DTNB [ 110 ], Ripple Down Rule learner (RIDOR) [ 125 ], Repeated Incremental Pruning to Produce Error Reduction (RIPPER) [ 126 ] exist with the ability of rule generation. The decision tree is one of the most common rule-based classification algorithms among these techniques because it has several advantages, such as being easier to interpret; the ability to handle high-dimensional data; simplicity and speed; good accuracy; and the capability to produce rules for human clear and understandable classification [ 127 ] [ 128 ]. The decision tree-based rules also provide significant accuracy in a prediction model for unseen test cases [ 106 ]. Since the rules are easily interpretable, these rule-based classifiers are often used to produce descriptive models that can describe a system including the entities and their relationships.

figure 6

Classification vs. regression. In classification the dotted line represents a linear boundary that separates the two classes; in regression, the dotted line models the linear relationship between the two variables

Regression Analysis

Regression analysis includes several methods of machine learning that allow to predict a continuous ( y ) result variable based on the value of one or more ( x ) predictor variables [ 41 ]. The most significant distinction between classification and regression is that classification predicts distinct class labels, while regression facilitates the prediction of a continuous quantity. Figure 6 shows an example of how classification is different with regression models. Some overlaps are often found between the two types of machine learning algorithms. Regression models are now widely used in a variety of fields, including financial forecasting or prediction, cost estimation, trend analysis, marketing, time series estimation, drug response modeling, and many more. Some of the familiar types of regression algorithms are linear, polynomial, lasso and ridge regression, etc., which are explained briefly in the following.

Simple and multiple linear regression: This is one of the most popular ML modeling techniques as well as a well-known regression technique. In this technique, the dependent variable is continuous, the independent variable(s) can be continuous or discrete, and the form of the regression line is linear. Linear regression creates a relationship between the dependent variable ( Y ) and one or more independent variables ( X ) (also known as regression line) using the best fit straight line [ 41 ]. It is defined by the following equations:

where a is the intercept, b is the slope of the line, and e is the error term. This equation can be used to predict the value of the target variable based on the given predictor variable(s). Multiple linear regression is an extension of simple linear regression that allows two or more predictor variables to model a response variable, y, as a linear function [ 41 ] defined in Eq. 6 , whereas simple linear regression has only 1 independent variable, defined in Eq. 5 .

Polynomial regression: Polynomial regression is a form of regression analysis in which the relationship between the independent variable x and the dependent variable y is not linear, but is the polynomial degree of \(n^\mathrm{th}\) in x [ 82 ]. The equation for polynomial regression is also derived from linear regression (polynomial regression of degree 1) equation, which is defined as below:

Here, y is the predicted/target output, \(b_0, b_1,... b_n\) are the regression coefficients, x is an independent/ input variable. In simple words, we can say that if data are not distributed linearly, instead it is \(n^\mathrm{th}\) degree of polynomial then we use polynomial regression to get desired output.

LASSO and ridge regression: LASSO and Ridge regression are well known as powerful techniques which are typically used for building learning models in presence of a large number of features, due to their capability to preventing over-fitting and reducing the complexity of the model. The LASSO (least absolute shrinkage and selection operator) regression model uses L 1 regularization technique [ 82 ] that uses shrinkage, which penalizes “absolute value of magnitude of coefficients” ( L 1 penalty). As a result, LASSO appears to render coefficients to absolute zero. Thus, LASSO regression aims to find the subset of predictors that minimizes the prediction error for a quantitative response variable. On the other hand, ridge regression uses L 2 regularization [ 82 ], which is the “squared magnitude of coefficients” ( L 2 penalty). Thus, ridge regression forces the weights to be small but never sets the coefficient value to zero, and does a non-sparse solution. Overall, LASSO regression is useful to obtain a subset of predictors by eliminating less important features, and ridge regression is useful when a data set has “multicollinearity” which refers to the predictors that are correlated with other predictors.

Cluster Analysis

Cluster analysis, also known as clustering, is an unsupervised machine learning technique for identifying and grouping related data points in large datasets without concern for the specific outcome. It does grouping a collection of objects in such a way that objects in the same category, called a cluster, are in some sense more similar to each other than objects in other groups [ 41 ]. It is often used as a data analysis technique to discover interesting trends or patterns in data, e.g., groups of consumers based on their behavior. In a broad range of application areas, such as cybersecurity, e-commerce, mobile data processing, health analytics, user modeling and behavioral analytics, clustering can be used. In the following, we briefly discuss and summarize various types of clustering methods.

Partitioning methods: Based on the features and similarities in the data, this clustering approach categorizes the data into multiple groups or clusters. The data scientists or analysts typically determine the number of clusters either dynamically or statically depending on the nature of the target applications, to produce for the methods of clustering. The most common clustering algorithms based on partitioning methods are K-means [ 69 ], K-Mediods [ 80 ], CLARA [ 55 ] etc.

Density-based methods: To identify distinct groups or clusters, it uses the concept that a cluster in the data space is a contiguous region of high point density isolated from other such clusters by contiguous regions of low point density. Points that are not part of a cluster are considered as noise. The typical clustering algorithms based on density are DBSCAN [ 32 ], OPTICS [ 12 ] etc. The density-based methods typically struggle with clusters of similar density and high dimensionality data.

Hierarchical-based methods: Hierarchical clustering typically seeks to construct a hierarchy of clusters, i.e., the tree structure. Strategies for hierarchical clustering generally fall into two types: (i) Agglomerative—a “bottom-up” approach in which each observation begins in its cluster and pairs of clusters are combined as one, moves up the hierarchy, and (ii) Divisive—a “top-down” approach in which all observations begin in one cluster and splits are performed recursively, moves down the hierarchy, as shown in Fig 7 . Our earlier proposed BOTS technique, Sarker et al. [ 102 ] is an example of a hierarchical, particularly, bottom-up clustering algorithm.

Grid-based methods: To deal with massive datasets, grid-based clustering is especially suitable. To obtain clusters, the principle is first to summarize the dataset with a grid representation and then to combine grid cells. STING [ 122 ], CLIQUE [ 6 ], etc. are the standard algorithms of grid-based clustering.

Model-based methods: There are mainly two types of model-based clustering algorithms: one that uses statistical learning, and the other based on a method of neural network learning [ 130 ]. For instance, GMM [ 89 ] is an example of a statistical learning method, and SOM [ 22 ] [ 96 ] is an example of a neural network learning method.

Constraint-based methods: Constrained-based clustering is a semi-supervised approach to data clustering that uses constraints to incorporate domain knowledge. Application or user-oriented constraints are incorporated to perform the clustering. The typical algorithms of this kind of clustering are COP K-means [ 121 ], CMWK-Means [ 27 ], etc.

figure 7

A graphical interpretation of the widely-used hierarchical clustering (Bottom-up and top-down) technique

Many clustering algorithms have been proposed with the ability to grouping data in machine learning and data science literature [ 41 , 125 ]. In the following, we summarize the popular methods that are used widely in various application areas.

K-means clustering: K-means clustering [ 69 ] is a fast, robust, and simple algorithm that provides reliable results when data sets are well-separated from each other. The data points are allocated to a cluster in this algorithm in such a way that the amount of the squared distance between the data points and the centroid is as small as possible. In other words, the K-means algorithm identifies the k number of centroids and then assigns each data point to the nearest cluster while keeping the centroids as small as possible. Since it begins with a random selection of cluster centers, the results can be inconsistent. Since extreme values can easily affect a mean, the K-means clustering algorithm is sensitive to outliers. K-medoids clustering [ 91 ] is a variant of K-means that is more robust to noises and outliers.

Mean-shift clustering: Mean-shift clustering [ 37 ] is a nonparametric clustering technique that does not require prior knowledge of the number of clusters or constraints on cluster shape. Mean-shift clustering aims to discover “blobs” in a smooth distribution or density of samples [ 82 ]. It is a centroid-based algorithm that works by updating centroid candidates to be the mean of the points in a given region. To form the final set of centroids, these candidates are filtered in a post-processing stage to remove near-duplicates. Cluster analysis in computer vision and image processing are examples of application domains. Mean Shift has the disadvantage of being computationally expensive. Moreover, in cases of high dimension, where the number of clusters shifts abruptly, the mean-shift algorithm does not work well.

DBSCAN: Density-based spatial clustering of applications with noise (DBSCAN) [ 32 ] is a base algorithm for density-based clustering which is widely used in data mining and machine learning. This is known as a non-parametric density-based clustering technique for separating high-density clusters from low-density clusters that are used in model building. DBSCAN’s main idea is that a point belongs to a cluster if it is close to many points from that cluster. It can find clusters of various shapes and sizes in a vast volume of data that is noisy and contains outliers. DBSCAN, unlike k-means, does not require a priori specification of the number of clusters in the data and can find arbitrarily shaped clusters. Although k-means is much faster than DBSCAN, it is efficient at finding high-density regions and outliers, i.e., is robust to outliers.

GMM clustering: Gaussian mixture models (GMMs) are often used for data clustering, which is a distribution-based clustering algorithm. A Gaussian mixture model is a probabilistic model in which all the data points are produced by a mixture of a finite number of Gaussian distributions with unknown parameters [ 82 ]. To find the Gaussian parameters for each cluster, an optimization algorithm called expectation-maximization (EM) [ 82 ] can be used. EM is an iterative method that uses a statistical model to estimate the parameters. In contrast to k-means, Gaussian mixture models account for uncertainty and return the likelihood that a data point belongs to one of the k clusters. GMM clustering is more robust than k-means and works well even with non-linear data distributions.

Agglomerative hierarchical clustering: The most common method of hierarchical clustering used to group objects in clusters based on their similarity is agglomerative clustering. This technique uses a bottom-up approach, where each object is first treated as a singleton cluster by the algorithm. Following that, pairs of clusters are merged one by one until all clusters have been merged into a single large cluster containing all objects. The result is a dendrogram, which is a tree-based representation of the elements. Single linkage [ 115 ], Complete linkage [ 116 ], BOTS [ 102 ] etc. are some examples of such techniques. The main advantage of agglomerative hierarchical clustering over k-means is that the tree-structure hierarchy generated by agglomerative clustering is more informative than the unstructured collection of flat clusters returned by k-means, which can help to make better decisions in the relevant application areas.

Dimensionality Reduction and Feature Learning

In machine learning and data science, high-dimensional data processing is a challenging task for both researchers and application developers. Thus, dimensionality reduction which is an unsupervised learning technique, is important because it leads to better human interpretations, lower computational costs, and avoids overfitting and redundancy by simplifying models. Both the process of feature selection and feature extraction can be used for dimensionality reduction. The primary distinction between the selection and extraction of features is that the “feature selection” keeps a subset of the original features [ 97 ], while “feature extraction” creates brand new ones [ 98 ]. In the following, we briefly discuss these techniques.

Feature selection: The selection of features, also known as the selection of variables or attributes in the data, is the process of choosing a subset of unique features (variables, predictors) to use in building machine learning and data science model. It decreases a model’s complexity by eliminating the irrelevant or less important features and allows for faster training of machine learning algorithms. A right and optimal subset of the selected features in a problem domain is capable to minimize the overfitting problem through simplifying and generalizing the model as well as increases the model’s accuracy [ 97 ]. Thus, “feature selection” [ 66 , 99 ] is considered as one of the primary concepts in machine learning that greatly affects the effectiveness and efficiency of the target machine learning model. Chi-squared test, Analysis of variance (ANOVA) test, Pearson’s correlation coefficient, recursive feature elimination, are some popular techniques that can be used for feature selection.

Feature extraction: In a machine learning-based model or system, feature extraction techniques usually provide a better understanding of the data, a way to improve prediction accuracy, and to reduce computational cost or training time. The aim of “feature extraction” [ 66 , 99 ] is to reduce the number of features in a dataset by generating new ones from the existing ones and then discarding the original features. The majority of the information found in the original set of features can then be summarized using this new reduced set of features. For instance, principal components analysis (PCA) is often used as a dimensionality-reduction technique to extract a lower-dimensional space creating new brand components from the existing features in a dataset [ 98 ].

Many algorithms have been proposed to reduce data dimensions in the machine learning and data science literature [ 41 , 125 ]. In the following, we summarize the popular methods that are used widely in various application areas.

Variance threshold: A simple basic approach to feature selection is the variance threshold [ 82 ]. This excludes all features of low variance, i.e., all features whose variance does not exceed the threshold. It eliminates all zero-variance characteristics by default, i.e., characteristics that have the same value in all samples. This feature selection algorithm looks only at the ( X ) features, not the ( y ) outputs needed, and can, therefore, be used for unsupervised learning.

Pearson correlation: Pearson’s correlation is another method to understand a feature’s relation to the response variable and can be used for feature selection [ 99 ]. This method is also used for finding the association between the features in a dataset. The resulting value is \([-1, 1]\) , where \(-1\) means perfect negative correlation, \(+1\) means perfect positive correlation, and 0 means that the two variables do not have a linear correlation. If two random variables represent X and Y , then the correlation coefficient between X and Y is defined as [ 41 ]

ANOVA: Analysis of variance (ANOVA) is a statistical tool used to verify the mean values of two or more groups that differ significantly from each other. ANOVA assumes a linear relationship between the variables and the target and the variables’ normal distribution. To statistically test the equality of means, the ANOVA method utilizes F tests. For feature selection, the results ‘ANOVA F value’ [ 82 ] of this test can be used where certain features independent of the goal variable can be omitted.

Chi square: The chi-square \({\chi }^2\) [ 82 ] statistic is an estimate of the difference between the effects of a series of events or variables observed and expected frequencies. The magnitude of the difference between the real and observed values, the degrees of freedom, and the sample size depends on \({\chi }^2\) . The chi-square \({\chi }^2\) is commonly used for testing relationships between categorical variables. If \(O_i\) represents observed value and \(E_i\) represents expected value, then

Recursive feature elimination (RFE): Recursive Feature Elimination (RFE) is a brute force approach to feature selection. RFE [ 82 ] fits the model and removes the weakest feature before it meets the specified number of features. Features are ranked by the coefficients or feature significance of the model. RFE aims to remove dependencies and collinearity in the model by recursively removing a small number of features per iteration.

Model-based selection: To reduce the dimensionality of the data, linear models penalized with the L 1 regularization can be used. Least absolute shrinkage and selection operator (Lasso) regression is a type of linear regression that has the property of shrinking some of the coefficients to zero [ 82 ]. Therefore, that feature can be removed from the model. Thus, the penalized lasso regression method, often used in machine learning to select the subset of variables. Extra Trees Classifier [ 82 ] is an example of a tree-based estimator that can be used to compute impurity-based function importance, which can then be used to discard irrelevant features.

Principal component analysis (PCA): Principal component analysis (PCA) is a well-known unsupervised learning approach in the field of machine learning and data science. PCA is a mathematical technique that transforms a set of correlated variables into a set of uncorrelated variables known as principal components [ 48 , 81 ]. Figure 8 shows an example of the effect of PCA on various dimensions space, where Fig. 8 a shows the original features in 3D space, and Fig. 8 b shows the created principal components PC1 and PC2 onto a 2D plane, and 1D line with the principal component PC1 respectively. Thus, PCA can be used as a feature extraction technique that reduces the dimensionality of the datasets, and to build an effective machine learning model [ 98 ]. Technically, PCA identifies the completely transformed with the highest eigenvalues of a covariance matrix and then uses those to project the data into a new subspace of equal or fewer dimensions [ 82 ].

figure 8

An example of a principal component analysis (PCA) and created principal components PC1 and PC2 in different dimension space

Association Rule Learning

Association rule learning is a rule-based machine learning approach to discover interesting relationships, “IF-THEN” statements, in large datasets between variables [ 7 ]. One example is that “if a customer buys a computer or laptop (an item), s/he is likely to also buy anti-virus software (another item) at the same time”. Association rules are employed today in many application areas, including IoT services, medical diagnosis, usage behavior analytics, web usage mining, smartphone applications, cybersecurity applications, and bioinformatics. In comparison to sequence mining, association rule learning does not usually take into account the order of things within or across transactions. A common way of measuring the usefulness of association rules is to use its parameter, the ‘support’ and ‘confidence’, which is introduced in [ 7 ].

In the data mining literature, many association rule learning methods have been proposed, such as logic dependent [ 34 ], frequent pattern based [ 8 , 49 , 68 ], and tree-based [ 42 ]. The most popular association rule learning algorithms are summarized below.

AIS and SETM: AIS is the first algorithm proposed by Agrawal et al. [ 7 ] for association rule mining. The AIS algorithm’s main downside is that too many candidate itemsets are generated, requiring more space and wasting a lot of effort. This algorithm calls for too many passes over the entire dataset to produce the rules. Another approach SETM [ 49 ] exhibits good performance and stable behavior with execution time; however, it suffers from the same flaw as the AIS algorithm.

Apriori: For generating association rules for a given dataset, Agrawal et al. [ 8 ] proposed the Apriori, Apriori-TID, and Apriori-Hybrid algorithms. These later algorithms outperform the AIS and SETM mentioned above due to the Apriori property of frequent itemset [ 8 ]. The term ‘Apriori’ usually refers to having prior knowledge of frequent itemset properties. Apriori uses a “bottom-up” approach, where it generates the candidate itemsets. To reduce the search space, Apriori uses the property “all subsets of a frequent itemset must be frequent; and if an itemset is infrequent, then all its supersets must also be infrequent”. Another approach predictive Apriori [ 108 ] can also generate rules; however, it receives unexpected results as it combines both the support and confidence. The Apriori [ 8 ] is the widely applicable techniques in mining association rules.

ECLAT: This technique was proposed by Zaki et al. [ 131 ] and stands for Equivalence Class Clustering and bottom-up Lattice Traversal. ECLAT uses a depth-first search to find frequent itemsets. In contrast to the Apriori [ 8 ] algorithm, which represents data in a horizontal pattern, it represents data vertically. Hence, the ECLAT algorithm is more efficient and scalable in the area of association rule learning. This algorithm is better suited for small and medium datasets whereas the Apriori algorithm is used for large datasets.

FP-Growth: Another common association rule learning technique based on the frequent-pattern tree (FP-tree) proposed by Han et al. [ 42 ] is Frequent Pattern Growth, known as FP-Growth. The key difference with Apriori is that while generating rules, the Apriori algorithm [ 8 ] generates frequent candidate itemsets; on the other hand, the FP-growth algorithm [ 42 ] prevents candidate generation and thus produces a tree by the successful strategy of ‘divide and conquer’ approach. Due to its sophistication, however, FP-Tree is challenging to use in an interactive mining environment [ 133 ]. Thus, the FP-Tree would not fit into memory for massive data sets, making it challenging to process big data as well. Another solution is RARM (Rapid Association Rule Mining) proposed by Das et al. [ 26 ] but faces a related FP-tree issue [ 133 ].

ABC-RuleMiner: A rule-based machine learning method, recently proposed in our earlier paper, by Sarker et al. [ 104 ], to discover the interesting non-redundant rules to provide real-world intelligent services. This algorithm effectively identifies the redundancy in associations by taking into account the impact or precedence of the related contextual features and discovers a set of non-redundant association rules. This algorithm first constructs an association generation tree (AGT), a top-down approach, and then extracts the association rules through traversing the tree. Thus, ABC-RuleMiner is more potent than traditional rule-based methods in terms of both non-redundant rule generation and intelligent decision-making, particularly in a context-aware smart computing environment, where human or user preferences are involved.

Among the association rule learning techniques discussed above, Apriori [ 8 ] is the most widely used algorithm for discovering association rules from a given dataset [ 133 ]. The main strength of the association learning technique is its comprehensiveness, as it generates all associations that satisfy the user-specified constraints, such as minimum support and confidence value. The ABC-RuleMiner approach [ 104 ] discussed earlier could give significant results in terms of non-redundant rule generation and intelligent decision-making for the relevant application areas in the real world.

Reinforcement Learning

Reinforcement learning (RL) is a machine learning technique that allows an agent to learn by trial and error in an interactive environment using input from its actions and experiences. Unlike supervised learning, which is based on given sample data or examples, the RL method is based on interacting with the environment. The problem to be solved in reinforcement learning (RL) is defined as a Markov Decision Process (MDP) [ 86 ], i.e., all about sequentially making decisions. An RL problem typically includes four elements such as Agent, Environment, Rewards, and Policy.

RL can be split roughly into Model-based and Model-free techniques. Model-based RL is the process of inferring optimal behavior from a model of the environment by performing actions and observing the results, which include the next state and the immediate reward [ 85 ]. AlphaZero, AlphaGo [ 113 ] are examples of the model-based approaches. On the other hand, a model-free approach does not use the distribution of the transition probability and the reward function associated with MDP. Q-learning, Deep Q Network, Monte Carlo Control, SARSA (State–Action–Reward–State–Action), etc. are some examples of model-free algorithms [ 52 ]. The policy network, which is required for model-based RL but not for model-free, is the key difference between model-free and model-based learning. In the following, we discuss the popular RL algorithms.

Monte Carlo methods: Monte Carlo techniques, or Monte Carlo experiments, are a wide category of computational algorithms that rely on repeated random sampling to obtain numerical results [ 52 ]. The underlying concept is to use randomness to solve problems that are deterministic in principle. Optimization, numerical integration, and making drawings from the probability distribution are the three problem classes where Monte Carlo techniques are most commonly used.

Q-learning: Q-learning is a model-free reinforcement learning algorithm for learning the quality of behaviors that tell an agent what action to take under what conditions [ 52 ]. It does not need a model of the environment (hence the term “model-free”), and it can deal with stochastic transitions and rewards without the need for adaptations. The ‘Q’ in Q-learning usually stands for quality, as the algorithm calculates the maximum expected rewards for a given behavior in a given state.

Deep Q-learning: The basic working step in Deep Q-Learning [ 52 ] is that the initial state is fed into the neural network, which returns the Q-value of all possible actions as an output. Still, when we have a reasonably simple setting to overcome, Q-learning works well. However, when the number of states and actions becomes more complicated, deep learning can be used as a function approximator.

Reinforcement learning, along with supervised and unsupervised learning, is one of the basic machine learning paradigms. RL can be used to solve numerous real-world problems in various fields, such as game theory, control theory, operations analysis, information theory, simulation-based optimization, manufacturing, supply chain logistics, multi-agent systems, swarm intelligence, aircraft control, robot motion control, and many more.

Artificial Neural Network and Deep Learning

Deep learning is part of a wider family of artificial neural networks (ANN)-based machine learning approaches with representation learning. Deep learning provides a computational architecture by combining several processing layers, such as input, hidden, and output layers, to learn from data [ 41 ]. The main advantage of deep learning over traditional machine learning methods is its better performance in several cases, particularly learning from large datasets [ 105 , 129 ]. Figure 9 shows a general performance of deep learning over machine learning considering the increasing amount of data. However, it may vary depending on the data characteristics and experimental set up.

figure 9

Machine learning and deep learning performance in general with the amount of data

The most common deep learning algorithms are: Multi-layer Perceptron (MLP), Convolutional Neural Network (CNN, or ConvNet), Long Short-Term Memory Recurrent Neural Network (LSTM-RNN) [ 96 ]. In the following, we discuss various types of deep learning methods that can be used to build effective data-driven models for various purposes.

figure 10

A structure of an artificial neural network modeling with multiple processing layers

MLP: The base architecture of deep learning, which is also known as the feed-forward artificial neural network, is called a multilayer perceptron (MLP) [ 82 ]. A typical MLP is a fully connected network consisting of an input layer, one or more hidden layers, and an output layer, as shown in Fig. 10 . Each node in one layer connects to each node in the following layer at a certain weight. MLP utilizes the “Backpropagation” technique [ 41 ], the most “fundamental building block” in a neural network, to adjust the weight values internally while building the model. MLP is sensitive to scaling features and allows a variety of hyperparameters to be tuned, such as the number of hidden layers, neurons, and iterations, which can result in a computationally costly model.

CNN or ConvNet: The convolution neural network (CNN) [ 65 ] enhances the design of the standard ANN, consisting of convolutional layers, pooling layers, as well as fully connected layers, as shown in Fig. 11 . As it takes the advantage of the two-dimensional (2D) structure of the input data, it is typically broadly used in several areas such as image and video recognition, image processing and classification, medical image analysis, natural language processing, etc. While CNN has a greater computational burden, without any manual intervention, it has the advantage of automatically detecting the important features, and hence CNN is considered to be more powerful than conventional ANN. A number of advanced deep learning models based on CNN can be used in the field, such as AlexNet [ 60 ], Xception [ 24 ], Inception [ 118 ], Visual Geometry Group (VGG) [ 44 ], ResNet [ 45 ], etc.

LSTM-RNN: Long short-term memory (LSTM) is an artificial recurrent neural network (RNN) architecture used in the area of deep learning [ 38 ]. LSTM has feedback links, unlike normal feed-forward neural networks. LSTM networks are well-suited for analyzing and learning sequential data, such as classifying, processing, and predicting data based on time series data, which differentiates it from other conventional networks. Thus, LSTM can be used when the data are in a sequential format, such as time, sentence, etc., and commonly applied in the area of time-series analysis, natural language processing, speech recognition, etc.

figure 11

An example of a convolutional neural network (CNN or ConvNet) including multiple convolution and pooling layers

In addition to these most common deep learning methods discussed above, several other deep learning approaches [ 96 ] exist in the area for various purposes. For instance, the self-organizing map (SOM) [ 58 ] uses unsupervised learning to represent the high-dimensional data by a 2D grid map, thus achieving dimensionality reduction. The autoencoder (AE) [ 15 ] is another learning technique that is widely used for dimensionality reduction as well and feature extraction in unsupervised learning tasks. Restricted Boltzmann machines (RBM) [ 46 ] can be used for dimensionality reduction, classification, regression, collaborative filtering, feature learning, and topic modeling. A deep belief network (DBN) is typically composed of simple, unsupervised networks such as restricted Boltzmann machines (RBMs) or autoencoders, and a backpropagation neural network (BPNN) [ 123 ]. A generative adversarial network (GAN) [ 39 ] is a form of the network for deep learning that can generate data with characteristics close to the actual data input. Transfer learning is currently very common because it can train deep neural networks with comparatively low data, which is typically the re-use of a new problem with a pre-trained model [ 124 ]. A brief discussion of these artificial neural networks (ANN) and deep learning (DL) models are summarized in our earlier paper Sarker et al. [ 96 ].

Overall, based on the learning techniques discussed above, we can conclude that various types of machine learning techniques, such as classification analysis, regression, data clustering, feature selection and extraction, and dimensionality reduction, association rule learning, reinforcement learning, or deep learning techniques, can play a significant role for various purposes according to their capabilities. In the following section, we discuss several application areas based on machine learning algorithms.

Applications of Machine Learning

In the current age of the Fourth Industrial Revolution (4IR), machine learning becomes popular in various application areas, because of its learning capabilities from the past and making intelligent decisions. In the following, we summarize and discuss ten popular application areas of machine learning technology.

Predictive analytics and intelligent decision-making: A major application field of machine learning is intelligent decision-making by data-driven predictive analytics [ 21 , 70 ]. The basis of predictive analytics is capturing and exploiting relationships between explanatory variables and predicted variables from previous events to predict the unknown outcome [ 41 ]. For instance, identifying suspects or criminals after a crime has been committed, or detecting credit card fraud as it happens. Another application, where machine learning algorithms can assist retailers in better understanding consumer preferences and behavior, better manage inventory, avoiding out-of-stock situations, and optimizing logistics and warehousing in e-commerce. Various machine learning algorithms such as decision trees, support vector machines, artificial neural networks, etc. [ 106 , 125 ] are commonly used in the area. Since accurate predictions provide insight into the unknown, they can improve the decisions of industries, businesses, and almost any organization, including government agencies, e-commerce, telecommunications, banking and financial services, healthcare, sales and marketing, transportation, social networking, and many others.

Cybersecurity and threat intelligence: Cybersecurity is one of the most essential areas of Industry 4.0. [ 114 ], which is typically the practice of protecting networks, systems, hardware, and data from digital attacks [ 114 ]. Machine learning has become a crucial cybersecurity technology that constantly learns by analyzing data to identify patterns, better detect malware in encrypted traffic, find insider threats, predict where bad neighborhoods are online, keep people safe while browsing, or secure data in the cloud by uncovering suspicious activity. For instance, clustering techniques can be used to identify cyber-anomalies, policy violations, etc. To detect various types of cyber-attacks or intrusions machine learning classification models by taking into account the impact of security features are useful [ 97 ]. Various deep learning-based security models can also be used on the large scale of security datasets [ 96 , 129 ]. Moreover, security policy rules generated by association rule learning techniques can play a significant role to build a rule-based security system [ 105 ]. Thus, we can say that various learning techniques discussed in Sect. Machine Learning Tasks and Algorithms , can enable cybersecurity professionals to be more proactive inefficiently preventing threats and cyber-attacks.

Internet of things (IoT) and smart cities: Internet of Things (IoT) is another essential area of Industry 4.0. [ 114 ], which turns everyday objects into smart objects by allowing them to transmit data and automate tasks without the need for human interaction. IoT is, therefore, considered to be the big frontier that can enhance almost all activities in our lives, such as smart governance, smart home, education, communication, transportation, retail, agriculture, health care, business, and many more [ 70 ]. Smart city is one of IoT’s core fields of application, using technologies to enhance city services and residents’ living experiences [ 132 , 135 ]. As machine learning utilizes experience to recognize trends and create models that help predict future behavior and events, it has become a crucial technology for IoT applications [ 103 ]. For example, to predict traffic in smart cities, parking availability prediction, estimate the total usage of energy of the citizens for a particular period, make context-aware and timely decisions for the people, etc. are some tasks that can be solved using machine learning techniques according to the current needs of the people.

Traffic prediction and transportation: Transportation systems have become a crucial component of every country’s economic development. Nonetheless, several cities around the world are experiencing an excessive rise in traffic volume, resulting in serious issues such as delays, traffic congestion, higher fuel prices, increased CO \(_2\) pollution, accidents, emergencies, and a decline in modern society’s quality of life [ 40 ]. Thus, an intelligent transportation system through predicting future traffic is important, which is an indispensable part of a smart city. Accurate traffic prediction based on machine and deep learning modeling can help to minimize the issues [ 17 , 30 , 31 ]. For example, based on the travel history and trend of traveling through various routes, machine learning can assist transportation companies in predicting possible issues that may occur on specific routes and recommending their customers to take a different path. Ultimately, these learning-based data-driven models help improve traffic flow, increase the usage and efficiency of sustainable modes of transportation, and limit real-world disruption by modeling and visualizing future changes.

Healthcare and COVID-19 pandemic: Machine learning can help to solve diagnostic and prognostic problems in a variety of medical domains, such as disease prediction, medical knowledge extraction, detecting regularities in data, patient management, etc. [ 33 , 77 , 112 ]. Coronavirus disease (COVID-19) is an infectious disease caused by a newly discovered coronavirus, according to the World Health Organization (WHO) [ 3 ]. Recently, the learning techniques have become popular in the battle against COVID-19 [ 61 , 63 ]. For the COVID-19 pandemic, the learning techniques are used to classify patients at high risk, their mortality rate, and other anomalies [ 61 ]. It can also be used to better understand the virus’s origin, COVID-19 outbreak prediction, as well as for disease diagnosis and treatment [ 14 , 50 ]. With the help of machine learning, researchers can forecast where and when, the COVID-19 is likely to spread, and notify those regions to match the required arrangements. Deep learning also provides exciting solutions to the problems of medical image processing and is seen as a crucial technique for potential applications, particularly for COVID-19 pandemic [ 10 , 78 , 111 ]. Overall, machine and deep learning techniques can help to fight the COVID-19 virus and the pandemic as well as intelligent clinical decisions making in the domain of healthcare.

E-commerce and product recommendations: Product recommendation is one of the most well known and widely used applications of machine learning, and it is one of the most prominent features of almost any e-commerce website today. Machine learning technology can assist businesses in analyzing their consumers’ purchasing histories and making customized product suggestions for their next purchase based on their behavior and preferences. E-commerce companies, for example, can easily position product suggestions and offers by analyzing browsing trends and click-through rates of specific items. Using predictive modeling based on machine learning techniques, many online retailers, such as Amazon [ 71 ], can better manage inventory, prevent out-of-stock situations, and optimize logistics and warehousing. The future of sales and marketing is the ability to capture, evaluate, and use consumer data to provide a customized shopping experience. Furthermore, machine learning techniques enable companies to create packages and content that are tailored to the needs of their customers, allowing them to maintain existing customers while attracting new ones.

NLP and sentiment analysis: Natural language processing (NLP) involves the reading and understanding of spoken or written language through the medium of a computer [ 79 , 103 ]. Thus, NLP helps computers, for instance, to read a text, hear speech, interpret it, analyze sentiment, and decide which aspects are significant, where machine learning techniques can be used. Virtual personal assistant, chatbot, speech recognition, document description, language or machine translation, etc. are some examples of NLP-related tasks. Sentiment Analysis [ 90 ] (also referred to as opinion mining or emotion AI) is an NLP sub-field that seeks to identify and extract public mood and views within a given text through blogs, reviews, social media, forums, news, etc. For instance, businesses and brands use sentiment analysis to understand the social sentiment of their brand, product, or service through social media platforms or the web as a whole. Overall, sentiment analysis is considered as a machine learning task that analyzes texts for polarity, such as “positive”, “negative”, or “neutral” along with more intense emotions like very happy, happy, sad, very sad, angry, have interest, or not interested etc.

Image, speech and pattern recognition: Image recognition [ 36 ] is a well-known and widespread example of machine learning in the real world, which can identify an object as a digital image. For instance, to label an x-ray as cancerous or not, character recognition, or face detection in an image, tagging suggestions on social media, e.g., Facebook, are common examples of image recognition. Speech recognition [ 23 ] is also very popular that typically uses sound and linguistic models, e.g., Google Assistant, Cortana, Siri, Alexa, etc. [ 67 ], where machine learning methods are used. Pattern recognition [ 13 ] is defined as the automated recognition of patterns and regularities in data, e.g., image analysis. Several machine learning techniques such as classification, feature selection, clustering, or sequence labeling methods are used in the area.

Sustainable agriculture: Agriculture is essential to the survival of all human activities [ 109 ]. Sustainable agriculture practices help to improve agricultural productivity while also reducing negative impacts on the environment [ 5 , 25 , 109 ]. The sustainable agriculture supply chains are knowledge-intensive and based on information, skills, technologies, etc., where knowledge transfer encourages farmers to enhance their decisions to adopt sustainable agriculture practices utilizing the increasing amount of data captured by emerging technologies, e.g., the Internet of Things (IoT), mobile technologies and devices, etc. [ 5 , 53 , 54 ]. Machine learning can be applied in various phases of sustainable agriculture, such as in the pre-production phase - for the prediction of crop yield, soil properties, irrigation requirements, etc.; in the production phase—for weather prediction, disease detection, weed detection, soil nutrient management, livestock management, etc.; in processing phase—for demand estimation, production planning, etc. and in the distribution phase - the inventory management, consumer analysis, etc.

User behavior analytics and context-aware smartphone applications: Context-awareness is a system’s ability to capture knowledge about its surroundings at any moment and modify behaviors accordingly [ 28 , 93 ]. Context-aware computing uses software and hardware to automatically collect and interpret data for direct responses. The mobile app development environment has been changed greatly with the power of AI, particularly, machine learning techniques through their learning capabilities from contextual data [ 103 , 136 ]. Thus, the developers of mobile apps can rely on machine learning to create smart apps that can understand human behavior, support, and entertain users [ 107 , 137 , 140 ]. To build various personalized data-driven context-aware systems, such as smart interruption management, smart mobile recommendation, context-aware smart searching, decision-making that intelligently assist end mobile phone users in a pervasive computing environment, machine learning techniques are applicable. For example, context-aware association rules can be used to build an intelligent phone call application [ 104 ]. Clustering approaches are useful in capturing users’ diverse behavioral activities by taking into account data in time series [ 102 ]. To predict the future events in various contexts, the classification methods can be used [ 106 , 139 ]. Thus, various learning techniques discussed in Sect. “ Machine Learning Tasks and Algorithms ” can help to build context-aware adaptive and smart applications according to the preferences of the mobile phone users.

In addition to these application areas, machine learning-based models can also apply to several other domains such as bioinformatics, cheminformatics, computer networks, DNA sequence classification, economics and banking, robotics, advanced engineering, and many more.

Challenges and Research Directions

Our study on machine learning algorithms for intelligent data analysis and applications opens several research issues in the area. Thus, in this section, we summarize and discuss the challenges faced and the potential research opportunities and future directions.

In general, the effectiveness and the efficiency of a machine learning-based solution depend on the nature and characteristics of the data, and the performance of the learning algorithms. To collect the data in the relevant domain, such as cybersecurity, IoT, healthcare and agriculture discussed in Sect. “ Applications of Machine Learning ” is not straightforward, although the current cyberspace enables the production of a huge amount of data with very high frequency. Thus, collecting useful data for the target machine learning-based applications, e.g., smart city applications, and their management is important to further analysis. Therefore, a more in-depth investigation of data collection methods is needed while working on the real-world data. Moreover, the historical data may contain many ambiguous values, missing values, outliers, and meaningless data. The machine learning algorithms, discussed in Sect “ Machine Learning Tasks and Algorithms ” highly impact on data quality, and availability for training, and consequently on the resultant model. Thus, to accurately clean and pre-process the diverse data collected from diverse sources is a challenging task. Therefore, effectively modifying or enhance existing pre-processing methods, or proposing new data preparation techniques are required to effectively use the learning algorithms in the associated application domain.

To analyze the data and extract insights, there exist many machine learning algorithms, summarized in Sect. “ Machine Learning Tasks and Algorithms ”. Thus, selecting a proper learning algorithm that is suitable for the target application is challenging. The reason is that the outcome of different learning algorithms may vary depending on the data characteristics [ 106 ]. Selecting a wrong learning algorithm would result in producing unexpected outcomes that may lead to loss of effort, as well as the model’s effectiveness and accuracy. In terms of model building, the techniques discussed in Sect. “ Machine Learning Tasks and Algorithms ” can directly be used to solve many real-world issues in diverse domains, such as cybersecurity, smart cities and healthcare summarized in Sect. “ Applications of Machine Learning ”. However, the hybrid learning model, e.g., the ensemble of methods, modifying or enhancement of the existing learning techniques, or designing new learning methods, could be a potential future work in the area.

Thus, the ultimate success of a machine learning-based solution and corresponding applications mainly depends on both the data and the learning algorithms. If the data are bad to learn, such as non-representative, poor-quality, irrelevant features, or insufficient quantity for training, then the machine learning models may become useless or will produce lower accuracy. Therefore, effectively processing the data and handling the diverse learning algorithms are important, for a machine learning-based solution and eventually building intelligent applications.

In this paper, we have conducted a comprehensive overview of machine learning algorithms for intelligent data analysis and applications. According to our goal, we have briefly discussed how various types of machine learning methods can be used for making solutions to various real-world issues. A successful machine learning model depends on both the data and the performance of the learning algorithms. The sophisticated learning algorithms then need to be trained through the collected real-world data and knowledge related to the target application before the system can assist with intelligent decision-making. We also discussed several popular application areas based on machine learning techniques to highlight their applicability in various real-world issues. Finally, we have summarized and discussed the challenges faced and the potential research opportunities and future directions in the area. Therefore, the challenges that are identified create promising research opportunities in the field which must be addressed with effective solutions in various application areas. Overall, we believe that our study on machine learning-based solutions opens up a promising direction and can be used as a reference guide for potential research and applications for both academia and industry professionals as well as for decision-makers, from a technical point of view.

Canadian institute of cybersecurity, university of new brunswick, iscx dataset, http://www.unb.ca/cic/datasets/index.html/ (Accessed on 20 October 2019).

Cic-ddos2019 [online]. available: https://www.unb.ca/cic/datasets/ddos-2019.html/ (Accessed on 28 March 2020).

World health organization: WHO. http://www.who.int/ .

Google trends. In https://trends.google.com/trends/ , 2019.

Adnan N, Nordin Shahrina Md, Rahman I, Noor A. The effects of knowledge transfer on farmers decision making toward sustainable agriculture practices. World J Sci Technol Sustain Dev. 2018.

Agrawal R, Gehrke J, Gunopulos D, Raghavan P. Automatic subspace clustering of high dimensional data for data mining applications. In: Proceedings of the 1998 ACM SIGMOD international conference on Management of data. 1998; 94–105

Agrawal R, Imieliński T, Swami A. Mining association rules between sets of items in large databases. In: ACM SIGMOD Record. ACM. 1993;22: 207–216

Agrawal R, Gehrke J, Gunopulos D, Raghavan P. Fast algorithms for mining association rules. In: Proceedings of the International Joint Conference on Very Large Data Bases, Santiago Chile. 1994; 1215: 487–499.

Aha DW, Kibler D, Albert M. Instance-based learning algorithms. Mach Learn. 1991;6(1):37–66.

Article   Google Scholar  

Alakus TB, Turkoglu I. Comparison of deep learning approaches to predict covid-19 infection. Chaos Solit Fract. 2020;140:

Amit Y, Geman D. Shape quantization and recognition with randomized trees. Neural Comput. 1997;9(7):1545–88.

Ankerst M, Breunig MM, Kriegel H-P, Sander J. Optics: ordering points to identify the clustering structure. ACM Sigmod Record. 1999;28(2):49–60.

Anzai Y. Pattern recognition and machine learning. Elsevier; 2012.

MATH   Google Scholar  

Ardabili SF, Mosavi A, Ghamisi P, Ferdinand F, Varkonyi-Koczy AR, Reuter U, Rabczuk T, Atkinson PM. Covid-19 outbreak prediction with machine learning. Algorithms. 2020;13(10):249.

Article   MathSciNet   Google Scholar  

Baldi P. Autoencoders, unsupervised learning, and deep architectures. In: Proceedings of ICML workshop on unsupervised and transfer learning, 2012; 37–49 .

Balducci F, Impedovo D, Pirlo G. Machine learning applications on agricultural datasets for smart farm enhancement. Machines. 2018;6(3):38.

Boukerche A, Wang J. Machine learning-based traffic prediction models for intelligent transportation systems. Comput Netw. 2020;181

Breiman L. Bagging predictors. Mach Learn. 1996;24(2):123–40.

Article   MATH   Google Scholar  

Breiman L. Random forests. Mach Learn. 2001;45(1):5–32.

Breiman L, Friedman J, Stone CJ, Olshen RA. Classification and regression trees. CRC Press; 1984.

Cao L. Data science: a comprehensive overview. ACM Comput Surv (CSUR). 2017;50(3):43.

Google Scholar  

Carpenter GA, Grossberg S. A massively parallel architecture for a self-organizing neural pattern recognition machine. Comput Vis Graph Image Process. 1987;37(1):54–115.

Chiu C-C, Sainath TN, Wu Y, Prabhavalkar R, Nguyen P, Chen Z, Kannan A, Weiss RJ, Rao K, Gonina E, et al. State-of-the-art speech recognition with sequence-to-sequence models. In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2018 pages 4774–4778. IEEE .

Chollet F. Xception: deep learning with depthwise separable convolutions. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1251–1258, 2017.

Cobuloglu H, Büyüktahtakın IE. A stochastic multi-criteria decision analysis for sustainable biomass crop selection. Expert Syst Appl. 2015;42(15–16):6065–74.

Das A, Ng W-K, Woon Y-K. Rapid association rule mining. In: Proceedings of the tenth international conference on Information and knowledge management, pages 474–481. ACM, 2001.

de Amorim RC. Constrained clustering with minkowski weighted k-means. In: 2012 IEEE 13th International Symposium on Computational Intelligence and Informatics (CINTI), pages 13–17. IEEE, 2012.

Dey AK. Understanding and using context. Person Ubiquit Comput. 2001;5(1):4–7.

Eagle N, Pentland AS. Reality mining: sensing complex social systems. Person Ubiquit Comput. 2006;10(4):255–68.

Essien A, Petrounias I, Sampaio P, Sampaio S. Improving urban traffic speed prediction using data source fusion and deep learning. In: 2019 IEEE International Conference on Big Data and Smart Computing (BigComp). IEEE. 2019: 1–8. .

Essien A, Petrounias I, Sampaio P, Sampaio S. A deep-learning model for urban traffic flow prediction with traffic events mined from twitter. In: World Wide Web, 2020: 1–24 .

Ester M, Kriegel H-P, Sander J, Xiaowei X, et al. A density-based algorithm for discovering clusters in large spatial databases with noise. Kdd. 1996;96:226–31.

Fatima M, Pasha M, et al. Survey of machine learning algorithms for disease diagnostic. J Intell Learn Syst Appl. 2017;9(01):1.

Flach PA, Lachiche N. Confirmation-guided discovery of first-order rules with tertius. Mach Learn. 2001;42(1–2):61–95.

Freund Y, Schapire RE, et al. Experiments with a new boosting algorithm. In: Icml, Citeseer. 1996; 96: 148–156

Fujiyoshi H, Hirakawa T, Yamashita T. Deep learning-based image recognition for autonomous driving. IATSS Res. 2019;43(4):244–52.

Fukunaga K, Hostetler L. The estimation of the gradient of a density function, with applications in pattern recognition. IEEE Trans Inform Theory. 1975;21(1):32–40.

Article   MathSciNet   MATH   Google Scholar  

Goodfellow I, Bengio Y, Courville A, Bengio Y. Deep learning. Cambridge: MIT Press; 2016.

Goodfellow I, Pouget-Abadie J, Mirza M, Xu B, Warde-Farley D, Ozair S, Courville A, Bengio Y. Generative adversarial nets. In: Advances in neural information processing systems. 2014: 2672–2680.

Guerrero-Ibáñez J, Zeadally S, Contreras-Castillo J. Sensor technologies for intelligent transportation systems. Sensors. 2018;18(4):1212.

Han J, Pei J, Kamber M. Data mining: concepts and techniques. Amsterdam: Elsevier; 2011.

Han J, Pei J, Yin Y. Mining frequent patterns without candidate generation. In: ACM Sigmod Record, ACM. 2000;29: 1–12.

Harmon SA, Sanford TH, Sheng X, Turkbey EB, Roth H, Ziyue X, Yang D, Myronenko A, Anderson V, Amalou A, et al. Artificial intelligence for the detection of covid-19 pneumonia on chest ct using multinational datasets. Nat Commun. 2020;11(1):1–7.

He K, Zhang X, Ren S, Sun J. Spatial pyramid pooling in deep convolutional networks for visual recognition. IEEE Trans Pattern Anal Mach Intell. 2015;37(9):1904–16.

He K, Zhang X, Ren S, Sun J. Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, 2016: 770–778.

Hinton GE. A practical guide to training restricted boltzmann machines. In: Neural networks: Tricks of the trade. Springer. 2012; 599-619

Holte RC. Very simple classification rules perform well on most commonly used datasets. Mach Learn. 1993;11(1):63–90.

Hotelling H. Analysis of a complex of statistical variables into principal components. J Edu Psychol. 1933;24(6):417.

Houtsma M, Swami A. Set-oriented mining for association rules in relational databases. In: Data Engineering, 1995. Proceedings of the Eleventh International Conference on, IEEE.1995:25–33.

Jamshidi M, Lalbakhsh A, Talla J, Peroutka Z, Hadjilooei F, Lalbakhsh P, Jamshidi M, La Spada L, Mirmozafari M, Dehghani M, et al. Artificial intelligence and covid-19: deep learning approaches for diagnosis and treatment. IEEE Access. 2020;8:109581–95.

John GH, Langley P. Estimating continuous distributions in bayesian classifiers. In: Proceedings of the Eleventh conference on Uncertainty in artificial intelligence, Morgan Kaufmann Publishers Inc. 1995; 338–345

Kaelbling LP, Littman ML, Moore AW. Reinforcement learning: a survey. J Artif Intell Res. 1996;4:237–85.

Kamble SS, Gunasekaran A, Gawankar SA. Sustainable industry 4.0 framework: a systematic literature review identifying the current trends and future perspectives. Process Saf Environ Protect. 2018;117:408–25.

Kamble SS, Gunasekaran A, Gawankar SA. Achieving sustainable performance in a data-driven agriculture supply chain: a review for research and applications. Int J Prod Econ. 2020;219:179–94.

Kaufman L, Rousseeuw PJ. Finding groups in data: an introduction to cluster analysis, vol. 344. John Wiley & Sons; 2009.

Keerthi SS, Shevade SK, Bhattacharyya C, Radha Krishna MK. Improvements to platt’s smo algorithm for svm classifier design. Neural Comput. 2001;13(3):637–49.

Khadse V, Mahalle PN, Biraris SV. An empirical comparison of supervised machine learning algorithms for internet of things data. In: 2018 Fourth International Conference on Computing Communication Control and Automation (ICCUBEA), IEEE. 2018; 1–6

Kohonen T. The self-organizing map. Proc IEEE. 1990;78(9):1464–80.

Koroniotis N, Moustafa N, Sitnikova E, Turnbull B. Towards the development of realistic botnet dataset in the internet of things for network forensic analytics: bot-iot dataset. Fut Gen Comput Syst. 2019;100:779–96.

Krizhevsky A, Sutskever I, Hinton GE. Imagenet classification with deep convolutional neural networks. In: Advances in neural information processing systems, 2012: 1097–1105

Kushwaha S, Bahl S, Bagha AK, Parmar KS, Javaid M, Haleem A, Singh RP. Significant applications of machine learning for covid-19 pandemic. J Ind Integr Manag. 2020;5(4).

Lade P, Ghosh R, Srinivasan S. Manufacturing analytics and industrial internet of things. IEEE Intell Syst. 2017;32(3):74–9.

Lalmuanawma S, Hussain J, Chhakchhuak L. Applications of machine learning and artificial intelligence for covid-19 (sars-cov-2) pandemic: a review. Chaos Sol Fract. 2020:110059 .

LeCessie S, Van Houwelingen JC. Ridge estimators in logistic regression. J R Stat Soc Ser C (Appl Stat). 1992;41(1):191–201.

LeCun Y, Bottou L, Bengio Y, Haffner P. Gradient-based learning applied to document recognition. Proc IEEE. 1998;86(11):2278–324.

Liu H, Motoda H. Feature extraction, construction and selection: A data mining perspective, vol. 453. Springer Science & Business Media; 1998.

López G, Quesada L, Guerrero LA. Alexa vs. siri vs. cortana vs. google assistant: a comparison of speech-based natural user interfaces. In: International Conference on Applied Human Factors and Ergonomics, Springer. 2017; 241–250.

Liu B, HsuW, Ma Y. Integrating classification and association rule mining. In: Proceedings of the fourth international conference on knowledge discovery and data mining, 1998.

MacQueen J, et al. Some methods for classification and analysis of multivariate observations. In: Proceedings of the fifth Berkeley symposium on mathematical statistics and probability, 1967;volume 1, pages 281–297. Oakland, CA, USA.

Mahdavinejad MS, Rezvan M, Barekatain M, Adibi P, Barnaghi P, Sheth AP. Machine learning for internet of things data analysis: a survey. Digit Commun Netw. 2018;4(3):161–75.

Marchand A, Marx P. Automated product recommendations with preference-based explanations. J Retail. 2020;96(3):328–43.

McCallum A. Information extraction: distilling structured data from unstructured text. Queue. 2005;3(9):48–57.

Mehrotra A, Hendley R, Musolesi M. Prefminer: mining user’s preferences for intelligent mobile notification management. In: Proceedings of the International Joint Conference on Pervasive and Ubiquitous Computing, Heidelberg, Germany, 12–16 September, 2016; pp. 1223–1234. ACM, New York, USA. .

Mohamadou Y, Halidou A, Kapen PT. A review of mathematical modeling, artificial intelligence and datasets used in the study, prediction and management of covid-19. Appl Intell. 2020;50(11):3913–25.

Mohammed M, Khan MB, Bashier Mohammed BE. Machine learning: algorithms and applications. CRC Press; 2016.

Book   Google Scholar  

Moustafa N, Slay J. Unsw-nb15: a comprehensive data set for network intrusion detection systems (unsw-nb15 network data set). In: 2015 military communications and information systems conference (MilCIS), 2015;pages 1–6. IEEE .

Nilashi M, Ibrahim OB, Ahmadi H, Shahmoradi L. An analytical method for diseases prediction using machine learning techniques. Comput Chem Eng. 2017;106:212–23.

Yujin O, Park S, Ye JC. Deep learning covid-19 features on cxr using limited training data sets. IEEE Trans Med Imaging. 2020;39(8):2688–700.

Otter DW, Medina JR , Kalita JK. A survey of the usages of deep learning for natural language processing. IEEE Trans Neural Netw Learn Syst. 2020.

Park H-S, Jun C-H. A simple and fast algorithm for k-medoids clustering. Expert Syst Appl. 2009;36(2):3336–41.

Liii Pearson K. on lines and planes of closest fit to systems of points in space. Lond Edinb Dublin Philos Mag J Sci. 1901;2(11):559–72.

Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R, Dubourg V, et al. Scikit-learn: machine learning in python. J Mach Learn Res. 2011;12:2825–30.

MathSciNet   MATH   Google Scholar  

Perveen S, Shahbaz M, Keshavjee K, Guergachi A. Metabolic syndrome and development of diabetes mellitus: predictive modeling based on machine learning techniques. IEEE Access. 2018;7:1365–75.

Santi P, Ram D, Rob C, Nathan E. Behavior-based adaptive call predictor. ACM Trans Auton Adapt Syst. 2011;6(3):21:1–21:28.

Polydoros AS, Nalpantidis L. Survey of model-based reinforcement learning: applications on robotics. J Intell Robot Syst. 2017;86(2):153–73.

Puterman ML. Markov decision processes: discrete stochastic dynamic programming. John Wiley & Sons; 2014.

Quinlan JR. Induction of decision trees. Mach Learn. 1986;1:81–106.

Quinlan JR. C4.5: programs for machine learning. Mach Learn. 1993.

Rasmussen C. The infinite gaussian mixture model. Adv Neural Inform Process Syst. 1999;12:554–60.

Ravi K, Ravi V. A survey on opinion mining and sentiment analysis: tasks, approaches and applications. Knowl Syst. 2015;89:14–46.

Rokach L. A survey of clustering algorithms. In: Data mining and knowledge discovery handbook, pages 269–298. Springer, 2010.

Safdar S, Zafar S, Zafar N, Khan NF. Machine learning based decision support systems (dss) for heart disease diagnosis: a review. Artif Intell Rev. 2018;50(4):597–623.

Sarker IH. Context-aware rule learning from smartphone data: survey, challenges and future directions. J Big Data. 2019;6(1):1–25.

Sarker IH. A machine learning based robust prediction model for real-life mobile phone data. Internet Things. 2019;5:180–93.

Sarker IH. Ai-driven cybersecurity: an overview, security intelligence modeling and research directions. SN Comput Sci. 2021.

Sarker IH. Deep cybersecurity: a comprehensive overview from neural network and deep learning perspective. SN Comput Sci. 2021.

Sarker IH, Abushark YB, Alsolami F, Khan A. Intrudtree: a machine learning based cyber security intrusion detection model. Symmetry. 2020;12(5):754.

Sarker IH, Abushark YB, Khan A. Contextpca: predicting context-aware smartphone apps usage based on machine learning techniques. Symmetry. 2020;12(4):499.

Sarker IH, Alqahtani H, Alsolami F, Khan A, Abushark YB, Siddiqui MK. Context pre-modeling: an empirical analysis for classification based user-centric context-aware predictive modeling. J Big Data. 2020;7(1):1–23.

Sarker IH, Alan C, Jun H, Khan AI, Abushark YB, Khaled S. Behavdt: a behavioral decision tree learning to build user-centric context-aware predictive model. Mob Netw Appl. 2019; 1–11.

Sarker IH, Colman A, Kabir MA, Han J. Phone call log as a context source to modeling individual user behavior. In: Proceedings of the 2016 ACM International Joint Conference on Pervasive and Ubiquitous Computing (Ubicomp): Adjunct, Germany, pages 630–634. ACM, 2016.

Sarker IH, Colman A, Kabir MA, Han J. Individualized time-series segmentation for mining mobile phone user behavior. Comput J Oxf Univ UK. 2018;61(3):349–68.

Sarker IH, Hoque MM, MdK Uddin, Tawfeeq A. Mobile data science and intelligent apps: concepts, ai-based modeling and research directions. Mob Netw Appl, pages 1–19, 2020.

Sarker IH, Kayes ASM. Abc-ruleminer: user behavioral rule-based machine learning method for context-aware intelligent services. J Netw Comput Appl. 2020; page 102762

Sarker IH, Kayes ASM, Badsha S, Alqahtani H, Watters P, Ng A. Cybersecurity data science: an overview from machine learning perspective. J Big Data. 2020;7(1):1–29.

Sarker IH, Watters P, Kayes ASM. Effectiveness analysis of machine learning classification models for predicting personalized context-aware smartphone usage. J Big Data. 2019;6(1):1–28.

Sarker IH, Salah K. Appspred: predicting context-aware smartphone apps using random forest learning. Internet Things. 2019;8:

Scheffer T. Finding association rules that trade support optimally against confidence. Intell Data Anal. 2005;9(4):381–95.

Sharma R, Kamble SS, Gunasekaran A, Kumar V, Kumar A. A systematic literature review on machine learning applications for sustainable agriculture supply chain performance. Comput Oper Res. 2020;119:

Shengli S, Ling CX. Hybrid cost-sensitive decision tree, knowledge discovery in databases. In: PKDD 2005, Proceedings of 9th European Conference on Principles and Practice of Knowledge Discovery in Databases. Lecture Notes in Computer Science, volume 3721, 2005.

Shorten C, Khoshgoftaar TM, Furht B. Deep learning applications for covid-19. J Big Data. 2021;8(1):1–54.

Gökhan S, Nevin Y. Data analysis in health and big data: a machine learning medical diagnosis model based on patients’ complaints. Commun Stat Theory Methods. 2019;1–10

Silver D, Huang A, Maddison CJ, Guez A, Sifre L, Van Den Driessche G, Schrittwieser J, Antonoglou I, Panneershelvam V, Lanctot M, et al. Mastering the game of go with deep neural networks and tree search. nature. 2016;529(7587):484–9.

Ślusarczyk B. Industry 4.0: Are we ready? Polish J Manag Stud. 17, 2018.

Sneath Peter HA. The application of computers to taxonomy. J Gen Microbiol. 1957;17(1).

Sorensen T. Method of establishing groups of equal amplitude in plant sociology based on similarity of species. Biol Skr. 1948; 5.

Srinivasan V, Moghaddam S, Mukherji A. Mobileminer: mining your frequent patterns on your phone. In: Proceedings of the International Joint Conference on Pervasive and Ubiquitous Computing, Seattle, WA, USA, 13-17 September, pp. 389–400. ACM, New York, USA. 2014.

Szegedy C, Liu W, Jia Y, Sermanet P, Reed S, Anguelov D, Erhan D, Vanhoucke V, Rabinovich A. Going deeper with convolutions. In: Proceedings of the IEEE conference on computer vision and pattern recognition. 2015; pages 1–9.

Tavallaee M, Bagheri E, Lu W, Ghorbani AA. A detailed analysis of the kdd cup 99 data set. In. IEEE symposium on computational intelligence for security and defense applications. IEEE. 2009;2009:1–6.

Tsagkias M. Tracy HK, Surya K, Vanessa M, de Rijke M. Challenges and research opportunities in ecommerce search and recommendations. In: ACM SIGIR Forum. volume 54. NY, USA: ACM New York; 2021. p. 1–23.

Wagstaff K, Cardie C, Rogers S, Schrödl S, et al. Constrained k-means clustering with background knowledge. Icml. 2001;1:577–84.

Wang W, Yang J, Muntz R, et al. Sting: a statistical information grid approach to spatial data mining. VLDB. 1997;97:186–95.

Wei P, Li Y, Zhang Z, Tao H, Li Z, Liu D. An optimization method for intrusion detection classification model based on deep belief network. IEEE Access. 2019;7:87593–605.

Weiss K, Khoshgoftaar TM, Wang DD. A survey of transfer learning. J Big data. 2016;3(1):9.

Witten IH, Frank E. Data Mining: Practical machine learning tools and techniques. Morgan Kaufmann; 2005.

Witten IH, Frank E, Trigg LE, Hall MA, Holmes G, Cunningham SJ. Weka: practical machine learning tools and techniques with java implementations. 1999.

Wu C-C, Yen-Liang C, Yi-Hung L, Xiang-Yu Y. Decision tree induction with a constrained number of leaf nodes. Appl Intell. 2016;45(3):673–85.

Wu X, Kumar V, Quinlan JR, Ghosh J, Yang Q, Motoda H, McLachlan GJ, Ng A, Liu B, Philip SY, et al. Top 10 algorithms in data mining. Knowl Inform Syst. 2008;14(1):1–37.

Xin Y, Kong L, Liu Z, Chen Y, Li Y, Zhu H, Gao M, Hou H, Wang C. Machine learning and deep learning methods for cybersecurity. IEEE Access. 2018;6:35365–81.

Xu D, Yingjie T. A comprehensive survey of clustering algorithms. Ann Data Sci. 2015;2(2):165–93.

Zaki MJ. Scalable algorithms for association mining. IEEE Trans Knowl Data Eng. 2000;12(3):372–90.

Zanella A, Bui N, Castellani A, Vangelista L, Zorzi M. Internet of things for smart cities. IEEE Internet Things J. 2014;1(1):22–32.

Zhao Q, Bhowmick SS. Association rule mining: a survey. Singapore: Nanyang Technological University; 2003.

Zheng T, Xie W, Xu L, He X, Zhang Y, You M, Yang G, Chen Y. A machine learning-based framework to identify type 2 diabetes through electronic health records. Int J Med Inform. 2017;97:120–7.

Zheng Y, Rajasegarar S, Leckie C. Parking availability prediction for sensor-enabled car parks in smart cities. In: Intelligent Sensors, Sensor Networks and Information Processing (ISSNIP), 2015 IEEE Tenth International Conference on. IEEE, 2015; pages 1–6.

Zhu H, Cao H, Chen E, Xiong H, Tian J. Exploiting enriched contextual information for mobile app classification. In: Proceedings of the 21st ACM international conference on Information and knowledge management. ACM, 2012; pages 1617–1621

Zhu H, Chen E, Xiong H, Kuifei Y, Cao H, Tian J. Mining mobile user preferences for personalized context-aware recommendation. ACM Trans Intell Syst Technol (TIST). 2014;5(4):58.

Zikang H, Yong Y, Guofeng Y, Xinyu Z. Sentiment analysis of agricultural product ecommerce review data based on deep learning. In: 2020 International Conference on Internet of Things and Intelligent Applications (ITIA), IEEE, 2020; pages 1–7

Zulkernain S, Madiraju P, Ahamed SI. A context aware interruption management system for mobile devices. In: Mobile Wireless Middleware, Operating Systems, and Applications. Springer. 2010; pages 221–234

Zulkernain S, Madiraju P, Ahamed S, Stamm K. A mobile intelligent interruption management system. J UCS. 2010;16(15):2060–80.

Download references

Author information

Authors and affiliations.

Swinburne University of Technology, Melbourne, VIC, 3122, Australia

Iqbal H. Sarker

Department of Computer Science and Engineering, Chittagong University of Engineering & Technology, 4349, Chattogram, Bangladesh

You can also search for this author in PubMed   Google Scholar

Corresponding author

Correspondence to Iqbal H. Sarker .

Ethics declarations

Conflict of interest.

The author declares no conflict of interest.

Additional information

Publisher's note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This article is part of the topical collection “Advances in Computational Approaches for Artificial Intelligence, Image Processing, IoT and Cloud Applications” guest edited by Bhanu Prakash K N and M. Shivakumar.

Rights and permissions

Reprints and permissions

About this article

Sarker, I.H. Machine Learning: Algorithms, Real-World Applications and Research Directions. SN COMPUT. SCI. 2 , 160 (2021). https://doi.org/10.1007/s42979-021-00592-x

Download citation

Received : 27 January 2021

Accepted : 12 March 2021

Published : 22 March 2021

DOI : https://doi.org/10.1007/s42979-021-00592-x

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Machine learning
  • Deep learning
  • Artificial intelligence
  • Data science
  • Data-driven decision-making
  • Predictive analytics
  • Intelligent applications
  • Find a journal
  • Publish with us
  • Track your research

Unsupervised Machine Learning for Clustering in Political and Social Research

New York, NY: Cambridge University Press (Forthcoming)

62 Pages Posted: 4 Nov 2020

Philip Waggoner

Columbia University, ISERP; YouGov America

Date Written: September 1, 2020

In the age of data-driven problem-solving, the ability to apply cutting edge computational tools for explaining substantive phenomena in a digestible way to a wide audience is an increasingly valuable skill. Such skills are no less important in political and social research. Yet, application of quantitative methods often assumes an understanding of the data, structure, patterns, and concepts that directly influence the broader research program. It is often the case that researchers may not be entirely aware of the precise structure and nature of their data or what to expect of their data when approaching analysis. Further, in teaching social science research methods, it is often overlooked that the process of exploring data is a key stage in applied research, which precedes predictive modeling and hypothesis testing. These tasks, though, require knowledge of appropriate methods for exploring and understanding data in the service of discerning patterns, which contribute to development of theories and testable expectations. This Element seeks to fill this gap by offering researchers and instructors an introduction clustering, which is a prominent class of unsupervised machine learning for exploring, mining, and understanding data. I detail several widely used clustering techniques, and pair each with R code and real data to facilitate interaction with the concepts. Three unsupervised clustering algorithms are introduced: agglomerative hierarchical clustering, k-means clustering, and Gaussian mixture models. I conclude by offering a high level look at three advanced methods: fuzzy C-means, DBSCAN, and partitioning around medoids clustering. The goal is to bring applied researchers into the world of unsupervised machine learning, both theoretically as well as practically. All code can be interactively run on the cloud computing platform Code Ocean to guide readers through implementation of the algorithms and techniques.

Keywords: machine learning, unsupervised learning, clustering, political science, social science, EDA

Suggested Citation: Suggested Citation

Philip Waggoner (Contact Author)

Columbia university, iserp ( email ).

3022 Broadway New York, NY 10027 United States

HOME PAGE: http://pdwaggoner.github.io/

YouGov America ( email )

432 Park Avenue South, Floor 5 New York, NY 10016 United States

Do you have a job opening that you would like to promote on SSRN?

Paper statistics, related ejournals, artificial intelligence ejournal.

Subscribe to this fee journal for more curated articles on this topic

Data Science, Data Analytics & Informatics eJournal

International political economy: globalization ejournal, libraries & information technology ejournal.

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here .

Loading metrics

Open Access

Peer-reviewed

Research Article

A machine learning and clustering-based approach for county-level COVID-19 analysis

Roles Conceptualization, Investigation, Methodology, Project administration, Software, Visualization, Writing – original draft, Writing – review & editing

* E-mail: [email protected]

Affiliations School of Industrial and Systems Engineering, University of Oklahoma, Norman, Oklahoma, United States of America, Data Science and Analytics Institute, University of Oklahoma, Norman, Oklahoma, United States of America

ORCID logo

Roles Data curation, Investigation, Writing – review & editing

Affiliation Data Science and Analytics Institute, University of Oklahoma, Norman, Oklahoma, United States of America

Roles Conceptualization, Methodology, Visualization, Writing – original draft, Writing – review & editing

Roles Formal analysis, Investigation, Methodology, Writing – review & editing

Affiliation School of Industrial and Systems Engineering, University of Oklahoma, Norman, Oklahoma, United States of America

Roles Data curation, Formal analysis, Investigation, Methodology, Writing – review & editing

Affiliation Department of Biostatistics and Epidemiology, University of Oklahoma Health Sciences Center, Oklahoma City, Oklahoma, United States of America

  • Charles Nicholson, 
  • Lex Beattie, 
  • Matthew Beattie, 
  • Talayeh Razzaghi, 

PLOS

  • Published: April 27, 2022
  • https://doi.org/10.1371/journal.pone.0267558
  • Reader Comments

Fig 1

COVID-19 is a global pandemic threatening the lives and livelihood of millions of people across the world. Due to its novelty and quick spread, scientists have had difficulty in creating accurate forecasts for this disease. In part, this is due to variation in human behavior and environmental factors that impact disease propagation. This is especially true for regionally specific predictive models due to either limited case histories or other unique factors characterizing the region. This paper employs both supervised and unsupervised methods to identify the critical county-level demographic, mobility, weather, medical capacity, and health related county-level factors for studying COVID-19 propagation prior to the widespread availability of a vaccine. We use this feature subspace to aggregate counties into meaningful clusters to support more refined disease analysis efforts.

Citation: Nicholson C, Beattie L, Beattie M, Razzaghi T, Chen S (2022) A machine learning and clustering-based approach for county-level COVID-19 analysis. PLoS ONE 17(4): e0267558. https://doi.org/10.1371/journal.pone.0267558

Editor: Usman Qamar, National University of Sciences and Technology (NUST), PAKISTAN

Received: June 22, 2021; Accepted: April 11, 2022; Published: April 27, 2022

Copyright: © 2022 Nicholson et al. This is an open access article distributed under the terms of the Creative Commons Attribution License , which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Data Availability: Upon acceptance, all data will be available at the URL: http://oklahomaanalytics.com/software-research-data/ .

Funding: C.N., L.B., T.R., M.B., and S.C. received funding from the Office of the Vice President for Research and Partnerships, University of Oklahoma. Funder website: https://www.ou.edu/research-norman The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Competing interests: The authors have declared that no competing interests exist.

Introduction

The emergence of COVID-19 has evolved into a widespread pandemic in a very short time and drastically affected the United States and the world. Many forecasts are being made regarding the potential number of cases and fatalities associated with the virus. Much of the available data skew towards large urban areas. According to data available from John Hopkins University [ 1 ], as of September 6, 2020, there were 6,163,496 cases and 186,125 deaths in the US. Of those, 2,481,887 cases (40%) and 72,202 deaths (39%) were from the four most populous states (California, Texas, Florida, and New York). In contrast, a smaller state like Oklahoma had only 63,556 cases and 853 deaths. At the county level, the imbalance is even more explicit: 12 counties, less than 0.5% of all counties in the US, represent over 20% of total COVID-19 cases, and only 8 counties account for 20% of the reported deaths.

All projections of the spread of COVID-19 are subject to the limitations of the data upon which they are based. At the national level, projections are dominated by the volume of cases from large regions (states or counties). Projections for less populous areas become more difficult due to limited case histories and each location’s heterogeneity. These less populous areas also tend to be the least prepared for an onslaught of COVID-19 cases [ 2 – 4 ]. Hospitals and medical funding in these counties rely on forecasting to determine how to concentrate their efforts to prepare for a potential outbreak without depleting precious resources that can be used for other needs such as education. Given the skew of data towards urban areas, many forecasts for rural, semi-rural, and small populations result in over or under forecasting outbreaks. With limited economic resources, relying on inaccurate forecasting can result in unnecessary spending or, in the case of under forecasting, the loss of human lives.

Many traditional tools for disease analyses leverage only limited data to distinguish one area from another, i.e., age distribution and the number of current COVID-19 cases. While this may be sufficient to forecast disease spread for large regions, it is insufficient at a more refined level [ 5 , 6 ]. For example, on April 10, 2020 using a Susceptible- Exposed-Infectious-Recovered (SEIR) model, the Oklahoma State Department of Health forecast that daily COVID-19 infections would peak in the state on April 21, and, by May 1, Oklahoma would have 9,300 total cases and 469 deaths [ 7 ]. In actuality, there were only 3,748 cases and 230 deaths by May 1, and the disease was nowhere near peaking.

Forecasting is complicated by the fact that critical variables can differ significantly geographically and demographically. That is, disease transmissibility is not only a characteristic of the biological pathogen, but also a function of human behavior and environmental factors [ 8 , 9 ]. By not accounting for these differences, there is a risk of biasing the predictions towards large, urban areas and missing important unique traits among subgroups. The effect of this variation diminishes when considering large populations. However, there is a need for region-specific analyses and projections.

Additionally, while sufficient data quantity and quality might be available at higher levels of aggregation (e.g., state or country) or populous regions (e.g., New York City), this is not as likely at smaller scales and local levels. This study offers an approach to cluster small geographies based upon features found to be relevant to COVID-19 propagation. These clusters have greater amounts of data available for further modeling. To accomplish this, a large array of county-level data is collected for the 48 conterminous United States (US). Multiple machine learning approaches are used to analyze the data to discover the important and inherent county-level characteristics that potentially drive COVID-19 outcomes. The critical features are used to create clusters of counties with similar inherent traits. These clusters and their characteristics are anaylzed in detail. Ultimately, we propose that this approach provides a valid and beneficial compromise between the highly aggregated national or state level data and the more granular and limited local-level data.

Related work

Multiple researchers and institutions have developed models for the spread of COVID-19, including publicly available tools from Stanford [ 10 ] and the US Center for Disease Control and Prevention [ 11 ]. A wide variety of propagation and forecasting models are being created alongside these since accurate prediction is proving to be a daunting task. The prediction models for the transmission dynamics of the COVID-19 pandemic can be categorized into two distinct classes: epidemiological methods and data-driven methods.

Epidemiological models

The most common epidemiological models are compartmental models, which were first described in a series of three papers by Kermak and McKendrick in the 1920s and 1930s [ 12 – 14 ]. In these models, individuals in a population exist in and move between compartments: infected (I), susceptible (S), and recovered (R) individuals. The Susceptible-Infected-Recovered (SIR) [ 12 ] and Susceptible-Exposed-Infected-Recovered (SEIR) [ 15 ] models are among the most popular techniques for outbreak prediction since the onset of the pandemic [ 16 – 18 ]. Researchers continue to investigate enhancements for SIR and SEIR-based models. Sun et al. [ 19 ] proposed a novel SIR model with varying coefficients to track the reproductivity of the COVID-19 epidemic in China. Syage [ 20 ] considered a statistical and dynamical model for forecasting COVID-19 deaths based on a hybrid asymmetric gaussian and SEIR construct.

Compartmental models are useful for modeling the mechanisms of disease transfer, but they require the assumption of full-mixing within compartments and ignore many other factors such as geography, population heterogeneity, individual contact vectors, social dynamics, governmental decisions (e.g., lockdown measures), and other complexities of human behavior.

Data-driven models

Data-driven models can provide more accurate forecasts at the expense of explicit modeling of propagation mechanisms. Methods such as agent-based simulation (ABS) [ 21 ] and machine learning (ML) methods have been employed for infectious disease outbreak analysis and disease prediction.

Agent-based simulation is a computer simulation approach consisting of agents (e.g., individuals) interacting with each other in a virtual environment. The advantage of ABS is that it can take into account a wide array of human-level dynamics while tracking disease spread. ABS has been applied for COVID-19 transmission modeling and prediction recently in [ 22 – 26 ]. While a powerful and flexible modeling paradigm, drawbacks of ABS include potential computational complexity, intricate modeling design assumptions, and the lack of closed-form “insight” on the observed system behavior.

The use of ML methods for COVID-19 forecasting is in its infancy. Yang et al. [ 27 ] developed the Long Short-Term Memory Networks (LSTM) to predict the COVID-19 epidemic using the 2003 SARS data as a training set. The COVID-19 epidemiological parameters, such as the probability of transmission, incubation rate, the probability of recovery or death and contact number, were used in the model. [ 28 ] proposed the use of 7 ML models and a new hybrid forecasting method based on nearest neighbors and k -means clustering to forecast COVID-19 growth rates. They employed LSTM, multiple linear regression, ridge regression, decision trees, random forest, neural network, and support vector machines on country level data (from the USA, India, UK, Germany, and Singapore). Other existing works have used the combination of epidemiological and machine learning models to predict pandemic propagation. [ 29 ] employed the SEIR model to obtain the value of R 0 and then they predicted the number of COVID-19 confirmed cases in India for the next 21 days using regression.

County-level COVID-19 propagation modeling has proven to be challenging for multiple reasons. Disease transmission is influenced by “numerous biological, sociobehavioral, and environmental factors that govern pathogen transmission.” [ 8 ]. For instance [ 30 ], found that rural populations in China had a less positive attitude towards COVID-19 preventive behaviors and were less likely to adhere to policies such as social distancing and using masks. Some very recent work has begun to recognize the urgency of creating refined propagation models. Wang et al. [ 31 ] and Zhou et al. [ 5 ] are two examples that both address county-level spatiotemporal modeling to predict COVID-19 related outcomes.

Research contribution

This study contributes to the growing body of knowledge and methods for county-level infectious disease analysis in multiple ways. The primary objective is to discover the most important county-level characteristics relating to COVID-19 propagation and aggregate individual counties into clusters based on the important county-level characteristics. Ideally, this will help balance the issues associated with high-level aggregation (which hide regional diversity but have sufficient data for evaluating trends and creating forecasts) with the granular data at the local level (which has significant diversity but may have limited populations, cases, etc. for in-depth analysis). To achieve the overall objective, we complete four important subtasks, detailed below and depicted in Fig 1 , that each contributes to literature.

thumbnail

  • PPT PowerPoint slide
  • PNG larger image
  • TIFF original image

https://doi.org/10.1371/journal.pone.0267558.g001

First, we produce a unified county-level database for the US that includes demographics, mobility, weather, medical capacity, and health related county-level data relating to COVID-19 propagation. The data is available at http://oklahomaanalytics.com/software-research-data . Second, we extract essential information from the high dimensional weather and mobility data by projecting these features to a lower dimensional space to support meaningful clustering. Third, the resulting feature set is analyzed via supervised learning to discover the most important county-level characteristics relating to COVID-19 propagation. It is important to note that we are not performing time series forecasting or month-to-month predictions, but rather identifying the underlying traits, i.e., the aforementioned “sociobehavioral” and “environmental factors”, affecting COVID-19 outcomes. To the best of our knowledge, this level of in-depth and advanced empirical analysis of the critical county-level factors for COVID-19 is a novel contribution. Finally, balancing statistical properties and practical considerations, we aggregate individual counties into clusters based on the important county-level characteristics. This increases the amount of data available for epidemiological models yet the aggregation retains regional-level diversity on the critical features. Each cluster is profiled and analyzed to demonstrate the validity of the approach and to set the stage for future work. We believe that our analytical approach, list of important variables related to COVID-19 outcomes, and novel clustering results will provide important practical guidance for health policy makers and stakeholders to implement future intervention and resource allocation plans for COVID-19 and other infectious diseases.

Data and methods

The data for this study is collected from multiple sources and includes demographic, health, mobility, and weather features for counties and county-equivalents across the US. The demographic data is gathered from a public data repository created by a group of faculty and students at John Hopkins University [ 32 ] that extracts and cleans data from various sources including the United States Census Bureau. The data reflects demographics as of 2017 or 2018 depending on the feature [ 33 ]. The relevant census data features include population, population by race and sex, population changes due to migration, number of births, number of deaths, and other descriptive demographic statistics. Population by race/ethnic data is aggregated to reflect the following categories: Hispanic alone, or non-Hispanic White, Black, Asian, Native Hawaiian or Pacific Islander, or Native American alone. Additionally, the multiple census categories regarding two or more races (whether Hispanic or not) is aggregated into a single category.

The health care variables concerning the number of beds, hospitals, admissions, and full-time employees are collected from the COVID Severity Forecast data set, which pulls said features from Kaiser Health News, Amma Resonance Healing Foundation Health [ 34 ], and the Behavioral Risk Factor Surveillance System. The mobility features are gathered from Google Mobility [ 35 ] and reflect monthly averages of daily metrics that describe how mobility changes against the counties’ baseline scores. Monthly averages are defined to help account for missing daily data for smaller counties. Weather features are sourced from the National Oceanic and Atmospheric Administration and accessed via the Google BigQuery Platform [ 36 ]. These features reflect monthly averages of high temperature, low temperature, average temperature, high humidity percentage, low humidity percentage, and average humidity percentage. Lastly, the COVID-19 case data is collected from USA Facts [ 37 ] and includes the number of confirmed cases and number of deaths by county starting in January 2020. This information is updated daily and this study uses data through October 10, 2020.

The data was merged from the various source based on the unique Federal Information Processing Standard code that uniquely identifies counties and county equivalents. All features are continuous numeric features. The values are standardized to represent Demographic, health-related, and COVID-19 case data are expressed per 1000 capita or as rates within the county population. The data set consists of 3,106 counties or county-equivalents (e.g., parishes and independent cities) across the conterminous US (two counties had missing data and the District of Columbia was not included). Each county is represented by 160 numerical features.

Principal component analysis

Principal component analysis is a statistical technique used to project high dimensional data to lower dimensions in a way that preserves the original variance in the data [ 38 ]. The approach is commonly used in many fields to simplify data for human consumption or visualization, reduce inherent correlation in data sets, or to mitigate the so-called ‘curse of dimensionality’ associated with machine learning [ 39 ].

Supervised learning and variable importance

Supervised learning is a class of machine learning algorithms that use a set of data points and known outcomes to determine a predictive model to map input space to outcomes. Many of these algorithms allow for complex, non-linear relationships between the input and outcome variables. While the resulting models may be difficult to interpret, the most important variables for predictive modeling can be identified, e.g., [ 40 , 41 ]. The techniques selected each have rigorous, algorithm-specific mechanisms for quantifying the most important predictors. For instance, while support vector machines and neural networks are known to produce highly accurate models, neither have high quality methods to evaluate which predictors are the most important. Random forests, on the other hand, quantify individual variable importance naturally throughout the model building process. The methods, their hyperparameters, and the associated measure for determining variable importance are briefly described.

Elastic net regression.

clustering in machine learning research papers

Multivariate adaptive regression splines.

Multivariate Adaptive Regression Splines (MARS), proposed by Friedman (1991) [ 43 ], construct a piecewise linear regression model by creating new features that isolate ranges of values from the original input data through the use of so-called hinge functions. Variables, their hinged-versions, and interactions between variables are sequentially added to a linear regression model. Once complete, MARS employs a backwards stepwise elimination procedure to reduce the number of features and optimize the generalized cross-validation (GCV) performance statistic. The hyperparameters relate to the allowed degree of variable interaction and the maximum size of predictors allowable after this second step. Variable importance is determined during the backwards elimination procedure and based on the effect that the presence of a given variable has on the GCV value.

clustering in machine learning research papers

Random forests, conditional inference forests, and gradient boosted trees.

Random forests (RF) [ 45 ], conditional inference forests (CF) [ 46 ], and gradient boosted trees (GBT) [ 47 ] each leverage an ensemble of weak learners (i.e., decision trees) to create highly predictive regression and classification models. RF and CF create many independently constructed decision trees and use a majority rule to determine outcome values. To reduce inter-tree correlation, at each step during the tree building process, only a random subset of predictors are evaluated to create node splits. RF uses an impurity metric to determine the split values whereas CF employs statistical tests. The number of variables considered at each split is tuned to reduce overfitting.

GBT constructs a sequence of simple decision trees in which each tree is built based on the results of the previous tree predictive error. Hyperparameter values include the number of trees to fit, the maximum depth of each tree, the learning rate, and the minimum number of observations in the terminal nodes of the trees. For both RF and CF, the mean-squared error (MSE) on the out-of-bag data is recorded for each tree and each variable. Variables that most improve the MSE have higher importance scores assigned. For GBT, variable importance is related to how often a feature is selected in the construction of underlying trees.

clustering in machine learning research papers

Partitioning around medoids.

clustering in machine learning research papers

PAM selects the medoids for each cluster using two phases called build and swap . The build phase finds an initial clustering through the consecutive selection of k medoids. The swap phase improves the selected set of medoids and then finds the clustering in an iterative process until the objective function value shown in Eq (6) no longer decreases or there is no further update in the set of medoids between two subsequent iterations.

Hierarchical clustering.

Hierarchical clustering techniques iteratively find nested clusters by constructing a tree structure either in agglomerative (bottom up) or divisive (top down) manner. Agglomerative clustering begins with each observation in its own cluster and subsequently combines the least dissimilar pair of clusters into a single cluster, thus producing a hierarchy. In this study, we use agglomerative clustering because it is the most popular and practical approach. There are different measures to obtain the distance between clusters such as single linkage, complete linkage, and Ward’s method [ 58 ]. We choose the latter for this study as it is based on minimizing the within sum of squares error from Eq (3) at iteration when combining clusters.

clustering in machine learning research papers

Computational tools

All statistical analysis, supervised learning, and clustering is performed using the R software environment [ 59 ] and the following R packages: elastic net models are developed using glmnet [ 60 ], the random forests are developed using randomForest [ 61 ], the conditional inference forests are developed using partykit [ 62 ], the gradient boosted trees are developed using gbm [ 63 ], and the MARS models are developed using earth [ 64 ]. Cross-validation is conducted using the caret package [ 65 ]. Finally, the mapping is performed using the package usmap [ 66 ].

Dimension reduction

For each county and each month, the average, minimum, and maximum temperatures and relative humidities are reported, producing 72 dimensions of data. For the mobility data, the changes are reported with respect to grocery, park, retail, residential, transits, and workplace values for February 2020 through September 2020, generating 48 dimensions. The weather variables exhibit high correlation with each other, as do the mobility variables. Both the weather and mobility data can be projected onto considerably lower dimensions while maintaining the majority of their informational value. Indeed, this finding is important for the success of the research effort. Ideally, we desire all of the input variables for the clustering procedure to represent inherent traits associated with each county. For example, we prefer general county-level weather characteristics (e.g., colder than the average US county) over a historical month’s specific values (e.g., the high temperature in May 2020). The former is easy to generalize, but the latter is not. We would like to project mobility data in a similar way—i.e., compacting the month-to-month specific data into something that relates to an overall behavioral pattern. Fortunately, the high correlation of variables indicates that this is feasible with principal component analysis.

Using PCA, the weather data is first mean-centered and scaled with respect to feature standard deviation. Next, the data is projected from 72 dimensions to 2 principal components while retaining approximately 80% of the original variation. The first principal component (PC1) explains 47% of the variance and is dominated by the monthly temperature related variables. The second principal component (PC2) explains 33% of the variance and is dominated by the monthly humidity related variables. The 2D projection is depicted in Fig 2 . The counties associated with extreme values for each axis are labeled. The mean-centered and scaled 48 dimensional mobility change data is successfully projected onto 8 dimensions while retaining nearly 80% of the original variance.

thumbnail

https://doi.org/10.1371/journal.pone.0267558.g002

COVID-19 supervised learning and variable importance

Each of the supervised learning approaches described beforehand is trained to predict four distinct county-level outcomes: total per 1000 capita positive COVID-19 cases as of October 10, 2020 ( cases ), total per 1000 capita COVID-19 deaths as of October 10, 2020 ( deaths ), the growth rate for positive cases over the most recent 30 days (September 11, 2020 to October 10, 2020) ( case rate ), and the growth rate for COVID-19 deaths over the same 30 days ( death rate ). The goal of the training is to identify which county-level variables are the most important driving factors associated with COVID-19 outcomes. Table 1 summarizes the four target variables.

thumbnail

https://doi.org/10.1371/journal.pone.0267558.t001

The models are trained on the county-level aggregated data set and tuned using 5-fold cross-validation with five repeats. The minimal cross-validated (CV) root mean squared error (RMSE) is used to determine the associated hyperparameter values and to evaluate the generalizable error of each model. Table 2 reports the predictive performance for each model. For each outcome variable and supervised learning method, the average CV RMSE and average CV R 2 metrics are listed. The RMSE values provide an effective method for comparing models for a given outcome and are listed first; the R 2 values facilitate comparisons between models of different outcomes and are listed below the RMSE values. For each outcome predicted, the performance values associated with the model having the lowest CV RMSE values are in bold.

thumbnail

https://doi.org/10.1371/journal.pone.0267558.t002

ENET and MARS generally underperform on all outcomes with respect to the RF, CF, and GBT algorithms. This implies that the fundamental relationships between the county characteristics and COVID-19 outcomes are both complex and non-linear. For predicting cases , deaths , and case rate , the random forest model performs the best. The conditional inference forest outperforms the competing techniques when predicting death rate . Each of the four forest methods are built with 500 trees. The tuned hyperparameter values for the four best models define the number of variables considered at each split of the underlying trees. For all four models, this value is tuned using cross-validation and found to range from 10-20.

In terms of overall predictability, the highest CV R 2 is 0.5704 using a random forest model to predict cases . It is important to note that this model uses only non-pathogen characteristics and no historical case load information, yet it captures over 57% of the variation in COVID-19 cases. The best predictive performances correspond to predicting the per capita cases by county. The next best set of predictive models are associated with deaths . The models predicting case rate are next with R 2 values in the range of 0.2659 to 0.3521. Finally, every technique applied has difficulty predicting the increase in COVID-19 deaths for the most recent 30 days. This may be due to an inherent lack of predictability (e.g., due to noise in the data) or indicative that there are important features missing from the collected data.

To identify the critical county-level factors, the top 10 variables, ranked in terms of variable importance, for each of the best predictive models in Table 2 are extracted. Since multiple variables are important in different models, this set is comprised of 20 distinct variables. These 20 critical features are listed, categorized, and described in Table 3 . Four race/ethnicity variables are important: non-Hispanic Whites, Blacks, and American Indian (alone) and the per capita number of individuals belonging to two or more races (regardless of Hispanic classification). In terms of medical capacity, the number of specialized nursing facilities (including nursing homes) and the ratio of insured to uninsured individuals is critical. Three health related factors are identified as critical: percent of individuals who self-report as being in fair or poor health, the number of self-reported mentally unhealthy days, and the percent of the county that are smokers. The county-level median income and unemployment rate are two important economic factors. The first two principal components derived from the weather data are top predictors. Education level, age brackets, and population density each make the list as well as the ratio of Democrats to Republicans in each county.

thumbnail

https://doi.org/10.1371/journal.pone.0267558.t003

Fig 3 depicts a Spearman’s ρ rank correlation plot for the 20 variables reported in Table 3 . The correlation strengths are represented by ellipses in each cell. Strong correlations are indicated by dark, thin ellipse angled to the right (positive correlation) or to the left (negative correlation). Statistical tests for the correlation values are conducted at a significance level of 0.05. If a correlation is not statistically significant at this level, the corresponding cell is left blank.

thumbnail

https://doi.org/10.1371/journal.pone.0267558.g003

Multiple variables demonstrate levels of moderate to strong correlation (or anti-correlation). The unemployment rate, percent of county without a high school degree, percent of county that are smokers, self-reported unhealthy mental days and self-reported fair/poor health status form a group of positively rank correlated variables. These same variables are negatively rank correlated to the set of factors including median income, percent of county with a 4-year degree and to some extent, with the ratio of health insurance, the first principal component for weather data, and the population of non-Hispanic Whites. The non-Hispanic Black population is negatively rank correlated with the first principal component for weather data and the population of non-Hispanic Whites, but positively rank correlated with the Democrat to Republican ratio and population density.

Table 4 lists the top ten variables, in order of importance, for each of the top performing models used in the prediction of the four distinct outcomes. The individual variable importance scores, scaled between 0 and 100 and rounded to the nearest integer, are reported in parentheses. The weather factor is a prominent predictor in all four models and the most important in all but the case rate model. This may reflect geographic diversity across the US and/or a more typical influenza-like propagation behavior associated how individuals spend more time indoors during inclement weather. Racial factors also play an important role in all four models. The deaths model uses all four race/ethnicity indicators. The case rate and death rate models only consider one race variable each, non-Hispanic American Indians and non-Hispanic Blacks, respectively. It is of note that self-reported mentally unhealthy days is the most important variable for the case rate model. This feature correlates (positively or negatively) with other socioeconomic factors such as percent reporting fair or poor health, median income values, health insurance coverage, and education. It may be that self-reported unhealthy days is an indication of other unhealthy behaviors or conditions that could lead to increases in COVID-19 cases. It is interesting that the death rate model has median income as its second most important variable and together with the case rate model are the only two models that identify the number of SNF sites and health insurance status as important predictors.

thumbnail

https://doi.org/10.1371/journal.pone.0267558.t004

County-level clustering

The 20 features identified as critical intentionally do not include any direct COVID-19 outcomes. The objective is to identify county-level characteristics that are fundamental factors impacting how COVID-19 spreads within a community. If successful, identifying clusters of counties within this 20 dimensional subspace may enhance future analysis methods and allow researchers to distinguish important trends.

Number of clusters.

To create the subgroups, k -means, PAM, and agglomerative hierarchical clustering (HC) results are extensively evaluated on the mean-centered and scaled data. Simulation studies have shown there is no best clustering algorithm that works for all scenarios [ 67 – 69 ]. The appropriateness of a particular algorithm is dependent on the nature of the data and on the information sought. For example, k -means and PAM tend to produce “spherically” shaped clusters, whereas hierarchical clustering does not have a similar limitation. When a priori knowledge about the data is not available or insufficient, it is common to explore different algorithms to obtain meaningful clustering results through comparisons. The final choice should be a balance between statistical properties and practical interpretation.

The choice of the number of clusters is also somewhat subjective. There are many quantitative index methods used in the literature to identify the appropriate number of clusters. Unfortunately, these indicators do not typically agree with one another and there is no single “correct” method for determining the right cluster quantity. This discrepancy is clear from the excerpt of indices shown in Table 5 for k -means, PAM, and HC with the county-level data. A missing value in the table denotes that the index does not apply or is not commonly used for the associated clustering method.

thumbnail

https://doi.org/10.1371/journal.pone.0267558.t005

The Gap statistic is a modern numeric approach leveraging Monte Carlo simulation to help determine the optimal number of clusters and is applicable to k -means, PAM, and HC. Simulation studies shows that the gap statistic outperforms other early methods [ 75 ]. The results indicate that 2, 6, or values from 6 to 11, are good settings for k , respectively, for the three algorithms. Fig 4 depicts a plot of the gap statistic means and standard errors using 500 bootstrapped samples for k = 1, …, 18 for the hierarchical cluster values. The lower value of k = 6 is determined based on the guidance from [ 75 ], which considers the observed standard errors. The higher value of k = 11 is determined from the location of the first local maximum in the Gap statistic graph. Given the inconsistency from the index methods, we take the recommended value from the more modern Gap statistic to produce clusters for analysis. After visual inspection and evaluation of the characteristics of many sets of identified clusters, we choose the HC clusters with k = 9 as a good balance to support the objectives of this study, i.e., to identify clusters of reasonable size and similarity that also reflect a level of regionally specific diversity that can be leveraged to support public health decision-making.

thumbnail

https://doi.org/10.1371/journal.pone.0267558.g004

Cluster geographic description.

Fig 5 depicts the geographical locations of the nine clusters. For clarity, the figure is shown in three maps, the first depicts clusters 1, 2, 3; the second depicts clusters 4, 5 and 6; and the third depicts clusters 7, 8, and 9. While each cluster is often formed by sets of contiguous counties, this is entirely the result of inherent regional similarities along the 20-dimensional critical subspace.

thumbnail

https://doi.org/10.1371/journal.pone.0267558.g005

Cluster 1 is primarily spread throughout the Southern US census region; cluster 2 is widely dispersed and includes counties from the northwestern US, central Texas, western Oklahoma, Florida, and parts of the Northeastern US; cluster 3 forms a relatively tight grouping of counties primarily dispersed across parts of Arkansas, Missouri, Tennessee, and Kentucky. Cluster 4 is focused in mostly in the south part of the Western US region, cluster 5 is located across the US but especially grouped in certain areas (e.g., around the San Francisco area, Denver, and in the Northeastern states), whereas cluster 6 which is composed of only 24 counties, is located in small pockets of large area counties. Cluster 7 is another small cluster of mostly individual counties across the nation. Cluster 8 pinpoints specific, high population density counties such as San Francisco County, CA, and Bronx, NY. Cluster 9 is primarily located in the Midwestern US census region.

Cluster profile.

The nine clusters are fully profiled in Table 6 . For each cluster, the number of associated counties is reported along the average of the mean-centered and scaled values for each of the 20 critical dimensions. Additionally, the table reports the cluster average for the scaled COVID-19 outcomes, i.e., cases, deaths, case growth rate, and death growth rate. The average scaled absolute values that exceed 1 are highlighted in bold. These values indicate that the average value within the associated cluster are greater than 1 standard deviation above/below the average for counties across the US. A brief description highlighting some discriminating attributes of each cluster follows.

  • Cluster 1 has a larger Black population than the average and below average population for other races, especially White. This cluster also has a below average PC1-wx score indicating that it is associated with warmer regions. It has an above average score for the per capita COVID-19 cases and deaths and while its more recent case growth rate is about average, it has the highest value in the recent growth of COVID-19 deaths.
  • Cluster 2 is the largest subset of counties from all the groups, and none of its scores are far from the overall national average.
  • Cluster 3 has high scores for all three unhealthy metrics. This cluster has the highest score for the population of Whites and has one of the lowest median income values and a relatively low education level. This group has below average COVID-19 cases and deaths and is only slightly above average with respect to recent increases in either outcome.
  • Cluster 4 has more population identifying with two or more races, is younger, and in colder region of the US than the average.
  • Cluster 5 has the greatest median income and education levels and has among the lowest values for recent trends in COVID-19 cases or deaths.
  • Cluster 6 has an American Indian population that is 9.5 standard deviations above the average for US counties. It also has the lowest median income, highest unemployment rate, lowest health insurance ratio, and some of the most unhealthy metrics for physcial and mental health. This group of counties has a population that is much younger than the average. The number of COVID-19 cases and recent COVID-19 case growth exceeds 1 standard deviation above the mean for all US counties. Cluster 6 has the highest values for the recent trend in COVID-19 deaths.
  • Cluster 7 has the highest percentage of adults without a highschool degree and a much greater than average ratio of males to females (exceeding 4 standard deviations above the mean). This subset of 70 counties, has on average the highest per capita COVID-19 cases and above average values for the other three COVID-19 outcomes.
  • Cluster 8 contains 21 counties whose average population density is far greater than the average (more than 7 standard deviations above the mean). It is cluster with the greatest Black population per capita, the highest ratio of Democrats to Republicans, and the highest college education level. While its per capita COVID-19 deaths to-date is the highest among all clusters, it has the lowest value for recent trend in COVID-19 cases and second to lowest in recent trend of COVID-19 deaths.
  • Cluster 9 has the second highest score for White population and the lowest number of mentally unhealthy days and lowest value for self-reported Poor/Fair health. This group also reports the lowest unemployment rate from among all the clusters. It has the second highest recent COVID-19 case growth.

thumbnail

https://doi.org/10.1371/journal.pone.0267558.t006

Clusters 6, 7, and 9 consist of counties with low population density, e.g., Big Horn, MT, Alfalfa, OK, and Kit Carson, CO, with 2.6, 6.5, and 3.8 persons per square mile, respectively. These rural clusters have greater than average recent COVID-19 case growth and/or recent increase in per capita deaths. Cluster 9 in particular is notable in that it represent 550 counties and while its per capita COVID-19 cases and deaths are lower than average, its recent above average increase in cases may precede a significant increase in COVID-19 deaths. Cluster 6 on the other hand, while rural and also colder than average, looks very different than cluster 9. Cluster 6 has a notable American Indian population and has the lowest median income, highest unemployment rate, lowest health insurance ratio, and some of the unhealthiest metrics in the data. Cluster 9 mostly represents White population with the least number of mentally unhealthy days and lowest values for self-reported poor/fair health. Our results with cluster 6 are consistent with previous studies that show COVID-19 incidence is much higher among American Indians/Alaska Natives than among White counterparts [ 76 ]. The lower values for the cluster 6 health and insurance factors imply that its recent case growth may have a more severe impact on lives lost. Indeed, the per county average for increase in recent deaths is already well above average.

The 7-day rolling averages of new COVID-19 cases per 100,000 capita for the combined populations of each cluster are depicted in Fig 6 from July 2020 until mid-October. The upticks in both cluster 6 and 9 are notable in that the other clusters have had relatively flat trends recently whereas these two have seen a pronounced increasing trend for several weeks. We hypothesize that the COVID-19 cases in both clusters have increased (since September) due in part to colder weather and potentially less restrictive lockdown policies. Cluster 6 has unique issues with inequities in access to health care, education, stable housing, healthy foods, and insurance coverage, which can lead to health disparities and higher risk for COVID-19 incidence among this aggregate population. We also suspect the notable rise in cluster 9 (since August) is due to multiple reasons including both dropping temperatures and the fact that it is located in the Midwestern US census region, which has been the epicenter of long-term care facility outbreaks during past four months from August to November 2020 according to [ 77 ].

thumbnail

https://doi.org/10.1371/journal.pone.0267558.g006

It is clear that the characteristics and trends are different for all of the defined clusters. Given the diversity from cluster to cluster, the underlying factors inherent to the associated groups affect both the speed and impact of the disease propagation. This inter-cluster diversity should be considered when designing interventions to effectively slow or stop the spread.

Forecasting COVID-19 propagation is difficult. The challenge is exacerbated for projections focused on local regions and locations with smaller populations such as rural areas in the US. In part, this is due to the reliance of traditional methods on assumptions of population homogeneity. The heterogeneity of US counties contributes to this complexity and local factors may have disproportionate affect on disease spread.

The overall research objective of this study is to produce a new, statistically sound, data-driven clustering of US counties to create a novel COVID-19 related map of the US which balances issues of data quantity with that of regional diversity along a critical feature set. The resulting newly defined clusters are more homogeneous groups whose populations can be analyzed distinctly from one another. To achieve the objective, we address several important sub-tasks including (i) aggregation of a large array of demographic, mobility, health, and weather data, (ii) data transformation via dimension reduction to create a data set amenable to the research scope, and (iii) extensive experimentation with appropriate machine learning methods to intelligently filter and rank critical variables. From this exploration, we discover weather playing a dominant role in case propagation in a similar fashion as regular influenza spread; demonstrate that race plays an outsized role for both case counts and deaths; identify self-reported health and mental health as important predictors; find that there is some political bias that relates to recent increases in county-level cases. Finally (iv), using k -means, agglomerative hierarchical clustering, and Partitioning Around Medoids, we evaluate numerous county-level clustering outcomes to determine a final set with good mathematical properties (i.e., according to the Gap statistic) and that is composed of semi-contiguous regions that reflect wide diversity in their characteristics and COVID-19 patterns. Since this latter element was not embedded into the design of the clusters, the vastly different COVID-19 propagation trends are a direct result of the cluster definitions. This provides additional empirical evidence that the critical factors we identify do drive COVID-19 outcomes.

The policies, communication, and interventions to protect all groups identified should take into account their distinct profiles. This study provides a mechanism to leverage data to better understand the diversity across the nation and how that diversity impacts disease spread. When considering the clusters, meaningful patterns emerge that can help guide policy decisions, mitigation efforts, and analytical accuracy. In future work, we seek to leverage the unique characteristics of each cluster to enhance regional and local level time series forecasting and disease prediction. Additionally, we will consider the impact of local, state, and federal public health interventions on the unique subgroups across the US and how these exogenous factors interact with the inherent characteristics of the clusters to affect disease propagation.

Acknowledgments

The authors gratefully acknowledge the support of the Vice President for Research and Partnerships of the University of Oklahoma.

  • 1. Medicine JHU. COVID-19 SES Data Hub, Hopkins Population Center; 2020. Dataset. Available from: https://github.com/QFL2020/COVID_DataHub .
  • 2. Keating D, Karklis L. Rural areas may be the most vulnerable during the coronavirus outbreak; 2020. Available from: https://www.washingtonpost.com/nation/2020/03/19/rural-areas-may-be-most-vulnerable-during-coronavirus-outbreak .
  • View Article
  • PubMed/NCBI
  • Google Scholar
  • 7. Wendelboe AM, Dvorak J, Anderson MP. OSDH releases COVID-19 modeling for Oklahoma, estimates April 21 peak;. Available from: https://coronavirus.health.ok.gov/articles/osdh-releases-covid-19-modeling-oklahoma-estimates-april-21-peak .
  • 11. United States Centers for Disease Control and Prevention. FluSurge 2.0; 2016. Available from: https://www.cdc.gov/flu/pandemic-resources/tools/flusurge.htm .
  • 15. Keeling MJ, Rohani P. Modeling infectious diseases in humans and animals. Princeton University Press; 2011.
  • 31. Wang L, Wang G, Gao L, Li X, Yu S, Kim M, et al. Spatiotemporal dynamics, nowcasting and forecasting of COVID-19 in the United States. arXiv preprint arXiv:200414103. 2020;.
  • 32. Killeen BD, Wu JY, Shah K, Zapaishchykova A, Nikutta P, Tamhane A, et al. A county-level dataset for informing the United States’ response to COVID-19; 2020. Dataset. Available from: https://github.com/JieYingWu/COVID-19_US_County-level_Summaries .
  • 33. Killeen BD, Wu JY, Shah K, Zapaishchykova A, Nikutta P, Tamhane A, et al. A county-Level dataset for informing the United States’ response to COVID-19. arXiv preprint arXiv:200400756. 2020;.
  • 35. Google. COVID-19 Community Mobility Reports; 2020. Dataset. Available from: https://www.google.com/covid19/mobility/ .
  • 36. Google. Public datasets: weather and climate; 2020. Dataset. Available from: https://cloud.google.com/public-datasets/weather .
  • 37. Facts U. US Coronavirus Cases and Deaths; 2020. Dataset. Available from: https://usafacts.org/visualizations/coronavirus-covid-19-spread-map/ .
  • 40. Saeys Y, Abeel T, Van de Peer Y. Robust Feature Selection Using Ensemble Feature Selection Techniques. In: Daelemans W, Goethals B, Morik K, editors. Machine Learning and Knowledge Discovery in Databases. Berlin, Heidelberg: Springer Berlin Heidelberg; 2008. p. 313–325.
  • 41. Beattie M. Combining classification and Bayesian methods to better model drug abuse. University of Oklahom, Oklahoma, USA; 2018.
  • 54. MacQueen J. Some methods for classification and analysis of multivariate observations. In: Proceedings of the fifth Berkeley symposium on mathematical statistics and probability. vol. 1. Oakland, CA, USA; 1967. p. 281–297.
  • 55. Kaufman L, Rousseeuw P. Clustering by means of medoids. Netherlands: Faculty of Mathematics and Informatics. Delft University of Technology. 1987;.
  • 57. Kassambara A. Practical guide to cluster analysis in R: Unsupervised machine learning. vol. 1. Create Space Independent Publishing Platform; 2017.
  • 59. Team RC. R: A language and environment for statistical computing; 2021. Available from: https://www.R-project.org/ .
  • 63. Greenwell B, Boehmke B, Cunningham J, Developers G. gbm: Generalized boosted regression models; 2020. Available from: https://CRAN.R-project.org/package=gbm .
  • 64. Milborrow S. earth: Multivariate adaptive regression splines; 2021. Available from: https://CRAN.R-project.org/package=earth .
  • 65. Kuhn M. caret: Classification and regression training; 2021. Available from: https://CRAN.R-project.org/package=caret .
  • 66. Di Lorenzo P. usmap: US maps including Alaska and Hawaii; 2021. Available from: https://CRAN.R-project.org/package=usmap .
  • 70. Beale E. Euclidean cluster analysis. Scientific Control Systems Limited; 1969.
  • 77. Curiskis A, Goldfarb A, Kissane E, Ledur J, Rivera JM, Oehler K, et al. Midwest outbreaks pause, hospitalizations and deaths keep rising: This week in COVID-19 data, Nov 25; 2020. Available from: https://covidtracking.com/analysis-updates/midwest-outbreaks-pause-hospitalizations-and-deaths-keep-rising .

U.S. flag

An official website of the United States government

The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

  • Publications
  • Account settings

Preview improvements coming to the PMC website in October 2024. Learn More or Try it out now .

  • Advanced Search
  • Journal List
  • Springer Nature - PMC COVID-19 Collection

Logo of phenaturepg

An Improved K-means Clustering Algorithm Towards an Efficient Data-Driven Modeling

1 Department of Computer Science and Engineering, Chittagong University of Engineering & Technology, Chittagong, 4349 Bangladesh

MD. Asif Iqbal

Avijeet shil, m. j. m. chowdhury.

2 Department of Computer Science and Information Technology, La Trobe University, Victoria, 3086 Australia

Mohammad Ali Moni

3 School of Health and Rehabilitation Sciences, Faculty of Health and Behavioural Sciences, The University of Queensland, St Lucia, QLD 4072 Australia

Iqbal H. Sarker

Associated data.

Data and codes used in this work can be made available upon reasonable request

K-means algorithm is one of the well-known unsupervised machine learning algorithms. The algorithm typically finds out distinct non-overlapping clusters in which each point is assigned to a group. The minimum squared distance technique distributes each point to the nearest clusters or subgroups. One of the K-means algorithm’s main concerns is to find out the initial optimal centroids of clusters. It is the most challenging task to determine the optimum position of the initial clusters’ centroids at the very first iteration. This paper proposes an approach to find the optimal initial centroids efficiently to reduce the number of iterations and execution time . To analyze the effectiveness of our proposed method, we have utilized different real-world datasets to conduct experiments. We have first analyzed COVID-19 and patient datasets to show our proposed method’s efficiency. A synthetic dataset of 10M instances with 8 dimensions is also used to estimate the performance of the proposed algorithm. Experimental results show that our proposed method outperforms traditional kmeans++ and random centroids initialization methods regarding the computation time and the number of iterations.

Introduction

Machine learning is a subset of Artificial Intelligence that makes applications capable of learning and improves the result through the experience, not being programmed explicitly through computation [ 1 ]. Supervised and unsupervised are the basic approaches of the machine learning algorithm. The unsupervised algorithm identifies hidden data structures from unlabelled data contained in the dataset [ 2 ]. According to the hidden structures of the datasets, clustering algorithm is typically used to find the similar data groups which also can be considered as a core part of data science as mentioned in Sarker et al. [ 3 ].

In the context of data science and machine learning, K-Means clustering is known as one of the powerful unsupervised techniques to identify the structure of a given dataset. The clustering algorithm is the best choice for separating the data into groups and is extensively exercised for its simplicity [ 4 ]. Its applications have been seen in different important real-world scenarios, for example, recommendation systems, various smart city services as well as cybersecurity and many more to cluster the data. Beyond this, clustering is one of the most useful techniques for business data analysis [ 5 ]. Also, the K-means algorithm has been used to analyze the users’ behavior and context-aware services [ 6 ]. Moreover, the K-means algorithm plays a vital role in complicated feature extraction.

In terms of problem type, K-means algorithm is considered an NP-Hard problem [ 7 ]. It is widely used to find the number of clusters so that it is possible to divide the unlabelled dataset into clusters to solve real-world problems in various application domains, mentioned above. It is done by calculating the distances from a centroid of a cluster. We need to fix the initial centroids’ coordinates to find the number of clusters at initialization. Thus, this step has a crucial role in the K-means algorithm. Generally, we randomly select the initial centroids. If we can determine the initial centroids efficiently, it will take fewer steps to converge. According to D. T. Pham et al. [ 8 ], the overall complexity of the K-means algorithm is

Where t is the number of iterations, k is the number of clusters and n is the number of data points.

Optimization plays an important role both for supervised and unsupervised learning algorithms [ 9 ]. So, it will be a great advantage if we can save some computational costs by optimization. The paper will give an overview to find the initial centroids more efficiently with the help of principal component analysis (PCA) and percentile concept for estimating the initial centroids. We need to have fewer iterations and execution times than the conventional method.

In this paper, recent datasets of COVID-19, a healthcare dataset and a synthetic dataset size of 10 Million instances are used to analyze our proposed method. In the COVID-19 dataset, K-means clustering is used to divide the countries into different clusters based on the health care quality. A patient dataset with relatively high instances and low dimensions is used for clustering the patient and inspecting the performance of the proposed algorithm. And finally, a high instance of 10M synthetic dataset is used to evaluate the performance. We have also compared our method with the kmeans++ and random centroids selection methods.

The key contributions of our work are as follows-

  • We propose an improved K-means clustering algorithm that can be used to build an efficient data-driven model.
  • Our approach finds the optimal initial centroids efficiently to reduce the number of iterations and execution time.
  • To show the efficiency of our model comparing with the existing approaches, we conduct experimental analysis utilizing a COVID-19 real-world dataset. A 10M synthetic dataset as well as a health care dataset have also been analyzed to determine our proposed method’s efficiency compared to the benchmark models.

This paper provides an algorithmic overview of our proposed method to develop an efficient k-means clustering algorithm. It is an extended and refined version of the paper [ 10 ]. Elaborations from the previous paper are (i) analysis of the proposed algorithm with two additional datasets along with the COVID-19 dataset [ 10 ], (ii) the comparison with random centroid selection and kmeans++ centroid selection methods, (iii) analysis of our proposed method more generalized way in different fields, and (iv) including more recent related works and summarizing several real-world applications.

In Sect. 2 , we’ll discuss the works of similar concepts. In Sect. 3 , we will further discuss our proposed methodology with a proper example. In Sect. 4 , we will show some experimental results, description of the datasets and a comparative analysis. In Sects. 5 and 6 , discussion and conclusion have been included.

Related Work

Several approaches have been made to find the initial cluster centroids more efficiently. In this section, we will bring some of these works. M. S. Rahman et al. [ 11 ], provided a centroids selection method based on radial and angular coordinates. The authors showed experimental evaluations for his proposed work for small(10k-20k) and large(1M-2M) datasets. However, the number of iterations of his proposed method isn’t constant for all the test cases. Thus, the runtime of his proposed method increases drastically with the increment of the cluster number. A. Kumar et al. also proposed to find initial centroids based on the dissimilarity tree [ 12 ]. This method improves k-means clustering slightly, but the execution time isn’t significantly enhanced. In [ 13 ], M.S. Mahmud et al. proposed a novel weighted average approach to finding the initial centroids by calculating the mean of every data point’s distance. It only describes the execution time of 3 clusters with 3 datasets. Improvement of execution time is also trivial. In [ 14 ], authors M. Goyal et al. also tried to find the centroids by dividing the sorted distances with k, the number of equal partitions. This method’s execution time has not been demonstrated. M. A. Lakshmi et al. [ 15 ] proposed a method to find initial centroids with the help of the nearest neighbour method. They compared their method using SSE(Sum of the Squared Differences)with random and kmeans++ initial centroids selection methods. SSE of their method was roughly similar to random and kmeans++ initial centroids selection methods. Moreover, they didn’t provide any comparison regarding execution time as well. K. B. Sawant [ 16 ] has proposed a method to find the initial cluster with the distance of the neighbourhood. The proposed method calculates and sorts all the distances from the first point. Then, the entire dataset was divided into equal portions. But the author didn’t mention any comparative analysis to prove his proposed method better than the existing one. In [ 17 ] , the authors proposed to save the distance to the nearest cluster of the previous iteration and used the distance to compare in the next iteration. But still, initial centroids are selected randomly. In [ 18 ], M. Motwani et al. proposed a method with the farthest distributed centroids clustering (FDCC) algorithm. The authors failed to include information on how this approach performs with a distributed dataset and a comparison of execution times. M. Yedla et al. [ 19 ] proposed a method where the algorithm sorted the data point by the distance from the origin and subdivided it into k (number of clusters needed) sets.

COVID-19 has been the subject of several major studies lately. The K-means algorithm has a significant influence on these studies. S. R. Vadyala et al. proposed a combined algorithm with k-means and LSTM to predict the number of confirmed cases of COVID-19 [ 20 ]. In [ 21 ], author A. Poompaavai et al. attempted to identify the affected areas by COVID-19 in India by using the k-means clustering algorithm. Many approaches have been attempted to solve the COVID-19 problem using k-means clustering. In [ 22 ], S.K. Sonbhadra et al. proposed a novel bottom-up approach for COVID-19 articles using k-means clustering along with DBSCAN and HAC. S. Chinchorkar used the K-means algorithm for defining Covid-19 containment zones. In this paper [ 23 ], the size and locations of such zones (affected by Coronapositive patients) are considered dynamic. K-means algorithm is proposed to handle the zones dynamically. But if the number of Corona-positive patient outbreaks, K-means may not be effective as it will take huge computational power and resources to handle the dataset. N. Aydin et al. used K-means in accessing countries’ performance against COVID-19. In this paper [ 24 ], K-means and hierarchical clustering methods are used for cluster analysis to find out the optimum number of classes in order to categorize the countries for further performance analysis. In this paper [ 25 ], a comparison is made based on the GDP declines and deaths in China and OECD countries. K-means is used for clustering analysis to find out the current impact of GDP growth rate, deaths, and account balances.T. Zhang used a generalized K-means algorithm in GLMs to group the state-level time series patterns for the analysis of the outbreak of COVID-19 in the United States [ 26 ]

K-means clustering algorithm has a huge impact on patient and medically related work. Many researchers use the k-means algorithm for their research purpose. Ldl Fuente-Tomas et al. [ 27 ] used the k-means algorithm to classify patients with bipolar disorder. P. Sili-tonga et al. [ 28 ] used a k-means clustering algorithm for clustering patient disease data. N. Das et al. [ 29 ] used the k-means algorithm to find the nearest blood & plasma donor. MS Alam et al. [ 30 ] used the k-means algorithm m for detecting human brain tumors in a magnetic resonance imaging (MRI) image. Optimized data mining and clustering models can provide an insightful information about the transmission pattern of COVID-19 outbreak [ 31 ].

Among all the improved and efficient k-means clustering algorithms proposed previously, they take the initial center by randomization [ 15 , 17 ] or the k-means++ algorithm [ 15 , 32 ]. Those processes of selecting the initial cluster take more time. In contrast, our proposed k-means algorithm chooses initial centroids using principal component analysis(PCA) & percentile.

Proposed Methodology

The K-means clustering is the NP-Hard optimization problem [ 33 ]. The efficiency of the k-means clustering algorithm depends on the selection or assignment of initial clusters’ centroids [ 34 ]. So, it is important to select the centroids more systematically to improve the K-means clustering algorithm’s performance and execution time. This section introduces our proposed method of assignment of initial centroids by using Principal Component Analysis (PCA) and dividing the values into percentiles to get the efficient initial coordinate of centroids. The flowchart in Fig. ​ Fig.1 1 depicts the overall process of our proposed method.

An external file that holds a picture, illustration, etc.
Object name is 40745_2022_428_Fig1_HTML.jpg

Flowchart of the proposed method

In the next subsection, We will describe our proposed method.

Input Dataset and Pre-processing

In Sect. 4.1 , a vivid description has been given about the dataset of our model. Every dataset has properties that requires manual pre-processing to make it competent for feeding the K-means clustering algorithm. K-means clustering algorithm can only perform its operation on numerical data. So any value which is not numerical or categorical must need to be transformed. Again, handling missing values needs to be performed during manual pre-processing.

As we have tried to solve real-world problems with our proposed method, all attributes are not equally important. So, we have selected some of the attributes for the K-means clustering algorithm’s implementation. So, choosing the right attributes must be specified before applying Principal Component Analysis (PCA)and percentile.

Principal Component Analysis (PCA)

PCA is a method, mostly discussed in mathematics, that utilizes an orthogonal transformation to translate a set of observations of potentially correlated variables into a set of values of linearly uncorrelated variables, called principal components. PCA is widely used in data analysis and making a predictive model [ 35 ]. PCA reduces the dimension of datasets by increasing interpretability but minimizing the loss of information simultaneously. For this purpose, the orthogonal transformation is used. Thus, the PCA algorithm helps to quantify the relationship between the large related dataset [ 36 ], and it helps to reduce computational complexity. A pseudo-code of PCA algorithm is provided below 1 .

An external file that holds a picture, illustration, etc.
Object name is 40745_2022_428_Figa_HTML.jpg

PCA tries to fit as much information as possible into the first component, then the second component, and so on.

We convert the multi-dimensional dataset into two dimensions for our proposed method using PCA. Because with these two dimensions, we can easily split the data into horizontal and vertical planes.

The percentile model is a well-known method used in statistics. It divides the whole dataset into 100 different parts. Each part contains 1 percent data of the total dataset. For example, the 25th percentile means this part contains 25 percent of data of the total dataset. That implies, using the percentile method, we can split our dataset into different distributions according to our given values [ 37 ].

The percentile formula is given below:

Here, P = The Percentile to find, n = Total Number of values, R = Percentile at P

After reducing the dataset into two dimensions by applying PCA, the percentile method is used on the dataset. The first component of PCA holds the majority of information, and Percentiles can only be applied to one-dimensional data. As we have found most of the information in the first component of PCA, we have considered only the first component.

Finally, dataset is partitioned according to the desired number of clusters with the help of the percentile method. It splits the two-dimensional data into equal parts of the desired number of clusters.

Dataset Split and Mean Calculation

After splitting the reduced dimensional dataset through percentile, we extract the split data from the primary dataset by indexing for every percentile. In this process, we can get back the original data. After retrieving the original data for each percentile, we have calculated the mean of each attribute. These means from the split dataset are the initial centroids of clusters of our proposed method.

Centroids Determination

After splitting the dataset and calculating the mean according to Subsect. 3.4 , we select each split dataset’s mean as a centroid. These centroids are considered as our proposed initial centroids for the efficient k-means clustering algorithm. K-means is an iterative method that attempts to make partitions in an unsupervised dataset. The sub-groups formed after the division are non-overlapping subgroups. This algorithm tries to make a group as identical as possible for the inter-cluster data points and aims to remain separate from the other cluster. In Algorithm 2, a pseudo-code is provided of the k-means algorithm:

An external file that holds a picture, illustration, etc.
Object name is 40745_2022_428_Figb_HTML.jpg

Cluster Generation

At the last step, we have executed our modified k-means algorithm until the centroids converge. Passing our proposed centroids instead of random or kmeans++ centroids through the k-means algorithm we have generated the final clusters [ 32 ]. The proposed method always considers the same centroids for each test. The pseudocode of our whole proposed methodology is given in the algorithm 3. In the next section, evaluation and experimental results of our proposed model are discussed.

An external file that holds a picture, illustration, etc.
Object name is 40745_2022_428_Figc_HTML.jpg

Evaluation and Experimental Result

We have gone through a couple of experiments to measure the effectiveness and validate our proposed model for selecting the optimum initial centroids for the k-means clustering algorithm. The proposed model is tested with the high dimension, relatively high instances, and very high instances datasets. We have used a few COVID-19 datasets and merged them to have a handful of features for clustering the countries according to their health quality during COVID-19. We have tested the model for clustering the health care patients [ 38 ]. The mode is also tested with 10 million data created with the scikit-learn library [ 39 ]. A detailed explanation of the datasets is given in the following subsection.

Dataset Exploration

We experimented with many different datasets to descry our model’s efficiency. Properties of our used datasets are

  • Low instances with a high dimensional dataset
  • Relatively high instances with a low dimensional dataset
  • Very high instances dataset

In the next Subsect. of 4.1.1 , 4.1.2 and 4.1.3 , a brief explanation of those datasets is given.

COVID-19 Dataset

Many machine learning algorithms, including supervised and unsupervised methods, were applied to the covid-19 dataset. For creating our model, we used a few datasets for selecting the features required for analyzing the health care quality of the countries. The selected datasets are owid-covid-data [ 40 ], covid-19-testing-policy [ 41 ], public-events-covid [ 41 ], covid-containment-and-health-index [ 41 ], inform-covid-indicators [ 42 ]. It is worth mentioning that we used the data up to 11 th August 2020.

For instance, some of the attributes of the owid-covid-data [ 40 ] are shown in Table  1 . Covid-19-testing-policy [ 41 ] dataset contains the categorical values of the testing policy of the countries shown in Table ​ Table2 2 .

Sample data of COVID-19 dataset

CountryTotal cases per millionNew cases per millionTotal deaths per millionNew deaths per millionCardiovasc death rateHospital beds per thousandlife expectancy
Australia839.10212.27512.2750.706107.7913.8483.44
Bangladesh1581.80817.65120.8760.237298.0030.872.59
China61.7690.0793.2580.003261.8994.3476.91

Sample data of Covid-19-testing-policy

EntityCodeDateTesting policy
AustraliaAUSAug 11, 20203
BangladeshBGDAug 11, 20202
ChinaCHNAug 11, 20203

Other datasets also contained such features required to ensure the health care quality of a country. These real-world datasets helped us to analyze our proposed method for real-world scenarios.

We merged the datasets according to the country name with the regular expression pre-processing. Some pre-processing and data cleaning had been conducted in case of merging the data, and we had also handled some missing data consciously. There are so many attributes regarding COVID-19; among them, 25 attributes were finally selected, as these attributes closely signify the health care quality of a country. The attributes represent categorical and numerical values. These are country name, cancellation of public events (due to public health awareness), stringency index 1 , testing policy ( category of testing facility available to the mass people), total positive case per million, new cases per million, total death per million, new deaths per million, cardiovascular death rate, hospital beds available per thousand, life expectancy, inform the COVID-19 risk (rate), hazard and exposure dimension rate, people using at least basic sanitation services (rate), inform vulnerability(rate), inform health conditions (rate), inform epidemic vulnerability (rate), mortality rate, prevalence of undernourishment, lack of coping capacity, access to healthcare, physicians density, current health expenditure per capita, maternal mortality ratio. We have consciously selected the features before feeding the model. A two-dimensional plot of the dataset is shown in Fig. ​ Fig.2 2 .

An external file that holds a picture, illustration, etc.
Object name is 40745_2022_428_Fig2_HTML.jpg

2D distribution plot of COVID-19 dataset

It is a dataset of low instances with high dimensions.

Medical Dataset

Our second dataset is Medical related dataset with 100k instances. It is an open-source dataset for research purposes. It has a random sample of 6 million patient records from our Medical Quality Improvement Consortium (MQIC) database [ 38 ]. Any personal information has been excluded. The final attributes of the dataset are: gender, age, diabetes, hypertension, stroke, heart disease, smoking history, and BMI.

A glimpse of the Medical dataset is given in Table ​ Table3. 3 . It is a dataset of relatively high instances with low dimensional data. Figure ​ Figure3 3 provides a graphical representation of the dataset, which gives meaningful insights into the distribution of the data.

Sample data of Medical Dataset

GenderAgeDiabetesHypertensionStrokeHeart diseaseMoking historyBMI
Female80.00001Never25.19
Female36.00000Current23.45
Female44.01000Never19.31
Male42.00000Never33.64
Male18.00000Never21.78

An external file that holds a picture, illustration, etc.
Object name is 40745_2022_428_Fig3_HTML.jpg

2D distribution plot of medical dataset

Synthetic Dataset

A final syntactic dataset has been made to cover the very high instances dataset category. This synthetic dataset is created with the scikit-learn library [ 39 ]. The dataset has 8 dimensions with 10M (Ten Million) instances. The two-dimensional distribution is shown in Fig. ​ Fig.4 4 .

An external file that holds a picture, illustration, etc.
Object name is 40745_2022_428_Fig4_HTML.jpg

2D distribution plot of scikit-learn library dataset

Experimental Setup

We have selected the following questions to evaluate our proposed model.

  • Is the proposed method for selecting efficient clusters of centroids working well both for the high and low dimensional dataset?
  • Does the method reduce the iteration for finding the final clusters with the k-means algorithm compared to the existing methods?
  • Does the method reduce the execution time to some extent compared to the existing methods for the k-means clustering algorithm?

To answer these questions, we have used real-time COVID-19 healthcare quality data of different countries, patient data, and a scikit-learn library dataset with 10M instances. In the following sub-sections, we have briefly discussed experimental results and comparison with its’ effectiveness.

Evaluation Process

In machine learning and data science, computational power is one of the main issues. Because at the same time, the computer needs to process a large amount of data. So, reducing computational costs is a big deal. K-means clustering is a popular unsupervised machine learning algorithm. It is widely used in different clustering processes. The algorithm randomly selects the initial clusters’ centroids, which sometimes causes many iterations and high computational power. As discussed in the methodology section, we have implemented our proposed method. However, many researchers proposed many ideas discussed in the related work Sect. 2 . We have compared our method with the best existing k-means++ and random method [ 32 ]. We have measured the effectiveness of the model with

  • Number of iterations needed for finding the final clusters
  • Execution time for reaching out to the final clusters

These two things will be measured in the upcoming subsections.

Experimental Result

Analysis with covid-19 dataset.

Firstly, we are starting our experiment with the COVID-19 dataset. Details explanation of the dataset is provided in Subsect. 4.1.1 . As we are making clusters for the countries with similar types of health care quality, we have defined the optimum number of clusters with the elbow method [ 44 ]. For the COVID-19dataset, the optimum number of clusters is 4. So, we are looking forward to the results with 4 clusters. Here, each type of cluster contains the same types of healthcare quality.

In Fig. ​ Fig.5, 5 , we have shown the experimental result in terms of iteration number for the COVID-19 dataset of 50 tests. Here, we have compared the results of our proposed method to the traditional random centroid selection method and existing the best centroid selection method kmeans++. The yellow, red, and blue lines in Fig. ​ Fig.5 5 represent the random, kmeans++, and our proposed method consecutively. The graphical representation clearly shows that the number of iterations of our model is constant and outperforms most of the cases. On the other hand, the random and kmeans++ method propagates randomly, and in most cases, the number of iterations is higher than our proposed method. So, we can claim that our model outperforms in terms of iteration number and execution time.

An external file that holds a picture, illustration, etc.
Object name is 40745_2022_428_Fig5_HTML.jpg

Iteration for 4 clusters with COVID-19 dataset

The graph in Fig. ​ Fig.6 6 represents the experimental result for execution time with the COVID-19 data. Execution time is closely related to the number of iterations. In Fig. ​ Fig.6, 6 , the yellow line represents the results for the random method, the red line for the existing kmeans++method and the blue line for our proposed method. As the centroids of our proposed method are constant for each iteration, the execution time is always nearly constant. Figure ​ Figure6 6 depicts that in the case of the random and kmeans++ method, the execution time varies randomly while our proposed methods outperform in every test case.

An external file that holds a picture, illustration, etc.
Object name is 40745_2022_428_Fig6_HTML.jpg

Execution time for 4 clusters with COVID-19 dataset

Analysis with Medical Dataset

K-means clustering algorithm is widely used in medical sectors as well. Patient clustering is also one of the important tasks where clustering is usually used. We have analyzed our model with the patient data mentioned in Subsect. 4.1.2 . As it is a real-world implementation, we have used the elbow method to find out the clusters’ optimum number. For the dataset, the optimum number of clusters is 3

We have conducted 50 tests over the dataset. The final outcome in terms of iteration is shown in Fig. ​ Fig.7. 7 . The blue line represents the execution time for 3 clusters with the dataset, and it is nearly constant. Compared to the other two methods represented with the yellow line for random and red line for kmeans++, our model notably outperforms.

An external file that holds a picture, illustration, etc.
Object name is 40745_2022_428_Fig7_HTML.jpg

Iteration for 3 clusters with Medical Dataset

In Fig. ​ Fig.8, 8 , we have shown the experimental result for execution time. The blue line represents the execution time for 3 clusters with the dataset. It is nearly constant. Compared to the other two methods represented with the yellow line for random and red line for kmeans++, our model notably outperforms.

An external file that holds a picture, illustration, etc.
Object name is 40745_2022_428_Fig8_HTML.jpg

Execution time for 3 clusters with Medical Dataset

Analysis with Synthetic Dataset

In practical scenarios, clustering data may be massive. For that reason, we have created a synthetic dataset described in Subsect. 4.1.3 for getting insight into our model, whether it works for a very large dataset or not. This synthetic dataset contains about 10 million instances, and the dataset is only created for demonstration purposes. We have also created the clusters randomly for testing our model. We have run the model with random, kmeans++ and our proposed model 50 times simultaneously with the dataset. The model is tested with 3, 4 and 5 clusters. Figure ​ Figure9 9 graphically represents the experimental results with the synthetic dataset. The blue, yellow and red lines represent the results for the proposed, random and kmeans++ methods consecutively. The left side graphs show the experimental results in terms of iteration, and the right side graphs show the experimental results for the execution time of 3, 4, and 5 clusters.

An external file that holds a picture, illustration, etc.
Object name is 40745_2022_428_Fig9_HTML.jpg

Total iteration and execution time for different number of clusters with Synthetic Dataset. (a) 3 clusters (b) 4 clusters (c) 5 clusters

For the clusters, the number of iterations is constant for our proposed method, and it also outperforms most of the test cases. Other models are random in nature. The kmeans++ and random models have not reduced the iteration significantly. It is a remarkable contribution.

Execution time is a vital issue for improving an algorithm’s performance. The right side graphs of Fig. ​ Fig.9 9 show the results in terms of execution time for 3, 4, and 5 clusters. The execution time needed for our proposed method is nearly constant and outperforms compared to the kmeans++ and random methods.

Effectiveness Comparison Based on Number of Iterations and Execution Time

Computation cost is one of the fundamental issues in data science. If we can reduce the number of iterations to some extent, the performance will be improved. A Constant iteration of the algorithm helps us to have the same performance over time.

In Fig. ​ Fig.5, 5 , we find the experimental result for covid-19 dataset. When our proposed method is applied, we have found a constant number of iterations in each test. But the existing kmeans++ and random methods’ iteration are random. It varies for different test cases. Figure ​ Figure6 6 provides the execution time comparison for existing kmeans++, random and our proposed method. Our model executed the k-means clustering algorithm for each test case in the shortest time.

The medical dataset contains a relatively high instance of data with real-world hospital patient data. Figures ​ Figures7 7 and ​ and8 8 represent the experimental results for random, kmeans++ and our proposed methods. We have found that our model converged to the final clusters with a reduced number of constant iterations and improved execution time.

In Fig. ​ Fig.9, 9 , we have shown the experimental results of the K-means clustering algorithm for 3,4 and 5 clusters consecutively for the number of iterations along with execution time. This experiment has been done on a synthetic dataset described in Subsect. 4.1.3 . Constant optimum iteration compared to the existing model kmeans++, random and shortest execution time signify that our model outperforms for large datasets.

Based on the experimental results, we claim that our model outperforms in real-world applications and reduces the computational power of the K-means clustering algorithm. It also outperforms for a huge instance of datasets.

The above discussion answers the last two questions discussed in the experimental setup Subsect. 4.2 .

K-means clustering is one of the most popular unsupervised clustering algorithms. By default k-means clustering algorithm randomly selects the initial centroids. Sometimes, it consumes a huge computational power and time. Over the years, many researchers tried to select the initial centroids more systematically. Some of the works have been mentioned in Sect. 2 , and most of the previous works are not detailed enough and well recognized. From equation 1 , we presume that the execution time will be reduced significantly if we can reduce the number of iteration. And our proposed method focuses on minimizing the iteration. Two statistical models, PCA(principal component analysis) [ 35 ] and percentile [ 37 ] are used to deploy the proposed method. It is also mentionable that the proposed method provides the optimal number of iterations for applying the K-means clustering algorithm. However, the most popular and standard method is kmeans++ [ 32 ]. So, we have compared our model with the kmeans++ method and the default random method to analyze the efficiency. We haven’t modified the main K-means clustering algorithm instead developed an efficient centroid selection process that provides an optimum number of constant iterations. As the time complexity is directly related to the iteration shown in equation 1 and our model gives less number of iterations, the overall execution time is reduced.

We have made clusters of the countries with similar types of health care quality for solving the real-world problem with our proposed method. It is high-dimensional data. The medical-related dataset is also used for making patient clusters. We have also used a synthetic dataset created with scikit-learn consisting of 10M instances to ensure that our model is also outperforming for a large number of instances. This model performs well for both low and high-dimensional datasets. Thus, this technique could be applied to solve many unsupervised learning problems in various real-world application areas ranging from personalized services to today’s various smart city services and security, i.e., to detect cyber-anomalies [ 6 , 45 ]. Internet of Things (IoT) is another cutting edge technology where clustering is widely used [ 46 ].

Our proposed method reduces the computational power. So, the proposed model will work faster in case of a clustering problem where the data volume is too large. This proposed method is easy to implement, and no extra setup is needed.

In this article, we have proposed an improved K-means clustering method that increases the performance of the traditional one. We have significantly reduced the number of iterations by systematically selecting the initial centroids for generating the clusters. PCA and Percentile techniques have been used to reduce the dimension of data and segregate the dataset according to the number of clusters. Finally, these segregated data have been used to select our initials centroids. Thus, we have successfully minimized the number of iterations. As the complexity of the traditional K-means clustering algorithm is directly related to the number of iterations, our proposed approach outperformed compared to the existing methods. We believe this method could play a significant role for data-driven solutions in various real-world application domains.

Author Contributions

All authors equally contributed to preparing and revising the manuscript.

Not Applicable

Data Availability Statement

Declarations.

The authors declare no conflict of interest.

The authors follow all the relevant ethical rules.

1 It is one of the matrices used by Oxford COVID-19 Government Response Tracker [ 43 ]. It delivers a picture of the country’s enforced strongest measures.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Information

  • Author Services

Initiatives

You are accessing a machine-readable page. In order to be human-readable, please install an RSS reader.

All articles published by MDPI are made immediately available worldwide under an open access license. No special permission is required to reuse all or part of the article published by MDPI, including figures and tables. For articles published under an open access Creative Common CC BY license, any part of the article may be reused without permission provided that the original article is clearly cited. For more information, please refer to https://www.mdpi.com/openaccess .

Feature papers represent the most advanced research with significant potential for high impact in the field. A Feature Paper should be a substantial original Article that involves several techniques or approaches, provides an outlook for future research directions and describes possible research applications.

Feature papers are submitted upon individual invitation or recommendation by the scientific editors and must receive positive feedback from the reviewers.

Editor’s Choice articles are based on recommendations by the scientific editors of MDPI journals from around the world. Editors select a small number of articles recently published in the journal that they believe will be particularly interesting to readers, or important in the respective research area. The aim is to provide a snapshot of some of the most exciting work published in the various research areas of the journal.

Original Submission Date Received: .

  • Active Journals
  • Find a Journal
  • Proceedings Series
  • For Authors
  • For Reviewers
  • For Editors
  • For Librarians
  • For Publishers
  • For Societies
  • For Conference Organizers
  • Open Access Policy
  • Institutional Open Access Program
  • Special Issues Guidelines
  • Editorial Process
  • Research and Publication Ethics
  • Article Processing Charges
  • Testimonials
  • Preprints.org
  • SciProfiles
  • Encyclopedia

algorithms-logo

Article Menu

clustering in machine learning research papers

  • Subscribe SciFeed
  • Recommended Articles
  • Google Scholar
  • on Google Scholar
  • Table of Contents

Find support for a specific problem in the support section of our website.

Please let us know what you think of our products and services.

Visit our dedicated information section to learn more about MDPI.

JSmol Viewer

Multi-objective unsupervised feature selection and cluster based on symbiotic organism search.

clustering in machine learning research papers

1. Introduction

2. review of related works, 2.1. background of the sos algorithm, 2.1.1. mutualism phase, 2.1.2. commensalism phase, 2.1.3. parasitism phase, 2.2. global-search unsupervised feature-selection algorithms based on sos methods, 2.3. clustering algorithms based on sos methods, 3. proposed method.

  • Initialization
  • 1. Mutualism phase.
  • 2. Commensalism phase.
  • 3. Parasitism phase.
  • UNTIL (the termination criterion is met).

3.1. Mutualism Phase

3.2. commensalism phase, 3.3. parasitism phase, 3.4. development of initial features, 3.5. cost of solutions, 4. experimental settings, 4.1. parameter setting of symbiotic organisms search as unsupervised feature selection, 4.2. investigating the impact of different sos parameters on cluster sos, performance measurement and datasets, 5. results and discussion, 5.1. evaluation of the sos cluster using all features, evaluation of sos-based unsupervised feature selection with sos cluster, 5.2. discussion, 6. conclusions, author contributions, data availability statement, acknowledgments, conflicts of interest.

  • Oyewole, G.J.; Thopil, G.A. Data clustering: Application and trends. Artif. Intell. Rev. 2022 , 56 , 6439–6475. [ Google Scholar ] [ CrossRef ] [ PubMed ]
  • Gedam, A.G.; Shikalpure, S.G. Direct kernel method for machine learning with support vector machine. In Proceedings of the 2017 International Conference on Intelligent Computing, Instrumentation and Control Technologies (ICICICT), Kerala, India, 6–7 July 2017; pp. 1772–1775. [ Google Scholar ]
  • da Silva, L.E.B.; Wunsch, D.C. An Information-Theoretic-Cluster Visualization for Self-Organizing Maps. IEEE Trans. Neural Netw. Learn. Syst. 2017 , 29 , 2595–2613. [ Google Scholar ] [ CrossRef ] [ PubMed ]
  • Sinaga, K.P.; Yang, M.-S. Unsupervised K-Means Clustering Algorithm. IEEE Access 2020 , 8 , 80716–80727. [ Google Scholar ] [ CrossRef ]
  • Wang, P.; Xue, B.; Liang, J.; Zhang, M. Feature clustering-Assisted feature selection with differential evolution. Pattern Recognit. 2023 , 140 , 109523. [ Google Scholar ] [ CrossRef ]
  • Kohavi, R.; John, G.H. Wrappers for feature subset selection. Artif. Intell. 1997 , 97 , 273–324. [ Google Scholar ] [ CrossRef ]
  • Jiao, L.; Liu, Y.; Zou, B. Self-organizing dual clustering considering spatial analysis and hybrid distance measures. Sci. China Earth Sci. 2011 , 54 , 1268–1278. [ Google Scholar ] [ CrossRef ]
  • Chakraborty, B.; Chakraborty, G. Fuzzy Consistency Measure with Particle Swarm Optimization for Feature Selection. In Proceedings of the 2013 IEEE International Conference on Systems, Man and Cybernetics (SMC 2013), Manchester, UK, 13–16 October 2013; pp. 4311–4315. [ Google Scholar ]
  • Li, G.; Li, Y.; Tsai, C.-L. Quantile Correlations and Quantile Autoregressive Modeling. J. Am. Stat. Assoc. 2015 , 110 , 246–261. [ Google Scholar ] [ CrossRef ]
  • Pardo, L. New Developments in Statistical Information Theory Based on Entropy and Divergence Measures. Entropy 2019 , 21 , 391. [ Google Scholar ] [ CrossRef ]
  • Xue, B.; Zhang, M.; Browne, W.N.; Yao, X. A Survey on Evolutionary Computation Approaches to Feature Selection. IEEE Trans. Evol. Comput. 2015 , 20 , 606–626. [ Google Scholar ] [ CrossRef ]
  • Liu, Q.; Chen, C.; Zhang, Y.; Hu, Z. Feature selection for support vector machines with RBF kernel. Artif. Intell. Rev. 2011 , 36 , 99–115. [ Google Scholar ] [ CrossRef ]
  • Rong, M.; Gong, D.; Gao, X. Feature Selection and Its Use in Big Data: Challenges, Methods, and Trends. IEEE Access 2019 , 7 , 19709–19725. [ Google Scholar ] [ CrossRef ]
  • Abualigah, L.M.; Khader, A.T.; Al-Betar, M.A. Unsupervised feature selection technique based on genetic algorithm for improving the Text Clustering. In Proceedings of the 2016 7th International Conference on Computer Science and Information Technology (CSIT), Amman, Jordan, 13–14 July 2016; pp. 1–6. [ Google Scholar ]
  • Shamsinejadbabki, P.; Saraee, M. A new unsupervised feature selection method for text clustering based on genetic algorithms. J. Intell. Inf. Syst. 2011 , 38 , 669–684. [ Google Scholar ] [ CrossRef ]
  • Bennaceur, H.; Almutairy, M.; Alhussain, N. Genetic Algorithm Combined with the K-Means Algorithm: A Hybrid Technique for Unsupervised Feature Selection. Intell. Autom. Soft Comput. 2023 , 37 , 2687–2706. [ Google Scholar ] [ CrossRef ]
  • Zhang, Y.; Wang, S.; Ji, G. A Comprehensive Survey on Particle Swarm Optimization Algorithm and Its Applications. Math. Probl. Eng. 2015 , 2015 , 931256. [ Google Scholar ] [ CrossRef ]
  • Shami, T.M.; El-Saleh, A.A.; Alswaitti, M.; Al-Tashi, Q.; Summakieh, M.A.; Mirjalili, S. Particle Swarm Optimization: A Comprehensive Survey. IEEE Access 2022 , 10 , 10031–10061. [ Google Scholar ] [ CrossRef ]
  • Lalwani, S.; Sharma, H.; Satapathy, S.C.; Deep, K.; Bansal, J.C. A Survey on Parallel Particle Swarm Optimization Algorithms. Arab. J. Sci. Eng. 2019 , 44 , 2899–2923. [ Google Scholar ] [ CrossRef ]
  • Han, C.; Zhou, G.; Zhou, Y. Binary Symbiotic Organism Search Algorithm for Feature Selection and Analysis. IEEE Access 2019 , 7 , 166833–166859. [ Google Scholar ] [ CrossRef ]
  • Mohmmadzadeh, H.; Gharehchopogh, F.S. An efficient binary chaotic symbiotic organisms search algorithm approaches for feature selection problems. J. Supercomput. 2021 , 77 , 9102–9144. [ Google Scholar ] [ CrossRef ]
  • Cheng, M.-Y.; Prayogo, D. Symbiotic Organisms Search: A new metaheuristic optimization algorithm. Comput. Struct. 2014 , 139 , 98–112. [ Google Scholar ] [ CrossRef ]
  • Abdullahi, M.; Ngadi, A.; Dishing, S.I.; Abdulhamid, S.M.; Ahmad, B.I. An efficient symbiotic organisms search algorithm with chaotic optimization strategy for multi-objective task scheduling problems in cloud computing environment. J. Netw. Comput. Appl. 2019 , 133 , 60–74. [ Google Scholar ] [ CrossRef ]
  • Miao, F.; Zhou, Y.; Luo, Q. A modified symbiotic organisms search algorithm for unmanned combat aerial vehicle route planning problem. J. Oper. Res. Soc. 2018 , 70 , 21–52. [ Google Scholar ] [ CrossRef ]
  • Wu, H.; Zhou, Y.; Luo, Q. Hybrid symbiotic organisms search algorithm for solving 0–1 knapsack problem. Int. J. Bio-Inspired Comput. 2018 , 12 , 23–53. [ Google Scholar ] [ CrossRef ]
  • Baysal, Y.A.; Ketenci, S.; Altas, I.H.; Kayikcioglu, T. Multi-objective symbiotic organism search algorithm for optimal feature selection in brain computer interfaces. Expert Syst. Appl. 2020 , 165 , 113907. [ Google Scholar ] [ CrossRef ]
  • Gharehchopogh, F.S.; Shayanfar, H.; Gholizadeh, H. A comprehensive survey on symbiotic organisms search algorithms. Artif. Intell. Rev. 2019 , 53 , 2265–2312. [ Google Scholar ] [ CrossRef ]
  • Ganesh, N.; Shankar, R.; Čep, R.; Chakraborty, S.; Kalita, K. Efficient Feature Selection Using Weighted Superposition Attraction Optimization Algorithm. Appl. Sci. 2023 , 13 , 3223. [ Google Scholar ] [ CrossRef ]
  • Jaffel, Z.; Farah, M. A symbiotic organisms search algorithm for feature selection in satellite image classification. In Proceedings of the 2018 4th International Conference on Advanced Technologies for Signal and Image Processing (ATSIP), Sousse, Tunisia, 21–24 March 2018; pp. 1–5. [ Google Scholar ]
  • Cheng, M.-Y.; Cao, M.-T.; Herianto, J.G. Symbiotic organisms search-optimized deep learning technique for mapping construction cash flow considering complexity of project. Chaos Solitons Fractals 2020 , 138 , 109869. [ Google Scholar ] [ CrossRef ]
  • Mohammadzadeh, H.; Gharehchopogh, F.S. Feature Selection with Binary Symbiotic Organisms Search Algorithm for Email Spam Detection. Int. J. Inf. Technol. Decis. Mak. 2021 , 20 , 469–515. [ Google Scholar ] [ CrossRef ]
  • Cheng, M.-Y.; Kusoemo, D.; Gosno, R.A. Text mining-based construction site accident classification using hybrid supervised machine learning. Autom. Constr. 2020 , 118 , 103265. [ Google Scholar ] [ CrossRef ]
  • Al-Tashi, Q.; Abdulkadir, S.J.; Rais, H.M.; Mirjalili, S.; Alhussian, H. Approaches to Multi-Objective Feature Selection: A Systematic Literature Review. IEEE Access 2020 , 8 , 125076–125096. [ Google Scholar ] [ CrossRef ]
  • Abdollahzadeh, B.; Gharehchopogh, F.S. A multi-objective optimization algorithm for feature selection problems. Eng. Comput. 2021 , 38 (Suppl. S3), 1845–1863. [ Google Scholar ] [ CrossRef ]
  • Zhang, M.; Wang, J.-S.; Liu, Y.; Song, H.-M.; Hou, J.-N.; Wang, Y.-C.; Wang, M. Multi-objective optimization algorithm based on clustering guided binary equilibrium optimizer and NSGA-III to solve high-dimensional feature selection problem. Inf. Sci. 2023 , 648 , 119638. [ Google Scholar ] [ CrossRef ]
  • Al-Tashi, Q.; Abdulkadir, S.J.; Rais, H.M.; Mirjalili, S.; Alhussian, H.; Ragab, M.G.; Alqushaibi, A. Binary Multi-Objective Grey Wolf Optimizer for Feature Selection in Classification. IEEE Access 2020 , 8 , 106247–106263. [ Google Scholar ] [ CrossRef ]
  • Xue, B.; Fu, W.; Zhang, M. Differential evolution (DE) for multi-objective feature selection in classification. In Proceedings of the GECCO’ 14: Genetic and Evolutionary Computation Conference, Dunedin, New Zealand, 15–18 December; pp. 83–84.
  • Vieira, S.M.; Sousa, J.M.C.; Runkler, T.A. Multi-criteria ant feature selection using fuzzy classifiers. In Swarm Intelligence for Multi-objective Problems in Data Mining ; Springer: Berlin/Heidelberg, Germany, 2009; pp. 19–36. [ Google Scholar ] [ CrossRef ]
  • Xue, B.; Zhang, M.; Browne, W.N. Particle swarm optimisation for feature selection in classification: Novel initialisation and updating mechanisms. Appl. Soft Comput. 2014 , 18 , 261–276. [ Google Scholar ] [ CrossRef ]
  • Abdullahi, M.; Ngadi, A.; Dishing, S.I.; Abdulhamid, S.M.; Usman, M.J. A survey of symbiotic organisms search algorithms and applications. Neural Comput. Appl. 2019 , 32 , 547–566. [ Google Scholar ] [ CrossRef ]
  • Ezugwu, A.E.; Adewumi, A.O. Soft sets based symbiotic organisms search algorithm for resource discovery in cloud computing environment. Future Gener. Comput. Syst. 2017 , 76 , 33–50. [ Google Scholar ] [ CrossRef ]
  • Ezugwu, A.E.-S.; Adewumi, A.O. Discrete symbiotic organisms search algorithm for travelling salesman problem. Expert Syst. Appl. 2017 , 87 , 70–78. [ Google Scholar ] [ CrossRef ]
  • Ezugwu, A.E.-S.; Adewumi, A.O.; Frîncu, M.E. Simulated annealing based symbiotic organisms search optimization algorithm for traveling salesman problem. Expert Syst. Appl. 2017 , 77 , 189–210. [ Google Scholar ] [ CrossRef ]
  • Mohammadzadeh, H.; Gharehchopogh, F.S. A multi-agent system based for solving high-dimensional optimization problems: A case study on email spam detection. Int. J. Commun. Syst. 2020 , 34 . [ Google Scholar ] [ CrossRef ]
  • Arora, S.; Anand, P. Binary butterfly optimization approaches for feature selection. Expert Syst. Appl. 2018 , 116 , 147–160. [ Google Scholar ] [ CrossRef ]
  • Du, Z.-G.; Pan, J.-S.; Chu, S.-C.; Chiu, Y.-J. Improved Binary Symbiotic Organism Search Algorithm with Transfer Functions for Feature Selection. IEEE Access 2020 , 8 , 225730–225744. [ Google Scholar ] [ CrossRef ]
  • Miao, F.; Yao, L.; Zhao, X. Symbiotic organisms search algorithm using random walk and adaptive Cauchy mutation on the feature selection of sleep staging. Expert Syst. Appl. 2021 , 176 , 114887. [ Google Scholar ] [ CrossRef ]
  • Kimovski, D.; Ortega, J.; Ortiz, A.; Baños, R. Parallel alternatives for evolutionary multi-objective optimization in unsupervised feature selection. Expert Syst. Appl. 2015 , 42 , 4239–4252. [ Google Scholar ] [ CrossRef ]
  • Liao, T.; Kuo, R. Five discrete symbiotic organisms search algorithms for simultaneous optimization of feature subset and neighborhood size of KNN classification models. Appl. Soft Comput. 2018 , 64 , 581–595. [ Google Scholar ] [ CrossRef ]
  • Apolloni, J.; Leguizamón, G.; Alba, E. Two hybrid wrapper-filter feature selection algorithms applied to high-dimensional microarray experiments. Appl. Soft Comput. 2016 , 38 , 922–932. [ Google Scholar ] [ CrossRef ]
  • Zare-Noghabi, A.; Shabanzadeh, M.; Sangrody, H. Medium-Term Load Forecasting Using Support Vector Regression, Feature Selection, and Symbiotic Organism Search Optimization. In Proceedings of the 2019 IEEE Power & Energy Society General Meeting (PESGM), Atlanta, GA, USA, 4–8 August 2019; pp. 1–5. [ Google Scholar ]
  • Gana, N.N.; Abdulhamid, S.M.; Misra, S.; Garg, L.; Ayeni, F.; Azeta, A. Optimization of Support Vector Machine for Classification of Spyware Using Symbiotic Organism Search for Features Selection. In Lecture Notes in Networks and Systems ; Springer: Cham, Switzerland, 2022. [ Google Scholar ] [ CrossRef ]
  • Zhou, Y.; Wu, H.; Luo, Q.; Abdel-Baset, M. Automatic data clustering using nature-inspired symbiotic organism search algorithm. Knowl.-Based Syst. 2019 , 163 , 546–557. [ Google Scholar ] [ CrossRef ]
  • Yang, C.-L.; Sutrisno, H. A clustering-based symbiotic organisms search algorithm for high-dimensional optimization problems. Appl. Soft Comput. 2020 , 97 , 106722. [ Google Scholar ] [ CrossRef ]
  • Zhang, B.; Sun, L.; Yuan, H.; Lv, J.; Ma, Z. An improved regularized extreme learning machine based on symbiotic organisms search. In Proceedings of the 2016 IEEE 11th Conference on Industrial Electronics and Applications (ICIEA), Hefei, China, 5–7 June 2016; pp. 1645–1648. [ Google Scholar ]
  • Ikotun, A.M.; Ezugwu, A.E. Boosting k-means clustering with symbiotic organisms search for automatic clustering problems. PLoS ONE 2022 , 17 , e0272861. [ Google Scholar ] [ CrossRef ] [ PubMed ]
  • Acharya, D.S.; Mishra, S.K. A multi-agent based symbiotic organisms search algorithm for tuning fractional order PID controller. Measurement 2020 , 155 , 107559. [ Google Scholar ] [ CrossRef ]
  • Rajah, V.; Ezugwu, A.E. Hybrid Symbiotic Organism Search algorithms for Automatic Data Clustering. In Proceedings of the 2020 Conference on Information Communications Technology and Society (ICTAS), Durban, South Africa, 11–12 March 2020; pp. 1–9. [ Google Scholar ]
  • Chakraborty, S.; Nama, S.; Saha, A.K. An improved symbiotic organisms search algorithm for higher dimensional optimization problems. Knowl.-Based Syst. 2021 , 236 , 107779. [ Google Scholar ] [ CrossRef ]
  • Sherin, B.M.; Supriya, M.H. SOS based selection and parameter optimization for underwater target classification. In Proceedings of the OCEANS 2016 MTS/IEEE Monterey, Monterey, CA, USA, 19–23 September 2016; pp. 1–4. [ Google Scholar ]
  • Bsoul, Q.; Salam, R.A.; Atwan, J.; Jawarneh, M. Arabic Text Clustering Methods and Suggested Solutions for Theme-Based Quran Clustering: Analysis of Literature. J. Inf. Sci. Theory Pract. 2021 , 9 , 15–34. [ Google Scholar ] [ CrossRef ]
  • Mehdi, S.; Smith, Z.; Herron, L.; Zou, Z.; Tiwary, P. Enhanced Sampling with Machine Learning. Annu. Rev. Phys. Chem. 2024 , 75 , 347–370. [ Google Scholar ] [ CrossRef ] [ PubMed ]
  • Larsen, B.; Aone, C. Fast and effective text mining using linear-time document clustering. In Proceedings of the KDD99: The First Annual International Conference on Knowledge Discovery in Data, San Diego, CA, USA, 15–18 August 1999; pp. 16–22. [ Google Scholar ]
  • Sanderson, M. Test Collection Based Evaluation of Information Retrieval Systems. Found. Trends Inf. Retr. 2010 , 4 , 247–375. [ Google Scholar ] [ CrossRef ]
  • Mohd, M.; Crestani, F.; Ruthven, I. Evaluation of an interactive topic detection and tracking interface. J. Inf. Sci. 2012 , 38 , 383–398. [ Google Scholar ] [ CrossRef ]
  • Zobeidi, S.; Naderan, M.; Alavi, S.E. Effective text classification using multi-level fuzzy neural network. In Proceedings of the 2017 5th Iranian Joint Congress on Fuzzy and Intelligent Systems (CFIS), Qazvin, Iran, 7–9 March 2017; pp. 91–96. [ Google Scholar ]
  • Lewis, D.D.; Yang, Y.; Rose, T.G.; Li, F. RCV1: A new benchmark collection for text categorization research. J. Mach. Learn. Res. 2004 , 5 , 361–397. [ Google Scholar ]
S/NSOS VariationsFeature-Selection ApproachesReferencesSupervised or Unsupervised Learning
1Modified SOSWrapped based[ ]Unsupervised learning
2Hybrid methodWrapper/coating-based[ ]Unsupervised learning
3Multi-objective SOSWrapper based[ ]Unsupervised learning
4Modified SOSWrapper based[ ]Unsupervised learning
5Improved SOSWrapper based[ ]Supervised learning
6Modified SOSFilter based[ ]Unsupervised learning
7MOEAWrapper based[ ]Unsupervised learning
8Five distinct SOS algorithms that combine modified and hybridized techniquesWrapper based[ ]Supervised learning
9Hybrid approachFilter and wrapper-based[ ]Supervised learning
10Hybrid method-[ ]Unsupervised learning
11Hybrid method [ ]Supervised learning
AuthorsAdopted Clustering Approach
[ ]The number of clusters that form at the beginning is half of the ecosize, which is divided into smaller ecologies. Subsequently, CSOS optimization is then applied.
[ ]Enhancing the BDI and BIC through the optimization process.
[ ]The CVI functions as the objective function in the optimization problems of the Davies–Boulding index and the compact separated index, both of which must be minimized.
[ ]To optimize the clustering problem, the SOS algorithm randomly initializes the cluster within the ecosystem.
[ ]The benefit factors are found by a non-linear method, and their weights are used to effectively explore and exploit the search region.
[ ]The number of initial clusters is determined by the eco-size, which is the number of sub-ecosystems that the self-organizing system (SOS) forms and then optimizes.
Scenarios
18
216
324
432
540
648
756
864
972
ScenarioNpop
12
24
36
48
516
618
720
824
928
DocumentSource#of Document#of Cluster
DS1Classic 338923
DS2TDT2 and TDT3 of TREC 2001144553
DS320 NEWSGROUP383110
DS4routers41958
Datasetsk-MeansHSSOSKHS
Classic 30.9290.909
TREC 20010.8040.829
Newsgroup0.5820.611
Routers0.6360.682
ComparisonDS1 Classic 3DS2 TREC 2001DS3 NewsgroupDS4 Routers
FeaturesF-MeasureFeaturesF-MeasureFeaturesF-MeasureFeaturesF-Measure
k-means13,3100.92967370.80427,2110.58212,1520.636
PSOC13,3100.89167370.84127,2110.60612,1520.688
HSC13,3100.90967370.82927,2110.61112,1520.682
WDOC13,310 6737 27,211 12,152
KHCluster13,310 6737 27,211 12,152
ComparisonDS1 Classic 3DS2 TREC 2001DS3 NewsgroupDS4 Routers
FeaturesF-MeasureFeaturesF-MeasureFeaturesF-MeasureFeaturesF-Measure
k-means 0.9355730.82420,8540.60295610.636
PSOC99270.928 0.847 0.636 0.688
HSC10,8430.929 0.831 0.62110,8540.682
WDOC 5834 19,283 6891
KHCluster10,289 6057 20,851
ComparisonDS1 Classic 3DS2 TREC 2001DS3 NewsgroupDS4 Routers
FeaturesF-MeasureFeaturesF-MeasureFeaturesF-MeasureFeaturesF-Measure
k-means 0.9355730.82420,8540.60295610.636
PSOC99270.928 0.84713,8240.636 0.688
HSC10,8430.92951280.831 0.62110,8540.682
WDOC98240.93958340.8319,2830.6468910.69
KHCluster10,289 6057 20,851 5732
SOSFS with SOSC
AlgorithmsRanking
k-means10.18
PSOC10.15
HSC10.01
WDOC9.61
SOSC9.39
KHCluster9.3
Friedman test (p-value)0.00
man-Davenport (p-value)0.00
The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

AL-Gburi, A.F.J.; Nazri, M.Z.A.; Yaakub, M.R.B.; Alyasseri, Z.A.A. Multi-Objective Unsupervised Feature Selection and Cluster Based on Symbiotic Organism Search. Algorithms 2024 , 17 , 355. https://doi.org/10.3390/a17080355

AL-Gburi AFJ, Nazri MZA, Yaakub MRB, Alyasseri ZAA. Multi-Objective Unsupervised Feature Selection and Cluster Based on Symbiotic Organism Search. Algorithms . 2024; 17(8):355. https://doi.org/10.3390/a17080355

AL-Gburi, Abbas Fadhil Jasim, Mohd Zakree Ahmad Nazri, Mohd Ridzwan Bin Yaakub, and Zaid Abdi Alkareem Alyasseri. 2024. "Multi-Objective Unsupervised Feature Selection and Cluster Based on Symbiotic Organism Search" Algorithms 17, no. 8: 355. https://doi.org/10.3390/a17080355

Article Metrics

Article access statistics, further information, mdpi initiatives, follow mdpi.

MDPI

Subscribe to receive issue release notifications and newsletters from MDPI journals

A machine learning and clustering-based methodology for the identification of lead users and their needs from online communities

New citation alert added.

This alert has been successfully added and will be sent to:

You will be notified whenever a record that you have chosen has been cited.

To manage your alert preferences, click on the button below.

New Citation Alert!

Please log in to your account

Information & Contributors

Bibliometrics & citations, view options, recommendations, administrator-users contribute more to online communities.

Lack of user participation and contribution has been a long-standing problem for online communities. We proposed and examined new strategies for cultivating opinion leadership and enabling users to post articles outside of the ...

Investigating User Experience of Online Communities: The Influence of Community Type

Current literature has identified key components and successful factors of online communities. However, research tends to study a specific type of community at a time. This study investigates users’ needs across different types of communities in order ...

Predicting continuance in online communities: model development and empirical test

Popular interest in online communities has grown rapidly in recent years as a result of the widespread diffusion of Web 2.0 applications. However, the full values and potential of online communities cannot be realised without users' ongoing ...

Information

Published in.

Pergamon Press, Inc.

United States

Publication History

Author tags.

  • New product development
  • Random forest-based algorithm
  • Clustering technique
  • Online community
  • Research-article

Contributors

Other metrics, bibliometrics, article metrics.

  • 0 Total Citations
  • 0 Total Downloads
  • Downloads (Last 12 months) 0
  • Downloads (Last 6 weeks) 0

View options

Login options.

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Share this publication link.

Copying failed.

Share on social media

Affiliations, export citations.

  • Please download or close your previous search result export first before starting a new bulk export. Preview is not available. By clicking download, a status dialog will open to start the export process. The process may take a few minutes but once it finishes a file will be downloadable from your browser. You may continue to browse the DL while the export process is in progress. Download
  • Download citation
  • Copy citation

We are preparing your search results for download ...

We will inform you here when the file is ready.

Your file of search results citations is now ready.

Your search export query has expired. Please try again.

Grab your spot at the free arXiv Accessibility Forum

Help | Advanced Search

Computer Science > Machine Learning

Title: randomnet: clustering time series using untrained deep neural networks.

Abstract: Neural networks are widely used in machine learning and data mining. Typically, these networks need to be trained, implying the adjustment of weights (parameters) within the network based on the input data. In this work, we propose a novel approach, RandomNet, that employs untrained deep neural networks to cluster time series. RandomNet uses different sets of random weights to extract diverse representations of time series and then ensembles the clustering relationships derived from these different representations to build the final clustering results. By extracting diverse representations, our model can effectively handle time series with different characteristics. Since all parameters are randomly generated, no training is required during the process. We provide a theoretical analysis of the effectiveness of the method. To validate its performance, we conduct extensive experiments on all of the 128 datasets in the well-known UCR time series archive and perform statistical analysis of the results. These datasets have different sizes, sequence lengths, and they are from diverse fields. The experimental results show that the proposed method is competitive compared with existing state-of-the-art methods.
Comments: 25 pages, 10 figures
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Cite as: [cs.LG]
  (or [cs.LG] for this version)
  Focus to learn more arXiv-issued DOI via DataCite
: Focus to learn more DOI(s) linking to related resources

Submission history

Access paper:.

  • HTML (experimental)
  • Other Formats

References & Citations

  • Google Scholar
  • Semantic Scholar

BibTeX formatted citation

BibSonomy logo

Bibliographic and Citation Tools

Code, data and media associated with this article, recommenders and search tools.

  • Institution

arXivLabs: experimental projects with community collaborators

arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.

Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.

Have an idea for a project that will add value for arXiv's community? Learn more about arXivLabs .

IMAGES

  1. Top 50 Papers in Clustering Algorithms for Machine Learning

    clustering in machine learning research papers

  2. Machine Learning Clustering Algorithm

    clustering in machine learning research papers

  3. Clustering with Machine Learning

    clustering in machine learning research papers

  4. Clustering in Machine Learning: 5 Essential Clustering Algorithms

    clustering in machine learning research papers

  5. SOLUTION: Clustering in machine learning

    clustering in machine learning research papers

  6. Clustering clusters: unsupervised machine learning on globular cluster

    clustering in machine learning research papers

COMMENTS

  1. A comprehensive survey of clustering algorithms: State-of-the-art

    Clustering is an essential tool in data mining research and applications. It is the subject of active research in many fields of study, such as computer science, data science, statistics, pattern recognition, artificial intelligence, and machine learning.

  2. K-means clustering algorithms: A comprehensive review, variants

    In this paper, the following focal research question was proposed to reflect the purpose of this comprehensive review work: ... Robust K-means algorithms focus on resolving the adverse effects of outliers on K-means clustering. There are several research works reporting on robust K-means clustering [86 ... standard learning machine algorithms ...

  3. Clustering algorithms: A comparative approach

    Many real-world systems can be studied in terms of pattern recognition tasks, so that proper use (and understanding) of machine learning methods in practical applications becomes essential. While many classification methods have been proposed, there is no consensus on which methods are more suitable for a given dataset. As a consequence, it is important to comprehensively compare methods in ...

  4. A Comprehensive Survey of Clustering Algorithms

    Data analysis is used as a common method in modern science research, which is across communication science, computer science and biology science. Clustering, as the basic composition of data analysis, plays a significant role. On one hand, many tools for cluster analysis have been created, along with the information increase and subject intersection. On the other hand, each clustering ...

  5. Survey paper A comprehensive survey of clustering algorithms: State-of

    Clustering is an essential tool in data mining research and applications. It is the subject of active research in many fields of study, such as computer science, data science, statistics, pattern recognition, artificial intelligence, and machine learning.Several clustering techniques have been proposed and implemented, and most of them successfully find excellent quality or optimal clustering ...

  6. [2210.04142] Deep Clustering: A Comprehensive Survey

    Deep Clustering: A Comprehensive Survey. Yazhou Ren, Jingyu Pu, Zhimeng Yang, Jie Xu, Guofeng Li, Xiaorong Pu, Philip S. Yu, Lifang He. View a PDF of the paper titled Deep Clustering: A Comprehensive Survey, by Yazhou Ren and 7 other authors. Cluster analysis plays an indispensable role in machine learning and data mining.

  7. A Taxonomy of Machine Learning Clustering Algorithms, Challenges, and

    In the field of data mining, clustering has shown to be an important technique. Numerous clustering methods have been devised and put into practice, and most of them locate high-quality or optimum clustering outcomes in the field of computer science, data science, statistics, pattern recognition, artificial intelligence, and machine learning. This research provides a modern, thorough review of ...

  8. A Comprehensive Survey on Deep Clustering: Taxonomy, Challenges, and

    ant deep architectures and data types. Motivated by the tremendous success of deep learning in clustering, one of the most fundamental machine learning tasks, and the large number of recent advances in this direction, in this paper we conduct a comprehensive survey on deep clustering by proposing a new taxonomy of various state-of-the-art ...

  9. [2206.07579] A Comprehensive Survey on Deep Clustering: Taxonomy

    A Comprehensive Survey on Deep Clustering: Taxonomy, Challenges, and Future Directions. Clustering is a fundamental machine learning task which has been widely studied in the literature. Classic clustering methods follow the assumption that data are represented as features in a vectorized form through various representation learning techniques.

  10. MICCF: A Mutual Information Constrained Clustering Framework for

    Deep clustering is a crucial task in machine learning and data mining that focuses on acquiring feature representations conducive to clustering. Previous research relies on self-supervised representation learning for general feature representations, such features may not be optimally suited for downstream clustering tasks. In this article, we ...

  11. Electronics

    The k-means clustering algorithm is considered one of the most powerful and popular data mining algorithms in the research community. However, despite its popularity, the algorithm has certain limitations, including problems associated with random initialization of the centroids which leads to unexpected convergence. Additionally, such a clustering algorithm requires the number of clusters to ...

  12. A Survey of Clustering With Deep Learning: From the Perspective of

    Clustering is a fundamental problem in many data-driven application domains, and clustering performance highly depends on the quality of data representation. Hence, linear or non-linear feature transformations have been extensively used to learn a better data representation for clustering. In recent years, a lot of works focused on using deep neural networks to learn a clustering-friendly ...

  13. PDF Machine Learning-Based Clustering Analysis: Foundational ...

    12.3 Centroid-Based Clustering. Instead of computing distance across observations and then recursively imposing a hierarchy over them, centroid-based clustering aims to partition observations into k groups in such a way that the sum of distances from points to the cen-. Fig. 12.4. Optimal number of clusters by method.

  14. (PDF) An overview of clustering methods

    Tel: +27 12 420 3578 Fax: +27 12 362 5188. E-mail: [email protected]. Abstract— Data clustering is the process of identifying natural. groupings or clusters within multidimensional data based on ...

  15. Data Clustering: Algorithms and Its Applications

    Data is useless if information or knowledge that can be used for further reasoning cannot be inferred from it. Cluster analysis, based on some criteria, shares data into important, practical or both categories (clusters) based on shared common characteristics. In research, clustering and classification have been used to analyze data, in the field of machine learning, bioinformatics, statistics ...

  16. Machine Learning: Algorithms, Real-World Applications and Research

    To discuss the applicability of machine learning-based solutions in various real-world application domains. To highlight and summarize the potential research directions within the scope of our study for intelligent data analysis and services. The rest of the paper is organized as follows.

  17. Unsupervised Machine Learning for Clustering in Political and Social

    This Element seeks to fill this gap by offering researchers and instructors an introduction clustering, which is a prominent class of unsupervised machine learning for exploring, mining, and understanding data. I detail several widely used clustering techniques, and pair each with R code and real data to facilitate interaction with the concepts.

  18. PDF Exploring Clustering Techniques in Machine Learning

    Clustering, a fundamental technique in machine learning, plays a pivotal role in pattern recognition, data mining, and exploratory data analysis. This paper provides a comprehensive exploration of clustering algorithms, evaluation metrics, applications, challenges, and recent advancements in the field. We discuss

  19. K-means clustering algorithms: A comprehensive review, variants

    In this paper, the following focal research question was proposed to reflect the purpose of this comprehensive review work: "What are the existing variants of K-means algorithms for solving clustering problems since its inception to date." In providing answers to the main research question, the following sub-research questions were ...

  20. A machine learning and clustering-based approach for county ...

    Clustering is an unsupervised machine learning approach to identify clusters of observations within data such that the intra-cluster similarity is high and the inter-cluster similarity is low. Suppose that a data set is represented by a set where , such that there are n observations and each x i is a observation with m features.

  21. An Improved K-means Clustering Algorithm Towards an Efficient Data

    In machine learning and data science, computational power is one of the main issues. Because at the same time, the computer needs to process a large amount of data. So, reducing computational costs is a big deal. K-means clustering is a popular unsupervised machine learning algorithm. It is widely used in different clustering processes.

  22. Research Paper on Cluster Techniques of Data Variations

    Introduction. Cluster analys is divides data into meaningful or useful. groups (clusters). If meaningful clusters are our ob jective, then the res ulting clusters should capture the "natural ...

  23. (PDF) Analysis of Clustering Algorithms in Machine Learning for

    Abstract. Clustering algorithm is one of the most popular data analysis tech-. nique in machine learning to precisely evaluate the vast number of healthcare. data from the body sensor networks ...

  24. Multi-Objective Unsupervised Feature Selection and Cluster Based on

    Unsupervised learning is a type of machine learning that learns from data without human supervision. Unsupervised feature selection (UFS) is crucial in data analytics, which plays a vital role in enhancing the quality of results and reducing computational complexity in huge feature spaces. The UFS problem has been addressed in several research efforts. Recent studies have witnessed a surge in ...

  25. A review of clustering techniques and developments

    As learning operation is central to the process of classification (supervised or unsupervised), it is used in this paper interchangeably with the same spirit. Clustering is a very essential component of various data analysis or machine learning based applications like, regression, prediction, data mining [21] etc.

  26. A machine learning and clustering-based methodology for the

    In this paper, we present a three-phase methodology that integrates a machine-learning-based algorithm with a sophisticated clustering technique. The purpose of this methodology is to systematically identify lead users and their needs from a complex online community network.

  27. RandomNet: Clustering Time Series Using Untrained Deep Neural Networks

    Neural networks are widely used in machine learning and data mining. Typically, these networks need to be trained, implying the adjustment of weights (parameters) within the network based on the input data. In this work, we propose a novel approach, RandomNet, that employs untrained deep neural networks to cluster time series. RandomNet uses different sets of random weights to extract diverse ...

  28. A robust, agnostic molecular biosignature based on machine learning

    (3) Training random forest machine-learning models using three-dimensional chromatographic retention time/mass to charge ratio/intensity data from each sample analysis (SI Appendix and Machine-Learning Methods). In this work, chromatographic retention time is also called scan number, as we measure when a particular feature arises in the analysis.

  29. Novel Machine Learning-based Cluster Analysis Method that Leverages

    A Tokyo Tech study introduced a machine learning-powered clustering model that incorporates both basic features and target properties, successfully grouping over 1,000 inorganic materials. This model provides insights into material relationships, potential applications, and identifies key factors to balance band gaps and dielectric constants ...