Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • View all journals
  • Explore content
  • About the journal
  • Publish with us
  • Sign up for alerts

Data mining articles within Scientific Reports

Article 17 September 2024 | Open Access

Evaluating segment anything model (SAM) on MRI scans of brain tumors

  • , Fady Alnajjar
  •  &  Rafat Damseh

Article 15 September 2024 | Open Access

Boundary-aware convolutional attention network for liver segmentation in ultrasound images

  • , Fulong Liu
  •  &  Xiao Zhang

Article 12 September 2024 | Open Access

Inferring gene regulatory networks with graph convolutional network based on causal feature reconstruction

  •  &  Xin Quan

Article 11 September 2024 | Open Access

A Bruton tyrosine kinase inhibitor-resistance gene signature predicts prognosis and identifies TRIP13 as a potential therapeutic target in diffuse large B-cell lymphoma

  • Yangyang Ding
  • , Keke Huang
  •  &  Shudao Xiong

Drug repurposing for Parkinson’s disease by biological pathway based edge-weighted network proximity analysis

  • Manyoung Han
  • , Seunghwan Jung
  •  &  Doheon Lee

Article 09 September 2024 | Open Access

Long-term trend prediction of pandemic combining the compartmental and deep learning models

  • Wanghu Chen
  •  &  Jiacheng Chi

Using geotagged facial expressions to visualize and characterize different demographic groups’ emotion in theme parks

  • Xiaoqing Song
  •  &  Qin Su

Article 06 September 2024 | Open Access

A coordinated adaptive multiscale enhanced spatio-temporal fusion network for multi-lead electrocardiogram arrhythmia detection

  • Zicong Yang
  • , Aitong Jin
  •  &  Yan Liu

Article 19 August 2024 | Open Access

The integrated genomic surveillance system of Andalusia (SIEGA) provides a One Health regional resource connected with the clinic

  • Carlos S. Casimiro-Soriguer
  • , Javier Pérez-Florido
  •  &  Joaquin Dopazo

Article 16 August 2024 | Open Access

An updated resource for the detection of protein-coding circRNA with CircProPlus

  • , Yunchang Liu
  •  &  Yundai Chen

Article 12 August 2024 | Open Access

Development, validation and use of custom software for the analysis of pain trajectories

  • M. R. van Ittersum
  • , A. de Zoete
  •  &  P. McCarthy

Comprehensive analysis identifies ubiquitin ligase FBXO42 as a tumor-promoting factor in neuroblastoma

  • Jianwu Zhou
  •  &  Yifei Du

Article 10 August 2024 | Open Access

Aquaporin 1 aggravates lipopolysaccharide-induced macrophage polarization and pyroptosis

  •  &  Abduxukur Ablimit

Article 07 August 2024 | Open Access

Bidirectional Mendelian randomization to explore the causal relationships between the gut microbiota and male reproductive diseases

  • Xiaofang Han
  •  &  Yuanyuan Ji

Article 06 August 2024 | Open Access

A comprehensive single-cell RNA transcriptomic analysis identifies a unique SPP1+ macrophages subgroup in aging skeletal muscle

  • , Mengyue Yang
  •  &  Weiming Guo

Article 01 August 2024 | Open Access

Association of GATA3 expression in triple-positive breast cancer with overall survival and immune cell infiltration

  • Xiuwen Chen
  • , Weilin Zhao
  •  &  Qiong Yi

Article 31 July 2024 | Open Access

Advanced differential evolution for gender-aware English speech emotion recognition

  •  &  Jiulong Zhu

Article 30 July 2024 | Open Access

Identification and immune landscape of sarcopenia-related molecular clusters in inflammatory bowel disease by machine learning and integrated bioinformatics

  • Chongkang Yue
  •  &  Huiping Xue

Article 26 July 2024 | Open Access

A data-centric machine learning approach to improve prediction of glioma grades using low-imbalance TCGA data

  • Raquel Sánchez-Marqués
  • , Vicente García
  •  &  J. Salvador Sánchez

Article 24 July 2024 | Open Access

Development and validation of a predictive model based on β-Klotho for head and neck squamous cell carcinoma

  • XiangXiu Wang
  • , HongWei Liu
  •  &  Ying Cui

Article 15 July 2024 | Open Access

Detecting depression severity using weighted random forest and oxidative stress biomarkers

  • Mariam Bader
  • , Moustafa Abdelwanis
  •  &  Herbert F. Jelinek

Article 06 July 2024 | Open Access

Scanpro is a tool for robust proportion analysis of single-cell resolution data

  • Yousef Alayoubi
  • , Mette Bentsen
  •  &  Mario Looso

Article 02 July 2024 | Open Access

Impact of Bariatric Surgery on metabolic health in a Uruguayan cohort and the emerging predictive role of FSTL1

  • Leonardo Santos
  • , Mariana Patrone
  •  &  Gustavo Bruno

Identification of a novel lactylation-related gene signature predicts the prognosis of multiple myeloma and experiment verification

  • , Wanqiu Zhang
  •  &  Wei Hu

Article 26 June 2024 | Open Access

Single-cell transcriptome profiling highlights the importance of telocyte, kallikrein genes, and alternative splicing in mouse testes aging

  • , Ziyan Zhang
  •  &  Gangcai Xie

Article 25 June 2024 | Open Access

Unifying aspect-based sentiment analysis BERT and multi-layered graph convolutional networks for comprehensive sentiment dissection

  • Kamran Aziz
  • , Donghong Ji
  •  &  Rashid Abbasi

Article 18 June 2024 | Open Access

Expression characteristics of lipid metabolism-related genes and correlative immune infiltration landscape in acute myocardial infarction

  • , Jingyi Luo
  •  &  Xiaorong Hu

Article 17 June 2024 | Open Access

Multi role ChatGPT framework for transforming medical data analysis

  • Haoran Chen
  • , Shengxiao Zhang
  •  &  Xuechun Lu

A tensor decomposition reveals ageing-induced differences in muscle and grip-load force couplings during object lifting

  • , Seyed Saman Saboksayr
  •  &  Ioannis Delis

Article 14 June 2024 | Open Access

Research on coal mine longwall face gas state analysis and safety warning strategy based on multi-sensor forecasting models

  • Haoqian Chang
  • , Xiangrui Meng
  •  &  Zuxiang Hu

PDE1B, a potential biomarker associated with tumor microenvironment and clinical prognostic significance in osteosarcoma

  • Qingzhong Chen
  • , Chunmiao Xing
  •  &  Zhongwei Qian

Article 13 June 2024 | Open Access

A real-world pharmacovigilance study on cardiovascular adverse events of tisagenlecleucel using machine learning approach

  • Juhong Jung
  • , Ju Hwan Kim
  •  &  Ju-Young Shin

Article 12 June 2024 | Open Access

Alteration of circulating ACE2-network related microRNAs in patients with COVID-19

  • Zofia Wicik
  • , Ceren Eyileten
  •  &  Marek Postula

DCRELM: dual correlation reduction network-based extreme learning machine for single-cell RNA-seq data clustering

  • Qingyun Gao
  •  &  Qing Ai

Article 10 June 2024 | Open Access

Multi-cohort analysis reveals immune subtypes and predictive biomarkers in tuberculosis

  •  &  Hong Ding

Article 03 June 2024 | Open Access

Depression recognition using voice-based pre-training model

  • Xiangsheng Huang
  • , Fang Wang
  •  &  Zhenrong Xu

Article 01 June 2024 | Open Access

Mitochondrial RNA modification-based signature to predict prognosis of lower grade glioma: a multi-omics exploration and verification study

  • Xingwang Zhou
  • , Yuanguo Ling
  •  &  Liangzhao Chu

Article 31 May 2024 | Open Access

Decoding intelligence via symmetry and asymmetry

  • Jianjing Fu
  •  &  Ching-an Hsiao

Article 27 May 2024 | Open Access

Research on domain ontology construction based on the content features of online rumors

  • Jianbo Zhao
  • , Huailiang Liu
  •  &  Ruiyu Ding

Exploring the pathways of drug repurposing and Panax ginseng treatment mechanisms in chronic heart failure: a disease module analysis perspective

  • Chengzhi Xie
  • , Ying Zhang
  •  &  Na Lang

Article 22 May 2024 | Open Access

Comprehensive data mining reveals RTK/RAS signaling pathway as a promoter of prostate cancer lineage plasticity through transcription factors and CNV

  • Guanyun Wei
  •  &  Zao Dai

Article 21 May 2024 | Open Access

Anoikis-related gene signatures in colorectal cancer: implications for cell differentiation, immune infiltration, and prognostic prediction

  • Taohui Ding
  • , Zhao Shang
  •  &  Bo Yi

Insights from modelling sixteen years of climatic and fumonisin patterns in maize in South Africa

  • Sefater Gbashi
  • , Oluwasola Abayomi Adelusi
  •  &  Patrick Berka Njobeh

Article 17 May 2024 | Open Access

Identification of cancer risk groups through multi-omics integration using autoencoder and tensor analysis

  • Ali Braytee
  •  &  Ali Anaissi

Article 14 May 2024 | Open Access

Multi-omics integration of scRNA-seq time series data predicts new intervention points for Parkinson’s disease

  • Katarina Mihajlović
  • , Gaia Ceddia
  •  &  Nataša Pržulj

Stellae-123 gene expression signature improved risk stratification in Taiwanese acute myeloid leukemia patients

  • Yu-Hung Wang
  • , Adrián Mosquera Orgueira
  •  &  Hwei-Fang Tien

Article 06 May 2024 | Open Access

Joint extraction of wheat germplasm information entity relationship based on deep character and word fusion

  • Xiaoxiao Jia
  • , Guang Zheng
  •  &  Lei Xi

Article 25 April 2024 | Open Access

Low ACADM expression predicts poor prognosis and suppressive tumor microenvironment in clear cell renal cell carcinoma

  •  &  Huimin Long

Article 19 April 2024 | Open Access

Automatic inference of ICD-10 codes from German ophthalmologic physicians’ letters using natural language processing

  • D. Böhringer
  • , P. Angelova
  •  &  T. Reinhard

Robust identification of interactions between heat-stress responsive genes in the chicken brain using Bayesian networks and augmented expression data

  • E. A. Videla Rodriguez
  • , John B. O. Mitchell
  •  &  V. Anne Smith

Advertisement

Browse broader subjects

  • Computational biology and bioinformatics

Quick links

  • Explore articles by subject
  • Guide to authors
  • Editorial policies

research papers for data mining

data mining Recently Published Documents

Total documents.

  • Latest Documents
  • Most Cited Documents
  • Contributed Authors
  • Related Sources
  • Related Keywords

Distance Based Pattern Driven Mining for Outlier Detection in High Dimensional Big Dataset

Detection of outliers or anomalies is one of the vital issues in pattern-driven data mining. Outlier detection detects the inconsistent behavior of individual objects. It is an important sector in the data mining field with several different applications such as detecting credit card fraud, hacking discovery and discovering criminal activities. It is necessary to develop tools used to uncover the critical information established in the extensive data. This paper investigated a novel method for detecting cluster outliers in a multidimensional dataset, capable of identifying the clusters and outliers for datasets containing noise. The proposed method can detect the groups and outliers left by the clustering process, like instant irregular sets of clusters (C) and outliers (O), to boost the results. The results obtained after applying the algorithm to the dataset improved in terms of several parameters. For the comparative analysis, the accurate average value and the recall value parameters are computed. The accurate average value is 74.05% of the existing COID algorithm, and our proposed algorithm has 77.21%. The average recall value is 81.19% and 89.51% of the existing and proposed algorithm, which shows that the proposed work efficiency is better than the existing COID algorithm.

Implementation of Data Mining Technology in Bonded Warehouse Inbound and Outbound Goods Trade

For the taxed goods, the actual freight is generally determined by multiplying the allocated freight for each KG and actual outgoing weight based on the outgoing order number on the outgoing bill. Considering the conventional logistics is insufficient to cope with the rapid response of e-commerce orders to logistics requirements, this work discussed the implementation of data mining technology in bonded warehouse inbound and outbound goods trade. Specifically, a bonded warehouse decision-making system with data warehouse, conceptual model, online analytical processing system, human-computer interaction module and WEB data sharing platform was developed. The statistical query module can be used to perform statistics and queries on warehousing operations. After the optimization of the whole warehousing business process, it only takes 19.1 hours to get the actual freight, which is nearly one third less than the time before optimization. This study could create a better environment for the development of China's processing trade.

Multi-objective economic load dispatch method based on data mining technology for large coal-fired power plants

User activity classification and domain-wise ranking through social interactions.

Twitter has gained a significant prevalence among the users across the numerous domains, in the majority of the countries, and among different age groups. It servers a real-time micro-blogging service for communication and opinion sharing. Twitter is sharing its data for research and study purposes by exposing open APIs that make it the most suitable source of data for social media analytics. Applying data mining and machine learning techniques on tweets is gaining more and more interest. The most prominent enigma in social media analytics is to automatically identify and rank influencers. This research is aimed to detect the user's topics of interest in social media and rank them based on specific topics, domains, etc. Few hybrid parameters are also distinguished in this research based on the post's content, post’s metadata, user’s profile, and user's network feature to capture different aspects of being influential and used in the ranking algorithm. Results concluded that the proposed approach is well effective in both the classification and ranking of individuals in a cluster.

A data mining analysis of COVID-19 cases in states of United States of America

Epidemic diseases can be extremely dangerous with its hazarding influences. They may have negative effects on economies, businesses, environment, humans, and workforce. In this paper, some of the factors that are interrelated with COVID-19 pandemic have been examined using data mining methodologies and approaches. As a result of the analysis some rules and insights have been discovered and performances of the data mining algorithms have been evaluated. According to the analysis results, JRip algorithmic technique had the most correct classification rate and the lowest root mean squared error (RMSE). Considering classification rate and RMSE measure, JRip can be considered as an effective method in understanding factors that are related with corona virus caused deaths.

Exploring distributed energy generation for sustainable development: A data mining approach

A comprehensive guideline for bengali sentiment annotation.

Sentiment Analysis (SA) is a Natural Language Processing (NLP) and an Information Extraction (IE) task that primarily aims to obtain the writer’s feelings expressed in positive or negative by analyzing a large number of documents. SA is also widely studied in the fields of data mining, web mining, text mining, and information retrieval. The fundamental task in sentiment analysis is to classify the polarity of a given content as Positive, Negative, or Neutral . Although extensive research has been conducted in this area of computational linguistics, most of the research work has been carried out in the context of English language. However, Bengali sentiment expression has varying degree of sentiment labels, which can be plausibly distinct from English language. Therefore, sentiment assessment of Bengali language is undeniably important to be developed and executed properly. In sentiment analysis, the prediction potential of an automatic modeling is completely dependent on the quality of dataset annotation. Bengali sentiment annotation is a challenging task due to diversified structures (syntax) of the language and its different degrees of innate sentiments (i.e., weakly and strongly positive/negative sentiments). Thus, in this article, we propose a novel and precise guideline for the researchers, linguistic experts, and referees to annotate Bengali sentences immaculately with a view to building effective datasets for automatic sentiment prediction efficiently.

Capturing Dynamics of Information Diffusion in SNS: A Survey of Methodology and Techniques

Studying information diffusion in SNS (Social Networks Service) has remarkable significance in both academia and industry. Theoretically, it boosts the development of other subjects such as statistics, sociology, and data mining. Practically, diffusion modeling provides fundamental support for many downstream applications (e.g., public opinion monitoring, rumor source identification, and viral marketing). Tremendous efforts have been devoted to this area to understand and quantify information diffusion dynamics. This survey investigates and summarizes the emerging distinguished works in diffusion modeling. We first put forward a unified information diffusion concept in terms of three components: information, user decision, and social vectors, followed by a detailed introduction of the methodologies for diffusion modeling. And then, a new taxonomy adopting hybrid philosophy (i.e., granularity and techniques) is proposed, and we made a series of comparative studies on elementary diffusion models under our taxonomy from the aspects of assumptions, methods, and pros and cons. We further summarized representative diffusion modeling in special scenarios and significant downstream tasks based on these elementary models. Finally, open issues in this field following the methodology of diffusion modeling are discussed.

The Influence of E-book Teaching on the Motivation and Effectiveness of Learning Law by Using Data Mining Analysis

This paper studies the motivation of learning law, compares the teaching effectiveness of two different teaching methods, e-book teaching and traditional teaching, and analyses the influence of e-book teaching on the effectiveness of law by using big data analysis. From the perspective of law student psychology, e-book teaching can attract students' attention, stimulate students' interest in learning, deepen knowledge impression while learning, expand knowledge, and ultimately improve the performance of practical assessment. With a small sample size, there may be some deficiencies in the research results' representativeness. To stimulate the learning motivation of law as well as some other theoretical disciplines in colleges and universities has particular referential significance and provides ideas for the reform of teaching mode at colleges and universities. This paper uses a decision tree algorithm in data mining for the analysis and finds out the influencing factors of law students' learning motivation and effectiveness in the learning process from students' perspective.

Intelligent Data Mining based Method for Efficient English Teaching and Cultural Analysis

The emergence of online education helps improving the traditional English teaching quality greatly. However, it only moves the teaching process from offline to online, which does not really change the essence of traditional English teaching. In this work, we mainly study an intelligent English teaching method to further improve the quality of English teaching. Specifically, the random forest is firstly used to analyze and excavate the grammatical and syntactic features of the English text. Then, the decision tree based method is proposed to make a prediction about the English text in terms of its grammar or syntax issues. The evaluation results indicate that the proposed method can effectively improve the accuracy of English grammar or syntax recognition.

Export Citation Format

Share document.

U.S. flag

An official website of the United States government

The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

  • Publications
  • Account settings

The PMC website is updating on October 15, 2024. Learn More or Try it out now .

  • Advanced Search
  • Journal List
  • Mil Med Res

Logo of milmedres

Data mining in clinical big data: the frequently used databases, steps, and methodological models

1 Department of Clinical Research, The First Affiliated Hospital of Jinan University, Tianhe District, 613 W. Huangpu Avenue, Guangzhou, 510632 Guangdong China

2 School of Public Health, Xi’an Jiaotong University Health Science Center, Xi’an, 710061 Shaanxi China

Yuan-Jie Li

3 Department of Human Anatomy, Histology and Embryology, School of Basic Medical Sciences, Xi’an Jiaotong University Health Science Center, Xi’an, 710061 Shaanxi China

4 Department of Neurology, The First Affiliated Hospital of Jinan University, Tianhe District, 613 W. Huangpu Avenue, Guangzhou, 510632 Guangdong China

Many high quality studies have emerged from public databases, such as Surveillance, Epidemiology, and End Results (SEER), National Health and Nutrition Examination Survey (NHANES), The Cancer Genome Atlas (TCGA), and Medical Information Mart for Intensive Care (MIMIC); however, these data are often characterized by a high degree of dimensional heterogeneity, timeliness, scarcity, irregularity, and other characteristics, resulting in the value of these data not being fully utilized. Data-mining technology has been a frontier field in medical research, as it demonstrates excellent performance in evaluating patient risks and assisting clinical decision-making in building disease-prediction models. Therefore, data mining has unique advantages in clinical big-data research, especially in large-scale medical public databases. This article introduced the main medical public database and described the steps, tasks, and models of data mining in simple language. Additionally, we described data-mining methods along with their practical applications. The goal of this work was to aid clinical researchers in gaining a clear and intuitive understanding of the application of data-mining technology on clinical big-data in order to promote the production of research results that are beneficial to doctors and patients.

With the rapid development of computer software/hardware and internet technology, the amount of data has increased at an amazing speed. “Big data” as an abstract concept currently affects all walks of life [ 1 ], and although its importance has been recognized, its definition varies slightly from field to field. In the field of computer science, big data refers to a dataset that cannot be perceived, acquired, managed, processed, or served within a tolerable time by using traditional IT and software and hardware tools. Generally, big data refers to a dataset that exceeds the scope of a simple database and data-processing architecture used in the early days of computing and is characterized by high-volume and -dimensional data that is rapidly updated represents a phenomenon or feature that has emerged in the digital age. Across the medical industry, various types of medical data are generated at a high speed, and trends indicate that applying big data in the medical field helps improve the quality of medical care and optimizes medical processes and management strategies [ 2 , 3 ]. Currently, this trend is shifting from civilian medicine to military medicine. For example, the United States is exploring the potential to use of one of its largest healthcare systems (the Military Healthcare System) to provide healthcare to eligible veterans in order to potentially benefit > 9 million eligible personnel [ 4 ]. Another data-management system has been developed to assess the physical and mental health of active-duty personnel, with this expected to yield significant economic benefits to the military medical system [ 5 ]. However, in medical research, the wide variety of clinical data and differences between several medical concepts in different classification standards results in a high degree of dimensionality heterogeneity, timeliness, scarcity, and irregularity to existing clinical data [ 6 , 7 ]. Furthermore, new data analysis techniques have yet to be popularized in medical research [ 8 ]. These reasons hinder the full realization of the value of existing data, and the intensive exploration of the value of clinical data remains a challenging problem.

Computer scientists have made outstanding contributions to the application of big data and introduced the concept of data mining to solve difficulties associated with such applications. Data mining (also known as knowledge discovery in databases) refers to the process of extracting potentially useful information and knowledge hidden in a large amount of incomplete, noisy, fuzzy, and random practical application data [ 9 ]. Unlike traditional research methods, several data-mining technologies mine information to discover knowledge based on the premise of unclear assumptions (i.e., they are directly applied without prior research design). The obtained information should have previously unknown, valid, and practical characteristics [ 9 ]. Data-mining technology does not aim to replace traditional statistical analysis techniques, but it does seek to extend and expand statistical analysis methodologies. From a practical point of view, machine learning (ML) is the main analytical method in data mining, as it represents a method of training models by using data and then using those models for predicting outcomes. Given the rapid progress of data-mining technology and its excellent performance in other industries and fields, it has introduced new opportunities and prospects to clinical big-data research [ 10 ]. Large amounts of high quality medical data are available to researchers in the form of public databases, which enable more researchers to participate in the process of medical data mining in the hope that the generated results can further guide clinical practice.

This article provided a valuable overview to medical researchers interested in studying the application of data mining on clinical big data. To allow a clearer understanding of the application of data-mining technology on clinical big data, the second part of this paper introduced the concept of public databases and summarized those commonly used in medical research. In the third part of the paper, we offered an overview of data mining, including introducing an appropriate model, tasks, and processes, and summarized the specific methods of data mining. In the fourth and fifth parts of this paper, we introduced data-mining algorithms commonly used in clinical practice along with specific cases in order to help clinical researchers clearly and intuitively understand the application of data-mining technology on clinical big data. Finally, we discussed the advantages and disadvantages of data mining in clinical analysis and offered insight into possible future applications.

Overview of common public medical databases

A public database describes a data repository used for research and dedicated to housing data related to scientific research on an open platform. Such databases collect and store heterogeneous and multi-dimensional health, medical, scientific research in a structured form and characteristics of mass/multi-ownership, complexity, and security. These databases cover a wide range of data, including those related to cancer research, disease burden, nutrition and health, and genetics and the environment. Table ​ Table1 1 summarizes the main public medical databases [ 11 – 26 ]. Researchers can apply for access to data based on the scope of the database and the application procedures required to perform relevant medical research.

Overview of main medical public database

DatabaseRangeLocationFounded yearCostURLReferences
Surveillance, Epidemiology, and End Results (SEER)TumorUSA1973Partially free [ ]
Medical Information Mart for Intensive Care (MIMIC)Intensive medicalUSA2001Free [ ]
National Health and Nutrition Examination Survey (NHANES)Children and adults healthUSA1999Free [ ]
Global Burden of Disease (GBD)Epidemic trends and burden of diseaseGlobal1988Free [ ]
UK Biobank (UKB)Health-related genetic data and phenotypic dataUK2006Partially free [ ]
The Cancer Genome Atlas (TCGA)Cancer genomicsUSA2006Free [ ]
Gene Expression Omnibus (GEO)Sequencing and gene expressionUSA2000Free [ ]
International Cancer Genome Consortium (ICGC)Cancer genomicsGlobal2008Free [ ]
China Kadoorie Biobank (CKB)Chronic diseasesChina2004Partially free [ ]
Comparative Toxicogenomics Database (CTD)Environmental chemicals and human healthUSA2004Free [ ]
Paediatric Intensive Care (PIC)Paediatric IntensiveChina2010Free [ ]
Biologic Specimen and Data Repositories Information Coordinating Center (BioLINCC)Cardiovascular, pulmonary, and hematologicalUSA2009Free [ ]
China Health and Nutrition Survey (CHNS)Health and nutritionChina1989Partially free [ ]
China Health and Retirement Longitudinal Study (CHARLS)Ageing and healthChina2011Free [ ]
eICU Collaborative Research Database (eICU-CRD)Intensive medicalUSA2018Free [ ]
Health and Retirement Study (HRS)Aging health and social supportGlobal1992Free [ ]

Data mining: an overview

Data mining is a multidisciplinary field at the intersection of database technology, statistics, ML, and pattern recognition that profits from all these disciplines [ 27 ]. Although this approach is not yet widespread in the field of medical research, several studies have demonstrated the promise of data mining in building disease-prediction models, assessing patient risk, and helping physicians make clinical decisions [ 28 – 31 ].

Data-mining models

Data-mining has two kinds of models: descriptive and predictive. Predictive models are used to predict unknown or future values of other variables of interest, whereas descriptive models are often used to find patterns that describe data that can be interpreted by humans [ 32 ].

Data-mining tasks

A model is usually implemented by a task, with the goal of description being to generalize patterns of potential associations in the data. Therefore, using a descriptive model usually results in a few collections with the same or similar attributes. Prediction mainly refers to estimation of the variable value of a specific attribute based on the variable values of other attributes, including classification and regression [ 33 ].

Data-mining methods

After defining the data-mining model and task, the data mining methods required to build the approach based on the discipline involved are then defined. The data-mining method depends on whether or not dependent variables (labels) are present in the analysis. Predictions with dependent variables (labels) are generated through supervised learning, which can be performed by the use of linear regression, generalized linear regression, a proportional hazards model (the Cox regression model), a competitive risk model, decision trees, the random forest (RF) algorithm, and support vector machines (SVMs). In contrast, unsupervised learning involves no labels. The learning model infers some internal data structure. Common unsupervised learning methods include principal component analysis (PCA), association analysis, and clustering analysis.

Data-mining algorithms for clinical big data

Data mining based on clinical big data can produce effective and valuable knowledge, which is essential for accurate clinical decision-making and risk assessment [ 34 ]. Data-mining algorithms enable realization of these goals.

Supervised learning

A concept often mentioned in supervised learning is the partitioning of datasets. To prevent overfitting of a model, a dataset can generally be divided into two or three parts: a training set, validation set, and test set. Ripley [ 35 ] defined these parts as a set of examples used for learning and used to fit the parameters (i.e., weights) of the classifier, a set of examples used to tune the parameters (i.e., architecture) of a classifier, and a set of examples used only to assess the performance (generalized) of a fully-specified classifier, respectively. Briefly, the training set is used to train the model or determine the model parameters, the validation set is used to perform model selection, and the test set is used to verify model performance. In practice, data are generally divided into training and test sets, whereas the verification set is less involved. It should be emphasized that the results of the test set do not guarantee model correctness but only show that similar data can obtain similar results using the model. Therefore, the applicability of a model should be analysed in combination with specific problems in the research. Classical statistical methods, such as linear regression, generalized linear regression, and a proportional risk model, have been widely used in medical research. Notably, most of these classical statistical methods have certain data requirements or assumptions; however, in face of complicated clinical data, assumptions about data distribution are difficult to make. In contrast, some ML methods (algorithmic models) make no assumptions about the data and cross-verify the results; thus, they are likely to be favoured by clinical researchers [ 36 ]. For these reasons, this chapter focuses on ML methods that do not require assumptions about data distribution and classical statistical methods that are used in specific situations.

Decision tree

A decision tree is a basic classification and regression method that generates a result similar to the tree structure of a flowchart, where each tree node represents a test on an attribute, each branch represents the output of an attribute, each leaf node (decision node) represents a class or class distribution, and the topmost part of the tree is the root node [ 37 ]. The decision tree model is called a classification tree when used for classification and a regression tree when used for regression. Studies have demonstrated the utility of the decision tree model in clinical applications. In a study on the prognosis of breast cancer patients, a decision tree model and a classical logistic regression model were constructed, respectively, with the predictive performance of the different models indicating that the decision tree model showed stronger predictive power when using real clinical data [ 38 ]. Similarly, the decision tree model has been applied to other areas of clinical medicine, including diagnosis of kidney stones [ 39 ], predicting the risk of sudden cardiac arrest [ 40 ], and exploration of the risk factors of type II diabetes [ 41 ]. A common feature of these studies is the use of a decision tree model to explore the interaction between variables and classify subjects into homogeneous categories based on their observed characteristics. In fact, because the decision tree accounts for the strong interaction between variables, it is more suitable for use with decision algorithms that follow the same structure [ 42 ]. In the construction of clinical prediction models and exploration of disease risk factors and patient prognosis, the decision tree model might offer more advantages and practical application value than some classical algorithms. Although the decision tree has many advantages, it recursively separates observations into branches to construct a tree; therefore, in terms of data imbalance, the precision of decision tree models needs improvement.

The RF method

The RF algorithm was developed as an application of an ensemble-learning method based on a collection of decision trees. The bootstrap method [ 43 ] is used to randomly retrieve sample sets from the training set, with decision trees generated by the bootstrap method constituting a “random forest” and predictions based on this derived from an ensemble average or majority vote. The biggest advantage of the RF method is that the random sampling of predictor variables at each decision tree node decreases the correlation among the trees in the forest, thereby improving the precision of ensemble predictions [ 44 ]. Given that a single decision tree model might encounter the problem of overfitting [ 45 ], the initial application of RF minimizes overfitting in classification and regression and improves predictive accuracy [ 44 ]. Taylor et al. [ 46 ] highlighted the potential of RF in correctly differentiating in-hospital mortality in patients experiencing sepsis after admission to the emergency department. Nowhere in the healthcare system is the need more pressing to find methods to reduce uncertainty than in the fast, chaotic environment of the emergency department. The authors demonstrated that the predictive performance of the RF method was superior to that of traditional emergency medicine methods and the methods enabled evaluation of more clinical variables than traditional modelling methods, which subsequently allowed the discovery of clinical variables not expected to be of predictive value or which otherwise would have been omitted as a rare predictor [ 46 ]. Another study based on the Medical Information Mart for Intensive Care (MIMIC) II database [ 47 ] found that RF had excellent predictive power regarding intensive care unit (ICU) mortality [ 48 ]. These studies showed that the application of RF to big data stored in the hospital healthcare system provided a new data-driven method for predictive analysis in critical care. Additionally, random survival forests have recently been developed to analyse survival data, especially right-censored survival data [ 49 , 50 ], which can help researchers conduct survival analyses in clinical oncology and help develop personalized treatment regimens that benefit patients [ 51 ].

The SVM is a relatively new classification or prediction method developed by Cortes and Vapnik and represents a data-driven approach that does not require assumptions about data distribution [ 52 ]. The core purpose of an SVM is to identify a separation boundary (called a hyperplane) to help classify cases; thus, the advantages of SVMs are obvious when classifying and predicting cases based on high dimensional data or data with a small sample size [ 53 , 54 ].

In a study of drug compliance in patients with heart failure, researchers used an SVM to build a predictive model for patient compliance in order to overcome the problem of a large number of input variables relative to the number of available observations [ 55 ]. Additionally, the mechanisms of certain chronic and complex diseases observed in clinical practice remain unclear, and many risk factors, including gene–gene interactions and gene-environment interactions, must be considered in the research of such diseases [ 55 , 56 ]. SVMs are capable of addressing these issues. Yu et al. [ 54 ] applied an SVM for predicting diabetes onset based on data from the National Health and Nutrition Examination Survey (NHANES). Furthermore, these models have strong discrimination ability, making SVMs a promising classification approach for detecting individuals with chronic and complex diseases. However, a disadvantage of SVMs is that when the number of observation samples is large, the method becomes time- and resource-intensive, which is often highly inefficient.

Competitive risk model

Kaplan–Meier marginal regression and the Cox proportional hazards model are widely used in survival analysis in clinical studies. Classical survival analysis usually considers only one endpoint, such as the impact of patient survival time. However, in clinical medical research, multiple endpoints usually coexist, and these endpoints compete with one another to generate competitive risk data [ 57 ]. In the case of multiple endpoint events, the use of a single endpoint-analysis method can lead to a biased estimation of the probability of endpoint events due to the existence of competitive risks [ 58 ]. The competitive risk model is a classical statistical model based on the hypothesis of data distribution. Its main advantage is its accurate estimation of the cumulative incidence of outcomes for right-censored survival data with multiple endpoints [ 59 ]. In data analysis, the cumulative risk rate is estimated using the cumulative incidence function in single-factor analysis, and Gray’s test is used for between-group comparisons [ 60 ].

Multifactor analysis uses the Fine-Gray and cause-specific (CS) risk models to explore the cumulative risk rate [ 61 ]. The difference between the Fine-Gray and CS models is that the former is applicable to establishing a clinical prediction model and predicting the risk of a single endpoint of interest [ 62 ], whereas the latter is suitable for answering etiological questions, where the regression coefficient reflects the relative effect of covariates on the increased incidence of the main endpoint in the target event-free risk set [ 63 ]. Currently, in databases with CS records, such as Surveillance, Epidemiology, and End Results (SEER), competitive risk models exhibit good performance in exploring disease-risk factors and prognosis [ 64 ]. A study of prognosis in patients with oesophageal cancer from SEER showed that Cox proportional risk models might misestimate the effects of age and disease location on patient prognosis, whereas competitive risk models provide more accurate estimates of factors affecting patient prognosis [ 65 ]. In another study of the prognosis of penile cancer patients, researchers found that using a competitive risk model was more helpful in developing personalized treatment plans [ 66 ].

Unsupervised learning

In many data-analysis processes, the amount of usable identified data is small, and identifying data is a tedious process [ 67 ]. Unsupervised learning is necessary to judge and categorize data according to similarities, characteristics, and correlations and has three main applications: data clustering, association analysis, and dimensionality reduction. Therefore, the unsupervised learning methods introduced in this section include clustering analysis, association rules, and PCA.

Clustering analysis

The classification algorithm needs to “know” information concerning each category in advance, with all of the data to be classified having corresponding categories. When the above conditions cannot be met, cluster analysis can be applied to solve the problem [ 68 ]. Clustering places similar objects into different categories or subsets through the process of static classification. Consequently, objects in the same subset have similar properties. Many kinds of clustering techniques exist. Here, we introduced the four most commonly used clustering techniques.

Partition clustering

The core idea of this clustering method regards the centre of the data point as the centre of the cluster. The k-means method [ 69 ] is a representative example of this technique. The k-means method takes n observations and an integer, k , and outputs a partition of the n observations into k sets such that each observation belongs to the cluster with the nearest mean [ 70 ]. The k-means method exhibits low time complexity and high computing efficiency but has a poor processing effect on high dimensional data and cannot identify nonspherical clusters.

Hierarchical clustering

The hierarchical clustering algorithm decomposes a dataset hierarchically to facilitate the subsequent clustering [ 71 ]. Common algorithms for hierarchical clustering include BIRCH [ 72 ], CURE [ 73 ], and ROCK [ 74 ]. The algorithm starts by treating every point as a cluster, with clusters grouped according to closeness. When further combinations result in unexpected results under multiple causes or only one cluster remains, the grouping process ends. This method has wide applicability, and the relationship between clusters is easy to detect; however, the time complexity is high [ 75 ].

Clustering according to density

The density algorithm takes areas presenting a high degree of data density and defines these as belonging to the same cluster [ 76 ]. This method aims to find arbitrarily-shaped clusters, with the most representative algorithm being DBSCAN [ 77 ]. In practice, DBSCAN does not need to input the number of clusters to be partitioned and can handle clusters of various shapes; however, the time complexity of the algorithm is high. Furthermore, when data density is irregular, the quality of the clusters decreases; thus, DBSCAN cannot process high dimensional data [ 75 ].

Clustering according to a grid

Neither partition nor hierarchical clustering can identify clusters with nonconvex shapes. Although a dimension-based algorithm can accomplish this task, the time complexity is high. To address this problem, data-mining researchers proposed grid-based algorithms that changed the original data space into a grid structure of a certain size. A representative algorithm is STING, which divides the data space into several square cells according to different resolutions and clusters the data of different structure levels [ 78 ]. The main advantage of this method is its high processing speed and its exclusive dependence on the number of units in each dimension of the quantized space.

In clinical studies, subjects tend to be actual patients. Although researchers adopt complex inclusion and exclusion criteria before determining the subjects to be included in the analyses, heterogeneity among different patients cannot be avoided [ 79 , 80 ]. The most common application of cluster analysis in clinical big data is in classifying heterogeneous mixed groups into homogeneous groups according to the characteristics of existing data (i.e., “subgroups” of patients or observed objects are identified) [ 81 , 82 ]. This new information can then be used in the future to develop patient-oriented medical-management strategies. Docampo et al. [ 81 ] used hierarchical clustering to reduce heterogeneity and identify subgroups of clinical fibromyalgia, which aided the evaluation and management of fibromyalgia. Additionally, Guo et al. [ 83 ] used k-means clustering to divide patients with essential hypertension into four subgroups, which revealed that the potential risk of coronary heart disease differed between different subgroups. On the other hand, density- and grid-based clustering algorithms have mostly been used to process large numbers of images generated in basic research and clinical practice, with current studies focused on developing new tools to help clinical research and practices based on these technologies [ 84 , 85 ]. Cluster analysis will continue to have extensive application prospects along with the increasing emphasis on personalized treatment.

Association rules

Association rules discover interesting associations and correlations between item sets in large amounts of data. These rules were first proposed by Agrawal et al. [ 86 ] and applied to analyse customer buying habits to help retailers create sales plans. Data-mining based on association rules identifies association rules in a two-step process: 1) all high frequency items in the collection are listed and 2) frequent association rules are generated based on the high frequency items [ 87 ]. Therefore, before association rules can be obtained, sets of frequent items must be calculated using certain algorithms. The Apriori algorithm is based on the a priori principle of finding all relevant adjustment items in a database transaction that meet a minimum set of rules and restrictions or other restrictions [ 88 ]. Other algorithms are mostly variants of the Apriori algorithm [ 64 ]. The Apriori algorithm must scan the entire database every time it scans the transaction; therefore, algorithm performance deteriorates as database size increases [ 89 ], making it potentially unsuitable for analysing large databases. The frequent pattern (FP) growth algorithm was proposed to improve efficiency. After the first scan, the FP algorithm compresses the frequency set in the database into a FP tree while retaining the associated information and then mines the conditional libraries separately [ 90 ]. Association-rule technology is often used in medical research to identify association rules between disease risk factors (i.e., exploration of the joint effects of disease risk factors and combinations of other risk factors). For example, Li et al. [ 91 ] used the association-rule algorithm to identify the most important stroke risk factor as atrial fibrillation, followed by diabetes and a family history of stroke. Based on the same principle, association rules can also be used to evaluate treatment effects and other aspects. For example, Guo et al. [ 92 ] used the FP algorithm to generate association rules and evaluate individual characteristics and treatment effects of patients with diabetes, thereby reducing the readability rate of patients with diabetes. Association rules reveal a connection between premises and conclusions; however, the reasonable and reliable application of information can only be achieved through validation by experienced medical professionals and through extensive causal research [ 92 ].

PCA is a widely used data-mining method that aims to reduce data dimensionality in an interpretable way while retaining most of the information present in the data [ 93 , 94 ]. The main purpose of PCA is descriptive, as it requires no assumptions about data distribution and is, therefore, an adaptive and exploratory method. During the process of data analysis, the main steps of PCA include standardization of the original data, calculation of a correlation coefficient matrix, calculation of eigenvalues and eigenvectors, selection of principal components, and calculation of the comprehensive evaluation value. PCA does not often appear as a separate method, as it is often combined with other statistical methods [ 95 ]. In practical clinical studies, the existence of multicollinearity often leads to deviation from multivariate analysis. A feasible solution is to construct a regression model by PCA, which replaces the original independent variables with each principal component as a new independent variable for regression analysis, with this most commonly seen in the analysis of dietary patterns in nutritional epidemiology [ 96 ]. In a study of socioeconomic status and child-developmental delays, PCA was used to derive a new variable (the household wealth index) from a series of household property reports and incorporate this new variable as the main analytical variable into the logistic regression model [ 97 ]. Additionally, PCA can be combined with cluster analysis. Burgel et al. [ 98 ] used PCA to transform clinical data to address the lack of independence between existing variables used to explore the heterogeneity of different subtypes of chronic obstructive pulmonary disease. Therefore, in the study of subtypes and heterogeneity of clinical diseases, PCA can eliminate noisy variables that can potentially corrupt the cluster structure, thereby increasing the accuracy of the results of clustering analysis [ 98 , 99 ].

The data-mining process and examples of its application using common public databases

Open-access databases have the advantages of large volumes of data, wide data coverage, rich data information, and a cost-efficient method of research, making them beneficial to medical researchers. In this chapter, we introduced the data-mining process and methods and their application in research based on examples of utilizing public databases and data-mining algorithms.

The data-mining process

Figure  1 shows a series of research concepts. The data-mining process is divided into several steps: (1) database selection according to the research purpose; (2) data extraction and integration, including downloading the required data and combining data from multiple sources; (3) data cleaning and transformation, including removal of incorrect data, filling in missing data, generating new variables, converting data format, and ensuring data consistency; (4) data mining, involving extraction of implicit relational patterns through traditional statistics or ML; (5) pattern evaluation, which focuses on the validity parameters and values of the relationship patterns of the extracted data; and (6) assessment of the results, involving translation of the extracted data-relationship model into comprehensible knowledge made available to the public.

An external file that holds a picture, illustration, etc.
Object name is 40779_2021_338_Fig1_HTML.jpg

The steps of data mining in medical public database

Examples of data-mining applied using public databases

Establishment of warning models for the early prediction of disease.

A previous study identified sepsis as a major cause of death in ICU patients [ 100 ]. The authors noted that the predictive model developed previously used a limited number of variables, and that model performance required improvement. The data-mining process applied to address these issues was, as follows: (1) data selection using the MIMIC III database; (2) extraction and integration of three types of data, including multivariate features (demographic information and clinical biochemical indicators), time series data (temperature, blood pressure, and heart rate), and clinical latent features (various scores related to disease); (3) data cleaning and transformation, including fixing irregular time series measurements, estimating missing values, deleting outliers, and addressing data imbalance; (4) data mining through the use of logical regression, generation of a decision tree, application of the RF algorithm, an SVM, and an ensemble algorithm (a combination of multiple classifiers) to established the prediction model; (5) pattern evaluation using sensitivity, precision, and the area under the receiver operating characteristic curve to evaluate model performance; and (6) evaluation of the results, in this case the potential to predicting the prognosis of patients with sepsis and whether the model outperformed current scoring systems.

Exploring prognostic risk factors in cancer patients

Wu et al. [ 101 ] noted that traditional survival-analysis methods often ignored the influence of competitive risk events, such as suicide and car accident, on outcomes, leading to deviations and misjudgements in estimating the effect of risk factors. They used the SEER database, which offers cause-of-death data for cancer patients, and a competitive risk model to address this problem according to the following process: (1) data were obtained from the SEER database; (2) demography, clinical characteristics, treatment modality, and cause of death of cecum cancer patients were extracted from the database; (3) patient data were deleted when there were no demographic, clinical, therapeutic, or cause-of-death variables; (4) Cox regression and two kinds of competitive risk models were applied for survival analysis; (5) the results were compared between three different models; and (6) the results revealed that for survival data with multiple endpoints, the competitive risk model was more favourable.

Derivation of dietary patterns

A study by Martínez Steele et al. [ 102 ] applied PCA for nutritional epidemiological analysis to determine dietary patterns and evaluate the overall nutritional quality of the population based on those patterns. Their process involved the following: (1) data were extracted from the NHANES database covering the years 2009–2010; (2) demographic characteristics and two 24 h dietary recall interviews were obtained; (3) data were weighted and excluded based on subjects not meeting specific criteria; (4) PCA was used to determine dietary patterns in the United States population, and Gaussian regression and restricted cubic splines were used to assess associations between ultra-processed foods and nutritional balance; (5) eigenvalues, scree plots, and the interpretability of the principal components were reviewed to screen and evaluate the results; and (6) the results revealed a negative association between ultra-processed food intake and overall dietary quality. Their findings indicated that a nutritionally balanced eating pattern was characterized by a diet high in fibre, potassium, magnesium, and vitamin C intake along with low sugar and saturated fat consumption.

The use of “big data” has changed multiple aspects of modern life, with its use combined with data-mining methods capable of improving the status quo [ 86 ]. The aim of this study was to aid clinical researchers in understanding the application of data-mining technology on clinical big data and public medical databases to further their research goals in order to benefit clinicians and patients. The examples provided offer insight into the data-mining process applied for the purposes of clinical research. Notably, researchers have raised concerns that big data and data-mining methods were not a perfect fit for adequately replicating actual clinical conditions, with the results potentially capable of misleading doctors and patients [ 86 ]. Therefore, given the rate at which new technologies and trends progress, it is necessary to maintain a positive attitude concerning their potential impact while remaining cautious in examining the results provided by their application.

In the future, the healthcare system will need to utilize increasingly larger volumes of big data with higher dimensionality. The tasks and objectives of data analysis will also have higher demands, including higher degrees of visualization, results with increased accuracy, and stronger real-time performance. As a result, the methods used to mine and process big data will continue to improve. Furthermore, to increase the formality and standardization of data-mining methods, it is possible that a new programming language specifically for this purpose will need to be developed, as well as novel methods capable of addressing unstructured data, such as graphics, audio, and text represented by handwriting. In terms of application, the development of data-management and disease-screening systems for large-scale populations, such as the military, will help determine the best interventions and formulation of auxiliary standards capable of benefitting both cost-efficiency and personnel. Data-mining technology can also be applied to hospital management in order to improve patient satisfaction, detect medical-insurance fraud and abuse, and reduce costs and losses while improving management efficiency. Currently, this technology is being applied for predicting patient disease, with further improvements resulting in the increased accuracy and speed of these predictions. Moreover, it is worth noting that technological development will concomitantly require higher quality data, which will be a prerequisite for accurate application of the technology.

Finally, the ultimate goal of this study was to explain the methods associated with data mining and commonly used to process clinical big data. This review will potentially promote further study and aid doctors and patients.

Abbreviations

BioLINCCBiologic Specimen and Data Repositories Information Coordinating Center
CHARLSChina Health and Retirement Longitudinal Study
CHNSChina Health and Nutrition Survey
CKBChina Kadoorie Biobank
CSCause-specific risk
CTDComparative Toxicogenomics Database
eICU-CRDEICU Collaborative Research Database
FPFrequent pattern
GBDGlobal burden of disease
GEOGene expression omnibus
HRSHealth and Retirement Study
ICGCInternational Cancer Genome Consortium
MIMICMedical Information Mart for Intensive Care
MLMachine learning
NHANESNational Health and Nutrition Examination Survey
PCAPrincipal component analysis
PICPaediatric intensive care
RFRandom forest
SEERSurveillance, epidemiology, and end results
SVMSupport vector machine
TCGAThe Cancer Genome Atlas
UKBUK Biobank

Authors’ contributions

WTW, YJL and JL designed the review. JL, AZF, TH, LL and ADX reviewed and criticized the original paper. All authors read and approved the final manuscript.

This study was supported by the National Social Science Foundation of China (No. 16BGL183).

Declarations

Not applicable.

The authors declare that they have no competing interests.

Wen-Tao Wu and Yuan-Jie Li have contributed equally to this work

Contributor Information

Wen-Tao Wu, Email: moc.361@61733808651 .

Yuan-Jie Li, Email: c.ude.utjx@0102jyil .

Ao-Zi Feng, Email: moc.361@71392183851 .

Li Li, Email: ten.haey@1201_ylil .

Tao Huang, Email: moc.361@63oath .

An-Ding Xu, Email: nc.ude.unj@lilt .

Jun Lyu, Email: nc.ude.unj@0202nujuyl .

IEEE Account

  • Change Username/Password
  • Update Address

Purchase Details

  • Payment Options
  • Order History
  • View Purchased Documents

Profile Information

  • Communications Preferences
  • Profession and Education
  • Technical Interests
  • US & Canada: +1 800 678 4333
  • Worldwide: +1 732 981 0060
  • Contact & Support
  • About IEEE Xplore
  • Accessibility
  • Terms of Use
  • Nondiscrimination Policy
  • Privacy & Opting Out of Cookies

A not-for-profit organization, IEEE is the world's largest technical professional organization dedicated to advancing technology for the benefit of humanity. © Copyright 2024 IEEE - All rights reserved. Use of this web site signifies your agreement to the terms and conditions.

Review Paper on Data Mining Techniques and Applications

International Journal of Innovative Research in Computer Science & Technology (IJIRCST), Volume-7, Issue-2, March 2019

5 Pages Posted: 2 Mar 2020

GVMGC Sonipat

Date Written: MARCH 31, 2019

Data mining is the process of extracting hidden and useful patterns and information from data. Data mining is a new technology that helps businesses to predict future trends and behaviors, allowing them to make proactive, knowledge driven decisions. The aim of this paper is to show the process of data mining and how it can help decision makers to make better decisions. Practically, data mining is really useful for any organization which has huge amount of data. Data mining help regular databases to perform faster. They also help to increase the profit, because of the correct decisions made with the help of data mining. This paper shows the various steps performed during the process of data mining and how it can be used by various industries to get better answers from huge amount of data.

Keywords: Data Mining, Regression, Time Series, Prediction, Association

Suggested Citation: Suggested Citation

Anshu (Contact Author)

Do you have a job opening that you would like to promote on ssrn, paper statistics, related ejournals, econometrics: econometric & statistical methods - special topics ejournal.

Subscribe to this fee journal for more curated articles on this topic

Web Technology eJournal

Decision-making & management science ejournal, data science, data analytics & informatics ejournal.

Information

  • Author Services

Initiatives

You are accessing a machine-readable page. In order to be human-readable, please install an RSS reader.

All articles published by MDPI are made immediately available worldwide under an open access license. No special permission is required to reuse all or part of the article published by MDPI, including figures and tables. For articles published under an open access Creative Common CC BY license, any part of the article may be reused without permission provided that the original article is clearly cited. For more information, please refer to https://www.mdpi.com/openaccess .

Feature papers represent the most advanced research with significant potential for high impact in the field. A Feature Paper should be a substantial original Article that involves several techniques or approaches, provides an outlook for future research directions and describes possible research applications.

Feature papers are submitted upon individual invitation or recommendation by the scientific editors and must receive positive feedback from the reviewers.

Editor’s Choice articles are based on recommendations by the scientific editors of MDPI journals from around the world. Editors select a small number of articles recently published in the journal that they believe will be particularly interesting to readers, or important in the respective research area. The aim is to provide a snapshot of some of the most exciting work published in the various research areas of the journal.

Original Submission Date Received: .

  • Active Journals
  • Find a Journal
  • Journal Proposal
  • Proceedings Series
  • For Authors
  • For Reviewers
  • For Editors
  • For Librarians
  • For Publishers
  • For Societies
  • For Conference Organizers
  • Open Access Policy
  • Institutional Open Access Program
  • Special Issues Guidelines
  • Editorial Process
  • Research and Publication Ethics
  • Article Processing Charges
  • Testimonials
  • Preprints.org
  • SciProfiles
  • Encyclopedia

plants-logo

Article Menu

research papers for data mining

  • Subscribe SciFeed
  • Recommended Articles
  • Google Scholar
  • on Google Scholar
  • Table of Contents

Find support for a specific problem in the support section of our website.

Please let us know what you think of our products and services.

Visit our dedicated information section to learn more about MDPI.

JSmol Viewer

Streamlining of simple sequence repeat data mining methodologies and pipelines for crop scanning.

research papers for data mining

Graphical Abstract

1. Introduction

2. ssrs: a robust framework for crop genetic markers, 2.1. development of ssr markers, 2.1.1. genomic library construction, 2.1.2. in silico approaches, 2.2. genomic resources for ssr data mining, 2.3. databases for genomic resources, 2.4. preprocessing of raw sequences, 2.5. computational tools and algorithms for ssr data mining, 2.6. algorithmic approaches, 2.7. library-based methods, 2.8. signature-based methods, 2.9. ab initio approaches, 2.9.1. self-comparison approaches, 2.9.2. enumeration of k-mers, 2.9.3. spaced seed approaches, 2.9.4. visualization approaches, 2.9.5. periodicity-based approaches, repeat masker, tra and e-tra, ssr scanner, 3. primer designing, 4. pipelines.

  • Read2marker
  • MicroPrimers

5. Efficiency of SSR Data Mining Computational Tools

6. conclusions, supplementary materials, author contributions, acknowledgments, conflicts of interest.

  • The ENCODE Project Consortium. An integrated encyclopedia of DNA elements in the human genome. Nature 2012 , 489 , 57–74. [ Google Scholar ] [ CrossRef ] [ PubMed ]
  • de Koning, A.J.; Gu, W.; Castoe, T.A.; Batzer, M.A.; Pollock, D.D. Repetitive elements may comprise over two-thirds of the human genome. PLoS Genet. 2011 , 7 , e1002384. [ Google Scholar ] [ CrossRef ] [ PubMed ]
  • Liehr, T. Repetitive elements in humans. Int. J. Mol. Sci. 2021 , 22 , 2072. [ Google Scholar ] [ CrossRef ] [ PubMed ]
  • Thakur, J.; Packiaraj, J.; Henikoff, S. Sequence, chromatin and evolution of satellite DNA. Int. J. Mol. Sci. 2021 , 22 , 4309. [ Google Scholar ] [ CrossRef ]
  • Balzano, E.; Pelliccia, F.; Giunta, S. Genome (in)stability at tandem repeats. Semin. Cell Dev. Biol. 2021 , 113 , 97–112. [ Google Scholar ] [ CrossRef ]
  • Bhargava, A.; Fuentes, F. Mutational dynamics of microsatellites. Mol. Biotechnol. 2010 , 44 , 250–266. [ Google Scholar ] [ CrossRef ]
  • Biscotti, M.A.; Olmo, E.; Heslop-Harrison, J. Repetitive DNA in eukaryotic genomes. Chromosome Res. 2015 , 23 , 415–420. [ Google Scholar ] [ CrossRef ]
  • Gemayel, R.; Vinces, M.D.; Legendre, M.; Verstrepen, K.J. Variable tandem repeats accelerate evolution of coding and regulatory sequences. Annu. Rev. Genet. 2010 , 44 , 445–477. [ Google Scholar ] [ CrossRef ]
  • Lower, S.S.; McGurk, M.P.; Clark, A.G.; Barbash, D.A. Satellite DNA evolution: Old ideas, new approaches. Curr. Opin. Genet. Dev. 2018 , 49 , 70–78. [ Google Scholar ] [ CrossRef ]
  • Pereira, G.; Nunes, E.; Laperuta, L.; Braga, M.; Penha, H.; Diniz, A.; Munhoz, C.; Gazaffi, R.; Garcia, A.A.F.; Vieira, M.L.C. Molecular polymorphism and linkage analysis in sweet passion fruit, an outcrossing species. Ann. Appl. Biol. 2013 , 162 , 347–361. [ Google Scholar ] [ CrossRef ]
  • Varshney, R.K.; Graner, A.; Sorrells, M.E. Genic microsatellite markers in plants: Features and applications. Trends Biotechnol. 2005 , 23 , 48–55. [ Google Scholar ] [ CrossRef ] [ PubMed ]
  • Zane, L.; Bargelloni, L.; Patarnello, T. Strategies for microsatellite isolation: A review. Mol. Ecol. 2002 , 11 , 347–361. [ Google Scholar ] [ CrossRef ]
  • Techen, N.; Arias, R.S.; Glynn, N.C.; Pan, Z.; Khan, I.A.; Scheffler, B.E. Optimized construction of microsatellite-enriched libraries. Mol. Ecol. Resour. 2010 , 10 , 508–515. [ Google Scholar ] [ CrossRef ]
  • Ellison, C.K.; Shaw, K.L. Mining non-model genomic libraries for microsatellites: BAC versus EST libraries and the generation of allelic richness. BMC Genom. 2010 , 11 , 428. [ Google Scholar ] [ CrossRef ] [ PubMed ]
  • Hong, C.; Lee, S.; Park, J.; Plaha, P.; Park, Y.; Lee, Y.; Choi, J.; Kim, K.; Lee, J.; Lee, J. Construction of a BAC library of Korean ginseng and initial analysis of BAC-end sequences. Mol. Genet. Genom. 2004 , 271 , 709–716. [ Google Scholar ] [ CrossRef ]
  • Kalita, B.; Roy, A.; Lakshmi, P. In-silico mining and characterization of EST-SSRs for the genetic diversity analysis of lemon. Nelumbo 2022 , 64 , 122–131. [ Google Scholar ] [ CrossRef ]
  • Poornima, K.N.; Shankar, R.; Ramesh, S.; Ravishankar, K.V. De-novo development and validation of EST-SSRs in Moringa oliefera . J. Plant Biochem. Biotechnol. 2023 , 32 , 319–327. [ Google Scholar ] [ CrossRef ]
  • Singh, K.N.; Parveen, S.; Kaushik, P.; Goel, S.; Jagannath, A.; Kumar, K.; Agarwal, M. Identification and validation of in silico mined polymorphic EST-SSR for genetic diversity and cross-species transferability studies in safflower. J. Plant Biochem. Biotechnol. 2022 , 31 , 168–177. [ Google Scholar ] [ CrossRef ]
  • Chandel, G.; Samuel, P.; Dubey, M.; Meena, R. In silico expression analysis of QTL specific candidate genes for grain micronutrient (Fe/Zn) content using ESTs and MPSS signature analysis in rice ( Oryza sativa L.). J. Plant Genet. Transgenics 2011 , 2 , 11–22. [ Google Scholar ]
  • Mehta, G.; Muthusamy, S.K.; Singh, G.; Sharma, P. Identification and development of novel salt-responsive candidate gene based SSRs (cg-SSRs) and MIR gene based SSRs (mir-SSRs) in bread wheat ( Triticum aestivum ). Sci. Rep. 2021 , 11 , 2210. [ Google Scholar ] [ CrossRef ]
  • Molla, K.A.; Azharudheen, T.M.; Ray, S.; Sarkar, S.; Swain, A.; Chakraborti, M.; Vijayan, J.; Singh, O.N.; Baig, M.J.; Mukherjee, A.K. Novel biotic stress responsive candidate gene based SSR (cgSSR) markers from rice. Euphytica 2019 , 215 , 17. [ Google Scholar ] [ CrossRef ]
  • Sharma, P.; Mehta, G.; Shefali; Muthusamy, S.K.; Singh, S.K.; Singh, G.P. Development and validation of heat-responsive candidate gene and miRNA gene based SSR markers to analysis genetic diversity in wheat for heat tolerance breeding. Mol. Biol. Rep. 2021 , 48 , 381–393. [ Google Scholar ] [ CrossRef ] [ PubMed ]
  • Singh, A.K.; Chaurasia, S.; Kumar, S.; Singh, R.; Kumari, J.; Yadav, M.C.; Singh, N.; Gaba, S.; Jacob, S.R. Identification, analysis and development of salt responsive candidate gene based SSR markers in wheat. BMC Plant Biol. 2018 , 18 , 249. [ Google Scholar ] [ CrossRef ]
  • Varshney, R.K.; Mahendar, T.; Aggarwal, R.K.; Börner, A. Genic molecular markers in plants: Development and applications. In Genomics-Assisted Crop Improvement ; Genomics approaches and platforms; Springer: Dordrecht, The Netherlands, 2007; Volume 1, pp. 13–29. [ Google Scholar ]
  • Zalapa, J.E.; Cuevas, H.; Zhu, H.; Steffan, S.; Senalik, D.; Zeldin, E.; McCown, B.; Harbut, R.; Simon, P. Using next-generation sequencing approaches to isolate simple sequence repeat (SSR) loci in the plant sciences. Am. J. Bot. 2012 , 99 , 193–208. [ Google Scholar ] [ CrossRef ] [ PubMed ]
  • Castoe, T.A.; Poole, A.W.; De Koning, A.J.; Jones, K.L.; Tomback, D.F.; Oyler-McCance, S.J.; Fike, J.A.; Lance, S.L.; Streicher, J.W.; Smith, E.N. Rapid microsatellite identification from Illumina paired-end genomic sequencing in two birds and a snake. PLoS ONE 2012 , 7 , e30953. [ Google Scholar ] [ CrossRef ] [ PubMed ]
  • Jennings, T.; Knaus, B.; Mullins, T.; Haig, S.; Cronn, R. Multiplexed microsatellite recovery using massively parallel sequencing. Mol. Ecol. Resour. 2011 , 11 , 1060–1067. [ Google Scholar ] [ CrossRef ]
  • Hon, T.; Mars, K.; Young, G.; Tsai, Y.-C.; Karalius, J.W.; Landolin, J.M.; Maurer, N.; Kudrna, D.; Hardigan, M.A.; Steiner, C.C. Highly accurate long-read HiFi sequencing data for five complex genomes. Sci. Data 2020 , 7 , 399. [ Google Scholar ] [ CrossRef ]
  • Lu, T.-Y.; The Human Genome Structural Variation Consortium; Chaisson, M.J. Profiling variable-number tandem repeat variation across populations using repeat-pangenome graphs. Nat. Commun. 2021 , 12 , 4250. [ Google Scholar ] [ CrossRef ]
  • McCouch, S.R.; Teytelman, L.; Xu, Y.; Lobos, K.B.; Clare, K.; Walton, M.; Fu, B.; Maghirang, R.; Li, Z.; Xing, Y. Development and mapping of 2240 new SSR markers for rice ( Oryza sativa L.). DNA Res. 2002 , 9 , 199–207. [ Google Scholar ] [ CrossRef ]
  • Temnykh, S.; DeClerck, G.; Lukashova, A.; Lipovich, L.; Cartinhour, S.; McCouch, S. Computational and experimental analysis of microsatellites in rice ( Oryza sativa L.): Frequency, length variation, transposon associations, and genetic marker potential. Genome Res. 2001 , 11 , 1441–1452. [ Google Scholar ] [ CrossRef ]
  • Brake, M.; Al-Qadumii, L.; Hamasha, H.; Migdadi, H.; Awad, A.; Haddad, N.; Sadder, M.T. Development of SSR markers linked to stress responsive genes along tomato chromosome 3 ( Solanum lycopersicum L.). BioTech 2022 , 11 , 34. [ Google Scholar ] [ CrossRef ] [ PubMed ]
  • Geethanjali, S.; Chen, K.-Y.; Pastrana, D.V.; Wang, J.-F. Development and characterization of tomato SSR markers from genomic sequences of anchored BAC clones on chromosome 6. Euphytica 2010 , 173 , 85–97. [ Google Scholar ] [ CrossRef ]
  • Geethanjali, S.; Kadirvel, P.; de la Peña, R.; Rao, E.S.; Wang, J.-F. Development of tomato SSR markers from anchored BAC clones of chromosome 12 and their application for genetic diversity analysis and linkage mapping. Euphytica 2011 , 178 , 283–295. [ Google Scholar ] [ CrossRef ]
  • Feng, C.; Bluhm, B.H.; Correll, J.C. Construction of a spinach bacterial artificial chromosome (BAC) library as a resource for gene identification and marker development. Plant Mol. Biol. Report. 2015 , 33 , 1996–2005. [ Google Scholar ] [ CrossRef ]
  • Meng, Y.; Zheng, C.; Li, H.; Li, A.; Zhai, H.; Wang, Q.; He, S.; Zhao, N.; Zhang, H.; Gao, S. Development of a high-density SSR genetic linkage map in sweet potato. Crop J. 2021 , 9 , 1367–1374. [ Google Scholar ] [ CrossRef ]
  • Jiang, H.; Waseem, M.; Liu, P. Development of simple sequence repeat markers for sugarcane from data mining of expressed sequence tags. Front. Plant Sci. 2023 , 14 , 1199210. [ Google Scholar ] [ CrossRef ]
  • Muoki, R.; Maangi, J.; Korir, R.; Bargul, J.; Kamunya, S. Mining and validation of polymorphic EST-SSR markers for analysing genetic diversity among interspecific hybrids of tea. Int. J. Tea Sci. 2020 , 15 , 40–45. [ Google Scholar ] [ CrossRef ]
  • Das, M.; Sahu, S.P.; Tiwari, A. De novo transcriptome assembly and mining of EST-SSR markers in Gloriosa superba . J. Genet. 2020 , 99 , 77. [ Google Scholar ] [ CrossRef ]
  • Taheri, S.; Abdullah, T.L.; Rafii, M.; Harikrishna, J.A.; Werbrouck, S.P.; Teo, C.H.; Sahebi, M.; Azizi, P. De novo assembly of transcriptomes, mining, and development of novel EST-SSR markers in Curcuma alismatifolia (Zingiberaceae family) through Illumina sequencing. Sci. Rep. 2019 , 9 , 3047. [ Google Scholar ]
  • Han, Z.; Ma, X.; Wei, M.; Zhao, T.; Zhan, R.; Chen, W. SSR marker development and intraspecific genetic divergence exploration of Chrysanthemum indicum based on transcriptome analysis. BMC Genom. 2018 , 19 , 291. [ Google Scholar ] [ CrossRef ]
  • Liu, C.; Zhang, M.; Zhao, X. Development of unigene-derived SSR markers from RNA-seq data of Uraria lagopodioides (Fabaceae) and their application in the genus Uraria Desv. (Fabaceae). BMC Plant Biol. 2023 , 23 , 87. [ Google Scholar ] [ CrossRef ] [ PubMed ]
  • Divakar, S.; Jha, R.K.; Singh, A. Validation of candidate gene-based EST-SSR markers for sugar yield in sugarcane. Front. Plant Sci. 2023 , 14 , 1273740. [ Google Scholar ] [ CrossRef ]
  • Schumacher, C.; Krannich, C.T.; Maletzki, L.; Köhl, K.; Kopka, J.; Sprenger, H.; Hincha, D.K.; Seddig, S.; Peters, R.; Hamera, S. Unravelling differences in candidate genes for drought tolerance in potato ( Solanum tuberosum L.) by use of new functional microsatellite markers. Genes 2021 , 12 , 494. [ Google Scholar ] [ CrossRef ] [ PubMed ]
  • Zhou, X.; Dong, Y.; Zhao, J.; Huang, L.; Ren, X.; Chen, Y.; Huang, S.; Liao, B.; Lei, Y.; Yan, L. Genomic survey sequencing for development and validation of single-locus SSR markers in peanut ( Arachis hypogaea L.). BMC Genom. 2016 , 17 , 420. [ Google Scholar ] [ CrossRef ]
  • Li, J.; Zhou, R.; Endo, T.R.; Stein, N. High-throughput development of SSR marker candidates and their chromosomal assignment in rye ( Secale cereale L.). Plant Breed. 2018 , 137 , 561–572. [ Google Scholar ] [ CrossRef ]
  • Patturaj, M.; Munusamy, A.; Kannan, N.; Kandasamy, U.; Ramasamy, Y. Chromosome-specific polymorphic SSR markers in tropical eucalypt species using low coverage whole genome sequences: Systematic characterization and validation. Genom. Inform. 2021 , 19 , e33. [ Google Scholar ] [ CrossRef ] [ PubMed ]
  • Nashima, K.; Hosaka, F.; Terakami, S.; Kunihisa, M.; Nishitani, C.; Moromizato, C.; Takeuchi, M.; Shoda, M.; Tarora, K.; Urasaki, N. SSR markers developed using next-generation sequencing technology in pineapple, Ananas comosus (L.) Merr. Breed. Sci. 2020 , 70 , 415–421. [ Google Scholar ] [ CrossRef ]
  • Portis, E.; Lanteri, S.; Barchi, L.; Portis, F.; Valente, L.; Toppino, L.; Rotino, G.L.; Acquadro, A. Comprehensive characterization of simple sequence repeats in eggplant ( Solanum melongena L.) genome and construction of a web resource. Front. Plant Sci. 2018 , 9 , 350273. [ Google Scholar ] [ CrossRef ]
  • Varshney, R.K.; Chen, W.; Li, Y.; Bharti, A.K.; Saxena, R.K.; Schlueter, J.A.; Donoghue, M.T.; Azam, S.; Fan, G.; Whaley, A.M. Draft genome sequence of pigeonpea ( Cajanus cajan ), an orphan legume crop of resource-poor farmers. Nat. Biotechnol. 2012 , 30 , 83. [ Google Scholar ] [ CrossRef ]
  • Jabeen, S.; Saif, R.; Haq, R.; Hayat, A.; Naz, S. Whole-genome sequencing and variant discovery of Citrus reticulata “Kinnow” from Pakistan. Funct. Integr. Genom. 2023 , 23 , 227. [ Google Scholar ] [ CrossRef ]
  • Uncu, A.O.; Uncu, A.T. High-throughput simple sequence repeat (SSR) mining saturates the carrot ( Daucus carota L.) genome with chromosome-anchored markers. Biotechnol. Biotechnol. Equip. 2020 , 34 , 1–9. [ Google Scholar ] [ CrossRef ]
  • Zhao, H.; Wang, W.; Yang, Y.; Wang, Z.; Sun, J.; Yuan, K.; Rabbi, S.H.A.; Khanam, M.; Kabir, M.S.; Seraj, Z.I. A high-quality chromosome-level wild rice genome of Oryza coarctata . Sci. Data 2023 , 10 , 701. [ Google Scholar ] [ CrossRef ]
  • Zhao, M.; Shu, G.; Hu, Y.; Cao, G.; Wang, Y. Pattern and variation in simple sequence repeat (SSR) at different genomic regions and its implications to maize evolution and breeding. BMC Genom. 2023 , 24 , 136. [ Google Scholar ] [ CrossRef ] [ PubMed ]
  • Gaikwad, A.B.; Kumari, R.; Yadav, S.; Rangan, P.; Bhat, K. Small cardamom genome: Development and utilization of microsatellite markers from a draft genome sequence of Elettaria cardamomum Maton. Front. Plant Sci. 2023 , 14 , 1161499. [ Google Scholar ] [ CrossRef ] [ PubMed ]
  • Kim, K.-R.; Yu, J.-N.; Hong, J.M.; Kim, S.-Y.; Park, S.Y. Genome assembly and microsatellite marker development using Illumina and PacBio Sequencing in the Carex pumila (Cyperaceae) from Korea. Genes 2023 , 14 , 2063. [ Google Scholar ] [ CrossRef ]
  • Caro, R.E.S.; Cagayan, J.; Gardoce, R.R.; Manohar, A.N.C.; Canama-Salinas, A.O.; Rivera, R.L.; Lantican, D.V.; Galvez, H.F.; Reaño, C.E. Mining and validation of novel simple sequence repeat (SSR) markers derived from coconut ( Cocos nucifera L.) genome assembly. J. Genet. Eng. Biotechnol. 2022 , 20 , 71. [ Google Scholar ] [ CrossRef ]
  • Bhattarai, G.; Shi, A.; Kandel, D.R.; Solís-Gracia, N.; Da Silva, J.A.; Avila, C.A. Genome-wide simple sequence repeats (SSR) markers discovered from whole-genome sequence comparisons of multiple spinach accessions. Sci. Rep. 2021 , 11 , 9999. [ Google Scholar ] [ CrossRef ]
  • Sari, D.; Sari, H.; Ikten, C.; Toker, C. Genome-wide discovery of di-nucleotide SSR markers based on whole genome re-sequencing data of Cicer arietinum L. and Cicer reticulatum Ladiz. Sci. Rep. 2023 , 13 , 10351. [ Google Scholar ] [ CrossRef ]
  • Sayers, E.W.; Cavanaugh, M.; Clark, K.; Pruitt, K.D.; Sherry, S.T.; Yankie, L.; Karsch-Mizrachi, I. GenBank 2023 update. Nucleic Acids Res. 2023 , 51 , D141–D144. [ Google Scholar ] [ CrossRef ]
  • Ewing, B.; Green, P. Base-calling of automated sequencer traces using phred. II. Error probabilities. Genome Res. 1998 , 8 , 186–194. [ Google Scholar ] [ CrossRef ]
  • Green, P. Documentation for Phrap and Cross_Match. 1999. Available online: http://bozeman.mbt.washington.edu/phrap.docs/phrap.html (accessed on 24 June 2024).
  • Pearson, W.R.; Lipman, D.J. Improved tools for biological sequence comparison. Proc. Natl. Acad. Sci. USA 1988 , 85 , 2444–2448. [ Google Scholar ] [ CrossRef ] [ PubMed ]
  • Chen, Y.; Ye, W.; Zhang, Y.; Xu, Y. High speed BLASTN: An accelerated MegaBLAST search tool. Nucleic Acids Res. 2015 , 43 , 7762–7768. [ Google Scholar ] [ CrossRef ]
  • Seqclean. Available online: https://sourceforge.net/projects/seqclean/ (accessed on 24 June 2024).
  • Hancock, J.M.; Armstrong, J.S. SIMPLE34: An improved and enhanced implementation for VAX and Sun computers of the SIMPLE algorithm for analysis of clustered repetitive motifs in nucleotide sequences. Bioinformatics 1994 , 10 , 67–70. [ Google Scholar ] [ CrossRef ] [ PubMed ]
  • Morgulis, A.; Gertz, E.M.; Schäffer, A.A.; Agarwala, R. A fast and symmetric DUST implementation to mask low-complexity DNA sequences. J. Comput. Biol. 2006 , 13 , 1028–1040. [ Google Scholar ] [ CrossRef ]
  • Bolger, A.M.; Lohse, M.; Usadel, B. Trimmomatic: A flexible trimmer for Illumina sequence data. Bioinformatics 2014 , 30 , 2114–2120. [ Google Scholar ] [ CrossRef ] [ PubMed ]
  • Martin, M. Cutadapt removes adapter sequences from high-throughput sequencing reads. EMBnet J. 2011 , 17 , 10–12. [ Google Scholar ] [ CrossRef ]
  • Andrews, S.; Krueger, F.; Segonds-Pichon, A.; Biggins, L.; Krueger, C.; Wingett, S. FastQC. A Quality Control Tool for High Throughput Sequence Data ; Babraham Bioinformatics: Cambridgeshire, UK, 2010. [ Google Scholar ]
  • Chen, S.; Huang, T.; Zhou, Y.; Han, Y.; Xu, M.; Gu, J. AfterQC: Automatic filtering, trimming, error removing and quality control for fastq data. BMC Bioinform. 2017 , 18 , 80. [ Google Scholar ] [ CrossRef ]
  • Chen, S.; Zhou, Y.; Chen, Y.; Gu, J. fastp: An ultra-fast all-in-one FASTQ preprocessor. Bioinformatics 2018 , 34 , i884–i890. [ Google Scholar ] [ CrossRef ]
  • Ptitsyn, A.; Hide, W. CLU: A new algorithm for EST clustering. BMC Bioinform. 2005 , 6 , S3. [ Google Scholar ] [ CrossRef ]
  • Lee, Y.; Tsai, J.; Sunkara, S.; Karamycheva, S.; Pertea, G.; Sultana, R.; Antonescu, V.; Chan, A.; Cheung, F.; Quackenbush, J. The TIGR Gene Indices: Clustering and assembling EST and known genes and integration with eukaryotic genomes. Nucleic Acids Res. 2005 , 33 , D71–D74. [ Google Scholar ] [ CrossRef ]
  • Christoffels, A.; Gelder, A.v.; Greyling, G.; Miller, R.; Hide, T.; Hide, W. STACK: Sequence tag alignment and consensus knowledgebase. Nucleic Acids Res. 2001 , 29 , 234–238. [ Google Scholar ] [ CrossRef ] [ PubMed ]
  • Chou, A.; Burke, J. CRAWview: For viewing splicing variation, gene families, and polymorphism in clusters of ESTs and full-length sequences. Bioinformatics 1999 , 15 , 376–381. [ Google Scholar ] [ CrossRef ] [ PubMed ]
  • Huang, X.; Madan, A. CAP3: A DNA sequence assembly program. Genome Res. 1999 , 9 , 868–877. [ Google Scholar ] [ CrossRef ] [ PubMed ]
  • Pertea, G.; Huang, X.; Liang, F.; Antonescu, V.; Sultana, R.; Karamycheva, S.; Lee, Y.; White, J.; Cheung, F.; Parvizi, B. TIGR Gene Indices clustering tools (TGICL): A software system for fast clustering of large EST datasets. Bioinformatics 2003 , 19 , 651–652. [ Google Scholar ] [ CrossRef ]
  • Kim, S.; Lee, J. BAG: A graph theoretic sequence clustering algorithm. Int. J. Data Min. Bioinform. 2006 , 1 , 178–200. [ Google Scholar ] [ CrossRef ]
  • Merkel, A.; Gemmell, N. Detecting short tandem repeats from genome data: Opening the software black box. Brief. Bioinform. 2008 , 9 , 355–366. [ Google Scholar ] [ CrossRef ]
  • Merkel, A.; Gemmell, N.J.; Merkel, A.; Gemmell, N.J. Detecting microsatellites in genome data: Variance in definitions and bioinformatic approaches cause systematic bias. Evol. Bioinform. 2008 , 4 , 1–6. [ Google Scholar ] [ CrossRef ]
  • Lim, K.G.; Kwoh, C.K.; Hsu, L.Y.; Wirawan, A. Review of tandem repeat search tools: A systematic approach to evaluating algorithmic performance. Brief. Bioinform. 2013 , 14 , 67–81. [ Google Scholar ] [ CrossRef ]
  • Bergman, C.M.; Quesneville, H. Discovering and detecting transposable elements in genome sequences. Brief. Bioinform. 2007 , 8 , 382–392. [ Google Scholar ] [ CrossRef ]
  • Saha, S.; Bridges, S.; Magbanua, Z.V.; Peterson, D.G. Computational approaches and tools used in identification of dispersed repetitive DNA sequences. Trop. Plant Biol. 2008 , 1 , 85–96. [ Google Scholar ] [ CrossRef ]
  • Lerat, E. Identifying repeats and transposable elements in sequenced genomes: How to find your way through the dense forest of programs. Heredity 2010 , 104 , 520–533. [ Google Scholar ] [ CrossRef ]
  • Bao, W.; Kojima, K.K.; Kohany, O. Repbase Update, a database of repetitive elements in eukaryotic genomes. Mob. DNA 2015 , 6 , 11. [ Google Scholar ] [ CrossRef ] [ PubMed ]
  • Gelfand, Y.; Rodriguez, A.; Benson, G. TRDB—The tandem repeats database. Nucleic Acids Res. 2007 , 35 , D80–D87. [ Google Scholar ] [ CrossRef ]
  • Bao, Z.; Eddy, S.R. Automated de novo identification of repeat sequence families in sequenced genomes. Genome Res. 2002 , 12 , 1269–1276. [ Google Scholar ] [ CrossRef ]
  • Price, A.L.; Jones, N.C.; Pevzner, P.A. De novo identification of repeat families in large genomes. Bioinformatics 2005 , 21 , i351–i358. [ Google Scholar ] [ CrossRef ]
  • Koch, P.; Platzer, M.; Downie, B.R. RepARK—De novo creation of repeat libraries from whole-genome NGS reads. Nucleic Acids Res. 2014 , 42 , e80. [ Google Scholar ] [ CrossRef ] [ PubMed ]
  • Stein, L.D.; Bao, Z.; Blasiar, D.; Blumenthal, T.; Brent, M.R.; Chen, N.; Chinwalla, A.; Clarke, L.; Clee, C.; Coghlan, A. The genome sequence of Caenorhabditis briggsae : A platform for comparative genomics. PLoS Biol. 2003 , 1 , e45. [ Google Scholar ] [ CrossRef ] [ PubMed ]
  • Altschul, S.F.; Gish, W.; Miller, W.; Myers, E.W.; Lipman, D.J. Basic local alignment search tool. J. Mol. Biol. 1990 , 215 , 403–410. [ Google Scholar ] [ CrossRef ] [ PubMed ]
  • Bennett, M.; Leitch, I. Plant genome size research: A field in focus. Ann. Bot. 2005 , 95 , 1–6. [ Google Scholar ] [ CrossRef ]
  • Kurtz, S.; Narechania, A.; Stein, J.C.; Ware, D. A new method to compute K-mer frequencies and its application to annotate large repetitive plant genomes. BMC Genom. 2008 , 9 , 517. [ Google Scholar ] [ CrossRef ]
  • Ilie, L.; Ilie, S. Multiple spaced seeds for homology search. Bioinformatics 2007 , 23 , 2969–2977. [ Google Scholar ] [ CrossRef ] [ PubMed ]
  • Mak, D.; Gelfand, Y.; Benson, G. Indel seeds for homology search. Bioinformatics 2006 , 22 , e341–e349. [ Google Scholar ] [ CrossRef ]
  • Whiteford, N.; Haslam, N.; Weber, G.; Prugel-Bennett, A.; Essex, J.; Neylon, C. Visualising the repeat structure of genomic sequences. Complex Syst. 2008 , 17 , 381–398. [ Google Scholar ]
  • Yoshida, T.; Obata, N.; Oosawa, K. Color-coding reveals tandem repeats in the Escherichia coli genome. J. Mol. Biol. 2000 , 298 , 343–349. [ Google Scholar ] [ CrossRef ]
  • Du, L.; Zhou, H.; Yan, H. OMWSA: Detection of DNA repeats using moving window spectral analysis. Bioinformatics 2007 , 23 , 631–633. [ Google Scholar ] [ CrossRef ]
  • Sharma, D.; Issac, B.; Raghava, G.; Ramaswamy, R. Spectral Repeat Finder (SRF): Identification of repetitive sequences using Fourier transformation. Bioinformatics 2004 , 20 , 1405–1412. [ Google Scholar ] [ CrossRef ]
  • Hauth, A.M.; Joseph, D.A. Beyond tandem repeats: Complex pattern structures and distant regions of similarity. Bioinformatics 2002 , 18 , S31–S37. [ Google Scholar ] [ CrossRef ]
  • Kurtz, S.; Choudhuri, J.V.; Ohlebusch, E.; Schleiermacher, C.; Stoye, J.; Giegerich, R. REPuter: The manifold applications of repeat analysis on a genomic scale. Nucleic Acids Res. 2001 , 29 , 4633–4642. [ Google Scholar ] [ CrossRef ]
  • Abajian, C. Sputnik: DNA Microsatellite Repeat Search Utility. 1994. [ Google Scholar ]
  • La Rota, M.; Kantety, R.V.; Yu, J.-K.; Sorrells, M.E. Nonrandom distribution and frequencies of genomic and EST-derived microsatellite markers in rice, wheat, and barley. BMC Genom. 2005 , 6 , 23. [ Google Scholar ] [ CrossRef ] [ PubMed ]
  • Smit, A.; Hubley, R.; Green, P. RepeatMasker Open-3.0. 2004. Available online: http://www.repeatmasker.org (accessed on 24 June 2024).
  • Bedell, J.A.; Korf, I.; Gish, W. MaskerAid: A performance enhancement to RepeatMasker. Bioinformatics 2000 , 16 , 1040–1041. [ Google Scholar ] [ CrossRef ] [ PubMed ]
  • Tarailo-Graovac, M.; Chen, N. Using Repeat Masker to identify repetitive elements in genomic sequences. Curr. Protoc. Bioinform. 2009 , 5 , 4.10.11–14.10.14. [ Google Scholar ] [ CrossRef ]
  • Benson, G. Tandem repeats finder: A program to analyze DNA sequences. Nucleic Acids Res. 1999 , 27 , 573–580. [ Google Scholar ] [ CrossRef ] [ PubMed ]
  • Castelo, A.T.; Martins, W.; Gao, G.R. TROLL—Tandem repeat occurrence locator. Bioinformatics 2002 , 18 , 634–636. [ Google Scholar ] [ CrossRef ] [ PubMed ]
  • Duran, C.; Appleby, N.; Edwards, D.; Batley, J. Molecular genetic markers: Discovery, applications, data storage and visualisation. Curr. Bioinform. 2009 , 4 , 16–27. [ Google Scholar ] [ CrossRef ]
  • Thiel, T.; Michalek, W.; Varshney, R.; Graner, A. Exploiting EST databases for the development and characterization of gene-derived SSR-markers in barley ( Hordeum vulgare L.). Theor. Appl. Genet. 2003 , 106 , 411–422. [ Google Scholar ] [ CrossRef ]
  • Beier, S.; Thiel, T.; Münch, T.; Scholz, U.; Mascher, M. MISA-web: A web server for microsatellite prediction. Bioinformatics 2017 , 33 , 2583–2585. [ Google Scholar ] [ CrossRef ]
  • Bizzaro, J.W.; Marx, K.A. Poly: A quantitative analysis tool for simple sequence repeat (SSR) tracts in DNA. BMC Bioinform. 2003 , 4 , 22. [ Google Scholar ] [ CrossRef ]
  • Parisi, V.; De Fonzo, V.; Aluffi-Pentini, F. STRING: Finding tandem repeats in DNA sequences. Bioinformatics 2003 , 19 , 1733–1738. [ Google Scholar ] [ CrossRef ]
  • Bilgen, M.; Karaca, M.; Onus, A.N.; Ince, A.G. A software program combining sequence motif searches with keywords for finding repeats containing DNA sequences. Bioinformatics 2004 , 20 , 3379–3386. [ Google Scholar ] [ CrossRef ]
  • Karaca, M.; Bilgen, M.; Onus, A.N.; Ince, A.G.; Elmasulu, S.Y. Exact tandem repeats analyzer (E-TRA): A new program for DNA sequence mining. J. Genet. 2005 , 84 , 49–54. [ Google Scholar ] [ CrossRef ]
  • Wexler, Y.; Yakhini, Z.; Kashi, Y.; Geiger, D. Finding approximate tandem repeats in genomic sequences. J. Comput. Biol. 2005 , 12 , 928–942. [ Google Scholar ] [ CrossRef ] [ PubMed ]
  • Anwar, T.; Khan, A.U. SSRscanner: A program for reporting distribution and exact location of simple sequence repeats. Bioinformation 2006 , 1 , 89. [ Google Scholar ] [ CrossRef ] [ PubMed ]
  • Boeva, V.; Regnier, M.; Papatsenko, D.; Makeev, V. Short fuzzy tandem repeats in genomic sequences, identification, and possible role in regulation of gene expression. Bioinformatics 2006 , 22 , 676–684. [ Google Scholar ] [ CrossRef ]
  • Kofler, R.; Schlötterer, C.; Lelley, T. SciRoKo: A new tool for whole genome microsatellite search and investigation. Bioinformatics 2007 , 23 , 1683–1685. [ Google Scholar ] [ CrossRef ] [ PubMed ]
  • Fonzo, V.D.; Aluffi-Pentini, F.; Parisi, V. JSTRING: A novel Java tandem repeats searcher in genomic sequences with an interactive graphic output. Open Appl. Inform. J. 2008 , 2 , 14–17. [ Google Scholar ] [ CrossRef ]
  • Banerjee, N.; Chidambarathanu, N.; Michael, D.; Balakrishnan, N.; Sekar, K. An algorithm to find all identical internal sequence repeats. Curr. Sci. 2008 , 95 , 188–195. [ Google Scholar ]
  • Senthilkumar, R.; Sabarinathan, R.; Hameed, B.S.; Banerjee, N.; Chidambarathanu, N.; Karthik, R.; Sekar, K. FAIR: A server for internal sequence repeats. Bioinformation 2010 , 4 , 271–275. [ Google Scholar ] [ CrossRef ]
  • Pai, T.-W.; Chen, C.-M.; Hsiao, M.-C.; Cheng, R.; Tzou, W.-S.; Hu, C.-H. An online conserved SSR discovery through cross-species comparison. Adv. Appl. Bioinform. Chem. 2009 , 2 , 23–35. [ Google Scholar ] [ CrossRef ]
  • Jorda, J.; Kajava, A.V. T-REKS: Identification of Tandem REpeats in sequences with a K-meanS based algorithm. Bioinformatics 2009 , 25 , 2632–2638. [ Google Scholar ] [ CrossRef ]
  • Chen, M.; Tan, Z.; Zeng, G. MfSAT: Detect simple sequence repeats in viral genomes. Bioinformation 2011 , 6 , 171–172. [ Google Scholar ] [ CrossRef ]
  • Wang, X.; Lu, P.; Luo, Z. GMATo: A novel tool for the identification and analysis of microsatellites in large genomes. Bioinformation 2013 , 9 , 541–544. [ Google Scholar ] [ CrossRef ] [ PubMed ]
  • Lopes, R.d.S.; Moraes, W.J.L.; Rodrigues, T.d.S.; Bartholomeu, D.C. ProGeRF: Proteome and genome repeat finder utilizing a fast parallel hash function. BioMed Res. Int. 2015 , 394157. [ Google Scholar ] [ CrossRef ]
  • Weiner, P. Linear pattern matching algorithms. In Proceedings of the 14th Annual Symposium on Switching and Automata Theory (Swat 1973), Iowa City, IA, USA, 15–17 October 1973; pp. 1–11. [ Google Scholar ]
  • Pickett, B.D.; Karlinsey, S.; Penrod, C.; Cormier, M.J.; Ebbert, M.T.; Shiozawa, D.K.; Whipple, C.; Ridge, P.G. SA-SSR: A suffix array-based algorithm for exhaustive and efficient SSR discovery in large genetic sequences. Bioinformatics 2016 , 32 , 2707–2709. [ Google Scholar ] [ CrossRef ] [ PubMed ]
  • Pickett, B.D.; Miller, J.B.; Ridge, P.G. Kmer-SSR: A fast and exhaustive SSR search algorithm. Bioinformatics 2017 , 33 , 3922–3928. [ Google Scholar ] [ CrossRef ] [ PubMed ]
  • Avvaru, A.K.; Sowpati, D.T.; Mishra, R.K. PERF: An exhaustive algorithm for ultra-fast and efficient identification of microsatellites from large DNA sequences. Bioinformatics 2018 , 34 , 943–948. [ Google Scholar ] [ CrossRef ] [ PubMed ]
  • Gou, X.; Ma, J.; Liu, Y. SSRMMD: A rapid and accurate algorithm for mining SSR feature loci and candidate polymorphic SSRs based on assembled sequences. Front. Genet. 2020 , 11 , 548380. [ Google Scholar ] [ CrossRef ]
  • Alves, S.I.A.; Ferreira, V.B.C.; Dantas, C.W.D.; Silva, A.L.d.C.d.; Ramos, R.T.J. EasySSR: A user-friendly web application with full command-line features for large-scale batch microsatellite mining and samples comparison. Front. Genet. 2023 , 14 , 1228552. [ Google Scholar ] [ CrossRef ]
  • Volfovsky, N.; Haas, B.J.; Salzberg, S.L. A clustering method for repeat analysis in DNA sequences. Genome Biol. 2001 , 2 , RESEARCH0027. [ Google Scholar ] [ CrossRef ]
  • Kolpakov, R.; Bana, G.; Kucherov, G. mreps: Efficient and flexible detection of tandem repeats in DNA. Nucleic Acids Res. 2003 , 31 , 3672–3678. [ Google Scholar ] [ CrossRef ]
  • Warburton, P.E.; Giordano, J.; Cheung, F.; Gelfand, Y.; Benson, G. Inverted repeat structure of the human genome: The X-chromosome contains a preponderance of large, highly homologous inverted repeats that contain testes genes. Genome Res. 2004 , 14 , 1861–1869. [ Google Scholar ] [ CrossRef ]
  • Delgrange, O.; Rivals, E. STAR: An algorithm to search for tandem approximate repeats. Bioinformatics 2004 , 20 , 2812–2820. [ Google Scholar ] [ CrossRef ]
  • Krishnan, A.; Tang, F. Exhaustive whole-genome tandem repeats search. Bioinformatics 2004 , 20 , 2702–2710. [ Google Scholar ] [ CrossRef ] [ PubMed ]
  • Kumpatla, S.P.; Mukhopadhyay, S. Mining and survey of simple sequence repeats in expressed sequence tags of dicotyledonous species. Genome 2005 , 48 , 985–998. [ Google Scholar ] [ CrossRef ]
  • Thurston, M.; Field, D. Msatfinder: Detection and Characterisation of Microsatellites ; CEH Oxford: Nottingham, UK, 2006. [ Google Scholar ]
  • de Ridder, C.; Kourie, D.G.; Watson, B.W. FireµSat: An algorithm to detect microsatellites in DNA. In Proceedings of the Prague Stringology Conference, Prague, Czech Republic, 28–30 August 2006; pp. 137–150. [ Google Scholar ]
  • de Ridder, C.; Kourie, D.G.; Watson, B.W.; Fourie, T.; Reyneke, P. Fine-tuning the search for microsatellites. J. Discret. Algorithms 2013 , 20 , 21–37. [ Google Scholar ] [ CrossRef ]
  • Mayer, C. Phobos, a tandem repeat search tool for complete genomes. Version 2008 , 3 , 12. [ Google Scholar ]
  • Mudunuri, S.B.; Nagarajaram, H.A. IMEx: Imperfect microsatellite extractor. Bioinformatics 2007 , 23 , 1181–1187. [ Google Scholar ] [ CrossRef ]
  • Faircloth, B.C. MSATCOMMANDER: Detection of microsatellite repeat arrays and automated, locus-specific primer design. Mol. Ecol. Resour. 2008 , 8 , 92–94. [ Google Scholar ] [ CrossRef ] [ PubMed ]
  • Otto, T.D.; Gomes, L.H.; Alves-Ferreira, M.; de Miranda, A.B.; Degrave, W.M. ReRep: Computational detection of repetitive sequences in genome survey sequences (GSS). BMC Bioinform. 2008 , 9 , 366. [ Google Scholar ] [ CrossRef ]
  • da Maia, L.C.; Palmieri, D.A.; de Souza, V.Q.; Kopp, M.M.; de Carvalho, F.I.F.; Costa de Oliveira, A. SSR locator: Tool for simple sequence repeat discovery integrated with primer design and PCR simulation. Int. J. Plant Genom. 2008 , 2008 , 412696. [ Google Scholar ] [ CrossRef ]
  • Abraham, A.-L.; Rocha, E.P.; Pothier, J. Swelfe: A detector of internal repeats in sequences and structures. Bioinformatics 2008 , 24 , 1536–1537. [ Google Scholar ] [ CrossRef ]
  • Pellegrini, M.; Renda, M.; Vecchio, A. TRStalker: An efficient heuristic for finding fuzzy tandem repeats. Bioinformatics 2010 , 26 , 358–366. [ Google Scholar ] [ CrossRef ] [ PubMed ]
  • Catanese, H.N.; Brayton, K.A.; Gebremedhin, A.H. RepeatAnalyzer: A tool for analysing and managing short-sequence repeat data. BMC Genom. 2016 , 17 , 165–168. [ Google Scholar ] [ CrossRef ] [ PubMed ]
  • Untergasser, A.; Cutcutache, I.; Koressaar, T.; Ye, J.; Faircloth, B.C.; Remm, M.; Rozen, S.G. Primer3—new capabilities and interfaces. Nucleic Acids Res. 2012 , 40 , e115. [ Google Scholar ] [ CrossRef ] [ PubMed ]
  • Rychlik, W. OLIGO 7 primer analysis software. PCR primer design. Methods Mol. Biol. 2007 , 402 , 35–59. [ Google Scholar ]
  • You, F.M.; Huo, N.; Gu, Y.Q.; Luo, M.-c.; Ma, Y.; Hane, D.; Lazo, G.R.; Dvorak, J.; Anderson, O.D. BatchPrimer3: A high throughput web application for PCR and sequencing primer design. BMC Bioinform. 2008 , 9 , 1–13. [ Google Scholar ] [ CrossRef ]
  • Ye, J.; Coulouris, G.; Zaretskaya, I.; Cutcutache, I.; Rozen, S.; Madden, T.L. Primer-BLAST: A tool to design target-specific primers for polymerase chain reaction. BMC Bioinform. 2012 , 13 , 1–11. [ Google Scholar ] [ CrossRef ]
  • Kalendar, R.; Lee, D.; Schulman, A.H. FastPCR software for PCR primer and probe design and repeat search. Genes Genomes Genom. 2009 , 3 , 1–14. [ Google Scholar ]
  • Kalendar, R.; Lee, D.; Schulman, A.H. FastPCR software for PCR, in silico PCR, and oligonucleotide assembly and analysis. DNA Cloning Assem. Methods 2014 , 271–302. [ Google Scholar ]
  • Sreenu, V.B.; Alevoor, V.; Nagaraju, J.; Nagarajaram, H.A. MICdb: Database of prokaryotic microsatellites. Nucleic Acids Res. 2003 , 31 , 106–108. [ Google Scholar ] [ CrossRef ]
  • Sreenu, V.B.; Ranjitkumar, G.; Swaminathan, S.; Priya, S.; Bose, B.; Pavan, M.N.; Thanu, G.; Nagaraju, J.; Nagarajaram, H.A. MICAS: A fully automated web server for microsatellite extraction and analysis from prokaryote and viral genomic sequences. Appl. Bioinform. 2003 , 2 , 165–168. [ Google Scholar ]
  • Robinson, A.J.; Love, C.G.; Batley, J.; Barker, G.; Edwards, D. Simple sequence repeat marker loci discovery using SSR primer. Bioinformatics 2004 , 20 , 1475–1476. [ Google Scholar ] [ CrossRef ] [ PubMed ]
  • Jewell, E.; Robinson, A.; Savage, D.; Erwin, T.; Love, C.G.; Lim, G.A.; Li, X.; Batley, J.; Spangenberg, G.C.; Edwards, D. SSRPrimer and SSR taxonomy tree: Biome SSR discovery. Nucleic Acids Res. 2006 , 34 , W656–W659. [ Google Scholar ] [ CrossRef ]
  • Fukuoka, H.; Nunome, T.; Minamiyama, Y.; Kono, I.; Namiki, N.; Kojima, A. Read2Marker: A data processing tool for microsatellite marker development from a large data set. Biotechniques 2005 , 39 , 472–476. [ Google Scholar ] [ CrossRef ]
  • Tang, J.; Baldwin, S.J.; Jacobs, J.M.; van der Linden, C.G.; Voorrips, R.E.; Leunissen, J.A.; van Eck, H.; Vosman, B. Large-scale identification of polymorphic microsatellites using an in silico approach. BMC Bioinform. 2008 , 9 , 374. [ Google Scholar ] [ CrossRef ] [ PubMed ]
  • Martins, W.S.; Lucas, D.C.S.; de Souza Neves, K.F.; Bertioli, D.J. WebSat-A web software for microsatellite marker development. Bioinformation 2009 , 3 , 282–283. [ Google Scholar ] [ CrossRef ] [ PubMed ]
  • Sarmah, R.; Sahu, J.; Dehury, B.; Sarma, K.; Sahoo, S.; Sahu, M.; Barooah, M.; Sen, P.; Modi, M.K. ESMP: A high-throughput computational pipeline for mining SSR markers from ESTs. Bioinformation 2012 , 8 , 206–208. [ Google Scholar ] [ CrossRef ]
  • Churbanov, A.; Ryan, R.; Hasan, N.; Bailey, D.; Chen, H.; Milligan, B.; Houde, P. HighSSR: High-throughput SSR characterization and locus development from next-gen sequencing data. Bioinformatics 2012 , 28 , 2797–2803. [ Google Scholar ] [ CrossRef ]
  • Meglécz, E.; Costedoat, C.; Dubut, V.; Gilles, A.; Malausa, T.; Pech, N.; Martin, J.-F. QDD: A user-friendly program to select microsatellite markers and design primers from large sequencing projects. Bioinformatics 2010 , 26 , 403–404. [ Google Scholar ] [ CrossRef ]
  • Meglécz, E.; Pech, N.; Gilles, A.; Dubut, V.; Hingamp, P.; Trilles, A.; Grenier, R.; Martin, J.F. QDD version 3.1: A user-friendly computer program for microsatellite selection and primer design revisited: Experimental validation of variables determining genotyping success rate. Mol. Ecol. Resour. 2014 , 14 , 1302–1313. [ Google Scholar ] [ CrossRef ]
  • Wang, X.; Wang, L. GMATA: An integrated software package for genome-scale SSR mining, marker development and viewing. Front. Plant Sci. 2016 , 7 , 215951. [ Google Scholar ] [ CrossRef ]
  • Ponyared, P.; Ponsawat, J.; Tongsima, S.; Seresangtakul, P.; Akkasaeng, C.; Tantisuwichwong, N. ESAP plus: A web-based server for EST-SSR marker development. BMC Genom. 2016 , 17 , 163–173. [ Google Scholar ] [ CrossRef ]
  • Xia, E.-H.; Yao, Q.-Y.; Zhang, H.-B.; Jiang, J.-J.; Zhang, L.-P.; Gao, L.-Z. CandiSSR: An efficient pipeline used for identifying candidate polymorphic SSRs based on multiple assembled sequences. Front. Plant Sci. 2016 , 6 , 157128. [ Google Scholar ] [ CrossRef ]
  • Metz, S.; Cabrera, J.M.; Rueda, E.; Giri, F.; Amavet, P. FullSSR: Microsatellite finder and primer designer. Adv. Bioinform. 2016 , 6040124. [ Google Scholar ] [ CrossRef ] [ PubMed ]
  • Pandey, M.; Kumar, R.; Srivastava, P.; Agarwal, S.; Srivastava, S.; Nagpure, N.S.; Jena, J.K.; Kushwaha, B. WGSSAT: A high-throughput computational pipeline for mining and annotation of SSR markers from whole genomes. J. Hered. 2018 , 109 , 339–343. [ Google Scholar ] [ CrossRef ] [ PubMed ]
  • Guang, X.-M.; Xia, J.-Q.; Lin, J.-Q.; Yu, J.; Wan, Q.-H.; Fang, S.-G. IDSSR: An efficient pipeline for identifying polymorphic microsatellites from a single genome sequence. Int. J. Mol. Sci. 2019 , 20 , 3497. [ Google Scholar ] [ CrossRef ]
  • Alves, F.; Martins, F.M.; Areias, M.; Muñoz-Mérida, A. Automating microsatellite screening and primer design from multi-individual libraries using Micro-Primers. Sci. Rep. 2022 , 12 , 295. [ Google Scholar ] [ CrossRef ] [ PubMed ]
  • Mokhtar, M.M.; Alsamman, A.M.; El Allali, A. MegaSSR: A web server for large scale microsatellite identification, classification, and marker development. Front. Plant Sci. 2023 , 14 , 1219055. [ Google Scholar ] [ CrossRef ]
  • Leclercq, S.; Rivals, E.; Jarne, P. Detecting microsatellites within genomes: Significant variation among algorithms. BMC Bioinform. 2007 , 8 , 125. [ Google Scholar ] [ CrossRef ]
  • Chen, C.; Chen, C.; Shih, T.; Pai, T.; Hu, C.; Tzou, W. Efficient algorithms for identifying orthologous simple sequence repeats of disease genes. J. Syst. Sci. Complex. 2010 , 23 , 906–916. [ Google Scholar ] [ CrossRef ]
  • Mathur, M. A comparative study of various SSRs identification tools using Aspergillus Fumigatus chromosome sequences. J. Bioinform. Comp. Genom. 2020 , 3 , 1–13. [ Google Scholar ]
  • Landau, G.M.; Schmidt, J.P.; Sokol, D. An algorithm for approximate tandem repeats. J. Comput. Biol. 2001 , 8 , 1–18. [ Google Scholar ] [ CrossRef ] [ PubMed ]
  • TE Hub Consortium; Elliott, T.A.; Heitkam, T.; Hubley, R.; Quesneville, H.; Suh, A.; Wheeler, T.J. TE Hub: A community-oriented space for sharing and connecting tools, data, resources, and methods for transposable element annotation. Mob DNA 2021 , 12 , 16. [ Google Scholar ]
  • Aishwarya, V.; Grover, A.; Sharma, P.C. EuMicroSat db: A database for microsatellites in the sequenced genomes of eukaryotes. BMC Genom. 2007 , 8 , 225. [ Google Scholar ] [ CrossRef ]
  • Aishwarya, V.; Sharma, P.C. UgMicroSat db: Database for mining microsatellites from unigenes. Nucleic Acids Res. 2007 , 36 , D53–D56. [ Google Scholar ] [ CrossRef ] [ PubMed ]
  • Avvaru, A.K.; Saxena, S.; Sowpati, D.T.; Mishra, R.K. MSDB: A comprehensive database of simple sequence repeats. Genome Biol. Evol. 2017 , 9 , 1797–1802. [ Google Scholar ] [ CrossRef ] [ PubMed ]
  • Avvaru, A.K.; Sharma, D.; Verma, A.; Mishra, R.K.; Sowpati, D.T. MSDB: A comprehensive, annotated database of microsatellites. Nucleic Acids Res. 2020 , 48 , D155–D159. [ Google Scholar ] [ CrossRef ]
  • Kumar, P.; Chaitanya, P.S.; Nagarajaram, H.A. PSSRdb: A relational database of polymorphic simple sequence repeats extracted from prokaryotic genomes. Nucleic Acids Res. 2010 , 39 , D601–D605. [ Google Scholar ] [ CrossRef ]
  • Mokhtar, M.M.; Atia, M.A.M. SSRome: An integrated database and pipelines for exploring microsatellites in all organisms. Nucleic Acids Res. 2019 , 47 , D244–D252. [ Google Scholar ] [ CrossRef ]
  • Subramanian, S.; Madgula, V.M.; George, R.; Mishra, R.K.; Pandit, M.W.; Kumar, C.S.; Singh, L. MRD: A microsatellite repeats database for prokaryotic and eukaryotic genomes. Genome Biol. 2002 , 3 , 1–13. [ Google Scholar ] [ CrossRef ]
  • Boby, T.; Patch, A.-M.; Aves, S. TRbase: A database relating tandem repeats to disease genes for the human genome. Bioinformatics 2005 , 21 , 811–816. [ Google Scholar ] [ CrossRef ]
  • Chang, Y.-H.; Su, W.-H.; Lee, T.-C.; Sun, H.-F.S.; Chen, C.-H.; Pan, W.-H.; Tsai, S.-F.; Jou, Y.-S. TPMD: A database and resources of microsatellite marker genotyped in Taiwanese populations. Nucleic Acids Res. 2005 , 33 , D174–D177. [ Google Scholar ] [ CrossRef ] [ PubMed ]
  • Missirlis, P.I.; Mead, C.-L.R.; Butland, S.L.; Ouellette, B.F.; Devon, R.S.; Leavitt, B.R.; Holt, R.A. Satellog: A database for the identification and prioritization of satellite repeats in disease association studies. BMC Bioinform. 2005 , 6 , 1–14. [ Google Scholar ] [ CrossRef ] [ PubMed ]
  • Subramanian, S.; Madgula, V.M.; George, R.; Kumar, S.; Pandit, M.W.; Singh, L. SSRD: Simple sequence repeats database of the human genome. Comp. Funct. Genom. 2003 , 4 , 342–345. [ Google Scholar ] [ CrossRef ]
  • Sakai, T.; Miura, I.; Yamada-Ishibashi, S.; Wakita, Y.; Kohara, Y.; Yamazaki, Y.; Inoue, T.; Kominami, R.; Moriwaki, K.; Shiroishi, T. Update of mouse microsatellite database of Japan (MMDBJ). Exp. Anim. 2004 , 53 , 151–154. [ Google Scholar ] [ CrossRef ]
  • Archak, S.; Meduri, E.; Kumar, P.S.; Nagaraju, J. InSatDb: A microsatellite database of fully sequenced insect genomes. Nucleic Acids Res. 2007 , 35 , D36–D39. [ Google Scholar ] [ CrossRef ]
  • Prasad, M.; Muthulakshmi, M.; Arunkumar, K.; Madhu, M.; Sreenu, V.B.; Pavithra, V.; Bose, B.; Nagarajaram, H.A.; Mita, K.; Shimada, T. SilkSatDb: A microsatellite database of the silkworm, Bombyx mori. Nucleic Acids Res. 2005 , 33 , D403–D406. [ Google Scholar ] [ CrossRef ]
  • Karaoglu, H.; Lee, C.M.Y.; Meyer, W. Survey of simple sequence repeats in completed fungal genomes. Mol. Biol. Evol. 2005 , 22 , 639–649. [ Google Scholar ] [ CrossRef ] [ PubMed ]
  • Mudunuri, S.; Appa Rao, A.; Pallamsetty, S.; Mishra, P.; Nagarajaram, H. VMD: Viral Microsatellite Database-A Comprehensive Resource for all Viral Microsatellites. J. Comput. Sci. Syst. Biol. 2009 , 2 , 283–286. [ Google Scholar ]
  • Arora, V.; Kapoor, N.; Fatma, S.; Jaiswal, S.; Iquebal, M.A.; Rai, A.; Kumar, D. BanSatDB, a whole-genome-based database of putative and experimentally validated microsatellite markers of three Musa species. Crop J. 2018 , 6 , 642–650. [ Google Scholar ] [ CrossRef ]
  • Arumugam, V.; Riju, A.; Arunachalam, V. Mining of expressed sequence tag (EST) libraries and core nucleotide sequences for simple sequence repeats (SSR) in papaya. In Proceedings of the II International Symposium on Papaya, Madurai, Madurai, India, 9–12 December 2008; Volume 851, pp. 197–200. [ Google Scholar ]
  • Babu, B.K.; Rani, K.M.; Sahu, S.; Mathur, R.; Kumar, P.N.; Ravichandran, G.; Anitha, P.; Bhagya, H. Development and validation of whole genome-wide and genic microsatellite markers in oil palm (Elaeis guineensis Jacq.): First microsatellite database (OpSatdb). Sci. Rep. 2019 , 9 , 1899. [ Google Scholar ]
  • Blenda, A.; Scheffler, J.; Scheffler, B.; Palmer, M.; Lacape, J.-M.; Yu, J.Z.; Jesudurai, C.; Jung, S.; Muthukumar, S.; Yellambalase, P. CMD: A cotton microsatellite database resource for Gossypium genomics. BMC Genom. 2006 , 7 , 1–10. [ Google Scholar ] [ CrossRef ] [ PubMed ]
  • Channdrasekar, A.; Rijju, A.; Sathyanath, N.V.; Santhosh, E. SpicEST-An Annotated database on Expressed Sequence tags of spices. Genes Genomes Genom. 2009 , 3 , 50–53. [ Google Scholar ]
  • Duhan, N.; Meshram, M.; Loaiza, C.D.; Kaundal, R. citSATdb: Genome-wide simple sequence repeat (SSR) marker database of Citrus species for germplasm characterization and crop improvement. Genes 2020 , 11 , 1486. [ Google Scholar ] [ CrossRef ] [ PubMed ]
  • Jayashree, B.; Punna, R.; Prasad, P.; Bantte, K.; Hash, C.T.; Chandra, S.; Hoisington, D.A.; Varshney, R.K. A database of simple sequence repeats from cereal and legume expressed sequence tags mined in silico: Survey and evaluation. Silico Biol. 2006 , 6 , 607–620. [ Google Scholar ]
  • Mueller, L.A.; Solow, T.H.; Taylor, N.; Skwarecki, B.; Buels, R.; Binns, J.; Lin, C.; Wright, M.H.; Ahrens, R.; Wang, Y. The SOL Genomics Network. A comparative resource for Solanaceae biology and beyond. Plant Physiol. 2005 , 138 , 1310–1317. [ Google Scholar ] [ CrossRef ]
  • Portis, E.; Portis, F.; Valente, L.; Moglia, A.; Barchi, L.; Lanteri, S.; Acquadro, A. A genome-wide survey of the microsatellite content of the globe artichoke genome and the development of a web-based database. PLoS ONE 2016 , 11 , e0162841. [ Google Scholar ] [ CrossRef ]
  • Purru, S.; Sahu, S.; Rai, S.; Rao, A.; Bhat, K. GinMicrosatDb: A genome-wide microsatellite markers database for sesame ( Sesamum indicum L.). Physiol. Mol. Biol. Plants 2018 , 24 , 929–937. [ Google Scholar ] [ CrossRef ]
  • Shirasawa, K.; Asamizu, E.; Fukuoka, H.; Ohyama, A.; Sato, S.; Nakamura, Y.; Tabata, S.; Sasamoto, S.; Wada, T.; Kishida, Y. An interspecific linkage map of SSR and intronic polymorphism markers in tomato. Theor. Appl. Genet. 2010 , 121 , 731–739. [ Google Scholar ] [ CrossRef ]
  • Song, X.; Yang, Q.; Bai, Y.; Gong, K.; Wu, T.; Yu, T.; Pei, Q.; Duan, W.; Huang, Z.; Wang, Z. Comprehensive analysis of SSRs and database construction using all complete gene-coding sequences in major horticultural and representative plants. Hortic. Res. 2021 , 8 . [ Google Scholar ] [ CrossRef ]
  • Youens-Clark, K.; Buckler, E.; Casstevens, T.; Chen, C.; DeClerck, G.; Derwent, P.; Dharmawardhana, P.; Jaiswal, P.; Kersey, P.; Karthikeyan, A. Gramene database in 2010: Updates and extensions. Nucleic Acids Res. 2010 , 39 , D1085–D1094. [ Google Scholar ] [ CrossRef ]
  • Yu, J.; Dossa, K.; Wang, L.; Zhang, Y.; Wei, X.; Liao, B.; Zhang, X. PMDBase: A database for studying microsatellite DNA and marker development in plants. Nucleic Acids Res. 2017 , 45 , D1046–D1053. [ Google Scholar ] [ CrossRef ] [ PubMed ]
  • Du, L.; Liu, Q.; Zhao, K.; Tang, J.; Zhang, X.; Yue, B.; Fan, Z. PSMD: An extensive database for pan-species microsatellite investigation and marker development. Mol. Ecol. Resour. 2020 , 20 , 283–291. [ Google Scholar ] [ CrossRef ]

Click here to enlarge figure

Sequences for SSR Data MiningOrganismReferences
PACRice (Oryza sativa)[ , ]
BACTomato (Solanum lycopersicum)[ , , ]
BAC end Spinach (Spinacia oleracea)[ ]
Sweet potato (Ipomoea batatas)[ ]
ESTSugarcane (Saccharum officinarum)[ ]
Safflower (Carthamus tinctorius)[ ]
Lemon (Citrus limon)[ ]
Tea (Camellia sinensis)[ ]
Glory lily (Gloriosa superba)[ ]
Siam tulip (Curcuma alismatifolia)[ ]
UnigenesIndian chrysanthemum (Chrysanthemum indicum)[ ]
Tick trefoil (Uraria lagopodioides)[ ]
Candidate genesWheat (Triticum aestivum)[ , , ]
Rice (O. sativa)[ , ]
Sugarcane (S. officinarum)[ ]
Potato (Solanum tuberosum)[ ]
Genomic survey sequences Peanut (Arachis hypogaea)[ ]
Rye (Secale cereale)[ ]
PseudomoleculesEucalyptus (Eucalyptus spp.)[ ]
Pineapple (Ananas comosus)[ ]
Eggplant (Solanum melongena)[ ]
Pigeon pea (Cajanus cajan)[ ]
ScaffoldsMandarin orange (Citrus reticulata)[ ]
Carrot (Daucus carota)[ ]
WGSsRice (O. sativa) (PacBio)[ ]
Maize (Zea mays) (PacBio)[ ]
Cardamon (Elettaria cardamomum) (Nanopore and Illumina)[ ]
Strand sedge (Carex pumila) (Illumina and PacBio)[ ]
Coconut (Cocos nucifera)[ ]
Spinach (S. oleracea)[ ]
Chickpea (Cicer arietinum)[ ]
ToolAlgorithm/Detection MethodScriptPlatformURL
(Accessed on 24 June 2024)
Type of Tandem Repeats DetectedReference
Sputnik **Recursive CWindows
Updated:
Perfect and approximate repeats[ ]
Repeat masker String matching Perl Unix/Linux Perfect, imperfect, and compound repeats[ ]
Tandem Repeat finder (TRF) **Heuristic: based on K-tuple match and alignmentsNASystem independent
Updated:
Perfect, imperfect, and compound repeats[ ]
ReputerK-mer approach and suffix trees, Hamming edit distance model NAUnix Perfect, imperfect, and compound repeats[ ]
Repeat finder K-mer approach and clusteringNAUnix/Linux Perfect repeats[ ]
Simple sequence repeat identification tool (SSRIT) **Regular expressions and similarity searchesPerl scriptSystem independent
Updated:
Perfect repeats[ ]
ComplexTR *Seed extension technique and K-length substringsC++, PerlNA Variable-length and multiple-period tandem repeats[ ]
POLYSliding window approachPythonNot known Perfect repeats[ ]
Tandem repeats Occurrence locator (TROLL)Dictionary approach Aho–Corasick algorithmC++ (Tcp/Tk script)Linux Perfect repeats[ ]
Search for tandem repeats in Genomes (STRING) **Heuristic and auto-alignment search using dynamic programming CUnix
Updated:
Perfect and imperfect repeats[ ]
Microsatellite search (MISA)Regular expressionPerlSystem independent Perfect and compound repeats[ ]
MrepsMixed combinatorial/heuristicANSI CLinux, SunOS, Digital Unix, Windows Fuzzy tandem repeats[ ]
Inverted Repeat Finder (IRF) K-tuple match and alignment scoreNAWindows, Linux, Mac OS Approximate inverted repeats[ ]
Spectral repeat finder (SRF)Periodicity approach, Fourier transformPerlSystem independent Perfect and imperfect repeats[ ]
Search for tandem approximate Repeats (STAR)Minimum distance length criterion, data compression, and optimization algorithmNALinux, SunOS, Mac OSX, and Windows Approximate tandem repeats[ ]
Exhaustive whole genome Tandem Repeat Search (ExTRS)K-mer and Hamming distance NA-On request from the authorsVariable-length tandem repeats[ ]
Tandem Repeats Analyser (TRA) *String matching and algorithm similar to STRINGC++Windows Perfect and imperfect repeats[ ]
ATRHunter *Iterative string matching and dynamic programming NAWindows, Unix, Linux Approximate tandem repeats[ ]
Exact tandem repeats Analyser (E-TRA) *One of the TRA algorithmsC++Windows Perfect, imperfect, and compound repeats[ ]
Repeat fetcher *Pattern recognition, regular expressionPerlUnix Perfect repeats[ ]
MsatFinder **Regular expressionsPerl Linux
Updated:
Perfect repeats[ ]
FireµSat/
FireµSat *
Regular expressions, FA, and Moore machine technologyC++Windows, Linux Perfect repeats[ , ]
Phobos Exact searchNAMac, Linux, Windows Perfect and imperfect tandem repeats[ ]
SSRscannerDictionary approach based on preselected motifsPerlSystem independentAvailable on request from authorsPerfect repeats[ ]
TandemSWAN *Auto-correlation analysis and statistical weightsC++System independent Fuzzy tandem repeats[ ]
OMWSAPeriodicity approach using moving window spectral analysisNANA Perfect, imperfect, and compound repeats[ ]
Imperfect Microsatellite Extraction (IMEx) **String matching algorithm and sliding window approachCSystem independent or

Imperfect repeats[ ]
SciRoko **SSR seed extensionC windows
Updated:
Perfect and imperfect repeats[ ]
JSTRING **Similar to STRING JavaSystem independent
Updated:
Perfect and imperfect tandem repeats[ ]
msatcommanderRegular expressionsPythonMacOS X, Windows, Unix Perfect repeats[ ]
ReRep (read Repeat) Finder * Similarity searchesPerlLinux Denovo repeat identification in GSS[ ]
SSRlocator *Similar to MISA and SSRITPerlWindows Perfect and imperfect repeats[ ]
SWELFEAlignment based on dynamic programming CLinux and Mac OS X Internal repeats[ ]
TREKSK-means clustering algorithmJavawindows Perfect and imperfect repeats[ ]
FAIR **Dynamic programmingC++Web-based
Updated:
Internal repeats [ ]
TRStalkerHeuristic
Edit distance
NAUnknown Fuzzy tandem repeats[ ]
Mfsat *Regular expressionsNAwindows Perfect repeats[ ]
PALFINDERText searchPerlSystem independent Perfect repeats[ ]
GMAToRegular expression with a greedy matching algorithmPerlSystem independent Perfect repeats[ ]
ProGeRF *Sequence search and alignment by hashing algorithmPerl and CLinux Perfect and imperfect repeats[ ]
Repeat AnalyzerKnuth–Morris–Pratt (KMP) string searching algorithmPythonWindows, Linux, Mac OS X Genic SSRs[ ]
SA-SSRSuffix and prefix array Linux Micro- and minisatellites[ ]
Kmer-SSRK-mer approachC++Linux Perfect repeats[ ]
PERFK-mer approachPythonSystem independent Perfect and imperfect repeats[ ]
SSRMMDRegular expression with a greedy matching algorithmPerlSystem independent Perfect repeats and polymorphic SSRs[ ]
EasySSRString matching implemented in IMexPython and PerlLinux .Perfect and imperfect repeats[ ]
ObjectiveSuitable Computational Tools
Whole genome search for SSRs at a faster paceSciroko
Mining for repeats within GSSsReREP
Mining for microsatellites in viral genomes Mfsat
Mining for internal repeatsSwelfe, IRF, FAIR
Mining for perfect repeats onlySSRIT, CUGISSR, TROLL, Sputnik
Mining for perfect, imperfect, and compound repeatsMISA, IMEx, Msatfinder, TRF
Mining for repeats within reads obtained from sequencing platformsPalfinder
Mining for polymorphic SSRPalfinder, PolySSR
Mining for fuzzy tandem repeats/VNTRTandem swan, ATR hunter, TRF, Mreps, STRING, STAR,
Identification and masking repeatsRepeatmasker, SIMPLE, DUST
Mining for long and divergent repeatsRepeat masker
Mining for short repeatsIMEx, Sputnik
Mining for repeats in both nucleic acid and protein sequencesFAIR, TreKs
Mining for palindromic repeatsAdplot, Reputer, CRISPRFinder
PipelinesRead2marker, QDD, ESMP, POLYSSR, HighSSR, FullSSR, WGSSAT, IDSSR
The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

Geethanjali, S.; Kadirvel, P.; Anumalla, M.; Hemanth Sadhana, N.; Annamalai, A.; Ali, J. Streamlining of Simple Sequence Repeat Data Mining Methodologies and Pipelines for Crop Scanning. Plants 2024 , 13 , 2619. https://doi.org/10.3390/plants13182619

Geethanjali S, Kadirvel P, Anumalla M, Hemanth Sadhana N, Annamalai A, Ali J. Streamlining of Simple Sequence Repeat Data Mining Methodologies and Pipelines for Crop Scanning. Plants . 2024; 13(18):2619. https://doi.org/10.3390/plants13182619

Geethanjali, Subramaniam, Palchamy Kadirvel, Mahender Anumalla, Nithyananth Hemanth Sadhana, Anandan Annamalai, and Jauhar Ali. 2024. "Streamlining of Simple Sequence Repeat Data Mining Methodologies and Pipelines for Crop Scanning" Plants 13, no. 18: 2619. https://doi.org/10.3390/plants13182619

Article Metrics

Supplementary material.

ZIP-Document (ZIP, 119 KiB)

Further Information

Mdpi initiatives, follow mdpi.

MDPI

Subscribe to receive issue release notifications and newsletters from MDPI journals

Advertisement

Advertisement

A comprehensive survey of data mining

  • Original Research
  • Published: 06 February 2020
  • Volume 12 , pages 1243–1257, ( 2020 )

Cite this article

research papers for data mining

  • Manoj Kumar Gupta   ORCID: orcid.org/0000-0002-4481-8432 1 &
  • Pravin Chandra 1  

5135 Accesses

59 Citations

Explore all metrics

Data mining plays an important role in various human activities because it extracts the unknown useful patterns (or knowledge). Due to its capabilities, data mining become an essential task in large number of application domains such as banking, retail, medical, insurance, bioinformatics, etc. To take a holistic view of the research trends in the area of data mining, a comprehensive survey is presented in this paper. This paper presents a systematic and comprehensive survey of various data mining tasks and techniques. Further, various real-life applications of data mining are presented in this paper. The challenges and issues in area of data mining research are also presented in this paper.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save.

  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime

Price includes VAT (Russian Federation)

Instant access to the full article PDF.

Rent this article via DeepDyve

Institutional subscriptions

Similar content being viewed by others

research papers for data mining

A Review of the Development and Future Trends of Data Mining Tools

research papers for data mining

A Survey on Big Data, Mining: (Tools, Techniques, Applications and Notable Uses)

research papers for data mining

Data Mining—A Tool for Handling Huge Voluminous Data

Explore related subjects.

  • Artificial Intelligence

Fayadd U, Piatesky-Shapiro G, Smyth P (1996) From data mining to knowledge discovery in databases. AAAI Press/The MIT Press, Massachusetts Institute of Technology. ISBN 0–262 56097–6 Fayap

Fayadd U, Piatesky-Shapiro G, Smyth P (1996) Knowledge discovery and data mining: towards a unifying framework. In: Proceedings of the 2nd ACM international conference on knowledge discovery and data mining (KDD), Portland, pp 82–88

Heikki M (1996) Data mining: machine learning, statistics, and databases. In: SSDBM ’96: proceedings of the eighth international conference on scientific and statistical database management, June 1996, pp 2–9

Arora RK, Gupta MK (2017) e-Governance using data warehousing and data mining. Int J Comput Appl 169(8):28–31

Google Scholar  

Morik K, Bhaduri K, Kargupta H (2011) Introduction to data mining for sustainability. Data Min Knowl Discov 24(2):311–324

Han J, Kamber M, Pei J (2012) Data mining concepts and techniques, 3rd edn. Elsevier, Netherlands

MATH   Google Scholar  

Friedman JH (1997) Data mining and statistics: What is the connection? in: Keynote Speech of the 29th Symposium on the Interface: Computing Science and Statistics, Houston, TX, 1997

Turban E, Aronson JE, Liang TP, Sharda R (2007) Decision support and business intelligence systems. 8 th edn, Pearson Education, UK

Gheware SD, Kejkar AS, Tondare SM (2014) Data mining: tasks, tools, techniques and applications. Int J Adv Res Comput Commun Eng 3(10):8095–8098

Kiranmai B, Damodaram A (2014) A review on evaluation measures for data mining tasks. Int J Eng Comput Sci 3(7):7217–7220

Sharma M (2014) Data mining: a literature survey. Int J Emerg Res Manag Technol 3(2):1–4

Venkatadri M, Reddy LC (2011) A review on data mining from past to the future. Int J Comput Appl 15(7):19–22

Chen M, Han J, Yu PS (1996) Data mining: an overview from a database perspective. IEEE Trans Knowl Data Eng 8(6):866–883

Gupta MK, Chandra P (2019) A comparative study of clustering algorithms. In: Proceedings of the 13th INDIACom-2019; IEEE Conference ID: 461816; 6th International Conference on “Computing for Sustainable Global Development”

Ponniah P (2001) Data warehousing fundamentals. Wiley, USA

Chandra P, Gupta MK (2018) Comprehensive survey on data warehousing research. Int J Inform Technol 10(2):217–224

Weiss SH, Indurkhya N (1998) Predictive data mining: a practical guide. Morgan Kaufmann Publishers, San Francisco

Fu Y (1997) Data mining: tasks, techniques, and applications. IEEE Potentials 16(4):18–20

Abuaiadah D (2015) Using bisect k-means clustering technique in the analysis of arabic documents. ACM Trans Asian Low-Resour Lang Inf Process 15(3):1–17

Algergawy A, Mesiti M, Nayak R, Saake G (2011) XML data clustering: an overview. ACM Comput Surv 43(4):1–25

Angiulli F, Fassetti F (2013) Exploiting domain knowledge to detect outliers. Data Min Knowl Discov 28(2):519–568

MathSciNet   MATH   Google Scholar  

Angiulli F, Fassetti F (2016) Toward generalizing the unification with statistical outliers: the gradient outlier factor measure. ACM Trans Knowl Discov Data 10(3):1–26

Bhatnagar V, Ahuja S, Kaur S (2015) Discriminant analysis-based cluster ensemble. Int J Data Min Modell Manag 7(2):83–107

Bouguessa M (2013) Clustering categorical data in projected spaces. Data Min Knowl Discov 29(1):3–38

MathSciNet   Google Scholar  

Campello RJGB, Moulavi D, Zimek A, Sander J (2015) Hierarchical density estimates for data clustering, visualization, and outlier detection. ACM Trans Knowl Discov Data 10(1):1–51

Carpineto C, Osinski S, Romano G, Weiss D (2009) A survey of web clustering engines. ACM Comput. Surv. 41(3):1–38

Ceglar A, Roddick JF (2006) Association mining. ACM Comput Surv 38(2):1–42

Chen YL, Weng CH (2009) Mining fuzzy association rules from questionnaire data. Knowl Based Syst 22(1):46–56

Fan Chin-Yuan, Fan Pei-Shu, Chan Te-Yi, Chang Shu-Hao (2012) Using hybrid data mining and machine learning clustering analysis to predict the turnover rate for technology professionals. Expert Syst Appl 39:8844–8851

Das R, Kalita J, Bhattacharya (2011) A pattern matching approach for clustering gene expression data. Int J Data Min Model Manag 3(2):130–149

Dincer E (2006) The k-means algorithm in data mining and an application in medicine. Kocaeli Univesity, Kocaeli

Geng L, Hamilton HJ (2006) Interestingness measures for data mining: a survey. ACM Comput Surv 38(3):1–32

Gupta MK, Chandra P (2019) P-k-means: k-means using partition based cluster initialization method. In: Proceedings of the international conference on advancements in computing and management (ICACM 2019), Elsevier SSRN, pp 567–573

Gupta MK, Chandra P (2019) An empirical evaluation of k-means clustering algorithm using different distance/similarity metrics. In: Proceedings of the international conference on emerging trends in information technology (ICETIT-2019), emerging trends in information technology, LNEE 605 pp 884–892 DOI: https://doi.org/10.1007/978-3-030-30577-2_79

Hea Z, Xua X, Huangb JZ, Denga S (2004) Mining class outliers: concepts, algorithms and applications in CRM. Expert Syst Appl 27(4):681e97

Hung LN, Thu TNT, Nguyen GC (2015) An efficient algorithm in mining frequent itemsets with weights over data stream using tree data structure. IJ Intell Syst Appl 12:23–31

Hung LN, Thu TNT (2016) Mining frequent itemsets with weights over data stream using inverted matrix. IJ Inf Technol Comput Sci 10:63–71

Jain AK, Murty MN, Flynn PJ (1999) Data clustering: a review. ACM Comput. Surv 31(3):1–60

Jin H, Wang S, Zhou Q, Li Y (2014) An improved method for density-based clustering. Int J Data Min Model Manag 6(4):347–368

Khandare A, Alvi AS (2017) Performance analysis of improved clustering algorithm on real and synthetic data. IJ Comput Netw Inf Secur 10:57–65

Koh YS, Ravana SD (2016) Unsupervised rare pattern mining: a survey. ACM Trans Knowl Discov Data 10(4):1–29

Kosina P, Gama J (2015) Very fast decision rules for classification in data streams. Data Min Knowl Discov 29(1):168–202

Kotsiantis SB (2007) Supervised machine learning: a review of classification techniques. Informatica 31:249–268

Kumar D, Bezdek JC, Rajasegarar S, Palaniswami M, Leckie C, Chan J, Gubbi J (2016) Adaptive cluster tendency visualization and anomaly detection for streaming data. ACM Trans Knowl Discov Data 11(2):1–24

Lee G, Yun U (2017) A new efficient approach for mining uncertain frequent patterns using minimum data structure without false positives. Future Gener Comput Syst 68:89–110

Li G, Zaki MJ (2015) Sampling frequent and minimal boolean patterns: theory and application in classification. Data Min Knowl Discov 30(1):181–225. https://doi.org/10.1007/s10618-015-0409-y

Article   MathSciNet   MATH   Google Scholar  

Liao TW, Triantaphyllou E (2007) Recent advances in data mining of enterprise data: algorithms and applications. World Scientific Publishing, Singapore, pp 111–145

Mabroukeh NR, Ezeife CI (2010) A taxonomy of sequential pattern mining algorithms. ACM Comput Surv 43:1

Mampaey M, Vreeken J (2011) Summarizing categorical data by clustering attributes. Data Min Knowl Discov 26(1):130–173

Menardi G, Torelli N (2012) Training and assessing classification rules with imbalanced data. Data Min Knowl Discov 28(1):4–28. https://doi.org/10.1007/s10618-012-0295-5

Mukhopadhyay A, Maulik U, Bandyopadhyay S (2015) A survey of multiobjective evolutionary clustering. ACM Comput Surv 47(4):1–46

Pei Y, Fern XZ, Tjahja TV, Rosales R (2016) ‘Comparing clustering with pairwise and relative constraints: a unified framework. ACM Trans Knowl Discov Data 11:2

Rafalak M, Deja M, Wierzbicki A, Nielek R, Kakol M (2016) Web content classification using distributions of subjective quality evaluations. ACM Trans Web 10:4

Reddy D, Jana PK (2014) A new clustering algorithm based on Voronoi diagram. Int J Data Min Model Manag 6(1):49–64

Rustogi S, Sharma M, Morwal S (2017) Improved Parallel Apriori Algorithm for Multi-cores. IJ Inf Technol Comput Sci 4:18–23

Shah-Hosseini H (2013) Improving K-means clustering algorithm with the intelligent water drops (IWD) algorithm. Int J Data Min Model Manag 5(4):301–317

Silva JA, Faria ER, Barros RC, Hruschka ER, de Carvalho ACPLF, Gama J (2013) Data stream clustering: a survey. ACM Comput Surv 46(1):1–31

Silva A, Antunes C (2014) Multi-relational pattern mining over data streams. Data Min Knowl Discov 29(6):1783–1814. https://doi.org/10.1007/s10618-014-0394-6

Sim K, Gopalkrishnan V, Zimek A, Cong G (2012) A survey on enhanced subspace clustering. Data Min Knowl Discov 26(2):332–397

Sohrabi MK, Roshani R (2017) Frequent itemset mining using cellular learning automata. Comput Hum Behav 68:244–253

Craw Susan, Wiratunga Nirmalie, Rowe Ray C (2006) Learning adaptation knowledge to improve case-based reasoning. Artif Intell 170:1175–1192

Tan KC, Teoh EJ, Yua Q, Goh KC (2009) A hybrid evolutionary algorithm for attribute selection in data mining. Expert Syst Appl 36(4):8616–8630

Tew C, Giraud-Carrier C, Tanner K, Burton S (2013) Behavior-based clustering and analysis of interestingness measures for association rule mining. Data Min Knowl Discov 28(4):1004–1045

Wang L, Dong M (2015) Exemplar-based low-rank matrix decomposition for data clustering. Data Min Knowl Discov 29:324–357

Wang F, Sun J (2014) Survey on distance metric learning and dimensionality reduction in data mining. Data Min Knowl Discov 29:534–564

Wang B, Rahal I, Dong A (2011) Parallel hierarchical clustering using weighted confidence affinity. Int J Data Min Model Manag 3(2):110–129

Zacharis NZ (2018) Classification and regression trees (CART) for predictive modeling in blended learning. IJ Intell Syst Appl 3:1–9

Zhang W, Li R, Feng D, Chernikov A, Chrisochoides N, Osgood C, Ji S (2015) Evolutionary soft co-clustering: formulations, algorithms, and applications. Data Min Knowl Discov 29:765–791

Han J, Fu Y (1996) Exploration of the power of attribute-oriented induction in data mining. Adv Knowl Discov Data Min. AAAI/MIT Press, pp 399-421

Gupta A, Mumick IS (1995) Maintenance of materialized views: problems, techniques, and applications. IEEE Data Eng Bull 18(2):3

Sawant V, Shah K (2013) A review of distributed data mining using agents. Int J Adv Technol Eng Res 3(5):27–33

Gupta MK, Chandra P (2019) An efficient approach for selection of initial cluster centroids for k-means clustering algorithm. In: Proceedings international conference on recent developments in science engineering and technology (REDSET-2019), November 15–16 2019

Gupta MK, Chandra P (2019) MP-K-means: modified partition based cluster initialization method for k-means algorithm. Int J Recent Technol Eng 8(4):1140–1148

Gupta MK, Chandra P (2019) HYBCIM: hypercube based cluster initialization method for k-means. IJ Innov Technol Explor Eng 8(10):3584–3587. https://doi.org/10.35940/ijitee.j9774.0881019

Article   Google Scholar  

Enke David, Thawornwong Suraphan (2005) The use of data mining and neural networks for forecasting stock market returns. Expert Syst Appl 29:927–940

Mezyk Edward, Unold Olgierd (2011) Machine learning approach to model sport training. Comput Hum Behav 27:1499–1506

Esling P, Agon C (2012) Time-series data mining. ACM Comput Surv 45(1):1–34

Hüllermeier Eyke (2005) Fuzzy methods in machine learning and data mining: status and prospects. Fuzzy Sets Syst 156:387–406

Hullermeier Eyke (2011) Fuzzy sets in machine learning and data mining. Appl Soft Comput 11:1493–1505

Gengshen Du, Ruhe Guenther (2014) Two machine-learning techniques for mining solutions of the ReleasePlanner™ decision support system. Inf Sci 259:474–489

Smith Kate A, Gupta Jatinder ND (2000) Neural networks in business: techniques and applications for the operations researcher. Comput Oper Res 27:1023–1044

Huang Mu-Jung, Tsou Yee-Lin, Lee Show-Chin (2006) Integrating fuzzy data mining and fuzzy artificial neural networks for discovering implicit knowledge. Knowl Based Syst 19:396–403

Padhraic S (2000) Data mining: analysis on grand scale. Stat Method Med Res 9(4):309–327. https://doi.org/10.1191/096228000701555181

Article   MATH   Google Scholar  

Saeed S, Ali M (2012) Privacy-preserving back-propagation and extreme learning machine algorithms. Data Knowl Eng 79–80:40–61

Singh Y, Bhatia PK, Sangwan OP (2007) A review of studies on machine learning techniques. Int J Comput Sci Secur 1(1):70–84

Yahia ME, El-taher ME (2010) A new approach for evaluation of data mining techniques. Int J Comput Sci Issues 7(5):181–186

Jackson J (2002) Data mining: a conceptual overview. Commun Assoc Inf Syst 8:267–296

Heckerman D (1998) A tutorial on learning with Bayesian networks. Learning in graphical models. Springer, Netherlands, pp 301–354

Politano PM, Walton RO (2017) Statistics & research methodol. Lulu. com

Wetherill GB (1987) Regression analysis with application. Chapman & Hall Ltd, UK

Anderberg MR (2014) Cluster analysis for applications: probability and mathematical statistics: a series of monographs and textbooks, vol 19. Academic Press, USA

Mihoci A (2017) Modelling limit order book volume covariance structures. In: Hokimoto T (ed) Advances in statistical methodologies and their application to real problems. IntechOpen, Croatia. https://doi.org/10.5772/66152

Chapter   Google Scholar  

Thompson B (2004) Exploratory and confirmatory factor analysis: understanding concepts and applications. American Psychological Association, Washington, DC (ISBN:1-59147-093-5)

Kuzey C, Uyar A, Delen (2014) The impact of multinationality on firm value: a comparative analysis of machine learning techniques. Decis Support Syst 59:127–142

Chan Philip K, Salvatore JS (1997) On the accuracy of meta-learning for scalable data mining. J Intell Inf Syst 8:5–28

Tsai Chih-Fong, Hsu Yu-Feng, Lin Chia-Ying, Lin Wei-Yang (2009) Intrusion detection by machine learning: a review. Expert Syst Appl 36:11994–12000

Liao SH, Chu PH, Hsiao PY (2012) Data mining techniques and applications—a decade review from 2000 to 2011. Expert Syst Appl 39:11303–11311

Kanevski M, Parkin R, Pozdnukhov A, Timonin V, Maignan M, Demyanov V, Canu S (2004) Environmental data mining and modelling based on machine learning algorithms and geostatistics. Environ Model Softw 19:845–855

Jain N, Srivastava V (2013) Data mining techniques: a survey paper. Int J Res Eng Technol 2(11):116–119

Baker RSJ (2010) Data mining for education. In: McGaw B, Peterson P, Baker E (eds) International encyclopedia of education, 3rd edn. Elsevier, Oxford, UK

Lew A, Mauch H (2006) Introduction to data mining and its applications. Springer, Berlin

Mukherjee S, Shaw R, Haldar N, Changdar S (2015) A survey of data mining applications and techniques. Int J Comput Sci Inf Technol 6(5):4663–4666

Data mining examples: most common applications of data mining (2019). https://www.softwaretestinghelp.com/data-mining-examples/ . Accessed 27 Dec 2019

Devi SVSG (2013) Applications and trends in data mining. Orient J Comput Sci Technol 6(4):413–419

Data mining—applications & trends. https://www.tutorialspoint.com/data_mining/dm_applications_trends.htm

Keleş MK (2017) An overview: the impact of data mining applications on various sectors. Tech J 11(3):128–132

Top 14 useful applications for data mining. https://bigdata-madesimple.com/14-useful-applications-of-data-mining/ . Accessed 20 Aug 2014

Yang Q, Wu X (2006) 10 challenging problems in data mining research. Int J Inf Technol Decis Making 5(4):597–604

Padhy N, Mishra P, Panigrahi R (2012) A survey of data mining applications and future scope. Int J Comput Sci Eng Inf Technol 2(3):43–58

Gibert K, Sanchez-Marre M, Codina V (2010) Choosing the right data mining technique: classification of methods and intelligent recommendation. In: International Congress on Environment Modelling and Software Modelling for Environment’s Sake, Fifth Biennial Meeting, Ottawa, Canada

Download references

Author information

Authors and affiliations.

University School of Information, Communication and Technology, Guru Gobind Singh Indraprastha University, Sector-16C, Dwarka, Delhi, 110078, India

Manoj Kumar Gupta & Pravin Chandra

You can also search for this author in PubMed   Google Scholar

Corresponding author

Correspondence to Manoj Kumar Gupta .

Rights and permissions

Reprints and permissions

About this article

Gupta, M.K., Chandra, P. A comprehensive survey of data mining. Int. j. inf. tecnol. 12 , 1243–1257 (2020). https://doi.org/10.1007/s41870-020-00427-7

Download citation

Received : 29 June 2019

Accepted : 20 January 2020

Published : 06 February 2020

Issue Date : December 2020

DOI : https://doi.org/10.1007/s41870-020-00427-7

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Data mining techniques
  • Data mining tasks
  • Data mining applications
  • Classification
  • Find a journal
  • Publish with us
  • Track your research

Help | Advanced Search

Computer Science > Cryptography and Security

Title: research on data right confirmation mechanism of federated learning based on blockchain.

Abstract: Federated learning can solve the privacy protection problem in distributed data mining and machine learning, and how to protect the ownership, use and income rights of all parties involved in federated learning is an important issue. This paper proposes a federated learning data ownership confirmation mechanism based on blockchain and smart contract, which uses decentralized blockchain technology to save the contribution of each participant on the blockchain, and distributes the benefits of federated learning results through the blockchain. In the local simulation environment of the blockchain, the relevant smart contracts and data structures are simulated and implemented, and the feasibility of the scheme is preliminarily demonstrated.
Comments: in Chinese language
Subjects: Cryptography and Security (cs.CR)
Cite as: [cs.CR]
  (or [cs.CR] for this version)
  Focus to learn more arXiv-issued DOI via DataCite

Submission history

Access paper:.

  • Other Formats

References & Citations

  • Google Scholar
  • Semantic Scholar

BibTeX formatted citation

BibSonomy logo

Bibliographic and Citation Tools

Code, data and media associated with this article, recommenders and search tools.

  • Institution

arXivLabs: experimental projects with community collaborators

arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.

Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.

Have an idea for a project that will add value for arXiv's community? Learn more about arXivLabs .

Optimized Application of the Decision Tree ID3 Algorithm Based on Big Data in Sports Performance Management

New citation alert added.

This alert has been successfully added and will be sent to:

You will be notified whenever a record that you have chosen has been cited.

To manage your alert preferences, click on the button below.

New Citation Alert!

Please log in to your account

Information & Contributors

Bibliometrics & citations, view options, index terms.

Applied computing

Operations research

Decision analysis

Information systems

Data management systems

Information storage systems

Record storage systems

Directory structures

Information systems applications

Data mining

Decision support systems

Data analytics

Mathematics of computing

Discrete mathematics

Graph theory

Recommendations

An improved decision tree classification algorithm based on id3 and the application in score analysis.

The Decision Tree is an important classification method in data mining classification. Aiming at deficiency of ID3 algorism, a new improved classification algorism is proposed in this paper. The new algorithm combines principle of Taylor Formula with ...

Improved Decision Tree Method for Imbalanced Data Sets in Digital Forensics

Improved decision tree ID3 algorithm for suiting digital forensics is presented in the study. Forensics data are imbalanced, inconstant, noisy and dispersive. Based on these characteristic, we improve ID3 algorithm by adopting correction factor and two ...

Sports Big Data: Management, Analysis, Applications, and Challenges

With the rapid growth of information technology and sports, analyzing sports information has become an increasingly challenging issue. Sports big data come from the Internet and show a rapid growth trend. Sports big data contain rich information such as ...

Information

Published in.

United States

Publication History

Author tags.

  • Sports Performance Management
  • Big Data Function
  • Decision Tree
  • User Interest

Contributors

Other metrics, bibliometrics, article metrics.

  • 0 Total Citations
  • 0 Total Downloads
  • Downloads (Last 12 months) 0
  • Downloads (Last 6 weeks) 0

View options

Login options.

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Share this publication link.

Copying failed.

Share on social media

Affiliations, export citations.

  • Please download or close your previous search result export first before starting a new bulk export. Preview is not available. By clicking download, a status dialog will open to start the export process. The process may take a few minutes but once it finishes a file will be downloadable from your browser. You may continue to browse the DL while the export process is in progress. Download
  • Download citation
  • Copy citation

We are preparing your search results for download ...

We will inform you here when the file is ready.

Your file of search results citations is now ready.

Your search export query has expired. Please try again.

IMAGES

  1. (PDF) Data Mining Methods: A Review

    research papers for data mining

  2. (PDF) Research on Data Mining Methods

    research papers for data mining

  3. (PDF) Use of Data Mining to Analyse Students' Performance

    research papers for data mining

  4. (PDF) Data mining techniques and methodologies

    research papers for data mining

  5. (PDF) Data Mining: An Overview

    research papers for data mining

  6. (PDF) A Study on Importance of Data Mining in Information Technology

    research papers for data mining

VIDEO

  1. Student data mining solution–knowledge management system related to higher education institutions

  2. Lecture 15: Data Mining CSE 2020 Fall

  3. Smart PLS-SEM: Lecture 15 Assessing Results of Structural Model Part-I

  4. Soil Classification Using Data Mining Techniques: A Comparative Study

  5. Data Mining Introduction

  6. Geology for Mining Exam CBT 2024

COMMENTS

  1. 345193 PDFs

    Explore the latest full-text research PDFs, articles, conference papers, preprints and more on DATA MINING. Find methods information, sources, references or conduct a literature review on DATA MINING

  2. Home

    Data Mining and Knowledge Discovery is a leading technical journal focusing on the extraction of information from vast databases. Publishes original research papers and practice in data mining and knowledge discovery. Provides surveys and tutorials of important areas and techniques. Offers detailed descriptions of significant applications.

  3. Data mining

    Data mining is the process of extracting potentially useful information from data sets. It uses a suite of methods to organise, examine and combine large data sets, including machine learning ...

  4. Knowledge Discovery: Methods from data mining and machine learning

    The interdisciplinary field of knowledge discovery and data mining emerged from a necessity of big data requiring new analytical methods beyond the traditional statistical approaches to discover new knowledge from the data mine. This emergent approach is a dialectic research process that is both deductive and inductive.

  5. Data mining articles within Scientific Reports

    Read the latest Research articles in Data mining from Scientific Reports. ... Data mining articles within Scientific Reports. ... Calls for Papers Editor's Choice ...

  6. Recent Advances in Data Mining

    Feature papers represent the most advanced research with significant potential for high impact in the field. A Feature Paper should be a substantial original Article that involves several techniques or approaches, provides an outlook for future research directions and describes possible research applications. ... Data mining is the procedure of ...

  7. Statistical Analysis and Data Mining: The ASA Data Science Journal

    JOURNAL METRICS >. Statistical Analysis and Data Mining: The ASA Data Science Journal addresses the broad area of data analysis, including data mining algorithms, statistical approaches, and practical applications. Topics include problems involving massive and complex datasets, solutions utilizing innovative data mining algorithms and/or novel ...

  8. Recent advances in domain-driven data mining

    Data mining research has been significantly motivated by and benefited from real-world applications in novel domains. This special issue was proposed and edited to draw attention to domain-driven data mining and disseminate research in foundations, frameworks, and applications for data-driven and actionable knowledge discovery. Along with this special issue, we also organized a related ...

  9. A comprehensive survey of clustering algorithms: State-of-the-art

    Clustering is an essential tool in data mining research and applications. It is the subject of active research in many fields of study, such as computer science, data science, statistics, pattern recognition, artificial intelligence, and machine learning. ... The paper surveyed the different data mining methods that can be applied to extract ...

  10. Trends in data mining research: A two-decade review using topic analysis

    The research direction related to practical Applications of data mining also shows a tendency to grow. The last two topics, Text Mining and Data Streams have attracted steady interest from ...

  11. Data Mining Methods and Obstacles: A Comprehensive Analysis

    Big data analytics: a li terature review paper. in Advances in Data Mining. Applications and Theo- Applications and Theo- retical Aspects: 14th Industrial Conference, ICDM 2014, St. Petersburg ...

  12. Methods and Applications of Data Mining in Business Domains

    This Special Issue invited researchers to contribute original research in the field of data mining, particularly in its application to diverse domains, like healthcare, software development, logistics, and human resources. We were especially interested in how the data mining method was modified to cater to the specific domain in question.

  13. Data Mining for the Internet of Things: Literature Review and

    In this paper, we survey the data mining in 3 different views: knowledge view, technique view, and application view. In knowledge view, we review classification, clustering, association analysis, time series analysis, and outlier analysis. ... and data mining system area. Based on the survey of the current research, a suggested big data mining ...

  14. Adaptations of data mining methodologies: a systematic literature

    The main research objective of this article is to study how data mining methodologies are applied by researchers and practitioners. To this end, we use systematic literature review (SLR) as scientific method for two reasons. Firstly, systematic review is based on trustworthy, rigorous, and auditable methodology.

  15. A Review of Data Mining in Personalized Education: Current ...

    To offer a comprehensive review of recent advancements in personalized educational data mining, this paper focuses on four primary scenarios: educational recommendation, cognitive diagnosis, knowledge tracing, and learning analysis. ... Zhang, Y., & Ran, J. (2018a). Research on clustering mining and feature analysis of online learning ...

  16. Data Mining in Healthcare: Applying Strategic Intelligence Techniques

    In order to identify the strategic topics and the thematic evolution structure of data mining applied to healthcare, in this paper, a bibliometric performance and network analysis (BPNA) was conducted. ... Table 5 presents the most important WoS subject research fields of data mining in healthcare from 1995 to July 2020. Computer Science ...

  17. data mining Latest Research Papers

    Epidemic diseases can be extremely dangerous with its hazarding influences. They may have negative effects on economies, businesses, environment, humans, and workforce. In this paper, some of the factors that are interrelated with COVID-19 pandemic have been examined using data mining methodologies and approaches.

  18. Data Mining: Data Mining Concepts and Techniques

    There are different process and techniques used to carry out data mining successfully. Published in: 2013 International Conference on Machine Intelligence and Research Advancement. Date of Conference: 21-23 December 2013. Date Added to IEEE Xplore: 09 October 2014. Electronic ISBN:978--7695-5013-8.

  19. Data mining in clinical big data: the frequently used databases, steps

    To allow a clearer understanding of the application of data-mining technology on clinical big data, the second part of this paper introduced the concept of public databases and summarized those commonly used in medical research. ... Figure 1 shows a series of research concepts. The data-mining process is divided into several steps: (1) database ...

  20. IEEE International Conference on Data Mining (ICDM)

    Payment Options. View Purchased Documents. Profile Information. Communications Preferences. Profession and Education. Technical interests. Need Help? US & Canada: +1 800 678 4333. Worldwide: +1 732 981 0060.

  21. Data Mining in Healthcare

    Among the data mining techniques developed in recent years, the data mining methods are including generalization, characterization, classification, clustering, association, evolution, pattern matching, data visualization and meta-rule guided mining. [2]. As an element of data mining technique research, this paper surveys the * Corresponding author.

  22. (PDF) Data mining techniques and applications

    PDF | Data mining is a process which finds useful patterns from large amount of data. The paper discusses few of the data mining techniques, algorithms... | Find, read and cite all the research ...

  23. Review Paper on Data Mining Techniques and Applications

    Abstract. Data mining is the process of extracting hidden and useful patterns and information from data. Data mining is a new technology that helps businesses to predict future trends and behaviors, allowing them to make proactive, knowledge driven decisions. The aim of this paper is to show the process of data mining and how it can help ...

  24. Streamlining of Simple Sequence Repeat Data Mining Methodologies and

    This paper includes the currently available methodologies for producing SSR markers, genomic resource databases, and computational tools/pipelines for SSR data mining and primer generation. This review aims to provide a 'one-stop shop' of information to help each new user carefully select tools for identifying and utilizing SSRs in genetic ...

  25. A comprehensive survey of data mining

    Data mining plays an important role in various human activities because it extracts the unknown useful patterns (or knowledge). Due to its capabilities, data mining become an essential task in large number of application domains such as banking, retail, medical, insurance, bioinformatics, etc. To take a holistic view of the research trends in the area of data mining, a comprehensive survey is ...

  26. Research on Data Right Confirmation Mechanism of Federated Learning

    Federated learning can solve the privacy protection problem in distributed data mining and machine learning, and how to protect the ownership, use and income rights of all parties involved in federated learning is an important issue. This paper proposes a federated learning data ownership confirmation mechanism based on blockchain and smart contract, which uses decentralized blockchain ...

  27. Optimized Application of the Decision Tree ID3 Algorithm Based on Big

    The research results show that the data query and data display functions are widely used in the system, and the utilization rate of each function of big data in the system can be clearly seen. In the accuracy of sports performance management, the improved ID3 algorithm is obviously higher than the ID3 algorithm, and the accuracy is improved by ...

  28. Monitoring cybersecurity technology through the years: a technology

    This paper investigates the research and development activity in this space. Technology mining techniques including bibliometrics and patent analyses were used to zoom in on academic research and industrial development, respectively. The paper presents a framework which can be replicated for other critical areas like cybersecurity.