• Review article
  • Open access
  • Published: 02 November 2020

Big data in education: a state of the art, limitations, and future research directions

  • Maria Ijaz Baig 1 ,
  • Liyana Shuib   ORCID: orcid.org/0000-0002-7907-0671 1 &
  • Elaheh Yadegaridehkordi 1  

International Journal of Educational Technology in Higher Education volume  17 , Article number:  44 ( 2020 ) Cite this article

61k Accesses

90 Citations

36 Altmetric

Metrics details

Big data is an essential aspect of innovation which has recently gained major attention from both academics and practitioners. Considering the importance of the education sector, the current tendency is moving towards examining the role of big data in this sector. So far, many studies have been conducted to comprehend the application of big data in different fields for various purposes. However, a comprehensive review is still lacking in big data in education. Thus, this study aims to conduct a systematic review on big data in education in order to explore the trends, classify the research themes, and highlight the limitations and provide possible future directions in the domain. Following a systematic review procedure, 40 primary studies published from 2014 to 2019 were utilized and related information extracted. The findings showed that there is an increase in the number of studies that address big data in education during the last 2 years. It has been found that the current studies covered four main research themes under big data in education, mainly, learner’s behavior and performance, modelling and educational data warehouse, improvement in the educational system, and integration of big data into the curriculum. Most of the big data educational researches have focused on learner’s behavior and performances. Moreover, this study highlights research limitations and portrays the future directions. This study provides a guideline for future studies and highlights new insights and directions for the successful utilization of big data in education.

Introduction

The world is changing rapidly due to the emergence of innovational technologies (Chae, 2019 ). Currently, a large number of technological devices are used by individuals (Shorfuzzaman, Hossain, Nazir, Muhammad, & Alamri, 2019 ). In every single moment, an enormous amount of data is produced through these devices (ur Rehman et al., 2019 ). In order to cater for this massive data, current technologies and applications are being developed. These technologies and applications are useful for data analysis and storage (Kalaian, Kasim, & Kasim, 2019 ). Now, big data has become a matter of interest for researchers (Anshari, Alas, & Yunus, 2019 ). Researchers are trying to define and characterize big data in different ways (Mikalef, Pappas, Krogstie, & Giannakos, 2018 ).

According to Yassine, Singh, Hossain, and Muhammad ( 2019 ), big data is a large volume of data. However, De Mauro, Greco, and Grimaldi ( 2016 ) referred to it as an informational asset that is characterized by high quantity, speed, and diversity. Moreover, Shahat ( 2019 ) described big data as large data sets that are difficult to process, control or examine in a traditional way. Big data is generally characterized into 3 Vs which are Volume, Variety, and Velocity (Xu & Duan, 2019 ). The volume refers to as a large amount of data or increasing scale of data. The size of big data can be measured in terabytes and petabytes (Herschel & Miori, 2017 ). In order to cater for the large volume of data, high capacity storage systems are required. The variety refers to as a type or heterogeneity of data. The data can be in a structured format (databases) or unstructured format (images, video, emails). Big data analytical tools are helpful in handling unstructured data. Velocity refers to as the speed at which big data can access. The data is virtually present in a real-time environment (Internet logs) (Sivarajah, Kamal, Irani, & Weerakkody, 2017 ).

Currently, the concept of 3 V’s is inflated into several V’s. For instance, Demchenko, Grosso, De Laat, and Membrey ( 2013 ) classified big data into 5vs, which are Volume, Velocity, Variety, Veracity, and Value. Similarly, Saggi and Jain ( 2018 ) characterized big data into 7 V’s namely Volume, Velocity, Variety, Valence, Veracity, Variability, and Value.

Big data demand is significantly increasing in different fields of endeavour such as insurance and construction (Dresner Advisory Services, 2017 ), healthcare (Wang, Kung, & Byrd, 2018 ), telecommunication (Ahmed et al., 2018 ), and e-commerce (Wu & Lin, 2018 ). According to Dresner Advisory Services ( 2017 ), technology (14%), financial services (10%), consulting (9%), healthcare (9%), education (8%) and telecommunication (7%) are the most active sectors in producing a vast amount of data.

However, the educational sector is not an exception in this situation. In the educational realm, a large volume of data is produced through online courses, teaching and learning activities (Oi, Yamada, Okubo, Shimada, & Ogata, 2017 ). With the advent of big data, now teachers can access student’s academic performance, learning patterns and provide instant feedback (Black & Wiliam, 2018 ). The timely and constructive feedback motivates and satisfies the students, which gives a positive impact on their performance (Zheng & Bender, 2019 ). Academic data can help teachers to analyze their teaching pedagogy and affect changes according to students’ needs and requirement. Many online educational sites have been designed, and multiple courses based on individual student preferences have been introduced (Holland, 2019 ). The improvement in the educational sector depends upon acquisition and technology. The large-scale administrative data can play a tremendous role in managing various educational problems (Sorensen, 2018 ). Therefore, it is essential for professionals to understand the effectiveness of big data in education in order to minimize educational issues.

So far, several review studies have been conducted in the big data realm. Mikalef et al. ( 2018 ) conducted a systematic literature review study that focused on big data analytics capabilities in the firm. Mohammad & Torabi ( 2018 ), in their review study on big data, observed the emerging trends of big data in the oil and gas industry. Furthermore, another systematic literature review was conducted by Neilson, Daniel, and Tjandra ( 2019 ) on big data in the transportation system. Kamilaris, Kartakoullis, and Prenafeta-Boldú ( 2017 ), conducted a review study on the use of big data in agriculture. Similarly, Wolfert, Ge, Verdouw, and Bogaardt ( 2017 ) conducted a review study on the use of big data in smart farming. Moreover, Camargo Fiorini, Seles, Jabbour, Mariano, and Sousa Jabbour ( 2018 ) conducted a review study on big data and management theory. Even though that many fields have been covered in the previous review studies, yet, a comprehensive review of big data in the education sector is still lacking today. Thus, this study aims to conduct a systematic review of big data in education in order to identify the primary studies, their trends & themes, as well as limitations and possible future directions. This research can play a significant role in the advancement of big data in the educational domain. The identified limitations and future directions will be helpful to the new researchers to bring encroachment in this particular realm.

The research questions of this study are stated below:

What are the trends in the papers published on big data in education?

What research themes have been addressed in big data in education domain?

What are the limitations and possible future directions?

The remainder of this study is organized as follows: Section 2 explains the review methodology and exposes the SLR results; Section 3 reports the findings of research questions; and finally, Section 4 presents the discussion and conclusion and research implications.

Review methodology

In order to achieve the aforementioned objective, this study employs a systematic literature review method. An effective review is based on analysis of literature, find the limitations and research gap in a particular area. A systematic review can be defined as a process of analyzing, accessing and understanding the method. It explains the relevant research questions and area of research. The essential purpose of conducting the systematic review is to explore and conceptualize the extant studies, identification of the themes, relations & gaps, and the description of the future directions accordingly. Thus, the identified reasons are matched with the aim of this study. This research applies the Kitchenham and Charters ( 2007 ) strategies. A systematic review comprised of three phases: Organizing the review, managing the review, and reporting the review. Each phase has specific activities. These activities are: 1) Develop review protocol 2) Formulate inclusion and exclusion criteria 3) Describe the search strategy process 4) Define the selection process 5) Perform the quality evaluation procedure and 6) Data extraction and synthesis. The description of each activity is provided in the following sections.

Review protocol

The review protocol provides the foundation and mechanism to undertake a systematic literature review. The essential purpose of the review protocol is to minimize the research bias. The review protocol comprised of background, research questions, search strategy, selection process, quality assessment, and extraction of data and synthesis. The review protocol helps to maintain the consistency of review and easy update at a later stage when new findings are incorporated. This is the most significant aspect that discriminates SLR from other literature reviews.

Inclusion and exclusion criteria

The aim of defining the inclusion and exclusion criteria is to be rest assured that only highly relevant researches are included in this study. This study considers the published articles in journals, workshops, conferences, and symposium. The articles that consist of introductions, tutorials and posters and summaries were eliminated. However, complete and full-length relevant studies published in the English language between January 2014 to 2019 March were considered for the study. The searched words should be present in title, abstract, or in the keywords section.

Table  1 shows a summary of the inclusion and exclusion criteria.

Search strategy process

The search strategy comprised of two stages, namely S1 (automatic stage) and S2 (manual stage). Initially, an automatic search (S1) process was applied to identify the primary studies of big data in education. The following databases and search engines were explored: Science Direct, SAGE.

Journals, Emerald Insight, Springer Link, IEEE Xplore, ACM Digital Library, Taylor and Francis and AIS e-Library. These databases were considered as it possessed highest impact journals and germane conference proceedings, workshops and symposium. According to Kitchenham and Charters ( 2007 ), electronic databases provide a broad perspective on a subject rather than a limited set of specific journals and conferences. In order to find the relevant articles, keywords on big data and education were searched to obtain relatable results. The general words correlated to education were also explored (education OR academic OR university OR learning.

OR curriculum OR higher education OR school). This search string was paired with big data. The second stage is a manual search stage (S2). In this stage, a manual search was performed on the references of all initial searched studies. Kitchenham ( 2004 ) suggested that manual search should be applied to the primary study references. However, EndNote was used to manage, sort and remove the replicate studies easily.

Selection process

The selection process is used to identify the researches that are relevant to the research questions of this review study. The selection process of this study is presented in Fig.  1 . By applying the string of keywords, a total number of 559 studies were found through automatic search. However, 348 studies are replica studies and were removed using the EndNote library. The inclusion and exclusion criteria were applied to the remaining 211 studies. According to Kitchenham and Charters ( 2007 ), recommendation and irrelevant studies should be excluded from the review subject. At this phase, 147 studies were excluded as full-length articles were not available to download. Thus, 64 full-length articles were present to download and were downloaded. To ensure the comprehensiveness of the initial search results, the snowball technique was used. In the second stage, manual search (S2) was performed on the references of all the relevant papers through Google Scholar (Fig. 1 ). A total of 1 study was found through Google Scholar search. The quality assessment criteria were applied to 65 studies. However, 25 studies were excluded, as these studies did not fulfil the quality assessment criteria. Therefore, a total of 40 highly relevant primary studies were included in this research. The selection of studies from different databases and sources before and after results retrieval is shown in Table  2 . It has been found that majority of research studies were present in Science Direct (90), SAGE Journals (50), Emerald Insight (81), Springer Link (38), IEEE Xplore (158), ACM Digital Library (73), Taylor and Francis (17) and AIS e-Library (52). Google Scholar was employed only for the second round of manual search.

figure 1

Selection Process

Quality assessment

According to (Kitchenham & Charters, 2007 ), quality assessment plays a significant role in order to check the quality of primary researches. The subtleties of assessment are totally dependent on the quality of the instruments. This assessment mechanism can be based on the checklist of components or a set of questions. The primary purpose of the checklist of components and a set of questions is to analyze the quality of every study. Nonetheless, for this study, four quality measurements standard was created to evaluate the quality of each research. The measurement standards are given as:

QA1. Does the topic address in the study related to big data in education?

QA2. Does the study describe the context?

QA3. Does the research method given in the paper?

QA4. Does data collection portray in the article?

The four quality assessment standards were applied to 65 selected studies to determine the integrity of each research. The measurement standards were categorized into low, medium and high. The quality of each study depends on the total number of score. Each quality assessment has two-point scores. If the study meets the full standard, a score of 2 is awarded. In the case of partial fulfillment, a score of 1 is acquired. If none of the assessment standards is met, then a score of 0 is awarded. In the total score, if the study gets below 4, it is counted as ‘low’ and exact 4 considered as ‘medium’. However, the above 4 is reflected as ‘high’. The details of studies are presented in Table 11 in Appendix B . The 25 studies were excluded as it did not meet the quality assessment standard. Therefore, based on the quality assessment standard, a total of 40 primary studies were included in this systemic literature review (Table 10 in Appendix A ). The scores of the studies (in terms of low, medium and high) are presented in Fig.  2 .

figure 2

Scores of studies

Data extraction and synthesis

The data extraction and synthesis process were carried by reading the 65 primary studies. The studies were thoroughly studied, and the required details extracted accordingly. The objective of this stage is to find out the needed facts and figure from primary studies. The data was collected through the aspects of research ID, names of author, the title of the research, its publishing year and place, research themes, research context, research method, and data collection method. Data were extracted from 65 studies by using this aspect. The narration of each item is given in Table  3 . The data extracted from all primary studies are tabulated. The process of data synthesizing is presented in the next section.

Figure  3 presented the allocation of studies based on their publication sources. All publications were from high impact journals, high-level conferences, and workshops. The primary studies are comprised of 21 journals, 17 conferences, 1 workshop, and 1 symposium. However, 14 studies were from Science Direct journals and conferences. A total of 5 primary studies were from the SAGE group, 1 primary study from SpringerLink. Whereas 6 studies were from IEEE conferences, 2 studies were from IEEE symposium and workshop. Moreover, 1 primary study from AISeL Conference. Hence, 4 studies were from Emraldinsight journals, 5 studies were from ACM conferences and 2 studies were from Taylor and Francis. The summary of published sources is given in Table  4 .

figure 3

Allocation of studies based on publication

Temporal view of researches

The selection period of this study is from January 2014–March 2019. The yearly allocation of primary studies is presented in Fig.  4 . The big data in education trend started in the year 2014. This trend gradually gained popularity. In 2015, 8 studies were published in this domain. It has been found that a number of studies rise in the year 2017. Thus, the highest number of publication in big data in the education realm was observed in the year 2017. In 2017, 12 studies were published. This trend continued in 2018, and in that year, 11 studies that belong to big data in education were published. In 2019, the trend of this domain is still continued as this paper covers that period of March 2019. Thus, 4 studies were published until March 2019.

figure 4

Temporal view of Papers

In order to find the total citation count for the studies, Google Scholar was used. The number of citation is shown in Fig.  5 . It has been observed that 28 studies were cited by other sources 1–50 times. However, 11 studies were not cited by any other source. Thus, 1 study was cited by other sources 127 times. The top cited studies with their titles are presented in Table  5 , which provides general verification. The data provided here is not for comparison purpose among the studies.

figure 5

Research methodologies

The research methods employed by primary studies are shown in Fig.  6 . It has been found that majority of them are review based studies. These reviews were conducted in a different educational context and big data. However, reviews covered 28% of primary studies. The second most used research method was quantitative. This method covered 23% of the total primary studies. Only 3% of the study was based on a mix method approach. Moreover, design science method also covered 3% of primary studies. Nevertheless, 20% of the studies used qualitative research method, whereas the remaining 25% of the studies were not discussed and given in the articles.

figure 6

Distribution of Research Methods of Primary Studies

Data collection methods

The data collection methods used by primary studies are shown in Fig.  7 . The primary studies employed different data collection methods. However, the majority of studies used extant literature. The 5 types of research conducted surveys which covered 13% of primary Studies. The 4 studies carried experiments for data collection, which covered 10% of primary studies. Nevertheless, 6 studies conducted interviews for data collection, which is based on 15% of primary studies. The 4 studies used data logs which are based on 10% of primary studies. The 2 studies collected data through observations, 1 study used social network data, and 3 studies used website data. The observational, social network data and website-based researches covered 5%, 3% and 8% of primary studies. Moreover, 11 studies used extant literature and 1 study extracted data from a focus group discussion. The extant literature and focus group-based studies covered 28% and 3% of primary studies. However, the data collection method is not available for the remaining 3 studies.

figure 7

Distribution of Data Collection Methods of Primary Studies

What research themes have been addressed in educational studies of big data?

The theme refers to an idea, topic or an area covered by different research studies. The central idea reflects the theme that can be helpful in developing real insight and analysis. A theme can be in single or combination of more words (Rimmon-Kenan, 1995 ). This study classified big data research themes into four groups (Table  6 ). Thus, Fig.  8 shows a mind map of big data in education research themes, sub-themes, and the methodologies.

figure 8

Mind Map of big data in education research themes, sub-themes, and the methodologies

Figure  9 presents, research themes under big data in education, namely learner’s behavior and performance, modelling, and educational data warehouse, improvement of the educational system, and integration of big data into the curriculum.

figure 9

Research Themes

The first research theme was based on the leaner’s behavior and performance. This theme covers 21 studies, which consists of 53% of overall primary studies (Fig.  9 ). The theme studies are based on teaching and learning analytics, big data frameworks, user behaviour, and attitude, learner’s strategies, adaptive learning, and satisfaction. The total number of 8 studies relies on teaching and learning analytics (Table  7 ). Three (3) studies deal with big data framework. However, 6 studies concentrated on user behaviour and attitude. Nevertheless, 2 studies dwell on learning strategies. The adaptive learning and satisfaction covered 1 study, respectively. In this theme, 2 studies conducted surveys, 4 studies carried out experiments and 1 study employed the observational method. The 5 studies reported extant literature. In addition, 4 studies used event log data and 5 conducted interviews (Fig.  10 ).

figure 10

Number of Studies and Data Collection Methods

In the second theme, studies conducted focused on modeling and educational data warehouses. In this theme, 6 studies covered 15% of primary studies. This theme studies investigated the cloud environment, big data modeling, cluster analysis, and data warehouse for educational purpose (Table  8 ). Three (3) studies introduced big data modeling in education and highlighted the potential for organizing data from multiple sources. However, 1 study analyzed data warehouse with big data tools (Hadoop). Moreover, 1 study analyzed the accessibility of huge academic data in a cloud computing environment whereas, 1 study used clustering techniques and data warehouse for educational purpose. In this theme, 4 studies reported extant review, 1 study conduct survey, and 1 study used social network data.

The third theme concentrated on the improvement of the educational system. In this theme, 9 studies covered 23% of the primary studies. They consist of statistical tools and measurements, educational research implications, big data training, the introduction of the ranking system, usage of websites, big data educational challenges and effectiveness (Table  9 ). Two (2) studies considered statistical tools and measurements. Educational research implications, ranking system, usage of websites, and big data training covered 1 study respectively. However, 3 studies considered big data effectiveness and challenges. In this theme, 1 study conducted a survey for data collection, 2 studies used website traffic data, and 1 study exploited the observational method. However, 3 studies reported extant literature.

The fourth theme concentrated on incorporating the big data approaches into the curriculum. In this theme, 4 studies covered 10% of the primary studies. These 4 studies considered the introduction of big data topics into different courses. However, 1 study conducted interviews, 1 study employed survey method and 1 study used focus group discussion.

The 20% of the studies (Fig. 6 ) used qualitative research methods (Dinter et al., 2017 ; Veletsianos et al., 2016 ; Yang & Du, 2016 ). Qualitative methods are mostly applicable to observe the single variable and its relationship with other variables. However, this method does not quantify relationships. In qualitative researches, understanding is attained through ‘wording’ (Chaurasia & Frieda Rosin, 2017 ). The behaviors, attitude, satisfaction, performance, and overall learning performance are related with human phenomenons (Cantabella et al., 2019 ; Elia et al., 2018 ; Sedkaoui & Khelfaoui, 2019 ). Qualitative researches are not statistically tested (Chaurasia & Frieda Rosin, 2017 ). Big data educational studies which employed qualitative methods lacks some certainties that are present in quantitative research methods. Therefore, future researches might quantify the educational big data applications and its impact on higher education.

The six studies conducted interviews for data collection (Chaurasia et al., 2018 ; Chaurasia & Frieda Rosin, 2017 ; Nelson & Pouchard, 2017 ; Troisi et al., 2018 ; Veletsianos et al., 2016 ). However, 2 studies used observational method (Maldonado-Mahauad et al., 2018 ; Sooriamurthi, 2018 ) and one (1) study conducted focus group discussion (Buffum et al., 2014 ) for data collection (Fig.  10 ). The observational studies were conducted in uncontrolled environments. Sometimes results of these studies lead to self-selection biased. There is a chance of ambiguities in data collection where human language and observation are involved. The findings of interviews, observations and focus group discussions are limited and cannot be extended to a wider population of learners (Dinter et al., 2017 ).

The four big data educational studies analyzed the event log data and conducted interviews (Cantabella et al., 2019 ; Hirashima et al., 2017 ; Liang et al., 2016 ; Yang & Du, 2016 ). However, longitudinal data are more appropriate for multidimensional measurements and to analyze the large data sets in the future (Sorensen, 2018 ).

The eight studies considered the teaching and learning analytics (Chaurasia et al., 2018 ; Chaurasia & Frieda Rosin, 2017 ; Dessì et al., 2019 ; Roy & Singh, 2017 ). There are limited researches that covered the aspects of learning environments, ethical and cultural values and government support in the adoption of educational big data (Yang & Du, 2016 ). In the future, comparison of big data in different learning environments, ethical and cultural values, government support and training in adopting big data in higher education can be covered through leading journals and conferences.

The three studies are related to big data frameworks for education (Cantabella et al., 2019 ; Muthukrishnan & Yasin, 2018 ). However, the existed frameworks did not cover the organizational and institutional cultures, yet lacking robust theoretical grounds (Dubey & Gunasekaran, 2015 ; Muthukrishnan & Yasin, 2018 ). In the future, big data educational framework that concentrates on theories and adoption of big data technology is recommended. The extension of existed models and interpretation of data models are recommended. This will help in better decision and ensure the predictive analysis in the academic realm. Moreover, further relations can be tested by integrating other constructs like university size and type (Chaurasia et al., 2018 ).

The three studies dwelled on big data modeling (Pardos, 2017 ; Petrova-Antonova et al., 2017 ; Wassan, 2015 ). These models do not incorporate with the present systems (Santoso & Yulia, 2017 ). Therefore, efficient research solutions that can manage the educational data, new interchanging and resources are required in the future. One (1) study explored a cloud-based solution for managing academic big data (Logica & Magdalena, 2015 ). However, this solution is expensive. In the future, a combination of LMS that is supported by open-source applications and software’s can be used. This development will help universities to obtain benefits from unified LMS and to introduce new trends and economic opportunities for the academic industry. The data warehouse with big data tools was investigated by one (1) study (Santoso & Yulia, 2017 ). Nevertheless, a manifold node cluster can be implemented to process and access the structural and un-structural data in future (Ramos et al., 2015 ). In addition, new techniques that are based on relational and nonrelational databases and development of index catalogs are recommended to improve the overall retrieval system. Furthermore, the applicability of the least analytical tools and parallel programming models are needed to be tested for academic big data. MapReduce, MongoDB, pig,

Cassandra, Yarn, and Mahout are suggested for exploring and analysis of educational big data (Wassan, 2015 ). These tools will improve the analysis process and help in the development of reliable models for academic analytics.

One (1) study detected ICT factors through data mining techniques and tools in order to enhance educational effectiveness and improves its system (Martínez-Abad et al., 2018 ). Additionally, two studies also employed big data analytic tools on popular websites to examine the academic user’s interest (Martínez-Abad et al., 2018 ; Qiu et al., 2015 ). Thus, in future research, more targeted strategies and regions can be selected for organizing the academic data. Similarly, in-depth data mining techniques can be applied according to the nature of the data. Thus, the foreseen research can be used to validate the findings by applying it on other educational websites. The present research can be extended by analyzing the socioeconomic backgrounds and use of other websites (Qiu et al., 2015 ).

The two research studies were conducted on measurements and selection of statistical software for educational big data (Ozgur et al., 2015 ; Selwyn, 2014 ). However, there is no statistical software that is fit for every academic project. Therefore, in future research, all in one’ type statistical software is recommended for big data in order to fulfill the need of all academic projects. The four research studies were based on incorporating the big data academic curricula (Buffum et al., 2014 ; Sledgianowski et al., 2017 ). However, in order to integrate the big data into the curriculum, the significant changes are required. Firstly, in future researches, curricula need to be redeveloped or restructured according to the level and learning environment (Nelson & Pouchard, 2017 ). Secondly, the training factor, learning objectives, and outcomes should be well designed in future studies. Lastly, comparable exercises, learning activities and assessment plan need to be well structured before integrating big data into curricula (Dinter et al., 2017 ).

Discussion and conclusion

Big data has become an essential part of the educational realm. This study presented a systematic review of the literature on big data in the educational sector. However, three research questions were formulated to present big data educational studies trends, themes, and identification of the limitations and directions for further research. The primary studies were collected by performing a systematic search through IEEE Xplore, ScienceDirect, Emerald Insight, AIS Electronic Library, Sage, ACM Digital Library, Springer Link, Taylor and Francis, and Google Scholar databases. Finally, 40 studies were selected that meet the research protocols. These studies were published between the years 2014 (January) and 2019 (April). Through the findings of this study, it can be concluded that 53% of extant studies were conducted on learner’s behavior and performance theme. Moreover, 15% of the studies were on modeling and educational Data Warehouse, and 23% of the studies were on the improvement of educational system themes. However, only 10% of the studies were on the integration of big data into the curriculum theme.

Thus, a large number of studies were conducted in learner’s behavior and performance theme. However, other themes gained lesser attention. Therefore, more researches are expected in modeling and educational Data Warehouse in the future, in order to improve the educational system and integration of big data into the curriculum, related themes.

It has been found that 20% of the studies used qualitative research methods. However, 6 studies conducted interviews, 2 studies used observational method and 1 study conducted focus group discussion for data collection. The findings of interviews, observations and focus group discussions are limited and cannot be extended to a wider population of learners. Therefore, prospect researches might quantify the educational big data applications and its impact in higher education. The longitudinal data are more appropriate for multidimensional measurements and future analysis of the large data sets. The eight studies were carried out on teaching and learning analytics. In the future, comparison of big data in different learning environments, ethical and cultural values, government support and training to adopt big data in higher education can be covered through leading journals and conferences.

The three studies were related to big data frameworks for education. In the future, big data educational framework that dwells on theories and extension of existed models are recommended. The three studies concentrated on big data modeling. These models cannot incorporate with present systems. Therefore, efficient research solutions are that can manage the educational data, new interchanging and resources are required in a future study. The two studies explored a cloud-based solution for managing academic big data and investigated data warehouse with big data tools. Nevertheless, in the future, a manifold node cluster can be implemented for processing and accessing of the structural and un-structural data. The applicability of the least analytical tools and parallel programming models needs to be tested for academic big data.

One (1) study considered the detection of ICT factors through data mining technique and 2 studies employed big data analytic tools on popular websites to examine the academic user’s interest. Thus, more targeted strategies and regions can be selected for organizing the academic data in future. Four (4) research studies featured on incorporating the big data academic curricula. However, the big data based curricula need to be redeveloped by considering the learning objectives. In the future, well-designed learning activities for big data curricula are suggested.

Research implications

This study has two folded implications for stakeholders and researchers. Firstly, this review explored the trends published on big data in education realm. The identified trends uncover the studies allocation, publication sources, sequential view and most cited papers. In addition, it highlights the research methods used in these studies. The described trends can provide opportunities and new ideas to researchers to predict the accurate direction in future studies.

Secondly, this research explored the themes, sub-themes, and the methodologies in big data in education domain. The classified themes, sub-themes, and the methodologies present a comprehensive overview of existing literature of big data in education. The described themes and sub-themes can be helpful for researchers to identify new research gap and avoid using repeated themes in future studies. Meanwhile, it can help researchers to focus on the combination of different themes in order to uncover new insights on how big data can improve the learning and teaching process. In addition, illustrated methodologies can be useful for researchers in the selection of method according to nature of the study in future.

Identified research can be an implication for stakeholders towards the holistic expansion of educational competencies. The identified themes give new insight to universities to plan mixed learning programs that combine conventional learning with web-based learning. This permits students to accomplish focused learning outcomes, engrossing exercises at an ideal pace. It can be helpful for teachers to apprehend the ways to gauge students learning behaviour and attitude simultaneously and advance teaching strategy accordingly. Understanding the latest trends in big data and education are of growing importance for the ministry of education as they can develop flexible possibly to support the institutions to improve the educational system.

Lastly, the identified limitations and possible future directions can provide guidelines for researchers about what has been explored or need to explore in future. In addition, stakeholders can also extract ideas to impart the future cohort and comprehend the learning and academic requirements.

Availability of data and materials

Not applicable.

Ahmed, E., Yaqoob, I., Hashem, I. A. T., Shuja, J., Imran, M., Guizani, N., & Bakhsh, S. T. (2018). Recent advances and challenges in mobile big data. IEEE Communications Magazine , 56 (2), 102–108. China: East China Normal University. https://doi.org/10.1109/MCOM.2018.1700294 .

Anshari, M., Alas, Y., & Yunus, N. (2019). A survey study of smartphones behavior in Brunei: A proposal of Modelling big data strategies. In Multigenerational Online Behavior and Media Use: Concepts, Methodologies, Tools, and Applications , (pp. 201–214). IGI global.

Black, P., & Wiliam, D. (2018). Classroom assessment and pedagogy. Assessment in Education: Principles, Policy & Practice , 25 (6), 551–575. https://doi.org/10.1080/0969594X.2018.1441807 .

Article   Google Scholar  

Buffum, P. S., Martinez-Arocho, A. G., Frankosky, M. H., Rodriguez, F. J., Wiebe, E. N., & Boyer, K. E. (2014, March). CS principles goes to middle school: Learning how to teach big data. In Proceedings of the 45th ACM technical Computer science education , (pp. 151–156). New York: ACM. https://doi.org/10.1145/2538862.2538949 .

Camargo Fiorini, P., Seles, B. M. R. P., Jabbour, C. J. C., Mariano, E. B., & Sousa Jabbour, A. B. L. (2018). Management theory and big data literature: From a review to a research agenda. International Journal of Information Management , 43 , 112–129. https://doi.org/10.1016/j.ijinfomgt.2018.07.005 .

Cantabella, M., Martínez-España, R., Ayuso, B., Yáñez, J. A., & Muñoz, A. (2019). Analysis of student behavior in learning management systems through a big data framework. Future Generation Computer Systems , 90 (2), 262–272. https://doi.org/10.1016/j.future.2018.08.003 .

Chae, B. K. (2019). A general framework for studying the evolution of the digital innovation ecosystem: The case of big data. International Journal of Information Management , 45 , 83–94. https://doi.org/10.1016/j.ijinfomgt.2018.10.023 .

Chaurasia, S. S., & Frieda Rosin, A. (2017). From big data to big impact: Analytics for teaching and learning in higher education. Industrial and Commercial Training , 49 (7), 321–328. https://doi.org/10.1108/ict-10-2016-0069 .

Chaurasia, S. S., Kodwani, D., Lachhwani, H., & Ketkar, M. A. (2018). Big data academic and learning analytics. International Journal of Educational Management , 32 (6), 1099–1117. https://doi.org/10.1108/ijem-08-2017-0199 .

Coccoli, M., Maresca, P., & Stanganelli, L. (2017). The role of big data and cognitive computing in the learning process. Journal of Visual Languages & Computing , 38 , 97–103. https://doi.org/10.1016/j.jvlc.2016.03.002 .

De Mauro, A., Greco, M., & Grimaldi, M. (2016). A formal definition of big data based on its essential features. Library Review , 65 (3), 122–135. https://doi.org/10.1108/LR-06-2015-0061 .

Demchenko, Y., Grosso, P., De Laat, C., & Membrey, P. (2013). Addressing big data issues in scientific data infrastructure. In Collaboration Technologies and Systems (CTS), 2013 International Conference on , (pp. 48–55). San Diego: IEEE. https://doi.org/10.1109/CTS.2013.6567203 .

Dessì, D., Fenu, G., Marras, M., & Reforgiato Recupero, D. (2019). Bridging learning analytics and cognitive computing for big data classification in micro-learning video collections. Computers in Human Behavior , 92 (1), 468–477. https://doi.org/10.1016/j.chb.2018.03.004 .

Dinter, B., Jaekel, T., Kollwitz, C., & Wache, H. (2017). Teaching Big Data Management – An Active Learning Approach for Higher Education . North America: Paper presented at the proceedings of the pre-ICIS 2017 SIGDSA, (pp. 1–17). North America: AISeL.

Dresner Advisory Services. (2017). Big data adoption: State of the market. ZoomData. Retrieved from https://www.zoomdata.com/master-class/state-market/big-data-adoption

Google Scholar  

Dubey, R., & Gunasekaran, A. (2015). Education and training for successful career in big data and business analytics. Industrial and Commercial Training , 47 (4), 174–181. https://doi.org/10.1108/ict-08-2014-0059 .

Elia, G., Solazzo, G., Lorenzo, G., & Passiante, G. (2018). Assessing learners’ satisfaction in collaborative online courses through a big data approach. Computers in Human Behavior , 92 , 589–599. https://doi.org/10.1016/j.chb.2018.04.033 .

Gupta, D., & Rani, R. (2018). A study of big data evolution and research challenges. Journal of Information Science. , 45 (3), 322–340. https://doi.org/10.1177/0165551518789880 .

Herschel, R., & Miori, V. M. (2017). Ethics & big data. Technology in Society , 49 , 31–36. https://doi.org/10.1016/j.techsoc.2017.03.003 .

Hirashima, T., Supianto, A. A., & Hayashi, Y. (2017, September). Model-based approach for educational big data analysis of learners thinking with process data. In 2017 International Workshop on Big Data and Information Security (IWBIS) (pp. 11-16). San Diego: IEEE. https://doi.org/10.1177/0165551518789880

Holland, A. A. (2019). Effective principles of informal online learning design: A theory-building metasynthesis of qualitative research. Computers & Education , 128 , 214–226. https://doi.org/10.1016/j.compedu.2018.09.026 .

Kalaian, S. A., Kasim, R. M., & Kasim, N. R. (2019). Descriptive and predictive analytical methods for big data. In Web Services: Concepts, Methodologies, Tools, and Applications , (pp. 314–331). USA: IGI global. https://doi.org/10.4018/978-1-5225-7501-6.ch018 .

Kamilaris, A., Kartakoullis, A., & Prenafeta-Boldú, F. X. (2017). A review on the practice of big data analysis in agriculture. Computers and Electronics in Agriculture , 143 , 23–37. https://doi.org/10.1016/j.compag.2017.09.037 .

Kitchenham, B. (2004). Procedures for performing systematic reviews. Keele, UK, Keele University , 33 (2004), 1–26.

Kitchenham, B., & Charters, S. (2007). Guidelines for performing systematic literature reviews in software engineering version 2.3. Engineering , 45 (4), 13–65.

Lia, Y., & Zhaia, X. (2018). Review and prospect of modern education using big data. Procedia Computer Science , 129 (3), 341–347. https://doi.org/10.1016/j.procs.2018.03.085 .

Liang, J., Yang, J., Wu, Y., Li, C., & Zheng, L. (2016). Big Data Application in Education: Dropout Prediction in Edx MOOCs. In Paper presented at the 2016 IEEE second international conference on multimedia big data (BigMM) , (pp. 440–443). USA: IEEE. https://doi.org/10.1109/BigMM.2016.70 .

Logica, B., & Magdalena, R. (2015). Using big data in the academic environment. Procedia Economics and Finance , 33 (2), 277–286. https://doi.org/10.1016/s2212-5671(15)01712-8 .

Maldonado-Mahauad, J., Pérez-Sanagustín, M., Kizilcec, R. F., Morales, N., & Munoz-Gama, J. (2018). Mining theory-based patterns from big data: Identifying self-regulated learning strategies in massive open online courses. Computers in Human Behavior , 80 (1), 179196. https://doi.org/10.1016/j.chb.2017.11.011 .

Martínez-Abad, F., Gamazo, A., & Rodríguez-Conde, M. J. (2018). Big Data in Education. In Paper presented at the proceedings of the sixth international conference on technological ecosystems for enhancing Multiculturality - TEEM'18, Salamanca, Spain , (pp. 145–150). New York: ACM. https://doi.org/10.1145/3284179.3284206 .

Mikalef, P., Pappas, I. O., Krogstie, J., & Giannakos, M. (2018). Big data analytics capabilities: A systematic literature review and research agenda. Information Systems and e-Business Management , 16 (3), 547–578. https://doi.org/10.1007/10257-017-0362-y .

Mohammadpoor, M., & Torabi, F. (2018). Big Data analytics in oil and gas industry: An emerging trend. Petroleum. In press. https://doi.org/10.1016/j.petlm.2018.11.001 .

Muthukrishnan, S. M., & Yasin, N. B. M. (2018). Big Data Framework for Students’ Academic. Paper presented at the symposium on computer applications & industrial electronics (ISCAIE), Penang, Malaysia (pp. 376–382). USA: IEEE. https://doi.org/10.1109/ISCAIE.2018.8405502

Neilson, A., Daniel, B., & Tjandra, S. (2019). Systematic review of the literature on big data in the transportation Domain: Concepts and Applications. Big Data Research . In press. https://doi.org/10.1016/j.bdr.2019.03.001 .

Nelson, M., & Pouchard, L. (2017). A pilot “big data” education modular curriculum for engineering graduate education: Development and implementation. In Paper presented at the Frontiers in education conference (FIE), Indianapolis, USA , (pp. 1–5). USA: IEEE. https://doi.org/10.1109/FIE.2017.8190688 .

Nie, M., Yang, L., Sun, J., Su, H., Xia, H., Lian, D., & Yan, K. (2018). Advanced forecasting of career choices for college students based on campus big data. Frontiers of Computer Science , 12 (3), 494–503. https://doi.org/10.1007/s11704-017-6498-6 .

Oi, M., Yamada, M., Okubo, F., Shimada, A., & Ogata, H. (2017). Reproducibility of findings from educational big data. In Paper presented at the proceedings of the Seventh International Learning Analytics & Knowledge Conference , (pp. 536–537). New York: ACM. https://doi.org/10.1145/3027385.3029445 .

Ong, V. K. (2015). Big Data and Its Research Implications for Higher Education: Cases from UK Higher Education Institutions. In Paper presented at the 2015 IIAI 4th international confress on advanced applied informatics , (pp. 487–491). USA: IEEE. https://doi.org/10.1109/IIAI-AAI.2015.178 .

Ozgur, C., Kleckner, M., & Li, Y. (2015). Selection of statistical software for solving big data problems. SAGE Open , 5 (2), 59–94. https://doi.org/10.1177/2158244015584379 .

Pardos, Z. A. (2017). Big data in education and the models that love them. Current Opinion in Behavioral Sciences , 18 (2), 107–113. https://doi.org/10.1016/j.cobeha.2017.11.006 .

Petrova-Antonova, D., Georgieva, O., & Ilieva, S. (2017, June). Modelling of educational data following big data value chain. In Proceedings of the 18th International Conference on Computer Systems and Technologies (pp. 88–95). New York City: ACM. https://doi.org/10.1145/3134302.3134335

Qiu, R. G., Huang, Z., & Patel, I. C. (2015, June). A big data approach to assessing the US higher education service. In 2015 12th International Conference on Service Systems and Service Management (ICSSSM) (pp. 1–6). New York: IEEE. https://doi.org/10.1109/ICSSSM.2015.7170149

Ramos, T. G., Machado, J. C. F., & Cordeiro, B. P. V. (2015). Primary education evaluation in Brazil using big data and cluster analysis. Procedia Computer Science , 55 (1), 10311039. https://doi.org/10.1016/j.procs.2015.07.061 .

Rimmon-Kenan, S. (1995). What Is Theme and How Do We Get at It?. Thematics: New Approaches, 9–20.

Roy, S., & Singh, S. N. (2017). Emerging trends in applications of big data in educational data mining and learning analytics. In 2017 7th International Conference on Cloud Computing, Data Science & Engineering-Confluence , (pp. 193–198). New York: IEEE. https://doi.org/10.1109/confluence.2017.7943148 .

Saggi, M. K., & Jain, S. (2018). A survey towards an integration of big data analytics to big insights for value-creation. Information Processing & Management , 54 (5), 758–790. https://doi.org/10.1016/j.ipm.2018.01.010 .

Santoso, L. W., & Yulia (2017). Data warehouse with big data Technology for Higher Education. Procedia Computer Science , 124 (1), 93–99. https://doi.org/10.1016/j.procs.2017.12.134 .

Sedkaoui, S., & Khelfaoui, M. (2019). Understand, develop and enhance the learning process with big data. Information Discovery and Delivery , 47 (1), 2–16. https://doi.org/10.1108/idd-09-2018-0043 .

Selwyn, N. (2014). Data entry: Towards the critical study of digital data and education. Learning, Media and Technology , 40 (1), 64–82. https://doi.org/10.1080/17439884.2014.921628 .

Shahat, O. A. (2019). A novel big data analytics framework for smart cities. Future Generation Computer Systems , 91 (1), 620–633. https://doi.org/10.1016/j.future.2018.06.046 .

Shorfuzzaman, M., Hossain, M. S., Nazir, A., Muhammad, G., & Alamri, A. (2019). Harnessing the power of big data analytics in the cloud to support learning analytics in mobile learning environment. Computers in Human Behavior , 92 (1), 578–588. https://doi.org/10.1016/j.chb.2018.07.002 .

Sivarajah, U., Kamal, M. M., Irani, Z., & Weerakkody, V. (2017). Critical analysis of big data challenges and analytical methods. Journal of Business Research , 70 , 263–286. https://doi.org/10.1016/j.jbusres.2016.08.001 .

Sledgianowski, D., Gomaa, M., & Tan, C. (2017). Toward integration of big data, technology and information systems competencies into the accounting curriculum. Journal of Accounting Education , 38 (1), 81–93. https://doi.org/10.1016/j.jaccedu.2016.12.008 .

Sooriamurthi, R. (2018). Introducing big data analytics in high school and college. In Proceedings of the 23rd Annual ACM Conference on Innovation and Technology in Computer Science Education (pp. 373–374). New York: ACM. https://doi.org/10.1145/3197091.3205834

Sorensen, L. C. (2018). "Big data" in educational administration: An application for predicting school dropout risk. Educational Administration Quarterly , 45 (1), 1–93. https://doi.org/10.1177/0013161x18799439 .

Article   MathSciNet   Google Scholar  

Su, Y. S., Ding, T. J., Lue, J. H., Lai, C. F., & Su, C. N. (2017). Applying big data analysis technique to students’ learning behavior and learning resource recommendation in a MOOCs course. In 2017 International conference on applied system innovation (ICASI) (pp. 1229–1230). New York: IEEE. https://doi.org/10.1109/ICASI.2017.7988114

Troisi, O., Grimaldi, M., Loia, F., & Maione, G. (2018). Big data and sentiment analysis to highlight decision behaviours: A case study for student population. Behaviour & Information Technology , 37 (11), 1111–1128. https://doi.org/10.1080/0144929x.2018.1502355 .

Ur Rehman, M. H., Yaqoob, I., Salah, K., Imran, M., Jayaraman, P. P., & Perera, C. (2019). The role of big data analytics in industrial internet of things. Future Generation Computer Systems , 92 , 578–588. https://doi.org/10.1016/j.future.2019.04.020 .

Veletsianos, G., Reich, J., & Pasquini, L. A. (2016). The Life Between Big Data Log Events. AERA Open , 2 (3), 1–45. https://doi.org/10.1177/2332858416657002 .

Wang, Y., Kung, L., & Byrd, T. A. (2018). Big data analytics: Understanding its capabilities and potential benefits for healthcare organizations. Technological Forecasting and Social Change , 126 , 3–13. https://doi.org/10.1016/j.techfore.2015.12.019 .

Wassan, J. T. (2015). Discovering big data modelling for educational world. Procedia - Social and Behavioral Sciences , 176 , 642–649. https://doi.org/10.1016/j.sbspro.2015.01.522 .

Wolfert, S., Ge, L., Verdouw, C., & Bogaardt, M. J. (2017). Big data in smart farming–a review. Agricultural Systems , 153 , 69–80. https://doi.org/10.1016/j.agsy.2017.01.023 .

Wu, P. J., & Lin, K. C. (2018). Unstructured big data analytics for retrieving e-commerce logistics knowledge. Telematics and Informatics , 35 (1), 237–244. https://doi.org/10.1016/j.tele.2017.11.004 .

Xu, L. D., & Duan, L. (2019). Big data for cyber physical systems in industry 4.0: A survey. Enterprise Information Systems , 13 (2), 148–169. https://doi.org/10.1080/17517575.2018.1442934 .

Yang, F., & Du, Y. R. (2016). Storytelling in the age of big data. Asia Pacific Media Educator , 26 (2), 148–162. https://doi.org/10.1177/1326365x16673168 .

Yassine, A., Singh, S., Hossain, M. S., & Muhammad, G. (2019). IoT big data analytics for smart homes with fog and cloud computing. Future Generation Computer Systems , 91 (2), 563–573. https://doi.org/10.1016/j.future.2018.08.040 .

Zhang, M. (2015). Internet use that reproduces educational inequalities: Evidence from big data. Computers & Education , 86 (1), 212–223. https://doi.org/10.1016/j.compedu.2015.08.007 .

Zheng, M., & Bender, D. (2019). Evaluating outcomes of computer-based classroom testing: Student acceptance and impact on learning and exam performance. Medical Teacher , 41 (1), 75–82. https://doi.org/10.1080/0142159X.2018.1441984 .

Download references

Acknowledgements

Not applicable

Author information

Authors and affiliations.

Department of Information Systems, Faculty of Computer Science & Information Technology University of Malaya, 50603, Kuala Lumpur, Malaysia

Maria Ijaz Baig, Liyana Shuib & Elaheh Yadegaridehkordi

You can also search for this author in PubMed   Google Scholar

Contributions

Maria Ijaz Baig composed the manuscript under the guidance of Elaheh Yadegaridehkordi. Liyana Shuib supervised the project. All authors discussed the results and contributed to the final manuscript.

Corresponding author

Correspondence to Liyana Shuib .

Ethics declarations

Competing interests.

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Additional information

Publisher’s note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ .

Reprints and permissions

About this article

Cite this article.

Baig, M.I., Shuib, L. & Yadegaridehkordi, E. Big data in education: a state of the art, limitations, and future research directions. Int J Educ Technol High Educ 17 , 44 (2020). https://doi.org/10.1186/s41239-020-00223-0

Download citation

Received : 09 March 2020

Accepted : 10 June 2020

Published : 02 November 2020

DOI : https://doi.org/10.1186/s41239-020-00223-0

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Data science applications in education
  • Learning communities
  • Teaching/learning strategies

research using big data

  • Open access
  • Published: 06 January 2022

The use of Big Data Analytics in healthcare

  • Kornelia Batko   ORCID: orcid.org/0000-0001-6561-3826 1 &
  • Andrzej Ślęzak 2  

Journal of Big Data volume  9 , Article number:  3 ( 2022 ) Cite this article

82k Accesses

149 Citations

33 Altmetric

Metrics details

The introduction of Big Data Analytics (BDA) in healthcare will allow to use new technologies both in treatment of patients and health management. The paper aims at analyzing the possibilities of using Big Data Analytics in healthcare. The research is based on a critical analysis of the literature, as well as the presentation of selected results of direct research on the use of Big Data Analytics in medical facilities. The direct research was carried out based on research questionnaire and conducted on a sample of 217 medical facilities in Poland. Literature studies have shown that the use of Big Data Analytics can bring many benefits to medical facilities, while direct research has shown that medical facilities in Poland are moving towards data-based healthcare because they use structured and unstructured data, reach for analytics in the administrative, business and clinical area. The research positively confirmed that medical facilities are working on both structural data and unstructured data. The following kinds and sources of data can be distinguished: from databases, transaction data, unstructured content of emails and documents, data from devices and sensors. However, the use of data from social media is lower as in their activity they reach for analytics, not only in the administrative and business but also in the clinical area. It clearly shows that the decisions made in medical facilities are highly data-driven. The results of the study confirm what has been analyzed in the literature that medical facilities are moving towards data-based healthcare, together with its benefits.

Introduction

The main contribution of this paper is to present an analytical overview of using structured and unstructured data (Big Data) analytics in medical facilities in Poland. Medical facilities use both structured and unstructured data in their practice. Structured data has a predetermined schema, it is extensive, freeform, and comes in variety of forms [ 27 ]. In contrast, unstructured data, referred to as Big Data (BD), does not fit into the typical data processing format. Big Data is a massive amount of data sets that cannot be stored, processed, or analyzed using traditional tools. It remains stored but not analyzed. Due to the lack of a well-defined schema, it is difficult to search and analyze such data and, therefore, it requires a specific technology and method to transform it into value [ 20 , 68 ]. Integrating data stored in both structured and unstructured formats can add significant value to an organization [ 27 ]. Organizations must approach unstructured data in a different way. Therefore, the potential is seen in Big Data Analytics (BDA). Big Data Analytics are techniques and tools used to analyze and extract information from Big Data. The results of Big Data analysis can be used to predict the future. They also help in creating trends about the past. When it comes to healthcare, it allows to analyze large datasets from thousands of patients, identifying clusters and correlation between datasets, as well as developing predictive models using data mining techniques [ 60 ].

This paper is the first study to consolidate and characterize the use of Big Data from different perspectives. The first part consists of a brief literature review of studies on Big Data (BD) and Big Data Analytics (BDA), while the second part presents results of direct research aimed at diagnosing the use of big data analyses in medical facilities in Poland.

Healthcare is a complex system with varied stakeholders: patients, doctors, hospitals, pharmaceutical companies and healthcare decision-makers. This sector is also limited by strict rules and regulations. However, worldwide one may observe a departure from the traditional doctor-patient approach. The doctor becomes a partner and the patient is involved in the therapeutic process [ 14 ]. Healthcare is no longer focused solely on the treatment of patients. The priority for decision-makers should be to promote proper health attitudes and prevent diseases that can be avoided [ 81 ]. This became visible and important especially during the Covid-19 pandemic [ 44 ].

The next challenges that healthcare will have to face is the growing number of elderly people and a decline in fertility. Fertility rates in the country are found below the reproductive minimum necessary to keep the population stable [ 10 ]. The reflection of both effects, namely the increase in age and lower fertility rates, are demographic load indicators, which is constantly growing. Forecasts show that providing healthcare in the form it is provided today will become impossible in the next 20 years [ 70 ]. It is especially visible now during the Covid-19 pandemic when healthcare faced quite a challenge related to the analysis of huge data amounts and the need to identify trends and predict the spread of the coronavirus. The pandemic showed it even more that patients should have access to information about their health condition, the possibility of digital analysis of this data and access to reliable medical support online. Health monitoring and cooperation with doctors in order to prevent diseases can actually revolutionize the healthcare system. One of the most important aspects of the change necessary in healthcare is putting the patient in the center of the system.

Technology is not enough to achieve these goals. Therefore, changes should be made not only at the technological level but also in the management and design of complete healthcare processes and what is more, they should affect the business models of service providers. The use of Big Data Analytics is becoming more and more common in enterprises [ 17 , 54 ]. However, medical enterprises still cannot keep up with the information needs of patients, clinicians, administrators and the creator’s policy. The adoption of a Big Data approach would allow the implementation of personalized and precise medicine based on personalized information, delivered in real time and tailored to individual patients.

To achieve this goal, it is necessary to implement systems that will be able to learn quickly about the data generated by people within clinical care and everyday life. This will enable data-driven decision making, receiving better personalized predictions about prognosis and responses to treatments; a deeper understanding of the complex factors and their interactions that influence health at the patient level, the health system and society, enhanced approaches to detecting safety problems with drugs and devices, as well as more effective methods of comparing prevention, diagnostic, and treatment options [ 40 ].

In the literature, there is a lot of research showing what opportunities can be offered to companies by big data analysis and what data can be analyzed. However, there are few studies showing how data analysis in the area of healthcare is performed, what data is used by medical facilities and what analyses and in which areas they carry out. This paper aims to fill this gap by presenting the results of research carried out in medical facilities in Poland. The goal is to analyze the possibilities of using Big Data Analytics in healthcare, especially in Polish conditions. In particular, the paper is aimed at determining what data is processed by medical facilities in Poland, what analyses they perform and in what areas, and how they assess their analytical maturity. In order to achieve this goal, a critical analysis of the literature was performed, and the direct research was based on a research questionnaire conducted on a sample of 217 medical facilities in Poland. It was hypothesized that medical facilities in Poland are working on both structured and unstructured data and moving towards data-based healthcare and its benefits. Examining the maturity of healthcare facilities in the use of Big Data and Big Data Analytics is crucial in determining the potential future benefits that the healthcare sector can gain from Big Data Analytics. There is also a pressing need to predicate whether, in the coming years, healthcare will be able to cope with the threats and challenges it faces.

This paper is divided into eight parts. The first is the introduction which provides background and the general problem statement of this research. In the second part, this paper discusses considerations on use of Big Data and Big Data Analytics in Healthcare, and then, in the third part, it moves on to challenges and potential benefits of using Big Data Analytics in healthcare. The next part involves the explanation of the proposed method. The result of direct research and discussion are presented in the fifth part, while the following part of the paper is the conclusion. The seventh part of the paper presents practical implications. The final section of the paper provides limitations and directions for future research.

Considerations on use Big Data and Big Data Analytics in the healthcare

In recent years one can observe a constantly increasing demand for solutions offering effective analytical tools. This trend is also noticeable in the analysis of large volumes of data (Big Data, BD). Organizations are looking for ways to use the power of Big Data to improve their decision making, competitive advantage or business performance [ 7 , 54 ]. Big Data is considered to offer potential solutions to public and private organizations, however, still not much is known about the outcome of the practical use of Big Data in different types of organizations [ 24 ].

As already mentioned, in recent years, healthcare management worldwide has been changed from a disease-centered model to a patient-centered model, even in value-based healthcare delivery model [ 68 ]. In order to meet the requirements of this model and provide effective patient-centered care, it is necessary to manage and analyze healthcare Big Data.

The issue often raised when it comes to the use of data in healthcare is the appropriate use of Big Data. Healthcare has always generated huge amounts of data and nowadays, the introduction of electronic medical records, as well as the huge amount of data sent by various types of sensors or generated by patients in social media causes data streams to constantly grow. Also, the medical industry generates significant amounts of data, including clinical records, medical images, genomic data and health behaviors. Proper use of the data will allow healthcare organizations to support clinical decision-making, disease surveillance, and public health management. The challenge posed by clinical data processing involves not only the quantity of data but also the difficulty in processing it.

In the literature one can find many different definitions of Big Data. This concept has evolved in recent years, however, it is still not clearly understood. Nevertheless, despite the range and differences in definitions, Big Data can be treated as a: large amount of digital data, large data sets, tool, technology or phenomenon (cultural or technological.

Big Data can be considered as massive and continually generated digital datasets that are produced via interactions with online technologies [ 53 ]. Big Data can be defined as datasets that are of such large sizes that they pose challenges in traditional storage and analysis techniques [ 28 ]. A similar opinion about Big Data was presented by Ohlhorst who sees Big Data as extremely large data sets, possible neither to manage nor to analyze with traditional data processing tools [ 57 ]. In his opinion, the bigger the data set, the more difficult it is to gain any value from it.

In turn, Knapp perceived Big Data as tools, processes and procedures that allow an organization to create, manipulate and manage very large data sets and storage facilities [ 38 ]. From this point of view, Big Data is identified as a tool to gather information from different databases and processes, allowing users to manage large amounts of data.

Similar perception of the term ‘Big Data’ is shown by Carter. According to him, Big Data technologies refer to a new generation of technologies and architectures, designed to economically extract value from very large volumes of a wide variety of data by enabling high velocity capture, discovery and/or analysis [ 13 ].

Jordan combines these two approaches by identifying Big Data as a complex system, as it needs data bases for data to be stored in, programs and tools to be managed, as well as expertise and personnel able to retrieve useful information and visualization to be understood [ 37 ].

Following the definition of Laney for Big Data, it can be state that: it is large amount of data generated in very fast motion and it contains a lot of content [ 43 ]. Such data comes from unstructured sources, such as stream of clicks on the web, social networks (Twitter, blogs, Facebook), video recordings from the shops, recording of calls in a call center, real time information from various kinds of sensors, RFID, GPS devices, mobile phones and other devices that identify and monitor something [ 8 ]. Big Data is a powerful digital data silo, raw, collected with all sorts of sources, unstructured and difficult, or even impossible, to analyze using conventional techniques used so far to relational databases.

While describing Big Data, it cannot be overlooked that the term refers more to a phenomenon than to specific technology. Therefore, instead of defining this phenomenon, trying to describe them, more authors are describing Big Data by giving them characteristics included a collection of V’s related to its nature [ 2 , 3 , 23 , 25 , 58 ]:

Volume (refers to the amount of data and is one of the biggest challenges in Big Data Analytics),

Velocity (speed with which new data is generated, the challenge is to be able to manage data effectively and in real time),

Variety (heterogeneity of data, many different types of healthcare data, the challenge is to derive insights by looking at all available heterogenous data in a holistic manner),

Variability (inconsistency of data, the challenge is to correct the interpretation of data that can vary significantly depending on the context),

Veracity (how trustworthy the data is, quality of the data),

Visualization (ability to interpret data and resulting insights, challenging for Big Data due to its other features as described above).

Value (the goal of Big Data Analytics is to discover the hidden knowledge from huge amounts of data).

Big Data is defined as an information asset with high volume, velocity, and variety, which requires specific technology and method for its transformation into value [ 21 , 77 ]. Big Data is also a collection of information about high-volume, high volatility or high diversity, requiring new forms of processing in order to support decision-making, discovering new phenomena and process optimization [ 5 , 7 ]. Big Data is too large for traditional data-processing systems and software tools to capture, store, manage and analyze, therefore it requires new technologies [ 28 , 50 , 61 ] to manage (capture, aggregate, process) its volume, velocity and variety [ 9 ].

Undoubtedly, Big Data differs from the data sources used so far by organizations. Therefore, organizations must approach this type of unstructured data in a different way. First of all, organizations must start to see data as flows and not stocks—this entails the need to implement the so-called streaming analytics [ 48 ]. The mentioned features make it necessary to use new IT tools that allow the fullest use of new data [ 58 ]. The Big Data idea, inseparable from the huge increase in data available to various organizations or individuals, creates opportunities for access to valuable analyses, conclusions and enables making more accurate decisions [ 6 , 11 , 59 ].

The Big Data concept is constantly evolving and currently it does not focus on huge amounts of data, but rather on the process of creating value from this data [ 52 ]. Big Data is collected from various sources that have different data properties and are processed by different organizational units, resulting in creation of a Big Data chain [ 36 ]. The aim of the organizations is to manage, process and analyze Big Data. In the healthcare sector, Big Data streams consist of various types of data, namely [ 8 , 51 ]:

clinical data, i.e. data obtained from electronic medical records, data from hospital information systems, image centers, laboratories, pharmacies and other organizations providing health services, patient generated health data, physician’s free-text notes, genomic data, physiological monitoring data [ 4 ],

biometric data provided from various types of devices that monitor weight, pressure, glucose level, etc.,

financial data, constituting a full record of economic operations reflecting the conducted activity,

data from scientific research activities, i.e. results of research, including drug research, design of medical devices and new methods of treatment,

data provided by patients, including description of preferences, level of satisfaction, information from systems for self-monitoring of their activity: exercises, sleep, meals consumed, etc.

data from social media.

These data are provided not only by patients but also by organizations and institutions, as well as by various types of monitoring devices, sensors or instruments [ 16 ]. Data that has been generated so far in the healthcare sector is stored in both paper and digital form. Thus, the essence and the specificity of the process of Big Data analyses means that organizations need to face new technological and organizational challenges [ 67 ]. The healthcare sector has always generated huge amounts of data and this is connected, among others, with the need to store medical records of patients. However, the problem with Big Data in healthcare is not limited to an overwhelming volume but also an unprecedented diversity in terms of types, data formats and speed with which it should be analyzed in order to provide the necessary information on an ongoing basis [ 3 ]. It is also difficult to apply traditional tools and methods for management of unstructured data [ 67 ]. Due to the diversity and quantity of data sources that are growing all the time, advanced analytical tools and technologies, as well as Big Data analysis methods which can meet and exceed the possibilities of managing healthcare data, are needed [ 3 , 68 ].

Therefore, the potential is seen in Big Data analyses, especially in the aspect of improving the quality of medical care, saving lives or reducing costs [ 30 ]. Extracting from this tangle of given association rules, patterns and trends will allow health service providers and other stakeholders in the healthcare sector to offer more accurate and more insightful diagnoses of patients, personalized treatment, monitoring of the patients, preventive medicine, support of medical research and health population, as well as better quality of medical services and patient care while, at the same time, the ability to reduce costs (Fig.  1 ).

figure 1

(Source: Own elaboration)

Healthcare Big Data Analytics applications

The main challenge with Big Data is how to handle such a large amount of information and use it to make data-driven decisions in plenty of areas [ 64 ]. In the context of healthcare data, another major challenge is to adjust big data storage, analysis, presentation of analysis results and inference basing on them in a clinical setting. Data analytics systems implemented in healthcare are designed to describe, integrate and present complex data in an appropriate way so that it can be understood better (Fig.  2 ). This would improve the efficiency of acquiring, storing, analyzing and visualizing big data from healthcare [ 71 ].

figure 2

Process of Big Data Analytics

The result of data processing with the use of Big Data Analytics is appropriate data storytelling which may contribute to making decisions with both lower risk and data support. This, in turn, can benefit healthcare stakeholders. To take advantage of the potential massive amounts of data in healthcare and to ensure that the right intervention to the right patient is properly timed, personalized, and potentially beneficial to all components of the healthcare system such as the payer, patient, and management, analytics of large datasets must connect communities involved in data analytics and healthcare informatics [ 49 ]. Big Data Analytics can provide insight into clinical data and thus facilitate informed decision-making about the diagnosis and treatment of patients, prevention of diseases or others. Big Data Analytics can also improve the efficiency of healthcare organizations by realizing the data potential [ 3 , 62 ].

Big Data Analytics in medicine and healthcare refers to the integration and analysis of a large amount of complex heterogeneous data, such as various omics (genomics, epigenomics, transcriptomics, proteomics, metabolomics, interactomics, pharmacogenetics, deasomics), biomedical data, talemedicine data (sensors, medical equipment data) and electronic health records data [ 46 , 65 ].

When analyzing the phenomenon of Big Data in the healthcare sector, it should be noted that it can be considered from the point of view of three areas: epidemiological, clinical and business.

From a clinical point of view, the Big Data analysis aims to improve the health and condition of patients, enable long-term predictions about their health status and implementation of appropriate therapeutic procedures. Ultimately, the use of data analysis in medicine is to allow the adaptation of therapy to a specific patient, that is personalized medicine (precision, personalized medicine).

From an epidemiological point of view, it is desirable to obtain an accurate prognosis of morbidity in order to implement preventive programs in advance.

In the business context, Big Data analysis may enable offering personalized packages of commercial services or determining the probability of individual disease and infection occurrence. It is worth noting that Big Data means not only the collection and processing of data but, most of all, the inference and visualization of data necessary to obtain specific business benefits.

In order to introduce new management methods and new solutions in terms of effectiveness and transparency, it becomes necessary to make data more accessible, digital, searchable, as well as analyzed and visualized.

Erickson and Rothberg state that the information and data do not reveal their full value until insights are drawn from them. Data becomes useful when it enhances decision making and decision making is enhanced only when analytical techniques are used and an element of human interaction is applied [ 22 ].

Thus, healthcare has experienced much progress in usage and analysis of data. A large-scale digitalization and transparency in this sector is a key statement of almost all countries governments policies. For centuries, the treatment of patients was based on the judgment of doctors who made treatment decisions. In recent years, however, Evidence-Based Medicine has become more and more important as a result of it being related to the systematic analysis of clinical data and decision-making treatment based on the best available information [ 42 ]. In the healthcare sector, Big Data Analytics is expected to improve the quality of life and reduce operational costs [ 72 , 82 ]. Big Data Analytics enables organizations to improve and increase their understanding of the information contained in data. It also helps identify data that provides insightful insights for current as well as future decisions [ 28 ].

Big Data Analytics refers to technologies that are grounded mostly in data mining: text mining, web mining, process mining, audio and video analytics, statistical analysis, network analytics, social media analytics and web analytics [ 16 , 25 , 31 ]. Different data mining techniques can be applied on heterogeneous healthcare data sets, such as: anomaly detection, clustering, classification, association rules as well as summarization and visualization of those Big Data sets [ 65 ]. Modern data analytics techniques explore and leverage unique data characteristics even from high-speed data streams and sensor data [ 15 , 16 , 31 , 55 ]. Big Data can be used, for example, for better diagnosis in the context of comprehensive patient data, disease prevention and telemedicine (in particular when using real-time alerts for immediate care), monitoring patients at home, preventing unnecessary hospital visits, integrating medical imaging for a wider diagnosis, creating predictive analytics, reducing fraud and improving data security, better strategic planning and increasing patients’ involvement in their own health.

Big Data Analytics in healthcare can be divided into [ 33 , 73 , 74 ]:

descriptive analytics in healthcare is used to understand past and current healthcare decisions, converting data into useful information for understanding and analyzing healthcare decisions, outcomes and quality, as well as making informed decisions [ 33 ]. It can be used to create reports (i.e. about patients’ hospitalizations, physicians’ performance, utilization management), visualization, customized reports, drill down tables, or running queries on the basis of historical data.

predictive analytics operates on past performance in an effort to predict the future by examining historical or summarized health data, detecting patterns of relationships in these data, and then extrapolating these relationships to forecast. It can be used to i.e. predict the response of different patient groups to different drugs (dosages) or reactions (clinical trials), anticipate risk and find relationships in health data and detect hidden patterns [ 62 ]. In this way, it is possible to predict the epidemic spread, anticipate service contracts and plan healthcare resources. Predictive analytics is used in proper diagnosis and for appropriate treatments to be given to patients suffering from certain diseases [ 39 ].

prescriptive analytics—occurs when health problems involve too many choices or alternatives. It uses health and medical knowledge in addition to data or information. Prescriptive analytics is used in many areas of healthcare, including drug prescriptions and treatment alternatives. Personalized medicine and evidence-based medicine are both supported by prescriptive analytics.

discovery analytics—utilizes knowledge about knowledge to discover new “inventions” like drugs (drug discovery), previously unknown diseases and medical conditions, alternative treatments, etc.

Although the models and tools used in descriptive, predictive, prescriptive, and discovery analytics are different, many applications involve all four of them [ 62 ]. Big Data Analytics in healthcare can help enable personalized medicine by identifying optimal patient-specific treatments. This can influence the improvement of life standards, reduce waste of healthcare resources and save costs of healthcare [ 56 , 63 , 71 ]. The introduction of large data analysis gives new analytical possibilities in terms of scope, flexibility and visualization. Techniques such as data mining (computational pattern discovery process in large data sets) facilitate inductive reasoning and analysis of exploratory data, enabling scientists to identify data patterns that are independent of specific hypotheses. As a result, predictive analysis and real-time analysis becomes possible, making it easier for medical staff to start early treatments and reduce potential morbidity and mortality. In addition, document analysis, statistical modeling, discovering patterns and topics in document collections and data in the EHR, as well as an inductive approach can help identify and discover relationships between health phenomena.

Advanced analytical techniques can be used for a large amount of existing (but not yet analytical) data on patient health and related medical data to achieve a better understanding of the information and results obtained, as well as to design optimal clinical pathways [ 62 ]. Big Data Analytics in healthcare integrates analysis of several scientific areas such as bioinformatics, medical imaging, sensor informatics, medical informatics and health informatics [ 65 ]. Big Data Analytics in healthcare allows to analyze large datasets from thousands of patients, identifying clusters and correlation between datasets, as well as developing predictive models using data mining techniques [ 65 ]. Discussing all the techniques used for Big Data Analytics goes beyond the scope of a single article [ 25 ].

The success of Big Data analysis and its accuracy depend heavily on the tools and techniques used to analyze the ability to provide reliable, up-to-date and meaningful information to various stakeholders [ 12 ]. It is believed that the implementation of big data analytics by healthcare organizations could bring many benefits in the upcoming years, including lowering health care costs, better diagnosis and prediction of diseases and their spread, improving patient care and developing protocols to prevent re-hospitalization, optimizing staff, optimizing equipment, forecasting the need for hospital beds, operating rooms, treatments, and improving the drug supply chain [ 71 ].

Challenges and potential benefits of using Big Data Analytics in healthcare

Modern analytics gives possibilities not only to have insight in historical data, but also to have information necessary to generate insight into what may happen in the future. Even when it comes to prediction of evidence-based actions. The emphasis on reform has prompted payers and suppliers to pursue data analysis to reduce risk, detect fraud, improve efficiency and save lives. Everyone—payers, providers, even patients—are focusing on doing more with fewer resources. Thus, some areas in which enhanced data and analytics can yield the greatest results include various healthcare stakeholders (Table 1 ).

Healthcare organizations see the opportunity to grow through investments in Big Data Analytics. In recent years, by collecting medical data of patients, converting them into Big Data and applying appropriate algorithms, reliable information has been generated that helps patients, physicians and stakeholders in the health sector to identify values and opportunities [ 31 ]. It is worth noting that there are many changes and challenges in the structure of the healthcare sector. Digitization and effective use of Big Data in healthcare can bring benefits to every stakeholder in this sector. A single doctor would benefit the same as the entire healthcare system. Potential opportunities to achieve benefits and effects from Big Data in healthcare can be divided into four groups [ 8 ]:

Improving the quality of healthcare services:

assessment of diagnoses made by doctors and the manner of treatment of diseases indicated by them based on the decision support system working on Big Data collections,

detection of more effective, from a medical point of view, and more cost-effective ways to diagnose and treat patients,

analysis of large volumes of data to reach practical information useful for identifying needs, introducing new health services, preventing and overcoming crises,

prediction of the incidence of diseases,

detecting trends that lead to an improvement in health and lifestyle of the society,

analysis of the human genome for the introduction of personalized treatment.

Supporting the work of medical personnel

doctors’ comparison of current medical cases to cases from the past for better diagnosis and treatment adjustment,

detection of diseases at earlier stages when they can be more easily and quickly cured,

detecting epidemiological risks and improving control of pathogenic spots and reaction rates,

identification of patients who are predicted to have the highest risk of specific, life-threatening diseases by collating data on the history of the most common diseases, in healing people with reports entering insurance companies,

health management of each patient individually (personalized medicine) and health management of the whole society,

capturing and analyzing large amounts of data from hospitals and homes in real time, life monitoring devices to monitor safety and predict adverse events,

analysis of patient profiles to identify people for whom prevention should be applied, lifestyle change or preventive care approach,

the ability to predict the occurrence of specific diseases or worsening of patients’ results,

predicting disease progression and its determinants, estimating the risk of complications,

detecting drug interactions and their side effects.

Supporting scientific and research activity

supporting work on new drugs and clinical trials thanks to the possibility of analyzing “all data” instead of selecting a test sample,

the ability to identify patients with specific, biological features that will take part in specialized clinical trials,

selecting a group of patients for which the tested drug is likely to have the desired effect and no side effects,

using modeling and predictive analysis to design better drugs and devices.

Business and management

reduction of costs and counteracting abuse and counseling practices,

faster and more effective identification of incorrect or unauthorized financial operations in order to prevent abuse and eliminate errors,

increase in profitability by detecting patients generating high costs or identifying doctors whose work, procedures and treatment methods cost the most and offering them solutions that reduce the amount of money spent,

identification of unnecessary medical activities and procedures, e.g. duplicate tests.

According to research conducted by Wang, Kung and Byrd, Big Data Analytics benefits can be classified into five categories: IT infrastructure benefits (reducing system redundancy, avoiding unnecessary IT costs, transferring data quickly among healthcare IT systems, better use of healthcare systems, processing standardization among various healthcare IT systems, reducing IT maintenance costs regarding data storage), operational benefits (improving the quality and accuracy of clinical decisions, processing a large number of health records in seconds, reducing the time of patient travel, immediate access to clinical data to analyze, shortening the time of diagnostic test, reductions in surgery-related hospitalizations, exploring inconceivable new research avenues), organizational benefits (detecting interoperability problems much more quickly than traditional manual methods, improving cross-functional communication and collaboration among administrative staffs, researchers, clinicians and IT staffs, enabling data sharing with other institutions and adding new services, content sources and research partners), managerial benefits (gaining quick insights about changing healthcare trends in the market, providing members of the board and heads of department with sound decision-support information on the daily clinical setting, optimizing business growth-related decisions) and strategic benefits (providing a big picture view of treatment delivery for meeting future need, creating high competitive healthcare services) [ 73 ].

The above specification does not constitute a full list of potential areas of use of Big Data Analysis in healthcare because the possibilities of using analysis are practically unlimited. In addition, advanced analytical tools allow to analyze data from all possible sources and conduct cross-analyses to provide better data insights [ 26 ]. For example, a cross-analysis can refer to a combination of patient characteristics, as well as costs and care results that can help identify the best, in medical terms, and the most cost-effective treatment or treatments and this may allow a better adjustment of the service provider’s offer [ 62 ].

In turn, the analysis of patient profiles (e.g. segmentation and predictive modeling) allows identification of people who should be subject to prophylaxis, prevention or should change their lifestyle [ 8 ]. Shortened list of benefits for Big Data Analytics in healthcare is presented in paper [ 3 ] and consists of: better performance, day-to-day guides, detection of diseases in early stages, making predictive analytics, cost effectiveness, Evidence Based Medicine and effectiveness in patient treatment.

Summarizing, healthcare big data represents a huge potential for the transformation of healthcare: improvement of patients’ results, prediction of outbreaks of epidemics, valuable insights, avoidance of preventable diseases, reduction of the cost of healthcare delivery and improvement of the quality of life in general [ 1 ]. Big Data also generates many challenges such as difficulties in data capture, data storage, data analysis and data visualization [ 15 ]. The main challenges are connected with the issues of: data structure (Big Data should be user-friendly, transparent, and menu-driven but it is fragmented, dispersed, rarely standardized and difficult to aggregate and analyze), security (data security, privacy and sensitivity of healthcare data, there are significant concerns related to confidentiality), data standardization (data is stored in formats that are not compatible with all applications and technologies), storage and transfers (especially costs associated with securing, storing, and transferring unstructured data), managerial skills, such as data governance, lack of appropriate analytical skills and problems with Real-Time Analytics (health care is to be able to utilize Big Data in real time) [ 4 , 34 , 41 ].

The research is based on a critical analysis of the literature, as well as the presentation of selected results of direct research on the use of Big Data Analytics in medical facilities in Poland.

Presented research results are part of a larger questionnaire form on Big Data Analytics. The direct research was based on an interview questionnaire which contained 100 questions with 5-point Likert scale (1—strongly disagree, 2—I rather disagree, 3—I do not agree, nor disagree, 4—I rather agree, 5—I definitely agree) and 4 metrics questions. The study was conducted in December 2018 on a sample of 217 medical facilities (110 private, 107 public). The research was conducted by a specialized market research agency: Center for Research and Expertise of the University of Economics in Katowice.

When it comes to direct research, the selected entities included entities financed from public sources—the National Health Fund (23.5%), and entities operating commercially (11.5%). In the surveyed group of entities, more than a half (64.9%) are hybrid financed, both from public and commercial sources. The diversity of the research sample also applies to the size of the entities, defined by the number of employees. Taking into account proportions of the surveyed entities, it should be noted that in the sector structure, medium-sized (10–50 employees—34% of the sample) and large (51–250 employees—27%) entities dominate. The research was of all-Poland nature, and the entities included in the research sample come from all of the voivodships. The largest group were entities from Łódzkie (32%), Śląskie (18%) and Mazowieckie (18%) voivodships, as these voivodships have the largest number of medical institutions. Other regions of the country were represented by single units. The selection of the research sample was random—layered. As part of medical facilities database, groups of private and public medical facilities have been identified and the ones to which the questionnaire was targeted were drawn from each of these groups. The analyses were performed using the GNU PSPP 0.10.2 software.

The aim of the study was to determine whether medical facilities in Poland use Big Data Analytics and if so, in which areas. Characteristics of the research sample is presented in Table 2 .

The research is non-exhaustive due to the incomplete and uneven regional distribution of the samples, overrepresented in three voivodeships (Łódzkie, Mazowieckie and Śląskie). The size of the research sample (217 entities) allows the authors of the paper to formulate specific conclusions on the use of Big Data in the process of its management.

For the purpose of this paper, the following research hypotheses were formulated: (1) medical facilities in Poland are working on both structured and unstructured data (2) medical facilities in Poland are moving towards data-based healthcare and its benefits.

The paper poses the following research questions and statements that coincide with the selected questions from the research questionnaire:

From what sources do medical facilities obtain data? What types of data are used by the particular organization, whether structured or unstructured, and to what extent?

From what sources do medical facilities obtain data?

In which area organizations are using data and analytical systems (clinical or business)?

Is data analytics performed based on historical data or are predictive analyses also performed?

Determining whether administrative and medical staff receive complete, accurate and reliable data in a timely manner?

Determining whether real-time analyses are performed to support the particular organization’s activities.

Results and discussion

On the basis of the literature analysis and research study, a set of questions and statements related to the researched area was formulated. The results from the surveys show that medical facilities use a variety of data sources in their operations. These sources are both structured and unstructured data (Table 3 ).

According to the data provided by the respondents, considering the first statement made in the questionnaire, almost half of the medical institutions (47.58%) agreed that they rather collect and use structured data (e.g. databases and data warehouses, reports to external entities) and 10.57% entirely agree with this statement. As much as 23.35% of representatives of medical institutions stated “I agree or disagree”. Other medical facilities do not collect and use structured data (7.93%) and 6.17% strongly disagree with the first statement. Also, the median calculated based on the obtained results (median: 4), proves that medical facilities in Poland collect and use structured data (Table 4 ).

In turn, 28.19% of the medical institutions agreed that they rather collect and use unstructured data and as much as 9.25% entirely agree with this statement. The number of representatives of medical institutions that stated “I agree or disagree” was 27.31%. Other medical facilities do not collect and use structured data (17.18%) and 13.66% strongly disagree with the first statement. In the case of unstructured data the median is 3, which means that the collection and use of this type of data by medical facilities in Poland is lower.

In the further part of the analysis, it was checked whether the size of the medical facility and form of ownership have an impact on whether it analyzes unstructured data (Tables 4 and 5 ). In order to find this out, correlation coefficients were calculated.

Based on the calculations, it can be concluded that there is a small statistically monotonic correlation between the size of the medical facility and its collection and use of structured data (p < 0.001; τ = 0.16). This means that the use of structured data is slightly increasing in larger medical facilities. The size of the medical facility is more important according to use of unstructured data (p < 0.001; τ = 0.23) (Table 4 .).

To determine whether the form of medical facility ownership affects data collection, the Mann–Whitney U test was used. The calculations show that the form of ownership does not affect what data the organization collects and uses (Table 5 ).

Detailed information on the sources of from which medical facilities collect and use data is presented in the Table 6 .

The questionnaire results show that medical facilities are especially using information published in databases, reports to external units and transaction data, but they also use unstructured data from e-mails, medical devices, sensors, phone calls, audio and video data (Table 6 ). Data from social media, RFID and geolocation data are used to a small extent. Similar findings are concluded in the literature studies.

From the analysis of the answers given by the respondents, more than half of the medical facilities have integrated hospital system (HIS) implemented. As much as 43.61% use integrated hospital system and 16.30% use it extensively (Table 7 ). 19.38% of exanimated medical facilities do not use it at all. Moreover, most of the examined medical facilities (34.80% use it, 32.16% use extensively) conduct medical documentation in an electronic form, which gives an opportunity to use data analytics. Only 4.85% of medical facilities don’t use it at all.

Other problems that needed to be investigated were: whether medical facilities in Poland use data analytics? If so, in what form and in what areas? (Table 8 ). The analysis of answers given by the respondents about the potential of data analytics in medical facilities shows that a similar number of medical facilities use data analytics in administration and business (31.72% agreed with the statement no. 5 and 12.33% strongly agreed) as in the clinical area (33.04% agreed with the statement no. 6 and 12.33% strongly agreed). When considering decision-making issues, 35.24% agree with the statement "the organization uses data and analytical systems to support business decisions” and 8.37% of respondents strongly agree. Almost 40.09% agree with the statement that “the organization uses data and analytical systems to support clinical decisions (in the field of diagnostics and therapy)” and 15.42% of respondents strongly agree. Exanimated medical facilities use in their activity analytics based both on historical data (33.48% agree with statement 7 and 12.78% strongly agree) and predictive analytics (33.04% agrees with the statement number 8 and 15.86% strongly agree). Detailed results are presented in Table 8 .

Medical facilities focus on development in the field of data processing, as they confirm that they conduct analytical planning processes systematically and analyze new opportunities for strategic use of analytics in business and clinical activities (38.33% rather agree and 10.57% strongly agree with this statement). The situation is different with real-time data analysis, here, the situation is not so optimistic. Only 28.19% rather agree and 14.10% strongly agree with the statement that real-time analyses are performed to support an organization’s activities.

When considering whether a facility’s performance in the clinical area depends on the form of ownership, it can be concluded that taking the average and the Mann–Whitney U test depends. A higher degree of use of analyses in the clinical area can be observed in public institutions.

Whether a medical facility performs a descriptive or predictive analysis do not depend on the form of ownership (p > 0.05). It can be concluded that when analyzing the mean and median, they are higher in public facilities, than in private ones. What is more, the Mann–Whitney U test shows that these variables are dependent from each other (p < 0.05) (Table 9 ).

When considering whether a facility’s performance in the clinical area depends on its size, it can be concluded that taking the Kendall’s Tau (τ) it depends (p < 0.001; τ = 0.22), and the correlation is weak but statistically important. This means that the use of data and analytical systems to support clinical decisions (in the field of diagnostics and therapy) increases with the increase of size of the medical facility. A similar relationship, but even less powerful, can be found in the use of descriptive and predictive analyses (Table 10 ).

Considering the results of research in the area of analytical maturity of medical facilities, 8.81% of medical facilities stated that they are at the first level of maturity, i.e. an organization has developed analytical skills and does not perform analyses. As much as 13.66% of medical facilities confirmed that they have poor analytical skills, while 38.33% of the medical facility has located itself at level 3, meaning that “there is a lot to do in analytics”. On the other hand, 28.19% believe that analytical capabilities are well developed and 6.61% stated that analytics are at the highest level and the analytical capabilities are very well developed. Detailed data is presented in Table 11 . Average amounts to 3.11 and Median to 3.

The results of the research have enabled the formulation of following conclusions. Medical facilities in Poland are working on both structured and unstructured data. This data comes from databases, transactions, unstructured content of emails and documents, devices and sensors. However, the use of data from social media is smaller. In their activity, they reach for analytics in the administrative and business, as well as in the clinical area. Also, the decisions made are largely data-driven.

In summary, analysis of the literature that the benefits that medical facilities can get using Big Data Analytics in their activities relate primarily to patients, physicians and medical facilities. It can be confirmed that: patients will be better informed, will receive treatments that will work for them, will have prescribed medications that work for them and not be given unnecessary medications [ 78 ]. Physician roles will likely change to more of a consultant than decision maker. They will advise, warn, and help individual patients and have more time to form positive and lasting relationships with their patients in order to help people. Medical facilities will see changes as well, for example in fewer unnecessary hospitalizations, resulting initially in less revenue, but after the market adjusts, also the accomplishment [ 78 ]. The use of Big Data Analytics can literally revolutionize the way healthcare is practiced for better health and disease reduction.

The analysis of the latest data reveals that data analytics increase the accuracy of diagnoses. Physicians can use predictive algorithms to help them make more accurate diagnoses [ 45 ]. Moreover, it could be helpful in preventive medicine and public health because with early intervention, many diseases can be prevented or ameliorated [ 29 ]. Predictive analytics also allows to identify risk factors for a given patient, and with this knowledge patients will be able to change their lives what, in turn, may contribute to the fact that population disease patterns may dramatically change, resulting in savings in medical costs. Moreover, personalized medicine is the best solution for an individual patient seeking treatment. It can help doctors decide the exact treatments for those individuals. Better diagnoses and more targeted treatments will naturally lead to increases in good outcomes and fewer resources used, including doctors’ time.

The quantitative analysis of the research carried out and presented in this article made it possible to determine whether medical facilities in Poland use Big Data Analytics and if so, in which areas. Thanks to the results obtained it was possible to formulate the following conclusions. Medical facilities are working on both structured and unstructured data, which comes from databases, transactions, unstructured content of emails and documents, devices and sensors. According to analytics, they reach for analytics in the administrative and business, as well as in the clinical area. It clearly showed that the decisions made are largely data-driven. The results of the study confirm what has been analyzed in the literature. Medical facilities are moving towards data-based healthcare and its benefits.

In conclusion, Big Data Analytics has the potential for positive impact and global implications in healthcare. Future research on the use of Big Data in medical facilities will concern the definition of strategies adopted by medical facilities to promote and implement such solutions, as well as the benefits they gain from the use of Big Data analysis and how the perspectives in this area are seen.

Practical implications

This work sought to narrow the gap that exists in analyzing the possibility of using Big Data Analytics in healthcare. Showing how medical facilities in Poland are doing in this respect is an element that is part of global research carried out in this area, including [ 29 , 32 , 60 ].

Limitations and future directions

The research described in this article does not fully exhaust the questions related to the use of Big Data Analytics in Polish healthcare facilities. Only some of the dimensions characterizing the use of data by medical facilities in Poland have been examined. In order to get the full picture, it would be necessary to examine the results of using structured and unstructured data analytics in healthcare. Future research may examine the benefits that medical institutions achieve as a result of the analysis of structured and unstructured data in the clinical and management areas and what limitations they encounter in these areas. For this purpose, it is planned to conduct in-depth interviews with chosen medical facilities in Poland. These facilities could give additional data for empirical analyses based more on their suggestions. Further research should also include medical institutions from beyond the borders of Poland, enabling international comparative analyses.

Future research in the healthcare field has virtually endless possibilities. These regard the use of Big Data Analytics to diagnose specific conditions [ 47 , 66 , 69 , 76 ], propose an approach that can be used in other healthcare applications and create mechanisms to identify “patients like me” [ 75 , 80 ]. Big Data Analytics could also be used for studies related to the spread of pandemics, the efficacy of covid treatment [ 18 , 79 ], or psychology and psychiatry studies, e.g. emotion recognition [ 35 ].

Availability of data and materials

The datasets for this study are available on request to the corresponding author.

Abouelmehdi K, Beni-Hessane A, Khaloufi H. Big healthcare data: preserving security and privacy. J Big Data. 2018. https://doi.org/10.1186/s40537-017-0110-7 .

Article   Google Scholar  

Agrawal A, Choudhary A. Health services data: big data analytics for deriving predictive healthcare insights. Health Serv Eval. 2019. https://doi.org/10.1007/978-1-4899-7673-4_2-1 .

Al Mayahi S, Al-Badi A, Tarhini A. Exploring the potential benefits of big data analytics in providing smart healthcare. In: Miraz MH, Excell P, Ware A, Ali M, Soomro S, editors. Emerging technologies in computing—first international conference, iCETiC 2018, proceedings (Lecture Notes of the Institute for Computer Sciences, Social-Informatics and Telecommunications Engineering, LNICST). Cham: Springer; 2018. p. 247–58. https://doi.org/10.1007/978-3-319-95450-9_21 .

Bainbridge M. Big data challenges for clinical and precision medicine. In: Househ M, Kushniruk A, Borycki E, editors. Big data, big challenges: a healthcare perspective: background, issues, solutions and research directions. Cham: Springer; 2019. p. 17–31.

Google Scholar  

Bartuś K, Batko K, Lorek P. Business intelligence systems: barriers during implementation. In: Jabłoński M, editor. Strategic performance management new concept and contemporary trends. New York: Nova Science Publishers; 2017. p. 299–327. ISBN: 978-1-53612-681-5.

Bartuś K, Batko K, Lorek P. Diagnoza wykorzystania big data w organizacjach-wybrane wyniki badań. Informatyka Ekonomiczna. 2017;3(45):9–20.

Bartuś K, Batko K, Lorek P. Wykorzystanie rozwiązań business intelligence, competitive intelligence i big data w przedsiębiorstwach województwa śląskiego. Przegląd Organizacji. 2018;2:33–9.

Batko K. Możliwości wykorzystania Big Data w ochronie zdrowia. Roczniki Kolegium Analiz Ekonomicznych. 2016;42:267–82.

Bi Z, Cochran D. Big data analytics with applications. J Manag Anal. 2014;1(4):249–65. https://doi.org/10.1080/23270012.2014.992985 .

Boerma T, Requejo J, Victora CG, Amouzou A, Asha G, Agyepong I, Borghi J. Countdown to 2030: tracking progress towards universal coverage for reproductive, maternal, newborn, and child health. Lancet. 2018;391(10129):1538–48.

Bollier D, Firestone CM. The promise and peril of big data. Washington, D.C: Aspen Institute, Communications and Society Program; 2010. p. 1–66.

Bose R. Competitive intelligence process and tools for intelligence analysis. Ind Manag Data Syst. 2008;108(4):510–28.

Carter P. Big data analytics: future architectures, skills and roadmaps for the CIO: in white paper, IDC sponsored by SAS. 2011. p. 1–16.

Castro EM, Van Regenmortel T, Vanhaecht K, Sermeus W, Van Hecke A. Patient empowerment, patient participation and patient-centeredness in hospital care: a concept analysis based on a literature review. Patient Educ Couns. 2016;99(12):1923–39.

Chen H, Chiang RH, Storey VC. Business intelligence and analytics: from big data to big impact. MIS Q. 2012;36(4):1165–88.

Chen CP, Zhang CY. Data-intensive applications, challenges, techniques and technologies: a survey on big data. Inf Sci. 2014;275:314–47.

Chomiak-Orsa I, Mrozek B. Główne perspektywy wykorzystania big data w mediach społecznościowych. Informatyka Ekonomiczna. 2017;3(45):44–54.

Corsi A, de Souza FF, Pagani RN, et al. Big data analytics as a tool for fighting pandemics: a systematic review of literature. J Ambient Intell Hum Comput. 2021;12:9163–80. https://doi.org/10.1007/s12652-020-02617-4 .

Davenport TH, Harris JG. Competing on analytics, the new science of winning. Boston: Harvard Business School Publishing Corporation; 2007.

Davenport TH. Big data at work: dispelling the myths, uncovering the opportunities. Boston: Harvard Business School Publishing; 2014.

De Cnudde S, Martens D. Loyal to your city? A data mining analysis of a public service loyalty program. Decis Support Syst. 2015;73:74–84.

Erickson S, Rothberg H. Data, information, and intelligence. In: Rodriguez E, editor. The analytics process. Boca Raton: Auerbach Publications; 2017. p. 111–26.

Fang H, Zhang Z, Wang CJ, Daneshmand M, Wang C, Wang H. A survey of big data research. IEEE Netw. 2015;29(5):6–9.

Fredriksson C. Organizational knowledge creation with big data. A case study of the concept and practical use of big data in a local government context. 2016. https://www.abo.fi/fakultet/media/22103/fredriksson.pdf .

Gandomi A, Haider M. Beyond the hype: big data concepts, methods, and analytics. Int J Inf Manag. 2015;35(2):137–44.

Groves P, Kayyali B, Knott D, Van Kuiken S. The ‘big data’ revolution in healthcare. Accelerating value and innovation. 2015. http://www.pharmatalents.es/assets/files/Big_Data_Revolution.pdf (Reading: 10.04.2019).

Gupta V, Rathmore N. Deriving business intelligence from unstructured data. Int J Inf Comput Technol. 2013;3(9):971–6.

Gupta V, Singh VK, Ghose U, Mukhija P. A quantitative and text-based characterization of big data research. J Intell Fuzzy Syst. 2019;36:4659–75.

Hampel HOBS, O’Bryant SE, Castrillo JI, Ritchie C, Rojkova K, Broich K, Escott-Price V. PRECISION MEDICINE-the golden gate for detection, treatment and prevention of Alzheimer’s disease. J Prev Alzheimer’s Dis. 2016;3(4):243.

Harerimana GB, Jang J, Kim W, Park HK. Health big data analytics: a technology survey. IEEE Access. 2018;6:65661–78. https://doi.org/10.1109/ACCESS.2018.2878254 .

Hu H, Wen Y, Chua TS, Li X. Toward scalable systems for big data analytics: a technology tutorial. IEEE Access. 2014;2:652–87.

Hussain S, Hussain M, Afzal M, Hussain J, Bang J, Seung H, Lee S. Semantic preservation of standardized healthcare documents in big data. Int J Med Inform. 2019;129:133–45. https://doi.org/10.1016/j.ijmedinf.2019.05.024 .

Islam MS, Hasan MM, Wang X, Germack H. A systematic review on healthcare analytics: application and theoretical perspective of data mining. In: Healthcare. Basel: Multidisciplinary Digital Publishing Institute; 2018. p. 54.

Ismail A, Shehab A, El-Henawy IM. Healthcare analysis in smart big data analytics: reviews, challenges and recommendations. In: Security in smart cities: models, applications, and challenges. Cham: Springer; 2019. p. 27–45.

Jain N, Gupta V, Shubham S, et al. Understanding cartoon emotion using integrated deep neural network on large dataset. Neural Comput Appl. 2021. https://doi.org/10.1007/s00521-021-06003-9 .

Janssen M, van der Voort H, Wahyudi A. Factors influencing big data decision-making quality. J Bus Res. 2017;70:338–45.

Jordan SR. Beneficence and the expert bureaucracy. Public Integr. 2014;16(4):375–94. https://doi.org/10.2753/PIN1099-9922160404 .

Knapp MM. Big data. J Electron Resourc Med Libr. 2013;10(4):215–22.

Koti MS, Alamma BH. Predictive analytics techniques using big data for healthcare databases. In: Smart intelligent computing and applications. New York: Springer; 2019. p. 679–86.

Krumholz HM. Big data and new knowledge in medicine: the thinking, training, and tools needed for a learning health system. Health Aff. 2014;33(7):1163–70.

Kruse CS, Goswamy R, Raval YJ, Marawi S. Challenges and opportunities of big data in healthcare: a systematic review. JMIR Med Inform. 2016;4(4):e38.

Kyoungyoung J, Gang HK. Potentiality of big data in the medical sector: focus on how to reshape the healthcare system. Healthc Inform Res. 2013;19(2):79–85.

Laney D. Application delivery strategies 2011. http://blogs.gartner.com/doug-laney/files/2012/01/ad949-3D-Data-Management-Controlling-Data-Volume-Velocity-and-Variety.pdf .

Lee IK, Wang CC, Lin MC, Kung CT, Lan KC, Lee CT. Effective strategies to prevent coronavirus disease-2019 (COVID-19) outbreak in hospital. J Hosp Infect. 2020;105(1):102.

Lerner I, Veil R, Nguyen DP, Luu VP, Jantzen R. Revolution in health care: how will data science impact doctor-patient relationships? Front Public Health. 2018;6:99.

Lytras MD, Papadopoulou P, editors. Applying big data analytics in bioinformatics and medicine. IGI Global: Hershey; 2017.

Ma K, et al. Big data in multiple sclerosis: development of a web-based longitudinal study viewer in an imaging informatics-based eFolder system for complex data analysis and management. In: Proceedings volume 9418, medical imaging 2015: PACS and imaging informatics: next generation and innovations. 2015. p. 941809. https://doi.org/10.1117/12.2082650 .

Mach-Król M. Analiza i strategia big data w organizacjach. In: Studia i Materiały Polskiego Stowarzyszenia Zarządzania Wiedzą. 2015;74:43–55.

Madsen LB. Data-driven healthcare: how analytics and BI are transforming the industry. Hoboken: Wiley; 2014.

Manyika J, Chui M, Brown B, Bughin J, Dobbs R, Roxburgh C, Hung BA. Big data: the next frontier for innovation, competition, and productivity. Washington: McKinsey Global Institute; 2011.

Marconi K, Dobra M, Thompson C. The use of big data in healthcare. In: Liebowitz J, editor. Big data and business analytics. Boca Raton: CRC Press; 2012. p. 229–48.

Mehta N, Pandit A. Concurrence of big data analytics and healthcare: a systematic review. Int J Med Inform. 2018;114:57–65.

Michel M, Lupton D. Toward a manifesto for the ‘public understanding of big data.’ Public Underst Sci. 2016;25(1):104–16. https://doi.org/10.1177/0963662515609005 .

Mikalef P, Krogstie J. Big data analytics as an enabler of process innovation capabilities: a configurational approach. In: International conference on business process management. Cham: Springer; 2018. p. 426–41.

Mohammadi M, Al-Fuqaha A, Sorour S, Guizani M. Deep learning for IoT big data and streaming analytics: a survey. IEEE Commun Surv Tutor. 2018;20(4):2923–60.

Nambiar R, Bhardwaj R, Sethi A, Vargheese R. A look at challenges and opportunities of big data analytics in healthcare. In: 2013 IEEE international conference on big data; 2013. p. 17–22.

Ohlhorst F. Big data analytics: turning big data into big money, vol. 65. Hoboken: Wiley; 2012.

Olszak C, Mach-Król M. A conceptual framework for assessing an organization’s readiness to adopt big data. Sustainability. 2018;10(10):3734.

Olszak CM. Toward better understanding and use of business intelligence in organizations. Inf Syst Manag. 2016;33(2):105–23.

Palanisamy V, Thirunavukarasu R. Implications of big data analytics in developing healthcare frameworks—a review. J King Saud Univ Comput Inf Sci. 2017;31(4):415–25.

Provost F, Fawcett T. Data science and its relationship to big data and data-driven decisionmaking. Big Data. 2013;1(1):51–9.

Raghupathi W, Raghupathi V. An overview of health analytics. J Health Med Inform. 2013;4:132. https://doi.org/10.4172/2157-7420.1000132 .

Raghupathi W, Raghupathi V. Big data analytics in healthcare: promise and potential. Health Inf Sci Syst. 2014;2(1):3.

Ratia M, Myllärniemi J. Beyond IC 4.0: the future potential of BI-tool utilization in the private healthcare, conference: proceedings IFKAD, 2018 at: Delft, The Netherlands.

Ristevski B, Chen M. Big data analytics in medicine and healthcare. J Integr Bioinform. 2018. https://doi.org/10.1515/jib-2017-0030 .

Rumsfeld JS, Joynt KE, Maddox TM. Big data analytics to improve cardiovascular care: promise and challenges. Nat Rev Cardiol. 2016;13(6):350–9. https://doi.org/10.1038/nrcardio.2016.42 .

Schmarzo B. Big data: understanding how data powers big business. Indianapolis: Wiley; 2013.

Senthilkumar SA, Rai BK, Meshram AA, Gunasekaran A, Chandrakumarmangalam S. Big data in healthcare management: a review of literature. Am J Theor Appl Bus. 2018;4:57–69.

Shubham S, Jain N, Gupta V, et al. Identify glomeruli in human kidney tissue images using a deep learning approach. Soft Comput. 2021. https://doi.org/10.1007/s00500-021-06143-z .

Thuemmler C. The case for health 4.0. In: Thuemmler C, Bai C, editors. Health 4.0: how virtualization and big data are revolutionizing healthcare. New York: Springer; 2017.

Tsai CW, Lai CF, Chao HC, et al. Big data analytics: a survey. J Big Data. 2015;2:21. https://doi.org/10.1186/s40537-015-0030-3 .

Wamba SF, Gunasekaran A, Akter S, Ji-fan RS, Dubey R, Childe SJ. Big data analytics and firm performance: effects of dynamic capabilities. J Bus Res. 2017;70:356–65.

Wang Y, Byrd TA. Business analytics-enabled decision-making effectiveness through knowledge absorptive capacity in health care. J Knowl Manag. 2017;21(3):517–39.

Wang Y, Kung L, Wang W, Yu C, Cegielski CG. An integrated big data analytics-enabled transformation model: application to healthcare. Inf Manag. 2018;55(1):64–79.

Wicks P, et al. Scaling PatientsLikeMe via a “generalized platform” for members with chronic illness: web-based survey study of benefits arising. J Med Internet Res. 2018;20(5):e175.

Willems SM, et al. The potential use of big data in oncology. Oral Oncol. 2019;98:8–12. https://doi.org/10.1016/j.oraloncology.2019.09.003 .

Williams N, Ferdinand NP, Croft R. Project management maturity in the age of big data. Int J Manag Proj Bus. 2014;7(2):311–7.

Winters-Miner LA. Seven ways predictive analytics can improve healthcare. Medical predictive analytics have the potential to revolutionize healthcare around the world. 2014. https://www.elsevier.com/connect/seven-ways-predictive-analytics-can-improve-healthcare (Reading: 15.04.2019).

Wu J, et al. Application of big data technology for COVID-19 prevention and control in China: lessons and recommendations. J Med Internet Res. 2020;22(10): e21980.

Yan L, Peng J, Tan Y. Network dynamics: how can we find patients like us? Inf Syst Res. 2015;26(3):496–512.

Yang JJ, Li J, Mulder J, Wang Y, Chen S, Wu H, Pan H. Emerging information technologies for enhanced healthcare. Comput Ind. 2015;69:3–11.

Zhang Q, Yang LT, Chen Z, Li P. A survey on deep learning for big data. Inf Fusion. 2018;42:146–57.

Download references

Acknowledgements

We would like to thank those who have touched our science paths.

This research was fully funded as statutory activity—subsidy of Ministry of Science and Higher Education granted for Technical University of Czestochowa on maintaining research potential in 2018. Research Number: BS/PB–622/3020/2014/P. Publication fee for the paper was financed by the University of Economics in Katowice.

Author information

Authors and affiliations.

Department of Business Informatics, University of Economics in Katowice, Katowice, Poland

Kornelia Batko

Department of Biomedical Processes and Systems, Institute of Health and Nutrition Sciences, Częstochowa University of Technology, Częstochowa, Poland

Andrzej Ślęzak

You can also search for this author in PubMed   Google Scholar

Contributions

KB proposed the concept of research and its design. The manuscript was prepared by KB with the consultation of AŚ. AŚ reviewed the manuscript for getting its fine shape. KB prepared the manuscript in the contexts such as definition of intellectual content, literature search, data acquisition, data analysis, and so on. AŚ obtained research funding. Both authors read and approved the final manuscript.

Corresponding author

Correspondence to Kornelia Batko .

Ethics declarations

Ethics approval and consent to participate.

Not applicable.

Consent for publication

Competing interests.

The author declares no conflict of interest.

Additional information

Publisher's note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ .

Reprints and permissions

About this article

Cite this article.

Batko, K., Ślęzak, A. The use of Big Data Analytics in healthcare. J Big Data 9 , 3 (2022). https://doi.org/10.1186/s40537-021-00553-4

Download citation

Received : 28 August 2021

Accepted : 19 December 2021

Published : 06 January 2022

DOI : https://doi.org/10.1186/s40537-021-00553-4

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Big Data Analytics
  • Data-driven healthcare

research using big data

SEP home page

  • Table of Contents
  • Random Entry
  • Chronological
  • Editorial Information
  • About the SEP
  • Editorial Board
  • How to Cite the SEP
  • Special Characters
  • Advanced Tools
  • Support the SEP
  • PDFs for SEP Friends
  • Make a Donation
  • SEPIA for Libraries
  • Entry Contents

Bibliography

Academic tools.

  • Friends PDF Preview
  • Author and Citation Info
  • Back to Top

Scientific Research and Big Data

Big Data promises to revolutionise the production of knowledge within and beyond science, by enabling novel, highly efficient ways to plan, conduct, disseminate and assess research. The last few decades have witnessed the creation of novel ways to produce, store, and analyse data, culminating in the emergence of the field of data science , which brings together computational, algorithmic, statistical and mathematical techniques towards extrapolating knowledge from big data. At the same time, the Open Data movement—emerging from policy trends such as the push for Open Government and Open Science—has encouraged the sharing and interlinking of heterogeneous research data via large digital infrastructures. The availability of vast amounts of data in machine-readable formats provides an incentive to create efficient procedures to collect, organise, visualise and model these data. These infrastructures, in turn, serve as platforms for the development of artificial intelligence, with an eye to increasing the reliability, speed and transparency of processes of knowledge creation. Researchers across all disciplines see the newfound ability to link and cross-reference data from diverse sources as improving the accuracy and predictive power of scientific findings and helping to identify future directions of inquiry, thus ultimately providing a novel starting point for empirical investigation. As exemplified by the rise of dedicated funding, training programmes and publication venues, big data are widely viewed as ushering in a new way of performing research and challenging existing understandings of what counts as scientific knowledge.

This entry explores these claims in relation to the use of big data within scientific research, and with an emphasis on the philosophical issues emerging from such use. To this aim, the entry discusses how the emergence of big data—and related technologies, institutions and norms—informs the analysis of the following themes:

  • how statistics, formal and computational models help to extrapolate patterns from data, and with which consequences;
  • the role of critical scrutiny (human intelligence) in machine learning, and its relation to the intelligibility of research processes;
  • the nature of data as research components;
  • the relation between data and evidence, and the role of data as source of empirical insight;
  • the view of knowledge as theory-centric;
  • understandings of the relation between prediction and causality;
  • the separation of fact and value; and
  • the risks and ethics of data science.

These are areas where attention to research practices revolving around big data can benefit philosophy, and particularly work in the epistemology and methodology of science. This entry doesn’t cover the vast scholarship in the history and social studies of science that has emerged in recent years on this topic, though references to some of that literature can be found when conceptually relevant. Complementing historical and social scientific work in data studies, the philosophical analysis of data practices can also elicit significant challenges to the hype surrounding data science and foster a critical understanding of the role of data-fuelled artificial intelligence in research.

1. What Are Big Data?

2. extrapolating data patterns: the role of statistics and software, 3. human and artificial intelligence, 4. the nature of (big) data, 5. big data and evidence, 6. big data, knowledge and inquiry, 7. big data between causation and prediction, 8. the fact/value distinction, 9. big data risks and the ethics of data science, 10. conclusion: big data and good science, other internet resources, related entries.

We are witnessing a progressive “datafication” of social life. Human activities and interactions with the environment are being monitored and recorded with increasing effectiveness, generating an enormous digital footprint. The resulting “big data” are a treasure trove for research, with ever more sophisticated computational tools being developed to extract knowledge from such data. One example is the use of various different types of data acquired from cancer patients, including genomic sequences, physiological measurements and individual responses to treatment, to improve diagnosis and treatment. Another example is the integration of data on traffic flow, environmental and geographical conditions, and human behaviour to produce safety measures for driverless vehicles, so that when confronted with unforeseen events (such as a child suddenly darting into the street on a very cold day), the data can be promptly analysed to identify and generate an appropriate response (the car swerving enough to avoid the child while also minimising the risk of skidding on ice and damaging to other vehicles). Yet another instance is the understanding of the nutritional status and needs of a particular population that can be extracted from combining data on food consumption generated by commercial services (e.g., supermarkets, social media and restaurants) with data coming from public health and social services, such as blood test results and hospital intakes linked to malnutrition. In each of these cases, the availability of data and related analytic tools is creating novel opportunities for research and for the development of new forms of inquiry, which are widely perceived as having a transformative effect on science as a whole.

A useful starting point in reflecting on the significance of such cases for a philosophical understanding of research is to consider what the term “big data” actually refers to within contemporary scientific discourse. There are multiple ways to define big data (Kitchin 2014, Kitchin & McArdle 2016). Perhaps the most straightforward characterisation is as large datasets that are produced in a digital form and can be analysed through computational tools. Hence the two features most commonly associated with Big Data are volume and velocity. Volume refers to the size of the files used to archive and spread data. Velocity refers to the pressing speed with which data is generated and processed. The body of digital data created by research is growing at breakneck pace and in ways that are arguably impossible for the human cognitive system to grasp and thus require some form of automated analysis.

Volume and velocity are also, however, the most disputed features of big data. What may be perceived as “large volume” or “high velocity” depends on rapidly evolving technologies to generate, store, disseminate and visualise the data. This is exemplified by the high-throughput production, storage and dissemination of genomic sequencing and gene expression data, where both data volume and velocity have dramatically increased within the last two decades. Similarly, current understandings of big data as “anything that cannot be easily captured in an Excel spreadsheet” are bound to shift rapidly as new analytic software becomes established, and the very idea of using spreadsheets to capture data becomes a thing of the past. Moreover, data size and speed do not take account of the diversity of data types used by researchers, which may include data that are not generated in digital formats or whose format is not computationally tractable, and which underscores the importance of data provenance (that is, the conditions under which data were generated and disseminated) to processes of inference and interpretation. And as discussed below, the emphasis on physical features of data obscures the continuing dependence of data interpretation on circumstances of data use, including specific queries, values, skills and research situations.

An alternative is to define big data not by reference to their physical attributes, but rather by virtue of what can and cannot be done with them. In this view, big data is a heterogeneous ensemble of data collected from a variety of different sources, typically (but not always) in digital formats suitable for algorithmic processing, in order to generate new knowledge. For example boyd and Crawford (2012: 663) identify big data with “the capacity to search, aggregate and cross-reference large datasets”, while O’Malley and Soyer (2012) focus on the ability to interrogate and interrelate diverse types of data, with the aim to be able to consult them as a single body of evidence. The examples of transformative “big data research” given above are all easily fitted into this view: it is not the mere fact that lots of data are available that makes a different in those cases, but rather the fact that lots of data can be mobilised from a wide variety of sources (medical records, environmental surveys, weather measurements, consumer behaviour). This account makes sense of other characteristic “v-words” that have been associated with big data, including:

  • Variety in the formats and purposes of data, which may include objects as different as samples of animal tissue, free-text observations, humidity measurements, GPS coordinates, and the results of blood tests;
  • Veracity , understood as the extent to which the quality and reliability of big data can be guaranteed. Data with high volume, velocity and variety are at significant risk of containing inaccuracies, errors and unaccounted-for bias. In the absence of appropriate validation and quality checks, this could result in a misleading or outright incorrect evidence base for knowledge claims (Floridi & Illari 2014; Cai & Zhu 2015; Leonelli 2017);
  • Validity , which indicates the selection of appropriate data with respect to the intended use. The choice of a specific dataset as evidence base requires adequate and explicit justification, including recourse to relevant background knowledge to ground the identification of what counts as data in that context (e.g., Loettgers 2009, Bogen 2010);
  • Volatility , i.e., the extent to which data can be relied upon to remain available, accessible and re-interpretable despite changes in archival technologies. This is significant given the tendency of formats and tools used to generate and analyse data to become obsolete, and the efforts required to update data infrastructures so as to guarantee data access in the long term (Bowker 2006; Edwards 2010; Lagoze 2014; Borgman 2015);
  • Value , i.e., the multifaceted forms of significance attributed to big data by different sections of society, which depend as much on the intended use of the data as on historical, social and geographical circumstances (Leonelli 2016, D’Ignazio and Klein 2020). Alongside scientific value, researchers may impute financial, ethical, reputational and even affective value to data, depending on their intended use as well as the historical, social and geographical circumstances of their use. The institutions involved in governing and funding research also have ways of valuing data, which may not always overlap with the priorities of researchers (Tempini 2017).

This list of features, though not exhaustive, highlights how big data is not simply “a lot of data”. The epistemic power of big data lies in their capacity to bridge between different research communities, methodological approaches and theoretical frameworks that are difficult to link due to conceptual fragmentation, social barriers and technical difficulties (Leonelli 2019a). And indeed, appeals to big data often emerge from situations of inquiry that are at once technically, conceptually and socially challenging, and where existing methods and resources have proved insufficient or inadequate (Sterner & Franz 2017; Sterner, Franz, & Witteveen 2020).

This understanding of big data is rooted in a long history of researchers grappling with large and complex datasets, as exemplified by fields like astronomy, meteorology, taxonomy and demography (see the collections assembled by Daston 2017; Anorova et al. 2017; Porter & Chaderavian 2018; as well as Anorova et al. 2010, Sepkoski 2013, Stevens 2016, Strasser 2019 among others). Similarly, biomedical research—and particularly subfields such as epidemiology, pharmacology and public health—has an extensive tradition of tackling data of high volume, velocity, variety and volatility, and whose validity, veracity and value are regularly negotiated and contested by patients, governments, funders, pharmaceutical companies, insurances and public institutions (Bauer 2008). Throughout the twentieth century, these efforts spurred the development of techniques, institutions and instruments to collect, order, visualise and analyse data, such as: standard classification systems and formats; guidelines, tools and legislation for the management and security of sensitive data; and infrastructures to integrate and sustain data collections over long periods of time (Daston 2017).

This work culminated in the application of computational technologies, modelling tools and statistical methods to big data (Porter 1995; Humphreys 2004; Edwards 2010), increasingly pushing the boundaries of data analytics thanks to supervised learning, model fitting, deep neural networks, search and optimisation methods, complex data visualisations and various other tools now associated with artificial intelligence. Many of these tools are based on algorithms whose functioning and results are tested against specific data samples (a process called “training”). These algorithms are programmed to “learn” from each interaction with novel data: in other words, they have the capacity to change themselves in response to new information being inputted into the system, thus becoming more attuned to the phenomena they are analysing and improving their ability to predict future behaviour. The scope and extent of such changes is shaped by the assumptions used to build the algorithms and the capability of related software and hardware to identify, access and process information of relevance to the learning in question. There is however a degree of unpredictability and opacity to these systems, which can evolve to the point of defying human understanding (more on this below).

New institutions, communication platforms and regulatory frameworks also emerged to assemble, prepare and maintain data for such uses (Kitchin 2014), such as various forms of digital data infrastructures, organisations aiming to coordinate and improve the global data landscape (e.g., the Research Data Alliance), and novel measures for data protection, like the General Data Protection Regulation launched in 2017 by the European Union. Together, these techniques and institutions afford the opportunity to assemble and interpret data at a much broader scale, while also promising to deliver finer levels of granularity in data analysis. [ 1 ] They increase the scope of any investigation by making it possible for researchers to link their own findings to those of countless others across the world, both within and beyond the academic sphere. By enhancing the mobility of data, they facilitate their repurposing for a variety of goals that may have been unforeseeable when the data were originally generated. And by transforming the role of data within research, they heighten their status as valuable research outputs in and of themselves. These technological and methodological developments have significant implications for philosophical conceptualisations of data, inferential processes and scientific knowledge, as well as for how research is conducted, organised, governed and assessed. It is to these philosophical concerns that I now turn.

Big data are often associated to the idea of data-driven research, where learning happens through the accumulation of data and the application of methods to extract meaningful patterns from those data. Within data-driven inquiry, researchers are expected to use data as their starting point for inductive inference, without relying on theoretical preconceptions—a situation described by advocates as “the end of theory”, in contrast to theory-driven approaches where research consists of testing a hypothesis (Anderson 2008, Hey et al. 2009). In principle at least, big data constitute the largest pool of data ever assembled and thus a strong starting point to search for correlations (Mayer-Schönberger & Cukier 2013). Crucial to the credibility of the data-driven approach is the efficacy of the methods used to extrapolate patterns from data and evaluate whether or not such patterns are meaningful, and what “meaning” may involve in the first place. Hence, some philosophers and data scholars have argued that

the most important and distinctive characteristic of Big Data [is] its use of statistical methods and computational means of analysis, (Symons & Alvarado 2016: 4)

such as for instance machine learning tools, deep neural networks and other “intelligent” practices of data handling.

The emphasis on statistics as key adjudicator of validity and reliability of patterns extracted from data is not novel. Exponents of logical empiricism looked for logically watertight methods to secure and justify inference from data, and their efforts to develop a theory of probability proceeded in parallel with the entrenchment of statistical reasoning in the sciences in the first half of the twentieth century (Romeijn 2017). In the early 1960s, Patrick Suppes offered a seminal link between statistical methods and the philosophy of science through his work on the production and interpretation of data models. As a philosopher deeply embedded in experimental practice, Suppes was interested in the means and motivations of key statistical procedures for data analysis such as data reduction and curve fitting. He argued that once data are adequately prepared for statistical modelling, all the concerns and choices that motivated data processing become irrelevant to their analysis and interpretation. This inspired him to differentiate between models of theory, models of experiment and models of data, noting that such different components of inquiry are governed by different logics and cannot be compared in a straightforward way. For instance,

the precise definition of models of the data for any given experiment requires that there be a theory of the data in the sense of the experimental procedure, as well as in the ordinary sense of the empirical theory of the phenomena being studied. (Suppes 1962: 253)

Suppes viewed data models as necessarily statistical: that is, as objects

designed to incorporate all the information about the experiment which can be used in statistical tests of the adequacy of the theory. (Suppes 1962: 258)

His formal definition of data models reflects this decision, with statistical requirements such as homogeneity, stationarity and order identified as the ultimate criteria to identify a data model Z and evaluate its adequacy:

Z is an N-fold model of the data for experiment Y if and only if there is a set Y and a probability measure P on subsets of Y such that \(Y = \langle Y, P\rangle\) is a model of the theory of the experiment, Z is an N-tuple of elements of Y , and Z satisfies the statistical tests of homogeneity, stationarity and order. (1962: 259)

This analysis of data models portrayed statistical methods as key conduits between data and theory, and hence as crucial components of inferential reasoning.

The focus on statistics as entry point to discussions of inference from data was widely promoted in subsequent philosophical work. Prominent examples include Deborah Mayo, who in her book Error and the Growth of Experimental Knowledge asked:

What should be included in data models? The overriding constraint is the need for data models that permit the statistical assessment of fit (between prediction and actual data); (Mayo 1996: 136)

and Bas van Fraassen, who also embraced the idea of data models as “summarizing relative frequencies found in data” (Van Fraassen 2008: 167). Closely related is the emphasis on statistics as means to detect error within datasets in relation to specific hypotheses, most prominently endorsed by the error-statistical approach to inference championed by Mayo and Aris Spanos (Mayo & Spanos 2009a). This approach aligns with the emphasis on computational methods for data analysis within big data research, and supports the idea that the better the inferential tools and methods, the better the chance to extract reliable knowledge from data.

When it comes to addressing methodological challenges arising from the computational analysis of big data, however, statistical expertise needs to be complemented by computational savvy in the training and application of algorithms associated to artificial intelligence, including machine learning but also other mathematical procedures for operating upon data (Bringsjord & Govindarajulu 2018). Consider for instance the problem of overfitting, i.e., the mistaken identification of patterns in a dataset, which can be greatly amplified by the training techniques employed by machine learning algorithms. There is no guarantee that an algorithm trained to successfully extrapolate patterns from a given dataset will be as successful when applied to other data. Common approaches to this problem involve the re-ordering and partitioning of both data and training methods, so that it is possible to compare the application of the same algorithms to different subsets of the data (“cross-validation”), combine predictions arising from differently trained algorithms (“ensembling”) or use hyperparameters (parameters whose value is set prior to data training) to prepare the data for analysis.

Handling these issues, in turn, requires

familiarity with the mathematical operations in question, their implementations in code, and the hardware architectures underlying such implementations. (Lowrie 2017: 3)

For instance, machine learning

aims to build programs that develop their own analytic or descriptive approaches to a body of data, rather than employing ready-made solutions such as rule-based deduction or the regressions of more traditional statistics. (Lowrie 2017: 4)

In other words, statistics and mathematics need to be complemented by expertise in programming and computer engineering. The ensemble of skills thus construed results in a specific epistemological approach to research, which is broadly characterised by an emphasis on the means of inquiry as the most significant driver of research goals and outputs. This approach, which Sabina Leonelli characterised as data-centric , involves “focusing more on the processes through which research is carried out than on its ultimate outcomes” (Leonelli 2016: 170). In this view, procedures, techniques, methods, software and hardware are the prime motors of inquiry and the chief influence on its outcomes. Focusing more specifically on computational systems, John Symons and Jack Horner argued that much of big data research consists of software-intensive science rather than data-driven research: that is, science that depends on software for its design, development, deployment and use, and thus encompasses procedures, types of reasoning and errors that are unique to software, such as for example the problems generated by attempts to map real-world quantities to discrete-state machines, or approximating numerical operations (Symons & Horner 2014: 473). Software-intensive science is arguably supported by an algorithmic rationality focused on the feasibility, practicality and efficiency of algorithms, which is typically assessed by reference to concrete situations of inquiry (Lowrie 2017).

Algorithms are enormously varied in their mathematical structures and underpinning conceptual commitments, and more philosophical work needs to be carried out on the specifics of computational tools and software used in data science and related applications—with emerging work in philosophy of computer science providing an excellent way forward (Turner & Angius 2019). Nevertheless, it is clear that whether or not a given algorithm successfully applies to the data at hand depends on factors that cannot be controlled through statistical or even computational methods: for instance, the size, structure and format of the data, the nature of the classifiers used to partition the data, the complexity of decision boundaries and the very goals of the investigation.

In a forceful critique informed by the philosophy of mathematics, Christian Calude and Giuseppe Longo argued that there is a fundamental problem with the assumption that more data will necessarily yield more information:

very large databases have to contain arbitrary correlations. These correlations appear only due to the size, not the nature, of data. (Calude & Longo 2017: 595)

They conclude that big data analysis is by definition unable to distinguish spurious from meaningful correlations and is therefore a threat to scientific research. A related worry, sometimes dubbed “the curse of dimensionality” by data scientists, concerns the extent to which the analysis of a given dataset can be scaled up in complexity and in the number of variables being considered. It is well known that the more dimensions one considers in classifying samples, for example, the larger the dataset on which such dimensions can be accurately generalised. This demonstrates the continuing, tight dependence between the volume and quality of data on the one hand, and the type and breadth of research questions for which data need to serve as evidence on the other hand.

Determining the fit between inferential methods and data requires high levels of expertise and contextual judgement (a situation known within machine learning as the “no free lunch theorem”). Indeed, overreliance on software for inference and data modelling can yield highly problematic results. Symons and Horner note that the use of complex software in big data analysis makes margins of error unknowable, because there is no clear way to test them statistically (Symons & Horner 2014: 473). The path complexity of programs with high conditionality imposes limits on standard error correction techniques. As a consequence, there is no effective method for characterising the error distribution in the software except by testing all paths in the code, which is unrealistic and intractable in the vast majority of cases due to the complexity of the code.

Rather than acting as a substitute, the effective and responsible use of artificial intelligence tools in big data analysis requires the strategic exercise of human intelligence—but for this to happen, AI systems applied to big data need to be accessible to scrutiny and modification. Whether or not this is the case, and who is best qualified to exercise such scrutiny, is under dispute. Thomas Nickles argued that the increasingly complex and distributed algorithms used for data analysis follow in the footsteps of long-standing scientific attempts to transcend the limits of human cognition. The resulting epistemic systems may no longer be intelligible to humans: an “alien intelligence” within which “human abilities are no longer the ultimate criteria of epistemic success” (Nickles forthcoming). Such unbound cognition holds the promise of enabling powerful inferential reasoning from previously unimaginable volumes of data. The difficulties in contextualising and scrutinising such reasoning, however, sheds doubt on the reliability of the results. It is not only machine learning algorithms that are becoming increasingly inaccessible to evaluation: beyond the complexities of programming code, computational data analysis requires a whole ecosystem of classifications, models, networks and inference tools which typically have different histories and purposes, and whose relation to each other—and effects when they are used together—are far from understood and may well be untraceable.

This raises the question of whether the knowledge produced by such data analytic systems is at all intelligible to humans, and if so, what forms of intelligibility it yields. It is certainly the case that deriving knowledge from big data may not involve an increase in human understanding, especially if understanding is understood as an epistemic skill (de Regt 2017). This may not be a problem to those who await the rise of a new species of intelligent machines, who may master new cognitive tools in a way that humans cannot. But as Nickles, Nicholas Rescher (1984), Werner Callebaut (2012) and others pointed out, even in that case “we would not have arrived at perspective-free science” (Nickles forthcoming). While the human histories and assumptions interwoven into these systems may be hard to disentangle, they still affect their outcomes; and whether or not these processes of inquiry are open to critical scrutiny, their telos, implications and significance for life on the planet arguably should be. As argued by Dan McQuillan (2018), the increasing automation of big data analytics may foster acceptance of a Neoplatonist machinic metaphysics , within which mathematical structures “uncovered” by AI would trump any appeal to human experience. Luciano Floridi echoes this intuition in his analysis of what he calls the infosphere :

The great opportunities offered by Information and Communication Technologies come with a huge intellectual responsibility to understand them and take advantage of them in the right way. (2014: vii)

These considerations parallel Paul Humphreys’s long-standing critique of computer simulations as epistemically opaque (Humphreys 2004, 2009)—and particularly his definition of what he calls essential epistemic opacity:

A process is essentially epistemically opaque to X if and only if it is impossible , given the nature of X , for X to know all of the epistemically relevant elements of the process. (Humphreys 2009: 618)

Different facets of the general problem of epistemic opacity are stressed within the vast philosophical scholarship on the role of modelling, computing and simulations in the sciences: the implications of lacking experimental access to the concrete parts of the world being modelled, for instance (Morgan 2005; Parker 2009; Radder 2009); the difficulties in testing the reliability of computational methods used within simulations (Winsberg 2010; Morrison 2015); the relation between opacity and justification (Durán & Formanek 2018); the forms of black-boxing associated to mechanistic reasoning implemented in computational analysis (Craver and Darden 2013; Bechtel 2016); and the debate over the intrinsic limits of computational approaches and related expertise (Collins 1990; Dreyfus 1992). Roman Frigg and Julian Reiss argued that such issues do not constitute fundamental challenges to the nature of inquiry and modelling, and in fact exist in a continuum with traditional methodological issues well-known within the sciences (Frigg & Reiss 2009). Whether or not one agrees with this position (Humphreys 2009; Beisbart 2012), big data analysis is clearly pushing computational and statistical methods to their limit, thus highlighting the boundaries to what even technologically augmented human beings are capable of knowing and understanding.

Research on big data analysis thus sheds light on elements of the research process that cannot be fully controlled, rationalised or even considered through recourse to formal tools.

One such element is the work required to present empirical data in a machine-readable format that is compatible with the software and analytic tools at hand. Data need to be selected, cleaned and prepared to be subjected to statistical and computational analysis. The processes involved in separating data from noise, clustering data so that it is tractable, and integrating data of different formats turn out to be highly sophisticated and theoretically structured, as demonstrated for instance by James McAllister’s (1997, 2007, 2011) and Uljana Feest’s (2011) work on data patterns, Marcel Boumans’s and Leonelli’s comparison of clustering principles across fields (forthcoming), and James Griesemer’s (forthcoming) and Mary Morgan’s (forthcoming) analyses of the peculiarities of datasets. Suppes was so concerned by what he called the “bewildering complexity” of data production and processing activities, that he worried that philosophers would not appreciate the ways in which statistics can and does help scientists to abstract data away from such complexity. He described the large group of research components and activities used to prepare data for modelling as “pragmatic aspects” encompassing “every intuitive consideration of experimental design that involved no formal statistics” (Suppes 1962: 258), and positioned them as the lowest step of his hierarchy of models—at the opposite end of its pinnacle, which are models of theory. Despite recent efforts to rehabilitate the methodology of inductive-statistical modelling and inference (Mayo & Spanos 2009b), this approach has been shared by many philosophers who regard processes of data production and processing as so chaotic as to defy systematic analysis. This explains why data have received so little consideration in philosophy of science when compared to models and theory.

The question of how data are defined and identified, however, is crucial for understanding the role of big data in scientific research. Let us now consider two philosophical views—the representational view and the relational view —that are both compatible with the emergence of big data, and yet place emphasis on different aspects of that phenomenon, with significant implications for understanding the role of data within inferential reasoning and, as we shall see in the next section, as evidence. The representational view construes data as reliable representations of reality which are produced via the interaction between humans and the world. The interactions that generate data can take place in any social setting regardless of research purposes. Examples range from a biologist measuring the circumference of a cell in the lab and noting the result in an Excel file, to a teacher counting the number of students in her class and transcribing it in the class register. What counts as data in these interactions are the objects created in the process of description and/or measurement of the world. These objects can be digital (the Excel file) or physical (the class register) and form a footprint of a specific interaction with the natural world. This footprint—“trace” or “mark”, in the words of Ian Hacking (1992) and Hans-Jörg Rheinberger (2011), respectively—constitutes a crucial reference point for analytic study and for the extraction of new insights. This is the reason why data forms a legitimate foundation to empirical knowledge: the production of data is equivalent to “capturing” features of the world that can be used for systematic study. According to the representative approach, data are objects with fixed and unchangeable content, whose meaning, in virtue of being representations of reality, needs to be investigated and revealed step-by-step through adequate inferential methods. The data documenting cell shape can be modelled to test the relevance of shape to the elasticity, permeability and resilience of cells, producing an evidence base to understand cell-to-cell signalling and development. The data produced counting students in class can be aggregated with similar data collected in other schools, producing an evidence base to evaluate the density of students in the area and their school attendance frequency.

This reflects the intuition that data, especially when they come in the form of numerical measurements or images such as photographs, somehow mirror the phenomena that they are created to document, thus providing a snapshot of those phenomena that is amenable to study under the controlled conditions of research. It also reflects the idea of data as “raw” products of research, which are as close as it gets to unmediated knowledge of reality. This makes sense of the truth-value sometimes assigned to data as irrefutable sources of evidence—the Popperian idea that if data are found to support a given claim, then that claim is corroborated as true at least as long as no other data are found to disprove it. Data in this view represent an objective foundation for the acquisition of knowledge and this very objectivity—the ability to derive knowledge from human experience while transcending it—is what makes knowledge empirical. This position is well-aligned with the idea that big data is valuable to science because it facilitates the (broadly understood) inductive accumulation of knowledge: gathering data collected via reliable methods produces a mountain of facts ready to be analysed and, the more facts are produced and connected with each other, the more knowledge can be extracted.

Philosophers have long acknowledged that data do not speak for themselves and different types of data require different tools for analysis and preparation to be interpreted (Bogen 2009 [2013]). According to the representative view, there are correct and incorrect ways of interpreting data, which those responsible for data analysis need to uncover. But what is a “correct” interpretation in the realm of big data, where data are consistently treated as mobile entities that can, at least in principle, be reused in countless different ways and towards different objectives? Perhaps more than at any other time in the history of science, the current mobilisation and re-use of big data highlights the degree to which data interpretation—and with it, whatever data is taken to represent—may differ depending on the conceptual, material and social conditions of inquiry. The analysis of how big data travels across contexts shows that the expectations and abilities of those involved determine not only the way data are interpreted, but also what is regarded as “data” in the first place (Leonelli & Tempini forthcoming). The representative view of data as objects with fixed and contextually independent meaning is at odds with these observations.

An alternative approach is to embrace these findings and abandon the idea of data as fixed representations of reality altogether. Within the relational view , data are objects that are treated as potential or actual evidence for scientific claims in ways that can, at least in principle, be scrutinised and accounted for (Leonelli 2016). The meaning assigned to data depends on their provenance, their physical features and what these features are taken to represent, and the motivations and instruments used to visualise them and to defend specific interpretations. The reliability of data thus depends on the credibility and strictness of the processes used to produce and analyse them. The presentation of data; the way they are identified, selected, and included (or excluded) in databases; and the information provided to users to re-contextualise them are fundamental to producing knowledge and significantly influence its content. For instance, changes in data format—as most obviously involved in digitisation, data compression or archival procedures— can have a significant impact on where, when, and who uses the data as source of knowledge.

This framework acknowledges that any object can be used as a datum, or stop being used as such, depending on the circumstances—a consideration familiar to big data analysts used to pick and mix data coming from a vast variety of sources. The relational view also explains how, depending on the research perspective interpreting it, the same dataset may be used to represent different aspects of the world (“phenomena” as famously characterised by James Bogen and James Woodward, 1988). When considering the full cycle of scientific inquiry from the viewpoint of data production and analysis, it is at the stage of data modelling that a specific representational value is attributed to data (Leonelli 2019b).

The relational view of data encourages attention to the history of data, highlighting their continual evolution and sometimes radical alteration, and the impact of this feature on the power of data to confirm or refute hypotheses. It explains the critical importance of documenting data management and transformation processes, especially with big data that transit far and wide over digital channels and are grouped and interpreted in different ways and formats. It also explains the increasing recognition of the expertise of those who produce, curate, and analyse data as indispensable to the effective interpretation of big data within and beyond the sciences; and the inextricable link between social and ethical concerns around the potential impact of data sharing and scientific concerns around the quality, validity, and security of data (boyd & Crawford 2012; Tempini & Leonelli, 2018).

Depending on which view on data one takes, expectations around what big data can do for science will vary dramatically. The representational view accommodates the idea of big data as providing the most comprehensive, reliable and generative knowledge base ever witnessed in the history of science, by virtue of its sheer size and heterogeneity. The relational view makes no such commitment, focusing instead on what inferences are being drawn from such data at any given point, how and why.

One thing that the representational and relational views agree on is the key epistemic role of data as empirical evidence for knowledge claims or interventions. While there is a large philosophical literature on the nature of evidence (e.g., Achinstein 2001; Reiss 2015; Kelly 2016), however, the relation between data and evidence has received less attention. This is arguably due to an implicit acceptance, by many philosophers, of the representational view of data. Within the representational view, the identification of what counts as data is prior to the study of what those data can be evidence for: in other words, data are “givens”, as the etymology of the word indicates, and inferential methods are responsible for determining whether and how the data available to investigators can be used as evidence, and for what. The focus of philosophical attention is thus on formal methods to single out errors and misleading interpretations, and the probabilistic and/or explanatory relation between what is unproblematically taken to be a body of evidence and a given hypothesis. Hence much of the expansive philosophical work on evidence avoids the term “data” altogether. Peter Achinstein’s seminal work is a case in point: it discusses observed facts and experimental results, and whether and under which conditions scientists would have reasons to believe such facts, but it makes no mention of data and related processing practices (Achinstein 2001).

By contrast, within the relational view an object can only be identified as datum when it is viewed as having value as evidence. Evidence becomes a category of data identification, rather than a category of data use as in the representational view (Canali 2019). Evidence is thus constitutive of the very notion of data and cannot be disentangled from it. This involves accepting that the conditions under which a given object can serve as evidence—and thus be viewed as datum - may change; and that should this evidential role stop altogether, the object would revert back into an ordinary, non-datum item. For example, the photograph of a plant taken by a tourist in a remote region may become relevant as evidence for an inquiry into the morphology of plants from that particular locality; yet most photographs of plants are never considered as evidence for an inquiry into the features and functioning of the world, and of those who are, many may subsequently be discarded as uninteresting or no longer pertinent to the questions being asked.

This view accounts for the mobility and repurposing that characterises big data use, and for the possibility that objects that were not originally generated in order to serve as evidence may be subsequently adopted as such. Consider Mayo and Spanos’s “minimal scientific principle for evidence”, which they define as follows:

Data x 0 provide poor evidence for H if they result from a method or procedure that has little or no ability of finding flaws in H , even if H is false. (Mayo & Spanos 2009b)

This principle is compatible with the relational view of data since it incorporates cases where the methods used to generate and process data may not have been geared towards the testing of a hypothesis H: all it asks is that such methods can be made relevant to the testing of H, at the point in which data are used as evidence for H (I shall come back to the role of hypotheses in the handling of evidence in the next section).

The relational view also highlights the relevance of practices of data formatting and manipulation to the treatment of data as evidence, thus taking attention away from the characteristics of the data objects alone and focusing instead on the agency attached to and enabled by those characteristics. Nora Boyd has provided a way to conceptualise data processing as an integral part of inferential processes, and thus of how we should understand evidence. To this aim she introduced the notion of “line of evidence”, which she defines as:

a sequence of empirical results including the records of data collection and all subsequent products of data processing generated on the way to some final empirical constraint. (Boyd 2018:406)

She thus proposes a conception of evidence that embraces both data and the way in which data are handled, and indeed emphasises the importance of auxiliary information used when assessing data for interpretation, which includes

the metadata regarding the provenance of the data records and the processing workflow that transforms them. (2018: 407)

As she concludes,

together, a line of evidence and its associated metadata compose what I am calling an “enriched line of evidence”. The evidential corpus is then to be made up of many such enriched lines of evidence. (2018: 407)

The relational view thus fosters a functional and contextualist approach to evidence as the manner through which one or more objects are used as warrant for particular knowledge items (which can be propositional claims, but also actions such as specific decisions or modes of conduct/ways of operating). This chimes with the contextual view of evidence defended by Reiss (2015), John Norton’s work on the multiple, tangled lines of inferential reasoning underpinning appeals to induction (2003), and Hasok Chang’s emphasis on the epistemic activities required to ground evidential claims (2012). Building on these ideas and on Stephen Toulmin’s seminal work on research schemas (1958), Alison Wylie has gone one step further in evaluating the inferential scaffolding that researchers (and particularly archaeologists, who so often are called to re-evaluate the same data as evidence for new claims; Wylie 2017) need to make sense of their data, interpret them in ways that are robust to potential challenges, and modify interpretations in the face of new findings. This analysis enabled Wylie to formulate a set of conditions for robust evidential reasoning, which include epistemic security in the chain of evidence, causal anchoring and causal independence of the data used as evidence, as well as the explicit articulation of the grounds for calibration of the instruments and methods involved (Chapman & Wylie 2016; Wylie forthcoming). A similar conclusion is reached by Jessey Wright’s evaluation of the diverse data analysis techniques that neuroscientists use to make sense of functional magnetic resonance imaging of the brain (fMRI scans):

different data analysis techniques reveal different patterns in the data. Through the use of multiple data analysis techniques, researchers can produce results that are locally robust. (Wright 2017: 1179)

Wylie’s and Wright’s analyses exemplify how a relational approach to data fosters a normative understanding of “good evidence” which is anchored in situated judgement—the arguably human prerogative to contextualise and assess the significance of evidential claims. The advantages of this view of evidence are eloquently expressed by Nancy Cartwright’s critique of both philosophical theories and policy approaches that do not recognise the local and contextual nature of evidential reasoning. As she notes,

we need a concept that can give guidance about what is relevant to consider in deciding on the probability of the hypothesis, not one that requires that we already know significant facts about the probability of the hypothesis on various pieces of evidence. (Cartwright 2013: 6)

Thus she argues for a notion of evidence that is not too restrictive, takes account of the difficulties in combining and selecting evidence, and allows for contextual judgement on what types of evidence are best suited to the inquiry at hand (Cartwright 2013, 2019). Reiss’s proposal of a pragmatic theory of evidence similarly aims to

takes scientific practice [..] seriously, both in terms of its greater use of knowledge about the conditions under which science is practised and in terms of its goal to develop insights that are relevant to practising scientists. (Reiss 2015: 361)

A better characterisation of the relation between data and evidence, predicated on the study of how data are processed and aggregated, may go a long way towards addressing these demands. As aptly argued by James Woodward, the evidential relationship between data and claims is not a “a purely formal, logical, or a priori matter” (Woodward 2000: S172–173). This again sits uneasily with the expectation that big data analysis may automate scientific discovery and make human judgement redundant.

Let us now return to the idea of data-driven inquiry, often suggested as a counterpoint to hypothesis-driven science (e.g., Hey et al. 2009). Kevin Elliot and colleagues have offered a brief history of hypothesis-driven inquiry (Elliott et al. 2016), emphasising how scientific institutions (including funding programmes and publication venues) have pushed researchers towards a Popperian conceptualisation of inquiry as the formulation and testing of a strong hypothesis. Big data analysis clearly points to a different and arguably Baconian understanding of the role of hypothesis in science. Theoretical expectations are no longer seen as driving the process of inquiry and empirical input is recognised as primary in determining the direction of research and the phenomena—and related hypotheses—considered by researchers.

The emphasis on data as a central component of research poses a significant challenge to one of the best-established philosophical views on scientific knowledge. According to this view, which I shall label the theory-centric view of science, scientific knowledge consists of justified true beliefs about the world. These beliefs are obtained through empirical methods aiming to test the validity and reliability of statements that describe or explain aspects of reality. Hence scientific knowledge is conceptualised as inherently propositional: what counts as an output are claims published in books and journals, which are also typically presented as solutions to hypothesis-driven inquiry. This view acknowledges the significance of methods, data, models, instruments and materials within scientific investigations, but ultimately regards them as means towards one end: the achievement of true claims about the world. Reichenbach’s seminal distinction between contexts of discovery and justification exemplifies this position (Reichenbach 1938). Theory-centrism recognises research components such as data and related practical skills as essential to discovery, and more specifically to the messy, irrational part of scientific work that involves value judgements, trial-and-error, intuition and exploration and within which the very phenomena to be investigated may not have been stabilised. The justification of claims, by contrast, involves the rational reconstruction of the research that has been performed, so that it conforms to established norms of inferential reasoning. Importantly, within the context of justification, only data that support the claims of interest are explicitly reported and discussed: everything else—including the vast majority of data produced in the course of inquiry—is lost to the chaotic context of discovery. [ 2 ]

Much recent philosophy of science, and particularly modelling and experimentation, has challenged theory-centrism by highlighting the role of models, methods and modes of intervention as research outputs rather than simple tools, and stressing the importance of expanding philosophical understandings of scientific knowledge to include these elements alongside propositional claims. The rise of big data offers another opportunity to reframe understandings of scientific knowledge as not necessarily centred on theories and to include non-propositional components—thus, in Cartwright’s paraphrase of Gilbert Ryle’s famous distinction, refocusing on knowing-how over knowing-that (Cartwright 2019). One way to construe data-centric methods is indeed to embrace a conception of knowledge as ability, such as promoted by early pragmatists like John Dewey and more recently reprised by Chang, who specifically highlighted it as the broader category within which the understanding of knowledge-as-information needs to be placed (Chang 2017).

Another way to interpret the rise of big data is as a vindication of inductivism in the face of the barrage of philosophical criticism levelled against theory-free reasoning over the centuries. For instance, Jon Williamson (2004: 88) has argued that advances in automation, combined with the emergence of big data, lend plausibility to inductivist philosophy of science. Wolfgang Pietsch agrees with this view and provided a sophisticated framework to understand just what kind of inductive reasoning is instigated by big data and related machine learning methods such as decision trees (Pietsch 2015). Following John Stuart Mill, he calls this approach variational induction and presents it as common to both big data approaches and exploratory experimentation, though the former can handle a much larger number of variables (Pietsch 2015: 913). Pietsch concludes that the problem of theory-ladenness in machine learning can be addressed by determining under which theoretical assumptions variational induction works (2015: 910ff).

Others are less inclined to see theory-ladenness as a problem that can be mitigated by data-intensive methods, and rather see it as a constitutive part of the process of empirical inquiry. Arching back to the extensive literature on perspectivism and experimentation (Gooding 1990; Giere 2006; Radder 2006; Massimi 2012), Werner Callebaut has forcefully argued that the most sophisticated and standardised measurements embody a specific theoretical perspective, and this is no less true of big data (Callebaut 2012). Elliott and colleagues emphasise that conceptualising big data analysis as atheoretical risks encouraging unsophisticated attitudes to empirical investigation as a

“fishing expedition”, having a high probability of leading to nonsense results or spurious correlations, being reliant on scientists who do not have adequate expertise in data analysis, and yielding data biased by the mode of collection. (Elliott et al. 2016: 880)

To address related worries in genetic analysis, Ken Waters has provided the useful characterisation of “theory-informed” inquiry (Waters 2007), which can be invoked to stress how theory informs the methods used to extract meaningful patterns from big data, and yet does not necessarily determine either the starting point or the outcomes of data-intensive science. This does not resolve the question of what role theory actually plays. Rob Kitchin (2014) has proposed to see big data as linked to a new mode of hypothesis generation within a hypothetical-deductive framework. Leonelli is more sceptical of attempts to match big data approaches, which are many and diverse, with a specific type of inferential logic. She rather focused on the extent to which the theoretical apparatus at work within big data analysis rests on conceptual decisions about how to order and classify data—and proposed that such decisions can give rise to a particular form of theorization, which she calls classificatory theory (Leonelli 2016).

These disagreements point to big data as eliciting diverse understandings of the nature of knowledge and inquiry, and the complex iterations through which different inferential methods build on each other. Again, in the words of Elliot and colleagues,

attempting to draw a sharp distinction between hypothesis-driven and data-intensive science is misleading; these modes of research are not in fact orthogonal and often intertwine in actual scientific practice. (Elliott et al. 2016: 881, see also O’Malley et al. 2009, Elliott 2012)

Another epistemological debate strongly linked to reflection on big data concerns the specific kinds of knowledge emerging from data-centric forms of inquiry, and particularly the relation between predictive and causal knowledge.

Big data science is widely seen as revolutionary in the scale and power of predictions that it can support. Unsurprisingly perhaps, a philosophically sophisticated defence of this position comes from the philosophy of mathematics, where Marco Panza, Domenico Napoletani and Daniele Struppa argued for big data science as occasioning a momentous shift in the predictive knowledge that mathematical analysis can yield, and thus its role within broader processes of knowledge production. The whole point of big data analysis, they posit, is its disregard for causal knowledge:

answers are found through a process of automatic fitting of the data to models that do not carry any structural understanding beyond the actual solution of the problem itself. (Napoletani, Panza, & Struppa 2014: 486)

This view differs from simplistic popular discourse on “the death of theory” (Anderson 2008) and the “power of correlations” (Mayer-Schoenberg and Cukier 2013) insofar as it does not side-step the constraints associated with knowledge and generalisations that can be extracted from big data analysis. Napoletani, Panza and Struppa recognise that there are inescapable tensions around the ability of mathematical reasoning to overdetermine empirical input, to the point of providing a justification for any and every possible interpretation of the data. In their words,

the problem arises of how we can gain meaningful understanding of historical phenomena, given the tremendous potential variability of their developmental processes. (Napoletani et al. 2014: 487)

Their solution is to clarify that understanding phenomena is not the goal of predictive reasoning, which is rather a form of agnostic science : “the possibility of forecasting and analysing without a structured and general understanding” (Napoletani et al. 2011: 12). The opacity of algorithmic rationality thus becomes its key virtue and the reason for the extraordinary epistemic success of forecasting grounded on big data. While “the phenomenon may forever re-main hidden to our understanding”(ibid.: 5), the application of mathematical models and algorithms to big data can still provide meaningful and reliable answers to well-specified problems—similarly to what has been argued in the case of false models (Wimsatt 2007). Examples include the use of “forcing” methods such as regularisation or diffusion geometry to facilitate the extraction of useful insights from messy datasets.

This view is at odds with accounts that posit scientific understanding as a key aim of science (de Regt 2017), and the intuition that what researchers are ultimately interested in is

whether the opaque data-model generated by machine-learning technologies count as explanations for the relationships found between input and output. (Boon 2020: 44)

Within the philosophy of biology, for example, it is well recognised that big data facilitates effective extraction of patterns and trends, and that being able to model and predict how an organism or ecosystem may behave in the future is of great importance, particularly within more applied fields such as biomedicine or conservation science. At the same time, researchers are interested in understanding the reasons for observed correlations, and typically use predictive patterns as heuristics to explore, develop and verify causal claims about the structure and functioning of entities and processes. Emanuele Ratti (2015) has argued that big data mining within genome-wide association studies often used in cancer genomics can actually underpin mechanistic reasoning, for instance by supporting eliminative inference to develop mechanistic hypotheses and by helping to explore and evaluate generalisations used to analyse the data. In a similar vein, Pietsch (2016) proposed to use variational induction as a method to establish what counts as causal relationships among big data patterns, by focusing on which analytic strategies allow for reliable prediction and effective manipulation of a phenomenon.

Through the study of data sourcing and processing in epidemiology, Stefano Canali has instead highlighted the difficulties of deriving mechanistic claims from big data analysis, particularly where data are varied and embodying incompatible perspectives and methodological approaches (Canali 2016, 2019). Relatedly, the semantic and logistical challenges of organising big data give reason to doubt the reliability of causal claims extracted from such data. In terms of logistics, having a lot of data is not the same as having all of them, and cultivating illusions of comprehensiveness is a risky and potentially misleading strategy, particularly given the challenges encountered in developing and applying curatorial standards for data other than the high-throughput results of “omics” approaches (see also the next section). The constant worry about the partiality and reliability of data is reflected in the care put by database curators in enabling database users to assess such properties; and in the importance given by researchers themselves, particularly in the biological and environmental sciences, to evaluating the quality of data found on the internet (Leonelli 2014, Fleming et al. 2017). In terms of semantics, we are back to the role of data classifications as theoretical scaffolding for big data analysis that we discussed in the previous section. Taxonomic efforts to order and visualise data inform causal reasoning extracted from such data (Sterner & Franz 2017), and can themselves constitute a bottom-up method—grounded in comparative reasoning—for assigning meaning to data models, particularly in situation where a full-blown theory or explanation for the phenomenon under investigation is not available (Sterner 2014).

It is no coincidence that much philosophical work on the relation between causal and predictive knowledge extracted from big data comes from the philosophy of the life sciences, where the absence of axiomatized theories has elicited sophisticated views on the diversity of forms and functions of theory within inferential reasoning. Moreover, biological data are heterogeneous both in their content and in their format; are curated and re-purposed to address the needs of highly disparate and fragmented epistemic communities; and present curators with specific challenges to do with tracking complex, diverse and evolving organismal structures and behaviours, whose relation to an ever-changing environment is hard to pinpoint with any stability (e.g., Shavit & Griesemer 2009). Hence in this domain, some of the core methods and epistemic concerns of experimental research—including exploratory experimentation, sampling and the search for causal mechanisms—remain crucial parts of data-centric inquiry.

At the start of this entry I listed “value” as a major characteristic of big data and pointed to the crucial role of valuing procedures in identifying, processing, modelling and interpreting data as evidence. Identifying and negotiating different forms of data value is an unavoidable part of big data analysis, since these valuation practices determine which data is made available to whom, under which conditions and for which purposes. What researchers choose to consider as reliable data (and data sources) is closely intertwined not only with their research goals and interpretive methods, but also with their approach to data production, packaging, storage and sharing. Thus, researchers need to consider what value their data may have for future research by themselves and others, and how to enhance that value—such as through decisions around which data to make public, how, when and in which format; or, whenever dealing with data already in the public domain (such as personal data on social media), decisions around whether the data should be shared and used at all, and how.

No matter how one conceptualises value practices, it is clear that their key role in data management and analysis prevents facile distinctions between values and “facts” (understood as propositional claims for which data provide evidential warrant). For example, consider a researcher who values both openness —and related practices of widespread data sharing—and scientific rigour —which requires a strict monitoring of the credibility and validity of conditions under which data are interpreted. The scale and manner of big data mobilisation and analysis create tensions between these two values. While the commitment to openness may prompt interest in data sharing, the commitment to rigour may hamper it, since once data are freely circulated online it becomes very difficult to retain control over how they are interpreted, by whom and with which knowledge, skills and tools. How a researcher responds to this conflict affects which data are made available for big data analysis, and under which conditions. Similarly, the extent to which diverse datasets may be triangulated and compared depends on the intellectual property regimes under which the data—and related analytic tools—have been produced. Privately owned data are often unavailable to publicly funded researchers; and many algorithms, cloud systems and computing facilities used in big data analytics are only accessible to those with enough resources to buy relevant access and training. Whatever claims result from big data analysis are, therefore, strongly dependent on social, financial and cultural constraints that condition the data pool and its analysis.

This prominent role of values in shaping data-related epistemic practices is not surprising given existing philosophical critiques of the fact/value distinction (e.g., Douglas 2009), and the existing literature on values in science—such as Helen Longino’s seminal distinction between constitutive and contextual values, as presented in her 1990 book Science as Social Knowledge —may well apply in this case too. Similarly, it is well-established that the technological and social conditions of research strongly condition its design and outcomes. What is particularly worrying in the case of big data is the temptation, prompted by hyped expectations around the power of data analytics, to hide or side-line the valuing choices that underpin the methods, infrastructures and algorithms used for big data extraction.

Consider the use of high-throughput data production tools, which enable researchers to easily generate a large volume of data in formats already geared to computational analysis. Just as in the case of other technologies, researchers have a strong incentive to adopt such tools for data generation; and may do so even in cases where such tools are not good or even appropriate means to pursue the investigation. Ulrich Krohs uses the term convenience experimentation to refer to experimental designs that are adopted not because they are the most appropriate ways of pursuing a given investigation, but because they are easily and widely available and usable, and thus “convenient” means for researchers to pursue their goals (Krohs 2012).

Appeals to convenience can extend to other aspects of data-intensive analysis. Not all data are equally easy to digitally collect, disseminate and link through existing algorithms, which makes some data types and formats more convenient than others for computational analysis. For example, research databases often display the outputs of well-resourced labs within research traditions which deal with “tractable” data formats (such as “omics”). And indeed, the existing distribution of resources, infrastructure and skills determines high levels of inequality in the production, dissemination and use of big data for research. Big players with large financial and technical resources are leading the development and uptake of data analytics tools, leaving much publicly funded research around the world at the receiving end of innovation in this area. Contrary to popular depictions of the data revolution as harbinger of transparency, democracy and social equality, the digital divide between those who can access and use data technologies, and those who cannot, continues to widen. A result of such divides is the scarcity of data relating to certain subgroups and geographical locations, which again limits the comprehensiveness of available data resources.

In the vast ecosystem of big data infrastructures, it is difficult to keep track of such distortions and assess their significance for data interpretation, especially in situations where heterogeneous data sources structured through appeal to different values are mashed together. Thus, the systematic aggregation of convenient datasets and analytic tools over others often results in a big data pool where the relevant sources and forms of bias are impossible to locate and account for (Pasquale 2015; O’Neill 2016; Zuboff 2017; Leonelli 2019a). In such a landscape, arguments for a separation between fact and value—and even a clear distinction between the role of epistemic and non-epistemic values in knowledge production—become very difficult to maintain without discrediting the whole edifice of big data science. Given the extent to which this approach has penetrated research in all domains, it is arguably impossible, however, to critique the value-laden structure of big data science without calling into question the legitimacy of science itself. A more constructive approach is to embrace the extent to which big data science is anchored in human choices, interests and values, and ascertain how this affects philosophical views on knowledge, truth and method.

In closing, it is important to consider at least some of the risks and related ethical questions raised by research with big data. As already mentioned in the previous section, reliance on big data collected by powerful institutions or corporations risks raises significant social concerns. Contrary to the view that sees big and open data as harbingers of democratic social participation in research, the way that scientific research is governed and financed is not challenged by big data. Rather, the increasing commodification and large value attributed to certain kinds of data (e.g., personal data) is associated to an increase in inequality of power and visibility between different nations, segments of the population and scientific communities (O’Neill 2016; Zuboff 2017; D’Ignazio and Klein 2020). The digital gap between those who not only can access data, but can also use it, is widening, leading from a state of digital divide to a condition of “data divide” (Bezuidenout et al. 2017).

Moreover, the privatisation of data has serious implications for the world of research and the knowledge it produces. Firstly, it affects which data are disseminated, and with which expectations. Corporations usually only release data that they regard as having lesser commercial value and that they need public sector assistance to interpret. This introduces another distortion on the sources and types of data that are accessible online while more expensive and complex data are kept secret. Even many of the ways in which citizens -researchers included - are encouraged to interact with databases and data interpretation sites tend to encourage participation that generates further commercial value. Sociologists have recently described this type of social participation as a form of exploitation (Prainsack & Buyx 2017; Srnicek 2017). In turn, these ways of exploiting data strengthen their economic value over their scientific value. When it comes to the commerce of personal data between companies working in analysis, the value of the data as commercial products -which includes the evaluation of the speed and efficiency with which access to certain data can help develop new products - often has priority over scientific issues such as for example, representativity and reliability of the data and the ways they were analysed. This can result in decisions that pose a problem scientifically or that simply are not interested in investigating the consequences of the assumptions made and the processes used. This lack of interest easily translates into ignorance of discrimination, inequality and potential errors in the data considered. This type of ignorance is highly strategic and economically productive since it enables the use of data without concerns over social and scientific implications. In this scenario the evaluation on the quality of data shrinks to an evaluation of their usefulness towards short-term analyses or forecasting required by the client. There are no incentives in this system to encourage evaluation of the long-term implications of data analysis. The risk here is that the commerce of data is accompanied by an increasing divergence between data and their context. The interest in the history of the transit of data, the plurality of their emotional or scientific value and the re-evaluation of their origins tend to disappear over time, to be substituted by the increasing hold of the financial value of data.

The multiplicity of data sources and tools for aggregation also creates risks. The complexity of the data landscape is making it harder to identify which parts of the infrastructure require updating or have been put in doubt by new scientific developments. The situation worsens when considering the number of databases that populate every area of scientific research, each containing assumptions that influence the circulation and interoperability of data and that often are not updated in a reliable and regular way. Just to provide an idea of the numbers involved, the prestigious scientific publication Nucleic Acids Research publishes a special issue on new databases that are relevant to molecular biology every year and included: 56 new infrastructures in 2015, 62 in 2016, 54 in 2017 and 82 in 2018. These are just a small proportion of the hundreds of databases that are developed each year in the life sciences sector alone. The fact that these databases rely on short term funding means that a growing percentage of resources remain available to consult online although they are long dead. This is a condition that is not always visible to users of the database who trust them without checking whether they are actively maintained or not. At what point do these infrastructures become obsolete? What are the risks involved in weaving an ever more extensive tapestry of infrastructures that depend on each other, given the disparity in the ways they are managed and the challenges in identifying and comparing their prerequisite conditions, the theories and scaffolding used to build them? One of these risks is rampant conservativism: the insistence on recycling old data whose features and management elements become increasingly murky as time goes by, instead of encouraging the production of new data with features that specifically respond to the requirements and the circumstances of their users. In disciplines such as biology and medicine that study living beings and therefore are by definition continually evolving and developing, such trust in old data is particularly alarming. It is not the case, for example, that data collected on fungi ten, twenty or even a hundred years ago is reliable to explain the behaviour of the same species of fungi now or in the future (Leonelli 2018).

Researchers of what Luciano Floridi calls the infosphere —the way in which the introduction of digital technologies is changing the world - are becoming aware of the destructive potential of big data and the urgent need to focus efforts for management and use of data in active and thoughtful ways towards the improvement of the human condition. In Floridi’s own words:

ICT yields great opportunity which, however, entails the enormous intellectual responsibility of understanding this technology to use it in the most appropriate way. (Floridi 2014: vii; see also British Academy & Royal Society 2017)

In light of these findings, it is essential that ethical and social issues are seen as a core part of the technical and scientific requirements associated with data management and analysis. The ethical management of data is not obtained exclusively by regulating the commerce of research and management of personal data nor with the introduction of monitoring of research financing, even though these are important strategies. To guarantee that big data are used in the most scientifically and socially forward-thinking way it is necessary to transcend the concept of ethics as something external and alien to research. An analysis of the ethical implications of data science should become a basic component of the background and activity of those who take care of data and the methods used to view and analyse it. Ethical evaluations and choices are hidden in every aspect of data management, including those choices that may seem purely technical.

This entry stressed how the emerging emphasis on big data signals the rise of a data-centric approach to research, in which efforts to mobilise, integrate, disseminate and visualise data are viewed as central contributions to discovery. The emergence of data-centrism highlights the challenges involved in gathering, classifying and interpreting data, and the concepts, technologies and institutions that surround these processes. Tools such as high-throughput measurement instruments and apps for smartphones are fast generating large volumes of data in digital formats. In principle, these data are immediately available for dissemination through internet platforms, which can make them accessible to anybody with a broadband connection in a matter of seconds. In practice, however, access to data is fraught with conceptual, technical, legal and ethical implications; and even when access can be granted, it does not guarantee that the data can be fruitfully used to spur further research. Furthermore, the mathematical and computational tools developed to analyse big data are often opaque in their functioning and assumptions, leading to results whose scientific meaning and credibility may be difficult to assess. This increases the worry that big data science may be grounded upon, and ultimately supporting, the process of making human ingenuity hostage to an alien, artificial and ultimately unintelligible intelligence.

Perhaps the most confronting aspect of big data science as discussed in this entry is the extent to which it deviates from understandings of rationality grounded on individual agency and cognitive abilities (on which much of contemporary philosophy of science is predicated). The power of any one dataset to yield knowledge lies in the extent to which it can be linked with others: this is what lends high epistemic value to digital objects such as GPS locations or sequencing data, and what makes extensive data aggregation from a variety of sources into a highly effective surveillance tool. Data production and dissemination channels such as social media, governmental databases and research repositories operate in a globalised, interlinked and distributed network, whose functioning requires a wide variety of skills and expertise. The distributed nature of decision-making involved in developing big data infrastructures and analytics makes it impossible for any one individual to retain oversight over the quality, scientific significance and potential social impact of the knowledge being produced.

Big data analysis may therefore constitute the ultimate instance of a distributed cognitive system. Where does this leave accountability questions? Many individuals, groups and institutions end up sharing responsibility for the conceptual interpretation and social outcomes of specific data uses. A key challenge for big data governance is to find mechanisms for allocating responsibilities across this complex network, so that erroneous and unwarranted decisions—as well as outright fraudulent, unethical, abusive, discriminatory or misguided actions—can be singled out, corrected and appropriately sanctioned. Thinking about the complex history, processing and use of data can encourage philosophers to avoid ahistorical, uncontextualized approaches to questions of evidence, and instead consider the methods, skills, technologies and practices involved in handling data—and particularly big data—as crucial to understanding empirical knowledge-making.

  • Achinstein, Peter, 2001, The Book of Evidence , Oxford: Oxford University Press. doi:10.1093/0195143892.001.0001
  • Anderson, Chris, 2008, “The End of Theory: The Data Deluge Makes the Scientific Method Obsolete”, Wired Magazine , 23 June 2008.
  • Aronova, Elena, Karen S. Baker, and Naomi Oreskes, 2010, “Big science and big data in biology: From the International Geophysical Year through the International Biological Program to the Long Term Ecological Research (LTER) Network, 1957–present”, Historical Studies in the Natural Sciences , 40: 183–224.
  • Aronova, Elena, Christine von Oertzen, and David Sepkoski, 2017, “Introduction: Historicizing Big Data”, Osiris , 32(1): 1–17. doi:10.1086/693399
  • Bauer, Susanne, 2008, “Mining Data, Gathering Variables and Recombining Information: The Flexible Architecture of Epidemiological Studies”, Studies in History and Philosophy of Science Part C: Studies in History and Philosophy of Biological and Biomedical Sciences , 39(4): 415–428. doi:10.1016/j.shpsc.2008.09.008
  • Bechtel, William, 2016, “Using Computational Models to Discover and Understand Mechanisms”, Studies in History and Philosophy of Science Part A , 56: 113–121. doi:10.1016/j.shpsa.2015.10.004
  • Beisbart, Claus, 2012, “How Can Computer Simulations Produce New Knowledge?”, European Journal for Philosophy of Science , 2(3): 395–434. doi:10.1007/s13194-012-0049-7
  • Bezuidenhout, Louise, Leonelli, Sabina, Kelly, Ann and Rappert, Brian, 2017, “Beyond the Digital Divide: Towards a Situated Approach to Open Data”. Science and Public Policy , 44(4): 464–475. doi: 10.1093/scipol/scw036
  • Bogen, Jim, 2009 [2013], “Theory and Observation in Science”, in The Stanford Encyclopedia of Philosophy (Spring 2013 Edition), Edward N. Zalta (ed.), URL = < https://plato.stanford.edu/archives/spr2013/entries/science-theory-observation/ >.
  • –––, 2010, “Noise in the World”, Philosophy of Science , 77(5): 778–791. doi:10.1086/656006
  • Bogen, James and James Woodward, 1988, “Saving the Phenomena”, The Philosophical Review , 97(3): 303. doi:10.2307/2185445
  • Bokulich, Alisa, 2018, “Using Models to Correct Data: Paleodiversity and the Fossil Record”, in S.I.: Abstraction and Idealization in Scientific Modelling by Synthese , 29 May 2018. doi:10.1007/s11229-018-1820-x
  • Boon, Mieke, 2020, “How Scientists Are Brought Back into Science—The Error of Empiricism”, in A Critical Reflection on Automated Science , Marta Bertolaso and Fabio Sterpetti (eds.), (Human Perspectives in Health Sciences and Technology 1), Cham: Springer International Publishing, 43–65. doi:10.1007/978-3-030-25001-0_4
  • Borgman, Christine L., 2015, Big Data, Little Data, No Data , Cambridge, MA: MIT Press.
  • Boumans, M.J. and Sabina Leonelli, forthcoming, “From Dirty Data to Tidy Facts: Practices of Clustering in Plant Phenomics and Business Cycles”, in Leonelli and Tempini forthcoming.
  • Boyd, Danah and Kate Crawford, 2012, “Critical Questions for Big Data: Provocations for a Cultural, Technological, and Scholarly Phenomenon”, Information, Communication & Society , 15(5): 662–679. doi:10.1080/1369118X.2012.678878
  • Boyd, Nora Mills, 2018, “Evidence Enriched”, Philosophy of Science , 85(3): 403–421. doi:10.1086/697747
  • Bowker, Geoffrey C., 2006, Memory Practices in the Sciences , Cambridge, MA: The MIT Press.
  • Bringsjord, Selmer and Naveen Sundar Govindarajulu, 2018, “Artificial Intelligence”, in The Stanford Encyclopedia of Philosophy (Fall 2018 edition), Edward N. Zalta (ed.), URL = < https://plato.stanford.edu/archives/fall2018/entries/artificial-intelligence/ >.
  • British Academy & Royal Society, 2017, Data Management and Use: Governance in the 21st Century. A Joint Report of the Royal Society and the British Academy , British Academy & Royal Society 2017 available online (see Report).
  • Cai, Li and Yangyong Zhu, 2015, “The Challenges of Data Quality and Data Quality Assessment in the Big Data Era”, Data Science Journal , 14: 2. doi:10.5334/dsj-2015-002
  • Callebaut, Werner, 2012, “Scientific Perspectivism: A Philosopher of Science’s Response to the Challenge of Big Data Biology”, Studies in History and Philosophy of Science Part C: Studies in History and Philosophy of Biological and Biomedical Sciences , 43(1): 69–80. doi:10.1016/j.shpsc.2011.10.007
  • Calude, Cristian S. and Giuseppe Longo, 2017, “The Deluge of Spurious Correlations in Big Data”, Foundations of Science , 22(3): 595–612. doi:10.1007/s10699-016-9489-4
  • Canali, Stefano, 2016, “Big Data, Epistemology and Causality: Knowledge in and Knowledge out in EXPOsOMICS”, Big Data & Society , 3(2): 205395171666953. doi:10.1177/2053951716669530
  • –––, 2019, “Evaluating Evidential Pluralism in Epidemiology: Mechanistic Evidence in Exposome Research”, History and Philosophy of the Life Sciences , 41(1): art. 4. doi:10.1007/s40656-019-0241-6
  • Cartwright, Nancy D., 2013, Evidence: For Policy and Wheresoever Rigor Is a Must , London School of Economics and Political Science (LSE), Order Project Discussion Paper Series [Cartwright 2013 available online ].
  • –––, 2019, Nature, the Artful Modeler: Lectures on Laws, Science, How Nature Arranges the World and How We Can Arrange It Better (The Paul Carus Lectures) , Chicago, IL: Open Court.
  • Chang, Hasok, 2012, Is Water H2O? Evidence, Realism and Pluralism , (Boston Studies in the Philosophy of Science 293), Dordrecht: Springer Netherlands. doi:10.1007/978-94-007-3932-1
  • –––, 2017, “VI—Operational Coherence as the Source of Truth”, Proceedings of the Aristotelian Society , 117(2): 103–122. doi:10.1093/arisoc/aox004
  • Chapman, Robert and Alison Wylie, 2016, Evidential Reasoning in Archaeology , London: Bloomsbury Publishing Plc.
  • Collins, Harry M., 1990, Artificial Experts: Social Knowledge and Intelligent Machines , Cambridge, MA: MIT Press.
  • Craver, Carl F. and Lindley Darden, 2013, In Search of Mechanisms: Discoveries Across the Life Sciences , Chicago: University of Chicago Press.
  • Daston, Lorraine, 2017, Science in the Archives: Pasts, Presents, Futures , Chicago: University of Chicago Press.
  • De Regt, Henk W., 2017, Understanding Scientific Understanding , Oxford: Oxford University Press. doi:10.1093/oso/9780190652913.001.0001
  • D’Ignazio, Catherine and Klein, Lauren F., 2020, Data Feminism , Cambridge, MA: The MIT Press.
  • Douglas, Heather E., 2009, Science, Policy and the Value-Free Ideal , Pittsburgh, PA: University of Pittsburgh Press.
  • Dreyfus, Hubert L., 1992, What Computers Still Can’t Do: A Critique of Artificial Reason , Cambridge, MA: MIT Press.
  • Durán, Juan M. and Nico Formanek, 2018, “Grounds for Trust: Essential Epistemic Opacity and Computational Reliabilism”, Minds and Machines , 28(4): 645–666. doi:10.1007/s11023-018-9481-6
  • Edwards, Paul N., 2010, A Vast Machine: Computer Models, Climate Data, and the Politics of Global Warming , Cambridge, MA: The MIT Press.
  • Elliott, Kevin C., 2012, “Epistemic and methodological iteration in scientific research”. Studies in History and Philosophy of Science , 43: 376–382.
  • Elliott, Kevin C., Kendra S. Cheruvelil, Georgina M. Montgomery, and Patricia A. Soranno, 2016, “Conceptions of Good Science in Our Data-Rich World”, BioScience , 66(10): 880–889. doi:10.1093/biosci/biw115
  • Feest, Uljana, 2011, “What Exactly Is Stabilized When Phenomena Are Stabilized?”, Synthese , 182(1): 57–71. doi:10.1007/s11229-009-9616-7
  • Fleming, Lora, Niccolò Tempini, Harriet Gordon-Brown, Gordon L. Nichols, Christophe Sarran, Paolo Vineis, Giovanni Leonardi, Brian Golding, Andy Haines, Anthony Kessel, Virginia Murray, Michael Depledge, and Sabina Leonelli, 2017, “Big Data in Environment and Human Health”, in Oxford Research Encyclopedia of Environmental Science , by Lora Fleming, Niccolò Tempini, Harriet Gordon-Brown, Gordon L. Nichols, Christophe Sarran, Paolo Vineis, Giovanni Leonardi, Brian Golding, Andy Haines, Anthony Kessel, Virginia Murray, Michael Depledge, and Sabina Leonelli, Oxford: Oxford University Press. doi:10.1093/acrefore/9780199389414.013.541
  • Floridi, Luciano, 2014, The Fourth Revolution: How the Infosphere is Reshaping Human Reality , Oxford: Oxford University Press.
  • Floridi, Luciano and Phyllis Illari (eds.), 2014, The Philosophy of Information Quality , (Synthese Library 358), Cham: Springer International Publishing. doi:10.1007/978-3-319-07121-3
  • Frigg, Roman and Julian Reiss, 2009, “The Philosophy of Simulation: Hot New Issues or Same Old Stew?”, Synthese , 169(3): 593–613. doi:10.1007/s11229-008-9438-z
  • Frigg, Roman and Stephan Hartmann, 2016, “Models in Science”, in The Stanford Encyclopedia of Philosophy (Winter 2016 edition), Edward N. Zalta (ed.), URL = < https://plato.stanford.edu/archives/win2016/entries/models-science/ >.
  • Gooding, David C., 1990, Experiment and the Making of Meaning , Dordrecht & Boston: Kluwer.
  • Giere, Ronald, 2006, Scientific Perspectivism , Chicago: University of Chicago Press.
  • Griesemer, James R., forthcoming, “A Data Journey through Dataset-Centric Population Biology”, in Leonelli and Tempini forthcoming.
  • Hacking, Ian, 1992, “The Self-Vindication of the Laboratory Sciences”, In Science as Practice and Culture , Andrew Pickering (ed.), Chicago, IL: The University of Chicago Press, 29–64.
  • Harris, Todd, 2003, “Data Models and the Acquisition and Manipulation of Data”, Philosophy of Science , 70(5): 1508–1517. doi:10.1086/377426
  • Hey Tony, Stewart Tansley, and Kristin Tolle, 2009, The Fourth Paradigm. Data-Intensive Scientific Discovery , Redmond, WA: Microsoft Research.
  • Humphreys, Paul, 2004, Extending Ourselves: Computational Science, Empiricism, and Scientific Method , Oxford: Oxford University Press. doi:10.1093/0195158709.001.0001
  • –––, 2009, “The Philosophical Novelty of Computer Simulation Methods”, Synthese , 169(3): 615–626. doi:10.1007/s11229-008-9435-2
  • Karaca, Koray, 2018, “Lessons from the Large Hadron Collider for Model-Based Experimentation: The Concept of a Model of Data Acquisition and the Scope of the Hierarchy of Models”, Synthese , 195(12): 5431–5452. doi:10.1007/s11229-017-1453-5
  • Kelly, Thomas, 2016, “Evidence”, in The Stanford Encyclopedia of Philosophy (Winter 2016 edition), Edward N. Zalta (ed.), URL = < https://plato.stanford.edu/archives/win2016/entries/evidence/ >.
  • Kitchin, Rob, 2013, The Data Revolution: Big Data, Open Data, Data Infrastructures & Their Consequences , Los Angeles: Sage.
  • –––, 2014, “Big Data, new epistemologies and paradigm shifts”, Big Data and Society , 1(1) April-June. doi: 10.1177/2053951714528481
  • Kitchin, Rob and Gavin McArdle, 2016, “What Makes Big Data, Big Data? Exploring the Ontological Characteristics of 26 Datasets”, Big Data & Society , 3(1): 205395171663113. doi:10.1177/2053951716631130
  • Krohs, Ulrich, 2012, “Convenience Experimentation”, Studies in History and Philosophy of Science Part C: Studies in History and Philosophy of Biological and Biomedical Sciences , 43(1): 52–57. doi:10.1016/j.shpsc.2011.10.005
  • Lagoze, Carl, 2014, “Big Data, data integrity, and the fracturing of the control zone,” Big Data and Society , 1(2) July-December. doi: 10.1177/2053951714558281
  • Leonelli, Sabina, 2014, “What Difference Does Quantity Make? On the Epistemology of Big Data in Biology”, Big Data & Society , 1(1): 205395171453439. doi:10.1177/2053951714534395
  • –––, 2016, Data-Centric Biology: A Philosophical Study , Chicago: University of Chicago Press.
  • –––, 2017, “Global Data Quality Assessment and the Situated Nature of ‘Best’ Research Practices in Biology”, Data Science Journal , 16: 32. doi:10.5334/dsj-2017-032
  • –––, 2018, “The Time of Data: Timescales of Data Use in the Life Sciences”, Philosophy of Science , 85(5): 741–754. doi:10.1086/699699
  • –––, 2019a, La Recherche Scientifique à l’Ère des Big Data: Cinq Façons Donc les Données Massive Nuisent à la Science, et Comment la Sauver , Milano: Éditions Mimésis.
  • –––, 2019b, “What Distinguishes Data from Models?”, European Journal for Philosophy of Science , 9(2): 22. doi:10.1007/s13194-018-0246-0
  • Leonelli, Sabina and Niccolò Tempini, 2018, “Where Health and Environment Meet: The Use of Invariant Parameters in Big Data Analysis”, Synthese , special issue on the Philosophy of Epidemiology , Sean Valles and Jonathan Kaplan (eds.). doi:10.1007/s11229-018-1844-2
  • –––, forthcoming, Data Journeys in the Sciences , Cham: Springer International Publishing.
  • Loettgers, Andrea, 2009, “Synthetic Biology and the Emergence of a Dual Meaning of Noise”, Biological Theory , 4(4): 340–356. doi:10.1162/BIOT_a_00009
  • Longino, Helen E., 1990, Science as Social Knowledge: Values and Objectivity in Scientific Inquiry , Princeton, NJ: Princeton University Press.
  • Lowrie, Ian, 2017, “Algorithmic Rationality: Epistemology and Efficiency in the Data Sciences”, Big Data & Society , 4(1): 1–13. doi:10.1177/2053951717700925
  • MacLeod, Miles and Nancy J. Nersessian, 2013, “Building Simulations from the Ground Up: Modeling and Theory in Systems Biology”, Philosophy of Science , 80(4): 533–556. doi:10.1086/673209
  • Massimi, Michela, 2011, “From Data to Phenomena: A Kantian Stance”, Synthese , 182(1): 101–116. doi:10.1007/s11229-009-9611-z
  • –––, 2012, “ Scientific perspectivism and its foes”, Philosophica , 84: 25–52.
  • –––, 2016, “Three Tales of Scientific Success”, Philosophy of Science , 83(5): 757–767. doi:10.1086/687861
  • Mayer-Schönberger, Victor and Kenneth Cukier, 2013, Big Data: A Revolution that Will Transform How We Live, Work, and Think , New York: Eamon Dolan/Houghton Mifflin Harcourt.
  • Mayo, Deborah G., 1996, Error and the Growth of Experimental Knowledge , Chicago: University of Chicago Press.
  • Mayo, Deborah G. and Aris Spanos (eds.), 2009a, Error and Inference , Cambridge: Cambridge University Press.
  • Mayo, Deborah G. and Aris Spanos, 2009b, “Introduction and Background”, in Mayo and Spanos (eds.) 2009a, pp. 1–27.
  • McAllister, James W., 1997, “Phenomena and Patterns in Data Sets”, Erkenntnis , 47(2): 217–228. doi:10.1023/A:1005387021520
  • –––, 2007, “Model Selection and the Multiplicity of Patterns in Empirical Data”, Philosophy of Science , 74(5): 884–894. doi:10.1086/525630
  • –––, 2011, “What Do Patterns in Empirical Data Tell Us about the Structure of the World?”, Synthese , 182(1): 73–87. doi:10.1007/s11229-009-9613-x
  • McQuillan, Dan, 2018, “Data Science as Machinic Neoplatonism”, Philosophy & Technology , 31(2): 253–272. doi:10.1007/s13347-017-0273-3
  • Mitchell, Sandra D., 2003, Biological Complexity and Integrative Pluralism , Cambridge: Cambridge University Press. doi:10.1017/CBO9780511802683
  • Morgan, Mary S., 2005, “Experiments versus Models: New Phenomena, Inference and Surprise”, Journal of Economic Methodology , 12(2): 317–329. doi:10.1080/13501780500086313
  • –––, forthcoming, “The Datum in Context”, in Leonelli and Tempini forthcoming.
  • Morrison, Margaret, 2015, Reconstructing Reality: Models, Mathematics, and Simulations , Oxford: Oxford University Press. doi:10.1093/acprof:oso/9780199380275.001.0001
  • Müller-Wille, Staffan and Isabelle Charmantier, 2012, “Natural History and Information Overload: The Case of Linnaeus”, Studies in History and Philosophy of Science Part C: Studies in History and Philosophy of Biological and Biomedical Sciences , 43(1): 4–15. doi:10.1016/j.shpsc.2011.10.021
  • Napoletani, Domenico, Marco Panza, and Daniele C. Struppa, 2011, “Agnostic Science. Towards a Philosophy of Data Analysis”, Foundations of Science , 16(1): 1–20. doi:10.1007/s10699-010-9186-7
  • –––, 2014, “Is Big Data Enough? A Reflection on the Changing Role of Mathematics in Applications”, Notices of the American Mathematical Society , 61(5): 485–490. doi:10.1090/noti1102
  • Nickles, Thomas, forthcoming, “Alien Reasoning: Is a Major Change in Scientific Research Underway?”, Topoi , first online: 20 March 2018. doi:10.1007/s11245-018-9557-1
  • Norton, John D., 2003, “A Material Theory of Induction”, Philosophy of Science , 70(4): 647–670. doi:10.1086/378858
  • O’Malley M, Maureen A., Kevin C. Elliott, Chris Haufe, and Richard Burian, 2009. “Philosophies of funding”. Cell , 138: 611–615. doi: 10.1016/j.cell.2009.08.008
  • O’Malley, Maureen A. and Orkun S. Soyer, 2012, “The Roles of Integration in Molecular Systems Biology”, Studies in History and Philosophy of Science Part C: Studies in History and Philosophy of Biological and Biomedical Sciences , 43(1): 58–68. doi:10.1016/j.shpsc.2011.10.006
  • O’Neill, Cathy, 2016, Weapons of Math Destruction: How Big Data Increases Inequality and Threatens Democracy , New York: Crown.
  • Parker, Wendy S., 2009, “Does Matter Really Matter? Computer Simulations, Experiments, and Materiality”, Synthese , 169(3): 483–496. doi:10.1007/s11229-008-9434-3
  • –––, 2017, “Computer Simulation, Measurement, and Data Assimilation”, The British Journal for the Philosophy of Science , 68(1): 273–304. doi:10.1093/bjps/axv037
  • Pasquale, Frank, 2015, The Black Box Society: The Secret Algorithms That Control Money and Information , Cambridge, MA: Harvard University Press.
  • Pietsch, Wolfgang, 2015, “Aspects of Theory-Ladenness in Data-Intensive Science”, Philosophy of Science , 82(5): 905–916. doi:10.1086/683328
  • –––, 2016, “The Causal Nature of Modeling with Big Data”, Philosophy & Technology , 29(2): 137–171. doi:10.1007/s13347-015-0202-2
  • –––, 2017, “Causation, probability and all that: Data science as a novel inductive paradigm”, in Frontiers in Data Science , Matthias Dehmer and Frank Emmert-Streib (eds.), Boca Raton, FL: CRC, 329–353.
  • Porter, Theodore M., 1995, Trust in Numbers: The Pursuit of Objectivity in Science and Public Life , Princeton, NJ: Princeton University Press.
  • Porter, Theodore M. and Soraya de Chadarevian, 2018, “Introduction: Scrutinizing the Data World”, Historical Studies in the Natural Sciences , 48(5): 549–556. doi:10.1525/hsns.2018.48.5.549
  • Prainsack, Barbara and Buyx, Alena, 2017, Solidarity in Biomedicine and Beyond , Cambridge, UK: Cambridge University Press.
  • Radder, Hans, 2009, “The Philosophy of Scientific Experimentation: A Review”, Automated Experimentation , 1(1): 2. doi:10.1186/1759-4499-1-2
  • Ratti, Emanuele, 2015, “Big Data Biology: Between Eliminative Inferences and Exploratory Experiments”, Philosophy of Science , 82(2): 198–218. doi:10.1086/680332
  • Reichenbach, Hans, 1938, Experience and Prediction: An Analysis of the Foundations and the Structure of Knowledge , Chicago, IL: The University of Chicago Press.
  • Reiss, Julian, 2015, “A Pragmatist Theory of Evidence”, Philosophy of Science , 82(3): 341–362. doi:10.1086/681643
  • Reiss, Julian, 2015, Causation, Evidence, and Inference , New York: Routledge.
  • Rescher, Nicholas, 1984, The Limits of Science , Berkely, CA: University of California Press.
  • Rheinberger, Hans-Jörg, 2011, “Infra-Experimentality: From Traces to Data, from Data to Patterning Facts”, History of Science , 49(3): 337–348. doi:10.1177/007327531104900306
  • Romeijn, Jan-Willem, 2017, “Philosophy of Statistics”, in The Stanford Encyclopedia of Philosophy (Spring 2017), Edward N. Zalta (ed.), URL: https://plato.stanford.edu/archives/spr2017/entries/statistics/ .
  • Sepkoski, David, 2013, “Toward ‘a natural history of data’: Evolving practices and epistemologies of data in paleontology, 1800–2000”, Journal of the History of Biology , 46: 401–444.
  • Shavit, Ayelet and James Griesemer, 2009, “There and Back Again, or the Problem of Locality in Biodiversity Surveys*”, Philosophy of Science , 76(3): 273–294. doi:10.1086/649805
  • Srnicek, Nick, 2017, Platform capitalism , Cambridge, UK and Malden, MA: Polity Press.
  • Sterner, Beckett, 2014, “The Practical Value of Biological Information for Research”, Philosophy of Science , 81(2): 175–194. doi:10.1086/675679
  • Sterner, Beckett and Nico M. Franz, 2017, “Taxonomy for Humans or Computers? Cognitive Pragmatics for Big Data”, Biological Theory , 12(2): 99–111. doi:10.1007/s13752-017-0259-5
  • Sterner, Beckett W., Nico M. Franz, and J. Witteveen, 2020, “Coordinating dissent as an alternative to consensus classification: insights from systematics for bio-ontologies”, History and Philosophy of the Life Sciences , 42(1): 8. doi: 10.1007/s40656-020-0300-z
  • Stevens, Hallam, 2016, “Hadooping the Genome: The Impact of Big Data Tools on Biology”, BioSocieties , 11: 352–371.
  • Strasser, Bruno, 2019, Collecting Experiments: Making Big Data Biology , Chicago: University of Chicago Press.
  • Suppes, Patrick, 1962, “Models of data”, in Logic, Methodology and Philosophy of Science , Ernest Nagel, Patrick Suppes, & Alfred Tarski (eds.), Stanford: Stanford University Press, 252–261.
  • Symons, John and Ramón Alvarado, 2016, “Can We Trust Big Data? Applying Philosophy of Science to Software”, Big Data & Society , 3(2): 1-17. doi:10.1177/2053951716664747
  • Symons, John and Jack Horner, 2014, “Software Intensive Science”, Philosophy & Technology , 27(3): 461–477. doi:10.1007/s13347-014-0163-x
  • Tempini, Niccolò, 2017, “Till Data Do Us Part: Understanding Data-Based Value Creation in Data-Intensive Infrastructures”, Information and Organization , 27(4): 191–210. doi:10.1016/j.infoandorg.2017.08.001
  • Tempini, Niccolò and Sabina Leonelli, 2018, “Concealment and Discovery: The Role of Information Security in Biomedical Data Re-Use”, Social Studies of Science , 48(5): 663–690. doi:10.1177/0306312718804875
  • Toulmin, Stephen, 1958, The Uses of Arguments , Cambridge: Cambridge University Press.
  • Turner, Raymond and Nicola Angius, 2019, “The Philosophy of Computer Science”, in The Stanford Encyclopedia of Philosophy (Spring 2019 edition), Edward N. Zalta (ed.), URL = < https://plato.stanford.edu/archives/spr2019/entries/computer-science/ >.
  • Van Fraassen, Bas C., 2008, Scientific Representation: Paradoxes of Perspective , Oxford: Oxford University Press. doi:10.1093/acprof:oso/9780199278220.001.0001
  • Waters, C. Kenneth, 2007, “The Nature and Context of Exploratory Experimentation: An Introduction to Three Case Studies of Exploratory Research”, History and Philosophy of the Life Sciences , 29(3): 275–284.
  • Wilkinson, Mark D., Michel Dumontier, IJsbrand Jan Aalbersberg, Gabrielle Appleton, Myles Axton, Arie Baak, Niklas Blomberg, Jan-Willem Boiten, Luiz Bonino da Silva Santos, Philip E. Bourne, Jildau Bouwman, Anthony J. Brookes, Tim Clark, Mercè Crosas, Ingrid Dillo, Olivier Dumon, Scott Edmunds, Chris T. Evelo, Richard Finkers, Alejandra Gonzalez-Beltran, et al., 2016, “The FAIR Guiding Principles for Scientific Data Management and Stewardship”, Scientific Data , 3(1): 160018. doi:10.1038/sdata.2016.18
  • Williamson, Jon, 2004 “A dynamic interaction between machine learning and the philosophy of science”, Minds and Machines , 14(4): 539–54. doi:10.1093/bjps/axx012
  • Wimsatt, William C., 2007, Re-Engineering Philosophy for Limited Beings: Piecewise Approximations to Reality , Cambridge, MA: Harvard University Press.
  • Winsberg, Eric, 2010, Science in the Age of Computer Simulation , Chicago: University of Chicago Press.
  • Woodward, James, 2000, “Data, phenomena and reliability”, Philosophy of Science , 67(supplement): Proceedings of the 1998 Biennial Meetings of the Philosophy of Science Association. Part II: Symposia Papers (Sep., 2000), pp. S163–S179. https://www.jstor.org/stable/188666
  • –––, 2010, “Data, Phenomena, Signal, and Noise”, Philosophy of Science , 77(5): 792–803. doi:10.1086/656554
  • Wright, Jessey, 2017, “The Analysis of Data and the Evidential Scope of Neuroimaging Results”, The British Journal for the Philosophy of Science , 69(4): 1179–1203. doi:10.1093/bjps/axx012
  • Wylie, Alison, 2017, “How Archaeological Evidence Bites Back: Strategies for Putting Old Data to Work in New Ways”, Science, Technology, & Human Values , 42(2): 203–225. doi:10.1177/0162243916671200
  • –––, forthcoming, “Radiocarbon Dating in Archaeology: Triangulation and Traceability”, in Leonelli and Tempini forthcoming.
  • Zuboff, Shoshana, 2017, The Age of Surveillance Capitalism: The Fight for the Future at the New Frontier of Power , New York: Public Affairs.
How to cite this entry . Preview the PDF version of this entry at the Friends of the SEP Society . Look up topics and thinkers related to this entry at the Internet Philosophy Ontology Project (InPhO). Enhanced bibliography for this entry at PhilPapers , with links to its database.

[Please contact the author with suggestions.]

artificial intelligence | Bacon, Francis | biology: experiment in | computer science, philosophy of | empiricism: logical | evidence | human genome project | models in science | Popper, Karl | science: theory and observation in | scientific explanation | scientific method | scientific theories: structure of | statistics, philosophy of

Acknowledgments

The research underpinning this entry was funded by the European Research Council (grant award 335925) and the Alan Turing Institute (EPSRC Grant EP/N510129/1).

Copyright © 2020 by Sabina Leonelli < s . leonelli @ exeter . ac . uk >

  • Accessibility

Support SEP

Mirror sites.

View this site from another server:

  • Info about mirror sites

The Stanford Encyclopedia of Philosophy is copyright © 2023 by The Metaphysics Research Lab , Department of Philosophy, Stanford University

Library of Congress Catalog Data: ISSN 1095-5054

Using big data and artificial intelligence to accelerate global development

Subscribe to the center for technology innovation newsletter, jennifer l. cohen and jlc jennifer l. cohen project coordinator and assistant to the vice president and director, global economy and development program - the brookings institution homi kharas homi kharas senior fellow - global economy and development , center for sustainable development.

November 15, 2018

  • 29 min read

This report is part of “ A Blueprint for the Future of AI ,” a series from the Brookings Institution that analyzes the new challenges and potential policy solutions introduced by artificial intelligence and other emerging technologies.

When U.N. member states unanimously adopted the 2030 Agenda in 2015, the narrative around global development embraced a new paradigm of sustainability and inclusion—of planetary stewardship alongside economic progress, and inclusive distribution of income. This comprehensive agenda—merging social, economic and environmental dimensions of sustainability—is not supported by current modes of data collection and data analysis, so the report of the High-Level Panel on the post-2015 development agenda called for a “data revolution” to empower people through access to information. 1

Today, a central development problem is that high-quality, timely, accessible data are absent in most poor countries, where development needs are greatest. In a world of unequal distributions of income and wealth across space, age and class, gender and ethnic pay gaps, and environmental risks, data that provide only national averages conceal more than they reveal. This paper argues that spatial disaggregation and timeliness could permit a process of evidence-based policy making that monitors outcomes and adjusts actions in a feedback loop that can accelerate development through learning. Big data and artificial intelligence are key elements in such a process.

Emerging technologies could lead to the next quantum leap in (i) how data is collected; (ii) how data is analyzed; and (iii) how analysis is used for policymaking and the achievement of better results. Big data platforms expand the toolkit for acquiring real-time information at a granular level, while machine learning permits pattern recognition across multiple layers of input. Together, these advances could make data more accessible, scalable, and finely tuned. In turn, the availability of real-time information can shorten the feedback loop between results monitoring, learning, and policy formulation or investment, accelerating the speed and scale at which development actors can implement change.

In a world of unequal distributions of income and wealth across space, age and class, gender and ethnic pay gaps, and environmental risks, data that provide only national averages conceal more than they reveal.

Data collection: From surveys to satellites and sensors

Traditionally, economists have relied on household consumption surveys and national account estimates to map patterns of poverty and to assess the impact of policy interventions, in particular, social assistance and social insurance programs. Conducting household surveys, however, is time-intensive, costly, and prone to error. In many countries, notably in the poorest countries and fragile states where development needs are greatest, survey data is simply unavailable. Between 2000 and 2010, 39 of 59 countries in Africa conducted fewer than two surveys, implying that no time trends could be reliably established. 2 Even in those countries where more frequent household survey data is available, the quality is in doubt. For example, survey results vary considerably depending on the method used to identify consumption or income.

Kathleen Beegle et al. (2012) found that recall methods, requiring that respondents recall consumption over a defined period, measure lower consumption than personal diaries, in which respondents track their consumption in real-time. 3 Survey results are notoriously at odds with national income accounts estimates of personal consumption, with the gap amounting to 60 percent of the total in some countries, including large countries with relatively sophisticated statistical systems like India and Indonesia. 4 Micro studies suggest that survey answers depend on the type of respondent, reference period, and degree of commodity detail, all of which can be difficult to control across organizations and projects, and which are often changed from survey to survey, complicating any analysis of what is happening over time. Furthermore, underreporting is frequent in illiterate homes and among urban respondents, which can lead to large data gaps among poorer households. 5

Big data from satellites, mobile phones, and social media, among other tools, allows researchers to build on, and in some cases, replace traditional methods of acquiring socioeconomic data. Its advantages are frequency and timeliness, accuracy and objectiveness. Its disadvantages are the fact that the indicators available are merely proxies for what policymakers are interested in and need for policy design.

Remote sensing satellites

Earth Observations (EO) provide finely tuned and near-real-time data on global terrain. These data are becoming widely available to public and private actors through platforms like the Global EO System of Systems (GEOSS) . A coalition of 105 governments and 127 participating organizations, known as the Group on Earth Observations (GEO), is working to ensure that EO are accessible and interoperable. 6 There is increasing recognition that these data can be used to support the 2030 Agenda for Sustainable Development. 7 8

While satellite sensors have been widely adopted in the environmental science community to observe changes in weather, climate, and terrain, their application to economics is new. Researchers have found that high-resolution, spatially tuned satellite imagery can provide important insight into human economic activity. Because data is disaggregated to local levels, comparisons within and among countries are possible. 9

While satellite sensors have been widely adopted in the environmental science community to observe changes in weather, climate, and terrain, their application to economics is new.

Social scientists have started to use nighttime light measures, or luminosity, as proxies for economic activity and population distribution. Satellites, like the U.S. Air Force Defense Meteorological Satellite Program (DMSP) Operational Linescan System (OLS), can map artificial light in cities, towns, and industrial centers on the Earth’s surface. Now declassified, the raw data are publically available. Several researchers have noted a correlation between nighttime light measures and country-level or subnational economic output. In 1997, Christopher Elvidge et al. identified a correlation between illuminated areas, electric power consumption, and GDP at the country level. 10 Paul C. Sutton and Robert Costanza (2002) examined luminosity and GDP per square kilometer, also finding a high correlation. 11 Xi Chen and William Nordhaus (2011) compared luminosity measures with conventional measures of output to indicate the value-add in data-poor countries. More recently, J. Vernon Henderson et al. (2012) determined that nighttime lights were “uniquely suited to spatial analyses of economic activity” and could serve as a proxy for GDP growth on the subnational level. 12

The satellite imagery below shows NASA Earth Observatory nighttime light data from Syria. The left hand image shows a concentration of Syria’s economic activity in two corridors in 2012. A North-South corridor on the country’s Western border, stretching from Damascus in the lower left of the image to Aleppo in the upper-left corner. The second corridor is a diagonal linking Aleppo with Baghdad in the lower-right corner. Just four years later, in 2016, the satellite captured a far darker image reflecting the losses to Syria’s economy and infrastructure during the ongoing civil war. In particular, Aleppo is barely visible, and the road from there to Baghdad no longer supports any economic activity. The luminosity of Damascus and its environs is also sharply reduced.

Figure 1: Satellite Imagery comparing Syria 2012 and 2016

Syria is a case in point where conflict made it impossible to collect data through any means other than remote sensing. And, in fact, many countries are constrained by budget or conflict, making satellite imagery the only option from which to infer socioeconomic characteristics.

There are limitations to this approach. Luminosity data is hard to interpret in low-output and high-output regions. In low-output regions, it is difficult to differentiate man-made lights from natural background lighting and reflections. In high-output regions, usually urban areas, the measure of bright lights may be capped by a saturation band, so that the metric is not smooth. Improvements in data quality to address these limitations are already happening and this will further open up the field to social scientists.

Cellphones, social media, and automated sensors

Mobile phone data can also be used to infer socioeconomic characteristics in a geographically disaggregated way. Cell phones are ubiquitous in developed and emerging economies. Call Detail Records (CDRs), which are stored and secured by Mobile Network Operators (MNOs) provide data on: (i) mobility, (ii) social interactions, and (iii) consumption and expenditure patterns (from the degree to which airtime is pre-paid). Joshua Blumenstock et al. (2015) used anonymized metadata from Rwanda’s largest cell phone network in combination with follow-up surveys to examine the extent to which mobile phone data can be used to estimate socioeconomic characteristics, and map a country-level wealth profile. 13 When aggregated at a district level, Blumenstock et al. found that mobile phone data estimations were comparable to predictions using ground data collected by the Kigali Demographic and Health Survey (DHS). More granularly, historical records of an individual’s mobile phone use can accurately predict socioeconomic characteristics. Vanessa Frias-Martinez et al. determined that cell records can also be used to approximate costly and infrequent census information. 14 They propose a new tool, CenCell, which uses behavioral patterns collected from CDRs to classify socioeconomic levels, with classification accuracy rates of up to 70 percent. The tool provides policymakers with affordable census maps at varying degrees of granularity. It should be noted that while CDRs provide detailed information on individual patterns of behavior, the data is proprietary and thus difficult to obtain. Additionally, questions over privacy and cybersecurity complicate efforts. Even when the data are available in the public domain, and individuals consent to its use for evaluation, some vulnerable populations may be underrepresented in mobile phone data.

Digital footprints from social media can also fill gaps in data for policymakers and development practitioners. For example, Google Trends (GT) reports, which provide real-time information on search queries at state and metro levels for several countries, have informed private consumption predictions. 15 Google analytics could have broad-reaching utility for other socioeconomic measures.

Digital footprints from social media can also fill gaps in data for policymakers and development practitioners.

Once relegated to sci-fi films, robotics have expanded opportunities to collect in situ data on environmental indicators. Autonomous underwater vehicles (AUVs) and underwater smart devices allow researchers to explore uncharted areas of the ocean. Marine sensing technology provides real-time, multidimensional data on the sea surface and deep sea. UNESCO notes that it is now possible to incorporate marine sensors on submarine telecommunication cables at intervals of 50-70 km. 16 These sensors could collect data on the seafloor and detect movement related to earthquakes or tsunamis. Above ground, a spinoff of Bivee Inc., Starling Data , has devised a unit that collects and transmits localized data in real-time without reliance on external power sources. Designed to improve data collection in developing economies, the unit tracks data on power quality, climate (rainfall, wind, humidity, etc.), and infrastructure (mapping, emissions), which it then uploads to the cloud. As the global community works in pursuit of economic progress joined with planetary stewardship, data on the environment will be increasingly important.

Data analysis: From hypothesis testing to machine learning

Machine learning (ML) allows researchers to analyze data in novel ways. Computers today can process multiple sets of data in little time and, with the correct classification sets, recognize highly complex patterns among them. Designed to simulate the interactions of biological neurons, “deep learning” uses artificial neural networks to discern features in successive layers of data while iterating on previously recognized trends. In the mid-1980s, artificial intelligence required that programmers classify data as part of the algorithms. 17 Today, machines learn from and adapt to different inputs with little human supervision.

Neal Jean et al. (2016) explain how this might work in the field of economic development. 18 Using a combination of survey and satellite data from Nigeria, Tanzania, Uganda, Malawi, and Rwanda, the Stanford team trained machines to recognize visual patterns that could then make predictions about socioeconomic distributions. Neal Jean et al. employed a particular type of machine learning, known as convolutional neural networks (CNNs), to improve the accuracy of their forecasts. Here’s how it works: the CNN model pre-trains on ImageNet, a classification data set with over 1,000 different categories of labeled images, to discern visual features that appear in daytime satellite imagery. Next, programmers train the CNN to predict which features best explain the variance observed in nighttime light intensities. Finally, these daytime features are combined with cluster-level, geolocated socio-economic variables from survey data (such as USAID supported Demographic and Health Surveys) to build ridge regression models. The model parameters can then be used to extend forecasts to areas of the country not covered by the DHS, to get comprehensive national maps, such as poverty and mortality maps. Jean et al. determined that CNN estimates could accurately predict average household consumption and asset wealth in Nigeria, Tanzania, Uganda, Malawi, and Rwanda. In addition, their model outperformed luminosity and mobile-phone only approaches.

Applications of artificial intelligence, like the one detailed above, could have a sweeping impact on the development field. Training machines on multiple layers of input reduces inaccuracies while allowing researchers to include a rich variety of publically available variables by merging geocoded data sets with infrastructure variables and social indicators.

Policymaking using data analysis and feedback loops

There are a number of examples that illustrate the ways data analytics can inform global development. These include using satellite imagery to map schools, identifying the hidden costs of conflict and reconciliation, tracking illegal mining, and addressing rapid urbanization. We elaborate on these examples below.

Employing satellite imagery to map schools in Kyrgyzstan

The government of Kyrgyzstan previously relied on administrative data to evaluate school placement, determine expected volume of students, and allocate classroom resources. It has recently adopted a new program called “Taza Koom,” designed to increase access to 21st century skills in schools across the country. The program faced significant barriers. For example, as of March 2017, 40 percent of schools lacked access to basic internet services. 19 UNICEF Kyrgyzstan teamed with the government to generate a highly detailed map of schools with real-time measures of connectivity, overlaid with additional sources of data that could serve as proxies for education efficiency. The hope is that this system will give national stakeholders the insights they need to address digital gaps in the school system.

Kyrgyzstan’s school mapping project is part of a broader UNICEF Innovation initiative to map every school in the world. In collaboration with UC San Diego’s Big Pixel Initiative and Development Seed, UNICEF is developing a convolutional neural network to recognize patterns in satellite imagery that could be used to locate schools. UNICEF has joined traditional measures of data collection with crowdsourcing methods and remote sensing observations. To aid decision-making, the data will be available in real-time on an online platform.

Identifying the hidden costs of conflict and reconciliation in Colombia

Decades of conflict between the government of Colombia and the guerilla group Revolutionary Armed Forces of Colombia (FARC) left large portions of the Colombian Amazon unexamined. Now that FARC guerilla fighters have vacated the forest, scientists are quickly working to document Colombia’s distinct ecosystem and biodiversity. 20 Geoscientist Ruiz Carrascal is building a network of climate sensors that monitor temperature and humidity in alpine regions. 21 Meanwhile, more than 40 researchers have kick-started a digital platform that collects and analyzes data on a wide range of environmental indicators, including species distribution, forest cover, and weather patterns.

Related Content

Anna Marie Fernando, Arturo Martinez Jr., Joseph Bulan, Katharina Fenz

October 20, 2020

Marshall Burke, Anne Driscoll, David Lobell, Stefano Ermon

June 23, 2022

Laura Ralston

September 22, 2016

Fears of rapid urbanization give urgency to the effort to analyze Amazonian data. After FARC abandoned its strongholds, logging, cattle, and gold-mining industries expanded their operations into the forest. While this has brought much-welcomed economic growth to the region, it has also brought about rapid deforestation: post-peace accords, the rate has increased by 44 percent. 22 The hope is that new in situ environmental sensors and machine learning techniques will generate models that can predict threats to conservation. This information could then inform policies to better protect forested areas and encourage both peace and sustainable development.

Harnessing Earth observations to track illegal mining in Ghana

Illegal mining is prevalent in Ghana. At least 30 cocoa farmers in the regions outside of Dunkwa, in Ghana’s Central region, have sold their plantations to gold miners, who quickly excavate the land. 23 The cost of these often-illegal operations is high: in addition to supporting an illicit economy, gold mining contributes to deforestation and water contamination. While the government of Ghana works to balance the economic benefits of small-scale gold mining alongside environmental conservation, getting the balance right is proving difficult. The Small-Scale Gold Mining Act of 1989 permitted groups of nine or fewer to mine for gold. An updated law from 2006 requires that miners obtain licenses from the Ghanaian Environmental Protection Agency and Forest Commission, but enforcement of these regulations is difficult.

Data from the Africa Regional Data Cube (ARDC) could help policymakers identify topographic changes and track illegal mining operations. 24 The ARDC collects EO data, including 17 years of satellite imagery archives, on Kenya, Senegal, Sierra Leone, Tanzania, and Ghana. It combines 8,000 visual layers across a defined period of time to produce localized, easily accessible data. The ARDC’s ability to compare changes in land across many years in Ghana could help policymakers identify and enforce regulation of extractive industries.

Addressing rapid urbanization in Sierra Leone through high-resolution poverty mapping

Rapid urbanization in Sierra Leone has contributed to major inequities. As of 2015, 40 percent of the national population lives in urban areas. 25 Of that, 50 percent lives in the Western Region, where Freetown is located, compared to 10 percent in the Southern Region. 26 Due to rapid population growth in Freetown, affordable land and housing are in short supply. Estimates place the housing deficit at 166,000 units. 27 Land degradation has further complicated efforts to improve the situation. Sierra Leones’s Environmental Protection Agency warns that deforestation associated with unplanned dwellings and the rise of informal settlements is leading to soil erosion, among other environmental issues. In 2017, flooding killed upward of 400 people and contributed to rising homelessness.

The Africa Regional Data Cube could help policymakers track rapid urbanization in Sierra Leone. 28 High-resolution satellite imagery of land cover and human settlements may aid efforts to identify vulnerable populations and improve city planning. 29   GRID 3 —a project led by the United Nations Population Fund, U.K. Department for International Development, Bill & Melinda Gates Foundation, WorldPOP/Flowminder, and Columbia University’s Center for International Earth Science Information Network—also aims to build robust geospatial data for population mapping, among other policy priorities. GRID 3 is already being used in Nigeria to identify and collect data on settlements across the country to improve public health responses (starting with polio eradication ) and it could be used in a similar way to deliver better policy outcomes in other countries.

Remote sensing can aid efforts to calculate the number of individuals living in poverty, and determine where they are located. This could have far-reaching advantages for the development community.

In fact, subnational mapping of population distributions and wealth profiles is already garnering attention within the academic community. Christopher Elvidge et al. (2009) produced the first satellite-generated, spatially disaggregated global map of poverty. 30 He and his team used four types of remote sensor data—DMSP lights, MODIS land cover, Shuttle Radar Topography Mission (SRTM) topography, and National Geospatial Intelligence Agency’s Controlled Image Base (CIB) —calibrated with 2006 World Development Indicators national poverty levels to estimate the number of people living in poverty. Their estimates show that remote sensing can aid efforts to calculate the number of individuals living in poverty, and determine where they are located. This could have far-reaching advantages for the development community.

Recommendations for action

Social science is just beginning to exploit big data and machine learning. In each area—data collection, data analysis, and policymaker use of analysis—there is scope for improvements. There are a number of actions that would improve access to big data, improve the use of data analytics, and use machine learning to monitor outcomes and drive policymaking.

Improve access to and cost of big data.

Data is expensive and, increasingly, is held within private companies. Researchers must negotiate access to data such as Call Detail Records on a case-by-case basis. However, the telecommunications companies that currently collect these data are concerned about privacy issues (although researchers typically ask for aggregated data) and are reluctant to give away for free data that they could potentially sell.

Granted, generating data is expensive, so a core challenge will be funding. High-quality satellite machinery is expensive and requires ongoing maintenance. The Department of the Navy and Department of the Air Force spent a combined $29.8 million in FY15 to acquire and process data from the Department of Defense’s Defense Meteorological Satellite Program (DMSP) and other sources of SBEM data. 31

To implement this recommendation, two things are needed. First, a set of ethically based protocols for provision of mobile data, along with an agreement with companies that they provide public access to such data as a condition of their license to operate. Second, governments, especially rich country governments with satellites, should provide access to the imagery for free or at marginal cost (which, given the digital technology involved, is almost free).

Improve big data analytics.

Data providers are often surprised that remote sensing data is being used for social science purposes. Their primary audience is in the military and intelligence services and the data tend to be mostly classified. However, this restricts data availability and timeliness in a way that compromises machine learning. Robust classification sets are needed to train the artificial neural networks. Additionally, machines require some degree of human supervision. Many of the countries that most need data analysis do not have the statistical infrastructure, nor do they have sufficient numbers of trained personnel, to employ “deep learning” techniques.

To implement this suggestion, data providers should work with analysts to understand better what kinds of data would enable better machine learning.

Use machine learning to monitor outcomes and drive policymaking, with particular attention to spatial implications.

Policymakers in economic development are largely unfamiliar with big data and its potential benefits, especially in identifying spatial issues. Development projects and interventions are over-designed at the beginning, with long gestation periods to try to overcome potential obstacles and bottlenecks. As a result, despite significant investment in monitoring and evaluation, the time frames involved are very long: decades from project concept to completion, followed by more years in evaluation and development of new approaches.

Machine learning offers an opportunity to shortcut this process, but policymakers have not yet systematically built into project design feedback loops that would permit rapid fine-tuning, while projects are being implemented. Results-based approaches require a mindset change: away from evaluating results and toward constantly learning to scale up and improve results. Such a mindset change will require very different project designs.

Emerging technologies have transformed three core areas: (i) data collection; (ii) data analysis and (iii) use of data analysis for policymaking. New big data platforms allow researchers to acquire granular details on a number of socioeconomic and environmental indicators. Remote sensing satellites provide real-time luminosity and daytime pictures that can serve as proxies for human economic activity, as well as determine changes to land cover and urban features. Other sources of geospatial data—like Call Detail Records, social media footprints, automated marine sensors, and climate-measuring devices—expand the scope and volume of information available to policymakers. Meanwhile, advances in data analytics transform the way in which data scientists and machines manipulate large sets of data. Artificial neural networks make it possible to recognize patterns across multiple layers of input, improving accuracy and permitting multidimensional analyses. Policymakers have used this information to map digital connectivity across schools in Kyrgyzstan, assess deforestation in Colombia following the peace process, track illegal mining operations in Ghana, and improve city planning in the Western Region of Sierra Leone, to name a few examples. Agenda 2030 has at its disposal a new digital toolkit that is spearheading a data revolution.

Conventional methods of data collection, which require substantial time to conduct and disseminate, have hindered efforts to implement change quickly and effectively.

The global community is entering a new world, where real-time data is shortening the feedback loop between outcomes and policy. Conventional methods of data collection, which require substantial time to conduct and disseminate, have hindered efforts to implement change quickly and effectively. By the time reports are available to key decision-makers, data on the ground have already changed. In contrast, big data and artificial intelligence allow researchers to acquire up-to-date information at varying degrees of granularity, while simultaneously processing for patterns that can inform policy.

The key to this data revolution is trust. How can the development community foster trust among individuals, whose socioeconomic data are critical to achieving sustainable solutions, at a time when concerns are mounting over privacy and cybersecurity? Relatedly, how can researchers assure policymakers that machine-generated analyses can be trusted as evidence on which to base key policy decisions?

While emerging technologies bring about a number of technical solutions, transformation will be felt most acutely in our ability to learn and adapt alongside the machines. After all, artificial intelligence is not a panacea. Only when machine learning is coupled with human insight will the global community achieve sustainable development solutions.

  • United Nations, High-Level Panel of Eminent Persons on the Post-2015 Development Agenda: A New Global Partnership: Eradicate Poverty and Transform Economies through Sustainable Development , (New York: United Nations Publications, 2013)
  • World Bank. A Measured Approach to Ending Poverty and Boosting Shared Prosperity: Concepts, Data, and the Twin Goals. (Washington, D.C.: World Bank, 2015), http://www.worldbank.org/en/research/publication/a-measured-approach-to-ending-poverty-and-boosting-shared-prosperity . PovcalNet online poverty analysis tool, https//iresearch.worldbank.org/povcalnet/ (2015); quoted in Jean, Neal, Marshall Burke, Michael Xie, W. Matthew Davis, David B. Lobell, Stefano Ermon, “Combining satellite imagery and machine learning to predict poverty,” Science 353, no. 6301 (August 2016): 790-794. doi: 10.1126/science.aaf7894
  • Beegle, Kathleen, Joachim De Weerdt, Jed Friedman, and John Gibson, “Methods of household consumption measurement through surveys: Experimental results from Tanzania,” Journal of Development Economics 98. no. 1 (May 2012): 3-18. Science Direct.  
  • Pinkovskiy and Xavier Sala-i-Martin, “Lights, Camera…Income! Illuminating the National Accounts Household Survey Debate,” The Quarterly Journal of Economics 131, no. 2 (May 2016): 579-631. doi: 10.1093/qje/qjw003
  • Group on Earth Observations. https://earthobservations.org/index2.php (accessed September 17, 2018).
  • Anderson, Katherine, Barbara Ryan, William Sonntag, Argyro Kavvada, and Lawrence Friedl, “Earth observation in service of the 2030 Agenda for Sustainable Development,” Geo-spatial Information Science 20, no. 2 (June 2017): 77-96. Taylor & Francis Online.
  • See, Linda, Steffen Fritz, Inian Moorthy, Olha Danylo, Michiel van Dijk, and Barbara Ryan, “Using Remote Sensing and Geospatial Information for Sustainable Development,” In From Summits to Solutions: Innovations in Implementing the Sustainable Development Goals, edited by Raj M. Desai, Hiroshi Kato, Homi Kharas, and John W. McArthur, 172-198. Washington, D.C.: The Brookings Institution, 2018.
  • Elvidge, Christopher D., Kimberly E. Baugh, Eric A. Kihn, Herbert W. Kroehl, Ethan R. Davis, and C.W. Davis, “Relation between satellite observed visible-near infrared emissions, population, economic activity and electric power consumption,” International Journal of Remote Sensing 18, no. 6 (1997): 1373-1379. doi: 10.1080/014311697218485
  • Sutton, Paul C. and Robert Costanza, “Global estimates of market and non-market values derived from nighttime satellite imagery, land cover, and ecosystem service valuation,” Ecological Economics 41, no. 3 (June 2002): 509-527. doi: 10.1016/S0921-8009(02)00097-6
  • Henderson, J. Vernon, Adam Storeygard, and David N. Weil, “Measuring Economic Growth from Outer Space,” American Economic Review 102, no. 2 (April 2012): 994-1028. American Economic Association.
  • Blumenstock, Joshua, Gabriel Cadamuro, and Robert On, “Predicting poverty and wealth from mobile phone metadata,” Science 350, no. 6264 (November 2015). doi: 10.1126/science.aac4420
  • Frias-Martinez, Vanessa, Victor Soto, Jesus Virseda, Enrique Frias-Martinez, “Can Cell Phone Traces Measure Social Development?” Third Conference on the Analysis of Mobile Phone Datasets, NetMob 2013, Boston, MA, 2013 (Oral Presentation).
  • Schmidt, Torsten and Simeon Vosen, “Forecasting Private Consumption: Survey-based Indicators vs. Google Trends,” Journal of Forecasting 30, no. 6 (September 2011): 565-578. doi: 10.1002/for.1213
  • “The Ocean We Need for the Future: Proposal for an International Decade of Ocean Service for Sustainable Development,” United Nations Educational, Scientific and Cultural Organization (2017). http://www.unesco.org/new/en/media-services/single view/news/towards_an_international_decade_of_ocean_science_for_sustain/
  • Hof, Robert D. “Deep Learning: With massive amounts of computational power, machines can now recognize objects and translate speech in real time. Artificial intelligence is finally getting smart,” MIT Technology Review . https://www.technologyreview.com/s/513696/deep-learning/
  • Jean, Neal, Marshall Burke, Michael Xie, W. Matthew Davis, David B. Lobell, Stefano Ermon, “Combining satellite imagery and machine learning to predict poverty,” Science 353, no. 6301 (August 2016): 790-794. doi: 10.1126/science.aaf7894
  • “Magic Box – School Mapping,” UNICEF. http://unicefstories.org/magicbox/schoolmapping/
  • Reardon, Sara, “FARC and the forest: Peace is destroying Colombia’s jungle – and opening it to science,” Nature International Journal of Science 558 (June 2018). doi: 10.1038/d41586-018-05397-2
  • Taylor, Kevin and Marisa Schwartz Taylor, “Illegal Gold Mining Boom Threatens Cocoa Farmers (And your Chocolate),” National Geographic , March 6, 2018. https://news.nationalgeographic.com/2018/03/ghana-gold-mining-cocoa-environment/
  • Melamed, Claire, “The Africa Regional Data Cube: Harnessing Satellites for SDG Progress,” Global Partnership for Sustainable Development, June 19, 2018.  http://www.data4sdgs.org/news/africa-regional-data-cube-harnessing-satellites-sdg-progress
  • Diagne, Alioune, “Sierra Leone 2015 Population and Housing Census: Thematic Report on Migration and Urbanization,” Statistics Sierra Leone (October 2017). https://sierraleone.unfpa.org/
  • “Flooding in Free town: a failure of planning?” Africa Research Institute, November 6, 2015. https://www.africaresearchinstitute.org/newsite/blog/flooding-in-freetown-a-failure-of-planning/
  • Melamed, Claire, “The Africa Regional Data Cube: Harnessing Satellites for SDG Progress,” Global Partnership for Sustainable Development, June 19, 2018. http://www.data4sdgs.org/news/africa-regional-data-cube-harnessing-satellites-sdg-progress
  • Galeon, Florence A., “Estimation of Population in Informal Settlement Communities Using High Resolution Satellite Image,” The International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences XXXVII, B4 (Beijing 2008): 1377-1382. ResearchGate.
  • Elvidge, Christopher D., Paul C. Sutton, Tilottama Ghosh, Benjamin T. Tuttle, Kimberly E. Baugh, Budhendra Bhaduri, and Edward Bright, “A global poverty map derived from satellite data,” Computers and Geosciences 35, no. 8 (August 2009): 1652-1660. doi: 10.1016/j.cageo.2009.01.009
  • United States Air Force, “Department of Defense Plan to Meet Joint Requirements Oversight Council Meteorological and Oceanographic Collection Requirements ,” (2016). Defense Daily Network: http://cdn.defensedaily.com/

Artificial Intelligence

Global Economy and Development

Center for Technology Innovation

Artificial Intelligence and Emerging Technology Initiative

August 30, 2024

Cameron F. Kerry

August 29, 2024

Isabella Panico Hernández, Nicol Turner Lee

August 22, 2024

Machine learning accelerated carbon neutrality research using big data—from predictive models to interatomic potentials

  • Published: 20 September 2022
  • Volume 65 , pages 2274–2296, ( 2022 )

Cite this article

research using big data

  • LingJun Wu 1   na1 ,
  • ZhenMing Xu 2   na1 ,
  • ZiXuan Wang 1 ,
  • ZiJian Chen 1 ,
  • ZhiChao Huang 1 ,
  • Chao Peng 3 ,
  • XiangDong Pei 4 ,
  • XiangGuo Li 5 ,
  • Jonathan P. Mailoa 6 ,
  • Chang-Yu Hsieh 6 ,
  • Xue-Feng Yu 1 &
  • HaiTao Zhao 1  

430 Accesses

Explore all metrics

Carbon neutrality has been proposed as a solution for the current severe energy and climate crisis caused by the overuse of fossil fuels, and machine learning (ML) has exhibited excellent performance in accelerating related research owing to its powerful capacity for big data processing. This review presents a detailed overview of ML accelerated carbon neutrality research with a focus on energy management, screening of novel energy materials, and ML interatomic potentials (MLIPs), with illustrations of two selected MLIP algorithms: moment tensor potential (MTP) and neural equivariant interatomic potential (NequIP). We conclude by outlining the important role of ML in accelerating the achievement of carbon neutrality from global-scale energy management, unprecedented screening of advanced energy materials in massive chemical space, to the revolution of atomic-scale simulations of MLIPs, which has the bright prospect of applications.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save.

  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime

Price includes VAT (Russian Federation)

Instant access to the full article PDF.

Rent this article via DeepDyve

Institutional subscriptions

Similar content being viewed by others

research using big data

Accelerated discovery of multi-elemental reverse water-gas shift catalysts using extrapolative machine learning approach

research using big data

Exploring the frontiers of condensed-phase chemistry with a general reactive machine learning potential

research using big data

Exploring the configuration space of elemental carbon with empirical and machine learned interatomic potentials

Explore related subjects.

  • Artificial Intelligence

Chen J M. Carbon neutrality: Toward a sustainable future. Innovation, 2021, 2: 100127

Google Scholar  

Zhao X, Ma X, Chen B, et al. Challenges toward carbon neutrality in China: Strategies and countermeasures. Resources Conservat Recycl, 2022, 176: 105959

Article   Google Scholar  

Barnham K W J, Mazzer M, Clive B. Resolving the energy crisis: Nuclear or photovoltaics? Nat Mater, 2006, 5: 161–164

Houghton J. Global warming. Rep Prog Phys, 2005, 68: 1343–1403

Gillingham K, Stock J H. The cost of reducing greenhouse gas emissions. J Economic Perspect, 2018, 32: 53–72

Rosa E A, Dietz T. Human drivers of national greenhouse-gas emissions. Nat Clim Change, 2012, 2: 581–586

Howarth R W, Santoro R, Ingraffea A. Methane and the greenhousegas footprint of natural gas from shale formations. Climatic Change, 2011, 106: 679–690

Tanaka K. Review of policies and measures for energy efficiency in industry sector. Energy Policy, 2011, 39: 6532–6550

Dan S. Regional differences in China’s energy efficiency and conservation potentials. China World Economy, 2007, 15: 96–115

Jeong K, Hong T, Kim J. Development of a CO 2 emission benchmark for achieving the national CO 2 emission reduction target by 2030. Energy Buildings, 2018, 158: 86–94

Zhang B, Wang Z, Yin J, et al. CO 2 emission reduction within Chinese iron & steel industry: Practices, determinants and performance. J Cleaner Production, 2012, 33: 167–178

Peng S S, Piao S, Zeng Z, et al. Afforestation in China cools local land surface temperature. Proc Natl Acad Sci USA, 2014, 111: 2915–2919

Arora V K, Montenegro A. Small temperature benefits provided by realistic afforestation efforts. Nat Geosci, 2011, 4: 514–518

Gielen D, Boshell F, Saygin D, et al. The role of renewable energy in the global energy transformation. Energy Strategy Rev, 2019, 24: 38–50

Lund H. Renewable energy strategies for sustainable development. Energy, 2007, 32: 912–919

George G, Haas M R, Pentland A. Big data and management. Academy Manage J, 2014, 57: 321–326

Sagiroglu S, Sinanc D. Big data: A review. In: Proceedings of the International Conference on Collaboration Technologies and Systems (CTS). San Diego, 2013. 42–47

Haenlein M, Kaplan A. A brief history of artificial intelligence: On the past, present, and future of artificial intelligence. California Manage Rev, 2019, 61: 5–14

Stilgoe J. Machine learning, social learning and the governance of self-driving cars. Soc Stud Sci, 2018, 48: 25–56

Deo R C. Machine learning in medicine. Circulation, 2015, 132: 1920–1930

Silver D, Hubert T, Schrittwieser J, et al. A general reinforcement learning algorithm that masters chess, shogi, and Go through self-play. Science, 2018, 362: 1140–1144

Article   MathSciNet   MATH   Google Scholar  

Jordan M I, Mitchell T M. Machine learning: Trends, perspectives, and prospects. Science, 2015, 349: 255–260

Yin H, Sun Z, Wang Z, et al. The data-intensive scientific revolution occurring where two-dimensional materials meet machine learning. Cell Rep Phys Sci, 2021, 2: 100482

Lai F, Sun Z, Saji S E, et al. Machine learning-aided crystal facet rational design with ionic liquid controllable synthesis. Small, 2021, 17: 2100024

Zhao H, Ezeh C I, Ren W, et al. Integration of machine learning approaches for accelerated discovery of transition-metal dichalcogenides as Hg 0 sensing materials. Appl Energy, 2019, 254: 113651

Wang Z, Sun Z, Yin H, et al. Data-driven materials innovation and applications. Adv Mater, 2022, 2104113

Kruitwagen L, Story K T, Friedrich J, et al. A global inventory of photovoltaic solar energy generating units. Nature, 2021, 598: 604–610

Alkandari M, Ahmad I. Solar power generation forecasting using ensemble approach based on deep learning and statistical methods. Appl Comput Inf, 2020, doi: https://doi.org/10.1016/j.aci.2019.11.002

Nauck C, Lindner M, Schürholt K, et al. Predicting basin stability of power grids using graph neural networks. arXiv: 2108.08230

Chen A, Zhang X, Chen L, et al. A machine learning model on simple features for CO 2 reduction electrocatalysts. J Phys Chem C, 2020, 124: 22471–22478

Zhong M, Tran K, Min Y, et al. Accelerated discovery of CO 2 electrocatalysts using active machine learning. Nature, 2020, 581: 178–183

Priya P, Aluru N R. Accelerated design and discovery of perovskites with high conductivity for energy applications through machine learning. npj Comput Mater, 2021, 7: 1–2

Carvalho R P, Marchiori C F N, Brandell D, et al. Artificial intelligence driven in-silico discovery of novel organic lithium-ion battery cathodes. Energy Storage Mater, 2022, 44: 313–325

Wang C, Aoyagi K, Wisesa P, et al. Lithium ion conduction in cathode coating materials from on-the-fly machine learning. Chem Mater, 2020, 32: 3741–3752

Park C W, Kornbluth M, Vandermause J, et al. Accurate and scalable graph neural network force field and molecular dynamics with direct force architecture. npj Comput Mater, 2021, 7: 73

Rangel-Martinez D, Nigam K D P, Ricardez-Sandoval L A. Machine learning on sustainable energy: A review and outlook on renewable energy systems, catalysis, smart grid and energy storage. Chem Eng Res Des, 2021, 174: 414–441

Narciso D A C, Martins F G. Application of machine learning tools for energy efficiency in industry: A review. Energy Rep, 2020, 6: 1181–1199

Liu Y, Zhao T, Ju W, et al. Materials discovery and design using machine learning. J Materiomics, 2017, 3: 159–177

Gu G H, Noh J, Kim I, et al. Machine learning for renewable energy materials. J Mater Chem A, 2019, 7: 17096–17117

Moses O A, Chen W, Adam M L, et al. Integration of data-intensive, machine learning and robotic experimental approaches for accelerated discovery of catalysts in renewable energy-related reactions. Mater Rep-Energy, 2021, 1: 100049

Liu Y, Guo B, Zou X, et al. Machine learning assisted materials design and discovery for rechargeable batteries. Energy Storage Mater, 2020, 31: 434–450

Deringer V L, Caro M A, Csányi G. Machine learning interatomic potentials as emerging tools for materials science. Adv Mater, 2019, 31: 1902765

Deringer V L. Modelling and understanding battery materials with machine-learning-driven atomistic simulations. J Phys Energy, 2020, 2: 041003

Chen C, Zuo Y, Ye W, et al. A critical review of machine learning of energy materials. Adv Energy Mater, 2020, 10: 1903242

Weber T, Wiseman N A, Kock A. Global ocean methane emissions dominated by shallow coastal waters. Nat Commun, 2019, 10: 4584

Pourghasemi H R, Gayen A, Lasaponara R, et al. Application of learning vector quantization and different machine learning techniques to assessing forest fire influence factors and spatial modelling. Environ Res, 2020, 184: 109321

Qi J, Banerjee S, Zuo Y, et al. Bridging the gap between simulated and experimental ionic conductivities in lithium superionic conductors. Mater Today Phys, 2021, 21: 100463

Zuo Y, Chen C, Li X, et al. Performance and cost assessment of machine learning interatomic potentials. J Phys Chem A, 2020, 124: 731–745

Batzner S, Musaelian A, Sun L, et al. E(3)-equivariant graph neural networks for data-efficient and accurate interatomic potentials. ar-Xiv: 2101.03164

Demolli H, Dokuz A S, Ecemis A, et al. Wind power forecasting based on daily wind speed data using machine learning algorithms. Energy Convers Manage, 2019, 198: 111823

Wang Y, Liu J, Han Y. Production capacity prediction of hydropower industries for energy optimization: Evidence based on novel extreme learning machine integrating Monte Carlo. J Cleaner Production, 2020, 272: 122824

Gao J. Machine learning applications for data center optimization. Google White Paper. 2014

Wan X, Zhang Z, Niu H, et al. Machine-learning-accelerated catalytic activity predictions of transition metal phthalocyanine dual-metal-site catalysts for CO 2 reduction. J Phys Chem Lett, 2021, 12: 6111–6118

Garrido Torres J A, Gharakhanyan V, Artrith N, et al. Augmenting zero-Kelvin quantum mechanics with machine learning for the prediction of chemical reactions at high temperatures. Nat Commun, 2021, 12: 7012

Joshi J, Sukumar R. Improving prediction and assessment of global fires using multilayer neural networks. Sci Rep, 2021, 11: 3295

Mutlu A Y, Yucel O. An artificial intelligence based approach to predicting syngas composition for downdraft biomass gasification. Energy, 2018, 165: 895–901

Fischer S. Globalization and its challenges. Am Econ Rev, 2003, 93: 1–30

Huang Z, Zhang H, Duan H. How will globalization contribute to reduce energy consumption?. Energy, 2020, 213: 118825

Shahbaz M, Lahiani A, Abosedra S, et al. The role of globalization in energy consumption: A quantile cointegrating regression approach. Energy Economics, 2018, 71: 161–170

Baloch M A, Ozturk I, Bekun F V, et al. Modeling the dynamic linkage between financial development, energy innovation, and environmental quality: Does globalization matter?. Bus Strat Env, 2021, 30: 176–184

Breiman L. Random forests. Mach Learn, 2001, 45: 5–32

Article   MATH   Google Scholar  

Chen T, Guestrin C. XGBoost: A scalable tree boosting system. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. New York: Association for Computing Machinery, 2016. 785–794

Chapter   Google Scholar  

Mozer M C, Jordan M I, Petsche T. Advances in Neural Information Processing Systems 9: In: Proceedings of the 1996 Conference. Cambridge, Massachusetts: MIT Press, 1997

MATH   Google Scholar  

Yao Z, Ruzzo W L. A regression-based K nearest neighbor algorithm for gene function prediction from heterogeneous data. BMC BioInf, 2006, 7: S11

Tibshirani R. Regression shrinkage and selection via the lasso. J Roy Stat Soc B Met, 1996, 58: 267–288

MathSciNet   MATH   Google Scholar  

Elith J, Leathwick J R, Hastie T. A working guide to boosted regression trees. J Anim Ecol, 2008, 77: 802–813

Vorpahl P, Elsenbeer H, Märker M, et al. How can statistical models help to determine driving factors of landslides? Ecol Model, 2012, 239: 27–39

Weng Z, Jiang J, Wu Y, et al. Electrochemical CO 2 reduction to hydrocarbons on a heterogeneous molecular Cu catalyst in aqueous solution. J Am Chem Soc, 2016, 138: 8076–8079

Chen Y, Li C W, Kanan M W. Aqueous CO 2 reduction at very low overpotential on oxide-derived Au nanoparticles. J Am Chem Soc, 2012, 134: 19969–19972

Lim R J, Xie M, Sk M A, et al. A review on the electrochemical reduction of CO 2 in fuel cells, metal electrodes and molecular catalysts. Catal Today, 2014, 233: 169–180

Wu Y, Zhao H, Wu Z, et al. Rational design of carbon materials as anodes for potassium-ion batteries. Energy Storage Mater, 2021, 34: 483–507

Liu B, Yang J, Yang H, et al. Rationalizing the interphase stability of Li|doped-Li 7 La 3 Zr 2 O 12 via automated reaction screening and machine learning. J Mater Chem A, 2019, 7: 19961–19969

Ahmad Z, Xie T, Maheshwari C, et al. Machine learning enabled computational screening of inorganic solid electrolytes for suppression of dendrite formation in lithium metal anodes. ACS Cent Sci, 2018, 4: 996–1006

Severson K A, Attia P M, Jin N, et al. Data-driven prediction of battery cycle life before capacity degradation. Nat Energy, 2019, 4: 383–391

Ouyang Y, Shi L, Bai X, et al. Breaking scaling relations for efficient CO 2 electrochemical reduction through dual-atom catalysts. Chem Sci, 2020, 11: 1807–1813

Ying Y, Luo X, Qiao J, et al. “More is different:” Synergistic effect and structural engineering in double-atom catalysts. Adv Funct Mater, 2021, 31: 2007423

Guimaräes da Silva M, Costa Muniz A R, Hoffmann R, et al. Impact of greenhouse gases on surface coal mining in Brazil. J Cleaner Production, 2018, 193: 206–216

Norgate T, Haque N. Energy and greenhouse gas impacts of mining and mineral processing operations. J Cleaner Production, 2010, 18: 266–274

Liang Q, Gongora A E, Ren Z, et al. Benchmarking the performance of Bayesian optimization across multiple experimental materials science domains. npj Comput Mater, 2021, 7: 188

Burger B, Maffettone P M, Gusev V V, et al. A mobile robotic chemist. Nature, 2020, 583: 237–241

Granda J M, Donina L, Dragone V, et al. Controlling an organic synthesis robot with machine learning to search for new reactivity. Nature, 2018, 559: 377–381

Zuo X, Zhu J, Müller-Buschbaum P, et al. Silicon based lithium-ion battery anodes: A chronicle perspective review. Nano Energy, 2017, 31: 113–143

Giffin G A. Ionic liquid-based electrolytes for “beyond lithium” battery technologies. J Mater Chem A, 2016, 4: 13378–13389

Siqi S, Jian G, Yue L, et al. Multi-scale computation methods: Their applications in lithium-ion battery research and development. Chin Phys B, 2016, 25: 018212

Zhao Q, Avdeev M, Chen L, et al. Machine learning prediction of activation energy in cubic Li-argyrodites with hierarchically encoding crystal structure-based (HECS) descriptors. Sci Bull, 2021, 66: 1401–1408

Zhao Q, Zhang L, He B, et al. Identifying descriptors for Li + conduction in cubic Li-argyrodites via hierarchically encoding crystal structure and inferring causality. Energy Storage Mater, 2021, 40: 386–393

Xie J, Lu Y C. A retrospective on lithium-ion batteries. Nat Commun, 2020, 11: 2499

Chen Y, Kang Y, Zhao Y, et al. A review of lithium-ion battery safety concerns: The issues, strategies, and testing standards. J Energy Chem, 2021, 59: 83–99

Armand M, Axmann P, Bresser D, et al. Lithium-ion batteries—current state of the art and anticipated developments. J Power Sources, 2020, 479: 228708

Yang Y, Okonkwo E G, Huang G, et al. On the sustainability of lithium ion battery industry—a review and perspective. Energy Storage Mater, 2021, 36: 186–212

Masias A, Marcicki J, Paxton W A. Opportunities and challenges of lithium ion batteries in automotive applications. ACS Energy Lett, 2021, 6: 621–630

Wang F, Wang B, Li J, et al. Prelithiation: A crucial strategy for boosting the practical application of next-generation lithium ion battery. ACS Nano, 2021, 15: 2197–2218

Duffner F, Kronemeyer N, Tübke J, et al. Post-lithium-ion battery cell production and its compatibility with lithium-ion cell production infrastructure. Nat Energy, 2021, 6: 123–134

Steiger J, Kramer D, Mönig R. Mechanisms of dendritic growth investigated by in situ light microscopy during electrodeposition and dissolution of lithium. J Power Sources, 2014, 261: 112–119

Aurbach D, Zinigrad E, Teller H, et al. Factors which limit the cycle life of rechargeable lithium (metal) batteries. J Electrochem Soc, 2000, 147: 1274

Ahmad Z, Viswanathan V. Stability of electrodeposition at solidsolid interfaces and implications for metal anodes. Phys Rev Lett, 2017, 119: 056003

Jacobs R, Mayeshiba T, Booske J, et al. Material discovery and design principles for stable, high activity perovskite cathodes for solid oxide fuel cells. Adv Energy Mater, 2018, 8: 1702708

Emery A A, Saal J E, Kirklin S, et al. High-throughput computational screening of perovskites for thermochemical water splitting applications. Chem Mater, 2016, 28: 5621–5634

Xu X, Chen Y, Zhou W, et al. A perovskite electrocatalyst for efficient hydrogen evolution reaction. Adv Mater, 2016, 28: 6442–6448

Poizot P, Dolhem F. Clean energy new deal for a sustainable world: From non-CO 2 generating energy sources to greener electrochemical storage devices. Energy Environ Sci, 2011, 4: 2003–2019

Larcher D, Tarascon J M. Towards greener and more sustainable batteries for electrical energy storage. Nat Chem, 2015, 7: 19–29

Wang A, Zou Z, Wang D, et al. Identifying chemical factors affecting reaction kinetics in Li-air battery via ab initio calculations and machine learning. Energy Storage Mater, 2021, 35: 595–601

Liu Y, Wu J, Avdeev M, et al. Multi-layer feature selection incorporating weighted score-based expert knowledge toward modeling materials with targeted properties. Adv Theor Simul, 2020, 3: 1900215

Cano Z P, Banham D, Ye S, et al. Batteries and fuel cells for emerging electric vehicle markets. Nat Energy, 2018, 3: 279–289

Schuster S F, Bach T, Fleder E, et al. Nonlinear aging characteristics of lithium-ion cells under different operational conditions. J Energy Storage, 2015, 1: 44–53

Harris S J, Harris D J, Li C. Failure statistics for commercial lithium ion batteries: A study of 24 pouch cells. J Power Sources, 2017, 342: 589–597

Burke K. Perspective on density functional theory. J Chem Phys, 2012, 136: 150901

Hospital A, Goñi J R, Orozco M, et al. Molecular dynamics simulations: Advances and applications. Adv Appl Bioinform Chem, 2015, 8: 37

Fan H B, Yuen M M F. Material properties of the cross-linked epoxy resin compound predicted by molecular dynamics simulation. Polymer, 2007, 48: 2174–2178

Brooks B R, Bruccoleri R E, Olafson B D, et al. CHARMM: A program for macromolecular energy, minimization, and dynamics calculations. J Comput Chem, 1983, 4: 187–217

Sun H. COMPASS: An ab initio force-field optimized for condensed-phase applicationsoverview with details on alkane and benzene compounds. J Phys Chem B, 1998, 102: 7338–7364

Shapeev A V. Moment tensor potentials: A class of systematically improvable interatomic potentials. Multiscale Model Simul, 2016, 14: 1153–1173

Bartók A P, Kondor R, Csányi G. On representing chemical environments. Phys Rev B, 2013, 87: 184115

Deng Z, Chen C, Li X-G, et al. An electrostatic spectral neighbor analysis potential (eSNAP) for lithium nitride. arXiv: 1901.08749

Bartók A P, Payne M C, Kondor R, et al. Gaussian approximation potentials: The accuracy of quantum mechanics, without the electrons. Phys Rev Lett, 2010, 104: 136403

Hajibabaei A, Kim K S. Universal machine learning interatomic potentials: surveying solid electrolytes. J Phys Chem Lett, 2021, 12: 8115–8120

Hajibabaei A, Myung C W, Kim K S. Sparse Gaussian process potentials: Application to lithium diffusivity in superionic conducting solid electrolytes. Phys Rev B, 2021, 103: 214102

Li W, Ando Y, Minamitani E, et al. Study of Li atom diffusion in amorphous Li 3 PO 4 with neural network potential. J Chem Phys, 2017, 147: 214106

Artrith N, Urban A, Ceder G. Constructing first-principles phase diagrams of amorphous Li x Si using machine-learning-assisted sampling with an evolutionary algorithm. J Chem Phys, 2018, 148: 241711

Onat B, Cubuk E D, Malone B D, et al. Implanted neural network potentials: Application to Li-Si alloys. Phys Rev B, 2018, 97: 094106

Mailoa J P, Kornbluth M, Batzner S, et al. A fast neural network approach for direct covariant forces prediction in complex multielement extended systems. Nat Mach Intell, 2019, 1: 471–479

Zhang L, Han J, Wang H, et al. Deep potential molecular dynamics: A scalable model with the accuracy of quantum mechanics. Phys Rev Lett, 2018, 120: 143001

Marcolongo A, Binninger T, Zipoli F, et al. Simulating diffusion properties of solid-state electrolytes via a neural network potential: Performance and training scheme. arXiv: 1910.10090

Fujikake S, Deringer V L, Lee T H, et al. Gaussian approximation potential modeling of lithium intercalation in carbon nanostructures. J Chem Phys, 2018, 148: 241714

Thompson A P, Swiler L P, Trott C R, et al. Spectral neighbor analysis method for automated generation of quantum-accurate interatomic potentials. J Comput Phys, 2015, 285: 316–330

Moses O A, Gao L, Zhao H, et al. 2D materials inks toward smart flexible electronics. Mater Today, 2021, 50: 116–148

Deringer V L, Merlet C, Hu Y, et al. Towards an atomistic understanding of disordered carbon electrode materials. Chem Commun, 2018, 54: 5988–5991

Deringer V L, Bernstein N, Bartók A P, et al. Realistic atomistic structure of amorphous silicon from machine-learning-driven molecular dynamics. J Phys Chem Lett, 2018, 9: 2879–2885

Behler J. Atom-centered symmetry functions for constructing high-dimensional neural network potentials. J Chem Phys, 2011, 134: 074106

Behler J, Parrinello M. Generalized neural-network representation of high-dimensional potential-energy surfaces. Phys Rev Lett, 2007, 98: 146401

Behler J. Neural network potential-energy surfaces in chemistry: A tool for large-scale simulations. Phys Chem Chem Phys, 2011, 13: 17930–17955

Schütt K T, Arbabzadah F, Chmiela S, et al. Quantum-chemical insights from deep tensor neural networks. Nat Commun, 2017, 8: 13890

Schütt K T, Sauceda H E, Kindermans P J, et al. SchNet—a deep learning architecture for molecules and materials. J Chem Phys, 2018, 148: 241722

Schütt K T, Kindermans P J, Sauceda H E, et al. SchNet: A continuous-filter convolutional neural network for modeling quantum interactions. arXiv: 1706.08566

Wang W, Yang T, Harris W H, et al. Active learning and neural network potentials accelerate molecular screening of ether-based solvate ionic liquids. Chem Commun, 2020, 56: 8920–8923

Schütt K T, Unke O T, Gastegger M. Equivariant message passing for the prediction of tensorial properties and molecular spectra. arXiv: 2102.03150

Anderson B, Hy T S, Kondor R. Cormorant: Covariant molecular neural networks. arXiv: 1906.04015

Haghighatlari M, Li J, Guan X, et al. NewtonNet: A Newtonian message passing network for deep learning of interatomic potentials and forces. arXiv: 2108.02913

Jørgensen P B, Bhowmik A. Graph neural networks for fast electron density estimation of molecules, liquids, and solids. arXiv: 2112.00652

Montes-Campos H, Carrete J, Bichelmaier S, et al. A differentiable neural-network force field for ionic liquids. J Chem Inf Model, 2022, 62: 88–101

Download references

Author information

These authors contributed equally to this work.

Authors and Affiliations

Materials Interfaces Center, Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences, Shenzhen, 518055, China

LingJun Wu, ZiXuan Wang, ZiJian Chen, ZhiChao Huang, Xue-Feng Yu & HaiTao Zhao

College of Materials Science and Technology, Nanjing University of Aeronautics and Astronautics, Nanjing, 210016, China

ZhenMing Xu

Multiscale Crystal Materials Research Center, Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences, Shenzhen, 518055, China

College of Computer, National University of Defense Technology, Changsha, 410073, China

XiangDong Pei

School of Materials, Shenzhen Campus of Sun Yat-sen University, Shenzhen, 518107, China

XiangGuo Li

Quantum Laboratory, Tencent, Shenzhen, 518057, China

Jonathan P. Mailoa & Chang-Yu Hsieh

China Beacons Institute, The University of Nottingham Ningbo China, Ningbo, 315100, China

You can also search for this author in PubMed   Google Scholar

Corresponding authors

Correspondence to XiangGuo Li , Jonathan P. Mailoa or HaiTao Zhao .

Additional information

This work was supported by the National Natural Science Foundation of China (Grant No. 52173234), the Shenzhen Science and Technology Program (Grant Nos. JCYJ20210324102008023 and JSGG20210802153408024), the Shenzhen-Hong Kong-Macau Technology Research Program (Type C, SGDX2020110309300301), the Natural Science Foundation of Guangdong Province (Grant No. 2022A1515010554), and CCF-Tencent Open Fund. The authors also thank Ningbo Municipal Key Laboratory on Clean Energy Conversion Technologies and the Zhejiang Provincial Key Laboratory for Carbonaceous Wastes Processing and Process Intensification Research funded by the Zhejiang Provincial Department of Science and Technology (Grant No. 2020E10018).

Rights and permissions

Reprints and permissions

About this article

Wu, L., Xu, Z., Wang, Z. et al. Machine learning accelerated carbon neutrality research using big data—from predictive models to interatomic potentials. Sci. China Technol. Sci. 65 , 2274–2296 (2022). https://doi.org/10.1007/s11431-022-2095-7

Download citation

Received : 19 February 2022

Accepted : 26 May 2022

Published : 20 September 2022

Issue Date : October 2022

DOI : https://doi.org/10.1007/s11431-022-2095-7

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • carbon neutrality
  • machine learning
  • molecular dynamics
  • interatomic potentials
  • Find a journal
  • Publish with us
  • Track your research
  • How It Works
  • PhD thesis writing
  • Master thesis writing
  • Bachelor thesis writing
  • Dissertation writing service
  • Dissertation abstract writing
  • Thesis proposal writing
  • Thesis editing service
  • Thesis proofreading service
  • Thesis formatting service
  • Coursework writing service
  • Research paper writing service
  • Architecture thesis writing
  • Computer science thesis writing
  • Engineering thesis writing
  • History thesis writing
  • MBA thesis writing
  • Nursing dissertation writing
  • Psychology dissertation writing
  • Sociology thesis writing
  • Statistics dissertation writing
  • Buy dissertation online
  • Write my dissertation
  • Cheap thesis
  • Cheap dissertation
  • Custom dissertation
  • Dissertation help
  • Pay for thesis
  • Pay for dissertation
  • Senior thesis
  • Write my thesis

214 Best Big Data Research Topics for Your Thesis Paper

big data research topics

Finding an ideal big data research topic can take you a long time. Big data, IoT, and robotics have evolved. The future generations will be immersed in major technologies that will make work easier. Work that was done by 10 people will now be done by one person or a machine. This is amazing because, in as much as there will be job loss, more jobs will be created. It is a win-win for everyone.

Big data is a major topic that is being embraced globally. Data science and analytics are helping institutions, governments, and the private sector. We will share with you the best big data research topics.

On top of that, we can offer you the best writing tips to ensure you prosper well in your academics. As students in the university, you need to do proper research to get top grades. Hence, you can consult us if in need of research paper writing services.

Big Data Analytics Research Topics for your Research Project

Are you looking for an ideal big data analytics research topic? Once you choose a topic, consult your professor to evaluate whether it is a great topic. This will help you to get good grades.

  • Which are the best tools and software for big data processing?
  • Evaluate the security issues that face big data.
  • An analysis of large-scale data for social networks globally.
  • The influence of big data storage systems.
  • The best platforms for big data computing.
  • The relation between business intelligence and big data analytics.
  • The importance of semantics and visualization of big data.
  • Analysis of big data technologies for businesses.
  • The common methods used for machine learning in big data.
  • The difference between self-turning and symmetrical spectral clustering.
  • The importance of information-based clustering.
  • Evaluate the hierarchical clustering and density-based clustering application.
  • How is data mining used to analyze transaction data?
  • The major importance of dependency modeling.
  • The influence of probabilistic classification in data mining.

Interesting Big Data Analytics Topics

Who said big data had to be boring? Here are some interesting big data analytics topics that you can try. They are based on how some phenomena are done to make the world a better place.

  • Discuss the privacy issues in big data.
  • Evaluate the storage systems of scalable in big data.
  • The best big data processing software and tools.
  • Data mining tools and techniques are popularly used.
  • Evaluate the scalable architectures for parallel data processing.
  • The major natural language processing methods.
  • Which are the best big data tools and deployment platforms?
  • The best algorithms for data visualization.
  • Analyze the anomaly detection in cloud servers
  • The scrutiny normally done for the recruitment of big data job profiles.
  • The malicious user detection in big data collection.
  • Learning long-term dependencies via the Fourier recurrent units.
  • Nomadic computing for big data analytics.
  • The elementary estimators for graphical models.
  • The memory-efficient kernel approximation.

Big Data Latest Research Topics

Do you know the latest research topics at the moment? These 15 topics will help you to dive into interesting research. You may even build on research done by other scholars.

  • Evaluate the data mining process.
  • The influence of the various dimension reduction methods and techniques.
  • The best data classification methods.
  • The simple linear regression modeling methods.
  • Evaluate the logistic regression modeling.
  • What are the commonly used theorems?
  • The influence of cluster analysis methods in big data.
  • The importance of smoothing methods analysis in big data.
  • How is fraud detection done through AI?
  • Analyze the use of GIS and spatial data.
  • How important is artificial intelligence in the modern world?
  • What is agile data science?
  • Analyze the behavioral analytics process.
  • Semantic analytics distribution.
  • How is domain knowledge important in data analysis?

Big Data Debate Topics

If you want to prosper in the field of big data, you need to try even hard topics. These big data debate topics are interesting and will help you to get a better understanding.

  • The difference between big data analytics and traditional data analytics methods.
  • Why do you think the organization should think beyond the Hadoop hype?
  • Does the size of the data matter more than how recent the data is?
  • Is it true that bigger data are not always better?
  • The debate of privacy and personalization in maintaining ethics in big data.
  • The relation between data science and privacy.
  • Do you think data science is a rebranding of statistics?
  • Who delivers better results between data scientists and domain experts?
  • According to your view, is data science dead?
  • Do you think analytics teams need to be centralized or decentralized?
  • The best methods to resource an analytics team.
  • The best business case for investing in analytics.
  • The societal implications of the use of predictive analytics within Education.
  • Is there a need for greater control to prevent experimentation on social media users without their consent?
  • How is the government using big data; for the improvement of public statistics or to control the population?

University Dissertation Topics on Big Data

Are you doing your Masters or Ph.D. and wondering the best dissertation topic or thesis to do? Why not try any of these? They are interesting and based on various phenomena. While doing the research ensure you relate the phenomenon with the current modern society.

  • The machine learning algorithms are used for fall recognition.
  • The divergence and convergence of the internet of things.
  • The reliable data movements using bandwidth provision strategies.
  • How is big data analytics using artificial neural networks in cloud gaming?
  • How is Twitter accounts classification done using network-based features?
  • How is online anomaly detection done in the cloud collaborative environment?
  • Evaluate the public transportation insights provided by big data.
  • Evaluate the paradigm for cancer patients using the nursing EHR to predict the outcome.
  • Discuss the current data lossless compression in the smart grid.
  • How does online advertising traffic prediction helps in boosting businesses?
  • How is the hyperspectral classification done using the multiple kernel learning paradigm?
  • The analysis of large data sets downloaded from websites.
  • How does social media data help advertising companies globally?
  • Which are the systems recognizing and enforcing ownership of data records?
  • The alternate possibilities emerging for edge computing.

The Best Big Data Analysis Research Topics and Essays

There are a lot of issues that are associated with big data. Here are some of the research topics that you can use in your essays. These topics are ideal whether in high school or college.

  • The various errors and uncertainty in making data decisions.
  • The application of big data on tourism.
  • The automation innovation with big data or related technology
  • The business models of big data ecosystems.
  • Privacy awareness in the era of big data and machine learning.
  • The data privacy for big automotive data.
  • How is traffic managed in defined data center networks?
  • Big data analytics for fault detection.
  • The need for machine learning with big data.
  • The innovative big data processing used in health care institutions.
  • The money normalization and extraction from texts.
  • How is text categorization done in AI?
  • The opportunistic development of data-driven interactive applications.
  • The use of data science and big data towards personalized medicine.
  • The programming and optimization of big data applications.

The Latest Big Data Research Topics for your Research Proposal

Doing a research proposal can be hard at first unless you choose an ideal topic. If you are just diving into the big data field, you can use any of these topics to get a deeper understanding.

  • The data-centric network of things.
  • Big data management using artificial intelligence supply chain.
  • The big data analytics for maintenance.
  • The high confidence network predictions for big biological data.
  • The performance optimization techniques and tools for data-intensive computation platforms.
  • The predictive modeling in the legal context.
  • Analysis of large data sets in life sciences.
  • How to understand the mobility and transport modal disparities sing emerging data sources?
  • How do you think data analytics can support asset management decisions?
  • An analysis of travel patterns for cellular network data.
  • The data-driven strategic planning for citywide building retrofitting.
  • How is money normalization done in data analytics?
  • Major techniques used in data mining.
  • The big data adaptation and analytics of cloud computing.
  • The predictive data maintenance for fault diagnosis.

Interesting Research Topics on A/B Testing In Big Data

A/B testing topics are different from the normal big data topics. However, you use an almost similar methodology to find the reasons behind the issues. These topics are interesting and will help you to get a deeper understanding.

  • How is ultra-targeted marketing done?
  • The transition of A/B testing from digital to offline.
  • How can big data and A/B testing be done to win an election?
  • Evaluate the use of A/B testing on big data
  • Evaluate A/B testing as a randomized control experiment.
  • How does A/B testing work?
  • The mistakes to avoid while conducting the A/B testing.
  • The most ideal time to use A/B testing.
  • The best way to interpret results for an A/B test.
  • The major principles of A/B tests.
  • Evaluate the cluster randomization in big data
  • The best way to analyze A/B test results and the statistical significance.
  • How is A/B testing used in boosting businesses?
  • The importance of data analysis in conversion research
  • The importance of A/B testing in data science.

Amazing Research Topics on Big Data and Local Governments

Governments are now using big data to make the lives of the citizens better. This is in the government and the various institutions. They are based on real-life experiences and making the world better.

  • Assess the benefits and barriers of big data in the public sector.
  • The best approach to smart city data ecosystems.
  • The big analytics used for policymaking.
  • Evaluate the smart technology and emergence algorithm bureaucracy.
  • Evaluate the use of citizen scoring in public services.
  • An analysis of the government administrative data globally.
  • The public values are found in the era of big data.
  • Public engagement on local government data use.
  • Data analytics use in policymaking.
  • How are algorithms used in public sector decision-making?
  • The democratic governance in the big data era.
  • The best business model innovation to be used in sustainable organizations.
  • How does the government use the collected data from various sources?
  • The role of big data for smart cities.
  • How does big data play a role in policymaking?

Easy Research Topics on Big Data

Who said big data topics had to be hard? Here are some of the easiest research topics. They are based on data management, research, and data retention. Pick one and try it!

  • Who uses big data analytics?
  • Evaluate structure machine learning.
  • Explain the whole deep learning process.
  • Which are the best ways to manage platforms for enterprise analytics?
  • Which are the new technologies used in data management?
  • What is the importance of data retention?
  • The best way to work with images is when doing research.
  • The best way to promote research outreach is through data management.
  • The best way to source and manage external data.
  • Does machine learning improve the quality of data?
  • Describe the security technologies that can be used in data protection.
  • Evaluate token-based authentication and its importance.
  • How can poor data security lead to the loss of information?
  • How to determine secure data.
  • What is the importance of centralized key management?

Unique IoT and Big Data Research Topics

Internet of Things has evolved and many devices are now using it. There are smart devices, smart cities, smart locks, and much more. Things can now be controlled by the touch of a button.

  • Evaluate the 5G networks and IoT.
  • Analyze the use of Artificial intelligence in the modern world.
  • How do ultra-power IoT technologies work?
  • Evaluate the adaptive systems and models at runtime.
  • How have smart cities and smart environments improved the living space?
  • The importance of the IoT-based supply chains.
  • How does smart agriculture influence water management?
  • The internet applications naming and identifiers.
  • How does the smart grid influence energy management?
  • Which are the best design principles for IoT application development?
  • The best human-device interactions for the Internet of Things.
  • The relation between urban dynamics and crowdsourcing services.
  • The best wireless sensor network for IoT security.
  • The best intrusion detection in IoT.
  • The importance of big data on the Internet of Things.

Big Data Database Research Topics You Should Try

Big data is broad and interesting. These big data database research topics will put you in a better place in your research. You also get to evaluate the roles of various phenomena.

  • The best cloud computing platforms for big data analytics.
  • The parallel programming techniques for big data processing.
  • The importance of big data models and algorithms in research.
  • Evaluate the role of big data analytics for smart healthcare.
  • How is big data analytics used in business intelligence?
  • The best machine learning methods for big data.
  • Evaluate the Hadoop programming in big data analytics.
  • What is privacy-preserving to big data analytics?
  • The best tools for massive big data processing
  • IoT deployment in Governments and Internet service providers.
  • How will IoT be used for future internet architectures?
  • How does big data close the gap between research and implementation?
  • What are the cross-layer attacks in IoT?
  • The influence of big data and smart city planning in society.
  • Why do you think user access control is important?

Big Data Scala Research Topics

Scala is a programming language that is used in data management. It is closely related to other data programming languages. Here are some of the best scala questions that you can research.

  • Which are the most used languages in big data?
  • How is scala used in big data research?
  • Is scala better than Java in big data?
  • How is scala a concise programming language?
  • How does the scala language stream process in real-time?
  • Which are the various libraries for data science and data analysis?
  • How does scala allow imperative programming in data collection?
  • Evaluate how scala includes a useful REPL for interaction.
  • Evaluate scala’s IDE support.
  • The data catalog reference model.
  • Evaluate the basics of data management and its influence on research.
  • Discuss the behavioral analytics process.
  • What can you term as the experience economy?
  • The difference between agile data science and scala language.
  • Explain the graph analytics process.

Independent Research Topics for Big Data

These independent research topics for big data are based on the various technologies and how they are related. Big data will greatly be important for modern society.

  • The biggest investment is in big data analysis.
  • How are multi-cloud and hybrid settings deep roots?
  • Why do you think machine learning will be in focus for a long while?
  • Discuss in-memory computing.
  • What is the difference between edge computing and in-memory computing?
  • The relation between the Internet of things and big data.
  • How will digital transformation make the world a better place?
  • How does data analysis help in social network optimization?
  • How will complex big data be essential for future enterprises?
  • Compare the various big data frameworks.
  • The best way to gather and monitor traffic information using the CCTV images
  • Evaluate the hierarchical structure of groups and clusters in the decision tree.
  • Which are the 3D mapping techniques for live streaming data.
  • How does machine learning help to improve data analysis?
  • Evaluate DataStream management in task allocation.
  • How is big data provisioned through edge computing?
  • The model-based clustering of texts.
  • The best ways to manage big data.
  • The use of machine learning in big data.

Is Your Big Data Thesis Giving You Problems?

These are some of the best topics that you can use to prosper in your studies. Not only are they easy to research but also reflect on real-time issues. Whether in University or college, you need to put enough effort into your studies to prosper. However, if you have time constraints, we can provide professional writing help. Are you looking for online expert writers? Look no further, we will provide quality work at a cheap price.

170 AP Research Topics

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Comment * Error message

Name * Error message

Email * Error message

Save my name, email, and website in this browser for the next time I comment.

As Putin continues killing civilians, bombing kindergartens, and threatening WWIII, Ukraine fights for the world's peaceful future.

Ukraine Live Updates

  • Open access
  • Published: 31 August 2024

Patient regional index: a new way to rank clinical specialties based on outpatient clinics big data

  • Xiaoling Peng 1 ,
  • Moyuan Huang 1 ,
  • Xinyang Li 1 ,
  • Tianyi Zhou 1 ,
  • Guiping Lin 2 &
  • Xiaoguang Wang 2  

BMC Medical Research Methodology volume  24 , Article number:  192 ( 2024 ) Cite this article

Metrics details

Many existing healthcare ranking systems are notably intricate. The standards for peer review and evaluation often differ across specialties, leading to contradictory results among various ranking systems. There is a significant need for a comprehensible and consistent mode of specialty assessment.

This quantitative study aimed to assess the influence of clinical specialties on the regional distribution of patient origins based on 10,097,795 outpatient records of a large comprehensive hospital in South China. We proposed the patient regional index (PRI), a novel metric to quantify the regional influence of hospital specialties, using the principle of representative points of a statistical distribution. Additionally, a two-dimensional measure was constructed to gauge the significance of hospital specialties by integrating the PRI and outpatient volume.

We calculated the PRI for each of the 16 specialties of interest over eight consecutive years. The longitudinal changes in the PRI accurately captured the impact of the 2017 Chinese healthcare reforms and the 2020 COVID-19 pandemic on hospital specialties. At last, the two-dimensional assessment model we devised effectively illustrates the distinct characteristics across hospital specialties.

We propose a novel, straightforward, and interpretable index for quantifying the influence of hospital specialties. This index, built on outpatient data, requires only the patients’ origin, thereby facilitating its widespread adoption and comparison across specialties of varying backgrounds. This data-driven method offers a patient-centric view of specialty influence, diverging from the traditional reliance on expert opinions. As such, it serves as a valuable augmentation to existing ranking systems.

Peer Review reports

Introduction

In the realm of healthcare, a ‘specialty’ denotes a distinct branch of medicine that is dedicated to the study and treatment of a specific category of diseases or conditions. The ‘clinical specialty ability’ encapsulates the capacity and growth potential of medical institutions to deliver specialized healthcare services. The quantitative appraisal of specialty ability is crucial for the distribution of medical resources, industry regulation, and hospital administration. Importantly, objective and rational assessments of specialty ability equip patients with valuable information, aiding them in making informed decisions regarding their choice of medical institutions.

Over the past few decades, numerous organizations have ranked or rated hospitals and specialties within their respective countries. These rankings serve dual purposes: guiding patients in selecting hospitals or medical centers, and providing a scientific foundation for the evolution of specialties in government-run hospitals. One of the most renowned ranking systems is the ‘Best Hospitals Honor Roll’, introduced by the U.S. News & World Report in 1990 [ 1 ]. This system, designed to assist patients in identifying superior medical centers and physicians across the United States, was established based on patient outcomes, patient experiences, medical technology, specialty reputation (derived from physician surveys), and other health-related indicators. However, the inherent complexity of this ranking system often leads to conflicting standings amongst various institutions [ 2 , 3 ]. In China, the most authoritative ranking system is the ‘Chinese Hospital Specialist Reputation Rankings’, issued by the Institute of Hospital Management at Fudan University. The evaluations conducted under this system primarily depend on expert ratings and research capability, and their credibility hinges on the authority and professionalism of the experts involved [ 4 , 5 , 6 ]. Despite its intricate computation, this system also neglects the intended audience of these rankings and lacks sufficiently objective evaluation standards [ 6 , 7 ].

Recently, eight healthcare experts reviewed four major hospital rating systems and identified the following significant issues [ 8 ]:

Comprehensiveness and representativeness of the data. For example, most rating systems use administrative data collected for billing rather than clinical purposes, and incomplete data can lead to bias in the assessments.

Reliability of the data. Rating systems generate their own data through surveys and they are not able to assess the validity and reliability of the data independently.

Methods for integrating and weighting composite metrics vary. Different rating systems use different methods to calculate composite indicators, which causes overall scores or ratings to vary widely. In some cases, the choice of weights even depends on stakeholder perceptions.

Distorted evaluation of small hospitals. Small hospitals are difficult to assess fairly with the usual performance estimation methods because of their low capacity.

Lack of uniform peer-review. Although each rating system uses expert panels to some extent, the expertise of the panels is uncertain and their evaluations are heavily influenced by the subjective thinking of the experts.

From the discussion above, it’s evident that while large and complex rating systems may seem comprehensive, their complexity in data collection and computation often hinders their effectiveness. Considering the diversity among different medical systems, it is challenging for these rating systems to draw convincing and consistent conclusions. Therefore, the ranking of hospitals and specialties should be straightforward and quantifiable. Utilizing assessment metrics that emphasize relevant patient-centered information can enhance patient acceptance of these evaluation metrics and better define the concept of patient-centered quality of medical care [ 9 ]. For instance, Cram suggested the use of patient-centered objective indicators, such as Readmission Reduction, to measure hospital quality [ 10 ].

Many studies investigating the factors that influence patients’ access to medical care have identified both the reputation of the specialty and geographical distance as crucial variables affecting patient choices [ 11 , 12 , 13 ]. Typically, patients favor medical institutions with a higher reputation in their specialty and those closer in proximity [ 14 ]. Therefore, the scope of a specialty service is tied to its overall societal reputation; the higher the reputation, the broader the geographical origins of its patient base [ 15 ]. Specialties with stronger reputations draw patients not only from the immediate vicinity of the medical institution and the city but also from areas outside the city and even the province. The geographical proximity of the patients’ origins to the medical institution serves as an objective indicator of the patients’ trust in its specialty competencies.

Representative Points (RPs) were proposed as a technique to discretize and approximate a continuous statistical distribution [ 16 ] and have since been utilized in a variety of fields, such as information compression and transmission in signal processing [ 17 ]. Fang and He have applied RPs for the grouping of Chinese body sizes to develop clothing standards [ 18 ]. The most common type of RPs are the mean squared error representative points (MSE-RPs), which aim to minimize the mean squared error in relation to the original distribution. MSE-RPs are accomplished by segmenting the domain of the distribution into distinct intervals, each symbolized by a single point, known as the representative point. A unique characteristic of MSE-RPs is their self-consistency in single-peaked distributions, where the RP for each interval aligns with the expected value for that interval. This property allows for the application of the Lloyd-Max method [ 19 , 20 , 21 ] to calculate RPs. By iteratively adjusting the interval endpoints and the expected values within each interval, MSE-RPs and their corresponding intervals can be obtained from a set of initial points.

Drawing from outpatient clinics big data, the objective of this paper is to introduce a novel index referred to as the patient regional index for assessing specialty influence using the MSE-RPs technology. Unlike conventional ranking methods that rely on diverse indicators and expert evaluations, our data-centric approach is straightforward to execute and comprehend, and it accurately represents the specialty influence from the patient’s viewpoint. Consequently, it offers superior consistency and comparability when evaluating the specialty influence across various medical institutions.

Data selection and preprocessing

The Chinese healthcare system is a government-funded and government-administered system. In China, public general hospitals are typically the main healthcare institutions that provide comprehensive medical services, including various specialties. In this study, we utilized a large, comprehensive hospital in South China to demonstrate the specific calculation process of the PRI.

From all 33 departments of this general hospital, we meticulously chose 16 specialties as the focus of our analysis. The selection of these specific specialties was guided by several factors. Firstly, they represent the hospital’s priority areas, reflecting the institution’s strategic focus. Secondly, these specialties have a well-established history within the hospital, indicating their enduring relevance. Lastly, these specialties are ubiquitous across most general hospitals, underscoring their widespread prevalence. It’s important to note that certain departments, such as the Emergency and Intensive Care Department, were deliberately omitted from our selection due to their unique patient demographics. As a result, the data set was collected from the outpatient registration records spanning 16 departments such as Pediatrics and Urology. Covering the period from 2014 to 2021, the dataset comprised a total of 10,098,024 visit records.

We began data processing by converting all patients’ origins in the database into latitude and longitude coordinates, where the patient regional information was extracted from the address and telephone number in the patients’ visit records. To achieve this, we utilized the Baidu Open Platform’s geocoding Application Programming Interface (API), which enabled us to convert text addresses into their corresponding latitude and longitude coordinates. When we encountered invalid addresses, such as blank fields or unrecognizable entries, we turned to the phone module in Python to extract area information from the patients’ phone numbers. Instances where data lacked both a valid address and a valid phone number were classified as invalid.

We then employed the Python library ‘geopy’ to compute the distance from each patient’s origin to the hospital, using the latitude and longitude coordinates. All coordinates were maintained to four decimal places, adhering to the default WGS-84 model. Upon calculation, 229 distances exceeding 5,000 km were identified. These were excluded from the analysis as they did not align with reality. Ultimately, we were left with 10,097,795 valid distances, which were further employed for estimating the baseline distance distribution and calculating the patient regional index ( PRI).

  • Patient regional index

After segmenting the distances from patients’ origins to the healthcare institution into several intervals, the PRI is then constructed by weighting the quantity (or proportion) of patients in each interval, inversely proportionate to the distance.

In this paper, we apply the theory of MSE-RPs to derive an optimal partition for the statistical distribution of distances, which allows us to determine the corresponding weights for each interval. Our initial step involves establishing a baseline distribution for these distances.

Fitting the baseline distance distribution

Considered that the majority of patients come from nearby areas, the likelihood of a patient traveling from a remote location is comparatively low. As such, the distribution of distances should exhibit right-skewness. Therefore, the two-parameter Gamma distribution \(\mathrm{Ga}(\alpha\beta)\) serves as an appropriate model to characterize the distribution of distances from patients’ residences to the healthcare institution.

Utilizing R 4.0.4, we fitted the baseline distance distribution Ga(α,β) using all valid distances from patients’ origins to the hospital. The maximum likelihood estimation yielded parameter estimates of \(\:\widehat{{\upalpha\:}}\) =0.1954 and \(\:\widehat{{\upbeta\:}}\) =0.0014. Figure  1 illustrates the distance distribution as a histogram, superimposed with the probability density curve for Ga(0.1954,0.0014). From Fig.  1 we see that the fitted Gamma distribution effectively captures the right-skewed nature of the distance distribution. This distribution indicates that the majority of visits originate from areas in close proximity to the hospital, with the frequency of visits significantly decreasing as the distance increases. Therefore, Ga(0.1954,0.0014) will serve as the baseline distance distribution for future partitioning and weighting.

figure 1

Baseline distribution of the distances between the patients’ origin and the hospital

Partitioning the baseline distance distribution

According to the theory of MSE-RPs [ 18 ], for a continuous distribution \(\:F\left(x\right)\) defined on \(\:[c,d]\) , the k representative points \(\:\mathbf{y}=\left\{{y}_{1},{y}_{2},\cdots\:,{y}_{k}\right\}\) , the corresponding k intervals \(\:\varvec{\Omega\:}\) ={ \(\:{{\Omega\:}}_{1},{{\Omega\:}}_{2},\cdots\:,{{\Omega\:}}_{k}\) } and their respective probabilities P = { \(\:{P}_{1},P,\cdots\:,{P}_{k}\) } can be derived by minimizing MSE below:

where \(\:f\left(x\right)\) is the probability density function of \(\:F\left(x\right).\) Naturally, the interval \(\:{{\Omega\:}}_{i}\:\) associated with \(\:{y}_{i}\) is its interval of integration. The cumulative probability \(\:{p}_{i}\) for each interval is given by

Figure  2 illustrates the distance intervals obtained from the partition of a right-skewed Gamma distribution, specifically for k  = 6.

figure 2

Six distance intervals for a right-skewed Gamma distribution

In this study, we used the Lloyd-Max method [ 19 ] to generate six representative points, along with their corresponding intervals and cumulative probabilities, from the baseline distance distribution Ga(0.1954,0.0014). These are listed in Table  1 .

Calculating the patient regional index

Adhering to the principle that greater distances should be assigned higher weights, we define the weight \(\:{w}_{i}\) of the i th distance interval to be inversely proportional to the probability \(\:{p}_{i}\) , that is, \(\:{w}_{i}\propto\:\frac{1}{{p}_{i}}\) . To ensure the sum of all weights equals 1, i.e., \(\:\sum\:{w}_{i}=1\) , the weight of the i th distance interval is assigned as below:

Following the probabilities of each interval provided in Table  1 , the weights for these distance intervals were computed in accordance with Eq. ( 1 ) and are presented in the last row of Table  1 .

Given the proportions of patients’ origins distributed across these k distance intervals \(\:{r}_{1},{r}_{2},\cdots\:,{r}_{k}\) , the patient regional index (PRI) is then defined as the weighted average of the patients’ geographical distribution:

Since the distribution of patients’ origins varies across departments, the PRI scores derived from Eq. ( 2 ) will differ for different departments. Generally, a department will have a higher PRI score if it attracts a larger proportion of patients from more distant regions.

Finally, we provide a summary of the process used to construct the PRI from outpatient clinics data:

Step 1 : Transform patient origin data into respective distances from the healthcare institution.

Step 2 : Establish the baseline distribution for the population of distances.

Step 3 : Obtain representative intervals and their associated probabilities from the baseline distribution.

Step 4 : Determine the PRI for a specific specialty as a weighted average of the proportions of its patients within each representative interval.

Given that the PRI scores derived in the aforementioned manner were numerically small, we adopted the average PRI scores from 2017 as a benchmark, setting this score at 100. Subsequently, each PRI was adjusted as:

For the sake of simplicity, in the remaining sections of this paper, any mention of the PRI will refer to the adjusted PRI.

A two-dimensional assessment model

The calculation of the PRI, as described above, primarily relies on the proportion of patients from different distances rather than their absolute numbers. This is because the number of patients can significantly vary from one department to another due to the unique characteristics of each specialty. For instance, in densely populated cities, pediatrics typically sees a larger patient volume, while some specialties like orthopedics primarily cater to a smaller population with physical deformities, resulting in fewer outpatient visits. By basing the PRI on proportions rather than total numbers, we can evaluate the regional influence of a specialty while mitigating the impact of the specialty’s inherent attributes.

Nevertheless, it’s crucial to recognize that the volume of outpatient visits serves as a significant measure of a specialty’s proficiency, reflecting aspects such as patient demand, quality of care, efficiency, and capacity, among others. Hence, a two-dimensional assessment framework, leveraging outpatient big data, that incorporates both patient regional distribution and outpatient volume can provide a more comprehensive evaluation of a specialty’s influence. Figure  3 briefly illustrates a schematic diagram of this joint assessment model for specialty influence based on outpatient big data.

figure 3

A joint assessment model of specialty influence based on outpatient big data

We calculated the PRI for each of the 16 specialties of interest over eight consecutive years. This calculation involved a weighted average of the proportions of patients from six distance intervals for each specialty, which we summarized annually. The proportions were weighted according to Eq. ( 2 ), using the weights specified in Table  1 .

PRI trends amid healthcare reform and pandemic

Figure  4 illustrates the changes in the PRI across 16 specialties within the hospital from 2014 to 2021. For comparative analysis, we categorized the 16 specialties into two groups: non-surgical and surgical departments. The non-surgical departments included pediatrics, nuclear medicine, dermatology, endocrinology, traditional Chinese medicine, cardiovascular medicine, rheumatology, and respiratory medicine. The surgical departments comprised gynecology and obstetrics, urology, hepatobiliary and pancreatic surgery, breast surgery, ENT (ear-nose-throat), plastic surgery, orthopedics, and oncology.

figure 4

Comparison of the PRI of specialties in a large comprehensive hospital, 2014–2021

Figure  4 reveals a gradual increase in the PRI for each specialty over time, reflecting the hospital’s development and diversification of its patient origin. This trend was particularly noticeable in the surgical departments, implying a growing regional influence and suggesting higher patient loyalty compared to non-surgical departments. It indicates that once patients recognize a hospital’s specialty, they are willing to travel longer distances for treatment.

However, we observed a decrease in the PRI for most specialties post-2017. This decline can be attributed to healthcare reforms initiated by the Chinese government in 2017, which encouraged patients with common diseases to seek initial treatment at primary healthcare institutions. This policy led to a significant reduction in out-of-town patients visiting this comprehensive hospital [ 22 , 23 ].

Additionally, the COVID-19 outbreak at the end of 2019 restricted people’s mobility, creating a noticeable inflection point in 2020 for the PRI of certain hospital specialties, especially surgical ones. Following the Chinese government’s control of the epidemic, patient mobility was restored, and the PRI of all specialties showed a significant rebound. Thus, the fluctuation in the PRI effectively mirrors the impact of China’s healthcare reform and the COVID-19 pandemic on specialty outpatient clinics.

Specialties overview using the joint assessment model

Beyond the scope of patient origin, the volume of outpatient visits in the outpatient information system, is another important metric of the proficiency of hospital specialties. This volume serves as a broad indicator of the scale of the specialty, the standard of medical technology, the efficiency of outpatient management, and the extent of patient trust in the hospital [ 24 ]. For the purpose of comparison, the average number of outpatient visits per specialty in 2017 was utilized as a benchmark and assigned a score of 100. The number of visits per specialty was then adjusted as follows:

Figure  5 illustrates the shift in the number of visits to nonsurgical and surgical specialties in this hospital from 2014 to 2021. Among the nonsurgical specialties, pediatrics, Chinese medicine, dermatology, and endocrinology observed considerably higher outpatient volumes compared to the remaining four departments. Notably, there was a significant downturn in the volume of pediatric outpatients in 2020, a trend likely attributable to the effects of the pandemic. In the realm of surgical departments, gynecology and obstetrics were the most profoundly impacted by the pandemic, while the number of visits to other surgical departments showed a tendency to rise, rather than decline, in 2020. This trend underscores the resilience and capacity of this hospital’s specialties during challenging times.

figure 5

Comparison of the outpatient volume of specialties in a large comprehensive hospital, 2014–2021

By integrating the patients’ origin with outpatient volume, we could deliver a more holistic perspective on the strengths and unique features of the hospital’s various specialties. Figure  6 depicts a two-dimensional distribution in terms of the PRI and the adjusted outpatient volume for 16 specialties within this large comprehensive hospital over an eight-year span. Identical symbols were employed to represent the same specialty values across different years.

figure 6

Joint assessment of the specialties in a large comprehensive hospital, 2014–2021

Using the two-dimensional assessment model illustrated in Fig.  6 , we broadly classified the hospital specialties into four primary categories. Category I encompasses specialties that demonstrate excellence through high outpatient volume and significant social influence. This category includes nonsurgical specialties such as Dermatology, and surgical specialties such as Gynecology and Obstetrics, which boasted an adjusted outpatient volume of approximately 150 or higher, and a PRI around 100. In fact, according to the ‘China Hospital and Specialty Reputation Ranking’, the Gynecology and Obstetrics department of this hospital holds the 2nd position in South China, affirming its superior specialty status.

Category II comprises specialties that, despite a smaller number of outpatient visits, maintain a high social reputation and attract patients from diverse regions. The Urology and Orthopedics departments fall under this category, with a PRI of 125 or above. These specialties are highly specialized, attracting a significant number of outpatients from distant locations.

Category III refers to specialties with a high outpatient volume, primarily serving local residents, and demonstrating robust operational capacity. The Pediatrics, Chinese Medicine, and Cardiovascular departments are included in this category, enjoying a strong reputation among local residents.

Lastly, Category IV includes specialties with smaller volumes, primarily serving local patients. The Nuclear Medicine and Rheumatology departments, due to the nature of their specialties, had a smaller volume of visits and an intermediate PRI.

We further selected Gynecology and Obstetrics (2017), Pediatrics (2019), Urology (2021), and Oncology (2014) as representatives of these four categories. We then generated a heat map detailing the distribution of their patients’ origins (Fig.  7 ). This heat map confirms that the dual indicators of the PRI and outpatient volume can effectively capture the unique characteristics of each specialty department. For instance, the Department of Gynecology and Obstetrics enjoyed an outstanding reputation, attracting patients not only from Southern China but also in large numbers from Central and Eastern China. Some patients even traveled from as far as Northeast China. In comparison to Pediatrics, the Urology department, despite its smaller total patient visits, had a significantly broader geographical reach for its patient origins. Consequently, Urology’s PRI was substantially higher than that of Pediatrics. In contrast, the Oncology department in 2014 was still in its nascent stages of development within the hospital. Its patients primarily resided in the hospital’s immediate vicinity, resulting in both its PRI and outpatient volume being relatively low.

figure 7

Heat maps illustrating the distribution of patient origins across different specialty types. ‘OA’ denotes the adjusted outpatient volume

In this study, we developed a novel Patient Regional Index (PRI) to assess the influence of hospital specialties based on the statistical distribution of patient origins. By analyzing 10,097,795 outpatient records from a large comprehensive hospital in South China, we demonstrated that the PRI effectively captures the impact of significant healthcare events, such as the 2017 Chinese healthcare reforms and the 2020 COVID-19 pandemic. We also introduced a two-dimensional model that combines PRI with outpatient volume to provide a comprehensive characterization of various specialties. Based on the case study we have presented and discussed earlier, we wish to highlight the distinct advantages of the PRI as follows:

Accessibility of data : The dataset employed is straightforward and easily accessible. The methodology relies only two simple indicators (i.e., regional patient origins and outpatient volume), which can be readily collected from a hospital’s outpatient system.

Objective Weighting : The weighting of distance intervals is determined by the statistical distribution of the data and relevant statistical theories, not by the subjective perceptions of stakeholders.

Patient-Centric Robustness : This index is robust and reliable as it reflects the choices of patients. It is a composite of the independent behavior of a substantial number of patients, rather than the subjective opinions of experts.

Unbiased by Specialty Capacity : The PRI is calculated using the proportion of patients from each regional interval rather than the total number of patients. This approach mitigates the distortion of other performance assessment methods for specialties with smaller volumes, thereby maximizing the fairness of the evaluation.

Ease of Understanding and Implementation : The assessment of specialty influence based on patient origins is easy to understand and implement. Therefore, it can be easily extended to evaluate the influence of an entire hospital, to allow for rankings and comparisons between different hospital specialties, ensuring consistent results.

Although our study is preliminary, it offers innovative ideas for effectively using big data from outpatient clinics to assess and rank hospital specialties. Compared with existing mainstream evaluation methods, which are often complex, our method is straightforward, easy to implement, and replicable, providing consistent conclusions. Therefore, our proposed assessment method can provide a meaningful reference that complements the existing evaluation systems.

Limitations

This study was confined to calculating and comparing the influence of a single hospital across various specialties and years due to data limitations. While outpatient big data is readily available within individual hospitals, obstacles persist in sharing this data across different institutions. The implementation of advanced data privacy technology is necessary to overcome these barriers, and that is the focus of our upcoming work.

Moreover, since the PRI is entirely reliant on outpatient data, there may be potential distortions for certain specialties. For instance, with the growth of the hospital, our study noted that the oncology department began to draw patients from an increasingly broader geographical range (PRI > 100). However, it was observed that a significant number of these patients were not attracted to the hospital due to the reputation of the oncology specialty per se, but rather due to the need for subsequent treatment in the radiation and chemotherapy clinics of the oncology department following surgeries from other specialties. As such, it is crucial to meticulously scrutinize patient origins when evaluating specific specialties using our method.

Availability of data and materials

The data presented in the study are included in the article, further inquiries about the dataset can be directed to the corresponding author (XW). Analysis code can be obtained upon request made with the first author (XP).

Abbreviations

Patient Regional Index

Representative Points

Mean Squared Error

Application Programming Interface

U.S. News hospitals rankings and ratings. https://health.usnews.com/best-hospitals .

Austin JM, Jha AK, Romano PS, et al. National hospital ratings systems share few common scores and may generate confusion instead of clarity. Health Aff (Millwood). 2015;34(3):423–30.

Article   PubMed   Google Scholar  

Bae JA, Curtis LH, Hernandez AF. National hospital quality rankings: improving the value of information in hospital rating systems. JAMA. 2020;324(9):839–40. https://doi.org/10.1001/jama.2020.11165 .

Xiong XC, Gao JC, Zhang BY, Lu J, Luo L. Introduction of Chinese hospital ranking method from the aspect of theoretical framework, realistic choice and social effect. J Hosp Manage Health Policy. 2017;1:4.

Article   Google Scholar  

Dong SP, Guo SY, He L, et al. Study of present situation and countermeasures of China’s hospital rankings. Chin Hosp Manage. 2015;35(3):38–40.

CAS   Google Scholar  

Wei T, Ma LP, Chen Y, et al. Status Quo of clinical specialist ability evaluation in China. Chinese Hospital Management. 2019;39(3):14–17.

Lubalin JS, Harris-Kojetin LD. What do consumers want and need to know in making health care choices? Med Care Res Rev. 1999;56(Suppl 1):67–102.

Bilimoria KY, Birkmeyer JD, Burstin H, Dimick JB, Maddox KE, Dahlke AR, DeLancey JO, Pronovost PJ. Rating the raters: an evaluation of publicly reported hospital quality rating systems. NEJM Catal. 2019;5(4).

Ellis RJ, Yuce TK, Hewitt DB, Merkow RP, Kinnier CV, Johnson JK, Bilimoria KY. National evaluation of patient preferences in selecting hospitals and health care providers. Med Care. 2020;58(10):867–73. https://doi.org/10.1097/MLR.0000000000001374 .

Article   PubMed   PubMed Central   Google Scholar  

Cram P, Wachter RM, Landon BE. Readmission reduction as a hospital quality measure: time to move on to more pressing concerns? JAMA. 2022;328(16):1589–90. https://doi.org/10.1001/jama.2022.18305 .

Exworthy M, Peckham S, Access. Choice and travel: implications for health policy. Soc Policy Adm. 2006;40:267–87.

Avdic D, Moscelli G, Pilny A, Sriubaite I. Subjective and objective quality and choice of hospital: evidence from maternal care services in Germany. J Health Econ. 2019;68(102229):1–19.

Google Scholar  

Mukamel DB, Amin A, Weimer DL, et al. When patients customize nursing home ratings, choices and rankings differ from the government’s version. Health Aff (Millwood). 2016;35(4):714–9.

Zhan JJ, Fu HQ. Hospital repitation, spatial distance and patient choice-evidence from hospital discharge data. China Econ. 2022;22:343–64.

Miao L, Chen W, Ding J, Chen R. Study on the medical service radius of large general hospital. Prog Mod Biomed. 2013;13(27):5381–6.

Cox DR. Note on grouping. J Am Stat Theory. 1957;52:543–7.

Gray RM, Neuhoff DL. Quantization. IEEE Trans Inform Theory. 1998;44(6):2325–83. https://doi.org/10.1109/18.720541 .

Fang KT, He SD. The problem of selecting a specified number of representative points from a normal population. Acta Math Appl Sin. 1984;7(3):293–306.

Flury B. Estimation of principal points. J R Stat Soc C. 1993;42:139–51.

Lloyd SP. Least squares quantization in PCM. IEEE Trans Inform Theory. 1982;28(2):129–37.

Max J. Quantizing for minimum distortion. IEEE Trans Inform Theory. 1960;6(1):7–12.

Yip W, Fu H, Chen AT, et al. 10 years of health-care reform in China: progress and gaps in universal health coverage. Lancet. 2019;394(10204):1192–204. https://doi.org/10.1016/S0140-6736(19)32136-1 .

Zhang LY, Cheng G, Song S, et al. Efficiency performance of China’s health care delivery system. Int J Health Plann Manage. 2017;32(3):254–63. https://doi.org/10.1002/hpm.2425 .

Zhao X, Li W, Wang Y, Jiang L. Evaluation of the number of visits to Chinese medical institutions using a logistic differential equation model. Complex. 2021;2021. https://doi.org/10.1155/2021/7943651 .

Download references

Acknowledgements

We thank Qianqian Lu, Beibei Liu and Jiahao Chen for their assistance with data processing. Additionally, we would like to express our sincere gratitude to the reviewers for their meticulous and thoughtful suggestions. Their contributions have greatly aided in enhancing the quality of our manuscript.

This work was supported in part by the Guangdong Provincial Key Laboratory IRADS (No. 2022B1212010006), the Yat-sen Management Research Fund (No. GL2213), and the UIC research grant (No. R202108). None of these funding sources played any role in the design of the study, collection, analysis, and interpretation of data, or in writing the manuscript.

Author information

Authors and affiliations.

Guangdong Provincial Key Laboratory of Interdisciplinary Research and Application for Data Science, BNU-HKBU United International College, Zhuhai, China

Xiaoling Peng, Moyuan Huang, Xinyang Li & Tianyi Zhou

Sun Yat-sen Memorial Hospital, Sun Yat-sen University, No. 107, Yanjiang West Road, Yuexiu District, Guangzhou, China

Guiping Lin & Xiaoguang Wang

You can also search for this author in PubMed   Google Scholar

Contributions

X.P. conducted the whole work and wrote the manuscript; M.H. did the main computation work; X.L. contributed to the methodology part and wrote the manuscript; T.Z. helped process massive amounts of data; G.L. interpreted the results and provided guidance; X.W. conceived the study and provided the data. All authors reviewed the manuscript and approved the submitted version.

Corresponding author

Correspondence to Xiaoguang Wang .

Ethics declarations

Ethics approval and consent to participate.

This is a retrospective study, the exemption from informed consent was approved by the ethics committee board of Sun Yat-sen Memorial Hospital, Sun Yat-sen University. All methods were carried out in accordance with relevant guidelines and regulations. All experimental protocols were approved by the ethics committee board of Sun Yat-sen Memorial Hospital, Sun Yat-sen University.

Consent for publication

Not applicable.

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/ .

Reprints and permissions

About this article

Cite this article.

Peng, X., Huang, M., Li, X. et al. Patient regional index: a new way to rank clinical specialties based on outpatient clinics big data. BMC Med Res Methodol 24 , 192 (2024). https://doi.org/10.1186/s12874-024-02309-z

Download citation

Received : 25 May 2023

Accepted : 14 August 2024

Published : 31 August 2024

DOI : https://doi.org/10.1186/s12874-024-02309-z

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Specialty influence
  • Outpatient big data
  • Representative points of statistical distributions
  • Two-dimensional assessment model

BMC Medical Research Methodology

ISSN: 1471-2288

research using big data

As we transition our order fulfillment and warehousing to W. W. Norton, select titles may temporarily appear as out of stock. We appreciate your patience.

Yale University Press

On The Site

research using big data

Map by Emanuel Bowen on Wikimedia

The Imperial Origins of Big Data

August 28, 2024 | ceb95 | East Asian Studies , Essays , European History , History , Technology

Asheesh Kapur Siddique—

We live in a moment of massive transformation in the nature of information. In 2020, according to one report , users of the Internet created 64.2 zetabytes of data, a quantity greater than the “number of detectable stars in the cosmos,” a colossal increase whose origins can be traced to the emergence of the World Wide Web in 1993. 1 Facilitated by technologies like satellites, smartphones, and artificial intelligence, the scale and speed of data creation seems like it may only balloon over the rest of our lifetimes—and with it, the problem of how to govern ourselves in relation to the inequalities and opportunities that the explosion of data creates.

But while much about our era of big data is indeed revolutionary, the political questions that it raises—How should information be used? Who should control it? And how should it be preserved?—are ones with which societies have long grappled. These questions attained a particular importance in Europe from the eleventh century due to a technological change no less significant than the ones we are witnessing today: the introduction of paper into Europe. Initially invented in China, paper travelled to Europe via the conduit of Islam around the eleventh century after the Moors conquered Spain. Over the twelfth, thirteenth, and fourteenth centuries, paper emerged as the fundamental substrate which politicians, merchants, and scholars relied on to record and circulate information in governance, commerce, and learning. At the same time, governing institutions sought to preserve and control the spread of written information through the creation of archives: repositories where they collected, organized, and stored documents.

The expansion of European polities overseas from the late fifteenth century onward saw governments massively scale up their use of paper—and confront the challenge of controlling its dissemination across thousands of miles of ocean and land. These pressures were felt particularly acutely in what eventually became the largest empire in world history, the British empire. As people from the British isles from the early seventeenth century fought, traded, and settled their way to power in the Atlantic world and South Asia, administrators faced the problem of how to govern both their emigrating subjects and the non-British peoples with whom they interacted. This meant collecting information about their behavior through the technology of paper. Just as we struggle to organize, search, and control our email boxes, text messages, and app notifications, so too did these early moderns confront the attendant challenges of developing practices of collection and storage to manage the resulting information overload. And despite the best efforts of states and companies to control information, it constantly escaped their grasp, falling into the hands of their opponents and rivals who deployed it to challenge and contest ruling powers.

The history of the early modern information state offers no simple or straightforward answers to the questions that data raises for us today. But it does remind us of a crucial truth, all too readily obscured by the deluge of popular narratives glorifying technological innovation: that questions of data are inherently questions about politics—about who gets to collect, control, and use information, and the ends to which information should be put. We should resist any effort to insulate data governance from democratic processes—and having an informed perspective on the politics of data requires that we attend not just to its present, but also to its past.

As I have written elsewhere, “According to what rules should this information be gathered? How should it be used? Who should have access to it? These questions continue to preoccupy our world, much as they did the strange and remote world of the early modern British Empire. The past is at once more distant and more proximate than we may think.” 2

1. “Breaking Down the Numbers: How Much Data Does the World Create Daily in 2024?,” Edge Delta , March 11, 2024, https://edgedelta.com/company/blog/how-much-data-is-created-per-day

2 . Asheesh Kapur Siddique, The Archive of Empire: Knowledge, Conquest, and the Making of the Early Modern British World , Yale University Press, 2024. p180.

Asheesh Kapur Siddique  is assistant professor in the Department of History at the University of Massachusetts Amherst. He is a historian of early America, early modern Europe, and the British Empire. He lives in Northampton, MA.

Recent Posts

  • Ep. 138 – Around the World in Public Art
  • A Most Normal Election Cycle
  • Divided Democracy—The Past is (Frighteningly) Never Dead
  • The Petit Network
  • Surely but Slowly: NATO Adapts to Strategic Competition
  • The How and Why of Ukrainian Resilience and Courage

Sign up for updates on new releases and special offers

Newsletter signup, shipping location.

Our website offers shipping to the United States and Canada only. For customers in other countries:

Mexico and South America: Contact W.W. Norton to place your order. All Others: Visit our Yale University Press London website to place your order.

Notice for Canadian Customers

Due to temporary changes in our shipping process, we cannot fulfill orders to Canada through our website from August 12th to September 30th, 2024.

In the meantime, you can find our titles at the following retailers:

  • Barnes & Noble
  • Powell’s
  • Seminary Co-op

We apologize for the inconvenience and appreciate your understanding.

Shipping Updated

Learn more about Schreiben lernen, 2nd Edition, available now. 

U.S. flag

An official website of the United States government

The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

  • Publications
  • Account settings

Preview improvements coming to the PMC website in October 2024. Learn More or Try it out now .

  • Advanced Search
  • Journal List
  • Int J Nurs Sci
  • v.6(2); 2019 Apr 10

The application of big data and the development of nursing science: A discussion paper

Ruifang zhu.

a School of Nursing, Shanxi Medical University, Taiyuan, China

b School of Humanities and Social Sciences, Shanxi Medical University, Taiyuan, China

Chichen Zhang

c School of Management, Shanxi Medical University, Taiyuan, China

Zhiguang Duan

Associated data.

Based on the concept and research status of big data, we analyze and examine the importance of constructing the knowledge system of nursing science for the development of the nursing discipline in the context of big data and propose that it is necessary to establish big data centers for nursing science to share resources, unify language standards, improve professional nursing databases, and establish a knowledge system structure.

1. Introduction

We must cope with all types of data because we live in a data society. We ourselves are a part of these data; whether we want to be a part of big data or not, data will always find us and cover us. Were there no data, modern economic activities, innovations, and economic growth simply would not be possible. Recent technological advances have revolutionized how we collect, store and manage information. The digitization of the world has dramatically increased the amount of data we collect. Big data technology can be used to extract, manage, analyze, and interpret large datasets and transform them into meaningful hypotheses that can be translated into practices. With the continuous acceleration of the informationization process, the medical and health field has gradually entered the era of big data. Large-scale, multi-channel, and diverse data can provide new methods and ideas for nursing practices and have application value in many areas, including nursing evaluation, improving the level of nursing practice, monitoring disease, nursing research and clinical decision support. Therefore, how can we find knowledge from massive data, detect patterns, trends and correlations hidden in big data, reveal rules of social phenomena and social development, and identify possible business applications? To answer these questions, we need to have better big data research centers. Therefore, in the era of big data, big data research in the field of nursing has become an important trend in the development of the nursing discipline.

1.1. Big data

The concept of big data can be traced back to the 1970s. Beginning in 2009, “big data” has become a buzzword in the Internet information technology industry. In 2010, in “ Big Data: A revolution that will transform how we live, work, and think ” [ 1 ], Kenneth Neil Cukier and Viktor Mayer-Schoenberger noted that the term “big data” refers to the analysis and processing of all the data, not the stochastic analysis. The McKinsey Global Institute (MGI) [ 2 ]defined big data as large-scale data collections that traditional databases cannot obtain, store, manage and analyze; they are a large amount of data collected from different sources and the analyses thereof, often described using the 10 “Vs”: volume, velocity, variety, veracity, variability, validity, vulnerability, volatility, visualization, and value. Volume refers to the sheer volume of the data; for example, according to IBM's estimate, 2.5 quintillion bytes of data are created every day (one quintillion, i.e., one million raised to the fifth power (10 18 ), is one followed by 18 zeroes); velocity refers to the speed at which big data are generated; variety refers to the diversity of the data or resources; veracity refers to facts that data collected for one purpose may have issues involving missing information or poor data quality when put into secondary use; for example, in the field of nursing, patient data are initially recorded or acquired to provide patient care and can be used for secondary purposes; finally, value refers to the new insights derived from big data [ 3 ]. Big data represent a change in thinking, including the preference of all data over sampling, the preference of efficiency over absolute accuracy, and the preference of correlation over causality [ 1 ]. The essence of big data is an unprecedented change in the fields of thinking, business, and management. The core value of big data lies in the storage and analysis of massive data, while the strategic significance of big data lies in the specialization of data processing in which the purpose of value-added data is achieved by improving data-processing capability. Compared with traditional databases, big data have the following advantages: the capability of storing a massive amount of data information, rapid data exchange and sharing, diversified types of data, low-density value, etc.

1.2. Big data of nursing science

In 2011, nursing science, referring to an applied science with a multidisciplinary theory basis that studies nursing theory, knowledge, technology and the development patterns that maintain, promote, and restore human physical and mental health, was listed as a national First Level discipline (China). It is also a practical science and plays an important role in the medical field as an independent discipline. In the era of big data and massive information, people are increasingly concerned about how to apply information technology to work and life, and massive, unstructured data have also been widely used in hospitals, in which the most obvious changes are reflected in nursing services.

Big data of nursing refer to the vast amount of data related to care and health, including big data of hospital nursing, big data of regional health service platform, and big data based on nursing research or disease monitoring in large populations. Compared to other technologies, in terms of cost, speed, and optimization, the comprehensive cost of big data is optimal. Big data could play a huge role in gaining insight into the value of data, preventing the spread of disease, eliminating the waste of medical resources, and avoiding the high cost of medical care, and could become a “superpower” that makes medical care more efficient. Brennan and Bakken argued that nursing requires big data and vice versa. In traditional surveys in nursing, big data were understood based on electronic health records, claims data and public health data sheets, while large databases of nursing are about nursing diagnosis, nursing intervention and nursing outcome and are extracted from a large number of electronic medical records (EMRs) through a general nursing classification [ 4 ].

1.3. Nursing big data center

The term “big data center” refers to achieving the centralized processing, storage, transmission, exchange, and management of information within a physical space, in which computers, servers, and network and storage devices are generally considered the key equipment for the core of a data center.

A big data center of nursing is a data center that integrates nursing science, informatics and analysis to identify, define, manage and exchange data, information, knowledge and wisdom, along with a nursing research center that integrates business, learning and research. It is centered on big data mining and inference studies. The presence of a big data center is important because it guarantees an improved ability to interpret big data. For the same data that are collected with the same method, different processing methods, different ideas, different decision-making methods, and different viewpoints can result in vastly different data processing outcomes [ 5 ]. Therefore, the development of the ability and perspective to systematically interpret data is the core of a large data center.

However, as a rising amount of data is collected, the demand for data analysis increases, definitely, the challenges we face are real. We must acquire a considerable number of valuable resources which have been standardized or normalized in order to stablish nursing big data centers and make use of analysis tools and model algorithm in order to obtain more research value from these impressive data.

2. Research status

2.1. research status of big data at abroad.

With the rapid development of Internet technology and computer technology, a global big data industry is gradually emerging. Many countries, including the United States (US), the United Kingdom, Japan and South Korea, have formulated national big data strategies. The US has even suggested that the strategic position of big data is comparable to oil in the industrial era. In the 1970s, the US and Japan developed information systems and successfully applied them to the health care industry. In the 1990s, the American Nurses Association (ANA) and the National Nursing Alliance placed the content of the published English journals on nursing science into the CINAHL database. In 2008, the journal “Nature” published an article about patents on big data, and researchers from various disciplines have since realized that huge amounts of data can bring opportunities and challenges to their fields [ 6 ]. Since the US Congress passed the Act of Health Information Technology for Economic and Clinical Health (HITECH) in 2009 [ 7 ], big data in the health sector have gradually increased their significance. In 2011, the journal “Science” published an article on challenges in data processing and indicated that we are faced with tremendous difficulties in processing a gigantic amount of data [ 8 ]. In 2011, the Korea Bioinformatics Center planned to develop a national DNA management system that integrates a large amount of DNA data and patient medical information to provide individualized diagnosis and treatment for the patient [ 9 ]. In 2013 and 2014, the University of Minnesota School of Nursing and its Center for Nursing Informatics hosted a conference on big data and transforming health care consensus, aiming at creating a national action plan for sharable and comparable nursing data. As of February 2014, the National Institutes of Health (NIH) has accumulated tens of billions of bytes of human genetic variation data at the Amazon Web Services Center, enabling researchers to access and analyze huge amounts of data [ 10 ]. In 2015, at the 3rd International Symposium on Systems Biomedicine, the “European Union Action Plan” on big data was launched by the Council of Europe to implement the strategy of prediction-treatment-care by investigating and analyzing data and to provide data strategy for big data of personalized medicine in Europe [ 11 ]. In April 2015, Joyce Sensmeier, the deputy director of the Information Department of the Healthcare Information and Management Systems Society (HIMSS) in Chicago, published an article on “Nursing Management” and presented her thoughts about how nurses should meet the challenges of big data era and rationally use their resources. Big data could accelerate the penetration of new Internet-based technologies into a wider range of areas and radiate to all walks of life, and nursing is no exception. In June 2015, at the Nursing Knowledge Conference in Minneapolis, it was announced that an expert group would work to represent informatics organizations, professional nursing organizations, electronic medical record software developers, education and research institutes, the federal government, and medical care providers. In July 2015, the National Institute of Nursing Research (NINR) under the NIH held a conference on big data of nursing to promote the development of big data in the nursing field [ 12 ]. The US nursing databases include the NDNQI, and the NMMDS. Currently, the Elsevier database of Clinical Key for Nursing is the most used database, but unfortunately, there is no Chinese version.

2.2. Research status of big data in China

In May 2013, China joined the International Council of Nurses (ICN), and China's nursing development has since entered the international arena. The development of nursing care has also been met with new challenges in the development of nursing information. In June 2014, at the Symposium of Big Data and Translational Medicine of the International Council of Nurses, it was made clear that the future of nursing will be focused on transforming nursing practice, research and education in the context of big data, and developing and cultivating the ability to obtain and integrate data and information. In July 2015, China's State Council issued “Guidelines on actively promoting the ‘Internet Plus action” and proposed that the reshaping of the healthcare service model requires nurses to seize opportunities and meet challenges, becoming designers and leaders in “Internet Plus” healthcare or the “Internet Plus” nursing field. On November 18, 2016, China's National Health and Family Planning Commission formulated and issued “The development plan for national nursing work (2016–2020)”, which indicates that the rapid development of information technology has created favorable conditions for nursing development. On August 21–25, 2017, the 5th International Conference of Nursing Informatics was held in Hangzhou, China, with the theme of “Informatics promotes precision nursing: when information helps, nursing soars.” On May 28, 2018, at the 19th General Assembly of Academicians of the Chinese Academy of Sciences (CAS), it was advocated that we should promote the deep integration of the Internet, big data and artificial intelligence with the real economy and develop a bigger and stronger digital economy.

In the field of big data research, the Big Data College was established in 2014 in China. Both the Big Data College and the Big Data International Forum were created by the WSS, an internationally renowned training institution. The China Education Big Data Research Institute was jointly established by the China Statistical Information Service Center and Qufu Normal University in 2015. Later, the China Big Data website ( http://www.the bigdata. cn/) was launched. In 2014, the University of Electronic Science and Technology of China (Chengdu) established the Big Data Research Center http://www.bigdata-research.org/about/ ). The Zhongwei Institute of Nursing Information ( http://www.zwini.org/nurseinfo-web/page/toIndex ) was established in 2015. It is a non-profit research institute under the management of a board of directors and focuses on issues on the frontier of nursing and nursing management and the improvement of the effectiveness of nursing management by actively utilizing the power of digitalization and intelligentization to help the nursing industry solve practical problems in clinical services and management. In October 2016, China launched its first batch of national pilot projects for a medical big data center and industrial park; Fuzhou and Xiamen Cities of Fujian Province, and Nanjing and Changzhou Cities of Jiangsu Province were chosen as the first batch of pilot cities. In 2017, Guizhou Medical University launched a program in Medical Information Engineering, a new specialty that is dedicated to the analysis of health-related big data.

China did not engage in information technology construction until the 1990s, and big data in the field of nursing and health developed late compared to other countries, thus lagging behind. In traditional medical research, enumeration data and measurement data are the most common data forms; as numerical structured data, they can be processed using the general data analysis techniques or tools. However, in the context of big data, unstructured data, e.g., text, images, videos, e-mails, and questionnaires with open-end questions, are increasingly emerge, and the primary content and inevitable trend of big data research relate to understanding and investigating these large-scale, multi-channel, and diverse data to obtain valuable information. Hospitals at the county level and above in China have essentially established their hospital information systems, and in 20% of the hospitals at the county level and above, the patient-centered and EMRs-based integrated management system of registration, billing, prescription, and treatment have been established. In 12 provinces including Beijing, Shanghai and Anhui, electronic health archives have also been established. The data resources for medical “big data” include electronic health records (EHRs) data from medical services, billing and expense data from hospitals and healthcare, academic, social and government data for medical research, data involving medicines, medical equipment, and the clinical trials of medical manufacturers, behavior and health management data related to residents, the government's population and public health data, and the data generated by networks in China's public social and economic life. All of these types of data constitute the initial data resources for big data in China's health sector [ 13 ].

In summary, the research and application of big data in the field of nursing are still relatively lagging, and the involvement of nurses in big data science is mostly limited to entering the data into HER [ 5 ].

The real applications of big data in clinical nursing still need to be improved. Many countries and governments have already attached great importance to the application and development of big data. Therefore, it is necessary for nursing researchers to participate in the construction of a big data information platform and to incorporate nursing elements into a multi-level, cross-organizational information platform.

3. Current issues

Although viewed as beneficial and satisfying, inspiring both innovation and new thinking, big data can also mean big danger. The credibility of big data can be jeopardized by data incompleteness and dubious standards and processes for storing, acquiring, analyzing, and presenting big data [ 14 ]. Furthermore, although the application value of big data has been fully reflected in numerous industries, big data still have their own limitations and development limitations.

3.1. Big data do not indicate causality but relationship

Big data can reveal “what is it?” but not “why is it?”, show large trends and patterns but not revolutionary innovations, and offer appropriate services but not satisfy new demands, all of which are core issues to be addressed in our research and development on and improvement of big data.

3.2. Big data do not represent the whole and tend to show selective bias

Although colossal in quantity, big data are only sampled data of a time section; they may infinitely approximate the whole but cannot represent the whole [ 15 ]. When the amount of data is too large, interference information may emerge or too much noise may be present, making cluster analysis very difficult. Many data are not relevant to what we want to investigate, and it is necessary to delete a large quantity of irrelevant data using statistical methods. Therefore, the most important step in data mining technology is cleaning data and the padding missing data, and sometimes it is necessary to calculate the statistically significant statistic of each of the characteristics so that filtering and filling can be performed based on quantile, mean, variance, covariance and correlation coefficient.

3.3. Resource data sharing is poor and data security needs improvement

Large hospitals have essentially implemented informatization construction, but the corresponding data resources are still scattered in different data pools that are not connected and thus information islands form, with the result that nursing data cannot be shared between hospitals, and relevant data from health systems throughout society cannot be effectively integrated, which in turn affects the formation of nursing resource data. The issue of how to extract the big data of nursing from the immense amount of medical care information resources and make them play an independent role is one of the difficulties in building a big data information platform. In addition, while promoting data transmission and sharing, big data also creates a risk that personal privacy will be breached. In the era of big data, it is necessary to engage in data sharing, which is largely limited by strict data protection; only when the sharing and protection of big data are guaranteed can the potential value of medical big data be maximally realized [ 16 ].

3.4. Standardized nursing terminology still lacks

Applications of big data are based on resource sharing, and the standardization of terminology is the basis for realizing resource sharing and exhorting the benefits and effects of big data. Standardized nursing terminology, a generalized data model and the information structure of EMRs are the basis for integrating nursing data into clinical databases for big data and big data science uses. Sensmeier [ 17 ]suggested the use of the Systematized Nomenclature of Medicine-Clinical Terms recommended by the ANA. In the era of big data, we certainly need to adopt international standards to maintain compatibility with the standard terminologies of international institutions. More importantly, however, we need to establish a unified standardized nursing terminology with Chinese characteristics; this has become a necessity for promoting the development of nursing science in China.

3.5. Unified software development deficiencies

Most of China's nursing information systems have been independently developed by companies. The requirements for Nursing Information Systems (NIS) were created for software developers by nurses based on their clinical needs, and those developers engage in systematic research, development and improvement. This kind of hospital information system development model can only meet the needs of a certain hospital but is unsuitable for schools, research institutes, databases, etc., since it has neither a unified standard nor a unified information system, and thus the levels of nursing information software development are uneven and the R & Ds are often redundant, wasting resources while impeding the introduction of NIS. At the same time, the information systems between hospitals are not unified, which makes it impossible for hospitals to share information resources and is not conducive to the development of information systems. Furthermore, this situation restricts the promotion of big data.

3.6. The shortage of talents in nursing informatics

In other countries such as the US, nursing informatics was recognized in the 1990s and became an independent discipline in the 2000s with its own teaching and research faculty and qualification credentials. In this regard, the gap between China and other countries is large, and basic education on nursing informatics remains inadequate in China. In most nursing schools, there are no graduate programs in nursing informatics, while in most hospitals, there are no full-time nursing informatics posts; this situation is bound to hinder the development of nursing informatics. Therefore, in terms of national policy, nursing informatician qualification and examination should be set up to train the talents of nursing informatics so that the developmental needs of the big data of nursing can be met.

4. The significance of setting up centers for the big data of nursing

Under normal circumstances, owning big data itself is meaningless, and big data's real significance is reflected in the specialized processing of data with a large amount of information [ 18 ]. However, to effectively process specialized data on a large scale and within a certain time frame, special technical supports such as large-scale data mining technology are needed. In mining big data, we cannot always rely on outside teams; instead, we should have an in-house task team and core technology. The establishment of centers for the big data of nursing enables the reasonable and effective use of big data, which will play a vital role in the development of big data centers in industry, academia and education. It can also provide supports to the clinical decisions of accurate nursing, public health and satisfactory service.

4.1. For forecasting

The core meaning of big data is prediction [ 19 ]. First, it can predict hot trends in which literature with different data features is quantitatively or qualitatively analyzed to reveal patterns, trends and hot issues in the specialty. Second, it can make predictions about diseases; for example, after data mining the search terms frequently used by Americans, Google established a mathematical model with a set of 45 search terms that can accurately predict the flu, and the prediction result had a correlation of 97% to the official data. In addition, for genetic data, since some diseases are hereditary, when we have some information about genes, we can predict and prevent certain diseases.

4.2. For evaluation

Research papers are the main forms used to present research achievements and activities, and the quantity and quality of such papers are important indicators for evaluating the level of the scientific research of and results of research institutes and personnel [ 20 ]. Therefore, the evaluation referenced here is mainly performed based on the “literature”, aiming to provide services to “scientists”, as reflected in the assessment of comprehensive nursing capability, teaching on nursing, scientific journals, nursing talents, etc.

  • (1) Evaluation of comprehensive nursing capability: Comprehensive nursing capability refers to the ability of nurses to find, analyze, and solve a problem, analyzing. The evaluation of the comprehensive nursing strength of a hospital involves and evaluation of the ability of nurses in the mastery and application of knowledge, which reflects the overall level of scientific research of the nursing personnel and department.
  • (2) Evaluation of nursing teaching: During “China's 13th Five-Year Plan” period, the Ministry of Education will build a networked, digitalized, personalized, and lifelong educational system. Using big data and starting by setting nursing teaching objectives, this system can make value judgments and research evaluations of the effect and process of teaching based on the goals of nursing teaching, thus exploring an innovative development route that promotes the fusion of information technology and education.
  • (3) Evaluation of scientific journals: The level of scientific journals must be examined through evaluation. In other words, the evaluation criteria determine the direction and goals of future development of a journal. The application of big data will also provide new technical means and methods of journal evaluation. The research and development of intelligent review assistance systems to assess each of the criteria of the journal, such as impact, innovative ideas, and application value, will greatly enhance the accuracy and comprehensiveness of the evaluation criteria for scientific periodicals.
  • (4) Evaluation of nursing talents: Nursing is a highly specialized discipline that investigates human health, and nursing talents should be the “scientist-type”; however, nursing is also a highly articulate specialty, and nursing talents should also be the “artist-type”. Undoubtedly, whether an evaluation examines Nobel Laureates, Nightingale Awards winners, or academics at the American Academy of Sciences, bibliometric analyses are indispensable.

4.3. For research

In traditional nursing research, investigators test their hypotheses using samples with a small size, which greatly reduces the credibility of the research results, at least to some extent. Against the current background of big data, data acquisition is no longer a problem, so nursing research is no longer limited by the sample size, single data type, insufficient funds, etc., and researchers can spend more time designing research plans or conducting in-depth analysis on the data analysis results, improving research efficiency while saving time, manpower and financial resources. Nursing research is mainly non-experimental, and descriptive research, case-control research and cohort research are the most commonly selected research types; the large-scale, multi-form and multi-source features of big data resources can well satisfy the data needs for such studies. Taking case-control studies as an example, large sample size has become the development trend in which nursing investigators can obtain a large number of cases through data platforms, performing retrospective studies using the retrospective data stored in systems such as EHR to examine the effect of a particular nursing intervention or medicine on the nursing outcome. In addition, big data-oriented cohort studies present a good opportunity for nursing research [ 21 ]. The large-scale cohort study has the characteristics of a large sample size; a prospective outlook; a multidisciplinary, multi-pathological, multi-factor, integrative, and sharing approach; etc. The “10V” features of big data can well satisfy the needs of large-scale cohort studies, and the investigators can screen the target population using the database, conducting patient information tracking and follow-up in the medical information platform for prospective research. They can survey the literature through the big data platform of nursing to obtain data with a high degree of matching, which provides a good opportunity for nursing research. Transforming big data resources into research results and promoting them clinically are the most basic applications of big data in nursing research.

4.4. For education

Big data are important to both educators and learners because big data revolutionize the education policy, research and practice of nursing science. Since the datasets of big data are huge and complex, it is impractical to manage them with traditional software tools, whose technology is more than a decade old. If big data are a noun and the analytics is a verb, the issue of how to extract, validate, transform, and use big data has become a new trend. Analytics can provide numerous methods for nursing education, including improving operations and making economic decisions, to help achieve specific learning goals and to predict behaviors and events by revealing the relationships and patterns between big datasets.

The types of data used in nursing education include data on teaching, learning, and evaluation. Students generate data through e-learning archives, EMRs, and social media. In addition, administrators and school staff generate data through academic reports, class attendance sheets, scholarships, research, and so on. Related personnel, including students, teachers, administrators, doctors, and scholars, can make decisions through the collection, analysis, and use of data. They can collect the data on education and assessment from different systems from the curriculum list of the first year to the clinical skills record of the final year [ 18 ].

4.5. For clinical practice

Big data can reflect the scale and impact of data-related issues in the medical and nursing fields. The application of big data to clinics is mainly reflected in the health guidance of patients through the collection of clinical data. Big data enable every bit of data uploaded to the network in the nursing process to be automatically recorded. Over a nurse's lifetime career, the number of patients to whom he or she provides care is limited, but the big data database has a wide variety of data related to the patient records, and once a patient is received, information about that patient can be immediately compared with that in the database; then, based on the existing data, nurses can provide real-time health interventions on the patient's condition and provide health guidance on diet, exercise, etc., potentially improving the work efficiency of nurses while achieving the real individualization of health care and promoting the rehabilitation of the patient [ 22 ]. However, big data enable the health monitoring of specific groups by collecting data on collective signs. In this way, many traditional methods of information collection will be overturned, all kinds of information will be monitored and collected at any time, and patient care can be expanded to before disease onset and after patient discharge, even to the relatives and friends of the patient, making it possible to achieve individual-centered whole-process health care and resource sharing [ 23 ]. Through the use of personal digital assistants (PDAs), sensors, wearable medical devices, etc., nurses can perform real-time, continuous monitoring and evaluation of the health of the patient, detect the health problems or risks of a specific patient, and adopt targeted preventive measures accordingly, causing care services to expand to the later ecological service circle. For example, in Europe, elders are asked to wear a watch capable of monitoring life signs and in the event of health problems, the watch will send an automatic alarm, making it possible to save elderly people's lives.

4.6. For decision making

Big data can be used as datasets for medicine, operational logistics, cases, and decision-making systems. On the one hand, based on data applications under big data, the nursing decision support system can improve clinicians’ rational decision-making related to patient care. It can help nurses and medical staff make correct judgments, obtain correct information at the right time to support the best clinical decision-making, and provide timely and accurate care for patients, which can significantly affect the evidence-based practice of nurses, improve the quality of patient clinical care and patient outcome, lower medical costs, and ensure patient safety. On the other hand, big data can make decisions for care managers. Empirical correlation analysis can be conducted on the entire dataset by utilizing the ability of big data to collect, analyze and extract massive amounts of data, enabling the subversion of the traditional top-down elite decision-making management model and causing nursing managers and practitioners to make decisions that cease to rely on experience and brainstorming but on the analysis of the entire data, gradually transforming from following the rules to following the data.

4.7. For the market

According to incomplete statistics, as of the end of October 2016, 184 enterprises in China's big data industry have obtained financing. The big data industry has become the new favorite of the capital market, and data sources have become the core competitive feature of big data companies. On November 17, 2016, at the 3rd World Internet Conference “Internet Plus Smart Healthcare” Forum hosted by the National Health and Family Planning Commission, the innovative applications of big data, cloud computing, Internet of Things, and information and communication technologies in the field of health care were discussed to promote collaborative innovation in production, education and research. Obviously, based on the sharing and application of massive data, big data centers can be used for the transformation of pharmaceutical research and development of the results of collaboration among production, education and research entities. On November 23, 2016, at the Pharmaceutical Industry and Commerce Strategic Cooperation Forum, it was made clear that with the “Health 2030” proposal, we should “gather the momentum, integrate, and achieve the win-win result”, develop a big health market using the ideas of big data and the Internet, implement the concept of big health, and accurately both tap the needs of users and meet their demands. Therefore, big data centers can be used to analyze the behavior characteristics of users and meet the specific demands of users.

According to the industry direction of “big data, small sensors, huge storage, cloud applications”, in addition to directly providing users with the data they need, big data centers can provide targeted information by analyzing data according to different enterprises and their needs. Furthermore, they can provide learning platforms, training services, consulting services and so on.

5. Conclusion

In short, when the amount of data accumulated is large enough, the information system will transform from one providing a simple data-exchange and information transfer to one providing a massive data-based integration analysis. Big data enable the information system to change from “a tool for people” to “self-thinking”. Standards are the cornerstone, data are the core, and applications are the key. Through the integrative analysis of massive data, big data can reveal the “non-causal relationship”; reasonable analysis and utilization of these big data will change nursing practice, nursing research and nursing education, promoting the advancement of the nursing discipline. The establishment of big data centers to apply the big data of nursing to various aspects such as nursing management, precision care, and patient safety will serve as the connection and center of the government, enterprises, universities, research institutes, capital and entrepreneurial businesses to build a large-scale innovation platform for nursing in China in the five major fields of the discipline, academics, technology, industry, and manufacturing, forming the “laboratory of nursing science” of the big data industry. Huge “nursing databases” are both inexhaustible assets and an insurmountable barrier for competitors.

Conflicts of interest

All contributing authors declare no conflicts of interest.

This work was supported by National Natural Science Foundation of China (No. 71573162).

Peer review under responsibility of Chinese Nursing Association.

Appendix A Supplementary data to this article can be found online at https://doi.org/10.1016/j.ijnss.2019.03.001 .

Appendix A. Supplementary data

The following is the Supplementary data to this article:

  • Faculty and Staff News
  • Media Resources
  • Purdue News Weekly
  • Research Excellence
  • Purdue Computes
  • Daniels School of Business
  • Purdue University in Indianapolis
  • The Persistent Pursuit
  • Purdue News on Youtube
  • Purdue in the News
  • Purdue University Events

Purdue’s online data science master’s addresses burgeoning demand for trained data scientists

The interdisciplinary degree is accessible for working professionals from both technical and nontechnical backgrounds

A digital display superimposed on fingers typing on a keyboard. On the right, the words online master’s in data science.

WEST LAFAYETTE, Ind. — Data scientists who can make sense of today’s epic floods of data to generate actionable insights and communicate them to a variety of audiences are in demand in almost any field, from retail business and industry to health care, government, education, and more.

The U.S. Bureau of Labor Statistics estimates that jobs for data scientists will grow 36% by 2031. Nationally, there were nearly 125,000 data scientist jobs added from 2013-2023. Yet many of those jobs — with many more openings coming — went unfilled for a lack of trained data scientists. The bottom line: Nearly every industry today requires data scientists, and the number of these positions is expected to grow.

Purdue University’s new 100% online Master of Science in data science degree addresses the need and the high demand for a trained data science workforce that can harness the power of data to drive innovation, efficiency and competitiveness. The interdisciplinary master’s program is designed for working professionals with a technical background but includes a pathway to entry for professionals from nontechnical fields.

“This data science master’s program is specifically designed for online delivery and optimal online learning, making it accessible to professionals around the world,” said Dimitrios Peroulis, Purdue senior vice president for partnerships and online. “The interdisciplinary curriculum is diverse, customizable to a student’s needs and tailored for practical application immediately.”

Purdue’s online master’s in data science features core courses covering foundations of data science, machine learning and data mining, big data technologies and tools, data analysis, and data visualization and communication.

Students do a capstone project pairing them with an industry mentor and a collaborative team to manage a data science project from inception to completion. That includes developing project timelines, allocating resources and adapting strategies based on the project’s evolution. The capstone, modeled after curriculum from The Data Mine , Purdue’s award-winning data science learning community, is an opportunity to apply knowledge acquired throughout the master’s program to solve complex, real-world problems.

The online master’s program also features the opportunity to earn industry-aligned certificates along the way to earning a master’s degree. Options include education, leadership, and policy; smart mobility and smart transportation; data science in finance; spatial data science; geospatial information science; managing information technology projects; IT business analysis; and applied statistics.

The program was developed by an interdisciplinary cohort of expert faculty from Purdue’s flagship campus, including the colleges of Agriculture, Education, Engineering, Health and Human Sciences, Liberal Arts, Pharmacy, Science, and Veterinary Medicine, along with the Mitch Daniels School of Business, the Purdue Polytechnic Institute, the Purdue Libraries, and the Office of the Vice Provost for Graduate Students and Postdoctoral Scholars.

“Purdue’s new online MS in data science program leverages the real-world experience of faculty working across several distinct disciplines,” said Timothy Keaton, assistant professor of practice in Purdue’s Department of Statistics, who was involved in developing the new degree. “This cooperation between experts in the application of data science in diverse fields provides a great opportunity to create engaging and meaningful coursework that incorporates many different potential areas of interest for our students.”

Students will develop expertise in programming languages, gaining the ability to design and implement data-driven solutions; learn to apply advanced technologies, including cloud computing and big data frameworks, to effectively handle and process large-scale datasets; gain a deep understanding of machine learning algorithms and models, applying them to real-world scenarios; and become proficient in collecting, cleaning, and analyzing diverse datasets.

The curriculum also is designed to teach learners data visualization and communication methods for creating compelling visual representations of complex data to effectively convey insights, along with the application of storytelling techniques to communicate findings clearly to both technical and nontechnical audiences. The program covers adherence to ethical standards in data science, privacy, transparency and fairness as well.

The program draws on Purdue’s expertise in myriad aspects of data science. Known for its emphasis on practical programs with proven value, Purdue has been rated among the Top 10 Most Innovative Schools for six years running by U.S. News & World Report and is the No. 8 public university in the U.S. according to the latest QS World University Rankings.

“The breadth and depth of topics that data science encompasses necessitate graduate programs that incorporate expertise from a variety of disciplines and then integrate this into a curriculum to meet the needs of its students,” said John Springer, a Purdue computer and information technology professor who was involved in developing the new degree. “Purdue’s unique approach to the development and delivery of its new online master’s program wholly fulfills these requirements by utilizing a highly interdisciplinary team of Purdue faculty backed by Purdue’s outstanding team of instructional designers.”

For more information about Purdue’s 100% online Master of Science in data science degree, visit the program website .

About Purdue University

Purdue University is a public research institution demonstrating excellence at scale. Ranked among top 10 public universities and with two colleges in the top four in the United States, Purdue discovers and disseminates knowledge with a quality and at a scale second to none. More than 105,000 students study at Purdue across modalities and locations, including nearly 50,000 in person on the West Lafayette campus. Committed to affordability and accessibility, Purdue’s main campus has frozen tuition 13 years in a row. See how Purdue never stops in the persistent pursuit of the next giant leap — including its first comprehensive urban campus in Indianapolis, the Mitch Daniels School of Business, Purdue Computes and the One Health initiative — at https://www.purdue.edu/president/strategic-initiatives .

Media contact: Brian Huchel, [email protected]

More Purdue News

Students walking past the Engineering Fountain at sunrise

Today’s top 5 from Purdue University

August 30, 2024

Several ewes eating dhurrin-free sorghum plants.

Researchers document animals’ preference for Purdue-patented sorghum technology

August 29, 2024

Purdue President Mung Chiang stands with the ambassador of Panama to the U.S. Each holds a document.

Purdue, Panama enter agreement to support semiconductor academic collaboration and workforce development

research using big data

Purdue alum, U.S. Olympic & Paralympic Committee executive Julie Dussliere named president and chief executive officer of Purdue for Life Foundation

August 26, 2024

Information

  • Author Services

Initiatives

You are accessing a machine-readable page. In order to be human-readable, please install an RSS reader.

All articles published by MDPI are made immediately available worldwide under an open access license. No special permission is required to reuse all or part of the article published by MDPI, including figures and tables. For articles published under an open access Creative Common CC BY license, any part of the article may be reused without permission provided that the original article is clearly cited. For more information, please refer to https://www.mdpi.com/openaccess .

Feature papers represent the most advanced research with significant potential for high impact in the field. A Feature Paper should be a substantial original Article that involves several techniques or approaches, provides an outlook for future research directions and describes possible research applications.

Feature papers are submitted upon individual invitation or recommendation by the scientific editors and must receive positive feedback from the reviewers.

Editor’s Choice articles are based on recommendations by the scientific editors of MDPI journals from around the world. Editors select a small number of articles recently published in the journal that they believe will be particularly interesting to readers, or important in the respective research area. The aim is to provide a snapshot of some of the most exciting work published in the various research areas of the journal.

Original Submission Date Received: .

  • Active Journals
  • Find a Journal
  • Proceedings Series
  • For Authors
  • For Reviewers
  • For Editors
  • For Librarians
  • For Publishers
  • For Societies
  • For Conference Organizers
  • Open Access Policy
  • Institutional Open Access Program
  • Special Issues Guidelines
  • Editorial Process
  • Research and Publication Ethics
  • Article Processing Charges
  • Testimonials
  • Preprints.org
  • SciProfiles
  • Encyclopedia

land-logo

Article Menu

research using big data

  • Subscribe SciFeed
  • Recommended Articles
  • Google Scholar
  • on Google Scholar
  • Table of Contents

Find support for a specific problem in the support section of our website.

Please let us know what you think of our products and services.

Visit our dedicated information section to learn more about MDPI.

JSmol Viewer

Identifying legal, bim data and visualization requirements to form legal spaces and developing a web-based 3d cadastre prototype: a case study of condominium building.

research using big data

1. Introduction

2. 3d cadastre related legislation and project in türkiye, 2.1. condominium unit and 3d cadastre-related legislation, 2.2. the production of 3d city models and creation of 3d cadastral bases project, 3. legal, bim data and visualization requirements for 3d cadastre in türkiye, 3.1. legal requirements.

  • the inner face of walls should be used when drawing the legal boundaries (spaces) of a condominium unit or an accessory part of a condominium unit.
  • if a space in a condominium building is not designated as a condominium unit or an accessory part, then it should be considered a common space.
  • all the walls outside a condominium unit, all main walls and the walls separating the condominium units should be considered common spaces
  • common spaces include, but are not limited to, ○ all the structural components (e.g., foundations, main walls, beams, columns, bearing walls forming the load-bearing system, other elements forming part of the load-bearing system, walls separating condominium units, ceilings, floors; roofs, chimneys, common roof terraces, rain gutters and fire escapes); ○ joint facilities (e.g., courtyards, entrance doors, entrances, stairs; elevators, landings, corridors, caretaker’s rooms, laundry rooms, laundry drying rooms; coal cellars, common garages); ○ installations outside the condominium units (e.g., slots and closed installations for the protection of electricity, water and gas meters; heating rooms, wells, cisterns, common water tanks, shelters, sewers, rubbish chutes; heating, water, gas and electricity installations; common networks and aerials for telephone, radio and television; hot and cold air installations);
  • accessory parts (e.g., parking space, cellar) that are indicated in the architectural drawings and/or the condominium deed should be considered as accessory parts.
  • all of the above requirements should be supported by architectural drawings, building survey projects, building layout plans and condominium unit plans in order to support as-built BIM-based 3D cadastre.

3.2. BIM Data Requirements

  • All space-bounding physical objects, such as the structural components, including load-bearing walls, floors, and columns, as well as the architectural features, such as the roof, and non-load-bearing walls, should be represented with their true (scaled) dimensions as specified in the architectural drawing.
  • The representation of all walls should include their functions, particularly whether or not they are main walls (e.g., exterior) and/or load-bearing.
  • Space-bounding physical objects should not overlap.
  • Spaces should be delineated by the interior boundary faces of the abovementioned space-bounding physical objects. If the spaces are not related to a physical object, then virtual boundaries should be specified.
  • It may be beneficial to annotate rooms with the type of use indicated by the architectural drawing in order to provide users with further information regarding the 3D spaces.
  • It is recommended that the spaces be grouped according to the condominium units to which they belong.
  • All physical objects and spaces, including those extending from the lowest to the highest points of the building, should be separated and associated with the relevant storeys (levels).
  • The elevations of the storeys should be stated.

3.3. Visualization Requirements

  • the parcel on which the condominium building is built should be identified. Annotations and attributes (e.g., identifier, area and so on) on the parcel should be included;
  • the georeferenced condominium building should be identified. Annotations and attributes (e.g., identifier, area and so on) on the condominium buildings should be included;
  • the private (main part) space of the condominium unit should be identified. This can be carried out by grouping each part (e.g., rooms) of the private space into one space. It should be noted that the rooms have no legal significance [ 68 ]. On the other hand, by grouping all rooms into one space, the legal space of the condominium unit can be created;
  • the common spaces should be identified. Annotations and attributes (e.g., identifier, area and so on) on the common spaces should be included. Grouping all common spaces into one may help users to better understand these spaces;
  • the accessory parts should be identified. Annotations and attributes (e.g., identifier, area and so on) on the accessory parts should be included;
  • the condominium unit and the accessory part(s) should be associated (grouped) for a better understanding of the private ownership spaces together;
  • spaces above and below ground should be identified;
  • all spaces in condominium buildings should be topologically consistent;
  • the legal spaces should be visually distinguished from the physical objects in 3D;
  • having the option to visualize both the legal space of condominium buildings and physical building objects in the same prototype system can further support effective visualization and dissemination.

4. Case Study to Form Legal Spaces and Develop a Web-Based 3D Cadastre Prototype

4.1. preparation of the data—depicting physical objects and basis of legal spaces, 4.2. producing legal spaces and geovisualization, 4.3. web-based 3d cadastre prototype for geovisualization, 5. discussion, 6. conclusions, author contributions, data availability statement, acknowledgments, conflicts of interest.

1 ].
  • Enemark, S. The Evolving Role of Cadastral Systems in Support of Good Land Governance. In Proceedings of the FIG Commission 7 Open Symposium on Digital Cadastral Map, Karlovy Vary, Czech Republic, 9 September 2010. [ Google Scholar ]
  • United Nations (UN). The New Urban Agenda (NUA) ; The United Nations Conference on Housing and Sustainable Urban Development (Habitat III) in Quito, Ecuador, on 20 October 2016, endorsed by the United Nations General Assembly at its Sixty-Eighth Plenary Meeting of the Seventy-First Session on 23 December 2016; United Nations: New York, NY, USA, 2016; ISBN 978-92-1-132731-1.
  • FAO; UNECE; FIG. Digital Transformation and Land Administration–Sustainable Practices from the UNECE Region and Beyond ; No. 80; FIG Publication: Rome, Italy, 2022; p. 88. [ Google Scholar ] [ CrossRef ]
  • Kalogianni, E.; Dimopoulou, E.; Lemmen, C.H.J.; van Oosterom, P.J. BIM/IFC files for 3D real property registration: An initial analysis. In Proceedings of the FIG Working Week 2020: Smart Surveyors for Land and Water Management, Amsterdam, The Netherlands, 10–14 May 2020. [ Google Scholar ]
  • ISO 29481-1:2016(en) ; Building Information Models—Information Delivery Manual—Part 1: Methodology and Format. International Organisation for Standardisation: Geneva, Switzerland, 2012. Available online: https://www.iso.org/standard/60553.html (accessed on 10 July 2024).
  • BIM Corner. 9 Reasons Why Norway is THE BEST in BIM! Available online: https://bimcorner.com/9-reasons-why-norway-is-the-best-in-bim/ (accessed on 10 July 2024).
  • NovaTR. Global BIM Adoption Around the World. Available online: https://www.novatr.com/blog/bim-adoption-around-the-world-global-overview (accessed on 10 July 2024).
  • United Nations Committee of Experts on Global Geospatial Information Management (UN-GGIM). Future Trends in Geospatial Information Management: The Five to Ten Year Vision ; United Nations: New York, NY, USA, 2019. Available online: https://ggim.un.org/meetings/GGIM-committee/10th-Session/documents/Future_Trends_Report_THIRD_EDITION_digital_accessible.pdf (accessed on 10 July 2024).
  • 1st International Workshop on “3D Cadastres. Available online: https://repository.tudelft.nl/record/uuid:bff0922a-b952-404b-a0d1-a26100a2fbd7 (accessed on 10 July 2024).
  • Atazadeh, B.; Kalantari, M.; Rajabifard, A.; Ho, S.; Champion, T. Extending a BIM-based data model to support 3D digital management of complex ownership spaces. Geogr. Inf. Syst. 2017 , 31 , 499–522. [ Google Scholar ] [ CrossRef ]
  • Xie, Y.; Atazadeh, B.; Rajabifard, A.; Olfat, H. Automatic Modelling of Property Ownership in BIM. In Proceedings of the 17th 3D GeoInfo Conference, Sydney, Australia, 19–21 October 2022. [ Google Scholar ]
  • New America. 2019. 3D Cadastre and Property Rights. Available online: https://www.newamerica.org/future-land-housing/reports/proprightstech-primers/3d-cadastre-and-property-rights (accessed on 10 July 2024).
  • Open Geospatial Consortium (OGC). Land and Infrastructure Conceptual Model Standard (LandInfra), Version 1.0, Publication Date: 2016-12-20, Edited by Paul Scarponcini, Contributors: Hans Christoph Gruler (Survey), Erik Stubkjær (Land), Peter Axelsson, Lars Wikstrom (Rail). Available online: http://docs.opengeospatial.org/is/15-111r1/15-111r1.html (accessed on 10 July 2024).
  • Thompson, R.J.; Van Oosterom, P.J.M.; Soon, K.H. LandXML encoding of mixed 2D and 3D survey plans with multi-level topology. ISPRS Int. J. Geo-Inf. 2017 , 6 , 171. [ Google Scholar ] [ CrossRef ]
  • Cemellini, B.; van Oosterom, P.; Thompson, R.; de Vries, M. Design, development and usability testing of an LADM compliant 3D Cadastral prototype system. Land Use Policy 2020 , 98 , 104418. [ Google Scholar ] [ CrossRef ]
  • Guo, R.; Li, L.; Ying, S.; Luo, P.; He, B.; Jiang, R. Developing a 3D cadastre for the administration of urban land use: A case study of Shenzhen, China. Comput. Environ. Urban Syst. 2013 , 40 , 46–55. [ Google Scholar ] [ CrossRef ]
  • Stoter, J.; Ploeger, H.; Roes, R.; van der Riet, E.; Biljecki, F.; Ledoux, H. First 3D cadastral registration of multi-level ownerships rights in the Netherlands. In Proceedings of the 5th International FIG 3D Cadastre Workshop, Athens, Greece, 18–20 October 2016. [ Google Scholar ]
  • Stoter, J.; Ploeger, H.; Roes, R.; van der Riet, E.; Biljecki, F.; Ledoux, H.; Kok, D.; Kim, S. Registration of multi-level property rights in 3D in the Netherlands: Two cases and next steps in further implementation. ISPRS Int. J. Geo-Inf. 2017 , 6 , 158. [ Google Scholar ] [ CrossRef ]
  • Góźdź, K.; Pachelski, W.; van Oosterom, P.J.M.; Coors, V. The possibilities of using CityGML for 3D representation of buildings in the cadastre. In Proceedings of the 4th International Workshop on 3D Cadastres, Dubai, United Arab Emirates, 9–11 November 2014. [ Google Scholar ]
  • Ying, S.; Guo, R.; Yang, J.; He, B.; Zhao, Z.; Jin, F. 3D space shift from CityGML LoD3-based multiple building elements to a 3D volumetric object. ISPRS Int. J. Geo-Inf. 2017 , 6 , 17. [ Google Scholar ] [ CrossRef ]
  • Hajji, R.; Yaagoubi, R.; Meliana, I.; Laafou, I.; Gholabzouri, A.E. Development of an integrated BIM-3D GIS approach for 3D cadastre in Morocco. ISPRS Int. J. Geo-Inf. 2021 , 10 , 351. [ Google Scholar ] [ CrossRef ]
  • Sürmeneli, H.G.; Koeva, M.; Alkan, M. The application domain extension (ADE) 4D cadastral data model and its application in Turkey. Land 2022 , 11 , 634. [ Google Scholar ] [ CrossRef ]
  • Nega, A.; Coors, V. The use of CITYGML 3.0 in 3D cadastre system: The case of Addis Ababa City. In Proceedings of the 17th 3D GeoInfo Conference, Sydney, Australia, 19–21 October 2022. [ Google Scholar ]
  • Liamis, T.; Mimis, A. Establishing semantic 3D city models by GRextADE: The case of the Greece. J. Geovisualization Spat. Anal. 2022 , 6 , 15. [ Google Scholar ] [ CrossRef ]
  • El-Mekawy, M.S.A.; Paasch, J.M.; Paulsson, J. Integration of legal aspects in 3D cadastral systems. Int. J. E-Plan. Res. 2015 , 4 , 47–71. [ Google Scholar ] [ CrossRef ]
  • Atazadeh, B.; Kalantari, M.; Rajabifard, A.; Ho, S.; Ngo, T. Building Information Modelling for High-rise Land Administration. Trans. GIS. 2017 , 21 , 91–113. [ Google Scholar ] [ CrossRef ]
  • Oldfield, J.; van Oosterom, P.J.M.; Beetz, J.; Krijnen, T.F. Working with Open BIM Standards to Source Legal Spaces for a 3D Cadastre. ISPRS Int. J. Geo-Inf. 2017 , 6 , 351. [ Google Scholar ] [ CrossRef ]
  • Shojaei, D.; Olfat, H.; Rajabifard, A.; Briffa, M. Design and development of a 3D digital cadastre visualization prototype. ISPRS Int. J. Geo-Inf. 2018 , 7 , 384. [ Google Scholar ]
  • Olfat, H.; Atazadeh, B.; Shojaei, D.; Rajabifard, A. The Feasibility of a BIM-Driven Approach to Support Building Subdivision Workflows—Case Study of Victoria, Australia. ISPRS Int. J. Geo-Inf. 2019 , 8 , 499. [ Google Scholar ] [ CrossRef ]
  • Sun, J.; Mi, S.; Olsson, P.O.; Paulsson, J.; Harrie, L. Utilizing BIM and GIS for Representation and Visualization of 3D Cadastre. ISPRS Int. J. Geo-Inf. 2019 , 8 , 503. [ Google Scholar ] [ CrossRef ]
  • Meulmeester, R.W.E. BIM Legal. Proposal for Defining Legal Spaces for Apartment Rights in the Dutch Cadastre Using the IFC Data Model. Master’s Thesis, Delft University of Technology, Deft, The Netherlands, 2019. [ Google Scholar ]
  • Broekhuizen, M.; Kalogianni, E.; van Oosterom, P. BIM models as input for 3D land administration systems for apartment registration. In Proceedings of the 7th International FIG 3D Cadastre Workshop, New York, NY, USA, 11–13 October 2021. [ Google Scholar ]
  • Alattas, A.; Kalogianni, E.; Alzahrani, T.; Zlatanova, S.; van Oosterom, P.J.M. Mapping private, common, and exclusive common spaces in buildings from BIM/IFC to LADM: A case study from Saudi Arabia. Land Use Policy 2021 , 104 , 105355. [ Google Scholar ] [ CrossRef ]
  • Petronijević, M.; Višnjevac, N.; Praščević, N.; Bajat, B. The extension of IFC for supporting 3D cadastre LADM geometry. ISPRS Int. J. Geo-Inf. 2021 , 10 , 297. [ Google Scholar ] [ CrossRef ]
  • Ying, S.; Xu, Y.; Li, C.; Guo, R.; Li, L. Easement spatialization with two cases based on LADM and BIM. Land Use Policy 2021 , 109 , 105641. [ Google Scholar ] [ CrossRef ]
  • Einali, M.; Alesheikh, A.A.; Atazadeh, B. Developing a building information modelling approach for 3D urban land administration in Iran: A case study in the city of Tehran. Geocarto Int. 2022 , 37 , 12669–12688. [ Google Scholar ] [ CrossRef ]
  • Guler, D.; Van Oosterom, P.J.M.; Yomralioglu, T. How to exploit BIM/IFC for 3D registration of ownership rights in multi-storey buildings: An evidence from Turkey. Geocarto Int. 2022 , 37 , 18418–18447. [ Google Scholar ] [ CrossRef ]
  • Andritsou, D.; Gkeli, M.; Soile, S.; Potsiou, C. A BIM/IFC–LADM solution aligned to the Greek legislation. In Proceedings of the XXIV ISPRS Congress, 2022 Edition, Nice, France, 6–11 June 2022; pp. 471–477. [ Google Scholar ]
  • Liu, C.; Zhu, H.; Li, L.; Ma, J.; Li, F. BIM/IFC-based 3D spatial model for condominium ownership: A case study of China. Geo-Spat. Inf. Sci. 2023 , 1–19. [ Google Scholar ] [ CrossRef ]
  • Shojaei, D.; Rajabifard, A.; Kalantari, M.; Bishop, I.D.; Aien, A. Design and development of a web-based 3D cadastral visualisation prototype. Int. J. Digit. Earth 2015 , 8 , 538–557. [ Google Scholar ] [ CrossRef ]
  • Production of 3D City Models and Creation of 3D Cadastral Bases Project (3 Boyutlu Şehir Modelleri ve Kadastro Projesi). Available online: https://www.tkgm.gov.tr/projeler/3-boyutlu-kent-modelleri-ve-kadastro-projesi (accessed on 10 July 2024).
  • Döner, F.; Şirin, S. 3D digital representation of cadastral data in Turkey—Apartments case. Land 2020 , 9 , 179. [ Google Scholar ] [ CrossRef ]
  • Dursun, İ. 3 Boyutlu Kadastro Veri Modeli Tasarimi, Gayrimenkul Değerleme Entegrasyonu ve Uygulamasi: Amasya Pilot Projesi. 2023. Available online: https://tkgmmakale.com/3-boyutlu-kadastro-veri-modeli-tasarimi-gayrimenkul-degerleme-entegrasyonu-ve-uygulamasi-amasya-pilot-projesi (accessed on 10 July 2024).
  • GDRLC Circular 2021/4 (2021/4 Sayılı Genelgede Değişiklik Hakkında Duyuru). 2014. Available online: https://www.tkgm.gov.tr/sites/default/files/2022-06/2021-4%20Kat%20M%C3%BClkiyeti%20Kat%20%C4%B0rtifak%C4%B1.pdf (accessed on 10 July 2024).
  • Turkish Civil Code (Türk Medeni Kanunu) No: 4721. Official Gazette, 22 October 2001. Available online: https://www.resmigazete.gov.tr/eskiler/2001/12/20011208.htm (accessed on 10 July 2024).
  • Condominium Law (Kat Mülkiyeti Kanunu) (No:634). Official Gazette, 23 June 1965. Available online: https://www.mevzuat.gov.tr/mevzuat?MevzuatNo=634&MevzuatTur=1&MevzuatTertip=5 (accessed on 10 July 2024).
  • Van Der Merwe, C. (Ed.) European Condominium Law ; Cambridge University Press: Cambridge, UK, 2015. [ Google Scholar ] [ CrossRef ]
  • Çağdaş, V.; Paasch, J.M.; Paulsson, J.; Ploeger, H.; Kara, A. Co-ownership shares in condominium—A comparative analysis for selected civil law jurisdictions. Land Use Policy 2020 , 95 , 104668. [ Google Scholar ] [ CrossRef ]
  • The General Directorate of Land Registry and Cadastre (Tapu ve Kadastro Genel Müdürlüğü). Available online: https://www.tkgm.gov.tr/ (accessed on 10 July 2024).
  • GDRLC circular 2017/6 (2017/6 sayılı Mimari Projelerin Elektronik Ortamda Alınması). 2017. Available online: https://tkgm.gov.tr/sites/default/files/2020-11/2017-6_genelge_2.pdf (accessed on 10 July 2024).
  • Planned Areas Zoning Regulation (Planlı Alanlar İmar Yönetmeliği). 2017. Available online: https://www.mevzuat.gov.tr/mevzuat?MevzuatNo=23722&MevzuatTur=7&MevzuatTertip=5 (accessed on 10 July 2024).
  • The Turkish Chamber of Survey and Cadastre Engineers (CSCE) Workshop on Preparation, Implementation and Technical Liability of Building Survey Project–Final Report, (HKMO Yapı Aplikasyon Projesi Yapımı, Uygulaması ve Fenni Mesuliyet Çalıştayı Sonuç Raporu). Updated 2nd edition, November 2023. Available online: https://obs.hkmo.org.tr/s/yapfm2023 (accessed on 10 July 2024).
  • Regulation on Title Plans (Tapu Planları Tüzüğü). 2008. Available online: https://www.mevzuat.gov.tr/MevzuatMetin/2.5.200814001.pdf (accessed on 10 July 2024).
  • GDLRC’s Spatial Property System (TKHM Mekânsal Gayrimenkul Sistemi–MEGSİS). Available online: https://www.tkgm.gov.tr/projeler/mekansal-gayrimenkul-sistemi-megsis (accessed on 10 July 2024).
  • GDLRC’s Spatial Property System (TKGM Mekânsal Gayrimenkul Sistemi–MEGSİS). Available online: https://cbs.tkgm.gov.tr/uygulama.aspx (accessed on 10 July 2024).
  • Dursun, İ.; Aslan, M.; Cankurt, İ.; Yıldırım, C.; Ayyıldız, E. 3D city models as a 3D cadastral layer: The case of TKGM model. In Proceedings of the FIG Congress 2022 Volunteering for the Future-Geospatial Excellence for a Better Living, Warsaw, Poland, 11–15 September 2022; Available online: https://www.fig.net/resources/proceedings/fig_proceedings/fig2022/ppt/ts01e/TS01E_aslan_cankurt_et_al_11367_ppt.pdf (accessed on 10 July 2024).
  • Bayramoğlu, Z.; Dursun, İ.; Aslan, M.; Adli, M.Z. Creation of 3D City Models and Sustainability of the Project: Türkiye Example. In Proceedings of the FIG Working Week 2024 Your World, Our World: Resilient Environment and Sustainable Resource Management for all, Accra, Ghana, 19–24 May 2024. [ Google Scholar ]
  • GDRLC Amasya 3D City Model (TKGM Amasya 3B Kent Modeli). Available online: https://amasya3b.tkgm.gov.tr/#/ (accessed on 10 July 2024).
  • GDRLC 3D Cadastre (TKGM 3B Kadastro). Available online: https://cbs.tkgm.gov.tr/3d/html/ (accessed on 10 July 2024).
  • Open Geospatial Consortium (OGC). OGC City Geography Markup Language (CityGML) Part 1: Conceptual Model Standard. Edited by Thomas H. Kolbe, Tatjana Kutzner, Carl Stephen Smyth, Claus Nagel, Carsten Roensdorf, Charles Heazel. Publication Date: 2021-09-13. Available online: http://www.opengis.net/doc/IS/CityGML-1/3.0 (accessed on 10 July 2024).
  • GDRLC. Administrative Activity Report 2023 (TKGM 2023 Yılı İdare Faaliyet Raporu). Available online: https://www.tkgm.gov.tr/stratejig-db/idare-faaliyet-raporlari (accessed on 10 July 2024).
  • GDRLC. Strategic Plan 2024-2028 (TKGM, 2024-2028 Strateji Planı). Available online: https://www.tkgm.gov.tr/stratejig-db/stratejik-planlar (accessed on 10 July 2024).
  • GDRLC’s Land Registry and Cadastre Technical Works Offering Circular (Tapu ve Kadastro Fen İşleri İzahnamesi. Land Registry and Cadastre Technical Works Offering Circular) ; GDRLC: Ankara, Türkiye, 1948.
  • Shojaei, D.; Kalantari, M.; Bishop, I.D.; Rajabifard, A.; Aien, A. Visualization requirements for 3D cadastral systems. Comput. Environ. Urban Syst. 2013 , 41 , 39–54. [ Google Scholar ] [ CrossRef ]
  • Wang, C. 3D Visualization of Cadastre: Assessing the Suitability of Visual Variables and Enhancement Techniques in the 3D model of Condominium Property Units. Doctoral Dissertation, Université Laval, Québec, QC, Canada, 2015. [ Google Scholar ]
  • Pouliot, J.; Ellul, C.; Hubert, F.; Wang, C.; Rajabifard, A.; Kalantari, M.; Ying, S. 3D cadastres best practices, chapter 5: Visualization and new opportunities. In Proceedings of the FIG Congress 2018, Istanbul, Türkiye, 6–11 May 2018. [ Google Scholar ]
  • Kalogianni, E.; van Oosterom, P.J.M.; Dimopoulou, E.; Lemmen, C. 3D land administration: A review and a future vision in the context of the spatial development lifecycle. ISPRS Int. J. Geo-Inf. 2020 , 9 , 107. [ Google Scholar ] [ CrossRef ]
  • Stoter, J.; Diakité, A.; Reuvers, M.; Smudde, D.; Vos, J.; Roes, R.; van der Vaart, J.; Hakim, A.; El Yamani, S. BIM Legal: Implementation of a standard for Cadastral Registration of Apartment Complexes in 3D. In Proceedings of the 19th 3D GeoInfo Conference 2024, Vigo, Spain, 1–3 July 2024. [ Google Scholar ]
  • Stoter, J.; Ho, S.; Biljecki, F. Considerations for a contemporary 3D cadastre for our times. In Proceedings of the 14th 3D GeoInfo Conference, Singapore, 24–27 September 2019. [ Google Scholar ]
  • Energy Efficiency 2030 Strategy and II. National Energy Efficiency Action Plan 2024–2030 (Enerji Verimliliği 2030 Stratejisi ve II. Ulusal Enerji Verimliliği Eylem Planı 2024–2030). Available online: https://enerji.gov.tr/Media/Dizin/BHIM/tr/Duyurular/T%C3%BCrkiyeninEnerjiVerimlili%C4%9Fi2030StratejisiveIIUlusalEnerjiVerimlili%C4%9FiEylemPlan%C4%B1_202401161407.pdf (accessed on 10 July 2024).
  • Open Geospatial Consortium (OGC). CityJSON Specifications 2.0.1, Living Standard, Edited by Hugo Ledoux; Balázs Dukai. Publication Date: 11 April 2024. Available online: https://www.cityjson.org/specs/2.0.1/ (accessed on 10 July 2024).

Click here to enlarge figure

GDLRC-CityGML
Objects
CityGML SchemaCityGML ClassSemantics
(Related Class)
Definitions of GDLRC-CityGML Types
Fotogrametrik Bina (Photogrammetric Building)buildingBuildinggenericAttributes (Building)Buildings identified by the photogrammetry operator using orthophoto maps, land model, cadastral parcel/structure, address register, building services and, if available, architectural drawings associated with the parcel.
Mimari Bina
(Architectural Building)
buildingBuildinggenericAttributes (Building)It represents digital buildings generated from architectural projects. In the architectural building, the block number, entrance number, floor number and condominium unit number are taken from the architectural project. Boundaries of the outer walls should be used in the digitization.
Bina (Building)buildingBuildinggenericAttributes-
Kat (Floor)coreCityObjectGroupgenericAttributes (cityObjectGroup)It is the type that collects geometric and semantic information about floors of 3D building models created by digitizing the architectural project. Boundaries of the outer walls used in the digitization of the floor plan. When digitizing each floor from architectural drawings, the boundaries of the exterior walls are used. When drawing the outer boundary of the floor, details such as projections and recesses that do not affect the condominium units and remain below 50 cm are not shown.
Bağımsız Bölüm
(Condominium Unit)
genericsGenericCityObjectgenericAttributes (GenericCityObject)It is the type that collects geometric and semantic information about condominium units of 3D building models created by digitizing the architectural project. When digitizing condominium units, the boundaries of the exterior walls are used.
Bağımsız Bölüm Kısım (Condominium Unit Part)buildingBuilding/interiorRoomgenericAttributes (Room)This type is used to collect the features of the condominium units of the 3D building models to be produced by the digitization of the architectural project, such as rooms, living rooms and bathrooms. The interior walls of the rooms are used when digitizing the floor plans. The floor area of each part is taken from floor plans of architectural drawings. If the floor area indicated in the floor plan of the architectural drawings and the measured floor area from the digitized floor plan differ by more than 10%, the measured area is used, stating the reason.
Ortak Alan İç Yapı
(Common Space Inside Condominium Unit)
buildingBuilding/roomInstallationgenericAttributes
(intBuildingInstallation)
It represents the types of structures that cannot be physically separated from the rooms within the condominium unit, such as columns and stairs.
Ortak Alanlar
(Common Spaces)
buildingBuilding/interiorRoomgenericAttributes (Room)It is the type that represents common spaces in buildings such as car parks, heating centers, electrical centers, cellars, water tanks and shelters. The interior walls of the common spaces are used when digitizing the floor plans. Non-qualified common areas (e.g., stairs, ventilation, main entrance, elevator) that are adjacent to each other are digitized as a single area.
Balkon (Balcony) and Teras (Terrace)buildingBuilding/OutherBuildingInstallationgenericAttributes (BuildingInstallation)This is the type that shows the floor and the side walls of the balcony and the terraces. The external wall boundary and the wall thickness of the balcony should be taken into account in the digitization process.
Kapı (Door)buildingOpening/Door-It is the type that shows openings used by people to enter buildings or rooms.
Pencere (Window)buildingOpening/Window-It is the type used to represent openings that open outwards or between two parts.
Çatı (Roof)buildingBuilding/RoofSurface-It represents the type created by drawing the outer boundaries of the roof plan in the architectural drawings. It is combined with roof types obtained by the photogrammetric method.
Bina Duvar (Building Wall)buildingBuilding/
WallSurface
OuterFloorSurface
InteriorWallSurface
-It represents the wall type of the buildings.
The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

Ilgar, A.; Kara, A.; Çağdaş, V. Identifying Legal, BIM Data and Visualization Requirements to Form Legal Spaces and Developing a Web-Based 3D Cadastre Prototype: A Case Study of Condominium Building. Land 2024 , 13 , 1380. https://doi.org/10.3390/land13091380

Ilgar A, Kara A, Çağdaş V. Identifying Legal, BIM Data and Visualization Requirements to Form Legal Spaces and Developing a Web-Based 3D Cadastre Prototype: A Case Study of Condominium Building. Land . 2024; 13(9):1380. https://doi.org/10.3390/land13091380

Ilgar, Azer, Abdullah Kara, and Volkan Çağdaş. 2024. "Identifying Legal, BIM Data and Visualization Requirements to Form Legal Spaces and Developing a Web-Based 3D Cadastre Prototype: A Case Study of Condominium Building" Land 13, no. 9: 1380. https://doi.org/10.3390/land13091380

Article Metrics

Article access statistics, further information, mdpi initiatives, follow mdpi.

MDPI

Subscribe to receive issue release notifications and newsletters from MDPI journals

NTRS - NASA Technical Reports Server

Available downloads, related records.

Self-Managing Some Abortion Care Later In Pregnancy Is Safe, Effective And Boosts Access, Researchers Say

  • Share to Facebook
  • Share to Twitter
  • Share to Linkedin

Managing the early stages of medication abortion care at home later in pregnancy is safe and cuts down time spent in hospital, according to new research published on Thursday, which researchers say could boost access to the procedure as reproductive rights stand out as a key issue in the 2024 election.

Self-managing some abortion care in later pregnancies is safe, researchers found.

Almost all abortions in the United States are performed at or before 13 weeks of pregnancy—the first trimester—according to the Centers for Disease Control and Prevention, and clinical guidelines typically recommend the procedure be performed exclusively in clinical settings after the 12-week mark so patients can remain under observation.

Medication abortions involve taking two types of pills to end pregnancy—mifepristone, which blocks a hormone needed for pregnancy to continue, and misoprostol, which makes the womb contract—and for abortions after 12 weeks patients are usually given mifepristone in a clinic and return to a day or two later to receive misoprostol until the procedure is complete, a process that frequently requires an overnight stay in hospital.

Based on a randomized control trial of 435 women having a medical abortion between 12 and 22 weeks published in the Lancet medical journal, researchers from universities and hospitals in Sweden found medical abortions conducted after 12 weeks of pregnancy were as safe and as effective when misoprostol was started at home instead of in hospital and that women managing the early stages of care at home spent less time in hospital.

Of the pregnant people starting misoprostol at home and returning to clinics to receive further doses—several are usually required to complete the procedure—71% spent fewer than nine hours in hospital, compared to 46% of those in the hospital treatment group, the peer reviewed study showed.

There were no differences in the pain reported by patients in either group, the type and number of side effects or rates of hospital admission earlier than what was planned, the researchers said, adding that a follow up survey revealed 78% of the home group said they preferred their allocated treatment option compared to just 49% of the hospital group.

One of the study’s authors Johanna Rydelius, a gynecologist at Sahlgrenska University Hospital and researcher at the University of Gothenburg, said the findings offer a safe alternative to a practice that often requires overnight stays many women “find stressful and isolating” and could potentially lead to more “feelings of autonomy during a time where women can feel extremely vulnerable.”

Crucial Quote

“Increasing access to abortion later in pregnancy is a crucial component of the struggle for reproductive autonomy,” said Heidi Moseson and Caitlin Gerdts, researchers at Ibis Reproductive Health, a U.S. non-profit reproductive health research organization, in a linked comment piece published alongside the study in the Lancet. Given the “overwhelming preference for at-home” misoprostol administration among the pregnant people involved in the study, Moseson and Gerdts, who were not involved in the research, said reforming guidance and “moving towards a less clinically supervised model of medical abortion care later in pregnancy is an important first step” for improving access.

Why Do Current Guidelines Recommend Facility-Based Medication Abortions After 12 Weeks?

This is because of the greater risk of complications the procedure carries in later stages of pregnancy that may require additional care. This could include pain, bleeding, infection or an incomplete abortion (material remaining). But despite “profound” implications for access, the requirement to administer medications in clinical settings is “driven primarily by the absence of data on alternative models of care,” Moseson and Gerdts explained. This system “limits the number and type of facilities that can offer abortion care after 12 weeks,” they added, pointing to limited bed spaces and staffing requirements needed to keep people overnight.

Key Background

Since the Supreme Court overturned Roe v. Wade in 2022, many states across the U.S. have harshly restricted access to abortion. The procedure is now banned from conception in 14 states and from six weeks—a time many people are not even aware they are pregnant—in four states, with the future of abortion uncertain in a handful of other states due to legal challenges. While exceptions are made under limited circumstances in states banning abortion, these vary by state and clinicians and health experts have complained the vague or inconsistent language setting out exceptions are out of touch with medical reality and unworkable in practice. This fits within broader efforts to roll back reproductive care even further, such as Republican attempts to restrict access to mifepristone. The matter has polarized the country and both Democrats and Republicans have seized upon reproductive healthcare as a major dividing issue for the upcoming presidential election.

625,978. That’s how many legal induced abortions there were in the U.S. in 2021, according to the CDC, excluding California, Maryland, New Hampshire and New Jersey, which did not submit data to the agency. Almost all of these, 93.5%, happened at or before 13 weeks.

Get Forbes Breaking News Text Alerts: We’re launching text message alerts so you'll always know the biggest stories shaping the day’s headlines. Text “Alerts” to (201) 335-0739 or sign up here .

Further Reading

Robert Hart

  • Editorial Standards
  • Reprints & Permissions

Join The Conversation

One Community. Many Voices. Create a free account to share your thoughts. 

Forbes Community Guidelines

Our community is about connecting people through open and thoughtful conversations. We want our readers to share their views and exchange ideas and facts in a safe space.

In order to do so, please follow the posting rules in our site's  Terms of Service.   We've summarized some of those key rules below. Simply put, keep it civil.

Your post will be rejected if we notice that it seems to contain:

  • False or intentionally out-of-context or misleading information
  • Insults, profanity, incoherent, obscene or inflammatory language or threats of any kind
  • Attacks on the identity of other commenters or the article's author
  • Content that otherwise violates our site's  terms.

User accounts will be blocked if we notice or believe that users are engaged in:

  • Continuous attempts to re-post comments that have been previously moderated/rejected
  • Racist, sexist, homophobic or other discriminatory comments
  • Attempts or tactics that put the site security at risk
  • Actions that otherwise violate our site's  terms.

So, how can you be a power user?

  • Stay on topic and share your insights
  • Feel free to be clear and thoughtful to get your point across
  • ‘Like’ or ‘Dislike’ to show your point of view.
  • Protect your community.
  • Use the report tool to alert us when someone breaks the rules.

Thanks for reading our community guidelines. Please read the full list of posting rules found in our site's  Terms of Service.

IMAGES

  1. Big Data Analytics: ¿qué es y por qué es tan importante?

    research using big data

  2. Using Big Data Analytics in Healthcare

    research using big data

  3. The applications of "human mobility" research using big data.

    research using big data

  4. Top 20 Latest Research Problems in Big Data and Data Science

    research using big data

  5. The Big Data Problem That Market Research Must Fix

    research using big data

  6. Generating Insights from Big Data

    research using big data

VIDEO

  1. Using Big Data to Revolutionize Sustainability

  2. Big data Analytics Assignment Ashish Mishra

  3. Big Data Analysis for Hospitality & Tourism

  4. Globys: Using Big Data Analytics to Assist Mobile Operators

  5. Big Data and Analytics Right Now: An Experts Exchange

  6. Dexter Hadley

COMMENTS

  1. Big Data Research

    Big Data for Medicine and Healthcare. Edited by Francesco Piccialli, Nik Bessis, Gwanggil Jeon, Fabio Giampaolo. 4 July 2022. View all special issues and article collections. View all issues. Read the latest articles of Big Data Research at ScienceDirect.com, Elsevier's leading platform of peer-reviewed scholarly literature.

  2. A review of big data and medical research

    In this descriptive review, we highlight the roles of big data, the changing research paradigm, and easy access to research participation via the Internet fueled by the need for quick answers. Universally, data volume has increased, with the collection rate doubling every 40 months, ever since the 1980s. 4 The big data age, starting in 2002 ...

  3. Research on Big Data

    A systematic mapping study on Big Data research. •. Big Data Research may be characterized by eight dimensions. Big Data has emerged as a significant area of study for both practitioners and researchers. Big Data is a term for massive data sets with large structure. In 2012, Big Data passed the top of the Gartner Hype Cycle, attesting the ...

  4. 15 years of Big Data: a systematic literature review

    Our Systematic Literature Review sketched the picture of 15 years of research in Big Data, identifying application domains, challenges, and future directions in this research field. We believe that a substantial amount of work remains to be done to align and seamlessly integrate Big Data into data-driven advanced software solutions of the future.

  5. Benefits and challenges of Big Data in healthcare: an overview of the

    The use of Big Data in healthcare poses new ethical and legal challenges because of the personal nature of the information enclosed. Ethical and legal challenges include the risk to compromise privacy, personal autonomy, as well as effects on public demand for transparency, trust and fairness while using Big Data. 16.

  6. Big data in education: a state of the art, limitations, and future

    Most of the big data educational researches have focused on learner's behavior and performances. Moreover, this study highlights research limitations and portrays the future directions. This study provides a guideline for future studies and highlights new insights and directions for the successful utilization of big data in education.

  7. Big Data's Role in Precision Public Health

    Big data has enabled more widespread and specific research and trials of stratifying and segmenting populations at risk for a variety of health problems. Examples of success using big data are surveyed in surveillance and signal detection, predicting future risk, targeted interventions, and understanding disease.

  8. The impact of big data on research methods in information science

    The influence of big data on research methods cannot be overstated, as it has a profound impact on the way research is conducted. However, the application of these research methods to big data-related studies ultimately depends on the nature, emphasis, and perspective of the specific research study at hand.

  9. Research challenges and opportunities for using big data in global

    The big data has opened new research opportunities, especially for developing new data-driven theories for improving biological predictions in Earth system models, tracing global change impacts across different organismic levels, and constructing cyberinfrastructure tools to accelerate the pace of model-data integrations.

  10. Big Data-Driven Public Policy Decisions: Transformation Toward Smart

    Research continues to be conducted on the use of big data to support evidence-based policymaking, especially concerning the policy cycle. The evidence-based policy complements the theoretical model of the policy process.

  11. The use of Big Data Analytics in healthcare

    The introduction of Big Data Analytics (BDA) in healthcare will allow to use new technologies both in treatment of patients and health management. The paper aims at analyzing the possibilities of using Big Data Analytics in healthcare. The research is based on a critical analysis of the literature, as well as the presentation of selected results of direct research on the use of Big Data ...

  12. Scientific Research and Big Data

    Big Data promises to revolutionise the production of knowledge within and beyond science, by enabling novel, highly efficient ways to plan, conduct, disseminate and assess research. The last few decades have witnessed the creation of novel ways to produce, store, and analyse data, culminating in the emergence of the field of data science, which brings together computational, algorithmic ...

  13. What are the threats and potentials of big data for qualitative research?

    The use of big data is similarly encumbered by established institutional protocols and issues of ownership, human relationships, and new implications for research ethics that are only beginning to be understood. The issue of open access to, or sharing of data is one that has become contentious.

  14. Big data analytics in healthcare: a systematic literature review

    The current study performs a systematic literature review (SLR) to synthesise prior research on the applicability of big data analytics (BDA) in healthcare. The SLR examines the outcomes of 41 stu...

  15. Big data in social and psychological science: theoretical and

    Big data presents unprecedented opportunities to understand human behavior on a large scale. It has been increasingly used in social and psychological research to reveal individual differences and group dynamics. There are a few theoretical and methodological challenges in big data research that require attention. In this paper, we highlight four issues, namely data-driven versus theory-driven ...

  16. Top 20 Latest Research Problems in Big Data and Data Science

    E ven though Big data is in the mainstream of operations as of 2020, there are still potential issues or challenges the researchers can address. Some of these issues overlap with the data science field. In this article, the top 20 interesting latest research problems in the combination of big data and data science are covered based on my personal experience (with due respect to the ...

  17. Ethics as Methods: Doing Ethics in the Era of Big Data Research

    This situates data science as a process of analysis performed by the tool, which obscures human decisions in the process. The scholars involved in this Special Issue problematize the assumptions and trends in big data research and point out the crisis in accountability that emerges from using such data to make societal interventions.

  18. Using big data and artificial intelligence to accelerate ...

    The global community is entering a new world, where real-time data is shortening the feedback loop between outcomes and policy.

  19. Big data and sentiment analysis: A comprehensive and systematic

    Moreover, big data is produced via mobile networks and social media. Applications of sentiment analysis on big data are used as a way of classifying the opinions into diverse sentiment. Accordingly, performing sentiment analysis on big data can be helpful for a business to take useful commercial insights from text-oriented content.

  20. Machine learning accelerated carbon neutrality research using big data

    Carbon neutrality has been proposed as a solution for the current severe energy and climate crisis caused by the overuse of fossil fuels, and machine learning (ML) has exhibited excellent performance in accelerating related research owing to its powerful capacity for big data processing. This review presents a detailed overview of ML accelerated carbon neutrality research with a focus on ...

  21. 214 Big Data Research Topics: Interesting Ideas To Try

    Big data is a major topic that is being embraced globally. Data science and analytics are helping institutions, governments, and the private sector. We will share with you the best big data research topics.

  22. Patient regional index: a new way to rank clinical specialties based on

    Although our study is preliminary, it offers innovative ideas for effectively using big data from outpatient clinics to assess and rank hospital specialties. Compared with existing mainstream evaluation methods, which are often complex, our method is straightforward, easy to implement, and replicable, providing consistent conclusions.

  23. The Imperial Origins of Big Data

    Asheesh Kapur Siddique— We live in a moment of massive transformation in the nature of information. In 2020, according to one report, users of the Internet created 64.2 zetabytes of data, a quantity greater than the "number of detectable stars in the cosmos," a colossal increase whose origins can be traced to the emergence of the World Wide Web in 1993. 1 Facilitated by technologies like ...

  24. The application of big data and the development of nursing science: A

    A big data center of nursing is a data center that integrates nursing science, informatics and analysis to identify, define, manage and exchange data, information, knowledge and wisdom, along with a nursing research center that integrates business, learning and research. It is centered on big data mining and inference studies.

  25. Purdue's online data science master's addresses burgeoning demand for

    Students will develop expertise in programming languages, gaining the ability to design and implement data-driven solutions; learn to apply advanced technologies, including cloud computing and big data frameworks, to effectively handle and process large-scale datasets; gain a deep understanding of machine learning algorithms and models ...

  26. Land

    The objective of this research is to identify the legal and BIM data requirements for deriving legal spaces in condominium buildings, in light of the legislative analysis, and to develop a web-based 3D cadastre visualization prototype (showing both legal spaces and physical objects) based on the requirements obtained from the scientific literature.

  27. Characterization of Large Drop Velocity in the NASA Icing Research

    This paper presents experimental work conducted in the Icing Research Tunnel at NASA Glenn Research Center to characterize the velocity of large drops in the test section. Some icing spray clouds with large drops were generated with Mod1 nozzles at low nozzle air pressure of 2 to 4 psig for various tunnel air speeds. Drop diameters and drop velocities were measured via high-resolution imaging ...

  28. Netflix & Big Data: The Strategic Ambivalence of an Entertainment

    Netflix actively fueled what is known as the myth of big data, promoting their recommender system and data-driven production as cutting-edge, all-seeing, and all-knowing. Today, however, the company is increasingly acknowledging the role of human expertise and creativity. In this paper I explore the strategic repositioning of Netflix from ...

  29. Starting Medication Abortions At Home For Later Pregnancies Is ...

    Based on a randomized control trial of 435 women having a medical abortion between 12 and 22 weeks published in the Lancet medical journal, researchers from universities and hospitals in Sweden ...

  30. Google Weighs Large Data Centre in Vietnam, Source Says, in Nation's

    HANOI (Reuters) - Alphabet's Google is considering building a large data centre in Vietnam, a person briefed on the plans said, in what would be the first such investment by a big U.S. technology ...