10 Real World Data Science Case Studies Projects with Example

Top 10 Data Science Case Studies Projects with Examples and Solutions in Python to inspire your data science learning in 2023.

10 Real World Data Science Case Studies Projects with Example

BelData science has been a trending buzzword in recent times. With wide applications in various sectors like healthcare , education, retail, transportation, media, and banking -data science applications are at the core of pretty much every industry out there. The possibilities are endless: analysis of frauds in the finance sector or the personalization of recommendations on eCommerce businesses.  We have developed ten exciting data science case studies to explain how data science is leveraged across various industries to make smarter decisions and develop innovative personalized products tailored to specific customers.

data_science_project

Walmart Sales Forecasting Data Science Project

Downloadable solution code | Explanatory videos | Tech Support

Table of Contents

Data science case studies in retail , data science case study examples in entertainment industry , data analytics case study examples in travel industry , case studies for data analytics in social media , real world data science projects in healthcare, data analytics case studies in oil and gas, what is a case study in data science, how do you prepare a data science case study, 10 most interesting data science case studies with examples.

data science case studies

So, without much ado, let's get started with data science business case studies !

With humble beginnings as a simple discount retailer, today, Walmart operates in 10,500 stores and clubs in 24 countries and eCommerce websites, employing around 2.2 million people around the globe. For the fiscal year ended January 31, 2021, Walmart's total revenue was $559 billion showing a growth of $35 billion with the expansion of the eCommerce sector. Walmart is a data-driven company that works on the principle of 'Everyday low cost' for its consumers. To achieve this goal, they heavily depend on the advances of their data science and analytics department for research and development, also known as Walmart Labs. Walmart is home to the world's largest private cloud, which can manage 2.5 petabytes of data every hour! To analyze this humongous amount of data, Walmart has created 'Data Café,' a state-of-the-art analytics hub located within its Bentonville, Arkansas headquarters. The Walmart Labs team heavily invests in building and managing technologies like cloud, data, DevOps , infrastructure, and security.

ProjectPro Free Projects on Big Data and Data Science

Walmart is experiencing massive digital growth as the world's largest retailer . Walmart has been leveraging Big data and advances in data science to build solutions to enhance, optimize and customize the shopping experience and serve their customers in a better way. At Walmart Labs, data scientists are focused on creating data-driven solutions that power the efficiency and effectiveness of complex supply chain management processes. Here are some of the applications of data science  at Walmart:

i) Personalized Customer Shopping Experience

Walmart analyses customer preferences and shopping patterns to optimize the stocking and displaying of merchandise in their stores. Analysis of Big data also helps them understand new item sales, make decisions on discontinuing products, and the performance of brands.

ii) Order Sourcing and On-Time Delivery Promise

Millions of customers view items on Walmart.com, and Walmart provides each customer a real-time estimated delivery date for the items purchased. Walmart runs a backend algorithm that estimates this based on the distance between the customer and the fulfillment center, inventory levels, and shipping methods available. The supply chain management system determines the optimum fulfillment center based on distance and inventory levels for every order. It also has to decide on the shipping method to minimize transportation costs while meeting the promised delivery date.

Here's what valued users are saying about ProjectPro

user profile

Anand Kumpatla

Sr Data Scientist @ Doubleslash Software Solutions Pvt Ltd

user profile

Gautam Vermani

Data Consultant at Confidential

Not sure what you are looking for?

iii) Packing Optimization 

Also known as Box recommendation is a daily occurrence in the shipping of items in retail and eCommerce business. When items of an order or multiple orders for the same customer are ready for packing, Walmart has developed a recommender system that picks the best-sized box which holds all the ordered items with the least in-box space wastage within a fixed amount of time. This Bin Packing problem is a classic NP-Hard problem familiar to data scientists .

Whenever items of an order or multiple orders placed by the same customer are picked from the shelf and are ready for packing, the box recommendation system determines the best-sized box to hold all the ordered items with a minimum of in-box space wasted. This problem is known as the Bin Packing Problem, another classic NP-Hard problem familiar to data scientists.

Here is a link to a sales prediction data science case study to help you understand the applications of Data Science in the real world. Walmart Sales Forecasting Project uses historical sales data for 45 Walmart stores located in different regions. Each store contains many departments, and you must build a model to project the sales for each department in each store. This data science case study aims to create a predictive model to predict the sales of each product. You can also try your hands-on Inventory Demand Forecasting Data Science Project to develop a machine learning model to forecast inventory demand accurately based on historical sales data.

Get Closer To Your Dream of Becoming a Data Scientist with 70+ Solved End-to-End ML Projects

Amazon is an American multinational technology-based company based in Seattle, USA. It started as an online bookseller, but today it focuses on eCommerce, cloud computing , digital streaming, and artificial intelligence . It hosts an estimate of 1,000,000,000 gigabytes of data across more than 1,400,000 servers. Through its constant innovation in data science and big data Amazon is always ahead in understanding its customers. Here are a few data analytics case study examples at Amazon:

i) Recommendation Systems

Data science models help amazon understand the customers' needs and recommend them to them before the customer searches for a product; this model uses collaborative filtering. Amazon uses 152 million customer purchases data to help users to decide on products to be purchased. The company generates 35% of its annual sales using the Recommendation based systems (RBS) method.

Here is a Recommender System Project to help you build a recommendation system using collaborative filtering. 

ii) Retail Price Optimization

Amazon product prices are optimized based on a predictive model that determines the best price so that the users do not refuse to buy it based on price. The model carefully determines the optimal prices considering the customers' likelihood of purchasing the product and thinks the price will affect the customers' future buying patterns. Price for a product is determined according to your activity on the website, competitors' pricing, product availability, item preferences, order history, expected profit margin, and other factors.

Check Out this Retail Price Optimization Project to build a Dynamic Pricing Model.

iii) Fraud Detection

Being a significant eCommerce business, Amazon remains at high risk of retail fraud. As a preemptive measure, the company collects historical and real-time data for every order. It uses Machine learning algorithms to find transactions with a higher probability of being fraudulent. This proactive measure has helped the company restrict clients with an excessive number of returns of products.

You can look at this Credit Card Fraud Detection Project to implement a fraud detection model to classify fraudulent credit card transactions.

New Projects

Let us explore data analytics case study examples in the entertainment indusry.

Ace Your Next Job Interview with Mock Interviews from Experts to Improve Your Skills and Boost Confidence!

Data Science Interview Preparation

Netflix started as a DVD rental service in 1997 and then has expanded into the streaming business. Headquartered in Los Gatos, California, Netflix is the largest content streaming company in the world. Currently, Netflix has over 208 million paid subscribers worldwide, and with thousands of smart devices which are presently streaming supported, Netflix has around 3 billion hours watched every month. The secret to this massive growth and popularity of Netflix is its advanced use of data analytics and recommendation systems to provide personalized and relevant content recommendations to its users. The data is collected over 100 billion events every day. Here are a few examples of data analysis case studies applied at Netflix :

i) Personalized Recommendation System

Netflix uses over 1300 recommendation clusters based on consumer viewing preferences to provide a personalized experience. Some of the data that Netflix collects from its users include Viewing time, platform searches for keywords, Metadata related to content abandonment, such as content pause time, rewind, rewatched. Using this data, Netflix can predict what a viewer is likely to watch and give a personalized watchlist to a user. Some of the algorithms used by the Netflix recommendation system are Personalized video Ranking, Trending now ranker, and the Continue watching now ranker.

ii) Content Development using Data Analytics

Netflix uses data science to analyze the behavior and patterns of its user to recognize themes and categories that the masses prefer to watch. This data is used to produce shows like The umbrella academy, and Orange Is the New Black, and the Queen's Gambit. These shows seem like a huge risk but are significantly based on data analytics using parameters, which assured Netflix that they would succeed with its audience. Data analytics is helping Netflix come up with content that their viewers want to watch even before they know they want to watch it.

iii) Marketing Analytics for Campaigns

Netflix uses data analytics to find the right time to launch shows and ad campaigns to have maximum impact on the target audience. Marketing analytics helps come up with different trailers and thumbnails for other groups of viewers. For example, the House of Cards Season 5 trailer with a giant American flag was launched during the American presidential elections, as it would resonate well with the audience.

Here is a Customer Segmentation Project using association rule mining to understand the primary grouping of customers based on various parameters.

Get FREE Access to Machine Learning Example Codes for Data Cleaning , Data Munging, and Data Visualization

In a world where Purchasing music is a thing of the past and streaming music is a current trend, Spotify has emerged as one of the most popular streaming platforms. With 320 million monthly users, around 4 billion playlists, and approximately 2 million podcasts, Spotify leads the pack among well-known streaming platforms like Apple Music, Wynk, Songza, amazon music, etc. The success of Spotify has mainly depended on data analytics. By analyzing massive volumes of listener data, Spotify provides real-time and personalized services to its listeners. Most of Spotify's revenue comes from paid premium subscriptions. Here are some of the examples of case study on data analytics used by Spotify to provide enhanced services to its listeners:

i) Personalization of Content using Recommendation Systems

Spotify uses Bart or Bayesian Additive Regression Trees to generate music recommendations to its listeners in real-time. Bart ignores any song a user listens to for less than 30 seconds. The model is retrained every day to provide updated recommendations. A new Patent granted to Spotify for an AI application is used to identify a user's musical tastes based on audio signals, gender, age, accent to make better music recommendations.

Spotify creates daily playlists for its listeners, based on the taste profiles called 'Daily Mixes,' which have songs the user has added to their playlists or created by the artists that the user has included in their playlists. It also includes new artists and songs that the user might be unfamiliar with but might improve the playlist. Similar to it is the weekly 'Release Radar' playlists that have newly released artists' songs that the listener follows or has liked before.

ii) Targetted marketing through Customer Segmentation

With user data for enhancing personalized song recommendations, Spotify uses this massive dataset for targeted ad campaigns and personalized service recommendations for its users. Spotify uses ML models to analyze the listener's behavior and group them based on music preferences, age, gender, ethnicity, etc. These insights help them create ad campaigns for a specific target audience. One of their well-known ad campaigns was the meme-inspired ads for potential target customers, which was a huge success globally.

iii) CNN's for Classification of Songs and Audio Tracks

Spotify builds audio models to evaluate the songs and tracks, which helps develop better playlists and recommendations for its users. These allow Spotify to filter new tracks based on their lyrics and rhythms and recommend them to users like similar tracks ( collaborative filtering). Spotify also uses NLP ( Natural language processing) to scan articles and blogs to analyze the words used to describe songs and artists. These analytical insights can help group and identify similar artists and songs and leverage them to build playlists.

Here is a Music Recommender System Project for you to start learning. We have listed another music recommendations dataset for you to use for your projects: Dataset1 . You can use this dataset of Spotify metadata to classify songs based on artists, mood, liveliness. Plot histograms, heatmaps to get a better understanding of the dataset. Use classification algorithms like logistic regression, SVM, and Principal component analysis to generate valuable insights from the dataset.

Explore Categories

Below you will find case studies for data analytics in the travel and tourism industry.

Airbnb was born in 2007 in San Francisco and has since grown to 4 million Hosts and 5.6 million listings worldwide who have welcomed more than 1 billion guest arrivals in almost every country across the globe. Airbnb is active in every country on the planet except for Iran, Sudan, Syria, and North Korea. That is around 97.95% of the world. Using data as a voice of their customers, Airbnb uses the large volume of customer reviews, host inputs to understand trends across communities, rate user experiences, and uses these analytics to make informed decisions to build a better business model. The data scientists at Airbnb are developing exciting new solutions to boost the business and find the best mapping for its customers and hosts. Airbnb data servers serve approximately 10 million requests a day and process around one million search queries. Data is the voice of customers at AirBnB and offers personalized services by creating a perfect match between the guests and hosts for a supreme customer experience. 

i) Recommendation Systems and Search Ranking Algorithms

Airbnb helps people find 'local experiences' in a place with the help of search algorithms that make searches and listings precise. Airbnb uses a 'listing quality score' to find homes based on the proximity to the searched location and uses previous guest reviews. Airbnb uses deep neural networks to build models that take the guest's earlier stays into account and area information to find a perfect match. The search algorithms are optimized based on guest and host preferences, rankings, pricing, and availability to understand users’ needs and provide the best match possible.

ii) Natural Language Processing for Review Analysis

Airbnb characterizes data as the voice of its customers. The customer and host reviews give a direct insight into the experience. The star ratings alone cannot be an excellent way to understand it quantitatively. Hence Airbnb uses natural language processing to understand reviews and the sentiments behind them. The NLP models are developed using Convolutional neural networks .

Practice this Sentiment Analysis Project for analyzing product reviews to understand the basic concepts of natural language processing.

iii) Smart Pricing using Predictive Analytics

The Airbnb hosts community uses the service as a supplementary income. The vacation homes and guest houses rented to customers provide for rising local community earnings as Airbnb guests stay 2.4 times longer and spend approximately 2.3 times the money compared to a hotel guest. The profits are a significant positive impact on the local neighborhood community. Airbnb uses predictive analytics to predict the prices of the listings and help the hosts set a competitive and optimal price. The overall profitability of the Airbnb host depends on factors like the time invested by the host and responsiveness to changing demands for different seasons. The factors that impact the real-time smart pricing are the location of the listing, proximity to transport options, season, and amenities available in the neighborhood of the listing.

Here is a Price Prediction Project to help you understand the concept of predictive analysis which is widely common in case studies for data analytics. 

Uber is the biggest global taxi service provider. As of December 2018, Uber has 91 million monthly active consumers and 3.8 million drivers. Uber completes 14 million trips each day. Uber uses data analytics and big data-driven technologies to optimize their business processes and provide enhanced customer service. The Data Science team at uber has been exploring futuristic technologies to provide better service constantly. Machine learning and data analytics help Uber make data-driven decisions that enable benefits like ride-sharing, dynamic price surges, better customer support, and demand forecasting. Here are some of the real world data science projects used by uber:

i) Dynamic Pricing for Price Surges and Demand Forecasting

Uber prices change at peak hours based on demand. Uber uses surge pricing to encourage more cab drivers to sign up with the company, to meet the demand from the passengers. When the prices increase, the driver and the passenger are both informed about the surge in price. Uber uses a predictive model for price surging called the 'Geosurge' ( patented). It is based on the demand for the ride and the location.

ii) One-Click Chat

Uber has developed a Machine learning and natural language processing solution called one-click chat or OCC for coordination between drivers and users. This feature anticipates responses for commonly asked questions, making it easy for the drivers to respond to customer messages. Drivers can reply with the clock of just one button. One-Click chat is developed on Uber's machine learning platform Michelangelo to perform NLP on rider chat messages and generate appropriate responses to them.

iii) Customer Retention

Failure to meet the customer demand for cabs could lead to users opting for other services. Uber uses machine learning models to bridge this demand-supply gap. By using prediction models to predict the demand in any location, uber retains its customers. Uber also uses a tier-based reward system, which segments customers into different levels based on usage. The higher level the user achieves, the better are the perks. Uber also provides personalized destination suggestions based on the history of the user and their frequently traveled destinations.

You can take a look at this Python Chatbot Project and build a simple chatbot application to understand better the techniques used for natural language processing. You can also practice the working of a demand forecasting model with this project using time series analysis. You can look at this project which uses time series forecasting and clustering on a dataset containing geospatial data for forecasting customer demand for ola rides.

Explore More  Data Science and Machine Learning Projects for Practice. Fast-Track Your Career Transition with ProjectPro

7) LinkedIn 

LinkedIn is the largest professional social networking site with nearly 800 million members in more than 200 countries worldwide. Almost 40% of the users access LinkedIn daily, clocking around 1 billion interactions per month. The data science team at LinkedIn works with this massive pool of data to generate insights to build strategies, apply algorithms and statistical inferences to optimize engineering solutions, and help the company achieve its goals. Here are some of the real world data science projects at LinkedIn:

i) LinkedIn Recruiter Implement Search Algorithms and Recommendation Systems

LinkedIn Recruiter helps recruiters build and manage a talent pool to optimize the chances of hiring candidates successfully. This sophisticated product works on search and recommendation engines. The LinkedIn recruiter handles complex queries and filters on a constantly growing large dataset. The results delivered have to be relevant and specific. The initial search model was based on linear regression but was eventually upgraded to Gradient Boosted decision trees to include non-linear correlations in the dataset. In addition to these models, the LinkedIn recruiter also uses the Generalized Linear Mix model to improve the results of prediction problems to give personalized results.

ii) Recommendation Systems Personalized for News Feed

The LinkedIn news feed is the heart and soul of the professional community. A member's newsfeed is a place to discover conversations among connections, career news, posts, suggestions, photos, and videos. Every time a member visits LinkedIn, machine learning algorithms identify the best exchanges to be displayed on the feed by sorting through posts and ranking the most relevant results on top. The algorithms help LinkedIn understand member preferences and help provide personalized news feeds. The algorithms used include logistic regression, gradient boosted decision trees and neural networks for recommendation systems.

iii) CNN's to Detect Inappropriate Content

To provide a professional space where people can trust and express themselves professionally in a safe community has been a critical goal at LinkedIn. LinkedIn has heavily invested in building solutions to detect fake accounts and abusive behavior on their platform. Any form of spam, harassment, inappropriate content is immediately flagged and taken down. These can range from profanity to advertisements for illegal services. LinkedIn uses a Convolutional neural networks based machine learning model. This classifier trains on a training dataset containing accounts labeled as either "inappropriate" or "appropriate." The inappropriate list consists of accounts having content from "blocklisted" phrases or words and a small portion of manually reviewed accounts reported by the user community.

Here is a Text Classification Project to help you understand NLP basics for text classification. You can find a news recommendation system dataset to help you build a personalized news recommender system. You can also use this dataset to build a classifier using logistic regression, Naive Bayes, or Neural networks to classify toxic comments.

Get confident to build end-to-end projects

Access to a curated library of 250+ end-to-end industry projects with solution code, videos and tech support.

Pfizer is a multinational pharmaceutical company headquartered in New York, USA. One of the largest pharmaceutical companies globally known for developing a wide range of medicines and vaccines in disciplines like immunology, oncology, cardiology, and neurology. Pfizer became a household name in 2010 when it was the first to have a COVID-19 vaccine with FDA. In early November 2021, The CDC has approved the Pfizer vaccine for kids aged 5 to 11. Pfizer has been using machine learning and artificial intelligence to develop drugs and streamline trials, which played a massive role in developing and deploying the COVID-19 vaccine. Here are a few data analytics case studies by Pfizer :

i) Identifying Patients for Clinical Trials

Artificial intelligence and machine learning are used to streamline and optimize clinical trials to increase their efficiency. Natural language processing and exploratory data analysis of patient records can help identify suitable patients for clinical trials. These can help identify patients with distinct symptoms. These can help examine interactions of potential trial members' specific biomarkers, predict drug interactions and side effects which can help avoid complications. Pfizer's AI implementation helped rapidly identify signals within the noise of millions of data points across their 44,000-candidate COVID-19 clinical trial.

ii) Supply Chain and Manufacturing

Data science and machine learning techniques help pharmaceutical companies better forecast demand for vaccines and drugs and distribute them efficiently. Machine learning models can help identify efficient supply systems by automating and optimizing the production steps. These will help supply drugs customized to small pools of patients in specific gene pools. Pfizer uses Machine learning to predict the maintenance cost of equipment used. Predictive maintenance using AI is the next big step for Pharmaceutical companies to reduce costs.

iii) Drug Development

Computer simulations of proteins, and tests of their interactions, and yield analysis help researchers develop and test drugs more efficiently. In 2016 Watson Health and Pfizer announced a collaboration to utilize IBM Watson for Drug Discovery to help accelerate Pfizer's research in immuno-oncology, an approach to cancer treatment that uses the body's immune system to help fight cancer. Deep learning models have been used recently for bioactivity and synthesis prediction for drugs and vaccines in addition to molecular design. Deep learning has been a revolutionary technique for drug discovery as it factors everything from new applications of medications to possible toxic reactions which can save millions in drug trials.

You can create a Machine learning model to predict molecular activity to help design medicine using this dataset . You may build a CNN or a Deep neural network for this data analyst case study project.

Access Data Science and Machine Learning Project Code Examples

9) Shell Data Analyst Case Study Project

Shell is a global group of energy and petrochemical companies with over 80,000 employees in around 70 countries. Shell uses advanced technologies and innovations to help build a sustainable energy future. Shell is going through a significant transition as the world needs more and cleaner energy solutions to be a clean energy company by 2050. It requires substantial changes in the way in which energy is used. Digital technologies, including AI and Machine Learning, play an essential role in this transformation. These include efficient exploration and energy production, more reliable manufacturing, more nimble trading, and a personalized customer experience. Using AI in various phases of the organization will help achieve this goal and stay competitive in the market. Here are a few data analytics case studies in the petrochemical industry:

i) Precision Drilling

Shell is involved in the processing mining oil and gas supply, ranging from mining hydrocarbons to refining the fuel to retailing them to customers. Recently Shell has included reinforcement learning to control the drilling equipment used in mining. Reinforcement learning works on a reward-based system based on the outcome of the AI model. The algorithm is designed to guide the drills as they move through the surface, based on the historical data from drilling records. It includes information such as the size of drill bits, temperatures, pressures, and knowledge of the seismic activity. This model helps the human operator understand the environment better, leading to better and faster results will minor damage to machinery used. 

ii) Efficient Charging Terminals

Due to climate changes, governments have encouraged people to switch to electric vehicles to reduce carbon dioxide emissions. However, the lack of public charging terminals has deterred people from switching to electric cars. Shell uses AI to monitor and predict the demand for terminals to provide efficient supply. Multiple vehicles charging from a single terminal may create a considerable grid load, and predictions on demand can help make this process more efficient.

iii) Monitoring Service and Charging Stations

Another Shell initiative trialed in Thailand and Singapore is the use of computer vision cameras, which can think and understand to watch out for potentially hazardous activities like lighting cigarettes in the vicinity of the pumps while refueling. The model is built to process the content of the captured images and label and classify it. The algorithm can then alert the staff and hence reduce the risk of fires. You can further train the model to detect rash driving or thefts in the future.

Here is a project to help you understand multiclass image classification. You can use the Hourly Energy Consumption Dataset to build an energy consumption prediction model. You can use time series with XGBoost to develop your model.

10) Zomato Case Study on Data Analytics

Zomato was founded in 2010 and is currently one of the most well-known food tech companies. Zomato offers services like restaurant discovery, home delivery, online table reservation, online payments for dining, etc. Zomato partners with restaurants to provide tools to acquire more customers while also providing delivery services and easy procurement of ingredients and kitchen supplies. Currently, Zomato has over 2 lakh restaurant partners and around 1 lakh delivery partners. Zomato has closed over ten crore delivery orders as of date. Zomato uses ML and AI to boost their business growth, with the massive amount of data collected over the years from food orders and user consumption patterns. Here are a few examples of data analyst case study project developed by the data scientists at Zomato:

i) Personalized Recommendation System for Homepage

Zomato uses data analytics to create personalized homepages for its users. Zomato uses data science to provide order personalization, like giving recommendations to the customers for specific cuisines, locations, prices, brands, etc. Restaurant recommendations are made based on a customer's past purchases, browsing history, and what other similar customers in the vicinity are ordering. This personalized recommendation system has led to a 15% improvement in order conversions and click-through rates for Zomato. 

You can use the Restaurant Recommendation Dataset to build a restaurant recommendation system to predict what restaurants customers are most likely to order from, given the customer location, restaurant information, and customer order history.

ii) Analyzing Customer Sentiment

Zomato uses Natural language processing and Machine learning to understand customer sentiments using social media posts and customer reviews. These help the company gauge the inclination of its customer base towards the brand. Deep learning models analyze the sentiments of various brand mentions on social networking sites like Twitter, Instagram, Linked In, and Facebook. These analytics give insights to the company, which helps build the brand and understand the target audience.

iii) Predicting Food Preparation Time (FPT)

Food delivery time is an essential variable in the estimated delivery time of the order placed by the customer using Zomato. The food preparation time depends on numerous factors like the number of dishes ordered, time of the day, footfall in the restaurant, day of the week, etc. Accurate prediction of the food preparation time can help make a better prediction of the Estimated delivery time, which will help delivery partners less likely to breach it. Zomato uses a Bidirectional LSTM-based deep learning model that considers all these features and provides food preparation time for each order in real-time. 

Data scientists are companies' secret weapons when analyzing customer sentiments and behavior and leveraging it to drive conversion, loyalty, and profits. These 10 data science case studies projects with examples and solutions show you how various organizations use data science technologies to succeed and be at the top of their field! To summarize, Data Science has not only accelerated the performance of companies but has also made it possible to manage & sustain their performance with ease.

FAQs on Data Analysis Case Studies

A case study in data science is an in-depth analysis of a real-world problem using data-driven approaches. It involves collecting, cleaning, and analyzing data to extract insights and solve challenges, offering practical insights into how data science techniques can address complex issues across various industries.

To create a data science case study, identify a relevant problem, define objectives, and gather suitable data. Clean and preprocess data, perform exploratory data analysis, and apply appropriate algorithms for analysis. Summarize findings, visualize results, and provide actionable recommendations, showcasing the problem-solving potential of data science techniques.

Access Solved Big Data and Data Science Projects

About the Author

author profile

ProjectPro is the only online platform designed to help professionals gain practical, hands-on experience in big data, data engineering, data science, and machine learning related technologies. Having over 270+ reusable project templates in data science and big data with step-by-step walkthroughs,

arrow link

© 2024

© 2024 Iconiq Inc.

Privacy policy

User policy

Write for ProjectPro

logo

FOR EMPLOYERS

Top 10 real-world data science case studies.

Data Science Case Studies

Aditya Sharma

Aditya is a content writer with 5+ years of experience writing for various industries including Marketing, SaaS, B2B, IT, and Edtech among others. You can find him watching anime or playing games when he’s not writing.

Frequently Asked Questions

Real-world data science case studies differ significantly from academic examples. While academic exercises often feature clean, well-structured data and simplified scenarios, real-world projects tackle messy, diverse data sources with practical constraints and genuine business objectives. These case studies reflect the complexities data scientists face when translating data into actionable insights in the corporate world.

Real-world data science projects come with common challenges. Data quality issues, including missing or inaccurate data, can hinder analysis. Domain expertise gaps may result in misinterpretation of results. Resource constraints might limit project scope or access to necessary tools and talent. Ethical considerations, like privacy and bias, demand careful handling.

Lastly, as data and business needs evolve, data science projects must adapt and stay relevant, posing an ongoing challenge.

Real-world data science case studies play a crucial role in helping companies make informed decisions. By analyzing their own data, businesses gain valuable insights into customer behavior, market trends, and operational efficiencies.

These insights empower data-driven strategies, aiding in more effective resource allocation, product development, and marketing efforts. Ultimately, case studies bridge the gap between data science and business decision-making, enhancing a company's ability to thrive in a competitive landscape.

Key takeaways from these case studies for organizations include the importance of cultivating a data-driven culture that values evidence-based decision-making. Investing in robust data infrastructure is essential to support data initiatives. Collaborating closely between data scientists and domain experts ensures that insights align with business goals.

Finally, continuous monitoring and refinement of data solutions are critical for maintaining relevance and effectiveness in a dynamic business environment. Embracing these principles can lead to tangible benefits and sustainable success in real-world data science endeavors.

Data science is a powerful driver of innovation and problem-solving across diverse industries. By harnessing data, organizations can uncover hidden patterns, automate repetitive tasks, optimize operations, and make informed decisions.

In healthcare, for example, data-driven diagnostics and treatment plans improve patient outcomes. In finance, predictive analytics enhances risk management. In transportation, route optimization reduces costs and emissions. Data science empowers industries to innovate and solve complex challenges in ways that were previously unimaginable.

Hire remote developers

Tell us the skills you need and we'll find the best developer for you in days, not weeks.

banner-in1

  • Data Science

12 Data Science Case Studies: Across Various Industries

Home Blog Data Science 12 Data Science Case Studies: Across Various Industries

Play icon

Data science has become popular in the last few years due to its successful application in making business decisions. Data scientists have been using data science techniques to solve challenging real-world issues in healthcare, agriculture, manufacturing, automotive, and many more. For this purpose, a data enthusiast needs to stay updated with the latest technological advancements in AI. An excellent way to achieve this is through reading industry data science case studies. I recommend checking out Data Science With Python course syllabus to start your data science journey.   In this discussion, I will present some case studies to you that contain detailed and systematic data analysis of people, objects, or entities focusing on multiple factors present in the dataset. Almost every industry uses data science in some way. You can learn more about data science fundamentals in this Data Science course content .

Let’s look at the top data science case studies in this article so you can understand how businesses from many sectors have benefitted from data science to boost productivity, revenues, and more.

case study for data science project

List of Data Science Case Studies 2024

  • Hospitality:  Airbnb focuses on growth by  analyzing  customer voice using data science.  Qantas uses predictive analytics to mitigate losses
  • Healthcare:  Novo Nordisk  is  Driving innovation with NLP.  AstraZeneca harnesses data for innovation in medicine  
  • Covid 19:  Johnson and Johnson use s  d ata science  to fight the Pandemic  
  • E-commerce:  Amazon uses data science to personalize shop p ing experiences and improve customer satisfaction  
  • Supply chain management:  UPS optimizes supp l y chain with big data analytics
  • Meteorology:  IMD leveraged data science to achieve a rec o rd 1.2m evacuation before cyclone ''Fani''  
  • Entertainment Industry:  Netflix  u ses data science to personalize the content and improve recommendations.  Spotify uses big   data to deliver a rich user experience for online music streaming  
  • Banking and Finance:  HDFC utilizes Big  D ata Analytics to increase income and enhance  the  banking experience
  • Urban Planning and Smart Cities:  Traffic management in smart cities such as Pune and Bhubaneswar
  • Agricultural Yield Prediction:  Farmers Edge in Canada uses Data science to help farmers improve their produce
  • Transportation Industry:  Uber optimizes their ride-sharing feature and track the delivery routes through data analysis
  • Environmental Industry:  NASA utilizes Data science to predict potential natural disasters, World Wildlife analyzes deforestation to protect the environment

Top 12 Data Science Case Studies

1. data science in hospitality industry.

In the hospitality sector, data analytics assists hotels in better pricing strategies, customer analysis, brand marketing, tracking market trends, and many more.

Airbnb focuses on growth by analyzing customer voice using data science.  A famous example in this sector is the unicorn '' Airbnb '', a startup that focussed on data science early to grow and adapt to the market faster. This company witnessed a 43000 percent hypergrowth in as little as five years using data science. They included data science techniques to process the data, translate this data for better understanding the voice of the customer, and use the insights for decision making. They also scaled the approach to cover all aspects of the organization. Airbnb uses statistics to analyze and aggregate individual experiences to establish trends throughout the community. These analyzed trends using data science techniques impact their business choices while helping them grow further.  

Travel industry and data science

Predictive analytics benefits many parameters in the travel industry. These companies can use recommendation engines with data science to achieve higher personalization and improved user interactions. They can study and cross-sell products by recommending relevant products to drive sales and increase revenue. Data science is also employed in analyzing social media posts for sentiment analysis, bringing invaluable travel-related insights. Whether these views are positive, negative, or neutral can help these agencies understand the user demographics, the expected experiences by their target audiences, and so on. These insights are essential for developing aggressive pricing strategies to draw customers and provide better customization to customers in the travel packages and allied services. Travel agencies like Expedia and Booking.com use predictive analytics to create personalized recommendations, product development, and effective marketing of their products. Not just travel agencies but airlines also benefit from the same approach. Airlines frequently face losses due to flight cancellations, disruptions, and delays. Data science helps them identify patterns and predict possible bottlenecks, thereby effectively mitigating the losses and improving the overall customer traveling experience.  

How Qantas uses predictive analytics to mitigate losses  

Qantas , one of Australia's largest airlines, leverages data science to reduce losses caused due to flight delays, disruptions, and cancellations. They also use it to provide a better traveling experience for their customers by reducing the number and length of delays caused due to huge air traffic, weather conditions, or difficulties arising in operations. Back in 2016, when heavy storms badly struck Australia's east coast, only 15 out of 436 Qantas flights were cancelled due to their predictive analytics-based system against their competitor Virgin Australia, which witnessed 70 cancelled flights out of 320.  

2. Data Science in Healthcare

The  Healthcare sector  is immensely benefiting from the advancements in AI. Data science, especially in medical imaging, has been helping healthcare professionals come up with better diagnoses and effective treatments for patients. Similarly, several advanced healthcare analytics tools have been developed to generate clinical insights for improving patient care. These tools also assist in defining personalized medications for patients reducing operating costs for clinics and hospitals. Apart from medical imaging or computer vision,  Natural Language Processing (NLP)  is frequently used in the healthcare domain to study the published textual research data.     

A. Pharmaceutical

Driving innovation with NLP: Novo Nordisk.  Novo Nordisk  uses the Linguamatics NLP platform from internal and external data sources for text mining purposes that include scientific abstracts, patents, grants, news, tech transfer offices from universities worldwide, and more. These NLP queries run across sources for the key therapeutic areas of interest to the Novo Nordisk R&D community. Several NLP algorithms have been developed for the topics of safety, efficacy, randomized controlled trials, patient populations, dosing, and devices. Novo Nordisk employs a data pipeline to capitalize the tools' success on real-world data and uses interactive dashboards and cloud services to visualize this standardized structured information from the queries for exploring commercial effectiveness, market situations, potential, and gaps in the product documentation. Through data science, they are able to automate the process of generating insights, save time and provide better insights for evidence-based decision making.  

How AstraZeneca harnesses data for innovation in medicine.  AstraZeneca  is a globally known biotech company that leverages data using AI technology to discover and deliver newer effective medicines faster. Within their R&D teams, they are using AI to decode the big data to understand better diseases like cancer, respiratory disease, and heart, kidney, and metabolic diseases to be effectively treated. Using data science, they can identify new targets for innovative medications. In 2021, they selected the first two AI-generated drug targets collaborating with BenevolentAI in Chronic Kidney Disease and Idiopathic Pulmonary Fibrosis.   

Data science is also helping AstraZeneca redesign better clinical trials, achieve personalized medication strategies, and innovate the process of developing new medicines. Their Center for Genomics Research uses  data science and AI  to analyze around two million genomes by 2026. Apart from this, they are training their AI systems to check these images for disease and biomarkers for effective medicines for imaging purposes. This approach helps them analyze samples accurately and more effortlessly. Moreover, it can cut the analysis time by around 30%.   

AstraZeneca also utilizes AI and machine learning to optimize the process at different stages and minimize the overall time for the clinical trials by analyzing the clinical trial data. Summing up, they use data science to design smarter clinical trials, develop innovative medicines, improve drug development and patient care strategies, and many more.

C. Wearable Technology  

Wearable technology is a multi-billion-dollar industry. With an increasing awareness about fitness and nutrition, more individuals now prefer using fitness wearables to track their routines and lifestyle choices.  

Fitness wearables are convenient to use, assist users in tracking their health, and encourage them to lead a healthier lifestyle. The medical devices in this domain are beneficial since they help monitor the patient's condition and communicate in an emergency situation. The regularly used fitness trackers and smartwatches from renowned companies like Garmin, Apple, FitBit, etc., continuously collect physiological data of the individuals wearing them. These wearable providers offer user-friendly dashboards to their customers for analyzing and tracking progress in their fitness journey.

3. Covid 19 and Data Science

In the past two years of the Pandemic, the power of data science has been more evident than ever. Different  pharmaceutical companies  across the globe could synthesize Covid 19 vaccines by analyzing the data to understand the trends and patterns of the outbreak. Data science made it possible to track the virus in real-time, predict patterns, devise effective strategies to fight the Pandemic, and many more.  

How Johnson and Johnson uses data science to fight the Pandemic   

The  data science team  at  Johnson and Johnson  leverages real-time data to track the spread of the virus. They built a global surveillance dashboard (granulated to county level) that helps them track the Pandemic's progress, predict potential hotspots of the virus, and narrow down the likely place where they should test its investigational COVID-19 vaccine candidate. The team works with in-country experts to determine whether official numbers are accurate and find the most valid information about case numbers, hospitalizations, mortality and testing rates, social compliance, and local policies to populate this dashboard. The team also studies the data to build models that help the company identify groups of individuals at risk of getting affected by the virus and explore effective treatments to improve patient outcomes.

4. Data Science in E-commerce  

In the  e-commerce sector , big data analytics can assist in customer analysis, reduce operational costs, forecast trends for better sales, provide personalized shopping experiences to customers, and many more.  

Amazon uses data science to personalize shopping experiences and improve customer satisfaction.  Amazon  is a globally leading eCommerce platform that offers a wide range of online shopping services. Due to this, Amazon generates a massive amount of data that can be leveraged to understand consumer behavior and generate insights on competitors' strategies. Data science case studies reveal how Amazon uses its data to provide recommendations to its users on different products and services. With this approach, Amazon is able to persuade its consumers into buying and making additional sales. This approach works well for Amazon as it earns 35% of the revenue yearly with this technique. Additionally, Amazon collects consumer data for faster order tracking and better deliveries.     

Similarly, Amazon's virtual assistant, Alexa, can converse in different languages; uses speakers and a   camera to interact with the users. Amazon utilizes the audio commands from users to improve Alexa and deliver a better user experience. 

5. Data Science in Supply Chain Management

Predictive analytics and big data are driving innovation in the Supply chain domain. They offer greater visibility into the company operations, reduce costs and overheads, forecasting demands, predictive maintenance, product pricing, minimize supply chain interruptions, route optimization, fleet management, drive better performance, and more.     

Optimizing supply chain with big data analytics: UPS

UPS  is a renowned package delivery and supply chain management company. With thousands of packages being delivered every day, on average, a UPS driver makes about 100 deliveries each business day. On-time and safe package delivery are crucial to UPS's success. Hence, UPS offers an optimized navigation tool ''ORION'' (On-Road Integrated Optimization and Navigation), which uses highly advanced big data processing algorithms. This tool for UPS drivers provides route optimization concerning fuel, distance, and time. UPS utilizes supply chain data analysis in all aspects of its shipping process. Data about packages and deliveries are captured through radars and sensors. The deliveries and routes are optimized using big data systems. Overall, this approach has helped UPS save 1.6 million gallons of gasoline in transportation every year, significantly reducing delivery costs.    

6. Data Science in Meteorology

Weather prediction is an interesting  application of data science . Businesses like aviation, agriculture and farming, construction, consumer goods, sporting events, and many more are dependent on climatic conditions. The success of these businesses is closely tied to the weather, as decisions are made after considering the weather predictions from the meteorological department.   

Besides, weather forecasts are extremely helpful for individuals to manage their allergic conditions. One crucial application of weather forecasting is natural disaster prediction and risk management.  

Weather forecasts begin with a large amount of data collection related to the current environmental conditions (wind speed, temperature, humidity, clouds captured at a specific location and time) using sensors on IoT (Internet of Things) devices and satellite imagery. This gathered data is then analyzed using the understanding of atmospheric processes, and machine learning models are built to make predictions on upcoming weather conditions like rainfall or snow prediction. Although data science cannot help avoid natural calamities like floods, hurricanes, or forest fires. Tracking these natural phenomena well ahead of their arrival is beneficial. Such predictions allow governments sufficient time to take necessary steps and measures to ensure the safety of the population.  

IMD leveraged data science to achieve a record 1.2m evacuation before cyclone ''Fani''   

Most  d ata scientist’s responsibilities  rely on satellite images to make short-term forecasts, decide whether a forecast is correct, and validate models. Machine Learning is also used for pattern matching in this case. It can forecast future weather conditions if it recognizes a past pattern. When employing dependable equipment, sensor data is helpful to produce local forecasts about actual weather models. IMD used satellite pictures to study the low-pressure zones forming off the Odisha coast (India). In April 2019, thirteen days before cyclone ''Fani'' reached the area,  IMD  (India Meteorological Department) warned that a massive storm was underway, and the authorities began preparing for safety measures.  

It was one of the most powerful cyclones to strike India in the recent 20 years, and a record 1.2 million people were evacuated in less than 48 hours, thanks to the power of data science.   

7. Data Science in the Entertainment Industry

Due to the Pandemic, demand for OTT (Over-the-top) media platforms has grown significantly. People prefer watching movies and web series or listening to the music of their choice at leisure in the convenience of their homes. This sudden growth in demand has given rise to stiff competition. Every platform now uses data analytics in different capacities to provide better-personalized recommendations to its subscribers and improve user experience.   

How Netflix uses data science to personalize the content and improve recommendations  

Netflix  is an extremely popular internet television platform with streamable content offered in several languages and caters to various audiences. In 2006, when Netflix entered this media streaming market, they were interested in increasing the efficiency of their existing ''Cinematch'' platform by 10% and hence, offered a prize of $1 million to the winning team. This approach was successful as they found a solution developed by the BellKor team at the end of the competition that increased prediction accuracy by 10.06%. Over 200 work hours and an ensemble of 107 algorithms provided this result. These winning algorithms are now a part of the Netflix recommendation system.  

Netflix also employs Ranking Algorithms to generate personalized recommendations of movies and TV Shows appealing to its users.   

Spotify uses big data to deliver a rich user experience for online music streaming  

Personalized online music streaming is another area where data science is being used.  Spotify  is a well-known on-demand music service provider launched in 2008, which effectively leveraged big data to create personalized experiences for each user. It is a huge platform with more than 24 million subscribers and hosts a database of nearly 20million songs; they use the big data to offer a rich experience to its users. Spotify uses this big data and various algorithms to train machine learning models to provide personalized content. Spotify offers a "Discover Weekly" feature that generates a personalized playlist of fresh unheard songs matching the user's taste every week. Using the Spotify "Wrapped" feature, users get an overview of their most favorite or frequently listened songs during the entire year in December. Spotify also leverages the data to run targeted ads to grow its business. Thus, Spotify utilizes the user data, which is big data and some external data, to deliver a high-quality user experience.  

8. Data Science in Banking and Finance

Data science is extremely valuable in the Banking and  Finance industry . Several high priority aspects of Banking and Finance like credit risk modeling (possibility of repayment of a loan), fraud detection (detection of malicious or irregularities in transactional patterns using machine learning), identifying customer lifetime value (prediction of bank performance based on existing and potential customers), customer segmentation (customer profiling based on behavior and characteristics for personalization of offers and services). Finally, data science is also used in real-time predictive analytics (computational techniques to predict future events).    

How HDFC utilizes Big Data Analytics to increase revenues and enhance the banking experience    

One of the major private banks in India,  HDFC Bank , was an early adopter of AI. It started with Big Data analytics in 2004, intending to grow its revenue and understand its customers and markets better than its competitors. Back then, they were trendsetters by setting up an enterprise data warehouse in the bank to be able to track the differentiation to be given to customers based on their relationship value with HDFC Bank. Data science and analytics have been crucial in helping HDFC bank segregate its customers and offer customized personal or commercial banking services. The analytics engine and SaaS use have been assisting the HDFC bank in cross-selling relevant offers to its customers. Apart from the regular fraud prevention, it assists in keeping track of customer credit histories and has also been the reason for the speedy loan approvals offered by the bank.  

9. Data Science in Urban Planning and Smart Cities  

Data Science can help the dream of smart cities come true! Everything, from traffic flow to energy usage, can get optimized using data science techniques. You can use the data fetched from multiple sources to understand trends and plan urban living in a sorted manner.  

The significant data science case study is traffic management in Pune city. The city controls and modifies its traffic signals dynamically, tracking the traffic flow. Real-time data gets fetched from the signals through cameras or sensors installed. Based on this information, they do the traffic management. With this proactive approach, the traffic and congestion situation in the city gets managed, and the traffic flow becomes sorted. A similar case study is from Bhubaneswar, where the municipality has platforms for the people to give suggestions and actively participate in decision-making. The government goes through all the inputs provided before making any decisions, making rules or arranging things that their residents actually need.  

10. Data Science in Agricultural Prediction   

Have you ever wondered how helpful it can be if you can predict your agricultural yield? That is exactly what data science is helping farmers with. They can get information about the number of crops they can produce in a given area based on different environmental factors and soil types. Using this information, the farmers can make informed decisions about their yield and benefit the buyers and themselves in multiple ways.  

Data Science in Agricultural Yield Prediction

Farmers across the globe and overseas use various data science techniques to understand multiple aspects of their farms and crops. A famous example of data science in the agricultural industry is the work done by Farmers Edge. It is a company in Canada that takes real-time images of farms across the globe and combines them with related data. The farmers use this data to make decisions relevant to their yield and improve their produce. Similarly, farmers in countries like Ireland use satellite-based information to ditch traditional methods and multiply their yield strategically.  

11. Data Science in the Transportation Industry   

Transportation keeps the world moving around. People and goods commute from one place to another for various purposes, and it is fair to say that the world will come to a standstill without efficient transportation. That is why it is crucial to keep the transportation industry in the most smoothly working pattern, and data science helps a lot in this. In the realm of technological progress, various devices such as traffic sensors, monitoring display systems, mobility management devices, and numerous others have emerged.  

Many cities have already adapted to the multi-modal transportation system. They use GPS trackers, geo-locations and CCTV cameras to monitor and manage their transportation system. Uber is the perfect case study to understand the use of data science in the transportation industry. They optimize their ride-sharing feature and track the delivery routes through data analysis. Their data science case studies approach enabled them to serve more than 100 million users, making transportation easy and convenient. Moreover, they also use the data they fetch from users daily to offer cost-effective and quickly available rides.  

12. Data Science in the Environmental Industry    

Increasing pollution, global warming, climate changes and other poor environmental impacts have forced the world to pay attention to environmental industry. Multiple initiatives are being taken across the globe to preserve the environment and make the world a better place. Though the industry recognition and the efforts are in the initial stages, the impact is significant, and the growth is fast.  

The popular use of data science in the environmental industry is by NASA and other research organizations worldwide. NASA gets data related to the current climate conditions, and this data gets used to create remedial policies that can make a difference. Another way in which data science is actually helping researchers is they can predict natural disasters well before time and save or at least reduce the potential damage considerably. A similar case study is with the World Wildlife Fund. They use data science to track data related to deforestation and help reduce the illegal cutting of trees. Hence, it helps preserve the environment.  

Where to Find Full Data Science Case Studies?  

Data science is a highly evolving domain with many practical applications and a huge open community. Hence, the best way to keep updated with the latest trends in this domain is by reading case studies and technical articles. Usually, companies share their success stories of how data science helped them achieve their goals to showcase their potential and benefit the greater good. Such case studies are available online on the respective company websites and dedicated technology forums like Towards Data Science or Medium.  

Additionally, we can get some practical examples in recently published research papers and textbooks in data science.  

What Are the Skills Required for Data Scientists?  

Data scientists play an important role in the data science process as they are the ones who work on the data end to end. To be able to work on a data science case study, there are several skills required for data scientists like a good grasp of the fundamentals of data science, deep knowledge of statistics, excellent programming skills in Python or R, exposure to data manipulation and data analysis, ability to generate creative and compelling data visualizations, good knowledge of big data, machine learning and deep learning concepts for model building & deployment. Apart from these technical skills, data scientists also need to be good storytellers and should have an analytical mind with strong communication skills.    

Opt for the best business analyst training  elevating your expertise. Take the leap towards becoming a distinguished business analysis professional

Conclusion  

These were some interesting  data science case studies  across different industries. There are many more domains where data science has exciting applications, like in the Education domain, where data can be utilized to monitor student and instructor performance, develop an innovative curriculum that is in sync with the industry expectations, etc.   

Almost all the companies looking to leverage the power of big data begin with a SWOT analysis to narrow down the problems they intend to solve with data science. Further, they need to assess their competitors to develop relevant data science tools and strategies to address the challenging issue.  Thus, the utility of data science in several sectors is clearly visible, a lot is left to be explored, and more is yet to come. Nonetheless, data science will continue to boost the performance of organizations in this age of big data.  

Frequently Asked Questions (FAQs)

A case study in data science requires a systematic and organized approach for solving the problem. Generally, four main steps are needed to tackle every data science case study: 

  • Defining the problem statement and strategy to solve it  
  • Gather and pre-process the data by making relevant assumptions  
  • Select tool and appropriate algorithms to build machine learning /deep learning models 
  • Make predictions, accept the solutions based on evaluation metrics, and improve the model if necessary. 

Getting data for a case study starts with a reasonable understanding of the problem. This gives us clarity about what we expect the dataset to include. Finding relevant data for a case study requires some effort. Although it is possible to collect relevant data using traditional techniques like surveys and questionnaires, we can also find good quality data sets online on different platforms like Kaggle, UCI Machine Learning repository, Azure open data sets, Government open datasets, Google Public Datasets, Data World and so on.  

Data science projects involve multiple steps to process the data and bring valuable insights. A data science project includes different steps - defining the problem statement, gathering relevant data required to solve the problem, data pre-processing, data exploration & data analysis, algorithm selection, model building, model prediction, model optimization, and communicating the results through dashboards and reports.  

Profile

Devashree Madhugiri

Devashree holds an M.Eng degree in Information Technology from Germany and a background in Data Science. She likes working with statistics and discovering hidden insights in varied datasets to create stunning dashboards. She enjoys sharing her knowledge in AI by writing technical articles on various technological platforms. She loves traveling, reading fiction, solving Sudoku puzzles, and participating in coding competitions in her leisure time.

Something went wrong

Upcoming Data Science Batches & Dates

NameDateFeeKnow more

Course advisor icon

  • Digital Marketing
  • Facebook Marketing
  • Instagram Marketing
  • Ecommerce Marketing
  • Content Marketing
  • Data Science Certification
  • Machine Learning
  • Artificial Intelligence
  • Data Analytics
  • Graphic Design
  • Adobe Illustrator
  • Web Designing
  • UX UI Design
  • Interior Design
  • Front End Development
  • Back End Development Courses
  • Business Analytics
  • Entrepreneurship
  • Supply Chain
  • Financial Modeling
  • Corporate Finance
  • Project Finance
  • Harvard University
  • Stanford University
  • Yale University
  • Princeton University
  • Duke University
  • UC Berkeley
  • Harvard University Executive Programs
  • MIT Executive Programs
  • Stanford University Executive Programs
  • Oxford University Executive Programs
  • Cambridge University Executive Programs
  • Yale University Executive Programs
  • Kellog Executive Programs
  • CMU Executive Programs
  • 45000+ Free Courses
  • Free Certification Courses
  • Free DigitalDefynd Certificate
  • Free Harvard University Courses
  • Free MIT Courses
  • Free Excel Courses
  • Free Google Courses
  • Free Finance Courses
  • Free Coding Courses
  • Free Digital Marketing Courses

Top 25 Data Science Case Studies [2024]

In an era where data is the new gold, harnessing its power through data science has led to groundbreaking advancements across industries. From personalized marketing to predictive maintenance, the applications of data science are not only diverse but transformative. This compilation of the top 25 data science case studies showcases the profound impact of intelligent data utilization in solving real-world problems. These examples span various sectors, including healthcare, finance, transportation, and manufacturing, illustrating how data-driven decisions shape business operations’ future, enhance efficiency, and optimize user experiences. As we delve into these case studies, we witness the incredible potential of data science to innovate and drive success in today’s data-centric world.

Related: Interesting Data Science Facts

Top 25 Data Science Case Studies [2024]

Case study 1 – personalized marketing (amazon).

Challenge:  Amazon aimed to enhance user engagement by tailoring product recommendations to individual preferences, requiring the real-time processing of vast data volumes.

Solution:  Amazon implemented a sophisticated machine learning algorithm known as collaborative filtering, which analyzes users’ purchase history, cart contents, product ratings, and browsing history, along with the behavior of similar users. This approach enables Amazon to offer highly personalized product suggestions.

Overall Impact:

  • Increased Customer Satisfaction:  Tailored recommendations improved the shopping experience.
  • Higher Sales Conversions:  Relevant product suggestions boosted sales.

Key Takeaways:

  • Personalized Marketing Significantly Enhances User Engagement:  Demonstrating how tailored interactions can deepen user involvement and satisfaction.
  • Effective Use of Big Data and Machine Learning Can Transform Customer Experiences:  These technologies redefine the consumer landscape by continuously adapting recommendations to changing user preferences and behaviors.

This strategy has proven pivotal in increasing Amazon’s customer loyalty and sales by making the shopping experience more relevant and engaging.

Case Study 2 – Real-Time Pricing Strategy (Uber)

Challenge:  Uber needed to adjust its pricing dynamically to reflect real-time demand and supply variations across different locations and times, aiming to optimize driver incentives and customer satisfaction without manual intervention.

Solution:  Uber introduced a dynamic pricing model called “surge pricing.” This system uses data science to automatically calculate fares in real time based on current demand and supply data. The model incorporates traffic conditions, weather forecasts, and local events to adjust prices appropriately.

  • Optimized Ride Availability:  The model reduced customer wait times by incentivizing more drivers to be available during high-demand periods.
  • Increased Driver Earnings:  Drivers benefitted from higher earnings during surge periods, aligning their incentives with customer demand.
  • Efficient Balance of Supply and Demand:  Dynamic pricing matches ride availability with customer needs.
  • Importance of Real-Time Data Processing:  The real-time processing of data is crucial for responsive and adaptive service delivery.

Uber’s implementation of surge pricing illustrates the power of using real-time data analytics to create a flexible and responsive pricing system that benefits both consumers and service providers, enhancing overall service efficiency and satisfaction.

Case Study 3 – Fraud Detection in Banking (JPMorgan Chase)

Challenge:  JPMorgan Chase faced the critical need to enhance its fraud detection capabilities to safeguard the institution and its customers from financial losses. The primary challenge was detecting fraudulent transactions swiftly and accurately in a vast stream of legitimate banking activities.

Solution:  The bank implemented advanced machine learning models that analyze real-time transaction patterns and customer behaviors. These models are continuously trained on vast amounts of historical fraud data, enabling them to identify and flag transactions that significantly deviate from established patterns, which may indicate potential fraud.

  • Substantial Reduction in Fraudulent Transactions:  The advanced detection capabilities led to a marked decrease in fraud occurrences.
  • Enhanced Security for Customer Accounts:  Customers experienced greater security and trust in their transactions.
  • Effectiveness of Machine Learning in Fraud Detection:  Machine learning models are greatly effective at identifying fraud activities within large datasets.
  • Importance of Ongoing Training and Updates:  Continuous training and updating of models are crucial to adapt to evolving fraudulent techniques and maintain detection efficacy.

JPMorgan Chase’s use of machine learning for fraud detection demonstrates how financial institutions can leverage advanced analytics to enhance security measures, protect financial assets, and build customer trust in their banking services.

Case Study 4 – Optimizing Healthcare Outcomes (Mayo Clinic)

Challenge:  The Mayo Clinic aimed to enhance patient outcomes by predicting diseases before they reach critical stages. This involved analyzing large volumes of diverse data, including historical patient records and real-time health metrics from various sources like lab results and patient monitors.

Solution:  The Mayo Clinic employed predictive analytics to integrate and analyze this data to build models that predict patient risk for diseases such as diabetes and heart disease, enabling earlier and more targeted interventions.

  • Improved Patient Outcomes:  Early identification of at-risk patients allowed for timely medical intervention.
  • Reduction in Healthcare Costs:  Preventing disease progression reduces the need for more extensive and costly treatments later.
  • Early Identification of Health Risks:  Predictive models are essential for identifying at-risk patients early, improving the chances of successful interventions.
  • Integration of Multiple Data Sources:  Combining historical and real-time data provides a comprehensive view that enhances the accuracy of predictions.

Case Study 5 – Streamlining Operations in Manufacturing (General Electric)

Challenge:  General Electric needed to optimize its manufacturing processes to reduce costs and downtime by predicting when machines would likely require maintenance to prevent breakdowns.

Solution:  GE leveraged data from sensors embedded in machinery to monitor their condition continuously. Data science algorithms analyze this sensor data to predict when a machine is likely to disappoint, facilitating preemptive maintenance and scheduling.

  • Reduction in Unplanned Machine Downtime:  Predictive maintenance helped avoid unexpected breakdowns.
  • Lower Maintenance Costs and Improved Machine Lifespan:  Regular maintenance based on predictive data reduced overall costs and extended the life of machinery.
  • Predictive Maintenance Enhances Operational Efficiency:  Using data-driven predictions for maintenance can significantly reduce downtime and operational costs.
  • Value of Sensor Data:  Continuous monitoring and data analysis are crucial for forecasting equipment health and preventing failures.

Related: Data Engineering vs. Data Science

Case Study 6 – Enhancing Supply Chain Management (DHL)

Challenge:  DHL sought to optimize its global logistics and supply chain operations to decreases expenses and enhance delivery efficiency. It required handling complex data from various sources for better route planning and inventory management.

Solution:  DHL implemented advanced analytics to process and analyze data from its extensive logistics network. This included real-time tracking of shipments, analysis of weather conditions, traffic patterns, and inventory levels to optimize route planning and warehouse operations.

  • Enhanced Efficiency in Logistics Operations:  More precise route planning and inventory management improved delivery times and reduced resource wastage.
  • Reduced Operational Costs:  Streamlined operations led to significant cost savings across the supply chain.
  • Critical Role of Comprehensive Data Analysis:  Effective supply chain management depends on integrating and analyzing data from multiple sources.
  • Benefits of Real-Time Data Integration:  Real-time data enhances logistical decision-making, leading to more efficient and cost-effective operations.

Case Study 7 – Predictive Maintenance in Aerospace (Airbus)

Challenge:  Airbus faced the challenge of predicting potential failures in aircraft components to enhance safety and reduce maintenance costs. The key was to accurately forecast the lifespan of parts under varying conditions and usage patterns, which is critical in the aerospace industry where safety is paramount.

Solution:  Airbus tackled this challenge by developing predictive models that utilize data collected from sensors installed on aircraft. These sensors continuously monitor the condition of various components, providing real-time data that the models analyze. The predictive algorithms assess the likelihood of component failure, enabling maintenance teams to schedule repairs or replacements proactively before actual failures occur.

  • Increased Safety:  The ability to predict and prevent potential in-flight failures has significantly improved the safety of Airbus aircraft.
  • Reduced Costs:  By optimizing maintenance schedules and minimizing unnecessary checks, Airbus has been able to cut down on maintenance expenses and reduce aircraft downtime.
  • Enhanced Safety through Predictive Analytics:  The use of predictive analytics in monitoring aircraft components plays a crucial role in preventing failures, thereby enhancing the overall safety of aviation operations.
  • Valuable Insights from Sensor Data:  Real-time data from operational use is critical for developing effective predictive maintenance strategies. This data provides insights for understanding component behavior under various conditions, allowing for more accurate predictions.

This case study demonstrates how Airbus leverages advanced data science techniques in predictive maintenance to ensure higher safety standards and more efficient operations, setting an industry benchmark in the aerospace sector.

Case Study 8 – Enhancing Film Recommendations (Netflix)

Challenge:  Netflix aimed to improve customer retention and engagement by enhancing the accuracy of its recommendation system. This task involved processing and analyzing vast amounts of data to understand diverse user preferences and viewing habits.

Solution:  Netflix employed collaborative filtering techniques, analyzing user behaviors (like watching, liking, or disliking content) and similarities between content items. This data-driven approach allows Netflix to refine and personalize recommendations continuously based on real-time user interactions.

  • Increased Viewer Engagement:  Personalized recommendations led to longer viewing sessions.
  • Higher Customer Satisfaction and Retention Rates:  Tailored viewing experiences improved overall customer satisfaction, enhancing loyalty.
  • Tailoring User Experiences:  Machine learning is pivotal in personalizing media content, significantly impacting viewer engagement and satisfaction.
  • Importance of Continuous Updates:  Regularly updating recommendation algorithms is essential to maintain relevance and effectiveness in user engagement.

Case Study 9 – Traffic Flow Optimization (Google)

Challenge:  Google needed to optimize traffic flow within its Google Maps service to reduce congestion and improve routing decisions. This required real-time analysis of extensive traffic data to predict and manage traffic conditions accurately.

Solution:  Google Maps integrates data from multiple sources, including satellite imagery, sensor data, and real-time user location data. These data points are used to model traffic patterns and predict future conditions dynamically, which informs updated routing advice.

  • Reduced Traffic Congestion:  More efficient routing reduced overall traffic buildup.
  • Enhanced Accuracy of Traffic Predictions and Routing:  Improved predictions led to better user navigation experiences.
  • Integration of Multiple Data Sources:  Combining various data streams enhances the accuracy of traffic management systems.
  • Advanced Modeling Techniques:  Sophisticated models are crucial for accurately predicting traffic patterns and optimizing routes.

Case Study 10 – Risk Assessment in Insurance (Allstate)

Challenge:  Allstate sought to refine its risk assessment processes to offer more accurately priced insurance products, challenging the limitations of traditional actuarial models through more nuanced data interpretations.

Solution:  Allstate enhanced its risk assessment framework by integrating machine learning, allowing for granular risk factor analysis. This approach utilizes individual customer data such as driving records, home location specifics, and historical claim data to tailor insurance offerings more accurately.

  • More Precise Risk Assessment:  Improved risk evaluation led to more tailored insurance offerings.
  • Increased Market Competitiveness:  Enhanced pricing accuracy boosted Allstate’s competitive edge in the insurance market.
  • Nuanced Understanding of Risk:  Machine learning provides a deeper, more nuanced understanding of risk than traditional models, leading to better risk pricing.
  • Personalized Pricing Strategies:  Leveraging detailed customer data in pricing strategies enhances customer satisfaction and business performance.

Related: Can you move from Cybersecurity to Data Science?

Case Study 11 – Energy Consumption Reduction (Google DeepMind)

Challenge:  Google DeepMind aimed to significantly reduce the high energy consumption required for cooling Google’s data centers, which are crucial for maintaining server performance but also represent a major operational cost.

Solution:  DeepMind implemented advanced AI algorithms to optimize the data center cooling systems. These algorithms predict temperature fluctuations and adjust cooling processes accordingly, saving energy and reducing equipment wear and tear.

  • Reduction in Energy Consumption:  Achieved a 40% reduction in energy used for cooling.
  • Decrease in Operational Costs and Environmental Impact:  Lower energy usage resulted in cost savings and reduced environmental footprint.
  • AI-Driven Optimization:  AI can significantly decrease energy usage in large-scale infrastructure.
  • Operational Efficiency Gains:  Efficiency improvements in operational processes lead to cost savings and environmental benefits.

Case Study 12 – Improving Public Safety (New York City Police Department)

Challenge:  The NYPD needed to enhance its crime prevention strategies by better predicting where and when crimes were most likely to occur, requiring sophisticated analysis of historical crime data and environmental factors.

Solution:  The NYPD implemented a predictive policing system that utilizes data analytics to identify potential crime hotspots based on trends and patterns in past crime data. Officers are preemptively dispatched to these areas to deter criminal activities.

  • Reduction in Crime Rates:  There is a notable decrease in crime in areas targeted by predictive policing.
  • More Efficient Use of Police Resources:  Enhanced allocation of resources where needed.
  • Effectiveness of Data-Driven Crime Prevention:  Targeting resources based on data analytics can significantly reduce crime.
  • Proactive Law Enforcement:  Predictive analytics enable a shift from reactive to proactive law enforcement strategies.

Case Study 13 – Enhancing Agricultural Yields (John Deere)

Challenge:  John Deere aimed to help farmers increase agricultural productivity and sustainability by optimizing various farming operations from planting to harvesting.

Solution:  Utilizing data from sensors on equipment and satellite imagery, John Deere developed algorithms that provide actionable insights for farmers on optimal planting times, water usage, and harvest schedules.

  • Increased Crop Yields:  More efficient farming methods led to higher yields.
  • Enhanced Sustainability of Farming Practices:  Improved resource management contributed to more sustainable agriculture.
  • Precision Agriculture:  Significantly improves productivity and resource efficiency.
  • Data-Driven Decision-Making:  Enables better farming decisions through timely and accurate data.

Case Study 14 – Streamlining Drug Discovery (Pfizer)

Challenge:  Pfizer faced the need to accelerate the process of discoverying drug and improve the success rates of clinical trials.

Solution:  Pfizer employed data science to simulate and predict outcomes of drug trials using historical data and predictive models, optimizing trial parameters and improving the selection of drug candidates.

  • Accelerated Drug Development:  Reduced time to market for new drugs.
  • Increased Efficiency and Efficacy in Clinical Trials:  More targeted trials led to better outcomes.
  • Reduction in Drug Development Time and Costs:  Data science streamlines the R&D process.
  • Improved Clinical Trial Success Rates:  Predictive modeling enhances the accuracy of trial outcomes.

Case Study 15 – Media Buying Optimization (Procter & Gamble)

Challenge:  Procter & Gamble aimed to maximize the ROI of their extensive advertising budget by optimizing their media buying strategy across various channels.

Solution:  P&G analyzed extensive data on consumer behavior and media consumption to identify the most effective times and channels for advertising, allowing for highly targeted ads that reach the intended audience at optimal times.

  • Improved Effectiveness of Advertising Campaigns:  More effective ads increased campaign impact.
  • Increased Sales and Better Budget Allocation:  Enhanced ROI from more strategic media spending.
  • Enhanced Media Buying Strategies:  Data analytics significantly improves media buying effectiveness.
  • Insights into Consumer Behavior:  Understanding consumer behavior is crucial for optimizing advertising ROI.

Related: Is Data Science Certificate beneficial for your career?

Case Study 16 – Reducing Patient Readmission Rates with Predictive Analytics (Mount Sinai Health System)

Challenge:  Mount Sinai Health System sought to reduce patient readmission rates, a significant indicator of healthcare quality and a major cost factor. The challenge involved identifying patients at high risk of being readmitted within 30 days of discharge.

Solution:  The health system implemented a predictive analytics platform that analyzes real-time patient data and historical health records. The system detects patterns and risk factors contributing to high readmission rates by utilizing machine learning algorithms. Factors such as past medical history, discharge conditions, and post-discharge care plans were integrated into the predictive model.

  • Reduced Readmission Rates:  Early identification of at-risk patients allowed for targeted post-discharge interventions, significantly reducing readmission rates.
  • Enhanced Patient Outcomes: Patients received better follow-up care tailored to their health risks.
  • Predictive Analytics in Healthcare:  Effective for managing patient care post-discharge.
  • Holistic Patient Data Utilization: Integrating various data points provides a more accurate prediction and better healthcare outcomes.

Case Study 17 – Enhancing E-commerce Customer Experience with AI (Zalando)

Challenge:  Zalando aimed to enhance the online shopping experience by improving the accuracy of size recommendations, a common issue that leads to high return rates in online apparel shopping.

Solution:  Zalando developed an AI-driven size recommendation engine that analyzes past purchase and return data in combination with customer feedback and preferences. This system utilizes machine learning to predict the best-fit size for customers based on their unique body measurements and purchase history.

  • Reduced Return Rates:  More accurate size recommendations decreased the returns due to poor fit.
  • Improved Customer Satisfaction: Customers experienced a more personalized shopping journey, enhancing overall satisfaction.
  • Customization Through AI:  Personalizing customer experience can significantly impact satisfaction and business metrics.
  • Data-Driven Decision-Making: Utilizing customer data effectively can improve business outcomes by reducing costs and enhancing the user experience.

Case Study 18 – Optimizing Energy Grid Performance with Machine Learning (Enel Group)

Challenge:  Enel Group, one of the largest power companies, faced challenges in managing and optimizing the performance of its vast energy grids. The primary goal was to increase the efficiency of energy distribution and reduce operational costs while maintaining reliability in the face of fluctuating supply and demand.

Solution:  Enel Group implemented a machine learning-based system that analyzes real-time data from smart meters, weather stations, and IoT devices across the grid. This system is designed to predict peak demand times, potential outages, and equipment failures before they occur. By integrating these predictions with automated grid management tools, Enel can dynamically adjust energy flows, allocate resources more efficiently, and schedule maintenance proactively.

  • Enhanced Grid Efficiency:  Improved distribution management, reduced energy wastage, and optimized resource allocation.
  • Reduced Operational Costs: Predictive maintenance and better grid management decreased the frequency and cost of repairs and outages.
  • Predictive Maintenance in Utility Networks:  Advanced analytics can preemptively identify issues, saving costs and enhancing service reliability.
  • Real-Time Data Integration: Leveraging data from various sources in real-time enables more agile and informed decision-making in energy management.

Case Study 19 – Personalizing Movie Streaming Experience (WarnerMedia)

Challenge:  WarnerMedia sought to enhance viewer engagement and subscription retention rates on its streaming platforms by providing more personalized content recommendations.

Solution:  WarnerMedia deployed a sophisticated data science strategy, utilizing deep learning algorithms to analyze viewer behaviors, including viewing history, ratings given to shows and movies, search patterns, and demographic data. This analysis helped create highly personalized viewer profiles, which were then used to tailor content recommendations, homepage layouts, and promotional offers specifically to individual preferences.

  • Increased Viewer Engagement:  Personalized recommendations resulted in extended viewing times and increased interactions with the platform.
  • Higher Subscription Retention: Tailored user experiences improved overall satisfaction, leading to lower churn rates.
  • Deep Learning Enhances Personalization:  Deep learning algorithms allow a more nuanced knowledge of consumer preferences and behavior.
  • Data-Driven Customization is Key to User Retention: Providing a customized experience based on data analytics is critical for maintaining and growing a subscriber base in the competitive streaming market.

Case Study 20 – Improving Online Retail Sales through Customer Sentiment Analysis (Zappos)

Challenge:  Zappos, an online shoe and clothing retailer, aimed to enhance customer satisfaction and boost sales by better understanding customer sentiments and preferences across various platforms.

Solution:  Zappos implemented a comprehensive sentiment analysis program that utilized natural language processing (NLP) techniques to gather and analyze customer feedback from social media, product reviews, and customer support interactions. This data was used to identify emerging trends, customer pain points, and overall sentiment towards products and services. The insights derived from this analysis were subsequently used to customize marketing strategies, enhance product offerings, and improve customer service practices.

  • Enhanced Product Selection and Marketing:  Insight-driven adjustments to inventory and marketing strategies increased relevancy and customer satisfaction.
  • Improved Customer Experience: By addressing customer concerns and preferences identified through sentiment analysis, Zappos enhanced its overall customer service, increasing loyalty and repeat business.
  • Power of Sentiment Analysis in Retail:  Understanding and reacting to customer emotions and opinions can significantly impact sales and customer satisfaction.
  • Strategic Use of Customer Feedback: Leveraging customer feedback to drive business decisions helps align product offerings and services with customer expectations, fostering a positive brand image.

Related: Data Science Industry in the US

Case Study 21 – Streamlining Airline Operations with Predictive Analytics (Delta Airlines)

Challenge:  Delta Airlines faced operational challenges, including flight delays, maintenance scheduling inefficiencies, and customer service issues, which impacted passenger satisfaction and operational costs.

Solution:  Delta implemented a predictive analytics system that integrates data from flight operations, weather reports, aircraft sensor data, and historical maintenance records. The system predicts potential delays using machine learning models and suggests optimal maintenance scheduling. Additionally, it forecasts passenger load to optimize staffing and resource allocation at airports.

  • Reduced Flight Delays:  Predictive insights allowed for better planning and reduced unexpected delays.
  • Enhanced Maintenance Efficiency:  Maintenance could be scheduled proactively, decreasing the time planes spend out of service.
  • Improved Passenger Experience: With better resource management, passenger handling became more efficient, enhancing overall customer satisfaction.
  • Operational Efficiency Through Predictive Analytics:  Leveraging data for predictive purposes significantly improves operational decision-making.
  • Data Integration Across Departments: Coordinating data from different sources provides a holistic view crucial for effective airline management.

Case Study 22 – Enhancing Financial Advisory Services with AI (Morgan Stanley)

Challenge:  Morgan Stanley sought to offer clients more personalized and effective financial guidance. The challenge was seamlessly integrating vast financial data with individual client profiles to deliver tailored investment recommendations.

Solution:  Morgan Stanley developed an AI-powered platform that utilizes natural language processing and ML to analyze financial markets, client portfolios, and historical investment performance. The system identifies patterns and predicts market trends while considering each client’s financial goals, risk tolerance, and investment history. This integrated approach enables financial advisors to offer highly customized advice and proactive investment strategies.

  • Improved Client Satisfaction:  Clients received more relevant and timely investment recommendations, enhancing their overall satisfaction and trust in the advisory services.
  • Increased Efficiency: Advisors were able to manage client portfolios more effectively, using AI-driven insights to make faster and more informed decisions.
  • Personalization through AI:  Advanced analytics and AI can significantly enhance the personalization of financial services, leading to better client engagement.
  • Data-Driven Decision Making: Leveraging diverse data sets provides a comprehensive understanding crucial for tailored financial advising.

Case Study 23 – Optimizing Inventory Management in Retail (Walmart)

Challenge:  Walmart sought to improve inventory management across its vast network of stores and warehouses to reduce overstock and stockouts, which affect customer satisfaction and operational efficiency.

Solution:  Walmart implemented a robust data analytics system that integrates real-time sales data, supply chain information, and predictive analytics. This system uses machine learning algorithms to forecast demand for thousands of products at a granular level, considering factors such as seasonality, local events, and economic trends. The predictive insights allow Walmart to dynamically adjust inventory levels, optimize restocking schedules, and manage distribution logistics more effectively.

  • Reduced Inventory Costs:  More accurate demand forecasts helped minimize overstock and reduce waste.
  • Enhanced Customer Satisfaction: Improved stock availability led to better in-store experiences and higher customer satisfaction.
  • Precision in Demand Forecasting:  Advanced data analytics and machine learning significantly enhance demand forecasting accuracy in retail.
  • Integrated Data Systems:  Combining various data sources provides a comprehensive view of inventory needs, improving overall supply chain efficiency.

Case Study 24: Enhancing Network Security with Predictive Analytics (Cisco)

Challenge:  Cisco encountered difficulties protecting its extensive network infrastructure from increasingly complex cyber threats. The objective was to bolster their security protocols by anticipating potential breaches before they happen.

Solution:  Cisco developed a predictive analytics solution that leverages ML algorithms to analyze patterns in network traffic and identify anomalies that could suggest a security threat. By integrating this system with their existing security protocols, Cisco can dynamically adjust defenses and alert system administrators about potential vulnerabilities in real-time.

  • Improved Security Posture:  The predictive system enabled proactive responses to potential threats, significantly reducing the incidence of successful cyber attacks.
  • Enhanced Operational Efficiency: Automating threat detection and response processes allowed Cisco to manage network security more efficiently, with fewer resources dedicated to manual monitoring.
  • Proactive Security Measures:  Employing predictive cybersecurity analytics helps organizations avoid potential threats.
  • Integration of Machine Learning: Machine learning is crucial for effectively detecting patterns and anomalies that human analysts might overlook, leading to stronger security measures.

Case Study 25 – Improving Agricultural Efficiency with IoT and AI (Bayer Crop Science)

Challenge:  Bayer Crop Science aimed to enhance agricultural efficiency and crop yields for farmers worldwide, facing the challenge of varying climatic conditions and soil types that affect crop growth differently.

Solution:  Bayer deployed an integrated platform that merges IoT sensors, satellite imagery, and AI-driven analytics. This platform gathers real-time weather conditions, soil quality, and crop health data. Utilizing machine learning models, the system processes this data to deliver precise agricultural recommendations to farmers, including optimal planting times, watering schedules, and pest management strategies.

  • Increased Crop Yields:  Tailored agricultural practices led to higher productivity per hectare.
  • Reduced Resource Waste: Efficient water use, fertilizers, and pesticides minimized environmental impact and operational costs.
  • Precision Agriculture:  Leveraging IoT and AI enables more precise and data-driven agricultural practices, enhancing yield and efficiency.
  • Sustainability in Farming:  Advanced data analytics enhance the sustainability of farming by optimizing resource utilization and minimizing waste.

Related: Is Data Science Overhyped?

The power of data science in transforming industries is undeniable, as demonstrated by these 25 compelling case studies. Through the strategic application of machine learning, predictive analytics, and AI, companies are solving complex challenges and gaining a competitive edge. The insights gleaned from these cases highlight the critical role of data science in enhancing decision-making processes, improving operational efficiency, and elevating customer satisfaction. As we look to the future, the role of data science is set to grow, promising even more innovative solutions and smarter strategies across all sectors. These case studies inspire and serve as a roadmap for harnessing the transformative power of data science in the journey toward digital transformation.

  • What is Narrow AI [Pros & Cons] [Deep Analysis] [2024]
  • Use of AI in Medicine: 5 Transformative Case Studies [2024]

Team DigitalDefynd

We help you find the best courses, certifications, and tutorials online. Hundreds of experts come together to handpick these recommendations based on decades of collective experience. So far we have served 4 Million+ satisfied learners and counting.

case study for data science project

Career in Data Science vs UX/UI [2024]

case study for data science project

Role of Data Analytics in Fintech [2024]

case study for data science project

Data Leadership Traits [2024]

case study for data science project

Role of Data Science in Cybersecurity & Threat Detection[2024]

case study for data science project

5 Reasons Why You Must Learn Excel VBA [2024]

case study for data science project

How to Use Data Science in Real Estate? [2024]

case study for data science project

The Data Science Newsletter

case study for data science project

Data Science Case Studies: Lessons from the Real World

case study for data science project

In the swiftly evolving domain of data science, real-world case studies serve as invaluable resources, offering insights into the challenges, strategies, and outcomes associated with various data science projects.

This comprehensive article explores a series of case studies across different industries, highlighting the pivotal lessons learned from each. By examining these case studies, data science professionals and enthusiasts can glean practical insights to apply in their work.

person holding clear glass ball

Furthermore, we encourage readers to subscribe to our newsletter for ongoing updates and in-depth analysis of data science trends and applications.

The Transformative Power of Data Science

Data science continues to revolutionize industries by extracting meaningful insights from complex datasets, driving decision-making processes, and fostering innovation.

Through real-world case studies, we can observe the transformative power of data science in action, from enhancing customer experiences to optimizing operations and beyond.

These stories not only inspire but also provide a tangible blueprint for how data science can be effectively applied to solve real-world problems.

Diverse Applications of Data Science

E-commerce personalization.

In the competitive e-commerce sector, personalization has emerged as a key differentiator.

A landmark case study involves a leading online retailer that leveraged data science to personalize the shopping experience for millions of users.

By analyzing customer data, including purchase history, browsing behavior, and preferences, the retailer developed algorithms to recommend products tailored to individual users.

This personalization strategy resulted in a significant increase in customer engagement and sales, highlighting the potential of data science to transform marketing and sales strategies.

Healthcare Predictive Analytics

The healthcare industry has seen remarkable benefits from the application of data science, particularly in predictive analytics. A notable case study is a hospital that implemented a predictive model to identify patients at risk of readmission within 30 days of discharge. By integrating data from electronic health records, social determinants of health, and patient-reported outcomes, the model provided healthcare providers with actionable insights to develop personalized care plans. This initiative led to improved patient outcomes and a reduction in healthcare costs, underscoring the impact of data science on patient care and health system efficiency.

Financial Fraud Detection

In the financial services industry, fraud detection is a critical application of data science. A compelling case study involves a bank that employed machine learning algorithms to detect fraudulent transactions in real-time. By analyzing transaction patterns and comparing them against known fraud indicators, the system could flag suspicious activities for further investigation. This proactive approach to fraud detection safeguarded customers' assets and enhanced the bank's security measures, demonstrating the effectiveness of data science in combating financial crime.

Embracing Data Science for Real-World Impact

The case studies presented illuminate the broad spectrum of challenges that data science can address, showcasing its versatility and impact.

For organizations and professionals looking to harness the power of data science, these examples provide inspiration and guidance on applying data science techniques to achieve tangible results.

The key lessons from these case studies emphasize the importance of understanding the specific context and objectives of each project, selecting appropriate methodologies, and continuously refining models based on real-world feedback.

Action: Stay Ahead with Our Data Science Newsletter

To delve deeper into the world of data science and explore more case studies, strategies, and innovations, we invite you to subscribe to our specialized newsletter. Our newsletter offers a curated selection of articles, case studies, expert insights, and the latest developments in data science, tailored to professionals seeking to enhance their knowledge and apply data science principles effectively in their domains.

By subscribing, you'll join a community of forward-thinking individuals passionate about leveraging data science for real-world impact. Whether you're a data science practitioner, a business leader looking to implement data-driven strategies, or simply curious about the potential of data science, our newsletter is your gateway to staying informed and inspired.

Don't miss out on the opportunity to expand your understanding of data science and gain valuable insights from the frontlines of industry innovation. Subscribe now and take the first step toward translating the lessons from real-world case studies into success in your own data science endeavors.

The Data Science Newsletter is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.

case study for data science project

Ready for more?

Case studies

How data science is used to solve real-world problems in business, public policy and beyond

Our case study content is in development. Interested in contributing a case study to Real World Data Science? Check out our call for contributions .

A doctor takes a blood pressure reading from a patient.

Forecasting the Health Needs of a Changing Population

Photo of yellow chain links by Possessed Photography on Unsplash.

Deduplicating and linking large datasets using Splink

Robin Linacre introduces an open source tool, developed by the UK Ministry of Justice, which uses probabilistic record linkage to improve the quality of justice system data.

Computer monitor displaying bodycam recordings, primary colours, digital art. Created using Bing Image Creator.

Learning from failure: ‘Red flags’ in body-worn camera data

A shopper walks the aisles of a supermarket in the United States.

Food for Thought: Second place winners – DeepFFTLink

A man is seen grocery shopping during the pandemic, looking at canned food while wearing a mask.

The Food for Thought Challenge: Using AI to support evidence-based food and nutrition policy

Close up of shopping trolleys linked together.

Food for Thought: The value of competitions for confidential data

Section of article's Figure 1 graphic, illustrating the challenge of linking retail scanner data to nutrition information.

Food for Thought: Competition and challenge design

From structure to setup, to metrics, results, and lessons learned, Zheyuan Zhang and Uyen Le give an overview of the design of the Food for Thought competition.

Fresh vegetables on grocery store shelves.

Food for Thought: First place winners – Auburn Big Data

Auburn University’s team of PhD students and faculty describe their winning solution to the Food for Thought challenge: random forest classifiers.

Dairy products lined up on refrigerator shelves in a farmers market in the United States.

Food for Thought: The importance of the Purchase to Plate Suite

An aerial shot of the groceries section of the Fred Meyer superstore in Redmond, WA.

Food for Thought: Third place winners – Loyola Marymount

Yifan Hu and Mandy Korpusik of Loyola Marymount University describe their solution to the Food for Thought challenge: binary classification with pre-trained BERT.

A young man sits in front of a computer keyboard, surrounded by monitors and books and with computer cables covering various surfaces.

The road to reproducible research: hazards to avoid and tools to get you there safely

IEEE Account

  • Change Username/Password
  • Update Address

Purchase Details

  • Payment Options
  • Order History
  • View Purchased Documents

Case Study: Delivering A Successful Data Science Project

case study for data science project

Article Snapshot

This knowledge is brought to you by Olga Rudenko , just one of thousands of top freelance agile consultants on Expert360. Sign up free to hire freelancers here, or apply to become an Expert360 consultant here. Let us know your thoughts on this piece in the comments below.

  Table of Contents

What Is A Data Science Project?

Context: case company & project, key learnings.

  • Data science projects are a new beast. Very few companies understand the difference between data science and analytical project. Analytical projects require stakeholders to decide before the value is unlocked. Data science projects build digital products that deliver value via fully automated micro decisions.
  • The case study follows the development of the recommendation engine for an Australian leading online business. It shares the results of the phase 1 of the project.
  • Setting up the right team and the right process is key. The team should combine creative, technical and commercial talent. The process should start with clear success measures, stimulate enough challenge of the current way of thinking of the problem, release a hypothesised solution as soon as possible and continue to iterate.
  • Key learnings were clearly communicating the vision for the project, bring a diverse team together to a non-hierarchical ideation process, and deliver value incrementally.

Before we jump into the case study, I felt it was important to briefly address the misconception about what a data science project is by giving an example of a side-by-side comparison. A lot of Australian companies are currently misusing the term and refer to a business analytics project as data science or big data project.  Data science is a very different beast. Here is an example of how the projects differ:  

Analytical project Data science project

What it is?

Any project that uses statistical analysis or logic driven process supported by data. The outcomes are a set of recommendations or conclusions that stakeholders need to make a decision on to unlock value

A project that uses large volumes of data to build self-learning and self-improving algorithms that require no human input to unlock value

  • Pricing strategy  
  • Choosing the right products for promotional discounting
  • Supply chain optimisation models 
  • Recommendation engines
  • Search optimisation bidding models
  • Fraud detection and prediction models

Capability:

An analyst usually needs to have:

  • Solid logic
  • Basic statistical toolkit
  • Basic or often advanced commercial acumen
  • Ability to build models in excel

A data scientist usually needs to have:

  • Advanced  statistical toolkit
  • Basic commercial acumen
  • Basic or advanced technical ability (at least R or better basic coding)

Measuring success

The success of an analytical project is often an effective strategic recommendation that requires a decision. As a result of the decision, certain business benefits are expected to be gained.  E.g. setting a different price for products within a category - 10% increase.

The success of the data science project is often an establishment of a self-running data product that delivers value via micro decisions that require no human input. They automatically get better over time with more data collected. You can improve them by tweaking the algorithms. A/B is often used to validate success. E.g. recommender engine consistently delivering better product recommendations over time.  n our case study, we will focus on a data science project  - improving a recommendation engine for an online business.  

Company: Our case company is a sizeable online retailer that sells merchandise to consumers all over the world. The company has a strong acquisition strategy that consistently delivered traffic growth. The key business challenge was loyalty.   Project: An improved recommendation engine was meant to deliver better product suggestions, that would lead to improved conversion at the time of purchase. However, the key success measure for the project was retention. Both revisitation and repeat purchase rates we key metrics of success. Loyalty was a bigger goal than one-off conversion. This affected design and delivery and made the project different from many recommendation engines targeting immediate behaviour changes.  

Outcomes Of The Data Science Project

Any data science project is rarely ever final. I was personally involved in phase 1 of the project delivery. It is still ongoing. Measures of success (in order of priority):

  • Repeat purchase: how many more customers came back to the site and bought again compared to old recommendation experience
  • Revisitation: how many more customers came back to the site again compared to old recommendation experience
  • Favouriting: how many more customers favourited the products recommended compared to old recommendation experience

  Phase 1. Email test We first tested the recommender via the email recommendation and these are the movements in core metrics we observed:

  • Repeat purchase: +5% better
  • Revisitation: +20% better
  • Favouriting: 5x better
  • Conversion: 3x better

We also observed much more impressive movement in the earlier stages of the funnel. Our open rates were 5 times better than an average email campaign and click through from email was 80% better than average.  We discovered that our loyalty metrics  - revisitation and repurchase had a more significant delay and were understated.   

Phase 2. Onsite test After an email recommendation test was successful, we moved the A/B test to site. At this point the results are being measured and the company is still working on further improvements to its recommendation engine.  

Choosing the team to deliver the project was the most important factor for us. What we found is that bringing creative talent to collaborate with the technical talent from the very beginning was an absolute key to innovation and led to a step change in recommendation methodology.

Commercial : having someone on the team with a commercial skill set and perspective was key, particularly at the beginning. You can get key leadership stakeholders to set project goals. For us, a product manager played that role. They also helped the team to make trade offs relating to impact along the journey. E.g. should we invest more time in refining the algorithm or release in small parts to test the impact? How much time can we afford to invest in marginal algorithm improvements given the benefits we expect to see?

Technical : this is a default capability in any data science project and is unlikely to get missed. The point of difference for us was that we had a data scientist as well as engineers involved. Data scientist was able to focus on the algorithm logic while engineers focused on making the solution scalable and fast.

Creative: design or customer insights teams rarely come into a data science project team. This was a key differentiator for us. Our designer not only brought more creativity into redefining the algorithm. He really challenged the team to take more risk and take a customer point of view. It was a designed who kept insisting that we can deliver a better recommendation experience for a smaller group of customers if we use different signals in our engine.  

Step 1. Define success

  Setting up your project with clear metrics of success is key. This is pretty much the only thing that will stay constant. For us it was important to keep in mind that loyalty was always a number one goal. Having won in conversion was a positive stepping stone, but not the final prize. Even when we had exceptional open rates and email engagement, we had to keep working until we got to meaningful loyalty gains.  Most meaningful data science projects always have a strong or direct connection to a top company goal.  

Step 2. Push the limits

With projects like a recommendation engine, it is easy to fall into the trap of marginal improvements. It is particularly difficult to step outside of the signals that are already clearly linked to the outcome. For us the options were:

  • Option 1. Iterate transactional signals by changing weights in the algorithm (product sales, add to cart, product page visits)
  • Option 2 . Use low coverage signals - those that could potentially be higher quality but work for a smaller number of people. Signals like favouriting is a much more sparse signal than add to cart, and therefore got ignored in the original algorithm design. We brought that back. We also discovered a completely new unconventional signal that had to do with the design of the pattern on the product (e.g. recommend red products if a customer was looking at red).

Putting together a very unconventional combination of signals was a crazy move. But it did pay off. These options would not have been on the table if not for the diversity within the team and the process we followed.  

Step 3. Release quick

Data science is really not that different from any digital product project in the way it uses MVP thinking (Minimal Viable Product). The quickest possible way you can test your product always wins. It does not mean a compromise on things that really matter. For instance, we took the time on step 2 to really challenge the status quo. But we did try to think of the quickest possible way to test. For instance, doing an email test vs releasing on site meant 0.5 days of work vs 1 week.  

Step 4. Learn and refine

Constant iteration is another digital product delivery concept that applies to data science and makes it very different from an analytical project. The only way a data science model or algorithm can be better is by collecting positive and negative signals. For us, thinking about exposing our recommendation engine via different channels and to different audiences was key in making sure the model got refined and was able to deal with a variety of situations.   

 So what have we learnt? If I were, to sum up, the three most important things:

Communicate vision

Once we discovered that a certain group of users would have a better recommendation from signals that have low coverage across the board, that helped us paint a picture for a truly personalised recommendation experience (see below). Communicating the vision for where a data science project can go is critical. It inspires the team. It helps stakeholders back you. It keeps you focused on the most important milestones.    

Empower the team

Assuming that your product owner will do the vision and data scientist will build the product is a wrong way to start the ideation. If you start the ideation process with everyone having an equal role, you may be surprised by the level of insight coming from most unlikely team members.   

Deliver incrementally

While having a vision as a goal post is great, the focus of the day to day should be on proving it. Find the quickest and most feasible way to test the vision. If your aspiration is to build a truly personal recommendation experience, start by delivering a user cluster version of it. If your aspiration is to deliver recommendations to 10 million visitors on site, start by sending recommendation email to 500,000 people to test it. A data product is just another digital product. You can read more on agile delivery in my guide to agile transformation.

Get our insights into what’s happening in business and the world of work; interesting news, trends, and perspectives from our Expert community, and access to our data & trend analysis. Be first in line to read The 360˚ View by subscribing below.

Hire exceptional talent in under 48 hours with Expert360 - Australia & New Zealand's #1 Skilled Talent Network.

case study for data science project

case study for data science project

  • Computers & Technology
  • Computer Science

Sorry, there was a problem.

Kindle app logo image

Download the free Kindle app and start reading Kindle books instantly on your smartphone, tablet, or computer - no Kindle device required .

Read instantly on your browser with Kindle for Web.

Using your mobile phone camera - scan the code below and download the Kindle app.

QR code to download the Kindle App

Image Unavailable

Data Science Projects with Python

  • To view this video download Flash Player

Follow the author

Stephen Klosterman

Data Science Projects with Python

Gain hands-on experience with industry-standard data analysis and machine learning tools in Python

Key Features

  • Tackle data science problems by identifying the problem to be solved
  • Illustrate patterns in data using appropriate visualizations
  • Implement suitable machine learning algorithms to gain insights from data

Book Description

Data Science Projects with Python is designed to give you practical guidance on industry-standard data analysis and machine learning tools, by applying them to realistic data problems. You will learn how to use pandas and Matplotlib to critically examine datasets with summary statistics and graphs, and extract the insights you seek to derive. You will build your knowledge as you prepare data using the scikit-learn package and feed it to machine learning algorithms such as regularized logistic regression and random forest. You'll discover how to tune algorithms to provide the most accurate predictions on new and unseen data. As you progress, you'll gain insights into the working and output of these algorithms, building your understanding of both the predictive capabilities of the models and why they make these predictions.

By then end of this book, you will have the necessary skills to confidently use machine learning algorithms to perform detailed data analysis and extract meaningful insights from unstructured data.

What you will learn

  • Install the required packages to set up a data science coding environment
  • Load data into a Jupyter notebook running Python
  • Use Matplotlib to create data visualizations
  • Fit machine learning models using scikit-learn
  • Use lasso and ridge regression to regularize your models
  • Compare performance between models to find the best outcomes
  • Use k-fold cross-validation to select model hyperparameters

Who this book is for

If you are a data analyst, data scientist, or business analyst who wants to get started using Python and machine learning techniques to analyze data and predict outcomes, this book is for you. Basic knowledge of Python and data analytics will help you get the most from this book. Familiarity with mathematical concepts such as algebra and basic statistics will also be useful.

Table of Contents

  • Data Exploration and Cleaning
  • Introduction to Scikit-Learn and Model Evaluation
  • Details of Logistic Regression and Feature Exploration
  • The Bias-Variance Trade-off
  • Decision Trees and Random Forests
  • Imputation of Missing Data, Financial Analysis, and Delivery to Client
  • ISBN-10 1838551026
  • ISBN-13 978-1838551025
  • Publisher Packt Publishing
  • Publication date April 30, 2019
  • Language English
  • Dimensions 9.25 x 7.52 x 0.78 inches
  • Print length 374 pages
  • See all details

From the Publisher

Woman learning Python with the Python data science tutorial Data Science Projects with Python

About Data Science Projects with Python

Data Science Projects with Python is a hands-on introduction to real-world data science. You'll take an active approach to learning by following real case studies that elegantly tie together mathematics and code.

You'll start by learning how to examine, extract and present insights from your own data with tools like pandas and Matplotlib. You'll build upon these skills as you start feeding data into machine learning algorithms with sci-kit learn, tuning them individually as you learn how they make their predictions.

Data Science Projects with Python is ideal if you're looking for guidance on industry-standard data analysis and machine learning tools. You'll learn how to navigate your own unstructured data, extracting meaningful insights along the way.

  • Ideal introduction to data science for those already familiar with foundational Python
  • Full of practical step-by-step exercises, activities and solutions
  • Structured to let you pause and progress learning on your terms

Learn Real-World Data Science

You'll complete step-by-step exercises that build practical, hands-on skills. Activities then challenge you to apply and extend your new abilities. It's a proven structure that's designed to hack your brain and embed your learning.

It can be tough to manage learning around your existing commitments. With a structure that lets you pick up and progress at your own pace, you're far more likely to reach your goal. It's time to take back control of your schedule.

Every project brings together a team of expert authors, instructors, and technical editors. You'll find step-by-step guidance from start to finish, complete with worked solutions. You won't ever be stuck with an example that doesn't compile.

Two men together at a table, verifying their Python skill online after completing a Python tutorial

Editorial Reviews

About the author.

Stephen Klosterman is a machine learning data scientist at CVS Health. He enjoys helping to frame problems in a data science context and delivering machine learning solutions that business stakeholders understand and value. His education includes a Ph.D. in biology from Harvard University, where he was an assistant teacher of the data science course.

Product details

  • Publisher ‏ : ‎ Packt Publishing (April 30, 2019)
  • Language ‏ : ‎ English
  • Paperback ‏ : ‎ 374 pages
  • ISBN-10 ‏ : ‎ 1838551026
  • ISBN-13 ‏ : ‎ 978-1838551025
  • Item Weight ‏ : ‎ 1.43 pounds
  • Dimensions ‏ : ‎ 9.25 x 7.52 x 0.78 inches
  • #497 in Data Processing
  • #900 in Python Programming
  • #1,356 in Artificial Intelligence & Semantics

About the author

Stephen klosterman.

Stephen Klosterman is a Machine Learning Data Scientist with a background in math, environmental science, and ecology. His education includes a PhD in Biology from Harvard University, where he was assistant teacher of the Data Science course. Currently he works in the health care industry. At work, he likes to research and develop machine learning solutions that stakeholders understand and value. In his spare time, he enjoys running, biking, sailing, and music. For blog posts on Data Science and Machine Learning, as well as errata and Q&A about the book, visit www.steveklosterman.com.

Customer reviews

  • 5 star 4 star 3 star 2 star 1 star 5 star 66% 18% 10% 2% 5% 66%
  • 5 star 4 star 3 star 2 star 1 star 4 star 66% 18% 10% 2% 5% 18%
  • 5 star 4 star 3 star 2 star 1 star 3 star 66% 18% 10% 2% 5% 10%
  • 5 star 4 star 3 star 2 star 1 star 2 star 66% 18% 10% 2% 5% 2%
  • 5 star 4 star 3 star 2 star 1 star 1 star 66% 18% 10% 2% 5% 5%

Customer Reviews, including Product Star Ratings help customers to learn more about the product and decide whether it is the right product for them.

To calculate the overall star rating and percentage breakdown by star, we don’t use a simple average. Instead, our system considers things like how recent a review is and if the reviewer bought the item on Amazon. It also analyzed reviews to verify trustworthiness.

  • Sort reviews by Top reviews Most recent Top reviews

Top reviews from the United States

There was a problem filtering reviews right now. please try again later..

case study for data science project

Top reviews from other countries

case study for data science project

  • About Amazon
  • Investor Relations
  • Amazon Devices
  • Amazon Science
  • Sell products on Amazon
  • Sell on Amazon Business
  • Sell apps on Amazon
  • Become an Affiliate
  • Advertise Your Products
  • Self-Publish with Us
  • Host an Amazon Hub
  • › See More Make Money with Us
  • Amazon Business Card
  • Shop with Points
  • Reload Your Balance
  • Amazon Currency Converter
  • Amazon and COVID-19
  • Your Account
  • Your Orders
  • Shipping Rates & Policies
  • Returns & Replacements
  • Manage Your Content and Devices
 
 
 
   
  • Conditions of Use
  • Privacy Notice
  • Consumer Health Data Privacy Disclosure
  • Your Ads Privacy Choices

case study for data science project

Navigation Menu

Search code, repositories, users, issues, pull requests..., provide feedback.

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly.

To see all available qualifiers, see our documentation .

  • Notifications You must be signed in to change notification settings

A Case Study Approach to Successful Data Science Projects Using Python, Pandas, and Scikit-Learn

TrainingByPackt/Data-Science-Projects-with-Python

Folders and files.

NameName
34 Commits

Repository files navigation

GitHub issues

Data Science Projects with Python

Data Science Projects with Python is designed to give you practical guidance on industry-standard data analysis and machine learning tools in Python, with the help of realistic data. The course will help you understand how you can use pandas and Matplotlib to critically examine a dataset with summary statistics and graphs, and extract the insights you seek to derive. You will continue to build on your knowledge as you learn how to prepare data and feed it to machine learning algorithms, such as regularized logistic regression and random forest, using the scikit-learn package. You’ll discover how to tune the algorithms to provide the best predictions on new and, unseen data. As you delve into later chapters, you’ll be able to understand the working and output of these algorithms and gain insight into not only the predictive capabilities of the models but also their reasons for making these predictions.

Data Science Projects with Python by Stephen Klosterman

What you will learn

  • Install the required packages to set up a data science coding environment
  • Load data into a Jupyter notebook running Python
  • Use Matplotlib to create data visualizations
  • Fit a model using scikit-learn
  • Use lasso and ridge regression to regularize the model
  • Fit and tune a random forest model and compare performance with logistic regression
  • Create visuals using the output of the Jupyter notebook
  • Use k-fold cross-validation to select the best combination of hyperparameters

Hardware requirements

For an optimal student experience, we recommend the following hardware configuration:

  • Processor : Intel Core i5 or equivalent
  • Memory : 4 GB RAM or higher
  • Storage : 35 Gb or higher

Software requirements

  • OS: Windows 7 SP1 64-bit, Windows 8.1 64-bit or Windows 10 64-bit, Ubuntu Linux, or the latest version of OS X
  • Browser: Google Chrome/Mozilla Firefox Latest Version
  • Notepad++/Sublime Text as IDE (Optional, as you can practice everything using Jupyter notecourse on your browser)
  • Python 3.4+ (latest is recommended) installed (from https://python.org )
  • Python libraries as needed (Jupyter, Numpy, Pandas, Matplotlib, BeautifulSoup4, and so on)

Contributors 3

  • Jupyter Notebook 100.0%
  • Subscription

Live Tutorials and Project Walkthroughs

Upcoming live events, no webinar this week, past tutorials, machine learning and deep learning beginner intro and overview (with code).

Video Featured Image

Forecast the Weather Using Prophet for Time Series Prediction

Video Featured Image

Deploy a Deep Learning API to the Cloud with Docker and Azure

Video Featured Image

Create a Deep Learning API with Python and FastAPI

Video Featured Image

Weather Prediction with Python and Machine Learning (with Code)

Video Featured Image

How to Ace a Data Scientist Interview

Video Featured Image

Predict NBA Games Using Python and Machine Learning (Part 2)

Video Featured Image

Web Scraping NBA Games Using Python (Part 1)

Video Featured Image

Detect Dog Emotions with Deep Learning (Full Walkthrough with Code)

Video Featured Image

Create Your First Data Solution in the Cloud

Video Featured Image

Predict Baseball Stats Using Machine Learning and Python

Video Featured Image

Predict Bitcoin Prices Using Machine Learning and Python (with Full Code)

Video Featured Image

Predict House Prices Using Machine Learning and Python (Full Tutorial)

Video Featured Image

3 Real-World Data Projects for Your Portfolio

Video Featured Image

Build a Custom Search Engine in Python with Filtering

Video Featured Image

Ridge Regression from Scratch in Python (Machine Learning Tutorial)

Video Featured Image

Linear Regression Algorithm in Python from Scratch (Beginner Tutorial)

Video Featured Image

Build Your First Machine Learning Project (Full Beginner Walkthrough)

Video Featured Image

7 Beginner Python Data Projects (with Full Code and Walkthroughs)

Video Featured Image

K-means Clustering Algorithm in Python From Scratch (Beginner Tutorial)

Video Featured Image

Understanding Different Data Roles (a Roundtable Discussion)

Video Featured Image

Business Analyst Job Outlook in 2022 (a Panel Discussion)

Video Featured Image

Real-Time Speech Recognition Using Your Microphone (Beginner Tutorial with Full Code)

Speech recognition and summarization system in python (project tutorial).

Video Featured Image

Build an Airflow Data Pipeline to Download Podcasts (Beginner Data Engineer Tutorial)

Video Featured Image

How to Build a Data Project Portfolio and Stand Out to Employers (with Examples)

Video Featured Image

Learner Roundtable Discussion: How to Start a Career in Data Science

Video Featured Image

Build a Movie Recommendation System with Jupyter and Pandas

Video Featured Image

Predict the Stock Market with Machine Learning

Video Featured Image

How to Start Your Career in Data

Video Featured Image

Is Power BI Certification Worth It?

Video Featured Image

Predicting Football Match Winners with Machine Learning

Video Featured Image

Web Scraping Football Matches from The EPL with Python (part 1 of 2)

Video Featured Image

How to Become a Business Analyst (with Q&A)

Video Featured Image

Classifying Dog Images with Deep Learning and TensorFlow

Video Featured Image

Power BI Beginner Tutorial: Analyzing the Olympics

Video Featured Image

Web Scraping Beginner Tutorial with Playwright

Video Featured Image

Exploring FIFA Stats with Microsoft Power BI

Video Featured Image

Analyzing COVID RNA Sequences with Python

Video Featured Image

Analyzing Data with Microsoft Power BI

Video Featured Image

Using Collaborative Filtering to Recommend Books (part 2 of 2)

Video Featured Image

Building a Book Recommendation System with Python (part 1 of 2)

Video Featured Image

Predicting the Weather with Machine Learning (Beginner Project)

Video Featured Image

Predicting the NBA MVP: Machine Learning Project (part 3 of 3)

Video Featured Image

Cleaning NBA Stats Data with Python And Pandas: Data Project (part 2 of 3)

Video Featured Image

Web Scraping NBA Stats with Python: Data Project (Part 1 of 3)

Video Featured Image

Analyzing Star Wars Survey Data with Python and Pandas (Data Project)

Video Featured Image

Predicting Stock Prices with Python and Scikit-Learn (Machine Learning Project)

Video Featured Image

  • Data Science
  • Data Analysis
  • Data Visualization
  • Machine Learning
  • Deep Learning
  • Computer Vision
  • Artificial Intelligence
  • AI ML DS Interview Series
  • AI ML DS Projects series
  • Data Engineering
  • Web Scrapping

5 Python Projects for Data Science Portfolio

Building a portfolio with well-thought-out projects is crucial for anyone aspiring to enter the field of data science. It not only demonstrates your technical skills but also shows your ability to handle real-world data problems.

5-Python-Projects-for-Data-Science-Portfolio-copy

In this article, we will explore 5 Python Projects for a Data Science Portfolio.

Table of Content

Exploratory Data Analysis on the Titanic Dataset

House price prediction, stock price prediction using time series analysis, sentiment analysis of social media posts, interactive data visualization dashboard.

Here are 5 project Ideas that you can include in your Data Science Portfolio to make it stand out.

Description: Perform an Exploratory Data Analysis (EDA) on the Titanic dataset to uncover insights about the passengers and their survival rates. EDA helps in understanding the data’s underlying patterns and relationships.

Explanation: The Titanic dataset is a classic example used for EDA. You’ll clean the data, handle missing values, and visualize the relationships between different features and survival rates. The steps involved include:

  • Data Cleaning: Handle missing values for features like Age, Cabin, and Embarked.
  • Feature Exploration: Analyze categorical features (e.g., Pclass, Sex, Embarked) and numerical features (e.g., Age, Fare).
  • Visualization: Use Seaborn and Matplotlib to create plots like bar charts, box plots, and heatmaps to visualize the relationships between features and survival rates.
  • Insights: Identify key factors that influenced survival rates, such as passenger class, gender, and age.
Project Link: Python | Titanic Data EDA using Seaborn

Description: Develop a machine learning model to predict house prices based on various features such as location, size, and amenities.

Explanation: This project involves data preprocessing, feature selection, and training different regression models to predict house prices. You will evaluate the models and tune their hyperparameters to improve accuracy. This project involves several steps to build a robust predictive model for house prices:

  • Data Preprocessing: Handle missing values, encode categorical variables, and scale numerical features.
  • Feature Selection: Identify and select important features that significantly impact house prices.
  • Model Training: Train different regression models such as Linear Regression, Decision Trees, and Random Forest.
  • Model Evaluation: Evaluate models using metrics like RMSE (Root Mean Squared Error) and R² score.
  • Hyperparameter Tuning: Optimize model performance by tuning hyperparameters using techniques like Grid Search or Random Search.
Project Link: House Price Prediction

Description: Analyze historical stock prices and build models to predict future stock prices.

Explanation: This project involves decomposing time series data, analyzing trends and seasonality, and implementing models like ARIMA or LSTM to forecast future stock prices. It’s a great way to showcase your skills in time series analysis and forecasting. This project focuses on time series analysis and forecasting using historical stock price data:

  • Data Collection: Gather historical stock price data from sources like Yahoo Finance or Alpha Vantage.
  • Time Series Decomposition: Decompose the time series data into trend, seasonality, and residual components.
  • Model Implementation: Implement models like ARIMA (AutoRegressive Integrated Moving Average) and LSTM (Long Short-Term Memory) for forecasting.
  • Evaluation: Evaluate model performance using metrics like MAE (Mean Absolute Error) and MSE (Mean Squared Error).
  • Forecasting: Generate future stock price predictions and visualize the forecasted values.
Project Link : Stock Price Prediction using Time Series Analysis

Description: Perform sentiment analysis on social media posts to classify them as positive, negative, or neutral.

Explanation: This NLP project involves text preprocessing, feature extraction using techniques like TF-IDF, and training models to classify sentiments. It’s a practical project that demonstrates your ability to work with textual data and extract meaningful insights. This NLP (Natural Language Processing) project involves several key steps:

  • Data Collection: Collect social media posts (e.g., tweets) related to a specific topic using APIs like Tweepy.
  • Text Preprocessing: Clean and preprocess the text data by removing stopwords, punctuation, and performing tokenization.
  • Feature Extraction: Extract features using techniques like TF-IDF (Term Frequency-Inverse Document Frequency) or word embeddings.
  • Model Training: Train machine learning models such as Logistic Regression, SVM (Support Vector Machine), or Neural Networks to classify sentiments.
  • Evaluation: Evaluate model performance using metrics like accuracy, precision, recall, and F1-score.
Project Link: Twitter Sentiment Analysis using Python

Description: Create an interactive data visualization dashboard to present data insights in a user-friendly manner.

Explanation: This project involves designing and implementing a dashboard that allows users to interact with the data through filters, dropdowns, and other interactive elements. It demonstrates your ability to create effective data visualizations and present data in an accessible way. This project involves designing and implementing a dashboard to make data insights accessible and interactive:

  • Data Preparation: Collect and preprocess the data to ensure it is clean and suitable for visualization.
  • Dashboard Design: Plan the layout and components of the dashboard, including charts, graphs, and interactive elements.
  • Implementation: Use libraries like Plotly and Dash to create interactive visualizations and build the dashboard.
  • Interactivity: Add features like filters, dropdowns, and sliders to allow users to interact with the data and customize their view.
  • Deployment: Deploy the dashboard on a web server or cloud platform to make it accessible to users.
Project Link: Using Plotly for Interactive Data Visualization in Python

Tips for Showcasing Your Projects

  • Documentation: Thoroughly document your projects. Include your thought process, methodologies, and interpretations.
  • Source Code: Make your code available on GitHub with a clear README file explaining the project.
  • Presentation: Create a portfolio website to present your projects. Use tools like Jupyter Notebook to narrate your analysis with code, visualizations, and markdown text.
  • Deployment: If possible, deploy your projects so that others can interact with them. Platforms like Heroku, AWS, or Streamlit Sharing can be useful for this.

By completing these projects, you demonstrate your ability to handle real-world data, apply machine learning algorithms, automate data collection, extract insights from text data, and analyze temporal patterns. Each project not only showcases your technical proficiency with Python and relevant libraries but also your problem-solving skills and ability to communicate complex results effectively.

Please Login to comment...

Similar reads.

  • AI-ML-DS Blogs
  • Data Science Blogathon 2024
  • Data Science Blogs

Improve your Coding Skills with Practice

 alt=

What kind of Experience do you want to share?

R-bloggers

R news and tutorials contributed by hundreds of R bloggers

Shiny in production 2024: full speaker lineup.

Posted on August 8, 2024 by The Jumping Rivers Blog in R bloggers | 0 Comments

case study for data science project

We are pleased to announce the full line-up for this year’s Shiny in Production conference! This year, we’re introducing a new lightning talk session. These short 5 minute talks will allow us to showcase many more uses of Shiny in Production. The conference will still feature 6 full length talks, as well as a session of lightning speakers.

Register now

Cara Thompson – Freelance Data Consultant

Data-To-Wow: Leveraging Shiny as a no-code solution for high-end parameterised visualisations

Photo of Cara Thompson

You’ve created a prototype visualisation, fine-tuned it so it looks amazing and perfectly on-brand, and turned the plot code into a function so that you can run it again on different data and highlight different aspects of the story. Others on the team have seen how good the outputs look and they want in on the magic! But they don’t want to learn R.

This talk will offer a behind-the-scenes look at the process of creating a Shiny App that functions as a black box to get straight from the data to high-end parameterised visualisations. We’ll start by looking at creating parameterised plot functions using ggplot, before exploring how to bring the data and parameterisation into Shiny to create a seamless no-code data-to-viz workflow for the users.

Gareth Burns – Exploristics

Shiny in Secondary Education: Supplementing traditional learning resources to allow students to explore statistical concepts

Photo of Gareth Burns

The Statisticians in the Pharmaceutical Industry (PSI) Schools Outreach initiative aims at promoting data literacy and statistical concepts to the next generation of Statisticians and Data Scientists. Volunteers attend secondary schools to present from specialised workshops which are designed to be interactive, engaging and aligned to the national curriculum for different age groups.

The PSI Visualisation Special Interest Group (VIS SIG) created a Shiny application to supplement an existing workshop for Asthma. This workshop aims to introduce the students to analysis of continuous data and make them think about concealing treatment assignment and consider false positive and false negative results. The application allowed electronic data capture the ability to dynamically explore their own data, re-enforcing the statistical concepts and making learning more engaging and accessible.

Each school is different in terms of class size, computer resources and student abilities, therefore the application needed to be flexible to account for this and enable independent set up by a volunteer instructor. User experience and accessibility were fundamental in the design concepts to ensure the application was appropriate for a classroom environment and data visualisation were at an appropriate level for students.

In this presentation we discuss the range of issues required to get a Shiny application being implemented by a team of volunteers into a classroom setting. This includes flexible project management for a team of volunteers, use of persistent storage to enable multiple simultaneous users and use of Shiny modules to make code flexible and scalable for future Workshops.

Cassio Felix Jardim – Data4Shiny

Creating any User Interface in Shiny: The Importance of CSS in Shaping Shiny Apps’ User Interface

Photo of Cassio Felix Jardim

The main goal of this presentation is to use CSS concepts to assist in building User Interfaces for Dashboards constructed through Programming Languages. In particular, the R language and its Dashboard creation package (shiny package).

The presentation aims to demonstrate that CSS is crucial for organizing the elements of our Dashboard on the screen and also for the aesthetic aspect of the Dashboard User Interface.

Through the concepts of CSS Flexbox and CSS Grid, the presentation will take on a tutorial format where the entire process of constructing the user interface of any dashboard will be covered from start to finish. The main idea is to consider elements of storytelling, UI Design, and UX Design in the process of building a Dashboard.

The Shiny package and its entire ecosystem include various packages that bridge the gap between Data Science and Web Design, especially languages like Html, CSS, and Javascript. Creating this “bridge” between the worlds of Data Science and Web Design is my main objective.

Katy Morgan – Government Internal Audit Agency

More than just a chat bot: Tailoring the use of Generative AI within Government Internal Audit Agency with user-friendly R shiny applications

Photo of Katy Morgan

Generative AI offers huge potential for driving creativity by suggesting new ideas and perspectives and can also improve efficiency by rapidly processing and extracting insights from large volumes of text. However, using a chatbot-style tool such as ChatGPT can be overwhelming as users have to work out, through trial and error, which questions and instructions give them the outputs they need. The Government Internal Audit Agency’s data analytics team has created two R shiny web applications, each of which simplifies the user’s experience of using generative AI by providing a user-friendly interface and implementing a set of standardised prompts. The Risk Engine walks the user through a stepwise process to explore and articulate the potential risks that might impact any given business objective. The Writing Engine enables users to analyse and generate text in several ways, including generating a draft audit report from rough notes, and summarising common themes from a set of audit reports. This presentation will cover the process of developing and deploying the web applications and the challenges we faced along the way, describing how we tailored the appearance and functionality of the apps to best meet user needs.

Keith Newman – Jumping Rivers

Title coming soon

Photo of Keith Newman

Following a PhD in statistics at Newcastle University, Keith developed software to improve road safety modelling. He enjoys creating Shiny apps and teaching the use of R.

Vikki Richardson – Audit Scotland

Faster than a Speeding Arrow – R Shiny Optimisation In Practice

Photo of Vikki Richardson

The task of optimising your R Shiny apps for great performance can be challenging. Ensuring your code is efficient, using promises where you can, caching resources, and reducing the number of widgets or reactive variables can all help. But datasets can’t be squeezed any more – or can they? By storing larger chunks of data in Arrow format and using the Arrow package for manipulation, we were able to speed up some slower computations by at least one order of magnitude – often more.

This presentation will cover a case study of migrating a financial data auditing system to Arrow data storage. Because of Arrow, we were able to drop from two Connect servers to one, making management very happy with the cost savings – and delighting our users with the new, snappier application.

Lightning talks

Yigit aydede – saint mary’s university.

Transforming Community Understanding: A Shiny Application for Real-Time Crime and Real Estate Market Insights in Nova Scotia

Photo of Yigit Aydede

This presentation showcases the Nova Scotia Property Insights (NSPI) application, a Shiny-based tool designed to provide comprehensive neighborhood insights through the integration of crime statistics and real estate market data. NSPI leverages the power of interactive maps to offer users a dynamic and engaging experience, facilitating informed decision-making for residents, potential homebuyers, policymakers, and researchers.

The core functionality of NSPI includes real-time visualization of crime data and property market trends across Nova Scotia neighborhoods. Users can select specific areas on the map to view detailed statistics within customizable radii, offering a granular perspective on local conditions. The application features a user-friendly interface with multiple tabs, including crime type comparisons, real estate market analysis, and historical data trends.

One of the key innovations of NSPI is its ability to allow users to perform side-by-side neighborhood comparisons. By simply clicking on different map areas, users can generate comparative reports that highlight variations in crime rates and property values. This feature is particularly valuable for those considering relocation or investment in Nova Scotia.

The presentation will delve into the technical aspects of developing NSPI, including data integration, user authentication, and the creation of a responsive UI. Additionally, we will discuss the challenges encountered and the solutions implemented to ensure data accuracy and user engagement.

Abbie Brookes & Jeremy Horne – Datacove

Shiny Policies: Dashboards to Aid British Government Decisions

Photo of Abbie Brookes

In collaboration with Natural England, Datacove developed a bespoke Shiny dashboard for informed government decision-making, covering Health and Wellbeing, Nature, and Sustainability (HWNS). This presentation will outline three major topics: project and data management, our approach to customization, and the route taken to enhance usability.

The first phase involved project and data management to establish clear expectations. By engaging with Natural England stakeholders, we ensured that the envisioned product met their specific needs and provided a tangible preview of the dashboard’s functionality and design. We connected to government APIs and used R to extract, process, and transform multiple sources of HWNS data, bringing this information into one place for localised decision-making.

In the second phase, we focused on customisation to ensure seamless integration with Natural England’s existing webpage. Using the brand guidelines and custom CSS/JavaScript, we ensured that the dashboard had the same look and feel as other products built outside of Shiny. This step was crucial in maintaining a cohesive user experience by complementing their established digital ecosystem. Thus, making it easy to access and increasing the likelihood of use.

In the third phase, we emphasized making the dashboard accessible to all, regardless of data literacy. We implemented user-friendly design principles, pre-calculated dynamic stats, and intuitive navigation. For example, we built interactive charts using libraries such as Leaflet and Highcharts, this ensured that comparisons were clear and easy to dynamically explore. We will demonstrate our tips for easy interactive visualisations.

Throughout the project, we adopted best practices in data interpretation and are looking forward to sharing our insights at Shiny in Production.

David Carayon – INRAE

The SK8 project: A scalable institutional architecture for managing and hosting Shiny applications

Photo of David Carayon

Introducing the SK8 Project (Shiny Kubernetes Service), where data scientists, statisticians and engineers from INRAE, the French national research institute for agriculture, food and environment, have teamed up to create a new solution for managing and hosting Shiny applications.

Shiny has become very popular in our institute, widely used for sharing, showcasing, and democratizing scientific work. However, the enduring challenge of establishing scalable, secure, and sustainable hosting for these apps had yet to be addressed.

So, after realizing that different research labs had each implemented their own local and makeshift solutions, we put on our thinking caps and decided to craft an open-source institutional solution. Our mission? Break down silos, unite the R community at INRAE, and make hosting applications easy for Shiny developers with no IT backgrounds.

The SK8 infrastructure allows to host Shiny code on a GitLab instance opened to all INRAE staff. We’ve got pipelines (GitLab CI/CD), stability ({renv}), containerization with Docker, scalability and seamless deployment in a Kubernetes cluster. All of this is developed, managed, and maintained by the SK8 team using open-source solutions.

Using SK8 is a piece of cake – just toss your application code into a dedicated GitLab project and hit the “play” button.

In this talk, we will be speaking about the project itself, the ecosystem that’s making it all happen and how you could replicate this in your own company.

Juan Ramon Vallarta Robledo – FIND

Chagas diagnostic algorithms: an online application to estimate cost and effectiveness of diagnostic algorithms for Chagas disease

Photo of David Carayon

Chagas disease, caused by the Trypanosoma cruzi parasite, is a significant public health concern in Latin America, with an estimated 6-7 million people affected and increasing incidence rates worldwide. Examining the available diagnostic tests and their cost-effectiveness is essential for improving early diagnosis, which is crucial in managing the disease and preventing severe chronic conditions. To address this, FIND, a non-profit organization dedicated to facilitating equitable access to reliable diagnosis, developed Chagaspathways to provide guidance for Chagas disease testing.

The application is entirely built using Shiny and it incorporates a separate R library ( patientpathways ), developed by FIND that contains all the analysis algorithms. It is designed to let users select different scenarios and specify parameters about the target population they are analyzing, like prevalence, testing costs, and the type of test used. The results show the recommended testing approach, the expected number of diagnosed cases, the cost per diagnosed case, along with the positive and negative predictive values. A comprehensive outcomes table is included in the results section and users have the option to download the results as an html report, to help them with further dissemination.

The Chagaspathways application is designed to be a user-friendly tool for public health professionals, recommending the most economical testing approaches to maximize resources and achieve the best results for patients and healthcare infrastructures. The application is intended to expand its scope to cover additional diseases, aiming to become an essential asset in global health initiatives for disease diagnostic modeling.

For updates and revisions to this article, see the original post

Copyright © 2024 | MH Corporate basic by MH Themes

Never miss an update! Subscribe to R-bloggers to receive e-mails with the latest R posts. (You will not see this message again.)

Exploring coral reefs in real time

Advanced Inquiry Program (AIP) graduate Zovig Minassian '24 of La Crescenta, California, wrote an article for California Classroom Science, a publication of the California Association of Science Educators (CASE).

Zovig Minassian

Advanced Inquiry Program (AIP) graduate Zovig Minassian '24 of La Crescenta, California, wrote an article for California Classroom Science, a publication of the California Association of Science Educators (CASE). In " Increasing Student Awareness of Coral Reef Conservation with Inquiry and Participatory Education " Minassian shares an ecology unit she developed for her students. The activities and lessons, which focus on climate change and coral reef conservation, target mostly high schoolers but can be tweaked to fit elementary and middle school students.

As a student in Miami's biology department, Minassian earned a Master of Arts (M.A.) in Biology through Project Dragonfly's AIP while working as a biology teacher at Clark Magnet High School.

501 E. High Street Oxford, OH 45056

  • Online: Miami Online
  • Main Operator 513-529-1809
  • Office of Admission 513-529-2531
  • Vine Hotline 513-529-6400
  • Emergency Info https://miamioh.edu/emergency

1601 University Blvd. Hamilton, OH 45011

  • Online: E-Campus
  • Main Operator 513-785-3000
  • Office of Admission 513-785-3111
  • Campus Status Line 513-785-3077
  • Emergency Info https://miamioh.edu/regionals/emergency

4200 N. University Blvd. Middletown, OH 45042

  • Main Operator 513-727-3200
  • Office of Admission 513-727-3216
  • Campus Status 513-727-3477

7847 VOA Park Dr. (Corner of VOA Park Dr. and Cox Rd.) West Chester, OH 45069

  • Main Operator 513-895-8862
  • From Middletown 513-217-8862

Chateau de Differdange 1, Impasse du Chateau, L-4524 Differdange Grand Duchy of Luxembourg

  • Main Operator 011-352-582222-1
  • Email [email protected]
  • Website https://miamioh.edu/luxembourg

217-222 MacMillan Hall 501 E. Spring St. Oxford, OH 45056, USA

  • Main Operator 513-529-8600

Find us on Facebook

Initiatives

  • Miami THRIVE Strategic Plan
  • Miami Rise Strategic Plan
  • Boldly Creative
  • Annual Report
  • Moon Shot for Equity
  • Miami and Ohio
  • Majors, Minors, and Programs
  • Inclusive Excellence
  • Employment Opportunities
  • University Safety and Security
  • Parking, Directions, and Maps
  • Equal Opportunity
  • Consumer Information
  • Land Acknowledgement
  • Privacy Statement
  • Title IX Statement
  • Report an Accessibility Issue
  • Annual Security and Fire Safety Report
  • Report a Problem with this Website
  • Policy Library

Analysis of Critical Success Factors for Implementing Project Management: A Case Study at PT ABC Company

New citation alert added.

This alert has been successfully added and will be sent to:

You will be notified whenever a record that you have chosen has been cited.

To manage your alert preferences, click on the button below.

New Citation Alert!

Please log in to your account

Information & Contributors

Bibliometrics & citations, index terms.

Social and professional topics

Professional topics

Management of computing and information systems

Project and people management

Project management techniques

Recommendations

An investigation of bi implementation critical success factors in iranian context.

Nowadays, many organizations take Business Intelligence (BI) systems to improve their decision-making processes. Although many organizations have adopted BI systems, not all of these implementations have been successful. This paper seeks to identify ...

Critical success factors (CSFs) for information technology governance (ITG)

The main studys aim is to investigate and identify the factors that encourage the successful implementation of ITG which will be called in this paper Critical Success Factors (CSFs).The literature has been thoroughly reviewed and analysed to identify ...

Correlation of critical success factors with success of software projects: an empirical investigation

Software engineering researchers have, over the years, proposed different critical success factors (CSFs) which are believed to be critically correlated with the success of software projects. To conduct an empirical investigation into the correlation of ...

Information

Published in.

cover image ACM Other conferences

Association for Computing Machinery

New York, NY, United States

Publication History

Permissions, check for updates, author tags.

  • Analytic Hierarchy Process (AHP)
  • Critical Success Factors (CSFs)
  • Importance Performance Analysis (IPA) Matrix
  • Project management
  • Research-article
  • Refereed limited

Contributors

Other metrics, bibliometrics, article metrics.

  • 0 Total Citations
  • 0 Total Downloads
  • Downloads (Last 12 months) 0
  • Downloads (Last 6 weeks) 0

View Options

Login options.

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

View options.

View or Download as a PDF file.

View online with eReader .

HTML Format

View this article in HTML Format.

Share this Publication link

Copying failed.

Share on social media

Affiliations, export citations.

  • Please download or close your previous search result export first before starting a new bulk export. Preview is not available. By clicking download, a status dialog will open to start the export process. The process may take a few minutes but once it finishes a file will be downloadable from your browser. You may continue to browse the DL while the export process is in progress. Download
  • Download citation
  • Copy citation

We are preparing your search results for download ...

We will inform you here when the file is ready.

Your file of search results citations is now ready.

Your search export query has expired. Please try again.

Grab your spot at the free arXiv Accessibility Forum

Help | Advanced Search

Computer Science > Machine Learning

Title: how transformers utilize multi-head attention in in-context learning a case study on sparse linear regression.

Abstract: Despite the remarkable success of transformer-based models in various real-world tasks, their underlying mechanisms remain poorly understood. Recent studies have suggested that transformers can implement gradient descent as an in-context learner for linear regression problems and have developed various theoretical analyses accordingly. However, these works mostly focus on the expressive power of transformers by designing specific parameter constructions, lacking a comprehensive understanding of their inherent working mechanisms post-training. In this study, we consider a sparse linear regression problem and investigate how a trained multi-head transformer performs in-context learning. We experimentally discover that the utilization of multi-heads exhibits different patterns across layers: multiple heads are utilized and essential in the first layer, while usually only a single head is sufficient for subsequent layers. We provide a theoretical explanation for this observation: the first layer preprocesses the context data, and the following layers execute simple optimization steps based on the preprocessed context. Moreover, we demonstrate that such a preprocess-then-optimize algorithm can significantly outperform naive gradient descent and ridge regression algorithms. Further experimental results support our explanations. Our findings offer insights into the benefits of multi-head attention and contribute to understanding the more intricate mechanisms hidden within trained transformers.
Subjects: Machine Learning (cs.LG)
Cite as: [cs.LG]
  (or [cs.LG] for this version)
  Focus to learn more arXiv-issued DOI via DataCite (pending registration)

Submission history

Access paper:.

  • Other Formats

license icon

References & Citations

  • Google Scholar
  • Semantic Scholar

BibTeX formatted citation

BibSonomy logo

Bibliographic and Citation Tools

Code, data and media associated with this article, recommenders and search tools.

  • Institution

arXivLabs: experimental projects with community collaborators

arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.

Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.

Have an idea for a project that will add value for arXiv's community? Learn more about arXivLabs .

Promoting Data Sharing: The Moral Obligations of Public Funding Agencies

  • Original Research/Scholarship
  • Open access
  • Published: 06 August 2024
  • Volume 30 , article number  35 , ( 2024 )

Cite this article

You have full access to this open access article

case study for data science project

  • Christian Wendelborn   ORCID: orcid.org/0000-0002-8012-1835 1   nAff2 ,
  • Michael Anger   ORCID: orcid.org/0000-0002-9328-510X 1 &
  • Christoph Schickhardt   ORCID: orcid.org/0000-0003-2038-1456 1  

164 Accesses

Explore all metrics

Sharing research data has great potential to benefit science and society. However, data sharing is still not common practice. Since public research funding agencies have a particular impact on research and researchers, the question arises: Are public funding agencies morally obligated to promote data sharing? We argue from a research ethics perspective that public funding agencies have several pro tanto obligations requiring them to promote data sharing. However, there are also pro tanto obligations that speak against promoting data sharing in general as well as with regard to particular instruments of such promotion. We examine and weigh these obligations and conclude that all things considered funders ought to promote the sharing of data. Even the instrument of mandatory data sharing policies can be justified under certain conditions.

Similar content being viewed by others

case study for data science project

Preparedness for Research Data Sharing: A Study of University Researchers in Three European Countries

case study for data science project

Data Decisions and Ethics: The Case of Stakeholder-Engaged Research

case study for data science project

Openness in Big Data and Data Repositories

Avoid common mistakes on your manuscript.

Introduction

The potential benefits of sharing research data for science and society have been widely acknowledged and emphasised. Some disciplines or sub-disciplines have a longstanding tradition and well established practices of data sharing, for instance, astrophysics, climate research and biomedical genomic research. However, despite various efforts to promote and encourage data sharing, for instance by scientific journals, it is still not common practice in most fields of the sciences. As public funding agencies have considerable influence on both the scientific communities as well as the individual researchers, the question arises whether they are morally obligated to promote data sharing. In order to answer this question, we examine the following more specific three questions from the perspective of research ethics:

Do public funders have general pro tanto moral obligations that require them to promote data sharing?

Do public funders have general pro tanto moral obligations that speak against promoting data sharing?

What pro tanto moral obligations have to be considered in the particular case of using mandatory data sharing policies, i.e., policies that require researchers to share data?

Answering these questions is a desideratum of (bio)ethical research on issues of data sharing. Although it is stated that individual researchers have a scientific responsibility (Bauchner et al., 2016 ; Fischer & Zigmond, 2010 ) and even a moral obligation to share data (Schickhardt et al., 2016 ), the moral responsibilities and obligations of public funding agencies in matters of data sharing have not been discussed systematically and explicitly from the perspective of research ethics. While it is common to postulate that funders “should” encourage data sharing or that it is their “responsibility” to do so, we want to carry out an in-depth ethical analysis of funders’ moral obligations. In doing so, we also contribute to an analysis of what funders are generally morally obligated to – another question that has thus far been rather neglected in research ethics and discussed primarily in terms of priority-setting and with regard to the general obligation to benefit society (Pierson & Millum, 2018 ; Pratt & Hyder, 2017 , 2019 ). Thus, we will provide a broader analysis of general moral obligations of funders and evaluate what they imply with regard to promoting data sharing in particular.

We proceed as follows: After some preliminary remarks in Sect. " Preliminary Remarks ", we provide a brief review of empirical data on the current status quo of data sharing in Sect. " The Current State of Data Sharing and of Promoting Data Sharing ". In Sect. " The Moral Obligations of Funders and the Promotion of Data Sharing ", we set out that funders have three general moral pro tanto obligations that require them to promote data sharing. In Sect. " Further Relevant Moral Obligations ", we examine two pro tanto obligations that both speak in favour of and against promoting data sharing. We conclude Sect. " Further Relevant Moral Obligations " by weighing all pro tanto obligations. In Sect. " Mandatory Data Sharing Policies and Academic Freedom ", we ethically assess the specific instrument of promoting data sharing by way of mandatory policies with regards to academic freedom. We conclude and summarise our arguments in Sect. " Summary and Conclusion ".

Preliminary Remarks

In the following, we use the term “research data” and “data” as referring to digital data that is collected and/or generated during a research project. We use the term “data sharing” as referring to the act of making data available for other researchers – either for the purpose of transparency of studies and replication of published research results or for the purpose of other researchers using the data for their own research questions and projects (secondary research use). Footnote 1 Data sharing is increasingly supposed to meet the requirements of the FAIR principles, i.e., data should be findable, accessible, interoperable and re-usable (Wilkinson et al., 2016 ). Data can be shared in various ways, for example via open access or restricted or controlled access, and by using Data Use and Access Committees or data sharing licenses. Footnote 2 Restricted or controlled access comes, for instance, with additional data protection requirements when personal data are involved. Data sharing activities (and data sharing policies by funders) must comply with the applicable local laws and regulations. In EU countries, the possibilities for international sharing of non-anonymous data are dependent on the EU GDPR, making personal data sharing difficult between EU countries and the US, for example. As to legal challenges to international data sharing raised by local laws, there are possible legal approaches (contracts) and technical solutions such as code-to-data approaches, when the data remains at the location of the data producer or the repository and is only analysed there on behalf of the other researcher. Footnote 3

We define public funding agencies, following the European Commission Joint Research Centre (2017), as organisational entities that distribute public funding for research on behalf of either regional, national, or transnational governments. The definition covers both i) funding agencies operating at arm’s length from the public administration and enjoying relative autonomy from the government and ii) ministries and offices within the government that fund research projects. The definition comprises of centralised and non-discipline specific agencies such as the German Research Foundation (Deutsche Forschungsgemeinschaft), de-centralised and discipline specific agencies such as the National Institutes of Health in the US or the UK Research Councils, as well as international funding agencies and bodies such as the European Commission. When we speak of research funding, we refer to funders who grant funds to individual researchers or groups of researchers (collaborative projects or research consortia). Against the background of the existing organisation of the (academic) science system with its systematic competition between researchers and the importance of scientific publications, we assume that funded researchers use the funding to seek and publish new findings and that they do so in a somehow exclusive way that does not involve the immediate disclosure of all data and results. The tendencies of competition, exclusive use of data and the pursuit of (more or less) exclusive first scientific publications of previously unknown research results are the reasons why funders' policies on sharing research data and overcoming data secrecy are important, at least at some point in the project and research cycle. Traditionally, research projects funded in this way tend to be hypothesis driven. However, as research methods, the nature of projects and the associated research funding evolve rapidly and potentially change in the era of Big Data and AI, the boundaries are blurring, and some things may change. There might be more scientific community-led research projects that are designed to be less exclusive and competitive, with community participation, immediate disclosure, and data sharing at the forefront from the start. A historical example is the Human Genome Project. Funding of such community-led research projects is not the focus of our paper, but community-led research is worth mentioning and discussing in further research.

As public funders are public (or even state) institutions and spend public money that they receive from the government, their moral obligations are related to their public and therefore political character. Our analysis of the moral obligations assumes a liberal-democratic and rights-based normative-ethical framework. To put it simply, public institutions are normatively conceived as "by the people, for the people and of the people", and citizens, including researchers, have fundamental liberal rights vis-à-vis the state and public institutions, especially negative rights that protect them from state interference. These moral rights, which play an important role in our analysis, include academic freedom and the rights to privacy and informational self-determination.

We confine our analysis in this article only to the promotion of data sharing within academic science and exclude the question of the promotion of data sharing from publicly funded academic science with private for-profit companies.

We do not limit our argument to funders that focus on a particular scientific discipline (for instance, biomedical funders), as we believe that the pro tanto obligations we will attribute to funders do not depend on the specific characteristics of particular scientific disciplines. However, we think that when applying our framework in practice, context factors that depend on the features of a certain discipline or a specific research project need to be taken into account.

Some of the following arguments for the moral pro tanto obligations of public funders can be translated mutatis mutandis to private funders, but not all of them can. Particularly those arguments that refer to the special status of public funders as public institutions that spend public money and have particular responsibilities towards the public and the rights of citizens cannot be applied to private funders. The obligations of private funders call for a separate analysis in a separate paper.

This paper presents an ethical analysis of the moral obligations of funders and is not concerned with legal rights and obligations that pertain to funders in a particular national state or legal area such as the European Union. We assume that the moral obligations presented below do not conflict with the legal requirements of (public) funders in any legal context. However, our claims that funders have a moral obligation to promote data sharing and that they should also implement mandatory data sharing policies under certain circumstances have implications for the revision of (templates for) future legally binding funding contracts between funders and funded researchers. In this respect our ethical analysis has legal implications.

We take a pro tanto obligation as an obligation that has some weight in determining what an actor morally ought to do all things considered (Dabbagh, 2018 ). Suppose I promise my friend to visit her tonight; however, my daughter is sick, and I ought to stay with her. I have then two pro tanto obligations that prescribe conflicting actions. To find out what I am obligated to do all things considered , I must find out which of the two obligations weighs heavier. Footnote 4 Therefore, when we examine pro tanto obligations that require the promotion of data sharing, these obligations must be weighed against other pro tanto obligations that speak against such promotion.

The Current State of Data Sharing and of Promoting Data Sharing

As to the current state of data sharing, there are differences across scientific disciplines (Tedersoo et al., 2021a , 2021b ). Some disciplines, such as astrophysics, climate research or genomic research, have a long history of data sharing. For instance, genomics research paved the way with the important and pioneering Fort Lauderdale (Fort Lauderdale Agreement, 2003 ) and Bermuda principles (First International Strategy Meeting on Human Genome Sequencing, 1996 ) on data sharing (Kaye et al., 2009 ) within the revolutionary and community driven Human Genome Project and has created a genomic commons, i.e., openly available data bases for genetic and genomic driven biomedical research (Contreras & Knoppers, 2018 ; National Cancer Institute; National Library of Medicine). With the exception of some more advanced scientific disciplines or sub-disciplines, the sharing of research data for purposes of transparency and secondary use still remains the exception rather than the norm in most fields and disciplines of the sciences (Danchev et al., 2021 ; Gabelica et al., 2022 ; Naudet et al., 2021 ; Ohmann et al., 2021 ; Thelwall et al., 2020 ; Watson, 2022 ; Strcic et al., 2022 ; Gorman, 2020 ; Towse et al., 2021 ). While there is an increased awareness of the benefits and importance of data sharing in all of the sciences and although various initiatives of funders and journals promote data sharing, for instance through data sharing policies, data sharing is still not common practice. Several studies report rather low rates of compliance with data sharing expectations or requirements of funders and journals (Couture et al., 2018 ; Federer et al., 2018 ; Gabelica et al., 2022 ; Naudet et al., 2018 , 2021 ; Danchev et al., 2021 ). Studies also report a gap between high in-principle support for data sharing, and low in-practice intention (Tan et al., 2021 ).

It is frequently emphasised that funders should improve and intensify their current efforts to promote data sharing. Some see the need to create incentives, for example by including a record of past data sharing as an additional criterion for the reviews of grant applications (Perrier et al., 2020 ; Terry et al., 2018 ). Since the majority of funders’ data sharing policies do not strictly require the sharing of data (Ohmann et al., 2021 ), some authors call for stronger policies with strict requirements for data sharing (Couture et al., 2018 ; Naudet et al., 2021 ; Ohmann et al., 2021 ; Sim et al., 2020 ; Stewart et al., 2022 ; Tedersoo et al., 2021a , 2021b ) Footnote 5 and contest the lack of monitoring and enforcing compliance (Couture et al., 2018 ; Kozlov, 2022 ). However, as a series of interviews shows, funders struggle to implement data sharing requirements, incentives, monitoring, and sanctions for non-compliance for various reasons (Anger et al., 2022 , 2024 ).

In consideration of the foregoing and from the perspective of research ethics, the question arises whether public funders are morally obligated to promote data sharing. To answer this question, in the next section we set out a description and analysis of funders' general moral obligations and their relevance for data sharing.

The Moral Obligations of Funders and the Promotion of Data Sharing

We will argue that funding agencies have several general moral pro tanto obligations requiring them to promote data sharing: The obligation to benefit society, the obligation to promote scientific progress as such and the obligation to promote scientific integrity. Our methodological approach consists of first introducing and explaining the individual moral obligations in order to then briefly justify them with reference to plausible and, for the most part, generally shared fundamental considerations, values or norms.

The Obligation to Benefit Society

Publicly funded research should benefit society, or, as it is sometimes put, it should have social value. Footnote 6 As a requirement for public funders, this means funders should base their decisions on considerations of social value. Barsdorf and Millum ( 2017 ) argue that funders ought to consider the social value in particular in their priority-setting, i.e., when setting goals and priorities for the research they fund. We extend the obligation to promote social value to all decisions and actions of public funding agencies. Footnote 7 Benefitting society or social value is sometimes conceptualised in terms of well-being. The concept of well-being is notoriously controversial in philosophy (as it relates to the complicated and controversial topic of the “good life”). In research ethics, the benefits at stake in the social value obligation are sometimes framed more pragmatically, for example when Resnik ( 2018b ) (following Kitcher 2001 ) states that benefits are “practical applications in technology, industry, medicine, engineering, criminal justice, the military, and public policy”, and that these applications “can also produce economic growth and prosperity”. We limit our conception of social value (benefit) to a more basic understanding (which does not include potentially problematic or controversial elements such as military and economic growth): We understand it in terms of the basic goods of health and wealth (housing, food, employment, income, etc.), infrastructure development (for communications, travel, etc.), and environmental protection (as natural resources).

What are the justifying reasons for this obligation? First of all, it must be pointed out that the obligation can be understood in different ways, depending on whether the population to be benefited is the local or the global population. Barsdorf and Millum ( 2017 ), for instance, argue that for health research the social value obligation of funders is towards the global and not the local (national) population of the funders’ country. In the literature, this question (local vs. global) is controversial. In general, the controversial positions on this question also depend on the justification one is willing to accept for the obligation. For instance, if one justifies the obligation as owed to the citizens as tax payers who finance the state and the public funder via taxes, then it is rather obvious to understand social value as benefit for the national tax paying population. In contrast, if one considers the social value obligation of funders as owed to all humans all over the world, it suggests itself to understand the social value broadly in terms of global benefit for all humans. Such a global understanding of the social value obligation could be justified with considerations of beneficence towards every human being or with a universalistic-egalitarian account of human rights. Global understandings of the obligation are likely to give priority to poor populations of the global South. We deem a combination of a local and a global understanding as being the most plausible one: funders have a primary obligation to foster social value on the national level, and an additional (weaker) social value obligation on a global level. But even this combined view raises questions and cannot be elaborated here. Most importantly for the purpose of our paper, we believe that the question concerning the understanding of the social value obligation(s) of funders (towards national vs global population or both) is not relevant for our question about the promotion of data sharing by funders. At first glance it might seem that a local reading of the social value obligation suggests that funders should promote sharing of research data only among local/national researchers. However, the contrary is much more plausible, at least for the academic sciences. Most fields of modern academic scientific research are international endeavours and advancements are achieved through multiple and interacting contributions from scientists from different countries. In most disciplines, there is no such thing as a „national current state of scientific progress “. As for sharing research data from the academic and publicly financed sciences with private for-profit companies, it might be plausible to assume that sharing data only with national companies is more likely to benefit the national population than sharing data with for-profit companies from abroad. However, this assumption can also be challenged, for example, in light of the rapid and effective development of vaccines during the covid pandemic. Most importantly, the sharing of research data from the publicly funded academic sciences with private for-profit companies is a very specific topic that we do not address in this paper. Footnote 8 As far as sharing of research data between academic researchers is concerned, it is plausible to assume: The more data are shared on a national and international level, and the more science advances – which in almost all scientific disciplines occurs as an international advancement -, the more likely national populations will benefit.

A last and more specific reason for funders’ obligation to foster social benefit is the following, which applies only to research involving humans or animals: If funders fund research that exposes animals and humans to risks and burdens, the funding can only be justified if the potential benefits for society are maximised (National Commission for the Protection of Human Subjects of Biomedical & Behavioral Research, 1978 ; World Medical Association, 2013 ). Footnote 9

The concept of social value refers to (classical and much debated) questions of distributive justice: Of all persons concerned, who should benefit how much ? Following Barsdorf and Millum, we think the obligation to benefit society, i.e., the social value obligation, should be understood according to a prioritarian account of social value. On a prioritarian account, benefits should be distributed such that the distribution (expectedly) maximises “weighted well-being” (or in our terms “weighted social benefit”), i.e. the well-being of the worse off gets some priority in the distribution of benefits.

Let's put this in the following proposition and call it the social value obligation for public funders:

Funders have a pro tanto obligation to align their decisions and actions in such a way that the research they fund maximises weighted social benefit. Footnote 10

Now, what is the relevance of the social value obligation for matters of promoting data sharing? We develop our answer step by step:

First step . Data sharing has the potential to optimise research in terms of i) progressiveness, ii) cost and iii) quality (Fischer & Zigmond, 2010 ; Sardanelli et al., 2018 ). Ad i) The sharing of research data accelerates research, enables more cooperation and collaboration between researchers and disciplines, allows for the integration and pooling of data from disparate sources into large data sets, and bears the potential for innovative research, meta-analyses and new lines of inquiry that can lead to better diagnoses and treatments. Ad ii) It reduces costs and is efficient as reusing the data increases the value of the initial investment. Ad iii) It allows research findings to be verified or reproduced based on the original data and thus increases the quality of research and potentially reduces “research waste” (i.e., research with questionably quality).

Second step . Given this efficiency-, quality- and progress-enhancing potential of data sharing, it is rational to assume that the following holds true: A world in which funded researchers share their data is better in terms of social value than a world in which funded researchers do not share their data. Notice that this holds true only under the following conditions: a) Funders must set research funding priorities according to the social value obligation. It is plausible to assume that only the sharing of data from research projects that were selected according to the right priorities (expectedly) maximises weighted social benefit. b) The funding of secondary use and decisions on data access for secondary use must be aligned to the social value obligation as well. Footnote 11

Third step . From the claim that a world in which funded researchers share their data is better in terms of social value it does not directly follow that funding agencies are obligated to promote a world in which researchers share their data, for two reasons:

If there are alternative actions than promoting data sharing that lead to a larger increase in weighed social benefit and that cannot (for cost or other reasons) be taken together with promoting data sharing, then these alternative actions should be taken. For instance, perhaps an initiative to promote translational biomedical research increases weighed social benefit more than the promotion of data sharing and the funder's budget can only finance one of the two initiatives.

Realising a world in which researchers share data comes with costs, for instance for warranting long-term storage and data availability or for incentivising data sharing. Hence, it may be that the means to realise a data sharing-world are so costly that they cancel out the benefits data sharing brings, so that realising this world does not maximise weighted social benefit and ought not to be done.

However, we think that both possibilities are very unlikely. Ad 1. We deem it highly unlikely that there are alternatives that are incompatible with promoting data sharing and more efficient in terms of social value. Ad 2. We think that the means to realise a world in which researchers share their data are not so costly that they cancel out the benefits. For instance, incentivising data sharing or making data sharing mandatory are means that can be expected to promote data sharing without being too costly. Footnote 12

Therefore, we conclude: To fulfil the social value obligation, funders pro tanto ought to promote data sharing. Footnote 13

This conclusion leaves open which specific means of promotion funders are required to take. Since there are many ways of promoting data sharing, some of which are cheaper, some of which are more effective, the social value obligation – in principle – requires a specific means of promotion. For example, incentivising data sharing (for instance, through data sharing prizes or other forms of recognition) might be cheaper but less effective, whereas mandatory policies in combination with monitoring and sanctioning might be more expensive but lead to a greater extent of data sharing. It is an empirical question which of these different means (or combination of means) maximises weighted social benefit (for each situation of each individual funder). We cannot answer this question here. For now, we confine ourselves to the conclusion that the social value-obligation pro tanto requires funders to promote data sharing and leave it open which specific means of promotion they ought to apply. Footnote 14

The Obligation to Promote Scientific Progress

In addition to the social value obligation, public funding agencies have a pro tanto obligation to promote scientific progress. Since scientific progress is likely to increase the social value of scientific research, one reason for funders’ obligation to promote scientific progress is the already discussed social value obligation. However, beyond social value there are also other reasons for the obligation to promote scientific progress and these reasons ground an independent obligation to promote scientific progress. In the following, we focus on these reasons that justify the obligation to foster scientific progress independently from social value.

In democratic countries, public funders have an obligation to promote scientific progress, i.e., the growth of (significant) scientific knowledge and understanding, Footnote 15 because it is their mandate to support a science system that is geared towards producing scientific knowledge (independently of considerations of social benefits). In most democratic countries this mandate is institutionalised on a constitutional level. In this sense, funders owe this obligation to the (democratic) public and the citizens.

There is a set of further reasons that justify the obligation of funders to support the scientific system and foster scientific progress with considerations of the value of scientific knowledge and progress. The value of science and scientific progress touches on complex questions about whether knowledge is valuable in itself and/or (only) insofar as it is somehow conducive to realising other values or ends. We do not want to take a position here on the hotly contested question about whether scientific knowledge (or progress) are intrinsically valuable (end in itself). Footnote 16 We just want to point to the aspects of knowledge that make knowledge instrumentally valuable apart from its instrumental value for the benefits of society. i) Scientific knowledge can be instrumentally valuable when it satisfies “human curiosity” (Kitcher, 2001 ) and the desire for a practically disinterested understanding of the natural world. ii) Scientific knowledge is a precondition and a contributory factor for the ability and freedom of “pursuing our own good in our own way” (Mill, 2008 ) and making reflective decisions about the goals of our own lives. By expanding our understanding of the world and our place in it, scientific progress can contribute to the exercise of this elementary freedom and can thus be seen as valuable for a self-governed and autonomous life (Kitcher, 2001 ; Wilholt, 2012 ). iii) scientific knowledge and progress is valuable for a functioning democracy insofar as (growth of) knowledge is a requirement for processes of informed deliberation, opinion-forming and decision-making (Brown & Guston, 2009 ). Now, this set of three reasons (i-iii) could be understood as reflecting not only the values and interests of the citizens (or tax payers) of the funder’s country, but also the values and interests of all people all over the world. Although it is plausible to some extent that the three reasons also reflect values or interests of people around the world, we do not think that this can establish a relationship in terms of strong moral rights and obligations between the global population and the local funder. Due to the rather loose relationship between persons in each country of the world on the one hand and the local state and funder on the other hand, only rather weak reasons for funders to promote scientific progress could result from the global understanding of the three reasons.

So far, we have argued that the obligation of funders to promote scientific progress is primarily owed to the public and the citizens (and rather weakly to the global population). But of course the question arises whether funders owe the promotion of scientific progress also to scientists or the scientific community. We think that this is the case. Scientists have the professional obligation to strive for scientific knowledge and progress. To fulfil this professional obligation, they depend on the scientific system in which funders play an important role. Scientists need a functional system that is designed to enable and promote scientific progress. Therefore, it is plausible that funders owe the obligation to promote scientific progress to the scientists as well.

We take the scientific progress obligation as follows:

Funders have a pro tanto obligation to align their decisions and actions such that the research they fund maximises scientific progress.

What relevance does this obligation have when discussing funders’ role in promoting data sharing? First and in general terms, this obligation to maximise scientific progress does not necessarily require funders to exercise intensive control and strong intervention in science. Keeping funders largely out of the methodological and content-related decisions of researchers is plausibly conducive to a functioning and progress-making scientific system. However, specific measures or interventions on the part of funders (for instance through policies) might have the potential to promote scientific progress. The promotion of data sharing plausibly is such an intervention: As we argued in Sect. " The Obligation to Benefit Society ", a scientific system in which researchers share their data can be expected to be a more efficient, effective, and innovative scientific system, and this means that it is also a better system in terms of scientific progress than a system in which researchers do not share data. Funders can contribute to realising such a system through various means (such as, for instance, data sharing policies) and thus promoting scientific progress.

However, as it is the case of the social value obligation as well, it does not follow directly that funders are obligated to promote data sharing. This depends on whether there are other means than promoting data sharing which are more conducive to scientific progress (and which cannot be taken together with the promotion of data sharing). Again (as with the social value obligation), this is an empirical question that we cannot answer here. Nonetheless, we think it is plausible to assume that promoting data sharing is an effective and efficient means to promoting scientific progress and that it is rather unlikely there are other more efficient and effective actions or means, which, at the same time, are incompatible (for cost or other reasons) with the promotion of data sharing. Footnote 17

Accordingly, to fulfil their moral obligation to use the resources at their disposal to maximise scientific progress requires them to promote data sharing.

The Obligation to Promote the Epistemic Integrity of Research

Public funding agencies have an obligation to promote the integrity of the research they fund—a view which is widely held (Bouter, 2016 , 2018 , 2020 ; Mejlgaard et al., 2020 ; Titus & Bosch, 2010 ), but not systematically developed and justified. To give a more detailed account of this obligation, we start with clarifying the concept of research integrity.

Research integrity relates to a set of professional norms and obligations that morally regulate and prescribe how researchers ought to conduct research. These norms and obligations can be differentiated between epistemic and socio-moral norms and obligations . Footnote 18 Epistemic norms or obligations are grounded in the goals or nature of science (Resnik, 1998 ), i.e., (roughly) the goals to obtain knowledge and understanding through reliable methods of inquiry. These obligations prohibit misconduct that is problematic from the point of view of epistemic rationality . Epistemic obligations are, for instance, the obligation not to fabricate, falsify, or misrepresent data. Epistemic obligations form what one might call epistemic research integrity . We take epistemic research integrity to be mainly about avoiding practices that lead to deception, inaccuracy, and imprecision in research and (the presentation) of research results. We thus follow Winter and Kosolosky ( 2013 ), who explicate the notion of epistemic research integrity by drawing on the property of deceptiveness and “define the epistemic integrity of a practice as a function of the degree to which the statements resulting from this practice are deceptive.”

Socio-moral obligations result from the fact that research can negatively affect the rights and interests of individuals or groups outside science. Such non-epistemic obligations take into account general responsibilities and potential effects of science for society and humanity and comprises, for example, obligations to obtain consent and to minimise risks for participants and third parties. These socio-moral obligations constitute what one might call socio-moral research integrity .

In the following, we focus only on epistemic research integrity and investigate whether funders’ obligation to promote epistemic research integrity implies that they ought to promote data sharing. We briefly address the relationship between data sharing and socio-moral research integrity in Sect. " Further Relevant Moral Obligations ".

The promotion of epistemic research integrity is required by the two above mentioned obligations of funders to promote social value and scientific progress since epistemic integrity arguably furthers social value and scientific progress or is even a prerequisite for them. Now, the goal of this section is to show that there are reasons independent from social value and scientific progress that ground or justify an obligation of funders to maximize epistemic research integrity. There are two reasons for this as an independent obligation in its own right:

Funders should promote the epistemic integrity of research for two reasons. 1. As public funders are either governmental institutions or at least spend public money, they should ensure that the activities they finance abide by professional norms and standards. Funders are not supposed to spend public money on activities where “anything goes” but rather fund activities and work that are lege artis . This is owed to the citizens and taxpayers and required by the recognition of the value of a rules-based scientific system. 2. Funders must guarantee a fair and rule-based research environment and competition. This is primarily owed to the scientists, among other things, to protect the honest and bona fide researchers against unfair and dishonest competitors.

In the following, we take the obligation of funders to promote epistemic research integrity as follows:

Funders have a pro tanto obligation to align their decisions and actions such that they maximise the epistemic integrity of research.

What does the obligation to promote epistemic research integrity imply for the question of whether funders ought to promote data sharing? To answer this question, we must investigate whether data sharing is required by epistemic research integrity.

To begin, we must differentiate between two different perspectives on epistemic research integrity. One perspective can be labelled as normative - philosophical and takes research integrity as a set of philosophically justified norms. The other perspective can be labelled as the community consensus perspective and takes research integrity as a set of norms that are agreed on and prescribed by the scientific community and that are codified in statements and codes of conduct by scientific societies and associations. These two perspectives usually do not display great discrepancies in terms of concrete norms of research integrity, but in principle they are not necessarily congruent. For reasons of space, we cannot give a systematic answer to the question of which of the two perspectives takes normative priority when they have conflicting norms and prescriptions. However, in the following we first examine the relationship between epistemic integrity and data sharing from a philosophical perspective and then describe how this relationship is treated in relevant codes of conduct and guidelines on research integrity. We will show that the two perspectives converge to some extent, and where they do not clearly converge, we will explain what this means for funders. We will do this in turn for data sharing for transparency (A.) and data sharing for secondary use (B.). Footnote 19

A. Epistemic Integrity and Data Sharing for Transparency

1. Philosophical perspective : Philosophers of science consider practices that enable “each scientist to scrutinize the work of others in his field, to verify and replicate results [and that make] it more likely that flaws will be uncovered” (Haack, 2007 ) to be prescribed by an important epistemic norm. The pertinent norm here is what David Resnik calls the “principle of openness” (Resnik, 1998 ) or what Susan Haack calls the epistemic norm of “evidence-sharing” (Haack, 2007 ). According to this understanding, practices of evidence-sharing enable collective efforts of communicating, reviewing, critiquing, and reproducing the evidence claimed by researchers as supporting their scientific claims and research results, i.e., evidence “which includes the methodology applied, the data acquired, and the process of methodology implementation, data analysis and outcome interpretation” (Munafò et al., 2017 ). Footnote 20 The sharing of evidence is a necessary condition for science as rational communication and argumentation and a requirement for efforts of reviewing and assessing scientific claims. Evidence-sharing can thus be understood as part of an organized skepticism Footnote 21 that increases the credibility of scientific claims and characterises (the ideal of) modern science as a specific social and cooperative enterprise. Following Winter and Koslovsky (2013), the principle of openness and the norm of evidence-sharing can be understood as prescribing practices that prevent and guard against deceptiveness.

One of these practices is arguably data transparency, i.e., transparency with respect to data on which an already published scientific paper is based. We want to explicate at least two reasons for why data transparency is an important norm of evidence-sharing and openness.

Data sharing as a prerequisite for replication . It is widely agreed that replication studies have epistemic value and are an essential and important part of scientific practice at least in a substantial part of the quantitative empirical sciences. Even those who caution against the crisis narrative in connection with failed replications or even doubt the epistemic value of replications for all disciplines (Leonelli, 2018 ) agree with this proposition. However, a precondition and minimal requirement for conducting replication studies is that the original studies can be (computationally or analytically) reproduced , that is, the published findings can be reproduced when the reported analyses are repeated upon the raw data (Hardwicke et al., 2021 ; Nuijten et al., 2018 ; Peels & Bouter, 2021 ). If a result cannot be reproduced, there is no need to even attempt a replication – since something with the analysis or the data must have gone wrong. Therefore, if we agree that efforts to replicate should be enabled and encouraged (due to its important epistemic value for research), then we must also recognise the importance of data transparency.

Data sharing as means for preventing and detecting breaches of epistemic integrity. Although the empirical evidence about the prevalence of scientific misconduct and questionable research practices (QRP) should be handled with care, studies suggest that it is non-negligible. For instance, a survey among researchers in The Netherlands found that “over the last three years one in two researchers engaged frequently in at least one QRP, while one in twelve reported having falsified or fabricated their research at least once” – with the highest prevalence estimate for fabrication and falsification in the life and medical sciences (Gopalakrishna et al., 2022 ). Similarly worrisome results with regard to different forms of questionable research practices or misconduct are reported in (Boutron & Ravaud, 2018 ; John et al., 2012 ; Kaiser et al., 2021 ). Footnote 22 Additionally, we think that it is not entirely unreasonable to assume that the widespread lack of transparency (particularly the much-reported difficulties of obtaining data even after personal requests) is at least somewhat indicative of a non-negligible prevalence of scientific misconduct and questionable research (data) practices. Footnote 23

The possibility of keeping data opaque enables misconduct or at least makes it more difficult to detect it. As data transparency makes it easier to detect (at least some forms of) fraud and questionable research practices and can function as a deterrent (Fischer & Zigmond, 2010 ; Gopalakrishna et al., 2022 ; Hedrick, 1988 ; Winter & Kosolosky, 2013 ), we argue that data sharing for transparency can help prevent and detect unethical scientific practices.

Since data transparency is a prerequisite for reproducibility and a means for preventing and detecting misconduct and questionable research practices, we conclude that there are good (normative-philosophical) arguments for taking data sharing for transparency as an important requirement of epistemic research integrity.

2. The community consensus perspective: The scientific community also sees data sharing as an important part of epistemic integrity (All European Academies ALLEA, 2017 ; Deutsche Forschungsgemeinschaft (DFG), 2019 ; Kretser et al., 2019 ; National Academies Press (US), 2017 ; Netherlands Code of Conduct for Research Integrity, 2018 ; Resnik & Shamoo, 2011 ; World Conference on Research Integrity, 2010 ). However, most of these guidelines and codes of conduct do not explicitly differentiate between epistemic and socio-moral integrity of research and many do not clearly differentiate between the purposes of data sharing (i.e., the purposes of transparency and secondary use). Therefore, we must deduce from the context what the respective statements refer to. We cannot do this in a systematic way here. But our impression is that many documents emphasise the values of transparency and honesty and explicitly or implicitly refer to these values when they state the importance of data sharing for research integrity. It thus seems there is a (international and trans-disciplinary) consensus that data sharing for purposes of transparency is a part of epistemic integrity. For example, the Netherland Code of Conduct explicitly connects data availability with the value of transparency, and the German DFG also explicitly refers to data sharing for the purpose of confirmability (“Nachvollziehbarkeit”).

Hence, both perspectives—the normative-philosophical and the community consensus perspectives—support the proposition that data sharing for transparency is an important component of epistemic research integrity.

B. Epistemic Integrity and Data Sharing for Secondary Use

Philosophical perspective : While data sharing for transparency clearly falls within the scope of epistemic research integrity, the same cannot be said about data sharing for secondary use. Since we follow Winter and Kosolosky ( 2013 ) and “define the epistemic integrity of a practice as a function of the degree to which the statements resulting from this practice are deceptive”, we believe that data sharing for secondary use is not part of epistemic research integrity. Although one might argue that secondary use of data has the potential to correct for misleading or deceptive statements from original studies, we think that the main importance of sharing data for secondary use is that it promotes scientific progress and social value. Data sharing for secondary use is of rather secondary importance when it comes to correcting misleading scientific statements or results. It does not seem to be a strict requirement of epistemic integrity but more of a supererogatory practice. Therefore, from a philosophical perspective, the promotion of data sharing for secondary use is not required by the obligation to promote epistemic research integrity

Community consensus perspective : Only a few guidelines and codes of conduct explicitly state that data sharing for secondary use is a requirement of research integrity (for instance, DFG, 2019 ). Many do not mention data sharing for secondary use explicitly, and some do not even seem to consider it implicitly. Thus, there does not appear to be a clear and unambiguous international consensus on the relationship between data sharing for secondary use and epistemic integrity. And since most of these documents do not differentiate explicitly between epistemic and socio-moral integrity, it is not clear whether data sharing for secondary use is considered as important from an epistemic perspective or from a non-epistemic, socio-moral perspective. Footnote 24

Therefore, from a community consensus perspective there is no clear consensus that data sharing for secondary use is a requirement of (epistemic) research integrity. From this perspective then, the obligation of funders to promote epistemic research integrity does not require the promotion of data sharing for secondary use. However, if there are specific disciplinary or national communities that explicitly take data sharing for secondary use as part of research integrity, those funders for whom this consensus is pertinent might have a reason to promote this kind of data sharing with reference to the obligation to promote research integrity. This holds true even though from a philosophical perspective data sharing for secondary use is not a part of epistemic research integrity: If the pertinent community takes data sharing for secondary use as part of (epistemic) integrity, funders might take this as a reason to promote it. Footnote 25

Therefore, and to conclude this whole Sect. " The Obligation to Promote the Epistemic Integrity of Research " about research integrity: Since funders have the obligation to promote epistemic research integrity, and since data sharing for transparency is an important part of epistemic research integrity, funders pro tanto ought to promote data sharing for transparency. From a philosophical perspective, epistemic research integrity does not require data sharing for secondary use, and from a community consensus perspective it is clearly considered as part of epistemic integrity only in a few cases of specific scientific communities. Therefore, a universal obligation for funders to promote data sharing for secondary use cannot be derived from considerations of epistemic research integrity.

Further Relevant Moral Obligations

In this section, we present two further obligations that partially speak in favour of funders promoting data sharing and partially against it. After presenting these various obligations in the following, we will close the section by weighing all pertinent obligations of funders and come to an all things considered judgement.

Funders have a pro tanto obligation to respect the rights of individuals and to not harm human or non-human beings, which includes the obligation to not induce, cause or increase risks of harm and of rights violations. This includes the obligation to respect privacy and informational autonomy of data subjects and not to induce, cause or increase informational risks or harms This obligation is part of the obligation to promote the socio-moral integrity of funded research and it speaks both in favour and against the promotion of data sharing:

As data sharing reduces the need for ever-new data collection, data sharing also reduces the amount and frequency of research procedures in interventional and non-interventional studies that carry risks for participants (Fischer & Zigmond, 2010 ). Hence, in this regard the obligation to respect the rights of persons and to not harm anybody speaks in favour of funders’ promoting data sharing.

The sharing of research data and its ensuing secondary use increases informational risks for data subjects. Prima facie , this speaks against the promotion of data sharing. However, if subjects are informed about these risks and give consent to the usage of their data despite these risks, this increase of informational risks does not represent an infringement of the obligation not to harm. Volenti non fit inuria . Thus, the risks do not speak against funders promoting data sharing if consent is obtained in funded research. Of course, this argument raises the question of a model that offers research subjects appropriate information and opportunities to consent or reject consent and, at the same time, allows for data sharing without causing unreasonable practical burdens or hurdles (Manson, 2019 ; Mikkelsen et al., 2019 ; Ploug & Holm, 2016 ). We deem that broad consent, if combined with a normative and technical governance framework and data protection measures, is an appropriate information and consent model. In order to meet their obligation to respect the rights of data subjects, funders should thus recommend that broad consent be embedded in appropriate normative and technical governance frameworks. Irrespective of the question of informed consent, informational risks exist due to data misuse and data breaches. Erlich & Narayanan, 2014 ; Hayden, 2013 ; Homer et al., 2008 ; Levy et al., 2007 have shown how different techniques could be used for breaching (particularly genetic) privacy. These risks pro tanto speak against the promotion of the sharing of personal data.

The pooling of data from different sources and the use of big data methods enables predictions about sensitive information regarding persons or groups other than the original data subjects (Mühlhoff, 2021 ). Some authors warn that this increases risks of stigmatisation and discrimination of marginalised groups (Favaretto et al., 2019 ; Reed-Berendt et al., 2022 ; Xafis et al., 2019 ). Promoting and accelerating data sharing and secondary use expand the opportunities for pooling and big data and thus might increase these risks. Thus, in this regard the obligation to minimise risks of harm speaks against the promotion of data sharing.

Funders also have a pro tanto obligation to increase public trust in science and research funding. This obligation partly speaks in favour and partially speaks against the promotion of data sharing. On the one hand, as data sharing promotes transparency and accountability, it can increase and consolidate public trust and confidence in science and research funding. Hence, in this respect funders ought to promote data sharing in order to promote public trust. On the other hand, since promoting data sharing increases risks for privacy and creates challenges for informational self-determination, concerns about these risks and challenges might reduce trust in the research system (Platt et al., 2018 ; Ploug, 2020 ). Hence, in this respect funders ought not to promote data sharing in order to promote or maintain public trust.

However, the extent to which the two obligations (not to harm and respect rights and to foster public trust) speak against the promotion of data sharing can be significantly reduced. In fact, funding agencies can and should do various things to minimise or prevent the pertaining risks:

They should fund technological as well as ethical, legal, and social research (ELSA-research) on practical solutions for data security and privacy protection with a particular view on problems and risks resulting from big data and machine learning.

Funders should promote research on data augmentation and synthetic data as potential approaches to handle limitations to data sharing due to risks for data subjects.

They should finance and promote data infrastructures and archives or repositories that can guarantee data privacy and security and require funded researchers use these trusted repositories.

Funders should fund the development and implementation of data access committees that take into account the aforementioned risks resulting from secondary use.

Funders should support data stewardship infrastructures that convey “a fiduciary (or trust) relation” that also takes into consideration the rights of patients and participants (Rosenbaum, 2010 ).

Funders should develop principles and provide best practices that support and enable researchers to provide appropriate forms of consent with regard to data sharing. They should create a framework for protecting the privacy of research participants that provides guidance on how participant information and (broad) consent forms are to be designed.

Funders should provide standards and best practice for contracts between data producers, repositories, and data re-users with special attention to data protection and security. Footnote 26

In all of the aforementioned measures, the participation and inclusion of patient representatives should be promoted and enabled. Footnote 27

Funders should require researchers to reflect upon and identify potential risks (early in the process) by creating a data management plan, elaborating how they address and intent to deal with or avoid these risks.

If the pertaining risks are addressed and thus comparatively small, the trust obligation and the not to harm obligation rather speak in favour of the promotion of data sharing or at least have no significant weight against data sharing. Even if they keep having some limited weight against data sharing, they are outweighed by the obligations in favour of promoting data sharing, i.e., the obligations of social value and scientific progress. Footnote 28 Of course, the more funders encourage and press researchers to share persona-related (non-anonymous) data, the more they are responsible for the impact of their policies on data subjects and the more they have to support researchers in protecting data subjects’ informational rights and privacy and this increases the financial and administrative costs and burdens for funders. However, we do not think that this outweighs the benefits in terms of social value and scientific progress.

The main conclusion of Sects. " The Moral Obligations of Funders and the Promotion of Data Sharing " and " Further Relevant Moral Obligations " is thus: Although there are two pro tanto obligations that speak against the promotion of data sharing by public funders, the pro tanto obligations in favour of the promotion weigh heavier (provided that the mentioned risk reducing measures are implemented). Public funders thus have an all things considered obligation to promote the sharing of data of funded researchers. Footnote 29

Mandatory Data Sharing Policies and Academic Freedom

Up to this point, we have not directly commented which means funders ought to use to promote data sharing. As we said, it is an empirical question what follows from the pro tanto obligations of funders to promote data sharing in terms of specific means to promote it.

However, in the following we want to examine a question with regard to a specific means of promoting data sharing: Mandatory Data Sharing Policies . Funders are increasingly advised to adopt policies that require data sharing (Sim et al., 2020 ). The NIH as a major funder has been setting standards in implementing such policies for years now and has recently implemented a new mandatory data sharing policy (Kozlov, 2022 ; National Institutes of Health, 2022 ). We find it plausible that such policies are comparatively effective and efficient. Mandatory data sharing policies can be designed at least with two different objectives: 1. They can require only the sharing of data that is the evidential basis for an already published paper and only for purposes of transparency and confirmability. 2. Or they can additionally require the sharing of data (either publication-related or all data that are generated during a research project) for purposes of secondary use.

As mandatory data sharing policies of public funders restrict the individual freedom of funded researchers (at least if these are dependent on third-party funding), the question arises whether such policies conflict with academic freedom . Do data sharing requirements implemented by public funders infringe on the academic freedom of individual researchers? Footnote 30

To answer this question, we have to clarify what academic freedom is and what it protects. From a philosophical perspective, Footnote 31 academic freedom is first and foremost the negative right of individual researchers against external intervention in their scientific work and decision-making. Academic freedom mainly concerns the freedom to choose research questions, theories, and methodologies as well as publication venues independently from outside intervention, in particular state intervention . This negative right of freedom from intervention of researchers thus corresponds to the negative duty of the state not to intervene.

As public funders are (semi-)governmental institutions whose funding comes predominantly from government budgets and on whose boards government representatives participate in decision-making, the following holds: Public funders have the negative obligation to respect the negative right to academic freedom of researchers.

The question now is whether mandatory data sharing policies violate the negative right of researchers to academic freedom? To answer this question, we must determine in more detail the scope of protection of academic freedom. From our perspective, the scope of protection of academic freedom includes only actions of researchers that are not violations of crucial and basic norms of epistemic research integrity. Such crucial and basic norms determine fundamental requirements of science and research as a specific kind of rational practice and communication. For instance, researchers that engage in data fabrication or falsification fail to meet such fundamental requirements. They thus violate crucial and basic norms of research integrity and engage in behaviour that is not protected by academic freedom.

Hence, we must answer the following questions:

Is the omission (or refusal) to share data that are the evidential basis of published research results for purposes of transparency and reproducibility a violation of fundamental requirements of scientific work and communication?

Is the omission (or refusal) to share data (either publication-related or all data that are generated during a research project) for purposes of secondary use a failure to meet such fundamental requirements?

Ad 1 : We believe that not sharing data that underlies research results (a published paper) for purposes of transparency is a violation of the fundamental requirements of scientific work and communication. This is clearly the case from the philosophical perspective on research integrity. Although not sharing data that underlies a published paper for transparency seems to be a less severe scientific misconduct as data fabrication or falsification, it clearly runs counter to one of the basic requirements of scientific communication and (collective) truth-seeking: To make one’s own scientific work transparent and reproducible. There is no reasonable justification for why researchers should be generally free to avoid that their published (!) work can be reviewed in all its parts. Footnote 32 We believe the philosophical perspective is backed by the perspective of the consensus of the scientific community. The community recognises data sharing for transparency as a key requirement of epistemic research integrity. Almost all codes of conduct and guidelines on research integrity emphasise the close relation between honesty, reproducibility, and data transparency (see Sect. " The Obligation to Promote the Epistemic Integrity of Research ". A). Therefore, research without sharing data for transparency is not protected by academic freedom. Thus, mandatory policies that require data sharing for transparency do not infringe on the right to academic freedom of individual researchers.

Ad 2 : In Sect. " The Obligation to Promote the Epistemic Integrity of Research ", we have already noted that neither in the community consensus nor in the normative-philosophical perspective data sharing for secondary use is a requirement of epistemic integrity. This means that the freedom to share or not to share data for secondary use is within the scope of protection of academic freedom. However, data sharing requirements for secondary use of public funders are not necessarily an infringement of academic freedom. First, it depends on how much researchers must rely on third-party funding in their research. If they have access to basic financial resources of their institutions and are not dependent on applying for additional public funding, then such requirements are not a restriction of their academic freedom. Second, data sharing requirements of public funders that enjoy relative autonomy from government and whose decisions are essentially made by scientists themselves do not represent state coercion but rather self-determination of the scientific community. However, academic freedom does not only protect individuals against state intervention but also against infringements through (parts of) the scientific community. Thus, data sharing requirements of public funders with state autonomy (and also those without state autonomy) do represent an infringement of academic freedom (at least for researchers that depend on their funding), though not in the classical sense of state infringements.

It must be noted, however, that this infringement of academic freedom is a fairly small one. The freedom to share or not to share data for secondary use does not belong to the core of academic freedom. The core arguably is the freedom to “follow a line of research where it leads” (Russell, 1993 ), i.e., the freedom to choose research questions, theories, and methodologies as well as publication venues independently from outside intervention. Nonetheless, it is an infringement, but we believe it can be mitigated by the following measures: Funders can a) offer the possibility for a justified exception from data sharing requirements (for instance, for reasons of data protection or dual use risks), b) allow for an embargo period in which the funded and data producing researcher has the exclusive privilege to use their data, c) consider discipline-specific standards for data management and sharing, and d) compensate for burden and costs financially (for instance, for fees of repositories for long-term storage or for data protection measures) and through investments in and supply of technical and administrative support (for instance, digital privacy and security safeguarding solutions and best practices). If funders implement measures like these, the infringement of academic freedom through mandatory data sharing policies becomes so small that it can be justified with reference to the other pro tanto obligations of funders, namely the obligations with respect to social value, scientific progress, and the minimisation of harm. Footnote 33

However, the justifiability of the infringement on academic freedom through mandatory data sharing policies is dependent on a further condition: Mandatory policies can only be justified, if there are no measures of promoting data sharing that are more effective and less invasive in terms of academic freedom.

A last word on the implications of the diagnosis that policies that require data sharing for secondary use infringe on the academic freedom of researchers: If public funders infringe on the academic freedom of researchers with reference to the benefits of data sharing, they have the responsibility to ensure that these benefits are realised. This requires two things of them: 1. Since the benefits of data sharing only materialise if reproduction and replication as well as secondary use are actually carried out, funders should fund appropriate projects. They should finance and reward reproduction and replication studies and set up a funding programme for secondary research. 2. Funders should fund research and monitoring on whether their own initiatives to promote data sharing are i) effective in terms of actual data sharing and ii) actually lead to the hoped-for benefits.

Summary and Conclusion

In this paper, we investigated the question whether public funders have a moral obligation to promote the sharing of research data generated in funded research projects. More specifically, we asked which of funders general moral obligations speak in favour of and which of these obligations speak against the promotion of data sharing. We draw the following conclusions: First, public funders have several general pro tanto obligations that (under certain conditions) require funders to promote data sharing. The main obligations are the social value-, scientific progress- and epistemic research integrity-obligation. Second, in the assessment of pro tanto obligations against promoting data sharing, we argued that – provided that funders take measures to minimise the risks for research subjects and third parties – the obligations in favour of promoting data sharing outweigh the obligations against. Therefore, we concluded with respect to our overall research question that public funders ought to promote data sharing all things considered .

With respect to our third specific research question whether mandatory data sharing policies are an ethically justifiable means of promoting data sharing, we argued: First, the scope of protection of academic freedom does not cover the omission or refusal to share data for purposes of transparency. Requirements to share data for the purpose of transparency therefore do not violate academic freedom. Second, the scope of protection does cover the omission or refusal to share data for secondary use, therefore requirements to share data for secondary use violate academic freedom to a small extent (at least for researchers that are dependent on public funding). However, such requirements and thus the violation of academic freedom can be justified with reference to the other pro tanto obligations that public funders have.

Sometimes research data can only be re-used when research methodologies that have been used to collect, generate and analyse the data (questionnaires, analytical codes, etc.) are shared as well (Goldacre et al., 2019 ). Thus, sharing these methodologies and other intermediary resources might equally be important as sharing the data themselves. However, due to some disanalogies between data and those resources (most saliently the fact that some of the latter can be seen as intellectual property), we confine our discussion here to research data.

The Creative Commons set of licenses are the most commonly used for sharing research data. These licenses are designed to be open, which means that data can be freely reused without requiring explicit permission as long as the terms of the license are adhered to. Such licences can be a good and efficient way of reducing costs and burdens for data sharing, while they may have limited applicability in cases of person-related data or for researchers who wish to retain control over the subsequent use of the data they produce.

If regulatory considerations limit the sharing of data generally or on an international level, the generation of synthetic data can be an alternative. However, (sharing of) synthetic data can only complement but not fully replace (the sharing of) non-synthetic data.

Pro tanto obligations are what David Ross ( 1930 ) called prima facie obligations. In line with the established terminology “ pro tanto and all things considered moral reasons” (Alvarez, 2016 ), we chose to deviate from Ross’ terminology, see also Hurtig ( 2007 ) and Dabbagh ( 2018 ).

Similar claims are made with regard to journal policies in Federer et al. ( 2018 ) and Gabelica et al. ( 2019 ).

For the debate about the social value requirement, see Barsdorf and Millum ( 2017 ), Pierson and Millum ( 2018 ), Resnik ( 2018a , 2018b ), Wendler and Rid ( 2017 ), Wertheimer ( 2015 ).

See also Bierer et al. ( 2018 ).

For further discussions on this topic see Winkler et al. ( 2023 ).

Notice that the references only state that research must have sufficient benefits for society in order to be justified if it exposes participants to risks. However, we find this implausible and believe that it has to maximise benefits. For it seems questionable to choose project A over the alternatively fundable projects B and C, if it can be expected that either project B or C have more social benefit than A.

Notice that this obligation does not require a short-sighted restriction to immediate benefits and “mere” application-oriented research but will plausibly take into account basic research that enables long-term fruitful and sustainable research by exploring fundamental causal mechanisms. Otherwise, maximisation would hardly be possible.

These conditions also secure that data sharing of funded projects does not facilitate the exploitation or extraction of resources from the underprivileged to the privileged or to private corporations and does not promote epistemically biased research. See Leonelli ( 2023 ) for examples of such detrimental effects of data sharing.

This issue of the cost–benefit balance of promoting data sharing is also pertinent for all other obligations we will discuss below. We will not mention it again though and assume for the rest of the paper that the benefits of promoting data sharing are greater than the costs.

Bierer et al. ( 2018 ) also argue that funders ought to promote data sharing in order to advance the social value of research. Notice that this obligation might be stronger or weaker for particular research fields or specific data. For instance, the social value of sharing particular health data in a pandemic or biomedical data in general is presumably bigger than the social value of the sharing of archaeological data about a particular Egyptian pharaoh.

We believe that on any other plausible account of social value, i.e., on any plausible distributive principle funders ought to promote data sharing and fund research that has social value. For instance, a utilitarian account of social value will give us the same conclusion.

On the notion of scientific progress and “significant” knowledge, see Bird ( 2007 ), Kitcher ( 2001 ), Niiniluoto ( 2019 ).

For the view that scientific knowledge has intrinsic value, see for instance Schwartz ( 2020 ).

What holds for the social value obligation also holds for the obligation to promote scientific progress: Depending on the particular research field and the particular data the obligation to promote data sharing in order to promote scientific progress is stronger or weaker. The sharing of particular (kinds of) data might bear more potential to promote scientific progress while the sharing of other (kinds of) data might bear less potential.

For different terminologies for both kinds of obligations, for instance internal vs. external norms, see Resnik ( 1996 ) and Reydon ( 2013 ). For an attempt to differentiate the justificatory grounds for the various kinds of obligations of scientists see Resnik ( 1998 ).

For a legal analysis of the relation between (semi-)governmental promotion of data sharing and good scientific practice in the context of German constitutional law see Fehling and Tormin ( 2021 ).

Strictly speaking, evidence is that which confirms or disconfirms a scientific claim, i.e., data. A methodology or an analysis is not evidence in this sense. However, we stick to the understanding of Munafò above because the sharing of evidence in his sense is required by the norm of evidence-sharing. At least we think that Haack has this in mind.

Robert Merton ( 1942 /1973) famously introduced this term in his description of “the normative structure” and the “ethos of science”, see also Ziman ( 2009 ).

Although Fanelli ( 2018 ) doubts that misconduct has a major impact on the scientific literature, she agrees that it is non-negligible.

For instance, Tsuyoshi Miyakawa ( 2020 ) reports the results of analyses on the manuscripts that he has handled as Editor-in-Chief of Molecular Brain as showing that “more than 97% of the 41 manuscripts did not present the raw data supporting their results when requested by an editor, suggesting a possibility that the raw data did not exist from the beginning, at least in some portions of these cases”.

The DFG Guideline ( 2019 ) is arguably a guideline exclusively for epistemic research integrity, and it is thus reasonable to assume that the explicit inclusion of data sharing for secondary use means that it is considered to be an epistemically required practice. However, the ALLEA code (ALLEA 2017 ), as some other codes, is not exclusively focused on epistemic integrity as it includes socio-moral obligations (for instance, to respect the needs and right of study participants). Its statement that data should be as open as possible, as closed as necessary can be understood as including data sharing for secondary use, but it remains open whether this is taken to be a requirement of epistemic integrity. It could be case that the justification for data sharing for secondary use is mainly seen in its benefit for society and scientific progress. If this is the reason why data sharing for secondary use is included in research integrity, then the research integrity obligation adds nothing to the social value and scientific progress obligation with respect to data sharing for secondary use – which we already discussed in Sect. " The Obligation to Benefit Society " and " The Obligation to Promote Scientific Progress ".

Only in cases in which there are strong philosophical or ethical reasons that speak against the community consensus, funders might not be allowed to follow this consensus. However, we believe this is not the case for the issue of data sharing for secondary use.

There have been intense and broad research and debates on ethical, legal, and social issues of privacy and data protection and other informational aspects of research subject protection in biomedical data intense research and data sharing for 10–15 years. Following the increasing activities of genomic data sharing, approaches and best practices have been developed to address challenges concerning data protection, privacy, and informational rights and autonomy. See for instance the GA4GH and its “Regulatory and Ethics Work Stream” ( https://www.ga4gh.org/how-we-work/workstreams/ ) that provides standard solutions for genetic data sharing and a framework for responsible sharing of genomic and health-related data ( https://www.ga4gh.org/genomic-data-toolkit/regulatory-ethics-toolkit/framework-for-responsible-sharing-of-genomic-and-health-related-data/ ) or the European Genome Archive (EGA) which also provides best practices for genetic data sharing.

We develop a systematic approach to funders' responsibilities for the protection and participation of data subjects from a legal and ethical perspective in Fehling et al. ( 2023 ).

Since the pertaining risks are mainly associated with data sharing for secondary use, and since data sharing for secondary use is not a requirement of research integrity, the weighing of obligations here must exclude the obligation to promote research integrity and focus only on scientific progress and social value.

Of course, we cannot exclude the possibility of very specific cases in certain areas of research where there are additional reasons against the promotion of data sharing which override the pro tanto obligation that speak in favour of promoting data sharing. For example, sharing huge amounts of high quality data used to develop machine learning programs in biomedicine with a Russian research institute closely linked to the Russian military complex might bear the risk of harmful consequences for society. Our all things considered claim should thus be understood as not applying to such special cases. For the possibility of such cases see footnote 11 and the reference to Leonelli ( 2023 ).

How differently the relation between academic freedom and data sharing requirements is perceived by German funders as compared to non-German funders is examined in more detail in Anger et al. ( 2024 ).

The following is a philosophical and not a legal analysis. For a legal analysis of the possibilities and limits of (semi-)governmental promotion of data sharing in the German context see Overkamp and Tormin ( 2022 ) and for the German and European context with a side glance at US constitutional law see Fehling and Tormin ( 2021 ).

Of course, there can be specific reasons in a particular case not to make data transparent for confirmation efforts (such as, for instance, privacy concerns). However, our point is that besides such special circumstances, there is no reason for why researchers ought to generally be free to refuse to make their data available for confirmation.

Of course, this depends on how strong these pro tanto obligations are with respect to particular (kinds of) data. As we explained in footnotes 13 and 17 the weight of these obligations depends on how much the sharing of particular data from a particular research field contributes to social value and scientific progress. We believe, however, that for the most part of the sciences the sharing of research data is so valuable in theses respects that an infringement of academic freedom can be justified.

All European Academies (ALLEA) (2017). The European Code of Conduct for Research Integrity. Retrieved 25 February 2022 https://allea.org/code-of-conduct/ .

Alvarez, M. (2016). Reasons for action: Justification, motivation, explanation. In E. N. Zalta (Ed.). The Stanford encyclopedia of philosophy (Winter 2017 edition). Retrieved June 14, 2022, from https://plato.stanford.edu/archives/win2017/entries/reasons-just-vs-expl/ .

Anger, M., Wendelborn, C., & Schickhardt, C. (2024). German funders’ data sharing policies—A qualitative interview study. PLoS ONE, 19 (2), e0296956. https://doi.org/10.1371/journal.pone.0296956

Article   Google Scholar  

Anger, M., Wendelborn, C., Winkler, E. C., & Schickhardt, C. (2022). Neither carrots nor sticks? Challenges surrounding data sharing from the perspective of research funding agencies—A qualitative expert interview study. PLoS ONE, 17 (9), e0273259. https://doi.org/10.1371/journal.pone.0273259

Barsdorf, N., & Millum, J. (2017). The social value of health research and the worst off. Bioethics, 31 (2), 105–115. https://doi.org/10.1111/bioe.12320

Bauchner, H., Golub, R. M., & Fontanarosa, P. B. (2016). Data sharing: An ethical and scientific imperative. JAMA, 315 (12), 1237–1239. https://doi.org/10.1001/jama.2016.2420

Begley, C. G., & Ioannidis, J. P. A. (2015). Reproducibility in science: Improving the standard for basic and preclinical research. Circulation Research, 116 (1), 116–126. https://doi.org/10.1161/CIRCRESAHA.114.303819

Bierer, B. E., Strauss, D. H., White, S. A., & Zarin, D. A. (2018). Universal funder responsibilities that advance social value. The American Journal of Bioethics AJOB, 18 (11), 30–32. https://doi.org/10.1080/15265161.2018.1523498

Bird, A. (2007). What is scientific progress? Noûs, 41 (1), 64–89. https://doi.org/10.1111/j.1468-0068.2007.00638.x

Bouter, L. (2016). What funding agencies and journals can do to prevent sloppy science. Retrieved June 14, 2022, from https://www.euroscientist.com/what-funding-agencies-and-journals-can-do-to-prevent-sloppy-science/ .

Bouter, L. (2020). What research institutions can do to foster research integrity. Science and Engineering Ethics, 26 (4), 2363–2369. https://doi.org/10.1007/s11948-020-00178-5

Bouter, L. M. (2018). Fostering responsible research practices is a shared responsibility of multiple stakeholders. Journal of Clinical Epidemiology, 96 , 143–146. https://doi.org/10.1016/j.jclinepi.2017.12.016

Boutron, I., & Ravaud, P. (2018). Misrepresentation and distortion of research in biomedical literature. Proceedings of the National Academy of Sciences of the United States of America, 115 (11), 2613–2619. https://doi.org/10.1073/pnas.1710755115

Brock, D. W. (2012). Priority to the worse off in health care resource prioritization. In R. Rhodes, M. Battin, & A. Silvers (Eds.), Medicine and social justice: Essays on the distribution of health care (pp. 155–164). Oxford University Press.

Chapter   Google Scholar  

Brown, M. B., & Guston, D. H. (2009). Science, democracy, and the right to research. Science and Engineering Ethics, 15 (3), 351–366. https://doi.org/10.1007/s11948-009-9135-4

Burton, P. R., Banner, N., Elliot, M. J., Knoppers, B. M., & Banks, J. (2017). Policies and strategies to facilitate secondary use of research data in the health sciences. International Journal of Epidemiology, 46 (6), 1729–1733. https://doi.org/10.1093/ije/dyx195

Chan, A.-W., Song, F., Vickers, A., Jefferson, T., Dickersin, K., Gøtzsche, P. C., Krumholz, H. M., Ghersi, D., & van der Worp, H. B. (2014). Increasing value and reducing waste: Addressing inaccessible research. The Lancet, 383 (9913), 257–266. https://doi.org/10.1016/S0140-6736(13)62296-5

Contreras, J., & Knoppers, B. M. (2018). The genomic commons. Annual Review of Genomics and Human Genetics, 19 , 429–453.

Couture, J. L., Blake, R. E., McDonald, G., & Ward, C. L. (2018). A funder-imposed data publication requirement seldom inspired data sharing. PLOS ONE , 13 (7). https://doi.org/10.1371/journal.pone.0199789 .

Dabbagh, H. (2018). The problem of explanation and reason-giving account of pro tanto duties in the Rossian ethical framework. Public Reason, 10 (1), 69–80.

Google Scholar  

Danchev, Valentin, Min, Yan, Borghi, John, Baiocchi, Mike, & Ioannidis, John P. A. (2021). Evaluation of data sharing after implementation of the International Committee of Medical Journal Editors Data Sharing Statement Requirement. JAMA Network Open , 4 (1), e2033972. https://doi.org/10.1001/jamanetworkopen.2020.33972

Deutsche Forschungsgemeinschaft (DFG) (2019). Leitlinien zur Sicherung guter wissenschaftlicher Praxis: Kodex. Retrieved 25 February 2022 https://doi.org/10.5281/zenodo.3923602 .

Digital Science Report (2019). State of Open Data 2019. A selection of analyses and articles about open data, curated by Figshare. figshare. https://doi.org/10.6084/M9.FIGSHARE.10011788.V2 .

Eckert, Ester M., Di Cesare, Andrea, Fontaneto, Diego, Berendonk, Thomas U., Bürgmann, Helmut, Cytryn, Eddie et al. (2020). Every fifth published metagenome is not available to science. PLOS Biology 18 (4), e3000698. https://doi.org/10.1371/journal.pbio.3000698 .

Erlich, Y., & Narayanan, A. (2014). Routes for breaching and protecting genetic privacy. Nature Reviews Genetics, 15 , 409–421. https://doi.org/10.1038/nrg3723

Errington, T. M., Denis, A., Perfito, N., Iorns, E., & Nosek, B. A. (2021). Challenges for assessing replicability in preclinical cancer biology. eLife , 10 . https://doi.org/10.7554/eLife.67995 .

European Commission. Joint Research Centre. (2017). Analysis of national public research funding (PREF). In Handbook for data collection and indicators production . Publications Office. https://doi.org/10.2760/849945

Fanelli, D. (2018). Opinion: Is science really facing a reproducibility crisis, and do we need it to? Proceedings of the National Academy of Sciences of the United States of America, 115 (11), 2628–2631. https://doi.org/10.1073/pnas.1708272114

Favaretto, M., Clercq, E. de, & Elger, B. S. (2019). Big Data and discrimination: Perils, promises and solutions. A systematic review. Journal of Big Data , 6 (1). https://doi.org/10.1186/s40537-019-0177-4 .

Federer, L. M., Belter, C. W., Joubert, D. J., Livinski, A., Lu, Y.-L., Snyders, L. N., & Thompson, H. (2018). Data sharing in PLOS ONE: An analysis of data availability statements. PLOS ONE , 13 (5). https://doi.org/10.1371/journal.pone.0194768 .

Fehling, M., & Tormin, M. (2021). Das Teilen von Forschungsdaten zwischen Wissenschaftsfreiheit und guter wissenschaftlicher Praxis. Wissenschaftsrecht, 54 (3–4), 281. https://doi.org/10.1628/wissr-2021-0022

Fehling, M., Tormin, M., Wendelborn, C., & Schickhardt, C. (2023). Forschungsförderorganisationen in der Verantwortung zwischen Data Sharing und dem Schutz von Datensubjekten. Medizinrecht, 41 (11), 869–878. https://doi.org/10.1007/s00350-023-6599-1

First International Strategy Meeting on Human Genome Sequencing (1996): Bermuda principles. http://web.ornl.gov/sci/techresources/Human_Genome/research/bermuda.shtml#1 . Accessed 29 July 2023

Fischer, B. A., & Zigmond, M. J. (2010). The essential nature of sharing in science. Science and Engineering Ethics, 16 (4), 783–799. https://doi.org/10.1007/s11948-010-9239-x

Fort Lauderdale Agreement (2003). Sharing data from large-scale biological research projects: A system of tripartite responsibility. http://www.genome.gov/Pages/Research/WellcomeReport0303.pdf . Accessed 29 July 2023

Gabelica, M., Cavar, J., & Puljak, L. (2019). Authors of trials from high-ranking anesthesiology journals were not willing to share raw data. Journal of Clinical Epidemiology, 109 , 111–116. https://doi.org/10.1016/j.jclinepi.2019.01.012

Gabelica, M., Bojčić, R., & Puljak, L. (2022). Many researchers were not compliant with their published data sharing statement: A mixed-methods study. Journal of Clinical Epidemiology, 150 , 33–41. https://doi.org/10.1016/j.jclinepi.2022.05.019

Glasziou, P., Altman, D. G., Bossuyt, P., Boutron, I., Clarke, M., Julious, S., Michie, S., Moher, D., & Wager, E. (2014). Reducing waste from incomplete or unusable reports of biomedical research. The Lancet, 383 (9913), 267–276. https://doi.org/10.1016/S0140-6736(13)62228-X

Goldacre, B., Morton, C. E., & DeVito, N. J. (2019). Why researchers should share their analytic code. BMJ (Clinical Research ed.), 367 , l6365. https://doi.org/10.1136/bmj.l6365

Gopalakrishna, G., Riet, G. ter, Vink, G., Stoop, I., Wicherts, J. M., & Bouter, L. M. (2022). Prevalence of questionable research practices, research misconduct and their potential explanatory factors: A survey among academic researchers in the Netherlands. PLOS ONE , 17 (2). https://doi.org/10.1371/journal.pone.0263023 .

Gorman, D. M. (2020). Availability of research data in high-impact addiction journals with data sharing policies. Science and Engineering Ethics , 26 (3), S. 1625–1632. https://doi.org/10.1007/s11948-020-00203-7 .

Haack, S. (2007). The integrity of science: What it means, why it matters. Contrastes: Revista International de Filosofia 12 , S. 5–26. Online verfügbar unter https://papers.ssrn.com/sol3/papers.cfm?abstract_id=1105831 , zuletzt geprüft am 25. Februar 2022.

Hardwicke, T. E., Bohn, M., MacDonald, K., Hembacher, E., Nuijten, M. B., Peloquin, B. N., deMayo, B. E., Long, B., Yoon, E. J., & Frank, M. C. (2021). Analytic reproducibility in articles receiving open data badges at the Journal Psychological Science: An observational study. Royal Society Open Science , 8 (1). https://doi.org/10.1098/rsos.201494 .

Hardwicke, T. E., Mathur, M. B., MacDonald, K., Nilsonne, G., Banks, G. C., Kidwell, M. C., Hofelich Mohr, A., Clayton, E., Yoon, E. J., Henry Tessler, M., Lenne, R. L., Altman, S., Long, B., & Frank, M. C. (2018). Data availability, reusability, and analytic reproducibility: Evaluating the impact of a mandatory open data policy at the journal Cognition. Royal Society Open Science , 5 (8). https://doi.org/10.1098/rsos.180448 .

Hayden, E. C. (2013). Privacy protections: The genome hacker. Nature 497 (7448), S. 172–174. https://doi.org/10.1038/497172a .

Hedrick, T. E. (1988). Justifications for the sharing of social science data. Law and Human Behavior, 12 (2), 163–171. https://doi.org/10.1007/BF01073124

Herlitz, A. (2018). Health, priority to the worse off, and time. Medicine, Health Care, and Philosophy, 21 (4), 517–527. https://doi.org/10.1007/s11019-018-9825-2

Homer, N., Szelinger, S., Redman, M., Duggan, D., Tembe, W., & Muehling, J. et al. (2008). Resolving individuals contributing trace amounts of DNA to highly complex mixtures using high-density SNP genotyping microarrays. PLoS Genetics , 4 (8), e1000167. https://doi.org/10.1371/journal.pgen.1000167 .

Hurtig, K. (2007). On prima facie obligations and nonmonotonicity. Journal of Philosophical Logic, 36 (5), 599–604.

Iqbal, S. A., Wallach, J. D., Khoury, M. J., Schully, S. D., & Ioannidis, J. P. A. (2016). Reproducible research practices and transparency across the biomedical literature. PLOS Biology , 14 (1). https://doi.org/10.1371/journal.pbio.1002333 .

John, L. K., Loewenstein, G., & Prelec, D. (2012). Measuring the prevalence of questionable research practices with incentives for truth telling. Psychological Science, 23 (5), 524–532. https://doi.org/10.1177/0956797611430953

Kaiser, M., Drivdal, L., Hjellbrekke, J., Ingierd, H., & Rekdal, O. B. (2021). Questionable research practices and misconduct among Norwegian researchers. Science and Engineering Ethics , 28 (1). https://doi.org/10.1007/s11948-021-00351-4 .

Kaye, J., Heeney, C., Hawkins, N., de Vries, J., & Boddington, P. (2009). Data sharing in genomics—Re-shaping scientific practice. Nature Reviews Genetics, 10 (5), 331–335. https://doi.org/10.1038/nrg2573

Kitcher, P. (2001). Science, truth, and democracy . Oxford University Press. https://doi.org/10.1093/0195145836.001.0001

Kozlov, M. (2022). NIH issues a seismic mandate: Share data publicly. Nature . https://doi.org/10.1038/d41586-022-00402-1

Kretser, A., Murphy, D., Bertuzzi, S., Abraham, T., Allison, D. B., Boor, K. J., Dwyer, J., Grantham, A., Harris, L. J., Hollander, R., Jacobs-Young, C., Rovito, S., Vafiadis, D., Woteki, C., Wyndham, J., & Yada, R. (2019). Scientific Integrity principles and best practices: Recommendations from a scientific integrity consortium. Science and Engineering Ethics, 25 (2), 327–355. https://doi.org/10.1007/s11948-019-00094-3

Leonelli, S. (2018). Rethinking reproducibility as a criterion for research quality. In L. Fiorito (Ed.), Including a symposium on the work of Mary Morgan: Curiosity, imagination, and surprise (pp. 129–146). Emerald Publishing Limited.

Leonelli, S. (2023). Philosophy of open science . Cambridge University Press. https://doi.org/10.1017/9781009416368

Levy, S., Sutton, G., Ng, P. C., Feuk, L., Halpern, A. L., Walenz, B. P. et al. (2007): The diploid genome sequence of an individual human. PLoS Biology 5 (10), e254. https://doi.org/10.1371/journal.pbio.0050254 .

Manson, N. C. (2019). The biobank consent debate: Why ‘meta-consent’ is not the solution? Journal of Medical Ethics, 45 (5), 291–294. https://doi.org/10.1136/medethics-2018-105007

Mejlgaard, N., Bouter, L. M., Gaskell, G., Kavouras, P., Allum, N., Bendtsen, A.-K., Charitidis, C. A., Claesen, N., Dierickx, K., Domaradzka, A., Reyes Elizondo, A., Foeger, N., Hiney, M., Kaltenbrunner, W., Labib, K., Marušić, A., Sørensen, M. P., Ravn, T., Ščepanović, R. … Veltri, G. A. (2020). Research integrity: Nine ways to move from talk to walk. Nature , 586 (7829), 358–360. https://doi.org/10.1038/d41586-020-02847-8 .

Merton, R. (Ed.) (1942/1973). The sociology of science: Theoretical and empirical investigations . The University of Chicago Press.

Mikkelsen, R. B., Gjerris, M., Waldemar, G., & Sandøe, P. (2019). Broad consent for biobanks is best—provided it is also deep. BMC Medical Ethics, 20 (1), 71. https://doi.org/10.1186/s12910-019-0414-6

Mill, J. S. (2008). On liberty and other essays . Oxford University Press.

Miyakawa, T. (2020). No raw data, no science: Another possible source of the reproducibility crisis. Molecular Brain , 13 (1). https://doi.org/10.1186/s13041-020-0552-2 .

Mühlhoff, R. (2021). Predictive privacy: Towards an applied ethics of data analytics. Ethics and Information Technology, 23 (4), 675–690. https://doi.org/10.1007/s10676-021-09606-x

Munafò, M. R., Nosek, B. A., Bishop, D. V. M., Button, K. S., Chambers, C. D., Du Sert, N. P., Simonsohn, U., Wagenmakers, E.-J., Ware, J. J., & Ioannidis, J. P. A. (2017). A manifesto for reproducible science. Nature Human Behaviour , 1 . https://doi.org/10.1038/s41562-016-0021 .

National Academies Press (US) (2017). Fostering integrity in research . https://doi.org/10.17226/21896 .

National Cancer Institute (n.d.). Genomic data commons, accessed 27 July 2023, https://gdc.cancer.gov/

National Commission for the Protection of Human Subjects of Biomedical and Behavioral Research (1978). The Belmont report: Ethical principles and guidelines for the protection of human subjects of research. DHEW Pub , No (OS) 78–0014. US Govt Print Office.

National Institutes of Health (2022). NIH Data Sharing Policy 2023. Retrieved 23 June 2022 https://sharing.nih.gov/data-management-and-sharing-policy/about-data-management-sharing-policy/data-management-and-sharing-policy-overview .

National Library of Medicine (n.d.). ClinVar, accessed 27 July 2023, https://www.ncbi.nlm.nih.gov/clinvar/

Naudet, F., Sakarovitch, C., Janiaud, P., Cristea, I., Fanelli, D., Moher, D., & Ioannidis, J. P. A. (2018). Data sharing and reanalysis of randomized controlled trials in leading biomedical journals with a full data sharing policy: Survey of studies published in The BMJ and PLOS Medicine. BMJ , 360 . https://doi.org/10.1136/bmj.k400 .

Naudet, F., Siebert, M., Pellen, C., Gaba, J., Axfors, C., Cristea, I., Danchev, V., Mansmann, U., Ohmann, C., Wallach, J. D., Moher, D., & Ioannidis, J. P. A. (2021). Medical journal requirements for clinical trial data sharing: Ripe for improvement. PLOS Medicine , 18 (10). https://doi.org/10.1371/journal.pmed.1003844 .

Netherlands Code of Conduct for Research Integrity (2018).

Neylon, C. (2017). Compliance culture or culture change? The role of funders in improving data management and sharing practice amongst researchers. Research Ideas and Outcomes, 3 , e21705. https://doi.org/10.3897/rio.3.e21705

Niiniluoto, I. (2019). Scientific progress. In E. N. Zalta (Ed.). The Stanford encyclopedia of philosophy (Winter 2019 edition). Retrieved June 14, 2022, from https://plato.stanford.edu/archives/win2019/entries/scientific-progress/ .

Nuijten, M. B., Bakker, M., Maassen, E., & Wicherts, J. M. (2018). Verify original results through reanalysis before replicating. Behavioral and Brain Sciences , 41 . https://doi.org/10.1017/S0140525X18000791 .

Ohmann, C., Moher, D., Siebert, M., Motschall, E., & Naudet, F. (2021). Status, use and impact of sharing individual participant data from clinical trials: A scoping review. BMJ Open , 11 (8). https://doi.org/10.1136/bmjopen-2021-049228 .

Ottersen, T. (2013). Lifetime QALY prioritarianism in priority setting. Journal of Medical Ethics, 39 (3), 175–180. https://doi.org/10.1136/medethics-2012-100740

Overkamp, P., & Tormin, M. (2022). Staatliche Steuerungsmöglichkeiten zur Förderung des Teilens von Forschungsdaten. Ordnungen der Wissenschaft, 1 , 39–54.

Peels, R. (2019). Replicability and replication in the humanities. Research Integrity and Peer Review , 4 . https://doi.org/10.1186/s41073-018-0060-4 .

Peels, R., & Bouter, L. (2021). Replication and trustworthiness. Accountability in Research . https://doi.org/10.1080/08989621.2021.1963708

Perrier, L., Blondal, E., & MacDonald, H. (2020). The views, perspectives, and experiences of academic researchers with data sharing and reuse: A meta-synthesis. PLOS ONE , 15 (2). https://doi.org/10.1371/journal.pone.0229182 .

Persad, G. (2019). Justice and public health. In A. C. Mastroianni, J. P. Kahn, & N. E. Kass (Eds.), The Oxford handbook of public health ethics (pp. 32–46). Oxford University Press.

Pierson, L., & Millum, J. (2018). Health research priority setting: The duties of individual funders. The American Journal of Bioethics, 18 (11), 6–17. https://doi.org/10.1080/15265161.2018.1523490

Platt, J. E., Jacobson, P. D., & Kardia, S. L. R. (2018). Public trust in health information sharing: A measure of system trust. Health Services Research, 53 (2), 824–845. https://doi.org/10.1111/1475-6773.12654

Ploug, T. (2020). In defence of informed consent for health record research—Why arguments from ‘easy rescue’, ‘no harm’ and ‘consent bias’ fail. BMC Medical Ethics, 21 (1), 75. https://doi.org/10.1186/s12910-020-00519-w

Ploug, T., & Holm, S. (2016). Meta consent—A flexible solution to the problem of secondary use of health data. Bioethics, 30 (9), 721–732. https://doi.org/10.1111/bioe.12286

Powell, Kendall (2021). The broken promise that undermines human genome research. Nature 590 (7845), S. 198–201. https://doi.org/10.1038/d41586-021-00331-5 .

Pratt, B., & Hyder, A. A. (2017). Fair resource allocation to health research: Priority topics for bioethics scholarship. Bioethics, 31 (6), 454–466. https://doi.org/10.1111/bioe.12350

Pratt, B., & Hyder, A. A. (2019). Ethical responsibilities of health research funders to advance global health justice. Global Public Health, 14 (1), 80–90. https://doi.org/10.1080/17441692.2018.1471148

Rauh, S., Torgerson, T., Johnson, A. L., Pollard, J., Tritz, D., & Vassar, M. (2020). Reproducible and transparent research practices in published neurology research. Research Integrity and Peer Review , 5 . https://doi.org/10.1186/s41073-020-0091-5 .

Reed-Berendt, R., Dove, E. S., & Pareek, M. (2022). The ethical implications of big data research in public health: “Big Data Ethics by Design” in the UK-REACH study. Ethics and Human Research, 44 (1), 2–17. https://doi.org/10.1002/eahr.500111

Resnik, D. (1996). Review: Ethics of scientific research by Shrader-Frechette, Kristin. Noûs , 30 (1), 133–143. https://doi.org/10.2307/2216307 .

Resnik, D. B. (1998). The ethics of science: An introduction. Philosophical issues in science. Routledge.

Resnik, D. B. (2018a). Difficulties with applying a strong social value requirement to clinical research. The Hastings Center Report, 48 (6), 35–37. https://doi.org/10.1002/hast.936

Resnik, D. B. (2018b). Examining the social benefits principle in research with human participants. Health Care Analysis, 26 (1), 66–80. https://doi.org/10.1007/s10728-016-0326-2

Resnik, D. B., & Shamoo, A. E. (2011). The singapore statement on research integrity. Accountability in Research, 18 (2), 71–75. https://doi.org/10.1080/08989621.2011.557296

Reydon, T. (2013). Wissenschaftsethik: Eine Einführung. UTB Philosophie, Naturwissenschaften , 4032. Ulmer.

Rosenbaum, S. (2010). Data governance and stewardship: Designing data stewardship entities and advancing data access. Health Services Research, 45 (5 Pt 2), 1442–1455. https://doi.org/10.1111/j.1475-6773.2010.01140.x

Ross, W. D. (1930). The right and the good . Clarendon.

Russell, C. (1993). Academic freedom. (1 st edition). Routledge.

Sardanelli, F., Alì, M., Hunink, M. G., Houssami, N., Sconfienza, L. M., & Di Leo, G. (2018). To share or not to share? Expected pros and cons of data sharing in radiological research. European Radiology, 28 (6), 2328–2335. https://doi.org/10.1007/s00330-017-5165-5

Schickhardt, C., Hosley, N., & Winkler, E. C. (2016). Researchers’ duty to share pre-publication data: From the prima facie duty to practice. In B. D. Mittelstadt & L. Floridi (Eds.), The ethics of biomedical big data (pp. 309–337). Springer.

Schwartz, J. S. J. (2020). The value of science in space exploration . Oxford University Press. https://doi.org/10.1093/oso/9780190069063.001.0001

Sen, A. (2002). Why health equity? Health Economics, 11 (8), 659–666. https://doi.org/10.1002/hec.762

Sim, I., Stebbins, M., Bierer, B. E., Butte, A. J., Drazen, J., Dzau, V., Hernandez, A. F., Krumholz, H. M., Lo, B., Munos, B., Perakslis, E., Rockhold, F., Ross, J. S., Terry, S. F., Yamamoto, K. R., Zarin, D. A., & Li, R. (2020). Time for NIH to lead on data sharing. Science, 367 (6484), 1308–1309. https://doi.org/10.1126/science.aba4456

Stewart, S. L. K., Pennington, C. R., da Silva, G. R., Ballou, N., Butler, J., Dienes, Z., Jay, C., Rossit, S., & Samara, A. (2022). Reforms to improve reproducibility and quality must be coordinated across the research ecosystem: The view from the UKRN local network leads. BMC Research Notes , 15 (1). https://doi.org/10.1186/s13104-022-05949-w .

Strcic, Josip, Civljak, Antonia, Glozinic, Terezija, Pacheco, Rafael Leite, Brkovic, Tonci, & Puljak, Livia (2022): Open data and data sharing in articles about COVID-19 published in preprint servers medRxiv and bioRxiv. Scientometrics 127 (5), S. 2791–2802. https://doi.org/10.1007/s11192-022-04346-1 .

Tan, Aidan Christopher, Askie, Lisa M., Hunter, Kylie Elizabeth, Barba, Angie, Simes, Robert John, & Seidler, Anna Lene (2021): Data sharing-trialists' plans at registration, attitudes, barriers and facilitators: A cohort study and cross-sectional survey. Research Synthesis Methods,1 2 (5), S. 641–657. https://doi.org/10.1002/jrsm.1500

Tedersoo, Leho, Küngas, Rainer, Oras, Ester, Köster, Kajar, Eenmaa, Helen, Leijen, Äli et al. (2021). Data sharing practices and data availability upon request differ across scientific disciplines. Scientific Data , 8 (1), Artikel 192. https://doi.org/10.1038/s41597-021-00981-0 .

Tedersoo, L., Küngas, R., Oras, E., Köster, K., Eenmaa, H., Leijen, Ä., Pedaste, M., Raju, M., Astapova, A., Lukner, H., Kogermann, K., & Sepp, T. (2021). Data sharing practices and data availability upon request differ across scientific disciplines. Scientific Data , 8 (1). https://doi.org/10.1038/s41597-021-00981-0 .

Terry, R. F., Littler, K., & Olliaro, P. L. (2018). Sharing health research data - the role of funders in improving the impact. F1000Research , 7 . https://doi.org/10.12688/f1000research.16523.2 .

Thelwall, M., Munafò, M., Mas-Bleda, A., Stuart, E., Makita, M., Weigert, V., Keene, C., Khan, N., Drax, K., & Kousha, K. (2020). Is useful research data usually shared? An investigation of genome-wide association study summary statistics. PLOS ONE , 15 (2). https://doi.org/10.1371/journal.pone.0229578 .

Titus, S., & Bosch, X. (2010). Tie funding to research integrity. Nature, 466 (7305), 436–437. https://doi.org/10.1038/466436a

Towse, John N., Ellis, David A., & Towse, Andrea S. (2021). Opening Pandora's Box: Peeking inside psychology's data sharing practices, and seven recommendations for change. Behavior Research Methods 53 (4), S. 1455–1468. https://doi.org/10.3758/s13428-020-01486-1 .

Watson, Clare (2022). Many researchers say they'll share data - but don't. Nature 606 (7916), S. 853. https://doi.org/10.1038/d41586-022-01692-1 .

Wendler, D., & Rid, A. (2017). In defense of a social value requirement for clinical research. Bioethics, 31 (2), 77–86. https://doi.org/10.1111/bioe.12325

Wertheimer, A. (2015). The social value requirement reconsidered. Bioethics, 29 (5), 301–308. https://doi.org/10.1111/bioe.12128

Wilholt, T. (2010). Scientific freedom: Its grounds and their limitations. Studies in History and Philosophy of Science Part A, 41 (2), 174–181. https://doi.org/10.1016/j.shpsa.2010.03.003

Wilholt, T. (2012). Die Freiheit der Forschung: Begründungen und Begrenzungen . Suhrkamp.

Wilkinson, M. D., Dumontier, M., Aalbersberg, I. J. J., Appleton, G., Axton, M., Baak, A., Blomberg, N., Boiten, J.-W., da Silva Santos, L. B., Bourne, P. E., Bouwman, J., Brookes, A. J., Clark, T., Crosas, M., Dillo, I., Dumon, O., Edmunds, S., Evelo, C. T., & Finkers, R. … Mons, B. (2016). The FAIR guiding principles for scientific data management and stewardship. Scientific Data , 3 . https://doi.org/10.1038/sdata.2016.18 .

Winkler, E. C., Jungkunz, M., Thorogood, A. et al. (2023). Patient data for commercial companies? An ethical framework for sharing patients’ data with for-profit companies for research . Journal of Medical Ethics. https://doi.org/10.1136/jme-2022-108781

de Winter, J., & Kosolosky, L. (2013). The epistemic integrity of scientific research. Science and Engineering Ethics, 19 (3), 757–774. https://doi.org/10.1007/s11948-012-9394-3

World Conference on Research Integrity (2010). Singapore Statement on Research Integrity. Retrieved 25 February 2022 https://wcrif.org/guidance/singapore-statement .

World Medical Association. (2013). World medical association declaration of Helsinki: Ethical principles for medical research involving human subjects. JAMA, 310 (20), 2191–2194. https://doi.org/10.1001/jama.2013.281053

Xafis, V., Schaefer, G. O., Labude, M. K., Brassington, I., Ballantyne, A., Lim, H. Y., Lipworth, W., Lysaght, T., Stewart, C., Sun, S., Laurie, G. T., & Tai, E. S. (2019). An ethics framework for big data in health and research. Asian Bioethics Review, 11 (3), 227–254. https://doi.org/10.1007/s41649-019-00099-x

Ziman, J. (2009). Real science. Cambridge University Press. https://doi.org/10.1017/CBO9780511541391

Download references

Acknowledgements

The authors would like to thank the following individuals and groups for their contributions to this project: our partners within the joint research project DATABLIC, Prof. Dr. Michael Fehling and Miriam Tormin (Bucerius Law School, Hamburg), Prof. Dr. Christiane Schwieren and Tamás Olah (University of Heidelberg); all members of the Section Translational Medical Ethics at the National Center for Tumour Diseases, Heidelberg, especially the head of section Prof. Dr. Dr. Eva Winkler; Maya Doering for assistence with literature review and formatting.  

The work on this article has been funded by the German Ministry for Education and Research (Bundesministerium für Bildung und Forschung, funding reference no. 01GP1904A) as part of the joint research project DATABLIC. The funder had no role in research design, analysis, decision to publish, or preparation of the manuscript.

Author information

Christian Wendelborn

Present address: University of Konstanz, Konstanz, Germany

Authors and Affiliations

Section for Translational Medical Ethics, German Cancer Research Center (DKFZ), National Center for Tumor Diseases (NCT) Heidelberg, Heidelberg, Germany

Christian Wendelborn, Michael Anger & Christoph Schickhardt

You can also search for this author in PubMed   Google Scholar

Contributions

Conceptualization: CW and CS. Methodology: CW and CS. Ethical analysis and investigation: CW and CS. Writing—original draft preparation: CW. Writing—review and editing: MA and CS. Supervision: CS. Project proposal and successful application: CS   

Corresponding author

Correspondence to Christian Wendelborn .

Ethics declarations

Conflict of interest.

The authors declare that no competing interests exist.

Additional information

Publisher's note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ .

Reprints and permissions

About this article

Wendelborn, C., Anger, M. & Schickhardt, C. Promoting Data Sharing: The Moral Obligations of Public Funding Agencies. Sci Eng Ethics 30 , 35 (2024). https://doi.org/10.1007/s11948-024-00491-3

Download citation

Received : 21 October 2022

Accepted : 08 June 2024

Published : 06 August 2024

DOI : https://doi.org/10.1007/s11948-024-00491-3

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Data sharing
  • Epistemic integrity
  • Funding agencies
  • Moral obligations
  • Research integrity
  • Scientific progress
  • Scientific freedom
  • Social value
  • Find a journal
  • Publish with us
  • Track your research

IMAGES

  1. Top 8 Data Science Case Studies for Data Science Enthusiasts

    case study for data science project

  2. Data Science Process: 7 Steps With Comprehensive Case Study

    case study for data science project

  3. 10 Real World Data Science Case Studies Projects with Example

    case study for data science project

  4. Data Science Case Studies: Solved and Explained

    case study for data science project

  5. 4 Most Viewed Data Science Case Studies given by Top Data Scientists

    case study for data science project

  6. Top 10 Data Science Case Study Interview Questions for 2024

    case study for data science project

COMMENTS

  1. 10 Real World Data Science Case Studies Projects with Example

    BelData science has been a trending buzzword in recent times. With wide applications in various sectors like healthcare, education, retail, transportation, media, and banking -data science applications are at the core of pretty much every industry out there. The possibilities are endless: analysis of frauds in the finance sector or the personalization of recommendations on eCommerce businesses.

  2. 10 Real-World Data Science Case Studies Worth Reading

    Real-world data science case studies differ significantly from academic examples. While academic exercises often feature clean, well-structured data and simplified scenarios, real-world projects tackle messy, diverse data sources with practical constraints and genuine business objectives.

  3. 12 Data Science Case Studies: Across Various Industries

    Top 12 Data Science Case Studies. 1. Data Science in Hospitality Industry. In the hospitality sector, data analytics assists hotels in better pricing strategies, customer analysis, brand marketing, tracking market trends, and many more. Airbnb focuses on growth by analyzing customer voice using data science.

  4. Top 25 Data Science Case Studies [2024]

    Top 25 Data Science Case Studies [2024] In an era where data is the new gold, harnessing its power through data science has led to groundbreaking advancements across industries. From personalized marketing to predictive maintenance, the applications of data science are not only diverse but transformative. This compilation of the top 25 data ...

  5. Data Science Case Studies: Solved and Explained

    53. 1. Solving a Data Science case study means analyzing and solving a problem statement intensively. Solving case studies will help you show unique and amazing data science use cases in your ...

  6. Data in Action: 7 Data Science Case Studies Worth Reading

    Case studies are helpful tools when you want to illustrate a specific point or concept. They can be used to show how a data science project works in real life, or they can be used as an example of what to avoid. Data science case studies help students, and entry-level data scientists understand how professionals have approached previous ...

  7. Case Study: Applying a Data Science Process Model to a Real-World

    However, it's essential to note that real-world data science projects pose several challenges, such as data quality issues, lack of domain expertise, and inadequate communication between stakeholders. In comparison, fictitious case studies provide an idealized environment with clean, well-labeled data and well-defined problem statements.

  8. Data Science Case Studies: Lessons from the Real World

    The case studies presented illuminate the broad spectrum of challenges that data science can address, showcasing its versatility and impact. For organizations and professionals looking to harness the power of data science, these examples provide inspiration and guidance on applying data science techniques to achieve tangible results.

  9. Case studies

    Case studies. How data science is used to solve real-world problems in business, public policy and beyond. Categories. All (11) Coding (1) Collaboration (1) Crime and justice (2) Data analysis (1) Data linkage (1) Data quality (2) Deep learning (1) Forecasting (1) Health and Wellbeing (1)

  10. Case Studies

    Discover some of our best data science and machine learning case studies. Your home for data science. A Medium publication sharing concepts, ideas and codes. ... A simple and customizable project with Python and Selenium, that will search for flights and send the prices directly to your email! Fábio Neves. May 2, 2019.

  11. Machine Learning Case-Studies

    Genetic Algorithms + Neural Networks = Best of Both Worlds. Learn how Neural Network training can be accelerated using Genetic Algorithms! Suryansh S. Mar 26, 2018. Real-world case studies on applications of machine learning to solve real problems. Your home for data science. A Medium publication sharing concepts, ideas and codes.

  12. Doing Data Science: A Framework and Case Study

    A data science framework has emerged and is presented in the remainder of this article along with a case study to illustrate the steps. This data science framework warrants refining scientific practices around data ethics and data acumen (literacy). A short discussion of these topics concludes the article. 2.

  13. Data Science Use Cases Guide

    Data science use case planning is: outlining a clear goal and expected outcomes, understanding the scope of work, assessing available resources, providing required data, evaluating risks, and defining KPI as a measure of success. The most common approaches to solving data science use cases are: forecasting, classification, pattern and anomaly ...

  14. Data Science Projects with Python: A case study approach to gaining

    This creates a case-study approach that simulates the working conditions you'll experience in real-world data science projects. You'll learn how to use key Python packages, including pandas, Matplotlib, and scikit-learn, and master the process of data exploration and data processing, before moving on to fitting, evaluating, and tuning ...

  15. Data Science Projects with Python: A case study approach to successful

    Data Science Projects with Python: A case study approach to successful data science projects using Python, pandas, and scikit-learn. Stephen Klosterman. ... Data Science Projects with Python is designed to give you practical guidance on industry-standard data analysis and machine learning tools, by applying them to realistic data problems. ...

  16. Case Studies: Data Science Projects

    Welcome to some case study of data science projects - (Personal Projects). Topics. data-science machine-learning pyspark healthcare logistic-regression decision-tree churn-prediction census-income house-price-prediction anomaly-detection pyspark-mllib data-science-projects case-study-data-science spaceship-titanic Resources.

  17. Case Study: Delivering A Successful Data Science Project

    Before we jump into the case study, I felt it was important to briefly address the misconception about what a data science project is by giving an example of a side-by-side comparison. A lot of Australian companies are currently misusing the term and refer to a business analytics project as data science or big data project. Data science is a ...

  18. Open Case Studies: Statistics and Data Science Education through Real

    Case Studies (opencasestudies.org) project, which offers a new statistical and data science education case study model. This educational resource pro-vides self-contained, multimodal, peer-reviewed, and open-source guides (or case studies) from real-world examples for active experiences of complete data analyses.

  19. Data Science Projects with Python: A case study approach to successful

    About Data Science Projects with Python . Data Science Projects with Python is a hands-on introduction to real-world data science. You'll take an active approach to learning by following real case studies that elegantly tie together mathematics and code.

  20. Data Science Project Lifecycle

    Explore and run machine learning code with Kaggle Notebooks | Using data from Data Science Project Lifecycle

  21. GitHub

    Data Science Projects with Python is designed to give you practical guidance on industry-standard data analysis and machine learning tools in Python, with the help of realistic data. The course will help you understand how you can use pandas and Matplotlib to critically examine a dataset with summary statistics and graphs, and extract the insights you seek to derive.

  22. Data Science and Machine Learning Project Walkthroughs

    In this video, we will go in-depth with three data science case studies to see how companies leverage data science to make better decisions, innovate in their sectors, and meet customers' specific needs. ... In this video, we'll walk through a data science project to predict stock market prices using Python, scikit-learn, and pandas. ...

  23. 5 Python Projects for Data Science Portfolio

    Data Science: Data science is study of data. It involves developing methods of recording, storing, and analyzing data to extract useful information. The goal of data science is to gain knowledge from any type of data both structured and unstructured. Data science is a term for set of fields that are focused on mining big data sets and discovering t

  24. Shiny in Production 2024: Full speaker lineup

    We are pleased to announce the full line-up for this year's Shiny in Production conference! This year, we're introducing a new lightning talk session. These short 5 minute talks will allow us to showcase many more uses of Shiny in Production. The conference will still feature 6 full length talks, as well as a session of lightning speakers. Register now Talks Cara Thompson - Freelance Data ...

  25. Exploring coral reefs in real time

    Advanced Inquiry Program (AIP) graduate Zovig Minassian '24 of La Crescenta, California, wrote an article for California Classroom Science, a publication of the California Association of Science Educators (CASE). In "Increasing Student Awareness of Coral Reef Conservation with Inquiry and ...

  26. Analysis of Critical Success Factors for Implementing Project

    Analysis of Critical Success Factors for Implementing Project Management: A Case Study at PT ABC Company. Social and professional topics. Professional topics. Management of computing and information systems. ... MSIE '24: Proceedings of the 2024 6th International Conference on Management Science and Industrial Engineering. April 2024. 395 pages.

  27. A long section of serpentinized depleted mantle peridotite

    Funding: This research used samples and data provided by the International Ocean Discovery Program (IODP). Funding for the operation and management of the scientific drilling vessel JOIDES Resolution was provided by the US National Science Foundation (NSF) (award OCE, 1326927).

  28. [2408.04532] How Transformers Utilize Multi-Head Attention in In

    Despite the remarkable success of transformer-based models in various real-world tasks, their underlying mechanisms remain poorly understood. Recent studies have suggested that transformers can implement gradient descent as an in-context learner for linear regression problems and have developed various theoretical analyses accordingly. However, these works mostly focus on the expressive power ...

  29. PDF Global Macro ISSUE 129

    vis-à-vis the US. However, a bull case for AI data centers— which assumes a slightly higher data center market share of 25% for Europe and no efficiency gains on future server deliveries—could see cumulative electricity consumption growth of around 50% over the next decade. But even in our base case, the incremental power consumption we expect

  30. Promoting Data Sharing: The Moral Obligations of Public Funding

    1. In the following, we use the term "research data" and "data" as referring to digital data that is collected and/or generated during a research project. We use the term "data sharing" as referring to the act of making data available for other researchers - either for the purpose of transparency of studies and replication of published research results or for the purpose of other ...