Federated Machine Learning
Vast amount of data is collected and stored every day. Nevertheless, when looking at the most pressing problems of
our time, how come that complex machine models, for instance for cancer treatment, are only trained on a few 1000
samples? The core problem of this issue is that the data is available but must not be shared due to legal
restrictions and bureaucratic hurdles, in particular when we are talking about sensitive data.
In this project, you will help to elevate this problem by utilizing an approach called federated machine learning.
Here, the data remains at the safe site of storage and the machine learning comes to the data. A small, local,
model is trained on each site, and then combined into a global model while only transmitting anonymous model
parameters. With that, we will be able to train high-quality models while not infringing on data security and
privacy issues.
The project is part of a large consortium project FeatureCloud. You will implement federated solutions for
specific problems which seamlessly integrate into the FeatureCloud platform. Potential projects are:
- Federated Clustering Algorithms
- Federated Similarity Calculation
- Federated Cluster Evaluation
- Federated Network Enrichment
GWAS Study
Impact of fibre, red and processed meat on risk of chronic inflammatory diseases: a prospective UK Biobank cohort study on prognostic factors and personalised medicine in the UK Biobank
UKBiobank is an incredible large datasoruce and has collected data from 500.000 British individuals enabling the
investigation of gene-lifestyle interactions in relation to development of diseases. Data is already available and
requires advanced data science skill to cope with the sheer amount of available data.
The aim of this project to programm statistical models to investigate gene-diet interactions in an efficient way.
You will learn to perform observational and advanced computer analyses such as interaction studies, case-control
and case-only studies. Further, you will learn how to do network and pathway analyses.
You will be supervised by Richard Röttger and Vibeke Andersen, professor, Research Leader, Research Unit for
Molecular Diagnostics and Clinical Research, University Hospital of Southern Denmark.
Practical Information. The study is designed as a Master Thesis project and can also be done in a team:
Two students working on their own project, but collaborating on methods, interpretation, writing manuscript,
etc., are preferred. You will be responsible for writing a manuscript under supervision. It is not crucial to
have in depth knowledge of Medicine or Biology. Could form a basis for a PhD study, if wanted.
Rare Disease
Rare disease collectively affects 30 million people in Europe. The patients undergo an expensive and time-consuming odyssey between doctors and hospitals to receive a correct diagnosis. On average, the diagnosis takes 10 years. We want to assist the medical researchers in identifying potential rare diseases using artificial intelligence and electronic health records. However, the challenges are two-fold:
- The rare disease is rare meaning that we need to gather a lot of data around the world from different hospitals.
- As the data comes from around the world, they may not be necessarily compatible between hospitals.
Therefore, we would like to discover which data, or in order words, features, are available and compatible with a set of hospitals. There exists a plethora of features useful for determining the rare disease of the patient. We can easily address the first challenge using federated learning. Each model is trained locally on each hospital, sending anonymous training parameters to the centralized model. The patient data never leaves their hospital.
The second challenge is that the data type and format may not be compatible in different hospitals which effectively makes federated learning difficult to utilize. We need to implement methods to discover data compatibility between hospitals for federated learning. For example, if the medical researcher selects 10 features in one hospital, they may lose 10000 potential samples in other hospitals. If a medical researcher would like to train a machine learning model for predicting rare diseases, they may therefore have specific requirements. What are the optimal number of features and samples? Which features are more important than the other? Furthermore, the medical researchers need to use an accessible software that meets their aforementioned needs. This is an optimization problem that requires finding the optimal balance between the features and samples.
Overall, the goal of the project is to implement methods to find the optimal balance between the number of selected features and samples for training a machine learning model. It is not expected of the student to have knowledge of rare diseases and patient data.
Synthetic Data Generation for Rare Disease Detection
Research Objective:
We invite master’s students with a foundational understanding of machine learning, data analysis, and natural language processing (NLP) to contribute to our cutting-edge research focused on detecting rare diseases. The central challenge in this domain is the availability of sensitive data, which cannot be freely shared due to legal and bureaucratic constraints. Our goal is to overcome these hurdles by generating high-quality synthetic data that can be used to train robust machine learning models.
Key Research Components
- Synthetic Data Generation:
- Develop and refine algorithms for generating synthetic datasets that accurately mimic real-world data of rare diseases.
- Ensure the synthetic data maintains the statistical properties and nuances of the original datasets to enable effective model training.
- Privacy-Preserving Techniques:
- Implement and evaluate methods to ensure the synthetic data upholds privacy standards and complies with legal restrictions.
Ideal Candidate Profile
We are looking for a motivated master’s student who is passionate about leveraging synthetic data generation and federated learning to address critical challenges in rare disease detection. The candidate should possess
- Basic knowledge of machine learning and data analysis techniques.
- Some familiarity with natural language processing (NLP) (not compulsory).
- A keen interest in privacy-preserving technologies and healthcare applications.
Research Significance
This thesis offers an exciting opportunity to contribute to a highly impactful area of research. By generating synthetic data and employing federated learning, you will help develop innovative solutions that facilitate the safe and effective use of sensitive healthcare data. Your work will have the potential to improve rare disease detection and ultimately enhance patient outcomes.