Thesis Topics @ RöttgerLab

Federated Machine Learning

Vast amount of data is collected and stored every day. Nevertheless, when looking at the most pressing problems of our time, how come that complex machine models, for instance for cancer treatment, are only trained on a few 1000 samples? The core problem of this issue is that the data is available but must not be shared due to legal restrictions and bureaucratic hurdles, in particular when we are talking about sensitive data.

In this project, you will help to elevate this problem by utilizing an approach called federated machine learning. Here, the data remains at the safe site of storage and the machine learning comes to the data. A small, local, model is trained on each site, and then combined into a global model while only transmitting anonymous model parameters. With that, we will be able to train high-quality models while not infringing on data security and privacy issues.

The project is part of a large consortium project FeatureCloud. You will implement federated solutions for specific problems which seamlessly integrate into the FeatureCloud platform. Potential projects are:

  1. Federated Clustering Algorithms
  2. Federated Similarity Calculation
  3. Federated Cluster Evaluation
  4. Federated Network Enrichment

GWAS Study

Impact of fibre, red and processed meat on risk of chronic inflammatory diseases: a prospective UK Biobank cohort study on prognostic factors and personalised medicine in the UK Biobank

UKBiobank is an incredible large datasoruce and has collected data from 500.000 British individuals enabling the investigation of gene-lifestyle interactions in relation to development of diseases. Data is already available and requires advanced data science skill to cope with the sheer amount of available data.

The aim of this project to programm statistical models to investigate gene-diet interactions in an efficient way. You will learn to perform observational and advanced computer analyses such as interaction studies, case-control and case-only studies. Further, you will learn how to do network and pathway analyses.

You will be supervised by Richard Röttger and Vibeke Andersen, professor, Research Leader, Research Unit for Molecular Diagnostics and Clinical Research, University Hospital of Southern Denmark.

Practical Information. The study is designed as a Master Thesis project and can also be done in a team: Two students working on their own project, but collaborating on methods, interpretation, writing manuscript, etc., are preferred. You will be responsible for writing a manuscript under supervision. It is not crucial to have in depth knowledge of Medicine or Biology. Could form a basis for a PhD study, if wanted.

Rare Disease

Rare disease collectively affects 30 million people in Europe. The patients undergo an expensive and time-consuming odyssey between doctors and hospitals to receive a correct diagnosis. On average, the diagnosis takes 10 years. We want to assist the medical researchers in identifying potential rare diseases using artificial intelligence and electronic health records. However, the challenges are two-fold:

  1. The rare disease is rare meaning that we need to gather a lot of data around the world from different hospitals.
  2. As the data comes from around the world, they may not be necessarily compatible between hospitals. Therefore, we would like to discover which data, or in order words, features, are available and compatible with a set of hospitals. There exists a plethora of features useful for determining the rare disease of the patient. We can easily address the first challenge using federated learning. Each model is trained locally on each hospital, sending anonymous training parameters to the centralized model. The patient data never leaves their hospital.

The second challenge is that the data type and format may not be compatible in different hospitals which effectively makes federated learning difficult to utilize. We need to implement methods to discover data compatibility between hospitals for federated learning. For example, if the medical researcher selects 10 features in one hospital, they may lose 10000 potential samples in other hospitals. If a medical researcher would like to train a machine learning model for predicting rare diseases, they may therefore have specific requirements. What are the optimal number of features and samples? Which features are more important than the other? Furthermore, the medical researchers need to use an accessible software that meets their aforementioned needs. This is an optimization problem that requires finding the optimal balance between the features and samples.

Overall, the goal of the project is to implement methods to find the optimal balance between the number of selected features and samples for training a machine learning model. It is not expected of the student to have knowledge of rare diseases and patient data.