PhD Studentship: Data Integration of ‘omic data using Differential Bayesian Networks.

PhD Studentship: Data Integration of ‘omic data using Differential Bayesian Networks.

Engineering & the Environment

Location: Highfield Campus

Closing Date:  Wednesday 17 January 2018

Reference: 828117F2

Project Reference: NGCM-0102

Data integration of multiple sources is a key challenge in the Life Sciences, particularly for ‘omic data, where analysis of single data type is often insufficient to explain the aetiology of complex traits. The assimilation of different sources of data is therefore essential to understand the underlying biological mechanisms linked with phenotype.

To be applicable to typical biological data sets, data integration methodologies must meet many computational challenges, from data size and heterogeneity, to dimensionality and noise. Although a number of analytical approaches and software tools are available for data integration of biological data, however new strategies is required to better understand the aetiology of complex traits.

Graphical models are often used to represent the conditional independence structure (i.e. the associations) between a series of variables. For example, a particular gene may be associated with a given disease but this association is dependent on several other variables. However, there is no direction to these associations, one variable cannot be said to predict or cause the other. Directed acyclic graphs, commonly called Bayesian networks, add a direction to these associations. It means that given the presence of a particular gene, and status of other variables, the probability of disease can be inferred. Differential Bayesian networks is a technique which can determine where the association structures of two (or more) different populations are the same.

The main aim of the project is to develop a robust data integration technique for heterogeneous biological data sources, using a meta-dimensional transformation integration technique based on a Bayesian Network approach. Further, we will extend it by performing Differential Bayesian Network testing on networks, which will potentially differentiate phenotypes. To implement this method, we would use ‘omic data available from Isle of Wight Third Generation (IoW F2) cohort, which would help us understand the gene-environment interaction in childhood asthma. This project would add value to both in fields of data integration and respiratory research as it will develop a novel methodology for integrating ‘omic datasets as well as contributing to the understanding of childhood respiratory disease.

The main objectives for the PhD project are:

a. Data pre-processing and quality control: DNA methylation profiling of blood samples from the IoW F2 cohort using will be processed using standard methods in [6,7] . Microbiome samples will be pre-processed and QC-ed with standard bioinformatic process used in [8]. QC-ed and pre-processed gene expression data for the matching samples are also available.

b. Data reduction: Explore and implement data reduction techniques to handle the high dimensionality of these three ‘omic data while maintaining biologically meaningful variables. This can be a combination of intrinsic (such as: principal component analysis, factor analysis) and extrinsic (using existing knowledge base) data reduction techniques.

c. Data integration: Develop, perform, and validate a meta-dimensional data integration method using Bayesian networks for heterogeneous sources of biological data. The meta-dimensional analyses are not hypothesis free and, therefore, we need to establish relationships between phenotypes and each genomic data set in the first instance. Then, based on phenotypic classification (for example, childhood asthma and non-asthma), we would implement transformation-based data integration techniques

d. Web service: The last stage of the project is to implement a web service where external users can upload their ‘omic’ data and run a user-interactive data integration analysis. This will produce a user-friendly QC and analysis report for the user for further downstream analyses.

The prospective candidate must be a highly motivated individual with at least an upper second-class degree in Computer Science, Bioinformatics, Physics or related field, and a background and/or interest in Molecular Biology. Programming experience in a numerical computing environment (ideally R, Matlab, Perl, Python, Java, C or C++), data analytical techniques, and UNIX skill are desirable. An enthusiasm for real-world applications of complex mathematical ideas and a positive attitude towards interdisciplinary research are essential.

If you wish to discuss any details of the project informally, please contact Faisal Rezwan, Email:, Tel: +44 (0) 2380 482002

This project is run through participation in the EPSRC Centre for Doctoral Training in Next Generation Computational Modelling ( For details of our 4 Year PhD programme, please see;=2652

For a details of available projects click here

To apply, please use the following website:

Further details:

  • Job Description and Person Specification