When a population is challenged by a pathogen, individual responses can vary greatly, depending on their past exposure to infections, environmental and socio-economic factors, as well as their genetic background. Indeed, we expect that the genomes we have inherited from our parents will carry common mutations that slightly modify the proteins essential for our defenses against microbes. The Milieu Intérieur Project proposes to verify this expectation and identify such mutations.
Defining host genetics
An essential goal to the Milieu Intérieur project is to quantify the contribution of host genetics to the normal immune response. To do so, we first have to identify the millions of mutations that are carried by the DNA of the 1,000 healthy individuals who participate in this project. We will first use genome-wide genotyping SNP arrays on the entire cohort, and then whole-exome sequencing. The millions of mutations found in the cohort will then be tested for association with the thousands of phenotypes obtained in the same individuals (that is, phenotypes selected in previous phases to capture several aspects of the human immune response). Among all these association signals, a small fraction of the strongest ones will be considered as convincingly true and will identify the genes and mutations that modulate our immune response to pathogens
Our power to identify association signals
For a project of such a scale, we had to verify that we have sufficient statistical power to identify strong mutation-phenotype association signals. This power basically depends on two factors, the cohort size and the comprehensiveness (that is, the "coverage") of the genetic data obtained. Our cohort includes 1,000 individuals, which will allow us to identify associations with odds-ratios greater than 1.5, a resolution that is substantially higher than most studies of cellular phenotypes published to date. We evaluated the coverage of several commercial SNP arrays, and determined that the Illumina OmniExpress has high power to tag the entire human genome in the European-descent population, at an affordable price (see Visual). This SNP array was thus retained for the next phases of the LabEx MI.
Precise methodologies to curate the SNP array data will be soon available on the Protocols webpage. Powerful statistical models will then be used to test for mutation-phenotype associations, accounting for demographic co-variables, correlations among phenotypes, interactions among mutations and prior knowledge about genomics of gene expression. Systems biology approaches will also be employed to search for cellular pathways enriched for association signals. The final integration of all of these results will help define the human variation which is relevant to our normal immune response, and shed new light on our susceptibility to infectious diseases and our response to treatments and vaccines.
Power of commercial SNP arrays to tag the human genome.
High-coverage whole-genome sequences of European-descent individuals were filtered for SNPs present in five different commercial SNP arrays. Filtered data was then imputed with IMPUTE v. 2, using the 1000 Genomes Project phase 1 as a reference dataset. The number of high-quality imputed SNPs is reported in green in the left panel, while low-quality imputed SNPs are in red (low quality here corresponds to a certainty score reported by IMPUTE lower than 0.8). The right panel reports the concordance rate between imputed genotypes and actual genotypes, which were initially present in the non-filtered whole-genome sequences. All SNP arrays produced very high concordance rates.