Codes4strains: Tracking bacterial pathogens through sources, geography and time using stable phylogenetically informative genome codes
The implementation of genome sequencing in public health microbiology has allowed the natural variation exhibited by pathogenic bacteria to be leveraged for infectious disease surveillance and outbreak detection. Genotype information derived from WGS allows the monitoring of pathogenic potential and the tracking of epidemic behaviour, to inform infection control, diagnostic and treatment practice. To track strains globally, and as they spread between the environment, food, animals and humans, universal strain nomenclatures are necessary. Two main strain nomenclatures approaches are currently existing. First, core genome Multilocus Sequence Typing (cgMLST) is widely applied for bacterial pathogen surveillance. It relies on predefined gene loci, the sequence variants of which are given unique identifiers (allelic numbers). Resulting allelic profiles are given unique identifiers (cgST) or are grouped based on their similarity, generally using the single-linkage clustering method. An alternative approach known as the SNP address was developed at Public Health England. Different from MLST, it is based on single nucleotide polymorphisms (SNP) compared to a reference genome. Single-linkage clustering is performed based on the resulting SNP distance between isolates. An original concept of the SNP address is to apply several thresholds upon allelic or SNP differences. The ‘address’ is a multi-positions code, where each position corresponds to the cluster membership at descending thresholds of genetic (SNP) distance among strains resulting in a multi-level nomenclature which provides a good approximation of the phylogenetic relatedness among isolates. Likewise, several cgMLST thresholds can be used to provide phylogenetic information on top of classification purposes, as was done for Listeria monocytogenes by the group of the main applicant. Providing multi-level information on phylogenetic relatedness has proved helpful for epidemiological investigations and for prospective surveillance. This has facilitated outbreak detection as well as providing the framework for case/control studies at different diversity levels, depending on the length or complexity of an outbreak. Further, utilising a flexible level of divergence to define an ‘outbreak type’ aids hypothesis generation and may allow in some cases to identify the specific source of the outbreak by maximizing the power of case-control source attribution studies. SNP and cgMLST approaches have complementary characteristics. One strength of the cgMLST approach is its standardized aspect (predefined sets of loci; unlike SNPs, which have proven difficult to standardize), which maximizes the applicability of the method for international or cross-sector strain comparisons where analysis is performed independently. In turn, whole-genome SNPs are more discriminatory than cgMLST, which relies on predefined set of ‘core’ loci. Therefore, SNP and cgMLST should be regarded as two useful approaches to be integrated jointly in future genomic epidemiology strategies. However, one major limitation of current SNP address or multi-level cgMLST classifications is that they utilise single-linkage clustering to define groups. This approach is unstable, as the fusion of predefined groups upon discovery of ‘intermediate’ genotypes is an inherent mathematical property of single-linkage. This issue is pertinent within epidemiological timescales, where intermediate genotypes have a high probability of being sampled. It is our experience in both applicants groups that the fusion of predefined groups is a challenge to handle in practice, and introduces nomenclatural confusion. Currently, no genomic nomenclature system of bacterial pathogens exists that combines complete stability of identifiers, high standardization and reproducibility and high resolution. This gap represents an important barrier to the field of genomic epidemiology and slows down communication and action against the transmission of pathogens across sectors, world regions and over long periods of time.We will address this critical gap in the present PhD project
About me: My name is Melanie and I am 24 years old. I have a Master’s degree in Bioinformatics and Modelling (BIM) from Sorbonne University since July 2018. My academic background in this multidisciplinary course has allowed me to develop my knowledge of biology and to acquire skills in the field of informatics, such as the use and implementation of bioinformatics tools. I have therefore chosen to focus my scientific profile on methodological development for the analysis of large genomic datasets.
What motivated me to do a PhD: My professional experience in the Unit “Biodiversity and Epidemiology of Bacterial Pathogens” at the Institut Pasteur allowed me to discover the field of molecular epidemiology and reinforced my interest in comparative genomics. In addition, pursuing my doctoral studies will allow me to explore and design methods of bacterial nomenclature that will facilitate communication between different public health actors in the future. This project will allow me to benefit from the expertise of the various laboratories in bacterial population genomics and bioanalysis in the field of microbiology.