Codes4strains: Tracking bacterial pathogens through sources, geography and time using stable phylogenetically informative genome codes

         Start: October 2019

         Duration: 3 years

         Research Domain: Foodborne Zoonoses, Antimicrobial Resistance

         Members: IP, INRA, ANSES- France, PHE- UK

         Contact: Dr Sylvain Brisse

The Project

The implementation of genome sequencing in public health microbiology has allowed the natural variation exhibited by pathogenic bacteria to be leveraged for infectious disease surveillance and outbreak detection. Genotype information derived from WGS allows the monitoring of pathogenic potential and the tracking of epidemic behaviour, to inform infection control, diagnostic and treatment practice. To track strains globally, and as they spread between the environment, food, animals and humans, universal strain nomenclatures are necessary. Two main strain nomenclatures approaches are currently existing. First, core genome Multilocus Sequence Typing (cgMLST) is widely applied for bacterial pathogen surveillance. It relies on predefined gene loci, the sequence variants of which are given unique identifiers (allelic numbers). Resulting allelic profiles are given unique identifiers (cgST) or are grouped based on their similarity, generally using the single-linkage clustering method. An alternative approach known as the SNP address was developed at Public Health England. Different from MLST, it is based on single nucleotide polymorphisms (SNP) compared to a reference genome. Single-linkage clustering is performed based on the resulting SNP distance between isolates. An original concept of the SNP address is to apply several thresholds upon allelic or SNP differences. The ‘address’ is a multi-positions code, where each position corresponds to the cluster membership at descending thresholds of genetic (SNP) distance among strains resulting in a multi-level nomenclature which provides a good approximation of the phylogenetic relatedness among isolates. Likewise, several cgMLST thresholds can be used to provide phylogenetic information on top of classification purposes, as was done for Listeria monocytogenes by the group of the main applicant. Providing multi-level information on phylogenetic relatedness has proved helpful for epidemiological investigations and for prospective surveillance. This has facilitated outbreak detection as well as providing the framework for case/control studies at different diversity levels, depending on the length or complexity of an outbreak. Further, utilising a flexible level of divergence to define an ‘outbreak type’ aids hypothesis generation and may allow in some cases to identify the specific source of the outbreak by maximizing the power of case-control source attribution studies. SNP and cgMLST approaches have complementary characteristics. One strength of the cgMLST approach is its standardized aspect (predefined sets of loci; unlike SNPs, which have proven difficult to standardize), which maximizes the applicability of the method for international or cross-sector strain comparisons where analysis is performed independently. In turn, whole-genome SNPs are more discriminatory than cgMLST, which relies on predefined set of ‘core’ loci. Therefore, SNP and cgMLST should be regarded as two useful approaches to be integrated jointly in future genomic epidemiology strategies. However, one major limitation of current SNP address or multi-level cgMLST classifications is that they utilise single-linkage clustering to define groups. This approach is unstable, as the fusion of predefined groups upon discovery of ‘intermediate’ genotypes is an inherent mathematical property of single-linkage. This issue is pertinent within epidemiological timescales, where intermediate genotypes have a high probability of being sampled. It is our experience in both applicants groups that the fusion of predefined groups is a challenge to handle in practice, and introduces nomenclatural confusion. Currently, no genomic nomenclature system of bacterial pathogens exists that combines complete stability of identifiers, high standardization and reproducibility and high resolution. This gap represents an important barrier to the field of genomic epidemiology and slows down communication and action against the transmission of pathogens across sectors, world regions and over long periods of time.We will address this critical gap in the present PhD project




Log in with your credentials


Forgot your details?

Create Account