Skip to main content

SARS-CoV-2 Variant Competition

All data last updated:

SARS-CoV-2 is constantly evolving, with new variants competing against one another for domiance in different regions. This model integrates SARS-CoV-2 genotype sequencing data from around the world to estimate the growth advantage of different variants, which is then used to provide regional estimates of variant frequencies and how these are changing over time. This can be used by researchers who might wish to know which variants to focus on in their studies, or by public health officials who might wish to know which variants are likely to become dominant in their region.

The full set of model estimates, which includes estimates for countries other than Sweden, can be found in the GitHub repository of the Murrell research group, who conduct this research.

Global statistics on lineage competition

Advantage estimates

Growth rate advantages are estimated from all variant frequency data, globally. A variant with a higher growth rate advantage is expected to increase in frequency relative to other variants.

Growth advantage estimates for the top variants

For convergent mutations (occuring independently at least three times), the contribution to the growth rate advantage of each mutation is estimated.

Estimates of contribution to growth of convergently occuring mutations

The relatedness of SARS-CoV-2 variants, with their estimated growth rate advantages, can be visualised in a phylogenetic tree. Only recent variants, and key ancestral variants, are shown. Lineages with low growth rate estimates are excluded.

Growth-annotated phylogeny

Variant trajectories

The “model average” variant frequencies are forecasts from the model with all region-specific effects set to zero. This provides a single “snapshot” of the global variant situation. This is not meant to representative of the true global variant frequencies, since it is influenced by different sequencing coverage in different regions, but it is useful to understand the model’s estimates of how quickly one variant might be expected to replace others.

Variants are coloured such that related variants should be similar in colour.

Model average variant trajectories
Model average variant frequencies

Results from Sweden

Estimates of variant frequencies and growth rate advantages for Sweden are always included in the model. As with all data used in the model, Swedish genotype data comes from GISAID. Sequencing volumes are often low for Sweden, especially when the case counts themselves are low (and there are not many infections to sequence). In such cases, the estimates for variant frequencies in Sweden can be very uncertain. It is therefore important to treat the results of the model with caution when sequencing volumes are low.

SARS-CoV-2 genotype volumes
Variant trajectories in Sweden
Variant frequencies in Sweden

Methods

Lineage dynamics are modelled using a Bayesian multinomial logistic regression over lineage counts. The latest global GISAID SARS-CoV-2 dataset (obtained via bulk download) is filtered to include only sequences with collection dates within the last 100 days. Lineage assignment is performed using NextClade, retaining sequences with a “good” overall quality control (QC) status, and >90% coverage. Daily lineage counts are aggregated by region (including only countries with sufficient recent sequencing volumes), with low-frequency sub-lineages (too rare to model) merged into their most recent ancestors.

Growth rates are modelled with a hierarchical approach. Each lineage’s growth rate in a given region is the sum of a global rate and a region-specific random effect. The global rate for each lineage comprises three components: i) branch-specific terms for each branch ancestral to the lineage, ii) terms for convergent spike mutations occurring 3 or more times independently that are present in the lineage, and iii) a lineage-specific term. This parameterisation allows for shared evidence when mutations occur across multiple branches, and phylogenetic heritability of growth rates, such that growth rates for closely-related lineages are more likely, under the model’s prior, to be similar to one another. Recombinants inherit weighted mixtures of their multiple parents’ growth terms. Lineage-specific intercept terms, which control the relative timing of the emergence of variants, comprise a global shared term and region-specific random effects.

Gaussian priors (centred on zero) are used for each parameter type, with Gaussian hyperpriors over the log of their standard deviations. Posterior distributions are jointly sampled (for global and local parameters for all global data) using Hamiltonian Monte Carlo with the No-U-Turn sampler, implemented in the AdvancedHMC.jl package of Julia.

The Pango designations are used for lineage names in all of the plots produced using the model.

The Murrell group gratefully acknowledges all data contributors, i.e. the Authors and their Originating Laboratories responsible for obtaining the specimens, and their Submitting Laboratories that generated the genetic sequence and metadata and shared via the GISAID Initiative the data on which this research is based.