B 281, 20140732 (2014). Ge, X. et al. Evol. Pangolin was developed to implement the dynamic nomenclature of SARS-CoV-2 lineages, known as the Pango nomenclature. It is clear from our analysis that viruses closely related to SARS-CoV-2 have been circulating in horseshoe bats for many decades. Note that breakpoints can be shared between sequences if they are descendants of the same recombination events. Preprint at https://doi.org/10.1101/2020.04.20.052019 (2020). Phylogenies of subregions of NRR1 depict an appreciable degree of spatial structuring of the bat sarbecovirus population across different regions (Fig. There are outstanding evolutionary questions on the recent emergence of human coronavirus SARS-CoV-2 including the role of reservoir species, the role of recombination and its time of divergence from animal viruses. Mol. A second breakpoint-conservative approach was conservative with respect to breakpoint identification, but this means that it is accepting of false-negative outcomes in breakpoint inference, resulting in less certainty that a putative NRR truly contains no breakpoints. Wang, H., Pipes, L. & Nielsen, R. Synonymous mutations and the molecular evolution of SARS-Cov-2 origins. 4), but also by markedly different evolutionary rates. Our third approach involved identifying breakpoints and masking minor recombinant regions (with gaps, which are treated as unobserved characters in probabilistic phylogenetic approaches). Concatenated region ABC is NRR1. 1, vev016 (2015). Emerg. Holmes, E. C. The Evolution and Emergence of RNA Viruses (Oxford Univ. Divergence time estimates based on the HCoV-OC43-centred rate prior for the separate BFRs (Supplementary Table 3) show consistency in TMRCA estimates across the genome. is funded by the MRC (no. Sequence similarity. and T.A.C. Preprint at https://doi.org/10.1101/2020.02.10.942748 (2020). Bayesian evaluation of temporal signal in measurably evolving populations. 1c). Nucleotide positions for phylogenetic inference are 147695, 9621,686 (first tree), 3,6259,150 (second tree, also BFR B), 9,26111,795 (third tree, also BFR C), 12,44319,638 (fourth tree) and 23,63124,633, 24,79525,847, 27,70228,843 and 29,57430,650 (fifth tree). From this perspective, it may be useful to perform surveillance for more closely related viruses to SARS-CoV-2 along the gradient from Yunnan to Hubei. Biol. Coronavirus: Pangolins found to carry related strains. eLife 7, e31257 (2018). Next, we (1) collected all breakpoints into a single set, (2) complemented this set to generate a set of non-breakpoints, (3) grouped non-breakpoints into contiguous BFRs and (4) sorted these regions by length. Nature 579, 270273 (2020). a, Breakpoints identified by 3SEQ illustrated by percentage of sequences (out of 68) that support a particular breakpoint position. RegionC showed no PI signals within it. Here, we analyse the evolutionary history of SARS-CoV-2 using available genomic data on sarbecoviruses. Biol. July 26, 2021. Zhou, H. et al. 3) to examine the sensitivity of date estimates to this prior specification. A new coronavirus associated with human respiratory disease in China. Duchene, S. et al. Share . Nature 579, 265269 (2020). The latter was reconstructed using IQTREE66 v.2.0 under a general time-reversible (GTR) model with a discrete gamma distribution to model inter-site rate variation. PubMed Across a large region of the virus genome, corresponding approximately to ORF1b, it did not cluster with any of the known bat coronaviruses indicating that recombination probably played a role in the evolutionary history of these viruses5,7. Early transmission dynamics in Wuhan, China, of novel coronavirus-infected pneumonia. N. Engl. Bayesian evolutionary rate and divergence date estimates were shown to be consistent for these three approaches and for two different prior specifications of evolutionary rates based on HCoV-OC43 and MERS-CoV. Our most conservative approach attempted to ensure that putative NRRs had no mosaic or phylogenetic incongruence signals. N. China corresponds to Jilin, Shanxi, Hebei and Henan provinces, and the N. China clade also includes one sequence sampled in Hubei Province in 2004. For the HCoV-OC43, MERS-CoV and SARS datasets we specified flexible skygrid coalescent tree priors. While pangolins could be acting as intermediate hosts for bat viruses to get into humansthey develop severe respiratory disease38 and commonly come into contact with people through traffickingthere is no evidence that pangolin infection is a requirement for bat viruses to cross into humans. M.F.B., P.L. We showed that severe acute respiratory syndrome coronavirus 2 is probably a novel recombinant virus. We thank A. Chan and A. Irving for helpful comments on the manuscript. & Minh, B. Q. IQ-TREE: a fast and effective stochastic algorithm for estimating maximum-likelihood phylogenies. Using the most conservative approach to identification of a non-recombinant genomic region (NRR1), SARS-CoV-2 forms a sister lineage with RaTG13, with genetically related cousin lineages of coronavirus sampled in pangolins in Guangdong and Guangxi provinces (Fig. This leaves the insertion of polybasic. Google Scholar. Boni, M. F., Zhou, Y., Taubenberger, J. K. & Holmes, E. C. Homologous recombination is very rare or absent in human influenza A virus. Below, we report divergence time estimates based on the HCoV-OC43-centred rate prior for NRR1, NRR2 and NRA3 and summarize corresponding estimates for the MERS-CoV-centred rate priors in Extended Data Fig. Trova, S. et al. The variable-loop region in SARS-CoV-2 shows closer identity to the 2019 pangolin coronavirus sequence than to the RaTG13 bat virus, supported by phylogenetic inference (Fig. Maclean, O. In the variable-loop region, RaTG13 diverges considerably with the TMRCA, now outside that of SARS-CoV-2 and the Pangolin Guangdong 2019 ancestor, suggesting that RaTG13 has acquired this region from a more divergent and undetected bat lineage. 23, 18911901 (2006). Genetics 176, 10351047 (2007). Nature 583, 286289 (2020). Genomic characterisation and epidemiology of 2019 novel coronavirus: implications for virus origins and receptor binding. To evaluate the performance procedure, we confirmed that the recombination masking resulted in (1) a markedly different outcome of the PHI test64, (2) removal of well-supported (bootstrap value >95%) incompatible splits in Neighbor-Net65 and (3) a near-complete reduction of mosaic signal as identified by 3SEQ. [12] We compiled a dataset including 27human coronavirus OC43 virus genomes and ten related animal virus genomes (six bovine, three white-tailed deer and one canine virus). We compiled a set of 69SARS-CoV genomes including 58 sampled from humans and 11 sampled from civets and raccoon dogs. Time-measured phylogenetic reconstruction was performed using a Bayesian approach implemented in BEAST42 v.1.10.4. Menachery, V. D. et al. Gorbalenya, A. E. et al. Mol. Originally, PANGOLIN used a maximum-likelihood-based assignment algorithm to assign query SARS-CoV-2 the most likely lineage sequence. Despite the high frequency of recombination among bat viruses, the block-like nature of the recombination patterns across the genome permits retrieval of a clean subalignment for phylogenetic analysis. & Muhire, B. RDP4: Detection and analysis of recombination patterns in virus genomes. Wu, Y. et al. 87, 62706282 (2013). Region A has been shortened to A (5,017nt) based on potential recombination signals within the region. Graham, R. L. & Baric, R. S. Recombination, reservoirs, and the modular spike: mechanisms of coronavirus cross-species transmission. Epidemiology, genetic recombination, and pathogenesis of coronaviruses. While there is evidence of positive selection in the sarbecovirus lineage leading to RaTG13/SARS-CoV-2 (ref. 68, 10521061 (2019). Of importance for future spillover events is the appreciation that SARS-CoV-2 has emerged from the same horseshoe bat subgenus that harbours SARS-like coronaviruses. The ongoing pandemic spread of a new human coronavirus, SARS-CoV-2, which is associated with severe pneumonia/disease (COVID-19), has resulted in the generation of tens of thousands of virus . Extended Data Fig. Biol. These authors contributed equally: Maciej F. Boni, Philippe Lemey. Evol. 874850). Lam, H. M., Ratmann, O. Yres, D. L. et al. Nat. the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Microbiol. A hypothesis of snakes as intermediate hosts of SARS-CoV-2 was posited during the early epidemic phase54, but we found no evidence of this55,56; see Extended Data Fig. When the first genome sequence of SARS-CoV-2, Wuhan-Hu-1, was released on 10January 2020 (GMT) on Virological.org by a consortium led by Zhang6, it enabled immediate analyses of its ancestry. Subsequently a bat sarbecovirusRaTG13, sampled from a Rhinolophus affinis horseshoe bat in 2013 in Yunnan Provincewas reported that clusters with SARS-CoV-2 in almost all genomic regions with approximately 96% genome sequence identity2. Genetics 172, 26652681 (2006). In regionA, we removed subregion A1 (ntpositions 3,8724,716 within regionA) and subregion A4 (nt1,6422,113) because both showed PI signals with other subregions of regionA. Publishers note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. Aiewsakun, P. & Katzourakis, A. Time-dependent rate phenomenon in viruses. Ji, W., Wang, W., Zhao, X., Zai, J. In the absence of a strong temporal signal, we sought to identify a suitable prior rate distribution to calibrate the time-measured trees by examining several coronaviruses sampled over time, including HCoV-OC43, MERS-CoV, and SARS-CoV virus genomes. Wong, A. C. P., Li, X., Lau, S. K. P. & Woo, P. C. Y. Many Git commands accept both tag and branch names, so creating this branch may cause unexpected behavior. EPI_ISL_410721) and Beijing Institute of Microbiology and Epidemiology (W.-C. Cao, T.T.-Y.L., N. Jia, Y.-W. Zhang, J.-F. Jiang and B.-G. Jiang, nos. 6, 8391 (2015). The coronavirus genome that these researchers had assembled, from pangolin lung-tissue samples, contained some gene regions that were ninety-nine per cent similar to equivalent parts of the SARS . 26 March 2020. DRAGEN COVID Lineage App This app aligns reads to a SARS-CoV-2 reference genome and reports coverage of targeted regions. wrote the first draft of the manuscript, and all authors contributed to manuscript editing. performed recombination analysis for non-recombining regions1 and 2, breakpoint analysis and phylogenetic inference on recombinant segments. Because coronaviruses are known to be highly recombinant, we used three different approaches to identify non-recombinant regions for use in our Bayesian time-calibrated phylogenetic inference. Lemey, P., Minin, V. N., Bielejec, F., Pond, S. L. K. & Suchard, M. A. Wu, F. et al. 1a-c ), has the third-highest number of confirmed COVID-19 cases in the state of So. CAS One study suggests that over a century ago, one lineage of coronavirus circulating in bats gave rise to SARS-CoV-2, RaTG13 and a Pangolin coronavirus known as Pangolin-2019, Live Science . Our approach resulted in similar posterior rates using two different prior means, implying that the sarbecovirus data do inform the rate estimate even though a root-to-tip temporal signal was not apparent. Genet. CNN . The boxplots show divergence time estimates (posterior medians) for SARS-CoV-2 (red) and the 20022003 SARS-CoV virus (blue) from their most closely related bat virus. If the latter still identified non-negligible recombination signal, we removed additional genomes that were identified as major contributors to the remaining signal. CAS Lie, P., Chen, W. & Chen, J.-P. RegionB is 5,525nt long. Martin, D. P., Murrell, B., Golden, M., Khoosal, A. Nguyen, L.-T., Schmidt, H. A., Von Haeseler, A. Accurate estimation of ages for deeper nodes would require adequate accommodation of time-dependent rate variation. However, for several reasons, nucleotide sequences may be generated that cover only the spike gene of SARS-CoV-2. Researchers in the UK had just set the scientific world . 24, 490502 (2016). Duchene, S., Holmes, E. C. & Ho, S. Y. W. Analyses of evolutionary dynamics in viruses are hindered by a time-dependent bias in rate estimates. b, Similarity plot between SARS-CoV-2 and several selected sequences including RaTG13 (black), SARS-CoV (pink) and two pangolin sequences (orange). Another similarity between SARS-CoV and SARS-CoV-2 is their divergence time (4070years ago) from currently known extant bat virus lineages (Fig. In this study, we report the case of a child with severe combined immu presenting a prolonged severe acute respiratory syndrome coronavirus 2 infection. Lond. J. Med Virol. Sci. Scientists trying to trace the ancestry of SARS-CoV-2, the virus responsible for COVID-19, have found the pangolin is unlikely to be the source of the virus responsible for the current pandemic. 88, 70707082 (2014). 5). 3 Priors and posteriors for evolutionary rate of SARS-CoV-2. the development of viral diversity. Of the nine breakpoints defining these ten BFRs, four showed phylogenetic incongruence (PI) signals with bootstrap support >80%, adopting previously published criteria on using a combination of mosaic and PI signals to show evidence of past recombination events19. Microbiol. To gauge the length of time this lineage has circulated in bats, we estimate the time to the most recent common ancestor (TMRCA) of SARS-CoV-2 and RaTG13. We thank all authors who have kindly deposited and shared genome data on GISAID. Virus Evol. The fact that these estimates lie between the rates for MERS-CoV and HCoV-OC43 is consistent with the intermediate sampling time range of about 18years (Fig. Means and 95% HPD intervals are 0.080 [0.0580.101] and 0.530 [0.3040.780] for the patristic distances between SARS-CoV-2 and RaTG13 (green) and 0.143 [0.1090.180] and 0.154 [0.0930.231] for the patristic distances between SARS-CoV-2 and Pangolin 2019 (orange). 32, 268274 (2014). Extended Data Fig. 56, 152179 (1992). Pangolin was developed to implement the dynamic nomenclature of SARS-CoV-2 lineages, known as the Pango nomenclature. The divergence time estimates for SARS-CoV-2 and SARS-CoV from their respective most closely related bat lineages are reasonably consistent among the three approaches we use to eliminate the effects of recombination in the alignment. Alexandre Hassanin, Vuong Tan Tu, Gabor Csorba, Nicola F. Mller, Kathryn E. Kistler & Trevor Bedford, Jack M. Crook, Ivana Murphy, Diana Bell, Simon Pollett, Matthew A. Conte, Irina Maljkovic Berry, Yatish Turakhia, Bryan Thornlow, Russell Corbett-Detig, Nature Microbiology This underscores the need for a global network of real-time human disease surveillance systems, such as that which identified the unusual cluster of pneumonia in Wuhan in December 2019, with the capacity to rapidly deploy genomic tools and functional studies for pathogen identification and characterization. GARD identified eight breakpoints that were also within 50nt of those identified by 3SEQ. SARS-CoV-2 and RaTG13 are the most closely related (their most recent common ancestor nodes denoted by green circles), except in the 222-nt variable-loop region of the C-terminal domain (bar graphs at bottom). Virological.org http://virological.org/t/ncovs-relationship-to-bat-coronaviruses-recombination-signals-no-snakes-no-evidence-the-2019-ncov-lineage-is-recombinant/331 (2020). With horseshoe bats currently the most plausible origin of SARS-CoV-2, it is important to consider that sarbecoviruses circulate in a variety of horseshoe bat species with widely overlapping species ranges57. Background & objectives: Several phylogenetic classification systems have been devised to trace the viral lineages of the severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2). Sequences were aligned by MAFTT58 v.7.310, with a final alignment length of 30,927, and used in the analyses below. Because 3SEQ identified ten BFRs >500nt, we used GARDs (v.2.5.0) inference on 10, 11 and 12 breakpoints. Using the most conservative approach (NRR1), the divergence time estimate for SARS-CoV-2 and RaTG13 is 1969 (95% HPD: 19302000), while that between SARS-CoV and its most closely related bat sequence is 1962 (95% HPD: 19321988); see Fig. It is RaTG13 that is more divergent in the variable-loop region (Extended Data Fig. Zhou et al.2 concluded from the genetic proximity of SARS-CoV-2 to RaTG13 that a bat origin for the current COVID-19 outbreak is probable. We infer time-measured evolutionary histories using a Bayesian phylogenetic approach while incorporating rate priors based on mean MERS-CoV and HCoV-OC43 rates and with standard deviations that allow for more uncertainty than the empirical estimates for both viruses (see Methods). Centre for Genomic Pathogen Surveillance. Influenza viruses reassort17 but they do not undergo homologous recombination within RNA segments18,19, meaning that origins questions for influenza outbreaks can always be reduced to origins questions for each of influenzas eight RNA segments. PubMed Use the Previous and Next buttons to navigate the slides or the slide controller buttons at the end to navigate through each slide. Further information on research design is available in the Nature Research Reporting Summary linked to this article. It performs: K-mer based detection Map/align, variant calling Consensus sequence generation Lineage/clade analysis using Pangolin and NextClade Access the DRAGEN COVID Lineage App on BaseSpace Sequence Hub 5. The presence in pangolins of an RBD very similar to that of SARS-CoV-2 means that we can infer this was also probably in the virus that jumped to humans. 95% credible interval bars are shown for all internal node ages. Lancet 395, 949950 (2020). Evolutionary rate estimation can be profoundly affected by the presence of recombination50. Stamatakis, A. RAxML version 8: a tool for phylogenetic analysis and post-analysis of large phylogenies. Extended Data Fig. The estimated divergence times for the pangolin virus most closely related to the SARS-CoV-2/RaTG13 lineage range from 1851 (1730-1958) to 1877 (1746-1986), indicating that these pangolin . Boxplots show interquartile ranges, white lines are medians and box whiskers show the full range of posterior distribution. In outbreaks of zoonotic pathogens, identification of the infection source is crucial because this may allow health authorities to separate human populations from the wildlife or domestic animal reservoirs posing the zoonotic risk9,10. Boxes show 95% HPD credible intervals. The red and blue boxplots represent the divergence time estimates for SARS-CoV-2 (red) and the 2002-2003 SARS-CoV (blue) from their most closely related bat virus, with the light- and dark-colored versions based on the HCoV-OC43 and MERS-CoV centered priors, respectively. Specifically, progenitors of the RaTG13/SARS-CoV-2 lineage appear to have recombined with the Hong Kong clade (with inferred breakpoints at 11.9 and 20.8kb) to form the CoVZXC21/CoVZC45-lineage. All authors contributed to analyses and interpretations. 2 Lack of root-to-tip temporal signal in SARS-CoV-2. Smuggled pangolins were carrying viruses closely related to the one sweeping the world, say scientists. Evol. c, Maximum likelihood phylogenetic trees rooted on a 2007 virus sampled in Kenya (BtKy72; root truncated from images), shown for five BFRs of the sarbecovirus alignment. The unsampled diversity descended from the SARS-CoV-2/RaTG13 common ancestor forms a clade of bat sarbecoviruses with generalist propertieswith respect to their ability to infect a range of mammalian cellsthat facilitated its jump to humans and may do so again. and P.L.) Our results indicate the presence of a single lineage circulating in bats with properties that allowed it to infect human cells, as previously described for bat sarbecoviruses related to the first SARS-CoV lineage29,30,31. 2, vew007 (2016). We named the length-sorted BFRs as: BFRA (ntpositions 13,29119,628, length=6,338nt), BFRB (ntpositions 3,6259,150, length=5,526nt), BFRC (ntpositions 9,26111,795, length=2,535nt), BFRD (ntpositions 27,70228,843, length=1,142nt) and six further regions (EJ). 110. Open reading frames are shown above the breakpoint plot, with the variable-loop region indicated in the Sprotein. This new approach classifies the newly sequenced genome against all the diverse lineages present instead of a representative select sequences. Evol. 4 we compare these divergence time estimates to those obtained using the MERS-CoV-centred rate priors for NRR1, NRR2 and NRA3. 62,63), the GTR+ model and 100bootstrap replicateswas inferred for each BFR >500nt. The ongoing pandemic spread of a new human coronavirus, SARS-CoV-2, which is associated with severe pneumonia/disease (COVID-19), has resulted in the generation of tens of thousands of virus genome sequences. 5, 536544 (2020). obtained the genome sequences of 10 SARS-CoV-2 virus strains through nanopore sequencing of nasopharyngeal swabs in Malta and analyzed the assembled genome with pangolin software, and the results showed that these virus strains were assigned to B.1 lineage, indicating that SARS-CoV-2 was widely spread in Europe (Biazzo et al., 2021). Zhou, P. et al. Stegeman, A. et al. The most parsimonious explanation for these shared ACE2-specific residues is that they were present in the common ancestors of SARS-CoV-2, RaTG13 and Pangolin Guangdong 2019, and were lost through recombination in the lineage leading to RaTG13. https://doi.org/10.1093/molbev/msaa163 (2020). A reduced sequence set of 25sequences chosen to capture the breadth of diversity in the sarbecoviruses (obvious recombinants not involving the SARS-CoV-2 lineage were also excluded) was used because GARD is computationally intensive. We thank T. Bedford for providing M.F.B. 382, 11991207 (2020). collected SARS-CoV data and assisted in analyses of SARS-CoV and SARS-CoV-2 data. Google Scholar. Nevertheless, the viral population is largely spatially structured according to provinces in the south and southeast on one lineage, and provinces in the centre, east and northeast on another (Fig. This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository. PubMed Central PubMed Central (Yes, Pango is a tongue-in-cheek reference to pangolins, which were briefly suspected to have had a role in the coronavirus's originseveral of the team's computational tools are named after. To begin characterizing any ancestral relationships for SARS-CoV-2, NRRs of the genome must be identified so that reliable phylogenetic reconstruction and dating can be performed. Because the estimated rates and divergence dates were highly similar in the three datasets analysed, we conclude that our estimates are robust to the method of identifying a genomes NRRs. Evol. MC_UU_1201412). Evolutionary origins of the SARS-CoV-2 sarbecovirus lineage responsible for the COVID-19 pandemic, https://doi.org/10.1038/s41564-020-0771-4. We used an uncorrelated relaxed clock model with log-normal distribution for all datasets, except for the low-diversity SARS data for which we specified a strict molecular clock model. We used TreeAnnotator to summarize posterior tree distributions and annotated the estimated values to a maximum clade credibility tree, which was visualized using FigTree. A., Filip, I., AlQuraishi, M. & Rabadan, R. Recombination and lineage-specific mutations led to the emergence of SARS-CoV-2. Bryant, D. & Moulton, V. Neighbor-Net: an agglomerative method for the construction of phylogenetic networks. Boni, M. F., de Jong, M. D., van Doorn, H. R. & Holmes, E. C. Guidelines for identifying homologous recombination events in influenza A virus. 3) clusters with viruses from provinces in the centre, east and northeast of China. 1) and thus likely to be the product of recombination, acquiring a divergent variable loop from a hitherto unsampled bat sarbecovirus28. Conducting analogous analyses of codon usage bias as Ji et al. & Li, X. Crossspecies transmission of the newly identified coronavirus 2019nCoV. Mol. ISSN 2058-5276 (online). MERS-CoV data were subsampled to match sample sizes with SARS-CoV and HCoV-OC43. In Extended Data Fig. A., Lytras, S., Singer, J. To employ phylogenetic dating methods, recombinant regions of a 68-genome sarbecovirus alignment were removed with three independent methods. J. Med. Discovery and genetic analysis of novel coronaviruses in least horseshoe bats in southwestern China.