An Open and Continuously
Updated Fern Tree of Life
(FTOL)

Joel Nitta1, Eric Schuettpelz2, Santiago Ramírez-Barahona3,
Wataru Iwasaki1

1: The University of Tokyo, 2: Smithsonian Institution, 3: Universidad Nacional Autónoma de México
ISMB EvolCompGen COSI 2022-07-13
https://joelnitta.github.io/ismb_2022

Building the Tree of Life (TOL):
a major goal of biology since Darwin


Darwin (1837)


Hinchliff et al. (2015)

Growth of data on GenBank means TOL may be possible…


Gauthier et al. (2019)

…using automated pipelines

Antonelli et al. (2016)

Problem 1: tradeoff between scalability and accuracy

  • Any automated pipeline must make shortcuts and assumptions

  • Manual inspection of all sequences would lead to high-quality results, but does not scale

Problem 2: tree built today will be out-of-date tomorrow

  • Due to the rapid accumulation of data on GenBank

Our approach: combine automation with customization

Goal: construct a pipeline to generate a maximally sampled, high taxonomic quality phylogeny of ferns

Why ferns?

A large, diverse, ecologically important group of plants

Much more tractable than seed plants (angiosperms):

  • Ferns: ca. 12,000 species, 40-50% sequenced

  • Seed plants: ca. 350,000 species, 20% sequenced

Sanger-sequenced plastid genes = workhorse of fern molecular systematics

Methods

Workflow highlights

Increase efficiency and reproducibility of working with GenBank data by using a local DB

Extract sequences without relying on GenBank annotations

Implement a custom fern taxonomy

Use two-step phylogenetic analysis to maximize accuracy and sampling

Incorporate automated and manual checks for rogues

Increase efficiency and reproducibility of working with GenBank data by using a local DB

restez* R package

  • Download portion (“plants” division) of GenBank from FTP site (v249, cutoff date 2022-04-15, ca. 170 gb)

  • Put only fern and outgroup sequences into local DB (monetDB, ca. 400 mb)

  • Post DB in figshare so others can use

Extract sequences without relying on GenBank annotations

There is no standard for how to annotate GenBank accessions

  • Curate set of reference sequences (one sequence per genus)

  • Use these as a BLAST DB to extract matching regions with superCRUNCH (Portik and Wiens 2020)

Implement a custom fern taxonomy

NCBI species names include many synonyms

  • Used Catalog of Life as basis for new, fern-specific taxonomic database, pteridocat
    • Built database with dwctaxon* R package (handles taxonomic data in compliance with Darwin Core standard)
  • Resolved GenBank species names to pteridocat using taxastand* R package
    • Matches taxonomic names to a custom database, while accounting for spelling differences and taxonomic idiosyncrasies

Use two-step phylogenetic analysis to maximize accuracy and sampling

  1. Plastome backbone (423 species x 79 loci)

  2. Use plastome backbone as constraint tree for analysis of Sanger dataset (5,582 species x 7 loci)

(also tried supermatrix, but this was much slower: ca. 1 month vs. 10 days)

Incorporate automated and manual checks for rogues

Automated

  • All-by-all BLAST
  • Exclude any sequences that matched the wrong family

Manual

  • Construct initial tree in “fast” mode with IQ-TREE (Nguyen et al. 2015)
  • Analyze monophyly (genus level and higher) with MonoPhy R package (Schwery and O’Meara 2016)
  • Curate exclusion list in consultation with taxonomic experts

Results

Revisiting the timeline of fern diversification

  • Dated with treePL (Smith and O’Meara 2012) using 51 fossils (2x more than previous)
  • Pushes back stem ages for most families ca. 10-30 my
  • Suggests ferns did not diversify “in the shadow” of angiosperms

Older stem
ages for most
families

Web portal

https://fernphy.github.io/

  • Data downloads

  • Shiny app for exploring data

R package ftolr

https://github.com/fernphy/ftolr

  • Read tree and data (alignments) directly into R

  • Options for outgroups, rooting, locus selection, etc.

library(ftolr)
ft_tree(drop_og = TRUE)

Phylogenetic tree with 5582 tips and 5581 internal nodes.

Tip labels:
  Acrostichum_danaeifolium, Acrostichum_speciosum, Acrostichum_aureum, Ceratopteris_richardii, Ceratopteris_cornuta, Ceratopteris_shingii, ...
Node labels:
  100/100, 100/100, 100, 100/100, 100, 100/100, ...

Rooted; includes branch lengths.

Community involvement

Consulted with a taxonomic expert on family Thelypteridaceae (S. Fawcett) between v1.0.0 and v1.1.0

  • Implemented “inclusion list” (preferred accessions for some species) based on recent phylogeny (Patel et al. 2019)

  • Number of non-monophyletic genera dropped from 16 to 7

Summary

FTOL hits sweet spot between automation and customization

  • Automated, versioned mining of GenBank data

  • Custom taxonomy tailored for ferns

  • Open data, methods, and results

  • Input from taxonomic experts and broader community

Future directions

  • Completion of FTOL

    • “Unlock the vault” of herbarium specimens via plastome skimming
  • Integration with Pteridophyte Phylogeny Group II

    • Species-level, community-driven, living taxonomy
  • Transition to phylogenomics for all species

  • Continue to provide tools (R packages) towards building TOL

Acknowledgements

  • Japan Society for the Promotion of Science

  • Smithsonian National Museum of Natural History Peter Buck Fellowship

  • Members of the Iwasaki lab, The University of Tokyo

  • A.E. White

  • S. Fawcett

  • M. Hassler

References

Antonelli, A., H. Hettling, F. L. Condamine, K. Vos, R. H. Nilsson, M. J. Sanderson, H. Sauquet, R. Scharn, D. Silvestro, M. Töpel, C. D. Bacon, B. Oxelman, and R. A. Vos. 2016. Toward a self-updating platform for estimating rates of speciation and migration, ages, and relationships of taxa. Systematic Biology 66:152–166.
Gauthier, J., A. T. Vincent, S. J. Charette, and N. Derome. 2019. A brief history of bioinformatics. Briefings in Bioinformatics 20:1981–1996.
Hinchliff, C. E., S. A. Smith, J. F. Allman, J. G. Burleigh, R. Chaudhary, L. M. Coghill, K. a. Crandall, J. Deng, B. T. Drew, R. Gazis, K. Gude, D. S. Hibbett, L. a. Katz, H. D. Laughinghouse, E. J. McTavish, P. E. Midford, C. L. Owen, R. H. Ree, J. a. Rees, D. E. Soltis, T. Williams, and K. a. Cranston. 2015. Synthesis of phylogeny and taxonomy into a comprehensive tree of life. Proceedings of the National Academy of Sciences:201423041.
Landau, W. M. 2021. The targets R package: A dynamic Make-like function-oriented pipeline toolkit for reproducibility and high-performance computing. Journal of Open Source Software 6:2959.
Nguyen, L.-T., H. A. Schmidt, A. von Haeseler, and B. Q. Minh. 2015. IQ-TREE: A fast and effective stochastic algorithm for estimating maximum-likelihood phylogenies. Molecular Biology and Evolution 32:268–274.
Patel, N., S. Fawcett, M. Sundue, and J. M. Budke. 2019. Evolution of perine morphology in the Thelypteridaceae. International Journal of Plant Sciences 180:1016–1035.
Portik, D. M., and J. J. Wiens. 2020. SuperCRUNCH: A bioinformatics toolkit for creating and manipulating supermatrices and other large phylogenetic datasets. Methods in Ecology and Evolution 11:763–772.
Schwery, O., and B. C. O’Meara. 2016. MonoPhy: a simple R package to find and visualize monophyly issues. PeerJ Computer Science 2:e56.
Smith, S. A., and B. C. O’Meara. 2012. treePL: divergence time estimation using penalized likelihood for large phylogenies. Bioinformatics 28:2689–2690.
Smith, S. A., and J. F. Walker. 2019. PyPHLAWD: A python tool for phylogenetic dataset construction. Methods in Ecology and Evolution 10:104–108.
Testo, W., and M. Sundue. 2016. A 4000-species dataset provides new insight into the evolution of ferns. Molecular Phylogenetics and Evolution 105:200–211.