An Open and Continuously
Updated Fern Tree of Life
(FTOL)

Joel Nitta1, Eric Schuettpelz2, Santiago Ramírez-Barahona3,
Wataru Iwasaki1

1: The University of Tokyo, 2: Smithsonian Institution, 3: Universidad Nacional Autónoma de México
Botany 2022
https://joelnitta.github.io/botany_2022_ftol

Phylogenies are essential to biology

Only with a phylogeny can we begin to understand diversification, regularities in patterns of evolution, or simply suggest individual evolutionary changes within a clade

- APG

https://www.digitalatlasofancientlife.org/learn/embryophytes/angiosperms/angiosperm-phylogeny/

Automated pipelines enable building large trees

Antonelli et al. (2016)

Problem 1: tradeoff between scalability and accuracy

  • Any automated pipeline must make shortcuts and assumptions

  • Manual inspection of all sequences would lead to high-quality results, but does not scale

Problem 2: tree built today will be out-of-date tomorrow

  • Due to the rapid accumulation of data on GenBank

Our approach: combine automation with customization

Goal: construct a pipeline to generate a maximally sampled, high taxonomic quality phylogeny of ferns

Why ferns?

A large, diverse, ecologically important group of plants

Much more tractable than seed plants (angiosperms):

  • Ferns: ca. 12,000 species, 40-50% sequenced

  • Seed plants: ca. 350,000 species, 20% sequenced

Sanger-sequenced plastid genes = workhorse of fern molecular systematics

Methods

GenBank mining

Sanger: 7 commonly used loci

  • Genes: atpA, atpB, matK, rbcL, rps4
  • Spacers: trnLtrnF, rps4trnS
  • ca. 5,100 species

Plastomes

  • 77 single-copy genes + 2 spacers
  • ca. 500 species

GenBank mining

  • Download data to local database using restez* R package

  • Use superCRUNCH (Portik and Wiens 2020) to extract sequences without relying on annotations

Taxonomic name resolution

query matched_name resolved_name
Anemia collina Sm. Anemia collina Sm. Anemia collina Raddi
Pteris flava Merr. Pteris flava Merr. Pteris linearis Poir.

… (6,475 total)

Automated cleaning

  • Run all-by-all BLAST (Camacho et al. 2009)

  • Any query matching the wrong family is excluded as mis-ID

species accession locus query family match family
Abacopteris_gymnopteridifrons JF303974 rbcL Thelypteridaceae Athyriaceae
Angiopteris_evecta AY344778 trnL-trnF Marattiaceae Ophioglossaceae

… (70 total)

Phylogenetic analysis: backbone

  • Align plastome sequences with MAFFT (Katoh et al. 2002) (544 species x 74,883 bp, 12.1% missing)

  • Infer tree using ML in IQ-TREE (Nguyen et al. 2015) (concatenated matrix, no paritioning)

Phylogenetic analysis: full tree

  • Align Sanger sequences with MAFFT (5,582 species x 12,716 bp, 77% missing)

  • Infer tree in IQ-TREE (concatenated matrix, no paritioning) with plastome tree as constraint

  • Before final analysis, run IQ-TREE in “fast” mode and manually inspect for rogues

Results

Revisiting the timeline of fern diversification

  • 51 fossils (2x more than previous)

  • Pushes back stem ages for most families ca. 10-30 my

  • Suggests ferns did not diversify “in the shadow” of angiosperms

Older stem
ages for most
families

Web portal

https://fernphy.github.io/

  • Data downloads

  • Shiny app for exploring data

R package ftolr

https://github.com/fernphy/ftolr

  • Read tree and data (alignments) directly into R

  • Options for outgroups, rooting, locus selection, etc.

library(ftolr)
ft_tree(drop_og = TRUE)

Phylogenetic tree with 5582 tips and 5581 internal nodes.

Tip labels:
  Acrostichum_danaeifolium, Acrostichum_speciosum, Acrostichum_aureum, Ceratopteris_richardii, Ceratopteris_cornuta, Ceratopteris_shingii, ...
Node labels:
  100/100, 100/100, 100, 100/100, 100, 100/100, ...

Rooted; includes branch lengths.

Community involvement

Consulted with a taxonomic expert on family Thelypteridaceae (S. Fawcett) between v1.0.0 and v1.1.0

  • Number of non-monophyletic genera dropped from 16 to 7

Summary

FTOL hits sweet spot between automation and customization

  • Automated, versioned mining of GenBank data

  • Custom taxonomy tailored for ferns

  • Input from taxonomic experts and broader community

  • Model for other plant groups at similar scale?

Future directions

  • Completion of FTOL

    • “Unlock the vault” of herbarium specimens via plastome skimming
  • Integration with Pteridophyte Phylogeny Group II

  • Transition to phylogenomics for all species

Acknowledgements

  • Japan Society for the Promotion of Science

  • Smithsonian National Museum of Natural History Peter Buck Fellowship

  • Members of the Iwasaki lab, The University of Tokyo

  • A.E. White

  • S. Fawcett

  • M. Hassler

References

Antonelli, A., H. Hettling, F. L. Condamine, K. Vos, R. H. Nilsson, M. J. Sanderson, H. Sauquet, R. Scharn, D. Silvestro, M. Töpel, C. D. Bacon, B. Oxelman, and R. A. Vos. 2016. Toward a self-updating platform for estimating rates of speciation and migration, ages, and relationships of taxa. Systematic Biology 66:152–166.
Camacho, C., G. Coulouris, V. Avagyan, N. Ma, J. Papadopoulos, K. Bealer, and T. Madden. 2009. BLAST+: architecture and applications. BMC Bioinformatics 10:421.
Hassler, M. 2022. World Ferns. Synonymic Checklist and Distribution of Ferns and Lycophytes of the World. www.worldplants.de/ferns/.
Katoh, K., K. Misawa, K. Kuma, and T. Miyata. 2002. MAFFT: A novel method for rapid multiple sequence alignment based on fast Fourier transform. Nucleic Acids Research 30:3059–3066.
Landau, W. M. 2021. The targets R package: A dynamic Make-like function-oriented pipeline toolkit for reproducibility and high-performance computing. Journal of Open Source Software 6:2959.
Nguyen, L.-T., H. A. Schmidt, A. von Haeseler, and B. Q. Minh. 2015. IQ-TREE: A fast and effective stochastic algorithm for estimating maximum-likelihood phylogenies. Molecular Biology and Evolution 32:268–274.
Portik, D. M., and J. J. Wiens. 2020. SuperCRUNCH: A bioinformatics toolkit for creating and manipulating supermatrices and other large phylogenetic datasets. Methods in Ecology and Evolution 11:763–772.
Smith, S. A., and B. C. O’Meara. 2012. treePL: divergence time estimation using penalized likelihood for large phylogenies. Bioinformatics 28:2689–2690.
Smith, S. A., and J. F. Walker. 2019. PyPHLAWD: A python tool for phylogenetic dataset construction. Methods in Ecology and Evolution 10:104–108.
Testo, W., and M. Sundue. 2016. A 4000-species dataset provides new insight into the evolution of ferns. Molecular Phylogenetics and Evolution 105:200–211.