Joel Nitta1, Eric Schuettpelz2, Santiago Ramírez-Barahona3,
Wataru Iwasaki1
1: The University of Tokyo, 2: Smithsonian Institution, 3: Universidad Nacional Autónoma de México
ISMB EvolCompGen COSI 2022-07-13
https://joelnitta.github.io/ismb_2022
Darwin (1837)
Hinchliff et al. (2015)
Gauthier et al. (2019)
Antonelli et al. (2016)
Any automated pipeline must make shortcuts and assumptions
Manual inspection of all sequences would lead to high-quality results, but does not scale
Goal: construct a pipeline to generate a maximally sampled, high taxonomic quality phylogeny of ferns
A large, diverse, ecologically important group of plants
Much more tractable than seed plants (angiosperms):
Ferns: ca. 12,000 species, 40-50% sequenced
Seed plants: ca. 350,000 species, 20% sequenced
Increase efficiency and reproducibility of working with GenBank data by using a local DB
Extract sequences without relying on GenBank annotations
Implement a custom fern taxonomy
Use two-step phylogenetic analysis to maximize accuracy and sampling
Incorporate automated and manual checks for rogues
restez* R package
Download portion (“plants” division) of GenBank from FTP site (v249, cutoff date 2022-04-15, ca. 170 gb)
Put only fern and outgroup sequences into local DB (monetDB, ca. 400 mb)
Post DB in figshare so others can use
There is no standard for how to annotate GenBank accessions
Curate set of reference sequences (one sequence per genus)
Use these as a BLAST DB to extract matching regions with superCRUNCH (Portik and Wiens 2020)
NCBI species names include many synonyms
Plastome backbone (423 species x 79 loci)
Use plastome backbone as constraint tree for analysis of Sanger dataset (5,582 species x 7 loci)
(also tried supermatrix, but this was much slower: ca. 1 month vs. 10 days)
Automated
Manual
Data downloads
Shiny app for exploring data
https://github.com/fernphy/ftolr
Read tree and data (alignments) directly into R
Options for outgroups, rooting, locus selection, etc.
Phylogenetic tree with 5582 tips and 5581 internal nodes.
Tip labels:
Acrostichum_danaeifolium, Acrostichum_speciosum, Acrostichum_aureum, Ceratopteris_richardii, Ceratopteris_cornuta, Ceratopteris_shingii, ...
Node labels:
100/100, 100/100, 100, 100/100, 100, 100/100, ...
Rooted; includes branch lengths.
Consulted with a taxonomic expert on family Thelypteridaceae (S. Fawcett) between v1.0.0 and v1.1.0
Implemented “inclusion list” (preferred accessions for some species) based on recent phylogeny (Patel et al. 2019)
Number of non-monophyletic genera dropped from 16 to 7
Automated, versioned mining of GenBank data
Custom taxonomy tailored for ferns
Open data, methods, and results
Input from taxonomic experts and broader community
Completion of FTOL
Integration with Pteridophyte Phylogeny Group II
Transition to phylogenomics for all species
Continue to provide tools (R packages) towards building TOL
Japan Society for the Promotion of Science
Smithsonian National Museum of Natural History Peter Buck Fellowship
Members of the Iwasaki lab, The University of Tokyo
A.E. White
S. Fawcett
M. Hassler