The basics • taxastand

This vignette explains the three basic steps of the taxonomic name resolution workflow, which consist of:

Name parsing
Name matching
Name resolution

Setup

We’ll start by loading taxastand. For more information on installing taxastand, see here.

library(taxastand)

Name parsing

In R, scientific names are often just stored as character vectors (strings). For example,

example_name <- "Crepidomanes minutum (Bl.) K. Iwats."

However, such a name actually consists of several distinct parts:

"Crepidomanes minutum (Bl.) K. Iwats."
------------- ------- ---------------
      |         |          |
    genus    specific    author
             epithet

Furthermore, in the case of this name, it was originally named by Blume ((Bl.)), then transferred to a different genus by Iwatsuki (K. Iwats.).

When working with taxonomic names, it can be useful to parse the name into its component parts. That is what ts_parse_names() does. It takes a character vector as input and returns a dataframe:

ts_parse_names(example_name)
#>                                   name         id genus_hybrid_sign
#> 1 Crepidomanes minutum (Bl.) K. Iwats. d6ec2820-1              <NA>
#>     genus_name species_hybrid_sign specific_epithet infraspecific_rank
#> 1 Crepidomanes                <NA>          minutum               <NA>
#>   infraspecific_epithet          author
#> 1                  <NA> (Bl.) K. Iwats.

The first column, name, is the original input name. id is a unique identifier attached to the name. The rest of the columns are the parsed components of the name.

Note that the name parsing algorithm used by taxastand is case-sensitive! It assumes that the standard capitalization of scientific names is being used: genus is capitalized, specific epithet is lower case, author is capitalized as a proper noun, etc. Name parsing probably won’t work without this type of capitalization.

Now that we’ve parsed a name, in the next section we will see why this is useful for matching names to each other.

Name matching

One reason that name parsing is important is because some scientific names may differ only in certain components.

For example, the species Hymenophyllum pectinatum actually corresponds to two different scientific names with different authors, Hymenophyllum pectinatum Nees & Blume and Hymenophyllum pectinatum Cav.

We can see this by querying the name:

ts_match_names(
  "Hymenophyllum pectinatum", 
  c("Hymenophyllum pectinatum Nees & Blume", 
    "Hymenophyllum pectinatum Cav."), 
  simple = TRUE)
#>                      query                             reference match_type
#> 1 Hymenophyllum pectinatum Hymenophyllum pectinatum Nees & Blume auto_fuzzy
#> 2 Hymenophyllum pectinatum         Hymenophyllum pectinatum Cav. auto_fuzzy

ts_match_names() matches both scientific names¹, because the algorithm it can’t distinguish between them without additional information. So it is almost always better to include the taxonomic author in the query, to distinguish between such cases.

However, there can be quite a bit of variation in how authors are recorded. Sometimes names are abbreviated to different lengths, or the basionym author (an author name in parentheses) might get left out by accident, etc. The algorithm used by taxastand can account for this (to a point). Here is an example where the query lacks a basionym author:

ts_match_names(
  "Hymenophyllum taiwanense C. V. Morton", 
  c("Hymenophyllum taiwanense (Tagawa) C. V. Morton", 
    "Hymenophyllum taiwanense De Vol"), 
  simple = TRUE)
#>                                   query
#> 1 Hymenophyllum taiwanense C. V. Morton
#>                                        reference  match_type
#> 1 Hymenophyllum taiwanense (Tagawa) C. V. Morton auto_basio-

The name matching algorithm was able to narrow the match down to Hymenophyllum taiwanense (Tagawa) C. V. Morton even though the query lacked (Tagawa). Furthermore, the match_type tells us how the matching was done: auto_basio- means an automatic match based on excluding the basionym author from the reference. It is recommended to always check any results that weren’t identical (exact) to verify that the matching algorithm worked correctly, especially for fuzzy matches (auto_fuzzy).

Here is a summary of the values taken by match_type from taxon-tools:

exact: Exact match to all parts of the name (genus hybrid marker, genus name, species hybrid marker, species epithet, infraspecific rank signifier, infraspecific rank, author string).
auto_punct: Exact match to all parts of the name after removing mis-matching spaces, periods, non-ASCII author name characters, etc.
auto_noauth (only applies if match_no_auth is TRUE): Match between a query lacking an author and a reference name lacking an author that occurs only once in the reference.
auto_basio-: Match after excluding the basionym author from the reference. For example, Cardaminopsis umbrosa Czerep. vs. Cardaminopsis umbrosa (Turcz.) Czerep.)); the basionym author is (Turcz.).
auto_basio+: Match after excluding the basionym author from the query.
auto_in-: Match after excluding all in elements from reference. An in element refers to phrases such as Tagawa in Morton. The version excluding in elements is Tagawa.
auto_in+: Match after excluding all in elements from query.
auto_ex-: Match after excluding all in and ex elements from reference. An ex element refers to phrases such as Rändel ex D.F.Murray. The version excluding ex elements is Rändel.
auto_ex+: Match after excluding all in and ex elements from query.
auto_basexin: Match after excluding all basionym authors and all in and ex elements from query and reference.
auto_irank: Match where all elements agree except for infraspecific rank.
auto_fuzzy: Fuzzy match; match between scientific names allowed up to threshold given by max_dist, the Levenshtein distance including total insertions, deletions and substitutions.
cfonly: Match by “canonical form”, i.e., genus plus specific epithet plus infraspecific epithet (if present), not including the infraspecific specifier (“subsp.”, etc.).
no_match: No match detected.

The matching algorithm will prefer match codes higher in the list; so if a name could be matched both by auto_punct and auto_fuzzy, it will be matched based on auto_punct².

Name resolution

Name resolution refers to the process of mapping a query name to its standard version. This could just be accounting for orthographic variations, or it could involve resolving synonyms: different names that actually refer to the same species.

In order to conduct name resolution, we require a taxonomic standard in the form of a dataframe. taxastand requires that the taxonomic standard conform to Darwin Core standards. There are many sources of taxonomic data online, including GBIF, Catalog of Life, and ITIS among others.

taxastand comes supplied with an example taxonomic standard for filmy ferns (family Hymenophyllaceae):

# Load example reference taxonomy in Darwin Core format
data(filmy_taxonomy)

# Take a look at the columns used by taxastand
head(filmy_taxonomy[c("taxonID", "acceptedNameUsageID", "taxonomicStatus", "scientificName")])
#> # A tibble: 6 × 4
#>    taxonID acceptedNameUsageID taxonomicStatus scientificName                   
#>      <dbl>               <dbl> <chr>           <chr>                            
#> 1 54115096                  NA accepted name   Cephalomanes atrovirens Presl    
#> 2 54133783            54115097 synonym         Trichomanes crassum Copel.       
#> 3 54115097                  NA accepted name   Cephalomanes crassum (Copel.) M.…
#> 4 54133784            54115098 synonym         Trichomanes densinervium Copel.  
#> 5 54115098                  NA accepted name   Cephalomanes densinervium (Copel…
#> 6 54133785            54115099 synonym         Trichomanes infundibulare Alderw.

Here, taxonID is a unique identifier for each taxonomic name. acceptedNameUsageID only applies in the case of synonyms: it tells us the taxonID of the accepted name corresponding to that synonym. taxonomicStatus describes the status of the name, typically either as an accepted name, synonym, or something else (“dubious”, etc.). Finally, the scientificName is the full scientific name, preferably with the author.

In its most simple usage, ts_resolve_names() can take as input a character vector to query, and provide the resolved name in the taxonomic standard (reference):

ts_resolve_names("Gonocormus minutum", filmy_taxonomy)
#>                query                        resolved_name
#> 1 Gonocormus minutum Crepidomanes minutum (Bl.) K. Iwats.
#>                     matched_name resolved_status matched_status match_type
#> 1 Gonocormus minutus (Bl.) Bosch   accepted name        synonym auto_fuzzy

In this case, the query, Gonocormus minutum was a misspelled name that is actually a synonym for Crepidomanes minutum (Bl.) K. Iwats. Under the hood, ts_resolve_names() is calling both ts_parse_names() and ts_match_names() to do parsing and matching steps before name resolution³.

However, when used this way, ts_resolve_names() may not be able to provide a resolved name if the input is not matched unambiguously:

t_bifid_res <- ts_resolve_names("Trichomanes bifidum", filmy_taxonomy)
head(t_bifid_res)
#>                 query resolved_name                 matched_name
#> 1 Trichomanes bifidum          <NA>   Trichomanes asynkii Racib.
#> 2 Trichomanes bifidum          <NA>      Trichomanes hartii Bak.
#> 3 Trichomanes bifidum          <NA>  Trichomanes minimum Alderw.
#> 4 Trichomanes bifidum          <NA>      Trichomanes loreum Bory
#> 5 Trichomanes bifidum          <NA> Trichomanes bifidum C. Presl
#> 6 Trichomanes bifidum          <NA>   Trichomanes bilingue Hook.
#>   resolved_status    matched_status match_type
#> 1            <NA>           synonym auto_fuzzy
#> 2            <NA>           synonym auto_fuzzy
#> 3            <NA>           synonym auto_fuzzy
#> 4            <NA>           synonym auto_fuzzy
#> 5            <NA> ambiguous synonym auto_fuzzy
#> 6            <NA>           synonym auto_fuzzy
dim(t_bifid_res)
#> [1] 211   6

In this case, name resolution using the default settings produced 211 possible answers! That is obviously far too many. Let’s try to adjust the arguments and see if we can reduce the output:

ts_resolve_names(
  "Trichomanes bifidum", filmy_taxonomy, 
  match_no_auth = TRUE, match_canon = TRUE, max_dist = 5)
#>                 query resolved_name             matched_name resolved_status
#> 1 Trichomanes bifidum          <NA> Trichomanes bifolium Bl.            <NA>
#> 2 Trichomanes bifidum          <NA>  Trichomanes rigidum Sw.            <NA>
#>      matched_status match_type
#> 1           synonym auto_fuzzy
#> 2 ambiguous synonym auto_fuzzy

By allowing matches without the author name (we probably should have done that anyways, since the query lacked an author) and lowering the fuzzy match threshold, we are able to greatly reduce the number of possible resolved names.

Name resolution workflows typically involve tweaking these arguments to resolve a maximum number of names automatically, followed by some amount of manual edits to the remaining resolved names.

A benefit of taxastand is that, if during the name resolution workflow we discover mistakes in the reference database, the reference database can be edited so that the query names resolve correctly (this is not possible with packages that rely on querying a remote taxonomic database that can’t be modified by the user).

Conclusion

This vignette illustrated the typical steps involved in name resolution with taxastand on some trivial examples. In another vignette, I will provide a more realistic example with a larger dataset.

Note that ts_match_names() did the name parsing by calling ts_parse_names() for us internally. This is usually fine, but it can also take parsed names (dataframes) produced by ts_parse_names() as input to either query or reference.↩︎
The algorithm used by taxastand is optimized for plants, algae, and fungi, which vary in their taxonomic rules somewhat from animals. For example, plants include basionym authors in parentheses followed by the combination author, and typically don’t include the year, whereas animals normally include the year and may not provide the combination author.↩︎
You can use the output of ts_match_names() to the query input of ts_parse_names() if you want to see the matching results first.↩︎