This is a wrapper for the CD-HIT-EST algorithm. According to the CD-HIT user's guide, "CD-HIT-EST clusters a nucleotide dataset into clusters that meet a user-defined similarity threshold, usually a sequence identity." cd-hit-est comes bundled with transdecoder, so it is run from there.
cd_hit_est(input, output, wd = here::here(), other_args = NULL, echo = pkgconfig::get_config("baitfindR::echo", fallback = FALSE), ...)
| input | Character vector of length one; the path to the input file for cd-hit-est. Should be DNA or AA sequences in fasta format. |
|---|---|
| output | Character vector of length one; the name to assign to the output. Can include a path, in which case the output will be written there. |
| wd | Character vector of length one; the directory where the command will be run. |
| other_args | Character vector; other arguments to pass to cd-hit-est. Each should be an element of the vector. |
| echo | Logical; should the standard output and error be printed to the screen? |
| ... | Additional other arguments. Not used by this function, but meant
to be used by |
Within the R environment, a list with components specified in
run.
Externally, two files will be written: according to the CD-HIT user's guide, "The output are two files: a fasta file of representative sequences and a text file of list of clusters."
The fasta file will be named with the value of output; the list of clusters
will be the same, with .clstr appended.
http://www.bioinformatics.org/cd-hit/, http://transdecoder.github.io
# NOT RUN { library(ape) library(baitfindR) # Make temp dir for storing output temp_dir <- fs::dir_create(fs::path(tempdir(), "baitfindR_example")) data("PSKY") # Write downsized transcriptome to temp dir write.FASTA(PSKY, fs::path(temp_dir, "PSKY")) # Get CDS transdecoder_long_orfs( transcriptome_file = fs::path(temp_dir, "PSKY"), wd = temp_dir ) # Cluster similar genes in CDS cd_hit_est( input = fs::path(temp_dir, "PSKY.transdecoder_dir", "longest_orfs.cds"), output = fs::path(temp_dir, "PSKY.cd-hit-est"), wd = temp_dir, echo = TRUE ) # Check output list.files(temp_dir) head(readr::read_lines(fs::path(temp_dir, "PSKY.cd-hit-est"))) head(readr::read_lines(fs::path(temp_dir, "PSKY.cd-hit-est.clstr"))) # Cleanup fs::file_delete(temp_dir) # }