VirFind pipeline

VirFind is a pipeline of various bioinformatics scripts currently running on 4 high performance computer nodes (each with 64 cores, 512Gb RAM) at Arkansas High Performance Computing Center. You can upload your fasta or Illumina fastq files to VirFind ftp server and let the tool discover if there are known/new viruses in your samples.

Due to the high intensity nature of processing next generation sequencing data, it can take up to several days for your job to be analyzed completely.

There might also be a queue of other users in front of your jobs.

VirFind version 1.2

Pipeline:

User file submission on VirFind ftp server, together with completion of Sequence submission form that instructs how the pipeline will run. File types permitted: fastq (Illumina), fasta, or gz of these two types.
File transfer to VirFind bioinformatics server
Convert fastq to fasta format, collapse
Trim n nucleotides (n = user's choice) from both ends
Map to reference genome (user's choice) by Bowtie2

output mapped sequences

Unmapped sequences: calculate average sequence length

de novo sequence assembly by Trinity and SPAdes
in addition, if average sequence length <= 80nt, de novo sequence assembly by SPAdes with kmer=13, 19, 25

Assembled contigs: Blastn against NCBI nt database, e-value = user's choice, default = 0.01, generate following outputs

Blastn_NON_VIRUS_reads.fna
Blastn_NON_VIRUS_report.tab
Blastn_VIRUS_reads.fna
Blastn_VIRUS_report.tab

Sequences not detected by Blastn: Blastx against all GenBank virus proteins, e-value = user's choice, default = 0.01, generate following outputs

Blastx_VIRUS_reads.fna
Blastx_VIRUS_report.tab

Sequences not detected by Blastx will be output to

Reads_with_NO_Blastn_NO_Blastx.fna
Reads_with_NO_Blastn_NO_Blastx.faa (translation of .fna file to protein)

Conserved domain search (user's choice) of the .faa file against NCBI CDD database, e-value = 0.05, output to Conserved_domain_search_report.txt

Users will use the information from Blastn Blastx .tab files and Conserved_domain_search_report.txt to decide whether a virus/viruses present in their sample.

Blastn_VIRUS_reads.fna shows reads that share nucleotide identity to GenBank sequences.
Blastx_VIRUS_reads.fna shows reads that share amino acid identity to GenBank sequences.
Reads_with_NO_Blastn_NO_Blastx.fna shows reads that cannot be detected by Blastn and Blastx with the chosen e-values. Please keep in mind that this file, while can still have host materials, might contain sequences of new viruses that are significantly different from the ones deposited on GenBank.

Users will need some experience to call if a nucleotide/amino acid read reported by VirFind is a real virus read, and the virus is just an isolate of a known species, or a completely new virus species belonging to such and such genus/subfamily/family/order.

VirFind virus discovery using NGS

Nav view search

Navigation

Search

Main Menu

VirFind pipeline