Automating BLAST validation of human viral read assignment

Experiments with BLASTN remote mode

Author

Will Bradshaw

Published

January 30, 2024

In previous entries, I presented the results of my Nextflow workflow on recent BMC RNA-seq data. To validate and calibrate my Bowtie2/Kraken2 pipeline for identifying human-viral reads, I manually BLASTed putative viral reads against NCBI’s nt database using their online BLAST tool, then assessed the results by eye to determine if each read likely came from a human virus. This approach was effective, but slow and manual, and thus hard to scale to other datasets I wanted to analyze with a higher relative abundance of viral sequences. I thus investigated options for automating the process to generate custom tabular BLAST results on the command line, which could then be read and processed programmatically.

Among the several options I investigated, the most promising were (1) Elastic-BLAST, NCBI’s own cloud-based BLAST tool, and (2) running regular command-line blast with the -remote option enabled to query NCBI’s online databases. I don’t currently have the AWS permissions needed to run Elastic-BLAST on our AWS service, and blastn -remote initially seemed too slow to be useful when run on an EC2 instance. However, I discovered that the latter option appears to run much more quickly when run on my local machine, making this a much more attractive option, at least until I’m able to run Elastic-BLAST.

To evaluate the potential for this approach to semi-automate the process of HV read validation, I ran BLASTN on the 227 putative HV reads from my previous BMC entry, using the following command:

blastn -query <input_path> -out <output_path> -db nt -remote -perc_identity 60 -max_hsps 5 -num_alignments 50 -qcov_hsp_perc 30 -outfmt "6 qseqid sseqid sgi staxid qlen evalue bitscore qcovs length pident mismatch gapopen sstrand qstart qend sstart send"

To begin with, I first needed to identify a list of taxids I could use to designate a read as human-viral. To do this, I took the list of HV taxids generated by my Nextflow workflow and expanded it to include any missing descendent taxids using the NCBI taxid tree structure. I also generated another list of taxids that included any taxid descended from any of these taxids or the overall virus taxid (10239); this latter approach decreases the risk of false negatives at the cost of potentially increasing the risk of false positives from phage sequences. The two approaches generated lists of 28,105 and 234,499 taxids, respectively.

Code
# Set file paths
data_dir <- "../data/2024-01-30_blast/"
reads_db_path <- file.path(data_dir, "bmc-hv-putative.tsv")
blast_results_path <- file.path(data_dir, "bmc-hv-putative.blast")
hv_taxids_path <- file.path(data_dir, "human-virus-taxids-all.txt")
tax_nodes_path <- file.path(data_dir, "nodes.dmp.gz")

# Import files for viral taxid search
hv_taxids <- read_tsv(hv_taxids_path, show_col_types = FALSE, col_names = "taxid")
tax_nodes <- read_delim(tax_nodes_path, delim = "\t|\t", show_col_types = FALSE, 
                        col_names = FALSE) %>%
  select(X1:X3) %>% rename(child_taxid = X1, parent_taxid = X2, rank = X3)

# Define taxid search function
expand_taxids <- function(taxids_in, nodes){
  taxids_out <- taxids_in
  taxids_new <- filter(nodes, parent_taxid %in% taxids_out, !child_taxid %in% taxids_out) %>%
    pull(child_taxid) %>% sort
  while (length(taxids_new) > 0){
    taxids_out <- c(taxids_out, taxids_new) %>% sort
    taxids_new <- filter(nodes, parent_taxid %in% taxids_out, 
                         !child_taxid %in% taxids_out) %>%
      pull(child_taxid) %>% sort
  }
  return(taxids_out)
}
v_taxids_1 <- expand_taxids(hv_taxids$taxid, tax_nodes)
v_taxids_2 <- expand_taxids(c(10239, hv_taxids$taxid), tax_nodes)

I next imported the BLAST alignment results and performed exploratory data analysis. Following visual inspection, it appeared that the following process would generate good results:

  1. If both the forward and reverse reads from a read pair align to the same viral taxid (given the specified filters on query coverage, percent identity, etc), classify that read pair as human viral.
  2. If only one of the reads in a read pair aligns to any given viral taxid, flag the read for manual inspection.
  3. If neither read aligns to a viral taxid, classify that read pair as non-human-viral.
Code
# Prepare and import BLAST results
reads_db <- read_tsv(reads_db_path, show_col_types = FALSE)
blast_cols <- c("qseqid", "sseqid", "sgi", "staxid", "qlen", "evalue", "bitscore", "qcovs", "length", "pident", "mismatch", "gapopen", "sstrand", "qstart", "qend", "sstart", "send")
blast_results <- read_tsv(blast_results_path, show_col_types = FALSE, col_names = blast_cols,
                          col_types = cols(.default="c"))

# Filter BLAST results by read ID
reads_db_seqs <- reads_db %>% mutate(read_id = paste(sample, seq_num, sep="_")) %>% pull(read_id) %>% c(paste(., "1", sep="_"), paste(., "2", sep="_"))
blast_results_filtered <- blast_results %>% filter(qseqid %in% reads_db_seqs)

# Summarize BLAST results by read pair and subject taxid
blast_results_paired <- blast_results_filtered %>%
  separate(qseqid, c("sample", "seq_num", "read_pair"), "_") %>%
  group_by(sample, seq_num, read_pair, staxid) %>% filter(bitscore == max(bitscore)) %>%
  filter(length == max(length)) %>% filter(row_number() == 1) %>%
  group_by(sample, seq_num, staxid) %>%
  mutate(bitscore = as.numeric(bitscore), seq_num = as.numeric(seq_num)) %>%
  summarize(bitscore_max = max(bitscore), bitscore_min = min(bitscore), 
            n_reads = n(), .groups = "drop")

# Assign viral status to BLAST results
blast_results_hviral <- blast_results_paired %>%
  mutate(viral_1 = staxid %in% v_taxids_1,
         viral_2 = staxid %in% v_taxids_2) %>%
  mutate(viral_1_full = viral_1 & n_reads == 2,
         viral_2_full = viral_2 & n_reads == 2)

# Assign putative viral status to read pairs
blast_results_out <- blast_results_hviral %>%
  group_by(sample, seq_num) %>%
  summarize(viral_status_1 = ifelse(any(viral_1_full), "TRUE", ifelse(any(viral_1), "INSPECT", "FALSE")),
            viral_status_2 = ifelse(any(viral_2_full), "TRUE", ifelse(any(viral_2), "INSPECT", "FALSE")),
            .groups = "drop") %>%
  inner_join(reads_db %>% select(sample, seq_num, hv_blast), by=c("sample", "seq_num"))

# Assume manual inspection produces same results as previous
blast_results_inspected <- blast_results_out %>%
  mutate(viral_status_1 = ifelse(viral_status_1 == "INSPECT", hv_blast, as.logical(viral_status_1)),
         viral_status_2 = ifelse(viral_status_2 == "INSPECT", hv_blast, as.logical(viral_status_2)))

# Evaluate performance vs fully manual inspection
blast_status <- blast_results_inspected %>%
    mutate(pos_tru_1 = viral_status_1 & hv_blast,
           pos_fls_1 = viral_status_1 & !hv_blast,
           neg_tru_1 = !viral_status_1 & !hv_blast,
           neg_fls_1 = !viral_status_1 & hv_blast,
           pos_tru_2 = viral_status_2 & hv_blast,
           pos_fls_2 = viral_status_2 & !hv_blast,
           neg_tru_2 = !viral_status_2 & !hv_blast,
           neg_fls_2 = !viral_status_2 & hv_blast)
blast_performance_1 <- blast_status %>%
  summarize(n_pos_tru = sum(pos_tru_1),
            n_pos_fls = sum(pos_fls_1),
            n_neg_tru = sum(neg_tru_1),
            n_neg_fls = sum(neg_fls_1),
            precision = n_pos_tru / (n_pos_tru + n_pos_fls),
            sensitivity = n_pos_tru / (n_pos_tru + n_neg_fls),
            specificity = n_neg_tru / (n_neg_tru + n_pos_fls),
            f1 = 2 * precision * sensitivity / (precision + sensitivity)) %>%
  mutate(ref_taxids = "Expanded human viral")
blast_performance_2 <- blast_status %>%
  summarize(n_pos_tru = sum(pos_tru_2),
            n_pos_fls = sum(pos_fls_2),
            n_neg_tru = sum(neg_tru_2),
            n_neg_fls = sum(neg_fls_2),
            precision = n_pos_tru / (n_pos_tru + n_pos_fls),
            sensitivity = n_pos_tru / (n_pos_tru + n_neg_fls),
            specificity = n_neg_tru / (n_neg_tru + n_pos_fls),
            f1 = 2 * precision * sensitivity / (precision + sensitivity)) %>%
  mutate(ref_taxids = "All viral")
blast_performance <- bind_rows(blast_performance_1, blast_performance_2) %>% select(ref_taxids, n_pos_tru:f1)
blast_performance

Using the inferred list of human-virus taxids results in high precision and specificity, but low sensitivity, missing many reads flagged as human-viral by manual inspection. Conversely, using a list of all viral taxids as the reference achieves perfect sensitivity and near-perfect precision and specificity, with only one false positive result, resulting in an F1 score of over 99%.

That one false positive turned out to be a sequence with a high-identity but low-query-coverage human-viral match (to human enterovirus 71) which was excluded during my previous manual assessment; in my opinion, it’s unclear whether this should really be classed as a false positive. If we want to class this sequence as non-human-viral, increasing the query coverage threshold from 30% to 35% successfully excludes it without losing any true positives, resulting in perfect precision relative to manual assignments:

Code
# Filter BLAST results by query coverage
blast_results_qcov <- blast_results_filtered %>% filter(as.numeric(qcovs) >= 35)

# Summarize BLAST results by read pair and subject taxid
blast_results_qcov_paired <- blast_results_qcov %>%
  separate(qseqid, c("sample", "seq_num", "read_pair"), "_") %>%
  group_by(sample, seq_num, read_pair, staxid) %>% filter(bitscore == max(bitscore)) %>%
  filter(length == max(length)) %>% filter(row_number() == 1) %>%
  group_by(sample, seq_num, staxid) %>%
  mutate(bitscore = as.numeric(bitscore), seq_num = as.numeric(seq_num)) %>%
  summarize(bitscore_max = max(bitscore), bitscore_min = min(bitscore), 
            n_reads = n(), .groups = "drop")

# Assign viral status to BLAST results
blast_results_qcov_hviral <- blast_results_qcov_paired %>%
  mutate(viral_1 = staxid %in% v_taxids_1,
         viral_2 = staxid %in% v_taxids_2) %>%
  mutate(viral_1_full = viral_1 & n_reads == 2,
         viral_2_full = viral_2 & n_reads == 2)

# Assign putative viral status to read pairs
blast_results_qcov_out <- blast_results_qcov_hviral %>%
  group_by(sample, seq_num) %>%
  summarize(viral_status_1 = ifelse(any(viral_1_full), "TRUE", ifelse(any(viral_1), "INSPECT", "FALSE")),
            viral_status_2 = ifelse(any(viral_2_full), "TRUE", ifelse(any(viral_2), "INSPECT", "FALSE")),
            .groups = "drop") %>%
  full_join(reads_db %>% select(sample, seq_num, hv_blast), by=c("sample", "seq_num")) %>%
  mutate(viral_status_1 = replace_na(viral_status_1, "FALSE"),
         viral_status_2 = replace_na(viral_status_2, "FALSE"))

# Assume manual inspection produces same results as previous
blast_results_qcov_inspected <- blast_results_qcov_out %>%
  mutate(viral_status_1 = ifelse(viral_status_1 == "INSPECT", hv_blast, as.logical(viral_status_1)),
         viral_status_2 = ifelse(viral_status_2 == "INSPECT", hv_blast, as.logical(viral_status_2)))

# Evaluate performance vs fully manual inspection
blast_status_qcov <- blast_results_qcov_inspected %>%
    mutate(pos_tru_1 = viral_status_1 & hv_blast,
           pos_fls_1 = viral_status_1 & !hv_blast,
           neg_tru_1 = !viral_status_1 & !hv_blast,
           neg_fls_1 = !viral_status_1 & hv_blast,
           pos_tru_2 = viral_status_2 & hv_blast,
           pos_fls_2 = viral_status_2 & !hv_blast,
           neg_tru_2 = !viral_status_2 & !hv_blast,
           neg_fls_2 = !viral_status_2 & hv_blast)
blast_performance_qcov_1 <- blast_status_qcov %>%
  summarize(n_pos_tru = sum(pos_tru_1),
            n_pos_fls = sum(pos_fls_1),
            n_neg_tru = sum(neg_tru_1),
            n_neg_fls = sum(neg_fls_1),
            precision = n_pos_tru / (n_pos_tru + n_pos_fls),
            sensitivity = n_pos_tru / (n_pos_tru + n_neg_fls),
            specificity = n_neg_tru / (n_neg_tru + n_pos_fls),
            f1 = 2 * precision * sensitivity / (precision + sensitivity)) %>%
  mutate(ref_taxids = "Expanded human viral")
blast_performance_qcov_2 <- blast_status_qcov %>%
  summarize(n_pos_tru = sum(pos_tru_2),
            n_pos_fls = sum(pos_fls_2),
            n_neg_tru = sum(neg_tru_2),
            n_neg_fls = sum(neg_fls_2),
            precision = n_pos_tru / (n_pos_tru + n_pos_fls),
            sensitivity = n_pos_tru / (n_pos_tru + n_neg_fls),
            specificity = n_neg_tru / (n_neg_tru + n_pos_fls),
            f1 = 2 * precision * sensitivity / (precision + sensitivity)) %>%
  mutate(ref_taxids = "All viral")
blast_performance_qcov <- bind_rows(blast_performance_qcov_1, blast_performance_qcov_2) %>% select(ref_taxids, n_pos_tru:f1)
blast_performance_qcov

In summary, using command-line blastn -remote, combined with tabular processing in R against a reference class of all viral taxids, works well as a substitute for manual inspection of online BLAST results for validating human-viral read assignments. It’s too slow and failure-prone to be added to the pipeline as a replacement or supplement to the Bowtie/Kraken automated approach, but will be my go-to approach in future for manual validation and refinement of human-viral assignment criteria – at least until I’m able to try out Elastic-BLAST.