Workflow analysis of Rosario et al. (2018)

Air sampling from a student dorm in Colorado.

Author

Will Bradshaw

Published

April 12, 2024

Continuing our look at air sampling datasets, we turn to Rosario et al. (2018), another study of air filters, this time from HVAC filters from an undergraduate dorm building at the University of Colorado campus in Boulder. As in Prussin, samples were eluted from filters (in this case MERV-8, so less stringent than Prussin’s MERV-14 filters) and underwent both RNA and DNA sequencing – this time on an Illumina MiSeq with 2x250bp reads.

The raw data

The Rosario dataset comprised sequencing data from 12 individual dormitory rooms at the sampling site, as well as a pooled sample of eight additional rooms and a negative control.

Code
# Data input paths
data_dir <- "../data/2024-04-12_rosario/"
libraries_path <- file.path(data_dir, "sample-metadata.csv")
basic_stats_path <- file.path(data_dir, "qc_basic_stats.tsv")
adapter_stats_path <- file.path(data_dir, "qc_adapter_stats.tsv")
quality_base_stats_path <- file.path(data_dir, "qc_quality_base_stats.tsv")
quality_seq_stats_path <- file.path(data_dir, "qc_quality_sequence_stats.tsv")

# Import libraries and extract metadata from sample names
libraries_raw <- read_csv(libraries_path, show_col_types = FALSE)
libraries <- libraries_raw %>%
  arrange(desc(na_type)) %>% mutate(na_type = fct_inorder(na_type)) %>%
  mutate(sample_type = ifelse(grepl("Control", room), "NC",
                              ifelse(grepl("Pool", room), "Pool", "Single-Room")),
         sample_type = factor(sample_type, levels=c("Single-Room", "Pool", "NC")),
         room = ifelse(grepl("Control", room), "Negative Control", room))

# Import QC data
stages <- c("raw_concat", "cleaned", "dedup", "ribo_initial", "ribo_secondary")
basic_stats <- read_tsv(basic_stats_path, show_col_types = FALSE) %>%
  inner_join(libraries, by="sample") %>% 
  mutate(stage = factor(stage, levels = stages),
         sample = fct_inorder(sample))
adapter_stats <- read_tsv(adapter_stats_path, show_col_types = FALSE) %>%
    mutate(sample = sub("_2$", "", sample)) %>%
  inner_join(libraries, by="sample") %>%
  mutate(stage = factor(stage, levels = stages),
         read_pair = fct_inorder(as.character(read_pair)))
quality_base_stats <- read_tsv(quality_base_stats_path, show_col_types = FALSE) %>%
  inner_join(libraries, by="sample") %>%
  mutate(stage = factor(stage, levels = stages),
         read_pair = fct_inorder(as.character(read_pair)))
quality_seq_stats <- read_tsv(quality_seq_stats_path, show_col_types = FALSE) %>%
  inner_join(libraries, by="sample") %>%
  mutate(stage = factor(stage, levels = stages),
         read_pair = fct_inorder(as.character(read_pair)))

# Filter to raw data
basic_stats_raw <- basic_stats %>% filter(stage == "raw_concat")
adapter_stats_raw <- adapter_stats %>% filter(stage == "raw_concat")
quality_base_stats_raw <- quality_base_stats %>% filter(stage == "raw_concat")
quality_seq_stats_raw <- quality_seq_stats %>% filter(stage == "raw_concat")

# Get key values for readout
raw_read_counts <- basic_stats_raw %>% group_by(na_type, sample_type) %>% 
  summarize(rmin = min(n_read_pairs), rmax=max(n_read_pairs),
            rmean=mean(n_read_pairs), .groups = "drop")
raw_read_totals <- basic_stats_raw %>% group_by(na_type, sample_type) %>% 
  summarize(n_read_pairs = sum(n_read_pairs), 
            n_bases_approx = sum(n_bases_approx), .groups = "drop")
raw_dup <- basic_stats_raw %>% group_by(na_type, sample_type) %>% 
  summarize(dmin = min(percent_duplicates), dmax=max(percent_duplicates),
            dmean=mean(percent_duplicates), .groups = "drop")

The 12 single-room samples yielded 0.57-0.74M (mean 0.67M) RNA-sequencing reads and 0.48M-0.84M (mean 0.67M) DNA-sequencing reads; the pooled and control samples had similar depth. In total, non-control samples yielded 8.0M RNA read pairs and 8.0M DNA read pairs (4 gigabases of sequence each).

Read qualities were so-so, in need of cleaning. Adapter levels were high. Inferred duplication levels were low (4.6-8.2%) in DNA reads but highly variable (3-83%) in RNA reads.

Code
# Prepare data
basic_stats_raw_metrics <- basic_stats_raw %>%
  select(room, na_type, sample_type,
         `# Read pairs` = n_read_pairs,
         `Total base pairs\n(approx)` = n_bases_approx,
         `% Duplicates\n(FASTQC)` = percent_duplicates) %>%
  pivot_longer(-(room:sample_type), names_to = "metric", values_to = "value") %>%
  mutate(metric = fct_inorder(metric))

# Set up plot templates
scale_fill_na <- purrr::partial(scale_fill_brewer, palette="Set1", 
                                name="Nucleic acid type")
g_basic <- ggplot(basic_stats_raw_metrics, 
                  aes(x=room, y=value, fill=na_type)) +
  geom_col(position = "dodge") +
  scale_x_discrete(name="Collection Date") +
  scale_y_continuous(expand=c(0,0)) +
  expand_limits(y=c(0,100)) +
  scale_fill_na() + 
  facet_grid(metric~sample_type, scales = "free", space="free_x", switch="y") +
  theme_rotate + theme(
    axis.title.y = element_blank(),
    strip.text.y = element_text(face="plain")
  )
g_basic

Code
# Set up plotting templates
scale_color_na <- purrr::partial(scale_color_brewer, palette="Set1",
                                 name="Nucleic acid type")
g_qual_raw <- ggplot(mapping=aes(color=na_type, linetype=read_pair, 
                         group=interaction(sample,read_pair))) + 
  scale_color_na() + scale_linetype_discrete(name = "Read Pair") +
  guides(color=guide_legend(nrow=2,byrow=TRUE),
         linetype = guide_legend(nrow=2,byrow=TRUE)) +
  theme_base

# Visualize adapters
g_adapters_raw <- g_qual_raw + 
  geom_line(aes(x=position, y=pc_adapters), data=adapter_stats_raw) +
  scale_y_continuous(name="% Adapters", limits=c(0,NA),
                     breaks = seq(0,100,10), expand=c(0,0)) +
  scale_x_continuous(name="Position", limits=c(0,NA),
                     breaks=seq(0,140,20), expand=c(0,0)) +
  facet_grid(sample_type~adapter)
g_adapters_raw

Code
# Visualize quality
g_quality_base_raw <- g_qual_raw +
  geom_hline(yintercept=25, linetype="dashed", color="red") +
  geom_hline(yintercept=30, linetype="dashed", color="red") +
  geom_line(aes(x=position, y=mean_phred_score), data=quality_base_stats_raw) +
  scale_y_continuous(name="Mean Phred score", expand=c(0,0), limits=c(10,45)) +
  scale_x_continuous(name="Position", limits=c(0,NA),
                     breaks=seq(0,140,20), expand=c(0,0)) +
  facet_grid(sample_type~.)
g_quality_base_raw

Code
g_quality_seq_raw <- g_qual_raw +
  geom_vline(xintercept=25, linetype="dashed", color="red") +
  geom_vline(xintercept=30, linetype="dashed", color="red") +
  geom_line(aes(x=mean_phred_score, y=n_sequences), data=quality_seq_stats_raw) +
  scale_x_continuous(name="Mean Phred score", expand=c(0,0)) +
  scale_y_continuous(name="# Sequences", expand=c(0,0)) +
  facet_grid(sample_type~., scales = "free_y")
g_quality_seq_raw

Preprocessing

The average fraction of reads lost at each stage in the preprocessing pipeline is shown in the following table. Single-room samples lost an average of 52%/14% (RNA/DNA) of total read pairs during cleaning and deduplication, with a further loss of 4.3%/0.1% during ribodepletion. However, the fraction of reads lost, especially during cleaning and deduplication, was highly variable between samples.

Code
n_reads_rel <- basic_stats %>% 
  select(sample, na_type, sample_type, stage, 
         percent_duplicates, n_read_pairs) %>%
  group_by(sample, na_type) %>% arrange(sample, na_type, stage) %>%
  mutate(p_reads_retained = replace_na(n_read_pairs / lag(n_read_pairs), 0),
         p_reads_lost = 1 - p_reads_retained,
         p_reads_retained_abs = n_read_pairs / n_read_pairs[1],
         p_reads_lost_abs = 1-p_reads_retained_abs,
         p_reads_lost_abs_marginal = replace_na(p_reads_lost_abs - lag(p_reads_lost_abs), 0))
n_reads_rel_display <- n_reads_rel %>% 
  rename(Stage=stage, `NA Type`=na_type, `Sample Type`=sample_type) %>% 
  group_by(`Sample Type`, `NA Type`, Stage) %>% 
  summarize(`% Total Reads Lost (Cumulative)` = paste0(round(min(p_reads_lost_abs*100),1), "-", round(max(p_reads_lost_abs*100),1), " (mean ", round(mean(p_reads_lost_abs*100),1), ")"),
            `% Total Reads Lost (Marginal)` = paste0(round(min(p_reads_lost_abs_marginal*100),1), "-", round(max(p_reads_lost_abs_marginal*100),1), " (mean ", round(mean(p_reads_lost_abs_marginal*100),1), ")"), .groups="drop") %>% 
  filter(Stage != "raw_concat") %>%
  mutate(Stage = Stage %>% as.numeric %>% factor(labels=c("Trimming & filtering", "Deduplication", "Initial ribodepletion", "Secondary ribodepletion")))
n_reads_rel_display
Code
g_stage_trace <- ggplot(basic_stats, aes(x=stage, color=na_type, group=sample)) +
  scale_color_na() +
  facet_wrap(sample_type~na_type, scales="free", ncol=2,
             labeller = label_wrap_gen(multi_line=FALSE)) +
  theme_kit

# Plot reads over preprocessing
g_reads_stages <- g_stage_trace +
  geom_line(aes(y=n_read_pairs)) +
  scale_y_continuous("# Read pairs", expand=c(0,0), limits=c(0,NA))
g_reads_stages

Code
# Plot relative read losses during preprocessing
g_reads_rel <- ggplot(n_reads_rel, aes(x=stage, color=na_type, group=sample)) +
  geom_line(aes(y=p_reads_lost_abs_marginal)) +
  scale_y_continuous("% Total Reads Lost", expand=c(0,0), 
                     labels = function(x) x*100) +
  scale_color_na() +
  facet_wrap(sample_type~na_type, scales="free", ncol=2,
             labeller = label_wrap_gen(multi_line=FALSE)) +
  theme_kit
g_reads_rel

Data cleaning was very successful at improving read qualities, as well as removing most adapter sequences. The exception to the latter was terminal poly-A sequences, which remained at a substantial (though reduced) prevalence throughout the pipeline. (This doesn’t seem to have caused downstream problems that I can see.)

Code
g_qual <- ggplot(mapping=aes(color=na_type, linetype=read_pair, 
                         group=interaction(sample,read_pair))) + 
  scale_color_na() + scale_linetype_discrete(name = "Read Pair") +
  guides(color=guide_legend(nrow=2,byrow=TRUE),
         linetype = guide_legend(nrow=2,byrow=TRUE)) +
  theme_base

# Visualize adapters
g_adapters <- g_qual + 
  geom_line(aes(x=position, y=pc_adapters), data=adapter_stats) +
  scale_y_continuous(name="% Adapters", limits=c(0,20),
                     breaks = seq(0,50,10), expand=c(0,0)) +
  scale_x_continuous(name="Position", limits=c(0,NA),
                     breaks=seq(0,140,20), expand=c(0,0)) +
  facet_grid(stage~adapter)
g_adapters

Code
# Visualize quality
g_quality_base <- g_qual +
  geom_hline(yintercept=25, linetype="dashed", color="red") +
  geom_hline(yintercept=30, linetype="dashed", color="red") +
  geom_line(aes(x=position, y=mean_phred_score), data=quality_base_stats) +
  scale_y_continuous(name="Mean Phred score", expand=c(0,0), limits=c(10,45)) +
  scale_x_continuous(name="Position", limits=c(0,NA),
                     breaks=seq(0,140,20), expand=c(0,0)) +
  facet_grid(stage~.)
g_quality_base

Code
g_quality_seq <- g_qual +
  geom_vline(xintercept=25, linetype="dashed", color="red") +
  geom_vline(xintercept=30, linetype="dashed", color="red") +
  geom_line(aes(x=mean_phred_score, y=n_sequences), data=quality_seq_stats) +
  scale_x_continuous(name="Mean Phred score", expand=c(0,0)) +
  scale_y_continuous(name="# Sequences", expand=c(0,0)) +
  facet_grid(stage~.)
g_quality_seq

According to FASTQC, deduplication was very effective at reducing measured duplicate levels, which for single-room samples fell from an average of 40% to 7.8% for RNA reads and from 5.5% to 0.3% for DNA reads.

Code
stage_dup <- basic_stats %>% group_by(na_type, sample_type, stage) %>% 
  summarize(dmin = min(percent_duplicates), dmax=max(percent_duplicates),
            dmean=mean(percent_duplicates), .groups = "drop")

g_dup_stages <- g_stage_trace +
  geom_line(aes(y=percent_duplicates)) +
  scale_y_continuous("% Duplicates", limits=c(0,NA), expand=c(0,0))
g_dup_stages

Code
g_readlen_stages <- g_stage_trace + geom_line(aes(y=mean_seq_len)) +
  scale_y_continuous("Mean read length (nt)", expand=c(0,0), limits=c(0,NA))
g_readlen_stages

High-level composition

As before, to assess the high-level composition of the reads, I ran the ribodepleted files through Kraken (using the Standard 16 database) and summarized the results with Bracken. Combining these results with the read counts above gives us a breakdown of the inferred composition of the samples:

Code
# Import Bracken data
bracken_path <- file.path(data_dir, "bracken_counts.tsv")
bracken <- read_tsv(bracken_path, show_col_types = FALSE)
total_assigned <- bracken %>% group_by(sample) %>% summarize(
  name = "Total",
  kraken_assigned_reads = sum(kraken_assigned_reads),
  added_reads = sum(added_reads),
  new_est_reads = sum(new_est_reads),
  fraction_total_reads = sum(fraction_total_reads)
)
bracken_spread <- bracken %>% select(name, sample, new_est_reads) %>%
  mutate(name = tolower(name)) %>%
  pivot_wider(id_cols = "sample", names_from = "name", 
              values_from = "new_est_reads")

# Count reads
read_counts_preproc <- basic_stats %>% 
  select(sample, room, na_type, sample_type, stage, n_read_pairs) %>%
  pivot_wider(id_cols = c("sample", "room", "na_type", "sample_type"),
              names_from="stage", values_from="n_read_pairs")
read_counts <- read_counts_preproc %>%
  inner_join(total_assigned %>% select(sample, new_est_reads), by = "sample") %>%
  rename(assigned = new_est_reads) %>%
  inner_join(bracken_spread, by="sample")

# Assess composition
read_comp <- transmute(read_counts, sample=sample, na_type=na_type,
                       room=room, sample_type=sample_type,
                       n_filtered = raw_concat-cleaned,
                       n_duplicate = cleaned-dedup,
                       n_ribosomal = (dedup-ribo_initial) + (ribo_initial-ribo_secondary),
                       n_unassigned = ribo_secondary-assigned,
                       n_bacterial = bacteria,
                       n_archaeal = archaea,
                       n_viral = viruses,
                       n_human = eukaryota)
read_comp_long <- pivot_longer(read_comp, -(sample:sample_type), 
                               names_to = "classification",
                               names_prefix = "n_", values_to = "n_reads") %>%
  mutate(classification = fct_inorder(str_to_sentence(classification))) %>%
  group_by(sample) %>% mutate(p_reads = n_reads/sum(n_reads))

# Summarize composition
read_comp_summ <- read_comp_long %>% 
  group_by(na_type, sample_type, classification) %>%
  summarize(n_reads = sum(n_reads), .groups = "drop_last") %>%
  mutate(n_reads = replace_na(n_reads,0),
    p_reads = n_reads/sum(n_reads),
    pc_reads = p_reads*100)
Code
# Prepare plotting templates
g_comp_base <- ggplot(mapping=aes(x=room, y=p_reads, fill=classification)) +
  scale_x_discrete(name="Collection Date") +
  facet_grid(na_type~sample_type, scales = "free", space = "free_x") +
  theme_rotate
scale_y_pc_reads <- purrr::partial(scale_y_continuous, name = "% Reads",
                                   expand = c(0,0), labels = function(y) y*100)

# Plot overall composition
g_comp <- g_comp_base + geom_col(data = read_comp_long, position = "stack") +
  scale_y_pc_reads(limits = c(0,1.01), breaks = seq(0,1,0.2)) +
  scale_fill_brewer(palette = "Set1", name = "Classification")
g_comp

Code
# Plot composition of minor components
read_comp_minor <- read_comp_long %>% 
  filter(classification %in% c("Archaeal", "Viral", "Other"))
palette_minor <- brewer.pal(9, "Set1")[c(6,7,9)]
g_comp_minor <- g_comp_base + geom_col(data=read_comp_minor, position = "stack") +
  scale_y_pc_reads() +
  scale_fill_manual(values=palette_minor, name = "Classification")
g_comp_minor

Code
p_reads_summ_group <- read_comp_long %>%
  mutate(classification = ifelse(classification %in% c("Filtered", "Duplicate", "Unassigned"), "Excluded", as.character(classification)),
         classification = fct_inorder(classification)) %>%
  group_by(classification, sample, na_type, sample_type) %>%
  summarize(p_reads = sum(p_reads), .groups = "drop") %>%
  group_by(classification, na_type, sample_type) %>%
  summarize(pc_min = min(p_reads)*100, pc_max = max(p_reads)*100, 
            pc_mean = mean(p_reads)*100, .groups = "drop")
p_reads_summ_prep <- p_reads_summ_group %>%
  mutate(classification = fct_inorder(classification),
         pc_min = pc_min %>% signif(digits=2) %>% sapply(format, scientific=FALSE, trim=TRUE, digits=2),
         pc_max = pc_max %>% signif(digits=2) %>% sapply(format, scientific=FALSE, trim=TRUE, digits=2),
         pc_mean = pc_mean %>% signif(digits=2) %>% sapply(format, scientific=FALSE, trim=TRUE, digits=2),
         display = paste0(pc_min, "-", pc_max, "% (mean ", pc_mean, "%)"))
p_reads_summ <- p_reads_summ_prep %>%
  select(sample_type, classification, na_type, display) %>%
  pivot_wider(names_from=na_type, values_from = display) %>% 
  arrange(sample_type, classification)
p_reads_summ

As with the Prussin data, human reads are very elevated compared to wastewater, making up an average of 21% of read pairs in RNA libraries and 4% for DNA libraries for single-room samples (in both cases, the maximum abundance of human reads across samples is much higher). Viral read abundances are higher than Prussin, averaging 0.08% for single-room RNA libraries and 0.04% for DNA libraries; for some reason, the pooled RNA library shows much higher viral prevalence, at 0.33%. The most obvious difference from the Prussin data is that DNA libraries in Rosario show a much higher fraction of non-ribosomal unassigned reads, with an average of 78% of all reads for single-room libraries.

As in Prussin, viral DNA reads were dominated by Caudoviricetes phages, with many unclassifiable below the class level. Other prominent viral classes in DNA reads included Papovaviricetes (which includes Polyomaviridae and Papillomaviridae), Megaviridae (giant viruses) and Repensiviricetes (plant and fungal viruses). Pooled samples also showed a significant level of Quintoviricetes (parvoviruses).

RNA libraries, meanwhile, were dominated by Alsuviricetes (mainly plant viruses) and, notably, Pisoniviricetes (which includes coronaviruses, picornaviruses including Enterovirus, and caliciviruses including Norovirus). Caudoviricetes, Herviviricetes (herpesviruses) and Revtraviricetes (retroviruses + Hep B) were also prominent in some samples.

Code
# Get viral taxonomy
viral_taxa_path <- file.path(data_dir, "viral-taxids.tsv.gz")
viral_taxa <- read_tsv(viral_taxa_path, show_col_types = FALSE)

# Import Kraken reports & extract viral taxa
samples <- as.character(basic_stats_raw$sample)
col_names <- c("pc_reads_total", "n_reads_clade", "n_reads_direct",
               "rank", "taxid", "name")
report_paths <- paste0(data_dir, "kraken/", samples, ".report.gz")
kraken_reports <- lapply(1:length(samples), function(n) 
  read_tsv(report_paths[n], col_names = col_names, show_col_types = FALSE) %>%
    mutate(sample = samples[n])) %>% bind_rows
kraken_reports_viral <- filter(kraken_reports, taxid %in% viral_taxa$taxid) %>%
  group_by(sample) %>%
  mutate(p_reads_viral = n_reads_clade/n_reads_clade[1])
kraken_reports_viral_cleaned <- kraken_reports_viral %>%
  inner_join(libraries, by="sample") %>%
  select(-pc_reads_total, -n_reads_direct) %>%
  select(name, taxid, p_reads_viral, n_reads_clade, everything())

viral_classes <- kraken_reports_viral_cleaned %>% filter(rank == "C")
viral_families <- kraken_reports_viral_cleaned %>% filter(rank == "F")
Code
# Identify major viral classes
viral_classes_major_tab <- viral_classes %>% 
  group_by(name, taxid) %>%
  summarize(p_reads_viral_max = max(p_reads_viral), .groups="drop") %>%
  filter(p_reads_viral_max >= 0.04)
viral_classes_major_list <- viral_classes_major_tab %>% pull(name)
viral_classes_major <- viral_classes %>% 
  filter(name %in% viral_classes_major_list) %>%
  select(name, taxid, sample, room, na_type, sample_type, p_reads_viral)
viral_classes_minor <- viral_classes_major %>% 
  group_by(sample, room, na_type, sample_type) %>%
  summarize(p_reads_viral_major = sum(p_reads_viral), .groups = "drop") %>%
  mutate(name = "Other", taxid=NA, p_reads_viral = 1-p_reads_viral_major) %>%
  select(name, taxid, sample, room, na_type, sample_type, p_reads_viral)
viral_classes_display <- bind_rows(viral_classes_major, viral_classes_minor) %>%
  arrange(desc(p_reads_viral)) %>% 
  mutate(name = factor(name, levels=c(viral_classes_major_list, "Other")),
         p_reads_viral = pmax(p_reads_viral, 0)) %>%
  rename(p_reads = p_reads_viral, classification=name)

palette_viral <- c(brewer.pal(12, "Set3"), brewer.pal(8, "Dark2"))
g_classes <- g_comp_base + 
  geom_col(data=viral_classes_display, position = "stack") +
  scale_y_continuous(name="% Viral Reads", limits=c(0,1.01), breaks = seq(0,1,0.2),
                     expand=c(0,0), labels = function(y) y*100) +
  scale_fill_manual(values=palette_viral, name = "Viral class")
g_classes

Of these, the strong presence of Papovaviricetes and Pisoniviricetes across multiple samples is the most interesting, as both contain important human pathogens. The presence of Quintoviricetes, Herviviricetes and Revtraviricetes in a smaller number of samples is also notable.

In DNA reads, Papovaviricetes are quite heterogeneous across samples, while mostly being dominated by one or two genera within samples. The three rooms with the largest Papovaviricetes fraction (Female room 1 & 3 and Male room 4) are each dominated by a single genus, specifically Dyothetapapillomavirus (of which the sole member is Dyothetapapillomavirus 1, a cat pathogen), Betapapillomavirus (a genus of human wart-causing viruses), and Alphapolyomavirus (specifically Merkel cell polyomavirus).

Code
# Get all read counts in class
papova_taxid <- 2732421
papova_desc_taxids_old <- papova_taxid
papova_desc_taxids_new <- unique(c(papova_desc_taxids_old, viral_taxa %>% filter(parent_taxid %in% papova_desc_taxids_old) %>% pull(taxid)))
while (length(papova_desc_taxids_new) > length(papova_desc_taxids_old)){
  papova_desc_taxids_old <- papova_desc_taxids_new
  papova_desc_taxids_new <- unique(c(papova_desc_taxids_old, viral_taxa %>% filter(parent_taxid %in% papova_desc_taxids_old) %>% pull(taxid)))
}
papova_counts <- kraken_reports_viral_cleaned %>%
  filter(taxid %in% papova_desc_taxids_new) %>%
  mutate(p_reads_papova = n_reads_clade/n_reads_clade[1])

# Get genus composition
papova_genera <- papova_counts %>% filter(rank == "G", na_type == "DNA")
papova_genera_major_tab <- papova_genera %>% 
  group_by(name, taxid) %>%
  summarize(p_reads_papova_max = max(p_reads_papova), .groups="drop") %>%
  filter(p_reads_papova_max >= 0.04)
papova_genera_major_list <- papova_genera_major_tab %>% pull(name)
papova_genera_major <- papova_genera %>% 
  filter(name %in% papova_genera_major_list) %>%
  select(name, taxid, sample, room, na_type, sample_type, p_reads_papova)
papova_genera_minor <- papova_genera_major %>% 
  group_by(sample, room, na_type, sample_type) %>%
  summarize(p_reads_papova_major = sum(p_reads_papova), .groups = "drop") %>%
  mutate(name = "Other", taxid=NA, p_reads_papova = 1-p_reads_papova_major) %>%
  select(name, taxid, sample, room, na_type, sample_type, p_reads_papova)
papova_genera_display <- bind_rows(papova_genera_major, papova_genera_minor) %>%
  arrange(desc(p_reads_papova)) %>% 
  mutate(name = factor(name, levels=c(papova_genera_major_list, "Other"))) %>%
  rename(p_reads = p_reads_papova, classification=name)

# Plot
g_papova_genera <- g_comp_base + 
  geom_col(data=papova_genera_display, position = "stack") +
  scale_y_continuous(name="% Papovaviricetes Reads", limits=c(0,1.01), 
                     breaks = seq(0,1,0.2),
                     expand=c(0,0), labels = function(y) y*100) +
  scale_fill_manual(values=palette_viral, name = "Viral genus") +
  guides(fill=guide_legend(ncol=3))
g_papova_genera

In RNA reads, Pisoniviricetes are conversely dominated by a single genus, Sobemovirus, across all samples, and specifically a single species. Disappointingly, this is Ryegrass mottle virus, a plant pathogen.

Code
# Get all read counts in class
pisoni_taxid <- 2732506
pisoni_desc_taxids_old <- pisoni_taxid
pisoni_desc_taxids_new <- unique(c(pisoni_desc_taxids_old, viral_taxa %>% filter(parent_taxid %in% pisoni_desc_taxids_old) %>% pull(taxid)))
while (length(pisoni_desc_taxids_new) > length(pisoni_desc_taxids_old)){
  pisoni_desc_taxids_old <- pisoni_desc_taxids_new
  pisoni_desc_taxids_new <- unique(c(pisoni_desc_taxids_old, viral_taxa %>% filter(parent_taxid %in% pisoni_desc_taxids_old) %>% pull(taxid)))
}
pisoni_counts <- kraken_reports_viral_cleaned %>%
  filter(taxid %in% pisoni_desc_taxids_new) %>%
  mutate(p_reads_pisoni = n_reads_clade/n_reads_clade[1])

# Get genus composition
pisoni_genera <- pisoni_counts %>% filter(rank == "S", na_type == "RNA")
pisoni_genera_major_tab <- pisoni_genera %>% 
  group_by(name, taxid) %>%
  summarize(p_reads_pisoni_max = max(p_reads_pisoni), .groups="drop") %>%
  filter(p_reads_pisoni_max >= 0.04)
pisoni_genera_major_list <- pisoni_genera_major_tab %>% pull(name)
pisoni_genera_major <- pisoni_genera %>% 
  filter(name %in% pisoni_genera_major_list) %>%
  select(name, taxid, sample, room, na_type, sample_type, p_reads_pisoni)
pisoni_genera_minor <- pisoni_genera_major %>% 
  group_by(sample, room, na_type, sample_type) %>%
  summarize(p_reads_pisoni_major = sum(p_reads_pisoni), .groups = "drop") %>%
  mutate(name = "Other", taxid=NA, p_reads_pisoni = 1-p_reads_pisoni_major) %>%
  select(name, taxid, sample, room, na_type, sample_type, p_reads_pisoni)
pisoni_genera_display <- bind_rows(pisoni_genera_major, pisoni_genera_minor) %>%
  arrange(desc(p_reads_pisoni)) %>% 
  mutate(name = factor(name, levels=c(pisoni_genera_major_list, "Other"))) %>%
  rename(p_reads = p_reads_pisoni, classification=name)

# Plot
g_pisoni_genera <- g_comp_base + 
  geom_col(data=pisoni_genera_display, position = "stack") +
  scale_y_continuous(name="% Pisoniviricetes Reads", limits=c(0,1.01), 
                     breaks = seq(0,1,0.2),
                     expand=c(0,0), labels = function(y) y*100) +
  scale_fill_manual(values=palette_viral, name = "Viral genus") +
  guides(fill=guide_legend(ncol=3))
g_pisoni_genera

Human-infecting virus reads: validation

Next, I investigated the human-infecting virus read content of these unenriched samples. Using the same workflow I used for Prussin et al, I identified 118 RNA read pairs and 1269 DNA read pairs as putatively human viral: 0.004% and 0.017% of surviving reads, respectively.

Code
hv_reads_filtered_path <- file.path(data_dir, "hv_hits_putative_filtered.tsv.gz")
hv_reads_filtered <- read_tsv(hv_reads_filtered_path, show_col_types = FALSE) %>%
  inner_join(libraries, by="sample") %>% 
  arrange(na_type, sample_type)
n_hv_filtered <- hv_reads_filtered %>% 
  group_by(sample, room, na_type, sample_type) %>% count %>% 
  inner_join(basic_stats %>% filter(stage == "ribo_initial") %>% 
               select(sample, n_read_pairs), by="sample") %>% 
  rename(n_putative = n, n_total = n_read_pairs) %>% 
  mutate(p_reads = n_putative/n_total, pc_reads = p_reads * 100)
n_hv_filtered_summ <- n_hv_filtered %>% group_by(na_type) %>% 
  summarize(n_putative = sum(n_putative), n_total = sum(n_total), 
            .groups="drop") %>% 
  mutate(p_reads = n_putative/n_total, pc_reads = p_reads*100)
Code
mrg <- hv_reads_filtered %>%
  mutate(kraken_label = ifelse(assigned_hv, "Kraken2 HV\nassignment",
                               ifelse(hit_hv, "Kraken2 HV\nhit",
                                      "No hit or\nassignment"))) %>%
  group_by(sample, na_type) %>% arrange(desc(adj_score_fwd), desc(adj_score_rev)) %>%
  mutate(seq_num = row_number(),
         adj_score_max = pmax(adj_score_fwd, adj_score_rev))
# Import Bowtie2/Kraken data and perform filtering steps
g_mrg <- ggplot(mrg, aes(x=adj_score_fwd, y=adj_score_rev)) +
  geom_point(alpha=0.5, shape=16) + 
  scale_x_continuous(name="S(forward read)", limits=c(0,65), breaks=seq(0,100,10), expand = c(0,0)) +
  scale_y_continuous(name="S(reverse read)", limits=c(0,65), breaks=seq(0,100,10), expand = c(0,0)) +
  facet_grid(na_type~kraken_label, labeller = labeller(kit = label_wrap_gen(20))) +
  theme_base + theme(aspect.ratio=1)
g_mrg

Code
g_hist_0 <- ggplot(mrg, aes(x=adj_score_max)) + geom_histogram(binwidth=5,boundary=0,position="dodge") + facet_grid(na_type~kraken_label, scales="free_y") + scale_x_continuous(name = "Maximum adjusted alignment score") + scale_y_continuous(name="# Read pairs") + scale_fill_brewer(palette = "Dark2") + theme_base + geom_vline(xintercept=20, linetype="dashed", color="red")
g_hist_0

As previously described, I ran BLASTN on these reads via a dedicated EC2 instance, using the same parameters I’ve used for previous datasets.

Code
mrg_fasta <-  mrg %>%
  mutate(seq_head = paste0(">", seq_id)) %>%
  ungroup %>%
  select(header1=seq_head, seq1=query_seq_fwd, 
         header2=seq_head, seq2=query_seq_rev) %>%
  mutate(header1=paste0(header1, "_1"), header2=paste0(header2, "_2"))
mrg_fasta_out <- do.call(paste, c(mrg_fasta, sep="\n")) %>% paste(collapse="\n")
write(mrg_fasta_out, file.path(data_dir, "blast/putative-viral.fasta"))
Code
# Import BLAST results
blast_results_path <- file.path(data_dir, "blast/putative-viral.blast.gz")
blast_cols <- c("qseqid", "sseqid", "sgi", "staxid", "qlen", "evalue", "bitscore", "qcovs", "length", "pident", "mismatch", "gapopen", "sstrand", "qstart", "qend", "sstart", "send")
blast_results <- read_tsv(blast_results_path, show_col_types = FALSE,
                          col_names = blast_cols, col_types = cols(.default="c"))
# blast_results_path <- file.path(data_dir, "blast/putative-viral-best.blast.gz")
# blast_results <- read_tsv(blast_results_path, show_col_types = FALSE)

# Filter for best hit for each query/subject combination
blast_results_best <- blast_results %>% group_by(qseqid, staxid) %>% 
  filter(bitscore == max(bitscore)) %>%
  filter(length == max(length)) %>% filter(row_number() == 1)

# Rank hits for each query and filter for high-ranking hits
blast_results_ranked <- blast_results_best %>% 
  group_by(qseqid) %>% mutate(rank = dense_rank(desc(bitscore)))
blast_results_highrank <- blast_results_ranked %>% filter(rank <= 5) %>%
    mutate(read_pair = str_split(qseqid, "_") %>% sapply(nth, n=-1), 
         seq_id = str_split(qseqid, "_") %>% sapply(nth, n=1)) %>%
    mutate(bitscore = as.numeric(bitscore))

# Summarize by read pair and taxid
blast_results_paired <- blast_results_highrank %>%
  group_by(seq_id, staxid) %>%
  summarize(bitscore_max = max(bitscore), bitscore_min = min(bitscore),
            n_reads = n(), .groups = "drop")

# Add viral status
blast_results_viral <- mutate(blast_results_paired, viral = staxid %in% viral_taxa$taxid) %>%
  mutate(viral_full = viral & n_reads == 2)

# Compare to Kraken & Bowtie assignments
mrg_assign <- mrg %>% select(sample, seq_id, taxid, assigned_taxid, adj_score_max, na_type)
blast_results_assign <- left_join(blast_results_viral, mrg_assign, by="seq_id") %>%
    mutate(taxid_match_bowtie = (staxid == taxid),
           taxid_match_kraken = (staxid == assigned_taxid),
           taxid_match_any = taxid_match_bowtie | taxid_match_kraken)
blast_results_out <- blast_results_assign %>%
  group_by(seq_id) %>%
  summarize(viral_status = ifelse(any(viral_full), 2,
                                  ifelse(any(taxid_match_any), 2,
                                             ifelse(any(viral), 1, 0))),
            .groups = "drop")
Code
# Merge BLAST results with unenriched read data
mrg_blast <- full_join(mrg, blast_results_out, by="seq_id") %>%
  mutate(viral_status = replace_na(viral_status, 0),
         viral_status_out = ifelse(viral_status == 0, FALSE, TRUE))
mrg_blast_rna <- mrg_blast %>% filter(na_type == "RNA")
mrg_blast_dna <- mrg_blast %>% filter(na_type == "DNA")

# Plot RNA
g_mrg_blast_rna <- mrg_blast_rna %>%
  ggplot(aes(x=adj_score_fwd, y=adj_score_rev, color=viral_status_out)) +
  geom_point(alpha=0.5, shape=16) +
  scale_x_continuous(name="S(forward read)", limits=c(0,65), breaks=seq(0,100,10), expand = c(0,0)) +
  scale_y_continuous(name="S(reverse read)", limits=c(0,65), breaks=seq(0,100,10), expand = c(0,0)) +
  scale_color_brewer(palette = "Set1", name = "Viral status") +
  facet_grid(viral_status_out~kraken_label, labeller = labeller(kit = label_wrap_gen(20))) +
  theme_base + labs(title="RNA") + 
  theme(aspect.ratio=1, plot.title = element_text(size=rel(2), hjust=0))
g_mrg_blast_rna

Code
# Plot DNA
g_mrg_blast_dna <- mrg_blast_dna %>%
  ggplot(aes(x=adj_score_fwd, y=adj_score_rev, color=viral_status_out)) +
  geom_point(alpha=0.5, shape=16) +
  scale_x_continuous(name="S(forward read)", limits=c(0,65), breaks=seq(0,100,10), expand = c(0,0)) +
  scale_y_continuous(name="S(reverse read)", limits=c(0,65), breaks=seq(0,100,10), expand = c(0,0)) +
  scale_color_brewer(palette = "Set1", name = "Viral status") +
  facet_grid(viral_status_out~kraken_label, labeller = labeller(kit = label_wrap_gen(20))) +
  theme_base + labs(title="DNA") + 
  theme(aspect.ratio=1, plot.title = element_text(size=rel(2), hjust=0))
g_mrg_blast_dna

Code
g_hist_1 <- ggplot(mrg_blast, aes(x=adj_score_max)) + geom_histogram(binwidth=5,boundary=0,position="dodge") + facet_grid(na_type~viral_status_out, scales = "free_y") + scale_x_continuous(name = "Maximum adjusted alignment score") + scale_y_continuous(name="# Read pairs") + scale_fill_brewer(palette = "Dark2") + theme_base + geom_vline(xintercept=20, linetype="dashed", color="red")
g_hist_1

These results look good on visual inspection, and indeed precision and sensitivity are both very high. For a disjunctive score threshold of 20, my updated workflow achieves an F1 score of 98.2% for RNA sequences and 97.9% for DNA sequences.

Code
test_sens_spec <- function(tab, score_threshold, conjunctive, include_special){
  if (!include_special) tab <- filter(tab, viral_status_out %in% c("TRUE", "FALSE"))
  tab_retained <- tab %>% mutate(
    conjunctive = conjunctive,
    retain_score_conjunctive = (adj_score_fwd > score_threshold & adj_score_rev > score_threshold), 
    retain_score_disjunctive = (adj_score_fwd > score_threshold | adj_score_rev > score_threshold),
    retain_score = ifelse(conjunctive, retain_score_conjunctive, retain_score_disjunctive),
    retain = assigned_hv | hit_hv | retain_score) %>%
    group_by(viral_status_out, retain) %>% count
  pos_tru <- tab_retained %>% filter(viral_status_out == "TRUE", retain) %>% pull(n) %>% sum
  pos_fls <- tab_retained %>% filter(viral_status_out != "TRUE", retain) %>% pull(n) %>% sum
  neg_tru <- tab_retained %>% filter(viral_status_out != "TRUE", !retain) %>% pull(n) %>% sum
  neg_fls <- tab_retained %>% filter(viral_status_out == "TRUE", !retain) %>% pull(n) %>% sum
  sensitivity <- pos_tru / (pos_tru + neg_fls)
  specificity <- neg_tru / (neg_tru + pos_fls)
  precision   <- pos_tru / (pos_tru + pos_fls)
  f1 <- 2 * precision * sensitivity / (precision + sensitivity)
  out <- tibble(threshold=score_threshold, include_special = include_special, 
                conjunctive = conjunctive, sensitivity=sensitivity, 
                specificity=specificity, precision=precision, f1=f1)
  return(out)
}
range_f1 <- function(intab, inc_special, inrange=15:45){
  tss <- purrr::partial(test_sens_spec, tab=intab, include_special=inc_special)
  stats_conj <- lapply(inrange, tss, conjunctive=TRUE) %>% bind_rows
  stats_disj <- lapply(inrange, tss, conjunctive=FALSE) %>% bind_rows
  stats_all <- bind_rows(stats_conj, stats_disj) %>%
    pivot_longer(!(threshold:conjunctive), names_to="metric", values_to="value") %>%
    mutate(conj_label = ifelse(conjunctive, "Conjunctive", "Disjunctive"))
  return(stats_all)
}
inc_special <- FALSE
stats_rna <- range_f1(mrg_blast_rna, inc_special) %>% mutate(na_type = "RNA")
stats_dna <- range_f1(mrg_blast_dna, inc_special) %>% mutate(na_type = "DNA")
stats_0 <- bind_rows(stats_rna, stats_dna) %>% mutate(attempt=0)
threshold_opt_0 <- stats_0 %>% group_by(conj_label,attempt,na_type) %>% 
  filter(metric == "f1") %>%
  filter(value == max(value)) %>% filter(threshold == min(threshold))
g_stats_0 <- ggplot(stats_0, aes(x=threshold, y=value, color=metric)) +
  geom_vline(data = threshold_opt_0, mapping = aes(xintercept=threshold),
             color = "red", linetype = "dashed") +
  geom_line() +
  scale_y_continuous(name = "Value", limits=c(0,1), breaks = seq(0,1,0.2), expand = c(0,0)) +
  scale_x_continuous(name = "Threshold", expand = c(0,0)) +
  scale_color_brewer(palette="Set3") +
  facet_grid(na_type~conj_label) +
  theme_base
g_stats_0

Looking into the composition of different read groups, the most notable observations for me are the predominance of (1) HIV reads among low-scoring RNA “true-positives”, and (2) human mastadenovirus C among RNA true-positives in general:

Code
viral_taxa$name[viral_taxa$taxid == 211787] <- "Human papillomavirus type 92"
viral_taxa$name[viral_taxa$taxid == 509154] <- "Porcine endogenous retrovirus C"
viral_taxa$name[viral_taxa$taxid == 493803] <- "Merkel cell polyomavirus"
viral_taxa$name[viral_taxa$taxid == 427343] <- "Human papillomavirus 107"

major_threshold <- 0.1
fp <- mrg_blast %>% 
  group_by(na_type, viral_status_out, 
           highscore = adj_score_max >= 20, taxid) %>% count %>% 
  group_by(na_type, viral_status_out, highscore) %>% mutate(p=n/sum(n)) %>% 
  left_join(viral_taxa, by="taxid") %>%
  arrange(desc(p)) %>%
  mutate(name = ifelse(taxid == 194958, "Porcine endogenous retrovirus A", name))
fp_major_tab <- fp %>% filter(p > major_threshold) %>% arrange(desc(p))
fp_major_list <- fp_major_tab %>% pull(name) %>% sort %>% unique %>% c(., "Other")
fp_major <- fp %>% mutate(major = p > major_threshold) %>% 
  mutate(name_display = ifelse(major, name, "Other")) %>%
  group_by(na_type, viral_status_out, highscore, name_display) %>% 
  summarize(n=sum(n), p=sum(p), .groups = "drop")  %>%
  mutate(name_display = factor(name_display, levels = fp_major_list),
         score_display = ifelse(highscore, "S >= 20", "S < 20"),
         status_display = ifelse(viral_status_out, "True positive", "False positive"))
g_fp <- ggplot(fp_major, aes(x=score_display, y=p, fill=name_display)) +
  geom_col(position="stack") +
  scale_x_discrete(name = "True positive?") +
  scale_y_continuous(name = "% reads", limits = c(0,1.01), 
                     breaks = seq(0,1,0.2), expand = c(0,0)) +
  scale_fill_manual(values = palette_viral, name = "Viral\ntaxon") +
  facet_grid(na_type~status_display) +
  guides(fill=guide_legend(ncol=3)) +
  theme_kit
g_fp

There are only eight low-scoring true-positive RNA reads, three of which are HIV. These do seem to be real; at least, they don’t BLAST to anything else. Ironically the eight high-scoring “true-positive” HIV reads seem more likely to be fake; they also map to a range of cloning vectors:

Code
# Configure
ref_taxid_hiv <- 11676
p_threshold <- 0.3

# Get taxon names
tax_names_path <- file.path(data_dir, "taxid-names.tsv.gz")
tax_names <- read_tsv(tax_names_path, show_col_types = FALSE)

# Add missing names
tax_names_new <- tribble(~staxid, ~name,
                         3050295, "Cytomegalovirus humanbeta5",
                         459231, "FLAG-tagging vector pFLAG97-TSR")
tax_names <- bind_rows(tax_names, tax_names_new)
ref_name_hiv <- tax_names %>% filter(staxid == ref_taxid_hiv) %>% pull(name)

# Get major matches
fp_staxid <- mrg_blast %>% filter(taxid == ref_taxid_hiv) %>%
  group_by(na_type, highscore = adj_score_max >= 20) %>% mutate(n_seq = n()) %>%
  left_join(blast_results_paired, by="seq_id") %>%
  mutate(staxid = as.integer(staxid)) %>%
  left_join(tax_names, by="staxid") %>% rename(sname=name) %>%
  left_join(tax_names %>% rename(taxid=staxid), by="taxid")
fp_staxid_count <- fp_staxid %>%
  group_by(viral_status_out, highscore, na_type, 
           taxid, name, staxid, sname, n_seq) %>%
  count %>%
  group_by(viral_status_out, highscore, na_type, taxid, name) %>%
  mutate(p=n/n_seq)
fp_staxid_count_major <- fp_staxid_count %>%
  filter(n>1, p>p_threshold, !is.na(staxid)) %>%
  mutate(score_display = ifelse(highscore, "S >= 20", "S < 20"),
         status_display = ifelse(viral_status_out, 
                                 "True positive", "False positive"))

# Plot
g <- ggplot(fp_staxid_count_major, aes(x=p, y=sname)) + 
  geom_col() + 
  facet_grid(na_type~status_display+score_display, scales="free",
             labeller = label_wrap_gen(multi_line = FALSE)) +
  scale_x_continuous(name="% mapped reads", limits=c(0,1), breaks=seq(0,1,0.2),
                     expand=c(0,0)) +
  labs(title=paste0(ref_name_hiv, " (taxid ", ref_taxid_hiv, ")")) +
  theme_base + theme(
    axis.title.y = element_blank(),
    plot.title = element_text(size=rel(1.4), hjust=0, face="plain"))
g

A similar pattern is seen for the Human mastadenovirus C hits: low-scoring true-positives only map to adenoviruses, but high-scoring ones also map to a range of cloning vectors.

Code
# Configure
ref_taxid_masta <- 129951
p_threshold <- 0.1

# Add missing names
tax_names_new <- tribble(~staxid, ~name,
                         3050295, "Cytomegalovirus humanbeta5",
                         459231, "FLAG-tagging vector pFLAG97-TSR",
                         3033760, "Human adenovirus 89",
                         3043599, "Human adenovirus 108",
                         3088344, "Human adenovirus C108",
                         3102992, "Cloning vector pBWH-C5",
                         3102994, "Cloning vector pBWH-C5-pIX-Kan",
                         3102995, "Cloning vector pBWH-C5-pIXZG"
                         )
tax_names <- bind_rows(tax_names, tax_names_new)
ref_name_masta <- tax_names %>% filter(staxid == ref_taxid_masta) %>% pull(name)

# Get major matches
fp_staxid <- mrg_blast %>% filter(taxid == ref_taxid_masta) %>%
  group_by(na_type, highscore = adj_score_max >= 20) %>% mutate(n_seq = n()) %>%
  left_join(blast_results_paired, by="seq_id") %>%
  mutate(staxid = as.integer(staxid)) %>%
  left_join(tax_names, by="staxid") %>% rename(sname=name) %>%
  left_join(tax_names %>% rename(taxid=staxid), by="taxid")
fp_staxid_count <- fp_staxid %>%
  group_by(viral_status_out, highscore, na_type, 
           taxid, name, staxid, sname, n_seq) %>%
  count %>%
  group_by(viral_status_out, highscore, na_type, taxid, name) %>%
  mutate(p=n/n_seq)
fp_staxid_count_major <- fp_staxid_count %>%
  filter(p>p_threshold, !is.na(staxid)) %>%
  mutate(score_display = ifelse(highscore, "S >= 20", "S < 20"),
         status_display = ifelse(viral_status_out, 
                                 "True positive", "False positive"))

# Plot
g <- ggplot(fp_staxid_count_major, aes(x=p, y=sname)) + 
  geom_col() + 
  facet_grid(na_type~status_display+score_display, scales="free",
             labeller = label_wrap_gen(multi_line = FALSE)) +
  scale_x_continuous(name="% mapped reads", limits=c(0,1), breaks=seq(0,1,0.2),
                     expand=c(0,0)) +
  labs(title=paste0(ref_name_masta, " (taxid ", ref_taxid_masta, ")")) +
  theme_base + theme(
    axis.title.y = element_blank(),
    plot.title = element_text(size=rel(1.5), hjust=0, face="plain"))
g

The plots above only tell us the total number of reads that BLAST to each cloning vector; they don’t tell us how many reads map to any vector. In particular, they can’t distinguish between a situation where many reads map to a few cloning vectors, and one where a few reads map to many. This turns out to be about 35% of high-scoring HIV RNA reads and 55% of high-scoring mastadenovirus RNA reads:

Code
fp_hiv_vec <- mrg_blast %>% filter(taxid == ref_taxid_hiv) %>%
  left_join(blast_results_paired, by="seq_id")  %>%
  mutate(staxid = as.integer(staxid)) %>%
  left_join(tax_names, by="staxid") %>% rename(sname=name) %>%
  left_join(tax_names %>% rename(taxid=staxid), by="taxid") %>%
  group_by(vector = grepl("vector", sname))
fp_hiv_vec_match <- fp_hiv_vec %>%
  group_by(viral_status_out, seq_id, na_type, highscore = adj_score_max >= 20) %>%
  summarize(vector_match = any(vector), .groups = "drop")
fp_hiv_vec_count <- fp_hiv_vec_match %>%
  group_by(viral_status_out, na_type, highscore, vector_match) %>% count %>%
  group_by(viral_status_out, na_type, highscore) %>% mutate(p=n/sum(n)) %>%
  mutate(score_display = ifelse(highscore, "S >= 20", "S < 20"),
         status_display = ifelse(viral_status_out,
                                 "True positive", "False positive"))


g_hiv_vec_count <- ggplot(fp_hiv_vec_count, aes(y=score_display, x=p, fill=vector_match)) +
  geom_col() +
  scale_x_continuous(name="% Mapped Reads", limits=c(0,1), breaks=seq(0,1,0.2),
                     expand=c(0,0)) +
  scale_fill_brewer(name="Maps to vector?", palette = "Dark2") +
  facet_grid(na_type~status_display, scales="free") +
  labs(title=paste0(ref_name_hiv, " (taxid ", ref_taxid_hiv, ")")) +
  theme_base + theme(
    axis.title.y = element_blank(),
    plot.title = element_text(size=rel(1.4), hjust=0, face="plain"))
g_hiv_vec_count

Code
fp_masta_vec <- mrg_blast %>% filter(taxid == ref_taxid_masta) %>%
  left_join(blast_results_paired, by="seq_id")  %>%
  mutate(staxid = as.integer(staxid)) %>%
  left_join(tax_names, by="staxid") %>% rename(sname=name) %>%
  left_join(tax_names %>% rename(taxid=staxid), by="taxid") %>%
  group_by(vector = grepl("vector", sname))
fp_masta_vec_match <- fp_masta_vec %>%
  group_by(viral_status_out, seq_id, na_type, highscore = adj_score_max >= 20) %>%
  summarize(vector_match = any(vector), .groups = "drop")
fp_masta_vec_count <- fp_masta_vec_match %>%
  group_by(viral_status_out, na_type, highscore, vector_match) %>% count %>%
  group_by(viral_status_out, na_type, highscore) %>% mutate(p=n/sum(n)) %>%
  mutate(score_display = ifelse(highscore, "S >= 20", "S < 20"),
         status_display = ifelse(viral_status_out,
                                 "True positive", "False positive"))


g_masta_vec_count <- ggplot(fp_masta_vec_count, aes(y=score_display, x=p, fill=vector_match)) +
  geom_col() +
  scale_x_continuous(name="% Mapped Reads", limits=c(0,1), breaks=seq(0,1,0.2),
                     expand=c(0,0)) +
  scale_fill_brewer(name="Maps to vector?", palette = "Dark2") +
  facet_grid(na_type~status_display, scales="free") +
  labs(title=paste0(ref_name_masta, " (taxid ", ref_taxid_masta, ")")) +
  theme_base + theme(
    axis.title.y = element_blank(),
    plot.title = element_text(size=rel(1.4), hjust=0, face="plain"))
g_masta_vec_count

This is more concerning for the mastadenovirus reads, which make up a nontrivial fraction of all high-scoring RNA “true-positives”. As in my last post, though, it’s hard to know what to take from this; many vectors are based on lentiviral or adenoviral backbones, and it’s not always easy to distinguish between true virus and virus-based vector. I’m going to stick with my pipeline as it is for now, but it’s worth keeping this in mind in interpreting the results.

Human-infecting viruses: overall relative abundance

Code
# Get raw read counts
read_counts_raw <- basic_stats_raw %>%
  select(sample, room, na_type, sample_type, n_reads_raw = n_read_pairs)

# Get HV read counts
mrg_hv <- mrg %>% mutate(hv_status = assigned_hv | hit_hv | adj_score_max >= 20)
read_counts_hv <- mrg_hv %>% filter(hv_status) %>% group_by(sample) %>% 
  count(name="n_reads_hv")
read_counts <- read_counts_raw %>% left_join(read_counts_hv, by="sample") %>%
  mutate(n_reads_hv = replace_na(n_reads_hv, 0))

# Aggregate
read_counts_total <- read_counts %>% group_by(na_type, sample_type) %>%
  filter(n() > 1) %>%
  summarize(n_reads_raw = sum(n_reads_raw),
            n_reads_hv = sum(n_reads_hv), .groups="drop") %>%
  mutate(sample= "All samples", room = "All rooms")
read_counts_agg <- read_counts %>% arrange(sample) %>%
  bind_rows(read_counts_total) %>% 
  mutate(sample = fct_inorder(sample),
         p_reads_hv = n_reads_hv/n_reads_raw)

In non-control samples, applying a disjunctive cutoff at S=20 identifies 97 RNA reads and 1204 DNA reads as human-viral. This gives an overall relative HV abundance of \(1.21 \times 10^{-5}\) for RNA reads and \(1.50 \times 10^{-4}\) for DNA reads. For DNA reads, HV RA in the negative controls is about 10x lower than the non-control samples; however, for RNA reads, the HV RA in the negative controls is actually higher than in the non-controls. This, combined with the very low absolute read count, suggests that we shouldn’t take the HV results for RNA reads too seriously.

Code
# Visualize
g_phv_agg <- ggplot(read_counts_agg, aes(x=room, color=na_type)) +
  geom_point(aes(y=p_reads_hv)) +
  scale_y_log10("Relative abundance of human virus reads") +
  scale_x_discrete(name="Collection Date") +
  facet_grid(.~sample_type, scales = "free", space = "free_x") +
  scale_color_na() + theme_rotate
g_phv_agg

Code
# Collate past RA values
ra_past <- tribble(~dataset, ~ra, ~na_type, ~panel_enriched,
                   "Brumfield", 5e-5, "RNA", FALSE,
                   "Brumfield", 3.66e-7, "DNA", FALSE,
                   "Spurbeck", 5.44e-6, "RNA", FALSE,
                   "Yang", 3.62e-4, "RNA", FALSE,
                   "Rothman (unenriched)", 1.87e-5, "RNA", FALSE,
                   "Rothman (panel-enriched)", 3.3e-5, "RNA", TRUE,
                   "Crits-Christoph (unenriched)", 1.37e-5, "RNA", FALSE,
                   "Crits-Christoph (panel-enriched)", 1.26e-2, "RNA", TRUE,
                   "Prussin (non-control)", 1.63e-5, "RNA", FALSE,
                   "Prussin (non-control)", 4.16e-5, "DNA", FALSE
)

# Collate new RA values
ra_new <- tribble(~dataset, ~ra, ~na_type, ~panel_enriched,
                  "Rosario (non-control)", 1.21e-5, "RNA", FALSE,
                  "Rosario (non-control)", 1.50e-4, "DNA", FALSE)


# Plot
ra_comp <- bind_rows(ra_past, ra_new) %>% mutate(dataset = fct_inorder(dataset))
g_ra_comp <- ggplot(ra_comp, aes(y=dataset, x=ra, color=na_type)) +
  geom_point() +
  scale_color_na() +
  scale_x_log10(name="Relative abundance of human virus reads") +
  theme_base + theme(axis.title.y = element_blank())
g_ra_comp

Human-infecting viruses: taxonomy and composition

I’m going to focus on the DNA reads here, since as discussed above I expect the RNA reads to be noisy and not very informative – which seems to in fact be the case.

At the family level, as in Prussin, we see that Papillomaviridae, and Polyomaviridae dominate DNA reads in most samples, with Poxviridae and Genomoviridae showing significant presence in several samples. The first of these is primarily Betapapillomavirus and Gammapapillomavirus, with some Alphapapillomavirus; the second is primarily Alphapolyomavirus and Deltapolyomavirus. All of these are made up of several different viruses at the species level. Genomoviridae is represented primarily by Gemykibivirus, while Poxviridae is represented by small numbers of reads from several species, including vaccinia, cowpox and tanapox. (There’s also one variola read, which would be alarming, but BLASTN thinks it isn’t real.)

Compared to Prussin, the most striking difference is the absence of herpesvirus reads; whereas HCMV was dominant in the Prussin data, it barely shows up here except in the controls.

Code
# Get viral taxon names for putative HV reads
viral_taxa$name[viral_taxa$taxid == 249588] <- "Mamastrovirus"
viral_taxa$name[viral_taxa$taxid == 194960] <- "Kobuvirus"
viral_taxa$name[viral_taxa$taxid == 688449] <- "Salivirus"
viral_taxa$name[viral_taxa$taxid == 585893] <- "Picobirnaviridae"
viral_taxa$name[viral_taxa$taxid == 333922] <- "Betapapillomavirus"
viral_taxa$name[viral_taxa$taxid == 334207] <- "Betapapillomavirus 3"
viral_taxa$name[viral_taxa$taxid == 369960] <- "Porcine type-C oncovirus"
viral_taxa$name[viral_taxa$taxid == 333924] <- "Betapapillomavirus 2"
viral_taxa$name[viral_taxa$taxid == 687329] <- "Anelloviridae"
viral_taxa$name[viral_taxa$taxid == 325455] <- "Gammapapillomavirus"
viral_taxa$name[viral_taxa$taxid == 333750] <- "Alphapapillomavirus"
viral_taxa$name[viral_taxa$taxid == 694002] <- "Betacoronavirus"
viral_taxa$name[viral_taxa$taxid == 334202] <- "Mupapillomavirus"
viral_taxa$name[viral_taxa$taxid == 197911] <- "Alphainfluenzavirus"
viral_taxa$name[viral_taxa$taxid == 186938] <- "Respirovirus"
viral_taxa$name[viral_taxa$taxid == 333926] <- "Gammapapillomavirus 1"
viral_taxa$name[viral_taxa$taxid == 337051] <- "Betapapillomavirus 1"
viral_taxa$name[viral_taxa$taxid == 337043] <- "Alphapapillomavirus 4"
viral_taxa$name[viral_taxa$taxid == 694003] <- "Betacoronavirus 1"
viral_taxa$name[viral_taxa$taxid == 334204] <- "Mupapillomavirus 2"
viral_taxa$name[viral_taxa$taxid == 334208] <- "Betapapillomavirus 4"
viral_taxa$name[viral_taxa$taxid == 333928] <- "Gammapapillomavirus 2"
viral_taxa$name[viral_taxa$taxid == 337039] <- "Alphapapillomavirus 2"


mrg_hv_named <- mrg_hv %>% left_join(viral_taxa, by="taxid")

# Discover viral species & genera for HV reads
raise_rank <- function(read_db, taxid_db, out_rank = "species", verbose = FALSE){
  # Get higher ranks than search rank
  ranks <- c("subspecies", "species", "subgenus", "genus", "subfamily", "family", "suborder", "order", "class", "subphylum", "phylum", "kingdom", "superkingdom")
  rank_match <- which.max(ranks == out_rank)
  high_ranks <- ranks[rank_match:length(ranks)]
  # Merge read DB and taxid DB
  reads <- read_db %>% select(-parent_taxid, -rank, -name) %>%
    left_join(taxid_db, by="taxid")
  # Extract sequences that are already at appropriate rank
  reads_rank <- filter(reads, rank == out_rank)
  # Drop sequences at a higher rank and return unclassified sequences
  reads_norank <- reads %>% filter(rank != out_rank, !rank %in% high_ranks, !is.na(taxid))
  while(nrow(reads_norank) > 0){ # As long as there are unclassified sequences...
    # Promote read taxids and re-merge with taxid DB, then re-classify and filter
    reads_remaining <- reads_norank %>% mutate(taxid = parent_taxid) %>%
      select(-parent_taxid, -rank, -name) %>%
      left_join(taxid_db, by="taxid")
    reads_rank <- reads_remaining %>% filter(rank == out_rank) %>%
      bind_rows(reads_rank)
    reads_norank <- reads_remaining %>%
      filter(rank != out_rank, !rank %in% high_ranks, !is.na(taxid))
  }
  # Finally, extract and append reads that were excluded during the process
  reads_dropped <- reads %>% filter(!seq_id %in% reads_rank$seq_id)
  reads_out <- reads_rank %>% bind_rows(reads_dropped) %>%
    select(-parent_taxid, -rank, -name) %>%
    left_join(taxid_db, by="taxid")
  return(reads_out)
}
hv_reads_species <- raise_rank(mrg_hv_named, viral_taxa, "species")
hv_reads_genus <- raise_rank(mrg_hv_named, viral_taxa, "genus")
hv_reads_family <- raise_rank(mrg_hv_named, viral_taxa, "family")
Code
threshold_major_family <- 0.06

# Count reads for each human-viral family
hv_family_counts <- hv_reads_family %>% 
  group_by(sample, room, sample_type, na_type, name, taxid) %>%
  count(name = "n_reads_hv") %>%
  group_by(sample, room, sample_type, na_type) %>%
  mutate(p_reads_hv = n_reads_hv/sum(n_reads_hv))

# Identify high-ranking families and group others
hv_family_major_tab <- hv_family_counts %>% group_by(name) %>% 
  filter(p_reads_hv == max(p_reads_hv)) %>% filter(row_number() == 1) %>%
  arrange(desc(p_reads_hv)) %>% filter(p_reads_hv > threshold_major_family)
hv_family_counts_major <- hv_family_counts %>%
  mutate(name_display = ifelse(name %in% hv_family_major_tab$name, name, "Other")) %>%
  group_by(sample, room, sample_type, na_type, name_display) %>%
  summarize(n_reads_hv = sum(n_reads_hv), p_reads_hv = sum(p_reads_hv), 
            .groups="drop") %>%
  mutate(name_display = factor(name_display, 
                               levels = c(hv_family_major_tab$name, "Other")))
hv_family_counts_display <- hv_family_counts_major %>%
  rename(p_reads = p_reads_hv, classification = name_display)

# Plot
g_hv_family <- g_comp_base + 
  geom_col(data=hv_family_counts_display, position = "stack") +
  scale_y_continuous(name="% HV Reads", limits=c(0,1.01), 
                     breaks = seq(0,1,0.2),
                     expand=c(0,0), labels = function(y) y*100) +
  scale_fill_manual(values=palette_viral, name = "Viral family")
g_hv_family

Code
threshold_major_genus <- 0.05

# Count reads for each human-viral genus
hv_genus_counts <- hv_reads_genus %>% filter(na_type == "DNA") %>%
  group_by(sample, room, sample_type, na_type, name, taxid) %>%
  count(name = "n_reads_hv") %>%
  group_by(sample, room, sample_type, na_type) %>%
  mutate(p_reads_hv = n_reads_hv/sum(n_reads_hv))

# Identify high-ranking families and group others
hv_genus_major_tab <- hv_genus_counts %>% group_by(name) %>% 
  filter(p_reads_hv == max(p_reads_hv)) %>% filter(row_number() == 1) %>%
  arrange(desc(p_reads_hv)) %>% filter(p_reads_hv > threshold_major_genus)
hv_genus_counts_major <- hv_genus_counts %>%
  mutate(name_display = ifelse(name %in% hv_genus_major_tab$name, name, "Other")) %>%
  group_by(sample, room, sample_type, na_type, name_display) %>%
  summarize(n_reads_hv = sum(n_reads_hv), p_reads_hv = sum(p_reads_hv), 
            .groups="drop") %>%
  mutate(name_display = factor(name_display, 
                               levels = c(hv_genus_major_tab$name, "Other")))
hv_genus_counts_display <- hv_genus_counts_major %>%
  rename(p_reads = p_reads_hv, classification = name_display)

# Plot
g_hv_genus <- g_comp_base + 
  geom_col(data=hv_genus_counts_display, position = "stack") +
  scale_y_continuous(name="% HV Reads", limits=c(0,1.01), 
                     breaks = seq(0,1,0.2),
                     expand=c(0,0), labels = function(y) y*100) +
  scale_fill_manual(values=palette_viral, name = "Viral genus")
g_hv_genus

Code
threshold_major_species <- 0.1

# Count reads for each human-viral species
hv_species_counts <- hv_reads_species %>% filter(na_type == "DNA") %>%
  group_by(sample, room, sample_type, na_type, name, taxid) %>%
  count(name = "n_reads_hv") %>%
  group_by(sample, room, sample_type, na_type) %>%
  mutate(p_reads_hv = n_reads_hv/sum(n_reads_hv))

# Identify high-ranking families and group others
hv_species_major_tab <- hv_species_counts %>% group_by(name) %>% 
  filter(p_reads_hv == max(p_reads_hv)) %>% filter(row_number() == 1) %>%
  arrange(desc(p_reads_hv)) %>% filter(p_reads_hv > threshold_major_species)
hv_species_counts_major <- hv_species_counts %>%
  mutate(name_display = ifelse(name %in% hv_species_major_tab$name, name, "Other")) %>%
  group_by(sample, room, sample_type, na_type, name_display) %>%
  summarize(n_reads_hv = sum(n_reads_hv), p_reads_hv = sum(p_reads_hv), 
            .groups="drop") %>%
  mutate(name_display = factor(name_display, 
                               levels = c(hv_species_major_tab$name, "Other")))
hv_species_counts_display <- hv_species_counts_major %>%
  rename(p_reads = p_reads_hv, classification = name_display)

# Plot
g_hv_species <- g_comp_base + 
  geom_col(data=hv_species_counts_display, position = "stack") +
  scale_y_continuous(name="% HV Reads", limits=c(0,1.01), 
                     breaks = seq(0,1,0.2),
                     expand=c(0,0), labels = function(y) y*100) +
  scale_fill_manual(values=palette_viral, name = "Viral species") +
  guides(fill=guide_legend(ncol=3))
g_hv_species

Finally, here again are the overall relative abundances of the specific viral genera I picked out manually in my last entry:

Code
# Define reference genera
path_genera_rna <- c("Mamastrovirus", "Enterovirus", "Salivirus", "Kobuvirus", "Norovirus", "Sapovirus", "Rotavirus", "Alphacoronavirus", "Betacoronavirus", "Alphainfluenzavirus", "Betainfluenzavirus", "Lentivirus")
path_genera_dna <- c("Mastadenovirus", "Alphapolyomavirus", "Betapolyomavirus", "Alphapapillomavirus", "Betapapillomavirus", "Gammapapillomavirus", "Orthopoxvirus", "Simplexvirus",
                     "Lymphocryptovirus", "Cytomegalovirus", "Dependoparvovirus")
path_genera <- bind_rows(tibble(name=path_genera_rna, genome_type="RNA genome"),
                         tibble(name=path_genera_dna, genome_type="DNA genome")) %>%
  left_join(viral_taxa, by="name")

# Count in each sample
n_path_genera <- hv_reads_genus %>% 
  group_by(sample, room, na_type, sample_type, name, taxid) %>% 
  count(name="n_reads_viral") %>% 
  inner_join(path_genera, by=c("name", "taxid")) %>%
  left_join(read_counts_raw, by=c("sample", "room", "na_type", "sample_type")) %>%
  mutate(p_reads_viral = n_reads_viral/n_reads_raw)

# Pivot out and back to add zero lines
n_path_genera_out <- n_path_genera %>% ungroup %>% select(sample, name, n_reads_viral) %>%
  pivot_wider(names_from="name", values_from="n_reads_viral", values_fill=0) %>%
  pivot_longer(-sample, names_to="name", values_to="n_reads_viral") %>%
  left_join(read_counts_raw, by="sample") %>%
  left_join(path_genera, by="name") %>%
  mutate(p_reads_viral = n_reads_viral/n_reads_raw)

## Aggregate across dates
n_path_genera_stype <- n_path_genera_out %>% 
  group_by(na_type, sample_type,
           name, taxid, genome_type) %>%
  summarize(n_reads_raw = sum(n_reads_raw),
            n_reads_viral = sum(n_reads_viral), .groups = "drop") %>%
  mutate(sample="All samples", date="All dates",
         p_reads_viral = n_reads_viral/n_reads_raw)

# Plot
g_path_genera <- ggplot(n_path_genera_stype,
                        aes(y=name, x=p_reads_viral, color=na_type)) +
  geom_point() +
  scale_x_log10(name="Relative abundance") +
  scale_color_na() +
  facet_grid(genome_type~sample_type, scales="free_y") +
  theme_base + theme(axis.title.y = element_blank())
g_path_genera
Warning: Transformation introduced infinite values in continuous x-axis

One immediate observation, which I think helps demonstrate the unreliability of the RNA data here, is the almost complete absence of RNA-virus genera of interest. It’s very suspicious for putative HIV reads to be almost as abundant as Enterovirus reads (which, remember, includes rhinovirus), while coronaviruses and influenza are totally absent. The DNA reads seem much more similar to Prussin, though I’m still surprised HCMV and Mastadenovirus are so rare.

Conclusion

This is the second, and by far the smaller, of the air-sampling datasets I’ve analyzed with this pipeline so far. Many of the high-level findings were similar to Prussin, including high relative abundance of human reads, low total viral reads, an absence of enteric viruses, and high abundance of papillomaviruses among human-infecting viruses. The biggest difference in the DNA data was the almost complete lack of cytomegalovirus, compared to its very high abundance in the Prussin data.

Given how small this dataset is (16M reads on a MiSeq, compared to ~450M for Prussin and 1.3B for the upcoming Leung et al. dataset), I don’t think too much weight should be placed on these findings, but given how rare air-MGS datasets are it’s nice to have another one.