Re-analysis of Crits-Christoph et al. (2021) using pipeline v2.4.0

Author

Will Bradshaw

Published

October 17, 2024

It’s been quite a while since I last made a public notebook post, as I’ve been focusing on pipeline development and analysis of our increasing stream of private datasets. Recently, pipeline development has accelerated with the addition of Harmon Bhasin to the NAO computational team. Harmon and I recently finished a tranche of issue fixes and new features, and I’m writing this notebook as a way to both demo those features and test them before merging the new pipeline version to the main branch.

My initial analysis of the Crits-Christoph dataset was quite some time ago now (1,2,3), back when I was still figuring out the basic functionality of the pipeline. We’ve come quite a long way since then, so this is a good opportunity to apply a more streamlined and well-tested analysis process to a very early dataset.

As a reminder, the Crits-Christoph data comprises 18 samples, all of which were collected as 24-hour composite samples of raw sewage collected from wastewater interceptor lines in four different locations in the SF Bay Area (Berkeley, Oakland, San Francisco, and Marin). Several different extraction protocols were used; for simplicity, I won’t be including this as a variable in this analysis, since the number of samples is too small to draw strong conclusions about inter-protocol differences.

Seven of the samples (the “unenriched” fraction) underwent bacterial ribodepletion followed by untargeted metagenomic RNA-sequencing. The other 11 samples (the “enriched” fraction) underwent panel enrichment for respiratory viruses using the Illumina Respiratory Virus Oligo Panel, and no ribodepletion. In total, the unenriched samples comprise roughly 298M read pairs (mean 42.5M per sample) while the enriched samples comprise roughly 9M (mean 0.8M per sample).

1 Cost profiling

As for the previous datasets I’ve run through this version of the pipeline, I generated data on computing resources consumed using nextflow log and AWS Cost Explorer.

Code
# Set input dir
data_dir <- "../data/2024-10-17_crits-christoph-2-4-0"
log_dir <- file.path(data_dir, "logging")

# Profiling parameters
tab_log <- tibble(
  dataset = "Crits-Christoph et al. 2021",
  read_pairs = 306407694,
  n_samples = 18
)

# Import trace
trace_path <- file.path(log_dir, "trace-2024-10-17_16-27-24.txt")
trace <- lapply(trace_path, read_tsv, show_col_types = FALSE) %>% bind_rows %>%
  mutate(realtime_s = parse_durations(realtime),
         cpu_hours = realtime_s * cpus / 3600,
         workflow = process %>% str_split(":") %>% sapply(dplyr::first),
         job = process %>% str_split(":") %>% sapply(dplyr::last),
         subworkflow = process %>% middle(":"),
         dataset = "Crits-Christoph et al. 2021"
         )
Code
# Parse trace data
trace_cpuh_summary <- trace %>% group_by(dataset, job) %>%
  summarize(cpu_hours = sum(cpu_hours), .groups = "drop_last") %>%
  mutate(p_cpu_hours = cpu_hours/sum(cpu_hours))
cpuh_total <- trace_cpuh_summary %>% summarize(cpu_hours = sum(cpu_hours))

#==============================================================================
# Aggregate data for plotting
#==============================================================================

# Group processes across subworkflows
jobs_major <- trace_cpuh_summary %>% group_by(job) %>%
  filter(p_cpu_hours == max(p_cpu_hours)) %>%
  arrange(desc(p_cpu_hours)) %>% ungroup %>%
  head(8) %>% pull(job)
trace_cpuh_display <- trace_cpuh_summary %>%
  mutate(minor = !(job %in% jobs_major),
         job_display = ifelse(minor, "Other", job)) %>%
  group_by(dataset, job_display, minor) %>%
  summarize(cpu_hours = sum(cpu_hours), .groups = "drop") %>%
  arrange(minor, desc(cpu_hours)) %>%
  mutate(job_display = fct_inorder(job_display))

# Break down log files into subworkflows, segregating BBDUK from the rest of the HV workflow
cat_levels <- c("Preprocessing", "Taxonomic profiling",
                "HV identification (BBDuk)", 
                "HV identification (other)", "Other")
trace_cpuh_categories <- trace %>%
  mutate(category = case_when(
    grepl("^RUN:RAW", process) ~ "Preprocessing",
    grepl("^RUN:CLEAN", process) ~ "Preprocessing",
    grepl("^RUN:PROFILE", process) ~ "Taxonomic profiling",
    grepl("^RUN:HV:BBDUK", process) ~ "HV identification (BBDuk)",
    grepl("^RUN:HV", process) ~ "HV identification (other)",
    TRUE ~ "Other"
  )) %>%
  group_by(dataset, category) %>% summarize(cpu_hours = sum(cpu_hours), .groups = "drop") %>%
  mutate(p_cpu_hours = cpu_hours / sum(cpu_hours),
         category = factor(category, levels = cat_levels))

In total, processing the Crits-Christoph data consumed roughly 99 CPU-hours. While I don’t have AWS costs for this run, at our usual cost per CPU-hour, I estimate this cost roughly $5-8.

Here’s the breakdown of CPU-hours by process and subworkflow:

Code
# Group by run group rather than dataset
trace_cpuh_display_agg <- trace_cpuh_display %>% 
  group_by(job_display, minor) %>% 
  summarize(cpu_hours = sum(cpu_hours), .groups = "drop")

# Plot
g_log_pie <- ggplot(trace_cpuh_display_agg, 
                    aes(x=1, y=cpu_hours, fill=job_display)) +
  geom_col(width=1, position="fill") +
  coord_polar("y", start = 0) +
  scale_fill_brewer(palette="Set1", name="Process") +
  theme_void()
g_log_pie
Figure 1.1: Pie charts of CPU-hour composition of v2.3 pipeline across processes.
Code
# Group by run group rather than dataset
trace_cpuh_categories_agg <- trace_cpuh_categories %>% 
  group_by(category) %>% 
  summarize(cpu_hours = sum(cpu_hours), .groups = "drop")

# Plot
g_cat_pie <- ggplot(trace_cpuh_categories_agg,
                    aes(x=1, y=cpu_hours, fill=category)) +
  geom_col(width=1, position="fill") +
  coord_polar("y", start = 0) +
  scale_fill_brewer(palette="Set3", name="Process category") +
  theme_void()
g_cat_pie
Figure 1.2: Pie charts of CPU-hour composition of v2.3 pipeline across process categories

2 Raw data & preprocessing

2.1 Read counts

Code
# Define directory structure
output_dirs <- data_dir
input_dirs <- file.path(output_dirs, "input")
results_dirs <- file.path(output_dirs, "results")

# Get metadata paths
metadata_paths <- file.path(output_dirs, "metadata.csv")

# Import metadata
metadata_raw <- lapply(metadata_paths, read_csv, show_col_types = FALSE) %>%
  bind_rows

# Process metadata
metadata <- metadata_raw %>%
  rename(sample_alias = sample, sample=library, date=collection_date) %>%
  mutate(enrichment = str_to_title(enrichment)) %>%
  arrange(date) %>%
  mutate(date = fct_inorder(as.character(date)))

# Define grouping columns
grouping_cols <- c("date", "location", "enrichment")
grouping_cols_agg <- c("enrichment")
Code
# Data input paths
qc_dirs <- file.path(results_dirs, "qc")
hv_dirs <- file.path(results_dirs, "hv")
basic_stats_paths <- file.path(qc_dirs, "qc_basic_stats.tsv.gz")
adapter_stats_paths <- file.path(qc_dirs, "qc_adapter_stats.tsv.gz")
quality_base_stats_paths <- file.path(qc_dirs, "qc_quality_base_stats.tsv.gz")
quality_seq_stats_paths <- file.path(qc_dirs, "qc_quality_sequence_stats.tsv.gz")

# Import QC data
stages <- c("raw_concat", "cleaned")
basic_stats <- lapply(basic_stats_paths, read_tsv, show_col_types = FALSE) %>%
  bind_rows %>%
  inner_join(metadata, by="sample") %>% arrange(sample) %>%
  mutate(stage = factor(stage, levels = stages),
         sample = fct_inorder(sample))
adapter_stats <- lapply(adapter_stats_paths, read_tsv, show_col_types = FALSE) %>%
  bind_rows %>%
  inner_join(metadata, by="sample") %>% arrange(sample) %>%
  mutate(stage = factor(stage, levels = stages),
         read_pair = fct_inorder(as.character(read_pair)))
quality_base_stats <- lapply(quality_base_stats_paths, read_tsv, show_col_types = FALSE) %>%
  bind_rows %>%
  inner_join(metadata, by="sample") %>% arrange(sample) %>%
  mutate(stage = factor(stage, levels = stages),
         read_pair = fct_inorder(as.character(read_pair)))
quality_seq_stats <- lapply(quality_seq_stats_paths, read_tsv, show_col_types = FALSE) %>%
  bind_rows %>%
  inner_join(metadata, by="sample") %>% arrange(sample) %>%
  mutate(stage = factor(stage, levels = stages),
         read_pair = fct_inorder(as.character(read_pair)))

basic_stats_raw <- basic_stats %>% filter(stage == "raw_concat")
Code
get_read_counts <- function(tab) {
  summarize(tab, rmin = min(n_read_pairs), rmax=max(n_read_pairs),
            rmean=mean(n_read_pairs), 
            rtot = sum(n_read_pairs),
            btot = sum(n_bases_approx),
            .groups = "drop")
}

read_counts_total <- basic_stats_raw %>% ungroup %>% get_read_counts
read_counts_grp <- basic_stats_raw %>% group_by(across(all_of(grouping_cols))) %>%
  get_read_counts
read_counts_agg <- basic_stats_raw %>% 
  group_by(across(all_of(grouping_cols_agg))) %>%
  get_read_counts

The 18 libraries in the Crits-Christoph dataset comprised roughly 306M read pairs, corresponding to roughly 46 gigabases of sequence data. The great majority of this data (~97%) was generated from unenriched libraries.

Code
# Prepare data
basic_stats_raw_metrics <- basic_stats_raw %>%
  mutate(n_read_pairs = replace_na(n_read_pairs, 0),
         n_bases_approx = replace_na(n_bases_approx, 0)) %>%
  select(sample, all_of(grouping_cols),
         `# Read pairs` = n_read_pairs,
         `Total base pairs\n(approx)` = n_bases_approx) %>%
  pivot_longer(-(sample:last(grouping_cols)), names_to = "metric", values_to = "value") %>%
  mutate(metric = fct_inorder(metric))

# Set up plot templates
g_basic <- ggplot(basic_stats_raw_metrics, 
                  aes(x=enrichment, y=value, fill=location, group=sample)) +
  geom_col(position = position_dodge2(preserve = "single")) +
  scale_x_discrete() +
  scale_y_continuous(expand=c(0,0)) +
  scale_fill_loc() +
  expand_limits(y=c(0,100)) +
  facet_grid(metric~., scales = "free", space="free_x", switch="y") +
  theme_tilt + theme(
    axis.title.y = element_blank(),
    strip.text.y = element_text(face="plain")
  )
g_basic
Figure 2.1: Read and base counts for different samples.

On average, read preprocessing with FASTP resulted in the loss of roughly 2.1% of reads in unenriched libraries and 6.8% in enriched libraries. A minority of enriched libraries showed even higher losses:

Code
# Count read losses
n_reads_rel <- basic_stats %>% 
  select(sample, all_of(grouping_cols), 
         stage, percent_duplicates, n_read_pairs) %>%
  group_by(sample, across(all_of(grouping_cols_agg))) %>% 
  arrange(sample, stage) %>%
  mutate(p_reads_retained = n_read_pairs / lag(n_read_pairs),
         p_reads_lost = 1 - p_reads_retained)
Code
# Plot relative read losses during preprocessing
g_reads_rel <- ggplot(n_reads_rel %>% filter(stage == "cleaned"), 
                      aes(x=enrichment, y=p_reads_lost, fill=location, group=sample)) +
  geom_col(position = position_dodge2(preserve = "single")) +
  scale_y_continuous("% Total Reads Lost", expand=c(0,0), 
                     labels = function(x) x*100,
                     breaks=seq(0,0.2,0.01)) +
  scale_fill_loc() +
  theme_tilt
g_reads_rel
Figure 2.2: Relative read losses during cleaning with FASTP.

2.2 Adapters

Adapter levels in the raw reads were moderate, and were very effectively removed by preprocessing with FASTP:

Code
g_qual <- ggplot(mapping=aes(color=location, linetype=read_pair, 
                         group=interaction(sample,read_pair))) + 
  scale_color_loc() + scale_linetype_discrete(name = "Read Pair") +
  guides(color=guide_legend(nrow=2,byrow=TRUE),
         linetype = guide_legend(nrow=2,byrow=TRUE)) +
  theme_base

# Visualize adapters
g_adapters <- g_qual + 
  geom_line(aes(x=position, y=pc_adapters), data=adapter_stats) +
  scale_y_continuous(name="% Adapters", limits=c(0,NA),
                     breaks = seq(0,1000,10), expand=c(0,0)) +
  scale_x_continuous(name="Position along read", limits=c(0,NA),
                     breaks=seq(0,140,20), expand=c(0,0)) +
  facet_grid(stage~adapter)
g_adapters
Figure 2.3: FASTQC-measured adapter contamination in raw and FASTP-cleaned reads.

2.3 Quality

Qualities were high across deliveries, both at the per-base and per-read level:

Code
g_quality_base <- g_qual +
  geom_hline(yintercept=25, linetype="dashed", color="red") +
  geom_hline(yintercept=30, linetype="dashed", color="red") +
  geom_line(aes(x=position, y=mean_phred_score), data=quality_base_stats) +
  scale_y_continuous(name="Mean Phred score", expand=c(0,0), limits=c(10,45)) +
  scale_x_continuous(name="Position", limits=c(0,NA),
                     breaks=seq(0,140,20), expand=c(0,0)) +
  facet_grid(stage~.)
g_quality_base
Figure 2.4: FASTQC-measured per-base quality scores.
Code
g_quality_seq <- g_qual +
  geom_vline(xintercept=25, linetype="dashed", color="red") +
  geom_vline(xintercept=30, linetype="dashed", color="red") +
  geom_line(aes(x=mean_phred_score, y=n_sequences), data=quality_seq_stats) +
  scale_x_continuous(name="Mean Phred score", expand=c(0,0)) +
  scale_y_continuous(name="# Sequences", expand=c(0,0)) +
  facet_grid(stage~., scales = "free_y")
g_quality_seq
Figure 2.5: FASTQC-measured per-read quality scores.

2.4 GC content

Unenriched samples show somewhat lower average GC content than enriched samples, with the former showing a mean GC content of 43.4% and the latter 47.9%:

Code
# Get raw GC content
gc_content <- basic_stats_raw %>%
  group_by(across(all_of(grouping_cols))) %>%
  summarize(mean_percent_gc = mean(percent_gc),
            .groups = "drop")

# Plot
g_gc_st <- ggplot(gc_content, aes(x=enrichment, y=mean_percent_gc)) +
  geom_boxplot() +
  geom_quasirandom(alpha=0.7, shape=16) +
  scale_y_continuous(name="Mean GC content (% bases)") +
  coord_flip() +
  guides(color=guide_legend(nrow=2)) +
  theme_base + theme(axis.title.y = element_blank())
g_gc_st
Figure 2.6: Distribution of GC content across samples.

3 Taxonomic profiling

3.1 Domain-level classification

As usual for wastewater, samples across all deliveries were dominated by bacterial and unassigned sequences:

Code
# Import Bracken data
bracken_paths <- file.path(results_dirs, "taxonomy/bracken_reports_merged.tsv.gz")
bracken_tab <- lapply(bracken_paths, read_tsv, show_col_types = FALSE) %>% bind_rows %>%
  inner_join(metadata, by="sample") %>% 
  mutate(ribosomal_label = ifelse(ribosomal, "Ribosomal", "Non-ribosomal"))

# Import Kraken data
kraken_paths <- file.path(results_dirs, "taxonomy/kraken_reports_merged.tsv.gz")
kraken_tab <- lapply(kraken_paths, read_tsv, show_col_types = FALSE) %>% bind_rows %>%
  inner_join(metadata, by="sample") %>% 
  mutate(ribosomal_label = ifelse(ribosomal, "Ribosomal", "Non-ribosomal"))

# Extract taxon reads from Bracken and unassigned from Kraken
class_tab <- bracken_tab %>%
  select(all_of(grouping_cols),
         n_reads=new_est_reads, name, ribosomal_label)
unclass_tab <- kraken_tab %>% filter(taxid == 0) %>%
  select(all_of(grouping_cols),
         n_reads=n_reads_clade, name, ribosomal_label) %>%
  mutate(name=str_to_title(name))
taxa_tab_raw <- bind_rows(class_tab, unclass_tab)
taxon_levels <- expand_grid(
  taxon = c("Unclassified", "Bacteria", "Archaea", "Eukaryota", "Viruses"),
  ribo  = c("Ribosomal", "Non-ribosomal") 
) %>% mutate(label = paste0(taxon, " (", ribo, ")")) %>% pull(label)
taxa_tab <- mutate(taxa_tab_raw, 
                   label = paste0(name, " (", ribosomal_label, ")"),
                   label = factor(label, levels = taxon_levels))
taxa_tab_display <- taxa_tab
Code
# Plot
g_bracken <- ggplot(mapping = aes(x=location, y=n_reads, fill=label)) +
  scale_y_continuous(name="% Reads", label = function(y) y*100) +
  facet_grid(.~enrichment, scales = "free_x", space = "free_x",
             labeller = label_wrap_gen(width = 10)) +
  guides(fill = guide_legend(ncol=3)) +
  theme_tilt
g_bracken_1 <- g_bracken +
    geom_col(data = taxa_tab_display, position = "fill", width=1) +
    scale_fill_brewer(palette="Set3", name="Taxon") +
    ggtitle("Taxonomic composition (all reads)")
g_bracken_1
Figure 3.1: Domain-level composition of samples, as measured with Kraken2 on 1M reads per sample.

3.2 Ribosomal content

The above plot makes it clear that, consistent with previous analysis of this dataset, the unenriched samples (which underwent ribodepletion) show much lower ribosomal content than the enriched samples (which did not), despite the latter having undergone viral panel enrichment. Quantitatively, enriched samples contained 63.4% ribosomal sequences, while unenriched samples contained only 5.5%.

Code
p_ribo_sample <- taxa_tab %>%
  group_by(across(all_of(grouping_cols)), ribosomal_label) %>%
  summarize(n_reads = sum(n_reads), .groups = "drop_last") %>%
  mutate(p_reads = n_reads/sum(n_reads)) %>%
  filter(ribosomal_label == "Ribosomal")
p_ribo_agg <- taxa_tab %>%
  group_by(across(all_of(grouping_cols_agg)), ribosomal_label) %>%
  summarize(n_reads = sum(n_reads), .groups = "drop_last") %>%
  mutate(p_reads = n_reads/sum(n_reads)) %>%
  filter(ribosomal_label == "Ribosomal")
Code
g_ribo_box <- ggplot(p_ribo_sample, aes(x=enrichment, y=p_reads)) +
  geom_boxplot() +
  geom_quasirandom(alpha=0.7, shape=16, mapping=aes(color=location)) +
  coord_flip() +
  scale_color_loc() +
  scale_y_continuous(name="Ribosomal content (% reads)",
                     breaks=seq(0,1,0.2),
                     labels=function(y)y*100) +
  guides(color=guide_legend(nrow=2)) +
  theme_base + theme(axis.title.y = element_blank())
g_ribo_box
Figure 3.2: Boxplots of ribosomal content by enrichment status.

3.3 Total viral content

Code
# Viral content across all reads
p_reads_viral_all <- taxa_tab %>% mutate(viral = name == "Viruses") %>%
  group_by(across(all_of(grouping_cols)), viral) %>%
  summarize(n_reads = sum(n_reads), .groups = "drop_last") %>% 
  mutate(p_reads = n_reads/sum(n_reads)) %>%
  filter(viral)

# Aggregate
p_reads_viral_all_agg <- taxa_tab %>% mutate(viral = name == "Viruses") %>%
  group_by(across(all_of(grouping_cols_agg)), viral) %>%
  summarize(n_reads = sum(n_reads), .groups = "drop_last") %>% 
  mutate(p_reads = n_reads/sum(n_reads)) %>%
  filter(viral)

Total viral content in the Kraken data was 4.0% in the enriched samples and 0.77% in the unenriched samples, demonstrating the efficacy of panel enrichment:

Code
g_viral_box <- ggplot(p_reads_viral_all, aes(x=enrichment, y=p_reads)) +
  geom_boxplot() +
  geom_quasirandom(alpha=0.7, shape=16, mapping=aes(color=location)) +
  coord_flip() +
  scale_color_loc() +
  scale_y_log10(name="Viral read fraction") +
  guides(color=guide_legend(nrow=2)) +
  theme_base + theme(axis.title.y = element_blank())
g_viral_box
Figure 3.3: Fraction of viral reads across samples, as measured with Kraken2 on 1M reads per sample.

3.4 Taxonomic composition of viruses

Code
# Set up base plot
g_comp_base <- ggplot(mapping = aes(x=location, y=p_reads, fill=label)) +
  facet_grid(.~enrichment, scales = "free_x", space = "free_x", labeller = label_wrap_gen(width = 10)) +
  guides(fill = guide_legend(ncol=3)) +
  theme_tilt

# Specify palette
palette_viral <- c(brewer.pal(12, "Set3"), brewer.pal(8, "Dark2"))
scale_fill_viral <- purrr::partial(scale_fill_manual, values = palette_viral)

# Set up composition scale
scale_y_composition <- purrr::partial(scale_y_continuous, limits = c(0,1.01),
                                      breaks = seq(0,1,0.2), expand = c(0,0),
                                      labels = function(y) y*100)

# Set up geom
geom_composition <- purrr::partial(geom_col, position = "fill", width = 1)

# Specify major family threshold
major_threshold <- 0.01
Code
# Get viral taxonomy
viral_taxa_path <- file.path(data_dir, "total-virus-db.tsv.gz")
viral_taxa <- read_tsv(viral_taxa_path, show_col_types = FALSE)

# Prepare viral Kraken tab
kraken_tab_viral_raw <- filter(kraken_tab, taxid %in% viral_taxa$taxid)
kraken_tab_viral_sum_noribo <- kraken_tab_viral_raw %>% filter(!ribosomal) %>%
  group_by(taxid, name, rank, across(all_of(grouping_cols))) %>%
  summarize(n_reads_clade = sum(n_reads_clade),
            n_reads_direct = sum(n_reads_direct),
            n_minimizers_total = sum(n_minimizers_total),
            n_minimizers_distinct = sum(n_minimizers_distinct),
            n_reads_clade_ribosomal = sum(n_reads_clade[ribosomal]),
            .groups = "drop") %>%
  mutate(p_reads_clade_ribosomal = n_reads_clade_ribosomal/n_reads_clade)
kraken_tab_viral_total_noribo <- kraken_tab_viral_sum_noribo %>%
  filter(taxid == 10239) %>%
  select(all_of(grouping_cols), n_reads_viral = n_reads_clade)
kraken_tab_viral_noribo <- kraken_tab_viral_sum_noribo %>%
  inner_join(kraken_tab_viral_total_noribo, by = grouping_cols) %>%
  mutate(p_reads_viral = n_reads_clade/n_reads_viral)
kraken_tab_viral_cleaned_noribo <- kraken_tab_viral_noribo %>%
  select(name, taxid, rank, all_of(grouping_cols), 
         n_reads_clade, n_reads_viral, p_reads_viral, p_reads_clade_ribosomal)

# Subset to specific taxonomic ranks
viral_classes_noribo <- kraken_tab_viral_cleaned_noribo %>% filter(rank == "C")
viral_families_noribo <- kraken_tab_viral_cleaned_noribo %>% filter(rank == "F")

As is normal for wastewater samples, the unenriched samples are dominated by Alsuviricetes (a class of eukaryotic viruses which includes Virgaviridae), Leviviricetes (a class of RNA phages), and Caudoviricetes (tailed DNA phages). Alsuviricetes and Caudoviricetes were also prominent in enriched samples; however, two other virus classes, Pisoniviricetes (which includes Picornaviridae and Coronaviridae, among other animal viruses) and Quintoviricetes (parvoviruses) are also prominent. Both of these additional classes include prominent human pathogens that are included in the Illumina Respiratory Virus Panel.

Code
# Identify major viral classes and collapse others
viral_classes_major_noribo <- viral_classes_noribo %>%
  group_by(name, taxid) %>%
  filter(max(p_reads_viral) >= major_threshold) %>% ungroup
viral_classes_minor_noribo <- viral_classes_major_noribo %>%
  group_by(across(all_of(grouping_cols)), n_reads_viral) %>%
  summarize(n_reads_clade = n_reads_viral[1] - sum(n_reads_clade),
            p_reads_viral = 1 - sum(p_reads_viral), .groups = "drop") %>%
  mutate(name = "Other", taxid = NA, rank = "F")
viral_classes_levels_noribo <- viral_classes_major_noribo %>% pull(name) %>% sort %>% 
  unique %>% append("Other")
viral_classes_out_noribo <- bind_rows(viral_classes_major_noribo, viral_classes_minor_noribo) %>%
  mutate(name = factor(name, levels = viral_classes_levels_noribo))

# Prepare data for plotting
viral_classes_display_noribo <- viral_classes_out_noribo %>%
  rename(p_reads = p_reads_viral, label = name)

# Plot
g_classes_noribo <- g_comp_base + 
  geom_composition(data=viral_classes_display_noribo) +
  ggtitle("Viral class composition (non-ribosomal reads only)") +
  scale_y_composition(name="% Viral Reads") +
  scale_fill_viral(name="Viral Class")
g_classes_noribo
Figure 3.4: Class composition of Kraken2-identified viral reads across samples, excluding putative ribosomal reads.

As usual, Alsuviricetes is overwhelmingly dominated by Virgaviridae, with a small minority of other plant viruses:

Code
# Get reads for major eukaryotic viral classes
euk_classes <- c("Alsuviricetes", "Pisoniviricetes", "Quintoviricetes")
euk_taxids <- sapply(euk_classes, function(x) viral_taxa %>% filter(name == x) %>% pull(taxid))
euk_taxid_desc <- sapply(euk_taxids, add_descendent_taxids, reference=viral_taxa)
euk_families <- lapply(euk_classes, function(x) viral_families_noribo %>% 
                         filter(taxid %in% euk_taxid_desc[[x]]) %>% mutate(class=x)) %>%
  bind_rows %>% group_by(across(all_of(grouping_cols)), class) %>%
  mutate(n_reads_class = sum(n_reads_clade),
         p_reads_class = n_reads_clade/n_reads_class)
euk_families_display <- lapply(euk_classes, function(x)
  euk_families %>% filter(class==x) %>% rename(p_reads = p_reads_class, label = name)) %>%
  setNames(euk_classes)

# Plot
g_euk_families <- lapply(euk_classes, function(x)
  g_comp_base + geom_composition(data=euk_families_display[[x]]) +
    scale_y_composition(name=paste0("% ", x, " Reads")) +
    scale_fill_viral(name="Viral Family")) %>% setNames(euk_classes)
g_euk_families[["Alsuviricetes"]]
Figure 3.5: Family composition of Kraken2-identified Alsuviricetes reads, excluding putative ribosomal reads.

Pisoniviricetes reads in enriched samples, meanwhile, are overwhelmingly made up of coronaviruses:

Code
g_euk_families[["Pisoniviricetes"]]
Figure 3.6: Family composition of Kraken2-identified Pisoniviricetes reads across samples, excluding putative ribosomal reads.

4 Human-infecting virus reads

Code
# Import and format reads
hv_reads_paths <- file.path(hv_dirs, "hv_hits_putative_collapsed.tsv.gz")
mrg_hv <- lapply(hv_reads_paths, read_tsv, show_col_types = FALSE,
                   col_types = cols(taxid_all = "c", assigned_taxid_all = "c")) %>%
  bind_rows %>%
  inner_join(metadata, by="sample") %>%
  mutate(kraken_label = ifelse(assigned_hv, "Kraken2 HV assignment",
                               "No Kraken2 assignment"),
         adj_score_max = pmax(adj_score_fwd, adj_score_rev),
         highscore = adj_score_max >= 20)

# Raise to the class level and exclude phages
phage_taxids <- c("2731619", "2732413", "2732411") # Caudoviricetes, Malgrantaviricetes, Faserviricetes
mrg_hv_named <- mrg_hv %>% left_join(viral_taxa, by="taxid") 
mrg_hv_class <- raise_rank(mrg_hv_named, viral_taxa, "class")
mrg_hv_class_filtered <- filter(mrg_hv_class, !taxid %in% phage_taxids)
mrg_hv_filtered <- mrg_hv %>% filter(seq_id %in% mrg_hv_class_filtered$seq_id)

# Get read counts and fractions
read_counts_raw <- select(basic_stats_raw, sample, n_reads_raw = n_read_pairs)
read_counts_hv <- count(mrg_hv_filtered, sample, name="n_reads_hv")
read_counts <- left_join(read_counts_raw, read_counts_hv, by="sample") %>%
  mutate(n_reads_hv = replace_na(n_reads_hv, 0)) %>%
  inner_join(metadata, by="sample") %>%
  select(sample, all_of(grouping_cols), n_reads_raw, n_reads_hv) %>%
  mutate(n_samples = 1, p_reads_total = n_reads_hv/n_reads_raw)

# Aggregate read counts
read_counts_agg <- read_counts %>%
  group_by(across(all_of(grouping_cols_agg))) %>%
  summarize(n_reads_raw = sum(n_reads_raw),
            n_reads_hv = sum(n_reads_hv), 
            n_samples = sum(n_samples), .groups="drop") %>%
  mutate(p_reads_total = n_reads_hv/n_reads_raw)

4.1 Overall relative abundance

Next, I calculated the number of human-infecting virus (HV) reads as a fraction of total raw reads:

Code
# Get read counts and fractions
read_counts_raw <- select(basic_stats_raw, sample, n_reads_raw = n_read_pairs)
read_counts_hv <- count(mrg_hv_filtered, sample, name="n_reads_hv")
read_counts <- left_join(read_counts_raw, read_counts_hv, by="sample") %>%
  mutate(n_reads_hv = replace_na(n_reads_hv, 0)) %>%
  inner_join(metadata, by="sample") %>%
  select(all_of(grouping_cols), n_reads_raw, n_reads_hv) %>%
  group_by(across(all_of(grouping_cols))) %>%
  summarize(n_reads_raw = sum(n_reads_raw),
            n_reads_hv = sum(n_reads_hv), .groups = "drop") %>%
  mutate(n_samples = 1, p_reads_total = n_reads_hv/n_reads_raw)

# Aggregate read counts
read_counts_agg <- read_counts %>%
  group_by(across(all_of(grouping_cols_agg))) %>%
  summarize(n_reads_raw = sum(n_reads_raw),
            n_reads_hv = sum(n_reads_hv), 
            n_samples = sum(n_samples), .groups="drop") %>%
  mutate(p_reads_total = n_reads_hv/n_reads_raw)
Code
g_hv_box <- ggplot(read_counts, aes(x=enrichment, y=p_reads_total)) +
  geom_boxplot() +
  geom_quasirandom(alpha=0.7, shape=16, mapping=aes(color=location)) +
  coord_flip() +
  scale_color_loc() +
  scale_y_log10(name="Unique human-viral read fraction") +
  guides(color=guide_legend(nrow=2)) +
  theme_base + theme(axis.title.y = element_blank())
g_hv_box
Figure 4.1: Unique human-infecting virus reads as a fraction of all raw input reads, for enriched and unenriched samples.

The effect of panel enrichment on HV read fraction was much stronger than on total viral read fraction, with overall relative abundance increasing from 1.23e-05 to 1.14e-02, or almost 1000-fold.

4.2 Overall taxonomy and composition

HV reads in enriched samples are dominated by coronaviruses and parvoviruses, which makes sense given the composition of the panel and what we saw for total-virus composition. Unenriched samples’ HV composition, meanwhile, was quite unusual for wastewater: while some standard enteric virus groups (e.g. Sedoreoviridae, Picornaviridae) were still abundant here, a large fraction of HV reads were mapped to Polyomaviridae and Papillomaviridae, two groups that seldom show a large presence in wastewater data.

Code
# Filter samples and add viral taxa information
mrg_hv_filtered_named <- mrg_hv_filtered %>% left_join(viral_taxa, by="taxid") 

# Discover viral species & genera for HV reads
hv_reads_genus <- raise_rank(mrg_hv_filtered_named, viral_taxa, "genus")
hv_reads_family <- raise_rank(mrg_hv_filtered_named, viral_taxa, "family")
Code
threshold_major_family <- 0.01

# Count reads for each human-viral family
hv_family_counts <- hv_reads_family %>% 
  group_by(name, taxid, across(all_of(grouping_cols))) %>%
  count(name = "n_reads_hv") %>%
  group_by(across(all_of(grouping_cols))) %>%
  mutate(p_reads_hv = n_reads_hv/sum(n_reads_hv))
hv_family_counts_collapsed <- hv_family_counts %>% 
  mutate(minor = p_reads_hv < threshold_major_family,
         name_display = ifelse(minor, "Other", name),
         taxid_display = ifelse(minor, NA, taxid)) %>%
  group_by(across(all_of(grouping_cols)), name_display, taxid_display) %>%
  summarize(n_reads_hv = sum(n_reads_hv), p_reads_hv = sum(p_reads_hv), .groups = "drop")
hv_family_levels <- hv_family_counts_collapsed %>% 
  group_by(is.na(taxid_display), name_display) %>% summarize(.groups = "drop") %>%
  pull(name_display)
hv_family_counts_display <- hv_family_counts_collapsed %>%
  rename(p_reads = p_reads_hv, label = name_display) %>%
  mutate(label = factor(label, levels = hv_family_levels))

# Get most prominent families for text
hv_family_counts_collate <- hv_family_counts %>%
  group_by(name, taxid, across(all_of(grouping_cols_agg))) %>%
  summarize(n_reads_tot = sum(n_reads_hv),
            p_reads_max = max(p_reads_hv), .groups = "drop") %>%
  arrange(desc(n_reads_tot))

# Plot
g_hv_family <- g_comp_base +
  geom_composition(data=hv_family_counts_display) +
  ggtitle("Family composition of human-viral reads") +
  scale_y_composition(name="% HV Reads") +
  scale_fill_viral(name="Viral family")
g_hv_family
Figure 4.2: Family composition of putative human-viral reads across samples.

Given my other research goals for this post, I’m not going to dig into this in detail here. But it might be worth a more detailed look in the future.

4.3 Length and complexity metrics

Some of the most important changes in this version of the pipeline were additional measurements of fragment length and duplication levels in the HV read subset. In this section, I’ll investigate these new metrics. This analysis largely follows Harmon’s investigation of the same data here.

4.3.1 Fragment lengths

We measured fragment length in two ways: firstly, as the alignment length to the HV reference calculated by Bowtie2, and secondly as the length of the FASTA sequence following read pair merging with BBMerge. The latter method is only applicable to read pairs with enough overlap to be successfully merged with BBMerge; fragments too long for this to occur cannot be accurately measured this way. For the Crits-Christoph data, about 14% of HV sequences were successfully merged by BBMerge and thus have both length metrics available.

Code
# Prepare subdatasets
frag_bb <- mrg_hv_filtered %>%
  filter(bbmerge_merge_status == 1) %>%
  transmute(length = bbmerge_frag_length, group="BBMerge")
frag_bt_merged <- mrg_hv_filtered %>%
  filter(bbmerge_merge_status == 1) %>%
  transmute(length = bowtie2_frag_length, group="Bowtie2 (merged only)")
frag_bt_unmerged <- mrg_hv_filtered %>%
  filter(bbmerge_merge_status == 0) %>%
  transmute(length = bowtie2_frag_length, group="Bowtie2 (unmerged only)")
frag_bt_all <- mrg_hv_filtered %>%
  transmute(length = bowtie2_frag_length, group="Bowtie2 (all)")
tab_frag <- bind_rows(frag_bb, frag_bt_merged, frag_bt_unmerged, frag_bt_all)

# Plot
g_frag_density <- ggplot(tab_frag, aes(x=length, color=group)) +
  geom_density() +
  scale_color_brewer(palette="Set1", name="Method") +
  labs(y="Density", x="Fragment length") +
  theme_base
g_frag_density
Figure 4.3: Kernel density plot of fragment lengths in Crits-Christoph data, measured using different methods.
Code
g_frag_cor <- mrg_hv_filtered %>%
  filter(bbmerge_merge_status == 1) %>%
  ggplot(aes(x=bowtie2_frag_length, y=bbmerge_frag_length)) +
  geom_abline(color="red") +
  geom_point(shape=16, alpha=0.3) +
  stat_cor(method = "spearman", label.x.npc = 0.8) +
  labs(x="Fragment length (Bowtie2)", y="Fragment length (BBMerge)") +
  theme_base
g_frag_cor
Figure 4.4: Scatterplot showing Bowtie2- and BBMerge-measured fragment lengths for the subset of reads that merged successfully with BBMerge. Red line indicates x=y; R gives the Spearman’s rank correlation between the two variables.
  • For sequences that merged successfully, the Bowtie2 and BBMerge length metrics are very similar (Spearman’s rho > 0.8), indicating a high level of agreement between the two approaches.

  • Unsurprisingly, sequences that did not successfully merge with BBMerge show much longer Bowtie2-measured fragment lengths than those that did, consistent with the former arising from fragments that were too long to achieve sufficient overlap between the forward and reverse reads.

  • Across all HV reads, Bowtie2-measured fragment length (post-preprocessing) ranged from 36 to 549nt, with a mean of 214.5nt.

4.3.2 Duplication

As with fragment length, we identified duplicates in the data via two different methods:

  1. Duplicates were identified using Bowtie2 by looking for read pairs that aligned to the same reference genome with the same start and end coordinates (+/- 1 nucleotide position).
  2. Duplicates were identified using Clumpify (from the BBTools suite) by looking for read pairs with near-identical sequences, with a small amount of tolerance for sequencing errors.

Importantly, the Bowtie2 duplicate analysis identified duplicate groups but did not discard identified duplicates, while the Clumpify analysis both identified duplicate groups and performed deduplication. The latter deduplication also discarded some information about the output of the former method, which somewhat limits our ability to analyze and compare duplication patterns across the two methods. This situation isn’t ideal, and is something we want to address in future pipeline versions.

Proceeding with the analysis given these limitations, we find that, given the information available, the two methods produce very similar distributions of duplication behavior:

Code
# Calculate dupcount distribution
tab_dupcount_clumpify <- mrg_hv_filtered %>% count(clumpify_dupcount) %>%
  rename(dupcount = clumpify_dupcount) %>% mutate(method="Clumpify")
tab_dupcount_bowtie2  <- mrg_hv_filtered %>% group_by(bowtie2_exemplar) %>%
  slice_head(n=1) %>% ungroup %>% count(bowtie2_dupcount) %>%
  rename(dupcount = bowtie2_dupcount) %>% mutate(method="Bowtie2")
tab_dupcount <- bind_rows(tab_dupcount_clumpify, tab_dupcount_bowtie2) %>%
  group_by(method) %>% mutate(p=n/sum(n)) %>%
  mutate(reads_total = n * dupcount)

# Plot
g_dupcount_density <- ggplot(tab_dupcount, 
                             aes(x=dupcount, y=n, color=method)) +
  geom_point(shape=16, alpha=0.7) +
  scale_color_brewer(palette="Dark2", name="Method") +
  scale_y_log10() +
  labs(y="# Duplicate groups", x="Reads in duplicate group") +
  theme_base
g_dupcount_density
Figure 4.5: Distribution of duplicate group sizes in the HV data, as measured by both Clumpify and Bowtie2.

In both cases, the great majority of sequences (>92%) are only represented by a single read, and the great majority of the remainder (>80%) are represented by two or three reads. This broad pattern continues into progressively higher duplicate counts. The overall fraction of duplicate reads (i.e. reads that aren’t the first in their duplicate group) was around 9% under both methods.

There is a moderate correlation between the two methods (r = 0.47) in terms of the size of each duplicate group:

Code
tab_dupcount_cor <- mrg_hv_filtered %>% group_by(bowtie2_exemplar) %>%
  slice_head(n=1)

g_dupcount_cor <- ggplot(tab_dupcount_cor,
                         aes(x=clumpify_dupcount, y=bowtie2_dupcount)) +
  geom_abline(color="red") +
  geom_point(shape=16, alpha=0.3) +
  stat_cor(method = "pearson", label.x.npc = 0.8) +
  scale_y_continuous(breaks = seq(0,40,4)) +
  labs(x="Duplicate group size (Clumpify)", y="Duplicate group size (Bowtie2)") +
  theme_base
g_dupcount_cor
Figure 4.6: Scatterplot showing Bowtie2- and Clumpify-measured duplicate counts for each duplicate group. Red line indicates x=y; R gives the Pearson correlation between the two variables.