Developing a qPCR data analysis workflow

qPCR R

In-progress qPCR data analysis workflow in R, using data from the 2022-Q2 Sprint spike-in experiment.

Mike
2022-11-05

Setup

The data is from a folder in the NAO Drive currently called ‘Spike-in experiments’, which I’ve downloaded locally.

data_path <- here( '_data/nao/qpcr', '2022-06-29-spike-in-experiment/results')
dir_ls(data_path) %>% path_file
 [1] "!README.docx"                          
 [2] "2022-06-29-trip01.eds"                 
 [3] "2022-06-29-trip01.txt"                 
 [4] "2022-06-29-trip01.xlsx"                
 [5] "2022-06-29-trip01_2.eds"               
 [6] "2022-06-29-trip02.eds"                 
 [7] "2022-06-29-trip02.xlsx"                
 [8] "2022-06-30-trip02_009_seq_barcode.xlsx"
 [9] "2022-06-30_trip03.xlsx"                
[10] "Results.ipynb"                         
[11] "plate_layout_trip1.txt"                
[12] "plate_layout_trip2.txt"                
[13] "results.md5"                           
[14] "trip1-template.edt"                    

TODO: try loading the data directly from Google Drive

Load qPCR data

This file assumes that the data is within a folder ‘data/’ within the root project folder.

File info: - Excel files have the raw and processed florescence measurements (Rn and Delta Rn), as well as the softwares autothreshold stuff in another sheet. - Also some sample metadata is here; however, we might want want to take that from the .txt file, since that is (I believe) closer to the original supplied table.

TODO: Talk to Ari about whether we have/can save the files used to set up the qPCR experiment

First, we can read in the relevant sections of the Excel file and clean up thedata a bit,

First, load in the sample metadata.

sam <- path(data_path, '2022-06-29-trip01.txt') %>%
  read_tsv(skip = 43) %>%
  janitor::clean_names() %>%
  mutate(
    row = str_sub(well_position, 1, 1) %>% as.ordered,
    column = str_sub(well_position, 2) %>% as.integer %>% as.ordered,
  ) %>%
  # relocate(row, column, .after = well_position) %>%
  glimpse
Rows: 96
Columns: 15
$ well           <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14…
$ well_position  <chr> "A1", "A2", "A3", "A4", "A5", "A6", "A7", "A8…
$ sample_name    <chr> "Blank", "NTC", "Trip1_010_Neg_Ctrl", "Trip1_…
$ sample_color   <chr> "RGB(0,139,69)", "RGB(142,56,142)", "RGB(139,…
$ biogroup_name  <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ biogroup_color <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ target_name    <chr> "Blank", "Trip1_010_0", "Trip1_010_Neg_Ctrl",…
$ target_color   <chr> "RGB(0,139,69)", "RGB(142,142,56)", "RGB(238,…
$ task           <chr> "UNKNOWN", "UNKNOWN", "UNKNOWN", "UNKNOWN", "…
$ reporter       <chr> "FAM", "FAM", "FAM", "FAM", "FAM", "FAM", "FA…
$ quencher       <chr> "NFQ-MGB", "NFQ-MGB", "NFQ-MGB", "NFQ-MGB", "…
$ quantity       <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ comments       <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ row            <ord> A, A, A, A, A, A, A, A, A, A, A, A, B, B, B, …
$ column         <ord> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 1, 2, …

There is additional sample data hidden in the sample names, which we’ll need to fix in future runs (it should be in its own columns in a table). But for now we can parse it from the sample names. I’ll overwrite the faulty target names.

sam <- sam %>%
  separate(sample_name, 
    into = c('ww_triplicate', 'target_name', 'dilution_name'),
    sep = '_', extra = 'merge',
    remove = FALSE
  )

NOTE: The target name of the NTC is now NA, which is incorrect, but I’m not going to worry about that now. The proper fix is for the target name to be fixed in the source data.

NOTE: We need to use the actual starting concentrations for the standard curve. Need to talk to Anjali about how these should be supplied. Here I will assume the following.

# concentration in copies per microliter
conc_low <- 0.1
dilution_step <- 20
num_samples <- 7
conc_max <- conc_low * dilution_step^(num_samples - 1)

dilution_df <- tibble(
  dilution_power = 0:6,
  dilution_factor = dilution_step^dilution_power,
  conc = conc_max / dilution_factor,
  dilution_name = str_c('D', '0', dilution_power + 1)
)

which we must add to the sample data,

sam <- sam %>%
  left_join(dilution_df, by = 'dilution_name')

Next, load the amplification data — the relative florescence values (Rn and Delta Rn) — and join the sample metadata.

amp <- path(data_path, '2022-06-29-trip01.xlsx') %>%
  readxl::read_excel(
    sheet = 'Amplification Data',
    skip = 40,
    col_types = c('numeric', 'text', 'numeric', 'text', 'numeric', 'numeric')
  ) %>%
  janitor::clean_names() %>%
  glimpse
Rows: 3,840
Columns: 6
$ well          <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
$ well_position <chr> "A1", "A1", "A1", "A1", "A1", "A1", "A1", "A1"…
$ cycle         <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14,…
$ target_name   <chr> "Blank", "Blank", "Blank", "Blank", "Blank", "…
$ rn            <dbl> 5.125966, 5.172779, 5.221941, 5.282046, 5.3525…
$ delta_rn      <dbl> -0.228727385, -0.190693140, -0.150309995, -0.0…

Note, this table has the original, incorrect target name. I’ll drop that, and join the sample metadata table which has the corrected target names.

amp <- amp %>%
  select(-target_name) %>%
  left_join(sam, by = c('well', 'well_position')) %>%
  glimpse
Rows: 3,840
Columns: 23
$ well            <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
$ well_position   <chr> "A1", "A1", "A1", "A1", "A1", "A1", "A1", "A…
$ cycle           <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 1…
$ rn              <dbl> 5.125966, 5.172779, 5.221941, 5.282046, 5.35…
$ delta_rn        <dbl> -0.228727385, -0.190693140, -0.150309995, -0…
$ sample_name     <chr> "Blank", "Blank", "Blank", "Blank", "Blank",…
$ ww_triplicate   <chr> "Blank", "Blank", "Blank", "Blank", "Blank",…
$ target_name     <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ dilution_name   <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ sample_color    <chr> "RGB(0,139,69)", "RGB(0,139,69)", "RGB(0,139…
$ biogroup_name   <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ biogroup_color  <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ target_color    <chr> "RGB(0,139,69)", "RGB(0,139,69)", "RGB(0,139…
$ task            <chr> "UNKNOWN", "UNKNOWN", "UNKNOWN", "UNKNOWN", …
$ reporter        <chr> "FAM", "FAM", "FAM", "FAM", "FAM", "FAM", "F…
$ quencher        <chr> "NFQ-MGB", "NFQ-MGB", "NFQ-MGB", "NFQ-MGB", …
$ quantity        <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ comments        <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ row             <ord> A, A, A, A, A, A, A, A, A, A, A, A, A, A, A,…
$ column          <ord> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
$ dilution_power  <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ dilution_factor <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ conc            <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …

Explore sample metadata

sam %>%
  count(target_name)
# A tibble: 3 × 2
  target_name     n
  <chr>       <int>
1 009            24
2 010            24
3 <NA>           48

Plot the plate layout

TODO: use code from vivo vitro to do this nicely; need to first parse the row and column from the well position

TODO: see if can flow the sample names so that they print nicer

QC checks

We’ll want to do the analysis separately for each target.

Maybe there’s value in first looking at everything

TODO: Set a fixed color scheme for the targets, and use this in all plots.

delta_rn_min <- 1e-3
p1 <- amp %>%
  filter(!is.na(target_name)) %>%
  ggplot(aes(cycle, pmax(delta_rn, delta_rn_min), color = target_name)) +
  geom_line(aes(group = well)) +
  scale_color_brewer(type = 'qual') +
  geom_point() +
  labs(y = 'Delta Rn', x = 'Cycle', color = 'Target')
p2 <- p1 +
  scale_y_log10()
p1 / p2

TODO: Consider if a better way to deal with non-positive values in the log-scale plot. See what the software does. (I think it might just not show these points.)

Analysis of a single target

Will ultimately do this for all targets; perhaps show in tabs? For now, I’ll use target 009.

amp_cur <- amp %>%
  filter(target_name == '009')
delta_rn_min <- 1e-3
p1 <- amp_cur %>%
  ggplot(aes(cycle, pmax(delta_rn, delta_rn_min), color = target_name)) +
  geom_line(aes(group = well)) +
  scale_color_brewer(type = 'qual') +
  geom_point() +
  labs(y = 'Delta Rn', x = 'Cycle', color = 'Target')
p2 <- p1 +
  scale_y_log10()
p1 / p2

Pick threshold

For now, will pick the threshold manually.

threshold <- 3e-1
p1 / p2 &
  geom_hline(yintercept = threshold)

TODO: implement chosen auto-threshold algorithm.

Compute Cq values

I’ll define a custom function estimate_cq to estimate the Cq value for a trajectotry crossing a given quantification threshold. See the appendix below for more info.

estimate_cq <- function(.data, threshold) {
  .data <- .data %>%
    arrange(cycle) %>%
    mutate(
      # NOTE: log transformation, important for interpolation
      across(delta_rn, log),
      below = delta_rn < threshold,
      above = delta_rn > threshold,
    )
  before <- .data %>%
    filter(below) %>%
    slice_tail(n = 1)
  after <- .data %>%
    filter(above) %>%
    slice_head(n = 1)
  # And find the intersection of the line passing between these two points, and the threshold.
  slope <- (after$delta_rn - before$delta_rn) / (after$cycle - before$cycle)
  delta_rn_diff <- threshold - before$delta_rn
  cycle_diff <- delta_rn_diff / slope
  ct <- before$cycle + cycle_diff
  ct
}

Now use this function to compute the Cq values for each trajectory

sample_vars <-  sam %>% colnames
cqs <- amp_cur %>%
  with_groups(all_of(sample_vars), nest) %>%
  mutate(
    cq = map_dbl(data, estimate_cq, threshold = threshold)
  ) %>%
  select(-data) %>%
  glimpse
Rows: 24
Columns: 21
$ well            <dbl> 51, 52, 53, 54, 55, 56, 57, 58, 63, 64, 65, …
$ well_position   <chr> "E3", "E4", "E5", "E6", "E7", "E8", "E9", "E…
$ sample_name     <chr> "Trip1_009_Neg_Ctrl", "Trip1_009_D07", "Trip…
$ ww_triplicate   <chr> "Trip1", "Trip1", "Trip1", "Trip1", "Trip1",…
$ target_name     <chr> "009", "009", "009", "009", "009", "009", "0…
$ dilution_name   <chr> "Neg_Ctrl", "D07", "D06", "D05", "D04", "D03…
$ sample_color    <chr> "RGB(238,44,44)", "RGB(51,161,201)", "RGB(23…
$ biogroup_name   <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ biogroup_color  <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ target_color    <chr> "RGB(51,161,201)", "RGB(238,18,137)", "RGB(1…
$ task            <chr> "UNKNOWN", "UNKNOWN", "UNKNOWN", "UNKNOWN", …
$ reporter        <chr> "FAM", "FAM", "FAM", "FAM", "FAM", "FAM", "F…
$ quencher        <chr> "NFQ-MGB", "NFQ-MGB", "NFQ-MGB", "NFQ-MGB", …
$ quantity        <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ comments        <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ row             <ord> E, E, E, E, E, E, E, E, F, F, F, F, F, F, F,…
$ column          <ord> 3, 4, 5, 6, 7, 8, 9, 10, 3, 4, 5, 6, 7, 8, 9…
$ dilution_power  <int> NA, 6, 5, 4, 3, 2, 1, 0, NA, 6, 5, 4, 3, 2, …
$ dilution_factor <dbl> NA, 6.4e+07, 3.2e+06, 1.6e+05, 8.0e+03, 4.0e…
$ conc            <dbl> NA, 1.0e-01, 2.0e+00, 4.0e+01, 8.0e+02, 1.6e…
$ cq              <dbl> 33.15247, 32.86074, 33.07862, 32.43584, 30.9…

Estimate and plot standard curve

Questions:

cqs <- cqs %>%
  mutate(
    conc_log10 = log10(conc) 
  )
fit <- lm(cq ~ conc_log10, data = cqs)
fit %>% summary

Call:
lm(formula = cq ~ conc_log10, data = cqs)

Residuals:
   Min     1Q Median     3Q    Max 
-3.006 -2.612 -0.405  1.888  3.032 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  33.8227     0.7298   46.35  < 2e-16 ***
conc_log10   -2.0442     0.1872  -10.92 1.25e-09 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.232 on 19 degrees of freedom
  (3 observations deleted due to missingness)
Multiple R-squared:  0.8626,    Adjusted R-squared:  0.8553 
F-statistic: 119.3 on 1 and 19 DF,  p-value: 1.251e-09
cqs %>%
  ggplot(aes(conc_log10, cq)) +
  geom_point() +
  geom_abline(
    intercept = coef(fit)[1],
    slope = coef(fit)[2]
  )

Note, better to plot in a way that shows the uncertainty. Can use stat smooth, but then not connected to the fit we did. Could be better to use some of the ggdist et al tools.

cqs %>%
  ggplot(aes(conc_log10, cq)) +
  geom_point() +
  stat_smooth(method = 'lm')

FALSE
[1] FALSE

Estimate the efficiency from the standard curve

We can estimate the efficiency from the slope of the standard curve using the standard formula,

x <- fit %>% broom::tidy()
slope <- coef(fit)['conc_log10']
efficiency_estimate <- 10^(-1/slope) - 1

In this case, the estimate is unreasonably large because the standard curve isn’t good.

90% confidence interval:

slope_ci <- confint(fit, parm = 'conc_log10', level = 0.9)
efficiency_ci <- 10^(-1/slope_ci) - 1
efficiency_ci
                5 %     95 %
conc_log10 1.644336 2.812536

Quite a large range!

Bayesian version with rstanarm

library(rstanarm)
options(mc.cores = parallel::detectCores())

library(ggdist)
stan_fit <- stan_glm(
  cq ~ conc_log10, 
  data = cqs,
)
stan_fit %>% summary

Model Info:
 function:     stan_glm
 family:       gaussian [identity]
 formula:      cq ~ conc_log10
 algorithm:    sampling
 sample:       4000 (posterior sample size)
 priors:       see help('prior_summary')
 observations: 21
 predictors:   2

Estimates:
              mean   sd   10%   50%   90%
(Intercept) 33.8    0.8 32.8  33.8  34.8 
conc_log10  -2.0    0.2 -2.3  -2.0  -1.8 
sigma        2.4    0.4  1.9   2.3   3.0 

Fit Diagnostics:
           mean   sd   10%   50%   90%
mean_PPD 27.9    0.8 26.9  27.9  28.9 

The mean_ppd is the sample average posterior predictive distribution of the outcome variable (for details see help('summary.stanreg')).

MCMC diagnostics
              mcse Rhat n_eff
(Intercept)   0.0  1.0  3052 
conc_log10    0.0  1.0  3186 
sigma         0.0  1.0  2274 
mean_PPD      0.0  1.0  2947 
log-posterior 0.0  1.0  1453 

For each parameter, mcse is Monte Carlo standard error, n_eff is a crude measure of effective sample size, and Rhat is the potential scale reduction factor on split chains (at convergence Rhat=1).
stan_fit %>% plot

try to get the posterior samples, so that we can then get the posterior of the efficiency estimate.

TODO: google a better way to do this

slope_post <- rstan::extract(stan_fit$stanfit)$beta
efficiency_post <- 10^(-1/slope_post) - 1
efficiency_post %>% qplot

note, we could use a stronger prior since we have a lot of relevant domain info.

Demo how to use the standard curve for calibration

Appendix

Function for finding ct in a well

Standard method is to compute the Ct for each well independentally. Need to interpolate between points in the trajectory, and find when the trajectory crosses the threshold. I wonder what the software does; simplest method is perhaps linear interpolation of log Delta Rn.

TODO: google linear interpolation in R. This is essentially what geom_line is doing. It would be handy if I could just get access to that output. Can also google how to create a piecewise linear function.

First try

Since I don’t have wifi, I will need to hack it myself.

FALSE
[1] FALSE
# Suppose we have the trajectory for a particular well
x <- amp_cur %>%
  filter(well == 52) %>%
  select(cycle, delta_rn) %>%
  arrange(cycle)
# We can get the crossing point as follows
x <- x %>%
  mutate(
    below = delta_rn < threshold,
    above = delta_rn > threshold,
  )
before <- x %>%
  filter(below) %>%
  slice_tail(n = 1)
after <- x %>%
  filter(above) %>%
  slice_head(n = 1)

And find the intersection.

line passing between these two points, and the threshold.

slope <- (after$delta_rn - before$delta_rn) / (after$cycle - before$cycle)
delta_rn_diff <- threshold - before$delta_rn
cycle_diff <- delta_rn_diff / slope
ct <- before$cycle + cycle_diff

we can put this all together in a function,

estimate_cq <- function(.data, threshold) {
  .data <- .data %>%
    arrange(cycle) %>%
    mutate(
      # NOTE: log transformation, important for interpolation
      across(delta_rn, log),
      below = delta_rn < threshold,
      above = delta_rn > threshold,
    )
  before <- .data %>%
    filter(below) %>%
    slice_tail(n = 1)
  after <- .data %>%
    filter(above) %>%
    slice_head(n = 1)
  # And find the intersection of the line passing between these two points, and the threshold.
  slope <- (after$delta_rn - before$delta_rn) / (after$cycle - before$cycle)
  delta_rn_diff <- threshold - before$delta_rn
  cycle_diff <- delta_rn_diff / slope
  ct <- before$cycle + cycle_diff
  ct
}

ways we could improve

Try 2, with wifi

# Suppose we have the trajectory for a particular well
x <- amp_cur %>%
  filter(well == 52) %>%
  select(cycle, delta_rn) %>%
  arrange(cycle)
# We can get the crossing point as follows
x <- x %>%
  mutate(
    below = delta_rn < threshold,
    above = delta_rn > threshold,
  )
before <- x %>%
  filter(below) %>%
  slice_tail(n = 1)
after <- x %>%
  filter(above) %>%
  slice_head(n = 1)

We can use approxfun() to define the interpolating function.

f <- approxfun(x$cycle, x$delta_rn %>% log, rule = 1)
a <- seq(from = -5, to = 45, by = 0.1)
qplot(a, f(a))

and then the intersection with e.g. a root-finding function, uniroot(). however, this approach is a bit funny because of the non-monotonicity in the noise region, and the fact that once we know the cycle interval we can calculate the intersection of the interpolation manually.

Efficiency estimate

Derivation:

The slope of the standard curve tells us how many extra cycles correspond to a 10X decrease in starting concentration. With perfect efficiency of \(E=1\), this would equal \(\log(10) / \log(2)\). More generally, the number of extra cycles is \(A = \log(10) / \log(1 + E)\), for any log base; taking the log base to be 10 gives \(A = 1 / \log(1 + E)\). The slope of the standard curve corresponds to \(-A\). Therefore, we can estimate \(E\) by

\[\begin{align} \hat E = 10^{1 / A} - 1 \end{align}\]

Methods for setting the threshold

https://www.researchgate.net/post/How-can-I-set-the-threshold-in-a-Real-Time-PCR-result has some discussion. One suggestion is to find candidate points based on the maximum of the second derivative of the amplification curves

Methods for finding the baseline region

Not needed right now since we’re using the software’s determination of this.

Next steps

References