In-progress qPCR data analysis workflow in R, using data from the 2022-Q2 Sprint spike-in experiment.
The data is from a folder in the NAO Drive currently called ‘Spike-in experiments’, which I’ve downloaded locally.
data_path <- here( '_data/nao/qpcr', '2022-06-29-spike-in-experiment/results')
dir_ls(data_path) %>% path_file
[1] "!README.docx"
[2] "2022-06-29-trip01.eds"
[3] "2022-06-29-trip01.txt"
[4] "2022-06-29-trip01.xlsx"
[5] "2022-06-29-trip01_2.eds"
[6] "2022-06-29-trip02.eds"
[7] "2022-06-29-trip02.xlsx"
[8] "2022-06-30-trip02_009_seq_barcode.xlsx"
[9] "2022-06-30_trip03.xlsx"
[10] "Results.ipynb"
[11] "plate_layout_trip1.txt"
[12] "plate_layout_trip2.txt"
[13] "results.md5"
[14] "trip1-template.edt"
TODO: try loading the data directly from Google Drive
This file assumes that the data is within a folder ‘data/’ within the root project folder.
File info: - Excel files have the raw and processed florescence measurements (Rn and Delta Rn), as well as the softwares autothreshold stuff in another sheet. - Also some sample metadata is here; however, we might want want to take that from the .txt file, since that is (I believe) closer to the original supplied table.
TODO: Talk to Ari about whether we have/can save the files used to set up the qPCR experiment
First, we can read in the relevant sections of the Excel file and clean up thedata a bit,
First, load in the sample metadata.
sam <- path(data_path, '2022-06-29-trip01.txt') %>%
read_tsv(skip = 43) %>%
janitor::clean_names() %>%
mutate(
row = str_sub(well_position, 1, 1) %>% as.ordered,
column = str_sub(well_position, 2) %>% as.integer %>% as.ordered,
) %>%
# relocate(row, column, .after = well_position) %>%
glimpse
Rows: 96
Columns: 15
$ well <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14…
$ well_position <chr> "A1", "A2", "A3", "A4", "A5", "A6", "A7", "A8…
$ sample_name <chr> "Blank", "NTC", "Trip1_010_Neg_Ctrl", "Trip1_…
$ sample_color <chr> "RGB(0,139,69)", "RGB(142,56,142)", "RGB(139,…
$ biogroup_name <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ biogroup_color <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ target_name <chr> "Blank", "Trip1_010_0", "Trip1_010_Neg_Ctrl",…
$ target_color <chr> "RGB(0,139,69)", "RGB(142,142,56)", "RGB(238,…
$ task <chr> "UNKNOWN", "UNKNOWN", "UNKNOWN", "UNKNOWN", "…
$ reporter <chr> "FAM", "FAM", "FAM", "FAM", "FAM", "FAM", "FA…
$ quencher <chr> "NFQ-MGB", "NFQ-MGB", "NFQ-MGB", "NFQ-MGB", "…
$ quantity <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ comments <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ row <ord> A, A, A, A, A, A, A, A, A, A, A, A, B, B, B, …
$ column <ord> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 1, 2, …
There is additional sample data hidden in the sample names, which we’ll need to fix in future runs (it should be in its own columns in a table). But for now we can parse it from the sample names. I’ll overwrite the faulty target names.
NOTE: The target name of the NTC is now NA, which is incorrect, but I’m not going to worry about that now. The proper fix is for the target name to be fixed in the source data.
NOTE: We need to use the actual starting concentrations for the standard curve. Need to talk to Anjali about how these should be supplied. Here I will assume the following.
# concentration in copies per microliter
conc_low <- 0.1
dilution_step <- 20
num_samples <- 7
conc_max <- conc_low * dilution_step^(num_samples - 1)
dilution_df <- tibble(
dilution_power = 0:6,
dilution_factor = dilution_step^dilution_power,
conc = conc_max / dilution_factor,
dilution_name = str_c('D', '0', dilution_power + 1)
)
which we must add to the sample data,
Next, load the amplification data — the relative florescence values (Rn and Delta Rn) — and join the sample metadata.
amp <- path(data_path, '2022-06-29-trip01.xlsx') %>%
readxl::read_excel(
sheet = 'Amplification Data',
skip = 40,
col_types = c('numeric', 'text', 'numeric', 'text', 'numeric', 'numeric')
) %>%
janitor::clean_names() %>%
glimpse
Rows: 3,840
Columns: 6
$ well <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
$ well_position <chr> "A1", "A1", "A1", "A1", "A1", "A1", "A1", "A1"…
$ cycle <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14,…
$ target_name <chr> "Blank", "Blank", "Blank", "Blank", "Blank", "…
$ rn <dbl> 5.125966, 5.172779, 5.221941, 5.282046, 5.3525…
$ delta_rn <dbl> -0.228727385, -0.190693140, -0.150309995, -0.0…
Note, this table has the original, incorrect target name. I’ll drop that, and join the sample metadata table which has the corrected target names.
Rows: 3,840
Columns: 23
$ well <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
$ well_position <chr> "A1", "A1", "A1", "A1", "A1", "A1", "A1", "A…
$ cycle <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 1…
$ rn <dbl> 5.125966, 5.172779, 5.221941, 5.282046, 5.35…
$ delta_rn <dbl> -0.228727385, -0.190693140, -0.150309995, -0…
$ sample_name <chr> "Blank", "Blank", "Blank", "Blank", "Blank",…
$ ww_triplicate <chr> "Blank", "Blank", "Blank", "Blank", "Blank",…
$ target_name <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ dilution_name <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ sample_color <chr> "RGB(0,139,69)", "RGB(0,139,69)", "RGB(0,139…
$ biogroup_name <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ biogroup_color <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ target_color <chr> "RGB(0,139,69)", "RGB(0,139,69)", "RGB(0,139…
$ task <chr> "UNKNOWN", "UNKNOWN", "UNKNOWN", "UNKNOWN", …
$ reporter <chr> "FAM", "FAM", "FAM", "FAM", "FAM", "FAM", "F…
$ quencher <chr> "NFQ-MGB", "NFQ-MGB", "NFQ-MGB", "NFQ-MGB", …
$ quantity <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ comments <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ row <ord> A, A, A, A, A, A, A, A, A, A, A, A, A, A, A,…
$ column <ord> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
$ dilution_power <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ dilution_factor <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ conc <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
TODO: use code from vivo vitro to do this nicely; need to first parse the row and column from the well position
TODO: see if can flow the sample names so that they print nicer
We’ll want to do the analysis separately for each target.
Maybe there’s value in first looking at everything
TODO: Set a fixed color scheme for the targets, and use this in all plots.
delta_rn_min <- 1e-3
p1 <- amp %>%
filter(!is.na(target_name)) %>%
ggplot(aes(cycle, pmax(delta_rn, delta_rn_min), color = target_name)) +
geom_line(aes(group = well)) +
scale_color_brewer(type = 'qual') +
geom_point() +
labs(y = 'Delta Rn', x = 'Cycle', color = 'Target')
p2 <- p1 +
scale_y_log10()
p1 / p2
TODO: Consider if a better way to deal with non-positive values in the log-scale plot. See what the software does. (I think it might just not show these points.)
Will ultimately do this for all targets; perhaps show in tabs? For now, I’ll use target 009.
delta_rn_min <- 1e-3
p1 <- amp_cur %>%
ggplot(aes(cycle, pmax(delta_rn, delta_rn_min), color = target_name)) +
geom_line(aes(group = well)) +
scale_color_brewer(type = 'qual') +
geom_point() +
labs(y = 'Delta Rn', x = 'Cycle', color = 'Target')
p2 <- p1 +
scale_y_log10()
p1 / p2
For now, will pick the threshold manually.
threshold <- 3e-1
p1 / p2 &
geom_hline(yintercept = threshold)
TODO: implement chosen auto-threshold algorithm.
I’ll define a custom function estimate_cq
to estimate the Cq value for a trajectotry crossing a given quantification threshold.
See the appendix below for more info.
estimate_cq <- function(.data, threshold) {
.data <- .data %>%
arrange(cycle) %>%
mutate(
# NOTE: log transformation, important for interpolation
across(delta_rn, log),
below = delta_rn < threshold,
above = delta_rn > threshold,
)
before <- .data %>%
filter(below) %>%
slice_tail(n = 1)
after <- .data %>%
filter(above) %>%
slice_head(n = 1)
# And find the intersection of the line passing between these two points, and the threshold.
slope <- (after$delta_rn - before$delta_rn) / (after$cycle - before$cycle)
delta_rn_diff <- threshold - before$delta_rn
cycle_diff <- delta_rn_diff / slope
ct <- before$cycle + cycle_diff
ct
}
Now use this function to compute the Cq values for each trajectory
sample_vars <- sam %>% colnames
cqs <- amp_cur %>%
with_groups(all_of(sample_vars), nest) %>%
mutate(
cq = map_dbl(data, estimate_cq, threshold = threshold)
) %>%
select(-data) %>%
glimpse
Rows: 24
Columns: 21
$ well <dbl> 51, 52, 53, 54, 55, 56, 57, 58, 63, 64, 65, …
$ well_position <chr> "E3", "E4", "E5", "E6", "E7", "E8", "E9", "E…
$ sample_name <chr> "Trip1_009_Neg_Ctrl", "Trip1_009_D07", "Trip…
$ ww_triplicate <chr> "Trip1", "Trip1", "Trip1", "Trip1", "Trip1",…
$ target_name <chr> "009", "009", "009", "009", "009", "009", "0…
$ dilution_name <chr> "Neg_Ctrl", "D07", "D06", "D05", "D04", "D03…
$ sample_color <chr> "RGB(238,44,44)", "RGB(51,161,201)", "RGB(23…
$ biogroup_name <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ biogroup_color <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ target_color <chr> "RGB(51,161,201)", "RGB(238,18,137)", "RGB(1…
$ task <chr> "UNKNOWN", "UNKNOWN", "UNKNOWN", "UNKNOWN", …
$ reporter <chr> "FAM", "FAM", "FAM", "FAM", "FAM", "FAM", "F…
$ quencher <chr> "NFQ-MGB", "NFQ-MGB", "NFQ-MGB", "NFQ-MGB", …
$ quantity <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ comments <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ row <ord> E, E, E, E, E, E, E, E, F, F, F, F, F, F, F,…
$ column <ord> 3, 4, 5, 6, 7, 8, 9, 10, 3, 4, 5, 6, 7, 8, 9…
$ dilution_power <int> NA, 6, 5, 4, 3, 2, 1, 0, NA, 6, 5, 4, 3, 2, …
$ dilution_factor <dbl> NA, 6.4e+07, 3.2e+06, 1.6e+05, 8.0e+03, 4.0e…
$ conc <dbl> NA, 1.0e-01, 2.0e+00, 4.0e+01, 8.0e+02, 1.6e…
$ cq <dbl> 33.15247, 32.86074, 33.07862, 32.43584, 30.9…
Questions:
Call:
lm(formula = cq ~ conc_log10, data = cqs)
Residuals:
Min 1Q Median 3Q Max
-3.006 -2.612 -0.405 1.888 3.032
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 33.8227 0.7298 46.35 < 2e-16 ***
conc_log10 -2.0442 0.1872 -10.92 1.25e-09 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 2.232 on 19 degrees of freedom
(3 observations deleted due to missingness)
Multiple R-squared: 0.8626, Adjusted R-squared: 0.8553
F-statistic: 119.3 on 1 and 19 DF, p-value: 1.251e-09
cqs %>%
ggplot(aes(conc_log10, cq)) +
geom_point() +
geom_abline(
intercept = coef(fit)[1],
slope = coef(fit)[2]
)
Note, better to plot in a way that shows the uncertainty. Can use stat smooth, but then not connected to the fit we did. Could be better to use some of the ggdist et al tools.
cqs %>%
ggplot(aes(conc_log10, cq)) +
geom_point() +
stat_smooth(method = 'lm')
FALSE
[1] FALSE
We can estimate the efficiency from the slope of the standard curve using the standard formula,
In this case, the estimate is unreasonably large because the standard curve isn’t good.
90% confidence interval:
slope_ci <- confint(fit, parm = 'conc_log10', level = 0.9)
efficiency_ci <- 10^(-1/slope_ci) - 1
efficiency_ci
5 % 95 %
conc_log10 1.644336 2.812536
Quite a large range!
Model Info:
function: stan_glm
family: gaussian [identity]
formula: cq ~ conc_log10
algorithm: sampling
sample: 4000 (posterior sample size)
priors: see help('prior_summary')
observations: 21
predictors: 2
Estimates:
mean sd 10% 50% 90%
(Intercept) 33.8 0.8 32.8 33.8 34.8
conc_log10 -2.0 0.2 -2.3 -2.0 -1.8
sigma 2.4 0.4 1.9 2.3 3.0
Fit Diagnostics:
mean sd 10% 50% 90%
mean_PPD 27.9 0.8 26.9 27.9 28.9
The mean_ppd is the sample average posterior predictive distribution of the outcome variable (for details see help('summary.stanreg')).
MCMC diagnostics
mcse Rhat n_eff
(Intercept) 0.0 1.0 3052
conc_log10 0.0 1.0 3186
sigma 0.0 1.0 2274
mean_PPD 0.0 1.0 2947
log-posterior 0.0 1.0 1453
For each parameter, mcse is Monte Carlo standard error, n_eff is a crude measure of effective sample size, and Rhat is the potential scale reduction factor on split chains (at convergence Rhat=1).
stan_fit %>% plot
try to get the posterior samples, so that we can then get the posterior of the efficiency estimate.
TODO: google a better way to do this
note, we could use a stronger prior since we have a lot of relevant domain info.
Standard method is to compute the Ct for each well independentally. Need to interpolate between points in the trajectory, and find when the trajectory crosses the threshold. I wonder what the software does; simplest method is perhaps linear interpolation of log Delta Rn.
TODO: google linear interpolation in R. This is essentially what geom_line is doing. It would be handy if I could just get access to that output. Can also google how to create a piecewise linear function.
Since I don’t have wifi, I will need to hack it myself.
FALSE
[1] FALSE
# Suppose we have the trajectory for a particular well
x <- amp_cur %>%
filter(well == 52) %>%
select(cycle, delta_rn) %>%
arrange(cycle)
# We can get the crossing point as follows
x <- x %>%
mutate(
below = delta_rn < threshold,
above = delta_rn > threshold,
)
before <- x %>%
filter(below) %>%
slice_tail(n = 1)
after <- x %>%
filter(above) %>%
slice_head(n = 1)
And find the intersection.
line passing between these two points, and the threshold.
slope <- (after$delta_rn - before$delta_rn) / (after$cycle - before$cycle)
delta_rn_diff <- threshold - before$delta_rn
cycle_diff <- delta_rn_diff / slope
ct <- before$cycle + cycle_diff
we can put this all together in a function,
estimate_cq <- function(.data, threshold) {
.data <- .data %>%
arrange(cycle) %>%
mutate(
# NOTE: log transformation, important for interpolation
across(delta_rn, log),
below = delta_rn < threshold,
above = delta_rn > threshold,
)
before <- .data %>%
filter(below) %>%
slice_tail(n = 1)
after <- .data %>%
filter(above) %>%
slice_head(n = 1)
# And find the intersection of the line passing between these two points, and the threshold.
slope <- (after$delta_rn - before$delta_rn) / (after$cycle - before$cycle)
delta_rn_diff <- threshold - before$delta_rn
cycle_diff <- delta_rn_diff / slope
ct <- before$cycle + cycle_diff
ct
}
ways we could improve
# Suppose we have the trajectory for a particular well
x <- amp_cur %>%
filter(well == 52) %>%
select(cycle, delta_rn) %>%
arrange(cycle)
# We can get the crossing point as follows
x <- x %>%
mutate(
below = delta_rn < threshold,
above = delta_rn > threshold,
)
before <- x %>%
filter(below) %>%
slice_tail(n = 1)
after <- x %>%
filter(above) %>%
slice_head(n = 1)
We can use approxfun()
to define the interpolating function.
and then the intersection with e.g. a root-finding function, uniroot()
. however, this approach is a bit funny because of the non-monotonicity in the noise region, and the fact that once we know the cycle interval we can calculate the intersection of the interpolation manually.
Derivation:
The slope of the standard curve tells us how many extra cycles correspond to a 10X decrease in starting concentration. With perfect efficiency of \(E=1\), this would equal \(\log(10) / \log(2)\). More generally, the number of extra cycles is \(A = \log(10) / \log(1 + E)\), for any log base; taking the log base to be 10 gives \(A = 1 / \log(1 + E)\). The slope of the standard curve corresponds to \(-A\). Therefore, we can estimate \(E\) by
\[\begin{align} \hat E = 10^{1 / A} - 1 \end{align}\]
https://www.researchgate.net/post/How-can-I-set-the-threshold-in-a-Real-Time-PCR-result has some discussion. One suggestion is to find candidate points based on the maximum of the second derivative of the amplification curves
Not needed right now since we’re using the software’s determination of this.