Subsampling MERFISH Data Using SPARROW
Source:vignettes/MERFISH_Using_SPARROW.Rmd
MERFISH_Using_SPARROW.Rmd
You will need the following packages
library(sparrow)
library(ggplot2)
library(dplyr)
Reading in Data
First we read in the merfish data using the function ‘read merfish’. This functions returns a list with the following
Output
- counts: A matrix of gene counts. Of dimension #genes (500) by #cells n (766757)
- regions: A vector of region ids for each cell. There are p(14) regions.
- CT: Cell type classification for each cell
data = read_merfish()
Overview
We use a MERFISH colorectal cancer dataset containing 15 cell types and 5 annotated regions. For each gene we fit the model
is the cell type of cell
,
is the library size of cell
,
and
is the region that cell
belongs.
is the expected expression of gene
in cell type
in region
Then for all cell types
and regions
we test
The original dataset contains approximately 800,000 cells. We will first downsample to 300,000 cells for the sake of speed. We will then use SPARROW to select a powerful subsample of cells that can recover the majority of significant effects.
Step 1: One hot encoding region and cell type classification
data$regions = model.matrix(~ regions - 1,
data = data.frame(regions = data$regions))
data$CT = model.matrix(~ CT - 1, data = data.frame(CT = data$CT))
#Sample out data to make downstream analysis faster
ind = c(1:ncol(data$counts))[1:300000]
data$counts = data$counts[,ind]
data$regions = data$regions[ind,]
data$CT = data$CT[ind,]
Step 2: Expand the Covariate Matrix to Incorporate Deconvolution
The model above has a variance covariance matrix equal to
.
The matrix
is equal to
i.e. we append
copies of
together, weighted by the deconvolution vectors for each cell
type.
The matrix
is diagonal with
approximately equal to
.
As
gets larger, there is more signal. Because we intend to estimate power
later, we fix a lower bound for
in order to place a conservative estimate on the standard error.
For now,what we will do with spatial transcriptomic data is to compute
the proportion of total UMIs that come from the each gene,compute the
75th percentile of these values, then convert this into UMIs per 1000
(e.g. ).
To expand our covariate matrix we use the “expand covariate matrix”
function which takes 6 arguments
Arguments
- X: The covariate matrix. Of dimension n by p
- lambda: The deconvolution matrix for each spot. Of dimension n by T
- family: The family of the downstream glm process that one intends to use.
- lib size: The library size of each spot. Either a scalar or a vector of length n
- min reads per 1000: The minimum reads per 1000 that we want to test for. This is only relevant for data selection later.
- min freq: How many non-zero observations a column in the expanded covariate matrix needs to be valid. Default is 500.
#get gene totals and total counts
gene_totals = rowSums(data$counts)
total_UMI = sum(data$counts)
#get good cutoff value for
cutoff = quantile(log(gene_totals/total_UMI),0.75)
#make expanded covariate matrix
expanded_X = sparrow::expand_covariate_matrix(X = data$regions, lambda = data$CT,
family = "poisson",lib_size = colSums(data$counts),
min_reads_per_1000 = 1000*exp(cutoff))
valid_cov = matrix(apply(expanded_X,2,function(x){sum(x>0)}) > 50,ncol(data$regions),ncol(data$CT))
expanded_X = expanded_X[,apply(expanded_X,2,function(x){sum(x>0)}) > 50]
The output is a list with two entries. Valid is a matrix of dimension by that tells us which cell type-covariate combinations have sample size that exceed the min frequency. X holds the expanded covariate matrix, removing the invalid cell type-covariate combinations.
Step 3: Compute Target Standard Error of Subsample
The goal of SPARROW is to identify a subsample of the data such that
the subsampled data and the original data have similar power. This is
done by comparing the standard errors in the subsample and the standard
errors in the original dataset. Key to this procedure is specifying a
minimum effect size
.
This is the minimum effect size that one cares about identifying.
For example, in the case of poisson regression, one may not care about
identifying log fold changes less than a certain amount. In this case
let us choose
.
For covariate
,
let its standard error on the entire dataset be
.
Under some assumptions one can compute the power when the standard error
is
and the effect size is
,
.
Define
as the average power the interval
.
is defined as
If , . Given , we can compute where is a constant close to 1 (e.g. 0.99). represents the largest standard error at which we recover at least of the power that the original data has for covariate . Below we compute the target standard errors when . To compute we us the “compute target standard error” function. This takes 3 arguments
Arguments
- X: The expanded covariate matrix. Of dimension n by p
- min effect: The smallest effect size we wish to detect
- target power approximation: The power we wish to retain compared to the standard error on the entire dataset be
target_standard_errors = sparrow::compute_target_standard_error(X = expanded_X,
min_effect = 0.05,target_power_approx = 0.99)
Step 4: Subsampling via SPARROW (~2 minutes)
SPARROW is a subsampling algorithm that attempts to minimize the trace of . Here is a submatrix of containing rows. This is done via stochastic greedy selection which can identify solutions that are at least efficient. A natural stopping criterion is when
Note that when
,
the amount of data needed to achieve the target standard error plateaus.
This is what characterizes to the ultra scalability of SPARROW.
To run data selection we use the “data selection” function. It takes 5 arguments
Arguments
- X: The expanded covariate matrix. Of dimension p by n
- data size: The maximum subsample size we want to take
- min SE: A vector of the target standard errors
- log: Logical of if we should log progress of data selection
- period: If log is TRUE, how often we want to print progress in iterations
selected_indices = sparrow::data_selection(X = t(expanded_X),
max_data_size = 300000,min_standard_error = target_standard_errors,
log = TRUE,period = 10000)
selected_indices = selected_indices[is.na(selected_indices) == F]
print(paste0("Cells Chosen: ",length(selected_indices)))
The output is a vector of indices in the order they are selected. Therefore if one initially picks a subset of size 100,000 but then wants to downsample to 50,000 observations, they simply have to take the first 50,000 indices of the output. We see that we select around 90,00 cells, 1/3 of the original sample.
Step 4: Downstream Analysis(Poisson GLM)
Full Data (~ 5 minutes)
#get region on
nregion = ncol(data$regions)
nCT = ncol(data$CT)
#get number of genes
ngene = nrow(data$counts)
#get library size of cells
spot_size = colSums(data$counts)
#save results to a list
res = list()
#start time
t1 = Sys.time()
for(i in c(1:ngene)){
if(i %%100 == 0){
print(paste0("Iteration ",i," out of ",ngene))
print(Sys.time() - t1)
}
#save beta
beta = matrix(NA,nregion ,nCT)
#get dimnames of beta
colnames(beta) = colnames(data$CT)
rownames(beta) = colnames(data$regions)
#standard error matrix
SE = matrix(NA,nregion,nCT)
colnames(SE) = colnames(data$CT)
rownames(SE) = colnames(data$regions)
#get counts
Y = data$counts[i,]
for(j in c(1:nCT)){
#get cells that correspond to that cell type
cells = which(data$CT[,j] == 1)
#get covariate matrix
X = data$regions[cells,]
#subset covariate matrix by valid
ct_ind = which(valid_cov[,j] == 1)
X = X[,ct_ind]
#get response
y = Y[cells]
#get offset
offset_cells = log(spot_size)[cells]
#fit model
model = glm(y~X-1+offset(offset_cells),family = "poisson")
#save beta
beta[ct_ind,j] = coef(model)
#save standard errors
SE[ct_ind,j] = sqrt(diag(vcov(model)))
}
#save results
res[[i]] = list(beta = beta,SE = SE)
}
#start time
print(Sys.time() - t1)
Subsampled Data (~ 2 minutes)
#get region on
nregion = ncol(data$regions)
nCT = ncol(data$CT)
#get number of genes
ngene = nrow(data$counts)
#get library size of cells
spot_size = colSums(data$counts)
#save results to a list
res_subsample = list()
t1 = Sys.time()
for(i in c(1:ngene)){
if(i %%100 == 0){
print(paste0("Iteration ",i," out of ",ngene))
print(Sys.time() - t1)
}
#save beta
beta = matrix(NA,nregion ,nCT)
#get dimnames of beta
colnames(beta) = colnames(data$CT)
rownames(beta) = colnames(data$regions)
#standard error matrix
SE = matrix(NA,nregion,nCT)
colnames(SE) = colnames(data$CT)
rownames(SE) = colnames(data$regions)
#get counts
Y = data$counts[i,]
for(j in c(1:nCT)){
#get cells that correspond to that cell type
cells = selected_indices[data$CT[selected_indices,j] == 1]
#get covariate matrix
X = data$regions[cells,]
#subset covariate matrix by valid
ct_ind = which(valid_cov[,j] == 1)
X = X[,ct_ind]
#get response
y = Y[cells]
#get offset
offset_cells = log(spot_size)[cells]
#fit model
model = glm(y~X-1+offset(offset_cells),family = "poisson")
#save beta
beta[ct_ind,j] = coef(model)
#save standard errors
SE[ct_ind,j] = sqrt(diag(vcov(model)))
}
#save results
res_subsample[[i]] = list(beta = beta,SE = SE)
}
Recall Plot: Contrast Test
Compute contrast test scores
contrast_test = function(beta,SE,upper = T){
pvals = matrix(NA,length(beta),length(beta))
for(j in c(1:length(beta))){
diff = beta[j] - beta
diff_SE = sqrt(SE[j]^2 + SE^2)
pvals[,j] = 2*(1-pnorm(abs(diff/diff_SE)))
}
if(upper){
pvals[upper.tri(pvals)] <- NA
}
diag(pvals) = NA
return(pvals)
}
#get pvalue list for full data and subsampled data
pvalue_subsample = c()
pvalue_full = c()
for(j in c(1:ngene)){
full_result = res[[j]]
subsample_result = res_subsample[[j]]
nCT = ncol(full_result$beta)
for(k in c(1:nCT)){
p_sub = contrast_test(full_result$beta[,k],full_result$SE[,k])
p_full = contrast_test(subsample_result$beta[,k],subsample_result$SE[,k])
pvalue_subsample = c(pvalue_subsample,p_sub)
pvalue_full = c(pvalue_full,p_full)
}
}
pvalue_subsample_adj = p.adjust(pvalue_subsample, method = "BH")
pvalue_full_adj = p.adjust(pvalue_full, method = "BH")
Compute and Plot Recall
We now adjust our pvalues using the BH procedure and compute the recall, assuming that the results from the full data model are the truth, across a variety of qvalue cutoffs. The x-axis shows the qvalue cutoff that would be used on the original dataset. The y-axis shows the recall on the subsetted dataset if the qvalue cutoff was 0.05. We see that the recall is nearly perfect no matter what qvalue cutoff is chosen. That is that a standard analysis with 1/3 of the data can identify nearly all effects found in the full dataset.
CT_names = colnames(res[[1]]$beta)
N = min(1000,sum(pvalue_full_adj[is.na(pvalue_full_adj)==F] <= 0.05))
set.seed(110)
Nind = 1000
Nindices = floor(seq(1,N,length.out = Nind))
ordered_pvals = sort(pvalue_full_adj[is.na(pvalue_full_adj)==F])
ordered_pvals = ordered_pvals[ordered_pvals < 0.05]
pval_inds = sort(c(1,sample(c(2:length(ordered_pvals)),N-1,replace = F)))
log_p = log(pvalue_full_adj[is.na(pvalue_full_adj)==F],10)
log_p = log_p[is.infinite(log_p) == F]
min_log_p = min(log_p)
#get numerator and denominator in recall
recall = rep(NA,Nind)
NUM = rep(0,Nind)
DEN = rep(0,Nind)
C = rep(0,Nind)
for(k in c(1:Nind)){
#get cutoff value
cutoff = max(min(log_p),ordered_pvals[pval_inds[k]])
standard_sig = (pvalue_full_adj <= cutoff & is.na(pvalue_subsample_adj) == F)
query_sig = (pvalue_subsample_adj <= 0.05)
NUM[k] = sum(standard_sig * query_sig,na.rm = T)
DEN[k] = sum(standard_sig,na.rm = T)
recall[k] = NUM[k]/DEN[k]
C[k] = log(cutoff,10)
if(is.infinite(C[k])){
C[k] = min(log_p)
}
}
# Example data
plot_data <- data.frame(
C = C, # Replace with your actual 'C' values
recall = recall # Replace with your actual 'recall' values
)
# Create the plot
PLOT = ggplot(plot_data, aes(x = C, y = recall)) +
geom_point() + # Scatter points
geom_line() + # Optional: Connect points with lines
scale_y_continuous(
limits = c(0, 1), # Set y-axis limits
breaks = seq(0, 1, by = 0.1) # Add y-axis ticks
) +
theme_minimal() + # Use a clean theme
labs(
x = "Full Data Log Q-value Cutoff", # Label for the x-axis
y = "Recall", # Label for the y-axis
title = paste0("Recall of Subsetted Data: ")
) +
theme(
panel.grid.major = element_line(color = "grey80"), # Add a grid
panel.grid.minor = element_blank() # Optional: Hide minor grids
)
print(PLOT)
#compute type 1 error
standard_null = (pvalue_full_adj > 0.05 & is.na(pvalue_subsample_adj) == F)
query_sig = (pvalue_subsample_adj <= 0.05)
type_1_error = sum(query_sig * standard_null,na.rm = T)/sum(standard_null,na.rm = T)
#print
print(paste0("Type 1 error rate: ",round(type_1_error,2)))