Subsampling Data Using SPARROW
Source:vignettes/Subsampling_Data_Using_SPARROW.Rmd
Subsampling_Data_Using_SPARROW.Rmd
You will need the following packages
library(sparrow)
library(spotglm)
Overview
SPARROW (Submodular Power Adaptive data Reduction for Representative dOWnstream analysis) is a method that selects a subsample of size from a large scale dataset. The subsample is chosen such that it maximizes the power of a downstream analysis (e.g. regression). This is done by minimizing the trace of which corresponds to the sum of the square of the standard errors of a standard linear regression. In the case of large scale spatial-omic data that contains mixtures of cells called spots, the covariate matrix must be expanded to be of dimension where is the number of cell types in the sample. More technical details can be found in the supplementary.
Step 1: Simulating Data
We use the simulate data function in sparrow to simulate 100,000 spots with 8 cell types. Each spot contains at most 2 cell types. We simulate data according to the model Here is the response (e.g. gene expression) for spot , is the covariate vector for spot , denotes a cell type, is the deconvolution estimate for cell type in spot , and is the effect vector for cell type . Note that each cell type gets a different effect vector. Also note that the columns of are bounded between 0 and 1. Bounding all columns between some interval is recommended to prevent scaling issues and to make interpretation easier.
data = sparrow::simulate_data(n = 1e5, nct = 8,effect_range = c(-0.2,0.2),min_effect = 0.05,
intercept_range = c(-6,-4),
library_size = 1000, spot_ct = 2,
p = 8,num_null = 2,prob_ct = NULL)
Step 2: Expand the Covariate Matrix to Incorporate Deconvolution
The model above has a variance covariance matrix equal to
.
The matrix
is equal to
i.e. we append
copies of
together, weighted by the deconvolution vectors for each cell
type.
The matrix
is diagonal with
approximately equal to
.
As
gets larger, there is more signal. Because we intend to estimate power
later, we fix a lower bound for
in order to place a conservative estimate on the standard error.
For now, let this bound be equal to
.
In general, what one can do with spatial transcriptomic data is to
compute the proportion of total UMIs that come from the each
gene,compute the 75th percentile of these values, then convert this into
UMIs per 1000
(e.g. ).
To expand our covariate matrix we use the “expand covariate matrix”
function which takes 6 arguments
Arguments
- X: The covariate matrix. Of dimension n by p
- lambda: The deconvolution matrix for each spot. Of dimension n by T
- family: The family of the downstream glm process that one intends to use.
- lib size: The library size of each spot. Either a scalar or a vector of length n
- min reads per 1000: The minimum reads per 1000 that we want to test for. This is only relevant for data selection later.
- min freq: How many non-zero observations a column in the expanded covariate matrix needs to be valid. Default is 500.
expanded_X = sparrow::expand_covariate_matrix(X = data$X, lambda = data$lambda,
family = "poisson",lib_size = 1000,
min_reads_per_1000 = 1000*exp(-5))
The output is a list with two entries. Valid is a matrix of dimension by that tells us which cell type-covariate combinations have sample size that exceed the min frequency. X holds the expanded covariate matrix, removing the invalid cell type-covariate combinations.
Step 3: Compute Target Standard Error of Subsample
The goal of SPARROW is to identify a subsample of the data such that
the subsampled data and the original data have similar power. This is
done by comparing the standard errors in the subsample and the standard
errors in the original dataset. Key to this procedure is specifying a
minimum effect size
.
This is the minimum effect size that one cares about identifying. For
example, in the case of poisson regression, one may not care about
identifying log fold changes less than a certain amount. In this case
let us choose
.
For covariate
,
let its standard error on the entire dataset be
.
Under some assumptions one can compute the power when the standard error
is
and the effect size is
,
.
Define
as the average power the interval
.
is defined as
If , . Given , we can compute where is a constant close to 1 (e.g. 0.99). represents the largest standard error at which we recover at least of the power that the original data has for covariate . Below we compute the target standard errors when . To compute we us the “compute target standard error” function. This takes 3 arguments
Arguments
- X: The expanded covariate matrix. Of dimension n by p
- min effect: The smallest effect size we wish to detect
- target power approximation: The power we wish to retain compared to the standard error on the entire dataset be
target_standard_errors = sparrow::compute_target_standard_error(X = expanded_X,
min_effect = 0.05,target_power_approx = 0.99)
The output is a vector of target standard errors.
Step 4: Subsampling via SPARROW (~1 minute)
SPARROW is a subsampling algorithm that attempts to minimize the trace of . Here is a submatrix of containing rows. This is done via stochastic greedy selection which can identify solutions that are at least efficient. A natural stopping criterion is when
Note that when
,
the amount of data needed to achieve the target standard error plateaus.
This is what characterizes to the ultra scalability of SPARROW.
To run data selection we use the “data selection” function. It takes 5
arguments
Arguments
- X: The expanded covariate matrix. Of dimension p by n
- data size: The maximum subsample size we want to take
- min SE: A vector of the target standard errors
- log: Logical of if we should log progress of data selection
- period: If log is TRUE, how often we want to print progress in iterations
selected_indices = sparrow::data_selection(X = t(expanded_X),
max_data_size = 80000,
min_standard_error = target_standard_errors,
log = TRUE,period = 5000)
selected_indices = selected_indices[is.na(selected_indices) == F]
print(paste0("#Cells Subsampled: ", length(selected_indices)))
The output is a vector of indices in the order they are selected. Therefore if one initially picks a subset of size 100,000 but then wants to downsample to 50,000 observations, they simply have to take the first 50,000 indices of the output. Note here that we select approximately 30,000 out of 100,000 total spots.
Step 4: Example Downstream Analysis (~5 minuts)
To test the performance of our subsampled dataset, we will run spotGLM on both the original dataset and the subsampled data. We see that the power on both datasets are near identical.
nsim = 25
full_power = rep(NA,nsim)
subset_power = rep(NA,nsim)
full_time = rep(NA,nsim)
subset_time = rep(NA,nsim)
t1 = Sys.time()
for(j in c(1:nsim)){
#simulate new response vector
expected_value = spotglm::spot_poisson$predict(X = data$X,beta = data$beta,
lambda = data$lambda,
offset = rep(log(1000),
length(data$y)))
expected_value = expected_value$total
data$y = rpois(1e5,lambda = expected_value)
#run spotglm on original dataset
model_full = spotglm::run_model(y = data$y, X = data$X, lambda = data$lambda,
offset = rep(log(1000),length(data$y)),
family = "spot poisson",initialization = T,
batch_size = 500,n_epochs = 100,
improvement_threshold = 1e-6,max_conv = 20)
#run spotglm on subsetted data
model_subset = spotglm::run_model(y = data$y[selected_indices],
X = data$X[selected_indices,],
lambda = data$lambda[selected_indices,],
offset=rep(log(1000),length(data$y))[selected_indices],
family = "spot poisson",initialization = T,
batch_size = 500,n_epochs = 100,
improvement_threshold = 1e-6,max_conv = 20)
#compute power
full_sig = abs(model_full$beta_est/model_full$stand_err_mat)>1.96
subset_sig = abs(model_subset$beta_est/model_subset$stand_err_mat)>1.96
full_power[j] = sum(full_sig * (1-data$null_beta))/sum(1-data$null_beta)
subset_power[j] = sum(subset_sig * (1-data$null_beta))/sum(1-data$null_beta)
#sompute time
full_time[j] = model_full$time
subset_time[j] = model_subset$time
}
print(Sys.time() - t1)
print(paste0("Mean Power on Full Data: ", round(mean(full_power),3)))
print(paste0("Mean Power on Subsetted Data: ", round(mean(subset_power),3)))