In this tutorial, we will use a simulated data set to illustrate how SpotGLM works. We use simulations as it is more straightforward to illustrate the probabilistic modeling framework underlying SpotGLM, and allows us to directly compare trade-offs in terms of power and specificity.
Let’s get started!
First, load all required packages:
library(spotglm)
library(sparrow)
the package sparrow allows for fast and scalable computation, as detailed in our paper.
Step 1: Simulate the data
We use the simulate data function in sparrow to generate a virtual spatial transcriptomic data with 100,000 spots. We assume that their are 8 cell types, and that each spot straddles at most 2 cell types. We simulate data according to the model:
Here is the response (e.g. gene expression) for spot , is the covariate vector for spot , indexes across the cell types, is the deconvolution estimate for cell type in spot , and is the effect vector for cell type . Note that each cell type gets a different effect vector. Also note that the columns of are bounded between 0 and 1. Bounding all columns between some interval is recommended to prevent scaling issues and to make interpretation easier.
data = sparrow::simulate_data(n = 1e5, nct = 8,family = "poisson",
effect_range = c(-2,2),
min_effect = 0.05,
intercept_range = c(-6,-4),
library_size = 1000, spot_ct = 2,
p = 8,num_null = 2,prob_ct = NULL)
colnames(data$X) = paste0("effect_",c(1:ncol(data$X))-1)
colnames(data$X)[1] = "intercept"
colnames(data$lambda) = paste0("cell_type_",c(1:ncol(data$lambda)))
Step 2: Formulating a SpotGLM
SpotGLM is a method for fitting model of the form
where
Here index a cell type and is the weighting for cell type in spot . is the weight vector for cell type and is the covariate vector for spot . In general,to specify a spotGLM one needs to first specify single cell level models, that is models that would be used if the data were of single cell resolution. Then they relate the response of a spot to that of the cell types that belong to the spot.
For example, consider a single cell level generalized linear model where for a cell Here is the cell type identity of cell . Common formulations of include (ordinary least squares), (poisson or negative binomial regression), or (logistic regression). For now let us assume that we are using a poisson or negative binomial model as is common in spatial transcriptomic data. In this case we have
Where is the library size of cell .
Now we need to relate a spot’s response to the responses of the cells
that lay in the spot. Accounting for the library size of spot
,
,
we can model the response of spot
to be the weighted average of the cells that belong to the spot. Let
be the expected library size of a cell of type
.
Then
where
Here, is the library size for spot , can be thought of as the deconvolution amount of cell type in spot . This form matches the spotGLM form and thus we can use the spotGLM package to fit this model.
Fitting SpotGLM
We fit a spotGLM using therun_model()
function. This
function takes several arguments, although most are pre-specified. The
main arguments are as follows
Arguments
-
y
: A vector of observed response values (e.g., gene expression or spatial counts).
-
X
: A design matrix of covariates or predictors.
-
family
: A description of the error distribution and link function to be used in the model (e.g.,"spot poisson"
or"spot binomial"
). -
offset
: For poisson/negative binomial models: An optional numeric vector to be used as an offset in the model (useful for normalization or exposure adjustment).Generally this is set to be the log of the library size for each spot. -
initialization
: Boolean if initialization via single cell approximation should be performed. Default TRUE. -
learning_rate
: The learning rate used for gradient descent. It is advised to pick a large value since the learning rate changes dynamically throughout fitting. Default value is 1. -
n_epochs
: The number of epochs to use in fitting. Default value 50. -batch_size
: Batch size to use in mini-batch gradient descent. Default value 128 -max_diff
: Stopping criterion: If change in likelihood between epochs is less than max diff, terminate fitting. Default value 1-1e-6
The run_model()
function returns a list with the
following elements:
Output
-
beta_estimate
: Estimated regression coefficients. Of dimension #Columns of X by #Cell Types -
standard_error_matrix
: Standard error matrix for beta coefficients -
time
: Time elapsed to fit model. Not including initialization -
disp
: The dispersion parameter (for negative binomial or gaussian models) -
converged
: Logical flag indicating whether the optimization algorithm converged -
likelihood
: The log-likelihood of the fitted model -
vcov
: The variance covariance matrix of the beta coefficients -
niter
: Number of epochs used -
fixed coef
: Which coefficients has insufficient sample size or expression to be fit
In the case where a spot binomial model is ran, an additional weight
argument must be specified. The weight argument in a standard logistic
regression corresponds to the number of trials for each response
variable. The same interpretation is used for the spot binomial model.
Additionally, if one knows wants to set
where
is a vector of length equal to the number of cell types
,
then one can set the lambda
argument equal to
and set the ct_cov_weights
argument equal to
.
In this case, spotGLM will normalize each row of
to sum to 1. This can be useful for fitting many models serially, where
where
does not change on each iteration (e.g. deconvolution estimates) but
does (e.g. the weighting depends on the relative gene expression across
cell types).
If a linear spotGLM, known as a “spot gaussian” model is fit, one can
simply use the spot_lm
function. The spot_lm
function is a direct analog of the standard lm function in R and take
the following arguments.
Arguments
-
y
: A vector of observed response values (e.g., gene expression or spatial counts).
-
X
: A design matrix of covariates or predictors.
-
lambda
: The deconvolution matrix. -
fix_coef
: A matrix of dimension
spot_lm
function returns a list with the following
elements:
Output
-
beta
: Estimated regression coefficients. Of dimension #Columns of X by #Cell Types -
standard_error_matrix
: Standard error matrix. -
sigma.sq
: Residual variance estimate. -
vcov
: The Variance-covariance matrix of the beta coefficients
model = spotglm::run_model(y = data$y,X = data$X,lambda = data$lambda,
family = "spot poisson",offset=rep(log(1000),
length(data$y)),n_epochs = 100,batch_size= 500,
learning_rate = 1,max_diff = 1-1e-6,
initialization = T)
Finally, we recommend scaling features to be in the range [0,1] or [-1,1] if possible.
Downstream Analyses
Interpreting Outputs
The beta estimates can be found via beta_estimate
index.
We see that the column names correspond to cell types and the row names
correspond to the effect names. To find the estimated effect vector
,
one simply needs to look at the relevant column of the matrix.
model$beta_estimate
The standard error matrix gives the corresponding standard errors of
each estimated coefficient and can be found in the
standard_error_matrix
index. it has the same structure as
the coefficient matrix.
model$standard_error_matrix
The vcov
index gives us the variance covariance matrix
of the beta coefficients. It is of dimension
where
is the number of cell types and
is the number of features. The columns correspond to the coefficients
traversing column wise throughout the coefficient matrix. Therefore, the
first 9 columns of the variance-covariance matrix correspond to the the
first column of the coefficient matrix.
model$vcov[1:9,1:9]
Assessing Significance
One can compute pvalues and qvalues for testing significance via the
compute_significance
function. This function takes the
following arguments:
Arguments
-
input_list
: A list of spotglm outputs. Each element of the list should contain a list that holds the beta estimates and standard errors. -
cell_type
: The cell type of interest -
effect_names
: The effects we wish to assess significance for in the cell type of interest. -
beta_name
: Where the beta coefficient estimate matrix is held in the each element of the input list. Default isbeta_estimate
-
standard_error_name
: Where the beta coefficient estimate matrix is held in the each element of the input list. Default isstd_error_mat
-
sided
: Should we test for a one sided (1) or two-sided (2) difference in effects? Default value is 2 -
direction
: If one sided, should we test for positive effects(effect > 0) or negative?
The output is a matrix with 8 columns; the name of the response, the cell type,the effect, the sidedness of the test, the direction of the test,the test statistic,the pvalue, and the qvalue. The qvalue is designed to be used in scenarios where one is testing across many responses (e.g. genes) and is interested in assessing significance for a cell type specific effect across all responses.Because the input must be a list of lists, for this example we use list(model) as an argument. Note that if the input list has no names, the names are replaced with the prefix “Test”. For easy interpretation of results it is suggested that one name the list.
spotglm::compute_significance(input_list = list(model),
cell_type = "cell_type_1",
"effect_name" = "effect_1",sided = 2)
Contrasts
Another downstream analysis one can use is a contrast test. This test
whether
for some cell type
and effects
and
.
One can compute pvalues and qvalues for testing significance via the
compute_contrast_significance
function. This function takes
the following arguments:
Arguments
-
input_list
: A list of spotglm outputs. Each element of the list should contain a list that holds the beta estimates and the variance-covariance matrix. -
cell_type
: The cell type of interest -
effect_names
: A vector of length two that lists the effects we wish to compare in the cell type of interest. -
beta_name
: Where the beta coefficient estimate matrix is held in the each element of the input list -
covariance_name
: Where the variance-covariance matrix for beta coefficient estimate matrix is held in the each element of the input list. Default isvcov
-
sided
: Should we test for a one sided (1) or two-sided (2) difference in effects? Default is 2. -
direction
: If one sided, should we test for positive effects(effect 1 > effect 2) or negative?
The output is a matrix with 9 columns; the name of the response, the cell type,the effects, the sidedness of the test, the direction of the test,the test statistic, the pvalue, and the qvalue.
spotglm::compute_contrast_significance(input_list = list(model),
cell_type = "cell_type_1",
"effect_name" = c("effect_1","effect_2"),
sided = 2)