SAVER Tutorial

2018-08-29

SAVER (Single-cell Analyses Via Expression Recovery) is a method for denoising single-cell RNA sequencing data by borrowing information across genes and cells. Here, we demonstrate how to use the method. For more information, take a look at the Github page and the paper.

Installation

You can install the most recent updates of SAVER from github with:

# install.packages("devtools")
devtools::install_github("mohuangx/SAVER")

To install the stable release of SAVER:

# install.packages("devtools")
devtools::install_github("mohuangx/SAVER@*release")

Getting Started

Once we have the package installed, we can load the package.

library(SAVER)
packageVersion("SAVER")
#> [1] '1.1.0'

In this tutorial, we will run SAVER on the mouse cortex data from Zeisel (2015).

data.path <- "../data/expression_mRNA_17-Aug-2014.txt"

# Need to remove first 10 rows of metadata
raw.data <- read.table(data.path, header = FALSE, skip = 11, row.names = 1,
                check.names = FALSE)
cortex <- as.matrix(raw.data[, -1])

cellnames <- read.table(data.path, skip = 7, nrows = 1, row.names = 1,
                        stringsAsFactors = FALSE)
colnames(cortex) <- cellnames[-1]

dim(cortex)
#> [1] 19972  3005

Preprocessing

The input to SAVER is a matrix of gene expression counts with genes along the rows and cells along the columns. SAVER was developed for use on UMI count data but it seems to work well with non-UMI TPM data. Standard quality control such as removing low quality cells and filtering out genes with low expression should be performed prior to running SAVER.

Here, the Zeisel dataset has already been preprocessed.

Running SAVER

The main function to run SAVER is the saver function. Since SAVER is computationally intensive for large datasets, we recommend running it in parallel on a cluster either by creating your own parallel environment or by specifying ncores. We will run SAVER on the Zeisel dataset across 12 cores.

cortex.saver <- saver(cortex, ncores = 12)
#> 19972 genes, 3005 cells
#> Running SAVER with 12 worker(s)
#> Calculating predictions for 19972 genes using 10971 genes and 3005 cells...
#> Start time: 2018-08-29 14:57:37
#> Estimating finish time...
#> Finished 12/19972 genes. Approximate finish time: 2018-08-29 15:22:51
#> Calculating max cor cutoff...
#> Finished 100/19972 genes. Approximate finish time: 2018-08-29 15:52:05
#> Calculating lambda coefficients...
#> Finished 289/19972 genes. Approximate finish time: 2018-08-29 15:58:53
#> Predicting remaining genes...
#> Finished 10131/19972 genes. Approximate finish time: 2018-08-29 17:09:14
#> Predicting remaining genes...
#> Done!
#> Finish time: 2018-08-29 17:09:30
#> Total time: 2.198162 hours
str(cortex.saver)
#> List of 3
#>  $ estimate: num [1:19972, 1:3005] 0.169 1.561 1.789 0.024 0.522 ...
#>   ..- attr(*, "dimnames")=List of 2
#>   .. ..$ : chr [1:19972] "Tspan12" "Tshz1" "Fnbp1l" "Adamts15" ...
#>   .. ..$ : chr [1:3005] "1772071015_C02" "1772071017_G12" "1772071017_A05" "1772071014_B06" ...
#>  $ se      : num [1:19972, 1:3005] 0.253 0.845 0.83 0.092 0.43 ...
#>   ..- attr(*, "dimnames")=List of 2
#>   .. ..$ : chr [1:19972] "Tspan12" "Tshz1" "Fnbp1l" "Adamts15" ...
#>   .. ..$ : chr [1:3005] "1772071015_C02" "1772071017_G12" "1772071017_A05" "1772071014_B06" ...
#>  $ info    :List of 10
#>   ..$ size.factor : num [1:3005] 1.55 1.56 2.27 2.36 1.54 ...
#>   ..$ maxcor      : num [1:19972] 0.2144 0.1746 0.229 0.2272 0.0796 ...
#>   ..$ lambda.max  : num [1:19972] 0.266 0.303 0.522 0.13 0 ...
#>   ..$ lambda.min  : num [1:19972] 0.0431 0.0644 0.0773 0.0195 0 ...
#>   ..$ sd.cv       : num [1:19972] NA NA NA NA 0 NA NA NA NA NA ...
#>   ..$ pred.time   : num [1:19972] 5.708236 5.712082 5.882065 4.172383 0.000221 ...
#>   ..$ var.time    : num [1:19972] 0.0526 0.0509 0.052 0.0572 0.0161 ...
#>   ..$ cutoff      : num 0.103
#>   ..$ lambda.coefs: Named num [1:2] -1.59 22.88
#>   .. ..- attr(*, "names")= chr [1:2] "(Intercept)" "info$maxcor[pred]"
#>   ..$ total.time  :Class 'difftime'  atomic [1:1] 2.2
#>   .. .. ..- attr(*, "units")= chr "hours"
#>  - attr(*, "class")= chr "saver"

cortex.saver is a saver object with the following components:

If you are only interested in the estimate, you can run saver setting estimates.only = TRUE. For example,

cortex.saver <- saver(cortex, ncores = 12, estimates.only = TRUE)

Using SAVER estimates

The SAVER estimates represent the library size normalized posterior means of the recovered gene expression. They can be used as input to many of the common downstream analysis methods such as Seurat and Monocle.

Additional Settings

Normalization

By default, SAVER takes in an unnormalized count matrix and performs library size normalization during the denoising step. However, if your data is already normalized or normalization is not desired, you can set size.factor = 1. In addition, if you want to use custom cell size factors, you can set size.factor to be a vector of size factors.

Predicting certain genes

SAVER has two main steps: the first is the prediction step and the second is a shrinkage step. The prediction step is the time-consuming part of SAVER. If you are only interested in a handful of genes, you can specify those genes and generate predictions for those genes only and choose to return the entire data matrix or for those genes only. For example,

# Identify the indices of the genes of interest
genes <- c("Thy1", "Mbp", "Stim2", "Psmc6", "Rps19")
genes.ind <- which(rownames(cortex) %in% genes)

# Generate predictions for those genes and return entire dataset
cortex.saver.genes <- saver(cortex, pred.genes = genes.ind,
                            estimates.only = TRUE)

# Generate predictions for those genes and return only those genes
cortex.saver.genes.only <- saver(cortex, pred.genes = genes.ind,
                                 pred.genes.only = TRUE, estimates.only = TRUE)

Using other predictions

The main contribution of SAVER is the shrinkage step after generating the predictions. The shrinkage step allows the algorithm to adaptively evaluate the quality of the predictions and weigh the estimates either towards the predictions or the observed value. SAVER performs a Lasso Poisson regression for each gene using other genes as covariates to generate the predictions. However, if you want to use other predictions such as MAGIC or DCA, you can input the results in matrix format into the mu parameter of the saver function.

Creating your own parallel backend

By default, if you specify ncores without already having a parallel backend set up, saver will make a default cluster using the makeCluster function in the parallel package and register a parallel backend via the registerDoParallel function in the doParallel package. However, if you want to set up a more complicated backend, such as using MPI across nodes, you can do so and run saver without specifying the ncores option.

Combining SAVER objects

Running SAVER on large datasets may run into memory issues, especially when parallelized across cores where each core has limited memory. One solution would be to register a parallel backend using SOCK or MPI to parallelize across nodes. Another solution would be to split the genes into multiple runs using pred.genes and pred.genes.only. For example, say we have a dataset x with 10,000 genes. We can split this up into 4 runs on different machines and combine using the combine.saver function. We recommend setting do.fast = FALSE when splitting the dataset.

saver1 <- saver(x, pred.genes = 1:2500, pred.genes.only = TRUE,
                do.fast = FALSE)
saver2 <- saver(x, pred.genes = 2501:5000, pred.genes.only = TRUE,
                do.fast = FALSE)
saver3 <- saver(x, pred.genes = 5001:7500, pred.genes.only = TRUE,
                do.fast = FALSE)
saver4 <- saver(x, pred.genes = 7501:10000, pred.genes.only = TRUE,
                do.fast = FALSE)

saver.all <- combine.saver(list(saver1, saver2, saver3, saver4))

Sampling example

Oftentimes for downstream analysis, you may want to sample from the posterior distributions to account for estimation uncertainty. This is similar to performing multiple imputation to obtain multiple imputed datasets. To sample from the posterior distributions, we use the function sample.saver, which takes in the saver result and outputs sampled datasets. For example,

samp1 <- sample.saver(saver1, rep = 1, seed = 50)
samp5 <- sample.saver(saver1, rep = 5, seed = 50)

Correlation example

Because the SAVER estimates contain uncertainty, correlations between genes and cells cannot be directly calculated using the SAVER estimates. To adjust for the uncertainty, we provide the functions cor.genes and cor.cells to construct gene-to-gene and cell-to-cell correlation matrices respectively for the SAVER output. These functions take in the saver result and outputs the gene-to-gene or cell-to-cell correlation matrix. For example,

saver1.cor.gene <- cor.genes(saver1)
saver1.cor.cell <- cor.cells(saver1)