Specifying Embedding Methods and Preprocessing Steps

The main functionality of scloop is to provide a unified interface for embedding and clustering methods with a common set of preprocessing steps. The embedding_algs argument is a list of strings which specify the embedding methods and preprocessing steps to run. The embedding methods are specified by the name of the method followed by a + and then a comma-separated list of preprocessing steps.

List of Embedding Methods

Baselines:

Name

Description

1d_pca

1-dimensional PCA baseline. Takes each contact matrix or preprocessed matrix and aggregates over rows to produce a 1D vector for each cell. Embed using PCA.

1d_lsi

1-dimensional LSI baseline. Same as 1D PCA baseline, but embed using LSI instead

2d_pca

PCA baseline. Takes each contact matrix or preprocessed matrix and unravels the specified number of strata into a single vector representation for each cell. Embed using PCA

2d_lsi

LSI baseline. Same as PCA baseline but embed using LSI instead

scHiCluster

scHiCluster method for embedding scHi-C based on VCSQRT normalization, convolution, and random-walk imputation prior to PCA embedding. By default this method will run with these preprocessing steps unless otherwise specified.

fastHiCRep

Similarity-based method which relies on the HiCRep stratum-adjusted correlation coefficient (SCC) as a distance metric for MDS embedding

InnerProduct

Generalization of fastHiCRep which ignores distance in correlation computation and simply computes cosine similarities of each strata vector

cisTopic

Convert dataset into a set of discrete locus-pair occurences (bag-of-words representation) and run Latent Dirichlet Allocation for topic modeling

Conventional scRNA-seq/scATAC-seq methods:

Name

Description

scVI

Aggregate each contact matrix into a 1D vector and treat it like a transcription vector. Embed using default scVI model

scVI_2d

Unravel each contact matrix into a 1D vector and embed using default scVI model

peakvi

Aggregate each contact matrix into a 1D vector and treat like a binary peak vector. Embed using PeakVI

peakvi_2d

Unravel each contact matrix into a 1D vector and embed using PeakVI

Deep learning methods:

Name

Description

Higashi

Represent entire dataset as a hypergraph and learn cell node embeddings by training a hypergraph neural network

Fast-Higashi

Higashi model based on tensor decomposition rather than training a hypergraph neural network

3DVI

Train an scVI model on each strata independently and concanenate to obtain final cell embeddings

VaDE

Variational deep embedding model, trains a VAE with Gaussian mixture prior (if number of clusters is specified). Similar to 3DVI but embeds entire matrix instead of independent strata

Biological feature representations:

Name

Description

deTOKI/deDOC

Identify TADs and domain boundaries in each contact matrix and embed using PCA of domain density vectors

InsScore

Identify domain boundaries by computing Insulation Score over sliding window and embed using PCA of domain density vectors

scGAD

Map each cell to a gene score vector based on a set of known loci. Scores represent z-scores from the BandNorm normalization method

List of Preprocessing Steps

Filtering operations:

Name

Description

quantile_<q>

Filter low values based on a quantile cutoff specified by q

min_count_<N>

Filter values with count lower than N

Normalization operations:

Name

Description

vc_sqrt_norm

Vanilla square-root coverage correction. Normalize by the square-root of the row-sums and column sums.

oe_norm

Distance correction. Compute average of each distance strata for expected values and return the observed/expected ratios

kr_norm

KR normalization. Convert the contact matrix into a doubly-stochastic matrix using the KR algorithm

Graph operations:

Name

Description

convolution

Perform simple neighbor averaging with a box filter

random_walk

Perform random-walk imputation on the contact matrices

network_enhance

Compute KNN transition matrix for each contact matrix

google

Compute PageRank transition matrix for each contact matrix