SCMER package contents

class scmer.Comparison

Bases: object

Methods for compare gene sets

static compare(y_true: list | set, y_pred: list | set, for_print=True)

Compare two gene sets

Parameters:

x – gene set 1
y – gene set 2

Returns:

[number of overlapping genes, number of genes in gene set, number of genes in prediction, list of overlapping genes]

static make_recall_curve(y_true, y_pred)

static read_gmt(file: str, keep_description: bool = False)

Read gene set(s) in gmt format https://software.broadinstitute.org/cancer/software/gsea/wiki/index.php/Data_formats#GMT:_Gene_Matrix_Transposed_file_format_.28.2A.gmt.29

Parameters:

file – gmt file name/path
keep_description – whether to also return the description of gene sets

Returns:

genesets as in {‘pathway1’: [gene1, gene2, …], ‘pathway2’: [gene3, gene4, …], …} (and if applicable, descriptions as in {‘pathway1’: ‘description1’, ‘pathway2’: ‘description2’, …})

class scmer.TsneL1(*, w: float | str | list | ndarray = 'ones', lasso: float = 0.0001, n_pcs: int | None = None, perplexity: float = 30.0, use_beta_in_Q: bool = True, max_outer_iter: int = 5, max_inner_iter: int = 20, owlqn_history_size: int = 100, eps: float = 1e-12, verbosity: int = 2, torch_precision: int | str | dtype = 32, torch_cdist_compute_mode: str = 'use_mm_for_euclid_dist', t_distr: bool = True, n_threads: int = 1, use_gpu: bool = False, pca_seed=0, ridge=0.0)

Bases: _ABCSelector

TsneL1 model

Parameters:

w – initial value of w, weight of each marker. Acceptable values are ‘ones’ (all 1), ‘uniform’ (random [0, 1] values), float numbers (all set to that number), or a list or numpy array with specific numbers.
lasso – lasso strength
n_pcs – Number of PCs used to generate P matrix. Skip PCA if set to None.
perplexity – perplexity of t-SNE modeling
use_beta_in_Q – whether to use the cell specific sigma^2 calculated from P in Q. (1 / beta)
max_outer_iter – number of iterations of OWL-QN
max_inner_iter – number of iterations inside OWL-QN
owlqn_history_size – history size for OWL-QN.
eps – epsilon for considering a value to be 0.
verbosity – verbosity level (0 ~ 2).
torch_precision – The dtype used inside torch model. By default, tf.float32 (a.k.a. tf.float) is used. However, if precision become an issue, tf.float64 may be worth trying. You can input 32, “32”, 64, or “64”.
torch_cdist_compute_mode – cdist_compute_mode: compute mode for torch.cdist. By default, “use_mm_for_euclid_dist” to (daramatically) improve performance. However, if numerical stability became an issue, “donot_use_mm_for_euclid_dist” may be used instead. This option does not affect distances computed outside of pytorch, e.g., matrix P. Only matrix Q is affect.
t_distr – By default, use t-distribution (1. / (1. + pdist2) for Q. Use Normal distribution instead (exp(-pdist2)) if set to False
n_threads – number of threads (currently only for calculating P and beta)
use_gpu – whether to use GPU to train the model.

fit(X, *, X_teacher=None, batches=None, P=None, beta=None, must_keep=None)

Select markers from one dataset to keep the cell-cell similarities in the same dataset

Parameters:

X – data matrix (cells (rows) x genes/proteins (columns))
X_teacher – get target similarities from this dataset
batches – (optional) batch labels
P – The P matrix, if calculated in advance
beta – The beta associated with P, if calculated in advance
must_keep – A boolean vector indicating if a feature must be kept. Those features will have a fixed weight 1.

Returns:

fit_transform(X, **kwargs)

Fit on a matrix / AnnData and then transfer it.

Parameters:

X – The matrix / AnnData to be transformed
kwargs – Other parameters for TsneL1.fit().

Returns:

Shrunk matrix / Anndata

get_mask()

Get the feature selection mask. For AnnData in scanpy, it can be used as adata[:, model.get_mask()]

Returns:: mask

transform(X)

Shrink a matrix / AnnData object with full markers to the selected markers only. If such operation is not supported by your data object, you can do it manually using get_mask().

Parameters:: X – Matrix / AnnData to be shrunk
Returns:: Shrunk matrix / Anndata

classmethod tune(target_n_features, X=None, *, X_teacher=None, batches=None, P=None, beta=None, must_keep=None, perplexity=30.0, n_pcs=None, w='ones', min_lasso=1e-08, max_lasso=0.01, tolerance=0, smallest_log10_fold_change=0.1, max_iter=100, return_P_beta=False, n_threads=6, **kwargs)

Automatically find proper lasso strength that returns the preferred number of markers

Parameters:

target_n_features – number of features
return_P_beta – controls what to return
kwargs – all other parameters are the same for a TsneL1 model or TsneL1.fit().

Returns:

if return_P_beta is True and there are batches, (model, X, P, beta); if return_P_beta is True and there is no batches, (model, P, beta); otherwise, only model by default.

class scmer.UmapL1(*, w: float | str | list | ndarray = 'ones', lasso: float = 0.0001, n_pcs: int | None = None, perplexity: float = 30.0, use_beta_in_Q: bool = True, max_outer_iter: int = 5, max_inner_iter: int = 20, owlqn_history_size: int = 100, eps: float = 1e-12, verbosity: int = 2, torch_precision: int | str | dtype = 32, torch_cdist_compute_mode: str = 'use_mm_for_euclid_dist', t_distr: bool = True, n_threads: int = 1, use_gpu: bool = False, pca_seed: int = 0, ridge: float = 0.0, _keep_fitting_info: bool = False)

Bases: _BaseSelector

UmapL1 model

Parameters:

w – initial value of w, weight of each marker. Acceptable values are ‘ones’ (all 1), ‘uniform’ (random [0, 1] values), float numbers (all set to that number), or a list or numpy array with specific numbers.
lasso – lasso strength (i.e., strength of L1 regularization in elastic net)
n_pcs – Number of PCs used to generate P matrix. Skip PCA if set to None.
perplexity – perplexity of t-SNE modeling
use_beta_in_Q – whether to use the cell specific sigma^2 calculated from P in Q. (1 / beta)
max_outer_iter – number of iterations of OWL-QN
max_inner_iter – number of iterations inside OWL-QN
owlqn_history_size – history size for OWL-QN.
eps – epsilon for considering a value to be 0.
verbosity – verbosity level (0 ~ 2).
torch_precision – The dtype used inside torch model. By default, tf.float32 (a.k.a. tf.float) is used. However, if precision become an issue, tf.float64 may be worth trying. You can input 32, “32”, 64, or “64”.
torch_cdist_compute_mode – cdist_compute_mode: compute mode for torch.cdist. By default, “use_mm_for_euclid_dist” to (daramatically) improve performance. However, if numerical stability became an issue, “donot_use_mm_for_euclid_dist” may be used instead. This option does not affect distances computed outside of pytorch, e.g., matrix P. Only matrix Q is affect.
t_distr – By default, use t-distribution (1. / (1. + pdist2)) for Q. Use Normal distribution instead (exp(-pdist2)) if set to False. The latter one is not stable.
n_threads – number of threads (currently only for calculating P and beta)
use_gpu – whether to use GPU to train the model.
pca_seed – random seed used by PCA (if applicable)
ridge – ridge strength (i.e., strength of L2 regularization in elastic net)
_keep_fitting_info – if True, write similarity matrix P to self.P and PyTorch model to self.model

fit(X, *, X_teacher=None, batches=None, P=None, beta=None, must_keep=None)

Select markers from one dataset to keep the cell-cell similarities in the same dataset

Parameters:

X – data matrix (cells (rows) x genes/proteins (columns))
X_teacher – get target similarities from this dataset
batches – (optional) batch labels
P – The P matrix, if calculated in advance
beta – The beta associated with P, if calculated in advance
must_keep – A boolean vector indicating if a feature must be kept. Those features will have a fixed weight 1.

Returns:

fit_transform(X, **kwargs)

Fit on a matrix / AnnData and then transfer it.

Parameters:

X – The matrix / AnnData to be transformed
kwargs – Other parameters for UmapL1.fit().

Returns:

Shrunk matrix / Anndata

get_mask(target_n_features=None)

Get the feature selection mask. For AnnData in scanpy, it can be used as adata[:, model.get_mask()]

Parameters:: target_n_features – If None, all features with w > 0 are selected. If not None, only select target_n_features largest features
Returns:: mask

transform(X, target_n_features=None, **kwargs)

Shrink a matrix / AnnData object with full markers to the selected markers only. If such operation is not supported by your data object, you can do it manually using get_mask().

Parameters:

X – Matrix / AnnData to be shrunk
target_n_features – If None, all features with w > 0 are selected. If not None, only select target_n_features largest features

Returns:

Shrunk matrix / Anndata

classmethod tune(target_n_features, X=None, *, X_teacher=None, batches=None, P=None, beta=None, must_keep=None, perplexity=30.0, n_pcs=None, w='ones', min_lasso=1e-08, max_lasso=0.01, tolerance=0, smallest_log10_fold_change=0.1, max_iter=100, return_P_beta=False, n_threads=6, **kwargs)

Automatically find proper lasso strength that returns the preferred number of markers

Parameters:

target_n_features – number of features
return_P_beta – controls what to return
kwargs – all other parameters are the same for a UmapL1 model or UmapL1.fit().

Returns:

if return_P_beta is True and there are batches, (model, X, P, beta); if return_P_beta is True and there is no batches, (model, P, beta); otherwise, only model by default.