SCMER package contents

class scmer.Comparison

Bases: object

Methods for compare gene sets

static compare(y_true: list | set, y_pred: list | set, for_print=True)

Compare two gene sets

Parameters:
  • x – gene set 1

  • y – gene set 2

Returns:

[number of overlapping genes, number of genes in gene set, number of genes in prediction, list of overlapping genes]

static make_recall_curve(y_true, y_pred)
static read_gmt(file: str, keep_description: bool = False)

Read gene set(s) in gmt format https://software.broadinstitute.org/cancer/software/gsea/wiki/index.php/Data_formats#GMT:_Gene_Matrix_Transposed_file_format_.28.2A.gmt.29

Parameters:
  • file – gmt file name/path

  • keep_description – whether to also return the description of gene sets

Returns:

genesets as in {‘pathway1’: [gene1, gene2, …], ‘pathway2’: [gene3, gene4, …], …} (and if applicable, descriptions as in {‘pathway1’: ‘description1’, ‘pathway2’: ‘description2’, …})

class scmer.TsneL1(*, w: float | str | list | ndarray = 'ones', lasso: float = 0.0001, n_pcs: int | None = None, perplexity: float = 30.0, use_beta_in_Q: bool = True, max_outer_iter: int = 5, max_inner_iter: int = 20, owlqn_history_size: int = 100, eps: float = 1e-12, verbosity: int = 2, torch_precision: int | str | dtype = 32, torch_cdist_compute_mode: str = 'use_mm_for_euclid_dist', t_distr: bool = True, n_threads: int = 1, use_gpu: bool = False, pca_seed=0, ridge=0.0)

Bases: _ABCSelector

TsneL1 model

Parameters:
  • w – initial value of w, weight of each marker. Acceptable values are ‘ones’ (all 1), ‘uniform’ (random [0, 1] values), float numbers (all set to that number), or a list or numpy array with specific numbers.

  • lasso – lasso strength

  • n_pcs – Number of PCs used to generate P matrix. Skip PCA if set to None.

  • perplexity – perplexity of t-SNE modeling

  • use_beta_in_Q – whether to use the cell specific sigma^2 calculated from P in Q. (1 / beta)

  • max_outer_iter – number of iterations of OWL-QN

  • max_inner_iter – number of iterations inside OWL-QN

  • owlqn_history_size – history size for OWL-QN.

  • eps – epsilon for considering a value to be 0.

  • verbosity – verbosity level (0 ~ 2).

  • torch_precision – The dtype used inside torch model. By default, tf.float32 (a.k.a. tf.float) is used. However, if precision become an issue, tf.float64 may be worth trying. You can input 32, “32”, 64, or “64”.

  • torch_cdist_compute_mode – cdist_compute_mode: compute mode for torch.cdist. By default, “use_mm_for_euclid_dist” to (daramatically) improve performance. However, if numerical stability became an issue, “donot_use_mm_for_euclid_dist” may be used instead. This option does not affect distances computed outside of pytorch, e.g., matrix P. Only matrix Q is affect.

  • t_distr – By default, use t-distribution (1. / (1. + pdist2) for Q. Use Normal distribution instead (exp(-pdist2)) if set to False

  • n_threads – number of threads (currently only for calculating P and beta)

  • use_gpu – whether to use GPU to train the model.

fit(X, *, X_teacher=None, batches=None, P=None, beta=None, must_keep=None)

Select markers from one dataset to keep the cell-cell similarities in the same dataset

Parameters:
  • X – data matrix (cells (rows) x genes/proteins (columns))

  • X_teacher – get target similarities from this dataset

  • batches – (optional) batch labels

  • P – The P matrix, if calculated in advance

  • beta – The beta associated with P, if calculated in advance

  • must_keep – A boolean vector indicating if a feature must be kept. Those features will have a fixed weight 1.

Returns:

fit_transform(X, **kwargs)

Fit on a matrix / AnnData and then transfer it.

Parameters:
  • X – The matrix / AnnData to be transformed

  • kwargs – Other parameters for TsneL1.fit().

Returns:

Shrunk matrix / Anndata

get_mask()

Get the feature selection mask. For AnnData in scanpy, it can be used as adata[:, model.get_mask()]

Returns:

mask

transform(X)

Shrink a matrix / AnnData object with full markers to the selected markers only. If such operation is not supported by your data object, you can do it manually using get_mask().

Parameters:

X – Matrix / AnnData to be shrunk

Returns:

Shrunk matrix / Anndata

classmethod tune(target_n_features, X=None, *, X_teacher=None, batches=None, P=None, beta=None, must_keep=None, perplexity=30.0, n_pcs=None, w='ones', min_lasso=1e-08, max_lasso=0.01, tolerance=0, smallest_log10_fold_change=0.1, max_iter=100, return_P_beta=False, n_threads=6, **kwargs)

Automatically find proper lasso strength that returns the preferred number of markers

Parameters:
  • target_n_features – number of features

  • return_P_beta – controls what to return

  • kwargs – all other parameters are the same for a TsneL1 model or TsneL1.fit().

Returns:

if return_P_beta is True and there are batches, (model, X, P, beta); if return_P_beta is True and there is no batches, (model, P, beta); otherwise, only model by default.

class scmer.UmapL1(*, w: float | str | list | ndarray = 'ones', lasso: float = 0.0001, n_pcs: int | None = None, perplexity: float = 30.0, use_beta_in_Q: bool = True, max_outer_iter: int = 5, max_inner_iter: int = 20, owlqn_history_size: int = 100, eps: float = 1e-12, verbosity: int = 2, torch_precision: int | str | dtype = 32, torch_cdist_compute_mode: str = 'use_mm_for_euclid_dist', t_distr: bool = True, n_threads: int = 1, use_gpu: bool = False, pca_seed: int = 0, ridge: float = 0.0, _keep_fitting_info: bool = False)

Bases: _BaseSelector

UmapL1 model

Parameters:
  • w – initial value of w, weight of each marker. Acceptable values are ‘ones’ (all 1), ‘uniform’ (random [0, 1] values), float numbers (all set to that number), or a list or numpy array with specific numbers.

  • lasso – lasso strength (i.e., strength of L1 regularization in elastic net)

  • n_pcs – Number of PCs used to generate P matrix. Skip PCA if set to None.

  • perplexity – perplexity of t-SNE modeling

  • use_beta_in_Q – whether to use the cell specific sigma^2 calculated from P in Q. (1 / beta)

  • max_outer_iter – number of iterations of OWL-QN

  • max_inner_iter – number of iterations inside OWL-QN

  • owlqn_history_size – history size for OWL-QN.

  • eps – epsilon for considering a value to be 0.

  • verbosity – verbosity level (0 ~ 2).

  • torch_precision – The dtype used inside torch model. By default, tf.float32 (a.k.a. tf.float) is used. However, if precision become an issue, tf.float64 may be worth trying. You can input 32, “32”, 64, or “64”.

  • torch_cdist_compute_mode – cdist_compute_mode: compute mode for torch.cdist. By default, “use_mm_for_euclid_dist” to (daramatically) improve performance. However, if numerical stability became an issue, “donot_use_mm_for_euclid_dist” may be used instead. This option does not affect distances computed outside of pytorch, e.g., matrix P. Only matrix Q is affect.

  • t_distr – By default, use t-distribution (1. / (1. + pdist2)) for Q. Use Normal distribution instead (exp(-pdist2)) if set to False. The latter one is not stable.

  • n_threads – number of threads (currently only for calculating P and beta)

  • use_gpu – whether to use GPU to train the model.

  • pca_seed – random seed used by PCA (if applicable)

  • ridge – ridge strength (i.e., strength of L2 regularization in elastic net)

  • _keep_fitting_info – if True, write similarity matrix P to self.P and PyTorch model to self.model

fit(X, *, X_teacher=None, batches=None, P=None, beta=None, must_keep=None)

Select markers from one dataset to keep the cell-cell similarities in the same dataset

Parameters:
  • X – data matrix (cells (rows) x genes/proteins (columns))

  • X_teacher – get target similarities from this dataset

  • batches – (optional) batch labels

  • P – The P matrix, if calculated in advance

  • beta – The beta associated with P, if calculated in advance

  • must_keep – A boolean vector indicating if a feature must be kept. Those features will have a fixed weight 1.

Returns:

fit_transform(X, **kwargs)

Fit on a matrix / AnnData and then transfer it.

Parameters:
  • X – The matrix / AnnData to be transformed

  • kwargs – Other parameters for UmapL1.fit().

Returns:

Shrunk matrix / Anndata

get_mask(target_n_features=None)

Get the feature selection mask. For AnnData in scanpy, it can be used as adata[:, model.get_mask()]

Parameters:

target_n_features – If None, all features with w > 0 are selected. If not None, only select target_n_features largest features

Returns:

mask

transform(X, target_n_features=None, **kwargs)

Shrink a matrix / AnnData object with full markers to the selected markers only. If such operation is not supported by your data object, you can do it manually using get_mask().

Parameters:
  • X – Matrix / AnnData to be shrunk

  • target_n_features – If None, all features with w > 0 are selected. If not None, only select target_n_features largest features

Returns:

Shrunk matrix / Anndata

classmethod tune(target_n_features, X=None, *, X_teacher=None, batches=None, P=None, beta=None, must_keep=None, perplexity=30.0, n_pcs=None, w='ones', min_lasso=1e-08, max_lasso=0.01, tolerance=0, smallest_log10_fold_change=0.1, max_iter=100, return_P_beta=False, n_threads=6, **kwargs)

Automatically find proper lasso strength that returns the preferred number of markers

Parameters:
  • target_n_features – number of features

  • return_P_beta – controls what to return

  • kwargs – all other parameters are the same for a UmapL1 model or UmapL1.fit().

Returns:

if return_P_beta is True and there are batches, (model, X, P, beta); if return_P_beta is True and there is no batches, (model, P, beta); otherwise, only model by default.