SCMER package contents
- class scmer.Comparison
Bases:
objectMethods for compare gene sets
- static compare(y_true: list | set, y_pred: list | set, for_print=True)
Compare two gene sets
- Parameters:
x – gene set 1
y – gene set 2
- Returns:
[number of overlapping genes, number of genes in gene set, number of genes in prediction, list of overlapping genes]
- static make_recall_curve(y_true, y_pred)
- static read_gmt(file: str, keep_description: bool = False)
Read gene set(s) in gmt format https://software.broadinstitute.org/cancer/software/gsea/wiki/index.php/Data_formats#GMT:_Gene_Matrix_Transposed_file_format_.28.2A.gmt.29
- Parameters:
file – gmt file name/path
keep_description – whether to also return the description of gene sets
- Returns:
genesets as in {‘pathway1’: [gene1, gene2, …], ‘pathway2’: [gene3, gene4, …], …} (and if applicable, descriptions as in {‘pathway1’: ‘description1’, ‘pathway2’: ‘description2’, …})
- class scmer.TsneL1(*, w: float | str | list | ndarray = 'ones', lasso: float = 0.0001, n_pcs: int | None = None, perplexity: float = 30.0, use_beta_in_Q: bool = True, max_outer_iter: int = 5, max_inner_iter: int = 20, owlqn_history_size: int = 100, eps: float = 1e-12, verbosity: int = 2, torch_precision: int | str | dtype = 32, torch_cdist_compute_mode: str = 'use_mm_for_euclid_dist', t_distr: bool = True, n_threads: int = 1, use_gpu: bool = False, pca_seed=0, ridge=0.0)
Bases:
_ABCSelectorTsneL1 model
- Parameters:
w – initial value of w, weight of each marker. Acceptable values are ‘ones’ (all 1), ‘uniform’ (random [0, 1] values), float numbers (all set to that number), or a list or numpy array with specific numbers.
lasso – lasso strength
n_pcs – Number of PCs used to generate P matrix. Skip PCA if set to None.
perplexity – perplexity of t-SNE modeling
use_beta_in_Q – whether to use the cell specific sigma^2 calculated from P in Q. (1 / beta)
max_outer_iter – number of iterations of OWL-QN
max_inner_iter – number of iterations inside OWL-QN
owlqn_history_size – history size for OWL-QN.
eps – epsilon for considering a value to be 0.
verbosity – verbosity level (0 ~ 2).
torch_precision – The dtype used inside torch model. By default, tf.float32 (a.k.a. tf.float) is used. However, if precision become an issue, tf.float64 may be worth trying. You can input 32, “32”, 64, or “64”.
torch_cdist_compute_mode – cdist_compute_mode: compute mode for torch.cdist. By default, “use_mm_for_euclid_dist” to (daramatically) improve performance. However, if numerical stability became an issue, “donot_use_mm_for_euclid_dist” may be used instead. This option does not affect distances computed outside of pytorch, e.g., matrix P. Only matrix Q is affect.
t_distr – By default, use t-distribution (1. / (1. + pdist2) for Q. Use Normal distribution instead (exp(-pdist2)) if set to False
n_threads – number of threads (currently only for calculating P and beta)
use_gpu – whether to use GPU to train the model.
- fit(X, *, X_teacher=None, batches=None, P=None, beta=None, must_keep=None)
Select markers from one dataset to keep the cell-cell similarities in the same dataset
- Parameters:
X – data matrix (cells (rows) x genes/proteins (columns))
X_teacher – get target similarities from this dataset
batches – (optional) batch labels
P – The P matrix, if calculated in advance
beta – The beta associated with P, if calculated in advance
must_keep – A boolean vector indicating if a feature must be kept. Those features will have a fixed weight 1.
- Returns:
- fit_transform(X, **kwargs)
Fit on a matrix / AnnData and then transfer it.
- Parameters:
X – The matrix / AnnData to be transformed
kwargs – Other parameters for
TsneL1.fit().
- Returns:
Shrunk matrix / Anndata
- get_mask()
Get the feature selection mask. For AnnData in scanpy, it can be used as adata[:, model.get_mask()]
- Returns:
mask
- transform(X)
Shrink a matrix / AnnData object with full markers to the selected markers only. If such operation is not supported by your data object, you can do it manually using
get_mask().- Parameters:
X – Matrix / AnnData to be shrunk
- Returns:
Shrunk matrix / Anndata
- classmethod tune(target_n_features, X=None, *, X_teacher=None, batches=None, P=None, beta=None, must_keep=None, perplexity=30.0, n_pcs=None, w='ones', min_lasso=1e-08, max_lasso=0.01, tolerance=0, smallest_log10_fold_change=0.1, max_iter=100, return_P_beta=False, n_threads=6, **kwargs)
Automatically find proper lasso strength that returns the preferred number of markers
- Parameters:
target_n_features – number of features
return_P_beta – controls what to return
kwargs – all other parameters are the same for a TsneL1 model or
TsneL1.fit().
- Returns:
if return_P_beta is True and there are batches, (model, X, P, beta); if return_P_beta is True and there is no batches, (model, P, beta); otherwise, only model by default.
- class scmer.UmapL1(*, w: float | str | list | ndarray = 'ones', lasso: float = 0.0001, n_pcs: int | None = None, perplexity: float = 30.0, use_beta_in_Q: bool = True, max_outer_iter: int = 5, max_inner_iter: int = 20, owlqn_history_size: int = 100, eps: float = 1e-12, verbosity: int = 2, torch_precision: int | str | dtype = 32, torch_cdist_compute_mode: str = 'use_mm_for_euclid_dist', t_distr: bool = True, n_threads: int = 1, use_gpu: bool = False, pca_seed: int = 0, ridge: float = 0.0, _keep_fitting_info: bool = False)
Bases:
_BaseSelectorUmapL1 model
- Parameters:
w – initial value of w, weight of each marker. Acceptable values are ‘ones’ (all 1), ‘uniform’ (random [0, 1] values), float numbers (all set to that number), or a list or numpy array with specific numbers.
lasso – lasso strength (i.e., strength of L1 regularization in elastic net)
n_pcs – Number of PCs used to generate P matrix. Skip PCA if set to None.
perplexity – perplexity of t-SNE modeling
use_beta_in_Q – whether to use the cell specific sigma^2 calculated from P in Q. (1 / beta)
max_outer_iter – number of iterations of OWL-QN
max_inner_iter – number of iterations inside OWL-QN
owlqn_history_size – history size for OWL-QN.
eps – epsilon for considering a value to be 0.
verbosity – verbosity level (0 ~ 2).
torch_precision – The dtype used inside torch model. By default, tf.float32 (a.k.a. tf.float) is used. However, if precision become an issue, tf.float64 may be worth trying. You can input 32, “32”, 64, or “64”.
torch_cdist_compute_mode – cdist_compute_mode: compute mode for torch.cdist. By default, “use_mm_for_euclid_dist” to (daramatically) improve performance. However, if numerical stability became an issue, “donot_use_mm_for_euclid_dist” may be used instead. This option does not affect distances computed outside of pytorch, e.g., matrix P. Only matrix Q is affect.
t_distr – By default, use t-distribution (1. / (1. + pdist2)) for Q. Use Normal distribution instead (exp(-pdist2)) if set to False. The latter one is not stable.
n_threads – number of threads (currently only for calculating P and beta)
use_gpu – whether to use GPU to train the model.
pca_seed – random seed used by PCA (if applicable)
ridge – ridge strength (i.e., strength of L2 regularization in elastic net)
_keep_fitting_info – if True, write similarity matrix P to self.P and PyTorch model to self.model
- fit(X, *, X_teacher=None, batches=None, P=None, beta=None, must_keep=None)
Select markers from one dataset to keep the cell-cell similarities in the same dataset
- Parameters:
X – data matrix (cells (rows) x genes/proteins (columns))
X_teacher – get target similarities from this dataset
batches – (optional) batch labels
P – The P matrix, if calculated in advance
beta – The beta associated with P, if calculated in advance
must_keep – A boolean vector indicating if a feature must be kept. Those features will have a fixed weight 1.
- Returns:
- fit_transform(X, **kwargs)
Fit on a matrix / AnnData and then transfer it.
- Parameters:
X – The matrix / AnnData to be transformed
kwargs – Other parameters for
UmapL1.fit().
- Returns:
Shrunk matrix / Anndata
- get_mask(target_n_features=None)
Get the feature selection mask. For AnnData in scanpy, it can be used as adata[:, model.get_mask()]
- Parameters:
target_n_features – If None, all features with w > 0 are selected. If not None, only select target_n_features largest features
- Returns:
mask
- transform(X, target_n_features=None, **kwargs)
Shrink a matrix / AnnData object with full markers to the selected markers only. If such operation is not supported by your data object, you can do it manually using
get_mask().- Parameters:
X – Matrix / AnnData to be shrunk
target_n_features – If None, all features with w > 0 are selected. If not None, only select target_n_features largest features
- Returns:
Shrunk matrix / Anndata
- classmethod tune(target_n_features, X=None, *, X_teacher=None, batches=None, P=None, beta=None, must_keep=None, perplexity=30.0, n_pcs=None, w='ones', min_lasso=1e-08, max_lasso=0.01, tolerance=0, smallest_log10_fold_change=0.1, max_iter=100, return_P_beta=False, n_threads=6, **kwargs)
Automatically find proper lasso strength that returns the preferred number of markers
- Parameters:
target_n_features – number of features
return_P_beta – controls what to return
kwargs – all other parameters are the same for a UmapL1 model or
UmapL1.fit().
- Returns:
if return_P_beta is True and there are batches, (model, X, P, beta); if return_P_beta is True and there is no batches, (model, P, beta); otherwise, only model by default.