Basic Usage
Assuming that you have a dataset in the form of a scanpy/AnnData
object adata.
First, import the module:
from scmer import UmapL1
Then, if you want to train the model with a given strength of l1-regularization:
model = UmapL1(lasso=1e-3).fit(adata.X)
Or, if you want to keep a specific number of features:
model_20 = UmapL1.tune(target_n_features=20, X=adata.X)
It will perform a binary search on strength of l1-regularization to find the one giving desired number of features.
To retain only the selected markers
selected_adata = model.transform(adata)
Note that the model has a space complexity of O(n^2), where n is the number of cells. Thus, we recommend that you subsample your data to 5,000 to 10,000 cells. Please refer to “Advanced” section for running on more cells.
Advanced
Marker transfering
To use one set of markers (e.g., mRNA) to fit the cell-cell similarity defined by another set of markers (e.g., protein).
model.fit(rna_adata.X, X_teacher=protein_adata.X)
Batch stratification
To find markers that are important in multiple samples (batches), you
can specify batches in fit():
model.fit(rna_adata.X, batches=adata.obs['batch'].values)
The dataset will be separated on the batches given, and the loss will be the sum of losses on all separated datasets. In this way, it will not be lured by the markers that separates the markers.
Incidentally, this approach also reduces the memory requirement. If a dataset with n cells is separate into b batches, the space complexity will reduce from O(n^2) to O(b * (n/b)^2) = O(n^2 / b). Thus, if subsampling is not desired, you may randomly separete the dataset into several batches. (That said, do not define the batches as the cell type labels or any category that is biologically meaningful.)
Predetemined markers
If there are markers you think that should be considered with priority,
there are two ways to indicate/enforce it. 1. Use a vector as the
parameter lasso, and set the corresponding entries to 0. In this
way, you remove l1-regularization for that gene.
model = UmapL1(lasso=[0., 0., 1e-5, 1e-5, 1e-5, ...])
model.fit(rna_adata.X)
Set
must_keepto nonzero valuesmodel.fit(rna_adata.X, must_keep=[1., 1., 0., 0., 0., ...])
If you wish to use both, the lasso parameter should only contain entires whose
must_keepstatus is zero. For example:model = UmapL1(lasso=lasso[must_keep == 0]) model.fit(rna_adata.X, must_keep=must_keep)
Tuning
UmapL1.tune(cls, target_n_features,
X=None, X_teacher=None, batches=None, P=None, beta=None, perplexity=30., n_pcs=None, w=None,
min_lasso=1e-8, max_lasso=1e-2, tolerance=0, smallest_log10_fold_change=0.1, max_iter=100,
**kwargs)
All other parameters of scmer.UmapL1 (except for lasso,
which is to be tuned) can also be specified.
Full API
Please refer to the documentation of scmer.UmapL1.
All model parameters
n_pcs: If you want to use PCs to calculate the pairwise distances, specify the number of PCs. If you want to use the expression directly, set it toNone. Default:None.w: Initial value of w. Leaving it asNoneto randomly generate one. Default:None.owlqn_history_size: History size for OWLQN optimization. Set to a smaller value if you encounter an insufficient memory problem. Default:100.n_threads: Number of threads used in calculating pairwise similarity. A linear speed-up is expected so it is recommended to use all CPUs.