MMEASE: Enhanced Analytical Workflow for Single-cell Metabolomics

Metabolomics, as the youngest of the omics, provides the readout closest to the phenotype and gives the best view of what chemical processes are taking place in cells (Nat Methods. 18: 1452-6, 2021). Now, metabolomics is shifting to the single-cell level (Nat Biotechnol. 42: 159-62, 2024). Moreover, single-cell metabolomics has regarded as one of seven technologies to watch in 2023 (Nature. 613: 794-7, 2023). A complete analysis procedure of single-cell metabolomics consists of data processing (data filtering, imputation, data transformation, normalization, batch effect removal, etc.), cellular heterogeneity analysis (dimensionality reduction, differential analysis, correlation analysis, classification model construction etc.), metabolite annotation and biological interpretation (Nat Methods. 18: 799-805, 2021). Therefore, it is highly necessary to develop a unified and standardized tool to provide the whole process and enough methods for analyzing single-cell metabolomic data.

MMEASE 2.0 is updated to provide entire analytical workflow of single-cell metabolomics from bulk metabolomics. Specifically, MMEASE 2.0 can (a) provide the most comprehensive workflow for enabling data processing, (b) realize systematical analytical functions for both metabolic heterogeneity and functional heterogeneity, and (c) significantly enhance the capability of metabolite annotation using tandem spectra by integrating the well-established MS2 matching similarity method. All in all, MMEASE 2.0 provides a unique online service of whole analytical workflow of single-cell metabolomics.

The Previous Version of MMEASE can be accessed at: https://idrblab.cn/mmease2021/

MMEASE 2.0 is powered by R shiny. It is free and open to all users with no login requirement and can be readily accessed by a variety of popular web browsers (such as, Chrome, Firefox, Edge, and Safari) and operating systems (such as, Linux, MacOS, and Windows).

The R Package of MMEASE 2.0 can be accessed at: https://github.com/mmease2025/mmease/

Thanks a million for using and improving MMEASE 2.0, and please feel free to report any errors to Dr. YANG at yangqx@zju.edu.cn.

High Resolution Metabolite Annotation and Cell-based Biological Interpretation

Metabolite Annotation Using Tandem Spectra

Metabolite annotation for single-cell metabolomics in MMEASE 2.0 is supported by a comprehensive reference spectra database organized into several libraries including biology, lipid, and exposome. These databases are curated from public sources like HMDB, MoNA, LipidBlast, GNPS, and KEGG, with MS2 spectra fragments annotated into formulas using BUDDY (Xing, et al. Nat Methods. 20: 881-90, 2023). Two widely used methods, dot product and spectral entropy, are implemented to evaluate MS2 matching similarity (Pang, et al. Nat Commun. 1:3675, 2024).

Metabolite Annotation for Biological Functions

Metabolites are classified into differential biological or functional groups through literature reviews and information from public sources such as HMDB, T3DB, KEGG, and DrugBank, providing insights into exogenous factors like food, drugs, toxins, and pollutants (Yang, et al. J Proteomics. 232:104023, 2021). In the paste box, two columns separate by spaces are needed. In the uploaded CSV file, two columns separate by commas are needed. The m/z values of the MS2 spectra are in the first column and the relative abundances of the responding m/z values are in the second column. The sample data of tandem spectrum could be downloaded .

Annotation Results for Tandem Spectrum

Ranking First with the Highest Matching Score

Ranking Second with the Highest Matching Score

Ranking Third with the Highest Matching Score

Metabolite Annotation Results based on Matching Score

Metabolit Annotation Table

Table of Contents

1. The Compatibility of Browser and Operating System (OS)

2. Required Formats of the Input Files

3. Step-by-step Instruction on the Usage of MMEASE 2.0

3.1 Uploading the Customized Single-Cell Metabolomic Data or the Sample Data Provided in MMEASE 2.0

3.2 Data Processing for Single-Cell Metabolomics

3.3 Data Analysis for Single-Cell Metabolomics

3.4 Metabolite Annotation for Single-Cell Metabolomics

1. The Compatibility of Browser and Operating System (OS)

MMEASE 2.0 is powered by R shiny. It is free and open to all users with no login requirement and can be readily accessed by a variety of popular web browsers and operating systems as shown below.

2. Required Formats of the Input Files

In general, the file required at the Step 1 of MMEASE 2.0 should be a sample-by-feature matrix in a csv format. In the uploaded dataset, the first 4 columns of the input file contain cell name, class, cell type, and batch, and are kept as "Cell Name","Class", "Cell Type" and "Batch". In the following columns of the input file, metabolites’ raw intensities across all samples are further provided. Chemical formulas of each metabolite are listed in the first row of the csv file. The sample data of single-cell metabolomic data could be downloaded .

3. 3. Step-by-step Instruction on the Usage of MMEASE 2.0

This website is free and open to all users and there is no login requirement, and can be readily accessed by all popular web browsers including Google Chrome, Mozilla Firefox, Safari and Internet Explorer 10 (or later), and so on. Analysis is started by clicking on the "Analysis" panel on the homepage of MMEASE 2.0. The collection of web services and the whole process provided by MMEASE 2.0 can be summarized into 4 steps: (3.1) Upload Single-Cell Metabolomic Data, (3.2) Data Processing for Single-Cell Metabolomics, (3.3) Data Analysis for Single-Cell Metabolomics, and (3.4) Metabolite Annotation for Single-Cell Metabolomics.

3.1 Uploading the Customized Single-Cell Metabolomic Data or the Sample Data Provided in MMEASE 2.0

There is one radio checkboxes in STEP-1 on the left side of the Analysis page. Users can choose to upload their own single-cell metabolomics data or to directly load sample data. After selecting the corresponding radio checkboxes, datasets provided by the users for further analysis can be then directly uploaded by clicking "Browse". Preview of the uploaded data is subsequently provided on the web page.

The sample data are also provided in this step facilitating a direct access of MMEASE 2.0. In this sample dataset, the single-cell metabolomic data collected from 962 HeLa and 692 NIH3T3 cells were selected as the example dataset (Luca Rappez, et al. Nat Methods. 18:799-805,2021). By clicking the Load Data button, the sample dataset selected by the users can be uploaded for further analysis.

3.2 Data Processing for Single-Cell Metabolomics

MMEASE 2.0 provides data processing including Data Filtering, Data Imputation, Data Transformation, Data Normalization, and Batch Correction. Single-cell metabolomic data is filtered when the tolerable percent of missing values in each metabolite is over the set cutoff. The cutoff of the tolerable percent of missing values could be set by users and default value is 0.2 based on "80% rule" in metabolomics (Chen, et al. Anal Chem. 89:5342-8, 2017). A detailed explanation of the methods used in each process is provided in this Manual. After selecting or defining preferred methods or parameters, please proceed by clicking the "PROCESS" button, the summary and visualization of single-cell metabolomic data are automatically generated. All resulting data and figures can be downloaded by clicking the corresponding "Download" button.

Data Filtering

After data filtering, proper handling of missing values is that the shapes of the histogram and density curve become more regular. The processed histogram exhibits a more reasonable range on the x-axis, a frequency distribution on the y-axis that aligns better with expectations, and no noticeable abnormal troughs or empty intervals.

Data Imputation

After data imputation, good performance exhibits that there a significant reduction in blank areas representing missing values, a more complete data matrix, and a data distribution morphology that remains largely consistent before and after imputation.

Data Imputation Method

a. KNN Inputation

KNN imputation (K-nearest Neighbor Imputation) method aims to find k metabolites of interest which are similar to the metabolites with missing value. The similarity is measured by Euclidean distance and the missing value is imputed by the weighted average of those k metabolites (Tang,et al. J Brief Bioinform. 2:621-636 ,2020).In the KNN algorithm, number of neighbors represents the count of nearest data points, which ranges from 1 to 50, and default value is 10. Max no. of missing data allowed in any row represents the maximum percent of missing data allowed in each row. The minimum is 0, the maximum equals to the number of columns, the default is half of the number of columns. Max no. of missing data allowed in any column indicates the maximum number of missing data permitted in each column, which ranges from 0 to the number of rows, and default value is 80% of the number of rows. The largest block of metabolites imputed refers to the maximum number of metabolite blocks imputed, which ranges from 1 to the number of metabolites.

b. 1/5 of Minimum Positive Value

1/5 of Minimum Positive Value of corresponding variables imputation strategy aims to handle missing or zero values with a small value derived from the data itself (1/5 of the smallest positive value of the same variable). (Li, et al. Adv Sci. 9:e2305401, 2024).

Data Transformation

After data transformation, optimal performance demonstrates that histogram more closely resembles a unimodal and roughly symmetric bell-shaped curve of a normal distribution, which exhibits a narrower width and mitigating the influence of outliers represented by isolated bars in the original histogram.

Data Transformation Methods

a. G-log Transformation

G-log transformation applies a generalized logarithm function that incorporates a small constant to handle zero or near-zero values, ensuring numerical stability. By compressing large values and expanding smaller ones, it balances the data scale while preserving the relative differences between variables (Li, et al. Adv Sci. 9:e2305401, 2024).

b. Log2 Transformation

Log2 transformation aims to make the distribution of data more symmetric and closer to normal, which is beneficial for statistical analysis. It applies a base-2 logarithm to each data point and compresses large values while expanding small values, creating a balanced scale (Liu, et al. Anal Chem. 18:7127-7133, 2023).

c. Log10 Transformation

Log10 transformation applies the base-10 logarithm to each data point. It compresses large values and stretches small values to reduce the data range and stabilize variance across datasets with large dynamic ranges (Luca Rappez, et al. Nat Methods. 18:799-805, 2021).

Data Normalization

After data normalization, superior performance exhibits that medians across boxplots converge toward a similar level, box lengths become more consistent, effectively controlling data dispersion and variability between different groups, and outliers are significantly reduced and distributed more uniformly.

Data Normalization Methods

a. Auto Scaling

Auto Scaling is also referred as unit variance scaling, which is a metabolite-based normalization method to adjust metabolite variances. AUT achieves its efficacy by using the standard deviation as scaling factor to scale all of the metabolic signals (Kohl SM, et al. Metabolomics. 8:146-60, 2012).

b. Mean Normalization

Mean Normalization is a sample-based normalization method that eliminates background effects by equalizing the means of intensities across samples, making them comparable (Ejigu BA, et al. OMICS. 17:473-85, 2013;Andjelkovic V, et al. Plant Cell Rep. 25:71-9, 2006).

c. Median Normalization

Median Normalization is one of the commonly used sample-based normalization methods which do not take any internal standards, normalizing each sample to make the median of the metabolite abundances across samples equal to each other (Wang W, et al. Anal Chem. 75:4818-26, 2003).

d. MS Total Useful Signal (MSTUS)

MS Total Useful Signal (MSTUS), also referred as MS total useful signal, is a sample-based normalization method where the intensity of each spectrum is divided by the sum of intensities of all spectra based on the assumption that there is an equivalence between increased intensities and decreased intensities (Saccenti E, et al. J Proteome Res. 16:619-34, 2017;Warrack BM, et al. J Chromatogr B. 877:547-52, 2009).

e. Single Internal Standard Normalization

Single Internal Standard (SIS) normalizes data by subtracting the log abundance of a single internal standard from the log abundances of metabolites in each sample (De Livera AM, et al. Anal Chem. 84:10768-76, 2012;Jonas Gullberg, et al. Anal biochem. 2:283-95, 2004). It assumes that all metabolites experience the same amount of unwanted variation, which can be measured by the internal standard (De Livera AM, et al. Anal Chem. 84:10768-76, 2012).To select the column of single IS (internal standard), IS represents the column number as the reference for data normalization. The value ranges from 1 to the number of columns.

Batch Correction

A successful batch correction should result in data points from different batches being better mixed and no longer distinctly separated in the UMAP plot, which produces a cleaner plot without obvious isolated points, with a more regular distribution of post-correction data points and reduced noise.

Batch Correction Methods

a. ComBat

ComBat is a standard choice for batch correction in omics data analysis. It uses an empirical Bayes framework to adjust for batch effects and removes batch-specific mean and variance while retaining the overall data structure (Luca Rappez,et al. Nat Methods. 18:799-805, 2021;Liu, et al. Anal Chem. 18:7127-7133, 2023). The method can accommodate known covariates (e.g., experimental groups) to prevent overcorrection of biological differences.

b. Limma

Limma is a statistical software package designed for analyzing gene expression data with complex experimental designs and batch effects. Limma fits linear models to the expression data for each gene across samples. It uses empirical Bayes methods to improve variance estimation, particularly for small sample sizes (Biswapriya B Misra, et al. Methods Mol Biol. 2064:191-217, 2020).

3.3 Data Analysis for Single-Cell Metabolomics

Metabolic Heterogeneity

Metabolic Heterogeneity is the variation analysis of metabolic processes and profiles across different cell types, such as T cell and B cell. For this analysis type, MMEASE 2.0 provides the following six analysis processes, Dimensionality Reduction, Differential Analysis, Visualization of Specific Metabolite, Visualization of Multiple Metabolites, Correlation Analysis, and Classification Model.

Functional Heterogeneity

Functional Heterogeneity is the variability analysis of the functional characteristics of cells, particularly in their response to stimuli, biological processes, or environmental changes. For example, cells of the same type exhibit significant differences in metabolites between normal and cancerous tissues. For this analysis type, MMEASE 2.0 also provides the following six analysis processes, Dimensionality Reduction, Differential Analysis, Visualization of Specific Metabolite, Visualization of Multiple Metabolites, Correlation Analysis, and Classification Model.

A detailed explanation of the methods used in each process is provided in this Manual. After selecting or defining preferred methods or parameters, please proceed by clicking the "PROCESS" button, the summary and visualization of single-cell metabolomic data are automatically generated. All resulting data and figures can be downloaded by clicking the corresponding "Download" button.

Dimensionality Reduction

UMAP Visualizations

UMAP is a dimensionality reduction technique used for visualizing high-dimensional data in a lower-dimensional space. It is particularly useful for visualizing complex datasets, such as single-cell metabolomic data, by preserving both local and global structures in the data.

t-SNE Visualizations

t-SNE is used for visualizing high-dimensional data in lower-dimensional spaces, typically 2D or 3D. It focuses on preserving the local structure of the data by minimizing the divergence between probability distributions of high-dimensional and low-dimensional data points.

Differential Analysis

Fold Change

Fold change (FC) is the ratio of mean abundance values between two groups for each metabolite in single-cell metabolomics. FC is to compare the absolute value change between means of two groups. A threshold needs to be defined, if fold change value exceeds this threshold, the metabolite will be reported as significant(Bertini I, et al. Cancer Res. 72(1):356-64, 2012).For fold change, log2FC was utilized to identify the differential metabolites.

One-Way ANOVA

One-Way ANOVA is a commonly used statistical method that compares the means of two or more independent groups to determine if their population means are significantly different (Wu, et al. Biochem Biophys Res Commun. 358:1108-13, 2007).For one-way ANOVA, p value was utilized to identify the differential metabolites.

Orthogonal Partial Least Squares Discriminant Analysis (OPLS-DA)

OPLS-DA improves PLS-DA by using a single component as a predictor for group discrimination, with other components orthogonal to it. It has been widely applied in metabolomics to identify biomarkers in amyotrophic lateral sclerosis and detect diagnostic markers for early-stage ovarian cancer (H Blasco, et al. Eur J Neurol. 2:346-53, 2015).For OPLS-DA, VIP was utilized to identify the differential metabolites.

Partial Least Squares-Discriminant Analysis (PLS-DA)

In metabolomic study, partial least squares-discriminant analysis (PLS-DA) is the most well-known tool to perform classification and regression. One of the perceived advantages of PLS-DA is that it can analyze highly collinear and noisy data (Gromski PS, et al. Anal Chim Acta. 879:10-23, 2015).For PLS-DA, VIP was utilized to identify the differential metabolites.

Variable Selection from Random Forests (VSRF)

VSRF is a recursive feature elimination method that starts with all features and iteratively removes the least important one, based on random forest importance scores, until no features remain. The features are then ranked by the deletion sequence, with the top-ranked feature being the last deleted. RF-RFE has been used to study fatty acid metabolism deregulation in liver diseases (Zhou, et al. Anal Bioanal Chem. 1:203-13, 2012).For VSRF, importance rating of variables was utilized to identify the differential metabolites.

Student’ t-Test

Student’ t-Test is a statistical hypothesis test in which the test statistic follows a Student’ s t-distribution if the null hypothesis is supported. Student’ t-Test has been widely applied in metabolomics analysis to predict early-onset preeclampsia (Ray O Bahado-Singh, et al. Am J Obstet Gynecol. 4:530.e1-530.e10, 2015), and determinate the untargeted metabolomic profile (Arnald Alonso, et al. Front Bioeng Biotechnol. 5:3:23, 2015).For student’ t-test, p value was utilized to identify the differential metabolites.

Kruskal–Wallis Test (KWT)

Kruskal-Wallis Test (KWT) is a non-parametric statistical test and is applied when the goal is to test the difference between multiple samples and the underlying population distributions are nonnormal or unknown (Abenavoli, et al. J Am Osteopath Assoc. 120: 647-54, 2020). A Kruskal-Wallis Test of the feature among various classes reveals a statistically significant difference. In metabolomics studies, Kruskal-Wallis Test is widely applied to identify metabolomic markers (Sawicka-Smiarowska E, et al. J Clin Med. 10: 5074, 2021).For KWT, p value was utilized to identify the differential metabolites.

Support Vector Machine-Recursive Feature Elimination (SVM-RFE)

Support vector machine-recursive feature elimination (SVM-RFE) is an efficient feature selection technique and has shown promising applications in the analysis of the metabolome data. SVM-RFE measures the weights of the features according to the support vectors, noise and non-informative variables in the high dimension data may affect the hyper-plane of the SVM learning model (Lin X, et al. J Chromatogr B. 910: 149-55, 2012). Nowadays, SVM-RFE is widely used in metabolomics studies for revealing metabolite biomarkers (Gromski PS, et al. Anal Chim Acta. 829: 1-8, 2014).For SVM-RFE, in recursive feature elimination, the features corresponding to the highest accuracy are identified as the differential metabolites.

Visualization of Specific Metabolite

UMAP of Specific Metabolite

UMAP of Specific Metabolite refers to a two-dimensional visualization technique using the Uniform Manifold Approximation and Projection (UMAP) algorithm to display the distribution and clustering of specific metabolite across samples or conditions. This approach helps to identify patterns, relationships, and variations in the levels of selected metabolites within complex datasets, providing insights into their roles and significance in biological processes.

Boxlplot of Specific Metabolite

A boxplot of specific metabolite is a graphical representation used to display the distribution of metabolite levels across different groups or conditions. It summarizes the data by showing the median, interquartile range (IQR), and potential outliers, allowing for the comparison of metabolite abundances and the identification of variations or significant differences between groups.

Visualization of Multiple Metabolites

Heatmap of Differential Metabolites

A Heatmap of Differential Metabolites is a graphical representation that displays the intensity of metabolic changes between different conditions or groups, using colors to indicate the levels of metabolites across samples, often highlighting significant differences.

Radar Map of Differential Metabolites

Radar Map is a graphical representation used to display multivariate data, where multiple variables are plotted along axes that radiate from a central point, allowing for a visual comparison of different categories or conditions in a circular format.

Correlation Analysis

Correlation Scatter Matrix

Correlation Scatter Matrix is a grid of scatter plots that shows the relationships between multiple variables. Each plot represents the correlation between two variables, allowing for the visualization of linear relationships and patterns in the data.

Heatmap of Mental Test Correlation

Heatmap of Mental Test Correlation is a graphical representation that visualizes the correlation between the results of multiple mental tests. It uses color gradients to indicate the strength and direction of the relationships, with higher correlations shown in darker or more intense colors.

Classification Model

Adaptive Boosting (AdaBoost)

The idea of boosting is to take a weak classifier, and use it to build a much better classifier, thereby boosting the performance of the weak classification algorithm. The most popular boosting algorithm is AdaBoost because it is adaptive (Dou L, et al. J Proteome Res. 20:191-201, 2021).For AdaBoost method, the parameters are set as default.

Bagging

In bagging, a random sample in the training set is selected with replacement and the individual data points can be chosen more than once. After several data samples are generated, these weak models are then trained independently, and the average of the predictions yield a more accurate estimate (Datta S, et al. BMC Bioinformatics. 11:427, 2010).For bagging method, the parameters are set as default.

Decision Trees (DT)

Decision tree is a reliable and effective decision making technique that provide high classification accuracy with a simple representation of gathered knowledge (Luna JM, et al. Proc Natl Acad Sci U S A. 116:19887-93, 2019). The most commonly used applications of decision trees are data mining and data classification in different areas of medical decision making (Podgorelec V, et al. J Med Syst. 26:445-63, 2002).For decision tree, the number of boosting iterations is set as 10, and other parameters is set as default.

K-Nearest Neighbor (KNN)

KNN is a non-parametric algorithm without any assumption on underlying data (Wang, et al. IEEE Trans Neural Netw Learn Syst. 31:1544-56, 2020). KNN is one of the simplest and most common classifiers and the core of this classifier depends mainly on measuring the distance or similarity between the tested examples and the training examples (Abu Alfeilat HA, et al. Big Data. 7:221-48, 2019. 26:445-63, 2002).For KNN, Minkowski distance is set 1, the kernel is set "triangular", and the parameters are set as default.

Linear Discriminat Analysis (LDA)

LDA is a widely used classification method with ready implementability and close relationships with many modern machine learning techniques (Ye, et al. Neural Netw. 105:393-404, 2018). Now, LDA helps to represent data for more than two classes. Linear discriminant analysis takes the mean value for each class and considers variants in order to make predictions assuming a Gaussian distribution.For LDA, the parameters are set as default.

Naive Bayes (NB)

The Naive Bayes implements the Bayes theorem for the computation and uses class levels for representing as feature values or vectors of predictors for classification (Miasnikof P, et al. BMC Med. 13:286, 2015). Naive Bayes classifiers have high accuracy and speed on large datasets (Zhang, et al. Food Chem Toxicol. 143:111513, 2020).For NB, Laplace is set as 3, indicating positive double controlling Laplace smoothing. The parameters are set as default.

Random Forest (RF)

The random forest classifier consists of multiple decision trees just as a forest has many trees. It makes decision trees based on a random selection of data samples and get predictions from every tree (de Santana FB, et al. Food Chem. 293:323-32, 2019). The random forest classification offers a rapid, sensitive, and accurate solution for identifying signatures in omics data (Roguet A, et al. Microbiome. 6:185, 2018).For RF, importance is set TRUE, indicating importance of predictors are assessed, proximity is set TRUE, indicating proximity measure among the rows is calculated, and the parameters are set as default.

Support Vector Machine (SVM)

SVM is a supervised machine learning algorithm used for both classification and regression. The objective of SVM is to find a hyperplane in an N-dimensional space that distinctly classifies the data points (Nedaie A, et al. Neural Netw . 98:87-101, 2018).For SVM, the kernel is set as "radial", the probability is set as TRUE, and the parameters are set as default.

3.4 Metabolite Annotation for Single-Cell Metabolomics

The Previous Version of MMEASE can be accessed at: https://idrblab.cn/mmease2021/

STEP-1: Dataset Upload

Summary and Visualization of the Uploaded Single-Cell Metabolomic Data

Download the Sample Data of Single-Cell Metabolomics for Testing and File Format Correcting

Summary and Visualization of the Uploaded Single-Cell Metabolomic Data

STEP-2: SCM Data Processing

Processing Method of Single-Cell Metabolomics Data

Summary and Visualization of Single-Cell Metabolomic Data after Data Processing

STEP-3: Cellular Heterogeneity Analysis

Methods of Data Analysis in Single-Cell Metabolomics

Summary and Visualization of Metabolic Heterogeneity for Single-Cell Metabolomics

UMAP of Selected Metabolite

Boxplot Selected Metabolite

Heatmap of Differential Metabolites

Radargram of Differential Metabolites

Correlation Scatter Matrix

Heatmap of Mental Test Correlation

Summary and Visualization of Functional Heterogeneity for Single-Cell Metabolomics