Posted: January 20, 2014
Modern analytical techniques, including gas chromatography-mass spectrometry (GC-MS), infra red spectroscopy (IR) and two-dimensional gas chromatography (GCxGC), can provide an abundance of detailed data pertaining to the nature of complex samples. Chemometric models can then extract underlying patterns and translate the abundance of data into more useful information. However, when constructing chemometric models, one main challenge is choosing the appropriate variables to incorporate from the thousands or millions that have been collected. Variable selection can be accomplished with objective variable ranking techniques, such as Analysis of Variance (ANOVA), coupled with a method that evaluates the quality of a model. For methods that generate scatter plots of data separated in classes, such as Principal Component Analysis (PCA), the measurement should objectively compare the distance between clusters relative to their shapes, sizes, and orientation. Measuring the distance between the classes, or clusters, provides an objective measure of the class separation and can be used to optimize the number of variables to include. Current methods for estimating class separation do not consider the shape and orientation of each class.
Researchers at the University of Alberta’s Department of Chemistry have developed a novel metric, termed “chemometric resolution”, which compares the separation of clusters of data points while simultaneously considering the shapes of the clusters and their relative orientations. The minimum distance between (or the extent of overlap of) confidence ellipses constructed around clusters of points representing different classes of objects is a key element of chemometric resolution. This metric, in conjunction with an objective variable ranking metric, can be used to automatically determine the optimal number of variables to be included in a chemometric model of a system.
When used to construct a PCA model to classify gasoline samples by octane rating, chemometric resolution automatically determined an optimal number of variables to use in the model. When utilizing a previously described metric for estimating class separation based on Euclidean distances for the same data, the suggested optimum was severely over-fit and did not actually indicate a set of variables that could be used to successfully classify the samples.
- New metric simultaneously considers the shapes, sizes, and relative alignments of clusters when measuring the separation between them
- Optimal number of variables can be found even when clusters form highly eccentric ellipses – a situation where other metrics fail
- Reliably ensures distinct class separation in automatically generated chemometric models
This invention will be of interest to researchers building multivariate data analysis and interpretation models in fields such as metabolomics, forensics, process control, pharmaceuticals and earth science.
Technology Management Group
TEC Edmonton – University of Alberta