Machine Learning Notes
From I571/ChE531 2007 Class Wiki
Contents |
[edit] Machine Learning Methods
A nice introduction to machine learning in general can be found at MIT.
The most often used machine learning methods in cheminformatics are currently:
- Neural Nets, particularly Self Organizing / Kohonen Maps - see NCI SOM. Regular nets can be feed forward (directed acyclic graph) or recurrent (arbitrary structures can form)
- Bayesian Analysis - see J Med Chem article about classifying Kinase inhibitors
- Recursive Partitioning - a kind of Decision Tree. See original JCICS article
- Random Forests - see JCICS article and recent solubility paper - an example of Bagging
- Support Vector Machines - see aqueous solubility paper in JCICS
[edit] Prediction and interpretability
Linear and classification-based methods are generally more interpretable that non-linear algorithmic ones. Interpretability is a product of the descriptor set used and the ability of the method to demonstrate why a decision was made in terms of the descriptors. Prediction involves both accuracy and confidence.
[edit] Best practices for choosing models
- Any method will work well given the right dataset!
- No method will always work well over a realistic diversity of datasets
- Whether something works is a function of the model type and descriptors used
- Always consider importance of predictive ability and interpretability. Remember end point is usually a human being!
- A simple strategy
- Analyze diversity of the set to see if it should be pre-clustered, or only related structures selected (MIMS > 0.55?)
- Develop a linear model (e.g. PLS) to get initial ideas about important descriptors
- Use a random forest for good predictive ability, minimal parameter tweaking, reasonable interpretability, resistance to overfitting and no need for feature selection, or a Bayesian-derived model as a second choice
[edit] Choosing Descriptors for QSAR
There are an awful lot of descriptors available, and many of them are highly correlated. They may be experimental values (e.g. logP), predicted values (e.g. ClogP), structure-based (fingerprints, topological indices), concrete (e.g. presence of a functional group) or abstract (e.g. chi-squared), 1D (global properties), 2D or 3D (geometric, electronic, etc). See all the ones in Molconn-Z and Dragon for a start. Mixing binary and non-binary information can be problematic. Descriptors may come with 100% confidence (e.g. presence of a structural feature), a supplied confidence (e.g. an error term from a predictive method for LogP) or unknown confidence. It is not clear how these should be mixed together for QSAR models.
Descriptors may be chosen by hand (useful if there is prior knowledge of improtant features, but subject to errors of omission) or automatically (which can be stepwise or stochastic). Stepwise selection is generally based on creating linear regression models with various sets of descriptors. Problems include being tied to linear regression, only producing a single result and reliability. Stochastic methods generally produce a set of good descriptor subsets, but can't guarantee finding the best model. Popular methods are the same as the popular general optimization methods: genetic algorithms, simulated annealing, tabu, etc. These all work fairly well, and are not restricted to a single model type (e.g. with a GA, the fitness function is simply the predictive ability of the model based on those descriptors).
Recent research shows that for corporate datasets, the qualities of the dataset (size, range of activities) are probably the most important, and reconfirms that not all descriptors work well on all datasets.
Best practice for choosing descriptors
- Choose model type first, based on desired interpretability, applicability, etc.
- Generate as many descriptors as possible, but restrict based on desired interpretability
- Use a stochastic descriptor selection method to automatically select best descriptors for the model