I571/ChE531 2007 Class WikiMain Page | About | Help | FAQ | Special pages | Log in

Printable version | Disclaimers | Privacy policy

Machine Learning Notes

From I571/ChE531 2007 Class Wiki

Contents

[edit] Machine Learning Methods

A nice introduction to machine learning in general can be found at MIT.

The most often used machine learning methods in cheminformatics are currently:

[edit] Prediction and interpretability

Linear and classification-based methods are generally more interpretable that non-linear algorithmic ones. Interpretability is a product of the descriptor set used and the ability of the method to demonstrate why a decision was made in terms of the descriptors. Prediction involves both accuracy and confidence.

[edit] Best practices for choosing models

[edit] Choosing Descriptors for QSAR

There are an awful lot of descriptors available, and many of them are highly correlated. They may be experimental values (e.g. logP), predicted values (e.g. ClogP), structure-based (fingerprints, topological indices), concrete (e.g. presence of a functional group) or abstract (e.g. chi-squared), 1D (global properties), 2D or 3D (geometric, electronic, etc). See all the ones in Molconn-Z and Dragon for a start. Mixing binary and non-binary information can be problematic. Descriptors may come with 100% confidence (e.g. presence of a structural feature), a supplied confidence (e.g. an error term from a predictive method for LogP) or unknown confidence. It is not clear how these should be mixed together for QSAR models.

Descriptors may be chosen by hand (useful if there is prior knowledge of improtant features, but subject to errors of omission) or automatically (which can be stepwise or stochastic). Stepwise selection is generally based on creating linear regression models with various sets of descriptors. Problems include being tied to linear regression, only producing a single result and reliability. Stochastic methods generally produce a set of good descriptor subsets, but can't guarantee finding the best model. Popular methods are the same as the popular general optimization methods: genetic algorithms, simulated annealing, tabu, etc. These all work fairly well, and are not restricted to a single model type (e.g. with a GA, the fitness function is simply the predictive ability of the model based on those descriptors).

Recent research shows that for corporate datasets, the qualities of the dataset (size, range of activities) are probably the most important, and reconfirms that not all descriptors work well on all datasets.

Best practice for choosing descriptors

Retrieved from "http://cheminfo.informatics.indiana.edu/djwild/i571wiki2007/index.php/Machine_Learning_Notes"

This page has been accessed 550 times. This page was last modified 12:48, 3 October 2007.


Find

Browse
Main Page
Community portal
Current events
Recent changes
Random page
Help
Donations
Edit
Edit this page
Editing help
This page
Discuss this page
Post a comment
Printable version
Context
Page history
What links here
Related changes
My pages
Log in / create account
Special pages
New pages
File list
Statistics
Bug reports
More...