- Profiling OpenBabel
Carry out a performance analysis of OpenBabel.You can examine a
variety of aspects such as looking for memory leaks and memory usage
using valgrind, using gprof to
profile the various file format conversion routines and gprof to
profile some specific tasks such as SMARTS matching. If you can
provide fixes to some of the leaks, all the better
- Chemical Database Application: Design a program that will
allow users to load in SDF files into a database and then browse the
database using a graphical interface. Loading the SDF files implies
that the structure in the SD file will be represented within the
database as SMILES (and possibly as InChI) and the properties in
the SD file will be stored as columns. The project should use a
subset of PubChem compounds (say 10,000 or so). You can use Postgres
or a local database (SQLite for Python or Derby for Java). The GUI
should be in the form of a table, with one of the columns showing
the 2D structure of molecule, the remaining columns simply showing
the rest of the data for that molecule.
You can use your language of choice and are free to use open-source
(CDK, OpenBabel, RDKit) or commercial toolkits (OEChem).
- CATS-2D and CATS-3D Descriptors: Implement these
descriptors. I will provide you with a copy of the book chapter that
describes these descriptors. You can use any language you want, but
it would be useful to implement it using the CDK. Currently the CDK
does not have these descriptors, so after a review, your code could
be incorporated into the library.
- Extend Conformer Handling in the CDK: The CDK currently
reads conformers from SD files (see IteratingMDLConformerReader). At
this point it assumes that in an SD file, conformers will be
contiguous and will all have the same title. Make a local copy of
the above class and extend it so that it will, optionally, use graph
isomorphism to identify conformers. That is, rather than assuming
conformers will have the same title, the code will check that the
conformers have identical connectivity (their molecular graphs will
be isomorphic). See the UniversalIsomorphismTester
class in the CDK library. Obviously, if the isomorphism test is chosen, things
will slow down. You should run benchmarks to measure the differences
in time required by title checking and by graph checking. Use a
number of different molecules - small and large. You can generate
conformers on cheminfo by using omega (manual)
- Genetic Algorithms for Feature Selection: When one builds
QSAR models, one usually evaluates a large (hundreds) number of
descriptors. But many are correlated or are (near) constant. The
initial pool is reduced by correlation and variance tests, to a
smaller pool. Given this small pool (say, a few tens of descriptors)
one must select a small subset (from 3 to 6) of descriptors to build
a predictive model. Though
this can be done brute force, it can take a long time (for example,
for a 43 descriptor pool there are approximately 1,000,000
5-descriptor subsets). As a result genetic algorithms are usually
employed to perform feature selection.
In this project you should
either implement your own or use a prepackaged genetic algorithm to
select a good subset of descriptors from a descriptor pool that I
will provide. You can choose any type of model (linear regression,
SVM etc). The goal of the project is to examine how fast the genetic
algorithm reaches the global best model, if it does at all. To do
this, execute the GA to find the best 3-descriptor model. Then do a
brute force calculation to evaluate all possible 3-descriptor
models. Is the best model from the GA the global best model? How
many generations are required to reach the best model? Show
the evolution of the population best RMSE with generation. Repeat
this for 4- and 5-descriptor models. Also, compare the use of the
different objective functions (say, RMSE, R2 and overall
F-statistic, or any others of your choosing). Do certain objective
functions show better convergence?
You can do this in any
environment of your choice. If working in R, you can use the genalg
package and the gtools
package for combinatoric functions.
Descriptor pool (no need to perform
correlation and variance tests as it has just 43 descriptors)
Dependent variable
- Network Analysis of Drug Targets: The DrugBank
database is a comprehensive resource of drugs and their targets (if
available). The database can be downloaded
or you can go through their web interface for various queries. Choose
a selection of drugs (at least 10) whose targets are known. Given these targets,
identify other proteins that they interact with if at all. To do this
you'll need to use some protein-protein interaction database such as
HPRD or IntAct.
Given
a target and it's interacting proteins, you can generate a graph,
where the target and interacting proteins are nodes, connected by edges. The
interacting proteins represent the first shell around the
target protein. For each interacting protein perform lookups to find
proteins that interact with them (obviously excluding the target
protein). This is the second shell around the target
protein. How far can this shell be extended for each target protein?
This will obviously depend on the number of interactions recorded in
the database.
Do you find any of networks you build for the individual drug targets
to share any nodes? If so what's the shortest path between the two
drug targets of such connected networks? Characterize the networks
that you generate by calculating properties such as cluster
coefficient, average vertex degree and so on.
Another possibility is to consider which pathways a drug target is
involved in and identify the known components of these pathways. If
the pathways for two drug targets have a common components, we can
therefor say the drug targets are connected. Is it possible to connect
any drug targets given the pathways they play a role in? You can try
the KEGG
Pathway database.
In both cases try and develop some visualizations of the network using
tools such a Cytoscape or
Graphviz.
- Measures of Molecular Complexity - Implement at least two
measures of molecular complexity. Perform benchmark tests to see if
the measures correlate with each other - if they don't present an
analysis of why they don't. Given measures of complexity apply it to
reaction sequences (search places such as J. Am. Chem. Soc.,
J. Nat. Prod. etc). The changes in the complexity measure should be
able to indicate whether a reaction is building a large complex
molecule or simplifying a molecule. A good example of the former is
natural product synthesis reactions.
Some example papers include
- Bertz, S., "The First General Index of
Molecular Complexity", J. Am. Chem. Soc., 1981,
103, 3599--3601.
- Allu, T. K., et al.,"Rapid Evaluation of Synthetic and
Molecular Complexity for in Silico Chemistry",
J. Chem. Inf. Model., 2005, 45, 1237--1243 (http://dx.doi.org/10.1021/ci0501387)
Rajarshi
Guha, Indiana University, Bloomington
Last modified: Sun Jan 18 17:36:48 EST 2009