Python Programs, Scripts and Snippets
Using the CDK from Python | CATS Descriptors | Extracting pages from a PDF file | Spoofing the HTTP referer | Cheminformatics | Misc | PyBibTeX | PDF to HTML | Cellular Automata | Hydrophobicity Surfaces | KMail to Evolution | DNA representation
ringset.py is a snippet that will find the smallest ring in a
graph that contains a specified vertex. It is based on a breadth first search
and is described by Figueras in J. Chem. Inf. Comput. Sci., 1996, 36, 986-991
An implementation of the Figueras algorithm to determine the SSSR maybe found
here. The main file is fig.py which contains the actual SSSR search algorithm
alongwith the ring search algorithm described above. It utilizes graph, queue and set classes which are
included in the tarball. The test example is the graph corresponding to cubane.
HinFile.py: A class which holds the
details of a Hyperchem HIN
files. Pretty much documented and contains several helper
functions to manipulate the molecule data.
setbin.py and setbinmod.py: The latter provides
function which are used by setbin.py to generate sets in QSAR
studies (check out the ADAPT menu item, alongside) using the
activity ranked binning method. setbin.py is a replacement for
the original setbin program in the ADAPT suite.
setbinmod.py is
also used in ksomsetbin.py which
uses the results from a
KSOM
run to generate QSAR sets (using the binning algorithm)
morgan.py: An implementation of the Morgan algorithm
to get a canonical numbering of the atoms in a molecule. Currently it only takes
the Hyperchem HIN file format
as input. It requires the HinFile class to run.
Currently it is very simplistic:
cml101.py: This is a very basic
CML parser. Its more a proof of
concept (the concept being, I could parse CML :) and uses a SAX parser.
CML looks pretty interesting as a way to store molecule
information independently
of vendor formats. It parses CML 1.01 and has no error handling, so unless
its supplied a correct CML file, it will probably crash. The information from a
<molecule> element is stored in a Molecule class - read the source for more
info on how the info is stored in the program.
gengrid.py: This code snippet will
generate a set of points on an N dimensional grid. It's useful
as a brute force method of parameter optimization. It requires
you to pass the axes which are defined as lists of numbers
(basically each axis corresponds to a single parameter and the
values of the axis correspond to the possible values of the
parameter). Thus in the 2D case if our axes are defined as
The the code will return 9 points: (1,1) , (1,2) ... ,
(3,2) , (3,1). For the 3D case we would get 27 points.
The problem with the code is that it runs in exponential
time so for 6 axes of length 5 or more, it can take
more than a few minutes to generate the grid. The use of copy.deepcopy()
also seems very inelegant. NOTE: gengrid2.py is a modification
of gengrid.py suggested by Simon Place which generates a
list comprehension expression which is then exec'ed. The
resulting code is much faster (though still exponential)
and is more elegant.
xbel2netscape.py: Converts
Galeon
bookmarks (output in XBEL format)
to plain HTML. Uses the expat parser and is pretty simple. I need to
fiddle around a little more with XML parsing.
run.py: A small script which provides a
pop up text box (uses
PyGTK)
to run a program, without having to open up a terminal.
Exam Score System: This is a set of
Python and shell scripts which manage the storage of test scores
in a MySQL database. Its a
pretty plain application, but shows how to get the values out of
a HTML form, store and retrieve data from a MySQL database, and
display the results as HTML.
Email -> HTML Email: A small
function that uses the Python email module to parse a mbox style
mailbox (KMail and mutt use this format) and generates an HTML
page with the individual mails formatted. Converts links and
email addresses to HTML links. Currently skips multipara mails
& mails that the email module chokes on.
Install: Extract the tarball and run bibgui.py
Currently an interesting feature is the ability to be used
from within a VIM session.
This is possible with vim.py, a
Python script that can be run by doing :pyfile vim.py
on a visible buffer. It assumes that the VIM program has
Python support compiled in. The procedure to use it is:
The connection between VIM and PyBibTeX is managed through a Unix
domain socket. Thus this setup could concievably be used
to interface PyBibTeX (if it gets to be usable!) to other
word processing packages (OOfice, KWord, AbiWord)
Tested with ImageMagick 5.3.8, Ghostscript 6.5.1 and Python
2.2 & 1.5.2
Using the -x and -y switches you can also specify the width
and height respectively of the main slide image.
This program
is used to combine the data from Brian Mattioni's
exlogp.perl
script and the actual molecule file (a Hyperchem HIN file in
this case) to produce a PDB file with the hydrophobicity values
in the B-factor column. The resultant PDB file can be viewed in
PyMOL - more
importantly PyMOL can use the B-factor column to color code a
molecular surface. Thus we can view hydrophobicity surfaces.
To color a molecule using a the B-factor column you need a
Python script color_bh.py which is a
slightly modified version of
color_b.py.
My modified version fixes the minimum and maximum B values to
to -1 and 1 and scales the calculated atomic hydrophobicity values to this scale
There are two ways to color code molecules with hydrophobicity
data. Firstly, you can generate the required PDB file using
pdbsurf.py and then load it into
PyMOL and invoke color_bh as shown below.
The steps to view a hydrophobic surface are:
The code effectively only parses messages in Maildir format and
simply copies mbox style folders to the corresponding Evolution
directory. When it faces a Maildir message that it cannot parse
it will log the message filename to mail.log
Th script uses the email class in Python 2.2 and so won't
work with 1.5.2 (unless you patched it). As a result, if the
email class can't parse a message theres nothing that I can do!
Finally, if you have converted a *large* mail store then
Evolution will take some time to initially load and display
the messages. This is because this script does not do any
indexing. Hence Evolution must create indices the first time it
loads the new folders.
The simplest example is:
reorder.py is a program that can
reorder a HIN file such that all the hydrogens come after the
heavy atoms. This is required by ADAPT. Currently only handles
HIN files which contain 1 molecule (ie only one set of mol -
endmol markers)
splitsdf.py can be used to split a large
SDF file containing multiple structures into individual SDF files
each containing a single structure.
The algorithm was implemented from the discussion given in Cheminformatics, A Textbook, Gasteiger & Engel (Eds.)
I hope to fiddle some more with CML and the Python XML tools, so more code
(and even a CML section at some time) should pop up
y = [1,2,3]
As you can see its not a very efficient way to do things. What I would like is to
have a toolbar button that can be clicked and the selected citations inserted.
One thing is, that now the text from the PDF file, seems a
little jagged. I'm not sure how I can get GhostScript to get me
better looking text. If anybody nows the proper GhostScript
directive, please let me know. (Note, the script
requires the
program mogrify from the ImageMagick suite, since GS cannot
output GIF; thus the jpeg or png files from GS must be converted
to GIF's by mogrify.
Good Features:
Bad Features:
Some points to note:
However, you can automate the above by using
hydrosurf.py. This is a
Python script that can be run from within PyMOL and handles the
conversion to PDB and coloring of the resultant molecule
automatically. Requires the
color_bh.py script as before. To use
it load up PyMOL and then do
data.dat and mol.hin are your hydropbobicity data file and HIN
file respectively. The 0 is to color the molecule using hues from
blue to magenta; if you want colors of the rainbow use 1.
The 2 is the number of colors to use - it
will draw a colored (coded by the hydrophobicity values) line
image of the molecule. Use PyMOL to convert it to a surface
view (show surface).
You can see movies of propanol
and 1-propane thiol color coded
using hydrophobicity values.
import os,time
o,i,e = os.popen3("/usr/local/adapt/bin/descmng -g -f qnetin")
o.write("\n\n%s\n" % (desc))
o.close()
time.sleep(2)
The call to sleep() is useful in some cases where the called
program (like gnuplot) might take a second or two to do its
work.
kmail2evo.py
This will convert all KMail mailstore under $HOME/Mail to
and Evolution mbox format under $HOME/evolution/local and will
maintain the whole folder heirarchy that you had in KMail
Before writing this I looked around for converters - the only
thing I could find was CEIConvert.
However it did'nt seem to handle my heirarchy properly - though
a nice feature is that it does generate indices.
Update (16/05/2005): Thanks to Achim Vierheilig for providing
an update which allows a KMail mail store to be converted to a
Thunderbird mail store.
Update (11/09/2003): Thanks to David Fenyes for
updating the code so that it avoids adding the 'From' and
'Date' portions of the parsed message if they're not strings.
python refcheck.py -h www.someserver.com -r http://www.somereferer.com \
-u "A User Agent" The host name is mandatory and if User-Agent and Referer headers are not specified
they will be empty. The path that the
script requests by default is '/' but this can be changed on the command line with the
-p switch.
The output is simply the
status code and
associated reason from the response.
Example usage is
extractpdfpages.py -p "3-10,15-25" file.pdfThe extracted pages are placed in a file called extract.pdf by default, though the output file can be specified on the command line.
An alternative method that does not need pdflatex is to use the ReportLab toolkit. However it seems that the ability to extract specific pages from a PDF is only available in the for-fee version of the library. Another option which I don't really know much about is the pdftk toolkit which is Java based (but Jython would come to the rescue in this case).
cats2d.py is an implementation of the CATS2D descriptors in Python and uses the OpeneEye toolkit. The pharmacophore definitions are based on those described by Renner, S. et al. (in Pharmacophores and Pharmacophore Searches, Langer, T. and Hoffmann, R.D. (Eds.), Wiley-VCH, 2006), with some changes.
The H-bond donor and H-bond acceptor patterns were taken from the Daylight web site and for the "carbon atom adjacent only to another carbon atom" I used the following SMARTS:
[C;D2;$(C(=C)(=C))], [C;D3;$(C(=C)(C)(C))], [C;D4;$(C(C)(C)(C)(C))], [C;D3;H1;$(C(C)(C)(C))], [C;D2;H2;$(C(C)(C))]It is easy to replace the SMARTS for any of the given pharmacophore groups (H-bond donor, acceptor, positive, negative and lipophilic) within the code. The maximum topological path length is 9 as described by Renner, S. et al. but this can be changed on the command line. The program can caluclate three variants of the CATS2D descriptors: the raw versions, scaled by the number of heavy atoms and scaled by the co-occurence of the individual PPP's. By default the raw descriptors (no scaling) are calculated, but this can be changed on the command line
A recent article in CDK News (O' Boyle, N.; CDK News, 2005, 2(2), 40-42) described the use of the CDK libraries in Python using Jython.
An example of the use of Jython to allow the inclusion of Java libraries in a Python program is available here. The function of this program is to evaluate the total surface area for a set of molecules specified in a SD file. Since the CDK surface area algorithm is numerical it uses a probe radius and a tesselation level, both of which can be specified on the command line. An example of its usage is:
jython calcSA.jy -r 1,2,3 -t 3,4 structures.sdf