Advanced Python notes - Structural Bioinformatics Group

sbg.bio.ic.ac.uk

Advanced Python notes - Structural Bioinformatics Group

Advanced Python notes

Benjamin Jefferys - benjamin.jefferys03@imperial.ac.uk

Advanced Python notes benjamin.jefferys03@imperial.ac.uk Page 1 of 16


Python language, Python library, and

third-party libraries

Python language

Core features

The python langauge is the foundation for

everything else. It includes language keywords such

as:

for

in

def

if

then

else

not

def

class

import

And symbolic operators such as:

:

=

"

()

{}

[]

%

+ - * /

Built-in library

A very basic set of "extras" is included in a built-in

library. There are a lot of functions included in this,

amongst the most useful are:

sum

max

min

abs

list

set

open

print

input

any

all

input

enumerate

zip

reduce

Python library

In addition to the built-in library, there are many extra libraries which provide features not required for

basic computation, but most certainly required for any practical task on a computer.

Advanced Python notes benjamin.jefferys03@imperial.ac.uk Page 2 of 16


Most important libraries

Other useful libraries

These libraries will be used for many bioinformatics

scripting tasks.

math - sin, cos, pi, floor, log, exp, factorial...

os - directories, processes, pipes, user info,

execution environment and files

sys - process arguments (args), platform

information, exiting

shutil - copy, delete and move files

subprocess - call other programs, get output from

and send intput to them

Reading and writing files

pickle

gzip - compress and decompress

xml - write and read using DOM or SAX

glob - search for files using wildcards

Coding utilities

logging - log debugging output, messages and

warnings

cProfile - identify bottlenecks in code

unittest - test your code in a consistent manner

pdb - debug, find out events leading up to a crash

pydoc - generate documentation from your code

optparse - read standardised arguments from the

command line

threading - split your program into parallel parts

Neworking and internet

socket - create and connect to servers at a low

level

email - construct emails

smtplib - send emails

urllib - get resources from a web server

Data types and algorithms

random - generate random numbers (uniformly and

from distributions), shuffle and sample lists

string - string formatting

re - Perl-style regular expressions

bisect - search sorted lists efficiently

datetime

Advanced Python notes benjamin.jefferys03@imperial.ac.uk Page 3 of 16


Other libraries

Hundreds or thousands of libraries have been written for python by other people!

Why use a library? Why not use a library?

Save coding time - do more science

Takes time to understand library

Library has few bugs (probably)

May not be free of bugs

Library is efficiently implemented

May not be efficiently implemented

"Native" libraries are faster than pure python

May not be native, and native libraries are platformspecific

Often fully features and complete

May lack features you want

Choosing a library

Ask these questions:

Is the documentation good?

Is it being actively developed?

Are the message boards active?

Are the bug lists actively addressed?

Try out quickly and abandon if unsuitable

Two useful libraries for bioinformatics and modelling

BioPython

Includes bioinformatics data structures (sequence, structure...) and parsers for common file formats (PDB,

FASTA...). It also has easy methods for accessing common bioinformatic servers (BLAST, Entrez...). Here's

an example of how to read in a PDB file to a protein data structure:

from Bio.PDB.PDBParser import PDBParser

pdbParser = PDBParser(PERMISSIVE=1)

structure = pdbParser.get_structure("1ten", "d1ten__.pdb")

for model in structure:

for chain in model:

for residue in chain:

aminoAcidType = residue.get_resname()

Advanced Python notes benjamin.jefferys03@imperial.ac.uk Page 4 of 16


SciPy and Numpy

caPosition = residue["CA"].get_coord().tolist()

SciPy is built upon Numpy, which gives you "magic" multidimentional arrays, where you can do the same

thing to all the elements at once. This can take advantage of your processor's ability to do arithmetic on up

to 8 numbers at once. For example, you can do this:

import numpy

a = numpy.array( [5, 4, 8, 2] )

b = numpy.array( [7, 1, 2, 9] )

c = a + b

# c = [5+7, 4+1, 8+2, 2+9]

Array creation takes time - maybe longer than just adding four numbers! Time your code and make sure

using Numpy is faster. Some more amazing things that Numpy can do:

import numpy

# 10m numbers 0..9999999

a = numpy.arange(10000000)

# squares all the numbers in a

a = a ** 2

# turns numbers into False if odd, True if even

evens = (a % 2 == 0)

# all() return True if all elements are True

assert(not evens.all())

# 10m numbers 0..2Ï€

b = numpy.linspace(0, 2*math.pi, 10000000)

# calculates sine of numbers in b

b = sin(b)

# adds elements in a to elements in b, without creating a new array

b += a

# makes every other element in the array 42

b[::2] = 42

# calculates product of all elements in b

p = numpy.multiply.reduce(b)

# obvious

m = b.mean()

SciPy builds upon this with lots of useful functions for scientific and mathematical modelling. For example,

clustering:

Advanced Python notes benjamin.jefferys03@imperial.ac.uk Page 5 of 16


import numpy

from scipy.cluster.vq import vq, kmeans, whiten

points = numpy.array([ [1.9, 2.3, 5.2], ..., [1.3, 0.2, 5.7] ])

points = whiten(points) # normalises points in each dimension

(means, labels) = kmeans2(points, 3)

# now label[i] is a cluster number for points[i]

# ... will be 0, 1 or 2 in this case

# More at: www.scipy.org/doc/api_docs/SciPy.cluster.vq.html

Integrating ODEs:

from scipy.integrate.ode import ode

def f(t, y):

return 0.1*y

print "y at time 1.0: %f" % ode(f).\

set_initial_value(200.0).integrate(1.0)

print "y at time 20.0: %f" % ode(f).\

set_initial_value(200.0).integrate(20.0)

# More at: www.scipy.org/doc/api_docs/SciPy.integrate.ode.html

Optimisation:

from scipy.optimize.minpack import fsolve

from scipy.optimize.optimize import fmin

def f(x):

return x ** 2.0 + x - 1.0

print "f(%0.3f) == 0.0" % fsolve(f, 0.0)[0]

minimisingX = fmin(f, 0.0)[0]

print "f(%0.3f) == %0.3f (local minimum)" % \

(minimisingX , f(minimisingX))

# More at: www.scipy.org/doc/api_docs/SciPy.optimize.html

It can image processing, Fourier transforms, provides sparse matrix data types, and has 100+ statistical

functions including central tendency (means etc.), other moments (skew, kutosis etc.), frequency statistics,

variability, correlation, inference (Mann-Whitney, chi-squared etc.), probability (P-value) calculations,

ANOVA

Three useful libraries for visualisation

In the end you have to see the results of your work. You can use other software for this, but it is

sometimes faster to generate the visualisation of your data in your own program.

matplotlib

An excellent and flexible library for data plotting and general visualisation. It will cover most of your needs

for plotting graphs. Example for plotting some randomly generated points:

from pylab import *

Advanced Python notes benjamin.jefferys03@imperial.ac.uk Page 6 of 16


from random import *

points = [ (n, n**2+random()*1000) for n in range(100) ]

(xs, ys) = zip(*points)

scatter(xs, ys)

xlabel("X axis label")

ylabel("Y axis label")

grid(True)

savefig("graph.png") # save to a PNG file

show() # display for interactive viewing

Some example output from matplotlib:

Advanced Python notes

benjamin.jefferys03@imperial.ac.uk

Page 7 of 16


Advanced Python notes

benjamin.jefferys03@imperial.ac.uk Page 8 of 16


PyCairo

This generates 2D graphics from primitives such as lines, circles and polygons. This poster gives a nice

summary of the architecture and features of PyCairo:

Some example output, showing a network:

Advanced Python notes

benjamin.jefferys03@imperial.ac.uk

Page 9 of 16


PyOpenGL

This generates 3D graphics from polygons, with lighting and other effects. It is very fast on most machines

due to specialised graphics cards, and it is based on technology which drives most modern PC games. It is

not simple to set up or use, so I will not give an example here, but will just show some of the effects that

are possible:

Advanced Python notes benjamin.jefferys03@imperial.ac.uk Page 10 of 16


Advanced Python notes benjamin.jefferys03@imperial.ac.uk Page 11 of 16


Python and speed

Python is an interpreted scripting langauge, so is not inherently built to be fast. But there are a few things

you can do to make sure your programs run as fast as possible!

Worry about speed if... Forget about speed if...

Program runs for hours or days

Program completes in minutes

Program will run many times

Program will run few times

Program is being developed and improved Program will work perfectly first time (!)

You're using a shared computer farm

You're using a dedicated computer

Eliminate bottlenecks

You can manually instrument your code or use a profiler (e.g. cProfile, which comes with Python) to find

out which parts of your code run slowest. But there are a few common things that make your code

slower:

• Poor data organisation

• Using more memory than your computer has

• Memory allocation, for example large lists

o Creating large lists in one go is faster than growing them

• String operations

• File access - especially over a network

Run your code on lots of processors

This is only useful if you can split your code into many processes - particularly when you're running the

same program lots of times with different input data, parameters or random seeds.

You computer probably has two or more processors already. You can run at least two programs on your

machine at the same time, and they will complete in the same time as a single program! If you want to run

more programs than this, then you need to use a computer farm - lots of computers for shared use with a

queuing system. Some tips on using a farm:

• Submit a single programs with qsub

o A program specified for qsub cannot take arguments, so if you want to run the same

program with different arguments, you must write lots of small scripts with the arguments

built in - it is common to write a program which writes out these scripts and then qsubs

them!

o Also remember the #! line at the start of the script

• Test your program with qsub -I before submitting thousands of jobs

Advanced Python notes benjamin.jefferys03@imperial.ac.uk Page 12 of 16


• Be considerate - farms are a shared resource

• Network disk access is slow - consider copying large files to and from local disks before and after

your program is run

• Use MPI for communication between your running programs - more complicated and slower, so

avoid!

If all else fails...

The psyco (more recently, pypy) library compiles your code as it runs, possible speeding it up by 2-4

times depending on the task. This can be error prone but worth a try to squeeze the last bit of speed from

your program!

In the end, your program has to do something, and you have to weigh the time spent speeding it up against

the time saved when running it. There is a theoretical limit to how fast your code can run! Don't bang your

head against this limit.

Advanced Python notes benjamin.jefferys03@imperial.ac.uk Page 13 of 16


Bioinformatics programming in

context: input and output

Your program will usually take input in the form of biological data and parameters, and produce output

intended as input for another program, or directly to the user.

Input

Biological data

Biological data is provided in a format that depends upon the type of the data. Often data will be stored in

files on a disk you can access directly. File formats are either binary or text-based. For text-based files, a

program must parse data from the file into a memory-based data structure which you can use (for

example, a list of numbers, or a set of objects). For binary files, they must be somehow converted into

such a memory-based structure, often using a library. There are advantages and disadvantages associated

with text vs. binary formats:

Text files

Binary files

Human readable in a text editor - easy to

understand and recover data from files without a

specialist viewer - less likely to become obsolete

Not readable by humans without a specialist viewer

Slow to load into program and parse

Fast to load into program and parse

Very space inefficient storage for most data types:

2-10 times bigger than binary for numbers!

Space-efficient storage of data

Some examples of how biological data is stored and loaded:

Biological data Example file format Loading method

DNA, RNA and protein

sequences

FASTA (text-based)

Simple to parse yourself, or use

BioPython

Protein and DNA structures PDB (text-based) BioPython parser

Microarray data

Comma (CSV), space or tab

(TSV) separated value files (textbased)

Python csv library

Advanced Python notes benjamin.jefferys03@imperial.ac.uk Page 14 of 16


Microscopy, medical and other

images

PNG, JPEG, TIFF, TGA (binary)

Python Imaging Library

Other data

Proprietary formats sometimes

using XML or CSV as a container

(usually text-based) and possibly

compressed (gzip, ZIP)

Python xmllib, csv, zipfile and gzip

libraries, or write your own parser

using re

Sometimes you will be using data taken directly from a webserver. BioPython has tools for accessing some

bioinformatics web servers and returning data in a form you can use directly. Otherwise you will use urllib

to fetch data from web pages, which is almost the same as loading a file from the local disk, so you can use

the same methods to process it. Data fetched from a web server may be part of an HTML page. You can

use re (regular expression library) or preferably htmllib to extract the relevant data, and ignore the

"wrapping" which presents it nicely on a web browser.

Finally, data may come from a SQL database. In Python you can use SQLAlchemy to access an SQL

database.

Parameters

Parameters for your program, such as option switches, numerical parameters for your model and filenames

to write to, can be input through files, or input directly from the user.

Parameters in files

It is useful to have a method for keeping parameters in files. This allows a user to keep a many different

sets of related parameters together. Parameter files might also be used as defaults for parameters which

are not specified by the user interactively.

You might use a proprietary text-based file format and your own parser for this purpose. However, XML

files in conjunction with the python xml parser, or cfg files read by the Python ConfigParser library, are

possible a better option, requiring less work, particularly in the latter case.

Python files can also be used to store parameters. Of course, you can keep parameters in your actual

script file which does all the work - but this is considered bad practice. Keep parameters separate from the

model. This can be done by putting parameters in a module file, and importing it into your program.

Interactive parameter input

Derek already talked about passing arguments to a Python script, and taking input from a terminal

interactively. But there are more interactive graphical methods, which have several advantages:

• Encourages experimentation

• Easier to see patterns in output

• Fast to change default inputs

• Allows a more flexible workflow

• Friendlier to users - more likely to be used

Advanced Python notes benjamin.jefferys03@imperial.ac.uk Page 15 of 16


... but it takes more work to implement a graphical input method. There are two main ways to do it.

Your program could display a graphical user interface (GUI) using Tkinter (which comes with most

versions of Python) or PyGTK. This isn't too hard but can be difficult to get started - some important

setup needs to be done correctly beforehand.

You might also write your program as a HTTP (web) server, in which case your parameters will be

entered on a web page. This is somewhat easier than writing a GUI, and has the advantage that your

program is accessible remotely by many users, without having to install software on their machine. Use

Python's cgi library to help you write such a server.

Think about providing a command-line or file-based parameter input method in addition to graphical input,

so that other programs can invoke your program, as well as human users.

Output

For other programs

Often you'll be producing output which is actually input for another program. If you are outputting

biological data, this will usually be in the same format as input files, especially if the other programs are

written by a third party. If the "downstream" program is written by you, you have more choice in formats,

and might want to use your own proprietary format.

The Python pickle library is a nice way to format intermediate files between different Python programs, as

the saving and loading can be done in a single line.

It is unusual (but possible!) to output files to a web server, but outputting to a SQL database is more

common. Again, use SQLAlchemy or another Python SQL library to do this.

For users

Often the initial user is just you, but ultimately will be the whole world, as you will want to present and

publish work to the scientific community and the general public. Often people prefer to use third-party

tools (for example, matlab, Excel, Word, or graphviz) for producing output more elaborate than basic text,

but it can be faster to produce such output directly from Python.

Derek has shown you how to output basic text. A simple extension to this is outputting an HTML file,

which allows more elaborate formatting, especially for tables. If you are writing a server, HTML output is

the default.

Graphical output, either to files or directly to the screen, is enabled by the Python Imaging Library,

matplotlib, PyCairo, PyOpenGL and GUI libraries like Tkinter and PyGTK. Consider graphical output when

your data is complex and you are trying to spot patterns: visualisation is a powerful tool for science, and of

course is generally required for publishing and presenting work. You might as well do it as early as possible

so that you get the benefits of visualisation, as well as your ultimate audience!

Advanced Python notes benjamin.jefferys03@imperial.ac.uk Page 16 of 16

More magazines by this user
Similar magazines