Standard text notes - Structural Bioinformatics Group

sbg.bio.ic.ac.uk

Standard text notes - Structural Bioinformatics Group

ython coding notes benjamin.jefferys03@imperial.ac.uk 1 of 16Notes on debugging, formatting andstructuring codeBenjamin Jefferys, benjamin.jefferys@imperial.ac.ukThese notes guide you through a series of changes to a Python script. It illustrates how to debug code,and also structure and format code so it is easier to debug and improve, and less likely to have bugs inthe first place. In each case, changes made to the script as compared to the previous version arehighlighted in grey in the left margin.Further readingen.wikipedia.org/wiki/Coding_styleen.wikipedia.org/wiki/Unit_testen.wikipedia.org/wiki/Regression_testingdocs.python.org/library/logging.htmlAn important method for debugging not covered here is the use of a debugger. This allows you tocontrol and study the execution of a program, step-by-stepDebuggers in general: en.wikipedia.org/wiki/DebuggerPython's debugger: docs.python.org/library/pdb.htmlStep 1 - syntax errorsThis is adapted from Derek's code for translating some DNA into a protein sequence, via a codontranslation table. I have changed it to illustrate some things and Derek's code did not have any bugs.Look at the script and the output and see if you can work out what is wrong...01 # Syntax errors - understanding errors, staring at code, pair programming0203 in_file = open('fasta_file.txt')04 seq_list = in_file.readilnes()05 seq_name = seq_list.pop(0).strip()06 seq = seq_list.pop(0).rstrip(().lower()07 for line in seq_list:08 seq += line.rstrip().lower()09 in_file.close()10 in_file = open('codons.txt')11 codons_list = in_file.readlines()12 codons = {} # Initialise the dictionary13 for count in range(0, len(codons_list), 2):14 codons[codons_list[count+1].rstrip().lower()] = \15 codons_list[count].rstrip().lower()16 in_file.close()17 for count in range(0, len(seq), 3):18 codon = seq[count:count+3]19 if codons.has_key(codon):20 aa = codons[codon]


Python coding notes benjamin.jefferys03@imperial.ac.uk 3 of 16AttributeError: 'file' object has no attribute 'readilnes'Step 3 - logical errorsThe basic error has been corrected (a). But the output isn't what we expected... it should be a proteinsequence. What's going on? Well the code is a bit of a mess at the moment, it isn't clear how it works.Let's tidy it up so we can begin to understand it.01 # Logic errors0203 in_file = open('fasta_file.txt')(a) 04 seq_list = in_file.readlines()05 seq_name = seq_list.pop(0).strip()06 seq = seq_list.pop(0).rstrip().lower()07 for line in seq_list:08 seq += line.rstrip().lower()09 in_file.close()10 in_file = open('codons.txt')11 codons_list = in_file.readlines()12 codons = {} # Initialise the dictionary13 for count in range(0, len(codons_list), 2):14 codons[codons_list[count+1].rstrip().lower()] = \15 codons_list[count].rstrip().lower()16 in_file.close()17 for count in range(0, len(seq), 3):18 codon = seq[count:count+3]19 if codons.has_key(codon):20 aa = codons[codon]21 else:22 aa = '-'23 print aa,Script output- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -- - - - - - - - - - - - - - - - - - - -Step 4 - commentingThe first thing to do is annotate, or comment, the code, and split it up into logical steps using whitespace. All the changes here are just adding comments and newlines, so the output is the same asbefore. Of course, doing this involved looking at the code and understanding each step - addingcomments just records this understanding so we don't have to do it again.01 # Commenting02(a) 03 # Read in fasta file to name string and list of lower-case sequence strings04 in_file = open('fasta_file.txt')05 seq_list = in_file.readlines()06 seq_name = seq_list.pop(0).strip()07 seq = seq_list.pop(0).rstrip().lower()(b) 0809 # Join up the sequence strings


ython coding notes benjamin.jefferys03@imperial.ac.uk 4 of 1610 for line in seq_list:11 seq += line.rstrip().lower()12 in_file.close()(c) 1314 # Read in the codon -> amino acid translation file15 in_file = open('codons.txt')16 codons_list = in_file.readlines()(d) 1718 # Read pairs of lines, first in pair is codon, second in pair is amino acid19 # Put it into a codon -> amino acid dictionary, all lower case20 codons = {} # Initialise the dictionary21 for count in range(0, len(codons_list), 2):22 codons[codons_list[count+1].rstrip().lower()] = \23 codons_list[count].rstrip().lower()(e) 2425 in_file.close()(f) 2627 # Translate triplets of the sequence into codons, via the codon dictionary28 for count in range(0, len(seq), 3):29 codon = seq[count:count+3]30 if codons.has_key(codon):31 aa = codons[codon]32 else:33 aa = '-'34 print aa,Script output- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -- - - - - - - - - - - - - - - - - - - -Step 5 - readable codeUsing understandable variable names and formatting them consistently is very important. It means otherpeople can understand your code more easily, and you'll be able to go back to old code and understandit. The broad aim is to make your code read like sentences - almost to the extent that comments areunnecessary. But don't go crazy - very long variable names are tiresome to type in! Often you'll wantto have more than one word in a name, but names cannot have spaces, so an alternative has to beused. Derek uses underscores (_) instead of spaces, and keeps all his variable names lower-case. Acommonly used alternative is to capitalise the start of all words, except the first word. Sosequence_name becomes sequenceName. This is the standard in Java and is becoming the standard inPython. I prefer this. So I've expanded the variable names to be more descriptive and use my preferredstyle.This might seem terribly picky, but it is vital to follow a consistent style if you're going to share codewith others. In industry, you will have to get used to following a "house style", and there will probablybe a long document describing it! So, get used to sticking to a standard.Check the output is still the same! We've just tidied stuff up so nothing should change01 # Nice names02


Python coding notes benjamin.jefferys03@imperial.ac.uk 5 of 1603 # Read in fasta file to name string and list of lower-case sequence strings(a) 04 fastaFile = open('fasta_file.txt')05 sequenceFileLines = fastaFile.readlines()06 sequenceName = sequenceFileLines.pop(0).strip()07 nucleotideSequence = sequenceFileLines.pop(0).rstrip().lower()0809 # Join up the sequence strings(b) 10 for sequenceFragment in sequenceFileLines:11 nucleotideSequence += sequenceFragment.rstrip().lower()12 fastaFile.close()1314 # Read in the codon -> amino acid translation file(c) 15 codonFile = open('codons.txt')16 codonFileLines = codonFile.readlines()1718 # Read pairs of lines, first in pair is codon, second in pair is amino acid19 # Put it into a codon -> amino acid dictionary, all lower case(d) 20 codonToAminoAcid = {} # Initialise the dictionary21 for i in range(0, len(codonFileLines), 2):22 codonToAminoAcid[codonFileLines[i+1].rstrip().lower()] = \23 codonFileLines[i].rstrip().lower()24(e) 25 codonFile.close()2627 # Translate triplets of the sequence into codons, via the codon dictionary(f) 28 for i in range(0, len(nucleotideSequence), 3):29 codon = nucleotideSequence[i:i+3]30 if codonToAminoAcid.has_key(codon):31 aminoAcid = codonToAminoAcid[codon]32 else:(g) 33aminoAcid = '-'34 print aminoAcid,Script output- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -- - - - - - - - - - - - - - - - - - - -Step 6 - instrumentationFinally we can start understanding the problem. We can instrument the code - print out status andinformation - so we can see what is going on as it is running, and perhaps indentify the problem. HereI've just added print statements between each step of the program to give a progress report and give asummary of what has happened.Straight away we can see a problem: the codon table has only 21 entries, but there are 64 codontriplets. Biochemists could probably have a good guess at what has happened.01 # Instrumentation - notice anomalous count in codon table0203 # Read in fasta file to name string and list of lower-case sequence strings04 fastaFile = open('fasta_file.txt')05 sequenceFileLines = fastaFile.readlines()06 sequenceName = sequenceFileLines.pop(0).strip()(a) 0708 print "Read in %i lines of sequence %s from fasta_file.txt" % \


ython coding notes benjamin.jefferys03@imperial.ac.uk 6 of 1609 (len(sequenceFileLines), sequenceName)1011 nucleotideSequence = sequenceFileLines.pop(0).rstrip().lower()1213 # Join up the sequence strings14 for sequenceFragment in sequenceFileLines:15 nucleotideSequence += sequenceFragment.rstrip().lower()16 fastaFile.close()17(b) 18 print "Sequence length: %i" % len(nucleotideSequence)1920 # Read in the codon -> amino acid translation file21 codonFile = open('codons.txt')22 codonFileLines = codonFile.readlines()(c) 2324 print "Read in %i lines from codons.txt" % len(codonFileLines)2526 # Read pairs of lines, first in pair is codon, second in pair is amino acid27 # Put it into a codon -> amino acid dictionary, all lower case28 codonToAminoAcid = {} # Initialise the dictionary29 for i in range(0, len(codonFileLines), 2):30 codonToAminoAcid[codonFileLines[i+1].rstrip().lower()] = \31 codonFileLines[i].rstrip().lower()3233 codonFile.close()34(d) 35 print "Filled codon translation table, %i entries" % len(codonToAminoAcid)3637 # Translate triplets of the sequence into codons, via the codon dictionary38 for i in range(0, len(nucleotideSequence), 3):39 codon = nucleotideSequence[i:i+3]40 if codonToAminoAcid.has_key(codon):41 aminoAcid = codonToAminoAcid[codon]42 else:43 aminoAcid = '-'44 print aminoAcid,(e) 4546 print47 print "Finished"Script outputRead in 7 lines of sequence >bleh from fasta_file.txtSequence length: 420Read in 128 lines from codons.txtFilled codon translation table, 21 entries- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -- - - - - - - - - - - - - - - - - - - -FinishedStep 7 - more instrumentationNow we add more instrumentation to see what's wrong with the codon table, by printing out the keysfrom the table (a). The problem is obvious - the keys are actually amino acid names. We seem to haveconstructed this codon translation table wrongly.01 # Instrumentation - further instrumentation to show error02


ython coding notes benjamin.jefferys03@imperial.ac.uk 7 of 1603 # Read in fasta file to name string and list of lower-case sequence strings04 fastaFile = open('fasta_file.txt')05 sequenceFileLines = fastaFile.readlines()06 sequenceName = sequenceFileLines.pop(0).strip()0708 print "Read in %i lines of sequence %s from fasta_file.txt" % \09 (len(sequenceFileLines), sequenceName)1011 nucleotideSequence = sequenceFileLines.pop(0).rstrip().lower()1213 # Join up the sequence strings14 for sequenceFragment in sequenceFileLines:15 nucleotideSequence += sequenceFragment.rstrip().lower()16 fastaFile.close()1718 print "Sequence length: %i" % len(nucleotideSequence)1920 # Read in the codon -> amino acid translation file21 codonFile = open('codons.txt')22 codonFileLines = codonFile.readlines()2324 print "Read in %i lines from codons.txt" % len(codonFileLines)2526 # Read pairs of lines, first in pair is codon, second in pair is amino acid27 # Put it into a codon -> amino acid dictionary, all lower case28 codonToAminoAcid = {} # Initialise the dictionary29 for i in range(0, len(codonFileLines), 2):30 codonToAminoAcid[codonFileLines[i+1].rstrip().lower()] = \31 codonFileLines[i].rstrip().lower()3233 codonFile.close()3435 print "Filled codon translation table, %i entries" % len(codonToAminoAcid)(a) 36 print "Codons: %s" % codonToAminoAcid.keys()3738 # Translate triplets of the sequence into codons, via the codon dictionary39 for i in range(0, len(nucleotideSequence), 3):40 codon = nucleotideSequence[i:i+3]41 if codonToAminoAcid.has_key(codon):42 aminoAcid = codonToAminoAcid[codon]43 else:44 aminoAcid = '-'45 print aminoAcid,4647 print48 print "Finished"Script outputRead in 7 lines of sequence >bleh from fasta_file.txtSequence length: 420Read in 128 lines from codons.txtFilled codon translation table, 21 entriesCodons: ['a', 'c', 'e', 'd', 'g', 'f', 'i', 'h', 'k', 'm', 'l', 'n', 'q', 'p', 's', 'r', 't', 'w', 'v', 'y', '.']- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -- - - - - - - - - - - - - - - - - - - -FinishedStep 8 - logging rather than printing


Python coding notes benjamin.jefferys03@imperial.ac.uk 8 of 16A small aside... having your program print out lots of information is great for debugging, but once it isworking it might become annoying! Instead of using print, you might use the Python logging library. Thislets you print messages out with some "importance level", but only message above a given importancelevel are actually shown. Here all the messages are "debug level" (b-f) and so will only be shown if thelogging level is DEBUG or below (a). The output is a bit different too.(a) 01 # Instrumentation - using logging to turn it on and off0203 import logging0405 logging.basicConfig(level=logging.DEBUG)0607 # Read in fasta file to name string and list of lower-case sequence strings08 fastaFile = open('fasta_file.txt')09 sequenceFileLines = fastaFile.readlines()10 sequenceName = sequenceFileLines.pop(0).strip()11(b) 12 logging.debug("Read in %i lines of sequence %s from fasta_file.txt" % \13 (len(sequenceFileLines), sequenceName))1415 nucleotideSequence = sequenceFileLines.pop(0).rstrip().lower()1617 # Join up the sequence strings18 for sequenceFragment in sequenceFileLines:19 nucleotideSequence += sequenceFragment.rstrip().lower()20 fastaFile.close()21(c) 22 logging.debug("Sequence length: %i" % len(nucleotideSequence))2324 # Read in the codon -> amino acid translation file25 codonFile = open('codons.txt')26 codonFileLines = codonFile.readlines()27(d) 28 logging.debug("Read in %i lines from codons.txt" % len(codonFileLines))2930 # Read pairs of lines, first in pair is codon, second in pair is amino acid31 # Put it into a codon -> amino acid dictionary, all lower case32 codonToAminoAcid = {} # Initialise the dictionary33 for i in range(0, len(codonFileLines), 2):34 codonToAminoAcid[codonFileLines[i+1].rstrip().lower()] = \35 codonFileLines[i].rstrip().lower()3637 codonFile.close()38(e) 39 logging.debug("Filled codon translation table, %i entries" % len(codonToAminoAcid40 logging.debug("Codons: %s" % codonToAminoAcid.keys())4142 # Translate triplets of the sequence into codons, via the codon dictionary43 for i in range(0, len(nucleotideSequence), 3):44 codon = nucleotideSequence[i:i+3]45 if codonToAminoAcid.has_key(codon):46 aminoAcid = codonToAminoAcid[codon]47 else:48 aminoAcid = '-'49 print aminoAcid,50(f) 51 logging.debug("Finished")Script output- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -


Python coding notes benjamin.jefferys03@imperial.ac.uk 9 of 16- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -- - - - - - - - - - - - - - - - - - - -DEBUG:root:Read in 7 lines of sequence >bleh from fasta_file.txtDEBUG:root:Sequence length: 420DEBUG:root:Read in 128 lines from codons.txtDEBUG:root:Filled codon translation table, 21 entriesDEBUG:root:Codons: ['a', 'c', 'e', 'd', 'g', 'f', 'i', 'h', 'k', 'm', 'l', 'n','q', 'p', 's', 'r', 't', 'w', 'v', 'y', '.']DEBUG:root:FinishedStep 9 - logical checkpointsNow we've narrowed down the error, we can leave the logging messages off by not setting the logginglevel to DEBUG at the start. We can also try to avoid introducing more bugs in the future by checkingour assumptions at various points in the code. At (a), we are checking that the codon to amino acidtable has 64 entries. The assert function will do nothing if the value passed to it is True, but will halt theprogram and print an error if it is False. It means that the error is spotted early before it leads to amore confusing or damaging problem later on. This will not only highlight coding errors, but will alsohighlight to the user if the codon translation file is broken.The check at (a) is a postcondition: it checks that things are right just after some action has beenperformed. The check at (b) is a precondition. It checks that the DNA sequence has any nucleotides,and that there is a multiple of 3 nucleotides, and therefore it makes sense to try to translate it. This ischecking an assumption of the following code. In general, DNA sequences can be empty or have anynumber of nucleotides - only when it comes to translation do these things matter. So the logical checkbelongs here.When run, this script produces an error, as the assertion that there are 64 entries in the codontranslation table is false.01 # Logic checkpoints0203 import logging0405 # Read in fasta file to name string and list of lower-case sequence strings06 fastaFile = open('fasta_file.txt')07 sequenceFileLines = fastaFile.readlines()08 sequenceName = sequenceFileLines.pop(0).strip()0910 logging.debug("Read in %i lines of sequence %s from fasta_file.txt" % \11 (len(sequenceFileLines), sequenceName))1213 nucleotideSequence = sequenceFileLines.pop(0).rstrip().lower()1415 # Join up the sequence strings16 for sequenceFragment in sequenceFileLines:17 nucleotideSequence += sequenceFragment.rstrip().lower()18 fastaFile.close()1920 logging.debug("Sequence length: %i" % len(nucleotideSequence))2122 # Read in the codon -> amino acid translation file23 codonFile = open('codons.txt')24 codonFileLines = codonFile.readlines()2526 logging.debug("Read in %i lines from codons.txt" % len(codonFileLines))


Python coding notes benjamin.jefferys03@imperial.ac.uk 10 of 162728 # Read pairs of lines, first in pair is codon, second in pair is amino acid29 # Put it into a codon -> amino acid dictionary, all lower case30 codonToAminoAcid = {} # Initialise the dictionary31 for i in range(0, len(codonFileLines), 2):32 codonToAminoAcid[codonFileLines[i+1].rstrip().lower()] = \33 codonFileLines[i].rstrip().lower()3435 codonFile.close()36(a) 37 # Logic checkpoint - Postcondition38 assert(len(codonToAminoAcid) == 64)3940 logging.debug("Filled codon translation table, %i entries" % len(codonToAminoAcid41 logging.debug("Codons: %s" % codonToAminoAcid.keys())4243 # Translate triplets of the sequence into codons, via the codon dictionary(b) 4445 # Logic checkpoint - Precondition46 assert(len(nucleotideSequence) > 0 and len(nucleotideSequence) % 3 == 0)4748 for i in range(0, len(nucleotideSequence), 3):49 codon = nucleotideSequence[i:i+3]50 if codonToAminoAcid.has_key(codon):51 aminoAcid = codonToAminoAcid[codon]52 else:53 aminoAcid = '-'54 print aminoAcid,5556 logging.debug("Finished")Script outputScript produces an error:Traceback (most recent call last):File "09.py", line 38, inassert(len(codonToAminoAcid) == 64)AssertionErrorStep 10 - fix the bug!Some more staring or pair programming at the code which makes the translation table reveals theerror: some list indexing has gone wrong. We change it around (a), and we get a protein sequence!Brilliant!01 # Bug fixed0203 import logging0405 # Read in fasta file to name string and list of lower-case sequence strings06 fastaFile = open('fasta_file.txt')07 sequenceFileLines = fastaFile.readlines()08 sequenceName = sequenceFileLines.pop(0).strip()0910 logging.debug("Read in %i lines of sequence %s from fasta_file.txt" % \11 (len(sequenceFileLines), sequenceName))1213 nucleotideSequence = sequenceFileLines.pop(0).rstrip().lower()1415 # Join up the sequence strings


Python coding notes benjamin.jefferys03@imperial.ac.uk 11 of 1616 for sequenceFragment in sequenceFileLines:17 nucleotideSequence += sequenceFragment.rstrip().lower()18 fastaFile.close()1920 logging.debug("Sequence length: %i" % len(nucleotideSequence))2122 # Read in the codon -> amino acid translation file23 codonFile = open('codons.txt')24 codonFileLines = codonFile.readlines()2526 logging.debug("Read in %i lines from codons.txt" % len(codonFileLines))2728 # Read pairs of lines, first in pair is codon, second in pair is amino acid29 # Put it into a codon -> amino acid dictionary, all lower case30 codonToAminoAcid = {} # Initialise the dictionary31 for i in range(0, len(codonFileLines), 2):(a) 32 codonToAminoAcid[codonFileLines[i].rstrip().lower()] = \33 codonFileLines[i+1].rstrip().lower()3435 codonFile.close()3637 # Logic checkpoint - Postcondition38 assert(len(codonToAminoAcid) == 64)3940 logging.debug("Filled codon translation table, %i entries" % len(codonToAminoAcid41 logging.debug("Codons: %s" % codonToAminoAcid.keys())4243 # Translate triplets of the sequence into codons, via the codon dictionary4445 # Logic checkpoint - Precondition46 assert(len(nucleotideSequence) > 0 and len(nucleotideSequence) % 3 == 0)4748 for i in range(0, len(nucleotideSequence), 3):49 codon = nucleotideSequence[i:i+3]50 if codonToAminoAcid.has_key(codon):51 aminoAcid = codonToAminoAcid[codon]52 else:53 aminoAcid = '-'54 print aminoAcid,5556 logging.debug("Finished")Script outputk s t c a f i . k k n t i r y c p e m f w k e r s l e k s l v f i i i i i i i il i i l i i s l k w k q y c f s g f c c m k c k k r d g f q . h i . s l l t k im t i n y t a s c k i r l v l k q t t p v h p t p l p f p e p p k r s k k l l rc c h w e i f a e s n n k t e t c w r lStep 11 - functions to reduce repetitionThere are a few places in the code where the same thing is done several times, however it is just doneby repeating the same set of statements. What if there is an error in one of these? It means there'sprobably an error in all of them! If the same code is used five times, we have to fix the same bug fivetimes (assuming we even find all the repetitions of the same code!). Twice we open a file and read thelines from it, and three times we make a standard sequence from a string by striping off non-sequenceinformation and making it lower case. Let's turn these into functions (a) and call them where necessary(b-f) - and check again that the output is doesn't change! We don't want to add new bugs whilst


Python coding notes benjamin.jefferys03@imperial.ac.uk 12 of 16removing others.01 # Better organisation - reading files and standardising sequences0203 import logging04(a) 05 def readFileLines(filename):06 f = open(filename)07 lines = f.readlines()08 f.close()0910 return lines1112 def formatSequenceLine(sequence):13 return sequence.rstrip().lower()1415 # Read in fasta file to name string and list of lower-case sequence strings(b) 16 sequenceFileLines = readFileLines('fasta_file.txt')17 sequenceName = sequenceFileLines.pop(0).strip()1819 logging.debug("Read in %i lines of sequence %s from fasta_file.txt" % \20 (len(sequenceFileLines), sequenceName))21(c) 22 nucleotideSequence = formatSequenceLine(sequenceFileLines.pop(0))2324 # Join up the sequence strings25 for sequenceFragment in sequenceFileLines:(d) 26 nucleotideSequence += formatSequenceLine(sequenceFragment)2728 logging.debug("Sequence length: %i" % len(nucleotideSequence))2930 # Read in the codon -> amino acid translation file(e) 31 codonFileLines = readFileLines('codons.txt')3233 logging.debug("Read in %i lines from codons.txt" % len(codonFileLines))3435 # Read pairs of lines, first in pair is codon, second in pair is amino acid36 # Put it into a codon -> amino acid dictionary, all lower case37 codonToAminoAcid = {} # Initialise the dictionary38 for i in range(0, len(codonFileLines), 2):(f) 39 codonToAminoAcid[formatSequenceLine(codonFileLines[i])] = \40 formatSequenceLine(codonFileLines[i+1])4142 # Logic checkpoint - Postcondition43 assert(len(codonToAminoAcid) == 64)4445 logging.debug("Filled codon translation table, %i entries" % len(codonToAminoAcid46 logging.debug("Codons: %s" % codonToAminoAcid.keys())4748 # Translate triplets of the sequence into codons, via the codon dictionary4950 # Logic checkpoint - Precondition51 assert(len(nucleotideSequence) > 0 and len(nucleotideSequence) % 3 == 0)5253 for i in range(0, len(nucleotideSequence), 3):54 codon = nucleotideSequence[i:i+3]55 if codonToAminoAcid.has_key(codon):56 aminoAcid = codonToAminoAcid[codon]57 else:58 aminoAcid = '-'59 print aminoAcid,6061 logging.debug("Finished")Script output


Python coding notes benjamin.jefferys03@imperial.ac.uk 13 of 16k s t c a f i . k k n t i r y c p e m f w k e r s l e k s l v f i i i i i i i il i i l i i s l k w k q y c f s g f c c m k c k k r d g f q . h i . s l l t k im t i n y t a s c k i r l v l k q t t p v h p t p l p f p e p p k r s k k l l rc c h w e i f a e s n n k t e t c w r lStep 12 - functions to structure codeWe can take this a step further and actually put most of the code into functions. This helps to makecode self-contained with well-defined tasks for each function, isolates errors within particular functions,and makes the flow of logic in the code clearer. It is a bit like division of labour in factories.Now the start of the program is just function definitions (a-c). At the end, the code at the "top level" isreduced to just four lines! It is very clear what is going on from the function names, and it gives a niceoverview of the logical flow of the program. Good structure reduces the need for comments. Note theoutput is still the same!01 # Better organisation - stages of process into functions0203 import logging0405 def readFileLines(filename):06 f = open(filename)07 lines = f.readlines()08 f.close()0910 return lines1112 def formatSequenceLine(sequence):13 return sequence.rstrip().lower()14(a) 15 def readFastaFile(filename):16 # Read in fasta file to name string and list of lower-case sequence strings17 sequenceFileLines = readFileLines(filename)18 sequenceName = sequenceFileLines.pop(0).strip()1920 logging.debug("Read in %i lines of sequence %s from fasta_file.txt" % \21 (len(sequenceFileLines), sequenceName))2223 nucleotideSequence = formatSequenceLine(sequenceFileLines.pop(0))2425 # Join up the sequence strings26 for sequenceFragment in sequenceFileLines:27 nucleotideSequence += formatSequenceLine(sequenceFragment)2829 logging.debug("Sequence length: %i" % len(nucleotideSequence))3031 return (sequenceName, nucleotideSequence)32(b) 33 def readCodonTable(filename):34 # Read in the codon -> amino acid translation file35 codonFileLines = readFileLines('codons.txt')3637 logging.debug("Read in %i lines from codons.txt" % len(codonFileLines))3839 # Read pairs of lines, first in pair is codon, second in pair is amino acid40 # Put it into a codon -> amino acid dictionary, all lower case41 codonToAminoAcid = {} # Initialise the dictionary42 for i in range(0, len(codonFileLines), 2):43 codonToAminoAcid[formatSequenceLine(codonFileLines[i])] = \44 formatSequenceLine(codonFileLines[i+1])


ython coding notes benjamin.jefferys03@imperial.ac.uk 14 of 164546 # Logic checkpoint - Postcondition47 assert(len(codonToAminoAcid) == 64)4849 logging.debug("Filled codon translation table, %i entries" % len(codonToAminoAcid50 logging.debug("Codons: %s" % codonToAminoAcid.keys())5152 return codonToAminoAcid5354(c) 55 def translateNucleotideSequence(nucleotideSequence, codonToAminoAcid):56 # Translate triplets of the sequence into codons, via the codon dictionary5758 # Logic checkpoint - Precondition59 assert(len(nucleotideSequence) > 0 and len(nucleotideSequence) % 3 == 0)6061 proteinSequence = []6263 for i in range(0, len(nucleotideSequence), 3):64 codon = nucleotideSequence[i:i+3]65 if codonToAminoAcid.has_key(codon):66 aminoAcid = codonToAminoAcid[codon]67 else:68 aminoAcid = '-'69 proteinSequence.append(aminoAcid)7071 return " ".join(proteinSequence)72(d) 73 (sequenceName, nucleotideSequence) = readFastaFile('fasta_file.txt')74(e) 75 codonToAminoAcid = readCodonTable('codons.txt')76(f) 77 proteinSequence = translateNucleotideSequence(nucleotideSequence, codonToAminoAcid78(g) 79 print proteinSequence8081 logging.debug("Finished")Script outputk s t c a f i . k k n t i r y c p e m f w k e r s l e k s l v f i i i i i i i il i i l i i s l k w k q y c f s g f c c m k c k k r d g f q . h i . s l l t k im t i n y t a s c k i r l v l k q t t p v h p t p l p f p e p p k r s k k l l rc c h w e i f a e s n n k t e t c w r lStep 13 - gathering related functions into a moduleIt is annoying to see all those function definitions at the top of the file before you get to the "meat" ofthe program. So they can be put in a different file (called bioinformatics.py) and then "imported" for thisscript to use (a). We've hidden away the details of how the programs works, and can now focus onwhat it does - crucial when writing large programs. Now all the function names must be prefixed withbioinformatics. (b-d)01 # Better organisation - modules - centralise bugs + opportunities for improvement0203 import logging(a) 04 import bioinformatics05(b) 06 (sequenceName, nucleotideSequence) = \


Python coding notes benjamin.jefferys03@imperial.ac.uk 15 of 1607 bioinformatics.readFastaFile('fasta_file.txt')08(c) 09 codonToAminoAcid = bioinformatics.readCodonTable('codons.txt')10(d) 11 proteinSequence = bioinformatics.translateNucleotideSequence(12 nucleotideSequence, codonToAminoAcid)1314 print proteinSequence1516 logging.debug("Finished")Script outputk s t c a f i . k k n t i r y c p e m f w k e r s l e k s l v f i i i i i i i il i i l i i s l k w k q y c f s g f c c m k c k k r d g f q . h i . s l l t k im t i n y t a s c k i r l v l k q t t p v h p t p l p f p e p p k r s k k l l rc c h w e i f a e s n n k t e t c w r lStep 14 - unit testingWe might want to make changes to our bioinformatics module in the future - but how can we be surewe haven't introduced new bugs? One way is to write a unit test which checks each function on fixedinput which is checked against known output which has been produced using a reliable method. Here, wedefine input (lines 7-8) and write this to a test fasta file (line 12-18), and also give the expected outputfrom translation (line 9). Lines 21-23 check the reading in the fasta file works, using a couple ofassertions. Lines 28-29 check the translation works OK. Note this isn't a script for the end user - it isfor the programmer to check that code hasn't regressed to do less than it did before your latestchanges! Note it produces no output - indicating all the tests passed! It would generate an assertionerror if there was a problem.01 # Unit testing0203 import logging04 import bioinformatics05(a) 06 # input and expected output07 testNucleotideSequenceName = ">test fasta sequence"08 testNucleotideSequence = "ctatcggagccattc"09 testProteinSequence = "l s e p f"1011 # create input fasta file12 fastaFilename = "testFastaFile.txt"1314 fastaFile = file(fastaFilename, "w")15 print >>fastaFile, testNucleotideSequenceName16 print >>fastaFile, testNucleotideSequence[:8]17 print >>fastaFile, testNucleotideSequence[8:]18 fastaFile.close()1920 # Unit test readFastaFile21 (sequenceName, nucleotideSequence) = bioinformatics.readFastaFile(fastaFilename)22 assert(sequenceName == testNucleotideSequenceName)23 assert(nucleotideSequence == testNucleotideSequence)2425 codonToAminoAcid = bioinformatics.readCodonTable('codons.txt')26(b) 27 # Unit test translateNucleotideSequence


ython coding notes benjamin.jefferys03@imperial.ac.uk 16 of 1628 proteinSequence = bioinformatics.translateNucleotideSequence(29 nucleotideSequence, codonToAminoAcid)(c) 30 assert(proteinSequence == testProteinSequence)3132 logging.debug("Finished")Script outputStep 15 - using a third-party libraryFinally, possibly the most reliable way to produce bug-free code is to use code that has been writtenand tested by tens or thousands of other people. The BioPython library can load a fasta file andtranslate it in just two lines! Third-party libraries also include more functions which would betime-consuming to write. For example, our code assumed that the fasta file had just one sequences - infact it can have many, and BioPython takes care of that. BioPython also considers the "transcription"step to an RNA sequence, and can transcribe reverse complement - functionality our code did notinclude. Save time and trouble by using libraries!01 # Using a third party library02(a) 03 import Bio.Seq04 import Bio.SeqIO05 from Bio.Alphabet import IUPAC06(b) 07 nucleotideSequence = Bio.SeqIO.parse(file("fasta_file.txt", "r"), "fasta").next()08(c) 09 print Bio.Seq.translate(Bio.Seq.transcribe(nucleotideSequence.seq))Script outputKSTCAFI*KKNTIRYCPEMFWKERSLEKSLVFIIIIIIIILIILIISLKWKQYCFSGFCCMKCKKRDGFQ*HI*SLLTKIMTINYTASCKIRLVLKQTTPVHPTPLPFPEPPKRSKKLLRCCHWEIFAESNNKTETCWRL

More magazines by this user
Similar magazines