11.07.2015 Views

Computer Exercise 3: Introduction to Bioinformatics 1

Computer Exercise 3: Introduction to Bioinformatics 1

Computer Exercise 3: Introduction to Bioinformatics 1

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

C. Baer, PCB 4674 Fall 2010Now return <strong>to</strong> the notes you <strong>to</strong>ok on CDS or reference the page you bookmarked for the humansequence file.Notice that the coordinates for the human CDS start at position 85 of the sequence. Edit thehuman sequence so that you remove all nucleotides upstream of the start codon (atg) atposition 85. To do so, click on “Props” > “Allow seq. Editing”, put the cursor on the 85 thnucleotide and use the backspace key on your keyboard <strong>to</strong> delete the beginning of thesequence. Now translate the edited human adh sequence in all three forward reading framesand see what changes.Q4. WHICH WAS THE APPROPRIATE CODING FRAME OF THE THREE YOU TRIED?EXPLAIN.3. Remember that a chromosome has two strands. By convention, cDNA is reported as the"sense" strand so the step of going from DNA->RNA->Polypeptide results in the actualpolypeptide the gene encodes. You can also use SeaView <strong>to</strong> get the reverse complementand/or translate the reverse complement of a sequence.III. Sequence Alignment Using Muscle1. To import your remaining three sequences in<strong>to</strong> a SeaView alignment, click on “File” > “Open”and choose the file you want <strong>to</strong> import. The sequence will open in a new window. In this windowselect the sequence by clicking on its name (the background becomes black) and then click on“Edit” > “Copy selected sequence”, go back <strong>to</strong> the window with the human sequence and clickon “Edit” > “Paste alignment data”. Repeat these steps until all your sequences are in the samewindow. Refer <strong>to</strong> your notes on the CDS and delete all sequence before the start position of theCDS. The start codon should be ATG (e.g., position 85 in the human cDNA; however, thegorilla will begin with GTC…-don’t mind this). Note that you can change the names of yoursequences <strong>to</strong> include the species name (select a sequence, and click on Edit > RenameSequence)2. Select all four sequences by clicking on “Edit” > “Select all”. Click on “Align” > “Align all”.The alignment program used in called “Muscle”. It is a fast and fairly accurate method <strong>to</strong> alignnucleotide and amino-acid sequences. Look at the aligned sequences in both nucleic acid andamino acid form (include gaps at the beginning <strong>to</strong> obtain the correct reading frame if necessary).Note that the aligned sequence is where both sequences occur, and does not include theregions of dashes.Q5. WHICH SEQUENCE IS MOST DIFFERENT FROM THE OTHERS (over the alignedlength)?Questions 6-7 refer <strong>to</strong> only the aligned human and mouse sequences.Q6. IN THE AMINO ACID SEQUENCE, FIND THE (FIRST) STOP CODON AND RECORDITS POSITION. POSITIONS BEFORE (“UPSTREAM OF”) THIS STOP CODON REFER TOTHE CDS. COUNT THE AMINO ACID SUBSTITUTIONS UPSTREAM OF THIS STOPCODON, AND THEN COUNT THE SUBSTITUTION DOWNSTREAM OF THIS STOP CODON.To make it easier <strong>to</strong> find each substitution, select the human sequence by clicking on its nameand click on “Props” > “by reference”3


C. Baer, PCB 4674 Fall 2010Q7. WHAT IS THE TOTAL FRACTION OF AMINO ACID SUBSTITUTIONS (in percent)BETWEEN HUMAN AND MOUSE? IS THE AA SUBSTITUTION RATE HIGHER UPSTREAMOR DOWNSTREAM OF THE STOP CODON?Questions 8-9 refer <strong>to</strong> the aligned human and chimp sequences.Q8. IN THE AMINO ACID SEQUENCE, FIND THE (FIRST) STOP CODON AND RECORDITS POSITION. COUNT THE AMINO ACID SUBSTITUTIONS UPSTREAM OF THIS STOPCODON, AND THEN COUNT THE SUBS DOWNSTREAM OF THIS STOP CODON.Q9. WHAT IS THE TOTAL FRACTION OF AMINO ACID SUBSTITUTIONS (in percent)BETWEEN HUMAN AND CHIMP? IS THE AA SUBSTITUTION RATE HIGHER UPSTREAMOR DOWNSTREAM OF THE STOP CODON?Q10. REFER TO YOUR RESPONSES FOR QUESTIONS 7 AND 9. ARE YOU SURPRISEDBY YOUR RESULTS? WHY?IV. Finding Genomic Sequence by BLASTIn this section you will find genomic DNA sequence that contains the ADH-IB gene, introns andall.1. Go <strong>to</strong> the NCBI web page (http://www.ncbi.nlm.nih.gov/). It would be a good idea <strong>to</strong> quicklyread through the material on the "Getting Started" link under the “Help” menu. "BLAST" standsfor Basic Local Alignment Search Tool and combines an alignment algorithm with databasesearch <strong>to</strong>ols. It is the standard method for mining genomic data, and if you continue inbiological science you will almost certainly use it at some point in your career. To begin, look forthe “BLAST” under Resources > DNA & RNA. Then select “nucleotide blast” under “BasicBLAST”.2. Enter the mouse accession number given <strong>to</strong> you on the first page of this lab. In the "choosesearch set" window, use the “database” drop bar <strong>to</strong> select "whole-genome shotgun reads(wgs)"; this is the method by which the mouse genome was sequenced. Just below, limit theorganism <strong>to</strong> “mouse.” Go <strong>to</strong> the bot<strong>to</strong>m of the page, click on the “BLAST” but<strong>to</strong>n and wait untilthe results appear. This might take a few minutes.3. When the results appear, scroll down <strong>to</strong> the frame with the colored features and the list ofsequences producing alignments. The "query" sequence is the sequence you entered in<strong>to</strong> theBLAST search; the colored lines are links <strong>to</strong> the results producing alignments. Of the alignedsequences, the smaller the "E value," the better the match ("E" stands for "expected bychance"). Click on the "max score" of the sequence with the highest score (if two or morealignments have the same score, just choose the <strong>to</strong>p one); it will take you down <strong>to</strong> the sequencealignment. In addition <strong>to</strong> the score and the E value, the % identity, the number of gaps, and thestrand info are presented (plus and minus are arbitrary references). The alignments report theposition of the query sequence and of the "subject" (target) sequence. See what the alignmentslook like. Now use the "back" feature <strong>to</strong> return <strong>to</strong> the list of sequences. Record the accessionnumber and the max score of the sequence with the highest score (or the one on <strong>to</strong>p of two ormore with the same max score.) Then click on the accession number <strong>to</strong> find the gi number andthe sequence length from the NCBI page.4


C. Baer, PCB 4674 Fall 2010Q11. WHAT IS THE gi NUMBER, THE ACCESSION NUMBER, AND THE MAX SCORE OFTHE SEQUENCE WITH THE HIGHEST SCORE?Q12. HOW LONG IS THE SEQUENCE?Repeat this exercise for the human cDNA and find the human genomic sequence that providesthe best match <strong>to</strong> the cDNA. You can BLAST the human cDNA accession number given <strong>to</strong> youon page 1 of this lab, but remember <strong>to</strong> limit the search <strong>to</strong> human. Also, if two or morealignments have the same max score, just choose the <strong>to</strong>p one.Q13-14. REPEAT QUESTIONS 11-12 FOR THE HUMAN GENOMIC SEQUENCE.V. Finding Introns with BLASTNext we will use BLAST <strong>to</strong> help find the introns in the gene. Recall that the cDNA sequencewas obtained by reverse transcription of a mRNA. Thus, the sequence had the introns splicedout.1. Return <strong>to</strong> the BLAST home page and look for the "Specialized BLAST" heading. Click on"align two sequences using BLAST". The default "Program" will be "blastn," which alignsnucleotide sequences. Enter the accession number for the mouse cDNA given <strong>to</strong> you on page1 of this lab in<strong>to</strong> the “Sequence 1” window. Enter the accession number or gi number of themouse wgs genomic sequence you just found in question 11 in<strong>to</strong> the "Sequence 2" window andclick the "align" but<strong>to</strong>n. The reason you are using BLAST <strong>to</strong> align these sequences rather thanClustalW is that ClustalW is not good at handling long regions of unmatched sequence, asoccur when you attempt <strong>to</strong> align cDNA sequence with genomic sequence.2. When the results appear, scroll down through the output. Note that the query sequencealigns <strong>to</strong> the target sequence in pieces; the pieces of the query sequence correspond <strong>to</strong> theexons of the gene; the missing pieces of the target sequence are the introns. Note that thelongest matching sequence is listed first, so you will have <strong>to</strong> piece <strong>to</strong>gether the sequence of thegene. Also note that there may be small amounts of overlapping sequence, and that there are afew small pieces of the query sequence that appear <strong>to</strong> match other pieces of the mousegenomic DNA contig ("contig" stands for "contiguous sequence" and is the process by whichshort sequence reads (~1 kb) are pieced <strong>to</strong>gether in<strong>to</strong> whole chromosome sequence.Q15. HOW MANY EXONS ARE IN THE MOUSE ADH-1 GENE?Q16. WHAT ARE THE LENGTHS OF THE EXONS? (e.g., "Exon 1=x1, Exon 2 = x2, etc.)Q17. WHAT ARE THE LENGTHS OF THE INTRONS (ROUGHLY, WITHIN A FEW BP)?VI. Comparative GenomicsThe last exercise is <strong>to</strong> compare the human ADH-IB gene with the mouse ADH1 gene. Return <strong>to</strong>the "BLAST 2 sequences” window. In the first window (sequence 1), enter the sequenceaccession number for the human cDNA given <strong>to</strong> you on page 1 of this lab. In the secondwindow enter the accession number or gi number from the mouse genomic DNA sequence youdetermined in question 11. Click "Align".Q18. HOW MANY REGIONS OF ALIGNMENT ARE THERE?5


C. Baer, PCB 4674 Fall 2010Q19. USING THE HUMAN GENE AS A FRAME OF REFERENCE, FIND THE REGIONS INTHE MOUSE GENOMIC DNA THAT ARE HOMOLOGOUS (I.E., REGION OF SEQUENCESIMILARITY) TO EACH EXON IN THE HUMAN GENE. GIVE A LIST OF MATCHINGHUMAN-MOUSE SEQUENCE, IDENTIFIED BY THE POSITIONS IN THE RESPECTIVEGENBANK SEQUENCE FILES.6

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!