Midterm Exam

Please use only the spaces provided !! 

Write clearly and legibly !! 

Spelling and grammar are important !! 

GENOMICS 5301 

MID-TERM EXAM — Answer Key 

OCTOBER 16, 2003 

True or False / Multiple Choice (2 Points each) 

1. Which of the following factors does not affect the E-value? 

a. Size of the target database 

b. Length of the alignment discovered 

c. Organism where the query sequence originated 

d. Blast score 

2. T/F _F_ The BLAST algorithm is a heuristic algorithm. This means that it is 

guaranteed to find the optimum alignment, and to find it the quickest. 

3. T/F _T_ It would make no sense to say that two sequences are 75% homologous. 

4. T/F _F_ Computer programs today for “finding” genes in unannotated genome 

sequence are extremely accurate, rarely missing real genes or declaring genes that 

don’t, in fact, exist. 

5. T/F _F_ Genes are approximately evenly distributed along the entire length of 

chromosomes in Arabidopsis and other plants that have been examined so far. 

1

Find the BEST Match for Each Genomic Term (2 Points Each) 

1. _f_ Homologue 

2. _c_ Minimum Tiling Path 

3. _i_ Chromosome Walking 

4. _h_ E-value 

5. _l_ Genbank 

6. _d_ Segmental Duplication 

7. _n_ Tandem Duplication 

8. _a_ Phrap 

9. _g_ Syntenic 

10. _e_ Colinear 

a. A program that finds overlapping 

regions in sequencing data 

b. A program that calculates the 

strength of evidence for each basecall 

as part of DNA sequencing 

c. Order of clones that most efficiently 

covers a genome region 

d. Multiple copies of a genome region 

derived from ancient genome 

rearrangement(s) 

e. Sequences in the same linear order in 

two or more taxa 

f. Sequences that share a common 

ancestor 

g. Sequences physically linked to one 

another in two or more taxa 

h. Probability of finding an alignment 

with a given Blast score 

i. Gene cloning based on a marker 

located close to a target, followed by 

searching for clones progressively 

closer 

j. Gene cloning based on isolating 

mRNA & cDNAs coming from a 

tissue or treatment of interest 

k. A database of conserved protein 

sequence motifs 

l. A database of all public nucleotide 

and protein sequences 

m. Related sequences separated from 

one another by duplication 

n. Multiple copies of a genome region 

located adjacent to one another 

2

Short Answer (8 points each) — Use Only the Space Provided! 

1. What is the difference between an unknown protein and a hypothetical protein? 

A “hypothetical” protein is a predicted entity never before seen in nature. A hypothetical 

protein is predicted directly from genome sequence by one of many computer algorithms 

that search for sequence regions that have the hallmarks of protein-coding genes. By 

contrast, an “unknown” protein is known to exist in nature because its corresponding 

mRNA has been observed at least once, either in a targeted cDNA study or as part of an 

EST project. 

2. Imagine you have been invited to a local high school to talk to students about genomics. 

The very first student asks “So, what is genomics anyway?” How do you answer? 

There is no one correct definition for genomics. A good answer would include the 

following features: A perspective on biology in which all of the genes or gene products 

are viewed together to reveal interesting or important properties. Genomics is based on 

massive amounts of DNA sequence data and requires the use of high powered 

computer tools. 

3. Assume that you’ve used the program FPC to construct the BAC contig map shown 

below. What additional steps would you use to validate the map – and do so in a highly 

efficient manner. (Use any of the genomic/molecular techniques we’ve talked about in 

class plus those you’ve learned about in other classes.) 

29F19 

84J03 

19B16 

65J15 

A BAC contig map, such as the one displayed here, would typically be constructed by 

DNA fingerprinting and the use of an algorithm such as FPC. This approach is usually 

effective at discovering BAC clones that overlap one another, but it frequently makes 

errors about the exact order or the degree of overlap. Before using such a physical map 

as a basis for generating a minumum tiling path (and subsequent DNA sequencing), 

scientists frequently sequence just the ends of BAC clones and use this information to 

generate probes/PCR primers from those BAC-ends. The probes and PCR primers can 

then be used to determine whether other BAC clones in the physical map really do have 

the predicted orientation through a combination of Southern blotting and PCR 

amplification. 

4. Briefly compare and contrast BAC-by-BAC versus Shotgun genome sequencing. Be sure 

to illustrate the advantages and disadvantages of both. (10 points) 

A BAC-by-BAC approach to genome sequencing begins with the construction of a 

detailed genetic mapping (segregation analysis) and physical mapping (fingerprint/FPC 

of BAC clones). These maps provide the basis for predicting a minimum tiling path of 

BAC clones, typically verified through BAC-end sequencing. The chosen BAC clones are 

3

subjected to finished sequencing (~8X redundant of a shotgun sub-library, coverage 

followed by closure) and the underlying genome sequence is reconstructed by analyzing 

overlaps between clones and predicting the underlying sequence for each chromosome 

arm. 

A shotgun sequencing approach begins with random (sheared) fragmenting of the 

genome to construct a very large shotgun library, which is size-selected to enrich for two 

size classes, typically 2 kbp and 10 kbp. These clones are all end-sequenced (taking 

care to keep track of which end sequences come from the same clone) and this process 

is carried out to >4X redundant coverage. In the end, millions of short reads are 

produced, which must be reassembled into a best guess of the genome sequence. 

Salient features of BAC-by-BAC sequencing: Very high quality genome sequence; 

Very well suited for comprehensive covereage; Well suited for studies of 

synteny/comparative genomics; Resolve cases of genome duplications; Significant upfront 

work & expense; Substantial project coordination; High levels of technical 

expertise; Very expensive on a per-base pair basis. 

Salient features of Shotgun sequencing: Comparatively low cost for >95% genome 

coverage; Can be carried out in short time-frame; Can be performed by a single (very 

well-equipped) laboratory; Relatively low levels of expertise require; Poor to modest 

quality sequence; Impossible to know what genome regions have been missed; Not wellsuited 

for synteny/comparative genomic analysis 

5. Fill in the following table: 

Ease of 

Development 

Ease of 

Analysis/Scoring 

Work Across 

Populations 

Work Across 

Species/Genus 

Suitable as Map 

Anchor 

Possible Answers Easy/Hard Easy/Hard Yes/No Yes/No Yes/No 

RFLP H H Y Y Y 

RAPD E E N N N 

SSR H (moderate) E Y N (sometimes) Y 

AFLP E H N N N 

4

Look at this BLAST report below and answer the following questions. (10 points) 

A B C 

1. Zea mays leucine-rich repeat transmembrane protein kinase 2 (ltk2)... 499 e-140 

2. Gossypium arboreum 7-10 dpa fiber library Gossypium arboreum cDNA... 384 e-105 

3. Gossypium arboreum 7-10 dpa fiber library Gossypium arboreum cDNA... 373 e-102 

4. Six-day Cotton fiber Gossypium hirsutum 5' similar to LRR kinase1... 319 1e-85 

5. Arabidopsis thaliana DNA chromosome 4, contig fragment No. 56... 284 1e-85 

6. Arabidopsis thaliana DNA chromosome 4, BAC clone F1N20... 284 1e-85 

7. Six-day Cotton fiber Gossypium hirsutum 5' similar to LRR kinase 1.. 312 1e-83 

8. Gm-c1036-1924 5' similar to LRR TRANSMEMBRANE KINASE I.. 307 3e-82 

9. Nodulated root Medicago truncatula cDNA clone NF030H07NR 305 1e-81 

10.Arabidopsis thaliana chromosome 1 YAC YUP8H12R sequence,... 141 1e-80 

11.tomato fruit mature green, cDNA clone cLEF46B14 5', mRNA sequence ... 284 4e-75 

1. What is the meaning of the numbers in the column C and how are these numbers calculated, at 

least in a conceptual sense? 

Column C shows the Expectation Value, which is the likelihood of observing a sequence 

alignment with the Blast score (shown in column B) by chance -- given the corresponding 

alignment length and target database size. 

2. Look at hits 10 and 11. How is it possible that #10 (Arabidopsis…) has a lower value in column 

B, but a more significant value in column C -- compared with #11 (tomato fruit…), which has a 

higher value in column B but a less significant value in column C? 

The most likely reason that the two blast hits have similar expectation values (column 

C), but very different Blast scores (column B) is that the length of the underlying 

alignments are different. The alignment in line 10 is probably much short than the 

alignment in line 11. 

3. The query used in this search came from flax and the BLAST results seem to indicate it is some 

sort of leucine-rich repeat transmembrane kinase. How confident would you be of this 

functional assignment and why? What is one additional type of information that you could look 

for “informatically” that would increase your confidence? (There is no one right answer to the 

second part of this question). 

The prediction that the flax query sequence is probably a member of the leucine-rich 

repeat, transmembrane kinase because it has several Blast hit matches annotated as 

such. Only one hit, however, is an original annotation of a LRR-TM-kinase (top hit from 

maize, ltk1) – all of the others are derived annotations (similar to…). Therefore, the best 

evidence, and really the only one shown, is the top hit to ltk1, plus the fact that the 

expectation score for this hit is so very negative. The other Blast hits do help to reinforce 

the conclusion that the query belongs to a large, well-defined protein family, and the top 

hit indicates it is probably related to LRR-TM-kinases. 

5

Your advisor asks you to consider a project to follow up on the research of a previous 

graduate student. That student mapped the location of a major quantitative trait locus (QTL) 

controlling aluminum tolerance in maize. You are being asked to clone the Al-tolerance locus by 

a combination of positional cloning and comparative genomics. (20 points) 

DO NOT write any part of your answer outside the lines – and be sure to write neatly. You 

should organize your thoughts (maybe even prepare an outline ahead of time) so that your 

answer fits in the space below. 

1. Briefly describe the key steps that the previous student probably used to “map” the Altolerance 

QTL. 

2. If you wanted to clone the maize gene based solely on map position, potentially including 

information gathered from the rice genome sequencing project, what strategy would you use? 

(Be sure to mention the key steps in your strategy.) 

3. Of course, maize and rice both have gene/genome duplications. What will you do to address 

this added complexity? 

To map the QTL underlying aluminum tolerance in maize, the previous graduate 

student probably screened several maize genotypes for this trait, and looked for two 

lines with distinctly contrasting responses to aluminum. The student would have made a 

cross and generated a segregating population (F2, RIL, etc). The indivdiuals in this 

population would be screened for aluminium response and also for ~100’s of DNA 

markers distributed throughout the genome. QTL(s) for this trait would be tagged 

(located) by those markers showing the highest statistical correlation with aluminum 

tolerance. 

To clone the aluminum tolernace locus by map position, I would choose several of 

the markers nearest the gene (by segregation analysis) and use these markers to 

screen a maize BAC library. Ideally, some or all of the clones identified by these clones 

would overlap one another based on fingerprint analysis. More likely, a small number of 

BAC contigs would be uncovered and one or more “chromosome walking” steps would 

be required, in which previously discovered BAC clones are used to identify new, 

overlapping clones. Once this process identifies clones that must, based on genetic and 

physical mapping, span the region containing the aluminum tolerance locus, I would be 

in a position to look for candidate sequences. If, in the course of chromsome walking, I 

run into problems in maize, I could potentially refer to syntenic regions in rice, with its 

nearly sequenced genome, as a new source of probes to seek BAC clones at or near 

the aluminum tolerance locus. 

Maize is know to be an ancient polyploid and most genome regions are duplicated. 

This would be a special problem during chromsome walking, because it might happen 

that I “walk” right off of the target genome region (where the aluminum locus is found) 

and onto a duplicated segment. The rice syntenic region might help (probably not), so for 

every chromosome walking step, I will need to verify direction (PCR, sequencing, 

fingerprinting, segregation analysis) that my step is keeping me in the target region. 

6

Midterm Exam

Create successful ePaper yourself

Delete template?

Save as template?