02.05.2015 Views

Midterm Exam

Midterm Exam

Midterm Exam

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

Please use only the spaces provided !!<br />

Write clearly and legibly !!<br />

Spelling and grammar are important !!<br />

GENOMICS 5301<br />

MID-TERM EXAM — Answer Key<br />

OCTOBER 16, 2003<br />

True or False / Multiple Choice (2 Points each)<br />

1. Which of the following factors does not affect the E-value?<br />

a. Size of the target database<br />

b. Length of the alignment discovered<br />

c. Organism where the query sequence originated<br />

d. Blast score<br />

2. T/F _F_ The BLAST algorithm is a heuristic algorithm. This means that it is<br />

guaranteed to find the optimum alignment, and to find it the quickest.<br />

3. T/F _T_ It would make no sense to say that two sequences are 75% homologous.<br />

4. T/F _F_ Computer programs today for “finding” genes in unannotated genome<br />

sequence are extremely accurate, rarely missing real genes or declaring genes that<br />

don’t, in fact, exist.<br />

5. T/F _F_ Genes are approximately evenly distributed along the entire length of<br />

chromosomes in Arabidopsis and other plants that have been examined so far.<br />

1


Find the BEST Match for Each Genomic Term (2 Points Each)<br />

1. _f_ Homologue<br />

2. _c_ Minimum Tiling Path<br />

3. _i_ Chromosome Walking<br />

4. _h_ E-value<br />

5. _l_ Genbank<br />

6. _d_ Segmental Duplication<br />

7. _n_ Tandem Duplication<br />

8. _a_ Phrap<br />

9. _g_ Syntenic<br />

10. _e_ Colinear<br />

a. A program that finds overlapping<br />

regions in sequencing data<br />

b. A program that calculates the<br />

strength of evidence for each basecall<br />

as part of DNA sequencing<br />

c. Order of clones that most efficiently<br />

covers a genome region<br />

d. Multiple copies of a genome region<br />

derived from ancient genome<br />

rearrangement(s)<br />

e. Sequences in the same linear order in<br />

two or more taxa<br />

f. Sequences that share a common<br />

ancestor<br />

g. Sequences physically linked to one<br />

another in two or more taxa<br />

h. Probability of finding an alignment<br />

with a given Blast score<br />

i. Gene cloning based on a marker<br />

located close to a target, followed by<br />

searching for clones progressively<br />

closer<br />

j. Gene cloning based on isolating<br />

mRNA & cDNAs coming from a<br />

tissue or treatment of interest<br />

k. A database of conserved protein<br />

sequence motifs<br />

l. A database of all public nucleotide<br />

and protein sequences<br />

m. Related sequences separated from<br />

one another by duplication<br />

n. Multiple copies of a genome region<br />

located adjacent to one another<br />

2


Short Answer (8 points each) — Use Only the Space Provided!<br />

1. What is the difference between an unknown protein and a hypothetical protein?<br />

A “hypothetical” protein is a predicted entity never before seen in nature. A hypothetical<br />

protein is predicted directly from genome sequence by one of many computer algorithms<br />

that search for sequence regions that have the hallmarks of protein-coding genes. By<br />

contrast, an “unknown” protein is known to exist in nature because its corresponding<br />

mRNA has been observed at least once, either in a targeted cDNA study or as part of an<br />

EST project.<br />

2. Imagine you have been invited to a local high school to talk to students about genomics.<br />

The very first student asks “So, what is genomics anyway?” How do you answer?<br />

There is no one correct definition for genomics. A good answer would include the<br />

following features: A perspective on biology in which all of the genes or gene products<br />

are viewed together to reveal interesting or important properties. Genomics is based on<br />

massive amounts of DNA sequence data and requires the use of high powered<br />

computer tools.<br />

3. Assume that you’ve used the program FPC to construct the BAC contig map shown<br />

below. What additional steps would you use to validate the map – and do so in a highly<br />

efficient manner. (Use any of the genomic/molecular techniques we’ve talked about in<br />

class plus those you’ve learned about in other classes.)<br />

29F19<br />

84J03<br />

19B16<br />

65J15<br />

A BAC contig map, such as the one displayed here, would typically be constructed by<br />

DNA fingerprinting and the use of an algorithm such as FPC. This approach is usually<br />

effective at discovering BAC clones that overlap one another, but it frequently makes<br />

errors about the exact order or the degree of overlap. Before using such a physical map<br />

as a basis for generating a minumum tiling path (and subsequent DNA sequencing),<br />

scientists frequently sequence just the ends of BAC clones and use this information to<br />

generate probes/PCR primers from those BAC-ends. The probes and PCR primers can<br />

then be used to determine whether other BAC clones in the physical map really do have<br />

the predicted orientation through a combination of Southern blotting and PCR<br />

amplification.<br />

4. Briefly compare and contrast BAC-by-BAC versus Shotgun genome sequencing. Be sure<br />

to illustrate the advantages and disadvantages of both. (10 points)<br />

A BAC-by-BAC approach to genome sequencing begins with the construction of a<br />

detailed genetic mapping (segregation analysis) and physical mapping (fingerprint/FPC<br />

of BAC clones). These maps provide the basis for predicting a minimum tiling path of<br />

BAC clones, typically verified through BAC-end sequencing. The chosen BAC clones are<br />

3


subjected to finished sequencing (~8X redundant of a shotgun sub-library, coverage<br />

followed by closure) and the underlying genome sequence is reconstructed by analyzing<br />

overlaps between clones and predicting the underlying sequence for each chromosome<br />

arm.<br />

A shotgun sequencing approach begins with random (sheared) fragmenting of the<br />

genome to construct a very large shotgun library, which is size-selected to enrich for two<br />

size classes, typically 2 kbp and 10 kbp. These clones are all end-sequenced (taking<br />

care to keep track of which end sequences come from the same clone) and this process<br />

is carried out to >4X redundant coverage. In the end, millions of short reads are<br />

produced, which must be reassembled into a best guess of the genome sequence.<br />

Salient features of BAC-by-BAC sequencing: Very high quality genome sequence;<br />

Very well suited for comprehensive covereage; Well suited for studies of<br />

synteny/comparative genomics; Resolve cases of genome duplications; Significant upfront<br />

work & expense; Substantial project coordination; High levels of technical<br />

expertise; Very expensive on a per-base pair basis.<br />

Salient features of Shotgun sequencing: Comparatively low cost for >95% genome<br />

coverage; Can be carried out in short time-frame; Can be performed by a single (very<br />

well-equipped) laboratory; Relatively low levels of expertise require; Poor to modest<br />

quality sequence; Impossible to know what genome regions have been missed; Not wellsuited<br />

for synteny/comparative genomic analysis<br />

5. Fill in the following table:<br />

Ease of<br />

Development<br />

Ease of<br />

Analysis/Scoring<br />

Work Across<br />

Populations<br />

Work Across<br />

Species/Genus<br />

Suitable as Map<br />

Anchor<br />

Possible Answers Easy/Hard Easy/Hard Yes/No Yes/No Yes/No<br />

RFLP H H Y Y Y<br />

RAPD E E N N N<br />

SSR H (moderate) E Y N (sometimes) Y<br />

AFLP E H N N N<br />

4


Look at this BLAST report below and answer the following questions. (10 points)<br />

A B C<br />

1. Zea mays leucine-rich repeat transmembrane protein kinase 2 (ltk2)... 499 e-140<br />

2. Gossypium arboreum 7-10 dpa fiber library Gossypium arboreum cDNA... 384 e-105<br />

3. Gossypium arboreum 7-10 dpa fiber library Gossypium arboreum cDNA... 373 e-102<br />

4. Six-day Cotton fiber Gossypium hirsutum 5' similar to LRR kinase1... 319 1e-85<br />

5. Arabidopsis thaliana DNA chromosome 4, contig fragment No. 56... 284 1e-85<br />

6. Arabidopsis thaliana DNA chromosome 4, BAC clone F1N20... 284 1e-85<br />

7. Six-day Cotton fiber Gossypium hirsutum 5' similar to LRR kinase 1.. 312 1e-83<br />

8. Gm-c1036-1924 5' similar to LRR TRANSMEMBRANE KINASE I.. 307 3e-82<br />

9. Nodulated root Medicago truncatula cDNA clone NF030H07NR 305 1e-81<br />

10.Arabidopsis thaliana chromosome 1 YAC YUP8H12R sequence,... 141 1e-80<br />

11.tomato fruit mature green, cDNA clone cLEF46B14 5', mRNA sequence ... 284 4e-75<br />

1. What is the meaning of the numbers in the column C and how are these numbers calculated, at<br />

least in a conceptual sense?<br />

Column C shows the Expectation Value, which is the likelihood of observing a sequence<br />

alignment with the Blast score (shown in column B) by chance -- given the corresponding<br />

alignment length and target database size.<br />

2. Look at hits 10 and 11. How is it possible that #10 (Arabidopsis…) has a lower value in column<br />

B, but a more significant value in column C -- compared with #11 (tomato fruit…), which has a<br />

higher value in column B but a less significant value in column C?<br />

The most likely reason that the two blast hits have similar expectation values (column<br />

C), but very different Blast scores (column B) is that the length of the underlying<br />

alignments are different. The alignment in line 10 is probably much short than the<br />

alignment in line 11.<br />

3. The query used in this search came from flax and the BLAST results seem to indicate it is some<br />

sort of leucine-rich repeat transmembrane kinase. How confident would you be of this<br />

functional assignment and why? What is one additional type of information that you could look<br />

for “informatically” that would increase your confidence? (There is no one right answer to the<br />

second part of this question).<br />

The prediction that the flax query sequence is probably a member of the leucine-rich<br />

repeat, transmembrane kinase because it has several Blast hit matches annotated as<br />

such. Only one hit, however, is an original annotation of a LRR-TM-kinase (top hit from<br />

maize, ltk1) – all of the others are derived annotations (similar to…). Therefore, the best<br />

evidence, and really the only one shown, is the top hit to ltk1, plus the fact that the<br />

expectation score for this hit is so very negative. The other Blast hits do help to reinforce<br />

the conclusion that the query belongs to a large, well-defined protein family, and the top<br />

hit indicates it is probably related to LRR-TM-kinases.<br />

5


Your advisor asks you to consider a project to follow up on the research of a previous<br />

graduate student. That student mapped the location of a major quantitative trait locus (QTL)<br />

controlling aluminum tolerance in maize. You are being asked to clone the Al-tolerance locus by<br />

a combination of positional cloning and comparative genomics. (20 points)<br />

DO NOT write any part of your answer outside the lines – and be sure to write neatly. You<br />

should organize your thoughts (maybe even prepare an outline ahead of time) so that your<br />

answer fits in the space below.<br />

1. Briefly describe the key steps that the previous student probably used to “map” the Altolerance<br />

QTL.<br />

2. If you wanted to clone the maize gene based solely on map position, potentially including<br />

information gathered from the rice genome sequencing project, what strategy would you use?<br />

(Be sure to mention the key steps in your strategy.)<br />

3. Of course, maize and rice both have gene/genome duplications. What will you do to address<br />

this added complexity?<br />

To map the QTL underlying aluminum tolerance in maize, the previous graduate<br />

student probably screened several maize genotypes for this trait, and looked for two<br />

lines with distinctly contrasting responses to aluminum. The student would have made a<br />

cross and generated a segregating population (F2, RIL, etc). The indivdiuals in this<br />

population would be screened for aluminium response and also for ~100’s of DNA<br />

markers distributed throughout the genome. QTL(s) for this trait would be tagged<br />

(located) by those markers showing the highest statistical correlation with aluminum<br />

tolerance.<br />

To clone the aluminum tolernace locus by map position, I would choose several of<br />

the markers nearest the gene (by segregation analysis) and use these markers to<br />

screen a maize BAC library. Ideally, some or all of the clones identified by these clones<br />

would overlap one another based on fingerprint analysis. More likely, a small number of<br />

BAC contigs would be uncovered and one or more “chromosome walking” steps would<br />

be required, in which previously discovered BAC clones are used to identify new,<br />

overlapping clones. Once this process identifies clones that must, based on genetic and<br />

physical mapping, span the region containing the aluminum tolerance locus, I would be<br />

in a position to look for candidate sequences. If, in the course of chromsome walking, I<br />

run into problems in maize, I could potentially refer to syntenic regions in rice, with its<br />

nearly sequenced genome, as a new source of probes to seek BAC clones at or near<br />

the aluminum tolerance locus.<br />

Maize is know to be an ancient polyploid and most genome regions are duplicated.<br />

This would be a special problem during chromsome walking, because it might happen<br />

that I “walk” right off of the target genome region (where the aluminum locus is found)<br />

and onto a duplicated segment. The rice syntenic region might help (probably not), so for<br />

every chromosome walking step, I will need to verify direction (PCR, sequencing,<br />

fingerprinting, segregation analysis) that my step is keeping me in the target region.<br />

6

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!