Midterm Exam
Midterm Exam
Midterm Exam
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
Please use only the spaces provided !!<br />
Write clearly and legibly !!<br />
Spelling and grammar are important !!<br />
GENOMICS 5301<br />
MID-TERM EXAM — Answer Key<br />
OCTOBER 16, 2003<br />
True or False / Multiple Choice (2 Points each)<br />
1. Which of the following factors does not affect the E-value?<br />
a. Size of the target database<br />
b. Length of the alignment discovered<br />
c. Organism where the query sequence originated<br />
d. Blast score<br />
2. T/F _F_ The BLAST algorithm is a heuristic algorithm. This means that it is<br />
guaranteed to find the optimum alignment, and to find it the quickest.<br />
3. T/F _T_ It would make no sense to say that two sequences are 75% homologous.<br />
4. T/F _F_ Computer programs today for “finding” genes in unannotated genome<br />
sequence are extremely accurate, rarely missing real genes or declaring genes that<br />
don’t, in fact, exist.<br />
5. T/F _F_ Genes are approximately evenly distributed along the entire length of<br />
chromosomes in Arabidopsis and other plants that have been examined so far.<br />
1
Find the BEST Match for Each Genomic Term (2 Points Each)<br />
1. _f_ Homologue<br />
2. _c_ Minimum Tiling Path<br />
3. _i_ Chromosome Walking<br />
4. _h_ E-value<br />
5. _l_ Genbank<br />
6. _d_ Segmental Duplication<br />
7. _n_ Tandem Duplication<br />
8. _a_ Phrap<br />
9. _g_ Syntenic<br />
10. _e_ Colinear<br />
a. A program that finds overlapping<br />
regions in sequencing data<br />
b. A program that calculates the<br />
strength of evidence for each basecall<br />
as part of DNA sequencing<br />
c. Order of clones that most efficiently<br />
covers a genome region<br />
d. Multiple copies of a genome region<br />
derived from ancient genome<br />
rearrangement(s)<br />
e. Sequences in the same linear order in<br />
two or more taxa<br />
f. Sequences that share a common<br />
ancestor<br />
g. Sequences physically linked to one<br />
another in two or more taxa<br />
h. Probability of finding an alignment<br />
with a given Blast score<br />
i. Gene cloning based on a marker<br />
located close to a target, followed by<br />
searching for clones progressively<br />
closer<br />
j. Gene cloning based on isolating<br />
mRNA & cDNAs coming from a<br />
tissue or treatment of interest<br />
k. A database of conserved protein<br />
sequence motifs<br />
l. A database of all public nucleotide<br />
and protein sequences<br />
m. Related sequences separated from<br />
one another by duplication<br />
n. Multiple copies of a genome region<br />
located adjacent to one another<br />
2
Short Answer (8 points each) — Use Only the Space Provided!<br />
1. What is the difference between an unknown protein and a hypothetical protein?<br />
A “hypothetical” protein is a predicted entity never before seen in nature. A hypothetical<br />
protein is predicted directly from genome sequence by one of many computer algorithms<br />
that search for sequence regions that have the hallmarks of protein-coding genes. By<br />
contrast, an “unknown” protein is known to exist in nature because its corresponding<br />
mRNA has been observed at least once, either in a targeted cDNA study or as part of an<br />
EST project.<br />
2. Imagine you have been invited to a local high school to talk to students about genomics.<br />
The very first student asks “So, what is genomics anyway?” How do you answer?<br />
There is no one correct definition for genomics. A good answer would include the<br />
following features: A perspective on biology in which all of the genes or gene products<br />
are viewed together to reveal interesting or important properties. Genomics is based on<br />
massive amounts of DNA sequence data and requires the use of high powered<br />
computer tools.<br />
3. Assume that you’ve used the program FPC to construct the BAC contig map shown<br />
below. What additional steps would you use to validate the map – and do so in a highly<br />
efficient manner. (Use any of the genomic/molecular techniques we’ve talked about in<br />
class plus those you’ve learned about in other classes.)<br />
29F19<br />
84J03<br />
19B16<br />
65J15<br />
A BAC contig map, such as the one displayed here, would typically be constructed by<br />
DNA fingerprinting and the use of an algorithm such as FPC. This approach is usually<br />
effective at discovering BAC clones that overlap one another, but it frequently makes<br />
errors about the exact order or the degree of overlap. Before using such a physical map<br />
as a basis for generating a minumum tiling path (and subsequent DNA sequencing),<br />
scientists frequently sequence just the ends of BAC clones and use this information to<br />
generate probes/PCR primers from those BAC-ends. The probes and PCR primers can<br />
then be used to determine whether other BAC clones in the physical map really do have<br />
the predicted orientation through a combination of Southern blotting and PCR<br />
amplification.<br />
4. Briefly compare and contrast BAC-by-BAC versus Shotgun genome sequencing. Be sure<br />
to illustrate the advantages and disadvantages of both. (10 points)<br />
A BAC-by-BAC approach to genome sequencing begins with the construction of a<br />
detailed genetic mapping (segregation analysis) and physical mapping (fingerprint/FPC<br />
of BAC clones). These maps provide the basis for predicting a minimum tiling path of<br />
BAC clones, typically verified through BAC-end sequencing. The chosen BAC clones are<br />
3
subjected to finished sequencing (~8X redundant of a shotgun sub-library, coverage<br />
followed by closure) and the underlying genome sequence is reconstructed by analyzing<br />
overlaps between clones and predicting the underlying sequence for each chromosome<br />
arm.<br />
A shotgun sequencing approach begins with random (sheared) fragmenting of the<br />
genome to construct a very large shotgun library, which is size-selected to enrich for two<br />
size classes, typically 2 kbp and 10 kbp. These clones are all end-sequenced (taking<br />
care to keep track of which end sequences come from the same clone) and this process<br />
is carried out to >4X redundant coverage. In the end, millions of short reads are<br />
produced, which must be reassembled into a best guess of the genome sequence.<br />
Salient features of BAC-by-BAC sequencing: Very high quality genome sequence;<br />
Very well suited for comprehensive covereage; Well suited for studies of<br />
synteny/comparative genomics; Resolve cases of genome duplications; Significant upfront<br />
work & expense; Substantial project coordination; High levels of technical<br />
expertise; Very expensive on a per-base pair basis.<br />
Salient features of Shotgun sequencing: Comparatively low cost for >95% genome<br />
coverage; Can be carried out in short time-frame; Can be performed by a single (very<br />
well-equipped) laboratory; Relatively low levels of expertise require; Poor to modest<br />
quality sequence; Impossible to know what genome regions have been missed; Not wellsuited<br />
for synteny/comparative genomic analysis<br />
5. Fill in the following table:<br />
Ease of<br />
Development<br />
Ease of<br />
Analysis/Scoring<br />
Work Across<br />
Populations<br />
Work Across<br />
Species/Genus<br />
Suitable as Map<br />
Anchor<br />
Possible Answers Easy/Hard Easy/Hard Yes/No Yes/No Yes/No<br />
RFLP H H Y Y Y<br />
RAPD E E N N N<br />
SSR H (moderate) E Y N (sometimes) Y<br />
AFLP E H N N N<br />
4
Look at this BLAST report below and answer the following questions. (10 points)<br />
A B C<br />
1. Zea mays leucine-rich repeat transmembrane protein kinase 2 (ltk2)... 499 e-140<br />
2. Gossypium arboreum 7-10 dpa fiber library Gossypium arboreum cDNA... 384 e-105<br />
3. Gossypium arboreum 7-10 dpa fiber library Gossypium arboreum cDNA... 373 e-102<br />
4. Six-day Cotton fiber Gossypium hirsutum 5' similar to LRR kinase1... 319 1e-85<br />
5. Arabidopsis thaliana DNA chromosome 4, contig fragment No. 56... 284 1e-85<br />
6. Arabidopsis thaliana DNA chromosome 4, BAC clone F1N20... 284 1e-85<br />
7. Six-day Cotton fiber Gossypium hirsutum 5' similar to LRR kinase 1.. 312 1e-83<br />
8. Gm-c1036-1924 5' similar to LRR TRANSMEMBRANE KINASE I.. 307 3e-82<br />
9. Nodulated root Medicago truncatula cDNA clone NF030H07NR 305 1e-81<br />
10.Arabidopsis thaliana chromosome 1 YAC YUP8H12R sequence,... 141 1e-80<br />
11.tomato fruit mature green, cDNA clone cLEF46B14 5', mRNA sequence ... 284 4e-75<br />
1. What is the meaning of the numbers in the column C and how are these numbers calculated, at<br />
least in a conceptual sense?<br />
Column C shows the Expectation Value, which is the likelihood of observing a sequence<br />
alignment with the Blast score (shown in column B) by chance -- given the corresponding<br />
alignment length and target database size.<br />
2. Look at hits 10 and 11. How is it possible that #10 (Arabidopsis…) has a lower value in column<br />
B, but a more significant value in column C -- compared with #11 (tomato fruit…), which has a<br />
higher value in column B but a less significant value in column C?<br />
The most likely reason that the two blast hits have similar expectation values (column<br />
C), but very different Blast scores (column B) is that the length of the underlying<br />
alignments are different. The alignment in line 10 is probably much short than the<br />
alignment in line 11.<br />
3. The query used in this search came from flax and the BLAST results seem to indicate it is some<br />
sort of leucine-rich repeat transmembrane kinase. How confident would you be of this<br />
functional assignment and why? What is one additional type of information that you could look<br />
for “informatically” that would increase your confidence? (There is no one right answer to the<br />
second part of this question).<br />
The prediction that the flax query sequence is probably a member of the leucine-rich<br />
repeat, transmembrane kinase because it has several Blast hit matches annotated as<br />
such. Only one hit, however, is an original annotation of a LRR-TM-kinase (top hit from<br />
maize, ltk1) – all of the others are derived annotations (similar to…). Therefore, the best<br />
evidence, and really the only one shown, is the top hit to ltk1, plus the fact that the<br />
expectation score for this hit is so very negative. The other Blast hits do help to reinforce<br />
the conclusion that the query belongs to a large, well-defined protein family, and the top<br />
hit indicates it is probably related to LRR-TM-kinases.<br />
5
Your advisor asks you to consider a project to follow up on the research of a previous<br />
graduate student. That student mapped the location of a major quantitative trait locus (QTL)<br />
controlling aluminum tolerance in maize. You are being asked to clone the Al-tolerance locus by<br />
a combination of positional cloning and comparative genomics. (20 points)<br />
DO NOT write any part of your answer outside the lines – and be sure to write neatly. You<br />
should organize your thoughts (maybe even prepare an outline ahead of time) so that your<br />
answer fits in the space below.<br />
1. Briefly describe the key steps that the previous student probably used to “map” the Altolerance<br />
QTL.<br />
2. If you wanted to clone the maize gene based solely on map position, potentially including<br />
information gathered from the rice genome sequencing project, what strategy would you use?<br />
(Be sure to mention the key steps in your strategy.)<br />
3. Of course, maize and rice both have gene/genome duplications. What will you do to address<br />
this added complexity?<br />
To map the QTL underlying aluminum tolerance in maize, the previous graduate<br />
student probably screened several maize genotypes for this trait, and looked for two<br />
lines with distinctly contrasting responses to aluminum. The student would have made a<br />
cross and generated a segregating population (F2, RIL, etc). The indivdiuals in this<br />
population would be screened for aluminium response and also for ~100’s of DNA<br />
markers distributed throughout the genome. QTL(s) for this trait would be tagged<br />
(located) by those markers showing the highest statistical correlation with aluminum<br />
tolerance.<br />
To clone the aluminum tolernace locus by map position, I would choose several of<br />
the markers nearest the gene (by segregation analysis) and use these markers to<br />
screen a maize BAC library. Ideally, some or all of the clones identified by these clones<br />
would overlap one another based on fingerprint analysis. More likely, a small number of<br />
BAC contigs would be uncovered and one or more “chromosome walking” steps would<br />
be required, in which previously discovered BAC clones are used to identify new,<br />
overlapping clones. Once this process identifies clones that must, based on genetic and<br />
physical mapping, span the region containing the aluminum tolerance locus, I would be<br />
in a position to look for candidate sequences. If, in the course of chromsome walking, I<br />
run into problems in maize, I could potentially refer to syntenic regions in rice, with its<br />
nearly sequenced genome, as a new source of probes to seek BAC clones at or near<br />
the aluminum tolerance locus.<br />
Maize is know to be an ancient polyploid and most genome regions are duplicated.<br />
This would be a special problem during chromsome walking, because it might happen<br />
that I “walk” right off of the target genome region (where the aluminum locus is found)<br />
and onto a duplicated segment. The rice syntenic region might help (probably not), so for<br />
every chromosome walking step, I will need to verify direction (PCR, sequencing,<br />
fingerprinting, segregation analysis) that my step is keeping me in the target region.<br />
6