INFO1105 Assignment 2 – Gene Collection - GetACoder
INFO1105 Assignment 2 – Gene Collection - GetACoder
INFO1105 Assignment 2 – Gene Collection - GetACoder
You also want an ePaper? Increase the reach of your titles
YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.
<strong>INFO1105</strong>/1905: Data Structures<br />
<strong>INFO1105</strong> <strong>Assignment</strong> 2 <strong>–</strong> <strong>Gene</strong> <strong>Collection</strong><br />
Due: 5:00 pm Wednesday (Week 11) 15/10/2008<br />
Purpose<br />
This assignment emphasizes the details of programming a data structure, with careful<br />
attention to making sure that each operation keeps the integrity of the structure. It should also<br />
provided practice in recursion. In addition, you will demonstrate your ability to analyze the<br />
run-time costs of your code. This is an individual assignment: each student must work<br />
independently, and any assistance must be acknowledged in the README file.<br />
Task<br />
Each student must write a collection class, which would be suitable to be kept in a class<br />
library. Note: your job is to write part of a library; in this assignment you should not use a<br />
collection class library (of course, you can still use the built-in arrays of Java, and you can use<br />
java.util.Scanner and the classes in java.lang and java.io, such as String).<br />
The class you write must be called RadixSearchTrie. It must have a constructor with no<br />
arguments, to produce an object representing an empty collection. It must be part of a<br />
package called <strong>Gene</strong><strong>Collection</strong>; that is, the RadixSearchTrie.java file must be in a<br />
directory called <strong>Gene</strong><strong>Collection</strong>.<br />
To clients, this class should appear as an implementation of the interface GCInterface. The<br />
interface is at Page 2 of this document.<br />
Internally, the collection class must be structured as a radix search trie. The detail explanation<br />
of the radix search trie is included in the Appendix Section at the end of this document.<br />
As well as the collection class and any associated classes for Nodes, you must produce a<br />
textual file called README. This must contain a statement about the authorship of the code<br />
you submit, including acknowledgements of all assistance you received (for example,<br />
conversations with friends, resources you found on the web, etc). The README must also<br />
contain a big-O running time for each method you implemented, and an answer to the<br />
following question.<br />
"Fred Foolish claims that the worst case scalability of the getDescription method in the radix<br />
trie is O(log n) where n is the total length of all the genes in the collection. Fred says this<br />
because the depth of a tree is approximately the logarithm of the number of nodes, and you<br />
have at most one node for each character in each gene. Show that Fred is wrong in general,<br />
and explain the errors in his argument."<br />
School of IT, The University of Sydney Page 1 of 6
<strong>INFO1105</strong>/1905: Data Structures <strong>Assignment</strong> 02<br />
package <strong>Gene</strong><strong>Collection</strong>;<br />
/*<br />
* This interface represents the API<br />
* suitable for an object which keeps a collection of genes, each with associated<br />
* information. A gene is given by a String, with the special property that every<br />
* character is A, C, G or T. In real uses, the associated information would<br />
* be voluminous, including discoverer, literature citation, species, chromosome<br />
* identifier, location within chromosome, purpose, date first mentioned, etc etc;<br />
* however for this simple example we will assume the associated information<br />
* is just a String description.<br />
*/<br />
public interface GCInterface {<br />
/*<br />
* Find the description associated with a given gene.<br />
* If the gene is not in the collection, return null.<br />
*/<br />
public String findDescription(String gene);<br />
/*<br />
* add a new gene to the collection, with an associated description.<br />
* Neither argument may be null. If the gene is already present in the collection,<br />
* the new description replaces the old; otherwise the new description and gene<br />
* are added.<br />
*/<br />
public void new<strong>Gene</strong>(String gene, String description);<br />
/*<br />
* return the number of genes in the collection.<br />
*/<br />
public int numberOf<strong>Gene</strong>s();<br />
/*<br />
* return the number of genes in the collection which have a given gene as prefix.<br />
*/<br />
public int numberOf<strong>Gene</strong>s(String prefix);<br />
/*<br />
* remove a gene and its associated description.<br />
* This must not be called with a null argument.<br />
*/<br />
public void remove(String gene);<br />
}<br />
School of IT, The University of Sydney Page 2 of 6
<strong>INFO1105</strong>/1905: Data Structures <strong>Assignment</strong> 02<br />
Assessment<br />
This assignment is worth 20% of the marks for the unit.<br />
There are 10 marks awarded for the functionality of the program. This is determined both by<br />
running tests on your code, and also by consideration of the submitted code.<br />
0 if the program does not run (for example, it doesn't compile or it doesn't implement<br />
GCInterface). This mark is also given if you are unable to explain the working of the code,<br />
when asked to do so, or if the code does not follow the radix search trie data structure as<br />
described (e.g. if you code a simpler structure such as a linked list, or if you internally use<br />
or inherit a library collection class).<br />
3 if the program runs correctly in several simple test cases (involving only the methods<br />
findDescriptiopn and new<strong>Gene</strong>).<br />
5 if the mehods findDescription and new<strong>Gene</strong> work correctly in a wide range of<br />
straightforward cases (such as never looking for a gene that isn't present, never adding a<br />
gene which is already present, never having two genes where one is a prefix of another,<br />
etc).<br />
7 if the methods findDescription and new<strong>Gene</strong> work correctly (including on "corner"<br />
cases such as looking for a gene that isn't present etc).<br />
10 if the program works correctly (to be precise, if the graders can't find any mistakes in<br />
functionality), and it follows the radix search trie data structure as described.<br />
There are eight marks for design and style of code. While the problem can be solved without<br />
recursion, the most marks a solution without recursion can receive for this part is 4.<br />
0 if the program is generally hard to understand (e.g. due to inadequate comments, poorly<br />
chosen identifiers, insufficient data hiding, not using the idioms/conventions of the<br />
language). This mark is also given if you are unable to explain the working of the code,<br />
when asked to do so, or if the code does not follow the radix search trie data structure as<br />
described (eg if you code a simpler structure such as a linked list, or if you internally use<br />
or inherit a library collection class).<br />
2 if the intention is clear, but there are significant flaws in style (such as inadequate<br />
comments or redundant code, or badly chosen instance or local variables). Alternatively<br />
this mark is given for non-recursive code with minor flaws in style.<br />
4 if the code is well-written throughout but does not use recursion in appropriate ways at<br />
all.<br />
6 if recursion is used in at least one of the required methods, and either there are minor<br />
flaws in coding style or the recursion is not well-explained.<br />
8 if the program is well-written throughout with sensible and well-explained use of<br />
recursion for some of the methods.<br />
There are 2 marks for the report explaining the error in Fred Foolish's view. This report must<br />
be included in the README file.<br />
0 if the report is missing, unclear, obviously wrong, or irrelevant.<br />
1 if the report has some sensible ideas, but the argument has flaws.<br />
School of IT, The University of Sydney Page 3 of 6
<strong>INFO1105</strong>/1905: Data Structures <strong>Assignment</strong> 02<br />
2 if the report is convincing and correct.<br />
Functionality Recursion, design, style of code README Total<br />
10% 8% 2% 20%<br />
How to Submit<br />
• Your collection should consist of files (ending in .java) all in a single directory called<br />
<strong>Gene</strong><strong>Collection</strong>. There must also be a README file in that directory, which includes the<br />
discussion of Fred Foolish's views. Zip the folder with your login name (e.g. abcd5678.zip)<br />
and submit the zip file.<br />
• You must submit your zip file by 5:00pm on the Wednesday of Week 11, 15/10/2008.<br />
Late submissions will not be marked. Submission link will be provided at the course<br />
website and WebCT.<br />
• You can have multiple submissions, but only the last submission before the deadline will<br />
be kept and marked.<br />
• There will be a penalty if 1) you do not include your name and SID in your source code, 2)<br />
your zip file cannot be unzipped successfully, or 3) your program cannot be compiled and<br />
run successfully. It is your responsibility to make sure your zip file can be unzipped<br />
successfully and your program can be run successfully.<br />
• Submit a hard copy assignment cover sheet with declaration and your signature to your<br />
tutor in the tutorial time of week 11.<br />
• PLAGIARISM is strictly prohibited. Refer to the course website to know more about the<br />
Academic Honest policies of the University of Sydney and the School of Information<br />
Technologies.<br />
School of IT, The University of Sydney Page 4 of 6
Appendix<br />
<strong>INFO1105</strong>/1905: Data Structures <strong>Assignment</strong> 02<br />
Radix Search Trie Structure<br />
A radix search trie is a data structure that can be used to provide map or dictionary<br />
functionality for the special case where the keys are strings with a small range of possible<br />
characters (for example, names only involve the 26 alphabetical characters). In this page we<br />
will use as keys, strings that represent genes. Each gene is a sequence of the 4 nucleotide<br />
bases, adenine, cytosine, guanine and thymine, and so each can be represented as a string<br />
where each character is A, C, G or T.<br />
This data structure is a variant of the trie or prefix tree, incorporating some but not all aspects<br />
of the radix tree. The term "trie" is not a misprint; it stands for "reTRIEval structure", and it's<br />
pronounced "try". Warning: some books have descriptions of structures like this for the simper<br />
case where no key is a prefix of another. That does not apply here, since we can have both<br />
GC and GCGT as genes in our collection.<br />
The radix search trie is a structure involving two types of Nodes. A LeafNode represents (and<br />
stores) a gene with its associated information. An InternalNode represents (but does not store)<br />
a string which is a prefix of two or more genes in the collection. Each InternalNode stores<br />
references to other nodes, one reference leading (directly or indirectly) to the Nodes<br />
representing those genes which start with the given prefix and then an A, another to those<br />
that genes that start with the given prefix then C, etc. The InternalNode also has a reference<br />
leading to the possible gene whose string is exactly the prefix, with no extra characters.<br />
Usually the five references will be in an array of Node references, indexed by 0 to 4 (with<br />
0="next character is A", 1= "next character is C", 2="next character is G", 3="next character is<br />
T" and 4 means "string ends here").<br />
For example, suppose the collection has the genes AGTC, ATACG, ATG, GCGT, and GC.<br />
Then the internal nodes correspond to the common prefixes AT (prefix of both ATACG and<br />
ATG), A (prefix of AGTC, ATACG and ATG) GC (prefix of GC and GCGT), G (also a prefix of<br />
GCGT and GC), and the empty string (a prefix of all the strings). Note that there is not an<br />
internal node for a string like ATA, which is a prefix only of one gene in the collection.<br />
The arrangement of the nodes is shown in the next Page.<br />
Thus the node representing A has a "T" child which is the internal node representing the<br />
prefix AT (and leads to all LeafNodes whose genes start with AT); it also has a G child which<br />
is the LeafNode for the only string which starts AG, namely AGTC. The InternalNode<br />
representing GC has a G child which is the LeafNode representing GCGT (the only gene<br />
starting GCG) and another child in the last position of the array, corresponding to the gene<br />
GC itself.<br />
Of course the RadixSearchTrie class itself will be a separate class, with an instance variable<br />
myRoot which refers to a Node object. Once the collection has two or more genes, the root<br />
will be the InternalNode for the empty prefix; when the collection is empty, myRoot is null, and<br />
when the collection has just one gene, the root is the LeafNode for that single gene.<br />
School of IT, The University of Sydney Page 5 of 6
<strong>INFO1105</strong>/1905: Data Structures <strong>Assignment</strong> 02<br />
--------------<br />
node representing empty prefix | | | | | |<br />
| | | | | |<br />
+-----+-------<br />
/ \<br />
-------------- \<br />
node for A | | | | | | --------------<br />
| | | | | | | | | | | | node for G<br />
------+--+---- | | | | | |<br />
/ | ---+----------<br />
/ | |<br />
[leaf for AGTC] | |<br />
-------------- --------------<br />
| | | | | | | | | | | | node for GC<br />
node for AT | | | | | | | | | | | |<br />
+-----+------- -------+----+-<br />
/ | / \<br />
/ | / \<br />
[leaf for ATACG] [leaf for ATG] / [leaf for GC]<br />
[leaf for GCGT]<br />
School of IT, The University of Sydney Page 6 of 6