15.11.2013 Views

INFO1105 Assignment 2 – Gene Collection - GetACoder

INFO1105 Assignment 2 – Gene Collection - GetACoder

INFO1105 Assignment 2 – Gene Collection - GetACoder

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

<strong>INFO1105</strong>/1905: Data Structures<br />

<strong>INFO1105</strong> <strong>Assignment</strong> 2 <strong>–</strong> <strong>Gene</strong> <strong>Collection</strong><br />

Due: 5:00 pm Wednesday (Week 11) 15/10/2008<br />

Purpose<br />

This assignment emphasizes the details of programming a data structure, with careful<br />

attention to making sure that each operation keeps the integrity of the structure. It should also<br />

provided practice in recursion. In addition, you will demonstrate your ability to analyze the<br />

run-time costs of your code. This is an individual assignment: each student must work<br />

independently, and any assistance must be acknowledged in the README file.<br />

Task<br />

Each student must write a collection class, which would be suitable to be kept in a class<br />

library. Note: your job is to write part of a library; in this assignment you should not use a<br />

collection class library (of course, you can still use the built-in arrays of Java, and you can use<br />

java.util.Scanner and the classes in java.lang and java.io, such as String).<br />

The class you write must be called RadixSearchTrie. It must have a constructor with no<br />

arguments, to produce an object representing an empty collection. It must be part of a<br />

package called <strong>Gene</strong><strong>Collection</strong>; that is, the RadixSearchTrie.java file must be in a<br />

directory called <strong>Gene</strong><strong>Collection</strong>.<br />

To clients, this class should appear as an implementation of the interface GCInterface. The<br />

interface is at Page 2 of this document.<br />

Internally, the collection class must be structured as a radix search trie. The detail explanation<br />

of the radix search trie is included in the Appendix Section at the end of this document.<br />

As well as the collection class and any associated classes for Nodes, you must produce a<br />

textual file called README. This must contain a statement about the authorship of the code<br />

you submit, including acknowledgements of all assistance you received (for example,<br />

conversations with friends, resources you found on the web, etc). The README must also<br />

contain a big-O running time for each method you implemented, and an answer to the<br />

following question.<br />

"Fred Foolish claims that the worst case scalability of the getDescription method in the radix<br />

trie is O(log n) where n is the total length of all the genes in the collection. Fred says this<br />

because the depth of a tree is approximately the logarithm of the number of nodes, and you<br />

have at most one node for each character in each gene. Show that Fred is wrong in general,<br />

and explain the errors in his argument."<br />

School of IT, The University of Sydney Page 1 of 6


<strong>INFO1105</strong>/1905: Data Structures <strong>Assignment</strong> 02<br />

package <strong>Gene</strong><strong>Collection</strong>;<br />

/*<br />

* This interface represents the API<br />

* suitable for an object which keeps a collection of genes, each with associated<br />

* information. A gene is given by a String, with the special property that every<br />

* character is A, C, G or T. In real uses, the associated information would<br />

* be voluminous, including discoverer, literature citation, species, chromosome<br />

* identifier, location within chromosome, purpose, date first mentioned, etc etc;<br />

* however for this simple example we will assume the associated information<br />

* is just a String description.<br />

*/<br />

public interface GCInterface {<br />

/*<br />

* Find the description associated with a given gene.<br />

* If the gene is not in the collection, return null.<br />

*/<br />

public String findDescription(String gene);<br />

/*<br />

* add a new gene to the collection, with an associated description.<br />

* Neither argument may be null. If the gene is already present in the collection,<br />

* the new description replaces the old; otherwise the new description and gene<br />

* are added.<br />

*/<br />

public void new<strong>Gene</strong>(String gene, String description);<br />

/*<br />

* return the number of genes in the collection.<br />

*/<br />

public int numberOf<strong>Gene</strong>s();<br />

/*<br />

* return the number of genes in the collection which have a given gene as prefix.<br />

*/<br />

public int numberOf<strong>Gene</strong>s(String prefix);<br />

/*<br />

* remove a gene and its associated description.<br />

* This must not be called with a null argument.<br />

*/<br />

public void remove(String gene);<br />

}<br />

School of IT, The University of Sydney Page 2 of 6


<strong>INFO1105</strong>/1905: Data Structures <strong>Assignment</strong> 02<br />

Assessment<br />

This assignment is worth 20% of the marks for the unit.<br />

There are 10 marks awarded for the functionality of the program. This is determined both by<br />

running tests on your code, and also by consideration of the submitted code.<br />

0 if the program does not run (for example, it doesn't compile or it doesn't implement<br />

GCInterface). This mark is also given if you are unable to explain the working of the code,<br />

when asked to do so, or if the code does not follow the radix search trie data structure as<br />

described (e.g. if you code a simpler structure such as a linked list, or if you internally use<br />

or inherit a library collection class).<br />

3 if the program runs correctly in several simple test cases (involving only the methods<br />

findDescriptiopn and new<strong>Gene</strong>).<br />

5 if the mehods findDescription and new<strong>Gene</strong> work correctly in a wide range of<br />

straightforward cases (such as never looking for a gene that isn't present, never adding a<br />

gene which is already present, never having two genes where one is a prefix of another,<br />

etc).<br />

7 if the methods findDescription and new<strong>Gene</strong> work correctly (including on "corner"<br />

cases such as looking for a gene that isn't present etc).<br />

10 if the program works correctly (to be precise, if the graders can't find any mistakes in<br />

functionality), and it follows the radix search trie data structure as described.<br />

There are eight marks for design and style of code. While the problem can be solved without<br />

recursion, the most marks a solution without recursion can receive for this part is 4.<br />

0 if the program is generally hard to understand (e.g. due to inadequate comments, poorly<br />

chosen identifiers, insufficient data hiding, not using the idioms/conventions of the<br />

language). This mark is also given if you are unable to explain the working of the code,<br />

when asked to do so, or if the code does not follow the radix search trie data structure as<br />

described (eg if you code a simpler structure such as a linked list, or if you internally use<br />

or inherit a library collection class).<br />

2 if the intention is clear, but there are significant flaws in style (such as inadequate<br />

comments or redundant code, or badly chosen instance or local variables). Alternatively<br />

this mark is given for non-recursive code with minor flaws in style.<br />

4 if the code is well-written throughout but does not use recursion in appropriate ways at<br />

all.<br />

6 if recursion is used in at least one of the required methods, and either there are minor<br />

flaws in coding style or the recursion is not well-explained.<br />

8 if the program is well-written throughout with sensible and well-explained use of<br />

recursion for some of the methods.<br />

There are 2 marks for the report explaining the error in Fred Foolish's view. This report must<br />

be included in the README file.<br />

0 if the report is missing, unclear, obviously wrong, or irrelevant.<br />

1 if the report has some sensible ideas, but the argument has flaws.<br />

School of IT, The University of Sydney Page 3 of 6


<strong>INFO1105</strong>/1905: Data Structures <strong>Assignment</strong> 02<br />

2 if the report is convincing and correct.<br />

Functionality Recursion, design, style of code README Total<br />

10% 8% 2% 20%<br />

How to Submit<br />

• Your collection should consist of files (ending in .java) all in a single directory called<br />

<strong>Gene</strong><strong>Collection</strong>. There must also be a README file in that directory, which includes the<br />

discussion of Fred Foolish's views. Zip the folder with your login name (e.g. abcd5678.zip)<br />

and submit the zip file.<br />

• You must submit your zip file by 5:00pm on the Wednesday of Week 11, 15/10/2008.<br />

Late submissions will not be marked. Submission link will be provided at the course<br />

website and WebCT.<br />

• You can have multiple submissions, but only the last submission before the deadline will<br />

be kept and marked.<br />

• There will be a penalty if 1) you do not include your name and SID in your source code, 2)<br />

your zip file cannot be unzipped successfully, or 3) your program cannot be compiled and<br />

run successfully. It is your responsibility to make sure your zip file can be unzipped<br />

successfully and your program can be run successfully.<br />

• Submit a hard copy assignment cover sheet with declaration and your signature to your<br />

tutor in the tutorial time of week 11.<br />

• PLAGIARISM is strictly prohibited. Refer to the course website to know more about the<br />

Academic Honest policies of the University of Sydney and the School of Information<br />

Technologies.<br />

School of IT, The University of Sydney Page 4 of 6


Appendix<br />

<strong>INFO1105</strong>/1905: Data Structures <strong>Assignment</strong> 02<br />

Radix Search Trie Structure<br />

A radix search trie is a data structure that can be used to provide map or dictionary<br />

functionality for the special case where the keys are strings with a small range of possible<br />

characters (for example, names only involve the 26 alphabetical characters). In this page we<br />

will use as keys, strings that represent genes. Each gene is a sequence of the 4 nucleotide<br />

bases, adenine, cytosine, guanine and thymine, and so each can be represented as a string<br />

where each character is A, C, G or T.<br />

This data structure is a variant of the trie or prefix tree, incorporating some but not all aspects<br />

of the radix tree. The term "trie" is not a misprint; it stands for "reTRIEval structure", and it's<br />

pronounced "try". Warning: some books have descriptions of structures like this for the simper<br />

case where no key is a prefix of another. That does not apply here, since we can have both<br />

GC and GCGT as genes in our collection.<br />

The radix search trie is a structure involving two types of Nodes. A LeafNode represents (and<br />

stores) a gene with its associated information. An InternalNode represents (but does not store)<br />

a string which is a prefix of two or more genes in the collection. Each InternalNode stores<br />

references to other nodes, one reference leading (directly or indirectly) to the Nodes<br />

representing those genes which start with the given prefix and then an A, another to those<br />

that genes that start with the given prefix then C, etc. The InternalNode also has a reference<br />

leading to the possible gene whose string is exactly the prefix, with no extra characters.<br />

Usually the five references will be in an array of Node references, indexed by 0 to 4 (with<br />

0="next character is A", 1= "next character is C", 2="next character is G", 3="next character is<br />

T" and 4 means "string ends here").<br />

For example, suppose the collection has the genes AGTC, ATACG, ATG, GCGT, and GC.<br />

Then the internal nodes correspond to the common prefixes AT (prefix of both ATACG and<br />

ATG), A (prefix of AGTC, ATACG and ATG) GC (prefix of GC and GCGT), G (also a prefix of<br />

GCGT and GC), and the empty string (a prefix of all the strings). Note that there is not an<br />

internal node for a string like ATA, which is a prefix only of one gene in the collection.<br />

The arrangement of the nodes is shown in the next Page.<br />

Thus the node representing A has a "T" child which is the internal node representing the<br />

prefix AT (and leads to all LeafNodes whose genes start with AT); it also has a G child which<br />

is the LeafNode for the only string which starts AG, namely AGTC. The InternalNode<br />

representing GC has a G child which is the LeafNode representing GCGT (the only gene<br />

starting GCG) and another child in the last position of the array, corresponding to the gene<br />

GC itself.<br />

Of course the RadixSearchTrie class itself will be a separate class, with an instance variable<br />

myRoot which refers to a Node object. Once the collection has two or more genes, the root<br />

will be the InternalNode for the empty prefix; when the collection is empty, myRoot is null, and<br />

when the collection has just one gene, the root is the LeafNode for that single gene.<br />

School of IT, The University of Sydney Page 5 of 6


<strong>INFO1105</strong>/1905: Data Structures <strong>Assignment</strong> 02<br />

--------------<br />

node representing empty prefix | | | | | |<br />

| | | | | |<br />

+-----+-------<br />

/ \<br />

-------------- \<br />

node for A | | | | | | --------------<br />

| | | | | | | | | | | | node for G<br />

------+--+---- | | | | | |<br />

/ | ---+----------<br />

/ | |<br />

[leaf for AGTC] | |<br />

-------------- --------------<br />

| | | | | | | | | | | | node for GC<br />

node for AT | | | | | | | | | | | |<br />

+-----+------- -------+----+-<br />

/ | / \<br />

/ | / \<br />

[leaf for ATACG] [leaf for ATG] / [leaf for GC]<br />

[leaf for GCGT]<br />

School of IT, The University of Sydney Page 6 of 6

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!