INFO1105 Assignment 2 – Gene Collection - GetACoder

INFO1105/1905: Data Structures 

INFO1105 Assignment 2 – Gene Collection 

Due: 5:00 pm Wednesday (Week 11) 15/10/2008 

Purpose 

This assignment emphasizes the details of programming a data structure, with careful 

attention to making sure that each operation keeps the integrity of the structure. It should also 

provided practice in recursion. In addition, you will demonstrate your ability to analyze the 

run-time costs of your code. This is an individual assignment: each student must work 

independently, and any assistance must be acknowledged in the README file. 

Task 

Each student must write a collection class, which would be suitable to be kept in a class 

library. Note: your job is to write part of a library; in this assignment you should not use a 

collection class library (of course, you can still use the built-in arrays of Java, and you can use 

java.util.Scanner and the classes in java.lang and java.io, such as String). 

The class you write must be called RadixSearchTrie. It must have a constructor with no 

arguments, to produce an object representing an empty collection. It must be part of a 

package called GeneCollection; that is, the RadixSearchTrie.java file must be in a 

directory called GeneCollection. 

To clients, this class should appear as an implementation of the interface GCInterface. The 

interface is at Page 2 of this document. 

Internally, the collection class must be structured as a radix search trie. The detail explanation 

of the radix search trie is included in the Appendix Section at the end of this document. 

As well as the collection class and any associated classes for Nodes, you must produce a 

textual file called README. This must contain a statement about the authorship of the code 

you submit, including acknowledgements of all assistance you received (for example, 

conversations with friends, resources you found on the web, etc). The README must also 

contain a big-O running time for each method you implemented, and an answer to the 

following question. 

"Fred Foolish claims that the worst case scalability of the getDescription method in the radix 

trie is O(log n) where n is the total length of all the genes in the collection. Fred says this 

because the depth of a tree is approximately the logarithm of the number of nodes, and you 

have at most one node for each character in each gene. Show that Fred is wrong in general, 

and explain the errors in his argument." 

School of IT, The University of Sydney Page 1 of 6

INFO1105/1905: Data Structures Assignment 02 

package GeneCollection; 

/* 

* This interface represents the API 

* suitable for an object which keeps a collection of genes, each with associated 

* information. A gene is given by a String, with the special property that every 

* character is A, C, G or T. In real uses, the associated information would 

* be voluminous, including discoverer, literature citation, species, chromosome 

* identifier, location within chromosome, purpose, date first mentioned, etc etc; 

* however for this simple example we will assume the associated information 

* is just a String description. 

*/ 

public interface GCInterface { 

/* 

* Find the description associated with a given gene. 

* If the gene is not in the collection, return null. 

*/ 

public String findDescription(String gene); 

/* 

* add a new gene to the collection, with an associated description. 

* Neither argument may be null. If the gene is already present in the collection, 

* the new description replaces the old; otherwise the new description and gene 

* are added. 

*/ 

public void newGene(String gene, String description); 

/* 

* return the number of genes in the collection. 

*/ 

public int numberOfGenes(); 

/* 

* return the number of genes in the collection which have a given gene as prefix. 

*/ 

public int numberOfGenes(String prefix); 

/* 

* remove a gene and its associated description. 

* This must not be called with a null argument. 

*/ 

public void remove(String gene); 

} 



Assessment 

This assignment is worth 20% of the marks for the unit. 

There are 10 marks awarded for the functionality of the program. This is determined both by 

running tests on your code, and also by consideration of the submitted code. 

0 if the program does not run (for example, it doesn't compile or it doesn't implement 

GCInterface). This mark is also given if you are unable to explain the working of the code, 

when asked to do so, or if the code does not follow the radix search trie data structure as 

described (e.g. if you code a simpler structure such as a linked list, or if you internally use 

or inherit a library collection class). 

3 if the program runs correctly in several simple test cases (involving only the methods 

findDescriptiopn and newGene). 

5 if the mehods findDescription and newGene work correctly in a wide range of 

straightforward cases (such as never looking for a gene that isn't present, never adding a 

gene which is already present, never having two genes where one is a prefix of another, 

etc). 

7 if the methods findDescription and newGene work correctly (including on "corner" 

cases such as looking for a gene that isn't present etc). 

10 if the program works correctly (to be precise, if the graders can't find any mistakes in 

functionality), and it follows the radix search trie data structure as described. 

There are eight marks for design and style of code. While the problem can be solved without 

recursion, the most marks a solution without recursion can receive for this part is 4. 

0 if the program is generally hard to understand (e.g. due to inadequate comments, poorly 

chosen identifiers, insufficient data hiding, not using the idioms/conventions of the 

language). This mark is also given if you are unable to explain the working of the code, 

when asked to do so, or if the code does not follow the radix search trie data structure as 

described (eg if you code a simpler structure such as a linked list, or if you internally use 

or inherit a library collection class). 

2 if the intention is clear, but there are significant flaws in style (such as inadequate 

comments or redundant code, or badly chosen instance or local variables). Alternatively 

this mark is given for non-recursive code with minor flaws in style. 

4 if the code is well-written throughout but does not use recursion in appropriate ways at 

all. 

6 if recursion is used in at least one of the required methods, and either there are minor 

flaws in coding style or the recursion is not well-explained. 

8 if the program is well-written throughout with sensible and well-explained use of 

recursion for some of the methods. 

There are 2 marks for the report explaining the error in Fred Foolish's view. This report must 

be included in the README file. 

0 if the report is missing, unclear, obviously wrong, or irrelevant. 

1 if the report has some sensible ideas, but the argument has flaws. 



2 if the report is convincing and correct. 

Functionality Recursion, design, style of code README Total 

10% 8% 2% 20% 

How to Submit 

• Your collection should consist of files (ending in .java) all in a single directory called 

GeneCollection. There must also be a README file in that directory, which includes the 

discussion of Fred Foolish's views. Zip the folder with your login name (e.g. abcd5678.zip) 

and submit the zip file. 

• You must submit your zip file by 5:00pm on the Wednesday of Week 11, 15/10/2008. 

Late submissions will not be marked. Submission link will be provided at the course 

website and WebCT. 

• You can have multiple submissions, but only the last submission before the deadline will 

be kept and marked. 

• There will be a penalty if 1) you do not include your name and SID in your source code, 2) 

your zip file cannot be unzipped successfully, or 3) your program cannot be compiled and 

run successfully. It is your responsibility to make sure your zip file can be unzipped 

successfully and your program can be run successfully. 

• Submit a hard copy assignment cover sheet with declaration and your signature to your 

tutor in the tutorial time of week 11. 

• PLAGIARISM is strictly prohibited. Refer to the course website to know more about the 

Academic Honest policies of the University of Sydney and the School of Information 

Technologies. 


Appendix 


Radix Search Trie Structure 

A radix search trie is a data structure that can be used to provide map or dictionary 

functionality for the special case where the keys are strings with a small range of possible 

characters (for example, names only involve the 26 alphabetical characters). In this page we 

will use as keys, strings that represent genes. Each gene is a sequence of the 4 nucleotide 

bases, adenine, cytosine, guanine and thymine, and so each can be represented as a string 

where each character is A, C, G or T. 

This data structure is a variant of the trie or prefix tree, incorporating some but not all aspects 

of the radix tree. The term "trie" is not a misprint; it stands for "reTRIEval structure", and it's 

pronounced "try". Warning: some books have descriptions of structures like this for the simper 

case where no key is a prefix of another. That does not apply here, since we can have both 

GC and GCGT as genes in our collection. 

The radix search trie is a structure involving two types of Nodes. A LeafNode represents (and 

stores) a gene with its associated information. An InternalNode represents (but does not store) 

a string which is a prefix of two or more genes in the collection. Each InternalNode stores 

references to other nodes, one reference leading (directly or indirectly) to the Nodes 

representing those genes which start with the given prefix and then an A, another to those 

that genes that start with the given prefix then C, etc. The InternalNode also has a reference 

leading to the possible gene whose string is exactly the prefix, with no extra characters. 

Usually the five references will be in an array of Node references, indexed by 0 to 4 (with 

0="next character is A", 1= "next character is C", 2="next character is G", 3="next character is 

T" and 4 means "string ends here"). 

For example, suppose the collection has the genes AGTC, ATACG, ATG, GCGT, and GC. 

Then the internal nodes correspond to the common prefixes AT (prefix of both ATACG and 

ATG), A (prefix of AGTC, ATACG and ATG) GC (prefix of GC and GCGT), G (also a prefix of 

GCGT and GC), and the empty string (a prefix of all the strings). Note that there is not an 

internal node for a string like ATA, which is a prefix only of one gene in the collection. 

The arrangement of the nodes is shown in the next Page. 

Thus the node representing A has a "T" child which is the internal node representing the 

prefix AT (and leads to all LeafNodes whose genes start with AT); it also has a G child which 

is the LeafNode for the only string which starts AG, namely AGTC. The InternalNode 

representing GC has a G child which is the LeafNode representing GCGT (the only gene 

starting GCG) and another child in the last position of the array, corresponding to the gene 

GC itself. 

Of course the RadixSearchTrie class itself will be a separate class, with an instance variable 

myRoot which refers to a Node object. Once the collection has two or more genes, the root 

will be the InternalNode for the empty prefix; when the collection is empty, myRoot is null, and 

when the collection has just one gene, the root is the LeafNode for that single gene. 



-------------- 

node representing empty prefix | | | | | | 

| | | | | | 

+-----+------- 

/ \ 

-------------- \ 

node for A | | | | | | -------------- 

| | | | | | | | | | | | node for G 

------+--+---- | | | | | | 

/ | ---+---------- 

/ | | 

[leaf for AGTC] | | 

-------------- -------------- 

| | | | | | | | | | | | node for GC 

node for AT | | | | | | | | | | | | 

+-----+------- -------+----+- 

/ | / \ 

/ | / \ 

[leaf for ATACG] [leaf for ATG] / [leaf for GC] 

[leaf for GCGT]

INFO1105 Assignment 2 – Gene Collection - GetACoder

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?