Vol 9 No1 - Journal of Cell and Molecular Biology - Haliç Üniversitesi

More documents

Recommendations

Info

with a highly conserved region. It is calculated based on the length of the pattern and frequency i.e., number of occurrences in the database. The given pattern p is classified as significant if it satisfies the following constraints: The length of p ≥ a given length-threshold: a significant pattern must be sufficiently long to carry important biological information. The score of p ≥ a given score-threshold: a significant pattern must have a sufficiently high score. GSArray is constructed for the entire sequences in the database. While constructing GSArray, at each node i, the length and the frequency ) (i.e. the number of occurrences of pi in the database) is stored for the corresponding pattern pi this frequency is incremented for every new node. Then the score function is calculated using the equation (1) (1) Where = Function value of the pattern, = frequency of the node, = length of the label of prefix, i = node number, |DB| = Database size. The score function is used as a measure to determine whether a particular pattern is significant to be included in the post-processing for evaluation of sequences, to determine the closest set of sequences. Then the query sequence Q and number of sequences to be selected from the database is read from the user, temporarily added onto GSArray. This enables to determine which suffixes of the query are shared by the sequence in the database. The query sequence is only temporarily added to the tree so that GSAlign is not affected for future sequence searches. Initially all the nodes in the GSArray are 0. When the query sequence Q is added as a suffix, the nodes visited is set to 1. This expedites the search for common patterns within the GSArray because only those paths in the tree for patterns that contain substrings of the query sequence are examined. In depth-first manner, starting at the root all the nodes are visited to check the value 1. If the current node has no child whose value is 1, then the search backtrack to its parent node. During this traversal all the significant patterns are collected. The sequence may have other common patterns that are not significant. An optimal alignment between these two sequences in an ideal case contains all significant patterns. After this process the query Bio-database compression using enhanced suffix array 49 sequence from GSArray is deleted. Top ten significant patterns are selected and stored. The sequence that contains significant patterns is extracted and stored. Reverse check is made to obtain the accuracy of the results; it computes how many chosen patterns are being shared by each of the sequences extracted already. Higher the number (weight), greater will be the similarity of the corresponding sequence to the query. Based on this weight the sequences are ranked. A top n sequences are transferred to new database. Algorithm for BioDB Compression Tool is shown in Figure 3. //Input: Set of DNA and Protein Sequences // Output: Compressed set of DNA and Protein sequences S1: Read DNA or Protein database; S2: Construct GSArray for the input sequences, Set node visit =0; S3: While constructing the suffix array, store the information of label length, frequency at nodes and sequences traversing through the branches; S4: Use the equation (1) to calculate the degree of similarity of patterns in the form of prefixes; S5: Read query sequence Q and the number of sequences n to select from the user; S6: Temporarily add the suffixes of query to the generalized GSArray; S7: While adding the suffixes highlight nodes of the paths which are traversed by the query sequence; S8: Post process the GSArray to extract patterns shared by the query Sequence, which lies above a defined threshold on the function-value; S9: Pick the top ten of these patterns and store; S8: Do a reverse check to compute the weight of each sequence in the subset; S9: Rank the sequences according to these weights; S11: Pick top n sequences from the subset and write to a new database; Figure 3. Algorithm for BioDB Compression Tool Cross validation The real world DNA (GSS, EST) and Protein sequence (Protein C, S) databases from Oryza sativa group is extracted from NCBI website and enhanced suffix array is formed for all the sequences in the database. Based on the user given query sequence, the significant patterns are generated. The sequences in the DNA or Protein database are given weights according to the number of patterns they contain. The compressed database is formed by selecting top ‘n’ sequences with highest ranks and written into a new database. The latest data mining technique, 7-fold cross validation is used to validate the results obtained from BioDB Compression Tool and WU-BLAST.
50 Arumugam KUNTHAVAI and Somasundaram VASANTHA RATHNA The main idea of 7-fold cross validation approach is ‘train on 6 folds, test on 1 fold. The data set is divided into 7 parts. Among the 7 parts 6/7 of the data are used for training data set and the remaining 1/7 is used for testing data set. BioDB Compression Tool is applied on training data set and then on WU_BLAST for single user given query sequence. For the same query sequence WU-BLAST alone is applied on the testing data set. The average of this seven runs is computed for analysis. In this paper such seven queries are taken and analyzed. 49 sequences from Oryza sativa GSS gi: 288881557 to gi: 288881606 are taken as a database set, 42 sequences are used for training data set and 7 sequences are used for test data set. Single query sequence is applied first on BioDB Compression Tool, data base is compressed. WU-BLAST is performed on the new data base and the same query sequence. Then for same sequence WU-BLAST alone is applied. The sequences in the training and test data set are interchanged and the above steps are repeated until every fold is used for training. Average of the result from 7 runs are calculated and stored. This is experimented for 7 different queries. The objective is to test whether the results are consistent for all the queries on a particular database in terms of computational time. Results Compressed database from BioDB Compression Tool is cross validated using 7-fold cross validation approach. The cross validation results of seven different queries for DNA database set GSS gi: 288881557 to gi: 288881606 are shown in Figure 4 and EST gi: 288886142 to gi: 288886191 are shown in Figure 5 and the results of seven different queries Protein C &S are shown in Figures 6 and 7. Figure 4. Results of cross validation for GSS DNA dataset Figure 5. Results of cross validation for EST DNA dataset The idea is to test whether the results are consistent for all queries on a particular database in terms of computation time. From the figures 4-6 it is inferred that the developed BioDB Compression Tool performs much better than WU-BLAST. Computational complexity of using developed BioDB Compression Tool is reduced compared to WU-BLAST because; compression takes place along sequence alignment. The Enhanced suffix array algorithm used in BioDB Compression Tool requires 5n bytes character where SCT requires 20n bytes which uses suffix tree. Figure 6. Results of cross validation for Protein C Hence it is proved that space complexity is approximately five times less than SCT. Implementation results show that the running time of developed algorithm BioDB Compression Tool using enhanced suffix array is much better than SCT which uses suffix tree.
Page 1 and 2:
Journal of Cell and •Notch in apo
Page 3 and 4: Haliç University Faculty of Arts a
Page 6 and 7: Journal of Cell and Molecular Biolo
Page 8 and 9: The aforesaid, thus, strongly sugge
Page 12 and 13: factors accounts for the different
Page 14 and 15: compromise nearly 5-10% of the Cauc
Page 16 and 17: Genetic susceptibility to bladder c
Page 18 and 19: Cunningham JM. SJ Hebbring SJ, McDo
Page 22 and 23: Identification of silencing suppres
Page 24 and 25: AAUAGAAAAUGA preceding the 12K AUG
Page 28 and 29: Semiquantitative RT-PCR Total RNA w
Page 30 and 31: A Corchorus_olitorius SAPKARQRCVADG
Page 32 and 33: are indistinguishable in the region
Page 34: Rabbani MA, Maruyama K, Abe H, Khan
Page 37 and 38: 32 Preeti SINGH et al. Introduction
Page 39 and 40: 34 Preeti SINGH et al. Discussion C
Page 41 and 42: 36 Preeti SINGH et al. Mukhopadhyay
Page 43 and 44: 38 Jeyaraj ANBURAJ et al. Introduct
Page 45 and 46: 40 Jeyaraj ANBURAJ et al. Surface o
Page 47 and 48: 42 Jeyaraj ANBURAJ et al. Table 3.
Page 49 and 50: 44 Jeyaraj ANBURAJ et al. Gokhale M
Page 51 and 52: 46 Arumugam KUNTHAVAI and Somasunda
Page 53: 48 Arumugam KUNTHAVAI and Somasunda
Page 57 and 58: 52 Arumugam KUNTHAVAI and Somasunda
Page 59 and 60: 54 Sabira TAIPAKOVA et al. crystall
Page 61 and 62: 56 Sabira TAIPAKOVA et al. Results
Page 63 and 64: 58 Sabira TAIPAKOVA et al. Synthesi
Page 65 and 66: 60 Sabira TAIPAKOVA et al. Discussi
Page 70 and 71: µl containing 50-100 ng DNA templa
Page 72 and 73: There are number of factors which a
Page 76 and 77: Data regarding changes in hemoglobi
Page 78 and 79: elated to major structural and func
Page 82 and 83: C: C1 & C2 and D: D1 & D2. Animals
Page 84 and 85: Effects of Vernonia amygdalina extr
Page 86 and 87: histological evidences from our stu
Page 90 and 91: EGFR variations in bladder cancer 8
Page 94 and 95: Submission Policies and Authorship
Page 96: Journal of Cell and Volume 9 · No
show all

Vol 9 No1 - Journal of Cell and Molecular Biology - Haliç Üniversitesi

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?