Download - Academy Publisher

More documents

Recommendations

Info

tree and WSTD model, and then presents the detailed design of the new weighted phrase-based document similarity; Section 4 illustrates the experimental results of evaluation for the new WSTD model, with the comparison of STD model; Section 5 summarizes the work. II. WEB DOCUMENT REPRESENTATION For Web documents, one of three levels of significance is assigned to the different parts; HIGH (H), MEDIUM (M), and LOW (L)[4]. Examples of HIGH significance parts are the title, meta keywords, meta description, and section headings. Examples of MEDIUM significance parts are texts that appear in bold, italic, colored, hyperlinked text. LOW significance parts are usually comprised of the document body text that was not assigned any of the other levels. A formal model is presented that the document is represented as a vector of sentences, which in turn are composed of a vector of terms[4]: d i ={s ij : j=1,…,p i } , s ij ={t ijk :k=1,…,l ij ; w ij } , (1a) (1b) where d i : is document i in documents set D, s ij : is sentence j in document i, p i : is the number of sentences in document i, t ijk : is term k of sentence s ij , l ij : is the length of sentence s ij , and w ij : is the level of the significance associated with sentence s ij . III. CLUSTERING METHOD BASED ON WSTD MODEL The main steps of the new approach are constructing WSTD model, mapping each node into a unique dimension of an M-dimensional vector space, computing feature term weights, finally applying GHAC to clustering Web documents. A. Definition of Weighted Suffix Tree For a sentence s of n words, in order to build a suffix tree[6, 9], a nonempty character $ is appended to the end of the sentence s with Ukkonen’s algorithm[9]. The length of s is changed to n+1. When constructing the suffix tree, the sentences s is represented as a sequence of words, not characters. The kth suffix of a sentence s is the substring of s that starts with word k to the end of s. Definition 1 The suffix tree T of the sentence s is a rooted, compact trie containing all the suffixes of s. Each internal node has at least 2 children. Each edge is labeled with a nonempty substring of s. The label (phrase from the node) of a node is defined to be the concatenation of the edge-labels on the path from the root to that node. No two edges out of the same node can have edge-labels that begin with the same word. For each suffix s[k…n+1] of s, there exists a suffix-node whose label equals it. Definition 2 The weighted suffix tree T s of the sentence s is based on a suffix tree of s, each node v of which records the sentence s, the level of significance w of s, and the times r that s traverses the node v. Its information is represented as a triple (s, r, w). Definition 3 The weighted suffix tree T d of document d i is based on a suffix tree which p i sentences are inserted into. Each node records a list of triple (s ij , r ij , w ij ). Definition 4 The weighted suffix tree T D of documents set D is based on a suffix tree which N documents are inserted into. Each document d i contains p i sentences. Thus, ∑p i sentences are inserted into T D . Each node records a list of triple (s ij , r ij , w ij ). Four properties are concluded from T D . Property 1 Any node v of the weighted suffix tree T D has its depth level L v . Each phrase P v denoted by any internal node v contains at least two words at a higher level (L v ≥2). Phrases of various lengths can be extracted. Property 2 Structure weights of any node appearing in different sentences or documents can be obtained from T D . Property 3 Summing up all times which a node is traversed by the sentences of a document, the node frequency, i.e., phrase frequency in a document, can be obtained. Phrase frequency is denoted by tf. Property 4 Summing up all times which a node is traversed by different documents, the document frequency of a node can be obtained. Document frequency is denoted by df. Fig.1 is an example of a suffix tree composed from two Web documents. d 0 : cat ate cheese. mouse ate cheese too. d 1 : cat ate mouse too. 0.0,1.0 (1,1) (H,L) cheese mouse too Figure 1. Suffix tree of documents set. cat ate cheese ate mouse too $ too 0.0,0.1 0.0,0.1,1.0 (1,1) (1,1,1) (H,L) (H,L,M) cheese too too $ mouse too ate cheese too 0.0,0.1 (1,1) (H,L) Figure 2. Weighted suffix tree of documents set. $ 0.1,1.0 (1,1) (L,M) 166
The nodes in the suffix tree are drawn in circles. There are three kinds of nodes: the root node, internal node, and leaf node. Each internal node represents a common phrase shared by at least two suffix substrings. Each internal node is attached to an individual box. Each upper number in the box designate a document identifier, the number below designates the traversed times of the document. Fig.2 is an example of a weighted suffix tree composed of three sentences with levels of significance which originate from the two documents d 0 and d 1 partitioned by HTML tags. Each internal node is attached to an individual box. Each upper number designates a sentence identifier which contains document identifier in the left of the period mark, the middle number designates the traversed times of the corresponding sentence, the number below designates the level of significance of corresponding sentence. The length of phrases extracted from weighted suffix tree is much less than from ordinary suffix tree. From the weighted suffix tree, structure weight w, tf, and df of each node can be obtained. B. Similarity Based on the Weighted Phrases In this paper, the symbols N, M, and q are used to denote the number of documents, the number of feature terms, and the number of clusters, respectively. The C 1 , C 2 ,…,C q are used to denote each one of the q clusters of result. Each document d is considered to be a vector in the M- dimensional feature term space. In this paper, the TF-IDF weighting scheme is employed, in which each document d can be represented as[5]: r d = { w(1, d), w(2, d),..., w( M , d)} , (2a) w(i,d)=(0.1+logtf(i,d))×log(0.1+N/df(i)) , (2b) where w(i,d) is ith feature term weight in document d, tf(i,d) is the frequency of the ith term in the document d, and df(i) is the number of documents containing the ith term. In the VSD model, the cosine similarity is the most commonly used to measure and compute the pairwise similarity of two document d i and d j , which is defined as: r r di • d j sim( di, d j ) = r r . (3) d × d With respect to WSTD model, a node v can appear in different sentences with its level of significance. When mapping a node v of WSTD model to VSD model, besides df(v) and tf(v,d), the feature term weight depends on the level of significance of v. The symbols tf H (v,d), tf M (v,d), and tf L (v,d) are used to denote the term frequency of the node v in HIGH, MEDIUM and LOW significance parts of a document respectively. The term frequency tf(v,d) of node v is updated to: tf(v,d)=w H ×tf H (v,d)+w M ×tf M (v,d)+w L ×tf L (v,d) , (4) i j where w H , w M , w L are used to denote different level of significance of a node v, w L is assigned 1, and the others are greater than w L . The feature term weight of node v is represented as: w(v,d)=(0.1+logtf(v,d))×log(0.1+N/df(v)) . (5) C. Computation Complexity of the Novel Approach Ukkonen[9] provides an building suffix algorithm for a document of m words with time complexity of O(m). In WSTD model, the length of a document m is assumed to be bounded by a constant. Therefore, the time complexity of building WSTD model is O(N). The time complexity of using GHAC algorithm to cluster Web documents is O(N 2 logN)[8]. For the dimensionality M of features vector usually is large, similarity computations are time consuming. As a result, the overall time complexity of the proposed approach is O(N 2 logN). IV. EXPERIMENTAL RESULTS A. Experimental Setup Three data sets are used, two of which are Web document data sets, and the third is a collection of the search engine results called AMBIENT[10] in which each document includes a title and a snippet. Table 1 describes the data sets. Documents set DS1 and DS2 are described and used in [4]. The proposed approach is implemented in Java with JDK 6. For the weighted suffix tree construction, an open source library of the Carrot2[11] is used. But there exist some errors and defects for counting the times which documents traverse nodes. These errors are corrected in the work and Carrot2 is extended with levels of significance of nodes for WSTD model. The GHAC algorithm is implemented in Java by following the description in [8]. B. Evaluation Metrics In order to evaluate the quality of the clustering, two quality measures are adopted, which are widely used in the text mining literature for the purpose of document clustering [2-5]. The first is the F-measure, which combines the Precision and recall ideas in information retrieval. The second measure is the Entropy, which tells how homogeneous a cluster is. The higher the homogeneity of a cluster, the lower the entropy is, and vice versa. To sum up, the approach maximizes the F-measure and minimizes the entropy score of the clusters to achieve a high-quality clustering. TABLE I. DATA SETS DESCRIPTIONS Data Avg. Name Type #Docs Categories Sets words DS1 UW-CAN HTML 314 10 469 DS2 Yahoo news HTML 2340 20 289 DS3 AMBIENT TEXT 4400 44 41 167
Page 1 and 2:
Proceedings The Second Internationa
Page 3 and 4:
Table of Contents Message from the
Page 5 and 6:
Zuming Xiao, Zhan Guo, Bin Tan, and
Page 7 and 8:
Message from the Symposium Chairs T
Page 9 and 10:
Second International Symposium on N
Page 11 and 12:
ISBN 978-952-5726-09-1 (Print) Proc
Page 13 and 14:
model that can deal with time serie
Page 15 and 16:
ISBN 978-952-5726-09-1 (Print) Proc
Page 17 and 18:
training for the Wushu competition
Page 19 and 20:
ISBN 978-952-5726-09-1 (Print) Proc
Page 21 and 22:
information, called weak uncertain
Page 23 and 24:
Student side Student side Student s
Page 25 and 26:
ISBN 978-952-5726-09-1 (Print) Proc
Page 27 and 28:
If we define element of student as
Page 29 and 30:
ISBN 978-952-5726-09-1 (Print) Proc
Page 31 and 32:
QoS, each frame data is divided int
Page 33 and 34:
ISBN 978-952-5726-09-1 (Print) Proc
Page 35 and 36:
for Imaging Two and Three Phase Flo
Page 37 and 38:
ISBN 978-952-5726-09-1 (Print) Proc
Page 39 and 40:
A. Profiling&following control algo
Page 41 and 42:
Research and Realization about Conv
Page 43 and 44:
PDF document structure is a tree st
Page 45 and 46:
ISBN 978-952-5726-09-1 (Print) Proc
Page 47 and 48:
support 10 Mb / s. But ENC28J60 onl
Page 49 and 50:
ISBN 978-952-5726-09-1 (Print) Proc
Page 51 and 52:
condition the first byte output of
Page 53 and 54:
ISBN 978-952-5726-09-1 (Print) Proc
Page 55 and 56:
According to maximum membership deg
Page 57 and 58:
ISBN 978-952-5726-09-1 (Print) Proc
Page 59 and 60:
ubber according to the mass ratio o
Page 61 and 62:
Ⅲ. AN IMPROVED DNA ALGORITHM FOR
Page 63 and 64:
Ⅴ.CONCLUSION REMARKS DNA computer
Page 65 and 66:
its first child q 1 on the left, th
Page 67 and 68:
chains can greatly improve efficien
Page 69 and 70:
II. RELATED WORK A. Mobile Service
Page 71 and 72:
special services. SOAP is used to b
Page 73 and 74:
detection methods of DDoS attacks m
Page 75 and 76:
Step4 calculate the new subordinate
Page 77 and 78:
Figure 1. An analysis of the partit
Page 79 and 80:
ISBN 978-952-5726-09-1 (Print) Proc
Page 81 and 82:
The preceding three formulas can be
Page 83 and 84:
ISBN 978-952-5726-09-1 (Print) Proc
Page 85 and 86:
And the decay speed of buffer seque
Page 87 and 88:
ISBN 978-952-5726-09-1 (Print) Proc
Page 89 and 90:
distance of view point. Given that
Page 91 and 92:
ISBN 978-952-5726-09-1 (Print) Proc
Page 93 and 94:
Ⅳ. EVALUATION OF BLENDED LEARNING
Page 95 and 96:
ISBN 978-952-5726-09-1 (Print) Proc
Page 97 and 98:
symmetric with respect to the origi
Page 99 and 100:
ISBN 978-952-5726-09-1 (Print) Proc
Page 101 and 102:
each sample belongs to each categor
Page 103 and 104:
ISBN 978-952-5726-09-1 (Print) Proc
Page 105 and 106:
A. Data Preparing To generate our t
Page 107 and 108:
ISBN 978-952-5726-09-1 (Print) Proc
Page 109 and 110:
oth α and β . determination of Qu
Page 111 and 112:
ISBN 978-952-5726-09-1 (Print) Proc
Page 113 and 114:
Strong earthquake 0.1< M L
Page 115 and 116:
ISBN 978-952-5726-09-1 (Print) Proc
Page 117 and 118:
indirect causes, and the logical re
Page 119 and 120:
egression. Granger causality test i
Page 121 and 122:
B. Evaluation We have evaluated thi
Page 123 and 124:
used as a source and neighboring ce
Page 125 and 126: Figure 3. Drainage networks generat
Page 127 and 128: Combinational logic unit failures i
Page 129 and 130: Tabal.1 Combinational fault logic t
Page 131 and 132: under the endorsement of both the m
Page 133 and 134: TABLE I. DESCRIPTION OF PROPOSITION
Page 135 and 136: Figure 2. The process of the invers
Page 137 and 138: Figure 9. (a)the original image.(b)
Page 139 and 140: ISBN 978-952-5726-09-1 (Print) Proc
Page 141 and 142: In order to reduce to the number of
Page 145 and 146: that studying being going to be to
Page 147 and 148: Reference[10] analyzed the evolutio
Page 149 and 150: module, communication module, apper
Page 151 and 152: After the comprehensive performance
Page 153 and 154: system testing can be seen that the
Page 155 and 156: III. AUTONOMIC RESOURCE ALLOCATION
Page 157 and 158: The QoE i (T i ) is the ith user’
Page 161 and 162: nodes will be formed one cluster, t
Page 165 and 166: Where n is the number of data point
Page 169 and 170: Keyboard event Application User mod
Page 173 and 174: SQL Azure will eventually include a
Page 175: ISBN 978-952-5726-09-1 (Print) Proc
Page 179 and 180: [3] Y. Li, S. M. Chung, and J. D. H
Page 181 and 182: some drilling fluid produces hydrog
Page 183 and 184: "normalization", whose membership b
Page 185 and 186: and minimum structural elements in
Page 187 and 188: esults of the spatial transform par
Page 189 and 190: B. Research on protocol actions Com
Page 191 and 192: ACKNOWLEDGMENT This work is funded
Page 193 and 194: A. Problem Description In a small b
Page 195 and 196: [4] Huang, Hung, and J. Y. jen Hsu.
Page 197 and 198: Further by calculating, the followi
Page 199 and 200: TABLE II. THE CONCENTRATION BETWEEN
Page 201 and 202: private key that obtained by using
Page 205 and 206: ontology, and the domain dictionary
Page 209 and 210: Eq.4 is NP-hard and can be solved b
Page 213 and 214: Where: G i is the selection field o
Page 215 and 216: If there is X j in a generation, wh
Page 217 and 218: Theorem 2.4([10]). Let L 1 and L 2
Page 219 and 220: The relation between the lattice im
Page 221 and 222: Figure 2. Example of single-step de
Page 223 and 224: evocation and the fourth group stor
Page 225 and 226: focused mainly on rule-based forms
Page 227 and 228:
Ⅵ. CONCLUSION Figure 6. The Flask
Page 229 and 230:
EDCF in comparison with DCF, has so
Page 231 and 232:
B. Simulation results and analysis
Page 233 and 234:
ISBN 978-952-5726-09-1 (Print) Proc
Page 235 and 236:
in routing table. When search resou
Page 237 and 238:
ISBN 978-952-5726-09-1 (Print) Proc
Page 239 and 240:
others through the selective emotio
Page 241 and 242:
such as weather information, techno
Page 243 and 244:
the opening window, a related page
Page 245 and 246:
1) Process context storage areas. I
Page 247 and 248:
V. CONCLUSION In this thesis, sCPU-
Page 249 and 250:
main types of horizontal search env
Page 251 and 252:
Enterprise Portal security provides
Page 253 and 254:
() t = [ S () t S () t ] T N S ,...
Page 255 and 256:
[2] Kumaravel, N., and Kavitha, V.,
Page 257 and 258:
output, so BP network has been wide
Page 259 and 260:
input to the artificial neural netw
Page 261 and 262:
(2) In a process (tokens from outsi
Page 263 and 264:
The reduction process consists of t
Page 265 and 266:
all eight normal vectors are identi
Page 267 and 268:
is B object , and the number of emp
Page 269 and 270:
xi, j y' = εα ( (1 + tanh( )) −
Page 271 and 272:
Error 8000 7000 6000 5000 4000 3000
Page 273 and 274:
Architecture (AMBA) a new bus archi
Page 275 and 276:
In order to ensure the smooth proce
Page 277 and 278:
ISBN 978-952-5726-09-1 (Print) Proc
Page 279 and 280:
In this paper, we adopt two Sobel o
Page 281 and 282:
ISBN 978-952-5726-09-1 (Print) Proc
Page 283 and 284:
four layers.The far right of the gr
Page 285 and 286:
ISBN 978-952-5726-09-1 (Print) Proc
Page 287 and 288:
TABLE II. OPERATIONAL EMPLOYEE TABL
Page 289 and 290:
A. Object-relational Type To audit
Page 291 and 292:
to execution time. Also, DBA would
Page 293 and 294:
Liping Chen .......................
show all

Download - Academy Publisher

Create successful ePaper yourself

Delete template?

Save as template?