ESSAYS ON TEXT MINING FOR IMPROVED DECISION MAKING ...

More documents

Recommendations

Info

Analyzing existing Customers’ Websites depending on part of speech. Unfortunately, lemmatization is time consuming and still error prone (Thorleuchter, 2008). Therefore, a dictionary-based stemmer is used combined with a set of production rules to give each term a correct stem. The production rules are used when a term is unrecognizable in the dictionary. The term frequencies in textual information follow a Zipf distribution (Zipf, 1949). Half of them appear only once or twice. Thus, those rare terms under these thresholds are deleted that often yield great savings. The last step in term filtering is to check the selected terms manually (Gericke et al., 2009). 3.2.3 Term vector weighting The selected terms are used to construct a vector of weighted frequencies for each web page. In contrast to term vectors where the component values are raw frequencies of appearance for a term in a web page, the use of term weighting schemes leads to significant improvements in retrieval performance (Sparck Jones, 1972). The weights reflect the importance of a term in a specific web page of the considered web page collection. Large weights are assigned to terms that are used frequently in selected web pages but rarely in the whole web page collection (Salton & Buckley, 1988). Thus a weight wi,j for a term i in web page j is computed by term frequency tfi,j times inverse web page frequency idfi, which describes the term specificity within the web page collection. In (Salton, Allan, & Buckley, 1994) a weighting scheme was proposed that has meanwhile proven its usability in practice. Besides term frequency – defined as the absolute frequency of term i in web page j - and inverse document frequency - defined as idfi := log(n/ dfi) -, a length normalization factor is used to ensure that all documents have equal chances of being retrieved independent of their lengths: tfi, j ⋅ log( n / dfi ) w i, j = (1) 2 tf ⋅ (log( n / df )) m ∑ p= 1 2 i, jp i p where n is the size of the web page collection D where web pages are represented by term vectors in m-dimensional space, and dfi is the number of web pages in D that contain term i (Chen, Chiu, & Chang, 2005). 3.2.4 Term vector aggregation As a result, a high-dimensional, weighted term-by-web page matrix is created. However, from the managerial point of view, a prediction is done per customer’s company website. _______________________________________________________________________ 36
Analyzing existing Customers’ Websites Thus, an aggregation of the web pages that belong to the same customer’s company website is needed. The aggregated weight of term i for all web pages belonging to a customer’s company website j (Coussement & Van den Poel, 2009) is r Aw i j = ∑ wi, k k = 1 , (2) with wi,k equal to the weight of term i in web page k and r equal to the number of web pages belonging to the same customer’s company website. 3.3 Concept identification with LSI and singular value decomposition Using each distinct term as a feature would lead to an unmanageable high dimensionality of the feature space. Additionally, most weights are zero for a customer’s company. To reduce the dimension of the feature space LSI is used. LSI groups together related terms (Deerwester et al., 1990) and together with singular value decomposition (SVD) it forms semantic generalizations due to the fact that relationships between terms are recognized by the appearance of terms in similar documents (e.g. web pages). SVD transforms web pages from the high-dimensional feature space to an orthonormal, semantic, latent subspace. Similar terms (keywords) are grouped into concepts. Each concept has a high discriminatory power to other concepts in the reduced feature space. 3.3.1 Feature space dimension reduction The SVD of a term-by-website (m x n) matrix A with rank r (r ≤ min(m,n)) is a transformation into a product of three matrices in form of A = U Σ V t (3) with Σ equal to a diagonal (r x r) matrix containing positive singular values of matrix A where λ1 ≥ λ2 ≥ … ≥ λr, U equal to the term-concept similarity (m x r) matrix, and V equal to the concept-company similarity (n x r) matrix. The columns of U and V are orthonormal in the Euclidean sense (Yen & Lina, 2010). The weights of the matrix A depended on the latent concepts by r ∑ x= 1 w i, j = Ui, x ⋅ Σ x ⋅ d j, x (4) _______________________________________________________________________ 37
Page 1 and 2: ESSAYS ON TEXT MINING FOR IMPROVED
Page 3 and 4: Table of Contents 1. Acknowledgment
Page 5 and 6: 2. Nederlandstalige Samenvatting In
Page 7 and 8: ideeën zijn. Ten einde dit doel te
Page 9 and 10: Ein wesentlicher Erfolgsfaktor für
Page 11 and 12: 4. Summary Data mining is defined a
Page 13 and 14: from customers' and companies' webs
Page 15 and 16: of e-commerce companies divided int
Page 17 and 18: that provides commentary, news, and
Page 19 and 20: Study 3 shows that a semi-automatic
Page 21 and 22: take the granularity of the context
Page 23 and 24: [25] Herring, S.C., Scheidt, L.A.,
Page 25 and 26: [56] Thorleuchter, D., Van den Poel
Page 27 and 28: Table of Contents Analyzing existin
Page 29 and 30: 1 Introduction Analyzing existing C
Page 31 and 32: Analyzing existing Customers’ Web
Page 33 and 34: Fig. 2: Different steps in data col
Page 35: Analyzing existing Customers’ Web
Page 39 and 40: 3.4 Website clustering Analyzing ex
Page 41 and 42: 4 Empirical verification 4.1 Resear
Page 47 and 48: 5 Conclusion Analyzing existing Cus
Page 51 and 52: Chapter II Predicting E-Commerce Co
Page 53 and 54: Predicting E-Commerce Success using
Page 59 and 60: with , )} x T = the training set, }
Page 61 and 62: 4.3 Comparing predictive performanc
Page 63 and 64: 4.4 Results Predicting E-Commerce S
Page 69 and 70: Chapter III Finding new technologic
Page 71 and 72: Finding New Technological Ideas Fin
Page 73 and 74: Finding New Technological Ideas to
Page 75 and 76: Finding New Technological Ideas 3.
Page 77 and 78: p( Ξ raw δ ,.., δ ) 1 μ ⎧ raw
Page 79 and 80: Finding New Technological Ideas inv
Page 81 and 82: Chapter IV Mining Ideas from Textua
Page 83 and 84: Mining Ideas from Textual Informati
Page 85 and 86: 1.2. Idea Definition Mining Ideas f
Page 87 and 88:
3. Idea Mining Process Mining Ideas
Page 89 and 90:
Mining Ideas from Textual Informati
Page 91 and 92:
Page 93 and 94:
Page 95 and 96:
6. Results and Discussions Mining I
Page 97 and 98:
Page 99 and 100:
Page 101 and 102:
Page 103 and 104:
Chapter V Mining Innovative Ideas t
Page 105 and 106:
Mining Innovative Ideas to Support
Page 107 and 108:
Mining Innovative Ideas idea is a n
Page 109 and 110:
Mining Innovative Ideas identifies
Page 111 and 112:
Mining Innovative Ideas categories
Page 113 and 114:
9. Outlook Mining Innovative Ideas
Page 115 and 116:
Chapter VI Extracting Consumers Nee
Page 117 and 118:
Extracting Consumers Needs for New
Page 119 and 120:
2. Rationale Behind this Approach E
Page 121 and 122:
Page 123 and 124:
Page 125 and 126:
Page 127 and 128:
Bibliography Extracting Consumers N
Page 129 and 130:
Table of Contents A Compared Cross
Page 131 and 132:
A Compared Cross Impact Analysis Ke
Page 133 and 134:
A Compared Cross Impact Analysis Th
Page 135 and 136:
A Compared Cross Impact Analysis te
Page 137 and 138:
A Compared Cross Impact Analysis In
Page 139 and 140:
A Compared Cross Impact Analysis De
Page 141 and 142:
A Compared Cross Impact Analysis A
Page 143 and 144:
A Compared Cross Impact Analysis Ta
Page 145 and 146:
A Compared Cross Impact Analysis Ea
Page 147 and 148:
Table 3: Technology area pairs with
Page 149 and 150:
A Compared Cross Impact Analysis A
Page 151 and 152:
A Compared Cross Impact Analysis al
Page 153 and 154:
A Compared Cross Impact Analysis As
Page 155 and 156:
A Compared Cross Impact Analysis Ta
Page 157 and 158:
A Compared Cross Impact Analysis Th
Page 159:
A Compared Cross Impact Analysis [2
show all

ESSAYS ON TEXT MINING FOR IMPROVED DECISION MAKING ...

Create successful ePaper yourself

Delete template?

Save as template?