ESSAYS ON TEXT MINING FOR IMPROVED DECISION MAKING ...

More documents

Recommendations

Info

Analyzing existing Customers’ Websites If k ≤ r and the singular values λk+1, …, λr are small compared to λ1, …, λk then LSI based on SVD allows a good approximation of Ar with rank r by Ak with rank k. Therefore, LSI dropped the smaller lambda values in Σ by retaining only the first predetermined singular values equal to or greater than k while only the first k columns of U and V were retained. Ak = Uk Σk Vk t (5) with Uk, Σk and Vk were equal to the k-rank approximation of U, Σ and V, respectively. The approximated k-rank concept-website similarity matrix Vk contained information on how well a certain website loads on the different k concepts, which reflect the hidden (latent semantic) patterns in the textual information. 3.3.2 Concept dimension selection The choice of k – the number of concepts - is critical for optimal predictive performance by using SVD. If k is too large then too many irrelevant or unimportant concepts are used for prediction. Otherwise, if k is too small then relevant concepts probably are not considered. The calculation of an optimal number of concepts k can be done using an operational criterion, i.e. a value of k that yields good performance (Chen et al., 2010). In this paper, we use a parameter-selection procedure by constructing several rank-k models, by using a fivefold cross-validation on the training set for each rank-k model, and by selecting the most favorable rank-k model (based on the cross-validated performance) for further analysis. The cross-validation performance is determined by the results of the prediction model (see Sect. 3.5). 3.3.3 Projection of test examples into the LSI-subspace The meaning of the concepts during testing should stay the same as during training. Consequently, test examples are transformed to term vectors by using the different steps in the pre-processing phase (see Sect. 3.2). Additionally, the projection of the test examples is done into the same semantic latent subspace as created during training (Zhong & Li, 2010). As a result, the term vector Ad is created for each test example and the new concept-website vector can be calculated by V d = A′ ⋅U ⋅ Σ (6) d k −1 k with Uk the k-rank concept-term similarity matrix and Σk the diagonal singular value matrix in rank k, both of the original SVD. The new concept-website vector Vd is comparable to the concept-website vectors of the matrix Vk. _______________________________________________________________________ 38
3.4 Website clustering Analyzing existing Customers’ Websites Concepts of the dimension reduced new concept-website matrix reflect the hidden, latent semantic patterns in the textual information from companies’ websites that have a high discriminatory power to other concepts. For applying these concepts for acquisition (e.g. to create a list of new potential customers), we have to identify prevalent terms mainly representing concepts from profitable companies’ websites and least of all from nonprofitable companies’ websites. Websites are clustered to identify these terms. An expectation-maximization (EM) algorithm is used for finding maximum likelihood estimates of parameters in a probabilistic model, where the model depends on the dimension-reduced SVD concepts. As a result, classes contain profitable companies’ websites as well as non-profitable company’s websites. Class labels represent a number of prevalent terms from the websites assigned to a class. Labels are selected from classes that are mainly assigned to profitable customers’ websites. Terms from these labels are used as search query in a web mining approach. Thus, companies can be identified where the selected terms occur on their websites and where the company itself does not occur in the training examples. It can be supposed that their websites contain similar concepts as concepts from the corresponding class and therefore, they probably are profitable customers, too. To evaluate their profitability, textual information from their websites is collected (see Sect. 3.1), pre-processed (see Sect. 3.2), projected into the LSI-subspace (see Sect. 3.3.3), and used as test examples in a prediction modeling approach (see Sect. 3.5). 3.5 Prediction Modeling As modeling technique, logistic regression is used by producing a maximum likelihood function and by maximizing it in order to become an appropriate fit to the data (Allison, 1999; Inagaki, 2010). Logistic regression is conceptually simple (DeLong, DeLong, & Clarke- Pearson, 1988), a closed-form solution for the posterior probabilities is available and it provides quick and robust results in a prediction context (Greiff, 1998). Therefore with a training set of {( i , i )} y x T = and } ,..., , { N 2 1 i = and input data xi ∈ R and corresponding binary target labels yi ∈ { 0, 1} (non-profitable, profitable), logistic regression is used to estimate the probability P ( y = 1| x) given by 1 P( y = 1| x) = (7) 1+ exp( ( w0 + wx)) _______________________________________________________________________ 39 n
Page 1 and 2: ESSAYS ON TEXT MINING FOR IMPROVED
Page 3 and 4: Table of Contents 1. Acknowledgment
Page 5 and 6: 2. Nederlandstalige Samenvatting In
Page 7 and 8: ideeën zijn. Ten einde dit doel te
Page 9 and 10: Ein wesentlicher Erfolgsfaktor für
Page 11 and 12: 4. Summary Data mining is defined a
Page 13 and 14: from customers' and companies' webs
Page 15 and 16: of e-commerce companies divided int
Page 17 and 18: that provides commentary, news, and
Page 19 and 20: Study 3 shows that a semi-automatic
Page 21 and 22: take the granularity of the context
Page 23 and 24: [25] Herring, S.C., Scheidt, L.A.,
Page 25 and 26: [56] Thorleuchter, D., Van den Poel
Page 27 and 28: Table of Contents Analyzing existin
Page 29 and 30: 1 Introduction Analyzing existing C
Page 31 and 32: Analyzing existing Customers’ Web
Page 33 and 34: Fig. 2: Different steps in data col
Page 37: Analyzing existing Customers’ Web
Page 41 and 42: 4 Empirical verification 4.1 Resear
Page 47 and 48: 5 Conclusion Analyzing existing Cus
Page 51 and 52: Chapter II Predicting E-Commerce Co
Page 53 and 54: Predicting E-Commerce Success using
Page 59 and 60: with , )} x T = the training set, }
Page 61 and 62: 4.3 Comparing predictive performanc
Page 63 and 64: 4.4 Results Predicting E-Commerce S
Page 69 and 70: Chapter III Finding new technologic
Page 71 and 72: Finding New Technological Ideas Fin
Page 73 and 74: Finding New Technological Ideas to
Page 75 and 76: Finding New Technological Ideas 3.
Page 77 and 78: p( Ξ raw δ ,.., δ ) 1 μ ⎧ raw
Page 79 and 80: Finding New Technological Ideas inv
Page 81 and 82: Chapter IV Mining Ideas from Textua
Page 83 and 84: Mining Ideas from Textual Informati
Page 85 and 86: 1.2. Idea Definition Mining Ideas f
Page 87 and 88: 3. Idea Mining Process Mining Ideas
Page 89 and 90:
Mining Ideas from Textual Informati
Page 91 and 92:
Page 93 and 94:
Page 95 and 96:
6. Results and Discussions Mining I
Page 97 and 98:
Page 99 and 100:
Page 101 and 102:
Page 103 and 104:
Chapter V Mining Innovative Ideas t
Page 105 and 106:
Mining Innovative Ideas to Support
Page 107 and 108:
Mining Innovative Ideas idea is a n
Page 109 and 110:
Mining Innovative Ideas identifies
Page 111 and 112:
Mining Innovative Ideas categories
Page 113 and 114:
9. Outlook Mining Innovative Ideas
Page 115 and 116:
Chapter VI Extracting Consumers Nee
Page 117 and 118:
Extracting Consumers Needs for New
Page 119 and 120:
2. Rationale Behind this Approach E
Page 121 and 122:
Page 123 and 124:
Page 125 and 126:
Page 127 and 128:
Bibliography Extracting Consumers N
Page 129 and 130:
Table of Contents A Compared Cross
Page 131 and 132:
A Compared Cross Impact Analysis Ke
Page 133 and 134:
A Compared Cross Impact Analysis Th
Page 135 and 136:
A Compared Cross Impact Analysis te
Page 137 and 138:
A Compared Cross Impact Analysis In
Page 139 and 140:
A Compared Cross Impact Analysis De
Page 141 and 142:
A Compared Cross Impact Analysis A
Page 143 and 144:
A Compared Cross Impact Analysis Ta
Page 145 and 146:
A Compared Cross Impact Analysis Ea
Page 147 and 148:
Table 3: Technology area pairs with
Page 149 and 150:
A Compared Cross Impact Analysis A
Page 151 and 152:
A Compared Cross Impact Analysis al
Page 153 and 154:
A Compared Cross Impact Analysis As
Page 155 and 156:
A Compared Cross Impact Analysis Ta
Page 157 and 158:
A Compared Cross Impact Analysis Th
Page 159:
A Compared Cross Impact Analysis [2
show all

ESSAYS ON TEXT MINING FOR IMPROVED DECISION MAKING ...

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?