GfKl 2008 - Legos

GfKl 2008 

German Classification Society 

32nd Annual Conference 

Advances in Data Analysis, Data Handling 

and Business Intelligence 

Joint Conference with the 

British Classification Society (BCS) and the 

Dutch/Flemish Classification Society (VOC) 

July 16−18, 2008 

Hamburg 

Program and 

Abstract Volume 

http://gfkl2008.hsu-hh.de

Main contact: 

Prof. Dr. Wilfried Seidel 

Helmut-Schmidt-University 

Holstenhofweg 85 

22043 Hamburg 

Germany 

+49 (0)40 6541 2315 

gfkl2008@hsu-hh.de 

− ii −

Contents 

Welcome, v 

Sponsors, vi 

Scientific Program Committee, vii 

Organizing Committee, vii 

Plenary and Semi-plenary Lectures, viii 

Invited Sessions, x 

Detailed Schedule, xi 

List of Contributions, xxviii 

Author Index, xxxvii 

Abstracts, 1 

− iii −

− iv −

Welcome 

On behalf of the Helmut-Schmidt-University Hamburg, 

we welcome you to GfKl2008 − Advances in Data Analysis, 

Data Handling and Business Intelligence − the 32nd Annual 

Conference of the German Classification Society, organized 

in cooperation with the British Classification Society (BCS) 

and the Dutch/Flemish Classification Society (VOC). 

The conference features 13 invited lectures (3 plenary 

speeches and 10 semi-plenary lectures), 166 contributed 

talks, 4 invited sessions, and 2 workshops. 

We are indebted to those who suggested and supported 

having GfKl2008 held in Hamburg. We are grateful to those 

who volunteered in the conference organization and we 

acknowledge the generous financial backing by our sponsors. 

We wish you a very pleasant and stimulating GfKl2008. 

Claudia Fantapié Altobelli 

Andreas Fink 

Hartmut Hebbel 

Wilfried Seidel 

Detlef Steuer 

Ulrich Tüshaus 

(Organizing committee, 

Helmut-Schmidt-University Hamburg) 

Berthold Lausen 

Alfred Ultsch 

(Co-chairs of the program committee) 

− v −

Sponsors 

The organizers would like to express their appreciation to the 

following organizations for providing financial help and other 

support: 

Deutsche 

Forschungsgemeinschaft 

Hamburg-Mannheimer 

Versicherungen 

Springer-Verlag 

Vattenfall 

Gesellschaft für 

Konsumforschung 

Hamburger Sparkasse 

Statsoft GmbH 

Volksfürsorge Deutsche 

Lebensversicherung AG 

− vi −

Scientific Program Committee 

H.-H. Bock (RWTH Aachen) 

R. Decker (Uni Bielefeld) 

V. Esposito Vinzi (ESSEC Paris) 

W. Esswein (TU Dresden) 

C. Fantapié Altobelli (HSU Hamburg) 

A. Fink (HSU Hamburg) 

W. Gaul (Uni Karlsruhe) 

H. Hebbel (HSU Hamburg) 

Ch. Hennig (Uni London) 

K. Jajuga (Wroclaw Univ. of Economics) 

H.-P. Klenk (DSMZ Braunschweig) 

B. Lausen (Uni Erlangen-Nürnberg, Co-Chair) 

H. Locarek-Junge (TU Dresden) 

F. Murtagh (Uni London) 

A. Okada (Uni Tokyo) 

L. Schmidt-Thieme (Uni Hildesheim) 

W. Seidel (HSU Hamburg) 

D. Steuer (HSU Hamburg) 

U. Tüshaus (HSU Hamburg) 

A. Ultsch (Uni Marburg, Co-Chair) 

M. van de Velden (Uni Rotterdam) 

D. van den Poel (Uni Ghent) 

I. van Mechelen (Uni Leuven) 

R. Wehrens (Uni Nijmegen) 

C. Weihs (Uni Dortmund) 

Organizing Committee 

Claudia Fantapié Altobelli 

Andreas Fink 

Hartmut Hebbel 

Wilfried Seidel 

Detlef Steuer 

Ulrich Tüshaus 

− vii −

Plenary and Semi-plenary Lectures 

Wednesday, July 16, 10:00–10:45: 

Walter Radermacher, President Federal Statistical Office of 

Germany, Wiesbaden, Germany 

“Statistical Processes Under Change – Enhancing Data Quality 

with Pretests” (Room 5) 


Geoffrey John McLachlan, University of Brisbane, Australia 

“Clustering of High-Dimensional Data Via Finite Mixture Models” 

(Room 5) 

Fred Hamprecht, University Heidelberg, Germany 

“Segmentation of Neural Tissue” (Room 3) 


Bernhard Schölkopf, Max-Planck-Institute, Tübingen, Germany 

“Machine Learning Applications of Positive Definite Kernels” 

(Room 5) 

Thursday, July 17, 09:00–09:40: 

Patrick Groenen, University Rotterdam, The Netherlands 

“Support Vector Machines in the Primal using Majorization and 

Kernels” (Room 5) 

Gilles Bisson, La Tronche, France 

“Clustering of Molecules and Structured Data” (Room 3) 

− viii −

Thursday, July 17, 15:55–16:35: 

Sabine Krolak-Schwerdt, University Wuppertal, Germany 

“Strategies of Model Construction for the Analysis of Judgment 

Data” (Room 3) 

Gilles Celeux, INRIA, France 

“Choosing the Number of Clusters in the Latent Class Model” 

(Room 5) 

Friday, July 18, 09:00–09:40: 

Francesco Palumbo, University of Macerata, Italy 

“Clustering and Dimensionality Reduction to Discover Interesting 

Patterns in Binary Data” (Room 5) 

Raimund Wildner, GfK, Nürnberg, Germany 

“Management and Methods: How to do Market Segmentation 

Projects” (Room 3) 

Friday, July 18, 11:20–12:00: 

Adi Ben-Israel, Rutgers University, USA 

“Probabilistic Distance Clustering” (Room 5) 

Tagashi Imaizumi, Tama University Tokyo, Japan 

“Dimensionality Reduction of Similarity Matrix” (Room 3) 

Friday, July 18, 13:15–14:00: 

Fred R. McMorris, Illinois Institute of Technology, USA 

“Majority-rule Consensus: From Preferences (Social Choice) to 

Trees (Biology and Classication Theory)” (Room 5) 

− ix −

Invited Sessions 

Wednesday, July 16, 14:50–16:05 (Room 2) 

VOC 

(Chairs: van de Velden, Wehrens) 

Wednesday, July 16, 16:50–18:30 (Room 2) 

PLS Path Modeling 

(Chair: Esposito Vinzi) 

Thursday, July 17, 09:45–11:00 (Room 2) 

Microarrays in Clinical Research 

(Chairs: Lausen, Ultsch) 

Thursday, July 17, 14:00–15:40 (Room 2) 

BCS 

(Chairs: Hennig, Murtagh) 

− x −

Detailed Schedule 

13:30- 

17:30 

Tuesday July 15, 2008 page 

Pre-conference Workshop Room 4 

Lenz, Hans-J. Data Quality: Defining, Measuring and 

Improving 

20:00 Informal Get Together (Hotel Baseler Hof, Esplanade 11) 

09:00- 

10:00 

Wednesday July 16, 2008 

Opening Ceremony (Chair: Seidel) Room 

5 

09:00 Welcome Claus Weihs (President of GfKl) 

Wilfried Seidel (Local Organizers) 

09:05 Welcome Herlind Gundelach 

(Senator for Science and 

Research, State Hamburg) 

09:15 Welcome Hans Christoph Zeidler 

(President of Helmut-Schmidt- 

University Hamburg) 

09:30 GfKl Best Paper 

Award 2007: 

Presentation and 

Laudation 

Claus Weihs (President of GfKl) 

N.N. 

09:50 Program Overview Berthold Lausen, Alfred Ultsch 

(Co-Chairs Program Committee) 

10:00- 

10:45 

Plenary Lecture (Chair: Seidel) 

Radermacher, Walter Statistical processes under 

change − Enhancing data quality 

with pretests 

− xi − 

page 

118 

Room 

5

10:00- 

18:00 

10:45- 

11:00 

11:00- 

11:40 

Workshop: Libraries (see separate schedule) Room 

403 

Coffee 

Semi-plenary Lectures 

McLachlan, Geoffrey 

John 

Hamprecht, Fred A. 

et al. 

Session Mixture Analysis I: 

Testing 

11:45- 

12:10 

12:10- Holzmann, Hajo; 

12:35 Dannemann, Jörn 

12:35- 

13:00 

Clustering of High-Dimensional 

Data Via Finite Mixture Models 

(Chair: McMorris) 

Segmentation of Neural Tissue 

(Chair: Wehrens) 

93 Room 

5 

5 Room 

3 

(Chair: Seidel) 

49 

Room 

3 

Gassiat, Elisabeth Likelihood ratio test for general 

mixture models 

Likelihood ratio testing for hidden 

Markov models 

Pommeret, Denys Testing distribution in errors in 

variables models 

Session Pattern Recognition and 

Machine Learning I 

11:45- 

12:10 

12:10- 

12:35 

12:35- 

13:00 

Haasdonk, Bernard; 

Pekalska, Elzbieta 

Louw, Nelmarie; 

Lamont, Morne; Steel, 

Sarel 

Classification with Regularized 

Kernel Mahalanobis-Distances 

Identifying Atypical Cases in 

Kernel Fisher Discriminant 

Analysis by using the Smallest 

Enclosing Hypersphere 

Trzesiok, Michal Relevant Importance of Predictor 

Variables in Support Vector 

Machines Models 

68 

113 

(Chair: Groenen) Room 

405/6 

Session Collective Intelligence (Chair: Geyer-Schulz) Room 

101/3 

11:45- 

12:10 

Geyer-Schulz, Andreas; 

Hoser, Bettina 

The Potential of Social Intelligence 

for Collective Intelligence 

− xii − 

57 

87 

152 

51

12:10- 

12:35 

12:35- 

13:00 

Mylonas, Phivos; 

Solachidis, Vassilios; 


Hoser, Bettina; 

Chapman, Sam; 

Ciravegna, Fabio; 

Staab, Stefen; Smrz, 

Pavel; Kompatsiaris, 

Yiannis; Avrithis, 

Yannis 

Solachidis, Vassilios; 

Mylonas, Phivos; 


Hoser, Bettina; 

Chapman, Sam; 

Ciravegna, Fabio; 

Staab, Stefen; 

Contopoulos, Costis; 

Gkika, Ioanna; Smrz, 

Pavel; Kompatsiaris, 

Yiannis; Avrithis, 

Yannis 

Session Evaluation of Clustering 

Algorithms and Data 

Structures 

11:45- Kaiser, Sebastian; 

12:10 Leisch, Friedrich 

12:10- 

12:35 

12:35- 

13:00 

Bade, Korinna; Benz, 

Dominik 

Grün, Bettina; Leisch, 

Friedrich 

Session Genome and DNA 

Analysis 

11:45- 

12:10 

12:10- 

12:35 

12:35- 

13:00 

Efficient Media Exploitation 

towards Collective Intelligence 

100 

Generating Collective Intelligence 137 

(Chair: Leisch) Room 

2 

Benchmarking Bicluster 

Algorithms 

Evaluation Strategies for Learning 

Algorithms of Hierarchical 

Structures 

Model diagnostics of finite 

mixtures using bootstrapping 

Klenk, Hans-Peter Polyphasic genomic approach for 

the taxonomy of archaea and 

bacteria 

Huson, Daniel H.; 

Rupp, Regula 

75 

8 

56 

(Chair: Klenk) Room 

6 

Using Cluster Networks to 

Represent Non-Compatible Sets of 

Clusters 

Hütt, Marc-Thorsten Genome phylogeny based on 

short-range correlations in DNA 

sequences 

− xiii − 

78 

71 

72

Session Market Risk and Credit 

Risk 

11:45- 

12:10 

12:10- 

12:35 

12:35- 

13:00 

13:00- 

14:00 

14:00- 

14:45 

Bravo, Cristian; 

Maldonado, Sebastian; 

Weber, Richard 

(Chair: Locarek-Junge) Room 

4 

Practical experiences from Credit 

Scoring projects for Chilean 

financial organizations 

21 

Kuziak, Katarzyna An application of copula functions 

to market risk management 

Rokita, Pawel; Piontek, 

Krzysztof 

Lunch (and Meetings) 

Extreme unconditional 

dependence vs. multivariate 

GARCH effect in the analysis of 

dependence between high losses 

on Polish and German stock 

indexes 

Plenary Lecture (Chair: Lausen) 

Schölkopf, Bernhard Machine Learning applications of 

positive definite kernels 

Session Mixture Analysis II: 

Clustering and 

Classification 

14:50- 

15:15 

15:15- 

15:40 

15:40- 

16:05 

16:05- 

16:30 

Pons, Odile Classification with an increasing 

number of components 

Lukociene, Olga; 

Vermunt, Jeroen K. 

Calò, Daniela G.; Viroli, 

Cinzia 

Latouche, Pierre J.; 

Ambroise, Christophe; 

Birmelé, Etienne 


Machine Learning II 

14:50- 

15:15 

15:15- 

15:40 

15:40- 

16:05 

Stecking, Ralf; 

Schebesch, Klaus B. 

Huellermeier, Eyke; 

Vanderlooy, Stijn 

Hühn, Jens; 

Hüllermeier, Eyke 

83 

121 

132 Room 

5 

(Chair: Montanari) Room 

3 

Determining the number of 

components in mixture models for 

hierarchical data 

Visualizing data in Gaussian 

mixture model classification 

Bayesian Methods for Graph 

Clustering 

114 

90 

24 

85 

(Chair: Nalbantov) Room 

405/6 

Generating Fictitious Training Data 

for Credit Client Classification 

Combining Predictions in Pairwise 

Classification: An Adaptive Voting 

Strategy and Its Relation to 

Weighted Voting 

Rule-Based Learning of Reliable 

Classifiers 

− xiv − 

140 

70 

69

Session Invited Session: VOC (Chairs: van de Velden, Wehrens) Room 

2 

14:50- 

15:15 

15:15- 

15:40 

15:40- 

16:05 

Timmerman, Marieke 

E.; Lichtwarck-Aschoff, 

Anna; Ceulemans, Eva 

van der Heijden, Peter 

G.M. 

van der Ark, Andries L.; 

Straat, J. Hendrik 

Session Ensemble Methods and 

Other Subjects 

14:50- 

15:15 

15:15- 

15:40 

15:40- 

16:05 

16:05- 

16:30 

Multilevel Simultaneous 

Component Analysis for Studying 

Inter-individual and Intraindividual 

Variabilities 

Estimating the prevalence of rule 

transgression 

Selection of items for tests and 

questionnaires using Mokken scale 

analysis 

149 

158 

157 

(Chair: Boulesteix) Room 

6 

Häberle, Lothar On classification of species of 

representation rings 

58 

Strobl, Carolin; Zeileis, A New, Conditional Variable 143 

Achim 

Importance Measure for Random 

Forests 

Potapov, Sergej; 

Lausen, Berthold 

Bagging with different split criteria 115 

Adler, Werner; 

Classification of Paired Data Using 4 

Brenning, Alexander; 

Lausen, Berthold 

Ensemble Methods 

Session Market Risk and Credit 

Risk 

14:50- 

15:15 

15:15- 

15:40 

15:40- 

16:05 

16:05- 

16:30 

16:30- 

16:50 

(Chair: Locarek-Junge) Room 

4 

Piontek, Krzysztof The Analysis of the power for 

some chosen VaR backtesting 

procedures - simulation approach 

Koralun-Bereznicka, 

Julia 

Dias, José G.; Vermunt, 

Jeroen K.; Ramos, Sofia 

Sardet, Laure; Patilea, 

Valentin 

Coffee 

Multivariate comparative analysis 

of stock exchanges - the European 

perspective 

Mixture Hidden Markov Models in 

Finance Research 

Beta-kernel density estimation 

using mixture-based 

transformations: an application to 

claims distribution 

− xv − 

112 

81 

33 

125

Session Mixture Analysis III: 

Model Fitting, Estimation 

and Applications 

16:50- 

17:15 

17:15- 

17:40 

17:40- 

18:05 

18:05- 

18:30 

18:30- 

18:55 

Greselin, Francesca; 

Ingrassia, Salvatore 

Neykov, Neyko; 

Filzmoser, Peter; 

Neytchev, Plamen 

Garel, Bernard; 

Boucharel, Julien; 

Dewitte, Boris; du 

Penhoat, Yves 

(Chair: McLachlan) Room 

3 

A note on constrained EM 

algorithms for mixtures of 

elliptical distributions 

Robust fitting of mixtures: The 

approach based on the Trimmed 

Likelihood Estimator 

Non-Gaussian nature of ENSO 

signals and climate shifts: implications 

for regional studies off the 

western coast of South America 

Schlattmann, Peter Comparison of four estimators of 

the heterogeneity variance for 

meta-analysis 

Schiffner, Julia; Weihs, 

Claus 


Machine Learning III 

16:50- 

17:15 

17:15- 

17:40 

17:40- 

18:05 

18:05- 

18:30 

Localized Classification Using 

Mixture Models 

53 

103 

48 

131 

130 

(Chair: Hüllermeier) Room 

405/6 

Wehrens, Ron Supervised Self-Organising Maps 

and More 

160 

Worm, Katja; Meffert, Image Based Mail Piece 

166 

Beate 

Identification using Unsupervised 

Learning 

Barbosa, Rui Pedro; 

Belo, Orlando 

Autonomous Forex Trading Agents 9 

Chiou, Hua-Kai; Huang, Applying Rough Set Theory to 27 

Yong-Ting; Liu, Gia- Constructing Knowledge Base for 

Shie 

Critical Military Commodity 

Management 

Session Invited Session: PLS 

Path Modeling 

16:50- 

17:15 

17:15- 

17:40 

17:40- 

18:05 

(Chair: Esposito Vinzi) Room 

2 

Ringle, Christian FIMIX-PLS Segmentation of Data 

for Path Models with Multiple 

Endogenous LVs 

Trinchera, L.; Esposito 

Vinzi, Vincenzo 

A Comprehensive Partial Least 

Squares Approach to Component- 

Based Structural Equation 

Modeling 

Henseler, Jörg Nonlinear Effects in PLS Path 

Models: A Comparison of Available 

Approaches 

− xvi − 

120 

151 

63

18:05- 

18:30 

Betzin, Jörg Categorical Data in PLS Path 

modeling 

Session Microarray Data 

Analysis 

16:50- 

17:15 

17:15- 

17:40 

17:40- 

18:05 

18:05- 

18:30 

Boulesteix, Anne- 

Laure; Slawski, Martin 

Slawski, Martin; 

Boulesteix, Anne- 

Laure; Daumer, Martin 

Scharl, Theresa; 

Leisch, Friedrich 

Martin-Magniette, 

Marie-Laure; Mary- 

Huard, Tristan; Bérard, 

Caroline; Robin, 

Stéphane 

Session Investments and 

Capital Markets 

16:50- 

17:15 

17:15- 

17:40 

17:40- 

18:05 

18:05- 

18:30 

Locarek-Junge, 

Hermann; Mihm, Max 

14 

(Chair: Benner) Room 

6 

On optimistic bias in reporting 

microarray-based classification 

accuracy 

20 

'CMA' - Steps in developing a 

comprehensive R-toolbox for 

classification with microarray data 

and other high-dimensional 

problems 

136 

Quality-Based Clustering of 

Functional Data: Applications to 

Time Course Microarray Data 

127 

ChIPmix : Mixture model of 

regressions for ChIP-chip 

experiment analysis 

92 

(Chair: Ultsch) Room 

4 

Fundamental Indexation - testing 

the concept in the German stock 

market 

86 

Ultsch, Alfred Is log ratio a good value for 

measuring return in stock 

investments? 

Klein, Christian; 

Kundisch, Dennis 

Bessler, Wolfgang; 

Holler, Julian 

Index-Based Investment Vehicles 

- A Comparative Study for the 

German DAX 

Hedge Funds in a Bayesian Asset 

Allocation Framework: 

Incorporating Information on 

market states and manager's 

ability 

19:00 Reception (Building “M1”) 

− xvii − 

154 

77 

13

09:00- 

09:40 

09:45- 

17:00 

Thursday July 17, 2008 


Groenen, Patrick J.F. 

et al. 

Support Vector Machines in the 

Primal using Majorization and 

Kernels (Chair: Okada) 

Bisson, Gilles Clustering of molecules and 

structured data (Chair: Hennig) 

Workshop: Decimal 

Classification 

Session Clustering and 

Classification I 

9:45- 

10:10 

10:10- 

10:35 

10:35- 

11:00 

Godehardt, Erhard; 

Jaworski, Jerzy; 

Rybarczyk, Katarzyna 

54 Room 

5 

16 Room 

3 

(see separate schedule) Room 

403 

(Chair: Bock) Room 

3 

Isolated vertices in random 

intersection graphs 

52 

Rozmus, Dorota Cluster ensemble based on cooccurrence 

data 

Enyukov, Igor Regression-autoregression based 

clustering 

Session Bayesian, Neural, and 

Fuzzy Clustering I 

9:45- 

10:10 

10:10- 

10:35 

10:35- 

11:00 

Borgelt, Christian Weighting and Selecting Features 

in Fuzzy Clustering 

Neumann, Anneke; 

Ambrosi, Klaus; Hahne, 

Felix 

Winkler, Roland; Rehm, 

Frank; Kruse, Rudolf 

Session Invited Session: 

Microarrays in Clinical 

Research 

9:45- 

10:25 

10:25- 

11:00 

Approach for Dynamic Problems in 

Clustering 

Clustering with Repulsive 

Prototypes 

123 

37 

(Chair: Kruse) Room 

405/6 

18 

102 

163 

(Chairs: Lausen, Ultsch) Room 

2 

Ultsch, Alfred Comparison of Algorithms to find 

differentially expressed Genes in 

Microarray Data 

Hielscher, Thomas; 

Zucknick, Manuela; 

Werft, Wiebke; Benner, 

Axel 

On the prognostic value of gene 

expression signatures for 

censored data 

− xviii − 

153 

67

Session Statistical Musicology I (Chair: Weihs) Room 

6 

9:45- Eigenfeldt, Arne; Multimodal Performance Analysis 35 

10:10 Kapur, Ajay 

of Electronic Sitar 

10:10- Sommer, Katrin; Analysis of polyphonic musical 138 

10:35 Weihs, Claus 

time series 

10:35- 

11:00 

Desmet, Frank Michel; 

Leman, Marc; Lesaffre, 

Micheline 

Session Marketing and 

Management Science I 

9:45- 

10:10 

10:10- 

10:35 

10:35- 

11:00 

11:00- 

11:20 

Wagner, Ralf; 

Sauerwald, Erik 

Sagan, Adam; 

Kowalska-Musial, 

Magdalena 

Becker, Niels; Werners, 

Brigitte 

Coffee 


Classification II 

11:20- 

11:45 

11:45- 

12:10 

12:15- 

12:45 

Nugent, Rebecca; 

Stuetzle, Werner 

Herrmann, Lutz; 

Ultsch, Alfred 

Software Presentation 

Eichenberg, Thilo 

(StatSoft) 

Session Bayesian, Neural, and 

Fuzzy Clustering II 

11:20- 

11:45 

11:45- 

12:10 

12:10- 

12:35 

Gabriel, Thomas R.; 

Thiel, Kilian; Berthold, 

Michael R. 

Fritsch, Arno; Ickstadt, 

Katja 

Steinbrecher, Matthias; 

Kruse, Rudolf 

Statistical analysis of human body 

movement and group interactions 

in response to music 

32 

(Chair: van den Poel) Room 

4 

Clustering Consumers with 

Respect to Their Marketing 

Reactance Behavior 

159 

Dyadic Interactions in Service 

Encounter - Bayesian SEM 

Approach 

124 

Improving Product Line Design 

with Bundling 

10 

(Chair: Vichi) Room 

3 

Cluster Tree Estimation using a 

Generalized Single Linkage 

Method 

104 

Strengths and Weaknesses of Ant 

Colony Clustering 

65 

STATISTICA 

Multi-Dimensional Scaling applied 

to Hierarchical Fuzzy Rule 

Systems 

An Improved Criterion for 

Clustering Based on the Posterior 

Similarity Matrix 

Clustering Association Rules with 

Fuzzy Concepts 

− xix − 

Room 

3 

(Chair: Kruse) Room 

405/6 

45 

43 

141

Session Text Mining (Chair: Schmidt-Thieme) Room 

101/3 

11:20- 

11:45 

11:45- 

12:10 

12:10- 

12:35 

12:35- 

13:00 

Karatzoglou, 

Alexandros; Feinerer, 

Ingo; Hornik, Kurt 

Schierle, Martin; 

Trabold, Daniel 

Hermes, Jürgen; 

Schwiebert, Stephan 

Nonparametric distribution 

analysis for text mining 

Multilingual knowledge based 

concept recognition in textual data 

Classification of text processing 

components: The Tesla Role 

System 

Thorleuchter, Dirk Mining ideas from textual 

information 

Session Modelling Exchange 

from Archaeological 

Evidence 

11:20- 

11:45 

11:45- 

12:10 

12:10- 

12:35 

Schyle, Daniel The Late Neolithic flint axe 

production on the Lousberg 

(Aachen, Germany) – An 

extrapolation of supply and 

demand and population density 

Dolata, Jens; Mucha, 

Hans-Joachim; Bartel, 

Hans-Georg 

Mapping Findspots of Roman 

Military Brickstamps in 

Mogontiacum (Mainz) and 

Archaeometrical Analysis 

Herzog, Irmela Reconstructing Central Places and 

Settlements Groups 

76 

128 

64 

147 

(Chair: Kerig) Room 

2 

Session Statistical Musicology II (Chair: Weihs) Room 

6 

11:20- 

11:45 

11:45- 

12:10 

12:10- 

12:35 

Meyer, Florian; Ultsch, 

Alfred 

Lukashevich, Hanna; 

Dittmar, Christian; 

Bastuck, Christoph 

Finding Music Fads by clustering 

Online Radio Data with Emergent 

Self-Organizing Maps 

Applying Statistical Models and 

Parametric Distance Measures for 

Music Similarity Search 

Fricke, Jobst P. A statistical theory of musical 

consonance proved in praxis 

Session Marketing and 

Management Science II 

11:20- 

11:45 

11:45- 

12:10 

Gazda, Vladimir On a Location of the Retail Units 

and Equilibrium Price 

Determination 

Zeileis, Achim; Kleiber, 

Christian 

134 

34 

66 

96 

89 

42 

(Chair: Decker) Room 

4 

Recursive Partitioning of Economic 

Regressions: Trees of Costly 

Journals and Beautiful Professors 

− xx − 

50 

168

12:10- 

12:35 

13:00- 

14:00 

van de Velden, Michel; 

de Beuckelaer, Alain; 

Groenen, Patrick; 

Busing, Frank 

Lunch (and Meetings) 

Visualizing preferences using 

minimum variance nonmetric 

unfolding 

Session Linguistics (Chairs: Goebl, Grzybek) Room 

3 

14:00- 

14:25 

14:25- 

14:50 

14:50- 

15:15 

15:15- 

15:40 

Rapp, Reinhard; Zock, 

Michael 

Fenk-Oczlon, Gertraud; 

Fenk, August 

Automatic Dictionary Expansion 

Using Non-parallel Corpora 

Cross-linguistic regularities in the 

monosyllabic system 

Rolshoven, Jürgen Grundzüge einer generativen 

Korpuslinguistik 

Petersen, Wiebke Lineare Kodierung multipler 

Vererbungshierarchien: 

Wiederbelebung einer antiken 

Klassifikationsmethode 

Session Invited Session: BCS (Chairs: Hennig, Murtagh) Room 

2 

14:00- 

14:25 

14:25- 

14:50 

Dean, Nema; Nugent, 

Rebecca 

Mirkin, Boris 

Augmenting Model-Based 

Clustering with Generalized 

Linkage methods 

Deviant box and dual clusters for the 

analysis of conceptual contexts 

14:50- Critchley, Frank; Pires, Principal Axis Analysis – with 

15:15 Ana; Amado, Conceicao HDLSS bonuses! 

15:15- Hennig, Christian; Using cluster analysis for species 

15:40 Hausdorf, Bernhard delimitation 

Session Processes in Industry (Chair: Joos) Room 

6 

14:00- 

14:25 

14:25- 

14:50 

14:50- 

15:15 

15:15- 

15:40 

Hahlweg, Cornelius; 

Rothe, Hendrik 

Raabe, Nils; Enk, Dirk; 

Weihs, Claus; 

Biermann, Dirk 

Meier, René; Joos, 

Franz 

Große, Lars; Joos, 

Franz 

Auswertung hochaufgelöster 

Streulichtdaten mit Methoden der 

multivariaten Statistik 

Dynamic disturbances in BTA 

deephole drilling - Identification of 

spiralling as a regenerative effect 

Optimization Methods with 

Evolutionary Algorithms and 

Artificial Neural Networks 

Usage of Artifical Neural Networks 

for Data Handling 

− xxi − 

156 

119 

40 

122 

110 

31 

97 

30 

62 

59 

117 

95 

55

Session Marketing and Management 

Science III 

14:00- 

14:25 

14:25- 

14:50 

14:50- 

15:15 

15:15- 

15:40 

15:40- 

15:55 

15:55- 

16:35 

Lübke, Karsten; 

Papenhoff, Heike 

Wilczynski, Petra; 

Sarstedt, Marko 

(Chair: van den Poel) Room 

4 

Latent growth models for 

analyzing a multi partner reward 

program 

88 

Multi-Item Versus Single-Item 

Measures: A Review and Future 

Research Directions 

161 

Sommerfeld, Angela Trust as a Key Determinant of 

Loyalty and its Moderators 

Kneib, Thomas; 

Baumgartner, 

Bernhard; Steiner, 

Winfried J. 

Coffee 


Celeux, Gilles Paul 

et al. 

Krolak-Schwerdt, 

Sabine 


Classification III 

16:40- Müller-Funk, Ulrich; 

17:05 Dlugosz, Stephan 

17:05- 

17:30 

17:30- 

17:55 

Azam, Muhammad; 

Ostermann, Alexander; 

Pfeiffer, Karl-Peter 

Gantner, Zeno; 

Schmidt-Thieme, Lars 

Session Optimization in 

Statistics 

16:40- 

17:05 

17:05- 

17:30 

Time-Varying Parameters in Brand 

Choice Models 

Choosing the number of clusters 

in the latent class model (Chair: 

Bock) 

Strategies of model construction 

for the analysis of judgement data 

(Chair: Decker) 

139 

80 

15 Room 

5 

82 Room 

3 

(Chair: Geyer-Schulz) Room 

3 

Predictive classification trees 99 

Evaluation Criteria for the 

Construction of Binary 

Classification Trees with Two or 

More Classes 

Scalable and Incrementally 

Updated Hybrid Recommender 

Systems 

Hansohm, Jürgen Algorithms for Computing the 

Multivariate Isotonic Regression 

Schachtner, Reinhard; 

Pöppel, Gerhard; Lang, 

Elmar 

Nonnegative Matrix Factorization 

for Binary Data to Extract 

Elementary Failure Maps from 

Wafer Test Images 

− xxii − 

7 

47 

(Chair: Ritter) Room 

405/6 

60 

126

17:30- 

17:55 

Nalbantov, Georgi 

Ilkov; Groenen, Patrick 

J.F.; Bioch, Cor 

Session Computational 

Intelligence and 

Metaheuristics 

16:40- 

17:05 

17:05- 

17:30 

17:30- 

17:55 

Winkler, Stephan; 

Affenzeller, Michael; 

Wagner, Stefan; 

Kronberger, Gabriel 

Caserta, Marco; 

Lessmann, Stefan 

Support Vector Machines in the 

Dual using Majorization and 

Kernels 

On the Effects of Enhanced 

Selection Models on Quality and 

Comparability of Classifiers 

Produced by Genetic Programming 

A novel approach to construct 

discrete support vector machine 

classifiers 

Thorleuchter, Dirk Mining technologies in security 

and defense 

Session Miscellaneous Models 

(Archeology) 

16:40- 

17:05 

17:05- 

17:30 

Okada, Akinori; 

Sakaehara, Towao 

Gans, Ulrich-Walter; 

Lang, Matthias 

Session Education and 

Psychology 

16:40- 

17:05 

17:05- 

17:30 

17:30- 

17:55 

Fuchs, Sebastian; 

Sarstedt, Marko 

Strobl, Carolin; Leisch, 

Friedrich 

101 

(Chair: Fink) Room 

101/3 

164 

25 

148 

(Chair: Posluschny) Room 

2 

Analysis of Borrowing and 

Guaranteeing Relationhships 

among Government Officials at 

the Eighth Century in the Old 

Capital of Japan 

106 

ArcheoInf - Leistungszentrum für 

die digitale Unterstützung 

feldarchäologischer Projekte 

46 

(Chair: Krolak-Schwerdt) Room 

6 

On the Use of Student Samples in 

Major Marketing Research 

Journals. A Meta-Study 

44 

Who's Afraid of Statistics? - 

Measurement and Predictors of 

Statistics Anxiety in German 

University Students 

142 

Ünlü, Ali Mosaic Plots and Knowledge 

Structures 

Session Marketing and Management 

Science IV 

17:05- 

17:30 

17:30- 

17:55 

Lam, Kar Yin; Koning, 

Alex J.; Franses, Philip 

Hans 

Wagner, Ralf; Klaus, 

Martin 

155 

(Chair: Decker) Room 

4 

Testing preference rankings 84 

Exploring the Interaction 

Structure of Weblogs 

− xxiii − 

91

18:00- 

19:00 

General Assembly of 

GfKl 

20:00 Conference Dinner (Handwerkskammer, Holsten- 

wall 12, bus transfer 19:30) 

09:00- 

09:40 

12:00- 

12:20 

Friday July 18, 2008 


Palumbo, Francesco Clustering and Dimensionality 

Reduction to Discover Interesting 

Patterns in Binary Data (Chair: 

Ultsch) 

Wildner, Raimund Management and Methods: How 

to do Market Segmentation 

Projects (Chair: Fantapié 

Altobelli) 

Coffee 


Classification IV 

10:00- 

10:25 

10:25- 

10:50 

10:50- 

11:15 

Buza, Krisztian Antal; 

Schmidt-Thieme, Lars 

Room 

5 

109 Room 

5 

162 Room 

3 

(Chair: Groenen) Room 

3 

Motif-based Classification of Time 

Series with Bayesian Networks 

and SVMs 

23 

Tomas, Amber Issues related to the 

implementation of a dynamic 

logistic model for classifier 

combination 

Oosthuizen, Surette; 

Steel, Sarel J. 

Session Visualization and 

Scaling Methods I 

10:00- 

10:25 

10:25- 

10:50 

10:50- 

11:15 

Variable selection for kernel 

classifiers: a feature-to-input 

space approach 

Mucha, Hans-Joachim Clustering a Contingency Table 

Accompanied by Visualization 

Bocci, Laura; Vichi, 

Maurizio 

The K-INDSCAL Model for 

Heterogeneous Three-way 

Dissimilarity Data 

150 

107 

(Chair: Hennig) Room 

405/6 

Cortina-Borja, Mario Extending Multivariate Planing 29 

− xxiv − 

98 

17

Session Exploratory Data 

Analysis I 

10:00- 

10:25 

10:25- 

10:50 

10:50- 

11:15 

Chiou, Hua-Kai; Yuan, 

Benjamin J.C.; Wang, 

Yen-Wen 

Cernian, Alexandra; 

Carstoiu, Dorin; 

Ionescu, Tudor 

Einbeck, Jochen; 

Evers, Ludger 

(Chair: Wehrens) Room 

101/3 

Correspondence Analysis for 

Exploring the Implementation of 

One Village One Product Programs 

in Taiwan 

28 

Modeling the Classification of 

Heterogeneous Data 

26 

Data compression and regression 

based on local principal curves 

Session Spatial Planning I (Chair: Behnisch) Room 

2 

10:00- Behnisch, Martin; Estimating the number of 

11 

10:25 Ultsch, Alfred 

buildings in Germany 

10:25- 

10:50 

10:50- 

11:15 

Thiel, Klaus Optimal VDSL Expansion taking 

into Consideration of 

Infrastructure Restrictions and 

Marketing Requirements 

Aden, Christian; 

Mucha, Hans-Joachim; 

Schmidt, Gunther; 

Schröder, Winfried 

Session Medical and Health 

Sciences I 

10:00- 

10:25 

10:25- 

10:50 

10:50- 

11:15 

Augustin, Thomas; 

Wallner, Matthias 

WaldIS - a web based reference 

system for the forest monitoring 

in North Rhine-Westphalia 

36 

145 

(Chair: Lausen) Room 

6 

On the power of corrected score 

functions to adjust for 

measurement error 

6 

Sieben, Wiebke Time Related Features for Alarm 

Classification in Intensive Care 

Monitoring 

Ostermann, Thomas; 

Schuster, Reinhard; 

Erben, Christoph 

Session Market Research, 

Controlling, OR I 

10:00- 

10:25 

10:25- 

10:50 

Brusch, Michael; Baier, 

Daniel 

Classifying hospitals with respect 

to their diagnostic diversity using 

Shannon's entropy 

3 

135 

108 

(Chair: Baier) Room 

4 

Analyzing the Stability of Price 

Response Functions - Measuring 

the Influence of Different 

Parameters in a Monte Carlo 

Comparison 

22 

Tarka, Piotr Conjoint Analysis within the field 

of customer satisfaction problems 

– a model of composite 

product/service 

− xxv − 

144

10:50- 

11:15 

11:20- 

12:00 

12:00- 

12:20 

Punzo, Antonio Considerations on the impact of 

JML-ill-conditioned configurations 

in the CML approach 


Ben-Israel, Adi Probabilistic Distance Clustering 

(Chair: Vichi) 

Imaizumi, Tadashi Dimensionality Reduction of 

Similarity Matrix (Chair: Gaul) 

Coffee 


Classification V 

12:20- 

12:45 

12:45- 

13:30 

Kludas, Jana; Bruno, 

Eric; Marchand-Maillet, 

Stepahne 

Schiffner, Julia; 

Szepannek, Gero; 

Monthé, Thierry; 

Weihs, Claus 

Session Visualization and 

Scaling Methods II 

12:20- 

12:45 

12:45- 

13:30 

116 

12 Room 

5 

73 Room 

3 

(Chair: Godehart) Room 

3 

Exploiting synergetic and 

redundant features for multimedia 

document classification 

79 

Localized Logistic Regression for 

Discrete Influential Factors 

129 

(Chair: van de Felden) Room 

405/6 

Adachi, Kohei Joint Procrustes Analysis with 

Constrained Simplimax Rotation: 

Nonsingular Transformation of 

Component Score and Loading 

Matrices Toward Simple Structure 

Fernández-Aguirre, 

Karmele; Garín-Martín, 

María Araceli 

Session Exploratory Data 

Analysis II 

12:20- Zarraga, Amaya; 

12:45 Goitisolo, Beatriz 

12:45- 

13:30 

Nusser, Sebastian; 

Otte, Clemens; 

Hauptmann, Werner 

Validity of images from binary 

coding tables. Student motivation 

surveys: some evidence 

(Chair: Wehrens) Room 

101/3 

Factor Analysis of Incomplete 

Disjunctive Tables 

167 

Multi-Class Extension of Verifiable 

Ensemble Models for Safety- 

Related Applications 

105 

− xxvi − 

2 

41

Session Spatial Planning II (Chair: Behnisch) Room 

2 

12:20- 

12:45 

12:45- 

13:30 

Witek, Ewa Analysis of massive emigration 

from Poland - the model-based 

clustering approach 

Thinh, Nguyen Xuan; 

Küttner, Leander; 

Meinel, Gotthard 

Session Medical and Health 

Sciences II 

12:20- 

12:45 

12:45- 

13:30 

Henker, Uwe; Ultsch, 

Alfred; Petersohn, Uwe 

Schuster, Reinhard; 

von Arnstedt, Eva 

Session Market Research, 

Controlling, OR II 

12:20- 

12:45 

12:45- 

13:30 

13:15- 

14:00 

14:00- 

15:00 

Esber, Said; Baier, 

Daniel 

Abu Assab, Samah; 

Baier, Daniel 

Evaluate the data structure and 

identify homogenous spatial units 

in the data base "Sustainability 

issues in sensitive areas" of the 

EU-FP6 Integrated Project 

SENSOR 

165 

146 

(Chair: Lausen) Room 

6 

Die präzise und effizienzte 

Erkennung von medizinischen 

Anforderungsformularen 

61 

Age Distributions for costs in drug 

prescription by practitioners and 

for DRG-based hospital treatment 

133 

(Chair: Baier) Room 

4 

Realoptionen bei der Bewertung 

von neuen Produkten 

38 

Designing Products Using Quality 

Function Deployment and Conjoint 

Analysis: A Comparison in a 

Market for Elderly People 

Plenary Lecture (Chair: Weihs) Room 

5 

McMorris, Fred R. Majority-rule consensus: from 

preferences (social choice) to 

trees (biology and classification 

theory) 

94 

Informal Farewell (Conference site) 

− xxvii − 

1

List of Contributions 

Authors Title Page 

Abu Assab, Samah; Baier, Designing Products Using Quality Function 1 

Daniel 

Deployment and Conjoint Analysis: A 

Comparison in a Market for Elderly People 

Adachi, Kohei Joint Procrustes Analysis with Constrained 

Simplimax Rotation: Nonsingular 

Transformation of Component Score and 

Loading Matrices Toward Simple Structure 

2 

Aden, Christian; Mucha, Hans- WaldIS - a web based reference system 3 

Joachim; Schmidt, Gunther; for the forest monitoring in North Rhine- 

Schröder, Winfried 

Westphalia 

Adler, Werner; Brenning, Classification of Paired Data Using 

4 

Alexander; Lausen, Berthold Ensemble Methods 

Andres, Bjoern; Koethe, Ullrich; 

Helmstaedter, Moritz; Denk, 

Winfried; Hamprecht, Fred 

Segmentation of Neural Tissue 5 

Augustin, Thomas; Wallner, On the power of corrected score functions 6 

Matthias 

to adjust for measurement error 

Azam, Muhammad; Ostermann, Evaluation Criteria for the Construction of 7 

Alexander; Pfeiffer, Karl-Peter Binary Classification Trees with Two or 

More Classes 

Bade, Korinna; Benz, Dominik Evaluation Strategies for Learning 

Algorithms of Hierarchical Structures 

8 

Barbosa, Rui Pedro; Belo, 

Orlando 

Autonomous Forex Trading Agents 9 

Becker, Niels; Werners, Brigitte Improving Product Line Design with 

Bundling 

10 

Behnisch, Martin; Ultsch, Alfred Estimating the number of buildings in 

Germany 

11 

Ben-Israel, Adi Probabilistic Distance Clustering 12 

Bessler, Wolfgang; Holler, Hedge Funds in a Bayesian Asset 

13 

Julian 

Allocation Framework: Incorporating 

Information on market states and 

manager's ability 

Betzin, Jörg Categorical Data in PLS Path modeling 14 

Biernacki, Christophe; Celeux, Choosing the number of clusters in the 15 

Gilles Paul; Govaert, Gérard latent class model 

Bisson, Gilles Clustering of molecules and structured 

data 

16 

Bocci, Laura; Vichi, Maurizio The K-INDSCAL Model for Heterogeneous 

Three-way Dissimilarity Data 

17 

Borgelt, Christian Weighting and Selecting Features in Fuzzy 

Clustering 

18 

Boulesteix, Anne-Laure; On optimistic bias in reporting microarray- 20 

Slawski, Martin 

based classification accuracy 

− xxviii −

Bravo, Cristian; Maldonado, Practical experiences from Credit Scoring 21 

Sebastian; Weber, Richard projects for Chilean financial organizations 

Brusch, Michael; Baier, Daniel Analyzing the Stability of Price Response 

Functions - Measuring the Influence of 

Different Parameters in a Monte Carlo 

Comparison 

22 

Buza, Krisztian Antal; Schmidt- Motif-based Classification of Time Series 23 

Thieme, Lars 

with Bayesian Networks and SVMs 

Calò, Daniela G.; Viroli, Cinzia Visualizing data in Gaussian mixture 

model classification 

24 

Caserta, Marco; Lessmann, A novel approach to construct discrete 25 

Stefan 

support vector machine classifiers 

Cernian, Alexandra; Carstoiu, Modeling the Classification of 

26 

Dorin; Ionescu, Tudor 

Heterogeneous Data 

Chiou, Hua-Kai; Huang, Yong- Applying Rough Set Theory to 

27 

Ting; Liu, Gia-Shie 

Constructing Knowledge Base for Critical 

Military Commodity Management 

Chiou, Hua-Kai; Yuan, 

Correspondence Analysis for Exploring the 28 

Benjamin J.C.; Wang, Yen-Wen Implementation of One Village One 

Product Programs in Taiwan 

Cortina-Borja, Mario Extending Multivariate Planing 29 

Critchley, Frank; Pires, Ana; Principal Axis Analysis – with HDLSS 30 

Amado, Conceicao 

bonuses! 

Dean, Nema; Nugent, Rebecca Augmenting Model-Based Clustering with 

Generalized Linkage methods 

31 

Desmet, Frank Michel; Leman, Statistical analysis of human body 

32 

Marc; Lesaffre, Micheline movement and group interactions in 

response to music 

Dias, José G.; Vermunt, Jeroen Mixture Hidden Markov Models in Finance 33 

K.; Ramos, Sofia 

Research 

Dolata, Jens; Mucha, Hans- Mapping Findspots of Roman Military 34 

Joachim; Bartel, Hans-Georg Brickstamps in Mogontiacum (Mainz) and 

Archaeometrical Analysis 

Eigenfeldt, Arne; Kapur, Ajay Multimodal Performance Analysis of 

Electronic Sitar 

35 

Einbeck, Jochen; Evers, Ludger Data compression and regression based 

on local principal curves 

36 

Enyukov, Igor Regression-autoregression based 

clustering 

37 

Esber, Said; Baier, Daniel Realoptionen bei der Bewertung von 

neuen Produkten 

38 

Fenk-Oczlon, Gertraud; Fenk, Cross-linguistic regularities in the 

40 

August 

monosyllabic system 

Fernández-Aguirre, Karmele; Validity of images from binary coding 41 

Garín-Martín, María Araceli tables. Student motivation surveys: some 

evidence 

Fricke, Jobst P. A statistical theory of musical consonance 

proved in praxis 

42 

− xxix −

Fritsch, Arno; Ickstadt, Katja An Improved Criterion for Clustering 

Based on the Posterior Similarity Matrix 

Fuchs, Sebastian; Sarstedt, On the Use of Student Samples in Major 

Marko 

Marketing Research Journals. A Meta- 

Study 

Gabriel, Thomas R.; Thiel, Multi-Dimensional Scaling applied to 

Kilian; Berthold, Michael R. Hierarchical Fuzzy Rule Systems 

Gans, Ulrich-Walter; Lang, ArcheoInf - Leistungszentrum für die 

Matthias 

digitale Unterstützung feldarchäologischer 

Projekte 

Gantner, Zeno; Schmidt- Scalable and Incrementally Updated 

Thieme, Lars 

Hybrid Recommender Systems 

Garel, Bernard; Boucharel, Non-Gaussian nature of ENSO signals and 

Julien; Dewitte, Boris; du climate shifts: implications for regional 

Penhoat, Yves 

studies off the western coast of South 

America 

Gassiat, Elisabeth Likelihood ratio test for general mixture 

models 

Gazda, Vladimir On a Location of the Retail Units and 

Equilibrium Price Determination 

Geyer-Schulz, Andreas; Hoser, The Potential of Social Intelligence for 

Bettina 

Collective Intelligence 

Godehardt, Erhard; Jaworski, Isolated vertices in random intersection 

Jerzy; Rybarczyk, Katarzyna graphs 

Greselin, Francesca; Ingrassia, A note on constrained EM algorithms for 

Salvatore 

mixtures of elliptical distributions 

Groenen, Patrick J.F.; 

Support Vector Machines in the Primal 

Nalbantov, Georgi; Bioch, Cor using Majorization and Kernels 

Große, Lars; Joos, Franz Usage of Artifical Neural Networks for 

Data Handling 

Grün, Bettina; Leisch, Friedrich Model diagnostics of finite mixtures using 

bootstrapping 

Haasdonk, Bernard; Pekalska, Classification with Regularized Kernel 

Elzbieta 

Mahalanobis-Distances 

Häberle, Lothar On classification of species of 

representation rings 

Hahlweg, Cornelius; Rothe, Auswertung hochaufgelöster 

Hendrik 

Streulichtdaten mit Methoden der 

multivariaten Statistik 

Hansohm, Jürgen Algorithms for Computing the Multivariate 

Isotonic Regression 

Henker, Uwe; Ultsch, Alfred; Die präzise und effizienzte Erkennung von 

Petersohn, Uwe 

medizinischen Anforderungsformularen 

Hennig, Christian; Hausdorf, Using cluster analysis for species 

Bernhard 

delimitation 

Henseler, Jörg Nonlinear Effects in PLS Path Models: A 

Comparison of Available Approaches 

Hermes, Jürgen; Schwiebert, Classification of text processing 

Stephan 

components: The Tesla Role System 

− xxx − 

43 

44 

45 

46 

47 

48 

49 

50 

51 

52 

53 

54 

55 

56 

57 

58 

59 

60 

61 

62 

63 

64

Herrmann, Lutz; Ultsch, Alfred Strengths and Weaknesses of Ant Colony 

Clustering 

65 

Herzog, Irmela Reconstructing Central Places and 

Settlements Groups 

66 

Hielscher, Thomas; Zucknick, On the prognostic value of gene 

67 

Manuela; Werft, Wiebke; 

Benner, Axel 

expression signatures for censored data 

Holzmann, Hajo; Dannemann, Likelihood ratio testing for hidden Markov 68 

Jörn 

models 

Hühn, Jens; Hüllermeier, Eyke Rule-Based Learning of Reliable Classifiers 69 

Huellermeier, Eyke; 


70 

Vanderlooy, Stijn 

Classification: An Adaptive Voting 

Strategy and Its Relation to Weighted 

Voting 

Huson, Daniel H.; Rupp, Regula Using Cluster Networks to Represent Non- 

Compatible Sets of Clusters 

71 

Hütt, Marc-Thorsten Genome phylogeny based on short-range 

correlations in DNA sequences 

72 

Imaizumi, Tadashi Dimensionality Reduction of Similarity 

Matrix 

73 

Kaiser, Sebastian; Leisch, 

Friedrich 

Benchmarking Bicluster Algorithms 75 

Karatzoglou, Alexandros; Nonparametric distribution analysis for 76 

Feinerer, Ingo; Hornik, Kurt text mining 

Klein, Christian; Kundisch, Index-Based Investment Vehicles - A 77 

Dennis 

Comparative Study for the German DAX 

Klenk, Hans-Peter Polyphasic genomic approach for the 

taxonomy of archaea and bacteria 

78 

Kludas, Jana; Bruno, Eric; Exploiting synergetic and redundant 79 

Marchand-Maillet, Stepahne features for multimedia document 

classification 

Kneib, Thomas; Baumgartner, Time-Varying Parameters in Brand Choice 80 

Bernhard; Steiner, Winfried J. Models 

Koralun-Bereznicka, Julia Multivariate comparative analysis of stock 

exchanges - the European perspective 

81 

Krolak-Schwerdt, Sabine Strategies of model construction for the 

analysis of judgement data 

82 

Kuziak, Katarzyna An application of copula functions to 

market risk management 

83 

Lam, Kar Yin; Koning, Alex J.; 

Franses, Philip Hans 

Testing preference rankings 84 

Latouche, Pierre J.; Ambroise, 

Christophe; Birmelé, Etienne 

Bayesian Methods for Graph Clustering 85 

Locarek-Junge, Hermann; Fundamental Indexation - testing the 86 

Mihm, Max 

concept in the German stock market 

Louw, Nelmarie; Lamont, Identifying Atypical Cases in Kernel Fisher 87 

Morne; Steel, Sarel 

Discriminant Analysis by using the 

Smallest Enclosing Hypersphere 

− xxxi −

Lübke, Karsten; Papenhoff, Latent growth models for analyzing a 88 

Heike 

multi partner reward program 

Lukashevich, Hanna; Dittmar, Applying Statistical Models and Parametric 89 

Christian; Bastuck, Christoph Distance Measures for Music Similarity 

Search 

Lukociene, Olga; Vermunt, Determining the number of components in 90 

Jeroen K. 

mixture models for hierarchical data 

Klaus, Martin; Wagner, Ralf Exploring the Interaction Structure of 

Weblogs 

91 

Martin-Magniette, Marie-Laure; ChIPmix : Mixture model of regressions 92 

Mary-Huard, Tristan; Bérard, 

Caroline; Robin, Stéphane 

for ChIP-chip experiment analysis 

McLachlan, Geoffrey John Clustering of High-Dimensional Data Via 

Finite Mixture Models 

93 

McMorris, F. R. Majority-rule consensus: from preferences 

(social choice) to trees (biology and 

classification theory) 

94 

Meier, René; Joos, Franz Optimization Methods with Evolutionary 

Algorithms and Artificial Neural Networks 

95 

Meyer, Florian; Ultsch, Alfred Finding Music Fads by clustering Online 

Radio Data with Emergent Self-Organizing 

Maps 

96 

Mirkin, Boris Deviant box and dual clusters for the 

analysis of conceptual contexts 

97 

Mucha, Hans-Joachim Clustering a Contingency Table 

Accompanied by Visualization 

98 

Müller-Funk, Ulrich; Dlugosz, 

Stephan 

Predictive classification trees 99 

Mylonas, Phivos; Solachidis, Efficient Media Exploitation towards 100 

Vassilios; Geyer-Schulz, 

Andreas; Hoser, Bettina; 

Chapman, Sam; Ciravegna, 

Fabio; Staab, Stefen; Smrz, 

Pavel; Kompatsiaris, Yiannis; 

Avrithis, Yannis 


Nalbantov, Georgi Ilkov; Support Vector Machines in the Dual using 101 

Groenen, Patrick J.F.; Bioch, 

Cor 

Majorization and Kernels 

Neumann, Anneke; Ambrosi, Approach for Dynamic Problems in 

102 

Klaus; Hahne, Felix 

Clustering 

Neykov, Neyko; Filzmoser, Robust fitting of mixtures: The approach 103 

Peter; Neytchev, Plamen based on the Trimmed Likelihood 

Estimator 

Nugent, Rebecca; Stuetzle, Cluster Tree Estimation using a 

104 

Werner 

Generalized Single Linkage Method 

Nusser, Sebastian; Otte, Multi-Class Extension of Verifiable Ensem- 105 

Clemens; Hauptmann, Werner ble Models for Safety-Related Applications 

Okada, Akinori; Sakaehara, Analysis of Borrowing and Guaranteeing 106 

Towao 

Relationhships among Government 

Officials at the Eighth Century in the Old 

− xxxii −

Capital of Japan 

Oosthuizen, Surette; Steel, Variable selection for kernel classifiers: a 107 

Sarel J. 

feature-to-input space approach 

Ostermann, Thomas; Schuster, Classifying hospitals with respect to their 108 

Reinhard; Erben, Christoph diagnostic diversity using Shannon's 

entropy 

Palumbo, Francesco Clustering and Dimensionality Reduction 

to Discover Interesting Patterns in Binary 

Data 

109 

Petersen, Wiebke Lineare Kodierung multipler 

Vererbungshierarchien: Wiederbelebung 

einer antiken Klassifikationsmethode 

110 

Petersen, Wiebke; Heinrich, Begriffsanalytischer Ansatz zur 

111 

Petja 

qualitativen Zitationsanalyse 

Piontek, Krzysztof The Analysis of the power for some 

chosen VaR backtesting procedures - 

simulation approach 

112 

Pommeret, Denys Testing distribution in errors in variables 

models 

113 

Pons, Odile Classification with an increasing number of 

components 

114 

Potapov, Sergej; Lausen, 

Berthold 

Bagging with different split criteria 115 

Punzo, Antonio Considerations on the impact of JML-illconditioned 

configurations in the CML 

approach 

116 

Raabe, Nils; Enk, Dirk; Weihs, Dynamic disturbances in BTA deephole 117 

Claus; Biermann, Dirk 

drilling - Identification of spiralling as a 

regenerative effect 

Radermacher, Walter Statistical processes under change - 

Enhancing data quality with pretests 

118 

Rapp, Reinhard; Zock, Michael Automatic Dictionary Expansion Using 

Non-parallel Corpora 

119 

Ringle, Christian M. FIMIX-PLS Segmentation of Data for Path 

Models with Multiple Endogenous LVs 

120 

Rokita, Pawel; Piontek, 

Extreme unconditional dependence vs. 121 

Krzysztof 

multivariate GARCH effect in the analysis 

of dependence between high losses on 

Polish and German stock indexes 

Rolshoven, Jürgen Grundzüge einer generativen 

Korpuslinguistik 

122 

Rozmus, Dorota Cluster ensemble based on co-occurrence 

data 

123 

Sagan, Adam; Kowalska- Dyadic Interactions in Service Encounter - 124 

Musial, Magdalena 

Bayesian SEM Approach 

− xxxiii −

Sardet, Laure; Patilea, Valentin Beta-kernel density estimation using 

mixture-based transformations: an 

application to claims distribution 

Schachtner, Reinhard; Pöppel, 

Gerhard; Lang, Elmar 

Scharl, Theresa; Leisch, 

Friedrich 

Nonnegative Matrix Factorization for 

Binary Data to Extract Elementary Failure 

Maps from Wafer Test Images 

Quality-Based Clustering of Functional 

Data: Applications to Time Course 


Multilingual knowledge based concept 

recognition in textual data 

Localized Logistic Regression for Discrete 

Influential Factors 

Schierle, Martin; Trabold, 

Daniel 

128 

Schiffner, Julia; Szepannek, 

Gero; Monthé, Thierry; Weihs, 

Claus 

129 

Schiffner, Julia; Weihs, Claus Localized Classification Using Mixture 

Models 

130 

Schlattmann, Peter Comparison of four estimators of the 

heterogeneity variance for meta-analysis 

131 

Schölkopf, Bernhard Machine Learning applications of positive 

definite kernels 

132 

Schuster, Reinhard; von Age Distributions for costs in drug 

133 

Arnstedt, Eva 

prescription by practitioners and for DRGbased 

hospital treatment 

Schyle, Daniel The Late Neolithic flint axe production on 

the Lousberg (Aachen, Germany) – An 

extrapolation of supply and demand and 

population density 

134 

Sieben, Wiebke Time Related Features for Alarm 

Classification in Intensive Care Monitoring 

135 

Slawski, Martin; Boulesteix, 'CMA' - Steps in developing a 

136 

Anne-Laure; Daumer, Martin comprehensive R-toolbox for classification 

with microarray data and other highdimensional 

problems 

Solachidis, Vassilios; Mylonas, 

Phivos; Geyer-Schulz, Andreas; 

Hoser, Bettina; Chapman, Sam; 

Ciravegna, Fabio; Staab, 

Stefen; Contopoulos, Costis; 

Gkika, Ioanna; Smrz, Pavel; 

Kompatsiaris, Yiannis; Avrithis, 

Yannis 

Generating Collective Intelligence 137 

Sommer, Katrin; Weihs, Claus Analysis of polyphonic musical time series 138 

Sommerfeld, Angela Trust as a Key Determinant of Loyalty and 

its Moderators 

139 

Stecking, Ralf; Schebesch, Generating Fictitious Training Data for 140 

Klaus B. 

Credit Client Classification 

Steinbrecher, Matthias; Kruse, Clustering Association Rules with Fuzzy 141 

Rudolf 

Concepts 

− xxxiv − 

125 

126 

127

Strobl, Carolin; Leisch, 

Friedrich 

Who's Afraid of Statistics? - Measurement 

and Predictors of Statistics Anxiety in 

German University Students 

Strobl, Carolin; Zeileis, Achim A New, Conditional Variable Importance 

Measure for Random Forests 

143 

Tarka, Piotr Conjoint Analysis within the field of 

customer satisfaction problems – a model 

of composite product/service 

144 

Thiel, Klaus Optimal VDSL Expansion taking into 

Consideration of Infrastructure 

Restrictions and Marketing Requirements 

145 

Thinh, Nguyen Xuan; Küttner, Evaluate the data structure and identify 146 

Leander; Meinel, Gotthard homogenous spatial units in the data base 

"Sustainability issues in sensitive areas" of 

the EU-FP6 Integrated Project SENSOR 

Thorleuchter, Dirk Mining ideas from textual information 147 

Thorleuchter, Dirk Mining technologies in security and 

defense 

148 

Timmerman, Marieke E.; Multilevel Simultaneous Component 149 

Lichtwarck-Aschoff, Anna; Analysis for Studying Inter-individual and 

Ceulemans, Eva 

Intra-individual Variabilities 

Tomas, Amber Issues related to the implementation of a 

dynamic logistic model for classifier 

combination 

150 

Trinchera, Laura; Esposito A Comprehensive Partial Least Squares 151 

Vinzi, Vincenzo 

Approach to Component-Based Structural 

Equation Modeling 

Trzesiok, Michal Relevant Importance of Predictor Variables 

in Support Vector Machines Models 

152 

Ultsch, Alfred Comparison of Algorithms to find 

differentially expressed Genes in 


153 

Ultsch, Alfred Is log ratio a good value for measuring 

return in stock investments? 

154 

Ünlü, Ali Mosaic Plots and Knowledge Structures 155 

van de Velden, Michel; de Visualizing preferences using minimum 156 

Beuckelaer, Alain; Groenen, 

Patrick; Busing, Frank 

variance nonmetric unfolding 

van der Ark, Andries L.; Straat, Selection of items for tests and 

157 

J. Hendrik 

questionnaires using Mokken scale 

analysis 

van der Heijden, Peter G.M. Estimating the prevalence of rule 

transgression 

158 

Wagner, Ralf; Sauerwald, Erik Clustering Consumers with Respect to 

Their Marketing Reactance Behavior 

159 

Wehrens, Ron Supervised Self-Organising Maps and 

More 

160 

Wilczynski, Petra; Sarstedt, Multi-Item Versus Single-Item Measures: 161 

Marko 

A Review and Future Research Directions 

− xxxv − 

142

Wildner, Raimund Management and methods: How to do 

market segmentation projects 

162 

Winkler, Roland; Rehm, Frank; 

Kruse, Rudolf 

Clustering with Repulsive Prototypes 163 

Winkler, Stephan; Affenzeller, On the Effects of Enhanced Selection 164 

Michael; Wagner, Stefan; Models on Quality and Comparability of 

Kronberger, Gabriel 

Classifiers Produced by Genetic 

Programming 

Witek, Ewa Analysis of massive emigration from 

Poland - the model-based clustering 

approach 

165 

Worm, Katja; Meffert, Beate Image Based Mail Piece Identification 

using Unsupervised Learning 

166 

Zarraga, Amaya; Goitisolo, Factor Analysis of Incomplete Disjunctive 167 

Beatriz 

Tables 

Zeileis, Achim; Kleiber, 


168 

Christian 

Regressions: Trees of Costly Journals and 

Beautiful Professors 

− xxxvi −

Author Index 

Abu Assab, Samah 1 

Adachi, Kohei 2 

Aden, Christian 3 

Adler, Werner 4 

Affenzeller, Michael 164 

Amado, Conceicao 30 

Ambroise, Christophe 85 

Ambrosi, Klaus 102 

Andres, Bjoern 5 

Augustin, Thomas 6 

Avrithis, Yannis 100, 137 

Azam, Muhammad 7 

Bade, Korinna 8 

Baier, Daniel 1, 22, 38 

Barbosa, Rui Pedro 9 

Bartel, Hans-Georg 34 

Bastuck, Christoph 89 

Baumgartner, Bernhard 80 

Becker, Niels 10 

Behnisch, Martin 11 

Belo, Orlando 9 

Ben-Israel, Adi 12 

Benner, Axel 67 

Benz, Dominik 8 

Bérard, Caroline 92 

Berthold, Michael R. 45 

Bessler, Wolfgang 13 

Betzin, Jörg 14 

Biermann, Dirk 117 

Biernacki, Christophe 15 

Bioch, Cor 54, 101 

Birmelé, Etienne 85 

Bisson, Gilles 16 

Bocci, Laura 17 

Borgelt, Christian 18 

Boucharel, Julien 48 

Boulesteix, Anne-Laure 20, 136 

− xxxvii − 

Bravo, Cristian 21 

Brenning, Alexander 4 

Bruno, Eric 79 

Brusch, Michael 22 

Busing, Frank 156 

Buza, Krisztian Antal 23 

Calò, Daniela G. 24 

Carstoiu, Dorin 26 

Caserta, Marco 25 

Celeux, Gilles Paul 15 

Cernian, Alexandra 26 

Ceulemans, Eva 149 

Chiou, Hua-Kai 27, 28 

Ciravegna, Fabio 100, 137 

Contopoulos, Costis 137 

Cortina-Borja, Mario 29 

Critchley, Frank 30 

Dannemann, Jörn 68 

Daumer, Martin 136 

de Beuckelaer, Alain 156 

Dean, Nema 31 

Denk, Winfried 5 

Desmet, Frank Michel 32 

Dewitte, Boris 48 

Dias, José G. 33 

Dittmar, Christian 89 

Dlugosz, Stephan 99 

Dolata, Jens 34 

du Penhoat, Yves 48 

Eigenfeldt, Arne 35 

Einbeck, Jochen 36 

Enk, Dirk 117 

Enyukov, Igor 37 

Erben, Christoph 108 

Esber, Said 38 

Esposito Vinzi, Vincenzo 151 

Evers, Ludger 36

Feinerer, Ingo 76 

Fenk, August 40 

Fenk-Oczlon, Gertraud 40 

Fernández-Aguirre, 

Karmele 

41 

Filzmoser, Peter 103 

Franses, Philip Hans 84 

Fricke, Jobst P. 42 

Fritsch, Arno 43 

Fuchs, Sebastian 44 

Gabriel, Thomas R. 45 

Gans, Ulrich-Walter 46 

Gantner, Zeno 47 

Garel, Bernard 48 

Garín-Martín, María Araceli 41 

Gassiat, Elisabeth 49 

Gazda, Vladimir 50 

Geyer-Schulz, Andreas 51, 100, 

137 

Gkika, Ioanna 137 

Godehardt, Erhard 52 

Goitisolo, Beatriz 167 

Govaert, Gérard 15 

Greselin, Francesca 53 

Groenen, Patrick J.F. 54, 101, 

156 

Große, Lars 55 

Grün, Bettina 56 

Haasdonk, Bernard 57 

Häberle, Lothar 58 

Hahlweg, Cornelius 59 

Hahne, Felix 102 

Hamprecht, Fred A. 5 

Hansohm, Jürgen 60 

Hauptmann, Werner 105 

Hausdorf, Bernhard 62 

Heinrich, Petja 111 

Helmstaedter, Moritz 5 

Henker, Uwe 61 

Hennig, Christian 62 

Henseler, Jörg 63 

− xxxviii − 

Hermes, Jürgen 64 

Herrmann, Lutz 65 

Herzog, Irmela 66 

Hielscher, Thomas 67 

Holler, Julian 13 

Holzmann, Hajo 68 

Hornik, Kurt 76 

Hoser, Bettina 51, 100, 

137 

Huang, Yong-Ting 27 

Hühn, Jens 69 

Hüllermeier, Eyke 69, 70 

Hütt, Marc-Thorsten 72 

Huson, Daniel H. 71 

Ickstadt, Katja 43 

Imaizumi, Tadashi 73 

Ingrassia, Salvatore 53 

Ionescu, Tudor 26 

Jaworski, Jerzy 52 

Joos, Franz 55, 95 

Kapur, Ajay 35 

Karatzoglou, Alexandros 76 

Klaus, Martin 91 

Kleiber, Christian 168 

Klein, Christian 77 

Klenk, Hans-Peter 78 

Kludas, Jana 79 

Kneib, Thomas 80 

Koethe, Ullrich 5 

Kompatsiaris, Yiannis 100, 137 

Koning, Alex J. 84 

Koralun-Bereznicka, Julia 81 

Kowalska-Musial, Magdal. 124 

Krolak-Schwerdt, Sabine 82 

Kronberger, Gabriel 164 

Kruse, Rudolf 141, 163 

Küttner, Leander 146 

Kundisch, Dennis 77 

Kuziak, Katarzyna 83 

Lam, Kar Yin 84

Lamont, Morne 87 

Lang, Elmar 126 

Lang, Matthias 46 

Latouche, Pierre J. 85 

Lausen, Berthold 4, 115 

Leisch, Friedrich 56, 75, 

127, 142 

Leman, Marc 32 

Lesaffre, Micheline 32 

Lessmann, Stefan 25 

Lichtwarck-Aschoff, Anna 149 

Locarek-Junge, Hermann 86 

Louw, Nelmarie 87 

Lübke, Karsten 88 

Lukashevich, Hanna 89 

Lukociene, Olga 90 

Maldonado, Sebastian 21 

Marchand-Maillet, Stepahne 79 

Martin-Magniette, Marie-L. 92 

Mary-Huard, Tristan 92 

McLachlan, Geoffrey John 93 

McMorris, Fred R. 94 

Meffert, Beate 166 

Meier, René 95 

Meinel, Gotthard 146 

Mihm, Max 86 

Mirkin, Boris 97 

Monthé, Thierry 129 

Mucha, Hans-Joachim 3, 34, 98 

Müller-Funk, Ulrich 99 

Mylonas, Phivos 100, 137 

Nalbantov, Georgi Ilkov 54, 101 

Neumann, Anneke 102 

Neykov, Neyko 103 

Neytchev, Plamen 103 

Nugent, Rebecca 31, 104 

Nusser, Sebastian 105 

Okada, Akinori 106 

Oosthuizen, Surette 107 

Ostermann, Alexander 7 

− xxxix − 

Ostermann, Thomas 108 

Otte, Clemens 105 

Palumbo, Francesco 109 

Papenhoff, Heike 88 

Patilea, Valentin 125 

Pekalska, Elzbieta 57 

Petersen, Wiebke 110, 111 

Petersohn, Uwe 61 

Pfeiffer, Karl-Peter 7 

Piontek, Krzysztof 112, 121 

Pires, Ana 30 

Pöppel, Gerhard 126 

Pommeret, Denys 113 

Pons, Odile 114 

Potapov, Sergej 115 

Punzo, Antonio 116 

Raabe, Nils 117 

Radermacher, Walter 118 

Ramos, Sofia 33 

Rapp, Reinhard 119 

Rehm, Frank 163 

Ringle, Christian M. 120 

Robin, Stéphane 92 

Rokita, Pawel 121 

Rolshoven, Jürgen 122 

Rothe, Hendrik 59 

Rozmus, Dorota 123 

Rupp, Regula 71 

Rybarczyk, Katarzyna 52 

Sagan, Adam 124 

Sakaehara, Towao 106 

Sardet, Laure 125 

Sarstedt, Marko 44, 161 

Sauerwald, Erik 159 

Schachtner, Reinhard 126 

Scharl, Theresa 127 

Schebesch, Klaus B. 140 

Schierle, Martin 128 

Schiffner, Julia 129, 130 

Schlattmann, Peter 131

Schmidt, Gunther 3 

Schmidt-Thieme, Lars 23, 47 

Schölkopf, Bernhard 132 

Schröder, Winfried 3 

Schuster, Reinhard 108, 133 

Schwiebert, Stephan 64 

Schyle, Daniel 134 

Sieben, Wiebke 135 

Slawski, Martin 20, 136 

Smrz, Pavel 100, 137 

Solachidis, Vassilios 100, 137 

Sommer, Katrin 138 

Sommerfeld, Angela 139 

Staab, Stefen 100, 137 

Stecking, Ralf 140 

Steel, Sarel J. 87, 107 

Steinbrecher, Matthias 141 

Steiner, Winfried J. 80 

Straat, J. Hendrik 157 

Stuetzle, Werner 104 

Szepannek, Gero 129 

Tarka, Piotr 144 

Thiel, Kilian 45 

Thiel, Klaus 145 

Thinh, Nguyen Xuan 146 

Thorleuchter, Dirk 147, 148 

Timmerman, Marieke E. 149 

Tomas, Amber 150 

Trabold, Daniel 128 

Trinchera, Laura 151 

Trzesiok, Michal 152 

Ünlü, Ali 155 

Ultsch, Alfred 11, 61, 

65, 96, 

153, 154 

van de Velden, Michel 156 

− xl − 

van der Ark, Andries L. 157 

van der Heijden, Peter G.M. 158 

Vanderlooy, Stijn 70 

Vermunt, Jeroen K. 33, 90 

Vichi, Maurizio 17 

Viroli, Cinzia 24 

von Arnstedt, Eva 133 

Wagner, Ralf 91, 159 

Wagner, Stefan 164 

Wallner, Matthias 6 

Weber, Richard 21 

Wehrens, Ron 160 

Weihs, Claus 117, 

129, 

130, 138 

Werft, Wiebke 67 

Werners, Brigitte 10 

Wilczynski, Petra 161 

Wildner, Raimund 162 

Winkler, Roland 163 

Winkler, Stephan 164 

Witek, Ewa 165 

Worm, Katja 166 

Yuan, Benjamin J.C. 28 

Zarraga, Amaya 167 

Zeileis, Achim 143, 168 

Zock, Michael 119 

Zucknick, Manuela 67

Designing Products Using Quality Function 

Deployment and Conjoint Analysis: A 

Comparison in a Market for Elderly People 

Samah Abu Assab and Daniel Baier 

Chair of Marketing and Innovation Management, Brandenburg University of 

Technology, Erich-Weinert-Str. 1, 03046 Cottbus, Germany 

samah.assab@tu-cottbus.de, baier@tu-cottbus.de 

Abstract. In this paper, we compare two product design approaches namely; quality 

function deployment (QFD) and conjoint analysis (CA) on the example of mobile 

phones for elderly people as a target group. Then, we compare between our results 

and the results from former similar comparisons (e.g., Pullman et al. (2002), Katz 

(2004)). In this work, the same procedures and conditions are taken into consideration 

as that taken by Pullman et al. in their paper (2002). 

Pullman et al. (2002) view the relation between the two methods: QFD and 

CA as a complementary one in which both should be simultaneously implemented 

and each providing feedback to the other. They concluded that CA is more efficient 

in reflecting the end-users’ present preferences for the product attributes, whereas 

QFD is definitely better in satisfying end-users’ needs from the developers’ point of 

view. Katz (2004) in his response from a practitioner’s point of view agreed with 

Pullman’s. However, he concluded that the two methods are better used sequentially 

and that QFD should precede conjoint analysis. We test these results in a market 

for elderly people 

Key words: Conjoint analysis, Quality function deployment, new product design, 

elderly people 

References 

Baier, D. and Brusch, M. (2005): Linking Quality Function Deployment and Conjoint 

Analysis for New Product Design. In: D. Baier, R. Decker and L. Schmidt- 

Thieme (Eds.): Data Analysis and Decision Support. Springer, Berlin, 189-198. 

Katz, G.M. (2004): A Response to Pullman et al.’s (2002) Comparison of Quality 

Function Deployment versus Conjoint Analysis. Journal of Product Innovation 

Management, 21, 61-63. 

Pullman, M.E.,Moore, W.L. and Wardell, D.G.(2002): A Comparison Quality Function 

Deployment and Conjoint Analysis in New Product Design. The Journal 

of Product Innovation Management, 19, 354-364. 

− 1 −

Joint Procrustes Analysis with Constrained 

Simplimax Rotation: Nonsingular 

Transformation of Component Score and 

Loading Matrices Toward Simple Structure 

Kohei Adachi 

Graduate School of Human Sciences 

Osaka University, Japan 

Abstract. The solution of component analysis has indeterminacy on nonsingular 

transformation: post-multiplying component score and loading matrices by a nonsingular 

matrix and its transposed inverse matrix, respectively, does not change the 

goodness of fit. To obtain a nonsingular matrix which gives simple structure to both 

the score and loading matrices transformed, we propose joint Procrustes analysis 

with constrained simplimax rotation, which consists of two phases. First, score and 

loading matrices are rotated orthogonally so as to match the target score and loading 

matrices, respectively, which include the elements of zeros, where the number of zero 

elements is predetermined, but their placement and the values of non-zero elements 

are unknown. Second, with the placement of the zero elements fixed at the result in 

the first phase, a nonsingular matrix is obtained which matches transformed score 

and loading matrices to the target score and loading matrices, respectively, where 

the values of non-zero elements are unknown. This procedure is argued to be useful 

for the cases where score and loading matrices have symmetric roles, for example, 

a case where component analysis is performed for a data matrix of input signals by 

output responses. 

References 

Adachi, K. (2005). Simultaneous Procrustes transformation of components and loadings 

obtained from three-way data. P. 77 in http://www.psychometrika.org/ 

PDFs/IMPS2005_Abstracts.pdf. 

Kiers, H. A. L. (1994). Simplimax: Oblique rotation to an optimal target with simple 

structure. Psychometrika, 59, 567-579. 

− 2 −

WaldIS - a web based reference system for the 

forest monitoring in North Rhine-Westphalia 

Christian Aden 1 , Hans Mucha 2 , Gunther Schmidt 1 , and Winfried Schröder 1 

1 Lehrstuhl für Landschaftsökologie, Hochschule Vechta, D-49377 Vechta, 

Germany, caden@iuw.uni-vechta.de 

2 Weierstraß-Institut für Angewandte Analysis und Stochastik (WIAS), 

D-10117 Berlin, Germany, mucha@wias-berlin.de 

Abstract. In Germany, a multi-level forest monitoring was established since the 

middle of the 1980ies. This hierarchical monitoring system consists of: the annual 

forest condition surveys, the forest soil survey, and the intensive long term monitoring 

of forest ecosystems. In North Rhine-Westphalia, these forest monitoring programmes 

are supplemented by the monitoring of the foliar chemistry. Within theses 

programmes, the monitoring data are recorded and evaluated by the respective federal 

authorities, separately. An integrative statistical analysis of all data collected in 

the different monitoring programmes could not be realised yet. To overcome these 

constraints, the German Research Foundation (DFG) sponsors a research project 

that aims at the compilation of the monitoring data by use of WebGIS techniques 

and at integrated statistical analyses by use of geostatistics, time series analysis and 

multivariate statistics. Currently, the reference data system WaldIS is being developed 

for integrating data of the surveys interactively and for visualising the data 

via a WebGIS. In addition, tools for both logical data queries and for downloads, 

and some GIS functions were included. WaldIS was realised by using open source 

software components instead of proprietary software: the UMN Mapserver was combined 

with the WebGIS Client Suite Mapbender and the database management 

system PostgreSQL. Furthermore, WaldIS relies on standards for processing geoobjects 

published by the Open Geospatial Consortium (Pesch et. al 2007). Moreover, 

WaldIS will be used to visualise the statistical results such as clusters or principal 

components. With the help of stable statistical analysis based on rank-order data 

the aim is finding areas of homogeneous environmental and forest conditions. 

Key words: forest monitoring, WebGIS, multivariate rank analysis, stability 

References 

Pesch, R., Schmidt, G., Schröder, W., Aden, C., Kleppin, L. and Holy, M. (2007): 

Development, Implementation and Application of the WebGIS MossMet. In: 

A. Scharl and K. Tochtermann (Eds.): The Geospatial Web. Springer, London, 

191–200. 

− 3 −

Classification of Paired Data Using Ensemble 

Methods 

Werner Adler 1 , Alexander Brenning 2 , and Berthold Lausen 1 

1 

Chair for Biometry and Epidemiology, University of Erlangen-Nuremberg, 

Germany 

werner.adler@imbe.imed.uni-erlangen.de, 

berthold.lausen@rzmail.uni-erlangen.de 

2 

Department of Geography, University of Waterloo, Canada 

brenning@fesmail.uwaterloo.ca 

Abstract. In glaucoma classification, the underlying data have a paired structure 

that often is accounted for by simply using only one eye per subject. Brenning and 

Lausen (2008) showed that the proper use of both eyes in paired cross-validation 

decreases the variance of the estimation, compaired to cross-validation using only 

one eye per subject. 

We discuss and compare different strategies to generate the bootstrap samples 

for training Adaboost (Freund and Schapire, 1996), Random Forest (Breiman, 2001), 

and Double Bagging (Hothorn and Lausen, 2005). The simplest approach is to ignore 

the paired data structure and proceed as usual. Adapting the idea by Brenning and 

Lausen, we also perform subject based sampling. In a first step, subjects are drawn 

with replacement. In a second step, for each drawn subject either both eyes or 

one randomly selected eye are chosen, or two eyes are drawn with replacement. 

The subjects not selected for training the base learners constitute the out-of-bag 

samples. We compare error rates resulting from these different approaches obtained 

by a simulation study. 

Key words: Bootstrap, Classification, Glaucoma, Paired Organs 

References 

Breiman, L. (2001): Random forests. Machine Learning, 45, 5–32. 

Brenning, A. and Lausen, B. (2008): Estimating error rates in the classification of 

paired organs. Statistics in Medicine, submitted. 

Freund, Y. and Schapire, R. (1996): Experiments with a new boosting algorithm. 

Proceedings of the 13th International Conference on Machine Learning, 148– 

156. 

Hothorn, T. and Lausen, B. (2005): Bundling classifiers by bagging trees. Computational 

Statistics & Data Analysis, 49, 1068–1078. 

− 4 −

Segmentation of Neural Tissue 

Bjoern Andres 1 , Ullrich Koethe 1 , Moritz Helmstaedter 2 , Winfried Denk 2 , 

and Fred Hamprecht 1 

1 Interdisciplinary Center for Scientific Computing, University of Heidelberg 

2 Max Planck Institute for Medical Research, Heidelberg 

Abstract. Three-dimensional electron-microscopic image stacks with almost isotropic 

resolution allow, for the rst time, to determine the complete connectivity matrix 

of parts of the brain. In spite of major advances in staining, correct segmentation 

of these stacks remains challenging, because very few local mistakes can lead to 

severe global errors. We propose a hierarchical segmentation procedure based on 

statistical learning and topology-preserving grouping. First, edge probability maps 

are computed by a random forest classier, and are partitioned into supervoxels by 

the watershed transform. Over-segmentation is then resolved by constructing an irregular 

graphical model on these supervoxels and inferring the most likely global 

segmentation. Careful validation shows that the results of our algorithm are close 

to human labelings. 

− 5 −

On the power of corrected score functions to 

adjust for measurement error 

Thomas Augustin and Matthias Wallner 

Department of Statistics, University of Munich (LMU) 

augustin@stat.uni-muenchen.de 

Abstract. Measurement error modeling, also called errors-in-variables-modeling, 

is a generic term for all situations where additional uncertainty in the variables 

has to be taken into account, in order to avoid severe bias in the statistical analysis. 

The problem is omnipresent in technical statistics, when data from imperfect 

measurement instruments are analyzed, as well as in biometrics, econometrics or 

social science, where operationalizations (surrogates) are used instead of complex 

theoretical constructs. 

After a brief introduction to the area of measurement error modelling, the talk 

discusses the power and some limitations of Nakamura’s general principle of corrected 

score functions, mainly in the context of failure time data. Starting with classical 

covariate measurement error in Cox’s PH model, it is shown how the Breslow 

likelihood can be corrected, while according to results by Stefanski and Nakamura 

himself no corrected score function for the partial likelihood can exist. We then turn 

to parametric failure time models and extend consideration to additionally errorprone 

lifetimes. Finally, some ideas for handling Berkson-type errors (as occurring, 

e.g., in Radon studies) and rounded errors will be sketched. 

Key words: Measurement error, error-in-variables, survival analysis, Cox model, 

rounding 

− 6 −

Evaluation Criteria for the Construction of 

Binary Classification Trees with Two or More 

Classes 

Muhammad Azam 1 Alexander Ostermann 2 and Karl-Peter Pfeiffer 3 

1 Department of Medical Statistics, Informatics and Health Economics, Medical 

University Innsbruck csag2533@uibk.ac.at 

2 Institute for Mathematics, Unversity of Innsbruck , Technikerstrasse 25/7, 6020 

Innsbruck alexander.ostermann@uibk.ac.at 

3 Department of Medical Statistics, Informatics and Health Economics, Medical 

University Innsbruck Karl-Peter.Pfeiffer@i-med.ac.at 

Abstract. Classification trees are top-down induction of labelled sampling units 

into recursive order to get end nodes. Each end node representing those labelled 

units which are in majority, otherwise considered as misclassified. In the top-down 

induction process, an evaluation criterion plays an important role to send maximum 

of the units having same label to the same node. To achieve this goal a ”goodness 

of split” measure is calculated by using evaluation criteria e.g. Gini function, Twoing 

rule etc. for each distinct value of each variable and finally chooses one which 

enhances the purity. For smaller number of classes attached to all the units, almost 

all the evaluation criteria provides the same results in terms of misclassified units, 

deviance and number of end nodes but it matters for larger number of classes and 

considered best criteria which provides less number of misclassified units especially. 

Here we proposed an impurity based evaluation criteria which fulfil all the required 

properties of any evaluation criteria (Breiman et al., 1984). (i) The node impurity 

function achieves its maximum value, when same number of units fall in a node belongs 

to J number of classes. (ii) Node is pure, when all the observations of a node 

belong to a single class. (iii) A node impurity function is a symmetric function. 

We conducted a simulation study to test the performance of the proposed criterion 

over many real life datasets available under UCI repository and observed that the 

proposed strategy provides improved results. 

Key words: Classification trees, Evaluation criteria, Misclassification rate, Deviance 

References 

Breiman, L., Friedman, J. H., Olshen, R. A. and Stone, C. J (1984): Classification 

and regression trees. Wadsworth International Group, Belmont, CA. 

− 7 −

Evaluation Strategies for Learning Algorithms 

of Hierarchical Structures 

Korinna Bade 1 and Dominik Benz 2 

1 Faculty of Computer Science, Otto-von-Guericke-University Magdeburg, 

D-39106 Magdeburg, Germany, Email: korinna.bade@ovgu.de 

2 Department of Electrical Engineering/Computer Science, University of Kassel, 

D-34121 Kassel, Germany, Email: benz@cs.uni-kassel.de 

Abstract. The idea to automatically induce a hierarchical structure among a set 

of objects or integrate a given hierarchy into the learning process is common to 

a number of disciplines like hierarchical clustering (Bade and Nürnberger, 2008) 

and classification or ontology learning. A crucial aspect hereby is how to assess the 

quality of the learned hierarchical scheme. Existing evaluation approaches can be 

broadly classified in methods defining quality metrics on the resulting scheme alone 

and methods which invoke an external “gold-standard” for comparison. We focus 

on the latter case, for which various similarity metrics have been proposed, mostly 

depending on the characteristics of the applied learning procedure. 

This work aims at bringing together the different disciplines by presenting and 

comparing existing gold-standard based evaluation methods for learning algorithms 

that generate hierarchical structures. We present an interdisciplinary framework in 

order to enable comparison across the different contexts, from which the metrics 

originate. Our goal is to emphasize the strong similarities of evaluation tasks in different 

disciplines and to create a general pool of evaluation methods. Based on prior 

work (Dellschaft and Staab, 2006), we analyze properties of (good) evaluation measures. 

Different types of structural errors in the learned hierarchies are identified and 

their effects on existing measures are shown. Observing strengths and weaknesses of 

existing methods, we also suggest some new methods. 

Key words: evaluation metrics, hierarchical clustering, ontology learning, goldstandard 

References 

Bade, K. and Nürnberger, A. (2008): Creating a Cluster Hierarchy under Constraints 

of a Partially Known Hierarchy. In: Proceedings of the 2008 SIAM International 

Conference on Data Mining. (to appear) 

Dellschaft, K. and Staab, S. (2006): On How to Perform a Gold Standard Based 

Evaluation of Ontology Learning. In: Proc. of 5 th Int. Semantic Web Conference. 

228–241. 

− 8 −

Autonomous Forex Trading Agents 

Rui Pedro Barbosa 1 and Orlando Belo 2 

1 

Department of Informatics, University of Minho, 4710-057 Braga, Portugal 

rui.barbosa@di.uminho.pt 

2 

Department of Informatics, University of Minho, 4710-057 Braga, Portugal 

obelo@di.uminho.pt 

Abstract. Trading in financial markets is undergoing a radical transformation, 

one in which algorithmic methods are becoming increasingly more important. The 

development of intelligent agents that can act as autonomous traders of financial 

instruments seems like a logical step forward in this “algorithms arms race”. With 

this in mind, our study proposes an infrastructure for implementing hybrid intelligent 

agents with the ability to trade in the Forex Market without requiring human 

supervision. This infrastructure is composed of three modules. The Intuition Module, 

implemented using an Ensemble Model, is responsible for performing pattern 

recognition and predicting the direction of the exchange rate. The A Posteriori 

Knowledge Module, implemented using a Case-Based Reasoning System, enables 

the agents to learn from empirical experience and is responsible for suggesting 

how much to invest in each trade. Finally, the A Priori Knowledge Module, implemented 

using a Rule-Based Expert System, enables the agents to incorporate 

non-experiential knowledge in their trading decisions. This infrastructure was used 

to implement two agents, one capable of trading the USD/JPY currency pair, and 

the other one capable of trading the EUR/USD currency pair, both with a 6 hours 

timeframe. Using 12 months of out-of-sample data, the USD/JPY agent performed 

826 simulated trades and obtained an average profit per trade of 6.88 pips. It accurately 

predicted the direction of the price in 54.72% of the trades, 65.74% of which 

were profitable. Over the same period, the EUR/USD agent performed 885 trades, 

with an average profit of 6.06 pips per trade. Its accuracy predicting the direction 

of the price was 52.99%, and 60.45% of its trades were profitable. These agents were 

integrated with an Electronic Communication Network and have been trading live 

for the past several months. So far their live trading results are consistent with the 

simulated results, which lead us to believe our infrastructure can be of practical 

interest to the traditional trading community. 

Key words: Forex trading, Hybrid agents, Autonomy 

− 9 −

Improving Product Line Design with Bundling 

Niels Becker and Brigitte Werners 

Faculty of Economics and Business Administration, 

Ruhr-University Bochum, 44780 Bochum, Germany 

niels.becker@ruhr-uni-bochum.de and or@ruhr-uni-bochum.de 

Abstract. Designing and pricing new products is of particular importance in many 

industries. In order to meet heterogeneous customer needs, many companies offer different 

variants of every product type. To support these product line design decisions, 

various mathematical programming approaches have been developed (Steiner and 

Hruschka, 2003). Most models are based on part-worth utilities, estimated within a 

conjoint framework. Besides, bundling is an important tool in marketing. It has been 

shown that bundling can transfer customers’ willingness to pay from one product to 

another. Therefore, prices can be differentiated so that higher profits are obtainable 

(Simon and Wübker, 1999). For determining optimal bundles and prices, Hanson 

and Martin (1990) have suggested a well-known linear programming model. 

Here the problem of optimally designing, bundling and pricing new products 

is investigated. One of the questions is, at which point in time bundling decisions 

should be made. Therefore, we compare product line design without bundling with 

sequential bundling, which means bundling subsequent to product line decisions, and 

simultaneous bundling, which means determining optimal bundles and product lines 

simultaneously. We developed a combined product line design and bundling model 

and present the impact on profits using simulated data. For this example, optimal 

results can be obtained using MILP-Software. Our studies show that simultaneous 

bundling leads to differently designed products and can improve profits substantially. 

Key words: Product Line Design, Pricing, Bundling, Optimization 

References 

Hanson, W. and Martin, K. (1990): Optimal Bundle Pricing. Management Science, 

36(2), 155–174. 

Simon, H. and Wübker, G. (1999): Bundling - A Powerful Method to Better Exploit 

Profit Potential. In: R. Füderer, A. Hermann and G. Wübker (Eds.): Optimal 

Bundling, Springer-Verlag, Heidelberg-Berlin, 7–28. 

Steiner, W. and Hruschka, H. (2003): Genetic Algorithms for Product Design: How 

Well Do They Really Work. Int. Journal of Market Research, 45(2), 229–240. 

− 10 −

Estimating the number of buildings in 

Germany 

Martin Behnisch 1 and Alfred Ultsch 2 

1 Institute of Historic Building Research and Conservation, ETH Hoenggerberg, 

HIL D 25.9, CH-8093 Zurich. Behnisch@arch.ethz.ch 

2 Datenbionic Research Group, Hans-Meerwein-Strasse, Philipps-University 

Marburg, D-35032 Marburg. Ultsch@Mathematik.Uni-Marburg.de 

Abstract. The building stock can be considered the largest physical, economical 

and cultural capital of a society. For German building stocks many institutions 

record different kind of data. Unfortunately there are just a few basic statistics 

about the amount of buildings. Collection of data is therefore very complicated, 

often expensive and the handling of missing data is one of the biggest handicaps. 

With the exception of data about residential buildings and particularly monuments, 

it is an unsolved problem to determine the total number of buildings. The main 

issue of this article is the description of an estimation procedure for this. Using 

methods from the, so called, Urban Knowledge Discovery approach, the authors find 

unsuspected relationships in the urban data which can be used for the estimation. 

The developed estimation procedure relies on 12430 municipalities and refers to data 

from the Cadaster of Real Estates and the Federal Bureau of Statistics. With this 

estimation it is possible to use statistical data from well known and easily accessible 

institutions. The number of buildings is estimated for regions with missing data. 

The quality of the estimation is analyzed by learn and test data sets. Information 

optimization leads to the conclusion that 20% of the municipalities hold 80% of all 

buildings. Therefore for an improvement of the estimation it is essential to refine 

the amount and quality of data in the larger municipalities. 

Key words: Spatial Planning, Engineering, Knowledge Discovery, Data Mining, 

Building Stock 

References 

Aachener Institut füer Bauschadensforschung und Angewandte Bauphysik, Hrsg: 

Hofman, F. (2001): Urban heritage - building maintenance. Final report. COST 

Action C5, European Commission. 

Becher, St. (1995): Klassifikation der regionalen Immobilienmärkte der Bundesrepublik 

Deutschland (Dissertation). Universität, Mainz. 

Behnisch, M. (2007): Urban Knowledge Discovery (Doctoral thesis). Universitätsverlag, 

Karlsruhe. 

− 11 −

Probabilistic Distance Clustering 

Adi Ben-Israel 

Rutgers University, RUTCOR 

Summary. A new iterative method [1] for probabilistic clustering of data is presented. 

Given clusters, their centers, and the distances of data points from these 

centers, the probability of cluster membership at any point is assumed inversely 

proportional to the distance from (the center of) the cluster in question. 

The resulting method is a generalization, to several centers, of the Weiszfeld 

method for solving the Fermat–Weber location problem. At each iteration, the 

distances (Euclidean, Mahalanobis, etc.) from the cluster centers are computed for 

all data points, and the centers are updated as convex combinations of the data 

points, with weights determined by the above principle. Computations stop when 

the centers stop moving. 

This approach works also for problems where the cluster sizes are unknowns (to 

be estimated), giving a viable alternative to the EM method, [2]. 

Progress is monitored by the joint distance function (JDF), a measure of 

distance from all cluster centers, that evolves during the iterations, and captures the 

data in its low contours. This is a new concept in data reduction and representation. 

A duality theory for the JDF is given in [3]. 

The method is simple, fast (requiring a small number of cheap iterations) and 

is not sensitive to outliers. 

Key words: Partial Least Squares, Correspondence Analysis, Categorical Data 

References 

A. B-I and C. Iyigun, Probabilistic Distance Clustering, J. Classification (to 

appear.) http://benisrael.net/J-CLASSIFICATION-07.pdf. 

C. Iyigun and A. B-I, Probabilistic Distance Clustering adjusted for Cluster 

Size, Probability in Engineering and Informational Sciences (to appear.) 

http://benisrael.net/PEIS-07.pdf. 

C. Iyigun and A. B-I, Contour Approximation of Data: A Duality Theory, (submitted.) 

http://benisrael.net/DUAL-12-20-07.pdf. 

− 12 −

Hedge Funds in a Bayesian Asset Allocation 

Framework: Incorporating information on 

market states and manager’s ability 

Wolfgang Bessler 1 and Julian Holler 2 

1 Center for Finance and Banking, Licher Strasse 74, 35394 Giessen 

Wolfgang.Bessler@wirtschaft.uni-giessen.de 

2 Center for Finance and Banking, Licher Strasse 74, 35394 Giessen 

Julian.Holler@wirtschaft.uni-giessen.de 

Abstract. A growing number of private and institutional investors make significant 

allocations to hedge funds in order to improve the risk-return trade-off of their portfolios. 

However, in a portfolio context there are a number of issues that are special 

to hedge funds. We attempt to address these issues in a bayesian asset allocation 

framework (Pastor 2000). In particular, we focus on the returns of two representative 

equity hedge fund strategies constructed by replication of two well-known 

statistical arbitrage strategies. Importantly, this approach allows us to obtain daily 

return observations despite the fact that most funds only report at a monthly interval. 

Using this framework, we investigate the following two research questions. 

First, we address the issue that many arbitrage strategies exhibit substantial exposures 

to financial crisis which is reflected in their high levels of curtosis and negative 

skewness. By including relevant state variables in the prior distribution, we infer 

whether investors can improve the risk-adjusted performance of their portfolios by 

reducing their exposures prior to the onset of a crisis. Second, investors should only 

pay high fees to hedge fund managers if they earn additional alpha above the returns 

generated by our dynamic trading strategy. Thus, we attempt to analyze how much 

confidence an investor has to put into a manager’s abilities by varying her prior 

beliefs about alpha. 

References 

Pastor, L. (2000): Portfolio Selection and Asset Pricing Models. Journal of Finance, 

55, 179–223 

Keywords 

Asset Allocation, Alternative Investments, Hedge Funds 

− 13 −

Categorical Data in PLS Path modeling 

Jörg Betzin 

German Centre of Gerontology (DZA), Berlin, Germany 

joerg.betzin@dza.de 

Summary. There are lot of surveys with categorical data where the relationships 

between the variables should be used in a path model with latent variables but, so 

far there are only a few possibilities to do so. We present a way using categorical 

manifest variables (MV’s) in PLS. 

The main idea is on the one hand in thinking of PLS as a generalization of PCA 

(principal component analysis) or canonical correlation and on the other hand using 

the framework of Correspondence Analysis (CorA) as a generalization of PCA for 

categorical variables and put these two approaches together. 

In the basic PLS algorithm the latent variables (LV’s) ηm (m = 1, ..., M) are 

estimated as weighted sum of their manifest variables (with data matrices Ym) 

ηm = Ymωm, where the pooled weight vector ω = (ω ′ 1, ..., ω ′ M ) ′ is result of an 

iteration algorithm like 

ω = `` Y ′ Y ´ ∗ P ´ ω 

with additional normalization constraint and where Y = (Y1, ...,YM ), P is a weight 

matrix changing with different iteration steps, and ’∗’ indicating the elementwise 

matrix product. The key point is the use of the covariance matrix Y ′ Y inside the 

iteration algorithm. 

Now, one main aspect of CorA is the transformation of the raw data matrix Ym 

into an indicator matrix Gm and the analysis of a kind of correlation matrix for Gm. 

If we describe by e Qm a suitable transformation of Gm where elements of e Q ′ m e Qm are 

roots of χ 2 -components from twodimensional contingency tables for columns in Gm. 

Than, in short, e “ ” 

Q = eQ1, ..., QM e will be used as an equivalent for the covariance 

matrix Y ′ Y in the PLS iteration algorithm. 

We will show results for different examples using basic PLS algorithms and 

using PLS algorithms adopted for categorical manifest variables, together with interpretations 

of the weights ωm in the case of categorical data and the other model 

parameters like correlations and regression coefficients. 

Key words: Partial Least Squares, Correspondence Analysis, Categorical Data 

− 14 −

Choosing the number of clusters in the latent 

class model 

Cristophe Biernacki 1 , Gilles Celeux 2 , and Gérard Govaert 3 

1 Université Lille 1 UMR CNRS 8524, France 

Christophe.Biernacki@math.univ-lille1.fr 

2 INRIA Saclay, France Gilles.Celeux@inria.fr 

3 UTC Compiègne UMR CNRS 6599 Heudiasyc, France Gerard.Govaert@utc.fr 

Abstract. The latent class model or multivariate multinomial mixture is a powerful 

model for clustering discrete data. This model is expected to be useful to represent 

nonhomogeneous populations. It uses a conditional independence assumption given 

the latent class to which a statistical unit is belonging. In this presentation, we exploit 

the fact that a fully Bayesian analysis of the latent class model with Jeffreys 

non informative prior distributions does not involve technical difficulty to derive 

the exact integrated complete likelihood. Then, we exploit this integrated complete 

likelihood as a criterion to assess the number of mixture components in a cluster 

analysis perspective. We highlight with numerical experiments how this exact criterion 

could outperfom the BIC-like asymptotic approximation generally used to 

choose a sensible number of clusters derived from the latent class model. 

Key words: Latent Class model, Integrated complete likelihood, Model selection 

References 

Biernacki, C., Celeux, G. and Govaert, G. (2000). Assessing a mixture model for 

clustering with the integrated completed likelihood. IEEE Trans. on PAMI, 

22, 719-725. 

Goodman, L. A. (1974). Exploratory latent structure analysis using both identifiable 

and unidentifiable models. Biometrika, 61, 215-231. 

− 15 −

Clustering of molecules and structured data 

Gilles Bisson 

Laboratoire TIMC-IMAG - Equipe AMA , Faculté de Médecine 

Summary. The discovery or the synthesis of molecules that activate or inhibit some 

biological systems is a central issue for biological research and health care. The objective 

of High Throughput Screening (HTS) is to rapidly evaluate, through automated 

approaches, the activity of a given collection of molecules on a given biological target 

that can be an enzyme or a whole cell. In practice, the results of a HTS test allow to 

highlight some tens of active molecules, named the ”hits”, representing a very small 

percentage of the initial collection. However, these tests are just the beginning of the 

work since the identified molecules generally do not have some nice characteristics in 

terms of sensitivity and specificity (a relevant molecule must by specific to the biological 

target and should be efficient with a small concentration). In such context, it 

is crucial to provide the chemists with some tools to explore the contents of his/her 

chemical libraries and especially to make easier the search for molecules that are 

structurally similar to the ”hits”. A possible approach, given a relevant distance, is 

to seek the nearest neighbours of those hits. More broadly, chemists have a need for 

methods to automatically organize the collections of molecules in order to locate the 

active molecules within the chemical space. Above all, they would like to evaluate 

the real diversity of the chemical structures contained in a collection. Clustering 

methods are well suited to carry out this kind of task. However, with structurally 

complex objects such as molecules, it is obvious that the quality of the results depends 

on the capacity of the distance used by the clustering method to grasp the 

structural similarities and also to take into account all the background knowledge 

of the chemists. The search for a structural distance between molecules is clearly 

related (but not totally equivalent) to the search for isomorphic partial subgraphs, 

which is a NP-complete problem. To overcome this problem many methods use an 

”a priori” molecular linearization: a molecule is represented by a vector of descriptors, 

each one corresponding to a molecular fragment, and well-known distances can 

be used. However, since last ten years, kernel functions, comparable to distances 

between graphs, have been proposed in the Support Vector Machines framework. 

In these approaches, molecular representation is more accurate. It can be based on 

a set of path (i.e. molecular fragments specifically chosen or randomly selected), or 

more interestingly use the whole molecule to try to value structural distances by 

dynamically explore the mapping that can be done between two molecules. 

− 16 −

The K -INDSCAL Model for Heterogeneous 

Three-way Dissimilarity Data 

Laura Bocci 1 and Maurizio Vichi 2 

1 Department of Sociology and Communication, University of Rome “La 

Sapienza”, Rome, Italy laura.bocci@uniroma1.it 

2 Department of Statistics, Probability and Applied Statistics, University of Rome 

“La Sapienza”, Rome, Italy maurizio.vichi@uniroma1.it 

Abstract. The weighted Euclidean model proposed by Carrol and Chang (1970) is 

the most well-known and used model of multidimensional scaling of three-way data. 

INDSCAL states a unique representation for the objects (common configuration 

space) and for each occasion weights for the dimensions of this representation (individual 

differences weights), thus definitely assuming that there are not systematic 

“strong” differences between data dissimilarity sources. Otherwise, when heterogeneous 

occasions are observed, it is shown that INDSCAL may fail to identify a 

common space representative of the observed data structure. In such frequent and 

realistic situation it is reasonable to assume that there are systematic differences 

among some, say, K clusters of occasions in the evaluation of the dissimilarities, 

so that within each cluster of occasions the evaluation may differ only because of 

sampling or measurement errors; while between clusters of occasions dissimilarities 

are really different. The heterogeneous INDSCAL in K classes model, simply called 

K -INDSCAL, is proposed to handle the above described heterogeneity in the data. 

The model includes the individual weights in order to preserve the rotational invariance 

of the INDSCAL model. The high number of parameters of INDSCAL, and 

consequently of K -INDSCAL, may produce instability of the estimates thus a parsimonious 

model, that drastically reduces the number of parameters, is also discussed. 

The parameters of the model are estimated in a least-squares fitting context and an 

efficiently coordinate descent algorithm is given. The usefulness of K -INDSCAL is 

demonstrated by both artificial and real data analyses. 

Key words: Three-way dissimilarity data, INDSCAL, heterogeneous data dissimilarities 

References 

Carroll, J.D. and Chang, J.J. (1970): Analysis of individual differences in multidimensional 

scaling via an N-generalization of the Eckart-Young decomposition. 

Psychometrika, 35, 283–319. 

− 17 −

Weighting and Selecting Features 

in Fuzzy Clustering 

Christian Borgelt 

European Center for Soft Computing 

c/ Gonzalo Gutiérrez Quirós s/n, 33600 Mieres, Spain 

christian.borgelt@softcomputing.es 

Abstract. A serious problem in distance-based clustering is that the more dimensions 

(attributes) a datasets has, the more the distances between data points—and 

thus also the distances between data points and constructed cluster centers—tend to 

become uniform. This, of course, impedes the effectiveness of clustering, as distancebased 

clustering exploits that these distances differ. In addition, in practice often 

only a subset of the available attributes is relevant for forming clusters, even though 

this may not be known beforehand. In such cases it is desirable to have a clustering 

algorithm that automatically weights the attributes or even selects a proper subset. 

In this contribution I study the problem of weighting and selecting features in 

clustering and in particular in fuzzy clustering. Apart from reviewing straighforward 

modifications of Gustafson–Kessel fuzzy clustering (Gustafson and Kessel 1979) and 

attribute weighting fuzzy clustering (Keller and Klawonn 2000) that lead to simple, 

but effective attribute weighting schemes, I introduce a new feature selection method 

by applying the idea of an alternative to the fuzzifier (Klawonn and Höppner 2003) 

to the latter scheme. The resulting combined feature weighting and selection method 

has the advantage that the obtained clustering result on the chosen subspace coincides 

with the projection of the result obtained on the full data space. Finally I 

discuss an extension of this scheme to principal axes selection. 

Key words: fuzzy clustering, feature weighting, feature selection 

References 

1.Gustafson, E.E., and Kessel, W.C. (1979): Fuzzy Clustering with a Fuzzy Covariance 

Matrix. Proc. IEEE Conf. on Decision and Control (CDC 1979, San 

Diego, CA), 761–766. IEEE Press, Piscataway, NJ, USA. 

2.Keller, A., and Klawonn, F. (2000): Fuzzy Clustering with Weighting of Data 

Variables. Int. Journal of Uncertainty, Fuzziness and Knowledge-based Systems 

8:735-746. World Scientific, Hackensack, NJ, USA. 

3.Klawonn, F., and Höppner, F. (2003): What is Fuzzy about Fuzzy Clustering? 

Understanding and Improving the Concept of the Fuzzifier. Proc. 5th Int. 

Symposium on Intelligent Data Analysis (IDA 2003, Berlin, Germany), 254– 

264. Springer-Verlag, Berlin, Germany. 

− 18 −

HIDDEN MARKOV MODEL BASED 

CLASSIFICATION OF NATURAL OBJECTS 

IN AERIAL PICTURES 

Mohamed El Yazid Boudaren 1 , Abdenour Labed 1 , Adel Aziz Boulfekhar 1 , 

and Yacine Amara 1 

Military Polytechnic School, Algiers boudarenyazid@hotmail.com 

Abstract. This work is part of a more global one that consists in creating virtual 

environments from aerial pictures combined with altimetry data. In such environments, 

while getting too close to the ground, one has to solve the problem of limited 

texture resolution. So, these textures have to be amplified to get more realistic 

scenes. Texture amplification must take account of object nature. This paper deals 

with picture pixels supervised classification in order to amplify texture resolution. 

For this purpose, we propose a hidden Markov model based approach that takes 

into account the spatial dependencies between natural objects present in the area 

of interest. HMMs have long been used to efficiently model one-dimensional data, 

in particular in speech recognition systems. In theory, HMMs can be applied as 

well to multi-dimensional data. However, the complexity of the algorithms grows 

exponentially in higher dimensions, so that, even in dimension 2, the usage of plain 

HMM becomes prohibitive in practice. To overcome the 2D-HMM complexity, we 

propose a two-level HMM, where the higher layer comprises one unique HMM constituted 

of super states associated to one low level HMM each. Our model differs 

from classic embedded HMM in that it deals with pixel blocks instead of pixel lines 

as elementary symbols. Another difference is that our high level HMM is ergodic; 

this enables our model to accurately model spatial dependencies between natural 

objects. The training of our HMM models is done in two steps: firstly, the low level 

HMMs are trained on unitextured pictures. Secondly the high level one is trained 

on multitextured pictures of the same region using the parameters of HMMs of the 

first step, according to Baum-Welch algorithm with slight modifications. For our 

experiments, we used real world aerial pictures of a relatively large area, with a resolution 

of 50 centimeters. Our results were then used to generate virtual interactive 

3D-scene. This showed that our classifier was able to satisfactorily reproduce the 

original terrain. 

Key words: Hidden Markov models, Aerial pictures supervised classification, texture 

recognition 

− 19 −

On optimistic bias in reporting 

microarray-based classification accuracy 

Anne-Laure Boulesteix 1 and Martin Slawski 1 

1 Sylvia Lawry Centre for Multiple Sclerosis Research, Hohenlindenerstr. 1, 

81677-München, Germany, boulesteix@slcmsr.org 

Abstract. Almost all published medical studies present positive research results. 

In the special case of microarray studies, which often focus on, e.g., the identification 

of differentially expressed genes or the construction of outcome prediction 

rules, it means that almost all studies report at least a few significant differentially 

expressed genes or a small prediction error, respectively. Authors are virtually urged 

to “find something significant” in their data, which encourages the publication of 

wrong research findings due to the multiple comparison effects. If authors try a large 

number of different analysis methods and designs on their data, they are likely to 

obtain “acceptable results” with at least one of them. Microarray-based class prediction 

is particularly affected by this problem. Whereas logistic regression is routinely 

applied as the standard class prediction approach in the simple case where only a 

small number of predictors are available, there is no consensus on the procedure to 

be applied for classification using high-dimensional microarray data. 

It is well-known that, if several statistical methods are tried on the same microarray 

data set, one should report all results, not only the best ones (Dupuy and 

Simon, 2007). Through simulations and real data studies, we address this problem 

quantitatively and determine the effect of not respecting this “good practice” rule. 

Our approach consists of applying a large number of well-known classifiers combined 

with several variable selection procedures and different numbers of selected variables, 

and evaluating them following different schemes (see Boulesteix et al, 2008, for an 

overview). The considered data sets are real publicly available microarray data sets, 

with or without random permutation of the class labels. The output of our study is 

the distribution of the minimally selected error rate and the bias resulting from this 

optimal selection in the different settings. 

References 

Boulesteix, A.-L., Strobl, C., Augustin, T. and Daumer, M. (2008): Evaluating 

microarray-based classifiers: An overview. Cancer Informatics, 4. 

Dupuy, A. and Simon, R. (2007): Critical Review of Published Microarray Studies for 

Cancer Outcome and Guidelines on Statistical Analysis and Reporting. Journal 

of the National Cancer Institute, 99, 147–157. 

− 20 −

Practical experiences from Credit Scoring 

projects for Chilean financial organizations. 

Cristian Bravo, Sebastian Maldonado and Richard Weber 

Department of Industrial Engineering, University of Chile. 

cbravo@dii.uchile.cl, semaldon@ing.uchile.cl, rweber@dii.uchile.cl 

Abstract. All financial organizations that offer loans to their customers have the 

problem to determine if the loaned money will be returned. Credit scoring systems 

have been successfully applied to determine the probability that a certain customer 

will fail in paying back the received credit. In many cases these systems are based 

on experience but offer a “closed solution” where the user has few possibilities to 

influence in the decision process. 

We have developed credit scoring systems for several Chilean financial organizations 

mapping the KDD process (Knowledge Discovery in Databases) to their 

special needs. This paper presents our experiences from these projects and explains 

in detail how we solved the problems in each step of the KDD process. 

In the data mining step we applied Logistic Regression and Support Vector 

Machines as classification techniques, comparing both their performance and their 

flexibility. A particular wrapper approach for feature selection using Support Vector 

Machines has been developed. Comparing this approach with alternative schemes 

underlines its strengths in terms of classification performance and selected features. 

Since most KDD projects propose just static solutions we had to develop a 

module for model updating that will be described in detail. In particular we propose 

to apply statistical techniques in order to determine changes in feature weights and 

structural changes in the respective universe. 

During the development of our solutions the users got important insights into 

their customers’ behavior, some of them were surprising, others just confirmed notions 

the respective experts had before. By using the systems in daily operation the 

rate of false positives as well as false negatives could be reduced leading to a higher 

coverage of the respective market. 

Key words: Credit scoring, Classification, Support Vector Machines. 

References 

Famili, A., Shen, W.-M., Weber, R., Simoudis, E. (1997): Data Preprocessing and 

Intelligent Data Analysis. Intelligent Data Analysis 1, No. 1, 3-23. 

− 21 −

Analyzing the Stability of Price Response 

Functions - Measuring the Influence of 

Different Parameters in a Monte Carlo 

Comparison 

Michael Brusch and Daniel Baier 

Institute of Business Administration and Economics, 

Brandenburg University of Technology Cottbus, Postbox 101344, 

D-03013 Cottbus, Germany {m.brusch|daniel.baier}@tu-cottbus.de 

Abstract. The usage and therefore the estimation of price response function (see, 

e.g., Steiner et al. 2007) is very important for strategic marketing decisions. Typically 

price response functions with an empirical basis are used (see, e.g., Balderjahn 

1998). However, such price response functions are subject to a lot of disturbing influence 

factors, e.g. the assumed profit maximal price and the assumed corresponding 

quantity of sales. 

In such cases, the question how stable the found price response function is was 

not answered sufficiently up to now. In this paper, the question will be pursued how 

much (and what kind of) errors in market research are pardonable for a stable price 

response function. Innovative technologies and systems of house power engineering 

are used as application example (see Brusch et al. 2003). For the comparisons, a 

factorial design with synthetically generated and disturbed data is used. 

Key words: Monte Carlo comparison, Price response functions 

References 

Balderjahn, I. (1998): Empirical analysis of price response functions. In: I. Balderjahn, 

C. Mennicken, E. Vernette (Eds.): New Developments and Approaches in 

Consumer Behavior Research. Schäffer-Poeschel/Macmillan, 185–200. 

Brusch, M., Zühlsdorff, D., Baier, D. and Kessler, A. (2003): Neue Technologien und 

erneuerbare Energiequellen auf dem Vormarsch. Energiewirtschaftliche Tagesfragen, 

53, 12, 825–829. 

Steiner, W. J., Brezger, A. and Belitz, Ch. (2007): Flexible estimation of price 

response functions using retail scanner data. Journal of retailing and consumer 

services, 14, afl. 6 (11), 383–393. 

− 22 −

Motif-based Classification of Time Series with 

Bayesian Networks and SVMs 

Krisztian Antal Buza 1 and Lars Schmidt-Thieme 1 

University of Hildesheim, Information Systems and Machnine Learning Lab 

{buza,schmidt-thieme}@ismll.uni-hildesheim.de 

Abstract. Classification of time series is of crucial importance in wide range of 

applications. One of the possible solutions for this problem is based on characteristic 

local patterns of time series, so-called motifs [Patel 2002]. 

We present a novel technique to make the classification of (multivariate) time 

series more accurate. We define different types of motifs. Most easy ones are frequent 

subseries. In case of noisy time series as well as in several application domains these 

easy motifs are not sufficient, more complex ones are necessary. Complex motifs 

used in our work may consist of several subseries, continouos and non-continouos 

parts and “joker” parts. We show an efficient algorithm for mining complex motifs in 

time series. We extend the highly efficient implementation of the algorithm Apriori 

described in [Borgelt 2003] to our task. 

We evaluate our method on real medical data, which consits of time series of 

dialysis sessions. We compare different types of motifs according to their ability for 

the prediction of the class of (multivariate) time series. We show that additional 

motif features significantly improve the accuracy of Bayesian Networks and Support 

Vector Machines for the classification of time series. 

Key words: Time Series, Complex motifs, Bayesian Networks, SVM 

References 

Borgelt, C. (2003): Efficient Implementations of Apriori and Eclat. 1st Workshop of 

Frequent Item Set Mining Implementations (FIMI 2003, Melbourne, FL, USA). 

Kunik, V. and Solan, Z. and Edelman, S. and Ruppin, E. and Horn, D. (2005): Motif 

Extraction and Protein Classification. IEEE Computational Systems Bioinformatics 

Conference (CSB’05), pp. 80-85 

Ferreira, P. G. and Azevedo, P. (2005): Protein Sequence Classification through Relevant 

Sequence Mining and Bayes Classifiers Proceedings of the 12th Portuguese 

Conference on Artificial Intelligence, pp. 236-247, LNAI 3808, Springer-Verlag. 

Patel, P. and Keogh, E and Lin, J. and Lonardi, S (2002): Mining Motifs in Massive 

Time Series Databases. Proceedings of the 2002 IEEE International Conference 

on Data Mining (ICDM 2002). 

− 23 −

Visualizing data in Gaussian mixture model 


Daniela G. Calo’ and Cinzia Viroli 

Department of Statistics - University of Bologna 

via Belle Arti, 41 - 40126 Bologna, Italy 

danielagiovanna.calo@unibo.it, cinzia.viroli@unibo.it 

Abstract. The paper presents a post-processing strategy for producing low-dimensional 

summary plots of the data after a Gaussian mixture classification model has been 

fitted. The most revealing projections are those along which the class-conditional 

densities are maximally separable. We consider a particular probability product kernel 

as a measure of similarity or affinity between class-conditional distributions. It 

takes an appealing closed form in the case of Gaussian mixture components. The 

performance of the proposed strategy has been evaluated on simulated and real data. 

Key words: Gaussian mixture models, Low-dimensional plots, Normalized expected 

likelihood kernel, Bayes error 

References 

Chan, A.B., Vasconcelos, N. and Moreno, P.J. (2004): A family of Probabilistic 

Kernels Based in Information Divergence. Technical Report, University of California, 

San Diego. 

Jebara, T. and Kondor, R. (2004): Probability Product Kernels. Journal of Machine 

Learning Research, 5, 819–844. 

McLachlan, G.J. and Peel, D. (2000): Finite Mixture Models. Wiley, New York. 

− 24 −

A novel approach to construct discrete support 

vector machine classifiers 

Marco Caserta and Stefan Lessmann 

Institute of Information Systems 

University of Hamburg, Germany 

Abstract. The support of managerial decision making by means of data mining 

has received considerable attention in the academic literature as well as corporate 

practice. This paper considers support vector machines (SVMs) which represent a 

popular classification method that may be used in data mining to, e.g., guide the selection 

of customers for a direct marketing campaign or assess the credibility of loan 

applications in financial applications. Recently, Orsenigo and Vercellis proposed a 

novel, discrete support vector machine (DSVM) and demonstrate its effectiveness in 

several empirical studies. Building a respective classifier involves solving an integer 

program which is a challenging computational task in general and in large-scale data 

mining settings in particular. This paper strives to improve upon a linear programming 

based heuristic, originally proposed by Orsenigo and Vercellis for solving the 

DSVM program. The core of the suggested procedure consists of a recursive algorithm 

that solves the (linear) relaxation of DSVM and exploits dual information to 

construct a smaller sized sub-program with integer constraints that may be solved 

to optimality. The sequence of linear and integer programs solved during the course 

of the algorithm provides upper and lower bounds of the final solution which are 

employed as termination criterion. Empirical experiments are conducted to scrutinize 

the suitability of the proposed procedure and examine the problem size (i.e. 

the number of examples and features) that can be processed with state-of-the-art 

integer programming techniques. 

− 25 −

Modeling the Classification of Heterogeneous 

Data 

Dorin Carstoiu, Tudor Ionescu, and Alexandra Cernian 

University of Bucharest 

Faculty of Automatic Control and Compuer Science 

Bucharest, Romania 

{dorin.carstoiu,tudor.ionescu,alexcernian}@yahoo.com 

Abstract. The goal of this work is to study the feasibility of a Heterogeneous Data 

Classification and Search (HDCS) system and to provide a possible design for its 

implementing. In order to design a HDCS system we propose an actor oriented 

modeling technique, for which we show the information flow. We have identified 6 

different actors (subsystems) which collaborate to construct a file sheet and produce 

the final search result. The first 5 actors add information to the files sheet, which is 

afterwards used by the final actor to produce the desired result. 

Given the vast quantity of data and the variety of formats and encodings it exists 

in, a semantic approach based on metadata has been chosen. Instead of digging into 

the actual data for extracting information, we used the context of the file to collect its 

metadata. The metadata is afterwards used for the classification proces. The reason 

for this approach is that data are made available by people who are interested in 

other people understanding what the respective data are about. This observation 

provided the confidence needed to pursue the presented approach. 

The HDCS system we propose combines techniques from conventional search 

systems, classification systems, search results clustering systems, while also providing 

original solutions, such as an innovative data sampling method. 

References 

Dawid Weiss: Descriptive Clustering as a Method for Exploring Text Collections, 

2006 

Frederic Boulanger, Guy Vidal-Naquet: A Primitive Execution Model for Heterogeneous 

Modeling 

Mokhoo Mbobi, Frederic Boulanger, Mohamed Feredj: Issues of Hierarchical Heterogeneous 

Modeling in Component Reusability 

Cecile Hardebolle, Frederic Boulanger, Dominique Marcadet, Guy Vidal-Naquet: A 

Generic Execution Framework for Models of Computation 

Hua-Jun Zeng, Zheng Chen, Wei-Ying Ma: A Unified Framework for Clustering 

Heterogeneous Web Objects 

− 26 −

Applying Rough Set Theory to Constructing 

Knowledge Base for Critical Military 

Commodity Management 

Hua-Kai Chiou, 1 Yong-Ting Huang 2 and Gia-Shie Liu 3 

1 Department of International Business, China Institute of Technology. 245 Sec.3, 

Academia Rd., Nangang Taipei 11581, Taiwan. hkchiou@cc.chit.edu.tw 

2 Graduate School of Resources Management and Decision Science,National 

Defense University. 70, Sec.2, Central North Rd., Beitou Taipei 11258 Taiwan. 

coby777.tw@yahoo.com.tw 

3 Department of Information Management,Lunghwa University of Science and 

Technology. 300, Sec.1, Wanshou Rd., Gueishan Shiang, Taoyuan County 33306, 

Taiwan. liugtw@yahoo.com.tw 

Abstract. Reduction of pattern dimensionality via feature extraction and feature 

selection belongs to the most fundamental steps in data preprocessing. Feature selection 

is a valuable technique in data analysis for information-preserving data reduction. 

Features constituting the objects pattern may be irrelevant or relevant. A large 

number of methods like discriminant analysis, logit analysis, recursive partitioning 

algorithm, etc., have been used in the past for the prediction issues by pattern recognition. 

These traditional approaches suffer from some limitations, often due to the 

unrealistic assumption of statistical hypotheses or due to a confusing language of 

communication with the decision makers. In this paper, we present applications of 

rough set methods for feature selection in pattern recognition. Firstly, we employ 

Delphi process to generate 39 critical attributes with respect to 6 key factors for 

evaluation. We then utilize the rough set approach to discover a set of rules able to 

discriminate among considered attributes for critical military commodity management. 

There 10 reduct sets and 226 rules were derived from our proposed model, 

we also stated some concluding remarks in the final section. Through this research 

we found that rough set theory is an efficiency technique for pattern recognition to 

solving decision making problems in real world. 

Key words: Pattern Recognition, Rough Set Theory, Delphi Process, Discriminant 

Analysis 

− 27 −

Correspondence Analysis for Exploring the 

Implementation of One Village One Product 

Programs in Taiwan 

Hua-Kai Chiou, 1 Benjamin J.C. Yuan 2 and Yen-Wen Wang 2,3 

1 Department of International Business, China Institute of Technology. 245 Sec.3, 

Academia Rd., Nangang Taipei 11581, Taiwan. hkchiou@cc.chit.edu.tw 

2 Institute of Management of Technology, National Chiao Tung University. 1001, 

Ta-Hsueh Rd., Hsinchu 30010, Taiwan. benjamin@cc.nctu.edu.tw 

3 Industrial Economics & Knowledge Center,Industrial Technology Research 

Institute. 195 Sec.4, Chung Hsing Rd., Chutung Hsinchu 31040, Taiwan. 

stevenwang@itri.org.tw 

Abstract. One Village One Product programs (OVOP) is a community-centered, 

and demand-driven local economic development approach initiated by Oita Prefecture 

in Japan in the 1970s. The uniqueness of the approach is that they intended 

to achieve their regional economic development through adding value to products 

using locally available resources through processing, quality control and marketing. 

The objectives of OVOP programs in Taiwan are supported by government to promoting 

the economic development and cooperative relationship of each country and 

region through the localization and innovation. Firstly, we employ Delphi process to 

converge 18 critical factors with respect to 3 key dimensions for evaluation. We then 

utilize correspondence analysis to exploring the implementation of OVOP programs 

and conduct some meaningful suggestions of the policy direction from these empirical 

cases. Through this study we successfully demonstrate that correspondence 

analysis is an efficiency technique for industrial analysis and strategy management 

in real world. 

Key words: OVOP, Correspondence Analysis, Delphi Process, Industrial Analysis 

− 28 −

Extending Multivariate Planing 

Mario Cortina–Borja 1 

Centre for Paediatric Epidemiology and Biostatistics 

Institute of Child Health, University College London 

30 Guilford Street, London WC1N 1EH, UK 

M.Cortina@ich.ucl.ac.uk 

Abstract. Friedman and Rafsky (1981) introduced planing, a visualization technique 

based on a triangulation procedure for constructing 2–dimensional representations 

from a set of n multivariate observations based on preserving exactly relatively 

few distances from the original distance matrix and plotting the observations on the 

plane. The way these distances are selected and the order in which the observations 

are positioned into the plane are induced by a minimal spanning tree (MST ) of the 

data. 

Other spanning trees could be used to provide the set of distances to be preserved 

exactly in a 2–dimensional configuration. One is the exodic tree (ET ) (Gilbert, 

1965), which is a not quite minimal spanning tree, though it may be regarded as a 

close approximation to the MST (Cortina–Borja and Robinson, 2000). To construct 

an ET we choose any point of the dataset as the root and label it as P0; we then 

label the rest of the n points as {P1, P2, · · · , Pn−1}, the indices being assigned to 

order the points according to their increasing distance from P0. Next, we link any 

point Pi to the point Pj chosen from {P2, P3, · · · , Pi−1} in order to minimize its 

distance with Pi, (i ≥ 1). 

This paper extends two aspects of planing: first, obtaining 3–dimensional configurations; 

second, using the ET as the structure defining the distances to be preserved 

in the low–dimensional representation. 

Key words: Exodic Tree, Minimal Spanning Tree, Planing, Visualization 

References 

Cortina–Borja, M. and Robinson, T. (2000) Estimating the Asymptotic Constants of 

the Total Length of the Euclidean Minimal Spanning Tree with Power–Weighted 

Edges. Statistics and Probability Letters, 47, 125–128. 

Friedman, J.H. and Rafsky, L.C. (1991) Graphics for the Multivariate Two–Sample 

Problem. Journal of the American Statistical Association, 76, 277–287. 

Gilbert, E.N. (1965) Random Minimal Trees. SIAM Journal of Applied Mathematics, 

13, 376–387. 

− 29 −

Principal Axis Analysis with HDLSS bonuses! 

Frank Critchley 1 , Ana Pires 2 , and Conceição Amado 2 

1 Open University, UK 

F.Critchley@open.ac.uk 

2 IST, Lisbon 

Abstract. Principal axis analysis rotates standardised principal components to optimally 

detect subgroup structure, rotation being based on preferred directions in 

the spherised data. As such, it is a computationally efficient method of exploratory 

data analysis, particularly well-suited to detecting mixtures of elliptically contoured 

distributions. High dimensional, low sample size (HDLSS) data are also discussed. 

Overall, principal axis analysis exemplifies the maxim: two decompositions are better 

than one. More technically, it is an example of invariant coordinate selection 

(ICS). 

− 30 −

Augmenting Model-Based Clustering with 

Generalized Linkage methods 

Nema Dean 1 and Rebecca Nugent 2 

1 Department of Statistics, University of Glasgow, 15 University Gardens, 

Glasgow G12 8QW, UK. nema@stats.gla.ac.uk 

2 Department of Statistics, Carnegie Mellon University, Baker Hall, Pittsburgh, 

PA 15213, USA. rnugent@stat.cmu.edu 

Abstract. The fundamental assumption made by model-based clustering (Fraley 

and Raftery 1998) is that the groups or sub-populations underlying the data have 

(multivariate) Gaussian distributions, giving the overall population a finite mixture 

model distribution. An assumption additionally made, is that the number and type 

of components found to best fit the data are a good estimate of the number and type 

of true groups in the data. Given the shape assumptions implicit in the choice of 

Gaussian distributions - elliptical, symmetric contours - in cases of skewed, curved 

or more generally complex-shaped groups, the equivalence of the mixture model 

components and the underlying groups is likely false. 

Since general continuous densities can be modelled arbitrarily well by mixtures 

of Gaussian densities, the mixture model chosen may still be a good estimate of 

the density of the data but it is likely that more than one component is identified 

with each group. Generalized single linkage methods (Stuetzle and Nugent 2007) 

use density estimates to create density-based similarity (or dissimilarity) measures 

which can then be used as a replacement for Euclidean (or other types of) distance 

in hierarchical agglomerative methods. Using the resulting model-based clustering 

density estimate we can use the resulting dendrogram to visualize the hierarchical 

structure of the components of the mixture model and make decisions about 

combining components to estimate groups. Since it is difficult to easily summarize 

information about complex shaped groups, offering a summary that is essentially 

a subset of components of the original mixture model with means and covariance 

matrices is an attractive alternative. 

Key words: Model-Based Clustering, Generalized Single Linkage Clustering 

References 

Fraley, C. and Raftery, A. E. (1998): How many clusters? which clustering method? 

- answers via model-based cluster analysis. The Computer Journal, 41, 578–588. 

Stuetzle, W. and Nugent, R. (2007): A generalized single linkage method for estimating 

the cluster tree of a density. Technical Report 514, Department of Statistics, 

Univeristy of Washington. 

− 31 −

Statistical analysis of human body movement 

and group interactions in response to music 

Frank Desmet, Marc Leman and Micheline Lesaffre 

IPEM, Department of Musicology, Ghent University, Belgium fm.desmet@ugent.be 

Abstract. The quantification of time series that relate to physiological data is a 

challenging research topic for music research. Up to now, most studies have focused 

on time dependent responses of individual subjects. However, little is known about 

time dependent responses of between-subject interactions. At IPEM, Ghent University, 

a large scale multidisciplinary research project targets the development of 

innovative music interaction based on the movement of groups of subjects. Based on 

a recent pilot experiment, we report new findings concerning the statistical analysis 

of group synchronicity in response to musical stimuli. The aim was to refine future 

experimental designs and to generate statistical pathways as practical guidelines for 

researchers. The experiment was carried out in the context of the ACCENTA 2007 

Fair in Ghent. 16 groups of 4 subjects took part in an experiment where they had 

to move a wireless wii sensor in response to music. In the first condition, the subjects 

were blind folded, while in the second condition, the subjects could see each 

other. The movements of the subjects were recorded as acceleration data on PC. 

Fourier coeffcients, total intensity, intra-group correlations and sampling entropy 

characteristics were derived from these raw acceleration data. Combined with pre 

and post survey data of the participants, we generated a multivariate dataset for 

analysis. The statistical methods used in this study are basic descriptive statistics, 

paired correlation analysis, auto correlation analysis, regression analysis and GLM 

(general linear modeling). The different empirical methodologies were validated as 

potential tools for the study of social embodied music interaction. It was found that 

the synchronicity of the human-human interactions increases significantly in the 

social context. The type of music is the predominant factor for the human-music 

interaction in both the individual and the social context. 

Key words: Human movement, Social Interaction, Statistical Analysis, music 

− 32 −

Mixture Hidden Markov Models in Finance 

Research 

José G. Dias 1 , Jeroen K. Vermunt 2 and Sofia Ramos 3 

1 

Department of Quantitative Methods, ISCTE – Higher Institute of Social 

Sciences and Business Studies, Edifício ISCTE, Av. das Forças Armadas, 

1649–026 Lisboa, Portugal 

jose.dias@iscte.pt 

2 

Department of Methodology and Statistics, Tilburg University, P.O. Box 90153, 

5000 LE Tilburg, The Netherlands, 

J.K.Vermunt@uvt.nl 

3 

Department of Finance, ISCTE – Higher Institute of Social Sciences and 

Business Studies, Edifício ISCTE, Av. das Forças Armadas, 1649–026 Lisboa, 

Portugal 

sofia.ramos@iscte.pt 

Abstract. Latent class or finite mixture modeling has proven to be a powerful 

tool for analyzing unobserved heterogeneity in a wide range of applications (see, for 

example, McLachlan and Peel (2000) or Dias and Vermunt (2007)). We introduce 

in finance research the Mixture Hidden Markov Model (HHMM) that takes into 

account both time-constant unobserved heterogeneity between and hidden regimes 

within time series. This approach is flexible in the sense that it can deal with the 

specific features of financial time series data, such as asymmetry, kurtosis, and unobserved 

heterogeneity, aspects that are almost always ignored in finance research. 

This methodology is applied to model simultaneously 12 time series of the returns of 

Asian stock markets. Because we selected a heterogeneous sample of countries including 

both developed and emerging countries, we expect that heterogeneity in market 

returns due to country idiosyncrasies will show up in the results. The best fitting 

model was the one with two latent classes or clusters at country level, which clearly 

present different dynamics in the switching dynamics between the two regimes. 

Key words: latent class model, finite mixture model, hidden Markov model, modelbased 

clustering, stock indexes 

References 

Dias, J.G., Vermunt, J.K. (2007): Latent class modeling of website users’ search 

patterns: Implications for online market segmentation. Journal of Retailing and 

Consumer Services, 14(6), 359–368. 

McLachlan, G.J., Peel, D. (2000): Finite Mixture Models. John Wiley & Sons, New 

York. 

− 33 −

Mapping Findspots of Roman Military 

Brickstamps in Mogontiacum (Mainz) 

and Archaeometrical Analysis 

Jens Dolata 1 , Hans-Joachim Mucha 2 , and Hans-Georg Bartel 3 

1 Generaldirektion Kulturelles Erbe Rheinland-Pfalz, Direktion Archäologie 

Mainz, Große Langgasse 29, D-55116 Mainz, Germany, 

dolata@ziegelforschung.de 

2 Weierstraß-Institut für Angewandte Analysis und Stochastik (WIAS), 


3 Institut für Chemie, Humboldt-Universität zu Berlin, Brook-Taylor-Straße 2, 

D-12489 Berlin, Germany, hg.bartel@yahoo.de 

Abstract. 1775 Roman military brickstamps concerning to the 1st century A.D. 

have been found in archaeological excavations in Mainz, the ancient Mogontiacum. 

Making a catalogue of these ones for a paper on Roman military archaeology the 

stamps have been classified and new types of stamps have been defined. All in all, 238 

findspots of bricks and tiles of the 1st century have been investigated. Additionally, 

the findspots are described by survey-coordinates. The mapping of the brickstamps 

visualizes the size of the ancient city and gives details for the localization of military 

camps and of civil settlement. Dating the brickstamps by epigraphical investigation 

or by assigning them to a military brickyard based on geochemical analysis allows 

the mapping of different periods. Two main maps have been plotted: (A) The earliest 

brickstamps found in Mainz are from the period of Emperors Claudius and Nero (41 - 

68 A.D., n = 932). They were manufactured by soldiers of legiones XXII Primigenia 

and IIII Macedonica. (B) Brickstamps of legiones I Adiutrix, XIV Gemina, VII 

Gemina, and XXI Rapax belong to the Flavian period (69 - 96 A.D., n = 843). 

These two main maps can be compared with some maps showing a selection 

of brickstamps from 3rd and 4th centuries (Emperors Caracalla, Constantine I or 

Julian, and Valentinian I, n = 102). Thus the maps show a total of 1877 brickstamps 

from 246 sites. In this paper we try to improve the situation of evaluating all this 

maps for history of settlement and urban development. Using statistical methods 

we compare the different entries of the maps. Every single entry is not of the same 

weight according to the radius of the finding area. These different weights can be 

taken into account in statistical analysis, for instance, in nonparametric density 

estimation. The intention is to get a mapping of densities of brickstamps for different 

urban regions. In combination of mapping and looking for dated brickstamps there 

is a good chance to achieve new sources for the ancient history of Mogontiacum. 

Key words: Roman bricks, archaeometry, nonparametric density, mapping 

− 34 −

Multimodal Performance Analysis of 

Electronic Sitar 

Arne Eigenfeldt 1 and Ajay Kapur 2 

1 Simon Fraser University 

Burnaby, BC, Canada 

2 California Institute of the Arts 

Valencia, CA, USA 

Abstract. This paper describes a custom-build system which extracts high-level 

musical information from real-time sensor data received from an Esitar [1]. Data is 

collected from sensors during rehearsal using one program, GATHER, and, combined 

with audio analysis, is used to derive statistical coefficients which are used to identify 

three different playing styles of North Indian music: Alap, Ghat, and Jhala. A realtime 

program, LISTEN, uses these coefficients in a multi-agent analysis to determine 

the current playing style. 

The ESitar is an instrument which gathers gesture data from a performing artist 

using sensors embedded on the traditional instrument. A number of different performance 

parameters are captured including fret detection (based on position of left 

hand on the neck of the instrument), thumb pressure (based on right hand strumming), 

and 3-dimensions of neck acceleration. Audio features from the instrument 

are computed as well, and include Root Mean Square (rms), spectral centroid, spectral 

flux, spectral rolloff at 85 

Analysis is done on sample data to derive statistical information (minimum, 

maximum, mean, standard deviation) - for each sensor and audio analysis to derive 

unique class coefficients. These coefficients are compared to incoming performance 

data, which is also statistically analysed, by an multi-agent system. 

Interactive and real-time computer music is becoming more complex, and the 

requirements placed upon the software is equally increasing. Composers, hoping to 

derive more understanding about a performer’s actions, are looking not just at incoming 

audio, but also to sensor data, for such information. Our research has focused 

upon a high-level musical cognition - playing style - rather than detail recognition 

- beat or pitch tracking, for example - by augmenting real-time audio analysis with 

sensor data analysis. 

References 

[1] Kapur, A., Tindale, A, Benning, M. S. and P. Driessen. ”The KiOm: A Paradigm 

for Collaborative Controller Design”, Proceedings of the International Computer 

Music Conference, New Orleans, USA, 2006. 

− 35 −

Data compression and regression based on 

local principal curves 

Jochen Einbeck 1 and Ludger Evers 2 

1 Department of Mathematical Sciences, Durham University, Durham, UK 

2 School of Mathematics, University of Bristol, Bristol, UK 

Abstract. Frequently the predictor space of a multivariate regression problem of 

the type y = f(x1, . . . , xp) + ɛ is in fact much lower-dimensional than p, often even 

(approximately) one-dimensional. Usual modeling attempts such as the additive 

model y = f1(x1) + . . . + fp(xp) + ɛ, which try to reduce the complexity of the 

regression problem by making additional structural assumptions, are then inefficient 

as they ignore the inherent structure of the predictor space and involve complicated 

model and variable selection stages. 

In a fundamentally different approach, one may consider first approximating 

the predictor space by a (usually nonlinear) curve passing through it, and then 

regressing the response only against the one-dimensional projections onto this curve. 

This entails the reduction from a p− to a one-dimensional regression problem. 

As a tool for the compression of the predictor space we apply local principal 

curves [1], which form a more flexible alternative to earlier proposed principal curve 

algorithms [2] as they also allow for branched or disconnected curves. Taking things 

on from the results presented in [1], we show how local principal curves can be 

parameterized and how the projections are obtained. The regression step can then 

be carried out using any nonparametric smoother. We illustrate the technique using 

16- and higher dimensional data from astrophysical applications. Possible extensions 

to more than one-dimensional nonparametric summaries of the predictor space are 

discussed. 

References 

[1] EINBECK, J., TUTZ, G., and EVERS, L. (2005). Exploring multivariate data 

structures with local principal curves. In: Weihs, C. and Gaul, W. (Eds.): Classification 

- The Ubiquitous Challenge. Springer, Heidelberg, pages 256-263. 

[2] CHANG, K. and GHOSH, J. (1998). Principal curves for nonlinear feature extraction 

and classification. SPIE Applications of Artificial Neural Networks in 

Image Processing III, 3307, 120–129. 

− 36 −

Regression-autoregression based clustering 

Igor Enyukov 

StatPoint Ltd., Moscow 

Abstract. The usual approach to the clustering of cases into k groups (for example, 

k-means ) can be regarded as a regression of the source variables on a set of k dummy 

binary (indicator) variables which satisfy the following conditions: 

• for i-th case only one of these variables has value 1 

• if j-th of such a variable has value 1 for i-th case, it means that the case belongs 

to j-th group. 

These dummy variables in their turn are some nonlinear functions of the source 

variables and their values are being evaluated by performing a classification procedure. 

The set of dependent variables and predictors is the same in this approach. 

So it may be regarded as an autoregression approach. Such approach may lead to 

problems when we work with spatial distributed data (like objects with geographical 

coordinates) because the objects that are close by physical properties (for example, 

by seasonal properties of river stocks) may be situated rather far in the geographical 

space. In this paper is suggested to use a smoothed variant of such indicator 

nonlinear functions. For this aim are used the radial basis functions (RBF), such 

that their centers are the group centers (in the end of work of the procedure). In 

this case set of dependent variables (source variables which we want to explain by 

clusterization) may be distinct from the set of explaining (independent variables or 

predictors) variables. For example, the last ones may be geographical coordinates. 

Such approach may be regarded as regression-autoregression. The regression problem 

is lnear relatively RBF. This approach gives a possibility to define the class 

centers and to evaluate their number as well. For this purpose a kind of ”step-wise” 

regression algorithm can be used. 

− 37 −

Realoptionen bei der Bewertung von neuen 

Produkten 

Said Esber and Daniel Baier 

Lehrstuhl für Marketing und Innovationsmanagement, Erich-Weinert-Str. 1, 

D-03046 Cottbus 

saidesber1@yahoo.com, baier@tu-cottbus.de 

Abstract. Bei der Entwicklung von neuen Produkten ist es für das F&E-Management 

sehr wichtig, technische, marktbedingte und wettbewerbsbedingte Unsicherheiten 

zu berücksichtigen. Aus den zunehmenden Markt- und Umfeldänderungen resultiert, 

dass Investitionsentscheidungen verstärkt unter hoher Unsicherheit getroffen 

werden müssen. Die Realoptionsbewertung ermöglicht hier ein besseres Verständnis 

für Unsicherheiten in F&E-Projekte, die Flexibilität des Managements 

während des Projektlebenslaufs und die Auswahl der besten Projektalternative. 

Der vorliegende Beitrag beschreibt die Anwendung des Realoptionsansatzes im Bereich 

der Informationstechnologie (IT) durch das Management der Produktentwicklung 

bei einem Videokonferenz-System. Zunächst wird ein Überblick über den Realoptionsansatz, 

die Eigenschaften verschiedener Typen von Realoptionen und ihr 

Zusammenhang mit anderen verwendbaren zusätzlichen oder alternativen Bewertungsmethoden 

von F&E-Projekten (Sensitivitätsanalyse, Szenarioanalyse, Monte- 

Carlo-Stimulation und Entscheidungsbäume) gegeben. Anschließend wird ein entscheidungsunterstützendes 

excelbasiertes Tool zur Berechnung der Realoptionen, 

zur Entwicklung der Entscheidungsbäume und zur Durchführung der Monte-Carlo- 

Simulationen verwendet. Schließlich wird die Entscheidung, dass Videokonferenz- 

System (BRAVIS) im Produktionsprozess in deutschen VW-Unternehmen einzuführen, 

als Anwendungsbeispiel analysiert. 

Key words: Realoptionen, Bewertungsverfahren von Optionen, Entscheidungen 

unter Unsicherheit, IT-Projekte, Verfahren der Risikoanalyse 

References 

Dixit, A. K. and Pindyck, R. S. (1995): The Options Approach to Capital Investment. 

Harvard Business Review, Mai/Juni, 105–115. 

Pritsch, G. (2000): Realoptionen als Controlling-Instrument - Das Beispiel pharmazeutische 

Forschung und Entwicklung. Gabler, Wiesbaden. 

Rese, A. and Baier, D. (2007): Deciding on new products using a computer-assisted 

real options approach. Int. J. of Techn. Intelligence & Planning, 3(3), 292–303. 

− 38 −

Regression and Classification using Bayesian 

Additive Voronoi Tessellation Models 

Ludger Evers 1 

School of Mathematics, University of Bristol, Bristol, UK, l.evers@bris.ac.uk 

Abstract. Voronoi-tessellation-based regression and classification models are based 

on approximating the unknown regression function by a discontinuous, piecewise 

constant (or linear) function. The discontinuities are modeled by a Voronoi tessellation 

of the covariate space. This distinguishes them from recursive partitioning 

models like CART which model the discontinuities by (typically axis aligned) hierarchical 

splits. Voronoi-tessellation-based regression and classification models are 

typically considered in a Bayesian framework, where inference is done using Reversible 

Jump MCMC techniques. 

These methods however possess two important drawbacks. In many situations 

only a small proportion of the covariates studied are relevant to the regression or 

classification task at hand. The pairwise distances, which the Voronoi tessellation 

is based on, are then dominated by the irrelevant covariates, i.e. it is increasingly 

difficult to find a Voronoi tessellation that is informative for the problem at hand. 

Second, the estimated regression function is, due to its high-dimensional and complex 

nature, typically difficult to interpret. 

We propose to use an additive model of the form P 

I fI(xI) that addresses these 

two shortcomings. Each fI(·) is based on a Voronoi tessellation of a subspace of the 

covariates. A hierarchical model is used for the inclusion of the covariates in order 

to ensure that the model makes sparing use of the covariates. In many situations, 

each of the fI(·) involves only a small number of covariates and thus allows for easy 

interpretation. A further benefit of this approach is that it allows for constructing 

faster mixing MCMC compared to the basic non-additive model. 

A case study and an empirical comparison with competing methods like CART, 

BART, MARS, and Support Vector Machines, is presented both for regression and 

classification tasks. 

Keywords 

Additive model, random basis, transdimensional simulation. 

− 39 −

Ross-linguistic regularities in the monosyllabic 

system 

Gertraud Fenk-Oczlon and August Fenk 

Alps-Adriatic University of Klagenfurt 

Abstract. We assumed cross-linguistic correlations between the number of monosyllabic 

words (a), of syllable types (b), of phonemes per syllable(c), and of the size 

of the phonemic inventory (d). Menzeraths (1954:112-121) descriptions of 8 Indo- 

European languages and Campbells (1991) data regarding their phonemic inventory 

offered the basis for a statistical evaluation. All correlations between a, b, and c 

turned out to be significant, those between these three parameters and d almost 

significant (Fenk-Oczlon & Fenk, to appear). The discussion of these results within 

a systems-theoretical framework includes: (I) Diachronic changes: A comparison of 

the Beowulf Prologue in Old English (OE) with its translation into Modern English 

(ME) shows a remarkable increase of monosyllables from 105 in OE to 312 in ME 

and a concomitant increase of the mean syllable complexity from 2.63 phonemes 

in OE to 2.88 in ME. (II) Semantic functions: The verb as well as the adverb or 

preposition forming a phrasal verb are often polysemous. In a short analysis of a collection 

of 1406 English phrasal verbs we found that 1367 or 97 % of the verbs that 

were part of the phrasal verb construction were monosyllabic. (39 phrasal verbs 

included a bisyllabic verb and only one was found with a trisyllabic verb.) (III) 

General relations between a languages phonemic inventory, the number of cogitable 

combinatorial possibilities and the number of those phonotactic possibilities actually 

realized in the monosyllables of typologically different languages. 

References 

Campbell, G. L. 1991. Compendium of the worlds languages. London: Routledge. 

Fenk-Oczlon, G. & Fenk, A. (to appear). Complexity trade-offs between the subsystems 

of language, In M. Miestamo, K. Sinnemaeki & F. Karlsson, (eds.) Language 

Complexity: Typology, Contact, Change, pp. 43-65. Amsterdam: John 

Benjamins. 

Menzerath, P. (1954). Die Architektonik des deutschen Wortschatzes. Hannover/Stuttgart: 

Duemmler. 

− 40 −

Validity of images from binary coding tables. 

Student motivation surveys: some evidence 

K. Fernández-Aguirre 1 and M. A. Garín-Martín 1 

University of the Basque Country (UPV/EHU), Bilbao, Spain 

karmele.fernandez@ehu.es 

Abstract. Using both artificial and real data, this paper analyses the superiority 

of Correspondence Analysis (CA) over Principal Component Analysis (PCA) as a 

procedure for displaying and exploring data in the processing of contingency tables 

or binary tables. 

Simple and Multiple Correspondence Analysis (CA and MCA) are becoming 

more and more widely used in many areas of science. However, PCA is much better 

known and is accessible in software packages, so it continues to be used even at the 

risk of obtaining poor results when it is not applied to quantitative data. A second 

point examined is the obtaining of low percentages of projected variance on the first 

factorial axes in CA applications. This may deter users from applying these methods, 

as it suggests that the visualization obtained will be of poor quality. This problem 

has been widely discussed: (Benzécri, 1979), (Greenacre, 1994) propose corrections 

in projected variance rates. Moreover, (Lebart, 1984, 1998, 2000) consider the case 

of a matrix associated with a symmetric graph and analytically study the variations 

in representation depending on the different codifications of the associated matrix. 

On the one hand, our study shows the superiority of CA for the reconstitution 

and visualization of an M matrix associated with a G symmetric graph over the 

visualization obtained with PCA. On the other hand, we present four examples which 

show that projected variance rates are a highly pessimistic measure of the quality 

of a representation. These examples of application to survey data examine various 

aspects of the motivation of university students in the field of education, and in 

particular present extremely low percentages of projected variance on the first axes. 

The application of MCA enables us to obtain apparently fragile but interpretable 

images. Moreover, classification analysis provides various types of student with clear 

interpretations, leading to robust conclusions. 

Key words: Binary tables, Visualization, Correspondence Analysis, Clustering 

References 

Lebart, L., Morineau, A. and Warwich, K. M. (1984): Multivariate Descriptive Statistical 

Analysis. John Wiley & sons, New York. 

− 41 −

A Statistical Theory of Musical Consonance 

Proved in Praxis 

Jobst Fricke 

Universität zu Köln 

Abstract. One of the recent models of consonance perception of musical intervals 

is based on the simulation of neural autocorrelation. It is assumed that the shape of 

the coinciding neural spikes resembles Dirac-delta-impulses. Then, consonant musical 

intervals are recognized as consonant, if and only if the frequencies of the interval 

tones exactly form a simple numerical proportion. But intervals are perceived to be 

consonant too when they have a slight displacement of the simple numerical proportions. 

The models of Tramo et al. (2001) and Ebeling (2007) are for the first time 

in accordance with the reality of perception by introducing larger impulses. Both of 

them just realize the autocorrelation of interval tones that consist of impulses with a 

width different from zero. In fact, the time window for the spikes’ coincidence has a 

width different from zero. This is the latency which is relevant in cognitive processes. 

In order to adapt the model to reality, the width of the statistical distribution of 

neural impulses should be considered. 

It is investigated to what extent the behavior of the model corresponds to the auditory 

perception in a realistic environment. The experimental data were taken from 

a study dealing with the judgment of intervals in a musical context (Fricke 2005b). 

Keywords 

MUSIC, CONSONANCE THEORY 

References 

EBELING, M. (2007): Verschmelzung und neuronale Autokorrelation als Grundlage 

einer Konsonanztheorie. Frankfurt/M. u.a.. 

FRICKE, J. (2005b): Classification of Perceived Musical Intervals, in: Claus 

Weihs u. Wolfgang Gaul (Hrsg.): Classification - The Ubiquitous Challenge, 

Berlin/Heidelberg/New York: Springer, S. 585–592. 

TRAMO, M., CARIANI, P., DELGUTTE, B. and BRAIDA, L. (2001): Neurobiological 

Foundations for the Theory of Harmony in Western Tonal Music, in: R. 

Zatorre and I. Peretz (Hrsg.): Biological Foundations of Music, Annals of the 

New York Academy of Sciences, Vol. 930. 

− 42 −

An Improved Criterion for Clustering Based 

on the Posterior Similarity Matrix 

Arno Fritsch and Katja Ickstadt 

Technische Universität Dortmund, Fakultät Statistik 

Vogelpothsweg 87, 44221 Dortmund 

arno.fritsch@tu-dortmund.de, ickstadt@statistik.uni-dortmund.de 

Abstract. Complex Bayesian cluster models are often fitted using Markov Chain 

Monte Carlo (MCMC) algorithms. A problem is then how to summarize the MCMC 

sample c (1) , . . . , c (M) from the posterior distribution of clusterings p(c|y) with a single 

estimated clustering ĉ. The problem is complicated by the fact that the labels 

associated with the clusters can switch during the MCMC run. One way to overcome 

this is to derive the estimate ĉ based on the posterior similarity matrix, a matrix 

with entries πij = P (ci = cj|y), the posterior probabilities that the observations i 

and j are in the same cluster. This approach is taken for example in the Bayesian 

cluster models for gene expression microarray data by Medvedovic et al. (2004) and 

Dahl (2006). The former applies hierarchical clustering to the matrix of (1 − πij), 

while the latter tries to minimize a loss function proposed by Binder (1978). We 

show that this minimization is equivalent to maximizing the Rand index between 

estimated and true clustering and propose a new criterion for choosing ĉ, the posterior 

expected adjusted Rand index with the true clustering. In a simulation study 

with a Dirichlet process mixture model it is shown that our new criterion leads to 

estimated clusterings closer to the true one than the other two approaches and that 

it also outperforms the usage of the maximum a posteriori (MAP) clustering. 

Key words: Adjusted Rand index, Bayesian cluster analysis, Markov Chain Monte 

Carlo, Loss functions 

References 

Binder, D.A. (1978): Bayesian Cluster Analysis. Biometrika, 65, 31–38. 

Dahl, D.B. (2006): Model-based Clustering for Expression Data via a Dirichlet Process 

Mixture Model. In: K.A. Do, P. Müller and M. Vannucci (Eds.): Bayesian 

Inference for Gene Expression and Proteomics, Cambridge University Press, 

New York, 201–216. 

Medvedovic, M., Yeung, K. and Bumgarner, R. (2004): Bayesian Mixture Model 

Based Clustering of Replicated Microarray Data. Bioinformatics, 20, 1222– 

1232. 

− 43 −

On the Use of Student Samples in Major 

Marketing Research Journals. A Meta-Study 

Sebastian Fuchs and Marko Sarstedt 

Institute for Market-based Management, Munich School of Management, D-80539 

Munich, Germany imm@bwl.lmu.de 

Abstract. In the last years, almost every marketing research journal has experienced 

a big leap in the number of high-quality paper submissions. This led to 

an increased competition among contributing authors and heightened requirements 

for manuscript submissions. In this context, the manuscript evaluation criteria for 

almost every marketing journal underline the importance of the sample’s characteristics 

and how well it represents the population being studied. However, the predominance 

of student samples in empirical marketing research documents the divergence 

between these theoretical requirements and practical implementation. Despite theoretical 

and empirical objections, several authors claim that a silent acceptance of 

the usage of student samples has become observable, even in top research societies. 

According to Peterson (2001), this development is problematic as generalizations are 

only feasible if replicating research with non-student subjects is carried out. Thus, 

the objective of this paper is to analyze the development of the usage of student 

samples in the most reputable marketing research journals. For this purpose, all 

eleven marketing journals rated A or A+ in the ranking, developed on behalf of 

the Association of University Professors of Management in German-speaking countries 

(VHB) were investigated. A total number of 1.491 studies that appeared since 

2005 were analyzed with regard to samples used, measures evaluated and limitations 

addressed. The results show vast differences between the various journals. 

Key words: Student Samples, Sampling, Marketing Research 

References 

Peterson, R.A. (2001): On the Use of College Students in Social Science Research: 

Insights from a Second-Order Meta-analysis Journal of Consumer Research, 28 

(3), 450–461. 

Völckner, F. and Sattler, H. (2007): Empirical Generalizability of Consumer Evaluations 

of Brand Extensions International Journal of Research in Marketing, 24 

(2), 149–162. 

− 44 −

Multi-Dimensional Scaling applied to 

Hierarchical Fuzzy Rule Systems 

Thomas R. Gabriel, Kilian Thiel, and Michael R. Berthold 

Chair for Bioinformatics and Information Mining 

Department of Computer and Information Science 

University of Konstanz, Box 712, 78457 Konstanz, Germany 

{gabriel|thiel|berthold}@inf.uni-konstanz.de 

Abstract. This paper presents an approach for visualizing high-dimensional fuzzy 

rules arranged in a hierarchy together with the training patterns they cover. A standard 

multi-dimensional scaling method is used to map the rule centers of the top 

hierarchy level to one coherent picture. Rules of the underlying levels are projected 

relatively to their superior level(s). In addition to the rules, all patterns are mapped 

onto the two dimensional projection in relation to the positions of the corresponding 

rule centers. The visualization is further extended by showing hierarchical relationships 

between overlapping rules of different levels as generated by a hierarchical rule 

learner. This delivers interesting insights into the rule hierarchy and makes the model 

more explorable. Additionally, rules can be highlighted interactively emphasizing the 

subsequent rules at all underlying levels together with their covered patterns. We 

demonstrate that this technique allows investigation of interesting rules at different 

levels of granularity, which even makes this approach applicable to a large number 

of rule sets. The proposed technique is illustrated and discussed on a number 

of hierarchical rule model visualizations generated on well-known benchmark data 

sets. 

Key words: Multi-Dimensional Scaling, Fuzzy Rule Induction, Rule Hierarchy 

References 

Berthold, M.R., (2003): Mixed Fuzzy Rule Formation. International Journal of Approximate 

Reasoning, 32:67–84. 

Gabriel, T.R. and Berthold, M.R. (2003): Constructing Hierarchical Rule Systems. 

In: M.R. Berthold, H.-J. Lenz, E. Bradley, R. Kruse, C. Borgelt (Eds.): Proc. 5th 

International Symposium on Intelligent Data Analysis. Springer, Berlin, 76–87. 

Gabriel, T.R. and Thiel, K. and Berthold, M.R. (2006): Rule Visualization based on 

Multi-Dimensional Scaling. In: IEEE International Conference on Fuzzy Systems. 

IEEE Press, Vancouver, 66–71. 

− 45 −

Ein Leistungszentrum für die digitale 

Unterstützung der Durchführung, Auswertung 

und Veröffentlichung von archäologischen 

Feldprojekten (Ausgrabung, Survey) 

Ulrich-Walter Gans und Matthias Lang 

Institut für Archäologische Wissenschaften der RUB, Fakultät für 

Geschichtswissenschaft, E-Mail: johannes.bergemann@rub.de 

Zusammenfassung. Seit etwa zwei Jahrzehnten werden die in der archäologischen 

Feldforschung in sehr hohen Quantitäten anfallenden Informationen überwiegend 

bis ausschließlich in digitaler Form erfasst. Zwar entstanden an zahlreichen Orten 

umfangreiche Datenbanken, allerdings weisen diese extrem heterogene Strukturen 

auf. Insellösungen ohne Verbindungen sind die Regel. Weitere grundsätzliche Probleme 

betreffen die langfristige Erhaltung und Pflege vorhandener Datenpools sowie 

die schnell wechselnden Betriebssysteme. Ein interdisziplinäres Team aus den 

Bereichen Archäologie, Informationsmanagement, Softwareentwicklung und Geoinformatik 

will Archäologen künftig einen völlig neuartigen Umgang mit den digitalen 

Medien ermöglichen. Bislang ist es nur möglich, über das Internet die auf Servern 

von Universitäten, Forschungsinstituten und Museen verteilten Datenbanken einzeln 

abzufragen. ArcheoInf will einen Mediator erstellen, der fähig ist, in zahlreichen archäologischen 

Feldforschungsdatenbanken gleichzeitig zu recherchieren, ohne dass 

die Benutzer die Abfrageoberfläche wechseln müssen. Dem Mediator liegen ein möglichst 

viele Bereiche der Archäologie umfassender Thesaurus und eine entsprechende 

Ontologie zugrunde, die eine komfortable Suche in Feldforschungsdaten ermöglicht. 

So wird ein dem Open-Access-Gedanken verpflichtetes Repositorium erstellt, in dem 

kostenfrei auf archäologische Sach- und Bilddaten zugegriffen werden kann. Weiter 

ist die Einbettung von bibliographischen Daten, Zitaten, Bestandsnachweisen und 

elektronischen Volltexten vorgesehen. Gleichzeitig gilt es einen zentralen WebGis- 

Server aufzubauen, der geoinformatische Anwendungen der Archäologen in Verbindung 

zu Text- und Bilddaten setzt. An dem von der Deutschen Forschungsgemeinschaft 

finanzierten Projekt arbeiten Archäologen der Ruhr-Universität Bochum, Informatiker 

der Technischen Universität Dortmund, Geoinformatiker der Hochschule 

Bochum sowie die Universitätsbibliotheken Bochum und Dortmund mit. Weitere 

archäologische Projektpartner in Berlin, Bochum, Cottbus, Darmstadt, Karlsruhe, 

Köln, Frankfurt und Tübingen sind assoziiert. 

− 46 −

Scalable and Incrementally Updated Hybrid 

Recommender Systems 

Zeno Gantner and Lars Schmidt-Thieme 

Information Systems and Machine Learning Lab (ISMLL) 

University of Hildesheim, Germany 

{gantner,schmidt-thieme}@ismll.uni-hildesheim.de 

Abstract. A typical approach in collaborative filtering is to treat the ratings as a 

matrix with many unknown entries, and to use the known data to approximate the 

complete matrix, including the unknown ratings. George and Merugu demonstrated 

that using a co-clustering algorithm to approximate the ratings matrix achieves prediction 

accuracies comparable to techniques like SVD, NMF, or kNN, while having 

computational properties which allow the use of the method in dynamic scenarios, 

where incoming data has to be incorporated instantly. However, pure collaborative 

filtering approaches suffer from insufficient data, especially in cases when there are 

only a few users, or when there are new items which have not yet been rated. 

To overcome this problem, we combine co-clustering with a Naïve Bayes classifier 

to predict ratings based on both the ratings matrix and item attributes. The simplicity 

of the classifier allows us to preserve the desirable properties (parallelization, 

scalability, incremental updates). Our evaluation indicates that the hybrid recommender 

system performs better than pure co-clustering. 

Key words: hybrid recommender systems, collaborative filtering, content-based 

filtering, co-clustering, naïve bayes 

References 

Banerjee, A., Dhillon, I., Ghosh, J., Merugu, S. and Modha, D. (2007): A Generalized 

Maximum Entropy Approach to Bregman Co-clustering and Matrix 

Approximation. The Journal of Machine Learning Research, 8, 1919–1986 

Hauger, S., Tso, K. and Schmidt-Thieme, L. (2007): Comparison of Recommender 

System Algorithms focusing on the New-Item and User-Bias Problem. In: The 

31st Annual Conference of the German Classification Society on Data Analysis, 

Machine Learning, and Applications. 

George, T. and Merugu, S. (2005): A Scalable Collaborative Filtering Framework 

Based on Co-Clustering. In: Proceedings of the 5th IEEE Conference on Data 

Mining (ICDM). IEEE Computer Society, Los Alamitos, CA, USA, 625–628 

− 47 −

Non-Gaussian nature of ENSO signals and 

climate shifts: implications for regional 

studies off the western coast of South America 

Bernard Garel 1 , J. Boucharel, B. Dewitte and Y. du Penhoat 2 

1 Institut de Mathmatiques de Toulouse (IMT-LSP) 

bernard.garel@math.univ-toulouse.fr 

2 Laboratoire d’Etudes en Gophysique et Ocanographie Spatiales (LEGOS) 

julien.boucharel@legos.obs-mip.fr 

Abstract. El Nio/Southern Oscillation (ENSO) exhibits a significant modulation 

at decadal timescales which is also associated to changes of its characteristics (amplitude, 

frequency, propagation, predictability). Among these characteristics, some 

of them are generally ignored in ENSO regional studies, such as asymmetry (number 

of warm and cold events is not equal) and deviation of its statistics from those 

of an assumed Gaussian distribution. They tend to reduce ENSO prediction skill. 

Empirical variance shifts (assumed to be an index of low frequency variability) first 

detected in the western tropical Pacific, propagate (with propagation characteristics 

related to the unstable modes of ENSO) and grow eastward along the equator, 

leading to enhanced SST anomalies, asymmetry. 

Statistical tests are used to quantify the non-Gaussian nature and asymmetry of 

ENSO typical indices from in situ data and a variety of models (from intermediate 

complexity models to full physics coupled general circulation models). It is tested 

if ENSO can be accounted for by a non-Gaussian alpha-stable law (i.e. a more 

heavy-tailed distribution than Gaussian), by a mixture of distributions or by a non 

stationary process dominated by mean state and empirical variance abrupt changes. 

This last issue is achieved by a shift detection method applied to ENSO typical 

indices. Implications for the interpretation of proxies of the upwelling variability off 

the coast of Peru are discussed. 

Key words: alpha-stable distributions, mixtures, non-stationary process 

− 48 −

Likelihood ratio test for general mixture 

models 

Elisabeth Gassiat 1 

Université Paris-Sud 11, Bâtiment 425, 91405 Orsay Cédex, France 

elisabeth.gassiat@math.u-psud.fr 

Abstract. We investigate the likelihood ratio test (LRT) for testing hypotheses on 

the mixing measure in mixture models with possibly structural parameter. The main 

result gives the asymptotic distribution of the LRT statistics under some conditions 

that are proved to be almost necessary. Asymptotic distribution of the LRT statistics 

under contiguous alternatives may be derived. This applies to various testing 

problems: the test of a single distribution against any mixture, with application to 

Gaussian, Poisson and binomial distributions; the test of the number of populations 

in a finite mixture with possibly structural parameter. This allows to prove that, for 

the simple contamination model, the asymptotic local power under contiguous hypotheses 

may be arbitrarily close to the asymptotic level when the set of parameters 

is large enough. 

Key words: Likelihood ratio test, mixture models, number of components, local 

power, contiguity 

References 

Azais, J.-M., Gassiat, E. and Mercadier, C. (2006): Asymptotic distribution and 

power of the likelihood ratio test for mixtures: bounded and unbounded case. 

Bernoulli,12(5),775-799. 

Azais, J.-M., Gassiat, E. and Mercadier, C. (2007): The likelihood ratio test for 

general mixture models with possibly unknown structural parameters. ESAIM 

P. and S.,submitted. 

Gassiat, E. (2002): Likelihood ratio inequalities with applications to various mixtures. 

Ann. Inst. H. Poincaré Probab. Statist., 6, 897–906. 

− 49 −

On a Location of the Retail Units and 

Equilibrium Price Determination. 

Vladimir Gazda 1 

Technical University in Kosice, Nemcovej Str., 040 01 Kosice, Slovakia 

vladimir.gazda@tuke.sk 

Abstract. The classical view on the price determination is based on perfect competition 

assumption, which implies the only price existence accepted by all retailers. 

Traditionally, we suppose that all consumers neglect the search costs spent by looking 

for the most appropriate purchasing possibility. New views on the price - location 

competition among the retailers were presented by Hotelling and, consequently, 

by d’Aspremont et al.; Gabszewicz and Thisse; Dobson and Waterson; Martinez et 

al. who stressed mainly a continual approach. The discrete model describing the relation 

between the search costs and the selling price was described by Stigler (1961) 

in his search theory. His approach is based on the sequential search of particular 

retail places and is focused more on the searching process itself than the equilibrium 

price determination. We propose a price equilibrium problem formulation in 

the more complicated (not sequential) consumers’ and retailers’ structure. 

The model describes the behaviour of the cost minimising homogenous consumers 

and the retailers’ price policy. There, the union of the set of cosumers V1, 

the set of retailers V2, a virtual source s and a virtual sink u gives a set V of digraph 

nodes. Then, E = {s} × V1 ∪ V1 × V2 ∪ V2 × {u} is a set of digraph edges. We define 

the unit cost function as c : E → N0 where cs,i = 0, ci,j = di,j, and cj,u = pj, where 

di,j ∈ N is the search cost spent by the i-th consumer to reach the j-the retail 

centre and pj ∈ πj is a price of the j-the retailer. It is proved that the min cost flow 

in the digraph G = (V, E, c) models the optimal behaviour of all consumers. Then, 

the normal form game of retailers S = (V2, Q Q 

j πj, , j µj) describes the behaviour 

of retailers with their payoff functions µj derived from the min-cost decisions of the 

consumers. The Nash equilibrium price strategies of the retailers are discussed in 

the article. 

Key words: Graph, Game, Consumption, Retailer, Min-cost Flow, Nash Equilibrium 

− 50 −

The Potential of Social Intelligence for 


Andreas Geyer-Schulz 1 and Bettina Hoser 2 

1 Information Service and Electronic Markets andreas.geyer-schulz@kit.edu 

2 bettina.hoser@kit.edu 

Abstract. In this contribution we review the history and potential of social intelligence 

for driving collective intelligence. Different social networks generated by 

computer-mediated communication have been researched extensively in the past. 

The development of new technologies and applications on the Internet has resulted 

in a recent dramatic rise of user participation within these networks. We systematically 

explore the possibility of cross-usage of information about social network 

structures for personal, community or organisational services. In the other direction, 

we investigate the potential of improving community services by integrating 

information on the network structure with personal, organisational and mass information. 

Key words: social networks, social network analysis, social intelligence, collective 

intelligence 

References 

Hoser, B, and Geyer-Schulz, A. (2005): Eigenspectralanalysis of Hermitian Adjacency 

Matrices for the Analysis of Group Substructures. Journal of Mathematical 

Sociology, 29(4), 265–294. 

Wassermann, S. and Faust K. (1994): Social Network Analysis, Methods and Applications. 

Cambridge University Press, Cambridge. 

− 51 −

Isolated vertices in random intersection graphs 

Erhard Godehardt 1 , Jerzy Jaworski 2 and Katarzyna Rybarczyk 2 

1 Clinic of Thoracic and Cardiovascular Surgery, Heinrich Heine University, 40225 

Düsseldorf, Germany; godehard@uni-duesseldorf.de 

2 Faculty of Mathematics and Computer Science, Adam Mickiewicz University, 

60769 Poznań, Poland; jaworski@amu.edu.pl, kryba@amu.edu.pl 

Abstract. In applications like the analysis of non-metric data, it is a natural approach 

to classify objects according to the properties they possess. Often it is useful 

to consider two objects similar if they share at least s properties (with a given arbitrary 

number s). Then an effective model to analyze the structure of similarities 

between objects is the random intersection graph G(m, n, P(m)), with a set of vertices 

V generated by the random bipartite graph BG(m, n, P(m)) with bipartition (V, W), 

where clusters are defined as given subgraphs of the generated intersection graph. 

In BG(m, n, P(m)) the number of neighbors (properties) of a vertex v ∈ V (object) is 

assigned according to the probability distribution P(m) and an edge between v ∈ V 

and w ∈ W means that the object v has the property w. Moreover in G(m, n, P(m)), 

an edge connects v1 and v2 (v1, v2 ∈ V) if and only if in BG(m, n, P(m)) they have 

at least s common properties. The models were introduced in Godehardt and Jaworski 

(2002). Using specific properties of such graphs, we can test the hypothesis 

of randomness of the underlying data set. 

Our main purpose is to study the number of isolated vertices (objects similar 

to no other) in G(m, n, P(m)). Previous results concerning this problem considered 

only the case, where each vertex had the same number of properties and s = 1. In 

our new approach we manage to cope with dependencies between edge appearances 

for s ≥ 1 (which is important from the application point of view). Moreover we give 

results for the case, where the number of properties differs between objects (different 

distributions P(m)). We give the asymptotics for the probability of nonexistence of 

isolated vertices in G(m, n, P(m)) and conditions for asymptotic convergence of the 

number of isolated vertices to the Poisson distribution. 

Key words: Random Intersection Graphs, Isolated Vertices, Non-metric Data 

Analysis 

References 

Godehardt, E. and Jaworski, J. (2002): Two Models of Random Intersection Graphs 

for Classification In: M. Schwaiger and O. Opitz (Eds.): Exploratory Data Analysis 

in Empirical Research. Springer, Berlin, 68–81. 

− 52 −

A note on constrained EM algorithms 

for mixtures of elliptical distributions 

Francesca Greselin 1 and Salvatore Ingrassia 2 

1 Dipartimento di Metodi Quantitativi per le Scienze Economiche e Aziendali, 

Università di Milano Bicocca (Italy) francesca.greselin@unimib.it 

2 Dipartimento di Economia e Metodi Quantitativi, Università di Catania (Italy) 

s.ingrassia@unict.it 

Abstract. We extend some theoretical results about the likelihood maximization 

on constrained parameter spaces to mixtures of multivariate elliptical distributions. 

In particular, mixtures of multivariate t distributions provide a robust parametric 

extension to the fitting of data with respect to normal mixtures. In this framework, 

the degrees of freedom can act as a robustness parameter, tuning the heaviness of 

the tails, and down weighting the effect of the outliers on the parameters estimation. 

Further, a constrained monotone algorithm implementing maximum likelihood mixture 

decomposition of multivariate t distributions is proposed, to achieve improved 

convergence capabilities and robustness. Numerical studies are presented in order 

to demonstrate the better performance of the algorithm, comparing it to earlier 

proposals. 

Key words: Mixture models, Robust Clustering, EM algorithm, elliptical distributions, 

t-distribution. 

References 

Hennig, C. (2004): Breakdown points for maximum likelihood estimators of locationscale 

mixtures. The Annals of Statistics, 32, 1313–1340. 

Hathaway, R.J. (1986): A constrained formulation of maximum-likelihood estimation 

for normal mixture distributions. The Annals of Statistics, 13, 795–800. 

Ingrassia, S. and Rocci, R. (2007): Constrained monotone EM algorithms for finite 

mixture of multivariate Gaussians. Computational Statistics & Data Analysis, 

51, 5339–5351. 

McLachlan, G. J. and Peel, D. (2000): Finite Mixture Models, John Wiley & Sons, 

New York. 

− 53 −

Support Vector Machines in the Primal using 


Patrick J.F. Groenen 1 , Georgi Nalbantov 2 , and Cor Bioch 3 

1 Econometric Institute, Erasmus University Rotterdam, 

Rotterdam, The Netherlands groenen@few.eur.nl 


Rotterdam, The Netherlands and MICC, University Maastricht, Maastircht 

nalbantov@few.eur.nl 


Rotterdam, The Netherlands bioch@few.eur.nl 

Abstract. Support vector machines have become one of the main stream methods 

for two-group classification. At the 2006 GfKl meeting in Berlin, we proposed SVM- 

Maj, a majorization algorithm that minimizes the SVM loss function (see, Groenen, 

Nalbantov, and Bioch, 2007, 2008). A big advantage of majorization is that in each 

iteration, the SVM-Maj algorithm is guaranteed to decrease the loss until the global 

minimum is reached. Nonlinearity was reached by replacing the predictor variables 

by their monotone spline bases and then doing a linear SVM. A disadvantage of 

the method so far is that if the number of predictor variables m is large, SVM-Maj 

becomes slow. 

In this paper, we extend the SVM-Maj algorithm in the primal to handle efficiently 

cases where the number of observations n is (much) smaller than m. We 

show that the SVM-Maj algorithm can be adapted to handle this case of n ≫ m as 

well. In addition, the use of kernels instead of splines for handling the nonlinearity 

becomes also possible while still maintaining the guaranteed descent properties of 

SVM-Maj. 

Key words: Support vector machines, Iterative majorization, Binary classification 

problem, Kernel 

References 

Groenen, P.J.F., Nalbantov, G., and Bioch, J.C. (2007): Nonlinear support vector 

machines through iterative majorization and I-splines. In: R.Decker, H-.J. Lenz 

(Eds.): Advances in data analysis. Springer, Berlin, 149–162. 

Groenen, P.J.F., Nalbantov, G., and Bioch, J.C. (2008, in press): SVM-Maj: A 

Majorization Approach to Linear Support Vector Machines with Different Hinge 

Errors. Advances in Data Analysis and Classification. 

− 54 −

Usage of Artificial Neural Networks for Data 

Handling 

Lars Frank Groe and Franz Joos 


University of the Federal Armed Forces Hamburg 

Power Engineering 

Laboratory of Turbomachinery 

Hamburg, Germany 

Abstract. To reduce the environmental pollution, it is essential to increase the 

efficiency of the commercially available combustion engines. If one succeeds in designing 

the combustion process, in particular the chemical reactions, it is feasible to 

replace the experiment by computer simulations. Complex chemical reaction mechanisms 

like the GRI3.0* consist of 325 reactions with 53 species. The computational 

hardware costs limit the evaluation of integrals of stiff equations to simple problems 

(2-D, low Reynolds numbers) or to very small numbers of species. Otherwise 

turbulent combustion, for example in combustion chambers of gas turbines, often 

consists of complex geometry with a wide spectrum of chemical states and proceeds 

with high Reynolds numbers. The use of databases for storing chemical reactions 

is widely described in literature [Pope]. Therefore several storage-based techniques 

have been implemented for data mining (Look-up-table, In situ adaptive tabulation). 

The use of artificial neuronal networks (ANN) to simulate complex chemistry 

with full GRI3.0 is suggested in this paper. ANN can represent the chemical reactions 

by creating a non-linear-multivariate model of the dataset. The information 

of the dataset is stored in the weights of the connected neurons in the ANN. The 

net is able to find the optimum approximation of the presented data by supervised 

learning method called back propagation. The modeling and generalisation of large 

number of chemical states by means of ANN with regard to complicated combustion 

simulation is the purpose of this work. 

References 

[Pope] Pope, S.B.: Computationally efficient implementation of combustion chemistry 

using in situ adaptive tabulation. In: Combust. Theory Modelling, Vol. 1 

(1997), pp. 41-63 

Gregory P. Smith, David M. Golden, Michael Frenklach, Nigel W. Moriarty, Boris 

Eiteneer, Mikhail Goldenberg, C. Thomas Bowman, Ronald K. Hanson, Soonho 

Song, William C. Gardiner, Jr., Vitali V. Lissianski, and Zhiwei: http://www. 

me.berkeley.edu/gri_mech/. 

− 55 −

Model diagnostics of finite mixtures using 

bootstrapping 

Bettina Grün 1 and Friedrich Leisch 2 

1 Department für Statistik und Mathematik, Wirtschaftsuniversität Wien 

Augasse 2-6, 1090 Wien, Austria; Bettina.Gruen@wu-wien.ac.at 

2 Institut für Statistik, Ludwig-Maximilians-Universität München, Ludwigstraße 

33, D-80539 München, Germany; Friedrich.Leisch@stat.uni-muenchen.de 

Abstract. The EM algorithm provides a common framework for maximum likelihood 

estimation of finite mixture models. The fitted models can differ with respect 

to the component specific models and may also allow for concomitant variables to 

model the component weights. The use of resampling methods to analyze finite 

mixture models fitted with the EM algorithm is appealing, because the bootstrap 

similarly to the EM algorithm constitutes a common framework for these models. 

We will outline various possibilities to use resampling methods for model diagnostics 

such as for determining the number of components, checking model identifiability 

and analyzing the stability of induced clusterings. 

The R package flexmix implements the EM algorithm for ML estimation of finite 

mixture models. It provides the E-step and all data handling and arbitrary mixture 

models can be fitted by modifying the M-step. The implementation of bootstrap 

techniques to allow for model diagnostics of the models fitted with the package is 

presented. 

Key words: Booststrap, Finite mixture, Model diagnostics, Resampling 

References 

Grün, B. and Leisch, F. (2004): Bootstrapping Finite Mixture Models. In: J. Antoch 

(Ed.): Compstat 2004—Proceedings in Computational Statistics. Springer, 

Heidelberg, 1115–1122. 

Hothorn, T., Leisch, F., Zeileis, A. and Hornik, K. (2005): The Design and Analysis 

of Benchmark Experiments. Journal of Computational and Graphical Statistics, 

14(3), 1–25. 

Leisch, F. (2004): FlexMix: A general framework for finite mixture models and latent 

class regression in R. Journal of Statistical Software, 11(8). 

McLachlan, G.J. (1987): On Bootstrapping the Likelihood Ratio Test Statistic for 

the Number of Components in a Normal Mixture. Applied Statistics, 36(3), 

318–324. 

− 56 −

Classification with Regularized Kernel 

Mahalanobis-Distances 

Bernard Haasdonk 1 and El˙zbieta P ↩ ekalska 2 

1 Institute of Numerical and Applied Mathematics, University of Münster, 

Germany haasdonk@math.uni-muenster.de 

2 School of Computer Science, University of Manchester, United Kingdom 

pekalska@cs.man.ac.uk 

Abstract. Linear discriminant analysis has been demonstrated to be successful in 

kernel-induced feature spaces. In particular, in terms of accuracy, the kernel Fisher 

discriminant (KFDA) can frequently compete with or even outperform the support 

vector machine (SVM) (Mika et al. 2000). In situations, where linear discrimination 

in kernel feature space is suboptimal, nonlinear techniques offer a better solution 

(Huang et al. 2005). An example is quadratic classification in the kernel space, based 

on kernelized versions of class-related Mahalanobis distances. 

In this presentation, we present two different formulations for quadratic classifiers 

in kernel-induced feature spaces, depending whether the class-related covariance 

operator has to be regularized or not. Experimental results on a toy data set enable 

us to draw comparisons to SVM and KFDA. More importantly, these results provide 

a proof of principle that nonlinear discriminants can be beneficial in the kernel 

space. 

Key words: Kernel Methods, Quadratic Discriminants, Kernel Mahalanobis-Distance 

References 

Mika, S., Rätsch, G., Schölkopf, B., Smola, A., Weston, J. and Müller, K.-R. (2000): 

“Invariant feature extraction and classification in kernel spaces.” In S.A. Solla, 

T.K. Leen, and K.-R. Müller (Eds.): Advances in Neural Information Processing 

Systems 12. MIT Press, Cambridge, MA, 526–532. 

Huang, S.-Y., Hwang, C.-R. and Lin, M.-H. (2005): “Kernel Fisher’s discriminant 

analysis in Gaussian reproducing kernel Hilbert space,” Academia Sinica, 

Taipei, Taiwan, Technical Report. 

− 57 −

On classification of species of representation 

rings 

Lothar Häberle 

Department of Biometry and Epidemiology, University of Erlangen-Nuremberg, 

Waldstr. 6, 91054 Erlangen 

Abstract. In biology and chemistry crystal structures and symmetries of molecules, 

for example, are classified by mathematical groups. The assigned groups can then 

be used to determine physical properties such as polarity and chirality. 

Representations of groups as linear transformations of vector spaces and, more 

generally, modules enables many group theoretical problems to be reduced to problems 

of linear algebra, which is a well understood theory. Defining addition and 

multiplication via direct sum and tensor product on the set of these modules and 

then considering them as elements of a ring, the representation ring, is an approach 

to examine such modules. In order to investigate representation rings one may study 

their structure preserving maps to the complex numbers, which are called species. 

We consider finite groups whose largest subgroup of prime power order is cyclic 

for some prime number and study the corresponding representation ring. The indecomposable 

modules are stated and the species are classified. The proposed way of 

classification may be applied to other classes of groups in the future and then be used 

in natural sciences. Throughout the paper we illustrate the theoretical statements 

with examples. 

Key words: mathematical group, species, representation ring, indecomposable 

module 

References 

Benson, D.J. (1991): Representation and Cohomology I. Cambride Universtiy Press. 

Fotsing, B. and Külshammer, B. (2005): Modular species and prime ideals for the 

ring of monomial representations of a finite group. Communications in Algebra, 

33, 3667–3677. 

Green, J.A. (1962): The modular representation algebra of a finite group. Illinois 

Journal of Mathematics, 6, 607–619. 

Häberle, L. (submitted): The species and idempotens of the Green algebra of a finite 

group with a cyclic Sylow subgroup. 

Shriver, D.F. and Atkins, P.W. (2006): Inorganic Chemistry. Oxford University 

Press. 

− 58 −

Auswertung hochaufgelöster Streulichtdaten mit 

Methoden der multivariaten Statistik 

Cornelius Hahlweg und Hendrik Rothe 

Helmut-Schmidt-Universität 

Hamburg 

Zusammenfassung. Die Entwicklung der Streulichtmeßtechnik wurde in den vergangenen 

Dekaden, insbesondere mit Blick auf einen Einsatz in der Qualitätsprüfung 

von Oberflächen, vorangetrieben. Streulichtverfahren erweisen sich für Oberflächenuntersuchungen 

als besonders leistungsfähig, da sie berührungslos arbeiten, einen 

skalierbaren Ausschnitt der Oberfläche prüfen und dabei feinste Oberflächenstrukturen 

abbilden. 

Prinzipiell liefern Streulichtverteilungen spektrale Aussagen über Eigenschaften 

der untersuchten Oberfläche. Während diese für sehr glatte Oberflächen in der Tat 

die Oberflächenfunktion selbst wiederspiegeln, erweisen sich für Oberflächen oberhalb 

des sog. Rayleigh-Limits Methoden der multivariaten Statistik als sinnvoll. 

Insbesondere können hier die höheren Momente der Streuverteilung als Merkmale 

für Klassifikationsverfahren dienen. Während diese Momente in früheren Veröffentlichungen 

als Beschreibungsform der Streuverteilung selbst unter eher heuristischen 

Aspekten vorgeschlagen wurden, kann ihnen nunmehr auch eine physikalische Bedeutung 

zugeordnet werden. Zur Vorverarbeitung und Reduktion der häufig sehr 

umfangreichen zweidimensionalen Datenmengen kommt zunächst die Hauptkomponentenanalyse 

(PCA) zum Einsatz. Zur Klassifikation verschiedener Proben, z.B. im 

Sinne einer Qualitätskontrolle, wird die lineare kanonische Diskriminanzanalyse genutzt. 

Der Beitrag gibt einen Einblick in die Grundlagen der verwendeten Verfahren 

in Bezug zur Streulichtmeßtechnik und bietet Beispiele aus der Anwendung in der 

Klassifikation technischer Oberflächen. 

Literaturverzeichnis 

Baier, D. and Gaul, W. (1999): Optimal Product Positioning Based on Paired Comparison 

Data. Journal of Econometrics, 89, 365–392. 

Bock, H.H. (1974): Automatische Klassifikation. Vandenhoeck & Ruprecht, Göttingen. 

Brusch, M. and Baier, D. (2002): Conjoint Analysis and Stimulus Presentation: 

A Comparison of Alternative Methods. In: K. Jajuga, A. Sokołowski and H.H. 

Bock (Eds.): Classification, Clustering, and Analysis. Springer, Berlin, 203–210. 

− 59 −

Algorithms for Computing the Multivariate 

Isotonic Regression 

Jürgen Hansohm 1 

University of the Federal Armed Forces, Munich, Germany 

Juergen.Hansohm@UniBw-Muenchen.de 

Abstract. Sasabuchi (1983) introduces 1983 a multivariate version of the wellknown 

univariate isotonic regression which plays a key role in the field of statistical 

inference under order restrictions. His proposed algorithm for computing the multivariate 

isotonic regression, however, is guaranteed to converge only under special 

conditions (Sasabuchi 2003). In this paper, a more general framework for multivariate 

isotonic regression is given and an algorithm based on Dykstra’s method is 

used to compute the multivariate isotonic regression. Two numerical examples are 

given to illustrate the algorithm and to compare the results with the Monte Carlo 

simulation published by Fernando and Kulatunga (2007) 

Key words: multivariate isotonic regression, projection, Dykstra’s algorithm, partial 

order, least squares solution 

References 

S. Sasabuchi, M. Inutsuka, D.D.S. Kulatunga (1983): A multivariate version of isotonic 

regression,Biometrika 70, 2 (1983) 465–472. 

S. Sasabuchi, T. Miura, H. Oda (2003): Estimation and test of several multivariate 

normal means under an order restriction when the dimension is larger than two, 

Journal of Statistical Computation and Simulation 73, 9 (2003) 619–641. 

W.T.P.S. Fernando, D.D.S. Kulatunga (2007): On the computation and some applications 

of multivariate isotonic regression,Computational Statistics and Data 

Analysis 52 (2007) 702–712. 

− 60 −

Die präzise und effizienzte Erkennung von 

medizinischen Anforderungsformularen 

Uwe Henker 1 , Alfred Ultsch 2 , and Uwe Petersohn 3 

1 

DOCexpert Computer GmbH 

Bamberg 

u.henker@docexpert.de 

2 

Databionics Research Group 

Philipps-University of Marburg, Germany 

ultsch@informatik.uni-marburg.de 

3 

TU Dresden 

Institut Künstliche Intelligenz 

peterson@inf.tu-dresden.de 

Abstract. Formulare für die Anforderung von medizinischen und/oder diagnostischen 

Leistungen spielen eine grosse Rolle in der gegenwätigen medizinischen Praxis. 

Mit solchen Formularen werden ggf. lebensentscheidende ärztliche oder labormedizinische 

Leistungen für einen Patienten angefordert. Die Übertragung der vom Arzt 

per Hand in ein solches Formular eingetragenen Anforderungen an die Labor- bzw. 

Krankenhaus Informtionssysteme erfolgt dabei durch maschinelle Erkennungsverfahren 

(Optical Marker Recognition (OMR)). Hierbei ist von einer sich ändernden 

Menge von verschiedenen Formularen (Prototypen) auszugehen, die zuverlässig 

erkannt werden müssen. 

Der Beitrag beschreibt die Wissensrepräsentation derartiger Formulare in einer Falldatenbank 

von Prototypen mittels Case based Reasoning (CBR). Zentrale Idee ist 

dabei, die vorverarbeiteten und abstrahierten Bilder von gescannten Formularen mit 

den Prototypen so zu vergleichen, dass Fehlertoleranzen zugelassen werden. Wird ein 

neuer Prototyp in die Wissensbasis eingefügt, der eine grosse Überschneidung mit 

bestehenden Prototypen hat, so wird in einem mehrstufigen Verfahren zusätzliches 

Entscheidungswissen für den Formularklassifikator in der Wissensbasis repräsentiert. 

Der Ansatz führt zu einer Erkennungsrate von 97% und keinen falsch-positiven 

Fällen. Im Vergleich zu anderen veröffentlichten Ansätzen kann eine substantielle 

Steigerung bei der Erkennungsleistung festgestellt werden. Insbesondere ist die Sonderanforderung, 

dass keine falsch-positiven Ergebnisse erzeugt werden zu 100% 

erfüllt. Das System wurde mit realen Formularen getestet, die in der Anwendung ein 

eindeutiges Merkmal (Barcode) zur Identifizierung verwenden. Die Notwendigkeit 

dieses Barcodes auf jedem Formular, stellt eine nicht unerhebliche Einschränkung 

dar. Durch den hier beschriebenen Ansatz wird diese aufgehoben. 

Key words: Classification, Knowledge Representation, Optical Marker Recognition, 

Image Processing, Medical Information Systems 

− 61 −

Using cluster analysis for species delimitation 

Christian Hennig 1 and Bernhard Hausdorf 2 

1 Department of Statistical Science, University College London, Gower St, London 

WC1E 6BT, United Kingdom chrish@stats.ucl.ac.uk 

2 Zoologisches Museum der Universität Hamburg, Martin-Luther-King-Platz 3, 

20146 Hamburg, Germany hausdorf@zoologie.uni-hamburg.de 

Abstract. Species delimitation is a fundamental task in biology. Operationally, 

species can be conceived as continuously varying groups of organisms that are separate 

from other such groups. This suggests methods of cluster analysis to delimit 

species empirically for given data. However, in the literature there is no agreement 

about the species concept (see Mayden, 1997, for an overview), which affects the 

choice of the appropriate data, cluster analysis method, and the interpretation of 

the results. A particular problem arises because of the hierarchical nature of evolution. 

Clusters occur at many levels and may represent, beside species, intrapopulation 

polymorphisms, populations, regional variation or higher taxa. We present 

a methodology for delimiting putative species based on codominant and dominant 

genetic markers. The method combines the definition of an appropriate dissimilarity 

measure, multidimensional scaling and model-based cluster analysis. We propose a 

null model taking into account spatial autocorrelation in order to check whether 

inhomogeneities in the data can be explained from regional variation alone. The 

methodology is a generalization of the techniques presented in Hennig and Hausdorf 

(2004) to categorial genetic data. The methodology is compatible with most species 

concepts. We discuss some general issues such as the choice of the clustering method 

and joining of not well separated clusters, which rather indicate inhomogeneity on 

lower levels than species. 

Key words: Model-based cluster analysis, genotypes, spatial autocorrelation 

References 

Hennig, C. and Hausdorf, B. (2004): Distance-based parametric bootstrap tests for 

clustering of species ranges. Computational Statistics and Data Analysis, 45, 

875–896. 

Mayden, R.L. (1997): A hierarchy of species concepts: the denouement in the saga 

of the species problem. In: M.F. Claridge, H.A. Dawah, M.R. Wilson (Eds.): 

The Units of Biodiversity. Chapman and Hall, London, 381–424. 

− 62 −

Nonlinear Effects in PLS Path Models: 

A Comparison of Available Approaches 

Jörg Henseler 1 

Institute of Management Research, Radboud University Nijmegen, Thomas van 

Aquinostraat 1, 6525 GD Nijmegen, The Netherlands, J.Henseler@fm.ru.nl 

Summary. Along with the development of scientific disciplines, researchers in business 

and social sciences are increasing interested in investigating nonlinear effects 

between latent variables. In this contribution, I present four approaches to modeling 

nonlinear effects with PLS: Firstly, Wold’s (1982) original approach takes the nonlinearity 

in the structural model into account during the iterative PLS algorithm. 

Secondly, the product indicator approach developed by Chin, Marcolin, and Newsted 

(2003) requires that the nonlinear function be applied a priori on the indicator 

level. Thirdly, a two-stage approach as suggested by Henseler and Fassott (2008) estimates 

the nonlinear effect a posteriori once the latent variable scores are estimated 

by means of the linear effects PLS path model. Fourthly, I adapt an orthogonalizing 

approach originally suggested by Little, Bovaird, and Widaman (2006) to nonlinear 

PLS path modeling. Finally, I compare the performance of these four approaches 

by means of a Monte Carlo simulation, and derive guidelines for users of PLS path 

modeling. 

Key words: partial least squares, PLS path modeling, nonlinear terms 

References 

Chin, W. W., Marcolin, B. L., and Newsted, P. N. (2003): A Partial Least Squares 

Latent Variable Modeling Approach for Measuring Interaction Effects: Results 

from a Monte Carlo Simulation Study and an Electronic-mail Emotion/Adoption 

Study. Information Systems Research, 14, 189–217. 

Henseler, J. and Fassott, G. (2008): Testing Moderating Effects in PLS Path 

Models: An Illustration of Available Procedures. In: V. E. Vinzi, W. W. Chin, J. 

Henseler, and H. Wang (Eds.): Handbook Partial Least Squares Path Modeling. 

Springer, Heidelberg, forthcoming. 

Little, T. D., Bovaird, J. A., and Widaman, K. F. (2006): On the Merits of 

Orthogonalizing Powered and Product Terms: Implications for Modeling Interactions 

Among Latent Variables, Structural Equation Modeling, 13, 497–519. 

Wold, H. (1982): Soft Modeling. The Basic Design and Some Extensions. In: K. 

G. Jöreskog and H. Wold (Eds.): Systems under Indirect Observation. Causality, 

Structure, Prediction, Part I. North-Holland, Amsterdam, 1–54. 

− 63 −

Classification of text processing components: 

The Tesla Role System 

Jürgen Hermes and Stephan Schwiebert 

Linguistic Data Processing, Department of Linguistics, University of Cologne 

{jhermes, sschwieb}@spinfo.uni-koeln.de 

Abstract. The analysis of sequences of discrete tokens (i.e., texts) is a major research 

subject of several essentially different sciences such as corpus linguistics, 

literature and bioinformatics. Though differing in both data and its interpretation, 

these sciences share some intermediate steps. Following these considerations, the obvious 

procedure is to encapsulate text processing tasks into components and create a 

framework that enables component interaction. The component arrangement within 

a workflow is comparable to an experimental setup: it allows a gradual modification 

of experiments, e.g., rerunning an experiment with a modified configuration or a 

replaced component. 

The Text Engineering Software Laboratory (Tesla) is an implementation of a 

framework that supports the development and deployment of text processing components 

as well as the execution of experiments on textual data. One of its main 

ideas is reducing the framework’s restrictions on data modeling to a minimum, allowing 

developers to focus on their scientific tasks. However, this results in new issues: 

an extensible way of database access definition, data exchange between components 

and data conversion during visualization. If, for instance, the annotations produced 

by a component cannot be related sequentially to single text elements but do instead 

represent more complex relations between these elements, as generally in graphs or 

matrices, the information contained in such data types can only be extracted with 

knowledge about their internal structure and its meaning, thus violating a basic 

principle of component frameworks. 

Addressing these concerns, the concept of a role is introduced in Tesla. A role 

adopted by a component specifies the type as well as the access methods of the 

produced data. As the role system implicitly exhibits a hierarchical structure, this 

finally leads to a dynamic classification of text processing components. 

− 64 −

Strengths and Weaknesses of Ant Colony 

Clustering 

Lutz Herrmann and Alfred Ultsch 


University of Marburg, Germany 

{lherrmann,ultsch}@informatik.uni-marburg.de 

Abstract. Ant colony clustering (ACC) is a promising nature-inspired technique 

where stochastic agents perform the task of clustering high-dimensional data on a 

low-dimensional output space. Most ACC methods are derivatives of the approach 

proposed by Lumer and Faieta. These methods usually perform poorly in terms 

of topographic mapping and cluster formation. In particular when compared to 

clustering on Emergent Self-Organizing Maps (ESOM). 

In order to address this issue, an unifying representation for both ACC methods 

and Emergent Self-Organizing Maps is derived in a brief yet formal manner. ACC 

terms are related to corresponding mechanisms of the Self-Organizing Map. This 

leads to insights on both algorithms. ACC are considered as first-degree relatives of 

the ESOM. This explains benefits and shortcomings of ACC and ESOM. Furthermore, 

the proposed unification allows to judge whether modifications improve an 

algorithm’s clustering abilities or not. This is demonstrated using a set of cardinal 

clustering problems. 

Key words: Clustering, Emergent Self-Organizing Maps, Swarm Intelligence 

References 

Deneubourg, J.-L., Goss, S., Franks, N., Sendova-Franks, A., Detrain, C. and Chretien, 

L. (1991): The dynamics of collective sorting: Robot-like ants and ant-like 

robots. In: Proc. of the First International Conference on Simulation of Adaptive 

Behaviour: From Animals to Animats 1. MIT Press, Cambridge, 356-365. 

Handl, J., Knowles, J. and Dorigo, M. (2005): Ant-Based Clustering and Topographic 

Mapping. Artificial Life 12(1), MIT Press, Cambridge. 

Kohonen, T. (1995): Self-Organizing Maps. Springer, Berlin, Heidelberg, New York. 

Lumer, E. and Faieta, B. (1994): Diversity and adaption in populations of clustering 

ants. In: Proc. of the Third International Conference on Simulation of Adaptive 

Behaviour: From Animals to Animats 3. MIT Press, Cambridge, 501–508. 

Ultsch, A. and Herrmann, L. (2005): The architecture of emergent self-organizing 

maps to reduce projection errors. In: Verleysen M. (Eds): Proc. of the European 

Symposium on Artificial Neural Networks (ESANN 2005). 

− 65 −

Reconstructing Central Places and Settlements 

Groups 

Irmela Herzog 

The Rhineland Regional Council / The Rhineland Commission for Archaeological 

Monuments and Sites 

Bonn, Germany 

i.herzog@LVR.de 

Abstract. If (i) the location of settlements is known for a certain period in time 

and (ii) the settlements are distributed in such a way that cluster centres with high 

settlement densities are present, then a variant of the density clustering algorithm 

using basin spanning trees can be applied to (i) reconstruct the location of the 

cluster centres and (ii) group the settlements. The model of this approach is based 

on the assumption that the exchange rate of products is high where the settlements 

are close to each other and/or the settlement size is large. If people living in a 

settlement with a low exchange rate wanted to buy or sell something they would 

walk to one of the settlements with higher exchange rates in their neighbourhood. 

This can be modelled by several variants of the density clustering algorithm using 

basin spanning trees. Which of the neighbouring settlements is determined as the 

preferred location for product exchange depends on the algorithm variant chosen. 

This method is used to reconstruct trade networks, and all settlements connected 

by direct or indirect trade links constitute a group. While extending the original 

clustering algorithm to support different settlement sizes could be accomplished 

easily, the adjustments needed to take into account the costs of walking between 

two locations in prehistoric times are by no means trivial. Examples from the river 

Main area with Bronze and Iron Age settlements will be presented. 

References 

Hader, S., Hamprecht, F. A. 2003. Efficient Density Clustering Using Basin Spanning 

Trees. In: M. Schader, W. Gaul, M. Vichi (Hrsg.), Between Data Science and 

Applied Data Analysis. Studies in Classification, Data Analysis, and Knowledge 

Organization (Berlin, Heidelberg, New York):39-48. 

− 66 −

On the prognostic value of gene expression 

signatures for censored data 

Thomas Hielscher, Manuela Zucknick, Wiebke Werft and Axel Benner 

Division of Biostatistics, German Cancer Research Center, Heidelberg, Germany 

t.hielscher@dkfz.de 

Abstract. As part of the validation of a new gene expression signature it is good 

statistical practice to quantify the amount of prognostic information represented by 

the signature. Open questions are how to measure the gain in prognostic information 

compared to established clinical parameters or biomarkers and the additional 

predictive accuracy especially when dealing with censored data. To answer these 

questions it is required to use consistent and interpretable measures. 

Several measures of prediction accuracy and proportion of explained variation 

have been suggested for right-censored event times. The underlying mechanisms of 

these measures are as different as the use of Schoenfeld residuals, model likelihoods 

or the variation of the individual survival curves. Consequently, these measures vary 

in their assumptions and properties and it remains unclear under which conditions 

and to which extent they are comparable. Moreover, explained variation for survival 

data can be considered as a function of time and therefore strongly depends on the 

available follow-up time and the time range of interest. 

We present a comparison of several common measures such as the Brier Score 

(Graf et al., 1999), the V measure (Schemper and Henderson, 2001) and the method 

of O’Quigley and Xu (2001) to illustrate their application to simulated and real 

clinical data. A presentation of existing and possible approaches to estimate the 

variability of these measures will be provided. An overview of available software 

implementations in R will be given. 

Key words: Survival, Predictive Accuracy, Gene Expression 

References 

Graf, E., Sauerbrei, W. F., Schmoor, C., and Schumacher, M. (1999): Assessment 

and comparison of prognostic classification schemes for survival data. Statistics 

in Medicine,18, 2529–2545. 

O’Quigley, J. and Xu, R. (2001): Explained variation in proportional hazards regression. 

In: J. Crowley and D.P. Ankerst (Eds.): Handbook of Statistics in Clinical 

Oncology, Second Edition. Chapman & Hall/CRC Press, 347–363. 

Schemper, M. and Henderson, R. (2000): Predictive accuracy and explained variation 

in Cox regression. Biometrics, 56, 249–255. 

− 67 −

Likelihood ratio testing for hidden Markov 

models 

Hajo Holzmann and Jörn Dannemann 

University of Karlsruhe 

Germany 

Abstract. When a mixture arises as the marginal distribution of a stationary process, 

the dependency structure can be incorporated by assuming that the underlying 

regime forms a finite state Markov chain. This leads to the class of hidden Markov 

models (HMMs), which are also called Markov dependent mixtures. We shall discuss 

maximum likelihood inference in HMMs. In particular, we investigate the problem 

of testing for the number of states via the likelihood ratio test (LRT). We propose 

a modified LRT for two against more states in an HMM, which is based on the 

so-called likelihood function under independence assumption, and derive its asymptotic 

distribution under the null hypothesis. Simulation results and applications to 

financial and biological time series illustrate the practical use of the methods. 

− 68 −

Rule-Based Learning of Reliable Classifiers 

Jens Hühn and Eyke Hüllermeier 

Department of Mathematics and Computer Science, University of Marburg 

{huehnj,eyke}@mathematik.uni-marburg.de 

Abstract. This paper introduces a fuzzy rule-based classification method called 

FR3, which is short for Fuzzy Round Robin RIPPER. As the name suggests, FR3 

builds upon the RIPPER algorithm, a state-of-the-art rule learner. More specifically, 

in the context of polychotomous classification, it uses a fuzzy extension of RIPPER 

as a base learner within a round robin scheme and, thus, can be seen as a fuzzy 

variant of the R3 learner that has recently been introduced in the literature. A 

key feature of FR3, in comparison with its non-fuzzy counterpart, is its ability to 

represent different facets of uncertainty involved in a classification decision in a more 

faithful way. FR3 thus provides the basis for implementing “reliable classifiers” that 

may abstain from a decision when not being sure enough, or at least indicate that 

a classification is not fully supported by the empirical evidence at hand. Besides, 

our experimental results show that FR3 outperforms R3 in terms of classification 

accuracy and, therefore, suggest that it produces predictions that are not only more 

reliable but also more accurate. 

Key words: Machine learning, classification, rule induction, uncertainty, fuzzy sets. 

References 

William W. Cohen (1995). Fast effective rule induction. In Armand Prieditis and 

Stuart Russell, editors, Proceedings of the 12th International Conference on 

Machine Learning, pages 115–123, Tahoe City, CA. Morgan Kaufmann. 

Johannes Fürnkranz (2003). Round robin ensembles. Intell. Data Anal., 7(5):385– 

403. 

Eyke Hüllermeier and Klaus Brinker (2008). Learning valued preference structures 

for solving classification problems, Fuzzy Sets and Systems (to appear). 

− 69 −


Classification: An Adaptive Voting Strategy 

and Its Relation to Weighted Voting 

Eyke Hüllermeier and Stijn Vanderlooy 

Department of Mathematics and Computer Science, University of Marburg 

{eyke,vanderlooy}@mathematik.uni-marburg.de 

Abstract. Learning by pairwise comparison is a well-known decomposition technique 

which allows one to transform a polychotomous classification problem into 

a number of binary problems. To aggregate the predictions from the ensemble of 

binary models into a final classification, various aggregation strategies have been 

proposed. The most commonly used strategy is weighted voting, in which the prediction 

of each model is counted as a (weighted) “vote” for a class label, and the 

class with the highest sum of votes is predicted as the label of the query instance. 

Even though weighted voting turned out to perform very well in practice, it remains 

ad-hoc to some extent and lacks a sound theoretical basis. 

In this regard, the current paper makes the following contributions. First, we 

propose a formal framework in which the aforementioned aggregation problem can be 

studied in a convenient way. This framework is based on the setting of label ranking 

which has recently received attention in the machine learning literature. Second, 

within this framework, we develop a new aggregation strategy called adaptive voting. 

This strategy allows one to take the strength of individual learners into consideration 

and, under certain assumptions, is provably optimal in the sense that it yields a MAP 

prediction of the class label. Thirdly, we show that weighted voting can be seen as 

an approximation of adaptive voting and, hence, approximates a MAP prediction. 

This theoretical justification of weighted voting is confirmed by strong empirical 

evidence showing that it is (at least) competitive in practice. 

Key words: Machine learning, pairwise classification, weighted voting, label ranking, 

MAP prediction. 

− 70 −

Using Cluster Networks to Represent 

Non-Compatible Sets of Clusters 

Daniel H. Huson and Regula Rupp 

Center for Bioinformatics ZBIT, Tübingen University, Sand 14, 72076 Tübingen, 

Germany 

huson,rupp@informatik.uni-tuebingen.de 

Abstract. A set of clusters is called compatible (or hierarchical), if it can be represented 

by a rooted tree. In many applications, such as multiple gene phylogenetic 

analysis, sets of clusters arise that are not compatible and the question arises how 

to represent such sets in a useful way, in particular emphasizing parts of the cluster 

system that are tree-like and where the incompatibilities lie. 

The result of a multiple gene tree analysis is usually a number of different tree 

topologies that are each supported by a significant proportion of the genes. We 

introduce the concept of a cluster network that can be used to combine such trees 

into a single rooted network, which can be drawn either as a cladogram or phylogram. 

In contrast to split networks, which can grow exponentially in the size of the input, 

cluster networks grow only quadratically. A cluster network is easily computed using 

a modification of the tree-popping algorithm, which we call network-popping. The 

approach will be made available as part of the Dendroscope tree-drawing program 

and its application will be illustrated using data and results from recent studies on 

large numbers of gene trees. 

Key words: clusters, networks, trees, phylogenetics 

References 

D.H. Huson and D. Bryant. Application of phylogenetic networks in evolutionary 

studies. Molecular Biology and Evolution, 23:254–267, 2006. Software available 

from www.splitstree.org. 

D.H. Huson, D.C. Richter, C. Rausch, T. Dezulian, M. Franz, and R. Rupp. Dendroscope: 

An interactive viewer for large phylogenetic trees. BMC Bioinformatics, 

8:460doi:10.1186/1471-2105-8-460, 2007. Software available from 

www.dendroscope.org. 

− 71 −

Genome phylogeny based on short-range 

correlations in DNA sequences 

Marc-Thorsten Hütt 1 

Jacobs University Bremen 

School of Engineering and Science 

Campus Ring 1 

m.huett@jacobs-university.de 

Abstract. The surprising fact that global statistical properties computed on a 

genomewide scale may reveal species information has first been observed in studies 

of dinucleotide frequencies. In this presentation I will look at the same phenomenon 

with a totally different statistical approach. We show that patterns in the shortrange 

statistical correlations in DNA sequences serve as evolutionary fingerprints of 

eukaryotes. All chromosomes of a species display the same characteristic pattern, 

markedly different from those of other species. The chromosomes of a species are 

sorted onto the same branch of a phylogenetic tree due to this correlation pattern. 

The average correlation between nucleotides at a distance k is quantified in two independent 

ways: (i) by estimating it from a higher-order Markov process and (ii) by 

computing the mutual information function at a distance k. We show how the quality 

of phylogenetic reconstruction depends on the range of correlation strengths and 

on the length of the underlying sequence segment. This concept of the correlation 

pattern as a phylogenetic signature of eukaryote species combines two rather distant 

domains of research, namely phylogenetic analysis based on molecular observation 

and the study of the correlation structure of DNA sequences. 

− 72 −

Dimensionality Reduction of Similarity Matrix 

Tadashi Imaizumi 

Tama University imaizumi@tama.ac.jp 

Abstract. We have become easily collecting a similarity data matrix of large number 

of objects, for example, in Basket Analysis in Data Mining. Then we apply 

several unsupervised methods to this data according to our purpose of analysis It is 

shown in many research fields that the geometric models such as MultiDimensional 

Scaling(MDS) or Self-Organizing Map(SOM) are applicable. However, we have two 

problems when we want to apply these methods to a large similarity matrix. One 

will be the change of dimensions focused, The other one is on how to employ our 

prior information about data. The latent dimensions of the similarity evaluation 

process may be common to all objects when the attributes of objects is less ambiguous 

and number of objects is not so large. However, we will not agree with that 

similarity evaluation between Hamburg and Tokyo is same to that between Hamburg 

and Berlin. This requires us how to model this process. The other one is how 

to employ the research’s knowledge as the prior information to the model. We have 

some knowledge about data and the gathered data will be containing these as the 

hidden information of data. It will contribute to propose some supervised geometric 

model treating those information as model parameters. I will discuss these two 

problems, compare the dimensionality reduction methods, and propose the model 

of the change of dimensions focused and the prior information about data. 

Key words: supervised,the latent dimensions, the attribute focus 

References 

Koh¨nen, T.(1995): Self-Organizing Maps. Springer, Berlin, Heidelberg. 

Roweis, S. T. and Saul, L. K. (2000): Nonlinear dimensionality reduction by locally 

linear embedding. Science 290, 2323-2326. 

Sammon, J. W., Jr. (1969): A nonlinear mapping for data structure analysis. IEEE 

Transactions on Computers, C-18, 5-28. 

Tenenbaum, J. B., de Silva, V. and Langford, C. (2000). A global geometric framework 

for nonlinear dimensionality reduction. Science 290, 2319-2323. 

− 73 −

Siedlungsverhalten währende des 7. -11. 

Jahrhunderts entlang der Ems. Eine GIS 

gestützte siedlungsarchäologische Analyse des 

Raumes zwischen Warendorf und Rheine 

Katrin Jaspers 

Universität Münster, Germany 

Zusammenfassung. Mit Hilfe von GIS sollen chronologische, topographische und 

historische Zusammenhänge zwischen den verschiedenen Siedlungen des Untersuchungsraums 

dokumentiert werden. Dabei werden auch pedologische Gesichtspunkte 

einbezogen. So könnten Abläufe und Entwicklungen der Besiedlung deutlich gemacht 

und möglicherweise Rückschlüsse auf die Infrastruktur zwischen den Siedlungen gezogen 

werden. 

Da sich die Arbeit noch in der Entwicklungsphase befindet, kann hier nur ein 

vorläufiger thematischer Abriss gegeben werden. 

− 74 −

Benchmarking Bicluster Algorithms 

Sebastian Kaiser and Friedrich Leisch 

Department of Statistics, Ludwig-Maximilians-Universität München, 

Ludwigstrasse 33, 80539 München, Germany, 

firstname.lastname@stat.uni-muenchen.de 

Abstract. Over the last decade, bicluster methods have become increasingly popular 

in different fields of two way data analysis, and a large variety of algorithms 

and analysis methods have been published, see (Madeira and Oliveira, 2004) for 

a survey. In this presentation, we show how the general benchmarking framework 

by Hothorn et al (2005) can be adapted to the special case of biclustering. A key 

issue is the development of bootstrap strategies for two-way data, which do not only 

resample cases, but also variables. 

All methods presented have been implemented in the open source R package 

biclust, which is available on http:\\cran.r-project.org. Both artificial as well 

as real world microarray data are used for benchmark experiments. The resulting 

benchmark data are explored using new graphical techniques and analyzed by means 

of statistical inference. 

Key words: Biclustering, Two-Way-Clustering, Validation, R 

References 

Hothorn, T, Leisch, F., Zeileis, A., and Hornik K. (2005): The design and analysis 

of benchmark experiments. Journal of Computational and Graphical Statistics, 

14(3), 675–699. 

Madeira, S. C. and A. L. Oliveira (2004): Biclustering algorithms for biological data 

analysis: A survey. IEEE/ACM Transactions on Computational Biology and 

Bioinformatics, 1(1),24–45. 

Santamaria, R., Theron, R., and Quintales, L. (2007): A framework to analyze biclustering 

results on microarray experiments. In: 8th International Conference on 

Intelligent Data Engineering and Automated Learning (IDEAL’07) ,Springer, 

Berlin, 770–779. 

Turner, H., Bailey, T., and Krzanowski, W. (2005): Improved biclustering of microarray 

data demonstrated through systematic performance tests. Computational 

Statistics and Data Analysis, 48,235–254. 

− 75 −

Nonparametric distribution analysis for text 

mining 

Alexandros Karatzoglou 1 , Ingo Feinerer 2,3 , and Kurt Hornik 3 

1 INSA de Rouen, France alexis@ci.tuwien.ac.at 

2 Theory and Logic Group, Institute of Computer Languages 

Vienna University of Technology, Austria feinerer@logic.at 

3 Department für Statistik und Mathematik, 

Wirtschaftsuniversität Wien, Austria kurt.hornik@wu-wien.ac.at 

Abstract. A number of new algorithms for non-parametric distribution analysis 

based on Maximum Mean discrepancy measures and the Hilbert-Smith Norm have 

been recently introduced. These novel algorithms operate in Hilbert space and can be 

used for Two-Sample Tests, Hierarchical Clustering and Dimensionality Reduction. 

Coupled with recent advances in string kernels, these methods extend the scope of 

kernel-based methods in the area of text mining. 

We review this group of kernel methods focusing on text mining where we will 

propose novel applications and present an efficient implementation in the kernlab 

package. We also present an efficient and integrated environment for applying modern 

machine learning methods to complex text mining problems through the combined 

use of the tm (for text mining) and the kernlab (for kernel-based learning) R 

packages. 

Key words: kernel methods, text mining, R 

References 

Karatzoglou A., Smola A. , Hornik K., Zeileis A. (2004): kernlab - An S4 Package 

for Kernel Methods in R. Journal of Statistical Software, 11, 9 

Smola, A., A. Gretton, L. Song and B. Schölkopf (2007): A Hilbert Space Embedding 

for Distributions. Proceedings of the 18th International Conference on 

Algorithmic Learning Theory (ALT 2007), 13-31, Springer, Berlin, Germany 

Song, L., A. J. Smola, K. Borgwardt and A. Gretton (2007): Colored Maximum Variance 

Unfolding. Proceedings of the Twenty-First Annual Conference on Neural 

Information Processing Systems (NIPS 2007), 1-8, MIT Press, Cambridge, 

Mass., USA 

− 76 −

Indexnachbildende Wertpapiere – 

Eine vergleichende Betrachtung am Beispiel 

des DAX 

Christian Klein 1 and Dennis Kundisch 2 

1 

Universität Hohenheim, Lehrstuhl für Rechnungswesen und Finanzierung, 70593 

Stuttgart cklein@uni-hohenheim.de 

2 

Universität Augsburg, Lehrstuhl für BWL, Wirtschaftsinformatik und Financial 

Engineering, 86135 Augsburg dennis.kundisch@wiwi.uni-augsburg.de 

Abstract. In dieser Arbeit vergleichen wir verschiedene indexnachbildende Wertpapiere. 

Produkte dieser Art versprechen dem Anleger eine Wertentwicklung, die einem 

zugrunde liegenden Index möglichst exakt entspricht. Bei unserer Untersuchung betrachten 

wir verschiedene Aspekte, unter anderem die Replikationsgüte der Papiere. 

Somit liefern wir ein differenziertes Bild, sowohl über die Qualität der Produkte 

als auch über die Stärken und Schwächen der üblicherweise angewendeten Untersuchungsverfahren. 

Key words: Aktienindex, Indexreplikation 

− 77 −

Polyphasic genomic approach for the taxonomy 

of archaea and bacteria 

Hans-Peter Klenk 

DSMZ - German Collection of Microorganisms and Cell Lines 38124 

Braunschweig,Germany hpk@dsmz.de 

Abstract. Contemporary taxonomic classification of prokaryotes is primarily based 

on the analysis of 16S rDNA sequences, extended by chemotaxonomical analyses, 

e.g. whole cell fatty acids or amino acid analysis of cell walls. Although rDNAs 

are excellent taxonomic markers, they represent far less than 1With meanwhile 637 

published prokaryotic genomes and more than 1850 ongoing archaeal and bacterial 

genome sequencing projects, the future of systematics will clearly be based on the 

analysis of whole genome sequences. The major imminent problems on the way to 

a genome-based systematic classification of prokaryotes are: 1) uneven phylogenetic 

distribution of the sequenced genomes; 2) large variation of the phylogenetic value 

in different fractions of the genomes; and 3) affordable technology for rapid sequence 

generation combined with highly automated analysis of the information. A massive 

generation of genome sequences from phylogenetically isolated archaea and bacteria 

in a collaboration between Joint Genome Institute with DSMZ aims for rapid filling 

of the deep phylogenetic gaps, soon to be followed by sequenced genomes of all 

type strains. The fast variation between genes or sets of genes in view of sequence 

conservation and genetic stability is problematic for global approaches to universal 

phylogenies, but provides suitable novel taxonomic markers for more restricted areas 

within the diversity of micro-organisms. New technologies for sequence generation 

have already sharply decreased the price for the production of microbial genome 

sequences and will continue to do so till the genome of any cultivated species of 

archaea or bacteria will become affordable. The more complex problem to be solved 

is the automated processing of the genomes within an endlessly growing sequence 

space. 

References 

Klenk, H.-P. (2007) Genomic future for the taxonomy of prokaryotes. In: E Stackebrandt, 

M Wozniczka, V Weihs & J Sikorski (eds) Connections between Collections. 

Proceedings of the 11th International Conference on Culture Collections. 

ISBN 978-3-00-022417-1. DSMZ, Braunschweig, Germany. pp 117-119 

− 78 −

Exploiting synergetic and redundant features 

for multimedia document classification 

Jana Kludas, Eric Bruno and Stephane Marchand-Maillet 

University of Geneva, Switzerland 

kludas|bruno|marchand@cui.unige.ch 

Summary. Multimedia data handling in all kinds of applications, received in the 

last decade a lot of attention by the research communities due to the ’multimediatisation’ 

of e.g. the WWW and other data collections in all day life. The most 

important problems identified in multimedia-based classification are amongst others 

the high dimensionality of the multi modal feature space, the unknown and varying 

relevance of features and modalities towards the class label, noise and missing values 

in the input data and the semantic gap between low-level features and high level 

semantic meanings. 

We are working on a promising way to tackle many of these problems at once: 

the calculation and exploitation of feature information interactions for feature selection 

and construction in high dimensional feature spaces towards more efficient 

information fusion and hence improved multimedia document classification. This 

information-theoretic dependence measure finds the exact, irreducible attribute interactions 

in a multivariate feature subset. Its definition is a stable relation because 

information interactions are described by the information exclusively shared by this 

subset’s variables. For subsets of size N = 2 the interaction is resulting in the well 

known mutual information. 

Then for higher order subsets N > 2, feature information interaction develops 

its most important characteristic, it can result in positive and negative values. This 

allows to discriminate two different types of feature relationships: (1) synergy given 

by positive interactions and (2) redundancy indicated by negative ones. That be 

used to treat the features of each of the types of interactions separately with the 

help of specialized feature selection and construction strategies. 

With the help of artificial data sets we will show what relationships information 

interactions can detect. Classification experiments on real world data will also 

show the superiority of preprocessing based on N-way interactions over pair-wise 

dependence measures that are often used in recent feature selection approaches like 

correlation and mutual information. 

Key words: feature selection, multi modal information fusion, multimedia object 


− 79 −

Time-Varying Parameters in Brand Choice 

Models 

Thomas Kneib 1 , Bernhard Baumgartner 2 , and Winfried J. Steiner 3 

1 Department of Statistics, University of Munich, Germany 

thomas.kneib@stat.uni-muenchen.de 

2 Department of Marketing, University of Regensburg, Germany 

bernhard.baumgartner@wiwi.uni-regensburg.de 

3 Department of Marketing, Technical University of Clausthal, Germany 

winfried.steiner@tu-clausthal.de 

Abstract. Brand Choice Models are frequently used in marketing research. In most 

applications, estimated parameters representing customers’ reactions to, e.g., price 

and promotional activities or brand-specific effects are assumed to be constant over 

time. Marketing theories as well as experiences of marketing practitioners, however, 

suggest the existence of trends and/or short-term fluctuations in brand choice behavior. 

For example, price elasticities or preferences for certain brands may change in the 

run-up to special events like Christmas or Mother’s day (e.g., Baumgartner 2003). 

In this contribution, we employ multinomial logit models with varying coefficients to 

estimate time-varying parameters in brand choice models. Both time-varying preferences 

(trends) and time-varying effects of covariates are modeled based on penalised 

splines, a flexible yet parsimonious nonparametric smoothing technique (e.g., Eilers 

and Marx 1996). The estimation procedure is fully data-driven, determining the flexible 

function estimates as well as the corresponding degree of smoothness in a unified 

approach (e.g., Kneib 2006). Preliminary results suggest that the model considering 

time-variable parameters outperforms models assuming constant parameters in 

terms of fit and predictive validity. 

Key words: Brand Choice, Multinomial logit model, Time-varying effects, Semiparametric 

regression, P-splines 

References 

BAUMGARTNER, B. (2003): Measuring Changes in Brand Choice Behavior. 

Schmalenbach Business Review, 55, 242–256. 

EILERS, P.H.C. and MARX, B.D. (1996): Flexible Smoothing Using B-Splines and 

Penalized Likelihood (with Comments and Rejoinder) Statistical Science, 11(2), 

89–121. 

KNEIB, T. (2006): Mixed Model Based Inference in Structured Additive Regression. 

Dr. Hut-Verlag, München. 

− 80 −

Multivariate comparative analysis of stock 

exchanges - the European perspective 

Julia Koralun-Bere´znicka 

Maritime University in Gdynia, Morska 81-87, 81-225 Gdynia, Poland 

koral@am.gdynia.pl 

Abstract. The aim of the research is to perform a multivariate comparative analysis 

of 20 European stock exchanges in order to identify the main similarities between 

the objects. The basis of comparison is a set of 48 monthly variables from the period 

01.2003–12.2005. The variables are classified into three categories: size of the market, 

equity trading and bonds. The paper aims at identifying the clusters of alike 

stock exchanges and at finding the characteristic features of each of the distinguished 

groups. The obtained categorization to some extent corresponds with the division 

of the European Union into ‘new’ and ‘old’ member countries. Clustering method, 

performed for each quarter separately, also reveals that the classification is fairly 

stable in time. The factor analysis, which was carried out to reduce the number of 

variables, reveals three major factors behind the data, which are related with the 

earlier mentioned categories of variables. 

Key words: stock exchanges, cluster analysis, factor analysis 

References 

Boillat, P., de Skowronsky, N., Tuchschmid, N. (2002) Cluster analysis: application 

to sector indices and empirical validation, Swiss Society for Financial Market 

Research, 16, 467–486. 

Kearney C., Lucey B. M., (2004) International equity market integration: Theory, 

evidence and implications, “International Review of Financial Analysis”, 13, 

571–583. 

Kim S. J., Moshirian F., Wu E. (2005) Dynamic stock market integration driven by 

the European Monetary Union: An empirical analysis, “Journal of Banking & 

Finance”, 29, 2475–2502. 

Krzanowski, W. J. (1988) Principles of multivariate analysis, Oxford University 

Press, Oxford. 

Morrison, D., (1967), Multivariate statistical methods, New York: McGraw-Hill. 

Pascual A. G., (2003) Assessing European stock markets (co)integration, “Economics 

Letters”, 78, 197–203. 

− 81 −

Strategies of model construction for 

the analysis of judgment data 

Sabine Krolak-Schwerdt 

Faculty of Humanities, Arts and Educational Science, University of Luxembourg 

sabine.krolak@uni.lu 

Abstract. This paper is concerned with the types of models researchers use to 

analyze empirical data in the domain of social judgments and decisions. Examples of 

this research domain are organizational or medical expert judgments, court decisions 

or judgments in private everyday life. 

Models for the analysis of judgment data may be divided into two classes depending 

on the criteria they optimize. The first class consists of approaches which 

optimize an internal (mathematical) criterion function. The aim is to minimize the 

discrepancy of values predicted by the model from obtained data by use of, e.g., a 

least squares approach. The second class comprises approaches which incorporate 

a substantive underlying theory into the model. These accounts were developed to 

satisfy external validity criteria, especially construct validity. Model parameters are 

not only formally defined, but they represent specified components of judgments. 

Several models from both classes are applied to a number of empirical data sets 

and comparatively evaluated as to goodness-of-fit, variance accounted for by the 

models and construct validity. Results exhibit considerable differences between the 

two model classes in construct validity, but not in internal validity criteria. 

It may be concluded that any model for the analysis of judgment data implies 

the selection of a formal theory about judgments. Hence, optimizing a mathematical 

criterion function does not induce a non-theoretical rationale or neutral tool. 

Rather, this approach yields another formal theory about judgments which may not 

correspond to substantive theories and, in this respect, may yield artefacts. As a 

consequence, models satisfying construct validity seem superior in the domain of 

judgments and decisions. 

Key words: Models of data analysis, external validity, internal validity, model 

comparison 

− 82 −

An application of copula functions to market 

risk management 

Katarzyna Kuziak 

Department of Financial Investments and Risk Management 

Wroclaw University of Economics 

ul. Komandorska 118/120, 53-345 Wroclaw, Poland 

katarzyna.kuziak@ae.wroc.pl 

Abstract. Modeling dependence is one of the main issues in risk management. From 

risk management point of view, failure to model correctly tail-dependence may cause 

many problems (under- or overestimation of risk level). The most popular approach 

to model dependence between individual risks is based on classical correlation, but 

in recent years an increasing interest in applying copula functions has arose. Copula 

functions, a powerful concept to aggregate risks, has been introduced in finance by 

Embrechts, McNeil, and Straumann. The aim of this paper is to provide simple 

applications for the practical use of copulas for risk management from market risk 

point of view. First, we introduce copula concept. Then, some applications of copulas 

for market risk are given. Two Value at Risk estimation approaches are compared for 

a portfolio of risks: utilizing classical covariance and copula-based one. The criterion 

for evaluating performance of the two approaches is just the result of a VaR backtesting 

procedure 

Key words: financial dependence, copula functions, risk management, market risk, 

Value at Risk 

References 

Cherubini U., Luciano E., Vecchiato W. (2004): Copula Methods in Finance, John 

Wiley & Sons, New York. 

Embrechts P., Frey R., McNeil A. (2005): Quantitative Risk Management: Concepts, 

Techniques, and Tools, Princeton University Press 

Embrechts P., Lindskog F., McNeil A. (2001): Modelling dependence with copulas 

and applications to risk management, report, ETHZ Zurich. 

Embrechts P., McNeil A., Straumann D. (1999): Correlation and dependence in risk 

management: properties and pitfalls. In: Risk Management: Value at Risk and 

Beyond (M. Dempster, Ed.) Cambridge University Press, Cambridge, 176-223. 

Nelsen R. (1999): An introduction to copulas, Springer Verlag, New York. 

− 83 −

Testing preference rankings 

Kar Yin Lam 1 , Alex J. Koning 2 , and Philip Hans Franses 2 

1 

ERIM & Econometric Institute, Erasmus University Rotterdam, The 

Netherlands 

kylam@few.eur.nl 

2 

Econometric Institute, Erasmus University Rotterdam, The Netherlands 

koning@few.eur.nl 

franses@few.eur.nl 

Abstract. Preference rankings are a common tool in consumer surveys. Such rankings 

are easy to perform and the outcomes are easy to understand. In this study 

we propose a method to examine if observed rankings imply statistically significant 

differences across the products. If there is statistical evidence of differences across 

products, the question is which products it concerns. We use multiple comparison 

procedures to test which products are significantly different from each other. Our 

method concerns the often-encountered practical situation that consumers evaluate 

N products but only give preference rankings for a subset that is selected by each 

consumer. This is due to the fact that the literature shows that the task of comparing 

all N products could be too difficult. It may also be that the assignment of ranks 

itself is problematic. For instance, ties may occur, that is, the consumer is indifferent 

between products, and hence two or more products have the same rank. There 

may also be missing values, that is, the consumer excludes a certain product in the 

consideration set, and thus does not evaluate it. As a consequence the consumer is 

not able to assign a rank to this product. The method we propose and analyze in 

this paper does not suffer from these drawbacks. We illustrate it for 93 individuals 

who rank 10 movies released in 2007 and who indicate preferences for only 4 of these 

10 movies. 

Key words: Rankings, Multiple comparisons, Ties, Missing observations 

− 84 −

Bayesian Methods for Graph Clustering 

Pierre Latouche, Christophe Ambroise, and Etienne Birmelé 

Laboratoire Statistique et Génome (UMR CNRS 8071, INRA 1152, UEVE), La 

Genopole Tour Evry 2, 523 place des Terrasses, 91000 Evry, France 

firstname.lastname@genopole.cnrs.fr 

Abstract. Networks are used in many scientific fields such as biology, social science, 

and information technology. They aim at modeling, with edges, the way objects of 

interest, represented by vertices, are related to each others. Looking for clusters, 

also called communities or modules, of highly connected vertices, has appeared to 

be a powerful approach to capture the underlying structure of a network. 

Recently, the Erdős-Rényi Mixture model for Graph (ERMG) for community 

detection was proposed by Daudin et al. (2006) with an associated algorithm, based 

on variational techniques, for maximum likelihood estimation. Given a network, the 

number of clusters is estimated and for all the vertices, the algorithm infers the 

probability of membership to each cluster. 

Following Hofman and Wiggins (2007), we show how the ERMG model can be 

described in a full Bayesian framework. Then, we apply two families of approximation 

techniques, called Variational Bayes (VB) and Expectation Propagation (EP), 

for the inference procedure. Using simulated and real data sets, we compare both 

the number and the quality of the estimated clusters obtained with the different 

approaches. 

Key words: Graph clustering, Variational Bayes, Expectation Propagation 

References 

Daudin, J. and Picard, F. and Robin, S. (2006): A Mixture Model for Random 

Graphs. Tech. rep, INRIA. 

Hofman, J.M. and Wiggins, C.H. (2007): A Bayesian Approach to Network Modularity. 

ArXiv e-prints. 

Jordan, M. and Ghahramani, Z. and Jaakkola, T. (1998): An introduction to variational 

methods for graphical models. In: Jordan, M.: Learning in Graphical 

Models. MIT Press. 

− 85 −

Fundamental Indexation - testing the concept in 

the German stock market 

Hermann Locarek-Junge 1 and Max Mihm 1 

Lehrstuhl für Finanzwirtschaft und Finanzdienstleistungen, 

TU Dresden, D-01062 Dresden, Germany, locarekj@finance.wiwi.tu-dresden.de 

Abstract. In Germany Fundamental Indexation is a rather new concept of portfolio 

management, creating portfolios not based on market capitalization, but by other 

economic numbers as revenues, employees, dividends or book value. The concept 

is rather new and has been implemented in only some mutual investment funds 

world wide so far. However, backward calculation of portfolios using the concept of 

fundamental indexation (CFI) on time series from 1961 to 2004 for stock portfolios 

in the US capital market and other studies show potential significant returns and 

impressive sharpe ratios for this period (see Arnott/Sautter/Siegel 2007). 

Trying to explain above average returns using factor models has not yet been 

accomplished in a way that is compatible with traditional capital market theory. The 

pro’s and con’s of the approach are discussed controversially between scientists and 

practitioners, e.g.: ”[CFI] are a triumph of marketing, not of new ideas” (Fama 2007), 

”With the advent of fundamental indexes we’re at the brink of a huge paradigm shift. 

... [They] are the next wave of investing.” (Siegel 2006), and ”Fundamental Indexing 

is just a new label on old wine.” (Asness 2006) 

We use data from 1987 to 2007 in the german stock market and several indexing 

concepts to test the CFI for the German market. We create and compare portfolio 

clusters of market weighted, equally weighted and fundamentally weighted stocks. 

We use Fama’s 3-factor-model to analyze and explain returns and anomalies, and 

we question the persistence of investment returns using the CFI. 

Key words: fundamental indexation, market index, portable alpha 

References 

Arnott, R., Sautter, G., Siegel, J. (2007): Fundamental Indexing Smackdown, in: 

Journal of Indexes, Vol. 10, No. 5, pp. 10–15. 

− 86 −

Identifying Atypical Cases in Kernel Fisher 

Discriminant Analysis by using the Smallest 

Enclosing Hypershere 

Nelmarie Louw, Morne Lamont and Sarel Steel 

Department of Statistics and Actuarial Science, University of Stellenbosch, Private 

Bag X1, 7602 Matieland, South Africa. nlouw@sun.ac.za 

Abstract. Kernel methods are fast becoming standard tools for solving classification 

and regression problems in statistics. An example of a kernel based classification 

method is Kernel Fisher discriminant analysis (KFDA). Conceptually KFDA entails 

transforming the data in the input space to a high-dimensional feature space, followed 

by linear discriminant analysis (LDA) performed in feature space. Although 

the resulting classifier is linear in feature space, it corresponds to a non-linear classifier 

in input space. However, as in the case of LDA, the classification performance 

of KFDA deteriorates in the presence of atypical data points. Louw et al. (2007) 

proposed several criteria for identification of atypical cases in KFDA. In extensive 

simulation studies these criteria have been found to be successful, in the sense that 

the error rate of the KFD classifier based on the dataset after removal of atypical 

cases, is lower than the error rate of the KFD classifier based on the entire data 

set. A disadvantage is that these criteria are calculated on a leave-one-out basis, 

which becomes computationally prohibitive when dealing with large data sets. In 

this paper we propose a two-step procedure for identifying atypical cases in large 

data sets. Firstly, a subset of potentially atypical data cases is found by constructing 

the smallest enclosing hypersphere (for each group) in feature space. Secondly, the 

proposed criteria are employed to identify atypical cases, but only cases in the subset 

are considered on a leave-one-out basis, leading to a substantial reduction in computation 

time. We investigate the merit of this new proposal in a simulation study, 

and compare the results to the results obtained when not using the hypersphere as 

a first step. We conclude that the new proposal has merit. 

Key words: Classification, Discriminant Analysis, Kernel Methods 

References 

Louw, N., Lamont, M.C. and Steel, S.J. (2007): Identification of Influential Cases 

in Kernel Fisher Discriminant Analysis. In: P. Mantovan, A. Pastore and S. 

Tonellato (Eds.): Complex Models and Computational Intensive Methods for 

Estimation and Prediction. CLEUP EDITORE, 296–301. 

− 87 −

Latent growth models for analyzing a multi 

partner reward program 

Karsten Lübke 1 and Heike Papenhoff 2 

1 Customer Intelligence, Karstadt Warenhaus GmbH, Theodor-Althoff-Strasse 2, 

45133 Essen karsten.luebke@karstadt.de 

2 Ruhr-Universität Bochum, Lehrstuhl für Betriebswirtschaftslehre, insbesondere 

Marketing, Universitätsstraße 150, 44780 Bochum 

Abstract. In recent years, multi partner reward programs (MPRP) have enjoyed a 

steady increase in popularity. However, one main advantage of MPRPs has not been 

sufficiently researched: participating customers are expected to not only prefer their 

focal card-issuing company over its competitors, but also to prefer other MPRP 

partner companies over their resp. competitors outside the program. This so-called 

cross-buying (CB) extended effect is crucial for suppliers when they evaluate program 

participation. As this CB is a dynamic process which may change over time we 

applied Latent Growth Models to analyze the effects of a MPRP on cross-buying. 

Keywords 

Latent Growth Models, Structural Equation Modeling, Cross Buying 

− 88 −

Applying Statistical Models and Parametric 

Distance Measures for Music Similarity Search 

Hanna Lukashevich, Christian Dittmar, and Christoph Bastuck 

Fraunhofer IDMT, Langewiesener Str. 22, 98693 Ilmenau, Germany 

{lkh;dmr;bsk}@idmt.fraunhofer.de 

Abstract. Content-based music similarity search implies methods that can be used 

for finding music pieces close in perceptual semantic meaning. It is an inherent part 

of automatic music recommendation systems and playlist generation. Most stateof-the-art 

music similarity techniques use short-term acoustic features. Defining a 

similarity measure between two audio signals consisting of multiple feature vector 

frames still remains a challenging task. A multitude of related studies propose 

the application of parametrical statictical models (e.g. Gaussian Mixture Models - 

GMMs) in conjunction with suitable model distance measures. This approach has 

several advantages: it enables a very compact and informative representation of an 

audio signal and it allows similarity estimation solely based on the parameters of 

the models. In this paper we concentrate only on those distance measures that do 

not use computationally demanding sampling (like Monte Carlo or likelihood ratio 

tests). A good example of such parametric distance measures is a Kullback-Leibler 

divergence (KL-divergence), describing the distance between two single gaussians. 

Unfortunately, the KL-divergence between GMMs is not analytically tractable. In a 

recent ICASSP paper Hershley and Olsen presented several approximations of the 

KL-divergence between two GMMs with very promising results. Hélen and Virtanen 

proposed a Euclidean distance between GMMs ommiting the KL-divergence. 

We present a KL-Euclidean Hybrid distance between GMMs. We compare it to 

other state-of-the-art distance measures and show that it significantly outperforms 

the others for several features and models. Rather then trying to find the best theoretical 

approximation, our focus is on the best performance for music similarity 

task. Besides that, we investigate the influence of the model parameter estimation 

on the performance in musci similarity search. Here we compare the performance 

for several versions of GMMs: a trivial model having just one gaussian per music 

piece, GMMs with a fixed number of gaussians, and GMMs where the number of 

components is estimated using model selection techniques. We also find promising 

results using semantic information like song segmentation. In the latter case, we 

model each segment of the song with a single gaussian and represent it as a GMM, 

depending on the duration of the segments. 

Key words: music information retrieval, music similarity, Gaussian mixture models, 

Kullback-Leibler divergence 

− 89 −

Determining the number of components in 

mixture models for hierarchical data 

Olga Lukociene 1 and Jeroen K. Vermunt 2 

1 

Tilburg University PO Box 90153 5000 LE Tilburg The Netherlands 

o.lukociene@uvt.nl 

2 

Tilburg University PO Box 90153 5000 LE Tilburg The Netherlands 

j.k.vermunt@uvt.nl 

Abstract. Recently, various types of mixture models have been developed for data 

sets having hierarchical or multilevel structure (see, e,g., Vermunt 2003, 2007). Most 

of these models include finite mixture distributions at multiple levels of a hierarchical 

structure. In the case of two levels, there are, for example, mixture distributions for 

individuals (lower-level units) and for groups (higher-level units). In multivlevel 

mixture models, selection of the number of mixture component is more complex 

than in standard mixture models because one has to determine the number mixture 

components at multiple levels. 

The most popular measure for determining the number of mixture components 

is the BIC. A problem in the application of this criterion in the context of multilevel 

mixture models is that it contains the sample size as one of its terms. In multilevel 

mixture models, it is not clear which sample size should be used in the BIC formula. 

This could be the number of groups, the number of individuals, or either the number 

of groups or number individuals depending on whether one wishes to determine the 

number of components at the higher or at the lower level. 

In this study we investigate the performance of various model selection methods 

in the context of multilevel mixture models. We will not only look at BIC with difference 

definitions of the sample sizes, but also at AIC, and AIC3, as well as at other 

criteria such as ICOMP, validation log-likelihood, and LR tests with bootstrapped 

p values. 

Key words: Multilevel mixture models, Hierarchical models, BIC, AIC, AIC3 

References 

Vermunt, J.K.(2003): Multilevel latent class models. Sociological Methodology, 33, 

213-239. 

Vermunt, J.K. (2007): A hierarchical mixture model for clustering three-way data 

sets. Computational Statistics and Data Analysis, 51, 5368-5376. 

− 90 −

Exploring the Interaction Structure of Weblogs 

Martin Klaus and Ralf Wagner 

SVI Chair for International Direct Marketing 

DMCC - Dialog Marketing Competence Center 

University of Kassel, Germany 

{mklaus,rwagner}@wirtschaft.uni-kassel.de 

Abstract. Weblogs as a medium of the Web 2.0 have changed the way of communication 

fundamentally but also created a new form of social interaction. Worldwide 

users make up a huge, permanently growing conversation database including various 

topics (Blood (2002)). An interesting feature of this virtual communication is 

the opportunity of providing reference to other blogs by setting hyperlinks between 

weblogs in the course of the dialog (Chin & Chignell (2006); Leskovec et al. (2007)). 

Weblogs have no standardized document format and no tags indicate them. 

Thus, it turns out to be challenging to identify and collect Weblogs from the web 

with a crawler, spider, or bot (Anjewierden, Brussee & Efimova (2004)). 

In this study we introduce different approaches to crawl weblogs and try to combine 

them. Subsequently we use social network analysis to uncover the structure 

between weblogs (Borgatti, Carley & Krackhardt (2006)). This structure provides 

us with an assessment of the blogs and their relevance for marketing communication. 

Key words: Marketing Communication, Social Network Analyzes, Web Mining, 

Weblog 

References 

Anjewierden, A., Brussee, R., and Efimova, L. (2004): Shared Conceptualizations in 

Weblogs. In: T.N. Burg (Ed.) BlogTalk 2.0, Vienna. 

Blood, R. (2002): You‘ve Got Blog: How Weblogs are Changing our Culture. Perseus, 

Cambridge. 

Borgatti, S.P., Carley, K.M., and Krackhardt, D. (2006): Robustness of Centrality 

Measures Under Conditions of Imperfect Data. Social Networks, 28, 234–236. 

Chin, A. and Chignell, M. (2006): Finding Evidence of Community from Blogging 

Co-citations: A Social Network Analytic Approach. In: Proceedings of 3rd 

IADIS International Conference Web Based Communities 2006. San Sebastian, 

Spain, 191–200. 

Leskovec, J., McGlohon, M., Faloutsos, C., Glance, N.S., and Hurst, M. (2007): 

Cascading Behavior in Large Blog Graphs. In: SDM ’07: SIAM Conference on 

Data Mining. 

− 91 −

ChIPmix : Mixture model of regressions for 

ChIP-chip experiment analysis 

Marie-Laure Martin-Magniette 1,2 and Tristan Mary-Huard 1 and Caroline 

Bérard 1,2 and Stéphane Robin 1 

1 UMR AgroParisTech/INRA MIA 518 

2 URGV UMR INRA/CNRS/UEVE 

Abstract. The Chromatin immunoprecipitation on chip (ChIP on chip) technology 

is used to investigate proteins associated with DNA by hybridization to microarray.In 

a two-color ChIP-chip experiment, two samples are compared: DNA fragments 

crosslinked to a protein of interest (IP), and genomic DNA (Input). The two samples 

are differentially labeled and then co-hybridized on a single array. The goal is 

then to identify actual binding targets of the protein of interest, i.e. probes whose 

IP intensity is significantly larger than the Input intensity. 

We propose a new method called ChIPmix to analyse ChIP-chip data based on 

mixture model of regressions. Let (xi, Yi) be the Input and IP intensities of probe i, 

respectively. The (unknown) status of the probe is characterized through a label Zi 

which is 1 if the probe is enriched and 0 if it is normal (not enriched). We assume 

the Input-IP relationship to be: 

Yi = a0 + b0xi + ɛi 

= a1 + b1xi + ɛi 

if Zi = 0 (normal) 

if Zi = 1 (enriched) 

where ɛi is a Gaussian random variable with mean 0 and variance σ 2 . The marginal 

distribution of Yi for a given level of Input xi is 

(1 − π)φ0(Yi|xi) + πφ1(Yi|xi), (1) 

where π is the proportion of enriched probes, and φj(·|x) stands for the probability 

density function of a Gaussian distribution with mean aj + bjx and variance σ 2 . 

The mixture parameters (proportion, intercepts, slopes and variance) are estimated 

using the EM algorithm. Posterior probabilities are used to classify probes into the 

normal or enriched class. In the hypothesis test theory, the false discovery control 

is performed by controlling the probability to reject wrongly the null hypothesis. 

We propose an analogous concept in the mixture model framework. Our aim is to 

control the probability for a probe to be wrongly assigned to the enriched class. 

Therefore we control Pr{τi > s | xi, Zi = 0} = α for a predefined level α. 

We present several applications of ChIPmix to promoter DNA methylation and 

histone modification data and show that ChIPmix competes with classical methods 

such as NimbleGen and ChIPOTle. 

Key words: Classification, Mixture models, ChIP-chip 

− 92 −

Clustering of High-Dimensional Data Via 

Finite Mixture Models 

Geoff McLachlan 

Department of Mathematics & Institute for Molecular Bioscience 

University of Queensland 

Summary. There has been a proliferation of applications in which the number 

of experimental units n is comparatively small but the underlying dimension p 

is extremely large as, for example, in microarray-based genomics and other highthroughput 

experimental approaches. Hence there has been increasing attention 

given not only in bioinformatics and machine learning, but also in mainstream statistics, 

to the analysis of complex data in this situation where n is small relative to p. 

In this talk, we focus on the clustering of high-dimensional (continuous) data, using 

normal mixture models. Their use in this context is not straightforward, as the normal 

mixture model is a highly parameterized one with each component-covariance 

matrix consisting of p(p + 1)/2 distinct parameters in the unrestricted case. Hence 

some restrictions must be imposed and/or a variable selection method applied beforehand. 

We shall review the existing literature and consider some new approaches 

that have been proposed recently. 

− 93 −

Majority-rule consensus: from preferences 

(social choice) to trees (biology and 

classification theory) 

F.R. McMorris 

Professor of Applied Mathematics 

Professor of Computer Science 

Illinois Institute of Technology 

Chicago, IL 60616, USA 

mcmorris@iit.edu 

Abstract: The problem of aggregating the individual preferences of a group 

of “voters” into a group consensus preference has been studied for many 

years. Indeed, mathematical investigations of consensus problems go back 

to the contributions of Borda (1784), of Condorcet (1785), and of Pareto 

(1896) and are still frequently cited today. One method, the compelling 

majority-rule consensus, is so simple (stick something in the output if it is in 

more than half of the input) that it seems nothing really interesting can be 

said about it. This presentation will give some historic background from the 

classical preference case (e.g., voting), and then point out some new and old 

mathematical and computational complexity results pertaining to the use of 

the majority-rule paradigm for finding consensus phylogenetic trees (biology) 

and classification structures (data analysis). 

− 94 −

Optimization Methods with Evolutionary 

Algorithms and Artificial Neurel Networks 

Rene Meier and Franz Joos 


University of the Federal Armed Forces Hamburg 

Power Engineering 

Laboratory of Turbomachinery 

Abstract. In order to optimize turbomachinery components it is necessary to 

describe the behaviour of multimodal objective functions (OF). But it is timeconsuming 

to evaluate the characteristics of these OF with a three-dimensional 

Navier Stokes solver. Instead an Artificial Neural Network (ANN) is used as an 

interpolator based on information contained in a database to correlate the performance 

to the geometrical parameters as is done by a compressible three-dimensional 

Reynolds-averaged Navier Stokes solver. With a computerized optimization system 

an existing centrifugal impeller will be redesigned using an Evolutionary Algorithm 

(EA) and an ANN. The ANN allows the evaluation of the OF for many geometries 

generated by the EA with less effort than a Navier Stokes solver. Yet sometimes the 

prediction is not accurate and must be verified by means of a more accurate but 

time consuming Navier Stokes solver. The results of this verification are added to 

the database. So a new optimization cycle is started with the expectation that the 

new learning on a larger database will result in a more accurate ANN. 

− 95 −

Finding Music Fads by clustering Online Radio 

Data with Emergent Self-Organizing Maps 

Florian Meyer and Alfred Ultsch 


University of Marburg, Germany 

{meyer,ultsch}@informatik.uni-marburg.de 

Abstract. Music charts provide a simple statistic of records sold. Due to web 2.0 

and its social networks, detailed information from listeners is available. In particular, 

there are user-generated keywords, so called tags, that group songs into genres. An 

important topic for the music industry are music fads. I.e. small time intervals of few 

weeks with a strong persistanc of similar music. A distance measure on weekly music 

charts and tags is used. The sequenc of music charts is visualized using Emergent Self 

Organizing Maps (ESOM). Fads are automatically found by clustering the charts 

with the U*C clustering algorithm on ESOM. U*C does not need an estimation 

of the number of clusters. Machine learned decision rules describe fads using the 

dominant genres. 

Key words: ESOM, U*C, Clustering, Tagged Music, Knowledge Representation 

References 

Ultsch, A. (2003): Maps for the Visualization of high dimensional Data Spaces. 

Yamakawa T (Eds.):Proceedings of the 4th Workshop on Self-Organizing 

Maps,225-230. 

Mörchen, F., Ultsch, A., Nöcker, M., Stamm, C. (2005): Visual mining in music 

collections In Proceedings 29th Annual Conference of the German Classification 

Society (GfKl 2005), Magdeburg, Germany, Springer, Heidelberg 

Lehwark, P., Risi,S. and Ultsch, A. (2007): Visualization and Clustering of Tagged 

Music Data, Proceedings Workshop on Self-Organizing Maps (WSOM ’07), 

Bielefeld, Germany, 

Elias Pampalk (2001): Islands of Music Analysis, Organization, and Visualization 

of Music Archives 

Mörchen, F.et al. (2005): Databionic visualization of music collections according to 

perceptual distance, Joshua D. Reiss, Geraint A. Wiggins (Eds), In Proceedings 

6th International Conference on Music Information Retrieval (ISMIR 2005), 

London, UK, pp. 396-403 

Adamic, L. and E. Adar (2003), Friends and Neighbors on the Web, Social Networks, 

25(3), 211–230. 

− 96 −

Deviant box and dual clusters for the analysis 

of conceptual contexts 

Boris Mirkin 

School of Computer Science and Information Systems 

Birkbeck University of London, Malet street, London, WC1E 7HX, UK 

mirkin@dcs.bbk.ac.uk 

Summary. This work relates to the frameworks of biclustering (Madeira and 

Oliveira 2004, Mirkin 1996) and formal concept analysis (Ganter and Wille 1999). 

A formal concept over a 1/0 rectangular matrix r, whose row-set is I and columnset 

is J, is a maximal pair (V, W ) such that V ⊂ I and W ⊂ J and all r-elements 

within V × W are unities. The lattices of formal concepts found interesting applications 

in such areas as association between itemsets and post-processing of web-search 

results. However, in many applications the notion of formal concept seems overly 

rigid because it does not allow any errors or peculiarities in 1/0 encoding (Pensa 

and Boullicaut 2005). This is why we take the concept of data approximating box 

V xW (Mirkin, Arabie and Hubert 1995) and use it in the framework of a disjunctive 

biclustering model approximating the data matrix r. 

We develop a method, Box(a), for fitting the model with possibly overlapping 

boxes by using a local search algorithm for finding an optimal box starting from a 

pre-specified row or column a and using a parameter b shifting the values of r to 

r − b. It is proven that the method leads to highly deviant boxes, which is accounted 

for by a variance measure. 

We further proceed to develop a dual clustering framework by multiplying the 

original model equation by its transposed version both on the right and on the 

left. The two equations lead to disjunctive decompositions of similarity matrices, 

(r − b) ∗ (r − b) ′ and (r − b) ′ ∗ (r − b) over clusters on row set I and column set 

J, respectively. This dual clustering framework formalizes the notion that good 

concepts should relate only such row and column sets that are similarity clusters 

on their own. A local search method for simultaneously fitting the dual clustering 

models, Dual(i, j), is developed using an evolutionary algorithm for optimization 

the common intensity value of the clusters. 

Results of experiments on generated and real data sets are reported supporting 

the view of effectiveness of the algorithms. 

Key words: Formal concept, Biclustering, Dual clustering, Scale shift 

− 97 −

Clustering a Contingency Table Accompanied 

by Visualization 

Hans-Joachim Mucha 

Weierstraß-Institut für Angewandte Analysis und Stochastik (WIAS), 


Abstract. Clustering techniques can be used for segmenting a heterogeneous twoway 

contingency table into smaller, homogeneous parts. Following the paper of 

Greenacre (1988), here the focus is on chi-square decompositions of the Person chisquare 

statistic by clustering the rows and/or the columns of a contingency table. Especially 

the hierarchical Ward method as well as a generalization of Ward’s method 

will be considered. The latter can find clusters of different volume. Additionally, one 

can show that it is also possible to carry out partitional cluster analysis by starting 

from pairwise chi-square distances. Partitional clustering techniques optimize 

some numerical criterion with respect to a fixed number of clusters K. Often the 

partitional cluster analysis attains better solutions than hierarchical cluster analysis. 

In any case, the correspondence analysis is the appropriate visualization tool for 

both the contingency table and the clusters of the rows and/or columns. Moreover, 

the correspondence analysis plots will become more informative by an additional 

projection of a dendrogram. The latter shows the hierarchy of clusters. An application 

from the field of ecology illustrates the segmentation of a contingency table 

using different cluster analysis techniques. 

Key words: chi-square distance, hierarchical clustering, partitional clustering, correspondence 

analysis, dendrogram 

References 

Greenacre, M. J. (1988): Clustering the Rows and Columns of a Contingency Table. 

Journal of Classification 5, 39–51. 

− 98 −

Predictive classification trees 

Ulrich Müller-Funk and Stephan Dlugosz 

Institut für Wirtschaftsinformatik 

University of Münster 

Germany 

Abstract. Tree-based algorithms for classification and regression are highly popular 

because they give rise to results that are easy to interpret and to communicate. 

(Some people argue, moreover, that factor selection comes along automatically. This 

point, too, will be challenged in the paper.) CART and (exhaustive) CHAID figure 

prominently among the procedures actually used in data based management etc. 

CART is a well-established, nonlinear and nonparametric procedure that produces 

binary trees. CHAID, in contrast, admits multiple splittings, a feature that allows to 

exploit the splitting variable more extensively. On the other hand, that procedure 

depends on premises that are questionable in practical applications. This can be 

put down to the fact, that CHAID relies on simultaneous Chi-Square- resp. F-tests. 

Both types of procedures – as implemented in SPSS, for instance – do not take into 

account ordinal dependent variables. In the paper we suggest a tree-algorithm that 

• requires categorical variables 

• chooses splitting attributes by means of predictive measures of association, 

• determines the cells to be united – and because of that the number of splits – 

with the help of their conditional predictive power 

• takes ordinal dependent variables into consideration 

− 99 −

Efficient Media Exploitation towards Collective 

Intelligence 

Phivos Mylonas 1 , Vassilios Solachidis 2 , Andreas Geyer-Schulz 3 , Bettina 

Hoser 3 , Sam Chapman 4 , Fabio Ciravegna 4 , Stefen Staab 5 ,PavelSmrz 6 , 

Yiannis Kompatsiaris 2 , and Yannis Avrithis 1 

1 National Technical University of Athens, Image, Video and Multimedia Systems 

Laboratory, Iroon Polytechneiou 9, Zographou Campus, Athens, GR 157 80, 

Greece,{fmylonas, iavr}@image.ntua.gr 

2 Centre of Research and Technology Hellas, Informatics and Telematics Institute, 

1st Km Thermi-Panorama Road, Thermi-Thessaloniki, GR 570 01, Greece, 

{vsol, ikom}@iti.gr 

3 Department of Economics and Business Engineering, Information Service and 

Electronic Markets, Kaiserstraße 12, Karlsruhe 76128, Germany 

{andreas.geyer-schulz, bettina.hoser}@kit.edu 

4 University of Sheffield, Department of Computer Science, Regent Court, 211 

Portobello Street, S1 4DP, Sheffield, UK {s.chapman, fabio}@dcs.shef.ac.uk 

5 Universität Koblenz-Landau, Information Systems and Semantic Web, 

Universitätsstraße 1, 57070 Koblenz, Germany, staab@uni-koblenz.de 

6 Brno University of Technology, Faculty of Information Technology, Bozetechova 

2, CZ-61266 Brno, Czech Republic smrz@fit.vutbr.cz 

Abstract. In this work we propose intelligent, automated content analysis techniques 

for different media to extract knowledge from the multimedia content. Information 

derived from different sources/modalities will be analyzed and fused, in 

terms of spatiotemporal, personal and even social contextual information. In order 

to achieve this goal, semantic analysis will be applied to the content items, taking 

into account the content itself (e.g. text, images and video), as well as existing 

personal, social and contextual information (e.g. semantic and machine-processable 

metadata and tags). The above process exploits the so-called “Media Intelligence” 

towards the ultimate goal of identifying “Collective Intelligence”, emerging from 

the collaboration and competition among people, empowering innovative services 

and user interactions. The utilization of “Media Intelligence” constitutes a departure 

from traditional methods for information sharing, since semantic multimedia 

analysis has to fuse information from both the content itself and the social context, 

while at the same time the social dynamics have to be taken into account. Such 

intelligence provides added-value to the available multimedia content and renders 

existing procedures and research efforts more efficient. 

− 100 −

Support Vector Machines in the Dual using 


Georgi Nalbantov 1 , Patrick J.F. Groenen 2 , and Cor Bioch 3 

1 MICC, Maastricht University and 

Econometric Institute, Erasmus University Rotterdam, The Netherlands 

nalbantov@few.eur.nl 

2 groenen@few.eur.nl 

3 bioch@few.eur.nl 

Abstract. Recently, Support Vector Machines (SVMs) have proved to be a quite 

successful method for classification. One of the bottlenecks with this approach from a 

practical point of view is that existing solvers are rather slow in speed. Usually, SVM 

solvers use specialized iterative optimization algorithms to solve the SVM optimization 

problem that are quite slow, especially in the so-called dual SVM formulation. 

Here, we propose to use another iterative method, which is a majorization method. 

It has already been applied successfully for solving the primal SVM formulation (see, 

Groenen, Nalbantov, and Bioch, 2007, 2008). The contribution of this paper is to 

extend it to the dual formulation. This opens the door for, first of all, using different 

so-called kernel functions, which allow for nonlinear decision functions, and second, 

for handling more efficiently linear problems where the number of input variables is 

bigger than the number of observations. 

Key words: Support vector machines, Iterative majorization, Binary classification 

problem, Kernels 

References 

Groenen, P.J.F., Nalbantov, G., and Bioch, J.C. (2007): Nonlinear support vector 

machines through iterative majorization and I-splines. In: R.Decker, H-.J. Lenz 

(Eds.): Advances in data analysis. Springer, Berlin, 149–162. 

Groenen, P.J.F., Nalbantov, G., and Bioch, J.C. (2008, in press): SVM-Maj: A 

Majorization Approach to Linear Support Vector Machines with Different Hinge 

Errors. Advances in Data Analysis and Classification. 

− 101 −

Approach for Dynamic Problems in Clustering 

Anneke Neumann, Klaus Ambrosi, and Felix Hahne 

Institut für Betriebswirtschaft und Wirtschaftsinformatik 

Stiftung Universität Hildesheim 

{aneumann,ambrosi,hahne}@bwl.uni-hildesheim.de 

Abstract. In cluster analysis, a variety of methods has been developed for different 

areas of application (e.g. economics, biology, medicine, psychology), some of 

which were implemented in data evaluation software packages (e.g. SPSS x , SAS). In 

many scenarios, particularly economic ones, special methods are required in order 

to analyze the development of clusters over time. While there are such methodical 

extensions for factor analysis and multidimensional scaling, hardly any dynamic 

approaches exist in the field of cluster analysis. 

In this talk, special attention will be paid to dynamic fuzzy clustering problems. 

Known approaches will be reviewed critically concerning their applicability in dynamic 

problems, and a new fuzzy clustering approach will be introduced which can 

be applied to dynamic problems. 

Key words: Clustering, Fuzzy Clustering, Dynamic Data Analysis 

References 

Basford, K.E. and McLachlan, G.J. (1985): The Mixture Method of Clustering Applied 

to Three-Way Data. Journal of Classification, 2, 109–125. 

Höppner, F., Klawonn, F., Kruse, R., and Runkler, T. (1999): Fuzzy Cluster Analysis. 

Wiley, Chichester, New York. 

Joentgen, A., Mikenina, L., Weber, B., and Zimmermann, H.-J. (1999): Dynamic 

fuzzy data analysis based on similarity between functions. Fuzzy Sets and Systems, 

105, 81–90. 

Tucker, L.R. (1966): Some mathematical notes on three-mode factor analysis. Psychometrika, 

31, 279–311. 

− 102 −

Robust fitting of mixtures: The approach based 

on the Trimmed Likelihood Estimator 

Neyko Neykov 1 , Peter Filzmoser 2 , and Plamen Neytchev 1 

1 Department of Statistics and Probability Theory, Vienna University of 

Technology, Austria P.Filzmoser@tuwien.ac.at 

2 National Institute of Meteorology and Hydrology, Bulgarian Academy of 

Sciences, Sofia, Bulgaria {Neyko.Neykov}{Plamen.Neytchev}@meteo.bg 

Abstract. The Maximum Likelihood Estimator (MLE) has commonly been used to 

estimate the unknown parameters in a finite mixture of distributions. However, the 

MLE can be very sensitive to outliers in the data. In order to overcome this problem, 

Neykov et al. (2007) adapted the trimmed likelihood methodology developed by 

Vandev and Neykov (1998) and Neykov and Müller (2003) to estimate mixtures 

in a robust way. The superiority of this approach in comparison with the MLE 

is illustrated by examples and simulation studies. The behavior of the widely used 

classical criteria for the assessment of the number of components in a mixture model 

and their robustified versions are also studied in the presence of outliers. 

Key words: Trimmed likelihood estimator, Finite mixtures of distributions 

References 

Maronna, R., Martin, R.D. and Yohai, V.J. (2006): Robust Statistics: Theory and 

Methods. Wiley, New York. 

McLachlan, G.J. and Peel, D. (2000): Finite Mixture Models. Wiley, New York. 

Neykov, N. and Müller, C. (2003): Breakdown Point and Computation of Trimmed 

Likelihood Estimators in GLMs. In: R. Dutter et al., (eds), Developments in 

robust statistics, pp. 277–286, Physica Verlag, Heidelberg. 

Neykov, N.M., Filzmoser, P., Dimova, R. and Neytchev, P.N. (2004): Mixture of 

Generalized Linear Models and the Trimmed Likelihood Methodology. In: J. 

Antoch (Ed.): Proceedings in Computational Statistics. Physica-Verlag, Heidelberg, 

1585–1592. 

Neykov, N., Filzmoser, P., Dimova, R. and Neytchev, P. (2007): Robust Fitting of 

Mixtures Using the Trimmed Likelihood Estimator. Computational Statistics & 

Data Analysis, 17(3), 299–308. 

Vandev, D.L. and Neykov, N.M. (1998): About Regression Estimators with High 

Breakdown Point. Statistics, 32, 111–129. 

− 103 −

Cluster Tree Estimation using a Generalized 

Single Linkage Method 

Rebecca Nugent 1 and Werner Stuetzle 2 

1 Department of Statistics, Carnegie Mellon University, Baker Hall, Pittsburgh, 

PA 15213, USA. rnugent@stat.cmu.edu 

2 Department of Statistics, University of Washington, Box 354322, Seattle, WA 

98195, USA wxs@stat.washington.edu 

Abstract. The goal of clustering is to detect the presence of distinct groups in a 

data set and assign group labels to the observations. In nonparametric clustering, 

we regard the observations as a sample from an underlying density and assume that 

groups correspond to modes of this density. The goal then is to find the modes 

and assign each observation to the domain of attraction of a mode. The (possibly 

hierarchical) modal structure of a density is summarized by its cluster tree; modes 

of the density correspond to leaves in the cluster tree. Estimating this cluster tree 

is the fundamental goal of nonparametric cluster analysis. 

We adopt a plug-in approach: estimate the cluster tree of the underlying density 

by the cluster tree of a density estimate. For density estimates that are piecewise 

constant (and so have computationally tractable level sets), the cluster tree can 

be computed exactly. However, for other density estimates, particularly in highdimensions, 

we have to be content with an approximation. We present a graph-based 

method that approximates the cluster tree for any density estimate and includes 

the introduction of a density-based similarity measure between observations. After 

motivating the method, we show results that allow us to reduce the graph to a 

spanning tree and then sketch an algorithm that allows the exact computation of the 

spanning tree whose edge weights are not of closed form. We point out mathematical 

and algorithmic similarities to single linkage clustering and illustrate our approach 

on several examples. 

Key words: cluster analysis, single linkage clustering, level sets, minimum density 

similarity measure, nearest neighbor density estimation 

References 

Stuetzle, W. and Nugent, R. (2007): A generalized single linkage method for estimating 

the cluster tree of a density. Technical Report 514, Department of Statistics, 

Univeristy of Washington. 

− 104 −

Multi-Class Extension of Verifiable Ensemble 

Models for Safety-Related Applications 

Sebastian Nusser 1,2 , Clemens Otte 1 , and Werner Hauptmann 1 

1 Siemens AG, Corporate Technology, Otto-Hahn-Ring 6, 81730 Munich, Germany, 

{sebastian.nusser.ext,clemens.otte,werner.hauptmann}@siemens.com 

2 School of Computer Science, Otto-von-Guericke-University of Magdeburg, 

Universitätsplatz 2, 39106 Magdeburg, Germany 

Abstract. For safety-related applications, models learned from data must be verifiable 

and, thus, interpretable by domain experts. In a previous work (Nusser et al., 

2007) we developed a sequential covering algorithm for binary classification problems 

in safety-related domains. It is based on ensembles of low-dimensional submodels, 

where each submodel as well as the overall ensemble model can be verified. Thus, the 

correct interpolation and extrapolation behavior of the complete model can be guaranteed. 

In the present contribution we extend the approach to multi-class problems. 

The extension is not straight-forward since common methods like one-against-one or 

one-against-rest voting (Friedman, 1996; Hsu and Lin, 2002) may introduce inconsistencies. 

We show that inconsistencies can be avoided by introducing a hierarchy 

of misclassification costs. Such hierarchy is used to define a strict ordering of the 

kind: “class c1 should never be misclassified, class c2 might only be misclassified as 

c1, class c3 might be misclassified as c1 or c2.” Our method follows a sequential 

covering concept also for multi-class classification: low-dimensional submodels are 

trained to separate the samples of the class with the minimal misclassification costs 

from the samples of all remaining classes. If the problem is solved for this class or 

no further improvements are possible, all remaining samples of this class are removed 

from the training data set and the procedure is repeated for the next class 

within the hierarchy of misclassification costs. Experimental evaluation carried out 

on benchmark data sets from the UCI Machine Learning Repository shows a good 

trade-off between interpretation and prediction accuracy of our method. 

Key words: multi-class, ensemble learning, local modeling, interpretability 

References 

Friedman, J. H. (1996). Another approach to polychotomous classification. Technical 

report, Department of Statistics, Stanford University. 

Hsu, C.-W. and Lin, C.-J. (2002). A comparison of methods for multiclass support 

vector machines. Neural Networks, IEEE Trans. on, 13(2):415–425. 

Nusser, S., Otte, C., and Hauptmann, W. (2007). Learning binary classifiers for 

applications in safety-related domains. In Proceedings of 17th Workshop Computational 

Intelligence, pages 139–151. Universitätsverlag Karlsruhe. 

− 105 −

Analysis of Borrowing and Guaranteeing 

Relationhships among Government Officials at 

the Eighth Century in the Old Capital of Japan 

Akinori Okada 1 and Towao Sakaehara 2 

1 

Graduate School of Management and Information Sciences, Tama University 

4-1-4 Hijirigaoka, Tama-shi, Tokyo 206-0022 Japan okada@tama.ac.jp 

2 

Department of History, Faculty and Graduate School of Literature and Human 

Sciences, Osaka City University 

Sugimoto-cho 3 Sumiyoshi-ku Osaka City 558-8585 Japan 

sakaehar@lit.osaka-cu.ac.jp 

Abstract. In the present study relationships among lower ranked government officials, 

working in the old capital of Japan called Heijo-kyo at the eighth century, are 

analyzed. They were engaged in copying the Buddhist sutra in the capital. The documents 

which show the borrowing and guaranteeing relationships among them have 

been kept in the governmental warehouse called Shoso-in (Sakaehara, 1987). The 

documents tell (a) the borrower, (b) the amount of money borrowed, (c) who stood 

guarantee for the borrower, (d) the date of borrowing. From these documents, the 

table which shows the borrowing and guaranteeing relationships among government 

officials was derived. The (j, k) element of the table shows the amount of money 

the government official corresponding to row j borrowed which was guaranteed by 

the government official corresponding to column k. One who stood guarantee for 

his colleague seem more dominant than one who borrowed. These relationships are 

asymmetric. The table was derived for the years 772, 773, and 774 (including the 

beginning of 775). The table was analyzed by the asymmetric multidimensional scaling 

(Okada and Imaizumi, 1997). The obtained configuration shows the dominance 

relationships among government officials and groups of them. 

Key words: Asymmetry, Borrowing and guaranteeing relationships, Historical 

data, Multidimensional scaling 

References 

Okada, A. and Imaizumi, T. (1997): Asymmetric multidimensional scaling of twomode 

three-way proximities. Journal of Classification, 14, 195–224. 

Sakaehara,T. (1987): People’s Life Styles in the Capital City. In: T. Kishi, A. 

(Ed.): Modes of Life in the Capital Cities. Chuokoron-sha, Tokyo, 187–266. 

(in Japanese) 

− 106 −

Variable Selection for kernel classifiers: 

A Feature-to-Input Space Approach 

Surette Oosthuizen and Sarel Steel 

Department of Statistics and Actuarial Science, University of Stellenbosch, Private 

Bag X1, 7602 Matieland, South Africa (surette@sun.ac.za; sjst@sun.ac.za) 

Abstract. Consider using values of input variables X1, X2, · · · , Xp to classify entities 

into one of two groups. Kernel classifiers, e.g. support vector machines (SVMs) 

and kernel Fisher discriminant analysis (KFDA), are known to be exceptionally well 

suited for this task. In general the classification accuracy of SVMs and KFDA can 

however be improved substantially if instead of the comprehensive set of p input 

variables, a smaller subset of (say m) input variables is used. Let the space in which 

the training patterns reside, be called the input space. Also, let Φ map the input 

space to a higher-dimensional so-called feature space. An aspect which complicates 

variable selection for non-linear kernel classifiers is that they make implicit use of 

Φ: they are linear functions in a higher-dimensional feature space. Since Φ is usually 

unknown, and the feature space can be infinite-dimensional, the implicit transformation 

step obscures the contributions of variables to the kernel discriminant function. 

In this paper we propose a new variable selection approach for kernel classifiers, 

viz. so-called feature-to-input space (F I) selection. The basic idea underlying this 

approach is to combine the information obtained from feature space with the easy 

interpretation in input space. We discuss several approaches and evaluate the resulting 

selection criteria in a fairly extensive simulation study. 

Key words: Variable Selection, Kernel Based Classification, Kernel Methods 

References 

Mika, S., Rätsch, G., Weston, J., Schölkopf, B., Smola, A.J. and Müller, K.-R. (1999). 

Fisher discriminant analysis with kernels. Proceedings of Neural Networks for 

Signal Processing, 9, 41-48. 

Shawe-Taylor, J. and Cristianini, N. (2004). Kernel methods for pattern analysis. 

Cambridge University Press, Cambridge. 

Rakotomamonjy, A. (2002). Variable selection using SVM-based criteria. Perception 

Systeme Information, Insa de Rouen, Technical Report PSI 2002-04. 

Rakotomamonjy, A. (2003). Variable selection using SVM-based criteria. Journal of 

Machine Learning Research, 3, 1357-1370. 

− 107 −

Classifying hospitals with respect to their 

diagnostic diversity using Shannon’s entropy 

Thomas Ostermann 1 , Reinhard Schuster 2 , and Christoph Erben 2 

1 Department of Medical Theory and Complementary Medicine, University of 

Witten/Herdecke, Gerhard-Kienle-Weg 4, 58313 Herdecke, Germany, 

thomaso@uni-wh.de 

2 Medical Review Board of the Statutory Health Insurance in North Germany, 

Katharinenstr 11a, 23554 Luebeck, Germany, Reinhard.Schuster@mdk-nord.de 

Abstract. Background: In Germany hospital comparisons are part of health status 

reporting. However, some methodological problems arise in classifying hospitals 

by means of their reported data. This article presents the application of Shannon’s 

entropy measure for hospital comparisons. Material and Methods: We used Shannon’s 

entropy given by E(p1, . . . , pn) = Pn k=1 pk log pk as an approach to measure 

the diagnostic diversity of a hospital department. Based on a data set of aggregated 

three–digit ICD–9–codes from the L4–hospital statistics of 1998 in Schleswig Holstein 

we compared the resulting measures for diagnostic diversity with respect to 

the hospital departments area (e.g. surgery, gynecology) and to the hospital status 

(primary, secondary, tertiary or specialized hospital). Results: Highly specialized 

departments like obstetrics (0.44) or ophthalmology (0.46) do generate lower entropy 

values than area-spanning departments like radiology (0.52) or general gynaecology 

(0.56), which have significantly higher values. Discussion: We showed how entropy 

can be used as a measure for classifying hospitals. Our approach can basically be 

implemented in all fields of health services research, where categorial data emerges. 

Especially in DRG-data this approach is quite promising and should be applied. 

Key words: Entropy, diagnostic diversity, hospital comparison, classification 

References 

Brindle, G.W. and Gibson, C.J. (2007): Entropy as a measure of diversity in an 

inventory of medical devices. Medical Engineering & Physics,29, Epub. 

Elayat, H.A., Murphy, B.B. and Prabhakar, N.D. (1978): Entropy in the hierarchical 

cluster analysis of hospitals. Health Serv Res, 13, 395–403. 

Erben, C.M. (2000): The concept of entropy as a possibility for gathering mass data 

for nominal scaled data in heath status reporting. Stud Health Technol Inform, 

77, 118–9. 

− 108 −

Clustering and Dimensionality Reduction to 

Discover Interesting Patterns in Binary Data 

Francesco Palumbo 

Dipartimento di Istituzioni Economiche e Finanziarie - University of Macerata 

Via Crescimbeni, 20 - I-62100, Italy 

francesco.palumbo@unimc.it 

Abstract. A key element in the success of data analysis is the strong contribution 

of visualization: dendrograms and factorial plans are intuitive ways to display 

association relationships within and among sets of variables and groups of units. 

In the Association Rules (AR) mining we refer to a n × p data matrix, where n 

indicates the number of statistical units and p the number of attributes, which are 

also called items. The problem consists in analyzing links between attributes. Sets 

of attributes that co-occur through the whole data matrix are referred as patterns. 

Scanning the whole data set and analyzing all the relationships is an interesting 

and promising approach, yet this approach leads to a NP -hard problem and gets 

no solution when dealing with a large number of attributes. 

Moreover, in some cases, the most interesting relationships refer to subpopulations 

in the data, and they are hidden by the obvious ones and cannot be identified 

by the classical descriptive and inferential statistical methods. 

The joint use of factorial and clustering methods in a unitary exploratory approach 

copes with these issues. It allows the analyst to identify the most interesting 

groups of units and sets of attributes; by focusing the attention only on them more 

easily interesting patterns are identified in large and huge binary data base. ns in 

large and huge binary data base, focusing the attention only on them. 

Key words: Dimensionality Reduction, Binary Data, AR mining 

References 

Iodice D’Enza A., Palumbo F. and Greenacre M. (2007): Exploratory data analysis 

leading towards the most interesting simple association rules. Comput. Statist. 

Data Anal., Corrected Proof, doi:10.1016/j.csda.2007.10.006. 

Mizuta, M. (2004): Dimension reduction methods. In J.E. Gentle, W. Hardle and 

Y. Mori (Eds.): Handbooks of Computational Statistics. Concepts and Methods. 

Springer-Verlag, Heidelberg, pp. 565-589. 

Plasse M., Niang N., Saporta G., Villeminot A. and Leblond L. (2007): Combined 

use of association rules mining and clustering methods to find relevant links 

between binary rare attributes in a large data set. Comput. Statist. Data Anal., 

doi: 10.1016/j.csda.2007.02.020. 

− 109 −

Lineare Kodierung multipler Vererbungshierarchien: 

Wiederbelebung einer antiken Klassifikationsmethode 

Wiebke Petersen 

Institut für Sprache und Information, Heinrich-Heine-Universität Düsseldorf, 

petersew@uni-duesseldorf.de 

Zusammenfassung. Die in der mehr als zweitausend Jahre alten Sanskritgrammatik 

von Pān. ini eingesetzten formalen Methoden erstaunen ob ihrer Modernität. 

Der Vortrag widmet sich insbesondere der Methode zur Repräsentation von Mengen 

als Intervalle einer Liste, die zur Klassifikation der Lautklassen eingesetzt wird. 

Diese Methode zeichnet sich dadurch aus, daß sie es erlaubt, bestimmte Polyhierarchien 

(d.h., Hierarchien in denen eine Klasse mehr als eine direkte Oberklasse 

haben kann) linear zu kodieren. Monohierarchien lassen sich als verschachtelte Listen 

linear repräsentieren, da sie immer eine Baumstruktur bilden. Für allgemeine 

Polyhierarchien steht eine lineare Repräsentationsmethode noch aus; sie werden häufig 

mithilfe einer Menge von Constraints beschrieben, wobei die Hierarchie zumeist 

in die einzelnen Elemente ihrer binären Nachbarschaftsrelation zerlegt wird. Infolge 

davon müssen viele Anfragen, zum Beispiel nach einer hierarchischen Teilstruktur, 

umständlich über rekursive Aufrufe abgearbeitet werden. Ein weiterer Nachteil von 

Polyhierarchien besteht darin, daß sie häufig aufgrund zahlreicher Kantenkreuzungen 

schwer lesbar sind. Da kreuzungsfreie Hierarchien von den Nutzern eines Systems 

besser akzeptiert und verstanden werden, schließen viele aktuelle Ontologiesysteme 

Polyhierarchien aus, oder verbergen sie zumindest vor den Anwendern. Aus diesen 

Gründen lassen zahlreiche Formalismen nur baumförmige Hierarchien zu. 

In dem Vortrag soll zunächst eine vollständige Charakterisierung der Klassifikationen 

gegeben werden, deren Klassen sich gemäß Pān. inis Methode als Intervalle 

einer Liste darstellen lassen. Eine solche Klassifikation wird S-darstellbar genannt. Es 

wird desweiteren formal gezeigt, daß sich die Hasse-Diagramme S-darstellbarer Klassifikationen 

immer kreuzungsfrei zeichnen lassen. Schließlich soll untersucht werden, 

inwieweit es angebracht ist, für bestimmte Einsatzzwecke die Klasse der zulässigen 

Hierarchien von baumförmigen auf S-darstellbare auszudehnen, um einerseits 

zumindest eingeschränkt multiple Vererbung zuzulassen, ohne andererseits die Vorteile 

einer kreuzungsfreien Zeichnung und einer effizienten linearen Kodierung und 

Verarbeitung hierarchischer Beziehungen zu verlieren. 

Schlüsselwörter: Pān. ini, Hierarchien, kreuzungsfreie Zeichnung, lineare Kodierung 

− 110 −

Begriffsanalytischer Ansatz zur qualitativen 

Zitationsanalyse 

Wiebke Petersen und Petja Heinrich 

Institut für Sprache und Information, Heinrich-Heine-Universität Düsseldorf, 

petersew@uni-duesseldorf.de 

Zusammenfassung. Zu den Aufgaben der Bibliometrie gehört die Zitationsanalyse 

(Kessler 1963), das heißt die Analyse von Kozitationen (zwei Texte werden kozitiert, 

wenn es einen Text gibt, in dem beide zitiert werden) und die bibliographische 

Kopplung (zwei Texte sind bibilographisch gekoppelt, wenn beide eine gemeinsame 

Zitation aufweisen). 

In dem Vortrag wird aufgezeigt werden, daß die Formale Begriffsanalyse (FBA) 

für eine qualitative Zitationsanalyse geeignete Mittel bereithält. Eine besondere Eigenschaft 

der FBA ist, daß sie die Kombination verschiedenartiger (qualitativer und 

skalarer) Merkmale ermöglicht. Durch den Einsatz geeigneter Skalen kann auch dem 

Problem begegnet werden, daß die große Zahl von zu analysierenden Texten bei qualitativen 

Analyseansätzen in der Regel zu unübersichtlichen Zitationsgraphen führt, 

deren Inhalt nicht erfaßt werden kann. 

Die Relation der bibliographischen Kopplung ist eng verwandt mit den von Priss 

entwickelten Nachbarschaftskontexten, die zur Analyse von Lexika eingesetzt werden. 

Anhand einiger Beispielanalysen werden die wichtigsten Begriffe der Zitationsanalyse 

in formalen Kontexten und Begriffsverbänden modelliert. Es stellt sich 

heraus, daß die hierarchischen Begriffsverbände der FBA den gewöhnlichen Zitationsgraphen 

in vielerlei Hinsicht überlegen sind, da sie durch ihre hierarchische 

Verbandstruktur bestimmte Regularitäten explizit erfassen. Außerdem wird gezeigt, 

wie durch die Kombination geeigneter Merkmale (Doktorvater, Institut, Fachbereich, 

Zitationshäufigkeit, Keywords) und Skalen häufigen Fehlerquellen wie Gefälligkeitszitationen, 

Gewohnheitszitationen u.s.w. begegnet werden kann. 

Schlüsselwörter: Bibliographische Kopplung, Kozitation, Formale Begriffsanalyse 

Literaturverzeichnis 

B. Ganter & R. Wille (1999): Formal Concept Analysis. Mathematical Foundations. 

Berlin: Springer. 

M.M. Kessler (1963): Bibliographic coupling between sientific papers. American Documentation, 

Vol. 14, 10–25. 

U. Priss & J. Old (2004): Modelling Lexical Databases with Formal Concept Analysis. 

Journal of Universal Computer Science, Vol. 10, Nr. 8, 967–984. 

− 111 −

The Analysis of the power for some chosen VaR 

backtesting procedures - simulation approach 

Krzysztof Piontek 




krzysztof.piontek@ae.wroc.pl 

Abstract. The definition of Value at Risk is quite general. There are many approaches 

which can give different VaR values. The challenge is not to suggest a new 

method but to distinguish between good and bad models. Backtesting is the necessary 

statistical procedure to evaluate performance of VaR models and select the 

best one. If the power of the test is low, then it is likely to mis-classify an inaccurate 

VaR model as well-specified. It can be a threat to financial institutions. 

The aim of this article is to analyze backtesting methodologies, focusing on the 

aspect of limited data set and the power of tests. There are three groups of methods 

for validating VaR models: based on the frequency of failures, based on various 

loss functions and the ones based on the adherence of a VaR model to asset return 

distributions. This article presents and summarizes some of frequently used methods 

from every group (proposed by Kupiec, Christoffersen, Lopez and Berkowitz). 

However, the main part of this work is statistical evaluation of the most applied 

tests for small data sets (usually observed in practice). We analyze performance 

of tests based on the type II error, in order to select the best one for different 

numbers of observations and model mis-specifications. For making this verification 

asset return simulations are used. Presented results indicate that some tests are not 

adequate for small samples, even for 1000 observations, which is a very important 

issue if acceptance of internal models for market risk management is considered. 

Key words: risk measurement, Value at Risk, backtesting, power of tests 

References 

Hass, M. (2001): New Methods in Backtesting, CAESAR, 

www.caesar.de/uploads/media/cae pp 0010 haas 2002-02-05 01.pdf 

Piontek, K. (2007): A Survey and a Comparison of Backtesting Procedures 

(in Polish), In: P. Chrzan: Metody matematyczne, ekonometryczne..., Katowice. 

Sarma, M., Thomas, S., Shah, A. (2003): Selection of Value-at-Risk Models, 

ideas.repec.org/s/jof/jforec.html 

− 112 −

Testing distribution in errors in variables 

models 

Denys Pommeret 1 

Aix-Marseille 2 University pommeret@iml.univ-mrs.fr 

Abstract. Within the frame of errors in variables models we consider the sum of 

two independent random variables, X = W + Z, where W is the variable of interest 

with known distribution Π, and where the error Z has unknown density f. We 

present a smooth goodness of fit test for testing the distribution of the error Z. For 

that we observe an i.d.d. sample X1, · · · , Xn, with a mixture density function 

Z 

g(x) = f(x, m)Π(dm), 

where Π is a real probability distribution and f(x, m) are real m-parameterized 

density functions, for m in some set M ⊂ R. We assume that Π is the known 

distribution of Z and we want to test 

H0 : f(x, m) = f0(x, m), for all m in M, 

where f0 is a specified probability density function. An adaptation of the Neyman 

smooth test is proposed. 

Key words: Mixture models, Neyman’s test, Score statistic, Schwarz’s criteria 

References 

Hart, J.D. (1997): Nonparametric smoothing and lack-of-fit tests, Springer Series in 

Statistics. New York, NY. 

Kallenberg, W.C.M. and Ledwina, T. (1995): Consistency and Monte Carlo simulation 

of data driven version of smooth goodness of fit tests, Ann. Statist, 23, 

1594–1608. 

Ledwina, T. (1994): Data-Driven Version of Neyman’s Smooth Test of Fit, Journal 

of the American Statistical Society, 89, 1000–05. 

Lehmann, E.L. and Romano, J.P. (2005): Testing statistical hypotheses, 3rd ed. 

Springer Texts in Statistics. New York, NY: Springer. 

− 113 −

Classification with an increasing number of 

components 

Odile Pons 1 

INRA, Mathematics, Jouy-en-Josas, France Odile.Pons@jouy.inra.fr 

Abstract. After estimating the rn actual components of a mixture with an increasing 

number of components increasing with the sample size, the question is to 

determine to which group a given observation Xi, i = 1, . . . , n, belongs. A classification 

consists in mapping an observation Xi (or a value x of X ) into a class � kn(Xi) 

(or � kn(x)) in {1, . . . , rn} which may either be uniquely defined (fixed case) or related 

to a random distribution. In both cases, the component � kn is chosen by maximum 

likelihood with a penalization. 

A random classification avoids misclassification of some observations with overlapping 

densities, k(Xi) = j with an estimated probability and k(x) = j with a fixed 

probability, 1 ≤ j ≤ rn. They are estimated by 

�µ � 

n, kn(x) � fn, 

kn(x) � = max Qn(�µn,k 

1≤k≤rn 

� fn,k; x), with 

Qn(�µn,k � fn,k; x) = �µn,k � fn,k(x) − nλ 2 n 

−ν 2 n 

� 

1≤j≤r 

q� 

j=1 

π( � fn,k − � fn,j), 

p (�µn,j) 

where the penalization coefficients λn and νn tend to zero as n → ∞, π and λ 

are smooth functions. For a random classification, � kn(Xi) is defined in the same 

way with some probabilities. The random procedure preserves all the classes: the 

proportions of observations belonging to {1, . . . , rn} is asymptotically identical to 

the mixture probabilities. 

Key words: Mixture, classification,asymptotics 

References 

Lemdani, M. and Pons, O. Large Mixture models with an increasing number of 

components, unpublished, (2007). 

Pons, O. Asymptotic distributions in finite semi-parametric mixture models. to appear, 

(2008). 

− 114 −

Bagging with different split criteria. 

Sergej Potapov and Berthold Lausen 

Department of Biometry and Epidemiology, Friedrich-Alexander-University 

Erlangen-Nuremberg, Waldstraße 6, D-91054 Erlangen, Germany 

Sergej.Potapov@imbe.imed.uni-erlangen.de 

Abstract. In recent years many papers discuss boosting and bagging based methods 

for supervised learning or machine learning. Both concepts aggregate sets of 

estimated trees, which are derived by split criteria without adjusting for variables 

measured on different scales. Breiman et al. (1984) observed that quantitative variables 

tend to be more often selected as binary variables. As a solution Lausen et al. 

(1994, 2004) introduced p-value adjusted classification and regression trees, which 

introduce the p-value of maximally selected test statistics as split criteria. The p 

value adjustment avoids the possible selection bias of variables measured on different 

scales. The R package TWIX of Potapov et al. (2008) offer p-value adjusted 

classification trees and bagging of p-value adjusted classification trees. In our paper 

we compare bagging, double-bagging (Hothorn and Lausen, 2003) without and with 

p-value adjustment by means of simulation. Moreover, we illustrate our approach 

using a clinical study involving micro array data. 

Key words: bagging, CART, machine learning, trees, micro array data 

References 

Breiman, L., Friedman, J.H., Olshen, R.A., Stone, C.J. (1984): Classification and 

regression trees. Wadsworth Press. 

Hothorn, T., Lausen, B. (2003): Double-bagging: Combinig classifiers by bootstrap 

aggregation. Pattern Recognition 36(6), 1303–1309. 

Lausen, B., Hothorn, T., Bretz, F., Schumacher, M. (2004): Assessment of optimal 

selected prognostic factors.Biometrical Journal 46, 364–374. 

Lausen, B., Sauerbrei, W., Schumacher, M. (1994): Classification and regression trees 

(CART) used for the exploration of prognostic factors measured on different 

scales, in: Dirschedl, P., and Ostermann, R. (eds.), Computational Statistics, 

Physica-Verlag, Heidelberg, 483–496. 

Potapov, S., Theus, M. (2008): The TWIX package (Version 0.2.4.). http://cran.rproject.org 

− 115 −

Remarks on the Existence of CML Estimates 

for the PCM by means of the R Package eRm 

Antonio Punzo 1 

University of Milano-Bicocca, Department of Quantitative Methods for Business 

and Economic Sciences, a punzo@libero.it 

Abstract. Mair and Hatzinger (2007) have recently proposed in the Journal of 

Statistical Software the R package eRm (extended Rasch models) for computing 

Rasch models and several extensions. Undoubtedly, in the eRm class the partial 

credit model (PCM) – for practical testing purposes – is one of the best known. 

The package, using a unitary conditional maximum likelihood (CML) procedure, 

estimates the item parameters of the above-mentioned models. 

Although the eRm belong to the Rasch family of models and share their distinguishing 

characteristics, they suffer from the problem of possible non-existence 

of estimates. In literature, both in the joint and in the conditional ML approach, 

the configurations and the conditions of non-existence for the RM are well-known 

(Fischer, 1981). The eRm package performs a preliminary data check only for the 

RM. The conditions of non-existence are known for the PCM only in the joint case 

(Bertoli-Barsotti, 2005). 

In this article, the main focus is on the PCM; the above-mentioned JML nonexistence 

configurations for this model will be the starting point. A class of counter 

examples is illustrated, which leads to “false” CML estimates with the eRm package, 

i.e., values that appear to be estimates but, through a more accurate analysis 

of the maximization function, they are rather a clear signal of non-existence. Moreover, 

the obtained results emphasize the presence of additional CML non-existence 

configurations, compared to those valid in the JML case. 

Key words: Rasch models, Partial Credit model, Conditional Maximum Likelihood 

estimate, R package eRm 

References 

Bertoli-Barsotti, L. (2005): On the Existence and Uniqueness of JML Estimates for 

the Partial Credit Model. Psychometrika, 70, 3, 517–531. 

Fischer, G. H. (1981): On the Existence and Uniqueness of Maximum-Likelihood 

Estimates in the Rasch Model. Psychometrika, 46, 1, 59–77. 

Mair P. and Hatzinger R. (2007): Extended Rasch Modeling: The eRm Package for 

the Application of IRT Models in R. Journal of Statistical Software, 20, 9, 1–20. 

− 116 −

Dynamic disturbances in BTA deephole 

drilling - Identification of spiralling as a 

regenerative effect 

Nils Raabe, Dirk Enk, Claus Weihs, and Dirk Biermann 

Technische Universitt Dortmund 

Germany 

Abstract. One serious problem in deep-hole drilling is the formation of a dynamic 

disturbance called spiralling which causes holes with several lobes. Since such lobes 

are a severe impairment of the bore hole the formation of spiralling has to be prevented. 

One common explanation for the occurrence of spiralling is the intersection 

of time varying bending eigenfrequencies with multiples of the tool’s rotational frequency. 

Little is known about which specific eigenfrequencies are crucial. Furthermore 

an Underlying assumption of this explanation is, that the resulting holes in 

cross-sectional view are showing as a curve with constant width. This Assumption 

implicitly supposes spiralling to result from a parallel displacement of the drill head. 

We in fact observed spiralling in experiments designed to force it by planning 

crucial frequency intersections using a statistical-physical model proposed in earlier 

work. However, not every intersection of any eigenfrequency with a multiple of the 

rotational frequency led to spiralling. Furthermore we also found cases of spiralling 

with two or four lobes contradict to the common assumption. After inspecting the 

eigenmodes corresponding to the frequencies which caused the spiralling it turned 

out that these modes commonly show a clear tilt at the drill head instead of a parallel 

displacement. This tilt one the one hand in general allows to order the eigenfrequencies 

by their relevance with respect to spiralling. Furthermore we now are able to 

give a geometrical explanation for the spiralling development as a regenerative effect. 

We use this explanation to extend our statistical-physical model by a process 

model of the chips cut during the process. This model is the basis of a system for 

the simulation of spiralling. Since the model contains the machine parameters it can 

be used to evaluate the probability and extend of spiralling in different settings. 

By this different settings can be classified into stable and instable processes and 

strategies for the avoidance of spiralling can be derived. Since the statistical-physical 

model includes a statistical estimation procedure for the unknown parameters these 

strategies can finally be tested in real processes. 

− 117 −

Statistical processes under change - Enhancing 

data quality with pretests 

Walter Radermacher 

President of the Federal Statistical Office, Germany 

walter.radermacher@destatis.de 

Summary. Production of high quality statistics is the main task of Federal Statistical 

Office. Technological progress, globalisation, the increasing significance and 

diversification of information and its distribution, are only some general terms for 

the changes we are faced with today. Needless to say, those changes strongly affect 

the statistical work of the FSO and pose challenges that can only be met with innovative 

and appropriate methods, to name only a few: cooperation and networks, 

multiple sources, mixed-mode designs, standardisation of processes, metadata for 

quality control and the use of administrative information. The point is to maximise 

data quality and minimise the cost and the burden for the participants in surveys. 

A prominent method for improving data quality in surveys is the use of pretests 

within - or ideally before - the actual data production process. Pretests fulfill a 

number of functions: they minimise non-sampling errors, they reduce the burdensomeness 

for the respondents of comprehensive questionnaires and they test the 

feasibility of a concept in practice. Combining quantitative and qualitative methods 

for pretesting leads to significant increases in data quality. For instance, cognitive 

interviewing, an accepted method mainly used in social science research, when applied 

to test household surveys as well as business surveys, enables the detection of 

reporting errors in surveys caused by underlying cognitive processes through which 

respondents generate their answers in survey questions. Some examples from the 

practice will illustrate the benefits of pretests in official statistics. 

Changing conditions call for changing procedures and methods. In our work supplying 

official statistics we react to the increasing demand for reliable data. Pretests 

are an important example of a method which meets this need in two ways: they 

improve quality control and they contribute to the userfriendliness of our surveys. 

− 118 −

Automatic Dictionary Expansion Using 

Non-parallel Corpora 

Reinhard Rapp 1 and Michael Zock 2 

1 University of Tarragona reinhard.rapp@urv.cat 

2 LIF-CNRS, Marseille michael.zock@lif.univ-mrs.fr 

Abstract. Automatically deriving bilingual dictionaries from manually translated 

texts is an established technique that works well in practice. However, translated 

texts are a scarce resource. Therefore, it is also desirable to be able to generate 

dictionaries from pairs of unrelated monolingual corpora. To achieve this, we suggest 

an approach that considers the crosslingual correlations between the co-occurrence 

patterns of translated words. If, for example, two words X and Y co-occur more often 

than expected by chance in the source language, then their translations T(X) and 

T(Y) should also co-occur more frequently than expected in the target language. It 

is further assumed that a small dictionary is available at the beginning, and that 

the aim is to expand this base lexicon. 

The approach is as follows: Using a corpus of the target language, first a cooccurrence 

matrix is computed with the rows being word types from the corpus and 

the columns being target words from the base lexicon. Next a word of the source 

language is considered whose translation is to be determined. Using the sourcelanguage 

corpus, a co-occurrence vector for this word is computed. Then, using the 

dictionary, all known words in this vector are translated into the target language, 

thereby discarding unknown words. The resulting vector is compared to all vectors 

in the co-occurrence matrix of the target language. The vector with the highest 

similarity is considered to be the translation of the source-language word. 

In our experiments this method gave an accuracy in the order of 50%. To improve 

the results, we perform an automatic cross-check which utilizes the dictionaries’ 

property of transitivity. What we mean by this is that if we have two dictionaries, one 

translating from language A to language B, the other from B to C, then we can also 

translate from language A to C by using the intermediate language (or interlingua) 

B. That is, the property of transitivity, although having some limitations due to 

word ambiguities, can be exploited to automatically generate a raw dictionary for 

A to C. One might think that this is unnecessary as our corpus-based approach 

also allows us to generate this dictionary directly from the respective comparable 

corpora. However, having two different ways of generating the same dictionary has 

the advantage that we can validate one via the other. Furthermore, by considering 

several languages, additional possibilites for mutual cross-validation arise. 

Key words: dictionary generation, comparable texts, translation 

− 119 −

FIMIX-PLS Segmentation of Data for Path 

Models with Multiple Endogenous LVs 

Christian M. Ringle 

University of Hamburg, Institute of Industrial Management, Von-Melle-Park 5, 

20146 Hamburg, Germany, cringle@econ.uni-hamburg.de 

Abstract. When applying a causal modeling approach such as partial least squares 

(PLS) path modeling in empirical studies, the assumption that the data has been 

collected from a single homogeneous population is often unrealistic. Unobserved 

heterogeneity in the PLS estimates for the aggregate data level may result in misleading 

interpretations. Finite mixture partial least squares (FIMIX-PLS; Hahn et 

al., 2002) allows classifying data based on the heterogeneity of the estimates in the 

inner path model. Experimental as well empirical examples (Esposito Vinzi et al., 

2007; Ringle et al., 2008) illustrate the application of FIMIX-PLS for path models 

that only involve a single latent endogenous variable. This research uses a systematic 

approach (Ringle, 2007) to apply the FIMIX-PLS methodology and presents 

FIMIX-PLS computational experiments for a path model which includes multiple 

endogenous latent variables (LVs). The results of this analysis further substantiate 

the reliability of the systematic FIMIX-PLS application in more realistic situations 

and provides researchers and practitioners with the certainty they require to effectively 

evaluate their PLS path modeling results. If the procedure uncovers significant 

heterogeneity, the analysis results in further differentiated path modeling outcomes 

and, thus, allows forming more precise conclusions. 

Key words: PLS Path Modeling, Heterogeneity, Finite Mixture, Segmentation 

References 

Esposito Vinzi, E., Ringle, C.M., Squillacciotti, S. and Trinchera, L. (2007): Capturing 

and Treating Unobserved Heterogeneity by Response Based Segmentation 

in PLS Path Modeling: A Comparison of Alternative Methods by Computational 

Experiments. ESSEC Research Center, Working Paper No. 07019. ES- 

SEC Business School Paris-Singapore. 

Hahn, C., Johnson, M.D., Herrmann, A. and Huber, F. (2002): Capturing Customer 

Heterogeneity using a Finite Mixture PLS Approach. Schmalenbach Business 

Review, 54, 243–269. 

Ringle, C.M. (2007): Segmentation for path models and unobserved heterogeneity: 

The finite mixture partial least squares approach, Research Papers on Marketing 

and Retailing No. 035. University of Hamburg. 

− 120 −

Extreme unconditional dependence vs. 

multivariate GARCH effect in the analysis of 

dependence between high losses on Polish and 

German stock indexes 

Pawel Rokita, Krzysztof Piontek 




pawel.rokita@ae.wroc.pl, krzysztof.piontek@ae.wroc.pl 

Abstract. Classical portfolio diversification methods do not take account of any 

dependence between extreme returns (losses). Many researchers provide, however, 

some empirical evidence that extreme-losses for various assets co-occur. If the cooccurrence 

is frequent enough to be statistically significant, it may seriously influence 

portfolio risk. Such effects may result from a few different properties of financial time 

series, like for instance: (1) extreme dependence in an (long-term) unconditional 

distribution, (2) extreme dependence in subsequent conditional distributions, (3) 

time-varying conditional covariance, (4) time-varying (long-term) unconditional covariance, 

(5) market contagion. Moreover, a mix of these properties may be present 

in return time series. Modeling each of them requires different approaches. It seams 

reasonable to investigate whether distinguishing between the properties is highly 

significant for portfolio risk measurement. If it is, identifying the effect responsible 

for high loss co-occurrence would be of a great importance. If it is not, the best solution 

would be selecting the easiest-to-apply model. This article concentrates on two 

of the aforementioned properties: extreme dependence (in a long-term unconditional 

distribution) and time-varying conditional covariance. 

Key words: extreme dependence, TDC, multivariate GARCH 

References 

Coles S., Heffernan J., Tawn J. (1999): Dependence Measures for Extreme Value 

Analyses. Extremes, 2:4, 339–365. 

Gouriéroux C. (1997): ARCH Models and Financial Applications. Springer. 

Rokita P. (2008): Comparing extreme dependence and varying conditional covariance 

concept for portfolio risk modeling (in Polish). To be published in: Taksonomia, 

15. 

− 121 −

Grundzüge einer generativen Korpuslinguistik 

Jürgen Rolshoven 

Linguistic Data Processing, Department of Linguistics, University of Cologne 

rols@spinfo.uni-koeln.de 

Abstract. Die Verarbeitung langer Texte ist durch die Bioinformatik stark stimuliert 

worden. Dies zeigen u.a. Gusfield (1997), Böckenhauer, Bongartz (2003) 

und Haubold, Wiehe (2006). Als Datenstruktur spielen Suffix Trees eine zentrale 

Rolle. Für den Linguisten sind Suffix Trees jedoch nicht zur Suche von Substrings 

von Bedeutung. Vielmehr ermöglichen sie die Gewinnung linguistischen 

Wissens durch Aufdeckungsverfahren. Suffixe sind potentielle Morpheme in Textkorpora. 

Aufdeckungsverfahren seligieren aus der Menge potentieller Morpheme 

die, welche die funktions- oder bedeutungstragend sind.Dafür werden strukturalistische 

Verfahren wie Austausch, Auslassung und Verschiebung eingesetzt. Formal 

ergibt sich folgendes Verfahren: Suffixbäume sind äquivalent zu endlichen 

Automaten und entsprechen Typ-3-Sprachen in der Chomsky-Hierarchie.Daraus 

ergeben sich einfache Produktionsregeln, die mit Hilfe weiterer Information aus Suffixbäumen 

in Typ-2-Regeln umgeformt werden. Diese werden wiederum in Typ-1- 

Regeln transformiert. Mit diesem Vorgehen wird Sprache nicht einzeln satzweise 

geparst, sondern gleichsam holistisch textbezogen. Die skizzierten Verfahren führen 

zu Übergenerierungen und sind insofern generativ. Ihr deskriptives Potential reicht 

über die Texte hinaus, aus denen die Regeln deriviert wurden. Dies soll 

durch die Bezeichnung generative Korpuslinguistik ausdrücken. Eine generative 

Korpuslinguistik verbindet durch Einsatz bioinformatischer Methoden die Stärken 

des datengetriebenen korpuslinguistischen Ansatzes mit dem hypothesengetriebenen 

Vorgehen generativer Grammatiken. 

References 

Böckenhauer, H-J., Bongartz, D. (2003): Algorithmische Grundlagen der Bioinformatik. 

Teubner Verlag, Wiesbaden. 

Gusfield, D. (1997): Algorithms on Strings, Trees and Sequences: Computer Science 

and Computational Biology. Cambridge University Press, Cambridge, Mass. 

Haubold, B., T. Wiehe (2006): Introduction to computational biology: an evolutionary 

approach. Birkhäuser Verlag, Basel; Boston. 

− 122 −

Cluster ensemble based on co-occurrence data 

Dorota Rozmus 

Department of Statistics, 

Katowice University of Economics, Bogucicka 14, 40-226 Katowice 

drozmus@ae.katowice.pl 

Abstract. Ensemble approach have been successfully applied in the context of 

supervised learning to increase the accuracy and stability of classification. Recently, 

analogous techniques for cluster analysis have been suggested. Research has proved 

that, by combining a collection of different clusterings, an improved solution can be 

obtained. 

In the traditional way of learning from data set the classifiers are built in a feature 

space. However, alternative ways can be found by constructing decision rules on 

similarity or dissimilarity representations, instead. In such a recognition process an 

object is described by distance matrix showing the similarity to the rest of training 

samples. 

This research has focused on exploiting the additional information provided by 

a collection of diverse clusterings to generate a co-association (similarity) matrix. 

Taking the co-occurrences of pairs of patterns in the same cluster as votes for their 

association, the data partitions are mapped into a co-association matrix of patterns. 

This n × n matrix represents a new similarity measure between patterns. The final 

data partition is obtained by clustering this matrix. 

In the experiments, the behavior of partitions built on co-occurrence data is 

studied. 

Key words: Cluster analysis, Cluster ensemble, Co-association matrix, (Dis)similarity 

representation. 

References 

Jain, A.K. and Fred, A. (2002): Evidence accumulation clustering based on the 

k-means algorithm. Structural, Syntactic, and Statistical Pattern Recognition, 

2396, 442–451. 

Strehl, A. and Ghosh, J. (2002): Cluster ensembles - a knowledge reuse framework 

for combining partitionings. Journal of Machine Learning Research, 3, 583–617. 

Pekalska, E. and Duin, R.P.W (2000): Classifiers for dissimilarity-based pattern 

recognition. In: A. Sanfeliu, J.J. Villanueva, M. Vanrell, R. Alquezar, A.K. 

Jain and J. Kittler (Eds.): Proc. 15th Int. Conf. on Pattern Recognition, IEEE 

Computer Society Press, Los Alamitos, 12–16. 

− 123 −

Dyadic Interactions in Service Encounter - 

Bayesian SEM Approach 

Adam Sagan 1 and Magdalena Kowalska-Musia̷l 2 

1 Chair of Market Analysis and Marketing Research Cracow University of 

Economics, Rakowicka 27, 31-510 Cracow, Poland sagana@ae.krakow.pl 

2 The School of Banking and Management, Armii Krajowej 4, 30-115 Cracow, 

Poland m.kowalska@wszib.edu.pl 

Abstract. Dyadic multirelational and sequential interactions are important aspects 

in service encounters. Can be observed in B2B distribution channels, professional 

services, buying centers, family decision making or WOM communications. The networks 

are consisted of dyadic bonds that form dense but weak ties among actors. 

The aim of paper is the identification of latent properties of dyadic interactions on 

mobile phone service market. Latent variable models in relational marketing are often 

concentrated either on effects of relations, or treat the relationship dimensions 

as psychological constructs on individual-trait level. 

We propose the approach based on bayesian latent variable modeling of social networks 

with dyads as units of analysis. This approach enables to model emergent and 

relational properties of actors’ interactions in dyads that are irreducible to individual 

latent traits or psychological constructs. 

Several competing models are developed and compared using bayesian structural 

equation models of dyadic data. Bayesian SEM helps to overcome the limitations 

of the more traditional solutions based on ML or WLS estimations. It is robust to 

small samples that are common in social network analysis, can also be applied for 

non-normal data as well as non-linear relations between latent variables. 

Key words: Relationship marketing, Dyadic data, Bayesian SEM 

References 

Anderson, J. C. and H˚akanson, H. and Johanson, J. (1994): Dyadic Business Relationships 

within a Business Network Context, Journal of Marketing, October, 

1–15. 

Iacobucci, D. and Hopkins, N. (1992): Modeling Dyadic Interactions and Networks 

in Marketing, Journal of Marketing Research, February, 5–17. 

Kenny, D.A. and Kashy, D.A and Cook, W.L. (2006): Dyadic Data Analysis. Guilford 

Press, New York. 

Lee, S. Y. (2007): Structural Equation Modeling. A Bayesian Approach,John Willey 

and Sons, Chichester. 

− 124 −

�� 

�� 

�� 

�� 

� �� 

� �� 

�� 

�� 

�� 

�� 

�� 

�� 

�� 

�� 

�� 

�� 

� �� 

�� 

�� 

�� 

�� 

�� 

�� 

�� 

�� 

�� 

�� 

�� 

�� 

�� 

�� 

�� 

�� 

�� 

− 125 −

Nonnegative Matrix Factorization for Binary 

Data to Extract Elementary Failure Maps from 

Wafer Test Images 

Reinhard Schachtner 1,2 , Gerhard Pöppel 1 and Elmar Lang 2 

1 Infineon Technologies AG, 93049 Regensburg, Germany 

reinhard.schachtner@infineon.com 

2 Institute of Biophysics, University of Regensburg, 93040 Regensburg, Germany 

Abstract. We introduce a probabilistic variant of non-negative matrix factorization 

(NMF) applied to binary data sets. Hence we consider binary coded images as a probabilistic 

superposition of underlying continuous-valued basic patterns. An extension 

of the well-known NMF procedure to binary-valued data sets is provided to solve the 

related optimization problem with non-negativity constraints. We demonstrate the 

performance of our method by applying it to the detection and characterization of 

hidden causes for failures during wafer processing. Therefore, we decompose binary 

coded (pass/fail) wafer test data into underlying elementary failure patterns and 

study their influence on the quality of single wafers. 

Key words: Nonnegative matrix factorization, binary data, failure patterns 

References 

Lee, D. and Seung, H. (1999): Learning the parts of objects by non-negative matrix 

factorization, Nature, 401,788–791 

− 126 −

Quality–Based Clustering of Functional Data: 

Applications to Time Course Microarray Data 

Theresa Scharl 1 and Friedrich Leisch 2 

1 Institut für Statistik und Wahrscheinlichkeitstheorie, Technische Universität 

Wien, Wiedner Hauptstr. 8-10, A-1040 Wien, Austria; Scharl@ci.tuwien.ac.at 

2 Institut für Statistik, Ludwig-Maximilians-Universität München, Ludwigstraße 

33, D-80539 München, Germany; Friedrich.Leisch@stat.uni-muenchen.de 

Abstract. Cluster methods are typically applied to time course gene expression 

data to find co–regulated genes which can finally help to reveal pathways and interactions 

between genes. Clustering is either carried out on the raw data or on 

functional data. In functional data analysis (e.g. Serban and Wasserman, 2005; 

Tarpey, 2007) a curve is fit to each observation in order to account for time dependency. 

Gene expression over time is biologically a continuous process and can 

therefore be represented by a continuous function. The different curve shapes found 

in a dataset can have important interpretations and characteristic patterns can be 

found by clustering the estimated regression coefficients. 

In this study the raw data is clustered using the well–known K–Means algorithm 

as well as the quality–based cluster algorithm Stochastic QT–Clust (Scharl and 

Leisch, 2006). Further, the parameters obtained by representing each gene expression 

profile by a curve are clustered. Additionally mixtures of spline regression models 

and mixed–effects models are applied to the data. All cluster algorithms used are 

implemented in R. The different cluster methods are compared in a simulation study 

on various datasets. 

Key words: Cluster analysis, functional data, time course gene expression data, R 

References 

SCHARL, T. and LEISCH, F. (2006): The stochastic qt–clust algorithm: evaluation 

of stability and variance on time–course microarray data. In Rizzi, A. and Vichi 

M., editors, Compstat 2006—Proceedings in Computational Statistics, 1015– 

1022, Physica Verlag, Heidelberg, Germany. 

SERBAN, N. and WASSERMAN, L. (2005): CATS: Clustering after transformation 

and smoothing. Journal of the American Statistical Association, 100(471), 990– 

999. 

TARPEY, T. (2007): Linear transformations and the k–Means clustering algorithm: 

Applications to clustering curves. The American Statistician, 61(1), 34–40. 

− 127 −

Multilingual knowledge based concept 

recognition in textual data 

Martin Schierle 1 and Daniel Trabold 2 

1 Daimler AG martin.schierle@daimler.com 

2 Daimler AG daniel.trabold@daimler.com 

Abstract. With respect to the increasing volume of textual data which is available 

through digital resources today, the identification of the main concepts in those texts 

becomes more and more important and can be seen as a vital step in the analysis 

of unstructured information. 

Research in this area has focused on the detection of named entities like person 

names or organization names, which only cover a very small part of concepts in texts. 

Especially the unique mapping between concepts in different languages requires 

parallel corpora which are rarely available in industrial settings. 

We therefore propose a powerful new knowledge based model to recognize various 

kinds of concepts even in very short and specialized texts using linguistic information 

for synonym handling and word sense disambiguation. 

We evaluate the proposed model on texts from the automotive domain. 

References 

− 128 −

Localized Logistic Regression for Discrete 

Influential Factors 

Julia Schiffner, Gero Szepannek, Thierry Monthé, and Claus Weihs 

Faculty of Statistics, Dortmund University of Technology, 44221 Dortmund, 

Germany, 

schiffner@statistik.uni-dortmund.de, 

weihs@statistik.uni-dortmund.de 

Abstract. The two-class localized logistic regression of Tutz and Binder (2005) 

is generalized to discrete explanatory variables, and applied to data from a breast 

cancer study. In order to obtain a distance measure between observations of the 

discrete factors a combination of the simple and the flexible matching coefficient 

(Ickstadt et al., 2006) is taken. Applying the method of Tutz and Binder (2005) 

with this distance measure to the example data, localized models lead to smaller 

misclassification rates than the corresponding global ones. Moreover, the best classification 

rule found gives one of the smallest misclassification rates ever obtained 

for the example data. The results of Monthé (2008) are extended by an automatic 

variable selection. 

Key words: Localized logistic regression, Matching coefficients, SNP data 

References 

Ickstadt, K., Mueller, T., and Schwender, H. (2006): Analyzing SNPs: Are There 

Needles in the Haystack? CHANCE, 19(3), 22–27. 

Monthé, Th. (2008): Lokalisierte Logistische Regression bei diskreten Variablen. Master 

thesis, Faculty of Statistics, Dortmund University of Technology. 

Tutz, G. and Binder, H. (2005): Localized Classification. Statistics and Computing, 

15, 155–166. 

− 129 −

Localized Classification Using Mixture Models 

Julia Schiffner and Claus Weihs 

Faculty of Statistics, Dortmund University of Technology, 44221 Dortmund, 

Germany, 

schiffner@statistik.uni-dortmund.de 

Abstract. In the literature a variety of classification methods can be found that can 

be called ‘local’ because they concentrate – in different senses – on one or multiple 

small regions of the data space. One type of local methods that may be beneficial 

in case of heterogeneous classes is based on mixture models. It is assumed that 

data are generated by a finite number of sources and that each source can produce 

data of one or multiple classes. Models valid for single sources can be referred to 

as ‘local models’ that can be aggregated to a global mixture model. Mixture based 

classification methods have been described by several authors (see references), but 

the relationships and differences between the underlying models are not clear. A 

consistent description of these models and the resulting Bayes classification rules is 

presented. Moreover, it is shown how Bayes rules can be derived if in distinct local 

models different variable subsets separate the classes. Finally, several methods for 

class posterior estimation are described and an application to sound data is shown, 

where the register of different instruments is predicted by timbre. 

Key words: Local classification methods, Mixture models, Bayes rules 

References 

Hastie, T. J. and Tibshirani, R. J. (1996): Discriminant Analysis by Gaussian Mixtures. 

Journal of the Royal Statistical Society B, 58(1), 155–176. 

Szepannek, G. and Weihs, C. (2006): Local Modelling in Classification on Different 

Feature Subspaces. In: P. Perner (Ed.): Advances in Data Mining. Springer, 

Berlin, 226-238. 

Titsias, M. K. and Likas, A. C. (2001): Shared Kernel Models for Class Conditional 

Density Estimation. IEEE Transactions on Neural Networks, 12(5), 987–997. 

Titsias, M. K. and Likas, A. C. (2002): Mixture of Experts Classification Using a 

Hierarchical Mixture Model. Neural Computation, 14, 2221–2244. 

Weihs, C., Szepannek, G., Ligges, U., Luebke, K., and Raabe, N. (2006): Local Models 

in Register Classification by Timbre. In: V. Batagelj, H.-H. Bock, A. Ferligoj, 

and A. Ziberna (Eds.): Data Science and Classification. Springer, Berlin, 315- 

322. 

− 130 −

Comparison of four estimators of the 

heterogeneity variance for meta-analysis 

Peter Schlattmann 

Dept. of Biostatistics and Clinical Epidemiology 

Charité Universitätsmedizin Charitéplatz 1, 10117 Berlin 

peter.schlattmann@charite.de 

Summary. The analysis of heterogeneity is a crucial part of each meta-analysis. 

In order to analyze heterogeneity often a random effects model which incorporates 

variation between studies is considered. It is assumed that each study has its own 

(true) exposure or therapy effect and that there is a random distribution of these true 

exposure effects around a central effect. The variability between studies is quantified 

by the heterogeneity variance. 

In order to compare the performance of four estimators of the heterogeneity variance 

a simulation study was performed. This study compared the Dersimonian-Laird 

(1986) estimator with the maximum-likelihood estimator based on the normal distribution 

for the random effects. Further comparators were the simple heterogeneity 

(SH) variance estimator proposed by Sidek and Jonkman (2005). 

All of the afore mentioned methods assume a normal distribution for the random 

effects. This assumption may be true or not. Thus an alternative estimator of 

the heterogeneity variance is based on a finite mixture model (Böhning, Dietz,and 

Schlattmann, 1998). 

This simulation study investigates these four estimators, when sampling from 

discrete distributions, i.e. the major assumption of a normal distribution for the 

random effects is not fulfilled. In this setting the simulation study investigates bias, 

standard deviation and mean square error (MSE) of all four estimators. 

Key words: Meta-Analysis, Heterogeneity, Simulation, Finite mixture model 

References 

Böhing, D., Dietz, E. and Schlattmann, P. (1998): Recent developments in computer 

assisted mixture analysis. Biometrics, 54, 283-303 

DerSimonian, R. and Laird, N. (1986): Meta-analysis in clinical trials. Controlled 

Clinical Trials,7,177-188 

Sidek, K. and Jonkman, J. (2005): Simple heterogeneity variance for metaanalysis. 

JRSS Series C, 54, 367-384 

− 131 −

Machine learning applications of positive 

definite kernels 

Prof. Dr. Bernhard Schölkopf 

MPI for Biological Cybernetics 

Spemannstrasse 38 

72076 Tübingen 

bernhard.schoelkopf@tuebingen.mpg.de 

Summary. Support vector machines and other kernel methods have become one of 

the most widely used techniques in the field of machine learning. I will present my 

thoughts on what made them popular and what may (or may not) keep them going. 

I will also discuss applications in different domains, including computer graphics. 

− 132 −

Age Distributions for costs in drug prescription by 

practitioners and for DRG-based hospital treatment 

Reinhard Schuster, Eva v. Arnstedt 

Medical Review Board of the Statutory Health Insurance in North Germany, 

23554 Lübeck, Germany, Reinhard.Schuster@mdk-nord.de 

Abstract. Purpose: We analyse age-dependent fractions of patients with costs above a treshold 

value in dependency of that value both in drug application outside hospitals as well as 

in DRG-based hospital treatment. We compare the results of different German regions and 

different statutory insurances. The outcome of age-dependency of costs is highly important in 

respect to demographic changes. Design/Methodology/Approach/Algorithm: We use drug 

prescription data of practitioners and data of DRG-based hospital treatment from several statutory 

insurances and several regions. We use a nonparametric functional equation with a geometric 

background which generates an one parametric family of logconcave distributions including 

the normal distribution. The mentioned functional equation is also related to Verhulst 

growth. Results: The data can be fitted by log-concave distributions and we get numerically 

stable computations. The respective logarithms are concave with respect to both variables age 

and costs. We find that independent of the absolute threshold value there is always a decrease 

in the fraction of high-cost patients above a certain age. So we do not find a monotone increase 

of costs with age. Research Limitations/Implications: The statistically reported data 

basis for age-dependent costs is poor in general with respect to specific details, especially 

if a (pseudonymized) patient-identifier is necessary. Practical Implications: Demographic 

changes are important for a large range of induced implications. Often it is implicated that 

the costs are strictly increasing with age. If this turns out not to be true in general, costs 

are depending much more sensible on the exact (demographically changing) age distribution 

of the population, which should be analysed in that direction. Originality/Value: The agedependent 

resolution of officially stated statistic reports is poor in general. We state a stable 

non-parametric model with high resolution. 

Key words: Drug Application, Age Distribution, DRG-System, Statutory Health Insurance 

References 

SCHUSTER, R.: Komponentenzerlegungen, Strukturen und Invarianten zu GKV- 

Arzneimittelverordnungsdaten. Journal of Public Health 4 (2003), 293-305. 

− 133 −

The Late Neolithic flint axe production on the 

Lousberg (Aachen, Germany) — An 

extrapolation of supply and demand and 

population density 

Daniel Schyle 

Institut für Ur- und Frühgeschichte Universitt zu Köln 

daniel.schyle@uni-koeln.de 

Abstract. The tabular flint seams within the cretaceous limestone slab once covering 

the Lousberg in Aachen (Germany) were completely exploited by systematic 

opencast mining during the time between approximately 3800 and 3000 years 

CalBC. The Lousberg-flint, easily identifiable by its tabular shape and its characteristic 

colours, was processed on-site almost exclusively for the production of 

axe-roughouts, which were distributed over distances up to 280 km mainly to Westphalia, 

but also to Hessen, Rheinland-Pfalz and into Belgium and the Netherlands. 

An excavation at the Lousberg was carried out under the direction of J. Weiner 

between 1978 and 1981. This contribution presents an extrapolation of the total 

amount of axe-roughouts produced at the site, based on the results of refittings and 

the counts of random samples of the knapping waste excavated from the mining 

dumps. The corresponding demand for axes per household and generation is estimated 

from axe distributions and frequencies in several well dated and preserved 

lakeshore dwellings of Southern Germany and Switzerland. To estimate the population 

density within the distribution area of Lousberg-axes, which is almost devoid of 

Late Neolithic settlement traces other than only roughly dated surface assemblages, 

the approximate size of the core-distribution area is determined by the site density 

mapping method (”Isolinien-Fundstellendichtekartierung”) recently developed by A. 

Zimmermann and collaborators of the Institut fr Ur- und Frhgeschichte at the University 

of Cologne. The contribution will focus on the problems in comparing the 

results based on the distribution of Lousberg-axes to the results recently obtained 

on settlement distributions of the Linearbandkeramik (LBK) in the Rhineland. The 

research is part of a project aimed at the final publication of the Lousberg finds, 

which was funded by the Deutsche Forschungsgemeinschaft (DFG). 

Key words: Abstract, Layout, Submission guideline 

− 134 −

Time Related Features for Alarm Classification 

in Intensive Care Monitoring 

Wiebke Sieben 

Department of Statistics, Technische Universitt Dortmund, 44227 Dortmund, 

Germany sieben@statistik.tu-dortmund.de 

Abstract. Traditional patient monitoring systems in intensive care are based on 

simple threshold alarms. These systems compare the measurement of a vital sign 

with a threshold set by the clinical staff and trigger an alarm when the threshold 

is crossed. Although there are some more sophisticated rules already incorporated 

in modern monitoring devices the false alarm rate has remained very high (Tsien 

Fackler 1997, Chambrin 2001). Machine learning techniques, and particularly decision 

trees have proven suitable for alarm classification. As the misclassification rate 

of non life-threatening situations is to be minimized under the constraint that the 

misclassification rate of life-threatening situations is close to zero, standard techniques 

need to be improved. Modified Random Forests (Sieben, Gather 2007) have 

been shown to do this successfully. So far only the measurements of the point in 

time when an alarm was triggered were used for classification. As physicians always 

take the character of changes over time in a patients’ health status into account for 

a diagnosis there might exist valuable information to be extracted from the time 

series. We study the use of time related features in combination with the modified 

Random Forest approach in terms of improvements in the classification results. 

References 

CHAMBRIN, M.-C. (2001): Alarms in the Intensive Care Unit: How Can the Number 

of False Alarms Be Reduced?. Critical Care, 5 (4), 184–188. 

SIEBEN, W., GATHER, U. (2007):Classifiing Alarms in Intensive care - Analogy 

to Hypothesis Testing. in LNCS Series: Proceedings of the 11th Conference on 

Artificial Intelligence in Medicine, Vol.4594/2007, eds. R. Bellazzi, A. Abu- 

Hanna, J. Hunter, Springer, Berlin/Heidelberg, 130–138. 

TSIEN, C.L., FACKLER, C. (1997): Poor Prognosis for Existing Monitors in the 

Intensive Care Unit. Critical Care Medicine, 25 (4), 614–619. 

Keywords 

CLASSIFICATION, INTENSIVE CARE MONITORING, FALSE ALARMS 

− 135 −

’CMA’ - Steps in developing a comprehensive 

R-toolbox for classification with microarray 

data and other high-dimensional problems 

Martin Slawski, Anne-Laure Boulesteix, and Martin Daumer 

Sylvia Lawry Centre for MS Research, Hohenlindenerstr. 1, D-81677 München 

Martin.Slawski@campus.lmu.de, boulesteix@slcmsr.org, daumer@slcmsr.org 

Abstract. Microarray studies have stimulated the development of new approaches 

and motivated the adaptation of known traditional methods for class prediction with 

high-dimensional data. There already exist numerous sofware packages implementing 

single methods for microarray-based classification and in addition two synthesis 

packages: MLInterfaces by V. Carey and R. Gentleman (2007) and MCRestimate 

by Ruschhaupt et al (Stat Appl Genet Mol Biol 2004, 3:37 ), available from the 

www.bioconductor.org platform. Conceptually, the R package CMA is more related 

to the second one, focussing on comparative model evaluation according to accepted 

’good pratice’ standards/guidelines (Dupuy and Simon, J Natl Cancer Inst 2007, 

99:147-157 ), an aspect neglected by MLInterfaces, though still widely used. In 

a nutshell, CMA provides a uniform interface to a total of more than 20 supervised 

classification methods, comprising classical approaches such as discriminant analysis 

or penalized multinomial logistic regression, dimension reduction by Partial Least 

Squares, and more sophisticated methods, e.g. Support Vector Machines, Neural 

Networks or boosting techniques. 

The evaluation of the constructed classifiers is based on repeated splittings into 

learning and test sets or related approaches (e.g. bootstrap). For each learning set 

separately, variable selection can be performed optionally, either by a collection of 

simple tests or by advanced techniques such as the lasso, elastic net or componentwise 

boosting. In the last step, hyperparameter optimization and model evaluation 

are carried out via a ’nested’ cross-validation procedure. The outer loop is used for 

classifier evaluation while appropriate values for the hyperparameters are determined 

in the inner loop. 

CMA is implemented entirely in S4 classes (J. Chambers, Programming with data, 

1998). Its modular construction makes the incorporation of new methods easy. Furthermore, 

it is intended to be user-friendly by providing a multitude of pre-defined 

methods for summarizing and visualizing classifier evaluation and comparison. 

A preliminary version of CMA is planned to be available in the next Bioconductor 

release in April 2008. 

Keywords 

High-dimensional data, classification, validation, statistical software 

− 136 −

Generating Collective Intelligence 

Vassilios Solachidis 1 , Phivos Mylonas 2 , Andreas Geyer-Schulz 3 , Bettina 

Hoser 3 , Sam Chapman 4 , Fabio Ciravegna 4 , Stefen Staab 5 ,Costis 

Contopoulos 6 , Ioanna Gkika 6 ,PavelSmrz 7 , Yiannis Kompatsiaris 1 ,and 

Yannis Avrithis 2 

1 

Centre of Research and Technology Hellas, Informatics and Telematics Institute, 

Km Thermi-Panorama Road, Thermi-Thessaloniki, GR 570 01, Greece {vsol, 

ikom}@iti.gr 

2 

National Technical University of Athens, Image, Video and Multimedia Systems 

Laboratory, Iroon Polytechneiou 9, Zographou Campus, Athens, GR 157 80, 

Greece {fmylonas, iavr}@image.ntua.gr 

3 

Department of Economics and Business Engineering, Information Service and 

Electronic Markets, Kaiserstraße 12, Karlsruhe 76128, Germany, 

{andreas.geyer-schulz, bettina.hoser}@kit.edu 

4 

University of Sheffield, Department of Computer Science, Regent Court, 211 

Portobello Street, S1 4DP, Sheffield, UK, {s.chapman, fabio}@dcs.shef.ac.uk 

5 

Universität Koblenz-Landau, Information Systems and Semantic Web, 

Universitätsstraße 1, 57070 Koblenz, Germany, staab@uni-koblenz.de 

6 

Vodafone-Panafon (Greece), Technology Strategic Planning - R&D Dept., 

Tzavella 1-3, Halandri, 152 31, Greece {Costis.Kontopoulos, 

Ioanna.Gkika}@vodafone.com 

7 

Brno University of Technology, Faculty of Information Technology, Bozetechova 

2, CZ-61266 Brno, Czech Republic, smrz@fit.vutbr.cz 

Abstract. In this paper we provide a foundation for a new generation of services 

and tools. We define new ways of capturing, sharing and reusing information and 

intelligence provided by single users and communities, as well as organizations by 

enabling the extraction, generation, interpretation and management of Collective 

Intelligence from user generated digital multimedia content. Different layers of intelligence 

will be generated, which together constitute the notion of Collective Intelligence. 

The latter emerges from the collaboration and competition among many 

individuals and forms an intelligence that seemingly has a mind of its own. The 

automatic generation of Collective Intelligence constitutes a departure from traditional 

methods for information sharing, since information from both the multimedia 

content and social aspects will be merged, while at the same time the social dynamics 

will be taken into account. In the context of this work, we shall present two case 

studies. Initially, an Emergency Response case study will be tackled, where users 

provide intelligence about large scale emergencies, empowering a more effective and 

informed emergency action and at the same time receive information on how to act. 

A Consumers Social Group case study will follow, providing enhanced publishing 

tools to support group activities (e.g. organization of team events) and the ability 

to extract meta-information from content sources and group discussions. Both Use 

Cases denote the important effect of Collective Intelligence as well as its leverage 

for private, commercial and public purposes. 

− 137 −

Analysis of polyphonic musical time series 

Katrin Sommer and Claus Weihs 

Lehrstuhl für Computergestützte Statistik 

Technische Universität Dortmund, D-44221 Dortmund 

sommer@statistik.tu-dortmund.de 

Abstract. A general model for pitch tracking of polyphonic musical time series will 

be introduced. Based on a model of Davy and Godsill (2002) the different pitches 

of the musical sound are estimated with MCMC methods simultaneously. Additionally 

a preprocessing step is designed to improve the estimation of the fundamental 

frequencies (Sommer and Weihs (2008)). The preprocessing step compares real audio 

data with an alphabet constructed from the McGill Master Samples (Opolko 

and Wapnick (1987)) and consists of tones of different instruments. The tones with 

minimal Itakura-Saito distortion (Gray et al. (1980)) are chosen as first estimates 

and as starting points for the MCMC algorithms. Furthermore the implementation 

of the alphabet is an approach for the recognition of the instruments generating the 

musical time series. Results are presented for mixed monophonic data from McGill 

and for self recorded polyphonic audio data. 

Key words: MCMC, musical time series, polyphony, alphabet 

References 

Davy, M. and Godsill, S. J. (2002): Bayesian Harmonic Models for Musical Pitch 

Estimation and Analysis. Technical Report 431. Cambridge University Engineering 

Department. 

Gray, R., Buzo, A., Gray, A. and Matsuyama, Y. (1980): Distortion Measures for 

Speech Processing. IEEE Transactions on Acoustics, Speech, and Signal Processing 

ASSP-28, 367–376. 

Opolko, F. and Wapnick, J. (1987): McGill University Master Samples [Compact 

disc]: Montreal, Quebec: McGill University. 

Sommer K. and Weihs C. (2006): Using MCMC as a stochastic optimization procedure 

for music time series. In Batagelj V, Bock HH, Ferligoj A, Ziberna A 

(Eds.): Data Science and Classification, Springer, Heidelberg, 307–314. 

Sommer, K. and Weihs, C. (2008): A comparative Study on polyphonic musical 

time series using MCMC methods. In C. Preisach, H. Burkhardt, L. Schmidt- 

Thieme and R. Decker (Eds.): Data Analysis, Machine Learning, and Applications. 

Springer, Berlin. 

− 138 −

Trust as a Key Determinant of Loyalty and its 

Moderators 

Angela Sommerfeld 1 

Institut für Marketing, Humboldt-Universität zu Berlin angelaso@umich.edu 

Abstract. Theorizing that successful relational exchanges are motivated by trust 

and commitment, theory implicitly assumes that transactional and weak relational 

exchanges are not similarly motivated. As such Garbarino & Johnson (1999) showed 

that trust is a peripheral evaluation, not predictive for purchase intentions in weak 

(individual ticket buyers) but for strong relationships (subscribers of the theatre). 

Extending their work we take a more theory-based approach to develop and test 

moderating hypotheses of the trust ? purchase intention relation beyond their variable 

type of contractual relationship. Based on a survey of 575 business-to-business 

customers we test the following proposed moderators: two facets of perceived risk 

notably performance risk and consequentiality, perceived switching cost, and length 

of the relationship between companies. Different methods have been employed to test 

the moderations (Multiple-Groups, Kenny-Judd-Models, and Quasi-ML). Especially 

the Quasi-ML method, which we apply for a simultaneous test of several moderation 

hypotheses, represents a statistically efficient estimation method for SEMs with 

multiple latent interaction effects (Klein & Muthen 2007). Depending on the method 

and their varying properties several hypotheses could be confirmed. The paper seeks 

to make three key contributions. First it gives a theory-based account of boundary 

conditions for the relevance of trust in exchange relationships between companies. 

Since there have been conflicting opinions on the role of risk in exchange between 

companies, a second contribution of the paper is to clarify this role by thoroughly 

testing both moderating and mediating hypotheses. Testing interactions in a structural 

equation framework is not a straightforward task. Thus a third contribution is 

to illustrate strengths and weaknesses of different methods for a substantive research 

question with a real world data set. 

Key words: Trust, Risk, Switching Cost, Multiple 

References 

Klein, A.G. and Muthen, B.O. (2007): Quasi Maximum Likelihood Estimation of 

Structural Equation Models With Multiple Interaction and Quadratic Effects. 

Multivariate Behavioral Research, 42, 647–673. 

− 139 −

Generating Fictitious Training Data for Credit 

Client Classification 

Klaus B. Schebesch 1 and Ralf Stecking 2 

1 

Faculty of Economics, University ”Vasile Goldis”, Arad, Romania 

kbsbase@gmx.de 

2 

Faculty of Economics, University of Oldenburg, D-26111 Oldenburg 

ralf.w.stecking@uni-oldenburg.de 

Abstract. In recent work we started investigating the effects of using fictitious 

training examples in addition to the empirical training examples for a credit scoring 

problem. Fictitious training points added by a very simple procedure lead to 

some interesting effects in the context of SVM (support vector machine) classifier 

modeling. For instance, the resulting out-of-sample performance measures of such 

preliminary models are not entirely obvious. However, by using SVM, we also can 

observe the change in support vector formation subject to fictitious training points. 

Such information may prove instrumental in producing fictitious training points 

which are (more) problem dependent. We also explore connections to generative, 

similarity based and template based learning which, in a related context receive 

some attention in recent classification literature. We then report on the results of 

using different types of fictitious training examples in SVM credit client classification. 

Finally, in order to generalize these results, evaluation of SVM with different 

kernel functions using various fictitious training data sets is presented. 

Key words: Fictitious training data, Data similarity, Support vector machine, 

Credit scoring 

References 

DUIN, R.P.W. and PEKALSKA, E. (2007): The Science of Pattern Recognition. 

Achievements and Perspectives. In: W. Duch, J. Mandziuk (eds.), Challenges 

for Computational Intelligence, Studies in Computational Intelligence, Springer 

HOCHREITER, S. and OBERMAYER, K. (2006). Support vector machines for 

dyadic data. Neural Computation, 18, 1472-1510 

LAUB, J., ROTH, V., BUHMANN, J.M. and MÜLLER, K. (2006): On the information 

and representation of non-Euclidean pairwise data. Pattern Recognition, 

39, pp. 1815-18266 

STECKING, R. and SCHEBESCH, K.B. (2007): Improving Classifier Performance 

by Using Fictitious Training Data? A Case Study. Accepted for publication in 

Operations Research Proceedings 2007. 

− 140 −

Clustering Association Rules with 

Fuzzy Concepts 

Matthias Steinbrecher 1 and Rudolf Kruse 1 

Department of Knowledge Processing and Language Engineering 

Otto-von-Guericke University of Magdeburg 

Universitätsplatz 2, 39106 Magdeburg, Germany 

{msteinbr,kruse}@iws.cs.uni-magdeburg.de 

Abstract. Association rules constitute a widely accepted technique to identify frequent 

patterns inside huge volumes of data. Practioneers prefer the straightforward 

interpretability of rules, however, depending on the nature of the underlying data 

the number of induced rules can be intractable large. Even reasonably sized result 

sets may contain a large amount of rules that are uninteresting to the user because 

they are too general, are already known or do not match other user-related intuitive 

criteria. We allow the user to model his conception of interestingness by means of linguistic 

expressions on rule evaluation measures and compound propositions of higher 

order (i. e., temporal or spatial changes of rule properties). Multiple such linguistic 

concepts can be considered a set of fuzzy patterns [?] and allow for the partition of 

the initial rule set into fuzzy fragments that contain rules of similar membership to a 

user’s concept [?,?,?]. With appropriate visualization methods that extent previous 

rule set visualizations [?] we allow the user to instantly assess the matching of his 

concepts against the rule set. 

Key words: Association Rules, Fuzzy Clustering, Exploratory Data Analysis 

References 

1.Dubois, D., Prade, H., Testemale, C.: Weighted Fuzzy Pattern Matching. Fuzzy 

Sets and Systems 28(3) (1988) 313–331 

2.Höppner, F., Klawonn, F., Kruse, R., Runkler, T.: Fuzzy Clustering. Wiley, 

Chichester, United Kingdom (1999) 

3.Döring, C., Lesot, M.J., Kruse, R.: Data Analysis with Fuzzy Clustering Methods. 

Computational Statistics & Data Analysis 51(1) (2006) 192–214 

4.Kruse, R., Döring, C., Lesot, M.J.: Fundamentals of fuzzy clustering. In 

de Oliveira, J.V., Pedrycz, W., eds.: Advances in Fuzzy Clustering and its Applications. 

John Wiley & Sons (2007) 3–30 

5.Steinbrecher, M., Kruse, R.: Visualization of Possibilistic Potentials. In: Foundations 

of Fuzzy Logic and Soft Computing. Volume 4529 of Lecture Notes in 

Computer Science., Springer Berlin / Heidelberg (2007) 295–303 

− 141 −

Who’s Afraid of Statistics? – Measurement and 

Predictors of Statistics Anxiety in German 

University Students 

Carolin Strobl 1 and Friedrich Leisch 2 

1 Institut für Statistik, Ludwig-Maximilians-Universität München 

carolin.strobl@stat.uni-muenchen.de 

2 friedrich.leisch@stat.uni-muenchen.de 

Abstract. The measurement of statistics anxiety and the relationship between 

statistics anxiety and several socio-demographic and educational factors was investigated 

in a survey on over 600 German university students. The attitude towards 

statistics was measured by means of the Affect and Cognitive Competence scales of 

the Survey of Attitudes Towards Statistics (SATS, Schau et al., 1995). Additional 

items covered, amongst others, prior mathematics experience and achievement, time 

and activity since high school graduation as well as items on the students’ strategy 

applied in mathematics courses, which was not considered in earlier studies. An 

anxiety indicator was derived from the SATS scales by means of cluster analysis 

in order to separate a group of students with high levels of statistics anxiety from 

those with moderate and low levels of anxiety. Using this anxiety indicator as the 

response, a set of relevant predictor variables was identified by means of random forest 

variable importance scores and further explored in a logistic regression model. 

Our results show that the SATS Affect and Cognitive Competence scales are well 

suited for identifying students with high levels of negative attitude against statistics, 

even though potential effects of the translation into German were noticeable 

for the positively worded items. Predictors found relevant for statistics anxiety were 

gender, mathematics taken as an intensive course in high school, prior (perceived) 

mathematics achievement, prior mathematics experience as well as two of the newly 

included items on students’ strategy applied in mathematics courses in high school: 

Students who named practicing as their strategy were less likely, while students who 

named memorizing as their strategy were more likely to show statistics anxiety. 

Key words: Attitude towards statistics, SATS, Statistics education 

References 

Schau, C., Stevens, J., Dauphinee, T. L. and Vecchio, A. D. (1995): The development 

and validation of the survey of attitudes toward statistics. Educational and 

Psychological Measurement, 55 (5), 868–875. 

− 142 −

A New, Conditional Variable Importance 

Measure for Random Forests 

Carolin Strobl 1 and Achim Zeileis 2 

1 Department of Statistics, Ludwig-Maximilians-Universität München 

carolin.strobl@stat.uni-muenchen.de 

2 Department of Statistics and Mathematics, Wirtschaftsuniversität Wien 

Achim.Zeileis@wu-wien.ac.at 

Abstract. Random forests are becoming increasingly popular in many scientific 

fields for assessing the importance of predictor variables (cf., e.g., Lunetta et al., 

2004) because they can cope with “small n large p” problems, complex interactions 

and even with highly correlated predictor variables. Their variable importance can 

help identify relevant predictors even if they are highly correlated, while in classical 

regression models often only one representative of a group of correlated predictors is 

included. However, currently-used variable importance measures can be biased, e.g., 

towards variables with many categories (Strobl et al., 2007) or correlated predictor 

variables (Archer and Kimes, 2008). While the former issue can be addressed by 

changing the resampling scheme in the tree growing process (Strobl et al., 2007), 

the latter is due to the permutation scheme employed in the computation of the 

variable importance. Here we suggest a new, conditional permutation scheme that 

is more suited to measure the degree of association of each predictor variable with 

the response. The resulting conditional variable importance can be used to rank the 

predictor variables more reliably. 

Key words: Feature selection, Correlation, Variable importance, Permutation tests 

References 

Archer, K. and Kimes, R. (2008): Empirical characterization of random forest variable 

importance measures. Computational Statistics & Data Analysis, 52(4), 

2249–2260. 

Lunetta, K.L., Hayward, L.B., Segal, J., Eerdewegh, P.V. (2004): Screening largescale 

association study data: Exploiting interactions using random forests. BMC 

Genetics, 5:32. 

Strobl, C., Boulesteix, A.-L., Zeileis, A. and Hothorn, T. (2007): Bias in random 

forest variable importance measures: Illustrations, sources and a solution. BMC 

Bioinformatics, 8:25. 

− 143 −

Conjoint Analysis within the field of customer 

satisfaction problems a model of composite 

product/service 

Piotr Tarka 

School of Banking in Poznan 

Department of Organization and Management 

Poland 

piotr.tarka@wsb.poznan.pl 

Abstract. This paper describes how the benefits of conjoint analysis can be 

adapted to measuring performance criteria in the customer service area. The paper 

points out how a single composite model can be built, incorporating a wide 

range of customer key choice criteria, including service. Author makes a mark to a 

specific problems. One of them, is the apparent distortion in computed utility values 

that arises in circumstances where global macro variables are traded-off against more 

micro topics. This can lead to dramatic underestimation of the overall contribution 

or importance of macro issues. To address this concern, author discuss an approach 

known as dual scaling for eliminating the bias. Another drawback to the approach 

in customer service studies is the limited number of variables that can be addressed 

by a typical conjoint study. This makes it difficult to cover the large range of service 

topics typically examined in a customer satisfaction study. The paper argues 

that, this limits the scope of both classical conjoint studies and current customer 

satisfaction approaches. 

Key words: Conjoint Analysis, Customer satisfaction problems 

− 144 −

Optimal VDSL Expansion taking into 

Consideration of Infrastructure Restrictions 

and Marketing Requirements 

Klaus Thiel 1 

T-Online, T-Online-Allee 1, 64295 Darmstadt k.thiel@t-online.net 

Abstract. The expansion of the Very High Speed Subscirber Line (VDSL) network 

in Germany is a billion-weighted prestigious infrastructure project. VDSL enables 

a transfer rate of 50 megabytes per second. With it, for example, so-called entertainment 

customer can receive two movies byte-parallel in High Definition Television 

(HDTV) and internet surfing and telephoning is also possible. Within the 

b2b-sector many new applications like telecommuting in virtual teams around the 

world, telemedicine have become feasible. Currently Deutsche Telekom has developed 

VDSL in 27 cities and for 2008 the VDSL expansion is planned for further 23 

cities. The optimal choice of the VDSL expansion areas primarily depends on infrastructure 

restrictions as well as on marketing requirements. In order to execute a 

spatial optimisation procedure, all the important infrastructure and marketing information 

must be converted to vector data by digitizing und subsequently importing 

into a Geo-Information-System (GIS). 

The most important GIS provider in Germany are Microm with MicromGEO 

and ESRI with ArcGIS. In order to select the most suitable system, regarding the 

mentioned problem, both systems have to be evaluated on the basis of objective 

test criteria. Test criteria are the quality of geo-referencing of address-data and 

the mapping-quality of different spatial levels. In order to choose the most suitable 

spatial level (e.g. city, post-code, dialling-code, municipality), several analysis have 

been executed. In the next step, a spatial scoring has been developed and imported 

into the GIS in order to ensure, that those areas with the highest VDSL customer 

equity potential will be the first to be expanded. Finally, using the spatial scoring, 

a spatial potential-ranking has been calculated on the basis of which the optimal 

VDSL expansion can be planned and executed. 

Key words: Customer Equity, Geo-Information-System, Optimal VDSL Expansion 

− 145 −

Evaluate the data structure and identify 

homogenous spatial units in the data base 

”Sustainability issues in sensitive areas” of the 

EU-FP6 Integrated Project SENSOR 

Nguyen Xuan Thinh 1 , Leander Küttner 1 , and Gotthard Meinel 1 

Leibniz Institute of Ecological and Regional Development (IOER), Weberplatz 1, 

01217 Dresden, Germany, ng.thinh@ioer.de, l.kuettner@ioer.de, 

g.meinel@ioer.de 

Abstract. SENSOR (Sustainability Impact Assessment: Tools for Environmental, 

Social and Economic Effects of Multifunctional Land Use in European Regions) is 

an Integrated Project within the 6th Framework Research Programme of the European 

Commission (33 research partners from 15 countries). The SENSOR Project 

is structured into seven interrelated modules M1-M7. For the Module M6 ”Sustainability 

issues in sensitive areas”, a data base with more than 800 000 entries has 

been established. Whereby Lusatia, Silesia, Eisenwurzen, High Tatra, Valais, Estonia 

coastal zone, and Malta were selected as sensitive area case studies (SACS). 

Using ACCESS, SPSS and ArcMap we conduct a comparative analysis and evaluate 

this M6 database with the view to the theoretical sustainability indicators defined 

in the ModuleM2. We then determine similarities and dissimilarities between data 

from different SACS. By applying adequate cluster analysis we identify homogenous 

spatial units of selected SACS in order to find out generalisable and specific sustainability 

characteristics in the seven case studies. As example we describe the case 

study of Lusatia more in detail. The area of Lusatia is divided in several local area 

units (LAU2), which are qualified for a statistical examination by a high number of 

entries and available variables. As base we choose a set of 25 variables related to 

sustainable land use issues. Using a factorial analysis we determine the significant 

variables and use them as representatives to characterise typical land use clusters. In 

a next step the clusters were identified by a combination of hierarchical and k-mean 

cluster analysis methods. To describe the situation at different time points and the 

development in the period between them, we repeat the procedure of cross sectional 

analysis of 1996 for 2004. The results of the statistical analysis are presented in 

ArcMap visualisations. Although the procedure and the variable base are the same, 

the results differ and reveal so the relevant land use trends within the general social 

transformation process of the 1990ies. 

Key words: EU Integrated Project SENSOR, Comparative Analysis, Similarity, 

Disimilarity, SENSOR M6 Indicators, Cluster Analysis 

− 146 −

Mining ideas from textual information 

Dirk Thorleuchter 

Fraunhofer Institut für Naturwissenschaftlich-Technische Trendanalysen, 

D-53879 Euskirchen, Appelsgarten 2, Germany 

dirk.thorleuchter@int.fraunhofer.de 

Abstract. This paper describes an approach to find automatically new technological 

ideas in textual information. On the basis of (Thorleuchter (2008)) the existing 

theoretical algorithm is enlarged in consideration of text mining approaches 

like stemming, term frequency etc. (Ferber (2003)) and ”creativity technique” approaches 

from literature (Dean et al. (2001)). The aim of the new algorithm is to 

find ideas by using a general stop word list, because up to now the existing approach 

is based on the inefficient usage of a (domain) specific stop word list specific 

created for the analyzed text. 

This new approach is evaluated with non-proprietary data and it is realized as webbased 

application, named ”Technological Idea Miner” that can be used for further 

testing and evaluation. The presentation of the identified ideas will be displayed in 

consideration of cognitive research knowledge like described in (Puppe et al. (2003)). 

Key words: Textmining, Knowledge, Discovery, Ideas 

References 

Dean, G.,Hender, J.M., Nunmaker, J.F. and Rodgers, T.L. (2001): Improving Group 

Creativity. In: Sprague, R., (Hrsg.): Proceedings of the 34th Hawaii International 

Conference on System Sciences - 2001. IEEE Publishing, Maui (USA), 

1070. 

Ferber, R. (2003): Information Retrieval. dpunkt.verlag, Heidelberg, 41. 

Puppe, F., Stoyan, H. and Studer, R. (2003): Knowledge Enineering. In: G. Görz, C.- 

R. Rollinger and J. Schneeberger (Eds.): Handbuch der Künstlichen Intelligenz. 

4. Auflage, Oldenbourg, München, 612. 

Thorleuchter, D. (2007): Finding new technological ideas and inventions with text 

mining and technique philosophy. In: C. Preisach, H. Burkhardt, L. Schmidt- 

Thieme, R. Decker (Eds.): Data Analysis, Machine Learning, and Applications. 

Springer, Heidelberg-Berlin. 

− 147 −

Mining technologies in security and defense 

Dirk Thorleuchter 

Fraunhofer Institut für Naturwissenschaftlich-Technische Trendanalysen, 

D-53879 Euskirchen, Appelsgarten 2, Germany 

dirk.thorleuchter@int.fraunhofer.de 

Abstract. In the last years, the rising asymmetrical threat is causing governments 

to pay more attention to security, especially in technological areas. New and ever 

more complex tasks in areas concerned with defense against these new types of threat 

require additional research and development of new techniques. For this reason, 

national and European governments are increasingly funding security and defense 

(S&D) based technological research. 

In this paper, we give an overview about the technological landscape of S&D 

by presenting different S&D-technologies and their relationships like described in 

(Geschka et al. (2005)) and (Reiß (2006)). Therefore we firstly identify technologies 

from different technological S&D-taxonomies and we secondly identify innovative 

S&D-research projects. The research projects are classified according to technologies 

and on that basis the relationships between technologies are presented. 

In detail, text documents are represented as vectors in vector space model using 

term frequency and corpus-based term co-occurrence data. We use Jaccard’s coefficient 

(Ferber(2003)) to measure similarity and we use fuzzy alpha-cut method for 

classification. Structured documents (XML) are used as data source and drain. 

To realize this approach, we present a web application ”S&D Technology Miner” 

for planning support to research program planners and to researchers, which acquire 

funding in this area but also for testing and evaluating the approach. 

Key words: Security, Defense, Technology, Textmining, Classification 

References 

Ferber, R. (2003): Information Retrieval. dpunkt.verlag, Heidelberg, 78. 

Geschka, H., Schauffele, J. and Zimmer, C. (2005): Explorative Technologie- 

Roadmaps - Eine Methodik zur Erkundung technologischer Entwicklugslinien 

und Potenziale. In: M.G. Möhrle and R. Isenmann (Eds.): Technologie- 

Roadmapping. Springer, Berlin, Heidelberg et al., 165. 

Reiß, T. (2006): Innovationssysteme im Wandel - Herausforderungen für die Innovationspolitik. 

In: B. Müller and U. Glutsch (Eds.): Fraunhofer-Institut für 

System- und Innovationsforschung - Jahresbericht 2006. Karlsruhe, 10 

− 148 −

Multilevel Simultaneous Component Analysis 

for Studying Inter-individual and 

Intra-individual Variabilities 

Marieke E. Timmerman 1 , Anna Lichtwarck-Aschoff 1 , and Eva Ceulemans 2 

1 

Heymans Institute for Psychology, University of Groningen 

Grote Kruisstraat 2/1 

9712 TS Groningen, The Netherlands 

m.e.timmerman@rug.nl 

2 

Centre for Methodology of Educational Research, University of Leuven 

Belgium 

Abstract. All psychological processes are dynamic. To fully understand those processes 

it is necessary to consider the intra-individual variation of individuals over 

time. Herewith, it is important to recognize that the nature of the processes may 

differ across individuals. This intricate matter requires new modelling approaches. 

We focus on the exploratory modelling of multivariate data that have been repeatedly 

gathered from more than one individual. We aim at identifying meaningful 

sources of both the inter-individual variability and the intra-individual variability 

in the observed variables, while expressing the similarities and differences in those 

sources across individuals. To this end, we use multilevel simultaneous component 

analysis (MLSCA; Timmerman, 2006). 

In essence, MLSCA specifies separate component models to account for interindividual 

and intra-individual variabilities. The latter may entail differences across 

individuals, which are expressed via the covariances of the individual’s withincomponent 

scores. The common within-loadings ensure the comparability across 

individuals. The relationships between MLSCA and the related multilevel and multigroup 

structural equation models will be discussed. The usefulness of MLSCA to 

grasp inter-individual and intra-individual variabilities is illustrated with an empirical 

example from a diary study focusing on emotions involved in daily conflicts 

between adolescent girls and their mothers. 

Key words: exploratory modelling of longitudinal data, multivariate analysis 

References 

Timmerman, M.E. (2006). Multilevel Component Analysis. British Journal of Mathematical 

and Statistical Psychology, 59, 301–320. 

− 149 −

Issues Related to the Implementation of a 

Dynamic Logistic Model for Classifier 

Combination 

Amber Tomas 

The University of Oxford, 1 South Parks Road, Oxford OX2 3TG, United 

Kingdom tomas@stats.ox.ac.uk 

Abstract. We consider a model for classification of sequentially received observations, 

when the population of interest is not assumed to be stationary. The model 

we propose combines the outputs of a fixed set of component classifiers (chosen in 

advance), and the parameters of the combination are allowed to change over time. 

Specifically, we use a logistic Dynamic Generalized Linear Model [1] for combining 

the classifier outputs, and take a predictive approach towards estimation of the posterior 

class probabilities. The dynamics are incorporated through the equation for 

parameter evolution 

β t+1 = β t + ωt, ωt ∼ N(0, Σt). (1) 

The implementation of this model when the distribution of the parameters is 

not assumed to be normal is not straightforward. In addition to computational 

complexity, there arise complications related to the identifiability of the parameters 

β t which are unique to classification problems. Specifically, although the classifications 

produced as a result of using the model with parameters β t are equivalent to 

the classifications when using parameters αβ t, α > 0, the posterior class probabilities 

are more extreme in the second case. This results in increased volatility of the 

classification rule when using a sequential MCMC method to estimate the posterior 

distribution of the parameters. We discuss why there is no simple constraint for the 

parameters which will alleviate this identifiability problem, and discuss an alternative 

approach. In addition, we consider the related problems of adaptively changing 

the effective value of Σt, and the consequences of using the model (1) when it is not 

assumed to be correct. 

Key words: Multiple Classifier Systems, Dynamic Classification, Identifiability 

References 

1.West, M., Harrison, J. and Migon, H. (1985): Dynamic Generalized Linear Models 

and Bayesian Forecasting. Journal of the American Statistical Association, 80, 

73–83. 

− 150 −

A Comprehensive Partial Least Squares 

Approach to Component-Based Structural 

Equation Modeling ⋆ 

Laura Trinchera 1 and Vincenzo Esposito Vinzi 2 

1 

Dipartimento di Matematica e Statistica, Universita degli Studi di Napoli 

Federico II. ltrinche@unina.it 

2 

ESSEC Business School of Paris and Singapore. vinzi@essec.fr 

Abstract. PLS Path Modeling (PLS-PM) is generally meant as a componentbased 

approach to structural equation modeling that privileges a prediction oriented 

discovery process to the statistical testing of causal hypotheses. Differently 

from covariance-based structural equation modeling (i.e. LISREL-type methods), in 

PLS-PM latent variables are estimated as linear combinations of the manifest variables. 

Thus they are more naturally defined as emergent constructs (with formative 

indicators) rather than latent constructs (with reflective indicators). Nowadays, formative 

relationships are more and more used in real applications but pose a few 

problems for the statistical estimation and interpretation. As of today, formative 

relationships in PLS-PM imply multiple OLS regressions between each latent variable 

and its own formative indicators. As known, OLS regression may yield unstable 

results in presence of important correlations between explanatory variables, it is not 

feasible when the number of statistical units is smaller than the number of variables 

nor when missing data affect the dataset. Thus, it seems quite natural to introduce 

a PLS Regression (PLS-R) external estimation mode within the PLS-PM algorithm 

so as to overcome the mentioned problems, preserve the formative relationships and 

still remain coherent with the component-based and prediction-oriented nature of 

PLS-PM. Here, the main issues concerning the use of formative indicators in PLS- 

PM are investigated. Furthermore, the features of PLS-R may be fruitfully exploited 

in the internal estimation phase as well as for estimating path coefficients upon convergence 

of the PLS-PM algorithm when classical OLS estimates become unstable 

or even unfeasible. Finally, the case of formative indicators will be considered also 

with respect to clustering techniques recently proposed for latent class detection in 

PLS-PM. 

Key words: Formative Indicators, PLS Regression, Latent Factor Scores 

⋆ The participation of L. Trinchera to this research was supported by the MURST 

grant “Multivariate statistical models for the ex-ante and the ex-post analysis 

of regulatory impact”, coordinated by C. Lauro (2006). The participation of V. 

Esposito Vinzi to this research was supported by CERESSEC, Research Center 

of the ESSEC Business School. 

− 151 −

Relevant Importance of Predictor Variables 

in Support Vector Machines Models 

Micha̷l Trzesiok 

Department of Mathematics, 

Katowice University of Economics, ul. Bogucicka 14, 40-226 Katowice 

trzesiok@ae.katowice.pl 

Abstract. The model resulting from Support Vector Machines suffer from the lack 

of interpretation. It is usually very hard to extract the knowledge about the analyzed 

phenomenon from the classification model obtained by using SVMs because the 

classification task is realized in a high dimensional feature space. Although the 

method identifies the observations which are crucial for the form of the decision 

function, it does not show which variables are relevant and which are redundant. 

Vapnik claims that feature selection is not necessary for SVMs, i.e. building the 

model on a set of variables including some redundant variables does not change the 

generalization ability. Once the model is built, it is still valuable to recognize the 

relative importance of predictor variables. The method we propose uses the sampling 

techniques, backward selection and Rand index for evaluating whether the particular 

variable is redundant or not. We even try to extend the idea to obtain the ranking 

of the predictor variables reflecting the relevant importance of the inputs. 

Key words: Support Vector Machines, redundancy, relevant attributes 

References 

Abe, S. (2005): Support Vector Machines for Pattern Classification, Springer, London. 

Breiman, L., Friedman, J., Olshen, R. and Stone, C. (1984): Classification and Regression 

Trees, Wadsworth, Monterey. 

Schölkof, B. and Smola, A.J. (2002): Learning with Kernels, MIT Press, Cambridge. 

Vapnik, V. (1998): Statistical Learning Theory, John Wiley & Sons, N.Y. 

− 152 −

Comparison of Algorithms to find differentially 

expressed Genes in Microarray Data 

Alfred Ultsch 

Databionics Research Lab, Department of Computer Science 

University of Marburg, D-35032 Marburg, Germany 


Summary. There are several different algorithms published for the identification 

of differentially expressed genes in DNA microarray experiments. The microarrays 

in this type of experiment are from two different populations (groups) of specimen. 

Among the many genes on the microarrays, those genes are sought that are the most 

relevant for the distinction between the two populations. Usually such algorithms 

produce ordered lists of genes. In this work a method to compare the performance 

of such algorithms is proposed. In order to compare different methods for the identification 

of significant genes, a data set with known properties is published. This 

benchmark data is used to compare the performance of different algorithms with a 

newly designed one, called PUL. The comparison is based on established measurements 

in information retrieval. Surprisingly a clear ordering in performance of the 

algorithms was observed. PUL outperformed other algorithms by a factor of two. 

PUL was applied successfully in different practical applications. For these experiments 

the importance of the genes proposed by PUL were independently verified. 

References 

1. Gebhard, S., Bergmann, E., Weber, A., Berwanger, B., Eilers, M., Ultsch, A., 

Christiansen, H.: Classification of stage 3 neuroblastomas by artificial neural 

networks based analysis of cDNA microarrays. (submitted) 

2. Dudoit, S., Fridlyand, J., Speed, T. (2000). Comparison of discrimination methods 

for the classification of tumors using gene expression data. Technical report 

576, Department of Statistics, University of California, Berkeley. 

3. Pallasch C.P., Schwamb J., Schulz A., Knigs S., Debey S., Kofler D., Schultze 

J.L., Hallek M., Ultsch, A. Wendtner, C.(2007) Targeting lipid metabolism by 

the lipoprotein lipase inhibitor orlistat results in apoptosis in chronic lymphocytic 

leukemia, accepted for Leucemia. 

4. Tusher, V., Tibshirani, R. and Chu, G. (2001): Significance analysis of microarrays 

applied to the ionizing radiation response, PNAS 2001 98: 5116-5121. 

5. Ultsch, A.(2005): Improving the identification of differentially expressed genes in 

cDNA microarray experiments, In Weihs, C., Gaul, W. (Eds): Classification- the 

Ubiquitous Challenge, Springer, Heidelberg, pp. 378-385. 

− 153 −

Is log ratio a good value for measuring return 

in stock investments? 

Alfred Ultsch 


Philipps-University of Marburg, Germany 


Abstract. Measuring the rate of return is an important issue for theory and practice 

of investmets in the stock market. A common measure for rate of return is the 

logarithm of the ratio of succesive prices (LogRatio). In this paper it is shown that 

LogRatio as well as arithmetic return rate (ROI) have several disadvantages. As an 

alternative relative differences (RelDiff) are proposed to measure rates of return. 

The stability against numerical and rounding errors of RelDiff is demonstrated 

to be much better than for LogRatios and ROI. RelDiff values are identical to 

LogRatios and ROI for interesting ranges of return rates. Relative differences map 

return rates to a finite range. For most subsequent analyses this is a big advantage. 

The usefullness of the approach is demonstrated on daily return rates of a large set 

of stocks. 

Key words: Rate of Return, Return on Investment, Financial time series, Black- 

Scholes-Model 

References 

Bodie,Z.,KaneA. and Marcus,A.J. Essentials of Investments, 5th Edition. New York: 

McGraw-Hill/Irwin, 2004. 

Brealey,R. A., Stewart C. Myers,S.C. and Allen,F. Principals of Corporate Finance, 

8th Edition. McGraw-Hill/Irwin, 2006. 

Feibel,B. J. Investment Performance Measurement. New York: Wiley, 2003. 

Franke,J.,Haerdle,W. Hafner,C. Einfhrung in die Statistik der Finanzmaerkte Berlin 

u.a. : Springer, 2. Aufl. 2004. 

Ultsch, A.: Improving the identification of differentially expressed genes in cDNA 

microarray experiments, Weihs, C., Gaul, W. (Eds), In Classification; The Ubiquitous 

Challenge, Springer, Heidelberg, (2005), pp. 378-385 

Ultsch, A.: Is log ratio a good value for identifying differential expressed genes in 

microarray experiments?, Technical Report No. 35, Dept. of Mathematics and 

Computer Science, University of Marburg, Germany, (2003) 

− 154 −

Mosaic Plots and Knowledge Structures 

Ali Ünlü 

Department of Mathematics, University of Augsburg, Germany 

ali.uenlue@math.uni-augsburg.de 

Abstract. Mosaic plots are state-of-the-art graphics for multivariate categorical 

data (Hofmann (2008)). Knowledge structures are mathematical models that belong 

to the recent theory of knowledge spaces in psychometrics (Doignon and Falmagne 

(1999)). This paper presents an application of mosaic plots and variants such as 

fluctuations diagrams and multiple barcharts to psychometric data arising from underlying 

knowledge structure models. In simulation trials, the scope of this graphing 

method in knowledge space theory is investigated. 

Key words: Mosaic plot, Visualization, Knowledge structure, Psychometrics 

References 

Doignon, J.-P. and Falmagne, J.-Cl. (1999): Knowledge Spaces. Springer, Berlin. 

Hofmann, H. (2008): Mosaic Plots and Their Variants. In: C.H. Chen, W. Haerdle 

and A.R. Unwin (Eds.): Handbook of Data Visualization. Springer, Heidelberg, 

617–642. 

− 155 −

Visualizing preferences using minimum 

variance nonmetric unfolding 

Michel van de Velden, Alain de Beuckelaer, Patrick Groenen, and Frank 

Busing 

No Institute Given 

Abstract. In multidimensional unfolding one wishes to obtain a map with subjects 

(e.g. consumers) and objects (e.g. products), in such a way that distances 

between subjects and objects in the map best represent the preferences as indicated 

in the data. Unfolding models are particularly adequate when the data (e.g. consumers’ 

preferences) are not unidirectional but exhibit an inverted U-shape. If the 

alternatives are rated on an interval (or ratio) scale the ‘metric’ unfolding model is 

appropriate. If, however, the alternatives are rank ordered or rated on an ordinal 

(e.g. Likert-type of) scale, one would need the ’nonmetric’ unfolding model. Until 

recently, nonmetric unfolding was not feasible because of degeneracy problems. 

Degenerate solutions are solutions were the extent of ‘misfit’ can be made arbitrarily 

small Existing algorithms consistently produced such degenerate solutions. 

Recently, Busing, Groenen & Heiser (Psychometrika 2005, pp71-98) proposed a solution 

to this long-standing methodological problem by including a penalty in the 

algorithm. The resulting PREFSCAL algorithm is available in SPSS. In PREFSCAL 

two parameters are introduced that determine the strength of the penalty that leads 

the algorithm away from degenerate solutions. No clear directions concerning the 

choice of the penalty parameters are given. In this paper, we propose a minimum 

variance criterion to choose the penalty parameters. By studying the stability of the 

unfolding solutions as a function of the penalty parameters, we are able to determine 

the penalty in such a way that a minimum variance, non-degenerate solution 

is obtained. The data used in our analysis stem from a consumer study in which 

consumers were asked to rank-order new product ideas for soups. 

− 156 −

Selection of items for tests and questionnaires 

using Mokken scale analysis 

L. Andries van der Ark and J. Hendrik Straat 

Department of Methodology and Statistics 

Tilburg University 

P.O. Box 90153 

5000 LE Tilburg 

The Netherlands 

a.vdark@uvt.nl 

Abstract. Tests or questionnaires are often used to measure personality traits, 

attitudes, opinions, skills, and abilities. These tests and questionnaires consist of 

questions, statements, problems, games, or rating scales, which are generically called 

items. An important step in the construction of a test and questionnaire is a careful 

selection of items. A well-known approach for selecting qualitatively good items in a 

test is Mokken scale analysis. In this presentation, Mokken scale analysis is explained 

and recent developments are discussed. Special attention is given to a comparison 

of automated item selection algorithms used in Mokken scale analysis. 

− 157 −

Estimating the prevalence of rule transgression 

using data collected by randomized response 

Peter G.M. van der Heijden 

Department of Methodology and Statistics 

Utrecht University 

PO Box 80140 

3508 TC Utrecht 


P.G.M.vanderheijden@uu.nl 

Abstract. In criminology self-report studies are a means to obtain prevalence estimates 

of for rule transgressions, violations of the law, and so on. In surveys individuals 

are interviewed about their behaviour. An obvious problem is, or course, 

that due to reasons such as social desirability people do not always answer honestly 

about their behaviour. 

For this reason about forty years ago randomized response was introduced to 

collect data about sensitive issues. Our research group has worked in this area for 

about 10 years and I will give an overview of our results. The results are: 

• a “best practice” for asking randomized response questions 

• a meta-analysis showing that randonized response is the most valid method for 

answering questions about sensitive topics 

• accommodating existing models for the multivariate data so that they can handle 

randomized response data, such as logistic regression, item response theory, and 

randmized response count data 

• accomodating these models for the potential presence of respondents that do not 

follow the randomized response design. 

We present these results and illustrate them using surveys that we conducted for the 

Ministry of Social Affairs into social benefit fraud, that we conducted on a two-yearly 

base from 1998 to 2006. 

− 158 −

Clustering Consumers with Respect to Their 

Marketing Reactance Behavior 

Ralf Wagner and Erik Sauerwald 

SVI Chair for International Direct Marketing 

DMCC - Dialog Marketing Competence Center 

University of Kassel, Germany 

rwagner@wirtschaft.uni-kassel.de 

erik sauerwald@arcor.de 

Abstract. The recent paradigm shift in modern marketing practices (Coviello et 

al. (2002), Vargo & Lusch (2004)) in concurrence with the increasing popularity of 

digital marketing measures (Wagner & Meißner (forthcoming)) add a new quality 

to the discussion of marketing intrusiveness (Morimoto & Chang (2006)). 

Despite the comprehensive research in international differences on media usage 

(e.g., Krafft et al. (2007)) related previous research frequently neglects the cultural 

differences in recipients’ assessment of the marketing measures. In this study we 

utilize the Item Response Theory approach for an assessment of individuals’ reactance 

to unsolicited marketing communications. The study is based on an survey of 

recipients from China, Germany, Russia, and the United Staates of America 

Key words: Advertising, Culture, Item Response Theory, Reactance 

References 

Coviello, N.E., Brodie, R.J., Danaher, P.J., and Johnston, W.J. (2002): How Firms 

Relate to Their Markets: An Empirical Examination of Contemporary Marketing 

Practices. Journal of Marketing, 66, 33–46. 

Krafft, M.; Hesse, J., Höfling, J., Peters, K., Rinas, D. (2007): International Direct 

Marketing. Principles, Best Practices, Marketing Facts. Springer, Berlin. 

Morimoto, M. and Chang, S. (2006): Consumers’ Attitudes toward Unsolicited Commercial 

E-mail and Postal Direct Mail Marketing Methods: Intrusiveness, Perceived 

Loss of Control, and Irritation. Journal of Interactive Advertising, 7, 

8–20. 

Vargo, S.L. and Lusch, R.F. (2004): Evolving to a New Dominant Logic for Marketing. 

Journal of Marketing, 68, 1–17. 

Wagner, R. and Meißner, M. (forthcoming): Multimedia for Direct Marketing. In: 

M. Pagani (Ed.): Encyclopedia of Multimedia Technology and Networking, 2 nd 

Edition. Idea Publishing, Hershey. 

− 159 −

Supervised Self-Organising Maps and More 

Ron Wehrens 

IMM, Analytical Chemistry 

P.O. Box 9010, 6500 GL Nijmegen 


r.wehrens@science.ru.nl 

Abstract. Self-organising maps (SOMs) have been applied in many different areas 

of science. In a typical application, large numbers of objects (thousands or more) are 

mapped to a two-dimensional grid in such a way that very similar objects end up in 

the same area. If several different types of information are available, one can combine 

these in one feature vector, used to determine the similarity with each of the map 

units, but this presents scaling difficulties. We have, e.g., mapped several thousand 

steroid crystal structures from the Cambridge Crystallographic Database, based 

on their diffraction patterns and a specific distance function. For these structures, 

several other types of information are available as well, such as space group and cell 

volume. 

To take extra information into account, we have extended the basic principle 

of SOMs to accomodate extra layers, one for each type of feature vector [?]. The 

closest unit is then determined by summing distances per layer, where each layer 

can be assigned a weight. This makes it possible to perform supervised mapping: 

the second layer then contains the class information. The result of including class 

information is that classes are more likely to form contiguous units in the map. This 

behaviour can be enforced by choosing a larger weight for the class information. One 

does not have to stop at two layers: it is possible to create several layers, each layer 

corresponding with another type of data. 

This is implemented in an R package, called “kohonen” [?], available from CRAN 

(http://cran.r-project.org). Several examples will be shown highlighting the 

possibilities of the technique. 

Key words: Self-organising maps, Data fusion, Supervised mapping 

References 

1.W.J. Melssen, R. Wehrens, and L.M.C. Buydens. Supervised Kohonen networks 

for classification problems. Chemom. Intell. Lab. Syst., 83:99–113, 2006. 

2.R. Wehrens and L.M.C. Buydens. Self- and super-organising maps in R: the 

kohonen package. Journal of Statistical Software, 21(5), 9 2007. 

− 160 −

Multi-Item Versus Single-Item Measures: 

A Review and Future Research Directions 

Petra Wilczynski and Marko Sarstedt 

Institute for Market-based Management, Munich School of Management, 

D-80539 Munich, Germany wilczynski@bwl.lmu.de 

Abstract. With their widely discussed Journal of Marketing Research article, 

Bergkvist and Rossiter (2007) resume a long-lasting interdisciplinary discussion on 

the benefits and limitations of multi-item versus single-item measures. Whereas 

multi-item measures of theoretical constructs have been the norm in marketing research 

for over 20 years, practioners seem to favour single-item measures on the 

practical grounds of minimizing non-response and costs. This proceeding is often 

seen as a fatal error, because single-item measures are believed to be unreliable 

and invalid. During the last decades several studies appeared in different disciplines 

such as social sciences, marketing or psychology that critically compare these two 

approaches, yielding sometimes contradictory results in terms of validity or reliability. 

Thus, the objective of this paper is to develop an integrated overview of the 

present status of research in this field, taking into account various disciplines. At 

this, advantages and disadvantages, analytical approaches as well as the results are 

compared and critically evaluated. The findings suggest several areas for future research 

in this important field, which is necessary to close the gap between theoretical 

and practical requirements. 

Key words: Single Item, Multi Item, Scale Development 

References 

Bergkvist, L. and Rossiter, J.R. (2007): The Predictive Validity of Multiple-Item 

Versus Single-Item Measures of the Same Constructs. Journal of Marketing 

Research, 44, 175–184. 

Drolet, A.L. and Morrison, D.G. (2001): Do We Really Need Multiple-Item Measures 

in Service Research? Journal of Service Research, 3, 196–204. 

Wanous, J.P. and Reichers, A.E. and Hudy, M.J. (1997): Overall Job Satisfaction: 

How Good Are Single-Item Measures? Journal of Applied Psychology, 82, 

247–252. 

− 161 −

Management and methods: How to do market 

segmentation projects 

Raimund Wildner 

GfK Group 

Nürnberg, Germany 

Abstract. Market segmentation projects are often strategic projects with top management 

attention and high budgets. Nevertheless many of them fail. This can be 

due to poor methodology as well as due to poor management. 

From the management perspective it is essential that the objectives of the segmentation 

have to be clear from the beginning. Product development, media advertising, 

or sales are possible objectives and each of them requires specific variables as 

an input. Furthermore it is essential that all stakeholders of a segmentation project 

are involved from the beginning. During the segmentation project a close cooperation 

between marketing experts, market research experts, and statistics expert is 

necessary. Special problems arise in international segmentation projects. Finally it 

is important to sell the segmentation in the organization by workshops, leaflets and 

other instruments that help to get a clear picture of the segments. 

From a methodological standpoint it is important that the result is stable so it 

can be reproduced in other data sets as well. A test for stability will be shown. Outlier 

can cause instability; so a special method to identify outliers will be shown. Faked 

interviews have to be excluded from the segmentation. A procedure for detection of 

faked interviews will be discussed. Finally the cluster procedure that proved to be 

superior in practical terms is discussed. 

− 162 −

Clustering with Repulsive Prototypes 

Roland Winkler 1 , Frank Rehm 2 , and Rudolf Kruse 3 

1 German Aerospace Center, Braunschweig roland.winkler@dlr.de 

2 German Aerospace Center, Braunschweig frank.rehm@dlr.de 

3 Otto von Guericke University, Magdeburg kruse@iws.cs.uni-magdeburg.de 

Abstract. Although there is no exact definition for the term cluster, in the 2D 

case, it is fairly easy for human beings to decide which objects belong together. For 

machines on the other hand, it is hard to determine which objects form a cluster. 

Depending on the problem, the success of a clustering algorithm depends on the 

idea of their creators about what a cluster should be. Likewise, each clustering 

algorithm comprises a characteristic idea of the term cluster. For example the fuzzy 

c-means algorithm tends to find spherical clusters with equal numbers of objects. 

Noise clustering focuses on finding spherical clusters of user-defined diameter. 

If there is certain amount of knowledge available about how clusters are shaped, it 

is possible to include more information into a clustering algorithm. In this paper, we 

present an extension to noise clustering that tries to maximize the distances between 

prototypes. For that purpose, the prototypes behave like repulsive magnets that 

have an inertia depending on their sum of membership values. Using this repulsive 

extension, it is possible to prevent that groups of objects are divided into more 

than one cluster. Due to the repulsion and inertia, it is also possible to determine 

the number of clusters in a data set. Roughly speaking, having information about 

cluster shapes (i.e. the diameter) may help to cope with the absence of knowledge 

concerning the exact number of clusters. 

The results of repulsive clustering can be used as an initialization for other clustering 

techniques. We successfully applied this method on air traffic management tasks. 

Key words: fuzzy clustering, cluster shapes, cluster validity, air traffic management 

References 

Bezdek, J.C.(1981): Pattern recognition with fuzzy objective function algorithms. 

Plenum Press, New York. 

Dave RN, Sumit S (1998): Generalized noise clustering as a robust fuzzy c-mestimators 

model. 17th Annual Conference of the North American Fuzzy Information 

Processing Society (NAFIPS-98), Pensacola Beach, Florida, 256–260 

Rehm, F., Klawonn, F., Kruse, R.(2007): A novel approach to noise clustering for 

outlier detection. Soft Computing - A Fusion of Foundations, Methodologies and 

Applications, Berlin/Heidelberg, Vol. 11, No 5, 489-494. 

− 163 −

On the Effects of Enhanced Selection Models 

on Quality and Comparability of Classifiers 

Produced by Genetic Programming 

Stephan Winkler, Michael Affenzeller, Stefan Wagner, and Gabriel 

Kronberger 

Fachhochschule Oberösterreich, Research Center Hagenberg 

{swinkler,maffenze,swagner,gkronber}@heuristiclab.com 

Abstract. The use of genetic programming (GP) in machine learning enables the 

automated search for classification models that are evolved by an evolutionary process 

using the principles of selection, crossover and mutation. The use of enhanced 

selection models in GP ([1], [2]) is able to significantly increase the quality of classifiers 

produced by GP; detailed analysis can be found for example in [3]. 

Algorithmic reliability can be assessed by comparing the results produced by a 

machine learning algorithm; due to the stochastic element that is intrinsic to any 

evolutionary process, GP can not guarantee the generation of similar or even similar 

models in each GP process execution. In [4] we have presented a method how to 

compare time series models produced by GP; in this paper we analyze the classifiers 

returned by GP based machine learning for medical benchmark data sets (taken 

from the UCI repository). We mainly focus on comparing standard GP techniques 

to those using enhanced selection models with respect to results similarity analysis. 

The effects of pruning mechanisms (applied to the final results) are also discussed. 

Key words: Evolutionary Learning, Genetic Programming, Results Comparability 

References 

[1] Affenzeller, M. and Wagner, S. (2005): Offspring selection: A new self-adaptive 

selection scheme for genetic algorithms. Adaptive and Natural Computing Algorithms, 

218—221. 

[2] Wagner, S. and Affenzeller, M. (2005): SexualGA: Gender-Specifc Selection for 

Genetic Algorithms. Proceedings of the 9th World Multi-Conference on Systemics, 

Cybernetics and Informatics (WMSCI) 2005, 4: 76–81. 

[3] Winkler, S., Affenzeller, M. and Wagner, S. (2007): Advanced genetic programming 

based machine learning. Journal of Mathematical Modelling and Algorithms, 

6(3): 455—480. Springer, Berlin. 

[4] Winkler, S., Affenzeller, M. and Wagner, S. (2008): On the Reliability of Nonlinear 

Modeling Using Enhanced Genetic Programming Techniques. Proceeding of 

the Chaotic Modeling and Simulation International Conference CHAOS 2008. 

− 164 −

Analysis of massive emigration from Poland – 

the model–based clustering approach 

Ewa Witek 

Department of Statistics, 

Katowice University of Economics, Bogucicka 14, 40–226 Katowice 

ewitek@ekonom.ae.katowice.pl 

Abstract. The model–based approach assumes that data is generated by a finite 

mixture of underlying probability distribution such as multivariate normal distribution. 

In finite mixture models, each component of probability distribution corresponds 

to a cluster. The problem of determining the number of clusters and choosing 

an appropriate clustering method becomes statistical model choice problem. Hence, 

the model–based approach provides a key advantage over heuristic clustering algorithms 

selecting both the correct model and the number of clusters. 

Model–based clustering has shown promise in a number of practical applications, 

including tissue segmentation, character recognition, minefield and seismic fault detection 

and classification of astronomical data. The article presents an application 

of the model–based clustering in economic analysis, which is comparatively rare. 

The moment Poland joined the EU, its citizens rushed out of the country. Since 

1 May 2004 Poland has been facing the problem of increased emigration. We used 

the model–based clustering approach for grouping and detecting inhomogeneities of 

Polish emigrants from different regions of Poland. 

Key words: Model–based clustering, EM algorithm, BIC 

References 

Fraley, C. and Raftery, A.E. (2002): Model–based clustering, discriminant analysis, 

and density estimation. Journal of the American Statistical Association, 97, 

611–631. 

McLachlan, G.J. and Peel, D. (2000): Finite mixture models. Willey, New York. 

− 165 −

Image Based Mail Piece Identification using 

Unsupervised Learning 

Katja Worm 1 and Beate Meffert 2 

1 Siemens ElectroCom Postautomation GmbH, Rudower Chaussee 29, 12489 

Berlin, Germany katja.worm@siemens.com 

2 Humboldt Universität zu Berlin, Rudower Chaussee 25, 12489 Berlin, Germany 

meffert@informatik.hu-berlin.de 

Abstract. Based on the uniqueness of a mail piece surface, postal sorting machines 

use mail piece image characteristics to reuse once extracted mail piece addresses in 

different sorting steps. During the first sorting step mail piece image characteristics 

are extracted and stored together with its target address in a database. In following 

sorting steps the mail piece address is accessed by determining the corresponding 

mail piece characteristics in the database. In a previous work appropriate mail piece 

image characteristics and procedures for their distance measurement were presented. 

Image based mail piece identification is complicated by a constantly changing 

and unknown mail piece spectrum as well as the differentiation of nearly identical 

mass mails. In particular, the rejection of unknown mail pieces requires the definition 

of specific rejection classes depending on the current mail piece spectrum. 

In this paper we present an approach on distance based mail piece identification 

using a two-stage classification process. The different handling of mass and collection 

mail is facilitated by a beforehand performed unsupervised learning process which 

clusters similar mail piece characteristics. Based on theses clusters a specific rejection 

class can be estimated within each cluster. The first step in the identification process 

is the determination of the corresponding cluster for a given mail piece. Based on 

the cluster specific rejection class the mail piece can be either identified or rejected. 

Experimental results obtained on real-world data sets show the applicability of the 

proposed method. 

References 

Worm, K. and Meffert, B. (2008): Robust Image Based Document Comparison Using 

Attributed Relational Graphs. Proceedings of the International Conference on 

Signal Processing, Pattern Recognition and Applications (SPPRA), accepted. 

Keywords 

DOCUMENT IDENTIFICATION, UNSUPERVISED LEARNING, 

MINIMUM DISTANCE CLASSIFICATION 

− 166 −

Factor Analysis of Incomplete Disjunctive 

Tables 

Amaya Zárraga 1 and Beatriz Goitisolo 1 

Departamento de Economía Aplicada III. UPV/EHU. Bilbao. Spain 

Amaya.Zarraga@ehu.es and Beatriz.Goitisolo@ehu.es 

Abstract. Multiple Correspondence Analysis (MCA) studies the relationship between 

several categorical variables defined with respect to a certain population. 

However, one of the main sources of information are those surveys in which it is 

usual to find a certain number of absent data and conditioned questions that do not 

need to be answered by the whole population. In these cases, the data codification in 

a complete disjunctive table requires the inclusion of non-answer categories that can 

alter the results. For example, the distance χ 2 between two row profiles increases 

with the common answers when the individuals do not answer the same number 

of questions. And in the distance χ 2 between two column profiles each individual 

could have a different weight according to the number of answers previously chosen. 

Therefore, the direct application of the standard MCA is not appropriate to 

the study of an incomplete disjunctive table (IDT). We propose the analysis of the 

incomplete disjunctive table by substituting the real marginal of the table about the 

individuals for a suitable imposed marginal. 

Key words: Multiple Correspondence Analysis, Complete Disjunctive Tables, Incomplete 

Disjunctive Tables 

References 

Zárraga, A. and Goitisolo, B. (1999): Independence Between Questions in the Factor 

Analysis of Incomplete Disjunctive Tables with Condicioned Questions. 

Qüestiió, vol 23, 3, 465–488. 

Zárraga, A. and Goitisolo, B. (2008): Factorial Analysis of a set of Contingency Tables. 

In: C. Preisach, H. Burkhardt, L. Schmidt-Thieme and R. Decker (Eds.): 

Data Analysis, Machine Learning and Applications Series: Studies in Classification, 

Data Analysis, and Knowledge Organization. Proceedings of the 31st 

Annual Conference of the Gesellschaft fr Klassifikation e.V., Albert-Ludwigs- 

Universitt Freiburg, March 7-9, 2007. Springer, Berlin, forthcoming. 

− 167 −


Regressions: Trees of Costly Journals and 

Beautiful Professors 

Achim Zeileis 1 and Christian Kleiber 2 

1 Department of Statistics, Department of Statistics and Mathematics, 

Wirtschaftsuniversität Wien Achim.Zeileis@wu-wien.ac.at 

2 Wirtschaftswissenschaftliches Zentrum, Universität Basel 

Christian.Kleiber@unibas.ch 

Abstract. The linear regression model is the workhorse for empirical economic 

analyses. For a wide variety of standard analysis problems, there are useful specifications 

of linear regression models, validated by economic theory and prior successful 

empirical studies. However, in non-standard problems or in situations where data on 

additional variables is available, a useful specification of a regression model involving 

all variables of interest might not be available. Here, we explore how recursive 

partitioning techniques can be used in such situations for modeling the relationship 

between the dependent variable and the available regressors. Linear regression is 

embedded into the model-based recursive partitioning framework of Zeileis et al. 

(2008). The resulting regression trees are grown by recursively applying techniques 

for testing and dating structural changes in linear regressions. They are compared 

to classical modeling approaches in two empirical applications: Following Stock and 

Watson (2007), the demand for economic journals (Bergstrom, 2001) is investigated. 

Furthermore, the impact of professors’ beauty on their class evaluations (Hamermesh 

and Parker, 2005) is assessed. 

Key words: Regression trees, Model-based recursive partitioning, Structural change 

References 

Bergstrom, T.C. (2001): Free Labor for Costly Journals? Journal of Economic Perspectives, 

(15), 183–198. 

Hamermesh, D.S. and Parker, A. (2005): Beauty in the Classroom: Instructors’ 

Pulchritude and Putative Pedagogical Productivity. Economics of Education 

Review, 24, 369–376. 

Stock, J.H. and Watson, M.W. (2007): Introduction to Econometrics. 2nd edition, 

Addison Wesley. 

Zeileis, A., Hothorn, T. and Hornik, K. (2008): Model-based Recursive Partitioning. 

Journal of Computational and Graphical Statistics, accepted for publication. 

− 168 −

GfKl 2008 - Legos

Create successful ePaper yourself

Delete template?

Save as template?