13.12.2012 Views

GfKl 2008 - Legos

GfKl 2008 - Legos

GfKl 2008 - Legos

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

<strong>GfKl</strong> <strong>2008</strong><br />

German Classification Society<br />

32nd Annual Conference<br />

Advances in Data Analysis, Data Handling<br />

and Business Intelligence<br />

Joint Conference with the<br />

British Classification Society (BCS) and the<br />

Dutch/Flemish Classification Society (VOC)<br />

July 16−18, <strong>2008</strong><br />

Hamburg<br />

Program and<br />

Abstract Volume<br />

http://gfkl<strong>2008</strong>.hsu-hh.de


Main contact:<br />

Prof. Dr. Wilfried Seidel<br />

Helmut-Schmidt-University<br />

Holstenhofweg 85<br />

22043 Hamburg<br />

Germany<br />

+49 (0)40 6541 2315<br />

gfkl<strong>2008</strong>@hsu-hh.de<br />

− ii −


Contents<br />

Welcome, v<br />

Sponsors, vi<br />

Scientific Program Committee, vii<br />

Organizing Committee, vii<br />

Plenary and Semi-plenary Lectures, viii<br />

Invited Sessions, x<br />

Detailed Schedule, xi<br />

List of Contributions, xxviii<br />

Author Index, xxxvii<br />

Abstracts, 1<br />

− iii −


− iv −


Welcome<br />

On behalf of the Helmut-Schmidt-University Hamburg,<br />

we welcome you to <strong>GfKl</strong><strong>2008</strong> − Advances in Data Analysis,<br />

Data Handling and Business Intelligence − the 32nd Annual<br />

Conference of the German Classification Society, organized<br />

in cooperation with the British Classification Society (BCS)<br />

and the Dutch/Flemish Classification Society (VOC).<br />

The conference features 13 invited lectures (3 plenary<br />

speeches and 10 semi-plenary lectures), 166 contributed<br />

talks, 4 invited sessions, and 2 workshops.<br />

We are indebted to those who suggested and supported<br />

having <strong>GfKl</strong><strong>2008</strong> held in Hamburg. We are grateful to those<br />

who volunteered in the conference organization and we<br />

acknowledge the generous financial backing by our sponsors.<br />

We wish you a very pleasant and stimulating <strong>GfKl</strong><strong>2008</strong>.<br />

Claudia Fantapié Altobelli<br />

Andreas Fink<br />

Hartmut Hebbel<br />

Wilfried Seidel<br />

Detlef Steuer<br />

Ulrich Tüshaus<br />

(Organizing committee,<br />

Helmut-Schmidt-University Hamburg)<br />

Berthold Lausen<br />

Alfred Ultsch<br />

(Co-chairs of the program committee)<br />

− v −


Sponsors<br />

The organizers would like to express their appreciation to the<br />

following organizations for providing financial help and other<br />

support:<br />

Deutsche<br />

Forschungsgemeinschaft<br />

Hamburg-Mannheimer<br />

Versicherungen<br />

Springer-Verlag<br />

Vattenfall<br />

Gesellschaft für<br />

Konsumforschung<br />

Hamburger Sparkasse<br />

Statsoft GmbH<br />

Volksfürsorge Deutsche<br />

Lebensversicherung AG<br />

− vi −


Scientific Program Committee<br />

H.-H. Bock (RWTH Aachen)<br />

R. Decker (Uni Bielefeld)<br />

V. Esposito Vinzi (ESSEC Paris)<br />

W. Esswein (TU Dresden)<br />

C. Fantapié Altobelli (HSU Hamburg)<br />

A. Fink (HSU Hamburg)<br />

W. Gaul (Uni Karlsruhe)<br />

H. Hebbel (HSU Hamburg)<br />

Ch. Hennig (Uni London)<br />

K. Jajuga (Wroclaw Univ. of Economics)<br />

H.-P. Klenk (DSMZ Braunschweig)<br />

B. Lausen (Uni Erlangen-Nürnberg, Co-Chair)<br />

H. Locarek-Junge (TU Dresden)<br />

F. Murtagh (Uni London)<br />

A. Okada (Uni Tokyo)<br />

L. Schmidt-Thieme (Uni Hildesheim)<br />

W. Seidel (HSU Hamburg)<br />

D. Steuer (HSU Hamburg)<br />

U. Tüshaus (HSU Hamburg)<br />

A. Ultsch (Uni Marburg, Co-Chair)<br />

M. van de Velden (Uni Rotterdam)<br />

D. van den Poel (Uni Ghent)<br />

I. van Mechelen (Uni Leuven)<br />

R. Wehrens (Uni Nijmegen)<br />

C. Weihs (Uni Dortmund)<br />

Organizing Committee<br />

Claudia Fantapié Altobelli<br />

Andreas Fink<br />

Hartmut Hebbel<br />

Wilfried Seidel<br />

Detlef Steuer<br />

Ulrich Tüshaus<br />

− vii −


Plenary and Semi-plenary Lectures<br />

Wednesday, July 16, 10:00–10:45:<br />

Walter Radermacher, President Federal Statistical Office of<br />

Germany, Wiesbaden, Germany<br />

“Statistical Processes Under Change – Enhancing Data Quality<br />

with Pretests” (Room 5)<br />

Wednesday, July 16, 11:00–11:40:<br />

Geoffrey John McLachlan, University of Brisbane, Australia<br />

“Clustering of High-Dimensional Data Via Finite Mixture Models”<br />

(Room 5)<br />

Fred Hamprecht, University Heidelberg, Germany<br />

“Segmentation of Neural Tissue” (Room 3)<br />

Wednesday, July 16, 14:00–14:45:<br />

Bernhard Schölkopf, Max-Planck-Institute, Tübingen, Germany<br />

“Machine Learning Applications of Positive Definite Kernels”<br />

(Room 5)<br />

Thursday, July 17, 09:00–09:40:<br />

Patrick Groenen, University Rotterdam, The Netherlands<br />

“Support Vector Machines in the Primal using Majorization and<br />

Kernels” (Room 5)<br />

Gilles Bisson, La Tronche, France<br />

“Clustering of Molecules and Structured Data” (Room 3)<br />

− viii −


Thursday, July 17, 15:55–16:35:<br />

Sabine Krolak-Schwerdt, University Wuppertal, Germany<br />

“Strategies of Model Construction for the Analysis of Judgment<br />

Data” (Room 3)<br />

Gilles Celeux, INRIA, France<br />

“Choosing the Number of Clusters in the Latent Class Model”<br />

(Room 5)<br />

Friday, July 18, 09:00–09:40:<br />

Francesco Palumbo, University of Macerata, Italy<br />

“Clustering and Dimensionality Reduction to Discover Interesting<br />

Patterns in Binary Data” (Room 5)<br />

Raimund Wildner, GfK, Nürnberg, Germany<br />

“Management and Methods: How to do Market Segmentation<br />

Projects” (Room 3)<br />

Friday, July 18, 11:20–12:00:<br />

Adi Ben-Israel, Rutgers University, USA<br />

“Probabilistic Distance Clustering” (Room 5)<br />

Tagashi Imaizumi, Tama University Tokyo, Japan<br />

“Dimensionality Reduction of Similarity Matrix” (Room 3)<br />

Friday, July 18, 13:15–14:00:<br />

Fred R. McMorris, Illinois Institute of Technology, USA<br />

“Majority-rule Consensus: From Preferences (Social Choice) to<br />

Trees (Biology and Classication Theory)” (Room 5)<br />

− ix −


Invited Sessions<br />

Wednesday, July 16, 14:50–16:05 (Room 2)<br />

VOC<br />

(Chairs: van de Velden, Wehrens)<br />

Wednesday, July 16, 16:50–18:30 (Room 2)<br />

PLS Path Modeling<br />

(Chair: Esposito Vinzi)<br />

Thursday, July 17, 09:45–11:00 (Room 2)<br />

Microarrays in Clinical Research<br />

(Chairs: Lausen, Ultsch)<br />

Thursday, July 17, 14:00–15:40 (Room 2)<br />

BCS<br />

(Chairs: Hennig, Murtagh)<br />

− x −


Detailed Schedule<br />

13:30-<br />

17:30<br />

Tuesday July 15, <strong>2008</strong> page<br />

Pre-conference Workshop Room 4<br />

Lenz, Hans-J. Data Quality: Defining, Measuring and<br />

Improving<br />

20:00 Informal Get Together (Hotel Baseler Hof, Esplanade 11)<br />

09:00-<br />

10:00<br />

Wednesday July 16, <strong>2008</strong><br />

Opening Ceremony (Chair: Seidel) Room<br />

5<br />

09:00 Welcome Claus Weihs (President of <strong>GfKl</strong>)<br />

Wilfried Seidel (Local Organizers)<br />

09:05 Welcome Herlind Gundelach<br />

(Senator for Science and<br />

Research, State Hamburg)<br />

09:15 Welcome Hans Christoph Zeidler<br />

(President of Helmut-Schmidt-<br />

University Hamburg)<br />

09:30 <strong>GfKl</strong> Best Paper<br />

Award 2007:<br />

Presentation and<br />

Laudation<br />

Claus Weihs (President of <strong>GfKl</strong>)<br />

N.N.<br />

09:50 Program Overview Berthold Lausen, Alfred Ultsch<br />

(Co-Chairs Program Committee)<br />

10:00-<br />

10:45<br />

Plenary Lecture (Chair: Seidel)<br />

Radermacher, Walter Statistical processes under<br />

change − Enhancing data quality<br />

with pretests<br />

− xi −<br />

page<br />

118<br />

Room<br />

5


10:00-<br />

18:00<br />

10:45-<br />

11:00<br />

11:00-<br />

11:40<br />

Workshop: Libraries (see separate schedule) Room<br />

403<br />

Coffee<br />

Semi-plenary Lectures<br />

McLachlan, Geoffrey<br />

John<br />

Hamprecht, Fred A.<br />

et al.<br />

Session Mixture Analysis I:<br />

Testing<br />

11:45-<br />

12:10<br />

12:10- Holzmann, Hajo;<br />

12:35 Dannemann, Jörn<br />

12:35-<br />

13:00<br />

Clustering of High-Dimensional<br />

Data Via Finite Mixture Models<br />

(Chair: McMorris)<br />

Segmentation of Neural Tissue<br />

(Chair: Wehrens)<br />

93 Room<br />

5<br />

5 Room<br />

3<br />

(Chair: Seidel)<br />

49<br />

Room<br />

3<br />

Gassiat, Elisabeth Likelihood ratio test for general<br />

mixture models<br />

Likelihood ratio testing for hidden<br />

Markov models<br />

Pommeret, Denys Testing distribution in errors in<br />

variables models<br />

Session Pattern Recognition and<br />

Machine Learning I<br />

11:45-<br />

12:10<br />

12:10-<br />

12:35<br />

12:35-<br />

13:00<br />

Haasdonk, Bernard;<br />

Pekalska, Elzbieta<br />

Louw, Nelmarie;<br />

Lamont, Morne; Steel,<br />

Sarel<br />

Classification with Regularized<br />

Kernel Mahalanobis-Distances<br />

Identifying Atypical Cases in<br />

Kernel Fisher Discriminant<br />

Analysis by using the Smallest<br />

Enclosing Hypersphere<br />

Trzesiok, Michal Relevant Importance of Predictor<br />

Variables in Support Vector<br />

Machines Models<br />

68<br />

113<br />

(Chair: Groenen) Room<br />

405/6<br />

Session Collective Intelligence (Chair: Geyer-Schulz) Room<br />

101/3<br />

11:45-<br />

12:10<br />

Geyer-Schulz, Andreas;<br />

Hoser, Bettina<br />

The Potential of Social Intelligence<br />

for Collective Intelligence<br />

− xii −<br />

57<br />

87<br />

152<br />

51


12:10-<br />

12:35<br />

12:35-<br />

13:00<br />

Mylonas, Phivos;<br />

Solachidis, Vassilios;<br />

Geyer-Schulz, Andreas;<br />

Hoser, Bettina;<br />

Chapman, Sam;<br />

Ciravegna, Fabio;<br />

Staab, Stefen; Smrz,<br />

Pavel; Kompatsiaris,<br />

Yiannis; Avrithis,<br />

Yannis<br />

Solachidis, Vassilios;<br />

Mylonas, Phivos;<br />

Geyer-Schulz, Andreas;<br />

Hoser, Bettina;<br />

Chapman, Sam;<br />

Ciravegna, Fabio;<br />

Staab, Stefen;<br />

Contopoulos, Costis;<br />

Gkika, Ioanna; Smrz,<br />

Pavel; Kompatsiaris,<br />

Yiannis; Avrithis,<br />

Yannis<br />

Session Evaluation of Clustering<br />

Algorithms and Data<br />

Structures<br />

11:45- Kaiser, Sebastian;<br />

12:10 Leisch, Friedrich<br />

12:10-<br />

12:35<br />

12:35-<br />

13:00<br />

Bade, Korinna; Benz,<br />

Dominik<br />

Grün, Bettina; Leisch,<br />

Friedrich<br />

Session Genome and DNA<br />

Analysis<br />

11:45-<br />

12:10<br />

12:10-<br />

12:35<br />

12:35-<br />

13:00<br />

Efficient Media Exploitation<br />

towards Collective Intelligence<br />

100<br />

Generating Collective Intelligence 137<br />

(Chair: Leisch) Room<br />

2<br />

Benchmarking Bicluster<br />

Algorithms<br />

Evaluation Strategies for Learning<br />

Algorithms of Hierarchical<br />

Structures<br />

Model diagnostics of finite<br />

mixtures using bootstrapping<br />

Klenk, Hans-Peter Polyphasic genomic approach for<br />

the taxonomy of archaea and<br />

bacteria<br />

Huson, Daniel H.;<br />

Rupp, Regula<br />

75<br />

8<br />

56<br />

(Chair: Klenk) Room<br />

6<br />

Using Cluster Networks to<br />

Represent Non-Compatible Sets of<br />

Clusters<br />

Hütt, Marc-Thorsten Genome phylogeny based on<br />

short-range correlations in DNA<br />

sequences<br />

− xiii −<br />

78<br />

71<br />

72


Session Market Risk and Credit<br />

Risk<br />

11:45-<br />

12:10<br />

12:10-<br />

12:35<br />

12:35-<br />

13:00<br />

13:00-<br />

14:00<br />

14:00-<br />

14:45<br />

Bravo, Cristian;<br />

Maldonado, Sebastian;<br />

Weber, Richard<br />

(Chair: Locarek-Junge) Room<br />

4<br />

Practical experiences from Credit<br />

Scoring projects for Chilean<br />

financial organizations<br />

21<br />

Kuziak, Katarzyna An application of copula functions<br />

to market risk management<br />

Rokita, Pawel; Piontek,<br />

Krzysztof<br />

Lunch (and Meetings)<br />

Extreme unconditional<br />

dependence vs. multivariate<br />

GARCH effect in the analysis of<br />

dependence between high losses<br />

on Polish and German stock<br />

indexes<br />

Plenary Lecture (Chair: Lausen)<br />

Schölkopf, Bernhard Machine Learning applications of<br />

positive definite kernels<br />

Session Mixture Analysis II:<br />

Clustering and<br />

Classification<br />

14:50-<br />

15:15<br />

15:15-<br />

15:40<br />

15:40-<br />

16:05<br />

16:05-<br />

16:30<br />

Pons, Odile Classification with an increasing<br />

number of components<br />

Lukociene, Olga;<br />

Vermunt, Jeroen K.<br />

Calò, Daniela G.; Viroli,<br />

Cinzia<br />

Latouche, Pierre J.;<br />

Ambroise, Christophe;<br />

Birmelé, Etienne<br />

Session Pattern Recognition and<br />

Machine Learning II<br />

14:50-<br />

15:15<br />

15:15-<br />

15:40<br />

15:40-<br />

16:05<br />

Stecking, Ralf;<br />

Schebesch, Klaus B.<br />

Huellermeier, Eyke;<br />

Vanderlooy, Stijn<br />

Hühn, Jens;<br />

Hüllermeier, Eyke<br />

83<br />

121<br />

132 Room<br />

5<br />

(Chair: Montanari) Room<br />

3<br />

Determining the number of<br />

components in mixture models for<br />

hierarchical data<br />

Visualizing data in Gaussian<br />

mixture model classification<br />

Bayesian Methods for Graph<br />

Clustering<br />

114<br />

90<br />

24<br />

85<br />

(Chair: Nalbantov) Room<br />

405/6<br />

Generating Fictitious Training Data<br />

for Credit Client Classification<br />

Combining Predictions in Pairwise<br />

Classification: An Adaptive Voting<br />

Strategy and Its Relation to<br />

Weighted Voting<br />

Rule-Based Learning of Reliable<br />

Classifiers<br />

− xiv −<br />

140<br />

70<br />

69


Session Invited Session: VOC (Chairs: van de Velden, Wehrens) Room<br />

2<br />

14:50-<br />

15:15<br />

15:15-<br />

15:40<br />

15:40-<br />

16:05<br />

Timmerman, Marieke<br />

E.; Lichtwarck-Aschoff,<br />

Anna; Ceulemans, Eva<br />

van der Heijden, Peter<br />

G.M.<br />

van der Ark, Andries L.;<br />

Straat, J. Hendrik<br />

Session Ensemble Methods and<br />

Other Subjects<br />

14:50-<br />

15:15<br />

15:15-<br />

15:40<br />

15:40-<br />

16:05<br />

16:05-<br />

16:30<br />

Multilevel Simultaneous<br />

Component Analysis for Studying<br />

Inter-individual and Intraindividual<br />

Variabilities<br />

Estimating the prevalence of rule<br />

transgression<br />

Selection of items for tests and<br />

questionnaires using Mokken scale<br />

analysis<br />

149<br />

158<br />

157<br />

(Chair: Boulesteix) Room<br />

6<br />

Häberle, Lothar On classification of species of<br />

representation rings<br />

58<br />

Strobl, Carolin; Zeileis, A New, Conditional Variable 143<br />

Achim<br />

Importance Measure for Random<br />

Forests<br />

Potapov, Sergej;<br />

Lausen, Berthold<br />

Bagging with different split criteria 115<br />

Adler, Werner;<br />

Classification of Paired Data Using 4<br />

Brenning, Alexander;<br />

Lausen, Berthold<br />

Ensemble Methods<br />

Session Market Risk and Credit<br />

Risk<br />

14:50-<br />

15:15<br />

15:15-<br />

15:40<br />

15:40-<br />

16:05<br />

16:05-<br />

16:30<br />

16:30-<br />

16:50<br />

(Chair: Locarek-Junge) Room<br />

4<br />

Piontek, Krzysztof The Analysis of the power for<br />

some chosen VaR backtesting<br />

procedures - simulation approach<br />

Koralun-Bereznicka,<br />

Julia<br />

Dias, José G.; Vermunt,<br />

Jeroen K.; Ramos, Sofia<br />

Sardet, Laure; Patilea,<br />

Valentin<br />

Coffee<br />

Multivariate comparative analysis<br />

of stock exchanges - the European<br />

perspective<br />

Mixture Hidden Markov Models in<br />

Finance Research<br />

Beta-kernel density estimation<br />

using mixture-based<br />

transformations: an application to<br />

claims distribution<br />

− xv −<br />

112<br />

81<br />

33<br />

125


Session Mixture Analysis III:<br />

Model Fitting, Estimation<br />

and Applications<br />

16:50-<br />

17:15<br />

17:15-<br />

17:40<br />

17:40-<br />

18:05<br />

18:05-<br />

18:30<br />

18:30-<br />

18:55<br />

Greselin, Francesca;<br />

Ingrassia, Salvatore<br />

Neykov, Neyko;<br />

Filzmoser, Peter;<br />

Neytchev, Plamen<br />

Garel, Bernard;<br />

Boucharel, Julien;<br />

Dewitte, Boris; du<br />

Penhoat, Yves<br />

(Chair: McLachlan) Room<br />

3<br />

A note on constrained EM<br />

algorithms for mixtures of<br />

elliptical distributions<br />

Robust fitting of mixtures: The<br />

approach based on the Trimmed<br />

Likelihood Estimator<br />

Non-Gaussian nature of ENSO<br />

signals and climate shifts: implications<br />

for regional studies off the<br />

western coast of South America<br />

Schlattmann, Peter Comparison of four estimators of<br />

the heterogeneity variance for<br />

meta-analysis<br />

Schiffner, Julia; Weihs,<br />

Claus<br />

Session Pattern Recognition and<br />

Machine Learning III<br />

16:50-<br />

17:15<br />

17:15-<br />

17:40<br />

17:40-<br />

18:05<br />

18:05-<br />

18:30<br />

Localized Classification Using<br />

Mixture Models<br />

53<br />

103<br />

48<br />

131<br />

130<br />

(Chair: Hüllermeier) Room<br />

405/6<br />

Wehrens, Ron Supervised Self-Organising Maps<br />

and More<br />

160<br />

Worm, Katja; Meffert, Image Based Mail Piece<br />

166<br />

Beate<br />

Identification using Unsupervised<br />

Learning<br />

Barbosa, Rui Pedro;<br />

Belo, Orlando<br />

Autonomous Forex Trading Agents 9<br />

Chiou, Hua-Kai; Huang, Applying Rough Set Theory to 27<br />

Yong-Ting; Liu, Gia- Constructing Knowledge Base for<br />

Shie<br />

Critical Military Commodity<br />

Management<br />

Session Invited Session: PLS<br />

Path Modeling<br />

16:50-<br />

17:15<br />

17:15-<br />

17:40<br />

17:40-<br />

18:05<br />

(Chair: Esposito Vinzi) Room<br />

2<br />

Ringle, Christian FIMIX-PLS Segmentation of Data<br />

for Path Models with Multiple<br />

Endogenous LVs<br />

Trinchera, L.; Esposito<br />

Vinzi, Vincenzo<br />

A Comprehensive Partial Least<br />

Squares Approach to Component-<br />

Based Structural Equation<br />

Modeling<br />

Henseler, Jörg Nonlinear Effects in PLS Path<br />

Models: A Comparison of Available<br />

Approaches<br />

− xvi −<br />

120<br />

151<br />

63


18:05-<br />

18:30<br />

Betzin, Jörg Categorical Data in PLS Path<br />

modeling<br />

Session Microarray Data<br />

Analysis<br />

16:50-<br />

17:15<br />

17:15-<br />

17:40<br />

17:40-<br />

18:05<br />

18:05-<br />

18:30<br />

Boulesteix, Anne-<br />

Laure; Slawski, Martin<br />

Slawski, Martin;<br />

Boulesteix, Anne-<br />

Laure; Daumer, Martin<br />

Scharl, Theresa;<br />

Leisch, Friedrich<br />

Martin-Magniette,<br />

Marie-Laure; Mary-<br />

Huard, Tristan; Bérard,<br />

Caroline; Robin,<br />

Stéphane<br />

Session Investments and<br />

Capital Markets<br />

16:50-<br />

17:15<br />

17:15-<br />

17:40<br />

17:40-<br />

18:05<br />

18:05-<br />

18:30<br />

Locarek-Junge,<br />

Hermann; Mihm, Max<br />

14<br />

(Chair: Benner) Room<br />

6<br />

On optimistic bias in reporting<br />

microarray-based classification<br />

accuracy<br />

20<br />

'CMA' - Steps in developing a<br />

comprehensive R-toolbox for<br />

classification with microarray data<br />

and other high-dimensional<br />

problems<br />

136<br />

Quality-Based Clustering of<br />

Functional Data: Applications to<br />

Time Course Microarray Data<br />

127<br />

ChIPmix : Mixture model of<br />

regressions for ChIP-chip<br />

experiment analysis<br />

92<br />

(Chair: Ultsch) Room<br />

4<br />

Fundamental Indexation - testing<br />

the concept in the German stock<br />

market<br />

86<br />

Ultsch, Alfred Is log ratio a good value for<br />

measuring return in stock<br />

investments?<br />

Klein, Christian;<br />

Kundisch, Dennis<br />

Bessler, Wolfgang;<br />

Holler, Julian<br />

Index-Based Investment Vehicles<br />

- A Comparative Study for the<br />

German DAX<br />

Hedge Funds in a Bayesian Asset<br />

Allocation Framework:<br />

Incorporating Information on<br />

market states and manager's<br />

ability<br />

19:00 Reception (Building “M1”)<br />

− xvii −<br />

154<br />

77<br />

13


09:00-<br />

09:40<br />

09:45-<br />

17:00<br />

Thursday July 17, <strong>2008</strong><br />

Semi-plenary Lectures<br />

Groenen, Patrick J.F.<br />

et al.<br />

Support Vector Machines in the<br />

Primal using Majorization and<br />

Kernels (Chair: Okada)<br />

Bisson, Gilles Clustering of molecules and<br />

structured data (Chair: Hennig)<br />

Workshop: Decimal<br />

Classification<br />

Session Clustering and<br />

Classification I<br />

9:45-<br />

10:10<br />

10:10-<br />

10:35<br />

10:35-<br />

11:00<br />

Godehardt, Erhard;<br />

Jaworski, Jerzy;<br />

Rybarczyk, Katarzyna<br />

54 Room<br />

5<br />

16 Room<br />

3<br />

(see separate schedule) Room<br />

403<br />

(Chair: Bock) Room<br />

3<br />

Isolated vertices in random<br />

intersection graphs<br />

52<br />

Rozmus, Dorota Cluster ensemble based on cooccurrence<br />

data<br />

Enyukov, Igor Regression-autoregression based<br />

clustering<br />

Session Bayesian, Neural, and<br />

Fuzzy Clustering I<br />

9:45-<br />

10:10<br />

10:10-<br />

10:35<br />

10:35-<br />

11:00<br />

Borgelt, Christian Weighting and Selecting Features<br />

in Fuzzy Clustering<br />

Neumann, Anneke;<br />

Ambrosi, Klaus; Hahne,<br />

Felix<br />

Winkler, Roland; Rehm,<br />

Frank; Kruse, Rudolf<br />

Session Invited Session:<br />

Microarrays in Clinical<br />

Research<br />

9:45-<br />

10:25<br />

10:25-<br />

11:00<br />

Approach for Dynamic Problems in<br />

Clustering<br />

Clustering with Repulsive<br />

Prototypes<br />

123<br />

37<br />

(Chair: Kruse) Room<br />

405/6<br />

18<br />

102<br />

163<br />

(Chairs: Lausen, Ultsch) Room<br />

2<br />

Ultsch, Alfred Comparison of Algorithms to find<br />

differentially expressed Genes in<br />

Microarray Data<br />

Hielscher, Thomas;<br />

Zucknick, Manuela;<br />

Werft, Wiebke; Benner,<br />

Axel<br />

On the prognostic value of gene<br />

expression signatures for<br />

censored data<br />

− xviii −<br />

153<br />

67


Session Statistical Musicology I (Chair: Weihs) Room<br />

6<br />

9:45- Eigenfeldt, Arne; Multimodal Performance Analysis 35<br />

10:10 Kapur, Ajay<br />

of Electronic Sitar<br />

10:10- Sommer, Katrin; Analysis of polyphonic musical 138<br />

10:35 Weihs, Claus<br />

time series<br />

10:35-<br />

11:00<br />

Desmet, Frank Michel;<br />

Leman, Marc; Lesaffre,<br />

Micheline<br />

Session Marketing and<br />

Management Science I<br />

9:45-<br />

10:10<br />

10:10-<br />

10:35<br />

10:35-<br />

11:00<br />

11:00-<br />

11:20<br />

Wagner, Ralf;<br />

Sauerwald, Erik<br />

Sagan, Adam;<br />

Kowalska-Musial,<br />

Magdalena<br />

Becker, Niels; Werners,<br />

Brigitte<br />

Coffee<br />

Session Clustering and<br />

Classification II<br />

11:20-<br />

11:45<br />

11:45-<br />

12:10<br />

12:15-<br />

12:45<br />

Nugent, Rebecca;<br />

Stuetzle, Werner<br />

Herrmann, Lutz;<br />

Ultsch, Alfred<br />

Software Presentation<br />

Eichenberg, Thilo<br />

(StatSoft)<br />

Session Bayesian, Neural, and<br />

Fuzzy Clustering II<br />

11:20-<br />

11:45<br />

11:45-<br />

12:10<br />

12:10-<br />

12:35<br />

Gabriel, Thomas R.;<br />

Thiel, Kilian; Berthold,<br />

Michael R.<br />

Fritsch, Arno; Ickstadt,<br />

Katja<br />

Steinbrecher, Matthias;<br />

Kruse, Rudolf<br />

Statistical analysis of human body<br />

movement and group interactions<br />

in response to music<br />

32<br />

(Chair: van den Poel) Room<br />

4<br />

Clustering Consumers with<br />

Respect to Their Marketing<br />

Reactance Behavior<br />

159<br />

Dyadic Interactions in Service<br />

Encounter - Bayesian SEM<br />

Approach<br />

124<br />

Improving Product Line Design<br />

with Bundling<br />

10<br />

(Chair: Vichi) Room<br />

3<br />

Cluster Tree Estimation using a<br />

Generalized Single Linkage<br />

Method<br />

104<br />

Strengths and Weaknesses of Ant<br />

Colony Clustering<br />

65<br />

STATISTICA<br />

Multi-Dimensional Scaling applied<br />

to Hierarchical Fuzzy Rule<br />

Systems<br />

An Improved Criterion for<br />

Clustering Based on the Posterior<br />

Similarity Matrix<br />

Clustering Association Rules with<br />

Fuzzy Concepts<br />

− xix −<br />

Room<br />

3<br />

(Chair: Kruse) Room<br />

405/6<br />

45<br />

43<br />

141


Session Text Mining (Chair: Schmidt-Thieme) Room<br />

101/3<br />

11:20-<br />

11:45<br />

11:45-<br />

12:10<br />

12:10-<br />

12:35<br />

12:35-<br />

13:00<br />

Karatzoglou,<br />

Alexandros; Feinerer,<br />

Ingo; Hornik, Kurt<br />

Schierle, Martin;<br />

Trabold, Daniel<br />

Hermes, Jürgen;<br />

Schwiebert, Stephan<br />

Nonparametric distribution<br />

analysis for text mining<br />

Multilingual knowledge based<br />

concept recognition in textual data<br />

Classification of text processing<br />

components: The Tesla Role<br />

System<br />

Thorleuchter, Dirk Mining ideas from textual<br />

information<br />

Session Modelling Exchange<br />

from Archaeological<br />

Evidence<br />

11:20-<br />

11:45<br />

11:45-<br />

12:10<br />

12:10-<br />

12:35<br />

Schyle, Daniel The Late Neolithic flint axe<br />

production on the Lousberg<br />

(Aachen, Germany) – An<br />

extrapolation of supply and<br />

demand and population density<br />

Dolata, Jens; Mucha,<br />

Hans-Joachim; Bartel,<br />

Hans-Georg<br />

Mapping Findspots of Roman<br />

Military Brickstamps in<br />

Mogontiacum (Mainz) and<br />

Archaeometrical Analysis<br />

Herzog, Irmela Reconstructing Central Places and<br />

Settlements Groups<br />

76<br />

128<br />

64<br />

147<br />

(Chair: Kerig) Room<br />

2<br />

Session Statistical Musicology II (Chair: Weihs) Room<br />

6<br />

11:20-<br />

11:45<br />

11:45-<br />

12:10<br />

12:10-<br />

12:35<br />

Meyer, Florian; Ultsch,<br />

Alfred<br />

Lukashevich, Hanna;<br />

Dittmar, Christian;<br />

Bastuck, Christoph<br />

Finding Music Fads by clustering<br />

Online Radio Data with Emergent<br />

Self-Organizing Maps<br />

Applying Statistical Models and<br />

Parametric Distance Measures for<br />

Music Similarity Search<br />

Fricke, Jobst P. A statistical theory of musical<br />

consonance proved in praxis<br />

Session Marketing and<br />

Management Science II<br />

11:20-<br />

11:45<br />

11:45-<br />

12:10<br />

Gazda, Vladimir On a Location of the Retail Units<br />

and Equilibrium Price<br />

Determination<br />

Zeileis, Achim; Kleiber,<br />

Christian<br />

134<br />

34<br />

66<br />

96<br />

89<br />

42<br />

(Chair: Decker) Room<br />

4<br />

Recursive Partitioning of Economic<br />

Regressions: Trees of Costly<br />

Journals and Beautiful Professors<br />

− xx −<br />

50<br />

168


12:10-<br />

12:35<br />

13:00-<br />

14:00<br />

van de Velden, Michel;<br />

de Beuckelaer, Alain;<br />

Groenen, Patrick;<br />

Busing, Frank<br />

Lunch (and Meetings)<br />

Visualizing preferences using<br />

minimum variance nonmetric<br />

unfolding<br />

Session Linguistics (Chairs: Goebl, Grzybek) Room<br />

3<br />

14:00-<br />

14:25<br />

14:25-<br />

14:50<br />

14:50-<br />

15:15<br />

15:15-<br />

15:40<br />

Rapp, Reinhard; Zock,<br />

Michael<br />

Fenk-Oczlon, Gertraud;<br />

Fenk, August<br />

Automatic Dictionary Expansion<br />

Using Non-parallel Corpora<br />

Cross-linguistic regularities in the<br />

monosyllabic system<br />

Rolshoven, Jürgen Grundzüge einer generativen<br />

Korpuslinguistik<br />

Petersen, Wiebke Lineare Kodierung multipler<br />

Vererbungshierarchien:<br />

Wiederbelebung einer antiken<br />

Klassifikationsmethode<br />

Session Invited Session: BCS (Chairs: Hennig, Murtagh) Room<br />

2<br />

14:00-<br />

14:25<br />

14:25-<br />

14:50<br />

Dean, Nema; Nugent,<br />

Rebecca<br />

Mirkin, Boris<br />

Augmenting Model-Based<br />

Clustering with Generalized<br />

Linkage methods<br />

Deviant box and dual clusters for the<br />

analysis of conceptual contexts<br />

14:50- Critchley, Frank; Pires, Principal Axis Analysis – with<br />

15:15 Ana; Amado, Conceicao HDLSS bonuses!<br />

15:15- Hennig, Christian; Using cluster analysis for species<br />

15:40 Hausdorf, Bernhard delimitation<br />

Session Processes in Industry (Chair: Joos) Room<br />

6<br />

14:00-<br />

14:25<br />

14:25-<br />

14:50<br />

14:50-<br />

15:15<br />

15:15-<br />

15:40<br />

Hahlweg, Cornelius;<br />

Rothe, Hendrik<br />

Raabe, Nils; Enk, Dirk;<br />

Weihs, Claus;<br />

Biermann, Dirk<br />

Meier, René; Joos,<br />

Franz<br />

Große, Lars; Joos,<br />

Franz<br />

Auswertung hochaufgelöster<br />

Streulichtdaten mit Methoden der<br />

multivariaten Statistik<br />

Dynamic disturbances in BTA<br />

deephole drilling - Identification of<br />

spiralling as a regenerative effect<br />

Optimization Methods with<br />

Evolutionary Algorithms and<br />

Artificial Neural Networks<br />

Usage of Artifical Neural Networks<br />

for Data Handling<br />

− xxi −<br />

156<br />

119<br />

40<br />

122<br />

110<br />

31<br />

97<br />

30<br />

62<br />

59<br />

117<br />

95<br />

55


Session Marketing and Management<br />

Science III<br />

14:00-<br />

14:25<br />

14:25-<br />

14:50<br />

14:50-<br />

15:15<br />

15:15-<br />

15:40<br />

15:40-<br />

15:55<br />

15:55-<br />

16:35<br />

Lübke, Karsten;<br />

Papenhoff, Heike<br />

Wilczynski, Petra;<br />

Sarstedt, Marko<br />

(Chair: van den Poel) Room<br />

4<br />

Latent growth models for<br />

analyzing a multi partner reward<br />

program<br />

88<br />

Multi-Item Versus Single-Item<br />

Measures: A Review and Future<br />

Research Directions<br />

161<br />

Sommerfeld, Angela Trust as a Key Determinant of<br />

Loyalty and its Moderators<br />

Kneib, Thomas;<br />

Baumgartner,<br />

Bernhard; Steiner,<br />

Winfried J.<br />

Coffee<br />

Semi-plenary Lectures<br />

Celeux, Gilles Paul<br />

et al.<br />

Krolak-Schwerdt,<br />

Sabine<br />

Session Clustering and<br />

Classification III<br />

16:40- Müller-Funk, Ulrich;<br />

17:05 Dlugosz, Stephan<br />

17:05-<br />

17:30<br />

17:30-<br />

17:55<br />

Azam, Muhammad;<br />

Ostermann, Alexander;<br />

Pfeiffer, Karl-Peter<br />

Gantner, Zeno;<br />

Schmidt-Thieme, Lars<br />

Session Optimization in<br />

Statistics<br />

16:40-<br />

17:05<br />

17:05-<br />

17:30<br />

Time-Varying Parameters in Brand<br />

Choice Models<br />

Choosing the number of clusters<br />

in the latent class model (Chair:<br />

Bock)<br />

Strategies of model construction<br />

for the analysis of judgement data<br />

(Chair: Decker)<br />

139<br />

80<br />

15 Room<br />

5<br />

82 Room<br />

3<br />

(Chair: Geyer-Schulz) Room<br />

3<br />

Predictive classification trees 99<br />

Evaluation Criteria for the<br />

Construction of Binary<br />

Classification Trees with Two or<br />

More Classes<br />

Scalable and Incrementally<br />

Updated Hybrid Recommender<br />

Systems<br />

Hansohm, Jürgen Algorithms for Computing the<br />

Multivariate Isotonic Regression<br />

Schachtner, Reinhard;<br />

Pöppel, Gerhard; Lang,<br />

Elmar<br />

Nonnegative Matrix Factorization<br />

for Binary Data to Extract<br />

Elementary Failure Maps from<br />

Wafer Test Images<br />

− xxii −<br />

7<br />

47<br />

(Chair: Ritter) Room<br />

405/6<br />

60<br />

126


17:30-<br />

17:55<br />

Nalbantov, Georgi<br />

Ilkov; Groenen, Patrick<br />

J.F.; Bioch, Cor<br />

Session Computational<br />

Intelligence and<br />

Metaheuristics<br />

16:40-<br />

17:05<br />

17:05-<br />

17:30<br />

17:30-<br />

17:55<br />

Winkler, Stephan;<br />

Affenzeller, Michael;<br />

Wagner, Stefan;<br />

Kronberger, Gabriel<br />

Caserta, Marco;<br />

Lessmann, Stefan<br />

Support Vector Machines in the<br />

Dual using Majorization and<br />

Kernels<br />

On the Effects of Enhanced<br />

Selection Models on Quality and<br />

Comparability of Classifiers<br />

Produced by Genetic Programming<br />

A novel approach to construct<br />

discrete support vector machine<br />

classifiers<br />

Thorleuchter, Dirk Mining technologies in security<br />

and defense<br />

Session Miscellaneous Models<br />

(Archeology)<br />

16:40-<br />

17:05<br />

17:05-<br />

17:30<br />

Okada, Akinori;<br />

Sakaehara, Towao<br />

Gans, Ulrich-Walter;<br />

Lang, Matthias<br />

Session Education and<br />

Psychology<br />

16:40-<br />

17:05<br />

17:05-<br />

17:30<br />

17:30-<br />

17:55<br />

Fuchs, Sebastian;<br />

Sarstedt, Marko<br />

Strobl, Carolin; Leisch,<br />

Friedrich<br />

101<br />

(Chair: Fink) Room<br />

101/3<br />

164<br />

25<br />

148<br />

(Chair: Posluschny) Room<br />

2<br />

Analysis of Borrowing and<br />

Guaranteeing Relationhships<br />

among Government Officials at<br />

the Eighth Century in the Old<br />

Capital of Japan<br />

106<br />

ArcheoInf - Leistungszentrum für<br />

die digitale Unterstützung<br />

feldarchäologischer Projekte<br />

46<br />

(Chair: Krolak-Schwerdt) Room<br />

6<br />

On the Use of Student Samples in<br />

Major Marketing Research<br />

Journals. A Meta-Study<br />

44<br />

Who's Afraid of Statistics? -<br />

Measurement and Predictors of<br />

Statistics Anxiety in German<br />

University Students<br />

142<br />

Ünlü, Ali Mosaic Plots and Knowledge<br />

Structures<br />

Session Marketing and Management<br />

Science IV<br />

17:05-<br />

17:30<br />

17:30-<br />

17:55<br />

Lam, Kar Yin; Koning,<br />

Alex J.; Franses, Philip<br />

Hans<br />

Wagner, Ralf; Klaus,<br />

Martin<br />

155<br />

(Chair: Decker) Room<br />

4<br />

Testing preference rankings 84<br />

Exploring the Interaction<br />

Structure of Weblogs<br />

− xxiii −<br />

91


18:00-<br />

19:00<br />

General Assembly of<br />

<strong>GfKl</strong><br />

20:00 Conference Dinner (Handwerkskammer, Holsten-<br />

wall 12, bus transfer 19:30)<br />

09:00-<br />

09:40<br />

12:00-<br />

12:20<br />

Friday July 18, <strong>2008</strong><br />

Semi-plenary Lectures<br />

Palumbo, Francesco Clustering and Dimensionality<br />

Reduction to Discover Interesting<br />

Patterns in Binary Data (Chair:<br />

Ultsch)<br />

Wildner, Raimund Management and Methods: How<br />

to do Market Segmentation<br />

Projects (Chair: Fantapié<br />

Altobelli)<br />

Coffee<br />

Session Clustering and<br />

Classification IV<br />

10:00-<br />

10:25<br />

10:25-<br />

10:50<br />

10:50-<br />

11:15<br />

Buza, Krisztian Antal;<br />

Schmidt-Thieme, Lars<br />

Room<br />

5<br />

109 Room<br />

5<br />

162 Room<br />

3<br />

(Chair: Groenen) Room<br />

3<br />

Motif-based Classification of Time<br />

Series with Bayesian Networks<br />

and SVMs<br />

23<br />

Tomas, Amber Issues related to the<br />

implementation of a dynamic<br />

logistic model for classifier<br />

combination<br />

Oosthuizen, Surette;<br />

Steel, Sarel J.<br />

Session Visualization and<br />

Scaling Methods I<br />

10:00-<br />

10:25<br />

10:25-<br />

10:50<br />

10:50-<br />

11:15<br />

Variable selection for kernel<br />

classifiers: a feature-to-input<br />

space approach<br />

Mucha, Hans-Joachim Clustering a Contingency Table<br />

Accompanied by Visualization<br />

Bocci, Laura; Vichi,<br />

Maurizio<br />

The K-INDSCAL Model for<br />

Heterogeneous Three-way<br />

Dissimilarity Data<br />

150<br />

107<br />

(Chair: Hennig) Room<br />

405/6<br />

Cortina-Borja, Mario Extending Multivariate Planing 29<br />

− xxiv −<br />

98<br />

17


Session Exploratory Data<br />

Analysis I<br />

10:00-<br />

10:25<br />

10:25-<br />

10:50<br />

10:50-<br />

11:15<br />

Chiou, Hua-Kai; Yuan,<br />

Benjamin J.C.; Wang,<br />

Yen-Wen<br />

Cernian, Alexandra;<br />

Carstoiu, Dorin;<br />

Ionescu, Tudor<br />

Einbeck, Jochen;<br />

Evers, Ludger<br />

(Chair: Wehrens) Room<br />

101/3<br />

Correspondence Analysis for<br />

Exploring the Implementation of<br />

One Village One Product Programs<br />

in Taiwan<br />

28<br />

Modeling the Classification of<br />

Heterogeneous Data<br />

26<br />

Data compression and regression<br />

based on local principal curves<br />

Session Spatial Planning I (Chair: Behnisch) Room<br />

2<br />

10:00- Behnisch, Martin; Estimating the number of<br />

11<br />

10:25 Ultsch, Alfred<br />

buildings in Germany<br />

10:25-<br />

10:50<br />

10:50-<br />

11:15<br />

Thiel, Klaus Optimal VDSL Expansion taking<br />

into Consideration of<br />

Infrastructure Restrictions and<br />

Marketing Requirements<br />

Aden, Christian;<br />

Mucha, Hans-Joachim;<br />

Schmidt, Gunther;<br />

Schröder, Winfried<br />

Session Medical and Health<br />

Sciences I<br />

10:00-<br />

10:25<br />

10:25-<br />

10:50<br />

10:50-<br />

11:15<br />

Augustin, Thomas;<br />

Wallner, Matthias<br />

WaldIS - a web based reference<br />

system for the forest monitoring<br />

in North Rhine-Westphalia<br />

36<br />

145<br />

(Chair: Lausen) Room<br />

6<br />

On the power of corrected score<br />

functions to adjust for<br />

measurement error<br />

6<br />

Sieben, Wiebke Time Related Features for Alarm<br />

Classification in Intensive Care<br />

Monitoring<br />

Ostermann, Thomas;<br />

Schuster, Reinhard;<br />

Erben, Christoph<br />

Session Market Research,<br />

Controlling, OR I<br />

10:00-<br />

10:25<br />

10:25-<br />

10:50<br />

Brusch, Michael; Baier,<br />

Daniel<br />

Classifying hospitals with respect<br />

to their diagnostic diversity using<br />

Shannon's entropy<br />

3<br />

135<br />

108<br />

(Chair: Baier) Room<br />

4<br />

Analyzing the Stability of Price<br />

Response Functions - Measuring<br />

the Influence of Different<br />

Parameters in a Monte Carlo<br />

Comparison<br />

22<br />

Tarka, Piotr Conjoint Analysis within the field<br />

of customer satisfaction problems<br />

– a model of composite<br />

product/service<br />

− xxv −<br />

144


10:50-<br />

11:15<br />

11:20-<br />

12:00<br />

12:00-<br />

12:20<br />

Punzo, Antonio Considerations on the impact of<br />

JML-ill-conditioned configurations<br />

in the CML approach<br />

Semi-plenary Lectures<br />

Ben-Israel, Adi Probabilistic Distance Clustering<br />

(Chair: Vichi)<br />

Imaizumi, Tadashi Dimensionality Reduction of<br />

Similarity Matrix (Chair: Gaul)<br />

Coffee<br />

Session Clustering and<br />

Classification V<br />

12:20-<br />

12:45<br />

12:45-<br />

13:30<br />

Kludas, Jana; Bruno,<br />

Eric; Marchand-Maillet,<br />

Stepahne<br />

Schiffner, Julia;<br />

Szepannek, Gero;<br />

Monthé, Thierry;<br />

Weihs, Claus<br />

Session Visualization and<br />

Scaling Methods II<br />

12:20-<br />

12:45<br />

12:45-<br />

13:30<br />

116<br />

12 Room<br />

5<br />

73 Room<br />

3<br />

(Chair: Godehart) Room<br />

3<br />

Exploiting synergetic and<br />

redundant features for multimedia<br />

document classification<br />

79<br />

Localized Logistic Regression for<br />

Discrete Influential Factors<br />

129<br />

(Chair: van de Felden) Room<br />

405/6<br />

Adachi, Kohei Joint Procrustes Analysis with<br />

Constrained Simplimax Rotation:<br />

Nonsingular Transformation of<br />

Component Score and Loading<br />

Matrices Toward Simple Structure<br />

Fernández-Aguirre,<br />

Karmele; Garín-Martín,<br />

María Araceli<br />

Session Exploratory Data<br />

Analysis II<br />

12:20- Zarraga, Amaya;<br />

12:45 Goitisolo, Beatriz<br />

12:45-<br />

13:30<br />

Nusser, Sebastian;<br />

Otte, Clemens;<br />

Hauptmann, Werner<br />

Validity of images from binary<br />

coding tables. Student motivation<br />

surveys: some evidence<br />

(Chair: Wehrens) Room<br />

101/3<br />

Factor Analysis of Incomplete<br />

Disjunctive Tables<br />

167<br />

Multi-Class Extension of Verifiable<br />

Ensemble Models for Safety-<br />

Related Applications<br />

105<br />

− xxvi −<br />

2<br />

41


Session Spatial Planning II (Chair: Behnisch) Room<br />

2<br />

12:20-<br />

12:45<br />

12:45-<br />

13:30<br />

Witek, Ewa Analysis of massive emigration<br />

from Poland - the model-based<br />

clustering approach<br />

Thinh, Nguyen Xuan;<br />

Küttner, Leander;<br />

Meinel, Gotthard<br />

Session Medical and Health<br />

Sciences II<br />

12:20-<br />

12:45<br />

12:45-<br />

13:30<br />

Henker, Uwe; Ultsch,<br />

Alfred; Petersohn, Uwe<br />

Schuster, Reinhard;<br />

von Arnstedt, Eva<br />

Session Market Research,<br />

Controlling, OR II<br />

12:20-<br />

12:45<br />

12:45-<br />

13:30<br />

13:15-<br />

14:00<br />

14:00-<br />

15:00<br />

Esber, Said; Baier,<br />

Daniel<br />

Abu Assab, Samah;<br />

Baier, Daniel<br />

Evaluate the data structure and<br />

identify homogenous spatial units<br />

in the data base "Sustainability<br />

issues in sensitive areas" of the<br />

EU-FP6 Integrated Project<br />

SENSOR<br />

165<br />

146<br />

(Chair: Lausen) Room<br />

6<br />

Die präzise und effizienzte<br />

Erkennung von medizinischen<br />

Anforderungsformularen<br />

61<br />

Age Distributions for costs in drug<br />

prescription by practitioners and<br />

for DRG-based hospital treatment<br />

133<br />

(Chair: Baier) Room<br />

4<br />

Realoptionen bei der Bewertung<br />

von neuen Produkten<br />

38<br />

Designing Products Using Quality<br />

Function Deployment and Conjoint<br />

Analysis: A Comparison in a<br />

Market for Elderly People<br />

Plenary Lecture (Chair: Weihs) Room<br />

5<br />

McMorris, Fred R. Majority-rule consensus: from<br />

preferences (social choice) to<br />

trees (biology and classification<br />

theory)<br />

94<br />

Informal Farewell (Conference site)<br />

− xxvii −<br />

1


List of Contributions<br />

Authors Title Page<br />

Abu Assab, Samah; Baier, Designing Products Using Quality Function 1<br />

Daniel<br />

Deployment and Conjoint Analysis: A<br />

Comparison in a Market for Elderly People<br />

Adachi, Kohei Joint Procrustes Analysis with Constrained<br />

Simplimax Rotation: Nonsingular<br />

Transformation of Component Score and<br />

Loading Matrices Toward Simple Structure<br />

2<br />

Aden, Christian; Mucha, Hans- WaldIS - a web based reference system 3<br />

Joachim; Schmidt, Gunther; for the forest monitoring in North Rhine-<br />

Schröder, Winfried<br />

Westphalia<br />

Adler, Werner; Brenning, Classification of Paired Data Using<br />

4<br />

Alexander; Lausen, Berthold Ensemble Methods<br />

Andres, Bjoern; Koethe, Ullrich;<br />

Helmstaedter, Moritz; Denk,<br />

Winfried; Hamprecht, Fred<br />

Segmentation of Neural Tissue 5<br />

Augustin, Thomas; Wallner, On the power of corrected score functions 6<br />

Matthias<br />

to adjust for measurement error<br />

Azam, Muhammad; Ostermann, Evaluation Criteria for the Construction of 7<br />

Alexander; Pfeiffer, Karl-Peter Binary Classification Trees with Two or<br />

More Classes<br />

Bade, Korinna; Benz, Dominik Evaluation Strategies for Learning<br />

Algorithms of Hierarchical Structures<br />

8<br />

Barbosa, Rui Pedro; Belo,<br />

Orlando<br />

Autonomous Forex Trading Agents 9<br />

Becker, Niels; Werners, Brigitte Improving Product Line Design with<br />

Bundling<br />

10<br />

Behnisch, Martin; Ultsch, Alfred Estimating the number of buildings in<br />

Germany<br />

11<br />

Ben-Israel, Adi Probabilistic Distance Clustering 12<br />

Bessler, Wolfgang; Holler, Hedge Funds in a Bayesian Asset<br />

13<br />

Julian<br />

Allocation Framework: Incorporating<br />

Information on market states and<br />

manager's ability<br />

Betzin, Jörg Categorical Data in PLS Path modeling 14<br />

Biernacki, Christophe; Celeux, Choosing the number of clusters in the 15<br />

Gilles Paul; Govaert, Gérard latent class model<br />

Bisson, Gilles Clustering of molecules and structured<br />

data<br />

16<br />

Bocci, Laura; Vichi, Maurizio The K-INDSCAL Model for Heterogeneous<br />

Three-way Dissimilarity Data<br />

17<br />

Borgelt, Christian Weighting and Selecting Features in Fuzzy<br />

Clustering<br />

18<br />

Boulesteix, Anne-Laure; On optimistic bias in reporting microarray- 20<br />

Slawski, Martin<br />

based classification accuracy<br />

− xxviii −


Bravo, Cristian; Maldonado, Practical experiences from Credit Scoring 21<br />

Sebastian; Weber, Richard projects for Chilean financial organizations<br />

Brusch, Michael; Baier, Daniel Analyzing the Stability of Price Response<br />

Functions - Measuring the Influence of<br />

Different Parameters in a Monte Carlo<br />

Comparison<br />

22<br />

Buza, Krisztian Antal; Schmidt- Motif-based Classification of Time Series 23<br />

Thieme, Lars<br />

with Bayesian Networks and SVMs<br />

Calò, Daniela G.; Viroli, Cinzia Visualizing data in Gaussian mixture<br />

model classification<br />

24<br />

Caserta, Marco; Lessmann, A novel approach to construct discrete 25<br />

Stefan<br />

support vector machine classifiers<br />

Cernian, Alexandra; Carstoiu, Modeling the Classification of<br />

26<br />

Dorin; Ionescu, Tudor<br />

Heterogeneous Data<br />

Chiou, Hua-Kai; Huang, Yong- Applying Rough Set Theory to<br />

27<br />

Ting; Liu, Gia-Shie<br />

Constructing Knowledge Base for Critical<br />

Military Commodity Management<br />

Chiou, Hua-Kai; Yuan,<br />

Correspondence Analysis for Exploring the 28<br />

Benjamin J.C.; Wang, Yen-Wen Implementation of One Village One<br />

Product Programs in Taiwan<br />

Cortina-Borja, Mario Extending Multivariate Planing 29<br />

Critchley, Frank; Pires, Ana; Principal Axis Analysis – with HDLSS 30<br />

Amado, Conceicao<br />

bonuses!<br />

Dean, Nema; Nugent, Rebecca Augmenting Model-Based Clustering with<br />

Generalized Linkage methods<br />

31<br />

Desmet, Frank Michel; Leman, Statistical analysis of human body<br />

32<br />

Marc; Lesaffre, Micheline movement and group interactions in<br />

response to music<br />

Dias, José G.; Vermunt, Jeroen Mixture Hidden Markov Models in Finance 33<br />

K.; Ramos, Sofia<br />

Research<br />

Dolata, Jens; Mucha, Hans- Mapping Findspots of Roman Military 34<br />

Joachim; Bartel, Hans-Georg Brickstamps in Mogontiacum (Mainz) and<br />

Archaeometrical Analysis<br />

Eigenfeldt, Arne; Kapur, Ajay Multimodal Performance Analysis of<br />

Electronic Sitar<br />

35<br />

Einbeck, Jochen; Evers, Ludger Data compression and regression based<br />

on local principal curves<br />

36<br />

Enyukov, Igor Regression-autoregression based<br />

clustering<br />

37<br />

Esber, Said; Baier, Daniel Realoptionen bei der Bewertung von<br />

neuen Produkten<br />

38<br />

Fenk-Oczlon, Gertraud; Fenk, Cross-linguistic regularities in the<br />

40<br />

August<br />

monosyllabic system<br />

Fernández-Aguirre, Karmele; Validity of images from binary coding 41<br />

Garín-Martín, María Araceli tables. Student motivation surveys: some<br />

evidence<br />

Fricke, Jobst P. A statistical theory of musical consonance<br />

proved in praxis<br />

42<br />

− xxix −


Fritsch, Arno; Ickstadt, Katja An Improved Criterion for Clustering<br />

Based on the Posterior Similarity Matrix<br />

Fuchs, Sebastian; Sarstedt, On the Use of Student Samples in Major<br />

Marko<br />

Marketing Research Journals. A Meta-<br />

Study<br />

Gabriel, Thomas R.; Thiel, Multi-Dimensional Scaling applied to<br />

Kilian; Berthold, Michael R. Hierarchical Fuzzy Rule Systems<br />

Gans, Ulrich-Walter; Lang, ArcheoInf - Leistungszentrum für die<br />

Matthias<br />

digitale Unterstützung feldarchäologischer<br />

Projekte<br />

Gantner, Zeno; Schmidt- Scalable and Incrementally Updated<br />

Thieme, Lars<br />

Hybrid Recommender Systems<br />

Garel, Bernard; Boucharel, Non-Gaussian nature of ENSO signals and<br />

Julien; Dewitte, Boris; du climate shifts: implications for regional<br />

Penhoat, Yves<br />

studies off the western coast of South<br />

America<br />

Gassiat, Elisabeth Likelihood ratio test for general mixture<br />

models<br />

Gazda, Vladimir On a Location of the Retail Units and<br />

Equilibrium Price Determination<br />

Geyer-Schulz, Andreas; Hoser, The Potential of Social Intelligence for<br />

Bettina<br />

Collective Intelligence<br />

Godehardt, Erhard; Jaworski, Isolated vertices in random intersection<br />

Jerzy; Rybarczyk, Katarzyna graphs<br />

Greselin, Francesca; Ingrassia, A note on constrained EM algorithms for<br />

Salvatore<br />

mixtures of elliptical distributions<br />

Groenen, Patrick J.F.;<br />

Support Vector Machines in the Primal<br />

Nalbantov, Georgi; Bioch, Cor using Majorization and Kernels<br />

Große, Lars; Joos, Franz Usage of Artifical Neural Networks for<br />

Data Handling<br />

Grün, Bettina; Leisch, Friedrich Model diagnostics of finite mixtures using<br />

bootstrapping<br />

Haasdonk, Bernard; Pekalska, Classification with Regularized Kernel<br />

Elzbieta<br />

Mahalanobis-Distances<br />

Häberle, Lothar On classification of species of<br />

representation rings<br />

Hahlweg, Cornelius; Rothe, Auswertung hochaufgelöster<br />

Hendrik<br />

Streulichtdaten mit Methoden der<br />

multivariaten Statistik<br />

Hansohm, Jürgen Algorithms for Computing the Multivariate<br />

Isotonic Regression<br />

Henker, Uwe; Ultsch, Alfred; Die präzise und effizienzte Erkennung von<br />

Petersohn, Uwe<br />

medizinischen Anforderungsformularen<br />

Hennig, Christian; Hausdorf, Using cluster analysis for species<br />

Bernhard<br />

delimitation<br />

Henseler, Jörg Nonlinear Effects in PLS Path Models: A<br />

Comparison of Available Approaches<br />

Hermes, Jürgen; Schwiebert, Classification of text processing<br />

Stephan<br />

components: The Tesla Role System<br />

− xxx −<br />

43<br />

44<br />

45<br />

46<br />

47<br />

48<br />

49<br />

50<br />

51<br />

52<br />

53<br />

54<br />

55<br />

56<br />

57<br />

58<br />

59<br />

60<br />

61<br />

62<br />

63<br />

64


Herrmann, Lutz; Ultsch, Alfred Strengths and Weaknesses of Ant Colony<br />

Clustering<br />

65<br />

Herzog, Irmela Reconstructing Central Places and<br />

Settlements Groups<br />

66<br />

Hielscher, Thomas; Zucknick, On the prognostic value of gene<br />

67<br />

Manuela; Werft, Wiebke;<br />

Benner, Axel<br />

expression signatures for censored data<br />

Holzmann, Hajo; Dannemann, Likelihood ratio testing for hidden Markov 68<br />

Jörn<br />

models<br />

Hühn, Jens; Hüllermeier, Eyke Rule-Based Learning of Reliable Classifiers 69<br />

Huellermeier, Eyke;<br />

Combining Predictions in Pairwise<br />

70<br />

Vanderlooy, Stijn<br />

Classification: An Adaptive Voting<br />

Strategy and Its Relation to Weighted<br />

Voting<br />

Huson, Daniel H.; Rupp, Regula Using Cluster Networks to Represent Non-<br />

Compatible Sets of Clusters<br />

71<br />

Hütt, Marc-Thorsten Genome phylogeny based on short-range<br />

correlations in DNA sequences<br />

72<br />

Imaizumi, Tadashi Dimensionality Reduction of Similarity<br />

Matrix<br />

73<br />

Kaiser, Sebastian; Leisch,<br />

Friedrich<br />

Benchmarking Bicluster Algorithms 75<br />

Karatzoglou, Alexandros; Nonparametric distribution analysis for 76<br />

Feinerer, Ingo; Hornik, Kurt text mining<br />

Klein, Christian; Kundisch, Index-Based Investment Vehicles - A 77<br />

Dennis<br />

Comparative Study for the German DAX<br />

Klenk, Hans-Peter Polyphasic genomic approach for the<br />

taxonomy of archaea and bacteria<br />

78<br />

Kludas, Jana; Bruno, Eric; Exploiting synergetic and redundant 79<br />

Marchand-Maillet, Stepahne features for multimedia document<br />

classification<br />

Kneib, Thomas; Baumgartner, Time-Varying Parameters in Brand Choice 80<br />

Bernhard; Steiner, Winfried J. Models<br />

Koralun-Bereznicka, Julia Multivariate comparative analysis of stock<br />

exchanges - the European perspective<br />

81<br />

Krolak-Schwerdt, Sabine Strategies of model construction for the<br />

analysis of judgement data<br />

82<br />

Kuziak, Katarzyna An application of copula functions to<br />

market risk management<br />

83<br />

Lam, Kar Yin; Koning, Alex J.;<br />

Franses, Philip Hans<br />

Testing preference rankings 84<br />

Latouche, Pierre J.; Ambroise,<br />

Christophe; Birmelé, Etienne<br />

Bayesian Methods for Graph Clustering 85<br />

Locarek-Junge, Hermann; Fundamental Indexation - testing the 86<br />

Mihm, Max<br />

concept in the German stock market<br />

Louw, Nelmarie; Lamont, Identifying Atypical Cases in Kernel Fisher 87<br />

Morne; Steel, Sarel<br />

Discriminant Analysis by using the<br />

Smallest Enclosing Hypersphere<br />

− xxxi −


Lübke, Karsten; Papenhoff, Latent growth models for analyzing a 88<br />

Heike<br />

multi partner reward program<br />

Lukashevich, Hanna; Dittmar, Applying Statistical Models and Parametric 89<br />

Christian; Bastuck, Christoph Distance Measures for Music Similarity<br />

Search<br />

Lukociene, Olga; Vermunt, Determining the number of components in 90<br />

Jeroen K.<br />

mixture models for hierarchical data<br />

Klaus, Martin; Wagner, Ralf Exploring the Interaction Structure of<br />

Weblogs<br />

91<br />

Martin-Magniette, Marie-Laure; ChIPmix : Mixture model of regressions 92<br />

Mary-Huard, Tristan; Bérard,<br />

Caroline; Robin, Stéphane<br />

for ChIP-chip experiment analysis<br />

McLachlan, Geoffrey John Clustering of High-Dimensional Data Via<br />

Finite Mixture Models<br />

93<br />

McMorris, F. R. Majority-rule consensus: from preferences<br />

(social choice) to trees (biology and<br />

classification theory)<br />

94<br />

Meier, René; Joos, Franz Optimization Methods with Evolutionary<br />

Algorithms and Artificial Neural Networks<br />

95<br />

Meyer, Florian; Ultsch, Alfred Finding Music Fads by clustering Online<br />

Radio Data with Emergent Self-Organizing<br />

Maps<br />

96<br />

Mirkin, Boris Deviant box and dual clusters for the<br />

analysis of conceptual contexts<br />

97<br />

Mucha, Hans-Joachim Clustering a Contingency Table<br />

Accompanied by Visualization<br />

98<br />

Müller-Funk, Ulrich; Dlugosz,<br />

Stephan<br />

Predictive classification trees 99<br />

Mylonas, Phivos; Solachidis, Efficient Media Exploitation towards 100<br />

Vassilios; Geyer-Schulz,<br />

Andreas; Hoser, Bettina;<br />

Chapman, Sam; Ciravegna,<br />

Fabio; Staab, Stefen; Smrz,<br />

Pavel; Kompatsiaris, Yiannis;<br />

Avrithis, Yannis<br />

Collective Intelligence<br />

Nalbantov, Georgi Ilkov; Support Vector Machines in the Dual using 101<br />

Groenen, Patrick J.F.; Bioch,<br />

Cor<br />

Majorization and Kernels<br />

Neumann, Anneke; Ambrosi, Approach for Dynamic Problems in<br />

102<br />

Klaus; Hahne, Felix<br />

Clustering<br />

Neykov, Neyko; Filzmoser, Robust fitting of mixtures: The approach 103<br />

Peter; Neytchev, Plamen based on the Trimmed Likelihood<br />

Estimator<br />

Nugent, Rebecca; Stuetzle, Cluster Tree Estimation using a<br />

104<br />

Werner<br />

Generalized Single Linkage Method<br />

Nusser, Sebastian; Otte, Multi-Class Extension of Verifiable Ensem- 105<br />

Clemens; Hauptmann, Werner ble Models for Safety-Related Applications<br />

Okada, Akinori; Sakaehara, Analysis of Borrowing and Guaranteeing 106<br />

Towao<br />

Relationhships among Government<br />

Officials at the Eighth Century in the Old<br />

− xxxii −


Capital of Japan<br />

Oosthuizen, Surette; Steel, Variable selection for kernel classifiers: a 107<br />

Sarel J.<br />

feature-to-input space approach<br />

Ostermann, Thomas; Schuster, Classifying hospitals with respect to their 108<br />

Reinhard; Erben, Christoph diagnostic diversity using Shannon's<br />

entropy<br />

Palumbo, Francesco Clustering and Dimensionality Reduction<br />

to Discover Interesting Patterns in Binary<br />

Data<br />

109<br />

Petersen, Wiebke Lineare Kodierung multipler<br />

Vererbungshierarchien: Wiederbelebung<br />

einer antiken Klassifikationsmethode<br />

110<br />

Petersen, Wiebke; Heinrich, Begriffsanalytischer Ansatz zur<br />

111<br />

Petja<br />

qualitativen Zitationsanalyse<br />

Piontek, Krzysztof The Analysis of the power for some<br />

chosen VaR backtesting procedures -<br />

simulation approach<br />

112<br />

Pommeret, Denys Testing distribution in errors in variables<br />

models<br />

113<br />

Pons, Odile Classification with an increasing number of<br />

components<br />

114<br />

Potapov, Sergej; Lausen,<br />

Berthold<br />

Bagging with different split criteria 115<br />

Punzo, Antonio Considerations on the impact of JML-illconditioned<br />

configurations in the CML<br />

approach<br />

116<br />

Raabe, Nils; Enk, Dirk; Weihs, Dynamic disturbances in BTA deephole 117<br />

Claus; Biermann, Dirk<br />

drilling - Identification of spiralling as a<br />

regenerative effect<br />

Radermacher, Walter Statistical processes under change -<br />

Enhancing data quality with pretests<br />

118<br />

Rapp, Reinhard; Zock, Michael Automatic Dictionary Expansion Using<br />

Non-parallel Corpora<br />

119<br />

Ringle, Christian M. FIMIX-PLS Segmentation of Data for Path<br />

Models with Multiple Endogenous LVs<br />

120<br />

Rokita, Pawel; Piontek,<br />

Extreme unconditional dependence vs. 121<br />

Krzysztof<br />

multivariate GARCH effect in the analysis<br />

of dependence between high losses on<br />

Polish and German stock indexes<br />

Rolshoven, Jürgen Grundzüge einer generativen<br />

Korpuslinguistik<br />

122<br />

Rozmus, Dorota Cluster ensemble based on co-occurrence<br />

data<br />

123<br />

Sagan, Adam; Kowalska- Dyadic Interactions in Service Encounter - 124<br />

Musial, Magdalena<br />

Bayesian SEM Approach<br />

− xxxiii −


Sardet, Laure; Patilea, Valentin Beta-kernel density estimation using<br />

mixture-based transformations: an<br />

application to claims distribution<br />

Schachtner, Reinhard; Pöppel,<br />

Gerhard; Lang, Elmar<br />

Scharl, Theresa; Leisch,<br />

Friedrich<br />

Nonnegative Matrix Factorization for<br />

Binary Data to Extract Elementary Failure<br />

Maps from Wafer Test Images<br />

Quality-Based Clustering of Functional<br />

Data: Applications to Time Course<br />

Microarray Data<br />

Multilingual knowledge based concept<br />

recognition in textual data<br />

Localized Logistic Regression for Discrete<br />

Influential Factors<br />

Schierle, Martin; Trabold,<br />

Daniel<br />

128<br />

Schiffner, Julia; Szepannek,<br />

Gero; Monthé, Thierry; Weihs,<br />

Claus<br />

129<br />

Schiffner, Julia; Weihs, Claus Localized Classification Using Mixture<br />

Models<br />

130<br />

Schlattmann, Peter Comparison of four estimators of the<br />

heterogeneity variance for meta-analysis<br />

131<br />

Schölkopf, Bernhard Machine Learning applications of positive<br />

definite kernels<br />

132<br />

Schuster, Reinhard; von Age Distributions for costs in drug<br />

133<br />

Arnstedt, Eva<br />

prescription by practitioners and for DRGbased<br />

hospital treatment<br />

Schyle, Daniel The Late Neolithic flint axe production on<br />

the Lousberg (Aachen, Germany) – An<br />

extrapolation of supply and demand and<br />

population density<br />

134<br />

Sieben, Wiebke Time Related Features for Alarm<br />

Classification in Intensive Care Monitoring<br />

135<br />

Slawski, Martin; Boulesteix, 'CMA' - Steps in developing a<br />

136<br />

Anne-Laure; Daumer, Martin comprehensive R-toolbox for classification<br />

with microarray data and other highdimensional<br />

problems<br />

Solachidis, Vassilios; Mylonas,<br />

Phivos; Geyer-Schulz, Andreas;<br />

Hoser, Bettina; Chapman, Sam;<br />

Ciravegna, Fabio; Staab,<br />

Stefen; Contopoulos, Costis;<br />

Gkika, Ioanna; Smrz, Pavel;<br />

Kompatsiaris, Yiannis; Avrithis,<br />

Yannis<br />

Generating Collective Intelligence 137<br />

Sommer, Katrin; Weihs, Claus Analysis of polyphonic musical time series 138<br />

Sommerfeld, Angela Trust as a Key Determinant of Loyalty and<br />

its Moderators<br />

139<br />

Stecking, Ralf; Schebesch, Generating Fictitious Training Data for 140<br />

Klaus B.<br />

Credit Client Classification<br />

Steinbrecher, Matthias; Kruse, Clustering Association Rules with Fuzzy 141<br />

Rudolf<br />

Concepts<br />

− xxxiv −<br />

125<br />

126<br />

127


Strobl, Carolin; Leisch,<br />

Friedrich<br />

Who's Afraid of Statistics? - Measurement<br />

and Predictors of Statistics Anxiety in<br />

German University Students<br />

Strobl, Carolin; Zeileis, Achim A New, Conditional Variable Importance<br />

Measure for Random Forests<br />

143<br />

Tarka, Piotr Conjoint Analysis within the field of<br />

customer satisfaction problems – a model<br />

of composite product/service<br />

144<br />

Thiel, Klaus Optimal VDSL Expansion taking into<br />

Consideration of Infrastructure<br />

Restrictions and Marketing Requirements<br />

145<br />

Thinh, Nguyen Xuan; Küttner, Evaluate the data structure and identify 146<br />

Leander; Meinel, Gotthard homogenous spatial units in the data base<br />

"Sustainability issues in sensitive areas" of<br />

the EU-FP6 Integrated Project SENSOR<br />

Thorleuchter, Dirk Mining ideas from textual information 147<br />

Thorleuchter, Dirk Mining technologies in security and<br />

defense<br />

148<br />

Timmerman, Marieke E.; Multilevel Simultaneous Component 149<br />

Lichtwarck-Aschoff, Anna; Analysis for Studying Inter-individual and<br />

Ceulemans, Eva<br />

Intra-individual Variabilities<br />

Tomas, Amber Issues related to the implementation of a<br />

dynamic logistic model for classifier<br />

combination<br />

150<br />

Trinchera, Laura; Esposito A Comprehensive Partial Least Squares 151<br />

Vinzi, Vincenzo<br />

Approach to Component-Based Structural<br />

Equation Modeling<br />

Trzesiok, Michal Relevant Importance of Predictor Variables<br />

in Support Vector Machines Models<br />

152<br />

Ultsch, Alfred Comparison of Algorithms to find<br />

differentially expressed Genes in<br />

Microarray Data<br />

153<br />

Ultsch, Alfred Is log ratio a good value for measuring<br />

return in stock investments?<br />

154<br />

Ünlü, Ali Mosaic Plots and Knowledge Structures 155<br />

van de Velden, Michel; de Visualizing preferences using minimum 156<br />

Beuckelaer, Alain; Groenen,<br />

Patrick; Busing, Frank<br />

variance nonmetric unfolding<br />

van der Ark, Andries L.; Straat, Selection of items for tests and<br />

157<br />

J. Hendrik<br />

questionnaires using Mokken scale<br />

analysis<br />

van der Heijden, Peter G.M. Estimating the prevalence of rule<br />

transgression<br />

158<br />

Wagner, Ralf; Sauerwald, Erik Clustering Consumers with Respect to<br />

Their Marketing Reactance Behavior<br />

159<br />

Wehrens, Ron Supervised Self-Organising Maps and<br />

More<br />

160<br />

Wilczynski, Petra; Sarstedt, Multi-Item Versus Single-Item Measures: 161<br />

Marko<br />

A Review and Future Research Directions<br />

− xxxv −<br />

142


Wildner, Raimund Management and methods: How to do<br />

market segmentation projects<br />

162<br />

Winkler, Roland; Rehm, Frank;<br />

Kruse, Rudolf<br />

Clustering with Repulsive Prototypes 163<br />

Winkler, Stephan; Affenzeller, On the Effects of Enhanced Selection 164<br />

Michael; Wagner, Stefan; Models on Quality and Comparability of<br />

Kronberger, Gabriel<br />

Classifiers Produced by Genetic<br />

Programming<br />

Witek, Ewa Analysis of massive emigration from<br />

Poland - the model-based clustering<br />

approach<br />

165<br />

Worm, Katja; Meffert, Beate Image Based Mail Piece Identification<br />

using Unsupervised Learning<br />

166<br />

Zarraga, Amaya; Goitisolo, Factor Analysis of Incomplete Disjunctive 167<br />

Beatriz<br />

Tables<br />

Zeileis, Achim; Kleiber,<br />

Recursive Partitioning of Economic<br />

168<br />

Christian<br />

Regressions: Trees of Costly Journals and<br />

Beautiful Professors<br />

− xxxvi −


Author Index<br />

Abu Assab, Samah 1<br />

Adachi, Kohei 2<br />

Aden, Christian 3<br />

Adler, Werner 4<br />

Affenzeller, Michael 164<br />

Amado, Conceicao 30<br />

Ambroise, Christophe 85<br />

Ambrosi, Klaus 102<br />

Andres, Bjoern 5<br />

Augustin, Thomas 6<br />

Avrithis, Yannis 100, 137<br />

Azam, Muhammad 7<br />

Bade, Korinna 8<br />

Baier, Daniel 1, 22, 38<br />

Barbosa, Rui Pedro 9<br />

Bartel, Hans-Georg 34<br />

Bastuck, Christoph 89<br />

Baumgartner, Bernhard 80<br />

Becker, Niels 10<br />

Behnisch, Martin 11<br />

Belo, Orlando 9<br />

Ben-Israel, Adi 12<br />

Benner, Axel 67<br />

Benz, Dominik 8<br />

Bérard, Caroline 92<br />

Berthold, Michael R. 45<br />

Bessler, Wolfgang 13<br />

Betzin, Jörg 14<br />

Biermann, Dirk 117<br />

Biernacki, Christophe 15<br />

Bioch, Cor 54, 101<br />

Birmelé, Etienne 85<br />

Bisson, Gilles 16<br />

Bocci, Laura 17<br />

Borgelt, Christian 18<br />

Boucharel, Julien 48<br />

Boulesteix, Anne-Laure 20, 136<br />

− xxxvii −<br />

Bravo, Cristian 21<br />

Brenning, Alexander 4<br />

Bruno, Eric 79<br />

Brusch, Michael 22<br />

Busing, Frank 156<br />

Buza, Krisztian Antal 23<br />

Calò, Daniela G. 24<br />

Carstoiu, Dorin 26<br />

Caserta, Marco 25<br />

Celeux, Gilles Paul 15<br />

Cernian, Alexandra 26<br />

Ceulemans, Eva 149<br />

Chiou, Hua-Kai 27, 28<br />

Ciravegna, Fabio 100, 137<br />

Contopoulos, Costis 137<br />

Cortina-Borja, Mario 29<br />

Critchley, Frank 30<br />

Dannemann, Jörn 68<br />

Daumer, Martin 136<br />

de Beuckelaer, Alain 156<br />

Dean, Nema 31<br />

Denk, Winfried 5<br />

Desmet, Frank Michel 32<br />

Dewitte, Boris 48<br />

Dias, José G. 33<br />

Dittmar, Christian 89<br />

Dlugosz, Stephan 99<br />

Dolata, Jens 34<br />

du Penhoat, Yves 48<br />

Eigenfeldt, Arne 35<br />

Einbeck, Jochen 36<br />

Enk, Dirk 117<br />

Enyukov, Igor 37<br />

Erben, Christoph 108<br />

Esber, Said 38<br />

Esposito Vinzi, Vincenzo 151<br />

Evers, Ludger 36


Feinerer, Ingo 76<br />

Fenk, August 40<br />

Fenk-Oczlon, Gertraud 40<br />

Fernández-Aguirre,<br />

Karmele<br />

41<br />

Filzmoser, Peter 103<br />

Franses, Philip Hans 84<br />

Fricke, Jobst P. 42<br />

Fritsch, Arno 43<br />

Fuchs, Sebastian 44<br />

Gabriel, Thomas R. 45<br />

Gans, Ulrich-Walter 46<br />

Gantner, Zeno 47<br />

Garel, Bernard 48<br />

Garín-Martín, María Araceli 41<br />

Gassiat, Elisabeth 49<br />

Gazda, Vladimir 50<br />

Geyer-Schulz, Andreas 51, 100,<br />

137<br />

Gkika, Ioanna 137<br />

Godehardt, Erhard 52<br />

Goitisolo, Beatriz 167<br />

Govaert, Gérard 15<br />

Greselin, Francesca 53<br />

Groenen, Patrick J.F. 54, 101,<br />

156<br />

Große, Lars 55<br />

Grün, Bettina 56<br />

Haasdonk, Bernard 57<br />

Häberle, Lothar 58<br />

Hahlweg, Cornelius 59<br />

Hahne, Felix 102<br />

Hamprecht, Fred A. 5<br />

Hansohm, Jürgen 60<br />

Hauptmann, Werner 105<br />

Hausdorf, Bernhard 62<br />

Heinrich, Petja 111<br />

Helmstaedter, Moritz 5<br />

Henker, Uwe 61<br />

Hennig, Christian 62<br />

Henseler, Jörg 63<br />

− xxxviii −<br />

Hermes, Jürgen 64<br />

Herrmann, Lutz 65<br />

Herzog, Irmela 66<br />

Hielscher, Thomas 67<br />

Holler, Julian 13<br />

Holzmann, Hajo 68<br />

Hornik, Kurt 76<br />

Hoser, Bettina 51, 100,<br />

137<br />

Huang, Yong-Ting 27<br />

Hühn, Jens 69<br />

Hüllermeier, Eyke 69, 70<br />

Hütt, Marc-Thorsten 72<br />

Huson, Daniel H. 71<br />

Ickstadt, Katja 43<br />

Imaizumi, Tadashi 73<br />

Ingrassia, Salvatore 53<br />

Ionescu, Tudor 26<br />

Jaworski, Jerzy 52<br />

Joos, Franz 55, 95<br />

Kapur, Ajay 35<br />

Karatzoglou, Alexandros 76<br />

Klaus, Martin 91<br />

Kleiber, Christian 168<br />

Klein, Christian 77<br />

Klenk, Hans-Peter 78<br />

Kludas, Jana 79<br />

Kneib, Thomas 80<br />

Koethe, Ullrich 5<br />

Kompatsiaris, Yiannis 100, 137<br />

Koning, Alex J. 84<br />

Koralun-Bereznicka, Julia 81<br />

Kowalska-Musial, Magdal. 124<br />

Krolak-Schwerdt, Sabine 82<br />

Kronberger, Gabriel 164<br />

Kruse, Rudolf 141, 163<br />

Küttner, Leander 146<br />

Kundisch, Dennis 77<br />

Kuziak, Katarzyna 83<br />

Lam, Kar Yin 84


Lamont, Morne 87<br />

Lang, Elmar 126<br />

Lang, Matthias 46<br />

Latouche, Pierre J. 85<br />

Lausen, Berthold 4, 115<br />

Leisch, Friedrich 56, 75,<br />

127, 142<br />

Leman, Marc 32<br />

Lesaffre, Micheline 32<br />

Lessmann, Stefan 25<br />

Lichtwarck-Aschoff, Anna 149<br />

Locarek-Junge, Hermann 86<br />

Louw, Nelmarie 87<br />

Lübke, Karsten 88<br />

Lukashevich, Hanna 89<br />

Lukociene, Olga 90<br />

Maldonado, Sebastian 21<br />

Marchand-Maillet, Stepahne 79<br />

Martin-Magniette, Marie-L. 92<br />

Mary-Huard, Tristan 92<br />

McLachlan, Geoffrey John 93<br />

McMorris, Fred R. 94<br />

Meffert, Beate 166<br />

Meier, René 95<br />

Meinel, Gotthard 146<br />

Mihm, Max 86<br />

Mirkin, Boris 97<br />

Monthé, Thierry 129<br />

Mucha, Hans-Joachim 3, 34, 98<br />

Müller-Funk, Ulrich 99<br />

Mylonas, Phivos 100, 137<br />

Nalbantov, Georgi Ilkov 54, 101<br />

Neumann, Anneke 102<br />

Neykov, Neyko 103<br />

Neytchev, Plamen 103<br />

Nugent, Rebecca 31, 104<br />

Nusser, Sebastian 105<br />

Okada, Akinori 106<br />

Oosthuizen, Surette 107<br />

Ostermann, Alexander 7<br />

− xxxix −<br />

Ostermann, Thomas 108<br />

Otte, Clemens 105<br />

Palumbo, Francesco 109<br />

Papenhoff, Heike 88<br />

Patilea, Valentin 125<br />

Pekalska, Elzbieta 57<br />

Petersen, Wiebke 110, 111<br />

Petersohn, Uwe 61<br />

Pfeiffer, Karl-Peter 7<br />

Piontek, Krzysztof 112, 121<br />

Pires, Ana 30<br />

Pöppel, Gerhard 126<br />

Pommeret, Denys 113<br />

Pons, Odile 114<br />

Potapov, Sergej 115<br />

Punzo, Antonio 116<br />

Raabe, Nils 117<br />

Radermacher, Walter 118<br />

Ramos, Sofia 33<br />

Rapp, Reinhard 119<br />

Rehm, Frank 163<br />

Ringle, Christian M. 120<br />

Robin, Stéphane 92<br />

Rokita, Pawel 121<br />

Rolshoven, Jürgen 122<br />

Rothe, Hendrik 59<br />

Rozmus, Dorota 123<br />

Rupp, Regula 71<br />

Rybarczyk, Katarzyna 52<br />

Sagan, Adam 124<br />

Sakaehara, Towao 106<br />

Sardet, Laure 125<br />

Sarstedt, Marko 44, 161<br />

Sauerwald, Erik 159<br />

Schachtner, Reinhard 126<br />

Scharl, Theresa 127<br />

Schebesch, Klaus B. 140<br />

Schierle, Martin 128<br />

Schiffner, Julia 129, 130<br />

Schlattmann, Peter 131


Schmidt, Gunther 3<br />

Schmidt-Thieme, Lars 23, 47<br />

Schölkopf, Bernhard 132<br />

Schröder, Winfried 3<br />

Schuster, Reinhard 108, 133<br />

Schwiebert, Stephan 64<br />

Schyle, Daniel 134<br />

Sieben, Wiebke 135<br />

Slawski, Martin 20, 136<br />

Smrz, Pavel 100, 137<br />

Solachidis, Vassilios 100, 137<br />

Sommer, Katrin 138<br />

Sommerfeld, Angela 139<br />

Staab, Stefen 100, 137<br />

Stecking, Ralf 140<br />

Steel, Sarel J. 87, 107<br />

Steinbrecher, Matthias 141<br />

Steiner, Winfried J. 80<br />

Straat, J. Hendrik 157<br />

Stuetzle, Werner 104<br />

Szepannek, Gero 129<br />

Tarka, Piotr 144<br />

Thiel, Kilian 45<br />

Thiel, Klaus 145<br />

Thinh, Nguyen Xuan 146<br />

Thorleuchter, Dirk 147, 148<br />

Timmerman, Marieke E. 149<br />

Tomas, Amber 150<br />

Trabold, Daniel 128<br />

Trinchera, Laura 151<br />

Trzesiok, Michal 152<br />

Ünlü, Ali 155<br />

Ultsch, Alfred 11, 61,<br />

65, 96,<br />

153, 154<br />

van de Velden, Michel 156<br />

− xl −<br />

van der Ark, Andries L. 157<br />

van der Heijden, Peter G.M. 158<br />

Vanderlooy, Stijn 70<br />

Vermunt, Jeroen K. 33, 90<br />

Vichi, Maurizio 17<br />

Viroli, Cinzia 24<br />

von Arnstedt, Eva 133<br />

Wagner, Ralf 91, 159<br />

Wagner, Stefan 164<br />

Wallner, Matthias 6<br />

Weber, Richard 21<br />

Wehrens, Ron 160<br />

Weihs, Claus 117,<br />

129,<br />

130, 138<br />

Werft, Wiebke 67<br />

Werners, Brigitte 10<br />

Wilczynski, Petra 161<br />

Wildner, Raimund 162<br />

Winkler, Roland 163<br />

Winkler, Stephan 164<br />

Witek, Ewa 165<br />

Worm, Katja 166<br />

Yuan, Benjamin J.C. 28<br />

Zarraga, Amaya 167<br />

Zeileis, Achim 143, 168<br />

Zock, Michael 119<br />

Zucknick, Manuela 67


Designing Products Using Quality Function<br />

Deployment and Conjoint Analysis: A<br />

Comparison in a Market for Elderly People<br />

Samah Abu Assab and Daniel Baier<br />

Chair of Marketing and Innovation Management, Brandenburg University of<br />

Technology, Erich-Weinert-Str. 1, 03046 Cottbus, Germany<br />

samah.assab@tu-cottbus.de, baier@tu-cottbus.de<br />

Abstract. In this paper, we compare two product design approaches namely; quality<br />

function deployment (QFD) and conjoint analysis (CA) on the example of mobile<br />

phones for elderly people as a target group. Then, we compare between our results<br />

and the results from former similar comparisons (e.g., Pullman et al. (2002), Katz<br />

(2004)). In this work, the same procedures and conditions are taken into consideration<br />

as that taken by Pullman et al. in their paper (2002).<br />

Pullman et al. (2002) view the relation between the two methods: QFD and<br />

CA as a complementary one in which both should be simultaneously implemented<br />

and each providing feedback to the other. They concluded that CA is more efficient<br />

in reflecting the end-users’ present preferences for the product attributes, whereas<br />

QFD is definitely better in satisfying end-users’ needs from the developers’ point of<br />

view. Katz (2004) in his response from a practitioner’s point of view agreed with<br />

Pullman’s. However, he concluded that the two methods are better used sequentially<br />

and that QFD should precede conjoint analysis. We test these results in a market<br />

for elderly people<br />

Key words: Conjoint analysis, Quality function deployment, new product design,<br />

elderly people<br />

References<br />

Baier, D. and Brusch, M. (2005): Linking Quality Function Deployment and Conjoint<br />

Analysis for New Product Design. In: D. Baier, R. Decker and L. Schmidt-<br />

Thieme (Eds.): Data Analysis and Decision Support. Springer, Berlin, 189-198.<br />

Katz, G.M. (2004): A Response to Pullman et al.’s (2002) Comparison of Quality<br />

Function Deployment versus Conjoint Analysis. Journal of Product Innovation<br />

Management, 21, 61-63.<br />

Pullman, M.E.,Moore, W.L. and Wardell, D.G.(2002): A Comparison Quality Function<br />

Deployment and Conjoint Analysis in New Product Design. The Journal<br />

of Product Innovation Management, 19, 354-364.<br />

− 1 −


Joint Procrustes Analysis with Constrained<br />

Simplimax Rotation: Nonsingular<br />

Transformation of Component Score and<br />

Loading Matrices Toward Simple Structure<br />

Kohei Adachi<br />

Graduate School of Human Sciences<br />

Osaka University, Japan<br />

Abstract. The solution of component analysis has indeterminacy on nonsingular<br />

transformation: post-multiplying component score and loading matrices by a nonsingular<br />

matrix and its transposed inverse matrix, respectively, does not change the<br />

goodness of fit. To obtain a nonsingular matrix which gives simple structure to both<br />

the score and loading matrices transformed, we propose joint Procrustes analysis<br />

with constrained simplimax rotation, which consists of two phases. First, score and<br />

loading matrices are rotated orthogonally so as to match the target score and loading<br />

matrices, respectively, which include the elements of zeros, where the number of zero<br />

elements is predetermined, but their placement and the values of non-zero elements<br />

are unknown. Second, with the placement of the zero elements fixed at the result in<br />

the first phase, a nonsingular matrix is obtained which matches transformed score<br />

and loading matrices to the target score and loading matrices, respectively, where<br />

the values of non-zero elements are unknown. This procedure is argued to be useful<br />

for the cases where score and loading matrices have symmetric roles, for example,<br />

a case where component analysis is performed for a data matrix of input signals by<br />

output responses.<br />

References<br />

Adachi, K. (2005). Simultaneous Procrustes transformation of components and loadings<br />

obtained from three-way data. P. 77 in http://www.psychometrika.org/<br />

PDFs/IMPS2005_Abstracts.pdf.<br />

Kiers, H. A. L. (1994). Simplimax: Oblique rotation to an optimal target with simple<br />

structure. Psychometrika, 59, 567-579.<br />

− 2 −


WaldIS - a web based reference system for the<br />

forest monitoring in North Rhine-Westphalia<br />

Christian Aden 1 , Hans Mucha 2 , Gunther Schmidt 1 , and Winfried Schröder 1<br />

1 Lehrstuhl für Landschaftsökologie, Hochschule Vechta, D-49377 Vechta,<br />

Germany, caden@iuw.uni-vechta.de<br />

2 Weierstraß-Institut für Angewandte Analysis und Stochastik (WIAS),<br />

D-10117 Berlin, Germany, mucha@wias-berlin.de<br />

Abstract. In Germany, a multi-level forest monitoring was established since the<br />

middle of the 1980ies. This hierarchical monitoring system consists of: the annual<br />

forest condition surveys, the forest soil survey, and the intensive long term monitoring<br />

of forest ecosystems. In North Rhine-Westphalia, these forest monitoring programmes<br />

are supplemented by the monitoring of the foliar chemistry. Within theses<br />

programmes, the monitoring data are recorded and evaluated by the respective federal<br />

authorities, separately. An integrative statistical analysis of all data collected in<br />

the different monitoring programmes could not be realised yet. To overcome these<br />

constraints, the German Research Foundation (DFG) sponsors a research project<br />

that aims at the compilation of the monitoring data by use of WebGIS techniques<br />

and at integrated statistical analyses by use of geostatistics, time series analysis and<br />

multivariate statistics. Currently, the reference data system WaldIS is being developed<br />

for integrating data of the surveys interactively and for visualising the data<br />

via a WebGIS. In addition, tools for both logical data queries and for downloads,<br />

and some GIS functions were included. WaldIS was realised by using open source<br />

software components instead of proprietary software: the UMN Mapserver was combined<br />

with the WebGIS Client Suite Mapbender and the database management<br />

system PostgreSQL. Furthermore, WaldIS relies on standards for processing geoobjects<br />

published by the Open Geospatial Consortium (Pesch et. al 2007). Moreover,<br />

WaldIS will be used to visualise the statistical results such as clusters or principal<br />

components. With the help of stable statistical analysis based on rank-order data<br />

the aim is finding areas of homogeneous environmental and forest conditions.<br />

Key words: forest monitoring, WebGIS, multivariate rank analysis, stability<br />

References<br />

Pesch, R., Schmidt, G., Schröder, W., Aden, C., Kleppin, L. and Holy, M. (2007):<br />

Development, Implementation and Application of the WebGIS MossMet. In:<br />

A. Scharl and K. Tochtermann (Eds.): The Geospatial Web. Springer, London,<br />

191–200.<br />

− 3 −


Classification of Paired Data Using Ensemble<br />

Methods<br />

Werner Adler 1 , Alexander Brenning 2 , and Berthold Lausen 1<br />

1<br />

Chair for Biometry and Epidemiology, University of Erlangen-Nuremberg,<br />

Germany<br />

werner.adler@imbe.imed.uni-erlangen.de,<br />

berthold.lausen@rzmail.uni-erlangen.de<br />

2<br />

Department of Geography, University of Waterloo, Canada<br />

brenning@fesmail.uwaterloo.ca<br />

Abstract. In glaucoma classification, the underlying data have a paired structure<br />

that often is accounted for by simply using only one eye per subject. Brenning and<br />

Lausen (<strong>2008</strong>) showed that the proper use of both eyes in paired cross-validation<br />

decreases the variance of the estimation, compaired to cross-validation using only<br />

one eye per subject.<br />

We discuss and compare different strategies to generate the bootstrap samples<br />

for training Adaboost (Freund and Schapire, 1996), Random Forest (Breiman, 2001),<br />

and Double Bagging (Hothorn and Lausen, 2005). The simplest approach is to ignore<br />

the paired data structure and proceed as usual. Adapting the idea by Brenning and<br />

Lausen, we also perform subject based sampling. In a first step, subjects are drawn<br />

with replacement. In a second step, for each drawn subject either both eyes or<br />

one randomly selected eye are chosen, or two eyes are drawn with replacement.<br />

The subjects not selected for training the base learners constitute the out-of-bag<br />

samples. We compare error rates resulting from these different approaches obtained<br />

by a simulation study.<br />

Key words: Bootstrap, Classification, Glaucoma, Paired Organs<br />

References<br />

Breiman, L. (2001): Random forests. Machine Learning, 45, 5–32.<br />

Brenning, A. and Lausen, B. (<strong>2008</strong>): Estimating error rates in the classification of<br />

paired organs. Statistics in Medicine, submitted.<br />

Freund, Y. and Schapire, R. (1996): Experiments with a new boosting algorithm.<br />

Proceedings of the 13th International Conference on Machine Learning, 148–<br />

156.<br />

Hothorn, T. and Lausen, B. (2005): Bundling classifiers by bagging trees. Computational<br />

Statistics & Data Analysis, 49, 1068–1078.<br />

− 4 −


Segmentation of Neural Tissue<br />

Bjoern Andres 1 , Ullrich Koethe 1 , Moritz Helmstaedter 2 , Winfried Denk 2 ,<br />

and Fred Hamprecht 1<br />

1 Interdisciplinary Center for Scientific Computing, University of Heidelberg<br />

2 Max Planck Institute for Medical Research, Heidelberg<br />

Abstract. Three-dimensional electron-microscopic image stacks with almost isotropic<br />

resolution allow, for the rst time, to determine the com- plete connectivity matrix<br />

of parts of the brain. In spite of major advances in staining, correct segmentation<br />

of these stacks remains challenging, be- cause very few local mistakes can lead to<br />

severe global errors. We propose a hierarchical segmentation procedure based on<br />

statistical learning and topology-preserving grouping. First, edge probability maps<br />

are computed by a random forest classier, and are partitioned into supervoxels by<br />

the watershed transform. Over-segmentation is then resolved by constructing an irregular<br />

graphical model on these supervoxels and inferring the most likely global<br />

segmentation. Careful validation shows that the results of our algorithm are close<br />

to human labelings.<br />

− 5 −


On the power of corrected score functions to<br />

adjust for measurement error<br />

Thomas Augustin and Matthias Wallner<br />

Department of Statistics, University of Munich (LMU)<br />

augustin@stat.uni-muenchen.de<br />

Abstract. Measurement error modeling, also called errors-in-variables-modeling,<br />

is a generic term for all situations where additional uncertainty in the variables<br />

has to be taken into account, in order to avoid severe bias in the statistical analysis.<br />

The problem is omnipresent in technical statistics, when data from imperfect<br />

measurement instruments are analyzed, as well as in biometrics, econometrics or<br />

social science, where operationalizations (surrogates) are used instead of complex<br />

theoretical constructs.<br />

After a brief introduction to the area of measurement error modelling, the talk<br />

discusses the power and some limitations of Nakamura’s general principle of corrected<br />

score functions, mainly in the context of failure time data. Starting with classical<br />

covariate measurement error in Cox’s PH model, it is shown how the Breslow<br />

likelihood can be corrected, while according to results by Stefanski and Nakamura<br />

himself no corrected score function for the partial likelihood can exist. We then turn<br />

to parametric failure time models and extend consideration to additionally errorprone<br />

lifetimes. Finally, some ideas for handling Berkson-type errors (as occurring,<br />

e.g., in Radon studies) and rounded errors will be sketched.<br />

Key words: Measurement error, error-in-variables, survival analysis, Cox model,<br />

rounding<br />

− 6 −


Evaluation Criteria for the Construction of<br />

Binary Classification Trees with Two or More<br />

Classes<br />

Muhammad Azam 1 Alexander Ostermann 2 and Karl-Peter Pfeiffer 3<br />

1 Department of Medical Statistics, Informatics and Health Economics, Medical<br />

University Innsbruck csag2533@uibk.ac.at<br />

2 Institute for Mathematics, Unversity of Innsbruck , Technikerstrasse 25/7, 6020<br />

Innsbruck alexander.ostermann@uibk.ac.at<br />

3 Department of Medical Statistics, Informatics and Health Economics, Medical<br />

University Innsbruck Karl-Peter.Pfeiffer@i-med.ac.at<br />

Abstract. Classification trees are top-down induction of labelled sampling units<br />

into recursive order to get end nodes. Each end node representing those labelled<br />

units which are in majority, otherwise considered as misclassified. In the top-down<br />

induction process, an evaluation criterion plays an important role to send maximum<br />

of the units having same label to the same node. To achieve this goal a ”goodness<br />

of split” measure is calculated by using evaluation criteria e.g. Gini function, Twoing<br />

rule etc. for each distinct value of each variable and finally chooses one which<br />

enhances the purity. For smaller number of classes attached to all the units, almost<br />

all the evaluation criteria provides the same results in terms of misclassified units,<br />

deviance and number of end nodes but it matters for larger number of classes and<br />

considered best criteria which provides less number of misclassified units especially.<br />

Here we proposed an impurity based evaluation criteria which fulfil all the required<br />

properties of any evaluation criteria (Breiman et al., 1984). (i) The node impurity<br />

function achieves its maximum value, when same number of units fall in a node belongs<br />

to J number of classes. (ii) Node is pure, when all the observations of a node<br />

belong to a single class. (iii) A node impurity function is a symmetric function.<br />

We conducted a simulation study to test the performance of the proposed criterion<br />

over many real life datasets available under UCI repository and observed that the<br />

proposed strategy provides improved results.<br />

Key words: Classification trees, Evaluation criteria, Misclassification rate, Deviance<br />

References<br />

Breiman, L., Friedman, J. H., Olshen, R. A. and Stone, C. J (1984): Classification<br />

and regression trees. Wadsworth International Group, Belmont, CA.<br />

− 7 −


Evaluation Strategies for Learning Algorithms<br />

of Hierarchical Structures<br />

Korinna Bade 1 and Dominik Benz 2<br />

1 Faculty of Computer Science, Otto-von-Guericke-University Magdeburg,<br />

D-39106 Magdeburg, Germany, Email: korinna.bade@ovgu.de<br />

2 Department of Electrical Engineering/Computer Science, University of Kassel,<br />

D-34121 Kassel, Germany, Email: benz@cs.uni-kassel.de<br />

Abstract. The idea to automatically induce a hierarchical structure among a set<br />

of objects or integrate a given hierarchy into the learning process is common to<br />

a number of disciplines like hierarchical clustering (Bade and Nürnberger, <strong>2008</strong>)<br />

and classification or ontology learning. A crucial aspect hereby is how to assess the<br />

quality of the learned hierarchical scheme. Existing evaluation approaches can be<br />

broadly classified in methods defining quality metrics on the resulting scheme alone<br />

and methods which invoke an external “gold-standard” for comparison. We focus<br />

on the latter case, for which various similarity metrics have been proposed, mostly<br />

depending on the characteristics of the applied learning procedure.<br />

This work aims at bringing together the different disciplines by presenting and<br />

comparing existing gold-standard based evaluation methods for learning algorithms<br />

that generate hierarchical structures. We present an interdisciplinary framework in<br />

order to enable comparison across the different contexts, from which the metrics<br />

originate. Our goal is to emphasize the strong similarities of evaluation tasks in different<br />

disciplines and to create a general pool of evaluation methods. Based on prior<br />

work (Dellschaft and Staab, 2006), we analyze properties of (good) evaluation measures.<br />

Different types of structural errors in the learned hierarchies are identified and<br />

their effects on existing measures are shown. Observing strengths and weaknesses of<br />

existing methods, we also suggest some new methods.<br />

Key words: evaluation metrics, hierarchical clustering, ontology learning, goldstandard<br />

References<br />

Bade, K. and Nürnberger, A. (<strong>2008</strong>): Creating a Cluster Hierarchy under Constraints<br />

of a Partially Known Hierarchy. In: Proceedings of the <strong>2008</strong> SIAM International<br />

Conference on Data Mining. (to appear)<br />

Dellschaft, K. and Staab, S. (2006): On How to Perform a Gold Standard Based<br />

Evaluation of Ontology Learning. In: Proc. of 5 th Int. Semantic Web Conference.<br />

228–241.<br />

− 8 −


Autonomous Forex Trading Agents<br />

Rui Pedro Barbosa 1 and Orlando Belo 2<br />

1<br />

Department of Informatics, University of Minho, 4710-057 Braga, Portugal<br />

rui.barbosa@di.uminho.pt<br />

2<br />

Department of Informatics, University of Minho, 4710-057 Braga, Portugal<br />

obelo@di.uminho.pt<br />

Abstract. Trading in financial markets is undergoing a radical transformation,<br />

one in which algorithmic methods are becoming increasingly more important. The<br />

development of intelligent agents that can act as autonomous traders of financial<br />

instruments seems like a logical step forward in this “algorithms arms race”. With<br />

this in mind, our study proposes an infrastructure for implementing hybrid intelligent<br />

agents with the ability to trade in the Forex Market without requiring human<br />

supervision. This infrastructure is composed of three modules. The Intuition Module,<br />

implemented using an Ensemble Model, is responsible for performing pattern<br />

recognition and predicting the direction of the exchange rate. The A Posteriori<br />

Knowledge Module, implemented using a Case-Based Reasoning System, enables<br />

the agents to learn from empirical experience and is responsible for suggesting<br />

how much to invest in each trade. Finally, the A Priori Knowledge Module, implemented<br />

using a Rule-Based Expert System, enables the agents to incorporate<br />

non-experiential knowledge in their trading decisions. This infrastructure was used<br />

to implement two agents, one capable of trading the USD/JPY currency pair, and<br />

the other one capable of trading the EUR/USD currency pair, both with a 6 hours<br />

timeframe. Using 12 months of out-of-sample data, the USD/JPY agent performed<br />

826 simulated trades and obtained an average profit per trade of 6.88 pips. It accurately<br />

predicted the direction of the price in 54.72% of the trades, 65.74% of which<br />

were profitable. Over the same period, the EUR/USD agent performed 885 trades,<br />

with an average profit of 6.06 pips per trade. Its accuracy predicting the direction<br />

of the price was 52.99%, and 60.45% of its trades were profitable. These agents were<br />

integrated with an Electronic Communication Network and have been trading live<br />

for the past several months. So far their live trading results are consistent with the<br />

simulated results, which lead us to believe our infrastructure can be of practical<br />

interest to the traditional trading community.<br />

Key words: Forex trading, Hybrid agents, Autonomy<br />

− 9 −


Improving Product Line Design with Bundling<br />

Niels Becker and Brigitte Werners<br />

Faculty of Economics and Business Administration,<br />

Ruhr-University Bochum, 44780 Bochum, Germany<br />

niels.becker@ruhr-uni-bochum.de and or@ruhr-uni-bochum.de<br />

Abstract. Designing and pricing new products is of particular importance in many<br />

industries. In order to meet heterogeneous customer needs, many companies offer different<br />

variants of every product type. To support these product line design decisions,<br />

various mathematical programming approaches have been developed (Steiner and<br />

Hruschka, 2003). Most models are based on part-worth utilities, estimated within a<br />

conjoint framework. Besides, bundling is an important tool in marketing. It has been<br />

shown that bundling can transfer customers’ willingness to pay from one product to<br />

another. Therefore, prices can be differentiated so that higher profits are obtainable<br />

(Simon and Wübker, 1999). For determining optimal bundles and prices, Hanson<br />

and Martin (1990) have suggested a well-known linear programming model.<br />

Here the problem of optimally designing, bundling and pricing new products<br />

is investigated. One of the questions is, at which point in time bundling decisions<br />

should be made. Therefore, we compare product line design without bundling with<br />

sequential bundling, which means bundling subsequent to product line decisions, and<br />

simultaneous bundling, which means determining optimal bundles and product lines<br />

simultaneously. We developed a combined product line design and bundling model<br />

and present the impact on profits using simulated data. For this example, optimal<br />

results can be obtained using MILP-Software. Our studies show that simultaneous<br />

bundling leads to differently designed products and can improve profits substantially.<br />

Key words: Product Line Design, Pricing, Bundling, Optimization<br />

References<br />

Hanson, W. and Martin, K. (1990): Optimal Bundle Pricing. Management Science,<br />

36(2), 155–174.<br />

Simon, H. and Wübker, G. (1999): Bundling - A Powerful Method to Better Exploit<br />

Profit Potential. In: R. Füderer, A. Hermann and G. Wübker (Eds.): Optimal<br />

Bundling, Springer-Verlag, Heidelberg-Berlin, 7–28.<br />

Steiner, W. and Hruschka, H. (2003): Genetic Algorithms for Product Design: How<br />

Well Do They Really Work. Int. Journal of Market Research, 45(2), 229–240.<br />

− 10 −


Estimating the number of buildings in<br />

Germany<br />

Martin Behnisch 1 and Alfred Ultsch 2<br />

1 Institute of Historic Building Research and Conservation, ETH Hoenggerberg,<br />

HIL D 25.9, CH-8093 Zurich. Behnisch@arch.ethz.ch<br />

2 Datenbionic Research Group, Hans-Meerwein-Strasse, Philipps-University<br />

Marburg, D-35032 Marburg. Ultsch@Mathematik.Uni-Marburg.de<br />

Abstract. The building stock can be considered the largest physical, economical<br />

and cultural capital of a society. For German building stocks many institutions<br />

record different kind of data. Unfortunately there are just a few basic statistics<br />

about the amount of buildings. Collection of data is therefore very complicated,<br />

often expensive and the handling of missing data is one of the biggest handicaps.<br />

With the exception of data about residential buildings and particularly monuments,<br />

it is an unsolved problem to determine the total number of buildings. The main<br />

issue of this article is the description of an estimation procedure for this. Using<br />

methods from the, so called, Urban Knowledge Discovery approach, the authors find<br />

unsuspected relationships in the urban data which can be used for the estimation.<br />

The developed estimation procedure relies on 12430 municipalities and refers to data<br />

from the Cadaster of Real Estates and the Federal Bureau of Statistics. With this<br />

estimation it is possible to use statistical data from well known and easily accessible<br />

institutions. The number of buildings is estimated for regions with missing data.<br />

The quality of the estimation is analyzed by learn and test data sets. Information<br />

optimization leads to the conclusion that 20% of the municipalities hold 80% of all<br />

buildings. Therefore for an improvement of the estimation it is essential to refine<br />

the amount and quality of data in the larger municipalities.<br />

Key words: Spatial Planning, Engineering, Knowledge Discovery, Data Mining,<br />

Building Stock<br />

References<br />

Aachener Institut füer Bauschadensforschung und Angewandte Bauphysik, Hrsg:<br />

Hofman, F. (2001): Urban heritage - building maintenance. Final report. COST<br />

Action C5, European Commission.<br />

Becher, St. (1995): Klassifikation der regionalen Immobilienmärkte der Bundesrepublik<br />

Deutschland (Dissertation). Universität, Mainz.<br />

Behnisch, M. (2007): Urban Knowledge Discovery (Doctoral thesis). Universitätsverlag,<br />

Karlsruhe.<br />

− 11 −


Probabilistic Distance Clustering<br />

Adi Ben-Israel<br />

Rutgers University, RUTCOR<br />

Summary. A new iterative method [1] for probabilistic clustering of data is presented.<br />

Given clusters, their centers, and the distances of data points from these<br />

centers, the probability of cluster membership at any point is assumed inversely<br />

proportional to the distance from (the center of) the cluster in question.<br />

The resulting method is a generalization, to several centers, of the Weiszfeld<br />

method for solving the Fermat–Weber location problem. At each iteration, the<br />

distances (Euclidean, Mahalanobis, etc.) from the cluster centers are computed for<br />

all data points, and the centers are updated as convex combinations of the data<br />

points, with weights determined by the above principle. Computations stop when<br />

the centers stop moving.<br />

This approach works also for problems where the cluster sizes are unknowns (to<br />

be estimated), giving a viable alternative to the EM method, [2].<br />

Progress is monitored by the joint distance function (JDF), a measure of<br />

distance from all cluster centers, that evolves during the iterations, and captures the<br />

data in its low contours. This is a new concept in data reduction and representation.<br />

A duality theory for the JDF is given in [3].<br />

The method is simple, fast (requiring a small number of cheap iterations) and<br />

is not sensitive to outliers.<br />

Key words: Partial Least Squares, Correspondence Analysis, Categorical Data<br />

References<br />

A. B-I and C. Iyigun, Probabilistic Distance Clustering, J. Classification (to<br />

appear.) http://benisrael.net/J-CLASSIFICATION-07.pdf.<br />

C. Iyigun and A. B-I, Probabilistic Distance Clustering adjusted for Cluster<br />

Size, Probability in Engineering and Informational Sciences (to appear.)<br />

http://benisrael.net/PEIS-07.pdf.<br />

C. Iyigun and A. B-I, Contour Approximation of Data: A Duality Theory, (submitted.)<br />

http://benisrael.net/DUAL-12-20-07.pdf.<br />

− 12 −


Hedge Funds in a Bayesian Asset Allocation<br />

Framework: Incorporating information on<br />

market states and manager’s ability<br />

Wolfgang Bessler 1 and Julian Holler 2<br />

1 Center for Finance and Banking, Licher Strasse 74, 35394 Giessen<br />

Wolfgang.Bessler@wirtschaft.uni-giessen.de<br />

2 Center for Finance and Banking, Licher Strasse 74, 35394 Giessen<br />

Julian.Holler@wirtschaft.uni-giessen.de<br />

Abstract. A growing number of private and institutional investors make significant<br />

allocations to hedge funds in order to improve the risk-return trade-off of their portfolios.<br />

However, in a portfolio context there are a number of issues that are special<br />

to hedge funds. We attempt to address these issues in a bayesian asset allocation<br />

framework (Pastor 2000). In particular, we focus on the returns of two representative<br />

equity hedge fund strategies constructed by replication of two well-known<br />

statistical arbitrage strategies. Importantly, this approach allows us to obtain daily<br />

return observations despite the fact that most funds only report at a monthly interval.<br />

Using this framework, we investigate the following two research questions.<br />

First, we address the issue that many arbitrage strategies exhibit substantial exposures<br />

to financial crisis which is reflected in their high levels of curtosis and negative<br />

skewness. By including relevant state variables in the prior distribution, we infer<br />

whether investors can improve the risk-adjusted performance of their portfolios by<br />

reducing their exposures prior to the onset of a crisis. Second, investors should only<br />

pay high fees to hedge fund managers if they earn additional alpha above the returns<br />

generated by our dynamic trading strategy. Thus, we attempt to analyze how much<br />

confidence an investor has to put into a manager’s abilities by varying her prior<br />

beliefs about alpha.<br />

References<br />

Pastor, L. (2000): Portfolio Selection and Asset Pricing Models. Journal of Finance,<br />

55, 179–223<br />

Keywords<br />

Asset Allocation, Alternative Investments, Hedge Funds<br />

− 13 −


Categorical Data in PLS Path modeling<br />

Jörg Betzin<br />

German Centre of Gerontology (DZA), Berlin, Germany<br />

joerg.betzin@dza.de<br />

Summary. There are lot of surveys with categorical data where the relationships<br />

between the variables should be used in a path model with latent variables but, so<br />

far there are only a few possibilities to do so. We present a way using categorical<br />

manifest variables (MV’s) in PLS.<br />

The main idea is on the one hand in thinking of PLS as a generalization of PCA<br />

(principal component analysis) or canonical correlation and on the other hand using<br />

the framework of Correspondence Analysis (CorA) as a generalization of PCA for<br />

categorical variables and put these two approaches together.<br />

In the basic PLS algorithm the latent variables (LV’s) ηm (m = 1, ..., M) are<br />

estimated as weighted sum of their manifest variables (with data matrices Ym)<br />

ηm = Ymωm, where the pooled weight vector ω = (ω ′ 1, ..., ω ′ M ) ′ is result of an<br />

iteration algorithm like<br />

ω = `` Y ′ Y ´ ∗ P ´ ω<br />

with additional normalization constraint and where Y = (Y1, ...,YM ), P is a weight<br />

matrix changing with different iteration steps, and ’∗’ indicating the elementwise<br />

matrix product. The key point is the use of the covariance matrix Y ′ Y inside the<br />

iteration algorithm.<br />

Now, one main aspect of CorA is the transformation of the raw data matrix Ym<br />

into an indicator matrix Gm and the analysis of a kind of correlation matrix for Gm.<br />

If we describe by e Qm a suitable transformation of Gm where elements of e Q ′ m e Qm are<br />

roots of χ 2 -components from twodimensional contingency tables for columns in Gm.<br />

Than, in short, e “ ”<br />

Q = eQ1, ..., QM e will be used as an equivalent for the covariance<br />

matrix Y ′ Y in the PLS iteration algorithm.<br />

We will show results for different examples using basic PLS algorithms and<br />

using PLS algorithms adopted for categorical manifest variables, together with interpretations<br />

of the weights ωm in the case of categorical data and the other model<br />

parameters like correlations and regression coefficients.<br />

Key words: Partial Least Squares, Correspondence Analysis, Categorical Data<br />

− 14 −


Choosing the number of clusters in the latent<br />

class model<br />

Cristophe Biernacki 1 , Gilles Celeux 2 , and Gérard Govaert 3<br />

1 Université Lille 1 UMR CNRS 8524, France<br />

Christophe.Biernacki@math.univ-lille1.fr<br />

2 INRIA Saclay, France Gilles.Celeux@inria.fr<br />

3 UTC Compiègne UMR CNRS 6599 Heudiasyc, France Gerard.Govaert@utc.fr<br />

Abstract. The latent class model or multivariate multinomial mixture is a powerful<br />

model for clustering discrete data. This model is expected to be useful to represent<br />

nonhomogeneous populations. It uses a conditional independence assumption given<br />

the latent class to which a statistical unit is belonging. In this presentation, we exploit<br />

the fact that a fully Bayesian analysis of the latent class model with Jeffreys<br />

non informative prior distributions does not involve technical difficulty to derive<br />

the exact integrated complete likelihood. Then, we exploit this integrated complete<br />

likelihood as a criterion to assess the number of mixture components in a cluster<br />

analysis perspective. We highlight with numerical experiments how this exact criterion<br />

could outperfom the BIC-like asymptotic approximation generally used to<br />

choose a sensible number of clusters derived from the latent class model.<br />

Key words: Latent Class model, Integrated complete likelihood, Model selection<br />

References<br />

Biernacki, C., Celeux, G. and Govaert, G. (2000). Assessing a mixture model for<br />

clustering with the integrated completed likelihood. IEEE Trans. on PAMI,<br />

22, 719-725.<br />

Goodman, L. A. (1974). Exploratory latent structure analysis using both identifiable<br />

and unidentifiable models. Biometrika, 61, 215-231.<br />

− 15 −


Clustering of molecules and structured data<br />

Gilles Bisson<br />

Laboratoire TIMC-IMAG - Equipe AMA , Faculté de Médecine<br />

Summary. The discovery or the synthesis of molecules that activate or inhibit some<br />

biological systems is a central issue for biological research and health care. The objective<br />

of High Throughput Screening (HTS) is to rapidly evaluate, through automated<br />

approaches, the activity of a given collection of molecules on a given biological target<br />

that can be an enzyme or a whole cell. In practice, the results of a HTS test allow to<br />

highlight some tens of active molecules, named the ”hits”, representing a very small<br />

percentage of the initial collection. However, these tests are just the beginning of the<br />

work since the identified molecules generally do not have some nice characteristics in<br />

terms of sensitivity and specificity (a relevant molecule must by specific to the biological<br />

target and should be efficient with a small concentration). In such context, it<br />

is crucial to provide the chemists with some tools to explore the contents of his/her<br />

chemical libraries and especially to make easier the search for molecules that are<br />

structurally similar to the ”hits”. A possible approach, given a relevant distance, is<br />

to seek the nearest neighbours of those hits. More broadly, chemists have a need for<br />

methods to automatically organize the collections of molecules in order to locate the<br />

active molecules within the chemical space. Above all, they would like to evaluate<br />

the real diversity of the chemical structures contained in a collection. Clustering<br />

methods are well suited to carry out this kind of task. However, with structurally<br />

complex objects such as molecules, it is obvious that the quality of the results depends<br />

on the capacity of the distance used by the clustering method to grasp the<br />

structural similarities and also to take into account all the background knowledge<br />

of the chemists. The search for a structural distance between molecules is clearly<br />

related (but not totally equivalent) to the search for isomorphic partial subgraphs,<br />

which is a NP-complete problem. To overcome this problem many methods use an<br />

”a priori” molecular linearization: a molecule is represented by a vector of descriptors,<br />

each one corresponding to a molecular fragment, and well-known distances can<br />

be used. However, since last ten years, kernel functions, comparable to distances<br />

between graphs, have been proposed in the Support Vector Machines framework.<br />

In these approaches, molecular representation is more accurate. It can be based on<br />

a set of path (i.e. molecular fragments specifically chosen or randomly selected), or<br />

more interestingly use the whole molecule to try to value structural distances by<br />

dynamically explore the mapping that can be done between two molecules.<br />

− 16 −


The K -INDSCAL Model for Heterogeneous<br />

Three-way Dissimilarity Data<br />

Laura Bocci 1 and Maurizio Vichi 2<br />

1 Department of Sociology and Communication, University of Rome “La<br />

Sapienza”, Rome, Italy laura.bocci@uniroma1.it<br />

2 Department of Statistics, Probability and Applied Statistics, University of Rome<br />

“La Sapienza”, Rome, Italy maurizio.vichi@uniroma1.it<br />

Abstract. The weighted Euclidean model proposed by Carrol and Chang (1970) is<br />

the most well-known and used model of multidimensional scaling of three-way data.<br />

INDSCAL states a unique representation for the objects (common configuration<br />

space) and for each occasion weights for the dimensions of this representation (individual<br />

differences weights), thus definitely assuming that there are not systematic<br />

“strong” differences between data dissimilarity sources. Otherwise, when heterogeneous<br />

occasions are observed, it is shown that INDSCAL may fail to identify a<br />

common space representative of the observed data structure. In such frequent and<br />

realistic situation it is reasonable to assume that there are systematic differences<br />

among some, say, K clusters of occasions in the evaluation of the dissimilarities,<br />

so that within each cluster of occasions the evaluation may differ only because of<br />

sampling or measurement errors; while between clusters of occasions dissimilarities<br />

are really different. The heterogeneous INDSCAL in K classes model, simply called<br />

K -INDSCAL, is proposed to handle the above described heterogeneity in the data.<br />

The model includes the individual weights in order to preserve the rotational invariance<br />

of the INDSCAL model. The high number of parameters of INDSCAL, and<br />

consequently of K -INDSCAL, may produce instability of the estimates thus a parsimonious<br />

model, that drastically reduces the number of parameters, is also discussed.<br />

The parameters of the model are estimated in a least-squares fitting context and an<br />

efficiently coordinate descent algorithm is given. The usefulness of K -INDSCAL is<br />

demonstrated by both artificial and real data analyses.<br />

Key words: Three-way dissimilarity data, INDSCAL, heterogeneous data dissimilarities<br />

References<br />

Carroll, J.D. and Chang, J.J. (1970): Analysis of individual differences in multidimensional<br />

scaling via an N-generalization of the Eckart-Young decomposition.<br />

Psychometrika, 35, 283–319.<br />

− 17 −


Weighting and Selecting Features<br />

in Fuzzy Clustering<br />

Christian Borgelt<br />

European Center for Soft Computing<br />

c/ Gonzalo Gutiérrez Quirós s/n, 33600 Mieres, Spain<br />

christian.borgelt@softcomputing.es<br />

Abstract. A serious problem in distance-based clustering is that the more dimensions<br />

(attributes) a datasets has, the more the distances between data points—and<br />

thus also the distances between data points and constructed cluster centers—tend to<br />

become uniform. This, of course, impedes the effectiveness of clustering, as distancebased<br />

clustering exploits that these distances differ. In addition, in practice often<br />

only a subset of the available attributes is relevant for forming clusters, even though<br />

this may not be known beforehand. In such cases it is desirable to have a clustering<br />

algorithm that automatically weights the attributes or even selects a proper subset.<br />

In this contribution I study the problem of weighting and selecting features in<br />

clustering and in particular in fuzzy clustering. Apart from reviewing straighforward<br />

modifications of Gustafson–Kessel fuzzy clustering (Gustafson and Kessel 1979) and<br />

attribute weighting fuzzy clustering (Keller and Klawonn 2000) that lead to simple,<br />

but effective attribute weighting schemes, I introduce a new feature selection method<br />

by applying the idea of an alternative to the fuzzifier (Klawonn and Höppner 2003)<br />

to the latter scheme. The resulting combined feature weighting and selection method<br />

has the advantage that the obtained clustering result on the chosen subspace coincides<br />

with the projection of the result obtained on the full data space. Finally I<br />

discuss an extension of this scheme to principal axes selection.<br />

Key words: fuzzy clustering, feature weighting, feature selection<br />

References<br />

1.Gustafson, E.E., and Kessel, W.C. (1979): Fuzzy Clustering with a Fuzzy Covariance<br />

Matrix. Proc. IEEE Conf. on Decision and Control (CDC 1979, San<br />

Diego, CA), 761–766. IEEE Press, Piscataway, NJ, USA.<br />

2.Keller, A., and Klawonn, F. (2000): Fuzzy Clustering with Weighting of Data<br />

Variables. Int. Journal of Uncertainty, Fuzziness and Knowledge-based Systems<br />

8:735-746. World Scientific, Hackensack, NJ, USA.<br />

3.Klawonn, F., and Höppner, F. (2003): What is Fuzzy about Fuzzy Clustering?<br />

Understanding and Improving the Concept of the Fuzzifier. Proc. 5th Int.<br />

Symposium on Intelligent Data Analysis (IDA 2003, Berlin, Germany), 254–<br />

264. Springer-Verlag, Berlin, Germany.<br />

− 18 −


HIDDEN MARKOV MODEL BASED<br />

CLASSIFICATION OF NATURAL OBJECTS<br />

IN AERIAL PICTURES<br />

Mohamed El Yazid Boudaren 1 , Abdenour Labed 1 , Adel Aziz Boulfekhar 1 ,<br />

and Yacine Amara 1<br />

Military Polytechnic School, Algiers boudarenyazid@hotmail.com<br />

Abstract. This work is part of a more global one that consists in creating virtual<br />

environments from aerial pictures combined with altimetry data. In such environments,<br />

while getting too close to the ground, one has to solve the problem of limited<br />

texture resolution. So, these textures have to be amplified to get more realistic<br />

scenes. Texture amplification must take account of object nature. This paper deals<br />

with picture pixels supervised classification in order to amplify texture resolution.<br />

For this purpose, we propose a hidden Markov model based approach that takes<br />

into account the spatial dependencies between natural objects present in the area<br />

of interest. HMMs have long been used to efficiently model one-dimensional data,<br />

in particular in speech recognition systems. In theory, HMMs can be applied as<br />

well to multi-dimensional data. However, the complexity of the algorithms grows<br />

exponentially in higher dimensions, so that, even in dimension 2, the usage of plain<br />

HMM becomes prohibitive in practice. To overcome the 2D-HMM complexity, we<br />

propose a two-level HMM, where the higher layer comprises one unique HMM constituted<br />

of super states associated to one low level HMM each. Our model differs<br />

from classic embedded HMM in that it deals with pixel blocks instead of pixel lines<br />

as elementary symbols. Another difference is that our high level HMM is ergodic;<br />

this enables our model to accurately model spatial dependencies between natural<br />

objects. The training of our HMM models is done in two steps: firstly, the low level<br />

HMMs are trained on unitextured pictures. Secondly the high level one is trained<br />

on multitextured pictures of the same region using the parameters of HMMs of the<br />

first step, according to Baum-Welch algorithm with slight modifications. For our<br />

experiments, we used real world aerial pictures of a relatively large area, with a resolution<br />

of 50 centimeters. Our results were then used to generate virtual interactive<br />

3D-scene. This showed that our classifier was able to satisfactorily reproduce the<br />

original terrain.<br />

Key words: Hidden Markov models, Aerial pictures supervised classification, texture<br />

recognition<br />

− 19 −


On optimistic bias in reporting<br />

microarray-based classification accuracy<br />

Anne-Laure Boulesteix 1 and Martin Slawski 1<br />

1 Sylvia Lawry Centre for Multiple Sclerosis Research, Hohenlindenerstr. 1,<br />

81677-München, Germany, boulesteix@slcmsr.org<br />

Abstract. Almost all published medical studies present positive research results.<br />

In the special case of microarray studies, which often focus on, e.g., the identification<br />

of differentially expressed genes or the construction of outcome prediction<br />

rules, it means that almost all studies report at least a few significant differentially<br />

expressed genes or a small prediction error, respectively. Authors are virtually urged<br />

to “find something significant” in their data, which encourages the publication of<br />

wrong research findings due to the multiple comparison effects. If authors try a large<br />

number of different analysis methods and designs on their data, they are likely to<br />

obtain “acceptable results” with at least one of them. Microarray-based class prediction<br />

is particularly affected by this problem. Whereas logistic regression is routinely<br />

applied as the standard class prediction approach in the simple case where only a<br />

small number of predictors are available, there is no consensus on the procedure to<br />

be applied for classification using high-dimensional microarray data.<br />

It is well-known that, if several statistical methods are tried on the same microarray<br />

data set, one should report all results, not only the best ones (Dupuy and<br />

Simon, 2007). Through simulations and real data studies, we address this problem<br />

quantitatively and determine the effect of not respecting this “good practice” rule.<br />

Our approach consists of applying a large number of well-known classifiers combined<br />

with several variable selection procedures and different numbers of selected variables,<br />

and evaluating them following different schemes (see Boulesteix et al, <strong>2008</strong>, for an<br />

overview). The considered data sets are real publicly available microarray data sets,<br />

with or without random permutation of the class labels. The output of our study is<br />

the distribution of the minimally selected error rate and the bias resulting from this<br />

optimal selection in the different settings.<br />

References<br />

Boulesteix, A.-L., Strobl, C., Augustin, T. and Daumer, M. (<strong>2008</strong>): Evaluating<br />

microarray-based classifiers: An overview. Cancer Informatics, 4.<br />

Dupuy, A. and Simon, R. (2007): Critical Review of Published Microarray Studies for<br />

Cancer Outcome and Guidelines on Statistical Analysis and Reporting. Journal<br />

of the National Cancer Institute, 99, 147–157.<br />

− 20 −


Practical experiences from Credit Scoring<br />

projects for Chilean financial organizations.<br />

Cristian Bravo, Sebastian Maldonado and Richard Weber<br />

Department of Industrial Engineering, University of Chile.<br />

cbravo@dii.uchile.cl, semaldon@ing.uchile.cl, rweber@dii.uchile.cl<br />

Abstract. All financial organizations that offer loans to their customers have the<br />

problem to determine if the loaned money will be returned. Credit scoring systems<br />

have been successfully applied to determine the probability that a certain customer<br />

will fail in paying back the received credit. In many cases these systems are based<br />

on experience but offer a “closed solution” where the user has few possibilities to<br />

influence in the decision process.<br />

We have developed credit scoring systems for several Chilean financial organizations<br />

mapping the KDD process (Knowledge Discovery in Databases) to their<br />

special needs. This paper presents our experiences from these projects and explains<br />

in detail how we solved the problems in each step of the KDD process.<br />

In the data mining step we applied Logistic Regression and Support Vector<br />

Machines as classification techniques, comparing both their performance and their<br />

flexibility. A particular wrapper approach for feature selection using Support Vector<br />

Machines has been developed. Comparing this approach with alternative schemes<br />

underlines its strengths in terms of classification performance and selected features.<br />

Since most KDD projects propose just static solutions we had to develop a<br />

module for model updating that will be described in detail. In particular we propose<br />

to apply statistical techniques in order to determine changes in feature weights and<br />

structural changes in the respective universe.<br />

During the development of our solutions the users got important insights into<br />

their customers’ behavior, some of them were surprising, others just confirmed notions<br />

the respective experts had before. By using the systems in daily operation the<br />

rate of false positives as well as false negatives could be reduced leading to a higher<br />

coverage of the respective market.<br />

Key words: Credit scoring, Classification, Support Vector Machines.<br />

References<br />

Famili, A., Shen, W.-M., Weber, R., Simoudis, E. (1997): Data Preprocessing and<br />

Intelligent Data Analysis. Intelligent Data Analysis 1, No. 1, 3-23.<br />

− 21 −


Analyzing the Stability of Price Response<br />

Functions - Measuring the Influence of<br />

Different Parameters in a Monte Carlo<br />

Comparison<br />

Michael Brusch and Daniel Baier<br />

Institute of Business Administration and Economics,<br />

Brandenburg University of Technology Cottbus, Postbox 101344,<br />

D-03013 Cottbus, Germany {m.brusch|daniel.baier}@tu-cottbus.de<br />

Abstract. The usage and therefore the estimation of price response function (see,<br />

e.g., Steiner et al. 2007) is very important for strategic marketing decisions. Typically<br />

price response functions with an empirical basis are used (see, e.g., Balderjahn<br />

1998). However, such price response functions are subject to a lot of disturbing influence<br />

factors, e.g. the assumed profit maximal price and the assumed corresponding<br />

quantity of sales.<br />

In such cases, the question how stable the found price response function is was<br />

not answered sufficiently up to now. In this paper, the question will be pursued how<br />

much (and what kind of) errors in market research are pardonable for a stable price<br />

response function. Innovative technologies and systems of house power engineering<br />

are used as application example (see Brusch et al. 2003). For the comparisons, a<br />

factorial design with synthetically generated and disturbed data is used.<br />

Key words: Monte Carlo comparison, Price response functions<br />

References<br />

Balderjahn, I. (1998): Empirical analysis of price response functions. In: I. Balderjahn,<br />

C. Mennicken, E. Vernette (Eds.): New Developments and Approaches in<br />

Consumer Behavior Research. Schäffer-Poeschel/Macmillan, 185–200.<br />

Brusch, M., Zühlsdorff, D., Baier, D. and Kessler, A. (2003): Neue Technologien und<br />

erneuerbare Energiequellen auf dem Vormarsch. Energiewirtschaftliche Tagesfragen,<br />

53, 12, 825–829.<br />

Steiner, W. J., Brezger, A. and Belitz, Ch. (2007): Flexible estimation of price<br />

response functions using retail scanner data. Journal of retailing and consumer<br />

services, 14, afl. 6 (11), 383–393.<br />

− 22 −


Motif-based Classification of Time Series with<br />

Bayesian Networks and SVMs<br />

Krisztian Antal Buza 1 and Lars Schmidt-Thieme 1<br />

University of Hildesheim, Information Systems and Machnine Learning Lab<br />

{buza,schmidt-thieme}@ismll.uni-hildesheim.de<br />

Abstract. Classification of time series is of crucial importance in wide range of<br />

applications. One of the possible solutions for this problem is based on characteristic<br />

local patterns of time series, so-called motifs [Patel 2002].<br />

We present a novel technique to make the classification of (multivariate) time<br />

series more accurate. We define different types of motifs. Most easy ones are frequent<br />

subseries. In case of noisy time series as well as in several application domains these<br />

easy motifs are not sufficient, more complex ones are necessary. Complex motifs<br />

used in our work may consist of several subseries, continouos and non-continouos<br />

parts and “joker” parts. We show an efficient algorithm for mining complex motifs in<br />

time series. We extend the highly efficient implementation of the algorithm Apriori<br />

described in [Borgelt 2003] to our task.<br />

We evaluate our method on real medical data, which consits of time series of<br />

dialysis sessions. We compare different types of motifs according to their ability for<br />

the prediction of the class of (multivariate) time series. We show that additional<br />

motif features significantly improve the accuracy of Bayesian Networks and Support<br />

Vector Machines for the classification of time series.<br />

Key words: Time Series, Complex motifs, Bayesian Networks, SVM<br />

References<br />

Borgelt, C. (2003): Efficient Implementations of Apriori and Eclat. 1st Workshop of<br />

Frequent Item Set Mining Implementations (FIMI 2003, Melbourne, FL, USA).<br />

Kunik, V. and Solan, Z. and Edelman, S. and Ruppin, E. and Horn, D. (2005): Motif<br />

Extraction and Protein Classification. IEEE Computational Systems Bioinformatics<br />

Conference (CSB’05), pp. 80-85<br />

Ferreira, P. G. and Azevedo, P. (2005): Protein Sequence Classification through Relevant<br />

Sequence Mining and Bayes Classifiers Proceedings of the 12th Portuguese<br />

Conference on Artificial Intelligence, pp. 236-247, LNAI 3808, Springer-Verlag.<br />

Patel, P. and Keogh, E and Lin, J. and Lonardi, S (2002): Mining Motifs in Massive<br />

Time Series Databases. Proceedings of the 2002 IEEE International Conference<br />

on Data Mining (ICDM 2002).<br />

− 23 −


Visualizing data in Gaussian mixture model<br />

classification<br />

Daniela G. Calo’ and Cinzia Viroli<br />

Department of Statistics - University of Bologna<br />

via Belle Arti, 41 - 40126 Bologna, Italy<br />

danielagiovanna.calo@unibo.it, cinzia.viroli@unibo.it<br />

Abstract. The paper presents a post-processing strategy for producing low-dimensional<br />

summary plots of the data after a Gaussian mixture classification model has been<br />

fitted. The most revealing projections are those along which the class-conditional<br />

densities are maximally separable. We consider a particular probability product kernel<br />

as a measure of similarity or affinity between class-conditional distributions. It<br />

takes an appealing closed form in the case of Gaussian mixture components. The<br />

performance of the proposed strategy has been evaluated on simulated and real data.<br />

Key words: Gaussian mixture models, Low-dimensional plots, Normalized expected<br />

likelihood kernel, Bayes error<br />

References<br />

Chan, A.B., Vasconcelos, N. and Moreno, P.J. (2004): A family of Probabilistic<br />

Kernels Based in Information Divergence. Technical Report, University of California,<br />

San Diego.<br />

Jebara, T. and Kondor, R. (2004): Probability Product Kernels. Journal of Machine<br />

Learning Research, 5, 819–844.<br />

McLachlan, G.J. and Peel, D. (2000): Finite Mixture Models. Wiley, New York.<br />

− 24 −


A novel approach to construct discrete support<br />

vector machine classifiers<br />

Marco Caserta and Stefan Lessmann<br />

Institute of Information Systems<br />

University of Hamburg, Germany<br />

Abstract. The support of managerial decision making by means of data mining<br />

has received considerable attention in the academic literature as well as corporate<br />

practice. This paper considers support vector machines (SVMs) which represent a<br />

popular classification method that may be used in data mining to, e.g., guide the selection<br />

of customers for a direct marketing campaign or assess the credibility of loan<br />

applications in financial applications. Recently, Orsenigo and Vercellis proposed a<br />

novel, discrete support vector machine (DSVM) and demonstrate its effectiveness in<br />

several empirical studies. Building a respective classifier involves solving an integer<br />

program which is a challenging computational task in general and in large-scale data<br />

mining settings in particular. This paper strives to improve upon a linear programming<br />

based heuristic, originally proposed by Orsenigo and Vercellis for solving the<br />

DSVM program. The core of the suggested procedure consists of a recursive algorithm<br />

that solves the (linear) relaxation of DSVM and exploits dual information to<br />

construct a smaller sized sub-program with integer constraints that may be solved<br />

to optimality. The sequence of linear and integer programs solved during the course<br />

of the algorithm provides upper and lower bounds of the final solution which are<br />

employed as termination criterion. Empirical experiments are conducted to scrutinize<br />

the suitability of the proposed procedure and examine the problem size (i.e.<br />

the number of examples and features) that can be processed with state-of-the-art<br />

integer programming techniques.<br />

− 25 −


Modeling the Classification of Heterogeneous<br />

Data<br />

Dorin Carstoiu, Tudor Ionescu, and Alexandra Cernian<br />

University of Bucharest<br />

Faculty of Automatic Control and Compuer Science<br />

Bucharest, Romania<br />

{dorin.carstoiu,tudor.ionescu,alexcernian}@yahoo.com<br />

Abstract. The goal of this work is to study the feasibility of a Heterogeneous Data<br />

Classification and Search (HDCS) system and to provide a possible design for its<br />

implementing. In order to design a HDCS system we propose an actor oriented<br />

modeling technique, for which we show the information flow. We have identified 6<br />

different actors (subsystems) which collaborate to construct a file sheet and produce<br />

the final search result. The first 5 actors add information to the files sheet, which is<br />

afterwards used by the final actor to produce the desired result.<br />

Given the vast quantity of data and the variety of formats and encodings it exists<br />

in, a semantic approach based on metadata has been chosen. Instead of digging into<br />

the actual data for extracting information, we used the context of the file to collect its<br />

metadata. The metadata is afterwards used for the classification proces. The reason<br />

for this approach is that data are made available by people who are interested in<br />

other people understanding what the respective data are about. This observation<br />

provided the confidence needed to pursue the presented approach.<br />

The HDCS system we propose combines techniques from conventional search<br />

systems, classification systems, search results clustering systems, while also providing<br />

original solutions, such as an innovative data sampling method.<br />

References<br />

Dawid Weiss: Descriptive Clustering as a Method for Exploring Text Collections,<br />

2006<br />

Frederic Boulanger, Guy Vidal-Naquet: A Primitive Execution Model for Heterogeneous<br />

Modeling<br />

Mokhoo Mbobi, Frederic Boulanger, Mohamed Feredj: Issues of Hierarchical Heterogeneous<br />

Modeling in Component Reusability<br />

Cecile Hardebolle, Frederic Boulanger, Dominique Marcadet, Guy Vidal-Naquet: A<br />

Generic Execution Framework for Models of Computation<br />

Hua-Jun Zeng, Zheng Chen, Wei-Ying Ma: A Unified Framework for Clustering<br />

Heterogeneous Web Objects<br />

− 26 −


Applying Rough Set Theory to Constructing<br />

Knowledge Base for Critical Military<br />

Commodity Management<br />

Hua-Kai Chiou, 1 Yong-Ting Huang 2 and Gia-Shie Liu 3<br />

1 Department of International Business, China Institute of Technology. 245 Sec.3,<br />

Academia Rd., Nangang Taipei 11581, Taiwan. hkchiou@cc.chit.edu.tw<br />

2 Graduate School of Resources Management and Decision Science,National<br />

Defense University. 70, Sec.2, Central North Rd., Beitou Taipei 11258 Taiwan.<br />

coby777.tw@yahoo.com.tw<br />

3 Department of Information Management,Lunghwa University of Science and<br />

Technology. 300, Sec.1, Wanshou Rd., Gueishan Shiang, Taoyuan County 33306,<br />

Taiwan. liugtw@yahoo.com.tw<br />

Abstract. Reduction of pattern dimensionality via feature extraction and feature<br />

selection belongs to the most fundamental steps in data preprocessing. Feature selection<br />

is a valuable technique in data analysis for information-preserving data reduction.<br />

Features constituting the objects pattern may be irrelevant or relevant. A large<br />

number of methods like discriminant analysis, logit analysis, recursive partitioning<br />

algorithm, etc., have been used in the past for the prediction issues by pattern recognition.<br />

These traditional approaches suffer from some limitations, often due to the<br />

unrealistic assumption of statistical hypotheses or due to a confusing language of<br />

communication with the decision makers. In this paper, we present applications of<br />

rough set methods for feature selection in pattern recognition. Firstly, we employ<br />

Delphi process to generate 39 critical attributes with respect to 6 key factors for<br />

evaluation. We then utilize the rough set approach to discover a set of rules able to<br />

discriminate among considered attributes for critical military commodity management.<br />

There 10 reduct sets and 226 rules were derived from our proposed model,<br />

we also stated some concluding remarks in the final section. Through this research<br />

we found that rough set theory is an efficiency technique for pattern recognition to<br />

solving decision making problems in real world.<br />

Key words: Pattern Recognition, Rough Set Theory, Delphi Process, Discriminant<br />

Analysis<br />

− 27 −


Correspondence Analysis for Exploring the<br />

Implementation of One Village One Product<br />

Programs in Taiwan<br />

Hua-Kai Chiou, 1 Benjamin J.C. Yuan 2 and Yen-Wen Wang 2,3<br />

1 Department of International Business, China Institute of Technology. 245 Sec.3,<br />

Academia Rd., Nangang Taipei 11581, Taiwan. hkchiou@cc.chit.edu.tw<br />

2 Institute of Management of Technology, National Chiao Tung University. 1001,<br />

Ta-Hsueh Rd., Hsinchu 30010, Taiwan. benjamin@cc.nctu.edu.tw<br />

3 Industrial Economics & Knowledge Center,Industrial Technology Research<br />

Institute. 195 Sec.4, Chung Hsing Rd., Chutung Hsinchu 31040, Taiwan.<br />

stevenwang@itri.org.tw<br />

Abstract. One Village One Product programs (OVOP) is a community-centered,<br />

and demand-driven local economic development approach initiated by Oita Prefecture<br />

in Japan in the 1970s. The uniqueness of the approach is that they intended<br />

to achieve their regional economic development through adding value to products<br />

using locally available resources through processing, quality control and marketing.<br />

The objectives of OVOP programs in Taiwan are supported by government to promoting<br />

the economic development and cooperative relationship of each country and<br />

region through the localization and innovation. Firstly, we employ Delphi process to<br />

converge 18 critical factors with respect to 3 key dimensions for evaluation. We then<br />

utilize correspondence analysis to exploring the implementation of OVOP programs<br />

and conduct some meaningful suggestions of the policy direction from these empirical<br />

cases. Through this study we successfully demonstrate that correspondence<br />

analysis is an efficiency technique for industrial analysis and strategy management<br />

in real world.<br />

Key words: OVOP, Correspondence Analysis, Delphi Process, Industrial Analysis<br />

− 28 −


Extending Multivariate Planing<br />

Mario Cortina–Borja 1<br />

Centre for Paediatric Epidemiology and Biostatistics<br />

Institute of Child Health, University College London<br />

30 Guilford Street, London WC1N 1EH, UK<br />

M.Cortina@ich.ucl.ac.uk<br />

Abstract. Friedman and Rafsky (1981) introduced planing, a visualization technique<br />

based on a triangulation procedure for constructing 2–dimensional representations<br />

from a set of n multivariate observations based on preserving exactly relatively<br />

few distances from the original distance matrix and plotting the observations on the<br />

plane. The way these distances are selected and the order in which the observations<br />

are positioned into the plane are induced by a minimal spanning tree (MST ) of the<br />

data.<br />

Other spanning trees could be used to provide the set of distances to be preserved<br />

exactly in a 2–dimensional configuration. One is the exodic tree (ET ) (Gilbert,<br />

1965), which is a not quite minimal spanning tree, though it may be regarded as a<br />

close approximation to the MST (Cortina–Borja and Robinson, 2000). To construct<br />

an ET we choose any point of the dataset as the root and label it as P0; we then<br />

label the rest of the n points as {P1, P2, · · · , Pn−1}, the indices being assigned to<br />

order the points according to their increasing distance from P0. Next, we link any<br />

point Pi to the point Pj chosen from {P2, P3, · · · , Pi−1} in order to minimize its<br />

distance with Pi, (i ≥ 1).<br />

This paper extends two aspects of planing: first, obtaining 3–dimensional configurations;<br />

second, using the ET as the structure defining the distances to be preserved<br />

in the low–dimensional representation.<br />

Key words: Exodic Tree, Minimal Spanning Tree, Planing, Visualization<br />

References<br />

Cortina–Borja, M. and Robinson, T. (2000) Estimating the Asymptotic Constants of<br />

the Total Length of the Euclidean Minimal Spanning Tree with Power–Weighted<br />

Edges. Statistics and Probability Letters, 47, 125–128.<br />

Friedman, J.H. and Rafsky, L.C. (1991) Graphics for the Multivariate Two–Sample<br />

Problem. Journal of the American Statistical Association, 76, 277–287.<br />

Gilbert, E.N. (1965) Random Minimal Trees. SIAM Journal of Applied Mathematics,<br />

13, 376–387.<br />

− 29 −


Principal Axis Analysis with HDLSS bonuses!<br />

Frank Critchley 1 , Ana Pires 2 , and Conceição Amado 2<br />

1 Open University, UK<br />

F.Critchley@open.ac.uk<br />

2 IST, Lisbon<br />

Abstract. Principal axis analysis rotates standardised principal components to optimally<br />

detect subgroup structure, rotation being based on preferred directions in<br />

the spherised data. As such, it is a computationally efficient method of exploratory<br />

data analysis, particularly well-suited to detecting mixtures of elliptically contoured<br />

distributions. High dimensional, low sample size (HDLSS) data are also discussed.<br />

Overall, principal axis analysis exemplifies the maxim: two decompositions are better<br />

than one. More technically, it is an example of invariant coordinate selection<br />

(ICS).<br />

− 30 −


Augmenting Model-Based Clustering with<br />

Generalized Linkage methods<br />

Nema Dean 1 and Rebecca Nugent 2<br />

1 Department of Statistics, University of Glasgow, 15 University Gardens,<br />

Glasgow G12 8QW, UK. nema@stats.gla.ac.uk<br />

2 Department of Statistics, Carnegie Mellon University, Baker Hall, Pittsburgh,<br />

PA 15213, USA. rnugent@stat.cmu.edu<br />

Abstract. The fundamental assumption made by model-based clustering (Fraley<br />

and Raftery 1998) is that the groups or sub-populations underlying the data have<br />

(multivariate) Gaussian distributions, giving the overall population a finite mixture<br />

model distribution. An assumption additionally made, is that the number and type<br />

of components found to best fit the data are a good estimate of the number and type<br />

of true groups in the data. Given the shape assumptions implicit in the choice of<br />

Gaussian distributions - elliptical, symmetric contours - in cases of skewed, curved<br />

or more generally complex-shaped groups, the equivalence of the mixture model<br />

components and the underlying groups is likely false.<br />

Since general continuous densities can be modelled arbitrarily well by mixtures<br />

of Gaussian densities, the mixture model chosen may still be a good estimate of<br />

the density of the data but it is likely that more than one component is identified<br />

with each group. Generalized single linkage methods (Stuetzle and Nugent 2007)<br />

use density estimates to create density-based similarity (or dissimilarity) measures<br />

which can then be used as a replacement for Euclidean (or other types of) distance<br />

in hierarchical agglomerative methods. Using the resulting model-based clustering<br />

density estimate we can use the resulting dendrogram to visualize the hierarchical<br />

structure of the components of the mixture model and make decisions about<br />

combining components to estimate groups. Since it is difficult to easily summarize<br />

information about complex shaped groups, offering a summary that is essentially<br />

a subset of components of the original mixture model with means and covariance<br />

matrices is an attractive alternative.<br />

Key words: Model-Based Clustering, Generalized Single Linkage Clustering<br />

References<br />

Fraley, C. and Raftery, A. E. (1998): How many clusters? which clustering method?<br />

- answers via model-based cluster analysis. The Computer Journal, 41, 578–588.<br />

Stuetzle, W. and Nugent, R. (2007): A generalized single linkage method for estimating<br />

the cluster tree of a density. Technical Report 514, Department of Statistics,<br />

Univeristy of Washington.<br />

− 31 −


Statistical analysis of human body movement<br />

and group interactions in response to music<br />

Frank Desmet, Marc Leman and Micheline Lesaffre<br />

IPEM, Department of Musicology, Ghent University, Belgium fm.desmet@ugent.be<br />

Abstract. The quantification of time series that relate to physiological data is a<br />

challenging research topic for music research. Up to now, most studies have focused<br />

on time dependent responses of individual subjects. However, little is known about<br />

time dependent responses of between-subject interactions. At IPEM, Ghent University,<br />

a large scale multidisciplinary research project targets the development of<br />

innovative music interaction based on the movement of groups of subjects. Based on<br />

a recent pilot experiment, we report new findings concerning the statistical analysis<br />

of group synchronicity in response to musical stimuli. The aim was to refine future<br />

experimental designs and to generate statistical pathways as practical guidelines for<br />

researchers. The experiment was carried out in the context of the ACCENTA 2007<br />

Fair in Ghent. 16 groups of 4 subjects took part in an experiment where they had<br />

to move a wireless wii sensor in response to music. In the first condition, the subjects<br />

were blind folded, while in the second condition, the subjects could see each<br />

other. The movements of the subjects were recorded as acceleration data on PC.<br />

Fourier coeffcients, total intensity, intra-group correlations and sampling entropy<br />

characteristics were derived from these raw acceleration data. Combined with pre<br />

and post survey data of the participants, we generated a multivariate dataset for<br />

analysis. The statistical methods used in this study are basic descriptive statistics,<br />

paired correlation analysis, auto correlation analysis, regression analysis and GLM<br />

(general linear modeling). The different empirical methodologies were validated as<br />

potential tools for the study of social embodied music interaction. It was found that<br />

the synchronicity of the human-human interactions increases significantly in the<br />

social context. The type of music is the predominant factor for the human-music<br />

interaction in both the individual and the social context.<br />

Key words: Human movement, Social Interaction, Statistical Analysis, music<br />

− 32 −


Mixture Hidden Markov Models in Finance<br />

Research<br />

José G. Dias 1 , Jeroen K. Vermunt 2 and Sofia Ramos 3<br />

1<br />

Department of Quantitative Methods, ISCTE – Higher Institute of Social<br />

Sciences and Business Studies, Edifício ISCTE, Av. das Forças Armadas,<br />

1649–026 Lisboa, Portugal<br />

jose.dias@iscte.pt<br />

2<br />

Department of Methodology and Statistics, Tilburg University, P.O. Box 90153,<br />

5000 LE Tilburg, The Netherlands,<br />

J.K.Vermunt@uvt.nl<br />

3<br />

Department of Finance, ISCTE – Higher Institute of Social Sciences and<br />

Business Studies, Edifício ISCTE, Av. das Forças Armadas, 1649–026 Lisboa,<br />

Portugal<br />

sofia.ramos@iscte.pt<br />

Abstract. Latent class or finite mixture modeling has proven to be a powerful<br />

tool for analyzing unobserved heterogeneity in a wide range of applications (see, for<br />

example, McLachlan and Peel (2000) or Dias and Vermunt (2007)). We introduce<br />

in finance research the Mixture Hidden Markov Model (HHMM) that takes into<br />

account both time-constant unobserved heterogeneity between and hidden regimes<br />

within time series. This approach is flexible in the sense that it can deal with the<br />

specific features of financial time series data, such as asymmetry, kurtosis, and unobserved<br />

heterogeneity, aspects that are almost always ignored in finance research.<br />

This methodology is applied to model simultaneously 12 time series of the returns of<br />

Asian stock markets. Because we selected a heterogeneous sample of countries including<br />

both developed and emerging countries, we expect that heterogeneity in market<br />

returns due to country idiosyncrasies will show up in the results. The best fitting<br />

model was the one with two latent classes or clusters at country level, which clearly<br />

present different dynamics in the switching dynamics between the two regimes.<br />

Key words: latent class model, finite mixture model, hidden Markov model, modelbased<br />

clustering, stock indexes<br />

References<br />

Dias, J.G., Vermunt, J.K. (2007): Latent class modeling of website users’ search<br />

patterns: Implications for online market segmentation. Journal of Retailing and<br />

Consumer Services, 14(6), 359–368.<br />

McLachlan, G.J., Peel, D. (2000): Finite Mixture Models. John Wiley & Sons, New<br />

York.<br />

− 33 −


Mapping Findspots of Roman Military<br />

Brickstamps in Mogontiacum (Mainz)<br />

and Archaeometrical Analysis<br />

Jens Dolata 1 , Hans-Joachim Mucha 2 , and Hans-Georg Bartel 3<br />

1 Generaldirektion Kulturelles Erbe Rheinland-Pfalz, Direktion Archäologie<br />

Mainz, Große Langgasse 29, D-55116 Mainz, Germany,<br />

dolata@ziegelforschung.de<br />

2 Weierstraß-Institut für Angewandte Analysis und Stochastik (WIAS),<br />

D-10117 Berlin, Germany, mucha@wias-berlin.de<br />

3 Institut für Chemie, Humboldt-Universität zu Berlin, Brook-Taylor-Straße 2,<br />

D-12489 Berlin, Germany, hg.bartel@yahoo.de<br />

Abstract. 1775 Roman military brickstamps concerning to the 1st century A.D.<br />

have been found in archaeological excavations in Mainz, the ancient Mogontiacum.<br />

Making a catalogue of these ones for a paper on Roman military archaeology the<br />

stamps have been classified and new types of stamps have been defined. All in all, 238<br />

findspots of bricks and tiles of the 1st century have been investigated. Additionally,<br />

the findspots are described by survey-coordinates. The mapping of the brickstamps<br />

visualizes the size of the ancient city and gives details for the localization of military<br />

camps and of civil settlement. Dating the brickstamps by epigraphical investigation<br />

or by assigning them to a military brickyard based on geochemical analysis allows<br />

the mapping of different periods. Two main maps have been plotted: (A) The earliest<br />

brickstamps found in Mainz are from the period of Emperors Claudius and Nero (41 -<br />

68 A.D., n = 932). They were manufactured by soldiers of legiones XXII Primigenia<br />

and IIII Macedonica. (B) Brickstamps of legiones I Adiutrix, XIV Gemina, VII<br />

Gemina, and XXI Rapax belong to the Flavian period (69 - 96 A.D., n = 843).<br />

These two main maps can be compared with some maps showing a selection<br />

of brickstamps from 3rd and 4th centuries (Emperors Caracalla, Constantine I or<br />

Julian, and Valentinian I, n = 102). Thus the maps show a total of 1877 brickstamps<br />

from 246 sites. In this paper we try to improve the situation of evaluating all this<br />

maps for history of settlement and urban development. Using statistical methods<br />

we compare the different entries of the maps. Every single entry is not of the same<br />

weight according to the radius of the finding area. These different weights can be<br />

taken into account in statistical analysis, for instance, in nonparametric density<br />

estimation. The intention is to get a mapping of densities of brickstamps for different<br />

urban regions. In combination of mapping and looking for dated brickstamps there<br />

is a good chance to achieve new sources for the ancient history of Mogontiacum.<br />

Key words: Roman bricks, archaeometry, nonparametric density, mapping<br />

− 34 −


Multimodal Performance Analysis of<br />

Electronic Sitar<br />

Arne Eigenfeldt 1 and Ajay Kapur 2<br />

1 Simon Fraser University<br />

Burnaby, BC, Canada<br />

2 California Institute of the Arts<br />

Valencia, CA, USA<br />

Abstract. This paper describes a custom-build system which extracts high-level<br />

musical information from real-time sensor data received from an Esitar [1]. Data is<br />

collected from sensors during rehearsal using one program, GATHER, and, combined<br />

with audio analysis, is used to derive statistical coefficients which are used to identify<br />

three different playing styles of North Indian music: Alap, Ghat, and Jhala. A realtime<br />

program, LISTEN, uses these coefficients in a multi-agent analysis to determine<br />

the current playing style.<br />

The ESitar is an instrument which gathers gesture data from a performing artist<br />

using sensors embedded on the traditional instrument. A number of different performance<br />

parameters are captured including fret detection (based on position of left<br />

hand on the neck of the instrument), thumb pressure (based on right hand strumming),<br />

and 3-dimensions of neck acceleration. Audio features from the instrument<br />

are computed as well, and include Root Mean Square (rms), spectral centroid, spectral<br />

flux, spectral rolloff at 85<br />

Analysis is done on sample data to derive statistical information (minimum,<br />

maximum, mean, standard deviation) - for each sensor and audio analysis to derive<br />

unique class coefficients. These coefficients are compared to incoming performance<br />

data, which is also statistically analysed, by an multi-agent system.<br />

Interactive and real-time computer music is becoming more complex, and the<br />

requirements placed upon the software is equally increasing. Composers, hoping to<br />

derive more understanding about a performer’s actions, are looking not just at incoming<br />

audio, but also to sensor data, for such information. Our research has focused<br />

upon a high-level musical cognition - playing style - rather than detail recognition<br />

- beat or pitch tracking, for example - by augmenting real-time audio analysis with<br />

sensor data analysis.<br />

References<br />

[1] Kapur, A., Tindale, A, Benning, M. S. and P. Driessen. ”The KiOm: A Paradigm<br />

for Collaborative Controller Design”, Proceedings of the International Computer<br />

Music Conference, New Orleans, USA, 2006.<br />

− 35 −


Data compression and regression based on<br />

local principal curves<br />

Jochen Einbeck 1 and Ludger Evers 2<br />

1 Department of Mathematical Sciences, Durham University, Durham, UK<br />

2 School of Mathematics, University of Bristol, Bristol, UK<br />

Abstract. Frequently the predictor space of a multivariate regression problem of<br />

the type y = f(x1, . . . , xp) + ɛ is in fact much lower-dimensional than p, often even<br />

(approximately) one-dimensional. Usual modeling attempts such as the additive<br />

model y = f1(x1) + . . . + fp(xp) + ɛ, which try to reduce the complexity of the<br />

regression problem by making additional structural assumptions, are then inefficient<br />

as they ignore the inherent structure of the predictor space and involve complicated<br />

model and variable selection stages.<br />

In a fundamentally different approach, one may consider first approximating<br />

the predictor space by a (usually nonlinear) curve passing through it, and then<br />

regressing the response only against the one-dimensional projections onto this curve.<br />

This entails the reduction from a p− to a one-dimensional regression problem.<br />

As a tool for the compression of the predictor space we apply local principal<br />

curves [1], which form a more flexible alternative to earlier proposed principal curve<br />

algorithms [2] as they also allow for branched or disconnected curves. Taking things<br />

on from the results presented in [1], we show how local principal curves can be<br />

parameterized and how the projections are obtained. The regression step can then<br />

be carried out using any nonparametric smoother. We illustrate the technique using<br />

16- and higher dimensional data from astrophysical applications. Possible extensions<br />

to more than one-dimensional nonparametric summaries of the predictor space are<br />

discussed.<br />

References<br />

[1] EINBECK, J., TUTZ, G., and EVERS, L. (2005). Exploring multivariate data<br />

structures with local principal curves. In: Weihs, C. and Gaul, W. (Eds.): Classification<br />

- The Ubiquitous Challenge. Springer, Heidelberg, pages 256-263.<br />

[2] CHANG, K. and GHOSH, J. (1998). Principal curves for nonlinear feature extraction<br />

and classification. SPIE Applications of Artificial Neural Networks in<br />

Image Processing III, 3307, 120–129.<br />

− 36 −


Regression-autoregression based clustering<br />

Igor Enyukov<br />

StatPoint Ltd., Moscow<br />

Abstract. The usual approach to the clustering of cases into k groups (for example,<br />

k-means ) can be regarded as a regression of the source variables on a set of k dummy<br />

binary (indicator) variables which satisfy the following conditions:<br />

• for i-th case only one of these variables has value 1<br />

• if j-th of such a variable has value 1 for i-th case, it means that the case belongs<br />

to j-th group.<br />

These dummy variables in their turn are some nonlinear functions of the source<br />

variables and their values are being evaluated by performing a classification procedure.<br />

The set of dependent variables and predictors is the same in this approach.<br />

So it may be regarded as an autoregression approach. Such approach may lead to<br />

problems when we work with spatial distributed data (like objects with geographical<br />

coordinates) because the objects that are close by physical properties (for example,<br />

by seasonal properties of river stocks) may be situated rather far in the geographical<br />

space. In this paper is suggested to use a smoothed variant of such indicator<br />

nonlinear functions. For this aim are used the radial basis functions (RBF), such<br />

that their centers are the group centers (in the end of work of the procedure). In<br />

this case set of dependent variables (source variables which we want to explain by<br />

clusterization) may be distinct from the set of explaining (independent variables or<br />

predictors) variables. For example, the last ones may be geographical coordinates.<br />

Such approach may be regarded as regression-autoregression. The regression problem<br />

is lnear relatively RBF. This approach gives a possibility to define the class<br />

centers and to evaluate their number as well. For this purpose a kind of ”step-wise”<br />

regression algorithm can be used.<br />

− 37 −


Realoptionen bei der Bewertung von neuen<br />

Produkten<br />

Said Esber and Daniel Baier<br />

Lehrstuhl für Marketing und Innovationsmanagement, Erich-Weinert-Str. 1,<br />

D-03046 Cottbus<br />

saidesber1@yahoo.com, baier@tu-cottbus.de<br />

Abstract. Bei der Entwicklung von neuen Produkten ist es für das F&E-Management<br />

sehr wichtig, technische, marktbedingte und wettbewerbsbedingte Unsicherheiten<br />

zu berücksichtigen. Aus den zunehmenden Markt- und Umfeldänderungen resultiert,<br />

dass Investitionsentscheidungen verstärkt unter hoher Unsicherheit getroffen<br />

werden müssen. Die Realoptionsbewertung ermöglicht hier ein besseres Verständnis<br />

für Unsicherheiten in F&E-Projekte, die Flexibilität des Managements<br />

während des Projektlebenslaufs und die Auswahl der besten Projektalternative.<br />

Der vorliegende Beitrag beschreibt die Anwendung des Realoptionsansatzes im Bereich<br />

der Informationstechnologie (IT) durch das Management der Produktentwicklung<br />

bei einem Videokonferenz-System. Zunächst wird ein Überblick über den Realoptionsansatz,<br />

die Eigenschaften verschiedener Typen von Realoptionen und ihr<br />

Zusammenhang mit anderen verwendbaren zusätzlichen oder alternativen Bewertungsmethoden<br />

von F&E-Projekten (Sensitivitätsanalyse, Szenarioanalyse, Monte-<br />

Carlo-Stimulation und Entscheidungsbäume) gegeben. Anschließend wird ein entscheidungsunterstützendes<br />

excelbasiertes Tool zur Berechnung der Realoptionen,<br />

zur Entwicklung der Entscheidungsbäume und zur Durchführung der Monte-Carlo-<br />

Simulationen verwendet. Schließlich wird die Entscheidung, dass Videokonferenz-<br />

System (BRAVIS) im Produktionsprozess in deutschen VW-Unternehmen einzuführen,<br />

als Anwendungsbeispiel analysiert.<br />

Key words: Realoptionen, Bewertungsverfahren von Optionen, Entscheidungen<br />

unter Unsicherheit, IT-Projekte, Verfahren der Risikoanalyse<br />

References<br />

Dixit, A. K. and Pindyck, R. S. (1995): The Options Approach to Capital Investment.<br />

Harvard Business Review, Mai/Juni, 105–115.<br />

Pritsch, G. (2000): Realoptionen als Controlling-Instrument - Das Beispiel pharmazeutische<br />

Forschung und Entwicklung. Gabler, Wiesbaden.<br />

Rese, A. and Baier, D. (2007): Deciding on new products using a computer-assisted<br />

real options approach. Int. J. of Techn. Intelligence & Planning, 3(3), 292–303.<br />

− 38 −


Regression and Classification using Bayesian<br />

Additive Voronoi Tessellation Models<br />

Ludger Evers 1<br />

School of Mathematics, University of Bristol, Bristol, UK, l.evers@bris.ac.uk<br />

Abstract. Voronoi-tessellation-based regression and classification models are based<br />

on approximating the unknown regression function by a discontinuous, piecewise<br />

constant (or linear) function. The discontinuities are modeled by a Voronoi tessellation<br />

of the covariate space. This distinguishes them from recursive partitioning<br />

models like CART which model the discontinuities by (typically axis aligned) hierarchical<br />

splits. Voronoi-tessellation-based regression and classification models are<br />

typically considered in a Bayesian framework, where inference is done using Reversible<br />

Jump MCMC techniques.<br />

These methods however possess two important drawbacks. In many situations<br />

only a small proportion of the covariates studied are relevant to the regression or<br />

classification task at hand. The pairwise distances, which the Voronoi tessellation<br />

is based on, are then dominated by the irrelevant covariates, i.e. it is increasingly<br />

difficult to find a Voronoi tessellation that is informative for the problem at hand.<br />

Second, the estimated regression function is, due to its high-dimensional and complex<br />

nature, typically difficult to interpret.<br />

We propose to use an additive model of the form P<br />

I fI(xI) that addresses these<br />

two shortcomings. Each fI(·) is based on a Voronoi tessellation of a subspace of the<br />

covariates. A hierarchical model is used for the inclusion of the covariates in order<br />

to ensure that the model makes sparing use of the covariates. In many situations,<br />

each of the fI(·) involves only a small number of covariates and thus allows for easy<br />

interpretation. A further benefit of this approach is that it allows for constructing<br />

faster mixing MCMC compared to the basic non-additive model.<br />

A case study and an empirical comparison with competing methods like CART,<br />

BART, MARS, and Support Vector Machines, is presented both for regression and<br />

classification tasks.<br />

Keywords<br />

Additive model, random basis, transdimensional simulation.<br />

− 39 −


Ross-linguistic regularities in the monosyllabic<br />

system<br />

Gertraud Fenk-Oczlon and August Fenk<br />

Alps-Adriatic University of Klagenfurt<br />

Abstract. We assumed cross-linguistic correlations between the number of monosyllabic<br />

words (a), of syllable types (b), of phonemes per syllable(c), and of the size<br />

of the phonemic inventory (d). Menzeraths (1954:112-121) descriptions of 8 Indo-<br />

European languages and Campbells (1991) data regarding their phonemic inventory<br />

offered the basis for a statistical evaluation. All correlations between a, b, and c<br />

turned out to be significant, those between these three parameters and d almost<br />

significant (Fenk-Oczlon & Fenk, to appear). The discussion of these results within<br />

a systems-theoretical framework includes: (I) Diachronic changes: A comparison of<br />

the Beowulf Prologue in Old English (OE) with its translation into Modern English<br />

(ME) shows a remarkable increase of monosyllables from 105 in OE to 312 in ME<br />

and a concomitant increase of the mean syllable complexity from 2.63 phonemes<br />

in OE to 2.88 in ME. (II) Semantic functions: The verb as well as the adverb or<br />

preposition forming a phrasal verb are often polysemous. In a short analysis of a collection<br />

of 1406 English phrasal verbs we found that 1367 or 97 % of the verbs that<br />

were part of the phrasal verb construction were monosyllabic. (39 phrasal verbs<br />

included a bisyllabic verb and only one was found with a trisyllabic verb.) (III)<br />

General relations between a languages phonemic inventory, the number of cogitable<br />

combinatorial possibilities and the number of those phonotactic possibilities actually<br />

realized in the monosyllables of typologically different languages.<br />

References<br />

Campbell, G. L. 1991. Compendium of the worlds languages. London: Routledge.<br />

Fenk-Oczlon, G. & Fenk, A. (to appear). Complexity trade-offs between the subsystems<br />

of language, In M. Miestamo, K. Sinnemaeki & F. Karlsson, (eds.) Language<br />

Complexity: Typology, Contact, Change, pp. 43-65. Amsterdam: John<br />

Benjamins.<br />

Menzerath, P. (1954). Die Architektonik des deutschen Wortschatzes. Hannover/Stuttgart:<br />

Duemmler.<br />

− 40 −


Validity of images from binary coding tables.<br />

Student motivation surveys: some evidence<br />

K. Fernández-Aguirre 1 and M. A. Garín-Martín 1<br />

University of the Basque Country (UPV/EHU), Bilbao, Spain<br />

karmele.fernandez@ehu.es<br />

Abstract. Using both artificial and real data, this paper analyses the superiority<br />

of Correspondence Analysis (CA) over Principal Component Analysis (PCA) as a<br />

procedure for displaying and exploring data in the processing of contingency tables<br />

or binary tables.<br />

Simple and Multiple Correspondence Analysis (CA and MCA) are becoming<br />

more and more widely used in many areas of science. However, PCA is much better<br />

known and is accessible in software packages, so it continues to be used even at the<br />

risk of obtaining poor results when it is not applied to quantitative data. A second<br />

point examined is the obtaining of low percentages of projected variance on the first<br />

factorial axes in CA applications. This may deter users from applying these methods,<br />

as it suggests that the visualization obtained will be of poor quality. This problem<br />

has been widely discussed: (Benzécri, 1979), (Greenacre, 1994) propose corrections<br />

in projected variance rates. Moreover, (Lebart, 1984, 1998, 2000) consider the case<br />

of a matrix associated with a symmetric graph and analytically study the variations<br />

in representation depending on the different codifications of the associated matrix.<br />

On the one hand, our study shows the superiority of CA for the reconstitution<br />

and visualization of an M matrix associated with a G symmetric graph over the<br />

visualization obtained with PCA. On the other hand, we present four examples which<br />

show that projected variance rates are a highly pessimistic measure of the quality<br />

of a representation. These examples of application to survey data examine various<br />

aspects of the motivation of university students in the field of education, and in<br />

particular present extremely low percentages of projected variance on the first axes.<br />

The application of MCA enables us to obtain apparently fragile but interpretable<br />

images. Moreover, classification analysis provides various types of student with clear<br />

interpretations, leading to robust conclusions.<br />

Key words: Binary tables, Visualization, Correspondence Analysis, Clustering<br />

References<br />

Lebart, L., Morineau, A. and Warwich, K. M. (1984): Multivariate Descriptive Statistical<br />

Analysis. John Wiley & sons, New York.<br />

− 41 −


A Statistical Theory of Musical Consonance<br />

Proved in Praxis<br />

Jobst Fricke<br />

Universität zu Köln<br />

Abstract. One of the recent models of consonance perception of musical intervals<br />

is based on the simulation of neural autocorrelation. It is assumed that the shape of<br />

the coinciding neural spikes resembles Dirac-delta-impulses. Then, consonant musical<br />

intervals are recognized as consonant, if and only if the frequencies of the interval<br />

tones exactly form a simple numerical proportion. But intervals are perceived to be<br />

consonant too when they have a slight displacement of the simple numerical proportions.<br />

The models of Tramo et al. (2001) and Ebeling (2007) are for the first time<br />

in accordance with the reality of perception by introducing larger impulses. Both of<br />

them just realize the autocorrelation of interval tones that consist of impulses with a<br />

width different from zero. In fact, the time window for the spikes’ coincidence has a<br />

width different from zero. This is the latency which is relevant in cognitive processes.<br />

In order to adapt the model to reality, the width of the statistical distribution of<br />

neural impulses should be considered.<br />

It is investigated to what extent the behavior of the model corresponds to the auditory<br />

perception in a realistic environment. The experimental data were taken from<br />

a study dealing with the judgment of intervals in a musical context (Fricke 2005b).<br />

Keywords<br />

MUSIC, CONSONANCE THEORY<br />

References<br />

EBELING, M. (2007): Verschmelzung und neuronale Autokorrelation als Grundlage<br />

einer Konsonanztheorie. Frankfurt/M. u.a..<br />

FRICKE, J. (2005b): Classification of Perceived Musical Intervals, in: Claus<br />

Weihs u. Wolfgang Gaul (Hrsg.): Classification - The Ubiquitous Challenge,<br />

Berlin/Heidelberg/New York: Springer, S. 585–592.<br />

TRAMO, M., CARIANI, P., DELGUTTE, B. and BRAIDA, L. (2001): Neurobiological<br />

Foundations for the Theory of Harmony in Western Tonal Music, in: R.<br />

Zatorre and I. Peretz (Hrsg.): Biological Foundations of Music, Annals of the<br />

New York Academy of Sciences, Vol. 930.<br />

− 42 −


An Improved Criterion for Clustering Based<br />

on the Posterior Similarity Matrix<br />

Arno Fritsch and Katja Ickstadt<br />

Technische Universität Dortmund, Fakultät Statistik<br />

Vogelpothsweg 87, 44221 Dortmund<br />

arno.fritsch@tu-dortmund.de, ickstadt@statistik.uni-dortmund.de<br />

Abstract. Complex Bayesian cluster models are often fitted using Markov Chain<br />

Monte Carlo (MCMC) algorithms. A problem is then how to summarize the MCMC<br />

sample c (1) , . . . , c (M) from the posterior distribution of clusterings p(c|y) with a single<br />

estimated clustering ĉ. The problem is complicated by the fact that the labels<br />

associated with the clusters can switch during the MCMC run. One way to overcome<br />

this is to derive the estimate ĉ based on the posterior similarity matrix, a matrix<br />

with entries πij = P (ci = cj|y), the posterior probabilities that the observations i<br />

and j are in the same cluster. This approach is taken for example in the Bayesian<br />

cluster models for gene expression microarray data by Medvedovic et al. (2004) and<br />

Dahl (2006). The former applies hierarchical clustering to the matrix of (1 − πij),<br />

while the latter tries to minimize a loss function proposed by Binder (1978). We<br />

show that this minimization is equivalent to maximizing the Rand index between<br />

estimated and true clustering and propose a new criterion for choosing ĉ, the posterior<br />

expected adjusted Rand index with the true clustering. In a simulation study<br />

with a Dirichlet process mixture model it is shown that our new criterion leads to<br />

estimated clusterings closer to the true one than the other two approaches and that<br />

it also outperforms the usage of the maximum a posteriori (MAP) clustering.<br />

Key words: Adjusted Rand index, Bayesian cluster analysis, Markov Chain Monte<br />

Carlo, Loss functions<br />

References<br />

Binder, D.A. (1978): Bayesian Cluster Analysis. Biometrika, 65, 31–38.<br />

Dahl, D.B. (2006): Model-based Clustering for Expression Data via a Dirichlet Process<br />

Mixture Model. In: K.A. Do, P. Müller and M. Vannucci (Eds.): Bayesian<br />

Inference for Gene Expression and Proteomics, Cambridge University Press,<br />

New York, 201–216.<br />

Medvedovic, M., Yeung, K. and Bumgarner, R. (2004): Bayesian Mixture Model<br />

Based Clustering of Replicated Microarray Data. Bioinformatics, 20, 1222–<br />

1232.<br />

− 43 −


On the Use of Student Samples in Major<br />

Marketing Research Journals. A Meta-Study<br />

Sebastian Fuchs and Marko Sarstedt<br />

Institute for Market-based Management, Munich School of Management, D-80539<br />

Munich, Germany imm@bwl.lmu.de<br />

Abstract. In the last years, almost every marketing research journal has experienced<br />

a big leap in the number of high-quality paper submissions. This led to<br />

an increased competition among contributing authors and heightened requirements<br />

for manuscript submissions. In this context, the manuscript evaluation criteria for<br />

almost every marketing journal underline the importance of the sample’s characteristics<br />

and how well it represents the population being studied. However, the predominance<br />

of student samples in empirical marketing research documents the divergence<br />

between these theoretical requirements and practical implementation. Despite theoretical<br />

and empirical objections, several authors claim that a silent acceptance of<br />

the usage of student samples has become observable, even in top research societies.<br />

According to Peterson (2001), this development is problematic as generalizations are<br />

only feasible if replicating research with non-student subjects is carried out. Thus,<br />

the objective of this paper is to analyze the development of the usage of student<br />

samples in the most reputable marketing research journals. For this purpose, all<br />

eleven marketing journals rated A or A+ in the ranking, developed on behalf of<br />

the Association of University Professors of Management in German-speaking countries<br />

(VHB) were investigated. A total number of 1.491 studies that appeared since<br />

2005 were analyzed with regard to samples used, measures evaluated and limitations<br />

addressed. The results show vast differences between the various journals.<br />

Key words: Student Samples, Sampling, Marketing Research<br />

References<br />

Peterson, R.A. (2001): On the Use of College Students in Social Science Research:<br />

Insights from a Second-Order Meta-analysis Journal of Consumer Research, 28<br />

(3), 450–461.<br />

Völckner, F. and Sattler, H. (2007): Empirical Generalizability of Consumer Evaluations<br />

of Brand Extensions International Journal of Research in Marketing, 24<br />

(2), 149–162.<br />

− 44 −


Multi-Dimensional Scaling applied to<br />

Hierarchical Fuzzy Rule Systems<br />

Thomas R. Gabriel, Kilian Thiel, and Michael R. Berthold<br />

Chair for Bioinformatics and Information Mining<br />

Department of Computer and Information Science<br />

University of Konstanz, Box 712, 78457 Konstanz, Germany<br />

{gabriel|thiel|berthold}@inf.uni-konstanz.de<br />

Abstract. This paper presents an approach for visualizing high-dimensional fuzzy<br />

rules arranged in a hierarchy together with the training patterns they cover. A standard<br />

multi-dimensional scaling method is used to map the rule centers of the top<br />

hierarchy level to one coherent picture. Rules of the underlying levels are projected<br />

relatively to their superior level(s). In addition to the rules, all patterns are mapped<br />

onto the two dimensional projection in relation to the positions of the corresponding<br />

rule centers. The visualization is further extended by showing hierarchical relationships<br />

between overlapping rules of different levels as generated by a hierarchical rule<br />

learner. This delivers interesting insights into the rule hierarchy and makes the model<br />

more explorable. Additionally, rules can be highlighted interactively emphasizing the<br />

subsequent rules at all underlying levels together with their covered patterns. We<br />

demonstrate that this technique allows investigation of interesting rules at different<br />

levels of granularity, which even makes this approach applicable to a large number<br />

of rule sets. The proposed technique is illustrated and discussed on a number<br />

of hierarchical rule model visualizations generated on well-known benchmark data<br />

sets.<br />

Key words: Multi-Dimensional Scaling, Fuzzy Rule Induction, Rule Hierarchy<br />

References<br />

Berthold, M.R., (2003): Mixed Fuzzy Rule Formation. International Journal of Approximate<br />

Reasoning, 32:67–84.<br />

Gabriel, T.R. and Berthold, M.R. (2003): Constructing Hierarchical Rule Systems.<br />

In: M.R. Berthold, H.-J. Lenz, E. Bradley, R. Kruse, C. Borgelt (Eds.): Proc. 5th<br />

International Symposium on Intelligent Data Analysis. Springer, Berlin, 76–87.<br />

Gabriel, T.R. and Thiel, K. and Berthold, M.R. (2006): Rule Visualization based on<br />

Multi-Dimensional Scaling. In: IEEE International Conference on Fuzzy Systems.<br />

IEEE Press, Vancouver, 66–71.<br />

− 45 −


Ein Leistungszentrum für die digitale<br />

Unterstützung der Durchführung, Auswertung<br />

und Veröffentlichung von archäologischen<br />

Feldprojekten (Ausgrabung, Survey)<br />

Ulrich-Walter Gans und Matthias Lang<br />

Institut für Archäologische Wissenschaften der RUB, Fakultät für<br />

Geschichtswissenschaft, E-Mail: johannes.bergemann@rub.de<br />

Zusammenfassung. Seit etwa zwei Jahrzehnten werden die in der archäologischen<br />

Feldforschung in sehr hohen Quantitäten anfallenden Informationen überwiegend<br />

bis ausschließlich in digitaler Form erfasst. Zwar entstanden an zahlreichen Orten<br />

umfangreiche Datenbanken, allerdings weisen diese extrem heterogene Strukturen<br />

auf. Insellösungen ohne Verbindungen sind die Regel. Weitere grundsätzliche Probleme<br />

betreffen die langfristige Erhaltung und Pflege vorhandener Datenpools sowie<br />

die schnell wechselnden Betriebssysteme. Ein interdisziplinäres Team aus den<br />

Bereichen Archäologie, Informationsmanagement, Softwareentwicklung und Geoinformatik<br />

will Archäologen künftig einen völlig neuartigen Umgang mit den digitalen<br />

Medien ermöglichen. Bislang ist es nur möglich, über das Internet die auf Servern<br />

von Universitäten, Forschungsinstituten und Museen verteilten Datenbanken einzeln<br />

abzufragen. ArcheoInf will einen Mediator erstellen, der fähig ist, in zahlreichen archäologischen<br />

Feldforschungsdatenbanken gleichzeitig zu recherchieren, ohne dass<br />

die Benutzer die Abfrageoberfläche wechseln müssen. Dem Mediator liegen ein möglichst<br />

viele Bereiche der Archäologie umfassender Thesaurus und eine entsprechende<br />

Ontologie zugrunde, die eine komfortable Suche in Feldforschungsdaten ermöglicht.<br />

So wird ein dem Open-Access-Gedanken verpflichtetes Repositorium erstellt, in dem<br />

kostenfrei auf archäologische Sach- und Bilddaten zugegriffen werden kann. Weiter<br />

ist die Einbettung von bibliographischen Daten, Zitaten, Bestandsnachweisen und<br />

elektronischen Volltexten vorgesehen. Gleichzeitig gilt es einen zentralen WebGis-<br />

Server aufzubauen, der geoinformatische Anwendungen der Archäologen in Verbindung<br />

zu Text- und Bilddaten setzt. An dem von der Deutschen Forschungsgemeinschaft<br />

finanzierten Projekt arbeiten Archäologen der Ruhr-Universität Bochum, Informatiker<br />

der Technischen Universität Dortmund, Geoinformatiker der Hochschule<br />

Bochum sowie die Universitätsbibliotheken Bochum und Dortmund mit. Weitere<br />

archäologische Projektpartner in Berlin, Bochum, Cottbus, Darmstadt, Karlsruhe,<br />

Köln, Frankfurt und Tübingen sind assoziiert.<br />

− 46 −


Scalable and Incrementally Updated Hybrid<br />

Recommender Systems<br />

Zeno Gantner and Lars Schmidt-Thieme<br />

Information Systems and Machine Learning Lab (ISMLL)<br />

University of Hildesheim, Germany<br />

{gantner,schmidt-thieme}@ismll.uni-hildesheim.de<br />

Abstract. A typical approach in collaborative filtering is to treat the ratings as a<br />

matrix with many unknown entries, and to use the known data to approximate the<br />

complete matrix, including the unknown ratings. George and Merugu demonstrated<br />

that using a co-clustering algorithm to approximate the ratings matrix achieves prediction<br />

accuracies comparable to techniques like SVD, NMF, or kNN, while having<br />

computational properties which allow the use of the method in dynamic scenarios,<br />

where incoming data has to be incorporated instantly. However, pure collaborative<br />

filtering approaches suffer from insufficient data, especially in cases when there are<br />

only a few users, or when there are new items which have not yet been rated.<br />

To overcome this problem, we combine co-clustering with a Naïve Bayes classifier<br />

to predict ratings based on both the ratings matrix and item attributes. The simplicity<br />

of the classifier allows us to preserve the desirable properties (parallelization,<br />

scalability, incremental updates). Our evaluation indicates that the hybrid recommender<br />

system performs better than pure co-clustering.<br />

Key words: hybrid recommender systems, collaborative filtering, content-based<br />

filtering, co-clustering, naïve bayes<br />

References<br />

Banerjee, A., Dhillon, I., Ghosh, J., Merugu, S. and Modha, D. (2007): A Generalized<br />

Maximum Entropy Approach to Bregman Co-clustering and Matrix<br />

Approximation. The Journal of Machine Learning Research, 8, 1919–1986<br />

Hauger, S., Tso, K. and Schmidt-Thieme, L. (2007): Comparison of Recommender<br />

System Algorithms focusing on the New-Item and User-Bias Problem. In: The<br />

31st Annual Conference of the German Classification Society on Data Analysis,<br />

Machine Learning, and Applications.<br />

George, T. and Merugu, S. (2005): A Scalable Collaborative Filtering Framework<br />

Based on Co-Clustering. In: Proceedings of the 5th IEEE Conference on Data<br />

Mining (ICDM). IEEE Computer Society, Los Alamitos, CA, USA, 625–628<br />

− 47 −


Non-Gaussian nature of ENSO signals and<br />

climate shifts: implications for regional<br />

studies off the western coast of South America<br />

Bernard Garel 1 , J. Boucharel, B. Dewitte and Y. du Penhoat 2<br />

1 Institut de Mathmatiques de Toulouse (IMT-LSP)<br />

bernard.garel@math.univ-toulouse.fr<br />

2 Laboratoire d’Etudes en Gophysique et Ocanographie Spatiales (LEGOS)<br />

julien.boucharel@legos.obs-mip.fr<br />

Abstract. El Nio/Southern Oscillation (ENSO) exhibits a significant modulation<br />

at decadal timescales which is also associated to changes of its characteristics (amplitude,<br />

frequency, propagation, predictability). Among these characteristics, some<br />

of them are generally ignored in ENSO regional studies, such as asymmetry (number<br />

of warm and cold events is not equal) and deviation of its statistics from those<br />

of an assumed Gaussian distribution. They tend to reduce ENSO prediction skill.<br />

Empirical variance shifts (assumed to be an index of low frequency variability) first<br />

detected in the western tropical Pacific, propagate (with propagation characteristics<br />

related to the unstable modes of ENSO) and grow eastward along the equator,<br />

leading to enhanced SST anomalies, asymmetry.<br />

Statistical tests are used to quantify the non-Gaussian nature and asymmetry of<br />

ENSO typical indices from in situ data and a variety of models (from intermediate<br />

complexity models to full physics coupled general circulation models). It is tested<br />

if ENSO can be accounted for by a non-Gaussian alpha-stable law (i.e. a more<br />

heavy-tailed distribution than Gaussian), by a mixture of distributions or by a non<br />

stationary process dominated by mean state and empirical variance abrupt changes.<br />

This last issue is achieved by a shift detection method applied to ENSO typical<br />

indices. Implications for the interpretation of proxies of the upwelling variability off<br />

the coast of Peru are discussed.<br />

Key words: alpha-stable distributions, mixtures, non-stationary process<br />

− 48 −


Likelihood ratio test for general mixture<br />

models<br />

Elisabeth Gassiat 1<br />

Université Paris-Sud 11, Bâtiment 425, 91405 Orsay Cédex, France<br />

elisabeth.gassiat@math.u-psud.fr<br />

Abstract. We investigate the likelihood ratio test (LRT) for testing hypotheses on<br />

the mixing measure in mixture models with possibly structural parameter. The main<br />

result gives the asymptotic distribution of the LRT statistics under some conditions<br />

that are proved to be almost necessary. Asymptotic distribution of the LRT statistics<br />

under contiguous alternatives may be derived. This applies to various testing<br />

problems: the test of a single distribution against any mixture, with application to<br />

Gaussian, Poisson and binomial distributions; the test of the number of populations<br />

in a finite mixture with possibly structural parameter. This allows to prove that, for<br />

the simple contamination model, the asymptotic local power under contiguous hypotheses<br />

may be arbitrarily close to the asymptotic level when the set of parameters<br />

is large enough.<br />

Key words: Likelihood ratio test, mixture models, number of components, local<br />

power, contiguity<br />

References<br />

Azais, J.-M., Gassiat, E. and Mercadier, C. (2006): Asymptotic distribution and<br />

power of the likelihood ratio test for mixtures: bounded and unbounded case.<br />

Bernoulli,12(5),775-799.<br />

Azais, J.-M., Gassiat, E. and Mercadier, C. (2007): The likelihood ratio test for<br />

general mixture models with possibly unknown structural parameters. ESAIM<br />

P. and S.,submitted.<br />

Gassiat, E. (2002): Likelihood ratio inequalities with applications to various mixtures.<br />

Ann. Inst. H. Poincaré Probab. Statist., 6, 897–906.<br />

− 49 −


On a Location of the Retail Units and<br />

Equilibrium Price Determination.<br />

Vladimir Gazda 1<br />

Technical University in Kosice, Nemcovej Str., 040 01 Kosice, Slovakia<br />

vladimir.gazda@tuke.sk<br />

Abstract. The classical view on the price determination is based on perfect competition<br />

assumption, which implies the only price existence accepted by all retailers.<br />

Traditionally, we suppose that all consumers neglect the search costs spent by looking<br />

for the most appropriate purchasing possibility. New views on the price - location<br />

competition among the retailers were presented by Hotelling and, consequently,<br />

by d’Aspremont et al.; Gabszewicz and Thisse; Dobson and Waterson; Martinez et<br />

al. who stressed mainly a continual approach. The discrete model describing the relation<br />

between the search costs and the selling price was described by Stigler (1961)<br />

in his search theory. His approach is based on the sequential search of particular<br />

retail places and is focused more on the searching process itself than the equilibrium<br />

price determination. We propose a price equilibrium problem formulation in<br />

the more complicated (not sequential) consumers’ and retailers’ structure.<br />

The model describes the behaviour of the cost minimising homogenous consumers<br />

and the retailers’ price policy. There, the union of the set of cosumers V1,<br />

the set of retailers V2, a virtual source s and a virtual sink u gives a set V of digraph<br />

nodes. Then, E = {s} × V1 ∪ V1 × V2 ∪ V2 × {u} is a set of digraph edges. We define<br />

the unit cost function as c : E → N0 where cs,i = 0, ci,j = di,j, and cj,u = pj, where<br />

di,j ∈ N is the search cost spent by the i-th consumer to reach the j-the retail<br />

centre and pj ∈ πj is a price of the j-the retailer. It is proved that the min cost flow<br />

in the digraph G = (V, E, c) models the optimal behaviour of all consumers. Then,<br />

the normal form game of retailers S = (V2, Q Q<br />

j πj, , j µj) describes the behaviour<br />

of retailers with their payoff functions µj derived from the min-cost decisions of the<br />

consumers. The Nash equilibrium price strategies of the retailers are discussed in<br />

the article.<br />

Key words: Graph, Game, Consumption, Retailer, Min-cost Flow, Nash Equilibrium<br />

− 50 −


The Potential of Social Intelligence for<br />

Collective Intelligence<br />

Andreas Geyer-Schulz 1 and Bettina Hoser 2<br />

1 Information Service and Electronic Markets andreas.geyer-schulz@kit.edu<br />

2 bettina.hoser@kit.edu<br />

Abstract. In this contribution we review the history and potential of social intelligence<br />

for driving collective intelligence. Different social networks generated by<br />

computer-mediated communication have been researched extensively in the past.<br />

The development of new technologies and applications on the Internet has resulted<br />

in a recent dramatic rise of user participation within these networks. We systematically<br />

explore the possibility of cross-usage of information about social network<br />

structures for personal, community or organisational services. In the other direction,<br />

we investigate the potential of improving community services by integrating<br />

information on the network structure with personal, organisational and mass information.<br />

Key words: social networks, social network analysis, social intelligence, collective<br />

intelligence<br />

References<br />

Hoser, B, and Geyer-Schulz, A. (2005): Eigenspectralanalysis of Hermitian Adjacency<br />

Matrices for the Analysis of Group Substructures. Journal of Mathematical<br />

Sociology, 29(4), 265–294.<br />

Wassermann, S. and Faust K. (1994): Social Network Analysis, Methods and Applications.<br />

Cambridge University Press, Cambridge.<br />

− 51 −


Isolated vertices in random intersection graphs<br />

Erhard Godehardt 1 , Jerzy Jaworski 2 and Katarzyna Rybarczyk 2<br />

1 Clinic of Thoracic and Cardiovascular Surgery, Heinrich Heine University, 40225<br />

Düsseldorf, Germany; godehard@uni-duesseldorf.de<br />

2 Faculty of Mathematics and Computer Science, Adam Mickiewicz University,<br />

60769 Poznań, Poland; jaworski@amu.edu.pl, kryba@amu.edu.pl<br />

Abstract. In applications like the analysis of non-metric data, it is a natural approach<br />

to classify objects according to the properties they possess. Often it is useful<br />

to consider two objects similar if they share at least s properties (with a given arbitrary<br />

number s). Then an effective model to analyze the structure of similarities<br />

between objects is the random intersection graph G(m, n, P(m)), with a set of vertices<br />

V generated by the random bipartite graph BG(m, n, P(m)) with bipartition (V, W),<br />

where clusters are defined as given subgraphs of the generated intersection graph.<br />

In BG(m, n, P(m)) the number of neighbors (properties) of a vertex v ∈ V (object) is<br />

assigned according to the probability distribution P(m) and an edge between v ∈ V<br />

and w ∈ W means that the object v has the property w. Moreover in G(m, n, P(m)),<br />

an edge connects v1 and v2 (v1, v2 ∈ V) if and only if in BG(m, n, P(m)) they have<br />

at least s common properties. The models were introduced in Godehardt and Jaworski<br />

(2002). Using specific properties of such graphs, we can test the hypothesis<br />

of randomness of the underlying data set.<br />

Our main purpose is to study the number of isolated vertices (objects similar<br />

to no other) in G(m, n, P(m)). Previous results concerning this problem considered<br />

only the case, where each vertex had the same number of properties and s = 1. In<br />

our new approach we manage to cope with dependencies between edge appearances<br />

for s ≥ 1 (which is important from the application point of view). Moreover we give<br />

results for the case, where the number of properties differs between objects (different<br />

distributions P(m)). We give the asymptotics for the probability of nonexistence of<br />

isolated vertices in G(m, n, P(m)) and conditions for asymptotic convergence of the<br />

number of isolated vertices to the Poisson distribution.<br />

Key words: Random Intersection Graphs, Isolated Vertices, Non-metric Data<br />

Analysis<br />

References<br />

Godehardt, E. and Jaworski, J. (2002): Two Models of Random Intersection Graphs<br />

for Classification In: M. Schwaiger and O. Opitz (Eds.): Exploratory Data Analysis<br />

in Empirical Research. Springer, Berlin, 68–81.<br />

− 52 −


A note on constrained EM algorithms<br />

for mixtures of elliptical distributions<br />

Francesca Greselin 1 and Salvatore Ingrassia 2<br />

1 Dipartimento di Metodi Quantitativi per le Scienze Economiche e Aziendali,<br />

Università di Milano Bicocca (Italy) francesca.greselin@unimib.it<br />

2 Dipartimento di Economia e Metodi Quantitativi, Università di Catania (Italy)<br />

s.ingrassia@unict.it<br />

Abstract. We extend some theoretical results about the likelihood maximization<br />

on constrained parameter spaces to mixtures of multivariate elliptical distributions.<br />

In particular, mixtures of multivariate t distributions provide a robust parametric<br />

extension to the fitting of data with respect to normal mixtures. In this framework,<br />

the degrees of freedom can act as a robustness parameter, tuning the heaviness of<br />

the tails, and down weighting the effect of the outliers on the parameters estimation.<br />

Further, a constrained monotone algorithm implementing maximum likelihood mixture<br />

decomposition of multivariate t distributions is proposed, to achieve improved<br />

convergence capabilities and robustness. Numerical studies are presented in order<br />

to demonstrate the better performance of the algorithm, comparing it to earlier<br />

proposals.<br />

Key words: Mixture models, Robust Clustering, EM algorithm, elliptical distributions,<br />

t-distribution.<br />

References<br />

Hennig, C. (2004): Breakdown points for maximum likelihood estimators of locationscale<br />

mixtures. The Annals of Statistics, 32, 1313–1340.<br />

Hathaway, R.J. (1986): A constrained formulation of maximum-likelihood estimation<br />

for normal mixture distributions. The Annals of Statistics, 13, 795–800.<br />

Ingrassia, S. and Rocci, R. (2007): Constrained monotone EM algorithms for finite<br />

mixture of multivariate Gaussians. Computational Statistics & Data Analysis,<br />

51, 5339–5351.<br />

McLachlan, G. J. and Peel, D. (2000): Finite Mixture Models, John Wiley & Sons,<br />

New York.<br />

− 53 −


Support Vector Machines in the Primal using<br />

Majorization and Kernels<br />

Patrick J.F. Groenen 1 , Georgi Nalbantov 2 , and Cor Bioch 3<br />

1 Econometric Institute, Erasmus University Rotterdam,<br />

Rotterdam, The Netherlands groenen@few.eur.nl<br />

2 Econometric Institute, Erasmus University Rotterdam,<br />

Rotterdam, The Netherlands and MICC, University Maastricht, Maastircht<br />

nalbantov@few.eur.nl<br />

3 Econometric Institute, Erasmus University Rotterdam,<br />

Rotterdam, The Netherlands bioch@few.eur.nl<br />

Abstract. Support vector machines have become one of the main stream methods<br />

for two-group classification. At the 2006 <strong>GfKl</strong> meeting in Berlin, we proposed SVM-<br />

Maj, a majorization algorithm that minimizes the SVM loss function (see, Groenen,<br />

Nalbantov, and Bioch, 2007, <strong>2008</strong>). A big advantage of majorization is that in each<br />

iteration, the SVM-Maj algorithm is guaranteed to decrease the loss until the global<br />

minimum is reached. Nonlinearity was reached by replacing the predictor variables<br />

by their monotone spline bases and then doing a linear SVM. A disadvantage of<br />

the method so far is that if the number of predictor variables m is large, SVM-Maj<br />

becomes slow.<br />

In this paper, we extend the SVM-Maj algorithm in the primal to handle efficiently<br />

cases where the number of observations n is (much) smaller than m. We<br />

show that the SVM-Maj algorithm can be adapted to handle this case of n ≫ m as<br />

well. In addition, the use of kernels instead of splines for handling the nonlinearity<br />

becomes also possible while still maintaining the guaranteed descent properties of<br />

SVM-Maj.<br />

Key words: Support vector machines, Iterative majorization, Binary classification<br />

problem, Kernel<br />

References<br />

Groenen, P.J.F., Nalbantov, G., and Bioch, J.C. (2007): Nonlinear support vector<br />

machines through iterative majorization and I-splines. In: R.Decker, H-.J. Lenz<br />

(Eds.): Advances in data analysis. Springer, Berlin, 149–162.<br />

Groenen, P.J.F., Nalbantov, G., and Bioch, J.C. (<strong>2008</strong>, in press): SVM-Maj: A<br />

Majorization Approach to Linear Support Vector Machines with Different Hinge<br />

Errors. Advances in Data Analysis and Classification.<br />

− 54 −


Usage of Artificial Neural Networks for Data<br />

Handling<br />

Lars Frank Groe and Franz Joos<br />

Helmut-Schmidt-University<br />

University of the Federal Armed Forces Hamburg<br />

Power Engineering<br />

Laboratory of Turbomachinery<br />

Hamburg, Germany<br />

Abstract. To reduce the environmental pollution, it is essential to increase the<br />

efficiency of the commercially available combustion engines. If one succeeds in designing<br />

the combustion process, in particular the chemical reactions, it is feasible to<br />

replace the experiment by computer simulations. Complex chemical reaction mechanisms<br />

like the GRI3.0* consist of 325 reactions with 53 species. The computational<br />

hardware costs limit the evaluation of integrals of stiff equations to simple problems<br />

(2-D, low Reynolds numbers) or to very small numbers of species. Otherwise<br />

turbulent combustion, for example in combustion chambers of gas turbines, often<br />

consists of complex geometry with a wide spectrum of chemical states and proceeds<br />

with high Reynolds numbers. The use of databases for storing chemical reactions<br />

is widely described in literature [Pope]. Therefore several storage-based techniques<br />

have been implemented for data mining (Look-up-table, In situ adaptive tabulation).<br />

The use of artificial neuronal networks (ANN) to simulate complex chemistry<br />

with full GRI3.0 is suggested in this paper. ANN can represent the chemical reactions<br />

by creating a non-linear-multivariate model of the dataset. The information<br />

of the dataset is stored in the weights of the connected neurons in the ANN. The<br />

net is able to find the optimum approximation of the presented data by supervised<br />

learning method called back propagation. The modeling and generalisation of large<br />

number of chemical states by means of ANN with regard to complicated combustion<br />

simulation is the purpose of this work.<br />

References<br />

[Pope] Pope, S.B.: Computationally efficient implementation of combustion chemistry<br />

using in situ adaptive tabulation. In: Combust. Theory Modelling, Vol. 1<br />

(1997), pp. 41-63<br />

Gregory P. Smith, David M. Golden, Michael Frenklach, Nigel W. Moriarty, Boris<br />

Eiteneer, Mikhail Goldenberg, C. Thomas Bowman, Ronald K. Hanson, Soonho<br />

Song, William C. Gardiner, Jr., Vitali V. Lissianski, and Zhiwei: http://www.<br />

me.berkeley.edu/gri_mech/.<br />

− 55 −


Model diagnostics of finite mixtures using<br />

bootstrapping<br />

Bettina Grün 1 and Friedrich Leisch 2<br />

1 Department für Statistik und Mathematik, Wirtschaftsuniversität Wien<br />

Augasse 2-6, 1090 Wien, Austria; Bettina.Gruen@wu-wien.ac.at<br />

2 Institut für Statistik, Ludwig-Maximilians-Universität München, Ludwigstraße<br />

33, D-80539 München, Germany; Friedrich.Leisch@stat.uni-muenchen.de<br />

Abstract. The EM algorithm provides a common framework for maximum likelihood<br />

estimation of finite mixture models. The fitted models can differ with respect<br />

to the component specific models and may also allow for concomitant variables to<br />

model the component weights. The use of resampling methods to analyze finite<br />

mixture models fitted with the EM algorithm is appealing, because the bootstrap<br />

similarly to the EM algorithm constitutes a common framework for these models.<br />

We will outline various possibilities to use resampling methods for model diagnostics<br />

such as for determining the number of components, checking model identifiability<br />

and analyzing the stability of induced clusterings.<br />

The R package flexmix implements the EM algorithm for ML estimation of finite<br />

mixture models. It provides the E-step and all data handling and arbitrary mixture<br />

models can be fitted by modifying the M-step. The implementation of bootstrap<br />

techniques to allow for model diagnostics of the models fitted with the package is<br />

presented.<br />

Key words: Booststrap, Finite mixture, Model diagnostics, Resampling<br />

References<br />

Grün, B. and Leisch, F. (2004): Bootstrapping Finite Mixture Models. In: J. Antoch<br />

(Ed.): Compstat 2004—Proceedings in Computational Statistics. Springer,<br />

Heidelberg, 1115–1122.<br />

Hothorn, T., Leisch, F., Zeileis, A. and Hornik, K. (2005): The Design and Analysis<br />

of Benchmark Experiments. Journal of Computational and Graphical Statistics,<br />

14(3), 1–25.<br />

Leisch, F. (2004): FlexMix: A general framework for finite mixture models and latent<br />

class regression in R. Journal of Statistical Software, 11(8).<br />

McLachlan, G.J. (1987): On Bootstrapping the Likelihood Ratio Test Statistic for<br />

the Number of Components in a Normal Mixture. Applied Statistics, 36(3),<br />

318–324.<br />

− 56 −


Classification with Regularized Kernel<br />

Mahalanobis-Distances<br />

Bernard Haasdonk 1 and El˙zbieta P ↩ ekalska 2<br />

1 Institute of Numerical and Applied Mathematics, University of Münster,<br />

Germany haasdonk@math.uni-muenster.de<br />

2 School of Computer Science, University of Manchester, United Kingdom<br />

pekalska@cs.man.ac.uk<br />

Abstract. Linear discriminant analysis has been demonstrated to be successful in<br />

kernel-induced feature spaces. In particular, in terms of accuracy, the kernel Fisher<br />

discriminant (KFDA) can frequently compete with or even outperform the support<br />

vector machine (SVM) (Mika et al. 2000). In situations, where linear discrimination<br />

in kernel feature space is suboptimal, nonlinear techniques offer a better solution<br />

(Huang et al. 2005). An example is quadratic classification in the kernel space, based<br />

on kernelized versions of class-related Mahalanobis distances.<br />

In this presentation, we present two different formulations for quadratic classifiers<br />

in kernel-induced feature spaces, depending whether the class-related covariance<br />

operator has to be regularized or not. Experimental results on a toy data set enable<br />

us to draw comparisons to SVM and KFDA. More importantly, these results provide<br />

a proof of principle that nonlinear discriminants can be beneficial in the kernel<br />

space.<br />

Key words: Kernel Methods, Quadratic Discriminants, Kernel Mahalanobis-Distance<br />

References<br />

Mika, S., Rätsch, G., Schölkopf, B., Smola, A., Weston, J. and Müller, K.-R. (2000):<br />

“Invariant feature extraction and classification in kernel spaces.” In S.A. Solla,<br />

T.K. Leen, and K.-R. Müller (Eds.): Advances in Neural Information Processing<br />

Systems 12. MIT Press, Cambridge, MA, 526–532.<br />

Huang, S.-Y., Hwang, C.-R. and Lin, M.-H. (2005): “Kernel Fisher’s discriminant<br />

analysis in Gaussian reproducing kernel Hilbert space,” Academia Sinica,<br />

Taipei, Taiwan, Technical Report.<br />

− 57 −


On classification of species of representation<br />

rings<br />

Lothar Häberle<br />

Department of Biometry and Epidemiology, University of Erlangen-Nuremberg,<br />

Waldstr. 6, 91054 Erlangen<br />

Abstract. In biology and chemistry crystal structures and symmetries of molecules,<br />

for example, are classified by mathematical groups. The assigned groups can then<br />

be used to determine physical properties such as polarity and chirality.<br />

Representations of groups as linear transformations of vector spaces and, more<br />

generally, modules enables many group theoretical problems to be reduced to problems<br />

of linear algebra, which is a well understood theory. Defining addition and<br />

multiplication via direct sum and tensor product on the set of these modules and<br />

then considering them as elements of a ring, the representation ring, is an approach<br />

to examine such modules. In order to investigate representation rings one may study<br />

their structure preserving maps to the complex numbers, which are called species.<br />

We consider finite groups whose largest subgroup of prime power order is cyclic<br />

for some prime number and study the corresponding representation ring. The indecomposable<br />

modules are stated and the species are classified. The proposed way of<br />

classification may be applied to other classes of groups in the future and then be used<br />

in natural sciences. Throughout the paper we illustrate the theoretical statements<br />

with examples.<br />

Key words: mathematical group, species, representation ring, indecomposable<br />

module<br />

References<br />

Benson, D.J. (1991): Representation and Cohomology I. Cambride Universtiy Press.<br />

Fotsing, B. and Külshammer, B. (2005): Modular species and prime ideals for the<br />

ring of monomial representations of a finite group. Communications in Algebra,<br />

33, 3667–3677.<br />

Green, J.A. (1962): The modular representation algebra of a finite group. Illinois<br />

Journal of Mathematics, 6, 607–619.<br />

Häberle, L. (submitted): The species and idempotens of the Green algebra of a finite<br />

group with a cyclic Sylow subgroup.<br />

Shriver, D.F. and Atkins, P.W. (2006): Inorganic Chemistry. Oxford University<br />

Press.<br />

− 58 −


Auswertung hochaufgelöster Streulichtdaten mit<br />

Methoden der multivariaten Statistik<br />

Cornelius Hahlweg und Hendrik Rothe<br />

Helmut-Schmidt-Universität<br />

Hamburg<br />

Zusammenfassung. Die Entwicklung der Streulichtmeßtechnik wurde in den vergangenen<br />

Dekaden, insbesondere mit Blick auf einen Einsatz in der Qualitätsprüfung<br />

von Oberflächen, vorangetrieben. Streulichtverfahren erweisen sich für Oberflächenuntersuchungen<br />

als besonders leistungsfähig, da sie berührungslos arbeiten, einen<br />

skalierbaren Ausschnitt der Oberfläche prüfen und dabei feinste Oberflächenstrukturen<br />

abbilden.<br />

Prinzipiell liefern Streulichtverteilungen spektrale Aussagen über Eigenschaften<br />

der untersuchten Oberfläche. Während diese für sehr glatte Oberflächen in der Tat<br />

die Oberflächenfunktion selbst wiederspiegeln, erweisen sich für Oberflächen oberhalb<br />

des sog. Rayleigh-Limits Methoden der multivariaten Statistik als sinnvoll.<br />

Insbesondere können hier die höheren Momente der Streuverteilung als Merkmale<br />

für Klassifikationsverfahren dienen. Während diese Momente in früheren Veröffentlichungen<br />

als Beschreibungsform der Streuverteilung selbst unter eher heuristischen<br />

Aspekten vorgeschlagen wurden, kann ihnen nunmehr auch eine physikalische Bedeutung<br />

zugeordnet werden. Zur Vorverarbeitung und Reduktion der häufig sehr<br />

umfangreichen zweidimensionalen Datenmengen kommt zunächst die Hauptkomponentenanalyse<br />

(PCA) zum Einsatz. Zur Klassifikation verschiedener Proben, z.B. im<br />

Sinne einer Qualitätskontrolle, wird die lineare kanonische Diskriminanzanalyse genutzt.<br />

Der Beitrag gibt einen Einblick in die Grundlagen der verwendeten Verfahren<br />

in Bezug zur Streulichtmeßtechnik und bietet Beispiele aus der Anwendung in der<br />

Klassifikation technischer Oberflächen.<br />

Literaturverzeichnis<br />

Baier, D. and Gaul, W. (1999): Optimal Product Positioning Based on Paired Comparison<br />

Data. Journal of Econometrics, 89, 365–392.<br />

Bock, H.H. (1974): Automatische Klassifikation. Vandenhoeck & Ruprecht, Göttingen.<br />

Brusch, M. and Baier, D. (2002): Conjoint Analysis and Stimulus Presentation:<br />

A Comparison of Alternative Methods. In: K. Jajuga, A. Sokołowski and H.H.<br />

Bock (Eds.): Classification, Clustering, and Analysis. Springer, Berlin, 203–210.<br />

− 59 −


Algorithms for Computing the Multivariate<br />

Isotonic Regression<br />

Jürgen Hansohm 1<br />

University of the Federal Armed Forces, Munich, Germany<br />

Juergen.Hansohm@UniBw-Muenchen.de<br />

Abstract. Sasabuchi (1983) introduces 1983 a multivariate version of the wellknown<br />

univariate isotonic regression which plays a key role in the field of statistical<br />

inference under order restrictions. His proposed algorithm for computing the multivariate<br />

isotonic regression, however, is guaranteed to converge only under special<br />

conditions (Sasabuchi 2003). In this paper, a more general framework for multivariate<br />

isotonic regression is given and an algorithm based on Dykstra’s method is<br />

used to compute the multivariate isotonic regression. Two numerical examples are<br />

given to illustrate the algorithm and to compare the results with the Monte Carlo<br />

simulation published by Fernando and Kulatunga (2007)<br />

Key words: multivariate isotonic regression, projection, Dykstra’s algorithm, partial<br />

order, least squares solution<br />

References<br />

S. Sasabuchi, M. Inutsuka, D.D.S. Kulatunga (1983): A multivariate version of isotonic<br />

regression,Biometrika 70, 2 (1983) 465–472.<br />

S. Sasabuchi, T. Miura, H. Oda (2003): Estimation and test of several multivariate<br />

normal means under an order restriction when the dimension is larger than two,<br />

Journal of Statistical Computation and Simulation 73, 9 (2003) 619–641.<br />

W.T.P.S. Fernando, D.D.S. Kulatunga (2007): On the computation and some applications<br />

of multivariate isotonic regression,Computational Statistics and Data<br />

Analysis 52 (2007) 702–712.<br />

− 60 −


Die präzise und effizienzte Erkennung von<br />

medizinischen Anforderungsformularen<br />

Uwe Henker 1 , Alfred Ultsch 2 , and Uwe Petersohn 3<br />

1<br />

DOCexpert Computer GmbH<br />

Bamberg<br />

u.henker@docexpert.de<br />

2<br />

Databionics Research Group<br />

Philipps-University of Marburg, Germany<br />

ultsch@informatik.uni-marburg.de<br />

3<br />

TU Dresden<br />

Institut Künstliche Intelligenz<br />

peterson@inf.tu-dresden.de<br />

Abstract. Formulare für die Anforderung von medizinischen und/oder diagnostischen<br />

Leistungen spielen eine grosse Rolle in der gegenwätigen medizinischen Praxis.<br />

Mit solchen Formularen werden ggf. lebensentscheidende ärztliche oder labormedizinische<br />

Leistungen für einen Patienten angefordert. Die Übertragung der vom Arzt<br />

per Hand in ein solches Formular eingetragenen Anforderungen an die Labor- bzw.<br />

Krankenhaus Informtionssysteme erfolgt dabei durch maschinelle Erkennungsverfahren<br />

(Optical Marker Recognition (OMR)). Hierbei ist von einer sich ändernden<br />

Menge von verschiedenen Formularen (Prototypen) auszugehen, die zuverlässig<br />

erkannt werden müssen.<br />

Der Beitrag beschreibt die Wissensrepräsentation derartiger Formulare in einer Falldatenbank<br />

von Prototypen mittels Case based Reasoning (CBR). Zentrale Idee ist<br />

dabei, die vorverarbeiteten und abstrahierten Bilder von gescannten Formularen mit<br />

den Prototypen so zu vergleichen, dass Fehlertoleranzen zugelassen werden. Wird ein<br />

neuer Prototyp in die Wissensbasis eingefügt, der eine grosse Überschneidung mit<br />

bestehenden Prototypen hat, so wird in einem mehrstufigen Verfahren zusätzliches<br />

Entscheidungswissen für den Formularklassifikator in der Wissensbasis repräsentiert.<br />

Der Ansatz führt zu einer Erkennungsrate von 97% und keinen falsch-positiven<br />

Fällen. Im Vergleich zu anderen veröffentlichten Ansätzen kann eine substantielle<br />

Steigerung bei der Erkennungsleistung festgestellt werden. Insbesondere ist die Sonderanforderung,<br />

dass keine falsch-positiven Ergebnisse erzeugt werden zu 100%<br />

erfüllt. Das System wurde mit realen Formularen getestet, die in der Anwendung ein<br />

eindeutiges Merkmal (Barcode) zur Identifizierung verwenden. Die Notwendigkeit<br />

dieses Barcodes auf jedem Formular, stellt eine nicht unerhebliche Einschränkung<br />

dar. Durch den hier beschriebenen Ansatz wird diese aufgehoben.<br />

Key words: Classification, Knowledge Representation, Optical Marker Recognition,<br />

Image Processing, Medical Information Systems<br />

− 61 −


Using cluster analysis for species delimitation<br />

Christian Hennig 1 and Bernhard Hausdorf 2<br />

1 Department of Statistical Science, University College London, Gower St, London<br />

WC1E 6BT, United Kingdom chrish@stats.ucl.ac.uk<br />

2 Zoologisches Museum der Universität Hamburg, Martin-Luther-King-Platz 3,<br />

20146 Hamburg, Germany hausdorf@zoologie.uni-hamburg.de<br />

Abstract. Species delimitation is a fundamental task in biology. Operationally,<br />

species can be conceived as continuously varying groups of organisms that are separate<br />

from other such groups. This suggests methods of cluster analysis to delimit<br />

species empirically for given data. However, in the literature there is no agreement<br />

about the species concept (see Mayden, 1997, for an overview), which affects the<br />

choice of the appropriate data, cluster analysis method, and the interpretation of<br />

the results. A particular problem arises because of the hierarchical nature of evolution.<br />

Clusters occur at many levels and may represent, beside species, intrapopulation<br />

polymorphisms, populations, regional variation or higher taxa. We present<br />

a methodology for delimiting putative species based on codominant and dominant<br />

genetic markers. The method combines the definition of an appropriate dissimilarity<br />

measure, multidimensional scaling and model-based cluster analysis. We propose a<br />

null model taking into account spatial autocorrelation in order to check whether<br />

inhomogeneities in the data can be explained from regional variation alone. The<br />

methodology is a generalization of the techniques presented in Hennig and Hausdorf<br />

(2004) to categorial genetic data. The methodology is compatible with most species<br />

concepts. We discuss some general issues such as the choice of the clustering method<br />

and joining of not well separated clusters, which rather indicate inhomogeneity on<br />

lower levels than species.<br />

Key words: Model-based cluster analysis, genotypes, spatial autocorrelation<br />

References<br />

Hennig, C. and Hausdorf, B. (2004): Distance-based parametric bootstrap tests for<br />

clustering of species ranges. Computational Statistics and Data Analysis, 45,<br />

875–896.<br />

Mayden, R.L. (1997): A hierarchy of species concepts: the denouement in the saga<br />

of the species problem. In: M.F. Claridge, H.A. Dawah, M.R. Wilson (Eds.):<br />

The Units of Biodiversity. Chapman and Hall, London, 381–424.<br />

− 62 −


Nonlinear Effects in PLS Path Models:<br />

A Comparison of Available Approaches<br />

Jörg Henseler 1<br />

Institute of Management Research, Radboud University Nijmegen, Thomas van<br />

Aquinostraat 1, 6525 GD Nijmegen, The Netherlands, J.Henseler@fm.ru.nl<br />

Summary. Along with the development of scientific disciplines, researchers in business<br />

and social sciences are increasing interested in investigating nonlinear effects<br />

between latent variables. In this contribution, I present four approaches to modeling<br />

nonlinear effects with PLS: Firstly, Wold’s (1982) original approach takes the nonlinearity<br />

in the structural model into account during the iterative PLS algorithm.<br />

Secondly, the product indicator approach developed by Chin, Marcolin, and Newsted<br />

(2003) requires that the nonlinear function be applied a priori on the indicator<br />

level. Thirdly, a two-stage approach as suggested by Henseler and Fassott (<strong>2008</strong>) estimates<br />

the nonlinear effect a posteriori once the latent variable scores are estimated<br />

by means of the linear effects PLS path model. Fourthly, I adapt an orthogonalizing<br />

approach originally suggested by Little, Bovaird, and Widaman (2006) to nonlinear<br />

PLS path modeling. Finally, I compare the performance of these four approaches<br />

by means of a Monte Carlo simulation, and derive guidelines for users of PLS path<br />

modeling.<br />

Key words: partial least squares, PLS path modeling, nonlinear terms<br />

References<br />

Chin, W. W., Marcolin, B. L., and Newsted, P. N. (2003): A Partial Least Squares<br />

Latent Variable Modeling Approach for Measuring Interaction Effects: Results<br />

from a Monte Carlo Simulation Study and an Electronic-mail Emotion/Adoption<br />

Study. Information Systems Research, 14, 189–217.<br />

Henseler, J. and Fassott, G. (<strong>2008</strong>): Testing Moderating Effects in PLS Path<br />

Models: An Illustration of Available Procedures. In: V. E. Vinzi, W. W. Chin, J.<br />

Henseler, and H. Wang (Eds.): Handbook Partial Least Squares Path Modeling.<br />

Springer, Heidelberg, forthcoming.<br />

Little, T. D., Bovaird, J. A., and Widaman, K. F. (2006): On the Merits of<br />

Orthogonalizing Powered and Product Terms: Implications for Modeling Interactions<br />

Among Latent Variables, Structural Equation Modeling, 13, 497–519.<br />

Wold, H. (1982): Soft Modeling. The Basic Design and Some Extensions. In: K.<br />

G. Jöreskog and H. Wold (Eds.): Systems under Indirect Observation. Causality,<br />

Structure, Prediction, Part I. North-Holland, Amsterdam, 1–54.<br />

− 63 −


Classification of text processing components:<br />

The Tesla Role System<br />

Jürgen Hermes and Stephan Schwiebert<br />

Linguistic Data Processing, Department of Linguistics, University of Cologne<br />

{jhermes, sschwieb}@spinfo.uni-koeln.de<br />

Abstract. The analysis of sequences of discrete tokens (i.e., texts) is a major research<br />

subject of several essentially different sciences such as corpus linguistics,<br />

literature and bioinformatics. Though differing in both data and its interpretation,<br />

these sciences share some intermediate steps. Following these considerations, the obvious<br />

procedure is to encapsulate text processing tasks into components and create a<br />

framework that enables component interaction. The component arrangement within<br />

a workflow is comparable to an experimental setup: it allows a gradual modification<br />

of experiments, e.g., rerunning an experiment with a modified configuration or a<br />

replaced component.<br />

The Text Engineering Software Laboratory (Tesla) is an implementation of a<br />

framework that supports the development and deployment of text processing components<br />

as well as the execution of experiments on textual data. One of its main<br />

ideas is reducing the framework’s restrictions on data modeling to a minimum, allowing<br />

developers to focus on their scientific tasks. However, this results in new issues:<br />

an extensible way of database access definition, data exchange between components<br />

and data conversion during visualization. If, for instance, the annotations produced<br />

by a component cannot be related sequentially to single text elements but do instead<br />

represent more complex relations between these elements, as generally in graphs or<br />

matrices, the information contained in such data types can only be extracted with<br />

knowledge about their internal structure and its meaning, thus violating a basic<br />

principle of component frameworks.<br />

Addressing these concerns, the concept of a role is introduced in Tesla. A role<br />

adopted by a component specifies the type as well as the access methods of the<br />

produced data. As the role system implicitly exhibits a hierarchical structure, this<br />

finally leads to a dynamic classification of text processing components.<br />

− 64 −


Strengths and Weaknesses of Ant Colony<br />

Clustering<br />

Lutz Herrmann and Alfred Ultsch<br />

Databionics Research Group<br />

University of Marburg, Germany<br />

{lherrmann,ultsch}@informatik.uni-marburg.de<br />

Abstract. Ant colony clustering (ACC) is a promising nature-inspired technique<br />

where stochastic agents perform the task of clustering high-dimensional data on a<br />

low-dimensional output space. Most ACC methods are derivatives of the approach<br />

proposed by Lumer and Faieta. These methods usually perform poorly in terms<br />

of topographic mapping and cluster formation. In particular when compared to<br />

clustering on Emergent Self-Organizing Maps (ESOM).<br />

In order to address this issue, an unifying representation for both ACC methods<br />

and Emergent Self-Organizing Maps is derived in a brief yet formal manner. ACC<br />

terms are related to corresponding mechanisms of the Self-Organizing Map. This<br />

leads to insights on both algorithms. ACC are considered as first-degree relatives of<br />

the ESOM. This explains benefits and shortcomings of ACC and ESOM. Furthermore,<br />

the proposed unification allows to judge whether modifications improve an<br />

algorithm’s clustering abilities or not. This is demonstrated using a set of cardinal<br />

clustering problems.<br />

Key words: Clustering, Emergent Self-Organizing Maps, Swarm Intelligence<br />

References<br />

Deneubourg, J.-L., Goss, S., Franks, N., Sendova-Franks, A., Detrain, C. and Chretien,<br />

L. (1991): The dynamics of collective sorting: Robot-like ants and ant-like<br />

robots. In: Proc. of the First International Conference on Simulation of Adaptive<br />

Behaviour: From Animals to Animats 1. MIT Press, Cambridge, 356-365.<br />

Handl, J., Knowles, J. and Dorigo, M. (2005): Ant-Based Clustering and Topographic<br />

Mapping. Artificial Life 12(1), MIT Press, Cambridge.<br />

Kohonen, T. (1995): Self-Organizing Maps. Springer, Berlin, Heidelberg, New York.<br />

Lumer, E. and Faieta, B. (1994): Diversity and adaption in populations of clustering<br />

ants. In: Proc. of the Third International Conference on Simulation of Adaptive<br />

Behaviour: From Animals to Animats 3. MIT Press, Cambridge, 501–508.<br />

Ultsch, A. and Herrmann, L. (2005): The architecture of emergent self-organizing<br />

maps to reduce projection errors. In: Verleysen M. (Eds): Proc. of the European<br />

Symposium on Artificial Neural Networks (ESANN 2005).<br />

− 65 −


Reconstructing Central Places and Settlements<br />

Groups<br />

Irmela Herzog<br />

The Rhineland Regional Council / The Rhineland Commission for Archaeological<br />

Monuments and Sites<br />

Bonn, Germany<br />

i.herzog@LVR.de<br />

Abstract. If (i) the location of settlements is known for a certain period in time<br />

and (ii) the settlements are distributed in such a way that cluster centres with high<br />

settlement densities are present, then a variant of the density clustering algorithm<br />

using basin spanning trees can be applied to (i) reconstruct the location of the<br />

cluster centres and (ii) group the settlements. The model of this approach is based<br />

on the assumption that the exchange rate of products is high where the settlements<br />

are close to each other and/or the settlement size is large. If people living in a<br />

settlement with a low exchange rate wanted to buy or sell something they would<br />

walk to one of the settlements with higher exchange rates in their neighbourhood.<br />

This can be modelled by several variants of the density clustering algorithm using<br />

basin spanning trees. Which of the neighbouring settlements is determined as the<br />

preferred location for product exchange depends on the algorithm variant chosen.<br />

This method is used to reconstruct trade networks, and all settlements connected<br />

by direct or indirect trade links constitute a group. While extending the original<br />

clustering algorithm to support different settlement sizes could be accomplished<br />

easily, the adjustments needed to take into account the costs of walking between<br />

two locations in prehistoric times are by no means trivial. Examples from the river<br />

Main area with Bronze and Iron Age settlements will be presented.<br />

References<br />

Hader, S., Hamprecht, F. A. 2003. Efficient Density Clustering Using Basin Spanning<br />

Trees. In: M. Schader, W. Gaul, M. Vichi (Hrsg.), Between Data Science and<br />

Applied Data Analysis. Studies in Classification, Data Analysis, and Knowledge<br />

Organization (Berlin, Heidelberg, New York):39-48.<br />

− 66 −


On the prognostic value of gene expression<br />

signatures for censored data<br />

Thomas Hielscher, Manuela Zucknick, Wiebke Werft and Axel Benner<br />

Division of Biostatistics, German Cancer Research Center, Heidelberg, Germany<br />

t.hielscher@dkfz.de<br />

Abstract. As part of the validation of a new gene expression signature it is good<br />

statistical practice to quantify the amount of prognostic information represented by<br />

the signature. Open questions are how to measure the gain in prognostic information<br />

compared to established clinical parameters or biomarkers and the additional<br />

predictive accuracy especially when dealing with censored data. To answer these<br />

questions it is required to use consistent and interpretable measures.<br />

Several measures of prediction accuracy and proportion of explained variation<br />

have been suggested for right-censored event times. The underlying mechanisms of<br />

these measures are as different as the use of Schoenfeld residuals, model likelihoods<br />

or the variation of the individual survival curves. Consequently, these measures vary<br />

in their assumptions and properties and it remains unclear under which conditions<br />

and to which extent they are comparable. Moreover, explained variation for survival<br />

data can be considered as a function of time and therefore strongly depends on the<br />

available follow-up time and the time range of interest.<br />

We present a comparison of several common measures such as the Brier Score<br />

(Graf et al., 1999), the V measure (Schemper and Henderson, 2001) and the method<br />

of O’Quigley and Xu (2001) to illustrate their application to simulated and real<br />

clinical data. A presentation of existing and possible approaches to estimate the<br />

variability of these measures will be provided. An overview of available software<br />

implementations in R will be given.<br />

Key words: Survival, Predictive Accuracy, Gene Expression<br />

References<br />

Graf, E., Sauerbrei, W. F., Schmoor, C., and Schumacher, M. (1999): Assessment<br />

and comparison of prognostic classification schemes for survival data. Statistics<br />

in Medicine,18, 2529–2545.<br />

O’Quigley, J. and Xu, R. (2001): Explained variation in proportional hazards regression.<br />

In: J. Crowley and D.P. Ankerst (Eds.): Handbook of Statistics in Clinical<br />

Oncology, Second Edition. Chapman & Hall/CRC Press, 347–363.<br />

Schemper, M. and Henderson, R. (2000): Predictive accuracy and explained variation<br />

in Cox regression. Biometrics, 56, 249–255.<br />

− 67 −


Likelihood ratio testing for hidden Markov<br />

models<br />

Hajo Holzmann and Jörn Dannemann<br />

University of Karlsruhe<br />

Germany<br />

Abstract. When a mixture arises as the marginal distribution of a stationary process,<br />

the dependency structure can be incorporated by assuming that the underlying<br />

regime forms a finite state Markov chain. This leads to the class of hidden Markov<br />

models (HMMs), which are also called Markov dependent mixtures. We shall discuss<br />

maximum likelihood inference in HMMs. In particular, we investigate the problem<br />

of testing for the number of states via the likelihood ratio test (LRT). We propose<br />

a modified LRT for two against more states in an HMM, which is based on the<br />

so-called likelihood function under independence assumption, and derive its asymptotic<br />

distribution under the null hypothesis. Simulation results and applications to<br />

financial and biological time series illustrate the practical use of the methods.<br />

− 68 −


Rule-Based Learning of Reliable Classifiers<br />

Jens Hühn and Eyke Hüllermeier<br />

Department of Mathematics and Computer Science, University of Marburg<br />

{huehnj,eyke}@mathematik.uni-marburg.de<br />

Abstract. This paper introduces a fuzzy rule-based classification method called<br />

FR3, which is short for Fuzzy Round Robin RIPPER. As the name suggests, FR3<br />

builds upon the RIPPER algorithm, a state-of-the-art rule learner. More specifically,<br />

in the context of polychotomous classification, it uses a fuzzy extension of RIPPER<br />

as a base learner within a round robin scheme and, thus, can be seen as a fuzzy<br />

variant of the R3 learner that has recently been introduced in the literature. A<br />

key feature of FR3, in comparison with its non-fuzzy counterpart, is its ability to<br />

represent different facets of uncertainty involved in a classification decision in a more<br />

faithful way. FR3 thus provides the basis for implementing “reliable classifiers” that<br />

may abstain from a decision when not being sure enough, or at least indicate that<br />

a classification is not fully supported by the empirical evidence at hand. Besides,<br />

our experimental results show that FR3 outperforms R3 in terms of classification<br />

accuracy and, therefore, suggest that it produces predictions that are not only more<br />

reliable but also more accurate.<br />

Key words: Machine learning, classification, rule induction, uncertainty, fuzzy sets.<br />

References<br />

William W. Cohen (1995). Fast effective rule induction. In Armand Prieditis and<br />

Stuart Russell, editors, Proceedings of the 12th International Conference on<br />

Machine Learning, pages 115–123, Tahoe City, CA. Morgan Kaufmann.<br />

Johannes Fürnkranz (2003). Round robin ensembles. Intell. Data Anal., 7(5):385–<br />

403.<br />

Eyke Hüllermeier and Klaus Brinker (<strong>2008</strong>). Learning valued preference structures<br />

for solving classification problems, Fuzzy Sets and Systems (to appear).<br />

− 69 −


Combining Predictions in Pairwise<br />

Classification: An Adaptive Voting Strategy<br />

and Its Relation to Weighted Voting<br />

Eyke Hüllermeier and Stijn Vanderlooy<br />

Department of Mathematics and Computer Science, University of Marburg<br />

{eyke,vanderlooy}@mathematik.uni-marburg.de<br />

Abstract. Learning by pairwise comparison is a well-known decomposition technique<br />

which allows one to transform a polychotomous classification problem into<br />

a number of binary problems. To aggregate the predictions from the ensemble of<br />

binary models into a final classification, various aggregation strategies have been<br />

proposed. The most commonly used strategy is weighted voting, in which the prediction<br />

of each model is counted as a (weighted) “vote” for a class label, and the<br />

class with the highest sum of votes is predicted as the label of the query instance.<br />

Even though weighted voting turned out to perform very well in practice, it remains<br />

ad-hoc to some extent and lacks a sound theoretical basis.<br />

In this regard, the current paper makes the following contributions. First, we<br />

propose a formal framework in which the aforementioned aggregation problem can be<br />

studied in a convenient way. This framework is based on the setting of label ranking<br />

which has recently received attention in the machine learning literature. Second,<br />

within this framework, we develop a new aggregation strategy called adaptive voting.<br />

This strategy allows one to take the strength of individual learners into consideration<br />

and, under certain assumptions, is provably optimal in the sense that it yields a MAP<br />

prediction of the class label. Thirdly, we show that weighted voting can be seen as<br />

an approximation of adaptive voting and, hence, approximates a MAP prediction.<br />

This theoretical justification of weighted voting is confirmed by strong empirical<br />

evidence showing that it is (at least) competitive in practice.<br />

Key words: Machine learning, pairwise classification, weighted voting, label ranking,<br />

MAP prediction.<br />

− 70 −


Using Cluster Networks to Represent<br />

Non-Compatible Sets of Clusters<br />

Daniel H. Huson and Regula Rupp<br />

Center for Bioinformatics ZBIT, Tübingen University, Sand 14, 72076 Tübingen,<br />

Germany<br />

huson,rupp@informatik.uni-tuebingen.de<br />

Abstract. A set of clusters is called compatible (or hierarchical), if it can be represented<br />

by a rooted tree. In many applications, such as multiple gene phylogenetic<br />

analysis, sets of clusters arise that are not compatible and the question arises how<br />

to represent such sets in a useful way, in particular emphasizing parts of the cluster<br />

system that are tree-like and where the incompatibilities lie.<br />

The result of a multiple gene tree analysis is usually a number of different tree<br />

topologies that are each supported by a significant proportion of the genes. We<br />

introduce the concept of a cluster network that can be used to combine such trees<br />

into a single rooted network, which can be drawn either as a cladogram or phylogram.<br />

In contrast to split networks, which can grow exponentially in the size of the input,<br />

cluster networks grow only quadratically. A cluster network is easily computed using<br />

a modification of the tree-popping algorithm, which we call network-popping. The<br />

approach will be made available as part of the Dendroscope tree-drawing program<br />

and its application will be illustrated using data and results from recent studies on<br />

large numbers of gene trees.<br />

Key words: clusters, networks, trees, phylogenetics<br />

References<br />

D.H. Huson and D. Bryant. Application of phylogenetic networks in evolutionary<br />

studies. Molecular Biology and Evolution, 23:254–267, 2006. Software available<br />

from www.splitstree.org.<br />

D.H. Huson, D.C. Richter, C. Rausch, T. Dezulian, M. Franz, and R. Rupp. Dendroscope:<br />

An interactive viewer for large phylogenetic trees. BMC Bioinformatics,<br />

8:460doi:10.1186/1471-2105-8-460, 2007. Software available from<br />

www.dendroscope.org.<br />

− 71 −


Genome phylogeny based on short-range<br />

correlations in DNA sequences<br />

Marc-Thorsten Hütt 1<br />

Jacobs University Bremen<br />

School of Engineering and Science<br />

Campus Ring 1<br />

m.huett@jacobs-university.de<br />

Abstract. The surprising fact that global statistical properties computed on a<br />

genomewide scale may reveal species information has first been observed in studies<br />

of dinucleotide frequencies. In this presentation I will look at the same phenomenon<br />

with a totally different statistical approach. We show that patterns in the shortrange<br />

statistical correlations in DNA sequences serve as evolutionary fingerprints of<br />

eukaryotes. All chromosomes of a species display the same characteristic pattern,<br />

markedly different from those of other species. The chromosomes of a species are<br />

sorted onto the same branch of a phylogenetic tree due to this correlation pattern.<br />

The average correlation between nucleotides at a distance k is quantified in two independent<br />

ways: (i) by estimating it from a higher-order Markov process and (ii) by<br />

computing the mutual information function at a distance k. We show how the quality<br />

of phylogenetic reconstruction depends on the range of correlation strengths and<br />

on the length of the underlying sequence segment. This concept of the correlation<br />

pattern as a phylogenetic signature of eukaryote species combines two rather distant<br />

domains of research, namely phylogenetic analysis based on molecular observation<br />

and the study of the correlation structure of DNA sequences.<br />

− 72 −


Dimensionality Reduction of Similarity Matrix<br />

Tadashi Imaizumi<br />

Tama University imaizumi@tama.ac.jp<br />

Abstract. We have become easily collecting a similarity data matrix of large number<br />

of objects, for example, in Basket Analysis in Data Mining. Then we apply<br />

several unsupervised methods to this data according to our purpose of analysis It is<br />

shown in many research fields that the geometric models such as MultiDimensional<br />

Scaling(MDS) or Self-Organizing Map(SOM) are applicable. However, we have two<br />

problems when we want to apply these methods to a large similarity matrix. One<br />

will be the change of dimensions focused, The other one is on how to employ our<br />

prior information about data. The latent dimensions of the similarity evaluation<br />

process may be common to all objects when the attributes of objects is less ambiguous<br />

and number of objects is not so large. However, we will not agree with that<br />

similarity evaluation between Hamburg and Tokyo is same to that between Hamburg<br />

and Berlin. This requires us how to model this process. The other one is how<br />

to employ the research’s knowledge as the prior information to the model. We have<br />

some knowledge about data and the gathered data will be containing these as the<br />

hidden information of data. It will contribute to propose some supervised geometric<br />

model treating those information as model parameters. I will discuss these two<br />

problems, compare the dimensionality reduction methods, and propose the model<br />

of the change of dimensions focused and the prior information about data.<br />

Key words: supervised,the latent dimensions, the attribute focus<br />

References<br />

Koh¨nen, T.(1995): Self-Organizing Maps. Springer, Berlin, Heidelberg.<br />

Roweis, S. T. and Saul, L. K. (2000): Nonlinear dimensionality reduction by locally<br />

linear embedding. Science 290, 2323-2326.<br />

Sammon, J. W., Jr. (1969): A nonlinear mapping for data structure analysis. IEEE<br />

Transactions on Computers, C-18, 5-28.<br />

Tenenbaum, J. B., de Silva, V. and Langford, C. (2000). A global geometric framework<br />

for nonlinear dimensionality reduction. Science 290, 2319-2323.<br />

− 73 −


Siedlungsverhalten währende des 7. -11.<br />

Jahrhunderts entlang der Ems. Eine GIS<br />

gestützte siedlungsarchäologische Analyse des<br />

Raumes zwischen Warendorf und Rheine<br />

Katrin Jaspers<br />

Universität Münster, Germany<br />

Zusammenfassung. Mit Hilfe von GIS sollen chronologische, topographische und<br />

historische Zusammenhänge zwischen den verschiedenen Siedlungen des Untersuchungsraums<br />

dokumentiert werden. Dabei werden auch pedologische Gesichtspunkte<br />

einbezogen. So könnten Abläufe und Entwicklungen der Besiedlung deutlich gemacht<br />

und möglicherweise Rückschlüsse auf die Infrastruktur zwischen den Siedlungen gezogen<br />

werden.<br />

Da sich die Arbeit noch in der Entwicklungsphase befindet, kann hier nur ein<br />

vorläufiger thematischer Abriss gegeben werden.<br />

− 74 −


Benchmarking Bicluster Algorithms<br />

Sebastian Kaiser and Friedrich Leisch<br />

Department of Statistics, Ludwig-Maximilians-Universität München,<br />

Ludwigstrasse 33, 80539 München, Germany,<br />

firstname.lastname@stat.uni-muenchen.de<br />

Abstract. Over the last decade, bicluster methods have become increasingly popular<br />

in different fields of two way data analysis, and a large variety of algorithms<br />

and analysis methods have been published, see (Madeira and Oliveira, 2004) for<br />

a survey. In this presentation, we show how the general benchmarking framework<br />

by Hothorn et al (2005) can be adapted to the special case of biclustering. A key<br />

issue is the development of bootstrap strategies for two-way data, which do not only<br />

resample cases, but also variables.<br />

All methods presented have been implemented in the open source R package<br />

biclust, which is available on http:\\cran.r-project.org. Both artificial as well<br />

as real world microarray data are used for benchmark experiments. The resulting<br />

benchmark data are explored using new graphical techniques and analyzed by means<br />

of statistical inference.<br />

Key words: Biclustering, Two-Way-Clustering, Validation, R<br />

References<br />

Hothorn, T, Leisch, F., Zeileis, A., and Hornik K. (2005): The design and analysis<br />

of benchmark experiments. Journal of Computational and Graphical Statistics,<br />

14(3), 675–699.<br />

Madeira, S. C. and A. L. Oliveira (2004): Biclustering algorithms for biological data<br />

analysis: A survey. IEEE/ACM Transactions on Computational Biology and<br />

Bioinformatics, 1(1),24–45.<br />

Santamaria, R., Theron, R., and Quintales, L. (2007): A framework to analyze biclustering<br />

results on microarray experiments. In: 8th International Conference on<br />

Intelligent Data Engineering and Automated Learning (IDEAL’07) ,Springer,<br />

Berlin, 770–779.<br />

Turner, H., Bailey, T., and Krzanowski, W. (2005): Improved biclustering of microarray<br />

data demonstrated through systematic performance tests. Computational<br />

Statistics and Data Analysis, 48,235–254.<br />

− 75 −


Nonparametric distribution analysis for text<br />

mining<br />

Alexandros Karatzoglou 1 , Ingo Feinerer 2,3 , and Kurt Hornik 3<br />

1 INSA de Rouen, France alexis@ci.tuwien.ac.at<br />

2 Theory and Logic Group, Institute of Computer Languages<br />

Vienna University of Technology, Austria feinerer@logic.at<br />

3 Department für Statistik und Mathematik,<br />

Wirtschaftsuniversität Wien, Austria kurt.hornik@wu-wien.ac.at<br />

Abstract. A number of new algorithms for non-parametric distribution analysis<br />

based on Maximum Mean discrepancy measures and the Hilbert-Smith Norm have<br />

been recently introduced. These novel algorithms operate in Hilbert space and can be<br />

used for Two-Sample Tests, Hierarchical Clustering and Dimensionality Reduction.<br />

Coupled with recent advances in string kernels, these methods extend the scope of<br />

kernel-based methods in the area of text mining.<br />

We review this group of kernel methods focusing on text mining where we will<br />

propose novel applications and present an efficient implementation in the kernlab<br />

package. We also present an efficient and integrated environment for applying modern<br />

machine learning methods to complex text mining problems through the combined<br />

use of the tm (for text mining) and the kernlab (for kernel-based learning) R<br />

packages.<br />

Key words: kernel methods, text mining, R<br />

References<br />

Karatzoglou A., Smola A. , Hornik K., Zeileis A. (2004): kernlab - An S4 Package<br />

for Kernel Methods in R. Journal of Statistical Software, 11, 9<br />

Smola, A., A. Gretton, L. Song and B. Schölkopf (2007): A Hilbert Space Embedding<br />

for Distributions. Proceedings of the 18th International Conference on<br />

Algorithmic Learning Theory (ALT 2007), 13-31, Springer, Berlin, Germany<br />

Song, L., A. J. Smola, K. Borgwardt and A. Gretton (2007): Colored Maximum Variance<br />

Unfolding. Proceedings of the Twenty-First Annual Conference on Neural<br />

Information Processing Systems (NIPS 2007), 1-8, MIT Press, Cambridge,<br />

Mass., USA<br />

− 76 −


Indexnachbildende Wertpapiere –<br />

Eine vergleichende Betrachtung am Beispiel<br />

des DAX<br />

Christian Klein 1 and Dennis Kundisch 2<br />

1<br />

Universität Hohenheim, Lehrstuhl für Rechnungswesen und Finanzierung, 70593<br />

Stuttgart cklein@uni-hohenheim.de<br />

2<br />

Universität Augsburg, Lehrstuhl für BWL, Wirtschaftsinformatik und Financial<br />

Engineering, 86135 Augsburg dennis.kundisch@wiwi.uni-augsburg.de<br />

Abstract. In dieser Arbeit vergleichen wir verschiedene indexnachbildende Wertpapiere.<br />

Produkte dieser Art versprechen dem Anleger eine Wertentwicklung, die einem<br />

zugrunde liegenden Index möglichst exakt entspricht. Bei unserer Untersuchung betrachten<br />

wir verschiedene Aspekte, unter anderem die Replikationsgüte der Papiere.<br />

Somit liefern wir ein differenziertes Bild, sowohl über die Qualität der Produkte<br />

als auch über die Stärken und Schwächen der üblicherweise angewendeten Untersuchungsverfahren.<br />

Key words: Aktienindex, Indexreplikation<br />

− 77 −


Polyphasic genomic approach for the taxonomy<br />

of archaea and bacteria<br />

Hans-Peter Klenk<br />

DSMZ - German Collection of Microorganisms and Cell Lines 38124<br />

Braunschweig,Germany hpk@dsmz.de<br />

Abstract. Contemporary taxonomic classification of prokaryotes is primarily based<br />

on the analysis of 16S rDNA sequences, extended by chemotaxonomical analyses,<br />

e.g. whole cell fatty acids or amino acid analysis of cell walls. Although rDNAs<br />

are excellent taxonomic markers, they represent far less than 1With meanwhile 637<br />

published prokaryotic genomes and more than 1850 ongoing archaeal and bacterial<br />

genome sequencing projects, the future of systematics will clearly be based on the<br />

analysis of whole genome sequences. The major imminent problems on the way to<br />

a genome-based systematic classification of prokaryotes are: 1) uneven phylogenetic<br />

distribution of the sequenced genomes; 2) large variation of the phylogenetic value<br />

in different fractions of the genomes; and 3) affordable technology for rapid sequence<br />

generation combined with highly automated analysis of the information. A massive<br />

generation of genome sequences from phylogenetically isolated archaea and bacteria<br />

in a collaboration between Joint Genome Institute with DSMZ aims for rapid filling<br />

of the deep phylogenetic gaps, soon to be followed by sequenced genomes of all<br />

type strains. The fast variation between genes or sets of genes in view of sequence<br />

conservation and genetic stability is problematic for global approaches to universal<br />

phylogenies, but provides suitable novel taxonomic markers for more restricted areas<br />

within the diversity of micro-organisms. New technologies for sequence generation<br />

have already sharply decreased the price for the production of microbial genome<br />

sequences and will continue to do so till the genome of any cultivated species of<br />

archaea or bacteria will become affordable. The more complex problem to be solved<br />

is the automated processing of the genomes within an endlessly growing sequence<br />

space.<br />

References<br />

Klenk, H.-P. (2007) Genomic future for the taxonomy of prokaryotes. In: E Stackebrandt,<br />

M Wozniczka, V Weihs & J Sikorski (eds) Connections between Collections.<br />

Proceedings of the 11th International Conference on Culture Collections.<br />

ISBN 978-3-00-022417-1. DSMZ, Braunschweig, Germany. pp 117-119<br />

− 78 −


Exploiting synergetic and redundant features<br />

for multimedia document classification<br />

Jana Kludas, Eric Bruno and Stephane Marchand-Maillet<br />

University of Geneva, Switzerland<br />

kludas|bruno|marchand@cui.unige.ch<br />

Summary. Multimedia data handling in all kinds of applications, received in the<br />

last decade a lot of attention by the research communities due to the ’multimediatisation’<br />

of e.g. the WWW and other data collections in all day life. The most<br />

important problems identified in multimedia-based classification are amongst others<br />

the high dimensionality of the multi modal feature space, the unknown and varying<br />

relevance of features and modalities towards the class label, noise and missing values<br />

in the input data and the semantic gap between low-level features and high level<br />

semantic meanings.<br />

We are working on a promising way to tackle many of these problems at once:<br />

the calculation and exploitation of feature information interactions for feature selection<br />

and construction in high dimensional feature spaces towards more efficient<br />

information fusion and hence improved multimedia document classification. This<br />

information-theoretic dependence measure finds the exact, irreducible attribute interactions<br />

in a multivariate feature subset. Its definition is a stable relation because<br />

information interactions are described by the information exclusively shared by this<br />

subset’s variables. For subsets of size N = 2 the interaction is resulting in the well<br />

known mutual information.<br />

Then for higher order subsets N > 2, feature information interaction develops<br />

its most important characteristic, it can result in positive and negative values. This<br />

allows to discriminate two different types of feature relationships: (1) synergy given<br />

by positive interactions and (2) redundancy indicated by negative ones. That be<br />

used to treat the features of each of the types of interactions separately with the<br />

help of specialized feature selection and construction strategies.<br />

With the help of artificial data sets we will show what relationships information<br />

interactions can detect. Classification experiments on real world data will also<br />

show the superiority of preprocessing based on N-way interactions over pair-wise<br />

dependence measures that are often used in recent feature selection approaches like<br />

correlation and mutual information.<br />

Key words: feature selection, multi modal information fusion, multimedia object<br />

classification<br />

− 79 −


Time-Varying Parameters in Brand Choice<br />

Models<br />

Thomas Kneib 1 , Bernhard Baumgartner 2 , and Winfried J. Steiner 3<br />

1 Department of Statistics, University of Munich, Germany<br />

thomas.kneib@stat.uni-muenchen.de<br />

2 Department of Marketing, University of Regensburg, Germany<br />

bernhard.baumgartner@wiwi.uni-regensburg.de<br />

3 Department of Marketing, Technical University of Clausthal, Germany<br />

winfried.steiner@tu-clausthal.de<br />

Abstract. Brand Choice Models are frequently used in marketing research. In most<br />

applications, estimated parameters representing customers’ reactions to, e.g., price<br />

and promotional activities or brand-specific effects are assumed to be constant over<br />

time. Marketing theories as well as experiences of marketing practitioners, however,<br />

suggest the existence of trends and/or short-term fluctuations in brand choice behavior.<br />

For example, price elasticities or preferences for certain brands may change in the<br />

run-up to special events like Christmas or Mother’s day (e.g., Baumgartner 2003).<br />

In this contribution, we employ multinomial logit models with varying coefficients to<br />

estimate time-varying parameters in brand choice models. Both time-varying preferences<br />

(trends) and time-varying effects of covariates are modeled based on penalised<br />

splines, a flexible yet parsimonious nonparametric smoothing technique (e.g., Eilers<br />

and Marx 1996). The estimation procedure is fully data-driven, determining the flexible<br />

function estimates as well as the corresponding degree of smoothness in a unified<br />

approach (e.g., Kneib 2006). Preliminary results suggest that the model considering<br />

time-variable parameters outperforms models assuming constant parameters in<br />

terms of fit and predictive validity.<br />

Key words: Brand Choice, Multinomial logit model, Time-varying effects, Semiparametric<br />

regression, P-splines<br />

References<br />

BAUMGARTNER, B. (2003): Measuring Changes in Brand Choice Behavior.<br />

Schmalenbach Business Review, 55, 242–256.<br />

EILERS, P.H.C. and MARX, B.D. (1996): Flexible Smoothing Using B-Splines and<br />

Penalized Likelihood (with Comments and Rejoinder) Statistical Science, 11(2),<br />

89–121.<br />

KNEIB, T. (2006): Mixed Model Based Inference in Structured Additive Regression.<br />

Dr. Hut-Verlag, München.<br />

− 80 −


Multivariate comparative analysis of stock<br />

exchanges - the European perspective<br />

Julia Koralun-Bere´znicka<br />

Maritime University in Gdynia, Morska 81-87, 81-225 Gdynia, Poland<br />

koral@am.gdynia.pl<br />

Abstract. The aim of the research is to perform a multivariate comparative analysis<br />

of 20 European stock exchanges in order to identify the main similarities between<br />

the objects. The basis of comparison is a set of 48 monthly variables from the period<br />

01.2003–12.2005. The variables are classified into three categories: size of the market,<br />

equity trading and bonds. The paper aims at identifying the clusters of alike<br />

stock exchanges and at finding the characteristic features of each of the distinguished<br />

groups. The obtained categorization to some extent corresponds with the division<br />

of the European Union into ‘new’ and ‘old’ member countries. Clustering method,<br />

performed for each quarter separately, also reveals that the classification is fairly<br />

stable in time. The factor analysis, which was carried out to reduce the number of<br />

variables, reveals three major factors behind the data, which are related with the<br />

earlier mentioned categories of variables.<br />

Key words: stock exchanges, cluster analysis, factor analysis<br />

References<br />

Boillat, P., de Skowronsky, N., Tuchschmid, N. (2002) Cluster analysis: application<br />

to sector indices and empirical validation, Swiss Society for Financial Market<br />

Research, 16, 467–486.<br />

Kearney C., Lucey B. M., (2004) International equity market integration: Theory,<br />

evidence and implications, “International Review of Financial Analysis”, 13,<br />

571–583.<br />

Kim S. J., Moshirian F., Wu E. (2005) Dynamic stock market integration driven by<br />

the European Monetary Union: An empirical analysis, “Journal of Banking &<br />

Finance”, 29, 2475–2502.<br />

Krzanowski, W. J. (1988) Principles of multivariate analysis, Oxford University<br />

Press, Oxford.<br />

Morrison, D., (1967), Multivariate statistical methods, New York: McGraw-Hill.<br />

Pascual A. G., (2003) Assessing European stock markets (co)integration, “Economics<br />

Letters”, 78, 197–203.<br />

− 81 −


Strategies of model construction for<br />

the analysis of judgment data<br />

Sabine Krolak-Schwerdt<br />

Faculty of Humanities, Arts and Educational Science, University of Luxembourg<br />

sabine.krolak@uni.lu<br />

Abstract. This paper is concerned with the types of models researchers use to<br />

analyze empirical data in the domain of social judgments and decisions. Examples of<br />

this research domain are organizational or medical expert judgments, court decisions<br />

or judgments in private everyday life.<br />

Models for the analysis of judgment data may be divided into two classes depending<br />

on the criteria they optimize. The first class consists of approaches which<br />

optimize an internal (mathematical) criterion function. The aim is to minimize the<br />

discrepancy of values predicted by the model from obtained data by use of, e.g., a<br />

least squares approach. The second class comprises approaches which incorporate<br />

a substantive underlying theory into the model. These accounts were developed to<br />

satisfy external validity criteria, especially construct validity. Model parameters are<br />

not only formally defined, but they represent specified components of judgments.<br />

Several models from both classes are applied to a number of empirical data sets<br />

and comparatively evaluated as to goodness-of-fit, variance accounted for by the<br />

models and construct validity. Results exhibit considerable differences between the<br />

two model classes in construct validity, but not in internal validity criteria.<br />

It may be concluded that any model for the analysis of judgment data implies<br />

the selection of a formal theory about judgments. Hence, optimizing a mathematical<br />

criterion function does not induce a non-theoretical rationale or neutral tool.<br />

Rather, this approach yields another formal theory about judgments which may not<br />

correspond to substantive theories and, in this respect, may yield artefacts. As a<br />

consequence, models satisfying construct validity seem superior in the domain of<br />

judgments and decisions.<br />

Key words: Models of data analysis, external validity, internal validity, model<br />

comparison<br />

− 82 −


An application of copula functions to market<br />

risk management<br />

Katarzyna Kuziak<br />

Department of Financial Investments and Risk Management<br />

Wroclaw University of Economics<br />

ul. Komandorska 118/120, 53-345 Wroclaw, Poland<br />

katarzyna.kuziak@ae.wroc.pl<br />

Abstract. Modeling dependence is one of the main issues in risk management. From<br />

risk management point of view, failure to model correctly tail-dependence may cause<br />

many problems (under- or overestimation of risk level). The most popular approach<br />

to model dependence between individual risks is based on classical correlation, but<br />

in recent years an increasing interest in applying copula functions has arose. Copula<br />

functions, a powerful concept to aggregate risks, has been introduced in finance by<br />

Embrechts, McNeil, and Straumann. The aim of this paper is to provide simple<br />

applications for the practical use of copulas for risk management from market risk<br />

point of view. First, we introduce copula concept. Then, some applications of copulas<br />

for market risk are given. Two Value at Risk estimation approaches are compared for<br />

a portfolio of risks: utilizing classical covariance and copula-based one. The criterion<br />

for evaluating performance of the two approaches is just the result of a VaR backtesting<br />

procedure<br />

Key words: financial dependence, copula functions, risk management, market risk,<br />

Value at Risk<br />

References<br />

Cherubini U., Luciano E., Vecchiato W. (2004): Copula Methods in Finance, John<br />

Wiley & Sons, New York.<br />

Embrechts P., Frey R., McNeil A. (2005): Quantitative Risk Management: Concepts,<br />

Techniques, and Tools, Princeton University Press<br />

Embrechts P., Lindskog F., McNeil A. (2001): Modelling dependence with copulas<br />

and applications to risk management, report, ETHZ Zurich.<br />

Embrechts P., McNeil A., Straumann D. (1999): Correlation and dependence in risk<br />

management: properties and pitfalls. In: Risk Management: Value at Risk and<br />

Beyond (M. Dempster, Ed.) Cambridge University Press, Cambridge, 176-223.<br />

Nelsen R. (1999): An introduction to copulas, Springer Verlag, New York.<br />

− 83 −


Testing preference rankings<br />

Kar Yin Lam 1 , Alex J. Koning 2 , and Philip Hans Franses 2<br />

1<br />

ERIM & Econometric Institute, Erasmus University Rotterdam, The<br />

Netherlands<br />

kylam@few.eur.nl<br />

2<br />

Econometric Institute, Erasmus University Rotterdam, The Netherlands<br />

koning@few.eur.nl<br />

franses@few.eur.nl<br />

Abstract. Preference rankings are a common tool in consumer surveys. Such rankings<br />

are easy to perform and the outcomes are easy to understand. In this study<br />

we propose a method to examine if observed rankings imply statistically significant<br />

differences across the products. If there is statistical evidence of differences across<br />

products, the question is which products it concerns. We use multiple comparison<br />

procedures to test which products are significantly different from each other. Our<br />

method concerns the often-encountered practical situation that consumers evaluate<br />

N products but only give preference rankings for a subset that is selected by each<br />

consumer. This is due to the fact that the literature shows that the task of comparing<br />

all N products could be too difficult. It may also be that the assignment of ranks<br />

itself is problematic. For instance, ties may occur, that is, the consumer is indifferent<br />

between products, and hence two or more products have the same rank. There<br />

may also be missing values, that is, the consumer excludes a certain product in the<br />

consideration set, and thus does not evaluate it. As a consequence the consumer is<br />

not able to assign a rank to this product. The method we propose and analyze in<br />

this paper does not suffer from these drawbacks. We illustrate it for 93 individuals<br />

who rank 10 movies released in 2007 and who indicate preferences for only 4 of these<br />

10 movies.<br />

Key words: Rankings, Multiple comparisons, Ties, Missing observations<br />

− 84 −


Bayesian Methods for Graph Clustering<br />

Pierre Latouche, Christophe Ambroise, and Etienne Birmelé<br />

Laboratoire Statistique et Génome (UMR CNRS 8071, INRA 1152, UEVE), La<br />

Genopole Tour Evry 2, 523 place des Terrasses, 91000 Evry, France<br />

firstname.lastname@genopole.cnrs.fr<br />

Abstract. Networks are used in many scientific fields such as biology, social science,<br />

and information technology. They aim at modeling, with edges, the way objects of<br />

interest, represented by vertices, are related to each others. Looking for clusters,<br />

also called communities or modules, of highly connected vertices, has appeared to<br />

be a powerful approach to capture the underlying structure of a network.<br />

Recently, the Erdős-Rényi Mixture model for Graph (ERMG) for community<br />

detection was proposed by Daudin et al. (2006) with an associated algorithm, based<br />

on variational techniques, for maximum likelihood estimation. Given a network, the<br />

number of clusters is estimated and for all the vertices, the algorithm infers the<br />

probability of membership to each cluster.<br />

Following Hofman and Wiggins (2007), we show how the ERMG model can be<br />

described in a full Bayesian framework. Then, we apply two families of approximation<br />

techniques, called Variational Bayes (VB) and Expectation Propagation (EP),<br />

for the inference procedure. Using simulated and real data sets, we compare both<br />

the number and the quality of the estimated clusters obtained with the different<br />

approaches.<br />

Key words: Graph clustering, Variational Bayes, Expectation Propagation<br />

References<br />

Daudin, J. and Picard, F. and Robin, S. (2006): A Mixture Model for Random<br />

Graphs. Tech. rep, INRIA.<br />

Hofman, J.M. and Wiggins, C.H. (2007): A Bayesian Approach to Network Modularity.<br />

ArXiv e-prints.<br />

Jordan, M. and Ghahramani, Z. and Jaakkola, T. (1998): An introduction to variational<br />

methods for graphical models. In: Jordan, M.: Learning in Graphical<br />

Models. MIT Press.<br />

− 85 −


Fundamental Indexation - testing the concept in<br />

the German stock market<br />

Hermann Locarek-Junge 1 and Max Mihm 1<br />

Lehrstuhl für Finanzwirtschaft und Finanzdienstleistungen,<br />

TU Dresden, D-01062 Dresden, Germany, locarekj@finance.wiwi.tu-dresden.de<br />

Abstract. In Germany Fundamental Indexation is a rather new concept of portfolio<br />

management, creating portfolios not based on market capitalization, but by other<br />

economic numbers as revenues, employees, dividends or book value. The concept<br />

is rather new and has been implemented in only some mutual investment funds<br />

world wide so far. However, backward calculation of portfolios using the concept of<br />

fundamental indexation (CFI) on time series from 1961 to 2004 for stock portfolios<br />

in the US capital market and other studies show potential significant returns and<br />

impressive sharpe ratios for this period (see Arnott/Sautter/Siegel 2007).<br />

Trying to explain above average returns using factor models has not yet been<br />

accomplished in a way that is compatible with traditional capital market theory. The<br />

pro’s and con’s of the approach are discussed controversially between scientists and<br />

practitioners, e.g.: ”[CFI] are a triumph of marketing, not of new ideas” (Fama 2007),<br />

”With the advent of fundamental indexes we’re at the brink of a huge paradigm shift.<br />

... [They] are the next wave of investing.” (Siegel 2006), and ”Fundamental Indexing<br />

is just a new label on old wine.” (Asness 2006)<br />

We use data from 1987 to 2007 in the german stock market and several indexing<br />

concepts to test the CFI for the German market. We create and compare portfolio<br />

clusters of market weighted, equally weighted and fundamentally weighted stocks.<br />

We use Fama’s 3-factor-model to analyze and explain returns and anomalies, and<br />

we question the persistence of investment returns using the CFI.<br />

Key words: fundamental indexation, market index, portable alpha<br />

References<br />

Arnott, R., Sautter, G., Siegel, J. (2007): Fundamental Indexing Smackdown, in:<br />

Journal of Indexes, Vol. 10, No. 5, pp. 10–15.<br />

− 86 −


Identifying Atypical Cases in Kernel Fisher<br />

Discriminant Analysis by using the Smallest<br />

Enclosing Hypershere<br />

Nelmarie Louw, Morne Lamont and Sarel Steel<br />

Department of Statistics and Actuarial Science, University of Stellenbosch, Private<br />

Bag X1, 7602 Matieland, South Africa. nlouw@sun.ac.za<br />

Abstract. Kernel methods are fast becoming standard tools for solving classification<br />

and regression problems in statistics. An example of a kernel based classification<br />

method is Kernel Fisher discriminant analysis (KFDA). Conceptually KFDA entails<br />

transforming the data in the input space to a high-dimensional feature space, followed<br />

by linear discriminant analysis (LDA) performed in feature space. Although<br />

the resulting classifier is linear in feature space, it corresponds to a non-linear classifier<br />

in input space. However, as in the case of LDA, the classification performance<br />

of KFDA deteriorates in the presence of atypical data points. Louw et al. (2007)<br />

proposed several criteria for identification of atypical cases in KFDA. In extensive<br />

simulation studies these criteria have been found to be successful, in the sense that<br />

the error rate of the KFD classifier based on the dataset after removal of atypical<br />

cases, is lower than the error rate of the KFD classifier based on the entire data<br />

set. A disadvantage is that these criteria are calculated on a leave-one-out basis,<br />

which becomes computationally prohibitive when dealing with large data sets. In<br />

this paper we propose a two-step procedure for identifying atypical cases in large<br />

data sets. Firstly, a subset of potentially atypical data cases is found by constructing<br />

the smallest enclosing hypersphere (for each group) in feature space. Secondly, the<br />

proposed criteria are employed to identify atypical cases, but only cases in the subset<br />

are considered on a leave-one-out basis, leading to a substantial reduction in computation<br />

time. We investigate the merit of this new proposal in a simulation study,<br />

and compare the results to the results obtained when not using the hypersphere as<br />

a first step. We conclude that the new proposal has merit.<br />

Key words: Classification, Discriminant Analysis, Kernel Methods<br />

References<br />

Louw, N., Lamont, M.C. and Steel, S.J. (2007): Identification of Influential Cases<br />

in Kernel Fisher Discriminant Analysis. In: P. Mantovan, A. Pastore and S.<br />

Tonellato (Eds.): Complex Models and Computational Intensive Methods for<br />

Estimation and Prediction. CLEUP EDITORE, 296–301.<br />

− 87 −


Latent growth models for analyzing a multi<br />

partner reward program<br />

Karsten Lübke 1 and Heike Papenhoff 2<br />

1 Customer Intelligence, Karstadt Warenhaus GmbH, Theodor-Althoff-Strasse 2,<br />

45133 Essen karsten.luebke@karstadt.de<br />

2 Ruhr-Universität Bochum, Lehrstuhl für Betriebswirtschaftslehre, insbesondere<br />

Marketing, Universitätsstraße 150, 44780 Bochum<br />

Abstract. In recent years, multi partner reward programs (MPRP) have enjoyed a<br />

steady increase in popularity. However, one main advantage of MPRPs has not been<br />

sufficiently researched: participating customers are expected to not only prefer their<br />

focal card-issuing company over its competitors, but also to prefer other MPRP<br />

partner companies over their resp. competitors outside the program. This so-called<br />

cross-buying (CB) extended effect is crucial for suppliers when they evaluate program<br />

participation. As this CB is a dynamic process which may change over time we<br />

applied Latent Growth Models to analyze the effects of a MPRP on cross-buying.<br />

Keywords<br />

Latent Growth Models, Structural Equation Modeling, Cross Buying<br />

− 88 −


Applying Statistical Models and Parametric<br />

Distance Measures for Music Similarity Search<br />

Hanna Lukashevich, Christian Dittmar, and Christoph Bastuck<br />

Fraunhofer IDMT, Langewiesener Str. 22, 98693 Ilmenau, Germany<br />

{lkh;dmr;bsk}@idmt.fraunhofer.de<br />

Abstract. Content-based music similarity search implies methods that can be used<br />

for finding music pieces close in perceptual semantic meaning. It is an inherent part<br />

of automatic music recommendation systems and playlist generation. Most stateof-the-art<br />

music similarity techniques use short-term acoustic features. Defining a<br />

similarity measure between two audio signals consisting of multiple feature vector<br />

frames still remains a challenging task. A multitude of related studies propose<br />

the application of parametrical statictical models (e.g. Gaussian Mixture Models -<br />

GMMs) in conjunction with suitable model distance measures. This approach has<br />

several advantages: it enables a very compact and informative representation of an<br />

audio signal and it allows similarity estimation solely based on the parameters of<br />

the models. In this paper we concentrate only on those distance measures that do<br />

not use computationally demanding sampling (like Monte Carlo or likelihood ratio<br />

tests). A good example of such parametric distance measures is a Kullback-Leibler<br />

divergence (KL-divergence), describing the distance between two single gaussians.<br />

Unfortunately, the KL-divergence between GMMs is not analytically tractable. In a<br />

recent ICASSP paper Hershley and Olsen presented several approximations of the<br />

KL-divergence between two GMMs with very promising results. Hélen and Virtanen<br />

proposed a Euclidean distance between GMMs ommiting the KL-divergence.<br />

We present a KL-Euclidean Hybrid distance between GMMs. We compare it to<br />

other state-of-the-art distance measures and show that it significantly outperforms<br />

the others for several features and models. Rather then trying to find the best theoretical<br />

approximation, our focus is on the best performance for music similarity<br />

task. Besides that, we investigate the influence of the model parameter estimation<br />

on the performance in musci similarity search. Here we compare the performance<br />

for several versions of GMMs: a trivial model having just one gaussian per music<br />

piece, GMMs with a fixed number of gaussians, and GMMs where the number of<br />

components is estimated using model selection techniques. We also find promising<br />

results using semantic information like song segmentation. In the latter case, we<br />

model each segment of the song with a single gaussian and represent it as a GMM,<br />

depending on the duration of the segments.<br />

Key words: music information retrieval, music similarity, Gaussian mixture models,<br />

Kullback-Leibler divergence<br />

− 89 −


Determining the number of components in<br />

mixture models for hierarchical data<br />

Olga Lukociene 1 and Jeroen K. Vermunt 2<br />

1<br />

Tilburg University PO Box 90153 5000 LE Tilburg The Netherlands<br />

o.lukociene@uvt.nl<br />

2<br />

Tilburg University PO Box 90153 5000 LE Tilburg The Netherlands<br />

j.k.vermunt@uvt.nl<br />

Abstract. Recently, various types of mixture models have been developed for data<br />

sets having hierarchical or multilevel structure (see, e,g., Vermunt 2003, 2007). Most<br />

of these models include finite mixture distributions at multiple levels of a hierarchical<br />

structure. In the case of two levels, there are, for example, mixture distributions for<br />

individuals (lower-level units) and for groups (higher-level units). In multivlevel<br />

mixture models, selection of the number of mixture component is more complex<br />

than in standard mixture models because one has to determine the number mixture<br />

components at multiple levels.<br />

The most popular measure for determining the number of mixture components<br />

is the BIC. A problem in the application of this criterion in the context of multilevel<br />

mixture models is that it contains the sample size as one of its terms. In multilevel<br />

mixture models, it is not clear which sample size should be used in the BIC formula.<br />

This could be the number of groups, the number of individuals, or either the number<br />

of groups or number individuals depending on whether one wishes to determine the<br />

number of components at the higher or at the lower level.<br />

In this study we investigate the performance of various model selection methods<br />

in the context of multilevel mixture models. We will not only look at BIC with difference<br />

definitions of the sample sizes, but also at AIC, and AIC3, as well as at other<br />

criteria such as ICOMP, validation log-likelihood, and LR tests with bootstrapped<br />

p values.<br />

Key words: Multilevel mixture models, Hierarchical models, BIC, AIC, AIC3<br />

References<br />

Vermunt, J.K.(2003): Multilevel latent class models. Sociological Methodology, 33,<br />

213-239.<br />

Vermunt, J.K. (2007): A hierarchical mixture model for clustering three-way data<br />

sets. Computational Statistics and Data Analysis, 51, 5368-5376.<br />

− 90 −


Exploring the Interaction Structure of Weblogs<br />

Martin Klaus and Ralf Wagner<br />

SVI Chair for International Direct Marketing<br />

DMCC - Dialog Marketing Competence Center<br />

University of Kassel, Germany<br />

{mklaus,rwagner}@wirtschaft.uni-kassel.de<br />

Abstract. Weblogs as a medium of the Web 2.0 have changed the way of communication<br />

fundamentally but also created a new form of social interaction. Worldwide<br />

users make up a huge, permanently growing conversation database including various<br />

topics (Blood (2002)). An interesting feature of this virtual communication is<br />

the opportunity of providing reference to other blogs by setting hyperlinks between<br />

weblogs in the course of the dialog (Chin & Chignell (2006); Leskovec et al. (2007)).<br />

Weblogs have no standardized document format and no tags indicate them.<br />

Thus, it turns out to be challenging to identify and collect Weblogs from the web<br />

with a crawler, spider, or bot (Anjewierden, Brussee & Efimova (2004)).<br />

In this study we introduce different approaches to crawl weblogs and try to combine<br />

them. Subsequently we use social network analysis to uncover the structure<br />

between weblogs (Borgatti, Carley & Krackhardt (2006)). This structure provides<br />

us with an assessment of the blogs and their relevance for marketing communication.<br />

Key words: Marketing Communication, Social Network Analyzes, Web Mining,<br />

Weblog<br />

References<br />

Anjewierden, A., Brussee, R., and Efimova, L. (2004): Shared Conceptualizations in<br />

Weblogs. In: T.N. Burg (Ed.) BlogTalk 2.0, Vienna.<br />

Blood, R. (2002): You‘ve Got Blog: How Weblogs are Changing our Culture. Perseus,<br />

Cambridge.<br />

Borgatti, S.P., Carley, K.M., and Krackhardt, D. (2006): Robustness of Centrality<br />

Measures Under Conditions of Imperfect Data. Social Networks, 28, 234–236.<br />

Chin, A. and Chignell, M. (2006): Finding Evidence of Community from Blogging<br />

Co-citations: A Social Network Analytic Approach. In: Proceedings of 3rd<br />

IADIS International Conference Web Based Communities 2006. San Sebastian,<br />

Spain, 191–200.<br />

Leskovec, J., McGlohon, M., Faloutsos, C., Glance, N.S., and Hurst, M. (2007):<br />

Cascading Behavior in Large Blog Graphs. In: SDM ’07: SIAM Conference on<br />

Data Mining.<br />

− 91 −


ChIPmix : Mixture model of regressions for<br />

ChIP-chip experiment analysis<br />

Marie-Laure Martin-Magniette 1,2 and Tristan Mary-Huard 1 and Caroline<br />

Bérard 1,2 and Stéphane Robin 1<br />

1 UMR AgroParisTech/INRA MIA 518<br />

2 URGV UMR INRA/CNRS/UEVE<br />

Abstract. The Chromatin immunoprecipitation on chip (ChIP on chip) technology<br />

is used to investigate proteins associated with DNA by hybridization to microarray.In<br />

a two-color ChIP-chip experiment, two samples are compared: DNA fragments<br />

crosslinked to a protein of interest (IP), and genomic DNA (Input). The two samples<br />

are differentially labeled and then co-hybridized on a single array. The goal is<br />

then to identify actual binding targets of the protein of interest, i.e. probes whose<br />

IP intensity is significantly larger than the Input intensity.<br />

We propose a new method called ChIPmix to analyse ChIP-chip data based on<br />

mixture model of regressions. Let (xi, Yi) be the Input and IP intensities of probe i,<br />

respectively. The (unknown) status of the probe is characterized through a label Zi<br />

which is 1 if the probe is enriched and 0 if it is normal (not enriched). We assume<br />

the Input-IP relationship to be:<br />

Yi = a0 + b0xi + ɛi<br />

= a1 + b1xi + ɛi<br />

if Zi = 0 (normal)<br />

if Zi = 1 (enriched)<br />

where ɛi is a Gaussian random variable with mean 0 and variance σ 2 . The marginal<br />

distribution of Yi for a given level of Input xi is<br />

(1 − π)φ0(Yi|xi) + πφ1(Yi|xi), (1)<br />

where π is the proportion of enriched probes, and φj(·|x) stands for the probability<br />

density function of a Gaussian distribution with mean aj + bjx and variance σ 2 .<br />

The mixture parameters (proportion, intercepts, slopes and variance) are estimated<br />

using the EM algorithm. Posterior probabilities are used to classify probes into the<br />

normal or enriched class. In the hypothesis test theory, the false discovery control<br />

is performed by controlling the probability to reject wrongly the null hypothesis.<br />

We propose an analogous concept in the mixture model framework. Our aim is to<br />

control the probability for a probe to be wrongly assigned to the enriched class.<br />

Therefore we control Pr{τi > s | xi, Zi = 0} = α for a predefined level α.<br />

We present several applications of ChIPmix to promoter DNA methylation and<br />

histone modification data and show that ChIPmix competes with classical methods<br />

such as NimbleGen and ChIPOTle.<br />

Key words: Classification, Mixture models, ChIP-chip<br />

− 92 −


Clustering of High-Dimensional Data Via<br />

Finite Mixture Models<br />

Geoff McLachlan<br />

Department of Mathematics & Institute for Molecular Bioscience<br />

University of Queensland<br />

Summary. There has been a proliferation of applications in which the number<br />

of experimental units n is comparatively small but the underlying dimension p<br />

is extremely large as, for example, in microarray-based genomics and other highthroughput<br />

experimental approaches. Hence there has been increasing attention<br />

given not only in bioinformatics and machine learning, but also in mainstream statistics,<br />

to the analysis of complex data in this situation where n is small relative to p.<br />

In this talk, we focus on the clustering of high-dimensional (continuous) data, using<br />

normal mixture models. Their use in this context is not straightforward, as the normal<br />

mixture model is a highly parameterized one with each component-covariance<br />

matrix consisting of p(p + 1)/2 distinct parameters in the unrestricted case. Hence<br />

some restrictions must be imposed and/or a variable selection method applied beforehand.<br />

We shall review the existing literature and consider some new approaches<br />

that have been proposed recently.<br />

− 93 −


Majority-rule consensus: from preferences<br />

(social choice) to trees (biology and<br />

classification theory)<br />

F.R. McMorris<br />

Professor of Applied Mathematics<br />

Professor of Computer Science<br />

Illinois Institute of Technology<br />

Chicago, IL 60616, USA<br />

mcmorris@iit.edu<br />

Abstract: The problem of aggregating the individual preferences of a group<br />

of “voters” into a group consensus preference has been studied for many<br />

years. Indeed, mathematical investigations of consensus problems go back<br />

to the contributions of Borda (1784), of Condorcet (1785), and of Pareto<br />

(1896) and are still frequently cited today. One method, the compelling<br />

majority-rule consensus, is so simple (stick something in the output if it is in<br />

more than half of the input) that it seems nothing really interesting can be<br />

said about it. This presentation will give some historic background from the<br />

classical preference case (e.g., voting), and then point out some new and old<br />

mathematical and computational complexity results pertaining to the use of<br />

the majority-rule paradigm for finding consensus phylogenetic trees (biology)<br />

and classification structures (data analysis).<br />

− 94 −


Optimization Methods with Evolutionary<br />

Algorithms and Artificial Neurel Networks<br />

Rene Meier and Franz Joos<br />

Helmut-Schmidt-University<br />

University of the Federal Armed Forces Hamburg<br />

Power Engineering<br />

Laboratory of Turbomachinery<br />

Abstract. In order to optimize turbomachinery components it is necessary to<br />

describe the behaviour of multimodal objective functions (OF). But it is timeconsuming<br />

to evaluate the characteristics of these OF with a three-dimensional<br />

Navier Stokes solver. Instead an Artificial Neural Network (ANN) is used as an<br />

interpolator based on information contained in a database to correlate the performance<br />

to the geometrical parameters as is done by a compressible three-dimensional<br />

Reynolds-averaged Navier Stokes solver. With a computerized optimization system<br />

an existing centrifugal impeller will be redesigned using an Evolutionary Algorithm<br />

(EA) and an ANN. The ANN allows the evaluation of the OF for many geometries<br />

generated by the EA with less effort than a Navier Stokes solver. Yet sometimes the<br />

prediction is not accurate and must be verified by means of a more accurate but<br />

time consuming Navier Stokes solver. The results of this verification are added to<br />

the database. So a new optimization cycle is started with the expectation that the<br />

new learning on a larger database will result in a more accurate ANN.<br />

− 95 −


Finding Music Fads by clustering Online Radio<br />

Data with Emergent Self-Organizing Maps<br />

Florian Meyer and Alfred Ultsch<br />

Databionics Research Group<br />

University of Marburg, Germany<br />

{meyer,ultsch}@informatik.uni-marburg.de<br />

Abstract. Music charts provide a simple statistic of records sold. Due to web 2.0<br />

and its social networks, detailed information from listeners is available. In particular,<br />

there are user-generated keywords, so called tags, that group songs into genres. An<br />

important topic for the music industry are music fads. I.e. small time intervals of few<br />

weeks with a strong persistanc of similar music. A distance measure on weekly music<br />

charts and tags is used. The sequenc of music charts is visualized using Emergent Self<br />

Organizing Maps (ESOM). Fads are automatically found by clustering the charts<br />

with the U*C clustering algorithm on ESOM. U*C does not need an estimation<br />

of the number of clusters. Machine learned decision rules describe fads using the<br />

dominant genres.<br />

Key words: ESOM, U*C, Clustering, Tagged Music, Knowledge Representation<br />

References<br />

Ultsch, A. (2003): Maps for the Visualization of high dimensional Data Spaces.<br />

Yamakawa T (Eds.):Proceedings of the 4th Workshop on Self-Organizing<br />

Maps,225-230.<br />

Mörchen, F., Ultsch, A., Nöcker, M., Stamm, C. (2005): Visual mining in music<br />

collections In Proceedings 29th Annual Conference of the German Classification<br />

Society (<strong>GfKl</strong> 2005), Magdeburg, Germany, Springer, Heidelberg<br />

Lehwark, P., Risi,S. and Ultsch, A. (2007): Visualization and Clustering of Tagged<br />

Music Data, Proceedings Workshop on Self-Organizing Maps (WSOM ’07),<br />

Bielefeld, Germany,<br />

Elias Pampalk (2001): Islands of Music Analysis, Organization, and Visualization<br />

of Music Archives<br />

Mörchen, F.et al. (2005): Databionic visualization of music collections according to<br />

perceptual distance, Joshua D. Reiss, Geraint A. Wiggins (Eds), In Proceedings<br />

6th International Conference on Music Information Retrieval (ISMIR 2005),<br />

London, UK, pp. 396-403<br />

Adamic, L. and E. Adar (2003), Friends and Neighbors on the Web, Social Networks,<br />

25(3), 211–230.<br />

− 96 −


Deviant box and dual clusters for the analysis<br />

of conceptual contexts<br />

Boris Mirkin<br />

School of Computer Science and Information Systems<br />

Birkbeck University of London, Malet street, London, WC1E 7HX, UK<br />

mirkin@dcs.bbk.ac.uk<br />

Summary. This work relates to the frameworks of biclustering (Madeira and<br />

Oliveira 2004, Mirkin 1996) and formal concept analysis (Ganter and Wille 1999).<br />

A formal concept over a 1/0 rectangular matrix r, whose row-set is I and columnset<br />

is J, is a maximal pair (V, W ) such that V ⊂ I and W ⊂ J and all r-elements<br />

within V × W are unities. The lattices of formal concepts found interesting applications<br />

in such areas as association between itemsets and post-processing of web-search<br />

results. However, in many applications the notion of formal concept seems overly<br />

rigid because it does not allow any errors or peculiarities in 1/0 encoding (Pensa<br />

and Boullicaut 2005). This is why we take the concept of data approximating box<br />

V xW (Mirkin, Arabie and Hubert 1995) and use it in the framework of a disjunctive<br />

biclustering model approximating the data matrix r.<br />

We develop a method, Box(a), for fitting the model with possibly overlapping<br />

boxes by using a local search algorithm for finding an optimal box starting from a<br />

pre-specified row or column a and using a parameter b shifting the values of r to<br />

r − b. It is proven that the method leads to highly deviant boxes, which is accounted<br />

for by a variance measure.<br />

We further proceed to develop a dual clustering framework by multiplying the<br />

original model equation by its transposed version both on the right and on the<br />

left. The two equations lead to disjunctive decompositions of similarity matrices,<br />

(r − b) ∗ (r − b) ′ and (r − b) ′ ∗ (r − b) over clusters on row set I and column set<br />

J, respectively. This dual clustering framework formalizes the notion that good<br />

concepts should relate only such row and column sets that are similarity clusters<br />

on their own. A local search method for simultaneously fitting the dual clustering<br />

models, Dual(i, j), is developed using an evolutionary algorithm for optimization<br />

the common intensity value of the clusters.<br />

Results of experiments on generated and real data sets are reported supporting<br />

the view of effectiveness of the algorithms.<br />

Key words: Formal concept, Biclustering, Dual clustering, Scale shift<br />

− 97 −


Clustering a Contingency Table Accompanied<br />

by Visualization<br />

Hans-Joachim Mucha<br />

Weierstraß-Institut für Angewandte Analysis und Stochastik (WIAS),<br />

D-10117 Berlin, Germany, mucha@wias-berlin.de<br />

Abstract. Clustering techniques can be used for segmenting a heterogeneous twoway<br />

contingency table into smaller, homogeneous parts. Following the paper of<br />

Greenacre (1988), here the focus is on chi-square decompositions of the Person chisquare<br />

statistic by clustering the rows and/or the columns of a contingency table. Especially<br />

the hierarchical Ward method as well as a generalization of Ward’s method<br />

will be considered. The latter can find clusters of different volume. Additionally, one<br />

can show that it is also possible to carry out partitional cluster analysis by starting<br />

from pairwise chi-square distances. Partitional clustering techniques optimize<br />

some numerical criterion with respect to a fixed number of clusters K. Often the<br />

partitional cluster analysis attains better solutions than hierarchical cluster analysis.<br />

In any case, the correspondence analysis is the appropriate visualization tool for<br />

both the contingency table and the clusters of the rows and/or columns. Moreover,<br />

the correspondence analysis plots will become more informative by an additional<br />

projection of a dendrogram. The latter shows the hierarchy of clusters. An application<br />

from the field of ecology illustrates the segmentation of a contingency table<br />

using different cluster analysis techniques.<br />

Key words: chi-square distance, hierarchical clustering, partitional clustering, correspondence<br />

analysis, dendrogram<br />

References<br />

Greenacre, M. J. (1988): Clustering the Rows and Columns of a Contingency Table.<br />

Journal of Classification 5, 39–51.<br />

− 98 −


Predictive classification trees<br />

Ulrich Müller-Funk and Stephan Dlugosz<br />

Institut für Wirtschaftsinformatik<br />

University of Münster<br />

Germany<br />

Abstract. Tree-based algorithms for classification and regression are highly popular<br />

because they give rise to results that are easy to interpret and to communicate.<br />

(Some people argue, moreover, that factor selection comes along automatically. This<br />

point, too, will be challenged in the paper.) CART and (exhaustive) CHAID figure<br />

prominently among the procedures actually used in data based management etc.<br />

CART is a well-established, nonlinear and nonparametric procedure that produces<br />

binary trees. CHAID, in contrast, admits multiple splittings, a feature that allows to<br />

exploit the splitting variable more extensively. On the other hand, that procedure<br />

depends on premises that are questionable in practical applications. This can be<br />

put down to the fact, that CHAID relies on simultaneous Chi-Square- resp. F-tests.<br />

Both types of procedures – as implemented in SPSS, for instance – do not take into<br />

account ordinal dependent variables. In the paper we suggest a tree-algorithm that<br />

• requires categorical variables<br />

• chooses splitting attributes by means of predictive measures of association,<br />

• determines the cells to be united – and because of that the number of splits –<br />

with the help of their conditional predictive power<br />

• takes ordinal dependent variables into consideration<br />

− 99 −


Efficient Media Exploitation towards Collective<br />

Intelligence<br />

Phivos Mylonas 1 , Vassilios Solachidis 2 , Andreas Geyer-Schulz 3 , Bettina<br />

Hoser 3 , Sam Chapman 4 , Fabio Ciravegna 4 , Stefen Staab 5 ,PavelSmrz 6 ,<br />

Yiannis Kompatsiaris 2 , and Yannis Avrithis 1<br />

1 National Technical University of Athens, Image, Video and Multimedia Systems<br />

Laboratory, Iroon Polytechneiou 9, Zographou Campus, Athens, GR 157 80,<br />

Greece,{fmylonas, iavr}@image.ntua.gr<br />

2 Centre of Research and Technology Hellas, Informatics and Telematics Institute,<br />

1st Km Thermi-Panorama Road, Thermi-Thessaloniki, GR 570 01, Greece,<br />

{vsol, ikom}@iti.gr<br />

3 Department of Economics and Business Engineering, Information Service and<br />

Electronic Markets, Kaiserstraße 12, Karlsruhe 76128, Germany<br />

{andreas.geyer-schulz, bettina.hoser}@kit.edu<br />

4 University of Sheffield, Department of Computer Science, Regent Court, 211<br />

Portobello Street, S1 4DP, Sheffield, UK {s.chapman, fabio}@dcs.shef.ac.uk<br />

5 Universität Koblenz-Landau, Information Systems and Semantic Web,<br />

Universitätsstraße 1, 57070 Koblenz, Germany, staab@uni-koblenz.de<br />

6 Brno University of Technology, Faculty of Information Technology, Bozetechova<br />

2, CZ-61266 Brno, Czech Republic smrz@fit.vutbr.cz<br />

Abstract. In this work we propose intelligent, automated content analysis techniques<br />

for different media to extract knowledge from the multimedia content. Information<br />

derived from different sources/modalities will be analyzed and fused, in<br />

terms of spatiotemporal, personal and even social contextual information. In order<br />

to achieve this goal, semantic analysis will be applied to the content items, taking<br />

into account the content itself (e.g. text, images and video), as well as existing<br />

personal, social and contextual information (e.g. semantic and machine-processable<br />

metadata and tags). The above process exploits the so-called “Media Intelligence”<br />

towards the ultimate goal of identifying “Collective Intelligence”, emerging from<br />

the collaboration and competition among people, empowering innovative services<br />

and user interactions. The utilization of “Media Intelligence” constitutes a departure<br />

from traditional methods for information sharing, since semantic multimedia<br />

analysis has to fuse information from both the content itself and the social context,<br />

while at the same time the social dynamics have to be taken into account. Such<br />

intelligence provides added-value to the available multimedia content and renders<br />

existing procedures and research efforts more efficient.<br />

− 100 −


Support Vector Machines in the Dual using<br />

Majorization and Kernels<br />

Georgi Nalbantov 1 , Patrick J.F. Groenen 2 , and Cor Bioch 3<br />

1 MICC, Maastricht University and<br />

Econometric Institute, Erasmus University Rotterdam, The Netherlands<br />

nalbantov@few.eur.nl<br />

2 groenen@few.eur.nl<br />

3 bioch@few.eur.nl<br />

Abstract. Recently, Support Vector Machines (SVMs) have proved to be a quite<br />

successful method for classification. One of the bottlenecks with this approach from a<br />

practical point of view is that existing solvers are rather slow in speed. Usually, SVM<br />

solvers use specialized iterative optimization algorithms to solve the SVM optimization<br />

problem that are quite slow, especially in the so-called dual SVM formulation.<br />

Here, we propose to use another iterative method, which is a majorization method.<br />

It has already been applied successfully for solving the primal SVM formulation (see,<br />

Groenen, Nalbantov, and Bioch, 2007, <strong>2008</strong>). The contribution of this paper is to<br />

extend it to the dual formulation. This opens the door for, first of all, using different<br />

so-called kernel functions, which allow for nonlinear decision functions, and second,<br />

for handling more efficiently linear problems where the number of input variables is<br />

bigger than the number of observations.<br />

Key words: Support vector machines, Iterative majorization, Binary classification<br />

problem, Kernels<br />

References<br />

Groenen, P.J.F., Nalbantov, G., and Bioch, J.C. (2007): Nonlinear support vector<br />

machines through iterative majorization and I-splines. In: R.Decker, H-.J. Lenz<br />

(Eds.): Advances in data analysis. Springer, Berlin, 149–162.<br />

Groenen, P.J.F., Nalbantov, G., and Bioch, J.C. (<strong>2008</strong>, in press): SVM-Maj: A<br />

Majorization Approach to Linear Support Vector Machines with Different Hinge<br />

Errors. Advances in Data Analysis and Classification.<br />

− 101 −


Approach for Dynamic Problems in Clustering<br />

Anneke Neumann, Klaus Ambrosi, and Felix Hahne<br />

Institut für Betriebswirtschaft und Wirtschaftsinformatik<br />

Stiftung Universität Hildesheim<br />

{aneumann,ambrosi,hahne}@bwl.uni-hildesheim.de<br />

Abstract. In cluster analysis, a variety of methods has been developed for different<br />

areas of application (e.g. economics, biology, medicine, psychology), some of<br />

which were implemented in data evaluation software packages (e.g. SPSS x , SAS). In<br />

many scenarios, particularly economic ones, special methods are required in order<br />

to analyze the development of clusters over time. While there are such methodical<br />

extensions for factor analysis and multidimensional scaling, hardly any dynamic<br />

approaches exist in the field of cluster analysis.<br />

In this talk, special attention will be paid to dynamic fuzzy clustering problems.<br />

Known approaches will be reviewed critically concerning their applicability in dynamic<br />

problems, and a new fuzzy clustering approach will be introduced which can<br />

be applied to dynamic problems.<br />

Key words: Clustering, Fuzzy Clustering, Dynamic Data Analysis<br />

References<br />

Basford, K.E. and McLachlan, G.J. (1985): The Mixture Method of Clustering Applied<br />

to Three-Way Data. Journal of Classification, 2, 109–125.<br />

Höppner, F., Klawonn, F., Kruse, R., and Runkler, T. (1999): Fuzzy Cluster Analysis.<br />

Wiley, Chichester, New York.<br />

Joentgen, A., Mikenina, L., Weber, B., and Zimmermann, H.-J. (1999): Dynamic<br />

fuzzy data analysis based on similarity between functions. Fuzzy Sets and Systems,<br />

105, 81–90.<br />

Tucker, L.R. (1966): Some mathematical notes on three-mode factor analysis. Psychometrika,<br />

31, 279–311.<br />

− 102 −


Robust fitting of mixtures: The approach based<br />

on the Trimmed Likelihood Estimator<br />

Neyko Neykov 1 , Peter Filzmoser 2 , and Plamen Neytchev 1<br />

1 Department of Statistics and Probability Theory, Vienna University of<br />

Technology, Austria P.Filzmoser@tuwien.ac.at<br />

2 National Institute of Meteorology and Hydrology, Bulgarian Academy of<br />

Sciences, Sofia, Bulgaria {Neyko.Neykov}{Plamen.Neytchev}@meteo.bg<br />

Abstract. The Maximum Likelihood Estimator (MLE) has commonly been used to<br />

estimate the unknown parameters in a finite mixture of distributions. However, the<br />

MLE can be very sensitive to outliers in the data. In order to overcome this problem,<br />

Neykov et al. (2007) adapted the trimmed likelihood methodology developed by<br />

Vandev and Neykov (1998) and Neykov and Müller (2003) to estimate mixtures<br />

in a robust way. The superiority of this approach in comparison with the MLE<br />

is illustrated by examples and simulation studies. The behavior of the widely used<br />

classical criteria for the assessment of the number of components in a mixture model<br />

and their robustified versions are also studied in the presence of outliers.<br />

Key words: Trimmed likelihood estimator, Finite mixtures of distributions<br />

References<br />

Maronna, R., Martin, R.D. and Yohai, V.J. (2006): Robust Statistics: Theory and<br />

Methods. Wiley, New York.<br />

McLachlan, G.J. and Peel, D. (2000): Finite Mixture Models. Wiley, New York.<br />

Neykov, N. and Müller, C. (2003): Breakdown Point and Computation of Trimmed<br />

Likelihood Estimators in GLMs. In: R. Dutter et al., (eds), Developments in<br />

robust statistics, pp. 277–286, Physica Verlag, Heidelberg.<br />

Neykov, N.M., Filzmoser, P., Dimova, R. and Neytchev, P.N. (2004): Mixture of<br />

Generalized Linear Models and the Trimmed Likelihood Methodology. In: J.<br />

Antoch (Ed.): Proceedings in Computational Statistics. Physica-Verlag, Heidelberg,<br />

1585–1592.<br />

Neykov, N., Filzmoser, P., Dimova, R. and Neytchev, P. (2007): Robust Fitting of<br />

Mixtures Using the Trimmed Likelihood Estimator. Computational Statistics &<br />

Data Analysis, 17(3), 299–308.<br />

Vandev, D.L. and Neykov, N.M. (1998): About Regression Estimators with High<br />

Breakdown Point. Statistics, 32, 111–129.<br />

− 103 −


Cluster Tree Estimation using a Generalized<br />

Single Linkage Method<br />

Rebecca Nugent 1 and Werner Stuetzle 2<br />

1 Department of Statistics, Carnegie Mellon University, Baker Hall, Pittsburgh,<br />

PA 15213, USA. rnugent@stat.cmu.edu<br />

2 Department of Statistics, University of Washington, Box 354322, Seattle, WA<br />

98195, USA wxs@stat.washington.edu<br />

Abstract. The goal of clustering is to detect the presence of distinct groups in a<br />

data set and assign group labels to the observations. In nonparametric clustering,<br />

we regard the observations as a sample from an underlying density and assume that<br />

groups correspond to modes of this density. The goal then is to find the modes<br />

and assign each observation to the domain of attraction of a mode. The (possibly<br />

hierarchical) modal structure of a density is summarized by its cluster tree; modes<br />

of the density correspond to leaves in the cluster tree. Estimating this cluster tree<br />

is the fundamental goal of nonparametric cluster analysis.<br />

We adopt a plug-in approach: estimate the cluster tree of the underlying density<br />

by the cluster tree of a density estimate. For density estimates that are piecewise<br />

constant (and so have computationally tractable level sets), the cluster tree can<br />

be computed exactly. However, for other density estimates, particularly in highdimensions,<br />

we have to be content with an approximation. We present a graph-based<br />

method that approximates the cluster tree for any density estimate and includes<br />

the introduction of a density-based similarity measure between observations. After<br />

motivating the method, we show results that allow us to reduce the graph to a<br />

spanning tree and then sketch an algorithm that allows the exact computation of the<br />

spanning tree whose edge weights are not of closed form. We point out mathematical<br />

and algorithmic similarities to single linkage clustering and illustrate our approach<br />

on several examples.<br />

Key words: cluster analysis, single linkage clustering, level sets, minimum density<br />

similarity measure, nearest neighbor density estimation<br />

References<br />

Stuetzle, W. and Nugent, R. (2007): A generalized single linkage method for estimating<br />

the cluster tree of a density. Technical Report 514, Department of Statistics,<br />

Univeristy of Washington.<br />

− 104 −


Multi-Class Extension of Verifiable Ensemble<br />

Models for Safety-Related Applications<br />

Sebastian Nusser 1,2 , Clemens Otte 1 , and Werner Hauptmann 1<br />

1 Siemens AG, Corporate Technology, Otto-Hahn-Ring 6, 81730 Munich, Germany,<br />

{sebastian.nusser.ext,clemens.otte,werner.hauptmann}@siemens.com<br />

2 School of Computer Science, Otto-von-Guericke-University of Magdeburg,<br />

Universitätsplatz 2, 39106 Magdeburg, Germany<br />

Abstract. For safety-related applications, models learned from data must be verifiable<br />

and, thus, interpretable by domain experts. In a previous work (Nusser et al.,<br />

2007) we developed a sequential covering algorithm for binary classification problems<br />

in safety-related domains. It is based on ensembles of low-dimensional submodels,<br />

where each submodel as well as the overall ensemble model can be verified. Thus, the<br />

correct interpolation and extrapolation behavior of the complete model can be guaranteed.<br />

In the present contribution we extend the approach to multi-class problems.<br />

The extension is not straight-forward since common methods like one-against-one or<br />

one-against-rest voting (Friedman, 1996; Hsu and Lin, 2002) may introduce inconsistencies.<br />

We show that inconsistencies can be avoided by introducing a hierarchy<br />

of misclassification costs. Such hierarchy is used to define a strict ordering of the<br />

kind: “class c1 should never be misclassified, class c2 might only be misclassified as<br />

c1, class c3 might be misclassified as c1 or c2.” Our method follows a sequential<br />

covering concept also for multi-class classification: low-dimensional submodels are<br />

trained to separate the samples of the class with the minimal misclassification costs<br />

from the samples of all remaining classes. If the problem is solved for this class or<br />

no further improvements are possible, all remaining samples of this class are removed<br />

from the training data set and the procedure is repeated for the next class<br />

within the hierarchy of misclassification costs. Experimental evaluation carried out<br />

on benchmark data sets from the UCI Machine Learning Repository shows a good<br />

trade-off between interpretation and prediction accuracy of our method.<br />

Key words: multi-class, ensemble learning, local modeling, interpretability<br />

References<br />

Friedman, J. H. (1996). Another approach to polychotomous classification. Technical<br />

report, Department of Statistics, Stanford University.<br />

Hsu, C.-W. and Lin, C.-J. (2002). A comparison of methods for multiclass support<br />

vector machines. Neural Networks, IEEE Trans. on, 13(2):415–425.<br />

Nusser, S., Otte, C., and Hauptmann, W. (2007). Learning binary classifiers for<br />

applications in safety-related domains. In Proceedings of 17th Workshop Computational<br />

Intelligence, pages 139–151. Universitätsverlag Karlsruhe.<br />

− 105 −


Analysis of Borrowing and Guaranteeing<br />

Relationhships among Government Officials at<br />

the Eighth Century in the Old Capital of Japan<br />

Akinori Okada 1 and Towao Sakaehara 2<br />

1<br />

Graduate School of Management and Information Sciences, Tama University<br />

4-1-4 Hijirigaoka, Tama-shi, Tokyo 206-0022 Japan okada@tama.ac.jp<br />

2<br />

Department of History, Faculty and Graduate School of Literature and Human<br />

Sciences, Osaka City University<br />

Sugimoto-cho 3 Sumiyoshi-ku Osaka City 558-8585 Japan<br />

sakaehar@lit.osaka-cu.ac.jp<br />

Abstract. In the present study relationships among lower ranked government officials,<br />

working in the old capital of Japan called Heijo-kyo at the eighth century, are<br />

analyzed. They were engaged in copying the Buddhist sutra in the capital. The documents<br />

which show the borrowing and guaranteeing relationships among them have<br />

been kept in the governmental warehouse called Shoso-in (Sakaehara, 1987). The<br />

documents tell (a) the borrower, (b) the amount of money borrowed, (c) who stood<br />

guarantee for the borrower, (d) the date of borrowing. From these documents, the<br />

table which shows the borrowing and guaranteeing relationships among government<br />

officials was derived. The (j, k) element of the table shows the amount of money<br />

the government official corresponding to row j borrowed which was guaranteed by<br />

the government official corresponding to column k. One who stood guarantee for<br />

his colleague seem more dominant than one who borrowed. These relationships are<br />

asymmetric. The table was derived for the years 772, 773, and 774 (including the<br />

beginning of 775). The table was analyzed by the asymmetric multidimensional scaling<br />

(Okada and Imaizumi, 1997). The obtained configuration shows the dominance<br />

relationships among government officials and groups of them.<br />

Key words: Asymmetry, Borrowing and guaranteeing relationships, Historical<br />

data, Multidimensional scaling<br />

References<br />

Okada, A. and Imaizumi, T. (1997): Asymmetric multidimensional scaling of twomode<br />

three-way proximities. Journal of Classification, 14, 195–224.<br />

Sakaehara,T. (1987): People’s Life Styles in the Capital City. In: T. Kishi, A.<br />

(Ed.): Modes of Life in the Capital Cities. Chuokoron-sha, Tokyo, 187–266.<br />

(in Japanese)<br />

− 106 −


Variable Selection for kernel classifiers:<br />

A Feature-to-Input Space Approach<br />

Surette Oosthuizen and Sarel Steel<br />

Department of Statistics and Actuarial Science, University of Stellenbosch, Private<br />

Bag X1, 7602 Matieland, South Africa (surette@sun.ac.za; sjst@sun.ac.za)<br />

Abstract. Consider using values of input variables X1, X2, · · · , Xp to classify entities<br />

into one of two groups. Kernel classifiers, e.g. support vector machines (SVMs)<br />

and kernel Fisher discriminant analysis (KFDA), are known to be exceptionally well<br />

suited for this task. In general the classification accuracy of SVMs and KFDA can<br />

however be improved substantially if instead of the comprehensive set of p input<br />

variables, a smaller subset of (say m) input variables is used. Let the space in which<br />

the training patterns reside, be called the input space. Also, let Φ map the input<br />

space to a higher-dimensional so-called feature space. An aspect which complicates<br />

variable selection for non-linear kernel classifiers is that they make implicit use of<br />

Φ: they are linear functions in a higher-dimensional feature space. Since Φ is usually<br />

unknown, and the feature space can be infinite-dimensional, the implicit transformation<br />

step obscures the contributions of variables to the kernel discriminant function.<br />

In this paper we propose a new variable selection approach for kernel classifiers,<br />

viz. so-called feature-to-input space (F I) selection. The basic idea underlying this<br />

approach is to combine the information obtained from feature space with the easy<br />

interpretation in input space. We discuss several approaches and evaluate the resulting<br />

selection criteria in a fairly extensive simulation study.<br />

Key words: Variable Selection, Kernel Based Classification, Kernel Methods<br />

References<br />

Mika, S., Rätsch, G., Weston, J., Schölkopf, B., Smola, A.J. and Müller, K.-R. (1999).<br />

Fisher discriminant analysis with kernels. Proceedings of Neural Networks for<br />

Signal Processing, 9, 41-48.<br />

Shawe-Taylor, J. and Cristianini, N. (2004). Kernel methods for pattern analysis.<br />

Cambridge University Press, Cambridge.<br />

Rakotomamonjy, A. (2002). Variable selection using SVM-based criteria. Perception<br />

Systeme Information, Insa de Rouen, Technical Report PSI 2002-04.<br />

Rakotomamonjy, A. (2003). Variable selection using SVM-based criteria. Journal of<br />

Machine Learning Research, 3, 1357-1370.<br />

− 107 −


Classifying hospitals with respect to their<br />

diagnostic diversity using Shannon’s entropy<br />

Thomas Ostermann 1 , Reinhard Schuster 2 , and Christoph Erben 2<br />

1 Department of Medical Theory and Complementary Medicine, University of<br />

Witten/Herdecke, Gerhard-Kienle-Weg 4, 58313 Herdecke, Germany,<br />

thomaso@uni-wh.de<br />

2 Medical Review Board of the Statutory Health Insurance in North Germany,<br />

Katharinenstr 11a, 23554 Luebeck, Germany, Reinhard.Schuster@mdk-nord.de<br />

Abstract. Background: In Germany hospital comparisons are part of health status<br />

reporting. However, some methodological problems arise in classifying hospitals<br />

by means of their reported data. This article presents the application of Shannon’s<br />

entropy measure for hospital comparisons. Material and Methods: We used Shannon’s<br />

entropy given by E(p1, . . . , pn) = Pn k=1 pk log pk as an approach to measure<br />

the diagnostic diversity of a hospital department. Based on a data set of aggregated<br />

three–digit ICD–9–codes from the L4–hospital statistics of 1998 in Schleswig Holstein<br />

we compared the resulting measures for diagnostic diversity with respect to<br />

the hospital departments area (e.g. surgery, gynecology) and to the hospital status<br />

(primary, secondary, tertiary or specialized hospital). Results: Highly specialized<br />

departments like obstetrics (0.44) or ophthalmology (0.46) do generate lower entropy<br />

values than area-spanning departments like radiology (0.52) or general gynaecology<br />

(0.56), which have significantly higher values. Discussion: We showed how entropy<br />

can be used as a measure for classifying hospitals. Our approach can basically be<br />

implemented in all fields of health services research, where categorial data emerges.<br />

Especially in DRG-data this approach is quite promising and should be applied.<br />

Key words: Entropy, diagnostic diversity, hospital comparison, classification<br />

References<br />

Brindle, G.W. and Gibson, C.J. (2007): Entropy as a measure of diversity in an<br />

inventory of medical devices. Medical Engineering & Physics,29, Epub.<br />

Elayat, H.A., Murphy, B.B. and Prabhakar, N.D. (1978): Entropy in the hierarchical<br />

cluster analysis of hospitals. Health Serv Res, 13, 395–403.<br />

Erben, C.M. (2000): The concept of entropy as a possibility for gathering mass data<br />

for nominal scaled data in heath status reporting. Stud Health Technol Inform,<br />

77, 118–9.<br />

− 108 −


Clustering and Dimensionality Reduction to<br />

Discover Interesting Patterns in Binary Data<br />

Francesco Palumbo<br />

Dipartimento di Istituzioni Economiche e Finanziarie - University of Macerata<br />

Via Crescimbeni, 20 - I-62100, Italy<br />

francesco.palumbo@unimc.it<br />

Abstract. A key element in the success of data analysis is the strong contribution<br />

of visualization: dendrograms and factorial plans are intuitive ways to display<br />

association relationships within and among sets of variables and groups of units.<br />

In the Association Rules (AR) mining we refer to a n × p data matrix, where n<br />

indicates the number of statistical units and p the number of attributes, which are<br />

also called items. The problem consists in analyzing links between attributes. Sets<br />

of attributes that co-occur through the whole data matrix are referred as patterns.<br />

Scanning the whole data set and analyzing all the relationships is an interesting<br />

and promising approach, yet this approach leads to a NP -hard problem and gets<br />

no solution when dealing with a large number of attributes.<br />

Moreover, in some cases, the most interesting relationships refer to subpopulations<br />

in the data, and they are hidden by the obvious ones and cannot be identified<br />

by the classical descriptive and inferential statistical methods.<br />

The joint use of factorial and clustering methods in a unitary exploratory approach<br />

copes with these issues. It allows the analyst to identify the most interesting<br />

groups of units and sets of attributes; by focusing the attention only on them more<br />

easily interesting patterns are identified in large and huge binary data base. ns in<br />

large and huge binary data base, focusing the attention only on them.<br />

Key words: Dimensionality Reduction, Binary Data, AR mining<br />

References<br />

Iodice D’Enza A., Palumbo F. and Greenacre M. (2007): Exploratory data analysis<br />

leading towards the most interesting simple association rules. Comput. Statist.<br />

Data Anal., Corrected Proof, doi:10.1016/j.csda.2007.10.006.<br />

Mizuta, M. (2004): Dimension reduction methods. In J.E. Gentle, W. Hardle and<br />

Y. Mori (Eds.): Handbooks of Computational Statistics. Concepts and Methods.<br />

Springer-Verlag, Heidelberg, pp. 565-589.<br />

Plasse M., Niang N., Saporta G., Villeminot A. and Leblond L. (2007): Combined<br />

use of association rules mining and clustering methods to find relevant links<br />

between binary rare attributes in a large data set. Comput. Statist. Data Anal.,<br />

doi: 10.1016/j.csda.2007.02.020.<br />

− 109 −


Lineare Kodierung multipler Vererbungshierarchien:<br />

Wiederbelebung einer antiken Klassifikationsmethode<br />

Wiebke Petersen<br />

Institut für Sprache und Information, Heinrich-Heine-Universität Düsseldorf,<br />

petersew@uni-duesseldorf.de<br />

Zusammenfassung. Die in der mehr als zweitausend Jahre alten Sanskritgrammatik<br />

von Pān. ini eingesetzten formalen Methoden erstaunen ob ihrer Modernität.<br />

Der Vortrag widmet sich insbesondere der Methode zur Repräsentation von Mengen<br />

als Intervalle einer Liste, die zur Klassifikation der Lautklassen eingesetzt wird.<br />

Diese Methode zeichnet sich dadurch aus, daß sie es erlaubt, bestimmte Polyhierarchien<br />

(d.h., Hierarchien in denen eine Klasse mehr als eine direkte Oberklasse<br />

haben kann) linear zu kodieren. Monohierarchien lassen sich als verschachtelte Listen<br />

linear repräsentieren, da sie immer eine Baumstruktur bilden. Für allgemeine<br />

Polyhierarchien steht eine lineare Repräsentationsmethode noch aus; sie werden häufig<br />

mithilfe einer Menge von Constraints beschrieben, wobei die Hierarchie zumeist<br />

in die einzelnen Elemente ihrer binären Nachbarschaftsrelation zerlegt wird. Infolge<br />

davon müssen viele Anfragen, zum Beispiel nach einer hierarchischen Teilstruktur,<br />

umständlich über rekursive Aufrufe abgearbeitet werden. Ein weiterer Nachteil von<br />

Polyhierarchien besteht darin, daß sie häufig aufgrund zahlreicher Kantenkreuzungen<br />

schwer lesbar sind. Da kreuzungsfreie Hierarchien von den Nutzern eines Systems<br />

besser akzeptiert und verstanden werden, schließen viele aktuelle Ontologiesysteme<br />

Polyhierarchien aus, oder verbergen sie zumindest vor den Anwendern. Aus diesen<br />

Gründen lassen zahlreiche Formalismen nur baumförmige Hierarchien zu.<br />

In dem Vortrag soll zunächst eine vollständige Charakterisierung der Klassifikationen<br />

gegeben werden, deren Klassen sich gemäß Pān. inis Methode als Intervalle<br />

einer Liste darstellen lassen. Eine solche Klassifikation wird S-darstellbar genannt. Es<br />

wird desweiteren formal gezeigt, daß sich die Hasse-Diagramme S-darstellbarer Klassifikationen<br />

immer kreuzungsfrei zeichnen lassen. Schließlich soll untersucht werden,<br />

inwieweit es angebracht ist, für bestimmte Einsatzzwecke die Klasse der zulässigen<br />

Hierarchien von baumförmigen auf S-darstellbare auszudehnen, um einerseits<br />

zumindest eingeschränkt multiple Vererbung zuzulassen, ohne andererseits die Vorteile<br />

einer kreuzungsfreien Zeichnung und einer effizienten linearen Kodierung und<br />

Verarbeitung hierarchischer Beziehungen zu verlieren.<br />

Schlüsselwörter: Pān. ini, Hierarchien, kreuzungsfreie Zeichnung, lineare Kodierung<br />

− 110 −


Begriffsanalytischer Ansatz zur qualitativen<br />

Zitationsanalyse<br />

Wiebke Petersen und Petja Heinrich<br />

Institut für Sprache und Information, Heinrich-Heine-Universität Düsseldorf,<br />

petersew@uni-duesseldorf.de<br />

Zusammenfassung. Zu den Aufgaben der Bibliometrie gehört die Zitationsanalyse<br />

(Kessler 1963), das heißt die Analyse von Kozitationen (zwei Texte werden kozitiert,<br />

wenn es einen Text gibt, in dem beide zitiert werden) und die bibliographische<br />

Kopplung (zwei Texte sind bibilographisch gekoppelt, wenn beide eine gemeinsame<br />

Zitation aufweisen).<br />

In dem Vortrag wird aufgezeigt werden, daß die Formale Begriffsanalyse (FBA)<br />

für eine qualitative Zitationsanalyse geeignete Mittel bereithält. Eine besondere Eigenschaft<br />

der FBA ist, daß sie die Kombination verschiedenartiger (qualitativer und<br />

skalarer) Merkmale ermöglicht. Durch den Einsatz geeigneter Skalen kann auch dem<br />

Problem begegnet werden, daß die große Zahl von zu analysierenden Texten bei qualitativen<br />

Analyseansätzen in der Regel zu unübersichtlichen Zitationsgraphen führt,<br />

deren Inhalt nicht erfaßt werden kann.<br />

Die Relation der bibliographischen Kopplung ist eng verwandt mit den von Priss<br />

entwickelten Nachbarschaftskontexten, die zur Analyse von Lexika eingesetzt werden.<br />

Anhand einiger Beispielanalysen werden die wichtigsten Begriffe der Zitationsanalyse<br />

in formalen Kontexten und Begriffsverbänden modelliert. Es stellt sich<br />

heraus, daß die hierarchischen Begriffsverbände der FBA den gewöhnlichen Zitationsgraphen<br />

in vielerlei Hinsicht überlegen sind, da sie durch ihre hierarchische<br />

Verbandstruktur bestimmte Regularitäten explizit erfassen. Außerdem wird gezeigt,<br />

wie durch die Kombination geeigneter Merkmale (Doktorvater, Institut, Fachbereich,<br />

Zitationshäufigkeit, Keywords) und Skalen häufigen Fehlerquellen wie Gefälligkeitszitationen,<br />

Gewohnheitszitationen u.s.w. begegnet werden kann.<br />

Schlüsselwörter: Bibliographische Kopplung, Kozitation, Formale Begriffsanalyse<br />

Literaturverzeichnis<br />

B. Ganter & R. Wille (1999): Formal Concept Analysis. Mathematical Foundations.<br />

Berlin: Springer.<br />

M.M. Kessler (1963): Bibliographic coupling between sientific papers. American Documentation,<br />

Vol. 14, 10–25.<br />

U. Priss & J. Old (2004): Modelling Lexical Databases with Formal Concept Analysis.<br />

Journal of Universal Computer Science, Vol. 10, Nr. 8, 967–984.<br />

− 111 −


The Analysis of the power for some chosen VaR<br />

backtesting procedures - simulation approach<br />

Krzysztof Piontek<br />

Department of Financial Investments and Risk Management<br />

Wroclaw University of Economics<br />

ul. Komandorska 118/120, 53-345 Wroclaw, Poland<br />

krzysztof.piontek@ae.wroc.pl<br />

Abstract. The definition of Value at Risk is quite general. There are many approaches<br />

which can give different VaR values. The challenge is not to suggest a new<br />

method but to distinguish between good and bad models. Backtesting is the necessary<br />

statistical procedure to evaluate performance of VaR models and select the<br />

best one. If the power of the test is low, then it is likely to mis-classify an inaccurate<br />

VaR model as well-specified. It can be a threat to financial institutions.<br />

The aim of this article is to analyze backtesting methodologies, focusing on the<br />

aspect of limited data set and the power of tests. There are three groups of methods<br />

for validating VaR models: based on the frequency of failures, based on various<br />

loss functions and the ones based on the adherence of a VaR model to asset return<br />

distributions. This article presents and summarizes some of frequently used methods<br />

from every group (proposed by Kupiec, Christoffersen, Lopez and Berkowitz).<br />

However, the main part of this work is statistical evaluation of the most applied<br />

tests for small data sets (usually observed in practice). We analyze performance<br />

of tests based on the type II error, in order to select the best one for different<br />

numbers of observations and model mis-specifications. For making this verification<br />

asset return simulations are used. Presented results indicate that some tests are not<br />

adequate for small samples, even for 1000 observations, which is a very important<br />

issue if acceptance of internal models for market risk management is considered.<br />

Key words: risk measurement, Value at Risk, backtesting, power of tests<br />

References<br />

Hass, M. (2001): New Methods in Backtesting, CAESAR,<br />

www.caesar.de/uploads/media/cae pp 0010 haas 2002-02-05 01.pdf<br />

Piontek, K. (2007): A Survey and a Comparison of Backtesting Procedures<br />

(in Polish), In: P. Chrzan: Metody matematyczne, ekonometryczne..., Katowice.<br />

Sarma, M., Thomas, S., Shah, A. (2003): Selection of Value-at-Risk Models,<br />

ideas.repec.org/s/jof/jforec.html<br />

− 112 −


Testing distribution in errors in variables<br />

models<br />

Denys Pommeret 1<br />

Aix-Marseille 2 University pommeret@iml.univ-mrs.fr<br />

Abstract. Within the frame of errors in variables models we consider the sum of<br />

two independent random variables, X = W + Z, where W is the variable of interest<br />

with known distribution Π, and where the error Z has unknown density f. We<br />

present a smooth goodness of fit test for testing the distribution of the error Z. For<br />

that we observe an i.d.d. sample X1, · · · , Xn, with a mixture density function<br />

Z<br />

g(x) = f(x, m)Π(dm),<br />

where Π is a real probability distribution and f(x, m) are real m-parameterized<br />

density functions, for m in some set M ⊂ R. We assume that Π is the known<br />

distribution of Z and we want to test<br />

H0 : f(x, m) = f0(x, m), for all m in M,<br />

where f0 is a specified probability density function. An adaptation of the Neyman<br />

smooth test is proposed.<br />

Key words: Mixture models, Neyman’s test, Score statistic, Schwarz’s criteria<br />

References<br />

Hart, J.D. (1997): Nonparametric smoothing and lack-of-fit tests, Springer Series in<br />

Statistics. New York, NY.<br />

Kallenberg, W.C.M. and Ledwina, T. (1995): Consistency and Monte Carlo simulation<br />

of data driven version of smooth goodness of fit tests, Ann. Statist, 23,<br />

1594–1608.<br />

Ledwina, T. (1994): Data-Driven Version of Neyman’s Smooth Test of Fit, Journal<br />

of the American Statistical Society, 89, 1000–05.<br />

Lehmann, E.L. and Romano, J.P. (2005): Testing statistical hypotheses, 3rd ed.<br />

Springer Texts in Statistics. New York, NY: Springer.<br />

− 113 −


Classification with an increasing number of<br />

components<br />

Odile Pons 1<br />

INRA, Mathematics, Jouy-en-Josas, France Odile.Pons@jouy.inra.fr<br />

Abstract. After estimating the rn actual components of a mixture with an increasing<br />

number of components increasing with the sample size, the question is to<br />

determine to which group a given observation Xi, i = 1, . . . , n, belongs. A classification<br />

consists in mapping an observation Xi (or a value x of X ) into a class � kn(Xi)<br />

(or � kn(x)) in {1, . . . , rn} which may either be uniquely defined (fixed case) or related<br />

to a random distribution. In both cases, the component � kn is chosen by maximum<br />

likelihood with a penalization.<br />

A random classification avoids misclassification of some observations with overlapping<br />

densities, k(Xi) = j with an estimated probability and k(x) = j with a fixed<br />

probability, 1 ≤ j ≤ rn. They are estimated by<br />

�µ �<br />

n, kn(x) � fn,<br />

kn(x) � = max Qn(�µn,k<br />

1≤k≤rn<br />

� fn,k; x), with<br />

Qn(�µn,k � fn,k; x) = �µn,k � fn,k(x) − nλ 2 n<br />

−ν 2 n<br />

�<br />

1≤j≤r<br />

q�<br />

j=1<br />

π( � fn,k − � fn,j),<br />

p (�µn,j)<br />

where the penalization coefficients λn and νn tend to zero as n → ∞, π and λ<br />

are smooth functions. For a random classification, � kn(Xi) is defined in the same<br />

way with some probabilities. The random procedure preserves all the classes: the<br />

proportions of observations belonging to {1, . . . , rn} is asymptotically identical to<br />

the mixture probabilities.<br />

Key words: Mixture, classification,asymptotics<br />

References<br />

Lemdani, M. and Pons, O. Large Mixture models with an increasing number of<br />

components, unpublished, (2007).<br />

Pons, O. Asymptotic distributions in finite semi-parametric mixture models. to appear,<br />

(<strong>2008</strong>).<br />

− 114 −


Bagging with different split criteria.<br />

Sergej Potapov and Berthold Lausen<br />

Department of Biometry and Epidemiology, Friedrich-Alexander-University<br />

Erlangen-Nuremberg, Waldstraße 6, D-91054 Erlangen, Germany<br />

Sergej.Potapov@imbe.imed.uni-erlangen.de<br />

Abstract. In recent years many papers discuss boosting and bagging based methods<br />

for supervised learning or machine learning. Both concepts aggregate sets of<br />

estimated trees, which are derived by split criteria without adjusting for variables<br />

measured on different scales. Breiman et al. (1984) observed that quantitative variables<br />

tend to be more often selected as binary variables. As a solution Lausen et al.<br />

(1994, 2004) introduced p-value adjusted classification and regression trees, which<br />

introduce the p-value of maximally selected test statistics as split criteria. The p<br />

value adjustment avoids the possible selection bias of variables measured on different<br />

scales. The R package TWIX of Potapov et al. (<strong>2008</strong>) offer p-value adjusted<br />

classification trees and bagging of p-value adjusted classification trees. In our paper<br />

we compare bagging, double-bagging (Hothorn and Lausen, 2003) without and with<br />

p-value adjustment by means of simulation. Moreover, we illustrate our approach<br />

using a clinical study involving micro array data.<br />

Key words: bagging, CART, machine learning, trees, micro array data<br />

References<br />

Breiman, L., Friedman, J.H., Olshen, R.A., Stone, C.J. (1984): Classification and<br />

regression trees. Wadsworth Press.<br />

Hothorn, T., Lausen, B. (2003): Double-bagging: Combinig classifiers by bootstrap<br />

aggregation. Pattern Recognition 36(6), 1303–1309.<br />

Lausen, B., Hothorn, T., Bretz, F., Schumacher, M. (2004): Assessment of optimal<br />

selected prognostic factors.Biometrical Journal 46, 364–374.<br />

Lausen, B., Sauerbrei, W., Schumacher, M. (1994): Classification and regression trees<br />

(CART) used for the exploration of prognostic factors measured on different<br />

scales, in: Dirschedl, P., and Ostermann, R. (eds.), Computational Statistics,<br />

Physica-Verlag, Heidelberg, 483–496.<br />

Potapov, S., Theus, M. (<strong>2008</strong>): The TWIX package (Version 0.2.4.). http://cran.rproject.org<br />

− 115 −


Remarks on the Existence of CML Estimates<br />

for the PCM by means of the R Package eRm<br />

Antonio Punzo 1<br />

University of Milano-Bicocca, Department of Quantitative Methods for Business<br />

and Economic Sciences, a punzo@libero.it<br />

Abstract. Mair and Hatzinger (2007) have recently proposed in the Journal of<br />

Statistical Software the R package eRm (extended Rasch models) for computing<br />

Rasch models and several extensions. Undoubtedly, in the eRm class the partial<br />

credit model (PCM) – for practical testing purposes – is one of the best known.<br />

The package, using a unitary conditional maximum likelihood (CML) procedure,<br />

estimates the item parameters of the above-mentioned models.<br />

Although the eRm belong to the Rasch family of models and share their distinguishing<br />

characteristics, they suffer from the problem of possible non-existence<br />

of estimates. In literature, both in the joint and in the conditional ML approach,<br />

the configurations and the conditions of non-existence for the RM are well-known<br />

(Fischer, 1981). The eRm package performs a preliminary data check only for the<br />

RM. The conditions of non-existence are known for the PCM only in the joint case<br />

(Bertoli-Barsotti, 2005).<br />

In this article, the main focus is on the PCM; the above-mentioned JML nonexistence<br />

configurations for this model will be the starting point. A class of counter<br />

examples is illustrated, which leads to “false” CML estimates with the eRm package,<br />

i.e., values that appear to be estimates but, through a more accurate analysis<br />

of the maximization function, they are rather a clear signal of non-existence. Moreover,<br />

the obtained results emphasize the presence of additional CML non-existence<br />

configurations, compared to those valid in the JML case.<br />

Key words: Rasch models, Partial Credit model, Conditional Maximum Likelihood<br />

estimate, R package eRm<br />

References<br />

Bertoli-Barsotti, L. (2005): On the Existence and Uniqueness of JML Estimates for<br />

the Partial Credit Model. Psychometrika, 70, 3, 517–531.<br />

Fischer, G. H. (1981): On the Existence and Uniqueness of Maximum-Likelihood<br />

Estimates in the Rasch Model. Psychometrika, 46, 1, 59–77.<br />

Mair P. and Hatzinger R. (2007): Extended Rasch Modeling: The eRm Package for<br />

the Application of IRT Models in R. Journal of Statistical Software, 20, 9, 1–20.<br />

− 116 −


Dynamic disturbances in BTA deephole<br />

drilling - Identification of spiralling as a<br />

regenerative effect<br />

Nils Raabe, Dirk Enk, Claus Weihs, and Dirk Biermann<br />

Technische Universitt Dortmund<br />

Germany<br />

Abstract. One serious problem in deep-hole drilling is the formation of a dynamic<br />

disturbance called spiralling which causes holes with several lobes. Since such lobes<br />

are a severe impairment of the bore hole the formation of spiralling has to be prevented.<br />

One common explanation for the occurrence of spiralling is the intersection<br />

of time varying bending eigenfrequencies with multiples of the tool’s rotational frequency.<br />

Little is known about which specific eigenfrequencies are crucial. Furthermore<br />

an Underlying assumption of this explanation is, that the resulting holes in<br />

cross-sectional view are showing as a curve with constant width. This Assumption<br />

implicitly supposes spiralling to result from a parallel displacement of the drill head.<br />

We in fact observed spiralling in experiments designed to force it by planning<br />

crucial frequency intersections using a statistical-physical model proposed in earlier<br />

work. However, not every intersection of any eigenfrequency with a multiple of the<br />

rotational frequency led to spiralling. Furthermore we also found cases of spiralling<br />

with two or four lobes contradict to the common assumption. After inspecting the<br />

eigenmodes corresponding to the frequencies which caused the spiralling it turned<br />

out that these modes commonly show a clear tilt at the drill head instead of a parallel<br />

displacement. This tilt one the one hand in general allows to order the eigenfrequencies<br />

by their relevance with respect to spiralling. Furthermore we now are able to<br />

give a geometrical explanation for the spiralling development as a regenerative effect.<br />

We use this explanation to extend our statistical-physical model by a process<br />

model of the chips cut during the process. This model is the basis of a system for<br />

the simulation of spiralling. Since the model contains the machine parameters it can<br />

be used to evaluate the probability and extend of spiralling in different settings.<br />

By this different settings can be classified into stable and instable processes and<br />

strategies for the avoidance of spiralling can be derived. Since the statistical-physical<br />

model includes a statistical estimation procedure for the unknown parameters these<br />

strategies can finally be tested in real processes.<br />

− 117 −


Statistical processes under change - Enhancing<br />

data quality with pretests<br />

Walter Radermacher<br />

President of the Federal Statistical Office, Germany<br />

walter.radermacher@destatis.de<br />

Summary. Production of high quality statistics is the main task of Federal Statistical<br />

Office. Technological progress, globalisation, the increasing significance and<br />

diversification of information and its distribution, are only some general terms for<br />

the changes we are faced with today. Needless to say, those changes strongly affect<br />

the statistical work of the FSO and pose challenges that can only be met with innovative<br />

and appropriate methods, to name only a few: cooperation and networks,<br />

multiple sources, mixed-mode designs, standardisation of processes, metadata for<br />

quality control and the use of administrative information. The point is to maximise<br />

data quality and minimise the cost and the burden for the participants in surveys.<br />

A prominent method for improving data quality in surveys is the use of pretests<br />

within - or ideally before - the actual data production process. Pretests fulfill a<br />

number of functions: they minimise non-sampling errors, they reduce the burdensomeness<br />

for the respondents of comprehensive questionnaires and they test the<br />

feasibility of a concept in practice. Combining quantitative and qualitative methods<br />

for pretesting leads to significant increases in data quality. For instance, cognitive<br />

interviewing, an accepted method mainly used in social science research, when applied<br />

to test household surveys as well as business surveys, enables the detection of<br />

reporting errors in surveys caused by underlying cognitive processes through which<br />

respondents generate their answers in survey questions. Some examples from the<br />

practice will illustrate the benefits of pretests in official statistics.<br />

Changing conditions call for changing procedures and methods. In our work supplying<br />

official statistics we react to the increasing demand for reliable data. Pretests<br />

are an important example of a method which meets this need in two ways: they<br />

improve quality control and they contribute to the userfriendliness of our surveys.<br />

− 118 −


Automatic Dictionary Expansion Using<br />

Non-parallel Corpora<br />

Reinhard Rapp 1 and Michael Zock 2<br />

1 University of Tarragona reinhard.rapp@urv.cat<br />

2 LIF-CNRS, Marseille michael.zock@lif.univ-mrs.fr<br />

Abstract. Automatically deriving bilingual dictionaries from manually translated<br />

texts is an established technique that works well in practice. However, translated<br />

texts are a scarce resource. Therefore, it is also desirable to be able to generate<br />

dictionaries from pairs of unrelated monolingual corpora. To achieve this, we suggest<br />

an approach that considers the crosslingual correlations between the co-occurrence<br />

patterns of translated words. If, for example, two words X and Y co-occur more often<br />

than expected by chance in the source language, then their translations T(X) and<br />

T(Y) should also co-occur more frequently than expected in the target language. It<br />

is further assumed that a small dictionary is available at the beginning, and that<br />

the aim is to expand this base lexicon.<br />

The approach is as follows: Using a corpus of the target language, first a cooccurrence<br />

matrix is computed with the rows being word types from the corpus and<br />

the columns being target words from the base lexicon. Next a word of the source<br />

language is considered whose translation is to be determined. Using the sourcelanguage<br />

corpus, a co-occurrence vector for this word is computed. Then, using the<br />

dictionary, all known words in this vector are translated into the target language,<br />

thereby discarding unknown words. The resulting vector is compared to all vectors<br />

in the co-occurrence matrix of the target language. The vector with the highest<br />

similarity is considered to be the translation of the source-language word.<br />

In our experiments this method gave an accuracy in the order of 50%. To improve<br />

the results, we perform an automatic cross-check which utilizes the dictionaries’<br />

property of transitivity. What we mean by this is that if we have two dictionaries, one<br />

translating from language A to language B, the other from B to C, then we can also<br />

translate from language A to C by using the intermediate language (or interlingua)<br />

B. That is, the property of transitivity, although having some limitations due to<br />

word ambiguities, can be exploited to automatically generate a raw dictionary for<br />

A to C. One might think that this is unnecessary as our corpus-based approach<br />

also allows us to generate this dictionary directly from the respective comparable<br />

corpora. However, having two different ways of generating the same dictionary has<br />

the advantage that we can validate one via the other. Furthermore, by considering<br />

several languages, additional possibilites for mutual cross-validation arise.<br />

Key words: dictionary generation, comparable texts, translation<br />

− 119 −


FIMIX-PLS Segmentation of Data for Path<br />

Models with Multiple Endogenous LVs<br />

Christian M. Ringle<br />

University of Hamburg, Institute of Industrial Management, Von-Melle-Park 5,<br />

20146 Hamburg, Germany, cringle@econ.uni-hamburg.de<br />

Abstract. When applying a causal modeling approach such as partial least squares<br />

(PLS) path modeling in empirical studies, the assumption that the data has been<br />

collected from a single homogeneous population is often unrealistic. Unobserved<br />

heterogeneity in the PLS estimates for the aggregate data level may result in misleading<br />

interpretations. Finite mixture partial least squares (FIMIX-PLS; Hahn et<br />

al., 2002) allows classifying data based on the heterogeneity of the estimates in the<br />

inner path model. Experimental as well empirical examples (Esposito Vinzi et al.,<br />

2007; Ringle et al., <strong>2008</strong>) illustrate the application of FIMIX-PLS for path models<br />

that only involve a single latent endogenous variable. This research uses a systematic<br />

approach (Ringle, 2007) to apply the FIMIX-PLS methodology and presents<br />

FIMIX-PLS computational experiments for a path model which includes multiple<br />

endogenous latent variables (LVs). The results of this analysis further substantiate<br />

the reliability of the systematic FIMIX-PLS application in more realistic situations<br />

and provides researchers and practitioners with the certainty they require to effectively<br />

evaluate their PLS path modeling results. If the procedure uncovers significant<br />

heterogeneity, the analysis results in further differentiated path modeling outcomes<br />

and, thus, allows forming more precise conclusions.<br />

Key words: PLS Path Modeling, Heterogeneity, Finite Mixture, Segmentation<br />

References<br />

Esposito Vinzi, E., Ringle, C.M., Squillacciotti, S. and Trinchera, L. (2007): Capturing<br />

and Treating Unobserved Heterogeneity by Response Based Segmentation<br />

in PLS Path Modeling: A Comparison of Alternative Methods by Computational<br />

Experiments. ESSEC Research Center, Working Paper No. 07019. ES-<br />

SEC Business School Paris-Singapore.<br />

Hahn, C., Johnson, M.D., Herrmann, A. and Huber, F. (2002): Capturing Customer<br />

Heterogeneity using a Finite Mixture PLS Approach. Schmalenbach Business<br />

Review, 54, 243–269.<br />

Ringle, C.M. (2007): Segmentation for path models and unobserved heterogeneity:<br />

The finite mixture partial least squares approach, Research Papers on Marketing<br />

and Retailing No. 035. University of Hamburg.<br />

− 120 −


Extreme unconditional dependence vs.<br />

multivariate GARCH effect in the analysis of<br />

dependence between high losses on Polish and<br />

German stock indexes<br />

Pawel Rokita, Krzysztof Piontek<br />

Department of Financial Investments and Risk Management<br />

Wroclaw University of Economics<br />

ul. Komandorska 118/120, 53-345 Wroclaw, Poland<br />

pawel.rokita@ae.wroc.pl, krzysztof.piontek@ae.wroc.pl<br />

Abstract. Classical portfolio diversification methods do not take account of any<br />

dependence between extreme returns (losses). Many researchers provide, however,<br />

some empirical evidence that extreme-losses for various assets co-occur. If the cooccurrence<br />

is frequent enough to be statistically significant, it may seriously influence<br />

portfolio risk. Such effects may result from a few different properties of financial time<br />

series, like for instance: (1) extreme dependence in an (long-term) unconditional<br />

distribution, (2) extreme dependence in subsequent conditional distributions, (3)<br />

time-varying conditional covariance, (4) time-varying (long-term) unconditional covariance,<br />

(5) market contagion. Moreover, a mix of these properties may be present<br />

in return time series. Modeling each of them requires different approaches. It seams<br />

reasonable to investigate whether distinguishing between the properties is highly<br />

significant for portfolio risk measurement. If it is, identifying the effect responsible<br />

for high loss co-occurrence would be of a great importance. If it is not, the best solution<br />

would be selecting the easiest-to-apply model. This article concentrates on two<br />

of the aforementioned properties: extreme dependence (in a long-term unconditional<br />

distribution) and time-varying conditional covariance.<br />

Key words: extreme dependence, TDC, multivariate GARCH<br />

References<br />

Coles S., Heffernan J., Tawn J. (1999): Dependence Measures for Extreme Value<br />

Analyses. Extremes, 2:4, 339–365.<br />

Gouriéroux C. (1997): ARCH Models and Financial Applications. Springer.<br />

Rokita P. (<strong>2008</strong>): Comparing extreme dependence and varying conditional covariance<br />

concept for portfolio risk modeling (in Polish). To be published in: Taksonomia,<br />

15.<br />

− 121 −


Grundzüge einer generativen Korpuslinguistik<br />

Jürgen Rolshoven<br />

Linguistic Data Processing, Department of Linguistics, University of Cologne<br />

rols@spinfo.uni-koeln.de<br />

Abstract. Die Verarbeitung langer Texte ist durch die Bioinformatik stark stimuliert<br />

worden. Dies zeigen u.a. Gusfield (1997), Böckenhauer, Bongartz (2003)<br />

und Haubold, Wiehe (2006). Als Datenstruktur spielen Suffix Trees eine zentrale<br />

Rolle. Für den Linguisten sind Suffix Trees jedoch nicht zur Suche von Substrings<br />

von Bedeutung. Vielmehr ermöglichen sie die Gewinnung linguistischen<br />

Wissens durch Aufdeckungsverfahren. Suffixe sind potentielle Morpheme in Textkorpora.<br />

Aufdeckungsverfahren seligieren aus der Menge potentieller Morpheme<br />

die, welche die funktions- oder bedeutungstragend sind.Dafür werden strukturalistische<br />

Verfahren wie Austausch, Auslassung und Verschiebung eingesetzt. Formal<br />

ergibt sich folgendes Verfahren: Suffixbäume sind äquivalent zu endlichen<br />

Automaten und entsprechen Typ-3-Sprachen in der Chomsky-Hierarchie.Daraus<br />

ergeben sich einfache Produktionsregeln, die mit Hilfe weiterer Information aus Suffixbäumen<br />

in Typ-2-Regeln umgeformt werden. Diese werden wiederum in Typ-1-<br />

Regeln transformiert. Mit diesem Vorgehen wird Sprache nicht einzeln satzweise<br />

geparst, sondern gleichsam holistisch textbezogen. Die skizzierten Verfahren führen<br />

zu Übergenerierungen und sind insofern generativ. Ihr deskriptives Potential reicht<br />

über die Texte hinaus, aus denen die Regeln deriviert wurden. Dies soll<br />

durch die Bezeichnung generative Korpuslinguistik ausdrücken. Eine generative<br />

Korpuslinguistik verbindet durch Einsatz bioinformatischer Methoden die Stärken<br />

des datengetriebenen korpuslinguistischen Ansatzes mit dem hypothesengetriebenen<br />

Vorgehen generativer Grammatiken.<br />

References<br />

Böckenhauer, H-J., Bongartz, D. (2003): Algorithmische Grundlagen der Bioinformatik.<br />

Teubner Verlag, Wiesbaden.<br />

Gusfield, D. (1997): Algorithms on Strings, Trees and Sequences: Computer Science<br />

and Computational Biology. Cambridge University Press, Cambridge, Mass.<br />

Haubold, B., T. Wiehe (2006): Introduction to computational biology: an evolutionary<br />

approach. Birkhäuser Verlag, Basel; Boston.<br />

− 122 −


Cluster ensemble based on co-occurrence data<br />

Dorota Rozmus<br />

Department of Statistics,<br />

Katowice University of Economics, Bogucicka 14, 40-226 Katowice<br />

drozmus@ae.katowice.pl<br />

Abstract. Ensemble approach have been successfully applied in the context of<br />

supervised learning to increase the accuracy and stability of classification. Recently,<br />

analogous techniques for cluster analysis have been suggested. Research has proved<br />

that, by combining a collection of different clusterings, an improved solution can be<br />

obtained.<br />

In the traditional way of learning from data set the classifiers are built in a feature<br />

space. However, alternative ways can be found by constructing decision rules on<br />

similarity or dissimilarity representations, instead. In such a recognition process an<br />

object is described by distance matrix showing the similarity to the rest of training<br />

samples.<br />

This research has focused on exploiting the additional information provided by<br />

a collection of diverse clusterings to generate a co-association (similarity) matrix.<br />

Taking the co-occurrences of pairs of patterns in the same cluster as votes for their<br />

association, the data partitions are mapped into a co-association matrix of patterns.<br />

This n × n matrix represents a new similarity measure between patterns. The final<br />

data partition is obtained by clustering this matrix.<br />

In the experiments, the behavior of partitions built on co-occurrence data is<br />

studied.<br />

Key words: Cluster analysis, Cluster ensemble, Co-association matrix, (Dis)similarity<br />

representation.<br />

References<br />

Jain, A.K. and Fred, A. (2002): Evidence accumulation clustering based on the<br />

k-means algorithm. Structural, Syntactic, and Statistical Pattern Recognition,<br />

2396, 442–451.<br />

Strehl, A. and Ghosh, J. (2002): Cluster ensembles - a knowledge reuse framework<br />

for combining partitionings. Journal of Machine Learning Research, 3, 583–617.<br />

Pekalska, E. and Duin, R.P.W (2000): Classifiers for dissimilarity-based pattern<br />

recognition. In: A. Sanfeliu, J.J. Villanueva, M. Vanrell, R. Alquezar, A.K.<br />

Jain and J. Kittler (Eds.): Proc. 15th Int. Conf. on Pattern Recognition, IEEE<br />

Computer Society Press, Los Alamitos, 12–16.<br />

− 123 −


Dyadic Interactions in Service Encounter -<br />

Bayesian SEM Approach<br />

Adam Sagan 1 and Magdalena Kowalska-Musia̷l 2<br />

1 Chair of Market Analysis and Marketing Research Cracow University of<br />

Economics, Rakowicka 27, 31-510 Cracow, Poland sagana@ae.krakow.pl<br />

2 The School of Banking and Management, Armii Krajowej 4, 30-115 Cracow,<br />

Poland m.kowalska@wszib.edu.pl<br />

Abstract. Dyadic multirelational and sequential interactions are important aspects<br />

in service encounters. Can be observed in B2B distribution channels, professional<br />

services, buying centers, family decision making or WOM communications. The networks<br />

are consisted of dyadic bonds that form dense but weak ties among actors.<br />

The aim of paper is the identification of latent properties of dyadic interactions on<br />

mobile phone service market. Latent variable models in relational marketing are often<br />

concentrated either on effects of relations, or treat the relationship dimensions<br />

as psychological constructs on individual-trait level.<br />

We propose the approach based on bayesian latent variable modeling of social networks<br />

with dyads as units of analysis. This approach enables to model emergent and<br />

relational properties of actors’ interactions in dyads that are irreducible to individual<br />

latent traits or psychological constructs.<br />

Several competing models are developed and compared using bayesian structural<br />

equation models of dyadic data. Bayesian SEM helps to overcome the limitations<br />

of the more traditional solutions based on ML or WLS estimations. It is robust to<br />

small samples that are common in social network analysis, can also be applied for<br />

non-normal data as well as non-linear relations between latent variables.<br />

Key words: Relationship marketing, Dyadic data, Bayesian SEM<br />

References<br />

Anderson, J. C. and H˚akanson, H. and Johanson, J. (1994): Dyadic Business Relationships<br />

within a Business Network Context, Journal of Marketing, October,<br />

1–15.<br />

Iacobucci, D. and Hopkins, N. (1992): Modeling Dyadic Interactions and Networks<br />

in Marketing, Journal of Marketing Research, February, 5–17.<br />

Kenny, D.A. and Kashy, D.A and Cook, W.L. (2006): Dyadic Data Analysis. Guilford<br />

Press, New York.<br />

Lee, S. Y. (2007): Structural Equation Modeling. A Bayesian Approach,John Willey<br />

and Sons, Chichester.<br />

− 124 −


����������� ������� ���������� �����<br />

������������� ���������������� �� �����������<br />

�� ������ ������������<br />

����� ������ � ��� �������� ������� �<br />

� ��� ������� ��������� ����������������������<br />

� ���������� ������ ��� ����������� �������������������������������<br />

��������� ���� ������� �� ��������� �������� ��������� ���� �� �������� ������<br />

���� ������������ ��� � ������� ��������� ����������� ��������� ������� ��� ��������<br />

��� ������������� �� ����� ������ ������������� ���� ���������� ������� ��� ������<br />

����� �������� �� ���� ������������� ������� ��� ���������� �� ��� ���������� ���<br />

������ ��� ���� �� �� ����� ������������ �� ������� ��� ��������� ��� ���� ����� ��<br />

���� �� ��� ���������� ������������� ����� ��� �������<br />

�� ���� ������ ������� �� ������� ��� � ����������� ������� ���� ���� ����<br />

�������� �� ������ � ������������ ������� ��������� ��������� � ��������������<br />

�������� ����� �� ��� ��� ������� ���������� ������������ �������� ����� �� ������<br />

���� ���� ���� ��� ���� �������� ����� �� ����� � ������������ ��������� ����������<br />

� ��������� ���� ������� �� ��� ����������� �� ��������� �������� ��� �����������<br />

������� �������� �� ���������������� ������� ��� ������� ������� ��� �� ������� ��<br />

�������� �� ��� �������� ������ �������� ��� �������������� �������� �� ���������<br />

���������� �� ��� ������������ ������� ���� ������ ��� ������� ������� �� �������<br />

��� ��� ������ �� ������� ����������� �� �������� ��� �������� ������� �� ����<br />

���� ��� ������ �������������� ������� ����� �� ��� ��������������� ���������<br />

�� �� ��������� ������������ ��� �� ��� ��������� ���� ������ �� ��� ��������� �� ���<br />

�������� �� ���������� ����������� �� ��� ���� �� ������� ��� ��������� ��������<br />

��� ������ �������� ���������� ������� ������ ��������������� ���� ������<br />

����������<br />

����� ���� ������� ���� ������ ���������� ��� ������� ���������� �������������<br />

���������� ��� ���� ��������� ��� ��������<br />

����������� ��� �������� ��� �������� ���� ��� ��������� �� ������� ����� ���������<br />

������ ������ ������� ���������� �� ���� ������������ � ����� ������� ������<br />

������������� ����� ������ ���� ��� ������ ���� ������� ���������� �������� �� ����<br />

���� �������������� ����� ������ �� ����������� ��� ������������ �����������<br />

����� ����� ������� ���� ��� ������� �� ������� ��������������� �� ������� ���<br />

��������� ����� ��� �������<br />

− 125 −


Nonnegative Matrix Factorization for Binary<br />

Data to Extract Elementary Failure Maps from<br />

Wafer Test Images<br />

Reinhard Schachtner 1,2 , Gerhard Pöppel 1 and Elmar Lang 2<br />

1 Infineon Technologies AG, 93049 Regensburg, Germany<br />

reinhard.schachtner@infineon.com<br />

2 Institute of Biophysics, University of Regensburg, 93040 Regensburg, Germany<br />

Abstract. We introduce a probabilistic variant of non-negative matrix factorization<br />

(NMF) applied to binary data sets. Hence we consider binary coded images as a probabilistic<br />

superposition of underlying continuous-valued basic patterns. An extension<br />

of the well-known NMF procedure to binary-valued data sets is provided to solve the<br />

related optimization problem with non-negativity constraints. We demonstrate the<br />

performance of our method by applying it to the detection and characterization of<br />

hidden causes for failures during wafer processing. Therefore, we decompose binary<br />

coded (pass/fail) wafer test data into underlying elementary failure patterns and<br />

study their influence on the quality of single wafers.<br />

Key words: Nonnegative matrix factorization, binary data, failure patterns<br />

References<br />

Lee, D. and Seung, H. (1999): Learning the parts of objects by non-negative matrix<br />

factorization, Nature, 401,788–791<br />

− 126 −


Quality–Based Clustering of Functional Data:<br />

Applications to Time Course Microarray Data<br />

Theresa Scharl 1 and Friedrich Leisch 2<br />

1 Institut für Statistik und Wahrscheinlichkeitstheorie, Technische Universität<br />

Wien, Wiedner Hauptstr. 8-10, A-1040 Wien, Austria; Scharl@ci.tuwien.ac.at<br />

2 Institut für Statistik, Ludwig-Maximilians-Universität München, Ludwigstraße<br />

33, D-80539 München, Germany; Friedrich.Leisch@stat.uni-muenchen.de<br />

Abstract. Cluster methods are typically applied to time course gene expression<br />

data to find co–regulated genes which can finally help to reveal pathways and interactions<br />

between genes. Clustering is either carried out on the raw data or on<br />

functional data. In functional data analysis (e.g. Serban and Wasserman, 2005;<br />

Tarpey, 2007) a curve is fit to each observation in order to account for time dependency.<br />

Gene expression over time is biologically a continuous process and can<br />

therefore be represented by a continuous function. The different curve shapes found<br />

in a dataset can have important interpretations and characteristic patterns can be<br />

found by clustering the estimated regression coefficients.<br />

In this study the raw data is clustered using the well–known K–Means algorithm<br />

as well as the quality–based cluster algorithm Stochastic QT–Clust (Scharl and<br />

Leisch, 2006). Further, the parameters obtained by representing each gene expression<br />

profile by a curve are clustered. Additionally mixtures of spline regression models<br />

and mixed–effects models are applied to the data. All cluster algorithms used are<br />

implemented in R. The different cluster methods are compared in a simulation study<br />

on various datasets.<br />

Key words: Cluster analysis, functional data, time course gene expression data, R<br />

References<br />

SCHARL, T. and LEISCH, F. (2006): The stochastic qt–clust algorithm: evaluation<br />

of stability and variance on time–course microarray data. In Rizzi, A. and Vichi<br />

M., editors, Compstat 2006—Proceedings in Computational Statistics, 1015–<br />

1022, Physica Verlag, Heidelberg, Germany.<br />

SERBAN, N. and WASSERMAN, L. (2005): CATS: Clustering after transformation<br />

and smoothing. Journal of the American Statistical Association, 100(471), 990–<br />

999.<br />

TARPEY, T. (2007): Linear transformations and the k–Means clustering algorithm:<br />

Applications to clustering curves. The American Statistician, 61(1), 34–40.<br />

− 127 −


Multilingual knowledge based concept<br />

recognition in textual data<br />

Martin Schierle 1 and Daniel Trabold 2<br />

1 Daimler AG martin.schierle@daimler.com<br />

2 Daimler AG daniel.trabold@daimler.com<br />

Abstract. With respect to the increasing volume of textual data which is available<br />

through digital resources today, the identification of the main concepts in those texts<br />

becomes more and more important and can be seen as a vital step in the analysis<br />

of unstructured information.<br />

Research in this area has focused on the detection of named entities like person<br />

names or organization names, which only cover a very small part of concepts in texts.<br />

Especially the unique mapping between concepts in different languages requires<br />

parallel corpora which are rarely available in industrial settings.<br />

We therefore propose a powerful new knowledge based model to recognize various<br />

kinds of concepts even in very short and specialized texts using linguistic information<br />

for synonym handling and word sense disambiguation.<br />

We evaluate the proposed model on texts from the automotive domain.<br />

References<br />

− 128 −


Localized Logistic Regression for Discrete<br />

Influential Factors<br />

Julia Schiffner, Gero Szepannek, Thierry Monthé, and Claus Weihs<br />

Faculty of Statistics, Dortmund University of Technology, 44221 Dortmund,<br />

Germany,<br />

schiffner@statistik.uni-dortmund.de,<br />

weihs@statistik.uni-dortmund.de<br />

Abstract. The two-class localized logistic regression of Tutz and Binder (2005)<br />

is generalized to discrete explanatory variables, and applied to data from a breast<br />

cancer study. In order to obtain a distance measure between observations of the<br />

discrete factors a combination of the simple and the flexible matching coefficient<br />

(Ickstadt et al., 2006) is taken. Applying the method of Tutz and Binder (2005)<br />

with this distance measure to the example data, localized models lead to smaller<br />

misclassification rates than the corresponding global ones. Moreover, the best classification<br />

rule found gives one of the smallest misclassification rates ever obtained<br />

for the example data. The results of Monthé (<strong>2008</strong>) are extended by an automatic<br />

variable selection.<br />

Key words: Localized logistic regression, Matching coefficients, SNP data<br />

References<br />

Ickstadt, K., Mueller, T., and Schwender, H. (2006): Analyzing SNPs: Are There<br />

Needles in the Haystack? CHANCE, 19(3), 22–27.<br />

Monthé, Th. (<strong>2008</strong>): Lokalisierte Logistische Regression bei diskreten Variablen. Master<br />

thesis, Faculty of Statistics, Dortmund University of Technology.<br />

Tutz, G. and Binder, H. (2005): Localized Classification. Statistics and Computing,<br />

15, 155–166.<br />

− 129 −


Localized Classification Using Mixture Models<br />

Julia Schiffner and Claus Weihs<br />

Faculty of Statistics, Dortmund University of Technology, 44221 Dortmund,<br />

Germany,<br />

schiffner@statistik.uni-dortmund.de<br />

Abstract. In the literature a variety of classification methods can be found that can<br />

be called ‘local’ because they concentrate – in different senses – on one or multiple<br />

small regions of the data space. One type of local methods that may be beneficial<br />

in case of heterogeneous classes is based on mixture models. It is assumed that<br />

data are generated by a finite number of sources and that each source can produce<br />

data of one or multiple classes. Models valid for single sources can be referred to<br />

as ‘local models’ that can be aggregated to a global mixture model. Mixture based<br />

classification methods have been described by several authors (see references), but<br />

the relationships and differences between the underlying models are not clear. A<br />

consistent description of these models and the resulting Bayes classification rules is<br />

presented. Moreover, it is shown how Bayes rules can be derived if in distinct local<br />

models different variable subsets separate the classes. Finally, several methods for<br />

class posterior estimation are described and an application to sound data is shown,<br />

where the register of different instruments is predicted by timbre.<br />

Key words: Local classification methods, Mixture models, Bayes rules<br />

References<br />

Hastie, T. J. and Tibshirani, R. J. (1996): Discriminant Analysis by Gaussian Mixtures.<br />

Journal of the Royal Statistical Society B, 58(1), 155–176.<br />

Szepannek, G. and Weihs, C. (2006): Local Modelling in Classification on Different<br />

Feature Subspaces. In: P. Perner (Ed.): Advances in Data Mining. Springer,<br />

Berlin, 226-238.<br />

Titsias, M. K. and Likas, A. C. (2001): Shared Kernel Models for Class Conditional<br />

Density Estimation. IEEE Transactions on Neural Networks, 12(5), 987–997.<br />

Titsias, M. K. and Likas, A. C. (2002): Mixture of Experts Classification Using a<br />

Hierarchical Mixture Model. Neural Computation, 14, 2221–2244.<br />

Weihs, C., Szepannek, G., Ligges, U., Luebke, K., and Raabe, N. (2006): Local Models<br />

in Register Classification by Timbre. In: V. Batagelj, H.-H. Bock, A. Ferligoj,<br />

and A. Ziberna (Eds.): Data Science and Classification. Springer, Berlin, 315-<br />

322.<br />

− 130 −


Comparison of four estimators of the<br />

heterogeneity variance for meta-analysis<br />

Peter Schlattmann<br />

Dept. of Biostatistics and Clinical Epidemiology<br />

Charité Universitätsmedizin Charitéplatz 1, 10117 Berlin<br />

peter.schlattmann@charite.de<br />

Summary. The analysis of heterogeneity is a crucial part of each meta-analysis.<br />

In order to analyze heterogeneity often a random effects model which incorporates<br />

variation between studies is considered. It is assumed that each study has its own<br />

(true) exposure or therapy effect and that there is a random distribution of these true<br />

exposure effects around a central effect. The variability between studies is quantified<br />

by the heterogeneity variance.<br />

In order to compare the performance of four estimators of the heterogeneity variance<br />

a simulation study was performed. This study compared the Dersimonian-Laird<br />

(1986) estimator with the maximum-likelihood estimator based on the normal distribution<br />

for the random effects. Further comparators were the simple heterogeneity<br />

(SH) variance estimator proposed by Sidek and Jonkman (2005).<br />

All of the afore mentioned methods assume a normal distribution for the random<br />

effects. This assumption may be true or not. Thus an alternative estimator of<br />

the heterogeneity variance is based on a finite mixture model (Böhning, Dietz,and<br />

Schlattmann, 1998).<br />

This simulation study investigates these four estimators, when sampling from<br />

discrete distributions, i.e. the major assumption of a normal distribution for the<br />

random effects is not fulfilled. In this setting the simulation study investigates bias,<br />

standard deviation and mean square error (MSE) of all four estimators.<br />

Key words: Meta-Analysis, Heterogeneity, Simulation, Finite mixture model<br />

References<br />

Böhing, D., Dietz, E. and Schlattmann, P. (1998): Recent developments in computer<br />

assisted mixture analysis. Biometrics, 54, 283-303<br />

DerSimonian, R. and Laird, N. (1986): Meta-analysis in clinical trials. Controlled<br />

Clinical Trials,7,177-188<br />

Sidek, K. and Jonkman, J. (2005): Simple heterogeneity variance for metaanalysis.<br />

JRSS Series C, 54, 367-384<br />

− 131 −


Machine learning applications of positive<br />

definite kernels<br />

Prof. Dr. Bernhard Schölkopf<br />

MPI for Biological Cybernetics<br />

Spemannstrasse 38<br />

72076 Tübingen<br />

bernhard.schoelkopf@tuebingen.mpg.de<br />

Summary. Support vector machines and other kernel methods have become one of<br />

the most widely used techniques in the field of machine learning. I will present my<br />

thoughts on what made them popular and what may (or may not) keep them going.<br />

I will also discuss applications in different domains, including computer graphics.<br />

− 132 −


Age Distributions for costs in drug prescription by<br />

practitioners and for DRG-based hospital treatment<br />

Reinhard Schuster, Eva v. Arnstedt<br />

Medical Review Board of the Statutory Health Insurance in North Germany,<br />

23554 Lübeck, Germany, Reinhard.Schuster@mdk-nord.de<br />

Abstract. Purpose: We analyse age-dependent fractions of patients with costs above a treshold<br />

value in dependency of that value both in drug application outside hospitals as well as<br />

in DRG-based hospital treatment. We compare the results of different German regions and<br />

different statutory insurances. The outcome of age-dependency of costs is highly important in<br />

respect to demographic changes. Design/Methodology/Approach/Algorithm: We use drug<br />

prescription data of practitioners and data of DRG-based hospital treatment from several statutory<br />

insurances and several regions. We use a nonparametric functional equation with a geometric<br />

background which generates an one parametric family of logconcave distributions including<br />

the normal distribution. The mentioned functional equation is also related to Verhulst<br />

growth. Results: The data can be fitted by log-concave distributions and we get numerically<br />

stable computations. The respective logarithms are concave with respect to both variables age<br />

and costs. We find that independent of the absolute threshold value there is always a decrease<br />

in the fraction of high-cost patients above a certain age. So we do not find a monotone increase<br />

of costs with age. Research Limitations/Implications: The statistically reported data<br />

basis for age-dependent costs is poor in general with respect to specific details, especially<br />

if a (pseudonymized) patient-identifier is necessary. Practical Implications: Demographic<br />

changes are important for a large range of induced implications. Often it is implicated that<br />

the costs are strictly increasing with age. If this turns out not to be true in general, costs<br />

are depending much more sensible on the exact (demographically changing) age distribution<br />

of the population, which should be analysed in that direction. Originality/Value: The agedependent<br />

resolution of officially stated statistic reports is poor in general. We state a stable<br />

non-parametric model with high resolution.<br />

Key words: Drug Application, Age Distribution, DRG-System, Statutory Health Insurance<br />

References<br />

SCHUSTER, R.: Komponentenzerlegungen, Strukturen und Invarianten zu GKV-<br />

Arzneimittelverordnungsdaten. Journal of Public Health 4 (2003), 293-305.<br />

− 133 −


The Late Neolithic flint axe production on the<br />

Lousberg (Aachen, Germany) — An<br />

extrapolation of supply and demand and<br />

population density<br />

Daniel Schyle<br />

Institut für Ur- und Frühgeschichte Universitt zu Köln<br />

daniel.schyle@uni-koeln.de<br />

Abstract. The tabular flint seams within the cretaceous limestone slab once covering<br />

the Lousberg in Aachen (Germany) were completely exploited by systematic<br />

opencast mining during the time between approximately 3800 and 3000 years<br />

CalBC. The Lousberg-flint, easily identifiable by its tabular shape and its characteristic<br />

colours, was processed on-site almost exclusively for the production of<br />

axe-roughouts, which were distributed over distances up to 280 km mainly to Westphalia,<br />

but also to Hessen, Rheinland-Pfalz and into Belgium and the Netherlands.<br />

An excavation at the Lousberg was carried out under the direction of J. Weiner<br />

between 1978 and 1981. This contribution presents an extrapolation of the total<br />

amount of axe-roughouts produced at the site, based on the results of refittings and<br />

the counts of random samples of the knapping waste excavated from the mining<br />

dumps. The corresponding demand for axes per household and generation is estimated<br />

from axe distributions and frequencies in several well dated and preserved<br />

lakeshore dwellings of Southern Germany and Switzerland. To estimate the population<br />

density within the distribution area of Lousberg-axes, which is almost devoid of<br />

Late Neolithic settlement traces other than only roughly dated surface assemblages,<br />

the approximate size of the core-distribution area is determined by the site density<br />

mapping method (”Isolinien-Fundstellendichtekartierung”) recently developed by A.<br />

Zimmermann and collaborators of the Institut fr Ur- und Frhgeschichte at the University<br />

of Cologne. The contribution will focus on the problems in comparing the<br />

results based on the distribution of Lousberg-axes to the results recently obtained<br />

on settlement distributions of the Linearbandkeramik (LBK) in the Rhineland. The<br />

research is part of a project aimed at the final publication of the Lousberg finds,<br />

which was funded by the Deutsche Forschungsgemeinschaft (DFG).<br />

Key words: Abstract, Layout, Submission guideline<br />

− 134 −


Time Related Features for Alarm Classification<br />

in Intensive Care Monitoring<br />

Wiebke Sieben<br />

Department of Statistics, Technische Universitt Dortmund, 44227 Dortmund,<br />

Germany sieben@statistik.tu-dortmund.de<br />

Abstract. Traditional patient monitoring systems in intensive care are based on<br />

simple threshold alarms. These systems compare the measurement of a vital sign<br />

with a threshold set by the clinical staff and trigger an alarm when the threshold<br />

is crossed. Although there are some more sophisticated rules already incorporated<br />

in modern monitoring devices the false alarm rate has remained very high (Tsien<br />

Fackler 1997, Chambrin 2001). Machine learning techniques, and particularly decision<br />

trees have proven suitable for alarm classification. As the misclassification rate<br />

of non life-threatening situations is to be minimized under the constraint that the<br />

misclassification rate of life-threatening situations is close to zero, standard techniques<br />

need to be improved. Modified Random Forests (Sieben, Gather 2007) have<br />

been shown to do this successfully. So far only the measurements of the point in<br />

time when an alarm was triggered were used for classification. As physicians always<br />

take the character of changes over time in a patients’ health status into account for<br />

a diagnosis there might exist valuable information to be extracted from the time<br />

series. We study the use of time related features in combination with the modified<br />

Random Forest approach in terms of improvements in the classification results.<br />

References<br />

CHAMBRIN, M.-C. (2001): Alarms in the Intensive Care Unit: How Can the Number<br />

of False Alarms Be Reduced?. Critical Care, 5 (4), 184–188.<br />

SIEBEN, W., GATHER, U. (2007):Classifiing Alarms in Intensive care - Analogy<br />

to Hypothesis Testing. in LNCS Series: Proceedings of the 11th Conference on<br />

Artificial Intelligence in Medicine, Vol.4594/2007, eds. R. Bellazzi, A. Abu-<br />

Hanna, J. Hunter, Springer, Berlin/Heidelberg, 130–138.<br />

TSIEN, C.L., FACKLER, C. (1997): Poor Prognosis for Existing Monitors in the<br />

Intensive Care Unit. Critical Care Medicine, 25 (4), 614–619.<br />

Keywords<br />

CLASSIFICATION, INTENSIVE CARE MONITORING, FALSE ALARMS<br />

− 135 −


’CMA’ - Steps in developing a comprehensive<br />

R-toolbox for classification with microarray<br />

data and other high-dimensional problems<br />

Martin Slawski, Anne-Laure Boulesteix, and Martin Daumer<br />

Sylvia Lawry Centre for MS Research, Hohenlindenerstr. 1, D-81677 München<br />

Martin.Slawski@campus.lmu.de, boulesteix@slcmsr.org, daumer@slcmsr.org<br />

Abstract. Microarray studies have stimulated the development of new approaches<br />

and motivated the adaptation of known traditional methods for class prediction with<br />

high-dimensional data. There already exist numerous sofware packages implementing<br />

single methods for microarray-based classification and in addition two synthesis<br />

packages: MLInterfaces by V. Carey and R. Gentleman (2007) and MCRestimate<br />

by Ruschhaupt et al (Stat Appl Genet Mol Biol 2004, 3:37 ), available from the<br />

www.bioconductor.org platform. Conceptually, the R package CMA is more related<br />

to the second one, focussing on comparative model evaluation according to accepted<br />

’good pratice’ standards/guidelines (Dupuy and Simon, J Natl Cancer Inst 2007,<br />

99:147-157 ), an aspect neglected by MLInterfaces, though still widely used. In<br />

a nutshell, CMA provides a uniform interface to a total of more than 20 supervised<br />

classification methods, comprising classical approaches such as discriminant analysis<br />

or penalized multinomial logistic regression, dimension reduction by Partial Least<br />

Squares, and more sophisticated methods, e.g. Support Vector Machines, Neural<br />

Networks or boosting techniques.<br />

The evaluation of the constructed classifiers is based on repeated splittings into<br />

learning and test sets or related approaches (e.g. bootstrap). For each learning set<br />

separately, variable selection can be performed optionally, either by a collection of<br />

simple tests or by advanced techniques such as the lasso, elastic net or componentwise<br />

boosting. In the last step, hyperparameter optimization and model evaluation<br />

are carried out via a ’nested’ cross-validation procedure. The outer loop is used for<br />

classifier evaluation while appropriate values for the hyperparameters are determined<br />

in the inner loop.<br />

CMA is implemented entirely in S4 classes (J. Chambers, Programming with data,<br />

1998). Its modular construction makes the incorporation of new methods easy. Furthermore,<br />

it is intended to be user-friendly by providing a multitude of pre-defined<br />

methods for summarizing and visualizing classifier evaluation and comparison.<br />

A preliminary version of CMA is planned to be available in the next Bioconductor<br />

release in April <strong>2008</strong>.<br />

Keywords<br />

High-dimensional data, classification, validation, statistical software<br />

− 136 −


Generating Collective Intelligence<br />

Vassilios Solachidis 1 , Phivos Mylonas 2 , Andreas Geyer-Schulz 3 , Bettina<br />

Hoser 3 , Sam Chapman 4 , Fabio Ciravegna 4 , Stefen Staab 5 ,Costis<br />

Contopoulos 6 , Ioanna Gkika 6 ,PavelSmrz 7 , Yiannis Kompatsiaris 1 ,and<br />

Yannis Avrithis 2<br />

1<br />

Centre of Research and Technology Hellas, Informatics and Telematics Institute,<br />

Km Thermi-Panorama Road, Thermi-Thessaloniki, GR 570 01, Greece {vsol,<br />

ikom}@iti.gr<br />

2<br />

National Technical University of Athens, Image, Video and Multimedia Systems<br />

Laboratory, Iroon Polytechneiou 9, Zographou Campus, Athens, GR 157 80,<br />

Greece {fmylonas, iavr}@image.ntua.gr<br />

3<br />

Department of Economics and Business Engineering, Information Service and<br />

Electronic Markets, Kaiserstraße 12, Karlsruhe 76128, Germany,<br />

{andreas.geyer-schulz, bettina.hoser}@kit.edu<br />

4<br />

University of Sheffield, Department of Computer Science, Regent Court, 211<br />

Portobello Street, S1 4DP, Sheffield, UK, {s.chapman, fabio}@dcs.shef.ac.uk<br />

5<br />

Universität Koblenz-Landau, Information Systems and Semantic Web,<br />

Universitätsstraße 1, 57070 Koblenz, Germany, staab@uni-koblenz.de<br />

6<br />

Vodafone-Panafon (Greece), Technology Strategic Planning - R&D Dept.,<br />

Tzavella 1-3, Halandri, 152 31, Greece {Costis.Kontopoulos,<br />

Ioanna.Gkika}@vodafone.com<br />

7<br />

Brno University of Technology, Faculty of Information Technology, Bozetechova<br />

2, CZ-61266 Brno, Czech Republic, smrz@fit.vutbr.cz<br />

Abstract. In this paper we provide a foundation for a new generation of services<br />

and tools. We define new ways of capturing, sharing and reusing information and<br />

intelligence provided by single users and communities, as well as organizations by<br />

enabling the extraction, generation, interpretation and management of Collective<br />

Intelligence from user generated digital multimedia content. Different layers of intelligence<br />

will be generated, which together constitute the notion of Collective Intelligence.<br />

The latter emerges from the collaboration and competition among many<br />

individuals and forms an intelligence that seemingly has a mind of its own. The<br />

automatic generation of Collective Intelligence constitutes a departure from traditional<br />

methods for information sharing, since information from both the multimedia<br />

content and social aspects will be merged, while at the same time the social dynamics<br />

will be taken into account. In the context of this work, we shall present two case<br />

studies. Initially, an Emergency Response case study will be tackled, where users<br />

provide intelligence about large scale emergencies, empowering a more effective and<br />

informed emergency action and at the same time receive information on how to act.<br />

A Consumers Social Group case study will follow, providing enhanced publishing<br />

tools to support group activities (e.g. organization of team events) and the ability<br />

to extract meta-information from content sources and group discussions. Both Use<br />

Cases denote the important effect of Collective Intelligence as well as its leverage<br />

for private, commercial and public purposes.<br />

− 137 −


Analysis of polyphonic musical time series<br />

Katrin Sommer and Claus Weihs<br />

Lehrstuhl für Computergestützte Statistik<br />

Technische Universität Dortmund, D-44221 Dortmund<br />

sommer@statistik.tu-dortmund.de<br />

Abstract. A general model for pitch tracking of polyphonic musical time series will<br />

be introduced. Based on a model of Davy and Godsill (2002) the different pitches<br />

of the musical sound are estimated with MCMC methods simultaneously. Additionally<br />

a preprocessing step is designed to improve the estimation of the fundamental<br />

frequencies (Sommer and Weihs (<strong>2008</strong>)). The preprocessing step compares real audio<br />

data with an alphabet constructed from the McGill Master Samples (Opolko<br />

and Wapnick (1987)) and consists of tones of different instruments. The tones with<br />

minimal Itakura-Saito distortion (Gray et al. (1980)) are chosen as first estimates<br />

and as starting points for the MCMC algorithms. Furthermore the implementation<br />

of the alphabet is an approach for the recognition of the instruments generating the<br />

musical time series. Results are presented for mixed monophonic data from McGill<br />

and for self recorded polyphonic audio data.<br />

Key words: MCMC, musical time series, polyphony, alphabet<br />

References<br />

Davy, M. and Godsill, S. J. (2002): Bayesian Harmonic Models for Musical Pitch<br />

Estimation and Analysis. Technical Report 431. Cambridge University Engineering<br />

Department.<br />

Gray, R., Buzo, A., Gray, A. and Matsuyama, Y. (1980): Distortion Measures for<br />

Speech Processing. IEEE Transactions on Acoustics, Speech, and Signal Processing<br />

ASSP-28, 367–376.<br />

Opolko, F. and Wapnick, J. (1987): McGill University Master Samples [Compact<br />

disc]: Montreal, Quebec: McGill University.<br />

Sommer K. and Weihs C. (2006): Using MCMC as a stochastic optimization procedure<br />

for music time series. In Batagelj V, Bock HH, Ferligoj A, Ziberna A<br />

(Eds.): Data Science and Classification, Springer, Heidelberg, 307–314.<br />

Sommer, K. and Weihs, C. (<strong>2008</strong>): A comparative Study on polyphonic musical<br />

time series using MCMC methods. In C. Preisach, H. Burkhardt, L. Schmidt-<br />

Thieme and R. Decker (Eds.): Data Analysis, Machine Learning, and Applications.<br />

Springer, Berlin.<br />

− 138 −


Trust as a Key Determinant of Loyalty and its<br />

Moderators<br />

Angela Sommerfeld 1<br />

Institut für Marketing, Humboldt-Universität zu Berlin angelaso@umich.edu<br />

Abstract. Theorizing that successful relational exchanges are motivated by trust<br />

and commitment, theory implicitly assumes that transactional and weak relational<br />

exchanges are not similarly motivated. As such Garbarino & Johnson (1999) showed<br />

that trust is a peripheral evaluation, not predictive for purchase intentions in weak<br />

(individual ticket buyers) but for strong relationships (subscribers of the theatre).<br />

Extending their work we take a more theory-based approach to develop and test<br />

moderating hypotheses of the trust ? purchase intention relation beyond their variable<br />

type of contractual relationship. Based on a survey of 575 business-to-business<br />

customers we test the following proposed moderators: two facets of perceived risk<br />

notably performance risk and consequentiality, perceived switching cost, and length<br />

of the relationship between companies. Different methods have been employed to test<br />

the moderations (Multiple-Groups, Kenny-Judd-Models, and Quasi-ML). Especially<br />

the Quasi-ML method, which we apply for a simultaneous test of several moderation<br />

hypotheses, represents a statistically efficient estimation method for SEMs with<br />

multiple latent interaction effects (Klein & Muthen 2007). Depending on the method<br />

and their varying properties several hypotheses could be confirmed. The paper seeks<br />

to make three key contributions. First it gives a theory-based account of boundary<br />

conditions for the relevance of trust in exchange relationships between companies.<br />

Since there have been conflicting opinions on the role of risk in exchange between<br />

companies, a second contribution of the paper is to clarify this role by thoroughly<br />

testing both moderating and mediating hypotheses. Testing interactions in a structural<br />

equation framework is not a straightforward task. Thus a third contribution is<br />

to illustrate strengths and weaknesses of different methods for a substantive research<br />

question with a real world data set.<br />

Key words: Trust, Risk, Switching Cost, Multiple<br />

References<br />

Klein, A.G. and Muthen, B.O. (2007): Quasi Maximum Likelihood Estimation of<br />

Structural Equation Models With Multiple Interaction and Quadratic Effects.<br />

Multivariate Behavioral Research, 42, 647–673.<br />

− 139 −


Generating Fictitious Training Data for Credit<br />

Client Classification<br />

Klaus B. Schebesch 1 and Ralf Stecking 2<br />

1<br />

Faculty of Economics, University ”Vasile Goldis”, Arad, Romania<br />

kbsbase@gmx.de<br />

2<br />

Faculty of Economics, University of Oldenburg, D-26111 Oldenburg<br />

ralf.w.stecking@uni-oldenburg.de<br />

Abstract. In recent work we started investigating the effects of using fictitious<br />

training examples in addition to the empirical training examples for a credit scoring<br />

problem. Fictitious training points added by a very simple procedure lead to<br />

some interesting effects in the context of SVM (support vector machine) classifier<br />

modeling. For instance, the resulting out-of-sample performance measures of such<br />

preliminary models are not entirely obvious. However, by using SVM, we also can<br />

observe the change in support vector formation subject to fictitious training points.<br />

Such information may prove instrumental in producing fictitious training points<br />

which are (more) problem dependent. We also explore connections to generative,<br />

similarity based and template based learning which, in a related context receive<br />

some attention in recent classification literature. We then report on the results of<br />

using different types of fictitious training examples in SVM credit client classification.<br />

Finally, in order to generalize these results, evaluation of SVM with different<br />

kernel functions using various fictitious training data sets is presented.<br />

Key words: Fictitious training data, Data similarity, Support vector machine,<br />

Credit scoring<br />

References<br />

DUIN, R.P.W. and PEKALSKA, E. (2007): The Science of Pattern Recognition.<br />

Achievements and Perspectives. In: W. Duch, J. Mandziuk (eds.), Challenges<br />

for Computational Intelligence, Studies in Computational Intelligence, Springer<br />

HOCHREITER, S. and OBERMAYER, K. (2006). Support vector machines for<br />

dyadic data. Neural Computation, 18, 1472-1510<br />

LAUB, J., ROTH, V., BUHMANN, J.M. and MÜLLER, K. (2006): On the information<br />

and representation of non-Euclidean pairwise data. Pattern Recognition,<br />

39, pp. 1815-18266<br />

STECKING, R. and SCHEBESCH, K.B. (2007): Improving Classifier Performance<br />

by Using Fictitious Training Data? A Case Study. Accepted for publication in<br />

Operations Research Proceedings 2007.<br />

− 140 −


Clustering Association Rules with<br />

Fuzzy Concepts<br />

Matthias Steinbrecher 1 and Rudolf Kruse 1<br />

Department of Knowledge Processing and Language Engineering<br />

Otto-von-Guericke University of Magdeburg<br />

Universitätsplatz 2, 39106 Magdeburg, Germany<br />

{msteinbr,kruse}@iws.cs.uni-magdeburg.de<br />

Abstract. Association rules constitute a widely accepted technique to identify frequent<br />

patterns inside huge volumes of data. Practioneers prefer the straightforward<br />

interpretability of rules, however, depending on the nature of the underlying data<br />

the number of induced rules can be intractable large. Even reasonably sized result<br />

sets may contain a large amount of rules that are uninteresting to the user because<br />

they are too general, are already known or do not match other user-related intuitive<br />

criteria. We allow the user to model his conception of interestingness by means of linguistic<br />

expressions on rule evaluation measures and compound propositions of higher<br />

order (i. e., temporal or spatial changes of rule properties). Multiple such linguistic<br />

concepts can be considered a set of fuzzy patterns [?] and allow for the partition of<br />

the initial rule set into fuzzy fragments that contain rules of similar membership to a<br />

user’s concept [?,?,?]. With appropriate visualization methods that extent previous<br />

rule set visualizations [?] we allow the user to instantly assess the matching of his<br />

concepts against the rule set.<br />

Key words: Association Rules, Fuzzy Clustering, Exploratory Data Analysis<br />

References<br />

1.Dubois, D., Prade, H., Testemale, C.: Weighted Fuzzy Pattern Matching. Fuzzy<br />

Sets and Systems 28(3) (1988) 313–331<br />

2.Höppner, F., Klawonn, F., Kruse, R., Runkler, T.: Fuzzy Clustering. Wiley,<br />

Chichester, United Kingdom (1999)<br />

3.Döring, C., Lesot, M.J., Kruse, R.: Data Analysis with Fuzzy Clustering Methods.<br />

Computational Statistics & Data Analysis 51(1) (2006) 192–214<br />

4.Kruse, R., Döring, C., Lesot, M.J.: Fundamentals of fuzzy clustering. In<br />

de Oliveira, J.V., Pedrycz, W., eds.: Advances in Fuzzy Clustering and its Applications.<br />

John Wiley & Sons (2007) 3–30<br />

5.Steinbrecher, M., Kruse, R.: Visualization of Possibilistic Potentials. In: Foundations<br />

of Fuzzy Logic and Soft Computing. Volume 4529 of Lecture Notes in<br />

Computer Science., Springer Berlin / Heidelberg (2007) 295–303<br />

− 141 −


Who’s Afraid of Statistics? – Measurement and<br />

Predictors of Statistics Anxiety in German<br />

University Students<br />

Carolin Strobl 1 and Friedrich Leisch 2<br />

1 Institut für Statistik, Ludwig-Maximilians-Universität München<br />

carolin.strobl@stat.uni-muenchen.de<br />

2 friedrich.leisch@stat.uni-muenchen.de<br />

Abstract. The measurement of statistics anxiety and the relationship between<br />

statistics anxiety and several socio-demographic and educational factors was investigated<br />

in a survey on over 600 German university students. The attitude towards<br />

statistics was measured by means of the Affect and Cognitive Competence scales of<br />

the Survey of Attitudes Towards Statistics (SATS, Schau et al., 1995). Additional<br />

items covered, amongst others, prior mathematics experience and achievement, time<br />

and activity since high school graduation as well as items on the students’ strategy<br />

applied in mathematics courses, which was not considered in earlier studies. An<br />

anxiety indicator was derived from the SATS scales by means of cluster analysis<br />

in order to separate a group of students with high levels of statistics anxiety from<br />

those with moderate and low levels of anxiety. Using this anxiety indicator as the<br />

response, a set of relevant predictor variables was identified by means of random forest<br />

variable importance scores and further explored in a logistic regression model.<br />

Our results show that the SATS Affect and Cognitive Competence scales are well<br />

suited for identifying students with high levels of negative attitude against statistics,<br />

even though potential effects of the translation into German were noticeable<br />

for the positively worded items. Predictors found relevant for statistics anxiety were<br />

gender, mathematics taken as an intensive course in high school, prior (perceived)<br />

mathematics achievement, prior mathematics experience as well as two of the newly<br />

included items on students’ strategy applied in mathematics courses in high school:<br />

Students who named practicing as their strategy were less likely, while students who<br />

named memorizing as their strategy were more likely to show statistics anxiety.<br />

Key words: Attitude towards statistics, SATS, Statistics education<br />

References<br />

Schau, C., Stevens, J., Dauphinee, T. L. and Vecchio, A. D. (1995): The development<br />

and validation of the survey of attitudes toward statistics. Educational and<br />

Psychological Measurement, 55 (5), 868–875.<br />

− 142 −


A New, Conditional Variable Importance<br />

Measure for Random Forests<br />

Carolin Strobl 1 and Achim Zeileis 2<br />

1 Department of Statistics, Ludwig-Maximilians-Universität München<br />

carolin.strobl@stat.uni-muenchen.de<br />

2 Department of Statistics and Mathematics, Wirtschaftsuniversität Wien<br />

Achim.Zeileis@wu-wien.ac.at<br />

Abstract. Random forests are becoming increasingly popular in many scientific<br />

fields for assessing the importance of predictor variables (cf., e.g., Lunetta et al.,<br />

2004) because they can cope with “small n large p” problems, complex interactions<br />

and even with highly correlated predictor variables. Their variable importance can<br />

help identify relevant predictors even if they are highly correlated, while in classical<br />

regression models often only one representative of a group of correlated predictors is<br />

included. However, currently-used variable importance measures can be biased, e.g.,<br />

towards variables with many categories (Strobl et al., 2007) or correlated predictor<br />

variables (Archer and Kimes, <strong>2008</strong>). While the former issue can be addressed by<br />

changing the resampling scheme in the tree growing process (Strobl et al., 2007),<br />

the latter is due to the permutation scheme employed in the computation of the<br />

variable importance. Here we suggest a new, conditional permutation scheme that<br />

is more suited to measure the degree of association of each predictor variable with<br />

the response. The resulting conditional variable importance can be used to rank the<br />

predictor variables more reliably.<br />

Key words: Feature selection, Correlation, Variable importance, Permutation tests<br />

References<br />

Archer, K. and Kimes, R. (<strong>2008</strong>): Empirical characterization of random forest variable<br />

importance measures. Computational Statistics & Data Analysis, 52(4),<br />

2249–2260.<br />

Lunetta, K.L., Hayward, L.B., Segal, J., Eerdewegh, P.V. (2004): Screening largescale<br />

association study data: Exploiting interactions using random forests. BMC<br />

Genetics, 5:32.<br />

Strobl, C., Boulesteix, A.-L., Zeileis, A. and Hothorn, T. (2007): Bias in random<br />

forest variable importance measures: Illustrations, sources and a solution. BMC<br />

Bioinformatics, 8:25.<br />

− 143 −


Conjoint Analysis within the field of customer<br />

satisfaction problems a model of composite<br />

product/service<br />

Piotr Tarka<br />

School of Banking in Poznan<br />

Department of Organization and Management<br />

Poland<br />

piotr.tarka@wsb.poznan.pl<br />

Abstract. This paper describes how the benefits of conjoint analysis can be<br />

adapted to measuring performance criteria in the customer service area. The paper<br />

points out how a single composite model can be built, incorporating a wide<br />

range of customer key choice criteria, including service. Author makes a mark to a<br />

specific problems. One of them, is the apparent distortion in computed utility values<br />

that arises in circumstances where global macro variables are traded-off against more<br />

micro topics. This can lead to dramatic underestimation of the overall contribution<br />

or importance of macro issues. To address this concern, author discuss an approach<br />

known as dual scaling for eliminating the bias. Another drawback to the approach<br />

in customer service studies is the limited number of variables that can be addressed<br />

by a typical conjoint study. This makes it difficult to cover the large range of service<br />

topics typically examined in a customer satisfaction study. The paper argues<br />

that, this limits the scope of both classical conjoint studies and current customer<br />

satisfaction approaches.<br />

Key words: Conjoint Analysis, Customer satisfaction problems<br />

− 144 −


Optimal VDSL Expansion taking into<br />

Consideration of Infrastructure Restrictions<br />

and Marketing Requirements<br />

Klaus Thiel 1<br />

T-Online, T-Online-Allee 1, 64295 Darmstadt k.thiel@t-online.net<br />

Abstract. The expansion of the Very High Speed Subscirber Line (VDSL) network<br />

in Germany is a billion-weighted prestigious infrastructure project. VDSL enables<br />

a transfer rate of 50 megabytes per second. With it, for example, so-called entertainment<br />

customer can receive two movies byte-parallel in High Definition Television<br />

(HDTV) and internet surfing and telephoning is also possible. Within the<br />

b2b-sector many new applications like telecommuting in virtual teams around the<br />

world, telemedicine have become feasible. Currently Deutsche Telekom has developed<br />

VDSL in 27 cities and for <strong>2008</strong> the VDSL expansion is planned for further 23<br />

cities. The optimal choice of the VDSL expansion areas primarily depends on infrastructure<br />

restrictions as well as on marketing requirements. In order to execute a<br />

spatial optimisation procedure, all the important infrastructure and marketing information<br />

must be converted to vector data by digitizing und subsequently importing<br />

into a Geo-Information-System (GIS).<br />

The most important GIS provider in Germany are Microm with MicromGEO<br />

and ESRI with ArcGIS. In order to select the most suitable system, regarding the<br />

mentioned problem, both systems have to be evaluated on the basis of objective<br />

test criteria. Test criteria are the quality of geo-referencing of address-data and<br />

the mapping-quality of different spatial levels. In order to choose the most suitable<br />

spatial level (e.g. city, post-code, dialling-code, municipality), several analysis have<br />

been executed. In the next step, a spatial scoring has been developed and imported<br />

into the GIS in order to ensure, that those areas with the highest VDSL customer<br />

equity potential will be the first to be expanded. Finally, using the spatial scoring,<br />

a spatial potential-ranking has been calculated on the basis of which the optimal<br />

VDSL expansion can be planned and executed.<br />

Key words: Customer Equity, Geo-Information-System, Optimal VDSL Expansion<br />

− 145 −


Evaluate the data structure and identify<br />

homogenous spatial units in the data base<br />

”Sustainability issues in sensitive areas” of the<br />

EU-FP6 Integrated Project SENSOR<br />

Nguyen Xuan Thinh 1 , Leander Küttner 1 , and Gotthard Meinel 1<br />

Leibniz Institute of Ecological and Regional Development (IOER), Weberplatz 1,<br />

01217 Dresden, Germany, ng.thinh@ioer.de, l.kuettner@ioer.de,<br />

g.meinel@ioer.de<br />

Abstract. SENSOR (Sustainability Impact Assessment: Tools for Environmental,<br />

Social and Economic Effects of Multifunctional Land Use in European Regions) is<br />

an Integrated Project within the 6th Framework Research Programme of the European<br />

Commission (33 research partners from 15 countries). The SENSOR Project<br />

is structured into seven interrelated modules M1-M7. For the Module M6 ”Sustainability<br />

issues in sensitive areas”, a data base with more than 800 000 entries has<br />

been established. Whereby Lusatia, Silesia, Eisenwurzen, High Tatra, Valais, Estonia<br />

coastal zone, and Malta were selected as sensitive area case studies (SACS).<br />

Using ACCESS, SPSS and ArcMap we conduct a comparative analysis and evaluate<br />

this M6 database with the view to the theoretical sustainability indicators defined<br />

in the ModuleM2. We then determine similarities and dissimilarities between data<br />

from different SACS. By applying adequate cluster analysis we identify homogenous<br />

spatial units of selected SACS in order to find out generalisable and specific sustainability<br />

characteristics in the seven case studies. As example we describe the case<br />

study of Lusatia more in detail. The area of Lusatia is divided in several local area<br />

units (LAU2), which are qualified for a statistical examination by a high number of<br />

entries and available variables. As base we choose a set of 25 variables related to<br />

sustainable land use issues. Using a factorial analysis we determine the significant<br />

variables and use them as representatives to characterise typical land use clusters. In<br />

a next step the clusters were identified by a combination of hierarchical and k-mean<br />

cluster analysis methods. To describe the situation at different time points and the<br />

development in the period between them, we repeat the procedure of cross sectional<br />

analysis of 1996 for 2004. The results of the statistical analysis are presented in<br />

ArcMap visualisations. Although the procedure and the variable base are the same,<br />

the results differ and reveal so the relevant land use trends within the general social<br />

transformation process of the 1990ies.<br />

Key words: EU Integrated Project SENSOR, Comparative Analysis, Similarity,<br />

Disimilarity, SENSOR M6 Indicators, Cluster Analysis<br />

− 146 −


Mining ideas from textual information<br />

Dirk Thorleuchter<br />

Fraunhofer Institut für Naturwissenschaftlich-Technische Trendanalysen,<br />

D-53879 Euskirchen, Appelsgarten 2, Germany<br />

dirk.thorleuchter@int.fraunhofer.de<br />

Abstract. This paper describes an approach to find automatically new technological<br />

ideas in textual information. On the basis of (Thorleuchter (<strong>2008</strong>)) the existing<br />

theoretical algorithm is enlarged in consideration of text mining approaches<br />

like stemming, term frequency etc. (Ferber (2003)) and ”creativity technique” approaches<br />

from literature (Dean et al. (2001)). The aim of the new algorithm is to<br />

find ideas by using a general stop word list, because up to now the existing approach<br />

is based on the inefficient usage of a (domain) specific stop word list specific<br />

created for the analyzed text.<br />

This new approach is evaluated with non-proprietary data and it is realized as webbased<br />

application, named ”Technological Idea Miner” that can be used for further<br />

testing and evaluation. The presentation of the identified ideas will be displayed in<br />

consideration of cognitive research knowledge like described in (Puppe et al. (2003)).<br />

Key words: Textmining, Knowledge, Discovery, Ideas<br />

References<br />

Dean, G.,Hender, J.M., Nunmaker, J.F. and Rodgers, T.L. (2001): Improving Group<br />

Creativity. In: Sprague, R., (Hrsg.): Proceedings of the 34th Hawaii International<br />

Conference on System Sciences - 2001. IEEE Publishing, Maui (USA),<br />

1070.<br />

Ferber, R. (2003): Information Retrieval. dpunkt.verlag, Heidelberg, 41.<br />

Puppe, F., Stoyan, H. and Studer, R. (2003): Knowledge Enineering. In: G. Görz, C.-<br />

R. Rollinger and J. Schneeberger (Eds.): Handbuch der Künstlichen Intelligenz.<br />

4. Auflage, Oldenbourg, München, 612.<br />

Thorleuchter, D. (2007): Finding new technological ideas and inventions with text<br />

mining and technique philosophy. In: C. Preisach, H. Burkhardt, L. Schmidt-<br />

Thieme, R. Decker (Eds.): Data Analysis, Machine Learning, and Applications.<br />

Springer, Heidelberg-Berlin.<br />

− 147 −


Mining technologies in security and defense<br />

Dirk Thorleuchter<br />

Fraunhofer Institut für Naturwissenschaftlich-Technische Trendanalysen,<br />

D-53879 Euskirchen, Appelsgarten 2, Germany<br />

dirk.thorleuchter@int.fraunhofer.de<br />

Abstract. In the last years, the rising asymmetrical threat is causing governments<br />

to pay more attention to security, especially in technological areas. New and ever<br />

more complex tasks in areas concerned with defense against these new types of threat<br />

require additional research and development of new techniques. For this reason,<br />

national and European governments are increasingly funding security and defense<br />

(S&D) based technological research.<br />

In this paper, we give an overview about the technological landscape of S&D<br />

by presenting different S&D-technologies and their relationships like described in<br />

(Geschka et al. (2005)) and (Reiß (2006)). Therefore we firstly identify technologies<br />

from different technological S&D-taxonomies and we secondly identify innovative<br />

S&D-research projects. The research projects are classified according to technologies<br />

and on that basis the relationships between technologies are presented.<br />

In detail, text documents are represented as vectors in vector space model using<br />

term frequency and corpus-based term co-occurrence data. We use Jaccard’s coefficient<br />

(Ferber(2003)) to measure similarity and we use fuzzy alpha-cut method for<br />

classification. Structured documents (XML) are used as data source and drain.<br />

To realize this approach, we present a web application ”S&D Technology Miner”<br />

for planning support to research program planners and to researchers, which acquire<br />

funding in this area but also for testing and evaluating the approach.<br />

Key words: Security, Defense, Technology, Textmining, Classification<br />

References<br />

Ferber, R. (2003): Information Retrieval. dpunkt.verlag, Heidelberg, 78.<br />

Geschka, H., Schauffele, J. and Zimmer, C. (2005): Explorative Technologie-<br />

Roadmaps - Eine Methodik zur Erkundung technologischer Entwicklugslinien<br />

und Potenziale. In: M.G. Möhrle and R. Isenmann (Eds.): Technologie-<br />

Roadmapping. Springer, Berlin, Heidelberg et al., 165.<br />

Reiß, T. (2006): Innovationssysteme im Wandel - Herausforderungen für die Innovationspolitik.<br />

In: B. Müller and U. Glutsch (Eds.): Fraunhofer-Institut für<br />

System- und Innovationsforschung - Jahresbericht 2006. Karlsruhe, 10<br />

− 148 −


Multilevel Simultaneous Component Analysis<br />

for Studying Inter-individual and<br />

Intra-individual Variabilities<br />

Marieke E. Timmerman 1 , Anna Lichtwarck-Aschoff 1 , and Eva Ceulemans 2<br />

1<br />

Heymans Institute for Psychology, University of Groningen<br />

Grote Kruisstraat 2/1<br />

9712 TS Groningen, The Netherlands<br />

m.e.timmerman@rug.nl<br />

2<br />

Centre for Methodology of Educational Research, University of Leuven<br />

Belgium<br />

Abstract. All psychological processes are dynamic. To fully understand those processes<br />

it is necessary to consider the intra-individual variation of individuals over<br />

time. Herewith, it is important to recognize that the nature of the processes may<br />

differ across individuals. This intricate matter requires new modelling approaches.<br />

We focus on the exploratory modelling of multivariate data that have been repeatedly<br />

gathered from more than one individual. We aim at identifying meaningful<br />

sources of both the inter-individual variability and the intra-individual variability<br />

in the observed variables, while expressing the similarities and differences in those<br />

sources across individuals. To this end, we use multilevel simultaneous component<br />

analysis (MLSCA; Timmerman, 2006).<br />

In essence, MLSCA specifies separate component models to account for interindividual<br />

and intra-individual variabilities. The latter may entail differences across<br />

individuals, which are expressed via the covariances of the individual’s withincomponent<br />

scores. The common within-loadings ensure the comparability across<br />

individuals. The relationships between MLSCA and the related multilevel and multigroup<br />

structural equation models will be discussed. The usefulness of MLSCA to<br />

grasp inter-individual and intra-individual variabilities is illustrated with an empirical<br />

example from a diary study focusing on emotions involved in daily conflicts<br />

between adolescent girls and their mothers.<br />

Key words: exploratory modelling of longitudinal data, multivariate analysis<br />

References<br />

Timmerman, M.E. (2006). Multilevel Component Analysis. British Journal of Mathematical<br />

and Statistical Psychology, 59, 301–320.<br />

− 149 −


Issues Related to the Implementation of a<br />

Dynamic Logistic Model for Classifier<br />

Combination<br />

Amber Tomas<br />

The University of Oxford, 1 South Parks Road, Oxford OX2 3TG, United<br />

Kingdom tomas@stats.ox.ac.uk<br />

Abstract. We consider a model for classification of sequentially received observations,<br />

when the population of interest is not assumed to be stationary. The model<br />

we propose combines the outputs of a fixed set of component classifiers (chosen in<br />

advance), and the parameters of the combination are allowed to change over time.<br />

Specifically, we use a logistic Dynamic Generalized Linear Model [1] for combining<br />

the classifier outputs, and take a predictive approach towards estimation of the posterior<br />

class probabilities. The dynamics are incorporated through the equation for<br />

parameter evolution<br />

β t+1 = β t + ωt, ωt ∼ N(0, Σt). (1)<br />

The implementation of this model when the distribution of the parameters is<br />

not assumed to be normal is not straightforward. In addition to computational<br />

complexity, there arise complications related to the identifiability of the parameters<br />

β t which are unique to classification problems. Specifically, although the classifications<br />

produced as a result of using the model with parameters β t are equivalent to<br />

the classifications when using parameters αβ t, α > 0, the posterior class probabilities<br />

are more extreme in the second case. This results in increased volatility of the<br />

classification rule when using a sequential MCMC method to estimate the posterior<br />

distribution of the parameters. We discuss why there is no simple constraint for the<br />

parameters which will alleviate this identifiability problem, and discuss an alternative<br />

approach. In addition, we consider the related problems of adaptively changing<br />

the effective value of Σt, and the consequences of using the model (1) when it is not<br />

assumed to be correct.<br />

Key words: Multiple Classifier Systems, Dynamic Classification, Identifiability<br />

References<br />

1.West, M., Harrison, J. and Migon, H. (1985): Dynamic Generalized Linear Models<br />

and Bayesian Forecasting. Journal of the American Statistical Association, 80,<br />

73–83.<br />

− 150 −


A Comprehensive Partial Least Squares<br />

Approach to Component-Based Structural<br />

Equation Modeling ⋆<br />

Laura Trinchera 1 and Vincenzo Esposito Vinzi 2<br />

1<br />

Dipartimento di Matematica e Statistica, Universita degli Studi di Napoli<br />

Federico II. ltrinche@unina.it<br />

2<br />

ESSEC Business School of Paris and Singapore. vinzi@essec.fr<br />

Abstract. PLS Path Modeling (PLS-PM) is generally meant as a componentbased<br />

approach to structural equation modeling that privileges a prediction oriented<br />

discovery process to the statistical testing of causal hypotheses. Differently<br />

from covariance-based structural equation modeling (i.e. LISREL-type methods), in<br />

PLS-PM latent variables are estimated as linear combinations of the manifest variables.<br />

Thus they are more naturally defined as emergent constructs (with formative<br />

indicators) rather than latent constructs (with reflective indicators). Nowadays, formative<br />

relationships are more and more used in real applications but pose a few<br />

problems for the statistical estimation and interpretation. As of today, formative<br />

relationships in PLS-PM imply multiple OLS regressions between each latent variable<br />

and its own formative indicators. As known, OLS regression may yield unstable<br />

results in presence of important correlations between explanatory variables, it is not<br />

feasible when the number of statistical units is smaller than the number of variables<br />

nor when missing data affect the dataset. Thus, it seems quite natural to introduce<br />

a PLS Regression (PLS-R) external estimation mode within the PLS-PM algorithm<br />

so as to overcome the mentioned problems, preserve the formative relationships and<br />

still remain coherent with the component-based and prediction-oriented nature of<br />

PLS-PM. Here, the main issues concerning the use of formative indicators in PLS-<br />

PM are investigated. Furthermore, the features of PLS-R may be fruitfully exploited<br />

in the internal estimation phase as well as for estimating path coefficients upon convergence<br />

of the PLS-PM algorithm when classical OLS estimates become unstable<br />

or even unfeasible. Finally, the case of formative indicators will be considered also<br />

with respect to clustering techniques recently proposed for latent class detection in<br />

PLS-PM.<br />

Key words: Formative Indicators, PLS Regression, Latent Factor Scores<br />

⋆ The participation of L. Trinchera to this research was supported by the MURST<br />

grant “Multivariate statistical models for the ex-ante and the ex-post analysis<br />

of regulatory impact”, coordinated by C. Lauro (2006). The participation of V.<br />

Esposito Vinzi to this research was supported by CERESSEC, Research Center<br />

of the ESSEC Business School.<br />

− 151 −


Relevant Importance of Predictor Variables<br />

in Support Vector Machines Models<br />

Micha̷l Trzesiok<br />

Department of Mathematics,<br />

Katowice University of Economics, ul. Bogucicka 14, 40-226 Katowice<br />

trzesiok@ae.katowice.pl<br />

Abstract. The model resulting from Support Vector Machines suffer from the lack<br />

of interpretation. It is usually very hard to extract the knowledge about the analyzed<br />

phenomenon from the classification model obtained by using SVMs because the<br />

classification task is realized in a high dimensional feature space. Although the<br />

method identifies the observations which are crucial for the form of the decision<br />

function, it does not show which variables are relevant and which are redundant.<br />

Vapnik claims that feature selection is not necessary for SVMs, i.e. building the<br />

model on a set of variables including some redundant variables does not change the<br />

generalization ability. Once the model is built, it is still valuable to recognize the<br />

relative importance of predictor variables. The method we propose uses the sampling<br />

techniques, backward selection and Rand index for evaluating whether the particular<br />

variable is redundant or not. We even try to extend the idea to obtain the ranking<br />

of the predictor variables reflecting the relevant importance of the inputs.<br />

Key words: Support Vector Machines, redundancy, relevant attributes<br />

References<br />

Abe, S. (2005): Support Vector Machines for Pattern Classification, Springer, London.<br />

Breiman, L., Friedman, J., Olshen, R. and Stone, C. (1984): Classification and Regression<br />

Trees, Wadsworth, Monterey.<br />

Schölkof, B. and Smola, A.J. (2002): Learning with Kernels, MIT Press, Cambridge.<br />

Vapnik, V. (1998): Statistical Learning Theory, John Wiley & Sons, N.Y.<br />

− 152 −


Comparison of Algorithms to find differentially<br />

expressed Genes in Microarray Data<br />

Alfred Ultsch<br />

Databionics Research Lab, Department of Computer Science<br />

University of Marburg, D-35032 Marburg, Germany<br />

ultsch@informatik.uni-marburg.de<br />

Summary. There are several different algorithms published for the identification<br />

of differentially expressed genes in DNA microarray experiments. The microarrays<br />

in this type of experiment are from two different populations (groups) of specimen.<br />

Among the many genes on the microarrays, those genes are sought that are the most<br />

relevant for the distinction between the two populations. Usually such algorithms<br />

produce ordered lists of genes. In this work a method to compare the performance<br />

of such algorithms is proposed. In order to compare different methods for the identification<br />

of significant genes, a data set with known properties is published. This<br />

benchmark data is used to compare the performance of different algorithms with a<br />

newly designed one, called PUL. The comparison is based on established measurements<br />

in information retrieval. Surprisingly a clear ordering in performance of the<br />

algorithms was observed. PUL outperformed other algorithms by a factor of two.<br />

PUL was applied successfully in different practical applications. For these experiments<br />

the importance of the genes proposed by PUL were independently verified.<br />

References<br />

1. Gebhard, S., Bergmann, E., Weber, A., Berwanger, B., Eilers, M., Ultsch, A.,<br />

Christiansen, H.: Classification of stage 3 neuroblastomas by artificial neural<br />

networks based analysis of cDNA microarrays. (submitted)<br />

2. Dudoit, S., Fridlyand, J., Speed, T. (2000). Comparison of discrimination methods<br />

for the classification of tumors using gene expression data. Technical report<br />

576, Department of Statistics, University of California, Berkeley.<br />

3. Pallasch C.P., Schwamb J., Schulz A., Knigs S., Debey S., Kofler D., Schultze<br />

J.L., Hallek M., Ultsch, A. Wendtner, C.(2007) Targeting lipid metabolism by<br />

the lipoprotein lipase inhibitor orlistat results in apoptosis in chronic lymphocytic<br />

leukemia, accepted for Leucemia.<br />

4. Tusher, V., Tibshirani, R. and Chu, G. (2001): Significance analysis of microarrays<br />

applied to the ionizing radiation response, PNAS 2001 98: 5116-5121.<br />

5. Ultsch, A.(2005): Improving the identification of differentially expressed genes in<br />

cDNA microarray experiments, In Weihs, C., Gaul, W. (Eds): Classification- the<br />

Ubiquitous Challenge, Springer, Heidelberg, pp. 378-385.<br />

− 153 −


Is log ratio a good value for measuring return<br />

in stock investments?<br />

Alfred Ultsch<br />

Databionics Research Group<br />

Philipps-University of Marburg, Germany<br />

ultsch@informatik.uni-marburg.de<br />

Abstract. Measuring the rate of return is an important issue for theory and practice<br />

of investmets in the stock market. A common measure for rate of return is the<br />

logarithm of the ratio of succesive prices (LogRatio). In this paper it is shown that<br />

LogRatio as well as arithmetic return rate (ROI) have several disadvantages. As an<br />

alternative relative differences (RelDiff) are proposed to measure rates of return.<br />

The stability against numerical and rounding errors of RelDiff is demonstrated<br />

to be much better than for LogRatios and ROI. RelDiff values are identical to<br />

LogRatios and ROI for interesting ranges of return rates. Relative differences map<br />

return rates to a finite range. For most subsequent analyses this is a big advantage.<br />

The usefullness of the approach is demonstrated on daily return rates of a large set<br />

of stocks.<br />

Key words: Rate of Return, Return on Investment, Financial time series, Black-<br />

Scholes-Model<br />

References<br />

Bodie,Z.,KaneA. and Marcus,A.J. Essentials of Investments, 5th Edition. New York:<br />

McGraw-Hill/Irwin, 2004.<br />

Brealey,R. A., Stewart C. Myers,S.C. and Allen,F. Principals of Corporate Finance,<br />

8th Edition. McGraw-Hill/Irwin, 2006.<br />

Feibel,B. J. Investment Performance Measurement. New York: Wiley, 2003.<br />

Franke,J.,Haerdle,W. Hafner,C. Einfhrung in die Statistik der Finanzmaerkte Berlin<br />

u.a. : Springer, 2. Aufl. 2004.<br />

Ultsch, A.: Improving the identification of differentially expressed genes in cDNA<br />

microarray experiments, Weihs, C., Gaul, W. (Eds), In Classification; The Ubiquitous<br />

Challenge, Springer, Heidelberg, (2005), pp. 378-385<br />

Ultsch, A.: Is log ratio a good value for identifying differential expressed genes in<br />

microarray experiments?, Technical Report No. 35, Dept. of Mathematics and<br />

Computer Science, University of Marburg, Germany, (2003)<br />

− 154 −


Mosaic Plots and Knowledge Structures<br />

Ali Ünlü<br />

Department of Mathematics, University of Augsburg, Germany<br />

ali.uenlue@math.uni-augsburg.de<br />

Abstract. Mosaic plots are state-of-the-art graphics for multivariate categorical<br />

data (Hofmann (<strong>2008</strong>)). Knowledge structures are mathematical models that belong<br />

to the recent theory of knowledge spaces in psychometrics (Doignon and Falmagne<br />

(1999)). This paper presents an application of mosaic plots and variants such as<br />

fluctuations diagrams and multiple barcharts to psychometric data arising from underlying<br />

knowledge structure models. In simulation trials, the scope of this graphing<br />

method in knowledge space theory is investigated.<br />

Key words: Mosaic plot, Visualization, Knowledge structure, Psychometrics<br />

References<br />

Doignon, J.-P. and Falmagne, J.-Cl. (1999): Knowledge Spaces. Springer, Berlin.<br />

Hofmann, H. (<strong>2008</strong>): Mosaic Plots and Their Variants. In: C.H. Chen, W. Haerdle<br />

and A.R. Unwin (Eds.): Handbook of Data Visualization. Springer, Heidelberg,<br />

617–642.<br />

− 155 −


Visualizing preferences using minimum<br />

variance nonmetric unfolding<br />

Michel van de Velden, Alain de Beuckelaer, Patrick Groenen, and Frank<br />

Busing<br />

No Institute Given<br />

Abstract. In multidimensional unfolding one wishes to obtain a map with subjects<br />

(e.g. consumers) and objects (e.g. products), in such a way that distances<br />

between subjects and objects in the map best represent the preferences as indicated<br />

in the data. Unfolding models are particularly adequate when the data (e.g. consumers’<br />

preferences) are not unidirectional but exhibit an inverted U-shape. If the<br />

alternatives are rated on an interval (or ratio) scale the ‘metric’ unfolding model is<br />

appropriate. If, however, the alternatives are rank ordered or rated on an ordinal<br />

(e.g. Likert-type of) scale, one would need the ’nonmetric’ unfolding model. Until<br />

recently, nonmetric unfolding was not feasible because of degeneracy problems.<br />

Degenerate solutions are solutions were the extent of ‘misfit’ can be made arbitrarily<br />

small Existing algorithms consistently produced such degenerate solutions.<br />

Recently, Busing, Groenen & Heiser (Psychometrika 2005, pp71-98) proposed a solution<br />

to this long-standing methodological problem by including a penalty in the<br />

algorithm. The resulting PREFSCAL algorithm is available in SPSS. In PREFSCAL<br />

two parameters are introduced that determine the strength of the penalty that leads<br />

the algorithm away from degenerate solutions. No clear directions concerning the<br />

choice of the penalty parameters are given. In this paper, we propose a minimum<br />

variance criterion to choose the penalty parameters. By studying the stability of the<br />

unfolding solutions as a function of the penalty parameters, we are able to determine<br />

the penalty in such a way that a minimum variance, non-degenerate solution<br />

is obtained. The data used in our analysis stem from a consumer study in which<br />

consumers were asked to rank-order new product ideas for soups.<br />

− 156 −


Selection of items for tests and questionnaires<br />

using Mokken scale analysis<br />

L. Andries van der Ark and J. Hendrik Straat<br />

Department of Methodology and Statistics<br />

Tilburg University<br />

P.O. Box 90153<br />

5000 LE Tilburg<br />

The Netherlands<br />

a.vdark@uvt.nl<br />

Abstract. Tests or questionnaires are often used to measure personality traits,<br />

attitudes, opinions, skills, and abilities. These tests and questionnaires consist of<br />

questions, statements, problems, games, or rating scales, which are generically called<br />

items. An important step in the construction of a test and questionnaire is a careful<br />

selection of items. A well-known approach for selecting qualitatively good items in a<br />

test is Mokken scale analysis. In this presentation, Mokken scale analysis is explained<br />

and recent developments are discussed. Special attention is given to a comparison<br />

of automated item selection algorithms used in Mokken scale analysis.<br />

− 157 −


Estimating the prevalence of rule transgression<br />

using data collected by randomized response<br />

Peter G.M. van der Heijden<br />

Department of Methodology and Statistics<br />

Utrecht University<br />

PO Box 80140<br />

3508 TC Utrecht<br />

The Netherlands<br />

P.G.M.vanderheijden@uu.nl<br />

Abstract. In criminology self-report studies are a means to obtain prevalence estimates<br />

of for rule transgressions, violations of the law, and so on. In surveys individuals<br />

are interviewed about their behaviour. An obvious problem is, or course,<br />

that due to reasons such as social desirability people do not always answer honestly<br />

about their behaviour.<br />

For this reason about forty years ago randomized response was introduced to<br />

collect data about sensitive issues. Our research group has worked in this area for<br />

about 10 years and I will give an overview of our results. The results are:<br />

• a “best practice” for asking randomized response questions<br />

• a meta-analysis showing that randonized response is the most valid method for<br />

answering questions about sensitive topics<br />

• accommodating existing models for the multivariate data so that they can handle<br />

randomized response data, such as logistic regression, item response theory, and<br />

randmized response count data<br />

• accomodating these models for the potential presence of respondents that do not<br />

follow the randomized response design.<br />

We present these results and illustrate them using surveys that we conducted for the<br />

Ministry of Social Affairs into social benefit fraud, that we conducted on a two-yearly<br />

base from 1998 to 2006.<br />

− 158 −


Clustering Consumers with Respect to Their<br />

Marketing Reactance Behavior<br />

Ralf Wagner and Erik Sauerwald<br />

SVI Chair for International Direct Marketing<br />

DMCC - Dialog Marketing Competence Center<br />

University of Kassel, Germany<br />

rwagner@wirtschaft.uni-kassel.de<br />

erik sauerwald@arcor.de<br />

Abstract. The recent paradigm shift in modern marketing practices (Coviello et<br />

al. (2002), Vargo & Lusch (2004)) in concurrence with the increasing popularity of<br />

digital marketing measures (Wagner & Meißner (forthcoming)) add a new quality<br />

to the discussion of marketing intrusiveness (Morimoto & Chang (2006)).<br />

Despite the comprehensive research in international differences on media usage<br />

(e.g., Krafft et al. (2007)) related previous research frequently neglects the cultural<br />

differences in recipients’ assessment of the marketing measures. In this study we<br />

utilize the Item Response Theory approach for an assessment of individuals’ reactance<br />

to unsolicited marketing communications. The study is based on an survey of<br />

recipients from China, Germany, Russia, and the United Staates of America<br />

Key words: Advertising, Culture, Item Response Theory, Reactance<br />

References<br />

Coviello, N.E., Brodie, R.J., Danaher, P.J., and Johnston, W.J. (2002): How Firms<br />

Relate to Their Markets: An Empirical Examination of Contemporary Marketing<br />

Practices. Journal of Marketing, 66, 33–46.<br />

Krafft, M.; Hesse, J., Höfling, J., Peters, K., Rinas, D. (2007): International Direct<br />

Marketing. Principles, Best Practices, Marketing Facts. Springer, Berlin.<br />

Morimoto, M. and Chang, S. (2006): Consumers’ Attitudes toward Unsolicited Commercial<br />

E-mail and Postal Direct Mail Marketing Methods: Intrusiveness, Perceived<br />

Loss of Control, and Irritation. Journal of Interactive Advertising, 7,<br />

8–20.<br />

Vargo, S.L. and Lusch, R.F. (2004): Evolving to a New Dominant Logic for Marketing.<br />

Journal of Marketing, 68, 1–17.<br />

Wagner, R. and Meißner, M. (forthcoming): Multimedia for Direct Marketing. In:<br />

M. Pagani (Ed.): Encyclopedia of Multimedia Technology and Networking, 2 nd<br />

Edition. Idea Publishing, Hershey.<br />

− 159 −


Supervised Self-Organising Maps and More<br />

Ron Wehrens<br />

IMM, Analytical Chemistry<br />

P.O. Box 9010, 6500 GL Nijmegen<br />

The Netherlands<br />

r.wehrens@science.ru.nl<br />

Abstract. Self-organising maps (SOMs) have been applied in many different areas<br />

of science. In a typical application, large numbers of objects (thousands or more) are<br />

mapped to a two-dimensional grid in such a way that very similar objects end up in<br />

the same area. If several different types of information are available, one can combine<br />

these in one feature vector, used to determine the similarity with each of the map<br />

units, but this presents scaling difficulties. We have, e.g., mapped several thousand<br />

steroid crystal structures from the Cambridge Crystallographic Database, based<br />

on their diffraction patterns and a specific distance function. For these structures,<br />

several other types of information are available as well, such as space group and cell<br />

volume.<br />

To take extra information into account, we have extended the basic principle<br />

of SOMs to accomodate extra layers, one for each type of feature vector [?]. The<br />

closest unit is then determined by summing distances per layer, where each layer<br />

can be assigned a weight. This makes it possible to perform supervised mapping:<br />

the second layer then contains the class information. The result of including class<br />

information is that classes are more likely to form contiguous units in the map. This<br />

behaviour can be enforced by choosing a larger weight for the class information. One<br />

does not have to stop at two layers: it is possible to create several layers, each layer<br />

corresponding with another type of data.<br />

This is implemented in an R package, called “kohonen” [?], available from CRAN<br />

(http://cran.r-project.org). Several examples will be shown highlighting the<br />

possibilities of the technique.<br />

Key words: Self-organising maps, Data fusion, Supervised mapping<br />

References<br />

1.W.J. Melssen, R. Wehrens, and L.M.C. Buydens. Supervised Kohonen networks<br />

for classification problems. Chemom. Intell. Lab. Syst., 83:99–113, 2006.<br />

2.R. Wehrens and L.M.C. Buydens. Self- and super-organising maps in R: the<br />

kohonen package. Journal of Statistical Software, 21(5), 9 2007.<br />

− 160 −


Multi-Item Versus Single-Item Measures:<br />

A Review and Future Research Directions<br />

Petra Wilczynski and Marko Sarstedt<br />

Institute for Market-based Management, Munich School of Management,<br />

D-80539 Munich, Germany wilczynski@bwl.lmu.de<br />

Abstract. With their widely discussed Journal of Marketing Research article,<br />

Bergkvist and Rossiter (2007) resume a long-lasting interdisciplinary discussion on<br />

the benefits and limitations of multi-item versus single-item measures. Whereas<br />

multi-item measures of theoretical constructs have been the norm in marketing research<br />

for over 20 years, practioners seem to favour single-item measures on the<br />

practical grounds of minimizing non-response and costs. This proceeding is often<br />

seen as a fatal error, because single-item measures are believed to be unreliable<br />

and invalid. During the last decades several studies appeared in different disciplines<br />

such as social sciences, marketing or psychology that critically compare these two<br />

approaches, yielding sometimes contradictory results in terms of validity or reliability.<br />

Thus, the objective of this paper is to develop an integrated overview of the<br />

present status of research in this field, taking into account various disciplines. At<br />

this, advantages and disadvantages, analytical approaches as well as the results are<br />

compared and critically evaluated. The findings suggest several areas for future research<br />

in this important field, which is necessary to close the gap between theoretical<br />

and practical requirements.<br />

Key words: Single Item, Multi Item, Scale Development<br />

References<br />

Bergkvist, L. and Rossiter, J.R. (2007): The Predictive Validity of Multiple-Item<br />

Versus Single-Item Measures of the Same Constructs. Journal of Marketing<br />

Research, 44, 175–184.<br />

Drolet, A.L. and Morrison, D.G. (2001): Do We Really Need Multiple-Item Measures<br />

in Service Research? Journal of Service Research, 3, 196–204.<br />

Wanous, J.P. and Reichers, A.E. and Hudy, M.J. (1997): Overall Job Satisfaction:<br />

How Good Are Single-Item Measures? Journal of Applied Psychology, 82,<br />

247–252.<br />

− 161 −


Management and methods: How to do market<br />

segmentation projects<br />

Raimund Wildner<br />

GfK Group<br />

Nürnberg, Germany<br />

Abstract. Market segmentation projects are often strategic projects with top management<br />

attention and high budgets. Nevertheless many of them fail. This can be<br />

due to poor methodology as well as due to poor management.<br />

From the management perspective it is essential that the objectives of the segmentation<br />

have to be clear from the beginning. Product development, media advertising,<br />

or sales are possible objectives and each of them requires specific variables as<br />

an input. Furthermore it is essential that all stakeholders of a segmentation project<br />

are involved from the beginning. During the segmentation project a close cooperation<br />

between marketing experts, market research experts, and statistics expert is<br />

necessary. Special problems arise in international segmentation projects. Finally it<br />

is important to sell the segmentation in the organization by workshops, leaflets and<br />

other instruments that help to get a clear picture of the segments.<br />

From a methodological standpoint it is important that the result is stable so it<br />

can be reproduced in other data sets as well. A test for stability will be shown. Outlier<br />

can cause instability; so a special method to identify outliers will be shown. Faked<br />

interviews have to be excluded from the segmentation. A procedure for detection of<br />

faked interviews will be discussed. Finally the cluster procedure that proved to be<br />

superior in practical terms is discussed.<br />

− 162 −


Clustering with Repulsive Prototypes<br />

Roland Winkler 1 , Frank Rehm 2 , and Rudolf Kruse 3<br />

1 German Aerospace Center, Braunschweig roland.winkler@dlr.de<br />

2 German Aerospace Center, Braunschweig frank.rehm@dlr.de<br />

3 Otto von Guericke University, Magdeburg kruse@iws.cs.uni-magdeburg.de<br />

Abstract. Although there is no exact definition for the term cluster, in the 2D<br />

case, it is fairly easy for human beings to decide which objects belong together. For<br />

machines on the other hand, it is hard to determine which objects form a cluster.<br />

Depending on the problem, the success of a clustering algorithm depends on the<br />

idea of their creators about what a cluster should be. Likewise, each clustering<br />

algorithm comprises a characteristic idea of the term cluster. For example the fuzzy<br />

c-means algorithm tends to find spherical clusters with equal numbers of objects.<br />

Noise clustering focuses on finding spherical clusters of user-defined diameter.<br />

If there is certain amount of knowledge available about how clusters are shaped, it<br />

is possible to include more information into a clustering algorithm. In this paper, we<br />

present an extension to noise clustering that tries to maximize the distances between<br />

prototypes. For that purpose, the prototypes behave like repulsive magnets that<br />

have an inertia depending on their sum of membership values. Using this repulsive<br />

extension, it is possible to prevent that groups of objects are divided into more<br />

than one cluster. Due to the repulsion and inertia, it is also possible to determine<br />

the number of clusters in a data set. Roughly speaking, having information about<br />

cluster shapes (i.e. the diameter) may help to cope with the absence of knowledge<br />

concerning the exact number of clusters.<br />

The results of repulsive clustering can be used as an initialization for other clustering<br />

techniques. We successfully applied this method on air traffic management tasks.<br />

Key words: fuzzy clustering, cluster shapes, cluster validity, air traffic management<br />

References<br />

Bezdek, J.C.(1981): Pattern recognition with fuzzy objective function algorithms.<br />

Plenum Press, New York.<br />

Dave RN, Sumit S (1998): Generalized noise clustering as a robust fuzzy c-mestimators<br />

model. 17th Annual Conference of the North American Fuzzy Information<br />

Processing Society (NAFIPS-98), Pensacola Beach, Florida, 256–260<br />

Rehm, F., Klawonn, F., Kruse, R.(2007): A novel approach to noise clustering for<br />

outlier detection. Soft Computing - A Fusion of Foundations, Methodologies and<br />

Applications, Berlin/Heidelberg, Vol. 11, No 5, 489-494.<br />

− 163 −


On the Effects of Enhanced Selection Models<br />

on Quality and Comparability of Classifiers<br />

Produced by Genetic Programming<br />

Stephan Winkler, Michael Affenzeller, Stefan Wagner, and Gabriel<br />

Kronberger<br />

Fachhochschule Oberösterreich, Research Center Hagenberg<br />

{swinkler,maffenze,swagner,gkronber}@heuristiclab.com<br />

Abstract. The use of genetic programming (GP) in machine learning enables the<br />

automated search for classification models that are evolved by an evolutionary process<br />

using the principles of selection, crossover and mutation. The use of enhanced<br />

selection models in GP ([1], [2]) is able to significantly increase the quality of classifiers<br />

produced by GP; detailed analysis can be found for example in [3].<br />

Algorithmic reliability can be assessed by comparing the results produced by a<br />

machine learning algorithm; due to the stochastic element that is intrinsic to any<br />

evolutionary process, GP can not guarantee the generation of similar or even similar<br />

models in each GP process execution. In [4] we have presented a method how to<br />

compare time series models produced by GP; in this paper we analyze the classifiers<br />

returned by GP based machine learning for medical benchmark data sets (taken<br />

from the UCI repository). We mainly focus on comparing standard GP techniques<br />

to those using enhanced selection models with respect to results similarity analysis.<br />

The effects of pruning mechanisms (applied to the final results) are also discussed.<br />

Key words: Evolutionary Learning, Genetic Programming, Results Comparability<br />

References<br />

[1] Affenzeller, M. and Wagner, S. (2005): Offspring selection: A new self-adaptive<br />

selection scheme for genetic algorithms. Adaptive and Natural Computing Algorithms,<br />

218—221.<br />

[2] Wagner, S. and Affenzeller, M. (2005): SexualGA: Gender-Specifc Selection for<br />

Genetic Algorithms. Proceedings of the 9th World Multi-Conference on Systemics,<br />

Cybernetics and Informatics (WMSCI) 2005, 4: 76–81.<br />

[3] Winkler, S., Affenzeller, M. and Wagner, S. (2007): Advanced genetic programming<br />

based machine learning. Journal of Mathematical Modelling and Algorithms,<br />

6(3): 455—480. Springer, Berlin.<br />

[4] Winkler, S., Affenzeller, M. and Wagner, S. (<strong>2008</strong>): On the Reliability of Nonlinear<br />

Modeling Using Enhanced Genetic Programming Techniques. Proceeding of<br />

the Chaotic Modeling and Simulation International Conference CHAOS <strong>2008</strong>.<br />

− 164 −


Analysis of massive emigration from Poland –<br />

the model–based clustering approach<br />

Ewa Witek<br />

Department of Statistics,<br />

Katowice University of Economics, Bogucicka 14, 40–226 Katowice<br />

ewitek@ekonom.ae.katowice.pl<br />

Abstract. The model–based approach assumes that data is generated by a finite<br />

mixture of underlying probability distribution such as multivariate normal distribution.<br />

In finite mixture models, each component of probability distribution corresponds<br />

to a cluster. The problem of determining the number of clusters and choosing<br />

an appropriate clustering method becomes statistical model choice problem. Hence,<br />

the model–based approach provides a key advantage over heuristic clustering algorithms<br />

selecting both the correct model and the number of clusters.<br />

Model–based clustering has shown promise in a number of practical applications,<br />

including tissue segmentation, character recognition, minefield and seismic fault detection<br />

and classification of astronomical data. The article presents an application<br />

of the model–based clustering in economic analysis, which is comparatively rare.<br />

The moment Poland joined the EU, its citizens rushed out of the country. Since<br />

1 May 2004 Poland has been facing the problem of increased emigration. We used<br />

the model–based clustering approach for grouping and detecting inhomogeneities of<br />

Polish emigrants from different regions of Poland.<br />

Key words: Model–based clustering, EM algorithm, BIC<br />

References<br />

Fraley, C. and Raftery, A.E. (2002): Model–based clustering, discriminant analysis,<br />

and density estimation. Journal of the American Statistical Association, 97,<br />

611–631.<br />

McLachlan, G.J. and Peel, D. (2000): Finite mixture models. Willey, New York.<br />

− 165 −


Image Based Mail Piece Identification using<br />

Unsupervised Learning<br />

Katja Worm 1 and Beate Meffert 2<br />

1 Siemens ElectroCom Postautomation GmbH, Rudower Chaussee 29, 12489<br />

Berlin, Germany katja.worm@siemens.com<br />

2 Humboldt Universität zu Berlin, Rudower Chaussee 25, 12489 Berlin, Germany<br />

meffert@informatik.hu-berlin.de<br />

Abstract. Based on the uniqueness of a mail piece surface, postal sorting machines<br />

use mail piece image characteristics to reuse once extracted mail piece addresses in<br />

different sorting steps. During the first sorting step mail piece image characteristics<br />

are extracted and stored together with its target address in a database. In following<br />

sorting steps the mail piece address is accessed by determining the corresponding<br />

mail piece characteristics in the database. In a previous work appropriate mail piece<br />

image characteristics and procedures for their distance measurement were presented.<br />

Image based mail piece identification is complicated by a constantly changing<br />

and unknown mail piece spectrum as well as the differentiation of nearly identical<br />

mass mails. In particular, the rejection of unknown mail pieces requires the definition<br />

of specific rejection classes depending on the current mail piece spectrum.<br />

In this paper we present an approach on distance based mail piece identification<br />

using a two-stage classification process. The different handling of mass and collection<br />

mail is facilitated by a beforehand performed unsupervised learning process which<br />

clusters similar mail piece characteristics. Based on theses clusters a specific rejection<br />

class can be estimated within each cluster. The first step in the identification process<br />

is the determination of the corresponding cluster for a given mail piece. Based on<br />

the cluster specific rejection class the mail piece can be either identified or rejected.<br />

Experimental results obtained on real-world data sets show the applicability of the<br />

proposed method.<br />

References<br />

Worm, K. and Meffert, B. (<strong>2008</strong>): Robust Image Based Document Comparison Using<br />

Attributed Relational Graphs. Proceedings of the International Conference on<br />

Signal Processing, Pattern Recognition and Applications (SPPRA), accepted.<br />

Keywords<br />

DOCUMENT IDENTIFICATION, UNSUPERVISED LEARNING,<br />

MINIMUM DISTANCE CLASSIFICATION<br />

− 166 −


Factor Analysis of Incomplete Disjunctive<br />

Tables<br />

Amaya Zárraga 1 and Beatriz Goitisolo 1<br />

Departamento de Economía Aplicada III. UPV/EHU. Bilbao. Spain<br />

Amaya.Zarraga@ehu.es and Beatriz.Goitisolo@ehu.es<br />

Abstract. Multiple Correspondence Analysis (MCA) studies the relationship between<br />

several categorical variables defined with respect to a certain population.<br />

However, one of the main sources of information are those surveys in which it is<br />

usual to find a certain number of absent data and conditioned questions that do not<br />

need to be answered by the whole population. In these cases, the data codification in<br />

a complete disjunctive table requires the inclusion of non-answer categories that can<br />

alter the results. For example, the distance χ 2 between two row profiles increases<br />

with the common answers when the individuals do not answer the same number<br />

of questions. And in the distance χ 2 between two column profiles each individual<br />

could have a different weight according to the number of answers previously chosen.<br />

Therefore, the direct application of the standard MCA is not appropriate to<br />

the study of an incomplete disjunctive table (IDT). We propose the analysis of the<br />

incomplete disjunctive table by substituting the real marginal of the table about the<br />

individuals for a suitable imposed marginal.<br />

Key words: Multiple Correspondence Analysis, Complete Disjunctive Tables, Incomplete<br />

Disjunctive Tables<br />

References<br />

Zárraga, A. and Goitisolo, B. (1999): Independence Between Questions in the Factor<br />

Analysis of Incomplete Disjunctive Tables with Condicioned Questions.<br />

Qüestiió, vol 23, 3, 465–488.<br />

Zárraga, A. and Goitisolo, B. (<strong>2008</strong>): Factorial Analysis of a set of Contingency Tables.<br />

In: C. Preisach, H. Burkhardt, L. Schmidt-Thieme and R. Decker (Eds.):<br />

Data Analysis, Machine Learning and Applications Series: Studies in Classification,<br />

Data Analysis, and Knowledge Organization. Proceedings of the 31st<br />

Annual Conference of the Gesellschaft fr Klassifikation e.V., Albert-Ludwigs-<br />

Universitt Freiburg, March 7-9, 2007. Springer, Berlin, forthcoming.<br />

− 167 −


Recursive Partitioning of Economic<br />

Regressions: Trees of Costly Journals and<br />

Beautiful Professors<br />

Achim Zeileis 1 and Christian Kleiber 2<br />

1 Department of Statistics, Department of Statistics and Mathematics,<br />

Wirtschaftsuniversität Wien Achim.Zeileis@wu-wien.ac.at<br />

2 Wirtschaftswissenschaftliches Zentrum, Universität Basel<br />

Christian.Kleiber@unibas.ch<br />

Abstract. The linear regression model is the workhorse for empirical economic<br />

analyses. For a wide variety of standard analysis problems, there are useful specifications<br />

of linear regression models, validated by economic theory and prior successful<br />

empirical studies. However, in non-standard problems or in situations where data on<br />

additional variables is available, a useful specification of a regression model involving<br />

all variables of interest might not be available. Here, we explore how recursive<br />

partitioning techniques can be used in such situations for modeling the relationship<br />

between the dependent variable and the available regressors. Linear regression is<br />

embedded into the model-based recursive partitioning framework of Zeileis et al.<br />

(<strong>2008</strong>). The resulting regression trees are grown by recursively applying techniques<br />

for testing and dating structural changes in linear regressions. They are compared<br />

to classical modeling approaches in two empirical applications: Following Stock and<br />

Watson (2007), the demand for economic journals (Bergstrom, 2001) is investigated.<br />

Furthermore, the impact of professors’ beauty on their class evaluations (Hamermesh<br />

and Parker, 2005) is assessed.<br />

Key words: Regression trees, Model-based recursive partitioning, Structural change<br />

References<br />

Bergstrom, T.C. (2001): Free Labor for Costly Journals? Journal of Economic Perspectives,<br />

(15), 183–198.<br />

Hamermesh, D.S. and Parker, A. (2005): Beauty in the Classroom: Instructors’<br />

Pulchritude and Putative Pedagogical Productivity. Economics of Education<br />

Review, 24, 369–376.<br />

Stock, J.H. and Watson, M.W. (2007): Introduction to Econometrics. 2nd edition,<br />

Addison Wesley.<br />

Zeileis, A., Hothorn, T. and Hornik, K. (<strong>2008</strong>): Model-based Recursive Partitioning.<br />

Journal of Computational and Graphical Statistics, accepted for publication.<br />

− 168 −

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!