GfKl 2008 - Legos
GfKl 2008 - Legos
GfKl 2008 - Legos
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
<strong>GfKl</strong> <strong>2008</strong><br />
German Classification Society<br />
32nd Annual Conference<br />
Advances in Data Analysis, Data Handling<br />
and Business Intelligence<br />
Joint Conference with the<br />
British Classification Society (BCS) and the<br />
Dutch/Flemish Classification Society (VOC)<br />
July 16−18, <strong>2008</strong><br />
Hamburg<br />
Program and<br />
Abstract Volume<br />
http://gfkl<strong>2008</strong>.hsu-hh.de
Main contact:<br />
Prof. Dr. Wilfried Seidel<br />
Helmut-Schmidt-University<br />
Holstenhofweg 85<br />
22043 Hamburg<br />
Germany<br />
+49 (0)40 6541 2315<br />
gfkl<strong>2008</strong>@hsu-hh.de<br />
− ii −
Contents<br />
Welcome, v<br />
Sponsors, vi<br />
Scientific Program Committee, vii<br />
Organizing Committee, vii<br />
Plenary and Semi-plenary Lectures, viii<br />
Invited Sessions, x<br />
Detailed Schedule, xi<br />
List of Contributions, xxviii<br />
Author Index, xxxvii<br />
Abstracts, 1<br />
− iii −
− iv −
Welcome<br />
On behalf of the Helmut-Schmidt-University Hamburg,<br />
we welcome you to <strong>GfKl</strong><strong>2008</strong> − Advances in Data Analysis,<br />
Data Handling and Business Intelligence − the 32nd Annual<br />
Conference of the German Classification Society, organized<br />
in cooperation with the British Classification Society (BCS)<br />
and the Dutch/Flemish Classification Society (VOC).<br />
The conference features 13 invited lectures (3 plenary<br />
speeches and 10 semi-plenary lectures), 166 contributed<br />
talks, 4 invited sessions, and 2 workshops.<br />
We are indebted to those who suggested and supported<br />
having <strong>GfKl</strong><strong>2008</strong> held in Hamburg. We are grateful to those<br />
who volunteered in the conference organization and we<br />
acknowledge the generous financial backing by our sponsors.<br />
We wish you a very pleasant and stimulating <strong>GfKl</strong><strong>2008</strong>.<br />
Claudia Fantapié Altobelli<br />
Andreas Fink<br />
Hartmut Hebbel<br />
Wilfried Seidel<br />
Detlef Steuer<br />
Ulrich Tüshaus<br />
(Organizing committee,<br />
Helmut-Schmidt-University Hamburg)<br />
Berthold Lausen<br />
Alfred Ultsch<br />
(Co-chairs of the program committee)<br />
− v −
Sponsors<br />
The organizers would like to express their appreciation to the<br />
following organizations for providing financial help and other<br />
support:<br />
Deutsche<br />
Forschungsgemeinschaft<br />
Hamburg-Mannheimer<br />
Versicherungen<br />
Springer-Verlag<br />
Vattenfall<br />
Gesellschaft für<br />
Konsumforschung<br />
Hamburger Sparkasse<br />
Statsoft GmbH<br />
Volksfürsorge Deutsche<br />
Lebensversicherung AG<br />
− vi −
Scientific Program Committee<br />
H.-H. Bock (RWTH Aachen)<br />
R. Decker (Uni Bielefeld)<br />
V. Esposito Vinzi (ESSEC Paris)<br />
W. Esswein (TU Dresden)<br />
C. Fantapié Altobelli (HSU Hamburg)<br />
A. Fink (HSU Hamburg)<br />
W. Gaul (Uni Karlsruhe)<br />
H. Hebbel (HSU Hamburg)<br />
Ch. Hennig (Uni London)<br />
K. Jajuga (Wroclaw Univ. of Economics)<br />
H.-P. Klenk (DSMZ Braunschweig)<br />
B. Lausen (Uni Erlangen-Nürnberg, Co-Chair)<br />
H. Locarek-Junge (TU Dresden)<br />
F. Murtagh (Uni London)<br />
A. Okada (Uni Tokyo)<br />
L. Schmidt-Thieme (Uni Hildesheim)<br />
W. Seidel (HSU Hamburg)<br />
D. Steuer (HSU Hamburg)<br />
U. Tüshaus (HSU Hamburg)<br />
A. Ultsch (Uni Marburg, Co-Chair)<br />
M. van de Velden (Uni Rotterdam)<br />
D. van den Poel (Uni Ghent)<br />
I. van Mechelen (Uni Leuven)<br />
R. Wehrens (Uni Nijmegen)<br />
C. Weihs (Uni Dortmund)<br />
Organizing Committee<br />
Claudia Fantapié Altobelli<br />
Andreas Fink<br />
Hartmut Hebbel<br />
Wilfried Seidel<br />
Detlef Steuer<br />
Ulrich Tüshaus<br />
− vii −
Plenary and Semi-plenary Lectures<br />
Wednesday, July 16, 10:00–10:45:<br />
Walter Radermacher, President Federal Statistical Office of<br />
Germany, Wiesbaden, Germany<br />
“Statistical Processes Under Change – Enhancing Data Quality<br />
with Pretests” (Room 5)<br />
Wednesday, July 16, 11:00–11:40:<br />
Geoffrey John McLachlan, University of Brisbane, Australia<br />
“Clustering of High-Dimensional Data Via Finite Mixture Models”<br />
(Room 5)<br />
Fred Hamprecht, University Heidelberg, Germany<br />
“Segmentation of Neural Tissue” (Room 3)<br />
Wednesday, July 16, 14:00–14:45:<br />
Bernhard Schölkopf, Max-Planck-Institute, Tübingen, Germany<br />
“Machine Learning Applications of Positive Definite Kernels”<br />
(Room 5)<br />
Thursday, July 17, 09:00–09:40:<br />
Patrick Groenen, University Rotterdam, The Netherlands<br />
“Support Vector Machines in the Primal using Majorization and<br />
Kernels” (Room 5)<br />
Gilles Bisson, La Tronche, France<br />
“Clustering of Molecules and Structured Data” (Room 3)<br />
− viii −
Thursday, July 17, 15:55–16:35:<br />
Sabine Krolak-Schwerdt, University Wuppertal, Germany<br />
“Strategies of Model Construction for the Analysis of Judgment<br />
Data” (Room 3)<br />
Gilles Celeux, INRIA, France<br />
“Choosing the Number of Clusters in the Latent Class Model”<br />
(Room 5)<br />
Friday, July 18, 09:00–09:40:<br />
Francesco Palumbo, University of Macerata, Italy<br />
“Clustering and Dimensionality Reduction to Discover Interesting<br />
Patterns in Binary Data” (Room 5)<br />
Raimund Wildner, GfK, Nürnberg, Germany<br />
“Management and Methods: How to do Market Segmentation<br />
Projects” (Room 3)<br />
Friday, July 18, 11:20–12:00:<br />
Adi Ben-Israel, Rutgers University, USA<br />
“Probabilistic Distance Clustering” (Room 5)<br />
Tagashi Imaizumi, Tama University Tokyo, Japan<br />
“Dimensionality Reduction of Similarity Matrix” (Room 3)<br />
Friday, July 18, 13:15–14:00:<br />
Fred R. McMorris, Illinois Institute of Technology, USA<br />
“Majority-rule Consensus: From Preferences (Social Choice) to<br />
Trees (Biology and Classication Theory)” (Room 5)<br />
− ix −
Invited Sessions<br />
Wednesday, July 16, 14:50–16:05 (Room 2)<br />
VOC<br />
(Chairs: van de Velden, Wehrens)<br />
Wednesday, July 16, 16:50–18:30 (Room 2)<br />
PLS Path Modeling<br />
(Chair: Esposito Vinzi)<br />
Thursday, July 17, 09:45–11:00 (Room 2)<br />
Microarrays in Clinical Research<br />
(Chairs: Lausen, Ultsch)<br />
Thursday, July 17, 14:00–15:40 (Room 2)<br />
BCS<br />
(Chairs: Hennig, Murtagh)<br />
− x −
Detailed Schedule<br />
13:30-<br />
17:30<br />
Tuesday July 15, <strong>2008</strong> page<br />
Pre-conference Workshop Room 4<br />
Lenz, Hans-J. Data Quality: Defining, Measuring and<br />
Improving<br />
20:00 Informal Get Together (Hotel Baseler Hof, Esplanade 11)<br />
09:00-<br />
10:00<br />
Wednesday July 16, <strong>2008</strong><br />
Opening Ceremony (Chair: Seidel) Room<br />
5<br />
09:00 Welcome Claus Weihs (President of <strong>GfKl</strong>)<br />
Wilfried Seidel (Local Organizers)<br />
09:05 Welcome Herlind Gundelach<br />
(Senator for Science and<br />
Research, State Hamburg)<br />
09:15 Welcome Hans Christoph Zeidler<br />
(President of Helmut-Schmidt-<br />
University Hamburg)<br />
09:30 <strong>GfKl</strong> Best Paper<br />
Award 2007:<br />
Presentation and<br />
Laudation<br />
Claus Weihs (President of <strong>GfKl</strong>)<br />
N.N.<br />
09:50 Program Overview Berthold Lausen, Alfred Ultsch<br />
(Co-Chairs Program Committee)<br />
10:00-<br />
10:45<br />
Plenary Lecture (Chair: Seidel)<br />
Radermacher, Walter Statistical processes under<br />
change − Enhancing data quality<br />
with pretests<br />
− xi −<br />
page<br />
118<br />
Room<br />
5
10:00-<br />
18:00<br />
10:45-<br />
11:00<br />
11:00-<br />
11:40<br />
Workshop: Libraries (see separate schedule) Room<br />
403<br />
Coffee<br />
Semi-plenary Lectures<br />
McLachlan, Geoffrey<br />
John<br />
Hamprecht, Fred A.<br />
et al.<br />
Session Mixture Analysis I:<br />
Testing<br />
11:45-<br />
12:10<br />
12:10- Holzmann, Hajo;<br />
12:35 Dannemann, Jörn<br />
12:35-<br />
13:00<br />
Clustering of High-Dimensional<br />
Data Via Finite Mixture Models<br />
(Chair: McMorris)<br />
Segmentation of Neural Tissue<br />
(Chair: Wehrens)<br />
93 Room<br />
5<br />
5 Room<br />
3<br />
(Chair: Seidel)<br />
49<br />
Room<br />
3<br />
Gassiat, Elisabeth Likelihood ratio test for general<br />
mixture models<br />
Likelihood ratio testing for hidden<br />
Markov models<br />
Pommeret, Denys Testing distribution in errors in<br />
variables models<br />
Session Pattern Recognition and<br />
Machine Learning I<br />
11:45-<br />
12:10<br />
12:10-<br />
12:35<br />
12:35-<br />
13:00<br />
Haasdonk, Bernard;<br />
Pekalska, Elzbieta<br />
Louw, Nelmarie;<br />
Lamont, Morne; Steel,<br />
Sarel<br />
Classification with Regularized<br />
Kernel Mahalanobis-Distances<br />
Identifying Atypical Cases in<br />
Kernel Fisher Discriminant<br />
Analysis by using the Smallest<br />
Enclosing Hypersphere<br />
Trzesiok, Michal Relevant Importance of Predictor<br />
Variables in Support Vector<br />
Machines Models<br />
68<br />
113<br />
(Chair: Groenen) Room<br />
405/6<br />
Session Collective Intelligence (Chair: Geyer-Schulz) Room<br />
101/3<br />
11:45-<br />
12:10<br />
Geyer-Schulz, Andreas;<br />
Hoser, Bettina<br />
The Potential of Social Intelligence<br />
for Collective Intelligence<br />
− xii −<br />
57<br />
87<br />
152<br />
51
12:10-<br />
12:35<br />
12:35-<br />
13:00<br />
Mylonas, Phivos;<br />
Solachidis, Vassilios;<br />
Geyer-Schulz, Andreas;<br />
Hoser, Bettina;<br />
Chapman, Sam;<br />
Ciravegna, Fabio;<br />
Staab, Stefen; Smrz,<br />
Pavel; Kompatsiaris,<br />
Yiannis; Avrithis,<br />
Yannis<br />
Solachidis, Vassilios;<br />
Mylonas, Phivos;<br />
Geyer-Schulz, Andreas;<br />
Hoser, Bettina;<br />
Chapman, Sam;<br />
Ciravegna, Fabio;<br />
Staab, Stefen;<br />
Contopoulos, Costis;<br />
Gkika, Ioanna; Smrz,<br />
Pavel; Kompatsiaris,<br />
Yiannis; Avrithis,<br />
Yannis<br />
Session Evaluation of Clustering<br />
Algorithms and Data<br />
Structures<br />
11:45- Kaiser, Sebastian;<br />
12:10 Leisch, Friedrich<br />
12:10-<br />
12:35<br />
12:35-<br />
13:00<br />
Bade, Korinna; Benz,<br />
Dominik<br />
Grün, Bettina; Leisch,<br />
Friedrich<br />
Session Genome and DNA<br />
Analysis<br />
11:45-<br />
12:10<br />
12:10-<br />
12:35<br />
12:35-<br />
13:00<br />
Efficient Media Exploitation<br />
towards Collective Intelligence<br />
100<br />
Generating Collective Intelligence 137<br />
(Chair: Leisch) Room<br />
2<br />
Benchmarking Bicluster<br />
Algorithms<br />
Evaluation Strategies for Learning<br />
Algorithms of Hierarchical<br />
Structures<br />
Model diagnostics of finite<br />
mixtures using bootstrapping<br />
Klenk, Hans-Peter Polyphasic genomic approach for<br />
the taxonomy of archaea and<br />
bacteria<br />
Huson, Daniel H.;<br />
Rupp, Regula<br />
75<br />
8<br />
56<br />
(Chair: Klenk) Room<br />
6<br />
Using Cluster Networks to<br />
Represent Non-Compatible Sets of<br />
Clusters<br />
Hütt, Marc-Thorsten Genome phylogeny based on<br />
short-range correlations in DNA<br />
sequences<br />
− xiii −<br />
78<br />
71<br />
72
Session Market Risk and Credit<br />
Risk<br />
11:45-<br />
12:10<br />
12:10-<br />
12:35<br />
12:35-<br />
13:00<br />
13:00-<br />
14:00<br />
14:00-<br />
14:45<br />
Bravo, Cristian;<br />
Maldonado, Sebastian;<br />
Weber, Richard<br />
(Chair: Locarek-Junge) Room<br />
4<br />
Practical experiences from Credit<br />
Scoring projects for Chilean<br />
financial organizations<br />
21<br />
Kuziak, Katarzyna An application of copula functions<br />
to market risk management<br />
Rokita, Pawel; Piontek,<br />
Krzysztof<br />
Lunch (and Meetings)<br />
Extreme unconditional<br />
dependence vs. multivariate<br />
GARCH effect in the analysis of<br />
dependence between high losses<br />
on Polish and German stock<br />
indexes<br />
Plenary Lecture (Chair: Lausen)<br />
Schölkopf, Bernhard Machine Learning applications of<br />
positive definite kernels<br />
Session Mixture Analysis II:<br />
Clustering and<br />
Classification<br />
14:50-<br />
15:15<br />
15:15-<br />
15:40<br />
15:40-<br />
16:05<br />
16:05-<br />
16:30<br />
Pons, Odile Classification with an increasing<br />
number of components<br />
Lukociene, Olga;<br />
Vermunt, Jeroen K.<br />
Calò, Daniela G.; Viroli,<br />
Cinzia<br />
Latouche, Pierre J.;<br />
Ambroise, Christophe;<br />
Birmelé, Etienne<br />
Session Pattern Recognition and<br />
Machine Learning II<br />
14:50-<br />
15:15<br />
15:15-<br />
15:40<br />
15:40-<br />
16:05<br />
Stecking, Ralf;<br />
Schebesch, Klaus B.<br />
Huellermeier, Eyke;<br />
Vanderlooy, Stijn<br />
Hühn, Jens;<br />
Hüllermeier, Eyke<br />
83<br />
121<br />
132 Room<br />
5<br />
(Chair: Montanari) Room<br />
3<br />
Determining the number of<br />
components in mixture models for<br />
hierarchical data<br />
Visualizing data in Gaussian<br />
mixture model classification<br />
Bayesian Methods for Graph<br />
Clustering<br />
114<br />
90<br />
24<br />
85<br />
(Chair: Nalbantov) Room<br />
405/6<br />
Generating Fictitious Training Data<br />
for Credit Client Classification<br />
Combining Predictions in Pairwise<br />
Classification: An Adaptive Voting<br />
Strategy and Its Relation to<br />
Weighted Voting<br />
Rule-Based Learning of Reliable<br />
Classifiers<br />
− xiv −<br />
140<br />
70<br />
69
Session Invited Session: VOC (Chairs: van de Velden, Wehrens) Room<br />
2<br />
14:50-<br />
15:15<br />
15:15-<br />
15:40<br />
15:40-<br />
16:05<br />
Timmerman, Marieke<br />
E.; Lichtwarck-Aschoff,<br />
Anna; Ceulemans, Eva<br />
van der Heijden, Peter<br />
G.M.<br />
van der Ark, Andries L.;<br />
Straat, J. Hendrik<br />
Session Ensemble Methods and<br />
Other Subjects<br />
14:50-<br />
15:15<br />
15:15-<br />
15:40<br />
15:40-<br />
16:05<br />
16:05-<br />
16:30<br />
Multilevel Simultaneous<br />
Component Analysis for Studying<br />
Inter-individual and Intraindividual<br />
Variabilities<br />
Estimating the prevalence of rule<br />
transgression<br />
Selection of items for tests and<br />
questionnaires using Mokken scale<br />
analysis<br />
149<br />
158<br />
157<br />
(Chair: Boulesteix) Room<br />
6<br />
Häberle, Lothar On classification of species of<br />
representation rings<br />
58<br />
Strobl, Carolin; Zeileis, A New, Conditional Variable 143<br />
Achim<br />
Importance Measure for Random<br />
Forests<br />
Potapov, Sergej;<br />
Lausen, Berthold<br />
Bagging with different split criteria 115<br />
Adler, Werner;<br />
Classification of Paired Data Using 4<br />
Brenning, Alexander;<br />
Lausen, Berthold<br />
Ensemble Methods<br />
Session Market Risk and Credit<br />
Risk<br />
14:50-<br />
15:15<br />
15:15-<br />
15:40<br />
15:40-<br />
16:05<br />
16:05-<br />
16:30<br />
16:30-<br />
16:50<br />
(Chair: Locarek-Junge) Room<br />
4<br />
Piontek, Krzysztof The Analysis of the power for<br />
some chosen VaR backtesting<br />
procedures - simulation approach<br />
Koralun-Bereznicka,<br />
Julia<br />
Dias, José G.; Vermunt,<br />
Jeroen K.; Ramos, Sofia<br />
Sardet, Laure; Patilea,<br />
Valentin<br />
Coffee<br />
Multivariate comparative analysis<br />
of stock exchanges - the European<br />
perspective<br />
Mixture Hidden Markov Models in<br />
Finance Research<br />
Beta-kernel density estimation<br />
using mixture-based<br />
transformations: an application to<br />
claims distribution<br />
− xv −<br />
112<br />
81<br />
33<br />
125
Session Mixture Analysis III:<br />
Model Fitting, Estimation<br />
and Applications<br />
16:50-<br />
17:15<br />
17:15-<br />
17:40<br />
17:40-<br />
18:05<br />
18:05-<br />
18:30<br />
18:30-<br />
18:55<br />
Greselin, Francesca;<br />
Ingrassia, Salvatore<br />
Neykov, Neyko;<br />
Filzmoser, Peter;<br />
Neytchev, Plamen<br />
Garel, Bernard;<br />
Boucharel, Julien;<br />
Dewitte, Boris; du<br />
Penhoat, Yves<br />
(Chair: McLachlan) Room<br />
3<br />
A note on constrained EM<br />
algorithms for mixtures of<br />
elliptical distributions<br />
Robust fitting of mixtures: The<br />
approach based on the Trimmed<br />
Likelihood Estimator<br />
Non-Gaussian nature of ENSO<br />
signals and climate shifts: implications<br />
for regional studies off the<br />
western coast of South America<br />
Schlattmann, Peter Comparison of four estimators of<br />
the heterogeneity variance for<br />
meta-analysis<br />
Schiffner, Julia; Weihs,<br />
Claus<br />
Session Pattern Recognition and<br />
Machine Learning III<br />
16:50-<br />
17:15<br />
17:15-<br />
17:40<br />
17:40-<br />
18:05<br />
18:05-<br />
18:30<br />
Localized Classification Using<br />
Mixture Models<br />
53<br />
103<br />
48<br />
131<br />
130<br />
(Chair: Hüllermeier) Room<br />
405/6<br />
Wehrens, Ron Supervised Self-Organising Maps<br />
and More<br />
160<br />
Worm, Katja; Meffert, Image Based Mail Piece<br />
166<br />
Beate<br />
Identification using Unsupervised<br />
Learning<br />
Barbosa, Rui Pedro;<br />
Belo, Orlando<br />
Autonomous Forex Trading Agents 9<br />
Chiou, Hua-Kai; Huang, Applying Rough Set Theory to 27<br />
Yong-Ting; Liu, Gia- Constructing Knowledge Base for<br />
Shie<br />
Critical Military Commodity<br />
Management<br />
Session Invited Session: PLS<br />
Path Modeling<br />
16:50-<br />
17:15<br />
17:15-<br />
17:40<br />
17:40-<br />
18:05<br />
(Chair: Esposito Vinzi) Room<br />
2<br />
Ringle, Christian FIMIX-PLS Segmentation of Data<br />
for Path Models with Multiple<br />
Endogenous LVs<br />
Trinchera, L.; Esposito<br />
Vinzi, Vincenzo<br />
A Comprehensive Partial Least<br />
Squares Approach to Component-<br />
Based Structural Equation<br />
Modeling<br />
Henseler, Jörg Nonlinear Effects in PLS Path<br />
Models: A Comparison of Available<br />
Approaches<br />
− xvi −<br />
120<br />
151<br />
63
18:05-<br />
18:30<br />
Betzin, Jörg Categorical Data in PLS Path<br />
modeling<br />
Session Microarray Data<br />
Analysis<br />
16:50-<br />
17:15<br />
17:15-<br />
17:40<br />
17:40-<br />
18:05<br />
18:05-<br />
18:30<br />
Boulesteix, Anne-<br />
Laure; Slawski, Martin<br />
Slawski, Martin;<br />
Boulesteix, Anne-<br />
Laure; Daumer, Martin<br />
Scharl, Theresa;<br />
Leisch, Friedrich<br />
Martin-Magniette,<br />
Marie-Laure; Mary-<br />
Huard, Tristan; Bérard,<br />
Caroline; Robin,<br />
Stéphane<br />
Session Investments and<br />
Capital Markets<br />
16:50-<br />
17:15<br />
17:15-<br />
17:40<br />
17:40-<br />
18:05<br />
18:05-<br />
18:30<br />
Locarek-Junge,<br />
Hermann; Mihm, Max<br />
14<br />
(Chair: Benner) Room<br />
6<br />
On optimistic bias in reporting<br />
microarray-based classification<br />
accuracy<br />
20<br />
'CMA' - Steps in developing a<br />
comprehensive R-toolbox for<br />
classification with microarray data<br />
and other high-dimensional<br />
problems<br />
136<br />
Quality-Based Clustering of<br />
Functional Data: Applications to<br />
Time Course Microarray Data<br />
127<br />
ChIPmix : Mixture model of<br />
regressions for ChIP-chip<br />
experiment analysis<br />
92<br />
(Chair: Ultsch) Room<br />
4<br />
Fundamental Indexation - testing<br />
the concept in the German stock<br />
market<br />
86<br />
Ultsch, Alfred Is log ratio a good value for<br />
measuring return in stock<br />
investments?<br />
Klein, Christian;<br />
Kundisch, Dennis<br />
Bessler, Wolfgang;<br />
Holler, Julian<br />
Index-Based Investment Vehicles<br />
- A Comparative Study for the<br />
German DAX<br />
Hedge Funds in a Bayesian Asset<br />
Allocation Framework:<br />
Incorporating Information on<br />
market states and manager's<br />
ability<br />
19:00 Reception (Building “M1”)<br />
− xvii −<br />
154<br />
77<br />
13
09:00-<br />
09:40<br />
09:45-<br />
17:00<br />
Thursday July 17, <strong>2008</strong><br />
Semi-plenary Lectures<br />
Groenen, Patrick J.F.<br />
et al.<br />
Support Vector Machines in the<br />
Primal using Majorization and<br />
Kernels (Chair: Okada)<br />
Bisson, Gilles Clustering of molecules and<br />
structured data (Chair: Hennig)<br />
Workshop: Decimal<br />
Classification<br />
Session Clustering and<br />
Classification I<br />
9:45-<br />
10:10<br />
10:10-<br />
10:35<br />
10:35-<br />
11:00<br />
Godehardt, Erhard;<br />
Jaworski, Jerzy;<br />
Rybarczyk, Katarzyna<br />
54 Room<br />
5<br />
16 Room<br />
3<br />
(see separate schedule) Room<br />
403<br />
(Chair: Bock) Room<br />
3<br />
Isolated vertices in random<br />
intersection graphs<br />
52<br />
Rozmus, Dorota Cluster ensemble based on cooccurrence<br />
data<br />
Enyukov, Igor Regression-autoregression based<br />
clustering<br />
Session Bayesian, Neural, and<br />
Fuzzy Clustering I<br />
9:45-<br />
10:10<br />
10:10-<br />
10:35<br />
10:35-<br />
11:00<br />
Borgelt, Christian Weighting and Selecting Features<br />
in Fuzzy Clustering<br />
Neumann, Anneke;<br />
Ambrosi, Klaus; Hahne,<br />
Felix<br />
Winkler, Roland; Rehm,<br />
Frank; Kruse, Rudolf<br />
Session Invited Session:<br />
Microarrays in Clinical<br />
Research<br />
9:45-<br />
10:25<br />
10:25-<br />
11:00<br />
Approach for Dynamic Problems in<br />
Clustering<br />
Clustering with Repulsive<br />
Prototypes<br />
123<br />
37<br />
(Chair: Kruse) Room<br />
405/6<br />
18<br />
102<br />
163<br />
(Chairs: Lausen, Ultsch) Room<br />
2<br />
Ultsch, Alfred Comparison of Algorithms to find<br />
differentially expressed Genes in<br />
Microarray Data<br />
Hielscher, Thomas;<br />
Zucknick, Manuela;<br />
Werft, Wiebke; Benner,<br />
Axel<br />
On the prognostic value of gene<br />
expression signatures for<br />
censored data<br />
− xviii −<br />
153<br />
67
Session Statistical Musicology I (Chair: Weihs) Room<br />
6<br />
9:45- Eigenfeldt, Arne; Multimodal Performance Analysis 35<br />
10:10 Kapur, Ajay<br />
of Electronic Sitar<br />
10:10- Sommer, Katrin; Analysis of polyphonic musical 138<br />
10:35 Weihs, Claus<br />
time series<br />
10:35-<br />
11:00<br />
Desmet, Frank Michel;<br />
Leman, Marc; Lesaffre,<br />
Micheline<br />
Session Marketing and<br />
Management Science I<br />
9:45-<br />
10:10<br />
10:10-<br />
10:35<br />
10:35-<br />
11:00<br />
11:00-<br />
11:20<br />
Wagner, Ralf;<br />
Sauerwald, Erik<br />
Sagan, Adam;<br />
Kowalska-Musial,<br />
Magdalena<br />
Becker, Niels; Werners,<br />
Brigitte<br />
Coffee<br />
Session Clustering and<br />
Classification II<br />
11:20-<br />
11:45<br />
11:45-<br />
12:10<br />
12:15-<br />
12:45<br />
Nugent, Rebecca;<br />
Stuetzle, Werner<br />
Herrmann, Lutz;<br />
Ultsch, Alfred<br />
Software Presentation<br />
Eichenberg, Thilo<br />
(StatSoft)<br />
Session Bayesian, Neural, and<br />
Fuzzy Clustering II<br />
11:20-<br />
11:45<br />
11:45-<br />
12:10<br />
12:10-<br />
12:35<br />
Gabriel, Thomas R.;<br />
Thiel, Kilian; Berthold,<br />
Michael R.<br />
Fritsch, Arno; Ickstadt,<br />
Katja<br />
Steinbrecher, Matthias;<br />
Kruse, Rudolf<br />
Statistical analysis of human body<br />
movement and group interactions<br />
in response to music<br />
32<br />
(Chair: van den Poel) Room<br />
4<br />
Clustering Consumers with<br />
Respect to Their Marketing<br />
Reactance Behavior<br />
159<br />
Dyadic Interactions in Service<br />
Encounter - Bayesian SEM<br />
Approach<br />
124<br />
Improving Product Line Design<br />
with Bundling<br />
10<br />
(Chair: Vichi) Room<br />
3<br />
Cluster Tree Estimation using a<br />
Generalized Single Linkage<br />
Method<br />
104<br />
Strengths and Weaknesses of Ant<br />
Colony Clustering<br />
65<br />
STATISTICA<br />
Multi-Dimensional Scaling applied<br />
to Hierarchical Fuzzy Rule<br />
Systems<br />
An Improved Criterion for<br />
Clustering Based on the Posterior<br />
Similarity Matrix<br />
Clustering Association Rules with<br />
Fuzzy Concepts<br />
− xix −<br />
Room<br />
3<br />
(Chair: Kruse) Room<br />
405/6<br />
45<br />
43<br />
141
Session Text Mining (Chair: Schmidt-Thieme) Room<br />
101/3<br />
11:20-<br />
11:45<br />
11:45-<br />
12:10<br />
12:10-<br />
12:35<br />
12:35-<br />
13:00<br />
Karatzoglou,<br />
Alexandros; Feinerer,<br />
Ingo; Hornik, Kurt<br />
Schierle, Martin;<br />
Trabold, Daniel<br />
Hermes, Jürgen;<br />
Schwiebert, Stephan<br />
Nonparametric distribution<br />
analysis for text mining<br />
Multilingual knowledge based<br />
concept recognition in textual data<br />
Classification of text processing<br />
components: The Tesla Role<br />
System<br />
Thorleuchter, Dirk Mining ideas from textual<br />
information<br />
Session Modelling Exchange<br />
from Archaeological<br />
Evidence<br />
11:20-<br />
11:45<br />
11:45-<br />
12:10<br />
12:10-<br />
12:35<br />
Schyle, Daniel The Late Neolithic flint axe<br />
production on the Lousberg<br />
(Aachen, Germany) – An<br />
extrapolation of supply and<br />
demand and population density<br />
Dolata, Jens; Mucha,<br />
Hans-Joachim; Bartel,<br />
Hans-Georg<br />
Mapping Findspots of Roman<br />
Military Brickstamps in<br />
Mogontiacum (Mainz) and<br />
Archaeometrical Analysis<br />
Herzog, Irmela Reconstructing Central Places and<br />
Settlements Groups<br />
76<br />
128<br />
64<br />
147<br />
(Chair: Kerig) Room<br />
2<br />
Session Statistical Musicology II (Chair: Weihs) Room<br />
6<br />
11:20-<br />
11:45<br />
11:45-<br />
12:10<br />
12:10-<br />
12:35<br />
Meyer, Florian; Ultsch,<br />
Alfred<br />
Lukashevich, Hanna;<br />
Dittmar, Christian;<br />
Bastuck, Christoph<br />
Finding Music Fads by clustering<br />
Online Radio Data with Emergent<br />
Self-Organizing Maps<br />
Applying Statistical Models and<br />
Parametric Distance Measures for<br />
Music Similarity Search<br />
Fricke, Jobst P. A statistical theory of musical<br />
consonance proved in praxis<br />
Session Marketing and<br />
Management Science II<br />
11:20-<br />
11:45<br />
11:45-<br />
12:10<br />
Gazda, Vladimir On a Location of the Retail Units<br />
and Equilibrium Price<br />
Determination<br />
Zeileis, Achim; Kleiber,<br />
Christian<br />
134<br />
34<br />
66<br />
96<br />
89<br />
42<br />
(Chair: Decker) Room<br />
4<br />
Recursive Partitioning of Economic<br />
Regressions: Trees of Costly<br />
Journals and Beautiful Professors<br />
− xx −<br />
50<br />
168
12:10-<br />
12:35<br />
13:00-<br />
14:00<br />
van de Velden, Michel;<br />
de Beuckelaer, Alain;<br />
Groenen, Patrick;<br />
Busing, Frank<br />
Lunch (and Meetings)<br />
Visualizing preferences using<br />
minimum variance nonmetric<br />
unfolding<br />
Session Linguistics (Chairs: Goebl, Grzybek) Room<br />
3<br />
14:00-<br />
14:25<br />
14:25-<br />
14:50<br />
14:50-<br />
15:15<br />
15:15-<br />
15:40<br />
Rapp, Reinhard; Zock,<br />
Michael<br />
Fenk-Oczlon, Gertraud;<br />
Fenk, August<br />
Automatic Dictionary Expansion<br />
Using Non-parallel Corpora<br />
Cross-linguistic regularities in the<br />
monosyllabic system<br />
Rolshoven, Jürgen Grundzüge einer generativen<br />
Korpuslinguistik<br />
Petersen, Wiebke Lineare Kodierung multipler<br />
Vererbungshierarchien:<br />
Wiederbelebung einer antiken<br />
Klassifikationsmethode<br />
Session Invited Session: BCS (Chairs: Hennig, Murtagh) Room<br />
2<br />
14:00-<br />
14:25<br />
14:25-<br />
14:50<br />
Dean, Nema; Nugent,<br />
Rebecca<br />
Mirkin, Boris<br />
Augmenting Model-Based<br />
Clustering with Generalized<br />
Linkage methods<br />
Deviant box and dual clusters for the<br />
analysis of conceptual contexts<br />
14:50- Critchley, Frank; Pires, Principal Axis Analysis – with<br />
15:15 Ana; Amado, Conceicao HDLSS bonuses!<br />
15:15- Hennig, Christian; Using cluster analysis for species<br />
15:40 Hausdorf, Bernhard delimitation<br />
Session Processes in Industry (Chair: Joos) Room<br />
6<br />
14:00-<br />
14:25<br />
14:25-<br />
14:50<br />
14:50-<br />
15:15<br />
15:15-<br />
15:40<br />
Hahlweg, Cornelius;<br />
Rothe, Hendrik<br />
Raabe, Nils; Enk, Dirk;<br />
Weihs, Claus;<br />
Biermann, Dirk<br />
Meier, René; Joos,<br />
Franz<br />
Große, Lars; Joos,<br />
Franz<br />
Auswertung hochaufgelöster<br />
Streulichtdaten mit Methoden der<br />
multivariaten Statistik<br />
Dynamic disturbances in BTA<br />
deephole drilling - Identification of<br />
spiralling as a regenerative effect<br />
Optimization Methods with<br />
Evolutionary Algorithms and<br />
Artificial Neural Networks<br />
Usage of Artifical Neural Networks<br />
for Data Handling<br />
− xxi −<br />
156<br />
119<br />
40<br />
122<br />
110<br />
31<br />
97<br />
30<br />
62<br />
59<br />
117<br />
95<br />
55
Session Marketing and Management<br />
Science III<br />
14:00-<br />
14:25<br />
14:25-<br />
14:50<br />
14:50-<br />
15:15<br />
15:15-<br />
15:40<br />
15:40-<br />
15:55<br />
15:55-<br />
16:35<br />
Lübke, Karsten;<br />
Papenhoff, Heike<br />
Wilczynski, Petra;<br />
Sarstedt, Marko<br />
(Chair: van den Poel) Room<br />
4<br />
Latent growth models for<br />
analyzing a multi partner reward<br />
program<br />
88<br />
Multi-Item Versus Single-Item<br />
Measures: A Review and Future<br />
Research Directions<br />
161<br />
Sommerfeld, Angela Trust as a Key Determinant of<br />
Loyalty and its Moderators<br />
Kneib, Thomas;<br />
Baumgartner,<br />
Bernhard; Steiner,<br />
Winfried J.<br />
Coffee<br />
Semi-plenary Lectures<br />
Celeux, Gilles Paul<br />
et al.<br />
Krolak-Schwerdt,<br />
Sabine<br />
Session Clustering and<br />
Classification III<br />
16:40- Müller-Funk, Ulrich;<br />
17:05 Dlugosz, Stephan<br />
17:05-<br />
17:30<br />
17:30-<br />
17:55<br />
Azam, Muhammad;<br />
Ostermann, Alexander;<br />
Pfeiffer, Karl-Peter<br />
Gantner, Zeno;<br />
Schmidt-Thieme, Lars<br />
Session Optimization in<br />
Statistics<br />
16:40-<br />
17:05<br />
17:05-<br />
17:30<br />
Time-Varying Parameters in Brand<br />
Choice Models<br />
Choosing the number of clusters<br />
in the latent class model (Chair:<br />
Bock)<br />
Strategies of model construction<br />
for the analysis of judgement data<br />
(Chair: Decker)<br />
139<br />
80<br />
15 Room<br />
5<br />
82 Room<br />
3<br />
(Chair: Geyer-Schulz) Room<br />
3<br />
Predictive classification trees 99<br />
Evaluation Criteria for the<br />
Construction of Binary<br />
Classification Trees with Two or<br />
More Classes<br />
Scalable and Incrementally<br />
Updated Hybrid Recommender<br />
Systems<br />
Hansohm, Jürgen Algorithms for Computing the<br />
Multivariate Isotonic Regression<br />
Schachtner, Reinhard;<br />
Pöppel, Gerhard; Lang,<br />
Elmar<br />
Nonnegative Matrix Factorization<br />
for Binary Data to Extract<br />
Elementary Failure Maps from<br />
Wafer Test Images<br />
− xxii −<br />
7<br />
47<br />
(Chair: Ritter) Room<br />
405/6<br />
60<br />
126
17:30-<br />
17:55<br />
Nalbantov, Georgi<br />
Ilkov; Groenen, Patrick<br />
J.F.; Bioch, Cor<br />
Session Computational<br />
Intelligence and<br />
Metaheuristics<br />
16:40-<br />
17:05<br />
17:05-<br />
17:30<br />
17:30-<br />
17:55<br />
Winkler, Stephan;<br />
Affenzeller, Michael;<br />
Wagner, Stefan;<br />
Kronberger, Gabriel<br />
Caserta, Marco;<br />
Lessmann, Stefan<br />
Support Vector Machines in the<br />
Dual using Majorization and<br />
Kernels<br />
On the Effects of Enhanced<br />
Selection Models on Quality and<br />
Comparability of Classifiers<br />
Produced by Genetic Programming<br />
A novel approach to construct<br />
discrete support vector machine<br />
classifiers<br />
Thorleuchter, Dirk Mining technologies in security<br />
and defense<br />
Session Miscellaneous Models<br />
(Archeology)<br />
16:40-<br />
17:05<br />
17:05-<br />
17:30<br />
Okada, Akinori;<br />
Sakaehara, Towao<br />
Gans, Ulrich-Walter;<br />
Lang, Matthias<br />
Session Education and<br />
Psychology<br />
16:40-<br />
17:05<br />
17:05-<br />
17:30<br />
17:30-<br />
17:55<br />
Fuchs, Sebastian;<br />
Sarstedt, Marko<br />
Strobl, Carolin; Leisch,<br />
Friedrich<br />
101<br />
(Chair: Fink) Room<br />
101/3<br />
164<br />
25<br />
148<br />
(Chair: Posluschny) Room<br />
2<br />
Analysis of Borrowing and<br />
Guaranteeing Relationhships<br />
among Government Officials at<br />
the Eighth Century in the Old<br />
Capital of Japan<br />
106<br />
ArcheoInf - Leistungszentrum für<br />
die digitale Unterstützung<br />
feldarchäologischer Projekte<br />
46<br />
(Chair: Krolak-Schwerdt) Room<br />
6<br />
On the Use of Student Samples in<br />
Major Marketing Research<br />
Journals. A Meta-Study<br />
44<br />
Who's Afraid of Statistics? -<br />
Measurement and Predictors of<br />
Statistics Anxiety in German<br />
University Students<br />
142<br />
Ünlü, Ali Mosaic Plots and Knowledge<br />
Structures<br />
Session Marketing and Management<br />
Science IV<br />
17:05-<br />
17:30<br />
17:30-<br />
17:55<br />
Lam, Kar Yin; Koning,<br />
Alex J.; Franses, Philip<br />
Hans<br />
Wagner, Ralf; Klaus,<br />
Martin<br />
155<br />
(Chair: Decker) Room<br />
4<br />
Testing preference rankings 84<br />
Exploring the Interaction<br />
Structure of Weblogs<br />
− xxiii −<br />
91
18:00-<br />
19:00<br />
General Assembly of<br />
<strong>GfKl</strong><br />
20:00 Conference Dinner (Handwerkskammer, Holsten-<br />
wall 12, bus transfer 19:30)<br />
09:00-<br />
09:40<br />
12:00-<br />
12:20<br />
Friday July 18, <strong>2008</strong><br />
Semi-plenary Lectures<br />
Palumbo, Francesco Clustering and Dimensionality<br />
Reduction to Discover Interesting<br />
Patterns in Binary Data (Chair:<br />
Ultsch)<br />
Wildner, Raimund Management and Methods: How<br />
to do Market Segmentation<br />
Projects (Chair: Fantapié<br />
Altobelli)<br />
Coffee<br />
Session Clustering and<br />
Classification IV<br />
10:00-<br />
10:25<br />
10:25-<br />
10:50<br />
10:50-<br />
11:15<br />
Buza, Krisztian Antal;<br />
Schmidt-Thieme, Lars<br />
Room<br />
5<br />
109 Room<br />
5<br />
162 Room<br />
3<br />
(Chair: Groenen) Room<br />
3<br />
Motif-based Classification of Time<br />
Series with Bayesian Networks<br />
and SVMs<br />
23<br />
Tomas, Amber Issues related to the<br />
implementation of a dynamic<br />
logistic model for classifier<br />
combination<br />
Oosthuizen, Surette;<br />
Steel, Sarel J.<br />
Session Visualization and<br />
Scaling Methods I<br />
10:00-<br />
10:25<br />
10:25-<br />
10:50<br />
10:50-<br />
11:15<br />
Variable selection for kernel<br />
classifiers: a feature-to-input<br />
space approach<br />
Mucha, Hans-Joachim Clustering a Contingency Table<br />
Accompanied by Visualization<br />
Bocci, Laura; Vichi,<br />
Maurizio<br />
The K-INDSCAL Model for<br />
Heterogeneous Three-way<br />
Dissimilarity Data<br />
150<br />
107<br />
(Chair: Hennig) Room<br />
405/6<br />
Cortina-Borja, Mario Extending Multivariate Planing 29<br />
− xxiv −<br />
98<br />
17
Session Exploratory Data<br />
Analysis I<br />
10:00-<br />
10:25<br />
10:25-<br />
10:50<br />
10:50-<br />
11:15<br />
Chiou, Hua-Kai; Yuan,<br />
Benjamin J.C.; Wang,<br />
Yen-Wen<br />
Cernian, Alexandra;<br />
Carstoiu, Dorin;<br />
Ionescu, Tudor<br />
Einbeck, Jochen;<br />
Evers, Ludger<br />
(Chair: Wehrens) Room<br />
101/3<br />
Correspondence Analysis for<br />
Exploring the Implementation of<br />
One Village One Product Programs<br />
in Taiwan<br />
28<br />
Modeling the Classification of<br />
Heterogeneous Data<br />
26<br />
Data compression and regression<br />
based on local principal curves<br />
Session Spatial Planning I (Chair: Behnisch) Room<br />
2<br />
10:00- Behnisch, Martin; Estimating the number of<br />
11<br />
10:25 Ultsch, Alfred<br />
buildings in Germany<br />
10:25-<br />
10:50<br />
10:50-<br />
11:15<br />
Thiel, Klaus Optimal VDSL Expansion taking<br />
into Consideration of<br />
Infrastructure Restrictions and<br />
Marketing Requirements<br />
Aden, Christian;<br />
Mucha, Hans-Joachim;<br />
Schmidt, Gunther;<br />
Schröder, Winfried<br />
Session Medical and Health<br />
Sciences I<br />
10:00-<br />
10:25<br />
10:25-<br />
10:50<br />
10:50-<br />
11:15<br />
Augustin, Thomas;<br />
Wallner, Matthias<br />
WaldIS - a web based reference<br />
system for the forest monitoring<br />
in North Rhine-Westphalia<br />
36<br />
145<br />
(Chair: Lausen) Room<br />
6<br />
On the power of corrected score<br />
functions to adjust for<br />
measurement error<br />
6<br />
Sieben, Wiebke Time Related Features for Alarm<br />
Classification in Intensive Care<br />
Monitoring<br />
Ostermann, Thomas;<br />
Schuster, Reinhard;<br />
Erben, Christoph<br />
Session Market Research,<br />
Controlling, OR I<br />
10:00-<br />
10:25<br />
10:25-<br />
10:50<br />
Brusch, Michael; Baier,<br />
Daniel<br />
Classifying hospitals with respect<br />
to their diagnostic diversity using<br />
Shannon's entropy<br />
3<br />
135<br />
108<br />
(Chair: Baier) Room<br />
4<br />
Analyzing the Stability of Price<br />
Response Functions - Measuring<br />
the Influence of Different<br />
Parameters in a Monte Carlo<br />
Comparison<br />
22<br />
Tarka, Piotr Conjoint Analysis within the field<br />
of customer satisfaction problems<br />
– a model of composite<br />
product/service<br />
− xxv −<br />
144
10:50-<br />
11:15<br />
11:20-<br />
12:00<br />
12:00-<br />
12:20<br />
Punzo, Antonio Considerations on the impact of<br />
JML-ill-conditioned configurations<br />
in the CML approach<br />
Semi-plenary Lectures<br />
Ben-Israel, Adi Probabilistic Distance Clustering<br />
(Chair: Vichi)<br />
Imaizumi, Tadashi Dimensionality Reduction of<br />
Similarity Matrix (Chair: Gaul)<br />
Coffee<br />
Session Clustering and<br />
Classification V<br />
12:20-<br />
12:45<br />
12:45-<br />
13:30<br />
Kludas, Jana; Bruno,<br />
Eric; Marchand-Maillet,<br />
Stepahne<br />
Schiffner, Julia;<br />
Szepannek, Gero;<br />
Monthé, Thierry;<br />
Weihs, Claus<br />
Session Visualization and<br />
Scaling Methods II<br />
12:20-<br />
12:45<br />
12:45-<br />
13:30<br />
116<br />
12 Room<br />
5<br />
73 Room<br />
3<br />
(Chair: Godehart) Room<br />
3<br />
Exploiting synergetic and<br />
redundant features for multimedia<br />
document classification<br />
79<br />
Localized Logistic Regression for<br />
Discrete Influential Factors<br />
129<br />
(Chair: van de Felden) Room<br />
405/6<br />
Adachi, Kohei Joint Procrustes Analysis with<br />
Constrained Simplimax Rotation:<br />
Nonsingular Transformation of<br />
Component Score and Loading<br />
Matrices Toward Simple Structure<br />
Fernández-Aguirre,<br />
Karmele; Garín-Martín,<br />
María Araceli<br />
Session Exploratory Data<br />
Analysis II<br />
12:20- Zarraga, Amaya;<br />
12:45 Goitisolo, Beatriz<br />
12:45-<br />
13:30<br />
Nusser, Sebastian;<br />
Otte, Clemens;<br />
Hauptmann, Werner<br />
Validity of images from binary<br />
coding tables. Student motivation<br />
surveys: some evidence<br />
(Chair: Wehrens) Room<br />
101/3<br />
Factor Analysis of Incomplete<br />
Disjunctive Tables<br />
167<br />
Multi-Class Extension of Verifiable<br />
Ensemble Models for Safety-<br />
Related Applications<br />
105<br />
− xxvi −<br />
2<br />
41
Session Spatial Planning II (Chair: Behnisch) Room<br />
2<br />
12:20-<br />
12:45<br />
12:45-<br />
13:30<br />
Witek, Ewa Analysis of massive emigration<br />
from Poland - the model-based<br />
clustering approach<br />
Thinh, Nguyen Xuan;<br />
Küttner, Leander;<br />
Meinel, Gotthard<br />
Session Medical and Health<br />
Sciences II<br />
12:20-<br />
12:45<br />
12:45-<br />
13:30<br />
Henker, Uwe; Ultsch,<br />
Alfred; Petersohn, Uwe<br />
Schuster, Reinhard;<br />
von Arnstedt, Eva<br />
Session Market Research,<br />
Controlling, OR II<br />
12:20-<br />
12:45<br />
12:45-<br />
13:30<br />
13:15-<br />
14:00<br />
14:00-<br />
15:00<br />
Esber, Said; Baier,<br />
Daniel<br />
Abu Assab, Samah;<br />
Baier, Daniel<br />
Evaluate the data structure and<br />
identify homogenous spatial units<br />
in the data base "Sustainability<br />
issues in sensitive areas" of the<br />
EU-FP6 Integrated Project<br />
SENSOR<br />
165<br />
146<br />
(Chair: Lausen) Room<br />
6<br />
Die präzise und effizienzte<br />
Erkennung von medizinischen<br />
Anforderungsformularen<br />
61<br />
Age Distributions for costs in drug<br />
prescription by practitioners and<br />
for DRG-based hospital treatment<br />
133<br />
(Chair: Baier) Room<br />
4<br />
Realoptionen bei der Bewertung<br />
von neuen Produkten<br />
38<br />
Designing Products Using Quality<br />
Function Deployment and Conjoint<br />
Analysis: A Comparison in a<br />
Market for Elderly People<br />
Plenary Lecture (Chair: Weihs) Room<br />
5<br />
McMorris, Fred R. Majority-rule consensus: from<br />
preferences (social choice) to<br />
trees (biology and classification<br />
theory)<br />
94<br />
Informal Farewell (Conference site)<br />
− xxvii −<br />
1
List of Contributions<br />
Authors Title Page<br />
Abu Assab, Samah; Baier, Designing Products Using Quality Function 1<br />
Daniel<br />
Deployment and Conjoint Analysis: A<br />
Comparison in a Market for Elderly People<br />
Adachi, Kohei Joint Procrustes Analysis with Constrained<br />
Simplimax Rotation: Nonsingular<br />
Transformation of Component Score and<br />
Loading Matrices Toward Simple Structure<br />
2<br />
Aden, Christian; Mucha, Hans- WaldIS - a web based reference system 3<br />
Joachim; Schmidt, Gunther; for the forest monitoring in North Rhine-<br />
Schröder, Winfried<br />
Westphalia<br />
Adler, Werner; Brenning, Classification of Paired Data Using<br />
4<br />
Alexander; Lausen, Berthold Ensemble Methods<br />
Andres, Bjoern; Koethe, Ullrich;<br />
Helmstaedter, Moritz; Denk,<br />
Winfried; Hamprecht, Fred<br />
Segmentation of Neural Tissue 5<br />
Augustin, Thomas; Wallner, On the power of corrected score functions 6<br />
Matthias<br />
to adjust for measurement error<br />
Azam, Muhammad; Ostermann, Evaluation Criteria for the Construction of 7<br />
Alexander; Pfeiffer, Karl-Peter Binary Classification Trees with Two or<br />
More Classes<br />
Bade, Korinna; Benz, Dominik Evaluation Strategies for Learning<br />
Algorithms of Hierarchical Structures<br />
8<br />
Barbosa, Rui Pedro; Belo,<br />
Orlando<br />
Autonomous Forex Trading Agents 9<br />
Becker, Niels; Werners, Brigitte Improving Product Line Design with<br />
Bundling<br />
10<br />
Behnisch, Martin; Ultsch, Alfred Estimating the number of buildings in<br />
Germany<br />
11<br />
Ben-Israel, Adi Probabilistic Distance Clustering 12<br />
Bessler, Wolfgang; Holler, Hedge Funds in a Bayesian Asset<br />
13<br />
Julian<br />
Allocation Framework: Incorporating<br />
Information on market states and<br />
manager's ability<br />
Betzin, Jörg Categorical Data in PLS Path modeling 14<br />
Biernacki, Christophe; Celeux, Choosing the number of clusters in the 15<br />
Gilles Paul; Govaert, Gérard latent class model<br />
Bisson, Gilles Clustering of molecules and structured<br />
data<br />
16<br />
Bocci, Laura; Vichi, Maurizio The K-INDSCAL Model for Heterogeneous<br />
Three-way Dissimilarity Data<br />
17<br />
Borgelt, Christian Weighting and Selecting Features in Fuzzy<br />
Clustering<br />
18<br />
Boulesteix, Anne-Laure; On optimistic bias in reporting microarray- 20<br />
Slawski, Martin<br />
based classification accuracy<br />
− xxviii −
Bravo, Cristian; Maldonado, Practical experiences from Credit Scoring 21<br />
Sebastian; Weber, Richard projects for Chilean financial organizations<br />
Brusch, Michael; Baier, Daniel Analyzing the Stability of Price Response<br />
Functions - Measuring the Influence of<br />
Different Parameters in a Monte Carlo<br />
Comparison<br />
22<br />
Buza, Krisztian Antal; Schmidt- Motif-based Classification of Time Series 23<br />
Thieme, Lars<br />
with Bayesian Networks and SVMs<br />
Calò, Daniela G.; Viroli, Cinzia Visualizing data in Gaussian mixture<br />
model classification<br />
24<br />
Caserta, Marco; Lessmann, A novel approach to construct discrete 25<br />
Stefan<br />
support vector machine classifiers<br />
Cernian, Alexandra; Carstoiu, Modeling the Classification of<br />
26<br />
Dorin; Ionescu, Tudor<br />
Heterogeneous Data<br />
Chiou, Hua-Kai; Huang, Yong- Applying Rough Set Theory to<br />
27<br />
Ting; Liu, Gia-Shie<br />
Constructing Knowledge Base for Critical<br />
Military Commodity Management<br />
Chiou, Hua-Kai; Yuan,<br />
Correspondence Analysis for Exploring the 28<br />
Benjamin J.C.; Wang, Yen-Wen Implementation of One Village One<br />
Product Programs in Taiwan<br />
Cortina-Borja, Mario Extending Multivariate Planing 29<br />
Critchley, Frank; Pires, Ana; Principal Axis Analysis – with HDLSS 30<br />
Amado, Conceicao<br />
bonuses!<br />
Dean, Nema; Nugent, Rebecca Augmenting Model-Based Clustering with<br />
Generalized Linkage methods<br />
31<br />
Desmet, Frank Michel; Leman, Statistical analysis of human body<br />
32<br />
Marc; Lesaffre, Micheline movement and group interactions in<br />
response to music<br />
Dias, José G.; Vermunt, Jeroen Mixture Hidden Markov Models in Finance 33<br />
K.; Ramos, Sofia<br />
Research<br />
Dolata, Jens; Mucha, Hans- Mapping Findspots of Roman Military 34<br />
Joachim; Bartel, Hans-Georg Brickstamps in Mogontiacum (Mainz) and<br />
Archaeometrical Analysis<br />
Eigenfeldt, Arne; Kapur, Ajay Multimodal Performance Analysis of<br />
Electronic Sitar<br />
35<br />
Einbeck, Jochen; Evers, Ludger Data compression and regression based<br />
on local principal curves<br />
36<br />
Enyukov, Igor Regression-autoregression based<br />
clustering<br />
37<br />
Esber, Said; Baier, Daniel Realoptionen bei der Bewertung von<br />
neuen Produkten<br />
38<br />
Fenk-Oczlon, Gertraud; Fenk, Cross-linguistic regularities in the<br />
40<br />
August<br />
monosyllabic system<br />
Fernández-Aguirre, Karmele; Validity of images from binary coding 41<br />
Garín-Martín, María Araceli tables. Student motivation surveys: some<br />
evidence<br />
Fricke, Jobst P. A statistical theory of musical consonance<br />
proved in praxis<br />
42<br />
− xxix −
Fritsch, Arno; Ickstadt, Katja An Improved Criterion for Clustering<br />
Based on the Posterior Similarity Matrix<br />
Fuchs, Sebastian; Sarstedt, On the Use of Student Samples in Major<br />
Marko<br />
Marketing Research Journals. A Meta-<br />
Study<br />
Gabriel, Thomas R.; Thiel, Multi-Dimensional Scaling applied to<br />
Kilian; Berthold, Michael R. Hierarchical Fuzzy Rule Systems<br />
Gans, Ulrich-Walter; Lang, ArcheoInf - Leistungszentrum für die<br />
Matthias<br />
digitale Unterstützung feldarchäologischer<br />
Projekte<br />
Gantner, Zeno; Schmidt- Scalable and Incrementally Updated<br />
Thieme, Lars<br />
Hybrid Recommender Systems<br />
Garel, Bernard; Boucharel, Non-Gaussian nature of ENSO signals and<br />
Julien; Dewitte, Boris; du climate shifts: implications for regional<br />
Penhoat, Yves<br />
studies off the western coast of South<br />
America<br />
Gassiat, Elisabeth Likelihood ratio test for general mixture<br />
models<br />
Gazda, Vladimir On a Location of the Retail Units and<br />
Equilibrium Price Determination<br />
Geyer-Schulz, Andreas; Hoser, The Potential of Social Intelligence for<br />
Bettina<br />
Collective Intelligence<br />
Godehardt, Erhard; Jaworski, Isolated vertices in random intersection<br />
Jerzy; Rybarczyk, Katarzyna graphs<br />
Greselin, Francesca; Ingrassia, A note on constrained EM algorithms for<br />
Salvatore<br />
mixtures of elliptical distributions<br />
Groenen, Patrick J.F.;<br />
Support Vector Machines in the Primal<br />
Nalbantov, Georgi; Bioch, Cor using Majorization and Kernels<br />
Große, Lars; Joos, Franz Usage of Artifical Neural Networks for<br />
Data Handling<br />
Grün, Bettina; Leisch, Friedrich Model diagnostics of finite mixtures using<br />
bootstrapping<br />
Haasdonk, Bernard; Pekalska, Classification with Regularized Kernel<br />
Elzbieta<br />
Mahalanobis-Distances<br />
Häberle, Lothar On classification of species of<br />
representation rings<br />
Hahlweg, Cornelius; Rothe, Auswertung hochaufgelöster<br />
Hendrik<br />
Streulichtdaten mit Methoden der<br />
multivariaten Statistik<br />
Hansohm, Jürgen Algorithms for Computing the Multivariate<br />
Isotonic Regression<br />
Henker, Uwe; Ultsch, Alfred; Die präzise und effizienzte Erkennung von<br />
Petersohn, Uwe<br />
medizinischen Anforderungsformularen<br />
Hennig, Christian; Hausdorf, Using cluster analysis for species<br />
Bernhard<br />
delimitation<br />
Henseler, Jörg Nonlinear Effects in PLS Path Models: A<br />
Comparison of Available Approaches<br />
Hermes, Jürgen; Schwiebert, Classification of text processing<br />
Stephan<br />
components: The Tesla Role System<br />
− xxx −<br />
43<br />
44<br />
45<br />
46<br />
47<br />
48<br />
49<br />
50<br />
51<br />
52<br />
53<br />
54<br />
55<br />
56<br />
57<br />
58<br />
59<br />
60<br />
61<br />
62<br />
63<br />
64
Herrmann, Lutz; Ultsch, Alfred Strengths and Weaknesses of Ant Colony<br />
Clustering<br />
65<br />
Herzog, Irmela Reconstructing Central Places and<br />
Settlements Groups<br />
66<br />
Hielscher, Thomas; Zucknick, On the prognostic value of gene<br />
67<br />
Manuela; Werft, Wiebke;<br />
Benner, Axel<br />
expression signatures for censored data<br />
Holzmann, Hajo; Dannemann, Likelihood ratio testing for hidden Markov 68<br />
Jörn<br />
models<br />
Hühn, Jens; Hüllermeier, Eyke Rule-Based Learning of Reliable Classifiers 69<br />
Huellermeier, Eyke;<br />
Combining Predictions in Pairwise<br />
70<br />
Vanderlooy, Stijn<br />
Classification: An Adaptive Voting<br />
Strategy and Its Relation to Weighted<br />
Voting<br />
Huson, Daniel H.; Rupp, Regula Using Cluster Networks to Represent Non-<br />
Compatible Sets of Clusters<br />
71<br />
Hütt, Marc-Thorsten Genome phylogeny based on short-range<br />
correlations in DNA sequences<br />
72<br />
Imaizumi, Tadashi Dimensionality Reduction of Similarity<br />
Matrix<br />
73<br />
Kaiser, Sebastian; Leisch,<br />
Friedrich<br />
Benchmarking Bicluster Algorithms 75<br />
Karatzoglou, Alexandros; Nonparametric distribution analysis for 76<br />
Feinerer, Ingo; Hornik, Kurt text mining<br />
Klein, Christian; Kundisch, Index-Based Investment Vehicles - A 77<br />
Dennis<br />
Comparative Study for the German DAX<br />
Klenk, Hans-Peter Polyphasic genomic approach for the<br />
taxonomy of archaea and bacteria<br />
78<br />
Kludas, Jana; Bruno, Eric; Exploiting synergetic and redundant 79<br />
Marchand-Maillet, Stepahne features for multimedia document<br />
classification<br />
Kneib, Thomas; Baumgartner, Time-Varying Parameters in Brand Choice 80<br />
Bernhard; Steiner, Winfried J. Models<br />
Koralun-Bereznicka, Julia Multivariate comparative analysis of stock<br />
exchanges - the European perspective<br />
81<br />
Krolak-Schwerdt, Sabine Strategies of model construction for the<br />
analysis of judgement data<br />
82<br />
Kuziak, Katarzyna An application of copula functions to<br />
market risk management<br />
83<br />
Lam, Kar Yin; Koning, Alex J.;<br />
Franses, Philip Hans<br />
Testing preference rankings 84<br />
Latouche, Pierre J.; Ambroise,<br />
Christophe; Birmelé, Etienne<br />
Bayesian Methods for Graph Clustering 85<br />
Locarek-Junge, Hermann; Fundamental Indexation - testing the 86<br />
Mihm, Max<br />
concept in the German stock market<br />
Louw, Nelmarie; Lamont, Identifying Atypical Cases in Kernel Fisher 87<br />
Morne; Steel, Sarel<br />
Discriminant Analysis by using the<br />
Smallest Enclosing Hypersphere<br />
− xxxi −
Lübke, Karsten; Papenhoff, Latent growth models for analyzing a 88<br />
Heike<br />
multi partner reward program<br />
Lukashevich, Hanna; Dittmar, Applying Statistical Models and Parametric 89<br />
Christian; Bastuck, Christoph Distance Measures for Music Similarity<br />
Search<br />
Lukociene, Olga; Vermunt, Determining the number of components in 90<br />
Jeroen K.<br />
mixture models for hierarchical data<br />
Klaus, Martin; Wagner, Ralf Exploring the Interaction Structure of<br />
Weblogs<br />
91<br />
Martin-Magniette, Marie-Laure; ChIPmix : Mixture model of regressions 92<br />
Mary-Huard, Tristan; Bérard,<br />
Caroline; Robin, Stéphane<br />
for ChIP-chip experiment analysis<br />
McLachlan, Geoffrey John Clustering of High-Dimensional Data Via<br />
Finite Mixture Models<br />
93<br />
McMorris, F. R. Majority-rule consensus: from preferences<br />
(social choice) to trees (biology and<br />
classification theory)<br />
94<br />
Meier, René; Joos, Franz Optimization Methods with Evolutionary<br />
Algorithms and Artificial Neural Networks<br />
95<br />
Meyer, Florian; Ultsch, Alfred Finding Music Fads by clustering Online<br />
Radio Data with Emergent Self-Organizing<br />
Maps<br />
96<br />
Mirkin, Boris Deviant box and dual clusters for the<br />
analysis of conceptual contexts<br />
97<br />
Mucha, Hans-Joachim Clustering a Contingency Table<br />
Accompanied by Visualization<br />
98<br />
Müller-Funk, Ulrich; Dlugosz,<br />
Stephan<br />
Predictive classification trees 99<br />
Mylonas, Phivos; Solachidis, Efficient Media Exploitation towards 100<br />
Vassilios; Geyer-Schulz,<br />
Andreas; Hoser, Bettina;<br />
Chapman, Sam; Ciravegna,<br />
Fabio; Staab, Stefen; Smrz,<br />
Pavel; Kompatsiaris, Yiannis;<br />
Avrithis, Yannis<br />
Collective Intelligence<br />
Nalbantov, Georgi Ilkov; Support Vector Machines in the Dual using 101<br />
Groenen, Patrick J.F.; Bioch,<br />
Cor<br />
Majorization and Kernels<br />
Neumann, Anneke; Ambrosi, Approach for Dynamic Problems in<br />
102<br />
Klaus; Hahne, Felix<br />
Clustering<br />
Neykov, Neyko; Filzmoser, Robust fitting of mixtures: The approach 103<br />
Peter; Neytchev, Plamen based on the Trimmed Likelihood<br />
Estimator<br />
Nugent, Rebecca; Stuetzle, Cluster Tree Estimation using a<br />
104<br />
Werner<br />
Generalized Single Linkage Method<br />
Nusser, Sebastian; Otte, Multi-Class Extension of Verifiable Ensem- 105<br />
Clemens; Hauptmann, Werner ble Models for Safety-Related Applications<br />
Okada, Akinori; Sakaehara, Analysis of Borrowing and Guaranteeing 106<br />
Towao<br />
Relationhships among Government<br />
Officials at the Eighth Century in the Old<br />
− xxxii −
Capital of Japan<br />
Oosthuizen, Surette; Steel, Variable selection for kernel classifiers: a 107<br />
Sarel J.<br />
feature-to-input space approach<br />
Ostermann, Thomas; Schuster, Classifying hospitals with respect to their 108<br />
Reinhard; Erben, Christoph diagnostic diversity using Shannon's<br />
entropy<br />
Palumbo, Francesco Clustering and Dimensionality Reduction<br />
to Discover Interesting Patterns in Binary<br />
Data<br />
109<br />
Petersen, Wiebke Lineare Kodierung multipler<br />
Vererbungshierarchien: Wiederbelebung<br />
einer antiken Klassifikationsmethode<br />
110<br />
Petersen, Wiebke; Heinrich, Begriffsanalytischer Ansatz zur<br />
111<br />
Petja<br />
qualitativen Zitationsanalyse<br />
Piontek, Krzysztof The Analysis of the power for some<br />
chosen VaR backtesting procedures -<br />
simulation approach<br />
112<br />
Pommeret, Denys Testing distribution in errors in variables<br />
models<br />
113<br />
Pons, Odile Classification with an increasing number of<br />
components<br />
114<br />
Potapov, Sergej; Lausen,<br />
Berthold<br />
Bagging with different split criteria 115<br />
Punzo, Antonio Considerations on the impact of JML-illconditioned<br />
configurations in the CML<br />
approach<br />
116<br />
Raabe, Nils; Enk, Dirk; Weihs, Dynamic disturbances in BTA deephole 117<br />
Claus; Biermann, Dirk<br />
drilling - Identification of spiralling as a<br />
regenerative effect<br />
Radermacher, Walter Statistical processes under change -<br />
Enhancing data quality with pretests<br />
118<br />
Rapp, Reinhard; Zock, Michael Automatic Dictionary Expansion Using<br />
Non-parallel Corpora<br />
119<br />
Ringle, Christian M. FIMIX-PLS Segmentation of Data for Path<br />
Models with Multiple Endogenous LVs<br />
120<br />
Rokita, Pawel; Piontek,<br />
Extreme unconditional dependence vs. 121<br />
Krzysztof<br />
multivariate GARCH effect in the analysis<br />
of dependence between high losses on<br />
Polish and German stock indexes<br />
Rolshoven, Jürgen Grundzüge einer generativen<br />
Korpuslinguistik<br />
122<br />
Rozmus, Dorota Cluster ensemble based on co-occurrence<br />
data<br />
123<br />
Sagan, Adam; Kowalska- Dyadic Interactions in Service Encounter - 124<br />
Musial, Magdalena<br />
Bayesian SEM Approach<br />
− xxxiii −
Sardet, Laure; Patilea, Valentin Beta-kernel density estimation using<br />
mixture-based transformations: an<br />
application to claims distribution<br />
Schachtner, Reinhard; Pöppel,<br />
Gerhard; Lang, Elmar<br />
Scharl, Theresa; Leisch,<br />
Friedrich<br />
Nonnegative Matrix Factorization for<br />
Binary Data to Extract Elementary Failure<br />
Maps from Wafer Test Images<br />
Quality-Based Clustering of Functional<br />
Data: Applications to Time Course<br />
Microarray Data<br />
Multilingual knowledge based concept<br />
recognition in textual data<br />
Localized Logistic Regression for Discrete<br />
Influential Factors<br />
Schierle, Martin; Trabold,<br />
Daniel<br />
128<br />
Schiffner, Julia; Szepannek,<br />
Gero; Monthé, Thierry; Weihs,<br />
Claus<br />
129<br />
Schiffner, Julia; Weihs, Claus Localized Classification Using Mixture<br />
Models<br />
130<br />
Schlattmann, Peter Comparison of four estimators of the<br />
heterogeneity variance for meta-analysis<br />
131<br />
Schölkopf, Bernhard Machine Learning applications of positive<br />
definite kernels<br />
132<br />
Schuster, Reinhard; von Age Distributions for costs in drug<br />
133<br />
Arnstedt, Eva<br />
prescription by practitioners and for DRGbased<br />
hospital treatment<br />
Schyle, Daniel The Late Neolithic flint axe production on<br />
the Lousberg (Aachen, Germany) – An<br />
extrapolation of supply and demand and<br />
population density<br />
134<br />
Sieben, Wiebke Time Related Features for Alarm<br />
Classification in Intensive Care Monitoring<br />
135<br />
Slawski, Martin; Boulesteix, 'CMA' - Steps in developing a<br />
136<br />
Anne-Laure; Daumer, Martin comprehensive R-toolbox for classification<br />
with microarray data and other highdimensional<br />
problems<br />
Solachidis, Vassilios; Mylonas,<br />
Phivos; Geyer-Schulz, Andreas;<br />
Hoser, Bettina; Chapman, Sam;<br />
Ciravegna, Fabio; Staab,<br />
Stefen; Contopoulos, Costis;<br />
Gkika, Ioanna; Smrz, Pavel;<br />
Kompatsiaris, Yiannis; Avrithis,<br />
Yannis<br />
Generating Collective Intelligence 137<br />
Sommer, Katrin; Weihs, Claus Analysis of polyphonic musical time series 138<br />
Sommerfeld, Angela Trust as a Key Determinant of Loyalty and<br />
its Moderators<br />
139<br />
Stecking, Ralf; Schebesch, Generating Fictitious Training Data for 140<br />
Klaus B.<br />
Credit Client Classification<br />
Steinbrecher, Matthias; Kruse, Clustering Association Rules with Fuzzy 141<br />
Rudolf<br />
Concepts<br />
− xxxiv −<br />
125<br />
126<br />
127
Strobl, Carolin; Leisch,<br />
Friedrich<br />
Who's Afraid of Statistics? - Measurement<br />
and Predictors of Statistics Anxiety in<br />
German University Students<br />
Strobl, Carolin; Zeileis, Achim A New, Conditional Variable Importance<br />
Measure for Random Forests<br />
143<br />
Tarka, Piotr Conjoint Analysis within the field of<br />
customer satisfaction problems – a model<br />
of composite product/service<br />
144<br />
Thiel, Klaus Optimal VDSL Expansion taking into<br />
Consideration of Infrastructure<br />
Restrictions and Marketing Requirements<br />
145<br />
Thinh, Nguyen Xuan; Küttner, Evaluate the data structure and identify 146<br />
Leander; Meinel, Gotthard homogenous spatial units in the data base<br />
"Sustainability issues in sensitive areas" of<br />
the EU-FP6 Integrated Project SENSOR<br />
Thorleuchter, Dirk Mining ideas from textual information 147<br />
Thorleuchter, Dirk Mining technologies in security and<br />
defense<br />
148<br />
Timmerman, Marieke E.; Multilevel Simultaneous Component 149<br />
Lichtwarck-Aschoff, Anna; Analysis for Studying Inter-individual and<br />
Ceulemans, Eva<br />
Intra-individual Variabilities<br />
Tomas, Amber Issues related to the implementation of a<br />
dynamic logistic model for classifier<br />
combination<br />
150<br />
Trinchera, Laura; Esposito A Comprehensive Partial Least Squares 151<br />
Vinzi, Vincenzo<br />
Approach to Component-Based Structural<br />
Equation Modeling<br />
Trzesiok, Michal Relevant Importance of Predictor Variables<br />
in Support Vector Machines Models<br />
152<br />
Ultsch, Alfred Comparison of Algorithms to find<br />
differentially expressed Genes in<br />
Microarray Data<br />
153<br />
Ultsch, Alfred Is log ratio a good value for measuring<br />
return in stock investments?<br />
154<br />
Ünlü, Ali Mosaic Plots and Knowledge Structures 155<br />
van de Velden, Michel; de Visualizing preferences using minimum 156<br />
Beuckelaer, Alain; Groenen,<br />
Patrick; Busing, Frank<br />
variance nonmetric unfolding<br />
van der Ark, Andries L.; Straat, Selection of items for tests and<br />
157<br />
J. Hendrik<br />
questionnaires using Mokken scale<br />
analysis<br />
van der Heijden, Peter G.M. Estimating the prevalence of rule<br />
transgression<br />
158<br />
Wagner, Ralf; Sauerwald, Erik Clustering Consumers with Respect to<br />
Their Marketing Reactance Behavior<br />
159<br />
Wehrens, Ron Supervised Self-Organising Maps and<br />
More<br />
160<br />
Wilczynski, Petra; Sarstedt, Multi-Item Versus Single-Item Measures: 161<br />
Marko<br />
A Review and Future Research Directions<br />
− xxxv −<br />
142
Wildner, Raimund Management and methods: How to do<br />
market segmentation projects<br />
162<br />
Winkler, Roland; Rehm, Frank;<br />
Kruse, Rudolf<br />
Clustering with Repulsive Prototypes 163<br />
Winkler, Stephan; Affenzeller, On the Effects of Enhanced Selection 164<br />
Michael; Wagner, Stefan; Models on Quality and Comparability of<br />
Kronberger, Gabriel<br />
Classifiers Produced by Genetic<br />
Programming<br />
Witek, Ewa Analysis of massive emigration from<br />
Poland - the model-based clustering<br />
approach<br />
165<br />
Worm, Katja; Meffert, Beate Image Based Mail Piece Identification<br />
using Unsupervised Learning<br />
166<br />
Zarraga, Amaya; Goitisolo, Factor Analysis of Incomplete Disjunctive 167<br />
Beatriz<br />
Tables<br />
Zeileis, Achim; Kleiber,<br />
Recursive Partitioning of Economic<br />
168<br />
Christian<br />
Regressions: Trees of Costly Journals and<br />
Beautiful Professors<br />
− xxxvi −
Author Index<br />
Abu Assab, Samah 1<br />
Adachi, Kohei 2<br />
Aden, Christian 3<br />
Adler, Werner 4<br />
Affenzeller, Michael 164<br />
Amado, Conceicao 30<br />
Ambroise, Christophe 85<br />
Ambrosi, Klaus 102<br />
Andres, Bjoern 5<br />
Augustin, Thomas 6<br />
Avrithis, Yannis 100, 137<br />
Azam, Muhammad 7<br />
Bade, Korinna 8<br />
Baier, Daniel 1, 22, 38<br />
Barbosa, Rui Pedro 9<br />
Bartel, Hans-Georg 34<br />
Bastuck, Christoph 89<br />
Baumgartner, Bernhard 80<br />
Becker, Niels 10<br />
Behnisch, Martin 11<br />
Belo, Orlando 9<br />
Ben-Israel, Adi 12<br />
Benner, Axel 67<br />
Benz, Dominik 8<br />
Bérard, Caroline 92<br />
Berthold, Michael R. 45<br />
Bessler, Wolfgang 13<br />
Betzin, Jörg 14<br />
Biermann, Dirk 117<br />
Biernacki, Christophe 15<br />
Bioch, Cor 54, 101<br />
Birmelé, Etienne 85<br />
Bisson, Gilles 16<br />
Bocci, Laura 17<br />
Borgelt, Christian 18<br />
Boucharel, Julien 48<br />
Boulesteix, Anne-Laure 20, 136<br />
− xxxvii −<br />
Bravo, Cristian 21<br />
Brenning, Alexander 4<br />
Bruno, Eric 79<br />
Brusch, Michael 22<br />
Busing, Frank 156<br />
Buza, Krisztian Antal 23<br />
Calò, Daniela G. 24<br />
Carstoiu, Dorin 26<br />
Caserta, Marco 25<br />
Celeux, Gilles Paul 15<br />
Cernian, Alexandra 26<br />
Ceulemans, Eva 149<br />
Chiou, Hua-Kai 27, 28<br />
Ciravegna, Fabio 100, 137<br />
Contopoulos, Costis 137<br />
Cortina-Borja, Mario 29<br />
Critchley, Frank 30<br />
Dannemann, Jörn 68<br />
Daumer, Martin 136<br />
de Beuckelaer, Alain 156<br />
Dean, Nema 31<br />
Denk, Winfried 5<br />
Desmet, Frank Michel 32<br />
Dewitte, Boris 48<br />
Dias, José G. 33<br />
Dittmar, Christian 89<br />
Dlugosz, Stephan 99<br />
Dolata, Jens 34<br />
du Penhoat, Yves 48<br />
Eigenfeldt, Arne 35<br />
Einbeck, Jochen 36<br />
Enk, Dirk 117<br />
Enyukov, Igor 37<br />
Erben, Christoph 108<br />
Esber, Said 38<br />
Esposito Vinzi, Vincenzo 151<br />
Evers, Ludger 36
Feinerer, Ingo 76<br />
Fenk, August 40<br />
Fenk-Oczlon, Gertraud 40<br />
Fernández-Aguirre,<br />
Karmele<br />
41<br />
Filzmoser, Peter 103<br />
Franses, Philip Hans 84<br />
Fricke, Jobst P. 42<br />
Fritsch, Arno 43<br />
Fuchs, Sebastian 44<br />
Gabriel, Thomas R. 45<br />
Gans, Ulrich-Walter 46<br />
Gantner, Zeno 47<br />
Garel, Bernard 48<br />
Garín-Martín, María Araceli 41<br />
Gassiat, Elisabeth 49<br />
Gazda, Vladimir 50<br />
Geyer-Schulz, Andreas 51, 100,<br />
137<br />
Gkika, Ioanna 137<br />
Godehardt, Erhard 52<br />
Goitisolo, Beatriz 167<br />
Govaert, Gérard 15<br />
Greselin, Francesca 53<br />
Groenen, Patrick J.F. 54, 101,<br />
156<br />
Große, Lars 55<br />
Grün, Bettina 56<br />
Haasdonk, Bernard 57<br />
Häberle, Lothar 58<br />
Hahlweg, Cornelius 59<br />
Hahne, Felix 102<br />
Hamprecht, Fred A. 5<br />
Hansohm, Jürgen 60<br />
Hauptmann, Werner 105<br />
Hausdorf, Bernhard 62<br />
Heinrich, Petja 111<br />
Helmstaedter, Moritz 5<br />
Henker, Uwe 61<br />
Hennig, Christian 62<br />
Henseler, Jörg 63<br />
− xxxviii −<br />
Hermes, Jürgen 64<br />
Herrmann, Lutz 65<br />
Herzog, Irmela 66<br />
Hielscher, Thomas 67<br />
Holler, Julian 13<br />
Holzmann, Hajo 68<br />
Hornik, Kurt 76<br />
Hoser, Bettina 51, 100,<br />
137<br />
Huang, Yong-Ting 27<br />
Hühn, Jens 69<br />
Hüllermeier, Eyke 69, 70<br />
Hütt, Marc-Thorsten 72<br />
Huson, Daniel H. 71<br />
Ickstadt, Katja 43<br />
Imaizumi, Tadashi 73<br />
Ingrassia, Salvatore 53<br />
Ionescu, Tudor 26<br />
Jaworski, Jerzy 52<br />
Joos, Franz 55, 95<br />
Kapur, Ajay 35<br />
Karatzoglou, Alexandros 76<br />
Klaus, Martin 91<br />
Kleiber, Christian 168<br />
Klein, Christian 77<br />
Klenk, Hans-Peter 78<br />
Kludas, Jana 79<br />
Kneib, Thomas 80<br />
Koethe, Ullrich 5<br />
Kompatsiaris, Yiannis 100, 137<br />
Koning, Alex J. 84<br />
Koralun-Bereznicka, Julia 81<br />
Kowalska-Musial, Magdal. 124<br />
Krolak-Schwerdt, Sabine 82<br />
Kronberger, Gabriel 164<br />
Kruse, Rudolf 141, 163<br />
Küttner, Leander 146<br />
Kundisch, Dennis 77<br />
Kuziak, Katarzyna 83<br />
Lam, Kar Yin 84
Lamont, Morne 87<br />
Lang, Elmar 126<br />
Lang, Matthias 46<br />
Latouche, Pierre J. 85<br />
Lausen, Berthold 4, 115<br />
Leisch, Friedrich 56, 75,<br />
127, 142<br />
Leman, Marc 32<br />
Lesaffre, Micheline 32<br />
Lessmann, Stefan 25<br />
Lichtwarck-Aschoff, Anna 149<br />
Locarek-Junge, Hermann 86<br />
Louw, Nelmarie 87<br />
Lübke, Karsten 88<br />
Lukashevich, Hanna 89<br />
Lukociene, Olga 90<br />
Maldonado, Sebastian 21<br />
Marchand-Maillet, Stepahne 79<br />
Martin-Magniette, Marie-L. 92<br />
Mary-Huard, Tristan 92<br />
McLachlan, Geoffrey John 93<br />
McMorris, Fred R. 94<br />
Meffert, Beate 166<br />
Meier, René 95<br />
Meinel, Gotthard 146<br />
Mihm, Max 86<br />
Mirkin, Boris 97<br />
Monthé, Thierry 129<br />
Mucha, Hans-Joachim 3, 34, 98<br />
Müller-Funk, Ulrich 99<br />
Mylonas, Phivos 100, 137<br />
Nalbantov, Georgi Ilkov 54, 101<br />
Neumann, Anneke 102<br />
Neykov, Neyko 103<br />
Neytchev, Plamen 103<br />
Nugent, Rebecca 31, 104<br />
Nusser, Sebastian 105<br />
Okada, Akinori 106<br />
Oosthuizen, Surette 107<br />
Ostermann, Alexander 7<br />
− xxxix −<br />
Ostermann, Thomas 108<br />
Otte, Clemens 105<br />
Palumbo, Francesco 109<br />
Papenhoff, Heike 88<br />
Patilea, Valentin 125<br />
Pekalska, Elzbieta 57<br />
Petersen, Wiebke 110, 111<br />
Petersohn, Uwe 61<br />
Pfeiffer, Karl-Peter 7<br />
Piontek, Krzysztof 112, 121<br />
Pires, Ana 30<br />
Pöppel, Gerhard 126<br />
Pommeret, Denys 113<br />
Pons, Odile 114<br />
Potapov, Sergej 115<br />
Punzo, Antonio 116<br />
Raabe, Nils 117<br />
Radermacher, Walter 118<br />
Ramos, Sofia 33<br />
Rapp, Reinhard 119<br />
Rehm, Frank 163<br />
Ringle, Christian M. 120<br />
Robin, Stéphane 92<br />
Rokita, Pawel 121<br />
Rolshoven, Jürgen 122<br />
Rothe, Hendrik 59<br />
Rozmus, Dorota 123<br />
Rupp, Regula 71<br />
Rybarczyk, Katarzyna 52<br />
Sagan, Adam 124<br />
Sakaehara, Towao 106<br />
Sardet, Laure 125<br />
Sarstedt, Marko 44, 161<br />
Sauerwald, Erik 159<br />
Schachtner, Reinhard 126<br />
Scharl, Theresa 127<br />
Schebesch, Klaus B. 140<br />
Schierle, Martin 128<br />
Schiffner, Julia 129, 130<br />
Schlattmann, Peter 131
Schmidt, Gunther 3<br />
Schmidt-Thieme, Lars 23, 47<br />
Schölkopf, Bernhard 132<br />
Schröder, Winfried 3<br />
Schuster, Reinhard 108, 133<br />
Schwiebert, Stephan 64<br />
Schyle, Daniel 134<br />
Sieben, Wiebke 135<br />
Slawski, Martin 20, 136<br />
Smrz, Pavel 100, 137<br />
Solachidis, Vassilios 100, 137<br />
Sommer, Katrin 138<br />
Sommerfeld, Angela 139<br />
Staab, Stefen 100, 137<br />
Stecking, Ralf 140<br />
Steel, Sarel J. 87, 107<br />
Steinbrecher, Matthias 141<br />
Steiner, Winfried J. 80<br />
Straat, J. Hendrik 157<br />
Stuetzle, Werner 104<br />
Szepannek, Gero 129<br />
Tarka, Piotr 144<br />
Thiel, Kilian 45<br />
Thiel, Klaus 145<br />
Thinh, Nguyen Xuan 146<br />
Thorleuchter, Dirk 147, 148<br />
Timmerman, Marieke E. 149<br />
Tomas, Amber 150<br />
Trabold, Daniel 128<br />
Trinchera, Laura 151<br />
Trzesiok, Michal 152<br />
Ünlü, Ali 155<br />
Ultsch, Alfred 11, 61,<br />
65, 96,<br />
153, 154<br />
van de Velden, Michel 156<br />
− xl −<br />
van der Ark, Andries L. 157<br />
van der Heijden, Peter G.M. 158<br />
Vanderlooy, Stijn 70<br />
Vermunt, Jeroen K. 33, 90<br />
Vichi, Maurizio 17<br />
Viroli, Cinzia 24<br />
von Arnstedt, Eva 133<br />
Wagner, Ralf 91, 159<br />
Wagner, Stefan 164<br />
Wallner, Matthias 6<br />
Weber, Richard 21<br />
Wehrens, Ron 160<br />
Weihs, Claus 117,<br />
129,<br />
130, 138<br />
Werft, Wiebke 67<br />
Werners, Brigitte 10<br />
Wilczynski, Petra 161<br />
Wildner, Raimund 162<br />
Winkler, Roland 163<br />
Winkler, Stephan 164<br />
Witek, Ewa 165<br />
Worm, Katja 166<br />
Yuan, Benjamin J.C. 28<br />
Zarraga, Amaya 167<br />
Zeileis, Achim 143, 168<br />
Zock, Michael 119<br />
Zucknick, Manuela 67
Designing Products Using Quality Function<br />
Deployment and Conjoint Analysis: A<br />
Comparison in a Market for Elderly People<br />
Samah Abu Assab and Daniel Baier<br />
Chair of Marketing and Innovation Management, Brandenburg University of<br />
Technology, Erich-Weinert-Str. 1, 03046 Cottbus, Germany<br />
samah.assab@tu-cottbus.de, baier@tu-cottbus.de<br />
Abstract. In this paper, we compare two product design approaches namely; quality<br />
function deployment (QFD) and conjoint analysis (CA) on the example of mobile<br />
phones for elderly people as a target group. Then, we compare between our results<br />
and the results from former similar comparisons (e.g., Pullman et al. (2002), Katz<br />
(2004)). In this work, the same procedures and conditions are taken into consideration<br />
as that taken by Pullman et al. in their paper (2002).<br />
Pullman et al. (2002) view the relation between the two methods: QFD and<br />
CA as a complementary one in which both should be simultaneously implemented<br />
and each providing feedback to the other. They concluded that CA is more efficient<br />
in reflecting the end-users’ present preferences for the product attributes, whereas<br />
QFD is definitely better in satisfying end-users’ needs from the developers’ point of<br />
view. Katz (2004) in his response from a practitioner’s point of view agreed with<br />
Pullman’s. However, he concluded that the two methods are better used sequentially<br />
and that QFD should precede conjoint analysis. We test these results in a market<br />
for elderly people<br />
Key words: Conjoint analysis, Quality function deployment, new product design,<br />
elderly people<br />
References<br />
Baier, D. and Brusch, M. (2005): Linking Quality Function Deployment and Conjoint<br />
Analysis for New Product Design. In: D. Baier, R. Decker and L. Schmidt-<br />
Thieme (Eds.): Data Analysis and Decision Support. Springer, Berlin, 189-198.<br />
Katz, G.M. (2004): A Response to Pullman et al.’s (2002) Comparison of Quality<br />
Function Deployment versus Conjoint Analysis. Journal of Product Innovation<br />
Management, 21, 61-63.<br />
Pullman, M.E.,Moore, W.L. and Wardell, D.G.(2002): A Comparison Quality Function<br />
Deployment and Conjoint Analysis in New Product Design. The Journal<br />
of Product Innovation Management, 19, 354-364.<br />
− 1 −
Joint Procrustes Analysis with Constrained<br />
Simplimax Rotation: Nonsingular<br />
Transformation of Component Score and<br />
Loading Matrices Toward Simple Structure<br />
Kohei Adachi<br />
Graduate School of Human Sciences<br />
Osaka University, Japan<br />
Abstract. The solution of component analysis has indeterminacy on nonsingular<br />
transformation: post-multiplying component score and loading matrices by a nonsingular<br />
matrix and its transposed inverse matrix, respectively, does not change the<br />
goodness of fit. To obtain a nonsingular matrix which gives simple structure to both<br />
the score and loading matrices transformed, we propose joint Procrustes analysis<br />
with constrained simplimax rotation, which consists of two phases. First, score and<br />
loading matrices are rotated orthogonally so as to match the target score and loading<br />
matrices, respectively, which include the elements of zeros, where the number of zero<br />
elements is predetermined, but their placement and the values of non-zero elements<br />
are unknown. Second, with the placement of the zero elements fixed at the result in<br />
the first phase, a nonsingular matrix is obtained which matches transformed score<br />
and loading matrices to the target score and loading matrices, respectively, where<br />
the values of non-zero elements are unknown. This procedure is argued to be useful<br />
for the cases where score and loading matrices have symmetric roles, for example,<br />
a case where component analysis is performed for a data matrix of input signals by<br />
output responses.<br />
References<br />
Adachi, K. (2005). Simultaneous Procrustes transformation of components and loadings<br />
obtained from three-way data. P. 77 in http://www.psychometrika.org/<br />
PDFs/IMPS2005_Abstracts.pdf.<br />
Kiers, H. A. L. (1994). Simplimax: Oblique rotation to an optimal target with simple<br />
structure. Psychometrika, 59, 567-579.<br />
− 2 −
WaldIS - a web based reference system for the<br />
forest monitoring in North Rhine-Westphalia<br />
Christian Aden 1 , Hans Mucha 2 , Gunther Schmidt 1 , and Winfried Schröder 1<br />
1 Lehrstuhl für Landschaftsökologie, Hochschule Vechta, D-49377 Vechta,<br />
Germany, caden@iuw.uni-vechta.de<br />
2 Weierstraß-Institut für Angewandte Analysis und Stochastik (WIAS),<br />
D-10117 Berlin, Germany, mucha@wias-berlin.de<br />
Abstract. In Germany, a multi-level forest monitoring was established since the<br />
middle of the 1980ies. This hierarchical monitoring system consists of: the annual<br />
forest condition surveys, the forest soil survey, and the intensive long term monitoring<br />
of forest ecosystems. In North Rhine-Westphalia, these forest monitoring programmes<br />
are supplemented by the monitoring of the foliar chemistry. Within theses<br />
programmes, the monitoring data are recorded and evaluated by the respective federal<br />
authorities, separately. An integrative statistical analysis of all data collected in<br />
the different monitoring programmes could not be realised yet. To overcome these<br />
constraints, the German Research Foundation (DFG) sponsors a research project<br />
that aims at the compilation of the monitoring data by use of WebGIS techniques<br />
and at integrated statistical analyses by use of geostatistics, time series analysis and<br />
multivariate statistics. Currently, the reference data system WaldIS is being developed<br />
for integrating data of the surveys interactively and for visualising the data<br />
via a WebGIS. In addition, tools for both logical data queries and for downloads,<br />
and some GIS functions were included. WaldIS was realised by using open source<br />
software components instead of proprietary software: the UMN Mapserver was combined<br />
with the WebGIS Client Suite Mapbender and the database management<br />
system PostgreSQL. Furthermore, WaldIS relies on standards for processing geoobjects<br />
published by the Open Geospatial Consortium (Pesch et. al 2007). Moreover,<br />
WaldIS will be used to visualise the statistical results such as clusters or principal<br />
components. With the help of stable statistical analysis based on rank-order data<br />
the aim is finding areas of homogeneous environmental and forest conditions.<br />
Key words: forest monitoring, WebGIS, multivariate rank analysis, stability<br />
References<br />
Pesch, R., Schmidt, G., Schröder, W., Aden, C., Kleppin, L. and Holy, M. (2007):<br />
Development, Implementation and Application of the WebGIS MossMet. In:<br />
A. Scharl and K. Tochtermann (Eds.): The Geospatial Web. Springer, London,<br />
191–200.<br />
− 3 −
Classification of Paired Data Using Ensemble<br />
Methods<br />
Werner Adler 1 , Alexander Brenning 2 , and Berthold Lausen 1<br />
1<br />
Chair for Biometry and Epidemiology, University of Erlangen-Nuremberg,<br />
Germany<br />
werner.adler@imbe.imed.uni-erlangen.de,<br />
berthold.lausen@rzmail.uni-erlangen.de<br />
2<br />
Department of Geography, University of Waterloo, Canada<br />
brenning@fesmail.uwaterloo.ca<br />
Abstract. In glaucoma classification, the underlying data have a paired structure<br />
that often is accounted for by simply using only one eye per subject. Brenning and<br />
Lausen (<strong>2008</strong>) showed that the proper use of both eyes in paired cross-validation<br />
decreases the variance of the estimation, compaired to cross-validation using only<br />
one eye per subject.<br />
We discuss and compare different strategies to generate the bootstrap samples<br />
for training Adaboost (Freund and Schapire, 1996), Random Forest (Breiman, 2001),<br />
and Double Bagging (Hothorn and Lausen, 2005). The simplest approach is to ignore<br />
the paired data structure and proceed as usual. Adapting the idea by Brenning and<br />
Lausen, we also perform subject based sampling. In a first step, subjects are drawn<br />
with replacement. In a second step, for each drawn subject either both eyes or<br />
one randomly selected eye are chosen, or two eyes are drawn with replacement.<br />
The subjects not selected for training the base learners constitute the out-of-bag<br />
samples. We compare error rates resulting from these different approaches obtained<br />
by a simulation study.<br />
Key words: Bootstrap, Classification, Glaucoma, Paired Organs<br />
References<br />
Breiman, L. (2001): Random forests. Machine Learning, 45, 5–32.<br />
Brenning, A. and Lausen, B. (<strong>2008</strong>): Estimating error rates in the classification of<br />
paired organs. Statistics in Medicine, submitted.<br />
Freund, Y. and Schapire, R. (1996): Experiments with a new boosting algorithm.<br />
Proceedings of the 13th International Conference on Machine Learning, 148–<br />
156.<br />
Hothorn, T. and Lausen, B. (2005): Bundling classifiers by bagging trees. Computational<br />
Statistics & Data Analysis, 49, 1068–1078.<br />
− 4 −
Segmentation of Neural Tissue<br />
Bjoern Andres 1 , Ullrich Koethe 1 , Moritz Helmstaedter 2 , Winfried Denk 2 ,<br />
and Fred Hamprecht 1<br />
1 Interdisciplinary Center for Scientific Computing, University of Heidelberg<br />
2 Max Planck Institute for Medical Research, Heidelberg<br />
Abstract. Three-dimensional electron-microscopic image stacks with almost isotropic<br />
resolution allow, for the rst time, to determine the com- plete connectivity matrix<br />
of parts of the brain. In spite of major advances in staining, correct segmentation<br />
of these stacks remains challenging, be- cause very few local mistakes can lead to<br />
severe global errors. We propose a hierarchical segmentation procedure based on<br />
statistical learning and topology-preserving grouping. First, edge probability maps<br />
are computed by a random forest classier, and are partitioned into supervoxels by<br />
the watershed transform. Over-segmentation is then resolved by constructing an irregular<br />
graphical model on these supervoxels and inferring the most likely global<br />
segmentation. Careful validation shows that the results of our algorithm are close<br />
to human labelings.<br />
− 5 −
On the power of corrected score functions to<br />
adjust for measurement error<br />
Thomas Augustin and Matthias Wallner<br />
Department of Statistics, University of Munich (LMU)<br />
augustin@stat.uni-muenchen.de<br />
Abstract. Measurement error modeling, also called errors-in-variables-modeling,<br />
is a generic term for all situations where additional uncertainty in the variables<br />
has to be taken into account, in order to avoid severe bias in the statistical analysis.<br />
The problem is omnipresent in technical statistics, when data from imperfect<br />
measurement instruments are analyzed, as well as in biometrics, econometrics or<br />
social science, where operationalizations (surrogates) are used instead of complex<br />
theoretical constructs.<br />
After a brief introduction to the area of measurement error modelling, the talk<br />
discusses the power and some limitations of Nakamura’s general principle of corrected<br />
score functions, mainly in the context of failure time data. Starting with classical<br />
covariate measurement error in Cox’s PH model, it is shown how the Breslow<br />
likelihood can be corrected, while according to results by Stefanski and Nakamura<br />
himself no corrected score function for the partial likelihood can exist. We then turn<br />
to parametric failure time models and extend consideration to additionally errorprone<br />
lifetimes. Finally, some ideas for handling Berkson-type errors (as occurring,<br />
e.g., in Radon studies) and rounded errors will be sketched.<br />
Key words: Measurement error, error-in-variables, survival analysis, Cox model,<br />
rounding<br />
− 6 −
Evaluation Criteria for the Construction of<br />
Binary Classification Trees with Two or More<br />
Classes<br />
Muhammad Azam 1 Alexander Ostermann 2 and Karl-Peter Pfeiffer 3<br />
1 Department of Medical Statistics, Informatics and Health Economics, Medical<br />
University Innsbruck csag2533@uibk.ac.at<br />
2 Institute for Mathematics, Unversity of Innsbruck , Technikerstrasse 25/7, 6020<br />
Innsbruck alexander.ostermann@uibk.ac.at<br />
3 Department of Medical Statistics, Informatics and Health Economics, Medical<br />
University Innsbruck Karl-Peter.Pfeiffer@i-med.ac.at<br />
Abstract. Classification trees are top-down induction of labelled sampling units<br />
into recursive order to get end nodes. Each end node representing those labelled<br />
units which are in majority, otherwise considered as misclassified. In the top-down<br />
induction process, an evaluation criterion plays an important role to send maximum<br />
of the units having same label to the same node. To achieve this goal a ”goodness<br />
of split” measure is calculated by using evaluation criteria e.g. Gini function, Twoing<br />
rule etc. for each distinct value of each variable and finally chooses one which<br />
enhances the purity. For smaller number of classes attached to all the units, almost<br />
all the evaluation criteria provides the same results in terms of misclassified units,<br />
deviance and number of end nodes but it matters for larger number of classes and<br />
considered best criteria which provides less number of misclassified units especially.<br />
Here we proposed an impurity based evaluation criteria which fulfil all the required<br />
properties of any evaluation criteria (Breiman et al., 1984). (i) The node impurity<br />
function achieves its maximum value, when same number of units fall in a node belongs<br />
to J number of classes. (ii) Node is pure, when all the observations of a node<br />
belong to a single class. (iii) A node impurity function is a symmetric function.<br />
We conducted a simulation study to test the performance of the proposed criterion<br />
over many real life datasets available under UCI repository and observed that the<br />
proposed strategy provides improved results.<br />
Key words: Classification trees, Evaluation criteria, Misclassification rate, Deviance<br />
References<br />
Breiman, L., Friedman, J. H., Olshen, R. A. and Stone, C. J (1984): Classification<br />
and regression trees. Wadsworth International Group, Belmont, CA.<br />
− 7 −
Evaluation Strategies for Learning Algorithms<br />
of Hierarchical Structures<br />
Korinna Bade 1 and Dominik Benz 2<br />
1 Faculty of Computer Science, Otto-von-Guericke-University Magdeburg,<br />
D-39106 Magdeburg, Germany, Email: korinna.bade@ovgu.de<br />
2 Department of Electrical Engineering/Computer Science, University of Kassel,<br />
D-34121 Kassel, Germany, Email: benz@cs.uni-kassel.de<br />
Abstract. The idea to automatically induce a hierarchical structure among a set<br />
of objects or integrate a given hierarchy into the learning process is common to<br />
a number of disciplines like hierarchical clustering (Bade and Nürnberger, <strong>2008</strong>)<br />
and classification or ontology learning. A crucial aspect hereby is how to assess the<br />
quality of the learned hierarchical scheme. Existing evaluation approaches can be<br />
broadly classified in methods defining quality metrics on the resulting scheme alone<br />
and methods which invoke an external “gold-standard” for comparison. We focus<br />
on the latter case, for which various similarity metrics have been proposed, mostly<br />
depending on the characteristics of the applied learning procedure.<br />
This work aims at bringing together the different disciplines by presenting and<br />
comparing existing gold-standard based evaluation methods for learning algorithms<br />
that generate hierarchical structures. We present an interdisciplinary framework in<br />
order to enable comparison across the different contexts, from which the metrics<br />
originate. Our goal is to emphasize the strong similarities of evaluation tasks in different<br />
disciplines and to create a general pool of evaluation methods. Based on prior<br />
work (Dellschaft and Staab, 2006), we analyze properties of (good) evaluation measures.<br />
Different types of structural errors in the learned hierarchies are identified and<br />
their effects on existing measures are shown. Observing strengths and weaknesses of<br />
existing methods, we also suggest some new methods.<br />
Key words: evaluation metrics, hierarchical clustering, ontology learning, goldstandard<br />
References<br />
Bade, K. and Nürnberger, A. (<strong>2008</strong>): Creating a Cluster Hierarchy under Constraints<br />
of a Partially Known Hierarchy. In: Proceedings of the <strong>2008</strong> SIAM International<br />
Conference on Data Mining. (to appear)<br />
Dellschaft, K. and Staab, S. (2006): On How to Perform a Gold Standard Based<br />
Evaluation of Ontology Learning. In: Proc. of 5 th Int. Semantic Web Conference.<br />
228–241.<br />
− 8 −
Autonomous Forex Trading Agents<br />
Rui Pedro Barbosa 1 and Orlando Belo 2<br />
1<br />
Department of Informatics, University of Minho, 4710-057 Braga, Portugal<br />
rui.barbosa@di.uminho.pt<br />
2<br />
Department of Informatics, University of Minho, 4710-057 Braga, Portugal<br />
obelo@di.uminho.pt<br />
Abstract. Trading in financial markets is undergoing a radical transformation,<br />
one in which algorithmic methods are becoming increasingly more important. The<br />
development of intelligent agents that can act as autonomous traders of financial<br />
instruments seems like a logical step forward in this “algorithms arms race”. With<br />
this in mind, our study proposes an infrastructure for implementing hybrid intelligent<br />
agents with the ability to trade in the Forex Market without requiring human<br />
supervision. This infrastructure is composed of three modules. The Intuition Module,<br />
implemented using an Ensemble Model, is responsible for performing pattern<br />
recognition and predicting the direction of the exchange rate. The A Posteriori<br />
Knowledge Module, implemented using a Case-Based Reasoning System, enables<br />
the agents to learn from empirical experience and is responsible for suggesting<br />
how much to invest in each trade. Finally, the A Priori Knowledge Module, implemented<br />
using a Rule-Based Expert System, enables the agents to incorporate<br />
non-experiential knowledge in their trading decisions. This infrastructure was used<br />
to implement two agents, one capable of trading the USD/JPY currency pair, and<br />
the other one capable of trading the EUR/USD currency pair, both with a 6 hours<br />
timeframe. Using 12 months of out-of-sample data, the USD/JPY agent performed<br />
826 simulated trades and obtained an average profit per trade of 6.88 pips. It accurately<br />
predicted the direction of the price in 54.72% of the trades, 65.74% of which<br />
were profitable. Over the same period, the EUR/USD agent performed 885 trades,<br />
with an average profit of 6.06 pips per trade. Its accuracy predicting the direction<br />
of the price was 52.99%, and 60.45% of its trades were profitable. These agents were<br />
integrated with an Electronic Communication Network and have been trading live<br />
for the past several months. So far their live trading results are consistent with the<br />
simulated results, which lead us to believe our infrastructure can be of practical<br />
interest to the traditional trading community.<br />
Key words: Forex trading, Hybrid agents, Autonomy<br />
− 9 −
Improving Product Line Design with Bundling<br />
Niels Becker and Brigitte Werners<br />
Faculty of Economics and Business Administration,<br />
Ruhr-University Bochum, 44780 Bochum, Germany<br />
niels.becker@ruhr-uni-bochum.de and or@ruhr-uni-bochum.de<br />
Abstract. Designing and pricing new products is of particular importance in many<br />
industries. In order to meet heterogeneous customer needs, many companies offer different<br />
variants of every product type. To support these product line design decisions,<br />
various mathematical programming approaches have been developed (Steiner and<br />
Hruschka, 2003). Most models are based on part-worth utilities, estimated within a<br />
conjoint framework. Besides, bundling is an important tool in marketing. It has been<br />
shown that bundling can transfer customers’ willingness to pay from one product to<br />
another. Therefore, prices can be differentiated so that higher profits are obtainable<br />
(Simon and Wübker, 1999). For determining optimal bundles and prices, Hanson<br />
and Martin (1990) have suggested a well-known linear programming model.<br />
Here the problem of optimally designing, bundling and pricing new products<br />
is investigated. One of the questions is, at which point in time bundling decisions<br />
should be made. Therefore, we compare product line design without bundling with<br />
sequential bundling, which means bundling subsequent to product line decisions, and<br />
simultaneous bundling, which means determining optimal bundles and product lines<br />
simultaneously. We developed a combined product line design and bundling model<br />
and present the impact on profits using simulated data. For this example, optimal<br />
results can be obtained using MILP-Software. Our studies show that simultaneous<br />
bundling leads to differently designed products and can improve profits substantially.<br />
Key words: Product Line Design, Pricing, Bundling, Optimization<br />
References<br />
Hanson, W. and Martin, K. (1990): Optimal Bundle Pricing. Management Science,<br />
36(2), 155–174.<br />
Simon, H. and Wübker, G. (1999): Bundling - A Powerful Method to Better Exploit<br />
Profit Potential. In: R. Füderer, A. Hermann and G. Wübker (Eds.): Optimal<br />
Bundling, Springer-Verlag, Heidelberg-Berlin, 7–28.<br />
Steiner, W. and Hruschka, H. (2003): Genetic Algorithms for Product Design: How<br />
Well Do They Really Work. Int. Journal of Market Research, 45(2), 229–240.<br />
− 10 −
Estimating the number of buildings in<br />
Germany<br />
Martin Behnisch 1 and Alfred Ultsch 2<br />
1 Institute of Historic Building Research and Conservation, ETH Hoenggerberg,<br />
HIL D 25.9, CH-8093 Zurich. Behnisch@arch.ethz.ch<br />
2 Datenbionic Research Group, Hans-Meerwein-Strasse, Philipps-University<br />
Marburg, D-35032 Marburg. Ultsch@Mathematik.Uni-Marburg.de<br />
Abstract. The building stock can be considered the largest physical, economical<br />
and cultural capital of a society. For German building stocks many institutions<br />
record different kind of data. Unfortunately there are just a few basic statistics<br />
about the amount of buildings. Collection of data is therefore very complicated,<br />
often expensive and the handling of missing data is one of the biggest handicaps.<br />
With the exception of data about residential buildings and particularly monuments,<br />
it is an unsolved problem to determine the total number of buildings. The main<br />
issue of this article is the description of an estimation procedure for this. Using<br />
methods from the, so called, Urban Knowledge Discovery approach, the authors find<br />
unsuspected relationships in the urban data which can be used for the estimation.<br />
The developed estimation procedure relies on 12430 municipalities and refers to data<br />
from the Cadaster of Real Estates and the Federal Bureau of Statistics. With this<br />
estimation it is possible to use statistical data from well known and easily accessible<br />
institutions. The number of buildings is estimated for regions with missing data.<br />
The quality of the estimation is analyzed by learn and test data sets. Information<br />
optimization leads to the conclusion that 20% of the municipalities hold 80% of all<br />
buildings. Therefore for an improvement of the estimation it is essential to refine<br />
the amount and quality of data in the larger municipalities.<br />
Key words: Spatial Planning, Engineering, Knowledge Discovery, Data Mining,<br />
Building Stock<br />
References<br />
Aachener Institut füer Bauschadensforschung und Angewandte Bauphysik, Hrsg:<br />
Hofman, F. (2001): Urban heritage - building maintenance. Final report. COST<br />
Action C5, European Commission.<br />
Becher, St. (1995): Klassifikation der regionalen Immobilienmärkte der Bundesrepublik<br />
Deutschland (Dissertation). Universität, Mainz.<br />
Behnisch, M. (2007): Urban Knowledge Discovery (Doctoral thesis). Universitätsverlag,<br />
Karlsruhe.<br />
− 11 −
Probabilistic Distance Clustering<br />
Adi Ben-Israel<br />
Rutgers University, RUTCOR<br />
Summary. A new iterative method [1] for probabilistic clustering of data is presented.<br />
Given clusters, their centers, and the distances of data points from these<br />
centers, the probability of cluster membership at any point is assumed inversely<br />
proportional to the distance from (the center of) the cluster in question.<br />
The resulting method is a generalization, to several centers, of the Weiszfeld<br />
method for solving the Fermat–Weber location problem. At each iteration, the<br />
distances (Euclidean, Mahalanobis, etc.) from the cluster centers are computed for<br />
all data points, and the centers are updated as convex combinations of the data<br />
points, with weights determined by the above principle. Computations stop when<br />
the centers stop moving.<br />
This approach works also for problems where the cluster sizes are unknowns (to<br />
be estimated), giving a viable alternative to the EM method, [2].<br />
Progress is monitored by the joint distance function (JDF), a measure of<br />
distance from all cluster centers, that evolves during the iterations, and captures the<br />
data in its low contours. This is a new concept in data reduction and representation.<br />
A duality theory for the JDF is given in [3].<br />
The method is simple, fast (requiring a small number of cheap iterations) and<br />
is not sensitive to outliers.<br />
Key words: Partial Least Squares, Correspondence Analysis, Categorical Data<br />
References<br />
A. B-I and C. Iyigun, Probabilistic Distance Clustering, J. Classification (to<br />
appear.) http://benisrael.net/J-CLASSIFICATION-07.pdf.<br />
C. Iyigun and A. B-I, Probabilistic Distance Clustering adjusted for Cluster<br />
Size, Probability in Engineering and Informational Sciences (to appear.)<br />
http://benisrael.net/PEIS-07.pdf.<br />
C. Iyigun and A. B-I, Contour Approximation of Data: A Duality Theory, (submitted.)<br />
http://benisrael.net/DUAL-12-20-07.pdf.<br />
− 12 −
Hedge Funds in a Bayesian Asset Allocation<br />
Framework: Incorporating information on<br />
market states and manager’s ability<br />
Wolfgang Bessler 1 and Julian Holler 2<br />
1 Center for Finance and Banking, Licher Strasse 74, 35394 Giessen<br />
Wolfgang.Bessler@wirtschaft.uni-giessen.de<br />
2 Center for Finance and Banking, Licher Strasse 74, 35394 Giessen<br />
Julian.Holler@wirtschaft.uni-giessen.de<br />
Abstract. A growing number of private and institutional investors make significant<br />
allocations to hedge funds in order to improve the risk-return trade-off of their portfolios.<br />
However, in a portfolio context there are a number of issues that are special<br />
to hedge funds. We attempt to address these issues in a bayesian asset allocation<br />
framework (Pastor 2000). In particular, we focus on the returns of two representative<br />
equity hedge fund strategies constructed by replication of two well-known<br />
statistical arbitrage strategies. Importantly, this approach allows us to obtain daily<br />
return observations despite the fact that most funds only report at a monthly interval.<br />
Using this framework, we investigate the following two research questions.<br />
First, we address the issue that many arbitrage strategies exhibit substantial exposures<br />
to financial crisis which is reflected in their high levels of curtosis and negative<br />
skewness. By including relevant state variables in the prior distribution, we infer<br />
whether investors can improve the risk-adjusted performance of their portfolios by<br />
reducing their exposures prior to the onset of a crisis. Second, investors should only<br />
pay high fees to hedge fund managers if they earn additional alpha above the returns<br />
generated by our dynamic trading strategy. Thus, we attempt to analyze how much<br />
confidence an investor has to put into a manager’s abilities by varying her prior<br />
beliefs about alpha.<br />
References<br />
Pastor, L. (2000): Portfolio Selection and Asset Pricing Models. Journal of Finance,<br />
55, 179–223<br />
Keywords<br />
Asset Allocation, Alternative Investments, Hedge Funds<br />
− 13 −
Categorical Data in PLS Path modeling<br />
Jörg Betzin<br />
German Centre of Gerontology (DZA), Berlin, Germany<br />
joerg.betzin@dza.de<br />
Summary. There are lot of surveys with categorical data where the relationships<br />
between the variables should be used in a path model with latent variables but, so<br />
far there are only a few possibilities to do so. We present a way using categorical<br />
manifest variables (MV’s) in PLS.<br />
The main idea is on the one hand in thinking of PLS as a generalization of PCA<br />
(principal component analysis) or canonical correlation and on the other hand using<br />
the framework of Correspondence Analysis (CorA) as a generalization of PCA for<br />
categorical variables and put these two approaches together.<br />
In the basic PLS algorithm the latent variables (LV’s) ηm (m = 1, ..., M) are<br />
estimated as weighted sum of their manifest variables (with data matrices Ym)<br />
ηm = Ymωm, where the pooled weight vector ω = (ω ′ 1, ..., ω ′ M ) ′ is result of an<br />
iteration algorithm like<br />
ω = `` Y ′ Y ´ ∗ P ´ ω<br />
with additional normalization constraint and where Y = (Y1, ...,YM ), P is a weight<br />
matrix changing with different iteration steps, and ’∗’ indicating the elementwise<br />
matrix product. The key point is the use of the covariance matrix Y ′ Y inside the<br />
iteration algorithm.<br />
Now, one main aspect of CorA is the transformation of the raw data matrix Ym<br />
into an indicator matrix Gm and the analysis of a kind of correlation matrix for Gm.<br />
If we describe by e Qm a suitable transformation of Gm where elements of e Q ′ m e Qm are<br />
roots of χ 2 -components from twodimensional contingency tables for columns in Gm.<br />
Than, in short, e “ ”<br />
Q = eQ1, ..., QM e will be used as an equivalent for the covariance<br />
matrix Y ′ Y in the PLS iteration algorithm.<br />
We will show results for different examples using basic PLS algorithms and<br />
using PLS algorithms adopted for categorical manifest variables, together with interpretations<br />
of the weights ωm in the case of categorical data and the other model<br />
parameters like correlations and regression coefficients.<br />
Key words: Partial Least Squares, Correspondence Analysis, Categorical Data<br />
− 14 −
Choosing the number of clusters in the latent<br />
class model<br />
Cristophe Biernacki 1 , Gilles Celeux 2 , and Gérard Govaert 3<br />
1 Université Lille 1 UMR CNRS 8524, France<br />
Christophe.Biernacki@math.univ-lille1.fr<br />
2 INRIA Saclay, France Gilles.Celeux@inria.fr<br />
3 UTC Compiègne UMR CNRS 6599 Heudiasyc, France Gerard.Govaert@utc.fr<br />
Abstract. The latent class model or multivariate multinomial mixture is a powerful<br />
model for clustering discrete data. This model is expected to be useful to represent<br />
nonhomogeneous populations. It uses a conditional independence assumption given<br />
the latent class to which a statistical unit is belonging. In this presentation, we exploit<br />
the fact that a fully Bayesian analysis of the latent class model with Jeffreys<br />
non informative prior distributions does not involve technical difficulty to derive<br />
the exact integrated complete likelihood. Then, we exploit this integrated complete<br />
likelihood as a criterion to assess the number of mixture components in a cluster<br />
analysis perspective. We highlight with numerical experiments how this exact criterion<br />
could outperfom the BIC-like asymptotic approximation generally used to<br />
choose a sensible number of clusters derived from the latent class model.<br />
Key words: Latent Class model, Integrated complete likelihood, Model selection<br />
References<br />
Biernacki, C., Celeux, G. and Govaert, G. (2000). Assessing a mixture model for<br />
clustering with the integrated completed likelihood. IEEE Trans. on PAMI,<br />
22, 719-725.<br />
Goodman, L. A. (1974). Exploratory latent structure analysis using both identifiable<br />
and unidentifiable models. Biometrika, 61, 215-231.<br />
− 15 −
Clustering of molecules and structured data<br />
Gilles Bisson<br />
Laboratoire TIMC-IMAG - Equipe AMA , Faculté de Médecine<br />
Summary. The discovery or the synthesis of molecules that activate or inhibit some<br />
biological systems is a central issue for biological research and health care. The objective<br />
of High Throughput Screening (HTS) is to rapidly evaluate, through automated<br />
approaches, the activity of a given collection of molecules on a given biological target<br />
that can be an enzyme or a whole cell. In practice, the results of a HTS test allow to<br />
highlight some tens of active molecules, named the ”hits”, representing a very small<br />
percentage of the initial collection. However, these tests are just the beginning of the<br />
work since the identified molecules generally do not have some nice characteristics in<br />
terms of sensitivity and specificity (a relevant molecule must by specific to the biological<br />
target and should be efficient with a small concentration). In such context, it<br />
is crucial to provide the chemists with some tools to explore the contents of his/her<br />
chemical libraries and especially to make easier the search for molecules that are<br />
structurally similar to the ”hits”. A possible approach, given a relevant distance, is<br />
to seek the nearest neighbours of those hits. More broadly, chemists have a need for<br />
methods to automatically organize the collections of molecules in order to locate the<br />
active molecules within the chemical space. Above all, they would like to evaluate<br />
the real diversity of the chemical structures contained in a collection. Clustering<br />
methods are well suited to carry out this kind of task. However, with structurally<br />
complex objects such as molecules, it is obvious that the quality of the results depends<br />
on the capacity of the distance used by the clustering method to grasp the<br />
structural similarities and also to take into account all the background knowledge<br />
of the chemists. The search for a structural distance between molecules is clearly<br />
related (but not totally equivalent) to the search for isomorphic partial subgraphs,<br />
which is a NP-complete problem. To overcome this problem many methods use an<br />
”a priori” molecular linearization: a molecule is represented by a vector of descriptors,<br />
each one corresponding to a molecular fragment, and well-known distances can<br />
be used. However, since last ten years, kernel functions, comparable to distances<br />
between graphs, have been proposed in the Support Vector Machines framework.<br />
In these approaches, molecular representation is more accurate. It can be based on<br />
a set of path (i.e. molecular fragments specifically chosen or randomly selected), or<br />
more interestingly use the whole molecule to try to value structural distances by<br />
dynamically explore the mapping that can be done between two molecules.<br />
− 16 −
The K -INDSCAL Model for Heterogeneous<br />
Three-way Dissimilarity Data<br />
Laura Bocci 1 and Maurizio Vichi 2<br />
1 Department of Sociology and Communication, University of Rome “La<br />
Sapienza”, Rome, Italy laura.bocci@uniroma1.it<br />
2 Department of Statistics, Probability and Applied Statistics, University of Rome<br />
“La Sapienza”, Rome, Italy maurizio.vichi@uniroma1.it<br />
Abstract. The weighted Euclidean model proposed by Carrol and Chang (1970) is<br />
the most well-known and used model of multidimensional scaling of three-way data.<br />
INDSCAL states a unique representation for the objects (common configuration<br />
space) and for each occasion weights for the dimensions of this representation (individual<br />
differences weights), thus definitely assuming that there are not systematic<br />
“strong” differences between data dissimilarity sources. Otherwise, when heterogeneous<br />
occasions are observed, it is shown that INDSCAL may fail to identify a<br />
common space representative of the observed data structure. In such frequent and<br />
realistic situation it is reasonable to assume that there are systematic differences<br />
among some, say, K clusters of occasions in the evaluation of the dissimilarities,<br />
so that within each cluster of occasions the evaluation may differ only because of<br />
sampling or measurement errors; while between clusters of occasions dissimilarities<br />
are really different. The heterogeneous INDSCAL in K classes model, simply called<br />
K -INDSCAL, is proposed to handle the above described heterogeneity in the data.<br />
The model includes the individual weights in order to preserve the rotational invariance<br />
of the INDSCAL model. The high number of parameters of INDSCAL, and<br />
consequently of K -INDSCAL, may produce instability of the estimates thus a parsimonious<br />
model, that drastically reduces the number of parameters, is also discussed.<br />
The parameters of the model are estimated in a least-squares fitting context and an<br />
efficiently coordinate descent algorithm is given. The usefulness of K -INDSCAL is<br />
demonstrated by both artificial and real data analyses.<br />
Key words: Three-way dissimilarity data, INDSCAL, heterogeneous data dissimilarities<br />
References<br />
Carroll, J.D. and Chang, J.J. (1970): Analysis of individual differences in multidimensional<br />
scaling via an N-generalization of the Eckart-Young decomposition.<br />
Psychometrika, 35, 283–319.<br />
− 17 −
Weighting and Selecting Features<br />
in Fuzzy Clustering<br />
Christian Borgelt<br />
European Center for Soft Computing<br />
c/ Gonzalo Gutiérrez Quirós s/n, 33600 Mieres, Spain<br />
christian.borgelt@softcomputing.es<br />
Abstract. A serious problem in distance-based clustering is that the more dimensions<br />
(attributes) a datasets has, the more the distances between data points—and<br />
thus also the distances between data points and constructed cluster centers—tend to<br />
become uniform. This, of course, impedes the effectiveness of clustering, as distancebased<br />
clustering exploits that these distances differ. In addition, in practice often<br />
only a subset of the available attributes is relevant for forming clusters, even though<br />
this may not be known beforehand. In such cases it is desirable to have a clustering<br />
algorithm that automatically weights the attributes or even selects a proper subset.<br />
In this contribution I study the problem of weighting and selecting features in<br />
clustering and in particular in fuzzy clustering. Apart from reviewing straighforward<br />
modifications of Gustafson–Kessel fuzzy clustering (Gustafson and Kessel 1979) and<br />
attribute weighting fuzzy clustering (Keller and Klawonn 2000) that lead to simple,<br />
but effective attribute weighting schemes, I introduce a new feature selection method<br />
by applying the idea of an alternative to the fuzzifier (Klawonn and Höppner 2003)<br />
to the latter scheme. The resulting combined feature weighting and selection method<br />
has the advantage that the obtained clustering result on the chosen subspace coincides<br />
with the projection of the result obtained on the full data space. Finally I<br />
discuss an extension of this scheme to principal axes selection.<br />
Key words: fuzzy clustering, feature weighting, feature selection<br />
References<br />
1.Gustafson, E.E., and Kessel, W.C. (1979): Fuzzy Clustering with a Fuzzy Covariance<br />
Matrix. Proc. IEEE Conf. on Decision and Control (CDC 1979, San<br />
Diego, CA), 761–766. IEEE Press, Piscataway, NJ, USA.<br />
2.Keller, A., and Klawonn, F. (2000): Fuzzy Clustering with Weighting of Data<br />
Variables. Int. Journal of Uncertainty, Fuzziness and Knowledge-based Systems<br />
8:735-746. World Scientific, Hackensack, NJ, USA.<br />
3.Klawonn, F., and Höppner, F. (2003): What is Fuzzy about Fuzzy Clustering?<br />
Understanding and Improving the Concept of the Fuzzifier. Proc. 5th Int.<br />
Symposium on Intelligent Data Analysis (IDA 2003, Berlin, Germany), 254–<br />
264. Springer-Verlag, Berlin, Germany.<br />
− 18 −
HIDDEN MARKOV MODEL BASED<br />
CLASSIFICATION OF NATURAL OBJECTS<br />
IN AERIAL PICTURES<br />
Mohamed El Yazid Boudaren 1 , Abdenour Labed 1 , Adel Aziz Boulfekhar 1 ,<br />
and Yacine Amara 1<br />
Military Polytechnic School, Algiers boudarenyazid@hotmail.com<br />
Abstract. This work is part of a more global one that consists in creating virtual<br />
environments from aerial pictures combined with altimetry data. In such environments,<br />
while getting too close to the ground, one has to solve the problem of limited<br />
texture resolution. So, these textures have to be amplified to get more realistic<br />
scenes. Texture amplification must take account of object nature. This paper deals<br />
with picture pixels supervised classification in order to amplify texture resolution.<br />
For this purpose, we propose a hidden Markov model based approach that takes<br />
into account the spatial dependencies between natural objects present in the area<br />
of interest. HMMs have long been used to efficiently model one-dimensional data,<br />
in particular in speech recognition systems. In theory, HMMs can be applied as<br />
well to multi-dimensional data. However, the complexity of the algorithms grows<br />
exponentially in higher dimensions, so that, even in dimension 2, the usage of plain<br />
HMM becomes prohibitive in practice. To overcome the 2D-HMM complexity, we<br />
propose a two-level HMM, where the higher layer comprises one unique HMM constituted<br />
of super states associated to one low level HMM each. Our model differs<br />
from classic embedded HMM in that it deals with pixel blocks instead of pixel lines<br />
as elementary symbols. Another difference is that our high level HMM is ergodic;<br />
this enables our model to accurately model spatial dependencies between natural<br />
objects. The training of our HMM models is done in two steps: firstly, the low level<br />
HMMs are trained on unitextured pictures. Secondly the high level one is trained<br />
on multitextured pictures of the same region using the parameters of HMMs of the<br />
first step, according to Baum-Welch algorithm with slight modifications. For our<br />
experiments, we used real world aerial pictures of a relatively large area, with a resolution<br />
of 50 centimeters. Our results were then used to generate virtual interactive<br />
3D-scene. This showed that our classifier was able to satisfactorily reproduce the<br />
original terrain.<br />
Key words: Hidden Markov models, Aerial pictures supervised classification, texture<br />
recognition<br />
− 19 −
On optimistic bias in reporting<br />
microarray-based classification accuracy<br />
Anne-Laure Boulesteix 1 and Martin Slawski 1<br />
1 Sylvia Lawry Centre for Multiple Sclerosis Research, Hohenlindenerstr. 1,<br />
81677-München, Germany, boulesteix@slcmsr.org<br />
Abstract. Almost all published medical studies present positive research results.<br />
In the special case of microarray studies, which often focus on, e.g., the identification<br />
of differentially expressed genes or the construction of outcome prediction<br />
rules, it means that almost all studies report at least a few significant differentially<br />
expressed genes or a small prediction error, respectively. Authors are virtually urged<br />
to “find something significant” in their data, which encourages the publication of<br />
wrong research findings due to the multiple comparison effects. If authors try a large<br />
number of different analysis methods and designs on their data, they are likely to<br />
obtain “acceptable results” with at least one of them. Microarray-based class prediction<br />
is particularly affected by this problem. Whereas logistic regression is routinely<br />
applied as the standard class prediction approach in the simple case where only a<br />
small number of predictors are available, there is no consensus on the procedure to<br />
be applied for classification using high-dimensional microarray data.<br />
It is well-known that, if several statistical methods are tried on the same microarray<br />
data set, one should report all results, not only the best ones (Dupuy and<br />
Simon, 2007). Through simulations and real data studies, we address this problem<br />
quantitatively and determine the effect of not respecting this “good practice” rule.<br />
Our approach consists of applying a large number of well-known classifiers combined<br />
with several variable selection procedures and different numbers of selected variables,<br />
and evaluating them following different schemes (see Boulesteix et al, <strong>2008</strong>, for an<br />
overview). The considered data sets are real publicly available microarray data sets,<br />
with or without random permutation of the class labels. The output of our study is<br />
the distribution of the minimally selected error rate and the bias resulting from this<br />
optimal selection in the different settings.<br />
References<br />
Boulesteix, A.-L., Strobl, C., Augustin, T. and Daumer, M. (<strong>2008</strong>): Evaluating<br />
microarray-based classifiers: An overview. Cancer Informatics, 4.<br />
Dupuy, A. and Simon, R. (2007): Critical Review of Published Microarray Studies for<br />
Cancer Outcome and Guidelines on Statistical Analysis and Reporting. Journal<br />
of the National Cancer Institute, 99, 147–157.<br />
− 20 −
Practical experiences from Credit Scoring<br />
projects for Chilean financial organizations.<br />
Cristian Bravo, Sebastian Maldonado and Richard Weber<br />
Department of Industrial Engineering, University of Chile.<br />
cbravo@dii.uchile.cl, semaldon@ing.uchile.cl, rweber@dii.uchile.cl<br />
Abstract. All financial organizations that offer loans to their customers have the<br />
problem to determine if the loaned money will be returned. Credit scoring systems<br />
have been successfully applied to determine the probability that a certain customer<br />
will fail in paying back the received credit. In many cases these systems are based<br />
on experience but offer a “closed solution” where the user has few possibilities to<br />
influence in the decision process.<br />
We have developed credit scoring systems for several Chilean financial organizations<br />
mapping the KDD process (Knowledge Discovery in Databases) to their<br />
special needs. This paper presents our experiences from these projects and explains<br />
in detail how we solved the problems in each step of the KDD process.<br />
In the data mining step we applied Logistic Regression and Support Vector<br />
Machines as classification techniques, comparing both their performance and their<br />
flexibility. A particular wrapper approach for feature selection using Support Vector<br />
Machines has been developed. Comparing this approach with alternative schemes<br />
underlines its strengths in terms of classification performance and selected features.<br />
Since most KDD projects propose just static solutions we had to develop a<br />
module for model updating that will be described in detail. In particular we propose<br />
to apply statistical techniques in order to determine changes in feature weights and<br />
structural changes in the respective universe.<br />
During the development of our solutions the users got important insights into<br />
their customers’ behavior, some of them were surprising, others just confirmed notions<br />
the respective experts had before. By using the systems in daily operation the<br />
rate of false positives as well as false negatives could be reduced leading to a higher<br />
coverage of the respective market.<br />
Key words: Credit scoring, Classification, Support Vector Machines.<br />
References<br />
Famili, A., Shen, W.-M., Weber, R., Simoudis, E. (1997): Data Preprocessing and<br />
Intelligent Data Analysis. Intelligent Data Analysis 1, No. 1, 3-23.<br />
− 21 −
Analyzing the Stability of Price Response<br />
Functions - Measuring the Influence of<br />
Different Parameters in a Monte Carlo<br />
Comparison<br />
Michael Brusch and Daniel Baier<br />
Institute of Business Administration and Economics,<br />
Brandenburg University of Technology Cottbus, Postbox 101344,<br />
D-03013 Cottbus, Germany {m.brusch|daniel.baier}@tu-cottbus.de<br />
Abstract. The usage and therefore the estimation of price response function (see,<br />
e.g., Steiner et al. 2007) is very important for strategic marketing decisions. Typically<br />
price response functions with an empirical basis are used (see, e.g., Balderjahn<br />
1998). However, such price response functions are subject to a lot of disturbing influence<br />
factors, e.g. the assumed profit maximal price and the assumed corresponding<br />
quantity of sales.<br />
In such cases, the question how stable the found price response function is was<br />
not answered sufficiently up to now. In this paper, the question will be pursued how<br />
much (and what kind of) errors in market research are pardonable for a stable price<br />
response function. Innovative technologies and systems of house power engineering<br />
are used as application example (see Brusch et al. 2003). For the comparisons, a<br />
factorial design with synthetically generated and disturbed data is used.<br />
Key words: Monte Carlo comparison, Price response functions<br />
References<br />
Balderjahn, I. (1998): Empirical analysis of price response functions. In: I. Balderjahn,<br />
C. Mennicken, E. Vernette (Eds.): New Developments and Approaches in<br />
Consumer Behavior Research. Schäffer-Poeschel/Macmillan, 185–200.<br />
Brusch, M., Zühlsdorff, D., Baier, D. and Kessler, A. (2003): Neue Technologien und<br />
erneuerbare Energiequellen auf dem Vormarsch. Energiewirtschaftliche Tagesfragen,<br />
53, 12, 825–829.<br />
Steiner, W. J., Brezger, A. and Belitz, Ch. (2007): Flexible estimation of price<br />
response functions using retail scanner data. Journal of retailing and consumer<br />
services, 14, afl. 6 (11), 383–393.<br />
− 22 −
Motif-based Classification of Time Series with<br />
Bayesian Networks and SVMs<br />
Krisztian Antal Buza 1 and Lars Schmidt-Thieme 1<br />
University of Hildesheim, Information Systems and Machnine Learning Lab<br />
{buza,schmidt-thieme}@ismll.uni-hildesheim.de<br />
Abstract. Classification of time series is of crucial importance in wide range of<br />
applications. One of the possible solutions for this problem is based on characteristic<br />
local patterns of time series, so-called motifs [Patel 2002].<br />
We present a novel technique to make the classification of (multivariate) time<br />
series more accurate. We define different types of motifs. Most easy ones are frequent<br />
subseries. In case of noisy time series as well as in several application domains these<br />
easy motifs are not sufficient, more complex ones are necessary. Complex motifs<br />
used in our work may consist of several subseries, continouos and non-continouos<br />
parts and “joker” parts. We show an efficient algorithm for mining complex motifs in<br />
time series. We extend the highly efficient implementation of the algorithm Apriori<br />
described in [Borgelt 2003] to our task.<br />
We evaluate our method on real medical data, which consits of time series of<br />
dialysis sessions. We compare different types of motifs according to their ability for<br />
the prediction of the class of (multivariate) time series. We show that additional<br />
motif features significantly improve the accuracy of Bayesian Networks and Support<br />
Vector Machines for the classification of time series.<br />
Key words: Time Series, Complex motifs, Bayesian Networks, SVM<br />
References<br />
Borgelt, C. (2003): Efficient Implementations of Apriori and Eclat. 1st Workshop of<br />
Frequent Item Set Mining Implementations (FIMI 2003, Melbourne, FL, USA).<br />
Kunik, V. and Solan, Z. and Edelman, S. and Ruppin, E. and Horn, D. (2005): Motif<br />
Extraction and Protein Classification. IEEE Computational Systems Bioinformatics<br />
Conference (CSB’05), pp. 80-85<br />
Ferreira, P. G. and Azevedo, P. (2005): Protein Sequence Classification through Relevant<br />
Sequence Mining and Bayes Classifiers Proceedings of the 12th Portuguese<br />
Conference on Artificial Intelligence, pp. 236-247, LNAI 3808, Springer-Verlag.<br />
Patel, P. and Keogh, E and Lin, J. and Lonardi, S (2002): Mining Motifs in Massive<br />
Time Series Databases. Proceedings of the 2002 IEEE International Conference<br />
on Data Mining (ICDM 2002).<br />
− 23 −
Visualizing data in Gaussian mixture model<br />
classification<br />
Daniela G. Calo’ and Cinzia Viroli<br />
Department of Statistics - University of Bologna<br />
via Belle Arti, 41 - 40126 Bologna, Italy<br />
danielagiovanna.calo@unibo.it, cinzia.viroli@unibo.it<br />
Abstract. The paper presents a post-processing strategy for producing low-dimensional<br />
summary plots of the data after a Gaussian mixture classification model has been<br />
fitted. The most revealing projections are those along which the class-conditional<br />
densities are maximally separable. We consider a particular probability product kernel<br />
as a measure of similarity or affinity between class-conditional distributions. It<br />
takes an appealing closed form in the case of Gaussian mixture components. The<br />
performance of the proposed strategy has been evaluated on simulated and real data.<br />
Key words: Gaussian mixture models, Low-dimensional plots, Normalized expected<br />
likelihood kernel, Bayes error<br />
References<br />
Chan, A.B., Vasconcelos, N. and Moreno, P.J. (2004): A family of Probabilistic<br />
Kernels Based in Information Divergence. Technical Report, University of California,<br />
San Diego.<br />
Jebara, T. and Kondor, R. (2004): Probability Product Kernels. Journal of Machine<br />
Learning Research, 5, 819–844.<br />
McLachlan, G.J. and Peel, D. (2000): Finite Mixture Models. Wiley, New York.<br />
− 24 −
A novel approach to construct discrete support<br />
vector machine classifiers<br />
Marco Caserta and Stefan Lessmann<br />
Institute of Information Systems<br />
University of Hamburg, Germany<br />
Abstract. The support of managerial decision making by means of data mining<br />
has received considerable attention in the academic literature as well as corporate<br />
practice. This paper considers support vector machines (SVMs) which represent a<br />
popular classification method that may be used in data mining to, e.g., guide the selection<br />
of customers for a direct marketing campaign or assess the credibility of loan<br />
applications in financial applications. Recently, Orsenigo and Vercellis proposed a<br />
novel, discrete support vector machine (DSVM) and demonstrate its effectiveness in<br />
several empirical studies. Building a respective classifier involves solving an integer<br />
program which is a challenging computational task in general and in large-scale data<br />
mining settings in particular. This paper strives to improve upon a linear programming<br />
based heuristic, originally proposed by Orsenigo and Vercellis for solving the<br />
DSVM program. The core of the suggested procedure consists of a recursive algorithm<br />
that solves the (linear) relaxation of DSVM and exploits dual information to<br />
construct a smaller sized sub-program with integer constraints that may be solved<br />
to optimality. The sequence of linear and integer programs solved during the course<br />
of the algorithm provides upper and lower bounds of the final solution which are<br />
employed as termination criterion. Empirical experiments are conducted to scrutinize<br />
the suitability of the proposed procedure and examine the problem size (i.e.<br />
the number of examples and features) that can be processed with state-of-the-art<br />
integer programming techniques.<br />
− 25 −
Modeling the Classification of Heterogeneous<br />
Data<br />
Dorin Carstoiu, Tudor Ionescu, and Alexandra Cernian<br />
University of Bucharest<br />
Faculty of Automatic Control and Compuer Science<br />
Bucharest, Romania<br />
{dorin.carstoiu,tudor.ionescu,alexcernian}@yahoo.com<br />
Abstract. The goal of this work is to study the feasibility of a Heterogeneous Data<br />
Classification and Search (HDCS) system and to provide a possible design for its<br />
implementing. In order to design a HDCS system we propose an actor oriented<br />
modeling technique, for which we show the information flow. We have identified 6<br />
different actors (subsystems) which collaborate to construct a file sheet and produce<br />
the final search result. The first 5 actors add information to the files sheet, which is<br />
afterwards used by the final actor to produce the desired result.<br />
Given the vast quantity of data and the variety of formats and encodings it exists<br />
in, a semantic approach based on metadata has been chosen. Instead of digging into<br />
the actual data for extracting information, we used the context of the file to collect its<br />
metadata. The metadata is afterwards used for the classification proces. The reason<br />
for this approach is that data are made available by people who are interested in<br />
other people understanding what the respective data are about. This observation<br />
provided the confidence needed to pursue the presented approach.<br />
The HDCS system we propose combines techniques from conventional search<br />
systems, classification systems, search results clustering systems, while also providing<br />
original solutions, such as an innovative data sampling method.<br />
References<br />
Dawid Weiss: Descriptive Clustering as a Method for Exploring Text Collections,<br />
2006<br />
Frederic Boulanger, Guy Vidal-Naquet: A Primitive Execution Model for Heterogeneous<br />
Modeling<br />
Mokhoo Mbobi, Frederic Boulanger, Mohamed Feredj: Issues of Hierarchical Heterogeneous<br />
Modeling in Component Reusability<br />
Cecile Hardebolle, Frederic Boulanger, Dominique Marcadet, Guy Vidal-Naquet: A<br />
Generic Execution Framework for Models of Computation<br />
Hua-Jun Zeng, Zheng Chen, Wei-Ying Ma: A Unified Framework for Clustering<br />
Heterogeneous Web Objects<br />
− 26 −
Applying Rough Set Theory to Constructing<br />
Knowledge Base for Critical Military<br />
Commodity Management<br />
Hua-Kai Chiou, 1 Yong-Ting Huang 2 and Gia-Shie Liu 3<br />
1 Department of International Business, China Institute of Technology. 245 Sec.3,<br />
Academia Rd., Nangang Taipei 11581, Taiwan. hkchiou@cc.chit.edu.tw<br />
2 Graduate School of Resources Management and Decision Science,National<br />
Defense University. 70, Sec.2, Central North Rd., Beitou Taipei 11258 Taiwan.<br />
coby777.tw@yahoo.com.tw<br />
3 Department of Information Management,Lunghwa University of Science and<br />
Technology. 300, Sec.1, Wanshou Rd., Gueishan Shiang, Taoyuan County 33306,<br />
Taiwan. liugtw@yahoo.com.tw<br />
Abstract. Reduction of pattern dimensionality via feature extraction and feature<br />
selection belongs to the most fundamental steps in data preprocessing. Feature selection<br />
is a valuable technique in data analysis for information-preserving data reduction.<br />
Features constituting the objects pattern may be irrelevant or relevant. A large<br />
number of methods like discriminant analysis, logit analysis, recursive partitioning<br />
algorithm, etc., have been used in the past for the prediction issues by pattern recognition.<br />
These traditional approaches suffer from some limitations, often due to the<br />
unrealistic assumption of statistical hypotheses or due to a confusing language of<br />
communication with the decision makers. In this paper, we present applications of<br />
rough set methods for feature selection in pattern recognition. Firstly, we employ<br />
Delphi process to generate 39 critical attributes with respect to 6 key factors for<br />
evaluation. We then utilize the rough set approach to discover a set of rules able to<br />
discriminate among considered attributes for critical military commodity management.<br />
There 10 reduct sets and 226 rules were derived from our proposed model,<br />
we also stated some concluding remarks in the final section. Through this research<br />
we found that rough set theory is an efficiency technique for pattern recognition to<br />
solving decision making problems in real world.<br />
Key words: Pattern Recognition, Rough Set Theory, Delphi Process, Discriminant<br />
Analysis<br />
− 27 −
Correspondence Analysis for Exploring the<br />
Implementation of One Village One Product<br />
Programs in Taiwan<br />
Hua-Kai Chiou, 1 Benjamin J.C. Yuan 2 and Yen-Wen Wang 2,3<br />
1 Department of International Business, China Institute of Technology. 245 Sec.3,<br />
Academia Rd., Nangang Taipei 11581, Taiwan. hkchiou@cc.chit.edu.tw<br />
2 Institute of Management of Technology, National Chiao Tung University. 1001,<br />
Ta-Hsueh Rd., Hsinchu 30010, Taiwan. benjamin@cc.nctu.edu.tw<br />
3 Industrial Economics & Knowledge Center,Industrial Technology Research<br />
Institute. 195 Sec.4, Chung Hsing Rd., Chutung Hsinchu 31040, Taiwan.<br />
stevenwang@itri.org.tw<br />
Abstract. One Village One Product programs (OVOP) is a community-centered,<br />
and demand-driven local economic development approach initiated by Oita Prefecture<br />
in Japan in the 1970s. The uniqueness of the approach is that they intended<br />
to achieve their regional economic development through adding value to products<br />
using locally available resources through processing, quality control and marketing.<br />
The objectives of OVOP programs in Taiwan are supported by government to promoting<br />
the economic development and cooperative relationship of each country and<br />
region through the localization and innovation. Firstly, we employ Delphi process to<br />
converge 18 critical factors with respect to 3 key dimensions for evaluation. We then<br />
utilize correspondence analysis to exploring the implementation of OVOP programs<br />
and conduct some meaningful suggestions of the policy direction from these empirical<br />
cases. Through this study we successfully demonstrate that correspondence<br />
analysis is an efficiency technique for industrial analysis and strategy management<br />
in real world.<br />
Key words: OVOP, Correspondence Analysis, Delphi Process, Industrial Analysis<br />
− 28 −
Extending Multivariate Planing<br />
Mario Cortina–Borja 1<br />
Centre for Paediatric Epidemiology and Biostatistics<br />
Institute of Child Health, University College London<br />
30 Guilford Street, London WC1N 1EH, UK<br />
M.Cortina@ich.ucl.ac.uk<br />
Abstract. Friedman and Rafsky (1981) introduced planing, a visualization technique<br />
based on a triangulation procedure for constructing 2–dimensional representations<br />
from a set of n multivariate observations based on preserving exactly relatively<br />
few distances from the original distance matrix and plotting the observations on the<br />
plane. The way these distances are selected and the order in which the observations<br />
are positioned into the plane are induced by a minimal spanning tree (MST ) of the<br />
data.<br />
Other spanning trees could be used to provide the set of distances to be preserved<br />
exactly in a 2–dimensional configuration. One is the exodic tree (ET ) (Gilbert,<br />
1965), which is a not quite minimal spanning tree, though it may be regarded as a<br />
close approximation to the MST (Cortina–Borja and Robinson, 2000). To construct<br />
an ET we choose any point of the dataset as the root and label it as P0; we then<br />
label the rest of the n points as {P1, P2, · · · , Pn−1}, the indices being assigned to<br />
order the points according to their increasing distance from P0. Next, we link any<br />
point Pi to the point Pj chosen from {P2, P3, · · · , Pi−1} in order to minimize its<br />
distance with Pi, (i ≥ 1).<br />
This paper extends two aspects of planing: first, obtaining 3–dimensional configurations;<br />
second, using the ET as the structure defining the distances to be preserved<br />
in the low–dimensional representation.<br />
Key words: Exodic Tree, Minimal Spanning Tree, Planing, Visualization<br />
References<br />
Cortina–Borja, M. and Robinson, T. (2000) Estimating the Asymptotic Constants of<br />
the Total Length of the Euclidean Minimal Spanning Tree with Power–Weighted<br />
Edges. Statistics and Probability Letters, 47, 125–128.<br />
Friedman, J.H. and Rafsky, L.C. (1991) Graphics for the Multivariate Two–Sample<br />
Problem. Journal of the American Statistical Association, 76, 277–287.<br />
Gilbert, E.N. (1965) Random Minimal Trees. SIAM Journal of Applied Mathematics,<br />
13, 376–387.<br />
− 29 −
Principal Axis Analysis with HDLSS bonuses!<br />
Frank Critchley 1 , Ana Pires 2 , and Conceição Amado 2<br />
1 Open University, UK<br />
F.Critchley@open.ac.uk<br />
2 IST, Lisbon<br />
Abstract. Principal axis analysis rotates standardised principal components to optimally<br />
detect subgroup structure, rotation being based on preferred directions in<br />
the spherised data. As such, it is a computationally efficient method of exploratory<br />
data analysis, particularly well-suited to detecting mixtures of elliptically contoured<br />
distributions. High dimensional, low sample size (HDLSS) data are also discussed.<br />
Overall, principal axis analysis exemplifies the maxim: two decompositions are better<br />
than one. More technically, it is an example of invariant coordinate selection<br />
(ICS).<br />
− 30 −
Augmenting Model-Based Clustering with<br />
Generalized Linkage methods<br />
Nema Dean 1 and Rebecca Nugent 2<br />
1 Department of Statistics, University of Glasgow, 15 University Gardens,<br />
Glasgow G12 8QW, UK. nema@stats.gla.ac.uk<br />
2 Department of Statistics, Carnegie Mellon University, Baker Hall, Pittsburgh,<br />
PA 15213, USA. rnugent@stat.cmu.edu<br />
Abstract. The fundamental assumption made by model-based clustering (Fraley<br />
and Raftery 1998) is that the groups or sub-populations underlying the data have<br />
(multivariate) Gaussian distributions, giving the overall population a finite mixture<br />
model distribution. An assumption additionally made, is that the number and type<br />
of components found to best fit the data are a good estimate of the number and type<br />
of true groups in the data. Given the shape assumptions implicit in the choice of<br />
Gaussian distributions - elliptical, symmetric contours - in cases of skewed, curved<br />
or more generally complex-shaped groups, the equivalence of the mixture model<br />
components and the underlying groups is likely false.<br />
Since general continuous densities can be modelled arbitrarily well by mixtures<br />
of Gaussian densities, the mixture model chosen may still be a good estimate of<br />
the density of the data but it is likely that more than one component is identified<br />
with each group. Generalized single linkage methods (Stuetzle and Nugent 2007)<br />
use density estimates to create density-based similarity (or dissimilarity) measures<br />
which can then be used as a replacement for Euclidean (or other types of) distance<br />
in hierarchical agglomerative methods. Using the resulting model-based clustering<br />
density estimate we can use the resulting dendrogram to visualize the hierarchical<br />
structure of the components of the mixture model and make decisions about<br />
combining components to estimate groups. Since it is difficult to easily summarize<br />
information about complex shaped groups, offering a summary that is essentially<br />
a subset of components of the original mixture model with means and covariance<br />
matrices is an attractive alternative.<br />
Key words: Model-Based Clustering, Generalized Single Linkage Clustering<br />
References<br />
Fraley, C. and Raftery, A. E. (1998): How many clusters? which clustering method?<br />
- answers via model-based cluster analysis. The Computer Journal, 41, 578–588.<br />
Stuetzle, W. and Nugent, R. (2007): A generalized single linkage method for estimating<br />
the cluster tree of a density. Technical Report 514, Department of Statistics,<br />
Univeristy of Washington.<br />
− 31 −
Statistical analysis of human body movement<br />
and group interactions in response to music<br />
Frank Desmet, Marc Leman and Micheline Lesaffre<br />
IPEM, Department of Musicology, Ghent University, Belgium fm.desmet@ugent.be<br />
Abstract. The quantification of time series that relate to physiological data is a<br />
challenging research topic for music research. Up to now, most studies have focused<br />
on time dependent responses of individual subjects. However, little is known about<br />
time dependent responses of between-subject interactions. At IPEM, Ghent University,<br />
a large scale multidisciplinary research project targets the development of<br />
innovative music interaction based on the movement of groups of subjects. Based on<br />
a recent pilot experiment, we report new findings concerning the statistical analysis<br />
of group synchronicity in response to musical stimuli. The aim was to refine future<br />
experimental designs and to generate statistical pathways as practical guidelines for<br />
researchers. The experiment was carried out in the context of the ACCENTA 2007<br />
Fair in Ghent. 16 groups of 4 subjects took part in an experiment where they had<br />
to move a wireless wii sensor in response to music. In the first condition, the subjects<br />
were blind folded, while in the second condition, the subjects could see each<br />
other. The movements of the subjects were recorded as acceleration data on PC.<br />
Fourier coeffcients, total intensity, intra-group correlations and sampling entropy<br />
characteristics were derived from these raw acceleration data. Combined with pre<br />
and post survey data of the participants, we generated a multivariate dataset for<br />
analysis. The statistical methods used in this study are basic descriptive statistics,<br />
paired correlation analysis, auto correlation analysis, regression analysis and GLM<br />
(general linear modeling). The different empirical methodologies were validated as<br />
potential tools for the study of social embodied music interaction. It was found that<br />
the synchronicity of the human-human interactions increases significantly in the<br />
social context. The type of music is the predominant factor for the human-music<br />
interaction in both the individual and the social context.<br />
Key words: Human movement, Social Interaction, Statistical Analysis, music<br />
− 32 −
Mixture Hidden Markov Models in Finance<br />
Research<br />
José G. Dias 1 , Jeroen K. Vermunt 2 and Sofia Ramos 3<br />
1<br />
Department of Quantitative Methods, ISCTE – Higher Institute of Social<br />
Sciences and Business Studies, Edifício ISCTE, Av. das Forças Armadas,<br />
1649–026 Lisboa, Portugal<br />
jose.dias@iscte.pt<br />
2<br />
Department of Methodology and Statistics, Tilburg University, P.O. Box 90153,<br />
5000 LE Tilburg, The Netherlands,<br />
J.K.Vermunt@uvt.nl<br />
3<br />
Department of Finance, ISCTE – Higher Institute of Social Sciences and<br />
Business Studies, Edifício ISCTE, Av. das Forças Armadas, 1649–026 Lisboa,<br />
Portugal<br />
sofia.ramos@iscte.pt<br />
Abstract. Latent class or finite mixture modeling has proven to be a powerful<br />
tool for analyzing unobserved heterogeneity in a wide range of applications (see, for<br />
example, McLachlan and Peel (2000) or Dias and Vermunt (2007)). We introduce<br />
in finance research the Mixture Hidden Markov Model (HHMM) that takes into<br />
account both time-constant unobserved heterogeneity between and hidden regimes<br />
within time series. This approach is flexible in the sense that it can deal with the<br />
specific features of financial time series data, such as asymmetry, kurtosis, and unobserved<br />
heterogeneity, aspects that are almost always ignored in finance research.<br />
This methodology is applied to model simultaneously 12 time series of the returns of<br />
Asian stock markets. Because we selected a heterogeneous sample of countries including<br />
both developed and emerging countries, we expect that heterogeneity in market<br />
returns due to country idiosyncrasies will show up in the results. The best fitting<br />
model was the one with two latent classes or clusters at country level, which clearly<br />
present different dynamics in the switching dynamics between the two regimes.<br />
Key words: latent class model, finite mixture model, hidden Markov model, modelbased<br />
clustering, stock indexes<br />
References<br />
Dias, J.G., Vermunt, J.K. (2007): Latent class modeling of website users’ search<br />
patterns: Implications for online market segmentation. Journal of Retailing and<br />
Consumer Services, 14(6), 359–368.<br />
McLachlan, G.J., Peel, D. (2000): Finite Mixture Models. John Wiley & Sons, New<br />
York.<br />
− 33 −
Mapping Findspots of Roman Military<br />
Brickstamps in Mogontiacum (Mainz)<br />
and Archaeometrical Analysis<br />
Jens Dolata 1 , Hans-Joachim Mucha 2 , and Hans-Georg Bartel 3<br />
1 Generaldirektion Kulturelles Erbe Rheinland-Pfalz, Direktion Archäologie<br />
Mainz, Große Langgasse 29, D-55116 Mainz, Germany,<br />
dolata@ziegelforschung.de<br />
2 Weierstraß-Institut für Angewandte Analysis und Stochastik (WIAS),<br />
D-10117 Berlin, Germany, mucha@wias-berlin.de<br />
3 Institut für Chemie, Humboldt-Universität zu Berlin, Brook-Taylor-Straße 2,<br />
D-12489 Berlin, Germany, hg.bartel@yahoo.de<br />
Abstract. 1775 Roman military brickstamps concerning to the 1st century A.D.<br />
have been found in archaeological excavations in Mainz, the ancient Mogontiacum.<br />
Making a catalogue of these ones for a paper on Roman military archaeology the<br />
stamps have been classified and new types of stamps have been defined. All in all, 238<br />
findspots of bricks and tiles of the 1st century have been investigated. Additionally,<br />
the findspots are described by survey-coordinates. The mapping of the brickstamps<br />
visualizes the size of the ancient city and gives details for the localization of military<br />
camps and of civil settlement. Dating the brickstamps by epigraphical investigation<br />
or by assigning them to a military brickyard based on geochemical analysis allows<br />
the mapping of different periods. Two main maps have been plotted: (A) The earliest<br />
brickstamps found in Mainz are from the period of Emperors Claudius and Nero (41 -<br />
68 A.D., n = 932). They were manufactured by soldiers of legiones XXII Primigenia<br />
and IIII Macedonica. (B) Brickstamps of legiones I Adiutrix, XIV Gemina, VII<br />
Gemina, and XXI Rapax belong to the Flavian period (69 - 96 A.D., n = 843).<br />
These two main maps can be compared with some maps showing a selection<br />
of brickstamps from 3rd and 4th centuries (Emperors Caracalla, Constantine I or<br />
Julian, and Valentinian I, n = 102). Thus the maps show a total of 1877 brickstamps<br />
from 246 sites. In this paper we try to improve the situation of evaluating all this<br />
maps for history of settlement and urban development. Using statistical methods<br />
we compare the different entries of the maps. Every single entry is not of the same<br />
weight according to the radius of the finding area. These different weights can be<br />
taken into account in statistical analysis, for instance, in nonparametric density<br />
estimation. The intention is to get a mapping of densities of brickstamps for different<br />
urban regions. In combination of mapping and looking for dated brickstamps there<br />
is a good chance to achieve new sources for the ancient history of Mogontiacum.<br />
Key words: Roman bricks, archaeometry, nonparametric density, mapping<br />
− 34 −
Multimodal Performance Analysis of<br />
Electronic Sitar<br />
Arne Eigenfeldt 1 and Ajay Kapur 2<br />
1 Simon Fraser University<br />
Burnaby, BC, Canada<br />
2 California Institute of the Arts<br />
Valencia, CA, USA<br />
Abstract. This paper describes a custom-build system which extracts high-level<br />
musical information from real-time sensor data received from an Esitar [1]. Data is<br />
collected from sensors during rehearsal using one program, GATHER, and, combined<br />
with audio analysis, is used to derive statistical coefficients which are used to identify<br />
three different playing styles of North Indian music: Alap, Ghat, and Jhala. A realtime<br />
program, LISTEN, uses these coefficients in a multi-agent analysis to determine<br />
the current playing style.<br />
The ESitar is an instrument which gathers gesture data from a performing artist<br />
using sensors embedded on the traditional instrument. A number of different performance<br />
parameters are captured including fret detection (based on position of left<br />
hand on the neck of the instrument), thumb pressure (based on right hand strumming),<br />
and 3-dimensions of neck acceleration. Audio features from the instrument<br />
are computed as well, and include Root Mean Square (rms), spectral centroid, spectral<br />
flux, spectral rolloff at 85<br />
Analysis is done on sample data to derive statistical information (minimum,<br />
maximum, mean, standard deviation) - for each sensor and audio analysis to derive<br />
unique class coefficients. These coefficients are compared to incoming performance<br />
data, which is also statistically analysed, by an multi-agent system.<br />
Interactive and real-time computer music is becoming more complex, and the<br />
requirements placed upon the software is equally increasing. Composers, hoping to<br />
derive more understanding about a performer’s actions, are looking not just at incoming<br />
audio, but also to sensor data, for such information. Our research has focused<br />
upon a high-level musical cognition - playing style - rather than detail recognition<br />
- beat or pitch tracking, for example - by augmenting real-time audio analysis with<br />
sensor data analysis.<br />
References<br />
[1] Kapur, A., Tindale, A, Benning, M. S. and P. Driessen. ”The KiOm: A Paradigm<br />
for Collaborative Controller Design”, Proceedings of the International Computer<br />
Music Conference, New Orleans, USA, 2006.<br />
− 35 −
Data compression and regression based on<br />
local principal curves<br />
Jochen Einbeck 1 and Ludger Evers 2<br />
1 Department of Mathematical Sciences, Durham University, Durham, UK<br />
2 School of Mathematics, University of Bristol, Bristol, UK<br />
Abstract. Frequently the predictor space of a multivariate regression problem of<br />
the type y = f(x1, . . . , xp) + ɛ is in fact much lower-dimensional than p, often even<br />
(approximately) one-dimensional. Usual modeling attempts such as the additive<br />
model y = f1(x1) + . . . + fp(xp) + ɛ, which try to reduce the complexity of the<br />
regression problem by making additional structural assumptions, are then inefficient<br />
as they ignore the inherent structure of the predictor space and involve complicated<br />
model and variable selection stages.<br />
In a fundamentally different approach, one may consider first approximating<br />
the predictor space by a (usually nonlinear) curve passing through it, and then<br />
regressing the response only against the one-dimensional projections onto this curve.<br />
This entails the reduction from a p− to a one-dimensional regression problem.<br />
As a tool for the compression of the predictor space we apply local principal<br />
curves [1], which form a more flexible alternative to earlier proposed principal curve<br />
algorithms [2] as they also allow for branched or disconnected curves. Taking things<br />
on from the results presented in [1], we show how local principal curves can be<br />
parameterized and how the projections are obtained. The regression step can then<br />
be carried out using any nonparametric smoother. We illustrate the technique using<br />
16- and higher dimensional data from astrophysical applications. Possible extensions<br />
to more than one-dimensional nonparametric summaries of the predictor space are<br />
discussed.<br />
References<br />
[1] EINBECK, J., TUTZ, G., and EVERS, L. (2005). Exploring multivariate data<br />
structures with local principal curves. In: Weihs, C. and Gaul, W. (Eds.): Classification<br />
- The Ubiquitous Challenge. Springer, Heidelberg, pages 256-263.<br />
[2] CHANG, K. and GHOSH, J. (1998). Principal curves for nonlinear feature extraction<br />
and classification. SPIE Applications of Artificial Neural Networks in<br />
Image Processing III, 3307, 120–129.<br />
− 36 −
Regression-autoregression based clustering<br />
Igor Enyukov<br />
StatPoint Ltd., Moscow<br />
Abstract. The usual approach to the clustering of cases into k groups (for example,<br />
k-means ) can be regarded as a regression of the source variables on a set of k dummy<br />
binary (indicator) variables which satisfy the following conditions:<br />
• for i-th case only one of these variables has value 1<br />
• if j-th of such a variable has value 1 for i-th case, it means that the case belongs<br />
to j-th group.<br />
These dummy variables in their turn are some nonlinear functions of the source<br />
variables and their values are being evaluated by performing a classification procedure.<br />
The set of dependent variables and predictors is the same in this approach.<br />
So it may be regarded as an autoregression approach. Such approach may lead to<br />
problems when we work with spatial distributed data (like objects with geographical<br />
coordinates) because the objects that are close by physical properties (for example,<br />
by seasonal properties of river stocks) may be situated rather far in the geographical<br />
space. In this paper is suggested to use a smoothed variant of such indicator<br />
nonlinear functions. For this aim are used the radial basis functions (RBF), such<br />
that their centers are the group centers (in the end of work of the procedure). In<br />
this case set of dependent variables (source variables which we want to explain by<br />
clusterization) may be distinct from the set of explaining (independent variables or<br />
predictors) variables. For example, the last ones may be geographical coordinates.<br />
Such approach may be regarded as regression-autoregression. The regression problem<br />
is lnear relatively RBF. This approach gives a possibility to define the class<br />
centers and to evaluate their number as well. For this purpose a kind of ”step-wise”<br />
regression algorithm can be used.<br />
− 37 −
Realoptionen bei der Bewertung von neuen<br />
Produkten<br />
Said Esber and Daniel Baier<br />
Lehrstuhl für Marketing und Innovationsmanagement, Erich-Weinert-Str. 1,<br />
D-03046 Cottbus<br />
saidesber1@yahoo.com, baier@tu-cottbus.de<br />
Abstract. Bei der Entwicklung von neuen Produkten ist es für das F&E-Management<br />
sehr wichtig, technische, marktbedingte und wettbewerbsbedingte Unsicherheiten<br />
zu berücksichtigen. Aus den zunehmenden Markt- und Umfeldänderungen resultiert,<br />
dass Investitionsentscheidungen verstärkt unter hoher Unsicherheit getroffen<br />
werden müssen. Die Realoptionsbewertung ermöglicht hier ein besseres Verständnis<br />
für Unsicherheiten in F&E-Projekte, die Flexibilität des Managements<br />
während des Projektlebenslaufs und die Auswahl der besten Projektalternative.<br />
Der vorliegende Beitrag beschreibt die Anwendung des Realoptionsansatzes im Bereich<br />
der Informationstechnologie (IT) durch das Management der Produktentwicklung<br />
bei einem Videokonferenz-System. Zunächst wird ein Überblick über den Realoptionsansatz,<br />
die Eigenschaften verschiedener Typen von Realoptionen und ihr<br />
Zusammenhang mit anderen verwendbaren zusätzlichen oder alternativen Bewertungsmethoden<br />
von F&E-Projekten (Sensitivitätsanalyse, Szenarioanalyse, Monte-<br />
Carlo-Stimulation und Entscheidungsbäume) gegeben. Anschließend wird ein entscheidungsunterstützendes<br />
excelbasiertes Tool zur Berechnung der Realoptionen,<br />
zur Entwicklung der Entscheidungsbäume und zur Durchführung der Monte-Carlo-<br />
Simulationen verwendet. Schließlich wird die Entscheidung, dass Videokonferenz-<br />
System (BRAVIS) im Produktionsprozess in deutschen VW-Unternehmen einzuführen,<br />
als Anwendungsbeispiel analysiert.<br />
Key words: Realoptionen, Bewertungsverfahren von Optionen, Entscheidungen<br />
unter Unsicherheit, IT-Projekte, Verfahren der Risikoanalyse<br />
References<br />
Dixit, A. K. and Pindyck, R. S. (1995): The Options Approach to Capital Investment.<br />
Harvard Business Review, Mai/Juni, 105–115.<br />
Pritsch, G. (2000): Realoptionen als Controlling-Instrument - Das Beispiel pharmazeutische<br />
Forschung und Entwicklung. Gabler, Wiesbaden.<br />
Rese, A. and Baier, D. (2007): Deciding on new products using a computer-assisted<br />
real options approach. Int. J. of Techn. Intelligence & Planning, 3(3), 292–303.<br />
− 38 −
Regression and Classification using Bayesian<br />
Additive Voronoi Tessellation Models<br />
Ludger Evers 1<br />
School of Mathematics, University of Bristol, Bristol, UK, l.evers@bris.ac.uk<br />
Abstract. Voronoi-tessellation-based regression and classification models are based<br />
on approximating the unknown regression function by a discontinuous, piecewise<br />
constant (or linear) function. The discontinuities are modeled by a Voronoi tessellation<br />
of the covariate space. This distinguishes them from recursive partitioning<br />
models like CART which model the discontinuities by (typically axis aligned) hierarchical<br />
splits. Voronoi-tessellation-based regression and classification models are<br />
typically considered in a Bayesian framework, where inference is done using Reversible<br />
Jump MCMC techniques.<br />
These methods however possess two important drawbacks. In many situations<br />
only a small proportion of the covariates studied are relevant to the regression or<br />
classification task at hand. The pairwise distances, which the Voronoi tessellation<br />
is based on, are then dominated by the irrelevant covariates, i.e. it is increasingly<br />
difficult to find a Voronoi tessellation that is informative for the problem at hand.<br />
Second, the estimated regression function is, due to its high-dimensional and complex<br />
nature, typically difficult to interpret.<br />
We propose to use an additive model of the form P<br />
I fI(xI) that addresses these<br />
two shortcomings. Each fI(·) is based on a Voronoi tessellation of a subspace of the<br />
covariates. A hierarchical model is used for the inclusion of the covariates in order<br />
to ensure that the model makes sparing use of the covariates. In many situations,<br />
each of the fI(·) involves only a small number of covariates and thus allows for easy<br />
interpretation. A further benefit of this approach is that it allows for constructing<br />
faster mixing MCMC compared to the basic non-additive model.<br />
A case study and an empirical comparison with competing methods like CART,<br />
BART, MARS, and Support Vector Machines, is presented both for regression and<br />
classification tasks.<br />
Keywords<br />
Additive model, random basis, transdimensional simulation.<br />
− 39 −
Ross-linguistic regularities in the monosyllabic<br />
system<br />
Gertraud Fenk-Oczlon and August Fenk<br />
Alps-Adriatic University of Klagenfurt<br />
Abstract. We assumed cross-linguistic correlations between the number of monosyllabic<br />
words (a), of syllable types (b), of phonemes per syllable(c), and of the size<br />
of the phonemic inventory (d). Menzeraths (1954:112-121) descriptions of 8 Indo-<br />
European languages and Campbells (1991) data regarding their phonemic inventory<br />
offered the basis for a statistical evaluation. All correlations between a, b, and c<br />
turned out to be significant, those between these three parameters and d almost<br />
significant (Fenk-Oczlon & Fenk, to appear). The discussion of these results within<br />
a systems-theoretical framework includes: (I) Diachronic changes: A comparison of<br />
the Beowulf Prologue in Old English (OE) with its translation into Modern English<br />
(ME) shows a remarkable increase of monosyllables from 105 in OE to 312 in ME<br />
and a concomitant increase of the mean syllable complexity from 2.63 phonemes<br />
in OE to 2.88 in ME. (II) Semantic functions: The verb as well as the adverb or<br />
preposition forming a phrasal verb are often polysemous. In a short analysis of a collection<br />
of 1406 English phrasal verbs we found that 1367 or 97 % of the verbs that<br />
were part of the phrasal verb construction were monosyllabic. (39 phrasal verbs<br />
included a bisyllabic verb and only one was found with a trisyllabic verb.) (III)<br />
General relations between a languages phonemic inventory, the number of cogitable<br />
combinatorial possibilities and the number of those phonotactic possibilities actually<br />
realized in the monosyllables of typologically different languages.<br />
References<br />
Campbell, G. L. 1991. Compendium of the worlds languages. London: Routledge.<br />
Fenk-Oczlon, G. & Fenk, A. (to appear). Complexity trade-offs between the subsystems<br />
of language, In M. Miestamo, K. Sinnemaeki & F. Karlsson, (eds.) Language<br />
Complexity: Typology, Contact, Change, pp. 43-65. Amsterdam: John<br />
Benjamins.<br />
Menzerath, P. (1954). Die Architektonik des deutschen Wortschatzes. Hannover/Stuttgart:<br />
Duemmler.<br />
− 40 −
Validity of images from binary coding tables.<br />
Student motivation surveys: some evidence<br />
K. Fernández-Aguirre 1 and M. A. Garín-Martín 1<br />
University of the Basque Country (UPV/EHU), Bilbao, Spain<br />
karmele.fernandez@ehu.es<br />
Abstract. Using both artificial and real data, this paper analyses the superiority<br />
of Correspondence Analysis (CA) over Principal Component Analysis (PCA) as a<br />
procedure for displaying and exploring data in the processing of contingency tables<br />
or binary tables.<br />
Simple and Multiple Correspondence Analysis (CA and MCA) are becoming<br />
more and more widely used in many areas of science. However, PCA is much better<br />
known and is accessible in software packages, so it continues to be used even at the<br />
risk of obtaining poor results when it is not applied to quantitative data. A second<br />
point examined is the obtaining of low percentages of projected variance on the first<br />
factorial axes in CA applications. This may deter users from applying these methods,<br />
as it suggests that the visualization obtained will be of poor quality. This problem<br />
has been widely discussed: (Benzécri, 1979), (Greenacre, 1994) propose corrections<br />
in projected variance rates. Moreover, (Lebart, 1984, 1998, 2000) consider the case<br />
of a matrix associated with a symmetric graph and analytically study the variations<br />
in representation depending on the different codifications of the associated matrix.<br />
On the one hand, our study shows the superiority of CA for the reconstitution<br />
and visualization of an M matrix associated with a G symmetric graph over the<br />
visualization obtained with PCA. On the other hand, we present four examples which<br />
show that projected variance rates are a highly pessimistic measure of the quality<br />
of a representation. These examples of application to survey data examine various<br />
aspects of the motivation of university students in the field of education, and in<br />
particular present extremely low percentages of projected variance on the first axes.<br />
The application of MCA enables us to obtain apparently fragile but interpretable<br />
images. Moreover, classification analysis provides various types of student with clear<br />
interpretations, leading to robust conclusions.<br />
Key words: Binary tables, Visualization, Correspondence Analysis, Clustering<br />
References<br />
Lebart, L., Morineau, A. and Warwich, K. M. (1984): Multivariate Descriptive Statistical<br />
Analysis. John Wiley & sons, New York.<br />
− 41 −
A Statistical Theory of Musical Consonance<br />
Proved in Praxis<br />
Jobst Fricke<br />
Universität zu Köln<br />
Abstract. One of the recent models of consonance perception of musical intervals<br />
is based on the simulation of neural autocorrelation. It is assumed that the shape of<br />
the coinciding neural spikes resembles Dirac-delta-impulses. Then, consonant musical<br />
intervals are recognized as consonant, if and only if the frequencies of the interval<br />
tones exactly form a simple numerical proportion. But intervals are perceived to be<br />
consonant too when they have a slight displacement of the simple numerical proportions.<br />
The models of Tramo et al. (2001) and Ebeling (2007) are for the first time<br />
in accordance with the reality of perception by introducing larger impulses. Both of<br />
them just realize the autocorrelation of interval tones that consist of impulses with a<br />
width different from zero. In fact, the time window for the spikes’ coincidence has a<br />
width different from zero. This is the latency which is relevant in cognitive processes.<br />
In order to adapt the model to reality, the width of the statistical distribution of<br />
neural impulses should be considered.<br />
It is investigated to what extent the behavior of the model corresponds to the auditory<br />
perception in a realistic environment. The experimental data were taken from<br />
a study dealing with the judgment of intervals in a musical context (Fricke 2005b).<br />
Keywords<br />
MUSIC, CONSONANCE THEORY<br />
References<br />
EBELING, M. (2007): Verschmelzung und neuronale Autokorrelation als Grundlage<br />
einer Konsonanztheorie. Frankfurt/M. u.a..<br />
FRICKE, J. (2005b): Classification of Perceived Musical Intervals, in: Claus<br />
Weihs u. Wolfgang Gaul (Hrsg.): Classification - The Ubiquitous Challenge,<br />
Berlin/Heidelberg/New York: Springer, S. 585–592.<br />
TRAMO, M., CARIANI, P., DELGUTTE, B. and BRAIDA, L. (2001): Neurobiological<br />
Foundations for the Theory of Harmony in Western Tonal Music, in: R.<br />
Zatorre and I. Peretz (Hrsg.): Biological Foundations of Music, Annals of the<br />
New York Academy of Sciences, Vol. 930.<br />
− 42 −
An Improved Criterion for Clustering Based<br />
on the Posterior Similarity Matrix<br />
Arno Fritsch and Katja Ickstadt<br />
Technische Universität Dortmund, Fakultät Statistik<br />
Vogelpothsweg 87, 44221 Dortmund<br />
arno.fritsch@tu-dortmund.de, ickstadt@statistik.uni-dortmund.de<br />
Abstract. Complex Bayesian cluster models are often fitted using Markov Chain<br />
Monte Carlo (MCMC) algorithms. A problem is then how to summarize the MCMC<br />
sample c (1) , . . . , c (M) from the posterior distribution of clusterings p(c|y) with a single<br />
estimated clustering ĉ. The problem is complicated by the fact that the labels<br />
associated with the clusters can switch during the MCMC run. One way to overcome<br />
this is to derive the estimate ĉ based on the posterior similarity matrix, a matrix<br />
with entries πij = P (ci = cj|y), the posterior probabilities that the observations i<br />
and j are in the same cluster. This approach is taken for example in the Bayesian<br />
cluster models for gene expression microarray data by Medvedovic et al. (2004) and<br />
Dahl (2006). The former applies hierarchical clustering to the matrix of (1 − πij),<br />
while the latter tries to minimize a loss function proposed by Binder (1978). We<br />
show that this minimization is equivalent to maximizing the Rand index between<br />
estimated and true clustering and propose a new criterion for choosing ĉ, the posterior<br />
expected adjusted Rand index with the true clustering. In a simulation study<br />
with a Dirichlet process mixture model it is shown that our new criterion leads to<br />
estimated clusterings closer to the true one than the other two approaches and that<br />
it also outperforms the usage of the maximum a posteriori (MAP) clustering.<br />
Key words: Adjusted Rand index, Bayesian cluster analysis, Markov Chain Monte<br />
Carlo, Loss functions<br />
References<br />
Binder, D.A. (1978): Bayesian Cluster Analysis. Biometrika, 65, 31–38.<br />
Dahl, D.B. (2006): Model-based Clustering for Expression Data via a Dirichlet Process<br />
Mixture Model. In: K.A. Do, P. Müller and M. Vannucci (Eds.): Bayesian<br />
Inference for Gene Expression and Proteomics, Cambridge University Press,<br />
New York, 201–216.<br />
Medvedovic, M., Yeung, K. and Bumgarner, R. (2004): Bayesian Mixture Model<br />
Based Clustering of Replicated Microarray Data. Bioinformatics, 20, 1222–<br />
1232.<br />
− 43 −
On the Use of Student Samples in Major<br />
Marketing Research Journals. A Meta-Study<br />
Sebastian Fuchs and Marko Sarstedt<br />
Institute for Market-based Management, Munich School of Management, D-80539<br />
Munich, Germany imm@bwl.lmu.de<br />
Abstract. In the last years, almost every marketing research journal has experienced<br />
a big leap in the number of high-quality paper submissions. This led to<br />
an increased competition among contributing authors and heightened requirements<br />
for manuscript submissions. In this context, the manuscript evaluation criteria for<br />
almost every marketing journal underline the importance of the sample’s characteristics<br />
and how well it represents the population being studied. However, the predominance<br />
of student samples in empirical marketing research documents the divergence<br />
between these theoretical requirements and practical implementation. Despite theoretical<br />
and empirical objections, several authors claim that a silent acceptance of<br />
the usage of student samples has become observable, even in top research societies.<br />
According to Peterson (2001), this development is problematic as generalizations are<br />
only feasible if replicating research with non-student subjects is carried out. Thus,<br />
the objective of this paper is to analyze the development of the usage of student<br />
samples in the most reputable marketing research journals. For this purpose, all<br />
eleven marketing journals rated A or A+ in the ranking, developed on behalf of<br />
the Association of University Professors of Management in German-speaking countries<br />
(VHB) were investigated. A total number of 1.491 studies that appeared since<br />
2005 were analyzed with regard to samples used, measures evaluated and limitations<br />
addressed. The results show vast differences between the various journals.<br />
Key words: Student Samples, Sampling, Marketing Research<br />
References<br />
Peterson, R.A. (2001): On the Use of College Students in Social Science Research:<br />
Insights from a Second-Order Meta-analysis Journal of Consumer Research, 28<br />
(3), 450–461.<br />
Völckner, F. and Sattler, H. (2007): Empirical Generalizability of Consumer Evaluations<br />
of Brand Extensions International Journal of Research in Marketing, 24<br />
(2), 149–162.<br />
− 44 −
Multi-Dimensional Scaling applied to<br />
Hierarchical Fuzzy Rule Systems<br />
Thomas R. Gabriel, Kilian Thiel, and Michael R. Berthold<br />
Chair for Bioinformatics and Information Mining<br />
Department of Computer and Information Science<br />
University of Konstanz, Box 712, 78457 Konstanz, Germany<br />
{gabriel|thiel|berthold}@inf.uni-konstanz.de<br />
Abstract. This paper presents an approach for visualizing high-dimensional fuzzy<br />
rules arranged in a hierarchy together with the training patterns they cover. A standard<br />
multi-dimensional scaling method is used to map the rule centers of the top<br />
hierarchy level to one coherent picture. Rules of the underlying levels are projected<br />
relatively to their superior level(s). In addition to the rules, all patterns are mapped<br />
onto the two dimensional projection in relation to the positions of the corresponding<br />
rule centers. The visualization is further extended by showing hierarchical relationships<br />
between overlapping rules of different levels as generated by a hierarchical rule<br />
learner. This delivers interesting insights into the rule hierarchy and makes the model<br />
more explorable. Additionally, rules can be highlighted interactively emphasizing the<br />
subsequent rules at all underlying levels together with their covered patterns. We<br />
demonstrate that this technique allows investigation of interesting rules at different<br />
levels of granularity, which even makes this approach applicable to a large number<br />
of rule sets. The proposed technique is illustrated and discussed on a number<br />
of hierarchical rule model visualizations generated on well-known benchmark data<br />
sets.<br />
Key words: Multi-Dimensional Scaling, Fuzzy Rule Induction, Rule Hierarchy<br />
References<br />
Berthold, M.R., (2003): Mixed Fuzzy Rule Formation. International Journal of Approximate<br />
Reasoning, 32:67–84.<br />
Gabriel, T.R. and Berthold, M.R. (2003): Constructing Hierarchical Rule Systems.<br />
In: M.R. Berthold, H.-J. Lenz, E. Bradley, R. Kruse, C. Borgelt (Eds.): Proc. 5th<br />
International Symposium on Intelligent Data Analysis. Springer, Berlin, 76–87.<br />
Gabriel, T.R. and Thiel, K. and Berthold, M.R. (2006): Rule Visualization based on<br />
Multi-Dimensional Scaling. In: IEEE International Conference on Fuzzy Systems.<br />
IEEE Press, Vancouver, 66–71.<br />
− 45 −
Ein Leistungszentrum für die digitale<br />
Unterstützung der Durchführung, Auswertung<br />
und Veröffentlichung von archäologischen<br />
Feldprojekten (Ausgrabung, Survey)<br />
Ulrich-Walter Gans und Matthias Lang<br />
Institut für Archäologische Wissenschaften der RUB, Fakultät für<br />
Geschichtswissenschaft, E-Mail: johannes.bergemann@rub.de<br />
Zusammenfassung. Seit etwa zwei Jahrzehnten werden die in der archäologischen<br />
Feldforschung in sehr hohen Quantitäten anfallenden Informationen überwiegend<br />
bis ausschließlich in digitaler Form erfasst. Zwar entstanden an zahlreichen Orten<br />
umfangreiche Datenbanken, allerdings weisen diese extrem heterogene Strukturen<br />
auf. Insellösungen ohne Verbindungen sind die Regel. Weitere grundsätzliche Probleme<br />
betreffen die langfristige Erhaltung und Pflege vorhandener Datenpools sowie<br />
die schnell wechselnden Betriebssysteme. Ein interdisziplinäres Team aus den<br />
Bereichen Archäologie, Informationsmanagement, Softwareentwicklung und Geoinformatik<br />
will Archäologen künftig einen völlig neuartigen Umgang mit den digitalen<br />
Medien ermöglichen. Bislang ist es nur möglich, über das Internet die auf Servern<br />
von Universitäten, Forschungsinstituten und Museen verteilten Datenbanken einzeln<br />
abzufragen. ArcheoInf will einen Mediator erstellen, der fähig ist, in zahlreichen archäologischen<br />
Feldforschungsdatenbanken gleichzeitig zu recherchieren, ohne dass<br />
die Benutzer die Abfrageoberfläche wechseln müssen. Dem Mediator liegen ein möglichst<br />
viele Bereiche der Archäologie umfassender Thesaurus und eine entsprechende<br />
Ontologie zugrunde, die eine komfortable Suche in Feldforschungsdaten ermöglicht.<br />
So wird ein dem Open-Access-Gedanken verpflichtetes Repositorium erstellt, in dem<br />
kostenfrei auf archäologische Sach- und Bilddaten zugegriffen werden kann. Weiter<br />
ist die Einbettung von bibliographischen Daten, Zitaten, Bestandsnachweisen und<br />
elektronischen Volltexten vorgesehen. Gleichzeitig gilt es einen zentralen WebGis-<br />
Server aufzubauen, der geoinformatische Anwendungen der Archäologen in Verbindung<br />
zu Text- und Bilddaten setzt. An dem von der Deutschen Forschungsgemeinschaft<br />
finanzierten Projekt arbeiten Archäologen der Ruhr-Universität Bochum, Informatiker<br />
der Technischen Universität Dortmund, Geoinformatiker der Hochschule<br />
Bochum sowie die Universitätsbibliotheken Bochum und Dortmund mit. Weitere<br />
archäologische Projektpartner in Berlin, Bochum, Cottbus, Darmstadt, Karlsruhe,<br />
Köln, Frankfurt und Tübingen sind assoziiert.<br />
− 46 −
Scalable and Incrementally Updated Hybrid<br />
Recommender Systems<br />
Zeno Gantner and Lars Schmidt-Thieme<br />
Information Systems and Machine Learning Lab (ISMLL)<br />
University of Hildesheim, Germany<br />
{gantner,schmidt-thieme}@ismll.uni-hildesheim.de<br />
Abstract. A typical approach in collaborative filtering is to treat the ratings as a<br />
matrix with many unknown entries, and to use the known data to approximate the<br />
complete matrix, including the unknown ratings. George and Merugu demonstrated<br />
that using a co-clustering algorithm to approximate the ratings matrix achieves prediction<br />
accuracies comparable to techniques like SVD, NMF, or kNN, while having<br />
computational properties which allow the use of the method in dynamic scenarios,<br />
where incoming data has to be incorporated instantly. However, pure collaborative<br />
filtering approaches suffer from insufficient data, especially in cases when there are<br />
only a few users, or when there are new items which have not yet been rated.<br />
To overcome this problem, we combine co-clustering with a Naïve Bayes classifier<br />
to predict ratings based on both the ratings matrix and item attributes. The simplicity<br />
of the classifier allows us to preserve the desirable properties (parallelization,<br />
scalability, incremental updates). Our evaluation indicates that the hybrid recommender<br />
system performs better than pure co-clustering.<br />
Key words: hybrid recommender systems, collaborative filtering, content-based<br />
filtering, co-clustering, naïve bayes<br />
References<br />
Banerjee, A., Dhillon, I., Ghosh, J., Merugu, S. and Modha, D. (2007): A Generalized<br />
Maximum Entropy Approach to Bregman Co-clustering and Matrix<br />
Approximation. The Journal of Machine Learning Research, 8, 1919–1986<br />
Hauger, S., Tso, K. and Schmidt-Thieme, L. (2007): Comparison of Recommender<br />
System Algorithms focusing on the New-Item and User-Bias Problem. In: The<br />
31st Annual Conference of the German Classification Society on Data Analysis,<br />
Machine Learning, and Applications.<br />
George, T. and Merugu, S. (2005): A Scalable Collaborative Filtering Framework<br />
Based on Co-Clustering. In: Proceedings of the 5th IEEE Conference on Data<br />
Mining (ICDM). IEEE Computer Society, Los Alamitos, CA, USA, 625–628<br />
− 47 −
Non-Gaussian nature of ENSO signals and<br />
climate shifts: implications for regional<br />
studies off the western coast of South America<br />
Bernard Garel 1 , J. Boucharel, B. Dewitte and Y. du Penhoat 2<br />
1 Institut de Mathmatiques de Toulouse (IMT-LSP)<br />
bernard.garel@math.univ-toulouse.fr<br />
2 Laboratoire d’Etudes en Gophysique et Ocanographie Spatiales (LEGOS)<br />
julien.boucharel@legos.obs-mip.fr<br />
Abstract. El Nio/Southern Oscillation (ENSO) exhibits a significant modulation<br />
at decadal timescales which is also associated to changes of its characteristics (amplitude,<br />
frequency, propagation, predictability). Among these characteristics, some<br />
of them are generally ignored in ENSO regional studies, such as asymmetry (number<br />
of warm and cold events is not equal) and deviation of its statistics from those<br />
of an assumed Gaussian distribution. They tend to reduce ENSO prediction skill.<br />
Empirical variance shifts (assumed to be an index of low frequency variability) first<br />
detected in the western tropical Pacific, propagate (with propagation characteristics<br />
related to the unstable modes of ENSO) and grow eastward along the equator,<br />
leading to enhanced SST anomalies, asymmetry.<br />
Statistical tests are used to quantify the non-Gaussian nature and asymmetry of<br />
ENSO typical indices from in situ data and a variety of models (from intermediate<br />
complexity models to full physics coupled general circulation models). It is tested<br />
if ENSO can be accounted for by a non-Gaussian alpha-stable law (i.e. a more<br />
heavy-tailed distribution than Gaussian), by a mixture of distributions or by a non<br />
stationary process dominated by mean state and empirical variance abrupt changes.<br />
This last issue is achieved by a shift detection method applied to ENSO typical<br />
indices. Implications for the interpretation of proxies of the upwelling variability off<br />
the coast of Peru are discussed.<br />
Key words: alpha-stable distributions, mixtures, non-stationary process<br />
− 48 −
Likelihood ratio test for general mixture<br />
models<br />
Elisabeth Gassiat 1<br />
Université Paris-Sud 11, Bâtiment 425, 91405 Orsay Cédex, France<br />
elisabeth.gassiat@math.u-psud.fr<br />
Abstract. We investigate the likelihood ratio test (LRT) for testing hypotheses on<br />
the mixing measure in mixture models with possibly structural parameter. The main<br />
result gives the asymptotic distribution of the LRT statistics under some conditions<br />
that are proved to be almost necessary. Asymptotic distribution of the LRT statistics<br />
under contiguous alternatives may be derived. This applies to various testing<br />
problems: the test of a single distribution against any mixture, with application to<br />
Gaussian, Poisson and binomial distributions; the test of the number of populations<br />
in a finite mixture with possibly structural parameter. This allows to prove that, for<br />
the simple contamination model, the asymptotic local power under contiguous hypotheses<br />
may be arbitrarily close to the asymptotic level when the set of parameters<br />
is large enough.<br />
Key words: Likelihood ratio test, mixture models, number of components, local<br />
power, contiguity<br />
References<br />
Azais, J.-M., Gassiat, E. and Mercadier, C. (2006): Asymptotic distribution and<br />
power of the likelihood ratio test for mixtures: bounded and unbounded case.<br />
Bernoulli,12(5),775-799.<br />
Azais, J.-M., Gassiat, E. and Mercadier, C. (2007): The likelihood ratio test for<br />
general mixture models with possibly unknown structural parameters. ESAIM<br />
P. and S.,submitted.<br />
Gassiat, E. (2002): Likelihood ratio inequalities with applications to various mixtures.<br />
Ann. Inst. H. Poincaré Probab. Statist., 6, 897–906.<br />
− 49 −
On a Location of the Retail Units and<br />
Equilibrium Price Determination.<br />
Vladimir Gazda 1<br />
Technical University in Kosice, Nemcovej Str., 040 01 Kosice, Slovakia<br />
vladimir.gazda@tuke.sk<br />
Abstract. The classical view on the price determination is based on perfect competition<br />
assumption, which implies the only price existence accepted by all retailers.<br />
Traditionally, we suppose that all consumers neglect the search costs spent by looking<br />
for the most appropriate purchasing possibility. New views on the price - location<br />
competition among the retailers were presented by Hotelling and, consequently,<br />
by d’Aspremont et al.; Gabszewicz and Thisse; Dobson and Waterson; Martinez et<br />
al. who stressed mainly a continual approach. The discrete model describing the relation<br />
between the search costs and the selling price was described by Stigler (1961)<br />
in his search theory. His approach is based on the sequential search of particular<br />
retail places and is focused more on the searching process itself than the equilibrium<br />
price determination. We propose a price equilibrium problem formulation in<br />
the more complicated (not sequential) consumers’ and retailers’ structure.<br />
The model describes the behaviour of the cost minimising homogenous consumers<br />
and the retailers’ price policy. There, the union of the set of cosumers V1,<br />
the set of retailers V2, a virtual source s and a virtual sink u gives a set V of digraph<br />
nodes. Then, E = {s} × V1 ∪ V1 × V2 ∪ V2 × {u} is a set of digraph edges. We define<br />
the unit cost function as c : E → N0 where cs,i = 0, ci,j = di,j, and cj,u = pj, where<br />
di,j ∈ N is the search cost spent by the i-th consumer to reach the j-the retail<br />
centre and pj ∈ πj is a price of the j-the retailer. It is proved that the min cost flow<br />
in the digraph G = (V, E, c) models the optimal behaviour of all consumers. Then,<br />
the normal form game of retailers S = (V2, Q Q<br />
j πj, , j µj) describes the behaviour<br />
of retailers with their payoff functions µj derived from the min-cost decisions of the<br />
consumers. The Nash equilibrium price strategies of the retailers are discussed in<br />
the article.<br />
Key words: Graph, Game, Consumption, Retailer, Min-cost Flow, Nash Equilibrium<br />
− 50 −
The Potential of Social Intelligence for<br />
Collective Intelligence<br />
Andreas Geyer-Schulz 1 and Bettina Hoser 2<br />
1 Information Service and Electronic Markets andreas.geyer-schulz@kit.edu<br />
2 bettina.hoser@kit.edu<br />
Abstract. In this contribution we review the history and potential of social intelligence<br />
for driving collective intelligence. Different social networks generated by<br />
computer-mediated communication have been researched extensively in the past.<br />
The development of new technologies and applications on the Internet has resulted<br />
in a recent dramatic rise of user participation within these networks. We systematically<br />
explore the possibility of cross-usage of information about social network<br />
structures for personal, community or organisational services. In the other direction,<br />
we investigate the potential of improving community services by integrating<br />
information on the network structure with personal, organisational and mass information.<br />
Key words: social networks, social network analysis, social intelligence, collective<br />
intelligence<br />
References<br />
Hoser, B, and Geyer-Schulz, A. (2005): Eigenspectralanalysis of Hermitian Adjacency<br />
Matrices for the Analysis of Group Substructures. Journal of Mathematical<br />
Sociology, 29(4), 265–294.<br />
Wassermann, S. and Faust K. (1994): Social Network Analysis, Methods and Applications.<br />
Cambridge University Press, Cambridge.<br />
− 51 −
Isolated vertices in random intersection graphs<br />
Erhard Godehardt 1 , Jerzy Jaworski 2 and Katarzyna Rybarczyk 2<br />
1 Clinic of Thoracic and Cardiovascular Surgery, Heinrich Heine University, 40225<br />
Düsseldorf, Germany; godehard@uni-duesseldorf.de<br />
2 Faculty of Mathematics and Computer Science, Adam Mickiewicz University,<br />
60769 Poznań, Poland; jaworski@amu.edu.pl, kryba@amu.edu.pl<br />
Abstract. In applications like the analysis of non-metric data, it is a natural approach<br />
to classify objects according to the properties they possess. Often it is useful<br />
to consider two objects similar if they share at least s properties (with a given arbitrary<br />
number s). Then an effective model to analyze the structure of similarities<br />
between objects is the random intersection graph G(m, n, P(m)), with a set of vertices<br />
V generated by the random bipartite graph BG(m, n, P(m)) with bipartition (V, W),<br />
where clusters are defined as given subgraphs of the generated intersection graph.<br />
In BG(m, n, P(m)) the number of neighbors (properties) of a vertex v ∈ V (object) is<br />
assigned according to the probability distribution P(m) and an edge between v ∈ V<br />
and w ∈ W means that the object v has the property w. Moreover in G(m, n, P(m)),<br />
an edge connects v1 and v2 (v1, v2 ∈ V) if and only if in BG(m, n, P(m)) they have<br />
at least s common properties. The models were introduced in Godehardt and Jaworski<br />
(2002). Using specific properties of such graphs, we can test the hypothesis<br />
of randomness of the underlying data set.<br />
Our main purpose is to study the number of isolated vertices (objects similar<br />
to no other) in G(m, n, P(m)). Previous results concerning this problem considered<br />
only the case, where each vertex had the same number of properties and s = 1. In<br />
our new approach we manage to cope with dependencies between edge appearances<br />
for s ≥ 1 (which is important from the application point of view). Moreover we give<br />
results for the case, where the number of properties differs between objects (different<br />
distributions P(m)). We give the asymptotics for the probability of nonexistence of<br />
isolated vertices in G(m, n, P(m)) and conditions for asymptotic convergence of the<br />
number of isolated vertices to the Poisson distribution.<br />
Key words: Random Intersection Graphs, Isolated Vertices, Non-metric Data<br />
Analysis<br />
References<br />
Godehardt, E. and Jaworski, J. (2002): Two Models of Random Intersection Graphs<br />
for Classification In: M. Schwaiger and O. Opitz (Eds.): Exploratory Data Analysis<br />
in Empirical Research. Springer, Berlin, 68–81.<br />
− 52 −
A note on constrained EM algorithms<br />
for mixtures of elliptical distributions<br />
Francesca Greselin 1 and Salvatore Ingrassia 2<br />
1 Dipartimento di Metodi Quantitativi per le Scienze Economiche e Aziendali,<br />
Università di Milano Bicocca (Italy) francesca.greselin@unimib.it<br />
2 Dipartimento di Economia e Metodi Quantitativi, Università di Catania (Italy)<br />
s.ingrassia@unict.it<br />
Abstract. We extend some theoretical results about the likelihood maximization<br />
on constrained parameter spaces to mixtures of multivariate elliptical distributions.<br />
In particular, mixtures of multivariate t distributions provide a robust parametric<br />
extension to the fitting of data with respect to normal mixtures. In this framework,<br />
the degrees of freedom can act as a robustness parameter, tuning the heaviness of<br />
the tails, and down weighting the effect of the outliers on the parameters estimation.<br />
Further, a constrained monotone algorithm implementing maximum likelihood mixture<br />
decomposition of multivariate t distributions is proposed, to achieve improved<br />
convergence capabilities and robustness. Numerical studies are presented in order<br />
to demonstrate the better performance of the algorithm, comparing it to earlier<br />
proposals.<br />
Key words: Mixture models, Robust Clustering, EM algorithm, elliptical distributions,<br />
t-distribution.<br />
References<br />
Hennig, C. (2004): Breakdown points for maximum likelihood estimators of locationscale<br />
mixtures. The Annals of Statistics, 32, 1313–1340.<br />
Hathaway, R.J. (1986): A constrained formulation of maximum-likelihood estimation<br />
for normal mixture distributions. The Annals of Statistics, 13, 795–800.<br />
Ingrassia, S. and Rocci, R. (2007): Constrained monotone EM algorithms for finite<br />
mixture of multivariate Gaussians. Computational Statistics & Data Analysis,<br />
51, 5339–5351.<br />
McLachlan, G. J. and Peel, D. (2000): Finite Mixture Models, John Wiley & Sons,<br />
New York.<br />
− 53 −
Support Vector Machines in the Primal using<br />
Majorization and Kernels<br />
Patrick J.F. Groenen 1 , Georgi Nalbantov 2 , and Cor Bioch 3<br />
1 Econometric Institute, Erasmus University Rotterdam,<br />
Rotterdam, The Netherlands groenen@few.eur.nl<br />
2 Econometric Institute, Erasmus University Rotterdam,<br />
Rotterdam, The Netherlands and MICC, University Maastricht, Maastircht<br />
nalbantov@few.eur.nl<br />
3 Econometric Institute, Erasmus University Rotterdam,<br />
Rotterdam, The Netherlands bioch@few.eur.nl<br />
Abstract. Support vector machines have become one of the main stream methods<br />
for two-group classification. At the 2006 <strong>GfKl</strong> meeting in Berlin, we proposed SVM-<br />
Maj, a majorization algorithm that minimizes the SVM loss function (see, Groenen,<br />
Nalbantov, and Bioch, 2007, <strong>2008</strong>). A big advantage of majorization is that in each<br />
iteration, the SVM-Maj algorithm is guaranteed to decrease the loss until the global<br />
minimum is reached. Nonlinearity was reached by replacing the predictor variables<br />
by their monotone spline bases and then doing a linear SVM. A disadvantage of<br />
the method so far is that if the number of predictor variables m is large, SVM-Maj<br />
becomes slow.<br />
In this paper, we extend the SVM-Maj algorithm in the primal to handle efficiently<br />
cases where the number of observations n is (much) smaller than m. We<br />
show that the SVM-Maj algorithm can be adapted to handle this case of n ≫ m as<br />
well. In addition, the use of kernels instead of splines for handling the nonlinearity<br />
becomes also possible while still maintaining the guaranteed descent properties of<br />
SVM-Maj.<br />
Key words: Support vector machines, Iterative majorization, Binary classification<br />
problem, Kernel<br />
References<br />
Groenen, P.J.F., Nalbantov, G., and Bioch, J.C. (2007): Nonlinear support vector<br />
machines through iterative majorization and I-splines. In: R.Decker, H-.J. Lenz<br />
(Eds.): Advances in data analysis. Springer, Berlin, 149–162.<br />
Groenen, P.J.F., Nalbantov, G., and Bioch, J.C. (<strong>2008</strong>, in press): SVM-Maj: A<br />
Majorization Approach to Linear Support Vector Machines with Different Hinge<br />
Errors. Advances in Data Analysis and Classification.<br />
− 54 −
Usage of Artificial Neural Networks for Data<br />
Handling<br />
Lars Frank Groe and Franz Joos<br />
Helmut-Schmidt-University<br />
University of the Federal Armed Forces Hamburg<br />
Power Engineering<br />
Laboratory of Turbomachinery<br />
Hamburg, Germany<br />
Abstract. To reduce the environmental pollution, it is essential to increase the<br />
efficiency of the commercially available combustion engines. If one succeeds in designing<br />
the combustion process, in particular the chemical reactions, it is feasible to<br />
replace the experiment by computer simulations. Complex chemical reaction mechanisms<br />
like the GRI3.0* consist of 325 reactions with 53 species. The computational<br />
hardware costs limit the evaluation of integrals of stiff equations to simple problems<br />
(2-D, low Reynolds numbers) or to very small numbers of species. Otherwise<br />
turbulent combustion, for example in combustion chambers of gas turbines, often<br />
consists of complex geometry with a wide spectrum of chemical states and proceeds<br />
with high Reynolds numbers. The use of databases for storing chemical reactions<br />
is widely described in literature [Pope]. Therefore several storage-based techniques<br />
have been implemented for data mining (Look-up-table, In situ adaptive tabulation).<br />
The use of artificial neuronal networks (ANN) to simulate complex chemistry<br />
with full GRI3.0 is suggested in this paper. ANN can represent the chemical reactions<br />
by creating a non-linear-multivariate model of the dataset. The information<br />
of the dataset is stored in the weights of the connected neurons in the ANN. The<br />
net is able to find the optimum approximation of the presented data by supervised<br />
learning method called back propagation. The modeling and generalisation of large<br />
number of chemical states by means of ANN with regard to complicated combustion<br />
simulation is the purpose of this work.<br />
References<br />
[Pope] Pope, S.B.: Computationally efficient implementation of combustion chemistry<br />
using in situ adaptive tabulation. In: Combust. Theory Modelling, Vol. 1<br />
(1997), pp. 41-63<br />
Gregory P. Smith, David M. Golden, Michael Frenklach, Nigel W. Moriarty, Boris<br />
Eiteneer, Mikhail Goldenberg, C. Thomas Bowman, Ronald K. Hanson, Soonho<br />
Song, William C. Gardiner, Jr., Vitali V. Lissianski, and Zhiwei: http://www.<br />
me.berkeley.edu/gri_mech/.<br />
− 55 −
Model diagnostics of finite mixtures using<br />
bootstrapping<br />
Bettina Grün 1 and Friedrich Leisch 2<br />
1 Department für Statistik und Mathematik, Wirtschaftsuniversität Wien<br />
Augasse 2-6, 1090 Wien, Austria; Bettina.Gruen@wu-wien.ac.at<br />
2 Institut für Statistik, Ludwig-Maximilians-Universität München, Ludwigstraße<br />
33, D-80539 München, Germany; Friedrich.Leisch@stat.uni-muenchen.de<br />
Abstract. The EM algorithm provides a common framework for maximum likelihood<br />
estimation of finite mixture models. The fitted models can differ with respect<br />
to the component specific models and may also allow for concomitant variables to<br />
model the component weights. The use of resampling methods to analyze finite<br />
mixture models fitted with the EM algorithm is appealing, because the bootstrap<br />
similarly to the EM algorithm constitutes a common framework for these models.<br />
We will outline various possibilities to use resampling methods for model diagnostics<br />
such as for determining the number of components, checking model identifiability<br />
and analyzing the stability of induced clusterings.<br />
The R package flexmix implements the EM algorithm for ML estimation of finite<br />
mixture models. It provides the E-step and all data handling and arbitrary mixture<br />
models can be fitted by modifying the M-step. The implementation of bootstrap<br />
techniques to allow for model diagnostics of the models fitted with the package is<br />
presented.<br />
Key words: Booststrap, Finite mixture, Model diagnostics, Resampling<br />
References<br />
Grün, B. and Leisch, F. (2004): Bootstrapping Finite Mixture Models. In: J. Antoch<br />
(Ed.): Compstat 2004—Proceedings in Computational Statistics. Springer,<br />
Heidelberg, 1115–1122.<br />
Hothorn, T., Leisch, F., Zeileis, A. and Hornik, K. (2005): The Design and Analysis<br />
of Benchmark Experiments. Journal of Computational and Graphical Statistics,<br />
14(3), 1–25.<br />
Leisch, F. (2004): FlexMix: A general framework for finite mixture models and latent<br />
class regression in R. Journal of Statistical Software, 11(8).<br />
McLachlan, G.J. (1987): On Bootstrapping the Likelihood Ratio Test Statistic for<br />
the Number of Components in a Normal Mixture. Applied Statistics, 36(3),<br />
318–324.<br />
− 56 −
Classification with Regularized Kernel<br />
Mahalanobis-Distances<br />
Bernard Haasdonk 1 and El˙zbieta P ↩ ekalska 2<br />
1 Institute of Numerical and Applied Mathematics, University of Münster,<br />
Germany haasdonk@math.uni-muenster.de<br />
2 School of Computer Science, University of Manchester, United Kingdom<br />
pekalska@cs.man.ac.uk<br />
Abstract. Linear discriminant analysis has been demonstrated to be successful in<br />
kernel-induced feature spaces. In particular, in terms of accuracy, the kernel Fisher<br />
discriminant (KFDA) can frequently compete with or even outperform the support<br />
vector machine (SVM) (Mika et al. 2000). In situations, where linear discrimination<br />
in kernel feature space is suboptimal, nonlinear techniques offer a better solution<br />
(Huang et al. 2005). An example is quadratic classification in the kernel space, based<br />
on kernelized versions of class-related Mahalanobis distances.<br />
In this presentation, we present two different formulations for quadratic classifiers<br />
in kernel-induced feature spaces, depending whether the class-related covariance<br />
operator has to be regularized or not. Experimental results on a toy data set enable<br />
us to draw comparisons to SVM and KFDA. More importantly, these results provide<br />
a proof of principle that nonlinear discriminants can be beneficial in the kernel<br />
space.<br />
Key words: Kernel Methods, Quadratic Discriminants, Kernel Mahalanobis-Distance<br />
References<br />
Mika, S., Rätsch, G., Schölkopf, B., Smola, A., Weston, J. and Müller, K.-R. (2000):<br />
“Invariant feature extraction and classification in kernel spaces.” In S.A. Solla,<br />
T.K. Leen, and K.-R. Müller (Eds.): Advances in Neural Information Processing<br />
Systems 12. MIT Press, Cambridge, MA, 526–532.<br />
Huang, S.-Y., Hwang, C.-R. and Lin, M.-H. (2005): “Kernel Fisher’s discriminant<br />
analysis in Gaussian reproducing kernel Hilbert space,” Academia Sinica,<br />
Taipei, Taiwan, Technical Report.<br />
− 57 −
On classification of species of representation<br />
rings<br />
Lothar Häberle<br />
Department of Biometry and Epidemiology, University of Erlangen-Nuremberg,<br />
Waldstr. 6, 91054 Erlangen<br />
Abstract. In biology and chemistry crystal structures and symmetries of molecules,<br />
for example, are classified by mathematical groups. The assigned groups can then<br />
be used to determine physical properties such as polarity and chirality.<br />
Representations of groups as linear transformations of vector spaces and, more<br />
generally, modules enables many group theoretical problems to be reduced to problems<br />
of linear algebra, which is a well understood theory. Defining addition and<br />
multiplication via direct sum and tensor product on the set of these modules and<br />
then considering them as elements of a ring, the representation ring, is an approach<br />
to examine such modules. In order to investigate representation rings one may study<br />
their structure preserving maps to the complex numbers, which are called species.<br />
We consider finite groups whose largest subgroup of prime power order is cyclic<br />
for some prime number and study the corresponding representation ring. The indecomposable<br />
modules are stated and the species are classified. The proposed way of<br />
classification may be applied to other classes of groups in the future and then be used<br />
in natural sciences. Throughout the paper we illustrate the theoretical statements<br />
with examples.<br />
Key words: mathematical group, species, representation ring, indecomposable<br />
module<br />
References<br />
Benson, D.J. (1991): Representation and Cohomology I. Cambride Universtiy Press.<br />
Fotsing, B. and Külshammer, B. (2005): Modular species and prime ideals for the<br />
ring of monomial representations of a finite group. Communications in Algebra,<br />
33, 3667–3677.<br />
Green, J.A. (1962): The modular representation algebra of a finite group. Illinois<br />
Journal of Mathematics, 6, 607–619.<br />
Häberle, L. (submitted): The species and idempotens of the Green algebra of a finite<br />
group with a cyclic Sylow subgroup.<br />
Shriver, D.F. and Atkins, P.W. (2006): Inorganic Chemistry. Oxford University<br />
Press.<br />
− 58 −
Auswertung hochaufgelöster Streulichtdaten mit<br />
Methoden der multivariaten Statistik<br />
Cornelius Hahlweg und Hendrik Rothe<br />
Helmut-Schmidt-Universität<br />
Hamburg<br />
Zusammenfassung. Die Entwicklung der Streulichtmeßtechnik wurde in den vergangenen<br />
Dekaden, insbesondere mit Blick auf einen Einsatz in der Qualitätsprüfung<br />
von Oberflächen, vorangetrieben. Streulichtverfahren erweisen sich für Oberflächenuntersuchungen<br />
als besonders leistungsfähig, da sie berührungslos arbeiten, einen<br />
skalierbaren Ausschnitt der Oberfläche prüfen und dabei feinste Oberflächenstrukturen<br />
abbilden.<br />
Prinzipiell liefern Streulichtverteilungen spektrale Aussagen über Eigenschaften<br />
der untersuchten Oberfläche. Während diese für sehr glatte Oberflächen in der Tat<br />
die Oberflächenfunktion selbst wiederspiegeln, erweisen sich für Oberflächen oberhalb<br />
des sog. Rayleigh-Limits Methoden der multivariaten Statistik als sinnvoll.<br />
Insbesondere können hier die höheren Momente der Streuverteilung als Merkmale<br />
für Klassifikationsverfahren dienen. Während diese Momente in früheren Veröffentlichungen<br />
als Beschreibungsform der Streuverteilung selbst unter eher heuristischen<br />
Aspekten vorgeschlagen wurden, kann ihnen nunmehr auch eine physikalische Bedeutung<br />
zugeordnet werden. Zur Vorverarbeitung und Reduktion der häufig sehr<br />
umfangreichen zweidimensionalen Datenmengen kommt zunächst die Hauptkomponentenanalyse<br />
(PCA) zum Einsatz. Zur Klassifikation verschiedener Proben, z.B. im<br />
Sinne einer Qualitätskontrolle, wird die lineare kanonische Diskriminanzanalyse genutzt.<br />
Der Beitrag gibt einen Einblick in die Grundlagen der verwendeten Verfahren<br />
in Bezug zur Streulichtmeßtechnik und bietet Beispiele aus der Anwendung in der<br />
Klassifikation technischer Oberflächen.<br />
Literaturverzeichnis<br />
Baier, D. and Gaul, W. (1999): Optimal Product Positioning Based on Paired Comparison<br />
Data. Journal of Econometrics, 89, 365–392.<br />
Bock, H.H. (1974): Automatische Klassifikation. Vandenhoeck & Ruprecht, Göttingen.<br />
Brusch, M. and Baier, D. (2002): Conjoint Analysis and Stimulus Presentation:<br />
A Comparison of Alternative Methods. In: K. Jajuga, A. Sokołowski and H.H.<br />
Bock (Eds.): Classification, Clustering, and Analysis. Springer, Berlin, 203–210.<br />
− 59 −
Algorithms for Computing the Multivariate<br />
Isotonic Regression<br />
Jürgen Hansohm 1<br />
University of the Federal Armed Forces, Munich, Germany<br />
Juergen.Hansohm@UniBw-Muenchen.de<br />
Abstract. Sasabuchi (1983) introduces 1983 a multivariate version of the wellknown<br />
univariate isotonic regression which plays a key role in the field of statistical<br />
inference under order restrictions. His proposed algorithm for computing the multivariate<br />
isotonic regression, however, is guaranteed to converge only under special<br />
conditions (Sasabuchi 2003). In this paper, a more general framework for multivariate<br />
isotonic regression is given and an algorithm based on Dykstra’s method is<br />
used to compute the multivariate isotonic regression. Two numerical examples are<br />
given to illustrate the algorithm and to compare the results with the Monte Carlo<br />
simulation published by Fernando and Kulatunga (2007)<br />
Key words: multivariate isotonic regression, projection, Dykstra’s algorithm, partial<br />
order, least squares solution<br />
References<br />
S. Sasabuchi, M. Inutsuka, D.D.S. Kulatunga (1983): A multivariate version of isotonic<br />
regression,Biometrika 70, 2 (1983) 465–472.<br />
S. Sasabuchi, T. Miura, H. Oda (2003): Estimation and test of several multivariate<br />
normal means under an order restriction when the dimension is larger than two,<br />
Journal of Statistical Computation and Simulation 73, 9 (2003) 619–641.<br />
W.T.P.S. Fernando, D.D.S. Kulatunga (2007): On the computation and some applications<br />
of multivariate isotonic regression,Computational Statistics and Data<br />
Analysis 52 (2007) 702–712.<br />
− 60 −
Die präzise und effizienzte Erkennung von<br />
medizinischen Anforderungsformularen<br />
Uwe Henker 1 , Alfred Ultsch 2 , and Uwe Petersohn 3<br />
1<br />
DOCexpert Computer GmbH<br />
Bamberg<br />
u.henker@docexpert.de<br />
2<br />
Databionics Research Group<br />
Philipps-University of Marburg, Germany<br />
ultsch@informatik.uni-marburg.de<br />
3<br />
TU Dresden<br />
Institut Künstliche Intelligenz<br />
peterson@inf.tu-dresden.de<br />
Abstract. Formulare für die Anforderung von medizinischen und/oder diagnostischen<br />
Leistungen spielen eine grosse Rolle in der gegenwätigen medizinischen Praxis.<br />
Mit solchen Formularen werden ggf. lebensentscheidende ärztliche oder labormedizinische<br />
Leistungen für einen Patienten angefordert. Die Übertragung der vom Arzt<br />
per Hand in ein solches Formular eingetragenen Anforderungen an die Labor- bzw.<br />
Krankenhaus Informtionssysteme erfolgt dabei durch maschinelle Erkennungsverfahren<br />
(Optical Marker Recognition (OMR)). Hierbei ist von einer sich ändernden<br />
Menge von verschiedenen Formularen (Prototypen) auszugehen, die zuverlässig<br />
erkannt werden müssen.<br />
Der Beitrag beschreibt die Wissensrepräsentation derartiger Formulare in einer Falldatenbank<br />
von Prototypen mittels Case based Reasoning (CBR). Zentrale Idee ist<br />
dabei, die vorverarbeiteten und abstrahierten Bilder von gescannten Formularen mit<br />
den Prototypen so zu vergleichen, dass Fehlertoleranzen zugelassen werden. Wird ein<br />
neuer Prototyp in die Wissensbasis eingefügt, der eine grosse Überschneidung mit<br />
bestehenden Prototypen hat, so wird in einem mehrstufigen Verfahren zusätzliches<br />
Entscheidungswissen für den Formularklassifikator in der Wissensbasis repräsentiert.<br />
Der Ansatz führt zu einer Erkennungsrate von 97% und keinen falsch-positiven<br />
Fällen. Im Vergleich zu anderen veröffentlichten Ansätzen kann eine substantielle<br />
Steigerung bei der Erkennungsleistung festgestellt werden. Insbesondere ist die Sonderanforderung,<br />
dass keine falsch-positiven Ergebnisse erzeugt werden zu 100%<br />
erfüllt. Das System wurde mit realen Formularen getestet, die in der Anwendung ein<br />
eindeutiges Merkmal (Barcode) zur Identifizierung verwenden. Die Notwendigkeit<br />
dieses Barcodes auf jedem Formular, stellt eine nicht unerhebliche Einschränkung<br />
dar. Durch den hier beschriebenen Ansatz wird diese aufgehoben.<br />
Key words: Classification, Knowledge Representation, Optical Marker Recognition,<br />
Image Processing, Medical Information Systems<br />
− 61 −
Using cluster analysis for species delimitation<br />
Christian Hennig 1 and Bernhard Hausdorf 2<br />
1 Department of Statistical Science, University College London, Gower St, London<br />
WC1E 6BT, United Kingdom chrish@stats.ucl.ac.uk<br />
2 Zoologisches Museum der Universität Hamburg, Martin-Luther-King-Platz 3,<br />
20146 Hamburg, Germany hausdorf@zoologie.uni-hamburg.de<br />
Abstract. Species delimitation is a fundamental task in biology. Operationally,<br />
species can be conceived as continuously varying groups of organisms that are separate<br />
from other such groups. This suggests methods of cluster analysis to delimit<br />
species empirically for given data. However, in the literature there is no agreement<br />
about the species concept (see Mayden, 1997, for an overview), which affects the<br />
choice of the appropriate data, cluster analysis method, and the interpretation of<br />
the results. A particular problem arises because of the hierarchical nature of evolution.<br />
Clusters occur at many levels and may represent, beside species, intrapopulation<br />
polymorphisms, populations, regional variation or higher taxa. We present<br />
a methodology for delimiting putative species based on codominant and dominant<br />
genetic markers. The method combines the definition of an appropriate dissimilarity<br />
measure, multidimensional scaling and model-based cluster analysis. We propose a<br />
null model taking into account spatial autocorrelation in order to check whether<br />
inhomogeneities in the data can be explained from regional variation alone. The<br />
methodology is a generalization of the techniques presented in Hennig and Hausdorf<br />
(2004) to categorial genetic data. The methodology is compatible with most species<br />
concepts. We discuss some general issues such as the choice of the clustering method<br />
and joining of not well separated clusters, which rather indicate inhomogeneity on<br />
lower levels than species.<br />
Key words: Model-based cluster analysis, genotypes, spatial autocorrelation<br />
References<br />
Hennig, C. and Hausdorf, B. (2004): Distance-based parametric bootstrap tests for<br />
clustering of species ranges. Computational Statistics and Data Analysis, 45,<br />
875–896.<br />
Mayden, R.L. (1997): A hierarchy of species concepts: the denouement in the saga<br />
of the species problem. In: M.F. Claridge, H.A. Dawah, M.R. Wilson (Eds.):<br />
The Units of Biodiversity. Chapman and Hall, London, 381–424.<br />
− 62 −
Nonlinear Effects in PLS Path Models:<br />
A Comparison of Available Approaches<br />
Jörg Henseler 1<br />
Institute of Management Research, Radboud University Nijmegen, Thomas van<br />
Aquinostraat 1, 6525 GD Nijmegen, The Netherlands, J.Henseler@fm.ru.nl<br />
Summary. Along with the development of scientific disciplines, researchers in business<br />
and social sciences are increasing interested in investigating nonlinear effects<br />
between latent variables. In this contribution, I present four approaches to modeling<br />
nonlinear effects with PLS: Firstly, Wold’s (1982) original approach takes the nonlinearity<br />
in the structural model into account during the iterative PLS algorithm.<br />
Secondly, the product indicator approach developed by Chin, Marcolin, and Newsted<br />
(2003) requires that the nonlinear function be applied a priori on the indicator<br />
level. Thirdly, a two-stage approach as suggested by Henseler and Fassott (<strong>2008</strong>) estimates<br />
the nonlinear effect a posteriori once the latent variable scores are estimated<br />
by means of the linear effects PLS path model. Fourthly, I adapt an orthogonalizing<br />
approach originally suggested by Little, Bovaird, and Widaman (2006) to nonlinear<br />
PLS path modeling. Finally, I compare the performance of these four approaches<br />
by means of a Monte Carlo simulation, and derive guidelines for users of PLS path<br />
modeling.<br />
Key words: partial least squares, PLS path modeling, nonlinear terms<br />
References<br />
Chin, W. W., Marcolin, B. L., and Newsted, P. N. (2003): A Partial Least Squares<br />
Latent Variable Modeling Approach for Measuring Interaction Effects: Results<br />
from a Monte Carlo Simulation Study and an Electronic-mail Emotion/Adoption<br />
Study. Information Systems Research, 14, 189–217.<br />
Henseler, J. and Fassott, G. (<strong>2008</strong>): Testing Moderating Effects in PLS Path<br />
Models: An Illustration of Available Procedures. In: V. E. Vinzi, W. W. Chin, J.<br />
Henseler, and H. Wang (Eds.): Handbook Partial Least Squares Path Modeling.<br />
Springer, Heidelberg, forthcoming.<br />
Little, T. D., Bovaird, J. A., and Widaman, K. F. (2006): On the Merits of<br />
Orthogonalizing Powered and Product Terms: Implications for Modeling Interactions<br />
Among Latent Variables, Structural Equation Modeling, 13, 497–519.<br />
Wold, H. (1982): Soft Modeling. The Basic Design and Some Extensions. In: K.<br />
G. Jöreskog and H. Wold (Eds.): Systems under Indirect Observation. Causality,<br />
Structure, Prediction, Part I. North-Holland, Amsterdam, 1–54.<br />
− 63 −
Classification of text processing components:<br />
The Tesla Role System<br />
Jürgen Hermes and Stephan Schwiebert<br />
Linguistic Data Processing, Department of Linguistics, University of Cologne<br />
{jhermes, sschwieb}@spinfo.uni-koeln.de<br />
Abstract. The analysis of sequences of discrete tokens (i.e., texts) is a major research<br />
subject of several essentially different sciences such as corpus linguistics,<br />
literature and bioinformatics. Though differing in both data and its interpretation,<br />
these sciences share some intermediate steps. Following these considerations, the obvious<br />
procedure is to encapsulate text processing tasks into components and create a<br />
framework that enables component interaction. The component arrangement within<br />
a workflow is comparable to an experimental setup: it allows a gradual modification<br />
of experiments, e.g., rerunning an experiment with a modified configuration or a<br />
replaced component.<br />
The Text Engineering Software Laboratory (Tesla) is an implementation of a<br />
framework that supports the development and deployment of text processing components<br />
as well as the execution of experiments on textual data. One of its main<br />
ideas is reducing the framework’s restrictions on data modeling to a minimum, allowing<br />
developers to focus on their scientific tasks. However, this results in new issues:<br />
an extensible way of database access definition, data exchange between components<br />
and data conversion during visualization. If, for instance, the annotations produced<br />
by a component cannot be related sequentially to single text elements but do instead<br />
represent more complex relations between these elements, as generally in graphs or<br />
matrices, the information contained in such data types can only be extracted with<br />
knowledge about their internal structure and its meaning, thus violating a basic<br />
principle of component frameworks.<br />
Addressing these concerns, the concept of a role is introduced in Tesla. A role<br />
adopted by a component specifies the type as well as the access methods of the<br />
produced data. As the role system implicitly exhibits a hierarchical structure, this<br />
finally leads to a dynamic classification of text processing components.<br />
− 64 −
Strengths and Weaknesses of Ant Colony<br />
Clustering<br />
Lutz Herrmann and Alfred Ultsch<br />
Databionics Research Group<br />
University of Marburg, Germany<br />
{lherrmann,ultsch}@informatik.uni-marburg.de<br />
Abstract. Ant colony clustering (ACC) is a promising nature-inspired technique<br />
where stochastic agents perform the task of clustering high-dimensional data on a<br />
low-dimensional output space. Most ACC methods are derivatives of the approach<br />
proposed by Lumer and Faieta. These methods usually perform poorly in terms<br />
of topographic mapping and cluster formation. In particular when compared to<br />
clustering on Emergent Self-Organizing Maps (ESOM).<br />
In order to address this issue, an unifying representation for both ACC methods<br />
and Emergent Self-Organizing Maps is derived in a brief yet formal manner. ACC<br />
terms are related to corresponding mechanisms of the Self-Organizing Map. This<br />
leads to insights on both algorithms. ACC are considered as first-degree relatives of<br />
the ESOM. This explains benefits and shortcomings of ACC and ESOM. Furthermore,<br />
the proposed unification allows to judge whether modifications improve an<br />
algorithm’s clustering abilities or not. This is demonstrated using a set of cardinal<br />
clustering problems.<br />
Key words: Clustering, Emergent Self-Organizing Maps, Swarm Intelligence<br />
References<br />
Deneubourg, J.-L., Goss, S., Franks, N., Sendova-Franks, A., Detrain, C. and Chretien,<br />
L. (1991): The dynamics of collective sorting: Robot-like ants and ant-like<br />
robots. In: Proc. of the First International Conference on Simulation of Adaptive<br />
Behaviour: From Animals to Animats 1. MIT Press, Cambridge, 356-365.<br />
Handl, J., Knowles, J. and Dorigo, M. (2005): Ant-Based Clustering and Topographic<br />
Mapping. Artificial Life 12(1), MIT Press, Cambridge.<br />
Kohonen, T. (1995): Self-Organizing Maps. Springer, Berlin, Heidelberg, New York.<br />
Lumer, E. and Faieta, B. (1994): Diversity and adaption in populations of clustering<br />
ants. In: Proc. of the Third International Conference on Simulation of Adaptive<br />
Behaviour: From Animals to Animats 3. MIT Press, Cambridge, 501–508.<br />
Ultsch, A. and Herrmann, L. (2005): The architecture of emergent self-organizing<br />
maps to reduce projection errors. In: Verleysen M. (Eds): Proc. of the European<br />
Symposium on Artificial Neural Networks (ESANN 2005).<br />
− 65 −
Reconstructing Central Places and Settlements<br />
Groups<br />
Irmela Herzog<br />
The Rhineland Regional Council / The Rhineland Commission for Archaeological<br />
Monuments and Sites<br />
Bonn, Germany<br />
i.herzog@LVR.de<br />
Abstract. If (i) the location of settlements is known for a certain period in time<br />
and (ii) the settlements are distributed in such a way that cluster centres with high<br />
settlement densities are present, then a variant of the density clustering algorithm<br />
using basin spanning trees can be applied to (i) reconstruct the location of the<br />
cluster centres and (ii) group the settlements. The model of this approach is based<br />
on the assumption that the exchange rate of products is high where the settlements<br />
are close to each other and/or the settlement size is large. If people living in a<br />
settlement with a low exchange rate wanted to buy or sell something they would<br />
walk to one of the settlements with higher exchange rates in their neighbourhood.<br />
This can be modelled by several variants of the density clustering algorithm using<br />
basin spanning trees. Which of the neighbouring settlements is determined as the<br />
preferred location for product exchange depends on the algorithm variant chosen.<br />
This method is used to reconstruct trade networks, and all settlements connected<br />
by direct or indirect trade links constitute a group. While extending the original<br />
clustering algorithm to support different settlement sizes could be accomplished<br />
easily, the adjustments needed to take into account the costs of walking between<br />
two locations in prehistoric times are by no means trivial. Examples from the river<br />
Main area with Bronze and Iron Age settlements will be presented.<br />
References<br />
Hader, S., Hamprecht, F. A. 2003. Efficient Density Clustering Using Basin Spanning<br />
Trees. In: M. Schader, W. Gaul, M. Vichi (Hrsg.), Between Data Science and<br />
Applied Data Analysis. Studies in Classification, Data Analysis, and Knowledge<br />
Organization (Berlin, Heidelberg, New York):39-48.<br />
− 66 −
On the prognostic value of gene expression<br />
signatures for censored data<br />
Thomas Hielscher, Manuela Zucknick, Wiebke Werft and Axel Benner<br />
Division of Biostatistics, German Cancer Research Center, Heidelberg, Germany<br />
t.hielscher@dkfz.de<br />
Abstract. As part of the validation of a new gene expression signature it is good<br />
statistical practice to quantify the amount of prognostic information represented by<br />
the signature. Open questions are how to measure the gain in prognostic information<br />
compared to established clinical parameters or biomarkers and the additional<br />
predictive accuracy especially when dealing with censored data. To answer these<br />
questions it is required to use consistent and interpretable measures.<br />
Several measures of prediction accuracy and proportion of explained variation<br />
have been suggested for right-censored event times. The underlying mechanisms of<br />
these measures are as different as the use of Schoenfeld residuals, model likelihoods<br />
or the variation of the individual survival curves. Consequently, these measures vary<br />
in their assumptions and properties and it remains unclear under which conditions<br />
and to which extent they are comparable. Moreover, explained variation for survival<br />
data can be considered as a function of time and therefore strongly depends on the<br />
available follow-up time and the time range of interest.<br />
We present a comparison of several common measures such as the Brier Score<br />
(Graf et al., 1999), the V measure (Schemper and Henderson, 2001) and the method<br />
of O’Quigley and Xu (2001) to illustrate their application to simulated and real<br />
clinical data. A presentation of existing and possible approaches to estimate the<br />
variability of these measures will be provided. An overview of available software<br />
implementations in R will be given.<br />
Key words: Survival, Predictive Accuracy, Gene Expression<br />
References<br />
Graf, E., Sauerbrei, W. F., Schmoor, C., and Schumacher, M. (1999): Assessment<br />
and comparison of prognostic classification schemes for survival data. Statistics<br />
in Medicine,18, 2529–2545.<br />
O’Quigley, J. and Xu, R. (2001): Explained variation in proportional hazards regression.<br />
In: J. Crowley and D.P. Ankerst (Eds.): Handbook of Statistics in Clinical<br />
Oncology, Second Edition. Chapman & Hall/CRC Press, 347–363.<br />
Schemper, M. and Henderson, R. (2000): Predictive accuracy and explained variation<br />
in Cox regression. Biometrics, 56, 249–255.<br />
− 67 −
Likelihood ratio testing for hidden Markov<br />
models<br />
Hajo Holzmann and Jörn Dannemann<br />
University of Karlsruhe<br />
Germany<br />
Abstract. When a mixture arises as the marginal distribution of a stationary process,<br />
the dependency structure can be incorporated by assuming that the underlying<br />
regime forms a finite state Markov chain. This leads to the class of hidden Markov<br />
models (HMMs), which are also called Markov dependent mixtures. We shall discuss<br />
maximum likelihood inference in HMMs. In particular, we investigate the problem<br />
of testing for the number of states via the likelihood ratio test (LRT). We propose<br />
a modified LRT for two against more states in an HMM, which is based on the<br />
so-called likelihood function under independence assumption, and derive its asymptotic<br />
distribution under the null hypothesis. Simulation results and applications to<br />
financial and biological time series illustrate the practical use of the methods.<br />
− 68 −
Rule-Based Learning of Reliable Classifiers<br />
Jens Hühn and Eyke Hüllermeier<br />
Department of Mathematics and Computer Science, University of Marburg<br />
{huehnj,eyke}@mathematik.uni-marburg.de<br />
Abstract. This paper introduces a fuzzy rule-based classification method called<br />
FR3, which is short for Fuzzy Round Robin RIPPER. As the name suggests, FR3<br />
builds upon the RIPPER algorithm, a state-of-the-art rule learner. More specifically,<br />
in the context of polychotomous classification, it uses a fuzzy extension of RIPPER<br />
as a base learner within a round robin scheme and, thus, can be seen as a fuzzy<br />
variant of the R3 learner that has recently been introduced in the literature. A<br />
key feature of FR3, in comparison with its non-fuzzy counterpart, is its ability to<br />
represent different facets of uncertainty involved in a classification decision in a more<br />
faithful way. FR3 thus provides the basis for implementing “reliable classifiers” that<br />
may abstain from a decision when not being sure enough, or at least indicate that<br />
a classification is not fully supported by the empirical evidence at hand. Besides,<br />
our experimental results show that FR3 outperforms R3 in terms of classification<br />
accuracy and, therefore, suggest that it produces predictions that are not only more<br />
reliable but also more accurate.<br />
Key words: Machine learning, classification, rule induction, uncertainty, fuzzy sets.<br />
References<br />
William W. Cohen (1995). Fast effective rule induction. In Armand Prieditis and<br />
Stuart Russell, editors, Proceedings of the 12th International Conference on<br />
Machine Learning, pages 115–123, Tahoe City, CA. Morgan Kaufmann.<br />
Johannes Fürnkranz (2003). Round robin ensembles. Intell. Data Anal., 7(5):385–<br />
403.<br />
Eyke Hüllermeier and Klaus Brinker (<strong>2008</strong>). Learning valued preference structures<br />
for solving classification problems, Fuzzy Sets and Systems (to appear).<br />
− 69 −
Combining Predictions in Pairwise<br />
Classification: An Adaptive Voting Strategy<br />
and Its Relation to Weighted Voting<br />
Eyke Hüllermeier and Stijn Vanderlooy<br />
Department of Mathematics and Computer Science, University of Marburg<br />
{eyke,vanderlooy}@mathematik.uni-marburg.de<br />
Abstract. Learning by pairwise comparison is a well-known decomposition technique<br />
which allows one to transform a polychotomous classification problem into<br />
a number of binary problems. To aggregate the predictions from the ensemble of<br />
binary models into a final classification, various aggregation strategies have been<br />
proposed. The most commonly used strategy is weighted voting, in which the prediction<br />
of each model is counted as a (weighted) “vote” for a class label, and the<br />
class with the highest sum of votes is predicted as the label of the query instance.<br />
Even though weighted voting turned out to perform very well in practice, it remains<br />
ad-hoc to some extent and lacks a sound theoretical basis.<br />
In this regard, the current paper makes the following contributions. First, we<br />
propose a formal framework in which the aforementioned aggregation problem can be<br />
studied in a convenient way. This framework is based on the setting of label ranking<br />
which has recently received attention in the machine learning literature. Second,<br />
within this framework, we develop a new aggregation strategy called adaptive voting.<br />
This strategy allows one to take the strength of individual learners into consideration<br />
and, under certain assumptions, is provably optimal in the sense that it yields a MAP<br />
prediction of the class label. Thirdly, we show that weighted voting can be seen as<br />
an approximation of adaptive voting and, hence, approximates a MAP prediction.<br />
This theoretical justification of weighted voting is confirmed by strong empirical<br />
evidence showing that it is (at least) competitive in practice.<br />
Key words: Machine learning, pairwise classification, weighted voting, label ranking,<br />
MAP prediction.<br />
− 70 −
Using Cluster Networks to Represent<br />
Non-Compatible Sets of Clusters<br />
Daniel H. Huson and Regula Rupp<br />
Center for Bioinformatics ZBIT, Tübingen University, Sand 14, 72076 Tübingen,<br />
Germany<br />
huson,rupp@informatik.uni-tuebingen.de<br />
Abstract. A set of clusters is called compatible (or hierarchical), if it can be represented<br />
by a rooted tree. In many applications, such as multiple gene phylogenetic<br />
analysis, sets of clusters arise that are not compatible and the question arises how<br />
to represent such sets in a useful way, in particular emphasizing parts of the cluster<br />
system that are tree-like and where the incompatibilities lie.<br />
The result of a multiple gene tree analysis is usually a number of different tree<br />
topologies that are each supported by a significant proportion of the genes. We<br />
introduce the concept of a cluster network that can be used to combine such trees<br />
into a single rooted network, which can be drawn either as a cladogram or phylogram.<br />
In contrast to split networks, which can grow exponentially in the size of the input,<br />
cluster networks grow only quadratically. A cluster network is easily computed using<br />
a modification of the tree-popping algorithm, which we call network-popping. The<br />
approach will be made available as part of the Dendroscope tree-drawing program<br />
and its application will be illustrated using data and results from recent studies on<br />
large numbers of gene trees.<br />
Key words: clusters, networks, trees, phylogenetics<br />
References<br />
D.H. Huson and D. Bryant. Application of phylogenetic networks in evolutionary<br />
studies. Molecular Biology and Evolution, 23:254–267, 2006. Software available<br />
from www.splitstree.org.<br />
D.H. Huson, D.C. Richter, C. Rausch, T. Dezulian, M. Franz, and R. Rupp. Dendroscope:<br />
An interactive viewer for large phylogenetic trees. BMC Bioinformatics,<br />
8:460doi:10.1186/1471-2105-8-460, 2007. Software available from<br />
www.dendroscope.org.<br />
− 71 −
Genome phylogeny based on short-range<br />
correlations in DNA sequences<br />
Marc-Thorsten Hütt 1<br />
Jacobs University Bremen<br />
School of Engineering and Science<br />
Campus Ring 1<br />
m.huett@jacobs-university.de<br />
Abstract. The surprising fact that global statistical properties computed on a<br />
genomewide scale may reveal species information has first been observed in studies<br />
of dinucleotide frequencies. In this presentation I will look at the same phenomenon<br />
with a totally different statistical approach. We show that patterns in the shortrange<br />
statistical correlations in DNA sequences serve as evolutionary fingerprints of<br />
eukaryotes. All chromosomes of a species display the same characteristic pattern,<br />
markedly different from those of other species. The chromosomes of a species are<br />
sorted onto the same branch of a phylogenetic tree due to this correlation pattern.<br />
The average correlation between nucleotides at a distance k is quantified in two independent<br />
ways: (i) by estimating it from a higher-order Markov process and (ii) by<br />
computing the mutual information function at a distance k. We show how the quality<br />
of phylogenetic reconstruction depends on the range of correlation strengths and<br />
on the length of the underlying sequence segment. This concept of the correlation<br />
pattern as a phylogenetic signature of eukaryote species combines two rather distant<br />
domains of research, namely phylogenetic analysis based on molecular observation<br />
and the study of the correlation structure of DNA sequences.<br />
− 72 −
Dimensionality Reduction of Similarity Matrix<br />
Tadashi Imaizumi<br />
Tama University imaizumi@tama.ac.jp<br />
Abstract. We have become easily collecting a similarity data matrix of large number<br />
of objects, for example, in Basket Analysis in Data Mining. Then we apply<br />
several unsupervised methods to this data according to our purpose of analysis It is<br />
shown in many research fields that the geometric models such as MultiDimensional<br />
Scaling(MDS) or Self-Organizing Map(SOM) are applicable. However, we have two<br />
problems when we want to apply these methods to a large similarity matrix. One<br />
will be the change of dimensions focused, The other one is on how to employ our<br />
prior information about data. The latent dimensions of the similarity evaluation<br />
process may be common to all objects when the attributes of objects is less ambiguous<br />
and number of objects is not so large. However, we will not agree with that<br />
similarity evaluation between Hamburg and Tokyo is same to that between Hamburg<br />
and Berlin. This requires us how to model this process. The other one is how<br />
to employ the research’s knowledge as the prior information to the model. We have<br />
some knowledge about data and the gathered data will be containing these as the<br />
hidden information of data. It will contribute to propose some supervised geometric<br />
model treating those information as model parameters. I will discuss these two<br />
problems, compare the dimensionality reduction methods, and propose the model<br />
of the change of dimensions focused and the prior information about data.<br />
Key words: supervised,the latent dimensions, the attribute focus<br />
References<br />
Koh¨nen, T.(1995): Self-Organizing Maps. Springer, Berlin, Heidelberg.<br />
Roweis, S. T. and Saul, L. K. (2000): Nonlinear dimensionality reduction by locally<br />
linear embedding. Science 290, 2323-2326.<br />
Sammon, J. W., Jr. (1969): A nonlinear mapping for data structure analysis. IEEE<br />
Transactions on Computers, C-18, 5-28.<br />
Tenenbaum, J. B., de Silva, V. and Langford, C. (2000). A global geometric framework<br />
for nonlinear dimensionality reduction. Science 290, 2319-2323.<br />
− 73 −
Siedlungsverhalten währende des 7. -11.<br />
Jahrhunderts entlang der Ems. Eine GIS<br />
gestützte siedlungsarchäologische Analyse des<br />
Raumes zwischen Warendorf und Rheine<br />
Katrin Jaspers<br />
Universität Münster, Germany<br />
Zusammenfassung. Mit Hilfe von GIS sollen chronologische, topographische und<br />
historische Zusammenhänge zwischen den verschiedenen Siedlungen des Untersuchungsraums<br />
dokumentiert werden. Dabei werden auch pedologische Gesichtspunkte<br />
einbezogen. So könnten Abläufe und Entwicklungen der Besiedlung deutlich gemacht<br />
und möglicherweise Rückschlüsse auf die Infrastruktur zwischen den Siedlungen gezogen<br />
werden.<br />
Da sich die Arbeit noch in der Entwicklungsphase befindet, kann hier nur ein<br />
vorläufiger thematischer Abriss gegeben werden.<br />
− 74 −
Benchmarking Bicluster Algorithms<br />
Sebastian Kaiser and Friedrich Leisch<br />
Department of Statistics, Ludwig-Maximilians-Universität München,<br />
Ludwigstrasse 33, 80539 München, Germany,<br />
firstname.lastname@stat.uni-muenchen.de<br />
Abstract. Over the last decade, bicluster methods have become increasingly popular<br />
in different fields of two way data analysis, and a large variety of algorithms<br />
and analysis methods have been published, see (Madeira and Oliveira, 2004) for<br />
a survey. In this presentation, we show how the general benchmarking framework<br />
by Hothorn et al (2005) can be adapted to the special case of biclustering. A key<br />
issue is the development of bootstrap strategies for two-way data, which do not only<br />
resample cases, but also variables.<br />
All methods presented have been implemented in the open source R package<br />
biclust, which is available on http:\\cran.r-project.org. Both artificial as well<br />
as real world microarray data are used for benchmark experiments. The resulting<br />
benchmark data are explored using new graphical techniques and analyzed by means<br />
of statistical inference.<br />
Key words: Biclustering, Two-Way-Clustering, Validation, R<br />
References<br />
Hothorn, T, Leisch, F., Zeileis, A., and Hornik K. (2005): The design and analysis<br />
of benchmark experiments. Journal of Computational and Graphical Statistics,<br />
14(3), 675–699.<br />
Madeira, S. C. and A. L. Oliveira (2004): Biclustering algorithms for biological data<br />
analysis: A survey. IEEE/ACM Transactions on Computational Biology and<br />
Bioinformatics, 1(1),24–45.<br />
Santamaria, R., Theron, R., and Quintales, L. (2007): A framework to analyze biclustering<br />
results on microarray experiments. In: 8th International Conference on<br />
Intelligent Data Engineering and Automated Learning (IDEAL’07) ,Springer,<br />
Berlin, 770–779.<br />
Turner, H., Bailey, T., and Krzanowski, W. (2005): Improved biclustering of microarray<br />
data demonstrated through systematic performance tests. Computational<br />
Statistics and Data Analysis, 48,235–254.<br />
− 75 −
Nonparametric distribution analysis for text<br />
mining<br />
Alexandros Karatzoglou 1 , Ingo Feinerer 2,3 , and Kurt Hornik 3<br />
1 INSA de Rouen, France alexis@ci.tuwien.ac.at<br />
2 Theory and Logic Group, Institute of Computer Languages<br />
Vienna University of Technology, Austria feinerer@logic.at<br />
3 Department für Statistik und Mathematik,<br />
Wirtschaftsuniversität Wien, Austria kurt.hornik@wu-wien.ac.at<br />
Abstract. A number of new algorithms for non-parametric distribution analysis<br />
based on Maximum Mean discrepancy measures and the Hilbert-Smith Norm have<br />
been recently introduced. These novel algorithms operate in Hilbert space and can be<br />
used for Two-Sample Tests, Hierarchical Clustering and Dimensionality Reduction.<br />
Coupled with recent advances in string kernels, these methods extend the scope of<br />
kernel-based methods in the area of text mining.<br />
We review this group of kernel methods focusing on text mining where we will<br />
propose novel applications and present an efficient implementation in the kernlab<br />
package. We also present an efficient and integrated environment for applying modern<br />
machine learning methods to complex text mining problems through the combined<br />
use of the tm (for text mining) and the kernlab (for kernel-based learning) R<br />
packages.<br />
Key words: kernel methods, text mining, R<br />
References<br />
Karatzoglou A., Smola A. , Hornik K., Zeileis A. (2004): kernlab - An S4 Package<br />
for Kernel Methods in R. Journal of Statistical Software, 11, 9<br />
Smola, A., A. Gretton, L. Song and B. Schölkopf (2007): A Hilbert Space Embedding<br />
for Distributions. Proceedings of the 18th International Conference on<br />
Algorithmic Learning Theory (ALT 2007), 13-31, Springer, Berlin, Germany<br />
Song, L., A. J. Smola, K. Borgwardt and A. Gretton (2007): Colored Maximum Variance<br />
Unfolding. Proceedings of the Twenty-First Annual Conference on Neural<br />
Information Processing Systems (NIPS 2007), 1-8, MIT Press, Cambridge,<br />
Mass., USA<br />
− 76 −
Indexnachbildende Wertpapiere –<br />
Eine vergleichende Betrachtung am Beispiel<br />
des DAX<br />
Christian Klein 1 and Dennis Kundisch 2<br />
1<br />
Universität Hohenheim, Lehrstuhl für Rechnungswesen und Finanzierung, 70593<br />
Stuttgart cklein@uni-hohenheim.de<br />
2<br />
Universität Augsburg, Lehrstuhl für BWL, Wirtschaftsinformatik und Financial<br />
Engineering, 86135 Augsburg dennis.kundisch@wiwi.uni-augsburg.de<br />
Abstract. In dieser Arbeit vergleichen wir verschiedene indexnachbildende Wertpapiere.<br />
Produkte dieser Art versprechen dem Anleger eine Wertentwicklung, die einem<br />
zugrunde liegenden Index möglichst exakt entspricht. Bei unserer Untersuchung betrachten<br />
wir verschiedene Aspekte, unter anderem die Replikationsgüte der Papiere.<br />
Somit liefern wir ein differenziertes Bild, sowohl über die Qualität der Produkte<br />
als auch über die Stärken und Schwächen der üblicherweise angewendeten Untersuchungsverfahren.<br />
Key words: Aktienindex, Indexreplikation<br />
− 77 −
Polyphasic genomic approach for the taxonomy<br />
of archaea and bacteria<br />
Hans-Peter Klenk<br />
DSMZ - German Collection of Microorganisms and Cell Lines 38124<br />
Braunschweig,Germany hpk@dsmz.de<br />
Abstract. Contemporary taxonomic classification of prokaryotes is primarily based<br />
on the analysis of 16S rDNA sequences, extended by chemotaxonomical analyses,<br />
e.g. whole cell fatty acids or amino acid analysis of cell walls. Although rDNAs<br />
are excellent taxonomic markers, they represent far less than 1With meanwhile 637<br />
published prokaryotic genomes and more than 1850 ongoing archaeal and bacterial<br />
genome sequencing projects, the future of systematics will clearly be based on the<br />
analysis of whole genome sequences. The major imminent problems on the way to<br />
a genome-based systematic classification of prokaryotes are: 1) uneven phylogenetic<br />
distribution of the sequenced genomes; 2) large variation of the phylogenetic value<br />
in different fractions of the genomes; and 3) affordable technology for rapid sequence<br />
generation combined with highly automated analysis of the information. A massive<br />
generation of genome sequences from phylogenetically isolated archaea and bacteria<br />
in a collaboration between Joint Genome Institute with DSMZ aims for rapid filling<br />
of the deep phylogenetic gaps, soon to be followed by sequenced genomes of all<br />
type strains. The fast variation between genes or sets of genes in view of sequence<br />
conservation and genetic stability is problematic for global approaches to universal<br />
phylogenies, but provides suitable novel taxonomic markers for more restricted areas<br />
within the diversity of micro-organisms. New technologies for sequence generation<br />
have already sharply decreased the price for the production of microbial genome<br />
sequences and will continue to do so till the genome of any cultivated species of<br />
archaea or bacteria will become affordable. The more complex problem to be solved<br />
is the automated processing of the genomes within an endlessly growing sequence<br />
space.<br />
References<br />
Klenk, H.-P. (2007) Genomic future for the taxonomy of prokaryotes. In: E Stackebrandt,<br />
M Wozniczka, V Weihs & J Sikorski (eds) Connections between Collections.<br />
Proceedings of the 11th International Conference on Culture Collections.<br />
ISBN 978-3-00-022417-1. DSMZ, Braunschweig, Germany. pp 117-119<br />
− 78 −
Exploiting synergetic and redundant features<br />
for multimedia document classification<br />
Jana Kludas, Eric Bruno and Stephane Marchand-Maillet<br />
University of Geneva, Switzerland<br />
kludas|bruno|marchand@cui.unige.ch<br />
Summary. Multimedia data handling in all kinds of applications, received in the<br />
last decade a lot of attention by the research communities due to the ’multimediatisation’<br />
of e.g. the WWW and other data collections in all day life. The most<br />
important problems identified in multimedia-based classification are amongst others<br />
the high dimensionality of the multi modal feature space, the unknown and varying<br />
relevance of features and modalities towards the class label, noise and missing values<br />
in the input data and the semantic gap between low-level features and high level<br />
semantic meanings.<br />
We are working on a promising way to tackle many of these problems at once:<br />
the calculation and exploitation of feature information interactions for feature selection<br />
and construction in high dimensional feature spaces towards more efficient<br />
information fusion and hence improved multimedia document classification. This<br />
information-theoretic dependence measure finds the exact, irreducible attribute interactions<br />
in a multivariate feature subset. Its definition is a stable relation because<br />
information interactions are described by the information exclusively shared by this<br />
subset’s variables. For subsets of size N = 2 the interaction is resulting in the well<br />
known mutual information.<br />
Then for higher order subsets N > 2, feature information interaction develops<br />
its most important characteristic, it can result in positive and negative values. This<br />
allows to discriminate two different types of feature relationships: (1) synergy given<br />
by positive interactions and (2) redundancy indicated by negative ones. That be<br />
used to treat the features of each of the types of interactions separately with the<br />
help of specialized feature selection and construction strategies.<br />
With the help of artificial data sets we will show what relationships information<br />
interactions can detect. Classification experiments on real world data will also<br />
show the superiority of preprocessing based on N-way interactions over pair-wise<br />
dependence measures that are often used in recent feature selection approaches like<br />
correlation and mutual information.<br />
Key words: feature selection, multi modal information fusion, multimedia object<br />
classification<br />
− 79 −
Time-Varying Parameters in Brand Choice<br />
Models<br />
Thomas Kneib 1 , Bernhard Baumgartner 2 , and Winfried J. Steiner 3<br />
1 Department of Statistics, University of Munich, Germany<br />
thomas.kneib@stat.uni-muenchen.de<br />
2 Department of Marketing, University of Regensburg, Germany<br />
bernhard.baumgartner@wiwi.uni-regensburg.de<br />
3 Department of Marketing, Technical University of Clausthal, Germany<br />
winfried.steiner@tu-clausthal.de<br />
Abstract. Brand Choice Models are frequently used in marketing research. In most<br />
applications, estimated parameters representing customers’ reactions to, e.g., price<br />
and promotional activities or brand-specific effects are assumed to be constant over<br />
time. Marketing theories as well as experiences of marketing practitioners, however,<br />
suggest the existence of trends and/or short-term fluctuations in brand choice behavior.<br />
For example, price elasticities or preferences for certain brands may change in the<br />
run-up to special events like Christmas or Mother’s day (e.g., Baumgartner 2003).<br />
In this contribution, we employ multinomial logit models with varying coefficients to<br />
estimate time-varying parameters in brand choice models. Both time-varying preferences<br />
(trends) and time-varying effects of covariates are modeled based on penalised<br />
splines, a flexible yet parsimonious nonparametric smoothing technique (e.g., Eilers<br />
and Marx 1996). The estimation procedure is fully data-driven, determining the flexible<br />
function estimates as well as the corresponding degree of smoothness in a unified<br />
approach (e.g., Kneib 2006). Preliminary results suggest that the model considering<br />
time-variable parameters outperforms models assuming constant parameters in<br />
terms of fit and predictive validity.<br />
Key words: Brand Choice, Multinomial logit model, Time-varying effects, Semiparametric<br />
regression, P-splines<br />
References<br />
BAUMGARTNER, B. (2003): Measuring Changes in Brand Choice Behavior.<br />
Schmalenbach Business Review, 55, 242–256.<br />
EILERS, P.H.C. and MARX, B.D. (1996): Flexible Smoothing Using B-Splines and<br />
Penalized Likelihood (with Comments and Rejoinder) Statistical Science, 11(2),<br />
89–121.<br />
KNEIB, T. (2006): Mixed Model Based Inference in Structured Additive Regression.<br />
Dr. Hut-Verlag, München.<br />
− 80 −
Multivariate comparative analysis of stock<br />
exchanges - the European perspective<br />
Julia Koralun-Bere´znicka<br />
Maritime University in Gdynia, Morska 81-87, 81-225 Gdynia, Poland<br />
koral@am.gdynia.pl<br />
Abstract. The aim of the research is to perform a multivariate comparative analysis<br />
of 20 European stock exchanges in order to identify the main similarities between<br />
the objects. The basis of comparison is a set of 48 monthly variables from the period<br />
01.2003–12.2005. The variables are classified into three categories: size of the market,<br />
equity trading and bonds. The paper aims at identifying the clusters of alike<br />
stock exchanges and at finding the characteristic features of each of the distinguished<br />
groups. The obtained categorization to some extent corresponds with the division<br />
of the European Union into ‘new’ and ‘old’ member countries. Clustering method,<br />
performed for each quarter separately, also reveals that the classification is fairly<br />
stable in time. The factor analysis, which was carried out to reduce the number of<br />
variables, reveals three major factors behind the data, which are related with the<br />
earlier mentioned categories of variables.<br />
Key words: stock exchanges, cluster analysis, factor analysis<br />
References<br />
Boillat, P., de Skowronsky, N., Tuchschmid, N. (2002) Cluster analysis: application<br />
to sector indices and empirical validation, Swiss Society for Financial Market<br />
Research, 16, 467–486.<br />
Kearney C., Lucey B. M., (2004) International equity market integration: Theory,<br />
evidence and implications, “International Review of Financial Analysis”, 13,<br />
571–583.<br />
Kim S. J., Moshirian F., Wu E. (2005) Dynamic stock market integration driven by<br />
the European Monetary Union: An empirical analysis, “Journal of Banking &<br />
Finance”, 29, 2475–2502.<br />
Krzanowski, W. J. (1988) Principles of multivariate analysis, Oxford University<br />
Press, Oxford.<br />
Morrison, D., (1967), Multivariate statistical methods, New York: McGraw-Hill.<br />
Pascual A. G., (2003) Assessing European stock markets (co)integration, “Economics<br />
Letters”, 78, 197–203.<br />
− 81 −
Strategies of model construction for<br />
the analysis of judgment data<br />
Sabine Krolak-Schwerdt<br />
Faculty of Humanities, Arts and Educational Science, University of Luxembourg<br />
sabine.krolak@uni.lu<br />
Abstract. This paper is concerned with the types of models researchers use to<br />
analyze empirical data in the domain of social judgments and decisions. Examples of<br />
this research domain are organizational or medical expert judgments, court decisions<br />
or judgments in private everyday life.<br />
Models for the analysis of judgment data may be divided into two classes depending<br />
on the criteria they optimize. The first class consists of approaches which<br />
optimize an internal (mathematical) criterion function. The aim is to minimize the<br />
discrepancy of values predicted by the model from obtained data by use of, e.g., a<br />
least squares approach. The second class comprises approaches which incorporate<br />
a substantive underlying theory into the model. These accounts were developed to<br />
satisfy external validity criteria, especially construct validity. Model parameters are<br />
not only formally defined, but they represent specified components of judgments.<br />
Several models from both classes are applied to a number of empirical data sets<br />
and comparatively evaluated as to goodness-of-fit, variance accounted for by the<br />
models and construct validity. Results exhibit considerable differences between the<br />
two model classes in construct validity, but not in internal validity criteria.<br />
It may be concluded that any model for the analysis of judgment data implies<br />
the selection of a formal theory about judgments. Hence, optimizing a mathematical<br />
criterion function does not induce a non-theoretical rationale or neutral tool.<br />
Rather, this approach yields another formal theory about judgments which may not<br />
correspond to substantive theories and, in this respect, may yield artefacts. As a<br />
consequence, models satisfying construct validity seem superior in the domain of<br />
judgments and decisions.<br />
Key words: Models of data analysis, external validity, internal validity, model<br />
comparison<br />
− 82 −
An application of copula functions to market<br />
risk management<br />
Katarzyna Kuziak<br />
Department of Financial Investments and Risk Management<br />
Wroclaw University of Economics<br />
ul. Komandorska 118/120, 53-345 Wroclaw, Poland<br />
katarzyna.kuziak@ae.wroc.pl<br />
Abstract. Modeling dependence is one of the main issues in risk management. From<br />
risk management point of view, failure to model correctly tail-dependence may cause<br />
many problems (under- or overestimation of risk level). The most popular approach<br />
to model dependence between individual risks is based on classical correlation, but<br />
in recent years an increasing interest in applying copula functions has arose. Copula<br />
functions, a powerful concept to aggregate risks, has been introduced in finance by<br />
Embrechts, McNeil, and Straumann. The aim of this paper is to provide simple<br />
applications for the practical use of copulas for risk management from market risk<br />
point of view. First, we introduce copula concept. Then, some applications of copulas<br />
for market risk are given. Two Value at Risk estimation approaches are compared for<br />
a portfolio of risks: utilizing classical covariance and copula-based one. The criterion<br />
for evaluating performance of the two approaches is just the result of a VaR backtesting<br />
procedure<br />
Key words: financial dependence, copula functions, risk management, market risk,<br />
Value at Risk<br />
References<br />
Cherubini U., Luciano E., Vecchiato W. (2004): Copula Methods in Finance, John<br />
Wiley & Sons, New York.<br />
Embrechts P., Frey R., McNeil A. (2005): Quantitative Risk Management: Concepts,<br />
Techniques, and Tools, Princeton University Press<br />
Embrechts P., Lindskog F., McNeil A. (2001): Modelling dependence with copulas<br />
and applications to risk management, report, ETHZ Zurich.<br />
Embrechts P., McNeil A., Straumann D. (1999): Correlation and dependence in risk<br />
management: properties and pitfalls. In: Risk Management: Value at Risk and<br />
Beyond (M. Dempster, Ed.) Cambridge University Press, Cambridge, 176-223.<br />
Nelsen R. (1999): An introduction to copulas, Springer Verlag, New York.<br />
− 83 −
Testing preference rankings<br />
Kar Yin Lam 1 , Alex J. Koning 2 , and Philip Hans Franses 2<br />
1<br />
ERIM & Econometric Institute, Erasmus University Rotterdam, The<br />
Netherlands<br />
kylam@few.eur.nl<br />
2<br />
Econometric Institute, Erasmus University Rotterdam, The Netherlands<br />
koning@few.eur.nl<br />
franses@few.eur.nl<br />
Abstract. Preference rankings are a common tool in consumer surveys. Such rankings<br />
are easy to perform and the outcomes are easy to understand. In this study<br />
we propose a method to examine if observed rankings imply statistically significant<br />
differences across the products. If there is statistical evidence of differences across<br />
products, the question is which products it concerns. We use multiple comparison<br />
procedures to test which products are significantly different from each other. Our<br />
method concerns the often-encountered practical situation that consumers evaluate<br />
N products but only give preference rankings for a subset that is selected by each<br />
consumer. This is due to the fact that the literature shows that the task of comparing<br />
all N products could be too difficult. It may also be that the assignment of ranks<br />
itself is problematic. For instance, ties may occur, that is, the consumer is indifferent<br />
between products, and hence two or more products have the same rank. There<br />
may also be missing values, that is, the consumer excludes a certain product in the<br />
consideration set, and thus does not evaluate it. As a consequence the consumer is<br />
not able to assign a rank to this product. The method we propose and analyze in<br />
this paper does not suffer from these drawbacks. We illustrate it for 93 individuals<br />
who rank 10 movies released in 2007 and who indicate preferences for only 4 of these<br />
10 movies.<br />
Key words: Rankings, Multiple comparisons, Ties, Missing observations<br />
− 84 −
Bayesian Methods for Graph Clustering<br />
Pierre Latouche, Christophe Ambroise, and Etienne Birmelé<br />
Laboratoire Statistique et Génome (UMR CNRS 8071, INRA 1152, UEVE), La<br />
Genopole Tour Evry 2, 523 place des Terrasses, 91000 Evry, France<br />
firstname.lastname@genopole.cnrs.fr<br />
Abstract. Networks are used in many scientific fields such as biology, social science,<br />
and information technology. They aim at modeling, with edges, the way objects of<br />
interest, represented by vertices, are related to each others. Looking for clusters,<br />
also called communities or modules, of highly connected vertices, has appeared to<br />
be a powerful approach to capture the underlying structure of a network.<br />
Recently, the Erdős-Rényi Mixture model for Graph (ERMG) for community<br />
detection was proposed by Daudin et al. (2006) with an associated algorithm, based<br />
on variational techniques, for maximum likelihood estimation. Given a network, the<br />
number of clusters is estimated and for all the vertices, the algorithm infers the<br />
probability of membership to each cluster.<br />
Following Hofman and Wiggins (2007), we show how the ERMG model can be<br />
described in a full Bayesian framework. Then, we apply two families of approximation<br />
techniques, called Variational Bayes (VB) and Expectation Propagation (EP),<br />
for the inference procedure. Using simulated and real data sets, we compare both<br />
the number and the quality of the estimated clusters obtained with the different<br />
approaches.<br />
Key words: Graph clustering, Variational Bayes, Expectation Propagation<br />
References<br />
Daudin, J. and Picard, F. and Robin, S. (2006): A Mixture Model for Random<br />
Graphs. Tech. rep, INRIA.<br />
Hofman, J.M. and Wiggins, C.H. (2007): A Bayesian Approach to Network Modularity.<br />
ArXiv e-prints.<br />
Jordan, M. and Ghahramani, Z. and Jaakkola, T. (1998): An introduction to variational<br />
methods for graphical models. In: Jordan, M.: Learning in Graphical<br />
Models. MIT Press.<br />
− 85 −
Fundamental Indexation - testing the concept in<br />
the German stock market<br />
Hermann Locarek-Junge 1 and Max Mihm 1<br />
Lehrstuhl für Finanzwirtschaft und Finanzdienstleistungen,<br />
TU Dresden, D-01062 Dresden, Germany, locarekj@finance.wiwi.tu-dresden.de<br />
Abstract. In Germany Fundamental Indexation is a rather new concept of portfolio<br />
management, creating portfolios not based on market capitalization, but by other<br />
economic numbers as revenues, employees, dividends or book value. The concept<br />
is rather new and has been implemented in only some mutual investment funds<br />
world wide so far. However, backward calculation of portfolios using the concept of<br />
fundamental indexation (CFI) on time series from 1961 to 2004 for stock portfolios<br />
in the US capital market and other studies show potential significant returns and<br />
impressive sharpe ratios for this period (see Arnott/Sautter/Siegel 2007).<br />
Trying to explain above average returns using factor models has not yet been<br />
accomplished in a way that is compatible with traditional capital market theory. The<br />
pro’s and con’s of the approach are discussed controversially between scientists and<br />
practitioners, e.g.: ”[CFI] are a triumph of marketing, not of new ideas” (Fama 2007),<br />
”With the advent of fundamental indexes we’re at the brink of a huge paradigm shift.<br />
... [They] are the next wave of investing.” (Siegel 2006), and ”Fundamental Indexing<br />
is just a new label on old wine.” (Asness 2006)<br />
We use data from 1987 to 2007 in the german stock market and several indexing<br />
concepts to test the CFI for the German market. We create and compare portfolio<br />
clusters of market weighted, equally weighted and fundamentally weighted stocks.<br />
We use Fama’s 3-factor-model to analyze and explain returns and anomalies, and<br />
we question the persistence of investment returns using the CFI.<br />
Key words: fundamental indexation, market index, portable alpha<br />
References<br />
Arnott, R., Sautter, G., Siegel, J. (2007): Fundamental Indexing Smackdown, in:<br />
Journal of Indexes, Vol. 10, No. 5, pp. 10–15.<br />
− 86 −
Identifying Atypical Cases in Kernel Fisher<br />
Discriminant Analysis by using the Smallest<br />
Enclosing Hypershere<br />
Nelmarie Louw, Morne Lamont and Sarel Steel<br />
Department of Statistics and Actuarial Science, University of Stellenbosch, Private<br />
Bag X1, 7602 Matieland, South Africa. nlouw@sun.ac.za<br />
Abstract. Kernel methods are fast becoming standard tools for solving classification<br />
and regression problems in statistics. An example of a kernel based classification<br />
method is Kernel Fisher discriminant analysis (KFDA). Conceptually KFDA entails<br />
transforming the data in the input space to a high-dimensional feature space, followed<br />
by linear discriminant analysis (LDA) performed in feature space. Although<br />
the resulting classifier is linear in feature space, it corresponds to a non-linear classifier<br />
in input space. However, as in the case of LDA, the classification performance<br />
of KFDA deteriorates in the presence of atypical data points. Louw et al. (2007)<br />
proposed several criteria for identification of atypical cases in KFDA. In extensive<br />
simulation studies these criteria have been found to be successful, in the sense that<br />
the error rate of the KFD classifier based on the dataset after removal of atypical<br />
cases, is lower than the error rate of the KFD classifier based on the entire data<br />
set. A disadvantage is that these criteria are calculated on a leave-one-out basis,<br />
which becomes computationally prohibitive when dealing with large data sets. In<br />
this paper we propose a two-step procedure for identifying atypical cases in large<br />
data sets. Firstly, a subset of potentially atypical data cases is found by constructing<br />
the smallest enclosing hypersphere (for each group) in feature space. Secondly, the<br />
proposed criteria are employed to identify atypical cases, but only cases in the subset<br />
are considered on a leave-one-out basis, leading to a substantial reduction in computation<br />
time. We investigate the merit of this new proposal in a simulation study,<br />
and compare the results to the results obtained when not using the hypersphere as<br />
a first step. We conclude that the new proposal has merit.<br />
Key words: Classification, Discriminant Analysis, Kernel Methods<br />
References<br />
Louw, N., Lamont, M.C. and Steel, S.J. (2007): Identification of Influential Cases<br />
in Kernel Fisher Discriminant Analysis. In: P. Mantovan, A. Pastore and S.<br />
Tonellato (Eds.): Complex Models and Computational Intensive Methods for<br />
Estimation and Prediction. CLEUP EDITORE, 296–301.<br />
− 87 −
Latent growth models for analyzing a multi<br />
partner reward program<br />
Karsten Lübke 1 and Heike Papenhoff 2<br />
1 Customer Intelligence, Karstadt Warenhaus GmbH, Theodor-Althoff-Strasse 2,<br />
45133 Essen karsten.luebke@karstadt.de<br />
2 Ruhr-Universität Bochum, Lehrstuhl für Betriebswirtschaftslehre, insbesondere<br />
Marketing, Universitätsstraße 150, 44780 Bochum<br />
Abstract. In recent years, multi partner reward programs (MPRP) have enjoyed a<br />
steady increase in popularity. However, one main advantage of MPRPs has not been<br />
sufficiently researched: participating customers are expected to not only prefer their<br />
focal card-issuing company over its competitors, but also to prefer other MPRP<br />
partner companies over their resp. competitors outside the program. This so-called<br />
cross-buying (CB) extended effect is crucial for suppliers when they evaluate program<br />
participation. As this CB is a dynamic process which may change over time we<br />
applied Latent Growth Models to analyze the effects of a MPRP on cross-buying.<br />
Keywords<br />
Latent Growth Models, Structural Equation Modeling, Cross Buying<br />
− 88 −
Applying Statistical Models and Parametric<br />
Distance Measures for Music Similarity Search<br />
Hanna Lukashevich, Christian Dittmar, and Christoph Bastuck<br />
Fraunhofer IDMT, Langewiesener Str. 22, 98693 Ilmenau, Germany<br />
{lkh;dmr;bsk}@idmt.fraunhofer.de<br />
Abstract. Content-based music similarity search implies methods that can be used<br />
for finding music pieces close in perceptual semantic meaning. It is an inherent part<br />
of automatic music recommendation systems and playlist generation. Most stateof-the-art<br />
music similarity techniques use short-term acoustic features. Defining a<br />
similarity measure between two audio signals consisting of multiple feature vector<br />
frames still remains a challenging task. A multitude of related studies propose<br />
the application of parametrical statictical models (e.g. Gaussian Mixture Models -<br />
GMMs) in conjunction with suitable model distance measures. This approach has<br />
several advantages: it enables a very compact and informative representation of an<br />
audio signal and it allows similarity estimation solely based on the parameters of<br />
the models. In this paper we concentrate only on those distance measures that do<br />
not use computationally demanding sampling (like Monte Carlo or likelihood ratio<br />
tests). A good example of such parametric distance measures is a Kullback-Leibler<br />
divergence (KL-divergence), describing the distance between two single gaussians.<br />
Unfortunately, the KL-divergence between GMMs is not analytically tractable. In a<br />
recent ICASSP paper Hershley and Olsen presented several approximations of the<br />
KL-divergence between two GMMs with very promising results. Hélen and Virtanen<br />
proposed a Euclidean distance between GMMs ommiting the KL-divergence.<br />
We present a KL-Euclidean Hybrid distance between GMMs. We compare it to<br />
other state-of-the-art distance measures and show that it significantly outperforms<br />
the others for several features and models. Rather then trying to find the best theoretical<br />
approximation, our focus is on the best performance for music similarity<br />
task. Besides that, we investigate the influence of the model parameter estimation<br />
on the performance in musci similarity search. Here we compare the performance<br />
for several versions of GMMs: a trivial model having just one gaussian per music<br />
piece, GMMs with a fixed number of gaussians, and GMMs where the number of<br />
components is estimated using model selection techniques. We also find promising<br />
results using semantic information like song segmentation. In the latter case, we<br />
model each segment of the song with a single gaussian and represent it as a GMM,<br />
depending on the duration of the segments.<br />
Key words: music information retrieval, music similarity, Gaussian mixture models,<br />
Kullback-Leibler divergence<br />
− 89 −
Determining the number of components in<br />
mixture models for hierarchical data<br />
Olga Lukociene 1 and Jeroen K. Vermunt 2<br />
1<br />
Tilburg University PO Box 90153 5000 LE Tilburg The Netherlands<br />
o.lukociene@uvt.nl<br />
2<br />
Tilburg University PO Box 90153 5000 LE Tilburg The Netherlands<br />
j.k.vermunt@uvt.nl<br />
Abstract. Recently, various types of mixture models have been developed for data<br />
sets having hierarchical or multilevel structure (see, e,g., Vermunt 2003, 2007). Most<br />
of these models include finite mixture distributions at multiple levels of a hierarchical<br />
structure. In the case of two levels, there are, for example, mixture distributions for<br />
individuals (lower-level units) and for groups (higher-level units). In multivlevel<br />
mixture models, selection of the number of mixture component is more complex<br />
than in standard mixture models because one has to determine the number mixture<br />
components at multiple levels.<br />
The most popular measure for determining the number of mixture components<br />
is the BIC. A problem in the application of this criterion in the context of multilevel<br />
mixture models is that it contains the sample size as one of its terms. In multilevel<br />
mixture models, it is not clear which sample size should be used in the BIC formula.<br />
This could be the number of groups, the number of individuals, or either the number<br />
of groups or number individuals depending on whether one wishes to determine the<br />
number of components at the higher or at the lower level.<br />
In this study we investigate the performance of various model selection methods<br />
in the context of multilevel mixture models. We will not only look at BIC with difference<br />
definitions of the sample sizes, but also at AIC, and AIC3, as well as at other<br />
criteria such as ICOMP, validation log-likelihood, and LR tests with bootstrapped<br />
p values.<br />
Key words: Multilevel mixture models, Hierarchical models, BIC, AIC, AIC3<br />
References<br />
Vermunt, J.K.(2003): Multilevel latent class models. Sociological Methodology, 33,<br />
213-239.<br />
Vermunt, J.K. (2007): A hierarchical mixture model for clustering three-way data<br />
sets. Computational Statistics and Data Analysis, 51, 5368-5376.<br />
− 90 −
Exploring the Interaction Structure of Weblogs<br />
Martin Klaus and Ralf Wagner<br />
SVI Chair for International Direct Marketing<br />
DMCC - Dialog Marketing Competence Center<br />
University of Kassel, Germany<br />
{mklaus,rwagner}@wirtschaft.uni-kassel.de<br />
Abstract. Weblogs as a medium of the Web 2.0 have changed the way of communication<br />
fundamentally but also created a new form of social interaction. Worldwide<br />
users make up a huge, permanently growing conversation database including various<br />
topics (Blood (2002)). An interesting feature of this virtual communication is<br />
the opportunity of providing reference to other blogs by setting hyperlinks between<br />
weblogs in the course of the dialog (Chin & Chignell (2006); Leskovec et al. (2007)).<br />
Weblogs have no standardized document format and no tags indicate them.<br />
Thus, it turns out to be challenging to identify and collect Weblogs from the web<br />
with a crawler, spider, or bot (Anjewierden, Brussee & Efimova (2004)).<br />
In this study we introduce different approaches to crawl weblogs and try to combine<br />
them. Subsequently we use social network analysis to uncover the structure<br />
between weblogs (Borgatti, Carley & Krackhardt (2006)). This structure provides<br />
us with an assessment of the blogs and their relevance for marketing communication.<br />
Key words: Marketing Communication, Social Network Analyzes, Web Mining,<br />
Weblog<br />
References<br />
Anjewierden, A., Brussee, R., and Efimova, L. (2004): Shared Conceptualizations in<br />
Weblogs. In: T.N. Burg (Ed.) BlogTalk 2.0, Vienna.<br />
Blood, R. (2002): You‘ve Got Blog: How Weblogs are Changing our Culture. Perseus,<br />
Cambridge.<br />
Borgatti, S.P., Carley, K.M., and Krackhardt, D. (2006): Robustness of Centrality<br />
Measures Under Conditions of Imperfect Data. Social Networks, 28, 234–236.<br />
Chin, A. and Chignell, M. (2006): Finding Evidence of Community from Blogging<br />
Co-citations: A Social Network Analytic Approach. In: Proceedings of 3rd<br />
IADIS International Conference Web Based Communities 2006. San Sebastian,<br />
Spain, 191–200.<br />
Leskovec, J., McGlohon, M., Faloutsos, C., Glance, N.S., and Hurst, M. (2007):<br />
Cascading Behavior in Large Blog Graphs. In: SDM ’07: SIAM Conference on<br />
Data Mining.<br />
− 91 −
ChIPmix : Mixture model of regressions for<br />
ChIP-chip experiment analysis<br />
Marie-Laure Martin-Magniette 1,2 and Tristan Mary-Huard 1 and Caroline<br />
Bérard 1,2 and Stéphane Robin 1<br />
1 UMR AgroParisTech/INRA MIA 518<br />
2 URGV UMR INRA/CNRS/UEVE<br />
Abstract. The Chromatin immunoprecipitation on chip (ChIP on chip) technology<br />
is used to investigate proteins associated with DNA by hybridization to microarray.In<br />
a two-color ChIP-chip experiment, two samples are compared: DNA fragments<br />
crosslinked to a protein of interest (IP), and genomic DNA (Input). The two samples<br />
are differentially labeled and then co-hybridized on a single array. The goal is<br />
then to identify actual binding targets of the protein of interest, i.e. probes whose<br />
IP intensity is significantly larger than the Input intensity.<br />
We propose a new method called ChIPmix to analyse ChIP-chip data based on<br />
mixture model of regressions. Let (xi, Yi) be the Input and IP intensities of probe i,<br />
respectively. The (unknown) status of the probe is characterized through a label Zi<br />
which is 1 if the probe is enriched and 0 if it is normal (not enriched). We assume<br />
the Input-IP relationship to be:<br />
Yi = a0 + b0xi + ɛi<br />
= a1 + b1xi + ɛi<br />
if Zi = 0 (normal)<br />
if Zi = 1 (enriched)<br />
where ɛi is a Gaussian random variable with mean 0 and variance σ 2 . The marginal<br />
distribution of Yi for a given level of Input xi is<br />
(1 − π)φ0(Yi|xi) + πφ1(Yi|xi), (1)<br />
where π is the proportion of enriched probes, and φj(·|x) stands for the probability<br />
density function of a Gaussian distribution with mean aj + bjx and variance σ 2 .<br />
The mixture parameters (proportion, intercepts, slopes and variance) are estimated<br />
using the EM algorithm. Posterior probabilities are used to classify probes into the<br />
normal or enriched class. In the hypothesis test theory, the false discovery control<br />
is performed by controlling the probability to reject wrongly the null hypothesis.<br />
We propose an analogous concept in the mixture model framework. Our aim is to<br />
control the probability for a probe to be wrongly assigned to the enriched class.<br />
Therefore we control Pr{τi > s | xi, Zi = 0} = α for a predefined level α.<br />
We present several applications of ChIPmix to promoter DNA methylation and<br />
histone modification data and show that ChIPmix competes with classical methods<br />
such as NimbleGen and ChIPOTle.<br />
Key words: Classification, Mixture models, ChIP-chip<br />
− 92 −
Clustering of High-Dimensional Data Via<br />
Finite Mixture Models<br />
Geoff McLachlan<br />
Department of Mathematics & Institute for Molecular Bioscience<br />
University of Queensland<br />
Summary. There has been a proliferation of applications in which the number<br />
of experimental units n is comparatively small but the underlying dimension p<br />
is extremely large as, for example, in microarray-based genomics and other highthroughput<br />
experimental approaches. Hence there has been increasing attention<br />
given not only in bioinformatics and machine learning, but also in mainstream statistics,<br />
to the analysis of complex data in this situation where n is small relative to p.<br />
In this talk, we focus on the clustering of high-dimensional (continuous) data, using<br />
normal mixture models. Their use in this context is not straightforward, as the normal<br />
mixture model is a highly parameterized one with each component-covariance<br />
matrix consisting of p(p + 1)/2 distinct parameters in the unrestricted case. Hence<br />
some restrictions must be imposed and/or a variable selection method applied beforehand.<br />
We shall review the existing literature and consider some new approaches<br />
that have been proposed recently.<br />
− 93 −
Majority-rule consensus: from preferences<br />
(social choice) to trees (biology and<br />
classification theory)<br />
F.R. McMorris<br />
Professor of Applied Mathematics<br />
Professor of Computer Science<br />
Illinois Institute of Technology<br />
Chicago, IL 60616, USA<br />
mcmorris@iit.edu<br />
Abstract: The problem of aggregating the individual preferences of a group<br />
of “voters” into a group consensus preference has been studied for many<br />
years. Indeed, mathematical investigations of consensus problems go back<br />
to the contributions of Borda (1784), of Condorcet (1785), and of Pareto<br />
(1896) and are still frequently cited today. One method, the compelling<br />
majority-rule consensus, is so simple (stick something in the output if it is in<br />
more than half of the input) that it seems nothing really interesting can be<br />
said about it. This presentation will give some historic background from the<br />
classical preference case (e.g., voting), and then point out some new and old<br />
mathematical and computational complexity results pertaining to the use of<br />
the majority-rule paradigm for finding consensus phylogenetic trees (biology)<br />
and classification structures (data analysis).<br />
− 94 −
Optimization Methods with Evolutionary<br />
Algorithms and Artificial Neurel Networks<br />
Rene Meier and Franz Joos<br />
Helmut-Schmidt-University<br />
University of the Federal Armed Forces Hamburg<br />
Power Engineering<br />
Laboratory of Turbomachinery<br />
Abstract. In order to optimize turbomachinery components it is necessary to<br />
describe the behaviour of multimodal objective functions (OF). But it is timeconsuming<br />
to evaluate the characteristics of these OF with a three-dimensional<br />
Navier Stokes solver. Instead an Artificial Neural Network (ANN) is used as an<br />
interpolator based on information contained in a database to correlate the performance<br />
to the geometrical parameters as is done by a compressible three-dimensional<br />
Reynolds-averaged Navier Stokes solver. With a computerized optimization system<br />
an existing centrifugal impeller will be redesigned using an Evolutionary Algorithm<br />
(EA) and an ANN. The ANN allows the evaluation of the OF for many geometries<br />
generated by the EA with less effort than a Navier Stokes solver. Yet sometimes the<br />
prediction is not accurate and must be verified by means of a more accurate but<br />
time consuming Navier Stokes solver. The results of this verification are added to<br />
the database. So a new optimization cycle is started with the expectation that the<br />
new learning on a larger database will result in a more accurate ANN.<br />
− 95 −
Finding Music Fads by clustering Online Radio<br />
Data with Emergent Self-Organizing Maps<br />
Florian Meyer and Alfred Ultsch<br />
Databionics Research Group<br />
University of Marburg, Germany<br />
{meyer,ultsch}@informatik.uni-marburg.de<br />
Abstract. Music charts provide a simple statistic of records sold. Due to web 2.0<br />
and its social networks, detailed information from listeners is available. In particular,<br />
there are user-generated keywords, so called tags, that group songs into genres. An<br />
important topic for the music industry are music fads. I.e. small time intervals of few<br />
weeks with a strong persistanc of similar music. A distance measure on weekly music<br />
charts and tags is used. The sequenc of music charts is visualized using Emergent Self<br />
Organizing Maps (ESOM). Fads are automatically found by clustering the charts<br />
with the U*C clustering algorithm on ESOM. U*C does not need an estimation<br />
of the number of clusters. Machine learned decision rules describe fads using the<br />
dominant genres.<br />
Key words: ESOM, U*C, Clustering, Tagged Music, Knowledge Representation<br />
References<br />
Ultsch, A. (2003): Maps for the Visualization of high dimensional Data Spaces.<br />
Yamakawa T (Eds.):Proceedings of the 4th Workshop on Self-Organizing<br />
Maps,225-230.<br />
Mörchen, F., Ultsch, A., Nöcker, M., Stamm, C. (2005): Visual mining in music<br />
collections In Proceedings 29th Annual Conference of the German Classification<br />
Society (<strong>GfKl</strong> 2005), Magdeburg, Germany, Springer, Heidelberg<br />
Lehwark, P., Risi,S. and Ultsch, A. (2007): Visualization and Clustering of Tagged<br />
Music Data, Proceedings Workshop on Self-Organizing Maps (WSOM ’07),<br />
Bielefeld, Germany,<br />
Elias Pampalk (2001): Islands of Music Analysis, Organization, and Visualization<br />
of Music Archives<br />
Mörchen, F.et al. (2005): Databionic visualization of music collections according to<br />
perceptual distance, Joshua D. Reiss, Geraint A. Wiggins (Eds), In Proceedings<br />
6th International Conference on Music Information Retrieval (ISMIR 2005),<br />
London, UK, pp. 396-403<br />
Adamic, L. and E. Adar (2003), Friends and Neighbors on the Web, Social Networks,<br />
25(3), 211–230.<br />
− 96 −
Deviant box and dual clusters for the analysis<br />
of conceptual contexts<br />
Boris Mirkin<br />
School of Computer Science and Information Systems<br />
Birkbeck University of London, Malet street, London, WC1E 7HX, UK<br />
mirkin@dcs.bbk.ac.uk<br />
Summary. This work relates to the frameworks of biclustering (Madeira and<br />
Oliveira 2004, Mirkin 1996) and formal concept analysis (Ganter and Wille 1999).<br />
A formal concept over a 1/0 rectangular matrix r, whose row-set is I and columnset<br />
is J, is a maximal pair (V, W ) such that V ⊂ I and W ⊂ J and all r-elements<br />
within V × W are unities. The lattices of formal concepts found interesting applications<br />
in such areas as association between itemsets and post-processing of web-search<br />
results. However, in many applications the notion of formal concept seems overly<br />
rigid because it does not allow any errors or peculiarities in 1/0 encoding (Pensa<br />
and Boullicaut 2005). This is why we take the concept of data approximating box<br />
V xW (Mirkin, Arabie and Hubert 1995) and use it in the framework of a disjunctive<br />
biclustering model approximating the data matrix r.<br />
We develop a method, Box(a), for fitting the model with possibly overlapping<br />
boxes by using a local search algorithm for finding an optimal box starting from a<br />
pre-specified row or column a and using a parameter b shifting the values of r to<br />
r − b. It is proven that the method leads to highly deviant boxes, which is accounted<br />
for by a variance measure.<br />
We further proceed to develop a dual clustering framework by multiplying the<br />
original model equation by its transposed version both on the right and on the<br />
left. The two equations lead to disjunctive decompositions of similarity matrices,<br />
(r − b) ∗ (r − b) ′ and (r − b) ′ ∗ (r − b) over clusters on row set I and column set<br />
J, respectively. This dual clustering framework formalizes the notion that good<br />
concepts should relate only such row and column sets that are similarity clusters<br />
on their own. A local search method for simultaneously fitting the dual clustering<br />
models, Dual(i, j), is developed using an evolutionary algorithm for optimization<br />
the common intensity value of the clusters.<br />
Results of experiments on generated and real data sets are reported supporting<br />
the view of effectiveness of the algorithms.<br />
Key words: Formal concept, Biclustering, Dual clustering, Scale shift<br />
− 97 −
Clustering a Contingency Table Accompanied<br />
by Visualization<br />
Hans-Joachim Mucha<br />
Weierstraß-Institut für Angewandte Analysis und Stochastik (WIAS),<br />
D-10117 Berlin, Germany, mucha@wias-berlin.de<br />
Abstract. Clustering techniques can be used for segmenting a heterogeneous twoway<br />
contingency table into smaller, homogeneous parts. Following the paper of<br />
Greenacre (1988), here the focus is on chi-square decompositions of the Person chisquare<br />
statistic by clustering the rows and/or the columns of a contingency table. Especially<br />
the hierarchical Ward method as well as a generalization of Ward’s method<br />
will be considered. The latter can find clusters of different volume. Additionally, one<br />
can show that it is also possible to carry out partitional cluster analysis by starting<br />
from pairwise chi-square distances. Partitional clustering techniques optimize<br />
some numerical criterion with respect to a fixed number of clusters K. Often the<br />
partitional cluster analysis attains better solutions than hierarchical cluster analysis.<br />
In any case, the correspondence analysis is the appropriate visualization tool for<br />
both the contingency table and the clusters of the rows and/or columns. Moreover,<br />
the correspondence analysis plots will become more informative by an additional<br />
projection of a dendrogram. The latter shows the hierarchy of clusters. An application<br />
from the field of ecology illustrates the segmentation of a contingency table<br />
using different cluster analysis techniques.<br />
Key words: chi-square distance, hierarchical clustering, partitional clustering, correspondence<br />
analysis, dendrogram<br />
References<br />
Greenacre, M. J. (1988): Clustering the Rows and Columns of a Contingency Table.<br />
Journal of Classification 5, 39–51.<br />
− 98 −
Predictive classification trees<br />
Ulrich Müller-Funk and Stephan Dlugosz<br />
Institut für Wirtschaftsinformatik<br />
University of Münster<br />
Germany<br />
Abstract. Tree-based algorithms for classification and regression are highly popular<br />
because they give rise to results that are easy to interpret and to communicate.<br />
(Some people argue, moreover, that factor selection comes along automatically. This<br />
point, too, will be challenged in the paper.) CART and (exhaustive) CHAID figure<br />
prominently among the procedures actually used in data based management etc.<br />
CART is a well-established, nonlinear and nonparametric procedure that produces<br />
binary trees. CHAID, in contrast, admits multiple splittings, a feature that allows to<br />
exploit the splitting variable more extensively. On the other hand, that procedure<br />
depends on premises that are questionable in practical applications. This can be<br />
put down to the fact, that CHAID relies on simultaneous Chi-Square- resp. F-tests.<br />
Both types of procedures – as implemented in SPSS, for instance – do not take into<br />
account ordinal dependent variables. In the paper we suggest a tree-algorithm that<br />
• requires categorical variables<br />
• chooses splitting attributes by means of predictive measures of association,<br />
• determines the cells to be united – and because of that the number of splits –<br />
with the help of their conditional predictive power<br />
• takes ordinal dependent variables into consideration<br />
− 99 −
Efficient Media Exploitation towards Collective<br />
Intelligence<br />
Phivos Mylonas 1 , Vassilios Solachidis 2 , Andreas Geyer-Schulz 3 , Bettina<br />
Hoser 3 , Sam Chapman 4 , Fabio Ciravegna 4 , Stefen Staab 5 ,PavelSmrz 6 ,<br />
Yiannis Kompatsiaris 2 , and Yannis Avrithis 1<br />
1 National Technical University of Athens, Image, Video and Multimedia Systems<br />
Laboratory, Iroon Polytechneiou 9, Zographou Campus, Athens, GR 157 80,<br />
Greece,{fmylonas, iavr}@image.ntua.gr<br />
2 Centre of Research and Technology Hellas, Informatics and Telematics Institute,<br />
1st Km Thermi-Panorama Road, Thermi-Thessaloniki, GR 570 01, Greece,<br />
{vsol, ikom}@iti.gr<br />
3 Department of Economics and Business Engineering, Information Service and<br />
Electronic Markets, Kaiserstraße 12, Karlsruhe 76128, Germany<br />
{andreas.geyer-schulz, bettina.hoser}@kit.edu<br />
4 University of Sheffield, Department of Computer Science, Regent Court, 211<br />
Portobello Street, S1 4DP, Sheffield, UK {s.chapman, fabio}@dcs.shef.ac.uk<br />
5 Universität Koblenz-Landau, Information Systems and Semantic Web,<br />
Universitätsstraße 1, 57070 Koblenz, Germany, staab@uni-koblenz.de<br />
6 Brno University of Technology, Faculty of Information Technology, Bozetechova<br />
2, CZ-61266 Brno, Czech Republic smrz@fit.vutbr.cz<br />
Abstract. In this work we propose intelligent, automated content analysis techniques<br />
for different media to extract knowledge from the multimedia content. Information<br />
derived from different sources/modalities will be analyzed and fused, in<br />
terms of spatiotemporal, personal and even social contextual information. In order<br />
to achieve this goal, semantic analysis will be applied to the content items, taking<br />
into account the content itself (e.g. text, images and video), as well as existing<br />
personal, social and contextual information (e.g. semantic and machine-processable<br />
metadata and tags). The above process exploits the so-called “Media Intelligence”<br />
towards the ultimate goal of identifying “Collective Intelligence”, emerging from<br />
the collaboration and competition among people, empowering innovative services<br />
and user interactions. The utilization of “Media Intelligence” constitutes a departure<br />
from traditional methods for information sharing, since semantic multimedia<br />
analysis has to fuse information from both the content itself and the social context,<br />
while at the same time the social dynamics have to be taken into account. Such<br />
intelligence provides added-value to the available multimedia content and renders<br />
existing procedures and research efforts more efficient.<br />
− 100 −
Support Vector Machines in the Dual using<br />
Majorization and Kernels<br />
Georgi Nalbantov 1 , Patrick J.F. Groenen 2 , and Cor Bioch 3<br />
1 MICC, Maastricht University and<br />
Econometric Institute, Erasmus University Rotterdam, The Netherlands<br />
nalbantov@few.eur.nl<br />
2 groenen@few.eur.nl<br />
3 bioch@few.eur.nl<br />
Abstract. Recently, Support Vector Machines (SVMs) have proved to be a quite<br />
successful method for classification. One of the bottlenecks with this approach from a<br />
practical point of view is that existing solvers are rather slow in speed. Usually, SVM<br />
solvers use specialized iterative optimization algorithms to solve the SVM optimization<br />
problem that are quite slow, especially in the so-called dual SVM formulation.<br />
Here, we propose to use another iterative method, which is a majorization method.<br />
It has already been applied successfully for solving the primal SVM formulation (see,<br />
Groenen, Nalbantov, and Bioch, 2007, <strong>2008</strong>). The contribution of this paper is to<br />
extend it to the dual formulation. This opens the door for, first of all, using different<br />
so-called kernel functions, which allow for nonlinear decision functions, and second,<br />
for handling more efficiently linear problems where the number of input variables is<br />
bigger than the number of observations.<br />
Key words: Support vector machines, Iterative majorization, Binary classification<br />
problem, Kernels<br />
References<br />
Groenen, P.J.F., Nalbantov, G., and Bioch, J.C. (2007): Nonlinear support vector<br />
machines through iterative majorization and I-splines. In: R.Decker, H-.J. Lenz<br />
(Eds.): Advances in data analysis. Springer, Berlin, 149–162.<br />
Groenen, P.J.F., Nalbantov, G., and Bioch, J.C. (<strong>2008</strong>, in press): SVM-Maj: A<br />
Majorization Approach to Linear Support Vector Machines with Different Hinge<br />
Errors. Advances in Data Analysis and Classification.<br />
− 101 −
Approach for Dynamic Problems in Clustering<br />
Anneke Neumann, Klaus Ambrosi, and Felix Hahne<br />
Institut für Betriebswirtschaft und Wirtschaftsinformatik<br />
Stiftung Universität Hildesheim<br />
{aneumann,ambrosi,hahne}@bwl.uni-hildesheim.de<br />
Abstract. In cluster analysis, a variety of methods has been developed for different<br />
areas of application (e.g. economics, biology, medicine, psychology), some of<br />
which were implemented in data evaluation software packages (e.g. SPSS x , SAS). In<br />
many scenarios, particularly economic ones, special methods are required in order<br />
to analyze the development of clusters over time. While there are such methodical<br />
extensions for factor analysis and multidimensional scaling, hardly any dynamic<br />
approaches exist in the field of cluster analysis.<br />
In this talk, special attention will be paid to dynamic fuzzy clustering problems.<br />
Known approaches will be reviewed critically concerning their applicability in dynamic<br />
problems, and a new fuzzy clustering approach will be introduced which can<br />
be applied to dynamic problems.<br />
Key words: Clustering, Fuzzy Clustering, Dynamic Data Analysis<br />
References<br />
Basford, K.E. and McLachlan, G.J. (1985): The Mixture Method of Clustering Applied<br />
to Three-Way Data. Journal of Classification, 2, 109–125.<br />
Höppner, F., Klawonn, F., Kruse, R., and Runkler, T. (1999): Fuzzy Cluster Analysis.<br />
Wiley, Chichester, New York.<br />
Joentgen, A., Mikenina, L., Weber, B., and Zimmermann, H.-J. (1999): Dynamic<br />
fuzzy data analysis based on similarity between functions. Fuzzy Sets and Systems,<br />
105, 81–90.<br />
Tucker, L.R. (1966): Some mathematical notes on three-mode factor analysis. Psychometrika,<br />
31, 279–311.<br />
− 102 −
Robust fitting of mixtures: The approach based<br />
on the Trimmed Likelihood Estimator<br />
Neyko Neykov 1 , Peter Filzmoser 2 , and Plamen Neytchev 1<br />
1 Department of Statistics and Probability Theory, Vienna University of<br />
Technology, Austria P.Filzmoser@tuwien.ac.at<br />
2 National Institute of Meteorology and Hydrology, Bulgarian Academy of<br />
Sciences, Sofia, Bulgaria {Neyko.Neykov}{Plamen.Neytchev}@meteo.bg<br />
Abstract. The Maximum Likelihood Estimator (MLE) has commonly been used to<br />
estimate the unknown parameters in a finite mixture of distributions. However, the<br />
MLE can be very sensitive to outliers in the data. In order to overcome this problem,<br />
Neykov et al. (2007) adapted the trimmed likelihood methodology developed by<br />
Vandev and Neykov (1998) and Neykov and Müller (2003) to estimate mixtures<br />
in a robust way. The superiority of this approach in comparison with the MLE<br />
is illustrated by examples and simulation studies. The behavior of the widely used<br />
classical criteria for the assessment of the number of components in a mixture model<br />
and their robustified versions are also studied in the presence of outliers.<br />
Key words: Trimmed likelihood estimator, Finite mixtures of distributions<br />
References<br />
Maronna, R., Martin, R.D. and Yohai, V.J. (2006): Robust Statistics: Theory and<br />
Methods. Wiley, New York.<br />
McLachlan, G.J. and Peel, D. (2000): Finite Mixture Models. Wiley, New York.<br />
Neykov, N. and Müller, C. (2003): Breakdown Point and Computation of Trimmed<br />
Likelihood Estimators in GLMs. In: R. Dutter et al., (eds), Developments in<br />
robust statistics, pp. 277–286, Physica Verlag, Heidelberg.<br />
Neykov, N.M., Filzmoser, P., Dimova, R. and Neytchev, P.N. (2004): Mixture of<br />
Generalized Linear Models and the Trimmed Likelihood Methodology. In: J.<br />
Antoch (Ed.): Proceedings in Computational Statistics. Physica-Verlag, Heidelberg,<br />
1585–1592.<br />
Neykov, N., Filzmoser, P., Dimova, R. and Neytchev, P. (2007): Robust Fitting of<br />
Mixtures Using the Trimmed Likelihood Estimator. Computational Statistics &<br />
Data Analysis, 17(3), 299–308.<br />
Vandev, D.L. and Neykov, N.M. (1998): About Regression Estimators with High<br />
Breakdown Point. Statistics, 32, 111–129.<br />
− 103 −
Cluster Tree Estimation using a Generalized<br />
Single Linkage Method<br />
Rebecca Nugent 1 and Werner Stuetzle 2<br />
1 Department of Statistics, Carnegie Mellon University, Baker Hall, Pittsburgh,<br />
PA 15213, USA. rnugent@stat.cmu.edu<br />
2 Department of Statistics, University of Washington, Box 354322, Seattle, WA<br />
98195, USA wxs@stat.washington.edu<br />
Abstract. The goal of clustering is to detect the presence of distinct groups in a<br />
data set and assign group labels to the observations. In nonparametric clustering,<br />
we regard the observations as a sample from an underlying density and assume that<br />
groups correspond to modes of this density. The goal then is to find the modes<br />
and assign each observation to the domain of attraction of a mode. The (possibly<br />
hierarchical) modal structure of a density is summarized by its cluster tree; modes<br />
of the density correspond to leaves in the cluster tree. Estimating this cluster tree<br />
is the fundamental goal of nonparametric cluster analysis.<br />
We adopt a plug-in approach: estimate the cluster tree of the underlying density<br />
by the cluster tree of a density estimate. For density estimates that are piecewise<br />
constant (and so have computationally tractable level sets), the cluster tree can<br />
be computed exactly. However, for other density estimates, particularly in highdimensions,<br />
we have to be content with an approximation. We present a graph-based<br />
method that approximates the cluster tree for any density estimate and includes<br />
the introduction of a density-based similarity measure between observations. After<br />
motivating the method, we show results that allow us to reduce the graph to a<br />
spanning tree and then sketch an algorithm that allows the exact computation of the<br />
spanning tree whose edge weights are not of closed form. We point out mathematical<br />
and algorithmic similarities to single linkage clustering and illustrate our approach<br />
on several examples.<br />
Key words: cluster analysis, single linkage clustering, level sets, minimum density<br />
similarity measure, nearest neighbor density estimation<br />
References<br />
Stuetzle, W. and Nugent, R. (2007): A generalized single linkage method for estimating<br />
the cluster tree of a density. Technical Report 514, Department of Statistics,<br />
Univeristy of Washington.<br />
− 104 −
Multi-Class Extension of Verifiable Ensemble<br />
Models for Safety-Related Applications<br />
Sebastian Nusser 1,2 , Clemens Otte 1 , and Werner Hauptmann 1<br />
1 Siemens AG, Corporate Technology, Otto-Hahn-Ring 6, 81730 Munich, Germany,<br />
{sebastian.nusser.ext,clemens.otte,werner.hauptmann}@siemens.com<br />
2 School of Computer Science, Otto-von-Guericke-University of Magdeburg,<br />
Universitätsplatz 2, 39106 Magdeburg, Germany<br />
Abstract. For safety-related applications, models learned from data must be verifiable<br />
and, thus, interpretable by domain experts. In a previous work (Nusser et al.,<br />
2007) we developed a sequential covering algorithm for binary classification problems<br />
in safety-related domains. It is based on ensembles of low-dimensional submodels,<br />
where each submodel as well as the overall ensemble model can be verified. Thus, the<br />
correct interpolation and extrapolation behavior of the complete model can be guaranteed.<br />
In the present contribution we extend the approach to multi-class problems.<br />
The extension is not straight-forward since common methods like one-against-one or<br />
one-against-rest voting (Friedman, 1996; Hsu and Lin, 2002) may introduce inconsistencies.<br />
We show that inconsistencies can be avoided by introducing a hierarchy<br />
of misclassification costs. Such hierarchy is used to define a strict ordering of the<br />
kind: “class c1 should never be misclassified, class c2 might only be misclassified as<br />
c1, class c3 might be misclassified as c1 or c2.” Our method follows a sequential<br />
covering concept also for multi-class classification: low-dimensional submodels are<br />
trained to separate the samples of the class with the minimal misclassification costs<br />
from the samples of all remaining classes. If the problem is solved for this class or<br />
no further improvements are possible, all remaining samples of this class are removed<br />
from the training data set and the procedure is repeated for the next class<br />
within the hierarchy of misclassification costs. Experimental evaluation carried out<br />
on benchmark data sets from the UCI Machine Learning Repository shows a good<br />
trade-off between interpretation and prediction accuracy of our method.<br />
Key words: multi-class, ensemble learning, local modeling, interpretability<br />
References<br />
Friedman, J. H. (1996). Another approach to polychotomous classification. Technical<br />
report, Department of Statistics, Stanford University.<br />
Hsu, C.-W. and Lin, C.-J. (2002). A comparison of methods for multiclass support<br />
vector machines. Neural Networks, IEEE Trans. on, 13(2):415–425.<br />
Nusser, S., Otte, C., and Hauptmann, W. (2007). Learning binary classifiers for<br />
applications in safety-related domains. In Proceedings of 17th Workshop Computational<br />
Intelligence, pages 139–151. Universitätsverlag Karlsruhe.<br />
− 105 −
Analysis of Borrowing and Guaranteeing<br />
Relationhships among Government Officials at<br />
the Eighth Century in the Old Capital of Japan<br />
Akinori Okada 1 and Towao Sakaehara 2<br />
1<br />
Graduate School of Management and Information Sciences, Tama University<br />
4-1-4 Hijirigaoka, Tama-shi, Tokyo 206-0022 Japan okada@tama.ac.jp<br />
2<br />
Department of History, Faculty and Graduate School of Literature and Human<br />
Sciences, Osaka City University<br />
Sugimoto-cho 3 Sumiyoshi-ku Osaka City 558-8585 Japan<br />
sakaehar@lit.osaka-cu.ac.jp<br />
Abstract. In the present study relationships among lower ranked government officials,<br />
working in the old capital of Japan called Heijo-kyo at the eighth century, are<br />
analyzed. They were engaged in copying the Buddhist sutra in the capital. The documents<br />
which show the borrowing and guaranteeing relationships among them have<br />
been kept in the governmental warehouse called Shoso-in (Sakaehara, 1987). The<br />
documents tell (a) the borrower, (b) the amount of money borrowed, (c) who stood<br />
guarantee for the borrower, (d) the date of borrowing. From these documents, the<br />
table which shows the borrowing and guaranteeing relationships among government<br />
officials was derived. The (j, k) element of the table shows the amount of money<br />
the government official corresponding to row j borrowed which was guaranteed by<br />
the government official corresponding to column k. One who stood guarantee for<br />
his colleague seem more dominant than one who borrowed. These relationships are<br />
asymmetric. The table was derived for the years 772, 773, and 774 (including the<br />
beginning of 775). The table was analyzed by the asymmetric multidimensional scaling<br />
(Okada and Imaizumi, 1997). The obtained configuration shows the dominance<br />
relationships among government officials and groups of them.<br />
Key words: Asymmetry, Borrowing and guaranteeing relationships, Historical<br />
data, Multidimensional scaling<br />
References<br />
Okada, A. and Imaizumi, T. (1997): Asymmetric multidimensional scaling of twomode<br />
three-way proximities. Journal of Classification, 14, 195–224.<br />
Sakaehara,T. (1987): People’s Life Styles in the Capital City. In: T. Kishi, A.<br />
(Ed.): Modes of Life in the Capital Cities. Chuokoron-sha, Tokyo, 187–266.<br />
(in Japanese)<br />
− 106 −
Variable Selection for kernel classifiers:<br />
A Feature-to-Input Space Approach<br />
Surette Oosthuizen and Sarel Steel<br />
Department of Statistics and Actuarial Science, University of Stellenbosch, Private<br />
Bag X1, 7602 Matieland, South Africa (surette@sun.ac.za; sjst@sun.ac.za)<br />
Abstract. Consider using values of input variables X1, X2, · · · , Xp to classify entities<br />
into one of two groups. Kernel classifiers, e.g. support vector machines (SVMs)<br />
and kernel Fisher discriminant analysis (KFDA), are known to be exceptionally well<br />
suited for this task. In general the classification accuracy of SVMs and KFDA can<br />
however be improved substantially if instead of the comprehensive set of p input<br />
variables, a smaller subset of (say m) input variables is used. Let the space in which<br />
the training patterns reside, be called the input space. Also, let Φ map the input<br />
space to a higher-dimensional so-called feature space. An aspect which complicates<br />
variable selection for non-linear kernel classifiers is that they make implicit use of<br />
Φ: they are linear functions in a higher-dimensional feature space. Since Φ is usually<br />
unknown, and the feature space can be infinite-dimensional, the implicit transformation<br />
step obscures the contributions of variables to the kernel discriminant function.<br />
In this paper we propose a new variable selection approach for kernel classifiers,<br />
viz. so-called feature-to-input space (F I) selection. The basic idea underlying this<br />
approach is to combine the information obtained from feature space with the easy<br />
interpretation in input space. We discuss several approaches and evaluate the resulting<br />
selection criteria in a fairly extensive simulation study.<br />
Key words: Variable Selection, Kernel Based Classification, Kernel Methods<br />
References<br />
Mika, S., Rätsch, G., Weston, J., Schölkopf, B., Smola, A.J. and Müller, K.-R. (1999).<br />
Fisher discriminant analysis with kernels. Proceedings of Neural Networks for<br />
Signal Processing, 9, 41-48.<br />
Shawe-Taylor, J. and Cristianini, N. (2004). Kernel methods for pattern analysis.<br />
Cambridge University Press, Cambridge.<br />
Rakotomamonjy, A. (2002). Variable selection using SVM-based criteria. Perception<br />
Systeme Information, Insa de Rouen, Technical Report PSI 2002-04.<br />
Rakotomamonjy, A. (2003). Variable selection using SVM-based criteria. Journal of<br />
Machine Learning Research, 3, 1357-1370.<br />
− 107 −
Classifying hospitals with respect to their<br />
diagnostic diversity using Shannon’s entropy<br />
Thomas Ostermann 1 , Reinhard Schuster 2 , and Christoph Erben 2<br />
1 Department of Medical Theory and Complementary Medicine, University of<br />
Witten/Herdecke, Gerhard-Kienle-Weg 4, 58313 Herdecke, Germany,<br />
thomaso@uni-wh.de<br />
2 Medical Review Board of the Statutory Health Insurance in North Germany,<br />
Katharinenstr 11a, 23554 Luebeck, Germany, Reinhard.Schuster@mdk-nord.de<br />
Abstract. Background: In Germany hospital comparisons are part of health status<br />
reporting. However, some methodological problems arise in classifying hospitals<br />
by means of their reported data. This article presents the application of Shannon’s<br />
entropy measure for hospital comparisons. Material and Methods: We used Shannon’s<br />
entropy given by E(p1, . . . , pn) = Pn k=1 pk log pk as an approach to measure<br />
the diagnostic diversity of a hospital department. Based on a data set of aggregated<br />
three–digit ICD–9–codes from the L4–hospital statistics of 1998 in Schleswig Holstein<br />
we compared the resulting measures for diagnostic diversity with respect to<br />
the hospital departments area (e.g. surgery, gynecology) and to the hospital status<br />
(primary, secondary, tertiary or specialized hospital). Results: Highly specialized<br />
departments like obstetrics (0.44) or ophthalmology (0.46) do generate lower entropy<br />
values than area-spanning departments like radiology (0.52) or general gynaecology<br />
(0.56), which have significantly higher values. Discussion: We showed how entropy<br />
can be used as a measure for classifying hospitals. Our approach can basically be<br />
implemented in all fields of health services research, where categorial data emerges.<br />
Especially in DRG-data this approach is quite promising and should be applied.<br />
Key words: Entropy, diagnostic diversity, hospital comparison, classification<br />
References<br />
Brindle, G.W. and Gibson, C.J. (2007): Entropy as a measure of diversity in an<br />
inventory of medical devices. Medical Engineering & Physics,29, Epub.<br />
Elayat, H.A., Murphy, B.B. and Prabhakar, N.D. (1978): Entropy in the hierarchical<br />
cluster analysis of hospitals. Health Serv Res, 13, 395–403.<br />
Erben, C.M. (2000): The concept of entropy as a possibility for gathering mass data<br />
for nominal scaled data in heath status reporting. Stud Health Technol Inform,<br />
77, 118–9.<br />
− 108 −
Clustering and Dimensionality Reduction to<br />
Discover Interesting Patterns in Binary Data<br />
Francesco Palumbo<br />
Dipartimento di Istituzioni Economiche e Finanziarie - University of Macerata<br />
Via Crescimbeni, 20 - I-62100, Italy<br />
francesco.palumbo@unimc.it<br />
Abstract. A key element in the success of data analysis is the strong contribution<br />
of visualization: dendrograms and factorial plans are intuitive ways to display<br />
association relationships within and among sets of variables and groups of units.<br />
In the Association Rules (AR) mining we refer to a n × p data matrix, where n<br />
indicates the number of statistical units and p the number of attributes, which are<br />
also called items. The problem consists in analyzing links between attributes. Sets<br />
of attributes that co-occur through the whole data matrix are referred as patterns.<br />
Scanning the whole data set and analyzing all the relationships is an interesting<br />
and promising approach, yet this approach leads to a NP -hard problem and gets<br />
no solution when dealing with a large number of attributes.<br />
Moreover, in some cases, the most interesting relationships refer to subpopulations<br />
in the data, and they are hidden by the obvious ones and cannot be identified<br />
by the classical descriptive and inferential statistical methods.<br />
The joint use of factorial and clustering methods in a unitary exploratory approach<br />
copes with these issues. It allows the analyst to identify the most interesting<br />
groups of units and sets of attributes; by focusing the attention only on them more<br />
easily interesting patterns are identified in large and huge binary data base. ns in<br />
large and huge binary data base, focusing the attention only on them.<br />
Key words: Dimensionality Reduction, Binary Data, AR mining<br />
References<br />
Iodice D’Enza A., Palumbo F. and Greenacre M. (2007): Exploratory data analysis<br />
leading towards the most interesting simple association rules. Comput. Statist.<br />
Data Anal., Corrected Proof, doi:10.1016/j.csda.2007.10.006.<br />
Mizuta, M. (2004): Dimension reduction methods. In J.E. Gentle, W. Hardle and<br />
Y. Mori (Eds.): Handbooks of Computational Statistics. Concepts and Methods.<br />
Springer-Verlag, Heidelberg, pp. 565-589.<br />
Plasse M., Niang N., Saporta G., Villeminot A. and Leblond L. (2007): Combined<br />
use of association rules mining and clustering methods to find relevant links<br />
between binary rare attributes in a large data set. Comput. Statist. Data Anal.,<br />
doi: 10.1016/j.csda.2007.02.020.<br />
− 109 −
Lineare Kodierung multipler Vererbungshierarchien:<br />
Wiederbelebung einer antiken Klassifikationsmethode<br />
Wiebke Petersen<br />
Institut für Sprache und Information, Heinrich-Heine-Universität Düsseldorf,<br />
petersew@uni-duesseldorf.de<br />
Zusammenfassung. Die in der mehr als zweitausend Jahre alten Sanskritgrammatik<br />
von Pān. ini eingesetzten formalen Methoden erstaunen ob ihrer Modernität.<br />
Der Vortrag widmet sich insbesondere der Methode zur Repräsentation von Mengen<br />
als Intervalle einer Liste, die zur Klassifikation der Lautklassen eingesetzt wird.<br />
Diese Methode zeichnet sich dadurch aus, daß sie es erlaubt, bestimmte Polyhierarchien<br />
(d.h., Hierarchien in denen eine Klasse mehr als eine direkte Oberklasse<br />
haben kann) linear zu kodieren. Monohierarchien lassen sich als verschachtelte Listen<br />
linear repräsentieren, da sie immer eine Baumstruktur bilden. Für allgemeine<br />
Polyhierarchien steht eine lineare Repräsentationsmethode noch aus; sie werden häufig<br />
mithilfe einer Menge von Constraints beschrieben, wobei die Hierarchie zumeist<br />
in die einzelnen Elemente ihrer binären Nachbarschaftsrelation zerlegt wird. Infolge<br />
davon müssen viele Anfragen, zum Beispiel nach einer hierarchischen Teilstruktur,<br />
umständlich über rekursive Aufrufe abgearbeitet werden. Ein weiterer Nachteil von<br />
Polyhierarchien besteht darin, daß sie häufig aufgrund zahlreicher Kantenkreuzungen<br />
schwer lesbar sind. Da kreuzungsfreie Hierarchien von den Nutzern eines Systems<br />
besser akzeptiert und verstanden werden, schließen viele aktuelle Ontologiesysteme<br />
Polyhierarchien aus, oder verbergen sie zumindest vor den Anwendern. Aus diesen<br />
Gründen lassen zahlreiche Formalismen nur baumförmige Hierarchien zu.<br />
In dem Vortrag soll zunächst eine vollständige Charakterisierung der Klassifikationen<br />
gegeben werden, deren Klassen sich gemäß Pān. inis Methode als Intervalle<br />
einer Liste darstellen lassen. Eine solche Klassifikation wird S-darstellbar genannt. Es<br />
wird desweiteren formal gezeigt, daß sich die Hasse-Diagramme S-darstellbarer Klassifikationen<br />
immer kreuzungsfrei zeichnen lassen. Schließlich soll untersucht werden,<br />
inwieweit es angebracht ist, für bestimmte Einsatzzwecke die Klasse der zulässigen<br />
Hierarchien von baumförmigen auf S-darstellbare auszudehnen, um einerseits<br />
zumindest eingeschränkt multiple Vererbung zuzulassen, ohne andererseits die Vorteile<br />
einer kreuzungsfreien Zeichnung und einer effizienten linearen Kodierung und<br />
Verarbeitung hierarchischer Beziehungen zu verlieren.<br />
Schlüsselwörter: Pān. ini, Hierarchien, kreuzungsfreie Zeichnung, lineare Kodierung<br />
− 110 −
Begriffsanalytischer Ansatz zur qualitativen<br />
Zitationsanalyse<br />
Wiebke Petersen und Petja Heinrich<br />
Institut für Sprache und Information, Heinrich-Heine-Universität Düsseldorf,<br />
petersew@uni-duesseldorf.de<br />
Zusammenfassung. Zu den Aufgaben der Bibliometrie gehört die Zitationsanalyse<br />
(Kessler 1963), das heißt die Analyse von Kozitationen (zwei Texte werden kozitiert,<br />
wenn es einen Text gibt, in dem beide zitiert werden) und die bibliographische<br />
Kopplung (zwei Texte sind bibilographisch gekoppelt, wenn beide eine gemeinsame<br />
Zitation aufweisen).<br />
In dem Vortrag wird aufgezeigt werden, daß die Formale Begriffsanalyse (FBA)<br />
für eine qualitative Zitationsanalyse geeignete Mittel bereithält. Eine besondere Eigenschaft<br />
der FBA ist, daß sie die Kombination verschiedenartiger (qualitativer und<br />
skalarer) Merkmale ermöglicht. Durch den Einsatz geeigneter Skalen kann auch dem<br />
Problem begegnet werden, daß die große Zahl von zu analysierenden Texten bei qualitativen<br />
Analyseansätzen in der Regel zu unübersichtlichen Zitationsgraphen führt,<br />
deren Inhalt nicht erfaßt werden kann.<br />
Die Relation der bibliographischen Kopplung ist eng verwandt mit den von Priss<br />
entwickelten Nachbarschaftskontexten, die zur Analyse von Lexika eingesetzt werden.<br />
Anhand einiger Beispielanalysen werden die wichtigsten Begriffe der Zitationsanalyse<br />
in formalen Kontexten und Begriffsverbänden modelliert. Es stellt sich<br />
heraus, daß die hierarchischen Begriffsverbände der FBA den gewöhnlichen Zitationsgraphen<br />
in vielerlei Hinsicht überlegen sind, da sie durch ihre hierarchische<br />
Verbandstruktur bestimmte Regularitäten explizit erfassen. Außerdem wird gezeigt,<br />
wie durch die Kombination geeigneter Merkmale (Doktorvater, Institut, Fachbereich,<br />
Zitationshäufigkeit, Keywords) und Skalen häufigen Fehlerquellen wie Gefälligkeitszitationen,<br />
Gewohnheitszitationen u.s.w. begegnet werden kann.<br />
Schlüsselwörter: Bibliographische Kopplung, Kozitation, Formale Begriffsanalyse<br />
Literaturverzeichnis<br />
B. Ganter & R. Wille (1999): Formal Concept Analysis. Mathematical Foundations.<br />
Berlin: Springer.<br />
M.M. Kessler (1963): Bibliographic coupling between sientific papers. American Documentation,<br />
Vol. 14, 10–25.<br />
U. Priss & J. Old (2004): Modelling Lexical Databases with Formal Concept Analysis.<br />
Journal of Universal Computer Science, Vol. 10, Nr. 8, 967–984.<br />
− 111 −
The Analysis of the power for some chosen VaR<br />
backtesting procedures - simulation approach<br />
Krzysztof Piontek<br />
Department of Financial Investments and Risk Management<br />
Wroclaw University of Economics<br />
ul. Komandorska 118/120, 53-345 Wroclaw, Poland<br />
krzysztof.piontek@ae.wroc.pl<br />
Abstract. The definition of Value at Risk is quite general. There are many approaches<br />
which can give different VaR values. The challenge is not to suggest a new<br />
method but to distinguish between good and bad models. Backtesting is the necessary<br />
statistical procedure to evaluate performance of VaR models and select the<br />
best one. If the power of the test is low, then it is likely to mis-classify an inaccurate<br />
VaR model as well-specified. It can be a threat to financial institutions.<br />
The aim of this article is to analyze backtesting methodologies, focusing on the<br />
aspect of limited data set and the power of tests. There are three groups of methods<br />
for validating VaR models: based on the frequency of failures, based on various<br />
loss functions and the ones based on the adherence of a VaR model to asset return<br />
distributions. This article presents and summarizes some of frequently used methods<br />
from every group (proposed by Kupiec, Christoffersen, Lopez and Berkowitz).<br />
However, the main part of this work is statistical evaluation of the most applied<br />
tests for small data sets (usually observed in practice). We analyze performance<br />
of tests based on the type II error, in order to select the best one for different<br />
numbers of observations and model mis-specifications. For making this verification<br />
asset return simulations are used. Presented results indicate that some tests are not<br />
adequate for small samples, even for 1000 observations, which is a very important<br />
issue if acceptance of internal models for market risk management is considered.<br />
Key words: risk measurement, Value at Risk, backtesting, power of tests<br />
References<br />
Hass, M. (2001): New Methods in Backtesting, CAESAR,<br />
www.caesar.de/uploads/media/cae pp 0010 haas 2002-02-05 01.pdf<br />
Piontek, K. (2007): A Survey and a Comparison of Backtesting Procedures<br />
(in Polish), In: P. Chrzan: Metody matematyczne, ekonometryczne..., Katowice.<br />
Sarma, M., Thomas, S., Shah, A. (2003): Selection of Value-at-Risk Models,<br />
ideas.repec.org/s/jof/jforec.html<br />
− 112 −
Testing distribution in errors in variables<br />
models<br />
Denys Pommeret 1<br />
Aix-Marseille 2 University pommeret@iml.univ-mrs.fr<br />
Abstract. Within the frame of errors in variables models we consider the sum of<br />
two independent random variables, X = W + Z, where W is the variable of interest<br />
with known distribution Π, and where the error Z has unknown density f. We<br />
present a smooth goodness of fit test for testing the distribution of the error Z. For<br />
that we observe an i.d.d. sample X1, · · · , Xn, with a mixture density function<br />
Z<br />
g(x) = f(x, m)Π(dm),<br />
where Π is a real probability distribution and f(x, m) are real m-parameterized<br />
density functions, for m in some set M ⊂ R. We assume that Π is the known<br />
distribution of Z and we want to test<br />
H0 : f(x, m) = f0(x, m), for all m in M,<br />
where f0 is a specified probability density function. An adaptation of the Neyman<br />
smooth test is proposed.<br />
Key words: Mixture models, Neyman’s test, Score statistic, Schwarz’s criteria<br />
References<br />
Hart, J.D. (1997): Nonparametric smoothing and lack-of-fit tests, Springer Series in<br />
Statistics. New York, NY.<br />
Kallenberg, W.C.M. and Ledwina, T. (1995): Consistency and Monte Carlo simulation<br />
of data driven version of smooth goodness of fit tests, Ann. Statist, 23,<br />
1594–1608.<br />
Ledwina, T. (1994): Data-Driven Version of Neyman’s Smooth Test of Fit, Journal<br />
of the American Statistical Society, 89, 1000–05.<br />
Lehmann, E.L. and Romano, J.P. (2005): Testing statistical hypotheses, 3rd ed.<br />
Springer Texts in Statistics. New York, NY: Springer.<br />
− 113 −
Classification with an increasing number of<br />
components<br />
Odile Pons 1<br />
INRA, Mathematics, Jouy-en-Josas, France Odile.Pons@jouy.inra.fr<br />
Abstract. After estimating the rn actual components of a mixture with an increasing<br />
number of components increasing with the sample size, the question is to<br />
determine to which group a given observation Xi, i = 1, . . . , n, belongs. A classification<br />
consists in mapping an observation Xi (or a value x of X ) into a class � kn(Xi)<br />
(or � kn(x)) in {1, . . . , rn} which may either be uniquely defined (fixed case) or related<br />
to a random distribution. In both cases, the component � kn is chosen by maximum<br />
likelihood with a penalization.<br />
A random classification avoids misclassification of some observations with overlapping<br />
densities, k(Xi) = j with an estimated probability and k(x) = j with a fixed<br />
probability, 1 ≤ j ≤ rn. They are estimated by<br />
�µ �<br />
n, kn(x) � fn,<br />
kn(x) � = max Qn(�µn,k<br />
1≤k≤rn<br />
� fn,k; x), with<br />
Qn(�µn,k � fn,k; x) = �µn,k � fn,k(x) − nλ 2 n<br />
−ν 2 n<br />
�<br />
1≤j≤r<br />
q�<br />
j=1<br />
π( � fn,k − � fn,j),<br />
p (�µn,j)<br />
where the penalization coefficients λn and νn tend to zero as n → ∞, π and λ<br />
are smooth functions. For a random classification, � kn(Xi) is defined in the same<br />
way with some probabilities. The random procedure preserves all the classes: the<br />
proportions of observations belonging to {1, . . . , rn} is asymptotically identical to<br />
the mixture probabilities.<br />
Key words: Mixture, classification,asymptotics<br />
References<br />
Lemdani, M. and Pons, O. Large Mixture models with an increasing number of<br />
components, unpublished, (2007).<br />
Pons, O. Asymptotic distributions in finite semi-parametric mixture models. to appear,<br />
(<strong>2008</strong>).<br />
− 114 −
Bagging with different split criteria.<br />
Sergej Potapov and Berthold Lausen<br />
Department of Biometry and Epidemiology, Friedrich-Alexander-University<br />
Erlangen-Nuremberg, Waldstraße 6, D-91054 Erlangen, Germany<br />
Sergej.Potapov@imbe.imed.uni-erlangen.de<br />
Abstract. In recent years many papers discuss boosting and bagging based methods<br />
for supervised learning or machine learning. Both concepts aggregate sets of<br />
estimated trees, which are derived by split criteria without adjusting for variables<br />
measured on different scales. Breiman et al. (1984) observed that quantitative variables<br />
tend to be more often selected as binary variables. As a solution Lausen et al.<br />
(1994, 2004) introduced p-value adjusted classification and regression trees, which<br />
introduce the p-value of maximally selected test statistics as split criteria. The p<br />
value adjustment avoids the possible selection bias of variables measured on different<br />
scales. The R package TWIX of Potapov et al. (<strong>2008</strong>) offer p-value adjusted<br />
classification trees and bagging of p-value adjusted classification trees. In our paper<br />
we compare bagging, double-bagging (Hothorn and Lausen, 2003) without and with<br />
p-value adjustment by means of simulation. Moreover, we illustrate our approach<br />
using a clinical study involving micro array data.<br />
Key words: bagging, CART, machine learning, trees, micro array data<br />
References<br />
Breiman, L., Friedman, J.H., Olshen, R.A., Stone, C.J. (1984): Classification and<br />
regression trees. Wadsworth Press.<br />
Hothorn, T., Lausen, B. (2003): Double-bagging: Combinig classifiers by bootstrap<br />
aggregation. Pattern Recognition 36(6), 1303–1309.<br />
Lausen, B., Hothorn, T., Bretz, F., Schumacher, M. (2004): Assessment of optimal<br />
selected prognostic factors.Biometrical Journal 46, 364–374.<br />
Lausen, B., Sauerbrei, W., Schumacher, M. (1994): Classification and regression trees<br />
(CART) used for the exploration of prognostic factors measured on different<br />
scales, in: Dirschedl, P., and Ostermann, R. (eds.), Computational Statistics,<br />
Physica-Verlag, Heidelberg, 483–496.<br />
Potapov, S., Theus, M. (<strong>2008</strong>): The TWIX package (Version 0.2.4.). http://cran.rproject.org<br />
− 115 −
Remarks on the Existence of CML Estimates<br />
for the PCM by means of the R Package eRm<br />
Antonio Punzo 1<br />
University of Milano-Bicocca, Department of Quantitative Methods for Business<br />
and Economic Sciences, a punzo@libero.it<br />
Abstract. Mair and Hatzinger (2007) have recently proposed in the Journal of<br />
Statistical Software the R package eRm (extended Rasch models) for computing<br />
Rasch models and several extensions. Undoubtedly, in the eRm class the partial<br />
credit model (PCM) – for practical testing purposes – is one of the best known.<br />
The package, using a unitary conditional maximum likelihood (CML) procedure,<br />
estimates the item parameters of the above-mentioned models.<br />
Although the eRm belong to the Rasch family of models and share their distinguishing<br />
characteristics, they suffer from the problem of possible non-existence<br />
of estimates. In literature, both in the joint and in the conditional ML approach,<br />
the configurations and the conditions of non-existence for the RM are well-known<br />
(Fischer, 1981). The eRm package performs a preliminary data check only for the<br />
RM. The conditions of non-existence are known for the PCM only in the joint case<br />
(Bertoli-Barsotti, 2005).<br />
In this article, the main focus is on the PCM; the above-mentioned JML nonexistence<br />
configurations for this model will be the starting point. A class of counter<br />
examples is illustrated, which leads to “false” CML estimates with the eRm package,<br />
i.e., values that appear to be estimates but, through a more accurate analysis<br />
of the maximization function, they are rather a clear signal of non-existence. Moreover,<br />
the obtained results emphasize the presence of additional CML non-existence<br />
configurations, compared to those valid in the JML case.<br />
Key words: Rasch models, Partial Credit model, Conditional Maximum Likelihood<br />
estimate, R package eRm<br />
References<br />
Bertoli-Barsotti, L. (2005): On the Existence and Uniqueness of JML Estimates for<br />
the Partial Credit Model. Psychometrika, 70, 3, 517–531.<br />
Fischer, G. H. (1981): On the Existence and Uniqueness of Maximum-Likelihood<br />
Estimates in the Rasch Model. Psychometrika, 46, 1, 59–77.<br />
Mair P. and Hatzinger R. (2007): Extended Rasch Modeling: The eRm Package for<br />
the Application of IRT Models in R. Journal of Statistical Software, 20, 9, 1–20.<br />
− 116 −
Dynamic disturbances in BTA deephole<br />
drilling - Identification of spiralling as a<br />
regenerative effect<br />
Nils Raabe, Dirk Enk, Claus Weihs, and Dirk Biermann<br />
Technische Universitt Dortmund<br />
Germany<br />
Abstract. One serious problem in deep-hole drilling is the formation of a dynamic<br />
disturbance called spiralling which causes holes with several lobes. Since such lobes<br />
are a severe impairment of the bore hole the formation of spiralling has to be prevented.<br />
One common explanation for the occurrence of spiralling is the intersection<br />
of time varying bending eigenfrequencies with multiples of the tool’s rotational frequency.<br />
Little is known about which specific eigenfrequencies are crucial. Furthermore<br />
an Underlying assumption of this explanation is, that the resulting holes in<br />
cross-sectional view are showing as a curve with constant width. This Assumption<br />
implicitly supposes spiralling to result from a parallel displacement of the drill head.<br />
We in fact observed spiralling in experiments designed to force it by planning<br />
crucial frequency intersections using a statistical-physical model proposed in earlier<br />
work. However, not every intersection of any eigenfrequency with a multiple of the<br />
rotational frequency led to spiralling. Furthermore we also found cases of spiralling<br />
with two or four lobes contradict to the common assumption. After inspecting the<br />
eigenmodes corresponding to the frequencies which caused the spiralling it turned<br />
out that these modes commonly show a clear tilt at the drill head instead of a parallel<br />
displacement. This tilt one the one hand in general allows to order the eigenfrequencies<br />
by their relevance with respect to spiralling. Furthermore we now are able to<br />
give a geometrical explanation for the spiralling development as a regenerative effect.<br />
We use this explanation to extend our statistical-physical model by a process<br />
model of the chips cut during the process. This model is the basis of a system for<br />
the simulation of spiralling. Since the model contains the machine parameters it can<br />
be used to evaluate the probability and extend of spiralling in different settings.<br />
By this different settings can be classified into stable and instable processes and<br />
strategies for the avoidance of spiralling can be derived. Since the statistical-physical<br />
model includes a statistical estimation procedure for the unknown parameters these<br />
strategies can finally be tested in real processes.<br />
− 117 −
Statistical processes under change - Enhancing<br />
data quality with pretests<br />
Walter Radermacher<br />
President of the Federal Statistical Office, Germany<br />
walter.radermacher@destatis.de<br />
Summary. Production of high quality statistics is the main task of Federal Statistical<br />
Office. Technological progress, globalisation, the increasing significance and<br />
diversification of information and its distribution, are only some general terms for<br />
the changes we are faced with today. Needless to say, those changes strongly affect<br />
the statistical work of the FSO and pose challenges that can only be met with innovative<br />
and appropriate methods, to name only a few: cooperation and networks,<br />
multiple sources, mixed-mode designs, standardisation of processes, metadata for<br />
quality control and the use of administrative information. The point is to maximise<br />
data quality and minimise the cost and the burden for the participants in surveys.<br />
A prominent method for improving data quality in surveys is the use of pretests<br />
within - or ideally before - the actual data production process. Pretests fulfill a<br />
number of functions: they minimise non-sampling errors, they reduce the burdensomeness<br />
for the respondents of comprehensive questionnaires and they test the<br />
feasibility of a concept in practice. Combining quantitative and qualitative methods<br />
for pretesting leads to significant increases in data quality. For instance, cognitive<br />
interviewing, an accepted method mainly used in social science research, when applied<br />
to test household surveys as well as business surveys, enables the detection of<br />
reporting errors in surveys caused by underlying cognitive processes through which<br />
respondents generate their answers in survey questions. Some examples from the<br />
practice will illustrate the benefits of pretests in official statistics.<br />
Changing conditions call for changing procedures and methods. In our work supplying<br />
official statistics we react to the increasing demand for reliable data. Pretests<br />
are an important example of a method which meets this need in two ways: they<br />
improve quality control and they contribute to the userfriendliness of our surveys.<br />
− 118 −
Automatic Dictionary Expansion Using<br />
Non-parallel Corpora<br />
Reinhard Rapp 1 and Michael Zock 2<br />
1 University of Tarragona reinhard.rapp@urv.cat<br />
2 LIF-CNRS, Marseille michael.zock@lif.univ-mrs.fr<br />
Abstract. Automatically deriving bilingual dictionaries from manually translated<br />
texts is an established technique that works well in practice. However, translated<br />
texts are a scarce resource. Therefore, it is also desirable to be able to generate<br />
dictionaries from pairs of unrelated monolingual corpora. To achieve this, we suggest<br />
an approach that considers the crosslingual correlations between the co-occurrence<br />
patterns of translated words. If, for example, two words X and Y co-occur more often<br />
than expected by chance in the source language, then their translations T(X) and<br />
T(Y) should also co-occur more frequently than expected in the target language. It<br />
is further assumed that a small dictionary is available at the beginning, and that<br />
the aim is to expand this base lexicon.<br />
The approach is as follows: Using a corpus of the target language, first a cooccurrence<br />
matrix is computed with the rows being word types from the corpus and<br />
the columns being target words from the base lexicon. Next a word of the source<br />
language is considered whose translation is to be determined. Using the sourcelanguage<br />
corpus, a co-occurrence vector for this word is computed. Then, using the<br />
dictionary, all known words in this vector are translated into the target language,<br />
thereby discarding unknown words. The resulting vector is compared to all vectors<br />
in the co-occurrence matrix of the target language. The vector with the highest<br />
similarity is considered to be the translation of the source-language word.<br />
In our experiments this method gave an accuracy in the order of 50%. To improve<br />
the results, we perform an automatic cross-check which utilizes the dictionaries’<br />
property of transitivity. What we mean by this is that if we have two dictionaries, one<br />
translating from language A to language B, the other from B to C, then we can also<br />
translate from language A to C by using the intermediate language (or interlingua)<br />
B. That is, the property of transitivity, although having some limitations due to<br />
word ambiguities, can be exploited to automatically generate a raw dictionary for<br />
A to C. One might think that this is unnecessary as our corpus-based approach<br />
also allows us to generate this dictionary directly from the respective comparable<br />
corpora. However, having two different ways of generating the same dictionary has<br />
the advantage that we can validate one via the other. Furthermore, by considering<br />
several languages, additional possibilites for mutual cross-validation arise.<br />
Key words: dictionary generation, comparable texts, translation<br />
− 119 −
FIMIX-PLS Segmentation of Data for Path<br />
Models with Multiple Endogenous LVs<br />
Christian M. Ringle<br />
University of Hamburg, Institute of Industrial Management, Von-Melle-Park 5,<br />
20146 Hamburg, Germany, cringle@econ.uni-hamburg.de<br />
Abstract. When applying a causal modeling approach such as partial least squares<br />
(PLS) path modeling in empirical studies, the assumption that the data has been<br />
collected from a single homogeneous population is often unrealistic. Unobserved<br />
heterogeneity in the PLS estimates for the aggregate data level may result in misleading<br />
interpretations. Finite mixture partial least squares (FIMIX-PLS; Hahn et<br />
al., 2002) allows classifying data based on the heterogeneity of the estimates in the<br />
inner path model. Experimental as well empirical examples (Esposito Vinzi et al.,<br />
2007; Ringle et al., <strong>2008</strong>) illustrate the application of FIMIX-PLS for path models<br />
that only involve a single latent endogenous variable. This research uses a systematic<br />
approach (Ringle, 2007) to apply the FIMIX-PLS methodology and presents<br />
FIMIX-PLS computational experiments for a path model which includes multiple<br />
endogenous latent variables (LVs). The results of this analysis further substantiate<br />
the reliability of the systematic FIMIX-PLS application in more realistic situations<br />
and provides researchers and practitioners with the certainty they require to effectively<br />
evaluate their PLS path modeling results. If the procedure uncovers significant<br />
heterogeneity, the analysis results in further differentiated path modeling outcomes<br />
and, thus, allows forming more precise conclusions.<br />
Key words: PLS Path Modeling, Heterogeneity, Finite Mixture, Segmentation<br />
References<br />
Esposito Vinzi, E., Ringle, C.M., Squillacciotti, S. and Trinchera, L. (2007): Capturing<br />
and Treating Unobserved Heterogeneity by Response Based Segmentation<br />
in PLS Path Modeling: A Comparison of Alternative Methods by Computational<br />
Experiments. ESSEC Research Center, Working Paper No. 07019. ES-<br />
SEC Business School Paris-Singapore.<br />
Hahn, C., Johnson, M.D., Herrmann, A. and Huber, F. (2002): Capturing Customer<br />
Heterogeneity using a Finite Mixture PLS Approach. Schmalenbach Business<br />
Review, 54, 243–269.<br />
Ringle, C.M. (2007): Segmentation for path models and unobserved heterogeneity:<br />
The finite mixture partial least squares approach, Research Papers on Marketing<br />
and Retailing No. 035. University of Hamburg.<br />
− 120 −
Extreme unconditional dependence vs.<br />
multivariate GARCH effect in the analysis of<br />
dependence between high losses on Polish and<br />
German stock indexes<br />
Pawel Rokita, Krzysztof Piontek<br />
Department of Financial Investments and Risk Management<br />
Wroclaw University of Economics<br />
ul. Komandorska 118/120, 53-345 Wroclaw, Poland<br />
pawel.rokita@ae.wroc.pl, krzysztof.piontek@ae.wroc.pl<br />
Abstract. Classical portfolio diversification methods do not take account of any<br />
dependence between extreme returns (losses). Many researchers provide, however,<br />
some empirical evidence that extreme-losses for various assets co-occur. If the cooccurrence<br />
is frequent enough to be statistically significant, it may seriously influence<br />
portfolio risk. Such effects may result from a few different properties of financial time<br />
series, like for instance: (1) extreme dependence in an (long-term) unconditional<br />
distribution, (2) extreme dependence in subsequent conditional distributions, (3)<br />
time-varying conditional covariance, (4) time-varying (long-term) unconditional covariance,<br />
(5) market contagion. Moreover, a mix of these properties may be present<br />
in return time series. Modeling each of them requires different approaches. It seams<br />
reasonable to investigate whether distinguishing between the properties is highly<br />
significant for portfolio risk measurement. If it is, identifying the effect responsible<br />
for high loss co-occurrence would be of a great importance. If it is not, the best solution<br />
would be selecting the easiest-to-apply model. This article concentrates on two<br />
of the aforementioned properties: extreme dependence (in a long-term unconditional<br />
distribution) and time-varying conditional covariance.<br />
Key words: extreme dependence, TDC, multivariate GARCH<br />
References<br />
Coles S., Heffernan J., Tawn J. (1999): Dependence Measures for Extreme Value<br />
Analyses. Extremes, 2:4, 339–365.<br />
Gouriéroux C. (1997): ARCH Models and Financial Applications. Springer.<br />
Rokita P. (<strong>2008</strong>): Comparing extreme dependence and varying conditional covariance<br />
concept for portfolio risk modeling (in Polish). To be published in: Taksonomia,<br />
15.<br />
− 121 −
Grundzüge einer generativen Korpuslinguistik<br />
Jürgen Rolshoven<br />
Linguistic Data Processing, Department of Linguistics, University of Cologne<br />
rols@spinfo.uni-koeln.de<br />
Abstract. Die Verarbeitung langer Texte ist durch die Bioinformatik stark stimuliert<br />
worden. Dies zeigen u.a. Gusfield (1997), Böckenhauer, Bongartz (2003)<br />
und Haubold, Wiehe (2006). Als Datenstruktur spielen Suffix Trees eine zentrale<br />
Rolle. Für den Linguisten sind Suffix Trees jedoch nicht zur Suche von Substrings<br />
von Bedeutung. Vielmehr ermöglichen sie die Gewinnung linguistischen<br />
Wissens durch Aufdeckungsverfahren. Suffixe sind potentielle Morpheme in Textkorpora.<br />
Aufdeckungsverfahren seligieren aus der Menge potentieller Morpheme<br />
die, welche die funktions- oder bedeutungstragend sind.Dafür werden strukturalistische<br />
Verfahren wie Austausch, Auslassung und Verschiebung eingesetzt. Formal<br />
ergibt sich folgendes Verfahren: Suffixbäume sind äquivalent zu endlichen<br />
Automaten und entsprechen Typ-3-Sprachen in der Chomsky-Hierarchie.Daraus<br />
ergeben sich einfache Produktionsregeln, die mit Hilfe weiterer Information aus Suffixbäumen<br />
in Typ-2-Regeln umgeformt werden. Diese werden wiederum in Typ-1-<br />
Regeln transformiert. Mit diesem Vorgehen wird Sprache nicht einzeln satzweise<br />
geparst, sondern gleichsam holistisch textbezogen. Die skizzierten Verfahren führen<br />
zu Übergenerierungen und sind insofern generativ. Ihr deskriptives Potential reicht<br />
über die Texte hinaus, aus denen die Regeln deriviert wurden. Dies soll<br />
durch die Bezeichnung generative Korpuslinguistik ausdrücken. Eine generative<br />
Korpuslinguistik verbindet durch Einsatz bioinformatischer Methoden die Stärken<br />
des datengetriebenen korpuslinguistischen Ansatzes mit dem hypothesengetriebenen<br />
Vorgehen generativer Grammatiken.<br />
References<br />
Böckenhauer, H-J., Bongartz, D. (2003): Algorithmische Grundlagen der Bioinformatik.<br />
Teubner Verlag, Wiesbaden.<br />
Gusfield, D. (1997): Algorithms on Strings, Trees and Sequences: Computer Science<br />
and Computational Biology. Cambridge University Press, Cambridge, Mass.<br />
Haubold, B., T. Wiehe (2006): Introduction to computational biology: an evolutionary<br />
approach. Birkhäuser Verlag, Basel; Boston.<br />
− 122 −
Cluster ensemble based on co-occurrence data<br />
Dorota Rozmus<br />
Department of Statistics,<br />
Katowice University of Economics, Bogucicka 14, 40-226 Katowice<br />
drozmus@ae.katowice.pl<br />
Abstract. Ensemble approach have been successfully applied in the context of<br />
supervised learning to increase the accuracy and stability of classification. Recently,<br />
analogous techniques for cluster analysis have been suggested. Research has proved<br />
that, by combining a collection of different clusterings, an improved solution can be<br />
obtained.<br />
In the traditional way of learning from data set the classifiers are built in a feature<br />
space. However, alternative ways can be found by constructing decision rules on<br />
similarity or dissimilarity representations, instead. In such a recognition process an<br />
object is described by distance matrix showing the similarity to the rest of training<br />
samples.<br />
This research has focused on exploiting the additional information provided by<br />
a collection of diverse clusterings to generate a co-association (similarity) matrix.<br />
Taking the co-occurrences of pairs of patterns in the same cluster as votes for their<br />
association, the data partitions are mapped into a co-association matrix of patterns.<br />
This n × n matrix represents a new similarity measure between patterns. The final<br />
data partition is obtained by clustering this matrix.<br />
In the experiments, the behavior of partitions built on co-occurrence data is<br />
studied.<br />
Key words: Cluster analysis, Cluster ensemble, Co-association matrix, (Dis)similarity<br />
representation.<br />
References<br />
Jain, A.K. and Fred, A. (2002): Evidence accumulation clustering based on the<br />
k-means algorithm. Structural, Syntactic, and Statistical Pattern Recognition,<br />
2396, 442–451.<br />
Strehl, A. and Ghosh, J. (2002): Cluster ensembles - a knowledge reuse framework<br />
for combining partitionings. Journal of Machine Learning Research, 3, 583–617.<br />
Pekalska, E. and Duin, R.P.W (2000): Classifiers for dissimilarity-based pattern<br />
recognition. In: A. Sanfeliu, J.J. Villanueva, M. Vanrell, R. Alquezar, A.K.<br />
Jain and J. Kittler (Eds.): Proc. 15th Int. Conf. on Pattern Recognition, IEEE<br />
Computer Society Press, Los Alamitos, 12–16.<br />
− 123 −
Dyadic Interactions in Service Encounter -<br />
Bayesian SEM Approach<br />
Adam Sagan 1 and Magdalena Kowalska-Musia̷l 2<br />
1 Chair of Market Analysis and Marketing Research Cracow University of<br />
Economics, Rakowicka 27, 31-510 Cracow, Poland sagana@ae.krakow.pl<br />
2 The School of Banking and Management, Armii Krajowej 4, 30-115 Cracow,<br />
Poland m.kowalska@wszib.edu.pl<br />
Abstract. Dyadic multirelational and sequential interactions are important aspects<br />
in service encounters. Can be observed in B2B distribution channels, professional<br />
services, buying centers, family decision making or WOM communications. The networks<br />
are consisted of dyadic bonds that form dense but weak ties among actors.<br />
The aim of paper is the identification of latent properties of dyadic interactions on<br />
mobile phone service market. Latent variable models in relational marketing are often<br />
concentrated either on effects of relations, or treat the relationship dimensions<br />
as psychological constructs on individual-trait level.<br />
We propose the approach based on bayesian latent variable modeling of social networks<br />
with dyads as units of analysis. This approach enables to model emergent and<br />
relational properties of actors’ interactions in dyads that are irreducible to individual<br />
latent traits or psychological constructs.<br />
Several competing models are developed and compared using bayesian structural<br />
equation models of dyadic data. Bayesian SEM helps to overcome the limitations<br />
of the more traditional solutions based on ML or WLS estimations. It is robust to<br />
small samples that are common in social network analysis, can also be applied for<br />
non-normal data as well as non-linear relations between latent variables.<br />
Key words: Relationship marketing, Dyadic data, Bayesian SEM<br />
References<br />
Anderson, J. C. and H˚akanson, H. and Johanson, J. (1994): Dyadic Business Relationships<br />
within a Business Network Context, Journal of Marketing, October,<br />
1–15.<br />
Iacobucci, D. and Hopkins, N. (1992): Modeling Dyadic Interactions and Networks<br />
in Marketing, Journal of Marketing Research, February, 5–17.<br />
Kenny, D.A. and Kashy, D.A and Cook, W.L. (2006): Dyadic Data Analysis. Guilford<br />
Press, New York.<br />
Lee, S. Y. (2007): Structural Equation Modeling. A Bayesian Approach,John Willey<br />
and Sons, Chichester.<br />
− 124 −
����������� ������� ���������� �����<br />
������������� ���������������� �� �����������<br />
�� ������ ������������<br />
����� ������ � ��� �������� ������� �<br />
� ��� ������� ��������� ����������������������<br />
� ���������� ������ ��� ����������� �������������������������������<br />
��������� ���� ������� �� ��������� �������� ��������� ���� �� �������� ������<br />
���� ������������ ��� � ������� ��������� ����������� ��������� ������� ��� ��������<br />
��� ������������� �� ����� ������ ������������� ���� ���������� ������� ��� ������<br />
����� �������� �� ���� ������������� ������� ��� ���������� �� ��� ���������� ���<br />
������ ��� ���� �� �� ����� ������������ �� ������� ��� ��������� ��� ���� ����� ��<br />
���� �� ��� ���������� ������������� ����� ��� �������<br />
�� ���� ������ ������� �� ������� ��� � ����������� ������� ���� ���� ����<br />
�������� �� ������ � ������������ ������� ��������� ��������� � ��������������<br />
�������� ����� �� ��� ��� ������� ���������� ������������ �������� ����� �� ������<br />
���� ���� ���� ��� ���� �������� ����� �� ����� � ������������ ��������� ����������<br />
� ��������� ���� ������� �� ��� ����������� �� ��������� �������� ��� �����������<br />
������� �������� �� ���������������� ������� ��� ������� ������� ��� �� ������� ��<br />
�������� �� ��� �������� ������ �������� ��� �������������� �������� �� ���������<br />
���������� �� ��� ������������ ������� ���� ������ ��� ������� ������� �� �������<br />
��� ��� ������ �� ������� ����������� �� �������� ��� �������� ������� �� ����<br />
���� ��� ������ �������������� ������� ����� �� ��� ��������������� ���������<br />
�� �� ��������� ������������ ��� �� ��� ��������� ���� ������ �� ��� ��������� �� ���<br />
�������� �� ���������� ����������� �� ��� ���� �� ������� ��� ��������� ��������<br />
��� ������ �������� ���������� ������� ������ ��������������� ���� ������<br />
����������<br />
����� ���� ������� ���� ������ ���������� ��� ������� ���������� �������������<br />
���������� ��� ���� ��������� ��� ��������<br />
����������� ��� �������� ��� �������� ���� ��� ��������� �� ������� ����� ���������<br />
������ ������ ������� ���������� �� ���� ������������ � ����� ������� ������<br />
������������� ����� ������ ���� ��� ������ ���� ������� ���������� �������� �� ����<br />
���� �������������� ����� ������ �� ����������� ��� ������������ �����������<br />
����� ����� ������� ���� ��� ������� �� ������� ��������������� �� ������� ���<br />
��������� ����� ��� �������<br />
− 125 −
Nonnegative Matrix Factorization for Binary<br />
Data to Extract Elementary Failure Maps from<br />
Wafer Test Images<br />
Reinhard Schachtner 1,2 , Gerhard Pöppel 1 and Elmar Lang 2<br />
1 Infineon Technologies AG, 93049 Regensburg, Germany<br />
reinhard.schachtner@infineon.com<br />
2 Institute of Biophysics, University of Regensburg, 93040 Regensburg, Germany<br />
Abstract. We introduce a probabilistic variant of non-negative matrix factorization<br />
(NMF) applied to binary data sets. Hence we consider binary coded images as a probabilistic<br />
superposition of underlying continuous-valued basic patterns. An extension<br />
of the well-known NMF procedure to binary-valued data sets is provided to solve the<br />
related optimization problem with non-negativity constraints. We demonstrate the<br />
performance of our method by applying it to the detection and characterization of<br />
hidden causes for failures during wafer processing. Therefore, we decompose binary<br />
coded (pass/fail) wafer test data into underlying elementary failure patterns and<br />
study their influence on the quality of single wafers.<br />
Key words: Nonnegative matrix factorization, binary data, failure patterns<br />
References<br />
Lee, D. and Seung, H. (1999): Learning the parts of objects by non-negative matrix<br />
factorization, Nature, 401,788–791<br />
− 126 −
Quality–Based Clustering of Functional Data:<br />
Applications to Time Course Microarray Data<br />
Theresa Scharl 1 and Friedrich Leisch 2<br />
1 Institut für Statistik und Wahrscheinlichkeitstheorie, Technische Universität<br />
Wien, Wiedner Hauptstr. 8-10, A-1040 Wien, Austria; Scharl@ci.tuwien.ac.at<br />
2 Institut für Statistik, Ludwig-Maximilians-Universität München, Ludwigstraße<br />
33, D-80539 München, Germany; Friedrich.Leisch@stat.uni-muenchen.de<br />
Abstract. Cluster methods are typically applied to time course gene expression<br />
data to find co–regulated genes which can finally help to reveal pathways and interactions<br />
between genes. Clustering is either carried out on the raw data or on<br />
functional data. In functional data analysis (e.g. Serban and Wasserman, 2005;<br />
Tarpey, 2007) a curve is fit to each observation in order to account for time dependency.<br />
Gene expression over time is biologically a continuous process and can<br />
therefore be represented by a continuous function. The different curve shapes found<br />
in a dataset can have important interpretations and characteristic patterns can be<br />
found by clustering the estimated regression coefficients.<br />
In this study the raw data is clustered using the well–known K–Means algorithm<br />
as well as the quality–based cluster algorithm Stochastic QT–Clust (Scharl and<br />
Leisch, 2006). Further, the parameters obtained by representing each gene expression<br />
profile by a curve are clustered. Additionally mixtures of spline regression models<br />
and mixed–effects models are applied to the data. All cluster algorithms used are<br />
implemented in R. The different cluster methods are compared in a simulation study<br />
on various datasets.<br />
Key words: Cluster analysis, functional data, time course gene expression data, R<br />
References<br />
SCHARL, T. and LEISCH, F. (2006): The stochastic qt–clust algorithm: evaluation<br />
of stability and variance on time–course microarray data. In Rizzi, A. and Vichi<br />
M., editors, Compstat 2006—Proceedings in Computational Statistics, 1015–<br />
1022, Physica Verlag, Heidelberg, Germany.<br />
SERBAN, N. and WASSERMAN, L. (2005): CATS: Clustering after transformation<br />
and smoothing. Journal of the American Statistical Association, 100(471), 990–<br />
999.<br />
TARPEY, T. (2007): Linear transformations and the k–Means clustering algorithm:<br />
Applications to clustering curves. The American Statistician, 61(1), 34–40.<br />
− 127 −
Multilingual knowledge based concept<br />
recognition in textual data<br />
Martin Schierle 1 and Daniel Trabold 2<br />
1 Daimler AG martin.schierle@daimler.com<br />
2 Daimler AG daniel.trabold@daimler.com<br />
Abstract. With respect to the increasing volume of textual data which is available<br />
through digital resources today, the identification of the main concepts in those texts<br />
becomes more and more important and can be seen as a vital step in the analysis<br />
of unstructured information.<br />
Research in this area has focused on the detection of named entities like person<br />
names or organization names, which only cover a very small part of concepts in texts.<br />
Especially the unique mapping between concepts in different languages requires<br />
parallel corpora which are rarely available in industrial settings.<br />
We therefore propose a powerful new knowledge based model to recognize various<br />
kinds of concepts even in very short and specialized texts using linguistic information<br />
for synonym handling and word sense disambiguation.<br />
We evaluate the proposed model on texts from the automotive domain.<br />
References<br />
− 128 −
Localized Logistic Regression for Discrete<br />
Influential Factors<br />
Julia Schiffner, Gero Szepannek, Thierry Monthé, and Claus Weihs<br />
Faculty of Statistics, Dortmund University of Technology, 44221 Dortmund,<br />
Germany,<br />
schiffner@statistik.uni-dortmund.de,<br />
weihs@statistik.uni-dortmund.de<br />
Abstract. The two-class localized logistic regression of Tutz and Binder (2005)<br />
is generalized to discrete explanatory variables, and applied to data from a breast<br />
cancer study. In order to obtain a distance measure between observations of the<br />
discrete factors a combination of the simple and the flexible matching coefficient<br />
(Ickstadt et al., 2006) is taken. Applying the method of Tutz and Binder (2005)<br />
with this distance measure to the example data, localized models lead to smaller<br />
misclassification rates than the corresponding global ones. Moreover, the best classification<br />
rule found gives one of the smallest misclassification rates ever obtained<br />
for the example data. The results of Monthé (<strong>2008</strong>) are extended by an automatic<br />
variable selection.<br />
Key words: Localized logistic regression, Matching coefficients, SNP data<br />
References<br />
Ickstadt, K., Mueller, T., and Schwender, H. (2006): Analyzing SNPs: Are There<br />
Needles in the Haystack? CHANCE, 19(3), 22–27.<br />
Monthé, Th. (<strong>2008</strong>): Lokalisierte Logistische Regression bei diskreten Variablen. Master<br />
thesis, Faculty of Statistics, Dortmund University of Technology.<br />
Tutz, G. and Binder, H. (2005): Localized Classification. Statistics and Computing,<br />
15, 155–166.<br />
− 129 −
Localized Classification Using Mixture Models<br />
Julia Schiffner and Claus Weihs<br />
Faculty of Statistics, Dortmund University of Technology, 44221 Dortmund,<br />
Germany,<br />
schiffner@statistik.uni-dortmund.de<br />
Abstract. In the literature a variety of classification methods can be found that can<br />
be called ‘local’ because they concentrate – in different senses – on one or multiple<br />
small regions of the data space. One type of local methods that may be beneficial<br />
in case of heterogeneous classes is based on mixture models. It is assumed that<br />
data are generated by a finite number of sources and that each source can produce<br />
data of one or multiple classes. Models valid for single sources can be referred to<br />
as ‘local models’ that can be aggregated to a global mixture model. Mixture based<br />
classification methods have been described by several authors (see references), but<br />
the relationships and differences between the underlying models are not clear. A<br />
consistent description of these models and the resulting Bayes classification rules is<br />
presented. Moreover, it is shown how Bayes rules can be derived if in distinct local<br />
models different variable subsets separate the classes. Finally, several methods for<br />
class posterior estimation are described and an application to sound data is shown,<br />
where the register of different instruments is predicted by timbre.<br />
Key words: Local classification methods, Mixture models, Bayes rules<br />
References<br />
Hastie, T. J. and Tibshirani, R. J. (1996): Discriminant Analysis by Gaussian Mixtures.<br />
Journal of the Royal Statistical Society B, 58(1), 155–176.<br />
Szepannek, G. and Weihs, C. (2006): Local Modelling in Classification on Different<br />
Feature Subspaces. In: P. Perner (Ed.): Advances in Data Mining. Springer,<br />
Berlin, 226-238.<br />
Titsias, M. K. and Likas, A. C. (2001): Shared Kernel Models for Class Conditional<br />
Density Estimation. IEEE Transactions on Neural Networks, 12(5), 987–997.<br />
Titsias, M. K. and Likas, A. C. (2002): Mixture of Experts Classification Using a<br />
Hierarchical Mixture Model. Neural Computation, 14, 2221–2244.<br />
Weihs, C., Szepannek, G., Ligges, U., Luebke, K., and Raabe, N. (2006): Local Models<br />
in Register Classification by Timbre. In: V. Batagelj, H.-H. Bock, A. Ferligoj,<br />
and A. Ziberna (Eds.): Data Science and Classification. Springer, Berlin, 315-<br />
322.<br />
− 130 −
Comparison of four estimators of the<br />
heterogeneity variance for meta-analysis<br />
Peter Schlattmann<br />
Dept. of Biostatistics and Clinical Epidemiology<br />
Charité Universitätsmedizin Charitéplatz 1, 10117 Berlin<br />
peter.schlattmann@charite.de<br />
Summary. The analysis of heterogeneity is a crucial part of each meta-analysis.<br />
In order to analyze heterogeneity often a random effects model which incorporates<br />
variation between studies is considered. It is assumed that each study has its own<br />
(true) exposure or therapy effect and that there is a random distribution of these true<br />
exposure effects around a central effect. The variability between studies is quantified<br />
by the heterogeneity variance.<br />
In order to compare the performance of four estimators of the heterogeneity variance<br />
a simulation study was performed. This study compared the Dersimonian-Laird<br />
(1986) estimator with the maximum-likelihood estimator based on the normal distribution<br />
for the random effects. Further comparators were the simple heterogeneity<br />
(SH) variance estimator proposed by Sidek and Jonkman (2005).<br />
All of the afore mentioned methods assume a normal distribution for the random<br />
effects. This assumption may be true or not. Thus an alternative estimator of<br />
the heterogeneity variance is based on a finite mixture model (Böhning, Dietz,and<br />
Schlattmann, 1998).<br />
This simulation study investigates these four estimators, when sampling from<br />
discrete distributions, i.e. the major assumption of a normal distribution for the<br />
random effects is not fulfilled. In this setting the simulation study investigates bias,<br />
standard deviation and mean square error (MSE) of all four estimators.<br />
Key words: Meta-Analysis, Heterogeneity, Simulation, Finite mixture model<br />
References<br />
Böhing, D., Dietz, E. and Schlattmann, P. (1998): Recent developments in computer<br />
assisted mixture analysis. Biometrics, 54, 283-303<br />
DerSimonian, R. and Laird, N. (1986): Meta-analysis in clinical trials. Controlled<br />
Clinical Trials,7,177-188<br />
Sidek, K. and Jonkman, J. (2005): Simple heterogeneity variance for metaanalysis.<br />
JRSS Series C, 54, 367-384<br />
− 131 −
Machine learning applications of positive<br />
definite kernels<br />
Prof. Dr. Bernhard Schölkopf<br />
MPI for Biological Cybernetics<br />
Spemannstrasse 38<br />
72076 Tübingen<br />
bernhard.schoelkopf@tuebingen.mpg.de<br />
Summary. Support vector machines and other kernel methods have become one of<br />
the most widely used techniques in the field of machine learning. I will present my<br />
thoughts on what made them popular and what may (or may not) keep them going.<br />
I will also discuss applications in different domains, including computer graphics.<br />
− 132 −
Age Distributions for costs in drug prescription by<br />
practitioners and for DRG-based hospital treatment<br />
Reinhard Schuster, Eva v. Arnstedt<br />
Medical Review Board of the Statutory Health Insurance in North Germany,<br />
23554 Lübeck, Germany, Reinhard.Schuster@mdk-nord.de<br />
Abstract. Purpose: We analyse age-dependent fractions of patients with costs above a treshold<br />
value in dependency of that value both in drug application outside hospitals as well as<br />
in DRG-based hospital treatment. We compare the results of different German regions and<br />
different statutory insurances. The outcome of age-dependency of costs is highly important in<br />
respect to demographic changes. Design/Methodology/Approach/Algorithm: We use drug<br />
prescription data of practitioners and data of DRG-based hospital treatment from several statutory<br />
insurances and several regions. We use a nonparametric functional equation with a geometric<br />
background which generates an one parametric family of logconcave distributions including<br />
the normal distribution. The mentioned functional equation is also related to Verhulst<br />
growth. Results: The data can be fitted by log-concave distributions and we get numerically<br />
stable computations. The respective logarithms are concave with respect to both variables age<br />
and costs. We find that independent of the absolute threshold value there is always a decrease<br />
in the fraction of high-cost patients above a certain age. So we do not find a monotone increase<br />
of costs with age. Research Limitations/Implications: The statistically reported data<br />
basis for age-dependent costs is poor in general with respect to specific details, especially<br />
if a (pseudonymized) patient-identifier is necessary. Practical Implications: Demographic<br />
changes are important for a large range of induced implications. Often it is implicated that<br />
the costs are strictly increasing with age. If this turns out not to be true in general, costs<br />
are depending much more sensible on the exact (demographically changing) age distribution<br />
of the population, which should be analysed in that direction. Originality/Value: The agedependent<br />
resolution of officially stated statistic reports is poor in general. We state a stable<br />
non-parametric model with high resolution.<br />
Key words: Drug Application, Age Distribution, DRG-System, Statutory Health Insurance<br />
References<br />
SCHUSTER, R.: Komponentenzerlegungen, Strukturen und Invarianten zu GKV-<br />
Arzneimittelverordnungsdaten. Journal of Public Health 4 (2003), 293-305.<br />
− 133 −
The Late Neolithic flint axe production on the<br />
Lousberg (Aachen, Germany) — An<br />
extrapolation of supply and demand and<br />
population density<br />
Daniel Schyle<br />
Institut für Ur- und Frühgeschichte Universitt zu Köln<br />
daniel.schyle@uni-koeln.de<br />
Abstract. The tabular flint seams within the cretaceous limestone slab once covering<br />
the Lousberg in Aachen (Germany) were completely exploited by systematic<br />
opencast mining during the time between approximately 3800 and 3000 years<br />
CalBC. The Lousberg-flint, easily identifiable by its tabular shape and its characteristic<br />
colours, was processed on-site almost exclusively for the production of<br />
axe-roughouts, which were distributed over distances up to 280 km mainly to Westphalia,<br />
but also to Hessen, Rheinland-Pfalz and into Belgium and the Netherlands.<br />
An excavation at the Lousberg was carried out under the direction of J. Weiner<br />
between 1978 and 1981. This contribution presents an extrapolation of the total<br />
amount of axe-roughouts produced at the site, based on the results of refittings and<br />
the counts of random samples of the knapping waste excavated from the mining<br />
dumps. The corresponding demand for axes per household and generation is estimated<br />
from axe distributions and frequencies in several well dated and preserved<br />
lakeshore dwellings of Southern Germany and Switzerland. To estimate the population<br />
density within the distribution area of Lousberg-axes, which is almost devoid of<br />
Late Neolithic settlement traces other than only roughly dated surface assemblages,<br />
the approximate size of the core-distribution area is determined by the site density<br />
mapping method (”Isolinien-Fundstellendichtekartierung”) recently developed by A.<br />
Zimmermann and collaborators of the Institut fr Ur- und Frhgeschichte at the University<br />
of Cologne. The contribution will focus on the problems in comparing the<br />
results based on the distribution of Lousberg-axes to the results recently obtained<br />
on settlement distributions of the Linearbandkeramik (LBK) in the Rhineland. The<br />
research is part of a project aimed at the final publication of the Lousberg finds,<br />
which was funded by the Deutsche Forschungsgemeinschaft (DFG).<br />
Key words: Abstract, Layout, Submission guideline<br />
− 134 −
Time Related Features for Alarm Classification<br />
in Intensive Care Monitoring<br />
Wiebke Sieben<br />
Department of Statistics, Technische Universitt Dortmund, 44227 Dortmund,<br />
Germany sieben@statistik.tu-dortmund.de<br />
Abstract. Traditional patient monitoring systems in intensive care are based on<br />
simple threshold alarms. These systems compare the measurement of a vital sign<br />
with a threshold set by the clinical staff and trigger an alarm when the threshold<br />
is crossed. Although there are some more sophisticated rules already incorporated<br />
in modern monitoring devices the false alarm rate has remained very high (Tsien<br />
Fackler 1997, Chambrin 2001). Machine learning techniques, and particularly decision<br />
trees have proven suitable for alarm classification. As the misclassification rate<br />
of non life-threatening situations is to be minimized under the constraint that the<br />
misclassification rate of life-threatening situations is close to zero, standard techniques<br />
need to be improved. Modified Random Forests (Sieben, Gather 2007) have<br />
been shown to do this successfully. So far only the measurements of the point in<br />
time when an alarm was triggered were used for classification. As physicians always<br />
take the character of changes over time in a patients’ health status into account for<br />
a diagnosis there might exist valuable information to be extracted from the time<br />
series. We study the use of time related features in combination with the modified<br />
Random Forest approach in terms of improvements in the classification results.<br />
References<br />
CHAMBRIN, M.-C. (2001): Alarms in the Intensive Care Unit: How Can the Number<br />
of False Alarms Be Reduced?. Critical Care, 5 (4), 184–188.<br />
SIEBEN, W., GATHER, U. (2007):Classifiing Alarms in Intensive care - Analogy<br />
to Hypothesis Testing. in LNCS Series: Proceedings of the 11th Conference on<br />
Artificial Intelligence in Medicine, Vol.4594/2007, eds. R. Bellazzi, A. Abu-<br />
Hanna, J. Hunter, Springer, Berlin/Heidelberg, 130–138.<br />
TSIEN, C.L., FACKLER, C. (1997): Poor Prognosis for Existing Monitors in the<br />
Intensive Care Unit. Critical Care Medicine, 25 (4), 614–619.<br />
Keywords<br />
CLASSIFICATION, INTENSIVE CARE MONITORING, FALSE ALARMS<br />
− 135 −
’CMA’ - Steps in developing a comprehensive<br />
R-toolbox for classification with microarray<br />
data and other high-dimensional problems<br />
Martin Slawski, Anne-Laure Boulesteix, and Martin Daumer<br />
Sylvia Lawry Centre for MS Research, Hohenlindenerstr. 1, D-81677 München<br />
Martin.Slawski@campus.lmu.de, boulesteix@slcmsr.org, daumer@slcmsr.org<br />
Abstract. Microarray studies have stimulated the development of new approaches<br />
and motivated the adaptation of known traditional methods for class prediction with<br />
high-dimensional data. There already exist numerous sofware packages implementing<br />
single methods for microarray-based classification and in addition two synthesis<br />
packages: MLInterfaces by V. Carey and R. Gentleman (2007) and MCRestimate<br />
by Ruschhaupt et al (Stat Appl Genet Mol Biol 2004, 3:37 ), available from the<br />
www.bioconductor.org platform. Conceptually, the R package CMA is more related<br />
to the second one, focussing on comparative model evaluation according to accepted<br />
’good pratice’ standards/guidelines (Dupuy and Simon, J Natl Cancer Inst 2007,<br />
99:147-157 ), an aspect neglected by MLInterfaces, though still widely used. In<br />
a nutshell, CMA provides a uniform interface to a total of more than 20 supervised<br />
classification methods, comprising classical approaches such as discriminant analysis<br />
or penalized multinomial logistic regression, dimension reduction by Partial Least<br />
Squares, and more sophisticated methods, e.g. Support Vector Machines, Neural<br />
Networks or boosting techniques.<br />
The evaluation of the constructed classifiers is based on repeated splittings into<br />
learning and test sets or related approaches (e.g. bootstrap). For each learning set<br />
separately, variable selection can be performed optionally, either by a collection of<br />
simple tests or by advanced techniques such as the lasso, elastic net or componentwise<br />
boosting. In the last step, hyperparameter optimization and model evaluation<br />
are carried out via a ’nested’ cross-validation procedure. The outer loop is used for<br />
classifier evaluation while appropriate values for the hyperparameters are determined<br />
in the inner loop.<br />
CMA is implemented entirely in S4 classes (J. Chambers, Programming with data,<br />
1998). Its modular construction makes the incorporation of new methods easy. Furthermore,<br />
it is intended to be user-friendly by providing a multitude of pre-defined<br />
methods for summarizing and visualizing classifier evaluation and comparison.<br />
A preliminary version of CMA is planned to be available in the next Bioconductor<br />
release in April <strong>2008</strong>.<br />
Keywords<br />
High-dimensional data, classification, validation, statistical software<br />
− 136 −
Generating Collective Intelligence<br />
Vassilios Solachidis 1 , Phivos Mylonas 2 , Andreas Geyer-Schulz 3 , Bettina<br />
Hoser 3 , Sam Chapman 4 , Fabio Ciravegna 4 , Stefen Staab 5 ,Costis<br />
Contopoulos 6 , Ioanna Gkika 6 ,PavelSmrz 7 , Yiannis Kompatsiaris 1 ,and<br />
Yannis Avrithis 2<br />
1<br />
Centre of Research and Technology Hellas, Informatics and Telematics Institute,<br />
Km Thermi-Panorama Road, Thermi-Thessaloniki, GR 570 01, Greece {vsol,<br />
ikom}@iti.gr<br />
2<br />
National Technical University of Athens, Image, Video and Multimedia Systems<br />
Laboratory, Iroon Polytechneiou 9, Zographou Campus, Athens, GR 157 80,<br />
Greece {fmylonas, iavr}@image.ntua.gr<br />
3<br />
Department of Economics and Business Engineering, Information Service and<br />
Electronic Markets, Kaiserstraße 12, Karlsruhe 76128, Germany,<br />
{andreas.geyer-schulz, bettina.hoser}@kit.edu<br />
4<br />
University of Sheffield, Department of Computer Science, Regent Court, 211<br />
Portobello Street, S1 4DP, Sheffield, UK, {s.chapman, fabio}@dcs.shef.ac.uk<br />
5<br />
Universität Koblenz-Landau, Information Systems and Semantic Web,<br />
Universitätsstraße 1, 57070 Koblenz, Germany, staab@uni-koblenz.de<br />
6<br />
Vodafone-Panafon (Greece), Technology Strategic Planning - R&D Dept.,<br />
Tzavella 1-3, Halandri, 152 31, Greece {Costis.Kontopoulos,<br />
Ioanna.Gkika}@vodafone.com<br />
7<br />
Brno University of Technology, Faculty of Information Technology, Bozetechova<br />
2, CZ-61266 Brno, Czech Republic, smrz@fit.vutbr.cz<br />
Abstract. In this paper we provide a foundation for a new generation of services<br />
and tools. We define new ways of capturing, sharing and reusing information and<br />
intelligence provided by single users and communities, as well as organizations by<br />
enabling the extraction, generation, interpretation and management of Collective<br />
Intelligence from user generated digital multimedia content. Different layers of intelligence<br />
will be generated, which together constitute the notion of Collective Intelligence.<br />
The latter emerges from the collaboration and competition among many<br />
individuals and forms an intelligence that seemingly has a mind of its own. The<br />
automatic generation of Collective Intelligence constitutes a departure from traditional<br />
methods for information sharing, since information from both the multimedia<br />
content and social aspects will be merged, while at the same time the social dynamics<br />
will be taken into account. In the context of this work, we shall present two case<br />
studies. Initially, an Emergency Response case study will be tackled, where users<br />
provide intelligence about large scale emergencies, empowering a more effective and<br />
informed emergency action and at the same time receive information on how to act.<br />
A Consumers Social Group case study will follow, providing enhanced publishing<br />
tools to support group activities (e.g. organization of team events) and the ability<br />
to extract meta-information from content sources and group discussions. Both Use<br />
Cases denote the important effect of Collective Intelligence as well as its leverage<br />
for private, commercial and public purposes.<br />
− 137 −
Analysis of polyphonic musical time series<br />
Katrin Sommer and Claus Weihs<br />
Lehrstuhl für Computergestützte Statistik<br />
Technische Universität Dortmund, D-44221 Dortmund<br />
sommer@statistik.tu-dortmund.de<br />
Abstract. A general model for pitch tracking of polyphonic musical time series will<br />
be introduced. Based on a model of Davy and Godsill (2002) the different pitches<br />
of the musical sound are estimated with MCMC methods simultaneously. Additionally<br />
a preprocessing step is designed to improve the estimation of the fundamental<br />
frequencies (Sommer and Weihs (<strong>2008</strong>)). The preprocessing step compares real audio<br />
data with an alphabet constructed from the McGill Master Samples (Opolko<br />
and Wapnick (1987)) and consists of tones of different instruments. The tones with<br />
minimal Itakura-Saito distortion (Gray et al. (1980)) are chosen as first estimates<br />
and as starting points for the MCMC algorithms. Furthermore the implementation<br />
of the alphabet is an approach for the recognition of the instruments generating the<br />
musical time series. Results are presented for mixed monophonic data from McGill<br />
and for self recorded polyphonic audio data.<br />
Key words: MCMC, musical time series, polyphony, alphabet<br />
References<br />
Davy, M. and Godsill, S. J. (2002): Bayesian Harmonic Models for Musical Pitch<br />
Estimation and Analysis. Technical Report 431. Cambridge University Engineering<br />
Department.<br />
Gray, R., Buzo, A., Gray, A. and Matsuyama, Y. (1980): Distortion Measures for<br />
Speech Processing. IEEE Transactions on Acoustics, Speech, and Signal Processing<br />
ASSP-28, 367–376.<br />
Opolko, F. and Wapnick, J. (1987): McGill University Master Samples [Compact<br />
disc]: Montreal, Quebec: McGill University.<br />
Sommer K. and Weihs C. (2006): Using MCMC as a stochastic optimization procedure<br />
for music time series. In Batagelj V, Bock HH, Ferligoj A, Ziberna A<br />
(Eds.): Data Science and Classification, Springer, Heidelberg, 307–314.<br />
Sommer, K. and Weihs, C. (<strong>2008</strong>): A comparative Study on polyphonic musical<br />
time series using MCMC methods. In C. Preisach, H. Burkhardt, L. Schmidt-<br />
Thieme and R. Decker (Eds.): Data Analysis, Machine Learning, and Applications.<br />
Springer, Berlin.<br />
− 138 −
Trust as a Key Determinant of Loyalty and its<br />
Moderators<br />
Angela Sommerfeld 1<br />
Institut für Marketing, Humboldt-Universität zu Berlin angelaso@umich.edu<br />
Abstract. Theorizing that successful relational exchanges are motivated by trust<br />
and commitment, theory implicitly assumes that transactional and weak relational<br />
exchanges are not similarly motivated. As such Garbarino & Johnson (1999) showed<br />
that trust is a peripheral evaluation, not predictive for purchase intentions in weak<br />
(individual ticket buyers) but for strong relationships (subscribers of the theatre).<br />
Extending their work we take a more theory-based approach to develop and test<br />
moderating hypotheses of the trust ? purchase intention relation beyond their variable<br />
type of contractual relationship. Based on a survey of 575 business-to-business<br />
customers we test the following proposed moderators: two facets of perceived risk<br />
notably performance risk and consequentiality, perceived switching cost, and length<br />
of the relationship between companies. Different methods have been employed to test<br />
the moderations (Multiple-Groups, Kenny-Judd-Models, and Quasi-ML). Especially<br />
the Quasi-ML method, which we apply for a simultaneous test of several moderation<br />
hypotheses, represents a statistically efficient estimation method for SEMs with<br />
multiple latent interaction effects (Klein & Muthen 2007). Depending on the method<br />
and their varying properties several hypotheses could be confirmed. The paper seeks<br />
to make three key contributions. First it gives a theory-based account of boundary<br />
conditions for the relevance of trust in exchange relationships between companies.<br />
Since there have been conflicting opinions on the role of risk in exchange between<br />
companies, a second contribution of the paper is to clarify this role by thoroughly<br />
testing both moderating and mediating hypotheses. Testing interactions in a structural<br />
equation framework is not a straightforward task. Thus a third contribution is<br />
to illustrate strengths and weaknesses of different methods for a substantive research<br />
question with a real world data set.<br />
Key words: Trust, Risk, Switching Cost, Multiple<br />
References<br />
Klein, A.G. and Muthen, B.O. (2007): Quasi Maximum Likelihood Estimation of<br />
Structural Equation Models With Multiple Interaction and Quadratic Effects.<br />
Multivariate Behavioral Research, 42, 647–673.<br />
− 139 −
Generating Fictitious Training Data for Credit<br />
Client Classification<br />
Klaus B. Schebesch 1 and Ralf Stecking 2<br />
1<br />
Faculty of Economics, University ”Vasile Goldis”, Arad, Romania<br />
kbsbase@gmx.de<br />
2<br />
Faculty of Economics, University of Oldenburg, D-26111 Oldenburg<br />
ralf.w.stecking@uni-oldenburg.de<br />
Abstract. In recent work we started investigating the effects of using fictitious<br />
training examples in addition to the empirical training examples for a credit scoring<br />
problem. Fictitious training points added by a very simple procedure lead to<br />
some interesting effects in the context of SVM (support vector machine) classifier<br />
modeling. For instance, the resulting out-of-sample performance measures of such<br />
preliminary models are not entirely obvious. However, by using SVM, we also can<br />
observe the change in support vector formation subject to fictitious training points.<br />
Such information may prove instrumental in producing fictitious training points<br />
which are (more) problem dependent. We also explore connections to generative,<br />
similarity based and template based learning which, in a related context receive<br />
some attention in recent classification literature. We then report on the results of<br />
using different types of fictitious training examples in SVM credit client classification.<br />
Finally, in order to generalize these results, evaluation of SVM with different<br />
kernel functions using various fictitious training data sets is presented.<br />
Key words: Fictitious training data, Data similarity, Support vector machine,<br />
Credit scoring<br />
References<br />
DUIN, R.P.W. and PEKALSKA, E. (2007): The Science of Pattern Recognition.<br />
Achievements and Perspectives. In: W. Duch, J. Mandziuk (eds.), Challenges<br />
for Computational Intelligence, Studies in Computational Intelligence, Springer<br />
HOCHREITER, S. and OBERMAYER, K. (2006). Support vector machines for<br />
dyadic data. Neural Computation, 18, 1472-1510<br />
LAUB, J., ROTH, V., BUHMANN, J.M. and MÜLLER, K. (2006): On the information<br />
and representation of non-Euclidean pairwise data. Pattern Recognition,<br />
39, pp. 1815-18266<br />
STECKING, R. and SCHEBESCH, K.B. (2007): Improving Classifier Performance<br />
by Using Fictitious Training Data? A Case Study. Accepted for publication in<br />
Operations Research Proceedings 2007.<br />
− 140 −
Clustering Association Rules with<br />
Fuzzy Concepts<br />
Matthias Steinbrecher 1 and Rudolf Kruse 1<br />
Department of Knowledge Processing and Language Engineering<br />
Otto-von-Guericke University of Magdeburg<br />
Universitätsplatz 2, 39106 Magdeburg, Germany<br />
{msteinbr,kruse}@iws.cs.uni-magdeburg.de<br />
Abstract. Association rules constitute a widely accepted technique to identify frequent<br />
patterns inside huge volumes of data. Practioneers prefer the straightforward<br />
interpretability of rules, however, depending on the nature of the underlying data<br />
the number of induced rules can be intractable large. Even reasonably sized result<br />
sets may contain a large amount of rules that are uninteresting to the user because<br />
they are too general, are already known or do not match other user-related intuitive<br />
criteria. We allow the user to model his conception of interestingness by means of linguistic<br />
expressions on rule evaluation measures and compound propositions of higher<br />
order (i. e., temporal or spatial changes of rule properties). Multiple such linguistic<br />
concepts can be considered a set of fuzzy patterns [?] and allow for the partition of<br />
the initial rule set into fuzzy fragments that contain rules of similar membership to a<br />
user’s concept [?,?,?]. With appropriate visualization methods that extent previous<br />
rule set visualizations [?] we allow the user to instantly assess the matching of his<br />
concepts against the rule set.<br />
Key words: Association Rules, Fuzzy Clustering, Exploratory Data Analysis<br />
References<br />
1.Dubois, D., Prade, H., Testemale, C.: Weighted Fuzzy Pattern Matching. Fuzzy<br />
Sets and Systems 28(3) (1988) 313–331<br />
2.Höppner, F., Klawonn, F., Kruse, R., Runkler, T.: Fuzzy Clustering. Wiley,<br />
Chichester, United Kingdom (1999)<br />
3.Döring, C., Lesot, M.J., Kruse, R.: Data Analysis with Fuzzy Clustering Methods.<br />
Computational Statistics & Data Analysis 51(1) (2006) 192–214<br />
4.Kruse, R., Döring, C., Lesot, M.J.: Fundamentals of fuzzy clustering. In<br />
de Oliveira, J.V., Pedrycz, W., eds.: Advances in Fuzzy Clustering and its Applications.<br />
John Wiley & Sons (2007) 3–30<br />
5.Steinbrecher, M., Kruse, R.: Visualization of Possibilistic Potentials. In: Foundations<br />
of Fuzzy Logic and Soft Computing. Volume 4529 of Lecture Notes in<br />
Computer Science., Springer Berlin / Heidelberg (2007) 295–303<br />
− 141 −
Who’s Afraid of Statistics? – Measurement and<br />
Predictors of Statistics Anxiety in German<br />
University Students<br />
Carolin Strobl 1 and Friedrich Leisch 2<br />
1 Institut für Statistik, Ludwig-Maximilians-Universität München<br />
carolin.strobl@stat.uni-muenchen.de<br />
2 friedrich.leisch@stat.uni-muenchen.de<br />
Abstract. The measurement of statistics anxiety and the relationship between<br />
statistics anxiety and several socio-demographic and educational factors was investigated<br />
in a survey on over 600 German university students. The attitude towards<br />
statistics was measured by means of the Affect and Cognitive Competence scales of<br />
the Survey of Attitudes Towards Statistics (SATS, Schau et al., 1995). Additional<br />
items covered, amongst others, prior mathematics experience and achievement, time<br />
and activity since high school graduation as well as items on the students’ strategy<br />
applied in mathematics courses, which was not considered in earlier studies. An<br />
anxiety indicator was derived from the SATS scales by means of cluster analysis<br />
in order to separate a group of students with high levels of statistics anxiety from<br />
those with moderate and low levels of anxiety. Using this anxiety indicator as the<br />
response, a set of relevant predictor variables was identified by means of random forest<br />
variable importance scores and further explored in a logistic regression model.<br />
Our results show that the SATS Affect and Cognitive Competence scales are well<br />
suited for identifying students with high levels of negative attitude against statistics,<br />
even though potential effects of the translation into German were noticeable<br />
for the positively worded items. Predictors found relevant for statistics anxiety were<br />
gender, mathematics taken as an intensive course in high school, prior (perceived)<br />
mathematics achievement, prior mathematics experience as well as two of the newly<br />
included items on students’ strategy applied in mathematics courses in high school:<br />
Students who named practicing as their strategy were less likely, while students who<br />
named memorizing as their strategy were more likely to show statistics anxiety.<br />
Key words: Attitude towards statistics, SATS, Statistics education<br />
References<br />
Schau, C., Stevens, J., Dauphinee, T. L. and Vecchio, A. D. (1995): The development<br />
and validation of the survey of attitudes toward statistics. Educational and<br />
Psychological Measurement, 55 (5), 868–875.<br />
− 142 −
A New, Conditional Variable Importance<br />
Measure for Random Forests<br />
Carolin Strobl 1 and Achim Zeileis 2<br />
1 Department of Statistics, Ludwig-Maximilians-Universität München<br />
carolin.strobl@stat.uni-muenchen.de<br />
2 Department of Statistics and Mathematics, Wirtschaftsuniversität Wien<br />
Achim.Zeileis@wu-wien.ac.at<br />
Abstract. Random forests are becoming increasingly popular in many scientific<br />
fields for assessing the importance of predictor variables (cf., e.g., Lunetta et al.,<br />
2004) because they can cope with “small n large p” problems, complex interactions<br />
and even with highly correlated predictor variables. Their variable importance can<br />
help identify relevant predictors even if they are highly correlated, while in classical<br />
regression models often only one representative of a group of correlated predictors is<br />
included. However, currently-used variable importance measures can be biased, e.g.,<br />
towards variables with many categories (Strobl et al., 2007) or correlated predictor<br />
variables (Archer and Kimes, <strong>2008</strong>). While the former issue can be addressed by<br />
changing the resampling scheme in the tree growing process (Strobl et al., 2007),<br />
the latter is due to the permutation scheme employed in the computation of the<br />
variable importance. Here we suggest a new, conditional permutation scheme that<br />
is more suited to measure the degree of association of each predictor variable with<br />
the response. The resulting conditional variable importance can be used to rank the<br />
predictor variables more reliably.<br />
Key words: Feature selection, Correlation, Variable importance, Permutation tests<br />
References<br />
Archer, K. and Kimes, R. (<strong>2008</strong>): Empirical characterization of random forest variable<br />
importance measures. Computational Statistics & Data Analysis, 52(4),<br />
2249–2260.<br />
Lunetta, K.L., Hayward, L.B., Segal, J., Eerdewegh, P.V. (2004): Screening largescale<br />
association study data: Exploiting interactions using random forests. BMC<br />
Genetics, 5:32.<br />
Strobl, C., Boulesteix, A.-L., Zeileis, A. and Hothorn, T. (2007): Bias in random<br />
forest variable importance measures: Illustrations, sources and a solution. BMC<br />
Bioinformatics, 8:25.<br />
− 143 −
Conjoint Analysis within the field of customer<br />
satisfaction problems a model of composite<br />
product/service<br />
Piotr Tarka<br />
School of Banking in Poznan<br />
Department of Organization and Management<br />
Poland<br />
piotr.tarka@wsb.poznan.pl<br />
Abstract. This paper describes how the benefits of conjoint analysis can be<br />
adapted to measuring performance criteria in the customer service area. The paper<br />
points out how a single composite model can be built, incorporating a wide<br />
range of customer key choice criteria, including service. Author makes a mark to a<br />
specific problems. One of them, is the apparent distortion in computed utility values<br />
that arises in circumstances where global macro variables are traded-off against more<br />
micro topics. This can lead to dramatic underestimation of the overall contribution<br />
or importance of macro issues. To address this concern, author discuss an approach<br />
known as dual scaling for eliminating the bias. Another drawback to the approach<br />
in customer service studies is the limited number of variables that can be addressed<br />
by a typical conjoint study. This makes it difficult to cover the large range of service<br />
topics typically examined in a customer satisfaction study. The paper argues<br />
that, this limits the scope of both classical conjoint studies and current customer<br />
satisfaction approaches.<br />
Key words: Conjoint Analysis, Customer satisfaction problems<br />
− 144 −
Optimal VDSL Expansion taking into<br />
Consideration of Infrastructure Restrictions<br />
and Marketing Requirements<br />
Klaus Thiel 1<br />
T-Online, T-Online-Allee 1, 64295 Darmstadt k.thiel@t-online.net<br />
Abstract. The expansion of the Very High Speed Subscirber Line (VDSL) network<br />
in Germany is a billion-weighted prestigious infrastructure project. VDSL enables<br />
a transfer rate of 50 megabytes per second. With it, for example, so-called entertainment<br />
customer can receive two movies byte-parallel in High Definition Television<br />
(HDTV) and internet surfing and telephoning is also possible. Within the<br />
b2b-sector many new applications like telecommuting in virtual teams around the<br />
world, telemedicine have become feasible. Currently Deutsche Telekom has developed<br />
VDSL in 27 cities and for <strong>2008</strong> the VDSL expansion is planned for further 23<br />
cities. The optimal choice of the VDSL expansion areas primarily depends on infrastructure<br />
restrictions as well as on marketing requirements. In order to execute a<br />
spatial optimisation procedure, all the important infrastructure and marketing information<br />
must be converted to vector data by digitizing und subsequently importing<br />
into a Geo-Information-System (GIS).<br />
The most important GIS provider in Germany are Microm with MicromGEO<br />
and ESRI with ArcGIS. In order to select the most suitable system, regarding the<br />
mentioned problem, both systems have to be evaluated on the basis of objective<br />
test criteria. Test criteria are the quality of geo-referencing of address-data and<br />
the mapping-quality of different spatial levels. In order to choose the most suitable<br />
spatial level (e.g. city, post-code, dialling-code, municipality), several analysis have<br />
been executed. In the next step, a spatial scoring has been developed and imported<br />
into the GIS in order to ensure, that those areas with the highest VDSL customer<br />
equity potential will be the first to be expanded. Finally, using the spatial scoring,<br />
a spatial potential-ranking has been calculated on the basis of which the optimal<br />
VDSL expansion can be planned and executed.<br />
Key words: Customer Equity, Geo-Information-System, Optimal VDSL Expansion<br />
− 145 −
Evaluate the data structure and identify<br />
homogenous spatial units in the data base<br />
”Sustainability issues in sensitive areas” of the<br />
EU-FP6 Integrated Project SENSOR<br />
Nguyen Xuan Thinh 1 , Leander Küttner 1 , and Gotthard Meinel 1<br />
Leibniz Institute of Ecological and Regional Development (IOER), Weberplatz 1,<br />
01217 Dresden, Germany, ng.thinh@ioer.de, l.kuettner@ioer.de,<br />
g.meinel@ioer.de<br />
Abstract. SENSOR (Sustainability Impact Assessment: Tools for Environmental,<br />
Social and Economic Effects of Multifunctional Land Use in European Regions) is<br />
an Integrated Project within the 6th Framework Research Programme of the European<br />
Commission (33 research partners from 15 countries). The SENSOR Project<br />
is structured into seven interrelated modules M1-M7. For the Module M6 ”Sustainability<br />
issues in sensitive areas”, a data base with more than 800 000 entries has<br />
been established. Whereby Lusatia, Silesia, Eisenwurzen, High Tatra, Valais, Estonia<br />
coastal zone, and Malta were selected as sensitive area case studies (SACS).<br />
Using ACCESS, SPSS and ArcMap we conduct a comparative analysis and evaluate<br />
this M6 database with the view to the theoretical sustainability indicators defined<br />
in the ModuleM2. We then determine similarities and dissimilarities between data<br />
from different SACS. By applying adequate cluster analysis we identify homogenous<br />
spatial units of selected SACS in order to find out generalisable and specific sustainability<br />
characteristics in the seven case studies. As example we describe the case<br />
study of Lusatia more in detail. The area of Lusatia is divided in several local area<br />
units (LAU2), which are qualified for a statistical examination by a high number of<br />
entries and available variables. As base we choose a set of 25 variables related to<br />
sustainable land use issues. Using a factorial analysis we determine the significant<br />
variables and use them as representatives to characterise typical land use clusters. In<br />
a next step the clusters were identified by a combination of hierarchical and k-mean<br />
cluster analysis methods. To describe the situation at different time points and the<br />
development in the period between them, we repeat the procedure of cross sectional<br />
analysis of 1996 for 2004. The results of the statistical analysis are presented in<br />
ArcMap visualisations. Although the procedure and the variable base are the same,<br />
the results differ and reveal so the relevant land use trends within the general social<br />
transformation process of the 1990ies.<br />
Key words: EU Integrated Project SENSOR, Comparative Analysis, Similarity,<br />
Disimilarity, SENSOR M6 Indicators, Cluster Analysis<br />
− 146 −
Mining ideas from textual information<br />
Dirk Thorleuchter<br />
Fraunhofer Institut für Naturwissenschaftlich-Technische Trendanalysen,<br />
D-53879 Euskirchen, Appelsgarten 2, Germany<br />
dirk.thorleuchter@int.fraunhofer.de<br />
Abstract. This paper describes an approach to find automatically new technological<br />
ideas in textual information. On the basis of (Thorleuchter (<strong>2008</strong>)) the existing<br />
theoretical algorithm is enlarged in consideration of text mining approaches<br />
like stemming, term frequency etc. (Ferber (2003)) and ”creativity technique” approaches<br />
from literature (Dean et al. (2001)). The aim of the new algorithm is to<br />
find ideas by using a general stop word list, because up to now the existing approach<br />
is based on the inefficient usage of a (domain) specific stop word list specific<br />
created for the analyzed text.<br />
This new approach is evaluated with non-proprietary data and it is realized as webbased<br />
application, named ”Technological Idea Miner” that can be used for further<br />
testing and evaluation. The presentation of the identified ideas will be displayed in<br />
consideration of cognitive research knowledge like described in (Puppe et al. (2003)).<br />
Key words: Textmining, Knowledge, Discovery, Ideas<br />
References<br />
Dean, G.,Hender, J.M., Nunmaker, J.F. and Rodgers, T.L. (2001): Improving Group<br />
Creativity. In: Sprague, R., (Hrsg.): Proceedings of the 34th Hawaii International<br />
Conference on System Sciences - 2001. IEEE Publishing, Maui (USA),<br />
1070.<br />
Ferber, R. (2003): Information Retrieval. dpunkt.verlag, Heidelberg, 41.<br />
Puppe, F., Stoyan, H. and Studer, R. (2003): Knowledge Enineering. In: G. Görz, C.-<br />
R. Rollinger and J. Schneeberger (Eds.): Handbuch der Künstlichen Intelligenz.<br />
4. Auflage, Oldenbourg, München, 612.<br />
Thorleuchter, D. (2007): Finding new technological ideas and inventions with text<br />
mining and technique philosophy. In: C. Preisach, H. Burkhardt, L. Schmidt-<br />
Thieme, R. Decker (Eds.): Data Analysis, Machine Learning, and Applications.<br />
Springer, Heidelberg-Berlin.<br />
− 147 −
Mining technologies in security and defense<br />
Dirk Thorleuchter<br />
Fraunhofer Institut für Naturwissenschaftlich-Technische Trendanalysen,<br />
D-53879 Euskirchen, Appelsgarten 2, Germany<br />
dirk.thorleuchter@int.fraunhofer.de<br />
Abstract. In the last years, the rising asymmetrical threat is causing governments<br />
to pay more attention to security, especially in technological areas. New and ever<br />
more complex tasks in areas concerned with defense against these new types of threat<br />
require additional research and development of new techniques. For this reason,<br />
national and European governments are increasingly funding security and defense<br />
(S&D) based technological research.<br />
In this paper, we give an overview about the technological landscape of S&D<br />
by presenting different S&D-technologies and their relationships like described in<br />
(Geschka et al. (2005)) and (Reiß (2006)). Therefore we firstly identify technologies<br />
from different technological S&D-taxonomies and we secondly identify innovative<br />
S&D-research projects. The research projects are classified according to technologies<br />
and on that basis the relationships between technologies are presented.<br />
In detail, text documents are represented as vectors in vector space model using<br />
term frequency and corpus-based term co-occurrence data. We use Jaccard’s coefficient<br />
(Ferber(2003)) to measure similarity and we use fuzzy alpha-cut method for<br />
classification. Structured documents (XML) are used as data source and drain.<br />
To realize this approach, we present a web application ”S&D Technology Miner”<br />
for planning support to research program planners and to researchers, which acquire<br />
funding in this area but also for testing and evaluating the approach.<br />
Key words: Security, Defense, Technology, Textmining, Classification<br />
References<br />
Ferber, R. (2003): Information Retrieval. dpunkt.verlag, Heidelberg, 78.<br />
Geschka, H., Schauffele, J. and Zimmer, C. (2005): Explorative Technologie-<br />
Roadmaps - Eine Methodik zur Erkundung technologischer Entwicklugslinien<br />
und Potenziale. In: M.G. Möhrle and R. Isenmann (Eds.): Technologie-<br />
Roadmapping. Springer, Berlin, Heidelberg et al., 165.<br />
Reiß, T. (2006): Innovationssysteme im Wandel - Herausforderungen für die Innovationspolitik.<br />
In: B. Müller and U. Glutsch (Eds.): Fraunhofer-Institut für<br />
System- und Innovationsforschung - Jahresbericht 2006. Karlsruhe, 10<br />
− 148 −
Multilevel Simultaneous Component Analysis<br />
for Studying Inter-individual and<br />
Intra-individual Variabilities<br />
Marieke E. Timmerman 1 , Anna Lichtwarck-Aschoff 1 , and Eva Ceulemans 2<br />
1<br />
Heymans Institute for Psychology, University of Groningen<br />
Grote Kruisstraat 2/1<br />
9712 TS Groningen, The Netherlands<br />
m.e.timmerman@rug.nl<br />
2<br />
Centre for Methodology of Educational Research, University of Leuven<br />
Belgium<br />
Abstract. All psychological processes are dynamic. To fully understand those processes<br />
it is necessary to consider the intra-individual variation of individuals over<br />
time. Herewith, it is important to recognize that the nature of the processes may<br />
differ across individuals. This intricate matter requires new modelling approaches.<br />
We focus on the exploratory modelling of multivariate data that have been repeatedly<br />
gathered from more than one individual. We aim at identifying meaningful<br />
sources of both the inter-individual variability and the intra-individual variability<br />
in the observed variables, while expressing the similarities and differences in those<br />
sources across individuals. To this end, we use multilevel simultaneous component<br />
analysis (MLSCA; Timmerman, 2006).<br />
In essence, MLSCA specifies separate component models to account for interindividual<br />
and intra-individual variabilities. The latter may entail differences across<br />
individuals, which are expressed via the covariances of the individual’s withincomponent<br />
scores. The common within-loadings ensure the comparability across<br />
individuals. The relationships between MLSCA and the related multilevel and multigroup<br />
structural equation models will be discussed. The usefulness of MLSCA to<br />
grasp inter-individual and intra-individual variabilities is illustrated with an empirical<br />
example from a diary study focusing on emotions involved in daily conflicts<br />
between adolescent girls and their mothers.<br />
Key words: exploratory modelling of longitudinal data, multivariate analysis<br />
References<br />
Timmerman, M.E. (2006). Multilevel Component Analysis. British Journal of Mathematical<br />
and Statistical Psychology, 59, 301–320.<br />
− 149 −
Issues Related to the Implementation of a<br />
Dynamic Logistic Model for Classifier<br />
Combination<br />
Amber Tomas<br />
The University of Oxford, 1 South Parks Road, Oxford OX2 3TG, United<br />
Kingdom tomas@stats.ox.ac.uk<br />
Abstract. We consider a model for classification of sequentially received observations,<br />
when the population of interest is not assumed to be stationary. The model<br />
we propose combines the outputs of a fixed set of component classifiers (chosen in<br />
advance), and the parameters of the combination are allowed to change over time.<br />
Specifically, we use a logistic Dynamic Generalized Linear Model [1] for combining<br />
the classifier outputs, and take a predictive approach towards estimation of the posterior<br />
class probabilities. The dynamics are incorporated through the equation for<br />
parameter evolution<br />
β t+1 = β t + ωt, ωt ∼ N(0, Σt). (1)<br />
The implementation of this model when the distribution of the parameters is<br />
not assumed to be normal is not straightforward. In addition to computational<br />
complexity, there arise complications related to the identifiability of the parameters<br />
β t which are unique to classification problems. Specifically, although the classifications<br />
produced as a result of using the model with parameters β t are equivalent to<br />
the classifications when using parameters αβ t, α > 0, the posterior class probabilities<br />
are more extreme in the second case. This results in increased volatility of the<br />
classification rule when using a sequential MCMC method to estimate the posterior<br />
distribution of the parameters. We discuss why there is no simple constraint for the<br />
parameters which will alleviate this identifiability problem, and discuss an alternative<br />
approach. In addition, we consider the related problems of adaptively changing<br />
the effective value of Σt, and the consequences of using the model (1) when it is not<br />
assumed to be correct.<br />
Key words: Multiple Classifier Systems, Dynamic Classification, Identifiability<br />
References<br />
1.West, M., Harrison, J. and Migon, H. (1985): Dynamic Generalized Linear Models<br />
and Bayesian Forecasting. Journal of the American Statistical Association, 80,<br />
73–83.<br />
− 150 −
A Comprehensive Partial Least Squares<br />
Approach to Component-Based Structural<br />
Equation Modeling ⋆<br />
Laura Trinchera 1 and Vincenzo Esposito Vinzi 2<br />
1<br />
Dipartimento di Matematica e Statistica, Universita degli Studi di Napoli<br />
Federico II. ltrinche@unina.it<br />
2<br />
ESSEC Business School of Paris and Singapore. vinzi@essec.fr<br />
Abstract. PLS Path Modeling (PLS-PM) is generally meant as a componentbased<br />
approach to structural equation modeling that privileges a prediction oriented<br />
discovery process to the statistical testing of causal hypotheses. Differently<br />
from covariance-based structural equation modeling (i.e. LISREL-type methods), in<br />
PLS-PM latent variables are estimated as linear combinations of the manifest variables.<br />
Thus they are more naturally defined as emergent constructs (with formative<br />
indicators) rather than latent constructs (with reflective indicators). Nowadays, formative<br />
relationships are more and more used in real applications but pose a few<br />
problems for the statistical estimation and interpretation. As of today, formative<br />
relationships in PLS-PM imply multiple OLS regressions between each latent variable<br />
and its own formative indicators. As known, OLS regression may yield unstable<br />
results in presence of important correlations between explanatory variables, it is not<br />
feasible when the number of statistical units is smaller than the number of variables<br />
nor when missing data affect the dataset. Thus, it seems quite natural to introduce<br />
a PLS Regression (PLS-R) external estimation mode within the PLS-PM algorithm<br />
so as to overcome the mentioned problems, preserve the formative relationships and<br />
still remain coherent with the component-based and prediction-oriented nature of<br />
PLS-PM. Here, the main issues concerning the use of formative indicators in PLS-<br />
PM are investigated. Furthermore, the features of PLS-R may be fruitfully exploited<br />
in the internal estimation phase as well as for estimating path coefficients upon convergence<br />
of the PLS-PM algorithm when classical OLS estimates become unstable<br />
or even unfeasible. Finally, the case of formative indicators will be considered also<br />
with respect to clustering techniques recently proposed for latent class detection in<br />
PLS-PM.<br />
Key words: Formative Indicators, PLS Regression, Latent Factor Scores<br />
⋆ The participation of L. Trinchera to this research was supported by the MURST<br />
grant “Multivariate statistical models for the ex-ante and the ex-post analysis<br />
of regulatory impact”, coordinated by C. Lauro (2006). The participation of V.<br />
Esposito Vinzi to this research was supported by CERESSEC, Research Center<br />
of the ESSEC Business School.<br />
− 151 −
Relevant Importance of Predictor Variables<br />
in Support Vector Machines Models<br />
Micha̷l Trzesiok<br />
Department of Mathematics,<br />
Katowice University of Economics, ul. Bogucicka 14, 40-226 Katowice<br />
trzesiok@ae.katowice.pl<br />
Abstract. The model resulting from Support Vector Machines suffer from the lack<br />
of interpretation. It is usually very hard to extract the knowledge about the analyzed<br />
phenomenon from the classification model obtained by using SVMs because the<br />
classification task is realized in a high dimensional feature space. Although the<br />
method identifies the observations which are crucial for the form of the decision<br />
function, it does not show which variables are relevant and which are redundant.<br />
Vapnik claims that feature selection is not necessary for SVMs, i.e. building the<br />
model on a set of variables including some redundant variables does not change the<br />
generalization ability. Once the model is built, it is still valuable to recognize the<br />
relative importance of predictor variables. The method we propose uses the sampling<br />
techniques, backward selection and Rand index for evaluating whether the particular<br />
variable is redundant or not. We even try to extend the idea to obtain the ranking<br />
of the predictor variables reflecting the relevant importance of the inputs.<br />
Key words: Support Vector Machines, redundancy, relevant attributes<br />
References<br />
Abe, S. (2005): Support Vector Machines for Pattern Classification, Springer, London.<br />
Breiman, L., Friedman, J., Olshen, R. and Stone, C. (1984): Classification and Regression<br />
Trees, Wadsworth, Monterey.<br />
Schölkof, B. and Smola, A.J. (2002): Learning with Kernels, MIT Press, Cambridge.<br />
Vapnik, V. (1998): Statistical Learning Theory, John Wiley & Sons, N.Y.<br />
− 152 −
Comparison of Algorithms to find differentially<br />
expressed Genes in Microarray Data<br />
Alfred Ultsch<br />
Databionics Research Lab, Department of Computer Science<br />
University of Marburg, D-35032 Marburg, Germany<br />
ultsch@informatik.uni-marburg.de<br />
Summary. There are several different algorithms published for the identification<br />
of differentially expressed genes in DNA microarray experiments. The microarrays<br />
in this type of experiment are from two different populations (groups) of specimen.<br />
Among the many genes on the microarrays, those genes are sought that are the most<br />
relevant for the distinction between the two populations. Usually such algorithms<br />
produce ordered lists of genes. In this work a method to compare the performance<br />
of such algorithms is proposed. In order to compare different methods for the identification<br />
of significant genes, a data set with known properties is published. This<br />
benchmark data is used to compare the performance of different algorithms with a<br />
newly designed one, called PUL. The comparison is based on established measurements<br />
in information retrieval. Surprisingly a clear ordering in performance of the<br />
algorithms was observed. PUL outperformed other algorithms by a factor of two.<br />
PUL was applied successfully in different practical applications. For these experiments<br />
the importance of the genes proposed by PUL were independently verified.<br />
References<br />
1. Gebhard, S., Bergmann, E., Weber, A., Berwanger, B., Eilers, M., Ultsch, A.,<br />
Christiansen, H.: Classification of stage 3 neuroblastomas by artificial neural<br />
networks based analysis of cDNA microarrays. (submitted)<br />
2. Dudoit, S., Fridlyand, J., Speed, T. (2000). Comparison of discrimination methods<br />
for the classification of tumors using gene expression data. Technical report<br />
576, Department of Statistics, University of California, Berkeley.<br />
3. Pallasch C.P., Schwamb J., Schulz A., Knigs S., Debey S., Kofler D., Schultze<br />
J.L., Hallek M., Ultsch, A. Wendtner, C.(2007) Targeting lipid metabolism by<br />
the lipoprotein lipase inhibitor orlistat results in apoptosis in chronic lymphocytic<br />
leukemia, accepted for Leucemia.<br />
4. Tusher, V., Tibshirani, R. and Chu, G. (2001): Significance analysis of microarrays<br />
applied to the ionizing radiation response, PNAS 2001 98: 5116-5121.<br />
5. Ultsch, A.(2005): Improving the identification of differentially expressed genes in<br />
cDNA microarray experiments, In Weihs, C., Gaul, W. (Eds): Classification- the<br />
Ubiquitous Challenge, Springer, Heidelberg, pp. 378-385.<br />
− 153 −
Is log ratio a good value for measuring return<br />
in stock investments?<br />
Alfred Ultsch<br />
Databionics Research Group<br />
Philipps-University of Marburg, Germany<br />
ultsch@informatik.uni-marburg.de<br />
Abstract. Measuring the rate of return is an important issue for theory and practice<br />
of investmets in the stock market. A common measure for rate of return is the<br />
logarithm of the ratio of succesive prices (LogRatio). In this paper it is shown that<br />
LogRatio as well as arithmetic return rate (ROI) have several disadvantages. As an<br />
alternative relative differences (RelDiff) are proposed to measure rates of return.<br />
The stability against numerical and rounding errors of RelDiff is demonstrated<br />
to be much better than for LogRatios and ROI. RelDiff values are identical to<br />
LogRatios and ROI for interesting ranges of return rates. Relative differences map<br />
return rates to a finite range. For most subsequent analyses this is a big advantage.<br />
The usefullness of the approach is demonstrated on daily return rates of a large set<br />
of stocks.<br />
Key words: Rate of Return, Return on Investment, Financial time series, Black-<br />
Scholes-Model<br />
References<br />
Bodie,Z.,KaneA. and Marcus,A.J. Essentials of Investments, 5th Edition. New York:<br />
McGraw-Hill/Irwin, 2004.<br />
Brealey,R. A., Stewart C. Myers,S.C. and Allen,F. Principals of Corporate Finance,<br />
8th Edition. McGraw-Hill/Irwin, 2006.<br />
Feibel,B. J. Investment Performance Measurement. New York: Wiley, 2003.<br />
Franke,J.,Haerdle,W. Hafner,C. Einfhrung in die Statistik der Finanzmaerkte Berlin<br />
u.a. : Springer, 2. Aufl. 2004.<br />
Ultsch, A.: Improving the identification of differentially expressed genes in cDNA<br />
microarray experiments, Weihs, C., Gaul, W. (Eds), In Classification; The Ubiquitous<br />
Challenge, Springer, Heidelberg, (2005), pp. 378-385<br />
Ultsch, A.: Is log ratio a good value for identifying differential expressed genes in<br />
microarray experiments?, Technical Report No. 35, Dept. of Mathematics and<br />
Computer Science, University of Marburg, Germany, (2003)<br />
− 154 −
Mosaic Plots and Knowledge Structures<br />
Ali Ünlü<br />
Department of Mathematics, University of Augsburg, Germany<br />
ali.uenlue@math.uni-augsburg.de<br />
Abstract. Mosaic plots are state-of-the-art graphics for multivariate categorical<br />
data (Hofmann (<strong>2008</strong>)). Knowledge structures are mathematical models that belong<br />
to the recent theory of knowledge spaces in psychometrics (Doignon and Falmagne<br />
(1999)). This paper presents an application of mosaic plots and variants such as<br />
fluctuations diagrams and multiple barcharts to psychometric data arising from underlying<br />
knowledge structure models. In simulation trials, the scope of this graphing<br />
method in knowledge space theory is investigated.<br />
Key words: Mosaic plot, Visualization, Knowledge structure, Psychometrics<br />
References<br />
Doignon, J.-P. and Falmagne, J.-Cl. (1999): Knowledge Spaces. Springer, Berlin.<br />
Hofmann, H. (<strong>2008</strong>): Mosaic Plots and Their Variants. In: C.H. Chen, W. Haerdle<br />
and A.R. Unwin (Eds.): Handbook of Data Visualization. Springer, Heidelberg,<br />
617–642.<br />
− 155 −
Visualizing preferences using minimum<br />
variance nonmetric unfolding<br />
Michel van de Velden, Alain de Beuckelaer, Patrick Groenen, and Frank<br />
Busing<br />
No Institute Given<br />
Abstract. In multidimensional unfolding one wishes to obtain a map with subjects<br />
(e.g. consumers) and objects (e.g. products), in such a way that distances<br />
between subjects and objects in the map best represent the preferences as indicated<br />
in the data. Unfolding models are particularly adequate when the data (e.g. consumers’<br />
preferences) are not unidirectional but exhibit an inverted U-shape. If the<br />
alternatives are rated on an interval (or ratio) scale the ‘metric’ unfolding model is<br />
appropriate. If, however, the alternatives are rank ordered or rated on an ordinal<br />
(e.g. Likert-type of) scale, one would need the ’nonmetric’ unfolding model. Until<br />
recently, nonmetric unfolding was not feasible because of degeneracy problems.<br />
Degenerate solutions are solutions were the extent of ‘misfit’ can be made arbitrarily<br />
small Existing algorithms consistently produced such degenerate solutions.<br />
Recently, Busing, Groenen & Heiser (Psychometrika 2005, pp71-98) proposed a solution<br />
to this long-standing methodological problem by including a penalty in the<br />
algorithm. The resulting PREFSCAL algorithm is available in SPSS. In PREFSCAL<br />
two parameters are introduced that determine the strength of the penalty that leads<br />
the algorithm away from degenerate solutions. No clear directions concerning the<br />
choice of the penalty parameters are given. In this paper, we propose a minimum<br />
variance criterion to choose the penalty parameters. By studying the stability of the<br />
unfolding solutions as a function of the penalty parameters, we are able to determine<br />
the penalty in such a way that a minimum variance, non-degenerate solution<br />
is obtained. The data used in our analysis stem from a consumer study in which<br />
consumers were asked to rank-order new product ideas for soups.<br />
− 156 −
Selection of items for tests and questionnaires<br />
using Mokken scale analysis<br />
L. Andries van der Ark and J. Hendrik Straat<br />
Department of Methodology and Statistics<br />
Tilburg University<br />
P.O. Box 90153<br />
5000 LE Tilburg<br />
The Netherlands<br />
a.vdark@uvt.nl<br />
Abstract. Tests or questionnaires are often used to measure personality traits,<br />
attitudes, opinions, skills, and abilities. These tests and questionnaires consist of<br />
questions, statements, problems, games, or rating scales, which are generically called<br />
items. An important step in the construction of a test and questionnaire is a careful<br />
selection of items. A well-known approach for selecting qualitatively good items in a<br />
test is Mokken scale analysis. In this presentation, Mokken scale analysis is explained<br />
and recent developments are discussed. Special attention is given to a comparison<br />
of automated item selection algorithms used in Mokken scale analysis.<br />
− 157 −
Estimating the prevalence of rule transgression<br />
using data collected by randomized response<br />
Peter G.M. van der Heijden<br />
Department of Methodology and Statistics<br />
Utrecht University<br />
PO Box 80140<br />
3508 TC Utrecht<br />
The Netherlands<br />
P.G.M.vanderheijden@uu.nl<br />
Abstract. In criminology self-report studies are a means to obtain prevalence estimates<br />
of for rule transgressions, violations of the law, and so on. In surveys individuals<br />
are interviewed about their behaviour. An obvious problem is, or course,<br />
that due to reasons such as social desirability people do not always answer honestly<br />
about their behaviour.<br />
For this reason about forty years ago randomized response was introduced to<br />
collect data about sensitive issues. Our research group has worked in this area for<br />
about 10 years and I will give an overview of our results. The results are:<br />
• a “best practice” for asking randomized response questions<br />
• a meta-analysis showing that randonized response is the most valid method for<br />
answering questions about sensitive topics<br />
• accommodating existing models for the multivariate data so that they can handle<br />
randomized response data, such as logistic regression, item response theory, and<br />
randmized response count data<br />
• accomodating these models for the potential presence of respondents that do not<br />
follow the randomized response design.<br />
We present these results and illustrate them using surveys that we conducted for the<br />
Ministry of Social Affairs into social benefit fraud, that we conducted on a two-yearly<br />
base from 1998 to 2006.<br />
− 158 −
Clustering Consumers with Respect to Their<br />
Marketing Reactance Behavior<br />
Ralf Wagner and Erik Sauerwald<br />
SVI Chair for International Direct Marketing<br />
DMCC - Dialog Marketing Competence Center<br />
University of Kassel, Germany<br />
rwagner@wirtschaft.uni-kassel.de<br />
erik sauerwald@arcor.de<br />
Abstract. The recent paradigm shift in modern marketing practices (Coviello et<br />
al. (2002), Vargo & Lusch (2004)) in concurrence with the increasing popularity of<br />
digital marketing measures (Wagner & Meißner (forthcoming)) add a new quality<br />
to the discussion of marketing intrusiveness (Morimoto & Chang (2006)).<br />
Despite the comprehensive research in international differences on media usage<br />
(e.g., Krafft et al. (2007)) related previous research frequently neglects the cultural<br />
differences in recipients’ assessment of the marketing measures. In this study we<br />
utilize the Item Response Theory approach for an assessment of individuals’ reactance<br />
to unsolicited marketing communications. The study is based on an survey of<br />
recipients from China, Germany, Russia, and the United Staates of America<br />
Key words: Advertising, Culture, Item Response Theory, Reactance<br />
References<br />
Coviello, N.E., Brodie, R.J., Danaher, P.J., and Johnston, W.J. (2002): How Firms<br />
Relate to Their Markets: An Empirical Examination of Contemporary Marketing<br />
Practices. Journal of Marketing, 66, 33–46.<br />
Krafft, M.; Hesse, J., Höfling, J., Peters, K., Rinas, D. (2007): International Direct<br />
Marketing. Principles, Best Practices, Marketing Facts. Springer, Berlin.<br />
Morimoto, M. and Chang, S. (2006): Consumers’ Attitudes toward Unsolicited Commercial<br />
E-mail and Postal Direct Mail Marketing Methods: Intrusiveness, Perceived<br />
Loss of Control, and Irritation. Journal of Interactive Advertising, 7,<br />
8–20.<br />
Vargo, S.L. and Lusch, R.F. (2004): Evolving to a New Dominant Logic for Marketing.<br />
Journal of Marketing, 68, 1–17.<br />
Wagner, R. and Meißner, M. (forthcoming): Multimedia for Direct Marketing. In:<br />
M. Pagani (Ed.): Encyclopedia of Multimedia Technology and Networking, 2 nd<br />
Edition. Idea Publishing, Hershey.<br />
− 159 −
Supervised Self-Organising Maps and More<br />
Ron Wehrens<br />
IMM, Analytical Chemistry<br />
P.O. Box 9010, 6500 GL Nijmegen<br />
The Netherlands<br />
r.wehrens@science.ru.nl<br />
Abstract. Self-organising maps (SOMs) have been applied in many different areas<br />
of science. In a typical application, large numbers of objects (thousands or more) are<br />
mapped to a two-dimensional grid in such a way that very similar objects end up in<br />
the same area. If several different types of information are available, one can combine<br />
these in one feature vector, used to determine the similarity with each of the map<br />
units, but this presents scaling difficulties. We have, e.g., mapped several thousand<br />
steroid crystal structures from the Cambridge Crystallographic Database, based<br />
on their diffraction patterns and a specific distance function. For these structures,<br />
several other types of information are available as well, such as space group and cell<br />
volume.<br />
To take extra information into account, we have extended the basic principle<br />
of SOMs to accomodate extra layers, one for each type of feature vector [?]. The<br />
closest unit is then determined by summing distances per layer, where each layer<br />
can be assigned a weight. This makes it possible to perform supervised mapping:<br />
the second layer then contains the class information. The result of including class<br />
information is that classes are more likely to form contiguous units in the map. This<br />
behaviour can be enforced by choosing a larger weight for the class information. One<br />
does not have to stop at two layers: it is possible to create several layers, each layer<br />
corresponding with another type of data.<br />
This is implemented in an R package, called “kohonen” [?], available from CRAN<br />
(http://cran.r-project.org). Several examples will be shown highlighting the<br />
possibilities of the technique.<br />
Key words: Self-organising maps, Data fusion, Supervised mapping<br />
References<br />
1.W.J. Melssen, R. Wehrens, and L.M.C. Buydens. Supervised Kohonen networks<br />
for classification problems. Chemom. Intell. Lab. Syst., 83:99–113, 2006.<br />
2.R. Wehrens and L.M.C. Buydens. Self- and super-organising maps in R: the<br />
kohonen package. Journal of Statistical Software, 21(5), 9 2007.<br />
− 160 −
Multi-Item Versus Single-Item Measures:<br />
A Review and Future Research Directions<br />
Petra Wilczynski and Marko Sarstedt<br />
Institute for Market-based Management, Munich School of Management,<br />
D-80539 Munich, Germany wilczynski@bwl.lmu.de<br />
Abstract. With their widely discussed Journal of Marketing Research article,<br />
Bergkvist and Rossiter (2007) resume a long-lasting interdisciplinary discussion on<br />
the benefits and limitations of multi-item versus single-item measures. Whereas<br />
multi-item measures of theoretical constructs have been the norm in marketing research<br />
for over 20 years, practioners seem to favour single-item measures on the<br />
practical grounds of minimizing non-response and costs. This proceeding is often<br />
seen as a fatal error, because single-item measures are believed to be unreliable<br />
and invalid. During the last decades several studies appeared in different disciplines<br />
such as social sciences, marketing or psychology that critically compare these two<br />
approaches, yielding sometimes contradictory results in terms of validity or reliability.<br />
Thus, the objective of this paper is to develop an integrated overview of the<br />
present status of research in this field, taking into account various disciplines. At<br />
this, advantages and disadvantages, analytical approaches as well as the results are<br />
compared and critically evaluated. The findings suggest several areas for future research<br />
in this important field, which is necessary to close the gap between theoretical<br />
and practical requirements.<br />
Key words: Single Item, Multi Item, Scale Development<br />
References<br />
Bergkvist, L. and Rossiter, J.R. (2007): The Predictive Validity of Multiple-Item<br />
Versus Single-Item Measures of the Same Constructs. Journal of Marketing<br />
Research, 44, 175–184.<br />
Drolet, A.L. and Morrison, D.G. (2001): Do We Really Need Multiple-Item Measures<br />
in Service Research? Journal of Service Research, 3, 196–204.<br />
Wanous, J.P. and Reichers, A.E. and Hudy, M.J. (1997): Overall Job Satisfaction:<br />
How Good Are Single-Item Measures? Journal of Applied Psychology, 82,<br />
247–252.<br />
− 161 −
Management and methods: How to do market<br />
segmentation projects<br />
Raimund Wildner<br />
GfK Group<br />
Nürnberg, Germany<br />
Abstract. Market segmentation projects are often strategic projects with top management<br />
attention and high budgets. Nevertheless many of them fail. This can be<br />
due to poor methodology as well as due to poor management.<br />
From the management perspective it is essential that the objectives of the segmentation<br />
have to be clear from the beginning. Product development, media advertising,<br />
or sales are possible objectives and each of them requires specific variables as<br />
an input. Furthermore it is essential that all stakeholders of a segmentation project<br />
are involved from the beginning. During the segmentation project a close cooperation<br />
between marketing experts, market research experts, and statistics expert is<br />
necessary. Special problems arise in international segmentation projects. Finally it<br />
is important to sell the segmentation in the organization by workshops, leaflets and<br />
other instruments that help to get a clear picture of the segments.<br />
From a methodological standpoint it is important that the result is stable so it<br />
can be reproduced in other data sets as well. A test for stability will be shown. Outlier<br />
can cause instability; so a special method to identify outliers will be shown. Faked<br />
interviews have to be excluded from the segmentation. A procedure for detection of<br />
faked interviews will be discussed. Finally the cluster procedure that proved to be<br />
superior in practical terms is discussed.<br />
− 162 −
Clustering with Repulsive Prototypes<br />
Roland Winkler 1 , Frank Rehm 2 , and Rudolf Kruse 3<br />
1 German Aerospace Center, Braunschweig roland.winkler@dlr.de<br />
2 German Aerospace Center, Braunschweig frank.rehm@dlr.de<br />
3 Otto von Guericke University, Magdeburg kruse@iws.cs.uni-magdeburg.de<br />
Abstract. Although there is no exact definition for the term cluster, in the 2D<br />
case, it is fairly easy for human beings to decide which objects belong together. For<br />
machines on the other hand, it is hard to determine which objects form a cluster.<br />
Depending on the problem, the success of a clustering algorithm depends on the<br />
idea of their creators about what a cluster should be. Likewise, each clustering<br />
algorithm comprises a characteristic idea of the term cluster. For example the fuzzy<br />
c-means algorithm tends to find spherical clusters with equal numbers of objects.<br />
Noise clustering focuses on finding spherical clusters of user-defined diameter.<br />
If there is certain amount of knowledge available about how clusters are shaped, it<br />
is possible to include more information into a clustering algorithm. In this paper, we<br />
present an extension to noise clustering that tries to maximize the distances between<br />
prototypes. For that purpose, the prototypes behave like repulsive magnets that<br />
have an inertia depending on their sum of membership values. Using this repulsive<br />
extension, it is possible to prevent that groups of objects are divided into more<br />
than one cluster. Due to the repulsion and inertia, it is also possible to determine<br />
the number of clusters in a data set. Roughly speaking, having information about<br />
cluster shapes (i.e. the diameter) may help to cope with the absence of knowledge<br />
concerning the exact number of clusters.<br />
The results of repulsive clustering can be used as an initialization for other clustering<br />
techniques. We successfully applied this method on air traffic management tasks.<br />
Key words: fuzzy clustering, cluster shapes, cluster validity, air traffic management<br />
References<br />
Bezdek, J.C.(1981): Pattern recognition with fuzzy objective function algorithms.<br />
Plenum Press, New York.<br />
Dave RN, Sumit S (1998): Generalized noise clustering as a robust fuzzy c-mestimators<br />
model. 17th Annual Conference of the North American Fuzzy Information<br />
Processing Society (NAFIPS-98), Pensacola Beach, Florida, 256–260<br />
Rehm, F., Klawonn, F., Kruse, R.(2007): A novel approach to noise clustering for<br />
outlier detection. Soft Computing - A Fusion of Foundations, Methodologies and<br />
Applications, Berlin/Heidelberg, Vol. 11, No 5, 489-494.<br />
− 163 −
On the Effects of Enhanced Selection Models<br />
on Quality and Comparability of Classifiers<br />
Produced by Genetic Programming<br />
Stephan Winkler, Michael Affenzeller, Stefan Wagner, and Gabriel<br />
Kronberger<br />
Fachhochschule Oberösterreich, Research Center Hagenberg<br />
{swinkler,maffenze,swagner,gkronber}@heuristiclab.com<br />
Abstract. The use of genetic programming (GP) in machine learning enables the<br />
automated search for classification models that are evolved by an evolutionary process<br />
using the principles of selection, crossover and mutation. The use of enhanced<br />
selection models in GP ([1], [2]) is able to significantly increase the quality of classifiers<br />
produced by GP; detailed analysis can be found for example in [3].<br />
Algorithmic reliability can be assessed by comparing the results produced by a<br />
machine learning algorithm; due to the stochastic element that is intrinsic to any<br />
evolutionary process, GP can not guarantee the generation of similar or even similar<br />
models in each GP process execution. In [4] we have presented a method how to<br />
compare time series models produced by GP; in this paper we analyze the classifiers<br />
returned by GP based machine learning for medical benchmark data sets (taken<br />
from the UCI repository). We mainly focus on comparing standard GP techniques<br />
to those using enhanced selection models with respect to results similarity analysis.<br />
The effects of pruning mechanisms (applied to the final results) are also discussed.<br />
Key words: Evolutionary Learning, Genetic Programming, Results Comparability<br />
References<br />
[1] Affenzeller, M. and Wagner, S. (2005): Offspring selection: A new self-adaptive<br />
selection scheme for genetic algorithms. Adaptive and Natural Computing Algorithms,<br />
218—221.<br />
[2] Wagner, S. and Affenzeller, M. (2005): SexualGA: Gender-Specifc Selection for<br />
Genetic Algorithms. Proceedings of the 9th World Multi-Conference on Systemics,<br />
Cybernetics and Informatics (WMSCI) 2005, 4: 76–81.<br />
[3] Winkler, S., Affenzeller, M. and Wagner, S. (2007): Advanced genetic programming<br />
based machine learning. Journal of Mathematical Modelling and Algorithms,<br />
6(3): 455—480. Springer, Berlin.<br />
[4] Winkler, S., Affenzeller, M. and Wagner, S. (<strong>2008</strong>): On the Reliability of Nonlinear<br />
Modeling Using Enhanced Genetic Programming Techniques. Proceeding of<br />
the Chaotic Modeling and Simulation International Conference CHAOS <strong>2008</strong>.<br />
− 164 −
Analysis of massive emigration from Poland –<br />
the model–based clustering approach<br />
Ewa Witek<br />
Department of Statistics,<br />
Katowice University of Economics, Bogucicka 14, 40–226 Katowice<br />
ewitek@ekonom.ae.katowice.pl<br />
Abstract. The model–based approach assumes that data is generated by a finite<br />
mixture of underlying probability distribution such as multivariate normal distribution.<br />
In finite mixture models, each component of probability distribution corresponds<br />
to a cluster. The problem of determining the number of clusters and choosing<br />
an appropriate clustering method becomes statistical model choice problem. Hence,<br />
the model–based approach provides a key advantage over heuristic clustering algorithms<br />
selecting both the correct model and the number of clusters.<br />
Model–based clustering has shown promise in a number of practical applications,<br />
including tissue segmentation, character recognition, minefield and seismic fault detection<br />
and classification of astronomical data. The article presents an application<br />
of the model–based clustering in economic analysis, which is comparatively rare.<br />
The moment Poland joined the EU, its citizens rushed out of the country. Since<br />
1 May 2004 Poland has been facing the problem of increased emigration. We used<br />
the model–based clustering approach for grouping and detecting inhomogeneities of<br />
Polish emigrants from different regions of Poland.<br />
Key words: Model–based clustering, EM algorithm, BIC<br />
References<br />
Fraley, C. and Raftery, A.E. (2002): Model–based clustering, discriminant analysis,<br />
and density estimation. Journal of the American Statistical Association, 97,<br />
611–631.<br />
McLachlan, G.J. and Peel, D. (2000): Finite mixture models. Willey, New York.<br />
− 165 −
Image Based Mail Piece Identification using<br />
Unsupervised Learning<br />
Katja Worm 1 and Beate Meffert 2<br />
1 Siemens ElectroCom Postautomation GmbH, Rudower Chaussee 29, 12489<br />
Berlin, Germany katja.worm@siemens.com<br />
2 Humboldt Universität zu Berlin, Rudower Chaussee 25, 12489 Berlin, Germany<br />
meffert@informatik.hu-berlin.de<br />
Abstract. Based on the uniqueness of a mail piece surface, postal sorting machines<br />
use mail piece image characteristics to reuse once extracted mail piece addresses in<br />
different sorting steps. During the first sorting step mail piece image characteristics<br />
are extracted and stored together with its target address in a database. In following<br />
sorting steps the mail piece address is accessed by determining the corresponding<br />
mail piece characteristics in the database. In a previous work appropriate mail piece<br />
image characteristics and procedures for their distance measurement were presented.<br />
Image based mail piece identification is complicated by a constantly changing<br />
and unknown mail piece spectrum as well as the differentiation of nearly identical<br />
mass mails. In particular, the rejection of unknown mail pieces requires the definition<br />
of specific rejection classes depending on the current mail piece spectrum.<br />
In this paper we present an approach on distance based mail piece identification<br />
using a two-stage classification process. The different handling of mass and collection<br />
mail is facilitated by a beforehand performed unsupervised learning process which<br />
clusters similar mail piece characteristics. Based on theses clusters a specific rejection<br />
class can be estimated within each cluster. The first step in the identification process<br />
is the determination of the corresponding cluster for a given mail piece. Based on<br />
the cluster specific rejection class the mail piece can be either identified or rejected.<br />
Experimental results obtained on real-world data sets show the applicability of the<br />
proposed method.<br />
References<br />
Worm, K. and Meffert, B. (<strong>2008</strong>): Robust Image Based Document Comparison Using<br />
Attributed Relational Graphs. Proceedings of the International Conference on<br />
Signal Processing, Pattern Recognition and Applications (SPPRA), accepted.<br />
Keywords<br />
DOCUMENT IDENTIFICATION, UNSUPERVISED LEARNING,<br />
MINIMUM DISTANCE CLASSIFICATION<br />
− 166 −
Factor Analysis of Incomplete Disjunctive<br />
Tables<br />
Amaya Zárraga 1 and Beatriz Goitisolo 1<br />
Departamento de Economía Aplicada III. UPV/EHU. Bilbao. Spain<br />
Amaya.Zarraga@ehu.es and Beatriz.Goitisolo@ehu.es<br />
Abstract. Multiple Correspondence Analysis (MCA) studies the relationship between<br />
several categorical variables defined with respect to a certain population.<br />
However, one of the main sources of information are those surveys in which it is<br />
usual to find a certain number of absent data and conditioned questions that do not<br />
need to be answered by the whole population. In these cases, the data codification in<br />
a complete disjunctive table requires the inclusion of non-answer categories that can<br />
alter the results. For example, the distance χ 2 between two row profiles increases<br />
with the common answers when the individuals do not answer the same number<br />
of questions. And in the distance χ 2 between two column profiles each individual<br />
could have a different weight according to the number of answers previously chosen.<br />
Therefore, the direct application of the standard MCA is not appropriate to<br />
the study of an incomplete disjunctive table (IDT). We propose the analysis of the<br />
incomplete disjunctive table by substituting the real marginal of the table about the<br />
individuals for a suitable imposed marginal.<br />
Key words: Multiple Correspondence Analysis, Complete Disjunctive Tables, Incomplete<br />
Disjunctive Tables<br />
References<br />
Zárraga, A. and Goitisolo, B. (1999): Independence Between Questions in the Factor<br />
Analysis of Incomplete Disjunctive Tables with Condicioned Questions.<br />
Qüestiió, vol 23, 3, 465–488.<br />
Zárraga, A. and Goitisolo, B. (<strong>2008</strong>): Factorial Analysis of a set of Contingency Tables.<br />
In: C. Preisach, H. Burkhardt, L. Schmidt-Thieme and R. Decker (Eds.):<br />
Data Analysis, Machine Learning and Applications Series: Studies in Classification,<br />
Data Analysis, and Knowledge Organization. Proceedings of the 31st<br />
Annual Conference of the Gesellschaft fr Klassifikation e.V., Albert-Ludwigs-<br />
Universitt Freiburg, March 7-9, 2007. Springer, Berlin, forthcoming.<br />
− 167 −
Recursive Partitioning of Economic<br />
Regressions: Trees of Costly Journals and<br />
Beautiful Professors<br />
Achim Zeileis 1 and Christian Kleiber 2<br />
1 Department of Statistics, Department of Statistics and Mathematics,<br />
Wirtschaftsuniversität Wien Achim.Zeileis@wu-wien.ac.at<br />
2 Wirtschaftswissenschaftliches Zentrum, Universität Basel<br />
Christian.Kleiber@unibas.ch<br />
Abstract. The linear regression model is the workhorse for empirical economic<br />
analyses. For a wide variety of standard analysis problems, there are useful specifications<br />
of linear regression models, validated by economic theory and prior successful<br />
empirical studies. However, in non-standard problems or in situations where data on<br />
additional variables is available, a useful specification of a regression model involving<br />
all variables of interest might not be available. Here, we explore how recursive<br />
partitioning techniques can be used in such situations for modeling the relationship<br />
between the dependent variable and the available regressors. Linear regression is<br />
embedded into the model-based recursive partitioning framework of Zeileis et al.<br />
(<strong>2008</strong>). The resulting regression trees are grown by recursively applying techniques<br />
for testing and dating structural changes in linear regressions. They are compared<br />
to classical modeling approaches in two empirical applications: Following Stock and<br />
Watson (2007), the demand for economic journals (Bergstrom, 2001) is investigated.<br />
Furthermore, the impact of professors’ beauty on their class evaluations (Hamermesh<br />
and Parker, 2005) is assessed.<br />
Key words: Regression trees, Model-based recursive partitioning, Structural change<br />
References<br />
Bergstrom, T.C. (2001): Free Labor for Costly Journals? Journal of Economic Perspectives,<br />
(15), 183–198.<br />
Hamermesh, D.S. and Parker, A. (2005): Beauty in the Classroom: Instructors’<br />
Pulchritude and Putative Pedagogical Productivity. Economics of Education<br />
Review, 24, 369–376.<br />
Stock, J.H. and Watson, M.W. (2007): Introduction to Econometrics. 2nd edition,<br />
Addison Wesley.<br />
Zeileis, A., Hothorn, T. and Hornik, K. (<strong>2008</strong>): Model-based Recursive Partitioning.<br />
Journal of Computational and Graphical Statistics, accepted for publication.<br />
− 168 −