PISA zufolge PISA – PISA According to PISA

Stefan Thomas Hopmann, Gertrude Brinek, 

Martin Retzl (Hg./Eds.) 

PISA zufolge PISA – PISA According to PISA

Schulpädagogik und 

Pädagogische Psychologie 

herausgegeben von 

Univ.-Prof. Dr. Dr. h. c. Richard Olechowski 

(Universität Wien) 

Band 6 

LIT

Stefan Thomas Hopmann, Gertrude Brinek, 

Martin Retzl (Hg./Eds.) 

PISA zufolge PISA – 

PISA According to PISA 

Hält PISA, was es verspricht? – 

Does PISA Keep What It Promises? 

LIT

Bibliographic information published by the Deutsche Nationalbibliothek 

The Deutsche Nationalbibliothek lists this publication in the Deutsche 

Nationalbibliografie; detailed bibliographic data are available in the Internet at 

http://dnb.d-nb.de. 

ISBN 978-3-7000-0771-5 (Österreich) 

ISBN 978-3-8258-0946-1 (Deutschland) 

A catalogue record for this book is available from the British Library 

© LIT VERLAG GmbH & Co. KG Wien 2007 

Krotenthallergasse 10/8 

A-1080 Wien 

Tel. +43 (0) 1 / 409 56 61 

Fax +43 (0) 1 / 409 56 97 

e-Mail: wien@lit-verlag.at 

http://www.lit-verlag.at 

LIT VERLAG Dr. W. Hopf 

Berlin 2007 

Auslieferung/Verlagskontakt: 

Fresnostr. 2 

D-48159 Münster 

Tel. +49 (0)251–62 03 20 

Fax +49 (0)251–23 19 72 

e-Mail: lit@lit-verlag.de 

http://www.lit-verlag.de 

Auslieferung: 

Österreich: Medienlogistik Pichler-ÖBZ GmbH & Co KG 

IZ-NÖ, Süd, Straße 1, Objekt 34, A-2355 Wiener Neudorf 

Tel. +43 (0) 2236/63 535 - 290, Fax +43 (0) 2236/63 535 - 243, e-Mail: mlo@medien-logistik.at 

Deutschland: LIT Verlag Fresnostr. 2, D-48159 Münster 

Tel. +49 (0) 2 51/620 32 - 22, Fax +49 (0) 2 51/922 60 99, e-Mail: vertrieb@lit-verlag.de 

Distributed in the UK by: Global Book Marketing, 99B Wallis Rd, London, E9 5LN 

Phone: +44 (0) 20 8533 5800 – Fax: +44 (0) 1600 775 663 

http://www.centralbooks.co.uk/acatalog/search.html 

Distributed in North America by: 

Transaction Publishers 

Rutgers University 

35 Berrue Circle 

Piscataway, NJ 08854 

Phone: +1 (732) 445 - 2280 

Fax: + 1 (732) 445 - 3138 

for orders (U. S. only): 

toll free (888) 999 - 6778 

e-mail: 

orders@transactionspub.com

Inhalt/Table of content 

Zu diesem Buch 1 

Vorwort 5 

Richard Olechowski 

Introduction: PISA According to PISA – Does PISA Keep What It 

Promises? 9 

Stefan T. Hopmann/Gertrude Brinek 

What Does PISA Really Assess? What Does It Not? A French View 21 

Antoine Bodin 

Testfähigkeit – Was ist das? 57 

Wolfram Meyerhöfer 

PISA – An Example of the Use and Misuse of Large-Scale 

Comparative Tests 93 

Jens Dolin 

Language-Based Item Analysis – Problems in Intercultural 

Comparisons 127 

Markus Puchhammer 

England: Poor Survey Response and No Sampling of Teaching Groups 139 

SJPrais 

Disappearing Students PISA and Students With Disabilities 157 

Bernadette Hörmann 

Identification of Group Differences Using PISA Scales – Considering 

Effects of Inhomogeneous Items 175 

Peter Allerup 

PISA and “Real Life Challenges”: Mission Impossible? 203 

Svein Sjøberg

ii INHALT/TABLE OF CONTENT 

PISA – Undressing the Truth or Dressing Up a Will to Govern? 225 

Gjert Langfeldt 

Uncertainties and Bias in PISA 241 

Joachim Wuttke 

Large-Scale International Comparative Achievement Studies in 

Education: Their Primary Purposes and Beyond 265 

Rolf V. Olsen 

The Hidden Curriculum of PISA – The Promotion of Neo-Liberal 

Policy By Educational Assessment 295 

Michael Uljens 

Deutsche Pisa-Folgen 305 

Thomas Jahnke 

PISA in Österreich: Mediale Reaktionen, öffentliche Bewertungen und 

politische Konsequenzen 321 

Dominik Bozkurt, Gertrude Brinek, Martin Retzl 

Epilogue: No Child, No School, No State Left Behind: Comparative 

Research in the Age of Accountability 363 

Stefan T. Hopmann

Zu diesem Buch 

„PISA zufolge PISA“ (PISA According to PISA) war Thema eines Symposiums, 

das im März 2007 an der Universität Wien von der Forschungseinheit für 

Schul- und Bildungsforschung des Instituts für Bildungswissenschaft durchgeführt 

wurde. Zu dieser Veranstaltung hatten wir einige Kritiker der vorliegenden 

PISA-Studien eingeladen, aber auch Vertreter des österreichischen PISA- 

Konsortiums, die jedoch leider kurzfristig absagen mussten. Unser Ziel war 

eine Versachlichung der Debatte über PISA, weg vom politisch-ideologischen 

Streit über PISA und hin zu einer Analyse der methodologischen Voraussetzungen 

und Folgen des PISA-Projektes, genauer gesagt der Frage: Ob PISA 

bei gegebenem Design halten kann, was es in seinen Analysen und Berichten 

zu erklären verspricht. Bei dieser Veranstaltung wurde die Frage gestellt, ob 

und wie PISA international als wissenschaftliches Vorhaben diskutiert wird. 

Wir haben dies zum Anlass genommen, den vorliegenden Band zu gestalten. 

Trotz seiner enormen Breitenwirkung hat PISA in der vergleichenden Bildungsforschung 

bislang kaum internationale, wohl aber eine Reihe von nationalen 

Nachfragen ausgelöst (vgl. Hopmann & Brinek in diesem Band). Für 

diesen Band haben sich nun erstmals länderübergreifend achtzehn ForscherInnen 

aus sieben Ländern (Dänemark, Deutschland, England, Finnland, Frankreich, 

Norwegen und Österreich) zusammengefunden, um PISA kritisch von 

allen Seiten unter die Lupe zu nehmen, und zwar den gesamten Forschungsprozess 

vom Design und den Erhebungsinstrumenten über die Durchführung 

und Datenanalyse bis hin zur öffentlichen Präsentation der Daten. Berücksichtigt 

wurden dabei alle einschlägig relevanten wissenschaftlichen Zugänge zu 

PISA: empirische Bildungsforschung, Forschungsmethodologie, Statistik, die 

allgemeine und die einschlägigen Fachdidaktiken. Fast alle Beteiligten verfügen 

über langjährige Erfahrungen in der vergleichenden Bildungsforschung 

oder verwandten Unternehmungen, einige waren zumindest zeitweise auch direkt 

an PISA- Forschungen beteiligt. 

Bei allem Respekt für das grossartige Engagement der OECD und der nationalen 

PISA-Konsortien, fällt das Ergebnis sehr ernüchternd aus: PISA hält 

nicht annähernd, was PISA verspricht, und kann das mit den angewandten

2 ZU DIESEMBUCH 

Mitteln auch gar nicht leisten! Das PISA-Projekt ist offenkundig mit so vielen 

Schwachstellen und Fehlerquellen belastet, dass sich zumindest die populärsten 

Endprodukte, die internationalen Vergleichstabellen sowie die meisten 

nationalen Zusatzanalysen zu Schulen und Schulstrukturen, Unterricht, Schulleistungen 

und Problemen wie Migration, sozialer Hintergrund, Geschlecht 

usw., in den bisher praktizierten Formen wissenschaftlich schlicht nicht aufrecht 

erhalten lassen. Sie überspannen bei weitem die Tragfähigkeit des gewählten 

Designs und dessen theoretische und methodische Grundlagen. Wer 

auf dieser Grundlage über Schulstrukturen, Lehrpläne, nationale Tests oder 

die zukünftige Lehrerbildung befinden will, ist nicht gut beraten. 

Damit hört PISA nicht auf, eines der wichtigsten und ertragreichsten Projekte 

der vergleichenden Forschung der Gegenwart zu sein. Einzelne Beiträge 

in diesem Band weisen dazu ausdrücklich auf künftige Möglichkeiten der 

PISA-Forschung hin. Nur scheint dringend geboten, die zum Teil bei solcher 

Forschung unvermeidlichen Grenzen der Geltung und Zuverlässigkeit weitaus 

deutlicher auszuweisen und dafür zu sorgen, dass nicht auch künftig PISA für 

Beweislasten in Anspruch genommen wird, die es auf wissenschaftlich vertretbare 

Weise nicht schultern kann. Man kann fast sagen, es gilt das Gute an PISA 

und die interessierte Öffentlichkeit gegen den methodologisch haltlosen Überschwang 

einiger der an PISA Beteiligten in Schutz zu nehmen. Sonst droht die 

Gefahr, dass eines Tages die Bildungsverwaltungen, die Schulen und Schulleitungen, 

die Lehrkräfte und die Schülerinnen und Schüler nicht nur des stetigen 

Missbrauchs ihrer Daten überdrüssig werden, sondern gleich alle vergleichbaren 

Maßnahmen und Forschungsvorhaben in Bausch und Bogen ablehnen oder 

gar – wie das mit staatlichen Tests in einigen Ländern (u.a. in den USA, Chile, 

Norwegen) schon passiert ist – boykottieren oder durch mutwilliges Antwortverhalten 

beschädigen. Dies würde der vergleichenden Bildungsforschung als 

Ganzes einen nachhaltigen Schaden zufügen, und die Sorge, dass es dazu am 

Ende der PISA-Begeisterung kommen kann, begründet unser Engagement. 

Natürlich waren auch unserem Vorhaben deutliche Grenzen gesetzt: 

– Zum einen war eine direkte Re-Analyse von PISA-Originaldaten, PISA- 

Fragen etc. nur im begrenzten Masse möglich, nur dort wo einzelnen Datenbestände 

zugänglich waren. Zusätzlich haben wir fast die gesamte Literatur 

zur PISA-Methodologie und ihren Implikationen ausgewertet (vgl. die 

Literaturangaben in den einzelnen Kapiteln). PISA lässt eine unabhängige 

Überprüfung der vollständigen Datensätze einschliesslich aller Unterlagen 

bislang jedoch nicht zu. Es mag also sein, dass sich bei einer entsprechenden

ZU DIESEMBUCH 3 

Nachlese – wenn sie eines Tages möglich sein wird – einzelne Ergebnisse 

unserer Metaanalysen anders darstellen, als es uns nachzuprüfen möglich 

war. Allerdings haben sich so viele kritische Einwände in unseren Untersuchungen 

ergeben, dass die Widerlegung eines halben oder ganzen Dutzends 

von ihnen an den Kernaussagen dieses Bandes nichts ändern würde. In jeder 

Phase des PISA-Projektes gibt es zahlreiche Designentscheidungen und 

-probleme, die für sich allein genommen ausreichen, einen erheblichen Teil 

der gegenwärtig üblichen Darstellung und Nutzung der PISA-Ergebnisse für 

wissenschaftlich nicht tragfähig zu halten. 

– Zum zweiten war uns nicht daran gelegen, mit einer Stimme zu reden. Nicht 

nur, dass auch PISA-Kritiker Unterschiedliches kritikwürdig finden, und 

deshalb verschiedene Zugänge und Argumentationen wählen. Wir wollten 

die ganze Bandbreite der zur Zeit in Europa zugänglichen Kritik präsentieren 

und niemanden ausschliessen, nur weil der eine oder die andere eventuell 

einzelne Punkte oder Schlussfolgerungen nicht teilen. Wir hatten auch 

das deutsche und das österreichische PISA-Konsortium mehrfach zur Mitwirkung 

eingeladen. Leider ist diese nicht zustande gekommen. Zum Glück 

waren einige andere mit PISA-Erfahrung dennoch bereit, an unserem Vorhaben 

teilzunehmen. Damit gelingt uns aber nur teilweise, das gesamte Für 

und Wider der Diskussion widerzuspiegeln. Wir zweifeln jedoch nicht daran, 

dass die PISA-Konsortien genügend andere Möglichkeiten haben, sich 

aktiv an der Debatte zu beteiligen. 

Ein solches Vorhaben wie das vorliegende kann ohne vielfältige Hilfe nicht 

gelingen. Das seinerzeitige Österreichische Bundesministerium für Bildung, 

Wissenschaft und Kultur (BMBWK), die Österreichische Gesellschaft für 

Bildungsforschung sowie der norwegische Forschungsverbund „Achieving 

School Accountability in Practice (ASAP)“, zu dessen Veröffentlichungen 

auch dieser Band zählt, haben das Symposium und die Arbeiten am vorliegenden 

Band großzügig unterstützt. Nicht zu vergessen ist die Hilfe durch die 

Sekretariate in Wien (Patricia Stuhr) und in Kristiansand (Inger Linn Nystad 

Baade, Karen Beth Lee Hansen), die uns durch Geduld und Sprachfertigkeit 

ein Gelingen ermöglichten. Schliesslich möchten wir Richard Olechowski und 

dem LIT-Verlag für die Aufnahme in die Reihe „Schulpädagogik und Pädagogische 

Psychologie“ herzlich danken. 

Stefan T. Hopmann, Gertrude Brinek, Martin Retzl 

Wien, im September 2007

4 ZU DIESEMBUCH 

Wissenschaft lebt von der Diskussion. Aus diesem Grund möchten wir Sie 

herzlich einladen, an unserem Online-Diskussionsforum teilzunehmen. Posten 

Sie Ihre Meinung, Kritik und Anregungen zum Buch! Nähere Informationen 

dazu sind auf folgender Homepage verfügbar: 

http://institut.erz.univie.ac.at/home/fe2/. 

Wir freuen uns auf eine anregende Diskussion!

Vorwort 


Österreich: Universität Wien 

Vermutlich jeder/jede Erziehungswissenschaftler/in, der/die jemals die letzte 

Verantwortung für ein empirisches Großprojekt getragen hat – das aber, wie 

in der Regel, ohnehin auf die nationale Ebene begrenzt, zusätzlich vielleicht 

eingeengt auf die in der betreffenden Nation am häufigsten gesprochene Sprache 

war – wird sich über den Wagemut des PISA-Konsortiums gewundert haben. 

Ist diesem Konsortium der „Salto mortale“ geglückt oder ist er für seine 

Mitglieder „letal“ ausgegangen? Nur zu gut weiß jeder/jede in einem nationalen 

Großprojekt Tätige (selbst wenn dieses so, wie eben skizziert, oder in 

noch größerem Maße eingeschränkt ist), dass die Gefahr einer Reihe von Beeinträchtigungen 

der nötigen Exaktheit in allen Stadien des Projekts gegeben 

ist, angefangen von der Stichprobenziehung, über die einzelnen Schritte der 

Testkonstruktion oder die Frage, ob die gleichartige Testvorgabe in allen Subgruppen 

gelungen ist. Nicht zu unterschätzen sind die Probleme der Testauswertung 

und der Datenanalyse (im engeren Sinn), zumal wenn Manches hiervon 

dezentralisiert – wie beim PISA-Projekt – in den mitwirkenden Staaten 

durchgeführt werden musste. Nicht zu gering ist auch die Wirkung der Wahl 

der Art und Weise der Publikation einzuschätzen, besonders wenn es sich um 

ein Projekt handelt, mit einem öffentlichen Interesse, im Ausmaß und in einer 

Intensität, wie dies bei PISA der Fall ist. Fachlich kompetente Kritiker von 

PISA werfen den für die PISA-Publikation Verantwortlichen in diesem Buch 

vor, dass die Intensität des öffentlichen Interesses durch den Umstand, dass 

die PISA-Ergebnisse in Form von nationalen Rangskalen publiziert wurden, 

noch bewusst erhöht und somit das öffentliche Interesse nicht in Richtung eines 

bildungswissenschaftlichen Interesses, sondern in das der Boulevardpresse 

gelenkt wurde. Die Daten wurden ursprünglich auf höherem Skalenniveau erhoben. 

Erst für die Publikation wählte man das verhältnismäßig grobe Maß

6 RICHARD OLECHOWSKI 

von Rangdaten; die „Suggestivkraft“ der Ranglisten sei vorherzusehen gewesen. 

Daher tragen die Betreiber von PISA, so die Kritiker, die Verantwortung 

für die Art und Weise der jetzigen Diskussion. 

Wie Kolleginnen und Kollegen, die fachlich kompetent sind – manche 

von ihnen haben sogar teilweise an dem Großprojekt PISA mitgearbeitet – 

in diesem Buch berichten, gebe es zu allen Phasen des PISA-Projekts ernst zu 

nehmende kritische Einwände. Einige Beispiele, zusätzlich zu jenen, die oben 

schon erwähnt wurden: 

– Am Beginn des PISA-Projekts stand die Sammlung von Aufgaben. Es wurden 

zwar alle Staaten, die sich an der PISA-Studie beteiligten, eingeladen, 

entsprechende Aufgaben einzusenden, doch nicht alle der in Betracht kommenden 

Staaten kamen dieser Einladung nach. Dadurch entstand ein „cultural 

bias“ – eine Verzerrung in Richtung der kulturellen Eigenheiten bzw. 

Eigenarten jener Staaten, die sich an der Aufforderung nach Einsendung von 

Aufgaben beteiligten. 

– Die Lehrpläne der in Betracht kommenden Schulen der einzelnen Länder 

wurden – im Großen und Ganzen – seinerzeit (viele Jahre vor dem 

PISA-Projekt) ohne internationale Koordination erstellt. Die Lehrer und 

Lehrerinnen der einzelnen Länder haben außerdem auch in je unterschiedlichem 

Ausmaß eine gewisse Lehrplanfreiheit. In einzelnen Lehrer/innen/arbeitsgemeinschaften 

werden in den meisten Ländern auch – in 

Konkretisierung der Unterrichtsarbeit – „Lehrstoffverteilungen“ erarbeitet. 

Es ist selbstverständlich, dass ein Schulleistungstest jeweils in einer genauen 

Abstimmung auf das, was tatsächlich in den Schulen unterrichtet wurde, 

zu erstellen ist. Dieser Gesichtspunkt wurde bei der Aufgabenerstellung 

im PISA-Projekt nicht systematisch berücksichtigt. (Ein allfälliger absichtlicher 

Verzicht auf das Erzielen einer „Lehrplanvalidität“ wäre gleichzusetzen 

mit einer bewussten Wahl des Risikos eines Argumentationsnotstands.) 

– Jeder Test hat sich an einer Eichstichprobe zu orientieren. Die Eichstichprobe 

muss jener Stichprobe, die in einer konkreten Untersuchung – stellvertretend 

für die interessierende Grundgesamtheit – getestet wird, in größtmöglicher 

Weise ähnlich sein. Dies ist bei PISA nicht der Fall: Es wurden nicht 

in jedem einzelnen Land (der sich am PISA-Projekt beteiligenden Länder) 

die einzelnen Tests an einer separaten Eichstichprobe geeicht bzw. genormt. 

– Es fehlen nähere Angaben über die eingesetzten Tests hinsichtlich ihrer „Reliabilität“. 

(Angaben darüber, wie groß das Maß der Übereinstimmung von

VORWORT 7 

Testresultaten ist, die zu unterschiedlichen Zeitpunkten an denselben bzw. 

„vergleichbaren“ Personen festgestellt wurden, fehlen.) 

– Ebenso fehlen nähere Angaben zur Frage der „Validität“ der eingesetzten 

Tests. (Angaben darüber, wie groß die Ähnlichkeit der Ergebnisse der Testergebnisse 

ist, verglichen mit Ergebnissen aus Tests, mit denen dieselben 

oder ähnliche Dimensionen gemessen werden, fehlen.) 

Solche, wie die oben aufgelisteten Mängel können, auch darüber sind sich die 

Kritiker von PISA, die in diesem Buch zu Wort kommen, einig, durch noch so 

große Stichproben nicht „ausgeglichen“ werden. Es handelt sich nämlich nicht 

um „Zufallsfehler“, sondern um sog. „systematische Fehler“. 

Dennoch – auch darüber sind sich sogar die strengsten Kritiker einig – 

ist mit PISA Neuland betreten und sind die Ergebnisse, nach teilweise herber 

Kritik, nicht einfach beiseite zu schieben. Zum gegenwärtigen Zeitpunkt 

könnte eine so große internationale Vergleichsuntersuchung von niemandem 

besser durchgeführt werden. Die oben angeführten Punkte der Kritik, die in 

diesem Buch ausführlich und fachkundig dargestellt und diskutiert werden, 

dürfen auch nicht in der Weise missverstanden werden, dass damit eine prinzipielle 

Zurückhaltung oder gar Aversion gegenüber dem Messen und Zählen 

zum Ausdruck gebracht werden sollte, wenn es um Fragen der Bildung geht. 

Auch wenn in Einzelheiten – und sogar wenn hinsichtlich der Frage des „nationalen 

Rankings“ bezüglich einer einzelnen Dimension – hin und wieder eine 

Korrektur nötig wäre: Durch die PISA-Studie hat die Bildungsforschung 

großen Gewinn gezogen: 

Einerseits sind wohl fast alle Staaten, die nicht die obersten Rangplätze 

in den meisten der geprüften Dimensionen „innehaben“, motiviert worden, 

kritisch zu prüfen, was an ihrem Schulsystem reformiert werden sollte, die 

Schulorganisation, der Lehrplan, die Lehreraus- und -weiterbildung oder andere 

Aspekte im jeweiligen Schul- und Bildungssystem. (Die bisherigen PISA- 

Erhebungen und PISA-Auswertungen helfen den einzelnen Ländern freilich 

nicht bei der Auffindung der konkreten Ursachen für die allenfalls unbefriedigenden 

PISA-Ergebnisse.) Andererseits ist durch PISA ein Gesichtspunkt 

von besonderer Bedeutung ins Bewusstsein der „ScientificCommunity“getreten: 

Die Vergleichende Erziehungswissenschaft – eine wichtige Forschungsrichtung 

innerhalb der Erziehungswissenschaft – ist mit einem Mal von der 

globalen Betrachtungsweise des Vergleichs der Schulsysteme, der Lehrpläne, 

der Ausbildungssysteme der einzelnen unterrichtenden Lehrerinnen und Lehrer 

geradezu gewaltsam weggezerrt und hingelenkt worden auf den wesentli-

8 RICHARD OLECHOWSKI 

chen Aspekt des „Outcome“ – durch den auf der Basis von PISA weltweiten 

Vergleich, einem Vergleich, der mit Hilfe eines methodisch anspruchsvollen 

Instrumentariums durchgeführt wird, mit Tests, die von probabilistischen Modellen 

der Testtheorie ausgehen. 

In dieser Ausgewogenheit zwischen der Kritik, dem methodenkritischen 

Referieren der im Drei-Jahres-Rhythmus stattfindenden Erhebungen (2000, 

2003, 2006), insbesondere der Auswertungen, soweit diese bereits vorliegen, 

und dem Blick in die Zukunft, liegt auch der Vorzug dieses Buches. Es bietet 

nicht nur die Möglichkeit, die PISA-Studie in aller methodenkritischen 

Schärfe und Kritik zu sehen; oft genug ist es eine konstruktive Kritik, sondern 

es eröffnet auch die Möglichkeit, PISA als einen Schritt der Weiterentwicklung 

der Vergleichen Erziehungswissenschaft zu erkennen. Zum ersten 

Mal ist mit dieser groß angelegten Vergleichsuntersuchung auch in der 

international-vergleichenden Bildungsforschung die Möglichkeit gegeben, intersubjektiv 

vergleichbare und somit auch – nach klaren Kriterien – falsifizierbare 

Ergebnisse zu produzieren (für die Wissenschaft ein wichtiger Gesichtspunkt!), 

aber auch umgekehrt, im engen Wortsinn, replizierbare Ergebnisse 

zu erzielen und so zu einem echten, empirisch gesicherten Wissensbestand zu 

gelangen – soweit dieser Begriff mit der Sicht der Vorläufigkeit aller Wissenschaft 

prinzipiell vereinbar ist. 


Wien, im Herbst 2007

Introduction: 

PISA According to PISA – Does PISA Keep What It 

Promises? 

Stefan T. Hopmann/Gertrude Brinek 

Austria: University of Vienna 

For the time being, PISA is the most successful enterprise in comparative education. 

Every time a new PISA wave rolls in, or an additional analysis appears, 

governments fear the results, newspapers fill column after column, and the 

public demands answers to the claimed failings in their country’s school system. 

Of course, such a tremendous impact evokes discussions and criticism. 

On the one side are those: 

– who blame PISA for not covering the whole breath of education or schooling 

(e.g. Fuchs 2003; Ladenthin 2004; Kraus 2005; Herrmann 2005; Dohn 2007; 

adding to the PISA frame: Benner 2002), 

– who point to the fact that PISA is run by private companies (“PISA Incorporated”) 

looking for a share of the ever-growing testing market (see e.g. 

Bracey 2005; Flitner 2006; Lohmann 2006), or 

– who depict PISA as a New Public Government outlet of the most neo-liberal 

kind (see e.g. Lohmann 2001; Huisken 2005; Klausnitzer 2006). 

On the other side are those who praise PISA for giving us the best data base 

ever available for comparative research, for developing new tools of research, 

and for PISA’s creative analysis of its data sets (for many examples see Pekrun 

2003; Roeder 2003; Weigel 2004; Stack 2006; Olsen in this volume). 

PISA According to PISA 

However, surprisingly, and in spite of its public impact, PISA has not lead to 

thorough methodological debates within the comparative research community,

10 STEFAN T. HOPMANN/GERTRUDE BRINEK 

at least not internationally. There have been some critiques pointing to design 

or analytic short-comings in some of the participating countries (e.g. Bonnet 

2002; Romainville 2002; Nash 2003; Prais 2003, 2004; Goldstein 2004; 

Allerup 2005, 2006; Bodin 2005; Bottani & Virgnaud 2005; Gaeth 2005; Olsen 

2005; Jahnke & Meyerhöfer 2006; Neuwirth, Ponocny & Grossmann 2006; 

Grisay & Monseur 2007). There has been some fundamental, and highly contested 

criticism of the methodological soundness of PISA’s research as a whole 

(Jahnke & Meyerhöfer 2006; especially Wuttke 2006; rebuttal by Prenzel & 

Walter 2006) 1 . However, none of this has lead to an international debate on 

the validity claims of PISA outside the PISA community itself. It seems as if 

the overwhelming success of the approach has led to any attempt to discuss 

PISA’s design, data collection and analysis methodologically looking pettyminded 

and irreverent. The strategy of PISA itself in not giving access to the 

full database, including all the questionnaires, contributes to this problem. 

The present volume on “PISA According to PISA” is probably the first 

independent international approach to discuss the methodological merits and 

shortcomings of PISA in relation to the validity and reliability claims PISA itself 

puts forward. Our aim is not to add to the debate for or against PISA. Most 

of us believe that PISA is an important milestone in the history of our field. 

But we do question if some basic elements of PISA are done well enough to 

carry the weight of, e.g., comparative league tables or of in-depth analyses of 

weaknesses of educational systems. We ask if other, and better, uses of the 

PISA data base are warranted, and if PISA-as-a-public-event should come under 

much more independent scrutiny – if only to avoid its misuse to validate 

claims and policies which cannot be legitimately derived from PISA. 

The volume seeks to follow – as much as possible – the whole PISA research 

process from the design and sampling, the data collection and analysis, 

through to the data presentation and impact. Our aim is not to give an 

overview of the different national PISA debates, rather to discuss general issues 

of construction and use. The contributors come from seven countries and 

from all walks of educational research, including specialists in empirical research 

methodology, statistical data analysis, general and subject matter didactics, 

and educational policy analysis. We include contributors who are or 

1 The editors of the above mentioned “PISA & Co.” volume (Jahnke & Meyerhöfer 2006) are 

working on a new and revised edition of that book, including an explicit discussion of the 

response they got to the first edition. This book will be available by late 2007.

INTRODUCTION: PISAACCORDING TO PISA 11 

have themselves been involved in PISA or similar projects (see the bios at the 

end of the book). 

To highlight just a few core issues: 

– Antoine Bodin (IREM de Besançon – Université de Franche-Comté) shows 

from a French perspective how much the PISA – assessment is embedded in 

a certain understanding of (school) knowledge, which doesn’t fit all. 

– Wolfram Meyerhöfer (Universität Potsdam) continues this argument by an 

in-depth-analysis of what PISA really asks for in its questionnaires, showing 

how little this is in touch with a comprehensive concept of “Bildung” or even 

current didactics. 

– Jens Dolin (Syddansk Universitet) adds similar arguments from a Danish 

perspective, underlining how much PISA’s conceptualization of knowledge 

is at risk to misrepresent what is taught and learned in schools. 

– Markus Puchhammer (Technikum Wien) shows – using the published example 

questions – how translation problems may affect results to a degree 

making comparisons guesswork. 

– SJPrais(National Institute of Economic and Social Research London) uses 

the example of England to demonstrate serious flaws in the response rates 

and sampling, which necessarily lead to biased results. 

– Bernadette Hörmann (Universität Wien) points to the systematic marginalization 

of special needs students by PISA and to how little there has been 

done to deal with their role within the PISA approach at least in Austria. 

– Peter Allerup (Århus Universitet) elaborates a similar issue by showing from 

Denmark to what degree PISA’s much acclaimed analysis of the impact of 

gender, migration and similar factors depends on but a few, highly problematic 

items. 

– Svein Sjøberg (Universitetet i Oslo) underlines how much both, PISA’s design 

on the one hand, and the student response behavior on the other are 

culturally embedded, which may lead to a partial or complete mismatch. 

– Gjert Langfeldt (Agder Universitet) questions the validity and reliability 

claims made by PISA, pointing to constructional constraints, methodological 

mishaps and the cultural bias embedded in the PISA design. 

– Joachim Wuttke gives a comprehensive overview over recently voiced criticism 

of PISA’s research conduct and the resulting bias and uncertainties, 

which put not at least its league tables and comparisons at random. 

– Rolf Olsen (Universitetet i Oslo) outlines ways how PISA can overcome


some of its short-comings by broadening its approach and adding new research. 

– Michael Uljens (Åbo Akademi) explains the Finnish PISA success by the 

fact that what PISA asks for had already gained a foothold in Finnish schooling 

before PISA came around. 

– Thomas Jahnke (Universität Potsdam) elaborates from a German perspective 

how PISA fails to really assess what is or should be taught in schools, and 

how reliance on PISA can lead to an impoverished view on the curriculum. 

– Dominik Bozkurt, Gertrude Brinek and Martin Retzl (Universität Wien) use 

the Austrian example to show how the public and political response to PISA 

unfolds irrespective of what PISA really can cover or prove. 

– Finally, Stefan T. Hopmann (Universität Wien) puts both the PISA project 

and the PISA discourse in a comparative perspective, showing how much the 

design, use of and response to PISA is depending on the needs and traditions 

of those involved. 

All in all, the contributions give a very varied picture of the PISA effort. No 

step in the research process seems to be without substantial problems, several 

steps do not meet rigorous scholarly standards. Some of us seem to believe that 

these are obstacles, which can be solved within the PISA frame (e.g. Allerup, 

Dolin, Olsen, Sjøberg), others tend to a conclusion that the PISA project is 

beyond repair (e.g. Langfeldt, Meyerhöfer, Wuttke) or so much embedded in 

a specific political purpose, that it rather should be considered as a type of 

research-based policy making, not as a scholarly undertaking (e.g. Hopmann, 

Jahnke, Uljens, Bozkurt/Brinek/Retzl). 

Almost all of the chapters raise serious doubts concerning the theoretical 

and methodological standards applied within PISA, and particularly to its 

most prominent by-products, its national league tables or analyses of school 

systems. Without access to the full set of original data, it is difficult to come 

to final conclusions. However, from our viewpoint, a few points seem to be 

evident beyond any reasonable doubt: 

– PISA is by design culturally biased and methodologically constrained to a 

degree which prohibits accurate representations of what actually is achieved 

in and by schools. Nor is there any proof that what it covers is a valid conceptualization 

of what every student should know. 

– The product of most public value, the national league tables (cf. Steiner- 

Khamsi 2003), are based on so many weak links that they should be abandoned 

right away. If only a few of the methodological issues raised in this


volume are on target, the league tables depend on assumptions about their 

validity and reliability which are unattainable. 

– The widely discussed by-products of PISA, such as the analyses of “good 

schools”, “good instruction” or of differences between school systems and 

on issues like gender, migration, or social background, go far beyond what 

a cautious approach to these data allows for. They are more often than not 

speculative, and would at least need a wider framing by additional research 

looking at the aspects, which PISA by design cannot cover or gets wrong. 

– Any policy making based on these data (whether about school structures, 

standards or the curriculum) cannot be justified. The use and misuse of PISA 

data in such contexts – done with or without PISA researchers consent or 

cooperation – belongs solely to the sphere of policy making. Of course PISA 

researchers have the same right as every citizen to pronounce their political 

convictions in public. However they cannot do so claiming research as an 

unquestionable basis for their arguments. 

This does not mean that there are no valuable lessons to be drawn from PISA. 

At least it is a very innovative comparative study on the uneven distribution 

of a peculiar kind of knowledge and abilities among young people in different 

countries. However, the use of PISA as research on schooling by the OECD, 

its members and some of the research groups connected to the effort goes far 

beyond what is scientific evidence or simply well done research. PISA is not 

according to PISA, when it comes to how it is produced and used in these 

cases. 

PISA – The Contergan of Educational Research? 

Of course, we would have loved to add to this volume commentaries and criticism 

of what is presented here by members of the PISA consortium – because 

we believe in the necessity of broad and uninhibited scholarly exchange. However, 

repeated invitations to address these issues in open symposia, or to contribute 

to this volume, remained either unanswered or were turned down. The 

German PISA consortium went so far to make an official decision not to participate 

in this effort; others simply kept silent. Time and again we were told 

in public and at meetings that most of the methodological criticism published 

on PISA has been proven wrong, and that every possible weakness has been 

taken care of. However, we could not obtain a published justification for this


claim. Even an invitation to contribute a summary of the counterarguments to 

this volume was turned down. 

As sad as this is, it was no surprise. In the preparation of this volume we 

exchanged quite a few notes on how the national debates around PISA unfold 

in ‘our’ countries. What emerged was a picture not unlike that seen in 

the behaviour of large companies when they encounter a potential scandal, e.g. 

pharmaceutical companies dealing with ill-conceived drugs (like Chemie Grünenthal 

in the famous Contergan/Thalomide case or other scandals; cf. Kirk 

1999; Luhmann 2000; Schulz 2001) where the strategy is one of an “issue 

framing” (cf. Entman 1993: Sniderman & Theriault 2004). To take just the 

most recent German example: 

– If some critique is voiced in public, the first response seems to be silence. Or 

as the leader of the German consortium, Manfred Prenzel, puts it in case of 

this book: One doesn’t want to provide “a forum for unproven allegations” 

(as an answer to the invitation to participate in this book by mail 2007-05- 

09, which was turned down by a “unanimous” decision of the German PISA 

consortium confirmed by a mail 2007-05-21). He wrote this before knowing 

the authors and titles of all but one of the chapters contained in this volume. 

– If that is not enough, the next step is often to raise doubts about the motives 

and the abilities of those who are critical of the enterprise. For instance, 

when asked about the recently published volume on PISA & Co. (Jahnke & 

Meyerhöfer 2006), Olaf Köller, as the head of the German National Institute 

for Educational Progress, suggested that (1) these critics were unqualified 

to discuss PISA (even though they included many leading members 

of the mathematics didactic research in Germany) and (2) they were probably 

driven by envy or other non-scholarly motives (Köller 2006a; Kerstan 

2006). 

– The next step seems to acknowledge some problems, but to insist that they 

are very limited in nature and scope, not affecting the overall picture. Alternatively, 

it is pointed out that these problems are well known within largescale 

survey research of the kind like PISA, and even unavoidable when 

working comparatively (e.g. Köller 2006b). Of course that claim does not 

reduce the impact of these problems on the validity of the results. 

– Finally, there is the statement that the criticism does not contain anything 

new, and nothing that has not been dealt with within the PISA research itself 

– and often this claim is accompanied by references to opaque technical


reports, that only insiders can understand, or to unpublished papers or reports 

(e.g. Prenzel & Walter 2006; Schleicher 2006). 

What does not happen is what is normally considered to be “good science”: 

open debate on the pros and cons of the arguments. If one understands PISA 

as an economic enterprise, in line with the abovementioned pharmaceutical 

companies, this is quite reasonable. Ignoring, silencing, or simply marginalizing 

a critic does less harm to the brand than a public argument. A public 

rebuttal carries the risk that some customers would not be totally convinced 

(“semper aliquid haeret”). It is only necessary to take firmer steps when criticism 

finally becomes so public that it cannot be ignored by customers and 

buyers. But the first move is still to discredit the critics and their supporters as 

being uninformed, ill-equipped, or simply following a personal agenda. The 

final move rests on the claim that there is other research, which proves the 

critics wrong – although for a variety of reasons the data-sets on which these 

conclusions are based cannot be made available. By using such techniques, 

companies can hold the realistic expectation that even proven deficiencies will 

not harm sales substantially and over time. 

Of course, the comparison of PISA and Contergan can be seen as overreaching: 

Thalomide did lead to thousands of severely disabled newborns, 

whereas PISA only does harm to children’s education in the worst case. Additionally, 

the Grünenthal company directly advertised the medication for purposes 

with high risk, whereas the PISA consortium can argue that it is up to 

the people to believe or not to believe in what PISA tells. But other similarities 

are striking: PISA has a large “market share” to defend: most of public money 

spent on educational research nowadays is being put into PISA and similar approaches 

(the standards and testing business); many chairs in education have 

turned to related topics and issues, thus providing a significant market for collaborators 

in the field. This is all too big and too seductive to be put at risk just 

because of a few other scholars who do not support the whole enterprise or the 

way it is done. 

The readers of this volume should expect similar responses to what is said 

here. But don’t worry: Nobody is going to pull PISA into courts of law because 

of its flaws – as was the case with the pharmaceutical companies. No other 

court than the one of public reasoning is available, but with Kant we do believe, 

that this is the strongest court of all. 

Discussion is an essential part of science. Therefore we invite you to take


part in our discussion forum on the Internet and to post your opinion and critique 

concerning the book. Find more information at 

http://institut.erz.univie.ac.at/home/fe2/. 

We are looking forward to an inspiring discussion! 

References 

Aktionsrat Bildung: Bildungsgerechtigkeit. Jahresgutachten 2007. (ed. 

Vereinigung der Bayerischen Wirtschaft e.V.; online: 

http://www.aktionsrat-bildung.de/fileadmin/Dokumente/ 

Bildungsgerechtigkeit_Jahresgutachten_2007_-_Aktionsrat_Bildung.pdf 

retr. 2007/07/07). 

Allerup, P.: PISA præstationer-målinger med skæve målestokke?. 

In: Dansk Pædagogisk Tidsskrift 2005-1, 68-81 

Allerup, P.: PISA 2000’s læseskala – vurdering af psykometriske egenskaber 

for elever med dansk og ikke-dansk sproglig baggrund. (Rockwool 

Fondens Forskningsenhed og Syddansk Universitetsforlag) Odense 2006. 

Allerup, P.: Identification of Group Differences Using PISA Scales – Considering 

Effects of Inhomogeneous Items. In this volume. 

Benner, D.: Die Struktur der Allgemeinbildung im Kerncurriculum moderner 

Bildungssysteme. Ein Vorschlag zur bildungstheoretischen Rahmung von 

PISA. In: Zeitschrift für Pädagogik 48-2002-1, 68-90. 

Bodin, A.: What does PISA really assess? What it doesn’t A French view. 

Report prepared for Joint Finnish-French conference “Teaching mathematics: 

Beyond the PISA survey”, Paris 2005. 

Bodin, A.: What does PISA really assess? What it doesn’t? A French view. In 

this volume. 

Bonnet, G.: Reflections in a Critical Eye: On the Pitfalls of International Assessment. 

In: Assessment in Education 2002-9, 387-400. 

Bottani, N. & Virgnaud, P.: La France et les evaluations internationales. 

Paris 2005. online: http://lesrapports.ladocumentationfrancaise.fr/BRP/ 

054000359/0000.pdf (retr. 2007/07/07). 

Bracey, G.W.: Research: Put Our Over PISA. In: Phi Delta Kappan 86-2005- 

10, 797. 

Dohn, N.B.: Knowledge and Skills for PISA – Assessing the Assessment. In: 

Journal of Philosophy of Education. 41-2007-1, 1-16. 

Flitner, E.: Pädagogische Wertschöpfung. Zur Rationalisierung von Schulsystemen 

durch public-private-partnerships am Beispiel von PISA. In: Oelk-


ers J. et al. (eds.): Rationalisierung und Bildung bei Max Weber. ) Bad 

Heilbrunn (Klinkhardt 2006, 245-266. 

Fuchs, H.-W.: Auf dem Wege zu einem neune Weltcurriculum? Zum Grundbildungskonzept 

von PISA und der Aufgabenzuweisung an die Schule. 

In: Zeitschrift für Pädagogik 49-2003-2, 161-179. 

Gaeth, F.: PISA (Programme for International Student Assessment) Eine 

statistisch-methodische Evaluation. Berlin (Freie Universität) 2005. 

Goldstein, H.: International Comparisons of Student Attainment: Some Issues 

Arising from the PISA Study. In: Assessment in Education – Principles, 

Policy, and Practice 11-2004-3, 319-330. 

Grisay, A. & Monseur, C.: Measuring the Equivalence of Item Difficulty in 

the Various Versions of an International Test. In: Studies in Educational 

Evaluation 33-2007-1, 69-86. 

Hermann, U.: Fördern “Bildungsstandards” die allgemeine Schulbildung? In: 

Rekus, J. (ed.): Bildungsstandards, Kerncurricula und die Aufgabe der 

Schule. Münster (Aschendorff ) 2005, 24-52. 

Herrmann U.: PISA – Welche Konsequenzen für Schule und Unterricht kann 

man wirklich ziehen? Diskussionsbeitrag DIDACTA Hannover 2006. 

FORUM BILDUNG (online: http://forum-kritische-paedagogik.de/start/ 

download.php?view.209; retr. 2007/07/07). 

Hopmann, S.T.: Restrained Teaching: The Common Core of Didaktik. In: European 

Educational Research Journal 6-2007-2, 109-124. 

Huisken, F.: Der “PISA-Schock” und seine Bewältigung – Wieviel Dummheit 

braucht/verträgt die Republik? Hamburg (VSA-Verlag) 2005. 

Jahnke, T. & Meyerhöfer, W. (eds.): PISA & Co – Kritik eines Programms. 

Hildesheim (Franzbecker) 2006. 

Kerstan, K.: An PISA gescheitert. In: DIE ZEIT, 16.11.2006 Nr. 47 

Kirk, B.: Der Contergan-Fall: eine unvermeidbare Arzneimittelkatastrophe? 

Stuttgart (Wissenschaftliche Verlagsgesellschaft) 1999. 

Klausnitzer, J.: PISA – einige offene Fragen zur OECD Bildungspolitik. 

2006. Online: http://www.links-netz.de/K_texte/K_klausenitzer_oecd. 

html (retr. 2007/07/07). 

Köller, O.: Kritik an PISA unberechtigt. Interview mit bildungsklick.de 2006a. 

online: http://bildungsklick.de/a/50155/kritik-an-pisa-unberechtigt (retr. 

2007/07/07). 

Köller, O.: Stellungnahme zum Text von Joachim Wuttke: Fehler, Verzerrun-


gen, Unsicherheiten in der PISA-Auswertung. Press release of the National 

Institute for Educational Progress (IQB) 2006/11/14, (2006b). 

Kraus, J.: Der PISA Schwindel. Unsere Kinder sind besser als ihr Ruf. Wie Eltern 

und Schule Potentiale fördern können. Wien (Signum Verlag) 2005. 

Ladenthin, V.: Bildung als Aufgabe der Gesellschaft. In: studia comenia et 

historica 34-2004-71/72, 305-319. 

Lohmann, I.: After Neoliberalism. Können nationalstaatliche Bildungssysteme 

den ‚freien Markt‘ überleben? 2001 online: http://www.erzwiss. 

uni-hamburg.de/Personal/Lohmann/AfterNeo.htm (retr. 2007/07/07) 

Lohmann, I.: Was bedeutet eigentlich “Humankapital”? GEW Bezirksverband 

Lüneburg und Universität Lüneburg: Der brauchbare Mensch. Bildung 

statt Nützlichkeitswahn. Bildungstage 2007, online: http://www.erzwiss. 

uni-hamburg.de/Personal/Lohmann/Publik/Humankapital.pdf (retr. 

2007/07/07) 

Luhmann, H.-J.: Die Contergan-Katastrophe revisited – Ein Lehrstück 

vom Beitrag der Wissenschaft zur gesellschaftlichen Blindheit. In: 

Umweltmedizin in Forschung und Praxis. 5-2000-5, 295-300. 

Nash, R.: Is the School Composition Effect Real? A Discussion with Evidence 

from the UK PISA Data. In: School Effectiveness and School Improvement 

14-2003-4, 441-457. 

Neuwirth, E., Ponocny, I. & Grossmann, W. (eds.): PISA 2000 und PISA 2003. 

Graz (Leykam) 2006. 

Olsen, R.V.: Achievement Tests From an Item Perspective. An Exploration of 

Single Item Data form the PISA and TIMSS studies. Oslo (University of 

Oslo) 2005. Online at: http://www.duo.uio.no/publ/realfag/2005/35342/ 

Rolf_Olsen.pdf (retr. 2007/07/07) 

Pekrun, R.: Vergleichende Evaluationsstudien zu Schülerleistungen: Konsequenzen 

für die Bildungsforschung. In: Zeitschrift für Pädagogik 48- 

2002-1, 111-128. 

Prais S. J.: Cautions on OECD’s recent educational survey(PISA): Rejoinder 

to OECD’s response. In: Oxford Review of Education 30-2004-4 

Prais S.J.: Cautions on OECD’s Recent Educational Survey (PISA) In: Oxford 

Review of Education 29-2003-2, 139-163. 

Prenzel, M. & Walter, O.: Wie solide ist PISA? Oder Ist die Kritik von Joachim 

Wuttke begründet? Kiel (IPN) 2006 (two pages including a one page attachment!) 

Roeder, P.M.: TIMSS und PISA – Chancen eines neuen Anfangs in Bil-


dungspolitik, -planung. –verwaltung und Unterricht. Endlich ein Schock 

mit Folgen? In: Zeitschrift für Pädagogik 49-2003-2, 180-197. 

Romainville, M.: On the Appropriate Use of PISA. In: La Revue Nouvelle 

2002-3/4. 

Schleicher, A.: Interview mit der Frankfurter Rundschau. In: Frankfurter 

Rundschau vom 28.11.2006 

Schulz, J.: Management von Risiko- und Krisenkommunikation – zur Bestandserhaltung 

und Anschlussfähigkeit von Kommunikationssystemen. Berlin 

(Humboldt Universität) 2001. 

Stack, M.: Testing, Testing, Read All About It: Canadian Press Coverage of 

the PISA Results. In: Canadian Journal of Education, 29-2006-1, 49-69. 

Steiner-Khamsi, G.: The Politics of League Tables. (http://www. 

sowi-onlinejournal.de/2003-1/tables_khamsi.htm; retr. 2007/07/07). 

Weigel, T.M.: Die PISA-Studie im bildungspolitischen Diskurs. Eine Untersuchung 

der Reaktionen auf PISA in Deutschland und im Vereinigten 

Königreich. Diplomarbeit Trier (Universität) 2004. 

Wuttke, J. (2006): Fehler, Verzerrungen, Unsicherheiten in der PISA- 

Auswertung.- In: Jahnke, T. & Meyerhöfer,W. (Hrsg): PISA & Co. Kritik 

eines Programms. Hildesheim, Berlin (Franzbecker), 101-154. 

Wuttke, J.: Uncertainties and Bias in PISA. In this volume.

What Does PISA Really Assess? What Does It Not? 

1 2 

AFrenchView 

Antoine Bodin 3 

France: Université de Franche-Comté 

Summary 

This paper puts aside many important aspects of the PISA design to focus on 

the external validity issue of its mathematics questions. 

First, it seeks to position the PISA item contents against the French mathematical 

syllabus, trying to identify the overlap of them both. 

Then it tries to compare the PISA mathematical cognitive demands and 

competency levels with those implied in some French assessment and examination 

settings. 

Underlining certain differences between the general PISA design and the 

French mathematical curriculum and school culture, it also tackles the PISA 

mathematical items ‘epistemological and didactical validity issues’. 

Cet article laisse de côté de nombreux points importants des études PISA 

pour se centrer sur l’examen de la validité externe des questions du domaine 

mathématique. 

1 This paper was partially presented in October 2005 at a French-Finnish Conference jointly 

organized by the French and Finnish Mathematical Societies. A French language version 

is available as well as two presentations used for the Conference (also in English and in 

French – see addresses on the page entitled “references”). 

2 With many thanks to Rosalind Charnaux for her kind help and advice for this English version. 

3 antoinebodin@mac.com, website: http://web.mac.com/antoinebodin/iWeb/Site_Antoine_ 

Bodin/

22 ANTOINE BODIN 

Tout d’abord il cherche à situer les contenus mathématiques des questions 

par rapport au curriculum français, et essaie de quantifier le recouvrement par 

PISA de ce curriculum. 

Ensuite il tente de comparer la complexité cognitive des questions mathématiques 

de PISA avec celle des questions d’examens et d’évaluations 

courantes en France. 

Pointant des différences entre les conceptions liées aux études PISA et les 

attendus du curriculum mathématique et de la culture scolaire de notre pays, 

il soulève des questions relatives à la validité épistémologique et didactique de 

l’étude. 

Introduction 

The PISA studies have been organised by the OECD, which, as everyone 

knows, is an organisation devoted to world economic development. The main 

reason that led this organisation to undertake such a study lies in a strong belief 

that good education is the key to better development. 

We will examine in this paper neither the value of this belief nor the economic 

and political implications of the studies. 

At the same time we accept the idea that the PISA mathematics framework 

is consistent with the general PISA design, and that the mathematics test development 

has been made as faithfully and as accurately as possible (personally, 

I believe this is the case). There is, however, the internal validity issue. 

Plenty of documents have been written and displayed all around the world 

about the PISA studies, a certain number of them directly issued by the OECD 

and by the PISA consortium 4 and many others by officials, research teams 

and/or the media in the participating countries. 

Therefore, the information is rich and full of contrasts. Most of the documents 

are public, and the OECD has done its utmost to allow scholars and 

other interested persons obtain complete access to PISA’s general design as 

well as its frameworks, complete database and international reports. 

Far from producing flimsy yet exciting, though often denounced, results 

(to which too much interest is generally paid), the PISA studies produce quality 

data of interest for a huge range of complementary studies ranging from 

politics to didactics. 

4 ACER – Melbourne – Australia

WHAT DOES PISA REALLY ASSESS? WHAT DOES IT NOT? 23 

Many international and national analyses have been undertaken which try 

to draw from processed data (as well as from raw data) the information of 

interest to all kinds of people concerned by educational matters. 

Meanwhile, not much effort has been made until now to examine the set 

of mathematics questions from an external point of view and try to more efficiently 

understand what they really assess and to which degree they may be 

viewed as epistemologically and didactically consistent. Further research into 

these points would produce possible implications for teaching and for teachers. 

This paper seeks only to examine PISA’s external validity and is limited 

to its mathematical section, and even narrower in scope, from a French point 

of view (‘French’ in the sense of being related to the French mathematics curriculum, 

French customary assessment settings, teacher beliefs, school culture, 

etc.). 

Intended and implemented PISA assessment focus 

First, it seems important to recall that PISA does not claim to assess the general 

quality of the educational systems examined. Regarding our topic, it does not 

pretend to assess the general mathematical proficiency, but simply concentrates 

on what the OECD judges essential for the normal life of any citizen (the socalled 

‘mathematical literacy’). 

Let us quote the official report: 

“PISA seeks to measure how well young adults, at age 15 and therefore approaching 

the end of compulsory schooling, are prepared to meet the challenges of today’s knowledge 

societies. The assessment is forward-looking, focusing on young people’s ability 

to use their knowledge and skills to meet real-life challenges, rather than merely on 

the extent to which they have mastered a specific school curriculum. This orientation 

reflects a change in the goals and objectives of curricula themselves, which are increasingly 

concerned with what students can do with what they learn at school, and 

not merely whether they can reproduce what they have learned.” 5 

At any rate, individual students who do not correctly answer the PISA mathematics 

questions seem doomed to a troubled life, and countries that do not 

perform well are viewed as doing a poor job of preparing their young people 

for the future. 

Thus, while PISA does not assess the entire body of mathematical knowledge 

acquired in schools, it does test at least a part of this knowledge. 

5 OCDE (2004) : Learning for Tomorrow’s World. First Results from PISA 2003. p. 20


We will therefore first try to identify more clearly the part truly assessed 

by PISA and then relate this part to the entire French mathematics educated 

offered to the country’s 15-year-olds. The relationship between this “literacy” 

part and the entire test is a problematic question, one that leads to raising epistemological 

and didactically complex issues. 

However, first we must examine the way in which the PISA material is 

linked to the French mathematics curriculum. 

A comparison of the PISA mathematics item content with the 

current French mathematical syllabus 

For the moment, let us limit ourselves to the French syllabus, which most of the 

15-year-old French students have studied. By this I mean the French “collège” 

syllabus from grade 6 to grade 9 (French “sixième” to “troisième”). At age 15, 

some French students attend high school (up to grade 11), while others are still 

lagging as far behind as grade 7, and yet a few others are in special education. 

However, on the whole, more than 85 % of the 15-year-olds have studied this 

syllabus 6 . 

The reader will find in annex 6 a presentation of this syllabus indicating 

the topics that have been addressed by at least one PISA 2003 mathematics 

question. 

Annex 3 shows a list of analysed PISA questions. 

Here we should recall that only a certain number of the PISA questions 

have been secured for future use. In this paper I will only quote some of the 

released questions, while most of the questions used have nevertheless been 

taken into account in the analysis. 

Finally, we find that the PISA questions cover about 15 % of the French 

syllabus, and are answered by more than 85 % of the 15-year-old French students. 

This shows beyond any doubt the marginal focus of the PISA questions 

(but marginal does not mean unimportant!). 

6 In fact, the 15-year-old official target is somewhat misleading. Let us quote the PISA technical 

report (page 46): “The 15-year-old international target population was slightly adapted 

to better fit the age structure of most of the northern hemisphere countries. As the majority 

of the testing was planned to occur in April, the international target population was consequently 

defined as all students ages 15 years and 3 (completed) months to 16 years and 2 

(completed) months at the beginning of the assessment period.” 

That leads to 59.1 % of the French students who took the tests were in high schools in grade 

10 (or for a few of them, grade 11).


At the same time those 15 % represent only about 75 % of the PISA mathematics 

items. This means that about 25 % of the PISA items do not fit into 

the French curriculum. It is not only the case for many items in the field of 

uncertainty, but it is also the case for items not directly linked to our current 

curriculum (such as some combined items). 

But an assessment setting can never completely cover 100 % of any curriculum. 

In order to explore further, we found it useful to compare the PISA 

material with some customary French examinations. 

A comparison of the PISA mathematics item content with some 

French examination and assessment settings at the 15-year-old 

level 

Comparison with the grade 9 national examination 

We choose to analyse in the same way some issues of the mathematics form 

of the national examination taken by all students at the end of French middle 

school (grade 9). 

Annex 6 shows the corresponding curriculum coverage for one of these 

issues, while the corresponding examination form is displayed in annex 5 with 

an analysis chart appearing in annex 4. 

Here we found that this particular “Brevet” examination form covers about 

35 % of the French syllabus presented above. 

In addition, the entire set of PISA 2003 questions has been planned for 

approximately 210 minutes of testing time, while every area of the “Brevet” 

is just 120 minutes each. As two different “Brevet” forms are different and 

address different parts of the syllabus, we can estimate that in the “Brevet” 

context, a 210-minute testing time might cover more than 50 % of the French 

syllabus. 

What is more striking is the fact that the coverage by PISA focuses more 

on the syllabus for grade 6 and 7, while the coverage by the “Brevet” concerns 

mainly the syllabus for grades 8 and 9 (which to a certain point contains and 

extends the previous syllabus). 

But the Brevet examination is a poor illustration of the entire French curriculum 

(“programmes et instructions officielles”) as well as of the teachers’ 

aims and teaching practices. The “Brevet” is well known for shrinking the objectives, 

and preparing for the “Brevet” is not viewed as a good way to prepare 

for further high school studies.


The EVAPM studies 

In the following sections I will refer to a series of large-scale studies organised 

in the “EVAPM Observatory”. 

EVAPM is a 20-year-long research project conducted by the Mathematics 

Teacher Association (APMEP) and the National Institute for Pedagogical Research 

(INRP), to follow the evolution of the French mathematics curriculum 

(and especially the attained curriculum), from grade 6 to grade 12. 

Being strongly linked to the teachers, and implicating them in the test development 

process, the EVAPM studies obviously reflect the authors’ beliefs 

and intentions. As the students are not directly assessed, there is no problem 

for checking competencies that are known for being just at the beginning of 

their development. In other words, the EVAPM questions are not limited by 

social expectations or political exploitation, as it is the case in the national exams. 

That could have been the case with the PISA questions; obviously, it is 

not. 

In recent EVAPM studies there was strong teacher resistance when we 

tried to introduce some PISA items. Most of the items were considered as 

not appropriate to the curriculum, and many of them were considered as such 

culturally biased. 

It is not relevant to mention curriculum coverage, as the EVAPM studies 

tend to be comprehensive (100 % coverage). 

In this paper, we will make use of the EVAPM studies in order to compare 

the cognitive demands of PISA with actual French curriculum expectations (at 

least as viewed by the French teachers). 

Comparison of cognitive demands 

In order to compare the cognitive demands of mathematics assessment items, 

we will use a cognitive taxonomy, of which the main categories are the following: 

– A Knowing and recognising . . . 

– B Understanding . . . 

– C Applying . . . 

– D Creating . . . 

– E Evaluating . . . 

See annex 1 for a first expansion of this taxonomy. 

The following chart displays the PISA levels of cognitive demands along 

those of the “Brevet” examination paper already examined.


The difference is most striking: the “Brevet” addresses mostly the recognition 

level, and even the classification of some items at Level C (application) 

might be questioned (most of them are routine procedures that might have been 

classified at Level A). 

Without doubt, the taxonomic range of the PISA items is much more balanced 

than that of the French examination 7 . 

Figure 1 

However, as we have already noted, the “Brevet” does not correctly reflect 

the actual French curriculum. 

The following chart (figure 2) adds classifications obtained for two 

EVAPM studies (grade 10 – 2003 and grade 6 – 2005). 

Here, the balance across levels is closer to PISA, at least at the same age 

level (grade 10). 

The chart (figure 2) seems to indicate that French teachers would be keen 

to evolve towards a more PISA-like assessment practice. The EVAPM studies 

have shown that most French teachers are quite torn between the need to prepare 

their students for formal exams like the “Brevet” presented in this paper 

and their conception about what a good math education should include (we 

also know that the conflict between exams and education is not unique to the 

French!). 

7 Renovation of the “Brevet” is on the agenda. Perhaps PISA will help speed along this process?


Figure 2 

Comparison of implied range of competencies 

PISA makes use of a three-tiered competency level classification: 

– Class 1: Reproduction: “ . . . consists of simple computations or definitions 

of the type most familiar in conventional mathematics assessments”. 

– Class 2: Connection: “ . . . requires connections to be made in order to solve 

straightforward problems”. 

– Class 3: Reflection: “ . . . consists of mathematical thinking, generalisation 

and insight, and requires students to engage in analysis, identify the mathematical 

elements in a situation and pose their own problems”. 

See annex 5 for more details 8 . 

The following chart (figure 3) displays the competency levels of the PISA 

items along those of the “Brevet”. 

PISA puts more than 70 % of the emphasis on Levels 2 and 3, while the 

“Brevet” exam puts less than 15 % on those levels. 

Here again, we can examine some EVAPM assessment settings. 

The chart (figure 4) shows again a balance much closer to PISA for the 

EVAPM studies than for the national examination. 

8 Note that for EVAPM we use a competency classification originating in the Aline Robert 

works (see references), which, while based on other assumptions than the PISA classification, 

provides about the same repartition.


Figure 3 

Figure 4 

Towards some epistemological analyse 

About Finnish and French differences 

With regards to this paper, we had a special interest in the differences between 

the Finnish and French results. The overall results on a global scale (511 for


France, 548 for Finland) hide the fact that this difference means a difference of 

a .33 standard deviation on the standard normal distribution, and that at a point 

where the density of probability is at its maximum. In France not many people 

know this fact, and still fewer understand it. 

Looking to the subscales of the study (quantity, change and relationships, 

space and shape, uncertainty) sheds no any supplementary light. In order to 

help understand the observed differences, it is essential to turn to the items 

themselves and to the percentages of success in each country (or for other 

approaches for each of the subgroups investigated 9 ). 

First, let us say that this examination confirms the better Finnish results – it 

is only the magnitude of the differences and its meaning that can be questioned. 

Regarding the magnitude, let us say that according to the items being examined, 

the differences in success rates range from + 30 % to the Finnish advantage 

to + 25 % to the French advantage, the average of the differences being 

3.5 % to the Finnish advantage. 10 

We observe that the differences are more important in favour of the Finnish 

students for the more “realistic” items, and that the differences tend to turn in 

favour of the French students for more abstract or formal items (compare for 

instance, below, the results of “Apples Item 1” with the results of “Apples Item 

3”. But the case seems general). 

It is important to note that the difference in results between Finland and 

France would totally disappear if 10 % of the less successful French students 

(the first 10 percent) were put aside. 

In fact, while in the case of the Finnish students, only 7 % of the age group 

score at Levels 1 or below Level 1 (on a proficiency scale ranging from 1 to 6), 

and 17 % of the French students fall into those categories. This confirms the 

fact that France does not succeed well in its mathematical education for all (a 

fact already strongly confirmed by the TIMSS studies). 

The other end of the scale (Level 6), concerns 7 % of the Finnish students, 

but only 3 % of the French ones. This fact may be less worrying than the one 

concerning the low levels. Let us remember here that PISA addresses only the 

literacy and does not pretend to assess the general mathematical competency. 

9 We do not mention in this paper the gender question, but our analysis points out a certain 

amount of gender bias, at least for some countries. As the overall results are weaker for 

girls than for boys in all countries but two, the question invites more examination. But other 

subgroups might also be worth scrutinising. 

10 This is only a rough estimate – only 41 items have been accounted for.


Nevertheless, this casts doubt on the assumption regarding French math education 

high level of quality demonstrated by the best students. 

The PISA questions presented below (exclusively those that have been released) 

are displayed with results for France, Finland, all OECD countries, 

containing in addition the highest and lowest observed results (OECD and all 

participating countries). 11 

Mathematics? 

The mathematical field may be extended or restricted according to different 

conceptions. Some mathematical PISA questions puzzle many French mathematics 

teachers. They do not recognise the mathematics they are striving to 

teach. At the same time they recognise the social usefulness of the knowledge 

implied by these questions. The same thing applies to mathematicians: the insertion 

of many mathematical PISA questions in the theoretical mathematical 

constructions are not obvious to them. 

Quantity, change and relationships, space and shape as well as uncertainty 

are not only modelled in mathematical theories, but are also used in common 

situation, using common sense and common language. 

In its endeavour to stick to real life, PISA could not help using normal 

language to display its questions. In some cases, understanding a text, which 

is in no way a mathematical text, is the main difficulty students have to face. 

Certainly, this is also part of the mathematical process, but the true mathematical 

work begins once the problem is fully understood. Here the “devolution” 

process is not controlled, and it is never certain if it is the either the “dressing 

up” or the wording that prevents students from solving the problem, or if it is 

the problem’s degree of the mathematical difficulty. These mathematical difficulties 

often appear trivial when compared with the structural and semantic 

complexity of the questions. 

The strong correlation observed between individual results in reading literacy 

and mathematical literacy (r = 0.77) perfectly illustrates this point. This 

correlation is smaller that the correlations observed among the four PISA mathematical 

domains at the International Level (which range from 0.89 to 0.92), 

but is much higher that what is generally observed in France (EVAPM studies) 

between students’ results in different mathematical domains (algebra, geometry, 

calculus, statistics), which usually lie in the interval [0.35; 0.60]. All this 

11 A more complete presentation may be downloaded on my website.


leads one to think that the mathematical PISA questions may all assess a general 

ability to read a text, to articulate between textual, iconic information and 

other indices indirectly given by the question’s context and to process based 

on this information. We also could invoke here the well-known “factor g.” 

Numbers, quantity, etc. also appear in the PISA reading questions, in the 

science questions and in the problem-solving questions. It is not always obvious 

whether a PISA question should be allocated to one branch of the study 

rather than to another one. In particular, some problem-solving questions could 

be analysed and gathered with the set of mathematics questions. 

Let us now examine some typical questions. 

The Apples Example 

This question is typical of realistic mathematics (and authentic assessment), 

which the OECD seeks both to assess and promote. In this context, a good 

question must open up for the process thus described in the framework: 

a) Starting with a reality-based problem. 

b) Organising it according to mathematical concepts 

c) Gradually trimming away the reality through process, such as making assumptions 

about which features of the problem are important, then generalising, formalising„ 

transforming the problem into a mathematical problem that closely represents the 

situation 

d) Solving the mathematical problem 

e) Making sense of the mathematical solution in terms of the actual situation

WHAT DOES PISA REALLY ASSESS? WHAT DOES IT NOT? 33


For Item 1 the main point is to understand the situation and being subsequently 

able to extrapolate a pattern. This may be complete by merely counting 

the first four lines in the chart. In the fifth example the student can either extend 

the drawing and then count or identify a number pattern in the completed 

chart.


The 10 % difference between French and Finnish students illustrates the 

French students’ relative lack of confidence or lack of initiative. They do not 

have a mathematical procedure on hand to treat the question, and this lack 

hinders a certain percentage of them from solving the problem. 

Conversely, French students who overcome this initial difficulty perform 

much better on the second item than their Finnish counterparts (26 % to 21 % 

for the entire population, but 62 % to 38 % for those who successfully completed 

Item 1). This also seems to be rather general. 

For this item, the mathematical process is quite obvious and leads to an 

equation to be solved: n2 8n. 

French students are used to solving this type of equation (though often in a 

formal, non-realistic, context). We may even suppose that many of them have 

used a correct mathematical method: by this I mean factorizing n n 8 

0 and finding the two values: 0 and 8, then and only then (Point E above) 

eliminating the value 0 and retaining the value 8. 

However, some students (in France as well in Finland) should have gotten 

the correct answer just by making this invalid simplification: n2 8n n 8 

or n n 8 n n n 8 n. 

Another procedural possibility consists of extending the chart until n 8. 

These procedures (of which at least one is mathematically incorrect) and 

other ones have been considered correct (full or partial credit!). This raises 

the epistemological issue: which kind of mathematics are at stake? What is 

valued? 

Let us be clear on this point: it is not our purpose to deny the interest of 

the question nor its relevance in a mathematical test, not even the legitimacy 

of building scales which may be of some usefulness to policymakers. What 

is raised here is the need for complementary qualitative studies, which could 

more deeply analyse students’ procedures from a mathematical point of view. 

Item 3 needs to compare two variation rates. In this instance it may lead to 

comparing the growth of the derivatives of functions f such as f(n) = n2 and g 

such as g(n) =8n, and, finally, to comparing the second derivatives. 

Here again students are not supposed to know derivatives; they should just 

have a sound and personal approach to the question. Several procedures are 

possible that have different mathematical values, but are considered the same. 

Note that the question is by no means trivial, and it is not too surprising 

that so few students across the world are able to cope with it.


The apples question has been used in an EVAPM study at the tenth grade 

level. The results of this setting also appear in the rectangles. 

The 6 % success rate (France) and the 4 % success rate (Finland) concern 

only a correct mathematical procedure. Those rates have to be compared with 

the 11 % obtained in Japan and also with the 11 % obtained by EVAPM in 

France at grade 10. 

For all countries but one the Item 3 success rate ranges from 2 % to 12 % 

The only exception (Korea at 24 %) deserves further examination. 

There is also an interesting point coming out from international studies 

(similar to TIMSS): real mathematical difficulties, meaning difficulties linked 

to the concepts and not only to the presentation or the wording seem to be 

experienced in the same way all over the world. 

The Bookshelves Example 

The following question is typically a case of one question not fitting the current 

French mathematical curriculum; more precisely, it would be considered as 

being more appropriate at the primary school level.


At the same time everyone in France (and especially French mathematics 

teachers) would expect 15-year-old students to be able to solve this problem. 

The success rate is more than 10 % higher in Finland than in France, which 

illustrates what has been said about realistic questions. 

But is it a mathematical question? Or should any question using numbers 

be considered as a mathematical question? In some countries (especially in 

France) this question would more likely be asked in the technological subject 

matter area. 

A mathematical solution could be: 

N Min 

26 

4 

; 33 

6 

; 200 

12 

; 20 

2 

; 510 

14 

Where N is the maximum number of bookshelves the carpenter can make, 

and where x stands for the integer part of x. 

Once again, it is not expected of the students that they write this complex 

formula. In fact, they proceed by a try and guess method. Meanwhile, if they 

had to prove their result, they would be forced to write down in everyday language 

the content as well as the meaning of this formula, which should be even 

more difficult than writing the symbolic formula. 

Fortunately for PISA, no student thinks about using such a formula (neither 

would we other than for this paper!), so the international results are quite high, 

ranging mostly from 50 % to 70 %. 

But is it still mathematics? Can these kinds of realistic questions be a good 

preparation for more abstract mathematics? As many educational systems tend 

to ask teachers to stress realistic mathematics, the question is surely worthy of 

being raised. 

This question, along with many others, points out the weak stress given 

by PISA to the proof undertakings (and what is mathematics without proof?). 

Even explaining and justifying are not much valued by the PISA marking 

scheme. This makes a great difference with the casual French conception of 

mathematical achievement. 

The idea of proof is not the only mathematical main feature which is quite 

absent in the PISA questions. There is a lack of any symbolism, and what 

is sometimes labelled as algebra (especially in national reports) is usually no 

more that the use of letters as substitute to numbers, without any perspective 

of using them in direct computations. The PISA design insists on real life, the 

concrete aspects of mathematics. So it conciously misses several fundamental 

aspects of the mathematical world.


Toward didactical analysis 

The preceding remarks lead directly to the raising of a central didactical question: 

Which sequence of teaching situations can help students to gain proficiency 

both in mathematical literacy (partly common sense knowledge) and 

in abstract and symbolic mathematics? 

Some people would assume that the question is not relevant and that there 

is a continuum from common sense knowledge to theoretical knowledge. 

On the contrary, we think that all the work of the so-called “French didactics 

school” has helped us to think that ruptures are necessary and constitutive 

to learning. So we may fear that putting too much stress on real life and concrete 

situations may in return have some negative effects. 

Here is an example. 

The Coloured Candies Question 

This question belongs to the uncertainty field and a “probability” value is requested. 

Probability is not part of the curriculum followed by 98 % of the French 

students at age 15; meanwhile, they perform at the same level as other OECD 

students. 

We obtained the same kind of result with TIMSS at age 13. While probability 

was not in the curriculum, French students performed better than others 

in countries where probability was considered as part of the curriculum. Other 

observations (EVAPM) show that when introduced to probability concepts (at 

least at the outset), students find more difficulty answering this kind of question 

than when they have not been taught the subject.


Once again, we can talk of common knowledge: understanding the diagram, 

counting the total number of candies (30), noting that 6 of them are red, 

and finally interpreting the 6 chances out of 30 as being a probability value. 

These are common language and preconceptions about a mathematical 

concept. Stressing this kind of task, particularly in an MCQ format, and allowing 

students (and many others) to think they have acquired some knowledge in 

probability may surely lead to serious misunderstandings. 

Many other questions deserve this kind of examination. 

A good example is given by a question that we are not allowed to display 

here (an unreleased question). The only point of this question is to identify an 

oblique line as being longer than a perpendicular one. Everybody feels this and 

can use this fact, even if they do not formally know it, and especially if they 

have not been taught it. Even dogs behave as if they know this fact.


The amusing point is that this question has been identified, at least in the 

French official report, as assessing the Pythagorean theorem! Well spread confusion 

between the fact that common sense may be mathematized and integrated 

in mathematical theories and the fact that students’ abilities for making 

good use of this common sense proves something concerning their theoretical 

knowledge. 

Some conclusions 

PISA has gathered a huge amount of quality data across countries, which opens 

the way for further research. Aside from edumetrics studies focusing on marks 

and scales, there is room for many interesting qualitative studies (more precisely 

for studies articulating quantitative and qualitative approaches). 

A large amount of resources have been put in the PISA studies, as well as 

a great variety of commitment and expertise, and it would be disappointing if 

students were not the primary beneficiaries of these contributors. 

In this paper we have attempted to demonstrate that certain precautions 

should be taken when interpreting and using the PISA results, at least in mathematics. 

Moreover, on the whole, the PISA studies are worth being taken seriously. 

They can bring new questions and new ideas to teachers that can help 

them to go ahead with a way of teaching that fits the needs of our societies as 

well as preserving the values of which they are conveyors. This balance is difficult 

to obtain; however, weak, flawed or biased interpretations of the general 

PISA implications and results will not help. 

This paper is particularly aimed at attracting scholars’ attention and justifying 

the idea that some complementary studies should be undertaken by and 

within research in the mathematics education community (and not, as is often 

the case, only processed and interpreted by officials strictly controlled by 

political bodies). 

The PISA studies may help scholars in different countries distance themselves 

from their national or regional places of origin and acquire a more comprehensive 

understanding of the teaching and acquisition of mathematics for 

future citizens, consumers and – above all – for the advancement of mankind. 

References 

Adams, R.J.: 2003, Response to “Cautions on OECD’s Recent Educational 

survey (PISA), Oxford Review of Education, 29(3)


Anderson, W. A.: 2001, A taxonomy for learning, teaching, and assessing; a 

revision of Bloom’s taxonomy of educational objectives. Longman. 

Bodin, A. & Capponi, B.: 1996, Junior Secondary School Practices, International 

Handbook of Mathematics Education, Chapter 15, Teaching 

and learning Mathematics, A. Bishop & C. Laborde (eds), pp. 565-613, 

Kluwer Academics Publishers, Dordrecht. 

Bodin, A.: 2003, Comment classer les questions de mathématiques? Communication 

au colloque international du Kangourou, Paris 7 novembre 2003. 

Article à paraître. 

Bodin, A; Straesser, R.; Villani, V.: 2001, Niveaux de référence pour 

l’enseignement des mathématiques en Europe – Rapport international 

Reference levels in School Mathematics Education in Europe – International 

report. 

Bodin A.: 1997, L’évaluation du savoir mathématique – Questions et méthodes. 

Recherches en Didactique des Mathématiques, Éditions La Pensée 

Sauvage, Grenoble. 

Bottani, N.& Vrignaud, P. (2005): La France et les évaluations internationales. 

Haut Conseil de l’Évaluation de l’École. 

Clarke, D. 2003, International comparative Research in Mathematics Education 

: Of What, By Whom, for What, and How. Second international 

Handbook on Mathematics education, Kluwer academic Publishers. 

Cytermann, J.R., Demeuse, M. (2005): La lecture des indicateurs internationaux 

en France. Haut Conseil de l’Évaluation de l’École. 

Demonty, I. & Fagnant, A. (2004): Évaluation de la culure mathématique des 

jeunes de 15 ans (PISA). Ministère de la Communauté Française. Bruxelles. 

Dupé, C. & Olivier, Y. (2005): Ce que l’évaluation PISA 2003 peut nous apprendre. 

Bulletin de l’APMEP N˚460 – octobre 2005 

French Ministry of Education (2007): L’évaluation internationale PISA 

2003 . . . dossier n˚ 180 de la Direction de l’Évaluation de la Prospective 

et de la Performance (DEPP). 

Freudhenthal, H: 1975, Pupils’ achievements internationally compared – The 

IEA. In Educational Studies in Mathematics – Vol 1975. 

Gras R.: 1977, Contributions à l’étude expérimentale et à l’analyse de certaines 

acquisitions cognitives et de certains objectifs didactiques en mathématiques 

– Thèse- université de RENNES. 

Lemke, M., Sen, A., Pahlke, E., Partelow, L., Miller, D., Williams, T., Kast-


berg, D., Jocelyn, L. (2004). International Outcomes of Learning in Mathematics 

Literacy and Problem Solving: PISA 2003 Results From the U.S. 

Perspective. (NCES 2005–003). Washington, DC: U.S. Department of 

Education, National Center for Education Statistics. 

Lie, S. & al (2003): Northern lights on PISA. Unity and diversity in the Nordic 

countries in PISA 2000. University of Oslo, Norway 

Meuret, D. 2003 Considérations sur la confiance que l’on peut faire à PISA 

2000. Intervention au colloque international de l’Agence Nationale de 

Lutte Contre l’Illetrisme sur l’évaluation des bas niveaux de compétences, 

Lyon, 5 novembre 2003 

Meuret, D. 2003 Pourquoi les jeunes français ont-ils à 15 ans des performances 

inférieures à celles des jeunes d’autres pays? Revue française de Pédagogie, 

n˚142, 89-104. 

Note DPD 04.12 (décembre) – Les élèves de 15 ans Premiers résultats de 

l’évaluation internationale PISA 2003 

OECD (2004), Problem Solving for Tomorrow’s World: First measures of 

Cross-Curricular Competencies from PISA 2003 

OECD (2004), Technical report. 

OECD 2004, First results from PISA 2003. Executive summary. 

OECD 2004, Learning for Tomorrow’s World: First results from PISA 2003 

OECD 2004, PISA 2003 Assessment Framework – Mathematics, Reading, 

Science and Problem Solving Knowledge and Skills 

Orivel, F. (3003) : De l’intérêt des comparaisons internationales en éducation. 

Robert A: 2003, Taches mathématiques et activités des élèves : une discussion 

sur le jeu des adaptations introduites au démarrage des exercices cherchés 

en classe de collège. Petit x N˚62 

Varcher, P. (2002), Evaluation des systèmes éducatifs par des batteries 

d’indicateurs du type PISA : vers une régresion des pratiques d’évaluation 

dans les classes. 

Addresses and contacts 

APMEP with access to EVAPM documents as well as to a show displaying the 

released PISA questions with some results: 

http://www.apmep.asso.fr/spip.php?rubrique114 (presentations in English 

and in French). 

Reference Levels in School Mathematics Education in Europe: 

http://www-irem.univ-fcomte.fr/Presentation_ref_levels.HTM and


http://www.emis.de/projects/Ref/ 

IREM de Franche-Comté: http://www-irem.univ-fcomte.fr/ 

French official reports: http://www.educ-eval.education.fr/pisa2003.htm 

International frameworks and reports: http://www.pisa.oecd.org/ 

Antoine Bodin personal website: 

http://web.mac.com/antoinebodin/iWeb/Site_Antoine_Bodin 

Email address: bodin.antoine@nerim.fr


ANNEXES 

Annexe 1: Taxonomy of cognitive demands for designing and 

analysing mathematical tasks – ordered by integrated level of 

complexity 

Simplified version – see complete taxonomy on the Web (in French) 

A 

B 

Main categories 

Knowing and 

recognising . . . 

Understanding 

. . . 

C Applying . . . 

D Creating... 

E Evaluating . . . 

Sub-categories 

A1 Facts 

A2 Vocabulary 

A3 Tools 

A4 Procedures 

B1 Facts 

B2 Vocabulary 

B3 Tools 

B4 Procedures 

B5 Relations 

B6 Situations 

C1 in simple familiar contexts 

C2 in mean complex familiar contexts 

C3 in complex familiar contexts 

D1 as mobilizing known mathematical 

tools and procedures in new situations 

D2 new ideas 

D3 personal tools or procedures 

E1 as issuing judgements about external 

productions 

E2 as assessing one’s own knowledge, process 

and results 

Taxonomy designed by Antoine Bodin, with full acknowledgment to R. 

Gras’ seminal work as well as to W. A. Anderson’s later influence.


Annexe 2: Competency classes for designing and analysing 

mathematical tasks – ordered by integrated level of complexity 

Simplified version – see expanded version in OECD documents on the Web 

Level OECD definition 

1 Reproduction The competencies in this 

cluster essentially involve 

reproduction of practised 

knowledge . . . 

2 Connection The connection cluster builds 

on the reproduction cluster 

competencies in taking problem 

solving to situations that 

are not simply routine, but 

still involve familiar or quasi 

familiar settings 

3 Reflection The competencies in this 

cluster include an element 

of reflectiveness . . . about 

the processes needed or used 

to solve a problem. They 

relate to students’ abilities to 

plan solution strategies and 

implement them in problem 

settings that contain more 

elements and may be more 

“original” (or unfamiliar) 

than those in the connection 

cluster... 

Reproduction 

Simple mathematisation 

Complex mathematisation 

(to modelisation)


Annexe 3: PISA 2003 and 2000 –Analysed Question Set 

Along with some other non released questions taken into account for this paper, 

the whole analysis covers about 70 % of the PISA material (60/85) 

PISA 

code 

Item 

name 

Mathematical 

content 

Taxo C Remarks 

M037Q01 Farms 1 Pyramid – square 

area 

B6 1 PISA2000 only 

M037Q02 Farms 2 Middle of the sides 

of a triangle. 

C1 2 PISA2000 only 

M124Q01 Walking 1 Using letters and 

formula 

C1 2 & PISA2000 

M124Q02 Walking 2 Using letters and 

formula – Units 

... 

B5 2 & PISA2000 

M136Q01 Apple 1 Completing charts B6 3 & PISA2000 & 

EVAPM 

M136Q02 Apple 2 Equation C1 2 & PISA2000 & 

EVAPM 

M136Q03 Apple 3 Don’t fit D1 3 & PISA2000 & 

EVAPM 

M145Q01 Cubes Cube B5 2 & PISA2000 

M148Q02 Continent area D1 3 PISA2000 only 

area 

&EVAPM 

M150Q01 Growing 

up 1 

Reading graphs B5 2 & PISA2000 

M150Q02 Growing 

up 2 

Reading graphs B5 1 & PISA2000 

M150Q03 Growing Reading graphs B5 1 & PISA2000 – 

up 3 

Gender bias ? 

M155Q02 Number 

cube 

Cube B5 2 & EVAPM 

M159Q01 Speed of a 

car 1 

Interpreting graph B6 2 PISA2000 only 


car 2 

Reading graph A3 1 PISA2000 only



car 3 

Interpreting graph B3 1 PISA2000 only 


car 4 

Interpreting graph D1 2 PISA2000 only 

M161Q01 Triangles Constructing geometrical 

figures 

B5 1 PISA2000 only 

M179Q01 Robberies Bar charts E1 3 & TIMSS & 

PISA2000 & 

EVAPM 

M266Q01 Carpenter Perimeter of a rect- D1 2 & PISA2000 – 

angle 

Gender bias ? 

M402Q01 Internet 

relay chat 

1 

Don’t fit D1 2 Gender bias ? 

M402Q02 Internet 

relay chat 

2 

Don’t fit D1 3 Gender bias ? 

M413Q01 Exchange 

rate 1 

Proportionality C1 2 


rate 2 

Proportionality A4 1 


rate 3 

Proportionality C1 2 

M438Q01 Export – 1 Bar charts A3 1 

M438Q02 Export – 2 Circle charts – Percentage 

C1 1 

M467Q01 Coloured 

candies 

Don’t fit C1 1 Probability 

M468Q01 Science 

test 

Mean C1 2 

M484Q01 Bookshelves Don’t fit D1 2 & EVAPMGender 

bias ? 

M505Q01 Litter Bar charts B6 2 ? Huge diff 

FRA-FIN 

M509Q01 Earthquake Don’t fit B5 2 Probability 

M510Q01 Choice Don’t fit D1 3 Combinatory 

–transtation pb


M513Q1 Test 

Scores 

Bar graph 

M520Q01 Skateboard 

1 

Don’t fit C1 2 EVAPM 


2 

Don’t fit C1 2 


3 

Don’t fit D1 3 

M547Q01 Staircase Division A4 1 

M555Q02 Number 

cubes 

Cube B5 2 

M702Q01 Support 

for president 

Don’t fit B6 2 

M704Q01 Best car 1 Reading charts C1 2 

M704Q02 Best car 2 Reading charts D1 3 

M806Q01 Step pattern 

Don’t fit A1 1 

PISA 2003: 85 items released: 31 

PISA 2000: 32 items released: 11 

Annexe 4: A typical mathematical examination at the final year of 

middle school 

Taxo Comp Remarks 

Part I – Numerical 

activities 

Numbers Ex 1 1) A4 1 Formal and 

unrealistic 

2) A4 1 id 

3) A4 1 id 

4) A4 1 id 

Data Ex 2 1) B5 1 Pseudorealistic 

2) A4 1 id 

3) C1 1 id


4) A2 1 id 

Numbers Ex 3 1) a A2 1 Formal and 

unrealistic 

1) b A4 1 id 

1) c A4 1 id 

1) d A4 1 id 

Numbers –Arithmetic Ex 4 1) C1 1 Formal and 

unrealistic 

2) C1 1 id 

3) C1 1 id 

Part II – Geometrical 

activities 

Space geometry Ex 1 1) a B1 1 Unrealistic 

1) b B5 1 id 

2) a A4 1 id 

2) b B5 1 id 

3) A4 1 id 

Plane géometry – 

Proof – Thalès 

Plane géometry – 

Proof – Pythagore 

Part III – Problem 

Geometry-Pythagore- 

Trigonometry 

Linear functions – inequations 

4) A4 1 id 

Ex 2 1) C1 2 Formal and 

unrealistic 

2) C1 2 id 

EX 3 1) A4 1 Formal and 

unrealistic 

2) A4 1 id 

3) i A4 1 id 

3) ii A4 1 id 

Part 

I 

Part 

II 

1) A4 1 Pseudorealisticdressing 

2) i A4 1 id 

2) ii A4 1 id 

3) A4 1 id 

1) a A3 1 id 

i 

1) a 

ii 

A3 1 id


Scale area – volume Part 

III 

1) b A3 1 id 

2) a A2 1 id 

2) b C1 2 id 

3) a B5 1 id 

3) b A4 1 id 

3) c A4 1 id 

1) A2 1 id 

2) A4 1 id 

3) C1 2 id 

4) i C1 2 id 

4) ii C1 2 id


Annexe 5: The examination on scope: Brevet 2005 – South of 

France 

Wording and appearance will be counted as 4 marks out of 40. 

Handheld calculators allowed. 

Test duration: 2 hours 

Part I: Numerical activities


part II: Geometrical activities

Part III: Problem 

WHAT DOES PISA REALLY ASSESS? WHAT DOES IT NOT? 53


Annexe 6: Comparing PISA with the French curriculum


Annexe 7: Comparing a customary French examination with the 

French curriculum

Testfähigkeit – Was ist das? 

Wolfram Meyerhöfer 

Deutschland: Freie Universität Berlin 

In diesem Artikel wird das Problem der Testfähigkeit am Beispiel mathematischer 

Leistungstests erkundet. „Testfähigkeit“ beschreibt jene Kenntnisse, Fähigkeiten 

und Fertigkeiten, die in einem Test miterfasst bzw. mitgemessen werden, 

die man aber nicht unter den Begriff „mathematische Leistungsfähigkeit“ 

fassen würde. Es wird zunächst ausgelotet, warum dem Thema Testfähigkeit 

eine erhebliche Bedeutung im Zusammenhang mit Tests zukommt. Anhand 

von Aufgaben aus TIMSS und PISA wird mit Hilfe von stoffdidaktischen und 

objektiv-hermeneutischen Aufgabeninterpretationen herausgearbeitet, welche 

empirischen Phänomene das Problem der Testfähigkeit ausmachen. Es zeigt 

sich, dass Testfähigkeit dem Gedanken von mathematischer Bildung entgegensteht. 

1 Testfähigkeit in der Mathematikdidaktik 

Der Begriff der Testfähigkeit 1 ist in der deutschsprachigen Mathematikdidaktik 

bisher eher am Rande abgehandelt, nie aber ernsthaft diskutiert worden. 

1 Ich werde in diesem Beitrag keine Begriffsdefinitionen geben: Bei der von mir verwendeten 

Methode der Objektiven Hermeneutik folgt man der Wittgensteinschen Erkenntnis, 

dass die Bedeutung eines Textes sich ausschließlich daraus erschließt, wie er benutzt wird. 

Dieser Erkenntnis folge ich auch hinsichtlich der verwendeten Fachbegriffe. Ich möchte 

die von mir verwendeten Begriffe wie Testfähigkeit, Bildung, standardisierter Leistungstest 

usw. nicht im Vorhinein verengen, indem ich sie in eine Begriffsbestimmung fasse. Ich 

möchte die Begriffe im Grunde auch nicht erweitern. Ich möchte sie vertiefen. Jeder Angehörige 

der Sprachgemeinschaft – erst recht der fachlichen Sprachgemeinschaft – hat einen 

unmittelbaren Zugriff auf diese Begriffe in ihrer ganzen Breite und Vielfalt (notfalls über 

Wörterbücher und Lexika). Mir geht es darum, bezüglich Testfähigkeit diesem „Bekannten“ 

einiges bislang Unerschlossenes hinzuzufügen. Für diesen Erkenntnisprozess ist es sinnvoll, 

„alles, was der Begriff so mit sich rumschleppt“ mitzuschleppen. Methodisch hängt das da-

58 WOLFRAM MEYERHÖFER 

Dies mag zum einen damit zu tun haben, dass der Begriff selbsterklärend 

und damit uninteressant erscheint: „Testfähigkeit“ beschreibt jene Kenntnisse, 

Fähigkeiten und Fertigkeiten, die in einem Test miterfasst (bei nichtstandardisierten 

Leistungstests) bzw. mitgemessen (bei standardisierten Leistungstests) 

werden, die man aber nicht unter den Begriff „mathematische Leistungsfähigkeit“ 

fassen würde. Insbesondere wenn es sich dabei um Dimensionen handelt, 

die nur deshalb auftauchen, weil es sich um einen Test handelt, scheint die Bezeichnung 

„Testfähigkeit“ oder „Testfähigkeiten“ sinnvoll zu sein. 

Ein zweiter Grund des bisher eher reduzierten Interesses mag sein, dass 

erst der hohe Anspruch, mit dem in den letzten Jahren standardisierte Leistungstests 

in alle gesellschaftlichen Praxen drängen, die Eigengesetzlichkeiten 

dieser Instrumente in den Blick rückt: Auch in „herkömmlichen“ schulischen 

Leistungstests (Klassenarbeiten, Klausuren usw.) werden natürlich Testfähigkeiten 

mitgetestet. Die Unschärfen dieser Instrumente sind aber im Prinzip 

unstrittig. Deshalb kann der Schüler über die Bepunktung einer Klassenarbeit 

mit dem Lehrer diskutieren, deshalb können Eltern gegen eine Klausurzensur 

klagen, deshalb wird kein Arbeitgeber einen Lehrling allein aufgrund seiner 

Zensuren einstellen und deshalb bestreitet kaum jemand, dass Verfahren 

der Vergabe von Studienplätzen aufgrund von Abiturzeugnissen sachlich problematisch 

sind. Standardisierte Leistungstests folgen dem Anspruch, solche 

Unschärfen zu vermeiden. Sie unterliegen deshalb einem – im Vergleich mit 

schulischen Leistungstests – verschärften Anspruch, Testfähigkeit nicht mitzumessen. 

Ihre erhöhte Relevanz als Herrschaftsinstrument und Instrument der 

Vergabe von Zukunftschancen lenkt den Blick darauf, dass dieser Anspruch 

verfehlt wird. 

Einen dritten Grund möchte ich nur als Eindruck formulieren: Die Mathematikdidaktik 

hat bzw. die Mathematikdidaktiker haben sich habituell noch 

nicht vom Lehrer zum Wissenschaftler gewandelt. Dieser Eindruck ist zwar 

in seiner Schärfe sichtbar falsch, aber seine Formulierung ist fruchtbar für das 

Lesen der nachfolgenden empirischen Rekonstruktionen. Das erkennt man immer, 

wenn man dabei geneigt ist zu sagen: „Aber das ist doch in der Schule 

auch so, wo liegt also das Problem?“ Man kann dann nicht die Rekonstruktion 

des Problems verwerfen – wie man es gerade in Diskussionen um Tests immer 

wieder erlebt – sondern man hat dann intuitiv in den Testaufgaben etwas wie- 

mit zusammen, dass man mit der Objektiven Hermeneutik ohnehin einerseits empirisch neu 

rekonstruiert, andererseits im Akt des Geschichtenerzählens all das „Mitgeschleppte“ verarbeitet.

TESTFÄHIGKEIT –WAS IST DAS? 59 

dergefunden, was man aus Schule kennt. Um eine Interpretation und die dort 

erfolgende Rekonstruktion eines Problems zu verwerfen, muss man hingegen 

die Interpretation selbst kritisieren. Man kann nicht einfach davon ausgehen, 

dass die Interpretation falsch ist, weil das Resultat einem nicht gefällt. 

Man hat also intuitiv Erkenntnis über Schule gewonnen, wo man Erkenntnis 

über Tests gewinnen wollte. Der Übergang vom Lehrer zum Wissenschaftler 

besteht gerade darin, Erkenntnis zu wollen und also zuzulassen, statt das 

Vorgefundene bereits als normal zu kennzeichnen und zu rechtfertigen, bevor 

man Erkenntnis überhaupt zugelassen hat. Abstrakter gesagt: Wissenschaftler 

sein heißt, sich weit genug von der zu untersuchenden Praxis zu distanzieren, 

um den Deutungsmustern dieser Praxis nicht selbst zu unterliegen. Wissenschaft 

betreiben heißt eben, diese Deutungsmuster nicht zu reproduzieren, 

sondern sie zu verstehen, ihre impliziten Annahmen, Widersprüchlichkeiten, 

Fehldeutungen, Verwerfungen usw. aufzuzeigen, kurz: ihre Deutungsmuster 

zu rekonstruieren. 

Dazu gehört beim Thema Testfähigkeit auch die Vielzahl von Rechtfertigungen, 

dass Testfähigkeit „dazugehöre“, dass diese Fähigkeiten sich auf Bildungsziele 

zurückführen ließen bzw. anderweitigen Wert in sich hätten. Diese 

Behauptung erweist sich in den empirischen Rekonstruktionen als oberflächlich, 

oftmals falsch und zumeist zynisch. 

Ein vierter Grund für das geringe Interesse der deutschsprachigen Mathematikdidaktik 

an Testfähigkeiten mag eine gewissen Furcht sein, dem Verhältnis 

von Theorie und Praxis konsequent nachzuspüren bzw. nachzudenken: Die 

Komponenten von Testfähigkeit, die nachfolgend rekonstruiert werden, findet 

man auch in schulischen Leistungstests. Auch dort sollten sie vermieden sein. 

Zwar sind schulische Leistungstests keine standardisierten Leistungstests. Sie 

unterliegen also nicht dem Anspruch, wissenschaftliche Instrumente zu sein 

und somit genau zu benennen, was sie messen – und das Mitmessen von Testfähigkeit 

zu vermeiden. Aber die Zynismen und die Beschädigungen der mathematischen 

Bildung durch jene Aufgabeneigenschaften, die im weiteren unter 

dem Fokus der Testfähigkeit erschlossen werden, verweisen auf allgemeine 

Probleme des Mathematikunterrichts. 

Will ich als Mathematikdidaktiker über die reine Rekonstruktion des Problems 

hinausweisen, dann muss ich also darstellen, was der Lehrer im Erstellen 

schulischer Tests anders machen kann, ohne dabei zum wissenschaftlichen 

Testkonstrukteur zu werden. Es ist nur zu verständlich, dass die Mathematikdidaktik 

diesem schwierigen Problem bisher eher nicht nachspüren mochte.


Auch ich umschiffe dieses Problem vorläufig, indem ich mich auf standardisierte 

Leistungstests beschränke. Aus den nachfolgenden Rekonstruktionen ergibt 

sich aber bereits die Hypothese, dass die Rekonstruktion und Bearbeitung 

von Habitusmustern in der professionellen Entwicklung von Lehrern einen Ansatz 

auch für die Problematik der Testfähigkeit liefert. 

Ein fünfter Grund für die bisherige Abstinenz der Mathematikdidaktik 

gegenüber dem Thema Testfähigkeit mögen methodische Probleme gewesen 

sein: Man kann das Mitmessen von Testfähigkeit auch ohne Methoden analysieren, 

aber es verlangt ein gutes Gespür für das Latente und eine erhebliche 

Distanz zum eigenen Produkt. Hinzu kommt, dass man ohne methodischen 

Rückhalt erhebliche Legitimationsprobleme hat, insbesondere wenn man Tests 

in kulturindustriellen Kontexten erstellt (vergleiche Meyerhöfer 2006). 

Ich habe mit meiner Promotionsschrift (Meyerhöfer 2004 a, 2005) die Methode 

der Objektiven Hermeneutik in die Mathematikdidaktik eingeführt. Sie 

ermöglicht, methodisch kontrolliert auch latente Textelemente zu rekonstruieren 

und zwingt dazu, systematisch den Text zu deuten und nicht die eigenen 

Intentionen bzw. die Intentionen des Testerstellers in den Text hineinzudeuten. 

Die Methode erweist sich als fruchtbares Instrument der Rekonstruktion von 

Testfähigkeiten. 

Im englischsprachigen Raum mit seiner langen Tradition von Versuchen 

der Vermessung des menschlichen Geistes ist eine Debatte um Testfähigkeit 

naturgemäß bereits länger im Gange. In einer positivistischen Denktradition, 

die das Messen zum Maßstab des Erkennens nimmt, werden auch die das Messen 

begleitenden Phänomene einer Messung unterzogen. So hat beispielsweise 

Hembree (1987) 120 Forschungsarbeiten mit mathematischen Leistungstests 

einer Meta-Analyse unterzogen, um den Einfluss von „Noncontent Variables“ 

auf Testleistung zu untersuchen. Wenn man dem Messparadigma anhängt und 

einen mathematischen Leistungstest erstellen möchte, so findet man den Einfluss 

nichtinhaltlicher „Variablen“ hier sicherlich befriedigend und erschöpfend 

erschlossen, auch die meisten im deutschsprachigen Raum relativ neuen 

Debatten um Testformate, Schreibstile, Aufgabenanordnungen usw. erfahren 

hier eine quantitative Analyse, wenn auch nicht auf so diffizile Weise wie in 

der PISA-Analyse von Wuttke (2007). 

Hembree untersucht u.a. eine „Variable“, die begrifflich oftmals mit Testfähigkeit 

gleichgesetzt wird, die Testwiseness: “Testwiseness refers to a testee’s 

ability to use the features and formats of the test and test situation to make a 

higher score, indepentent of knowledge of content (Millman, Bishop & Ebel,


1965). The comparison of scores by a group trained in test taking and an untrained 

group is the effect related to testwiseness.” (Hembree 1987, S. 201) 

Hembree stellt in seiner Metaanalyse fest, dass ein Training von Testwiseness 

die Testleistung erhöht. 

Mir scheint der Begriff der Testwiseness zu wenig erschließend, um das 

Problem der Testfähigkeit wirklich zu fassen. Mir geht es nicht nur um die – 

im Grunde und dann auch in der quantitativen Analyse äußerlich bleibenden – 

„features and formats of the test and the test situation“. Das liegt daran, dass 

diese Begriffe kategorial gemeint sind. Ich meine mehr, wenn ich von jenen 

Kenntnissen, Fähigkeiten und Fertigkeiten spreche, die in einem Test miterfasst 

bzw. mitgemessen werden, die man aber nicht unter den Begriff „mathematische 

Leistungsfähigkeit“ fassen würde; und es wird sich zeigen dass das 

nachfolgend Rekonstruierte sich nicht kategorisieren lässt. Vor allem spreche 

ich auch davon, wie das Mitmessen von Testfähigkeiten das Messen von mathematischen 

Fähigkeiten durchwebt und mathematische Bildung beschädigt. 

Die nachfolgend dargestellte Analyse der Aufgaben verweist dabei auf den 

Ort, den die Debatte um Testfähigkeit systematisch vernachlässigt, wo aber 

das Problem erst erzeugt wird: Testfähigkeit ist ja erst in zweiter Linie eine 

Fähigkeit des Individuums. Testfähigkeit ist zunächst etwas, das in der Aufgabe 

drinsteckt, denn die Aufgabe ist der primäre Ort der Geltungserzeugung 

der Testaussage: Erst wenn ich verstanden habe, was die Aufgabe misst, hat es 

überhaupt Sinn, den Blick auf das vermessene Individuum zu richten. Deshalb 

wird hier die Frage, was Testfähigkeit ist, aus den Aufgaben herauspräpariert. 

2 Bedeutung von Testfähigkeit innerhalb der Diskussion um Tests 

2.1 Wissenschaftlicher Anspruch von Tests und Testfähigkeit 

Der hier verwendete Begriff der Testfähigkeit bezieht sich nur auf standardisierte 

(mathematische) Leistungstests. Diese Tests beanspruchen, die Relativität 

von Leistungsbewertung in der Schule zu heilen, also ein weniger relatives 

oder nicht relatives Maß für (mathematische) Leistungsfähigkeit darzustellen. 

Die Verringerung von Subjektivität ist offensichtlich, da (i) das Multiple- 

Choice-Format die Subjektivität der Deutung der Schüler„antwort“ im Grunde 

auf Null fährt (der Scanner trifft ja keine subjektive Entscheidung darüber, ob 

er einen Tintenhaufen als „Angekreuzt“ akzeptiert oder nicht, und der Programmierer, 

der den Grenzwert einstellt, trifft damit ja auch keine subjektive 

Entscheidung über die zu bewertende Leistung), da (ii) beim Rating von


halboffenen oder offenen Schülerantworten viel weniger Subjekte und damit 

Subjektivitäten beteiligt sind als wenn jeder Lehrer selbst korrigiert und da 

(iii) die Schulung von Ratern die Subjektivität der Ratings hoffentlich wirklich 

verringert. Mit der Verringerung von Subjektivität liegt aber noch keine 

geringere Relativität der Leistungsbewertung im Vergleich mit der Schule vor. 

Es muss untersucht werden, ob die verringerte Subjektivität auch zu einer präziseren, 

vergleichbareren und im Sinne der Leistungsanforderungen wahrhaftigeren 

(vielleicht „valideren“) Leistungsmessung führt. Dass dies prinzipiell 

kaum möglich und speziell bei TIMSS und PISA nicht der Fall ist, habe ich 

ausgiebig in Meyerhöfer (2004 a) bzw. (2005) diskutiert. 

Nur aus dem Anspruch heraus, die Relativität der Leistungsbewertung heilen 

zu können, ergibt sich überhaupt die Notwendigkeit, das Mitmessen von 

Testfähigkeit zu diskutieren: Wenn Fähigkeiten mitgemessen werden, die nicht 

die zu messenden mathematischen Fähigkeiten sind, dann sind diese Fähigkeiten 

zu benennen, und sie sind daraufhin zu untersuchen, ob sie erwünscht sind. 

Der Anspruch eines Instrumentes, wissenschaftlich zu sein, verweist eben gerade 

auf die Verpflichtung, das mit dem Instrument Erfasste zu explizieren. Ich 

verzichte dabei hier auf die Diskussion von Banalitäten, z.B. das Mitmessen 

verbaler Fähigkeiten oder der Fähigkeit, überhaupt mit der Arbeit anzufangen. 

Man könnte sich nun auch darauf festlegen, dass es erwünscht ist, Testfähigkeiten 

mitzumessen, z.B. die Fähigkeit bei einer Mathematikaufgabe eine 

inhaltlich sinnlose Anhäufung von Textmasse beiseite zu schaufeln, um an das 

mathematische Problem zu gelangen, oder die Fähigkeit, sich der mathematischen 

Anforderung durch Unverfrorenheit zu entziehen und den Punkt trotzdem 

zu erhalten. Ich gehe bei den folgenden Betrachtungen allerdings davon 

aus, dass TIMSS und PISA ausschließlich mathematische Fähigkeiten messen 

sollen. 

Die zusätzlich mitgemessenen Fähigkeiten können – wenn man sie erkannt 

hat – zwar dem Messkonstrukt zugeschlagen werden. Man verwickelt sich 

dann allerdings in Probleme der Fairness und der Zielstellung des Testens: 

i) Es ist in sich problematisch, dass Testfähigkeit als mathematische Fähigkeit 

erscheint. 

ii) Je mehr nichtmathematische Fähigkeiten man bereit ist mitzumessen, 

umso breiter und präziser muss diskutiert werden, was man misst und warum 

man es messen möchte. 

iii) Man steht außerdem in der Gefahr, sich in der Beliebigkeit des zu Messenden 

zu verlieren und das zu Messende nicht mehr aus einem erwünschten


Leistungskonstrukt heraus zu erarbeiten, sondern alles zu messen, „was die 

Items so mitmessen“. 

So war die Vorgehensweise bei TIMSS und PISA 2 . Man ist damit von 

der Messunschärfe einer herkömmlichen Klassenarbeit nicht weit entfernt, verliert 

also den wesentlichen Grund für standardisierte Leistungstests. 3 Man verschlechtert 

damit sogar die Position des Schülers, denn die Unwägbarkeiten in 

den Aufgabenformulierungen einer Klassenarbeit kann er durch sein im Unterricht 

erworbenes implizites oder explizites Wissen über den Lehrer heilen, 

notfalls kann er sogar fragen. 4 Die durch Testfähigkeit entstehenden Unwägbarkeiten 

bei standardisierten Tests sind auf diese Weise nicht zu bearbeiten. 

Man sollte den Begriff Testfähigkeit abgrenzen von der Fähigkeit, bei einem 

Test im Sinne einer Klassenarbeit gut abzuschneiden (letztere Fähigkeit 

ist expliziter Bestandteil schulischer Leistungserbringung): Auch bei dieser 

Fähigkeit geht es zwar z.B. darum, erfolgreich zu erschließen (und das heißt 

2 vgl. die Darstellung der Testkonstruktion in Meyerhöfer (2004 a, S. 98 f. und 139-157) oder 

in Meyerhöfer (2005, Kapitel 4) 

3 An diesen Überlegungen ist zu erkennen, dass das Defizit von schulischer Leistungsbewertung 

nur scheinbar in mangelnder Standardisierung liegt. Ein Test, der „alles mögliche“ 

mitmisst, kann trotzdem hochstandardisiert sein. Er hat aber fast die gleiche Messunschärfe 

wie eine Klassenarbeit. Hier reproduziert sich ein Irrtum, der uns auch im Forschungsprozess 

oft begegnet, nämlich der Glaube, dass hohe Standardisierung zu präziseren oder 

„besseren“, breiteren, tieferen oder wenigstens allgemeiner gültigen Erkenntnissen führen 

würde. Standardisierung führt aber zunächst nur dazu, dass alle Mitglieder einer Population 

bezüglich bestimmter Aspekte den gleichen Bedingungen unterworfen sind. Das bedeutet 

zwar, dass bestimmte Rahmenbedingungen (oder auch für die Testkonstrukte: bestimmte 

Dimensionen eines multidimensionalen Kausalkonstrukts) für alle Mitglieder gleich konstruiert 

sind. Das bedeutet aber noch lange nicht, dass damit die Geltungserzeugung präziser, 

besser, breiter, tiefer, eindeutiger oder wenigstens allgemeiner gültig ist. Am Problem 

der Geltungserzeugung geht die Standardisierung eher vorbei – wobei natürlich bestimmte 

Standardisierungselemente die Geltungserzeugung unterstützen können. 

Ein beredtes Beispiel für dieses Problem ist der PISA-Test: Man kann 180 000 Schüler 

hochstandardisiert untersuchen. Wenn dabei unklar bleibt, was eigentlich gemessen wird, 

bleibt die Testaussage begrenzt. Selbst der hohe voyeuristische Wert einer Länderrangskala 

ergibt sich nicht aus hoher Standardisierung, sondern lediglich aus der großen Anzahl der 

Beteiligten. 

4 Nikola Leufer (U Dortmund) hat mich darauf aufmerksam gemacht, dass umgekehrt ein 

Lehrer, der seinen Schüler gut kennt, dessen „Testunfähigkeit“ in Bezug auf eine Klassenarbeit 

bei der Korrektur berücksichtigen, also quasi durch „gutmütige Korrektur“ heilen 

kann. Allgemein könnte man annehmen: Testfähigkeiten werden auch in der Schule miterfasst 

– haben aber keine so starken Konsequenzen. Mit einem anderen Blick: Es ist fester 

Bestandteil professionellen Könnens (und damit nicht technisierbar), diese Konsequenzen 

gering zu halten.


manchmal: erraten oder erahnen), was der Lehrer mit seiner Frage meint und in 

welcher Tiefe bzw. auf welcher Ebene die Aufgabe zu erfüllen ist. In der Klasse 

ist aber die Vermittlung dieser Fähigkeit expliziter Bestandteil des Unterrichtsprozesses: 

Unterricht ist per se eine nichtstandardisierte Angelegenheit. 

Somit ist er allen Vor- und Nachteilen der Nichtstandardisierung ausgesetzt. 

Das schlägt sich auch in unterrichtlichen Tests nieder. Die daraus resultierende 

Relativität von Zensierungen lässt zwei polarisierte Schlussfolgerungen zu: 

Einerseits kann man eine größere Standardisierung von Leistungsbewertung 

anstreben. Andererseits kann diese Relativität Anlass sein, den Zensuren mit 

einer gewissen Gelassenheit zu begegnen, also u.a. ihre Rolle für die Vergabe 

von Zukunftschancen ebenso zu relativieren. Sie sollte jedenfalls Anlass 

sein, sich der Vielschichtigkeit der Ursachen von Schulerfolg zu stellen – denn 

Zensuren sind das gesetzte und wahrscheinlich das beste quantitative Maß für 

Schulerfolg 5 . Leistungen sind nur eine Ursache von Schulerfolg, und Schulerfolg 

wirkt vielfältig zurück auf Leistungen. Will man das Leistungsprinzip in 

der Schule stärker zur Geltung bringen (und das ist eine Implikation des Trends 

zu Tests), so muss die Kopplung von Schulerfolg an Leistung gesichert werden. 

Werden nun andererseits in standardisierten Tests irgendwelche anderen 

als die zu leistenden Fähigkeiten mitgemessen, bedeutet dies wiederum eine 

Abkehr vom Leistungsprinzip – nur dass jetzt andere Nicht-Leistungskriterien 

einfließen als in der Klasse. 

5 Diese Behauptung bedürfte einer tieferen Argumentation, die hier nicht geleistet werden 

kann. Die Argumentationsrichtung wäre etwa die folgende: Wenn man ein Maß für Schulerfolg 

erstellen möchte, dann muss man Schulerfolg definieren und in ein Messkonstrukt 

überführen. Der Versuch wäre mit Messunschärfen und anderen Konstruktionsproblemen 

behaftet. Bereits die Adressierung von „Schulerfolg“ würde zu unüberwindlichen Schwierigkeiten 

führen: Verschiedene gesellschaftliche Gruppen haben verschiedene Ansprüche 

an „Schulerfolg“, die Vielfalt an schulischen Aufgaben müsste in eine gewichtete Form 

gebracht werden usw. Die Schulzensur ist der Versuch, eine solche Gesamt„messung“ vorzunehmen. 

Das „Messkonstrukt“ ist in einem langen Prozess entstanden, in dem innerschulische 

und außerschulische Interessen in das Konstrukt eingeflossen sind. Es ist kaum zu 

überschauen, welche impliziten und expliziten Elemente hier zusammenfließen. Es handelt 

sich aber um ein Konstrukt von erstaunlich hoher gesellschaftlicher Akzeptanz: Obwohl 

die Probleme der „Messunschärfe“ von Zensuren hinlänglich bekannt sind, sind Zensuren 

nach wie vor vorrangige Instrumente der Vergabe von Zukunftschancen in nachschulischen 

Feldern. 

Im Zusammenhang damit ist bemerkenswert, dass keine Untersuchung über den Zusammenhang 

von Zensur und Testleistung bei PISA vorliegt, obwohl die Zensuren erhoben 

wurden. Man kann sich unschwer vorstellen, dass Tests schnell als überflüssig angesehen 

würden, wenn sich herausstellte, dass die ordinale Anordnung erhalten bleibt, und dass vorrangig 

die Tests problematisiert würden, wenn die ordinale Anordnung nicht erhalten bleibt.

2.2 Testfähigkeit und Bildungsziele 


Tests setzen Standards. Sie tun dies in umso größerem Maße, je relevanter sie 

für die Vergabe von Zukunftschancen sind. Sie tun dies aber auch dadurch, dass 

sie sich als wissenschaftliche Instrumente gerieren. Diese Standards schlagen 

bis in den Unterricht durch. Dadurch ist es problematisch, wenn in Tests Aufgaben 

auftauchen, die man lösen kann, ohne dass man die Fähigkeit, die getestet 

werden soll, wirklich besitzen muss. Umgekehrt ist es ebenfalls problematisch, 

wenn man eine Aufgabe nicht (richtig im Sinne der Tester) lösen kann, 

obwohl man die Fähigkeit(en) besitzt. 

Für den Lehrer ist es schwierig, Elemente von Testfähigkeit in den Aufgaben 

von Test, Bildungsstandards usw. zu erkennen und zu beheben, wenn 

er ausschließlich auf die mathematischen Fähigkeiten rekurrieren möchte: 

Der Lehrer arbeitet unter Handlungsdruck und setzt verständlicherweise darauf, 

dass standardisierte mathematische Leistungstests wirklich mathematische 

Leistung testen. Wissenschaftler unterliegen also einer gewissen Verantwortung 

für ihr Instrument. 

Man trifft in der Debatte um Testfähigkeiten auf das Argument, dass manche 

Testfähigkeiten durchaus als Bildungsziele taugen bzw. mit ihnen korrespondieren. 

Dieses Argument wird unten anhand der in den Interpretationen rekonstruierten 

Komponenten von Testfähigkeit diskutiert werden. Es stellt sich 

dabei im Wesentlichen als wenig fundiert und zynisch heraus. 

2.3 Chancengleichheit und Testfähigkeit 

Testfairness ist verletzt, wenn Teile der zu vermessenden Population Teile der 

gemessenen Fähigkeiten nicht oder in geringerem Maße als andere Teile der 

Population erwerben konnten. Dies kann z.B. der Fall sein, wenn Inhalte getestet 

werden, welche in einer der vermessenen Schularten gar nicht unterrichtet 

wurden. Dies kann auch der Fall sein, wenn mit einem Realitätskontext gearbeitet 

wird, welcher einer Gruppe völlig unbekannt, einer anderen hingegen 

vertraut ist. Ideale Testfairness kann es nicht geben und man muss sich dem in 

der Deutung der Testdaten stellen. 

In Bezug auf Testfähigkeit liegt eine Verletzung von Testfairness dann vor, 

wenn Tests Testfähigkeiten mitmessen und gleichzeitig Teile der zu vermessenden 

Population mehr Gelegenheit als andere Teile dieser Population hatten, 

diese Testfähigkeiten zu erlangen. So gab es eine kurze, aber intensive Debatte 

über Testfähigkeit, als die ersten TIMSS-Ergebnisse 1997 in Deutschland 

veröffentlicht wurden. Insbesondere wurde darauf verwiesen, dass die USA-


und die asiatischen „Nationalauswahlen“ viel besser auf den Test vorbereitet 

gewesen seien, weil in diesen Ländern eine ausgeprägte „Kultur“ des Testens 

zur Vergabe von Zukunftschancen herrscht. Man ging also davon aus, dass die 

asiatischen und die USA-Teile der vermessenen Population mehr Gelegenheit 

als die deutschen Teilnehmer hatten, Testfähigkeiten zu erlangen. Zwei polarisierte 

Schlussfolgerungen wurden daraus gezogen: Einerseits die Schlussfolgerung, 

die mangelnde Aussagekraft bei der Interpretation der Resultate zu berücksichtigen 

und vielleicht sogar auf solche Tests zu verzichten. Andererseits 

die Schlussfolgerung, die deutschen Teilnehmer ebenso intensiv in Testfähigkeiten 

einzuüben. 

3 Erfolge von Testtraining 

Mittlerweile tendiert die Praxis des deutschen Schulsystems in die Richtung 

des verstärkten Testens auch der deutschen Schüler. Allerdings wird das Problem 

der Testfähigkeit dabei kaum noch diskutiert. In der früheren Debatte 

fühlte sich die TIMSS-Gruppe noch genötigt zu behaupten, dass man solche 

Tests nicht trainieren kann 6 (Baumert u.a. 2000, S. 108 in Antwort auf Hagemeister 

1999). Das hieße nun allerdings zugespitzt, dass es Testfähigkeit 

im hier gemeinten Sinne nicht gibt, denn beim Testtraining geht es nicht um 

das Training der mathematischen Fähigkeiten, sondern um jene Fähigkeiten, 

die neben den mathematischen Fähigkeiten für den Testerfolg sorgen. Etwas 

schlicht gesagt: Testtraining (als Idealtypus) stellt nicht die Frage: Welche mathematischen 

Fähigkeiten müssen wir noch elaborieren? Es stellt die Fragen: 

Wie ticken Tester? Wie tickt der Test? Wie musst du ticken, damit du möglichst 

gut durchkommst. Man kann als Gegentypus das Üben konstruieren, das 

die Frage stellt: Welche mathematischen Fähigkeiten müssen wir noch elaborieren? 

Mit der Konstruktion als Idealtypus wird deutlich, dass manches Üben 

auch Elemente von Testtraining enthält, und dass manches Testtraining auch 

Elemente von Üben enthält. 

Da ich in meiner Dissertation festgestellt habe, dass und wie TIMSS 

und PISA Testfähigkeiten mitmessen, habe ich dort näher untersucht, wie die 

6 Die Behauptung erfolgt unter Nichtberücksichtigung der Metaanalyse von Hembree (1987), 

aber unter der im Text das Thema erschlagenden Bemerkung: „In den USA gibt es eine 

breite Forschungsliteratur zu den begrenzten Auswirkungen von Test-Coaching.“ (Baumert 

u.a. 2000, S. 108)


TIMSS-Gruppe zum Resultat gelangt, dass man Tests nicht trainieren kann und 

es diese Testfähigkeiten also nicht gibt oder man sie vernachlässigen kann. 7 

Dabei stellt sich heraus, dass Baumert u.a. (2000) Forschungsergebnisse 

verzerrt darstellen. Sie berufen sich auf zwei Studien von Klieme und Maichle 

(1989, 1990). Klieme und Maichle haben ein Training für Teile der medizinischen 

Eingangstests entwickelt und durchgeführt. Sie wollten im Wesentlichen 

herausbekommen, ob bezahlte Vorbereitungskurse für diese Tests die Chancengleichheit 

der Kandidaten verletzen können. In gewisser Weise betrifft die 

Fragestellung also die gleiche Chancengleichheitsdebatte bezüglich Tests wie 

heute. 

Klieme und Maichle haben ein Testtraining von sechs (!) Zeitstunden mit 

21 Personen durchgeführt. Sie erreichten dabei Verbesserungen in den trainierten 

Komponenten – es trat also ein positiver Trainingseffekt auf. Sie erreichten 

keine Verbesserungen im eigentlichen Test, aber dafür hatte auch kein 

Training stattgefunden, weil sie aus Zeitgründen nur einzelne Komponenten 

trainiert hatten. Sie diskutieren das Ergebnis ihres Trainings dann auch recht 

vielschichtig, schließen allerdings in erstaunlicher Weise: „Auch die Resultate 

dieser spezifischen . . . Fördermaßnahmen bestätigen letztlich die Aussage 

. . . , daß komplexe Problemlöseleistungen im Sinne der Subtests . . . nicht 

bzw. nur in relativ geringem Ausmaß trainierbar sind.“ (Klieme/Maichle 1990, 

S. 307) Diese Schlussfolgerung steht in offensichtlichem Widerspruch zu den 

Ergebnissen der Untersuchung. Vielleicht erklärt sie sich aus der institutionellen 

Einbindung der Forscher im Testinstitut heraus: Die Untersuchung sollte 

herausfinden, ob die Testfairness durch bezahlte Vorbereitungskurse verletzt 

werden kann. Wäre die Antwort ein „Ja“ gewesen oder wäre auch nur das 

schwächere „Ja“ dieser Untersuchung herausgekommen, dann hätte der Test 

massiv verändert oder abgeschafft werden müssen. 

Die TIMSS-Gruppe nimmt das verfälschte „Resultat“ auf, obwohl es sich 

nicht mal auf die Diskussion um Langzeiteffekte von Massentestungen bezieht. 

Die Studie von Klieme und Maichle wird offensichtlich lediglich vorgeschoben, 

um unerwünschte Nebeneffekte des Testens wegzudiskutieren. 

In der Diskussion um Testfähigkeit geht es jedoch um Langzeiteffekte bei 

kindlichen und jugendlichen Schülern, für die direkt oder indirekt Zukunftschancen 

an Testergebnisse gebunden werden. Man muss sich mit diesen Nebeneffekten, 

die zu Haupteffekten beim Lernen von Mathematik werden kön- 

7 Meyerhöfer (2004 a, S. 219-221; 2005, S. 190-192)


nen, beschäftigen, um ihren Charakter und ihren Einfluss abschätzen zu können. 

4 Testfähigkeit und Autonomie 

In diesem Beitrag wird nicht der Trainingsprozess betrachtet, sondern es wird 

untersucht, welche Itemeigenschaften dafür sorgen, dass neben mathematischen 

Fähigkeiten auch Testfähigkeiten gemessen werden. Ziel dieser Betrachtung 

ist es, den Beteiligten zu größerer Autonomie gegenüber dem Problem zu 

verhelfen. 

Das Training von Testfähigkeit scheint eine Möglichkeit dazu zu sein, weil 

Testfähigkeit die Autonomie des Schülers gegenüber dem Testprozess stärkt. 

Sie reproduziert aber auch die Autonomiezerstörung, indem sie den Schüler 

auf Fähigkeiten hin trainiert, die außerhalb von mathematischen Fähigkeiten 

liegen. Die autonomiezerstörende Grundstruktur von Tests ist nicht hintergehbar 

8 . Sie kann nur durch distanzierte Reflexion gebrochen werden. Außerdem 

erfordert jedes Testtraining ein Zeitbudget, welches für sinnvollere Lerninhalte 

einsetzbar ist. 

Die Erweiterung von Autonomie kann ebenso auf Seiten des Lehrers oder 

des bildungspolitischen Raums stattfinden. Mit dem Wissen um Komponenten 

von Testfähigkeit kann man bewusster entscheiden, ob man bereit ist, Leistungstests, 

welche Testfähigkeit mitmessen, zur Vergabe von Zukunftschancen 

einzusetzen. 

Erweiterung von Autonomie kann aber auch auf Seiten der Testentwickler 

stattfinden. Mit dem Wissen um Komponenten von Testfähigkeit kann auch 

hier bewusster entschieden werden, was man alles (mit)messen möchte. 

5 Testfähigkeit – Empirische Erschließungen 

5.1 Eine erste Annäherung 

Führen wir uns zunächst die Grundstruktur des Testens vor Augen. Tests werden 

erstellt, um Eigenschaften von Messobjekten in einem Messprozess zu 

erfassen. Man hat also zunächst eine Vorstellung davon, was in unserem Fall 

„mathematische Leistungsfähigkeit“ sein soll. Nun operationalisiert man diese 

Vorstellung, man schafft also Items, die in ihrem Zusammenwirken messen, 

inwieweit diese Fähigkeit vorhanden ist. Das so entstandene Konstrukt, 

8 vgl. Meyerhöfer (2004 a, S. 81-83; 2005, S. 24-27)


die „operationalisierte mathematische Leistungsfähigkeit“, soll natürlich möglichst 

identisch sein mit dem, was man sich vor der Operationalisierung unter 

mathematischer Leistungsfähigkeit vorgestellt hat. 

Das operationalisierte Messkonstrukt trifft – materialisiert in Form eines 

Testheftes – auf das Messobjekt, also auf den Schüler. Für den Schüler ist egal, 

was mathematische Leistungsfähigkeit ist, ob sie richtig operationalisiert ist, 

ob die getesteten Fähigkeiten relevant sind usw. Für den Schüler ist nur eines 

wichtig: Er muss die Erwartung des Testers bedienen. Er muss sein Kreuz an 

der richtigen Stelle machen, er muss die richtige Zahl hinschreiben, er muss eine 

Antwort notieren, die der auswertende Kodierer mit einem Leistungspunkt 

belegen kann. Damit ist die Richtung für einen Begriff von Testfähigkeit festgelegt: 

Testfähigkeit ist die Fähigkeit der Optimierung des eigenen Punktwertes 

innerhalb des Testkonstrukts. Das heißt insbesondere, dass man erstens in 

der Lage ist, eine wirklich vorhandene mathematische Fähigkeit in einen Testpunkt 

umzusetzen, und dass man zweitens in der Lage ist, einen Testpunkt 

auch dann zu erreichen, wenn man nicht über die mathematische Fähigkeit 

verfügt. Das zeigt zunächst, dass für das Individuum Testfähigkeit umso wichtiger 

wird, je bedeutsamer der Test für die Vergabe von Zukunftschancen wird. 

Zur Testfähigkeit gehört aber auch die Fähigkeit, den Test sinnvoll zu verweigern. 

Wenn z.B. bei PISA die Schule und nicht das Individuum vermessen 

wird, dann sollte im Sinne der Schule ein „schlechter“ Schüler den Test ebenso 

verweigern wie ein Schüler, der an diesem Tag sein Leistungsoptimum nicht 

erreicht. Die Schule muss für eine solche Verweigerung dankbar sein und sie 

unterstützen. 9 

9 Wolfgang Schulz (HU Berlin) schlägt in einem Gutachten zu diesem Beitrag vor, auch die 

Bereitschaft, sich dem Test zu stellen und das Bestreben, den Test möglichst erfolgreich zu 

absolvieren, als Testfähigkeit zu behandeln. Ich finde den Vorschlag fruchtbar, mir scheint 

hier aber eher das vorzuliegen, was Soziologen in Anlehnung an Durkheim „vorvertragliche 

Grundlagen des (sozialen) Vertrages“ nennen. Diese vorvertraglichen Grundlagen sind 

aber ein eigenes Thema. Ich setze hier voraus, dass die Getesteten möglichst gut abschneiden 

möchten. Dazu gehört dann aber – wenn die Schule als Ganzes vermessen wird – dass 

Schüler und Lehrer dieses gute Abschneiden als gemeinsames Projekt begreifen. Dass man 

meinen Vorschlag, dass beide Gruppen sich auf die absichtliche Absenz von testschwachen 

Schülern einigen, als absurd empfindet, zeigt lediglich einen Zustand an, in dem sich Lehrer 

und Schüler (noch?) nicht als Gemeinschaft gegen etwas Äußeres verstehen. Dieses Phänomen 

lässt sich aber nur im Rahmen einer Schultheorie umfassender diskutieren.


5.2 Bekannte Komponenten von Testfähigkeit 

Nur kurz erwähnen möchte ich allgemeine Testbearbeitungsstrategien, die bereits 

andernorts ausgiebig dargestellt sind und zu deren weiterer Beschreibung 

ich hier nichts beitragen möchte. Das sind Zeiteinteilungsstrategien, Fehlervermeidungsstrategien, 

Ratestrategien, Strategien zur Ausnutzung versteckter 

Lösungshinweise und formale Strategien zum deduktiven Erschließen der vermeintlich 

richtigen Antwort: 

– „Zeiteinteilungsstrategien (z.B.: das Überspringen von schwierigen Aufgaben, 

das Markieren von ungelösten Aufgaben oder solchen Items, bei denen 

man sich seiner Lösung nicht ganz sicher ist, das Markieren von Teillösungen, 

das Anlegen eines Arbeitsprotokolls, aus dem man ersehen kann, wie 

schnell man vorankommt, usw.), 

– Fehlervermeidungsstrategien (sorgfältiges Lesen der Instruktion, Beachten 

der Aufgabenstellung, Überprüfen der Antwort usw.), 

– Ratestrategien [ 10 ], 

– Strategien zur Ausnutzung versteckter Lösungshinweise (das Beachten aller 

Merkmale, hinsichtlich derer sich die Antworten von den Distraktoren 

unterscheiden könnten – z.B. der Länge, der Position, des Stils der betreffenden 

Aussagen usw.) 

– dieBeachtung sogenannter „specific determiners“ (gemeint sind Worte wie 

„immer“, „niemals“, „alle“ usw., die nach Meinung der Veranstalter [von 

Testtrainings, W. M.] speziell die Distraktoren, also die Falschantworten, 

kennzeichnen), 

– formale Strategien zum deduktiven Erschließen der vermeintlich richtigen 

Antwort (z.B. auf der Basis inhaltlicher oder formaler Abhängigkeiten 

zwischen den einzelnen Antwortmöglichkeiten).“ (Klieme, Maichle 1989, 

S. 207) 

Ebenfalls nur erwähnen möchte ich folgende Aspekte, die sich auch ohne 

eingehendere Aufgabeninterpretationen erschließen: Wenn man nichts weiß, 

muss man raten. Wenn man Multiple-Choice-Angebote mit nur einer richtigen 

Antwort abarbeiten muss, ist es besser, die Bearbeitung bei Erreichen der 

wahrscheinlich richtigen Antwort abzubrechen und die Aufgabe zu kennzeichnen. 

Erst wenn man später noch Zeit hat, sollte man zurückkehren und eine 

Fehlerkontrolle durchführen. Gleiches gilt für andere Unsicherheiten. 

10 Näheres zum Raten vgl. Meyerhöfer (2004 c).


Wenn man sich durch viele überflüssige Informationen (z.B. PISA- 

Aufgabe „Bauernhöfe“ – siehe unten) oder durch eine Anhäufung von Variationen 

der immer gleichen Wortgruppe („Bauernhöfe“ und TIMSS-Aufgabe 

A5 – siehe unten) hindurcharbeiten muss oder wenn man eine Ansammlung 

von Aussagen abarbeiten muss (z.B. PISA-Aufgabe „Dreiecke“ 11 ), dann kann 

man von Konzentrations- bzw. Durchhaltefähigkeit sprechen. Diese Fähigkeit 

benötigt man zwar oft im Leben, aber es wäre sicherlich wünschenswert, wenn 

man aus der inneren Verfasstheit und Ernsthaftigkeit eines Problems heraus 

durchhalten muss und nicht, weil eine Aufgabe schlecht gestellt ist bzw. weil 

sie Durchhaltefähigkeit statt der eigentlich zu testenden Fähigkeit misst. 

5.3 Testfähigkeit in Testaufgaben 

Ich möchte nun Komponenten von Testfähigkeit darstellen, die in den Aufgabeninterpretationen 

von PISA und TIMSS rekonstruiert wurden. Ich habe die 

Aufgaben objektiv-hermeneutisch interpretiert, hier sind lediglich Interpretationselemente 

angedeutet und Interpretationsresultate dargestellt. Die Interpretationen 

erfolgten zunächst unter der Fragestellung, was mit den Aufgaben gemessen 

wird. Dabei zeigten sich nicht nur erhebliche Messprobleme, die zu der 

Schlussfolgerung führten, dass beide Tests als Instrument zur Messung mathematischer 

Leistungsfähigkeit ungeeignet sind. Es zeigten sich auch habituelle 

Probleme 12 . 

Das Mitmessen von Testfähigkeit erweist sich als ein Problem, dem Messprobleme 

ebenso wie habituelle Probleme anhaften. Die hier aufgezeigten 

Komponenten von Testfähigkeit sind vielfältig miteinander verwoben – im Erscheinungsbild, 

im Charakter, im Hintergrund und in den Ursachen ihres Auftretens. 

Es würde der Reichhaltigkeit der empirischen Rekonstruktion nicht 

entsprechen, wenn man versuchte, die Komponenten mit zusammenfassenden 

Namen zu belegen, sie gar zu kategorisieren. Ich möchte auch nicht der Versuchung 

erliegen, die Komponenten in – dann zwingend plakative – Schüleranweisungen 

zu übersetzen, z.B.: Nimm das reale Problem nicht ernst! Denke 

zum Mittelmaß hin! Auch dies würde der Komplexität des Gegenstandes nicht 

entsprechen, welche hier entfaltet, aber noch nicht reduziert werden soll. Die 

11 vgl. Deutsches PISA-Konsortium (2001, S. 178); Diskussion bei Meyerhöfer (2004 b) 

12 Manifeste Orientierung auf Fachsprachlichkeit bei gleichzeitiger Beschädigung des Mathematischen; 

Verwerfungen des Mathematischen und des Realen bei realitätsnahen Aufgaben 

(Misslingen der angestrebten „Vermittlung von Realem und Mathematischem“); Kalkülorientierung 

statt mathematischer Bildung; Illusion der Schülernähe als Verblendung. Ich habe 

das als „Abkehr von der Sache“ zusammengefasst.


Überschriften sind dementsprechend Stichworte zu den jeweils herausgearbeiteten 

Phänomenen, keine Benennungen für trennscharf gedachte Komponenten. 

Da die Komponenten als Aufgabeneigenschaften rekonstruiert wurden, 

verweisen die Überschriften auf solche Eigenschaften. Erst zum Schluss führe 

ich diese Eigenschaften zu „Fähigkeiten“ zusammen. 

5.3.1 Fremde und bizarre Wörter; Irritationen; Tendenz zum Mittelmaß 

Die wohl am einfachsten zu erkennende und zu behebende Komponente von 

Testfähigkeit begegnet uns in der TIMSS-Aufgabe A1 im Wort schattieren: 

Betrachte die Figur. Wie viele von den kleinen Quadraten muss man ZU- 

SÄTZLICH schattieren, damit 4 

5 der kleinen Quadrate schattiert sind? 

A) 5 

B) 4 

C) 3 

D) 2 

E) 1 

Diese Komponente von Testfähigkeit ist durch ungewöhnliche, schwierige, 

mehrdeutige, vielleicht auch falsch benutzte Wörter im Aufgabentext gekennzeichnet 

(Interpretation für „schattieren“ vgl. Meyerhöfer 2004 a, S. 104 f.). 

Als Hauptursache für das Auftreten dieser Komponente sind Prätentionen, 

Übersetzungsfehler und auch mangelnde Sorgfalt bei der Durchsicht der Aufgaben 

zu nennen: Diese Fehler bewegen sich auf der manifesten Textebene 

und sind durch sorgfältige Durchsicht der Texte zu beheben. Hier kann man 

zum Beispiel das Wort „schraffieren“ verwenden und wirklich eine Schraffur 

verwenden. 

Als Verschärfung dieser Komponente lässt es sich ansehen, wenn der Text 

in eine offen bizzare Form übergeht. So wird in der PISA-Aufgabe „Bauernhöfe“ 

der Quader EFGHKLMN als rechtwinkliges Prisma erläutert:

PISA-Aufgabe Bauernhöfe 


Hier siehst du ein Foto eines Bauernhauses mit pyramidenförmigem Dach. 

Nachfolgend siehst du eine Skizze mit den entsprechenden Maßen, die eine 

Schülerin vom Dach des Bauernhauses gezeichnet hat. 

Der Dachboden, in der Skizze ABCD, ist ein Quadrat. Die Balken, die das 

Dach stützen, sind die Kanten eines Quaders (rechtwinkliges Prisma) EFGH- 

KLMN. E ist die Mitte von AT, F ist die Mitte von BT , G ist die Mitte von 

CT und H ist die Mitte von DT. Jede Kante der Pyramide in der Skizze misst 

12 m. 

Bauernhöfe 1. Berechne den Flächeninhalt des Dachbodens ABCD. 

Der Flächeninhalt des Dachbodens ABCD = ______ m 2 . 

Bauernhöfe 2. Berechne die Länge von EF, einer der waagerechten Kanten 

des Quaders. 

Die Länge von EF= ______ m.


Ein weiteres Beispiel liefert die TIMSS-Aufgabe A2: 

Die Gegenstände auf der Waage halten sich im Gleichgewicht. Auf der linken 

Waagschale befinden sich ein Gewicht (eine Masse) von 1 kg und ein halber 

Ziegelstein. Auf der rechten Seite befindet sich ein ganzer Ziegelstein. 

Welches Gewicht (welche Masse) hat ein ganzer Ziegelstein? 

A) 0,5 kg 

B) 1kg 

C) 2kg 

D) 3kg 

Hier kann man sich in der Aufgabenerstellung nicht entscheiden, ob es um 

Gewicht oder um Masse geht, obwohl das für das Problem belanglos ist. Bis 

zur ersten Klammer haben Bild und Text widersprüchliche Signale bezüglich 

der Frage gegeben, ob die Aufgabe unter mathematischen, physikalischen oder 

alltäglichen Gesichtspunkten zu bearbeiten sei 13 . Den äußerlichen auf Physik 

bzw. Messtechnik orientierenden Signalen wird latent widersprochen. An der 

Klammer tritt die Unsicherheit in offene Konfusion über. Die Unklarheit wird 

im Text manifest. Die Feinanalyse zeigt, dass lediglich ein schulmeisterliches 

Bedürfnis nach korrektem Gebrauch von Fachsprache bedient wird. Die Struktur 

kann man als äußerliche Verfachsprachlichung bei gleichzeitiger inhaltlicher 

Dementierung von Fachlichkeit bezeichnen. Gleichzeitig wird ein Irritationsmoment 

geschaffen, denn der Schüler muss einen Umgang mit der offenen 

sprachlichen Verwerfung finden. Konkret muss er entscheiden, ob die begriffliche 

Doppelung für die Lösung wichtig ist oder nicht. Ein Schüler, der das 

Schulmeisterliche des Textes erfassen und übergehen kann, erhält hier einen 

Zeitvorteil. 

Ein drittes Beispiel findetsichinderTIMSS-Aufgabe A5: 

Welche der Aussagen über das Quadrat EFGH ist FALSCH? 

A) EIF und EIH sind kongruent (deckungsgleich). 

B) GHI und GHF sind kongruent (deckungsgleich). 

C) EFH und EGH sind kongruent (deckungsgleich). 

D) EIF und GIH sind kongruent (deckungsgleich). 

In der Verwendung von kongruent und deckungsgleich spiegelt sich ein wahrscheinlich 

nicht lösbarer Konflikt von „zentralen“ Tests. Beide Wörter sind – 

bezogen auf das hier in Rede stehende Problem – gleichbedeutend. Beide Wör- 

13 Interpretation vgl. Meyerhöfer (2004 a, S. 107-113)


ter gehören zur Fachsprache von Mathematikunterricht. Es gibt Klassen, in denen 

der Begriff der Deckungsgleichheit der Lernstoff ist. Es gibt auch Klassen, 

in denen der Kongruenzbegriff der Lernstoff ist. Bei einigen von diesen wird 

wiederum der Begriff der Deckungsgleichheit zur Erklärung des Kongruenzbegriffs 

herangezogen – das bezieht sich auf ein gewisses Selbsterklärungspotenzial 

des Begriffs der Deckungsgleichheit. Die in der Aufgabe gewählte 

Formulierung kongruent (deckungsgleich) nimmt nun vorrangig diesen letzten 

Aspekt auf: kongruent wird – quasi zur Erinnerung – als deckungsgleich 

erläutert. Gleichzeitig wird deckungsgleich als begriffliche Alternative für diejenigen, 

die den Begriff kongruent nicht kennen, angeboten. Für diese Gruppe 

ist ein neuer Begriff – quasi als Hauptbegriff, weil nicht in der Klammer stehend 

– aufgetaucht. Für diejenigen, die nur den Begriff kongruent kennen, ist 

ebenfalls ein neuer Begriff eingeführt, und zwar in einer Klammer. Für alle 

drei Gruppen entsteht durch die Klammer ein Irritationspotential: Entweder 

wird man mit einem neuen Begriff konfrontiert, oder es wird plötzlich in der 

Klammer an die Bedeutung eines Begriff erinnert – für einen Test ein seltsames 

Unterfangen. Erklärbar ist diese Begriffsverwirrung für denjenigen, der 

das – im Kern didaktische – Problem der zwei Begriffe kennt. „Normal“ ist 

es für denjenigen, der mit solchen Konstruktionen in Tests vertraut ist und sie 

übergehen kann: ein Bestandteil von Testfähigkeit. 

In allen drei Beispielen stellt sich als Ursache ein habituelles Problem des 

„Schulmeisterlichen“ heraus: Hier wird probleminadäquat Wert auf Verwendung 

von Fachsprache gelegt – und das Fachliche unterminiert. Gleichzeitig 

wird ein Irritationsmoment geschaffen, denn der Schüler muss einen Umgang 

mit dieser sprachlichen Verwerfung finden. Konkret muss er z.B. entscheiden, 

ob die begriffliche Doppelung für die Lösung wichtig ist oder nicht. Ein Schüler, 

der das Schulmeisterliche des Textes erfassen und übergehen kann, erhält 

hier einen Zeitvorteil: Er spart die Zeit, die jemand benötigt, der erst über das 

Masse-Gewichts-Problem oder den Prismenbegriff oder über Kongruenz nachdenkt 

oder es gar tiefgründig in seine Überlegungen Einzug halten lässt. 

Die Aufgabe für den Schüler besteht bei dieser ersten Testfähigkeitskomponente 

darin, die entstehende Klippe zu umschiffen. Das kann einerseits bedeuten, 

erfolgreich den Inhalt des „seltsamen“ Wortes zu erfassen – eine verbale 

Fähigkeit. Bei mehreren möglichen Bedeutungen ist die von den Testern 

intendierte Bedeutung zu erfassen. Habituell erfordert das, sich auf die von 

den Testern anvisierte Ebene der Problembearbeitung zu begeben, also nicht 

zu tiefgründig oder zu oberflächlich zu denken. Wer intellektuell zu weit nach 

unten oder oben denkt, ist einer erhöhten Gefahr des Scheiterns ausgesetzt.


Testfähigkeit hat hier also eine inhaltliche und eine habituelle Dimension und 

kennzeichnet eine Tendenz zum Mittelmaß. Die Klippe kann auch umschifft 

werden, indem man das Seltsame übergeht und begriffliche oder inhaltliche 

Exaktheit vermeidet. – Es geht nicht darum, das Problem der Aufgabe vollständig 

zu verstehen, sondern es geht um die im Sinne des Tests richtige Lösung. 

An dieser Stelle wird deutlich, wie Tests die viel beklagte Resultatsorientierung 

(statt Inhaltsorientierung) der Schüler reproduzieren. 

Das Umschiffen der Klippe muss nicht nur inhaltlich erfolgreich geschehen, 

sondern es muss auch unter möglichst geringem Zeitverlust geschehen, 

denn Zeit ist in einem Test eine kostbare Ressource. Testfähigkeit bedeutet 

dabei zu wissen, dass es auf das einzelne Wort nicht so sehr ankommt und 

dass man das Seltsame übergehen muss. Man nimmt es möglichst gar nicht 

zur Kenntnis oder erschließt aus dem Rest des Textes möglichst schnell, dass 

hier keine Falle lauert. Die Möglichkeit, dass es sich um ein wichtiges Wort 

handelt bzw. dass ein Begriff der fachlichen Präzision wegen eingeführt ist, ist 

die große Gefahr für den Testfähigen. Beim Auftreten eines „seltsamen“ Wortes 

oder einer seltsamen Konstruktion in einem Test ist es aber ausgesprochen 

unwahrscheinlich, dass es sich um eine begriffliche Präzisierung handelt, die 

für die Erbringung der richtigen Antwort unbedingt verstanden werden muss. 

Testfähigkeit bedeutet hier, keine Zeit mit Nachdenken zu vertun. 

Die dargestellte Komponente von Testfähigkeit kann kaum (im Sinne des 

Arguments, dass Testfähigkeiten als Bildungsziele taugen bzw. mit ihnen korrespondieren) 

als Bildungsziel deklariert werden: Zwar geht es beim Rezipieren 

von Texten immer auch darum, bei der schnellen Erfassung von Textinhalten 

die in den Texten auftretenden Prätentionen und Fehler zu „überlesen“, 

sich also an ihnen vorbei den Inhalt zu erschließen. Daraus lässt sich aber keine 

Rechtfertigung für das Mitmessen dieser Komponente von Testfähigkeit konstruieren, 

weil damit eine Tendenz zur Normalisierung von Fehlern verbunden 

ist: Der Schüler wird gezwungen, Defizite der Testerstellung zu übergehen und 

damit zu akzeptieren, statt sie zurückzuweisen – das kann er wegen der autonomiezerstörenden 

Grundstruktur bei Tests nicht ungestraft. Auch die Vermeidung 

von Irritation durch schulmeisterlichen Fachsprachengebrauch kann nur 

unter großen Verbiegungen als Bildungsziel deklariert werden: Man müsste 

dazu voraussetzen, dass das Fachsprachliche einen Wert außerhalb des Fachlichen 

hat. Mir scheint das Fachsprachliche aber nur einen Wert zu haben, wenn 

dadurch Fachliches transportiert oder konstruiert wird. Die latente Unterminierung 

des Fachlichen durch das Fachsprachliche scheint mir als Bildungsziel 

wenig geeignet.


5.3.2 Irritationen durch misslungene künstliche Beschleunigungen und 

Vereindeutigungen 

Eine weitere Komponente von Testfähigkeit tritt auf, wenn versucht wird, den 

Schüler künstlich schneller durch den Test zu schleusen. Dabei wird in einigen 

Fällen mit der manifesten Konstruktion einer Eindeutigkeit genau diese Eindeutigkeit 

latent zerstört. Dasselbe Prinzip tritt auf, wenn eine textliche Konstruktion, 

die die Texterfassung beschleunigen soll, Irritationspotenzial entfaltet, 

welches die Texterfassung verzögert. 

Das erste Beispiel findet sich in der Konstruktion Wie viele von den kleinen 

Quadraten . . . in der TIMSS-Aufgabe A1 (vgl. 5.3.1). Hier wird besonders 

auf die kleinen Quadrate verwiesen. Dieser Verweis soll vereindeutigen, denn 

es wird ausgeschlossen, dass sich der Schüler mit den aus den kleinen Quadraten 

zusammengesetzten „großen“ Quadraten auseinandersetzt. Der Verweis 

verwirrt aber auch, denn die Wahrscheinlichkeit, dass sich Schüler von sich 

aus mit den „großen“ Quadraten auseinandersetzen, ist ausgesprochen gering. 

Man wird also darauf gestoßen, besondere kleine und eventuell sogar große 

Quadrate zu suchen. Lediglich als Hilfe für den Schüler, der nicht weiß, was 

Quadrate sind, könnte man sich „kleine Quadrate“ vorstellen. Dieses Argument 

zerbricht aber daran, dass außer den Quadraten gar nichts da ist, womit 

man arbeiten kann. Die Vereindeutigung zerstört sich also selbst. 

Testfähigkeit besteht hier darin, sich von solchen testvereindeutigenden 

und beschleunigenden Konstruktionen nicht irritieren zu lassen: Der testfähige 

Schüler ist also mit solchen Konstruktionen vertraut und weiß (implizit 

oder explizit), dass es lediglich um Vereindeutigung geht und dass über diese 

schlichte Funktion nicht hinausgedacht werden muss. Es geht darum, auf 

eine vielschichtige Auseinandersetzung mit der Aufgabe gerade zu verzichten, 

also nicht über die Rolle von großen und kleinen Quadraten und über die 

vielfältigen Möglichkeiten des Umgangs mit Mengen in der Zeichnung nachzudenken 

– wie es der explizite Verweis auf die kleinen Quadrate zunächst 

nahelegt. Wenn man auf vieldimensionales Nachdenken verzichtet und sich 

auf das Setzen des richtigen Kreuzes konzentriert, dann wird die Bearbeitung 

der Aufgabe durch kleine vielleicht sogar wirklich beschleunigt. 

Das gleiche Prinzip wiederholt sich in A1 mit . . . muss man ZUSÄTZ- 

LICH schattieren . . . Die Großschreibung scheint zunächst eine Hilfe zu sein, 

da sie vor der Angabe der insgesamt zu schattierenden Quadrate warnt. Diese 

Hilfe ist aber nicht notwendig, weil die durch Multiple Choice angegebenen 

Lösungsvarianten dem Schüler seinen Irrtum signalisieren würden. Auch


an dieser Stelle erfährt ein testfähiger Schüler einen Vorteil, weil er mit einer 

solchen testbeschleunigenden Konstruktion vertraut ist. Der testunerfahrene 

Schüler wird eher irritiert sein, weil in der Schriftsprache normaler Texte, 

auch bei schulischen Texten, Wörter in Großbuchstaben eine derart starke Exponierung 

erzeugen, dass ein Nachdenken über den Grund der Exponierung 

angezeigt ist. Testfähigkeit bedeutet hier, den Grund der Exponierung bereits 

zu „kennen“: Vermeidung naheliegender Fehler. Der Beschleunigungsvorteil 

durch diese Exponierung gilt natürlich nur für jenen, der der impliziten Aufforderung 

widersteht, über den Grund der Exponierung nachzudenken. Auch 

hier bedeutet Testfähigkeit wieder, nicht über den Text nachzudenken, sondern 

dem Prinzip zu folgen, dass es um das Kreuz an der richtigen Stelle geht, nicht 

um inhaltliche Auseinandersetzung. 

Die Komponente der irritationshaltigen Beschleunigung findet sich auch in 

der Formulierung Welche der Aussagen . . . ist FALSCH? von A5 (vgl. 5.3.1). 

Ursache ist hier eine Prätention: Eine mathematisch anspruchslose Fragestellung 

wird zunächst künstlich verkompliziert: Hier ist reine Fleiß- und Konzentrationsarbeit 

zu verrichten, deren Anspruch aus dem zu lösenden Problem 

heraus nicht zu begründen ist. Die künstliche Verkomplizierung durch die Umkehr 

des Anspruchs – man soll benennen, was falsch ist – motiviert wiederum 

die Hervorhebung durch Großbuchstaben: Das Ungewöhnliche muss hervorgehoben 

werden, um eine Verwechslung mit der erwartbaren Anforderung zu 

vermeiden. In der Variante „ . . . ist richtig“ käme der Gedanke, RICHTIG groß 

zu schreiben, nicht auf. 

Ein weiteres Beispiel findet sich in der PISA-Aufgabe „Pyramide“: 

Die Grundfläche einer Pyramide ist ein Quadrat. Jede Kante der 

skizzierten Pyramide misst 12 cm. Berechne den Flächeninhalt 

der Grundfläche ABCD. 

Im zweiten Satz wird hier die Option des Vorhandenseins 

zweier Pyramiden eröffnet, nämlich der Pyramide 

des ersten Satzes und der skizzierten Pyramide. Manifest erfolgt durch 

die Einfügung des skizzierten eine Lesebeschleunigung durch die explizite Verknüpfung 

von Text und Bild. Latent wird eine Irritation erzeugt. Die Aufgabe 

an den Schüler lautet, sich von dieser Irritation nicht ergreifen zu lassen, also 

darüber hinweg zu lesen. 

Auch die hier beschriebene Dimension von Testfähigkeit lässt sich als Bildungsziel 

diskutieren: Schließlich gibt es solche Brechungen zwischen dem 

textlich Gewollten und dem damit produzierten Irritierenden auch in den Tex-


ten, auf deren Rezeption Unterricht die Schüler vorbereitet. Man kann es zum 

Bildungsziel erklären, einen Umgang damit zu finden und zu lernen, diese Irritationen 

zu überwinden. Das Argument ist allerdings zynisch: Autonomievergrößerung 

würde bedeuten, das Auseinanderlaufen verschiedener Textebenen 

in irgendeiner Weise zu thematisieren. Dem Schüler würde dabei ermöglicht, 

Distanz zum Text und damit auch zum Prozess der Leistungskontrolle zu erlangen. 

Er könnte sich damit intellektuell von schulischen und leistungsbewertenden 

Prozessen emanzipieren. In einem Test ist er diesen Prozessen ausgeliefert, 

weil er den Punkt auch dann nicht bekommt, wenn er die Aufgabe intellektuell 

brilliant zurückweist. Mir scheint es weitaus einleuchtender, dass die Tester 

der Verpflichtung unterliegen, Irritationspotenzial zu vermeiden, indem sie gebrochene 

Vereindeutigungen und Beschleunigungen unterlassen. Dies ist aber 

offenbar nur möglich, wenn man diese Brüche überhaupt erkennt. Die erste 

Voraussetzung dafür ist ein Perspektivwechsel: Der Tester darf sich nicht nur 

darauf konzentrieren, was er hören will, sondern muss sich fragen, was der 

Text wirklich verlangt und ob das mit dem zusammenläuft, was er will. Die 

zweite Voraussetzung ist dann nur noch eine gewisse Textsensibilität. Objektive 

Hermeneutik bietet dieser Sensibilität ein Instrument methodischer Kontrolle. 

5.3.3 Fehlbarkeit der Tester 

PISA-Aufgabe ÄPFEL 

Ein Bauer pflanzt Apfelbäume an, die er in einem quadratischen Muster anordnet. 

Um diese Bäume vor dem Wind zu schützen, pflanzt er Nadelbäume um 

den Obstgarten herum. 

Im folgenden Diagramm siehst du das Muster, nach dem Apfelbäume und Nadelbäume 

für eine beliebige Anzahl (n) von Apfelbaumreihen gepflanzt werden:


Äpfel 1: 

Vervollständige die Tabelle: 

Äpfel 2: 

Es gibt zwei Formeln, die man verwenden kann, um die Anzahl der Apfelbäume 

und die Anzahl der Nadelbäume für das oben beschriebene Muster zu berechnen: 

Anzahl der Apfelbäume = n2 Anzahl der Nadelbäume = 8n 

wobei n die Anzahl der Apfelbaumreihen bezeichnet. 

Es gibt einen Wert für n, bei dem die Anzahl der Apfelbäume gleich groß ist 

wie die Anzahl der Nadelbäume. Bestimme diesen Wert und gib an, wie du ihn 

berechnet hast. 

Äpfel 3: 

Angenommen, der Bauer möchte einen viel größeren Obstgarten mit vielen 

Reihen von Bäumen anlegen. Was wird schneller zunehmen, wenn der Bauer 

den Obstgarten vergrößert: die Anzahl der Apfelbäume oder die Anzahl der 

Nadelbäume? Erkläre, wie du zu deiner Antwort gekommen bist. 

Die Aufgabe „Äpfel“ hat sich in der näheren Betrachtung 14 als produktive und 

mathematisch gehaltvoll erweiterbare unterrichtliche Aufgabe herausgestellt, 

die aber als Testaufgabe ungeeignet ist: Unter anderem wird in Äpfel 2 eine 

Formel für das Lösen von Äpfel 1 nachgereicht. Testfähigkeit besteht hier 

nicht vorrangig darin, erkennen zu können, in welcher Weise die Tester die 

Lösung oder Teile der Lösung bereits mitgeliefert haben. Schließlich hat es 

wenig Sinn, Aufgaben gezielt auf solche Möglichkeiten hin zu durchsuchen – 

dazu sind diese Möglichkeiten zu selten. 

Testfähigkeit besteht vielmehr darin, den Gedanken der Fehlbarkeit der 

Tester zuzulassen und die Fehlung dann auch auszunutzen. Immerhin handelt 

es sich beim Test um ein Instrumentarium, das sich auf die exponierte Zielgenauigkeit 

und Sorgfalt des Wissenschaftlichen beruft und bereits durch seinen 

Umfang und sein Auftreten signalisiert, dass hier viele Leute lange darüber 

14 Winter (2005), Meyerhöfer (2004 a, S. 203 f.), Meyerhöfer (2005, S. 171 ff.)


nachgedacht haben, was sie an den Schüler herantragen. Es verlangt ein gewisses 

Maß an Autonomie, Gelassenheit, Abstand oder Unverfrorenheit, um 

den Gedanken zuzulassen, dass dieser riesige Apparat in Aufgabe 2 die Formel 

für die Lösung von Aufgabe 1 reinschreibt. 

5.3.4 Primat des gewünschten Resultats vor dem mathematischen Anspruch, 

Unverfrorenheit gegenüber dem mathematischen Anspruch, 

Möglichkeiten von Multiple Choice 

Eine Zuspitzung erfährt die eben beschriebene Dimension in der TIMSS- 

Aufgabe M7, in der die mathematische Anforderung unterminiert wird: 

AB ist in dieser Zeichnung eine Gerade. 

Wieviel Grad mißt Winkel BCD? 

A) 20 

B) 40 

C) 50 

D) 80 

E) 100 

Hier soll der Schüler offenbar erkennen, dass 9x gleich 180 Grad ist, und daraus 

das Winkelmaß von 80 Grad für BCD bestimmen. Sehr viel effektiver ist 

es, mit Hilfe der Multiple-Choice-Angebote abzuschätzen, dass es nur 80 Grad 

sein können. Dass der Schüler eigentlich rechnen soll, erkennt man am ersten 

Satz und an der Tatsache, dass die außergewöhnlichen Bezeichnungen 5x 

und 4x angebracht sind. In einer Schätzaufgabe würde so etwas nicht vorkommen 

15 . 

Ein Schüler, der das Problem rechnerisch nicht lösen kann, hat hier Glück, 

er kommt nämlich nicht in Gefahr, Zeit zu verschwenden. Für ihn besteht lediglich 

die Aufgabe, sich zu trauen, einfach das anzukreuzen, was er sieht. 

Das ist nicht trivial, denn mancher Schüler traut sich nicht, das Offensichtliche 

hinzuschreiben, wenn er spürt bzw. merkt, dass er eigentlich rechnen soll. Ein 

Schüler, der das Problem rechnerisch lösen kann, kommt ebenfalls zum richtigen 

Ergebnis – wenn er nicht in die Fallen der Lösungsangebote A oder E 

fällt. Er verbraucht aber sehr viel von der kostbaren Ressource Zeit. Um diese 

Zeit einzusparen, benötigt er eine gewisse Unverfrorenheit gegenüber der 

rechnerischen Anforderung, gepaart mit einer gewissen Cleverness im Erkennen 

der durch Multiple Choice geschaffenen Möglichkeiten. Für das Problem 

15 Interpretation siehe Meyerhöfer (2001)


der Testfähigkeit ergibt sich damit eine weitere Komponente: Man muss sich 

trauen, einen nichtrechnerischen Weg zu gehen, auch wenn offenbar Rechnen 

verlangt ist. Man muss also unverfroren gegen die Anforderung handeln, denn 

es geht nicht um den mathematischen Inhalt, sondern um das Kreuz an der 

richtigen Stelle. Die Aufgabe M7 eignet sich geradezu ideal dazu, Menschen 

zu identifizieren, die sich clever und effektiv der eigentlichen Anforderung 

stellen und dabei unverfroren gegen die manifest intendierte Aufgabe handeln. 

Im Vergleich dazu erscheint das Bedienen der rechnerischen Intention als braves 

Abarbeiten von fehlerbehafteten und probleminadäquaten mathematikunterrichtlichen 

Techniken. 

Die gleiche Dimension von Testfähigkeit wird in den Aufgaben „Bauernhöfe“ 

(vgl. 5.3.1) und „Dreieck“ 16 mitgemessen. Dort ist (in unterschiedlich 

starkem Maße) die Verwendung von lokalem Satz- und Formelwissen gefragt. 

Tendenziell schneller sind die Wege über Intuition bzw. Messen. Den höchsten 

Zeitverlust hat dort derjenige, der genuin mathematisch denkt und handelt. 

Reinhard Woschek (2005) hat im Rahmen seiner Dissertation untersucht, 

auf welch unterschiedliche Weisen deutsche und Schweizer Schüler TIMSS- 

Aufgaben lösen. Bei M7 stellte er fest, dass deutsche Schüler fast nur rechnen 

und auch oft damit scheitern. Schweizer Schüler hingegen schätzen fast nur. 

Es gibt natürlich auch in Deutschland Lehrer, die möchten, dass ihre Schüler 

an dieser Stelle schätzen – jedenfalls wenn nicht gerade das Aufstellen 

von und Umgehen mit Gleichungen angesagt ist. Testfähigkeit läuft hier zwar 

gegen die Aufgabenintention, hat aber durchaus einen Charakter, der Lehrintentionen 

entsprechen kann: Man kann durchaus wollen, dass die Schüler mit 

gegebenen Problemen nicht stur rechnerisch umgehen, sondern sie der Situation 

angemessen möglichst effektiv lösen. Allerdings bleibt unklar, warum man 

gerade das künstliche und statische Instrument der Multiple-Choice-Aufgabe 

wählen sollte, um sich einem dynamischen, problemadäquaten und effektiven 

Umgang mit mathematischen Problemen zu nähern, die noch dazu nach rechnerischer 

Bearbeitung verlangen. Man sollte sich auch vor Augen halten, dass 

es zynisch wäre, eine rechnerische Anforderung künstlich zu suggerieren bzw. 

zu konstruieren, die nicht aus der Sache selbst erwächst. 

5.3.5 Egal wie wenig du weißt, schreibe immer irgendetwas hin. 

Eine elementare Komponente von Testfähigkeit lässt sich in die Aufforderung 

umschreiben: Egal wie wenig du weißt, schreibe immer irgendetwas hin. Die 

16 vgl. Deutsches PISA-Konsortium (2001, S. 178); Diskussion bei Meyerhöfer (2004 b)


Multiple-Choice-Variante dieser Aufforderung unterstreicht das Prinzip: Wenn 

du nichts weißt, dann kreuze irgendetwas an, und zwar möglichst das, was dir 

am meisten einleuchtet. Die Diskussion um das Raten bei Tests 17 kann man 

darauf zuspitzen, dass sich alle Populationsunterschiede in der Testleistung mit 

dem unterschiedlichen Grad der Verinnerlichung dieser Komponente von Testfähigkeit 

erklären lassen. Diese Behauptung ist zwar ebensowenig überprüfbar 

wie die Behauptung, Raten spiele keine Rolle. Aber wenn wir von Wuttke 

(2007, Abschnitt 3.12.) erfahren, dass bereits Lösungsunterschiede von einer 

halben Aufgabe in der PISA-Skalierung als relevanter Unterschied (9 Punkte) 

gedeutet werden, dann offenbart das die Anfälligkeit des Konstrukts für Rateprobleme. 

Ich möchte das Problem hier nur für offene Antwortformate (die allerdings 

im kategorialen Vorgehen immer geschlossen kodiert werden) diskutieren: In 

der Aufgabe „Äpfel 2“ (vgl. 5.3.3) heißt es: Es gibt einen Wert für n, bei dem 

die Anzahl der Apfelbäume gleich groß ist wie die Anzahl der Nadelbäume. 

Bestimme diesen Wert und gib an, wie du ihn berechnet hast. 

Es sind aber zwei Werte, nämlich 0 und 8. Aus den Lösungskodierungen 

der PISA-Gruppe geht hervor, dass ein Schüler, der n = 8 angibt, den Lösungspunkt 

erhält, auch wenn er keine Begründung bzw. Berechnung angibt – wenn 

er also die Aufgabenstellung nicht erfüllt. Ein Schüler, der nur n = 0 angibt, erhält 

hingegen keinen Punkt, selbst wenn er seine Antwort begründet und es bei 

diesem Wert belässt, weil schließlich im Aufgabentext nur ein Wert gefordert 

ist. Man kennt natürlich nie die Kodierungsanweisungen der Tester, wenn man 

getestet wird. Aber es wird deutlich, dass es nicht in jedem Fall darum geht, 

die Aufgabe wirklich zu erfüllen. Bereits das Hinschreiben einer Teillösung 

führt zum Punkt. 

Es liegt nahe einzuwenden, dass die Lösung Null für den realen Kontext 

eher irrelevant ist. Das stimmt zwar inhaltlich, setzt aber eine Kernerfahrung 

mit Mathematikunterricht nicht außer Kraft: Dort geht es unsystematisch – das 

heißt, nicht unbedingt durchschaubar aus der Sache heraus begründet, sondern 

gelegentlich aus dem Belieben des Lehrers heraus erscheinend – immer wieder 

um solche „Randbetrachtungen“. Für den Schüler bleibt gerade in Tests 

undurchschaubar, in welchem Maße er „Randbetrachtungen“ mit zu leisten 

hat (und leisten darf). Die Unsicherheit wird dadurch gestärkt, dass es um das 

Reale offensichtlich gar nicht geht. 

17 vgl. Meyerhöfer (2004 c), Lind (2004)


Diese Herabwürdigung des Realen lässt in der PISA-Aufgabe „Sparen“ 18 

den Eindruck entstehen, man könne irgend etwas über den Zinseszins hinschreiben 

– womöglich sogar, ohne ihn berechnet zu haben – und könnte trotzdem 

den Punkt erhalten. 

Irgendetwas hinzuschreiben erweist sich auch als sinnvoll, wenn man sich 

die Kodierungspraxis vor Augen hält: Ein Kodierer – meist ein schlecht bezahlter 

Student – muss in einem entfremdeten Arbeitsprozess unter Zeitdruck 

eine große Menge an schlecht lesbaren Schülernotizen entziffern. Er muss versuchen, 

dem Geschriebenen einen Sinn abzuringen und diesen Sinn mit einer 

umfangreichen, die Wirklichkeit aber doch nur holzschnittartig erfassenden 

Bewertungsvorschrift in Einklang zu bringen. Er steht im ständigen Konflikt, 

dass einerseits die von ihm geleistete Geltungserzeugung dem Anspruch der 

Wissenschaftlichkeit ausgesetzt ist, dass ihm andererseits aber keine wissenschaftliche 

Methode der Geltungserzeugung zur Verfügung steht. (Der Konflikt 

existiert unabhängig vom Bewusstsein des Kodierers. Allerdings sind die 

Kodierer direkt mit dem Text konfrontiert und dürften am deutlichsten spüren, 

dass die Kategorisierungen weder das Latente noch die vielen verschiedenen 

Ausprägungen von Verstehen oder von Können zu greifen vermögen.) Ergebnis 

seines Tuns soll eine undifferenzierte Null-Eins-Entscheidung sein, und in 

18 Sparen 

Karina hat 1000 DM in ihrem Ferienjob verdient. Ihre Mutter empfiehlt ihr, das Geld 

zunächst bei einer Bank für 2 Jahre festzulegen (Zinseszins!) Dafür hat sie zwei Angebote: 

a) „Plus“-Sparen: Im ersten Jahr 3 % Zinsen, im zweiten Jahr dann 5 % Zinsen. 

b) „Extra“-Sparen: Im ersten und zweiten Jahr jeweils 4 % Zinsen. 

Karina meint: „Beide Angebote sind gleich gut.“ Was meinst du dazu? 

Begründe deine Antwort! 

Die Differenz zwischen beiden Angeboten beträgt zehn Pfennige und es bleibt unklar, 

was die Tester jetzt hören wollen: Sind 10 Pfennig Unterschied noch „gleich gut“ oder nicht? 

Die Frage ist ja wegen der unbekannten sonstigen Bedingungen offensichtlich nicht beantwortbar: 

Selbst bei einer Anlage von 100 000 DM wäre der Unterschied ja nur 10 DM, also 

durch jede Kontoführungsgebühr bzw. andere Nebenkosten, Fahrtkosten zur Bank, selbst 

durch Mitnehmen von Werbegeschenken ausgeglichen. 

Das Problem für den Getesteten ist immer, erfolgreich zu erahnen, was die Tester hören 

wollen. Hier bleibt das unklar. Es könnte sogar sein, dass man auf eine sinnvolle Argumentation 

hin einen Punkt bekommt, egal wie man sich entscheidet. Ein Element von 

Testfähigkeit besteht hier darin, trotz der in ihrer Bedeutung für die Antworterwartung nicht 

einschätzbaren Zehn-Pfennig-Differenz irgendetwas hinzuschreiben. Da das Reale hier ohnehin 

nicht ernst genommen wird, könnte man den Punkt erhalten, wenn man irgendetwas 

über Zinseszins hinschreibt – womöglich sogar, ohne ihn berechnet zu haben. (näheres vgl. 

Meyerhöfer 2004, S. 199 f., Meyerhöfer 2005, S. 166 f.)


gewisser Weise ist es auch egal, ob man sorgfältig oder gültig bepunktet oder 

nicht: Der Kodierer spürt ja unmittelbar die Brüchigkeit, mit der im Kodierungsverfahren 

die Geltung der Testaussage erzeugt wird. Er spürt hautnah die 

Illusion des Punktwertes. 

Die Kodierungspraxis birgt also eine sowohl im Kategorisierungsprinzip 

liegende als auch eine menschliche Komponente von Willkür – und diese Willkür 

ist ein wesentlicher Unterschied zur Klassenarbeit, nach der der Lehrer 

immer unter Rechtfertigungs- und damit unter Fairnessdruck steht. Es wird 

deutlich, dass man wenig Einfluss auf die „Gnadenstimmung“ des Kodierers 

und auf den Kategorienkatalog hat, dass es aber die Chance auf eine positive 

Bewertung erhöht, wenn man irgendetwas hinschreibt. 

Diese Komponente von Testfähigkeit ist zwar testimmanent (unabhängig 

davon, ob eine Testaufgabe gelungen oder misslungen ist), bewegt sich nichtsdestotrotz 

nah an Fähigkeiten, die in Klassenarbeiten benötigt werden, denn 

auch dort geht es darum, durch das Hinschreiben von Fragmenten „Punkte zu 

schinden“, selbst wenn man wenig weiß. Auch dieser Komponente von Testfähigkeit 

mag man deshalb „Bildungswert“ zuschreiben. Es ist aber ein rein 

innerschulischer Wert: Das Versammeln von Halbwissen oder Fragmenten von 

Wissen dient hier keiner Annäherung an Bildungsgut durch Versammeln des 

bereits Gewussten, durch seine Reflexion, Aufarbeitung und Erweiterung. Es 

dient lediglich dem Bedienen fremdgesetzter Anforderungen in einer asymmetrischen 

Konstellation, deren inhaltliche Füllung zunächst keinem Bildungsprozess 

dient. 

5.3.6 Nichtrespektierung der Autonomie und Authentizität des 

Mathematischen wie des Realen; Schein des Realen; spezifische Realität 

der Tester 

Wenn Tester sich dem Realen außerhalb der Mathematik zuwenden, öffnet sich 

ihnen ein mannigfaltiges Potenzial einer Produktion von Verwerfungen, deren 

Bearbeitung Testfähigkeiten erfordert. Da gibt es in der Aufgabe „Bauernhöfe“ 

(vgl. 5.3.1) Dachböden, die Quadrate sind, da gibt es Mitten von Strecken, da 

werden Modellierungsanforderungen behauptet und zerstört. 

Ich habe in meiner Untersuchung zu PISA 19 diskutiert, wie sich diese Verwerfungen 

vermeiden lassen: Grundbedingung ist, das Reale und das Mathematische 

in ihrer Autonomie und Authentizität zu respektieren. Damit ist die 

19 Meyerhöfer (2004, Kapitel 5), Meyerhöfer (2005, Kapitel 5)


Grundrichtung der hier zu beschreibenden Testfähigkeit abgesteckt: Die Nichtrespektierung 

der Autonomie und Authentizität ist zu bearbeiten. 

In der TIMSS-Aufgabe A2 (vgl. 5.3.1) wird zwischen dem Realen, dem 

Mathematischen und dem Physikalischen „hin- und herverworfen“. Eine Möglichkeit 

des Scheiterns ergibt sich dort, wenn man das für einen Ziegelstein 

hält, was wie ein Ziegelstein aussieht, nämlich der „halbe“ Ziegelstein. Der 

Fehler liegt nahe, weil Ziegelsteine mit quadratischem Querschnitt uns seltener 

begegnen und weil man sie nie längs teilt, wie das hier geschehen ist. Wir 

lernen für das Problem der Testfähigkeit: Man soll nicht dem Schein des Realen 

glauben. Man muss sich also in die spezifische Realität der Tester begeben. 

In dieser Realität werden Ziegelsteine längs geteilt, Mutter ruft „Zinseszins!“ 

und Schülerinnen zeichnen angeblich Dächer von Bauernhöfen, die 

keine Bauernhöfe sind. Diese Welt ähnelt der Welt der Schulbücher und sicherlich 

auch der Welt von Mathematikunterricht. Insofern läuft die Fähigkeit, 

sich in die Realität der Tester zu begeben, wahrscheinlich mit der Fähigkeit 

zusammen, sich in die spezifische Realität von Mathematikunterricht 

zu begeben. Diese Komponente von Testfähigkeit hat also eine gewisse Ähnlichkeit 

mit einer Fähigkeitskomponente, die auch im Mathematikunterricht 

thematisch ist. Der Unterschied ist allerdings ein konstitutiver: Im Mathematikunterricht 

scheint mir Bestandteil des Bildungsgedankens die Forderung zu 

sein, die Spezifität des Realen und die Spezifität des Mathematischen in den 

Blick zu nehmen. – Es geht hier gerade nicht um die unreflektierte Übernahme 

von Aufgabenmustern. Diese unreflektierte Übernahme würde man einem Mathematikunterricht 

zuschreiben, der seinen Bildungsauftrag nicht erfüllt. Die 

Nichtrespektierung der Autonomie und Authentizität insbesondere des Mathematischen, 

aber auch des Realen, ist mathematikdidaktisch also nicht zu rechtfertigen. 

Deshalb ist es problematisch, wenn beides in Tests nicht respektiert 

wird. Und es ist im Sinne eines Bildungsauftrags problematisch, dass in keiner 

der veröffentlichten PISA-Aufgaben und in keiner jener TIMSS-Aufgaben, die 

allen Schülern vorlagen, die Spezifität des Realen und des Mathematischen 

thematisch ist. Am Bildungsauftrag von Mathematikunterricht arbeiten diese 

beiden Tests diesbezüglich vorbei. Die hier mitgemessenen Komponenten 

von Testfähigkeit bedienen lediglich die Anpassung an einen Mathematikunterricht, 

der seinen Bildungsauftrag nicht erfüllt. 

Es ist ebenso bedenklich, wie häufig Schulbuchaufgaben sich der Spezifität 

des Realen und des Mathematischen nicht stellen; insofern ist der Bildungsauftrag 

von Mathematikunterricht zum Teil gegen die Praxis von Schul-


buchaufgaben gerichtet. Das zerstört aber den Auftrag nicht, sondern behindert 

lediglich seine Umsetzung (und illustriert die mangelnde Verankerung des Bildungsauftrags 

im Feld). In Tests gibt es aber nichts als die Aufgaben selbst. Sie 

konstituieren das Ganze. 

5.3.7 Nichternsthaftigkeit des realen Problems; Dominanz des Schlichten; 

Nachteiligkeit von exakten Überlegungen, von kreativem oder 

intellektuell anspruchsvollem Arbeiten; Testerwünsche als Maß des Tuns 

Eng verbunden mit der Forderung, sich dem Schein des Realen nicht hinzugeben, 

ist die Forderung, das reale bzw. realitätsnahe Problem nicht ernst zu 

nehmen. Wenn man in der TIMSS-Aufgabe A2 (vgl. 5.3.3) das Problem ernst 

nimmt und unter Berücksichtigung des Abstandes der Körper vom Waagenmittelpunkt 

durchrechnet, bekommt man heraus, dass der Ziegelstein 2,62 kg 

wiegt. Dafür gibt es aber kein Multiple-Choice-Angebot. Man würde demzufolge 

auf 3 kg runden und damit ein „falsches“ Resultat erhalten, weil die 

Tester 2 kg angekreuzt sehen möchten. Hier liegt also ein Fall vor, in dem ein 

Schüler zum im Sinne der Tester falschen Resultat gelangen würde, obwohl er 

ein anspruchsvolles Problem löst und wahrscheinlich auch das kann, was die 

Tester zu messen glauben. Man könnte vereinfacht sagen: Ein Schüler, der „zu 

klug“ ist, gelangt zum falschen Resultat. Defizitär formuliert: Dieser Schüler 

erkennt nicht, auf welcher Ebene er hier argumentieren soll. Das Irritationsmoment 

liegt auch vor, wenn man lediglich über das Bild „stolpert“, weil man 

die Problemstellung ernst nimmt. Hier besteht die zusätzliche Aufgabe darin 

zu erkennen, dass man das Problem nicht ernst nehmen darf, sondern eine 

schlichtere Überlegung anstellen soll. Ein Schüler mit Testfähigkeit, der exakte 

Überlegungen von vornherein ausspart, erfährt damit bei dieser Aufgabe 

einen zeitlichen Vorteil. Man könnte diese Komponente von Testfähigkeit also 

so formulieren: Du sollst nicht das Problem ernst nehmen und lösen, welches 

gestellt ist, sondern du sollst herausfinden, was die Tester wollen, dass du es 

hinschreibst bzw. ankreuzt. Aus den Erkenntnissen über das Testen kann man 

hinzufügen: Es ist wahrscheinlicher, dass du schlicht arbeiten sollst, als dass 

du kreativ oder intellektuell anspruchsvoll arbeiten sollst. 

Wer in der Aufgabe „Bauernhöfe“ das Problem ernst nimmt, der erfährt 

nicht einmal, welche Länge er bestimmen soll, weil der zu berechnende Balken 

einen trapezförmigen Querschnitt haben muss. Glücklicherweise zerstört der 

Text die Modellierungsanforderung sehr gründlich, so dass man das Problem 

nicht ernst nehmen wird.


6 Zusammenführung 

„Testfähigkeit“ beschreibt jene Kenntnisse, Fähigkeiten und Fertigkeiten, die 

in einem Test miterfasst bzw. mitgemessen werden, die man aber nicht unter 

den Begriff „mathematische Leistungsfähigkeit“ fassen würde. 

Wenn in einem Test Testfähigkeiten mitgemessen werden, entstehen folgende 

Probleme: 

– Es wird nicht nur das gemessen, was gemessen werden soll: mathematische 

Leistungsfähigkeit. Das Messresultat wäre also verfälscht. 

– Tests setzen Standards. Wenn Tests Testfähigkeiten mitmessen, werden diese 

Fähigkeiten Bestandteil des Standards. Die empirische Analyse hat gezeigt, 

dass dies kein nebensächliches Phänomen ist, sondern den Kern mathematischer 

Bildung betrifft. Sie hat gleichzeitig gezeigt, dass Testfähigkeiten 

nicht in Bildungsgut umzudeuten sind. Entsprechende Versuche erweisen 

sich in der empirischen Analyse als oberflächlich, oftmals falsch und 

zumeist zynisch. 

– Für die aufgezeigten Probleme ist denkbar, sie mittels Testtraining zu bearbeiten. 

Das Hauptproblem tritt aber auf, wenn latentes Testtraining durch 

gehäuftes Bearbeiten von Testaufgaben stattfindet. (Es gibt hierfür den im 

Kern zynischen Euphemismus „Testkultur“.) Dann entfalten die den Bildungsgedanken 

beschädigenden Phänomene schleichend ihre Wirkung. Ich 

kann den Gedanken hier nicht vertiefen, möchte aber darauf verweisen, dass 

Adornos „Theorie der Halbbildung“ (Adorno 1972) das Problem weiter erschließt. 

Vor diesem Hintergrund ist das Konzept der deutschen Bildungsstandards zu 

überdenken. Sie fokussieren auf das Testen von „Kompetenzen“. Derzeit wird 

ein – in seinen Ausmaßen den PISA-Test weit übersteigender – Test entwickelt, 

der die Bildungsstandards für das Fach Mathematik in eine Testform 

gerinnen lassen soll. Der Gedanke, dass Tests Standards setzen, erreicht hier 

eine radikalisierte Praktizierung. Dieser Standards-Test wird in einer Weise erstellt, 

bei welcher das Problem der Testfähigkeit nicht bearbeitet werden kann. 

Wegen der hohen Durchschlagkraft der Bildungsstandards-Tests auf Mathematikunterricht 

wird Testfähigkeit somit zum Standard(s)phänomen in deutschen 

Schulen. Ich schlage deshalb vor, diese Testerstellung zu stoppen. 

Ich möchte nun die empirisch herausgearbeiteten Komponenten von Testfähigkeit 

zusammenfassen. 

Oberster Grundsatz für den Getesteten ist: Es geht in der Testsituation nicht 

darum, dass ein mathematisches Problem erschlossen wird, dass ein Gedanke


entfaltet wird oder eine Argumentation brilliant entwickelt wird. Es geht darum, 

das Kreuz an der von den Testern gewünschten (an der „richtigen“) Stelle 

zu setzen, die von den Testern gewünschte Zahl hinzuschreiben oder einen Gedanken 

so weit zu entfalten, dass der Kodierer einen Punkt dafür vergibt. Gelegentlich 

laufen Erschließung und Gewünschtes zusammen, d.h. gelegentlich 

testet eine Aufgabe mathematische Bildung bzw. Leistung. 

Es geht darum, sich der Tendenz zum Mittelmaß, die Tests innewohnt, 

anzupassen. Wenn man „von unten her“ auf dieses Mittelmaß blickt, dann 

erscheint dieses Phänomen auf den ersten Blick unproblematisch, weil man 

dann nach bestem Wissen die Aufgabe bearbeiten kann. Etwas schlicht gesagt: 

Jemand, der bildungsfern ist, wird vielleicht durch das Erlangen von Testfähigkeit 

nicht in seiner Intellektualität beschädigt, er wird lediglich in ihrer 

Entwicklung behindert. Er hat zusätzlichen, unausgesprochenen und überflüssigen 

Lernstoff zu bewältigen. Das raubt Kapazität für relevante Inhalte. Es 

verschärft zusätzlich für die bildungsferne Klientel das Problem der Benachteiligung 

durch Unausgesprochenes (vgl. Bourdieu/Passeron 1971). Wenn man 

umgekehrt dazu neigt, sich der Welt intellektuell zu nähern, sie ernsthaft zu befragen, 

Gedanken vielschichtig zu entfalten und mathematische Probleme bis 

hin zu einem eigenen Verständnis zu bearbeiten, dann verliert man in Tests 

Zeit, gelegentlich landet man auch bei einem richtigen oder „auch richtigen“ 

oder „unter diesem Blickwinkel auch richtigen“ Resultat, welches aber nicht 

das erwünschte und somit prämierte Resultat ist. Der Grundsatz lautet: Tiefgründigkeit 

und Vielschichtigkeit vermeiden! 

Dies gilt in einer besonderen Färbung, wenn Tester sich in das Reale begeben. 

Hier ist wichtig, die realen „Einkleidungen“ nicht ernst zu nehmen. Man 

muss herausfinden, in welcher spezifischen Realität die Tester sich bewegen. 

Man fährt am besten, wenn man sich einfach fragt: Was wollen sie, dass ich 

rechne? Wenn die Realität und das mathematisch (meist: rechnerisch) Gewollte 

nicht recht zusammenlaufen, dann ist immer das mathematisch Gewollte 

das Primäre. Man muss wiederum besonders aufpassen, wenn das Reale uns 

zu differenzierterem Denken auffordert: An solchen Stellen gilt es herauszufinden, 

was die Tester hören wollen und sich nicht mit dem zu beschäftigen, 

was das Reale erfordert – das kostet Zeit bzw. führt zu einem nicht prämierten 

Resultat. 

Umgekehrt muss man immer etwas hinschreiben, egal wie wenig man 

weiß. Es bietet sich an, schwierige Aufgaben während des Durcharbeitens zu 

kennzeichnen. Je nachdem, wie viel Zeit man am Ende noch hat, muss man


mit Plausibilitätsbetrachtungen inhaltliches Raten betreiben oder notfalls Lotterieraten 

durchführen (vgl. Meyerhöfer 2004 c), und man muss bei Texten 

irgendetwas hinschreiben, bei einzusetzenden Zahlen jene Zahl, die einem am 

plausibelsten erscheint. 

Standardisierte Tests sind fremdartige und hölzerne Instrumente. Sie halten 

vielfältige Irritationen parat, die im Laufe des Erstellungs- bzw. Operationalisierungsprozesses 

eingewaschen werden. Man sollte sich vor Augen halten, 

dass (Zehn)Tausende den Test bearbeiten sollen, dass also viele verschiedene 

Fachbegriffe und Umgehensweisen abgedeckt werden müssen und dass 

Übersetzungsprobleme hinzukommen. Da „geht schon mal ein Wort daneben“, 

manchmal auch mehr. Es kommt auch vor, dass die Aufgabe verständlicher 

oder schneller erfassbar gemacht werden soll und dass dadurch irritierende 

Formulierungen entstehen. Irritationen vermeiden heißt hier: Darüber hinweg 

lesen können. Auch hier hilft der Hinweis auf das Mittelmaß: Es ist meist 

das weniger Komplizierte gemeint, und meist kommt es auf das einzelne Wort 

nicht an, man kann es getrost überlesen. Wenn man sich darauf konzentriert, 

was die Tester hören wollen, dann merkt man auch, dass das Irritierende oftmals 

nebensächlich ist. Übrigens kann man auch immer den Testleiter fragen. 

Er darf zwar prinzipiell nichts sagen. Aber Verständnisprobleme bei Wörtern 

darf er manchmal klären, und vielleicht erzählt er ja noch mehr. 

Empirisch zeigt sich, dass Testfähigkeit unvermeidbar dort auftritt, 

– wo Multiple-Choice-Angebote das Raten ermöglichen, 

– wo offene Antworten kategorial in Null-Eins-Entscheidungen kodiert werden, 

– wo ein verschiedener Umgang mit Fachbegriffen in verschiedenen Teilen 

der zu vermessenden Population bearbeitet werden muss. 

Testfähigkeit tritt vermeidbar dort auf, 

– wo ein Gegeneinanderlaufen von latenter und manifester Textebene zu Irritationspotential 

führt, 

– wo der Inhalt nicht ernstgenommen wird, um den es zu gehen scheint. Dabei 

ist es zunächst nicht so, dass der Schüler den Inhalt nicht ernst nimmt, 

sondern der Aufgabenersteller, der den Text erstellt, nimmt den Inhalt nicht 

ernst. (Das ist natürlich didaktisch verschleiert.) 

– wo Mehrdeutigkeiten bzw. Unschärfen auftreten bezüglich dessen, was gemessen 

wird und was gemessen werden soll. 

Dieses empirische Ergebnis verwundert nicht: Testfähigkeit spielt eben immer 

dort eine Rolle, wo die Aufgabe schlecht konstruiert ist, wo also latente und


manifeste Textebene auseinanderlaufen, wo der mathematische Inhalt didaktisch 

verworfen ist und wo der Operationalisierungsprozess nicht sorgfältig 

verlaufen ist, wo der eigene mathematikdidaktische Habitus nicht reflektiert 

wird. Diese Art von Testfähigkeiten kann also im Sinne von Vermeidung bearbeitet 

werden, wenn die Tester das Zusammenlaufen von latenter und manifester 

Textebene bearbeiten, wenn sie didaktischen Illusionen und Verschleierungen 

selbst nicht aufsitzen und wenn sie sorgfältige Operationalisierungen 

ihrer Messkonstrukte vornehmen. 

Literatur 

Adorno, Theodor W. (1972): Theorie der Halbbildung. In: Soziologische 

Schriften I (Gesammelte Schriften Band 8), Frankfurt: Suhrkamp 

Baumert, Jürgen, Eckhard Klieme, Manfred Lehre und Elwin Savelsbergh 

(2000): Konzeption und Aussagekraft der TIMSS-Leistungstests. Zur 

Diskussion um TIMSS-Aufgaben aus der Mittelstufenphysik. In: Die 

Deutsche Schule, 92. Jahrgg. 2000, Heft 1 (S. 102-115), Heft 2 (S. 196- 

217) 

Bourdieu, Pierre; Passeron, Jean-Claude (1971): Die Illusion der Chancengleichheit. 

Stuttgart: Klett 

Deutsches PISA-Konsortium (Hrsg.) (2001): PISA 2000. Basiskompetenzen 

von Schülerinnen und Schülern im internationalen Vergleich. Opladen: 

Leske + Budrich 

Hagemeister, Volker (1999): Was wurde bei TIMSS erhoben? Eine Analyse 

der empirischen Basis von TIMSS. In: Die Deutsche Schule, 91.Jahrgg. 

1999, Heft 2, S. 160-177 

Hembree, Ray (1987): Effects Of Noncontent Variables On Mathematics Test 

Performance. In: Journal for Research in Mathematics Education. Vol. 18, 

No. 3, S. 197-214 

Klieme, Eckhard; Maichle, Ulla (1989): Zum Training von Techniken des 

Textverstehens und des Problemlösens in Naturwissenschaften und Medizin. 

In: Günter Trost (Hrg.): Test für medizinischen Studiengänge (TMS): 

Studien zur Evaluation (13.Arbeitsbericht), Bonn: Institut für Test- und 

Begabungsforschung. S. 188-247 

Klieme, Eckhard; Maichle, Ulla (1990): Ergebnisse eines Trainings zum Textverstehen 

und zum Problemlösen in Naturwissenschaften und Medizin. 

In: Günter Trost (Hrg.): Test für medizinischen Studiengänge (TMS). 14.


Arbeitsbericht. Bonn: Institut für Test- und Begabungsforschung, S. 258- 

307 

Lind, Detlef: Welches Raten ist unerwünscht? Eine Erwiderung. („Erwiderung“ 

auf Meyerhöfer 2004 c) In: JMD 1/2004, S. 70-74 

Meyerhöfer, Wolfram (2001): Was misst TIMSS? Einige Überlegungen zum 

Problem der Interpretierbarkeit der erhobenen Daten. In: http://pub.ub. 

uni-potsdam.de/2001meta/0012/door.htm 

Meyerhöfer, Wolfram (2004 a): Was testen Tests? Objektiv-hermeneutische 

Analysen am Beispiel von TIMSS und PISA. Dissertation an der 

Mathematisch-Naturwissenschaftlichen Fakultät der Universität Potsdam 

Meyerhöfer, Wolfram (2004 b): Zum Kompetenzstufenmodell von PISA. 

In: JMD 1/2004, S. 294-305. Längere Version unter: http://www.math. 

uni-potsdam.de/prof/o_didaktik/mita/me/Veroe 

Meyerhöfer, Wolfram (2004 c): Zum Problem des Ratens bei PISA. JMD 

1/2004, S. 62-69 

Meyerhöfer, Wolfram (2005): Tests im Test. Das Beispiel PISA. Verlag Barbara 

Budrich. Opladen 

Meyerhöfer, Wolfram (2006): PISA & Co als kulturindustrielle Phänomene. 

In: Thomas Jahnke, Wolfram Meyerhöfer (Hrsg.): PISA & Co – Kritik 

eines Programms. Franzbecker, Hildesheim, S. 63-100 

Millman, J.; Bishop, C. & Ebel, R. (1965): An Analysis Of Test-Wiseness. In: 

Educational and Psychological Measurement, 25, S. 707-726 (zitiert nach 

Hembree 1987) 

Winter, Heinrich (2005): Apfelbäume und Fichten – und Isoperimetrie. In: mathematik 

lehren, Heft 128, S. 58-62 

Woschek, Reinhard (2005): TIMSS 2 elaboriert: Eine didaktische Analyse von 

Schülerarbeiten im Ländervergleich Schweiz/Deutschland. Dissertation 

beim Fachbereich Mathematik der Universität Duisburg-Essen. 

Wuttke, Joachim (2007): Die Insignifikanz signifikanter Unterschiede: Der Genauigkeitsanspruch 

von PISA ist illusorisch. In: Thomas Jahnke, Wolfram 

Meyerhöfer (Hrsg.): PISA & Co – Kritik eines Programms. 2., überarbeitete 

Auflage, Hildesheim: Franzbecker

PISA – An Example of the Use and Misuse of 

Large-Scale Comparative Tests 1 

Jens Dolin 

Denmark: University of Copenhagen 

To an ever increasing extent, international evaluations such as PISA are both 

setting the agenda in the educational policy debate in the participating countries 

and exerting a considerable influence on their educational policy decisions. 

But do such surveys justify the fuss they often cause? 

In Denmark, the headlines which followed the publication of the PISA 

2003 survey included: 

– More discipline in the schools. Discipline will help to improve Danish results 

in international surveys (Jyllandsposten, 7 Dec. 2004) 

– Time for physics classes in country no. 31 (Jyllandsposten, 7 Dec. 2004) 

– The government to introduce more tests for Danish schoolchildren (Politiken, 

7 Dec. 2004). 

The government used the PISA results as a lever to tighten up educational policy, 

while a number of leading education researchers warned against introducing 

drastic alterations on the basis of an international test of a character which 

was described as being to some extent foreign to the Danish educational culture. 

The tone of the debate was sharp, as illustrated by the following extracts 

from an interview appearing in a Danish newspaper: 

You have been fooled by the PISA report. The PISA report on the elementary schools 

is nonsense and a perverse provocation. It is based on neither knowledge nor insight. 

(Prof. Staf Callawaert, in the newspaper Information, 10 December 2004). 

1 This paper is an updating of a key-note held at a Nordic Conference for Science Education 

2005.

94 JENS DOLIN 

A rather barren chasm was rapidly dug which prevented large parts of the educational 

system from utilising the PISA results productively and large parts 

of the political system from placing PISA in the necessary context. Hopefully, 

this article may contribute a little to both. 

The article will analyse PISA – particularly the part dealing with science – 

as an example of a major comparative evaluation. 

PISA will first be described and then analysed on the basis of test theory, 

which will address some detailed technical aspects of the test as well as the 

broader issue of validation. The purpose of this is to illustrate how the technical 

aspects of evaluations are not neutral practices, but rather a part of the 

fundamental value system on which the evaluation is based. Some apparently 

objective choices must necessarily be made which have consequences for the 

theoretical basis of the evaluation, and the technique thereby becomes part of 

the fundamental value system. These considerations form the basis for an evaluation 

of PISA’s predicative power in a national context – in this case, that of 

Denmark. On this basis, the analysis will focus on the relationship between 

PISA’s fundamental assumptions and the national consequences of participation. 

Finally, I will conclude with some reflections on how PISA may be utilised 

and developed. 

Comparative evaluation – between politics and science 

Whether or not evaluations in the form of politically-initiated surveys can be 

considered research as such, the designation “comparative evaluation” forms 

part of the lexicon of comparative educational research. Internationally, this 

is a major research field, organised in the World Council of Comparative Education 

Societies, which was founded in 1970 and now has 35 national and 

regional member organisations. All major international education conferences 

have sessions for comparative evaluation, and the field is covered by several international 

periodicals, of which the two largest are the British periodical Comparative 

Education and the American Comparative Education Review. Finally, 

large-scale international comparative tests such as PISA present an opportunity 

to conduct a growing amount of related research. This secondary research 

may focus on PISA itself, or it may utilise the PISA data in analyses which 

expand its perspectives, such as in comparisons between countries, surveys of 

sub-populations, correlations between different variables, etc.

PISA – USE AND MISUSE OF LARGE-SCALE COMPARATIVE TESTS 95 

However, the research field also has a longer tradition which, under the designation 

comparative educational theory, refers to comparative studies of educational 

matters in different countries and cultures. One of the earliest Danish 

comparative surveys was conducted in 1841 and compared the Danish school 

system with the German and French systems; an analysis which was included 

in the formation of the Danish upper secondary school education. During the 

same period, the Danish educationalist Grundtvig visited Britain and was inspired 

by the college system to develop the Danish folk high schools. Comparative 

educational theory has thus contributed to building up the educational 

systems of national states via inspiration and exchanges of experience. This 

tradition, which Winther-Jensen (2004) terms comparative educational theory 

in its horizontal significance, was dominant until the nineteen-sixties. 

International comparative studies have grown considerably in their extent 

and level of interest over the past decade, but the most important point is 

that their actual aim has altered. A change occurred during the course of the 

nineteen-seventies and nineteen-eighties in the conditions and significance of 

the educational systems, which also altered the focus of comparative educational 

theory. The key words here are globalisation and marketisation; education 

comprises a key sector of the global knowledge society, and it therefore 

becomes important for politicians to know how their country is doing in the 

international competition to become the best knowledge society. At the same 

time, a marketisation of the educational system is taking place, one which 

causes politicians to ask: Are we getting value for money? There is a need 

for data to determine whether a Danish school student is more costly than a 

foreign one, and if so, whether she is at least more skilled. The marketisation 

of the educational system and of the public sector in general is being implemented 

via New Public Management: a system of control which is based on 

goals and result targets on the output side, and the implementation of which 

requires knowledge and data obtained by means of national and international 

evaluations and the standards these impose. This is given precise expression in 

PISA: 

Across the world, policymakers use PISA findings to: 

– gauge the literary skills of students in their own country in comparison with 

those of the other participating countries 

– establish benchmarks for educational improvement . . . 

– understand the relative strengths and weaknesses of their educational system 

(OECD 2004)

96 JENS DOLIN 

There is still an interest in comparing oneself with other countries – the horizontal 

dimension – but now international concepts and standards have been established 

which provide a basis on which national states can assess themselves. 

These supranational structures make it possible to speak of comparative educational 

theory in the vertical sense. The EU is developing a concept of lifelong 

learning, UNESCO defines Education for All, and the OECD is testing a literacy 

concept through PISA. These international concepts become a determining 

factor in national policies, and the international evaluations set up a standard 

which is independent of the differences between the individual countries, both 

for these key concepts and for the actual educational systems. The goals of the 

educational systems thereby become harmonised, and increasing emphasis is 

placed on standardisation and on comparison of student performance in order 

to measure the extent to which a country is meeting the international requirements. 

Under such conditions, the horizontal dimension becomes reduced to a 

comparison with those countries that best fulfil the international standards. 

We may, for example, ask ourselves in desperation, “What is it that Finland 

does that causes it to do so well in PISA?” But we ask less about what school 

students can do in Denmark. How is it that Denmark is doing so well in international 

competition, when Danish young people achieve such a mediocre 

score in international comparative tests? It may be that these comparative evaluations 

fail to capture the essence of the students’ skills – or at any rate, only 

an inessential subset. It is therefore important to analyse what such evaluations 

can really tell us, and what they cannot. What are the limitations, for example, 

in comparing complex matters between many countries – both from the 

perspective of test theory and educational theory? 

The aim of this article is to evaluate the predictive power of the PISA 

results, and thereby provide a perspective on international comparative evaluations 

in general. The criticism examined here should then be compared with 

the advantages that PISA bestows. One problem in this context is that surveys 

like PISA are initiated and planned in one part of the educational system (typically 

at policy level) but implemented by another part of the system (typically 

the directly practising level), after which the results are used by the policy level 

to characterise and change the practising level. The situation is thereby one of 

attack and defence from the beginning, which makes it difficult to find a neutral 

standpoint from which to assess PISA. 

It is, however, important to understand that PISA was designed by the


OECD with the official aim of acquiring a data foundation for the use of (educational) 

decision-makers. As PISA’s own introduction makes clear (OECD 

1999): 

The results of the OECD assessments, to be published every three years along 

with other indicators of education systems, will allow national policymakers to compare 

the performance of their education systems with those of other countries. They 

will also help to focus and motivate educational reform and school improvement, especially 

where schools or education systems with similar inputs achieve markedly different 

results. Further, they will provide a basis for better assessment and monitoring 

of the effectiveness of education systems at the national level (p. 7). 

PISA is administered by a PISA Governing Board, which includes representatives 

from the governments of the participant countries, and which takes the 

decisive decisions concerning PISA’s goals, content, procedures, etc. 

PISA thereby comprises a different type of survey and research from that 

with which we are traditionally familiar from the universities. It is a commissioned, 

research-based survey containing questions formulated by the commissioners, 

and with some set frameworks, but with rather extensive freedom with 

regard to how these frameworks are filled (e.g. the formulation and choice of 

test items). Such surveys have become quite common in the research world, 

such as in the form of evaluations and memoranda, but they differ in crucial 

areas from the free research of the universities. Many of the associated debates 

and decisions, for example, take place in relatively closed groups, with 

strong influence from the administrative layer of the ministries, and thereby 

with the fingerprint of the present government. It is thus a blend of research, 

investigation, evaluation and educational policy. 

The results which PISA has published, first and foremost in the form of the 

so-called league tables which rank countries according to the performance of 

their young people, have also been used in many other countries as arguments 

for fundamental alterations in their educational systems. In Denmark, with direct 

reference to the poor results in PISA 2003, the government introduced 

a wide range of school tests, albeit under strong protest from the teachers’ 

organisations and education researchers (Dolin 2007). Teaching methods and 

so-called progressive education were identified by leading politicians as the 

cause of the disappointing results, which gave rise to a back-to-basics wave 

and greater emphasis on strong school leadership.

98 JENS DOLIN 

Critical choices, reliability and validity 

Any evaluation requires a number of theoretical, practical and methodological 

choices in order to ensure the production of the results necessary to fulfil 

its goals. These choices are taken at various points in the PISA system on the 

basis of compiled foundation documents (often of a political or scientific nature). 

The choices relate to questions of framework and content, such as the 

relationship with other surveys and test item design, and are of significance for 

the validity and reliability of the evaluation. Such fundamental choices set the 

limits for the survey’s usefulness and predictive power, and define its methodological 

standard. 

In a comparative test, reliability is crucial. Irrespective of what you measure, 

it must be done correctly. You must be certain that the various countries 

are appraised in the same way, so that their ranking in the final evaluation 

will not be open to question. Reliability-related problems include, for example, 

sampling procedures and the scoring of responses. The most fundamental 

questions, however, relate to the survey’s validity – the extent to which the 

chosen design can measure what you are interested in. There is a gradual transition 

between problems of reliability and problems of validity, so the divisions 

between them are as much questions of organisation as of content. 

We will begin with some of the critical choices, then review a number of 

apparently technical and reliability-related issues, and finally utilise the more 

fundamental validity problems to form the transition to a discussion which will 

place the issue in perspective. 

Critical choices 

An international survey must position itself within the range of comparative 

tests with regard to its aim, content, target group, etc., and it must possess 

a design in accordance with this positioning. Some of the choices taken will 

have consequences for the survey’s reliability and validity; for example, certain 

aims give rise to certain test forms to ensure that the test is in accord with its 

aims and thereby valid. But the testing must also be of a type that enables a 

high degree of reliability. These two considerations can be difficult to unite, 

and often the reliability consideration will be given highest priority.


Lack of comparability with earlier surveys 

In the case of PISA, no links have been established with earlier international 

surveys (particularly those undertaken by the IEA), which makes comparisons 

very difficult. 

It is regrettable that PISA did not link the survey to prior surveys, for 

example by including some test items of the same type as were included in 

TIMSS (which was curriculum-based). This would have enabled comparisons 

over time, and comparisons of tests with different testing purposes. Is there, 

for example, agreement between the results of a curriculum-based test and a 

more general “fit-for-life” test? 

Secondary surveys, however, have revealed quite significant correlations 

between the two surveys. Lie & Olsen 2007 compared science results for 22 

countries that had participated in both TIMSS and PISA, and found that the 

correlation between the scores in the two studies was as high as 0.95 at the 

country level. 

Whether this high level of consistency between different forms of measurement 

is good or bad is quite a delicate question, to which we will return. 

Year sample instead of class sample 

By selecting a representative sample of a given year’s school students, we 

can illuminate whether society receives “value for money” in the educational 

system as such: Does our educational system adequately equip young people 

for the future? (Assuming that you are actually able to measure such “futurepreparedness”, 

but that is a question to which we will return later.) How many 

of our schools’ students have which particular skills? And so on. This takes 

place at a highly aggregated level, where, for example, something can be said 

about sociocultural differences, the distribution of the results across a year, 

etc., and some general issues can be identified which the educational system is 

failing to satisfactorily address. In addition, a yearly-based sample provides a 

good overview of a given school year, and the size of the sample provides an 

opportunity to compare different parts of the educational system. 

However, if we wish to know something about the educational system 

which can be used to change it, we must examine the places where the education 

actually takes place – which is to say the classroom and the school. 

The problem with PISA here is that the test does not illuminate the teaching 

conditions which, in the final analysis, are responsible for the measured results. 

Data collection at this level, with, for example, entire classes represent-

100 JENS DOLIN 

ing a school, would provide an opportunity for teaching-related comparisons. 

The students tested would certainly have been exposed to different forms of 

teaching, but the Danish model, under which teachers often are permanently 

assigned to particular groups of students throughout the years, would enable 

meaningful correlations to be made between teaching variables and output. 

Problems with the selected statistical model 

The fundamental problem for all comparative evaluations is how to safeguard 

comparability between different cultures and educational systems. The statistical 

side of this process is addressed in PISA by choosing a psychometric 

model which assumes that differences between systems may be ascribed to 

variation along a scale. PISA has chosen to rely on a technique known as “Item 

Response Modelling”, despite the absence of (published) theoretical considerations 

concerning what the choice of this model might mean. The problem 

with this model is that it permits only a one-dimensional variation along the 

chosen scales, and thereby risks overlooking differences between countries lying 

outside the scale in question. As the technical report says: “An item may 

be deleted from PISA altogether if it has poor psychometric characteristics in 

more than eight countries (a dodgy item)” (Adams and WU 2001, p. 101). If a 

particular test item does not fit the one-dimensional model – i.e. it gives very 

different results in several countries – it is omitted, even though the reasons 

why it gives different results might be an expression of a variation in another 

dimension than the relevant scale is designed to measure. Potential information 

can thereby be suppressed, or to put it another way: In its efforts to avoid 

cultural bias, PISA neglects cultural differences – the very differences that it 

would have been interesting to identify as explanations for the observed variations 

in performance between the various countries. 

As Harvey Goldstein puts it: 

“Perhaps the major (concern) centres around the narrowness of its focus, which remains 

concerned, even fixated, with the psychometric properties of a restricted class 

of conceptually simplistic models. . . . It needs to be recognized that the reality of 

comparing countries is a complex multidimensional issue, well beyond the somewhat 

ineffectual attempt by PISA to produce subscales. With such recognition, however, it 

becomes difficult to promote the simple country rankings which appear to be what are 

demanded by policymakers.” (Goldstein 2004, p. 328)


The price of such item-homogeneity is that cultural differences are erased on 

the “profile level”. It would in general have been useful to have some clearer 

considerations on the appropriateness of the chosen model. 

Is PISA authentic? 

This is one of the crucial elements in PISA. 

Figure 1: The pizza item 

Let us for example examine the ‘pizza’ test item from a set of published pilot 

items that were stated to be representative of PISA’s mathematics questions 

(figure 1). On the surface it appears to be an everyday situation (at any rate for 

city-dwellers in Western Europe), but it has been made abstract by the use of 

an unknown currency and “nice” numbers. It is in fact a disguised mathematics 

problem. 

I wonder how students who are used to ordering from their local pizzeria 

would reply to such a question using realistic (and known) numbers. But 

this touches on the fundamental – and conflict-ridden – academic debate on 

whether mathematics should be taught as a closed, deductive system or as ‘realistic’ 

mathematics. Those who belong to the first school tend to formulate 

questions which test the ability to perceive mathematical structures in everyday 

examples, while the other school prefers to focus on the skill to be able to 

“manage” everyday situations – irrespective of whether approved methods are 

used. If you aim for the latter, it would be correct to say that the more realistic 

a test is – the more it is designed to reflect actual everyday situations – the 

less it makes sense to compile a globally comparable test! It is quite simply a 

fundamental conflict of principle in which the choice of test questions reflects 

a particular academic and pedagogical attitude. 

In PISA, it is as though backward reasoning has been used in the formulation 

of many of the test questions: Here we have a set of school subjects –


biology, physics, geography, chemistry, etc. Where can the students apply this 

knowledge? Where in the real world are there situations that involve the use of 

this knowledge? Instead of starting (authentically!) with some realistic everyday 

situations – the consumer, the manufacturer, the citizen, leisure activities, 

etc. – and then choosing some in which scientific insight might play a role. But 

it is fair to say that it would be a very difficult agenda to set up – due to the 

special character of science. In everyday life we use the known and the experienced 

to explain the unknown. In science it is reversed. Here you explain the 

well-known with abstract, invisible, and non experienced concepts. And it is a 

huge pedagogical and didactical challenge to make the two ways of knowing 

meet. 

In this connection it is also characteristic that the answers must be based on 

the information supplied in the test question, which must not be combined with 

the students’ own knowledge of the subject (see, for example, Svendsen 2005). 

In order to do well on the test item, it is as least as important to understand test 

logic as to know the subject. You have to know how tests are scored, how 

to optimise your answer strategy, etc. Greater familiarity with tests probably 

gives a higher score. 

PISA’s results, like all those of all evaluations, are dependent on the evaluation 

context both with regard to the formulation of the specific questions and 

with regard to the context in which the test items are solved. As an example, 

Kjeld Kjertmann (2000) shows how readers who have done well in a standard 

word reading test (US64) achieve very different results in reading tests which 

involve meaningful texts. 

The question of reliability 

The main question here is whether PISA lives up to its own premises from a 

test’s technical point of view. Is the test performed “properly”, i.e. in conformity 

with recognised test standards? 

The reliability of PISA is probably as high as is practically possible in such 

an extensive survey. For each round, a ‘Technical Report’ is published containing 

thorough documentation of the procedures used in all phases of the survey, 

which gives the impression that the test has been undertaken competently in 

every respect. In the case of PISA 2000, this is Technical Report 2000 (Adams 

and Wu 2001), and outlines how the test was compiled and pilot tested, how 

the respondents were selected and the data collected and processed, etc. The 

reliability of the data and processes was evaluated in all respects, and special


reliability studies were also undertaken. In the case of one study, in which the 

scoring undertaken by the national test scorers in reading questions was compared 

with that of a PISA consortium official (a so-called ‘verifier’), there was 

agreement between the OECD’s ‘verifiers’ and all four national scorers in 78 % 

of instances (p. 174). There was agreement with a majority of the national test 

scorers in 91.5 % of instances. However, the results revealed a large degree of 

scoring variation between both the questions and the countries. Some marking 

of questions showed an inter-country agreement rate of less than 0.80 (Technical 

Report, p. 175), and some countries showed an inconsistency rate of more 

than 50 % in the marking of certain questions (Technical Report p. 177). The 

overall consistency rate of the individual countries varied from 80.2 % (France) 

to 96.5 % (New Zealand) (Technical Report, p. 178). 

There were major variations in reliability in the various areas. In the ‘soft’ 

data (background variables), reliability was significantly lower than in the test 

items. The reliability of measures of the quality of school resources (a subset 

of “physical infrastructure”) was for example 0.70 for Denmark (Technical 

Report, p. 250). It is difficult to see how the figure of 0.7 was arrived at, but it 

is probably based on measurements of employees’ classifications of the same 

answer. Account has not been taken here of the validity problems involved 

in questions such as: “What is your father’s occupation?”; an answer such as 

painter, teacher, or office worker can mean quite different things, despite the 

fact that the PISA scorers classified them as identical. Here in Denmark, however, 

we have the opportunity to check the answers via data pooling. 

One can always discuss whether an overall reliability rate of 92 % is good 

or bad, but the survey gives the appearance of being scientifically correct. As 

the Danish Minister of Education, Bertel Haarder put it: When so many international 

experts have participated, it must be satisfactory. But as in the case 

of all statistics, they have been collected in a particular way for a particular 

purpose, and in any survey, statistics can only describe a (limited) part of the 

issues and phenomena dealt with by the survey. 

A number of education researchers and statisticians have also criticised 

both the theoretical background and the technical implementation of PISA. 

Noteworthy in this context has been the debate between Professor Prais of the 

National Department of Economic and Social Research in London and Raymond 

Adams of the International PISA Consortium (Adams 2003; Prais 2003; 

Prais 2004), and the critique by Professor Goldstein, Professor in Statistical 

Methods at the Institute of Education, University of London (Goldstein 2004).


It would be going too far here to undertake an in-depth analysis of these criticisms, 

which would require a rather advanced level of familiarity with statistical 

theory; the following should therefore be mainly seen as a summary of 

the problems identified by various persons in the technical and design-related 

aspects of PISA. 

Translation problems 

Once the test items have been selected, they must be translated into the various 

national languages. As the questions have often originally been formulated in 

English, the translation must often be worded in a more complex manner in 

order to convey the precise meaning. It is generally recognized that in order 

to represent the full meaning in a text originally produced in a foreign language, 

you often have to reframe and paraphrase – causing several awkward 

and clumsy sentences. The Danish version of the text suffers to some degree 

of this inappropriateness. 

The translation also results in a number of inevitable inaccuracies, the effects 

of which are impossible to assess. In a questionnaire directed at school 

principals, for example, the English term “assessment” was translated into 

Danish as “standpunktsprøver” (“proficiency tests”), which has a different 

meaning. 

The occurrence of translation problems, inelegant style and imprecise 

meaning causes a drop in reliability. 

Measuring scale errors (lack of chronological comparability) 

The Danish statistician Peter Allerup at the Danish School of Education has 

demonstrated that the comparability between the individual cycles which 

forms an important part of PISA is not valid, because different measuring 

scales are used in the two surveys (Allerup 2005). 

In the scaling technique utilised by PISA, the average score of each student 

in all questions is not first calculated in order to assess the average score of all 

the students; instead, the latent item difficulty is calculated by examining the 

students’ simultaneous item responses, i.e. the same student’s answers to all the 

questions. In PISA, these are termed the “item parameters”. By undertaking a 

so-called Rasch statistical analysis of all the students who answered the same 

question, it is then possible to see how the latent item difficulty is distributed 

in different surveys. It is a prerequisite for comparability that the relative level 

of difficulty is fixed.


Figure 2: measuring scale differences 

Figure 2 shows the relative difficulty of 22 common test items in reading 

in the years 2000 and 2003. As can be seen, the relative level of difficulty is 

not fixed, i.e. the same measuring scale has not been used in both cases (if this 

had been the case, the lines would have been vertical). 

A student with a particular level of skill is awarded points as he moves 

to the right on the scale; i.e. as he solves test items with a greater level of 

difficulty. It can be seen that changes in latent item difficulty between the two 

test cycles produce different scores for the same average student. An aboveaverage 

student with an item parameter of 0.7, for example, would be one who 

can solve 18 out of 22 common reading tasks in 2000, but only 16 of the same 

22 items in 2003. For the 22 test items, the sum of these deviations for all tasks 

results in a difference in latent student scores of approximately 11 scale points 

between the 2000 and 2003 surveys. 

Corresponding analyses may be undertaken regarding gender and ethnicity. 

Changes in the difficulty of test items for boys and girls respectively ac-


cumulate to a scale advantage for girls at the weak end of the scale of 8-10 

points (and in the strong end just 1-2 points). Students whose Danish is poor 

receive a scale-conditioned advantage over ethnic Danish students amounting 

to approximately 12 scale points. 

Eleven to twelve scale points is quite a lot. In the PISA 2000 scientific 

literacy test, this would be enough to lift Denmark from the group of countries 

with a statistical score significantly under the OECD average into the medium 

group of countries. 

Validity 

In my opinion, the validity problems of PISA are more fundamental than its 

weaknesses in technique and reliability. 

A test can only measure what it can capture with the current test design. 

What the test says about the test subjects may be one thing, while the information 

which can be derived from the test results to reveal something about 

the educational system which has educated these students is something quite 

different. It is thus quite a complicated and extensive task to provide an adequate 

analysis of the validity of an international comparative test. Accordingly, 

a validation of PISA implies a mixture of test design analysis and comparisons 

between the test and the national context. There are questions regarding what 

one might term internal validity: Does PISA Science 2006 really measure what 

it is intended to, namely scientific literacy? This question has two parts: How 

well does the concept of scientific literacy proposed in PISA correspond to 

other generally accepted concepts of literacy, and to what extent can the test 

items and the test concept measure the proposed literacy concept? 

The starting-point for the PISA 2006 science test is the so-called “Framework” 

compiled by the Science Forum, a group of science researchers from the 

participating countries, and the Science Expert Group. Here, scientific literacy 

is defined as: 

Scientific knowledge and use of that knowledge to identify questions, to acquire new 

knowledge, to explain scientific phenomena, and to draw evidence-based conclusions 

about science-related issues; 

understanding of the characteristic features of science as a form of human knowledge 

and enquiry; 

awareness of how science and technology shape our material, intellectual, and cultural 

environments; and


a willingness to engage in science-related issues, and with the ideas of science, as a 

reflective citizen. 

(Doc: ScFor(0407)1, OECD 2004) 

This concept is quite similar to other concepts of scientific literacy, as is 

revealed in the first inspection report from an ongoing validation project (Dolin 

et al 2006), so we can with some assurance state that PISA aims to test scientific 

literacy. It is worth noting, however, that the definition of scientific literacy 

has been changed quite a lot from PISA 2003 to PISA 2006. The 2006 definition 

places more emphasis on knowledge about science and on the students’ 

attitudes towards science. The incorporation of attitudinal aspects is solved by 

separating the cognitive and attitudinal items in the same unit. However, by 

doing so, the possibility of testing the situational interest is renounced. 

A more fundamental question is: What scientific knowledge do young people 

need later in life, and is this what is tested? No real analysis of this question 

has been undertaken by the Science Forum. Instead, the Forum has looked at 

the existing school curriculum and existing school traditions, and has considered 

which parts of these could be considered relevant for the young person’s 

future life. On this basis, they then produced the model for scientific literacy 

shown in Figure 3. 

The level of literacy is thus tested via four coherent aspects, namely the answers 

to these questions: 

What contexts are suitable for testing 15-year-olds? 

What competencies are necessary for 15-year-olds? 

What knowledge is it reasonable to expect 15-year-olds to have? 

What affective responses are reasonable to expect from 15-year-olds? 

These four questions have been thoroughly processed by the Science Forum, 

which undertook a mixture of academic and educational policy weighting of 

the different interests. The cognitive aspect was weighed up in relation to the 

affective, and the various academic areas were weighted in terms of percentages 

in the test areas. The extent to which people in the individual countries 

feel that the result covers what young people might be predicted to need in 

their adult lives is a matter for the individual countries to assess. An analysis 

of PISA’s framework in comparison with future demands for knowledge 

management, multimodality and innovation points out PISA’s lack of broader 

contexts and more future-proof categories (Dolin 2005). 

The fundamental question regarding validity is whether one can reasonably 

claim that sitting with a paper and pencil and (casually) answering ques-


Figure 3: Scientific literacy framework 

tions about imaginary situations has anything at all to do with competencies 

in the sense that we normally understand them. I will return to this fundamental 

question later, but many of the test questions that have been published can 

hardly be said to test appropriate everyday actions, not to mention the willingness 

to engage in science-related issues, and with the ideas of science, as 

areflective citizen, but rather the students’ general ability to make deductions 

and hypotheses, evaluate evidence, etc. – in other words a number of schoolspecific 

skills which, according to the logic of school, can be used later in life. 

And this aspect is tested very well! Seen in this light, many of the questions 

are diagnostically strong, inasmuch as a great deal of work has been done to 

investigate the use of particular cognitive processes. Let us examine a couple 

of examples. 

Problems with test item formulation 

Although the test items have been formulated in conformity with a detailed 

framework and subjected to a quite comprehensive selection process, there are 

still a few duds. It is hard to formulate “good” questions, as all teachers know, 

and even though only one-third of the test items made it through the process to


the pilot test, and even though all countries had the right to object, there will always 

be some less appropriate items. I will mention just one; Inge Henningsen 

has provided a more detailed criticism in Mona 2005 no. 1 (Henningsen 2005), 

and Lars Svendsen criticised some of the other published test items in the Danish 

newspaper Politiken on 13 January 2005 (Svendsen 2005). 

Figure 4: Walking 

In the test item “walking” from the 2003 mathematics set (fig. 4), the length 

of the stride is indicated for the first step, but it is clearly apparent that the 

second step is quite a bit longer; so in fact, the length of the stride should be 

defined as the average length of the measured strides. What is worse is that 

the formula provided is pure nonsense. Larger stride lengths, according to the 

formula, are faster strides, which contradicts our experience. 

Cultural bias 

Despite careful attention on the part of the question compilers, it is impossible 

to avoid a certain amount of cultural bias. Test items which require the student 

to read between the lines in references to cultural background knowledge are 

managed more easily by ethnic Danes than by Danish students from ethnic minorities. 

One could naturally argue that all students ought to be able to manage 

even culturally-determined tasks, and there is a certain logic to this, given that


they must be able to manage life in a (post-) modern society. But in this case, 

one cannot simultaneously accept that PISA aims at smoothing out cultural 

differences while measuring cultural deviations from a West European norm. 

This also applies to gender, regarded as culture. 

Let us examine the racetrack test item (fig. 5) from PISA 2000. 

Figure 5a: racetrack item 

Figure 5b 

The test item seems realistic and meaningful (at any rate to me as a man). 

Unfortunately, the question cannot be solved. Based on the number of curves,


the lane must be B, C or D. If we look at the position of the start (at the 

conclusion of a straight length) it should be lane D, but as the first curve is 

followed by one which is sharper and one that is less sharp than the others, it 

must be lane B! However, what is more interesting is that the responses show 

a gender-based imbalance: 

Greece, girls: 8 % correct 

Portugal, girls: 10 % correct 

Australia, boys: 43 % correct 

Switzerland, boys: 46 % correct 

What do these answers really reveal? Are they perhaps more a reflection of 

society’s socialisation, i.e. gender-specific interest, than of what the school has 

taught students (concerning the ability to create a graphic representation of 

movement)? Or are the girls just so bright that they can see there is no solution? 

There can hardly be a standardised version of “everyday life” valid for the 

whole world, and the issue of what can be regarded as everyday mathematics or 

science is the subject of considerable debate. Do all young people really need 

to learn the same things? Do we all need to know the same things in order to 

be ‘fit for life’? And should they be evaluated in the same way? 

PISA and the Danish educational goals 

The next general question is: Does this framework harmonise with Danish educational 

goals, for example as expressed in the Common Goals statement of 

the Ministry of Education? (http://www.faellesmaal.uvm.dk/) The answer is 

both yes and no. Dolin et al (2006) have undertaken a thorough analysis of the 

intentions of the PISA survey and compared these with the goals of Danish 

education as formulated in the Common Goals. The report concludes: 

To summarise, we could say that PISA’s scientific literacy framework covers key parts 

of the formulated aims and mentality-related goals of Danish scientific school subjects. 

The greatest lack is the emphasis placed by Danish scientific subjects on students’ 

practical and field work, which is not included in PISA. This also means that a 

number of personal qualities, such as imagination and inquisitiveness, are not tested. 

It is also important to point out that the personal and affective aims of science teaching 

are given considerable emphasis in the Danish aims and goals, while these comprise 

only a minor part of the overall PISA test in science. 

In addition, the PISA competencies primarily relate to cognitive skills, whereas Danish 

goals are more holistic and aim to encourage independent problem-solving, which 

naturally also involves cognitive skills, but in interplay with other abilities.


The PISA framework thus covers some of the Danish goals, but far from all of 

them, and perhaps not even the ones which many Danes would regard as being 

the most important, such as democratic culture, social skills, personal development, 

etc. Here, I feel we can find one of the key reasons for the opposition 

to PISA; many opponents criticise PISA for failing to test what they regard as 

important, but at the same time overlook what PISA does in fact test. Similarly, 

many of PISA’s supporters focus on what PISA tests, and perhaps fail to place 

this in relation to what PISA does not test. The exciting question is whether 

there are correlations between the two areas; this would demand an actual field 

validation process, i.e. a concrete examination of the PISA-tested students with 

the aid of other evaluation methods besides that of PISA. A Danish research 

project is doing so at the time of writing (Dolin et al 2006). 

All in all, a wide range of important validity problems appears when we ask 

what it is that PISA measures, and what the actual skills of Danish students are 

in the areas tested by PISA. One should thus be extremely cautious in drawing 

too hasty or too firm conclusions from the PISA results. 

Against the background of these considerations, I would recommend that 

in the case of extensive surveys such as PISA, more aspects of the survey 

design and its context should be taken into consideration when assessing the 

test’s validity and consequences. 

A wider view of validity 

In the following, I wish to present a broader and more differentiated view of 

the validity issue in order to further define the problems that can arise in connection 

with a survey such as PISA. The issue of validity will be examined in 

relation to: 

– the structure and design of the actual test apparatus in relation to the questions 

posed 

– the range which defines the test’s area of validity or generalisability 

– the foundation upon which the test’s fundamental assumptions are juxtaposed 

with the field’s dominant assumptions. 

Validity in relation to the test design 

A test can naturally only measure that which it is designed to measure, so the 

first and most fundamental validity evaluation must clarify whether the test’s 

design is in accordance with its aims. Does the PISA test measure scientific


literacy? I would as mentioned be concerned if the idea of literacy were to be 

restricted to something that can be measured using paper and pencil, sitting 

at a desk in a gymnasium. The prevalent approach to literacy operates with a 

significantly broader view of the concept of literacy – typically the ability to 

manage everyday situations in which the necessary actions or considerations 

demand scientific insight (Roth & Désautels 2002). PISA’s concept of literacy 

attaches great importance to deductive skills on the basis of some given 

premises which comprise a subset of the prevalent approaches to literacy, and 

such abilities are excellently tested in quite a few of the PISA test items. However, 

is it not the case that the more the test items and the test situation are 

shorn of their context and removed from ordinary everyday life, the more we 

tend to test levels of general intelligence? It gives one food for thought that a 

recent study demonstrates high consistency in measuring performance in PISA 

2003 and TIMSS 2003 (Lie&Olsen 2007). A comparison of the results for 22 

educational systems participating in the two tests shows a correlation between 

the scores in the two tests as high as 0.95 at the country level. Despite differences 

in focus (scientific literacy vs. curriculum test), the test results are very 

much the same. So perhaps the PISA test does not reflect the different focus 

sufficiently. PISA might have a good definition of scientific literacy, but the 

test items and the whole test setup are too close to a traditional curriculum test. 

Achieving complex goals will often require a blend of multi-dimensional 

skills, the integration of academic and personal/social skills, and the utilisation 

of several academic areas and subjects. Such complex goals can only be evaluated 

with the help of complex forms of evaluation. Much work has been done 

on developing process-oriented and complexity-capturing forms of evaluation 

(e.g. logbooks, portfolios, project reports), but these are, naturally enough, difficult 

to carry out and more time-consuming, and they need to be learned, for 

which reason they are more costly than traditional written tests. The better the 

evaluation is at capturing complex skills, the more difficult it is to present the 

results in the form of simple, comparable data. 

This brings us back to the traditional dilemma between undertaking an 

evaluation with a high degree of validity which is costly to carry out and which, 

because of its complexity, will have low reliability, and an evaluation of simple 

factors which is capable of measuring with high reliability, but in which the 

level of validity is relatively low.


Generalisability 

One cannot generalise test results beyond their area of validity. It would thus 

seem unreasonable, on the basis of a test in deduction and calculation, to generalise 

regarding general abilities and skills in science. A test is a very specific 

communicative situation in which students must answer questions in writing 

and under time pressure, and without the help of an interlocutor to adjust 

their understanding of the problem. As far as I am aware, no survey has been 

undertaken of the relationship between such problem-solving skills and the 

ability to manage later in life in situations which include a scientific content. 

Nonetheless, PISA measures something, and the measuring apparatus provides 

a fine scaling of the students. There is, for example, a correlation between 

the PISA reading results and later educational achievements. An analysis of 

Danish school students who participated in PISA 2000 showed that the young 

people’s educational position four years after elementary school was primarily 

determined by their reading skills and their academic self-image in the ninth 

grade (such as these were established by the PISA test) (Pilegaard Jensen and 

Andersen, 2006). Such a correlation may not, however, necessarily indicate a 

direct causal relation, but rather reflect some general relationships, probably 

attributable to social background, which are revealed by PISA. But we also 

know that of the 17 % of Danish school students who were designated functionally 

illiterate on the basis of PISA 2000, 20 % later completed an upper 

secondary or vocational education. Many of these were in other words capable 

of coping with relatively high demands for reading and comprehension. This 

implies that we can draw no clear conclusions with regard to the generalisability 

of the PISA test, and thus it begins to resemble soothsaying, to put it mildly, 

to rank countries by the supposed ability of their school students to manage in 

the future, as in Figure Six taken from the Danish PISA 2003 report. 

The skill requirements of the future are difficult to predict, and an exaggerated 

re-traditionalisation of the school system might well occur at the expense 

of explorative, creative, communicative and playful skills, and many 

other skills which the digital society of the future might come to rely on – and 

which PISA does not test. 

Naturally, this must not divert our attention from the problem that an unreasonably 

large proportion of Danish youth have poor reading abilities – something 

which it is good that PISA documents. But is it reasonable to conclude, 

on the basis of the PISA data, that three-quarters of the school students in Finland 

are ready for the labour market of the 21st century, while this applies only


Figure 6. Percentage of students prepared for the 22th century labor market 

Source: Mejding 2003, s. 132 

to just under half of the Norwegians? And to what extent would it enhance 

the students’ future preparedness to improve their skills in traditional cultural 

techniques, if this requires learning decontextualised skills? In this connection, 

it is of crucial importance to find a reasonable balance between fundamental 

subject-related skills and social/personal skills, and that this balance is 

expressed in relevant contexts. 

Fundamental assumptions 

The question of validity is closely linked with the fundamental assumptions 

and values upon which the test is based. If, for example, a test builds on the


premise that knowledge is an objective quantity, independent of context, it 

might be meaningful to attempt to test the presence and extent of this knowledge 

with individual students in neutral contexts. And if you define competence 

as the ability to solve items in a test, you could call it competencies. If, 

on the other hand, we view knowledge as a social construction in actual contexts, 

such a test set-up might amount to a valid measurement of school knowledge 

– but not in any respect a measurement of ‘everyday useful’ knowledge – 

let alone competencies. 

Consider, for example, the following view of competence and knowledge, 

from the perspective of situated cognition (St. Julien 1997): 

Competence, understood as the ability to act on the basis of understanding, has been 

a fundamental goal of education. But it is a painful fact of educational life that knowledge 

gained in school too often does not transfer to the ability to act competently in 

more “worldly” settings. 

... 

From the viewpoint of situated cognition, competent action is not grounded in individual 

accumulations of knowledge but is, instead, generated in the web of social 

relations and human artefacts that define the context of our action. 

This view of knowledge and competence shifts the focus when assessing competencies 

from a focus on examining individual knowledge to examining authentic 

activities in social contexts. In the Nordic countries, we have built up 

a view of knowledge in an educational context which attempts to combine 

the process-oriented view of knowledge expressed by constructivism with the 

more absolute view of knowledge expressed by science. We also work to a 

great extent on the basis of a socio-cultural view of learning, i.e. in educational 

contexts we tend to emphasise the ability of individual students and the 

group to work towards their own view of knowledge, which then gradually 

approaches that of established science. 

There is no room for such a view of knowledge in the PISA format. Here, 

concrete questions are asked to which the answer to most items is either correct 

or incorrect (or at most correct, partly correct, incorrect). Such questions 

are naturally also asked in Danish science teaching – and it is obviously important 

to be able to answer them – but they are not the most important questions, 

as the aim is to build up the students’ general scientific understanding. However, 

it is not possible to train test scorers to assess whether a student is on 

the right path. In PISA, certain premises are typically presented within a spec-


ified frame, and the student is then expected to apply particular knowledge or 

a particular process to this frame, accepting the given terms. This is a very 

Anglo-Saxon approach. In a constructivist context, one would instead emphasise 

the students’ ability to draw up frames and premises themselves, and the 

ability to formulate the actual problem as part of the solution. The “walking” 

test item in Figure Four would, for example, be formulated in a completely different 

manner under a constructivist view of education; in this case the students 

would be required to measure the stride length themselves and then attempt to 

work out the relationship between stride length and speed, evaluating whether 

or not this is reasonable. It would be this ability to structure the problem that is 

primarily tested, rather than whether the student is capable of inserting figures 

into a given formula (which they must naturally also be able to do). 

The critical point, however, is that if the actual test format itself rules out 

questions which are too open, and students who display independent thought, 

i.e. by exceeding the test item’s premises or drawing upon knowledge other 

than that provided, risk being penalised (Svendsen 2005). 

Seen in this light, the PISA test seems epistemologically conservative, and 

consequently more of a measuring rod for idealised skills than a tool for promoting 

education which is centred on the learning process. 

Tiberghien (2007) advocates research studies on test design similar to those 

which have led to the development of research-based teaching sequences. Such 

studies would allow an item construction in close connection to the desired 

learning process of the students, and thus provide a didactical foundation for a 

more fine-grained scoring. 

Consequences for educational policy 

The results of an evaluation provide a basis for certain decisions, but it is important 

that these decisions do not exceed what is actually justified by the test. 

Accordingly, it is interesting to consider how the results of a test such as PISA 

can be used – and abused. 

PISA in the media 

The media debate following the publication of an international test often has 

a very uncertain foundation, and experience from the publication of the PISA 

2000 and PISA 2003 results indicates that the loose claims advanced in the 

initial hectic media coverage tend to remain the main impressions of PISA.


First impressions last. They thus become truths upon which the educational 

debate becomes based in ensuing years. 

In a media society, a media image can have a direct influence on political 

decisions. When the media construct a particular view of reality, many politicians 

feel obliged to act on this basis. 

Figure 7. Denmark gets too little value for money from its education budget 

(Source: Arbejdsmarkedspolitisk Agenda (The Danish Employers’ Confederation) April 7th, 

2005) 

See, for example, the juxtaposition by the Danish Employers’ Confederation 

of the PISA results with educational spending (figure 7). Here a comparison 

is made between Denmark’s ranking in PISA and its expenditure per student 

– and by implication, the quality of its educational system. In this ranking, 

Denmark ends up in third last place in a range of OECD countries when the 

Danish PISA score is compared with its educational budget. Denmark pays an 

average of EUR 1,000 per student to achieve just over six PISA points, while 

the Germans, for example, obtain ten points for the same price. The conclusion 

is clear. However, this analysis disregards the fact that Denmark obtains much 

more from its expenditure on education than PISA points. 

The politicians ask whether we get “value for money”, and they are accustomed 

to measuring value in terms of figures in columns. If the results are


too low, we need more tests and measurements, and we must introduce economic 

rewards, grading systems, etc. The end result may well be that schools 

and classes alter their teaching in such a way that students become better able 

to manage PISA test items, but at the expense of the less detectable results of 

education. I do not claim that there is a direct contradiction between these two 

things, but with the limited time and resources available, it is a delicate matter 

to maintain the existing values while at the same time gearing the system to 

meet a number of specific requirements. However, the possibility cannot be 

ruled out that it could be a fruitful process. 

Measurable factors as parameters of quality 

In general, it would be well to exercise caution when measuring and assessing 

something as complex as human behaviour with figures, not to mention a 

country’s overall performance – and in particular, when using figures procured 

via measurement of only a limited part of the overall area. This is extreme reductionism, 

and an example of how one of the central scientific tools to create 

knowledge – the ability to practise reductionist methods – should be utilised 

with caution outside the domain of science itself. 

There is a major risk that the factors which are measurable via the test in 

question become the norm-setting parameters of quality, while the remainder 

of the large and complex educational picture imperceptibly slips out of view. 

This would have serious consequences for the entire educational system, including 

the priorities of individual schools and teachers. 

We risk harmonising away the very qualities that we have built up over 

generations and which may be the key to our survival in the globalised world 

of the future. A process of cultural uniformity and harmonisation of values 

is occurring on the basis of the contemporary mainstream. In an interview in 

the Danish newspaper Information (Thorup 2005 (20 March)), Microsoft CEO 

Steve Balmer expresses the company’s winning strategy as: “I want the whole 

world to be Danish.” This is followed up by Mikael R. Lindholm, a member of 

the Innovation Council’s strategic planning group, who says: 

The welfare system helps to create some highly committed, dynamic, inquisitive and 

competent people in Denmark. And these are precisely the qualities from which we 

benefit, and of which the rest of the world is very envious. 

[... ]


But Denmark shows too little interest in these special, culturally-determined competencies 

that the rest of the world covets. Instead, the government is trying to harmonise 

our strengths out of the educational system. 

It is a notorious fact in educational research that the more important something 

is, the harder it is to see and measure! 

Evaluation as construction of an area 

It is well known that educational evaluations have a strong back-wash effect 

on teaching: ‘Teach to the test’, as it is known. This is in itself a reasonable 

and desirable process, if the evaluation is sensible and reflects the goals of 

the educational system. However, it is problematic if the evaluation fails to 

accord with the foundation of the educational system and its overall goals, but 

is instead undertaken to support a number of ideological aims. 

Figure 8. The utilisation of evaluation 

(From: Dahler-Larsen&Larsen 2001) 

The Danish educational researchers Peter Dahler-Larsen and Flemming 

Larsen (Dahler-Larsen and Larsen 2001) have drawn up a list of uses to which 

evaluations are put (figure eight). They distinguish between uses which view 

human actions as based on rationality and functionality, i.e. in which we act 

in order to achieve a particular goal (such as learning or acquiring information), 

and uses aimed at making the system “suitable”, what is expected, rather 

than in order to achieve a particular goal. In the latter case it is not the effects 

of the evaluation that are important, but rather the fact that the evaluations 

are undertaken at all. There is a tendency for these symbolic and constitutive 

uses to occupy ever more space in the evaluation landscape. By evaluating, 

you communicate credibility and drive; you show that you are prepared to 

do something; you are part of the action (just think of how the number of 

countries participating in PISA grows with each round). The symbolic value


is politically important, and often more important than adhering to the results 

achieved. However, at the same time, there may be a number of unintended 

consequences for the content. By undertaking evaluation, you can influence the 

field in a particular direction and help to form it. By evaluating, we also create 

social relations and identities (passive students, etc.). Or the evaluation creates 

a view of the subject matter (such as the scientific competence of school students) 

for which there is insufficient evidence (e.g. that Danish school students 

are scientifically incompetent, even though PISA measures only their ability to 

perform a particular type of task in a particular context). 

There is no doubt that the PISA results, together with those of a number 

of other surveys, such as the OECD review of elementary schools (Uddannelsesstyrelsen 

2004), have contributed to an increasing focus on the apparently 

weak evaluation culture in Danish elementary schools. This is a result 

which is consistent with the general foundation of the PISA and OECD surveys, 

and it is to a large extent necessary and useful. An increased level of evaluation 

– if well-balanced, well-designed and diagnostically-oriented – would 

undoubtedly enhance the benefits of the educational process for all groups of 

students. 

However, there are signs that PISA, besides exerting an influence on teaching, 

has also had an influence on the actual objects clause of the elementary 

schools, so as to direct the teaching to conform to a greater degree with what 

PISA is capable of measuring! 

PISA in perspective 

Taking a critical approach tends to sharpens your argumentation, and here I 

have emphasised the problematic aspects of PISA. It would not be reasonable 

to conclude on this basis alone that PISA is unusable, worthless or the like. On 

the contrary, PISA encompasses a great deal of potential. 

To begin with, it presents us with an enormous amount of empirical material. 

The figures indicate many unknown factors in the educational sector 

which it would be worthwhile to investigate further, as well as confirming 

much which we already know, such as the large gender variations in Denmark, 

the variation between ethnic groups, etc. It is thought-provoking that there appears 

to be a statistical correlation between results achieved and the students’ 

comments on the level of discipline and order during lessons. Moreover, it is 

in itself remarkable that a very large proportion of the students – more than


one-third – report experiencing poor discipline and order during lessons. It 

is useful to know that Danish school students feel at home in their schools, 

and that they have a positive attitude to their studies and a positive image of 

their own academic skills. There are many correlations which it would be interesting 

to explore in more depth, and an extensive diagnostic potential in the 

PISA material, first and foremost in connection with finding out what young 

people think, for better or worse. Rolf V. Olsen (2007) suggests five generic 

approaches to a secondary analysis of data in PISA, each of them accompanied 

with a comprehensive list of approaches to analysis. 

It is also meaningful to undertake comparisons within the same cultural 

groups, which may provide some fruitful contextualisation of well-known issues. 

This has extensively been done in a Nordic context, for example (Lie, 

Linnakylä et al. 2003; Kjærnsli and Lie 2004). 

Finally, it should be mentioned that PISA is a laboratory in testing techniques 

and test theory. Participation in PISA has provided Denmark with a 

much-needed test-related theoretical boost, and has also helped to place the 

evaluation culture in Danish elementary schools on the agenda. 

But this potential must be balanced with the danger of mainstreaming 

and distorting the educational system and teaching which PISA could 

also induce. International comparative evaluations possess almost inherent retraditionalising 

and standardising elements which could influence the national 

development in a direction which is foreign to the local educational culture. 

Evaluations as comprehensive as PISA express themselves with great authority 

on the basis of what many view as incontrovertible documentation. In relation 

to the national research environments, the PISA system has so many resources 

at its disposal that it is difficult to establish genuinely critical and independent 

research of PISA and the PISA results, with the result that a project like PISA 

can rapidly become established as a representative of objective, neutral reality. 

Political prestige has also been invested in participation, which makes it difficult 

for the participating countries to distance themselves from the project at 

policy level; as a “member of the club”, one feels obliged to show solidarity 

with the club’s rules. 

Finally, it is important to point out that from an educational perspective, it 

is difficult to establish links between the findings of comparative evaluations, 

which describe the educational system in its entirety, and teaching in individual 

classes, with individual students. PISA’s strength lies in its analytical and diagnostic 

possibilities at the overall educational policy level, but when utilised to


influence the structure of specific teaching practice, there is a risk of promoting 

changes on the basis of an oversimplified view of educational practice which 

can have a counterproductive effect in the long run in relation to achieving the 

stated goals. 

References 

Adams, R. and M. Wu (2001). PISA 2000 technical report. Paris:OECD. 

Adams, R. J. (2003). Response to “Cautions on OECD’s recent educational 

survey (PISA)”. Oxford Review of Education 29(3): 377 – 389. 

Allerup, P. (2005). PISA Præstationer – målinger med skæve målestokke. 

Dansk Pædagogisk Tidsskrift(1): 68-81. 

Dahler-Larsen, P. and F. Larsen (2001). Anvendelser af evaluering – Historien 

om et begreb, der udvider sig. 

I: P. Dahler-Larsen and H. K. Krogstrup. Tendenser i evaluering. Odense: 

Odense Universitetsforlag. 

Dolin, J. (2005). PISA og fremtidens kundskabskrav. In: PISA-undersøgelsen 

og det danske uddannelsessystem. Folketingshøring om PISAundersøgelsen 

12. september 2005. Teknologirådet. 

Dolin, J., H. Busch og L. B. Krogh (2006). En sammenlignende analyse af 

PISA2006 science testens grundlag og de danske målkategorier i naturfagene. 

Første delrapport fra VAP-projektet. Odense: IFPR/Syddansk Universitet. 

(with English summary) 

Dolin, J. (2007). Science education standards and their assessment in Denmark. 

In: Waddington, D., Nentwig, P. & Schanze, S. (eds.): Standards in 

Science Education. Waxmann. 

Goldstein, H. (2004). International comparisons of student attainment: Some 

issues arising from the PISA study. Assessment in Education 11(3). 

Hansen, E. J. (2005). PISA – et svagt funderet projekt. Dansk Pædagogisk 

Tidsskrift (1): 64-67 

Henningsen, I. (2005). PISA – et kritisk blik. MONA (1). 

Kjertmann, K. (2000). Evaluering af læsning: Generelle og specifikke problemer. 

Forskningstidsskrift fra Danmarks Lærerhøjskole, nr.6. 

Kjærnsli, M. and S. Lie (2004). PISA and scientific literacy: Similarities and 

differences between the Nordic countries. Scandinavian Journal of Educational 

Research 48(3): 271-286.


Lie, S., P. Linnakylä, et al., Eds. (2003). Northern lights on PISA. Unity and 

diversity in the Nordic countries in PISA 2000. Oslo: University of Oslo. 

Lie, S. and Olsen, R. (2007). A comparison of the measures of science achievement 

in PISA and TIMSS. Paper presented at ESERA 2007 Conference, 

Malmoe. 

Mejding, J. (ed.) (2004). PISA 2003 – danske unge i en international sammenligning. 

København: Danmarks Pædagogiske Universitets Forlag. 

Mejding, J., S. Reusch og T. Yung Andersen (2006). Leaving examination 

marks and PISA results – Exploring the validity of PISA scores. In: Mejding, 

J. og A. Roe (red.). Northern Lights on PISA – a reflection from the 

Nordic countries. Copenhagen: Nordic Council of Ministers. 

OECD (1999). Measuring student knowledge and skills – a new framework for 

assessment. Paris:OECD. 

OECD (2001). Knowledge and skills for life. First results from PISA 2000. 

Paris: OECD. 

OECD (2002). Sample tasks from the PISA 2000 assessment. Paris:OECD. 

OECD (2004). Learning for tomorrow’s world. First results from PISA 2003. 

Paris: OECD. 

Olsen, R.V. (2007). Beyond the primary purpose: Potentials for secondary research 

in science education based on PISA 2006 data. Paper presented at 

ESERA 2007 Conference, Malmoe. 

Pilegaard Jensen, T. & D. Andersen (2006). Participants in PISA 2000. Four 

years later. In: Mejding, J. & A. Roe (red.). Northern lights on PISA – 

areflection from the Nordic countries. Copenhagen: Nordic Council of 

Ministers. 

Prais, S. J. (2003). Cautions on OECD’S recent educational survey (PISA). 

Oxford Review of Education 29(2): 139-163. 

Prais, S. J. (2004). Cautions on OECD’s recent educational survey (PISA): 

rejoinder to OECD’s response. Oxford Review of Education 30(4): 569- 

573. 

Roth, W.-M. and J. Désautels, Eds. (2002). Science education as/for sociopolitical 

action. New York: Peter Lang. 

St. Julien, J. (1997). Explaining learning: The research trajectory of situated 

cognition and the implications of connectionism. I: D. Kirshner and J. 

A. Whitson. Situated Cognition. Social, Semiotic, and Psychological Perspectives. 

London: Lawrence Erlbaum Associates.


Svendsen, L. S. (2005). Med Klods-Hans til PISA-prøve. Politiken. København. 

Thorup, M.-L. (2005 (20. marts)). I want the whole world to be Danish. Information. 

København. 

Tibergien, A. (2007). Assessing scientific literacy: The need for research to 

inform the future development af assessment instruments. Paper presented 

at ESERA 2007 Conference, Malmoe. 

Uddannelsesstyrelsen (2004). OECD-rapport om grundskolen i Danmark – 

2004 -. Uddannelsesstyrelsens temahæfteserie nr. 5. 

Winther-Jensen, T. (2004). Komparativ pædagogik – faglig tradition og global 

udfordring. København: Akademisk Forlag.

Language-Based Item Analysis – 

Problems in Intercultural Comparisons 

Markus Puchhammer 

Austria: University of Applied Sciences Technikum Wien 

PISA was started as an instrument to check the outputs of the education systems 

of several different countries against each other. To achieve an assessment 

across a still growing number of participating countries, test items were developed 

for mathematics, reading literacy, science and problem solving. Multiple 

institutions located in different OECD-countries contributed to the effort of 

creating test items. Quite similarly looking test booklets were produced featuring 

the same items presented in the languages officially used in education 

of the participating nations. The results have been widely discussed, rankings 

were often attributed to the organization of national education systems. But is 

it correct to argue that the same items have been presented – not taking into 

account the use of different languages? If different cultural backgrounds may 

be assumed to influence reading literacy (getting visible in the results), areas 

like mathematics may be regarded as less sensitive. But quantitative evaluations 

(presented below) show that there are still enough factors introduced by 

wording and by language. Thus, the validity of PISA assessment – to test what 

is intended to be tested – should be watched carefully within an international 

frame. 

Factors indicating the importance of reading and language 

Based on cultural backgrounds, influences of language may be expected for 

areas that are linked to reading. In this case arguing the importance of wording 

could have good chances of success, because it seems obvious that reading 

fluency and reading comprehension are related to sentence structure, to the use

128 MARKUS PUCHHAMMER 

of specific terms, or to the length of words of a text. These factors may be 

regarded as less important for other areas such as mathematics; also scientific 

competence may be less affected. But if it can be shown that even in these 

areas the effects of language must not be neglected, the result obtained so far 

can be generalized. Thus, the following context focuses on PISA’s mathematics 

assessment. 

In an average text book on mathematics pages are full of formulas and 

numbers. A first look on PISA items designed for testing mathematics performance 

revealed another layout. Approximately 7 % of the text (1250 characters 

out of 18058 in the English item sample) were digits or mathematical operators 

like = + - % /, the rest (93 %) formed a readable text where proper understanding 

is recommended to find the correct solution. (Some diagrams were also 

shown, but note that diagram texts are included in these counts.) 

The influence of language is further demonstrated by a model calculated 

to explain PISA-2000 results in mathematics, presented in Artelt et al. (2001, 

25). To predict performance in mathematics, the factors such as socioeconomic 

status, gender, general cognitive ability, mathematical self-image and reading 

competence were considered. These variables explained 76 percent of the variance 

of the performance in mathematics. It was expected that general cognitive 

ability or mathematical self-image would influence the mathematics score 

most strongly. Surprisingly, the most prominent influence was realised in the 

area of reading competence, expressed in a path coefficient of 0.55, whereas 

the other factors contributed less: path coefficients were 0.32 for general cognitive 

ability and 0.14 for the mathematical self-image. 

The translation process – as described by PISA-Austria (2004a) – will 

be considered next. Language versions have been generated for the teaching 

languages of the participating particular nations. They start from an English 

source text and a French source text (often derived from the English version). 

A so-called double translation process is used, i.e. teams of two independent 

translators develop the national items which are then cross-checked and 

merged into a final national version. International verification steps, a training 

programme and item analyses serve as quality assurance of the translation process. 

But nevertheless this process starts with the two source languages, and 

it is not clear how far the translators can free themselves from language structures 

of the source language(s) to achieve good readability of the translated 

items. If reading comprehension is reduced due to this process, then the test-

LANGUAGE-BASED ITEM ANALYSIS 129 

ing process is dissimilar, comparison of results is eventually rather restricted – 

the influence of translation therefore should be considered in more detail. 

The principal component analysis on PISA-2003 country scores for mathematics, 

science, problem solving and reading literacy yields a common factor 

which contributes 94 % of the total variance. This observation shows that there 

are no distinct foci, e.g. when expecting that some countries promote mainly 

natural sciences or mathematics, while others concentrate on reading comprehension 

and literacy. Different interpretations are possible, but the concept of 

reading comprehension and thus language understanding being the most important 

factor would be a good explanation. Therefore, the influence of factors 

like wording, length of item texts etc. should be investigated in more detail. 

Selection of PISA items for comparison 

An evaluation based on the language and wording of items should carefully 

select appropriate items. Language differences can be demonstrated by items 

that were originally considered as PISA items, but finally were not chosen for 

the test booklets (cf. e.g. the description of the item construction process in 

OECD, 2001, p. 42ff.). Some of these items have been released to the public 

and serve as examples for the different PISA assessment areas in the “full 

report”, available in several languages (e.g. OECD 2004a,b,c). They are presented 

as prototypes for different assessment areas, e.g. for several levels of 

mathematical skills. The major advantage of these items is that they can be 

easily retrieved in several languages, and their release to the public eases discussion 

of individual contents. 

For the German version, 42 released mathematics items (with up to 3 questions 

each) were located (see PISA Austria, 2004b). Some of them have already 

been released in OECD publications (2004c). The English items were 

downloaded directly from OECD (2005) in a file comprising 27 PDF documents, 

extended by other sources (e.g. HKPISA, 2006). All of the English 

items (except one item) are also available in the German sample and have 

been used for subsequent comparisons. Items released to the public are also 

available for other assessment areas, but in a smaller number. Mathematics 

performance items suggest to be less language-dependant, and the sample of 

31 possible comparisons results in an item number that can be reasonably used 

for statistical calculations. The availability of these items in other languages 

invites to extend the investigation. Item identifiers start with a letter (“M” for


mathematics), followed by a three-digit item number and, possibly, a question 

number indication. An abbreviation of the item contents has been taken from 

the file name of the electronic format. 

The following comparisons first show that the German text is significantly 

longer than the English version, thereby discussing the implications for the 

assessment. Then the familiarity of words (which should be related to the word 

knowledge of the target group of 15-year aged students) is retrieved using an 

approach of quantitative linguistics. Furthermore, German sentence structures 

still seem to increase complexity in German texts. 

Text-length based comparisons 

31 mathematics items have been evaluated, both in the German and English 

version. The item texts have been retrieved from the PDF format and were 

imported into a word processing program. To analyze details a computer program 

(written in Visual Basic) has been developed and applied to the item 

texts. Table 1 summarizes the results for the number of words (units separated 

by spaces or similar sentence marks) and number of characters. To eliminate 

the confusing effect of number tables (there were some in the item sample) the 

number of digits and mathematical operators (like + – / %) were counted separately; 

these special characters are not included in the total character count. 

Results for English and German are shown below. 

ItemID #words #chars #word #chars 

(English) (English) (German) (German) 

M037Farms 155 694 149 796 

M124Walkg 109 634 124 761 

M145Cubes 68 306 62 341 

M148Cont 52 286 61 353 

M150GrwUp 91 403 85 495 

M159Speed 235 1063 228 1307 

M161Triang 66 328 62 349 

M179Robbr 64 335 54 333 

M266Crntr 110 475 93 465 

M402IRC 157 771 144 820 

M413Excha 182 1002 157 1005 

M438Expor 128 576 126 616 

M467Candy 68 321 69 364


M468STest 58 296 49 330 

M471SFair 98 473 92 566 

M484Books 65 365 58 396 

M505Littr 84 474 74 516 

M509Quake 156 818 153 919 

M510Choic 65 393 62 446 

M513Score 132 557 155 867 

M515ShKid 112 380 101 454 

M520Skate 227 1143 226 1426 

M521Table 80 433 66 443 

M525Deacr 265 1405 278 1629 

M543Space 98 488 97 532 

M547Stair 41 198 35 177 

M555NCube2 94 448 74 509 

M555NCube3 149 723 143 951 

M702Presi 158 877 145 1003 

M704BestC 237 1098 226 1290 

M806StepP 58 295 57 340 

Tab. 1: Text-length based results for PISA example mathematics items, English and German 

version. 

Text lengths vary from item to item, but texts usually contain several hundred 

characters (most items had only one question attached, and only three 

items had 3, but still a lower length limit can be observed also on a per-question 

basis). Average lengths calculated were 583 characters for the English version, 

and 670 characters for a German item on average – indicating the German 

items to be noticeably longer. Average word counts were more similar for the 

English and German version (118.1 and 113.1 respectively – but note that some 

terms use two words in English whereas in German often two words are combined 

into one; this effect results in more and shorter words in English). 

It is interesting to observe the length of German item texts depending on 

their English source counterpart (in fact, dependency is given by the translation 

process). This relationship is shown in Fig. 1 and is clearly visible – described 

by a high correlation coefficient of r=0.98; only a few entries deviate noticeably 

from the regression line. 

To obtain a profound estimate of the relative text lengths, the regression


line (based on a least-squares fit using intercept=0) has been calculated, represented 

by the formula 

length(German) = 1.16 length(English) 

The testing of the slope coefficient for statistical significance clearly supports 

the statement that German item texts are in fact longer than in English 

by nearly 1/6 (95 %-confidence interval for the slope parameter between 

[1.123 . . . 1.198] ). 

Fig. 1: Graphical display of English vs. German item text length. Regression line is shown. 

The relevance of these observations is visible when contrasting items to the 

reading speed of 15-year old pupils. For average adults, reading speeds around 

200-300 words per minute are frequently reported (depending on the amount


of text comprehension required, print size etc.). For these readers, a reading 

time around half a minute per item can be expected. 

Assuming that the mathematics part of a PISA assessment contains 20 

items to be worked out in 30 minutes, just reading the items would consume 

1/3 of the available time. On the other hand, PISA items should be carried out 

also by pupils on a below-average level. For slow readers speeds are proposed 

of only 110 words per minute for English texts (Readingsoft, 2007; similar 

reading speeds are reported for “efficient words” related to understanding). 

For those readers, the total reading time would sum up to 21 minutes – 70 % 

of the 30 minute session. German texts have longer words (about 20 % by our 

data), reading speed is even a bit slower, therefore the rest of the time that actually 

can be devoted to reflect upon the mathematics behind the question is still 

shorter. High variances in reading ability can actually be expected, e.g. by the 

findings of Klipcera and Gasteiger-Klipcera (1993) who reported that the least 

performing 15 % of pupils in the 8 th school year were at a level of an average 

reader at the end of the 2 nd year or beginning of the 3 rd year in school. 

Familiarity of words 

In 1932, the American linguist and philologist George Kinsley Zipf (1932) 

noticed that the statistical frequency of words can be linked to their rank by 

frequency of occurrence. A word ranked n th in frequency occurs with a probability 

Pn of about 

Pn 1/n a 

where the exponent a is almost 1. What is now known as Zipf’s law holds 

well for nearly all languages (except for the first few words – e.g. in English: 

the, and, to, of, a, . . . – with probabilities deviating only slightly). Since then, 

word frequency tables have been constructed. Files representing the top 10 000 

words of languages like English, German, French and Dutch can be downloaded 

(e.g., see Universität Leipzig, 2007), spanning across some orders of 

magnitude between the relative frequency for words that occur in an average 

text of the selected language. Words that occur quite seldom (e.g. only once 

every 100 000 words) may not be well-known, may be difficult to understand 

or even unknown to average users of the language. 

In order to detect reading disadvantages of items in the translated language, 

these rank lists of English and German words have been applied. Words occurring 

frequently have low rank numbers, rare words have high rank numbers.


If frequent words in the first language are replaced by infrequent words in 

the second language (during translation), then the resulting text is more difficult 

to understand. Text translations that claim to yield similar difficulty should 

use words that occur with similar probability, having similar rank numbers (according 

to Zipf’s law). Words used seldom are more relevant for a comparison, 

because a shift in difficulty would be easier to observe, and would influence 

the understandability of a text more distinctly. 

The PISA example mathematics items have been reviewed in both versions, 

English and German. Words that seemed important for an item text as 

well as “difficult” words (occurring less frequently) were identified and their 

translated counterparts were located. Then, rank numbers were determined and 

compared. A list of (the first of these) words is shown below. It should be noted 

that even more complicated words (e.g. German words like Hemisphäre) could 

be found in other PISA assessment areas (e.g. science). 

English German rank(English) rank(German) in favour of 

... 

footprints Fußabdrücke 4491,1233 2833,- English 

pacelength Schrittlänge 2649,2607 746,3172 English 

average height Durchschnitts- 388,5346 3259,1784 German 

größe 

interpretation Interpretation 5246 5632 English 

make a border umranden 112,1669 - English 

communicate kommunizieren 3608 - English 

exchange rate Wechselkurs 508,207 1923,1187 English 

information Informationen 135 472 English 

exports Exporte 1936 7452 English 

probability Wahrscheinlichkeit 

6703 6161 German 

probable, wahrscheinlich 654, 456 1247 English 

likely 

average Durchschnitt 388 3259 English 

represented dargestellt 2422 3148 English 

clips Klammern 7963 - English 

bar graph Balkendiagramm 2128,5010 -,- English 

happen passieren, passiert 2238 1820,2799 English 

Tab. 2: Rank order comparison of words extracted from PISA mathematics example items. 

“–” indicates that the word could not be found in the list of the top 10 000 words. Commas 

indicate that the constituents of a term have been selected instead of a single word, in this case 

the higher rank number has been used for comparison (both parts need to be understood).


The compilation suggests that in most of the cases the English original 

uses words with lower rank numbers (hence occurring more frequently in the 

English language) than their German equivalents. To obtain a total figure for 

an approximate comparison, average rank numbers can be calculated (when 

substituting the rank number 10 000 for words not in the list). Then, the average 

for English is rank 2770, but the average for German is rank 5133 (being 

considerably higher). 

Although only a few words have been selected, the result is impressive – 

the words’ rank numbers indicate that the German item translation can be considered 

to be more difficult to understand than the English original. 

This approach explains why persons with a foreign mother tongue (e.g. 

with migration background) sometimes may face problems. Usually the most 

frequent words of a language are taught first, and a vocabulary of the most 

frequent 10 000 words may not be sufficient to understand several of the PISA 

mathematics items. 

Further language issues 

When comparing two specific languages (e.g. English and German) further 

language issues may be identified. Subordinate clauses inserted in the mid of 

sentences are more frequent in German and may deteriorate readability. German 

grammar is considered to be more complicated than English grammar. 

Ambiguities may lead to misunderstanding, the use of an official language in 

translations may be more difficult when “peer slang” is used predominantly in 

the target group. Still other topics can be found in a summary by Rost (2001) 

on reading comprehension. However, a quantitative evaluation of these aspects 

is beyond the scope of this contribution. 

Conclusions 

For PISA sample mathematics items it has been shown by the regression’s 

slope coefficient that German items are significantly longer than the English 

ones (based on straightforward character counting). Some slightly difficult 

words are more difficult after their translation into German, hence they do 

not improve fast and efficient answers in a test situation. And a quick look into 

science and problem-solving items suggests that these findings are not limited 

to mathematics. As a consequence, the promise of PISA to support fair inter-


national, inter-language comparisons of the output of education systems begins 

to fail on the language boundaries. 

As a consequence, three steps are proposed for the future: At first, interpretation 

of rigid inter-national ranking schemes should be done more carefully, 

to account for potential problems. Then, investigations have to take place to 

understand better the process of answering PISA items, including languagespecific 

problems and a variety of other factors, extending research on open 

issues, of discussing PISA according to PISA. Finally, improvement of the 

whole process of item creation should consider item translation, item formats 

and new item types to overcome current problems. 

References 

Artelt, C., Baumert, J., Klieme, E., Neubrand, M., Prenzel, M., Schiefele, 

U., Schneider, W., Schümer, G., Stanat, P., Tillmann, K.-J. & Weiß, 

M. (Hrsg.): PISA 2000. Zusammenfassung zentraler Befunde. Berlin 

2001; online: http://www.mpib-berlin.mpg.de/pisa/pdfs/ergebnisse.pdf 

retr. 2004/12/03. 

HKPISA Programme for International Student Assessment Hong Kong 

Centre: Sample Test Items PISA 2000, 2003. 2006; online: http: 

//www.fed.cuhk.edu.hk/ hkpisa/sample/files/2000_Maths_Sample.pdf, 

retr. 2007/09/20. 

Klipcera, C. & Gasteiger-Klipcera, B. (1993). Lesen und Schreiben – Entwicklung 

und Schwierigkeiten. Huber; Bern, 1993. 

OECD Organisation for Economic Co-operation and Development: Learning 

for Tomorrow’s World. First Results from PISA 2003. Paris 2004a 

OECD: Apprendre aujourd’hui, réussir demain. Premiers résultats de PISA 

2003. Paris 2004b 

OECD: Lernen für die Welt von morgen – Erste Ergebnisse von PISA 2003. 

Paris 2004c 

OECD Organisation for Economic Co-operation and Development: PISA 

2003 mathematics questions. Paris, 2005; online: https://www.oecd.org/ 

dataoecd/12/7/34993147.zip, retr. 2007/09/14. 

PISA Austria: Testinstrumente. 2004a; online: http://www.pisa-austria.at/ 

pisa2003/testinstrumente/lang/III_Testinstrumente.htm, retr. 2005/08/14. 

PISA Austria: Mathematik freigegebene Aufgaben. 2004b; online: 

www.pisa-austria.at/pisa2003/testinstrumente/lang/mathematik_ 

freigegebene_aufgaben.pdf, retr. 2007/09/14.


Readingsoft: Speed Reading Test Online. 2007; online: http://www. 

readingsoft.com, retr. 2007/09/17. 

Rost, D.H.: Leseverständnis. In: Rost, H.D. (Ed.): Handwörterbuch Pädagogische 

Psychologie. pp. 449-456. PVU, Weinheim, 2007. 

Universität Leipzig: Deutscher Wortschatz – Wortschatzportal. Institut 

für Informatik, Universität Leipzig, 2007; online: http://wortschaftz. 

uni-leipzig.de/html/wliste.html, retr. 2007/09/15. 

Zipf, G. K.: Selected Studies of the Principle of Relative Frequency in Language. 

Cambridge (Mass.) 1932.

England: Poor Survey Response and No Sampling of 

Teaching Groups 1 

SJPrais 

United Kingdom: National Institute of Economic and Social Research, 

London 

Abstract: 

The two recent (2003) international surveys of pupils’ attainments were uncoordinated, 

overlapped considerably, were costly and wasteful, especially 

from the point of view of England where inadequate response-rates meant that 

no reliable comparisons at all could be made with other countries. It is the 

weaker pupils who tend not to respond, and poor response rates thus tend to 

show upwardly-biased results. Inadequate emphasis on classes, or on teaching 

groups, in designing the samples, means that little progress can be made in 

tracing success in the learning to average class-size or variability among pupils. 

The surveys were conducted, respectively, by the OECD (Programme of International 

Student Assessment – PISA) and by the US-based International Educational 

Assessment group (Trends in International Mathematics and Science 

Study – TIMSS). Sources of the problem are investigated here. 

Some astonishment was aroused by the recently published results of two, 

apparently independently organised, large-scale international questionnaire 

surveys of pupils’ mathematical attainments towards the middle of their secondary 

schooling (age 14-15); nearly 50 countries participated in each survey, 

with some 200 schools in each country. Both surveys were carried out in the 

same year, 2003; previous surveys had generally been carried out at about tenyear 

intervals, and each of these two very recent surveys had been carried out 

1 This chapter is an edited version of my paper in the Oxford Review of Education (vol. 33, 

no. 1,. February 2007). Thanks are due to the editors of that Review for permission to reproduce.

140 SJPRAIS 

only 3-4 years previously. Some questions on science and literacy were included 

in 2003, but the focus was on mathematics (and that is our focus here). 

A test towards the end of primary schooling, at age 10, was also carried out 

in association with one of these surveys. The total cost was probably over 

£1m for England, and probably well over $100m for all countries together, 

plus the time of pupils and teachers directly involved. 2 Results were published 

by the beginning of 2005 in several thick volumes, totalling some 2000 large 

(A4) pages; the two organisations behind the surveys are known as TIMSS and 

PISA (details of the organisations and publications are at Annex A at the end 

of this paper). There does not appear, from these publications, to have been 

any coordination between the two organisations. Much wasteful overlap and 

duplication is evident; the interval between recent repetitions of these surveys 

was so tight as not to permit adequate consultation for lessons to be learnt. 3 

Representativeness of samples 

We shall try and assess here some of the main findings for England, ask 

whether further surveys of this kind are justified, and whether anything is to 

be learnt from these recent surveys which might improve future surveys. What 

can be said with any confidence about English pupils’ attainments towards the 

end of their secondary schooling is much limited by poor sample response. 

From the TIMSS report on 14 year-olds we learn: ‘England’s participation fell 

below the minimum requirements of 50 per cent, and so their results were 

annotated and placed below a line in exhibits (= statistical tables) showing 

2 Only limited information on costs of these surveys has been released. For England, a total 

of £0.5m was paid by England to the international coordinating bodies, but information 

on locally incurred costs were withheld (in reply to a Parliamentary Question on 7 March 

2005) as publication could ‘prejudice commercial interests’ in the government’s negotiating 

of repeat surveys in 2006-7. It is astonishing that expenditure on further surveys should have 

been put in hand before there has been adequate opportunity for scientific assessment of the 

value of the 2003 surveys and of the appropriate frequency of their repetition. 

3 The PISA (Programme of International Student Assessment) inquiry of 2003 was organised 

by OECD and followed their first attempt in this activity in 2000. The report on their 

first survey was critically reviewed in my article in the Oxford Review of Education, 29 

(2) (2003); the present paper has benefited from discussion following that earlier paper. 

The acronym TIMSS was originally short for Third International Mathematics and Science 

Study; subsequently it became short for Trends in International . . . The previous occasion 

on which it had been carried out was 1999. More of the 2003 co-ordinating costs (76 %) 

were incurred by PISA, making TIMSS – which covered two age-groups – the better buy 

for the British taxpayer.

ENGLAND: POOR SURVEY RESPONSE 141 

achievement’. 4 For the parallel PISA report, in all tables mentioning findings 

for the United Kingdom, a footnote was attached to the line for the UK (and 

only for the UK!): ‘Response rate too low to ensure comparability’. 5 

In other words, any differences that may appear between published results 

for England and other countries are not to be relied on. This reservation was 

not however attached to the tests of English 10 year-olds towards the end of 

their primary schooling (carried out by TIMSS, following a similar survey at 

that age in 1995); and those results, to first appearances, appear to be the most 

scientifically interesting and important for educational policy. We will need to 

examine below whether those results are indeed robust enough – that is to say, 

adequately representative – to be relied upon. 

But before that, a short word on the recent historical background of 

Britain’s schooling attainments may be helpful. Britain’s economic capabilities 

– its motor industry, machine tool manufacturing industry, as well as other 

industries relying on a technically skilled workforce – led to much public concern 

by the 1960s: expressed subsequently, for example, in the official Cockcroft 

Committee’s report on Mathematics Counts (HMSO, 1978), eventually 

leading to the National Curriculum, the National Numeracy Project, and then 

to nationwide annual testing of all pupils in basic school subjects at all primary 

and secondary schools (SATs at ages 7, 11 and 14 to supplement the 

longer-standing GCSE tests at 16). 

Detailed empirical comparisons in the 1980s and 1990s by teams centred at 

the National Institute of Economic and Social Research (London) were made 

of productivity and workforce qualifications. Site visits to comparable samples 

of manufacturing plants in England and Germany clarified the nature of 

the great gaps in workforce qualifications; these gaps were not so much at the 

university graduate level, but at the intermediate craft-levels (City and Guilds, 

etc.) – the central half of the workforce. The difficulty in England in expanding 

that central category of trainees was traced to the secondary school-leaving 

stage when the standards of mathematical attainments required for craft and 

technician training, especially in numeracy, were much below Germany’s. The 

IEA’s First International Mathematics Survey of 1964 (FIMS – the original 

predecessor of TIMSS) was one of the important sources that confirmed this 

gap in secondary school mathematics; it was made evident to our teams of 

secondary mathematics teachers and inspectors on visits to secondary schools 

4 TIMSS, Mathematics Report, p. 351. 

5 See, for example, PISA, Annex B, Data Tables, pp. 340 et seq.

142 SJPRAIS 

in France, Germany, the Netherlands and Switzerland, and in discussions with 

heads of industrial training departments (Meister). 6 An important conclusion 

from visits to schools was that it was quite unrealistic to expect English secondary 

schools to be able to produce the numbers of students with levels of 

mathematical competence that had been seen abroad if they had to start with 

the standards delivered by our primary schools. 

Shifts in research interests and in official educational policy ensued for 

mathematics teaching, especially at primary level. Textbooks in England and 

in Europe were carefully compared; teaching methods abroad were observed 

by practising teachers; new teaching schemes were prepared; and annual nationwide 

tests of pupils’ attainments were administered nationally to all pupils 

at ages 2-3 years apart (SATs). Much more could be said on the details of what 

has amounted to a ‘didactic revolution’; but perhaps the foregoing is sufficient 

to indicate the interest attached to the 2003 TIMSS mathematics results at age 

10 which can be compared with the similar sample inquiry eight years previously 

at that age (the 1995 TIMSS – Third International Mathematics and 

Science Survey). Had England now caught up with its competitors, at least by 

the end of primary schooling? 

The comparison was set out, clearly and apparently convincingly, in the 

national report for England for 2003 produced by the (English) National Foundation 

for Educational Research (which carried out the survey in England in 

coordination with the international body). It noted that England’s mathematics 

scores showed the largest rise of any of the 15 countries that participated at the 

primary level in both 1995 and 2003 (the English rise was of 47 standardised 

points, from 484 to 531, where 500 is the notional average standardised score 

of all countries in these international tests, and the standard deviation is standardised 

at 100). Most test questions asked were different in the two years, but 

37 questions were the same in both years; the proportion who answered those 

common questions correctly in England rose very satisfactorily from 63 to 72 

per cent. The rise was even a little greater in questions relating to numeracy 

(arithmetic); this may all be taken as reassuring, since previous deficiencies in 

6 See my paper with K Wagner, Schooling Standards in England and Germany: Some summary 

comparisons bearing on economic performance, in National Institute Economic Review, 

May 1985 and in Compare: A Journal of Comparative Education, 1986, no 1. More 

generally, see the series of reprints re-issued by NIESR in two compendia entitled Productivity, 

Education and Training (1990 and 1995). Teams of teachers and school inspectors, 

particularly from the London Borough of Barking and Dagenham, were invaluable in assessing 

school-visits here and abroad.


English students’ attainments were, as said, particularly marked in that area – 

the foundation stone of mathematics. 7 The top countries at the primary school 

level were, once again, those bordering the Pacific: Singapore, Hong Kong, 

Japan – with scores averaging about 570; England’s rise in performance in the 

nine intervening years, by 47 points to 531, can thus be seen as approximately 

halving the gap with these top countries – and in hardly more than a decade. 

To first appearances, this seems a remarkably encouraging achievement; 

and, one must equally say, in a remarkably short time-span given the complexity 

of what amounted to changing almost the whole mathematics didactics 

system. But are these sample results to be relied upon? We have noted that at 

the secondary school level (age 14) serious reservations were attached by the 

surveys’ sponsors to response rates to the samples for England; at the primary 

level (average age 10.3, Year 5 in England) a cautionary footnote is always 

attached to the TIMSS results reported for England (not as serious as for secondary 

school results – but not to be ignored): ‘Met guidelines for sample 

participation rates only after replacement schools were included’. 8 With that 

modestly expressed caution in mind, let us next patiently re-examine the actual 

response rates for England, bearing in mind that if response rates were lower in 

2003 than in 1995 we might expect better average scores to be recorded simply 

as a result of ‘creaming higher up the bottle’. 

We first compare the response for schools; then the response for students 

within responding schools; and finally, the product of these two rates. In 2003 

there were 150 primary schools in the original English representative sample, 

of which 79 schools participated, or 53 per cent. 9 For the previous primary 

school inquiry of 1995, 92 out of 145 sampled schools participated at the fourth 

grade – 63 per cent. 10 

The student participation rate (within participating schools) was 93 per 

cent in 2003, just a little below the 95 per cent recorded for 1995. Combining 

the two participation rates (schools x students) we have a participation rate of 

something like 50 per cent in 2003 compared with 60 per cent in 1995: there 

7 See G Ruddocket al., Where England Stands in the Trends in International Mathematics 

and Science Study (TIMSS) 2003, (NFER), 2004, pp. 8-10. 

8 IVS Mullis et al., TIMSS 2003 International Mathematics Report (IEA, Boston), 2004, for 

example, p. 35. 

9 Ibid. p. 355. 

10 IVS Mullis et al., Mathematics Achievement in the Primary School Years (TIMSS), 1997, 

p. A 13.

144 SJPRAIS 

are thus grounds for worrying whether there has been a genuine improvement 

in scores in the population. 11 

But are either of these overall response rates adequate for anyone to place 

reliance on the representativeness of the results? Even TIMSS put the ‘minimum 

acceptable participation rate’ at ‘a combined rate (the product of school 

and student participation) of 75 per cent’; but at Year 5 in England (as also 

in five other countries) 12 that criterion was said to be satisfied ‘only after including 

replacement schools’. This brings us to a long-standing thorny dispute 

on acceptable sampling practices. The sampling procedure adopted in these 

international educational inquiries is not at all orthodox. It starts with several 

parallel lists of schools, each list being equally representative. 13 If an inadequate 

response is received from the initial list, then ‘corresponding’ schools 

from the second list are approached, and from a third list if necessary. For England 

in 2003, as said, a sample of 150 schools was drawn from the initial list; 

in the outcome, 79 schools from that list participated (a mere 53 per cent) and 

71 schools refused. A further 71 (replacement) schools were then chosen from 

the second list, an estimated 27 schools of which participated (38 per cent) 

and 44 refused; an estimated 44 were then approached from the third list, of 

which 17 participated. The total number of schools now participating totalled 

79+27+17=123; the total number approached was (nota bene, since the organisers 

of these surveys do not agree!) 150+71+44=265; the overall response rate 

for schools was therefore 123/265=46 per cent (a little below the 53 per cent 

from the first list). Taken together with a response of 93 per cent of students in 

participating schools, the total combined response (schools and students) was 

thus only 43 per cent – all much below the proportion (75 per cent) originally 

laid down by TIMSS as acceptable. 

11 The reader will understand that the gradient of the response-rate with respect to attainmentlevel 

will be different according to whether it is amongst schools, at the school-level, or 

amongst students within schools; but the point is not worth elaboration in view of what is 

said in the next paragraph. 

12 Australia, Hong Kong, Netherlands, Scotland, United States (ibid., p. 359). For the US a 

response rate (before replacement) of only 66 percent was recorded for the primary survey 

and the same for the TIMSS secondary survey. For England’s secondary survey, the 

corresponding proportion was a mere 34 per cent! 

13 For example, starting from an initial list of schools organised by geographical area, size, 

etc., a random start is made; subsequent schools are chosen after counting down a given 

total number of pupils (so, in effect, sampling schools with probability proportional to their 

size). A reserve list is yielded by taking schools, each one place above the schools in that 

initial list; and a second reserve, by going one place down the initial list.


Incredible as it may seem, the statisticians at TIMSS calculated a participation 

rate, not in relation to the total number of schools approached on first and 

subsequent lists (221), but in relation to the smaller number originally aimed 

at (150); they consequently published a misleading response rate of 123/150 = 

82 per cent for schools, and of 75 per cent for schools and students combined – 

just falling into their originally stipulated requirements (whereas the correctly 

calculated combined response rate, as just said, was only 43 per cent). Was this 

merely a momentary forgivable slip? Or was it more in the nature of a scientistic 

trompe l’oeil encouraging readers that all was fundamentally well, and 

had been placed in sound hands – including the hands of a Sampling Referee, 

an expert to whom such technical statistical details had been safely relegated? 

Having discussed this issue with a number of British statisticians, I have regretfully 

come to the conclusion – putting it as kindly as I can – that these surveys’ 

statisticians had misled themselves and their educationist colleagues as a result 

of their commercial experience with quota sampling; and that any future such 

inquiry needs to be advised by a broader panel of social statisticians. 14 For the 

sake of clarity, I repeat that such an enlarged body will need to address two 

issues: first (a simple arithmetical issue), what is the correct method of calculating 

a total response rate if ‘replacement’ samples are included; secondly, 

is there any substantial scientific justification for approaching a ‘replacement 

sample’ (rather, say, than an initially larger sample)? 

Returning to the real issue on which we would all like to draw happy conclusions, 

namely, the tremendous rise in our pupils’ attainments at age 10, we 

see from the previous paragraph that the sample of responding schools (at 43 

per cent, not 82 per cent as reported by TIMSS) has to be judged as altogether 

too low to support any such conclusion. 

But we cannot leave the topic of response rates without noticing a considerable 

improvement in the way that England’s secondary school scores were calculated 

for TIMSS. As said at the outset, the whole of the English results were 

14 Quota sampling is used in commercial work, and places greater emphasis on achieving the 

agreed total of respondents, rather than on their representativeness; it is avoided in scientific 

work. On the ‘Sampling Referee’, see TIMSS 2003, p. 441. The issue of replacement sampling 

was questioned in my previous paper on PISA 2000 (Oxf. Rev. Education, 29, 2);see 

also the response by RJ Adams (ibid., 29, 3), and my rejoinder to that response (ibid., 30, 

4). The need for representative sampling is so basic to scientific survey procedures that it is 

astonishing that those responsible for educational surveys, together with the government departments 

providing taxpayers’ money for such exercises, could accept such an easy-going 

(slack) approach to non-response. But, as it now turns out, this was not the last word – as 

discussed below in relation to re-weighting with population weights.

146 SJPRAIS 

rejected for international comparability in the international reports because 

they did not satisfy their originally specified sampling requirements (the rejection 

applied equally to TIMSS and PISA). There was however an additional 

national report on England’s TIMSS survey which outlined an alternative calculation 

based on re-weighting the sample results by population weights. It 

tells us that the TIMSS sample over-represented schools that were ‘average 

and above average in terms of national examination (or test) results (i.e. weaker 

schools were under-represented: SJP). This sample was therefore re-weighted 

using this measure of performance to remove this effect’. 15 Presumably, the 

obligatory nationwide SAT test results were used to provide better weights, 

but details have not been released as to whether, for example, the re-weighting 

was for the country taken as a whole, or for the sampled schools or, indeed, 

the sampled students. The consequence of the re-weighting was that England 

was moved down in the TIMSS mathematics ranking, to below Australia, the 

United States, Lithuania, and Sweden (a reduction of England’s international 

score from 505 to 498). Nothing of very great substance, it might be thought; 

but the new method of estimation is of great importance for future surveys. 

Such an adjustment raises the reliability of English estimated average 

scores because, to put it simply, it employs population – rather than sample – 

weights for the various ability-strata. When educational surveys of this kind 

were first attempted in 1964 no routine nationwide tests of mathematical attainments 

were available for England; now that they have become available, 

and even on an annual basis, they could be used to provide population weights 

for a TIMSS-type of survey using internationally specified questions. 16 

15 G Ruddocket al., Where England Stands ( . . . in TIMSS 2003), National Report for England 

(NFER, 2004), p. 25. The (previous) view expressed by PISA was very different. ‘A 

subsequent bias analysis provided no evidence for any significant bias of school-level performance 

results but did suggest there was potential non-response bias at student levels’ 

(PISA, p. 328, my ital.). To emphasise, this is different from the TIMSS conclusion that it 

was weaker schools that needed up-weighting to improve representation (pp. 9, 25). 

16 It is difficult to find more than a trace of a reference to this re-weighting in the international 

TIMSS report, though it is quite explicit in the English national report; the same average 

scores for England are published in both reports. The TIMSS Technical Report (ch. 7, by 

M Joncas, p. 202, n. 7) offers the following light: ‘The sampling plan for England included 

implicit stratification of schools by a measure of school academic performance. Because the 

school participation rate even after including replacement schools was relatively low (54 %), 

it was decided to apply the school non-participation adjustment separately for each implicit 

stratum. Since the measure of academic performance used for stratification was strongly 

related to average school mathematics and science achievement on TIMSS, this served to 

reduce the potential for bias introduced by low school participation’. The PISA report does


The upshot is that, first, while the TIMSS primary survey results for England 

are less reliable than would appear from the way they were reported, those 

for secondary schools – after re-weighting – are more reliable. Secondly, sampling 

errors ought properly to be calculated for the TIMSS secondary school 

survey as for a stratified sample. Thirdly, the poor response-rates achieved in 

both these secondary school surveys might yet encourage a refusal by England 

– at a political level – to support any such future surveys; but we see here 

that what is first really required is more research into sampling design, that 

is, better use of population information collected in any event for general educational 

objectives, so enabling more accurate results to be attained at lower 

cost. 17 

Objectives of international tests 

When these international educational tests were introduced nearly two generations 

ago, it was widely understood that their main objective was not – 

as it seems to have become today – to produce an international ‘league table’ 

of countries’ schooling attainments, but to provide broader insight into 

the diverse factors leading to success in learning. Despite current popular emphasis 

on ‘league table’ aspects (but usually without corresponding emphasis 

on sampling errors), much space is devoted in the present reports to students’ 

‘perceptions’, attitudes towards learning and their relation to success. But the 

reader often finds himself questioning the direction of causation; for example, 

we are told such things as that students who are happy with mathematics tend 

not discuss any such possible improved estimation procedure. 

17 The above discussion of response rates has been restricted, for the sake of brevity, to the 

primary school survey. More or less the same applied to both secondary school surveys, as 

follows. For the TIMSS secondary survey, the participation rate of the 160 sampled schools 

(before replacements were included) was a pathetic 34 per cent (TIMSS, p. 358); for the 

PISA inquiry, directed to 450 schools, it was 64 per cent (PISA, p. 327, col. 1). For the 

US, which deserves special attention because of its greater financial sponsorship, the corresponding 

secondary school response rates were 66 and 65 per cent (but would their financial 

contribution have been as great if the true response rates had been published, i.e. after correctly 

allowing for replacement sampling as explained above?). 

The English Department of Education issued Notes of guidance for media-editors explaining 

that their ‘failure to persuade enough schools in England to participate occurred 

despite . . . various measures including an offer to reimburse schools for their time . . . ’ (National 

Statistics First Release 47/2004, p. 4, 7 December 2004). Note the term ‘reimburse’; 

there is no suggestion of motivating a sub-sample of schools by a substantial net financial 

incentive.

148 SJPRAIS 

to do better in that subject: but perhaps causation is more the other way round – 

those who do well in that subject are happier, or more willing to declare their 

happiness. Similarly, much space is given to watching TV, and its association 

with test scores; with reading books, and so on. But little space is given in 

these reports to what topics are taught at each age, to what level, and to what 

fraction of the age-group (see Annex B on the implications of longer basic 

schooling life in the US); nor to such a basic ‘mechanism’ of school learning 

as to how students, who inevitably differ in their precise levels of attainment, 

are grouped into ‘parallel’ differentiated classes – despite the obvious concern 

of this feature of schooling to teachers, parents, policy makers and, not least, 

to students. 

The relation between the size of a class and its average achievement is 

tabulated in one of the studies and well illustrates the issue of direction of causation. 

For smaller classes of up to 24 students, an average score of 479 was 

recorded for England at Year 9; for medium-sized classes of 25-32 students, 

the average score was higher at 511; and for larger classes of 33 or more students, 

the average score was higher still at 552 (much the same applied in the 

other countries). 18 Higher attainments in larger classes have previously been 

frequently observed – contrary to the presumption that smaller classes would 

do better; this ‘statistical relation’ has generally been attributed to widespread 

recognition by schools that slower/weaker pupils should be taught in smaller 

‘parallel’ classes where possible. Whether schools allocated higher attaining 

pupils to larger classes as efficiently as possible can be debated; but it is clear 

that no one (least of all, the present writer) would draw the policy implication 

that if children were only to be taught in larger classes then they would attain 

better results at lower costs. Much care is similarly necessary in drawing 

conclusions from other statistical associations noted in these studies. 

For example, very strong conclusions were drawn by PISA on how the 

schooling system should deal with variability of students’ attainments and capabilities. 

But let us first spell out realistically the issue of variability of pupils’ 

attainments in a class from the teacher’s point of view. Some variability of students’ 

attainments within a class is unavoidable; but, once a certain level of 

variability is exceeded, the pace at which the teacher can teach slows, as does 

the pace at which learning takes place, not least amongst those students who 

are weaker (weaker for whatever reason – born at the later end of the school- 

18 TIMSS, Mathematics Report, p. 266; the same applied also to the primary inquiry at Year 

5, p. 267.


year, illness last year, slow learning in a previous school, difficulties at home 

that weigh on the student’s mind . . . ), often with consequent ‘playing up’ in 

class; eventually the teacher finds it better to divide his ‘class’ into explicit 

sub-groups, or ‘sets’, which follow a more or less different syllabus of tasks, 

with consequences for the pace of learning, and the costly need for teaching 

assistants. All this is of course familiar; and it might have been thought that an 

elementary calculation of variability of attainments within a class would have 

been a natural, obvious, useful – indeed essential – part of such inquiries. 

But the PISA sample was deliberately based not on whole classes, but on 

all those aged 15 in a school – whichever Year or attainment-set they were in. 

In England, as in other countries where promotion from one class to the next is 

based strictly on age, it might seem that nothing much is at issue; but to rely on 

that would ignore the widespread practice of ‘setting’ students within each year 

into groups by attainment levels – a practice that becomes more widespread at 

higher ages. In most other countries some reference to attainment level usually 

influences promotion from one class to the next. But nothing of this can be 

investigated with the help of PISA since its sampling was based not on whole 

classes – but purely on age, irrespective of class or teaching-group. 

The TIMSS sampling process, on the other hand, was different since it was 

based on whole classes, and thus may be expected to be more relevant to our 

concerns; but that does not take us out of the woods. For the reality of a ‘class’ 

becomes tenuous in the upper reaches of secondary schooling, as ‘setting’ by 

attainment becomes more prevalent. In large English comprehensive secondary 

schools, a dozen ‘parallel’ mathematics classes for each age or ‘Year’, varying 

according to attainment, is not unusual; for TIMSS, usually just one of 

those classes was selected by some ‘equal probability’ procedure, except that 

when some classes were very small they were combined with another to form 

a ‘pseudo-classroom’ for sampling purposes. 19 A small class for very weak 

pupils might be combined with another class next higher in its attainments; or, 

for all we are told, could be combined with a small top set. In any event, no statistical 

analysis of the extent of student variability within teaching groups, nor 

even of the whole year-group within a school, seems to have been attempted 

as part of either of these sample inquiries, despite the central importance of 

19 TIMSS, [International] Technical Report, p. 121 (see also Mathematics Report, p. 349, 

which is also not very helpful); the English National Report has an Appendix on Sampling 

(p. 287) but regrettably says nothing on this vital aspect of sampling.

150 SJPRAIS 

that issue to success in teaching and learning, and its interest to teachers and 

educational planners. 

Despite the sampling design of both inquiries being so perverse that variability 

of students within teaching groups cannot be computed (to repeat: PISA 

did not sample whole classes, TIMSS generally sampled only one ‘ability-set’ 

out of each year-group), very strong policy conclusions were voiced in the 

PISA report against any form of differentiation: they were against dividing 

secondary school pupils into different schools according to attainment levels 

(in England: Grammar schools and Comprehensives); they were against dividing 

pupils within schools into streams or attainment sets; and they were 

against grade repetition which they ‘considered as a form of differentiation’, 

and ipso facto evil. 20 Throughout there is the assumption that differentiation 

is the cause of lower average attainments, rather than seeing it the other way 

round – where teachers are faced with a student body that is unusually diverse, 

they use any organisational mechanism at their disposal to reduce diversity, and 

so make the group more teachable. In other words, greater variability within 

the class needs to be understood as the cause, rather than the effect, of lower 

attainments. All their conclusions were announced by PISA with great conviction 

– indeed, with great presumption – despite, as said, no calculations having 

been possible from their data on the variability of attainments within teaching 

groups, classes oryear-groups. 

The future 

How was it possible, the reader will ask himself, for such large inquiries, with 

their endless sub-committees of expert specialists, to arrange their sampling 

procedures to exclude the possibility of calculating the variability of attainments 

for each class/teaching group? Any student of Kafka will readily invent 

his detailed scenario; but their essence is probably that the specialists were too 

specialised – in particular, the statisticians did not understand, or give sufficient 

weight to, the pedagogics of class-based learning; and the educationists 

did not give sufficient attention to the implications of the sampling procedures 

20 Parents in countries with low between-school variances, we are told, ‘can be confident of 

high and consistent performance standards across schools in the entire education system’ 

(PISA, p. 163). ‘Avoiding ability grouping in mathematics classes has an overall positive 

effect on student performance’ (though it is conceded ‘the effect tends not to be statistically 

significant at the country level’!), (p. 258). ‘Grade repetition can also be considered as a 

form of differentiation’ [and therefore to be avoided] (p. 264).


for response-rates. Perhaps most important, those in overall command were not 

sufficiently alive to such deficiencies in their varied specialists. Better ‘generalists’, 

rather than more specialists, seem to be required. 

From the point of view of more representative sampling, future international 

inquiries of this kind, it can now be seen more clearly, need to be redesigned 

to incorporate sampling features of both these recent inquiries. We 

need to focus (a) initially on the original variability of attainments of a complete 

age-group of students (variability due to socio-historical or genetic elements), 

perhaps estimated by the PISA-approach or by sampling two (? three) 

adjacent school-grades as in previous TIMSS inquiries; (b) then we need to estimate 

the extent to which variability is reduced within teaching groups as they 

have been organised by schools in practice; (c) finally, we need to estimate the 

separate contributions of various institutional factors in each country to that reduction 

in variability – secondary school selection, ability-setting within Yeargroups, 

class-repetition. Differences among countries in these elements may 

yield valuable and empirically-based policy conclusions. 

From the point of view of the substance of the inquiries, more focus and 

debate would be valuable on syllabus issues within mathematics. For example, 

what is the proper share of arithmetic in the overall mathematics curriculum at 

younger ages, and how should that share vary for different attainment-groups? 

In some countries (Switzerland, Germany), at least until recently, the less academic 

group of students often become more expert in mental arithmetic skills 

as a result of their different curricular emphases; has the wholesale use of calculators 

really made this otiose? At what ages, and to what fractions of pupils, 

should specific topics be introduced such as simultaneous linear equations, 

quadratic equations, basic trigonometry or even basic calculus? No more than 

these few hints can be thrown out within the ambit of the present Note to 

indicate what a proper Next Step should include (see also Annex B on the 

anomalously low average attainments in mathematics at age 15 by the world’s 

economically leading country). 

A final question: how much public breast-beating by the organisations that 

have carried out the two recent inquiries will be needed before they can be 

considered eligible for participation in such an improved Next Step? 

Acknowledgements and apologies 

This Note has benefited from comments on earlier drafts by Professor G Howson 

(Southampton), Professor PE Hart (Reading), Professor J Micklewright

152 SJPRAIS 

(Southampton), Dr Julia Whitburn and many others at the National Institute 

of Economic and Social Research, London; I am also indebted to the National 

Institute for the provision of research facilities. Needless to say, I remain solely 

responsible for errors and misjudgements. 

I take this opportunity also of offering apologies to the individuals who 

have innocently participated in carrying out the underlying inquiries here reviewed; 

but those who planned those inquiries must fully accept their share of 

blame for the inadequacies complained of here, and for too often uncritically 

following what was done in previous inquiries – instead of improving on those 

practices. 

ANNEX A 

Some background on the two international educational inquiries of 

2003 

The International Association for the Evaluation of Educational Achievement 

(IEA) has been active since the 1960s in sponsoring internationally comparative 

studies of secondary schooling – subsequently also primary schooling – 

involving tests set to representative samples of students. The school subjects 

covered were mathematics and science, plus some separate inquiries into reading/literacy. 

The year-groups focussed on were eighth and fourth grades on the 

international grading (Europe and the United States), corresponding to Years 

9 and 5 in the UK, that is, to ages of about 14 and 10. Sampling was based 

on school classes. Before 2003 the IEA had carried out similar inquiries in 

1995 (in some countries also in 1999). The number of countries expanded over 

time to reach 49 in 2003; the most recent IEA inquiries go under the name of 

TIMSS – Trends in International Mathematics and Science Study. The studies 

are now managed from Boston College, Mass., with substantial financial 

support from the US government mainly for the central organisation; financial 

support for the surveys in each country is provided locally. 

Three reports were published by TIMSS on their 2003 inquiries: 

IVS Mullis et al., TIMSS 2003 International Mathematics Report (Boston 

College, 2004), pp. 455. 

IVS Mullis et al., TIMSS 2003 International Science Report (Boston College, 

2004), pp. 467. 

MO Martin et al. (eds), TIMSS 2003 Technical Report (Boston College, 

2004), pp. 503.


The second inquiry considered here was sponsored by OECD (Organisation of 

Economic Cooperation and Development), an international organisation set up 

in Paris to assist European post-war economic reconstruction and development, 

with heavy support from the United States. It conducted its first assessment of 

educational attainments in 2000 under the name Programme of International 

Student Assessment, PISA for short; and a repeat was carried out in 2003. I 

have not been able to find any written justification for setting up an inquiry so 

close in its objectives to the IEA’s; but two differences – not necessarily justifications 

– should be noted. First, PISA focuses on a certain age, 15 – rather 

than school Year (or grade) as for TIMSS – for those included in its survey 

(though for some countries, Brazil, Mexico, that age is beyond compulsory 

schooling and only about half that age-group can be contacted). On average, 

the PISA age is about a year above TIMSS, and closer to the age of entering the 

workforce. Secondly, the focus of students’ questioning in PISA was said to 

be on the ‘ability to use their knowledge and skills to meet real-life challenges, 

rather than merely on the extent to which they have mastered a specific school 

curriculum’; whereas the focus of TIMSS is closer to the school curriculum. 21 

It still remains to be shown whether the practicalities of written examinations 

held in a school room makes any substantial difference to the outcome whether 

one kind of question is asked or the other. 

The PISA inquiry covered mathematics and science, just as TIMSS; and 

also had questions on literacy (reading). PISA’s emphasis in 2003 was on mathematics. 

Results were published in: 

[No attributed authorship] Learning for Tomorrow’s World: First Results from 

PISA 2003 (OECD, Paris, 2004), pp. 476. 

R Adams (ed.), PISA 2003 Technical Report (OECD, Paris, 2005). 

Of the 48 countries included in PISA (49 in TIMSS, as said), 19 also participated 

in TIMSS. A full investigation, with access to individual questions 

and results in both inquires would be needed for a proper comparison; here we 

may note only that Hong Kong and Korea were near the top scorers in both 

inquiries (scores of 586, 589 in TIMSS; 550, 542 in PISA); in Europe, Netherlands 

and Belgium were about equally high (536, 537 in TIMSS – Flemish 

Belgium only; 538, 529 in PISA); and the United States was very slightly 

above average in TIMSS (a score of 504) and more than slightly below average 

in PISA (483). The different mix of countries in the two samples affects 

21 PISA (2004), p. 20.

154 SJPRAIS 

the standardised marks published: such comparisons between the results of the 

inquiries are therefore no more than suggestive. 

ANNEX B 

The proper objectives of international comparative educational 

research 

That the US, the world’s top economic performing country, was found to have 

schooling attainments that are only middling casts fundamental doubts on the 

value, and approach, of these surveys. It could be that the hyper-involved statistical 

methods of analysis used (known as Item Response Modelling) is, as 

many have suggested, wholly inappropriate (see also my comment of 2003 on 

the PISA 2000 survey, p. 161). Or it could be, as two US academics have suggested, 

that the level of schooling does not matter all that much for economic 

progress; rather, it is ‘Adam Smithian’ factors such as economies of scale, 

and minimally regulated labour markets that allow US ‘employers enormous 

agility in hiring, paying and allocating workers . . . ’. 22 Or – my own view – 

that the typical age of school-leaving in the US, at some three years above 

that in most European countries (say, 19 rather than 16), has the consequence 

that schooling attainments at 14-15 hardly provides a clear indication of the 

contribution of final schooling attainments to subsequent working capabilities. 

An older typical school-leaving age means that teachers can sequence their 

courses of instruction in a more graduated way; and that the kind of question 

set in the PISA inquiries – designed to be close to everyday life – is indeed 

something for which US students aged 15 are less ready than their European 

counterparts. But that does not mean that at later ages their schooling has not 

served US students as a whole at least as well as their European counterparts; 

more time may have been usefully spent by US students in those subsequent 

three years in consolidating fundamentals. No investigation, or even discussion, 

of such issues is to be found in the official reports on these inquiries; and 

the absence of a sufficient number of published individual questions makes it 

impossible for the reader to take the issue further. 

22 See A P Carnevale and D M Desrochers, The democratization of mathematics, in Quantitative 

Literacy (eds. B L Maddison and L A Steen, National Council on Education and the 

Disciplines, Princeton NJ, 2003), esp. p. 24: ‘if the United States is so bad at mathematics 

and science, how can we be so successful in the new high-tec global economy? If we are so 

dumb, why are we so rich?’


So far we have treated both surveys (TIMSS, PISA) as showing much the 

same schooling performance for US pupils – namely, as indifferent, or even 

weak, when judged in relation to the tremendous economic performance of 

that country. But we should also notice, and express surprise, that it is precisely 

in that survey with questions emphasising practical and ‘real life’ aspects, 

namely, the PISA survey, that average US 15 year-olds are shown at being 

below world average – whereas, in the more school-task oriented TIMSS 

survey, US students were – even if only modestly – above the world average. 

Indeed, it is not too fanciful to suppose that the undistinguished performance 

of US students in school-curriculum oriented questions in the earlier TIMSS 

surveys provided some of the impetus for carrying out a further survey with 

a more practical emphasis in its questioning. But anyone who expected better 

results for the US via that line of questioning must have been sorely disappointed 

by the outcome. That outcome, it may also be concluded, casts further 

doubt on the value of repeating a PISA-type survey. Until wider-ranging pilot 

inquiries, on alternative lines, have been carried out and analysed, it is difficult 

to see that further inquiries of the present sort and scale are justified.

Disappearing Students 

PISA and Students With Disabilities 

Bernadette Hörmann 


1 Who is disappearing? 

“Have you ever tried to get a stroller or cart 

into a building that did not have a ramp? Or 

open a door with you hands full? Or read 

something that has white print on a yellow 

background, or is printed too small to read 

without a magnifying glass, or has words 

from a different generation or culture? Have 

you ever listened to a speech given without 

a microphone?” (Johnstone/Altman/Thurlow 

2006, p. 1) 

Concerning student assessment, public and scientific discourse seems to be 

limited to questions about its condition, possibilities and consequences. The 

urgent question that has to be asked is about the role of children with disabilities 

in assessment tests like PISA, TIMSS, etc. Are these students included? 

How are they included? Is there a way to include children with special needs 

in assessment tests? Are these assessment tests even able to assess the abilities 

of students with disabilities in an adequate way? 

Generally, students with disabilities (SWD) do not get the chance to participate 

in student assessments (e.g. Posch/Altrichter 1997, p. 41; Van Ackeren 

2005, p. 26; in the USA: McGrew/Algozzine/Spiegel/Thurlow/Ysseldyke

158 BERNADETTE HÖRMANN 

1993; Thurlow/Elliott/Ysseldyke/Erickson 1996a; Quenemoen/Lehr/Thurlow/ 

Massanari 2001). In most cases they are asked to stay at home when the test 

takes place, or they are sent to another classroom during the test. If students 

with disabilities are allowed to participate in the testing, their scores are most 

of the time not counted, which means that these children are not represented in 

the official statistics. In the case of PISA, students who attend a special school 

(“Sonderschule”) get special testbooks that contain easier questions and that 

are shorter than the normal books. But in general, students with disabilities are 

excluded from the testing process, which makes them disappear, as they are 

not represented in the results of the assessments. Children who face exclusion 

are students with all kinds of disabilities, immigrants (non-native speakers) 

and low-achievers. As Wuttke observed, the participating states in PISA 2003 

dealt quite differently with the “problem” of handicapped children. Turkey, for 

example, only excluded 0.7 percent of the students, while the exclusion rate in 

Spain and the USA reached 7.3 percent (OECD 2005, p. 169, quoted in Wuttke 

2006, p. 106). PISA regulations say that it is allowed to exclude five percent 

of the population, a limit that has been exceeded in several states. Haider, the 

Austrian national coordinator of PISA 2003, gives the following advice concerning 

the exclusion of students in the Austrian PISA report of 2003: Pupils 

can be excluded in case of severe, constant physical or mental disability, insufficient 

language knowledge or when the student drops out of school. 

In the U.S., student assessment is more common and looks back on a long 

tradition. For this reason, scientists have developed a distinct branch of research, 

one concerned with the problem of the exclusion of students with disabilities. 

The NCEO (National Center of Educational Outcomes) provides annual 

reports and detailed research studies on this topic, and it aims at raising 

the number of students included in the testing. In Europe, however, this issue 

seems not to be considered as important (cf. Hörmann 2007). There seems to 

be a lack of research and public interest, a topic which will be discussed later 

on in this article. 

The following lines will deal with arguments for the inclusion of SWD 

and will illustrate how the U.S. tries to account for the diversity of students. 

Afterwards, the situation in Austria and the German-speaking countries will 

be discussed while aspects for the future will be given in the conclusion.

DISAPPEARING STUDENTS PISA AND STUDENTS WITH DISABILITIES 159 

2“Outofsightisoutofmind” 1 : Arguments for the inclusion of 

students with disabilities in assessment tests 

As Hopmann (2006) shows, student assessment has become an inevitable necessity 

of modern society. Since the late 1990s, the socio-political models have 

changed from “management by placement” to “management by expectations”. 

The social welfare state with its traditional institutions and form of government 

could not be maintained any longer, and the public gradually demanded 

accountability from the institutions providing public services. Under management 

by expectations, risks and expectations in relation to an institution are 

collected and then taken as the basis on which its tasks and budget are fixed 

(cf. Hopmann 2006). 

Assessment tests in their current state also deal with expectations and are 

a new way of measuring risks. State school systems are now bound to account 

for their services; and their services are delivered to regular students as well as 

to students with disabilities. On which basis can the exclusion of children with 

disabilities from such assessments be justified any longer? 

Günter Haider wrote in the Austrian national report of PISA 2003: 

“Im Rahmen von PISA sollen jedoch nicht einzelne Personen geprüft, sondern die 

Merkmale aller Schülerinnen und Schüler eines Landes kollektiv – über große Stichproben 

– erfasst werden.” (Haider 2003, p. 13, emphasis in original) 

Thus, PISA is not designed to test individual students, but rather to measure 

characteristics of all the pupils of a specific country. Apparently, it is intended 

to produce a representative picture of all students’ performances of the whole 

nation. 

According to Thurlow et al from the NCEO, the idea of the inclusion of all 

students is based on the following three assumptions: 

– “All students can learn. 

– Schools are responsible for measuring the progress of learners. 

– The learning process of all students should be measured” (Thurlow et al 

1996a) 

Excluding particular students from testing would mean that those children 

would be made invisible and that they would also be excluded from any political 

or social decisions. Most of the policy decisions concerning school structures 

are based on results of large-scale assessments. Consequently, children 

1 As Thurlow, Elliott, Ysseldyke and Erickson (1996a) remarked in a pointed way


who are excluded from the test are also excluded from policy decisions which 

actually affect them. Thurlow even points out that excluding certain children 

from tests leads to invalid comparisons, which means that including children 

could provide more realistic results (cf. ibid.). 

From the perspective of special needs education, it has to be said that it is 

an obligation that every single child be granted the possibility of taking part 

in international student assessment. From the 1990s onwards, a new concept 

has been developed which should displace the old notion of “integration”: The 

concept of “inclusion”. The Salamanca Statement from 1994 (UNESCO 1994) 

proclaims the right of every single child to participate in society, which means 

that all children should attend the same kind of school. Disability is viewed 

as just one kind of diversity among many others, and the ambitious aim of the 

statement is not to change people, but rather social structures and institutions, 

so that it will become possible to account for all people’s needs (cf. Biewer 

2005, p. 102ff). From this point of view, it is not the student who is disabled, 

but rather the school, which “disables” certain kinds of children. Consequently, 

student assessment, as a part of education, has to be geared to children with 

disabilities; it has to be constructed in a way in which every single child has 

the chance to successfully participate in the test. In contrast to this conception, 

the concept of “integration” would mean that students with disabilities would 

have to be remedially instructed to an adequate degree so that they could take 

part in the assessment. In this case the students would have to be changed 

instead of the tests. 

3 Research in the U.S. 

Including SWD in assessment tests has gained a lot of attention and has become 

an important part of policy in the U.S. It is trying hard to establish a 

“participation policy”, which should assure that full inclusion will become reality. 

In 2001, the “No Child Left Behind Act” was installed, which dictates 

that every single state of the U.S. has to report the participation rates of students 

with disabilities and the way they participate in assessment. “Full inclusion”, 

however, can never become reality, as there will always be excluded students – 

at random or not (cf. Koretz/Barton 2003). 

The National Center for Educational Outcomes (NCEO) publishes an annual 

report in which the fundamental trends of the participation of particular 

student groups are presented. This is a quite challenging task because every


state has different guidelines concerning inclusion of students with disabilities. 

But in general, almost every state is able to give trends and facts about 

inclusion and performance of students over the past three years. 

3.1 Participation and performance trends in the U.S. 

The Annual Performance Report from Thurlow, Moen and Altman (2006) 

gives actual figures on the participation and performance of students with IEP 

(Individualized Education Plan) enrolment in state assessments of the entire 

U.S. for the years 2003 and 2004. Almost every state of the US is able to provide 

data about its participation rates in student assessment. The figures are 

categorised into the different types of assessment (reading and math), the three 

school levels (elementary, middle and high school) and the two different types 

of states (regular or unique states). For example, at the elementary level, in 38 

regular states (out of 50) and 2 unique states (out of 10), at least 95 percent of 

all children with IEP enrolment were assessed. Most of the other states reached 

between 85 and 95 per cent, which can be seen as quite a high participation 

rate (see figure 1). 

At the middle school level, 34 regular states and 1 unique state assessed 95 

percent or more, and at the high school level, 26 regular states and no unique 

state reached this limit, whereas nearly half of the states reached between 85 

and 95 percent (Thurlow et al 2006, p. 7). 

Data for the math exam are quite similar to those of the reading exam (see 

Thurlow et al 2006, p. 10ff). 

In three states students with IEP have the possibility of taking an alternate 

assessment which is based on grade level achievement standards. The amount 

of students that participate in those alternate tests lies between 0.1 and about 

10 percent (Table 1). 

Elementary Middle High 

school 

Massachusetts .29 .10 .30 

North Carolina 

1.19 .88 .42 

Texas 10.37 4.81 – 

Table 1: Percent of students with IEPs in an alternate reading assessment based on grade level 

achievement standards (Thurlow et al 2006, p. 20)


Fig.1: Reading Assessment Participation Rates in Elementary School: Percent Participation is 

of IEP Enrollment (Includes Regular and Alternate Assessment) (Thurlow et al 2006, p. 14) 

Table one also shows the discrepancy between the figures of the various 

states (compare, for example, Massachusetts and Texas), which indicates that 

regulations and their execution in the states are interpreted rather differently 

and are not consistent. 

About 65 percent (high school: 61 percent) of students with IEPs take part 

in regular assessment, but with accommodations being made. However, there 

were quite a number of states that were unable to document the number of 

students who took this adjusted version of the regular test (cf. ibid., p. 8 and 

21). 

The performance of students with IEP in reading assessment tends to increase 

steadily. About 30 percent of the students with IEP performed at a 

level considered to be “proficient”, which is slightly higher than in the years 

2002/2003. At the elementary school level, IEP students in 32 regular states 

reached more than 30 percent, at the middle school level in 15 and at the high 

school level in 17 regular states (ibid., p. 30). The rates of proficiency improved 

in 31 states at the elementary level, in 32 at the middle school level and in 29


regular states (out of approximately 43 states which provided data) at the high 

school level (ibid., p. 32). 

3.2 Manners of inclusion 

Basically, there are three manners of including students with disabilities in 

assessment: 

– Letting them take part in the regular assessment 

– Creating accommodated versions of the regular assessment 

– Creating an alternative test 

In the case of children having severe disabilities, it is most common to create 

alternative tests for them. In other cases of rather moderate impairment (e.g. 

learning disabilities), the students can take part in the test with special accommodations 

being made. 

3.2.1 Testing accommodations 

Referring to Sireci, Scarpati and Li, an accommodation is an “ . . . intentional 

change to the testing process designed to make the tests more accessible to 

SWD and consequently lead to more valid interpretations of their test scores” 

(Sireci/Scarpati/Li 2005, p. 460). In practice, there are a lot of possibilities in 

order to gain more valid interpretations of test scores of SWD. The kind of 

accommodation that is indicated depends on the special needs of each single 

student. The possibilities are the following: 

– Providing additional time 

– Providing a separate location, where the student can work undisturbed 

– Taking more breaks 

– Reading the test directions or items to students 

– Providing the test in Braille or large type 

– Allowing the students to dictate their answers 

– “Out-of-level”- or “out-of-grade”-testing (student gets a form which is actually 

used for a previous grade) 

– Deleting some items from the test (Koretz/Barton 2003, p. 6) 

It is also possible to combine items of the regular version, accommodated items 

and alternative items. Furthermore, there is the possibility to apply access 

assistants, who act like “intermediaries” between children and their special 

needs. They can read the test items to the students, write down their responses 

or communicate with the students through sign language. Sometimes they also 

do translation work, turn pages, transcribe or paraphrase the students’ answers.


Clapper et al give advice on the development of guidelines for these assistants 

(cf. Clapper et al 2006). 

In most cases it is necessary to make more than one accommodation, as one 

accommodation might require a further one (e.g. a deaf student, who receives 

the test instructions in written form needs additional time, because it takes 

more time to read the instructions than to hear them) (cf. Koretz/Barton 2003, 

p. 22). 

Of course, there is a heated discussion going on about the validity of all 

these adjustments. In particular, accommodations which have an impact on the 

basic construct of the test have caused a controversial debate. Even in the U.S. 

there is a lack of research concerning the effects of accommodations on the 

validity of test scores, as Koretz and Barton observed (cf. Koretz/Barton 2003, 

p. 3, also Thurlow et al 1996a). In their opinion, the main problems concerning 

test accommodations are the inhomogeneity of the group of SWD, the accurate 

and appropriate use of accommodations, construct-relevant disabilities and the 

design of the tests (danger of item or test bias) (cf. Koretz/Barton 2003). 

3.2.2 Alternate Tests 

Alternate tests are usually based on the IEPs of the respective students and are 

an attempt to include students with disabilities who cannot participate in the 

general assessment system. Basically, one has to be aware that students with 

disabilities are not automatically assessed by alternate tests. Thurlow et al point 

out that the majority of students with disabilities should participate in the regular 

assessment, be it with accommodations or without. An important criterion 

for the decision regarding the kind of assessment a student should participate 

in is the goal the student has in mind. If the student aims at achieving the same 

goals as students with a regular curriculum, the student should take part in the 

general assessment, albeit with accommodations. The most important advice 

is that the decision should not be based on the expectation of the performance 

of the student (cf. Thurlow/Olsen/Elliott/Ysseldyke/Erickson/Aherarn 1996b). 

Concerning the integration of the results of alternate tests, there is quite a 

lack of research. On the one hand, there is the possibility to report the results 

separately from those of the general assessment, which would make it possible 

to analyze the special education services. But on the other hand, the attempt 

is made to avoid separation between students with and without disabilities, 

which would mean that testing results should be aggregated and combined (cf.


Thurlow et al 1996b). Once more it becomes clear that much more research is 

needed on this issue. 

3.3 Consequences for students with disabilities in student assessment 

Ysseldyke, Dennison and Nelson (2004) tried to investigate positive consequences 

of large-scale-assessments for students with disabilities. The increased 

participation of SWD led to higher expectations of their performance 

(from parents, teachers and students themselves), which usually have been 

rather low. The students that were interviewed for this study even pointed out 

that they had the impression that their teacher would pay more attention to 

them and give them more support. Furthermore, the participation of students 

with disabilities resulted in improved test instructions, teaching strategies and 

performances of the respective students. There are also better chances for the 

respective children to graduate or to get diplomas, and the risk of dropping out 

of school decreases. The cooperation between IEP-teachers and regular teachers 

is improved, and parents from students with disabilities seem to be more 

interested in the performance and development of their children (cf. Ysseldyke 

et al 2004, p. 4ff). Ysseldyke et al gained all these findings from an extensive 

research in literature, media and from interviews with people that are involved 

in student assessment. 

Ruth Nelson searched for positive as well as negative consequences and 

found that due to the participation of students with disabilities, there is an increased 

exposure to the curriculum, which is a consequence from intensified 

test preparation, extra tutoring, extra lessons, etc. Moreover, the increased exposure 

also causes higher levels of stress, anxiety and frustration as well as 

limited possibilities for choosing electives among students. Ysseldyke et al 

(2004) as well as Nelson (2006) discovered that both participation and expectations 

have increased. However, Nelson could not find any empirical evidence 

for assumed increased referrals to special education or the retention of students 

(cf. Nelson 2006). 

3.4 Universally designed assessments or: “a more accessible 

assessment is always an option” 2 

The latest attempts to include as many students as possible in state assessment 

are “universally designed assessments”. It is a project of NCEO for which a 

2 Johnstone et al 2006, p. 23


guide was published in order to provide states and their representatives responsible 

for information and ideas about ways of including students with disabilities 

in assessment tests. Universal design demands accessibility for everyone 

whether he or she is disabled, a non-native speaker of English, a migrant or 

whatever. When an assessment test is designed universally, it respects the diversity 

of the population. It is characterised by concise and readable texts, clear 

formats and clear visuals, and it allows changes in the format as long as they do 

not change the meaning or the level of difficulty. It is stressed that those tests 

are not intended to change the standard of performance of assessments, nor to 

make them easier for special groups. The ambitious aim of universal design is 

to create the “most valid assessment possible for the greatest number of students, 

including students with disabilities” (Johnstone et al 2006, p. 1). In this 

guide provided by Johnstone et al, 10 steps are proposed for the best way of 

achieving a universally designed assessment. I will not describe each of these 

steps in detail, but offer an overview of the main features of this approach. 

The main idea of the approach is to consider the diversity of the students 

from the very beginning (and not to adjust the tests afterwards for the special 

needs of some groups of students). For this reason, every item has to be 

checked in the phase of conceptualization. Contents which could give unfair 

advantage or disadvantage to a certain group of students should be avoided 

(e.g. using large font sizes, avoiding unnecessary linguistic complexity when 

it is not assessed). Every single item has to be checked if it allows consideration 

for the diversity of the pupils (gender, age, ethnicity, socioeconomic 

status, region, disability, language). In order to avoid ceiling or floor effects, 

it is important to develop a full range of test performance, and an adequately 

sized item pool is needed in case the items have to be eliminated. The authors 

are aware of the fact that this is a truly challenging and time-consuming procedure, 

but as the authors argue, considering accessibility from the beginning 

can save time and effort later (cf. ibid., p. 6). 

When all items are constructed, it is necessary to let them be reviewed by 

expert teams in the participating states. Members of several special groups, 

such as language minorities, disability groups, scientists, teachers, etc. should 

be involved in this review process, and they should examine if the test items 

give some advantage or disadvantage to a certain group of students. They are 

required to look at the response format and decide if it is clear enough, check if 

the item really tests what it says and if there could be induced errors which are 

not related to the question (cf. ibid., p. 7). If there are any items that the experts


find problematic, these items are analyzed with the “Think Aloud Method” in 

order to find out if they can be incorporated in the test or not, or if they should 

be adjusted. After the field test, the items are analyzed statistically, in particular 

those which were conspicuous (cf. ibid., p. 13). When the final revision is done, 

the test can be carried out. 

It has often been pointed out that it is not possible to create a test which 

is accessible to every single student. However, as mentioned, the goal is to 

make it as widely accessible as possible. Moreover, as the authors proclaim, all 

students participating in universally designed assessments benefit from having 

more accessible tests (cf. ibid., p. 2). 

3.5Emergingissues 

Koretz and Barton summarize the most important topics which have to be considered 

when it comes to the inclusion of students with disabilities. At the same 

time, those issues represent the most important research gaps that need to be 

closed: 

– First of all, students with disabilities have to be identified and classified. 

Comparisons have shown that figures on e.g. children with learning disabilities 

differ from 3 to 9.1 percent. Therefore, it can be assumed that the lines 

are drawn rather differently and that the term “learning disability” does not 

necessarily mean the same thing to everybody (cf. also McGrew et al 1993). 

Thus, identification and classification seem to be a crucial in order to make 

an equitable assessment system possible. 

– Appropriate use of accommodations: Accommodations, or the “corrective 

lenses” (Koretz/Barton 2003, p. 7), are not only an important way to increase 

the inclusion of students with disabilities in assessment tests, but they also 

tend to influence and bias the validity of tests. Research concerning the validity 

of accommodations is still tremendously needed. 

– The problem of construct relevant disabilities: The assessment test can be 

offered in Braille to blind students. This kind of accommodation does not 

influence the construct of the test. But when it comes to e.g. dyslexia, it 

becomes quite difficult. In this case, the student is not able to understand 

the tasks, because most assessment tests are language-based. However, this 

does not mean that this student is not able to solve the task just because he 

cannot read it. 

– Concerning test design, it is important to keep an eye on bias. Several assessment 

formats (like multiple choice, open response, etc.) have different


effects and consequences for different students, especially for those with 

disabilities. 

(cf. Koretz/Barton 2003, p. 3ff) 

Although the US plays a leading role in including students with disabilities, 

there is still a huge lack of research concerning validity and alternate ways of 

test participation (cf. also Quenemoen et al 2001, Thurlow et al 1996a). As 

Koretz and Barton point out, the mere inhomogeneity of the group of students 

with disabilities makes it tremendously difficult to create guidelines and prescriptions. 

Moreover, construct relevant disabilities pose tough challenges for 

research (cf. Gerald Hales 2004, who shows that and why the common tests 

are not able to measure the skills and proficiency of students with dyslexia in 

an adequate way). 

According to Koretz and Barton, the most important steps are an increased 

collection of data on assessment participation of students with disabilities, further 

research on possible item bias, test bias, and, of course, validity. To make 

comparisons possible, it is necessary to standardize several definitions of disabilities 

and the participation conditions (cf. Koretz/Barton 2003, p. 23ff). 

Finally, Ruth Nelson emphasizes the need to identify and limit unintended, 

negative consequences (e.g. as mentioned above increased anxiety, exposure to 

curriculum, etc.) for SWD in assessment tests and to document them empirically. 

Trying to avoid these unintended consequences can be “life-changing” 

for the respective students, because it allows the students to get a fair chance 

in order to show their actual abilities (Nelson 2006, p. 34f). 

4 Situation in Austria and German-speaking countries 

In 1995, Elliott, Shin, Thurlow and Ysseldyke searched in national education 

encyclopedias and yearbooks of 14 states worldwide to discover if they reported 

facts and figures on the inclusion of students with disabilities in assessments. 

These states included the following: Argentina, Australia, Canada, 

Chile, China, England and Wales, France, Japan, Corea, the Netherlands, Nigeria, 

Sweden, Tunisia, U.S. What they found was that out of these 14 states, 

just a few documented the inclusion of students with disabilities in assessment 

tests. It is only the U.S. who reports exact facts and figures, while some other 

states present a short description of their directions regarding the participation 

of students with disabilities (Canada, France and Korea mention that they allow 

accommodated tests for students with disabilities). Elliott et al interpret


their research findings as follows: There may be three possible reasons why 

states do not give any data about inclusion of SWD. First, it could be that they 

exclude SWD arbitrarily; second, it is possible that data from disabled children 

are collected, but not counted and not published; third, data could be collected, 

counted, but not published (cf. Elliott et al 1995). 

As in the most states included in the study, there is no attention paid to 

the topic of inclusion of SWD in assessment in Austria. Since the 1990s, and 

since 2000 in particular, Austria has been taking part in several assessments 

(CIVED, FIMS/TIMSS, PISA, PIRLS) and it now has a national screening 

for testing reading abilities (Salzburger Lese-Screening). Although the official 

test reports of PISA give some advice as to how to handle SWD, there is no 

real interest in how this advice is carried out in practice. As shown in Hörmann 

2007, people who are involved in assessment testing processes do not 

consider the problem as being relevant and urgent. Eight interview partners 

(teachers, school directors, scientists, employee of the ministry of education 

(responsible for international assessment)) were asked by means of short qualitative 

interviews what they think about the problem, if they have already had 

any experience with it and how they reacted. The main interest in this research 

project was to investigate the extent to which these people have already been 

confronted with the problem, what they know and think about it and how they 

deal with the problem (cf. Hörmann 2007, p. 58ff). 

The interviews reveal that there are two ways of perceiving the problem: 

The administrative-organisational perspective and the perspective of the children 

concerned by this problem. The majority of the interview partners take 

the administrative-organisational perspective and do not regard the problem 

as an important one. Some of the interview partners who work closely with 

children with disabilities or disadvantaged children take the position of these 

students and require solutions to the problem. It is a fact for all interviewees 

that students with e.g. learning disabilities need more support at school, but for 

most of them it is not obvious that these children would also need this support 

when taking an assessment test (cf. Hörmann 2007, p. 85f). 

In my diploma thesis, I conducted thorough literary research in order to 

find information on the way Austria copes with this problem. I asked scientists 

and institutions, but nobody could help me. Likewise, in all German-speaking 

countries, I could barely find any hint of relevant literature. As mentioned 

above, Wuttke published exclusion rates and Elisabeth von Stechow dedicated 

a small chapter of her article to exclusion rates of students with disabilities (cf.


Stechow 2006, p. 22). Her book “Sonderpädagogik und PISA” (2006) seems to 

be the first German publication that responds to the problem of students with 

disabilities in assessments, at least in part. Oser and Biedermann proclaim the 

necessity of a specific assessment for special education (their slogan: “PISA 

for the rest”) (cf. Oser/Biedermann 2006). 

Nevertheless, it is obvious that there are children who do not fit into the 

“norm”, but who have special needs and who cannot cope with a conventional 

testing situation. This especially concerns students with learning disabilities, 

who never have the chance to show their real abilities, because they generally 

fail in reading the test items. Meyerhöfer talks about “Testfähigkeit”, which 

means that every student has to develop a certain kind of ability that enables 

her or him to cope with the testing situation, organize the provided time, read 

both quickly and carefully and use clever strategies to find the right solution 

(Meyerhöfer 2005, p. 187 and in this book). Assessment tests as they are constructed 

at present are definitively not able to test students with disabilities in 

an adequate way. In addition, when nobody is interested in what happens to 

these students, they become invisible and disappear from the public. 

5Conclusion 

This article sets out to raise an awareness of the problem students with disabilities 

have to face in relation to student assessment. Research has revealed 

that there is a lack of literature and discourse about this problem, which shows 

how unknown and how unimportant it is to people that actually work with 

assessment tests in their profession (cf. Hörmann 2007). 

PISA seems to be no exception in this respect. Assuming that there exists 

an average student endowed with average skills not only within one but 

even accross countries, it neglects by construction children deviating from that 

“norm”. As a result these children are either excluded from the assessment or 

(if they get the chance to take part at all) are doomed to fail. 

Confronted with this problem, people behind PISA play down the importance 

of the problem, the impact on the respective children and show no ambition 

to change the situation. It probably lies in the self-interest of the people 

involved in the construction and conduct of this study to marginalise the 

problem, since this critique does not just refer to certain aspects of PISA but 

questions primary assumptions and thereby shakes it to its foundations. 

Even though many people think that the problem of exclusion is irrelevant 

from a statistical point of view, I am not of the opinion that this is an argument


against inclusion. Assessment tests are not able to test SWD in an adequate 

way, but this does not mean that there is no way to change the assessment 

tests in order to make them able to assess SWD (see 3.2 and 3.4). Results of 

assessments are often used for political decisions. If children with disabilities 

are excluded from assessment tests, they are also excluded from those political 

decisions in spite of the fact that these decisions concern them equally as 

much as all the other students. This means that SWD disappear once again. 

The state and society are responsible for the education of every single child, 

including children with disabilities, and every single child has the right to be 

part of society. In terms of the concept of inclusive education, it is a duty and a 

responsibility to accommodate the tests to the children, and not the children to 

the tests. Participating in assessments in an adequate and successful way can 

support the self-confidence and the performance of the respective students in a 

very positive manner. 

Research and experience in the U.S. show interesting ways of including 

students with disabilities in assessment tests. Testing accommodations, alternate 

tests and “universally designed assessments” are a new option to account 

for the diversity of people, although research is still at the beginning. 

If PISA wants to move with the times, I suggest a revision both of the 

paradigm and the construction in order to gain results that really represent the 

variety of students. In order to reach this goal, it is important to raise awareness 

of this problem and start a discussion in public and among scientists. Only then 

will it be possible to think about ways of solving this problem. 

Literature 

Biewer, Gottfried (2005): “Inclusive Education”. Effektivitätssteigerung von 

Bildungsinstitutionen oder Verlust heilpädagogischer Standards?- In: 

Zeitschrift für Heilpädagogik. Jahrgang 56 (2005), Heft 3, S. 101-108 

Elliott, J.L.; Shin, H.; Thurlow, M.L.; Ysseldyke, J.E. (1995): A perspective 

on education and assessment in other nations: Where are students with 

disabilities? (Synthesis Report No. 19).- Minneapolis, MN: University of 

Minnesota, National Center on Educational Outcomes. 

Available at: http://education.umn.edu/NCEO/OnlinePubs/Synthesis19.html 

(15.6.2006) 

Haider, Günter (2003): OECD/PISA – Programme for International Student 

Assessment (Kapitel 1.2 des nationalen Berichts von PISA 2003). 

Available at:


http://www.pisa-austria.at/PISA2003_Kapitel_1_2_nationalerBericht. 

pdf (12.6.2007) 

Hales, Gerald (2004): Putting in nails with a spanner: the potential effect of 

using false data from language-rich tests to assess dyslexic people. Available 

at: http://www.bdainternationalconference.org/2004/presentations/ 

sat_s3_d_7.shtml (27.4.2006) 

Hopmann, Stefan Thomas (2006): Im Durchschnitt PISA oder Alles bleibt 

schlechter.- In: Criblez, Lucien; Gautschi, Peter u.a. (Hrsg): Lehrpläne 

und Bildungsstandards. Was Schülerinnen und Schüler lernen sollen. 

Festschrift zum 65. Geburtstag von Prof. Dr. Rudolf Künzli.- Bern: hep- 

Verlag, S. 149-172. 

Hörmann, Bernadette (2007): Die Unsichtbaren in PISA, TIMSS & Co. Kinder 

mit Lernbehinderungen in nationalen und internationalen Schulleistungsstudien.- 

Wien (Diploma Thesis) 

Jahnke, Thomas; Meyerhöfer, Wolfram (Hrsg) (2006): PISA&Co. Kritik eines 

Programms.- Hildesheim, Berlin: Franzbecker 

Johnstone, Christopher; Altman, Jason; Thurlow, Martha (2006): A state guide 

to the development of universally designed assessments.- Minneapolis, 

MN: University of Minnesota, National Center on Educational Outcomes. 

Koretz, Daniel; Barton, Karen (2003): Assessing Students with Disabilities: 

Issues and Evidence. (CSE Technical Report 587).- Los Angeles: National 

Center for Research on Evaluation, Standards, and Student Testing. 

Available at: http://www.cse.ucla.edu/products/Reports/TR587.pdf 

(20.6.2007) 

McGrew, K., Algozzine, B., Spiegel, A., Thurlow M., Ysseldyke, J. (1993): 

The identification of people with disabilities in national databases: A 

failure to communicate (Technical Report No. 6).- Minneapolis, MN: 

University of Minnesota, National Center on Educational Outcomes. 

Available at: http://education.umn.edu/NCEO/OnlinePubs/Technical6. 

html (22.1.2006) 

Meyerhöfer, Wolfram (2005): TestsimTest.DasBeispielPISA.- Opladen: Budrich. 

Nelson, J. Ruth (2006): High stakes graduation exams: The intended and 

unintended consequences of Minnesota’s Basic Standards Tests for 

students with disabilities (Synthesis Report 62).- Minneapolis, MN: 


Available at:


http://www.education.umn.edu/NCEO/OnlinePubs/Synthesis62/default. 

html 

OECD (2005): PISA 2003 Technical Report.- Paris: OECD. 

Oser, Fritz; Biedermann, Horst (2006): PISA für den Rest: Lehr- und Lernbehinderung 

und ihre schulische Anstrengungslogik.- In: Vierteljahresschrift 

für Heilpädagogik und ihre Nachbargebiete. Jahrgang 75 (2006), 

Heft 1, S. 4-8. 

Posch, Peter; Altrichter, Herbert (1997): Möglichkeiten und Grenzen der Qualitätsevaluation 

und Qualitätsentwicklung im Schulwesen. Forschungsbericht 

des Bundesministeriums für Unterricht und kulturelle Angelegenheiten.- 

Innsbruck, Wien: Studien – Verlag (Bildungsforschung des Bundesministeriums 

für Unterricht und kulturelle Angelegenheiten; 12) 

Quenemoen, R. F., Lehr, C. A., Thurlow, M. L., Massanari, C. B. (2001): Students 

with disabilities in standards-based assesment and accountability 

systems: Emerging issues, strategies, and recommendations (Synthesis 

Report 37).- Minneapolis, MN: University of Minnesota, National Center 

on Educational Outcomes. Available at: http://education.umn.edu/NCEO/ 

OnlinePubs/Synthesis37.html (22.1.2006) 

Sireci, Stephen G.; Scarpati, Stanley E.; Li, Shuhong (2005): Test Accommodations 

for Students With Disabilities: An Analysis of the Interaction 

Hypothesis.- In: Review of Educational Research, Vol. 75 (2005) No.4, 

p. 457-490. 

Stechow, Elisabeth von (2006): Soziokulturelle Benachteiligung und Bewältigung 

von Heterogenität – Eine sonderpädagogische Antwort auf eine 

Empfehlung der KMK.- In: Stechow, Elisabeth von; Hofmann, Christiane 

(Hrsg): Sonderpädagogik und PISA. Kritisch-konstruktive Beiträge.- Bad 

Heilbrunn: Klinkhardt. 

Thurlow, M.L.; Elliott, J.L.; Ysseldyke, J.E.; Erickson, R.N. (1996a): Questions 

and answers: Tough questions about accountability systems and 

students with disabilities (Synthesis Report No. 24).- Minneapolis, MN: 


Available at: http://education.umn.edu/NCEO/OnlinePubs/Synthesis24. 

html (22.1.2006) 

Thurlow, M.; Olsen, K.; Elliott, J.; Ysseldyke, J.; Erickson R.; Aherarn, E. 

(1996b): Alternate assessments for students with disabilities for students 

unable to participate in general large-scale assessments (Policy 

Directions No. 5).- Minneapolis, MN: University of Minnesota, National


Center on Educational Outcomes. Available at: http://education.umn.edu/ 

NCEO/OnlinePubs/Policy5.html (22.1.2006) 

Thurlow, Martha; Moen, Ross; Altman, Jason (2006): Annual Performance 

Report: 2003-2004. State Assessment Data.- National Center on Educational 

Outcomes. Available at: http://www.education.umn.edu/nceo/ 

OnlinePubs/APR2003-04.pdf (13.5.2007) 

UNESCO (1994): The Salamanca Statement and Framework for Action on 

Special Needs Education. World Conference on Special Needs Education: 

Access and Quality.- Salamanca, Spain: 7-10 June 1994. Available 

at: 

http://unesdoc.unesco.org/images/0009/000984/098427eo.pdf 

(18.6.2007) 

Van Ackeren, Isabell (2005): Vom Daten- zum Informationsreichtum? 

Erfahrungen mit standardisierten Vergleichstests in ausgewählten 

Nachbarländern.- In: Pädagogik, Jahrgang 57 (2005), Heft 5, S. 24-28 

Wuttke, Joachim (2006): Fehler, Verzerrungen, Unsicherheiten in der 

PISA-Auswertung.- In: Jahnke, Thomas; Meyerhöfer, Wolfram (Hrsg): 

PISA&Co. Kritik eines Programms.- Hildesheim, Berlin: Franzbecker, S. 

101-154 

Ysseldyke, J.; Dennison, A.; Nelson, R. (2004): Large-scale assessment and 

accountability systems: Positive consequences for students with disabilities 

(Synthesis Report 51).- Minneapolis, MN: University of Minnesota, 

National Center on Educational Outcomes. Available at: http://education. 

umn.edu/NCEO/OnlinePubs/Synthesis51.html (15.6.2006)

Identification of Group Differences Using PISA Scales – 

Considering Effects of Inhomogeneous Items 

Peter Allerup 

Denmark: University of Aarhus 

Abstract: 

PISA data have been available for analysis since the first PISA data base was 

released from the PISA 2000 study. The two following PISA studies in 2003 

and 2006 formed the basis of dynamic analyses besides the traditional cross 

sectional type of analysis, where PISA performances in mathematics, science 

and reading are analysed in relation to student background variables. The caption 

for many analyses, carried out separately on the PISA 2000 and PISA 

2003 data, has been to look for significant differences created by PISA performances 

for groups of students. 

Few studies have, however, been directed towards the psychometric question 

as to whether the PISA scales are correctly measuring the reported differences. 

For example, could it be that reported sex differences in mathematics 

are partly due to the fact that the PISA mathematics scales are not measuring 

the girls and the boys in a uniform or homogenous way? In other words, using 

the terms of modern IRT analyses (Item Response Theory), it is questioned 

whether the relative difficulty of the items is the same for girls and boys. The 

fact that item difficulties are not the same for girls and boys, a condition which 

is called item inhomogeneity, can be demonstrated to have impact on the conclusions 

of the comparisons of student groups, e.g. girls versus boys. 

The present analyses address the problem of possible item inhomogeneity 

in PISA scales from 2000 and 2003, asking specifically if the PISA scale 

items are homogeneous across sex, ethnicity and the two points in time (2000 

and 2003). This will be illustrated using items from all three PISA subjects: 

reading, mathematics and science. Main efforts will, however, be concentrated

176 PETER ALLERUP 

on the subject of reading. The consequences are demonstrated of detected item 

inhomogeneities for the calculation of student PISA performances (measures 

of ability). This will take place on the individual student level as well as on a 

general, average student level. 

Inhomogeneous items and some consequences 

In order to give a precise definition of item inhomogeneity, it is useful to refer 

to the general framework to which items, students and responses belong and 

their mutual interactions can be made operational. In fact, figure 1 displays the 

fundamental concepts behind many IRT (Item Response Theory) approaches 

to data analysis, the Rasch analysis in particular. The response avifrom student 

No. v to item No. i takes the values avi = 0 for a non correct and avi =1fora 

correct response. 

The parameters 1 ... kare latent measures of item difficulty, and 1 

nare the students’ parameters carrying the information about student ability. 

These are the PISA student scores which are reported and compared internationally 

(or estimates thereof). 

The definition of item homogeneity is now given by a manifestation of 

the fact that the responses ((avi )) are determined by a fixed set of item parameters 

given by the framework, valid for all students, and therefore for every 

subgrouping of the students. Actually, the probability of obtaining a correct 

response avi = 1 for student No. v to item No. i is given by the special IRT 

model, the so-called Rasch Model (Rasch, 1960) which calculates chances to 

solving the tasks behind items by referring to the same set of item parameters 

regardless of which student is considered. 

1 ... k 

Responses 

Student 

No. Ability 

Item 1 

1 

Item 2 

2 

Item 3 

3 

Item i 

i 

. Item k 

1 1 1 0 1 1 1 0 a 1 

2 2 0 1 1 0 0 1 a 2 . 

3 3 1 1 0 1 1 0 a 3 . 

. . 

v v 1 0 1 a vi . 1 av. 

n n 1 1 0 a Ni . 0 a N 

k 

Student 

score 

(rv) 

Figure 1: The framework for analyzing item inhomogeneity in IRT models. Individual responses 

((a vi )), latent measures of item difficulty ( i ) i=1, . . . ,k, student abilities ( v) 

v=1, . . . ,n and student scores (rv) recording the total number of correct responses across k 

items.

IDENTIFICATION OF GROUP DIFFERENCES USING PISA SCALES 177 

The Rasch Model is the theoretical, psychometric reference for validation 

of the PISA scales, and it has been the reference for scale verification and 

calibration in the IEA international comparative investigations, e.g. the reading 

literacy study RL (Elley,1993), TIMSS (Beaton et al., 1998), CIVIC (Torney- 

Purta et al., 2000) and NAEP assessments after 1984 in USA. 

Using this model it can e.g. be shown, that a correct response a vi =1 to an 

item with item difficulty i =1.20 given by a student with v= -0.5 takes place 

with probability P(a=1) = 0.62, i.e. with a 62 % chance. 

P a 1 

exp �i �v 

1 exp �i �v 

A major reason for the wide applicability of the Rasch Model lies in the 

existence of the following three equivalent characterizations of item homogeneity, 

proved by Rasch (see e.g Rasch, 1971, Allerup 1994, Fischer and 

Molenaar, 1995 ) and brought here in abbreviated form: 

1. The student scores (and parallel property of item scores) are sufficient statistics 

for the latent student abilities 1 n, viz. all information concerning 

vis contained in the student score rv 

2. The student abilities 1 n can be calculated with the same result irrespective 

of which subset of items is used. 

3. Data collected in the framework in figure 1 fits the Rasch Model, i.e. the 

model forms an adequate description of the variability of the observations 

((avi )) in figure 1. 

While Rasch often referred to these properties as the analytic means for 

‘specific objective comparisons’, others have adopted the notion ‘homogeneous’ 

for the status of items when the conditions are met. The practical power 

behind this, seen from the point of view of theory of science, is that ‘objective 

comparisons’ is in casu a requirement, which can be investigated empirically 

by means of simple statistical techniques, i.e. statistical test of fit oftheRasch 

Model (cf. property 3). It is henceforth not a purely theoretical concept but 

rather one which requires empirical actions to be taken beyond the ‘theoretical’ 

thoughts invested from the subject matter’s point of view into the construction 

of items. 

By the characterization of item homogeneity, it follows that ‘inhomogeneity’, 

or ‘inhomogeneous items’, appears when items are not homogeneous, for 

example when different subsets of items give rise to different measures of student 

abilities. This is e.g. one of the risks which might appear in PISA when


using rotating of booklets, where students who are responding to different item 

blocks must still be compared on the same PISA scale (cf. property 2). The 

present analyses will focus directly on possible violations of ‘item homogeneity’ 

by looking for indications of different sets of estimated item parameters 

assigned to different student groups, through the fit of the Rasch Model. In 

other words it will be tested whether e.g. boys and girls are measured by the 

same set of item parameters. Two other criteria defining groups of students 

will be applied, these being 1) the year 2000 vs. 2003 and ethnicity 2) Danish 

vs. non-Danish linguistic background. Especially in the subject of reading, the 

distinction by ethnicity is of interest, because different language competencies 

are expected to influence the understanding and through this the ability to reply 

correctly to the reading tasks. 

The consequences of item inhomogeneity are diversified and can bring 

about serious implications, depending on the analytic view. In a PISA context, 

however, one specific kind of consequence attracts attention: How are 

comparisons carried out by means of student PISA scores affected by inhomogeneity? 

If boys and girls are in fact measured by two different scales, i.e. two 

sets of item parameters, will this influence conclusions carried out under the 

use of one, common ‘average’ set of items? Will an interval of PISA points 

estimated to separate the average -level for Danish students from the non- 

Danish students be greater or smaller, if knowledge as to item inhomogeneity 

is introduced into the -calculations? 

Such consequences can be exposed on the -scale either at the individual 

student level using one item and one individual or at the general level using all 

students and all items. 

The individual level is established in a simple way by calculating the individual 

change on the -scale, which is mathematically needed to compensate 

for a given difference in the – parameter under the assumption that a fixed 

probability for answering correct is maintained. Suppose for instance that data 

from boys are fitted to the Rasch Models with estimated item difficulty 1 

0.40 and the same item gets an estimated difficulty 2 0.75 for the girls, a 

difference which can be tested to be significantly different from the first item 

(Allerup, 1995 and 1997). Then a simple calculation under the Rasch Model 

shows that in order for a boy and a girl to obtain equal probabilities for answering 

this item correctly, the boy’s -value must be adjusted by 0.75-0.40 

= 0.35. This item is easier for the boy compared to the girl, even considering


aboyandagirlwith the same -value 1 and hence should be equally capable 

of answering the item correctly. In order to compensate for this scale-specific 

advantage, a boy ‘should start lower’ by subtracting 0.35 from 2 .Inaway 

it resembles the rules in golf, where the concept of ‘handicap’ plays a similar 

role to making comparisons between players more fair. 

When moving from the individual level to the comprehensive level and 

including all items, two simple methods are available. The first one is based 

on theoretical calculations, where expected scores are compared for fixed 

– values using the two sets of inhomogeneous item parameters 3 . The second 

approach is based on summing up the individual changes for all students as an 

average; it suffices to summarize all individual -changes within each group 

in question when using the set of item parameters specific for each group. A 

third strategy consists of first removing inhomogeneous items from the scale, 

then carrying out statistical analyses by means of the remaining homogeneous 

items only, e.g. estimation of the student PISA scores. Following this procedure 

a ‘true’ difference between the groups will then be obtained. In a way 

this last procedure follows the traditional path of Rasch scale analysis, where 

successive steps from field trials to the main study are paved by item analyses, 

correcting and eliminating inhomogeneous items step by step. As stated, 

the present analyses will focus on student groups defined by gender, year of 

investigation and ethnicity. 

Data used 

Data for these analyses are collected under different studies with no overlap. 

The Standard PISA 2000 and 2003 data are representative samples, while 

the PISA Copenhagen data comprises all public schools in the community of 

Copenhagen, and PISA E is a sample specifically addressing the participation 

of ethnic students, and was therefore created from prior knowledge as to where 

this group of students attend school. 

1 Same -value means that they are considered to be identical in the framework 

2 The analytic picture is slightly more complicated, because there are constrains on the – 

3 

values: ? i =1.00 

E(avi 1 ... k 

estimates of 1 

) as function of ;rv =E(avi 1 ... k ) with conditional ml 

... k inserted, provides the estimate of


1. PISA 2000 N=4209: 50 % girls 50 % boys, 6 % ethnics 

2. PISA 2003 N=4218: 51 % girls 49 % boys, 7 % ethnics 

3. PISA E N=3652: 48 % girls 52 % boys, 25 % ethnics 

4. PISA Copenhagen N=2202: 50 % girls 50 % boys, 24 % ethnics 

In the three studies PISA 2000, E and Copenhagen, the same set of PISA 

instruments has been used, ie. the same set of items organized in nine booklets 

has been rotated among the students. In PISA 2003 some of the items from the 

PISA 2000 study were reused, because items in common must be available for 

bridging between 2000 and 2003. According to the PISA cycles, every study 

has a special theme; in 2000 it was reading, and in 2003 it was mathematics. 

In these years the two subjects were especially heavily represented by many 

items. Because of this, the present analyses dealing with the 2003 data are 

undertaken mainly by means of items which are in common for the two PISA 

studies 2000 and 2003. 

Scaling PISA 2000 versus PISA 2003 in reading 

One of the reasons for the interest in the PISA scaling procedures was the fact 

that the international PISA report from PISA 2003 comments upon general 

change in the level of reading competencies between 2000 and 2003 in the 

following manner: 

“However, mainly because of the inclusion of new countries in 2003, the overall 

OECD mean for reading literacy is now 494 score points and the standard deviation is 

100 score points.” (PISA 2003, OECD) 

It seems very unlikely that all students in the world being taught in more than 

50 different school systems should experience a common weakening across 

three years of their reading capacities, amounting to 6 PISA points (from 500 

to 494); a further explanation given in the Danish National Report does not 

increase a sense for a convincing explanation for this significant drop of 6 

PISA points: 

“The general reading score for the OECD-countries dropped from 500 to 494 points. 

This is influenced by the fact that two countries joined PISA between 2000 and 2003, 

contributing to the lower end, while the Netherlands lifts the average a bit. But, considering 

all countries, it looks like the reading score has dropped a bit” (PISA 2003, 

ed. Mejding)


Could it be that the 6-point drop was the result of item inhomogeneities across 

2000 and 2003? If this question either in full or in part must be answered by 

a yes, one can still hope to conduct appropriate comparisons between student 

responses from 2000 with 2003. In fact, assuming that no other scale problem 

exists within each of the years 2000 and 2003, one can consider the two scales 

completely separately and apply statistical Test Equating techniques. The PISA 

2000 reading scale has been compared to the IEA 1992 Reading Literacy scale 

using this technique, showing that these two scales – in spite of inhomogeneous 

items – are psychometrically parallel (Allerup. 2002) 

PISA 2000 and PISA 2003 share 22 reading items which are necessary 

for the analysis of homogeneity by means of the Rasch Model. The items are 

found in booklet No. 10 in PISA 2003 and booklet No. 4 in PISA 2000. Table 1 

displays the (log) item difficulties, estimated under the simple one-dimensional 

Rasch Model 4 . 

Item difficulties PISA -scale 

item i (2000) 1 R055Q01_ 1.27 

i (2003) 

1.23 

difference 

-3.6 

2 R055Q02_ -0.66 -0.79 -11.7 

3 R055Q03_ -0.08 -0.21 -11.7 percent correct 

4 R055Q05_ 0.44 0.55 9.9 2000 2003 

5 R067Q01_ 0.58 1.97 125.1 0.64 0.88 

6 R067Q04_ -0.29 0.88 105.3 0.43 0.71 

7 R067Q05_ -0.47 1.15 145.8 0.38 0.76 

8 R102Q05_ -0.86 -1.18 -28.8 

9 R102Q07_ 1.73 1.41 -28.8 

10 R102Q04A_ -1.34 -2.01 -60.3 

11 R104Q01_ 0.41 0.10 -27.9 

12 R104Q02_ -0.31 -0.63 -28.8 

13 R104Q05_ -0.40 -0.72 -28.8 

14 R111Q01_ -0.99 -1.08 -8.1 

15 R111Q02B_ 0.04 -0.05 -8.1 

16 R111Q06B_ 1.51 1.66 13.5 

4 Conditional maximum likelihood estimates from p(((avi )) (rv) ), conditional on student 

scores (rv), cf. fig1


17 R219Q02_ 0.28 0.44 14.4 percent correct 

18 R219Q01E_ 0.08 0.20 10.8 2000 2003 

19 R220Q01_ -0.32 -0.82 -45.0 0.42 0.31 

20 R220Q04_ -0.05 -0.60 -49.5 0.49 0.35 

21 R220Q05_ 0.83 0.33 -45.0 0.70 0.58 

22 R220Q06_ -1.40 -1.82 -37.8 0.20 0.14 

Table 1: Rasch Model estimates of item difficulties i for the two years of testing 2000 and 

2003 and -scale adjustments for unequal item difficulties. 

Several test statistics can be applied for testing the hypothesis stating that 

item difficulties are equal across the years 2000 and 2003, both multivariate 

conditional (Andersen, 1973) and exact tests item-by-item (Allerup, 1997). 

The results are all clearly rejecting the hypothesis and, consequently, the items 

are inhomogeneous across the year of testing 2000 and 2003. 

A visual impression of how the two PISA scales are composed by item 

difficulties as marks on two ‘rulers’ is displayed in figure 2. Items connected by 

vertical lines tend to be homogeneous, while oblique connecting lines indicate 

inhomogeneous items. 

The last column in table 1 lists the consequences at the individual student 

level of the estimated item inhomogeneity transformed to quantities measured 

on the ordinary PISA student scale, i.e. the -scale internationally calibrated 

to mean value = 500 with standard deviation = 100. As an example the item 

R055Q01 changed the estimated difficulty from 1.27 in 2000 to 1.23 in 2003, 

a small decrease in the relative difficulty of -3.6. For an average student, i.e. 

with PISA ability v= 0.00 this means that the chance of responding correctly 

to these items has changed from 0.78 to 0.77, a small 1 % drop. This can be calculated 

from the Rasch Model; for an above-average student with v=2.00the 

change will be 0.963 to 0.962, a very minor change of magnitude, 1 per mille. 

Table 1 shows how the consequences amount to considerable PISA points for 

some items, especially the items R067 and R220, which are framed in the table. 

These items are the ones which distinguish themselves on figure 2 by nonvertical 

lines. The marginal percent correction, which is based on all booklets 

and students, is included in table 1 in order to get a well-known interpretation 

of the change from 2000 to 2003. It is a tacitly assumed that the PISA items 

are accepted under tests of reliability. 

The last column in table 1 indicates the advantage (difference >0) or disadvantage 

(difference


Figure 2: Estimated Item difficulties i (2000) and i (2003) for PISA 2000 (lower line) and 

PISA 2003 (upper line). Estimates based on booklet 4 (PISA 2000) and booklet 10 (PISA 

2003) using data from all countries. 

pretation being that 2003-students are given ‘free’ PISA points as a result of 

the fact that the (relative) item difficulty has dropped between the years 2000 

and 2003; that this ‘advantage’ can be quantified in terms of ‘compensations’ 

on the -scale shown in the last column, displaying how much a student must 

change the PISA-score in order to compensate for a the change of difficulty 

of the item. This way of thinking is much alike the thoughts behind the construction 

of the so-called items maps, visualizing both the distribution of item 

difficulties and student abilities anchored in predefined probabilities for a correct 

response. 

Table 1 pictures item inhomogenities, item by item, in reading; some items 

turn out to be (relatively) more difficult between 2000 and 2003, while others 

are easier between the two years. A comprehensive picture involving all 

single-item ‘movements’ and all students is more complicated to establish 5 . 

The technique used in this case is to study the gap between expected score 

5 Analyze the expected score E(avi 1 ... k ) as function of with conditional ml 

estimates of 1 ... k inserted.


levels caused by the two item sets of (inhomogeneous) difficulties. By this, it 

can be shown that the general effect is approximately 11 PISA points. In other 

words, the average PISA 2003- student experiences a ‘loss’ of approximately 

11 PISA points, purely due to psychometric scale inhomogeneities. The official 

drop between 2000 and 2003 was for Denmark 497? 492, i.e. a drop of 

5 points. In the light of scale-induced changes of magnitude minus 11 points, 

could this be switching a disappointing conclusion to the contrary? 

Scaling PISA 2003 in reading – gender and ethnicity 

Whenever analysis of item homogeneity is executed by using an external variable 

to define sub-groups, it is tacitly assumed that the Rasch model works 

within each group 6 , i.e. the items are homogeneous within each group. 

2003 PISA 2003 PISA 

Item difficulties -scale Item difficulties -scale 

item i (girls) i (boys) difference i (DK) 1 R055Q01_ 1.18 1.35 15.16 1.25 

i (ejDK) difference 

2.05 72.77 

2 R055Q02_ -0.71 -0.70 1.56 -1.13 -1.59 -41.76 

3 R055Q03_ -0.23 -0.02 18.80 -0.53 -0.83 -26.60 

4 R055Q05_ 0.58 0.43 -13.32 0.20 0.28 7.38 

5 R067Q01_ 1.04 1.11 5.96 2.38 2.05 -29.43 

6 R067Q04_ 0.25 0.15 -8.83 0.08 0.01 -6.63 

7 R067Q05_ 0.36 0.01 -30.83 0.25 0.42 16.06 

8 R102Q05_ -1.07 -0.91 14.44 -0.42 -0.24 15.51 

9 R102Q07_ 1.55 1.66 10.31 2.01 0.91 -99.20 

10 R102Q04A -1.75 -1.51 21.34 -1.68 -1.82 -12.02 

11 R104Q01_ 0.38 0.20 -15.90 0.25 -0.12 -32.89 

12 R104Q02_ -0.22 -0.68 -40.96 -0.75 -0.60 13.82 

13 R104Q05_ -0.39 -0.69 -26.30 -0.93 -1.70 -69.25 

14 R111Q01_ -1.28 -0.74 48.33 -1.02 -1.05 -2.26 

15 R111Q02B 0.04 -0.01 -4.85 -0.77 -0.37 36.59 

16 R111Q06B 1.59 1.57 -1.86 1.37 2.05 61.47 

17 R219Q02_ 0.32 0.40 7.05 0.37 1.29 82.81 

18 R219Q01E 0.05 0.25 17.95 0.45 1.09 57.84 

19 R220Q01_ -0.62 -0.45 15.31 -0.09 -0.37 -24.59 

20 R220Q04_ -0.16 -0.43 -24.37 -0.97 -1.27 -26.68 

21 R220Q05_ 0.64 0.60 -3.69 0.47 0.74 23.62 

22 R220Q06_ -1.55 -1.61 -5.28 -0.75 -0.94 -16.57 

Table 2: Rasch Model estimates of item difficulties ifor girls and boys (international student 

responses) and for Danish (DK) and non-Danish, ethnic students (ejDK) (Danish student 

responses) for PISA 2003; -scale adjustments for unequal item difficulties. All items from 

booklet No. 10. 

6 By nature the likelihood ration test statistic (Andersen, 1973) for item homogeneity across 

groups has as prerequisite that item parameters exist within each group, i.e. the Rasch model 

fits within each group.


Within PISA 2003 data a repetition of the statistical tests presented in the 

previous section for homogeneity across 2000 – 2003 have been undertaken 

across gender and ethnicity. While the international data was used for the gender 

analysis, only data from Denmark has been used for the ethnic grouping. 

This leads to table 2. 

The numerical indications in table 2 regarding the degree of inhomogeneity 

can be illustrated in the same fashion as in figure 2, here presented as figure 3. 

Perfect homogeneity across the two external criteria gender and ethnicity can 

be read as perfect vertical lines in the figures. 

Figure 3: Estimated item difficulties for 22 reading items, Danish students (lower line) and 

non-Danish students (upper line), left part. Danish PISA 2003 data. Estimated item difficulties 

for 22 reading items, girls (lower line) and boys (upper line), right part. International 

PISA 2003 data. 

Although it is the impression from figures in table 2 and the graphs in figure 

3 that ethnicity creates the largest degree of inhomogeneity, the contrary is, 

in fact, the truth. The explanation for this is that the statistical tests for homogeneity 

across ethnicity are based on the Danish PISA 2003 set, booklet No. 10 

consisting only of 325 valid student responses, providing little power behind 

the tests. Again both simultaneous tests as multivariate conditional (Andersen, 

1973) and exact tests, item-by-item (Allerup, 1997) have been applied. While 

test statistics are strongly rejecting the homogeneity hypothesis across gender, 

more weak signs of inhomogeity are indicated across ethnicity. 

Reading the crude deviations from table 2 points e.g. to items R104 and 

R067 favouring girls and R111 and R102 favouring boys. Likewise, items 

R102 and R104 constitute challenges which favour Danish students, while


items R219 and R055 seem to favour ethnic students. Details behind these 

suggestions for inhomogeneity, e.g. assessing didactic interpretations for these 

deviations, can be evaluated through a closer look at the relation between the 

observed and expected number of responses in specific score-groups. 7 

If the displayed inhomogeneities in table 2 are accumulated in the same 

way as with the PISA 2000 vs. 2003 analysis, it can be shown that poorly performing 

girls get a scale-specific advantage of magnitude 8-10 PISA points, 

which is reduced to approximately 1-2 points for high performing girls. A similar 

accumulation for the analysis across ethnicity shows that low performing 

Danish students (around 30 % correct responses), get a scale-specific advantage 

of approximately 12 PISA points, while very low or very high performing 

students do not get any ‘free’ scale points because of inhomogeneity. 

Scaling PISA 2000 in reading – ethnicity 

PISA 2000 data offers an excellent opportunity to study what happens if the 

reading by Danish students is compared with that of the ethnic students in 

Denmark. Before any didactic explanations can be discussed, a first approach 

to recognizing possible inhomogeneity is achieved by comparing the relative 

item difficulties for the two groups. As said, both the ordinary PISA 2000 

study and the two studies (PISA Ethnic and PISA Copenhagen) have been run 

on the PISA 2000 instruments, bringing the total number of student responses 

to approximately 10,000, 17 % of which come from ethnic students. 

PISA 2000 PISA 

Item difficulties -scale 

Item cat i (DK) R055Q01 1.21 1 1.30 

i (ejDK) difference 

1.13 -15.39 

booklet 

2 

R055Q03 1.17 1 -0.59 -1.22 -56.88 2 

R061Q01 0.91 0 -0.37 -0.17 17.59 6 

R076Q03 0.86 1 0.20 0.78 52.69 4 

R076Q04 0.80 1 -0.67 0.22 79.94 4 

R076Q05 1.08 0 -0.87 -0.63 21.96 4 

R076Q05 1.15 0 0.02 0.12 8.73 5 

R077Q04 0.72 1 0.67 0.77 9.32 8 

R081Q05 1.00 0 0.23 0.34 9.53 1 

7 Compare ai (r) – the observed number of correct responses to item No i in score group r, 

with nr i (r) – the expected number, where nr is the number of students in score group r 

and i (r) is the conditional probability for a correct response to item No i in score group r 

(depending on iand the so-called symmetric functions of 1 ... k only)


R083Q06 1.14 0 -0.93 -0.66 24.48 5 

R086Q05 1.64 1 1.93 1.36 -51.97 1 

R086Q05 1.54 1 1.89 1.05 -75.48 3 

R086Q05 1.21 1 2.31 1.65 -58.93 4 

R091Q06 0.72 1 1.50 1.51 1.11 3 

R100Q06 1.35 1 1.35 0.85 -44.77 3 

R100Q06 1.31 1 1.56 0.49 -96.48 6 

R101Q02 1.36 1 1.53 0.87 -58.99 5 

R104Q01 1.06 1 0.64 0.90 23.86 5 

R104Q01 1.46 0 2.64 2.11 -47.50 6 

R110Q06 0.98 0 1.03 1.11 7.42 7 

R111Q06B 1.35 1 -0.45 -1.41 -86.05 4 

R119Q06 0.70 1 1.21 1.18 -2.15 3 

R120Q01 1.20 1 0.63 0.67 2.99 4 

R120Q01 1.47 1 0.62 0.15 -42.32 6 

R120Q07T 1.32 1 0.69 0.19 -44.98 4 

R219Q02 0.75 1 0.56 0.96 35.43 1 

R220Q02B 1.17 1 0.49 -0.06 -49.78 4 

R220Q06 0.87 0 1.20 0.96 -21.55 7 

R227Q04 1.53 1 -0.46 -0.88 -37.47 3 

R234Q01 1.16 0 1.40 1.38 -1.52 1 

R234Q02 1.24 1 -2.16 -2.04 10.88 1 

R234Q02 0.95 1 -2.04 -1.74 26.81 2 

R241Q02 0.81 1 -0.70 -0.34 32.56 2 

Table 3: Rasch Model estimates of significant inhomogeneous items across ethnicity; item 

difficulties ifor Danish(DK) and non-Danish (ejDK), ethnic students(N=10063 student responses); 

-scale adjustments for unequal item difficulties under the simple Rasch Model. 

Cat=1 indicates significant item discrimination ( 1.00). 

In this section analyses are based on 140 reading items from all nine booklets, 

each containing around 40 items, organized with overlap in a rotation system 

across the booklets. This brings about 1,100 student responses per booklet. 

Using these PISA 2000 data, the statistical tests for homogeneity across the 

two student groups defined by ethnicity (DK and ejDK) may once more be applied. 

Both multivariate conditional (Andersen, 1973) and exact item-by-item 

tests (Allerup, 1997) were applied. The results clearly reject the hypothesis of 

homogeneity and, consequently, the items are inhomogeneous across the two 

ethnic student groups. 

Because of the amount of data available, statistical tests for proper Rasch 

model item discriminations (Allerup, 1994) have been included also; if significant, 

i.e. the hypothesis i = 1.00 must be rejected, it can be taken as an indication 

of the validity of the so-called two-parameter Rasch model (Lord and


Novic, 1968) 8 . Other more orthodox views would claim that basic properties 

behind ‘objective comparisons’ are then violated because of intersecting ICC 

curves (Item Characteristic Curves). Hence, this would be taken as just another 

sign of item inhomogeneity. Table 3 lists all items (among the 140 items in total) 

found to be inhomogeneous in the predefined setting with unequal item 

difficulties only. Items with significant item discriminations are then marked 

with cat=1. Some items appear twice because of the rotation, allowing items 

to be used in several different booklets. 

The combination of high item discrimination and the existence of two 

slightly different -groups, which are compared on the general average level, 

can cause serious effects. Since it is expected that the ethnic student group 

generally performs lower than the Danish group, it could be that one item with 

high item discrimination acts like a ‘separator’ in the item-map-sense between 

the two -groups. This situation will artificially decrease the probability of a 

correct response from students in the lower latent -group while, on the opposite 

end, students from the upper -group will artificially enjoy enhanced 

probabilities to respond correctly. In a way this phenomenon of high item discrimination 

tends to punish the poor students and disproportionately rewards 

the high performing students. 

From table 3 it can be read that e.g. item R055Q03 is (relatively) more 

difficult for the ethnic students compared with the Danish students. In terms of 

compensation on the PISA -scale, this means that an ethnic student experiences 

a PISA scale induced loss 56.88 points. In other words, an ethnic student 

must be 56.88 scale points ahead of his Danish classmate if they are going to 

have equal chances for responding correctly to the item. A Danish student with 

a PISA score equal to, say 475, has the same probability for a correct response, 

as an ethnic student with PISA score 475+56.88=531.88. 

It is an interesting feature of table 3 that more than 60 % of these ethnicsignificant 

items are administered in a multiple choice format (i.e. closed response 

categories, MC), while only 19 % belong to this category in the full 

PISA 2000 set up. This is surprising, because an open response would be expected 

to call for deeper insight into linguistic details about the formulation 

of the reading problem compared to just ticking a predefined box in the MC 

–format. 

The item R076Q04 is a MC item under the caption “retrieving information”, 

where the students examine the flying schedule of Iran Air. This item is 

8 The two parameter model with item discrimination i isP a 1 

exp � i � i �v 

1 exp � i � i �v


solved far better by the ethnic students compared with the Danish students, because 

the item doesn’t really contain complicated text at all, just numbers and 

figures listed in a schematic form. Contrary to this example, item R100Q06 

(MC) contains long and twisted Danish text, and the caption for the item is 

“interpreting”, which aims at ‘reading’ behind the lines; only if the interpretation 

is correct, is the complete response considered to be correct. 

In this example from reading, the accumulated effect of the individual item 

inhomogeneities is evaluated using a different technique from the previous sections. 

In fact, the more traditional step-by-step method is now applied in which 

inhomogeneous items are removed before re-estimation of the PISA score 

takes place. The gap between Danish and ethnic students can then be studied 

before and after removal of inhomogeneous items. 

From the joint data PISA 2000, PISA Copenhagen and PISA E one gets 

the crude differences: 

Language N PISA -score 

average 

Danish 

Non Danish 

Difference 

The crude average difference amounts to 90.54 PISA points. Since the 

items are spread over nine booklets, it is of interest to judge the accumulated 

effect for each booklet. At the same time this would be an opportunity to check 

one of the implications of the three equivalent characterizations of the Rasch 

model 9 , viz. that you should get almost the same picture, irrespective of which 

booklet is investigated. 

Booklet -scores 

all items 

-scores 

homogeneous items 

9 Student abilities 1 ncan be calculated with same result irrespective of which subset of 

items is used.


total 

Table 4: Average differences between Danish and non-Danish student calculated under two 

scenarios: (1) all items and (2) homogeneous items, i.e. items enjoying the property i (DK) – 

i (ejDK)


M155Q04T 0.86 0 -0.37 0.27 58.11 

S114Q03T 1.74 1 0.57 0.63 6.18 2 8 

S114Q04T 1.63 1 0.28 0.30 1.68 

S114Q05T 1.11 0 -1.21 -1.54 -29.37 

S128Q01_ 0.94 0 0.53 0.70 14.91 

S128Q02_ 0.78 1 -0.16 -0.31 -14.08 

S128Q03T 0.80 1 0.35 0.25 -8.60 

S131Q02T 1.43 1 0.0 -0.20 -18.26 

S131Q04T 1.60 1 -1.62 -1.54 7.62 

S133Q01_ 0.91 0 0.60 0.95 31.93 

S133Q03_ 0.51 1 -0.61 -0.70 -8.20 

S133Q04T 0.56 1 0.23 0.18 -5.04 

S213Q02_ 0.88 0 1.21 1.19 -1.61 

S213Q01T 1.21 1 -0.17 0.09 22.84 

Table 5: Rasch Model estimates of items difficulties i (2000) and i (2003) for math items 

shared by PISA 2000 and PISA 2003 in four booklets; -scale adjustments for unequal item 

difficulties under the simple Rasch Model. Cat=1 indicates significant item discrimination ( 

1.00). 

The test statistics applied earlier are again brought into operation, testing 

the hypothesis, that the item difficulties for the years 2000 and 2003 are equal. 

In fact, both multivariate conditional (Andersen, 1973) and exact tests itemby-item 

(Allerup, 1997) were used. The results of estimation are presented in 

table 5 together with and an evaluation of the item discriminations i . 

The results for mathematics shows that the hypothesis must be rejected 

and, consequently, the items presented in table 5 are inhomogeneous across 

the year of testing 2000 and 2003. Item M155Q04T is an item which systematically 

for all score levels seems to become easier between 2000 and 2003; 

in more familiar terms a rise is seen from 64 % correct responses to 75 %, 

calculated for all students. 

The results for science seem to be in accordance with the expectations 

behind the PISA scaling. In fact, the multivariate conditional and the exact 

tests for single items are not rejecting the hypothesis of equal item difficulties 

across the test years 2000 and 2003. 

Since only a very few item groups from four booklets have been investigated, 

no attempt on calculating accumulated effects for larger groups of students 

and items will be carried out.


Scaling PISA 2000 and 2003 in mathematics 

In view of the fact that the tests for homogeneity across 2000 and 2003 failed 

in mathematics, it could be of interest to investigate scale properties within 

each of the two years. Using booklets No 5 (same booklet number in 2000 and 

2003), around 400 student responses are available for analysis of homogeneity 

across gender. Table 6 displays the estimates of item difficulties for the seven 

math items shared in 2000 and 20003 in booklet No. 5 together with the estimated 

item discriminations and an evaluation of the item discrimination i in 

relation to the Rasch model requirement: i =1.00 

PISA PISA 

Item difficulties -scale 

Item cat i (girls) 2000: 

i (boys) difference booklet 

M150Q01_ 0.86 0 0.36 0.57 -18.88 5 

M150Q02T 1.23 0 3.24 2.67 51.03 

M150Q03T 1.00 0 -0.54 -0.71 14.79 

M155Q01_ 0.91 0 0.06 -0.38 39.20 

M155Q02T 1.20 0 0.36 0.63 -24.46 

M155Q03T 1.61 0 -3.06 -2.47 -53.63 

M155Q04T 0.85 0 -0.42 -0.33 -8.04 

2003: 

M150Q01_ 0.75 0 -0.27 0.63 -81.63 5 

M150Q02T 0.95 0 2.40 3.38 -88.71 

M150Q03T 0.99 0 -0.52 -1.09 51.06 

M155Q01_ 1.26 0 0.12 -0.15 24.37 

M155Q02T 1.09 0 0.42 -0.07 43.28 

M155Q03T 1.45 0 -2.41 -2.99 51.92 

M155Q04T 0.85 0 0.27 0.27 -0.29 

Table 6: Rasch Model estimates of items difficulties i (girls) and i (boys) for math items in 

PISA 2000 and PISA 2003, using two booklets; -scale adjustments for unequal item difficulties 

under the simple Rasch Model. Cat=1 indicates significant item discrimination ( 

1.00). 

The statistical methods for testing the hypothesis of equal difficulties for 

girls and boys are brought into operation again. Both multivariate conditional 

(Andersen, 1973) and exact tests item-by-item (Allerup, 1997) were used. 

Behind the estimates presented in table 6 lies the information that the 

gender-specific homogeneity hypothesis must clearly be rejected in the data 

from PISA 2003, while the picture is less distinct for PISA 2000 (significance


probability p=0.08 for the simultaneous test). Consequently, in PISA 2003 the 

seven items presented in table 6 are inhomogeneous across gender. In particular, 

item No. 2, M155Q02T is one item which changes position from favouring 

the girls in PISA 2000 (98 % correct for girls vs 96 % correct for boys) to the 

contrasting role of favouring the boys in PISA 2003 (96 % correct for girls vs. 

97 % correct for boys). In terms of log-odds ratio, this is a change from 1.14 

as relative ‘distance’ between girls and boys in PISA 2000 to -0.29 in PISA 

2003. In PISA 2003 the items M150Q01, M150Q03T and M155Q03T attracts 

attention, also because of the large transformed consequences on the -scale. 

However, the only item showing significant gender bias according to exacts 

tests for single items, is M150Q01. 

As stated in the three equivalent characterizations of item homogeneity, rejecting 

the hypothesis about homogeneous items means that information about 

students’ ability to solve the tasks is not accessible through the raw scores, i.e. 

the total number of correct responses across items. The student raw score is not 

asufficient statistic for the ability , or the PISA scale score does not measure 

the students’ competencies as to solving the items: these are two other ways 

of describing the situation under the caption ‘inhomogeneous items’. On the 

other hand this does not exclude the PISA analyst to obtaining another kind of 

information from the responses with respect to comparing students by means 

of the PISA items. 

With regard to the two items M150Q02 and M150Q03 above, it has been 

demonstrated (Allerup et al, 2005) how information from these two openended 

10 items can be handled as profiles. By this, all combinations of responses 

to the two items are considered, and analysis of group differences takes place 

using these profiles as ‘units’ for the analyses. In principle every combination 

of responses from simultaneous items entering such profiles must be labelled 

prior to the analysis in order to be able to interpret differences found by way of 

the profiles. If the number of items exceeds, say, ten, with two response levels 

on each item, this would in turn require about approx 1,000 different labels! 

In general this is far too many profiles to be able to assign different interpretations, 

and the profile methods is, consequently, not suited for analyses built on 

a large number of items. 

One consequence of accepting an item as part of a scale for further analyses, 

in spite of the fact that the item was found to be inhomogeneous across 

10 An item which requires a written answer, not a multiple choice item. The response is later 

on rated and scored correct or non-correct


gender, can be illustrated by the reports from the international TIMSS study 

from 1995 (Beaton et al, 1998), operated 11 by IEA. In this study a general difference 

was found in mathematics performance between girls and boys, showing 

that in practically all participating countries, boys performed better than 

girls. Although this conclusion contrasted greatly with experiences obtained 

nationally for many countries, the TIMSS result was generally accepted as the 

fact. The TIMSS study was at that time designed in a way using rotated booklets 

as in the PISA, but without using itemblocks. In stead a fixed set of six 

math items and six science items were part of every booklet as fixed reference 

for bridging between the booklets 

Unfortunately, it turned out that one of the six math reference items 12 was 

strongly inhomogeneous (Allerup, 2002). The girls were actually ‘punished’ 

by this item, and even very highly performing female students rated on the 

basis of responses to other items, responded incorrectly to this particular item. 

This could be confirmed by analysing data from all participating countries, 

providing high statistical power to the tests for homogeneity. 

Scaling PISA 2000 – ‘not reached’ items in reading 

‘Not reached’ items are the same as ‘not attempted’ items, and constitute a 

special kind of item, which deserves attention in studies like PISA. They are 

usually found at the end of a booklet because the students read the booklet 

from page 1 and try solving the tasks in the order they appear. In the international 

versions of the final data base, the ‘not reached’ items are marked by 

a special missing-symbol to distinguish them from omitted items, which are 

items, where neighbouring items to the right have obviously been attempted. 

It is ordinary testing practice to present several tasks to the student, which 

are in turn properly adjusted to the complete testing time, e.g. two lessons in 

the case of PISA. This is a widespread practice with exceptions seen in Nordic 

testing practices. Many tests are thereby constructed in a way as to make it 

possible to judge two separate aspects: proficiency and speed. In reading it is 

considered to be crucial for relevant teaching that the teacher gets information 

about the students’ proficiency both in terms of ‘correctness’ and reading 

speed. In order for the last factor to be measurable, one usually needs to have 

11 IEA, The International Association for the Evaluation of Educational Achievement 

12 A math item aiming at testing the students knowledge of proportionality, but presented in a 

linguistic form, which was misunderstood by the girls.


a test which discriminates between students with respect to be able to reach 

all items, viz. has a length exceeding the capacity for some students but being 

easy to reach for other students. 

While everybody seems to agree on the statistical treatment of omitted 

items (they are simply scored as “non-correct”) there have been discussions as 

to how to treat “not reached” items. This takes place from two distinct points 

of views: one dealing with scaling problems and one dealing with the problem 

of assigning justifiable PISA scores to the students. 

One of the virtues of linking scale properties to the analysis of Rasch homogeneity 

is found in the second characterization above of item homogeneity, 

viz. that “the student abilities 1 ncan be calculated with same result, irrespective 

of which subset of items is used”. This strong requirement, which 

in PISA ensures that responses from different booklets can be compared, irrespective 

of which items are included, in principle paves the road as well for 

non-problematic comparisons between students who have completed all items 

and students who have not completed all items in a booklet. At any rate, seen 

from a technical point of view, the existence of ‘not reached’ items does therefore 

not pose a problem for the estimation of the student scores , because 

the quoted fundamental property of homogeneity has been tested for in a pilot 

study prior to the main study, and all items included in the main study are 

consequently expected to enjoy this property. In the IEA reading literacy study 

(Elley, 1992 ) the discussion about which student Rasch -score to choose, the 

one based on the “attempted items”, considering ‘not reached’ items as ‘non 

existing’ or the one considering ‘not reached’ items as ‘non-correct’ responses 

was never solved, and both estimates were published. In subsequent IEA studies 

and in the PISA cycles to date, the ‘not reached’ items have been considered 

as ‘non-correct’. 

The second problem mentioned is the influence the ‘not reached’ items 

have on the statistical tests for homogeneity, an analytical phase which is undertaken 

previously to the estimation of student abilities 1 n. The immediate 

question here is whether different management of the ‘not reached’ item 

responses could lead to different results as to the acceptance of the homogeneity 

hypothesis. The immediate answer to the question is that it matters how ‘not 

reached’ item responses are scored, ‘not attempted’ or ‘non correct’. The technical 

details will, however, not be discussed here, but one important point is 

the type of estimation technique applied for the item parameters 1 ... k 13 . 

13 Marginal estimation with or without prior distribution on the students scores 1 n or


PISA study 

Booklet 2000 Cop Ethnic 

1 0.02 0.02 0.01 

2 0.00 0.00 0.00 

3 0.01 0.01 0.00 

4 0.00 0.00 0.00 

5 0.00 0.00 0.01 

6 0.00 0.00 0.01 

7 0.01 0.02 0.03 

8 0.02 0.02 0.05 

9 0.05 0.07 0.17 

Table 7: Frequency of ‘not reached’ items in three studies using PISA 2000 instruments: Ordinary 

PISA 2000, The Copenhagen (Cop) and Ethnic Special study. 

In PISA 2000 with reading as the main theme, the ‘not reached’ problem 

was not a significant issue. Table 7 displays the frequency of ‘not reached’ 

items in the main study PISA 2000. It can be read from the table that the level 

of ‘not reached’ varies greatly across booklets with a maximum amounting 

to 5 % for booklet No. 9. Looking at the Copenhagen study and the special 

Ethnic study it is, however, clear that the ‘not reached’ problem is probably 

most critical for the students having an ethnic minority background. In fact, 

using all N=10063 observations in the combined data from table 7, it can be 

shown that the average frequency of ‘not reached’ is 1.6 % for Danish students 

and 4.3 % for ethnic minority students. For the ethnic minority group it can 

furthermore be shown that the frequency of ‘not reached’ reaches a maximum 

in booklet No. 9 of 17 %. 

Before conclusions will be drawn as to the evaluation of group differences 

in terms of different PISA – values, the relation between PISA –values 

and the frequency of ‘not reached’ can be shown. Using log-odds as a measure 

of the level of ‘not reached’, a distinct linear relationship can be detected in 

figure 4. As anticipated, the relation indicates a negative correlation. For the 

summary of conclusions as to viewing the effects of inhomogeneity and other 

sources influencing the – scaling, it is clear from figure 4 that the statistical 

administration of this variable can be modelled in a simple linear manner. 

conditional maximum likelihood estimation. A popular technique for estimation and testing 

of homogeneity is undertaken by successive extention of data, increasing the number of 

items, using only complete response data with no ‘not reached’ responses in each step.


Figure 4: Relation between estimated PISA -scores and the frequency of ‘not reached’ (log 

odds of the frequency) for booklet No 9 in the combined data set from PISA 2000, PISA 

Copenhagen and PISA Ethnic. 

Conclusions and summary of effects on the scaling of PISA 

students 

It has been essential for the analyses presented above to elucidate the theoretical 

arguments for the use of Rasch models in the work of calibrating scales for 

PISA measurements. Although the two latent scales containing item difficulties 

and student abilities are, mathematically speaking, completely symmetrical, 

different concepts and different methods are associated with the practical 

management of the two scales. 

The analyses have demonstrated that a certain degree of item inhomogeneity 

has been found in the PISA 2000 and 2003 scales. These effects of inhomogeneity 

have been transformed to practical, measurable effects on the ordinary 

PISA ability -scale, which holds the internationally reported student results. 

It was a conclusion that on the individual student level this transformed effects 

amounted to rather large quantities, up to 150 PISA points, but they were often


below 100 points. For the standard groupings of PISA students according to 

gender and ethnicity the accumulated average effect on group level amounted 

to around 10 PISA points. 

In order to examine effects of item inhomogeneity in relation to other systematic 

factors, which are influential on comparisons between groups of students, 

an illustration will be used from PISA 2000 in reading (see also Allerup, 

2006). From the previous analyses a picture of item inhomogeneity across two 

systematic factors (gender and ethnicity) was obtained. Together with the factor 

Booklet Id and the number of ‘not reached’ items, four factors have by this 

already been at work as systematic background for contrasting levels of PISA 

-scores. 

The illustration aims at setting the effect of inhomogeneity in relation to 

other systematic factors when statistical analysis of -scores differences are 

investigated. The illustration will be using differences between the two ethnic 

groups, carried out as adjusted comparisons with the systematic factors as controlling 

variables. In order to complete a typical PISA data analysis, one supplementary 

factor must be included: the socio-economic index (ESCS), aiming 

at measuring through a simple index the economical, educational and occupational 

level at home for the student 14 . The relation between PISA -scores and 

the index ESCS is a (weak) linear function and is usually called the ‘law of 

negative social heritage’. Together with the linear impression gained in figure 

4, an adequate statistical analysis behind the illustration will be an analysis 

of PISA -scores as dependent variable and (1) number of not reached items, 

(2) booklet id, (3) gender and (4) socio economic index ESCS as independent 

variables, all implemented in a generalized linear model. 

Two kinds of PISA -scores enter the analysis: (1) The reported PISA 

scores found in the official reports from PISA 2000 (OECD, 2001), PISA 

Copenhagen (Egelund og Rangvid, 2004) and PISA Ethnic (Egelund and 

Tranæs red., 2006) and (2) Rasch total, i.e.estimated – scores based on a 

combined data set after removal of inhomogeneous items. By this the composition 

of effects on the resulting – scale from item inhomogeneity and other 

systematic factors is illustrated with an evaluation of their relative significance. 

14 The economy is not included as exact income figures but are estimated from information 

from the student questionnaire


Controlling variables PISA-scores 

–value 

Not reached Booklet, gender 

Not reached, 

Booklet, gender, socio-economy 

NO adjusting variables 

Reported 

Rasch total 

Reported 


Reported 


Adjusted average 

difference 

Danish vs. ethnic 

Table 8: Evaluation of differences between Danish and ethnic minority students using the 

combined data set from PISA 2000, PISA Copenhagen and PISA Ethnic. Differences listed 

by means of (1) reported PISA scores from the international PISA report and from (2) Rasch 

scores where item inhomogeneity has been removed (Rasch total). 

The results of analyzing the gap between Danish and ethnic students are 

presented in table 8. Under ‘no adjustment’ the officially reported gap of 90.54 

PISA points is listed. If inhomogeneous items are removed from the item scale, 

this group difference is reduced to 80.69 points, i.e. a reduction of around 10 

PISA points. The inhomogeneity is therefore responsible for around 10 PISA 

points. If the variables ‘not reached’, ‘booklet id’ and ‘gender’ are added as 

systematic factor in the statistical analysis the controlled gap is now 56.00 

PISA points, if viewed from the point of official PISA scores, and 47.48 if 

calculated after removal of inhomogenous items. After controlling for ESCS, 

the socio-economic index, it is seen that the reported gap is now 43.89 PISA 

points, while the gap comes down to 26.74 PISA points, if the gap is measured 

by means of homogeneous reading items. Ordinary least square evaluation of 

the last factor mentioned, controlled difference 26.74, shows that this difference 

is not far from being insignificant (p=0.01). Notice that the part of difference 

which can be attributed to the effect of inhomogeneous items varies 

with 10 PISA points, constituting around 11 % of the total official interval, in 

the case of crude comparisons without other controlling variables (last line in 

table 8) to approximately 20 PISA points, constituting around 50 % of the total 

official interval in the case of inhomogeneity is evaluated after adjusting for 

other variables. 

What can be seen from this example and the previous discussions and data 

analysis is that the effect of inhomogeneous items on the official PISA -scale 

can be substantial, if the aim of analysis is to compare either individuals, or 

a few students at one time. The average effect on the official PISA -scale


in case of larger student groups depends on the environment in which comparisons 

are carried out. It seems to have less impact on crude comparisons of 

(average) PISA abilities with no other variables involved, amounting to around 

10 PISA points, while more sophisticated comparisons with adjusted comparisons 

involving controlling variables are more affected by item inhomogeneity. 

References 

Allerup P. (1994): “Rasch Measurement, theory of ”. The International Encyclopedia 

of Education, Vol. 8, Pergamon, 1994. 

Allerup, P (1995): “The IEA Study of Reading Literacy”. Owen, P. & Pumfrey, 

P. (red.): Children Learning to Read: International Concerns, Vol.2, 

p. 186-297, 1995. 

Allerup, P. (1997) “Statistical Analysis of Data from the IEA Reading Literacy 

Study”; Applications of Latent trait and latent Class models in the Social 

Sciences; Waxmann, 1997. 

Allerup, P. (2002): “Test Equating using IRT models” proc. 7’th round table 

conference on Assessment, Canberra November 2002 

Allerup, P. (2002). “Gender Differences in Mathematics Achievement”., Measurement 

and Multivariate Analysis. Springer Verlag, Tokyo. 

Allerup, P. (2005) “PISA præstationer – målinger med skæve målestokke?” 

Dansk Pædagogisk Tidsskrift, vol 1, 2005. (in Danish) 

Allerup, P., Lindenskov, L., Weng, P. (2006) “growing up –The story behind 

two items in PISA 2003”. Nordic Light, Nordisk Råd 2006. 

Allerup, P. (2006) “PISA 2000’s læseskala – vurdering af psykometriske egenskaber 

for elever med dansk og ikke-dansk sproglig baggrund” Rockwool 

Fondens Forskningsenhed og Syddansk Universitetsforlag, 2006 (in Danish) 

Andersen, A et al. (2001) “Forventninger og færdigheder – danske unge i en international 

Sammenligning”, AKF (Anvendt Kommunal Forskning) DPU 

(Danmarks Pædagogiske Universitet), SFI Social Forsknings Instituttet. 

Andersen E.B. (1973). “Conditional Inference and Models for Measuring”, 

Copenhagen: Mentalhygiejnisk Forlag. 

Beaton, A et al. (1996): “Mathematics Achievement in the Middle School 

Years. IEA’s Third International Mathematics and Science Study”. 

Boston College USA


Egelund, N og Tranæs,T ed. (2007) “PISA Etnisk 2005 – kompetencer hos 

danske og etniske elever I 9.klasser I Danmark 2005.” Rockwool Fondens 

Forskningsenhed, Syddansk Universitetsforlag 

Elley, W (1992):“How in the world do students read?”, The International 

Association for the Evaluation of Educational Achievement, 1992, IEA, 

Haque 

Fischer, G., Molenaar, I. (1995) Rasch Models – Foundations, recent Developments, 

and Applications. Springer-Verlag, New York. 

Lord, F and Novick, M(1968). “Statistical Theories of Mental Test Scores”. 

Addison Wesley, Massachusetts 

OECD (2001) “Knowledge and Skills for Life – First Results from PISA 2000”; 

OECD, Paris. 

OECD (2004) “Learning for Tomorrow’s World – First Results from PISA 

2003”; OECD, Paris. 

Rasch, G. (1960) “Probabilistic Models for some Intelligence and Attaintment 

Tests” Munksgaard, 1960. Genoptrykt Chicago University Press, 1980. 

Rasch G. (1971) “Proof that the necessary condition for the validity of the 

multiplicative dichotomic model is also sufficient”. Dupl. note, Statistical 

Institute, Copenhagen (see Allerup, 1994). 

Torney-Purta, J., Lehman, R., Oswald, H., Schulz,W (2001) “Citizenship and 

Education in twenty-eight Countries Civic Knowledge and Engagement 

at age fourteen”. Amsterdam: IEA 2001.

PISA and “Real Life Challenges”: Mission Impossible? 

Svein Sjøberg 

Sweden: University of Oslo 

Introduction 

The PISA project has positive as well as more problematic aspects, and it is 

important for educators and researchers to engage in critical public debates on 

this utterly important project, including its uses and misuses. 

The PISA project sets the educational agenda internationally as well as 

within the participating countries. PISA results and advice are often considered 

as objective and value-free scientific truths, while they are, in fact embedded in 

the overall political and economic aims and priorities of the OECD. Through 

media coverage PISA results create the public perception of the quality of a 

country’s overall school system. The lack of critical voices from academics as 

well as from media gives authority to the images that are presented. 

In this article, I will raise critical points from several perspectives. The 

main point of view is that the PISA ambitions of testing “real-life skills and 

competencies in authentic contexts” are by definition alone impossible to 

achieve. A test is never better than the items that constitute the test. Hence, 

a critique of PISA should not mainly address the official rationale, ambitions 

and definitions, but should scrutinize the test items and the realities around the 

data collection. The secrecy over PISA items makes detailed critique difficult, 

but I will illustrate the quality of the items with two examples from the released 

texts. 

Finally, I will raise serious questions about the credibility of the results, 

in particular the ranking. Reliable results assume that the respondents in all 

countries do their best while they are sitting the test. I will assert that young 

learners in different countries and cultures may vary in the way they behave in 

the PISA test situation. I claim that in many modern societies, several students

204 SVEIN SJØBERG 

are unwilling to give their best performance if they find the PISA items long, 

unreadable, unrealistic and boring, in particular if bad test results have no negative 

consequence for them. I will use the concept of “perceived task value” to 

argue this important point. 

The political importance of PISA 

Whether one likes the PISA study or not, one might easily agree about the importance 

of the project. When OECD has embarked on such a large project, 

it is certainly not meant as a purely academic research undertaking. PISA is 

meant to provide results to be used in the shaping of future policies. After 6- 

7 years of living with PISA, we see that the PISA concepts, ideology, values 

and not least the results and the rankings, shape international educational policies 

and also influence national policies in most of the participating countries. 

Moreover, the PISA results provide media and the public with convincing images 

and perceptions about the quality of the school system, the quality of their 

teachers’ work and the characteristics of both the school population and future 

citizen. 

Contemporary natural science is often labelled Big Science or Technoscience: 

The projects are multinational, they involve thousands of researchers, 

and they require heavy funding. Moreover, the traditional scientific values and 

ethos of science become different from the traditional ideals of academic science 

(Ziman, 2000). Prime examples are CERN, The Human Genome Project, 

European Space Agency etc. The PISA project has many similarities with such 

projects, although the scale and the costs are much lower. But the number of 

people who are involved is large, and the mere organization of the undertaking 

requires resources, planning and logistics unusual to the social sciences. According 

to Prais (2007), the total cost of the PISA and TIMSS testing in 2006 

was “probably well over 100 million US dollars for all countries together, plus 

the time of pupils and teachers directly involved.” 

Why is an organization like the OECD embarking on an ambitious task 

like this? The OECD is an organization for the promotion of economic 

growth, cooperation and development in countries that are committed to market 

economies. Their slogan appears on their website: “For a better world economy.” 

1 

1 These and other quotes in the article are taken from OECDs home site http://www.oecd.org/, 

retrieved Sept 2, 2007.

PISA AND “REAL LIFE CHALLENGES”: MISSION IMPOSSIBLE? 205 

The OECD and its member countries have not embarked on the PISA 

project because they have an interest in basic research in education or learning 

theory. They have decided to invest in PISA because education is crucial for 

the economy. Governments need information that is supposed to be relevant for 

their policies and priorities in this economic perspective. Since mass education 

is expensive, they also most certainly want “value for money” to ensure efficient 

running of the educational systems. Stating this is not meant as a critique 

of PISA. It is, however, meant to state the obvious, but still important fact: 

PISA should be judged in the context of the agenda of the OECD; economic 

development and competition in a global market economy. 

The strong influence that PISA has on national educational policies should 

imply that all educators ought to be interested in PISA whether they endorse 

the aims of PISA or not. Educators should be able to discuss and use the results 

with some insight in the methods, underlying assumptions, strengths and 

weaknesses, possibilities and limitations of the project. We need to know what 

we might learn from the study, as well as what we cannot learn. Moreover, we 

need to raise a critical (not necessarily negative!) voice in the public as well as 

professional debates over uses and misuses of the results. 

The influence of PISA: Norway as an example 

Attention given to PISA results in national media varies between countries, 

but in most countries it is formidable. In my country, Norway, the results from 

PISA2000 as well as from PISA2003 provided war-like headings in most national 

newspapers. 

Our then Minister of Education (2001-2005), Kristin Clemet (representing 

Høyre, the Conservative party), commented on the PISA2000 results, released 

a few months after she had taken office, following a Labour government: “Norway 

is a school loser, now it is well documented. It is like coming home from 

the Winter Olympics without a gold medal” (which, of course, for Norway 

would have been a most unthinkable disaster!). She even added: “And this 

time we cannot even claim that the Finnish participants have been doped!” 

(Aftenposten January 2001). 

The headlines in all the newspapers told us again an again “Norway is a loser”. 

In fact, such headings were misleading. Norway has ended up close to 

the average among the OECD countries in all test domains in PISA2000 and 

PISA2003. But for some reason, Norwegians had expected that we should be


Figure 1: PISA results are presented in the media with war-like headings, shaping public perception 

about the national school system. Here are PISA results presented in the leading Norwegian 

newspaper Dagbladet with the heading “Norway is a school loser”. 

on top – as we often are on other indicators and in winter sports. When we are 

not the winners, we regard ourselves as being losers. 

The results from PISA (and TIMSS as well) have shaped the public image 

of the quality of our school system, not only for the aspects that have in fact 

been studied, but for more or less all other aspects of school. It has now become 

commonly ‘accepted’ that Norwegian schools in general have a very low level 

of quality, and that Norwegian classrooms are among the noisiest in the world. 

The media present tabloid-like and oversimplified rankings. It seems that the 

public as well as politicians have accepted these versions as objective scientific 

truths about our education system. There has been little public debate, and 

even the researchers behind the PISA study have little to modify and remind 

the public about the limitations of the study. In sum; PISA (as well as TIMSS) 

has created a public image of the quality of the Norwegian school that is not 

justified, and that may be seen to be detrimental. I assume that other counties 

may have similar experiences. 

But PISA does not only shape the public image, it also provides a scientific


legitimization of school reforms. Under Kristin Clemet as Minister of Education 

(2001-2005), a series of educational reforms were introduced in Norway. 

Most of these reforms were legitimized by reference to international testing, 

mainly to PISA. In 2005, we had a change in government, and Kristin Clemet’s 

Secretary of State, Helge Ole Bergesen, published a book shortly afterwards in 

which he presented the “inside story” on the reforms made while they were in 

power. The main perspective of the book is the many references to large-scale 

achievement studies. He confirms that these studies provided the key arguments 

and rationale for curricular as well as other school reforms. Under the 

tabloid heading: “The PISA Shock”, he confirms the key role of PISA: 

With the [publication of the] PISA results, the scene was set for a national battle over 

knowledge in our schools. [ . . . ] For those of us who had just taken over the political 

power in the Ministry of Education and Research, the PISA results provided a “flying 

start” (Bergesen 2006: 41-42. Author’s translation). 

Other countries may have different stories to tell. Figures 2 and 3 provide examples 

from the public sphere in Germany. In sum: There is no doubt that PISA 

has provided – and will continue to provide – results, ideologies, concepts, 

analysis, advice and recommendations that will shape our future educational 

debates and reforms, nationally as well as internationally. 

PISA: Underlying values and assumptions 

It is important to examine the ideas and values that underpin PISA, because, 

like most research studies, PISA is not impartial. It builds on several assumptions, 

and it carries with it several value judgements. Some of these values are 

explicit; others are implicit and ‘hidden’, but nevertheless of great importance. 

Some value commitments are not very controversial, others may be contested. 

Peter Fensham, a key scholar in international science education thinking 

and research for many decades, has also been heavily involved in several committees 

in TIMSS and PISA. He has seen all aspects of the projects from the 

inside over decades. In a recent book chapter, he provides an insider’s overview 

and critique of the values that underlie these projects. He draws attention to the 

underlying values and implications: 

The design and findings from large-scale international comparisons of science learning 

do impact on how science education is thought about, is taught and is assessed in 

the participating countries. The design and the development of the instruments used


Figure 2: The political agenda and the public image of the quality of the entire school system 

is formed by the PISA results. This is an example from the German Newspaper Die Woche 

after the release of PISA2000 results. 

and the findings that they produce send implicit messages to the curriculum authorities 

and, through them, to science teachers. It is thus important that these values, at 

all levels of existence and operations of the projects, be discussed, lest these messages 

act counterproductively to other sets of values that individual countries try to achieve 

with their science curricula. (Fensham 2007: 215,216) 

Aims and purpose of the OECD 

In the public debate as well as among politicians, advice and reports from 

OECD experts are often considered to be impartial and objective. The OECD 

has become an important contributor to the political battle over social, political, 

economic and other ideas. To a large extent, these persons shape the political 

landscape, and their reports and advice set the political agenda in the national 

as well as international debates over priorities and concerns. But the OECD 

is certainly not a impartial group of independent educational researchers. The 

OECD is built on a neo-liberal political and economic ideology, and its advice 

should be seen in this perspective. The seemingly scientific and neutral language 

of expert advice conceals the fact that there are possibilities for other


Figure 3: PISA has become a well-known concept in public debate: A bookshelf in a German 

airport offering bestselling books with PISA-like tests for self-assessment, very much like 

IQ-tests. 

political choices based on different sets of social, cultural and educational values. 

Figure 4 shows how the OECD presents itself. 

The overall perspective of the OECD is concerned with market economy 

and growth in free world trade. All policy advice they provide is certainly 

coloured by such underlying value commitments. Hence, the agenda of the 

OECD (and PISA) does not necessarily coincide with the concerns of many 

educators (or other citizens, for that matter). The concerns of PISA are not 

about ‘Bildung’ or liberal education, not about solidarity with the poor, not 

about sustainable development etc. – but about skills and competencies that 

can promote the economic goals of the OECD. Saying this is, of course, stating 

the obvious, but such basic facts are often forgotten in the public and political 

debates over PISA results.


About the OECD 

The OECD brings together the governments of countries committed to 

democracy and the market economy from around the world to: 

– Support sustainable economic growth 

– Boost employment 

– Raise living standards 

– Maintain financial stability 

– Assist other countries’ economic development 

– Contribute to growth in world trade 

Figure 4: The basis for and commitments of the OECD as they appear on 

http://www.oecd.org/ Retrieved Sept 7 2007 

Educational and curricular values in PISA 

Quite naturally, values creep into PISA testing in several ways. PISA sets out 

to shed light on important (and not very controversial) questions like these: 

Are students well prepared for future challenges? Can they analyse, reason and communicate 

effectively? Do they have the capacity to continue learning throughout life? 

(First words on the PISA home page at: http://www.pisa.oecd.org/) 

These are important concerns for most people, and it is hard to disagree with 

such aims. However, as is well known, PISA tests just a few areas of the 

school curriculum: Reading, mathematics and science. These subjects are, consequently, 

considered more important than other areas of the school curriculum 

in order to reach the brave goals quoted above. Hence, the OECD implicitly 

says that our future challenges are not highly dependent on subjects like history, 

geography, social science, ethics, foreign language, practical skills, arts 

and aesthetics, etc. 

PISA provides test results that are closely connected to (certain aspects of) 

the three subjects that they test. But when test results are communicated to 

the public, one receives the impression that they have tested the quality of the 

entire school system and all the competencies that are of key importance for 

preparing to meet the challenges of the future. 

There is one important feature of PISA that is often forgotten in the public 

debate: PISA (in contrast to TIMSS) does not test “school knowledge”. Neither 

the PISA framework nor the test items claim having any connection to 

national school curricula. This fact is in many ways the strength of the PISA 

undertaking; they have set out to think independently from the constraints of


all the different school curricula. There is a strong contrast with the TIMSS 

test, as its items are meant to test knowledge that is more or less common in 

all curricula in the numerous participating countries. This implies, of course, 

that the “TIMSS curriculum” (Mullis et al 2001) may be characterized as a 

fossilized and old-fashioned curriculum of a type that most science educators 

want to eradicate. In fact, nearly all TIMSS test items could have been used 

60-70 years ago. The PISA thinking has been freed from the constraints of 

school curricula and could in principle be more radical and forward-looking in 

their thinking. (However, as asserted in other parts of this chapter, PISA does 

not manage to live up to such high expectations.) 

PISA stresses that the skills and competencies assessed may not only stem 

from activities at school but from experiences and influences from family life, 

contact with friends, etc. In spite of this, both good and bad results are most 

often considered by both the public and politicians to be attributed to the school 

only. 

Values in the PISA-reporting 

The PISA data collection also covers a great variety of dimensions regarding 

background variables. The intention is, of course, to use these to explain 

the variance in the test results (“explain” in a statistical sense, i.e. to establish 

correlations etc.). Many interesting studies have been published on such 

issues. But the main focus in the public reporting is in the form of simple ranking, 

often in the form of league tables for the participating countries. Here, 

the mean scores of the national samples in different countries are published. 

These league tables are nearly the only results that appear in the mass media. 

Although the PISA researchers take care to explain that many differences (say, 

between a mean national score of 567 and 572) are not statistically significant, 

the placement on the list gets most of the public attention. It is somewhat similar 

to sporting events: The winner takes it all. If you become no 8, no one 

asks how far you are from the winner, or how far you are from no 24 at any 

event. Moving up or down some places in this league table from PISA2000 

to PISA2003 is awarded great importance in the public debate, although the 

differences may be non-significant statistically as well as educationally. 

The winners also become models and ideals for other countries. Many want 

to copy aspects of the school system from the winners. This, among other 

things, assumes that PISA results can be explained mainly by school factors –


and not by political, historical, economic or cultural factors or by youth culture 

and the values and concerns of the young learners. Peter Fensham claims: 

. . . the project managers choose to have quite separate expert groups to work on the 

science learning and the contextual factors—a decision that was later to lead to discrepancies. 

Both projects have taken a positivist stance to the relationship between 

contextual constructs and students’ achievement scores, although after the first round 

of TIMSS other voices suggested a more holistic or cultural approach to be more appropriate 

for such multi-cultural comparisons. (Fensham 2007: 218) 

PISA (and even more so TIMSS) is dominated and driven by psychometric 

concerns, and much less by educational. The data that emerge from these studies 

provides a fantastic pool of social and educational data, collected under 

strictly controlled conditions – a playground for psychometricians and their 

models. In fact, the rather complicated statistical design of the studies decreases 

the intelligibility of the studies. It is, even for experts, rather difficult to 

understand the statistical and sampling procedures, the rationale and the models 

that underlie the emergence of even test scores. In practice, one has to take 

the results at face value and on trust, given that some of our best statisticians 

are involved. But the advanced statistics certainly reduce the transparency of 

the study and hinder publicly informed debate. 

PISA items – a critique 

The secrecy 

An achievement test is never better than the quality of its items. If the items 

are miserable, even the best statisticians in the world cannot change this fact. 

Subject matter educators should have a particular interest, and even a duty, 

to go into detail on how their subject is treated and ‘operationalized’ through 

the PISA test items. One should not just discuss the given definitions of e.g. 

scientific literacy and the intentions of what PISA claims to test. In fact, the 

framework as well as the intentions and ideologies in the PISA testing may be 

considered acceptable and even progressive. The important question is: How 

are these brave intentions translated into actual items? 

But it is not easy to address this important issue, as only a very few of the 

items have been made publicly available. Peter Fensham, himself a member of 

the PISA (as well as TIMSS) subject matter expert group, deplores the secrecy:


“By their decision to maintain most items in a test secret [ . . . ] TIMSS and PISA deny 

to curriculum authorities and to teachers the most immediate feedback the project 

could make, namely the release in detail of the items, that would indicate better than 

framework statements, what is meant by ‘science learning’. The released items are 

tantalizing few and can easily be misinterpreted” (Fensham 2007: 217) 

The reason for this secrecy is, of course, that the items will be used in the next 

PISA testing round, and therefore they may not be made public. An informed 

public debate on this key issue is therefore difficult, to say the least. But we 

scrutinize the relatively few items that have been made public 2 . 

Can “real-life challenges” assessed by wordy paper-and-pencil items? 

The PISA testing takes place in about 60 countries, which together (according 

to the PISA homepage) account for 90 % of the world economy. PISA has the 

intention of testing 

. . . knowledge and skills that are essential for full participation in society. [ . . . ] not 

merely in terms of mastery of the school curriculum, but in terms of important knowledge 

and skills needed in adult life. [ . . . ] 

The questions are reviewed by the international contractor and by participating countries 

and are carefully checked for cultural bias. Only those questions that are unanimously 

approved are used in PISA. (Quotes from Pisa.oecd.org, retrieved 5 sept 2007) 

In each item unit, the questions are based on what is called an “authentic text”. 

This, one may assume, means that the original text has appeared in print in 

one of the 60 participating countries, and that it has been translated from this 

original. 

There are many critical comments that can be made to challenge the claim 

that PISA lives up to the high ambition of testing real-life skills. An obvious 

limitation is the test format itself: The test contains only paper-and-pencil 

items, and most items are based on the reading of rather lengthy pieces of text. 

This is, of course, only a subset of the types of “knowledge and skills that are 

essential for full participation in society”. Coping with life in modern societies 

requires a range of competencies and skills that cannot possibly be measured 

by test items of the PISA units’ format. 

2 All the released items from previous PISA rounds can be retrieved from the PISA website 

http://www.oecd.org/document/25/0,3343,en_32252351_32235731_38709529_1_1_1_ 

1,00.html


Identical “real-life challenges” in 60 countries? 

But the abovementioned criticism has other and equally important dimensions: 

The PISA test items are by necessity exactly the same in each country. The 

quote above assures us that any “cultural bias” has been removed, and items 

have to be “unanimously approved”. 

At first glance, this sounds positive. But there are indeed difficulties with 

such requirements: Real life is different in different countries. Here are, in 

alphabetical order, the first countries on the list of participating countries: 

Argentina*, Australia, Austria, Azerbaijan*, Belgium, Brazil*, Bulgaria*, 

Canada, Chile*, Colombia*, Croatia*, the Czech Republic, Denmark 3 

We can only imagine the deliberation towards unanimous acceptance of all 

items among the 60 countries with the demands that there should be no cultural 

bias and that context of no country should be favoured. 

The following consequences seem unavoidable: The items will become 

decontextualised, or with contrived ‘contexts’ far removed from the reality of 

most real life situations in any of the participating countries. While the schools 

in most countries have a mandate to prepare students to meet the challenges 

in that particular society (depending on level of development, climate, natural 

environment, culture, urgent local and national needs and challenges, etc.), 

the PISA tests only aspects that are shared with all other nations. This runs 

contrary to current curriculum trends in many countries, where the issue of 

providing local relevance and context have become urgent. In many countries, 

educators argue for a more contextualized (or ‘localized’) curriculum, at least 

in the obligatory basic education for all young learners. 

The item construction process also rules out the inclusion of all sorts of 

controversial issues, be they scientific, cultural, economic or political. It is 

indeed enough that the authorities in one of the participating countries have 

objections. 

To repeat: Schools in many countries have the mandate of preparing their 

learners to take an active part in social and political life. While many countries 

encourage the schools to treat controversial socio-scientific issues, such issues 

are unthinkable in schools in other countries. Moreover, a controversial issue in 

one country may not be seen as controversial in another. In sum: The demands 

of the item construction process set serious limitations on the actual items that 

comprise the PISA test. 

3 The list is from http://www.pisa.oecd.org/ Countries marked with a * are not members of 

OECD, (but are also assumed to unanimously agree on the inclusion of all test units.)


Now, all the above considerations are simply deductions from the demands 

of the processes behind the construction of the PISA instrument. It is, of 

course, of great importance to check out such conclusions against the test itself. 

But, as mentioned, this is not an easy task to complete, given the secrecy over 

the test items. Nonetheless, the items that have been released confirm the above 

analysis: The PISA items are basically decontextualised and non-controversial. 

The PISA items are – in spite of an admirable level of ambition – nearly the 

negation of the skills and competencies that many educators consider important 

for facing future challenges in modern, democratic societies. 

PISA items have also been criticized on other aspects. Many claim that 

the scientific content is questionable or misleading and that the language is 

strange, often verbose. In the next paragraph, two examples of PISA units are 

discussed in some detail, one from Mathematics, the other from Science.


A PISA mathematics unit: Walking 

In Figure 5 below the complete PISA test unit called Walking is reproduced. 

M124: Walking 

The picture shows the footprints of a man walking. The pacelength P is the distance 

between the rear of two consecutive footprints. 

For men, the formula, n/P=140, gives an approximate relationship between n and P 

where, 

n= number of steps per minute, and 

P= pacelength in metres 

Question 1: WALKING M124Q01- 0 1 2 9 

If the formula applies to Heiko’s walking and Heiko takes 70 steps per minute, 

what is Heiko’s pacelength? Show your work. 

Question 3: WALKING M124Q03- 00 11 21 22 23 24 31 99 

Bernard knows his pacelength is 0.80 metres. The formula applies to Bernard’s 

walking. Calculate Bernard’s walking speed in metres per minute and in kilometres 

per hour. Show your working out. 

Figure 5: A complete PISA mathematics unit, “Walking”, with the text presenting the situation 

and the questions relating to the situation.


Comments to Walking 

Some details first: 

Note that Question 2 is missing! (This may be an omission in the published 

document.) Note also the end of Q1: “Show your work.” And for Q2. “Show 

your working out.” There also seems to be several commas too many. Consider 

the commas in this paragraph: “For men, the formula, n/P = 140, gives an 

approximate relationship between n and P where, etc . . . ”. In my view, all 

the 4 commas seem somewhat misplaced. Perhaps these are merely details, 

but they are not very convincing as the final outcome of serious negotiations 

between 60 countries! 

The main comments to this unit are, however, more on the content of the 

item. First of all: Is this situation really a “real-life situation”? How real is the 

situation described above? Is this type of question a real challenge in the future 

life of young people – in any country? 

But even if we accept the situation as a real problem, it seems hard to acknowledge 

that the given formula is a realistic mathematization of a genuine 

situation. The formula implies that when you increase the frequency in your 

walking, your paces simultaneously become longer. To my knowledge, a person 

may walk with long paces and low frequency. Moreover, the same person 

may also walk using short steps at high frequency. In fact, at least from my 

point of view, the two factors should be inversely proportional rather than proportional, 

as suggested in the “Walking” item. In any case, a respondent who 

tries to think critically about the formula may get confused, but those who do 

not think may easily solve the question simply by inserting the formula. 

But the problems do not stop here: Take a careful look at the dimensions 

given in the figure. If the marked footstep is 80 cm (as suggested in Q3 above), 

then the footprint is 55 cm long! A regular man’s foot is actually only about 

26 cm long, so the figure is extremely misleading! But even worse: From the 

figure, we can see (or measure) the next footstep to be 60 % longer. Given the 

formula above, this also implies a more rapid pace, and the man’s acceleration 

from the first to the second footstep has to be enormous! 

In conclusion: The situation is unrealistic and flawed from several points 

of view. Students who simply insert numbers in the formula without thinking 

will get it right. More critical students who start thinking will, however, be 

confused and get in trouble!


A PISA science unit: Cloning 

In Figure 6 below the complete PISA test unit called Cloning is reproduced 

S128: Cloning 

Read the newspaper article and answer the questions that follow. 

Without any doubt, if there had 

been elections for the animal of the year 

1997, Dolly would have been the winner! 

Dolly is a Scottish sheep that you 

see in the photo. But Dolly is not just 

a simple sheep. She is a clone of another 

sheep. A clone means: a copy. 

Cloning means copying ‘from a single 

master copy’. Scientists succeeded in 

creating a sheep (Dolly) that is identical 

to a sheep that functioned as a 

‘master copy’. It was the Scottish scientist 

Ian Wilmut who designed the 

‘copying machine’ for sheep. He took 

a very small piece from the udder of an 

adult sheep (sheep 1). From that small 

A copying machine for living beings? 

piece he removed the nucleus, then he 

transferred the nucleus into the eggcell 

of another (female) sheep (sheep 

2). But first he removed from that eggcell 

all the material that would have 

determined sheep 2 characteristics in 

a lamb produced from that egg-cell. 

Ian Wilmut implanted the manipulated 

egg-cell of sheep 2 into yet another (female) 

sheep (sheep 3). Sheep 3 became 

pregnant and had a lamb: Dolly. Some 

scientists think that within a few years 

it will be possible to clone people as 

well. But many governments have already 

decided to forbid cloning of people 

by law.


Question 1: 

CLONING 

Which sheep is Dolly 

identical to? 

A Sheep 1 

B Sheep 2 

C Sheep 3 

D Dolly’s father 

Question 2: CLONING S128Q02 

In line 14 the part of the udder that was used is 

described as “a very small piece”. 

From the article text you can work out what is 

meant by “a very small piece”. 

That “very small piece” is 

A a cell. 

B a gene. 

C a cell nucleus. 

D a chromosome. 

Question 3: CLONING S128Q03 

In the last sentence of the article it is stated that many governments have already 

decided to forbid cloning of people by law. 

Two possible reasons for this decision are mentioned below. 

Are these reasons scientific reasons? 

Circle either “Yes” or “No” for each. 

Reason: Scientific? 

Cloned people could be more sensitive to certain diseases than 

normal people. 

Yes/No 

People should not take over the role of a Creator Yes/No 

Figure 6: A complete PISA science unit, “Cloning”, with the text presenting the situation and 

the three questions relating to the situation.


Comments to Cloning 

This task requires the understanding of the rather lengthy 30 lines of text. 

In non-English-speaking countries, this text is translated into the language of 

instruction. The translation follows rather detailed procedures to ensure high 

quality. The requirement that the text should be more or less identical results in 

rather strange prose in many languages. The original has, we assume, been an 

“authentic text” in some language, but the resulting translations cannot be considered 

to be “authentic” in the sense that they could appear in any newspaper 

or journal in that particular country. 

PISA adheres to strict rules for the translation process, but this is not the 

way prose should be translated to become good, natural and readable in other 

languages. In my own language, Norwegian, the heading “A copying machine 

for living being” is translated word by word. This does not make sense, and 

prose like this would never appear in real texts. 

The scientific content of the item may also be challenged. The only accepted 

answer on Question 1 is that Dolly is identical to Sheep 1 (alternative 

A). It may seem strange to claim that two sheep of very different ages are 

“identical” – but this is the only acceptable answer. The other two questions 

are also open for criticism. Basically, they test language skills, reading as well 

as vocabulary. (The word ‘udder’ was unknown to me.) 

In conclusion: Although the intentions behind the PISA test are positive, 

it becomes next to impossible to produce items that are ‘authentic’, close to 

real life challenges – and at the same without cultural bias and equally ‘fair’ 

in all countries. Items have to be constructed by international negotiations, and 

the result will therefore be that all contexts are wiped out – contrary to the 

ambitions of the PISA framework. 

Youth culture: Who cares to concentrate on PISA tests? 

In the PISA testing, students at the age of 15 are supposed to sit for 2 hours 

and do their best to answer the items. The data gathered in this way forms 

the basis of all conclusions on achievement and all forms of factor analysis 

that explain (in a statistical sense) the variation in achievement. The quality 

of these achievement data determines the quality of the whole PISA exercise. 

Good data assumes, of course, that the respondents have done their best to 

answer the questions. For PISA results to be valid, one has to assume that


students are motivated and cooperative, and that they are willing to concentrate 

on the items and give their best performance. 

There are good reasons to question such assumptions. My assertion is that 

students in different countries react very differently to test situations like those 

of PISA (and TIMSS). This situation is closely linked to the overall cultural 

environment in the country, and in particular to students’ attitudes to schools 

and education. Let me give an illustration of such cultures with examples from 

two countries scoring high on tests like PISA and TIMSS. 

Testing in Taiwan and Singapore 

An observer from Times Educational observed the TIMSS testing at a school 

in Taiwan, and he noticed that pupils and parents were gathered in the schoolyard 

before the big event, the TIMSS testing. The director of the school gave 

an appeal in which he also urged the students to perform their utmost for themselves 

and their country. Then they marched in while the national hymn was 

played. Of course, they worked hard; they lived up to the expectations from 

their parents, school and society. 

Similar observations can be made in Singapore, another high achiever on 

the international test. A professor in mathematics at the National University of 

Singapore (Helmer Aslaksen) makes the following comment: “In this country, 

only one thing matters: Be best – teach to the test!” 

He has also taken the photograph from the check-out counter in a typical 

Singaporean shop, (see Figure 7). This is where the last-minute offers are displayed: 

On the lower shelf one finds pain-killers, while the upper shelf displays 

a collection of exam papers for the important public exams in mathematics, science 

and English (i.e. the three PISA subjects). This is what ambitious parents 

may bring home for their 13-year-old kids at home. Good results from such 

exams are determinants for the future of the student. 

This is definitely not the way such testing takes place in my part of the 

world (Norway) and the other Scandinavian countries. Here, students have a 

very different attitude to schooling, and even more so to exams and testing. The 

students know that the performance on the PISA testing has no significance for 

them: They are told that they will never get the results, the items will never be 

discussed at school, and they will not get any other form of feedback, let alone 

school marks for their efforts. Given the educational and cultural milieu in 

(e.g.) Scandinavia, it is hard to believe that all students will engage seriously 

in the PISA test.


Figure 7: The context of exams and testing: This is the check-out counter in a shop in Singapore. 

Last-minute offers are displayed: On the lower shelf: Medicinal pain-killers. On the 

upper shelf: Exam papers for the important public exams in mathematics, science and English 

(i.e. the three PISA subjects). 

Task value: “Why should I answer this question?” 

Several theoretical concepts and perspectives are used to describe and explain 

performance on tests. The concept of self-efficacy beliefs has become central 

to this field. By self-efficacy belief, one understands it to be the belief and 

confidence that students have in their resources and competencies when facing 

the task (Bandura 1997). Self-efficacy is rather specific to the type of task in 

question, and should not be confused with more general psychological personality 

traits like self-confidence or self-esteem. PISA has several constructs that 

seek to address self-efficacy, and they have noted a rather strong positive relationship 

between e.g. mathematical self-efficacy beliefs and achievement on 

the PISA mathematics test on the individual level (Knain & Turmo 2003). (It


is, however, interesting to note that such a positive correlation does not exist 

when countries are the unit of comparison.) 

There is, however, a related concept that may be of greater importance 

when explaining test results and students’ behaviour in test situations. This is 

the concept of task value beliefs (Eccles & Wigfield 1992, 1995). While selfefficacy 

beliefs ask the question, “Am I capable of completing this task?”, the 

task value belief focuses on the question, “Why do I want to do this task?”. 

The task value belief concerns beliefs about the importance of succeeding (or 

even trying to succeed) on a given task. 

It has been proposed that the task value belief may be seen to have three 

different components or dimensions: These are the 1. attainment value, 2. intrinsic 

value or interest, and 3. utility value. Rhee et al (2007) explains in more 

detail: 

Attainment value refers to the importance or salience that students place on the task. 

Intrinsic value (i.e. personal interest) relates to general enjoyment of the task or subject 

matter, which remains more stable over time. Finally, utility value concerns students’ 

perceptions of the usefulness of the task, in terms of their daily life or for future careerrelated 

or life goals. (Rhee et al 2007: 87) 

I would argue that young learners in different countries perceive the task value 

of the PISA testing in very different ways, as indicated in this chapter’s previous 

sections. 

Based on my knowledge about the school system and youth culture in my 

own part of the world, in particular Norway and Denmark, I would claim that 

many students in these countries assign very little value to all the above three 

dimensions of the task value of the PISA test and its items. Given the nature 

of the PISA tasks (long, clumsy prose and contrived situations removed from 

everyday life), many students can hardly find these items to have high “intrinsic 

value”; the items are simply not interesting and do not provide joy or pleasure. 

Neither does the PISA test have any “utility value” for these Scandinavian 

students; the results have no consequence, the items will never be discussed, 

there is no feed-back, results are secret and do not count, neither for school 

marks nor in their daily lives. They do not count for students’ future careerrelated 

or life goals. Given the cultural and school milieu and the values held 

by young learners in e.g. Scandinavia, it is hard to understand why they should 

choose to push themselves in a PISA test situation. 

If so, we have an additional cause for serious uncertainty about the validity 

and the reliability of the PISA results.


References 

Bandura, A. (1997). Self-efficacy: The exercise of control. New York: Freeman. 

Bergesen, O. H. (2006). Kampen om kunnskapsskolen (Eng: The fight for a 

knowledge-based school) Oslo: Universitetsforlaget. 

Eccles, J. S., & Wigfield, A. (1992). The development of achievement-task 

values: A theoretical analysis. Developmental Review, 12, 256–273. 

Eccles, J. S., & Wigfield, A. (1995). In the mind of the actor: the structure 

of adolescents’ achievement task values and expectancy-related beliefs. 

Personality and Social Psychology Bulletin, 21, 215–225. 

Fensham, Peter (2007). Values in the measurement of students’ science 

achievement in TIMSS and PISA. In Corrigan et al (Eds) (2007). The 

Re-Emergence of Values in Science Education, (p. 215-229) Rotterdam: 

Sense Publishers. 

Knain, E. & A. Turmo (2003). Self-regulated learning Lie S et al (Eds) NorthernLightsonPISA, 

University of Oslo, Norway (p. 101-112). 

Mullis, I. V. S., Martin, M. O., Smith, T. A., Garden, R. A., Gregory, K. D., 

Gonzales, E. J., et al. (2001). TIMSS Assessment Frameworks and Specifications 

2003. Boston: International Study Center, Boston College. 

Prais, S.J. (2007). England: Poor survey response and no sampling of teaching 

groups Oxford Review of Education vol. 33, no. 1. 

Rhee, C. B., T. Kempler, A. Zusho, B. Coppola & P. Pintrich (2005). Student 

learning in science classrooms: what role does motivation play? In Alsop, 

S. (Ed). Beyond Cartesian Dualism. Encountering Affect in the Teaching 

and Learning of Science Dordrecht: Springer, Science and Technology 

Education Library. 

Ziman, J.M. (2000). Real Science: What it is and what it means. Cambridge: 

Cambridge University Press.

PISA – Undressing the Truth or Dressing Up a Will to 

Govern? 

Gjert Langfeldt 

Norway: University of Agder 

Background 

The background for this article is a study of accountability in Europe. The 

testing of pupils’ results is a prime mechanism in establishing an accountability 

– based logic of governance 1 . A part of understanding accountability is the 

study of the quality of the instruments used to measure results, among which 

the international comparative tests are of prime importance. 

PISA – Programme for Student International Assessment – stands out 

among these tests as being by far the most influential of the international comparative 

tests. The approach to PISA used here was to collect articles of how 

educational researchers around Europe have reacted to PISA. This is not easy, 

as PISA is conducted on a tri-annual scale, with three different focal points, 

taking nine years to complete a complete cycle. In 2000 the focal point of PISA 

was reading literacy, in 2003 mathematical competence. This meant that the 

researchers’ critique was often linked to a partial theme. A common methodological 

ground for critique of PISA was not always straightforward to find. 

Synthesising the literature, this article is structured under three headings: 

Reliability. “The International League Table” is the spearhead of PISA in 

attracting public interest.The differences reported between countries have huge 

consequences, and the issue of whether these differences are reliable must be 

of primary concern. So the issue of reliability will be the first theme: Does the 

1 The publication of schools’ results, most crudely known as “league tables”, and the sanctioning 

of schools based on the test results, appear to be further steps in creating a more 

full-blown version of accountability-based regimes.

226 GJERT LANGFELDT 

international literature indicate a concern that there are sources of “fuzziness” 

in the PISA results that can make the national scores appear unreliable? 

Validity. The second issue of concern is the issue of validity. Several issues 

appear to be discussed under this theme in the literature. The angle chosen 

here can be stated thus: Currently, 57 nations partake in PISA. In what sense is 

it meaningful to compare these in the form of presenting a ladder of national 

results? Theoretically, how can one find a legitimate basis for comparing different 

nations? Closely related to this is the assumption – the reliance upon 

which PISA is not alone – that pupils’ results can be an indicator of school 

quality, which again can be a proof of the quality of national educational systems. 

A third element in the discussion of the validity of PISA is the issue of 

inference: Can one assess a school system on the basis of scores of individual 

students? 

The business model of PISA. The third issue I wish to focus on is PISA as a 

sociological event: The impact of PISA is not only how it changes the lives of 

pupils and teachers or makes educational policymaking change priorities but 

also the impact of PISA on how we think about education, about school quality 

and what aims the educational policies of a nation should fulfil. Traditional 

actors in this field are politicians who can be held accountable for their views. 

What kind of actor is PISA? Researchers claim that there is another agenda in 

which PISA is a prominent actor – and the final discussion of this paper will 

be to look at the legitimacy of PISA within such a broader horizon. 

Why PISA 

On the European scene, two providers of international, comparative knowledge 

tests are dominant, the IEA and the OECD. 

The IEA (http://www.iea.nl/), or International Association for the Evaluation 

of Educational Achievement, is a foundation owned by member states and 

organisations, currently at 62 with another 20 non-member states partaking in 

various activities. Its most popular product is the TIMSS (Trends in Mathematics 

and Science) which currently (TIMSS 2007) is used in more than 60 

countries, of which more than 20 are European. TIMSS aims to measure mastery 

of curriculum provided. PIRLS (Progress in Reading Literacy Studies) 

aims at measuring reading literacy. 41 countries currently participate, among 

which 23 European. Where TIMSS is run in 4 – year cycles, PIRLS is run in 

5 – year cycles. A third study is SITES (Second Information Technology in

UNDRESSING THE TRUTH OR DRESSING UP AWILL TO GOVERN? 227 

Education studies), which in 2006 was run by 20 states, half of which were 

European. ICCS (International Civics and Citizenship Education Study) currently 

has 40 participants, 25 of which are European. The fifth and last product 

to be announced on their homepage is concerned with Teacher Education Development 

in Mathematics, (Ted –M). This is a new test whose first results are 

being published in 2007. Fourteen countries participate on this test, of which 

6 are European. 

The OECD has 30 member states, accounting for about 90 % of the world’s 

GNP, exerting a huge influence as “the rich countries of the world’s club”. 

Their efforts and influence in education is increasing (Jacobi 2006), partly due 

to the success of the definitive market leader, PISA. In addition to PISA, the 

tools of influence are the annual publication of statistics on Education (Education 

at a Glance). The OECD also runs a “think-tank” related to education, 

CERI – Centre for Educational Research and Innovation. In addition to PISA, 

OECD also manages another international knowledge test, ALL (Adult Literacy 

and Life Skills Survey), (http://nces.ed.gov/surveys/all/) with the precursors 

SIALS and IALS. These are large scale tests with the purpose of charting 

the adult population’s skills in reading and mathematics. 

PISA is a program of assessment in the sense that it is carried out each third 

year with differing focus on the three main areas. In 2000 41 countries participated 

in the study, of which 25 were European. In 2006 57 countries were in, 

and 31 were European. For each round of testing, OECD will publish results 

comparing countries in the form of a league table. PISA will assess 15-yearold 

students, as “this is normally close to the end of the initial period of basic 

schooling in which all young people follow a broadly common curriculum” 

PISA’s aim is to measure literacy: 

While OECD/PISA does assess students’ knowledge, it also examines their ability 

to reflect, and to apply their knowledge and experience to real-world issues. . . . The 

term “literacy” is used to encapsulate this broader conception of knowledge and skills. 

(OECD 2003 p. 9-10). 

This approach sets PISA apart from competitors such as TIMSS, which tries to 

measure the degree to which pupils master the knowledge transmitted through 

the national curricula. PISA can thus claim not to be constrained by national 

curricula. 

The special focus means that this theme will “take up nearly two-thirds 

of the testing time” (OECD 2003 p. 13). As several authors confer approximately 

2 minutes per item, this means that the special focus area has about 40


questions, and that the total runs to about 60 items. The universe of items is, 

however, much larger, and the items are organised in 14 different textbooks 

distributed to equal proportions of the sample. 

Nearly all these tests have in common the fact that in addition to the test 

document, the students will also answer a questionnaire, charting the context 

of the education. In addition several forms are to be answered by e.g teachers, 

principals, municipal authorities, etc. in order to allow generalising to the 

context of education. 

In each participating country a national PISA office is set up, using up to 

two years to establish the national sample, establishing processes for administering 

the tests, and functioning as local quality assurance officers. It seems 

to be a general trait that these national offices also function as the chief PISA 

interpreters in their country, often undertaking not only publication but also 

additional research in order to enlarge the PISA impact. 

The reliability issue 

Starting from a textbook definition, this issue concerns how random errors 

can influence results. One should differentiate between random and systematic 

errors in all evaluations of measurement; the latter approaches the issue of validity. 

The importance of reliability is that in this respect, reliability constitutes 

a precondition for validity. Metaphorically, this can be illustrated by how noise 

is able to destroy a musical experience – how PISA measures is a precondition 

for discussing what it claims to have found. 

Random variation between 450-500 000 15-year-olds can come from innumerable 

sources, and it is an important discussion in itself to assess what 

differences should be controlled for and which not. Neither the research community 

nor PISA have undertaken any systematic discussion of how real-world 

differences of such a magnitude can be controlled for – even though such a theory 

is fundamental for explaining differences in score. An example of this is 

that, so far, I have not found any mention of the influence resulting from the 

substantial variations in the amount of instructional hours 15-year-old pupils 

will have received. 

What I have found in the survey of European research articles on PISA reliability 

are three arguments, two concerning sample quality and one concerning 

item cultural bias. 

Of the two arguments relating to sample quality, one argument is a minor 

issue, and concerns the sample representativity of PISA. The arguments run


that when many schools decline to partake, and substitute schools are recruited, 

one cannot be sure that the properties of the supplementary schools are equal to 

those who declined. Although a pertinent objection, this problem will probably 

disappear if PISA influence keeps mounting. 

However, I find the second argument to be rather an important one, as it 

concerns what hides under the PISA assertion that PISA tests the competence 

of 15-year-olds as “this is normally close to the end of the initial period of basic 

schooling in which all young people follow a broadly common curriculum” 2 . 

The critics point to this being a simplification for two reasons: Whether a 

given grade actually contains pupils aged 15 is an empirical question, and there 

is every reason to believe that this will vary between countries. S.N. Praise, 

who first launched this argument, formulates his argument thus: “Perhaps most 

pupils were in classes for mainly 15-year-olds; others had repeated a class 

and – though aged 15 – were in a classes for mainly 14-year-olds, others in 

classes for mainly 13-year-olds; and a few had ‘skipped’ and were in classes 

for 16-year-olds. Often (France, Germany, Switzerland . . . ) by the age of 15 

hardly more than half of pupils may be in a class for pupils of that age. 3 (Prais 

2004 p. 571) 

Another aspect of the same fact is that when you compare 57 countries, 

some of these countries will not have all pupils aged 15 in class – they have already 

dropped out, for instance, to work. The relevant issue for a discussion of 

reliability is whether those who drop out have the same academic proficiency 

as those who stay in school. Arguing the case for better home background being 

decisive for academic achievement, a case being so well researched that 

it would be trite to mention evidence, one may well assume that the pupils 

quitting before the age of 15 as being unequal to the ones remaining in school. 

This leads to the conclusion that in a PISA context, some nations gain from the 

fact that as much as 60 % of their classmates have left school before the age of 

15. 

2 Actually, this is not completely precise. In the PISA technical report it says: “The 15-yearold 

official target was slightly adapted to better fit the age structure of most of the northern 

hemisphere countries. As the majority of the testing was planned to occur in April, the 

international target population was consequently defined as all students aged from 15 years 

and 3 (completed) months to 16 years and 2 (completed) months at the beginning of the 

assessment period.” (OECD: PISA 2000 technical report, page 46) 

3 As an illustration, Bodin, referring to PISA 2000, states that “That leads to 59,1 % of the 

French students who took the test were in high school grades, at grade 10, or for a few of 

them at grade 11” (Bodin 2005 p. 4). Another illustration would be Norway, where about 

95 % were in grade 9 (out of 10) at age 15, as they started school at age 7.


One researcher who has agued this is Wuttke (2006) in regard to PISA 

2003. He starts by asking how representative PISA is, and his answer is that 

the representative aspect has far too little basis in actual numbers. This conclusion 

is based (as with Prais) on an appraisal of school attendance and the 

attendance of 15-year-olds. Wuttke points to Turkey, where the school attendance 

for 15-year-olds is a meagre 54 %, and to Mexico, where it is 58 %. Even 

within OECD countries this is a problem: Wuttke refers to Portugal, where 5 % 

of the sample left school between the time they were recruited and the time the 

test was administered. As one cannot assume that the drop-out rate is randomly 

distributed, he draws the conclusion that for PISA “it becomes a measure of 

success that the weaker pupils have dropped prematurely out of school” (Wuttke 

p. 105). 

In addition to this, there is also the issue of the representativeness of the 

national samples. According to PISA procedure this is organised so that one 

first recruits schools, and then pupils within those schools. Under these circumstances 

is it vital to have a documentation of the relative size of the sample 

from each school. The importance of this record is that if such a list is present, 

one may adjust for unwanted differences; for example, if such a list opens the 

possibility for the statistical weighing of samples from particular schools for 

example because of small school size. On surveying the underlying material, 4 

Wuttke concludes that “This documentation is lacking from far too many countries” 

– the examples he provides to illustrate this point is taken from Greece, 

where all pupils had to be given equal weight, as there was no information of 

school size at all, while the participation rate from Sweden was 102,5 %, from 

Toscana 107 % (ibid p. 106). 

Another set of criticism relating to sample quality, has to do with students 

whose results can not be counted as other students, typically because they are 

handicapped. Wuttke asserts that the PISA report (OECD 2005 p. 183 ff) leaves 

the definition of handicaps up to the national committees. He refers to how 

the exemption rate within the OECD varies from 0.7 % for Turkey to 7.3 for 

Spain and the US (OECD 2005 p. 169). In addition Denmark, New Zealand 

and Canada transgress the 5 % limit, (OECD 2005 p. 241 ff), but this has not 

had any consequence – data from all these countries are presented at face value 

in the presentation (ibid p. 106). The conclusion that one is comparing apples 

and oranges is nigh at hand. 

4 Wuttke refers to OECD 2005 p. 108 for an explanation of this, and he develops this argument 

in some detail.


Wording as a reliability issue 

The reliability of PISA is affected by how the questions are worded, and the 

PISA technical report explains the logic of how this is handled and the thoroughness 

with which is has been looked into. Hemmingen (2005) introduces a 

question not covered by PISA when she asks whether the wording lives up to 

its promises of measuring the life skills of pupils. Her assessment is that by and 

large it does not, and it cannot do so. By using some of the PISA items (Going 

hand in hand/Growing up/Semmelweiss) as examples, she demonstrates how 

PISA items are constructed just like “school-test” items, how their difficulty 

is affected by their wording, and how they relate to contexts that cannot be 

similarly known to all students. The proof of the pudding as it were for Hemmingsen 

is that PISA is just as subjected to being “test wise” as ordinary tests 

(Hemmingsen, 2005 p. 41). 

The importance of these objections is not that they represent any neglect on 

the hand of PISA. Rather on the contrary – PISA has done more than any other 

similar undertaking in trying to establish a discussion on how reliability issues 

can be met. The importance of these objections is rather that they give rise to 

the issue of how fruitful the ambition of attempting to control for real-world 

differences on a multinational scale can be. In fact a point of criticism might be 

that without a theory of what differences can be accounted for and how such a 

control can be established, the undertaking of comparing the complex realities 

of 57 nations along one scale will appear as high-handed. 

Summing up the methodological objection raised by researchers it appears 

that a methodological discussion should be encouraged to a larger degree, and 

that PISA itself has central role in establishing such a discussion. It is not only 

legitimate but even vital that research should try to influence public debate 

on the quality of education, but one must not transgress the limits granted by 

one’s tools. This is particularly so if the agenda is to contribute to greater accountability 

in education. The verdict of the research community raises grave 

questions of whether PISA transgresses such limits. 

The validity issue 

From a textbook definition, the issue of validity is an issue of inference quality. 

5 What is the basis of the conclusions drawn? In a PISA context, one rele- 

5 Shadish, Cook and Campbell use this definition: “We use the term validity to refer to the 

approximate truth of an inference . . . Validity is a property of inferences.” (2002 p. 34)


vant definition of validity is put in this way “A total judgment rests on an holistic 

assessment of whether the empirical evidence and the theoretical framework 

form a sufficient basis to justify the actions and the consequences that 

are drawn from the test scores”.(Jablonski 2005, 157). This definition goes to 

the core of what quality in a test like PISA is about: Does its impact rest on 

a solid basis, of both theory and data? It is only from this perspective that the 

lack of reliability finds its true importance. The issue of validity goes beyond 

systematic errors in the sense that errors can also accrue from a lack of theory, 

as well as from the quality of cohesion between theory and data. 

The main issues of validity which have been raised by European researchers 

can be summed up in three arguments: The issue of cultural bias, 

the issue of scaling, and the issue of how to interpret PISA scores. As the issue 

of scaling is covered comprehensively elsewhere in this volume, I will focus 

on the other two issues. 

Cultural bias as a validity issue 

This argument was heard in the reactions to both PISA 2000 from Italian 

(Nardi 2002), Swiss (Bain 2003), and French (Bodin 2005) sources, as well 

as from German ones relating to PISA 2003. The main argument challenges 

whether the real world ambition of PISA refers to a world shared by all, and 

even the concept of “real world” competence is argued to be an Anglo-phone 

preference. This is a validity issue in two respects. First, it argues that pupils 

from different countries will have systematically different chances to perform 

equally well. Secondly, it argues that the more successful PISA is, the less will 

one be able to see cultural differences as an asset, diversity as an tool for improvement. 

This last argument raises the issue of whether the one dimensional 

scale of a sum score is a valid standard for comparing nations – will it prove 

legitimate when the needs of globalisation put new changes on the agenda? 

Some of these arguments can be contested. When Nari, (2002) doubts if the 

methodology is correct, when four out of the six best countries were Anglophone 

(the exceptions being Korea and Finland) or when Jablonka (2006) finds 

that out of a total of 54 questions in mathematics in PISA 2003, 13 come 

from Holland, 15 from Australia and 7 from Canada, the rest stems from 9 

different countries (ibid p. 167), one can still argue from PISA that when they 

can document that the questions are equally well understood everywhere, this 

argument is effectively controlled for 6 . 

6 The problem resurfaces within PISA’s context as why some countries have a weak item


The issue of cultural differences is raised in a more radical sense by Bodin 

whose argument concerns the quality of teaching as a precondition for PISA 

performance. He observes that “the differences are more important in favour 

of the Finnish students for the more ‘realistic’ items, and that the difference 

tends to turn in favour of the French students for more abstract or formal questions” 

(Bodin, 2005 p. 8). He uses the bookshelves question to lament “This 

question, along with many others, points to the weak stress given by PISA to 

the proof undertakings. Even explaining and justifying are not much valued by 

the PISA marking scheme. That makes a big difference with the . . . French 

conception of mathematical achievement”. (2005 p. 12). Here Bodin illustrates 

how culture defines relevance: As a French mathematician he is proud of their 

tradition in mathematics. 

On the contrary I think, and all the work of the so-called French didactics school has 

helped me, that ruptures are necessary and constitutive to learning. So we may fear 

that putting too much stress on real life and actual situations may in return have some 

negative effects (2005, p. 13). 

Would Europe become intellectually richer if this pride vanished in the face 

of PISA results? Is it not rather the opposite: That by the next corner, the 

French style in mathematical reasoning might be not only vindicated, but prove 

an asset to all? Cultural diversity makes for a sustainable development. The 

Choice of an approach that ends up treating cultural diversity as a measurement 

error, makes the very undertaking of comparison repressive. 

A special case of this argument is related to the PISA strategy for measuring 

reading skills. 

Bain (2003) queries whether the conceptual framework for the reading 

tests is adequate, and also in what respect PISA can improve teaching. The 

argument Bain raises is whether reading is such a complex skill that it can not 

be validly tested within the restrictions of the PISA test format. His criticism 

of the PISA conceptual framework is firstly linked to the fact that at the time 

(PISA 2000) he finds that the theory used for understanding reading literacy is 

statistic. This is reported as affecting 12 countries (Basque county, Brazil, Indonesia, Japan, 

Macau-china, Mexico, Thailand and Tunisia as well as to a lesser extent Hong Kong China, 

Serbia and Turkey). The explanations offered by PISA (the items may have discriminated 

differently in different countries, there may be concern about linguistic and cultural equivalence 

or one simply has not recruited translators well enough equipped for the job),) actually 

strengthen the argument that cultural bias is and must be present in such tests. (PISA 2003 

Technical Report p. 79)


too empirical, “the test given can not verify the validity of a model but relies on 

a model to emerge from the facts” (ibid, p. 61), a situation which is aggravated 

by the fact that it is a restricted understanding of reading skills that is tested, in 

the sense that “of course one may agree that the pupils read texts that are about 

situations, but the situation they are read in, is a typical school-situation”. This 

is a situation which disfavours the weak pupils, only the clever pupils will in 

this situation be able to recreate the use intended for the text by the author (ibid 

p. 64). Bain proceeds to argue that the mastery of different genres is not and 

cannot be mastered within the restrictive test format (ibid p. 66). 

Interpreting PISA results – dressing up the will to govern 

An important validity issue is the choice of PISA to develop the results into 

an “international league table”, thus opening for the comparison of national 

results and explicitly discussing how these results can be improved. This is 

in PISA linked to the organisation and prioritisation of research focus areas, 

discussed in the technical framework in chapter 3. 

These focus areas are based on the OECD education indicators (INES) 

and organise data along two dimensions: Firstly, data is interpreted by the 

level of the education system they originate from, and secondly, the indicators 

PISA produces are seen as outcomes or outputs, contexts or constraints. 

There are four levels to which the resulting indicators relate. These are specified 

thus: “The education system as a whole, the educational institutions and 

providers of educational service, the instructional setting and the learning environment 

within the institution and the individuals participating in the learning 

activities”. Each of these levels are studied in three aspects: With respect to 

“outputs and outcomes of education and learning”, “policy levers and contexts” 

(circumstances that shape the outputs and outcomes at each level) and 

“antecedents and constraints (factors that define or constrain policy)” (OECD 

2003 technical report p. 35). Organised into a matrix, this gives a 12-cell matrix 

of PISA focus areas, where educational outputs can be identified on four 

levels, and which also specifies policy levers and contexts and antecedents and 

constraints. Such a matrix is presented in the technical framework, specifying 

what variables are used to produce the indicators presented in each cell. 

The way the PISA focus areas are organised raises the issue of how one 

can interpret data from the “international league table”, and one question in 

particular is important: Does this framework lead PISA to offer opinions on 

educational matters beyond which they have an adequate basis?


PISA itself acknowledges a first objection to this framework: The problem 

of recursivity and complexity among levels. The PISA example is how at 

the classroom level the relation between student achievement and class size is 

negative, while at the class or school level the relation is positive – the explanation 

being that students are often intentionally grouped so that the weaker 

student is placed in smaller classes. PISA sees this as an example that “a differentiation 

between levels is not only important with regard to the collection 

of information, but also because many features of the education system play 

out quite differently at different levels of the system” (ibid p. 35). This is a fact 

that really should be underlined, and it can be shown to be present in many 

aspects of the educational system. One example often used in the literature on 

causality in education is the issue of the interplay between teacher and class. 

Carroll’s (1963) model of class learning as a function of student level and time 

spent proved overly simplistic by not being able to give room for the interplay 

between class and teacher: An equal amount of time and teacher effort will 

provide widely different results as the students’ attitude to learning differs. 7 

That the interpretative matrix is riddled with difficulties is also illustrated 

in what PISA terms “antecedents and constraints” and which are described as 

follows: “Policy levers and contexts typically have antecedents, that is, factors 

that define or constrain policy. These are usually specific foragivenlevelof 

the educational system, and antecedents at a lower level may well be policy 

levers at a higher level ( e.g for teachers and students in a school, teacher qualifications 

are a given constant, while at the level of the education system, professional 

development of teachers is a key policy lever(ibid p. 35). How is this 

a validity problem? What PISA does not say, but should have advised, is that 

the cultural traditions of different countries, which often constitute contexts as 

well as restraints, cannot be discussed out of context. Particularly when data 

are aggregated it is of the utmost importance to present data contextualised. 

James S. Coleman (1990) argues that it is in principle logically invalid to deduce 

principles of government and management from aggregated macro-level 

data, if such data lack substantial contextualization. If this is omitted, two problems 

arise: The first is that one lacks “reality checks” and is led opine beyond 

the realistic. The second, which can be seen as a correlate to this, is that the 

7 In fact it almost 30 years ago since Cronbach suggested that most causal links in education 

might preferably be understood as interactions, relations linked in such a way that causality 

can only be understood as probable and where the direction of causality might change.


interpretation of results becomes more difficult. (e.g. as of today, no real explanation 

for Finn excellence in PISA is presented). 

A third argument – and once again the real world differences crop up – 

must also be mentioned here concerning the levels of the educational system 

in which PISA organises data. Two of the PISA levels are intuitively understandable: 

The student level and the level of the classroom; that is, where education 

as a social interaction occurs. The two last levels – the educational 

institutions and providers of educational services and the education system as 

a whole – seem to be introduced in order to be able to differentiate between 

nations where schools are run by the government (and where the institution 

owner is the state) and nations where schools are run by a number of organisers 

(churches, NGO’s, local communities). It is only for such settings that 

a differentiation between institution owner and system as a whole is appropriate. 

Two questions must be asked: Is such a way of organising the levels 

appropriate, and in the PISA case, is it used sensibly? 

The first question has been addressed in a recent paper (Afzar 2007). She 

insists that the PISA framework is based on a theoretical assumption about a 

linear administrative chain of steering. This chain runs from the political level 

via the political body of the school owner, through the instructional setting 

organised within each school to individual learning. In addition, she argues, 

lying behind the principle of aggregation of data, there is an action theoretical 

approach reminiscent of methodological individualism. Afzar rejects the notion 

of a linear chain, and argues that when one tries to grasp education as a 

system, one must use an approach legitimised theoretically, a model which can 

also explain the complexity of the relation of the different levels of the educational 

system, and she ventures to introduce one such model, a model whose 

importance in this context is that it allows for seeing the function of the school 

as being different for the individual student than for the functions of society at 

large. The Afzar model, interesting as it may be, is not of relevance here, but it 

serves to highlight in what respects the model used by PISA is legitimate. 

Does PISA use its own framework sensibly? PISA 2003 did not study 

teachers nor had intact classrooms as units of sampling. The level of instructional 

settings is therefore empty with regards to outputs and outcomes. What 

data it contains are data of students’ learning – learning that happens in different 

instructional settings, setting which PISA does not explain. A similar 

argument is applicable to the institutional level, where data is either aggre-


gated from the individual level or “synthesised” across institutional settings, 

as these are not identified as such. 

At the systems level, PISA only relies on aggregates. In particular, the 

PISA systems outcome consists not only of aggregated individual data, but 

also for policy levers and contexts from system-level aggregates, (notably from 

the instructional setting level), and the same goes for the antecedents and constraints 

column at the systems level, albeit at these levels unspecified OECD 

data are also indicated as source material. The question is: Can student achievement 

be aggregated through such levels and still contribute to a meaningful 

interpretation of “national performance”? 

PISA Inc. 

The title of this argument is taken from a German book on PISA (Jahncke et 

al 2006), and it concerns the nature of PISA as enterprise. Here only two arguments 

will be discussed. The first argument is about whether PISA itself as 

a tool for accountability appears as transparent: The aim of the international 

league table is to influence educational policies. Does PISA aim to handle this 

influence in a democratic way, or is PISA just another brand-building enterprise, 

whose aim is to exert as much influence as possible, not caring to make 

account of whether this influence is justified or not? 

This argument has been explicitly addressed by Howie & Plomp (2005), 

who argue that even if it is intended that PISA will have policy implications, 

no study has been undertaken to systematically chart how PISA affects educational 

policies. They refer to Kellahan (1996) as stating that most accounts 

of the use of findings appear to be “limited and impressionistic” and that “detailed 

analyses are not available” and that “the way policy makers arrive at 

their conclusions is also little known”. Howie & Plomp concur with this and 

add that albeit nearly a decade later, this still appears to be the case. This may 

be due in part to the difficulty of researchers gaining access to the policymakers’ 

realm as well as having a lack of funding for impact studies. “In fact whilst 

government are prepared often to fund data collection and the initial descriptive 

reports, little funding is offered for secondary analyses of the same data 

let alone an impact study of the release of such a rich source of data nationally 

or internationally” (2005 p. 93) 

The second argument is that the allegations of lacking transparency seem 

to hold true even when looking inside how PISA is organised. Mogens Niss, 

a Danish member of the PISA expert group in mathematics, touches on this


in an interview given in 2005: He does not agree that “leading experts in the 

field is a guarantee for PISA quality”, the reason being that “There is no one 

within PISA who keeps tabs on things. It is like the Internet: There is no allcontrolling 

central brain in PISA”. He describes the development of mathematics 

literacy as “The expert group should clarify a description of the frame, a job 

that was organised as a analytical developmental process, not a research process, 

and whose results were shaped also by the PISA governing board, which 

is constituted by officials from ministries in the involved countries. The PISA 

questions are shaped by expert groups, the OECD secretariat, the governing 

board and the international consortium together and Professor Niss concludes 

that PISA is not a clear-cut object; it is a mixture of research, development 

work, influenced by needs for comparison and politics. 

What Niss does not mention is that PISA is developed largely by enterprises 

either dependent on providing part of their income in the market (The 

Australian Council of Educational Research or the Educational Testing Service 

(USA) or living wholly off the market, as exemplified by Weststat ( a US 

company) or Citogroep (a Dutch company) 8 . The problem with such an approach 

to organising is that when companies who have a vested interest in the 

success of PISA is to advice governments on educational policy, one cannot 

know whether the advice is biased or not. 9 

It may be relevant to mention that costs of PISA participation are not easily 

come by. For most of the IEA tests, however, the price is USD 30,000 per 

country per year, or USD 120,000 for a full cycle of a four – year annual 

repetition. Most of these tests are, however, sponsored by Ford, which is how 

8 A complete listing is available in the PISA 2003 Technical Framework appendix 2. 

9 This ambiguity is apparent in the chapter on data abjudication in the 2003 Technical Framework 

report. Using the USA as an example, this country not only did not meet the required 

school response rate (68,12 % after replacement), it also broke the PISA test timing window 

and had a too high overall exclusion rate (7,28 %). After an evaluation (where no sources 

are given), it is concluded that the US data will be included in the full range of PISA reports. 

Another country plagued with grave problems is the United Kingdom, where the technical 

report concludes that “The uncertainty surrounding the sample and its bias are such that 

PISA 2003 scores for the UK cannot be reliably compared with other countries”, or with 

PISA 2000. The conclusion is still that all international averages and aggregate statistics include 

the data from the UK. There are apparent anomalies, such as those found in Mexico, 

where only 58 % of the classmates are in school, and the coverage of the national 15-yearold 

population was at only 49 %, or Spain, where the pupil exclusion rate was about 50 % 

above PISA standards, or Turkey, where the coverage of 15-year-olds was at 36 % – they are 

all included in the full range of PISA 2003 reports. Is this because it is scientifically sound 

or is it because another ruling would be bad for PISA Inc.?


the price can be so low. No data is published on the financing of the OECD 

tests. 

It must be fair to conclude that PISA has huge unresolved issues concerning 

the way it is used. There seems to be an imbalance between the tools created 

and the eagerness to influence politics. In the long run this is detrimental 

to the very issue PISA seeks to promote: A sensible approach to measureing 

the human capital generated in the member countries. 

Particular notice should be paid to PISA relation to private enterprise, so 

that one does not produce a capacity for test-making which goes far beyond 

how such tests can be sensibly used. 

References 

Afzar. A. (2007). A systems theoretical critique of international comparisons. 

Paper presented at the 2007 AERA Convention. 

Bain D. (2003). PISA et la lecture: Un point de vue didacticien. In Schweizerische 

Zeitschrift für Bildungswissenschaften vol 25 2003 no 1 p. 59-78. 

Bender, P.,(2006). Was sagen uns PISA & Co, wenn wir uns auf sie einlassen? 

In Jahnke,T. and Meyerhöfer W. (2006) PISA & Co Kritik eines Programms, 

Hildesheim Verlag Franzbecker. 

Bodin, A. (2005). What does PISA really assess? What it doesn’t. A French 

view. Report prepared for Joint Finnish-French conference “Teaching 

mathematics: Beyond the PISA survey”, Paris. 

Folkeskolen 8.4.2005: “PISA – Der er ingen der har styr på det hele. Et sammensurium 

af forskning, test og politik siger Mogens Niss fra PISAs 

ekspertgruppe i matematikk”. 

http://www.folkeskolen.dk/objectShow.aspx?ObjectId=33661 

Hemmingsen, I (2005). Et kritisk blik på opgaverne i PISA med særlig vekt på 

matematikk. In MONA vol? 2005 no. 1. p. 24-43. 

Howie S. and Plomp T.,(2005). International comparative studies of education 

and large-scale change. In Bascia N., A. Cumming A. Datnow 

and K. Leithwood (2005) International Handbook of Educational Policy, 

Springer Internatinal Handbooks of Education, Dordrect, Holland. 

Jablonka,E., (2006). Mathematical literacy: Die Verflüchtigung eines ambitionierten 

Testkonstrukts in bedeutungslose PISA-Punkte in Jahnke,T. and 

Meyerhöfer W. (2006) PISA & Co Kritik eines Programms, Hildesheim 

Verlag Franzbecker.


Jahnke,T. and Meyerhöfer W. (2006). PISA & Co Kritik eines Programms, 

Hildesheim Verlag Franzbecker. 

Jahnke,T.,(2006) Zur Ideologie von PISA & CO. In Jahnke,T. and Meyerhöfer 

W. (2006) PISA & Co Kritik eines Programms, Hildesheim Verlag 

Franzbecker. 

OECD (2003). The PISA 2003 Assessment Framework, Paris OECD. 

OECD (2002). School sampling preparation manual. PISA 2003 Main Study 

Version one 2002. 

OECD (2002). Programme for international student assessment sample. Task 

from the PISA 2000 assessment of reading, mathematical and scientific 

literacy. 

OECD: PISA 2003 Technical Report 

Prais S.J. (2003). Cautions on OECD’s recent educational survey (PISA) Oxford 

Review of Education, vol 29 no 2, 2003, p. 139 – 163. 

Prais S.J. (2004). Cautions on OECD’s recent educational survey(PISA): Rejoinder 

to OECD’s response. Oxford Review of Education vol 30 no 4, 

Dec 2004. 

Romainville, M., (2002). L’enquete O.C.D.E. sur les aquis des élèves en débat 

in La Revue Nouvelle vol 115, 2002 no 3-4 pp. 84-108. 

Shadish,W., Cook, Th. and D. Campbell (2003). Experimental and quasiexperimental 

designs for causal inference. Boston, Hougton Mifflin Company. 

The French Ministry of Education (2002). The meetings of Desco: “Evaluation 

of the knowledge and skills of 15-year-old pupils: Questions and 

hypotheses formulated following the OECD study”, contains Gaudemar 

J_P.,(2002): Opening of the conference debate, Crowne, S. (2002) The 

British case, Nardi E.,(2002) The Italian case, Koch H.C.,(2002) The German 

case and Cytermann, J_R.,(2002) The french Case. 

Wuttke, J.,(2006). Fehler, Verzerrungen,Unsicherheiten in der PISA- 

Auswertung in Jahnke,T. and Meyerhöfer W. (2006) PISA & Co Kritik 

eines Programms, Hildesheim Verlag Franzbecker.

Uncertainties and Bias in PISA 

Joachim Wuttke 

Germany: Forschungszentrum Jülich – Munich 

This is a summary of a detailed report (>100 pages, >100 references) that has 

appeared in German (Wuttke 2007). It will be shown that PISA’s statistical significance 

criteria are misleading because several sources of systematic bias and 

uncertainty are quantitatively more important than the standard errors communicated 

in the official reports. 

1 Introduction 

1.1 A huge framework 

PISA is a long-term project. Starting in 2000, assessments are carried out every 

three years. One and a half years are needed for data processing until an 

international report entitled “First Results” (FR00, FR03) appears, and it takes 

even longer until a Technical Report (TR00, TR03) is published and the raw 

data are made available for independent analysis. Therefore, although the third 

assessment was carried out in spring 2006, at present (summer 2007) only 

PISA 2000 and 2003 can be evaluated. In the following we will concentrate on 

data from PISA 2003. 

PISA 2003 was carried out in 30 OECD countries and in some partner 

countries. As data from the latter were not used in the international calibration, 

they will be disregarded in the following. The United Kingdom (UK), which 

failed to meet several criteria required for participation, was excluded from 

tables in the official report. However, data from the UK were fully used in 

calibrating the international data set and in calculating OECD averages – an 

inconsistency that is left unexplained (TR03: 128, FR03: 31).

242 JOACHIM WUTTKE 

PISA rules required a minimum sample size of 4,500 students per country 

except in very small countries (Iceland, Luxembourg), where all fifteen-yearold 

students were recruited. In several countries (Australia, Belgium, Canada, 

Italy, Mexico, Spain, Switzerland, UK), considerably larger samples of up to 

nearly 30,000 students (TR03: 168) were drawn so that separate analyses for 

regions or linguistic communities became possible. For the comparison of the 

sixteen German länder, an even larger sample of 44,580 students was tested 

(Prenzel et al. 2005: 392) of which, however, only 4,660 were contributed to 

the international sample (TR03: 168). The Kultusministerkonferenz, fearing 

unauthorised cross-länder comparisons of school types, has imposed deletion 

of länder codes from public-use data files. Therefore, the inner-German comparison 

will not be considered further. 

The bulk of PISA data comes from a three-hour student testing session. 

Some more information is gathered from school principals. The testing session 

consists of a two-hour cognitive test and of a third hour devoted to questionnaires. 

The main questionnaire enquires about the students’ social background, 

educational environment, and learning habits. The questionnaire responses certainly 

constitute a valuable resource for studying the living and learning conditions 

of fifteen-year-olds in large parts of the world, even though participation 

rate gradients introduce some bias. 

Compared to the rich empirical material obtained from the questionnaires, 

the outcome of the cognitive test is meagre: the official data analysis reduces 

it to just four scores per student, interpreted as “competences” in specific subject 

domains (reading, mathematics, science, problem-solving). Nevertheless, 

these results are at the origin of PISA’s political impact; communicated as 

“league tables” of national mean values, they made PISA known to the general 

public, causing an outright “shock” in some countries. 

While controversy erupted about possible causes of results perceived as 

unsatisfactory, the three-digit precision of the underlying data has rarely been 

questioned. This will be done in the present paper. The accuracy and validity 

of cognitive test results are to be reviewed from a statistical point of view. 

1.2 A surprisingly simple measure of competence 

As a first step of data reduction, student responses are digitally coded. The 

Technical Report discusses inter-coder and inter-country variance at length 

(TR03: 218-232); the conclusion that non-uniform coding is an important 

source of bias and uncertainty is left up to the reader.

UNCERTAINTIES AND BIAS IN PISA 243 

Some codes are kept secret because national authorities want to prevent 

certain analyses. In several multilingual countries the test language is kept secret. 

Except for such deletions, the international raw data set is available for 

downloading on the website of the OECD’s main contractor ACER (Australian 

Council for Educational Research). 

On the lowest level of data aggregation, single item response statistics 

(percentages of correct, incorrect, and invalid responses to one cognitive test 

item) can be generated. In the international report not even one such statistic 

is shown. PISA is decidedly not a study in Fachdidaktik (math education, 

science education, etc.). PISA does not aim at gathering information about 

the understanding of scientific concepts or the mastery of specific mathematical 

techniques. The data provide almost no handle to understandwhy students 

give incorrect responses. Only Luxembourg has scanned and published some 

student solutions to free-response items; these examples show that students 

sometimes just misunderstood what the item writer meant to ask. 

PISA is designed to be analysed on a much coarser level. As anticipated 

above, cognitive test results are aggregated into just four “competence” values 

per student. The determination of these values is technically complicated because 

not all students worked on the same item set: thirteen different booklets 

were used, and in some countries some items turned out to be invalid because 

of misprints, translation errors, or other problems. This makes it necessary to 

establish an “item difficulty” scale prior to the quantification of student competences. 

For this calibration an elementary version of item response theory is 

used. 

The importance of this theory tends to be overestimated by defenders and 

critics of PISA alike. Misunderstandings are also provoked by poor documentation 

in the official reports. For a functional understanding of what PISA measures, 

it is not important that different booklets were used, and it is plainly irrelevant 

that in some countries certain items were deleted. Glossing over these 

technicalities, pretending that all students were assigned the same item set, and 

ignoring the probabilistic aspect of item response theory, it becomes apparent 

what the competence values actually measure: no more and no less than the 

number of correct responses. 

In the mathematics subtest of PISA 2003, a student with a competence 

of 500 (the OECD mean) has solved about 46 % of the items assigned to 

him. A competence of 400 (one standard deviation below the mean) corresponds 

to a correct-response rate of 23 %; 600 corresponds to 71 % (Wuttke


2007: Fig. 4). Within this span the relationship between competence value and 

correct-response percentage is nearly linear. The slope is about 4 competence 

points per 1 % of assigned items. This conversion gives the competence scale 

a much simpler meaning than the official reports allow one to suspect. 

1.3 League Tables and Stochastic Uncertainties 

Any analysis of PISA data aims at statistical statements about populations. For 

instance, an elementary analysis of the cognitive test yields results like the following: 

German students have a mean mathematics competence of 503; the 

standard deviation is 103; the standard error of the mean is 3.3, and the standard 

error of the standard deviation is 1.8 (Prenzel et al. 2004: 70). In order 

to make sense of such numbers, they need to be put into context. The PISA 

reports provide two kinds of interpretation guidance: Verbal descriptions of 

“proficiency levels” give a rough idea of what competence differences of 60 

or more points signify (see below), and comparisons between different populations 

insinuate that even differences of only a few points bear a message. 

Since the assessment of competences within each of the four subject domains 

is strictly one-dimensional, any inter-population comparison implies a 

ranking. This explains the primordial role of league tables in PISA: They are 

not only a vehicle for gaining media attention, but they are deeply rooted in the 

conception of the study (cf. Bottani/Vrignaud 2005). In the official reports almost 

all statistics are communicated in the form of country league tables. The 

ranks in these tables, especially low ranks (and every country has low ranks in 

some tables), are then easily turned into political messages. In this way PISA 

results can be interpreted without any understanding of what has actually been 

measured. 

Of course, not all rank differences are statistically significant. This is duly 

noted in the official reports. For all statistics, standard errors are calculated. 

After processing these standard errors through a zero hypothesis testing machinery, 

some mean value differences are judged significant, while others are 

not. Complicated tables (FR03: 59, 71, 81, 88, 92, 281, 294) indicate which 

differences of competence means are significant and which are not. It turns 

out that in some cases 9 points are “sufficient to say with confidence that the 

higher performance by sampled students in one country holds for the entire 

population of enrolled 15-year-olds” (FR03: 93). 

This accuracy is formidable when compared to the intra-country spread 

of test performances. The standard deviation of the competence distribution is


100 points in the OECD country average and not much smaller within single 

nations. This is an order of magnitude more than an inter-country difference of 

9 points. Figure 1 illustrates the situation. 

Figure 1: Two Gaussian distributions with mean values differing by 9 % of their standard 

deviation. Such a small difference between two populations is considered significant in PISA. 

However, significant does not mean valid, let alone relevant. Statistical significance 

is achieved by nothing more than the law of large numbers. The standard 

errors on which the significance criteria are based only account for two 

specific sources of stochastic uncertainty: the student sampling and the itemresponse 

modeling of student behaviour. By testing more and more students 

on more and more items, these uncertainties can be made arbitrarily small. At 

some point, however, this effort becomes inefficient because the validity of the 

study remains limited by non-stochastic sources of bias and uncertainty, which 

do not decrease with increasing sample size. 

Before entering into details, the likeliness of non-stochastic bias will be 

made plausible by a simple estimate: To bring about a significant inter-country 

difference of 9 points, correct-response rates must differ by about 2 % of given 

responses. On average, a student is assigned 26 mathematics items. Hence, 9 

points correspond to no more than half a correct response per student. This 

suggests that little systematic error is needed to distort test results far beyond 

their nominal standard errors. 

In this paper, I will argue that PISA does indeed suffer from severe nonstochastic 

limitations, and that the large sample sizes are therefore uneconomic. 

Part 2 describes disparities in student sampling, Part 3 shows that the 

projection of cognitive test results onto a one-dimensional “competence” scale 

is neither technically convincing nor culturally fair, and Part 4 raises certain 

objections on the conceptual level.


2 Sampling disparities 

In some countries it is clear from the outset that PISA cannot be representative 

(Sect. 2.1). But even in countries where school is obligatory beyond the 

age of fifteen, low participation rates are likely to introduce some bias. Several 

imperfections and inconsistencies of the international sample are well documented 

in the Technical Report. Participation rate requirements were not strict 

enough to prevent significant bias, and violations of predefined rules had no 

consequences. 

2.1 Target population does not serve study objective 

PISA claims to measure “outcomes of education systems in terms of student 

achievements”. This claim is not consistent with the choice of the target population, 

namely “15-year-olds enrolled full-time in educational institutions”. In 

some countries (Mexico, Turkey, several partner countries), enrollment is less 

than 60 %. Obviously, PISA says nothing about the outcome of the education 

system of these countries. 

On the other hand, in many countries school is obligatory beyond the age 

of 15. At fifteen, the ability of abstract reasoning is still in full development. 

PISA therefore systematically underestimates the abilities students have “near 

the end of compulsory schooling” (FR03: 3, 298; TR03: 46). 

2.2 Target population too loosely defined: unequal exclusions 

Rules allowed countries to exclude up to 5 % of the target population: up to 

0.5 % for organizational reasons and up to 4.5 % for intellectual or functional 

disabilities or limited language proficiency. Exclusions for intellectual disability 

depended on “the professional opinion of the school principal, or by other 

qualified staff” – a completely uncontrollable source of uncertainty. From the 

fine print in the Technical Report, it appears that some countries defined additional 

criteria: Denmark, Finland, Ireland, Poland, and Spain excluded students 

with dyslexia; Denmark also excluded students with dyscalculia; Luxembourg 

excluded recently immigrated students (TR03: 47, 65, 169, 183). 

Actual student exclusion rates of the OECD countries varied from 0.7 % 

to 7.3 %. Canada, Denmark, New Zealand, Spain, and the USA exceeded the 

5 % limit. Nevertheless, data from these countries were fully included in all 

analyses.


For a first-order estimate of the impact caused by the unequal use of student 

exclusions, let us approximate the competence distribution in every single 

country by a Gaussian with standard deviation 100, and let us assume for a 

moment that countries exclude with perfect precision the least competent students. 

Under these assumptions, exclusion of the weakest 0.7 % increases the 

country’s mean by 2.0 points and reduces its standard deviation by 2.5 points, 

whereas exclusion of 7.3 % increases the mean by 15.0 and reduces the standard 

deviation by 12.8. Of course, exclusion criteria are only correlatives of 

potential test achievement, and they are never applied with perfect precision. 

When a probabilistic cut-off, spread over a range of 100 points, is used to 

model soft exclusion criteria, the bias in the two countries’ competence mean 

difference is reduced to about half of the initial 13 points. 

In Germany much public attention has been drawn to the percentage of 

students in a so-called “risk group” defined by test scores below an arbitrary 

threshold. International comparisons of such percentages are particularly unreliable, 

because they are extremely sensitive to non-uniform exclusion criteria. 

2.3 On the fringe of the target population: unequal inclusion of 

learning-disabled students 

The imprecision of exclusion criteria and the resulting bias are further illustrated 

by the unequal inclusion of students with learning disabilities. Seven 

countries cater to them in special schools. In these schools the cognitive test 

was abridged to one hour, and a special booklet with a selection of easy items 

was used. In all other countries student exclusions were decided per case; but 

even in countries that used the special booklets, some learning-disabled students 

could be individually excluded (cf. Prais 2003: 149, 158). 

The extent to which students were either excluded from the test or given 

the short booklet varies widely among the seven countries. In Austria, 1.6 % of 

the target population were completely excluded, and 0.9 % of the participating 

students got the short test. In Hungary, 3.9 % were excluded, and 6.1 % did 

the short test. Given this discrepancy, it is barely surprising that Hungarian 

students who did the short test achieved nearly 200 points more than Austrians. 

For another rough estimate of the quantitative impact of unclear exclusion 

criteria, one can recalculate national means without short tests. If all short tests 

were excluded from the PISA sample, the mean reading score of Belgium, 

Denmark, and Germany would increase by more than 7 points; in doing so, 

Belgium (1.5 % exclusions, 3.0 % short tests) would even remain within the


5 % limit (TR03: 169). A bias of the order of 7 points is in perfect accord with 

the estimate from the previous section. 

2.4 Sampling problems: inconsistent input 

The sampling is technically difficult. Many governments do not dispose of 

consistent databases. Sometimes, this leads to bewildering inconsistencies: In 

Sweden, 102.5 % of all 15-year-olds are reported to be enrolled in an educational 

institution; in the Italian region of Tuscany, 107.7 %; in the USA, in spite 

of a strong homeschooling movement, 100.000 % (TR03: 168, 183). 

The sample is drawn in two stages: schools within strata (regions and/or 

school types), and students within schools. As a consequence of this stratification 

and of unequal participation rates, not all students are equally representative 

of the target population. To correct this, students are assigned statistical 

weights composed of several factors. The recommended way to calculate these 

weights is so difficult that international rules foresee three replacement procedures. 

In Greece, none of the four procedures worked, so that a uniform student 

weight had to be used (TR03: 52). 

2.5 Sampling problems: inconsistent output 

In the Austrian sample of PISA 2000, students from vocational schools were 

underrepresented. As a consequence, average student competences were overestimated, 

and other statistics were distorted as well. The error was only 

searched for and found three years later, when the deceiving outcome of PISA 

2003 induced the government (which had changed in the meantime) to order 

an investigation (Neuwirth et al. 2006). 

In South Tyrol, a change of government is not in sight, and therefore nobody 

seems interested in verifying accusations that the excellent PISA results 

of this region are largely due to the underrepresentation of students from vocational 

schools (Putz 2006). 

In South Korea, only 40.5 % of PISA participants are girls. In the 1980s, 

due to selective abortion and possibly to hepatitis-B, the sex ratio at birth in 

South Korea had attained a historic low of 47 %, perhaps even 46 %. But even 

when this is taken this into account, girls are still severely underrepresented 

in the PISA sample. According to the Technical Report, this cannot be explained 

by unequal enrollment or test compliance: The reported enrollment 

rate is 99.94 %, the school participation rate 100 %, and the student participation 

rate 98.81 %. Either these numbers are wrong, or the sampling scheme was


inappropriate. This conclusion is also supported by an anomalous distribution 

of birth months. 

2.6 Insufficient response rates 

Rules required a school response rate of 85 %, within-school student response 

rates of 25 %, and a country-wide student response rate of 80 % (TR03: 48-50). 

The United Kingdom breached more than one criterion, which led to its superficial 

disqualification. Canada profited from a strange rule according to which 

initial response rates between 65 % and 85 % could be cured by negotiation 

if the 85 % quorum was not even reached after calling replacement schools 

(TR03: 238). With 64.9 %, the USA missed the non-negotiable initial condition, 

though by a narrow margin, and the response from replacement schools 

was overwhelmingly negative, bringing the participation rate to no more than 

68.1 %. Nevertheless, US data were fully included in all analyses (note: the 

USA contributes 25 % of the OECD’s budget). 

Non-response can cause considerable bias because the propensity of school 

principals and students to partake in the testing is likely to be correlated with 

the potential outcome. Quantitative estimates are difficult because the international 

data base contains not the least information about those who refused the 

test. Nevertheless, there is ample indirect evidence that the correlation is quite 

high. To cite just one example: In Germany schools with a student response of 

100 % had a mean math score of 553. Schools with participation below 90 % 

achieved only 476 points. Even if the latter number is subject to some uncertainty 

(discussed at length in Wuttke 2007), the strong correlation between 

student ability and test compliance is beyond any doubt. 

In the official analysis, statistical weights provide a first-order correction 

for the between-school variation of response rates: When schools refuse to participate, 

the weight of other schools from the same stratum is increased accordingly. 

Similarly, in schools with low student response rates, the participating 

students are given higher weights. 

However, these corrections do not cure within-school correlations between 

students’ latent abilities and their propensity to partake in the test. In the absence 

of data from absent students, the possible bias can only roughly be estimated: 

In some countries, the student response rate is more than 15 % lower 

than in others. Assuming very conservatively that the latent ability of the missing 

students is only half a standard deviation below the true national average,


one finds that the absence of these students increases the measured national 

average by 8.8 points. 

2.7 Gender-dependent response rates 

In many countries, girls are overrepresented in the PISA sample. The discrepancy 

is largest in France, with 52.6 % girls in PISA against an estimated 48.9 % 

among 15-year-olds: Compared to the age cohort, the PISA sample has more 

than 7 % too many girls and more than 7 % too few boys. Insofar as this is due 

to different enrollment, it enforces the argument of Sect. 2.1. Otherwise, the 

most likely explanation is a gender-dependent propensity to participate in the 

testing. 

2.8 Doubts about data transmission: missing missing responses 

Normally, some students do not respond to all questions of the background 

questionnaire. Moreover, some students leave between the cognitive test and 

the questionnaire session. In Poland, however, such missing data are missing: 

There is no single student who responded to less than 25 questionnaire items, 

and there are 7 items to which no single student did not respond. Unless this 

anomaly is explained otherwise, one must suspect that booklets with missing 

data have been suppressed. 

3 Ignored dimensions of the cognitive test 

PISA’s “competence” scale depends on the assumption that all items from one 

subject domain measure essentially one and the same latent ability. In reality, 

any test outcome is also influenced by factors that cannot be subsumed under 

a subject-specific competence. While there is no generally accepted way 

to indicate the degree of multi-dimensionality of a test (Hattie 1985), simple 

first-order estimates are sufficient to demonstrate its impact: Non-competence 

dimensions cause an amount of arbitrariness, uncertainty, and bias in PISA’s 

competence measure, which is by no means negligible when compared to the 

purely stochastic official standard errors. 

3.1 Elimination of disturbing items 

The evidence for multidimensionality to be presented in the following sections 

is even more striking on the background that the cognitive items actually used


in PISA have been preselected for unidimensionality: Submissions from participating 

countries were streamlined by “professional item writers”, reviewed 

by national “subject matter experts”, tested with students in think-aloud interviews, 

tested in a pre-pilot study in a few countries, tested in a field trial in most 

participant countries, rated by expert groups, and selected by the consortium 

(TR03: 20-30). 

Only one-third of the items that had reached the field trial were finally 

used in the main test. Items that did not fit into the idea that competence can be 

measured in a culturally neutral way on a one-dimensional scale were simply 

eliminated. Field test results remain unpublished, although one could imagine 

an open-ended analysis providing valuable insight into the diversity of education 

outcomes. This adds to Olsen’s (2005a: 5) observation that in PISA-like 

studies the major portion of information is thrown away. 

However, the strong preselection did not prevent seriously flawed items 

from being used in the main test: In the analysis of PISA 2000, the item “Continent 

Area Q1” had to be disqualified, in 2003 “Room Numbers Q1”. Furthermore, 

some items had to be disqualified in specific countries. 

3.2 Unfounded models 

In PISA a probabilistic psychological model is used to calibrate item difficulties 

and to estimate student competences. This model, named after Georg 

Rasch, is the most elementary incarnation of item response theory. It assumes 

that the probability of a correct response depends only on the difference of the 

student’s competence value and the item’s difficulty value. Mislevy (1993) calls 

this attempt to “explain problem-solving ability in terms of a single, continuous 

variable” a “caricature”, based in “19th century psychology”. The model 

does not even admit the possibility that some items are easier in one subpopulation 

than in another. The reason for its usage in PISA is neither theoretical 

nor empirical, but rather pragmatic: Only one-dimensional models yield unambiguous 

rankings. 

Taking the Rasch model literally, there is no way to estimate the competence 

of students who solved all items or none. To them the test has been 

too easy or too difficult, respectively. In PISA, this problem is circumvented 

by enhancing the probability of intermediate competences through a Bayesian 

prior, arbitrarily assumed to be a Gaussian. As distributions of psychometric 

measures are never Gaussian (Micceri 1992), this inappropriate prior causes 

bias in the competence estimates (Molenaar in Fischer/Molenaar 1995: 48),


especially at extreme values (Woods/Thissen 2006). This further undermines 

statements about “risk groups” with particularly low competence values. 

3.3 Failure of the Rasch model 

Various mathematical criteria have been developed to assists in the decision 

whether or not the Rasch model reasonably approximates an empirical data 

set. It appears that only one of them has been used to check the outcome of the 

PISA main test: an unexplained “item infit mean square” (TR03: 123, 278). 

A much more sensitive way to test the goodness of fit is a visual inspection 

of appropriate plots (Hambleton et al. 1991: 66). An “item characteristic” 

or “score curve” is a plot of correct-response percentages as function of competence 

values, each data point representing a quantile of examinees. In the 

Technical Report (TR03: 127), one single item characteristic is shown – an 

atypical one that agrees rather well with the Rasch model. 

According to the model, all item characteristics from one subject domain 

should have strictly the same shape; the only degree of freedom is a horizontal 

shift, driven by the model’s only item parameter, the difficulty. This is clearly 

inconsistent with the variety of shapes exhibited by the four item characteristics 

in Figure 2. Whereas “Water Q3b” discriminates quite well between more 

or less “competent” students, the other three items have deficiencies that cannot 

be described without additional parameters. 

The characteristic of “Chair Lift Q1” has almost a plateau at low competence 

values. This is the typical signature of guessing. On the other hand, 

“Freezer Q1” saturates at less than 35 %. This indicates that many students 

Figure 2: Some item characteristics that show pronounced deviations from the Rasch model. 

Solid curves in (a) are fits with a two-parameter model that accounts for different discrimination. 

The four-parameter fits in (b) additionally model guessing and misunderstanding.


did not find out the intention of the testers. Low discrimination strengths as 

in “South Rainea Q2” may have several reasons: different difficulties in different 

subpopulations, different difficulties for different solution strategies (cf. 

Meyerhöfer 2004), qualified guessing, weak correlation of the latent ability 

measured here and in the majority of this domain’s items. 

The solid lines in Fig. 2 show that satisfactory fits of the empirical data 

are possible when the Rasch model is extended by parameters that allow for 

variable discrimination strength, for guessing, and for misunderstanding. Such 

multi-parameter item-response models still contain a linear shift parameter that 

may be interpreted as the item difficulty. However, best-fit estimates of this 

parameter deviate by typically 30 points from the official Rasch difficulties 

(Wuttke 2007: Fig. 11). This model dependence of item difficulty estimates is 

not compatible with a one-dimensional ranking of items as is needed for the 

construction of “proficiency levels” (Sect. 4.1). Furthermore, as soon as one 

admits more than one item parameter, any student ranking becomes arbitrary 

because of the ad-hoc anchoring of the difficulty and competence scales. 

The first data point of the characteristics of “South Rainea” and “Chair 

Lift” clearly lies below the fit curves: the weakest 4 % of participants perform 

weaker than modeled. This may be due to a lack of cooperation: yet another 

dimension that is not contained in elementary item-response theory. It may 

also be due to the inappropriateness of the Gaussian population model. 

3.4 Between-booklet variance 

The use of different test booklets makes it possible to employ a total of 165 

different items, though every single student works on no more than 60 of them. 

This reduces the dependence of test results on the arbitrary choice of items. 

At the same time, it allows us to get an idea of how strong this dependence 

actually is. Calculating mathematics competence means for groups of 

students who have worked on the same booklet, inter-booklet standard deviations 

between 4 (Hungary) and 18 (Mexico) points are found. The largest difference 

occurs in the USA: Students who worked on booklet 2 were estimated 

to have a math competence of 444, whereas those who worked on booklet 10 

achieved 512 points. Eliminating either booklet 2 or booklet 10 would respectively 

increase or decrease the overall national mean by about three points. 

This variance only reflects the arbitrariness in choosing items from a pool that 

is already quite homogeneous due to the procedures described above (Sect.


3.1). Cultural bias in the submission, selection, and adaptation of items may 

have a far stronger impact. 

3.5 Imputation with wrong normalisation 

Each of the thirteen regular booklets consists of four blocks. Each item appears 

in four different blocks, in four different positions, in four different booklets. 

The major subject domain, mathematics, is covered by seven of the thirteen 

blocks; the other three domains are tested in two blocks each. 

While all thirteen booklets contain at least one mathematics block, each 

minor domain appears only in seven booklets. Nevertheless, in the scaled data 

all students are attributed competence values in all four domains. If a student 

has not been tested in a domain, the competence estimate is based on both 

his questionnaire responses and his school’s average math achievement. Such 

an imputation, when done correctly, reduces the standard error of population 

means without introducing bias. 

In PISA, however, it is not done correctly. Bias is introduced because the 

imputation is anchored in only one of the seven booklets for which real data are 

available. This bias is plainly admitted in the Technical Report (TR03: 211), 

though it is quantified only for Canada. The case of Greece is more extreme: 

The official science competence mean of 481 is 16 points above the average 

achievement of those students who were actually tested in science (Wuttke 

2007: Sect. 3.10; cf. Neuwirth in Neuwirth et al. 2006: 53). This huge bias is 

certainly not justified by the benefits of imputation, which consists in a slight 

simplification of the secondary data structure and in a reduction of stochastic 

standard errors by probably no more than 10 %. 

3.6 Timing, tactics, fatigue 

Since every item occurs in four different positions, one can easily investigate 

how response rates vary during the two-hour testing session: Per-block response 

rates, averaged across booklets over all items, can be directly compared 

to each other. 

One finds that the average rates of non-reached items, of missing responses, 

and of incorrect responses systematically decrease from block to 

block. The extent of this decrease varies considerably between countries. The 

ratio of non-reached items in the fourth block is 1 % in the Netherlands, while 

in Mexico it is 25.3 %. In the Netherlands the ratio of items that were reached 

but not answered goes up from 2.5 % in the first block to 4.0 % in the fourth


block; in Greece, from 11.1 % to 24.4 %. In Austria, the ratio of corrent togiven 

responses decreases from 56.2 % in the first block to 54.4 % in the fourth block; 

in Iceland, from 58.5 % to 53.1 %. 

All these data indicate that students lack a sufficient amount of time in the 

last of the four blocks. This alone is a strong argument against the applicability 

of one-dimensional item response theory (Rost 2004: 43). The ways students 

react to the lack of time vary considerably between countries: 

– Dutch students try to answer almost every item. Towards the end of the test, 

they become hasty and increasingly resort to guessing. 

– Austrian and German students skip many items, and they do so from the first 

block on, which leaves them enough time to finish the test without greatly 

accelerating their pace. 

– Greek students, in contrast, seem to be taken by surprise by the time pressure 

near the end. In the first block, their correct-response rate is better than in 

Portugal and not far away from the USA and Italy. In the last block, however, 

non-reached items and missing responses add up to 35 %, bringing Greece 

down to one of the last ranks. 

Aside from such extreme cases, it is hardly possible to disentangle the effects 

of test-taking tactics and fatigue. 

3.7 Multiple responses to multiple choice items 

In PISA 2003, 42 of 165 items are in a simple multiple-choice format. For 

each of these items, four or five responses are proposed of which exactly one 

is meant to be the correct one. This essential rule is not clearly explained to the 

examinees. In some countries, for some items, a considerable number of multiple 

responses are given. They are denoted by a special code in the international 

database, but they are subsequently counted as incorrect. 

In many countries, including Australia, Canada, Japan, Mexico, the 

Netherlands, New Zealand, and the USA, the quota of multiple responses is 

close to 0 % (except for one particularly flawed item). In Austria, Germany, 

and Luxembourg, on the other hand, the fraction of multiple responses surpasses 

4 % for at least eleven items, and it reaches up to 10 % for one of them. 

Such a misunderstanding of the test format does not only distort the outcome 

of the directly concerned item. It also costs time: it requires more effort 

to decide four or five times whether or not a proposed answer is correct than to 

choose only one alternative. Those who are familiar with the multiple-choice 

format sometimes do not even need to read all distractors.


3.8 Testing cultural background 

If one wants to understand what a test actually measures, one has to study 

the manifold reasons why students give incorrect responses (cf. Kohn 2000: 

11). The few student solutions of open-ended items published by Luxembourg 

show how much information is lost when verbal or pictorial responses are digitally 

coded. 

A B C D 

Slovakia 3.1 % 46.1 % 17.5 % 33.3 % 

Sweden 3.1 % 46.2 % 37.0 % 13.7 % 

Table 1: Percentages for the four possible responses of the multiple-choice item “Optician 

Q1”. Data are shown for two countries where almost the same percentage of students chooses 

the correct response B. However, preferences for the distractors C and D vary by about 20 %. 

In contrast, in the digital coding of multiple-choice items, most information 

is preserved; the codes for formally valid but incorrect responses indicate 

which of the three distractors was chosen. Table 1 shows the response percentages 

for one item and two countries. In this example distractor preferences vary 

by about 20 %, although the correct-response percentage is almost the same. 

This demonstrates quantitatively that the reasons that induce students to give a 

specific incorrect answer can vary enormously from country to country. 

It is fairly obvious that the offer of distractors also influences correct - 

response rates. Had distractor D been more in the spirit of C, it would have 

attracted additional responses in Sweden, whereas in Slovakia many students 

would have reoriented their choice towards B. 

Between-country variance may be due to school curricula, cultural background, 

test language, or to a combination of several factors. These factors are 

particularly influential in PISA because students have little time (about 2’20” 

per item), and reading texts are too long. Sometimes the stimulus material even 

tricks students into misclues (Ruddock et al. 2006). In this situation, test-wise 

students try to solve items without actually reading the introductory texts. Such 

qualified guessing is of course highly dependent on extrinsic knowledge and 

therefore particularly susceptible to cultural bias. 

The released reading unit “Flu” from PISA 2000 provides a nice example. 

The stimulus material is an information sheet about a flu vaccination. One of


the items asks how the vaccination compares to alternative or complementary 

means of protection. Of course, students are not asked about their personal 

opinion; the answer is to be sought in the reading text. Nevertheless, the distractor 

preferences reflect French reliance on technology and German belief in 

nature. 

3.9 Language-related problems 

The language influences the test in several ways: 

Translations are prone to errors. In PISA, a complicated scheme with double 

translation from English and French was foreseen to minimise such errors. 

However, in many cases, including the German-speaking countries, the French 

original was not taken seriously, and final versions were produced under extreme 

time pressure. There are clear-cut translation errors in the released sample 

items. In the unit entitled “Daylight”, the English word “hemisphere” was 

translated by the erudite “Hemisphäre” where German schoolbooks use the 

word “Erdhälfte”. In the unit “Farms”, “attic floor” was rendered as “Dachboden” 

which just means “attic”. The fact that the Austrian version has the correct 

wording “Boden des Dachgeschosses” though all German-speaking languages 

had shared the translation work indicates that uncoordinated and unchecked 

last-minute modifications have been made. 

Blum and Guérin-Pace (2000: 113) report that changing a question (“Quels 

taux . . . ?”) into a prompt (“Énumérez tous les taux . . . ”) can change the 

rate of correct responses by 31 %. This gives an idea of how much freedom 

translators have either to help or confuse (cf. Freudenthal 1975: 172; Olsen et 

al. 2001). 

Under translation, texts tend to become longer, and some languages are 

more concise than others. In PISA 2000, the English and French versions of 

60 stimulus texts were compared, and showed that the French texts contained 

on average 12 % more words and 19 % more letters (TR00: 64). Of course, 

reading time is not simply proportional to the number of words or letters. It 

seems nevertheless plausible that such a huge length difference induces an 

important bias. 

3.10 Origin of test items 

A majority of test items comes from English-speaking countries; the other 

items were translated into English before they were streamlined by “professional 

item writers”. If there is cultural bias, it is clearly in favour of the


English-speaking countries. This makes it difficult to separate it from the translation 

bias, which acts in the same direction. 

The quantitative importance of cultural or/and linguistic bias can be read 

off from the correlation of correct-response-percentage-per-item vectors, as 

has been shown by Zabulionis (2001, for TIMSS), Rocher (2003), Olsen 

(2005), and Wuttke (2007). Cluster analyses invariably show that student behaviour 

is most similar for countries that share both language and cultural heritage, 

such as Australia and New Zealand (correlation coefficient 0.98). If the 

languages differ, correlations are at best about 0.96, as for the Czech and Slovak 

Republics. If the languages do not belong to the same stem, correlations are 

hardly larger than 0.94. While some countries belong to large clusters, others 

like Japan and Korea are quite isolated (no correlation larger than 0.90). These 

results have immediate implications for the validity of inter-country comparisons: 

The lesser the correlation of response patterns, the more a comparison 

depends on the arbitrary choice of items. 

4 Interpreting cognitive test results 

4.1 Proficiency levels 

Verbal descriptions of “proficiency levels” are used to guide the interpretation 

of numeric results (FR03: 46-56). The boundaries of these levels are arbitrarily 

chosen; nevertheless, they are communicated with absurd four-digit precision. 

Starting at a competence of 358.3, there are six proficiency levels. The width of 

levels 1 to 5 is about 62.1; the semi-infinite level 6 starts at 668.7. Depending 

on how many students gave the right response, each item is assigned to one 

of these levels. Based on all items assigned to one level, a verbal synthesis is 

given of what students with corresponding competence values “can typically 

do”. 

By construction, the student competence distribution is approximately 

Gaussian. The mean of 500 and the standard deviation of 100 are imposed by 

an explicit (though ill documented) renormalisation. Therefore, the percentages 

of students in the different proficiency levels are almost constant. 

To illustrate this point, let us perform a Gedanken experiment. If the percentage 

of correct responses given by a single student grows by 6 %, his competence 

value increases by about 30 points. Suppose now that the correctresponse 

rate grows by 6 % for all students. In this case, the competence values 

assigned to the students will not increase because any uniform change


of competences is immediately reverted by the renormalisation to the predefined 

Gaussian. Instead, the itemdifficulty values would be lowered by about 

30 points, so that about every second item would be relegated to the next lower 

proficiency level. Theoretically, this should then lead to a rephrasing of the 

proficiency level descriptions. 

However, these descriptions are highly systematic. They are so systematic 

that they could have been derived straight from Bloom’s forty-year-old 

taxonomy. They are far too systematic to appear like a summary of empirical 

results: One would expect that not every single item fits equally well in 

such a scheme, but the level descriptions do not reflect the least irritation. As 

Meyerhöfer (2004) has pointed out, the very idea of proficiency levels is not 

consistent with the fact that test items can be solved in quite different ways, depending 

for instance on curricular premises, on testwiseness and time pressure. 

Therefore, the most likely outcome of our Gedanken experiment seems to be 

that the official level descriptions would not at all change, so that the overall 

increase in student achievement would pass unnoticed – as has the misfit of 

the Rasch model and the resulting bias and uncertainty of about 30 difficulty 

points. 

Another fundamental objection is the lack of transparency. The proficiency 

level descriptions are not scientifically discussible unless the consortium publishes 

the instruments on which they are based and the proceedings of the 

hermeneutic sessions in which the descriptions have been worked out. 

In the German reports, students in and below proficiency level 1 are called 

“the risk group”. This deviates from the international reports that speak of 

“risk” only in connection with students below level 1. It has become an urban 

legend in Germany that nearly one quarter of all fifteen-year-olds are almost 

functionally illiterate, although the original report clearly states that PISA does 

not bother to measure fluency of reading, which is taken for granted even on 

level 1 (FR00: 47-48). Furthermore, as has been stressed above, the percentage 

of students on or below level 1 is extremely sensitive to disparities in sampling 

and participation. 

4.2 Is PISA an intelligence test? 

PISA items from different domains are quite similar in style – and sometimes 

even in contents: Reading items are based on nontextual stimulus material 

such as graphics or tables, and math or science items require a lot of reading. 

This is intentional insofar as it reflects a certain conception of “literacy”.


It is therefore unsurprising that competence values from different domains 

are highly correlated. A majority of per-country inter-domain correlations is 

stronger than 80 %. 

In such a situation, the sensible thing to do is a principal component analysis. 

One finds that between 75 % (Greece) and 92 % (Netherlands) of the total 

variance of student competences can be attributed to just one component. 

However, no such analysis has been published by the consortium, and when 

Rindermann (2006) did so, members of PISA Germany tried to dismiss and 

even to ridicule it. The ideological and strategical reasons for this opposition 

are obvious: Once it is found that PISA mainly measures one general factor 

per examinee, it is hard not to make a connection to the g factor of cognitive 

psychology. This must be seen as a sacrilege and as a threat by PISA members, 

who avoid the term “intelligence” throughout their writings. The word 

is taboo in much of the pedagogical mainstream, and no government would 

spend millions to be informed about the intelligence of students. 

4.3 Uncontrolled variables 

PISA aims at monitoring “outcomes of education systems”. However, the education 

system is just one of many variables that influence the outcome of 

the cognitive test. As we have seen, sampling, exclusions, participation rates, 

test taking habits, culture, and language are quantitatively important. Since all 

these variables are country dependent, there is no way to separate them from 

the variable “education system”. 

But even in the hypothetical case of a technically and culturally fair test, 

it would not be clear that differences in test outcome are due to differences 

in education systems. There are certainly country dependent educational influences 

that are not part of what is generally understood under “education 

system”, such as the subtitled TV programs prevalent in small language communities. 

Furthermore, equating test achievement with the outcome of schooling 

is highly ideological in that it dismisses differences in genetic equipment, 

pre-scholar education, and extra-scholar environment. 

The importance of extrinsic parameters becomes obvious when subpopulations 

are compared that share the same education system. One example is 

the two language communities in Finland. In the major domain of PISA 2000, 

reading, students in Finnish-speaking schools achieve 548 points, in Swedishspeaking 

schools only 513 – slightly less than Sweden’s national average of 

516 (Wuttke 2007: Sect. 4.8). A national report (Brunell 2004) suggests that


much of the difference between the two communities can be explained by two 

factors, namely by the language spoken at home and by the social, economic, 

and cultural background. 

If student dependent background variables have such a huge impact in an 

otherwise comparatively homogeneous country like Finland, they can even 

more severely distort international comparisons. As several authors have already 

noted, one of the most important background variables is the language 

spoken at home. Except in a few bilingual regions, a non-test language spoken 

at home is typically linked to immigration. The immigration status is accessible 

through the questionnaire, which asks for the country of birth of the student 

and his parents. Excluding first and second generation immigrant students from 

the national averages considerably mutates the country league tables: On top 

of the 2003 mathematics league table, Finland is replaced by the Netherlands 

and Belgium, and it is closely followed by Switzerland. The superiority of the 

Finnish school system, one of the most publicised “results” of PISA, vanishes 

as soon as one single background variable is controlled. 

5 Conclusions 

One defense line of PISA proponents reads: PISA is state-of-the-art; at present 

nobody can do it better. This is probably true. If there was one outstanding 

source of bias, one could hope to improve PISA by fighting this specific problem. 

However, it rather appears that there is a plethora of inaccuracies of similar 

magnitude. Reducing a few of them will have very little effect on the overall 

uncertainty. Therefore, one has to live with the unsatisfactory state of the art 

and draw the right consequences. 

Firstly, the outcome of PISA must be reassessed. The official significance 

criteria, based only on stochastic errors, are irrelevant and misleading. The accuracy 

of country rankings is largely overestimated. Statistics are particularly 

distorted if they depend on response rates among weak students; statements 

about “risk groups” are untenable. 

Secondly, the large sample sizes of PISA are uneconomic. Since the accuracy 

of the study is determined by other factors, the effort currently invested in 

minimising stochastic errors is unjustified. 

Thirdly, it is clear from the outset that little can be learned when something 

as complex as a school system is characterised by something as simple as the 

average number of solved test items.


References 

Blum, A./Guérin-Pace, F. (2000): De Lettres et des Chiffres. Des tests 

d’intelligence à l’évaluation du “savoir lire”, un siècle de polémiques. 

Paris: Fayard. 

Bottani, N./Vrignaud, P. (2005): La France et les évaluations internationales. 

Rapport établi à la demande du Haut Conseil de l’évaluation de 

l’école. http://lesrapports.ladocumentationfrancaise.fr/BRP/054000359/ 

0000.pdf. 

Brunell, V. (2004): Utmärkta PISA-resultat också i Svenskfinland. Pedagogiska 

Forskningsinstitutet, Jyväskylä Universitet. http://ktl.jyu.fi/pisa/ 

Langt_pressmeddelande.pdf. 

Fischer, G. H./Molenaar, I. W. (1995): Rasch Models. Foundations, Recent 

Developments, and Applications. New York: Springer. 

Freudenthal, H. (1975): Pupils achievements internationally compared – the 

IEA. In: Educ. Stud. Math. 6, 127-186. 

FR00: OECD, ed. (2001): Knowledge and Skills for Life. First Results from 

the OECD Programme for International Student Assessment (PISA) 

2000. Paris: OECD. 

FR03: OECD, ed. (2004): Learning for Tomorrow’s World. First Results from 

PISA 2003. Paris: OECD. 

Hambleton, R. K./Swaminathan, H./Rogers, H. J. (1991): Fundamentals of 

Item Response Theory. Newbury Park: Sage. 

Hattie, J. (1985): Methodology Review: Assessing Unidimensionality of Tests 

and Items. In: Appl. Psych. Meas. 9 (2) 139-164. 

Kohn, A. (2000): The Case Against Standardized Testing. Raising the Scores, 

Ruining the Schools. Portsmouth NH: Heinemann. 

Meyerhöfer, W. (2004): Zum Problem des Ratens bei PISA. In: J. Math.-did. 

25 (1) 62-69. 

Micceri, T. (1989): The Unicorn, the Normal Curve, and other Improbable 

Creatures. In: Psychol. Bull. 105 (1) 156-166. 

Mislevy, R. J. (1993): Foundations of a New Test Theory. In: Frederiksen, 

N./Mislevy, R. J./Bejar, I. I., eds.: Test Theory for a New Generation of 

Tests. Hillsdale: Lawrence Erlbaum. 

Neuwirth, E./Ponocny, I./Grossmann, W., eds. (2006): PISA 2000 und PISA 

2003: Vertiefende Analysen und Beiträge zur Methodik. Graz: Leykam. 

Olsen, R. V./Turmo, A./Lie, S. (2001): Learning about students’ knowledge


and thinking in science through large-scale quantitative studies. Eur. J. 

Psychol. Educ. 16 (3) 403-420. 

Olsen, R. V. (2005a): Achievement tests from an item perspective. An exploration 

of single item data from the PISA and TIMSS studies, and how 

such data can inform us about students’ knowledge and thinking in science. 

Dissertation, Universität Oslo. 

Olsen, R. V. (2005b): An exploration of cluster structure in scientific literacy 

in PISA: Evidence for a Nordic dimension? In: NorDiNa 1 (1) 81-94. 

Prenzel, M. et al. [PISA-Konsortium Deutschland], eds. (2004): PISA 2003. 

Der Bildungsstand der Jugendlichen in Deutschland – Ergebnisse des 

zweiten internationalen Vergleichs. Münster: Waxmann. 

Prenzel, M. et al. [PISA-Konsortium Deutschland], eds. (2005): PISA 2003. 

Der zweite Vergleich der Länder in Deutschland – Was wissen und können 

Jugendliche. Münster: Waxmann. 

Putz, M. (2006): PISA: Zu schön um wahr zu sein? Liegt das Traumergebnis 

an Rechenfehlern? Unpublished. 

Rindermann, H. (2006): Was messen internationale Schulleistungsstudien? 

Schulleistungen, Schülerfähigkeiten, kognitive Fähigkeiten, Wissen oder 

allgemeine Intelligenz? In: Psychol. Rundsch. 57 (2) 69-86. See also comments 

and reply in vol. 58 (2). 

Rocher, T. (2003): La méthodologie des évaluations internationales de compétences. 

In: Psychologie et Psychométrie 24 (2-3) [Numéro spécial: 

Mesure et Éducation], 117-146. 

Rost,J.( 2 2004): Lehrbuch Testtheorie – Testkonstruktion. Bern: Hans Huber. 

TR00: Adams, R./Wu, M., eds. (2002): PISA 2000 Technical Report. Paris: 

OECD. 

TR03: OECD, ed. (2005): PISA 2003 Technical Report. Paris: OECD. 

Woods, C. M./Thissen, D. (2006): Item Response Theory with Estimation of 

the Latent Population Distribution Using Spline-Based Densities. In: Psychometrika 

71 (2) 281-301. 

Wuttke, J. (2007): Die Insignifikanz signifikanter Unterschiede: Der 

Genauigkeitsanspruch von PISA ist illusorisch. In: Jahnke, 

T./Meyerhöfer, W., eds.: Pisa & Co. Kritik eines Programms. 2nd 

edition [note: my contribution to the 1st edition is outdated]. Hildesheim: 

Franzbecker. 

Zabulionis, A. (2001): Similarity of Mathematics and Science Achievement of 

Various Nations. In: Educ. Policy Analysis Arch. 9 (33).

Large-Scale International Comparative Achievement 

Studies in Education: Their Primary Purposes and 

Beyond 

Rolf V. Olsen 

Norway: University of Oslo 

Abstract: 

This chapter argues that PISA is more than a driver for policy decisions in 

many countries. The study also provides unique data with the potential to engage 

educational researchers across the world in conducting a range of secondary 

analyses. The first section of the chapter describes how the primary purpose 

of such studies in general has gradually evolved. This description reflects 

how the studies have typically related to educational research. This section of 

the chapter is used as the general background for the second and major section, 

which presents a rationale for why educational researchers could or should be 

motivated to engage in analytical work relating to these studies. This is followed 

up by a provisional framework for how educational researchers may 

approach and make use of the data from these studies in secondary analyses. 

This framework is based on six generic analytical approaches derived from the 

study of a large number of examples of published secondary analyses. 

Introduction 

The overall purpose of this article is to argue that both PISA and a range of 

other studies often referred to as large-scale international comparative achievement 

studies in education (LINCAS) (Bos, 2002), are not only an important 

driver for policy decisions in many countries, but the PISA study also provides 

unique data with the potential to engage educational researchers across the

266 ROLF V. OLSEN 

world in conducting a range of secondary analyses. The first part of the chapter 

describes how the primary purpose of such studies in general has gradually 

evolved. This description reflects how the studies have typically related to educational 

research. This section of the chapter is used as a general background 

for the second and major section, which provides arguments for why educational 

researchers could or should be motivated to engage in work relating to 

these studies. Furthermore, this section presents how educational researchers 

may approach and make use of the data from these studies in secondary analyses 

by suggesting six generic analytical approaches. The six suggested generic 

approaches probably do not make up an exhaustive list of possibilities for secondary 

analytical work. Instead, they should, when taken together, be regarded 

as a provisional framework to be used as a starting point for a more comprehensive 

and systematic review of the available literature presenting secondary 

analyses relating to the PISA study. 

Even though PISA is the main case to be discussed in this book, the theme 

for this chapter is of a more overarching and general nature. Many of the references 

made throughout the chapter to specific secondary analytical work will 

therefore be to studies relating to other studies, and particularly to TIMSS 1 , 

since this study has been around for a much longer time. Furthermore, given 

the author’s background as a researcher in science education, a majority of 

the examples will be related to this subject. However, the discussion offered 

is not subject specific, and the arguments are thus equally relevant for other 

international studies as well as for other subject areas. 

Part I: The primary purposes of the comparative studies 

In order to start describing the main features of LINCAS, it is relevant to note 

that they include one or several measures of achievement, specifically speaking 

school subjects or in more overarching competencies transcending the traditional 

borders set up by school subjects. Furthermore, these measures have 

been developed under the requirement that they should allow for meaningful 

international comparison. In addition, an essential design component in the 

studies is that differences between countries can be studied as effects of contextual 

factors. It is also important to underscore the fact that these studies 

are large-scale, which implies that the aim of these studies is to find measures 

which can be generalised to schools and educational systems. In order to 

1 Trends in International Mathematics and Science Study

LARGE-SCALE INTERNATIONAL COMPARATIVE ACHIEVEMENT STUDIES267 

obtain reliable measures which can be generalised to the systematic level, rigorous 

procedures for sampling a large number of schools/classes and students 

must be employed. 

Although the aims of, the use of data from, the organisation of, and the 

methodology applied by these studies have developed gradually (Porter & 

Gamoran, 2002), there have been two distinctly different and partly competing 

overall visions underlying the studies. They were first conceived as a specific 

design or method for conducting research into education with a cross-country 

comparative perspective. This initial idea will be labelled Purpose I – the research 

purpose. Gradually, the focus has shifted, influenced by the increasing 

attention of policymakers towards monitoring the outcomes of educational systems 

and the study of possible determinants of such outcomes. This rationale 

for the comparative studies will be labelled Purpose II – the effective policy 

purpose. 

The labels Purpose I and II are only suggested as being useful heuristic 

devices for understanding some of the ideological tensions with which these 

studies have to live. 2 However, using this dichotomy does not suggest that 

the research purpose and the effective policy purpose are incompatible. On 

the contrary, I will offer the perspective that the studies may be considered 

as arenas where researchers in education and educational policymakers can 

exchange ideas, developing in turn mutual interest for and acceptance of each 

other’s engagement in educational issues on both the national and international 

levels. 

Purpose I: The research purpose 

Today the label ‘comparative studies in education’ refers to various types of 

research ranging from issues of the more philosophical and methodological 

aspects of comparing across cultures to very specific studies of narrowly defined 

aspects of education across countries, regions or classrooms. This label 

also covers studies with a great variety of designs and scales, and in general it 

is fair to say that the label ‘comparative studies’ is used with different meanings, 

as there is no generally accepted definition of the term (see for instance 

Alexander, Broadfoot, & Phillips, 1999; Alexander, Osborn, & Phillips, 2000; 

Carnoy, 2006). The idea of the large-scale comparative studies receiving focus 

here was created and defined as a research agenda with the establishment 

2 These labels are inspired by the way Roberts (2007) uses the terms Vision I and Vision II in 

his review of the concept of scientific literacy.


of the IEA – the International Association for the Evaluation of Educational 

Achievement – in 1961 under the auspices of the UNESCO Institute for Education 

(Husén & Tuijnman, 1994; Keeves, 1992). The fundamental idea of 

the founders of IEA is very clearly expressed by one of the pioneers Torsten 

Husén (1973): 

We, the researchers who . . . decided to cooperate in developing internationally valid 

evaluation instruments, conceived of the world as one big educational laboratory 

where a great variety of practices in terms of school structure and curriculum were 

tried out. We simply wanted to take advantage of the international variability with 

regard both to the outcomes of the educational systems and the factors which caused 

differences in those outcomes. (p. 10) 

The term “laboratory” in this quote is used only as a metaphor, since laboratory 

conditions with controlled experiments, taken literally, are hardly feasible 

in educational research due to both practical and ethical considerations. The 

alternative to the experiment would therefore be survey designs in which the 

variables of interest could be studied under a great variety of different conditions. 

In this way “differences between education systems would provide 

the opportunity to examine the impact of different variables on educational 

outcome” (Bos, 2002, p. 5). “Thus the studies were envisaged as having a research 

perspective . . . , as well as policy implications” (Kellaghan & Greaney, 

2001, p. 92). The assumption is, in other words, that educational organisation 

and practice affect educational opportunities and outcome, and this can be the 

subject of empirical research with the following aim: 

. . . go beyond the purely descriptive identification of salient factors which account 

for cross-national differences and to explain how they operate. Thus the ambition has 

been the one prevalent in the social sciences in general, that is to say, to explain and 

predict, and to arrive at generalizations. (Husén, 1973, pp. 10-11) 

The two quotes above taken from Husén should be seen as typical of the time 

and for the prevailing optimism regarding how the social sciences could contribute 

to the development of a better understanding of the causal relationship 

between different types of factors in society, a vision that in retrospect is often 

referred to as “social engineering”. The importance of the quotes in this 

context is, however, to identify the fact that the studies originally came from 

researchers in education who aimed to use them in order to find answers to 

what they saw as important research questions. Furthermore, they considered


that an international comparative design gave particularly good opportunities 

for answering such questions. 

Purpose II: The effective policy purpose 

Policymakers are required to establish overall plans for the nation’s educational 

system; e.g. to accomplish the following: 

– decide the amount and distribution methods of, resources; 

– specify the overall purpose of education as part of the wider social context 

and specific goals of achievement; and to 

– determine the organisation of the progression of schooling from childhood 

to adolescence and beyond. 

To a large extent comparative studies and other internationally comparative 

data have been regarded by policymakers as providing information that is relevant 

in their continuous evaluation of such overall plans. What was initially a 

formulation of a platform for comparative educational research coincided with 

a growing recognition among politicians, industrial leaders and others, stating 

that education was one of the most central agents to realising long-term 

political, societal or economic visions, such as the following: 

– developing a society with a better distribution of resources across class, race, 

gender or any other social group; 

– fulfilling the need for a highly competent workforce in order to succeed in 

the international marketplace; 

– enhancing and further developing democracy by giving all citizens basic and 

further education so that they are enabled to fulfil their own life-agenda and 

become full-fledged participants in the democratic process. 

These were just a few examples of the visions of the ideal society that to a 

large degree were, and still are, shared visions throughout large parts of the 

world. At the same time, during the post-Second World War period, international 

organisations such as the United Nations, the World Bank, the OECD, 

and the European Union were established and quickly grew in size and influence. 

These are organisations with different (and to some degree conflicting) 

agendas. However, they all to various degrees invest resources into the study 

of education in their member countries, and several of these organisations are 

linked to each other through joint projects concerning educational issues. 

IEA became a provider of educational data and analyses, not only to national 

policymakers, but also to several international organisations. In addition 

to UNESCO, which was involved in the establishment of the IEA, OECD (be-


fore PISA was established) used data from IEA studies in their publications 

Education at a Glance (eg. the use of TIMSS data in OECD, 1996, 1997, 

1998). Since the first studies conducted in the early 1960’s, the IEA has been 

in charge of a great number of comparative studies in different subjects, and 

over the years the studies have grown to include a great number of countries 

throughout the world. At the same time the methodological challenges have 

been a driving force in the development of new designs and psychometrical 

procedures (Porter & Gamoran, 2002). 

During the last few decades the growth of comparative studies has also 

probably been fuelled by the reform of public services that is often referred 

to as ‘new public management’. This is characterised by deregulation of the 

public sector and a drive towards a higher degree of privatisation of those parts 

of the public sectors that can be thought of as the infrastructure of society. 

Deregulation implies a transfer of responsibility from the central government 

to the local authorities. Nevertheless, important decisions related to schools 

are to be made by policymakers at administrative levels above either the local 

community level or local school level. A consequence in most countries where 

deregulation took place was therefore to reinforce the central government’s 

role by installing a national assessment system. This was a shift from the regulation 

of inputs (e.g. specification of the use of the resources or number of 

students per class) to controlling the output (achievement, surveys of students 

and parents). In this way the service providers were made accountable both to 

the central government and to the users of these services. On the one hand the 

central government could control and direct the services by connecting measures 

of the output to incentives, or to intervening and manipulating the system 

to work as intended. On the other hand, the users could make use of the output 

measures in personal decisions regarding the public services. 

In this context the studies provide many indicators considered as being 

relevant, especially for the policymaker: 

– They produce measures of some of the outputs, most importantly achievement 

measures. 

– They produce indicators for systemic factors that may be directly linked to 

policy, such as average class size, availability of resources (e.g. computers), 

teacher education and allocation of time to different subjects. They also offer 

the possibility of relating such factors to achievement. 

– They provide indicators of relationships between variables that policy seeks


to change in a certain direction, e.g. the aim of schooling to provide an equal 

opportunity for everyone to learn regardless of background. 

– Some studies (for instance, PISA) produce indicators of how the indicators 

and their relationship to various demographic characteristics change over 

time by repeating the surveys at regular intervals. 

Moreover, indicators used to monitor the educational system in countries are 

not based on absolute measures. Placing such measures in an international context 

provides a comparative context for the interpretation of many of these measures. 

Comparison is a fundamental concept in measurement (Andrich, 1988). 

In an assessment with no comparative component, it is usually possible to establish 

whether various effect sizes are statistically significant,butitwouldbe 

very difficult to establish whether the effects are small or large. Even though 

the international variation cannot be used in order to draw causal inferences, 

it provides a description of what is possible and a context in which national 

data can be compared. One specific example of how international variation improves 

the potential for interpreting the results in the national context relates to 

the issue of equity: It is often expressed in policy documents that large systematic 

differences in achievement between pupils from different socio-economic 

levels indicate that school systems fail in providing equal opportunities for all 

pupils. Moreover, a large standard deviation in achievement in the total population 

is often considered as an indicator of inequities. For both these types of 

effects, the international context provides an opportunity for the policymaker 

to evaluate whether or not the differences between students or groups of students 

are large or small as compared to other systems perceived as relevant for 

comparison. There will always be differences between students, but without a 

contrast it would be impossible to evaluate or provide a substantial interpretation 

of the size of the effect. 

Common ground for the two purposes? 

In order to better understand the possible tensions between Purpose I and II, 

Jenkins (2000), based on Loving & Cobern (2000) and Huberman (1994), offers 

an interesting starting point. He suggests that the educational researchers 

and the policymaker not only have different agendas, but they also live within 

different knowledge systems, and therefore: 

The knowledge produced within one system and for the one set of purposes cannot 

normally be readily transferred to another. (Jenkins 2000, p. 18)


Jenkins does not provide a definition of the concept knowledge system, and he 

does not identify more specific aspects of the two knowledge systems claimed 

to be very different. Furthermore, he does not come up with a solution for how 

the problem in the above quote may be amended. 

One very obvious difference in the way that researchers and policymakers 

approach knowledge is of course that the latter are to a much larger extent 

confronted with decision-making. This entails at least two characteristics 

of the knowledge seen as relevant. Firstly, decisions are bound by time. The 

pace of decision-making is usually much faster than the timelines for most researchers. 

It is therefore likely that, due to the pressure to produce policy in 

a short time (before the next election), the knowledge that may be digested 

and understood without occupying too much time is considered as being more 

relevant by the policymaker. Secondly, knowledge that is likely to be true (analogous 

to evidence that will ‘hold up in court’) is generally more appreciated 

when confronted with the realities of decision-making. Using these criteria, 

it is possible to imagine that numbers and quantitative measures are deemed 

more appropriate than thick and rich qualitative descriptions. 

The OECD, the UN and other international organisations play important 

roles by being engaged in both these knowledge systems. In contrast to the national 

policy level, these international organisations have been given mandates 

that are relatively stable across time, and among several functions, they have 

been given the role of providing continuous policy analysis within a longer 

time frame. This mandate is particularly visible in the mandate of the OECD. 

PISA is therefore an interesting case regarding the issue of how educational researchers 

and policymakers may operate in a joint knowledge space. Through 

many of their educational initiatives, the OECD aims at establishing procedures 

and arenas for the dissemination of educational research to the policymakers. 

Conversely, through the same arenas, policymakers are able to communicate 

their needs for information on which to base their decisions. This 

is at least part of the solution for how it might be possible to get an efficient 

transfer of information back and forth between the two knowledge systems. 

This means that the overall aim of the PISA study is very much aligned with 

how policymakers define and justify educational outcomes. This also means 

that the cognitive measures are contextualised by variables perceived by the 

policy level to be of importance. 

In summary, the second purpose of effective policy development is in many 

respects compatible with the aims of the researchers who established the IEA


and conducted the first surveys (Purpose I). Policy issues such as those related 

to how one can improve the conditions for comprehensive and equitable education 

is, for instance, also a core issue in much pedagogical and didactical research. 

The difference is that within Purpose II the comparative studies are not 

any longer principally considered as basic research in education. This is not to 

say that they can no longer be used to study fundamental issues in educational 

research. However, in the international comparative large-scale studies, such 

research issues have gradually been awarded lower priority in the shaping of 

the studies. The research purpose has to some extent become secondary to the 

primary purpose, which is to monitor and benchmark the educational outcome 

of educational systems in order to inform policymakers. 

Part II: Beyond the primary purpose of the comparative studies 

In the first part of this chapter, I have argued that educational researchers would 

be well-advised to engage in studies like PISA since they provide communicative 

tools for interchange of relevant knowledge with policymakers. However, 

there is also a more direct argument as to why researchers in education could 

be highly motivated to take part in or follow up these studies, which is the 

issue to be discussed in the remaining section of the chapter: These studies 

provide valuable and unique data for researchers that may be used as a basis 

for secondary analyses. This research activity may range from theoretical contributions 

to secondary analysis of the data or the documents accompanying 

these data (e.g. analyses of instruments and items, analyses of the theoretical 

framework and rationale underlying the studies). 

A number of slightly different definitions of the term ‘secondary analysis’ 

have been suggested in some of the literature on research designs in the social 

sciences. They usually focus on the fact that secondary analyses are analyses of 

already existing data, conducted by researchers other than those who originally 

collected the data, and with a purpose that most likely was not included in the 

original design leading to the data collection. The definition that is best suited 

for the discussion presented in the following is most likely the one suggested 

by Bryman. (2004): 

Secondary analysis is the analysis of data by researchers who will probably not have 

been involved in the collection of those data for purposes that in all likelihood were 

not envisaged by those responsible for the data collection. (p. 201)


This definition also opens up the possibility that the original researchers may 

be involved in secondary analysis, and furthermore, that the purpose of the 

secondary analysis may have been included in the original research design. 

The latter point is highly relevant for many of the large-scale official surveys 

of different aspects of social life (e.g. different types of household surveys 

conducted regularly in many nations), many of which may be considered as 

having multiple purposes (Burton, 2000; The BMS, 1994), and where the potential 

for secondary analysis by social scientists is an important part of the 

primary design. 

There are a number of perfectly sound reasons for why many researchers 

give priority to collecting their own data instead of analysing already collected 

data. The primary reason is that ‘the scientific approach’ to some extent may 

be pragmatically defined by a methodology starting with the posing of research 

questions and hypotheses. Data collected by others is collected with other specific 

questions or hypotheses in mind, and it may therefore be difficult to use 

this data to analyse other issues. Secondly, there are often many technical obstacles 

in using data collected by others: they might not be publicly available; 

they may lack the documentation necessary to understand the data (e.g. 

a comprehensive codebook); or the data may require technical analytical skills 

beyond those of most researchers. Thirdly, there may be ideological reasons 

for not wanting to base research on data collected by national or international 

organisations that are primarily collected for policy analyses. Some of these 

issues are also conditions that limit the potential for using data from the comparative 

studies in secondary research. 

However, I would argue that the benefits of such secondary analysis 

strongly outweigh the limiting conditions. First of all, the data provided by 

these studies has qualities not often seen in educational research. The primary 

reason for this claim is that the quality is documented in unprecedented detail. 

In the technical reports for the PISA surveys (Adams & Wu, 2002; OECD, 

2005b), all the procedures for the instrument development, sampling, marking 

and data adjudication are thoroughly described. By studying such reports, it is 

clear that the PISA study (and other LINCAS) is based on the following: 

– very clearly defined populations and adequate routines for sampling these 

populations in all participating countries; 

– well-developed frameworks and instruments, including documentation of 

the quality of the translation into the different languages;


– well-developed and controlled routines for ensuring that the administration 

of the test was equal in all countries; and 

– well-developed routines and quality monitoring of how student responses 

were scored as well as how the data was entered and further processed. 

Gathering data with procedures like these is not usually possible in ordinary 

(low-cost) research, which brings us to the second argument for why the data 

should be used more: Millions of dollars or euros have been spent on producing 

these high quality databases. Samples have been established and the 

instruments distributed to the students and back to the research centres in a 

way that ensures a certain degree of quality and comparability. Furthermore, 

the data has been assembled and restructured through skilful work by experts 

in data processing and measurement to further secure the quality of the information 

available. Nevertheless, relatively little money is spent on the analysis 

of the data, as most of the money has been spent on gathering and processing it. 

Evidently, investing more in further analyses of the data would be a financially 

sound idea. 

Thirdly, data from PISA has been made publicly available (although some 

of the achievement items are kept secure for future use), and researchers interested 

in using the data can get access to it through a number of channels 3 . 

In order to make the database accessible, PISA has even developed a thorough 

manual for how to analyse the data (OECD, 2005a). An even better proposal 

would be to engage in a dialogue with the national centre. Through this contact 

it could be possible to get some advice and access to material that is otherwise 

not so readily available. 

A fourth argument for why researchers in education should be keen to 

use data from PISA and similar studies is the fact that this data is perhaps the 

single-most influential knowledge bases for decisionmaking and political argumentation 

about educational issues in many countries, and as such these data 

should be scrutinised from a multitude of perspectives. Even if one suspects 

that the data may be affected and encased by certain ideologies, secondary 

analysis of the data can in many cases be used to document such a relationship 

(Pole & Lampard, 2002). Data from LINCAS and documents describing or reporting 

outcomes of the studies need informed reviews from scholars who can 

frame the data and documents differently and thus offer both new interpretations 

and criticism. To a large extent LINCAS is regularly exposed to such criticism. 

Some of this feedback concerns ideological aspects of the studies (e.g. 

3 For access to the PISA data, see http://www.pisa.oecd.org


Atkin & Black, 1997; Brown, 1998; Goldstein, 2004a; Keitel & Kilpatrick, 

1999; Kellaghan & Greaney, 2001; Orpwood, 2000; Reddy, 2005). Other critical 

remarks are more specifically related to methodological issues (e.g. Blum, 

Goldstein, & Guerin-Pace, 2001; Bonnet, 2002; Freudenthal, 1975; Goldstein, 

1995, 2004b; Harlow & Jones, 2004; Prais, 2003; Wang, 2001), and no doubt, 

this book adds to this collection of critical notes. 

Finally, since the results from the studies are mainly used to inform policy 

at the national level, it is necessary to conduct discussions on how the results 

may be used to evaluate the national school system. In order for comparative 

studies to provide an even better basis of information for this discussion, it 

may be necessary to develop specific national designs. This would ensure that 

one could obtain information seen as vital in the national context. Germany is 

the prime example of a country which has emphasised the national dimension 

by implementing several national extensions to the PISA study. In Germany 

participating students respond to additional nationally developed instruments, 

and the country also has an extended sample in order to cover the educational 

system in each of the partially autonomous districts (Länder) (Stanat et al., 

2002). These extended efforts in Germany have increased participation by researchers 

regarding the data as seen by the number of articles discussing PISA 

in the German academic journals in education. It has also boosted the public 

awareness and debate about educational issues in general. To a somewhat 

lesser extent the situation is similar in Norway. 

Targeting research questions in education 

The above discussion mainly presented arguments emanating from the studies 

themselves regarding why data from large-scale international comparative 

achievement studies should be the subject of secondary analyses. Nevertheless, 

the main reason why educational researchers could be motivated to invest their 

own time and resources on secondary analyses of such data is that they may 

be used to address research questions of importance. In the remainder of the 

chapter, I will therefore turn to the more specific question of how these data 

may be used to target research questions in education. 

I will suggest that most of the secondary research using data from these 

studies can be classified into one of six generic types of research designs or 

methodological approaches. The sequence of or extent to which the six generic 

research designs are presented does not suggest a priority. Furthermore, the intention 

is not to provide an exhaustive list of possible secondary research issues


to be addressed by this data. Moreover, it is not suggested that these generic 

types form a typology of mutually exclusive categories. Typically, much secondary 

analysis would relate to several of the six headings. Finally, the author’s 

own background accounts for why studies within science education prevail in 

the references given in the forthcoming discussion. The purpose of the six suggested 

categories is rather to provide a provisional framework at some level of 

generality for what secondary research relating to studies like PISA may look 

like. 

Using data, results, or interpretations as a background 

Secondary analysis of already existing data, results or their interpretations may 

be included as a somewhat peripheral part of a research design. The original 

study referred to in this type of research may provide the background or major 

referent for generating hypotheses and research questions, it may provide data 

or findings with which to contrast or triangulate other data or findings, or it may 

be part of the basis for theoretical argumentation or deliberation of educational 

issues. In this type of research the aim is usually to go behind the data in order 

to develop thicker and richer descriptions and analyses of issues derived from 

findings of the international studies. 

One example of this type of work (albeit in a Norwegian context) is the 

research project entitled PISA+ 4 . The researchers involved in the project use 

transcripts of videotapes from classrooms covering several hours of activities 

as their primary data source. Therefore, it is clearly not secondary analysis of 

data from PISA. But as the title of the research project reflects, it is triggered 

by some of the findings from PISA in a Norwegian context needing follow-up 

(hence the plus sign in the title). Other types of research in which the focus is 

on how phenomena change over time, or how one group of respondents compares 

to another group, may also use data or findings from comparative studies 

as a background. In some of these cases the international comparative studies 

can provide data that may be used as a baseline or benchmark for comparison 

to which the researchers’ own data may be related. For such a purpose it would, 

strictly speaking, be necessary to use partly identical instruments and similar 

routines for collecting and processing the data. One specific example of this is 

the use of items from TIMSS 1995 in an evaluation of scientific achievement in 

Norway before and after the curriculum reform in 1997 (Almendingen, Tveita, 

& Klepaker, 2003). In addition, as mentioned above, findings from PISA may 

4 See http://www.pfi.uio.no/forskning/forskningsprosjekter/pisa+/ for a description


be used as one of the key referents for a theoretical deliberation on educational 

issues. This seems to be the case for a substantial amount of articles discussing 

educational (and more general social issues) in the German context during 

recent years (e.g. Opitz, 2006; Pongratz, 2006; Sacher, 2003; von-Stechow, 

2006). 

The two next types in this generic scheme are related to the fact that the 

primary units of analysis in the comparative studies are broad and overarching 

aggregates of the two main dimensions in the data matrix (persons and items). 

The persons are sampled to study the population of interest, and these populations 

are described by composite and broad measures, which have been constructed 

by aggregating several items. These constructs or traits are measures 

of students’ achievements in broadly defined domains (e.g. science, mathematics, 

and reading) as well as contextual descriptors (e.g. socioeconomic status, 

interest, motivation and learning strategies). It is therefore natural to suggest 

two classes of designs for secondary analyses related to the deconstruction of 

the respective two axis of the data matrix. 

In-depth analyses of certain variables 

Among the most frequently reported secondary analyses are those aiming at 

presenting a more finely tuned picture by studying more narrowly defined traits 

or even single items. This type of analysis utilises information in the data that 

is not included in the analyses of the total test scores (Olsen, 2005). Several 

relevant examples may be mentioned. Turmo (2003b) reported on qualitative 

aspects of students’ responses to a few single cognitive items from the PISA 

2000 study related to the environmental issue of depletion of the ozone layer, 

relating the types of responses to published research in science education. Similar 

studies of data from TIMSS 1995 exist in abundance (e.g. Angell, 2004; 

Dossey, Jones, & Martin, 2002; Kjærnsli, Angell, & Lie, 2002). 

In the same manner the student questionnaire data may be analysed indepth 

by selecting one or a few variables in a more narrowly targeted analysis, 

including discussions and alternative interpretations in light of other theoretical 

or methodological positions. Papanastasiou et al. (2003) has for instance 

carried out an in-depth analysis, based on data from PISA 2000 of the relationship 

between the use of computers and scientific literacy in the US. Gorard 

& Smith (2004) used other data from the same study to compute several indexes 

of segregation within the European Union countries, and these indexes 

supplement the selections of indicators reported in the official OECD publica-


tions of the PISA data. Thematically, both these two latter studies belong to 

the primary intent of the comparative study to which it is related, and as such 

they exemplify that the line of demarcation between ‘secondary analysis’ and 

‘primary analysis’ is not easy to draw. 

In-depth analyses of a sub-sample of students 

Another type of secondary research is, as briefly stated above, analyses in 

which the person axis in the data matrix is deconstructed. This is a very fruitful 

approach for targeting many specific issues in educational research. Many of 

the data sets from these studies are so large that the researcher may extract a 

subset of respondents with similar characteristics. 

One may, for instance, conduct an in-depth analysis of ethnic minority 

groups (Roe & Hvistendahl, 2006; Sohn & Ozcan, 2006). The fact that the 

OECD has recently published a supplementary thematic report on this issue 

(OECD, 2006) exemplifies that a clear-cut line of demarcation between primary 

and secondary analysis does not always exist. Another study which illustrates 

extremely well the possibilities for using the samples from LINCAS to 

address issues that are specific to marginal groups in the population is the study 

by Mullis and Stemler (2002). They used the original sample from TIMSS 

1995 to identify gender differences for high-achieving upper secondary students 

in mathematics. The reason why such highly specific subgroups can be 

studied is, of course, that the samples used in the studies are very large, so one 

may actually select the students above the 75th percentile and further divide 

these by gender (as was done in this particular study) and still have adequate 

sample sizes of reasonable power. Thus, using data describing students’ backgrounds, 

attitudes and or achievements, it is possible to construct a number of 

subgroups found relevant for the specific research issue at hand. 

A more specific and narrow comparative outlook on the data 

The fourth class of secondary analysis consists of studies aiming at giving 

amorefinely tuned comparative view by selecting only a few countries (or 

merely one country) as the unit of analysis. These studies relate to a longstanding 

tradition within comparative research in education. 

When comparing a smaller selection of countries, two rather different 

strategies for selecting countries have proven to be fruitful. In one strategy 

countries are selected to represent rather divergent educational systems. Having 

a sample of educational systems that differ from each other along policy


relevant variables is at the heart of the idea of comparative studies. One of the 

most powerful international studies using this strategy is the TIMSS videotape 

study (Stigler & Hiebert, 1999). Even if this was not a secondary analysis of 

data from TIMSS, but rather an independent study conducted and analysed 

simultaneously, it is a prime example of how studying divergent educational 

systems may uncover hidden assumptions and tacit features of the participating 

countries’ teaching practice. One example of a secondary analysis of 

PISA data is the comparison of mathematics teaching in Finland and France, 

to which a recent conference was entirely devoted 5 . A third illustrative and 

interesting example is the comparison of mathematics achievement in PISA in 

Brazil, Japan and Norway (Güzel & Berberoglu, 2005). 

The other strategy is to compare convergent educational systems. Naturally, 

such studies are often regional studies of neighbouring countries. Examples 

of regionally focused studies are the reports issued by Nordic researchers 

working on the PISA data (Lie, Linnakylä, & Roe, 2003; Mejding & Roe, 

2006), and a similar report by researchers in several Eastern European countries 

based on data from TIMSS 1995 (Vári, 1997). A third example is the 

study by Wößman (2005), using data from TIMSS, on the impact of family 

background in the East Asian countries. There are several reasons why regionally 

focussed reports and studies are valuable. First of all, comparing countries 

with certain common cultural features in a wider sense (be they historical, political 

and/or linguistic), implies that more factors may be controlled, which 

is imperative when studying naturally occurring phenomena by a comparative 

survey design. Secondly, in comparisons between neighbouring or linguistically 

similar countries, the possible measurement errors due to item-bycountry 

interactions are also reduced (Wolfe, 1999). Therefore, from a policy 

perspective such comparisons are more likely to produce fruitful recommendations 

for decision-making since neighbouring countries, such as the Nordic 

ones, often have an institutionalised and continuous exchange of policy. 

The comparative basis may be even further reduced to case studies of single 

countries. Obviously, the national reports that are developed in most participating 

countries are to a large degree examples of such studies. However, these 

reports are mainly reported in public reports targeting a wide group of prospective 

readers. Hence, parts of the analyses presented in these reports should be 

transformed into a format aimed at an international audience of researchers 

withstanding the scrutiny of peer review. This type of secondary analysis also 

5 See http://smf.emath.fr/VieSociete/Rencontres/France-Finlande-2005/


includes studies aiming at linking the international studies to either the national 

curriculum or prevalent ideologies in the participating country. Numerous examples 

of such studies exist. One recent French contribution is the chapter by 

Bodin in this book. In this chapter he analyses the degree of correspondence 

between the French mathematical syllabus, the French grade 9 national exam 

(Brevet) and the mathematics assessment in PISA. Yet other examples are papers 

discussing the case of Finland, which has quite understandably received a 

lot of attention given their performance on the assessments in PISA (Linnakyla 

& Valijarvi, 2005; Sahlberg, 2007; Simola, 2005; Valijarvi et al., 2002). 

The two last examples highlight why a national scope on the data from 

these studies clearly may have wider and broader implications. Ichilov (2004) 

used data from the IEA Civic Education Study (CivEd) to report on civic orientations 

in Hebrew and Arab schools in Israel, an issue which (unfortunately) is 

extremely relevant for the international community. Howie (2004) and Reddy 

(2005) have used the case of TIMSS in South Africa to reflect upon and question 

the values of participating in international comparative studies for developing 

countries, particularly when students’ mother-tongues are not applied as 

the language of the test. What these examples have in common is that what 

at the outset seem to be issues primarily of national interest may be highly 

relevant contributions to educational research in general. 

Combining data from one study with other sources of information 

Many countries participate in several studies, and secondary analyses seeking 

to combine, contrast or synthesise information across studies would be valuable 

contributions. Furthermore, efforts should be made to combine quantitative 

results from a study like PISA with other supplemental pieces of information. 

These supplements may well be of a qualitative nature. However, this is 

methodologically challenging, since it is not always clear how to combine the 

information formally through a common unit of analysis. 

The most obvious possibility for linking different international surveys is 

by using the results aggregated to country level. One successful example is the 

study reported by Kirkcaldy et al. (2004) on the relationship between health efficacy, 

educational attainment and well-being. This study combined data from 

PISA with that provided by the World Health Organisation, the United Nations, 

and other sources. 

Another possibility for combining data (perhaps the primary candidate, 

given the discussion above) would be to find ways of combining the data


in PISA and TIMSS to address issues in mathematics or science education 

(Olsen, 2006; Olsen & Grønmo, 2006; Olsen, Kjærnsli, & Lie, 2007), and data 

from PISA and PIRLS 6 to address the issue of reading skills (e.g. Becker & 

Schubert, 2006). This is not a straightforward task since these studies, even if 

they partially overlap in the content assessed, differ in many other ways, including 

the ages and grade levels of students. However, it is in principle possible 

to use data aggregated to countries in order to explore and describe typical 

features of students’ achievements, attitudes, motivation, and background in 

different countries. Furthermore, it is highly recommended to gather complementary 

data to help establish links between different studies. This was, for 

instance, done in a Danish study in which the PISA reading literacy measure 

from 2000 was formally linked to the IEA Reading Literacy study (which later 

has become known as PIRLS) from 1991. This made it possible to compare 

the two measurements. However, more significantly, it was possible to develop 

a measure of change in reading literacy for Danish students from 1991 to 2000 

(Allerup & Mejding, 2003). A less stringent way of linking the studies to each 

other would be to compare the documents describing the studies and to compare 

the match between different studies’ frameworks and item pools. For example, 

several comparisons of TIMSS and PISA have been done documenting 

how these two surveys differ in their conceptualisation of mathematics and 

science (Grønmo & Olsen, In press; Hutchison & Schagen, In press; Neidorf, 

Binkley, Gattis, & Nohara, 2006; Neidorf, Binkley, & Stephens, 2006; Olsen, 

2005; Olsen et al., 2007). 

Approaching the data with other methodological tools 

This is not a specific methodological approach such as the previous categories 

are. This is rather a category used to collect the many studies utilising a different 

methodological approach to the data. Although these studies often result 

in alternative interpretations of the data, their additional aim is often to comment 

on the consequences of the methods used. In recent years there is, for 

instance, a growing recognition of the hierarchical structures in educational 

achievement data in which students are located within classes that are located 

within schools, which in turn are located within regions, etc. (e.g. Malin, 2005; 

O’Dwyer, 2002; Ramírez, 2006; Schagen, 2004). By applying specialised statistical 

tools, it is possible to impose this structure while modelling the data. 

6 Progress in International Reading Literacy Study


Furthermore, some observers of the comparative studies question the requirement 

of uni-dimensionality of the measurements, and have analysed the 

data sets from some of these studies by allowing for multiple dimensions in the 

data (Blum et al., 2001; Gustafsson & Rosén, 2004). Related to this are a number 

of studies using cluster analysis in order to study the distinct differences in 

achievement profiles across the cognitive items for clusters of empirically related 

countries (e.g. Angell, Kjærnsli, & Lie, 2006; Grønmo, Kjærnsli, & Lie, 

2004; Lie & Roe, 2003; Olsen, 2006). 

Others have used latent variable or latent group modelling techniques in 

their approach to the data (e.g. Hansen, Rosén, & Gustafsson, 2004; C. Papanastasiou 

& Papanastasiou, 2006; Wolff, 2004). These are merely some examples 

of the use of alternative methodological approaches. The fruitfulness 

of such studies is both that they may utilise other aspects of the data set than 

what was their primary purpose, and that they may be considered as competing 

hypotheses regarding how to model and interpret such data. 

Concluding remarks 

In this chapter I have argued that data from the large-scale international 

achievement studies should be valued as an important resource for researchers 

in the educational sciences. In the first part I gave a condensed presentation 

of how these large-scale international studies of students’ achievements were 

created and are still affected by two visions or purposes. Originally, the studies 

were conceived of as tools for conducting fundamental research (Purpose I). 

This vision was gradually adopted, absorbed and transformed by educational 

policymakers into a vision in which these studies were regarded as one of 

the primary tools for monitoring educational systems (Purpose II). As a consequence 

I argued that researchers who engage themselves in these studies would 

get access to an arena for the exchange of ideas and thoughts with educational 

policymakers. This argument is particularly valid for PISA, which to a larger 

extent than other similar studies is defined as a joint venture between policymakers 

and researchers. This joint venture is set up by the organisational frame 

of the OECD with active involvement of both researchers and policymakers. 

In the second part of this chapter, the call for researchers to engage in secondary 

analysis of data from the international comparative achievement studies 

was argued from within the studies themselves. Specifically, it was argued that 

the data sets offered by the studies are complex and multifaceted, and thus it


should be possible to target a range of fundamental issues in educational research 

by secondary analyses of the data. This argument has been augmented 

by the fact that the data has a supreme and unprecedented quality. Furthermore, 

the data is publicly available. Moreover, it was argued from an economic 

perspective that since so much money has been spent in creating this dataset, 

any additional resources put into secondary analytical research activities would 

ensure an even better return on the investment. This argumentation is equally 

valid for several studies. 

Having presented an argumentation for why researchers in education could 

or should be interested in utilising the data, the second part turned to the issue 

of the types of secondary analysis that are possible or viable. Six generic 

approaches to secondary use of the data in research were suggested. This was 

accompanied by reference to a diverse range of secondary analyses in order to 

document and exemplify the possibilities for conducting such analyses. These 

references are only a fraction of the available academic literature that utilises 

information from PISA and other similar studies. Searching international bibliographical 

databases with the term PISA in the title, keyword or abstract gave 

more than 600 hits when combining two of the most comprehensive databases 

of literature in educational research (the ERIC and the ISI Web of Knowledge 

databases). By deleting duplicates and other false positives, we are brought 

down to approximately 250 hits in the period 2001-2007. Out of these, approximately 

50 entries are references given to what should be labelled as primary 

analysis (national and international reports written as an intended part of the 

study) bringing the total down to about 200. At the same time it is obvious 

that not all published secondary analytical work is included in the database 

(false negatives). Several of the references included in this chapter are for instance 

not found in this bibliography. It is therefore reasonable to claim that 

secondary analysis of the PISA data is a vital field of research. I doubt that any 

other data set within educational research has been analysed by so many people 

from so many diverse perspectives. A more detailed and systematic analysis of 

this bibliographical database will be conducted in the future in order to give a 

more comprehensive synthesis of how data from PISA is used by researchers. 

In order to take further advantage of the PISA data, governments should 

consider allocating resources for further analyses of the data sets, especially 

to analyses that help relate the data to the national context. For instance, in 

Norway funds have been made available so that the primary researchers may 

spend some time developing and publishing research going beyond the com-


missioned reports. Furthermore, in many countries funds have been allocated 

to facilitate the use of data from PISA by students as the basis for their Master’s 

or Doctoral thesis, particularly so in countries where the national institution 

responsible for the study is located at a university. To continue with the case 

of Norway as an example, several doctoral dissertations have been produced 

based on what could be labelled as secondary analysis of data from TIMSS and 

PISA (Angell, 1996; Isager, 1996; Kind, 1996; Olsen, 2005; Turmo, 2003a). 

Hopefully, this chapter may motivate and help in engaging researchers to 

explore the possibilities for making use of the resources offered by the largescale 

international studies in education and, subsequently, make such analyses 

available through internationally accepted publications. Furthermore, the 

framework of types of analyses suggested, ranging from using the data as 

merely a referent for the collection of other data, and to sophisticated modelling 

of the data, will hopefully provide a certain amount of guidance with respect 

to what such secondary analyses may look like. The many examples provided 

should also be regarded as an initial source of inspiration for researchers 

who would welcome taking on this challenge. 

Acknowledgement 

This manuscript has been adapted from the article Olsen, R.V. & Lie, S. (2006). 

Les évaluations internationales et la recherche en éducation: principaux objectifs 

et perspectives. Revue française de pédagogie, No 157, pp. 11-26, and I 

wish to thank the editors of the journal for giving their permission for this 

adaptation. Fellow author Svein Lie has also kindly agreed that I may publish 

the present adapted version. 

References 

Adams, R., & Wu, M. (Eds.). (2002). PISA 2000 technical report.Paris:OECD 

Publications. 

Alexander, R., Broadfoot, P., & Phillips, D. (Eds.). (1999). Learning from comparing: 

New directions in comparative educational research. Volume 1: 

Contexts, classrooms and outcomes. Oxford: Symposium Books. 

Alexander, R., Osborn, M., & Phillips, D. (Eds.). (2000). Learning from comparing: 

New directions in comparative educational research – Volume 2: 

policy, professionals and developments. Oxford: Symposium Books.


Allerup, P., & Mejding, J. (2003). Reading achievement in 1991 and 2000. In 

S. Lie, P. Linnakylä & A. Roe (Eds.), Northern Lights on PISA (pp. 133- 

145). Oslo: Department of Teacher Education and School Development, 

University of Oslo. 

Almendingen, S. B. M. F., Tveita, J., & Klepaker, T. (2003). Tenke det, ønske 

det, ville det med, men gjøre det . . . ?: en evaluering av natur- og miljøfag 

etter Reform 97. Nesna: Høgskolen i Nesna. 

Andrich, D. (1988). Rasch models for measurement. Newbury 

Park/London/New Delhi: Sage Publications. 

Angell, C. (1996). Elevers fysikkforståelse. En studie basert på utvalgte 

fysikkoppgaver i TIMSS. Oslo: Det matematisk-naturvitenskapelige 

fakultet, Universitetet i Oslo. 

Angell, C. (2004). Exploring students’ intuitive ideas based on physics items in 

TIMSS – 1995. In C. Papanastasiou (Ed.), Proceedings of the IRC-2004 

TIMSS (pp. 108-123). Nicosia: Cyprus University Press. 

Angell, C., Kjærnsli, M., & Lie, S. (2006). Curricular and cultural effects in 

patterns of students’ responses to TIMSS science items. In S. J. Howie & 

T. Plomp (Eds.), Contexts of learning mathematics and science: Lessons 

learned from TIMSS (pp. 277-290). London: Rutledge. 

Atkin, J. M., & Black, P. (1997). Policy perils of international comparisons: 

The TIMSS case. Phi Delta Kappan, 79(1), 22-28. 

Becker, R., & Schubert, F. (2006). Soziale ungleichheit von lesekompetenzen: 

Eine Matching-Analyse im Langsschnitt mit Querschnittsdaten von 

PIRLS 2001 und PISA 2000/Social inequality of reading literacy: A longitudinal 

analysis with cross-sectional data of PIRLS 2001 and PISA 

2000 Utilizing the pair-wise matching Procedure. Kölner Zeitschrift für 

Soziologie und Sozialpsychologie, 58(2), 253-284. 

Blum, A., Goldstein, H., & Guerin-Pace, F. (2001). International adult literacy 

survey (IALS): an analysis of international comparisons of adult literacy. 

Assessment in Education, 8(2), 225-246. 

Bonnet, G. (2002). Reflections in a critical eye [1]: On pitfalls of international 

assessment. Assessment in Education, 9(3), 387-399. 

Bos, K. T. (2002). Benefits and limitations of large-scale international comparative 

achievement studies: The case of IEA’s TIMSS study. Unpublished 

PhD, University of Twente. 

Brown, M. (1998). The tyranny of the international horse race. In R. Slee, 

G. Weiner & S. Tomlinson (Eds.), School effectiveness for whom? Chal-


lenges to the school effectiveness and school improvement movements 

(pp. 33-47). London: Falmer Press. 

Bryman, A. (2004). Social research methods (2nd ed.). Oxford: University 

Press. 

Burton, D. (2000). Secondary data analysis. In D. Burton (Ed.), Research training 

for social scientists (pp. 347-360). London: Sage Publications. 

Carnoy, M. (2006). Rethinking the comparative—and the international. Comparative 

Education Review, 50(4), 551-570. 

Dossey, J. A., Jones, C. O., & Martin, T. S. (2002). Analyzing student responses 

in mathematics using two-digit rubrics. In D. F. Robitaille & A. 

E. Beaton (Eds.), Secondary analysis of the TIMSS Data (pp. 21-45). Dordrecht: 

Kluwer Academic Publishers. 

Freudenthal, H. (1975). Pupils’ achievements internationally compared – The 

IEA. Educational Studies in Mathematics, 6, 127-186. 

Goldstein, H. (1995). Interpreting international comparisons of student 

achievement (Vol. 63). Paris: UNESCO Publishing. 

Goldstein, H. (2004a). Education for all: the globalization of learning targets. 

Comparative Education, 40(1), 7-14. 

Goldstein, H. (2004b). International comparative assessment: how far have we 

really come? Assessment in Education, 11(2), 227-234. 

Gorard, S., & Smith, E. (2004). An international comparison of equity in education 

systems. Comparative Education, 40(1), 15-28. 

Grønmo, L. S., Kjærnsli, M., & Lie, S. (2004). Looking for cultural and geographical 

factors in patterns of response to TIMSS items. In C. Papanastasiou 

(Ed.), Proceedings of the IRC-2004 TIMSS (Vol. 1, pp. 99-112). 

Nicosia: Cyprus University Press. 

Grønmo, L. S., & Olsen, R. V. (In press). TIMSS versus PISA: the case of pure 

and applied mathematics. In Unknown (Ed.), Unknown. Washington, DC. 

Gustafsson, J.-E., & Rosén, M. (2004). The IEA 10-year trend study of reading 

literacy: A multivariate reanalysis. In C. Papanastasiou (Ed.), Proceedings 

of the IRC-2004 (Vol. 3, pp. 1-16). Nicosia: Cyprus University Press. 

Güzel, C. I., & Berberoglu, G. (2005). An analysis of the programme for international 

student assessment 2000 (PISA 2000) mathematical literacy 

data for brazilian, japanese and norwegian students. Studies In Educational 

Evaluation, 31(4), 283-314. 

Hansen, K. Y., Rosén, M., & Gustafsson, J.-E. (2004). Effects of socioeconomic 

status on reading achievement at collective and individual lev-


els in Sweden in 1991 and 2001. In C. Papanastasiou (Ed.), Proceedings 

of the IRC-2004 PIRLS (Vol. 3, pp. 123-139). Nicosia: Cyprus University 

Press. 

Harlow, A., & Jones, A. (2004). Why students answer TIMSS science test 

items the way they do. Research in Science Education, 34(2), 221-238. 

Howie, S. J. (2004). TIMSS in South Africa: The value of international comparative 

studies for a developing country. In D. Shorrocks-Taylor & E. W. 

Jenkins (Eds.), Learning from Others (Vol. 8). Dordrecht: Kluwer Academic 

Publishers. 

Huberman, M. (1994). The OERI/CERI Seminar on educational research and 

development: a synthesis and commentary. In T. M. Tomlinson & A. C. 

Tuijnman (Eds.), Education research and reform: an international perspective 

(pp. 45-66). Washington D.C.: OECD Centre for Educational Research 

ans Innovation/US Department of Education. 

Husén, T. (1973). Foreword. In L. C. Comber & J. P. Keeves (Eds.), Science 

achievement in nineteen countries (pp. 13-24). Stockholm/New York: 

Almqvist & Wiksell/John Wiley & Sons. 

Husén, T., & Tuijnman, A. (1994). Monitoring standards in education: Why 

and how it came about. In A. C. Tuijnman & T. N. Postlethwaite (Eds.), 

Monitoring the standards of education. Papers in honor of John P. Keeves 

(pp. 1-21). Oxford: Pergamon. 

Hutchison, D., & Schagen, I. (In press). Comparison between PISA and 

TIMSS – Are we the man with two watches? In T. Loveless (Ed.), Lessons 

Learned: What International Assessments Tell Us about Math Achievement. 

Washington, DC: Brookings Institution Press. 

Ichilov, O. (2004). Becoming citizens in Israel: A deeply divided society. Civic 

orientations in Hebrew and Arab schools. In C. Papanastasiou (Ed.), Proceedings 

of the IRC-2004 CivEd-Sites (Vol. 4, pp. 69-86). Nicosia: Cyprus 

University Press. 

Isager, O. A. (1996). Den norske grunnskolens biologi i et historisk og komparativt 

perspektiv. Oslo: Det matematisk-naturvitenskapelige fakultet, 

Universitetet i Oslo. 

Jenkins, E. W. (2000). Research in science education: Time for a health check? 

Studies in Science Education, 35, 1-25. 

Keeves, J. P. (Ed.). (1992). The IEA study of science III: Changes in science 

education and achievement: 1970 to 1984. New York: Pergamon Press. 

Keitel, C., & Kilpatrick, J. (1999). The tationality and irrationality of interna-


tional comparative studies. In G. Kaiser, E. Luna & I. Huntley (Eds.), International 

Comparisons in Mathematics Education (pp. 241-256). London: 

Falmer Press. 

Kellaghan, T., & Greaney, V. (2001). The globalisation of assessment in the 

20th century. Assessment in Education, 8(1), 87-102. 

Kind, P. M. (1996). Exploring performance assessment in science. Oslo:Det 

matematisk-naturvitenskapelige fakultet, Universitetet i Oslo. 

Kirkcaldy, B., Furnham, A., & Siefen, G. (2004). The relationship between 

health efficacy, Educational attainment, and well-being among 30 nations. 

European Psychologist, 9(2), 107-119. 

Kjærnsli, M., Angell, C., & Lie, S. (2002). Exploring population 2 students’ 

ideas about science. In D. F. Robitaille & A. E. Beaton (Eds.), Secondary 

Analysis of the TIMSS Data (pp. 127-144). Dordrecht: Kluwer Academic 

Publishers. 

Lie, S., Linnakylä, P., & Roe, A. (Eds.). (2003). Northern lights on PISA: 

Unity and diversity in the Nordic countries in PISA 2000: Department 

of Teacher Education and School Development, University of Oslo. 

Lie, S., & Roe, A. (2003). Unity and diversity of reading literacy profiles. In 

S. Lie, P. Linnakylä & A. Roe (Eds.), Northern Lights on PISA (pp. 147- 

157): Department of Teacher Education and School Development, University 

of Oslo. 

Linnakyla, P., & Valijarvi, J. (2005). Secrets to literacy success: The Finnish 

story. Education Canada, 45(3), 34-37. 

Loving, C. C., & Cobern, W. W. (2000). Invoking Thomas Kuhn: What citation 

analysis reveals about science education. Science & Education, 9(1-2), 

187-206. 

Malin, A. (2005). School differences and inequities in educational outcomes. 

Jyväskylä: Jyväskylä University Press. 

Mejding, J., & Roe, A. (Eds.). (2006). Northern lights on PISA 2003 – a reflection 

from the Nordic countries. Copenhagen: Nordic Council of Ministers. 

Mullis, I. V. S., & Stemler, S. E. (2002). Analyzing gender differences for high 

achieving students on TIMSS. In D. F. Robitaille & A. E. Beaton (Eds.), 

Secondary Analysis of the TIMSS Data (pp. 287-290). Dordrecht: Kluwer 

Academic Publishers. 

Neidorf, T. S., Binkley, M., Gattis, K., & Nohara, D. (2006). Comparing 

mathematics content in the National Assessment of Educational 

Progress (NAEP), Trends in International Mathematics and Science Study


(TIMSS), and Program for International Student Assessment (PISA) 2003 

assessments. Washington, DC: National Center for Education Statistics. 

Neidorf, T. S., Binkley, M., & Stephens, M. (2006). Comparing science content 

in the National Assessment of Educational Progress (NAEP) 2000 and 

Trends in International Mathematics and Science Study (TIMSS) 2003 

assessments. Washington, DC: National Center for Education Statistics. 

O’Dwyer, L. M. (2002). Extending the application of multilevel modelling to 

data from TIMSS. In D. F. Robitaille & A. E. Beaton (Eds.), Secondary 

Analysis of the TIMSS Data (pp. 359-373). Dordrecht: Kluwer Academic 

Publishers. 

OECD (1996). Education at a glance. Paris: OECD Publications. 



OECD (2005a). PISA 2003 Data analysis manual. Paris: OECD Publishing. 

OECD (2005b). PISA 2003: Technical report. Paris: OECD Publications. 

OECD (2006). Where immigrant students succed. A comparative review of 

performance and engagement in PISA 2003. Paris: OECD Publications. 

Olsen, R. V. (2005). Achievement tests from an item perspective. An exploration 

of single item data from the PISA and TIMSS studies, and how 

such data can inform us about students’ knowledge and thinking in science. 

Oslo: Unipub forlag. 

Olsen, R. V. (2006). A nordic profile of mathematics achievement: Myth or 

reality? In J. Mejding & A. Roe (Eds.), Northern Lights on PISA 2003 – 

areflection from the Nordic countries (pp. 33-45). Copenhagen: Nordic 

Council of Ministers. 

Olsen, R. V., & Grønmo, L. S. (2006). What are the characteristics of the nordic 

profile in mathematical literacy? In J. Mejding & A. Roe (Eds.), Northern 

Lights on PISA 2003 – a reflection from the Nordic countries (pp. 47-57). 

Copenhagen: Nordic Council of Ministers. 

Olsen, R. V., Kjærnsli, M., & Lie, S. (2007, 21.-25. August). A comparison of 

the measures of science achievement in PISA and TIMSS. Paper presented 

at the ESERA 2007, Malmö, Sweden. 

Olsen, R. V., & Lie, S. (2006). Les évaluations internationales et la recherche 

en éducation:principaux objectifs et perspectives. Revue française de pédagogie(157), 

11-26. 

Opitz, E.-M. (2006). PISA und Bildungsstandards: Stein des Anstosses oder 

Anstoss fur die Sonderpadagogik?/PISA and Education Standards: Stum-


bling Block or Impulse for Special Education? Vierteljahresschrift fur 

Heilpadagogik und ihre Nachbargebiete, 75(2), 110-120. 

Orpwood, G. (2000). Diversity of purpose in international assessments: Issues 

arising from the TIMSS test of mathematics and science. In D. 

Shorrocks-Taylor & E. W. Jenkins (Eds.), Learning from Others: International 

Comparisons in Education (pp. 49-62). Dordrecht/Boston/London: 

Kluwer Academic Publishers. 

Papanastasiou, C., & Papanastasiou, E. C. (2006). Modelling mathematics 

achievement in Cyprus. In S. J. Howie & T. Plomp (Eds.), Contexts of 

Learning Mathematics and Science. London: Routledge. 

Papanastasiou, E. C., Zembylas, M., & Vrasidas, C. (2003). Can computer 

use hurt science achievement? The USA Results from PISA. Journal of 

Science Education and Technology, 12(3), 325-332. 

Pole, C., & Lampard, R. (2002). Practical social investigation: Qualitative 

and quantitative methods in social research. Essex: Pearson Education 

Limited. 

Pongratz, L.-A. (2006). Voluntary self-control: Education reform as a governmental 

strategy. Educational philosophy and theory, 38(4), 471-482. 

Porter, A. C., & Gamoran, A. (Eds.). (2002). Methodological advances in 

cross-national surveys of educational achievement. Washington, DC: National 

Academy Press. 

Prais, S. J. (2003). Cautions on OECD’s recent educational survey (PISA). 

Oxford Review of Education, 29(2), 139-163. 

Ramírez, M. J. (2006). Factors related to mathematics cchievement in Chile. 

In S. J. Howie & T. Plomp (Eds.), Contexts of Learning Mathematics and 

Science (pp. 97-111). London: Routledge. 

Reddy, V. (2005). Cross-national achievement studies: learning from South 

Africa’s participation in the Trends in International Mathematics and Science 

Study (TIMSS). Compare, 35(1), 63-77. 

Roberts, D. A. (2007). Scientific literacy/science literacy. In S. K. Abell & 

N. G. Lederman (Eds.), Handbook of Research in Science Education 

(pp. 729-780). Mahwah: Lawrence Erlbaum Associates, Publishers. 

Roe, A., & Hvistendahl, R. (2006). Nordic minority students’ literacy achievement 

and home background. In J. Mejding & A. Roe (Eds.), Northern 

Lights on PISA 2003 – a reflection from the Nordic countries. Copenhagen: 

Nordic Council of Ministers.


Sacher, W. (2003). Schulleistungsdiagnose – padagogisch oder nach dem Modell 

PISA? Padagogische Rundschau, 57(4), 399-417. 

Sahlberg, P. (2007). Education policies for raising student learning: the Finnish 

approach. Journal of Education Policy, 22(2), 147-171. 

Schagen, I. (2004). Multilevel analysis of PIRLS data for England. In C. Papanastasiou 

(Ed.), Proceedings of the IRC-2004 PIRLS (Vol. 3, pp. 82- 

102). Nicosia: Cyprus University Press. 

Simola, H. (2005). The Finnish miracle of PISA: Historical and sociological 

remarks on teaching and teacher education. Comparative Education, 

41(4), 455-470. 

Sohn, J., & Ozcan, V. (2006). The educational attainment of Turkish migrants 

in Germany. Turkish studies, 7(1), 101-124. 

Stanat, Artelt, Baumert, Klieme, Neubrand, Prenzel, et al. (2002). PISA 2000: 

Overview of the study. Design, method and results. Berlin: Max Planck 

Institute for Human Development. 

Stigler, J. W., & Hiebert, J. (1999). The teaching gap: Best ideas from the 

world’s teachers for improving education in the classroom. NewYork: 

Free Press. 

The BMS. (1994). Correspondence analysis: A history and french sociological 

perspective. In M. J. Greenacre & J. Blasius (Eds.), Correspondence 

Analysis in the Social Sciences (pp. 128-137). London: Academic Press. 

Turmo, A. (2003a). Naturfagdidaktikk og internasjonale studier. Store internasjonale 

studier som ramme for naturfagdidaktisk forskning: En drøfting 

med eksempler på hvordan data fra PISA 2000 kan belyse sider ved 

begrepet naturfaglig allmenndannelse. Oslo: Unipub AS. 

Turmo, A. (2003b). Understanding a newsletter article on ozone – a crossnational 

comparison of the scientific literacy of 15-year-olds in a specific 

context. Paper presented at the 4th ESERA conference “Research 

and the Quality of Science Education”, August 2003, Noordwijkerhout, 

The Netherlands. 

Valijarvi, J., Linnakyla, P., Kupari, P., Reinikainen, P., Arffman, I., & Jyvaskyla 

Univ. Inst. for Educational, R. (2002). The Finnish success in PISA–And 

some reasons behind it: PISA 2000. 

Vári, P. (Ed.). (1997). Are we similar in math and science? A study of grade 8 in 

nine Central and Eastern European countries. Amsterdam: International 

Association for the Evaluation of Educational Achievement. 

von-Stechow, E. (2006). PISA und die Folgen fur schwache Schülerinnen


und Schüler/PISA and the Consequences for Pupils with Learning Disabilities. 

Vierteljahresschrift fur Heilpädagogik und ihre Nachbargebiete, 

75(4), 285-292. 

Wang, J. (2001). TIMSS primary and middle school data: Some technical concerns. 

Educational Researcher, 30(6), 17-21. 

Wolfe, R. G. (1999). Measurement obstacles to international comparisons and 

the need for regional design and analysis in mathematics surveys. In G. 

Kaiser, E. Luna & I. Huntley (Eds.), International Comparisons in Mathematics 

Education. London: Falmer Press. 

Wolff, U. (2004). Different patterns of reading performance: A latent profile 

analysis. In C. Papanastasiou (Ed.), Proceedings of the IRC-2004 PIRLS 

(Vol. 3, pp. 188-202). Nicosia: Cyprus University Press. 

Wossmann, L. (2005). Educational production in east Asia: The impact of family 

background and schooling policies on student performance. German 

Economic Review, 6(3), 331-353.

The Hidden Curriculum of PISA – The Promotion of 

Neo-Liberal Policy By Educational Assessment 

Michael Uljens 

Finland: Åbo Akademi University 

Introduction 

The aim of the present chapter is to contextualise the PISA evaluation as an 

exponent of an ongoing shift in the educational policy of many countries participating 

in PISA The shift is considered to reflect a neoliberally oriented 

understanding of the relation between the state, market and education. From a 

Finnish perspective, this shift was intiated at the end of the 1980s and beginning 

of the 1990s. It has been referred to as the educational policy of “the third 

republic”. 

Even if the PISA project strengthens the development of a neoliberal educational 

discourse, both nationally and globally, the project was prepared for 

by developments within many nations during the 1990s. Movements and actions 

that preceded the PISA project are being described from the perspective 

of Finland. The argument is that these preceding operations made PISA appear 

a natural continuation of an already initiated change process on the national 

level. 

The chapter also points out some of the mechanisms through which the 

PISA evaluations operate in order to promote the neoliberal interests of the 

OECD. This is considered important, as it often appears to be forgotten that 

OECD is the organisation behind and running the PISA project. PISA is interpreted 

as a specific kind of a transnational, semi-global, educational evaluation 

technique previously unexperienced. PISA is thus interpreted as having been 

prepared by previous actions on the local and national levels, but PISA in turn 

promotes and strengthens the readiness to uphold a competition-oriented cooperation 

within and between nations.

296 MICHAEL ULJENS 

Doing the groundwork for PISA – the silent educational revolution 

The point of departure of the present chapter is the view that especially all 

large-scale changes in education must be understood as being socially, culturally 

and as historically developed. Consequently, the claim is here that the 

PISA project cannot be correctly understood without acknowledging it as an 

exponent of an ongoing shift in European and global educational policy. The 

shift today concerns and covers all levels and areas of the western educational 

system, although in varying degrees in different countries. 

In Finland, this shift has been called a movement towards the educational 

policy of the “third republic”. The first republic refers to the period from 

Finnish independence (1917) up to the Second World War. The second republic, 

started in1945 and lasted up to the mid-1980s. This period focused on 

educational expansion, solidarity, basic education for all students, equal opportunities, 

regional balance, and education for the civil society. In a word, 

it was the educational doctrine of the welfare state assuming mutual positive 

effects between economic growth, welfare and political participation (see e.g. 

Siljander 2007). 

The period of the “third republic” started towards the end of the 20th century 

or, symbolically, when the previous century “ended” in 1989, i.e. after the 

collapse of Soviet Union and the fall of the Berlin Wall. The political mentality 

in Finland had already started to change towards a more conservative direction 

in the 1980s, and has since then developed chronologically in this direction, 

even though the movement was even more obvious in other countries. 

In contrast to the period of the second republic, the educational mentality 

of the third republic initiated a discourse on excellency, efficiency, productivity, 

competition, internationalisation, increased individual freedom and 

responsibility as well as deregulation in all societal areas (e.g. communication, 

health-care, infrastructure) including the educational sector (education 

law, curriculum planning and educational administration). The direction was 

clearly manifested in the Governmental program in Finland after the elections 

in 1990. The project could be called the creation of the educational policy of 

the global post-industrial knowledge economy and information society. New 

Public Management ideas were introduced in the late 1980s, and a so-called 

agency theoretical approach, according to which the role of the state is expanded 

and changed from producing services to buying services. The model 

included, as we know, the lowering of taxes as well as techniques for “quality 

assurance”. Attention also turned towards profiling individual schools and

THE HIDDEN CURRICULUM OF PISA 297 

institutions and on increasing flexiblility e.g. in educational career planning. 

Extended freedom of choice on the local level was supported by e.g. decentralising 

curriculum planning, first to the community level in the 1980s and then to 

the school level in the 1990s. Parents were included in school boards. Salaries 

determined according to achievement were later introduced in the public sector. 

This mentality supported a kind of commodification of knowledge, marketisation 

of schooling as well as a much stronger view of national education 

as vehicles for international competition. The use of national tests for ranking 

schools was introduced in the 1990s as a mean for promoting a competitionoriented 

climate. The education of gifted students became acknowledged in 

addition to the strong emphasis on traditional special education. Today, limiting 

dropping out of school is motivated by its societal costs rather than many 

other reasonable arguments. 

Despite all of these changes, the idea of educational equality has remained 

the guiding principle, although it has become weakened. The process by and 

large reflects a view of the students or parents as “customers”, according to 

which parents were offered enlarged opportunities to choose which schools 

their children attended on the basis of the success of schools and their profiles. 

The view of citizens as customers is also obvious in various EU documents 

(Heikkinen, 2004), (Finland joined EU in 1995). Education has increasingly 

come to be considered a private good rather than a public good. During the 

past decade movements in this direction have been very obvious within the 

university system (law, financing models, productivity, etc.) in Finland. 

The changes pointed out above reflect a silent but on-going “revolution” in 

educational ideology and policy. The development in Finland is similar to other 

European countries. Globally seen, it is difficult not to consider the collapse of 

the former Soviet Union as the starting point for the development of a new 

ideological and economic world order. 

The conclusion of what has been said thus far is that the PISA evaluations, 

organised by the OECD, were in many ways prepared for by the developments 

described above. The argument of the following section is that although international 

ranking of countries with respect to pupils’ success at testing is not a 

new phenomenon, taking into account how PISA has been constructed, governed 

and how its results have been distributed, interpreted and made use of 

makes the PISA process an organic part of an on-going “silent revolution” in 

western educational thinking.


Governing technologies used by the OECD 

It is important to observe that the PISA evaluations are coordinated by 

the OECD (Organisation for Economic Cooperation and Development). The 

OECD was founded in Paris in 1960 in order to stimulate economic growth 

and employment. The OECD was founded by 20 countries but was extended 

in 2000 to 30 countries. A growing number of non-OECD countries have participated 

in the PISA evaluations. 

The overall logic behind the strategy of the OECD seems to be to support 

an increase of a competitive mentality combined with a system of having common 

standards for nations, as this is expected to be beneficial for a common 

market. The intention seems thus to combine competition with cooperation. 

The current question is through what mechanisms, operations or technologies 

this is put into practice? In the following some major strategies are identified 

that have been applied in and through the PISA evaluations in order to promote 

a competitive mentality combined with cooperation. 

First, using transnational evaluation procedures following one single measurment 

standard (common to all and independent of every participating country) 

supports in the end the development of an increased homogeneity. The 

argument is that this occurs through a self-adjusting process. More precisely, 

the strategy applied is the following: As PISA is mainly focused on the ranking 

of participating countries and not very interested in explaining differences 

between them, the burden of producing explanations is left to the participating 

nations, their governments, educational administration and the media. We saw 

this occuring after launching the results of PISA 2000, and even more clearly 

after PISA 2003. 

By not offering systematic explanations to the reported differences in 

school achievements, a development of a self-adjusting mentality or a certain 

mode of self-reflection was promoted. Through this process the countries 

themselves begin to orientate towards certain types of questions and topics, i.e. 

looking for keys to success. We all know that ranking participating countries 

created an unforeseen alertness among politicians and within the educational 

administration to explain either their students’ success or lack thereof. 

From an OECD perspective this is the best anyone can hope for – getting 

nations engaged in the right issues, so to speak. By leaving the task of 

explaining differences to participating nations, media people and the like, national 

experts, governmental representatives and politicians are also free to 

make different kinds of conclusions from the results. Thus, the policies ema-


nating from the process vary between countries. However, this process leads to 

limiting the agenda for educational politics of a specific country. Instrumental 

policy issues, i.e. means for how things should be carried out and corrected, 

then becomes the main topic, while reflection related to the orientation and 

aim of education and schooling as such diminishes. Nonetheless, it would be 

wrong to say that the question of educational aims has moved to the background 

during this process as it is obvious that all levels of education strongly 

empasize that education, research and developmental work are core strategies 

for creating economic growth. As the aims are so obvious, there is a risk that 

educational policymaking on a national level becomes a kind of educational 

managerialism or “procedurology”. 

A second strategy applied for promoting the interests dominating the 

OECD is related to the construction of the tests and their relation to national 

curricula. One of the fundamental differences between the PISA evaluation 

and e.g. the IEA evaluation is that IEA took the national curriculum, its intentions 

and content as the point of departure. As it is quite natural to consider 

the national curriculum as the frame of reference when evaluating pedagogical 

efforts, it becomes important to try and understand why PISA did not evaluate 

what teachers in respective countries were expected to strive for? But what if 

the point was also something else in addition to primarily evaluating the effectiveness 

of the educational system? What if the idea was rather to use international 

evaluation as a technique for homogenising the participating educational 

systems and creating a competition-oriented mentality? 

If homogenisation (or increased coherence) may be seen as one interest 

aim to be reached from an OECD perspective, then the promotion of a 

competition-oriented mentality is another, equally important aim. Having accepted 

this, the main question is not concerned with the aims of education but 

the means of how to reach or hold a leading position. 

A mentality accepting a never-ending competition is deceptive, as one cannot 

ever reach either the goal or certainty. The only point that is clear is that 

one has to struggle for keeping or improving one’s position. Competition is 

always accompanied by insecurity, and this insecure identity or mentality continuously 

strives to reach safety. The mentality supported is one of continuous 

angst or a feeling of insufficiency. Lifelong learning, which was first hailed 

as a deliberating policy, has quickly turned out to be more like a life sentence 

than something emancipating. The individual is not allowed to reach “heaven 

on earth”, but is rather expected to try to learn to live with the idea that a con-


tinuous learning process is the closest we can come to fulfilment in life. In fact, 

this construction is not a recent or new one. In some respects it is a fundamental 

feature of the European tradition of Bildung. At the risk of oversimplifying, 

we could say that while the Bildung tradition emphasises learning as emancipation, 

independence, self-awareness and maturity (Mündigkeit), the lifelong 

learning ideology or dogma explicates learning activity as something that the 

individual has to exhibit in order to meet “legitimate” expectations of those 

towards one is considered to be responsible. A learning attitude is the ethos of 

an “alert readiness to change” according to what the situation needs, but where 

one is not defining this situation. In this sense the lifelong learning dogma is 

opposite to the concept of Bildung. 

Conclusions 

The intention in this chapter has been to develop an interpretation of the possible 

logic behind the PISA evaluation compared to previous international evaluations. 

Moreover, the aim has been to analyse some of the mechanisms or 

governmental strategies utilised or operating in the PISA process. This general 

logic was considered simple but intelligent. It was interpreted as aiming at 

the uniting of intercultural communicative activities oriented towards learning 

from each other and a simultaneous or parallel competition-oriented mentality 

– a logic of competing and competitive cooperation. As this has not been 

formulated as an explicit aim of the PISA program, it may be interpreted as a 

part of the hidden curriculum of PISA. International evaluations in the shape 

they have taken in the PISA process thus include a kind of hidden curriculum, 

aiming at developing the educational systems of participating countries in a 

neo-liberal direction. 

The analysis was not focused on what in fact was measured by the tests 

themselves or whether the theoretical foundation of the project was weak or 

not, e.g. with respect to how comparative educational research was understood. 

The point was rather first and foremost to pay attention to how the educational 

policy landscape in Finland for its part prepared for PISA and secondly, to 

point out effects that this kind of evaluation procedure may have on the educational 

thinking of the participating countries. PISA was thus more understood 

as an instrument or technique used by the OECD to support the development 

of a specific type of national educational policy. Expressed in the terminology 

of Michel Foucault, the PISA evaluation is viewed as a good example of how


evaluation operated not by direct governing behaviour but by governing the 

self-government or self-conduct of individuals. 

However, it has not to been claimed that the supporting of countries’ competitive 

capacity by educational means is a new feature of Finnish or European 

educational policy. It may, in fact, be argued that living with uncertainty and 

openness is a fundamental feature of the modern European tradition of Bildung 

(Uljens, 2007). Furthermore, the educational policy of the welfare state was, 

and still is (at least in Finland) to the extent it exists built upon the conviction 

of positive mutual effects between economic progress, educational equality, 

social justice and welfare and active, participatory citizenship. 

In order to avoid misunderstanding, it should be stated that it is in this 

context that the term ‘neo-liberalism’ is distinguished from ‘classical liberalism’ 

(A. Smith). Classical liberalism is taken to refer to the idea that the state 

should not intervene in market-related issues, as the market regulates itself and 

automatically is beneficial for all. Neo-liberalism is taken to refer to the view 

that the state does and should intervene in the market by laws and regulations 

of all kinds. In the neo-liberal model, politics, economics and education are 

seen as mutually dependent on each other. The international development of 

market-oriented economic thinking after 1989 may thus be considered as a 

renewed neo-liberalist politics in which the relative impact of politics on the 

economy has diminished. This has created a dissonance in the “school-statemarket 

triangle” (education, politics, economics), which is most clearly visible 

in and through the contemporary discussions on the crisis of citizenship and 

citizenship education. 

In conclusion, understood in the sense defined above, the PISA process is 

coherent with the kind of educational policy in Finland that has been evolving 

over the past 15-20 years. The relation may also be seen the other way around: 

The educational policy of Finland, as it developed from the end of the 1980s 

and beginning of the 1990s, moulded the national scene so that the strategies 

and technologies used in the PISA evaluation appeared as a reasonable continuation 

of the national policy. 

It was pointed out that this preparatory work was mainly carried out by 

applying three policy technologies: a) economisation referring to the measurement 

of value primarily in economical terms, b) privatisation as a movement 

towards partial deconstruction of collective, societal institutions in favour of 

private actors, deregulating laws and increasing flexibility of educational administration 

and increased individual responsibility and freedom and, finally,


c) productivity referring to the fact that activities effectively stimulating economic 

growth are supported. 

One of the anomalies resulting from the international PISA discussion is 

how to explain the case that an educational system like the Finnish comprehensive 

school was indeed able to produce better results and a smaller variation 

compared to parallel school systems, like those in Germany or Great Britain. 

One reason as to why this raised so much confusion was the fact that the ideology 

behind the comprehensive school fundamentally differed from the OECD 

ideology’s emphasising more individual freedom and less state intervention. 

PISA also has resulted in increased expectations for continued and extended 

success. In Finland the PISA success for the compulsory school system 

turned attention towards the universities: Why are our universities not doing 

equally well in international rankings? During the last few years many different 

steps have been taken in order to push for Finnish universities’ international 

success. One example is that the decentralised model of higher education 

which was initiated at the end of the 1960s definitely remains. According to 

the unquestioned rhetoric of today, large university units are considered capable 

of being successful in many ways, not least when it comes to raising 

research funding and offering stimulating study programs. This also happens 

on the EU level (e.g. establishment of the European Institute of Technology, 

EIT). 

A final comment – or query, – concerning where PISA is or has been discussed 

the past years: Compared with the immense attention PISA issues have 

got in the public debate all over the world, and the impact it has had on governmental 

policies and school practices, it is fascinating how seldom educational 

researchers touch upon the topic in international research conferences 

and journal articles. If the observation is correct, which I do think it is, then it 

seems that we have two different worlds of educational debate which are not 

necessarily in touch with each other. Is this how things should be? 

References 

Heikkinen, A. (2004): Evaluation in the transnational ‘Management by 

projects’ Policies. In: European Educational Research Journal, 3(2), 486- 

500. 

Uljens, M. (2007): Education and societal change in the global age. In: R. 

Jakku-Sihvonen & H. Niemi (Eds), Education as a societal contributor 

(pp. 23-49). Frankfurt am Main: Peter Lang.


Siljander, P. (2007): Education and ‘Bildung’ in modern society – Developmental 

trends of finnish educational and sociocultural processes. In: R. 

Jakku-Sihvonen & H. Niemi (Eds), Education as a societal contributor 

(pp. 71-90). Frankfurt am Main: Peter Lang.

Deutsche Pisa-Folgen 

Thomas Jahnke 

Deutschland: Universität Potsdam 

In dieser Note werden die Beschlüsse der Kultusministerkonferenz zum ‚Bildungsmonitoring‘und 

zu den ‚Bildungsstandards‘ in Mathematik als nationale 

Pisa-Folgen identifiziert. Eine Auseinandersetzung mit der Testforschung in 

den USA und eine Ernüchterung der veröffentlichten Meinung zu der Testwirklichkeit 

kann die Geltungsmacht von Pisa & Co in Deutschland möglicherweise 

eindämmen. 

Die Teilnahme an der Dritten Mathematik- und Naturwissenschaftsstudie 

(TIMSS) und dem ersten Durchgang des Programme for International Student 

Assessment (PISA) hat zu einer grundlegenden Wende in der deutschen 

Bildungspolitik geführt. Das Unbehagen an der deutschen Schule ist messbar 

geworden und mit diesen Messungen ist auch der Weg, die Verhältnisse zu bessern, 

vorgezeichnet: die Messwerte müssen höher werden, dann wird es besser. 

Die Wucht, mit der dieser Gedanke die mediale und politische Öffentlichkeit 

durchrollte und vereinzelte Kritik an solchen Erkenntnisse und der einzuschlagenden 

Kur unter sich begrub, hatte lawinenartigen Charakter. Die Messergebnisse 

scheinen wirklicher als jede Theorie, und den „deskriptiven Befunde“ 

haftet eine quasi-naturwissenschaftliche Objektivität und damit unwiderlegbare 

Wahrheit an: so liegen die Dinge – im Rahmen der Messgenauigkeit. Der 

Triumpf empirischen Denkens: die Wirklichkeit ist beziffert, digitalisiert, das 

Menetekel hat Dezimale bekommen und kann nun Steuerungsprozessen unterworfen 

werden, deren Ergebnisse wieder zu messen sind und so weiter. 

In Deutschland wird kaum diskutiert, dass auch solchen Messungen eine 

– möglicherweise holprige, unausgesprochene, wenig durchdachte – Theorie 

zugrunde liegt und Begriffe wie ‚Kompetenzstufen‘ oder ‚Grundbildung‘ 

sich nicht messtechnisch ergeben, sondern „Realität“ eher hervorbringen als

306 THOMAS JAHNKE 

beschreiben. Ferner ist der Glaube, durch periodisierte Testungen würden die 

Leistungen deutscher Schülerinnen und Schüler steigen, weit und auch in der 

Bildungsadministration verbreitet. Kritik an Pisa wird häufig damit zurückgewiesen, 

dass ein Unternehmen dieser Größenordnung natürlich auch Schwächen 

und Ungereimtheiten aufweise, dass der ‚Pisa-Schock‘aber grundsätzlich 

doch das Augenmerk auf die Schulwirklichkeit gelenkt und schon damit Bewegung 

gebracht und diverse Reformbestrebungen in Gang gesetzt habe. Verkannt 

wird dabei, dass es sich bei Pisa nicht um eine einmalige Testung handelt, 

deren Ergebnisse in ihrer Aussagekraft möglicherweise überschätzt einem 

reflektierenden Betrachter schon etwas erzählen könnten, sondern um ein 

Programm, das keineswegs den Blick für verschiedenste Reformansätze und 

–anstrengungen öffnet, sondern im Gegenteil den Weg durch das Ziel schon 

festgeschrieben hat: deutsche Schülerinnen und Schüler sollen bei den künftigen 

Tests besser abschneiden. 

„Bildungsmonitoring“ 

Die Gesamtstrategie der Kultusministerkonferenz zum Bildungsmonitoring 

liegt in doppelter Form vor: einmal als Beschluss der Kultusministerkonferenz 

vom 02.06.2006 1 , zum anderen als Broschüre 2 , die vom Sekretariat der Ständigen 

Konferenz der Kultusminister der Länder der Bundesrepublik Deutschland 

in Zusammenarbeit mit dem Institut zur Qualitätsentwicklung im Bildungswesen 

(IQB) 2006 herausgegeben wurde. Die Broschüre ist – schon auf 

dem Umschlag – mit ganzseitigen Farbfotos von Schülerinnen und Schülern illustriert, 

enthält ein Vorwort der Präsidentin der Kultusministerkonferenz und 

ein Inhaltsverzeichnis mit geänderter Nummerierungen der Abschnitte, ist um 

einen Abschnitt mit Aufgabenbeispielen angereichert. Offensichtlich hat man 

dem IQB zugestanden, sein Aufgaben- und Pflichtenbuch selbst zu überarbeiten 

und die Formulierungen des zugrunde liegenden Beschlusses sich passend 

zu glätten und auszulegen. Dies geschieht tatsächlich Absatz für Absatz. Aus 

der Formulierung . . . in eine Reihe von Beschlüssen der KMK einzuordnen, 

die entsprechende Handlungsfelder beschreiben und gemeinsame zentrale Arbeitsbereiche 

nach Pisa 2003 festlegen in dem Beschluss der KMK wird der 

Bezug auf Pisa 2003 gestrichen. Aus dem Arbeitsbereich Bereitstellung von 

1 Kultusministerkonferenz (KMK): Gesamtstrategie der Kultusministerkonferenz zum Bildungsmonitoring 

(Beschlüsse der Kultusministerkonferenz vom 02.06.2006) 

2 Kultusministerkonferenz (KMK): Gesamtstrategie der Kultusministerkonferenz zum Bildungsmonitoring 

(2006)

DEUTSCHE PISA-FOLGEN 307 

Fortbildungskonzeptionen und –materialien zur kompetenz- bzw. standardbasierten 

Unterrichtsentwicklung, vor allem Lesen, Geometrie, Stochastik wird 

der Bezug zum Lesen und der – nicht nachvollziehbare, verwunderliche – Bezug 

auf spezielle mathematische Bereiche, den man wohl nur durch die Abwesenheit 

und damit auch Verzichtbarkeit von ‚Fachkompetenz‘ erklären kann, 

gestrichen. Wir zitieren im Folgenden die etwas schlankeren Formulierungen 

des ursprünglichen Beschlusses. Schon der erste Absatz lässt wenig Zweifel 

unter welchen Auspizien Bildung heute betrachtet wird: 

Bildung nimmt eine Schlüsselrolle für die individuelle Entwicklung, für gesellschaftliche 

Teilhabe sowie berufliches Fortkommen, aber auch für den wirtschaftlichen Erfolg 

eines Landes ein. Die globalen Entwicklungen der vergangenen Jahrzehnte haben 

die grundlegende Bedeutung von Bildung für Deutschland noch einmal unterstrichen. 

Die Ausschöpfung aller Begabungspotentiale und die Sicherung und Entwicklung von 

Qualität im Bildungswesen sind daher zentrale Aufgaben der Bildungspolitik. (S. 1) 3 

Als zentrale Instrumente der Kultusministerkonferenz für das Bildungsmonitoring 

werden dann benannt 

– Internationale Schulleistungsuntersuchungen 

– Zentrale Überprüfung des Erreichens der Bildungsstandards in einem Ländervergleich 

– Vergleichsarbeiten in Anbindung oder Ankoppelung an die Bildungsstandards 

zur landesweiten Überprüfung der Leistungsfähigkeit einzelner Schulen 

– Gemeinsame Bildungsberichterstattung von Bund und Ländern. (S. 1/2) 

Ob die bisher veröffentlichten ‚Bildungsstandards‘(s.u.!) solchen Überprüfungen 

und Belastungen standhalten, kann man bezweifeln. In jedem Fall wird 

Deutschland durch diesen Beschluss zum Testland ausgerufen und erklärt: für 

die Jahre 2006 bis 2018 (!) werden in einer Tabelle 17 Testungen und 19 

Berichterstattungen über diese terminiert, die sich allein durch die Teilnahme 

an PIRLS, TIMSS und PISA 4 sowie die Ländervergleiche bundesweit er- 

3 In ihrem Tenor und Jargon erinnert solche Funktionsbeschreibung von ‚Bildung‘an die entsprechende 

Verlautbarungen der OECD. Überraschender Weise ist in der Überarbeitung des 

Textes (Siehe die o.a. Broschüre) in dem angeführten Zitat das Wort ‚Bildung‘durch ‚Das 

Bildungssystem‘ersetzt, als seien diese Begriffe synonym. 

4 PIRLS ist die Abkürzung für Progress in International Reading Literacy Study, die in 

Deutschland auch mit IGLU für Internationale Grundschul-Lese-Untersuchung bezeichnet 

wird. TIMSS war ursprünglich ein Akronym für Third International Mathematics and 

Science Study; seit TIMSS 2003 steht das Akronym für Trends in Mathematics and Science 

Study; PISA steht für Progamme for International Students Assessment.


geben. Dazu kommen noch die länderspezifische und länderübergreifenden 

Vergleichsarbeiten in Anbindung oder Anlehnung an die Bildungsstandards in 

Jahrgangsstufen 3 und 4 für Deutsch und Mathematik, in den Jahrgangsstufen 

8 und 9 für den Hauptschulabschluss in Deutsch, Mathematik, Erste Fremdsprache 

(Englisch, Französisch) und in den Jahrgangsstufen 9 und 10 für den 

Mittleren Schulabschluss in Deutsch, Mathematik, Erste Fremdsprache (Englisch, 

Französisch), Biologie, Chemie, Physik. 

Dass schulische Bildung in Deutschland solchem ‚Monitoring‘ nicht mehr 

entkommen kann, wird schließlich im letzten Abschnitt Bildungsberichterstattung 

gesichert: 

Kern der Bildungsberichterstattung ist ein überschaubarer, systematischer, regelmäßig 

aktualisierter Satz von Indikatoren, d.h. statistischen Kennziffern, die jeweils für 

ein zentrales Merkmal von Bildungsprozessen bzw. einen zentralen Aspekt von Bildungsqualität 

stehen. Diese Indikatoren werden aus amtlichen Daten und sozialwissenschaftlichen 

Erhebungen in Zeitreihe dargestellt, wenn möglich im internationalen 

Vergleich und aufgeschlüsselt nach Ländern. Um den Vergleich mit Entwicklungen in 

den Mitgliedstaaten der Europäischen Union und der OECD zu ermöglichen, wird Anschlussfähigkeit 

und Kompatibilität mit internationalen Berichtssystemen ( . . . ) angestrebt.(... 

) 

Durch die Verfügbarkeit individueller Verlaufsdaten und die regelmäßige Erfassung 

erworbener Kompetenzen soll die Leitidee der Bildungsberichtserstattung „Bildung 

im Lebenslauf“ umgesetzt werden. Für einen einheitlichen Satz schulstatistischer Daten 

und die Sicherung der Anschlussfähigkeit an die internationale Bildungsstatistik 

haben die Länder bereits grundlegende Beschlüsse gefasst. So haben die Länder am 

22.09.2005 vereinbart, längerfristig ihre Daten entsprechend den im Kerndatensatz 

vereinbarten Merkmalsausprägungen zur Verfügung zu stellen. Zumindest Daten der 

öffentlichen Schulen sollen für das Schuljahr 2008/2009 von allen Ländern vorliegen. 

(S. 14) 

In der überarbeiten Broschüre zum Bildungsmonitoring wurde der zuletzt zitierte 

Absatz gestrichen. Es ist aber wohl kaum davon auszugehen, dass damit 

auch die angestrebte Datenbank nicht eingerichtet wird. 

„Teaching to the Test“ 

Man muss konstatieren, dass die Kritik an den Testverfahren und an dem Gedanken, 

die ‚Erträge‘schulischer Bildung könnten über periodisierte Tests in 

sinnvoller Weise gemessen und gesteigert werden, Deutschland nicht erreicht 

hat oder dass es auch nur zu einer sorgfältigen und redlichen Diskussion dieses


prima vista selbstverständlich erscheinenden Gedankens hierzulande gekommen 

ist. 5 Das ist auch nicht weiter erstaunlich, weil von den involvierten Testinstitute 

und den mit ihnen kooperierenden Wissenschaftler kaum zu erwarten 

ist, dass sie mit ihren Test- Knowhow auch die Test-Kritik auf den Markt bringen. 

In dem vierzehnseitigen Beschluss der Kultusministerkonferenz zum Bildungsmonitoring 

vom 02.02.2006 wird auf Seite 13 unter der Zwischenüberschrift 

„Weiterentwicklung der Bildung, aber kein Teaching to the Test“ auf 

diese Problematik – wie folgt – kurz eingegangen: 

Neben der Funktion der Beschreibung von Leistungsanforderungen und der Leistungsmessung 

dienen die Bildungsstandards primär der Weiterentwicklung des Unterrichts 

und vor allem der individuellen Förderung aller Schülerinnen und Schüler. Die 

Länder sind sich darin einig, dass mit der Setzung der Bildungsstandards als übergreifenden 

Referenzrahmen eine Entwicklung hin zum „teaching to the test“ oder eine 

Verengung des Unterrichts aus die Anforderungen der Standards verhindert werden 

muss. (S. 13) 

Diese Kürze ist trotz der beschworenen Einigkeit der Länder erstaunlich. Es 

liegt nahe, wenn man die ‚Weiterentwicklung des Unterrichts und vor allem der 

individuellen Förderungen der Schülerinnen und Schüler‘ durch Bildungsstandards 

befördern oder anordnen will, deren Erreichen im Wesentlichen durch 

Tests überprüft wird, die Erfahrungen von Ländern und darunter insbesondere 

der USA zu rezipieren, die seit Jahren oder Jahrzehnten eine solche Politik 

verfolgen. 

For several decades, some measurement experts have warned that high-stakes testing 

could lead to inappropriate forms of test preparation and score inflation, which we 

5 Es ist aufschlussreich und vermutlich nicht folgenlos, dass die Daten von Pisa in Australien 

aufgearbeitet werden und gleichsam deutschen (oder europäischen) Boden nie betreten. 

In Zeiten einer sich global verstehenden Forschung scheinen solche räumliche Distanzen 

ohne jede Auswirkung, u.a. weil der Zugriff auf Datenserver ubiquitär und ohne zeitliche 

Verzögerung möglich ist. Dennoch ist es von Bedeutung, ob und in welchem Rahmen und 

geistigen Raum die Verfahren zur Aufbereitung der Daten entwickelt, diskutiert und kritisiert 

werden, ob sie als die Ergebnisse der Form und dem Inhalt nach prägende Instrumente 

begriffen werden oder nur – als mehr oder minder schlecht dokumentierte – Routinen in 

Softwarepaketen erscheinen, ob sie überhaupt wissenschaftlich diskutiert oder schlicht als 

notwendige und doch arbiträre Essenzen eines ‚State of the Art‘ aufgefasst werden, ob den 

beteiligten Forschern daran liegt, ihre Verfahren zu verkaufen oder als Erkenntnisinstrumente 

in die Diskussion einzuführen und zu legitimieren etc.


define as a gain in scores that substantially overstates the improvement in learning it 

implies. (p. 99) 

So leitet Daniel Koretz, Erziehungswissenschaftler an der Havard-Universität 

und assoziierter Direktor des Center of Research, Standards, and Student Testing 

(CRESST), seinen Aufsatz Alignment, High Stakes, and the Inflation of 

Test Scores 6 ein und beschreibt einen Ausgangspunkt, über den eine öffentliche 

Diskussion in Deutschlang bisher kaum hinausgekommen ist: 

On common response to this problem has been to seek “tests worth teaching to”. 

The search for such tests has led reformers in several directions over the years, but 

currently, many argue that tests well aligned with standards meet this criterion. If tests 

are aligned with standards, the arguments runs, they test material deemed important, 

and teaching to the test therefore teaches what is important. If students are being taught 

what is important, how can the resulting score gains be misleading? (p. 99) 

Koretz begründet seinen Widerspruch gegen solche Naivität theoretisch und 

empirisch unter anderem eindrücklich mit Sägezahnkurven (“sawtooth pattern“) 

für die gemessenen Leistungen der gleichen oder einer vergleichbaren 

Population, die sich in verschiedenen Erhebungen je nach den verwendeten 

Tests in unterschiedlichster Weise ergaben. Auch der Hoffnung, solche Effekte 

seien allein der Testkonstruktion und den Testumständen zuzuschreiben, widerspricht 

er: 

The problem is not confined to commercial, off-the-shelf, multiple-choice tests. It has 

appeared as well with standards-based tests and with tests using no multiple-choice 

items. (p. 106) 

Die Vorstellung, Schülerleistungen ließen sich in einem Test objektiv oder mit 

angebbaren Fehlermargen – gleichsam physikalisch messen, ist schlicht (und) 

irreführend. Folgerungen aus solcher Vorstellung mehr als fragwürdig. Wird 

dies in Abrede gestellt, verschwiegen oder das Gegenteil prätendiert, liegen in 

aller Regel massive Erkenntnisinteressen der Auftraggeber oder -nehmer der 

Testungen vor. 

6 Koretz, D.: Alignment, High Stakes, and the Inflation of Test Scores. Yearbook of the National 

Society for the Study of Education (2005) 104 (2), 99–118. 

(Online erhältlich unter: http://www.blackwell-synergy.com/doi/abs/10.1111/j. 

1744-7984.2005.00027.x)


Auch die Auswirkungen von Testungen auf den Unterricht werden in den 

USA seit Jahrzehnten untersucht. Koretz zum Beispiel beschreibt und charakterisiert 

in dem zitierten Papier Reallocation, Alignment und Coaching: 

Reallocation. Reallocation refers to shifts in instructional resources among the elements 

of performance. Research has shown that when scores on a test are important 

to teachers, many of them will reallocate their instructional time to focus more on 

the material emphasized by the test. ( . . . ) Many observers believe that reallocation is 

among the most important factors causing the sawtooth pattern ( . . . ). 

Alignment. Content and performance standards comprise material – performance elements, 

in the terminology used here – that someone (not necessarily the ultimate user 

of scores) has decided are important. If the material is emphasized in the standards, 

that implies that users should give this material substantial weight in the interference 

they draw about student performance. Alignment gives this same material high 

weights in the test as well. ( . . . ) 

Coaching. The term “coaching” is used in a variety of different ways in writings about 

test preparation. Here it is used to refer to two specific, related types of test preparation, 

called substantive and non-substantive coaching. 

Substantive coaching is an emphasis on narrow, substantive aspects of a test that capitalizes 

on the particular style or emphasis of test items. The aspects of the tests that are 

emphasized may be either intended or unintended by the test designers. For example, 

in one study of the author’s, a teacher noted that the state’s test always used regular 

polygons in test items and suggested that teachers should focus solely on those and 

ignore irregular polygons. The intended interferences, however, were about polygons, 

not specifically regular polygons. ( . . . ) 

Nonsubstantive coaching refers to the same process when focused on nonsubstantive 

aspects of a test, such as characteristics of distracters (incorrect answers to multiplechoice 

items), substantively unimportant aspects of scoring rubrics, and so on. Teaching 

test-taking tricks (process of elimination, plug-in, etc.) can also be seen as nonsubstantive 

coaching. In some cases – for example, when first introducing young children 

to the op-scan answer sheets used with multiple-choice tests – a modest amount of 

certain types of nonsubstantive coaching can increase scores and improve validity by 

removing irrelevant barriers to performance. In most cases, however, it either wastes 

time or inflates scores. (p. 110-112) 

An anderer Stelle findet sich ähnliche Kritik. So fasst Brian M. Stecher sein 

Kapitel 4 Consequences of large-scale, high-stakes testing on school and classroom 

practices in dem von ihm mitherausgegebenen Buch Making Sense of 

Test-Based Accountability in Education 7 folgendermaßen zusammen: 

7 Stecher, B. M.: Consequences of large-scale, high-stakes testing on school and classroom


The net effect of high-stakes testing on policy and practice is uncertain. Researchers 

have not documented the desirable consequences of testing – providing more instruction, 

working harder, and working more effectively – as clearly as the undesirable ones – 

such as negative reallocation, negative alignment of classroom time to emphasize topics 

covered by a test, excessive coaching, and cheating. More important, researchers 

have not generally measured the extent or magnitude of the shifts in practice that the 

identified as a result of high-stakes testing. 

Overall, the evidence suggests that large-scale high-stakes testing has been a relatively 

potent policy in terms of bringing about changes within schools and classrooms. 

Many of these changes appear to diminish students’ exposure to curriculum, which 

undermines the meaning of the test scores. (p. 99/100) 

Der im letzten Absatz angesprochene Antagonismus scheint der deutschen 

Kultusministerkonferenz möglicherweise von ihren Beratern vorenthalten 

worden zu sein. Das Gleiche gilt vermutlich für das Position Statement on 

High Stakes Testing in PreK-12 Education der American Evaluation Association 

(AEA), in dem es heißt: 

High stakes testing leads to under-serving or mis-serving all students, especially the 

most needy and vulnerable, thereby violating the principle of “do no harm.” The American 

Evaluation Association opposes the use of tests as the sole or primary criterion 

for making decisions with serious negative consequences for students, educators, and 

schools. The AEA supports systems of assessment and accountability that help education. 

Recent years have seen an increased reliance on high stakes testing (the use of tests to 

make critical decisions about students, teachers, and schools) without full validation 

throughout the United States. The rationale for increased uses of testing is often based 

on a need for solid information to help policy makers shape policies and practices to 

insure the academic success of all students. Our reading of the accumulated evidence 

over the past two decades indicates that high stakes testing does not lead to better 

educational policies and practices. There is evidence that such testing often leads to 

educationally unjust consequences and unsound practices, even though it occasionally 

upgrades teaching and learning conditions in some classrooms and schools. The consequences 

that concern us most are increased drop out rates, teacher and administrator 

deprofessionalization, loss of curricular integrity, increased cultural insensitivity, and 

disproportionate allocation of educational resources into testing programs and not into 

hiring qualified teachers and providing sound educational programs. The deleterious 

practices. In L. S. Hamilton, B. M. Stecher, and S. P. Klein (Eds.): Making Sense of Test- 

Based Accountability in Education. RAND. Santa Monica 2002. P. 79-100 

(Online unter: http://www.rand.org/pubs/monograph_reports/MR1554/index.html)


effects of high stakes testing need further study, but the evidence of injury is compelling 

enough that AEA does not support continuation of the practice. 

While the shortcomings of contemporary schooling are serious, the simplistic application 

of single tests or test batteries to make high stakes decisions about individuals and 

groups impede rather than improve student learning. Comparisons of schools and students 

based on test scores promote teaching to the test, especially in ways that do not 

constitute an improvement in teaching and learning. Although used for more than two 

decades, state mandated high stakes testing has not improved the quality of schools; 

nor diminished disparities in academic achievement along gender, race or class lines; 

nor moved the country forward in moral, social, or economic terms. The American 

Evaluation Association (AEA) is a staunch supporter of accountability, but not test 

driven accountability. AEA joins many other professional associations in opposing 

the inappropriate use of tests to make high stakes decisions. 

In einer Endnote zu diesem Text wird auf weitere Organisationen verwiesen, 

die ebenfalls dagegen opponieren, weit reichende Entscheidungen auf Grund 

von Testergebnissen zu fällen. 

AEA joins many other professional associations, teacher unions, parent advocacy 

groups in opposing the inappropriate use of tests to make high stakes decisions. These 

include, but are not limited to the American Educational Research Association, the 

National Council for Teachers of English, the National Council for Teachers of Mathematics, 

the International Reading Association, the College and University Faculty 

Assembly of the National Council for the Social Studies, and the National Education 

Association 8 

Für den deutschen Betrachter ist es kaum nachvollziehbar, mit welchen 

Besserungs- wenn nicht gar Heilserwartungen gleich in welcher Richtung die 

hiesige Bildungspolitik umfangreichste Testprograme einführt, während der sicherlich 

nicht zimperliche angelsächsische Evaluations-Pragmatismus sich in 

kaum zu übertreffender Deutlichkeit nach mehr als zwanzigjähriger Erfahrung 

von solchen Bestrebungen distanziert. 

„Bildungsstandards“ 

Während in dem Beschluss der Kultusministerkonferenz zum Bildungsmonitoring 

noch wie oben zitiert davon die Rede ist, dass die Bildungsstandards 

primär der Weiterentwicklung des Unterrichts und vor allem der individuellen 

8 American Evaluation Association AEA): Position Statement on HIGH STAKES TESTING 

In PreK-12 Education. 2002 (Online unter: http://www.eval.org/hst3.htm)


Förderung aller Schülerinnen und Schüler dienen heißt es in der Vereinbarung 

über Bildungsstandards für den Mittleren Schulabschluss 9 der gleichen Organisation 

deutlicher und weniger pädagogisch verschleiert: 

Die Kultusministerkonferenz sieht es als zentrale Aufgabe an, die Qualität schulischer 

Bildung, die Vergleichbarkeit schulischer Abschlüsse sowie die Durchlässigkeit 

des Bildungssystems zu sichern. Bildungsstandards sind hierbei von besonderer Bedeutung. 

Sie sind Bestandteil eines umfassenden Systems der Qualitätssicherung, das 

auch Schulentwicklung, externe und interne Evaluation umfasst. Bildungsstandards 

beschreiben erwartete Lernergebnisse. Ihre Anwendung bietet Hinweise für notwendige 

Förderungs- und Unterstützungsmaßnahmen. (S. 3) 

Standards 10 und Tests bescheinigen sich gegenseitig ihre Notwendigkeit in 

einem Maße, dass sie gleichsam aus purer Logik existent werden: Tests benötigen 

Standards, an denen sie oder auf die sie ausgerichtet sind; Standards 

benötigen Tests zur Überprüfung ihrer Einhaltung oder Erreichung oder ihres 

Verfehlens. Eine kurze – weniger logische – Geschichte der Standards in 

Deutschland hat Hans Dieter Sill 2006 skizziert. 11 Er kommt zu dem Schluss: 

Die Standards sind nicht im Resultat gründlicher wissenschaftlicher Analysen internationaler 

und nationaler Entwicklungen entstanden, sondern sind Ergebnis eines politisch 

motivierten Beschlusses auf ministerialer Ebene, der in sehr kurzer Zeit umzusetzen 

war. Es bestanden weder zeitliche noch personelle Ressourcen, um den wissenschaftlich 

außerordentlich anspruchsvollen Prozess der Entwicklung nationaler Standards 

in der notwendigen Tiefe und Gründlichkeit zu gestalten. (S. 299/200) 

Die Ergebnisse solcher Knappheit kennzeichnen zum mindestens rechnerisch 

nicht die Bildungsstandards im Fach Mathematik für den Mittleren Schulabschluss, 

die am 4.12.2003 von der Kultusministerkonferenz beschlossen wurden. 

Durch die Setzung von sechs sich ohne jede Trennschärfe oder auch nur 

9 Vereinbarung über Bildungsstandards für den Mittleren Schulabschluss (Jahrgangsstufe 

10) – (Beschluss der Kultusministerkonferenz vom 4.12.2003) in: Kultusministerkonferenz 

(KMK): Bildungsstandards im Fach Mathematik für den Mittleren Schulabschluss. 

Beschluss vom 4.12.2003 

10 In Achtung der großen deutschen Bildungstheoretiker des 18., 19. und 20. Jahrhunderts 

versuche ich das Wort Bildungsstandard zu vermeiden. Als wäre Ohr und Verstand mit dem 

Kompositum ‚Bildungsstandards‘ noch nicht ausreichend gequält, wird in dem Beschluss 

der Kultusministerkonferenz zum Bildungsmonitoring an zahlreichen Stellen noch von der 

notwendigen Normierung und Nachnormierung der Bildungsstandards gesprochen. 

11 Sill, H. D.: PISA und die Bildungsstandards. In: Jahnke, Th.; Meyerhöher, W. (Hrsg.): Pisa 

& Co – Kritik eines Programms. Franzbecker Verlag. Hildesheim 2006. S. 293 – 330.


eigene Charakteristik überlappenden Kompetenzen, von fünf – seit Jahren bekannten 

– mathematischen Leitideen und drei Anforderungsbereichen ergeben 

sich aus Gründen der Multiplikation neunzig verschiedene Möglichkeiten eine 

Aufgabe zu kennzeichnen. Sind – wie wohl meistens zu erwarten – mehrere 

Kompetenzen oder Leitideen gefragt, dann ergeben sich mehrere hundert 

solcher Klassifikationen. 

Auf 24 der 36 Seiten der Broschüre sind Aufgabenbeispiele und Lösungsskizzen 

mit der Angabe von Leitideen und allgemeinen mathematischen Kompetenzen 

sowie deren Zuordnung zu Anforderungsbereichen abgedruckt. Während 

die Klassifikation der Aufgaben wenig zwingend oder aufschlussreich, 

eher selbstverständlich und für die Bearbeitung von zu vernachlässigender Bedeutung 

ist, erschreckt die magere Qualität der Aufgaben, die Material, wie 

es in neueren, gut durchgearbeiteten und aufbereiteten Schulbüchern zu finden 

ist, nicht einmal im Ansatz erreicht. 

In Aufgabe (1) wird unangemessen modelliert. 

In Aufgabe (2) wird das Ungeschick eines Grafikers, das diesem vermutlich durch 

einen fehlerhaften Umgang mit einer Tabellenkalkulationssoftware unterlaufen ist, 

nicht thematisiert, sondern hingenommen. 

In Aufgabe (3) wird ein nicht symmetrisch gezeichneter Stern als symmetrisch bezeichnet 

und dann nach der Zahl seiner Symmetrieachsen gefragt. 

In Aufgabe (4) erstaunen die künstliche Fragestellung und die Klassifikation. 

In Aufgabe (5) ist eine mit „Lohnerhöhung in EURO“ beschriftete Achse in Zehnerschritten 

von 0 bis 50 bezeichnet, aber zugleich in 30 Teile geteilt, so dass ein Teilabschnitt 

1 2/3 ¤ entspricht und die Achsenbeschriftungen nicht an den Teilstrichen 

sitzen (können). 

In Aufgabe (6) wird ein Punkt mit P(y;x) bezeichnet und dann bemerkt, dass „x die 

erste Koordinate des Punktes P ist“. 

In Aufgabe (7) erstaunen die Teilfragen c) und d). 

In Aufgabe (8) wird vor allem die Anstrengung deutlich, eine Leitidee unterzubringen. 

In Aufgabe (9) ist die Fragestellung c) undurchsichtig. 

In Aufgabe (10) wird der Taschenrechner fragwürdig benutzt. 

In Aufgabe (11) soll man sich mit den fünf Schüleräußerungen in Sprechblasen auseinandersetzen, 

die man außerhalb der Schule wohl kaum mathematisch aufarbeiten 

würde. 

In Aufgabe (12) werden Fragestellungen zu Linearen Funktionen behandelt, denen 

man wenig Sinn abgewinnen kann.


In Aufgabe (13) wäre im Ansatz einmal eine Modellierung möglich, wenn sie nicht 

im Text schon vorgegeben wäre. 

In Aufgabe (14) vereint mühsam Fragestellungen, die wenig gemein haben. 

Die Aufgaben sind durchweg eher hölzern formuliert, die Grafiken lieblos und 

fehlerhaft, die Lösungsskizzen wenig hilfreich und zum Teil falsch (Z.B. in 

gravierender Weise bei Aufgabe (3) und Aufgabe (5)). Innovative Anregungen 

gehen von solchem Material nicht aus. Warum sind diese Aufgaben, die 

die ‚Bildungsstandards für den Mittleren Schulabschluss‘ deutschlandweit exemplifizieren 

sollen, über die blass konturierten Kompetenzen hinaus deren Inkarnation 

darstellen, so voller Mängel? Die einzige rationale Antwort auf diese 

Frage liegt darin, dass es in den Standards nicht um Kompetenzen, Leitideen 

und Anforderungsbereiche geht, dass das Musterhafte dieser Aufgaben sich 

nicht auf ihren Inhalt bezieht. Es geht gar nicht darum, sie und ihre Lösungsmöglichkeiten 

sich gründlich anzuschauen, sie also ernst zu nehmen, sondern 

den Lehrpersonen und den Schülerinnen und Schüler klar zu machen, dass es 

jetzt einen neuen administrativ-zwingenden Begriff gibt, nämlich den der Standards 

gibt, den es ohne Widerworte einzuhalten gilt, der keinen Widerspruch 

ob gegen Tests oder Vergleichsarbeiten und deren Inhalte duldet. Ernst zunehmen 

sind also nicht die Aufgaben, sondern die Kandare, an die Lehrerinnen 

und Lehrer wie Schülerinnen und Schüler genommen werden: ihr müsst das 

jetzt können, sonst setzt es etwas, sei es durch Publikation der mageren Ergebnisse 

der Schüler, der Lehrer oder der Schule, sei es durch andere Zwangsmaßnahmen. 

Jetzt wird Ernst gemacht und dieser Ernst heißt eben Standard. Es 

mag schon sein, dass das Aufziehen dieser neuen Saiten – gleichsam als Lob 

der Ernsthaftigkeit staatlicher Bildungsvorgaben – manchem zu Pass kommt 

und mancher davon profitiert zum Beispiel als staatlich bestellter Bildungsforscher 

oder Testentwickler, aber Mathematikdidaktik ist das nicht. In der 

Vereinbarung über Bildungsstandards für den Mittleren Schulabschluss (Jahrgangsstufe 

10) heißt es (auf Seite 4 in der zitierten Broschüre): 

Die Standards und ihre Einhaltung werden unter Berücksichtigung der Entwicklung 

in den Fachwissenschaften, in der Fachdidaktik und in der Schulpraxis durch eine von 

den Ländern gemeinsam beauftragte wissenschaftliche Einrichtung überprüft und auf 

der Basis validierter Tests weiter entwickelt. (S. 4) 

Eine inhaltliche Weiterentwicklung hat seither nicht stattgefunden; offensichtlich 

besteht auch gar kein Bedürfnis nach einer breiten und tiefen fachlichen, 

fachdidaktischen oder schulpraktischen Diskussion.

Risse in der öffentlichen Geltungsmacht 


Bei der sorgfältig arrangierten Veröffentlichung der ersten Pisa-‚Ergebnisse‘ in 

Deutschland wurde – wie in geringerem Ausmaß schon bei der Dritten Internationalen 

Mathematik- und Naturwissenschaftsstudie (TIMSS) – in den Medien 

im Kern nur das Entsetzen über das Abschneiden der deutschen Schülerinnen 

und Schüler ausgerufen und in Szene gesetzt (‚Pisa-Schock‘). Die Ergebnisse 

selbst, ihre Interpretation oder die angewandten Verfahren zu ihrer Gewinnung 

wurden auf den Pressekonferenzen, in den zugehörigen Berichten und 

Kommentare nicht einmal simpelsten Plausibilitätsprüfungen unterzogen. Ein 

Elchtest, der diesen komplexen Untersuchungsapparat im Ansatz oder auch 

nur die technischen Details des Tests in der Schule (zeitliche Länge, Art und 

Anzahl der Fragen) und dessen wunderliche Aussagekraft näher befragt hätte, 

blieb aus. Es galt nur das Ausmaß des deutschen Versagens zu beklagen und 

auf Abhilfe je nach Couleur des Kommentators und seiner Organisationszugehörigkeit 

zu sinnen. Zwar waren zuweilen recht ernüchternde Berichte von 

direkt an dem Test beteiligten Schülerinnen und Schülern sowie Lehrerinnen 

und Lehrern zu lesen, aber solche Augenzeugenkolportagen wurden als lokale 

Ausrutscher wider die Handbücher und vorgegebenen internationalen Verfahrensregeln 

bezeichnet und ihre Erwähnung oder Betrachtung als unwissenschaftlich 

gebrandmarkt. Sie gingen in der Dramatik und Wucht der globalen 

Untersuchung unter. Jegliche Kritik an Pisa wurde medial nur als ein untauglicher 

Versuch gesehen, die Misere der deutschen Bildung schön zu reden oder 

sie gar ganz zu leugnen. Die Geltungsmacht von Pisa hatte die Medien wie 

auch die Politik fest im Griff. 

Bei der zweiten Pisa-Welle ließ sich keine vergleichbare Dramatik in den 

Medien mehr aufbauen. Auch halbherzige, bildungspolitisch forcierte Versuche, 

aus dem Vergleich der Ergebnisse der beider Durchläufe Schlüsse auf 

ein erstes Wirken deutscher Maßnahmen zu ziehen, erwiesen sich als verfahrenstechnisch 

gewagt und inhaltlich weder glaubwürdig noch überhaupt plausibel 

und in der Tendenz sogar kontraproduktiv. Nicht einmal die (unsinnigspektakulären) 

Länderrankings ließen sich noch verwerten, so dass ein neues 

Debakel die mediale Aufmerksamkeit sichern musste, dass nämlich in 

Deutschland die territoriale und soziale Herkunft in besonderer Weise auf 

die Bildungschancen ‚durchschlage‘. Auch hier unterblieben übrigens einfache 

Nachfragen, wie denn dieses Forschungsergebnis zustande gekommen sei, 

welche Größen oder Indikatoren man hier gemessen, verrechnet oder gegeneinander 

aufgetragen habe und in welcher Weise die deutschen Ergebnisse


die vergleichbarer Länder über- oder untertrafen. Medial handelte es sich also 

nicht um ein Resultat einer komplexen Untersuchung, deren Verfahren zumindest 

im Groben zu erläutern seien, sondern um eine moralische Katastrophe, 

an deren Beseitigung man ohne Nachfrage und Aufschub zu arbeiten habe. 12 

Inzwischen ist auch die Blendkraft dieser Nachricht dahin. Der folgende 

Artikel zeigt beispielhaft, dass der schiere Glaube, dem jähen Entsetzen über 

das vermessene deutsche Schulbildungsdebakel würde sich nun mit der gleicher 

vollmundigen Bestimmtheit und Kennerschaft eine Besserung der Verhältnisse 

anschließen, in den Medien zu bröckeln beginnt. 

Langer Anlauf ohne Sprung 

Die wahren Pisa-Sieger sind gar nicht die Finnen. Die wahren Pisa-Sieger sitzen in 

Berlin, Dortmund und Bielefeld. In den Schulen sieht man sie selten. Meist brüten sie 

über Testbögen, ersinnen Prüfungsfragen oder erforschen mit Hingabe die Wirkung 

ihrer eigenen Forschung. „So viele Daten hatten wir noch nie“, freut sich der Bielefelder 

Erziehungswissenschaftler Klaus Jürgen Tillmann, Mitglied im deutschen Pisa- 

Konsortium, „als empirischer Bildungsforscher bin ich natürlich ganz begeistert.“ An 

Fördergeldern herrscht kein Mangel, neue Forschungsstätten werden gegründet, etwa 

das Institut zur Qualitätsentwicklung im Bildungswesen (IQB) an der Berliner 

Humboldt-Universität. Allein der Forschungsgegenstand selbst dämpft noch die Wissenschaftlereuphorie: 

„Den Schulen bringt das leider nichts“, sagt Pädagoge Tillmann. 

Gegen miese Testergebnisse, scheinen Deutschlands Schulminister zu glauben, helfe 

vor allem Testen. Zwar hat sich die Kultusministerkonferenz als Reaktion auf den 

Pisa-Schock sieben Verbesserungsstrategien vorgenommen – darunter Sprachkurse 

für Migrantenkinder, mehr Ganztagsschulen und gezielte Leseförderung –, doch konsequent 

umgesetzt haben sie bislang nur eine einzige: Tests. „Entwicklungen gibt es 

zwar in allen sieben Bereichen“, sagt Tillmann, „aber flächendeckend in allen Ländern 

sind nur die zentralen Prüfungen in den Schulen angekommen.“ ( . . . ) 

Tatsächlich wird an den deutschen Schulen so viel evaluiert, verglichen und inspiziert 

wie nie zuvor. Schon vor der Einschulung müssen Vierjährige häufig zum Deutschtest 

antreten, in sieben Bundesländern schwitzen dann die Drittklässler über „Vera“ 

(„Vergleichsarbeiten“) Tests, in der Mittelstufe folgen vielerorts weitere Vergleichsarbeiten. 

Dazwischen kommen alle Jahre wieder internationale Studien wie Pisa, Iglu 

oder Timss und je nach Land Erhebungen mit phantasievollen Namen wie „Quasum“, 

„Desi“, „Tosca“, „Markus“, „Ulme“ oder „Lau“. ( . . . ) 

12 Es geht hier keineswegs darum, deutsche Defizite im Umgang und in der Beschulung mit 

Schülerinnen und Schülern mit Migrationshintergrund (o.a.) in Abrede zu stellen, sondern 

darum deren heftige Moralisierung als einen wesentlichen Grund für die Existenzberechtigung 

von Pisa & Co zu akzeptieren.


„Nach Pisa wollte sich kein Kultusminister vorwerfen lassen, dass er nicht auf Leistung 

setzt“, erklärt Forscher Tillmann. „Dahinter steht die vage Hoffnung, dass vom 

Überprüfen auch alles irgendwie besser wird.“ Doch noch fehlt den Lehrerkollegien 

das Know-how, um aus der Datenflut Konzepte abzuleiten. „Da muss dringend was 

geschehen“, sagt Tillmann, „sonst bleibt das Ganze ein langer Anlauf, ohne dass gesprungen 

wird.“ ( . . . ) 

Besonders weit auf dem Weg, sinnvolle Lehren aus den vielen Tests zu ziehen, glaubt 

sich Nordrhein-Westfalens Bildungsministerin Sommer. Sie rühmt ihr Schulsystem 

als das „modernste in Deutschland“. So will NRW als erstes Bundesland noch in dieser 

Legislaturperiode Schulrankings einführen. Zugleich können Eltern an Rhein und 

Ruhr neuerdings aussuchen, wo sie ihr Kind einschulen – und sich dabei möglicherweise 

an den Listen orientieren. „Wir wollen einen fairen Wettbewerb“, sagt Sommer. 

Doch gerade darin sehen viele Wissenschaftler die größte Gefahr der neuen Testkultur: 

„Wenn die Schulen nur noch auf ihre Listenplätze schauen, findet überhaupt keine 

Schulentwicklung mehr statt“, warnt Wilfried Bos, Chef des Dortmunder Instituts für 

Schulentwicklungsforschung. Faire Rankings, die etwa den sozialen Hintergrund der 

Schülerschaft berücksichtigen, sind kaum möglich, wenn wie etwa bei Vera nach Herkunft 

und Familie der Kinder gar nicht gefragt werden darf. 

Zudem erwies sich schon die Prämierung der Vera-Besten im vergangenen Jahr als 

Flop: Viele Schulen hatten sich gute Ergebnisse erschummelt – sie hatten die Testaufgaben 

vorher mit den Schülern trainiert (SPIEGEL 27/2006). „Wenn es erst mal 

richtige Rankings gibt“, glaubt Schulleiterin Borns aus Münster, „dann wird noch viel 

mehr gemogelt.“ 

Julia Koch in Der SPIEGEL 24/2007 

Vermutlich werden solche Artikel die öffentliche Geltungsmacht von Pisa in 

Deutschland mehr erschüttern und eher zerrütten als eine wissenschaftliche 

Kritik an den Methoden und Verfahren der Untersuchung, die als unbedeutender 

innerwissenschaftlicher, von Laien angezettelter Zwist abgetan werden 

kann, die Öffentlichkeit kaum erreicht und eine Bildungspolitik, die sich Pisa 

gleichsam verschworen hat, nicht irritieren kann. 

Literatur 

American Evaluation Association AEA): Position Statement on HIGH STA- 

KES TESTING In PreK-12 Education. 2002 

(Online unter: http://www.eval.org/hst3.htm) 

Kultusministerkonferenz (KMK): Gesamtstrategie der Kultusministerkonferenz 

zum Bildungsmonitoring (Beschlüsse der Kultusministerkon-


ferenz vom 02.06.2006). (Online unter: http://www.kmk.org/aktuell/ 

Gesamtstrategie%20Dokumentation.pdf) 

Kultusministerkonferenz (Hrsg.) in Zusammenarbeit mit dem Institut zur Qualitätsentwicklung 

im Bildungswesen: Gesamtstrategie der Kultusministerkonferenz 

zum Bildungsmonitoring. Berlin 2006. 

(Online unter: http://www.kmk.org/schul/Bildungsmonitoring_Brosch% 

FCre_Endf.pdf) 

Kultusministerkonferenz (KMK): Bildungsstandards im Fach Mathematik für 

den Mittleren Schulabschluss. Beschluss vom 4.12.2003. (Online unter: 

http://www.kmk.org/schul/Bildungsstandards/Mathematik_MSA_BS_ 

04-12-2003.pdf) 

Koch, Julia: Langer Anlauf ohne Sprung. Der SPIEGEL 24/2007 

Koretz, D.: Alignment, High Stakes, and the Inflation of Test Scores. Yearbook 

of the National Society for the Study of Education (2005) 104 (2), 

99–118. 

(Online erhältlich unter: http://www.blackwell-synergy.com/doi/abs/10. 

1111/j.1744-7984.2005.00027.x) 

Sill, H. D.: PISA und die Bildungsstandards. In: Jahnke, Th.; Meyerhöfer, W. 

(Hrsg.): Pisa & Co – Kritik eines Programms. Franzbecker Verlag. Hildesheim 

2006. S. 293-330 

Stecher, B. M.: Consequences of large-scale, high-stakes testing on school and 

classroom practices. In L. S. Hamilton, B. M. Stecher, and S. P. Klein 

(Eds.): Making Sense of Test-Based Accountability in Education. RAND. 

Santa Monica 2002. P. 79-100. 

(Online unter: http://www.rand.org/pubs/monograph_reports/MR1554/ 

index.html)

PISA in Österreich: Mediale Reaktionen, öffentliche 

Bewertungen und politische Konsequenzen 

Dominik Bozkurt, Gertrude Brinek, Martin Retzl 

Österreich: Universität Wien 

Abstract: 

In diesem Beitrag werden nach einer Gegenüberstellung der Ergebnisse der 

beiden PISA-Testungen 2000 und 2003 die öffentlichen medialen Reaktionen 

sowie die Reaktionen der politischen Organisationen und deren bildungspolitische 

Konsequenzen aus PISA dargestellt. Dabei soll verdeutlicht werden, wie 

die Ergebnisse im öffentlichen Diskurs aufgenommen und interpretiert bzw. 

welche politischen Handlungsaufträge daraus abgeleitet wurden. So werden 

sowohl Übereinstimmungen und Abweichungen zwischen öffentlichen bzw. 

politischen Reaktionen und den offiziellen Ergebnissen erörtert als auch Veränderungen 

zwischen den Reaktionen auf die PISA-Ergebnisse 2000 und 2003 

sichtbar gemacht. Die mediale Analyse und die bildungspolitische Bewertung 

zeigen Dichte der Resonanz sowie Art und Grad der gesellschaftlichen „Erregung“, 

nicht nur in der scientific community. 

1 Österreichische PISA-Ergebnisse 

In diesem Kapitel werden die offiziellen Ergebnisse von PISA 2000 und PISA 

2003 für Österreich, welche von den Mitgliedern des österreichischen PISA- 

Konsortiums in diversen Publikationen veröffentlicht wurden, vorgestellt und 

verglichen. Diese Ergebnisse wurden in der Öffentlichkeit recht unkritisch rezipiert. 

Dass sie ihrem Anspruch nicht bzw. nur bedingt gerecht werden, zeigen 

wissenschaftliche Beiträge, die bspw. in Deutschland publiziert wurden sowie 

die verschiedenen Beiträge in diesem Band. Allerdings erfolgten die Stellungnahmen 

und Reaktionen von Medien und Politik auf Grundlage eben dieser

322 DOMINIK BOZKURT, GERTRUDE BRINEK, MARTIN RETZL 

Ergebnisse. Für unser Vorhaben erscheint uns daher die Berufung darauf sinnvoll 

und angebracht. 

1.1 Die Ergebnisse der PISA-Studie 2000 

Im Dezember 2001 wurden die PISA-Ergebnisse 2000, an der 31 Staaten teilnahmen 

– darunter auch Österreich –, von der OECD veröffentlicht. PISA 

(Programme for International Student Assessment) erhebt in einem Dreijahres- 

Rhythmus das Leseverständnis, das mathematische und das naturwissenschaftliche 

Grundwissen der 15-/16-jährigen SchülerInnen (vgl. Reiter, Haider 

2002). 

Der erste Testteilleistungsbereich der PISA-Studie 2000 umfasste das 

„Kompetenzprofil Lesen“ (ebenda, 13), wobei die Inhalte geschriebener Texte 

von den Jugendlichen verstanden, genützt und reflektiert werden mussten 

(vgl. ebenda, 13). Zum Messen dieser Eigenschaften setzte die OECD 129 

Testaufgaben ein, die wiederum in „fünf aufsteigende Lese-Kompetenzstufen“ 

(ebenda, 13) gegliedert wurden. Ca. 9 % der österreichischen 15-/16-jährigen 

SchülerInnen waren „zur obersten Kompetenzstufe und rund 14 % waren zu 

den sehr schlechten Leser/innen „der beiden untersten Stufen“ (ebenda, 13) 

zuzuordnen. 

Die österreichischen TeilnehmerInnen erreichten in der Lese-Kompetenz, 

der Schwerpunktdisziplin von PISA 2000 (vgl. ebenda, 21), 507 Punkte und 

somit Platz 10 unter den 27 getesteten OECD-Staaten (vgl. Haider, Reiter 

2004, 77). Österreich lag somit „knapp über dem OECD-Durchschnitt von 

500“ Punkten in dieser Disziplin (vgl. Reiter, Haider 2002, 13). 

Im Zuge von PISA 2000 wurden auch die mathematischen Kenntnisse der 

15-/16-jährigen SchülerInnen gemessen. Um diese in Erfahrung zu bringen, 

mussten die getesteten Jugendlichen ihr Können in den unterschiedlichsten 

Bereichen wie z.B. „Problemlösen“ und „Modellieren“ (ebenda, 21) zeigen. 

Österreich erreichte in der Mathematik-Kompetenz „515 Punkte“ (ebenda, 21) 

und somit den 11. Platz unter 27 OECD-Staaten (vgl. Haider, Reiter 2004, 63). 

Allerdings ist darauf hinzuweisen, dass die innerösterreichischen Ergebnisse 

auf Grund des stark gegliederten hiesigen Schulsystems stark variieren. So erzielten 

SchülerInnen der Allgemeinbildenden Höheren Schule insgesamt 565 

Punkte, während die Jugendlichen der Allgemeinen Pflichtschulen lediglich 

„438 Punkte“ (Reiter, Haider 2002, 23) im Mathematikranking erreichten. 

Im dritten und abschließenden Testbereich von PISA 2000 wurden die 

Naturwissenschafts-Kompetenzen der österreichischen SchülerInnen getestet.

PISA IN ÖSTERREICH 323 

Besonderes Augenmerk wurde auf das Erkennen naturwissenschaftlicher 

Fragestellungen sowie auf die Anwendung naturwissenschaftlichen Wissens 

gelegt. Die SchülerInnen waren u. a. dazu aufgerufen, „durch Belege gestützte 

Aussagen von bloßen Meinungen zu unterscheiden“ (ebenda, 29). Haider weist 

darauf hin, dass bei diesem Testbereich das Erkennen naturwissenschaftlicher 

Fragen, das Anwenden von naturwissenschaftlichem Wissen und das Ziehen 

von Schlussfolgerungen aus Belegen im Mittelpunkt stehe (vgl. ebenda, 29). 

Im Bereich der Naturwissenschaften erreichten die österreichischen SchülerInnen 

insgesamt „519 Punkte“ (vgl. ebenda, 29) und somit den achten Rang 

unter den 27 an der Testung teilnehmenden OECD-Staaten (vgl. Haider, Reiter 

2004, 89). 

Zusammenfassend kann festgehalten werden, dass Österreich im Bereich 

der Naturwissenschaften mit 519 Punkten und Rang 8 am besten abgeschnitten 

hat. In Mathematik wurden 515 Punkte erreicht und somit Rang 11. Im Lesen 

konnten mit 507 die wenigsten Punkte erzielt werden, wobei damit immerhin 

der 10. Rang unter allen OECD-Staaten eingenommen wurde. Die in jedem 

Bereich über dem OECD- Durchschnitt liegenden Leistungen der österreichischen 

Jugendlichen führten schließlich dazu, dass Österreich im Gesamtranking 

von PISA 2000 den 10. Platz belegte und somit im vorderen Drittel der 

Untersuchung rangierte. 

1.2 Die Ergebnisse der PISA-Studie 2003 

Zum Jahresende 2004 wurden die Ergebnisse von PISA 2003 veröffentlicht. 

Der Fokus der PISA-Testung 2003, an der bereits 41 Staaten teilgenommen 

haben, wovon jedoch nur 40 in die Wertung mit aufgenommen wurden (vgl. 

Haider, Reiter 2004, 18), lag auf den mathematischen Kompetenzen, die zur 

„Hauptdomäne“ (ebenda 2004, 13) ernannt wurden. In den „Nebendomänen“ 

wurden wiederum die Lesekompetenz und das naturwissenschaftliche Wissen 

geprüft. Erstmals wurden auch die Problemlösungskompetenzen der 15-/16jährigen 

SchülerInnen untersucht (vgl. ebenda, 13). 

Der getestete Bereich der Mathematik umfasste insgesamt – da Schwerpunktdisziplin 

– 2/3 aller Testaufgaben von PISA 2003. Offiziell erreichten die 

österreichischen SchülerInnen im Lösen mathematischer Aufgaben und Problemstellungen 

mit 506 Punkten den 15. Rang unter 29 OECD-Staaten (vgl. 

Haider, Reiter 2004, 63). Jedoch stellten sich zwei nicht unerhebliche Probleme 

beim Vergleich der PISA-Ergebnisse 2000 im Bereich der Mathematik 

mit jenen von der Testung aus dem Jahre 2003 heraus. Haider weist darauf


hin, dass bei PISA 2000 lediglich „nur zwei der vier in PISA 2003 abgefragten 

Unterbereiche in Mathematik“ (ebenda, 45) getestet wurden und somit sind 

auch nur diese beiden Bereiche unmittelbar vergleichbar. Für die in PISA 2003 

neu geschaffenen mathematischen Bereiche der „Unsicherheit“ 1 und der „Größen“ 

2 gibt es somit keine Vergleichsmöglichkeit (vgl. ebenda, 45). 

Die Lese-Kompetenz der 15-/16-jährigen SchülerInnen wurde in PISA 

2003 neuerlich untersucht. Offiziell erreichten die österreichischen TeilnehmerInnen 

mit 491 Punkten den 19. Rang „innerhalb der 29 OECD-Staaten“ und 

den 22. Rang unter allen 40 „PISA-Teilnehmerstaaten“, wodurch sich Österreich 

„nicht signifikant“ vom OECD-Schnitt von 494 Punkten unterscheidet 

(vgl. ebenda, 76). Haider resümiert, dass Österreich in der Lesekompetenz bei 

PISA 2003 im OECD-Vergleich um 9 Ränge oder 16 Punkte zurückfiel und somit 

signifikant schlechter abschnitt als drei Jahre zuvor. Allerdings relativiert 

er diesen Rückfall, denn bei „Berücksichtigung der geteilten Ränge nach statistischer 

Bandbreite heißt das: PISA 2000: 10.-16. Rang; PISA 2003: 12.-21. 

Rang“ 3 (ebenda, 77). 

Als eine weitere Nebendomäne bei PISA 2003 wurden die naturwissenschaftlichen 

Fertigkeiten der Jugendlichen getestet. Dabei mussten die 15-/16jährigen 

SchülerInnen ihre Fähigkeit unter Beweis stellen und zeigen, inwieweit 

sie „das jeweilige physikalische, chemische oder biologische Fachwissen“ 

zu „praktischen Problemlösungen“ einsetzen, Probleme repräsentieren 

und Lösungsvorschläge argumentieren können (vgl. ebenda, 78). 

Österreich befindet sich in diesem Bereich mit 491 Punkten, vorausgesetzt, 

man berücksichtigt nur „die Punktschätzung des Mittelwerts“ auf dem 

20. Rang innerhalb der 29 OECD-Staaten (vgl. ebenda, 79; 89). Allerdings ist 

darauf hinzuweisen, dass bei Betrachtung des Konfidenzintervalls (es wurde 

im Zuge von PISA lediglich ein Teil der österreichischen Schülerpopulation 

tatsächlich erfasst) die österreichischen SchülerInnen den 16. bis 23. Rang unter 

den 29 OECD-Staaten erreichen (vgl. ebenda, 79). 

1 Die mathematische Subgruppe Unsicherheit umfasst „Aufgaben und Darstellung von Daten 

sowie Wahrscheinlichkeiten, Unsicherheiten und Schlussfolgerungen“ (Haider, Reiter 2004, 

52). 

2 Der Bereich der Größen meint jene mathematischen Aufgaben, „die sich mit numerischen 

Phänomenen und Mustern sowie quantitativen Zusammenhängen beschäftigen“ (ebenda, 

53). 

3 vgl. dazu die Info-Seite: Wichtige Informationen zur Interpretation der Ergebnisse. In: Haider, 

Reiter 2004, 43.


Das kann aber nicht darüber hinwegtäuschen, dass Österreich im Bereich 

der Naturwissenschaften vom 8. Rang (unter 27 OECD-Staaten) und 519 

Punkten in PISA 2000 bei der Untersuchung 2003 mit 491 Punkten an die 

20. Stelle (von 29 OECD-Staaten) zurückgefallen ist (vgl. ebenda, 89). 

Die drei bisherigen PISA-Testbereiche (Mathematik, Lesen, Naturwissenschaften) 

wurden in der Untersuchung von 2003 um „Problemlösungs- 

Kompetenz“ erweitert. Hierbei wurden SchülerInnen dazu aufgefordert, sich 

„nicht-routinemäßigen (neuartigen) Problemen“ (ebenda, 90) zu stellen. Im 

Zuge dessen sollten auf Grund von Denkprozessen Lösungsvorschläge angeregt 

werden. Im Bereich der Problemlösungen erreichten die 15-/16jährigen 

SchülerInnen aus Österreich 506 Punkte, womit sie über dem OECD- 

Durchschnitt von 500 Punkten lagen und (ohne Berücksichtigung des Konfidenzintervalls 

4 ) den 15. Rangplatz von 29 OECD-Ländern einnahmen (vgl. 

ebenda, 90; 91). 

Festzuhalten bleibt, dass die österreichischen SchülerInnen bei PISA 2003 

in Mathematik mit 506 Punkten die besten Leistungen erbrachten. Im Lesen 

und den Naturwissenschaften wurden lediglich je 491 Punkte erreicht. Beim 

„Problemlösen“ konnten 506 Punkte erzielt werden. Die österreichischen 15- 

/16-jährigen SchülerInnen belegten somit in sämtlichen Testbereichen Platzierungen 

im mittleren oder im hinteren Drittel. 

1.3 Korrigierte Hauptergebnisse 

Die og. PISA-Ergebnisse bzw. deren Interpretationen halten nicht jeder Prüfung 

stand. Werden die von Neuwirth et al. publizierten und korrigierten 

Hauptergebnisse in den Vergleich der PISA Ergebnisse der Jahre 2000 und 

2003 miteinbezogen, so relativieren sich die von Österreich erreichten Ergebnisse 

im Jahre 2003 beträchtlich. Neuwirth spricht davon, dass bei „näherer 

Betrachtung des Datenmaterials“ bei den österreichischen Daten bald „Inkonsistenzen“ 

festgestellt werden konnten (vgl. Neuwirth et al. 2004, 11). Der auf 

Grund der veröffentlichten PISA 2003-Daten interpretierte „Absturz“ fand also 

nicht statt, sondern es geht vielmehr darum, dass die PISA-Daten 2000 und 

2003 nicht direkt vergleichbar seien (vgl. ebenda, 62ff). Neuwirth führt dafür 

folgende Gründe an: 

– bei PISA 2000 war die Beteiligung der weiblichen Schüler höher als jene 

der männlichen (vgl. ebenda, 11) 

4 Bei der Berücksichtigung des Konfidenzintervalls würde Österreich den „13. bis 17. Rang 

unter 29 OECD-Staaten“ (ebenda, 91) erreichen.


– die Leistungsergebnisse von österreichischen BerufsschülerInnen wurden 

bei PISA 2000 weniger stark gewichtet als bei PISA 2003. Bei einem direkten, 

d.h. unreflektierten Vergleich zeigt die zweite Untersuchung eine 

Verschlechterung der Ergebnisse, da BerufsschülerInnen sich am „unteren 

Ende des Leistungsspektrums“ befinden (vgl. ebenda 28ff). 

Basierend auf diesen Erkenntnissen führt Neuwirth aus, dass „sich die Leistungen 

der österreichischen Schüler/innen im Lesen kaum geändert haben und 

sowohl die Lesewerte als auch die Mathematikwerte in der Nähe des OECD- 

Durchschnitts liegen“ (ebenda, 62). Neuwirth räumt aber auch ein, dass in den 

Naturwissenschaften „ein deutlicher Rückgang der österreichischen Werte erkennbar“ 

sei (vgl. ebenda, 62). 

Zusätzliche Schwierigkeiten beim Vergleich der Leistungsdaten von PISA 

2000 und 2003 ergeben sich dadurch, dass manche Testbereiche (wie z.B. in 

der Mathematik) für PISA 2003 neu geschaffen wurden und dass es somit keine 

Vergleichsmöglichkeit gibt (vgl. ebenda, 63). Zu erwähnen ist hier außerdem, 

dass eine hohe Interpretationsfreiheit der PISA-Ergebnisse 2003 besteht, 

wenn das Konfidenzintervall (notwendigerweise) berücksichtigt wird (siehe 

oben). 

2 Öffentliche mediale Reaktionen auf die PISA-Ergebnisse 

Nach der einleitenden Darstellung der PISA-Ergebnisse für Österreich im internationalen 

Vergleich werden nun die öffentlichen Reaktionen auf die Ergebnisse 

erörtert. Dies ist nicht zuletzt deshalb von Bedeutung, da, wie etwa 

auch Uljens in diesem Sammelband erinnert, PISA primär auf die Förderung 

des Wettbewerbs (am Bildungssektor) zwischen den Teilnehmerstaaten 

und auf die Förderung von einheitlichen Bildungsstandards in den teilnehmenden 

Nationen abziele. Auf eine Erklärung für das Zustandekommen der unterschiedlichen 

Ergebnisse in den verschiedenen Ländern würde dabei aber vollständig 

verzichtet. Diese Aufgabe sei den Regierungen, dem Schulwesen und 

den Medien der jeweiligen Staaten überlassen (vgl. Uljens in diesem Band). 

In Österreich hat das zu einer medialen und politischen PISA-Erklärungsflut 

geführt, die zwar angesichts der „pisanischen“ Zurückhaltung an Erklärungsund 

Interpretationsangeboten verständlich, jedoch geprägt von vorgefertigten 

Überzeugungen entsprechend undifferenziert über die Bevölkerung hereingebrochen 

ist.


Im Folgenden sollen daher anhand der am meisten gelesenen Tageszeitungen 

in Österreich (Kronen Zeitung, Kurier, Standard, Presse, Kleine Zeitung, 

Oberösterreichische Nachrichten, Salzburger Nachrichten, Tiroler Tageszeitung, 

Vorarlberger Nachrichten, Wirtschaftsblatt, Neues Volksblatt und Wiener 

Zeitung), welche zusammen im Jahre 2001 eine Nettoreichweite von ca. 

75 % sowie im Jahre 2004 eine Nettoreichweite von ca. 74 % aufweisen konnten, 

die medialen Reaktionen auf das Abschneiden Österreichs bei den beiden 

PISA-Testungen aufgezeigt werden. Zumindest eine der genannten Zeitungen 

wurde von täglich ca. 3/4 der österreichischen Bevölkerung über 14 Jahren gelesen 

(vgl. Mediaanalyse 2007, 2007a). Des Weiteren werden die Originaltext- 

Aussendungen der Austria Presse Agentur (APA) nach der Veröffentlichung 

der Ergebnisse zum Thema PISA-Studie präsentiert, analysiert und interpretiert. 

Die Ergebnisse der beiden PISA-Testungen wurden jeweils Anfang Dezember 

des auf die Durchführung folgenden Kalenderjahres veröffentlicht, d. 

h. im Dezember 2001 und im Dezember 2004. 

Der Darstellungs- bzw. Beobachtungszeitraum erstreckt sich daher jeweils 

vom Veröffentlichungsdatum bis zum 16. Jänner bzw. bis zum 31. Jänner der 

Folgejahre. Untersucht werden in den genannten Zeiträumen die Artikel in Tageszeitungen 

und ungefähr 3 Monate lang nach Veröffentlichung die Presseaussendungen 

der APA, welche sich thematisch u. a. den österreichischen Ergebnissen 

der Studie widmen. Die Zeiträume wurden deshalb so gewählt, weil 

in der ersten Zeit nach Veröffentlichung der Ergebnisse die stärksten Reaktionen 

zu erwarten sind, obwohl festgehalten werden muss, dass auch danach 

die Thematik rund um die PISA-Studie(n) in der Öffentlichkeit immer wieder 

aufgegriffen wurde. Die Kriterien, wonach die Medienberichte untersucht 

werden, sind folgende: 

– Anzahl der Artikel bzw. Presseaussendungen zu den PISA-Studien in den 

besagten Zeiträumen 

– Verfasser des Beitrags bzw. zu Wort kommende Personen im Beitrag 

– Bewertung der österreichischen Ergebnisse im Beitrag (positiv-neutralnegativ) 

– Ursachenzuschreibung im Beitrag (wer/was ist schuld, ist verantwortlich) 

– Geforderte Maßnahmen im Beitrag (was muss getan werden)


2.1 Die Reaktionen der Tageszeitungen auf die PISA-Ergebnisse 2000 

und 2003 im Vergleich 

Die Artikel in den genannten Tageszeitungen wurden in den elektronischen 

Archiven der Tageszeitungen bzw. in der Sammlung der österreichischen Nationalbibliothek 

recherchiert, von den Tageszeitungen selbst zur Verfügung gestellt 

oder stammen aus der Zeitungsberichtsammlung zum Thema Schule des 

ehemaligen Bundesministeriums für Bildung, Wissenschaft und Kultur. Berücksichtigt 

wurden alle Artikel, die von der offiziellen Veröffentlichung der 

ersten PISA-Ergebnisse am 4. Dezember 2001 bis zum 31. Jänner 2002 sowie 

von der offiziellen Veröffentlichung der zweiten PISA-Ergebnisse am 6. Dezember 

2004 bis einschließlich 16. Jänner 2005 erschienen, das Wort PISA, 

OECD oder STUDIE beinhalten und sich thematisch in irgendeiner Weise 

auf die österreichischen PISA-Ergebnisse beziehen. Die Kategorien, in welchen 

die Ursachen/Gründe für das Abschneiden bei der PISA-Studie bzw. das 

Zustandekommen der Ergebnisse und die geforderten Lösungen bzw. Maßnahmen 

in den Tageszeitungen unterteilt wurden, sind großteils von Schwarzgruber 

übernommen, der sich bereits einer intensiven Analyse der Ergebnisse 

2003 in österreichischen Tageszeitungen widmete. Seine ausgearbeiteten Kategorien 

sind inhaltlich auch für die Zeitungsberichte zu PISA 2000 geeignet 

und ermöglichen so einen guten Vergleich mit den Zeitungsberichten zu PISA 

2003. 

Auffallend unterschiedlich ist die Anzahl der Artikel in den besagten Tageszeitungen 

zu den Ergebnissen in den beiden Testjahren. Der Tabelle 1 ist 

zu entnehmen, dass als Reaktion auf die ersten PISA-Ergebnisse im genannten 

Untersuchungszeitraum insgesamt 36 Berichte in den erwähnten Tageszeitungen 

verfasst wurden, als Reaktion auf die zweiten PISA-Ergebnisse jedoch 

231 (Schwarzgruber 2006, 69). Berücksichtigt man, dass der Untersuchungszeitraum 

nach Reaktionen auf die erste PISA-Welle um zwei Wochen länger 

war, so ist der tatsächliche Unterschied noch größer. Damit ist über die zweiten 

PISA-Ergebnisse in den österreichischen Tageszeitungen mindestens mehr als 

sechsmal soviel berichtet worden als über die ersten. 

In der Hälfte der 36 Zeitungsartikel zu den PISA-Ergebnissen 2000 (Tab. 

2) berichten ausschließlich JournalistInnen. In jedem fünften Bericht kommen 

PolitikerInnen zu Wort und in jedem zwölften WissenschaftlerInnen. In 22,2 % 

der Artikel wird auf sonstige, andere Personen Bezug genommen. In 45,5 % 

der 231 Artikel zu den PISA-Ergebnissen 2003 hingegen werden die Ansichten 

von PolitikerInnen wiedergegeben und nur in jedem fünften Bericht neh-


Tabelle 1: vgl. Schwarzgruber 2006, 69; Bozkurt/Brinek/Retzl 2007 


men ausschließlich JournalistInnen Stellung. WissenschaftlerInnen werden mit 

16,9 % im Verhältnis doppelt so viel zitiert wie in den Berichten zu PISA 2000.


In 2,5 % der Artikel werden auch Vertreter der Industrie berücksichtigt. In fast 

jedem siebenten Artikel finden sich Kommentare von sonstigen Personen. 


69,4 % der Artikel zu PISA 2000 bewerten die österreichischen Ergebnisse 

positiv. Jeder vierte Artikel verhält sich neutral zu den Ergebnissen. In nur 

5,6 % der Artikel werden die Ergebnisse negativ interpretiert. Das Abschneiden 

Österreichs bei PISA 2003 wird hingegen in der Hälfte der Artikel negativ 

gesehen. Die andere Hälfte der Artikel verhält sich neutral zu den Ergebnissen. 

In keinem Beitrag konnte den Ergebnissen etwas Positives abgewonnen 

werden. Diese Tatsache zeigt deutlich auf, dass die PISA-Ergebnisse in den öffentlichen 

Tageszeitungen massiv polarisierend dargestellt wurden. Dies kann 

durchaus mit einer positiven, teilweise euphorischen Haltung gegenüber PISA 

2000 und einer Katastrophenstimmung nach PISA 2003 beschrieben werden. 

In acht der 36 Berichte (22 %) zu den PISA-Ergebnissen 2000 werden vermutete 

Ursachen angeführt. Von den 231 Berichten zu den PISA-Ergebnissen 

2003 werden in 106 Berichten (46 %) Ursachen/Gründe für das Abschneiden 

genannt. Pro Bericht können mehrere Ursachen angegeben werden; zu den 

PISA-Ergebnissen 2000 lassen sich 10 und zu den PISA-Ergebnissen 2003 

insgesamt 156 Ursachen aufzählen (vgl. Schwarzgruber 2006, 85f). Bezüglich 

2003 wurde in jedem zweiten Bericht über mögliche Ursachen spekuliert, die 

Ergebnisse aus 2000 betreffend nur knapp in jedem fünften Bericht.



Die Hauptverantwortlichen für die durchwegs positiv bewerteten Ergebnisse 

2000 sind die LehrerInnen bzw. die Schule und das Schulsystem ebenso 

wie die Politik. Für die als negativ bewerteten Ergebnisse aus 2003 trägt 

hauptsächlich das Schulsystem (mit 40 Nennungen), gefolgt von der Politik 

(mit 27) und den Lehrern bzw. der Schule (mit 24 Nennungen) die Verantwortung. 

Häufig werden auch noch MigrantInnen, die Eltern und die Lesefähigkeit 

insgesamt als Ursache genannt. Mit neun Nennungen werden die SchülerInnen 

bzw. ihre Leistungen/ihr Leistungsverhalten als Ursache berücksichtigt, jedoch 

vergleichsweise selten für die Ergebnisse verantwortlich gemacht. 

In 19 oder 52,8 % der Berichte zu PISA 2000 und in 193 bzw. 83,5 % der 

Berichte zu PISA 2003 wurden Lösungen bzw. Maßnahmen gefordert. Dabei 

erfolgten 28 Nennungen von Maßnahmen zu PISA 2000 und 493 Nennungen 

zu PISA 2003 (vgl. Schwarzgruber 2006, 111). Das Kategoriensystem 

Schwarzgrubers wurde noch um drei Kategorien erweitert, da die geforderten 

Maßnahmen zu den PISA-Ergebnissen 2000 nicht vollständig den Kategorien 

von PISA 2003 zuzuordnen waren (Tab. 5). Die Forderungen nach mehr Tests 

bzw. Evaluierung und die Änderung der Politik bzw. mehr Budget wurden mit 

jeweils drei Nennungen in den Berichten zu PISA 2000 erwähnt, in den Berichten 

zu PISA 2003 jedoch nie. Maßnahmen wie mehr Autonomie für Schulen, 

Sprachkurse und vorschulische Programme wurden hingegen nur in den 

Berichten zu PISA 2003 genannt und gefordert, nicht jedoch in denen zu PISA



2000. Mit 101 Nennungen sind allgemeine Rufe nach generellen Reformen, 

nach der Gesamtschule (96 Nennungen) und deutlich weniger, aber dennoch 

häufig (56 Nennungen) die Verbesserung des Unterrichts bzw. der Unterrichtsqualität 

die am meisten genannten Forderungen nach PISA 2003. Weiters wurden 

nach PISA 2003 49 mal die Ganztagesschule, 44 mal Veränderungen in 

der Lehreraus- und weiterbildung, 34 mal die Verbesserung der Lesefähigkeit 

(Alphabetisierung, Lesetests) und 14 mal eine neutrale Analyse der Ergebnisse 

gefordert. Die am häufigsten geforderten Maßnahmen nach PISA 2000 sind 

mit je fünf Nennungen die Verbesserung des Unterrichts bzw. der Unterrichtsqualität 

und die Verbesserung der Lesefähigkeit (Alphabetisierung, Lesetests). 

3 mal wurden nach PISA 2000 generelle Reformvorschläge gemacht, 2 mal 

wurde die Gesamtschule als geeignete Maßnahme angegeben. Je einmal wurde 

auch nach PISA 2000 bereits die Ganztagesschule, eine neutrale Analyse 

bzw. die Verbesserung der Lehreraus- und weiterbildung genannt. 

2.2 Die Reaktionen in den Presseaussendungen der Austria Presse 

Agentur (APA) auf die PISA-Ergebnisse 2000 und 2003 

Die im folgenden analysierten Presseaussendungen stammen aus dem APA- 

OTS Online-Archiv, welche die Worte PISA, OECD oder STUDIE beinhalten 

und einen Bezug zu den österreichischen PISA-Ergebnissen herstellen, indem 

entweder über mögliche Ursachen der Ergebnisse oder über jene aus den Ergebnissen 

resultierende Maßnahmen und Veränderungen berichtet bzw. eine


Bewertung der Ergebnisse vorgenommen wird. Die APA Originaltext-Service 

GmbH (OTS) verbreitet Presseaussendungen im Originalwortlaut unter inhaltlicher 

Verantwortung des Aussenders (vgl. APA-OTS 2007). Bezieher dieser 

Presseaussendungen sind über 650 österreichische Redaktionen und Pressestellen 

(alle österreichischen Tageszeitungen mit Ausnahme der Kronenzeitung, 

öffentliches und privates Fernsehen und Radio, Periodika, Verlage, internationale 

Nachrichtenagenturen, Ministerien und Pressestellen, Politik, Organisationen 

und Interessenvertretungen u.v.m.), 7600 professionelle User der 

Plattform APA OnlineManager (AOM), 12.500 Abonnenten der Mailingliste 

APA-OTS Mailabo, rund 15.000 User der APA Online Pressespiegel und kundenspezifischer 

Selektionen, sowie Webportale und WAP-Services (vgl. APA- 

OTS 2007a). Der Zeitraum der Untersuchung der Reaktionen auf die erste 

PISA-Welle liegt zwischen der Veröffentlichung der ersten PISA-Ergebnisse 

am 4. Dezember 2001 und dem 1. März 2002. Die Reaktionen auf die Ergebnisse 

der zweiten PISA-Welle werden bereits vom 1. Dezember 2004 bis zum 

1. März 2005 untersucht. Der Grund dafür ist, dass bereits vor der offiziellen 

Veröffentlichung am 6. Dezember 2004 die PISA-Ergebnisse bekannt wurden 

und somit bereits ab Ende November heftige mediale Diskussionen entbrannten. 

Tabelle 6: Bozkurt/Brinek/Retzl 2007


Der bereits festgestellte Trend, dass die zweite PISA-Welle viel mehr öffentliches 

Interesse erweckte als die zweite, lässt sich auch durch die Presseaussendungen 

bestätigen. So gab es in den genannten Zeiträumen über fünf 

mal mehr Presseaussendungen als Reaktion auf die zweiten PISA-Ergebnisse 

als auf die ersten. 

Tabelle 7: Bozkurt/Brinek/Retzl 2007 

Von den 14 Berichten zu den PISA-Ergebnissen 2000 stammen 64,2 %, 

von den 77 Berichten zu den PISA-Ergebnissen 2003 gar 87 % von PolitikerInnen. 

In beiden Untersuchungszeiträumen kamen die SPÖ-PolitikerInnen 

vor den ÖVP-PolitikerInnen am meisten zu Wort, wobei zu PISA 2003 in mehr 

als der Hälfte aller Berichte auf SPÖ-Politiker Bezug genommen wird, während 

nur 15,6 % der Berichte Ansichten von ÖVP-Politikern widerspiegeln. 

Reaktionen von LehrerInnen, SchülerInnen und Elternverbänden sowie solche 

von sozialen Organisationen bzw. Interessenvertretungen werden in beiden 

Untersuchungszeiträumen wiedergegeben. Des Weiteren werden Meinungen 

von Vertretern aus der Industrie zu PISA 2000 und die Ansicht von WissenschaftlerInnen 

zu PISA 2003 in den Presseaussendungen erwähnt bzw. wiedergegeben. 

Aus Tabelle 8 ist ersichtlich, dass die Bewertungen der PISA-Ergebnisse 

2000 und 2003 sehr stark variieren. So werden in 71,4 % der Presseaussendungen 

zu PISA 2000 die Ergebnisse positiv beurteilt und nur in 7,1 % der 

Meldungen negativ. Anders verhält es sich bei den PISA-Ergebnissen 2003. In 

keiner Aussendung werden die Ergebnisse 2003 positiv beurteilt, in fast der



Hälfte erfolgt eine negative Bewertung. Keine oder eine neutrale Bewertung 

wurde in 21,4 % der Aussendungen zu PISA 2000 getroffen. Meldungen zu 

PISA 2003 hingegen beinhalten in mehr als der Hälfte wertende Stellungnahmen. 

SPÖ-PolitikerInnen nennen als Ursache für die großteils positiv bewerteten 

PISA-Ergebnisse 2000 einmal die LehrerInnen und zweimal die SPÖ- 

Regierung (vor der ÖVP-FPÖ-Koalition). ÖVP-PolitikerInnen hingegen geben 

als Ursache einmal Bundesministerin Gehrer und die ÖVP-FPÖ-Regierung 

an, zweimal die LehrerInnen und einmal das differenzierte Schulsystem 

und seine Durchlässigkeit. Hingegen sehen VertreterInnen von LehrerInnen-, 

SchülerInnen- bzw. Elternverbänden einmal in Bundesministerin Gehrer (und 

ihrer Sparpolitik) die Ursache für die negativen oder neutral bewerteten Ergebnisse 

aus PISA 2000, einmal im selektiven Schulsystem (frühe Aufteilung 

in Starke und Schwache SS, HS, AHS; Selektion ab 10; verkrustetes System). 

Alle anderen Personengruppen nennen in den erwähnten Presseaussendungen 

keine Ursache für die Ergebnisse der ersten PISA-Tests. 

Die überwiegend negativ bewerteten PISA-Ergebnisse 2003 begründen 

SPÖ-PolitikerInnen 21 mal mit Bundesministerin Gehrer und der Sparpoli-




tik der ÖVP-FPÖ-Regierung (Kürzungen von Stunden, Personal etc.). Dass 

das selektive Schulsystem, die Aufteilung in Starke und Schwache (HS-SS- 

AHS), die Trennung mit 10 Jahren und das verkrustete System Schuld an den 

Ergebnissen tragen, senden SPÖ-PolitikerInnen drei mal aus. Zweimal nennen 

diese als Ursache die LehrerInnen. ÖVP-PolitikerInnen äußern sich kaum 

zu den Ursachen betreffend die Ergebnisse 2003. PolitikerInnen der Grünen


nennen in einer Aussendung ebenfalls Bundesministerin Gehrer und die Sparpolitik 

der ÖVP-FPÖ-Regierung als negativen Einflußfaktor. In zwei der 77 

Aussendungen beklagen PolitikerInnen der Grünen das selektive Schulsystem, 

den hohen Zeitaufwand durch Schule, die Aufteilung in Starke und Schwache 

(SS-HS-AHS), die Trennung mit 10 Jahren und das verkrustete System. FPÖ- 

PolitikerInnen nennen einmal als Ursache die SPÖ-Regierungen vor der ÖVP- 

FPÖ-Koalition und einmal die MigrantInnen. LehrerInnen-, SchülerInnenbzw. 

Elternverbände nennen in drei Aussendungen Bundesministerin Gehrer 

und die Sparpolitik der ÖVP-FPÖ-Regierung. Vertreter von sozialen Organisationen 

bzw. Interessenvertretungen sehen im selektiven Schulsystem und dem 

hohen Zeitaufwand durch Schule, in der Aufteilung in Starke und Schwache 

(HS-SS-AHS), der Trennung mit 10 Jahren und dem verkrusteten System Ursachen 

für die mäßigen PISA-Ergebnisse 2003. WissenschaftlerInnen äußern 

sich nicht zu möglichen Ursachen/Gründen bezüglich der PISA-Ergebnisse 

2003. 

2.3 Geforderte Maßnahmen als Reaktion auf PISA 2000 und PISA 

2003 

Forderungen in 

APA-OTS 

SPÖ ÖVP Grüne FPÖ In 

du 

Wi 

ss 

L-S- 

E 

soz. 

Org.; 

Int.v. 

total 

Jahr 200/. 0 3 0 3 0 3 0 3 0 3 0 3 0 3 0 3 

Abschaffung von 

Leistungstests 

1 - - - - - - - - - 1 - - - 2 - 

Regierung ÖVP- 

FPÖ (BM Gehrer) 

3 2 - - 1 - 1 - - - 1 - - - 6 2 

Ganztagesschule - 13 - 2 - - - 1 - - - 3 1 1 1 20 

Gesamtschule - 14 - - - 2 - 2 - - 1 2 1 1 2 21 

Förderung 1 10 1 1 - - - 2 - - - - 1 1 3 14 

Lehreraus- und 

weiterbildung 

1 2 - 1 - - - 1 - - - 2 1 - 2 6 

Infrastruktur 1 - - - - - - - - - - - - - 1 - 

Tests, Evaluierung - 2 1 - - - 1 - 1 - - - - - 3 2 

keine Studiengebühren 

- - - - - - - - - - - - 1 - 1 - 

technische Bildung - - - - - - - - 1 - - - - - 1 - 

gemeinsame Reformen 

1 10 - 7 1 - - 3 - 1 - 2 - - 2 23 

keine Schulstrukturdeb. 

- - 2 1 - - - - - - - - - - 2 1 

Schulautonomie - 3 - - - - - - - - - - - 1 - 4


Oberstufenreform 1 - - - - - - - - - - - - - 1 - 

SPÖ- 

Bildungsprogramm 

1 7 - - - - - - - 1 - - - - 1 8 

Zukunftskommiss. - 7 - 1 - - - - - - - - - 1 - 9 

Schulpartnerschaft - - - 2 - - - - - - - 1 - - - 3 

Lehrplanreform - - - - - - - - - - - 1 - - - 1 

Keine Noten - 1 - - - - - - - 1 - 1 - - - 3 

Kein anderes Modell 

übernehmen 

- 1 - - - 1 - - - - - - - - - 2 

Aufhebung der 

2/3 Mehrheit für 

Schulgesetze 

- 4 - - - - - 2 - - - - - - - 6 

Neues anstelle von 

Altem 

- - - 2 - - - - - - - - - - - 2 

Verbindung Beruf- 

Schule 

- 2 - 1 - - - 2 - - - - - - - 5 

anderes - 17 1 7 - 2 - 4 - 1 - 1 1 1 2 33 

total 10 95 5 25 2 5 2 17 2 4 3 13 6 6 30 165 


Der Tabelle 11 sind die genannten Forderungen nach PISA 2000 und PISA 

2003 in den besagten APA-Originaltextsendungen zu entnehmen. Zusätzlich 

informiert die Tabelle darüber, wer eine bestimmte Forderung wie oft stellt. 

Damit ist gut ersichtlich, dass verschiedene Parteien, Verbände, Organisationen 

oder Interessensgemeinschaften oft unterschiedliche Forderungen stellen, 

die, wie kaum verwunderlich, sehr stark aus der Ideologie oder den Interessen 

der jeweiligen Gruppe zu verstehen sind. 

Das Fehlen fachlich korrekter Interpretationen und wissenschaftlicher 

Schlussfolgerungen von Seiten des PISA-Konsortiums hat wesentlich dazu 

beigetragen, dass Schlussfolgerungen und Einschätzungen vielfach den jeweiligen 

Vertretern oder „Anwälten“ von vorgefertigten Meinungen überlassen 

wurden. Offenkundige politische Gegner sind daher auch an der Gegensätzlichkeit 

ihrer Forderungen leicht zu erkennen (vgl. dazu den nächsten Abschnitt 

in diesem Beitrag). Eigene vorgefertigte ideologisch genährte Überzeugungen 

und Pläne wurden damit gefestigt und von kaum einer Seite wurde 

Anlass und Anstoß zu einer rationalen Argumentation geboten. Auch die 

einschlägige Wissenschaft hat dies weitgehend unterlassen. Grundsätzlich ist 

festzuhalten, dass in den 14 Presseaussendungen nach der Veröffentlichung der 

PISA-Ergebnisse 2000 insgesamt 30 geforderte Maßnahmen enthalten sind, in 

den 77 Pressemeldungen nach der Veröffentlichung der PISA-Ergebnisse 2003


insgesamt 165. Vertreter der Industrie melden sich nur in den Aussendungen 

zu PISA 2000 zu Wort, Wissenschaftler nur in jenen zu PISA 2003. 

Die meist genannte Forderung des Jahres 2000 mit sechs Nennungen ist, 

dass Bundesminister Gehrer und die Bundesregierung „den Hut nehmen“, 

„sich nicht ausruhen“, mehr Engagement zeigen, mit Kürzungen aufhören bzw. 

Reformen erarbeiten sollen. Je dreimal wird nach 

– (früher) Lese- und Sprachförderung (Vorschuljahr für alle) bzw. Begabungsförderung 

(ab Kindergarten), Lesetests in der Volksschule bzw. Förderung 

der RisikoschülerInnen (Berufsschule) und nach 

– Tests, Benchmarking, Leistungsvergleich, Qualitätsmanagement, Evaluierung 

verlangt. 

Immerhin wird je zweimal auch in der Abschaffung bzw. der Reduktion der 

Leistungstests ein geeigneter Weg gesehen. Des Weiteren wird zweimal als 

Forderung erwähnt: 

– Die Gesamtschule, Zusammenführung der Schularten bzw. keine Selektion 

mit 10 

– Aus- und Weiterbildung von LehrerInnen: bspw. Diagnostik und Therapie 

von Leseschwächen, Ausbildung auf akademisches Niveau heben (Uni bzw. 

PH) 

– gemeinsame Reformen (Regierung mit Opposition), parlamentarische Bildungsenquete, 

Krisengipfel (genaue Datenanalyse + Ursachenforschung), 

Blick auf andere Länder (Finnland) 

– keine Schulstrukturdebatte, sondern Beibehaltung des differenzierten Systems 

und Verbesserung von Unterricht 

Je einmal wird als Reaktion auf die PISA-Ergebnisse 2000 

– die Ganztagesschule 

– die Verbesserung der Schulinfrastruktur (Computer) 

– die Abschaffung der Studiengebühren 

– mehr Bildungsangebot im technischen Bereich (HTL, FH) 

– eine Oberstufenreform 

– die Realisierung des SPÖ-Bildungsprogramms 

gefordert. 

Die am häufigsten genannten Forderungen nach PISA 2003 sind mit 23 

Nennungen „gemeinsame Reformen (Regierung mit Opposition), parlamentarische 

Bildungsenquete, Krisengipfel (genaue Datenanalyse + Ursachenforschung), 

Blick auf andere Länder (Finnland)“. Knapp dahinter folgen mit 21 

Nennungen die Forderung nach einer „Gesamtschule, Zusammenführung der


Schularten bzw. keine Selektion mit 10“ und mit 20 Nennungen die Forderung 

nach einer „Ganztagesschule“. Eindringlich werden auch (frühe) Lese- und 

Sprachförderung (Vorschuljahr für alle) bzw. Begabungsförderung (ab Kindergarten), 

Lesetests in der Volksschule bzw. Förderung der Risikoschüler (Berufsschule) 

gewünscht (14 Nennungen). 

Des Weiteren wird häufig der Wunsch nach „Umsetzung der Vorschläge 

der Zukunftskommission 5 “ (9 Nennungen) und die Umsetzung des SPÖ- 

Bildungsprogramms (8 Nennungen) ausgesprochen. Hier fällt überraschend 

auf, dass vorwiegend die SPÖ die Umsetzung der Vorschläge der Zukunftskommission 

einmahnt, obwohl diese Kommission von der ÖVP-Ministerin 

Gehrer eingesetzt wurde. Sechs mal wird je die 

– Aus- und Weiterbildung von LehrerInnen: bspw. Diagnostik und Therapie 

von Leseschwächen, Ausbildung auf akademisches Niveau heben (Uni bzw. 

PH) 

– die Aufhebung der 2/3 Mehrheit für Schulgesetze 

genannt. 

Fünf mal erwähnt wird die Forderung nach „stärkerer Zusammenarbeit 

von Schule und Berufswelt“. Vier mal wird der „Ausbau der Schulautonomie 

(Schulen und Kommunen entscheiden)“ gefordert. 

Je drei mal erfolgt in den APA-Originaltextsendungen der Ruf nach 

– Abschaffung der Noten bzw. der Klassenwiederholungen und 

– Ausbau konstruktiver Schulpartnerschaft. 

Je zweimal wird aus den PISA-Ergebnissen gefolgert, dass 

– Bundesminister Gehrer und die Bundesregierung „den Hut nehmen“, „sich 

nicht ausruhen“, mehr Engagement zeigen, mit Kürzungen aufhören bzw. 

Reformen erarbeiten sollen 

– der Einsatz von Tests, Benchmarking, Leistungsvergleich, Qualitätsmanagement, 

Evaluierung ausgebaut werden soll 

– keine anderen Schulmodelle (wie bspw. die skandinavischen) unhinterfragt 

übernommen werden sollen 

– neue Wege begangen und keine „alten Hüte“ hervorgeholt werden sollen. 

Je einmal erwähnt wird, dass 

– keine Schulstrukturdebatte geführt, sondern das differenzierte System beibehalten 

und der Unterricht verbessert sowie 

– der Lehrplan reformiert werden soll. 

5 Mehr über die Zukunftskommission siehe Kapitel 3.2

2.4 Resümee 


Die Anzahl der Zeitungsberichte und der APA-Originaltext-Sendungen zu den 

österreichischen PISA-Ergebnissen im Zeitraum nach der Veröffentlichung der 

Ergebnisse war im Jahre 2003 um ein vielfaches höher als im Jahre 2000. PI- 

SA ist somit erst nach der zweiten Welle in den Blickpunkt der österreichischen 

Öffentlichkeit gerückt. Stark unterschiedlich stellt sich ebenso die Bewertung 

der Ergebnisse in den beiden Testjahren dar. Während das Abschneiden 

Österreichs bei PISA 2000 vorwiegend positiv bewertet wurde, sind die 

österreichischen PISA-Ergebnisse 2003 im Gegensatz dazu überwiegend negativ 

beurteilt worden. Dies spricht für eine stark polarisierende Darstellung 

der PISA-Ergebnisse 2000 und 2003 in der Öffentlichkeit. 

Mit Ausnahme der JournalistInnen, die naturgemäß in den Tageszeitungen 

am häufigsten zu Wort kommen, in den Originaltextsendungen der APA 

jedoch kaum bis gar nicht, sind PolitikerInnen in beiden Medien am stärksten 

präsent. WissenschaftlerInnen werden in den Zeitungen öfter (und auch zu 

beiden PISA-Testungen) zitiert und kommen auch in den Presseaussendungen 

zu PISA 2003 vor. Einige wenige Ansichten von Vertretern der Industrie sind 

zu PISA 2000 in den Presseaussendungen und zu PISA 2003 in den Zeitungsberichten 

zu finden. Eltern-, LehrerInnen- und SchülerInnen-Verbände sowie 

soziale Organisationen bzw. Interessenvertretungen kommen immer wieder in 

Presseaussendungen zu Wort, werden jedoch in den Zeitungsberichten kaum 

explizit berücksichtigt. 

Die klare Dominanz der PolitikerInnen unter den berücksichtigten Personen 

in den untersuchten Medien weist daraufhin, dass PISA in der Öffentlichkeit 

hauptsächlich als politisches Ereignis wahrgenommen wird, wodurch der 

irrtümliche Eindruck entsteht, dass darauf politisch reagiert werden kann/muss 

und auch die Politik die Verantwortung für die Testergebnisse trägt. Das Übertragen 

der Interpretation der PISA-Ergebnisse auf Politik und verschiedene Interessengruppen 

hat eine Ideologisierung und damit auch maßgebliche Überschätzung 

der Untersuchung in der Öffentlichkeit zur Folge, wodurch der 

Einsatz sachlicher Argumente und Grundlagen unterdrückt wird. Eine solche 

sachliche Auseinandersetzung mit PISA muss somit im Nachhinein, in Sammelbänden 

wie diesem nachgeholt und damit im wissenschaftlichen Diskurs 

erschlossen werden. 

Die Übertragung der Verantwortung für die PISA-Ergebnisse 2000 und 

2003 auf einzelne Gruppen, z. B. die LehrerInnen, die Politik oder das Schulsystem 

insgesamt ist bereits Folge einer ideologisierten Diskussion.


Für das stark negativ bewertete Abschneiden bei der PISA-Studie 2003 

werden sowohl in den Zeitungen als auch in den Presseaussendungen vielfach 

auch MigrantInnen und ihr unterrichtssprachliches Leistungsvermögen 

genannt, ohne auf eine Diskussion von Vorschlägen zur Verbesserung einzugehen. 

Auf mangelnde Lesefähigkeit bei den SchülerInnen wird in den Zeitungen 

hingewiesen, jedoch wird sie in den Pressemeldungen nicht als Ursache 

für die PISA-Ergebnisse angesehen. „Schulangst“ wird hingegen nur in den 

Pressemeldungen genannt. 

Nach der Veröffentlichung der PISA-Ergebnisse 2000 werden die Änderung 

der Politik, mehr Budget bzw. das Beenden der Sparmaßnahmen in den 

Tageszeitungen am häufigsten gefordert und auch in den Pressemeldungen verhältnismäßig 

oft artikuliert. Ebenso ist die Forderung nach mehr Tests und 

Evaluierung, sowie nach einer Gesamtschule und einer Ganztagesschule in 

beiden Medien eine oft genannte. Des Weiteren ist der Ruf nach generellen 

Reformen, die nach genauerer Analyse gemeinsam erarbeitet werden sollen, 

in beiden Medien identifizierbar, ebenso wie die Forderung nach Verbesserung 

bzw. Veränderung der LehrerInnenaus- und weiterbildung. Die Verbesserung 

des Unterrichts bzw. der Unterrichtsqualität sowie die Alphabetisierung und 

die Verbesserung der Lesefähigkeit, welche in den Zeitungen am meisten gefordert 

werden, kommen in den Pressemeldungen jedoch kaum bis gar nicht 

vor. Alle anderen Maßnahmen werden nur in jeweils einem der beiden Medien 

genannt. 

Etwas anders verhält es sich mit der Häufigkeit der einzelnen Forderungen 

nach der Veröffentlichung der PISA-Ergebnisse 2003. Sowohl in den Tageszeitungen 

als auch in den Pressemeldungen wird die Forderung nach generellen, 

gemeinsamen Reformen am meisten genannt. Die Gesamtschule ist in beiden 

Medien die am zweithäufigsten geforderte Maßnahme. Die Ganztagesschule, 

die in den Pressemeldungen mit 20 Nennungen sehr häufig verlangt wird, rangiert 

in den Zeitungsberichten mit 49 Nennungen auch weit oben, d.h. unter 

den Top 5. 

Die Forderung nach Förderung, (früher) Lese- und Sprachförderung (Vorschuljahr 

für alle) bzw. Begabungsförderung (ab Kindergarten), Lesetests in 

VS und Förderung der Risikoschüler (Berufsschule) gehört mit 14 Nennungen 

in den Pressemeldungen und in den Zeitungen zusammen mit den Kategorien 

„vorschulische Maßnahmen“ (50 Nennungen), „Sprachkurse“ (28 Nennungen) 

und Verbesserung der Lesefähigkeit bzw. Alphabetisierung (34 Nennungen) 

in beiden Medien zu den am meisten genannten Maßnahmen. Eben-


so in beiden Medien regelmäßig vertreten ist die Forderung nach Veränderung 

der LehrerInnenaus- und weiterbildung; weniger häufig, aber dennoch 

vertreten ist der Ruf nach mehr Schulautonomie. Die Umsetzung des SPÖ- 

Bildungsprogramms (9 Nennungen) und die Vorschläge der Zukunftskommission 

(8 Nennungen), die Aufhebung der 2/3 Mehrheit für Schulgesetze (6 Nennungen) 

sowie eine verbesserte Verbindung von Beruf und Schule (5 Nennungen) 

sind relativ häufige Forderungen in den Pressemeldungen, kommen jedoch 

nicht in den Zeitungsmeldungen vor. Andererseits wird in den Zeitungen 

die Verbesserung des Unterrichts bzw. der Unterrichtsqualität mit 56 Nennungen 

sehr oft erwähnt, in den Pressemeldungen jedoch nur einmal andeutungsweise. 

Hier fällt auf, dass in beiden Testjahren die Forderungen nach generellen, 

gemeinsamen Reformen, nach einer Gesamtschule und einer Ganztagesschule 

zu den am häufigsten genannten gehören. Außerdem kommen sowohl 

nach PISA 2000 und nach PISA 2003 die Forderung nach einer Lese- und 

Sprachförderung (Alphabetisierung durch vorschulische Maßnahmen zur Aufhebung 

des Risikoschüler-Phänomens) häufig vor. Ebenso werden die Verbesserung 

der LehrerInnenaus- und weiterbildung, sowie die Verbesserung des 

Unterrichts- bzw. der Unterrichtsqualität als Forderung in beiden Testjahren 

regelmäßig genannt. 

Erwähnenswert ist darüber hinaus, dass nach PISA 2000 sehr oft eine Änderung 

der Politik, mehr Budget bzw. das Beenden der Sparmaßnahmen und 

mehr Tests und Evaluierungen gefordert werden. Nach PISA 2003 sind diese 

Forderungen jedoch von keiner bzw. allenfalls von untergeordneter Bedeutung. 

3 PISA-Ergebnisse – ihre bildungspolitischen Bewertungen und 

Konsequenzen 

Neben der kommentierten Wiedergabe der PISA-Ergebnisse 2000 und 2003 

sowie der Darstellung und Erörterung der medialen Reflexion (vgl. Kapitel 

1 und 2) zeigen die bildungspolitischen Bewertungen Art und Grad der gesellschaftlichen 

Erregung auf einer weiteren Ebene außerhalb der „scientific 

community“. Unter Verzicht auf die Besinnung dessen, was von der OECD als 

explizites und implizites Ziel der Testung genannt wurde und unter der Berücksichtigung 

der spezifisch österreichischen Tradition der „Erregung“ kann 

man in Österreich von einer politischen und argumentativen Verselbstständigung 

sprechen, die bis heute nicht wirklich abgeklungen ist. Die Grundlage


für eine ernsthafte Diskussion um die Weiterentwicklung des Schulsystems 

wäre gelegt, z. B. mit dem Ergebnis der ministeriellen Zukunftskommission, 

das aber nur mäßig engagiert diskutiert wurde. 

Erst in den letzten Wochen, angestoßen durch das Symposion, das diesem 

Band zugrunde liegt, weniger auf jener Basis, die in Deutschland die Diskussion 

bestimmte, kommt es zu einem vorsichtigen Innehalten im unbedachten 

Absingen des PISA-Lobes. In der Besinnung auf die Frage, was guter Unterricht 

sei und wie er gelingen könne, respektive in der Erinnerung an Humboldt 

gehe es keineswegs darum, „Testartisten“ (Hartmut von Hentig) auszubilden, 

sondern den Herausforderungen für morgen gerecht werden zu können – welche 

immer das auch sein mögen. 

PISA-Ergebnisse wurden in den untersuchten Ländern unterschiedlich aufgenommen 

und sowohl pädagogisch als auch bildungspolitisch entweder aufgeregt 

oder entspannt diskutiert und interpretiert. Wie kaum eine andere Untersuchung 

und Bewertung von SchülerInnenleistungen haben sie Anlass für 

Analysen und Schlussfolgerungen geliefert und waren zumindest in Österreich 

und in Deutschland bald überlagert von Spekulationen und politischen Reflexen. 

Bildungs-Seiten-Macher der Tageszeitungen und Wochenmagazine haben 

„Schulexperten“ aufgeboten, die auf verschiedene Aspekte fokussierten und 

umgehend zu wissen vorgaben, woran es lag, dass das jeweilige Land so und 

so abgeschnitten hatte und welche Konsequenzen zu ziehen wären. 

Diskussionen über notwendige pädagogische oder didaktische Anstrengungen 

zur Verbesserung des Unterrichts oder bildungswissenschaftlichsystematische 

Vergewisserungen – als Konsequenz sorgfältiger Analysen – 

wurden über weite Strecken aufmerksamkeitspolitisch verdrängt und waren 

schließlich eher die Ausnahme, denn die Regel . . . 

Übersehen wurde, dass die PISA-Untersuchung „die jeweils nationale 

Sichtweise ergänzen und vertiefen (kann), indem sie nationale Ergebnisse zur 

besseren Interpretation in einen größeren Zusammenhang (zu) stellen und die 

jeweiligen Stärken und Schwächen im Lichte der Leistungsfähigkeit anderer 

Bildungssysteme einzuschätzen“ (erlaubt). PISA habe die Basis für den Dialog 

und die Zusammenarbeit bei der Definition und Umsetzung von Bildungszielen 

geschaffen, wobei die für das spätere Leben relevanten Kompetenzen im 

Vordergrund stehen (Schleicher o. J., 9). Von Rückschlüssen auf (Allgemein-) 

Bildung, wie sie im mitteleuropäisch umfassenden Sinn gedacht wird, ist nicht 

die Rede, ebenso wenig wie auf Kausalverhältnisse bzgl. PISA-Ergebnis und 

Schulsystem abgestellt wird.

3.1 Didaktische Verbesserungen 


Einige Länder haben entschieden gemäß dieser Maßgabe gehandelt. 

„Die alarmierenden und beunruhigenden Befunde der ersten PISA- 

Untersuchung“ führten in Deutschland sehr bald nach Veröffentlichung der 

ersten Ergebnisse zu einer internationalen Tagung der Gesellschaft für Fachdidaktik 

(der Dachorganisation aller wissenschaftlich-fachdidaktischen Fachgesellschaften), 

mit dem Ziel, „Perspektiven einer Verbesserung fachlichen 

wie fächerübergreifenden Lernens und Lehrens ( . . . ) zu entwickeln“. (Bayrhuber/Vollmer 

2004, 7). 

Bundesministerin Edelgard Buhlmann (ebenda 25f) stellt in ihrem programmatischen 

Vortrag Bildungsreformansätze vor. Sie fordert mehr „Bildungsoptimismus“ 

und konzentriert sich im Verweis auf Finnland auf das 

Prinzip der individuellen Förderung, auch kann sie dazu gleich mit zusätzlichen 

Mitteln der deutschen Bundesregierung (4 Mrd. Euro) aufwarten, die 

in Ganztagesschulprogramme zu investieren seien, damit anders als in der alten 

„Gleichschritt-Pädagogik“ (G.B.) nun auf der Basis von größeren Zeitbudgets 

anders unterrichtet werden könne. Offen bleibt dabei die innere 

Form der „Ganztagesschule“, wird doch auf Partnerschaft und Kooperation 

mit Sportvereinen, Musikschulen, Elterninitiativen u. ä. gesetzt. 6 Es gehe 

um die frühe Identifikation von Defiziten und Stärken bei Kindern – v. a. 

im Bereich der Sprach-, Lese- und Schreibkompetenz – und die Fokussierung 

auf die Fachdidaktik. „Eine bessere Qualität des Unterrichts in unseren 

Schulen kann nur über einen didaktischen Wandel erreicht werden“ (ebenda, 

28), d.h. über die schlüssige Verbindung von fachwissenschaftlicher und 

erziehungswissenschaftlich-didaktischer Ausbildung. 

Staatsministerin Karin Wolff, Präsidentin der Kultusministerkonferenz, 

verweist auf das rasche und ergebnisorientierte Handeln nach der PISA- 

Ergebnis-Präsentation, das die Konzentration auf die Förderung der Kinder 

mit Migrationshintergrund, 7 die Lesefähigkeit sowie die Verbesserung in der 

6 In Österreich ist der Unterschied zwischen Ganztagesschule und ganztägigen Schulformen 

wesentlich: Während in der einen Unterrichts-, Übungs- und Freizeitstunden abwechselnd 

über den Tag verteilt sind und den verpflichtenden ganztägigen Besuch bedingen, ist in der 

anderen der verpflichtende Unterricht im wesentlichen am Vormittag angesetzt, während in 

den Nachmittagsstunden Vertiefung, Übung, Freizeit und Sport angeboten werden – zum 

freiwilligen Besuch. 

7 In Hessen werden nur SchülerInnen eingeschult, die die deutsche Sprache beherrschen. 

Auch in Finnland sind spezielle Sprach-Vorbereitungsklassen für Kinder mit Migrationshintergrund 

eingerichtet. In Schweden u. a. Ländern gibt es Vorbereitungsklassen für Kin-


Bewältigung komplexer Aufgaben verfolgt. In allen Dimensionen sei die Fachdidaktik 

angesprochen, ginge es doch um die Verbesserung der Unterrichtsqualität. 

Unter diesem Aspekt habe die Kultusministerkonferenz die Bildungsstandards 

als zentrales Mittel zur Sicherung der Qualität schulischer Bildung 

herausgestellt (nähere Definitionen werden dazu im Buch S. 36 ff geliefert). 

Mit der Länder-Auswertung wird aber auch in Deutschland die bildungspolitische 

bzw. schulorganisationspolitische Diskussion entsprechend aufgeladen, 

schneiden doch die Länder des Südens bei PISA mit ihrem gegliederten 

Schulsystem grob gesprochen besser ab als die nördlichen. 

Peter Bender, Uni Paderborn, nimmt in einem Beitrag „für die GDM 

Nr. 81“ Bezug auf die PISA-Vergleichsstudien und zu jenen Artikeln, die sich 

mit der Kritik der PISA-Kritiker und mit schulorganisatorischen Konsequenzen 

befassen: „Die Bayern waren mit Abstand die Besten, ( . . . ) die nächsten 

drei Plätze (gingen an) Baden-Württemberg, Sachsen, Thüringen. Aus dem 

direkten Vergleich der Schulformen, der ja bei PISA 2000 für die integrierte 

Gesamtschule schlecht ausgefallen war, war diese jetzt herausgenommen worden 

( . . . ) aus statistischen Gründen. Aber die PISA-Schwäche der Gesamtschule 

ist nach wie vor erkennbar ( . . . ). Auch aus den internationalen PISAund 

TIMSS-Zahlen lässt sich kein Honig für ein Einheitsschulsystem saugen. 

Zwar verfügen die Spitzenländer durchweg über ein solches, aber – und das 

wird immer wieder geflissentlich ignoriert – eben auch sämtliche Länder in 

der unteren Hälfte der Tabelle. Die wenigen Länder mit früh gegliedertem 

Schulsystem (Belgien, Deutschland, Österreich, Schweiz, Slowakei, Tschechien) 

dagegen finden sich alle in der oberen Hälfte. Die internationalen PISA- 

Zahlen sprechen also ebenfalls eher gegen die Einheitsschule. Jedoch meine 

ich, dass sie überhaupt nicht für oder gegen Schulsysteme sprechen, sondern 

Ausdruck des kulturell-technischen Entwicklungsstandes, des Leistungsorientierungsgrads 

und der Migrationsstruktur der jeweiligen Gesellschaft sind, und 

zwar i. W. unabhängig vom Schulsystem“ (Bender 2007). Insbesondere seien 

die skandinavischen Länder als Vorbilder abhanden gekommen bzw. lägen 

abgesehen von Finnland auf Augenhöhe mit Deutschland; ebenso wird noch 

einmal auf die migrationspolitisch günstigeren Bedingungen etwa Schwedens 

(gegenüber Deutschlands) hingewiesen. 

der, die einen unterrichtssprachlichen Nachholbedarf haben. Österreichische Schulen kennen 

nur die Möglichkeit des außerordentlichen Schulbesuchs, der gesetzlich auf ein Jahr 

beschränkt ist, sowie einen auf wenige Stunden pro Woche beschränkten Zusatzunterricht, 

der während des Regelunterrichts angeboten wird.


Als weiteren Interpretations-Aspekt bezieht Bender das Moment der Chancen(un)gleichheit 

in seine Stellungnahme mit ein und verweist auf methodische 

Fehler, die im Zusammenhang mit Konsequenzen aus „Bildungsbeteiligung“ 

und „ökonomisch-sozial-kulturellen Status“ stünden. 

Direkte, etwa von Andreas Schleicher an verschiedenen Stellen getätigte 

OECD-Aussagen brächten die Motive der PISA-Untersuchung zutage. So sei 

vor einigen Jahren, so Bender, das Schulsystem mit der Steigerung des Bruttosozialprodukts 

in Verbindung gebracht worden – in Ignoranz anderer wesentlicher 

Einflussgrößen. 8 

Ähnlich grob „geschnitzt“ seien die den Wettbewerb stimulierenden Tabellen 

über die (Steigerung der) Bildungsausgaben ausgefallen: Mexiko würde 

demnach einsam an der Spitze liegen, so Bender. 

In einem offenen Brief an die stv. Vorsitzende der GEW, Marianne Demmer, 

nimmt Bender auch zu den Kritikern des Buches „PISA & Co – Kritik 

eines Programms“ Stellung bzw. zum Reflex, den die Kritik in Deutschland 

ausgelöst habe, womit gewissermaßen bewiesen wäre, dass eine auf Argumenten 

basierende Analyse zu Schmähung und Verfolgung führt(e). Mehr dazu im 

Beitrag von Stefan Hopmann. 

In diesem Zusammenhang sei auch die Stellungnahme des UNO- 

Sonderberichterstatters der UN- Menschenrechteskommission, Vernor Munoz 

aus Costa Rica, 2006 in Deutschland, zu interpretieren, in der er das gegliederte 

Schulsystem kritisierte und befand, dass zur Integration von Familien mit 

Migrationhintergrund sinngemäß die Sprache nicht ausschlaggebend sei . . . 9 

3.2 Die „Zukunftskommission“ 

Im österreichischen Report PISA 2000 – Lernen für das Leben fasst die zuständige 

Bundesministerin Elisabeth Gehrer die notwendigen Konsequenzen 

und Handlungsschritte im Vorwort des Ergebnis-Reports zusammen: „Bei der 

nun vorliegenden Detailauswertung gibt es wichtige Hinweise, in welchen Bereichen 

die Anstrengungen zur Steigerung der Bildungsqualität noch verstärkt 

werden sollen. Österreichische Kinder sollten spätestens zum Ende der dritten 

Klasse Volksschule verlässlich sinnerfassend lesen können. Lesen ist die Kulturkompetenz, 

auch im Zeitalter der Automatisierung. Deshalb wurde vom Bil- 

8 Auf Österreich übertragen ließen sich daraus äußerst beruhigende Rückschlüsse auf das 

Schulsystem ziehen, zeigen doch die jüngsten Wirtschaftsdaten vergleichsweise sehr gute 

Ergebnisse. 

9 Wer hat hier den Kern der PISA-Tests – literacy! – nicht verstanden oder gar überlesen?


dungsministerium das Projekt ›Lesefit‹ unter dem Motto ‚Lesen können heißt 

lernen können‘ gestartet. Unter Einbindung der Eltern und des Buchklubs muss 

erreicht werden, dass alle Kinder die Volksschule mit hervorragenden Lesekenntnissen 

verlassen.“ Unter Verweis auf die „thematischen Berichte“ wird 

auf die „unterschiedlichen Kompetenzen bei Mädchen und Buben sowie bei 

deutsch- und nicht deutschsprachigen Schülerinnen und Schülern“ verwiesen 

und insgesamt die Detailauswertung hinsichtlich ihrer Rückmeldefunktion für 

die Qualitätssteigerung im Bildungswesen gewürdigt. Im Vorwort der PISA- 

Studie 2003 wird an ›Lesefit‹ erinnert sowie auf die Ausweitung des Projekts 

IMST (Innovations in Mathematics, Science and Technology Teaching) Bezug 

genommen. Mit der Initiative ›klasse:zukunft‹, der Erarbeitung von Bildungsstandards 

und der zielstrebigen Fortsetzung der inneren Schulreform sei Österreich 

auf dem richtigen Weg, die Qualität des Unterrichts nachhaltig zu verbessern 

und zu sichern, so Gehrer. Dieselbe bemerkt abschließend aber auch, dass 

„Leistungsmessungen wie PISA wichtige Momentaufnahmen“ lieferten, aber 

nur einen „Teil der Leistungen ( . . . ) verkörperten, die ( . . . ) an unseren Schulen 

erbracht werden“. 

Auf Wunsch der Bildungsministerin wird unter dem österreichischen 

PISA-Verantwortlichen Günter Haider mit Ministerratsbeschluss vom 

1.4.2003 die sog. Zukunftskommission eingesetzt, die bildungspolitische Konsequenzen 

aus der OECD-Untersuchung formulieren soll (weitere Mitglieder 

sind Christiane Spiel, Ferdinand Eder, Werner Specht und Manfred Wimmer). 

Die Kommission legt 2005 ein Analyse- und Maßnahmen-Papier bzw. das Ergebnis 

vor, das der deutschen Konklusion inhaltlich nicht so fern steht; sie 

stellt auf die Verbesserung der Unterrichtsqualität ab. 

Als Reformziel wird genannt: Schule und Unterricht systematisch verbessern 

(Hervorhebungen jeweils d. d. Verfasser). „Sowohl Ergebnisse neuerer 

Leistungsmessungen (vor allem PISA), als auch die seit mehr als einem Jahrzehnt 

laufenden Überlegungen zur Qualitätsverbesserung in den Schulen sowie 

die Analyse der Rahmenbedingungen in Österreich legen nahe, die Lehr- 

/Lernprozesse im Unterricht, die Unterrichtsinhalte und die Unterrichtsmethoden, 

somit ‚Guten Unterricht‘ in das Zentrum der Reformmaßnahmen zu 

rücken. Reformstrategie: Qualitätsentwicklung vor Strukturreform 

Die Zukunftskommission hat in ihrem ersten Bericht das Schwergewicht 

der Vorschläge auf Qualitätssicherung, Qualitätsentwicklung und den Ausbau 

einer verlässlichen Schule gelegt, und weniger auf den Umbau von Strukturmerkmalen 

und Organisationselementen. Sie bleibt auch in diesem Folgebe-


richt auf derselben Linie. Die vorgeschlagenen Maßnahmen der Zukunftskommission 

streben daher Unterrichtsverbesserungen durch Schulentwicklung und 

Qualitätssicherung, durch Lehrerbildung und Unterstützungssysteme – und 

nicht durch Systemumbau an. 

Die Gesamtstrategie orientiert sich an folgenden vier Prinzipien: 

1. Systematisches Qualitätsmanagement: Förderung der Qualitätsentwicklung 

und der Qualitätssicherung auf allen Ebenen. ( . . . ) 

2. Mehr Autonomie und mehr Selbstverantwortung – erhöhter Handlungsspielraum 

bei transparenter Leistung und Rechenschaftspflicht. (... ) 

3. Professionalisierung der LehrerInnen: kriterienbezogene Auswahl, kompetenzorientierte 

Ausbildung, leistungsorientierte Aufstiegsmöglichkeiten. 

(... ) 

4. Mehr Forschung & Entwicklung und bessere Unterstützungssysteme. 

( . . . )“ (vgl. Abschlußbericht – Zusammenfassung). 

Dazu wurden in den zusammenfassenden Empfehlungen fünf Handlungsbereiche 

(mit einzelnen Subbereichen) und vordringliche und übergreifende 

Forschungs- & Entwicklungsbereiche formuliert und detailliert ausgeführt. 

Diesen in der Öffentlichkeit durchaus positiv bewerteten Aktivitäten (vgl. 

auch die in diesem Beitrag zitierten parlamentarischen Debatten) folgten einige 

weitere: 

Am 9.2.2005 schreibt Peter Posch im Lichte der PISA-Ergebnisse in einem 

Gutachten für Bundesminsterin Gehrer über „einige mögliche Gründe für die 

Schwächen des österreichischen Schulsystems und Ansätze zu ihrer Überwindung: 

Wesentlich ist ( . . . ) die Erkenntnis, dass Verbesserungen nur von einem 

komplexen Ensemble von Maßnahmen zu erwarten sind.“ 

Unter den 10 Punkten weist er zwar auch auf die „Frage der Fragmentierung“ 

des Schulsystems hin und die Folgen für schwächere SchülerInnen 

bzw. für jene, die aus schwierigen sozialen Verhältnissen kommen – 2000 

und 2003 hätte diese Gruppe beunruhigend schlecht abgeschnitten, weil der 

Leistungsanreiz durch die leistungsmäßig stärkeren Schüler gefehlt hätte –, 

setzt aber im ersten Punkt auf Qualitätssicherung. „Die Einführung der Verpflichtung 

auf ein Schulprogramm, in dem in bestimmten Abständen von den 

Schulen verlangt wird, Rechenschaft über Initiativen und deren Ergebnisse 

zur Weiterentwicklung der Unterrichtsqualität und der schulischen Rahmenbedingungen 

abzulegen, wurde rechtlich nicht verankert, obwohl bereits 2002 

ein detaillierter Vorschlag ausgearbeitet worden ist ( . . . ).“ Ein weiterer Punkt


ist für Posch die Gewinnung von Schulzeit, d.h. der Vormittagsunterricht reiche 

nicht. Neben der Verbesserung und Professionalisierung der LehrerInnen- 

Ausbildung mahnt der Autor ein, den gängigen Unterricht methodisch weiterzuentwickeln, 

um künftig auch anspruchsvolle Aufgaben und Denkleistungen 

besser bewältigen zu können, vgl. TIMSS. Mehr Transparenz in der 

Formulierung und Beurteilung der Leistungsansprüche sollte u. a. das Ergebnis 

einer laufenden LehrerInnen-Fortbildung und der professionellen Zusammenarbeit 

in Fachgruppen-Teams sein. Schließlich sei die SchulleiterInnen- 

Qualifizierung und -Stärkung der Management-Ebene vonnöten sowie das 

„Aufsichtsvakuum“ zwischen Schulleitern (verantwortlich für das Schulprogramm) 

und Schulaufsicht (verantwortlich für die Qualität der Selbstevaluation) 

zu beseitigen. 

Als wesentlichen Punkt streicht Posch die wahrscheinlich nachteilige Auswirkung 

der mangelnden Deutsch-Kenntnisse der MigrantInnen-Kinder heraus. 

Dazu: „Einrichtung von Programmen zur Sicherung der Deutschkenntnisse 

von Kindern mit Migrationshintergrund und zwar nicht nur begleitend zur 

Schullaufbahn sondern ( . . . ) bevor sie in die Schule eintreten.“ Dabei solle eine 

hohe Konzentration von Kindern nicht-deutscher Muttersprache vermieden 

werden. Abschließend wird auf Finnland verwiesen, wo bereits im Kindergartenalter 

systematisch Finnisch vermittelt werde. 

3.3 Öffentliche Reaktionen 

Im Spiegel der medialen Rezeption stellt sich die öffentliche Diskussion um 

Konsequenzen aus PISA und Schulleistungs-Vergleichtests anders dar. Gab es 

zum ersten PISA-Report viel bisweilen auch undifferenzierte Zufriedenheit, 

Eigenlob und Unaufgeregtheit nach dem Motto „alles im grünen Bereich“ bis 

„einmal noch gut davongekommen und vor allem besser als Deutschland abgeschnitten“ 

(mehr dazu im Kapitel 2), so verdichten sich Reaktionen auf den 

zweiten PISA-Report zu Katastrophenmeldungen. 

Finnland ist oft das geflügelte Wort für das Gute in der Pädagogik, die 

Scheu vor einer moralischen Aufladung und Überstrapazierung nimmt rapide 

ab, das können auch andere internationale Bildungsanalysen nicht relativieren, 

z. B. der von Bildungsministerin Gehrer zitierte OECD-Länderbericht 2005, 

der für Pressegespräche und einschlägige Informationen zitiert wird: „Als ein 

besonderes Charakteristikum Österreichs kann die große Vielfalt an Schultypen 

angesehen werden. Der hohe Level an vertikaler und horizontaler Differenzierung 

zeigt Vorteile, aber auch Einschränkungen. Im Schulsystem gibt es


den Eltern eine große Freiheit an Wahlmöglichkeiten, besonders in Wien und 

den großen Städten, wenngleich dies auch zu Zersplitterung und hohen Kosten 

führen könnte“. Gestützt wird die Argumentation nach dem Erscheinen des og. 

Länderberichtes mit einer Analyse in der FAZ: „Entscheidend ist nicht das jeweilige 

Schulsystem, sondern der kluge und bedachte Umgang mit vorhandener 

Schultradition. Nach diesem Ländervergleich gibt es weniger Grund denn 

je, das in Deutschland etablierte dreigliedrige Schulsystem einem Einheitsschulsystem 

nach skandinavischem Vorbild zu opfern. Vielmehr liegen die 

Länder, die auf ein dreigliedriges Schulsystem mit hohen Qualitätsstandards 

in allen Schularten gesetzt haben, in Führung ( . . . )“ (Presse-Information des 

BMBWK zuletzt v. 16. Dez. 2005). In derselben wird von Seiten des Ministeriums 

auch auf das erfolgreiche Abschneiden Bayerns, einem Land mit einem 

gegliederten Schulsystem, hingewiesen, sowie auf die in Österreich niedrige 

Jugendarbeitslosigkeit und den WHO-Bericht über das Wohlbefinden in der 

Schule: Das Ergebnis aus der Frage „Fühlst Du Dich in der Schule wohl?“ 

bedeutet für Österreich den Platz 3, für Finnland den Platz 34 (letzte Stelle). 

Dies alles trägt aber zur Versachlichung der Debatte nicht (mehr) bei. Es 

zeigt nur ein pädagogisch-systematisch motiviertes Aufflackern eines ernsthaften 

Versuches, die gegenwärtige Schule und ihre Funktion als umfassenden 

„Gerechtigkeitsherstellungsapparat“ zu relativieren bzw. nicht zu strapazieren, 

sondern den Beitrag der Schule zur Schaffung einer humanen Gesellschaft als 

so bescheiden herauszustreichen als es auf Basis der aktuellen Forschungslage 

benannt werden darf und muss . . . 

Auch Mitglieder der Zukunftskommission nehmen in Bezug auf einen 

möglichen Umbau des Schulsystems in einer ähnlichen Weise Stellung, so 

z. B. der Vorsitzende Haider, indem er v. a. mündlich immer wieder auf die 

österreichische Tradition und Schulkultur verweist (z.B. Haider im parlamentarischen 

Unterrichtsausschuss und im Rahmen einer Studientagung der ÖVP 

in Alpbach). 

3.4 Parlamentarische Resonanz 

Im Rahmen einer sog. Aktuellen Aussprache im parlamentarischen 

Unterrichts-Ausschuss (in den Ausschüssen werden die legistischen Arbeiten 

vorbereitet, diskutiert und mehrheitlich beschlossen, um danach in öffentlichen 

Plenarsitzungen abgestimmt zu werden; Aktuelle Aussprachen sind Teil 

der Debatten in den Ausschuss-Sitzungen, zu denen auch Experten eingeladen 

werden) stand am 3.7.03 „die PISA-Studie und die Tätigkeit der Zukunftskom-


mission“ zur Diskussion. Dazu wurde auch Günter Haider als Experte eingeladen. 

Von Bundesministerin Gehrer wird einleitend das Ziel der Zukunftskommission 

referiert; der österreichische PISA-Vorsitzende schließt sich an und 

unterstreicht, dass sich die Zukunftskommission „insbesondere mit der Qualität 

und der Qualitätsentwicklung auseinandersetzen werde, wobei man den 

Schwerpunkt auf die Verbesserung des Unterrichts lege. 80 % der Qualitätsverbesserung 

seien, so Haider, dadurch erreichbar, wobei man insbesondere 

in der Lehrerbildung ansetzen müsse. Nur 20 % an Verbesserungen, schätzt 

er, könnten durch organisatorische Maßnahmen erreicht werden. Besonderes 

Augenmerk werde der Vorbereitung auf ein lebenslanges Lernen und einem 

gesicherten und verstehenden Lesevermögen geschenkt werden, denn ohne 

dieses sei kein selbstständiger Bildungserwerb möglich“ – was nur durch den 

entsprechenden Unterricht erreichbar wäre (Parlamentkorrespondenz Nr. 536). 

Nach den Ausführungen Haiders ziele die Arbeit der Zukunftskommission 

auf ein Gesamtkonzept ab, was nur langfristig umgesetzt werden könne; 

kurzfristige Auswirkungen würden der Umsetzung des Qualitätsmanagements 

zugeschrieben. Als langfristiges Projekt bezeichnete der Experte die Hinwendung 

zur Rechenschaftslegungsorientierung sowie die Realisierung einer verstärkten 

Autonomie . . . . 

In der anschließenden Debatte zeigten sich dann „die unterschiedlichen 

Gewichtungen, die die Opposition und die Regierungsfraktionen vornahmen. 

Aus den Äußerungen der Abgeordneten von SPÖ und Grünen war die Meinung 

herauszuhören, dass „organisatorische Maßnahmen durchaus einen stärkeren 

Einfluss auf die Qualität und den Output des Unterrichts nehmen könnten.“ 

Für Werner Amon von der ÖVP ginge es, so die exemplarische Wortmeldung, 

darum, das relativ gute Schulsystem durch eine Konzentration auf die Verbesserung 

des Unterrichts und den Ausbau der Autonomie weiterzuentwickeln. 

Haider unterstrich nach einer ausführlichen Debatte auch noch die Notwendigkeit 

des Ausbaus der Tagesschul- bzw. Unterrichtszeit und Unterteilung 

des Lehrplanes in einen Kern- und Erweiterungsstoff. 

Am 1.12.04 wurden im parlamentarischen Unterrichts-Ausschuss im Rahmen 

der Aktuellen Aussprache die kurz zuvor bekannt gewordenen Ergebnisse 

der 2. PISA-Studie diskutiert (ein endgültiges offizielles Ergebnis lag zu 

diesem Zeitpunkt nicht vor; bis 7.12. seien die Ergebnisse im Eigentum der 

OECD). Bundesministerin Gehrer räumte ein, dass sich Österreich im Ranking 

verschlechtert habe. Sie halte es aber „zum jetzigen Zeitpunkt für verfehlt, 

voreilige Schuldzuweisungen vorzunehmen und ein bestimmtes Schulsystem


als Heilmittel zu propagieren“ (Parlamentkorrespondenz Nr. 893). Neben dem 

Hinweis auf die beabsichtigte Analyse, mit der sie Haider beauftragen wolle, 

erinnerte sie an bereits gesetzte Maßnahmen, sowie an die im Vergleich etwa 

zu Finnland positive Lage auf dem Sektor der Jugendarbeitslosigkeit. 

Um den anstehenden Reformen im Bildungswesen gerecht werden zu 

können, stand die Arbeit des parlamentarischen Unterrichts-Ausschusses am 

20.4.05 im Zeichen der Entscheidung über die Abschaffung der Zweidrittelmehrheit 

für Schulgesetze; dazu war die Unterstützung der Opposition notwendig. 

Grundlage für die Diskussion im Rahmen der Aktuellen Aussprache 

war der Abschlußbericht der Zukunftskommission; Günter Haider stand 

abermals für Fragen und Diskussionsbeiträge zur Verfügung und bezeichnete 

wiederum die systematische Verbesserung des Unterrichts als zentrales Element 

der Schulreform. Dazu gehöre es, die Lern- und Leistungsfähigkeit optimal 

zu fördern sowie die Qualifizierung der LehrerInnen zu erhöhen und die 

Nachhaltigkeit des Unterrichts zu verbessern. „Die Qualitätsentwicklung habe 

Vorrang vor der Strukturreform“, sagte er und vertrat auch die Auffassung, 

die sprachliche Frühförderung sollte so rasch wie möglich umgesetzt werden 

(Parlamentskorrespondenz Nr. 272). Erwartungsgemäß reagierten die Oppositionsabgeordneten 

mit dem Wunsch nach einer Strukturreform, ebenso wie die 

ÖVP-Mandatare die qualitative Verbesserung des Unterrichts als vordringlich 

erachteten. Nach einer weiteren Stellungnahme von Bundesministerin Gehrer 

– sie referierte alle bisher gesetzten Verbesserungs- und Schulentwicklungsmaßnahmen 

– wurde die Diskussion nach einer kurzen Unterbrechung 

fortgesetzt, auf die die Debatte bezüglich neuer Mehrheiten bei Schulgesetzen 

folgte. Reformen sollten rascher und einfacher umgesetzt werden als bisher. 

Des Weiteren wurde auf die geplanten bzw. auch budgetierten ca. 300.000 zusätzlichen 

Förderstunden und den ebenso vorgesehenen Ausbau der Tagesbetreuung 

hingewiesen. 

Im Laufe der folgenden Monate wird die politische Diskussion um PISA 

oder besser die öffentlich geführte Debatte jedoch argumentativ immer enger 

und konzentriert sich – ideologisch aufgeladen – auf die Frage „Gesamtschule 

oder gegliedertes Schulwesen“. 

Dass sich darin, wenn auch nicht nur, ab 2006 der nahende Bundes- 

Wahlkampf widerspiegelt, wird offenkundig etwa indem sich nun u. a. auch der 

österreichische PISA-Chef Haider öffentlich zunehmend anders, d.h. von seiner 

inhaltlichen Linie und seinen ursprünglichen Empfehlungen abweichend 

und wissenschaftlich grenzüberschreitend, politisch wertend verhält und der


Bildungsministerin sinngemäß schwere bildungspolitische Versäumnisse, Unbeweglichkeit 

und Amtsmüdigkeit vorwirft. 

3.5 Vertiefende Analyse 

2005 entwickelt sich auch in Österreich eine Diskussion um die offenkundigen 

methodischen Schwächen, d.h. die Anlage und die Auswertung der Leistungserhebung 

in der PISA-Untersuchung. Bundesministerin Gehrer beauftragt 

Erich Neuwirth und sein Team an der Universität Wien mit vertiefenden 

Analysen und Beiträgen zur Methodik. Wissenschaftlich abgesicherte Aussagen 

sollten dazu beitragen, die Unterschiede klären und interpretieren zu können. 

Im Abschnitt „Korrigierte Hauptergebnisse“ (Neuwirth, 62 ff.) wird auf 

den nun möglichen stichhaltigen Vergleich der PISA-Ergebnisse 2000 und 

2003 abgestellt. „Dabei zeigt sich, dass sich die Leistungen der österreichischen 

Schüler/innen in Lesen kaum geändert haben und sowohl die Lesewerte 

als auch die Mathematikwerte in der Nähe des OECD-Durchschschnitts liegen. 

In den Naturwissenschaften ist dagegen ein deutlicher Rückgang der österreichischen 

Werte erkennbar“ (Neuwirth, 62). Festzuhalten sei auch, dass 

die Werte für Mathematik nicht unmittelbar vergleichbar sind, weil die im 

Mathematik-Test abgedruckten Bereiche 2003 sehr stark erweitert wurden und 

daher nicht mehr exakt dasselbe Kompetenzfeld untersucht wurde wie 2000. 

Die Analyse der Geschlechterunterschiede ergibt sowohl beim Lesen als 

auch in den Naturwissenschaften kein nennenswertes Ergebnis. Analysiert 

man die Antwortformate in den Naturwissenschaften, so kommt man zu einem 

aufschlussreichen Resultat. Bei Aufgaben mit offenen, freien (verbalen) 

Antwortformaten schnitten Österreichs Schüler/innen schlechter ab als 2000 

(vgl. Neuwirth, 71, 75). Das trifft v. a. bei Schülerinnen in Berufschulen und 

Berufsbildenden Mittleren Schulen zu 10 . „Vom öffentlich diskutierten ‚PISA- 

Absturz‘ in allen Disziplinen und vom drastischen Auseinadergehen der Leistungswerte 

der Geschlechter beim Lesen bleibt bei Analyse der korrigierten 

Daten nichts übrig“, so der Autor (Neuwirth, 64). 

Das Statistiker-Team kommt zu dem Schluss, dass das Datenmaterial einer 

näheren Betrachtung in Bezug auf Konsistenz nicht standhält (vgl. Kapitel 

10 Eine Relation darf in diesem Zusammenhang artikuliert werden, nämlich die mit der Entwicklung 

der Zahl der Kinder mit Migrationshintergrund und den Erfahrungen aus der Integrationsarbeit. 

Die österreichische Bundespolitik setzt in dieser Zeit ihr Programm der 

Familienzusammenführung um, d.h. es kommen hauptsächlich Kinder nach Österreich . . .


1.2.1) und von einem „Absturz“ Österreichs nicht die Rede sein kann. Beispielsweise 

war bei PISA 2000 in Österreich der Mädchenanteil höher als der 

Burschenanteil, was jedem demografischen Grundwissen widerspricht. Genauere 

Analysen unter Verwendung zusätzlicher, nicht öffentlich verfügbarer 

Daten zeigten dann, dass die Daten der beiden PISA-Erhebungen von 2000 

und 2003 speziell für Österreich nicht unmittelbar vergleichbar sind . . . (Neuwirth, 

11). 

3.6 „PISA bringt allen was“ 

Die Analyse Neuwirths vermag an der allgemein strapazierten Interpretation 

der PISA-Ergebnisse in Österreich kaum etwas zu ändern. Die politische 

Opposition lädt die Diskussion moralisch auf und verlangt vom Schulsystem 

weniger Selektion und mehr Gerechtigkeit; es wird kommuniziert, dass das 

Gesamtschulwesen eine solche bringen werde. Die ursprüngliche Absicht von 

PISA, den (wirtschaftlichen) Wettbewerb unter immer mehr Ländern mit unüberschaubaren 

Datenmengen anzuheizen – der 3. PISA-Bericht bringt Daten 

aus 60 Ländern – bleibt unberücksichtigt. Demoskopische Ergebnisse zeigen 

bezüglich Schulsystem weiterhin eine konstante Einschätzung der Österreicher 

und Österreicherinnen: 

In der jüngeren Vergangenheit bis in die Gegenwart lag das gegliederte 

Schulwesen in Umfragen in Österreich eindeutig vor der Gesamtschule (meist 

zwischen 65 und 75 %). Die Einstellung gegenüber Schulreformen entwickelte 

sich aber hin zu einer Art von kollektivem Bewusstsein für die Notwendigkeit 

von Verbesserungen. Detail-Abfragen lieferten aber kein schlüssiges Bild. 

Im Frühjahr 2007 wird in Lehrerkreisen kolportiert, dass die Österreicherinnen 

und Österreicher nun doch für die Gesamtschule votierten, nachdem 

der Frage die Information voraus gegangen sei, PISA habe gezeigt, die Gesamtschule 

sei das bessere Schulmodell. Die Quellen sind nicht eruierbar, die 

Resonanz bleibt insgesamt bescheiden . . . 

Gemäß SPÖ-ÖVP-Regierungsprogramm vom Jänner 2007 setzt die Unterrichtsministerin 

eine Reform-Kommission ein, um Modellregionen zur Erprobung 

der Gesamtschulformen zu ermitteln. Die Mitglieder sind mehrheitlich 

keine ExpertInnen aus den Bereichen Bildungstheorie/-wissenschaft oder 

Schulforschung, sondern haben sich in der Vergangenheit im Wesentlichen als 

Gegner des gegliederten Schulwesens artikuliert. Günter Haider fungiert als 

Bildungsberater der Ministerin.


Einzelne Bundesländer stellen ihre Varianten einer reformierten Schule der 

Zukunft vor, so unterstützt z.B. Oberösterreich die einjährige Verlängerung der 

Volksschule, an dessen Ende die Entscheidung über den weiteren Bildungsweg 

stehen soll. Niederösterreich setzt auf eine Art zweijährige Orientierungsstufe 

nach der Volksschule und beruft sich dabei auf eine hohe Akzeptanz: Laut Fessel+ 

GFK-Institut (vom Juli 07) würden 78 % beim bestehenden Schulsystem 

bleiben wollen und nur 18 % für einen einheitlichen Schultyp votieren (3 % 

keine Angaben). 62 % finden die og. Orientierungsstufe für gut. 

Andere Bundesländer haben ihre Kooperation bei der Umsetzung einer 

Gesamtschul-Modellregion gezeigt (z. B. Kärnten, Burgenland). Die publizierten 

Ergebnisse aus den Versuchen mit Modellen der Integrierten Gesamtschule 

der letzten Jahrzehnte bleiben unberücksichtigt. 

Gegenwärtig, wenn auch nicht mehr so heftig wie in den Jahren davor, 

steht PISA einerseits gewissermaßen für eine nationale Kränkung, mit der man 

noch nicht so richtig umzugehen gelernt hat, andererseits verbindet man v. a. 

auf akademischem Boden mit der „OECD-Maschinerie“ das, was es selbst sein 

wollte, nämlich eine Lizenz zur internationalen Aufmischung der Schul- und 

Wirtschaftswelt. Wolfgang Horvath zitiert dazu die OECD selbst: PISA dient 

dem „( . . . ) besseren Verständnis der Bildungserträge in den am weitesten entwickelten 

Ländern und in denjenigen, die sich noch in einem früheren Stadium 

der wirtschaftlichen Entwicklung befinden“ (Horvath, 208) und stellt auf 

„die für das spätere Leben relevanten Kompetenzen ab“ (Schleicher, 9). In der 

öffentlichen Diskussion wird diesem Motiv jedoch nicht Rechnung getragen, 

nicht zuletzt, weil PISA technisch so angelegt ist, dass ein solcherart ermitteltes 

Ergebnis in Abwandlung des Leit-Spruches der Österreichischen Post AG 

„allen was bringt“. 

Darüber hinaus ist bis dato weitgehend unreflektiert geblieben, dass 

sich hinter PISA eine Bildungsnorm verbirgt, ein (Allgemein-)Bildungs- 

Verständnis, das in den Ländern mit der Tradition von Humboldt und Schleiermacher 

nicht selbstverständlich ist. „Dieses an und für sich durchaus diskussionswürdige 

Bildungskonzept, dessen Wurzeln in einem konkreten historischen 

und kulturellen Raum zu suchen sind, wird mittels motorischen Verschweigens 

seiner Herkunftsgeschichte in seinem Gültigkeitsanspruch absolut gesetzt. Es 

wird als universell, naturgegeben, vorgestellt und damit als Norm gesetzt bzw. 

wird diese Norm als solche genau genommen eben nicht ausgewiesen. Sie wird 

als einzig mögliche, weil einzig denkbare, stillschweigend vorausgesetzt. Ihr 

eignet somit der Charakter eines Naturgesetzes ( . . . ). Allgemeinbildung wird


damit auf Nutzbarkeit hin getrimmt, der gemessene Bildungswert dient als Indikator 

für die wirtschaftliche Schlagkraft“ (Horvath, 210). Insofern sind die 

artikulierten Verbesserungsvorschläge der Zukunftskommission in sich schlüssig, 

sie stellen – nachvollziehbar in einer über weite Strecken managementtechnischen 

Begrifflichkeit – auf verbesserte Qualifikationen und deren effizienten 

Einsatz im Dienst der SchülerInnen und der Wirtschafts- und Berufswelt 

ab. 

So mutet es geradezu paradox an, wenn Vertreter der politischen Linken 

den PISA-Ergebnissen noch stärker gerecht werden wollen: mit der Gesamtschule 

und damit der globalen Wirtschaftsorientierung im Bildungswesen, 

nicht etwa mit einem (notwendigerweise reformierten) gegliederten Schulwesen, 

das auch das Gymnasium realisiert. Dass die Chancen und Stärken der 

Europäischen Union in der Berücksichtigung eines aufgeklärten Bildungsverständnisses 

liegen könnten, wird von kaum jemandem außerhalb der „academia“ 

artikuliert. 

Dass sich nicht alle Länder gleich stark engagiert mit dem „Rankingspektakel“ 

(Schirlbauer 2007, 6) beschäftigen, zeige die PISA-Hompage der OECD. 

Zwei Monate nach der Veröffentlichung der PISA-Studie widmete der Sieger 

Finnland dem Ereignis 8 Seiten Berichterstattung in den Printmedien, UK (im 

Ergebnis im oberen Viertel) 88 Seiten, Frankreich (im oberen Drittel) 32 Seiten, 

Deutschland (unter dem Durchschnitt) 774 Seiten, Italien (hinter Deutschland) 

16 Seiten. Österreich kommt in der Liste nicht vor (vgl. Gruber). 

Schirlbauer folgt dabei Grubers Wertung (ebenda) nicht, nämlich dass Italiens 

Medienöffentlichkeit durch Berlusconis Unterdrückungspolitik gekennzeichnet 

sei, vielmehr verweist er auf einen anderen nationalen Aufmerksamkeitsgrund, 

den Fußball. Ebenso sei in anderen Ländern die jeweilige Fähigkeit, 

mit internationalen Vergleichsstudien bildungspolitisch konstruktiv umgehen 

zu können, unterschiedlich herausgebildet und damit Grund für öffentliche 

Erregung oder eben nicht . . . 

Schirlbauers Interpretation der aktuellen Bildungsstandards, die auch eine 

Art Konsequenz aus dem Projekt ›klasse:zukunft‹ (ein Vorschlag der Zukunfts- 

Kommission) sind, orientierten sich an einem notwendigen, aber noch verdeckten 

Come back des Lehrplans – einerseits resultierend aus den PISA- 

Ergebnissen, andererseits aus dem immer offenkundigeren Scheitern der Bildungsreformbemühungen 

der 70er-Jahre, die auf Inhalte verzichten wollten 

und das Methodische in pseudoemanzipatorischer Absicht zu feiern verstan-


den, um SchülerInnen und LehrerInnen vom „Stoffdruck“ zu befreien (vgl. 

Schirlbauer 1992, 27). 

„Was ist guter Unterricht?“ sei schlüssiger Weise auch die leitende Frage, 

das leitende Motiv, der Reformvorschläge der Zukunftskommission. Das Ansinnen 

stehe damit in der gewissermaßen wiederaufkeimenden Tradition der 

Orientierung auf Schul- und Unterrichtsqualität – mehr aus der Besinnung auf 

Didaktik und moderne Lehrkunst und weniger als Reaktion auf den Globalisierungsdruck 

der sog. Wissensgesellschaft (vgl. die Schlussfolgerungen der 

deutschen Bildungs-Verantwortlichen, in: Bayrhuber et al.). 

Ewald Terhart, einer der maßgeblichen Didaktiker des deutschsprachigen 

Raumes, legte bereits 2002 den Schwerpunkt auf Unterricht: „Die Qualität eines 

Lehrers entscheidet sich an der Qualität des Unterrichts. Schulqualität entsteht 

zu einem hohen Anteil aus Unterrichtsqualität. Zwar bildet die gesamte 

Schulkultur – Schule als Erfahrungsraum – einen wichtigen Lern- und Sozialisationshintergrund 

für Schüler und Lehrer; gleichwohl aber ist die Qualität von 

Unterricht und Unterrichtsentwicklung der letztlich entscheidende Bereich.“ 

Die Frage des ‚guten Unterrichts‘ entscheide sich auf drei Feldern: der Gestaltung/Vorbereitung 

des Kontextes, der Durchführung des Unterrichts selbst 

sowie der nachgängigen Analyse und Auswertung (vgl. Terhart, 99f). 

Das stets unfertig entwickelte Bewusstsein vom (eigentlichen?) Telos des 

Lehrers, zeitgeistige Marketing-Erwartungen und die Ansichten, dass mühsame 

Kleinarbeit unbedankt, weil lange Zeit unsichtbar bliebe, ein als steigend 

wahrgenommener Legitimationsdruck und ein spürbarer Wettbewerb unter 

Schulen (vgl. auch die zurückgehende SchülerInnenzahl) führ(t)en zur Neigung, 

die „innere Schulreform“ zu vernachlässigen und dafür auf strukturelle 

(d.h. herzeigbare) Veränderungen zu setzen. Das steht in Konfrontation mit 

der nachhaltigen Erkenntnis, dass die „Erledigung des pädagogischen Kerngeschäfts“ 

dem Lehrer/der Lehrerin obliege und er/sie diese zu verantworten 

habe, mag der Weg auch noch so herausfordernd sein und sich die (Selbst-) 

Täuschung noch so komfortabel anfühlen. 

Die Vorboten des PISA-Reports 2007 lassen ahnen, dass politischschematische 

Interpretationen den undifferenzierten und einseitigen Urteilen 

der Vergangenheit folgen werden. 

Mit jeder derartigen Annahme geht jedoch auch die Hoffnung einher, 

dass ein Fortschreiten in Richtung argumentative Sensibilität und intellektuelle 

Sorgfalt immer noch Wirklichkeit werden könne – in der evidenz-basierten 

Wissensgesellschaft im Europa des 21. Jahrhunderts.

4Fazit 


Die Ergebnisse von PISA 2000 und PISA 2003 wurden in der medialen Öffentlichkeit 

durchaus den veröffentlichten Ergebnissen entsprechend wahrgenommen. 

Ohne Berücksichtigung der Re-Analysen von Neuwirth et al. ist im 

Vergleich PISA 2000 zu PISA 2003 ein Leistungsabfall der österreichischen 

SchülerInnen festzustellen (siehe Kapitel 1). Dieser Leistungsabfall wurde medial 

widergespiegelt, jedoch in seinem Ausmaß dramatisiert und übertrieben. 

So werden die Ergebnisse von PISA 2000 in den Medien als äußerst positiv 

und die Ergebnisse von PISA 2003 als überwiegend negativ angesehen. Werden 

die Re-Analysen von Neuwirth et al. berücksichtigt, liegen die österreichischen 

SchülerInnen in beiden Testungen im Lesen und in Mathematik im 

OECD-Durchschnitt. Daher hat sich, genau genommen, in diesen beiden Bereichen 

nichts geändert, wodurch sich der viel erwähnte drastische Leistungsabfall 

als Fiktion herausstellt. Lediglich im Bereich der Naturwissenschaften 

gibt es einen tatsächlich feststellbaren Leistungsabfall, auf den die entsprechende 

mediale Resonanz so gut wie ausgeblieben ist. 

Interessanterweise betreffen die fachspezifischen medialen Forderungen 

nach beiden PISA-Wellen fast ausschließlich die Lese- und Sprachförderung, 

was zwar angesichts der bedenklich hohen Anzahl an SchülerInnen, die bei 

PISA als sehr schlechte LeserInnen einzustufen sind und angesichts der elementaren 

Bedeutung des Lesens in unserer Welt sicher seine Berechtigung hat, 

jedoch in Hinblick auf den weitaus größeren Leistungsabfall in den Naturwissenschaften 

etwas überrascht. 

Alle anderen Forderungen nach PISA 2000 und nach PISA 2003 unterscheiden 

sich im Wesentlichen kaum und wiederholen sich unabhängig von 

den Ergebnissen. Sie spiegeln eher jene von bekannten Überzeugungen geprägten 

bildungspolitischen Debatten (z. B. bezüglich Schulstruktur) wider, 

betreffen familien- und gesellschaftspolitische Weichenstellungen (Ganztagesschule) 

und beinhalten Rufe nach Reformen, sind aber in keiner Weise aus der 

PISA-Studie abzuleiten, durch sie zu widerlegen oder zu begründen. Nicht anders 

verhält es sich mit den vorgebrachten mutmaßlichen Ursachen/Gründen 

für das Abschneiden bei PISA. Auch diese entstammen vielfach subjektiven 

Einschätzungen und können in keiner Weise von und durch die PISA- 

Ergebnisse belegt werden. Hier sei nochmals davor gewarnt, dass das willkürliche 

Äußern von Schuld- und Verantwortungs-Zuweisungen ohne ausreichende 

pädagogische Grundlage nicht die schwächsten Gruppen einer Gesellschaft 

treffen dürfe.


Insgesamt bleibt festzuhalten, dass sich die mediale Aufmerksamkeit für 

PISA 2000 von der für PISA 2003 stark unterscheidet und sich die Einschätzung 

(der Konsequenzen) von manchen Personen unter veränderten Zeit- und 

Politikbedingungen wandelt. Während PISA 2000 als mediales „Randthema“ 

fungiert, wird PISA 2003 zum medialen „Spektakel“. 

Angesichts dieser Ergebnisse lässt sich resümieren, dass PISA keinen Rahmen 

darstellt, welcher ein systematisches, rationales Ergründen von Ursachen 

bzw. Erstellen von Maßnahmen im Schul- und Bildungsbereich ermöglicht 

oder eine bestimmte schulpolitische Konklusion nahe legt. PISA selbst bietet 

für sinnvolle Konzepte, Maßnahmen bzw. Eingriffe im Bildungsbereich der 

Einzelstaaten keinen Anhaltspunkt. Sowohl PISA 2000 als auch PISA 2003 

dienten höchstens dazu, das „Wettbewerbsbewusstsein“ unter den OECD- 

Staaten anzukurbeln und bildungspolitische Positionen, Überzeugungen und 

Vorhaben scheinbar wissenschaftlich zu begründen und medienwirksam in der 

Öffentlichkeit zu verbreiten. 

Paradoxerweise ist die Tatsache, dass jede(r) seine Vorstellungen und 

Überzeugungen durch PISA bestätigt finden kann, der Grund für Durchsetzungskraft 

und Erfolg von PISA, wenn auch gleichzeitig der Fluch des Programms. 

Zumindest ein positiver Aspekt von PISA besteht darin, dass Bildung wieder 

in den Mittelpunkt der öffentlichen und politischen Diskussion rückt und 

somit die Chance besteht, dass pädagogisch legitimierbare Entwicklungen eingeleitet 

werden können, auch wenn sich die bisherigen Reaktionen auf PISA 

großteils durch das Aufgreifen „verstaubter“ Bildungskonzepte und ausländischer 

„Schulkopien“ auszeichnen, was aber umso mehr Anreiz sein kann, neue 

eigene Überlegungen auf wissenschaftlich gesicherter Basis anzustellen. 

Literatur 

APA-OTS (Hg.) (2007): Über APA-OTS. Online-Publikation 

[http://service.ots.at/standard.php?channel=CH0171&document= 

CMS1096293925986&sc=pt] download 20.7.2007. 

APA-OTS (Hg.) (2007a): APA-OTS Empfänger. Online-Publikation 

[http://service.ots.at/standard.php?channel=CH0171&document= 

CMS1135843558515&sc=pt&sb=pt3] download 20.7.2007. 

Bayrhuber, Horst, Ralle Bernd, Reiss, Kristina, Schön, Lutz-Helmut, Vollmer, 

Johannes (Hrsg.) (2004): Konsequenzen aus PISA. Perspektiven der 

Fachdidaktiken. Innsbruck.


Bender Peter (2007): Leserbrief. Online-Publikation [www.uni-paderborn.de/ 

bender/LeserbriefPISAkritik.pdf] download 19.7.2007. 

Buhlmann Edelgard (2004): „Konsequenzen aus PISA – Perspektiven der 

Fachdidaktiken“. In: Bayrhuber, Horst, Ralle, Bernd, Reiss, Kristina, 

Schön, Lutz-Helmut, Vollmer, Johannes (Hrsg.): Konsequenzen aus PI- 

SA. Perspektiven der Fachdidaktiken. Innsbruck. 

Gruber, Karl Heinz (2004): Bildungsstandards: „World class“, PISA- 

Durchschnitt und österreichische Mindeststandards. In: Erziehung und 

Unterricht. Jg. 2004. 

Haider, G. u. Reiter, C. (2004): PISA 2003, Internationaler Vergleich von 

Schülerleistungen. Leykam: Graz. 

Horvath, Wolfgang (2006): PISA-Studie. In: Dzierzbicka, Agnieszka/Schirlbauer, 

Alfred (Hrsg.): Pädagogisches Glossar der Gegenwart. 

Wien. 

Mediaanalyse (Hrsg.) (2007): Jahresbericht 2001. Tageszeitungen. Total. 

Online-Publikation [http://www.media-analyse.at/frmdata2001.html] 

download 25.6.2007. 

Mediaanalyse (Hrsg.) (2007a): Jahresbericht 2004. Tageszeitungen. Total. 

Online-Publikation [http://www.media-analyse.at/frmdata2004.html] 

download 25.6.2007. 

Neuwirth, E., Ponocny, I., Grossmann, W. (2004): PISA 2000 und 2003: Vertiefende 

Analysen und Beiträge zur Methodik. Leykam: Graz. 

Parlamentskorrespondenz (2007): Online-Publikation [http://www.parlament. 

gv.at/portal/page?_pageid=607,78669&_dad=portal&_schema] download 

24.7.2007. 

Posch, Peter (2005): „Einige mögliche Gründe für die Schwächen des österreichischen 

Schulsystems und Ansätze zu ihrer Überwindung“. Unveröffentlichtes 

Arbeitspapier des Bundesministeriums für Bildung Wissenschaft 

und Kultur. 

Reiter, C. u. Haider, G. (2002): PISA 2000 – Lernen für das Leben. Österreichische 

Perspektiven des internationalen Vergleichs. Studien Verlag: 

Innsbruck-Wien-München-Bozen. 

Schirlbauer, Alfred (1992): Junge Bitternis. Eine Kritik der Didaktik. Wien. 

Schirlbauer, Alfred (2007): Sollen wir uns vor den Bildungsstandards fürchten, 

oder dürfen wir uns über die freuen? (unveröffentlichtes Manuskript). 

Wien. 

Schleicher, Andreas (2004): Vorwort des Leiters der Abteilung für Indikato-


ren und Analysen im OECD Direktorat für Bildung. In: Neuwirth, E., 

Ponocny, I., Grossmann, W. (Hrsg.): PISA 2000 und 2003: Vertiefende 

Analysen und Beiträge zur Methodik. Leykam: Graz. 

Schwarzgruber, Manfred (2006): Die PISA-Studie und ihre mediale Darstellung. 

Eine Inhaltsanalyse der Berichterstattung über die PISA-Studie 

2003 in österreichischen Tageszeitungen. Universität Salzburg: Diplomarbeit. 

Terhart, Ewald (2002): „Wie können die Ergebnisse von vergleichenden Leistungsstudien 

systematisch zur Qualitätsverbesserung in Schulen genutzt 

werden?“ In: Zeitschrift für Pädagogik. Jg. 48 – Heft 1. 

zukunft:schule (2005): Strategien und Maßnahmen zur Qualitätsentwicklung. 

Abschlußbericht der Zukunftskommission. Bundesministerium für Bildung, 

Wissenschaft und Kultur: Wien.

Epilogue: No Child, No School, No State Left Behind: 

Comparative Research in the Age of Accountability 

Stefan T. Hopmann 


Warum und unter welchen Bedingungen hat PISA Erfolg? Wie kommt es, dass 

PISA – Ergebnisse für höchst gegensätzliche bildungspolitische Optionen gleichzeitig 

als Begründung verwendet werden können, und dass in manchen Ländern 

die gesamte Bildungspolitik in den schiefen Schatten von PISA gerät, 

während PISA andernorts nur eine Stimme unter vielen ist? Ausgehend von 

den Beiträgen zu diesem Band und ergänzt von Ergebnissen historisch – vergleichender 

Forschung analysiert der folgende Beitrag PISA als ein Fallbeispiel 

für die grundlegenden Veränderungen, die nach und nach alle öffentlichen 

Dienstleistungen (wie etwa auch den Gesundheitssektor) erfassen. 

Es zeigt sich, dass es für die Leistungen und Schwächen des PISA – Projektes 

und für deren höchst unterschiedliche Nutzung gute historische und aktuelle 

Gründe gibt. 

Why is it that a comparative project like PISA can gain so much public attention 

in so many countries at the same time? What makes some governments 

tremble, parliaments discuss, journalists write, parents nervous, and teachers 

angry when PISA announces new results? Why are educational administrations 

and political committees eager to align their curriculum concepts to the 

one implicit in the PISA tests? Why is PISA in some places big news, in others 

news appropriate for a short notice on page five or in the education section? 

PISA is not the first project of this kind: What is different with PISA? 

Of course there is no single explanation for this mind-boggling success 

story. PISA has obviously hit something in the public mind, or in the political 

mind, at least in Western societies. This makes it “knowledge of most worth”. 

It is unlikely that this success is a result only of the quality and scope of PISA

364 STEFAN T. HOPMANN 

itself. If one accepts at least some of the criticism voiced in this volume, the 

opposite seems to be the case. What is of “most worth” in the eyes of the 

public is not the complicated and often overstretched research techniques or 

the specific design, but the simple messages which front the public appearance 

of PISA: the league tables and the summaries, which indicate what PISA sees 

as the weakness or strengths of the respective systems of schooling. 

But even news of this kind has been around before – ever since the IEA began 

its comparative research in 1959 – but never gained a similar impact. Thus 

it is not enough to look at PISA itself to understand this story. It is necessary to 

understand, at the same time, how the social environment has changed: schooling, 

policies, and the public. The question is how PISA and its methodology 

fit into a larger frame of social transformation and thus could achieve the influence 

they have now. Moreover one has to ask if there is one PISA achieving 

all this, or whether it is more appropriate to talk about the “multiple realities 

of PISA”, i.e., the manifold ways in which PISA is enacted and experienced, 

which have been crucial for the success, and which the methodological mix 

utilized by PISA allows for. 

In my view, the rise of PISA owes much to what I would call the emerging 

“age of accountability”, i.e., a fundamental transformation on-going on in at 

least in the Western world, and which centres on how societies deal with welfare 

problems like security, health, resurrection and education (see Hopmann 

2003, 2006, 2007). PISA fits this context in many different ways, depending 

on how accountability issues unfold in different societies. In some places the 

same fit is equally expressed by national policies like the “No Child Left Behind” 

legislation in the US, or the development of national education standards 

as has been the case in countries as diverse as Sweden, Germany, Switzerland 

or New Zealand. What is important to note here is that the education sector is 

rather late in addressing such issues when compared to other areas as health or 

security. In this perspective PISA is but one example, a kind of collateral damage 

sparked of by the intrusion of accountability mechanism into the social 

fabric of schooling. 

To explain this observation, I will first outline how the transformation 

called “the age of accountability” can be understood (1). In the following section 

(2) I will try to illuminate three basic modes of accountability, namely 

the strategies of “no child left behind”, “no school left behind” and “no state 

left behind”, and how PISA fits into these different settings. In the last section 

(3) implications of the multiple realities of accountability for current and fu-

EPILOGUE: NO CHILD, NO SCHOOL, NO STATE LEFT BEHIND 365 

ture school development are discussed. In doing so I rely on the results of the 

Norwegian research project on “Achieving School Accountability in Practice” 

(ASAP), which I initiated in 2003 and which will present its results in another 

volume later this year (Langfeldt, Elstad & Hopmann 2007). Other sources 

are comparative projects, which I have been involved with in recent years like 

“Organizing Curriculum Change” (OCC), including research projects in Finland, 

Norway, Switzerland, Germany and the US (cf. e.g. Künzli & Hopmann 

1998), and the dialogue project “Didaktik meets Curriculum”, which involved 

scholars form about twenty countries throughout the 1990s (cf. e.g. Westbury, 

Hopmann & Riquarts 2000; Gundem & Hopmann 2002), and – last but not 

least – the research done in preparation of this volume on PISA in cooperation 

with colleagues from seven different countries (Austria, Denmark, Germany, 

Finland, France, Norway, and the UK). 

1 The Age of Accountability 

Social scientists, economists, politicians, educators and the public seem to 

agree that something fundamental is going on, something which changes at 

least in principle the social fabric of Western societies. However they differ 

widely in what they see as the core of this transition. To name but a few more 

recent examples: 

– “The Modern World System” is according to Immanuel Wallerstein characterized 

by the ever expanding commodification (in Marx’s terms: “Verdinglichung”) 

of all natural resources, human relations, labour, knowledge, 

etc., forcing a lasting division of labour in and between nations and turning 

ordinary citizens into alienated tools of a globalized economy (cf. e.g. 

Wallerstein 2004). 

– Neoinstitutionalist like John W. Meyer speak about globalization as well, 

however in organizational or structural terms, defining the current transition 

process as an outcome of the rapidly growing influx of international “institutions”, 

i.e. common ways of seeing and dealing with society, as provided 

by international organizations (such as the UN, the World Bank, OECD) 

and the emerging “world polity”, which supersedes national histories and 

policies alike (cf. e.g. Meyer 2006). 

– Theories on “reflexive modernity”, as provided by Anthony Giddens and 

others, would agree that the change is global, however pinpoint the special 

implications for the members of society, as, e.g., the need to develop a reflexive 

stance towards the structures of society and the embedded risks for


society as a whole as well as for the individual (cf. e.g. Beck, Giddens & 

Lash 1996; Beck 2006). 

– Governmentality theories, drawing on Foucault’s famous 1977/78 College 

de France lectures (cf. Foucault 2004), point similarly to the impact the 

transition has for the state and its institutions as well as for the public and its 

members, but they see a growing transfer and diffusion of power relations 

into self-control mechanisms making the citizens internalize the (more or 

less alienated) mentality necessary to govern them(selves) (cf. e.g. Bröckling, 

Lassmann & Lemke 2000; Lange & Schimank 2004; Gottweis 2006). 

– New Public Management (NPM) supporters would not disagree that there 

is a diffusion of power and a change of the habits required, but they rather 

see it as a positive force making societies and their institutions and members 

more effective in a globalized world as customer-centred management and 

control techniques are introduced (cf. e.g. Buschor & Schedler 1994; Pollit 

& Bouckaert 2004 2 ). 

– More recent theories on the welfare state discuss similar issues, but rather 

as a question of how the modern “intervention state” is forced to dismantle 

its traditional comprehensive strategies governing resources, the law and the 

social sphere in a more and more post-national world, and how welfare is 

re-modelled within this “unravelling” of the state and its institutions (cf. e.g. 

Esping-Andersen 1996; Scharpf & Schmidt 2000; Leibfried & Zürn 2006). 

– Finally, systems theories based on the work of Niklas Luhmann (cf. eg. Luhmann 

1998) argue that the current transition grows from within, from the 

need of social systems to deal with an ever-growing complexity and contingency 

that forces a reflexive re-design of the ways and means of social 

communication (which constitutes the fabric of social systems according to 

Luhmann; cf. e.g. Akerstrøm Andersen 2003; Rasmussen 2006). 

Of course, this is but a small selection of the staggering number of transition 

theories, which flourish despite the obviously prematurely proclaimed “end of 

history” (Fukuyama 1992). Moreover these approaches vary widely. Some of 

them see the current change as a late consequence of processes started with 

the invention of the modern state (e.g. Foucault, Wallerstein), whereas others 

point to more recent changes, to for instance the crisis of the welfare state or the 

rapid globalization process (e.g. Leibfried & Zürn, Meyer). Some of them look 

primarily at it as a top-down process by which global developments overpower 

local traditions (e.g. Meyer, Wallerstein); others stress the role of intermediate 

levels as the nation state and its institutions (e.g. Giddens, Leibfried & Zürn,


Pollit & Bouckaert); whereas some see the main issue at the level of the impact 

of the transition process on those involved (e.g. Beck, Foucault). Some 

theories stress institutional patterns or social systems as the prime force (e.g. 

Meyer, Luhmann), others believe in actors and their policies as defining elements 

(e.g. Pollit & Bouckaert; Wallerstein), whereas some try to sketch a third 

perspective, in which actors and structures are seen as inextricably intertwined 

(e.g. Giddens, Foucault). 

One should not complain about this amazing diversity of approaches: It is 

an expression of the difficulty in finding more common ground at a time when 

the transition is still unfolding with growing, but uneven speed in different 

places. Additionally, many of these authors and their followers use a similar 

pool of examples in spite of their differences. Even though they do not agree 

on all the why questions, they point to much of the same kind of evidence: as, 

for instance, in examples 

– of the redistribution of resources, risks and responsibilities within and across 

societies, 

– of the destabilization, or at least of the restructuring, of most public institutions 

and their relations to or competition with the private sector, 

– of the re-tooling of legitimation and control patterns within the public as 

well as the private sphere and its impact, 

– of the pressure on systems and actors towards taking a reflexive stance towards 

themselves and taking responsibility for their own “well-being”. 

Accountability 

Looking at the narrower issue of “accountability”, a similar wealth of models 

and approaches can be observed. Besides the more or less implicit accountability 

concepts within the general transition theories mentioned above (mostly 

constructed as ‘being made responsible’ in one or another way e.g. by Giddens, 

Foucault or the NPM theories), different models of accountability have 

emerged based on the areas in which accountability is observed: 

– In economic theories (e.g. Laffont 2003), where the concept originated, accountability 

is nowadays often constructed as a means by which a principal 

(the resource-giver) under conditions of limited information tries to multiply 

the ends, by giving the agent (the resource-taker) incentives and/or forcing 

him by other means to account for the efficiency, quality and results of his 

deliveries.


– In government research (e.g. Hood 1991, 1995; Hood, Rothstein & Baldwin 

2004), accountability is often seen as a key tool of the New Public Management 

movement to ensure that units and persons provide services according 

to the goals set for them or agreed with them. According to this approach, 

it unfolds as a combination of risk management and blame avoidance, by 

which those hold accountable try to limit the scope of possible failure. 

– In research on social policy, the same phenomenon is described as a “quasimarket 

revolution” (Bartlett, Roberts & Le Grand 1998), i.e. as the intrusion 

of marked-like mechanisms of distribution and control into the public 

sector: elements of competition, contractualiziation and finally auditing are 

introduced into the service-rendering. As Hood (2004) has pointed out, often 

in form of “double whamming”, i.e. as the co-existence of the traditional 

bureaucratic modes with the administrative tool kit fostered by NPM. 

– Similarly, some educational and health-care researchers see the rise of “the 

age of accountability” as a “revolutionary” move towards “evidence-based” 

practice, i.e. the growing expectation that professionals can present data to 

prove that they have performed professionally and efficiently for education, 

Slavin 2007; for health care, cf. e.g. Muir Gray 2001). 

– Generalized beyond the realms of public service, this leads to the concept of 

the emergence of an “audit society” (e.g. Power 1997), the assumption that 

more and more areas of social life are being made “verifiable”, i.e. subjugated 

regimes of counting what can be counted, and thus become part of a 

measurable accountability. 

– Interaction and transactions theories construe the personal costs of such transition, 

looking at accountability as an interpersonal relation in which we deal 

with “accounts”, “excuses” and “apologies”, i.e. strategies to explain ourselves 

in ways which give a sustainable account of our efforts (e.g. Benoit 

1995). 

– Finally, psychological approaches to accountability (e.g. Sedikides 2002) 

look at the personal ways of dealing with accountability, how one develops 

mechanisms to attribute or to reject accountability-embedded in roles and 

functions we have to perform. 

Like the transition theories, accountability theories provide a wide array of 

possible causes and implications. Some see this process primarily as an effect 

of a growing “economization” of all parts of the society (e.g. social policy 

and audit theories), whereas others see accountability as inevitably embedded 

in the social fabric of modern societies (e.g. the psychological explanations).


Some see accountability primarily as a politically initiated restructuring effort 

(e.g. quasi-market theories), whereas others see accountability simply as a legitimate 

means to ensure that customers or clients get what they have paid for 

(economic and NPM theories). Additionally, accountability is viewed on rather 

different levels. Utilizing a model developed by Melvin Dubnick (2006) one 

can discern: 

– a first-order accountability, i.e. accountability arising in face-to-face relations 

(as described by the psychological models); 

– a second-order accountability, which is characterized by how good one follows 

the rules and standards set by a resource giver (as described by government 

theories); 

– a third-order accountability can be seen as “managerial accountability”, i.e. 

the use of accountability by a principal as a means to achieve better service 

and effectiveness of the agent. Finally, 

– a fourth-order accountability is based on that the one held accountable internalizes 

the norms, values and expectations of the stake-holders, and which 

puts him or her into action (as pointed out e.g. by theories of governmentality 

or of professionalism). 

In practice, all of these can be intertwined. However the dividing line is how 

it is assumed that this interaction comes into being and which levels rule compared 

to the others (if they are not seen, as they are in economic theories, as an 

embedded rationale of social actors at all levels). 

In addition, accountability concepts change over time and are different in 

different places. A good indicator of this is that (1) there is no common translation 

of the concept available in most of the non-English-speaking countries, 

and (2) public agencies or policy makers do not employ similar definitions of 

the elements and limits of accountability when “accounting for accountability” 

(cf. Dubnick & Justice 2004, Birkeland 2007). Nevertheless, most accountability 

analysts agree with the above-mentioned transition theories on some core 

issues, namely: 

– that accountability procedures more and more permeate at least all Western 

societies, and thereby change the ways and means by which societies deal 

with themselves, 

– that the rapid rise of accountability affects all areas of the public sector from 

education to health and their relation to the private sector, 

– that this transition enforces a vast redistribution of resources and responsibilities 

and thereby a fundamental change in the interplay between resource-


providers and -users, often described as a kind of implicit (values, norms) 

or explicit (standards, contracts) fixation of what is supposed to shape their 

relations. 

– that this process unfolds at different speeds and with different patterns, depending 

on what kind of social setting they become a part of. 

For the purpose of this chapter, it is not necessary to decide which of these 

theories and models carries the most theoretical or empirical evidence. Rather 

the common features, shared by most of them, should be enough as a starting 

point, even though this puts some of the “why questions” aside temporarily and 

moves the focus to the question, of how the emergence of the age of accountability 

can be observed in action. In my view, its common core can be described 

as a slow, but steady transition from what I call “management of placements” 

(Verortung) towards a “management of expectations” (Vermessung), by which 

the ways and means of dealing with “ill-defined” problems, such as health, education, 

security and resurrection, are changed fundamentally (see Hopmann 

2000, 2003, 2006, 2007). 

Managing transition 

Following in the footsteps of Max Weber (cf. Weber 1923, Breuer 1991), the 

rise of the modern state can be described as the successive unfolding of a management 

of placements, by which the risks of being born (e.g., how to get an 

education, who takes care of me when ill or old, who gives me security in 

my everyday life and my dealings, how to be at peace with myself and my 

neighbours) were taken care of by institutions run by professionals with a specific 

education on how to deal with such ill-defined problems. These institutions 

(such as schools, hospitals, prisons, armies, bureaucracies, churches) had 

a comprehensive mission in that their professionals needed leeway to define 

which of these problems required what kind of treatment. The institutionalized 

problem-sharing allowed for taking on more risks and moving beyond the care 

for immediate needs. Of course, which problems were considered being ill- or 

well-defined changed over time, as did the resources available. But the internal 

distribution of resources and the evaluation of outcomes were mostly left to 

the professionals themselves, or the emerging professional communities, who 

defined and controlled the education, licensing and practice of their members 

(cf. Abbott 1988; Hopmann 2003). 

However, this comprehensive institutionalization had no fixed boundaries 

(which would have required the transformation from the ill-defined into well-


defined problems), thus opening a continuing process of the broadening of the 

scope and differentiating the means whenever new aspects of the problems 

seemed to become urgent. Thus each and every field underwent a massive expansion, 

multiplying its tasks and treatments. In the past, for example, a couple 

of years in schools (and for a few, in universities) was all the public education 

available. Today we spend twenty and more years of our life in all kinds of professionalized 

educational settings from childcare to elder hostels. When once 

we met a doctor at the beginning and at the end of life, and maybe for a few 

other times under extraordinary circumstances, today we spend in each and 

every year a lot of time with medical doctors, nurses and other health care specialists 

in waiting, treatment or emergency rooms. In short, the management 

of placement was extremely successful, so successful that Western societies 

spend most of their public budgets on dealing with these problems. As long as 

the differentiation of the institutions did not outspend the resources available, 

differentiation could go on and on, and with an ever-growing speed. 

This success story seems to come to an end in what social policy theory 

calls “the crisis of the welfare state”, i.e. as resource limits and boundaries for 

further expansion become more and more visible (wherever they stem from). 

The legitimacy of the whole placement strategy relied on its ability to cover 

new ill-defined problems by expansion and sophistication; but there is now 

mistrust and anxieties about whether this comprehensive help will be sustainable 

in the future. A very visible impact of this loss of trust is the rise of welfare 

patriotism on the right and the left in almost all Western societies, articulating, 

and maybe misusing, much of the unease citizens feel about the future and security 

of the inherited places and treatments (our welfare is said to be at risk 

because of immigrants, globalization, outsourcing, etc.). One of the important 

responses to this is a stepwise transition from a management of placements 

towards a management of expectations. Instead of guaranteeing comprehensive 

institutions, there is an attempt to transform ill-defined problems in to 

better-defined expectations as to what can be achieved with a given amount of 

resources. Standards, benchmarks, indicator-based budgets etc. are examples 

of how this transition is managed. In that they do not necessarily imply longterm 

commitments, expectations can remain transient and volatile to changes 

in the social fabric of expectations mix. This allows for more target-oriented 

management and accountability that, however, comes at the price that whatever 

does not fit into the expectation regime of a time becomes marginalized. 

Comprehensive coverage is replaced by a fragmented system of treatments


available under certain conditions. The left-over, not least the still ill-defined 

general issues – what does it mean to be well-educated, healthy, secure, feel 

well etc.? – is either still connected to the former placements and/or transformed 

into temporary programs seemingly better equipped for addressing the 

remains immediately (“the patient in focus”, “fighting crime”, “strengthening 

social education” etc.). 

Take the example of schooling: In earlier times public education was provided 

by “a place called school” (Goodlad 1983) run by professionals called 

teachers who decided within sketchy limits, based on professional and local 

traditions, how to teach and what achievement seemed to be sufficient. There 

was no external public evaluation of the quality of the services provided, except 

for extraordinary cases of failure, or while the normal procedures seemed 

to be professionally acceptable. Accordingly, good instruction was not defined 

primarily by its measurable outcomes, but rather by the professional judgement 

of the adequacy of what was done. Expectation management changes 

the picture dramatically. The core focus shifts to more or less well-defined expectations 

of what has to be achieved by whom. Good instruction is the one 

overlapping expectations, and that can be provided outside the traditional institutions 

and professions, in fact: everybody is welcome to provide as long 

as the expectations are met. Of course, there are issues which are not (yet) 

coveredbyidentifiable expectations, however – in case of conflicting goals – 

the balance will always tip towards those expectations, which are well-defined 

enough to become part of the implied accountability of the treatment providers. 

The rest, that which is not addressed but seems to need to be taken into account 

(e.g. issues such as mobbing/bullying, gender, migration etc.) is embedded into 

transient intervention programs of limited scope, enough to ensure the public 

that no ill-defined problem is left behind. 

It is important to remember, that this is expectation management, and is not 

about outputs or outcomes or “efficiency” as such – as, for instance, NPM theories 

see it. Only those results, which can be “verified” according to the stakes 

given and do not meet expectations become problematic, and only those outcomes 

which meet the predefined criteria are considered a success. In fact any 

care-taker of an ill-defined problem will always produce many more effects 

than any accountability system can observe and measure. Some of them may 

be simply by-products or minor collateral damage, but some impact may indeed 

be a major contribution (e.g. inclusion into society, regulating biographies 

etc.), which is beyond the short-sighted reach of the management of expecta-


tions. The line is drawn by the ever-changing fabric of expectations on the one 

hand and, on the other, by the simple fact that accountability needs something 

which can be counted, or where it is at least possible to measure the distance 

between expectations and results (cf. Slavin 2007). 

The emergence and spreading of accountability is a signifying hallmark of 

the whole transition process. PISA fits nicely into this transition, as we will see 

in the following sections. Seen as part of a management of placements PISA 

would be a disaster: it covers only a few aspects of the place, of schooling 

and the curriculum, and even are covered in ways that account at best very 

indirectly (cf. the preceding chapters on PISA’s limits). However, as a tool of 

expectation management, PISA fosters a transformation of what had been illdefined 

issues (e.g. curriculum contents) into seemingly well-defined attainment 

goals. It delivers, at the same time, a parameter for holding schooling 

accountable – for delivering according to the expectations embedded into its 

questionnaires. It contributes to the fragmentation of the field by transforming 

the conditions and constraints of this delivery into independent factors (e.g. 

social background, gender, migration etc.), whose impact has to be minimized 

by way of teaching if expectations are to be fully met. The best representation 

of this is given by the “production functions” by which PISA-using economists 

calculate the transaction costs of schooling and ways and means that the principals 

(parents, the state) might maximize the effectiveness of the chosen agents 

(i.e. teachers, schools or school systems; c.f. e.g. Bishop & Woessmann 2004; 

Fuchs & Woessmann 2004; Micklewright & Schnep. 2004, 2006; Sutherland 

& Price 2007). 

Constitutional Mindsets 

When it comes to the public sphere, the transition from a management of placements 

towards a management of expectations meets different constitutional 

mindsets, i.e. deeply engrained ways of understanding the relation between 

the public and its institutions (cf. for the basics of the following e.g. Haft & 

Hopmann 1990; Hopmann & Wulff 1993; Zweigert & Kötz 1997; Lepsius 

2006). For example, the American constitution is constructed as a protection 

of the individual against the misuse of power by governments and others. It 

sees the rights of the individuals as a given and the intervention of government 

as limited by these rights, and obliged to protect citizens against any infringements 

of their constitutional freedoms. The First Amendment, for instance, 

states: “Congress shall make no law respecting the establishment of religion,


prohibiting the free exercise thereof; or abridging the freedom of speech or 

the press; or the right of the people peacefully to assemble, an to petition the 

government for the redress of grievances”. Within the Prussian or the Austrian 

tradition, which comes from the opposite direction (not at least Roman 

law), civil rights are something constituted and limited by the law, i.e. it is 

the state and its (more or less enlightened) institutions which create and define 

the boundaries of social and individual life. Religious freedom, for instance, 

maybe granted, but the freedom is closely connected to state supervision of its 

organizations and institutions, which can set limits for the conditions for the 

‘full’ exercise of a religion (which creates problems for non-institutionalized 

traditions such as Islam). The Scandinavian constitutional tradition settled (at 

least in its beginnings) somewhere between these fundamentally opposed starting 

points: it acknowledges the right of the state to impose a constitution, but 

originally limits its reach making it subsidiary to local and regional law traditions. 

That has changed gradually, but there is still no unified code of law, 

rather a pragmatic approach to regulating fields of interest based on practical 

experience and home-grown traditions. The local constituency is still seen as 

the core of the social fabric. Citizens are empowered to define a community 

life based on their own traditions within a broad constitutional setting. Thus, 

while there are state churches in Norway and Denmark, there was plenty of 

leeway to establish new local traditions (e.g. as “free churches”); today any 

group of certain size and permanence and with a discernible creed of its own 

can establish itself as a “church” with a right to receive state subsidies (cf. 

Repstad 2000 2 ). 

Of course, the constitutional and legal structures are much more mixed, the 

patterns much more blurred, than these different starting points indicate. But 

the mind sets on which they are founded seem to be alive and well, and have 

a strong impact on how the public and its institutions conceptualize the legal 

and structural implications of social change. At least, this is the case when it 

comes to how accountability measures are embedded into the public system 

as a whole, and especially into the school system. There the main questions 

are: Is accountability about protecting the individual citizen (student) against 

bad service-rendering? Or is the primary goal to strengthen the ability of local 

communities to run its institutions according to their own needs and aspirations? 

Or is it about holding the public system accountable for its contribution 

to the state’s welfare? Put in the current educational context, one has to ask 

accordingly, where the main focus of accountability is situated: at “no child 

left behind”, “no school left behind”, or “no state left behind”?


2 The Multiple Realities of Accountability 

In a historical perspective, PISA and the likes are heavily indebted to the legacy 

of the assessment movement in the United States and its internationalization 

by the International Association for the Evaluation of Educational Achievement 

(IEA), beginning in the late 1950s. Seminal works such as Caswell’s 

City School Surveys (1929), the Eight Year Study (1942) or Benjamin Bloom’s 

groundbreaking work on “the taxonomy of educational objectives” (1956), the 

national spreading of the Scholastic Aptitude Test (SAT) from the 1950s onwards 

and the establishment of the National Assessment of Education Progress 

(NAEP) paved the way for an understanding, in which student achievements 

were seen as the prime indicator of the quality of schooling. The rapid rise of 

assessment and evaluation as key tools of educational control was fuelled by a 

constant flow of critical works on the poor state of the Nation’s schools. From 

Conant’s report, The American High School (1959), Rickover’s American Education 

– a National Failure (1963), Coleman’s report on the “Equality of Educational 

Opportunity” (1966) to the national report, A Nation at Risk (1983) 

and the Nation’s Report Card (Lamar & Thomas, 1987), the basic tenor was 

the same: The American system is failing many of its students – as demonstrated 

by the test scores achieved in local, state-wide and national testing. 

From the late 1980s onwards, this seemingly constant failing lead to a 

more generalized approach to assessment, testing and “reform”, now called 

“standards-based reform” (cf. Achieve 1998; Ahearn 2000; Fuhrman 2001). 

State after state introduced state standards for the curriculum and – if not yet 

done so – state-wide assessment of student achievement, to assure that these 

standards were applied. It would be a wild exaggeration to pretend that this 

approach was an immediate success. In some cases, the introduction of state 

standards obviously spelled out disaster (cf., e.g., the Kentucky experience; 

Whitford & Jones 2000), in others at best modest gains could be reported but 

their validity was, and is, heavily disputed (cf. e.g. Cannell 1987; Dorn 1998; 

Saunders 1999; Linn 2000; Haney 2000; Watson & Suppovitz 2001; Amrein & 

Berliner 2002; Haney 2002; Ladd & Walsh 2002; Swanson & Stevenson 2002; 

Darling-Hammond 2003; Braun 2004). However, despite some 50 years of 

mixed experience with assessment and rather shallow results (cf. Cook 1997; 

Mehrens 1998; McNeil 2000; Herman & Haertel 2005), the next move was 

to introduce national legislation, aiming at a unified approach to assessment 

and accountability, the “No Child Left Behind Act” of 2001, enacted under the


Bush presidency and supported by an almost united Congress (cf. Peterson & 

West 2003). 

No Child Left Behind (NCLB) 

It is worthwhile to give the provisions of the NCLB act a closer look, as they 

are paradigmatic for how accountability is constructed within the American 

tradition. Already the comprehensive “statement of purpose” unfolds a wide 

array of issues: 

“The purpose of this title is to ensure that all children have a fair, equal, and significant 

opportunity to obtain a high-quality education and reach, at a minimum, proficiency 

on challenging State academic achievement standards and state academic assessments. 

This purpose can be accomplished by – 

(1) ensuring that high-quality academic assessments, accountability systems, teacher 

preparation and training, curriculum, and instructional materials are aligned with challenging 

State academic standards so that students, teachers, parents, and administrators 

can measure progress against common expectations for student academic achievement; 

(2) meeting the educational needs of low-achieving children in our Nation’s highestpoverty 

schools, limited English proficient children, migratory children, children with 

disabilities, Indian children, neglected or delinquent children, and young children in 

need of reading assistance; 

(3) closing the achievement gap between high- and low performing children, especially 

the achievement gaps between minority and non-minority students, and between 

disadvantaged children and their more advantaged peers; 

(4) holding schools, local educational agencies, and States accountable for improving 

the academic achievement of all students, and identifying and turning around lowperforming 

schools that have failed to provide a high-quality education to their students, 

while providing alternatives to students in such schools to enable the students 

to receive a high-quality education; 

(5) distributing and targeting resources sufficiently to make a difference to local educational 

agencies and schools where needs are greatest; 

(6) improving and strengthening accountability, teaching, and learning by using State 

assessment systems designed to ensure that students are meeting challenging State 

academic achievement and content standards and increasing achievement overall, but 

especially for the disadvantaged; 

(7) providing greater decision making authority and flexibility to schools and teachers 

in exchange for greater responsibility for student performance;


(8) providing children an enriched and accelerated educational program, including 

the use of school-wide programs or additional services that increase the amount and 

quality of instructional time; 

(9) promoting school-wide reform and ensuring the access of children to effective, 

scientifically based instructional strategies and challenging academic content; 

(10) significantly elevating the quality of instruction by providing staff in participating 

schools with substantial opportunities for professional development; 

(11) coordinating services under all parts of this title with each other, with other educational 

services, and, to the extent feasible, with other agencies providing services to 

youth, children, and families; and 

(12) affording parents substantial and meaningful opportunities to participate in the 

education of their children.” (Section 1001) 

But this complexity is right away reduced to more specific expectations, when 

it comes to, which goals are in focus and how accountability is supposed to 

foster these goals. Academic standards are according to NCLB the following: 

Standards under this paragraph shall include— 

(i) challenging academic content standards in academic subjects that— 

(I) specify what children are expected to know and be able to do; 

(II) contain coherent and rigorous content; and 

(III) encourage the teaching of advanced skills; and 

(ii) challenging student academic achievement standards that— 

(I) are aligned with the State’s academic content standards; 

(II) describe two levels of high achievement (proficient and advanced) that determine 

how well children are mastering the material in the State academic content standards; 

and 

(III) describe a third level of achievement (basic) to provide complete information 

about the progress of the lower-achieving children toward mastering the proficient 

and advanced levels of achievement.” (Section 1111) 

Accountability is then based on these standards: 

“Each State plan shall demonstrate that the State has developed and is implementing 

a single, statewide State accountability system that will be effective in ensuring 

that all local educational agencies, public elementary schools, and public secondary 

schools make adequate yearly progress as defined under this paragraph. Each State 

accountability system shall –


(i) be based on the academic standards and academic assessments . . . and other academic 

indicators consistent . . . , and shall take into account the achievement of all 

public elementary school and secondary school students; 

(ii) be the same accountability system the State uses for all public elementary schools 

and secondary schools or all local educational agencies in the State, except that public 

elementary schools, secondary schools, and local educational agencies not participating 

under this part . . . and 

(iii) include sanctions and rewards, such as bonuses and recognition, the State will 

use to hold local educational agencies and public elementary schools and secondary 

schools accountable for student achievement and for ensuring that they make adequate 

yearly progress . . . ”. (ibid.) 

Finally, what is meant by “Adequate Yearly Progress” is defined in the subsequent 

paragraph: 

“(B) ADEQUATE YEARLY PROGRESS.—Each State plan shall demonstrate, based 

on academic assessments described in paragraph (3), and in accordance with this 

paragraph, what constitutes adequate yearly progress of the State, and of all public 

elementary schools, secondary schools, and local educational agencies in the State, 

toward enabling all public elementary school and secondary school students to meet 

the State’s student academic achievement standards, while working toward the goal of 

narrowing the achievement gaps in the State, local educational agencies, and schools. 

(C) DEFINITION.—‘Adequate yearly progress’ shall be defined by the State in a 

manner that— 

(i) applies the same high standards of academic achievement to all public elementary 

school and secondary school students in the State; 

(ii) is statistically valid and reliable; 

(iii) results in continuous and substantial academic improvement for all students; 

(iv) measures the progress of public elementary schools, secondary schools and local 

educational agencies and the State based primarily on the academic assessments 

described in paragraph (3); 

(v) includes separate measurable annual objectives for continuous and substantial improvement 

for 

each of the following: 

(I) The achievement of all public elementary school and secondary school students. 

(II) The achievement of— 

(aa) economically disadvantaged students; 

(bb) students from major racial and ethnic groups;


(cc) students with disabilities; and 

(dd) students with limited English proficiency; 

except that disaggregation of data under subclause (II) shall not be required in a case in 

which the number of students in a category is insufficient to yield statistically reliable 

information or the results would reveal personally identifiable information about an 

individual student.” (ibid.) 

I have quoted the NCLB act in such length because it provides a concise definition 

of what management of expectations in this perspective is about. The 

core of accountability is narrowly focused on student achievements measured 

by “academic standards”. Other functions of schooling (such as the role school 

plays for local communities or in shaping society) are hardly mentioned, and 

if at all, they are constructed as minority problems. At the same time the academic 

achievement is reduced to that which can be reported as “statistically 

valid and reliable,” leaving out any educational or social achievements which 

cannot be counted as required. With in this frame, responsibility is passed from 

the federal top through intermediate levels such as state and district administrations 

to teachers and local school leaders. They are expected to improve 

the test results by “evidence-based teaching” or even by “data-driven decision 

making” (cf. ECS 2002; Marsh & Hamilton 2006), which collapses the complexities 

of class room work or school leadership into single-minded framesets 

of statistically significant achievement gains (cf. the comments by e.g. Koretz 

2002; Berliner 2005; Hargreaves 2006; Ingersoll 2006). The starting idea of 

the “basic principles of curriculum and instruction” (Tyler 1949), which was 

embedded in the methodologically much broader approach of, e.g. the Eight 

Year Study (Aikin 1942), and which asked for a comprehensive understanding 

of schooling as social and local institution, has – as it seems – dwindled to a 

concept of measurable yearly progress. 

The response to NCLB within the education community has been almost 

evenly divided. While a majority of politicians and economists, and certain 

parts of the public, seem to support NCLB wholeheartedly, or at least its core 

concept of accountability, many educators are less enthusiastic. The public response 

seems to be fragmented based on social class, level of education, and 

political orientation (cf. Loveless 2006). The professional reactions have much 

to do with the question if or if not one accepts the narrow focus of NCLB as 

reasonable. On the one side are those who see NCLB at least as a starting point 

for a possible school revolution, finally solving the American school crisis (cf. 

e.g. Ladd & Walsh 2002; Petersen & West 2003; Irons & Harris 2006). Some


economists have even begun to calculate the economical spin-off of NCLB if 

modest gains can be sustained (Hanushek 2002, 2006; Hanushek & Raymond 

2003, 2005). Others are sceptical, pointing to the “impoverished” scope of the 

provisions (cf. e.g. Berliner 2005) or the obvious implementation problems of 

the current approach (for a variety of such problems cf. e.g. Eberts, Hollenbeck 

& Stone 2002; Mintrop 2003; Chubb 2005; Gorard 2006; Martineau 2006; 

Apple 2007; Deretchin & Craig 2007; Zimmer et al. 2007). The construct of 

“adequate yearly progress” (AYP) in particular has created a tremendous challenge. 

The expectation of the progress could be reached went far beyond the 

reality of slowness and instability in school change – and adding minority criteria 

worsened the situation. Some critics fear that almost all American schools 

will end up on the watch lists of failing schools (cf. Linn, Baker & Betebenner 

2002; Linn & Haug 2002; Herman & Haertel 2005; Linn 2005). And many 

researchers and practitioners have pointed out, NCLB will fail while schools 

and their leadership do not have the required “capacities”, i.e. the ability to 

identify their local mix of problems and to deal with these professionally (cf. 

e.g. Elmore 2006). However, what is almost never challenged in this debate is 

the basic assumption of the whole enterprise, namely, that it is the student’s 

academic achievement which best reflects the quality of schooling, and that it 

is the poor quality of instruction provided by poor teaching, which is to blame 

for the fact that children are left behind. Holding states, school districts and 

schools accountable is reduced to the requirement to do whatever necessary to 

pass this accountability along to classrooms and teachers and, finally, to the 

students themselves. 

Similar criticism has been voiced about PISA’s role in the US (cf. e.g. 

Bracey 2005). However, even though it uses a similar approach to mapping 

school achievements, and even though the US has been one of the driving 

forces behind it, PISA does not have much of an American audience in the 

shadow of NCLB, and no significant impact on the wider public or the educational 

science community. For instance, a recent search of the Education 

Resource Information Centre (ERIC) brought up less than 150 articles and 

books about PISA, most of them from outside the US – nothing compared to 

the general issue of accountability (more than 18.000 hits) or NCLB (about 

2000 hits). It does not even match the impact of TIMSS (with more than 400 

hits) and is far below of what the German equivalent of ERIC, the FIS Bildung, 

reports on PISA from Germany (more than 2500 hits). Arguably, this reflects


NCLB’s status as a national law, whereas PISA is an international enterprise 

with no direct obligations for the states and the schools participating. 

In general, PISA seems to have less impact where a national achievement 

control of some kind already is in place (as, e.g., in Sweden), which bodes 

ill for the future of PISA as more and more countries introduce accountability 

systems of their own making. However, this would not explain why earlier 

studies like TIMSS have had more visibility in the US. In my view, this shortcoming 

of PISA has much to do with the key fallacy of its design, namely 

to present itself mainly as a cross-national comparison, even though it shares 

with NCLB the same starting point, student achievement, which only to a very 

small degree (not least in case of the U.S. with its manifold states) can be attributed 

to a specific “national” fabric of schooling (cf. the critique brought 

forward in this volume). As a cross-national comparison PISA is not much of 

an eye-opener for the US public; it only confirms what was known from earlier 

international studies, i.e., that US students don’t do very well in such tests 

compared to the students of many other nations. In that the “winner” of PISA 

2000 and 2003 was tiny Finland, and not an international competitor like Japan 

(which succeeded in TIMSS, and did well on PISA too), there is, seemingly, 

nothing much for Americans to learn from successful PISA nations. 

No School left Behind 

When the accountability wave hit Nordic shores for the first time, the spontaneous 

reaction of the political and educational establishments was almost 

opposite to what had happened in the US. While there accountability became 

a tool to centralize important elements of educational control, firstatthestate, 

later at the national level, the spontaneous reaction of the Scandinavians was 

decentralization. Although government offices and administrative departments 

(in Norway and Sweden) were created to satisfy the discourse of the New Public 

Management, and many national reports and white papers were commissioned 

on public service rendering and administration, none of this – except 

maybe for Finland, which was under much more economic strain following 

the break down of the Berlin Wall (cf. Sinola 2005; Uljens in this volume) – 

led to a sustained and comprehensive accountability reform of the US kind (cf. 

Bogason 1996; Irjala & Eikås 1996; Prahl & Olsen 1997; Pollit & Bouckaert 

2004 2 ). The concurrent debate about joining or not-joining the EU may have 

had a share in this decision making in that EU participation was often framed in 

terms of the risk of more centralization (cf. Karlsen 1994). However, tackling


challenges by way of an issue-focused and pragmatic step-by-step approach 

with special regard as to how lower levels of government, such as districts and 

municipalities, could deal with any emerging tool kit was consistent with what 

I am calling their constitutional mindset. 

Thus as a background for understanding the Nordic education sector, one 

has to know that schools and their teachers played a pivotal role in the nationbuilding 

processes across the region, and in the shaping of national identities 

(cf. e.g. Slagstad 1998; Telhaug & Mediås 2003, Korsgaard 2004; Werler 2004; 

Telhaug 2005). Moreover, schools are not only seen as places for the young, 

but as the cultural core of the local community – which turns the local and 

regional distribution of schooling into always contested issue. 

Until the 1990s the main tool used to govern the school curriculum was 

curriculum guidelines, developed mainly by the state administrations by way 

of committees largely consisting of experienced teachers and subject matter 

specialists (cf. Gundem, 1992, 1993, 1997; Sivesind, Bachmann & Afzar, 

2003; Bachmann, Sivesind & Hopmann 2004; ). Local schools and teachers 

had considerable leeway to pick and chose within this curriculum frame in order 

to develop locally adapted teaching programs. There was no regular staterun 

evaluation of the outcomes of teaching, and, indeed, outside research not 

even the concept was familiar (cf. Hopmann 2003). In a Nordic perspective 

schools were seen as places run by highly educated and esteemed teachers, 

who knew best how to do their job. Curriculum change was primarily seen 

as a matter of dialogue between local experience and national needs; changes 

were typically introduced by way of lengthy try-out periods, and with an often 

extraordinary involvement of all levels of schooling and administration. Of 

course, this was by no means a paradise of peaceful change: each and every 

curriculum reform has had its proponents and opponents, and the interplay between 

the school sector, research, politics and the public was at times pretty 

contentious (cf. Sivesind, forthcoming). However, this played out within the 

context of school systems that enjoyed, for most of the time, broad support at 

all levels of society. 

In this context, it was no surprise that the first reaction to sharp national and 

international criticism of schooling was a re-doing of what had been successful. 

In the case of Norway, for instance, the first contemporary criticism of the 

school system was voiced by an OECD panel (1988) and by a national committee 

commissioned by the parliament (NOU 1988:22). Reflecting the emerging 

NPM discourse, both concluded that the weaknesses of the national school sys-


tem were, significantly, an outcome of an underperforming school governance 

structure, which was not able to ensure that the goals of curriculum guidelines 

were being reached. Two conclusions were drawn: On the one side a sweeping 

reform of the whole school curriculum was launched, beginning with a new 

general curriculum frame (L93), followed by new comprehensive guidelines 

for the upper secondary sector (R94) and the elementary and lower secondary 

schools (L97). The frame stressed the double purpose of schooling as caretaker 

of the national and local heritage and as knowledge-promoter (cf. STM 

29 1994/95). The subsequent curriculum guidelines received a new structure: 

they were to focus on the most important requirements and state these expectations 

in terms of goals (a kind of management-by-objectives approach) that 

could be reached by average schools. On the other side a re-make of the governance 

structure was inaugurated, constructing a double-faced reform combining 

a re-focussing of national steering while stressing the importance of 

local autonomy and responsibility for reaching these goals (cf. STM 37 1990- 

1991; STM 47 1995/96; KUF 1997). The reform was supported by numerous 

in-service and research programs to help districts, municipalities, schools and 

teachers identify the major obstacles and prepare for the enactment of the new 

guidelines. This new orientation was complemented by initiatives to develop 

school-based and peer-guided school improvement (cf. e.g. Granheim, Kogan 

& Lundgren 1990; Karlsen 1993; Ålvik 1994; KUF 1994; Haug & Monsen 

2002; Nesje & Hopmann 2003). However, this first take-up of NPM like measures 

was to infuriate many educators, politicians and practitioners alike; these 

critics felt that the tool-kit of accountability was an “instrumentalist mistake” 

that did not fit the national traditions of schooling and challenged the former 

strategy of placement, i.e. the compulsory comprehensive school (cf. e.g. Hovednak 

2000, Koritzinsky 2000, Lindblad, Johanneson & Simola 2003). 

When a new liberal-conservative government felt that these first steps of 

reform were still not enough to ensure adequate school development, it commissioned 

a new national report to recommend additional measures. What 

emerged was a peculiar understanding of school development as development 

of “quality”, in which “quality” represents a rather vague and all-encompassing 

understanding of whatever might affect the outcomes of schooling (cf. STM 30 

2007; Birkeland 2007; Sivesind forthcoming). The then-secretary of education, 

Kristin Clemet, expressed the basic rationale of this approach as follows: 

Society’s reasons for having schools, and the community tasks imposed on them, are 

still relevant today: Education is an institution that binds us together. We all share


it. It has its roots in the past and is meant to equip us for the future. It transfers 

knowledge, culture and values from one generation to the next. It promotes social 

mobility and ensures the creation of values and welfare for all. For the individual, 

education is to contribute to cultural and moral growth, mastering social skills and 

learning self-sufficiency. It passes on values and imparts knowledge and tools that allow 

everyone to make full use of their abilities and realize their talents. It is meant to 

cultivate and educate, so that individuals can accept personal responsibility for themselves 

and their fellows. Education must make it possible for pupils to develop so that 

they can make well-founded decisions and influence their own futures. At the same 

time, schools must change when society changes. New knowledge and understanding, 

new surroundings and new challenges influence schools and the way they carry out 

the tasks they have been given. Schools must also prepare pupils for looking farther a 

field than the Norwegian frontiers and being part of a larger, international community. 

We must nourish and further develop the best aspects of Norwegian schools and at the 

same time make them better equipped for meeting the challenges of the knowledge 

society. Our vision is to create a better culture for learning. If we are to succeed, we 

must be more able and willing to learn. Schools themselves must be learning organizations. 

Only then can they offer attractive jobs and stimulate pupils’ curiosity and 

motivation for learning. . . . We will equip schools to meet a greater diversity amongst 

pupils and parents/guardians. Schools are already ideals for the rest of society in the 

way they include everybody. However, in the future we must increasingly appreciate 

variety and deal with differences. Schools must have as their ambition to exploit and 

adapt to this diversity in a positive manner. 

If schools are to be able to achieve this, it is necessary to change the system by which 

schools are administered. National authorities must allow greater diversity in the solutions 

and working methods chosen, so that these can be adapted and customized 

to the situation of each individual pupil, teacher and school. The national authorities 

must define the objectives and contribute with good framework conditions, support 

and guidance. At the same time, we must have confidence in schools and teachers as 

professionals. We wish to mobilize to greater creativity and commitment by allowing 

greater freedom to accept responsibility.... 

All plans for developing and improving schools will fail without competent, committed 

and ambitious teachers and school administrators. They are the school system’s 

most important assets. It is therefore an important task to strengthen and further develop 

the teachers’ professional and pedagogical expertise and to motivate for improvements 

and changes. This Report heralds comprehensive efforts regarding competence 

development in schools. Education must be developed through a dialogue with 

those who have their daily work in and for schools. (Introduction to STM 30 2004) 

The difference to the accountability rhetoric of NCLB is striking. Where 

NCLB solely is focused on “academic standards” and on allocating respon-


sibilities to states, districts, teachers and students, the Norwegians talk about 

allowing for “greater diversity”, about the core role of the teachers and – above 

all – about the “confidence in schools and teachers as professionals”. The 

“comprehensive effort” announced is built around three dimensions: structure, 

process, and outcomes, and both the minister and the committee stress time and 

again that one cannot expect better results without improving the structures and 

the processes, and without considerable help from all sides (cf. STM 30 2004). 

The committee tried to embed the new tools in a way that is less offensive to 

traditionalists, by integrating the new in the familiar concepts of local monitoring 

and school autonomy. The proposals included the establishment of a 

national testing procedure to ensure that basic competencies are achieved, but 

stressing the “basic” and seeing this first and foremost as a helping hand to assist 

schools in diagnosing where may have a need for improvement (cf. NOU 

2003:16). 

The introduction of the national testing has been very difficult and is 

a not yet finished task, disputed by researchers and practitioners alike, and 

still far from anything resembling NCLB (cf. Langfeldt, Elstad & Hopmann 

2007). Nobody speaks about “evidence-based teaching” or “data-driven decision 

making” as prime tools to make school improvement work; the data are 

seen as a limited indicator, which has to be embedded in a wider understanding 

of a school’s program and needs. But even this limited aspiration has put 

a tremendous stress on both the national test developers and local communities 

and schools as they to meet the new expectation regime. Because of their 

poor technical quality, the first wave of national tests was met by sharp criticism, 

even from the supporters of their use. This forced the government to 

take a one-year break and completely redo the tool-kit of assessment (cf. Lie 

2005; MMI 2005; Telhaug 2005; Langfeldt 2007b). As a result, many schools 

and municipalities felt more confused, then controlled by the new measures. 

It seems that it will take some time before a more coherent pattern of working 

with national monitoring emerges and the different levels find sustainable 

strategies for dealing with the new tool kit of expectation management (cf. 

Møller 2003; Riksvevisjonen 2006; Sivesind, Langfeldt & Skedsmo 2006; Elstad 

2007; Elstad & Langfeldt 2007; Engeland, Roald & Langfeldt 2007; Isaksen 

2007). However, the prevailing attitude towards what might be expected 

can be illustrated by what a principal of a top-scoring school said at a national 

leaders conference: “One shouldn’t put too much into these results”; they reflected, 

he said, only a small part of his school’s program and did not inform


his school about the challenges they faced, not at least in relation to special 

education. In all events, one should not expect his school to be on top next 

year; the next year’s class wasn’t close to the quality of this one. This was not 

just a fine display of public Norwegian humbleness (“you shouldn’t believe 

you are someone”). He seemed genuinely concerned that the unexpected success 

would divert attentiveness from the more pressing problems of his school 

and mislead parents and local politicians, with the implication of less support 

in tackling his school’s problems. This is a similar reaction to the one seen in 

Finland as they discuss the overwhelming PISA success of their country and its 

more or less unintended side-effects (cf. e.g. Sinola 2005; Kivirauma, Klemala 

& Rinne 2006; Uljens in this volume). 

PISA and its predecessors like TIMSS played an important, but not a key 

role in this development in Norway. The move towards a policy change had 

started long before PISA came into being. The TIMSS and PISA data underlined 

that there were some substantial short-comings to address, but PISA was 

not taken as a sufficient description of the challenges ahead in either in the 

relevant committees or in the parliament. Nor did PISA lead to a fundamental 

change in the course of action, with the one exception that the new generation 

of curriculum guidelines tries to adapt some of PISA’s competence conceptualizations. 

But this was not by chance. Most Nordic PISA researchers were 

scrupulous in outlining the reach of their results, pointing to the limited scope 

of PISA’s material, admonishing against any attempt to simplify the complexities, 

and warning against any expectation of comprehensive political solutions 

based on PISA (cf. e.g. Mejding & Roe 2006). The most substantial criticism 

of PISA’s reach came from within, from researchers with close connections to 

the project. They have analysed particularly the match and mismatch of PISA 

constructs with their nation’s traditions of knowledge culture and schooling 

(cf. Olsen 2005 and in this volume, Sjøberg in this volume, Dolin in this volume). 

They have discussed if and how PISA is reflects the social and cultural 

diversity of student achievements (cf. e.g. Allerup 2005, 2006 and in this volume). 

In addition, the PISA project tried from its beginning to place a main 

focus on schools as the decisive units of action. This was not easy: PISA does 

not provide comprehensive, independently cross-checked school data, but relies 

instead on the descriptions of school climate and classroom practice provided 

by the students and the teachers themselves, a weak source because of 

the well-known variance in the ways students describe the same experienced 

curriculum (cf. Turmo & Lie 2004).


It would exceed the scope of this chapter to address the subtle differences 

between the Nordic countries with their different levels and shades of public 

debate on PISA and national testing (cf. Langfeldt, Elstad & Hopmann 2007). 

What is important, however, is another fundamental difference to the NCLB 

approach which the Nordic countries have in common. As a recent survey of 

teacher education in the Nordic countries shows (Skågen 2006), they share a 

fundamental trust in the quality of their teachers and the underlying teacher education. 

It is not that nothing might be improved. Rather they feel that teachers 

are well enough educated to do what is necessary if they are given the means 

and the challenges to do so. The core issue becomes then how to improve the 

local communities “room to move”, their ability to unleash teachers’ energies 

and to monitor progress in a supportive way (cf. Engeland, Roald & Langfeldt 

2007). 

No State Left Behind 

What a difference PISA can make was nowhere more visible than in Germany 

and – with a typical delay – in Austria. In Germany PISA was from the beginning 

“big news”, filling newspapers, forcing political responses, engaging each 

and everyone interested in school affairs (cf. summarizing Weigel 2004). The 

Austrian reaction was somewhat slower; Austria seemed to have fared better in 

PISA 2000, at least better than Germany, which counts for quite a lot in Austria 

(cf. Bozkurt, Brinek & Retzl in this volume). When it turned out that Austria 

scored worse in the PISA 2003, and that the better results of 2000 might have 

been an artefact of flawed sampling (cf. Neuwirth 2006; Neuwirth, Ponocny 

& Grossmann 2006), the discussion climate changed dramatically. Now both 

school systems were seen to be in a deep crisis, not least a crisis of their traditional 

school structures and their out-worn forms of teaching (cf. summarizing 

Terhart 2004; Bozkurt, Brinek & Retzl in this volume). 

The response pattern as such was no surprise. Both countries have, since 

the school reforms of late-18th century (cf. Melton-Horn 1988), had recurrent 

“big school debates” every 20 to 30 years. Every debate is a struggle about 

the national curriculum, and (about) every second debate is more specifically 

focused on the structures of schooling and their implications (as was the case 

for Prussia/Germany in the early-19th century, the 1850s, the 1890s, the 1920s, 

and finally in the 1960s and early 1970s; cf. summarizing Hopmann 1988, 

2000).


The important role of school structural issues within this pattern results 

from the understanding in both countries that, at least since the reforms of late- 

18 th century, schools are state-owned and state-run systems – at the national 

level in Austria, at the state level in the Federal Republic of Germany. Local 

municipalities have some responsibilities for “outer” school matters, such as 

buildings and equipment, but the curriculum, the hiring and firing of teachers, 

the licensing of school books, and the day-by-day control of all “internal” 

school matters etc. are seen as being within the realm of the state’s school 

administration. Moreover, both countries have stratified school systems, in 

which secondary schools are divided in different strands for “high-” and “low-” 

achievers, providing, e.g. different schools for “academic achievers” (Gymnasium), 

for more “practically oriented” youth (Realschule, Hauptschule, Berufsschule), 

and for children with “special needs” (Sonderschule). The decision 

about which kind of a student should attend is normally made following 4th 

or 6th grade (in the 19 th century the division was from the first grade). Both 

countries have a system of vocational education, combining school with onthe-job 

training, sometimes beginning at the lower secondary level, but more 

usually covering those who do not attend a Gymnasium or the like for uppersecondary 

education. However, in both countries rates of attainment of the 

highest academic qualification, (Abitur, Matura), and thereby access to universities, 

is considered as the key indicator of social equity (cf. Becker & Lauterbach 

2004). 

In that it is the state, and the state alone, that regulates schools, school 

structures can be understood as institutionalized expressions of the state’s view 

on social class and stratification. The proverbial example of inequality in the 

school debates of the 1960s was the catholic working-class girl from a rural 

area attending a Hauptschule; she is now replaced by the Moslem daughter 

of an immigrant family living in a poor inner-city district who also attends 

a Hauptschule or a Sonderschule (cf. summarizing Berger & Kahlert 2005). 

Within this frame, school-structure debates tend to become debates on social 

division; the stratified school system is regularly defended by conservatives 

and economists, whereas the move towards a comprehensive school system is 

an affair of the heart for social democrats and the labour movement, without 

regard to whether one system or the other has a better record in terms of social 

equity. In both countries the core argument is the assumed, yet not proven 

effect stratification might have on human capital– does stratification lead to a 

structural underperformance of lower-class students or does a comprehensive


school limit the space and speed of development of high-achievers, and vice 

versa (cf. Bozkurt, Brinek & Retzl in this volume). 

How PISA fits in this frame, is easily understood if one takes its official 

purpose as stated by its owner, the OECD: 

Quality education is the most valuable asset for present and future generations. 

Achieving it requires a strong commitment from everyone, including governments, 

teachers, parents and students themselves. The OECD is contributing to this goal 

through PISA, which monitors results in education within an agreed framework, allowing 

for valid international comparisons. By showing that some countries succeed 

in providing both high quality and equitable learning outcomes, PISA sets ambitious 

goals for others. (Angel Gurría, OECD Secretary-General, as introduction to PISA 

2006, 3) 

According to the same source PISA’s “key features” have been so far: 

– Its policy orientation, with design and reporting methods determined by the need of 

governments to draw policy lessons. 

– Its innovative “literacy” concept, which is concerned with the capacity of students 

to apply knowledge and skills in key subject areas and to analyse, reason and communicate 

effectively as they pose, solve and interpret problems in a variety of situations. 

– Its relevance to lifelong learning, which does not limit PISA to assessing students’ 

curricular and cross-curricular competencies but also asks them to report on their 

own motivation to learn, their beliefs about themselves and their learning strategies 

– Its regularity, which will enable countries to monitor their progress in meeting key 

learning objectives. 

– Its contextualisation within the system of OECD education indicators, which examine 

the quality of learning outcomes, the policy levers and contextual factors that 

shape these outcomes, and the broader private and social returns to investments in 

education. 

– Its breadth of geographical coverage and collaborative nature, with more than 60 

countries (covering roughly nine-tenths of the world economy) having participated 

in PISA assessments to date, including all 30 OECD countries. (ibid., 7) 

The “policy orientation, with design and reporting methods determined (sic!) 

by the need of governments to draw policy lessons” has lead to wealth of national 

and OECD reports, using PISA data as a means to assess the quality 

of school structures and schooling, issues of social inequality, gender, migration 

etc., and, not least, comparisons of again and again of countries and their 

PISA performance in relation to other OECD indicators (most of this online 

available at http://www.pisa.oecd.org).


Of course this approach has a number of implicit assumptions, which are all 

but self-evident: 

– The assumption that what PISA measures is somehow important knowledge 

for the future: There is no research available, which proves this assertion 

beyond the point of that knowing something is always good and knowing 

more is always better. There is not even research showing that PISA covers 

enough to be representative for the school subjects involved or the general 

school knowledge base. PISA items are based on the practical reasoning of 

its researchers and based on pre-tests of what works in all or most settings – 

and not on systematic research on current or future knowledge structures and 

needs (cf. Dohn 2007; Bodin, Jahnke, Meyerhöfer, Sjøberg in this volume). 

– The assumption that the economic future is dependent on the knowledgebase 

monitored by PISA: The little research on this theme – which assumes 

that there is a direct relation between test scores and future economic 

development– relies on strong and unproven arguments which have no basis 

when, for instance, comparing success in PISA’s predecessors and later 

economic development (cf. Fertig 2004; Heynemann 2006). 

– The assumption that PISA measures what is learned in schools: this is not 

PISA’s own starting point which is not to use national curricula as point of 

reference (as e.g. TIMSS does; cf. Sjøberg in this volume). The decision to 

focus on a small number of issues and topics, which can be expected to be 

present in all involved countries leaves open the question of how these items 

represent the school curriculum as a whole (cf. Benner 2002; Fuchs 2003; 

Ladenthin 2004; Hopmann 2001, 2006; Dolin, Sjøberg, Meyerhöfer in this 

volume) beyond the fact that those who are successful in school do, on average, 

better on PISA – which is hardly a surprise inasmuch as PISA requires 

cognitive and not at least language skills, which are helpful in schools as 

well. Some even argue that PISA first and foremost monitors whatever intelligence 

testing monitors (cf. Rindermann 2006), which could lead to the 

somewhat irritating implication that according to PISA, e.g. Finns are more 

“intelligent” than Germans or Austrians. 

– The assumption that PISA measures the competitiveness of schooling: One 

has to keep in mind, that at best 5–15 percent of the variance in the PISA results 

can be attributed to lasting qualities provided by the schools studied (cf. 

already Watermann et al. 2003; for the principal problems of re-constructing 

schooling and teaching based on such data see Rauin 2004). Most of the 

variance can be attributed to factors from the outside, that are mostly be-


yond the reach of schooling (such as social background; cf. Baumert, Stanat 

& Watermann 2006). 

– The assumption that PISA thus measures and compares the quality of national 

school provision, not at least of school structures, teacher quality, 

the curriculum, etc.: Although school effects as such have a very limited 

role in the results of PISA, one has to add that a) PISA has a considerable 

sampling and cultural match problem, which reduces its trustworthiness as 

an indicator for national systems, at least for systems with the small differences 

seen between Western countries (see the contributions to this volume), 

and b) since Coleman’s seminal study (1967) it is well known that schools 

only have a very limited impact on social distributions of education success 

when compared to factors such as the social fabric of the surrounding society 

(cf. Shavit & Blossfeldt 1993; Becker & Lauterbach 2004). Moreover, 

by its very design PISA is forced to drop most of what might indeed indicate 

specifics of national systems (cf. Dolin, Langfeldt in this volume). 

In short: PISA relies on “strong assumptions” (Fertig 2004) based on weak 

data (cf. e.g. Allerup, Langfeldt, Wuttke in this volume) that appeal to conventional 

wisdom (“education does matter, doesn’t it?”; “school structures make a 

difference, don’t they?”), but almost no empirical and historical research supporting 

its implied causalities. 

But this has not kept either PISA researchers or the public from using 

PISA as if such causal relations are given. Otherwise one would not be able 

to explain how the two main impacts which PISA has had on school administration 

and policy making in Austria and Germany, directly referring to this 

frame of reference, albeit using it somewhat differently. Thus PISA’s approach 

to competency measuring has been a sweeping success in both countries, in 

part fuelled by the “national expertise” (Klieme et al. 2003) produced by researchers 

close to the PISA efforts who argue that a national monitoring of 

student achievement based on an approach similar to PISA is both necessary 

and feasible (cf. Jahnke in this volume). Based on this, the German education 

ministers of the states have established a process towards such national 

standards and given a helping hand to the mounting of a National Institute for 

Progress in Education (Institut zur Qualitätsentwicklung im Bildungswesen) 

with similar functions to the US National Assessment of Educational Progress 

(NAEP). Both of these accomplishments are significant in Germany given that 

curriculum matters are normally considered to be state, not federal responsibilities, 

and that there was no previous tradition of state-run outcome controls


(except for some standardizations of final exams in a few states). Prior to this 

point there have been more than 4000 different state curriculum guidelines that 

werethemainroadtodefining expected results, without any regular control for 

whether they were achieved (as in the Nordic countries; cf. Hopmann 2003). 

Similarly, the Austrian government has initiated a not yet finished project to 

develop and implement national competency standards as an alternative to the 

former guidelines and to combine this with regular testing (cf. the material collected 

at the official site http://www.gemeinsamlernen.at). All this in spite of 

the fact, that the impact of the use of national or state assessment on what PISA 

and similar projects measure is at best weak in either direction (cf. Amrein & 

Berliner 2002: Bishop & Woessmann 2004; Fuchs & Woessmann 2004), and 

that the overall importance of meeting the goals which PISA and according 

state standards happen to measure at best is a good guess without a solid research 

foundation. 

Whereas this approach seems to have support across the whole political 

spectrum, the second impact has proven to be rather divisive: Based on PISA 

and similar studies, researchers and politicians have – as mentioned above – 

reopened the debate on school structures. Interestingly both sides – proponents 

and opponents of a comprehensive system, proponents and opponents of 

early school start, proponents and opponents of an integrated teacher education, 

etc. – feel themselves encouraged by the very same PISA data, which the 

other faction uses as well. The most prominent example of this is a national 

report on schooling, written by a number of “leading experts” (i.e. researchers 

utilizing PISA and the like) on behalf of the Confederation of Bavarian Industry 

(VBW), which – focused on equity-issues – argues that there is ample 

research evidence for re-organizing the whole school system as a two-tier 

organization (cf. Aktionsrat Bildung 2007). On the other hand, the leader of 

the PISA effort at OECD, Andreas Schleicher, is totally convinced that PISA 

proves the advantages of a comprehensive system. The leader of the national 

PISA effort in Austria, Günther Haider, managed first to support a continuation 

of the current stratified structures, then a transition towards a comprehensive 

system, in both cases claiming PISA data as evidence for his recommendations 

(cf. Bozkurt, Brinek & Retzl in this volume). 

Public criticism of the empirical evidence provided by PISA has been weak 

in both countries. The devastating results were all too much in line with the 

political needs to find good causes at the end of the economically painful reunification 

process in Germany and at a time when both countries felt them-


selves economically underperforming compared to other European countries, 

not at least those seemingly more successful in PISA, such as Finland. Even 

the scientific discourse took the economic reasoning backing PISA for granted, 

arguing that PISA reduced “Bildung” to economic necessities and the needs 

of globalization, thereby acknowledging the unfounded premises of PISA’s 

ability to monitor and guide the school curriculum (cf. e.g. Huisken 2005; 

Lohmann 2007). Except for the obvious case of the Austrian sampling problems 

(Neuwirth 2006), the few methodological objections that were voiced 

were either ignored or ridiculed by the PISA community and its supporters, 

and has had – at least up to now –no substantial impact on the public standing 

of PISA in either Austria or Germany (cf. the introduction to this volume). 

The “no state left behind approach” of the OECD and its German and Austrian 

consorts leads to the somewhat paradoxical effect that PISA has the most 

impact by way of the by-products of the PISA research, which in design and 

methodology are most probably the weakest links of the whole enterprise. But 

even this is not without precedent. Re-reading Georg Picht’s volume on the 

“education catastrophe” (1964), which started the last “big school debate” in 

the mid-1960’s, one is amazed how little of his evidence could stand the test 

of time and how much of it was simply speculative. However the book was the 

single most important lever for the ensuing debate on how the school system 

should adapt to the social changes at the end of the post-war reconstruction period, 

a process which ended with biggest expansion of the educational system 

and of public expenditure in the history of schooling. This process also included 

the temporary transfer of a tool-kit, scientific curriculum development, 

from the US – in spite of its then self-pronounced “moribund” state on its 

home turf (cf. Hopmann 1988). At least in Austria and Germany, PISA seems 

to have achieved something similar, to help politicians, educators and the public 

to get the educational field in touch with the transition processes going on in 

the whole public sector by providing them with a sense of what “manageable 

expectations” might be, and with tools to monitor their success – or failure. 

3 Comparative Accountability 

Thus the overall picture of the accountability approaches I have reviewed 

shows three very different basic philosophies of what this transition is about 

(cf. fig. 1):


Core Data Student achievement 

Main Tools Standards controlled 

by testing 

Aggregated 

school achieve- 

ment data 

Testing with 

regard to opportunities 

to learn 

(OTL) 

Aggregated national 

student 

achievement 

Competencies 

measured by 

random testing 

Stakes High stakes Low stakes No stakes 

(PISA) 

Low or high 

stakes (standards) 

Driving Force Blame Community Competition 

Main levels of 

attribution 

Class room & 

teaching 

Spirit 

Local school 

management 

School systems/Society 

at large 

Best Practice Data-driven Customized Research based 

Accountability Bottom up Bridging the 

gap 

Top down 

The role of 

PISA 

Almost none Supporting act Main act 

Fig. 1: Basics of the No Child, No School, No State Left Behind Strategies 

Of course, this table only pinpoints the main assumptions and entry points 

of each approach. In the nature of public schooling, each approach carries elements 

of the others. Additional analysis of more countries would show that 

there are mixed patterns, combining elements of different modes of accountability 

(e.g. the case of Switzerland would probably reveal a mixture of “no 

school” and “no canton” strategies; Canada a mixture of “no child” and “no 

school” etc.; cf. e.g. BFS 2005; Rhyn 2007; Stack 2006; Klatt, Murphy & 

Irvine 2003; Ma & Crocker 2007). And there is a fourth pattern, where “no accountability 

has yet arrived”, and where the public sector is in the early stages 

of a transition towards accountability policies, and therefore not yet open for 

the influx of international accountability measures (as, for instance, in Italy 

where PISA has been no real issue, and even the government has treated it as 

almost non-existent; cf. Nardi 2004).


Emerging issues 

Each strategy has its own strength and carries its own risks depending on the 

larger concept of expectation management it is a part of: 

The “no child” approach has the advantage of a clear focus: Everybody 

knows what counts and how it is measured. But the price for this is what David 

Berliner (2005) calls “an impoverished view on educational reform”, a system 

of accountability checks which places “statistical significance” above all other 

ways of looking at individual and institutional achievements. The very narrow 

conceptualization lends itself to reduced remedial strategies: “evidence-based” 

or “best-practice” models, or “data-driven decision making”, only make sense 

if it is assumed that assessment data are all that counts and that local conditions 

do not play a significant role, or at least can be overcome, if one does 

as the successful do. But while NCLB’s “blueprint” (US Department of Education 

2007) honours a rather naïve empiricism, much of the NCLB-induced 

research provides more complex insights in the complexities of school life, 

using a whole range of mixed methods and avoiding the fallacies of an engineering 

approach to social transition (cf. e.g. Elmore 2006; O’Day 2007). 

However, it seems unlikely that the insights produced by this research will 

have any lasting impact on the NCLB movement: the prime implication of this 

research, the importance of capacity-building in local schooling with special 

regard to the unique mix of challenges at hand, is contrary to any belief that 

the same high-stakes for everyone, and distributing blame and shame in large 

portions, is a reasonable approach to making accountability work. 

Moreover, it is this one-sided focus which allows the transformation of 

the apparent problems with equity and equality, with minorities, special needs, 

gender etc. into individualized liabilities, whose impact on achievement has to 

be minimized, if not eradicated. If high stakes as sole approach to this fails 

(and research points to that it will; cf. Linn 2007), there are a number of technical 

options to ease the burden such as lowering the ceilings, adding opt-out 

clauses for the worst students and/or schools, inflating the number of stakes 

such that everybody can succeed in something, and not at least creating more 

school choice and vouchers, which leaves the responsibility of choosing the 

right school with the parents. All of these options are under consideration in 

the current debate on the renewal of NCLB (cf. US Department of Education 

2007). Choice is, of course, the core of a strategy of passing the basket on to 

the next in line of the accountability chain, i.e. the ones who seemingly bring 

the liabilities to school: the parents, the minorities, the poor, those with special


needs, etc. We can expect more accountability tools, e.g. contractual attainment 

goals and/or connecting welfare subsidies or other sanctions with them, 

to make these families directly responsible for the outcomes. The achievement 

gap will not disappear with these moves, but rather what once was considered 

being a failure of the school system to cope with the diversity of society (cf. 

e.g. Coleman et al. 1967) will be turned step by step into a problem of individual 

customers failing to meet rising expectations. 

The “no school left behind” approach of Norway, and most of the Nordic 

countries, is far away from such reductionism, but also pays a heavy price. The 

double task of embedding the new strategies in the traditional tool-kit and of 

doing so in close co-operation with the local level, obviously leaves many wondering 

if there is a real change process going, and if there is, what it consists 

of. No real sense of the new obligations has emerged in schools and municipalities, 

and it seems as if they respond to the new accountability expectation 

with classic Nordic “muddling-through”: planning, coordinating and reporting 

on a local level time and again, with no real stakes and inconclusive outcomes 

(cf. Engeland, Roald & Langfeldt 2007; Elstad 2007). That the first national 

tests were a technical disaster (cf. Lie et al. 2005), and prompted a break in the 

whole process, did not really help, nor did the new curriculum guidelines of 

2006 which, in spite of much of the rhetoric, do not require more adjustments 

then earlier guidelines, i.e. most teaching does not change significantly as a 

result of their adoption, and the prime concern of teachers remains with local 

adaptation, not national outcomes (cf. Bachmann & Sivesind 2007). 

But this will not be the end of the story! The key question is what will 

happen if the current effort, which even in a Norwegian perspective is quite expensive, 

fails to achieve significant and sustainable gains beyond those which 

come as the system becomes used to the new tools? Social and economic inequality 

are rising rapidly, and knowing how this can affect both schools and 

students, growing achievement disparities and gaps will be no surprise. Two 

response pattern seem likely: The first one would move even more rapidly towards 

more radical accountability strategies, i.e. raising stakes, adding more 

national testing, and most importantly adding sanctions for those who continue 

to fail. This would put tremendous pressure on the comprehensive school 

system: homogenous schools without too many non-achievers will succeed 

and tell the public that the time of an all-encompassing school has come to an 

end. The other strategy would introduce more choice and private options into 

the system (as it is already the case in Sweden and Denmark), thus allowing


schools to remove their most challenging parents and the most challenged children, 

leaving the public school as main route for ‘average’ people (cf. Kvale 

2007). Both strategies would imply a definite end of the “one school for all” 

notion. This idea is deeply engrained in the social fabric of Nordic societies, 

and the move will be no easy task (and will lead in Norway to a continuing 

back and forth of who is allowed to opt out, and why). But there are no other 

ways to reconcile the former management of places with the new needs of accountability, 

even if it takes time before the “muddling through” is forced to 

accept this consequence as inevitable. 

None of this applies to the two leading examples of the “no state left behind” 

strategy, Austria and Germany. Both have fragmented school systems in 

which comprehensive schools play no significant role. Moreover, for the moment, 

both have easy access to knowing if their school improvement works. 

All they seemingly have to do is to wait for the next PISA wave; it will then 

be clear who has lost or won in the interstate competition (of course only if 

one believes that PISA indeed is able to tell something about that). The main 

risk lies in the deeply-engrained traditions of how to deal with “big school debates”, 

because these traditions transform the achievement problem into one 

of school structure and other institutional change. The issues at stake are more 

or less the same in both countries (cf. Aktionsrat Bildung 2007; Retzl, Bozkurt 

& Brinek in this volume): Comprehensive schools or different tracks? Compulsory 

pre-school education and if so for whom? Unified teacher education 

or different routes for different types of schooling? Special schools for special 

needs or inclusive education? Keeping the double structure of vocational education 

(school plus training on the job) or integrating vocational education in 

some general kind of upper-secondary schooling? 

If the attempts to force a comprehensive re-structuring fail (and there is no 

empirical or political evidence indicating that this could turn out otherwise), 

then at least two possible outcomes are likely: The first one would be to move 

towards a more NCLB-like approach to accountability, i.e. adding more stakes 

and tests (e.g. unified entrance and exit exams), including all levels and becoming 

more all-encompassing than is possible within PISA, i.e., by requiring 

more data on single schools, school districts and the different federal states, 

eventually extending the screening beyond student achievement towards indicators 

on teaching patterns, teaching materials, teacher qualifications, studentcareer 

data and the like. But, at least in Germany, such an approach faces the 

obstacle that schooling is constitutionally a matter for states, not the Federal


government – which means that there are no means for enforcing alignment beyond 

that which all states agree upon. In Austria the Federal government has 

the necessary constitutional backing for federal involvement, but in that the 

country, since its reconstruction after World War II, depended on compromise 

between the two largest political wings (the social democrats and the conservatives), 

each controlling about half of the states, it is unlikely that any lasting 

agreement on a comprehensive accountability approach is feasible. Which 

brings the second option to the forefront, namely to dissolve, or embed, the national 

accountability measures in an internal competition between the different 

federal states. Those confident of their success would prove the advantages 

of their chosen solutions by own data; those not meeting the standards would 

have to answer by their own explanations of why a mismatch was unavoidable. 

In the end, there would be a inextricable hodgepodge of testing, controlling, 

monitoring etc., with each state having its own tool-kit of accountability measures. 

But at least two problems would be left behind by either option: on the one 

hand, none of this addresses the problems of sustainable inner-school development, 

capacity-building over the long run. A race to match changing expectations 

by restructuring the system will not leave energy and resources to address 

the tricky problems of “no teaching left behind” (cf. Terhart 2005). Secondly, 

both approaches would lead to further marginalization of the special-needs 

students who already invisible, or turned into liabilities, in the PISA approach 

(cf. Hörmann 2007 and in this volume). They promise to become even more 

marginalized inasmuch as they don’t help to win a competition that has individualized 

academic achievement as its basic rationale (cf. Hopmann 2007). 

Each and every new round of testing would only reaffirm their “lower” abilities 

and the “superiority” of the schools dealing with high achievers, thus petrifying 

the hierarchy of schools. In that this hierarchy always has been experienced 

as an expression of the social fabric of society, and the state’s position towards 

it, one can only imagine how rapidly this will lead to inner tensions in a school 

system surrounded by a society with rapidly growing social inequalities. 

PISA in Transition 

Most of the emerging issues stem from inner tensions between the former 

management of placement and the new expectation regime. The data-driven 

NCLB disintegrates the former “place called school” (Goodlad) into concurrent, 

but not intertwined, individual challenges of meeting the standards. The


“no school left behind” approach has difficulties in embracing a coherent set 

of expectations, as it dissolves the idea of accountability with its old routines 

of institutionalized muddling-through. The “no state left behind” strategy is at 

risk of answering the new expectations by functionalizing them for a renewing 

of the conventional restructuring game, without really changing what is going 

on inside schools and classrooms. 

The success of PISA within this transition is made possible by a certain 

“fuzziness” of design and self-presentation. It treats the links between student, 

school and national achievements as evident, thus allowing for a black-box 

approach to schooling itself (economists call that a production function) in 

which the coincidence of results and factors is transformed into correlations 

and causalities, without proving how exactly this linearity comes into being. 

Within a management of placements, PISA and the national testing inspired by 

it, would be dysfunctional in that it covers only a few aspects of schooling and 

these in a way which does not allow for research-based decision-making concerning 

the whole school or even teaching and learning under given conditions. 

However as a tool of expectation management PISA allows in each setting to 

address problems as they are framed within the respective constitutional mindsets, 

using PISA as “evidence”. Thus PISA refreshes the never-ending dispute 

in Germany and Austria on school structures and their relation to social class 

and diversity, reinvigorates the co-dependency of national government and local 

community in the Nordic countries, and reaffirms the starting point of the 

NCLB discourse on failing schools and teachers as the main culprits for the 

uneven distribution of knowledge and cultural capital in Western societies. 

The irony of this story is, of course, that PISA achieves this not in spite of, 

but because of its short-comings. Although it uses advanced statistical tools, 

PISA stays methodologically with in the frame of a pre-Popper positivism, 

which takes item responses for the realities addressed by them. There is no 

theory of schooling or of the curriculum, which allows for a non-affirmative 

stance towards the policy-driven expectations which, according to OECD, “determine” 

“the design and reporting methods” of PISA (OECD 2007). There 

is no systematically embedded concept of how yet unheard voices and nonstandardized 

needs could be recognized as equally valid expressions of what 

schooling is about. Accordingly, none of the newer developments in educational 

research, addressing the situatedness, multi-perspectivity, non-linearity 

or contingency of social action, plays a significant role in PISA’s design (cf. 

Hopmann 2007). Even though, there are many quite advanced options to use


PISA data within mixed methods or other more comprehensive research designs, 

which could address some of PISA’s inherent weaknesses as well (cf. 

Olsen in this volume). But to incorporate such developments on a large scale 

would be close to impossible. They do not lend themselves to such generalized 

bottom-lines as league tables; to include them in a large-scale study of the size 

of PISA would require resources far beyond that available to even PISA. 

As an entry to the commencing accountability transition, PISA has done 

a significant job in facilitating and illustrating the difficulties any approach to 

these issues will have to face. But it might be, that the PISA-frenzy already 

has reached its peak, or is very close to doing so (the next wave of results, 

coming in December 2007, will show whether this is the case). But if the PISAfrenzy 

is drawing to a close, it will not be because of the technical mishaps and 

fallacies discussed in this volume. Such details go unnoticed by the politicians 

and the public. If PISA looses its unique position, it will happen because of 

its success, because of the multiplying of PISA-like tools in national and state 

accountability programmes. If the NCLB experience holds true, PISA will be 

reduced to being just one voice in the polyphonic concert of assessment results, 

and – having no sanctions other than statistical blame – will be overcome by 

accountability measures that carry more immediate risks for those involved. 

The important question for the future of educational research is how much 

PISA then will be left behind, to what extend will its methodological reductionism 

prevail as the state-of-the-art of comparative research. But the more 

pressing question centers on the long-term effect its conceptualization of student 

achievement will have on the public understanding of what schooling is 

about. What will happen to the school subjects left out, to the special-needs 

that are marginalized, to school tasks which have nothing to do with higherorder 

academic achievement, to the school functions which move beyond a 

one-dimensional kind of knowledge distribution? Perhaps there are new, not 

yet seen possibilities hidden in the multiple realities of the transition from 

the management of placement towards the management of expectations, even 

some which make research, policy and schooling accountable for not leaving 

their social conscience behind on their march into the emerging age of accountability.

References 


Abbott, A.: The System of Professions. Chicago (University of Chicago Press) 

1988. 

Achieve: Aiming higher: 1998 annual report. Cambridge (Achieve, Inc.) 1998. 

Ahearn, E.M.: Educational Accountability: A Synthesis of Literature and Review 

of a Balanced Model of Accountability. Washington D.C. (Department 

of Education) 2000. 

Aikin, W.M. et al.: The Eight Year Study. Vol. I – V. New York/London (Harper 

& Brothers) 1942. 

Akerstrøm Andersen, N.: Borgerens kontraktliggørelse. Kopenhagen (Reitzel) 

2003. 

Aktionsrat Bildung: Bildungsgerechtigkeit. Jahresgutachten 2007. Wiesbaden 

(VS) 2007. 

Allerup, P.: PISA præstationer – målinger med skæve målestokke?. In: Dansk 

Pædagogisk Tidsskrift 2005-1, 68-81. 

Allerup, P.: PISA 2000’s læseskala – vurdering af psykometriske egenskaber 

for elever med dansk og ikke-dansk sproglig baggrund. (Rockwool 

Fondens Forskningsenhed og Syddansk Universitetsforlag) Odense 2006. 

Allerup, P.: Identification of Group Differences Using PISA Scales – Considering 

Effects of Inhomogeneous Items. In this volume. 

Ålvik, T. (ed.): Skolebasert vurdering – en artikkelsamling. Oslo (Ad notam) 

1994. 

Amrein, A.L. & Berliner, D.C.: High-stakes testing, uncertainty, and student 

learning. Education Policy Analysis Archives 10-2002-18. Online: http: 

//epaa.asu.edu/epaa/v10n18 (10.03.2007) 

Apple, M.: Ideological Success, Educational Failure? On the Politics of No 

Child Left Behind. In: Journal of Teacher Education 58-2007-2, 108-116. 

Bachmann, K.; Sivesind, K. & Hopmann, S.T.: Hvordan formidles læreplanen. 

Kristiansand (Høyskoleforlag) 2004. 

Bachmann, K. & Sivesind, K.: Regn med meg! Evaluering og ansvarligjøring i 

skolen. In: Langfeldt, G., Elstad, E. & Hopmann, S.T. (eds.): Ansvarlighet 

i skolen. Oslo (Cappelen) 2007 (forthcoming). 

Bartlett, W., Roberts, J.A. & Le Grand, J.: A revolution in social policy: Quasimarket 

reforms in the 1990’s. Bristol (Policy Press) 1998. 

Baumert, J., Stanat, P. & Watermann, R. Hrsg.: Herkunftsbedingte Disparitäten 

im Bildungswesen. Verteiefende Analysen im Rahmen von PISA 2000. 

Wiesbaden(VS) 2004.


Beck, U.: Weltrisikogesellschaft. Frankfurt (Suhrkamp) 2007. 

Beck, U., Giddens, A. & Lash, S.: Reflexive Modernisierung. Frankfurt 

(Suhrkamp) 1996. 

Benner, D.: Die Struktur der Allgemeinbildung im Kerncurriculum moderner 

Bildungssysteme. Ein Vorschlag zur bildungstheoretischen Rahmung von 

PISA. In: Zeitschrift für Pädagogik 48-2002-1, 68-90. 

Benoit, W.L.: Accounts, Excuses and Apologies: A Theory of Image Restoration. 

Albany (SUNY) 1995. 

Berliner, D. C.: Our impoverished view of educational reform. In: Teachers 

College Record 2005. Online: http.//www.tcrecord.org ID no. 12106 

(2007/07/07). 

Birkeland; N.: Ansvarlig, jeg? Accountability på norsk. In: Langfeldt, G., Elstad, 

E. & Hopmann, S.T. (eds.): Ansvarlighet i skolen. Oslo (Cappelen) 

2007 (forthcoming). 

Bishop, J.H. & Woessmann, L.: Institutional Effects in a Simple Model of 

Educational Production. In: Education Economics 12-2004-1, 17-38. 

Bloom, B. S. (ed.): Taxonomy of Educational Objectives, the classification of 

educational goals – Handbook I: Cognitive Domain. New York (McKay) 

1956. 

Bodin, A.: What does PISA really assess? What it doesn’t A French view. 

Report prepared for Joint Finnish-French conference “Teaching mathematics: 

Beyond the PISA survey”, Paris 2005. 

Bodin, A.: What does PISA really assess? What it doesn’t? A French view. In 

this volume. 

Bogason, P. (ed.): New Modes of Local Political Organization: Local Government 

Fragmentation in Scandinavia. Commack (Nova Sciences) 1996. 

Bracey, G.W.: Research: Put Out Over PISA. In: Phi Delta Kappan 86-2005- 

10, 797. 

Braun, H. (2004): Reconsidering the impact of high-stakes testing. Education 

Policy Analysis Archives 12-1. Online: http://epaa.asu.edu/epaa/v12n1/ 

(2006/01/20) 

Buschor, E. & Schedler, K. (eds.): Perspecticves on Performance Measurement 

and Public Sector Accounting. Bern (Haupt) 1994. 

Cannell, J.J.: Nationally Normed Elementary Achievement Testing in America’s 

Public Schools: How All 50 States are Above National Average. 

Daniels (Friends of Education) 1987.


Caswell, H.L.: City School Surveys: An Interpretation and Analysis. New York 

(Teacher College) 1929. 

Chubb, J.E. (ed.): Within Our Reach: How America Can Educate Every Child. 

Lanham (Rowman & Littlefield) 2005. 

Coleman, J. S. et. al.:. Equality of Educational Opportunity. Washington (U. S. 

Department of Health, Education and Welfare) 1966. 

Conant, J. B.: The American High School Today; A First Report to Interested 

Citizens New York (McGraw Hill) 1959. 

Cook, T. D.: Lessons Learned in Evaluation Over the Past 25 Years In Chelimsky, 

E. & Shadish, W.R. (eds.): Evaluation for the 21st Century. Thousand 

Oaks, London, New Delhi (Sage Publications) 1997, 30-52. 

Darling-Hammond, L.: Standards and Assessment: Where We Are and What 

We Need. Teachers College Record 16-2003-2. 

Deretchin, L.F.: Craig, C.J. (eds.): International Research on the Impact of Accountability 

Systems (Teacher Education Yearbook XV). Lanham (Rowman 

& Littlefield) 2007. 

Dewey, J.: Democracy and Education (1916). Online: http://www.ilt.columbia. 

edu/publications/dewey.html (2007/01/07) 

Dohn, N.B.: Knowledge and Skills for PISA – Assessing the Assessment. In: 

Journal of Philosophy of Education. 41-2007-1, 1-16. 

Dolin, J.: PISA – an Example of the Use and Misuse of Large-scale Comparative 

Tests. In this volume. 

Dorn, S.: The Political Legacy of School Accountability Systems. In: Education 

Policy Analysis Archives 6-1998-1. Online: http://epaa.asu.edu/epaa/ 

v6n1/ (2007/03/02). 

Dubnick, M.J.: Accountability and the Promise of Performance. Paper presented 

at the 2003 Annual Metting of the American Poltical Science Association. 

Philadelphia 

Dubnick, M.J. & Justice, J.B.: Accounting for Accountability. Paper presented 

at the Annual Meeting of the American Political Science Association 

2004. Online: http://pubpages.unh.edu/dubnick/papers/2004/ 

dubjusacctg2004.pdf (2007/07/07). 

Dubnick, M.J.: Orders of Accountability. Paper presented at the World Ethics 

Forum in Oxford 2006. Online: http://pubpages.unh.edu/dubnick/papers/ 

2006/oxford2006.pdf (2007/07/07). 

Eberts, R., Hollenbeck L. & Stone, J.: Teacher Performance Incentives and


Student Outcomes. The Journal of Human Resources, 37-2002-4, 913- 

927. 

Education Commission of the States (ECS): No Child Left Behind Issue 

Brief: Data-Driven Decisionmaking. 2002. online: http://www.nsba.org/ 

site/docs/9200/9153.pdf (2007/07/07). 

Elmore, R.F.: School Reform From the Inside Out. Cambridge (Harvard University 

Press). 2006. 

Elstad, E.: Hvordan forholder skoler seg til ansvarliggjøring av skolens bidrag 

til elevenes læringsresultater? In: Langfeldt, G., Elstad, E. & Hopmann, 

S.T. (eds.): Ansvarlighet i skolen. Oslo (Cappelen) 2007 (forthcoming). 

Elstad, E. & Langfeldt, G.: Hvordan forholder skoler seg til målinger av 

kvalitetsaspekter ved lærernes undervisning og elevenes læringsprosesser?. 

In: Langfeldt, G., Elstad, E. & Hopmann, S.T. (eds.): Ansvarlighet 

i skolen. Oslo (Cappelen) 2007 (forthcoming). 

Engeland, Ø.: Skolen i kommunalt eie – politisk styrt eller profesjonell 

ledet skoleutvikling? Avhandling til dr. polit graden. Oslo (Det utdanningsvitenskapelige 

fakultet, Universitetet i Oslo) 2000. 

Engeland, Ø., Langfeldt, G. & Roald, K.: Kommunalt handlingsrom – hvordan 

møter norske kommuner ansvarsstyring i skolen?. In: Langfeldt, G., Elstad, 

E. & Hopmann, S.T. (eds.): Ansvarlighet i skolen. Oslo (Cappelen) 

2007 (forthcoming). 

Esping-Andersen, G. Hrsg.: Welfare States in Transition. National Adaptations 

to Global Economies, London (Sage) 1996 

Evers, A., Rauch, U. & Stitz, U. (eds.): Von öffentlichen Einrichtungen zu 

sozialen Unternehmen. Hybride Organisationsformen im Bereich sozialer 

Dienstleistungen. Berlin (Edition Sigma) 2002. 

Fertig, M.: What Can We Learn From International Student Performance Studies? 

Some Methodological Remarks. RWI: Discussion Papers No. 23. Essen 

(RWI) 2004. 

Foucault, M.: Geschichte der Gouvernementalität 1: Sicherheit, Territorium, 

Bevölkerung. Vorlesung am College de France 1977/1978. Frankfurt 

(Suhrkamp) 2006. 

Fuchs, H.-W.: Auf dem Wege zu einem neune Weltcurriculum? Zum Grundbildungskonzept 

von PISA und der Aufgabenzuweisung an die Schule. 

In: Zeitschrift für Pädagogik 49-2003-2, 161-179. 

Fuchs, T. & Woessmann, L.: What Accounts for International Differences in


Student Performance? A Re-Examination Using PISA Data. Bonn (IZA) 

2004. 

Fuhrmann, S. (ed.): From the Capitol to the Classroom: Standards-Based Reform 

in the States. Vol. I & II. Chicago (National Society for the Study of 

Education Yearbooks) 2001. 

Fukuyama, F.: The End of History and the Last Man. New York (Free Press) 

1992. 

Goodlad, J.I.: A Place Called School. New York (McGraw-Hill) 1983. 

Gorard, S.: Value-Addedis of Little Value. In: Journal of Education Policy 21- 

2006-2, 235-243. 

Gottweis, H. et al.: Verwaltete Körper. Strategien der Gesundheitspolitik im 

internationalen Vergleich. Wien (Böhlau) 2005. 

Granheim, M.; Kogan, M. & Lundgren, U.P. (eds.): Evaluation as Policymaking: 

Introducing Evaluation Into a National Decentralized Educational 

System. London (Jessica Kingsley Publishers) 1990. 

Grisay, A. & Monseur, C.: Measuring the Equivalence of Item Difficulty in 

the Various Versions of an International Test. In: Studies in Educational 

Evaluation 33-2007-1, 69-86. 

Gundem, B. B. & Hopmann, S. (eds.): Didaktik and/or Curriculum: An International 

Dialogue. New York, Bern etc. (Lang) 2002 2 . 

Gundem, B.B.: Læreplanadministrering: fremvækst og utvikling i et 

sentraliserings-desentraliseringsperspektiv. Oslo (UiO/PFI) 1992. 

Gundem, B.B.: Mot en ny skolevirkelighet? Læreplanen i et sentraliseringsdesentraliseringsperspektiv. 

Oslo (Ad Notam) 1993. 

Gundem, B.B.: Læreplanhistorie – historien om skolens innhold – som forskningsfelt: 

en innføring og noen eksempler. Oslo (UiO/PFI) 1997. 

Haft, H./Hopmann, S.T. (eds.): Case Studies in Curriculum Administration 

History. London/New York 1990. 

Haney, W. (2000): The Myth of the Texas Miracle in Education. Eductional 

Policy Analysis Archives 8-2000-41. Online: http://epaa.asu.edu/epaa/ 

v8n41/ (2007/07/07). 

Haney, W.: Lake Wobegon Guaranteed. Educational Policy Analysis Archives 

10-2002-24. Online: http://epaa.asu.edu/epaa/v10n24/ (2007/07/07) 

Hanushek, E.A. & Raymond, M.E. (2003): Lessons about the Design of State 

Accountability Systems. In: No Child Left Behind. Petersen, P. & West, 

M.R. Hrsg. (Brookings) Washington, 127-151 

Hanushek, E.A. & Raymond, M.E.: Does School Accountability Lead to Im-


proved Student Performance? In: Journal of Policy Analysis and Management 

24-2005-2, 297-327. 

Hanushek, E.A.: The Failure of Input-based Schooling Policies. Working Paper 

9040. Cambridge, MA (National Bureau of Economic Research). 

2002. 

Hanushek, E.A.: Alternative School Policies ad the Benefits of General Cognitive 

Skills. In: Economics of Education Review 25-2006-4, 447-462. 

Hanushek, E.A.: The Long Run Importance of School Quality. NBER Working 

Paper No. 9071. Cambridge, MA (NBER) 2002. 

Hargreaves, A.: Teaching in the Knowledge Society: Education in the Age of 

Insecurity. New York, NY (Teachers College Press) 2003. 

Haug, P. & Monsen, L. (eds.): Skolebasert vurdering: erfaringer og utfordringer. 

Oslo (abstract) 2002. 

Herman, J.L. & Haertel, E.H. Hrsg.: Uses and Misuses of Data for Educational 

Accountability and Improvement (The 104 th Yearbook of NSSE Part 2). 

Malden (Blackwell) 2005. 

Hood, C.: A Public Management for all Seasons. Public Administration 69- 

1991-1, 3-20. 

Hood, C.: Contemporary Public Management: A New Global Paradigm? Public 

Policy and Administration. 10-1995-2, 104-117. 

Hood, C.: Institutions, Blame Avoidance and Negativity Bias: Where Public 

Management Reform Meets the Blame Cuulture. Paper presented at the 

CMPO Conference on Public Organisation and the New Public Management. 

Bristol 2004. 

Hood, C., Rothstein, H. & Baldwin, R.: The Government of Risk: Understanding 

Risk Regulation Regimes. Oxford (University Press) 2004. 

Hopmann, S.T.: Lehrplanarbeit als Verwaltungshandeln (Curriculum making 

as administration). Kiel (IPN) 1988. 

Hopmann, S.T.: Lehrplan des Abendlandes – am Ende seiner Geschichte? 

Geschichte der Lehrplanarbeit und des Lehrplans seit 1900. (The curriculum 

of the occident – at the end of its history? Curriculum development 

and the curriculum since 1800). In: Keck, Rudolf et al. (eds.): Lehrplan 

des Abendlandes – revisited. (Hohengrefe) Braunschweig 2000a. 

Hopmann, S.T.: Die Schule von morgen – Entwicklungsperspektiven für 

einen nachhaltigen Unterricht. In: Die Schweizer Schule 2000-3, 13-19. 

(2000b) 

Hopmann, S.T.: Von der gutbürgerlichen Küche zu McDonald’s: Beabsichtigte


und unbeabsichtigte Folgen der Internationalisierung der Erwartungen 

an Schule und Unterricht. In: Keiner, E. (Hrsg.): Evaluation in den 

Erziehungswissenschaften. Weinheim (Beltz) 2001, 207-224. 

Hopmann, S.T.: On the Evaluation of Curriculum Reforms. In: Journal of Curriculum 

Studies 2003-4, 459-478. 

Hopmann S.T.: Im Durchschnitt Pisa oder: Alles bleibt schlechter. In Criblez, 

L. et al. (eds) Lehrpläne und Bildungsstandards. Bern (hep) 2006, 149- 

172 

Hopmann, S.T.: Keine Ausnahme für Hottentotten. Methoden der vergleichenden 

Bildungswissenschaften für die heilpädagogische Forschung. 

In: Biewer, G. & Schwinge, M. (Hrsg.): Internationale Sonderpädagogik. 

Bad Heilbrunn: Klinkhardt (in print). 

Hörmann, B. (2007): Die Unsichtbaren in PISA, TIMSS & Co. Diplomarbeit. 

Wien: Institut für Bildungswissenschaft der Universität Wien 

Hörmann, B.: Disappearing Students. PISA and students with disabilities. In 

this volume. 

Hovdenak, S.S.: 90-tallsreformene – et instrumentalistisk mistak? Oslo 

(Gyldendal Akademisk) 2000. 

Huisken, F.: Der “PISA-Schock” und seine Bewältigung – Wieviel Dummheit 

braucht/verträgt die Republik? (VSA-Verlag) Hamburg 2005 

Ingersoll, R.: Who Controls Teachers’ Work? Power and Accountability in 

America’s Schools. Cambridge (Harvard University Press) 2006. 

Irjala, A & Eikås, M.: State Culture and Decentralization: a Comparative 

Study of Decentralization Processes in Nordic Cultural Politics. Sogndal 

& Helsinki (Western Norway Research Institute/Arts Council of Finland) 

1996. 

Irons, J.E. & Harris, S.: The Challenges of No Child Left Behind. Blue Ridge 

Summit (Rowman & Littlefield) 2006. 

Isaksen, L.: Skoler i gapestokken. In: Langfeldt, G., Elstad, E. & Hopmann, 

S.T. (eds.): Ansvarlighet i skolen. Oslo (Cappelen) 2007 (forthcoming). 

Jahnke, T.: Deutsche Pisa-Folgen. In this volume. 

Jahnke, T. & Meyerhöfer, W. (eds.): PISA & Co – Kritik eines Programms. 

(Franzbecker) Hildesheim 2006. 

Karlsen, G.E.: Desentralisering – løsning eller oppløsning: søkelys på norsk 

skoleutvikling og utdanningspolitik. Oslo (Ad notam) 1993. 

Karlsen, G.E.: EU, EØS og utdanning. Oslo (Tano) 1994. 

Kirke-, utdannings- og forskningsdepartementet (KUF): Underveis: Håndbok


i skolebasert vurdering: grunnskole og videregående skole. Oslo (KUF) 

1994. 

Kivirauma, J., Klemala, K. & Rinne, R.: Segregation, Integration, Inclusion – 

The Ideology and Reality in Finland. In: European Journal of Special 

Needs Education 21-2006-2, 117-133 

Klatt, B., Murphy, S. & Irvine, D: Accountability: Getting a Grip on Results. 

Calgary (Bow River) 2003 2 . 

Klieme, E. et al.: Expertise zur Entwicklung nationaler Bildungsstandards. 

Berlin (BMBF) 2003. Online: http://www.bmbf.de/pub/zur_ 

entwicklung_nationaler_bildungsstandards.pdf (07/07/2007). 

Koretz, D.: Limitations in the Use of Achievement Tests as Measures of Educators’ 

Productivity. The Journal of Human Resources, 37-2002-4, 752 – 

777. 

Koritzinsky, T.: Pedagogikk og politikk i L 97: Læreplanens innhold og beslutningsprosessene. 

Oslo (Universitetsforlaget) 2000. 

Korsgaard, O.: Kampen om folket: et dannelsesperspektiv på dansk historie 

gennem 500 år. Copenhagen (Gyldendal) 2004. 

KUF: Rapport om nasjonalt vurderingssystem, (Moe utvalget). Forslag fra utvalg 

oppnevnt av Kirke-, Utdannings- og Forskningsdepartementet. Oslo 

(KUF) 1997. 

Künzli, R. & Hopmann, S.T. (eds.): Lehrpläne: Wie sie entwickelt werden und 

was von ihnen erwartet wird. Forschungsstand, Zugänge und Ergebnisse 

aus der Schweiz und der Bundesrepublik Deutschland. Zürich (Ruegger) 

1998. 

Kvale, G.: “Det er ditt val!” – om fritt skuleval i to norske kommunar. In: 

Langfeldt, G., Elstad, E. & Hopmann, S.T. (eds.): Ansvarlighet i skolen. 

Oslo (Cappelen) 2007 (forthcoming). 

Ladd, H.E. & Walsh, R.P.: Implementing Value-added Measures of School 

Effectiveness. Economics of Education Review. 21-2002-1, 1-17. 

Ladenthin, V.: Bildung als Aufgabe der Gesellschaft. In: Studia Comenia Et 

Historica 34-2004-71/72, 305-319. 

Laffont, J.-J. (ed.): The Principal Agent Model: The Economic Theory of Incentives. 

Cheltenham (Edward Elgar Publishing) 2003. 

Lamar, A. & Thomas, J.A.: The Nation’s Report Card: Improving the Assessment 

of Student Achievement. Stanford, CA (National Academy of Education) 

1987.


Lange, S. & Schimank, U. (eds.): Governance und gesellschaftliche Integration. 

Opladen (VS) 2004. 

Langfeldt, G.: Resultatstyring som verktøy og ideologi. Statlige styringsstrategier 

i utdanningssektoren In: Langfeldt, G., Elstad, E. & Hopmann, S.T. 

(eds.): Ansvarlighet i skolen. Oslo (Cappelen) 2007 (forthcoming). 

Langfeldt, G.: PISA – Undressing the Truth or Dressing Up a Will to Govern. 

In this volume. 

Langfeldt, G., Elstad, E. & Hopmann, S.T. (eds.): Ansvarlighet i skolen. Oslo 

(Cappelen) 2007 (forthcoming). 

Leibfried, S. & Zürn, M.: Transformation des Staates. Frankfurt (Suhrkamp) 

2006. 

Lie S. et al.: Nasjonale prøver på ny prøve Rapport fra en utvalgsundersøkelse 

for å analysere og vurdere kvaliteten på oppgaver og resultater 

til nasjonale prøver våren 2005. Oslo (UiO, ILS) 2005. 

Lindblad, S.; Johannesson, I. & Simola, R.: Education governance in transition. 

Scandinavian Journal of Educational Research. 2003-2. 

Linn R.L. & Haug, C.: Stability of school building Accountability Scores and 

Gains in Educational Evaluation and Policy Analysis 24-2002-1, 29-36. 

Linn, R. L.: Assessments and Accountability. In: Educational Researcher 29- 

2000-2, 4-16. 

Linn, R. L., Baker, E. L., & Betebenner, D. W.: Accountability Systems: Implications 

of Requirements of the No Child Left Behind Act of 2001. In: 

Educational Reseacher 31-2002-6, 3-16. 

Linn, R.L. (2005): Issues in the Design of Accountability Systems. In: Herman, 

J.L. & Haertel, E.H. Hrsg.: Uses and Misuses of Data for Educational 

Accountability and Improvement (The 104 th Yearbook of the NSSE Part 

2). Malden (Blackwell) 2005, 78-98. 

Lohmann, I.: After Neoliberalism. Können nationalstaatliche Bildungssysteme 

den ‚freien Markt‘ überleben? 2001 online: http://www.erzwiss. 

uni-hamburg.de/Personal/Lohmann/AfterNeo.htm (2007/07/07) 

Lohmann, I.: Was bedeutet eigentlich “Humankapital”? GEW Bezirksverband 

Lüneburg und Universität Lüneburg: Der brauchbare Mensch. 

Bildung statt Nützlichkeitswahn. Bildungstage 2007, online: http://www. 

erzwiss.uni-hamburg.de/Personal/Lohmann/Publik/Humankapital.pdf 

(2007/07/07) 

Loveless, T.: The Peculiar Politics of No Child Left Behind. Washington 

(Brookings) 2006.


Luhmann, N.: Die Gesellschaft der Gesellschaft. Frankfurt (Suhrkamp) 1998. 

Marsh, J.; Pane, J. & Hamilton, L.: Making Sense of Data-Driven Decision 

Making in Education: Evidence from Recent RAND Research. Santa 

Monica (RAND) 2006. 

Martineau, J. A.: Distorting Value Added: The Use of Longitudinal, Vertically 

Scaled Student Achievement Data for Growth-Based, Value-Added Accountability. 

Journal of Educational and Behavioral Statistics 31-2006-1, 

35-62. 

McNeil, L.: Contradictions of School Reform: Educational Costs of Standardized 

Testing. New York (Routledge) 2000. 

Mediås, O.A. & Telhaug, A.O.: Fra sentral til desentalisert styring: statlig og 

regional styring af utdanningen i Skandinavia fram mot år 2000. Steinkjer 

(Projekt: Utdanning som nasjonsbygging) 2000. 

Mehrens, W.A.: Consequences of Assessmen: What is the Evidence?. In: Educational 

Policy Analysis Archives, 13-1998-6. Online: http://epaa.asu. 

edu/epaa/v6n13.html (2007/07/07). 

Mejding, J. & Roe, A. (eds.): Northern Lights on PISA 2003 – A Reflection 

Form the Nordic Countries. Copenhagen (Nordic Council) 2006. 

Melton-Horn, J.: Absolutism and the Eighteenth-Century Origins of Compulsory 

Schooling in Prussia and Austria. Cambridge (University Press) 

1988. 

Meyer, J.W.: Weltkultur. Wie die westlichen Prinzipien die Welt durchdringen. 

Frankfurt (Suhrkamp) 2005. 

Meyerhöfer, W.: Tests im Test. Das Beispiel PISA. Opladen (Budrich) 2005 

Meyerhöfer, W.: Testfähigkeit – Was ist das?. In this volume. 

Micklewright, J. & Schnepf, S.S.: Educational Achievement in the English 

Speaking Countries: Do Different Surveys Tell the Same Story?. Bonn 

(IZA) 2004. 

Micklewright, J. & Schnepf, S.S.: Inequality of Learning in Industrialised 

Countries. Bonn (IZA) 2006. 

Mintrop, H.: The Limit of Sanctions in Low-Performing Schools: A Study 

of Maryland and Kentucky Schools on Probation. In: Education Policy 

Analysis Archives, Vol. 11-2003-3. Online: http://epaa.asu.edu/epaa/ 

v11n3.html (2007/07/07). 

MMI (Markeds- og meidainstituttet AS): Evaluering av gjennomføring av de 

nasjonale prøvene i 2005. Online: http://www.utdanningsdirektoratet.no/ 

eway/library/forms/showmessage.aspx?oid=338 (retr. 07/07/2007).


Møller, J.: Coping with Accountability – A Tension between Reason and Emotion. 

Passionate Principalship: Learning from Life Histories of School 

Leaders. London (Falmer) 2003. 

Muir Grey, J.A.: Evidence Based Health Care. Oxford (Elsevier) 2001 2 

National Commission on Excellence in Education: A Nation at Risk: The 

Imperative for Educational Reform. Washington, DC (U.S. Government 

Printing Office) 1983. 

Nesje, K. & Hopmann, S.T. (eds.): En lærende skole: L97 i Skolepraksis. Oslo 

(Cappelen) 2002. 

Neuwirth, E., Ponocny, I. & Grossmann, W. (eds.): PISA 2000 und PISA 2003. 

Graz (Leykam) 2006. 

Neuwirth, E.: PISA 2000. Sample Weight Problems in Austria. OECD Education 

Working Papers No. 5. Paris (OECD) 2006. 

NOU 1988:22: Med viten og vilje. White paper commissioned by the Norwegian 

Government. Oslo 1988 

NOU 2002:10. Førsteklasses fra første klasse. White paper commissioned by 

the Norwegian Government. Oslo 2002 

NOU 2003:16. I første rekke. White paper commissioned by the Norwegian 

Government. Oslo 2003 

OECD: Reviews of National Policies for Education: Norway. Paris (OECD) 

1987 

OECD: Public Management Developments. Paris (OECD) 1995. 

OECD: Education at a Glance. Paris: OECD.Online: http://www.oecd.org/ 

document/34/0,2340,en_2649_34515_35289570_1_1_1_1,00.html. 

Paris 2005 (2007/07/07). 

OECD: Knowledge and Skills for Life. First Results From Pisa 2000. Paris 

(OECD) 2001. 

OECD: The PISA 2003 Assessment Framework., Paris (OECD) 2003. 

Olsen, R.V.: Achievement Tests From an Item Perspective. An Exploration of 

Single Item Data form the PISA and TIMSS studies. Thesis (University 

of Oslo) Oslo 2005. Online at: http://www.duo.uio.no/publ/realfag/2005/ 

35342/Rolf_Olsen.pdf (retr. 2007/07/07) 

Olsen, R.: Large-scale international comparative achievement studies in education: 

Their primary purposes and beyond. In this volume. 

Peterson, P.E. & West, M.R. (eds.): No Child Left Behind. The Politics and 

Practice of School Accountability. Washington (Brookings) 2003. 

Picht, G.: Die Deutsche Bildungskatastrophe. Olten (Walter Verlag) 1964.


PISA 2006: PISA – THE OECD PROGRAMME FOR INTERNA- 

TIONAL STUDENT ASSESSMENT. Leaflet produced by the OECD 

in 2006. Online: http://www.pisa.oecd.org/dataoecd/51/27/37474503.pdf 

(07/07/2007). 

Pollitt, C. & Bouckaert, G.: Public Mangement Reform. A Comparative Analysis. 

Oxford (University Press) 2004 2 . 

Power M.: The Audit Society; Rituals of Verification. Oxford (University 

Press) 1997. 

Prahl, A. & Olsen, C.B.: Lokalsamfundet som samarbejdspartner: sammenhænge 

mellem decentralisering og lokalsamfundsudvikling i de nordiske 

lande. Copenhagen (Nordisk Minsterråd). 1997. 

Prais, S. J.: Cautions on OECD’s recent educational survey(PISA): Rejoinder 

to OECD’s response. In: Oxford Review of Education 30-2004-4. 377- 

389. 

Prais, S.J.: Cautions on OECD’s Recent Educational Survey (PISA) In: Oxford 

Review of Education 29-2003-2, 139-163. 

Prais, S.J.: England: Poor Survey Response and no Sampling. In this volume. 

Rasmussen, J.: Undervisning i det refleksivt moderne. Copenhagen (Reitzel) 

2006. 

Rauin, U.: Die Pädagogik im Bann empirischer Mythen – Wie aus empirischen 

Vermutungen scheinbare pädagogische Gewissheit wird. In: Pädagogische 

Korrespondenz. 2004-32, 39-49. 

Rickover, H. G.: American Education – a National Failure: The Problem of 

Our Schools and What We Can Learn from England. New York (E. P. 

Dutton) 1963. 

Riksrevisjonen: Riksrevisjonens undersøkelse av opplæringen i Grunnskolen 

Dokument nr 3: 10 (2005-2006).Oslo (Riksrevisjonen) 2006. 

Rindermann, H.: Was messen internationale Schulleistungsstudien? Schulleistungen, 

Schülerfähigkeiten, kognitive Fähigkeiten, Wissen oder allgemeine 

Intelligenz? Psychologische Rundschau 57-2006-1, 69-86. 

Robin, S.R. & Sprietsma, M.: Characteristics of Teaching Institutions and Student’s 

Performance: New Empirical Evidence from OECD Data. Lille 

(CRESGE) 2003. 

Saunders, L.: Brief History of Educational ‘Value-Added‘”. How Did We Get 

to Where We Are?. In: School Effectivenes and School Improvement 10- 

1999-2, 233-256.


Scharpf, F.W. & Schmidt, V. Hrsg.: Welfare and Work in the Open Economy. 

v1 & v2. Oxford (University Press) 2000. 

Schedler, K. & Proeller, I.: New Public Management. Bern: Haupt (UTB) 

2006 3 . 

Sedikides, C. et al.: Accountability as a Deterrent to Self-Enhancement: The 

Search for Mechanisms. In: Journal of Personality and Social Psychology 

83-2002-3, 592-605. 

Shavit, Y. & Blossfeld, H.-P. (eds.): Persistent inequality. Boulder (Westview 

Press) 1993. 

Simola, H.: The Finnish miracle of PISA: historical and sociological remarks 

on teaching and teacher education.Quelle: In: Comparative Education 45- 

2005-4, 455-470. 

Sivesind, K.: Reformulating Reforms. Oslo (UiO, ILS). Forthcoming. 

Sivesind, K.: Task and Themes in the Communication about the Curriculum. 

The Norwegian Compulsory School Reform in Perspective. In: Rosenmund, 

M. et al.: Comparing Curriculum Making Processes. Bern (Lang) 

2002. 

Sivesind, K.; Bachmann, K. & Afzar, A.: Nordiske læreplaner. Oslo 

(Læringssenteret) 2003. 

Sivesind, K.; Langfeldt, G. & Skedsmo, G.: Utdanningsledelse. Oslo (Cappelen 

akademisk) 2006. 

Slagstad, R.: De nasjonale strateger. Oslo (Pax forlag) 1996. 

Slavin, R.E.: Educational Research in an Age of Accountability. Boston (Pearson) 

2006. 

Stack, M.: Testing, Testing, Read All About It: Canadian Press Coverage of 

the PISA Results. In: Canadian Journal of Education, 29-2006-1, 49-69. 

STM Stortingsmelding 37 (1990-1991): Om organisering og styring i utdanningssektoren. 

Report to the Norwegian Parliament. 

STM Stortingsmelding 29 (1994-1995): Om prinsipper og retningslinjer for 

tiårig grunnskole- ny læreplan. 

STM Stortingsmelding 47 (1995-1996) Om elevvurdering, skolebasert vurdering 

og nasjonalt vurderingssystem 

STM Stortingsmelding 28 (1998-99): Mot rikare mål. Nasjonalt vurderingssystem 

for grunnskolen. Report to the Norwegian Parliament. 

STM Stortingsmelding 17 (2002-2003): Om statlige tilsyn. 

STM Stortingsmelding 30 (2003-2004): Kultur for læring. (a shortened 

version on English at: http://www.regjeringen.no/en/dep/kd/Documents/


Brochures-and-handbooks/2004/Report-no-30-to-the-Storting-2003- 

2004.html?id=419442). 

Sutherland, D. & Price, R.: Linkages Between Performance and Institutions in 

the Primary and Secondary Education Sector. OECD Economics Department 

Working Papers No. 558. Paris (OECD) 2007. 

Swanson, C. B. & Stevenson, D. L.: Standards-Based Reform in Practice: Evidence 

on State Policy and Classroom Instruction from the NAEP State 

Assessments. In: Educational Evaluation and Policy Analysis, 24-2002-1, 

1 – 27. 

Swiss Federal Statistical Office (BFS): PISA 2003 – Einflussfaktoren auf die 

kantonalen Ergebnisse. Neuchatel (BFS) 2005. 

Telhaug, A.O. & Mediås, O.A.: Grunnskolen som nasjonsbygger: fra statspietisme 

til nyliberalisme. Oslo (Abstrakt) 2003. 

Telhaug, A.O.: Kunnskapsløftet – Ny eller Gammel skole? Oslo (Cappelen 

Akademisk) 2005. 

Telhaug, A.O.: Skolen mellom stat og marked: norsk skoletenkning fra år til år 

1990-2005. Oslo (Didakta) 2005. 

TNS Gallup: Undersøkelse blant rektorer og lærere om gjennomføring av de 

nasjonale prøvene våren 2005. Rapport. Oslo 2005. 

Turmo, A. & Lie, S.: Hva kjennetegner norske skoler som skårer høyt i PISA 

2000? Oslo (UiO/ILS) 2004. 

Tyler, R.: Basic Principles of Curriculum and Instruction. Chicago (University 

Press) 1949. 

Uljens, M.: The Hidden Curriculum of PISA. – The Promotion of Neo-liberal 

Policy by Educational Assessment. In this volume. 

U.S. Department of Education, Building on Results: A Blueprint for Strengthening 

the No Child Left Behind Act, Washington, D.C., 2007 

Wallerstein, I.: World-Systems Analysis: An Introduction. Durham, North Carolina 

(Duke University Press) 2004 

Watermann, R. et al. (2003): Schulrückmeldungen im Rahmen von Schulleistungsuntersuchungen: 

Das Disseminationskonzept von PISA-2000. In: 

Zeitschrift Für Pädagogik 49-2003-1, 92-111. 

Watson, S., & Supovitz, J.: Autonomy and Accountability in the Context 

of Standards-based Reform. In: Education Policy Analysis Archives, 9- 

2001-32. Online: http://epaa.asu.edu/epaa/v9n32.html (2007/07/07). 

Weber, M. (1923): Wirtschaft und Gesellschaft. Online: http://www.textlog.de/ 

weber_wirtschaft.html (2007/03/19)


Weigel, T.M.: Die PISA-Studie im bildungspolitischen Diskurs. Eine Untersuchung 

der Reaktionen auf PISA in Deutschland und im Vereinigten 

Königreich. Trier (Universität) 2004. Online: http://www.oecd. 

org/dataoecd/46/23/34805090.pdf (2007/07/07). 

Werler, T.: Nation, Gemeinschaft, Bildung: die Evolution des modernen skandinavischen 

Wohlfahrtsstaates und das Schulsystem. Baltmannsweiler 

(Schneider Verlag) 2004. 

Westbury, I.: Didaktik and Curriculum Studies. In: Gundem, B B. & Hopmann, 

S.T. (eds) Didaktik and/or Curriculum. New York (Lang) 2002 2 , 47-78. 

Westbury, I., Hopmann, S. & Riquarts, K. (eds.): Teaching as Reflective Practice: 

The German Didaktik Tradition. Mahwah, NJ (Lawrence Erlbaum 

Associates) 2000. 

Withford, B. L. & Jones, K.: Accountability, Assessment, and Teacher Commitment: 

Lessons from Kentucky’s Reform Efforts. Albany (SUNY) 

2000. 

Wuttke, J.: Uncertainties and Bias in PISA. In this volume. 

Zimmer, R. et al.: State and Local Implementation of the “No Child Left Behind 

Act”. Washington (Department of Education) 2007.


Über die Autoren/About the Authors 

Allerup, Peter Nimmo: 

Peter Allerup graduated in Mathematical Statistics from University of Copenhagen in 

1970. Today his preferred fields of interests are mathematical statistics, psychometrics 

and quantitative research methods in general. From 1994 – 2002 he was Senior Research 

Scientist at The Royal Danish Institute for Educational Research, from 2002 he 

holds a professors chair at Danish University of Education, later Aarhus University, 

School of Education. He has been involved into the majority of empirical international 

studies conducted by the university, OECD’s PISA and IEA’s comparative investigations 

in mathematics and science and in civic education, TIMSS and CIVIC. He has 

specialized experience in the field of IRT models (Item Response Theory), viz. the 

Rasch Models in particular, with emphasis on applications, where psychometric scaling 

properties are essential. He has long time experiences with data from multilevel 

specifications of the research frame work. 

Kontakt: nimmo@dpu.dk 

Bodin, Antoine: 

Graduated in pure mathematics and in mathematics education (Didactics of Mathematics). 

Successively, or at the same time, secondary math teacher, teacher trainer, researcher 

in mathematics education, evaluation specialist, mathematics text book author, 

international consultant (World Bank and other national and international agencies). 

Antoine Bodin was much involved in the IREM network (French Institute of Research 

in Mathematics Education) and in the APMEP (French Mathematics Teacher 

Association) were he created the EVAPM observatory and leaded it for 20 years. 

He was a member of the TIMSS Subject Matter Advisory Committee and for 

a few months of the PISA 2003 Math Expert Group. He was also a member of the 

mathematics curriculum expert group and of the test development unit in the French 

Ministry of Education. 

He has published numerous papers in math education: see his website: http://web. 

mac.com/antoinebodin/iWeb/Site_Antoine_Bodin / 

Contact: antoinebodin@mac.com 

Bozkurt, Dominik: 

Dominik Bozkurt wurde am 10. November 1975 in Wels geboren. Er absolvierte die 

Matura am Realgymnasium am Henriettenplatz in 1150 Wien. Nach der Matura begann 

er 2001 das Studium der Pädagogik (Schul- und Sozialpädagogik) und der Romanistik 

(Französisch). Im Jänner 2007 schloß er sein Studium mit einer Diplomarbeit 

über Schulqualität und Kooperativer Mittelschule und einer kommissionellen Prüfung 

ab. Von 2005 bis 2007 war er als Studienassistent in der Forschungseinheit für Schulund 

Bildungsforschung tätig. Seit 2007 arbeitet er als Sozialpädagoge in der Präventionsabteilung 

der Aids Hilfe Wien. 

Kontakt: bozkurt@aids.at


Brinek, Gertrude: 

Lehrtätigkeit an Wiener Volks- und Hauptschulen (1973 bis 1983); Studium der 

Kunstgeschichte, Pädagogik und Psychologie an der Universität Wien, (Studienabschluss 

Dr. phil. in Pädagogik); Studien-Assistentin bzw. Universitäts-Assistentin am 

Institut für Erziehungswissenschaften der Universität Wien. 

Forschungsarbeiten/Lehrtätigkeit und Publikationen in den Bereichen Schulklima/Schulangst, 

Museumspädagogik, Bildungs-/Schultheorie und -forschung u. a. 

Seit 2003 Ass. Prof. am Institut für Bildungswissenschaft/Fakultät für Philosophie 

und Bildungswissenschaft der Uni Wien (teilkarenziert zur Ausübung eines Mandats 

in der Bundesgesetzgebung). 

Kontakt: gertrude.brinek@univie.ac.at 

Dolin, Jens: 

Head of the Department of Science Education at the University of Copenhagen. He 

has done research in teaching and learning science (with focus on dialogical processes, 

forms of representation and the development of competencies), general pedagogical 

issues (bildung, competencies, assessment and evaluation) and organizational change 

(reform processes, curriculum development, teacher conceptions). He has been engaged 

in the development and implementation of the new science curriculum 2005 for 

the Danish Upper Secondary School. 

He has been member of the PISA Science Forum 2006 which formulated the 

Science Literacy Framework for the PISA 2006 science test, and is currently leader of 

a research project on the validation of PISA science in a Danish context. 

Contact: dolin@ind.ku.dk 

Hopmann, Stefan Thomas: 

Universität Wien. Zuvor unter anderem in Kiel, Potsdam, Oslo, Trondheim und Kristiansand 

tätig. Forschungsgebiete: 

Historische und vergleichende Schul- und Bildungsforschung insbesondere mit 

Blick auf Didaktik, Schulentwicklung, Schulverwaltung und Lehrerbildung. 

Kontakt: stefan.hopmann@univie.ac.at 

Hörmann, Bernadette: 

Born in 1983, studied educational science at the University of Vienna. She currently 

works as an assistant at the department for educational science at the University of 

Vienna where she is writing her dissertation. Her core themes are school accountability 

and school structures. 

Contact: bernadette.hoermann@univie.ac.at 

Jahnke, Thomas: 

(Jg. 1949), Diplom in Mathematik 1974 (Universität Marburg), Promotion in Mathematik 

1979 (Universität Freiburg), Habilitation in Didaktik der Mathematik 1988 

(Universität Siegen). Seit 1994 Lehrstuhl für Didaktik der Mathematik (Universität 

Potsdam). Zahlreiche wissenschaftliche Veröffentlichungen; Herausgeber und Autor 

von Schulbüchern für den Mathematikunterricht an Gymnasien. Arbeitsgebiete:


Stoffdidaktik; Kritik didaktischer Ideologien; Curriculumentwicklung für Lehramtsstudiengänge; 

Philosophie, Geschichte und Kultur der Mathematik. 

Kontakt: jahnke@math.uni-potsdam.de 

Langfeldt, Gjert: 

Dr. Gjert Langfeldt is a tenured associate professor at the University of Agder, Norway. 

Substantially his main areas of research are efficiency and equity linked issues 

and also how didactics can be transformed into an empirical discipline. He has pursued 

his research interest by an empirical approach and an interest in methodological 

issues. 

Langfeldt is currently engaged in two research projects funded by national authorities: 

Together with Stefan Hopmann he is engaged in a project charting how schools 

and teachers can come to grips with the new logic of accountability in education, and 

he is also involved in evaluating the National System of Quality Assurance in Education. 

Contact: gjert.langfeldt@uia.no 

Meyerhöfer, Wolfram: 

1990-1995: Studium Lehramt für Mathematik und Physik an der Universität Potsdam 

1996-1998: Referendariat am Studienseminar Potsdam 

1998-2007: Universität Potsdam, Didaktik der Mathematik 

Promotion Mai 2004: Was testen Tests? Objektiv-hermeneutische Analysen am Beispiel 

von TIMSS und PISA. 

seit 2007: Gastprofessor FU Berlin 

Kontakt: meyerhof@math.uni-potsdam.de 

Olechowski, Richard: 

Geb.: 7.5.1936, in Wien, ab Herbst 1955: Studium der Psychologie an der Universität 

Wien, 1962: Dr.phil.; als Psychol. im Bereich der Justiz (Resozialisierung), ab 1966: 

Ass. an der Univ. Wien, 1970: Habil. für Pädagogik (m.bes. Berücksichtigung d. Päd. 

Psychol), 1972: O. Prof. f. Pädagogik an der Univ. Salzburg, 1977: O. Prof. f. Pädagogik 

an der Univ. Wien (m.bes. Berücksichtigung d. Schulpäd. u. d. Allg. Didaktik), 

ab 1986 zusätzl.: Wissenschaftl. und administr. Leiter des Ludw. Boltzmann-Inst. für 

Schulentwicklung und intern.-vergl. Schulforschg. Seit 1988: Redaktionsmitglied d. 

Ztschr. „Erziehung u. Unterricht“.2004: Dr. h.c. und Prof. h.c. der Eötvös-Loránd- 

Universität in Budapest, Herbst 2004: Emeritierung.– Publikationen: Das alternde 

Gedächtnis (1969), Das Sprachlabor (1970, 1973 2 ), ins Japanische übers.; über 100 

Beiträge in in- u. ausl. Fachztschr., Lexika u. Handbüchern, Hrsg. d. Reihen „Schule- 

Wissenschaft-Politik“, „Erziehungswiss. Forschung – Päd. Praxis“ sowie „Schulpäd. 

und Päd. Psychol.“ Spezialgebiet: Quantitative empirisch-päd. Forschung (bes. zum 

Problemgebiet der Schulorganisation), von 1992 bis 1997: Leitung eines empirischen 

Großprojekts; Evaluierung des Schulmodells „Kooperative Mittelschule“ im Längsschnitt. 

Kontakt: richard.olechowski@univie.ac.at


Olsen, Rolf V.: 

Rolf V. Olsen holds a postdoc position at the Department of Teacher Education and 

School Development at the University of Oslo where he also received his phd in 2005 

with a thesis on secondary analysis of the science data in PISA and TIMSS. His current 

research activities are extensions of the work presented in his thesis. He is a member 

of a research group, which among other activities, is responsible for the Norwegian 

activities in a range of similar studies (Unit for Quantitative Analysis in Education). 

Besides his analytical work presented in his publications he has extensive practical 

experience with item development in both international and national studies, and previously 

he worked as a teacher in science, physics and mathematics for several years 

in upper secondary education 

Contact: r.v.olsen@ils.uio.no 

Prais, S J: 

S J Prais (b. 1928) has spent most of his career in economic research, mostly at the 

National Institute of Economic and Social Research. 

The economic analysis of consumer behaviour formed his initial research field, 

followed by a study of the growth of industrial concentrations in Britain. Work on international 

differences in industrial productivity, and their relation to vocational training 

of the workforce were the centre of subsequent extended empirical comparisons 

of British and German industries. This led to international comparisons of schoolleaving 

standards, particularly in mathematics; and, in due course, to the membership 

of the National Curriculum committee in that subject. Schooling standards and teaching 

methods have been compared, particularly with Germany and Switzerland.He 

was elected Fellow of the British Academy in 1985; and was awarded a D.Litt. (hon.) 

by City University in 1989 and Dr. Sc. (hon.) by University of Birmingham in 2006. 

Contact: c/o m.ockenden@niesr.ac.uk 

Puchhammer, Markus: 

Dipl.Ing.Dr.phil.Dr.techn. worked as research assistant and graduated in physics on 

Technical University Vienna, graduated in psychology and education science on University 

Vienna; worked for several years as software developer, systems analyst and 

project manager in telecommunications industry, then in the electronic transaction 

processing business, EDP training courses on WIFI Vienna, teaching in a vocational 

education college [HTL], lecturer on FH Joanneum Graz, then on the University of 

Applied Sciences Technikum Wien for science and research, statistics and data analysis; 

teleteaching survey. 

Contact: puchhammer@gmx.at 

Retzl, Martin: 

(b.1980); Assistant at the Department of Educational Science at the University of Vienna 

(since 1 st Sept. 2007) 

– Student-Assistant at the Department of Educational Science at the University of 

Vienna (March 2006-July 2007)


– Master’s degree in educational science (Mag. phil.) 

– Graduate of teacher’s college: Diploma for teaching at “Hauptschule” (lower secondary 

school) 

– Projects and research interests: development of teaching material, empirical research 

(diploma thesis: teacher-study), school-capacity-research, governance of the 

school- and educational system. 

Contact: martin.retzl@univie.ac.at 

Sjøberg, Svein: 

Professor in science education at Oslo University. He was educated as a nuclear physicist, 

later also in education and social science. Current research interests: Social, 

cultural and ethical aspects of science education, science education and development, 

gender and science education in developing countries. Critical approach to issues of 

scientific literacy and public understanding of science. Currently organizer of ROSE 

(The Relevance of Science Education), a comparative project on pupils’ interests, attitudes, 

perceptions etc. of importance to science teaching and learning. 

Information and articles on http://folk.uio.no/sveinsj/ 

Contact: svein.sjoberg@ils.uio.no 

Uljens, Michael: 

Michael Uljens (b. 1962), Prof. Dr, Vice Dean at Åbo Akademi University, Dozent at 

Helsinki university, has been working with a wide range of educational topics, but his 

main field of research through the years has the theory and philosophy of education 

(books: “School Didactics and Learning” and “Allmän pedagogik”). Since 2005 he is 

running a 4 year research project (“Bildung and learning in the late-modern society”) 

with 6 doctoral students working fulltime. He has been working as visiting scholar 

at the University of Göteborg with Prof. Marton and at Humboldt University with 

Prof. Benner. 2000-2003 he was professor in general education at Helsinki University. 

Contact: muljens@abo.fi 

Wuttke, Joachim: 

Joachim Wuttke studied physics in München and Grenoble. 

He holds a state certificate for teaching mathematics and physics in secondary 

schools, a PhD in physical chemistry, and a habilitation in experimental physics. 

He has worked in the telecommunication industry, as a school teacher, and as a 

group leader in academic research. Besides 25 research papers in statistical physics, 

he has published on scientific instrumentation and computing. 

Joachim Wuttke is staff scientist at the Munich outstation of Forschungszentrum 

Jülich. 

Contact: wuttke1@web.de

Schulpädagogik und Pädagogische Psychologie 

hrsg. von Univ.Prof. Dr. Richard Olechowski (Universität Wien) 

Rudolf Beer 

Bildungsstandards 

Einstellungen von Lehrerinnen und Lehrern 

Bildungsstandards sollen die Qualität der österreichischen Schulen erhöhen. Lehrer/innen 

werden als „zentrales Gelenksstück“ bei der Implementierung von Bildungsstandards und 

einer daraus resultierenden Qualitätsentwicklung gesehen. Wie weit können Bildungsstandards 

diesem Anspruch aus Sicht der Lehrer/innen gerecht werden? Diese Publikation bildet 

den aktuellen Stand der Diskussion ab, versucht eine Klärung der Begriffe, weist aber auch 

auf Widersprüche und Risiken hin. Die empirische Studie in Wien widmet sich der Frage der 

Akzeptanz eines solchen Konzepts durch die betroffenen Lehrer/innen. 

Bd. 1, 2006, 256 S., 24,90 €, br., ISBN 3-8258-0104-7 

LIT Verlag GmbH & Co. KG Wien – Zürich 

Auslieferung Österreich: Medienlogistik Pichler-ÖBZ GmbH & Co KG 

IZ-NÖ Süd, Straße 1, Objekt 34, A-2355 Wiener Neudorf, Postfach 133 

Tel. +43 (0) 2236 / 63 535 - 236, Fax +43 (0) 2236 / 63 535 - 243, e-Mail: bestellen@medien-logistik.at 

Auslieferung Deutschland: Fresnostr. 2 48159 Münster 

Tel.: 0251 / 620 32 22 – Fax 0251 / 922 60 99 

e-Mail: vertrieb@lit-verlag.de – http://www.lit-verlag.de

Isabella Benischek 

Leistungsbeurteilung im österreichischen Schulsystem 

Dieses Buch greift die stets brisante Thematik der Leistungsbeurteilung im österreichischen 

Schulsystem unter dem Gesichtspunkt auf, dass die Leistungsbeurteilung mitentscheidend 

für den weiteren Bildungsweg der SchülerInnen ist. Daher wird der Frage nachgegangen, ob 

und wie sich die Beurteilungen in Volksschule, Hauptschule und AHS unterscheiden. Eine 

empirische Untersuchung gibt Aufschluss über das Beurteilungsverhalten von Volksschul-, 

Hauptschul- und AHSLehrerInnen. Das Buch beinhaltet auch eine kurze Übersicht der historischen 

Entwicklung des österreichischen Schulsystems und eine Darstellung des aktuellen 

Modells. 

Bd. 2, 2006, 288 S., 24,90 €, br., ISBN 3-8258-0074-1 








Roman Lehnert; Justine Scanferla 

Zusammenleben in Wien 

Ergebnisse einer empirischen Längsschnittstudie an Migrantenkindern 

Anhand einer empirischen Längsschnittstudie wird der Frage nachgegangen, inwieweit Jugendliche 

in Wien interethnischen Kontakten gegenüber aufgeschlossen sind. Dabei werden 

die Einstellungen von Jugendlichen mit Migrationshintergrund (türkischer und serbisch/serbokroatischer 

Muttersprache) und deutschsprachigen Jugendlichen verglichen. Darüber 

hinaus werden auch die Mädchen und Burschen der jeweiligen Sprachgruppen separat 

betrachtet. Die vorliegende Studie ermöglicht einen Einblick in die Veränderung der Meinung 

der befragten SchülerInnen im Alter von 10 bis 15 Jahren. 

Bd. 4, 2007, 280 S., 24,90 €, br., ISBN 978-3-8258-0554-8 








Christa-Monika Reisinger 

Unterrichtsdifferenzierung 

Die Forschung zur Unterrichtsqualität analysiert zunehmend den Unterricht auf seine Effizienz. 

Wie kann diesem Anspruch in heterogenen Klassen Rechnung getragen werden? Das 

Buch behandelt ausgewählte Determinanten der Schulleistung und zeigt, wie die Anpassung 

von Unterrichtsmethoden an persönliche Voraussetzungen der Kinder in der Schulpraxis realisiert 

werden kann. Eine empirische Untersuchung gibt anhand eines Beispiels aus der Mathematik 

Aufschluss darüber, unter welchen Bedingungen Schüler/innen ihren persönlich besten 

Weg zum Lernen finden und so optimale Lernergebnisse erzielen können. 

Bd. 5, 2007, 336 S., 29,90 €, br., ISBN 978-3-8258-0867-9 








Osnabrücker Schriften zur Psychologie 

hrsg. von Prof. Dr. Josef Rogner, Prof. Dr. Henning Schöttke 

und Prof. Dr. Manfred Tücke 

Manfred Tücke unter Mitarbeit von Ulla Burger 

Entwicklungspsychologie des Kindes- und Jugendalters für (zukünftige) Lehrer 

Dies Buch wurde für LehramtsstudentInnen und LehrerInnen geschrieben. Es soll praxisnah 

über wesentliche Themen und Kontroversen der Entwicklungspsychologie des Kindes- und 

Jugendalters informieren, soweit sie für die Schule wichtig und/oder interessant sind. Wo immer 

es ohne wesentlichen Verlust an Exaktheit möglich war, wurde eine umgangssprachliche 

Darstellung gegenüber dem wissenschaftlichen Fachvokabular bevorzugt. 

Bd. 6, 3. Aufl. 2007, 440 S., 29,90 €, gb., ISBN 978-3-8258-0157-1 








Manfred Tücke 

Grundlagen der Psychologie für (zukünftige) Lehrer 

Dies Buch wurde für LehramtsstudentInnen und LehrerInnen geschrieben. Darin werden 

wichtige Denkweisen und Ergebnisse der Psychologie vorgestellt, an Hand klassischer Untersuchungen 

erläutert und an Hand vieler Beispiele auf unser Alltagsleben bezogen. Wo immer 

es ohne wesentlichen Verlust an Exaktheit möglich war, wurde eine umgangssprachliche 

Darstellung gegenüber dem wissenschaftlichen Fachvokabular bevorzugt. Folgende Themen 

werden angesprochen: – Gegenstand und Methoden der Psychologie – Konditionieren und 

Lernen: Lernen aus Erfahrung – Erinnern und Vergessen: das menschliche Gedächtnis – Denken, 

Problemlösen und Entscheiden – Intelligenz und Intelligenzmessung – Emotionen – am 

Beispiel Glück, Zufriedenheit und Angst – Soziale Prozesse und soziales Verhalten 

Bd. 8, 2. Aufl 2004, 472 S., 29,90 €, gb., ISBN 3-8258-7190-8

PISA zufolge PISA – PISA According to PISA

Create successful ePaper yourself

Delete template?

Save as template?