16.08.2012 Views

PISA zufolge PISA – PISA According to PISA

PISA zufolge PISA – PISA According to PISA

PISA zufolge PISA – PISA According to PISA

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

Stefan Thomas Hopmann, Gertrude Brinek,<br />

Martin Retzl (Hg./Eds.)<br />

<strong>PISA</strong> <strong>zufolge</strong> <strong>PISA</strong> <strong>–</strong> <strong>PISA</strong> <strong>According</strong> <strong>to</strong> <strong>PISA</strong>


Schulpädagogik und<br />

Pädagogische Psychologie<br />

herausgegeben von<br />

Univ.-Prof. Dr. Dr. h. c. Richard Olechowski<br />

(Universität Wien)<br />

Band 6<br />

LIT


Stefan Thomas Hopmann, Gertrude Brinek,<br />

Martin Retzl (Hg./Eds.)<br />

<strong>PISA</strong> <strong>zufolge</strong> <strong>PISA</strong> <strong>–</strong><br />

<strong>PISA</strong> <strong>According</strong> <strong>to</strong> <strong>PISA</strong><br />

Hält <strong>PISA</strong>, was es verspricht? <strong>–</strong><br />

Does <strong>PISA</strong> Keep What It Promises?<br />

LIT


Bibliographic information published by the Deutsche Nationalbibliothek<br />

The Deutsche Nationalbibliothek lists this publication in the Deutsche<br />

Nationalbibliografie; detailed bibliographic data are available in the Internet at<br />

http://dnb.d-nb.de.<br />

ISBN 978-3-7000-0771-5 (Österreich)<br />

ISBN 978-3-8258-0946-1 (Deutschland)<br />

A catalogue record for this book is available from the British Library<br />

© LIT VERLAG GmbH & Co. KG Wien 2007<br />

Krotenthallergasse 10/8<br />

A-1080 Wien<br />

Tel. +43 (0) 1 / 409 56 61<br />

Fax +43 (0) 1 / 409 56 97<br />

e-Mail: wien@lit-verlag.at<br />

http://www.lit-verlag.at<br />

LIT VERLAG Dr. W. Hopf<br />

Berlin 2007<br />

Auslieferung/Verlagskontakt:<br />

Fresnostr. 2<br />

D-48159 Münster<br />

Tel. +49 (0)251<strong>–</strong>62 03 20<br />

Fax +49 (0)251<strong>–</strong>23 19 72<br />

e-Mail: lit@lit-verlag.de<br />

http://www.lit-verlag.de<br />

Auslieferung:<br />

Österreich: Medienlogistik Pichler-ÖBZ GmbH & Co KG<br />

IZ-NÖ, Süd, Straße 1, Objekt 34, A-2355 Wiener Neudorf<br />

Tel. +43 (0) 2236/63 535 - 290, Fax +43 (0) 2236/63 535 - 243, e-Mail: mlo@medien-logistik.at<br />

Deutschland: LIT Verlag Fresnostr. 2, D-48159 Münster<br />

Tel. +49 (0) 2 51/620 32 - 22, Fax +49 (0) 2 51/922 60 99, e-Mail: vertrieb@lit-verlag.de<br />

Distributed in the UK by: Global Book Marketing, 99B Wallis Rd, London, E9 5LN<br />

Phone: +44 (0) 20 8533 5800 <strong>–</strong> Fax: +44 (0) 1600 775 663<br />

http://www.centralbooks.co.uk/acatalog/search.html<br />

Distributed in North America by:<br />

Transaction Publishers<br />

Rutgers University<br />

35 Berrue Circle<br />

Piscataway, NJ 08854<br />

Phone: +1 (732) 445 - 2280<br />

Fax: + 1 (732) 445 - 3138<br />

for orders (U. S. only):<br />

<strong>to</strong>ll free (888) 999 - 6778<br />

e-mail:<br />

orders@transactionspub.com


Inhalt/Table of content<br />

Zu diesem Buch 1<br />

Vorwort 5<br />

Richard Olechowski<br />

Introduction: <strong>PISA</strong> <strong>According</strong> <strong>to</strong> <strong>PISA</strong> <strong>–</strong> Does <strong>PISA</strong> Keep What It<br />

Promises? 9<br />

Stefan T. Hopmann/Gertrude Brinek<br />

What Does <strong>PISA</strong> Really Assess? What Does It Not? A French View 21<br />

An<strong>to</strong>ine Bodin<br />

Testfähigkeit <strong>–</strong> Was ist das? 57<br />

Wolfram Meyerhöfer<br />

<strong>PISA</strong> <strong>–</strong> An Example of the Use and Misuse of Large-Scale<br />

Comparative Tests 93<br />

Jens Dolin<br />

Language-Based Item Analysis <strong>–</strong> Problems in Intercultural<br />

Comparisons 127<br />

Markus Puchhammer<br />

England: Poor Survey Response and No Sampling of Teaching Groups 139<br />

SJPrais<br />

Disappearing Students <strong>PISA</strong> and Students With Disabilities 157<br />

Bernadette Hörmann<br />

Identification of Group Differences Using <strong>PISA</strong> Scales <strong>–</strong> Considering<br />

Effects of Inhomogeneous Items 175<br />

Peter Allerup<br />

<strong>PISA</strong> and “Real Life Challenges”: Mission Impossible? 203<br />

Svein Sjøberg


ii INHALT/TABLE OF CONTENT<br />

<strong>PISA</strong> <strong>–</strong> Undressing the Truth or Dressing Up a Will <strong>to</strong> Govern? 225<br />

Gjert Langfeldt<br />

Uncertainties and Bias in <strong>PISA</strong> 241<br />

Joachim Wuttke<br />

Large-Scale International Comparative Achievement Studies in<br />

Education: Their Primary Purposes and Beyond 265<br />

Rolf V. Olsen<br />

The Hidden Curriculum of <strong>PISA</strong> <strong>–</strong> The Promotion of Neo-Liberal<br />

Policy By Educational Assessment 295<br />

Michael Uljens<br />

Deutsche Pisa-Folgen 305<br />

Thomas Jahnke<br />

<strong>PISA</strong> in Österreich: Mediale Reaktionen, öffentliche Bewertungen und<br />

politische Konsequenzen 321<br />

Dominik Bozkurt, Gertrude Brinek, Martin Retzl<br />

Epilogue: No Child, No School, No State Left Behind: Comparative<br />

Research in the Age of Accountability 363<br />

Stefan T. Hopmann


Zu diesem Buch<br />

„<strong>PISA</strong> <strong>zufolge</strong> <strong>PISA</strong>“ (<strong>PISA</strong> <strong>According</strong> <strong>to</strong> <strong>PISA</strong>) war Thema eines Symposiums,<br />

das im März 2007 an der Universität Wien von der Forschungseinheit für<br />

Schul- und Bildungsforschung des Instituts für Bildungswissenschaft durchgeführt<br />

wurde. Zu dieser Veranstaltung hatten wir einige Kritiker der vorliegenden<br />

<strong>PISA</strong>-Studien eingeladen, aber auch Vertreter des österreichischen <strong>PISA</strong>-<br />

Konsortiums, die jedoch leider kurzfristig absagen mussten. Unser Ziel war<br />

eine Versachlichung der Debatte über <strong>PISA</strong>, weg vom politisch-ideologischen<br />

Streit über <strong>PISA</strong> und hin zu einer Analyse der methodologischen Voraussetzungen<br />

und Folgen des <strong>PISA</strong>-Projektes, genauer gesagt der Frage: Ob <strong>PISA</strong><br />

bei gegebenem Design halten kann, was es in seinen Analysen und Berichten<br />

zu erklären verspricht. Bei dieser Veranstaltung wurde die Frage gestellt, ob<br />

und wie <strong>PISA</strong> international als wissenschaftliches Vorhaben diskutiert wird.<br />

Wir haben dies zum Anlass genommen, den vorliegenden Band zu gestalten.<br />

Trotz seiner enormen Breitenwirkung hat <strong>PISA</strong> in der vergleichenden Bildungsforschung<br />

bislang kaum internationale, wohl aber eine Reihe von nationalen<br />

Nachfragen ausgelöst (vgl. Hopmann & Brinek in diesem Band). Für<br />

diesen Band haben sich nun erstmals länderübergreifend achtzehn ForscherInnen<br />

aus sieben Ländern (Dänemark, Deutschland, England, Finnland, Frankreich,<br />

Norwegen und Österreich) zusammengefunden, um <strong>PISA</strong> kritisch von<br />

allen Seiten unter die Lupe zu nehmen, und zwar den gesamten Forschungsprozess<br />

vom Design und den Erhebungsinstrumenten über die Durchführung<br />

und Datenanalyse bis hin zur öffentlichen Präsentation der Daten. Berücksichtigt<br />

wurden dabei alle einschlägig relevanten wissenschaftlichen Zugänge zu<br />

<strong>PISA</strong>: empirische Bildungsforschung, Forschungsmethodologie, Statistik, die<br />

allgemeine und die einschlägigen Fachdidaktiken. Fast alle Beteiligten verfügen<br />

über langjährige Erfahrungen in der vergleichenden Bildungsforschung<br />

oder verwandten Unternehmungen, einige waren zumindest zeitweise auch direkt<br />

an <strong>PISA</strong>- Forschungen beteiligt.<br />

Bei allem Respekt für das grossartige Engagement der OECD und der nationalen<br />

<strong>PISA</strong>-Konsortien, fällt das Ergebnis sehr ernüchternd aus: <strong>PISA</strong> hält<br />

nicht annähernd, was <strong>PISA</strong> verspricht, und kann das mit den angewandten


2 ZU DIESEMBUCH<br />

Mitteln auch gar nicht leisten! Das <strong>PISA</strong>-Projekt ist offenkundig mit so vielen<br />

Schwachstellen und Fehlerquellen belastet, dass sich zumindest die populärsten<br />

Endprodukte, die internationalen Vergleichstabellen sowie die meisten<br />

nationalen Zusatzanalysen zu Schulen und Schulstrukturen, Unterricht, Schulleistungen<br />

und Problemen wie Migration, sozialer Hintergrund, Geschlecht<br />

usw., in den bisher praktizierten Formen wissenschaftlich schlicht nicht aufrecht<br />

erhalten lassen. Sie überspannen bei weitem die Tragfähigkeit des gewählten<br />

Designs und dessen theoretische und methodische Grundlagen. Wer<br />

auf dieser Grundlage über Schulstrukturen, Lehrpläne, nationale Tests oder<br />

die zukünftige Lehrerbildung befinden will, ist nicht gut beraten.<br />

Damit hört <strong>PISA</strong> nicht auf, eines der wichtigsten und ertragreichsten Projekte<br />

der vergleichenden Forschung der Gegenwart zu sein. Einzelne Beiträge<br />

in diesem Band weisen dazu ausdrücklich auf künftige Möglichkeiten der<br />

<strong>PISA</strong>-Forschung hin. Nur scheint dringend geboten, die zum Teil bei solcher<br />

Forschung unvermeidlichen Grenzen der Geltung und Zuverlässigkeit weitaus<br />

deutlicher auszuweisen und dafür zu sorgen, dass nicht auch künftig <strong>PISA</strong> für<br />

Beweislasten in Anspruch genommen wird, die es auf wissenschaftlich vertretbare<br />

Weise nicht schultern kann. Man kann fast sagen, es gilt das Gute an <strong>PISA</strong><br />

und die interessierte Öffentlichkeit gegen den methodologisch haltlosen Überschwang<br />

einiger der an <strong>PISA</strong> Beteiligten in Schutz zu nehmen. Sonst droht die<br />

Gefahr, dass eines Tages die Bildungsverwaltungen, die Schulen und Schulleitungen,<br />

die Lehrkräfte und die Schülerinnen und Schüler nicht nur des stetigen<br />

Missbrauchs ihrer Daten überdrüssig werden, sondern gleich alle vergleichbaren<br />

Maßnahmen und Forschungsvorhaben in Bausch und Bogen ablehnen oder<br />

gar <strong>–</strong> wie das mit staatlichen Tests in einigen Ländern (u.a. in den USA, Chile,<br />

Norwegen) schon passiert ist <strong>–</strong> boykottieren oder durch mutwilliges Antwortverhalten<br />

beschädigen. Dies würde der vergleichenden Bildungsforschung als<br />

Ganzes einen nachhaltigen Schaden zufügen, und die Sorge, dass es dazu am<br />

Ende der <strong>PISA</strong>-Begeisterung kommen kann, begründet unser Engagement.<br />

Natürlich waren auch unserem Vorhaben deutliche Grenzen gesetzt:<br />

<strong>–</strong> Zum einen war eine direkte Re-Analyse von <strong>PISA</strong>-Originaldaten, <strong>PISA</strong>-<br />

Fragen etc. nur im begrenzten Masse möglich, nur dort wo einzelnen Datenbestände<br />

zugänglich waren. Zusätzlich haben wir fast die gesamte Literatur<br />

zur <strong>PISA</strong>-Methodologie und ihren Implikationen ausgewertet (vgl. die<br />

Literaturangaben in den einzelnen Kapiteln). <strong>PISA</strong> lässt eine unabhängige<br />

Überprüfung der vollständigen Datensätze einschliesslich aller Unterlagen<br />

bislang jedoch nicht zu. Es mag also sein, dass sich bei einer entsprechenden


ZU DIESEMBUCH 3<br />

Nachlese <strong>–</strong> wenn sie eines Tages möglich sein wird <strong>–</strong> einzelne Ergebnisse<br />

unserer Metaanalysen anders darstellen, als es uns nachzuprüfen möglich<br />

war. Allerdings haben sich so viele kritische Einwände in unseren Untersuchungen<br />

ergeben, dass die Widerlegung eines halben oder ganzen Dutzends<br />

von ihnen an den Kernaussagen dieses Bandes nichts ändern würde. In jeder<br />

Phase des <strong>PISA</strong>-Projektes gibt es zahlreiche Designentscheidungen und<br />

-probleme, die für sich allein genommen ausreichen, einen erheblichen Teil<br />

der gegenwärtig üblichen Darstellung und Nutzung der <strong>PISA</strong>-Ergebnisse für<br />

wissenschaftlich nicht tragfähig zu halten.<br />

<strong>–</strong> Zum zweiten war uns nicht daran gelegen, mit einer Stimme zu reden. Nicht<br />

nur, dass auch <strong>PISA</strong>-Kritiker Unterschiedliches kritikwürdig finden, und<br />

deshalb verschiedene Zugänge und Argumentationen wählen. Wir wollten<br />

die ganze Bandbreite der zur Zeit in Europa zugänglichen Kritik präsentieren<br />

und niemanden ausschliessen, nur weil der eine oder die andere eventuell<br />

einzelne Punkte oder Schlussfolgerungen nicht teilen. Wir hatten auch<br />

das deutsche und das österreichische <strong>PISA</strong>-Konsortium mehrfach zur Mitwirkung<br />

eingeladen. Leider ist diese nicht zustande gekommen. Zum Glück<br />

waren einige andere mit <strong>PISA</strong>-Erfahrung dennoch bereit, an unserem Vorhaben<br />

teilzunehmen. Damit gelingt uns aber nur teilweise, das gesamte Für<br />

und Wider der Diskussion widerzuspiegeln. Wir zweifeln jedoch nicht daran,<br />

dass die <strong>PISA</strong>-Konsortien genügend andere Möglichkeiten haben, sich<br />

aktiv an der Debatte zu beteiligen.<br />

Ein solches Vorhaben wie das vorliegende kann ohne vielfältige Hilfe nicht<br />

gelingen. Das seinerzeitige Österreichische Bundesministerium für Bildung,<br />

Wissenschaft und Kultur (BMBWK), die Österreichische Gesellschaft für<br />

Bildungsforschung sowie der norwegische Forschungsverbund „Achieving<br />

School Accountability in Practice (ASAP)“, zu dessen Veröffentlichungen<br />

auch dieser Band zählt, haben das Symposium und die Arbeiten am vorliegenden<br />

Band großzügig unterstützt. Nicht zu vergessen ist die Hilfe durch die<br />

Sekretariate in Wien (Patricia Stuhr) und in Kristiansand (Inger Linn Nystad<br />

Baade, Karen Beth Lee Hansen), die uns durch Geduld und Sprachfertigkeit<br />

ein Gelingen ermöglichten. Schliesslich möchten wir Richard Olechowski und<br />

dem LIT-Verlag für die Aufnahme in die Reihe „Schulpädagogik und Pädagogische<br />

Psychologie“ herzlich danken.<br />

Stefan T. Hopmann, Gertrude Brinek, Martin Retzl<br />

Wien, im September 2007


4 ZU DIESEMBUCH<br />

Wissenschaft lebt von der Diskussion. Aus diesem Grund möchten wir Sie<br />

herzlich einladen, an unserem Online-Diskussionsforum teilzunehmen. Posten<br />

Sie Ihre Meinung, Kritik und Anregungen zum Buch! Nähere Informationen<br />

dazu sind auf folgender Homepage verfügbar:<br />

http://institut.erz.univie.ac.at/home/fe2/.<br />

Wir freuen uns auf eine anregende Diskussion!


Vorwort<br />

Richard Olechowski<br />

Österreich: Universität Wien<br />

Vermutlich jeder/jede Erziehungswissenschaftler/in, der/die jemals die letzte<br />

Verantwortung für ein empirisches Großprojekt getragen hat <strong>–</strong> das aber, wie<br />

in der Regel, ohnehin auf die nationale Ebene begrenzt, zusätzlich vielleicht<br />

eingeengt auf die in der betreffenden Nation am häufigsten gesprochene Sprache<br />

war <strong>–</strong> wird sich über den Wagemut des <strong>PISA</strong>-Konsortiums gewundert haben.<br />

Ist diesem Konsortium der „Sal<strong>to</strong> mortale“ geglückt oder ist er für seine<br />

Mitglieder „letal“ ausgegangen? Nur zu gut weiß jeder/jede in einem nationalen<br />

Großprojekt Tätige (selbst wenn dieses so, wie eben skizziert, oder in<br />

noch größerem Maße eingeschränkt ist), dass die Gefahr einer Reihe von Beeinträchtigungen<br />

der nötigen Exaktheit in allen Stadien des Projekts gegeben<br />

ist, angefangen von der Stichprobenziehung, über die einzelnen Schritte der<br />

Testkonstruktion oder die Frage, ob die gleichartige Testvorgabe in allen Subgruppen<br />

gelungen ist. Nicht zu unterschätzen sind die Probleme der Testauswertung<br />

und der Datenanalyse (im engeren Sinn), zumal wenn Manches hiervon<br />

dezentralisiert <strong>–</strong> wie beim <strong>PISA</strong>-Projekt <strong>–</strong> in den mitwirkenden Staaten<br />

durchgeführt werden musste. Nicht zu gering ist auch die Wirkung der Wahl<br />

der Art und Weise der Publikation einzuschätzen, besonders wenn es sich um<br />

ein Projekt handelt, mit einem öffentlichen Interesse, im Ausmaß und in einer<br />

Intensität, wie dies bei <strong>PISA</strong> der Fall ist. Fachlich kompetente Kritiker von<br />

<strong>PISA</strong> werfen den für die <strong>PISA</strong>-Publikation Verantwortlichen in diesem Buch<br />

vor, dass die Intensität des öffentlichen Interesses durch den Umstand, dass<br />

die <strong>PISA</strong>-Ergebnisse in Form von nationalen Rangskalen publiziert wurden,<br />

noch bewusst erhöht und somit das öffentliche Interesse nicht in Richtung eines<br />

bildungswissenschaftlichen Interesses, sondern in das der Boulevardpresse<br />

gelenkt wurde. Die Daten wurden ursprünglich auf höherem Skalenniveau erhoben.<br />

Erst für die Publikation wählte man das verhältnismäßig grobe Maß


6 RICHARD OLECHOWSKI<br />

von Rangdaten; die „Suggestivkraft“ der Ranglisten sei vorherzusehen gewesen.<br />

Daher tragen die Betreiber von <strong>PISA</strong>, so die Kritiker, die Verantwortung<br />

für die Art und Weise der jetzigen Diskussion.<br />

Wie Kolleginnen und Kollegen, die fachlich kompetent sind <strong>–</strong> manche<br />

von ihnen haben sogar teilweise an dem Großprojekt <strong>PISA</strong> mitgearbeitet <strong>–</strong><br />

in diesem Buch berichten, gebe es zu allen Phasen des <strong>PISA</strong>-Projekts ernst zu<br />

nehmende kritische Einwände. Einige Beispiele, zusätzlich zu jenen, die oben<br />

schon erwähnt wurden:<br />

<strong>–</strong> Am Beginn des <strong>PISA</strong>-Projekts stand die Sammlung von Aufgaben. Es wurden<br />

zwar alle Staaten, die sich an der <strong>PISA</strong>-Studie beteiligten, eingeladen,<br />

entsprechende Aufgaben einzusenden, doch nicht alle der in Betracht kommenden<br />

Staaten kamen dieser Einladung nach. Dadurch entstand ein „cultural<br />

bias“ <strong>–</strong> eine Verzerrung in Richtung der kulturellen Eigenheiten bzw.<br />

Eigenarten jener Staaten, die sich an der Aufforderung nach Einsendung von<br />

Aufgaben beteiligten.<br />

<strong>–</strong> Die Lehrpläne der in Betracht kommenden Schulen der einzelnen Länder<br />

wurden <strong>–</strong> im Großen und Ganzen <strong>–</strong> seinerzeit (viele Jahre vor dem<br />

<strong>PISA</strong>-Projekt) ohne internationale Koordination erstellt. Die Lehrer und<br />

Lehrerinnen der einzelnen Länder haben außerdem auch in je unterschiedlichem<br />

Ausmaß eine gewisse Lehrplanfreiheit. In einzelnen Lehrer/innen/arbeitsgemeinschaften<br />

werden in den meisten Ländern auch <strong>–</strong> in<br />

Konkretisierung der Unterrichtsarbeit <strong>–</strong> „Lehrs<strong>to</strong>ffverteilungen“ erarbeitet.<br />

Es ist selbstverständlich, dass ein Schulleistungstest jeweils in einer genauen<br />

Abstimmung auf das, was tatsächlich in den Schulen unterrichtet wurde,<br />

zu erstellen ist. Dieser Gesichtspunkt wurde bei der Aufgabenerstellung<br />

im <strong>PISA</strong>-Projekt nicht systematisch berücksichtigt. (Ein allfälliger absichtlicher<br />

Verzicht auf das Erzielen einer „Lehrplanvalidität“ wäre gleichzusetzen<br />

mit einer bewussten Wahl des Risikos eines Argumentationsnotstands.)<br />

<strong>–</strong> Jeder Test hat sich an einer Eichstichprobe zu orientieren. Die Eichstichprobe<br />

muss jener Stichprobe, die in einer konkreten Untersuchung <strong>–</strong> stellvertretend<br />

für die interessierende Grundgesamtheit <strong>–</strong> getestet wird, in größtmöglicher<br />

Weise ähnlich sein. Dies ist bei <strong>PISA</strong> nicht der Fall: Es wurden nicht<br />

in jedem einzelnen Land (der sich am <strong>PISA</strong>-Projekt beteiligenden Länder)<br />

die einzelnen Tests an einer separaten Eichstichprobe geeicht bzw. genormt.<br />

<strong>–</strong> Es fehlen nähere Angaben über die eingesetzten Tests hinsichtlich ihrer „Reliabilität“.<br />

(Angaben darüber, wie groß das Maß der Übereinstimmung von


VORWORT 7<br />

Testresultaten ist, die zu unterschiedlichen Zeitpunkten an denselben bzw.<br />

„vergleichbaren“ Personen festgestellt wurden, fehlen.)<br />

<strong>–</strong> Ebenso fehlen nähere Angaben zur Frage der „Validität“ der eingesetzten<br />

Tests. (Angaben darüber, wie groß die Ähnlichkeit der Ergebnisse der Testergebnisse<br />

ist, verglichen mit Ergebnissen aus Tests, mit denen dieselben<br />

oder ähnliche Dimensionen gemessen werden, fehlen.)<br />

Solche, wie die oben aufgelisteten Mängel können, auch darüber sind sich die<br />

Kritiker von <strong>PISA</strong>, die in diesem Buch zu Wort kommen, einig, durch noch so<br />

große Stichproben nicht „ausgeglichen“ werden. Es handelt sich nämlich nicht<br />

um „Zufallsfehler“, sondern um sog. „systematische Fehler“.<br />

Dennoch <strong>–</strong> auch darüber sind sich sogar die strengsten Kritiker einig <strong>–</strong><br />

ist mit <strong>PISA</strong> Neuland betreten und sind die Ergebnisse, nach teilweise herber<br />

Kritik, nicht einfach beiseite zu schieben. Zum gegenwärtigen Zeitpunkt<br />

könnte eine so große internationale Vergleichsuntersuchung von niemandem<br />

besser durchgeführt werden. Die oben angeführten Punkte der Kritik, die in<br />

diesem Buch ausführlich und fachkundig dargestellt und diskutiert werden,<br />

dürfen auch nicht in der Weise missverstanden werden, dass damit eine prinzipielle<br />

Zurückhaltung oder gar Aversion gegenüber dem Messen und Zählen<br />

zum Ausdruck gebracht werden sollte, wenn es um Fragen der Bildung geht.<br />

Auch wenn in Einzelheiten <strong>–</strong> und sogar wenn hinsichtlich der Frage des „nationalen<br />

Rankings“ bezüglich einer einzelnen Dimension <strong>–</strong> hin und wieder eine<br />

Korrektur nötig wäre: Durch die <strong>PISA</strong>-Studie hat die Bildungsforschung<br />

großen Gewinn gezogen:<br />

Einerseits sind wohl fast alle Staaten, die nicht die obersten Rangplätze<br />

in den meisten der geprüften Dimensionen „innehaben“, motiviert worden,<br />

kritisch zu prüfen, was an ihrem Schulsystem reformiert werden sollte, die<br />

Schulorganisation, der Lehrplan, die Lehreraus- und -weiterbildung oder andere<br />

Aspekte im jeweiligen Schul- und Bildungssystem. (Die bisherigen <strong>PISA</strong>-<br />

Erhebungen und <strong>PISA</strong>-Auswertungen helfen den einzelnen Ländern freilich<br />

nicht bei der Auffindung der konkreten Ursachen für die allenfalls unbefriedigenden<br />

<strong>PISA</strong>-Ergebnisse.) Andererseits ist durch <strong>PISA</strong> ein Gesichtspunkt<br />

von besonderer Bedeutung ins Bewusstsein der „ScientificCommunity“getreten:<br />

Die Vergleichende Erziehungswissenschaft <strong>–</strong> eine wichtige Forschungsrichtung<br />

innerhalb der Erziehungswissenschaft <strong>–</strong> ist mit einem Mal von der<br />

globalen Betrachtungsweise des Vergleichs der Schulsysteme, der Lehrpläne,<br />

der Ausbildungssysteme der einzelnen unterrichtenden Lehrerinnen und Lehrer<br />

geradezu gewaltsam weggezerrt und hingelenkt worden auf den wesentli-


8 RICHARD OLECHOWSKI<br />

chen Aspekt des „Outcome“ <strong>–</strong> durch den auf der Basis von <strong>PISA</strong> weltweiten<br />

Vergleich, einem Vergleich, der mit Hilfe eines methodisch anspruchsvollen<br />

Instrumentariums durchgeführt wird, mit Tests, die von probabilistischen Modellen<br />

der Testtheorie ausgehen.<br />

In dieser Ausgewogenheit zwischen der Kritik, dem methodenkritischen<br />

Referieren der im Drei-Jahres-Rhythmus stattfindenden Erhebungen (2000,<br />

2003, 2006), insbesondere der Auswertungen, soweit diese bereits vorliegen,<br />

und dem Blick in die Zukunft, liegt auch der Vorzug dieses Buches. Es bietet<br />

nicht nur die Möglichkeit, die <strong>PISA</strong>-Studie in aller methodenkritischen<br />

Schärfe und Kritik zu sehen; oft genug ist es eine konstruktive Kritik, sondern<br />

es eröffnet auch die Möglichkeit, <strong>PISA</strong> als einen Schritt der Weiterentwicklung<br />

der Vergleichen Erziehungswissenschaft zu erkennen. Zum ersten<br />

Mal ist mit dieser groß angelegten Vergleichsuntersuchung auch in der<br />

international-vergleichenden Bildungsforschung die Möglichkeit gegeben, intersubjektiv<br />

vergleichbare und somit auch <strong>–</strong> nach klaren Kriterien <strong>–</strong> falsifizierbare<br />

Ergebnisse zu produzieren (für die Wissenschaft ein wichtiger Gesichtspunkt!),<br />

aber auch umgekehrt, im engen Wortsinn, replizierbare Ergebnisse<br />

zu erzielen und so zu einem echten, empirisch gesicherten Wissensbestand zu<br />

gelangen <strong>–</strong> soweit dieser Begriff mit der Sicht der Vorläufigkeit aller Wissenschaft<br />

prinzipiell vereinbar ist.<br />

Richard Olechowski<br />

Wien, im Herbst 2007


Introduction:<br />

<strong>PISA</strong> <strong>According</strong> <strong>to</strong> <strong>PISA</strong> <strong>–</strong> Does <strong>PISA</strong> Keep What It<br />

Promises?<br />

Stefan T. Hopmann/Gertrude Brinek<br />

Austria: University of Vienna<br />

For the time being, <strong>PISA</strong> is the most successful enterprise in comparative education.<br />

Every time a new <strong>PISA</strong> wave rolls in, or an additional analysis appears,<br />

governments fear the results, newspapers fill column after column, and the<br />

public demands answers <strong>to</strong> the claimed failings in their country’s school system.<br />

Of course, such a tremendous impact evokes discussions and criticism.<br />

On the one side are those:<br />

<strong>–</strong> who blame <strong>PISA</strong> for not covering the whole breath of education or schooling<br />

(e.g. Fuchs 2003; Ladenthin 2004; Kraus 2005; Herrmann 2005; Dohn 2007;<br />

adding <strong>to</strong> the <strong>PISA</strong> frame: Benner 2002),<br />

<strong>–</strong> who point <strong>to</strong> the fact that <strong>PISA</strong> is run by private companies (“<strong>PISA</strong> Incorporated”)<br />

looking for a share of the ever-growing testing market (see e.g.<br />

Bracey 2005; Flitner 2006; Lohmann 2006), or<br />

<strong>–</strong> who depict <strong>PISA</strong> as a New Public Government outlet of the most neo-liberal<br />

kind (see e.g. Lohmann 2001; Huisken 2005; Klausnitzer 2006).<br />

On the other side are those who praise <strong>PISA</strong> for giving us the best data base<br />

ever available for comparative research, for developing new <strong>to</strong>ols of research,<br />

and for <strong>PISA</strong>’s creative analysis of its data sets (for many examples see Pekrun<br />

2003; Roeder 2003; Weigel 2004; Stack 2006; Olsen in this volume).<br />

<strong>PISA</strong> <strong>According</strong> <strong>to</strong> <strong>PISA</strong><br />

However, surprisingly, and in spite of its public impact, <strong>PISA</strong> has not lead <strong>to</strong><br />

thorough methodological debates within the comparative research community,


10 STEFAN T. HOPMANN/GERTRUDE BRINEK<br />

at least not internationally. There have been some critiques pointing <strong>to</strong> design<br />

or analytic short-comings in some of the participating countries (e.g. Bonnet<br />

2002; Romainville 2002; Nash 2003; Prais 2003, 2004; Goldstein 2004;<br />

Allerup 2005, 2006; Bodin 2005; Bottani & Virgnaud 2005; Gaeth 2005; Olsen<br />

2005; Jahnke & Meyerhöfer 2006; Neuwirth, Ponocny & Grossmann 2006;<br />

Grisay & Monseur 2007). There has been some fundamental, and highly contested<br />

criticism of the methodological soundness of <strong>PISA</strong>’s research as a whole<br />

(Jahnke & Meyerhöfer 2006; especially Wuttke 2006; rebuttal by Prenzel &<br />

Walter 2006) 1 . However, none of this has lead <strong>to</strong> an international debate on<br />

the validity claims of <strong>PISA</strong> outside the <strong>PISA</strong> community itself. It seems as if<br />

the overwhelming success of the approach has led <strong>to</strong> any attempt <strong>to</strong> discuss<br />

<strong>PISA</strong>’s design, data collection and analysis methodologically looking pettyminded<br />

and irreverent. The strategy of <strong>PISA</strong> itself in not giving access <strong>to</strong> the<br />

full database, including all the questionnaires, contributes <strong>to</strong> this problem.<br />

The present volume on “<strong>PISA</strong> <strong>According</strong> <strong>to</strong> <strong>PISA</strong>” is probably the first<br />

independent international approach <strong>to</strong> discuss the methodological merits and<br />

shortcomings of <strong>PISA</strong> in relation <strong>to</strong> the validity and reliability claims <strong>PISA</strong> itself<br />

puts forward. Our aim is not <strong>to</strong> add <strong>to</strong> the debate for or against <strong>PISA</strong>. Most<br />

of us believe that <strong>PISA</strong> is an important miles<strong>to</strong>ne in the his<strong>to</strong>ry of our field.<br />

But we do question if some basic elements of <strong>PISA</strong> are done well enough <strong>to</strong><br />

carry the weight of, e.g., comparative league tables or of in-depth analyses of<br />

weaknesses of educational systems. We ask if other, and better, uses of the<br />

<strong>PISA</strong> data base are warranted, and if <strong>PISA</strong>-as-a-public-event should come under<br />

much more independent scrutiny <strong>–</strong> if only <strong>to</strong> avoid its misuse <strong>to</strong> validate<br />

claims and policies which cannot be legitimately derived from <strong>PISA</strong>.<br />

The volume seeks <strong>to</strong> follow <strong>–</strong> as much as possible <strong>–</strong> the whole <strong>PISA</strong> research<br />

process from the design and sampling, the data collection and analysis,<br />

through <strong>to</strong> the data presentation and impact. Our aim is not <strong>to</strong> give an<br />

overview of the different national <strong>PISA</strong> debates, rather <strong>to</strong> discuss general issues<br />

of construction and use. The contribu<strong>to</strong>rs come from seven countries and<br />

from all walks of educational research, including specialists in empirical research<br />

methodology, statistical data analysis, general and subject matter didactics,<br />

and educational policy analysis. We include contribu<strong>to</strong>rs who are or<br />

1 The edi<strong>to</strong>rs of the above mentioned “<strong>PISA</strong> & Co.” volume (Jahnke & Meyerhöfer 2006) are<br />

working on a new and revised edition of that book, including an explicit discussion of the<br />

response they got <strong>to</strong> the first edition. This book will be available by late 2007.


INTRODUCTION: <strong>PISA</strong>ACCORDING TO <strong>PISA</strong> 11<br />

have themselves been involved in <strong>PISA</strong> or similar projects (see the bios at the<br />

end of the book).<br />

To highlight just a few core issues:<br />

<strong>–</strong> An<strong>to</strong>ine Bodin (IREM de Besançon <strong>–</strong> Université de Franche-Comté) shows<br />

from a French perspective how much the <strong>PISA</strong> <strong>–</strong> assessment is embedded in<br />

a certain understanding of (school) knowledge, which doesn’t fit all.<br />

<strong>–</strong> Wolfram Meyerhöfer (Universität Potsdam) continues this argument by an<br />

in-depth-analysis of what <strong>PISA</strong> really asks for in its questionnaires, showing<br />

how little this is in <strong>to</strong>uch with a comprehensive concept of “Bildung” or even<br />

current didactics.<br />

<strong>–</strong> Jens Dolin (Syddansk Universitet) adds similar arguments from a Danish<br />

perspective, underlining how much <strong>PISA</strong>’s conceptualization of knowledge<br />

is at risk <strong>to</strong> misrepresent what is taught and learned in schools.<br />

<strong>–</strong> Markus Puchhammer (Technikum Wien) shows <strong>–</strong> using the published example<br />

questions <strong>–</strong> how translation problems may affect results <strong>to</strong> a degree<br />

making comparisons guesswork.<br />

<strong>–</strong> SJPrais(National Institute of Economic and Social Research London) uses<br />

the example of England <strong>to</strong> demonstrate serious flaws in the response rates<br />

and sampling, which necessarily lead <strong>to</strong> biased results.<br />

<strong>–</strong> Bernadette Hörmann (Universität Wien) points <strong>to</strong> the systematic marginalization<br />

of special needs students by <strong>PISA</strong> and <strong>to</strong> how little there has been<br />

done <strong>to</strong> deal with their role within the <strong>PISA</strong> approach at least in Austria.<br />

<strong>–</strong> Peter Allerup (Århus Universitet) elaborates a similar issue by showing from<br />

Denmark <strong>to</strong> what degree <strong>PISA</strong>’s much acclaimed analysis of the impact of<br />

gender, migration and similar fac<strong>to</strong>rs depends on but a few, highly problematic<br />

items.<br />

<strong>–</strong> Svein Sjøberg (Universitetet i Oslo) underlines how much both, <strong>PISA</strong>’s design<br />

on the one hand, and the student response behavior on the other are<br />

culturally embedded, which may lead <strong>to</strong> a partial or complete mismatch.<br />

<strong>–</strong> Gjert Langfeldt (Agder Universitet) questions the validity and reliability<br />

claims made by <strong>PISA</strong>, pointing <strong>to</strong> constructional constraints, methodological<br />

mishaps and the cultural bias embedded in the <strong>PISA</strong> design.<br />

<strong>–</strong> Joachim Wuttke gives a comprehensive overview over recently voiced criticism<br />

of <strong>PISA</strong>’s research conduct and the resulting bias and uncertainties,<br />

which put not at least its league tables and comparisons at random.<br />

<strong>–</strong> Rolf Olsen (Universitetet i Oslo) outlines ways how <strong>PISA</strong> can overcome


12 STEFAN T. HOPMANN/GERTRUDE BRINEK<br />

some of its short-comings by broadening its approach and adding new research.<br />

<strong>–</strong> Michael Uljens (Åbo Akademi) explains the Finnish <strong>PISA</strong> success by the<br />

fact that what <strong>PISA</strong> asks for had already gained a foothold in Finnish schooling<br />

before <strong>PISA</strong> came around.<br />

<strong>–</strong> Thomas Jahnke (Universität Potsdam) elaborates from a German perspective<br />

how <strong>PISA</strong> fails <strong>to</strong> really assess what is or should be taught in schools, and<br />

how reliance on <strong>PISA</strong> can lead <strong>to</strong> an impoverished view on the curriculum.<br />

<strong>–</strong> Dominik Bozkurt, Gertrude Brinek and Martin Retzl (Universität Wien) use<br />

the Austrian example <strong>to</strong> show how the public and political response <strong>to</strong> <strong>PISA</strong><br />

unfolds irrespective of what <strong>PISA</strong> really can cover or prove.<br />

<strong>–</strong> Finally, Stefan T. Hopmann (Universität Wien) puts both the <strong>PISA</strong> project<br />

and the <strong>PISA</strong> discourse in a comparative perspective, showing how much the<br />

design, use of and response <strong>to</strong> <strong>PISA</strong> is depending on the needs and traditions<br />

of those involved.<br />

All in all, the contributions give a very varied picture of the <strong>PISA</strong> effort. No<br />

step in the research process seems <strong>to</strong> be without substantial problems, several<br />

steps do not meet rigorous scholarly standards. Some of us seem <strong>to</strong> believe that<br />

these are obstacles, which can be solved within the <strong>PISA</strong> frame (e.g. Allerup,<br />

Dolin, Olsen, Sjøberg), others tend <strong>to</strong> a conclusion that the <strong>PISA</strong> project is<br />

beyond repair (e.g. Langfeldt, Meyerhöfer, Wuttke) or so much embedded in<br />

a specific political purpose, that it rather should be considered as a type of<br />

research-based policy making, not as a scholarly undertaking (e.g. Hopmann,<br />

Jahnke, Uljens, Bozkurt/Brinek/Retzl).<br />

Almost all of the chapters raise serious doubts concerning the theoretical<br />

and methodological standards applied within <strong>PISA</strong>, and particularly <strong>to</strong> its<br />

most prominent by-products, its national league tables or analyses of school<br />

systems. Without access <strong>to</strong> the full set of original data, it is difficult <strong>to</strong> come<br />

<strong>to</strong> final conclusions. However, from our viewpoint, a few points seem <strong>to</strong> be<br />

evident beyond any reasonable doubt:<br />

<strong>–</strong> <strong>PISA</strong> is by design culturally biased and methodologically constrained <strong>to</strong> a<br />

degree which prohibits accurate representations of what actually is achieved<br />

in and by schools. Nor is there any proof that what it covers is a valid conceptualization<br />

of what every student should know.<br />

<strong>–</strong> The product of most public value, the national league tables (cf. Steiner-<br />

Khamsi 2003), are based on so many weak links that they should be abandoned<br />

right away. If only a few of the methodological issues raised in this


INTRODUCTION: <strong>PISA</strong>ACCORDING TO <strong>PISA</strong> 13<br />

volume are on target, the league tables depend on assumptions about their<br />

validity and reliability which are unattainable.<br />

<strong>–</strong> The widely discussed by-products of <strong>PISA</strong>, such as the analyses of “good<br />

schools”, “good instruction” or of differences between school systems and<br />

on issues like gender, migration, or social background, go far beyond what<br />

a cautious approach <strong>to</strong> these data allows for. They are more often than not<br />

speculative, and would at least need a wider framing by additional research<br />

looking at the aspects, which <strong>PISA</strong> by design cannot cover or gets wrong.<br />

<strong>–</strong> Any policy making based on these data (whether about school structures,<br />

standards or the curriculum) cannot be justified. The use and misuse of <strong>PISA</strong><br />

data in such contexts <strong>–</strong> done with or without <strong>PISA</strong> researchers consent or<br />

cooperation <strong>–</strong> belongs solely <strong>to</strong> the sphere of policy making. Of course <strong>PISA</strong><br />

researchers have the same right as every citizen <strong>to</strong> pronounce their political<br />

convictions in public. However they cannot do so claiming research as an<br />

unquestionable basis for their arguments.<br />

This does not mean that there are no valuable lessons <strong>to</strong> be drawn from <strong>PISA</strong>.<br />

At least it is a very innovative comparative study on the uneven distribution<br />

of a peculiar kind of knowledge and abilities among young people in different<br />

countries. However, the use of <strong>PISA</strong> as research on schooling by the OECD,<br />

its members and some of the research groups connected <strong>to</strong> the effort goes far<br />

beyond what is scientific evidence or simply well done research. <strong>PISA</strong> is not<br />

according <strong>to</strong> <strong>PISA</strong>, when it comes <strong>to</strong> how it is produced and used in these<br />

cases.<br />

<strong>PISA</strong> <strong>–</strong> The Contergan of Educational Research?<br />

Of course, we would have loved <strong>to</strong> add <strong>to</strong> this volume commentaries and criticism<br />

of what is presented here by members of the <strong>PISA</strong> consortium <strong>–</strong> because<br />

we believe in the necessity of broad and uninhibited scholarly exchange. However,<br />

repeated invitations <strong>to</strong> address these issues in open symposia, or <strong>to</strong> contribute<br />

<strong>to</strong> this volume, remained either unanswered or were turned down. The<br />

German <strong>PISA</strong> consortium went so far <strong>to</strong> make an official decision not <strong>to</strong> participate<br />

in this effort; others simply kept silent. Time and again we were <strong>to</strong>ld<br />

in public and at meetings that most of the methodological criticism published<br />

on <strong>PISA</strong> has been proven wrong, and that every possible weakness has been<br />

taken care of. However, we could not obtain a published justification for this


14 STEFAN T. HOPMANN/GERTRUDE BRINEK<br />

claim. Even an invitation <strong>to</strong> contribute a summary of the counterarguments <strong>to</strong><br />

this volume was turned down.<br />

As sad as this is, it was no surprise. In the preparation of this volume we<br />

exchanged quite a few notes on how the national debates around <strong>PISA</strong> unfold<br />

in ‘our’ countries. What emerged was a picture not unlike that seen in<br />

the behaviour of large companies when they encounter a potential scandal, e.g.<br />

pharmaceutical companies dealing with ill-conceived drugs (like Chemie Grünenthal<br />

in the famous Contergan/Thalomide case or other scandals; cf. Kirk<br />

1999; Luhmann 2000; Schulz 2001) where the strategy is one of an “issue<br />

framing” (cf. Entman 1993: Sniderman & Theriault 2004). To take just the<br />

most recent German example:<br />

<strong>–</strong> If some critique is voiced in public, the first response seems <strong>to</strong> be silence. Or<br />

as the leader of the German consortium, Manfred Prenzel, puts it in case of<br />

this book: One doesn’t want <strong>to</strong> provide “a forum for unproven allegations”<br />

(as an answer <strong>to</strong> the invitation <strong>to</strong> participate in this book by mail 2007-05-<br />

09, which was turned down by a “unanimous” decision of the German <strong>PISA</strong><br />

consortium confirmed by a mail 2007-05-21). He wrote this before knowing<br />

the authors and titles of all but one of the chapters contained in this volume.<br />

<strong>–</strong> If that is not enough, the next step is often <strong>to</strong> raise doubts about the motives<br />

and the abilities of those who are critical of the enterprise. For instance,<br />

when asked about the recently published volume on <strong>PISA</strong> & Co. (Jahnke &<br />

Meyerhöfer 2006), Olaf Köller, as the head of the German National Institute<br />

for Educational Progress, suggested that (1) these critics were unqualified<br />

<strong>to</strong> discuss <strong>PISA</strong> (even though they included many leading members<br />

of the mathematics didactic research in Germany) and (2) they were probably<br />

driven by envy or other non-scholarly motives (Köller 2006a; Kerstan<br />

2006).<br />

<strong>–</strong> The next step seems <strong>to</strong> acknowledge some problems, but <strong>to</strong> insist that they<br />

are very limited in nature and scope, not affecting the overall picture. Alternatively,<br />

it is pointed out that these problems are well known within largescale<br />

survey research of the kind like <strong>PISA</strong>, and even unavoidable when<br />

working comparatively (e.g. Köller 2006b). Of course that claim does not<br />

reduce the impact of these problems on the validity of the results.<br />

<strong>–</strong> Finally, there is the statement that the criticism does not contain anything<br />

new, and nothing that has not been dealt with within the <strong>PISA</strong> research itself<br />

<strong>–</strong> and often this claim is accompanied by references <strong>to</strong> opaque technical


INTRODUCTION: <strong>PISA</strong>ACCORDING TO <strong>PISA</strong> 15<br />

reports, that only insiders can understand, or <strong>to</strong> unpublished papers or reports<br />

(e.g. Prenzel & Walter 2006; Schleicher 2006).<br />

What does not happen is what is normally considered <strong>to</strong> be “good science”:<br />

open debate on the pros and cons of the arguments. If one understands <strong>PISA</strong><br />

as an economic enterprise, in line with the abovementioned pharmaceutical<br />

companies, this is quite reasonable. Ignoring, silencing, or simply marginalizing<br />

a critic does less harm <strong>to</strong> the brand than a public argument. A public<br />

rebuttal carries the risk that some cus<strong>to</strong>mers would not be <strong>to</strong>tally convinced<br />

(“semper aliquid haeret”). It is only necessary <strong>to</strong> take firmer steps when criticism<br />

finally becomes so public that it cannot be ignored by cus<strong>to</strong>mers and<br />

buyers. But the first move is still <strong>to</strong> discredit the critics and their supporters as<br />

being uninformed, ill-equipped, or simply following a personal agenda. The<br />

final move rests on the claim that there is other research, which proves the<br />

critics wrong <strong>–</strong> although for a variety of reasons the data-sets on which these<br />

conclusions are based cannot be made available. By using such techniques,<br />

companies can hold the realistic expectation that even proven deficiencies will<br />

not harm sales substantially and over time.<br />

Of course, the comparison of <strong>PISA</strong> and Contergan can be seen as overreaching:<br />

Thalomide did lead <strong>to</strong> thousands of severely disabled newborns,<br />

whereas <strong>PISA</strong> only does harm <strong>to</strong> children’s education in the worst case. Additionally,<br />

the Grünenthal company directly advertised the medication for purposes<br />

with high risk, whereas the <strong>PISA</strong> consortium can argue that it is up <strong>to</strong><br />

the people <strong>to</strong> believe or not <strong>to</strong> believe in what <strong>PISA</strong> tells. But other similarities<br />

are striking: <strong>PISA</strong> has a large “market share” <strong>to</strong> defend: most of public money<br />

spent on educational research nowadays is being put in<strong>to</strong> <strong>PISA</strong> and similar approaches<br />

(the standards and testing business); many chairs in education have<br />

turned <strong>to</strong> related <strong>to</strong>pics and issues, thus providing a significant market for collabora<strong>to</strong>rs<br />

in the field. This is all <strong>to</strong>o big and <strong>to</strong>o seductive <strong>to</strong> be put at risk just<br />

because of a few other scholars who do not support the whole enterprise or the<br />

way it is done.<br />

The readers of this volume should expect similar responses <strong>to</strong> what is said<br />

here. But don’t worry: Nobody is going <strong>to</strong> pull <strong>PISA</strong> in<strong>to</strong> courts of law because<br />

of its flaws <strong>–</strong> as was the case with the pharmaceutical companies. No other<br />

court than the one of public reasoning is available, but with Kant we do believe,<br />

that this is the strongest court of all.<br />

Discussion is an essential part of science. Therefore we invite you <strong>to</strong> take


16 STEFAN T. HOPMANN/GERTRUDE BRINEK<br />

part in our discussion forum on the Internet and <strong>to</strong> post your opinion and critique<br />

concerning the book. Find more information at<br />

http://institut.erz.univie.ac.at/home/fe2/.<br />

We are looking forward <strong>to</strong> an inspiring discussion!<br />

References<br />

Aktionsrat Bildung: Bildungsgerechtigkeit. Jahresgutachten 2007. (ed.<br />

Vereinigung der Bayerischen Wirtschaft e.V.; online:<br />

http://www.aktionsrat-bildung.de/fileadmin/Dokumente/<br />

Bildungsgerechtigkeit_Jahresgutachten_2007_-_Aktionsrat_Bildung.pdf<br />

retr. 2007/07/07).<br />

Allerup, P.: <strong>PISA</strong> præstationer-målinger med skæve måles<strong>to</strong>kke?.<br />

In: Dansk Pædagogisk Tidsskrift 2005-1, 68-81<br />

Allerup, P.: <strong>PISA</strong> 2000’s læseskala <strong>–</strong> vurdering af psykometriske egenskaber<br />

for elever med dansk og ikke-dansk sproglig baggrund. (Rockwool<br />

Fondens Forskningsenhed og Syddansk Universitetsforlag) Odense 2006.<br />

Allerup, P.: Identification of Group Differences Using <strong>PISA</strong> Scales <strong>–</strong> Considering<br />

Effects of Inhomogeneous Items. In this volume.<br />

Benner, D.: Die Struktur der Allgemeinbildung im Kerncurriculum moderner<br />

Bildungssysteme. Ein Vorschlag zur bildungstheoretischen Rahmung von<br />

<strong>PISA</strong>. In: Zeitschrift für Pädagogik 48-2002-1, 68-90.<br />

Bodin, A.: What does <strong>PISA</strong> really assess? What it doesn’t A French view.<br />

Report prepared for Joint Finnish-French conference “Teaching mathematics:<br />

Beyond the <strong>PISA</strong> survey”, Paris 2005.<br />

Bodin, A.: What does <strong>PISA</strong> really assess? What it doesn’t? A French view. In<br />

this volume.<br />

Bonnet, G.: Reflections in a Critical Eye: On the Pitfalls of International Assessment.<br />

In: Assessment in Education 2002-9, 387-400.<br />

Bottani, N. & Virgnaud, P.: La France et les evaluations internationales.<br />

Paris 2005. online: http://lesrapports.ladocumentationfrancaise.fr/BRP/<br />

054000359/0000.pdf (retr. 2007/07/07).<br />

Bracey, G.W.: Research: Put Our Over <strong>PISA</strong>. In: Phi Delta Kappan 86-2005-<br />

10, 797.<br />

Dohn, N.B.: Knowledge and Skills for <strong>PISA</strong> <strong>–</strong> Assessing the Assessment. In:<br />

Journal of Philosophy of Education. 41-2007-1, 1-16.<br />

Flitner, E.: Pädagogische Wertschöpfung. Zur Rationalisierung von Schulsystemen<br />

durch public-private-partnerships am Beispiel von <strong>PISA</strong>. In: Oelk-


INTRODUCTION: <strong>PISA</strong>ACCORDING TO <strong>PISA</strong> 17<br />

ers J. et al. (eds.): Rationalisierung und Bildung bei Max Weber. ) Bad<br />

Heilbrunn (Klinkhardt 2006, 245-266.<br />

Fuchs, H.-W.: Auf dem Wege zu einem neune Weltcurriculum? Zum Grundbildungskonzept<br />

von <strong>PISA</strong> und der Aufgabenzuweisung an die Schule.<br />

In: Zeitschrift für Pädagogik 49-2003-2, 161-179.<br />

Gaeth, F.: <strong>PISA</strong> (Programme for International Student Assessment) Eine<br />

statistisch-methodische Evaluation. Berlin (Freie Universität) 2005.<br />

Goldstein, H.: International Comparisons of Student Attainment: Some Issues<br />

Arising from the <strong>PISA</strong> Study. In: Assessment in Education <strong>–</strong> Principles,<br />

Policy, and Practice 11-2004-3, 319-330.<br />

Grisay, A. & Monseur, C.: Measuring the Equivalence of Item Difficulty in<br />

the Various Versions of an International Test. In: Studies in Educational<br />

Evaluation 33-2007-1, 69-86.<br />

Hermann, U.: Fördern “Bildungsstandards” die allgemeine Schulbildung? In:<br />

Rekus, J. (ed.): Bildungsstandards, Kerncurricula und die Aufgabe der<br />

Schule. Münster (Aschendorff ) 2005, 24-52.<br />

Herrmann U.: <strong>PISA</strong> <strong>–</strong> Welche Konsequenzen für Schule und Unterricht kann<br />

man wirklich ziehen? Diskussionsbeitrag DIDACTA Hannover 2006.<br />

FORUM BILDUNG (online: http://forum-kritische-paedagogik.de/start/<br />

download.php?view.209; retr. 2007/07/07).<br />

Hopmann, S.T.: Restrained Teaching: The Common Core of Didaktik. In: European<br />

Educational Research Journal 6-2007-2, 109-124.<br />

Huisken, F.: Der “<strong>PISA</strong>-Schock” und seine Bewältigung <strong>–</strong> Wieviel Dummheit<br />

braucht/verträgt die Republik? Hamburg (VSA-Verlag) 2005.<br />

Jahnke, T. & Meyerhöfer, W. (eds.): <strong>PISA</strong> & Co <strong>–</strong> Kritik eines Programms.<br />

Hildesheim (Franzbecker) 2006.<br />

Kerstan, K.: An <strong>PISA</strong> gescheitert. In: DIE ZEIT, 16.11.2006 Nr. 47<br />

Kirk, B.: Der Contergan-Fall: eine unvermeidbare Arzneimittelkatastrophe?<br />

Stuttgart (Wissenschaftliche Verlagsgesellschaft) 1999.<br />

Klausnitzer, J.: <strong>PISA</strong> <strong>–</strong> einige offene Fragen zur OECD Bildungspolitik.<br />

2006. Online: http://www.links-netz.de/K_texte/K_klausenitzer_oecd.<br />

html (retr. 2007/07/07).<br />

Köller, O.: Kritik an <strong>PISA</strong> unberechtigt. Interview mit bildungsklick.de 2006a.<br />

online: http://bildungsklick.de/a/50155/kritik-an-pisa-unberechtigt (retr.<br />

2007/07/07).<br />

Köller, O.: Stellungnahme zum Text von Joachim Wuttke: Fehler, Verzerrun-


18 STEFAN T. HOPMANN/GERTRUDE BRINEK<br />

gen, Unsicherheiten in der <strong>PISA</strong>-Auswertung. Press release of the National<br />

Institute for Educational Progress (IQB) 2006/11/14, (2006b).<br />

Kraus, J.: Der <strong>PISA</strong> Schwindel. Unsere Kinder sind besser als ihr Ruf. Wie Eltern<br />

und Schule Potentiale fördern können. Wien (Signum Verlag) 2005.<br />

Ladenthin, V.: Bildung als Aufgabe der Gesellschaft. In: studia comenia et<br />

his<strong>to</strong>rica 34-2004-71/72, 305-319.<br />

Lohmann, I.: After Neoliberalism. Können nationalstaatliche Bildungssysteme<br />

den ‚freien Markt‘ überleben? 2001 online: http://www.erzwiss.<br />

uni-hamburg.de/Personal/Lohmann/AfterNeo.htm (retr. 2007/07/07)<br />

Lohmann, I.: Was bedeutet eigentlich “Humankapital”? GEW Bezirksverband<br />

Lüneburg und Universität Lüneburg: Der brauchbare Mensch. Bildung<br />

statt Nützlichkeitswahn. Bildungstage 2007, online: http://www.erzwiss.<br />

uni-hamburg.de/Personal/Lohmann/Publik/Humankapital.pdf (retr.<br />

2007/07/07)<br />

Luhmann, H.-J.: Die Contergan-Katastrophe revisited <strong>–</strong> Ein Lehrstück<br />

vom Beitrag der Wissenschaft zur gesellschaftlichen Blindheit. In:<br />

Umweltmedizin in Forschung und Praxis. 5-2000-5, 295-300.<br />

Nash, R.: Is the School Composition Effect Real? A Discussion with Evidence<br />

from the UK <strong>PISA</strong> Data. In: School Effectiveness and School Improvement<br />

14-2003-4, 441-457.<br />

Neuwirth, E., Ponocny, I. & Grossmann, W. (eds.): <strong>PISA</strong> 2000 und <strong>PISA</strong> 2003.<br />

Graz (Leykam) 2006.<br />

Olsen, R.V.: Achievement Tests From an Item Perspective. An Exploration of<br />

Single Item Data form the <strong>PISA</strong> and TIMSS studies. Oslo (University of<br />

Oslo) 2005. Online at: http://www.duo.uio.no/publ/realfag/2005/35342/<br />

Rolf_Olsen.pdf (retr. 2007/07/07)<br />

Pekrun, R.: Vergleichende Evaluationsstudien zu Schülerleistungen: Konsequenzen<br />

für die Bildungsforschung. In: Zeitschrift für Pädagogik 48-<br />

2002-1, 111-128.<br />

Prais S. J.: Cautions on OECD’s recent educational survey(<strong>PISA</strong>): Rejoinder<br />

<strong>to</strong> OECD’s response. In: Oxford Review of Education 30-2004-4<br />

Prais S.J.: Cautions on OECD’s Recent Educational Survey (<strong>PISA</strong>) In: Oxford<br />

Review of Education 29-2003-2, 139-163.<br />

Prenzel, M. & Walter, O.: Wie solide ist <strong>PISA</strong>? Oder Ist die Kritik von Joachim<br />

Wuttke begründet? Kiel (IPN) 2006 (two pages including a one page attachment!)<br />

Roeder, P.M.: TIMSS und <strong>PISA</strong> <strong>–</strong> Chancen eines neuen Anfangs in Bil-


INTRODUCTION: <strong>PISA</strong>ACCORDING TO <strong>PISA</strong> 19<br />

dungspolitik, -planung. <strong>–</strong>verwaltung und Unterricht. Endlich ein Schock<br />

mit Folgen? In: Zeitschrift für Pädagogik 49-2003-2, 180-197.<br />

Romainville, M.: On the Appropriate Use of <strong>PISA</strong>. In: La Revue Nouvelle<br />

2002-3/4.<br />

Schleicher, A.: Interview mit der Frankfurter Rundschau. In: Frankfurter<br />

Rundschau vom 28.11.2006<br />

Schulz, J.: Management von Risiko- und Krisenkommunikation <strong>–</strong> zur Bestandserhaltung<br />

und Anschlussfähigkeit von Kommunikationssystemen. Berlin<br />

(Humboldt Universität) 2001.<br />

Stack, M.: Testing, Testing, Read All About It: Canadian Press Coverage of<br />

the <strong>PISA</strong> Results. In: Canadian Journal of Education, 29-2006-1, 49-69.<br />

Steiner-Khamsi, G.: The Politics of League Tables. (http://www.<br />

sowi-onlinejournal.de/2003-1/tables_khamsi.htm; retr. 2007/07/07).<br />

Weigel, T.M.: Die <strong>PISA</strong>-Studie im bildungspolitischen Diskurs. Eine Untersuchung<br />

der Reaktionen auf <strong>PISA</strong> in Deutschland und im Vereinigten<br />

Königreich. Diplomarbeit Trier (Universität) 2004.<br />

Wuttke, J. (2006): Fehler, Verzerrungen, Unsicherheiten in der <strong>PISA</strong>-<br />

Auswertung.- In: Jahnke, T. & Meyerhöfer,W. (Hrsg): <strong>PISA</strong> & Co. Kritik<br />

eines Programms. Hildesheim, Berlin (Franzbecker), 101-154.<br />

Wuttke, J.: Uncertainties and Bias in <strong>PISA</strong>. In this volume.


What Does <strong>PISA</strong> Really Assess? What Does It Not?<br />

1 2<br />

AFrenchView<br />

An<strong>to</strong>ine Bodin 3<br />

France: Université de Franche-Comté<br />

Summary<br />

This paper puts aside many important aspects of the <strong>PISA</strong> design <strong>to</strong> focus on<br />

the external validity issue of its mathematics questions.<br />

First, it seeks <strong>to</strong> position the <strong>PISA</strong> item contents against the French mathematical<br />

syllabus, trying <strong>to</strong> identify the overlap of them both.<br />

Then it tries <strong>to</strong> compare the <strong>PISA</strong> mathematical cognitive demands and<br />

competency levels with those implied in some French assessment and examination<br />

settings.<br />

Underlining certain differences between the general <strong>PISA</strong> design and the<br />

French mathematical curriculum and school culture, it also tackles the <strong>PISA</strong><br />

mathematical items ‘epistemological and didactical validity issues’.<br />

Cet article laisse de côté de nombreux points importants des études <strong>PISA</strong><br />

pour se centrer sur l’examen de la validité externe des questions du domaine<br />

mathématique.<br />

1 This paper was partially presented in Oc<strong>to</strong>ber 2005 at a French-Finnish Conference jointly<br />

organized by the French and Finnish Mathematical Societies. A French language version<br />

is available as well as two presentations used for the Conference (also in English and in<br />

French <strong>–</strong> see addresses on the page entitled “references”).<br />

2 With many thanks <strong>to</strong> Rosalind Charnaux for her kind help and advice for this English version.<br />

3 an<strong>to</strong>inebodin@mac.com, website: http://web.mac.com/an<strong>to</strong>inebodin/iWeb/Site_An<strong>to</strong>ine_<br />

Bodin/


22 ANTOINE BODIN<br />

Tout d’abord il cherche à situer les contenus mathématiques des questions<br />

par rapport au curriculum français, et essaie de quantifier le recouvrement par<br />

<strong>PISA</strong> de ce curriculum.<br />

Ensuite il tente de comparer la complexité cognitive des questions mathématiques<br />

de <strong>PISA</strong> avec celle des questions d’examens et d’évaluations<br />

courantes en France.<br />

Pointant des différences entre les conceptions liées aux études <strong>PISA</strong> et les<br />

attendus du curriculum mathématique et de la culture scolaire de notre pays,<br />

il soulève des questions relatives à la validité épistémologique et didactique de<br />

l’étude.<br />

Introduction<br />

The <strong>PISA</strong> studies have been organised by the OECD, which, as everyone<br />

knows, is an organisation devoted <strong>to</strong> world economic development. The main<br />

reason that led this organisation <strong>to</strong> undertake such a study lies in a strong belief<br />

that good education is the key <strong>to</strong> better development.<br />

We will examine in this paper neither the value of this belief nor the economic<br />

and political implications of the studies.<br />

At the same time we accept the idea that the <strong>PISA</strong> mathematics framework<br />

is consistent with the general <strong>PISA</strong> design, and that the mathematics test development<br />

has been made as faithfully and as accurately as possible (personally,<br />

I believe this is the case). There is, however, the internal validity issue.<br />

Plenty of documents have been written and displayed all around the world<br />

about the <strong>PISA</strong> studies, a certain number of them directly issued by the OECD<br />

and by the <strong>PISA</strong> consortium 4 and many others by officials, research teams<br />

and/or the media in the participating countries.<br />

Therefore, the information is rich and full of contrasts. Most of the documents<br />

are public, and the OECD has done its utmost <strong>to</strong> allow scholars and<br />

other interested persons obtain complete access <strong>to</strong> <strong>PISA</strong>’s general design as<br />

well as its frameworks, complete database and international reports.<br />

Far from producing flimsy yet exciting, though often denounced, results<br />

(<strong>to</strong> which <strong>to</strong>o much interest is generally paid), the <strong>PISA</strong> studies produce quality<br />

data of interest for a huge range of complementary studies ranging from<br />

politics <strong>to</strong> didactics.<br />

4 ACER <strong>–</strong> Melbourne <strong>–</strong> Australia


WHAT DOES <strong>PISA</strong> REALLY ASSESS? WHAT DOES IT NOT? 23<br />

Many international and national analyses have been undertaken which try<br />

<strong>to</strong> draw from processed data (as well as from raw data) the information of<br />

interest <strong>to</strong> all kinds of people concerned by educational matters.<br />

Meanwhile, not much effort has been made until now <strong>to</strong> examine the set<br />

of mathematics questions from an external point of view and try <strong>to</strong> more efficiently<br />

understand what they really assess and <strong>to</strong> which degree they may be<br />

viewed as epistemologically and didactically consistent. Further research in<strong>to</strong><br />

these points would produce possible implications for teaching and for teachers.<br />

This paper seeks only <strong>to</strong> examine <strong>PISA</strong>’s external validity and is limited<br />

<strong>to</strong> its mathematical section, and even narrower in scope, from a French point<br />

of view (‘French’ in the sense of being related <strong>to</strong> the French mathematics curriculum,<br />

French cus<strong>to</strong>mary assessment settings, teacher beliefs, school culture,<br />

etc.).<br />

Intended and implemented <strong>PISA</strong> assessment focus<br />

First, it seems important <strong>to</strong> recall that <strong>PISA</strong> does not claim <strong>to</strong> assess the general<br />

quality of the educational systems examined. Regarding our <strong>to</strong>pic, it does not<br />

pretend <strong>to</strong> assess the general mathematical proficiency, but simply concentrates<br />

on what the OECD judges essential for the normal life of any citizen (the socalled<br />

‘mathematical literacy’).<br />

Let us quote the official report:<br />

“<strong>PISA</strong> seeks <strong>to</strong> measure how well young adults, at age 15 and therefore approaching<br />

the end of compulsory schooling, are prepared <strong>to</strong> meet the challenges of <strong>to</strong>day’s knowledge<br />

societies. The assessment is forward-looking, focusing on young people’s ability<br />

<strong>to</strong> use their knowledge and skills <strong>to</strong> meet real-life challenges, rather than merely on<br />

the extent <strong>to</strong> which they have mastered a specific school curriculum. This orientation<br />

reflects a change in the goals and objectives of curricula themselves, which are increasingly<br />

concerned with what students can do with what they learn at school, and<br />

not merely whether they can reproduce what they have learned.” 5<br />

At any rate, individual students who do not correctly answer the <strong>PISA</strong> mathematics<br />

questions seem doomed <strong>to</strong> a troubled life, and countries that do not<br />

perform well are viewed as doing a poor job of preparing their young people<br />

for the future.<br />

Thus, while <strong>PISA</strong> does not assess the entire body of mathematical knowledge<br />

acquired in schools, it does test at least a part of this knowledge.<br />

5 OCDE (2004) : Learning for Tomorrow’s World. First Results from <strong>PISA</strong> 2003. p. 20


24 ANTOINE BODIN<br />

We will therefore first try <strong>to</strong> identify more clearly the part truly assessed<br />

by <strong>PISA</strong> and then relate this part <strong>to</strong> the entire French mathematics educated<br />

offered <strong>to</strong> the country’s 15-year-olds. The relationship between this “literacy”<br />

part and the entire test is a problematic question, one that leads <strong>to</strong> raising epistemological<br />

and didactically complex issues.<br />

However, first we must examine the way in which the <strong>PISA</strong> material is<br />

linked <strong>to</strong> the French mathematics curriculum.<br />

A comparison of the <strong>PISA</strong> mathematics item content with the<br />

current French mathematical syllabus<br />

For the moment, let us limit ourselves <strong>to</strong> the French syllabus, which most of the<br />

15-year-old French students have studied. By this I mean the French “collège”<br />

syllabus from grade 6 <strong>to</strong> grade 9 (French “sixième” <strong>to</strong> “troisième”). At age 15,<br />

some French students attend high school (up <strong>to</strong> grade 11), while others are still<br />

lagging as far behind as grade 7, and yet a few others are in special education.<br />

However, on the whole, more than 85 % of the 15-year-olds have studied this<br />

syllabus 6 .<br />

The reader will find in annex 6 a presentation of this syllabus indicating<br />

the <strong>to</strong>pics that have been addressed by at least one <strong>PISA</strong> 2003 mathematics<br />

question.<br />

Annex 3 shows a list of analysed <strong>PISA</strong> questions.<br />

Here we should recall that only a certain number of the <strong>PISA</strong> questions<br />

have been secured for future use. In this paper I will only quote some of the<br />

released questions, while most of the questions used have nevertheless been<br />

taken in<strong>to</strong> account in the analysis.<br />

Finally, we find that the <strong>PISA</strong> questions cover about 15 % of the French<br />

syllabus, and are answered by more than 85 % of the 15-year-old French students.<br />

This shows beyond any doubt the marginal focus of the <strong>PISA</strong> questions<br />

(but marginal does not mean unimportant!).<br />

6 In fact, the 15-year-old official target is somewhat misleading. Let us quote the <strong>PISA</strong> technical<br />

report (page 46): “The 15-year-old international target population was slightly adapted<br />

<strong>to</strong> better fit the age structure of most of the northern hemisphere countries. As the majority<br />

of the testing was planned <strong>to</strong> occur in April, the international target population was consequently<br />

defined as all students ages 15 years and 3 (completed) months <strong>to</strong> 16 years and 2<br />

(completed) months at the beginning of the assessment period.”<br />

That leads <strong>to</strong> 59.1 % of the French students who <strong>to</strong>ok the tests were in high schools in grade<br />

10 (or for a few of them, grade 11).


WHAT DOES <strong>PISA</strong> REALLY ASSESS? WHAT DOES IT NOT? 25<br />

At the same time those 15 % represent only about 75 % of the <strong>PISA</strong> mathematics<br />

items. This means that about 25 % of the <strong>PISA</strong> items do not fit in<strong>to</strong><br />

the French curriculum. It is not only the case for many items in the field of<br />

uncertainty, but it is also the case for items not directly linked <strong>to</strong> our current<br />

curriculum (such as some combined items).<br />

But an assessment setting can never completely cover 100 % of any curriculum.<br />

In order <strong>to</strong> explore further, we found it useful <strong>to</strong> compare the <strong>PISA</strong><br />

material with some cus<strong>to</strong>mary French examinations.<br />

A comparison of the <strong>PISA</strong> mathematics item content with some<br />

French examination and assessment settings at the 15-year-old<br />

level<br />

Comparison with the grade 9 national examination<br />

We choose <strong>to</strong> analyse in the same way some issues of the mathematics form<br />

of the national examination taken by all students at the end of French middle<br />

school (grade 9).<br />

Annex 6 shows the corresponding curriculum coverage for one of these<br />

issues, while the corresponding examination form is displayed in annex 5 with<br />

an analysis chart appearing in annex 4.<br />

Here we found that this particular “Brevet” examination form covers about<br />

35 % of the French syllabus presented above.<br />

In addition, the entire set of <strong>PISA</strong> 2003 questions has been planned for<br />

approximately 210 minutes of testing time, while every area of the “Brevet”<br />

is just 120 minutes each. As two different “Brevet” forms are different and<br />

address different parts of the syllabus, we can estimate that in the “Brevet”<br />

context, a 210-minute testing time might cover more than 50 % of the French<br />

syllabus.<br />

What is more striking is the fact that the coverage by <strong>PISA</strong> focuses more<br />

on the syllabus for grade 6 and 7, while the coverage by the “Brevet” concerns<br />

mainly the syllabus for grades 8 and 9 (which <strong>to</strong> a certain point contains and<br />

extends the previous syllabus).<br />

But the Brevet examination is a poor illustration of the entire French curriculum<br />

(“programmes et instructions officielles”) as well as of the teachers’<br />

aims and teaching practices. The “Brevet” is well known for shrinking the objectives,<br />

and preparing for the “Brevet” is not viewed as a good way <strong>to</strong> prepare<br />

for further high school studies.


26 ANTOINE BODIN<br />

The EVAPM studies<br />

In the following sections I will refer <strong>to</strong> a series of large-scale studies organised<br />

in the “EVAPM Observa<strong>to</strong>ry”.<br />

EVAPM is a 20-year-long research project conducted by the Mathematics<br />

Teacher Association (APMEP) and the National Institute for Pedagogical Research<br />

(INRP), <strong>to</strong> follow the evolution of the French mathematics curriculum<br />

(and especially the attained curriculum), from grade 6 <strong>to</strong> grade 12.<br />

Being strongly linked <strong>to</strong> the teachers, and implicating them in the test development<br />

process, the EVAPM studies obviously reflect the authors’ beliefs<br />

and intentions. As the students are not directly assessed, there is no problem<br />

for checking competencies that are known for being just at the beginning of<br />

their development. In other words, the EVAPM questions are not limited by<br />

social expectations or political exploitation, as it is the case in the national exams.<br />

That could have been the case with the <strong>PISA</strong> questions; obviously, it is<br />

not.<br />

In recent EVAPM studies there was strong teacher resistance when we<br />

tried <strong>to</strong> introduce some <strong>PISA</strong> items. Most of the items were considered as<br />

not appropriate <strong>to</strong> the curriculum, and many of them were considered as such<br />

culturally biased.<br />

It is not relevant <strong>to</strong> mention curriculum coverage, as the EVAPM studies<br />

tend <strong>to</strong> be comprehensive (100 % coverage).<br />

In this paper, we will make use of the EVAPM studies in order <strong>to</strong> compare<br />

the cognitive demands of <strong>PISA</strong> with actual French curriculum expectations (at<br />

least as viewed by the French teachers).<br />

Comparison of cognitive demands<br />

In order <strong>to</strong> compare the cognitive demands of mathematics assessment items,<br />

we will use a cognitive taxonomy, of which the main categories are the following:<br />

<strong>–</strong> A Knowing and recognising . . .<br />

<strong>–</strong> B Understanding . . .<br />

<strong>–</strong> C Applying . . .<br />

<strong>–</strong> D Creating . . .<br />

<strong>–</strong> E Evaluating . . .<br />

See annex 1 for a first expansion of this taxonomy.<br />

The following chart displays the <strong>PISA</strong> levels of cognitive demands along<br />

those of the “Brevet” examination paper already examined.


WHAT DOES <strong>PISA</strong> REALLY ASSESS? WHAT DOES IT NOT? 27<br />

The difference is most striking: the “Brevet” addresses mostly the recognition<br />

level, and even the classification of some items at Level C (application)<br />

might be questioned (most of them are routine procedures that might have been<br />

classified at Level A).<br />

Without doubt, the taxonomic range of the <strong>PISA</strong> items is much more balanced<br />

than that of the French examination 7 .<br />

Figure 1<br />

However, as we have already noted, the “Brevet” does not correctly reflect<br />

the actual French curriculum.<br />

The following chart (figure 2) adds classifications obtained for two<br />

EVAPM studies (grade 10 <strong>–</strong> 2003 and grade 6 <strong>–</strong> 2005).<br />

Here, the balance across levels is closer <strong>to</strong> <strong>PISA</strong>, at least at the same age<br />

level (grade 10).<br />

The chart (figure 2) seems <strong>to</strong> indicate that French teachers would be keen<br />

<strong>to</strong> evolve <strong>to</strong>wards a more <strong>PISA</strong>-like assessment practice. The EVAPM studies<br />

have shown that most French teachers are quite <strong>to</strong>rn between the need <strong>to</strong> prepare<br />

their students for formal exams like the “Brevet” presented in this paper<br />

and their conception about what a good math education should include (we<br />

also know that the conflict between exams and education is not unique <strong>to</strong> the<br />

French!).<br />

7 Renovation of the “Brevet” is on the agenda. Perhaps <strong>PISA</strong> will help speed along this process?


28 ANTOINE BODIN<br />

Figure 2<br />

Comparison of implied range of competencies<br />

<strong>PISA</strong> makes use of a three-tiered competency level classification:<br />

<strong>–</strong> Class 1: Reproduction: “ . . . consists of simple computations or definitions<br />

of the type most familiar in conventional mathematics assessments”.<br />

<strong>–</strong> Class 2: Connection: “ . . . requires connections <strong>to</strong> be made in order <strong>to</strong> solve<br />

straightforward problems”.<br />

<strong>–</strong> Class 3: Reflection: “ . . . consists of mathematical thinking, generalisation<br />

and insight, and requires students <strong>to</strong> engage in analysis, identify the mathematical<br />

elements in a situation and pose their own problems”.<br />

See annex 5 for more details 8 .<br />

The following chart (figure 3) displays the competency levels of the <strong>PISA</strong><br />

items along those of the “Brevet”.<br />

<strong>PISA</strong> puts more than 70 % of the emphasis on Levels 2 and 3, while the<br />

“Brevet” exam puts less than 15 % on those levels.<br />

Here again, we can examine some EVAPM assessment settings.<br />

The chart (figure 4) shows again a balance much closer <strong>to</strong> <strong>PISA</strong> for the<br />

EVAPM studies than for the national examination.<br />

8 Note that for EVAPM we use a competency classification originating in the Aline Robert<br />

works (see references), which, while based on other assumptions than the <strong>PISA</strong> classification,<br />

provides about the same repartition.


WHAT DOES <strong>PISA</strong> REALLY ASSESS? WHAT DOES IT NOT? 29<br />

Figure 3<br />

Figure 4<br />

Towards some epistemological analyse<br />

About Finnish and French differences<br />

With regards <strong>to</strong> this paper, we had a special interest in the differences between<br />

the Finnish and French results. The overall results on a global scale (511 for


30 ANTOINE BODIN<br />

France, 548 for Finland) hide the fact that this difference means a difference of<br />

a .33 standard deviation on the standard normal distribution, and that at a point<br />

where the density of probability is at its maximum. In France not many people<br />

know this fact, and still fewer understand it.<br />

Looking <strong>to</strong> the subscales of the study (quantity, change and relationships,<br />

space and shape, uncertainty) sheds no any supplementary light. In order <strong>to</strong><br />

help understand the observed differences, it is essential <strong>to</strong> turn <strong>to</strong> the items<br />

themselves and <strong>to</strong> the percentages of success in each country (or for other<br />

approaches for each of the subgroups investigated 9 ).<br />

First, let us say that this examination confirms the better Finnish results <strong>–</strong> it<br />

is only the magnitude of the differences and its meaning that can be questioned.<br />

Regarding the magnitude, let us say that according <strong>to</strong> the items being examined,<br />

the differences in success rates range from + 30 % <strong>to</strong> the Finnish advantage<br />

<strong>to</strong> + 25 % <strong>to</strong> the French advantage, the average of the differences being<br />

3.5 % <strong>to</strong> the Finnish advantage. 10<br />

We observe that the differences are more important in favour of the Finnish<br />

students for the more “realistic” items, and that the differences tend <strong>to</strong> turn in<br />

favour of the French students for more abstract or formal items (compare for<br />

instance, below, the results of “Apples Item 1” with the results of “Apples Item<br />

3”. But the case seems general).<br />

It is important <strong>to</strong> note that the difference in results between Finland and<br />

France would <strong>to</strong>tally disappear if 10 % of the less successful French students<br />

(the first 10 percent) were put aside.<br />

In fact, while in the case of the Finnish students, only 7 % of the age group<br />

score at Levels 1 or below Level 1 (on a proficiency scale ranging from 1 <strong>to</strong> 6),<br />

and 17 % of the French students fall in<strong>to</strong> those categories. This confirms the<br />

fact that France does not succeed well in its mathematical education for all (a<br />

fact already strongly confirmed by the TIMSS studies).<br />

The other end of the scale (Level 6), concerns 7 % of the Finnish students,<br />

but only 3 % of the French ones. This fact may be less worrying than the one<br />

concerning the low levels. Let us remember here that <strong>PISA</strong> addresses only the<br />

literacy and does not pretend <strong>to</strong> assess the general mathematical competency.<br />

9 We do not mention in this paper the gender question, but our analysis points out a certain<br />

amount of gender bias, at least for some countries. As the overall results are weaker for<br />

girls than for boys in all countries but two, the question invites more examination. But other<br />

subgroups might also be worth scrutinising.<br />

10 This is only a rough estimate <strong>–</strong> only 41 items have been accounted for.


WHAT DOES <strong>PISA</strong> REALLY ASSESS? WHAT DOES IT NOT? 31<br />

Nevertheless, this casts doubt on the assumption regarding French math education<br />

high level of quality demonstrated by the best students.<br />

The <strong>PISA</strong> questions presented below (exclusively those that have been released)<br />

are displayed with results for France, Finland, all OECD countries,<br />

containing in addition the highest and lowest observed results (OECD and all<br />

participating countries). 11<br />

Mathematics?<br />

The mathematical field may be extended or restricted according <strong>to</strong> different<br />

conceptions. Some mathematical <strong>PISA</strong> questions puzzle many French mathematics<br />

teachers. They do not recognise the mathematics they are striving <strong>to</strong><br />

teach. At the same time they recognise the social usefulness of the knowledge<br />

implied by these questions. The same thing applies <strong>to</strong> mathematicians: the insertion<br />

of many mathematical <strong>PISA</strong> questions in the theoretical mathematical<br />

constructions are not obvious <strong>to</strong> them.<br />

Quantity, change and relationships, space and shape as well as uncertainty<br />

are not only modelled in mathematical theories, but are also used in common<br />

situation, using common sense and common language.<br />

In its endeavour <strong>to</strong> stick <strong>to</strong> real life, <strong>PISA</strong> could not help using normal<br />

language <strong>to</strong> display its questions. In some cases, understanding a text, which<br />

is in no way a mathematical text, is the main difficulty students have <strong>to</strong> face.<br />

Certainly, this is also part of the mathematical process, but the true mathematical<br />

work begins once the problem is fully unders<strong>to</strong>od. Here the “devolution”<br />

process is not controlled, and it is never certain if it is the either the “dressing<br />

up” or the wording that prevents students from solving the problem, or if it is<br />

the problem’s degree of the mathematical difficulty. These mathematical difficulties<br />

often appear trivial when compared with the structural and semantic<br />

complexity of the questions.<br />

The strong correlation observed between individual results in reading literacy<br />

and mathematical literacy (r = 0.77) perfectly illustrates this point. This<br />

correlation is smaller that the correlations observed among the four <strong>PISA</strong> mathematical<br />

domains at the International Level (which range from 0.89 <strong>to</strong> 0.92),<br />

but is much higher that what is generally observed in France (EVAPM studies)<br />

between students’ results in different mathematical domains (algebra, geometry,<br />

calculus, statistics), which usually lie in the interval [0.35; 0.60]. All this<br />

11 A more complete presentation may be downloaded on my website.


32 ANTOINE BODIN<br />

leads one <strong>to</strong> think that the mathematical <strong>PISA</strong> questions may all assess a general<br />

ability <strong>to</strong> read a text, <strong>to</strong> articulate between textual, iconic information and<br />

other indices indirectly given by the question’s context and <strong>to</strong> process based<br />

on this information. We also could invoke here the well-known “fac<strong>to</strong>r g.”<br />

Numbers, quantity, etc. also appear in the <strong>PISA</strong> reading questions, in the<br />

science questions and in the problem-solving questions. It is not always obvious<br />

whether a <strong>PISA</strong> question should be allocated <strong>to</strong> one branch of the study<br />

rather than <strong>to</strong> another one. In particular, some problem-solving questions could<br />

be analysed and gathered with the set of mathematics questions.<br />

Let us now examine some typical questions.<br />

The Apples Example<br />

This question is typical of realistic mathematics (and authentic assessment),<br />

which the OECD seeks both <strong>to</strong> assess and promote. In this context, a good<br />

question must open up for the process thus described in the framework:<br />

a) Starting with a reality-based problem.<br />

b) Organising it according <strong>to</strong> mathematical concepts<br />

c) Gradually trimming away the reality through process, such as making assumptions<br />

about which features of the problem are important, then generalising, formalising„<br />

transforming the problem in<strong>to</strong> a mathematical problem that closely represents the<br />

situation<br />

d) Solving the mathematical problem<br />

e) Making sense of the mathematical solution in terms of the actual situation


WHAT DOES <strong>PISA</strong> REALLY ASSESS? WHAT DOES IT NOT? 33


34 ANTOINE BODIN<br />

For Item 1 the main point is <strong>to</strong> understand the situation and being subsequently<br />

able <strong>to</strong> extrapolate a pattern. This may be complete by merely counting<br />

the first four lines in the chart. In the fifth example the student can either extend<br />

the drawing and then count or identify a number pattern in the completed<br />

chart.


WHAT DOES <strong>PISA</strong> REALLY ASSESS? WHAT DOES IT NOT? 35<br />

The 10 % difference between French and Finnish students illustrates the<br />

French students’ relative lack of confidence or lack of initiative. They do not<br />

have a mathematical procedure on hand <strong>to</strong> treat the question, and this lack<br />

hinders a certain percentage of them from solving the problem.<br />

Conversely, French students who overcome this initial difficulty perform<br />

much better on the second item than their Finnish counterparts (26 % <strong>to</strong> 21 %<br />

for the entire population, but 62 % <strong>to</strong> 38 % for those who successfully completed<br />

Item 1). This also seems <strong>to</strong> be rather general.<br />

For this item, the mathematical process is quite obvious and leads <strong>to</strong> an<br />

equation <strong>to</strong> be solved: n2 8n.<br />

French students are used <strong>to</strong> solving this type of equation (though often in a<br />

formal, non-realistic, context). We may even suppose that many of them have<br />

used a correct mathematical method: by this I mean fac<strong>to</strong>rizing n n 8<br />

0 and finding the two values: 0 and 8, then and only then (Point E above)<br />

eliminating the value 0 and retaining the value 8.<br />

However, some students (in France as well in Finland) should have gotten<br />

the correct answer just by making this invalid simplification: n2 8n n 8<br />

or n n 8 n n n 8 n.<br />

Another procedural possibility consists of extending the chart until n 8.<br />

These procedures (of which at least one is mathematically incorrect) and<br />

other ones have been considered correct (full or partial credit!). This raises<br />

the epistemological issue: which kind of mathematics are at stake? What is<br />

valued?<br />

Let us be clear on this point: it is not our purpose <strong>to</strong> deny the interest of<br />

the question nor its relevance in a mathematical test, not even the legitimacy<br />

of building scales which may be of some usefulness <strong>to</strong> policymakers. What<br />

is raised here is the need for complementary qualitative studies, which could<br />

more deeply analyse students’ procedures from a mathematical point of view.<br />

Item 3 needs <strong>to</strong> compare two variation rates. In this instance it may lead <strong>to</strong><br />

comparing the growth of the derivatives of functions f such as f(n) = n2 and g<br />

such as g(n) =8n, and, finally, <strong>to</strong> comparing the second derivatives.<br />

Here again students are not supposed <strong>to</strong> know derivatives; they should just<br />

have a sound and personal approach <strong>to</strong> the question. Several procedures are<br />

possible that have different mathematical values, but are considered the same.<br />

Note that the question is by no means trivial, and it is not <strong>to</strong>o surprising<br />

that so few students across the world are able <strong>to</strong> cope with it.


36 ANTOINE BODIN<br />

The apples question has been used in an EVAPM study at the tenth grade<br />

level. The results of this setting also appear in the rectangles.<br />

The 6 % success rate (France) and the 4 % success rate (Finland) concern<br />

only a correct mathematical procedure. Those rates have <strong>to</strong> be compared with<br />

the 11 % obtained in Japan and also with the 11 % obtained by EVAPM in<br />

France at grade 10.<br />

For all countries but one the Item 3 success rate ranges from 2 % <strong>to</strong> 12 %<br />

The only exception (Korea at 24 %) deserves further examination.<br />

There is also an interesting point coming out from international studies<br />

(similar <strong>to</strong> TIMSS): real mathematical difficulties, meaning difficulties linked<br />

<strong>to</strong> the concepts and not only <strong>to</strong> the presentation or the wording seem <strong>to</strong> be<br />

experienced in the same way all over the world.<br />

The Bookshelves Example<br />

The following question is typically a case of one question not fitting the current<br />

French mathematical curriculum; more precisely, it would be considered as<br />

being more appropriate at the primary school level.


WHAT DOES <strong>PISA</strong> REALLY ASSESS? WHAT DOES IT NOT? 37<br />

At the same time everyone in France (and especially French mathematics<br />

teachers) would expect 15-year-old students <strong>to</strong> be able <strong>to</strong> solve this problem.<br />

The success rate is more than 10 % higher in Finland than in France, which<br />

illustrates what has been said about realistic questions.<br />

But is it a mathematical question? Or should any question using numbers<br />

be considered as a mathematical question? In some countries (especially in<br />

France) this question would more likely be asked in the technological subject<br />

matter area.<br />

A mathematical solution could be:<br />

N Min<br />

26<br />

4<br />

; 33<br />

6<br />

; 200<br />

12<br />

; 20<br />

2<br />

; 510<br />

14<br />

Where N is the maximum number of bookshelves the carpenter can make,<br />

and where x stands for the integer part of x.<br />

Once again, it is not expected of the students that they write this complex<br />

formula. In fact, they proceed by a try and guess method. Meanwhile, if they<br />

had <strong>to</strong> prove their result, they would be forced <strong>to</strong> write down in everyday language<br />

the content as well as the meaning of this formula, which should be even<br />

more difficult than writing the symbolic formula.<br />

Fortunately for <strong>PISA</strong>, no student thinks about using such a formula (neither<br />

would we other than for this paper!), so the international results are quite high,<br />

ranging mostly from 50 % <strong>to</strong> 70 %.<br />

But is it still mathematics? Can these kinds of realistic questions be a good<br />

preparation for more abstract mathematics? As many educational systems tend<br />

<strong>to</strong> ask teachers <strong>to</strong> stress realistic mathematics, the question is surely worthy of<br />

being raised.<br />

This question, along with many others, points out the weak stress given<br />

by <strong>PISA</strong> <strong>to</strong> the proof undertakings (and what is mathematics without proof?).<br />

Even explaining and justifying are not much valued by the <strong>PISA</strong> marking<br />

scheme. This makes a great difference with the casual French conception of<br />

mathematical achievement.<br />

The idea of proof is not the only mathematical main feature which is quite<br />

absent in the <strong>PISA</strong> questions. There is a lack of any symbolism, and what<br />

is sometimes labelled as algebra (especially in national reports) is usually no<br />

more that the use of letters as substitute <strong>to</strong> numbers, without any perspective<br />

of using them in direct computations. The <strong>PISA</strong> design insists on real life, the<br />

concrete aspects of mathematics. So it conciously misses several fundamental<br />

aspects of the mathematical world.


38 ANTOINE BODIN<br />

Toward didactical analysis<br />

The preceding remarks lead directly <strong>to</strong> the raising of a central didactical question:<br />

Which sequence of teaching situations can help students <strong>to</strong> gain proficiency<br />

both in mathematical literacy (partly common sense knowledge) and<br />

in abstract and symbolic mathematics?<br />

Some people would assume that the question is not relevant and that there<br />

is a continuum from common sense knowledge <strong>to</strong> theoretical knowledge.<br />

On the contrary, we think that all the work of the so-called “French didactics<br />

school” has helped us <strong>to</strong> think that ruptures are necessary and constitutive<br />

<strong>to</strong> learning. So we may fear that putting <strong>to</strong>o much stress on real life and concrete<br />

situations may in return have some negative effects.<br />

Here is an example.<br />

The Coloured Candies Question<br />

This question belongs <strong>to</strong> the uncertainty field and a “probability” value is requested.<br />

Probability is not part of the curriculum followed by 98 % of the French<br />

students at age 15; meanwhile, they perform at the same level as other OECD<br />

students.<br />

We obtained the same kind of result with TIMSS at age 13. While probability<br />

was not in the curriculum, French students performed better than others<br />

in countries where probability was considered as part of the curriculum. Other<br />

observations (EVAPM) show that when introduced <strong>to</strong> probability concepts (at<br />

least at the outset), students find more difficulty answering this kind of question<br />

than when they have not been taught the subject.


WHAT DOES <strong>PISA</strong> REALLY ASSESS? WHAT DOES IT NOT? 39<br />

Once again, we can talk of common knowledge: understanding the diagram,<br />

counting the <strong>to</strong>tal number of candies (30), noting that 6 of them are red,<br />

and finally interpreting the 6 chances out of 30 as being a probability value.<br />

These are common language and preconceptions about a mathematical<br />

concept. Stressing this kind of task, particularly in an MCQ format, and allowing<br />

students (and many others) <strong>to</strong> think they have acquired some knowledge in<br />

probability may surely lead <strong>to</strong> serious misunderstandings.<br />

Many other questions deserve this kind of examination.<br />

A good example is given by a question that we are not allowed <strong>to</strong> display<br />

here (an unreleased question). The only point of this question is <strong>to</strong> identify an<br />

oblique line as being longer than a perpendicular one. Everybody feels this and<br />

can use this fact, even if they do not formally know it, and especially if they<br />

have not been taught it. Even dogs behave as if they know this fact.


40 ANTOINE BODIN<br />

The amusing point is that this question has been identified, at least in the<br />

French official report, as assessing the Pythagorean theorem! Well spread confusion<br />

between the fact that common sense may be mathematized and integrated<br />

in mathematical theories and the fact that students’ abilities for making<br />

good use of this common sense proves something concerning their theoretical<br />

knowledge.<br />

Some conclusions<br />

<strong>PISA</strong> has gathered a huge amount of quality data across countries, which opens<br />

the way for further research. Aside from edumetrics studies focusing on marks<br />

and scales, there is room for many interesting qualitative studies (more precisely<br />

for studies articulating quantitative and qualitative approaches).<br />

A large amount of resources have been put in the <strong>PISA</strong> studies, as well as<br />

a great variety of commitment and expertise, and it would be disappointing if<br />

students were not the primary beneficiaries of these contribu<strong>to</strong>rs.<br />

In this paper we have attempted <strong>to</strong> demonstrate that certain precautions<br />

should be taken when interpreting and using the <strong>PISA</strong> results, at least in mathematics.<br />

Moreover, on the whole, the <strong>PISA</strong> studies are worth being taken seriously.<br />

They can bring new questions and new ideas <strong>to</strong> teachers that can help<br />

them <strong>to</strong> go ahead with a way of teaching that fits the needs of our societies as<br />

well as preserving the values of which they are conveyors. This balance is difficult<br />

<strong>to</strong> obtain; however, weak, flawed or biased interpretations of the general<br />

<strong>PISA</strong> implications and results will not help.<br />

This paper is particularly aimed at attracting scholars’ attention and justifying<br />

the idea that some complementary studies should be undertaken by and<br />

within research in the mathematics education community (and not, as is often<br />

the case, only processed and interpreted by officials strictly controlled by<br />

political bodies).<br />

The <strong>PISA</strong> studies may help scholars in different countries distance themselves<br />

from their national or regional places of origin and acquire a more comprehensive<br />

understanding of the teaching and acquisition of mathematics for<br />

future citizens, consumers and <strong>–</strong> above all <strong>–</strong> for the advancement of mankind.<br />

References<br />

Adams, R.J.: 2003, Response <strong>to</strong> “Cautions on OECD’s Recent Educational<br />

survey (<strong>PISA</strong>), Oxford Review of Education, 29(3)


WHAT DOES <strong>PISA</strong> REALLY ASSESS? WHAT DOES IT NOT? 41<br />

Anderson, W. A.: 2001, A taxonomy for learning, teaching, and assessing; a<br />

revision of Bloom’s taxonomy of educational objectives. Longman.<br />

Bodin, A. & Capponi, B.: 1996, Junior Secondary School Practices, International<br />

Handbook of Mathematics Education, Chapter 15, Teaching<br />

and learning Mathematics, A. Bishop & C. Laborde (eds), pp. 565-613,<br />

Kluwer Academics Publishers, Dordrecht.<br />

Bodin, A.: 2003, Comment classer les questions de mathématiques? Communication<br />

au colloque international du Kangourou, Paris 7 novembre 2003.<br />

Article à paraître.<br />

Bodin, A; Straesser, R.; Villani, V.: 2001, Niveaux de référence pour<br />

l’enseignement des mathématiques en Europe <strong>–</strong> Rapport international<br />

Reference levels in School Mathematics Education in Europe <strong>–</strong> International<br />

report.<br />

Bodin A.: 1997, L’évaluation du savoir mathématique <strong>–</strong> Questions et méthodes.<br />

Recherches en Didactique des Mathématiques, Éditions La Pensée<br />

Sauvage, Grenoble.<br />

Bottani, N.& Vrignaud, P. (2005): La France et les évaluations internationales.<br />

Haut Conseil de l’Évaluation de l’École.<br />

Clarke, D. 2003, International comparative Research in Mathematics Education<br />

: Of What, By Whom, for What, and How. Second international<br />

Handbook on Mathematics education, Kluwer academic Publishers.<br />

Cytermann, J.R., Demeuse, M. (2005): La lecture des indicateurs internationaux<br />

en France. Haut Conseil de l’Évaluation de l’École.<br />

Demonty, I. & Fagnant, A. (2004): Évaluation de la culure mathématique des<br />

jeunes de 15 ans (<strong>PISA</strong>). Ministère de la Communauté Française. Bruxelles.<br />

Dupé, C. & Olivier, Y. (2005): Ce que l’évaluation <strong>PISA</strong> 2003 peut nous apprendre.<br />

Bulletin de l’APMEP N˚460 <strong>–</strong> oc<strong>to</strong>bre 2005<br />

French Ministry of Education (2007): L’évaluation internationale <strong>PISA</strong><br />

2003 . . . dossier n˚ 180 de la Direction de l’Évaluation de la Prospective<br />

et de la Performance (DEPP).<br />

Freudhenthal, H: 1975, Pupils’ achievements internationally compared <strong>–</strong> The<br />

IEA. In Educational Studies in Mathematics <strong>–</strong> Vol 1975.<br />

Gras R.: 1977, Contributions à l’étude expérimentale et à l’analyse de certaines<br />

acquisitions cognitives et de certains objectifs didactiques en mathématiques<br />

<strong>–</strong> Thèse- université de RENNES.<br />

Lemke, M., Sen, A., Pahlke, E., Partelow, L., Miller, D., Williams, T., Kast-


42 ANTOINE BODIN<br />

berg, D., Jocelyn, L. (2004). International Outcomes of Learning in Mathematics<br />

Literacy and Problem Solving: <strong>PISA</strong> 2003 Results From the U.S.<br />

Perspective. (NCES 2005<strong>–</strong>003). Washing<strong>to</strong>n, DC: U.S. Department of<br />

Education, National Center for Education Statistics.<br />

Lie, S. & al (2003): Northern lights on <strong>PISA</strong>. Unity and diversity in the Nordic<br />

countries in <strong>PISA</strong> 2000. University of Oslo, Norway<br />

Meuret, D. 2003 Considérations sur la confiance que l’on peut faire à <strong>PISA</strong><br />

2000. Intervention au colloque international de l’Agence Nationale de<br />

Lutte Contre l’Illetrisme sur l’évaluation des bas niveaux de compétences,<br />

Lyon, 5 novembre 2003<br />

Meuret, D. 2003 Pourquoi les jeunes français ont-ils à 15 ans des performances<br />

inférieures à celles des jeunes d’autres pays? Revue française de Pédagogie,<br />

n˚142, 89-104.<br />

Note DPD 04.12 (décembre) <strong>–</strong> Les élèves de 15 ans Premiers résultats de<br />

l’évaluation internationale <strong>PISA</strong> 2003<br />

OECD (2004), Problem Solving for Tomorrow’s World: First measures of<br />

Cross-Curricular Competencies from <strong>PISA</strong> 2003<br />

OECD (2004), Technical report.<br />

OECD 2004, First results from <strong>PISA</strong> 2003. Executive summary.<br />

OECD 2004, Learning for Tomorrow’s World: First results from <strong>PISA</strong> 2003<br />

OECD 2004, <strong>PISA</strong> 2003 Assessment Framework <strong>–</strong> Mathematics, Reading,<br />

Science and Problem Solving Knowledge and Skills<br />

Orivel, F. (3003) : De l’intérêt des comparaisons internationales en éducation.<br />

Robert A: 2003, Taches mathématiques et activités des élèves : une discussion<br />

sur le jeu des adaptations introduites au démarrage des exercices cherchés<br />

en classe de collège. Petit x N˚62<br />

Varcher, P. (2002), Evaluation des systèmes éducatifs par des batteries<br />

d’indicateurs du type <strong>PISA</strong> : vers une régresion des pratiques d’évaluation<br />

dans les classes.<br />

Addresses and contacts<br />

APMEP with access <strong>to</strong> EVAPM documents as well as <strong>to</strong> a show displaying the<br />

released <strong>PISA</strong> questions with some results:<br />

http://www.apmep.asso.fr/spip.php?rubrique114 (presentations in English<br />

and in French).<br />

Reference Levels in School Mathematics Education in Europe:<br />

http://www-irem.univ-fcomte.fr/Presentation_ref_levels.HTM and


WHAT DOES <strong>PISA</strong> REALLY ASSESS? WHAT DOES IT NOT? 43<br />

http://www.emis.de/projects/Ref/<br />

IREM de Franche-Comté: http://www-irem.univ-fcomte.fr/<br />

French official reports: http://www.educ-eval.education.fr/pisa2003.htm<br />

International frameworks and reports: http://www.pisa.oecd.org/<br />

An<strong>to</strong>ine Bodin personal website:<br />

http://web.mac.com/an<strong>to</strong>inebodin/iWeb/Site_An<strong>to</strong>ine_Bodin<br />

Email address: bodin.an<strong>to</strong>ine@nerim.fr


44 ANTOINE BODIN<br />

ANNEXES<br />

Annexe 1: Taxonomy of cognitive demands for designing and<br />

analysing mathematical tasks <strong>–</strong> ordered by integrated level of<br />

complexity<br />

Simplified version <strong>–</strong> see complete taxonomy on the Web (in French)<br />

A<br />

B<br />

Main categories<br />

Knowing and<br />

recognising . . .<br />

Understanding<br />

. . .<br />

C Applying . . .<br />

D Creating...<br />

E Evaluating . . .<br />

Sub-categories<br />

A1 Facts<br />

A2 Vocabulary<br />

A3 Tools<br />

A4 Procedures<br />

B1 Facts<br />

B2 Vocabulary<br />

B3 Tools<br />

B4 Procedures<br />

B5 Relations<br />

B6 Situations<br />

C1 in simple familiar contexts<br />

C2 in mean complex familiar contexts<br />

C3 in complex familiar contexts<br />

D1 as mobilizing known mathematical<br />

<strong>to</strong>ols and procedures in new situations<br />

D2 new ideas<br />

D3 personal <strong>to</strong>ols or procedures<br />

E1 as issuing judgements about external<br />

productions<br />

E2 as assessing one’s own knowledge, process<br />

and results<br />

Taxonomy designed by An<strong>to</strong>ine Bodin, with full acknowledgment <strong>to</strong> R.<br />

Gras’ seminal work as well as <strong>to</strong> W. A. Anderson’s later influence.


WHAT DOES <strong>PISA</strong> REALLY ASSESS? WHAT DOES IT NOT? 45<br />

Annexe 2: Competency classes for designing and analysing<br />

mathematical tasks <strong>–</strong> ordered by integrated level of complexity<br />

Simplified version <strong>–</strong> see expanded version in OECD documents on the Web<br />

Level OECD definition<br />

1 Reproduction The competencies in this<br />

cluster essentially involve<br />

reproduction of practised<br />

knowledge . . .<br />

2 Connection The connection cluster builds<br />

on the reproduction cluster<br />

competencies in taking problem<br />

solving <strong>to</strong> situations that<br />

are not simply routine, but<br />

still involve familiar or quasi<br />

familiar settings<br />

3 Reflection The competencies in this<br />

cluster include an element<br />

of reflectiveness . . . about<br />

the processes needed or used<br />

<strong>to</strong> solve a problem. They<br />

relate <strong>to</strong> students’ abilities <strong>to</strong><br />

plan solution strategies and<br />

implement them in problem<br />

settings that contain more<br />

elements and may be more<br />

“original” (or unfamiliar)<br />

than those in the connection<br />

cluster...<br />

Reproduction<br />

Simple mathematisation<br />

Complex mathematisation<br />

(<strong>to</strong> modelisation)


46 ANTOINE BODIN<br />

Annexe 3: <strong>PISA</strong> 2003 and 2000 <strong>–</strong>Analysed Question Set<br />

Along with some other non released questions taken in<strong>to</strong> account for this paper,<br />

the whole analysis covers about 70 % of the <strong>PISA</strong> material (60/85)<br />

<strong>PISA</strong><br />

code<br />

Item<br />

name<br />

Mathematical<br />

content<br />

Taxo C Remarks<br />

M037Q01 Farms 1 Pyramid <strong>–</strong> square<br />

area<br />

B6 1 <strong>PISA</strong>2000 only<br />

M037Q02 Farms 2 Middle of the sides<br />

of a triangle.<br />

C1 2 <strong>PISA</strong>2000 only<br />

M124Q01 Walking 1 Using letters and<br />

formula<br />

C1 2 & <strong>PISA</strong>2000<br />

M124Q02 Walking 2 Using letters and<br />

formula <strong>–</strong> Units<br />

...<br />

B5 2 & <strong>PISA</strong>2000<br />

M136Q01 Apple 1 Completing charts B6 3 & <strong>PISA</strong>2000 &<br />

EVAPM<br />

M136Q02 Apple 2 Equation C1 2 & <strong>PISA</strong>2000 &<br />

EVAPM<br />

M136Q03 Apple 3 Don’t fit D1 3 & <strong>PISA</strong>2000 &<br />

EVAPM<br />

M145Q01 Cubes Cube B5 2 & <strong>PISA</strong>2000<br />

M148Q02 Continent area D1 3 <strong>PISA</strong>2000 only<br />

area<br />

&EVAPM<br />

M150Q01 Growing<br />

up 1<br />

Reading graphs B5 2 & <strong>PISA</strong>2000<br />

M150Q02 Growing<br />

up 2<br />

Reading graphs B5 1 & <strong>PISA</strong>2000<br />

M150Q03 Growing Reading graphs B5 1 & <strong>PISA</strong>2000 <strong>–</strong><br />

up 3<br />

Gender bias ?<br />

M155Q02 Number<br />

cube<br />

Cube B5 2 & EVAPM<br />

M159Q01 Speed of a<br />

car 1<br />

Interpreting graph B6 2 <strong>PISA</strong>2000 only<br />

M159Q02 Speed of a<br />

car 2<br />

Reading graph A3 1 <strong>PISA</strong>2000 only


WHAT DOES <strong>PISA</strong> REALLY ASSESS? WHAT DOES IT NOT? 47<br />

M159Q03 Speed of a<br />

car 3<br />

Interpreting graph B3 1 <strong>PISA</strong>2000 only<br />

M159Q04 Speed of a<br />

car 4<br />

Interpreting graph D1 2 <strong>PISA</strong>2000 only<br />

M161Q01 Triangles Constructing geometrical<br />

figures<br />

B5 1 <strong>PISA</strong>2000 only<br />

M179Q01 Robberies Bar charts E1 3 & TIMSS &<br />

<strong>PISA</strong>2000 &<br />

EVAPM<br />

M266Q01 Carpenter Perimeter of a rect- D1 2 & <strong>PISA</strong>2000 <strong>–</strong><br />

angle<br />

Gender bias ?<br />

M402Q01 Internet<br />

relay chat<br />

1<br />

Don’t fit D1 2 Gender bias ?<br />

M402Q02 Internet<br />

relay chat<br />

2<br />

Don’t fit D1 3 Gender bias ?<br />

M413Q01 Exchange<br />

rate 1<br />

Proportionality C1 2<br />

M413Q02 Exchange<br />

rate 2<br />

Proportionality A4 1<br />

M413Q03 Exchange<br />

rate 3<br />

Proportionality C1 2<br />

M438Q01 Export <strong>–</strong> 1 Bar charts A3 1<br />

M438Q02 Export <strong>–</strong> 2 Circle charts <strong>–</strong> Percentage<br />

C1 1<br />

M467Q01 Coloured<br />

candies<br />

Don’t fit C1 1 Probability<br />

M468Q01 Science<br />

test<br />

Mean C1 2<br />

M484Q01 Bookshelves Don’t fit D1 2 & EVAPMGender<br />

bias ?<br />

M505Q01 Litter Bar charts B6 2 ? Huge diff<br />

FRA-FIN<br />

M509Q01 Earthquake Don’t fit B5 2 Probability<br />

M510Q01 Choice Don’t fit D1 3 Combina<strong>to</strong>ry<br />

<strong>–</strong>transtation pb


48 ANTOINE BODIN<br />

M513Q1 Test<br />

Scores<br />

Bar graph<br />

M520Q01 Skateboard<br />

1<br />

Don’t fit C1 2 EVAPM<br />

M520Q02 Skateboard<br />

2<br />

Don’t fit C1 2<br />

M520Q03 Skateboard<br />

3<br />

Don’t fit D1 3<br />

M547Q01 Staircase Division A4 1<br />

M555Q02 Number<br />

cubes<br />

Cube B5 2<br />

M702Q01 Support<br />

for president<br />

Don’t fit B6 2<br />

M704Q01 Best car 1 Reading charts C1 2<br />

M704Q02 Best car 2 Reading charts D1 3<br />

M806Q01 Step pattern<br />

Don’t fit A1 1<br />

<strong>PISA</strong> 2003: 85 items released: 31<br />

<strong>PISA</strong> 2000: 32 items released: 11<br />

Annexe 4: A typical mathematical examination at the final year of<br />

middle school<br />

Taxo Comp Remarks<br />

Part I <strong>–</strong> Numerical<br />

activities<br />

Numbers Ex 1 1) A4 1 Formal and<br />

unrealistic<br />

2) A4 1 id<br />

3) A4 1 id<br />

4) A4 1 id<br />

Data Ex 2 1) B5 1 Pseudorealistic<br />

2) A4 1 id<br />

3) C1 1 id


WHAT DOES <strong>PISA</strong> REALLY ASSESS? WHAT DOES IT NOT? 49<br />

4) A2 1 id<br />

Numbers Ex 3 1) a A2 1 Formal and<br />

unrealistic<br />

1) b A4 1 id<br />

1) c A4 1 id<br />

1) d A4 1 id<br />

Numbers <strong>–</strong>Arithmetic Ex 4 1) C1 1 Formal and<br />

unrealistic<br />

2) C1 1 id<br />

3) C1 1 id<br />

Part II <strong>–</strong> Geometrical<br />

activities<br />

Space geometry Ex 1 1) a B1 1 Unrealistic<br />

1) b B5 1 id<br />

2) a A4 1 id<br />

2) b B5 1 id<br />

3) A4 1 id<br />

Plane géometry <strong>–</strong><br />

Proof <strong>–</strong> Thalès<br />

Plane géometry <strong>–</strong><br />

Proof <strong>–</strong> Pythagore<br />

Part III <strong>–</strong> Problem<br />

Geometry-Pythagore-<br />

Trigonometry<br />

Linear functions <strong>–</strong> inequations<br />

4) A4 1 id<br />

Ex 2 1) C1 2 Formal and<br />

unrealistic<br />

2) C1 2 id<br />

EX 3 1) A4 1 Formal and<br />

unrealistic<br />

2) A4 1 id<br />

3) i A4 1 id<br />

3) ii A4 1 id<br />

Part<br />

I<br />

Part<br />

II<br />

1) A4 1 Pseudorealisticdressing<br />

2) i A4 1 id<br />

2) ii A4 1 id<br />

3) A4 1 id<br />

1) a A3 1 id<br />

i<br />

1) a<br />

ii<br />

A3 1 id


50 ANTOINE BODIN<br />

Scale area <strong>–</strong> volume Part<br />

III<br />

1) b A3 1 id<br />

2) a A2 1 id<br />

2) b C1 2 id<br />

3) a B5 1 id<br />

3) b A4 1 id<br />

3) c A4 1 id<br />

1) A2 1 id<br />

2) A4 1 id<br />

3) C1 2 id<br />

4) i C1 2 id<br />

4) ii C1 2 id


WHAT DOES <strong>PISA</strong> REALLY ASSESS? WHAT DOES IT NOT? 51<br />

Annexe 5: The examination on scope: Brevet 2005 <strong>–</strong> South of<br />

France<br />

Wording and appearance will be counted as 4 marks out of 40.<br />

Handheld calcula<strong>to</strong>rs allowed.<br />

Test duration: 2 hours<br />

Part I: Numerical activities


52 ANTOINE BODIN<br />

part II: Geometrical activities


Part III: Problem<br />

WHAT DOES <strong>PISA</strong> REALLY ASSESS? WHAT DOES IT NOT? 53


54 ANTOINE BODIN<br />

Annexe 6: Comparing <strong>PISA</strong> with the French curriculum


WHAT DOES <strong>PISA</strong> REALLY ASSESS? WHAT DOES IT NOT? 55<br />

Annexe 7: Comparing a cus<strong>to</strong>mary French examination with the<br />

French curriculum


Testfähigkeit <strong>–</strong> Was ist das?<br />

Wolfram Meyerhöfer<br />

Deutschland: Freie Universität Berlin<br />

In diesem Artikel wird das Problem der Testfähigkeit am Beispiel mathematischer<br />

Leistungstests erkundet. „Testfähigkeit“ beschreibt jene Kenntnisse, Fähigkeiten<br />

und Fertigkeiten, die in einem Test miterfasst bzw. mitgemessen werden,<br />

die man aber nicht unter den Begriff „mathematische Leistungsfähigkeit“<br />

fassen würde. Es wird zunächst ausgelotet, warum dem Thema Testfähigkeit<br />

eine erhebliche Bedeutung im Zusammenhang mit Tests zukommt. Anhand<br />

von Aufgaben aus TIMSS und <strong>PISA</strong> wird mit Hilfe von s<strong>to</strong>ffdidaktischen und<br />

objektiv-hermeneutischen Aufgabeninterpretationen herausgearbeitet, welche<br />

empirischen Phänomene das Problem der Testfähigkeit ausmachen. Es zeigt<br />

sich, dass Testfähigkeit dem Gedanken von mathematischer Bildung entgegensteht.<br />

1 Testfähigkeit in der Mathematikdidaktik<br />

Der Begriff der Testfähigkeit 1 ist in der deutschsprachigen Mathematikdidaktik<br />

bisher eher am Rande abgehandelt, nie aber ernsthaft diskutiert worden.<br />

1 Ich werde in diesem Beitrag keine Begriffsdefinitionen geben: Bei der von mir verwendeten<br />

Methode der Objektiven Hermeneutik folgt man der Wittgensteinschen Erkenntnis,<br />

dass die Bedeutung eines Textes sich ausschließlich daraus erschließt, wie er benutzt wird.<br />

Dieser Erkenntnis folge ich auch hinsichtlich der verwendeten Fachbegriffe. Ich möchte<br />

die von mir verwendeten Begriffe wie Testfähigkeit, Bildung, standardisierter Leistungstest<br />

usw. nicht im Vorhinein verengen, indem ich sie in eine Begriffsbestimmung fasse. Ich<br />

möchte die Begriffe im Grunde auch nicht erweitern. Ich möchte sie vertiefen. Jeder Angehörige<br />

der Sprachgemeinschaft <strong>–</strong> erst recht der fachlichen Sprachgemeinschaft <strong>–</strong> hat einen<br />

unmittelbaren Zugriff auf diese Begriffe in ihrer ganzen Breite und Vielfalt (notfalls über<br />

Wörterbücher und Lexika). Mir geht es darum, bezüglich Testfähigkeit diesem „Bekannten“<br />

einiges bislang Unerschlossenes hinzuzufügen. Für diesen Erkenntnisprozess ist es sinnvoll,<br />

„alles, was der Begriff so mit sich rumschleppt“ mitzuschleppen. Methodisch hängt das da-


58 WOLFRAM MEYERHÖFER<br />

Dies mag zum einen damit zu tun haben, dass der Begriff selbsterklärend<br />

und damit uninteressant erscheint: „Testfähigkeit“ beschreibt jene Kenntnisse,<br />

Fähigkeiten und Fertigkeiten, die in einem Test miterfasst (bei nichtstandardisierten<br />

Leistungstests) bzw. mitgemessen (bei standardisierten Leistungstests)<br />

werden, die man aber nicht unter den Begriff „mathematische Leistungsfähigkeit“<br />

fassen würde. Insbesondere wenn es sich dabei um Dimensionen handelt,<br />

die nur deshalb auftauchen, weil es sich um einen Test handelt, scheint die Bezeichnung<br />

„Testfähigkeit“ oder „Testfähigkeiten“ sinnvoll zu sein.<br />

Ein zweiter Grund des bisher eher reduzierten Interesses mag sein, dass<br />

erst der hohe Anspruch, mit dem in den letzten Jahren standardisierte Leistungstests<br />

in alle gesellschaftlichen Praxen drängen, die Eigengesetzlichkeiten<br />

dieser Instrumente in den Blick rückt: Auch in „herkömmlichen“ schulischen<br />

Leistungstests (Klassenarbeiten, Klausuren usw.) werden natürlich Testfähigkeiten<br />

mitgetestet. Die Unschärfen dieser Instrumente sind aber im Prinzip<br />

unstrittig. Deshalb kann der Schüler über die Bepunktung einer Klassenarbeit<br />

mit dem Lehrer diskutieren, deshalb können Eltern gegen eine Klausurzensur<br />

klagen, deshalb wird kein Arbeitgeber einen Lehrling allein aufgrund seiner<br />

Zensuren einstellen und deshalb bestreitet kaum jemand, dass Verfahren<br />

der Vergabe von Studienplätzen aufgrund von Abiturzeugnissen sachlich problematisch<br />

sind. Standardisierte Leistungstests folgen dem Anspruch, solche<br />

Unschärfen zu vermeiden. Sie unterliegen deshalb einem <strong>–</strong> im Vergleich mit<br />

schulischen Leistungstests <strong>–</strong> verschärften Anspruch, Testfähigkeit nicht mitzumessen.<br />

Ihre erhöhte Relevanz als Herrschaftsinstrument und Instrument der<br />

Vergabe von Zukunftschancen lenkt den Blick darauf, dass dieser Anspruch<br />

verfehlt wird.<br />

Einen dritten Grund möchte ich nur als Eindruck formulieren: Die Mathematikdidaktik<br />

hat bzw. die Mathematikdidaktiker haben sich habituell noch<br />

nicht vom Lehrer zum Wissenschaftler gewandelt. Dieser Eindruck ist zwar<br />

in seiner Schärfe sichtbar falsch, aber seine Formulierung ist fruchtbar für das<br />

Lesen der nachfolgenden empirischen Rekonstruktionen. Das erkennt man immer,<br />

wenn man dabei geneigt ist zu sagen: „Aber das ist doch in der Schule<br />

auch so, wo liegt also das Problem?“ Man kann dann nicht die Rekonstruktion<br />

des Problems verwerfen <strong>–</strong> wie man es gerade in Diskussionen um Tests immer<br />

wieder erlebt <strong>–</strong> sondern man hat dann intuitiv in den Testaufgaben etwas wie-<br />

mit zusammen, dass man mit der Objektiven Hermeneutik ohnehin einerseits empirisch neu<br />

rekonstruiert, andererseits im Akt des Geschichtenerzählens all das „Mitgeschleppte“ verarbeitet.


TESTFÄHIGKEIT <strong>–</strong>WAS IST DAS? 59<br />

dergefunden, was man aus Schule kennt. Um eine Interpretation und die dort<br />

erfolgende Rekonstruktion eines Problems zu verwerfen, muss man hingegen<br />

die Interpretation selbst kritisieren. Man kann nicht einfach davon ausgehen,<br />

dass die Interpretation falsch ist, weil das Resultat einem nicht gefällt.<br />

Man hat also intuitiv Erkenntnis über Schule gewonnen, wo man Erkenntnis<br />

über Tests gewinnen wollte. Der Übergang vom Lehrer zum Wissenschaftler<br />

besteht gerade darin, Erkenntnis zu wollen und also zuzulassen, statt das<br />

Vorgefundene bereits als normal zu kennzeichnen und zu rechtfertigen, bevor<br />

man Erkenntnis überhaupt zugelassen hat. Abstrakter gesagt: Wissenschaftler<br />

sein heißt, sich weit genug von der zu untersuchenden Praxis zu distanzieren,<br />

um den Deutungsmustern dieser Praxis nicht selbst zu unterliegen. Wissenschaft<br />

betreiben heißt eben, diese Deutungsmuster nicht zu reproduzieren,<br />

sondern sie zu verstehen, ihre impliziten Annahmen, Widersprüchlichkeiten,<br />

Fehldeutungen, Verwerfungen usw. aufzuzeigen, kurz: ihre Deutungsmuster<br />

zu rekonstruieren.<br />

Dazu gehört beim Thema Testfähigkeit auch die Vielzahl von Rechtfertigungen,<br />

dass Testfähigkeit „dazugehöre“, dass diese Fähigkeiten sich auf Bildungsziele<br />

zurückführen ließen bzw. anderweitigen Wert in sich hätten. Diese<br />

Behauptung erweist sich in den empirischen Rekonstruktionen als oberflächlich,<br />

oftmals falsch und zumeist zynisch.<br />

Ein vierter Grund für das geringe Interesse der deutschsprachigen Mathematikdidaktik<br />

an Testfähigkeiten mag eine gewissen Furcht sein, dem Verhältnis<br />

von Theorie und Praxis konsequent nachzuspüren bzw. nachzudenken: Die<br />

Komponenten von Testfähigkeit, die nachfolgend rekonstruiert werden, findet<br />

man auch in schulischen Leistungstests. Auch dort sollten sie vermieden sein.<br />

Zwar sind schulische Leistungstests keine standardisierten Leistungstests. Sie<br />

unterliegen also nicht dem Anspruch, wissenschaftliche Instrumente zu sein<br />

und somit genau zu benennen, was sie messen <strong>–</strong> und das Mitmessen von Testfähigkeit<br />

zu vermeiden. Aber die Zynismen und die Beschädigungen der mathematischen<br />

Bildung durch jene Aufgabeneigenschaften, die im weiteren unter<br />

dem Fokus der Testfähigkeit erschlossen werden, verweisen auf allgemeine<br />

Probleme des Mathematikunterrichts.<br />

Will ich als Mathematikdidaktiker über die reine Rekonstruktion des Problems<br />

hinausweisen, dann muss ich also darstellen, was der Lehrer im Erstellen<br />

schulischer Tests anders machen kann, ohne dabei zum wissenschaftlichen<br />

Testkonstrukteur zu werden. Es ist nur zu verständlich, dass die Mathematikdidaktik<br />

diesem schwierigen Problem bisher eher nicht nachspüren mochte.


60 WOLFRAM MEYERHÖFER<br />

Auch ich umschiffe dieses Problem vorläufig, indem ich mich auf standardisierte<br />

Leistungstests beschränke. Aus den nachfolgenden Rekonstruktionen ergibt<br />

sich aber bereits die Hypothese, dass die Rekonstruktion und Bearbeitung<br />

von Habitusmustern in der professionellen Entwicklung von Lehrern einen Ansatz<br />

auch für die Problematik der Testfähigkeit liefert.<br />

Ein fünfter Grund für die bisherige Abstinenz der Mathematikdidaktik<br />

gegenüber dem Thema Testfähigkeit mögen methodische Probleme gewesen<br />

sein: Man kann das Mitmessen von Testfähigkeit auch ohne Methoden analysieren,<br />

aber es verlangt ein gutes Gespür für das Latente und eine erhebliche<br />

Distanz zum eigenen Produkt. Hinzu kommt, dass man ohne methodischen<br />

Rückhalt erhebliche Legitimationsprobleme hat, insbesondere wenn man Tests<br />

in kulturindustriellen Kontexten erstellt (vergleiche Meyerhöfer 2006).<br />

Ich habe mit meiner Promotionsschrift (Meyerhöfer 2004 a, 2005) die Methode<br />

der Objektiven Hermeneutik in die Mathematikdidaktik eingeführt. Sie<br />

ermöglicht, methodisch kontrolliert auch latente Textelemente zu rekonstruieren<br />

und zwingt dazu, systematisch den Text zu deuten und nicht die eigenen<br />

Intentionen bzw. die Intentionen des Testerstellers in den Text hineinzudeuten.<br />

Die Methode erweist sich als fruchtbares Instrument der Rekonstruktion von<br />

Testfähigkeiten.<br />

Im englischsprachigen Raum mit seiner langen Tradition von Versuchen<br />

der Vermessung des menschlichen Geistes ist eine Debatte um Testfähigkeit<br />

naturgemäß bereits länger im Gange. In einer positivistischen Denktradition,<br />

die das Messen zum Maßstab des Erkennens nimmt, werden auch die das Messen<br />

begleitenden Phänomene einer Messung unterzogen. So hat beispielsweise<br />

Hembree (1987) 120 Forschungsarbeiten mit mathematischen Leistungstests<br />

einer Meta-Analyse unterzogen, um den Einfluss von „Noncontent Variables“<br />

auf Testleistung zu untersuchen. Wenn man dem Messparadigma anhängt und<br />

einen mathematischen Leistungstest erstellen möchte, so findet man den Einfluss<br />

nichtinhaltlicher „Variablen“ hier sicherlich befriedigend und erschöpfend<br />

erschlossen, auch die meisten im deutschsprachigen Raum relativ neuen<br />

Debatten um Testformate, Schreibstile, Aufgabenanordnungen usw. erfahren<br />

hier eine quantitative Analyse, wenn auch nicht auf so diffizile Weise wie in<br />

der <strong>PISA</strong>-Analyse von Wuttke (2007).<br />

Hembree untersucht u.a. eine „Variable“, die begrifflich oftmals mit Testfähigkeit<br />

gleichgesetzt wird, die Testwiseness: “Testwiseness refers <strong>to</strong> a testee’s<br />

ability <strong>to</strong> use the features and formats of the test and test situation <strong>to</strong> make a<br />

higher score, indepentent of knowledge of content (Millman, Bishop & Ebel,


TESTFÄHIGKEIT <strong>–</strong>WAS IST DAS? 61<br />

1965). The comparison of scores by a group trained in test taking and an untrained<br />

group is the effect related <strong>to</strong> testwiseness.” (Hembree 1987, S. 201)<br />

Hembree stellt in seiner Metaanalyse fest, dass ein Training von Testwiseness<br />

die Testleistung erhöht.<br />

Mir scheint der Begriff der Testwiseness zu wenig erschließend, um das<br />

Problem der Testfähigkeit wirklich zu fassen. Mir geht es nicht nur um die <strong>–</strong><br />

im Grunde und dann auch in der quantitativen Analyse äußerlich bleibenden <strong>–</strong><br />

„features and formats of the test and the test situation“. Das liegt daran, dass<br />

diese Begriffe kategorial gemeint sind. Ich meine mehr, wenn ich von jenen<br />

Kenntnissen, Fähigkeiten und Fertigkeiten spreche, die in einem Test miterfasst<br />

bzw. mitgemessen werden, die man aber nicht unter den Begriff „mathematische<br />

Leistungsfähigkeit“ fassen würde; und es wird sich zeigen dass das<br />

nachfolgend Rekonstruierte sich nicht kategorisieren lässt. Vor allem spreche<br />

ich auch davon, wie das Mitmessen von Testfähigkeiten das Messen von mathematischen<br />

Fähigkeiten durchwebt und mathematische Bildung beschädigt.<br />

Die nachfolgend dargestellte Analyse der Aufgaben verweist dabei auf den<br />

Ort, den die Debatte um Testfähigkeit systematisch vernachlässigt, wo aber<br />

das Problem erst erzeugt wird: Testfähigkeit ist ja erst in zweiter Linie eine<br />

Fähigkeit des Individuums. Testfähigkeit ist zunächst etwas, das in der Aufgabe<br />

drinsteckt, denn die Aufgabe ist der primäre Ort der Geltungserzeugung<br />

der Testaussage: Erst wenn ich verstanden habe, was die Aufgabe misst, hat es<br />

überhaupt Sinn, den Blick auf das vermessene Individuum zu richten. Deshalb<br />

wird hier die Frage, was Testfähigkeit ist, aus den Aufgaben herauspräpariert.<br />

2 Bedeutung von Testfähigkeit innerhalb der Diskussion um Tests<br />

2.1 Wissenschaftlicher Anspruch von Tests und Testfähigkeit<br />

Der hier verwendete Begriff der Testfähigkeit bezieht sich nur auf standardisierte<br />

(mathematische) Leistungstests. Diese Tests beanspruchen, die Relativität<br />

von Leistungsbewertung in der Schule zu heilen, also ein weniger relatives<br />

oder nicht relatives Maß für (mathematische) Leistungsfähigkeit darzustellen.<br />

Die Verringerung von Subjektivität ist offensichtlich, da (i) das Multiple-<br />

Choice-Format die Subjektivität der Deutung der Schüler„antwort“ im Grunde<br />

auf Null fährt (der Scanner trifft ja keine subjektive Entscheidung darüber, ob<br />

er einen Tintenhaufen als „Angekreuzt“ akzeptiert oder nicht, und der Programmierer,<br />

der den Grenzwert einstellt, trifft damit ja auch keine subjektive<br />

Entscheidung über die zu bewertende Leistung), da (ii) beim Rating von


62 WOLFRAM MEYERHÖFER<br />

halboffenen oder offenen Schülerantworten viel weniger Subjekte und damit<br />

Subjektivitäten beteiligt sind als wenn jeder Lehrer selbst korrigiert und da<br />

(iii) die Schulung von Ratern die Subjektivität der Ratings hoffentlich wirklich<br />

verringert. Mit der Verringerung von Subjektivität liegt aber noch keine<br />

geringere Relativität der Leistungsbewertung im Vergleich mit der Schule vor.<br />

Es muss untersucht werden, ob die verringerte Subjektivität auch zu einer präziseren,<br />

vergleichbareren und im Sinne der Leistungsanforderungen wahrhaftigeren<br />

(vielleicht „valideren“) Leistungsmessung führt. Dass dies prinzipiell<br />

kaum möglich und speziell bei TIMSS und <strong>PISA</strong> nicht der Fall ist, habe ich<br />

ausgiebig in Meyerhöfer (2004 a) bzw. (2005) diskutiert.<br />

Nur aus dem Anspruch heraus, die Relativität der Leistungsbewertung heilen<br />

zu können, ergibt sich überhaupt die Notwendigkeit, das Mitmessen von<br />

Testfähigkeit zu diskutieren: Wenn Fähigkeiten mitgemessen werden, die nicht<br />

die zu messenden mathematischen Fähigkeiten sind, dann sind diese Fähigkeiten<br />

zu benennen, und sie sind daraufhin zu untersuchen, ob sie erwünscht sind.<br />

Der Anspruch eines Instrumentes, wissenschaftlich zu sein, verweist eben gerade<br />

auf die Verpflichtung, das mit dem Instrument Erfasste zu explizieren. Ich<br />

verzichte dabei hier auf die Diskussion von Banalitäten, z.B. das Mitmessen<br />

verbaler Fähigkeiten oder der Fähigkeit, überhaupt mit der Arbeit anzufangen.<br />

Man könnte sich nun auch darauf festlegen, dass es erwünscht ist, Testfähigkeiten<br />

mitzumessen, z.B. die Fähigkeit bei einer Mathematikaufgabe eine<br />

inhaltlich sinnlose Anhäufung von Textmasse beiseite zu schaufeln, um an das<br />

mathematische Problem zu gelangen, oder die Fähigkeit, sich der mathematischen<br />

Anforderung durch Unverfrorenheit zu entziehen und den Punkt trotzdem<br />

zu erhalten. Ich gehe bei den folgenden Betrachtungen allerdings davon<br />

aus, dass TIMSS und <strong>PISA</strong> ausschließlich mathematische Fähigkeiten messen<br />

sollen.<br />

Die zusätzlich mitgemessenen Fähigkeiten können <strong>–</strong> wenn man sie erkannt<br />

hat <strong>–</strong> zwar dem Messkonstrukt zugeschlagen werden. Man verwickelt sich<br />

dann allerdings in Probleme der Fairness und der Zielstellung des Testens:<br />

i) Es ist in sich problematisch, dass Testfähigkeit als mathematische Fähigkeit<br />

erscheint.<br />

ii) Je mehr nichtmathematische Fähigkeiten man bereit ist mitzumessen,<br />

umso breiter und präziser muss diskutiert werden, was man misst und warum<br />

man es messen möchte.<br />

iii) Man steht außerdem in der Gefahr, sich in der Beliebigkeit des zu Messenden<br />

zu verlieren und das zu Messende nicht mehr aus einem erwünschten


TESTFÄHIGKEIT <strong>–</strong>WAS IST DAS? 63<br />

Leistungskonstrukt heraus zu erarbeiten, sondern alles zu messen, „was die<br />

Items so mitmessen“.<br />

So war die Vorgehensweise bei TIMSS und <strong>PISA</strong> 2 . Man ist damit von<br />

der Messunschärfe einer herkömmlichen Klassenarbeit nicht weit entfernt, verliert<br />

also den wesentlichen Grund für standardisierte Leistungstests. 3 Man verschlechtert<br />

damit sogar die Position des Schülers, denn die Unwägbarkeiten in<br />

den Aufgabenformulierungen einer Klassenarbeit kann er durch sein im Unterricht<br />

erworbenes implizites oder explizites Wissen über den Lehrer heilen,<br />

notfalls kann er sogar fragen. 4 Die durch Testfähigkeit entstehenden Unwägbarkeiten<br />

bei standardisierten Tests sind auf diese Weise nicht zu bearbeiten.<br />

Man sollte den Begriff Testfähigkeit abgrenzen von der Fähigkeit, bei einem<br />

Test im Sinne einer Klassenarbeit gut abzuschneiden (letztere Fähigkeit<br />

ist expliziter Bestandteil schulischer Leistungserbringung): Auch bei dieser<br />

Fähigkeit geht es zwar z.B. darum, erfolgreich zu erschließen (und das heißt<br />

2 vgl. die Darstellung der Testkonstruktion in Meyerhöfer (2004 a, S. 98 f. und 139-157) oder<br />

in Meyerhöfer (2005, Kapitel 4)<br />

3 An diesen Überlegungen ist zu erkennen, dass das Defizit von schulischer Leistungsbewertung<br />

nur scheinbar in mangelnder Standardisierung liegt. Ein Test, der „alles mögliche“<br />

mitmisst, kann trotzdem hochstandardisiert sein. Er hat aber fast die gleiche Messunschärfe<br />

wie eine Klassenarbeit. Hier reproduziert sich ein Irrtum, der uns auch im Forschungsprozess<br />

oft begegnet, nämlich der Glaube, dass hohe Standardisierung zu präziseren oder<br />

„besseren“, breiteren, tieferen oder wenigstens allgemeiner gültigen Erkenntnissen führen<br />

würde. Standardisierung führt aber zunächst nur dazu, dass alle Mitglieder einer Population<br />

bezüglich bestimmter Aspekte den gleichen Bedingungen unterworfen sind. Das bedeutet<br />

zwar, dass bestimmte Rahmenbedingungen (oder auch für die Testkonstrukte: bestimmte<br />

Dimensionen eines multidimensionalen Kausalkonstrukts) für alle Mitglieder gleich konstruiert<br />

sind. Das bedeutet aber noch lange nicht, dass damit die Geltungserzeugung präziser,<br />

besser, breiter, tiefer, eindeutiger oder wenigstens allgemeiner gültig ist. Am Problem<br />

der Geltungserzeugung geht die Standardisierung eher vorbei <strong>–</strong> wobei natürlich bestimmte<br />

Standardisierungselemente die Geltungserzeugung unterstützen können.<br />

Ein beredtes Beispiel für dieses Problem ist der <strong>PISA</strong>-Test: Man kann 180 000 Schüler<br />

hochstandardisiert untersuchen. Wenn dabei unklar bleibt, was eigentlich gemessen wird,<br />

bleibt die Testaussage begrenzt. Selbst der hohe voyeuristische Wert einer Länderrangskala<br />

ergibt sich nicht aus hoher Standardisierung, sondern lediglich aus der großen Anzahl der<br />

Beteiligten.<br />

4 Nikola Leufer (U Dortmund) hat mich darauf aufmerksam gemacht, dass umgekehrt ein<br />

Lehrer, der seinen Schüler gut kennt, dessen „Testunfähigkeit“ in Bezug auf eine Klassenarbeit<br />

bei der Korrektur berücksichtigen, also quasi durch „gutmütige Korrektur“ heilen<br />

kann. Allgemein könnte man annehmen: Testfähigkeiten werden auch in der Schule miterfasst<br />

<strong>–</strong> haben aber keine so starken Konsequenzen. Mit einem anderen Blick: Es ist fester<br />

Bestandteil professionellen Könnens (und damit nicht technisierbar), diese Konsequenzen<br />

gering zu halten.


64 WOLFRAM MEYERHÖFER<br />

manchmal: erraten oder erahnen), was der Lehrer mit seiner Frage meint und in<br />

welcher Tiefe bzw. auf welcher Ebene die Aufgabe zu erfüllen ist. In der Klasse<br />

ist aber die Vermittlung dieser Fähigkeit expliziter Bestandteil des Unterrichtsprozesses:<br />

Unterricht ist per se eine nichtstandardisierte Angelegenheit.<br />

Somit ist er allen Vor- und Nachteilen der Nichtstandardisierung ausgesetzt.<br />

Das schlägt sich auch in unterrichtlichen Tests nieder. Die daraus resultierende<br />

Relativität von Zensierungen lässt zwei polarisierte Schlussfolgerungen zu:<br />

Einerseits kann man eine größere Standardisierung von Leistungsbewertung<br />

anstreben. Andererseits kann diese Relativität Anlass sein, den Zensuren mit<br />

einer gewissen Gelassenheit zu begegnen, also u.a. ihre Rolle für die Vergabe<br />

von Zukunftschancen ebenso zu relativieren. Sie sollte jedenfalls Anlass<br />

sein, sich der Vielschichtigkeit der Ursachen von Schulerfolg zu stellen <strong>–</strong> denn<br />

Zensuren sind das gesetzte und wahrscheinlich das beste quantitative Maß für<br />

Schulerfolg 5 . Leistungen sind nur eine Ursache von Schulerfolg, und Schulerfolg<br />

wirkt vielfältig zurück auf Leistungen. Will man das Leistungsprinzip in<br />

der Schule stärker zur Geltung bringen (und das ist eine Implikation des Trends<br />

zu Tests), so muss die Kopplung von Schulerfolg an Leistung gesichert werden.<br />

Werden nun andererseits in standardisierten Tests irgendwelche anderen<br />

als die zu leistenden Fähigkeiten mitgemessen, bedeutet dies wiederum eine<br />

Abkehr vom Leistungsprinzip <strong>–</strong> nur dass jetzt andere Nicht-Leistungskriterien<br />

einfließen als in der Klasse.<br />

5 Diese Behauptung bedürfte einer tieferen Argumentation, die hier nicht geleistet werden<br />

kann. Die Argumentationsrichtung wäre etwa die folgende: Wenn man ein Maß für Schulerfolg<br />

erstellen möchte, dann muss man Schulerfolg definieren und in ein Messkonstrukt<br />

überführen. Der Versuch wäre mit Messunschärfen und anderen Konstruktionsproblemen<br />

behaftet. Bereits die Adressierung von „Schulerfolg“ würde zu unüberwindlichen Schwierigkeiten<br />

führen: Verschiedene gesellschaftliche Gruppen haben verschiedene Ansprüche<br />

an „Schulerfolg“, die Vielfalt an schulischen Aufgaben müsste in eine gewichtete Form<br />

gebracht werden usw. Die Schulzensur ist der Versuch, eine solche Gesamt„messung“ vorzunehmen.<br />

Das „Messkonstrukt“ ist in einem langen Prozess entstanden, in dem innerschulische<br />

und außerschulische Interessen in das Konstrukt eingeflossen sind. Es ist kaum zu<br />

überschauen, welche impliziten und expliziten Elemente hier zusammenfließen. Es handelt<br />

sich aber um ein Konstrukt von erstaunlich hoher gesellschaftlicher Akzeptanz: Obwohl<br />

die Probleme der „Messunschärfe“ von Zensuren hinlänglich bekannt sind, sind Zensuren<br />

nach wie vor vorrangige Instrumente der Vergabe von Zukunftschancen in nachschulischen<br />

Feldern.<br />

Im Zusammenhang damit ist bemerkenswert, dass keine Untersuchung über den Zusammenhang<br />

von Zensur und Testleistung bei <strong>PISA</strong> vorliegt, obwohl die Zensuren erhoben<br />

wurden. Man kann sich unschwer vorstellen, dass Tests schnell als überflüssig angesehen<br />

würden, wenn sich herausstellte, dass die ordinale Anordnung erhalten bleibt, und dass vorrangig<br />

die Tests problematisiert würden, wenn die ordinale Anordnung nicht erhalten bleibt.


2.2 Testfähigkeit und Bildungsziele<br />

TESTFÄHIGKEIT <strong>–</strong>WAS IST DAS? 65<br />

Tests setzen Standards. Sie tun dies in umso größerem Maße, je relevanter sie<br />

für die Vergabe von Zukunftschancen sind. Sie tun dies aber auch dadurch, dass<br />

sie sich als wissenschaftliche Instrumente gerieren. Diese Standards schlagen<br />

bis in den Unterricht durch. Dadurch ist es problematisch, wenn in Tests Aufgaben<br />

auftauchen, die man lösen kann, ohne dass man die Fähigkeit, die getestet<br />

werden soll, wirklich besitzen muss. Umgekehrt ist es ebenfalls problematisch,<br />

wenn man eine Aufgabe nicht (richtig im Sinne der Tester) lösen kann,<br />

obwohl man die Fähigkeit(en) besitzt.<br />

Für den Lehrer ist es schwierig, Elemente von Testfähigkeit in den Aufgaben<br />

von Test, Bildungsstandards usw. zu erkennen und zu beheben, wenn<br />

er ausschließlich auf die mathematischen Fähigkeiten rekurrieren möchte:<br />

Der Lehrer arbeitet unter Handlungsdruck und setzt verständlicherweise darauf,<br />

dass standardisierte mathematische Leistungstests wirklich mathematische<br />

Leistung testen. Wissenschaftler unterliegen also einer gewissen Verantwortung<br />

für ihr Instrument.<br />

Man trifft in der Debatte um Testfähigkeiten auf das Argument, dass manche<br />

Testfähigkeiten durchaus als Bildungsziele taugen bzw. mit ihnen korrespondieren.<br />

Dieses Argument wird unten anhand der in den Interpretationen rekonstruierten<br />

Komponenten von Testfähigkeit diskutiert werden. Es stellt sich<br />

dabei im Wesentlichen als wenig fundiert und zynisch heraus.<br />

2.3 Chancengleichheit und Testfähigkeit<br />

Testfairness ist verletzt, wenn Teile der zu vermessenden Population Teile der<br />

gemessenen Fähigkeiten nicht oder in geringerem Maße als andere Teile der<br />

Population erwerben konnten. Dies kann z.B. der Fall sein, wenn Inhalte getestet<br />

werden, welche in einer der vermessenen Schularten gar nicht unterrichtet<br />

wurden. Dies kann auch der Fall sein, wenn mit einem Realitätskontext gearbeitet<br />

wird, welcher einer Gruppe völlig unbekannt, einer anderen hingegen<br />

vertraut ist. Ideale Testfairness kann es nicht geben und man muss sich dem in<br />

der Deutung der Testdaten stellen.<br />

In Bezug auf Testfähigkeit liegt eine Verletzung von Testfairness dann vor,<br />

wenn Tests Testfähigkeiten mitmessen und gleichzeitig Teile der zu vermessenden<br />

Population mehr Gelegenheit als andere Teile dieser Population hatten,<br />

diese Testfähigkeiten zu erlangen. So gab es eine kurze, aber intensive Debatte<br />

über Testfähigkeit, als die ersten TIMSS-Ergebnisse 1997 in Deutschland<br />

veröffentlicht wurden. Insbesondere wurde darauf verwiesen, dass die USA-


66 WOLFRAM MEYERHÖFER<br />

und die asiatischen „Nationalauswahlen“ viel besser auf den Test vorbereitet<br />

gewesen seien, weil in diesen Ländern eine ausgeprägte „Kultur“ des Testens<br />

zur Vergabe von Zukunftschancen herrscht. Man ging also davon aus, dass die<br />

asiatischen und die USA-Teile der vermessenen Population mehr Gelegenheit<br />

als die deutschen Teilnehmer hatten, Testfähigkeiten zu erlangen. Zwei polarisierte<br />

Schlussfolgerungen wurden daraus gezogen: Einerseits die Schlussfolgerung,<br />

die mangelnde Aussagekraft bei der Interpretation der Resultate zu berücksichtigen<br />

und vielleicht sogar auf solche Tests zu verzichten. Andererseits<br />

die Schlussfolgerung, die deutschen Teilnehmer ebenso intensiv in Testfähigkeiten<br />

einzuüben.<br />

3 Erfolge von Testtraining<br />

Mittlerweile tendiert die Praxis des deutschen Schulsystems in die Richtung<br />

des verstärkten Testens auch der deutschen Schüler. Allerdings wird das Problem<br />

der Testfähigkeit dabei kaum noch diskutiert. In der früheren Debatte<br />

fühlte sich die TIMSS-Gruppe noch genötigt zu behaupten, dass man solche<br />

Tests nicht trainieren kann 6 (Baumert u.a. 2000, S. 108 in Antwort auf Hagemeister<br />

1999). Das hieße nun allerdings zugespitzt, dass es Testfähigkeit<br />

im hier gemeinten Sinne nicht gibt, denn beim Testtraining geht es nicht um<br />

das Training der mathematischen Fähigkeiten, sondern um jene Fähigkeiten,<br />

die neben den mathematischen Fähigkeiten für den Testerfolg sorgen. Etwas<br />

schlicht gesagt: Testtraining (als Idealtypus) stellt nicht die Frage: Welche mathematischen<br />

Fähigkeiten müssen wir noch elaborieren? Es stellt die Fragen:<br />

Wie ticken Tester? Wie tickt der Test? Wie musst du ticken, damit du möglichst<br />

gut durchkommst. Man kann als Gegentypus das Üben konstruieren, das<br />

die Frage stellt: Welche mathematischen Fähigkeiten müssen wir noch elaborieren?<br />

Mit der Konstruktion als Idealtypus wird deutlich, dass manches Üben<br />

auch Elemente von Testtraining enthält, und dass manches Testtraining auch<br />

Elemente von Üben enthält.<br />

Da ich in meiner Dissertation festgestellt habe, dass und wie TIMSS<br />

und <strong>PISA</strong> Testfähigkeiten mitmessen, habe ich dort näher untersucht, wie die<br />

6 Die Behauptung erfolgt unter Nichtberücksichtigung der Metaanalyse von Hembree (1987),<br />

aber unter der im Text das Thema erschlagenden Bemerkung: „In den USA gibt es eine<br />

breite Forschungsliteratur zu den begrenzten Auswirkungen von Test-Coaching.“ (Baumert<br />

u.a. 2000, S. 108)


TESTFÄHIGKEIT <strong>–</strong>WAS IST DAS? 67<br />

TIMSS-Gruppe zum Resultat gelangt, dass man Tests nicht trainieren kann und<br />

es diese Testfähigkeiten also nicht gibt oder man sie vernachlässigen kann. 7<br />

Dabei stellt sich heraus, dass Baumert u.a. (2000) Forschungsergebnisse<br />

verzerrt darstellen. Sie berufen sich auf zwei Studien von Klieme und Maichle<br />

(1989, 1990). Klieme und Maichle haben ein Training für Teile der medizinischen<br />

Eingangstests entwickelt und durchgeführt. Sie wollten im Wesentlichen<br />

herausbekommen, ob bezahlte Vorbereitungskurse für diese Tests die Chancengleichheit<br />

der Kandidaten verletzen können. In gewisser Weise betrifft die<br />

Fragestellung also die gleiche Chancengleichheitsdebatte bezüglich Tests wie<br />

heute.<br />

Klieme und Maichle haben ein Testtraining von sechs (!) Zeitstunden mit<br />

21 Personen durchgeführt. Sie erreichten dabei Verbesserungen in den trainierten<br />

Komponenten <strong>–</strong> es trat also ein positiver Trainingseffekt auf. Sie erreichten<br />

keine Verbesserungen im eigentlichen Test, aber dafür hatte auch kein<br />

Training stattgefunden, weil sie aus Zeitgründen nur einzelne Komponenten<br />

trainiert hatten. Sie diskutieren das Ergebnis ihres Trainings dann auch recht<br />

vielschichtig, schließen allerdings in erstaunlicher Weise: „Auch die Resultate<br />

dieser spezifischen . . . Fördermaßnahmen bestätigen letztlich die Aussage<br />

. . . , daß komplexe Problemlöseleistungen im Sinne der Subtests . . . nicht<br />

bzw. nur in relativ geringem Ausmaß trainierbar sind.“ (Klieme/Maichle 1990,<br />

S. 307) Diese Schlussfolgerung steht in offensichtlichem Widerspruch zu den<br />

Ergebnissen der Untersuchung. Vielleicht erklärt sie sich aus der institutionellen<br />

Einbindung der Forscher im Testinstitut heraus: Die Untersuchung sollte<br />

herausfinden, ob die Testfairness durch bezahlte Vorbereitungskurse verletzt<br />

werden kann. Wäre die Antwort ein „Ja“ gewesen oder wäre auch nur das<br />

schwächere „Ja“ dieser Untersuchung herausgekommen, dann hätte der Test<br />

massiv verändert oder abgeschafft werden müssen.<br />

Die TIMSS-Gruppe nimmt das verfälschte „Resultat“ auf, obwohl es sich<br />

nicht mal auf die Diskussion um Langzeiteffekte von Massentestungen bezieht.<br />

Die Studie von Klieme und Maichle wird offensichtlich lediglich vorgeschoben,<br />

um unerwünschte Nebeneffekte des Testens wegzudiskutieren.<br />

In der Diskussion um Testfähigkeit geht es jedoch um Langzeiteffekte bei<br />

kindlichen und jugendlichen Schülern, für die direkt oder indirekt Zukunftschancen<br />

an Testergebnisse gebunden werden. Man muss sich mit diesen Nebeneffekten,<br />

die zu Haupteffekten beim Lernen von Mathematik werden kön-<br />

7 Meyerhöfer (2004 a, S. 219-221; 2005, S. 190-192)


68 WOLFRAM MEYERHÖFER<br />

nen, beschäftigen, um ihren Charakter und ihren Einfluss abschätzen zu können.<br />

4 Testfähigkeit und Au<strong>to</strong>nomie<br />

In diesem Beitrag wird nicht der Trainingsprozess betrachtet, sondern es wird<br />

untersucht, welche Itemeigenschaften dafür sorgen, dass neben mathematischen<br />

Fähigkeiten auch Testfähigkeiten gemessen werden. Ziel dieser Betrachtung<br />

ist es, den Beteiligten zu größerer Au<strong>to</strong>nomie gegenüber dem Problem zu<br />

verhelfen.<br />

Das Training von Testfähigkeit scheint eine Möglichkeit dazu zu sein, weil<br />

Testfähigkeit die Au<strong>to</strong>nomie des Schülers gegenüber dem Testprozess stärkt.<br />

Sie reproduziert aber auch die Au<strong>to</strong>nomiezerstörung, indem sie den Schüler<br />

auf Fähigkeiten hin trainiert, die außerhalb von mathematischen Fähigkeiten<br />

liegen. Die au<strong>to</strong>nomiezerstörende Grundstruktur von Tests ist nicht hintergehbar<br />

8 . Sie kann nur durch distanzierte Reflexion gebrochen werden. Außerdem<br />

erfordert jedes Testtraining ein Zeitbudget, welches für sinnvollere Lerninhalte<br />

einsetzbar ist.<br />

Die Erweiterung von Au<strong>to</strong>nomie kann ebenso auf Seiten des Lehrers oder<br />

des bildungspolitischen Raums stattfinden. Mit dem Wissen um Komponenten<br />

von Testfähigkeit kann man bewusster entscheiden, ob man bereit ist, Leistungstests,<br />

welche Testfähigkeit mitmessen, zur Vergabe von Zukunftschancen<br />

einzusetzen.<br />

Erweiterung von Au<strong>to</strong>nomie kann aber auch auf Seiten der Testentwickler<br />

stattfinden. Mit dem Wissen um Komponenten von Testfähigkeit kann auch<br />

hier bewusster entschieden werden, was man alles (mit)messen möchte.<br />

5 Testfähigkeit <strong>–</strong> Empirische Erschließungen<br />

5.1 Eine erste Annäherung<br />

Führen wir uns zunächst die Grundstruktur des Testens vor Augen. Tests werden<br />

erstellt, um Eigenschaften von Messobjekten in einem Messprozess zu<br />

erfassen. Man hat also zunächst eine Vorstellung davon, was in unserem Fall<br />

„mathematische Leistungsfähigkeit“ sein soll. Nun operationalisiert man diese<br />

Vorstellung, man schafft also Items, die in ihrem Zusammenwirken messen,<br />

inwieweit diese Fähigkeit vorhanden ist. Das so entstandene Konstrukt,<br />

8 vgl. Meyerhöfer (2004 a, S. 81-83; 2005, S. 24-27)


TESTFÄHIGKEIT <strong>–</strong>WAS IST DAS? 69<br />

die „operationalisierte mathematische Leistungsfähigkeit“, soll natürlich möglichst<br />

identisch sein mit dem, was man sich vor der Operationalisierung unter<br />

mathematischer Leistungsfähigkeit vorgestellt hat.<br />

Das operationalisierte Messkonstrukt trifft <strong>–</strong> materialisiert in Form eines<br />

Testheftes <strong>–</strong> auf das Messobjekt, also auf den Schüler. Für den Schüler ist egal,<br />

was mathematische Leistungsfähigkeit ist, ob sie richtig operationalisiert ist,<br />

ob die getesteten Fähigkeiten relevant sind usw. Für den Schüler ist nur eines<br />

wichtig: Er muss die Erwartung des Testers bedienen. Er muss sein Kreuz an<br />

der richtigen Stelle machen, er muss die richtige Zahl hinschreiben, er muss eine<br />

Antwort notieren, die der auswertende Kodierer mit einem Leistungspunkt<br />

belegen kann. Damit ist die Richtung für einen Begriff von Testfähigkeit festgelegt:<br />

Testfähigkeit ist die Fähigkeit der Optimierung des eigenen Punktwertes<br />

innerhalb des Testkonstrukts. Das heißt insbesondere, dass man erstens in<br />

der Lage ist, eine wirklich vorhandene mathematische Fähigkeit in einen Testpunkt<br />

umzusetzen, und dass man zweitens in der Lage ist, einen Testpunkt<br />

auch dann zu erreichen, wenn man nicht über die mathematische Fähigkeit<br />

verfügt. Das zeigt zunächst, dass für das Individuum Testfähigkeit umso wichtiger<br />

wird, je bedeutsamer der Test für die Vergabe von Zukunftschancen wird.<br />

Zur Testfähigkeit gehört aber auch die Fähigkeit, den Test sinnvoll zu verweigern.<br />

Wenn z.B. bei <strong>PISA</strong> die Schule und nicht das Individuum vermessen<br />

wird, dann sollte im Sinne der Schule ein „schlechter“ Schüler den Test ebenso<br />

verweigern wie ein Schüler, der an diesem Tag sein Leistungsoptimum nicht<br />

erreicht. Die Schule muss für eine solche Verweigerung dankbar sein und sie<br />

unterstützen. 9<br />

9 Wolfgang Schulz (HU Berlin) schlägt in einem Gutachten zu diesem Beitrag vor, auch die<br />

Bereitschaft, sich dem Test zu stellen und das Bestreben, den Test möglichst erfolgreich zu<br />

absolvieren, als Testfähigkeit zu behandeln. Ich finde den Vorschlag fruchtbar, mir scheint<br />

hier aber eher das vorzuliegen, was Soziologen in Anlehnung an Durkheim „vorvertragliche<br />

Grundlagen des (sozialen) Vertrages“ nennen. Diese vorvertraglichen Grundlagen sind<br />

aber ein eigenes Thema. Ich setze hier voraus, dass die Getesteten möglichst gut abschneiden<br />

möchten. Dazu gehört dann aber <strong>–</strong> wenn die Schule als Ganzes vermessen wird <strong>–</strong> dass<br />

Schüler und Lehrer dieses gute Abschneiden als gemeinsames Projekt begreifen. Dass man<br />

meinen Vorschlag, dass beide Gruppen sich auf die absichtliche Absenz von testschwachen<br />

Schülern einigen, als absurd empfindet, zeigt lediglich einen Zustand an, in dem sich Lehrer<br />

und Schüler (noch?) nicht als Gemeinschaft gegen etwas Äußeres verstehen. Dieses Phänomen<br />

lässt sich aber nur im Rahmen einer Schultheorie umfassender diskutieren.


70 WOLFRAM MEYERHÖFER<br />

5.2 Bekannte Komponenten von Testfähigkeit<br />

Nur kurz erwähnen möchte ich allgemeine Testbearbeitungsstrategien, die bereits<br />

andernorts ausgiebig dargestellt sind und zu deren weiterer Beschreibung<br />

ich hier nichts beitragen möchte. Das sind Zeiteinteilungsstrategien, Fehlervermeidungsstrategien,<br />

Ratestrategien, Strategien zur Ausnutzung versteckter<br />

Lösungshinweise und formale Strategien zum deduktiven Erschließen der vermeintlich<br />

richtigen Antwort:<br />

<strong>–</strong> „Zeiteinteilungsstrategien (z.B.: das Überspringen von schwierigen Aufgaben,<br />

das Markieren von ungelösten Aufgaben oder solchen Items, bei denen<br />

man sich seiner Lösung nicht ganz sicher ist, das Markieren von Teillösungen,<br />

das Anlegen eines Arbeitspro<strong>to</strong>kolls, aus dem man ersehen kann, wie<br />

schnell man vorankommt, usw.),<br />

<strong>–</strong> Fehlervermeidungsstrategien (sorgfältiges Lesen der Instruktion, Beachten<br />

der Aufgabenstellung, Überprüfen der Antwort usw.),<br />

<strong>–</strong> Ratestrategien [ 10 ],<br />

<strong>–</strong> Strategien zur Ausnutzung versteckter Lösungshinweise (das Beachten aller<br />

Merkmale, hinsichtlich derer sich die Antworten von den Distrak<strong>to</strong>ren<br />

unterscheiden könnten <strong>–</strong> z.B. der Länge, der Position, des Stils der betreffenden<br />

Aussagen usw.)<br />

<strong>–</strong> dieBeachtung sogenannter „specific determiners“ (gemeint sind Worte wie<br />

„immer“, „niemals“, „alle“ usw., die nach Meinung der Veranstalter [von<br />

Testtrainings, W. M.] speziell die Distrak<strong>to</strong>ren, also die Falschantworten,<br />

kennzeichnen),<br />

<strong>–</strong> formale Strategien zum deduktiven Erschließen der vermeintlich richtigen<br />

Antwort (z.B. auf der Basis inhaltlicher oder formaler Abhängigkeiten<br />

zwischen den einzelnen Antwortmöglichkeiten).“ (Klieme, Maichle 1989,<br />

S. 207)<br />

Ebenfalls nur erwähnen möchte ich folgende Aspekte, die sich auch ohne<br />

eingehendere Aufgabeninterpretationen erschließen: Wenn man nichts weiß,<br />

muss man raten. Wenn man Multiple-Choice-Angebote mit nur einer richtigen<br />

Antwort abarbeiten muss, ist es besser, die Bearbeitung bei Erreichen der<br />

wahrscheinlich richtigen Antwort abzubrechen und die Aufgabe zu kennzeichnen.<br />

Erst wenn man später noch Zeit hat, sollte man zurückkehren und eine<br />

Fehlerkontrolle durchführen. Gleiches gilt für andere Unsicherheiten.<br />

10 Näheres zum Raten vgl. Meyerhöfer (2004 c).


TESTFÄHIGKEIT <strong>–</strong>WAS IST DAS? 71<br />

Wenn man sich durch viele überflüssige Informationen (z.B. <strong>PISA</strong>-<br />

Aufgabe „Bauernhöfe“ <strong>–</strong> siehe unten) oder durch eine Anhäufung von Variationen<br />

der immer gleichen Wortgruppe („Bauernhöfe“ und TIMSS-Aufgabe<br />

A5 <strong>–</strong> siehe unten) hindurcharbeiten muss oder wenn man eine Ansammlung<br />

von Aussagen abarbeiten muss (z.B. <strong>PISA</strong>-Aufgabe „Dreiecke“ 11 ), dann kann<br />

man von Konzentrations- bzw. Durchhaltefähigkeit sprechen. Diese Fähigkeit<br />

benötigt man zwar oft im Leben, aber es wäre sicherlich wünschenswert, wenn<br />

man aus der inneren Verfasstheit und Ernsthaftigkeit eines Problems heraus<br />

durchhalten muss und nicht, weil eine Aufgabe schlecht gestellt ist bzw. weil<br />

sie Durchhaltefähigkeit statt der eigentlich zu testenden Fähigkeit misst.<br />

5.3 Testfähigkeit in Testaufgaben<br />

Ich möchte nun Komponenten von Testfähigkeit darstellen, die in den Aufgabeninterpretationen<br />

von <strong>PISA</strong> und TIMSS rekonstruiert wurden. Ich habe die<br />

Aufgaben objektiv-hermeneutisch interpretiert, hier sind lediglich Interpretationselemente<br />

angedeutet und Interpretationsresultate dargestellt. Die Interpretationen<br />

erfolgten zunächst unter der Fragestellung, was mit den Aufgaben gemessen<br />

wird. Dabei zeigten sich nicht nur erhebliche Messprobleme, die zu der<br />

Schlussfolgerung führten, dass beide Tests als Instrument zur Messung mathematischer<br />

Leistungsfähigkeit ungeeignet sind. Es zeigten sich auch habituelle<br />

Probleme 12 .<br />

Das Mitmessen von Testfähigkeit erweist sich als ein Problem, dem Messprobleme<br />

ebenso wie habituelle Probleme anhaften. Die hier aufgezeigten<br />

Komponenten von Testfähigkeit sind vielfältig miteinander verwoben <strong>–</strong> im Erscheinungsbild,<br />

im Charakter, im Hintergrund und in den Ursachen ihres Auftretens.<br />

Es würde der Reichhaltigkeit der empirischen Rekonstruktion nicht<br />

entsprechen, wenn man versuchte, die Komponenten mit zusammenfassenden<br />

Namen zu belegen, sie gar zu kategorisieren. Ich möchte auch nicht der Versuchung<br />

erliegen, die Komponenten in <strong>–</strong> dann zwingend plakative <strong>–</strong> Schüleranweisungen<br />

zu übersetzen, z.B.: Nimm das reale Problem nicht ernst! Denke<br />

zum Mittelmaß hin! Auch dies würde der Komplexität des Gegenstandes nicht<br />

entsprechen, welche hier entfaltet, aber noch nicht reduziert werden soll. Die<br />

11 vgl. Deutsches <strong>PISA</strong>-Konsortium (2001, S. 178); Diskussion bei Meyerhöfer (2004 b)<br />

12 Manifeste Orientierung auf Fachsprachlichkeit bei gleichzeitiger Beschädigung des Mathematischen;<br />

Verwerfungen des Mathematischen und des Realen bei realitätsnahen Aufgaben<br />

(Misslingen der angestrebten „Vermittlung von Realem und Mathematischem“); Kalkülorientierung<br />

statt mathematischer Bildung; Illusion der Schülernähe als Verblendung. Ich habe<br />

das als „Abkehr von der Sache“ zusammengefasst.


72 WOLFRAM MEYERHÖFER<br />

Überschriften sind dementsprechend Stichworte zu den jeweils herausgearbeiteten<br />

Phänomenen, keine Benennungen für trennscharf gedachte Komponenten.<br />

Da die Komponenten als Aufgabeneigenschaften rekonstruiert wurden,<br />

verweisen die Überschriften auf solche Eigenschaften. Erst zum Schluss führe<br />

ich diese Eigenschaften zu „Fähigkeiten“ zusammen.<br />

5.3.1 Fremde und bizarre Wörter; Irritationen; Tendenz zum Mittelmaß<br />

Die wohl am einfachsten zu erkennende und zu behebende Komponente von<br />

Testfähigkeit begegnet uns in der TIMSS-Aufgabe A1 im Wort schattieren:<br />

Betrachte die Figur. Wie viele von den kleinen Quadraten muss man ZU-<br />

SÄTZLICH schattieren, damit 4<br />

5 der kleinen Quadrate schattiert sind?<br />

A) 5<br />

B) 4<br />

C) 3<br />

D) 2<br />

E) 1<br />

Diese Komponente von Testfähigkeit ist durch ungewöhnliche, schwierige,<br />

mehrdeutige, vielleicht auch falsch benutzte Wörter im Aufgabentext gekennzeichnet<br />

(Interpretation für „schattieren“ vgl. Meyerhöfer 2004 a, S. 104 f.).<br />

Als Hauptursache für das Auftreten dieser Komponente sind Prätentionen,<br />

Übersetzungsfehler und auch mangelnde Sorgfalt bei der Durchsicht der Aufgaben<br />

zu nennen: Diese Fehler bewegen sich auf der manifesten Textebene<br />

und sind durch sorgfältige Durchsicht der Texte zu beheben. Hier kann man<br />

zum Beispiel das Wort „schraffieren“ verwenden und wirklich eine Schraffur<br />

verwenden.<br />

Als Verschärfung dieser Komponente lässt es sich ansehen, wenn der Text<br />

in eine offen bizzare Form übergeht. So wird in der <strong>PISA</strong>-Aufgabe „Bauernhöfe“<br />

der Quader EFGHKLMN als rechtwinkliges Prisma erläutert:


<strong>PISA</strong>-Aufgabe Bauernhöfe<br />

TESTFÄHIGKEIT <strong>–</strong>WAS IST DAS? 73<br />

Hier siehst du ein Fo<strong>to</strong> eines Bauernhauses mit pyramidenförmigem Dach.<br />

Nachfolgend siehst du eine Skizze mit den entsprechenden Maßen, die eine<br />

Schülerin vom Dach des Bauernhauses gezeichnet hat.<br />

Der Dachboden, in der Skizze ABCD, ist ein Quadrat. Die Balken, die das<br />

Dach stützen, sind die Kanten eines Quaders (rechtwinkliges Prisma) EFGH-<br />

KLMN. E ist die Mitte von AT, F ist die Mitte von BT , G ist die Mitte von<br />

CT und H ist die Mitte von DT. Jede Kante der Pyramide in der Skizze misst<br />

12 m.<br />

Bauernhöfe 1. Berechne den Flächeninhalt des Dachbodens ABCD.<br />

Der Flächeninhalt des Dachbodens ABCD = ______ m 2 .<br />

Bauernhöfe 2. Berechne die Länge von EF, einer der waagerechten Kanten<br />

des Quaders.<br />

Die Länge von EF= ______ m.


74 WOLFRAM MEYERHÖFER<br />

Ein weiteres Beispiel liefert die TIMSS-Aufgabe A2:<br />

Die Gegenstände auf der Waage halten sich im Gleichgewicht. Auf der linken<br />

Waagschale befinden sich ein Gewicht (eine Masse) von 1 kg und ein halber<br />

Ziegelstein. Auf der rechten Seite befindet sich ein ganzer Ziegelstein.<br />

Welches Gewicht (welche Masse) hat ein ganzer Ziegelstein?<br />

A) 0,5 kg<br />

B) 1kg<br />

C) 2kg<br />

D) 3kg<br />

Hier kann man sich in der Aufgabenerstellung nicht entscheiden, ob es um<br />

Gewicht oder um Masse geht, obwohl das für das Problem belanglos ist. Bis<br />

zur ersten Klammer haben Bild und Text widersprüchliche Signale bezüglich<br />

der Frage gegeben, ob die Aufgabe unter mathematischen, physikalischen oder<br />

alltäglichen Gesichtspunkten zu bearbeiten sei 13 . Den äußerlichen auf Physik<br />

bzw. Messtechnik orientierenden Signalen wird latent widersprochen. An der<br />

Klammer tritt die Unsicherheit in offene Konfusion über. Die Unklarheit wird<br />

im Text manifest. Die Feinanalyse zeigt, dass lediglich ein schulmeisterliches<br />

Bedürfnis nach korrektem Gebrauch von Fachsprache bedient wird. Die Struktur<br />

kann man als äußerliche Verfachsprachlichung bei gleichzeitiger inhaltlicher<br />

Dementierung von Fachlichkeit bezeichnen. Gleichzeitig wird ein Irritationsmoment<br />

geschaffen, denn der Schüler muss einen Umgang mit der offenen<br />

sprachlichen Verwerfung finden. Konkret muss er entscheiden, ob die begriffliche<br />

Doppelung für die Lösung wichtig ist oder nicht. Ein Schüler, der das<br />

Schulmeisterliche des Textes erfassen und übergehen kann, erhält hier einen<br />

Zeitvorteil.<br />

Ein drittes Beispiel findetsichinderTIMSS-Aufgabe A5:<br />

Welche der Aussagen über das Quadrat EFGH ist FALSCH?<br />

A) EIF und EIH sind kongruent (deckungsgleich).<br />

B) GHI und GHF sind kongruent (deckungsgleich).<br />

C) EFH und EGH sind kongruent (deckungsgleich).<br />

D) EIF und GIH sind kongruent (deckungsgleich).<br />

In der Verwendung von kongruent und deckungsgleich spiegelt sich ein wahrscheinlich<br />

nicht lösbarer Konflikt von „zentralen“ Tests. Beide Wörter sind <strong>–</strong><br />

bezogen auf das hier in Rede stehende Problem <strong>–</strong> gleichbedeutend. Beide Wör-<br />

13 Interpretation vgl. Meyerhöfer (2004 a, S. 107-113)


TESTFÄHIGKEIT <strong>–</strong>WAS IST DAS? 75<br />

ter gehören zur Fachsprache von Mathematikunterricht. Es gibt Klassen, in denen<br />

der Begriff der Deckungsgleichheit der Lerns<strong>to</strong>ff ist. Es gibt auch Klassen,<br />

in denen der Kongruenzbegriff der Lerns<strong>to</strong>ff ist. Bei einigen von diesen wird<br />

wiederum der Begriff der Deckungsgleichheit zur Erklärung des Kongruenzbegriffs<br />

herangezogen <strong>–</strong> das bezieht sich auf ein gewisses Selbsterklärungspotenzial<br />

des Begriffs der Deckungsgleichheit. Die in der Aufgabe gewählte<br />

Formulierung kongruent (deckungsgleich) nimmt nun vorrangig diesen letzten<br />

Aspekt auf: kongruent wird <strong>–</strong> quasi zur Erinnerung <strong>–</strong> als deckungsgleich<br />

erläutert. Gleichzeitig wird deckungsgleich als begriffliche Alternative für diejenigen,<br />

die den Begriff kongruent nicht kennen, angeboten. Für diese Gruppe<br />

ist ein neuer Begriff <strong>–</strong> quasi als Hauptbegriff, weil nicht in der Klammer stehend<br />

<strong>–</strong> aufgetaucht. Für diejenigen, die nur den Begriff kongruent kennen, ist<br />

ebenfalls ein neuer Begriff eingeführt, und zwar in einer Klammer. Für alle<br />

drei Gruppen entsteht durch die Klammer ein Irritationspotential: Entweder<br />

wird man mit einem neuen Begriff konfrontiert, oder es wird plötzlich in der<br />

Klammer an die Bedeutung eines Begriff erinnert <strong>–</strong> für einen Test ein seltsames<br />

Unterfangen. Erklärbar ist diese Begriffsverwirrung für denjenigen, der<br />

das <strong>–</strong> im Kern didaktische <strong>–</strong> Problem der zwei Begriffe kennt. „Normal“ ist<br />

es für denjenigen, der mit solchen Konstruktionen in Tests vertraut ist und sie<br />

übergehen kann: ein Bestandteil von Testfähigkeit.<br />

In allen drei Beispielen stellt sich als Ursache ein habituelles Problem des<br />

„Schulmeisterlichen“ heraus: Hier wird probleminadäquat Wert auf Verwendung<br />

von Fachsprache gelegt <strong>–</strong> und das Fachliche unterminiert. Gleichzeitig<br />

wird ein Irritationsmoment geschaffen, denn der Schüler muss einen Umgang<br />

mit dieser sprachlichen Verwerfung finden. Konkret muss er z.B. entscheiden,<br />

ob die begriffliche Doppelung für die Lösung wichtig ist oder nicht. Ein Schüler,<br />

der das Schulmeisterliche des Textes erfassen und übergehen kann, erhält<br />

hier einen Zeitvorteil: Er spart die Zeit, die jemand benötigt, der erst über das<br />

Masse-Gewichts-Problem oder den Prismenbegriff oder über Kongruenz nachdenkt<br />

oder es gar tiefgründig in seine Überlegungen Einzug halten lässt.<br />

Die Aufgabe für den Schüler besteht bei dieser ersten Testfähigkeitskomponente<br />

darin, die entstehende Klippe zu umschiffen. Das kann einerseits bedeuten,<br />

erfolgreich den Inhalt des „seltsamen“ Wortes zu erfassen <strong>–</strong> eine verbale<br />

Fähigkeit. Bei mehreren möglichen Bedeutungen ist die von den Testern<br />

intendierte Bedeutung zu erfassen. Habituell erfordert das, sich auf die von<br />

den Testern anvisierte Ebene der Problembearbeitung zu begeben, also nicht<br />

zu tiefgründig oder zu oberflächlich zu denken. Wer intellektuell zu weit nach<br />

unten oder oben denkt, ist einer erhöhten Gefahr des Scheiterns ausgesetzt.


76 WOLFRAM MEYERHÖFER<br />

Testfähigkeit hat hier also eine inhaltliche und eine habituelle Dimension und<br />

kennzeichnet eine Tendenz zum Mittelmaß. Die Klippe kann auch umschifft<br />

werden, indem man das Seltsame übergeht und begriffliche oder inhaltliche<br />

Exaktheit vermeidet. <strong>–</strong> Es geht nicht darum, das Problem der Aufgabe vollständig<br />

zu verstehen, sondern es geht um die im Sinne des Tests richtige Lösung.<br />

An dieser Stelle wird deutlich, wie Tests die viel beklagte Resultatsorientierung<br />

(statt Inhaltsorientierung) der Schüler reproduzieren.<br />

Das Umschiffen der Klippe muss nicht nur inhaltlich erfolgreich geschehen,<br />

sondern es muss auch unter möglichst geringem Zeitverlust geschehen,<br />

denn Zeit ist in einem Test eine kostbare Ressource. Testfähigkeit bedeutet<br />

dabei zu wissen, dass es auf das einzelne Wort nicht so sehr ankommt und<br />

dass man das Seltsame übergehen muss. Man nimmt es möglichst gar nicht<br />

zur Kenntnis oder erschließt aus dem Rest des Textes möglichst schnell, dass<br />

hier keine Falle lauert. Die Möglichkeit, dass es sich um ein wichtiges Wort<br />

handelt bzw. dass ein Begriff der fachlichen Präzision wegen eingeführt ist, ist<br />

die große Gefahr für den Testfähigen. Beim Auftreten eines „seltsamen“ Wortes<br />

oder einer seltsamen Konstruktion in einem Test ist es aber ausgesprochen<br />

unwahrscheinlich, dass es sich um eine begriffliche Präzisierung handelt, die<br />

für die Erbringung der richtigen Antwort unbedingt verstanden werden muss.<br />

Testfähigkeit bedeutet hier, keine Zeit mit Nachdenken zu vertun.<br />

Die dargestellte Komponente von Testfähigkeit kann kaum (im Sinne des<br />

Arguments, dass Testfähigkeiten als Bildungsziele taugen bzw. mit ihnen korrespondieren)<br />

als Bildungsziel deklariert werden: Zwar geht es beim Rezipieren<br />

von Texten immer auch darum, bei der schnellen Erfassung von Textinhalten<br />

die in den Texten auftretenden Prätentionen und Fehler zu „überlesen“,<br />

sich also an ihnen vorbei den Inhalt zu erschließen. Daraus lässt sich aber keine<br />

Rechtfertigung für das Mitmessen dieser Komponente von Testfähigkeit konstruieren,<br />

weil damit eine Tendenz zur Normalisierung von Fehlern verbunden<br />

ist: Der Schüler wird gezwungen, Defizite der Testerstellung zu übergehen und<br />

damit zu akzeptieren, statt sie zurückzuweisen <strong>–</strong> das kann er wegen der au<strong>to</strong>nomiezerstörenden<br />

Grundstruktur bei Tests nicht ungestraft. Auch die Vermeidung<br />

von Irritation durch schulmeisterlichen Fachsprachengebrauch kann nur<br />

unter großen Verbiegungen als Bildungsziel deklariert werden: Man müsste<br />

dazu voraussetzen, dass das Fachsprachliche einen Wert außerhalb des Fachlichen<br />

hat. Mir scheint das Fachsprachliche aber nur einen Wert zu haben, wenn<br />

dadurch Fachliches transportiert oder konstruiert wird. Die latente Unterminierung<br />

des Fachlichen durch das Fachsprachliche scheint mir als Bildungsziel<br />

wenig geeignet.


TESTFÄHIGKEIT <strong>–</strong>WAS IST DAS? 77<br />

5.3.2 Irritationen durch misslungene künstliche Beschleunigungen und<br />

Vereindeutigungen<br />

Eine weitere Komponente von Testfähigkeit tritt auf, wenn versucht wird, den<br />

Schüler künstlich schneller durch den Test zu schleusen. Dabei wird in einigen<br />

Fällen mit der manifesten Konstruktion einer Eindeutigkeit genau diese Eindeutigkeit<br />

latent zerstört. Dasselbe Prinzip tritt auf, wenn eine textliche Konstruktion,<br />

die die Texterfassung beschleunigen soll, Irritationspotenzial entfaltet,<br />

welches die Texterfassung verzögert.<br />

Das erste Beispiel findet sich in der Konstruktion Wie viele von den kleinen<br />

Quadraten . . . in der TIMSS-Aufgabe A1 (vgl. 5.3.1). Hier wird besonders<br />

auf die kleinen Quadrate verwiesen. Dieser Verweis soll vereindeutigen, denn<br />

es wird ausgeschlossen, dass sich der Schüler mit den aus den kleinen Quadraten<br />

zusammengesetzten „großen“ Quadraten auseinandersetzt. Der Verweis<br />

verwirrt aber auch, denn die Wahrscheinlichkeit, dass sich Schüler von sich<br />

aus mit den „großen“ Quadraten auseinandersetzen, ist ausgesprochen gering.<br />

Man wird also darauf ges<strong>to</strong>ßen, besondere kleine und eventuell sogar große<br />

Quadrate zu suchen. Lediglich als Hilfe für den Schüler, der nicht weiß, was<br />

Quadrate sind, könnte man sich „kleine Quadrate“ vorstellen. Dieses Argument<br />

zerbricht aber daran, dass außer den Quadraten gar nichts da ist, womit<br />

man arbeiten kann. Die Vereindeutigung zerstört sich also selbst.<br />

Testfähigkeit besteht hier darin, sich von solchen testvereindeutigenden<br />

und beschleunigenden Konstruktionen nicht irritieren zu lassen: Der testfähige<br />

Schüler ist also mit solchen Konstruktionen vertraut und weiß (implizit<br />

oder explizit), dass es lediglich um Vereindeutigung geht und dass über diese<br />

schlichte Funktion nicht hinausgedacht werden muss. Es geht darum, auf<br />

eine vielschichtige Auseinandersetzung mit der Aufgabe gerade zu verzichten,<br />

also nicht über die Rolle von großen und kleinen Quadraten und über die<br />

vielfältigen Möglichkeiten des Umgangs mit Mengen in der Zeichnung nachzudenken<br />

<strong>–</strong> wie es der explizite Verweis auf die kleinen Quadrate zunächst<br />

nahelegt. Wenn man auf vieldimensionales Nachdenken verzichtet und sich<br />

auf das Setzen des richtigen Kreuzes konzentriert, dann wird die Bearbeitung<br />

der Aufgabe durch kleine vielleicht sogar wirklich beschleunigt.<br />

Das gleiche Prinzip wiederholt sich in A1 mit . . . muss man ZUSÄTZ-<br />

LICH schattieren . . . Die Großschreibung scheint zunächst eine Hilfe zu sein,<br />

da sie vor der Angabe der insgesamt zu schattierenden Quadrate warnt. Diese<br />

Hilfe ist aber nicht notwendig, weil die durch Multiple Choice angegebenen<br />

Lösungsvarianten dem Schüler seinen Irrtum signalisieren würden. Auch


78 WOLFRAM MEYERHÖFER<br />

an dieser Stelle erfährt ein testfähiger Schüler einen Vorteil, weil er mit einer<br />

solchen testbeschleunigenden Konstruktion vertraut ist. Der testunerfahrene<br />

Schüler wird eher irritiert sein, weil in der Schriftsprache normaler Texte,<br />

auch bei schulischen Texten, Wörter in Großbuchstaben eine derart starke Exponierung<br />

erzeugen, dass ein Nachdenken über den Grund der Exponierung<br />

angezeigt ist. Testfähigkeit bedeutet hier, den Grund der Exponierung bereits<br />

zu „kennen“: Vermeidung naheliegender Fehler. Der Beschleunigungsvorteil<br />

durch diese Exponierung gilt natürlich nur für jenen, der der impliziten Aufforderung<br />

widersteht, über den Grund der Exponierung nachzudenken. Auch<br />

hier bedeutet Testfähigkeit wieder, nicht über den Text nachzudenken, sondern<br />

dem Prinzip zu folgen, dass es um das Kreuz an der richtigen Stelle geht, nicht<br />

um inhaltliche Auseinandersetzung.<br />

Die Komponente der irritationshaltigen Beschleunigung findet sich auch in<br />

der Formulierung Welche der Aussagen . . . ist FALSCH? von A5 (vgl. 5.3.1).<br />

Ursache ist hier eine Prätention: Eine mathematisch anspruchslose Fragestellung<br />

wird zunächst künstlich verkompliziert: Hier ist reine Fleiß- und Konzentrationsarbeit<br />

zu verrichten, deren Anspruch aus dem zu lösenden Problem<br />

heraus nicht zu begründen ist. Die künstliche Verkomplizierung durch die Umkehr<br />

des Anspruchs <strong>–</strong> man soll benennen, was falsch ist <strong>–</strong> motiviert wiederum<br />

die Hervorhebung durch Großbuchstaben: Das Ungewöhnliche muss hervorgehoben<br />

werden, um eine Verwechslung mit der erwartbaren Anforderung zu<br />

vermeiden. In der Variante „ . . . ist richtig“ käme der Gedanke, RICHTIG groß<br />

zu schreiben, nicht auf.<br />

Ein weiteres Beispiel findet sich in der <strong>PISA</strong>-Aufgabe „Pyramide“:<br />

Die Grundfläche einer Pyramide ist ein Quadrat. Jede Kante der<br />

skizzierten Pyramide misst 12 cm. Berechne den Flächeninhalt<br />

der Grundfläche ABCD.<br />

Im zweiten Satz wird hier die Option des Vorhandenseins<br />

zweier Pyramiden eröffnet, nämlich der Pyramide<br />

des ersten Satzes und der skizzierten Pyramide. Manifest erfolgt durch<br />

die Einfügung des skizzierten eine Lesebeschleunigung durch die explizite Verknüpfung<br />

von Text und Bild. Latent wird eine Irritation erzeugt. Die Aufgabe<br />

an den Schüler lautet, sich von dieser Irritation nicht ergreifen zu lassen, also<br />

darüber hinweg zu lesen.<br />

Auch die hier beschriebene Dimension von Testfähigkeit lässt sich als Bildungsziel<br />

diskutieren: Schließlich gibt es solche Brechungen zwischen dem<br />

textlich Gewollten und dem damit produzierten Irritierenden auch in den Tex-


TESTFÄHIGKEIT <strong>–</strong>WAS IST DAS? 79<br />

ten, auf deren Rezeption Unterricht die Schüler vorbereitet. Man kann es zum<br />

Bildungsziel erklären, einen Umgang damit zu finden und zu lernen, diese Irritationen<br />

zu überwinden. Das Argument ist allerdings zynisch: Au<strong>to</strong>nomievergrößerung<br />

würde bedeuten, das Auseinanderlaufen verschiedener Textebenen<br />

in irgendeiner Weise zu thematisieren. Dem Schüler würde dabei ermöglicht,<br />

Distanz zum Text und damit auch zum Prozess der Leistungskontrolle zu erlangen.<br />

Er könnte sich damit intellektuell von schulischen und leistungsbewertenden<br />

Prozessen emanzipieren. In einem Test ist er diesen Prozessen ausgeliefert,<br />

weil er den Punkt auch dann nicht bekommt, wenn er die Aufgabe intellektuell<br />

brilliant zurückweist. Mir scheint es weitaus einleuchtender, dass die Tester<br />

der Verpflichtung unterliegen, Irritationspotenzial zu vermeiden, indem sie gebrochene<br />

Vereindeutigungen und Beschleunigungen unterlassen. Dies ist aber<br />

offenbar nur möglich, wenn man diese Brüche überhaupt erkennt. Die erste<br />

Voraussetzung dafür ist ein Perspektivwechsel: Der Tester darf sich nicht nur<br />

darauf konzentrieren, was er hören will, sondern muss sich fragen, was der<br />

Text wirklich verlangt und ob das mit dem zusammenläuft, was er will. Die<br />

zweite Voraussetzung ist dann nur noch eine gewisse Textsensibilität. Objektive<br />

Hermeneutik bietet dieser Sensibilität ein Instrument methodischer Kontrolle.<br />

5.3.3 Fehlbarkeit der Tester<br />

<strong>PISA</strong>-Aufgabe ÄPFEL<br />

Ein Bauer pflanzt Apfelbäume an, die er in einem quadratischen Muster anordnet.<br />

Um diese Bäume vor dem Wind zu schützen, pflanzt er Nadelbäume um<br />

den Obstgarten herum.<br />

Im folgenden Diagramm siehst du das Muster, nach dem Apfelbäume und Nadelbäume<br />

für eine beliebige Anzahl (n) von Apfelbaumreihen gepflanzt werden:


80 WOLFRAM MEYERHÖFER<br />

Äpfel 1:<br />

Vervollständige die Tabelle:<br />

Äpfel 2:<br />

Es gibt zwei Formeln, die man verwenden kann, um die Anzahl der Apfelbäume<br />

und die Anzahl der Nadelbäume für das oben beschriebene Muster zu berechnen:<br />

Anzahl der Apfelbäume = n2 Anzahl der Nadelbäume = 8n<br />

wobei n die Anzahl der Apfelbaumreihen bezeichnet.<br />

Es gibt einen Wert für n, bei dem die Anzahl der Apfelbäume gleich groß ist<br />

wie die Anzahl der Nadelbäume. Bestimme diesen Wert und gib an, wie du ihn<br />

berechnet hast.<br />

Äpfel 3:<br />

Angenommen, der Bauer möchte einen viel größeren Obstgarten mit vielen<br />

Reihen von Bäumen anlegen. Was wird schneller zunehmen, wenn der Bauer<br />

den Obstgarten vergrößert: die Anzahl der Apfelbäume oder die Anzahl der<br />

Nadelbäume? Erkläre, wie du zu deiner Antwort gekommen bist.<br />

Die Aufgabe „Äpfel“ hat sich in der näheren Betrachtung 14 als produktive und<br />

mathematisch gehaltvoll erweiterbare unterrichtliche Aufgabe herausgestellt,<br />

die aber als Testaufgabe ungeeignet ist: Unter anderem wird in Äpfel 2 eine<br />

Formel für das Lösen von Äpfel 1 nachgereicht. Testfähigkeit besteht hier<br />

nicht vorrangig darin, erkennen zu können, in welcher Weise die Tester die<br />

Lösung oder Teile der Lösung bereits mitgeliefert haben. Schließlich hat es<br />

wenig Sinn, Aufgaben gezielt auf solche Möglichkeiten hin zu durchsuchen <strong>–</strong><br />

dazu sind diese Möglichkeiten zu selten.<br />

Testfähigkeit besteht vielmehr darin, den Gedanken der Fehlbarkeit der<br />

Tester zuzulassen und die Fehlung dann auch auszunutzen. Immerhin handelt<br />

es sich beim Test um ein Instrumentarium, das sich auf die exponierte Zielgenauigkeit<br />

und Sorgfalt des Wissenschaftlichen beruft und bereits durch seinen<br />

Umfang und sein Auftreten signalisiert, dass hier viele Leute lange darüber<br />

14 Winter (2005), Meyerhöfer (2004 a, S. 203 f.), Meyerhöfer (2005, S. 171 ff.)


TESTFÄHIGKEIT <strong>–</strong>WAS IST DAS? 81<br />

nachgedacht haben, was sie an den Schüler herantragen. Es verlangt ein gewisses<br />

Maß an Au<strong>to</strong>nomie, Gelassenheit, Abstand oder Unverfrorenheit, um<br />

den Gedanken zuzulassen, dass dieser riesige Apparat in Aufgabe 2 die Formel<br />

für die Lösung von Aufgabe 1 reinschreibt.<br />

5.3.4 Primat des gewünschten Resultats vor dem mathematischen Anspruch,<br />

Unverfrorenheit gegenüber dem mathematischen Anspruch,<br />

Möglichkeiten von Multiple Choice<br />

Eine Zuspitzung erfährt die eben beschriebene Dimension in der TIMSS-<br />

Aufgabe M7, in der die mathematische Anforderung unterminiert wird:<br />

AB ist in dieser Zeichnung eine Gerade.<br />

Wieviel Grad mißt Winkel BCD?<br />

A) 20<br />

B) 40<br />

C) 50<br />

D) 80<br />

E) 100<br />

Hier soll der Schüler offenbar erkennen, dass 9x gleich 180 Grad ist, und daraus<br />

das Winkelmaß von 80 Grad für BCD bestimmen. Sehr viel effektiver ist<br />

es, mit Hilfe der Multiple-Choice-Angebote abzuschätzen, dass es nur 80 Grad<br />

sein können. Dass der Schüler eigentlich rechnen soll, erkennt man am ersten<br />

Satz und an der Tatsache, dass die außergewöhnlichen Bezeichnungen 5x<br />

und 4x angebracht sind. In einer Schätzaufgabe würde so etwas nicht vorkommen<br />

15 .<br />

Ein Schüler, der das Problem rechnerisch nicht lösen kann, hat hier Glück,<br />

er kommt nämlich nicht in Gefahr, Zeit zu verschwenden. Für ihn besteht lediglich<br />

die Aufgabe, sich zu trauen, einfach das anzukreuzen, was er sieht.<br />

Das ist nicht trivial, denn mancher Schüler traut sich nicht, das Offensichtliche<br />

hinzuschreiben, wenn er spürt bzw. merkt, dass er eigentlich rechnen soll. Ein<br />

Schüler, der das Problem rechnerisch lösen kann, kommt ebenfalls zum richtigen<br />

Ergebnis <strong>–</strong> wenn er nicht in die Fallen der Lösungsangebote A oder E<br />

fällt. Er verbraucht aber sehr viel von der kostbaren Ressource Zeit. Um diese<br />

Zeit einzusparen, benötigt er eine gewisse Unverfrorenheit gegenüber der<br />

rechnerischen Anforderung, gepaart mit einer gewissen Cleverness im Erkennen<br />

der durch Multiple Choice geschaffenen Möglichkeiten. Für das Problem<br />

15 Interpretation siehe Meyerhöfer (2001)


82 WOLFRAM MEYERHÖFER<br />

der Testfähigkeit ergibt sich damit eine weitere Komponente: Man muss sich<br />

trauen, einen nichtrechnerischen Weg zu gehen, auch wenn offenbar Rechnen<br />

verlangt ist. Man muss also unverfroren gegen die Anforderung handeln, denn<br />

es geht nicht um den mathematischen Inhalt, sondern um das Kreuz an der<br />

richtigen Stelle. Die Aufgabe M7 eignet sich geradezu ideal dazu, Menschen<br />

zu identifizieren, die sich clever und effektiv der eigentlichen Anforderung<br />

stellen und dabei unverfroren gegen die manifest intendierte Aufgabe handeln.<br />

Im Vergleich dazu erscheint das Bedienen der rechnerischen Intention als braves<br />

Abarbeiten von fehlerbehafteten und probleminadäquaten mathematikunterrichtlichen<br />

Techniken.<br />

Die gleiche Dimension von Testfähigkeit wird in den Aufgaben „Bauernhöfe“<br />

(vgl. 5.3.1) und „Dreieck“ 16 mitgemessen. Dort ist (in unterschiedlich<br />

starkem Maße) die Verwendung von lokalem Satz- und Formelwissen gefragt.<br />

Tendenziell schneller sind die Wege über Intuition bzw. Messen. Den höchsten<br />

Zeitverlust hat dort derjenige, der genuin mathematisch denkt und handelt.<br />

Reinhard Woschek (2005) hat im Rahmen seiner Dissertation untersucht,<br />

auf welch unterschiedliche Weisen deutsche und Schweizer Schüler TIMSS-<br />

Aufgaben lösen. Bei M7 stellte er fest, dass deutsche Schüler fast nur rechnen<br />

und auch oft damit scheitern. Schweizer Schüler hingegen schätzen fast nur.<br />

Es gibt natürlich auch in Deutschland Lehrer, die möchten, dass ihre Schüler<br />

an dieser Stelle schätzen <strong>–</strong> jedenfalls wenn nicht gerade das Aufstellen<br />

von und Umgehen mit Gleichungen angesagt ist. Testfähigkeit läuft hier zwar<br />

gegen die Aufgabenintention, hat aber durchaus einen Charakter, der Lehrintentionen<br />

entsprechen kann: Man kann durchaus wollen, dass die Schüler mit<br />

gegebenen Problemen nicht stur rechnerisch umgehen, sondern sie der Situation<br />

angemessen möglichst effektiv lösen. Allerdings bleibt unklar, warum man<br />

gerade das künstliche und statische Instrument der Multiple-Choice-Aufgabe<br />

wählen sollte, um sich einem dynamischen, problemadäquaten und effektiven<br />

Umgang mit mathematischen Problemen zu nähern, die noch dazu nach rechnerischer<br />

Bearbeitung verlangen. Man sollte sich auch vor Augen halten, dass<br />

es zynisch wäre, eine rechnerische Anforderung künstlich zu suggerieren bzw.<br />

zu konstruieren, die nicht aus der Sache selbst erwächst.<br />

5.3.5 Egal wie wenig du weißt, schreibe immer irgendetwas hin.<br />

Eine elementare Komponente von Testfähigkeit lässt sich in die Aufforderung<br />

umschreiben: Egal wie wenig du weißt, schreibe immer irgendetwas hin. Die<br />

16 vgl. Deutsches <strong>PISA</strong>-Konsortium (2001, S. 178); Diskussion bei Meyerhöfer (2004 b)


TESTFÄHIGKEIT <strong>–</strong>WAS IST DAS? 83<br />

Multiple-Choice-Variante dieser Aufforderung unterstreicht das Prinzip: Wenn<br />

du nichts weißt, dann kreuze irgendetwas an, und zwar möglichst das, was dir<br />

am meisten einleuchtet. Die Diskussion um das Raten bei Tests 17 kann man<br />

darauf zuspitzen, dass sich alle Populationsunterschiede in der Testleistung mit<br />

dem unterschiedlichen Grad der Verinnerlichung dieser Komponente von Testfähigkeit<br />

erklären lassen. Diese Behauptung ist zwar ebensowenig überprüfbar<br />

wie die Behauptung, Raten spiele keine Rolle. Aber wenn wir von Wuttke<br />

(2007, Abschnitt 3.12.) erfahren, dass bereits Lösungsunterschiede von einer<br />

halben Aufgabe in der <strong>PISA</strong>-Skalierung als relevanter Unterschied (9 Punkte)<br />

gedeutet werden, dann offenbart das die Anfälligkeit des Konstrukts für Rateprobleme.<br />

Ich möchte das Problem hier nur für offene Antwortformate (die allerdings<br />

im kategorialen Vorgehen immer geschlossen kodiert werden) diskutieren: In<br />

der Aufgabe „Äpfel 2“ (vgl. 5.3.3) heißt es: Es gibt einen Wert für n, bei dem<br />

die Anzahl der Apfelbäume gleich groß ist wie die Anzahl der Nadelbäume.<br />

Bestimme diesen Wert und gib an, wie du ihn berechnet hast.<br />

Es sind aber zwei Werte, nämlich 0 und 8. Aus den Lösungskodierungen<br />

der <strong>PISA</strong>-Gruppe geht hervor, dass ein Schüler, der n = 8 angibt, den Lösungspunkt<br />

erhält, auch wenn er keine Begründung bzw. Berechnung angibt <strong>–</strong> wenn<br />

er also die Aufgabenstellung nicht erfüllt. Ein Schüler, der nur n = 0 angibt, erhält<br />

hingegen keinen Punkt, selbst wenn er seine Antwort begründet und es bei<br />

diesem Wert belässt, weil schließlich im Aufgabentext nur ein Wert gefordert<br />

ist. Man kennt natürlich nie die Kodierungsanweisungen der Tester, wenn man<br />

getestet wird. Aber es wird deutlich, dass es nicht in jedem Fall darum geht,<br />

die Aufgabe wirklich zu erfüllen. Bereits das Hinschreiben einer Teillösung<br />

führt zum Punkt.<br />

Es liegt nahe einzuwenden, dass die Lösung Null für den realen Kontext<br />

eher irrelevant ist. Das stimmt zwar inhaltlich, setzt aber eine Kernerfahrung<br />

mit Mathematikunterricht nicht außer Kraft: Dort geht es unsystematisch <strong>–</strong> das<br />

heißt, nicht unbedingt durchschaubar aus der Sache heraus begründet, sondern<br />

gelegentlich aus dem Belieben des Lehrers heraus erscheinend <strong>–</strong> immer wieder<br />

um solche „Randbetrachtungen“. Für den Schüler bleibt gerade in Tests<br />

undurchschaubar, in welchem Maße er „Randbetrachtungen“ mit zu leisten<br />

hat (und leisten darf). Die Unsicherheit wird dadurch gestärkt, dass es um das<br />

Reale offensichtlich gar nicht geht.<br />

17 vgl. Meyerhöfer (2004 c), Lind (2004)


84 WOLFRAM MEYERHÖFER<br />

Diese Herabwürdigung des Realen lässt in der <strong>PISA</strong>-Aufgabe „Sparen“ 18<br />

den Eindruck entstehen, man könne irgend etwas über den Zinseszins hinschreiben<br />

<strong>–</strong> womöglich sogar, ohne ihn berechnet zu haben <strong>–</strong> und könnte trotzdem<br />

den Punkt erhalten.<br />

Irgendetwas hinzuschreiben erweist sich auch als sinnvoll, wenn man sich<br />

die Kodierungspraxis vor Augen hält: Ein Kodierer <strong>–</strong> meist ein schlecht bezahlter<br />

Student <strong>–</strong> muss in einem entfremdeten Arbeitsprozess unter Zeitdruck<br />

eine große Menge an schlecht lesbaren Schülernotizen entziffern. Er muss versuchen,<br />

dem Geschriebenen einen Sinn abzuringen und diesen Sinn mit einer<br />

umfangreichen, die Wirklichkeit aber doch nur holzschnittartig erfassenden<br />

Bewertungsvorschrift in Einklang zu bringen. Er steht im ständigen Konflikt,<br />

dass einerseits die von ihm geleistete Geltungserzeugung dem Anspruch der<br />

Wissenschaftlichkeit ausgesetzt ist, dass ihm andererseits aber keine wissenschaftliche<br />

Methode der Geltungserzeugung zur Verfügung steht. (Der Konflikt<br />

existiert unabhängig vom Bewusstsein des Kodierers. Allerdings sind die<br />

Kodierer direkt mit dem Text konfrontiert und dürften am deutlichsten spüren,<br />

dass die Kategorisierungen weder das Latente noch die vielen verschiedenen<br />

Ausprägungen von Verstehen oder von Können zu greifen vermögen.) Ergebnis<br />

seines Tuns soll eine undifferenzierte Null-Eins-Entscheidung sein, und in<br />

18 Sparen<br />

Karina hat 1000 DM in ihrem Ferienjob verdient. Ihre Mutter empfiehlt ihr, das Geld<br />

zunächst bei einer Bank für 2 Jahre festzulegen (Zinseszins!) Dafür hat sie zwei Angebote:<br />

a) „Plus“-Sparen: Im ersten Jahr 3 % Zinsen, im zweiten Jahr dann 5 % Zinsen.<br />

b) „Extra“-Sparen: Im ersten und zweiten Jahr jeweils 4 % Zinsen.<br />

Karina meint: „Beide Angebote sind gleich gut.“ Was meinst du dazu?<br />

Begründe deine Antwort!<br />

Die Differenz zwischen beiden Angeboten beträgt zehn Pfennige und es bleibt unklar,<br />

was die Tester jetzt hören wollen: Sind 10 Pfennig Unterschied noch „gleich gut“ oder nicht?<br />

Die Frage ist ja wegen der unbekannten sonstigen Bedingungen offensichtlich nicht beantwortbar:<br />

Selbst bei einer Anlage von 100 000 DM wäre der Unterschied ja nur 10 DM, also<br />

durch jede Kon<strong>to</strong>führungsgebühr bzw. andere Nebenkosten, Fahrtkosten zur Bank, selbst<br />

durch Mitnehmen von Werbegeschenken ausgeglichen.<br />

Das Problem für den Getesteten ist immer, erfolgreich zu erahnen, was die Tester hören<br />

wollen. Hier bleibt das unklar. Es könnte sogar sein, dass man auf eine sinnvolle Argumentation<br />

hin einen Punkt bekommt, egal wie man sich entscheidet. Ein Element von<br />

Testfähigkeit besteht hier darin, trotz der in ihrer Bedeutung für die Antworterwartung nicht<br />

einschätzbaren Zehn-Pfennig-Differenz irgendetwas hinzuschreiben. Da das Reale hier ohnehin<br />

nicht ernst genommen wird, könnte man den Punkt erhalten, wenn man irgendetwas<br />

über Zinseszins hinschreibt <strong>–</strong> womöglich sogar, ohne ihn berechnet zu haben. (näheres vgl.<br />

Meyerhöfer 2004, S. 199 f., Meyerhöfer 2005, S. 166 f.)


TESTFÄHIGKEIT <strong>–</strong>WAS IST DAS? 85<br />

gewisser Weise ist es auch egal, ob man sorgfältig oder gültig bepunktet oder<br />

nicht: Der Kodierer spürt ja unmittelbar die Brüchigkeit, mit der im Kodierungsverfahren<br />

die Geltung der Testaussage erzeugt wird. Er spürt hautnah die<br />

Illusion des Punktwertes.<br />

Die Kodierungspraxis birgt also eine sowohl im Kategorisierungsprinzip<br />

liegende als auch eine menschliche Komponente von Willkür <strong>–</strong> und diese Willkür<br />

ist ein wesentlicher Unterschied zur Klassenarbeit, nach der der Lehrer<br />

immer unter Rechtfertigungs- und damit unter Fairnessdruck steht. Es wird<br />

deutlich, dass man wenig Einfluss auf die „Gnadenstimmung“ des Kodierers<br />

und auf den Kategorienkatalog hat, dass es aber die Chance auf eine positive<br />

Bewertung erhöht, wenn man irgendetwas hinschreibt.<br />

Diese Komponente von Testfähigkeit ist zwar testimmanent (unabhängig<br />

davon, ob eine Testaufgabe gelungen oder misslungen ist), bewegt sich nichtsdes<strong>to</strong>trotz<br />

nah an Fähigkeiten, die in Klassenarbeiten benötigt werden, denn<br />

auch dort geht es darum, durch das Hinschreiben von Fragmenten „Punkte zu<br />

schinden“, selbst wenn man wenig weiß. Auch dieser Komponente von Testfähigkeit<br />

mag man deshalb „Bildungswert“ zuschreiben. Es ist aber ein rein<br />

innerschulischer Wert: Das Versammeln von Halbwissen oder Fragmenten von<br />

Wissen dient hier keiner Annäherung an Bildungsgut durch Versammeln des<br />

bereits Gewussten, durch seine Reflexion, Aufarbeitung und Erweiterung. Es<br />

dient lediglich dem Bedienen fremdgesetzter Anforderungen in einer asymmetrischen<br />

Konstellation, deren inhaltliche Füllung zunächst keinem Bildungsprozess<br />

dient.<br />

5.3.6 Nichtrespektierung der Au<strong>to</strong>nomie und Authentizität des<br />

Mathematischen wie des Realen; Schein des Realen; spezifische Realität<br />

der Tester<br />

Wenn Tester sich dem Realen außerhalb der Mathematik zuwenden, öffnet sich<br />

ihnen ein mannigfaltiges Potenzial einer Produktion von Verwerfungen, deren<br />

Bearbeitung Testfähigkeiten erfordert. Da gibt es in der Aufgabe „Bauernhöfe“<br />

(vgl. 5.3.1) Dachböden, die Quadrate sind, da gibt es Mitten von Strecken, da<br />

werden Modellierungsanforderungen behauptet und zerstört.<br />

Ich habe in meiner Untersuchung zu <strong>PISA</strong> 19 diskutiert, wie sich diese Verwerfungen<br />

vermeiden lassen: Grundbedingung ist, das Reale und das Mathematische<br />

in ihrer Au<strong>to</strong>nomie und Authentizität zu respektieren. Damit ist die<br />

19 Meyerhöfer (2004, Kapitel 5), Meyerhöfer (2005, Kapitel 5)


86 WOLFRAM MEYERHÖFER<br />

Grundrichtung der hier zu beschreibenden Testfähigkeit abgesteckt: Die Nichtrespektierung<br />

der Au<strong>to</strong>nomie und Authentizität ist zu bearbeiten.<br />

In der TIMSS-Aufgabe A2 (vgl. 5.3.1) wird zwischen dem Realen, dem<br />

Mathematischen und dem Physikalischen „hin- und herverworfen“. Eine Möglichkeit<br />

des Scheiterns ergibt sich dort, wenn man das für einen Ziegelstein<br />

hält, was wie ein Ziegelstein aussieht, nämlich der „halbe“ Ziegelstein. Der<br />

Fehler liegt nahe, weil Ziegelsteine mit quadratischem Querschnitt uns seltener<br />

begegnen und weil man sie nie längs teilt, wie das hier geschehen ist. Wir<br />

lernen für das Problem der Testfähigkeit: Man soll nicht dem Schein des Realen<br />

glauben. Man muss sich also in die spezifische Realität der Tester begeben.<br />

In dieser Realität werden Ziegelsteine längs geteilt, Mutter ruft „Zinseszins!“<br />

und Schülerinnen zeichnen angeblich Dächer von Bauernhöfen, die<br />

keine Bauernhöfe sind. Diese Welt ähnelt der Welt der Schulbücher und sicherlich<br />

auch der Welt von Mathematikunterricht. Insofern läuft die Fähigkeit,<br />

sich in die Realität der Tester zu begeben, wahrscheinlich mit der Fähigkeit<br />

zusammen, sich in die spezifische Realität von Mathematikunterricht<br />

zu begeben. Diese Komponente von Testfähigkeit hat also eine gewisse Ähnlichkeit<br />

mit einer Fähigkeitskomponente, die auch im Mathematikunterricht<br />

thematisch ist. Der Unterschied ist allerdings ein konstitutiver: Im Mathematikunterricht<br />

scheint mir Bestandteil des Bildungsgedankens die Forderung zu<br />

sein, die Spezifität des Realen und die Spezifität des Mathematischen in den<br />

Blick zu nehmen. <strong>–</strong> Es geht hier gerade nicht um die unreflektierte Übernahme<br />

von Aufgabenmustern. Diese unreflektierte Übernahme würde man einem Mathematikunterricht<br />

zuschreiben, der seinen Bildungsauftrag nicht erfüllt. Die<br />

Nichtrespektierung der Au<strong>to</strong>nomie und Authentizität insbesondere des Mathematischen,<br />

aber auch des Realen, ist mathematikdidaktisch also nicht zu rechtfertigen.<br />

Deshalb ist es problematisch, wenn beides in Tests nicht respektiert<br />

wird. Und es ist im Sinne eines Bildungsauftrags problematisch, dass in keiner<br />

der veröffentlichten <strong>PISA</strong>-Aufgaben und in keiner jener TIMSS-Aufgaben, die<br />

allen Schülern vorlagen, die Spezifität des Realen und des Mathematischen<br />

thematisch ist. Am Bildungsauftrag von Mathematikunterricht arbeiten diese<br />

beiden Tests diesbezüglich vorbei. Die hier mitgemessenen Komponenten<br />

von Testfähigkeit bedienen lediglich die Anpassung an einen Mathematikunterricht,<br />

der seinen Bildungsauftrag nicht erfüllt.<br />

Es ist ebenso bedenklich, wie häufig Schulbuchaufgaben sich der Spezifität<br />

des Realen und des Mathematischen nicht stellen; insofern ist der Bildungsauftrag<br />

von Mathematikunterricht zum Teil gegen die Praxis von Schul-


TESTFÄHIGKEIT <strong>–</strong>WAS IST DAS? 87<br />

buchaufgaben gerichtet. Das zerstört aber den Auftrag nicht, sondern behindert<br />

lediglich seine Umsetzung (und illustriert die mangelnde Verankerung des Bildungsauftrags<br />

im Feld). In Tests gibt es aber nichts als die Aufgaben selbst. Sie<br />

konstituieren das Ganze.<br />

5.3.7 Nichternsthaftigkeit des realen Problems; Dominanz des Schlichten;<br />

Nachteiligkeit von exakten Überlegungen, von kreativem oder<br />

intellektuell anspruchsvollem Arbeiten; Testerwünsche als Maß des Tuns<br />

Eng verbunden mit der Forderung, sich dem Schein des Realen nicht hinzugeben,<br />

ist die Forderung, das reale bzw. realitätsnahe Problem nicht ernst zu<br />

nehmen. Wenn man in der TIMSS-Aufgabe A2 (vgl. 5.3.3) das Problem ernst<br />

nimmt und unter Berücksichtigung des Abstandes der Körper vom Waagenmittelpunkt<br />

durchrechnet, bekommt man heraus, dass der Ziegelstein 2,62 kg<br />

wiegt. Dafür gibt es aber kein Multiple-Choice-Angebot. Man würde dem<strong>zufolge</strong><br />

auf 3 kg runden und damit ein „falsches“ Resultat erhalten, weil die<br />

Tester 2 kg angekreuzt sehen möchten. Hier liegt also ein Fall vor, in dem ein<br />

Schüler zum im Sinne der Tester falschen Resultat gelangen würde, obwohl er<br />

ein anspruchsvolles Problem löst und wahrscheinlich auch das kann, was die<br />

Tester zu messen glauben. Man könnte vereinfacht sagen: Ein Schüler, der „zu<br />

klug“ ist, gelangt zum falschen Resultat. Defizitär formuliert: Dieser Schüler<br />

erkennt nicht, auf welcher Ebene er hier argumentieren soll. Das Irritationsmoment<br />

liegt auch vor, wenn man lediglich über das Bild „s<strong>to</strong>lpert“, weil man<br />

die Problemstellung ernst nimmt. Hier besteht die zusätzliche Aufgabe darin<br />

zu erkennen, dass man das Problem nicht ernst nehmen darf, sondern eine<br />

schlichtere Überlegung anstellen soll. Ein Schüler mit Testfähigkeit, der exakte<br />

Überlegungen von vornherein ausspart, erfährt damit bei dieser Aufgabe<br />

einen zeitlichen Vorteil. Man könnte diese Komponente von Testfähigkeit also<br />

so formulieren: Du sollst nicht das Problem ernst nehmen und lösen, welches<br />

gestellt ist, sondern du sollst herausfinden, was die Tester wollen, dass du es<br />

hinschreibst bzw. ankreuzt. Aus den Erkenntnissen über das Testen kann man<br />

hinzufügen: Es ist wahrscheinlicher, dass du schlicht arbeiten sollst, als dass<br />

du kreativ oder intellektuell anspruchsvoll arbeiten sollst.<br />

Wer in der Aufgabe „Bauernhöfe“ das Problem ernst nimmt, der erfährt<br />

nicht einmal, welche Länge er bestimmen soll, weil der zu berechnende Balken<br />

einen trapezförmigen Querschnitt haben muss. Glücklicherweise zerstört der<br />

Text die Modellierungsanforderung sehr gründlich, so dass man das Problem<br />

nicht ernst nehmen wird.


88 WOLFRAM MEYERHÖFER<br />

6 Zusammenführung<br />

„Testfähigkeit“ beschreibt jene Kenntnisse, Fähigkeiten und Fertigkeiten, die<br />

in einem Test miterfasst bzw. mitgemessen werden, die man aber nicht unter<br />

den Begriff „mathematische Leistungsfähigkeit“ fassen würde.<br />

Wenn in einem Test Testfähigkeiten mitgemessen werden, entstehen folgende<br />

Probleme:<br />

<strong>–</strong> Es wird nicht nur das gemessen, was gemessen werden soll: mathematische<br />

Leistungsfähigkeit. Das Messresultat wäre also verfälscht.<br />

<strong>–</strong> Tests setzen Standards. Wenn Tests Testfähigkeiten mitmessen, werden diese<br />

Fähigkeiten Bestandteil des Standards. Die empirische Analyse hat gezeigt,<br />

dass dies kein nebensächliches Phänomen ist, sondern den Kern mathematischer<br />

Bildung betrifft. Sie hat gleichzeitig gezeigt, dass Testfähigkeiten<br />

nicht in Bildungsgut umzudeuten sind. Entsprechende Versuche erweisen<br />

sich in der empirischen Analyse als oberflächlich, oftmals falsch und<br />

zumeist zynisch.<br />

<strong>–</strong> Für die aufgezeigten Probleme ist denkbar, sie mittels Testtraining zu bearbeiten.<br />

Das Hauptproblem tritt aber auf, wenn latentes Testtraining durch<br />

gehäuftes Bearbeiten von Testaufgaben stattfindet. (Es gibt hierfür den im<br />

Kern zynischen Euphemismus „Testkultur“.) Dann entfalten die den Bildungsgedanken<br />

beschädigenden Phänomene schleichend ihre Wirkung. Ich<br />

kann den Gedanken hier nicht vertiefen, möchte aber darauf verweisen, dass<br />

Adornos „Theorie der Halbbildung“ (Adorno 1972) das Problem weiter erschließt.<br />

Vor diesem Hintergrund ist das Konzept der deutschen Bildungsstandards zu<br />

überdenken. Sie fokussieren auf das Testen von „Kompetenzen“. Derzeit wird<br />

ein <strong>–</strong> in seinen Ausmaßen den <strong>PISA</strong>-Test weit übersteigender <strong>–</strong> Test entwickelt,<br />

der die Bildungsstandards für das Fach Mathematik in eine Testform<br />

gerinnen lassen soll. Der Gedanke, dass Tests Standards setzen, erreicht hier<br />

eine radikalisierte Praktizierung. Dieser Standards-Test wird in einer Weise erstellt,<br />

bei welcher das Problem der Testfähigkeit nicht bearbeitet werden kann.<br />

Wegen der hohen Durchschlagkraft der Bildungsstandards-Tests auf Mathematikunterricht<br />

wird Testfähigkeit somit zum Standard(s)phänomen in deutschen<br />

Schulen. Ich schlage deshalb vor, diese Testerstellung zu s<strong>to</strong>ppen.<br />

Ich möchte nun die empirisch herausgearbeiteten Komponenten von Testfähigkeit<br />

zusammenfassen.<br />

Oberster Grundsatz für den Getesteten ist: Es geht in der Testsituation nicht<br />

darum, dass ein mathematisches Problem erschlossen wird, dass ein Gedanke


TESTFÄHIGKEIT <strong>–</strong>WAS IST DAS? 89<br />

entfaltet wird oder eine Argumentation brilliant entwickelt wird. Es geht darum,<br />

das Kreuz an der von den Testern gewünschten (an der „richtigen“) Stelle<br />

zu setzen, die von den Testern gewünschte Zahl hinzuschreiben oder einen Gedanken<br />

so weit zu entfalten, dass der Kodierer einen Punkt dafür vergibt. Gelegentlich<br />

laufen Erschließung und Gewünschtes zusammen, d.h. gelegentlich<br />

testet eine Aufgabe mathematische Bildung bzw. Leistung.<br />

Es geht darum, sich der Tendenz zum Mittelmaß, die Tests innewohnt,<br />

anzupassen. Wenn man „von unten her“ auf dieses Mittelmaß blickt, dann<br />

erscheint dieses Phänomen auf den ersten Blick unproblematisch, weil man<br />

dann nach bestem Wissen die Aufgabe bearbeiten kann. Etwas schlicht gesagt:<br />

Jemand, der bildungsfern ist, wird vielleicht durch das Erlangen von Testfähigkeit<br />

nicht in seiner Intellektualität beschädigt, er wird lediglich in ihrer<br />

Entwicklung behindert. Er hat zusätzlichen, unausgesprochenen und überflüssigen<br />

Lerns<strong>to</strong>ff zu bewältigen. Das raubt Kapazität für relevante Inhalte. Es<br />

verschärft zusätzlich für die bildungsferne Klientel das Problem der Benachteiligung<br />

durch Unausgesprochenes (vgl. Bourdieu/Passeron 1971). Wenn man<br />

umgekehrt dazu neigt, sich der Welt intellektuell zu nähern, sie ernsthaft zu befragen,<br />

Gedanken vielschichtig zu entfalten und mathematische Probleme bis<br />

hin zu einem eigenen Verständnis zu bearbeiten, dann verliert man in Tests<br />

Zeit, gelegentlich landet man auch bei einem richtigen oder „auch richtigen“<br />

oder „unter diesem Blickwinkel auch richtigen“ Resultat, welches aber nicht<br />

das erwünschte und somit prämierte Resultat ist. Der Grundsatz lautet: Tiefgründigkeit<br />

und Vielschichtigkeit vermeiden!<br />

Dies gilt in einer besonderen Färbung, wenn Tester sich in das Reale begeben.<br />

Hier ist wichtig, die realen „Einkleidungen“ nicht ernst zu nehmen. Man<br />

muss herausfinden, in welcher spezifischen Realität die Tester sich bewegen.<br />

Man fährt am besten, wenn man sich einfach fragt: Was wollen sie, dass ich<br />

rechne? Wenn die Realität und das mathematisch (meist: rechnerisch) Gewollte<br />

nicht recht zusammenlaufen, dann ist immer das mathematisch Gewollte<br />

das Primäre. Man muss wiederum besonders aufpassen, wenn das Reale uns<br />

zu differenzierterem Denken auffordert: An solchen Stellen gilt es herauszufinden,<br />

was die Tester hören wollen und sich nicht mit dem zu beschäftigen,<br />

was das Reale erfordert <strong>–</strong> das kostet Zeit bzw. führt zu einem nicht prämierten<br />

Resultat.<br />

Umgekehrt muss man immer etwas hinschreiben, egal wie wenig man<br />

weiß. Es bietet sich an, schwierige Aufgaben während des Durcharbeitens zu<br />

kennzeichnen. Je nachdem, wie viel Zeit man am Ende noch hat, muss man


90 WOLFRAM MEYERHÖFER<br />

mit Plausibilitätsbetrachtungen inhaltliches Raten betreiben oder notfalls Lotterieraten<br />

durchführen (vgl. Meyerhöfer 2004 c), und man muss bei Texten<br />

irgendetwas hinschreiben, bei einzusetzenden Zahlen jene Zahl, die einem am<br />

plausibelsten erscheint.<br />

Standardisierte Tests sind fremdartige und hölzerne Instrumente. Sie halten<br />

vielfältige Irritationen parat, die im Laufe des Erstellungs- bzw. Operationalisierungsprozesses<br />

eingewaschen werden. Man sollte sich vor Augen halten,<br />

dass (Zehn)Tausende den Test bearbeiten sollen, dass also viele verschiedene<br />

Fachbegriffe und Umgehensweisen abgedeckt werden müssen und dass<br />

Übersetzungsprobleme hinzukommen. Da „geht schon mal ein Wort daneben“,<br />

manchmal auch mehr. Es kommt auch vor, dass die Aufgabe verständlicher<br />

oder schneller erfassbar gemacht werden soll und dass dadurch irritierende<br />

Formulierungen entstehen. Irritationen vermeiden heißt hier: Darüber hinweg<br />

lesen können. Auch hier hilft der Hinweis auf das Mittelmaß: Es ist meist<br />

das weniger Komplizierte gemeint, und meist kommt es auf das einzelne Wort<br />

nicht an, man kann es getrost überlesen. Wenn man sich darauf konzentriert,<br />

was die Tester hören wollen, dann merkt man auch, dass das Irritierende oftmals<br />

nebensächlich ist. Übrigens kann man auch immer den Testleiter fragen.<br />

Er darf zwar prinzipiell nichts sagen. Aber Verständnisprobleme bei Wörtern<br />

darf er manchmal klären, und vielleicht erzählt er ja noch mehr.<br />

Empirisch zeigt sich, dass Testfähigkeit unvermeidbar dort auftritt,<br />

<strong>–</strong> wo Multiple-Choice-Angebote das Raten ermöglichen,<br />

<strong>–</strong> wo offene Antworten kategorial in Null-Eins-Entscheidungen kodiert werden,<br />

<strong>–</strong> wo ein verschiedener Umgang mit Fachbegriffen in verschiedenen Teilen<br />

der zu vermessenden Population bearbeitet werden muss.<br />

Testfähigkeit tritt vermeidbar dort auf,<br />

<strong>–</strong> wo ein Gegeneinanderlaufen von latenter und manifester Textebene zu Irritationspotential<br />

führt,<br />

<strong>–</strong> wo der Inhalt nicht ernstgenommen wird, um den es zu gehen scheint. Dabei<br />

ist es zunächst nicht so, dass der Schüler den Inhalt nicht ernst nimmt,<br />

sondern der Aufgabenersteller, der den Text erstellt, nimmt den Inhalt nicht<br />

ernst. (Das ist natürlich didaktisch verschleiert.)<br />

<strong>–</strong> wo Mehrdeutigkeiten bzw. Unschärfen auftreten bezüglich dessen, was gemessen<br />

wird und was gemessen werden soll.<br />

Dieses empirische Ergebnis verwundert nicht: Testfähigkeit spielt eben immer<br />

dort eine Rolle, wo die Aufgabe schlecht konstruiert ist, wo also latente und


TESTFÄHIGKEIT <strong>–</strong>WAS IST DAS? 91<br />

manifeste Textebene auseinanderlaufen, wo der mathematische Inhalt didaktisch<br />

verworfen ist und wo der Operationalisierungsprozess nicht sorgfältig<br />

verlaufen ist, wo der eigene mathematikdidaktische Habitus nicht reflektiert<br />

wird. Diese Art von Testfähigkeiten kann also im Sinne von Vermeidung bearbeitet<br />

werden, wenn die Tester das Zusammenlaufen von latenter und manifester<br />

Textebene bearbeiten, wenn sie didaktischen Illusionen und Verschleierungen<br />

selbst nicht aufsitzen und wenn sie sorgfältige Operationalisierungen<br />

ihrer Messkonstrukte vornehmen.<br />

Literatur<br />

Adorno, Theodor W. (1972): Theorie der Halbbildung. In: Soziologische<br />

Schriften I (Gesammelte Schriften Band 8), Frankfurt: Suhrkamp<br />

Baumert, Jürgen, Eckhard Klieme, Manfred Lehre und Elwin Savelsbergh<br />

(2000): Konzeption und Aussagekraft der TIMSS-Leistungstests. Zur<br />

Diskussion um TIMSS-Aufgaben aus der Mittelstufenphysik. In: Die<br />

Deutsche Schule, 92. Jahrgg. 2000, Heft 1 (S. 102-115), Heft 2 (S. 196-<br />

217)<br />

Bourdieu, Pierre; Passeron, Jean-Claude (1971): Die Illusion der Chancengleichheit.<br />

Stuttgart: Klett<br />

Deutsches <strong>PISA</strong>-Konsortium (Hrsg.) (2001): <strong>PISA</strong> 2000. Basiskompetenzen<br />

von Schülerinnen und Schülern im internationalen Vergleich. Opladen:<br />

Leske + Budrich<br />

Hagemeister, Volker (1999): Was wurde bei TIMSS erhoben? Eine Analyse<br />

der empirischen Basis von TIMSS. In: Die Deutsche Schule, 91.Jahrgg.<br />

1999, Heft 2, S. 160-177<br />

Hembree, Ray (1987): Effects Of Noncontent Variables On Mathematics Test<br />

Performance. In: Journal for Research in Mathematics Education. Vol. 18,<br />

No. 3, S. 197-214<br />

Klieme, Eckhard; Maichle, Ulla (1989): Zum Training von Techniken des<br />

Textverstehens und des Problemlösens in Naturwissenschaften und Medizin.<br />

In: Günter Trost (Hrg.): Test für medizinischen Studiengänge (TMS):<br />

Studien zur Evaluation (13.Arbeitsbericht), Bonn: Institut für Test- und<br />

Begabungsforschung. S. 188-247<br />

Klieme, Eckhard; Maichle, Ulla (1990): Ergebnisse eines Trainings zum Textverstehen<br />

und zum Problemlösen in Naturwissenschaften und Medizin.<br />

In: Günter Trost (Hrg.): Test für medizinischen Studiengänge (TMS). 14.


92 WOLFRAM MEYERHÖFER<br />

Arbeitsbericht. Bonn: Institut für Test- und Begabungsforschung, S. 258-<br />

307<br />

Lind, Detlef: Welches Raten ist unerwünscht? Eine Erwiderung. („Erwiderung“<br />

auf Meyerhöfer 2004 c) In: JMD 1/2004, S. 70-74<br />

Meyerhöfer, Wolfram (2001): Was misst TIMSS? Einige Überlegungen zum<br />

Problem der Interpretierbarkeit der erhobenen Daten. In: http://pub.ub.<br />

uni-potsdam.de/2001meta/0012/door.htm<br />

Meyerhöfer, Wolfram (2004 a): Was testen Tests? Objektiv-hermeneutische<br />

Analysen am Beispiel von TIMSS und <strong>PISA</strong>. Dissertation an der<br />

Mathematisch-Naturwissenschaftlichen Fakultät der Universität Potsdam<br />

Meyerhöfer, Wolfram (2004 b): Zum Kompetenzstufenmodell von <strong>PISA</strong>.<br />

In: JMD 1/2004, S. 294-305. Längere Version unter: http://www.math.<br />

uni-potsdam.de/prof/o_didaktik/mita/me/Veroe<br />

Meyerhöfer, Wolfram (2004 c): Zum Problem des Ratens bei <strong>PISA</strong>. JMD<br />

1/2004, S. 62-69<br />

Meyerhöfer, Wolfram (2005): Tests im Test. Das Beispiel <strong>PISA</strong>. Verlag Barbara<br />

Budrich. Opladen<br />

Meyerhöfer, Wolfram (2006): <strong>PISA</strong> & Co als kulturindustrielle Phänomene.<br />

In: Thomas Jahnke, Wolfram Meyerhöfer (Hrsg.): <strong>PISA</strong> & Co <strong>–</strong> Kritik<br />

eines Programms. Franzbecker, Hildesheim, S. 63-100<br />

Millman, J.; Bishop, C. & Ebel, R. (1965): An Analysis Of Test-Wiseness. In:<br />

Educational and Psychological Measurement, 25, S. 707-726 (zitiert nach<br />

Hembree 1987)<br />

Winter, Heinrich (2005): Apfelbäume und Fichten <strong>–</strong> und Isoperimetrie. In: mathematik<br />

lehren, Heft 128, S. 58-62<br />

Woschek, Reinhard (2005): TIMSS 2 elaboriert: Eine didaktische Analyse von<br />

Schülerarbeiten im Ländervergleich Schweiz/Deutschland. Dissertation<br />

beim Fachbereich Mathematik der Universität Duisburg-Essen.<br />

Wuttke, Joachim (2007): Die Insignifikanz signifikanter Unterschiede: Der Genauigkeitsanspruch<br />

von <strong>PISA</strong> ist illusorisch. In: Thomas Jahnke, Wolfram<br />

Meyerhöfer (Hrsg.): <strong>PISA</strong> & Co <strong>–</strong> Kritik eines Programms. 2., überarbeitete<br />

Auflage, Hildesheim: Franzbecker


<strong>PISA</strong> <strong>–</strong> An Example of the Use and Misuse of<br />

Large-Scale Comparative Tests 1<br />

Jens Dolin<br />

Denmark: University of Copenhagen<br />

To an ever increasing extent, international evaluations such as <strong>PISA</strong> are both<br />

setting the agenda in the educational policy debate in the participating countries<br />

and exerting a considerable influence on their educational policy decisions.<br />

But do such surveys justify the fuss they often cause?<br />

In Denmark, the headlines which followed the publication of the <strong>PISA</strong><br />

2003 survey included:<br />

<strong>–</strong> More discipline in the schools. Discipline will help <strong>to</strong> improve Danish results<br />

in international surveys (Jyllandsposten, 7 Dec. 2004)<br />

<strong>–</strong> Time for physics classes in country no. 31 (Jyllandsposten, 7 Dec. 2004)<br />

<strong>–</strong> The government <strong>to</strong> introduce more tests for Danish schoolchildren (Politiken,<br />

7 Dec. 2004).<br />

The government used the <strong>PISA</strong> results as a lever <strong>to</strong> tighten up educational policy,<br />

while a number of leading education researchers warned against introducing<br />

drastic alterations on the basis of an international test of a character which<br />

was described as being <strong>to</strong> some extent foreign <strong>to</strong> the Danish educational culture.<br />

The <strong>to</strong>ne of the debate was sharp, as illustrated by the following extracts<br />

from an interview appearing in a Danish newspaper:<br />

You have been fooled by the <strong>PISA</strong> report. The <strong>PISA</strong> report on the elementary schools<br />

is nonsense and a perverse provocation. It is based on neither knowledge nor insight.<br />

(Prof. Staf Callawaert, in the newspaper Information, 10 December 2004).<br />

1 This paper is an updating of a key-note held at a Nordic Conference for Science Education<br />

2005.


94 JENS DOLIN<br />

A rather barren chasm was rapidly dug which prevented large parts of the educational<br />

system from utilising the <strong>PISA</strong> results productively and large parts<br />

of the political system from placing <strong>PISA</strong> in the necessary context. Hopefully,<br />

this article may contribute a little <strong>to</strong> both.<br />

The article will analyse <strong>PISA</strong> <strong>–</strong> particularly the part dealing with science <strong>–</strong><br />

as an example of a major comparative evaluation.<br />

<strong>PISA</strong> will first be described and then analysed on the basis of test theory,<br />

which will address some detailed technical aspects of the test as well as the<br />

broader issue of validation. The purpose of this is <strong>to</strong> illustrate how the technical<br />

aspects of evaluations are not neutral practices, but rather a part of the<br />

fundamental value system on which the evaluation is based. Some apparently<br />

objective choices must necessarily be made which have consequences for the<br />

theoretical basis of the evaluation, and the technique thereby becomes part of<br />

the fundamental value system. These considerations form the basis for an evaluation<br />

of <strong>PISA</strong>’s predicative power in a national context <strong>–</strong> in this case, that of<br />

Denmark. On this basis, the analysis will focus on the relationship between<br />

<strong>PISA</strong>’s fundamental assumptions and the national consequences of participation.<br />

Finally, I will conclude with some reflections on how <strong>PISA</strong> may be utilised<br />

and developed.<br />

Comparative evaluation <strong>–</strong> between politics and science<br />

Whether or not evaluations in the form of politically-initiated surveys can be<br />

considered research as such, the designation “comparative evaluation” forms<br />

part of the lexicon of comparative educational research. Internationally, this<br />

is a major research field, organised in the World Council of Comparative Education<br />

Societies, which was founded in 1970 and now has 35 national and<br />

regional member organisations. All major international education conferences<br />

have sessions for comparative evaluation, and the field is covered by several international<br />

periodicals, of which the two largest are the British periodical Comparative<br />

Education and the American Comparative Education Review. Finally,<br />

large-scale international comparative tests such as <strong>PISA</strong> present an opportunity<br />

<strong>to</strong> conduct a growing amount of related research. This secondary research<br />

may focus on <strong>PISA</strong> itself, or it may utilise the <strong>PISA</strong> data in analyses which<br />

expand its perspectives, such as in comparisons between countries, surveys of<br />

sub-populations, correlations between different variables, etc.


<strong>PISA</strong> <strong>–</strong> USE AND MISUSE OF LARGE-SCALE COMPARATIVE TESTS 95<br />

However, the research field also has a longer tradition which, under the designation<br />

comparative educational theory, refers <strong>to</strong> comparative studies of educational<br />

matters in different countries and cultures. One of the earliest Danish<br />

comparative surveys was conducted in 1841 and compared the Danish school<br />

system with the German and French systems; an analysis which was included<br />

in the formation of the Danish upper secondary school education. During the<br />

same period, the Danish educationalist Grundtvig visited Britain and was inspired<br />

by the college system <strong>to</strong> develop the Danish folk high schools. Comparative<br />

educational theory has thus contributed <strong>to</strong> building up the educational<br />

systems of national states via inspiration and exchanges of experience. This<br />

tradition, which Winther-Jensen (2004) terms comparative educational theory<br />

in its horizontal significance, was dominant until the nineteen-sixties.<br />

International comparative studies have grown considerably in their extent<br />

and level of interest over the past decade, but the most important point is<br />

that their actual aim has altered. A change occurred during the course of the<br />

nineteen-seventies and nineteen-eighties in the conditions and significance of<br />

the educational systems, which also altered the focus of comparative educational<br />

theory. The key words here are globalisation and marketisation; education<br />

comprises a key sec<strong>to</strong>r of the global knowledge society, and it therefore<br />

becomes important for politicians <strong>to</strong> know how their country is doing in the<br />

international competition <strong>to</strong> become the best knowledge society. At the same<br />

time, a marketisation of the educational system is taking place, one which<br />

causes politicians <strong>to</strong> ask: Are we getting value for money? There is a need<br />

for data <strong>to</strong> determine whether a Danish school student is more costly than a<br />

foreign one, and if so, whether she is at least more skilled. The marketisation<br />

of the educational system and of the public sec<strong>to</strong>r in general is being implemented<br />

via New Public Management: a system of control which is based on<br />

goals and result targets on the output side, and the implementation of which<br />

requires knowledge and data obtained by means of national and international<br />

evaluations and the standards these impose. This is given precise expression in<br />

<strong>PISA</strong>:<br />

Across the world, policymakers use <strong>PISA</strong> findings <strong>to</strong>:<br />

<strong>–</strong> gauge the literary skills of students in their own country in comparison with<br />

those of the other participating countries<br />

<strong>–</strong> establish benchmarks for educational improvement . . .<br />

<strong>–</strong> understand the relative strengths and weaknesses of their educational system<br />

(OECD 2004)


96 JENS DOLIN<br />

There is still an interest in comparing oneself with other countries <strong>–</strong> the horizontal<br />

dimension <strong>–</strong> but now international concepts and standards have been established<br />

which provide a basis on which national states can assess themselves.<br />

These supranational structures make it possible <strong>to</strong> speak of comparative educational<br />

theory in the vertical sense. The EU is developing a concept of lifelong<br />

learning, UNESCO defines Education for All, and the OECD is testing a literacy<br />

concept through <strong>PISA</strong>. These international concepts become a determining<br />

fac<strong>to</strong>r in national policies, and the international evaluations set up a standard<br />

which is independent of the differences between the individual countries, both<br />

for these key concepts and for the actual educational systems. The goals of the<br />

educational systems thereby become harmonised, and increasing emphasis is<br />

placed on standardisation and on comparison of student performance in order<br />

<strong>to</strong> measure the extent <strong>to</strong> which a country is meeting the international requirements.<br />

Under such conditions, the horizontal dimension becomes reduced <strong>to</strong> a<br />

comparison with those countries that best fulfil the international standards.<br />

We may, for example, ask ourselves in desperation, “What is it that Finland<br />

does that causes it <strong>to</strong> do so well in <strong>PISA</strong>?” But we ask less about what school<br />

students can do in Denmark. How is it that Denmark is doing so well in international<br />

competition, when Danish young people achieve such a mediocre<br />

score in international comparative tests? It may be that these comparative evaluations<br />

fail <strong>to</strong> capture the essence of the students’ skills <strong>–</strong> or at any rate, only<br />

an inessential subset. It is therefore important <strong>to</strong> analyse what such evaluations<br />

can really tell us, and what they cannot. What are the limitations, for example,<br />

in comparing complex matters between many countries <strong>–</strong> both from the<br />

perspective of test theory and educational theory?<br />

The aim of this article is <strong>to</strong> evaluate the predictive power of the <strong>PISA</strong><br />

results, and thereby provide a perspective on international comparative evaluations<br />

in general. The criticism examined here should then be compared with<br />

the advantages that <strong>PISA</strong> bes<strong>to</strong>ws. One problem in this context is that surveys<br />

like <strong>PISA</strong> are initiated and planned in one part of the educational system (typically<br />

at policy level) but implemented by another part of the system (typically<br />

the directly practising level), after which the results are used by the policy level<br />

<strong>to</strong> characterise and change the practising level. The situation is thereby one of<br />

attack and defence from the beginning, which makes it difficult <strong>to</strong> find a neutral<br />

standpoint from which <strong>to</strong> assess <strong>PISA</strong>.<br />

It is, however, important <strong>to</strong> understand that <strong>PISA</strong> was designed by the


<strong>PISA</strong> <strong>–</strong> USE AND MISUSE OF LARGE-SCALE COMPARATIVE TESTS 97<br />

OECD with the official aim of acquiring a data foundation for the use of (educational)<br />

decision-makers. As <strong>PISA</strong>’s own introduction makes clear (OECD<br />

1999):<br />

The results of the OECD assessments, <strong>to</strong> be published every three years along<br />

with other indica<strong>to</strong>rs of education systems, will allow national policymakers <strong>to</strong> compare<br />

the performance of their education systems with those of other countries. They<br />

will also help <strong>to</strong> focus and motivate educational reform and school improvement, especially<br />

where schools or education systems with similar inputs achieve markedly different<br />

results. Further, they will provide a basis for better assessment and moni<strong>to</strong>ring<br />

of the effectiveness of education systems at the national level (p. 7).<br />

<strong>PISA</strong> is administered by a <strong>PISA</strong> Governing Board, which includes representatives<br />

from the governments of the participant countries, and which takes the<br />

decisive decisions concerning <strong>PISA</strong>’s goals, content, procedures, etc.<br />

<strong>PISA</strong> thereby comprises a different type of survey and research from that<br />

with which we are traditionally familiar from the universities. It is a commissioned,<br />

research-based survey containing questions formulated by the commissioners,<br />

and with some set frameworks, but with rather extensive freedom with<br />

regard <strong>to</strong> how these frameworks are filled (e.g. the formulation and choice of<br />

test items). Such surveys have become quite common in the research world,<br />

such as in the form of evaluations and memoranda, but they differ in crucial<br />

areas from the free research of the universities. Many of the associated debates<br />

and decisions, for example, take place in relatively closed groups, with<br />

strong influence from the administrative layer of the ministries, and thereby<br />

with the fingerprint of the present government. It is thus a blend of research,<br />

investigation, evaluation and educational policy.<br />

The results which <strong>PISA</strong> has published, first and foremost in the form of the<br />

so-called league tables which rank countries according <strong>to</strong> the performance of<br />

their young people, have also been used in many other countries as arguments<br />

for fundamental alterations in their educational systems. In Denmark, with direct<br />

reference <strong>to</strong> the poor results in <strong>PISA</strong> 2003, the government introduced<br />

a wide range of school tests, albeit under strong protest from the teachers’<br />

organisations and education researchers (Dolin 2007). Teaching methods and<br />

so-called progressive education were identified by leading politicians as the<br />

cause of the disappointing results, which gave rise <strong>to</strong> a back-<strong>to</strong>-basics wave<br />

and greater emphasis on strong school leadership.


98 JENS DOLIN<br />

Critical choices, reliability and validity<br />

Any evaluation requires a number of theoretical, practical and methodological<br />

choices in order <strong>to</strong> ensure the production of the results necessary <strong>to</strong> fulfil<br />

its goals. These choices are taken at various points in the <strong>PISA</strong> system on the<br />

basis of compiled foundation documents (often of a political or scientific nature).<br />

The choices relate <strong>to</strong> questions of framework and content, such as the<br />

relationship with other surveys and test item design, and are of significance for<br />

the validity and reliability of the evaluation. Such fundamental choices set the<br />

limits for the survey’s usefulness and predictive power, and define its methodological<br />

standard.<br />

In a comparative test, reliability is crucial. Irrespective of what you measure,<br />

it must be done correctly. You must be certain that the various countries<br />

are appraised in the same way, so that their ranking in the final evaluation<br />

will not be open <strong>to</strong> question. Reliability-related problems include, for example,<br />

sampling procedures and the scoring of responses. The most fundamental<br />

questions, however, relate <strong>to</strong> the survey’s validity <strong>–</strong> the extent <strong>to</strong> which the<br />

chosen design can measure what you are interested in. There is a gradual transition<br />

between problems of reliability and problems of validity, so the divisions<br />

between them are as much questions of organisation as of content.<br />

We will begin with some of the critical choices, then review a number of<br />

apparently technical and reliability-related issues, and finally utilise the more<br />

fundamental validity problems <strong>to</strong> form the transition <strong>to</strong> a discussion which will<br />

place the issue in perspective.<br />

Critical choices<br />

An international survey must position itself within the range of comparative<br />

tests with regard <strong>to</strong> its aim, content, target group, etc., and it must possess<br />

a design in accordance with this positioning. Some of the choices taken will<br />

have consequences for the survey’s reliability and validity; for example, certain<br />

aims give rise <strong>to</strong> certain test forms <strong>to</strong> ensure that the test is in accord with its<br />

aims and thereby valid. But the testing must also be of a type that enables a<br />

high degree of reliability. These two considerations can be difficult <strong>to</strong> unite,<br />

and often the reliability consideration will be given highest priority.


<strong>PISA</strong> <strong>–</strong> USE AND MISUSE OF LARGE-SCALE COMPARATIVE TESTS 99<br />

Lack of comparability with earlier surveys<br />

In the case of <strong>PISA</strong>, no links have been established with earlier international<br />

surveys (particularly those undertaken by the IEA), which makes comparisons<br />

very difficult.<br />

It is regrettable that <strong>PISA</strong> did not link the survey <strong>to</strong> prior surveys, for<br />

example by including some test items of the same type as were included in<br />

TIMSS (which was curriculum-based). This would have enabled comparisons<br />

over time, and comparisons of tests with different testing purposes. Is there,<br />

for example, agreement between the results of a curriculum-based test and a<br />

more general “fit-for-life” test?<br />

Secondary surveys, however, have revealed quite significant correlations<br />

between the two surveys. Lie & Olsen 2007 compared science results for 22<br />

countries that had participated in both TIMSS and <strong>PISA</strong>, and found that the<br />

correlation between the scores in the two studies was as high as 0.95 at the<br />

country level.<br />

Whether this high level of consistency between different forms of measurement<br />

is good or bad is quite a delicate question, <strong>to</strong> which we will return.<br />

Year sample instead of class sample<br />

By selecting a representative sample of a given year’s school students, we<br />

can illuminate whether society receives “value for money” in the educational<br />

system as such: Does our educational system adequately equip young people<br />

for the future? (Assuming that you are actually able <strong>to</strong> measure such “futurepreparedness”,<br />

but that is a question <strong>to</strong> which we will return later.) How many<br />

of our schools’ students have which particular skills? And so on. This takes<br />

place at a highly aggregated level, where, for example, something can be said<br />

about sociocultural differences, the distribution of the results across a year,<br />

etc., and some general issues can be identified which the educational system is<br />

failing <strong>to</strong> satisfac<strong>to</strong>rily address. In addition, a yearly-based sample provides a<br />

good overview of a given school year, and the size of the sample provides an<br />

opportunity <strong>to</strong> compare different parts of the educational system.<br />

However, if we wish <strong>to</strong> know something about the educational system<br />

which can be used <strong>to</strong> change it, we must examine the places where the education<br />

actually takes place <strong>–</strong> which is <strong>to</strong> say the classroom and the school.<br />

The problem with <strong>PISA</strong> here is that the test does not illuminate the teaching<br />

conditions which, in the final analysis, are responsible for the measured results.<br />

Data collection at this level, with, for example, entire classes represent-


100 JENS DOLIN<br />

ing a school, would provide an opportunity for teaching-related comparisons.<br />

The students tested would certainly have been exposed <strong>to</strong> different forms of<br />

teaching, but the Danish model, under which teachers often are permanently<br />

assigned <strong>to</strong> particular groups of students throughout the years, would enable<br />

meaningful correlations <strong>to</strong> be made between teaching variables and output.<br />

Problems with the selected statistical model<br />

The fundamental problem for all comparative evaluations is how <strong>to</strong> safeguard<br />

comparability between different cultures and educational systems. The statistical<br />

side of this process is addressed in <strong>PISA</strong> by choosing a psychometric<br />

model which assumes that differences between systems may be ascribed <strong>to</strong><br />

variation along a scale. <strong>PISA</strong> has chosen <strong>to</strong> rely on a technique known as “Item<br />

Response Modelling”, despite the absence of (published) theoretical considerations<br />

concerning what the choice of this model might mean. The problem<br />

with this model is that it permits only a one-dimensional variation along the<br />

chosen scales, and thereby risks overlooking differences between countries lying<br />

outside the scale in question. As the technical report says: “An item may<br />

be deleted from <strong>PISA</strong> al<strong>to</strong>gether if it has poor psychometric characteristics in<br />

more than eight countries (a dodgy item)” (Adams and WU 2001, p. 101). If a<br />

particular test item does not fit the one-dimensional model <strong>–</strong> i.e. it gives very<br />

different results in several countries <strong>–</strong> it is omitted, even though the reasons<br />

why it gives different results might be an expression of a variation in another<br />

dimension than the relevant scale is designed <strong>to</strong> measure. Potential information<br />

can thereby be suppressed, or <strong>to</strong> put it another way: In its efforts <strong>to</strong> avoid<br />

cultural bias, <strong>PISA</strong> neglects cultural differences <strong>–</strong> the very differences that it<br />

would have been interesting <strong>to</strong> identify as explanations for the observed variations<br />

in performance between the various countries.<br />

As Harvey Goldstein puts it:<br />

“Perhaps the major (concern) centres around the narrowness of its focus, which remains<br />

concerned, even fixated, with the psychometric properties of a restricted class<br />

of conceptually simplistic models. . . . It needs <strong>to</strong> be recognized that the reality of<br />

comparing countries is a complex multidimensional issue, well beyond the somewhat<br />

ineffectual attempt by <strong>PISA</strong> <strong>to</strong> produce subscales. With such recognition, however, it<br />

becomes difficult <strong>to</strong> promote the simple country rankings which appear <strong>to</strong> be what are<br />

demanded by policymakers.” (Goldstein 2004, p. 328)


<strong>PISA</strong> <strong>–</strong> USE AND MISUSE OF LARGE-SCALE COMPARATIVE TESTS 101<br />

The price of such item-homogeneity is that cultural differences are erased on<br />

the “profile level”. It would in general have been useful <strong>to</strong> have some clearer<br />

considerations on the appropriateness of the chosen model.<br />

Is <strong>PISA</strong> authentic?<br />

This is one of the crucial elements in <strong>PISA</strong>.<br />

Figure 1: The pizza item<br />

Let us for example examine the ‘pizza’ test item from a set of published pilot<br />

items that were stated <strong>to</strong> be representative of <strong>PISA</strong>’s mathematics questions<br />

(figure 1). On the surface it appears <strong>to</strong> be an everyday situation (at any rate for<br />

city-dwellers in Western Europe), but it has been made abstract by the use of<br />

an unknown currency and “nice” numbers. It is in fact a disguised mathematics<br />

problem.<br />

I wonder how students who are used <strong>to</strong> ordering from their local pizzeria<br />

would reply <strong>to</strong> such a question using realistic (and known) numbers. But<br />

this <strong>to</strong>uches on the fundamental <strong>–</strong> and conflict-ridden <strong>–</strong> academic debate on<br />

whether mathematics should be taught as a closed, deductive system or as ‘realistic’<br />

mathematics. Those who belong <strong>to</strong> the first school tend <strong>to</strong> formulate<br />

questions which test the ability <strong>to</strong> perceive mathematical structures in everyday<br />

examples, while the other school prefers <strong>to</strong> focus on the skill <strong>to</strong> be able <strong>to</strong><br />

“manage” everyday situations <strong>–</strong> irrespective of whether approved methods are<br />

used. If you aim for the latter, it would be correct <strong>to</strong> say that the more realistic<br />

a test is <strong>–</strong> the more it is designed <strong>to</strong> reflect actual everyday situations <strong>–</strong> the<br />

less it makes sense <strong>to</strong> compile a globally comparable test! It is quite simply a<br />

fundamental conflict of principle in which the choice of test questions reflects<br />

a particular academic and pedagogical attitude.<br />

In <strong>PISA</strong>, it is as though backward reasoning has been used in the formulation<br />

of many of the test questions: Here we have a set of school subjects <strong>–</strong>


102 JENS DOLIN<br />

biology, physics, geography, chemistry, etc. Where can the students apply this<br />

knowledge? Where in the real world are there situations that involve the use of<br />

this knowledge? Instead of starting (authentically!) with some realistic everyday<br />

situations <strong>–</strong> the consumer, the manufacturer, the citizen, leisure activities,<br />

etc. <strong>–</strong> and then choosing some in which scientific insight might play a role. But<br />

it is fair <strong>to</strong> say that it would be a very difficult agenda <strong>to</strong> set up <strong>–</strong> due <strong>to</strong> the<br />

special character of science. In everyday life we use the known and the experienced<br />

<strong>to</strong> explain the unknown. In science it is reversed. Here you explain the<br />

well-known with abstract, invisible, and non experienced concepts. And it is a<br />

huge pedagogical and didactical challenge <strong>to</strong> make the two ways of knowing<br />

meet.<br />

In this connection it is also characteristic that the answers must be based on<br />

the information supplied in the test question, which must not be combined with<br />

the students’ own knowledge of the subject (see, for example, Svendsen 2005).<br />

In order <strong>to</strong> do well on the test item, it is as least as important <strong>to</strong> understand test<br />

logic as <strong>to</strong> know the subject. You have <strong>to</strong> know how tests are scored, how<br />

<strong>to</strong> optimise your answer strategy, etc. Greater familiarity with tests probably<br />

gives a higher score.<br />

<strong>PISA</strong>’s results, like all those of all evaluations, are dependent on the evaluation<br />

context both with regard <strong>to</strong> the formulation of the specific questions and<br />

with regard <strong>to</strong> the context in which the test items are solved. As an example,<br />

Kjeld Kjertmann (2000) shows how readers who have done well in a standard<br />

word reading test (US64) achieve very different results in reading tests which<br />

involve meaningful texts.<br />

The question of reliability<br />

The main question here is whether <strong>PISA</strong> lives up <strong>to</strong> its own premises from a<br />

test’s technical point of view. Is the test performed “properly”, i.e. in conformity<br />

with recognised test standards?<br />

The reliability of <strong>PISA</strong> is probably as high as is practically possible in such<br />

an extensive survey. For each round, a ‘Technical Report’ is published containing<br />

thorough documentation of the procedures used in all phases of the survey,<br />

which gives the impression that the test has been undertaken competently in<br />

every respect. In the case of <strong>PISA</strong> 2000, this is Technical Report 2000 (Adams<br />

and Wu 2001), and outlines how the test was compiled and pilot tested, how<br />

the respondents were selected and the data collected and processed, etc. The<br />

reliability of the data and processes was evaluated in all respects, and special


<strong>PISA</strong> <strong>–</strong> USE AND MISUSE OF LARGE-SCALE COMPARATIVE TESTS 103<br />

reliability studies were also undertaken. In the case of one study, in which the<br />

scoring undertaken by the national test scorers in reading questions was compared<br />

with that of a <strong>PISA</strong> consortium official (a so-called ‘verifier’), there was<br />

agreement between the OECD’s ‘verifiers’ and all four national scorers in 78 %<br />

of instances (p. 174). There was agreement with a majority of the national test<br />

scorers in 91.5 % of instances. However, the results revealed a large degree of<br />

scoring variation between both the questions and the countries. Some marking<br />

of questions showed an inter-country agreement rate of less than 0.80 (Technical<br />

Report, p. 175), and some countries showed an inconsistency rate of more<br />

than 50 % in the marking of certain questions (Technical Report p. 177). The<br />

overall consistency rate of the individual countries varied from 80.2 % (France)<br />

<strong>to</strong> 96.5 % (New Zealand) (Technical Report, p. 178).<br />

There were major variations in reliability in the various areas. In the ‘soft’<br />

data (background variables), reliability was significantly lower than in the test<br />

items. The reliability of measures of the quality of school resources (a subset<br />

of “physical infrastructure”) was for example 0.70 for Denmark (Technical<br />

Report, p. 250). It is difficult <strong>to</strong> see how the figure of 0.7 was arrived at, but it<br />

is probably based on measurements of employees’ classifications of the same<br />

answer. Account has not been taken here of the validity problems involved<br />

in questions such as: “What is your father’s occupation?”; an answer such as<br />

painter, teacher, or office worker can mean quite different things, despite the<br />

fact that the <strong>PISA</strong> scorers classified them as identical. Here in Denmark, however,<br />

we have the opportunity <strong>to</strong> check the answers via data pooling.<br />

One can always discuss whether an overall reliability rate of 92 % is good<br />

or bad, but the survey gives the appearance of being scientifically correct. As<br />

the Danish Minister of Education, Bertel Haarder put it: When so many international<br />

experts have participated, it must be satisfac<strong>to</strong>ry. But as in the case<br />

of all statistics, they have been collected in a particular way for a particular<br />

purpose, and in any survey, statistics can only describe a (limited) part of the<br />

issues and phenomena dealt with by the survey.<br />

A number of education researchers and statisticians have also criticised<br />

both the theoretical background and the technical implementation of <strong>PISA</strong>.<br />

Noteworthy in this context has been the debate between Professor Prais of the<br />

National Department of Economic and Social Research in London and Raymond<br />

Adams of the International <strong>PISA</strong> Consortium (Adams 2003; Prais 2003;<br />

Prais 2004), and the critique by Professor Goldstein, Professor in Statistical<br />

Methods at the Institute of Education, University of London (Goldstein 2004).


104 JENS DOLIN<br />

It would be going <strong>to</strong>o far here <strong>to</strong> undertake an in-depth analysis of these criticisms,<br />

which would require a rather advanced level of familiarity with statistical<br />

theory; the following should therefore be mainly seen as a summary of<br />

the problems identified by various persons in the technical and design-related<br />

aspects of <strong>PISA</strong>.<br />

Translation problems<br />

Once the test items have been selected, they must be translated in<strong>to</strong> the various<br />

national languages. As the questions have often originally been formulated in<br />

English, the translation must often be worded in a more complex manner in<br />

order <strong>to</strong> convey the precise meaning. It is generally recognized that in order<br />

<strong>to</strong> represent the full meaning in a text originally produced in a foreign language,<br />

you often have <strong>to</strong> reframe and paraphrase <strong>–</strong> causing several awkward<br />

and clumsy sentences. The Danish version of the text suffers <strong>to</strong> some degree<br />

of this inappropriateness.<br />

The translation also results in a number of inevitable inaccuracies, the effects<br />

of which are impossible <strong>to</strong> assess. In a questionnaire directed at school<br />

principals, for example, the English term “assessment” was translated in<strong>to</strong><br />

Danish as “standpunktsprøver” (“proficiency tests”), which has a different<br />

meaning.<br />

The occurrence of translation problems, inelegant style and imprecise<br />

meaning causes a drop in reliability.<br />

Measuring scale errors (lack of chronological comparability)<br />

The Danish statistician Peter Allerup at the Danish School of Education has<br />

demonstrated that the comparability between the individual cycles which<br />

forms an important part of <strong>PISA</strong> is not valid, because different measuring<br />

scales are used in the two surveys (Allerup 2005).<br />

In the scaling technique utilised by <strong>PISA</strong>, the average score of each student<br />

in all questions is not first calculated in order <strong>to</strong> assess the average score of all<br />

the students; instead, the latent item difficulty is calculated by examining the<br />

students’ simultaneous item responses, i.e. the same student’s answers <strong>to</strong> all the<br />

questions. In <strong>PISA</strong>, these are termed the “item parameters”. By undertaking a<br />

so-called Rasch statistical analysis of all the students who answered the same<br />

question, it is then possible <strong>to</strong> see how the latent item difficulty is distributed<br />

in different surveys. It is a prerequisite for comparability that the relative level<br />

of difficulty is fixed.


<strong>PISA</strong> <strong>–</strong> USE AND MISUSE OF LARGE-SCALE COMPARATIVE TESTS 105<br />

Figure 2: measuring scale differences<br />

Figure 2 shows the relative difficulty of 22 common test items in reading<br />

in the years 2000 and 2003. As can be seen, the relative level of difficulty is<br />

not fixed, i.e. the same measuring scale has not been used in both cases (if this<br />

had been the case, the lines would have been vertical).<br />

A student with a particular level of skill is awarded points as he moves<br />

<strong>to</strong> the right on the scale; i.e. as he solves test items with a greater level of<br />

difficulty. It can be seen that changes in latent item difficulty between the two<br />

test cycles produce different scores for the same average student. An aboveaverage<br />

student with an item parameter of 0.7, for example, would be one who<br />

can solve 18 out of 22 common reading tasks in 2000, but only 16 of the same<br />

22 items in 2003. For the 22 test items, the sum of these deviations for all tasks<br />

results in a difference in latent student scores of approximately 11 scale points<br />

between the 2000 and 2003 surveys.<br />

Corresponding analyses may be undertaken regarding gender and ethnicity.<br />

Changes in the difficulty of test items for boys and girls respectively ac-


106 JENS DOLIN<br />

cumulate <strong>to</strong> a scale advantage for girls at the weak end of the scale of 8-10<br />

points (and in the strong end just 1-2 points). Students whose Danish is poor<br />

receive a scale-conditioned advantage over ethnic Danish students amounting<br />

<strong>to</strong> approximately 12 scale points.<br />

Eleven <strong>to</strong> twelve scale points is quite a lot. In the <strong>PISA</strong> 2000 scientific<br />

literacy test, this would be enough <strong>to</strong> lift Denmark from the group of countries<br />

with a statistical score significantly under the OECD average in<strong>to</strong> the medium<br />

group of countries.<br />

Validity<br />

In my opinion, the validity problems of <strong>PISA</strong> are more fundamental than its<br />

weaknesses in technique and reliability.<br />

A test can only measure what it can capture with the current test design.<br />

What the test says about the test subjects may be one thing, while the information<br />

which can be derived from the test results <strong>to</strong> reveal something about<br />

the educational system which has educated these students is something quite<br />

different. It is thus quite a complicated and extensive task <strong>to</strong> provide an adequate<br />

analysis of the validity of an international comparative test. <strong>According</strong>ly,<br />

a validation of <strong>PISA</strong> implies a mixture of test design analysis and comparisons<br />

between the test and the national context. There are questions regarding what<br />

one might term internal validity: Does <strong>PISA</strong> Science 2006 really measure what<br />

it is intended <strong>to</strong>, namely scientific literacy? This question has two parts: How<br />

well does the concept of scientific literacy proposed in <strong>PISA</strong> correspond <strong>to</strong><br />

other generally accepted concepts of literacy, and <strong>to</strong> what extent can the test<br />

items and the test concept measure the proposed literacy concept?<br />

The starting-point for the <strong>PISA</strong> 2006 science test is the so-called “Framework”<br />

compiled by the Science Forum, a group of science researchers from the<br />

participating countries, and the Science Expert Group. Here, scientific literacy<br />

is defined as:<br />

Scientific knowledge and use of that knowledge <strong>to</strong> identify questions, <strong>to</strong> acquire new<br />

knowledge, <strong>to</strong> explain scientific phenomena, and <strong>to</strong> draw evidence-based conclusions<br />

about science-related issues;<br />

understanding of the characteristic features of science as a form of human knowledge<br />

and enquiry;<br />

awareness of how science and technology shape our material, intellectual, and cultural<br />

environments; and


<strong>PISA</strong> <strong>–</strong> USE AND MISUSE OF LARGE-SCALE COMPARATIVE TESTS 107<br />

a willingness <strong>to</strong> engage in science-related issues, and with the ideas of science, as a<br />

reflective citizen.<br />

(Doc: ScFor(0407)1, OECD 2004)<br />

This concept is quite similar <strong>to</strong> other concepts of scientific literacy, as is<br />

revealed in the first inspection report from an ongoing validation project (Dolin<br />

et al 2006), so we can with some assurance state that <strong>PISA</strong> aims <strong>to</strong> test scientific<br />

literacy. It is worth noting, however, that the definition of scientific literacy<br />

has been changed quite a lot from <strong>PISA</strong> 2003 <strong>to</strong> <strong>PISA</strong> 2006. The 2006 definition<br />

places more emphasis on knowledge about science and on the students’<br />

attitudes <strong>to</strong>wards science. The incorporation of attitudinal aspects is solved by<br />

separating the cognitive and attitudinal items in the same unit. However, by<br />

doing so, the possibility of testing the situational interest is renounced.<br />

A more fundamental question is: What scientific knowledge do young people<br />

need later in life, and is this what is tested? No real analysis of this question<br />

has been undertaken by the Science Forum. Instead, the Forum has looked at<br />

the existing school curriculum and existing school traditions, and has considered<br />

which parts of these could be considered relevant for the young person’s<br />

future life. On this basis, they then produced the model for scientific literacy<br />

shown in Figure 3.<br />

The level of literacy is thus tested via four coherent aspects, namely the answers<br />

<strong>to</strong> these questions:<br />

What contexts are suitable for testing 15-year-olds?<br />

What competencies are necessary for 15-year-olds?<br />

What knowledge is it reasonable <strong>to</strong> expect 15-year-olds <strong>to</strong> have?<br />

What affective responses are reasonable <strong>to</strong> expect from 15-year-olds?<br />

These four questions have been thoroughly processed by the Science Forum,<br />

which under<strong>to</strong>ok a mixture of academic and educational policy weighting of<br />

the different interests. The cognitive aspect was weighed up in relation <strong>to</strong> the<br />

affective, and the various academic areas were weighted in terms of percentages<br />

in the test areas. The extent <strong>to</strong> which people in the individual countries<br />

feel that the result covers what young people might be predicted <strong>to</strong> need in<br />

their adult lives is a matter for the individual countries <strong>to</strong> assess. An analysis<br />

of <strong>PISA</strong>’s framework in comparison with future demands for knowledge<br />

management, multimodality and innovation points out <strong>PISA</strong>’s lack of broader<br />

contexts and more future-proof categories (Dolin 2005).<br />

The fundamental question regarding validity is whether one can reasonably<br />

claim that sitting with a paper and pencil and (casually) answering ques-


108 JENS DOLIN<br />

Figure 3: Scientific literacy framework<br />

tions about imaginary situations has anything at all <strong>to</strong> do with competencies<br />

in the sense that we normally understand them. I will return <strong>to</strong> this fundamental<br />

question later, but many of the test questions that have been published can<br />

hardly be said <strong>to</strong> test appropriate everyday actions, not <strong>to</strong> mention the willingness<br />

<strong>to</strong> engage in science-related issues, and with the ideas of science, as<br />

areflective citizen, but rather the students’ general ability <strong>to</strong> make deductions<br />

and hypotheses, evaluate evidence, etc. <strong>–</strong> in other words a number of schoolspecific<br />

skills which, according <strong>to</strong> the logic of school, can be used later in life.<br />

And this aspect is tested very well! Seen in this light, many of the questions<br />

are diagnostically strong, inasmuch as a great deal of work has been done <strong>to</strong><br />

investigate the use of particular cognitive processes. Let us examine a couple<br />

of examples.<br />

Problems with test item formulation<br />

Although the test items have been formulated in conformity with a detailed<br />

framework and subjected <strong>to</strong> a quite comprehensive selection process, there are<br />

still a few duds. It is hard <strong>to</strong> formulate “good” questions, as all teachers know,<br />

and even though only one-third of the test items made it through the process <strong>to</strong>


<strong>PISA</strong> <strong>–</strong> USE AND MISUSE OF LARGE-SCALE COMPARATIVE TESTS 109<br />

the pilot test, and even though all countries had the right <strong>to</strong> object, there will always<br />

be some less appropriate items. I will mention just one; Inge Henningsen<br />

has provided a more detailed criticism in Mona 2005 no. 1 (Henningsen 2005),<br />

and Lars Svendsen criticised some of the other published test items in the Danish<br />

newspaper Politiken on 13 January 2005 (Svendsen 2005).<br />

Figure 4: Walking<br />

In the test item “walking” from the 2003 mathematics set (fig. 4), the length<br />

of the stride is indicated for the first step, but it is clearly apparent that the<br />

second step is quite a bit longer; so in fact, the length of the stride should be<br />

defined as the average length of the measured strides. What is worse is that<br />

the formula provided is pure nonsense. Larger stride lengths, according <strong>to</strong> the<br />

formula, are faster strides, which contradicts our experience.<br />

Cultural bias<br />

Despite careful attention on the part of the question compilers, it is impossible<br />

<strong>to</strong> avoid a certain amount of cultural bias. Test items which require the student<br />

<strong>to</strong> read between the lines in references <strong>to</strong> cultural background knowledge are<br />

managed more easily by ethnic Danes than by Danish students from ethnic minorities.<br />

One could naturally argue that all students ought <strong>to</strong> be able <strong>to</strong> manage<br />

even culturally-determined tasks, and there is a certain logic <strong>to</strong> this, given that


110 JENS DOLIN<br />

they must be able <strong>to</strong> manage life in a (post-) modern society. But in this case,<br />

one cannot simultaneously accept that <strong>PISA</strong> aims at smoothing out cultural<br />

differences while measuring cultural deviations from a West European norm.<br />

This also applies <strong>to</strong> gender, regarded as culture.<br />

Let us examine the racetrack test item (fig. 5) from <strong>PISA</strong> 2000.<br />

Figure 5a: racetrack item<br />

Figure 5b<br />

The test item seems realistic and meaningful (at any rate <strong>to</strong> me as a man).<br />

Unfortunately, the question cannot be solved. Based on the number of curves,


<strong>PISA</strong> <strong>–</strong> USE AND MISUSE OF LARGE-SCALE COMPARATIVE TESTS 111<br />

the lane must be B, C or D. If we look at the position of the start (at the<br />

conclusion of a straight length) it should be lane D, but as the first curve is<br />

followed by one which is sharper and one that is less sharp than the others, it<br />

must be lane B! However, what is more interesting is that the responses show<br />

a gender-based imbalance:<br />

Greece, girls: 8 % correct<br />

Portugal, girls: 10 % correct<br />

Australia, boys: 43 % correct<br />

Switzerland, boys: 46 % correct<br />

What do these answers really reveal? Are they perhaps more a reflection of<br />

society’s socialisation, i.e. gender-specific interest, than of what the school has<br />

taught students (concerning the ability <strong>to</strong> create a graphic representation of<br />

movement)? Or are the girls just so bright that they can see there is no solution?<br />

There can hardly be a standardised version of “everyday life” valid for the<br />

whole world, and the issue of what can be regarded as everyday mathematics or<br />

science is the subject of considerable debate. Do all young people really need<br />

<strong>to</strong> learn the same things? Do we all need <strong>to</strong> know the same things in order <strong>to</strong><br />

be ‘fit for life’? And should they be evaluated in the same way?<br />

<strong>PISA</strong> and the Danish educational goals<br />

The next general question is: Does this framework harmonise with Danish educational<br />

goals, for example as expressed in the Common Goals statement of<br />

the Ministry of Education? (http://www.faellesmaal.uvm.dk/) The answer is<br />

both yes and no. Dolin et al (2006) have undertaken a thorough analysis of the<br />

intentions of the <strong>PISA</strong> survey and compared these with the goals of Danish<br />

education as formulated in the Common Goals. The report concludes:<br />

To summarise, we could say that <strong>PISA</strong>’s scientific literacy framework covers key parts<br />

of the formulated aims and mentality-related goals of Danish scientific school subjects.<br />

The greatest lack is the emphasis placed by Danish scientific subjects on students’<br />

practical and field work, which is not included in <strong>PISA</strong>. This also means that a<br />

number of personal qualities, such as imagination and inquisitiveness, are not tested.<br />

It is also important <strong>to</strong> point out that the personal and affective aims of science teaching<br />

are given considerable emphasis in the Danish aims and goals, while these comprise<br />

only a minor part of the overall <strong>PISA</strong> test in science.<br />

In addition, the <strong>PISA</strong> competencies primarily relate <strong>to</strong> cognitive skills, whereas Danish<br />

goals are more holistic and aim <strong>to</strong> encourage independent problem-solving, which<br />

naturally also involves cognitive skills, but in interplay with other abilities.


112 JENS DOLIN<br />

The <strong>PISA</strong> framework thus covers some of the Danish goals, but far from all of<br />

them, and perhaps not even the ones which many Danes would regard as being<br />

the most important, such as democratic culture, social skills, personal development,<br />

etc. Here, I feel we can find one of the key reasons for the opposition<br />

<strong>to</strong> <strong>PISA</strong>; many opponents criticise <strong>PISA</strong> for failing <strong>to</strong> test what they regard as<br />

important, but at the same time overlook what <strong>PISA</strong> does in fact test. Similarly,<br />

many of <strong>PISA</strong>’s supporters focus on what <strong>PISA</strong> tests, and perhaps fail <strong>to</strong> place<br />

this in relation <strong>to</strong> what <strong>PISA</strong> does not test. The exciting question is whether<br />

there are correlations between the two areas; this would demand an actual field<br />

validation process, i.e. a concrete examination of the <strong>PISA</strong>-tested students with<br />

the aid of other evaluation methods besides that of <strong>PISA</strong>. A Danish research<br />

project is doing so at the time of writing (Dolin et al 2006).<br />

All in all, a wide range of important validity problems appears when we ask<br />

what it is that <strong>PISA</strong> measures, and what the actual skills of Danish students are<br />

in the areas tested by <strong>PISA</strong>. One should thus be extremely cautious in drawing<br />

<strong>to</strong>o hasty or <strong>to</strong>o firm conclusions from the <strong>PISA</strong> results.<br />

Against the background of these considerations, I would recommend that<br />

in the case of extensive surveys such as <strong>PISA</strong>, more aspects of the survey<br />

design and its context should be taken in<strong>to</strong> consideration when assessing the<br />

test’s validity and consequences.<br />

A wider view of validity<br />

In the following, I wish <strong>to</strong> present a broader and more differentiated view of<br />

the validity issue in order <strong>to</strong> further define the problems that can arise in connection<br />

with a survey such as <strong>PISA</strong>. The issue of validity will be examined in<br />

relation <strong>to</strong>:<br />

<strong>–</strong> the structure and design of the actual test apparatus in relation <strong>to</strong> the questions<br />

posed<br />

<strong>–</strong> the range which defines the test’s area of validity or generalisability<br />

<strong>–</strong> the foundation upon which the test’s fundamental assumptions are juxtaposed<br />

with the field’s dominant assumptions.<br />

Validity in relation <strong>to</strong> the test design<br />

A test can naturally only measure that which it is designed <strong>to</strong> measure, so the<br />

first and most fundamental validity evaluation must clarify whether the test’s<br />

design is in accordance with its aims. Does the <strong>PISA</strong> test measure scientific


<strong>PISA</strong> <strong>–</strong> USE AND MISUSE OF LARGE-SCALE COMPARATIVE TESTS 113<br />

literacy? I would as mentioned be concerned if the idea of literacy were <strong>to</strong> be<br />

restricted <strong>to</strong> something that can be measured using paper and pencil, sitting<br />

at a desk in a gymnasium. The prevalent approach <strong>to</strong> literacy operates with a<br />

significantly broader view of the concept of literacy <strong>–</strong> typically the ability <strong>to</strong><br />

manage everyday situations in which the necessary actions or considerations<br />

demand scientific insight (Roth & Désautels 2002). <strong>PISA</strong>’s concept of literacy<br />

attaches great importance <strong>to</strong> deductive skills on the basis of some given<br />

premises which comprise a subset of the prevalent approaches <strong>to</strong> literacy, and<br />

such abilities are excellently tested in quite a few of the <strong>PISA</strong> test items. However,<br />

is it not the case that the more the test items and the test situation are<br />

shorn of their context and removed from ordinary everyday life, the more we<br />

tend <strong>to</strong> test levels of general intelligence? It gives one food for thought that a<br />

recent study demonstrates high consistency in measuring performance in <strong>PISA</strong><br />

2003 and TIMSS 2003 (Lie&Olsen 2007). A comparison of the results for 22<br />

educational systems participating in the two tests shows a correlation between<br />

the scores in the two tests as high as 0.95 at the country level. Despite differences<br />

in focus (scientific literacy vs. curriculum test), the test results are very<br />

much the same. So perhaps the <strong>PISA</strong> test does not reflect the different focus<br />

sufficiently. <strong>PISA</strong> might have a good definition of scientific literacy, but the<br />

test items and the whole test setup are <strong>to</strong>o close <strong>to</strong> a traditional curriculum test.<br />

Achieving complex goals will often require a blend of multi-dimensional<br />

skills, the integration of academic and personal/social skills, and the utilisation<br />

of several academic areas and subjects. Such complex goals can only be evaluated<br />

with the help of complex forms of evaluation. Much work has been done<br />

on developing process-oriented and complexity-capturing forms of evaluation<br />

(e.g. logbooks, portfolios, project reports), but these are, naturally enough, difficult<br />

<strong>to</strong> carry out and more time-consuming, and they need <strong>to</strong> be learned, for<br />

which reason they are more costly than traditional written tests. The better the<br />

evaluation is at capturing complex skills, the more difficult it is <strong>to</strong> present the<br />

results in the form of simple, comparable data.<br />

This brings us back <strong>to</strong> the traditional dilemma between undertaking an<br />

evaluation with a high degree of validity which is costly <strong>to</strong> carry out and which,<br />

because of its complexity, will have low reliability, and an evaluation of simple<br />

fac<strong>to</strong>rs which is capable of measuring with high reliability, but in which the<br />

level of validity is relatively low.


114 JENS DOLIN<br />

Generalisability<br />

One cannot generalise test results beyond their area of validity. It would thus<br />

seem unreasonable, on the basis of a test in deduction and calculation, <strong>to</strong> generalise<br />

regarding general abilities and skills in science. A test is a very specific<br />

communicative situation in which students must answer questions in writing<br />

and under time pressure, and without the help of an interlocu<strong>to</strong>r <strong>to</strong> adjust<br />

their understanding of the problem. As far as I am aware, no survey has been<br />

undertaken of the relationship between such problem-solving skills and the<br />

ability <strong>to</strong> manage later in life in situations which include a scientific content.<br />

Nonetheless, <strong>PISA</strong> measures something, and the measuring apparatus provides<br />

a fine scaling of the students. There is, for example, a correlation between<br />

the <strong>PISA</strong> reading results and later educational achievements. An analysis of<br />

Danish school students who participated in <strong>PISA</strong> 2000 showed that the young<br />

people’s educational position four years after elementary school was primarily<br />

determined by their reading skills and their academic self-image in the ninth<br />

grade (such as these were established by the <strong>PISA</strong> test) (Pilegaard Jensen and<br />

Andersen, 2006). Such a correlation may not, however, necessarily indicate a<br />

direct causal relation, but rather reflect some general relationships, probably<br />

attributable <strong>to</strong> social background, which are revealed by <strong>PISA</strong>. But we also<br />

know that of the 17 % of Danish school students who were designated functionally<br />

illiterate on the basis of <strong>PISA</strong> 2000, 20 % later completed an upper<br />

secondary or vocational education. Many of these were in other words capable<br />

of coping with relatively high demands for reading and comprehension. This<br />

implies that we can draw no clear conclusions with regard <strong>to</strong> the generalisability<br />

of the <strong>PISA</strong> test, and thus it begins <strong>to</strong> resemble soothsaying, <strong>to</strong> put it mildly,<br />

<strong>to</strong> rank countries by the supposed ability of their school students <strong>to</strong> manage in<br />

the future, as in Figure Six taken from the Danish <strong>PISA</strong> 2003 report.<br />

The skill requirements of the future are difficult <strong>to</strong> predict, and an exaggerated<br />

re-traditionalisation of the school system might well occur at the expense<br />

of explorative, creative, communicative and playful skills, and many<br />

other skills which the digital society of the future might come <strong>to</strong> rely on <strong>–</strong> and<br />

which <strong>PISA</strong> does not test.<br />

Naturally, this must not divert our attention from the problem that an unreasonably<br />

large proportion of Danish youth have poor reading abilities <strong>–</strong> something<br />

which it is good that <strong>PISA</strong> documents. But is it reasonable <strong>to</strong> conclude,<br />

on the basis of the <strong>PISA</strong> data, that three-quarters of the school students in Finland<br />

are ready for the labour market of the 21st century, while this applies only


<strong>PISA</strong> <strong>–</strong> USE AND MISUSE OF LARGE-SCALE COMPARATIVE TESTS 115<br />

Figure 6. Percentage of students prepared for the 22th century labor market<br />

Source: Mejding 2003, s. 132<br />

<strong>to</strong> just under half of the Norwegians? And <strong>to</strong> what extent would it enhance<br />

the students’ future preparedness <strong>to</strong> improve their skills in traditional cultural<br />

techniques, if this requires learning decontextualised skills? In this connection,<br />

it is of crucial importance <strong>to</strong> find a reasonable balance between fundamental<br />

subject-related skills and social/personal skills, and that this balance is<br />

expressed in relevant contexts.<br />

Fundamental assumptions<br />

The question of validity is closely linked with the fundamental assumptions<br />

and values upon which the test is based. If, for example, a test builds on the


116 JENS DOLIN<br />

premise that knowledge is an objective quantity, independent of context, it<br />

might be meaningful <strong>to</strong> attempt <strong>to</strong> test the presence and extent of this knowledge<br />

with individual students in neutral contexts. And if you define competence<br />

as the ability <strong>to</strong> solve items in a test, you could call it competencies. If,<br />

on the other hand, we view knowledge as a social construction in actual contexts,<br />

such a test set-up might amount <strong>to</strong> a valid measurement of school knowledge<br />

<strong>–</strong> but not in any respect a measurement of ‘everyday useful’ knowledge <strong>–</strong><br />

let alone competencies.<br />

Consider, for example, the following view of competence and knowledge,<br />

from the perspective of situated cognition (St. Julien 1997):<br />

Competence, unders<strong>to</strong>od as the ability <strong>to</strong> act on the basis of understanding, has been<br />

a fundamental goal of education. But it is a painful fact of educational life that knowledge<br />

gained in school <strong>to</strong>o often does not transfer <strong>to</strong> the ability <strong>to</strong> act competently in<br />

more “worldly” settings.<br />

...<br />

From the viewpoint of situated cognition, competent action is not grounded in individual<br />

accumulations of knowledge but is, instead, generated in the web of social<br />

relations and human artefacts that define the context of our action.<br />

This view of knowledge and competence shifts the focus when assessing competencies<br />

from a focus on examining individual knowledge <strong>to</strong> examining authentic<br />

activities in social contexts. In the Nordic countries, we have built up<br />

a view of knowledge in an educational context which attempts <strong>to</strong> combine<br />

the process-oriented view of knowledge expressed by constructivism with the<br />

more absolute view of knowledge expressed by science. We also work <strong>to</strong> a<br />

great extent on the basis of a socio-cultural view of learning, i.e. in educational<br />

contexts we tend <strong>to</strong> emphasise the ability of individual students and the<br />

group <strong>to</strong> work <strong>to</strong>wards their own view of knowledge, which then gradually<br />

approaches that of established science.<br />

There is no room for such a view of knowledge in the <strong>PISA</strong> format. Here,<br />

concrete questions are asked <strong>to</strong> which the answer <strong>to</strong> most items is either correct<br />

or incorrect (or at most correct, partly correct, incorrect). Such questions<br />

are naturally also asked in Danish science teaching <strong>–</strong> and it is obviously important<br />

<strong>to</strong> be able <strong>to</strong> answer them <strong>–</strong> but they are not the most important questions,<br />

as the aim is <strong>to</strong> build up the students’ general scientific understanding. However,<br />

it is not possible <strong>to</strong> train test scorers <strong>to</strong> assess whether a student is on<br />

the right path. In <strong>PISA</strong>, certain premises are typically presented within a spec-


<strong>PISA</strong> <strong>–</strong> USE AND MISUSE OF LARGE-SCALE COMPARATIVE TESTS 117<br />

ified frame, and the student is then expected <strong>to</strong> apply particular knowledge or<br />

a particular process <strong>to</strong> this frame, accepting the given terms. This is a very<br />

Anglo-Saxon approach. In a constructivist context, one would instead emphasise<br />

the students’ ability <strong>to</strong> draw up frames and premises themselves, and the<br />

ability <strong>to</strong> formulate the actual problem as part of the solution. The “walking”<br />

test item in Figure Four would, for example, be formulated in a completely different<br />

manner under a constructivist view of education; in this case the students<br />

would be required <strong>to</strong> measure the stride length themselves and then attempt <strong>to</strong><br />

work out the relationship between stride length and speed, evaluating whether<br />

or not this is reasonable. It would be this ability <strong>to</strong> structure the problem that is<br />

primarily tested, rather than whether the student is capable of inserting figures<br />

in<strong>to</strong> a given formula (which they must naturally also be able <strong>to</strong> do).<br />

The critical point, however, is that if the actual test format itself rules out<br />

questions which are <strong>to</strong>o open, and students who display independent thought,<br />

i.e. by exceeding the test item’s premises or drawing upon knowledge other<br />

than that provided, risk being penalised (Svendsen 2005).<br />

Seen in this light, the <strong>PISA</strong> test seems epistemologically conservative, and<br />

consequently more of a measuring rod for idealised skills than a <strong>to</strong>ol for promoting<br />

education which is centred on the learning process.<br />

Tiberghien (2007) advocates research studies on test design similar <strong>to</strong> those<br />

which have led <strong>to</strong> the development of research-based teaching sequences. Such<br />

studies would allow an item construction in close connection <strong>to</strong> the desired<br />

learning process of the students, and thus provide a didactical foundation for a<br />

more fine-grained scoring.<br />

Consequences for educational policy<br />

The results of an evaluation provide a basis for certain decisions, but it is important<br />

that these decisions do not exceed what is actually justified by the test.<br />

<strong>According</strong>ly, it is interesting <strong>to</strong> consider how the results of a test such as <strong>PISA</strong><br />

can be used <strong>–</strong> and abused.<br />

<strong>PISA</strong> in the media<br />

The media debate following the publication of an international test often has<br />

a very uncertain foundation, and experience from the publication of the <strong>PISA</strong><br />

2000 and <strong>PISA</strong> 2003 results indicates that the loose claims advanced in the<br />

initial hectic media coverage tend <strong>to</strong> remain the main impressions of <strong>PISA</strong>.


118 JENS DOLIN<br />

First impressions last. They thus become truths upon which the educational<br />

debate becomes based in ensuing years.<br />

In a media society, a media image can have a direct influence on political<br />

decisions. When the media construct a particular view of reality, many politicians<br />

feel obliged <strong>to</strong> act on this basis.<br />

Figure 7. Denmark gets <strong>to</strong>o little value for money from its education budget<br />

(Source: Arbejdsmarkedspolitisk Agenda (The Danish Employers’ Confederation) April 7th,<br />

2005)<br />

See, for example, the juxtaposition by the Danish Employers’ Confederation<br />

of the <strong>PISA</strong> results with educational spending (figure 7). Here a comparison<br />

is made between Denmark’s ranking in <strong>PISA</strong> and its expenditure per student<br />

<strong>–</strong> and by implication, the quality of its educational system. In this ranking,<br />

Denmark ends up in third last place in a range of OECD countries when the<br />

Danish <strong>PISA</strong> score is compared with its educational budget. Denmark pays an<br />

average of EUR 1,000 per student <strong>to</strong> achieve just over six <strong>PISA</strong> points, while<br />

the Germans, for example, obtain ten points for the same price. The conclusion<br />

is clear. However, this analysis disregards the fact that Denmark obtains much<br />

more from its expenditure on education than <strong>PISA</strong> points.<br />

The politicians ask whether we get “value for money”, and they are accus<strong>to</strong>med<br />

<strong>to</strong> measuring value in terms of figures in columns. If the results are


<strong>PISA</strong> <strong>–</strong> USE AND MISUSE OF LARGE-SCALE COMPARATIVE TESTS 119<br />

<strong>to</strong>o low, we need more tests and measurements, and we must introduce economic<br />

rewards, grading systems, etc. The end result may well be that schools<br />

and classes alter their teaching in such a way that students become better able<br />

<strong>to</strong> manage <strong>PISA</strong> test items, but at the expense of the less detectable results of<br />

education. I do not claim that there is a direct contradiction between these two<br />

things, but with the limited time and resources available, it is a delicate matter<br />

<strong>to</strong> maintain the existing values while at the same time gearing the system <strong>to</strong><br />

meet a number of specific requirements. However, the possibility cannot be<br />

ruled out that it could be a fruitful process.<br />

Measurable fac<strong>to</strong>rs as parameters of quality<br />

In general, it would be well <strong>to</strong> exercise caution when measuring and assessing<br />

something as complex as human behaviour with figures, not <strong>to</strong> mention a<br />

country’s overall performance <strong>–</strong> and in particular, when using figures procured<br />

via measurement of only a limited part of the overall area. This is extreme reductionism,<br />

and an example of how one of the central scientific <strong>to</strong>ols <strong>to</strong> create<br />

knowledge <strong>–</strong> the ability <strong>to</strong> practise reductionist methods <strong>–</strong> should be utilised<br />

with caution outside the domain of science itself.<br />

There is a major risk that the fac<strong>to</strong>rs which are measurable via the test in<br />

question become the norm-setting parameters of quality, while the remainder<br />

of the large and complex educational picture imperceptibly slips out of view.<br />

This would have serious consequences for the entire educational system, including<br />

the priorities of individual schools and teachers.<br />

We risk harmonising away the very qualities that we have built up over<br />

generations and which may be the key <strong>to</strong> our survival in the globalised world<br />

of the future. A process of cultural uniformity and harmonisation of values<br />

is occurring on the basis of the contemporary mainstream. In an interview in<br />

the Danish newspaper Information (Thorup 2005 (20 March)), Microsoft CEO<br />

Steve Balmer expresses the company’s winning strategy as: “I want the whole<br />

world <strong>to</strong> be Danish.” This is followed up by Mikael R. Lindholm, a member of<br />

the Innovation Council’s strategic planning group, who says:<br />

The welfare system helps <strong>to</strong> create some highly committed, dynamic, inquisitive and<br />

competent people in Denmark. And these are precisely the qualities from which we<br />

benefit, and of which the rest of the world is very envious.<br />

[... ]


120 JENS DOLIN<br />

But Denmark shows <strong>to</strong>o little interest in these special, culturally-determined competencies<br />

that the rest of the world covets. Instead, the government is trying <strong>to</strong> harmonise<br />

our strengths out of the educational system.<br />

It is a no<strong>to</strong>rious fact in educational research that the more important something<br />

is, the harder it is <strong>to</strong> see and measure!<br />

Evaluation as construction of an area<br />

It is well known that educational evaluations have a strong back-wash effect<br />

on teaching: ‘Teach <strong>to</strong> the test’, as it is known. This is in itself a reasonable<br />

and desirable process, if the evaluation is sensible and reflects the goals of<br />

the educational system. However, it is problematic if the evaluation fails <strong>to</strong><br />

accord with the foundation of the educational system and its overall goals, but<br />

is instead undertaken <strong>to</strong> support a number of ideological aims.<br />

Figure 8. The utilisation of evaluation<br />

(From: Dahler-Larsen&Larsen 2001)<br />

The Danish educational researchers Peter Dahler-Larsen and Flemming<br />

Larsen (Dahler-Larsen and Larsen 2001) have drawn up a list of uses <strong>to</strong> which<br />

evaluations are put (figure eight). They distinguish between uses which view<br />

human actions as based on rationality and functionality, i.e. in which we act<br />

in order <strong>to</strong> achieve a particular goal (such as learning or acquiring information),<br />

and uses aimed at making the system “suitable”, what is expected, rather<br />

than in order <strong>to</strong> achieve a particular goal. In the latter case it is not the effects<br />

of the evaluation that are important, but rather the fact that the evaluations<br />

are undertaken at all. There is a tendency for these symbolic and constitutive<br />

uses <strong>to</strong> occupy ever more space in the evaluation landscape. By evaluating,<br />

you communicate credibility and drive; you show that you are prepared <strong>to</strong><br />

do something; you are part of the action (just think of how the number of<br />

countries participating in <strong>PISA</strong> grows with each round). The symbolic value


<strong>PISA</strong> <strong>–</strong> USE AND MISUSE OF LARGE-SCALE COMPARATIVE TESTS 121<br />

is politically important, and often more important than adhering <strong>to</strong> the results<br />

achieved. However, at the same time, there may be a number of unintended<br />

consequences for the content. By undertaking evaluation, you can influence the<br />

field in a particular direction and help <strong>to</strong> form it. By evaluating, we also create<br />

social relations and identities (passive students, etc.). Or the evaluation creates<br />

a view of the subject matter (such as the scientific competence of school students)<br />

for which there is insufficient evidence (e.g. that Danish school students<br />

are scientifically incompetent, even though <strong>PISA</strong> measures only their ability <strong>to</strong><br />

perform a particular type of task in a particular context).<br />

There is no doubt that the <strong>PISA</strong> results, <strong>to</strong>gether with those of a number<br />

of other surveys, such as the OECD review of elementary schools (Uddannelsesstyrelsen<br />

2004), have contributed <strong>to</strong> an increasing focus on the apparently<br />

weak evaluation culture in Danish elementary schools. This is a result<br />

which is consistent with the general foundation of the <strong>PISA</strong> and OECD surveys,<br />

and it is <strong>to</strong> a large extent necessary and useful. An increased level of evaluation<br />

<strong>–</strong> if well-balanced, well-designed and diagnostically-oriented <strong>–</strong> would<br />

undoubtedly enhance the benefits of the educational process for all groups of<br />

students.<br />

However, there are signs that <strong>PISA</strong>, besides exerting an influence on teaching,<br />

has also had an influence on the actual objects clause of the elementary<br />

schools, so as <strong>to</strong> direct the teaching <strong>to</strong> conform <strong>to</strong> a greater degree with what<br />

<strong>PISA</strong> is capable of measuring!<br />

<strong>PISA</strong> in perspective<br />

Taking a critical approach tends <strong>to</strong> sharpens your argumentation, and here I<br />

have emphasised the problematic aspects of <strong>PISA</strong>. It would not be reasonable<br />

<strong>to</strong> conclude on this basis alone that <strong>PISA</strong> is unusable, worthless or the like. On<br />

the contrary, <strong>PISA</strong> encompasses a great deal of potential.<br />

To begin with, it presents us with an enormous amount of empirical material.<br />

The figures indicate many unknown fac<strong>to</strong>rs in the educational sec<strong>to</strong>r<br />

which it would be worthwhile <strong>to</strong> investigate further, as well as confirming<br />

much which we already know, such as the large gender variations in Denmark,<br />

the variation between ethnic groups, etc. It is thought-provoking that there appears<br />

<strong>to</strong> be a statistical correlation between results achieved and the students’<br />

comments on the level of discipline and order during lessons. Moreover, it is<br />

in itself remarkable that a very large proportion of the students <strong>–</strong> more than


122 JENS DOLIN<br />

one-third <strong>–</strong> report experiencing poor discipline and order during lessons. It<br />

is useful <strong>to</strong> know that Danish school students feel at home in their schools,<br />

and that they have a positive attitude <strong>to</strong> their studies and a positive image of<br />

their own academic skills. There are many correlations which it would be interesting<br />

<strong>to</strong> explore in more depth, and an extensive diagnostic potential in the<br />

<strong>PISA</strong> material, first and foremost in connection with finding out what young<br />

people think, for better or worse. Rolf V. Olsen (2007) suggests five generic<br />

approaches <strong>to</strong> a secondary analysis of data in <strong>PISA</strong>, each of them accompanied<br />

with a comprehensive list of approaches <strong>to</strong> analysis.<br />

It is also meaningful <strong>to</strong> undertake comparisons within the same cultural<br />

groups, which may provide some fruitful contextualisation of well-known issues.<br />

This has extensively been done in a Nordic context, for example (Lie,<br />

Linnakylä et al. 2003; Kjærnsli and Lie 2004).<br />

Finally, it should be mentioned that <strong>PISA</strong> is a labora<strong>to</strong>ry in testing techniques<br />

and test theory. Participation in <strong>PISA</strong> has provided Denmark with a<br />

much-needed test-related theoretical boost, and has also helped <strong>to</strong> place the<br />

evaluation culture in Danish elementary schools on the agenda.<br />

But this potential must be balanced with the danger of mainstreaming<br />

and dis<strong>to</strong>rting the educational system and teaching which <strong>PISA</strong> could<br />

also induce. International comparative evaluations possess almost inherent retraditionalising<br />

and standardising elements which could influence the national<br />

development in a direction which is foreign <strong>to</strong> the local educational culture.<br />

Evaluations as comprehensive as <strong>PISA</strong> express themselves with great authority<br />

on the basis of what many view as incontrovertible documentation. In relation<br />

<strong>to</strong> the national research environments, the <strong>PISA</strong> system has so many resources<br />

at its disposal that it is difficult <strong>to</strong> establish genuinely critical and independent<br />

research of <strong>PISA</strong> and the <strong>PISA</strong> results, with the result that a project like <strong>PISA</strong><br />

can rapidly become established as a representative of objective, neutral reality.<br />

Political prestige has also been invested in participation, which makes it difficult<br />

for the participating countries <strong>to</strong> distance themselves from the project at<br />

policy level; as a “member of the club”, one feels obliged <strong>to</strong> show solidarity<br />

with the club’s rules.<br />

Finally, it is important <strong>to</strong> point out that from an educational perspective, it<br />

is difficult <strong>to</strong> establish links between the findings of comparative evaluations,<br />

which describe the educational system in its entirety, and teaching in individual<br />

classes, with individual students. <strong>PISA</strong>’s strength lies in its analytical and diagnostic<br />

possibilities at the overall educational policy level, but when utilised <strong>to</strong>


<strong>PISA</strong> <strong>–</strong> USE AND MISUSE OF LARGE-SCALE COMPARATIVE TESTS 123<br />

influence the structure of specific teaching practice, there is a risk of promoting<br />

changes on the basis of an oversimplified view of educational practice which<br />

can have a counterproductive effect in the long run in relation <strong>to</strong> achieving the<br />

stated goals.<br />

References<br />

Adams, R. and M. Wu (2001). <strong>PISA</strong> 2000 technical report. Paris:OECD.<br />

Adams, R. J. (2003). Response <strong>to</strong> “Cautions on OECD’s recent educational<br />

survey (<strong>PISA</strong>)”. Oxford Review of Education 29(3): 377 <strong>–</strong> 389.<br />

Allerup, P. (2005). <strong>PISA</strong> Præstationer <strong>–</strong> målinger med skæve måles<strong>to</strong>kke.<br />

Dansk Pædagogisk Tidsskrift(1): 68-81.<br />

Dahler-Larsen, P. and F. Larsen (2001). Anvendelser af evaluering <strong>–</strong> His<strong>to</strong>rien<br />

om et begreb, der udvider sig.<br />

I: P. Dahler-Larsen and H. K. Krogstrup. Tendenser i evaluering. Odense:<br />

Odense Universitetsforlag.<br />

Dolin, J. (2005). <strong>PISA</strong> og fremtidens kundskabskrav. In: <strong>PISA</strong>-undersøgelsen<br />

og det danske uddannelsessystem. Folketingshøring om <strong>PISA</strong>undersøgelsen<br />

12. september 2005. Teknologirådet.<br />

Dolin, J., H. Busch og L. B. Krogh (2006). En sammenlignende analyse af<br />

<strong>PISA</strong>2006 science testens grundlag og de danske målkategorier i naturfagene.<br />

Første delrapport fra VAP-projektet. Odense: IFPR/Syddansk Universitet.<br />

(with English summary)<br />

Dolin, J. (2007). Science education standards and their assessment in Denmark.<br />

In: Wadding<strong>to</strong>n, D., Nentwig, P. & Schanze, S. (eds.): Standards in<br />

Science Education. Waxmann.<br />

Goldstein, H. (2004). International comparisons of student attainment: Some<br />

issues arising from the <strong>PISA</strong> study. Assessment in Education 11(3).<br />

Hansen, E. J. (2005). <strong>PISA</strong> <strong>–</strong> et svagt funderet projekt. Dansk Pædagogisk<br />

Tidsskrift (1): 64-67<br />

Henningsen, I. (2005). <strong>PISA</strong> <strong>–</strong> et kritisk blik. MONA (1).<br />

Kjertmann, K. (2000). Evaluering af læsning: Generelle og specifikke problemer.<br />

Forskningstidsskrift fra Danmarks Lærerhøjskole, nr.6.<br />

Kjærnsli, M. and S. Lie (2004). <strong>PISA</strong> and scientific literacy: Similarities and<br />

differences between the Nordic countries. Scandinavian Journal of Educational<br />

Research 48(3): 271-286.


124 JENS DOLIN<br />

Lie, S., P. Linnakylä, et al., Eds. (2003). Northern lights on <strong>PISA</strong>. Unity and<br />

diversity in the Nordic countries in <strong>PISA</strong> 2000. Oslo: University of Oslo.<br />

Lie, S. and Olsen, R. (2007). A comparison of the measures of science achievement<br />

in <strong>PISA</strong> and TIMSS. Paper presented at ESERA 2007 Conference,<br />

Malmoe.<br />

Mejding, J. (ed.) (2004). <strong>PISA</strong> 2003 <strong>–</strong> danske unge i en international sammenligning.<br />

København: Danmarks Pædagogiske Universitets Forlag.<br />

Mejding, J., S. Reusch og T. Yung Andersen (2006). Leaving examination<br />

marks and <strong>PISA</strong> results <strong>–</strong> Exploring the validity of <strong>PISA</strong> scores. In: Mejding,<br />

J. og A. Roe (red.). Northern Lights on <strong>PISA</strong> <strong>–</strong> a reflection from the<br />

Nordic countries. Copenhagen: Nordic Council of Ministers.<br />

OECD (1999). Measuring student knowledge and skills <strong>–</strong> a new framework for<br />

assessment. Paris:OECD.<br />

OECD (2001). Knowledge and skills for life. First results from <strong>PISA</strong> 2000.<br />

Paris: OECD.<br />

OECD (2002). Sample tasks from the <strong>PISA</strong> 2000 assessment. Paris:OECD.<br />

OECD (2004). Learning for <strong>to</strong>morrow’s world. First results from <strong>PISA</strong> 2003.<br />

Paris: OECD.<br />

Olsen, R.V. (2007). Beyond the primary purpose: Potentials for secondary research<br />

in science education based on <strong>PISA</strong> 2006 data. Paper presented at<br />

ESERA 2007 Conference, Malmoe.<br />

Pilegaard Jensen, T. & D. Andersen (2006). Participants in <strong>PISA</strong> 2000. Four<br />

years later. In: Mejding, J. & A. Roe (red.). Northern lights on <strong>PISA</strong> <strong>–</strong><br />

areflection from the Nordic countries. Copenhagen: Nordic Council of<br />

Ministers.<br />

Prais, S. J. (2003). Cautions on OECD’S recent educational survey (<strong>PISA</strong>).<br />

Oxford Review of Education 29(2): 139-163.<br />

Prais, S. J. (2004). Cautions on OECD’s recent educational survey (<strong>PISA</strong>):<br />

rejoinder <strong>to</strong> OECD’s response. Oxford Review of Education 30(4): 569-<br />

573.<br />

Roth, W.-M. and J. Désautels, Eds. (2002). Science education as/for sociopolitical<br />

action. New York: Peter Lang.<br />

St. Julien, J. (1997). Explaining learning: The research trajec<strong>to</strong>ry of situated<br />

cognition and the implications of connectionism. I: D. Kirshner and J.<br />

A. Whitson. Situated Cognition. Social, Semiotic, and Psychological Perspectives.<br />

London: Lawrence Erlbaum Associates.


<strong>PISA</strong> <strong>–</strong> USE AND MISUSE OF LARGE-SCALE COMPARATIVE TESTS 125<br />

Svendsen, L. S. (2005). Med Klods-Hans til <strong>PISA</strong>-prøve. Politiken. København.<br />

Thorup, M.-L. (2005 (20. marts)). I want the whole world <strong>to</strong> be Danish. Information.<br />

København.<br />

Tibergien, A. (2007). Assessing scientific literacy: The need for research <strong>to</strong><br />

inform the future development af assessment instruments. Paper presented<br />

at ESERA 2007 Conference, Malmoe.<br />

Uddannelsesstyrelsen (2004). OECD-rapport om grundskolen i Danmark <strong>–</strong><br />

2004 -. Uddannelsesstyrelsens temahæfteserie nr. 5.<br />

Winther-Jensen, T. (2004). Komparativ pædagogik <strong>–</strong> faglig tradition og global<br />

udfordring. København: Akademisk Forlag.


Language-Based Item Analysis <strong>–</strong><br />

Problems in Intercultural Comparisons<br />

Markus Puchhammer<br />

Austria: University of Applied Sciences Technikum Wien<br />

<strong>PISA</strong> was started as an instrument <strong>to</strong> check the outputs of the education systems<br />

of several different countries against each other. To achieve an assessment<br />

across a still growing number of participating countries, test items were developed<br />

for mathematics, reading literacy, science and problem solving. Multiple<br />

institutions located in different OECD-countries contributed <strong>to</strong> the effort of<br />

creating test items. Quite similarly looking test booklets were produced featuring<br />

the same items presented in the languages officially used in education<br />

of the participating nations. The results have been widely discussed, rankings<br />

were often attributed <strong>to</strong> the organization of national education systems. But is<br />

it correct <strong>to</strong> argue that the same items have been presented <strong>–</strong> not taking in<strong>to</strong><br />

account the use of different languages? If different cultural backgrounds may<br />

be assumed <strong>to</strong> influence reading literacy (getting visible in the results), areas<br />

like mathematics may be regarded as less sensitive. But quantitative evaluations<br />

(presented below) show that there are still enough fac<strong>to</strong>rs introduced by<br />

wording and by language. Thus, the validity of <strong>PISA</strong> assessment <strong>–</strong> <strong>to</strong> test what<br />

is intended <strong>to</strong> be tested <strong>–</strong> should be watched carefully within an international<br />

frame.<br />

Fac<strong>to</strong>rs indicating the importance of reading and language<br />

Based on cultural backgrounds, influences of language may be expected for<br />

areas that are linked <strong>to</strong> reading. In this case arguing the importance of wording<br />

could have good chances of success, because it seems obvious that reading<br />

fluency and reading comprehension are related <strong>to</strong> sentence structure, <strong>to</strong> the use


128 MARKUS PUCHHAMMER<br />

of specific terms, or <strong>to</strong> the length of words of a text. These fac<strong>to</strong>rs may be<br />

regarded as less important for other areas such as mathematics; also scientific<br />

competence may be less affected. But if it can be shown that even in these<br />

areas the effects of language must not be neglected, the result obtained so far<br />

can be generalized. Thus, the following context focuses on <strong>PISA</strong>’s mathematics<br />

assessment.<br />

In an average text book on mathematics pages are full of formulas and<br />

numbers. A first look on <strong>PISA</strong> items designed for testing mathematics performance<br />

revealed another layout. Approximately 7 % of the text (1250 characters<br />

out of 18058 in the English item sample) were digits or mathematical opera<strong>to</strong>rs<br />

like = + - % /, the rest (93 %) formed a readable text where proper understanding<br />

is recommended <strong>to</strong> find the correct solution. (Some diagrams were also<br />

shown, but note that diagram texts are included in these counts.)<br />

The influence of language is further demonstrated by a model calculated<br />

<strong>to</strong> explain <strong>PISA</strong>-2000 results in mathematics, presented in Artelt et al. (2001,<br />

25). To predict performance in mathematics, the fac<strong>to</strong>rs such as socioeconomic<br />

status, gender, general cognitive ability, mathematical self-image and reading<br />

competence were considered. These variables explained 76 percent of the variance<br />

of the performance in mathematics. It was expected that general cognitive<br />

ability or mathematical self-image would influence the mathematics score<br />

most strongly. Surprisingly, the most prominent influence was realised in the<br />

area of reading competence, expressed in a path coefficient of 0.55, whereas<br />

the other fac<strong>to</strong>rs contributed less: path coefficients were 0.32 for general cognitive<br />

ability and 0.14 for the mathematical self-image.<br />

The translation process <strong>–</strong> as described by <strong>PISA</strong>-Austria (2004a) <strong>–</strong> will<br />

be considered next. Language versions have been generated for the teaching<br />

languages of the participating particular nations. They start from an English<br />

source text and a French source text (often derived from the English version).<br />

A so-called double translation process is used, i.e. teams of two independent<br />

transla<strong>to</strong>rs develop the national items which are then cross-checked and<br />

merged in<strong>to</strong> a final national version. International verification steps, a training<br />

programme and item analyses serve as quality assurance of the translation process.<br />

But nevertheless this process starts with the two source languages, and<br />

it is not clear how far the transla<strong>to</strong>rs can free themselves from language structures<br />

of the source language(s) <strong>to</strong> achieve good readability of the translated<br />

items. If reading comprehension is reduced due <strong>to</strong> this process, then the test-


LANGUAGE-BASED ITEM ANALYSIS 129<br />

ing process is dissimilar, comparison of results is eventually rather restricted <strong>–</strong><br />

the influence of translation therefore should be considered in more detail.<br />

The principal component analysis on <strong>PISA</strong>-2003 country scores for mathematics,<br />

science, problem solving and reading literacy yields a common fac<strong>to</strong>r<br />

which contributes 94 % of the <strong>to</strong>tal variance. This observation shows that there<br />

are no distinct foci, e.g. when expecting that some countries promote mainly<br />

natural sciences or mathematics, while others concentrate on reading comprehension<br />

and literacy. Different interpretations are possible, but the concept of<br />

reading comprehension and thus language understanding being the most important<br />

fac<strong>to</strong>r would be a good explanation. Therefore, the influence of fac<strong>to</strong>rs<br />

like wording, length of item texts etc. should be investigated in more detail.<br />

Selection of <strong>PISA</strong> items for comparison<br />

An evaluation based on the language and wording of items should carefully<br />

select appropriate items. Language differences can be demonstrated by items<br />

that were originally considered as <strong>PISA</strong> items, but finally were not chosen for<br />

the test booklets (cf. e.g. the description of the item construction process in<br />

OECD, 2001, p. 42ff.). Some of these items have been released <strong>to</strong> the public<br />

and serve as examples for the different <strong>PISA</strong> assessment areas in the “full<br />

report”, available in several languages (e.g. OECD 2004a,b,c). They are presented<br />

as pro<strong>to</strong>types for different assessment areas, e.g. for several levels of<br />

mathematical skills. The major advantage of these items is that they can be<br />

easily retrieved in several languages, and their release <strong>to</strong> the public eases discussion<br />

of individual contents.<br />

For the German version, 42 released mathematics items (with up <strong>to</strong> 3 questions<br />

each) were located (see <strong>PISA</strong> Austria, 2004b). Some of them have already<br />

been released in OECD publications (2004c). The English items were<br />

downloaded directly from OECD (2005) in a file comprising 27 PDF documents,<br />

extended by other sources (e.g. HK<strong>PISA</strong>, 2006). All of the English<br />

items (except one item) are also available in the German sample and have<br />

been used for subsequent comparisons. Items released <strong>to</strong> the public are also<br />

available for other assessment areas, but in a smaller number. Mathematics<br />

performance items suggest <strong>to</strong> be less language-dependant, and the sample of<br />

31 possible comparisons results in an item number that can be reasonably used<br />

for statistical calculations. The availability of these items in other languages<br />

invites <strong>to</strong> extend the investigation. Item identifiers start with a letter (“M” for


130 MARKUS PUCHHAMMER<br />

mathematics), followed by a three-digit item number and, possibly, a question<br />

number indication. An abbreviation of the item contents has been taken from<br />

the file name of the electronic format.<br />

The following comparisons first show that the German text is significantly<br />

longer than the English version, thereby discussing the implications for the<br />

assessment. Then the familiarity of words (which should be related <strong>to</strong> the word<br />

knowledge of the target group of 15-year aged students) is retrieved using an<br />

approach of quantitative linguistics. Furthermore, German sentence structures<br />

still seem <strong>to</strong> increase complexity in German texts.<br />

Text-length based comparisons<br />

31 mathematics items have been evaluated, both in the German and English<br />

version. The item texts have been retrieved from the PDF format and were<br />

imported in<strong>to</strong> a word processing program. To analyze details a computer program<br />

(written in Visual Basic) has been developed and applied <strong>to</strong> the item<br />

texts. Table 1 summarizes the results for the number of words (units separated<br />

by spaces or similar sentence marks) and number of characters. To eliminate<br />

the confusing effect of number tables (there were some in the item sample) the<br />

number of digits and mathematical opera<strong>to</strong>rs (like + <strong>–</strong> / %) were counted separately;<br />

these special characters are not included in the <strong>to</strong>tal character count.<br />

Results for English and German are shown below.<br />

ItemID #words #chars #word #chars<br />

(English) (English) (German) (German)<br />

M037Farms 155 694 149 796<br />

M124Walkg 109 634 124 761<br />

M145Cubes 68 306 62 341<br />

M148Cont 52 286 61 353<br />

M150GrwUp 91 403 85 495<br />

M159Speed 235 1063 228 1307<br />

M161Triang 66 328 62 349<br />

M179Robbr 64 335 54 333<br />

M266Crntr 110 475 93 465<br />

M402IRC 157 771 144 820<br />

M413Excha 182 1002 157 1005<br />

M438Expor 128 576 126 616<br />

M467Candy 68 321 69 364


LANGUAGE-BASED ITEM ANALYSIS 131<br />

M468STest 58 296 49 330<br />

M471SFair 98 473 92 566<br />

M484Books 65 365 58 396<br />

M505Littr 84 474 74 516<br />

M509Quake 156 818 153 919<br />

M510Choic 65 393 62 446<br />

M513Score 132 557 155 867<br />

M515ShKid 112 380 101 454<br />

M520Skate 227 1143 226 1426<br />

M521Table 80 433 66 443<br />

M525Deacr 265 1405 278 1629<br />

M543Space 98 488 97 532<br />

M547Stair 41 198 35 177<br />

M555NCube2 94 448 74 509<br />

M555NCube3 149 723 143 951<br />

M702Presi 158 877 145 1003<br />

M704BestC 237 1098 226 1290<br />

M806StepP 58 295 57 340<br />

Tab. 1: Text-length based results for <strong>PISA</strong> example mathematics items, English and German<br />

version.<br />

Text lengths vary from item <strong>to</strong> item, but texts usually contain several hundred<br />

characters (most items had only one question attached, and only three<br />

items had 3, but still a lower length limit can be observed also on a per-question<br />

basis). Average lengths calculated were 583 characters for the English version,<br />

and 670 characters for a German item on average <strong>–</strong> indicating the German<br />

items <strong>to</strong> be noticeably longer. Average word counts were more similar for the<br />

English and German version (118.1 and 113.1 respectively <strong>–</strong> but note that some<br />

terms use two words in English whereas in German often two words are combined<br />

in<strong>to</strong> one; this effect results in more and shorter words in English).<br />

It is interesting <strong>to</strong> observe the length of German item texts depending on<br />

their English source counterpart (in fact, dependency is given by the translation<br />

process). This relationship is shown in Fig. 1 and is clearly visible <strong>–</strong> described<br />

by a high correlation coefficient of r=0.98; only a few entries deviate noticeably<br />

from the regression line.<br />

To obtain a profound estimate of the relative text lengths, the regression


132 MARKUS PUCHHAMMER<br />

line (based on a least-squares fit using intercept=0) has been calculated, represented<br />

by the formula<br />

length(German) = 1.16 length(English)<br />

The testing of the slope coefficient for statistical significance clearly supports<br />

the statement that German item texts are in fact longer than in English<br />

by nearly 1/6 (95 %-confidence interval for the slope parameter between<br />

[1.123 . . . 1.198] ).<br />

Fig. 1: Graphical display of English vs. German item text length. Regression line is shown.<br />

The relevance of these observations is visible when contrasting items <strong>to</strong> the<br />

reading speed of 15-year old pupils. For average adults, reading speeds around<br />

200-300 words per minute are frequently reported (depending on the amount


LANGUAGE-BASED ITEM ANALYSIS 133<br />

of text comprehension required, print size etc.). For these readers, a reading<br />

time around half a minute per item can be expected.<br />

Assuming that the mathematics part of a <strong>PISA</strong> assessment contains 20<br />

items <strong>to</strong> be worked out in 30 minutes, just reading the items would consume<br />

1/3 of the available time. On the other hand, <strong>PISA</strong> items should be carried out<br />

also by pupils on a below-average level. For slow readers speeds are proposed<br />

of only 110 words per minute for English texts (Readingsoft, 2007; similar<br />

reading speeds are reported for “efficient words” related <strong>to</strong> understanding).<br />

For those readers, the <strong>to</strong>tal reading time would sum up <strong>to</strong> 21 minutes <strong>–</strong> 70 %<br />

of the 30 minute session. German texts have longer words (about 20 % by our<br />

data), reading speed is even a bit slower, therefore the rest of the time that actually<br />

can be devoted <strong>to</strong> reflect upon the mathematics behind the question is still<br />

shorter. High variances in reading ability can actually be expected, e.g. by the<br />

findings of Klipcera and Gasteiger-Klipcera (1993) who reported that the least<br />

performing 15 % of pupils in the 8 th school year were at a level of an average<br />

reader at the end of the 2 nd year or beginning of the 3 rd year in school.<br />

Familiarity of words<br />

In 1932, the American linguist and philologist George Kinsley Zipf (1932)<br />

noticed that the statistical frequency of words can be linked <strong>to</strong> their rank by<br />

frequency of occurrence. A word ranked n th in frequency occurs with a probability<br />

Pn of about<br />

Pn 1/n a<br />

where the exponent a is almost 1. What is now known as Zipf’s law holds<br />

well for nearly all languages (except for the first few words <strong>–</strong> e.g. in English:<br />

the, and, <strong>to</strong>, of, a, . . . <strong>–</strong> with probabilities deviating only slightly). Since then,<br />

word frequency tables have been constructed. Files representing the <strong>to</strong>p 10 000<br />

words of languages like English, German, French and Dutch can be downloaded<br />

(e.g., see Universität Leipzig, 2007), spanning across some orders of<br />

magnitude between the relative frequency for words that occur in an average<br />

text of the selected language. Words that occur quite seldom (e.g. only once<br />

every 100 000 words) may not be well-known, may be difficult <strong>to</strong> understand<br />

or even unknown <strong>to</strong> average users of the language.<br />

In order <strong>to</strong> detect reading disadvantages of items in the translated language,<br />

these rank lists of English and German words have been applied. Words occurring<br />

frequently have low rank numbers, rare words have high rank numbers.


134 MARKUS PUCHHAMMER<br />

If frequent words in the first language are replaced by infrequent words in<br />

the second language (during translation), then the resulting text is more difficult<br />

<strong>to</strong> understand. Text translations that claim <strong>to</strong> yield similar difficulty should<br />

use words that occur with similar probability, having similar rank numbers (according<br />

<strong>to</strong> Zipf’s law). Words used seldom are more relevant for a comparison,<br />

because a shift in difficulty would be easier <strong>to</strong> observe, and would influence<br />

the understandability of a text more distinctly.<br />

The <strong>PISA</strong> example mathematics items have been reviewed in both versions,<br />

English and German. Words that seemed important for an item text as<br />

well as “difficult” words (occurring less frequently) were identified and their<br />

translated counterparts were located. Then, rank numbers were determined and<br />

compared. A list of (the first of these) words is shown below. It should be noted<br />

that even more complicated words (e.g. German words like Hemisphäre) could<br />

be found in other <strong>PISA</strong> assessment areas (e.g. science).<br />

English German rank(English) rank(German) in favour of<br />

...<br />

footprints Fußabdrücke 4491,1233 2833,- English<br />

pacelength Schrittlänge 2649,2607 746,3172 English<br />

average height Durchschnitts- 388,5346 3259,1784 German<br />

größe<br />

interpretation Interpretation 5246 5632 English<br />

make a border umranden 112,1669 - English<br />

communicate kommunizieren 3608 - English<br />

exchange rate Wechselkurs 508,207 1923,1187 English<br />

information Informationen 135 472 English<br />

exports Exporte 1936 7452 English<br />

probability Wahrscheinlichkeit<br />

6703 6161 German<br />

probable, wahrscheinlich 654, 456 1247 English<br />

likely<br />

average Durchschnitt 388 3259 English<br />

represented dargestellt 2422 3148 English<br />

clips Klammern 7963 - English<br />

bar graph Balkendiagramm 2128,5010 -,- English<br />

happen passieren, passiert 2238 1820,2799 English<br />

Tab. 2: Rank order comparison of words extracted from <strong>PISA</strong> mathematics example items.<br />

“<strong>–</strong>” indicates that the word could not be found in the list of the <strong>to</strong>p 10 000 words. Commas<br />

indicate that the constituents of a term have been selected instead of a single word, in this case<br />

the higher rank number has been used for comparison (both parts need <strong>to</strong> be unders<strong>to</strong>od).


LANGUAGE-BASED ITEM ANALYSIS 135<br />

The compilation suggests that in most of the cases the English original<br />

uses words with lower rank numbers (hence occurring more frequently in the<br />

English language) than their German equivalents. To obtain a <strong>to</strong>tal figure for<br />

an approximate comparison, average rank numbers can be calculated (when<br />

substituting the rank number 10 000 for words not in the list). Then, the average<br />

for English is rank 2770, but the average for German is rank 5133 (being<br />

considerably higher).<br />

Although only a few words have been selected, the result is impressive <strong>–</strong><br />

the words’ rank numbers indicate that the German item translation can be considered<br />

<strong>to</strong> be more difficult <strong>to</strong> understand than the English original.<br />

This approach explains why persons with a foreign mother <strong>to</strong>ngue (e.g.<br />

with migration background) sometimes may face problems. Usually the most<br />

frequent words of a language are taught first, and a vocabulary of the most<br />

frequent 10 000 words may not be sufficient <strong>to</strong> understand several of the <strong>PISA</strong><br />

mathematics items.<br />

Further language issues<br />

When comparing two specific languages (e.g. English and German) further<br />

language issues may be identified. Subordinate clauses inserted in the mid of<br />

sentences are more frequent in German and may deteriorate readability. German<br />

grammar is considered <strong>to</strong> be more complicated than English grammar.<br />

Ambiguities may lead <strong>to</strong> misunderstanding, the use of an official language in<br />

translations may be more difficult when “peer slang” is used predominantly in<br />

the target group. Still other <strong>to</strong>pics can be found in a summary by Rost (2001)<br />

on reading comprehension. However, a quantitative evaluation of these aspects<br />

is beyond the scope of this contribution.<br />

Conclusions<br />

For <strong>PISA</strong> sample mathematics items it has been shown by the regression’s<br />

slope coefficient that German items are significantly longer than the English<br />

ones (based on straightforward character counting). Some slightly difficult<br />

words are more difficult after their translation in<strong>to</strong> German, hence they do<br />

not improve fast and efficient answers in a test situation. And a quick look in<strong>to</strong><br />

science and problem-solving items suggests that these findings are not limited<br />

<strong>to</strong> mathematics. As a consequence, the promise of <strong>PISA</strong> <strong>to</strong> support fair inter-


136 MARKUS PUCHHAMMER<br />

national, inter-language comparisons of the output of education systems begins<br />

<strong>to</strong> fail on the language boundaries.<br />

As a consequence, three steps are proposed for the future: At first, interpretation<br />

of rigid inter-national ranking schemes should be done more carefully,<br />

<strong>to</strong> account for potential problems. Then, investigations have <strong>to</strong> take place <strong>to</strong><br />

understand better the process of answering <strong>PISA</strong> items, including languagespecific<br />

problems and a variety of other fac<strong>to</strong>rs, extending research on open<br />

issues, of discussing <strong>PISA</strong> according <strong>to</strong> <strong>PISA</strong>. Finally, improvement of the<br />

whole process of item creation should consider item translation, item formats<br />

and new item types <strong>to</strong> overcome current problems.<br />

References<br />

Artelt, C., Baumert, J., Klieme, E., Neubrand, M., Prenzel, M., Schiefele,<br />

U., Schneider, W., Schümer, G., Stanat, P., Tillmann, K.-J. & Weiß,<br />

M. (Hrsg.): <strong>PISA</strong> 2000. Zusammenfassung zentraler Befunde. Berlin<br />

2001; online: http://www.mpib-berlin.mpg.de/pisa/pdfs/ergebnisse.pdf<br />

retr. 2004/12/03.<br />

HK<strong>PISA</strong> Programme for International Student Assessment Hong Kong<br />

Centre: Sample Test Items <strong>PISA</strong> 2000, 2003. 2006; online: http:<br />

//www.fed.cuhk.edu.hk/ hkpisa/sample/files/2000_Maths_Sample.pdf,<br />

retr. 2007/09/20.<br />

Klipcera, C. & Gasteiger-Klipcera, B. (1993). Lesen und Schreiben <strong>–</strong> Entwicklung<br />

und Schwierigkeiten. Huber; Bern, 1993.<br />

OECD Organisation for Economic Co-operation and Development: Learning<br />

for Tomorrow’s World. First Results from <strong>PISA</strong> 2003. Paris 2004a<br />

OECD: Apprendre aujourd’hui, réussir demain. Premiers résultats de <strong>PISA</strong><br />

2003. Paris 2004b<br />

OECD: Lernen für die Welt von morgen <strong>–</strong> Erste Ergebnisse von <strong>PISA</strong> 2003.<br />

Paris 2004c<br />

OECD Organisation for Economic Co-operation and Development: <strong>PISA</strong><br />

2003 mathematics questions. Paris, 2005; online: https://www.oecd.org/<br />

dataoecd/12/7/34993147.zip, retr. 2007/09/14.<br />

<strong>PISA</strong> Austria: Testinstrumente. 2004a; online: http://www.pisa-austria.at/<br />

pisa2003/testinstrumente/lang/III_Testinstrumente.htm, retr. 2005/08/14.<br />

<strong>PISA</strong> Austria: Mathematik freigegebene Aufgaben. 2004b; online:<br />

www.pisa-austria.at/pisa2003/testinstrumente/lang/mathematik_<br />

freigegebene_aufgaben.pdf, retr. 2007/09/14.


LANGUAGE-BASED ITEM ANALYSIS 137<br />

Readingsoft: Speed Reading Test Online. 2007; online: http://www.<br />

readingsoft.com, retr. 2007/09/17.<br />

Rost, D.H.: Leseverständnis. In: Rost, H.D. (Ed.): Handwörterbuch Pädagogische<br />

Psychologie. pp. 449-456. PVU, Weinheim, 2007.<br />

Universität Leipzig: Deutscher Wortschatz <strong>–</strong> Wortschatzportal. Institut<br />

für Informatik, Universität Leipzig, 2007; online: http://wortschaftz.<br />

uni-leipzig.de/html/wliste.html, retr. 2007/09/15.<br />

Zipf, G. K.: Selected Studies of the Principle of Relative Frequency in Language.<br />

Cambridge (Mass.) 1932.


England: Poor Survey Response and No Sampling of<br />

Teaching Groups 1<br />

SJPrais<br />

United Kingdom: National Institute of Economic and Social Research,<br />

London<br />

Abstract:<br />

The two recent (2003) international surveys of pupils’ attainments were uncoordinated,<br />

overlapped considerably, were costly and wasteful, especially<br />

from the point of view of England where inadequate response-rates meant that<br />

no reliable comparisons at all could be made with other countries. It is the<br />

weaker pupils who tend not <strong>to</strong> respond, and poor response rates thus tend <strong>to</strong><br />

show upwardly-biased results. Inadequate emphasis on classes, or on teaching<br />

groups, in designing the samples, means that little progress can be made in<br />

tracing success in the learning <strong>to</strong> average class-size or variability among pupils.<br />

The surveys were conducted, respectively, by the OECD (Programme of International<br />

Student Assessment <strong>–</strong> <strong>PISA</strong>) and by the US-based International Educational<br />

Assessment group (Trends in International Mathematics and Science<br />

Study <strong>–</strong> TIMSS). Sources of the problem are investigated here.<br />

Some as<strong>to</strong>nishment was aroused by the recently published results of two,<br />

apparently independently organised, large-scale international questionnaire<br />

surveys of pupils’ mathematical attainments <strong>to</strong>wards the middle of their secondary<br />

schooling (age 14-15); nearly 50 countries participated in each survey,<br />

with some 200 schools in each country. Both surveys were carried out in the<br />

same year, 2003; previous surveys had generally been carried out at about tenyear<br />

intervals, and each of these two very recent surveys had been carried out<br />

1 This chapter is an edited version of my paper in the Oxford Review of Education (vol. 33,<br />

no. 1,. February 2007). Thanks are due <strong>to</strong> the edi<strong>to</strong>rs of that Review for permission <strong>to</strong> reproduce.


140 SJPRAIS<br />

only 3-4 years previously. Some questions on science and literacy were included<br />

in 2003, but the focus was on mathematics (and that is our focus here).<br />

A test <strong>to</strong>wards the end of primary schooling, at age 10, was also carried out<br />

in association with one of these surveys. The <strong>to</strong>tal cost was probably over<br />

£1m for England, and probably well over $100m for all countries <strong>to</strong>gether,<br />

plus the time of pupils and teachers directly involved. 2 Results were published<br />

by the beginning of 2005 in several thick volumes, <strong>to</strong>talling some 2000 large<br />

(A4) pages; the two organisations behind the surveys are known as TIMSS and<br />

<strong>PISA</strong> (details of the organisations and publications are at Annex A at the end<br />

of this paper). There does not appear, from these publications, <strong>to</strong> have been<br />

any coordination between the two organisations. Much wasteful overlap and<br />

duplication is evident; the interval between recent repetitions of these surveys<br />

was so tight as not <strong>to</strong> permit adequate consultation for lessons <strong>to</strong> be learnt. 3<br />

Representativeness of samples<br />

We shall try and assess here some of the main findings for England, ask<br />

whether further surveys of this kind are justified, and whether anything is <strong>to</strong><br />

be learnt from these recent surveys which might improve future surveys. What<br />

can be said with any confidence about English pupils’ attainments <strong>to</strong>wards the<br />

end of their secondary schooling is much limited by poor sample response.<br />

From the TIMSS report on 14 year-olds we learn: ‘England’s participation fell<br />

below the minimum requirements of 50 per cent, and so their results were<br />

annotated and placed below a line in exhibits (= statistical tables) showing<br />

2 Only limited information on costs of these surveys has been released. For England, a <strong>to</strong>tal<br />

of £0.5m was paid by England <strong>to</strong> the international coordinating bodies, but information<br />

on locally incurred costs were withheld (in reply <strong>to</strong> a Parliamentary Question on 7 March<br />

2005) as publication could ‘prejudice commercial interests’ in the government’s negotiating<br />

of repeat surveys in 2006-7. It is as<strong>to</strong>nishing that expenditure on further surveys should have<br />

been put in hand before there has been adequate opportunity for scientific assessment of the<br />

value of the 2003 surveys and of the appropriate frequency of their repetition.<br />

3 The <strong>PISA</strong> (Programme of International Student Assessment) inquiry of 2003 was organised<br />

by OECD and followed their first attempt in this activity in 2000. The report on their<br />

first survey was critically reviewed in my article in the Oxford Review of Education, 29<br />

(2) (2003); the present paper has benefited from discussion following that earlier paper.<br />

The acronym TIMSS was originally short for Third International Mathematics and Science<br />

Study; subsequently it became short for Trends in International . . . The previous occasion<br />

on which it had been carried out was 1999. More of the 2003 co-ordinating costs (76 %)<br />

were incurred by <strong>PISA</strong>, making TIMSS <strong>–</strong> which covered two age-groups <strong>–</strong> the better buy<br />

for the British taxpayer.


ENGLAND: POOR SURVEY RESPONSE 141<br />

achievement’. 4 For the parallel <strong>PISA</strong> report, in all tables mentioning findings<br />

for the United Kingdom, a footnote was attached <strong>to</strong> the line for the UK (and<br />

only for the UK!): ‘Response rate <strong>to</strong>o low <strong>to</strong> ensure comparability’. 5<br />

In other words, any differences that may appear between published results<br />

for England and other countries are not <strong>to</strong> be relied on. This reservation was<br />

not however attached <strong>to</strong> the tests of English 10 year-olds <strong>to</strong>wards the end of<br />

their primary schooling (carried out by TIMSS, following a similar survey at<br />

that age in 1995); and those results, <strong>to</strong> first appearances, appear <strong>to</strong> be the most<br />

scientifically interesting and important for educational policy. We will need <strong>to</strong><br />

examine below whether those results are indeed robust enough <strong>–</strong> that is <strong>to</strong> say,<br />

adequately representative <strong>–</strong> <strong>to</strong> be relied upon.<br />

But before that, a short word on the recent his<strong>to</strong>rical background of<br />

Britain’s schooling attainments may be helpful. Britain’s economic capabilities<br />

<strong>–</strong> its mo<strong>to</strong>r industry, machine <strong>to</strong>ol manufacturing industry, as well as other<br />

industries relying on a technically skilled workforce <strong>–</strong> led <strong>to</strong> much public concern<br />

by the 1960s: expressed subsequently, for example, in the official Cockcroft<br />

Committee’s report on Mathematics Counts (HMSO, 1978), eventually<br />

leading <strong>to</strong> the National Curriculum, the National Numeracy Project, and then<br />

<strong>to</strong> nationwide annual testing of all pupils in basic school subjects at all primary<br />

and secondary schools (SATs at ages 7, 11 and 14 <strong>to</strong> supplement the<br />

longer-standing GCSE tests at 16).<br />

Detailed empirical comparisons in the 1980s and 1990s by teams centred at<br />

the National Institute of Economic and Social Research (London) were made<br />

of productivity and workforce qualifications. Site visits <strong>to</strong> comparable samples<br />

of manufacturing plants in England and Germany clarified the nature of<br />

the great gaps in workforce qualifications; these gaps were not so much at the<br />

university graduate level, but at the intermediate craft-levels (City and Guilds,<br />

etc.) <strong>–</strong> the central half of the workforce. The difficulty in England in expanding<br />

that central category of trainees was traced <strong>to</strong> the secondary school-leaving<br />

stage when the standards of mathematical attainments required for craft and<br />

technician training, especially in numeracy, were much below Germany’s. The<br />

IEA’s First International Mathematics Survey of 1964 (FIMS <strong>–</strong> the original<br />

predecessor of TIMSS) was one of the important sources that confirmed this<br />

gap in secondary school mathematics; it was made evident <strong>to</strong> our teams of<br />

secondary mathematics teachers and inspec<strong>to</strong>rs on visits <strong>to</strong> secondary schools<br />

4 TIMSS, Mathematics Report, p. 351.<br />

5 See, for example, <strong>PISA</strong>, Annex B, Data Tables, pp. 340 et seq.


142 SJPRAIS<br />

in France, Germany, the Netherlands and Switzerland, and in discussions with<br />

heads of industrial training departments (Meister). 6 An important conclusion<br />

from visits <strong>to</strong> schools was that it was quite unrealistic <strong>to</strong> expect English secondary<br />

schools <strong>to</strong> be able <strong>to</strong> produce the numbers of students with levels of<br />

mathematical competence that had been seen abroad if they had <strong>to</strong> start with<br />

the standards delivered by our primary schools.<br />

Shifts in research interests and in official educational policy ensued for<br />

mathematics teaching, especially at primary level. Textbooks in England and<br />

in Europe were carefully compared; teaching methods abroad were observed<br />

by practising teachers; new teaching schemes were prepared; and annual nationwide<br />

tests of pupils’ attainments were administered nationally <strong>to</strong> all pupils<br />

at ages 2-3 years apart (SATs). Much more could be said on the details of what<br />

has amounted <strong>to</strong> a ‘didactic revolution’; but perhaps the foregoing is sufficient<br />

<strong>to</strong> indicate the interest attached <strong>to</strong> the 2003 TIMSS mathematics results at age<br />

10 which can be compared with the similar sample inquiry eight years previously<br />

at that age (the 1995 TIMSS <strong>–</strong> Third International Mathematics and<br />

Science Survey). Had England now caught up with its competi<strong>to</strong>rs, at least by<br />

the end of primary schooling?<br />

The comparison was set out, clearly and apparently convincingly, in the<br />

national report for England for 2003 produced by the (English) National Foundation<br />

for Educational Research (which carried out the survey in England in<br />

coordination with the international body). It noted that England’s mathematics<br />

scores showed the largest rise of any of the 15 countries that participated at the<br />

primary level in both 1995 and 2003 (the English rise was of 47 standardised<br />

points, from 484 <strong>to</strong> 531, where 500 is the notional average standardised score<br />

of all countries in these international tests, and the standard deviation is standardised<br />

at 100). Most test questions asked were different in the two years, but<br />

37 questions were the same in both years; the proportion who answered those<br />

common questions correctly in England rose very satisfac<strong>to</strong>rily from 63 <strong>to</strong> 72<br />

per cent. The rise was even a little greater in questions relating <strong>to</strong> numeracy<br />

(arithmetic); this may all be taken as reassuring, since previous deficiencies in<br />

6 See my paper with K Wagner, Schooling Standards in England and Germany: Some summary<br />

comparisons bearing on economic performance, in National Institute Economic Review,<br />

May 1985 and in Compare: A Journal of Comparative Education, 1986, no 1. More<br />

generally, see the series of reprints re-issued by NIESR in two compendia entitled Productivity,<br />

Education and Training (1990 and 1995). Teams of teachers and school inspec<strong>to</strong>rs,<br />

particularly from the London Borough of Barking and Dagenham, were invaluable in assessing<br />

school-visits here and abroad.


ENGLAND: POOR SURVEY RESPONSE 143<br />

English students’ attainments were, as said, particularly marked in that area <strong>–</strong><br />

the foundation s<strong>to</strong>ne of mathematics. 7 The <strong>to</strong>p countries at the primary school<br />

level were, once again, those bordering the Pacific: Singapore, Hong Kong,<br />

Japan <strong>–</strong> with scores averaging about 570; England’s rise in performance in the<br />

nine intervening years, by 47 points <strong>to</strong> 531, can thus be seen as approximately<br />

halving the gap with these <strong>to</strong>p countries <strong>–</strong> and in hardly more than a decade.<br />

To first appearances, this seems a remarkably encouraging achievement;<br />

and, one must equally say, in a remarkably short time-span given the complexity<br />

of what amounted <strong>to</strong> changing almost the whole mathematics didactics<br />

system. But are these sample results <strong>to</strong> be relied upon? We have noted that at<br />

the secondary school level (age 14) serious reservations were attached by the<br />

surveys’ sponsors <strong>to</strong> response rates <strong>to</strong> the samples for England; at the primary<br />

level (average age 10.3, Year 5 in England) a cautionary footnote is always<br />

attached <strong>to</strong> the TIMSS results reported for England (not as serious as for secondary<br />

school results <strong>–</strong> but not <strong>to</strong> be ignored): ‘Met guidelines for sample<br />

participation rates only after replacement schools were included’. 8 With that<br />

modestly expressed caution in mind, let us next patiently re-examine the actual<br />

response rates for England, bearing in mind that if response rates were lower in<br />

2003 than in 1995 we might expect better average scores <strong>to</strong> be recorded simply<br />

as a result of ‘creaming higher up the bottle’.<br />

We first compare the response for schools; then the response for students<br />

within responding schools; and finally, the product of these two rates. In 2003<br />

there were 150 primary schools in the original English representative sample,<br />

of which 79 schools participated, or 53 per cent. 9 For the previous primary<br />

school inquiry of 1995, 92 out of 145 sampled schools participated at the fourth<br />

grade <strong>–</strong> 63 per cent. 10<br />

The student participation rate (within participating schools) was 93 per<br />

cent in 2003, just a little below the 95 per cent recorded for 1995. Combining<br />

the two participation rates (schools x students) we have a participation rate of<br />

something like 50 per cent in 2003 compared with 60 per cent in 1995: there<br />

7 See G Ruddocket al., Where England Stands in the Trends in International Mathematics<br />

and Science Study (TIMSS) 2003, (NFER), 2004, pp. 8-10.<br />

8 IVS Mullis et al., TIMSS 2003 International Mathematics Report (IEA, Bos<strong>to</strong>n), 2004, for<br />

example, p. 35.<br />

9 Ibid. p. 355.<br />

10 IVS Mullis et al., Mathematics Achievement in the Primary School Years (TIMSS), 1997,<br />

p. A 13.


144 SJPRAIS<br />

are thus grounds for worrying whether there has been a genuine improvement<br />

in scores in the population. 11<br />

But are either of these overall response rates adequate for anyone <strong>to</strong> place<br />

reliance on the representativeness of the results? Even TIMSS put the ‘minimum<br />

acceptable participation rate’ at ‘a combined rate (the product of school<br />

and student participation) of 75 per cent’; but at Year 5 in England (as also<br />

in five other countries) 12 that criterion was said <strong>to</strong> be satisfied ‘only after including<br />

replacement schools’. This brings us <strong>to</strong> a long-standing thorny dispute<br />

on acceptable sampling practices. The sampling procedure adopted in these<br />

international educational inquiries is not at all orthodox. It starts with several<br />

parallel lists of schools, each list being equally representative. 13 If an inadequate<br />

response is received from the initial list, then ‘corresponding’ schools<br />

from the second list are approached, and from a third list if necessary. For England<br />

in 2003, as said, a sample of 150 schools was drawn from the initial list;<br />

in the outcome, 79 schools from that list participated (a mere 53 per cent) and<br />

71 schools refused. A further 71 (replacement) schools were then chosen from<br />

the second list, an estimated 27 schools of which participated (38 per cent)<br />

and 44 refused; an estimated 44 were then approached from the third list, of<br />

which 17 participated. The <strong>to</strong>tal number of schools now participating <strong>to</strong>talled<br />

79+27+17=123; the <strong>to</strong>tal number approached was (nota bene, since the organisers<br />

of these surveys do not agree!) 150+71+44=265; the overall response rate<br />

for schools was therefore 123/265=46 per cent (a little below the 53 per cent<br />

from the first list). Taken <strong>to</strong>gether with a response of 93 per cent of students in<br />

participating schools, the <strong>to</strong>tal combined response (schools and students) was<br />

thus only 43 per cent <strong>–</strong> all much below the proportion (75 per cent) originally<br />

laid down by TIMSS as acceptable.<br />

11 The reader will understand that the gradient of the response-rate with respect <strong>to</strong> attainmentlevel<br />

will be different according <strong>to</strong> whether it is amongst schools, at the school-level, or<br />

amongst students within schools; but the point is not worth elaboration in view of what is<br />

said in the next paragraph.<br />

12 Australia, Hong Kong, Netherlands, Scotland, United States (ibid., p. 359). For the US a<br />

response rate (before replacement) of only 66 percent was recorded for the primary survey<br />

and the same for the TIMSS secondary survey. For England’s secondary survey, the<br />

corresponding proportion was a mere 34 per cent!<br />

13 For example, starting from an initial list of schools organised by geographical area, size,<br />

etc., a random start is made; subsequent schools are chosen after counting down a given<br />

<strong>to</strong>tal number of pupils (so, in effect, sampling schools with probability proportional <strong>to</strong> their<br />

size). A reserve list is yielded by taking schools, each one place above the schools in that<br />

initial list; and a second reserve, by going one place down the initial list.


ENGLAND: POOR SURVEY RESPONSE 145<br />

Incredible as it may seem, the statisticians at TIMSS calculated a participation<br />

rate, not in relation <strong>to</strong> the <strong>to</strong>tal number of schools approached on first and<br />

subsequent lists (221), but in relation <strong>to</strong> the smaller number originally aimed<br />

at (150); they consequently published a misleading response rate of 123/150 =<br />

82 per cent for schools, and of 75 per cent for schools and students combined <strong>–</strong><br />

just falling in<strong>to</strong> their originally stipulated requirements (whereas the correctly<br />

calculated combined response rate, as just said, was only 43 per cent). Was this<br />

merely a momentary forgivable slip? Or was it more in the nature of a scientistic<br />

trompe l’oeil encouraging readers that all was fundamentally well, and<br />

had been placed in sound hands <strong>–</strong> including the hands of a Sampling Referee,<br />

an expert <strong>to</strong> whom such technical statistical details had been safely relegated?<br />

Having discussed this issue with a number of British statisticians, I have regretfully<br />

come <strong>to</strong> the conclusion <strong>–</strong> putting it as kindly as I can <strong>–</strong> that these surveys’<br />

statisticians had misled themselves and their educationist colleagues as a result<br />

of their commercial experience with quota sampling; and that any future such<br />

inquiry needs <strong>to</strong> be advised by a broader panel of social statisticians. 14 For the<br />

sake of clarity, I repeat that such an enlarged body will need <strong>to</strong> address two<br />

issues: first (a simple arithmetical issue), what is the correct method of calculating<br />

a <strong>to</strong>tal response rate if ‘replacement’ samples are included; secondly,<br />

is there any substantial scientific justification for approaching a ‘replacement<br />

sample’ (rather, say, than an initially larger sample)?<br />

Returning <strong>to</strong> the real issue on which we would all like <strong>to</strong> draw happy conclusions,<br />

namely, the tremendous rise in our pupils’ attainments at age 10, we<br />

see from the previous paragraph that the sample of responding schools (at 43<br />

per cent, not 82 per cent as reported by TIMSS) has <strong>to</strong> be judged as al<strong>to</strong>gether<br />

<strong>to</strong>o low <strong>to</strong> support any such conclusion.<br />

But we cannot leave the <strong>to</strong>pic of response rates without noticing a considerable<br />

improvement in the way that England’s secondary school scores were calculated<br />

for TIMSS. As said at the outset, the whole of the English results were<br />

14 Quota sampling is used in commercial work, and places greater emphasis on achieving the<br />

agreed <strong>to</strong>tal of respondents, rather than on their representativeness; it is avoided in scientific<br />

work. On the ‘Sampling Referee’, see TIMSS 2003, p. 441. The issue of replacement sampling<br />

was questioned in my previous paper on <strong>PISA</strong> 2000 (Oxf. Rev. Education, 29, 2);see<br />

also the response by RJ Adams (ibid., 29, 3), and my rejoinder <strong>to</strong> that response (ibid., 30,<br />

4). The need for representative sampling is so basic <strong>to</strong> scientific survey procedures that it is<br />

as<strong>to</strong>nishing that those responsible for educational surveys, <strong>to</strong>gether with the government departments<br />

providing taxpayers’ money for such exercises, could accept such an easy-going<br />

(slack) approach <strong>to</strong> non-response. But, as it now turns out, this was not the last word <strong>–</strong> as<br />

discussed below in relation <strong>to</strong> re-weighting with population weights.


146 SJPRAIS<br />

rejected for international comparability in the international reports because<br />

they did not satisfy their originally specified sampling requirements (the rejection<br />

applied equally <strong>to</strong> TIMSS and <strong>PISA</strong>). There was however an additional<br />

national report on England’s TIMSS survey which outlined an alternative calculation<br />

based on re-weighting the sample results by population weights. It<br />

tells us that the TIMSS sample over-represented schools that were ‘average<br />

and above average in terms of national examination (or test) results (i.e. weaker<br />

schools were under-represented: SJP). This sample was therefore re-weighted<br />

using this measure of performance <strong>to</strong> remove this effect’. 15 Presumably, the<br />

obliga<strong>to</strong>ry nationwide SAT test results were used <strong>to</strong> provide better weights,<br />

but details have not been released as <strong>to</strong> whether, for example, the re-weighting<br />

was for the country taken as a whole, or for the sampled schools or, indeed,<br />

the sampled students. The consequence of the re-weighting was that England<br />

was moved down in the TIMSS mathematics ranking, <strong>to</strong> below Australia, the<br />

United States, Lithuania, and Sweden (a reduction of England’s international<br />

score from 505 <strong>to</strong> 498). Nothing of very great substance, it might be thought;<br />

but the new method of estimation is of great importance for future surveys.<br />

Such an adjustment raises the reliability of English estimated average<br />

scores because, <strong>to</strong> put it simply, it employs population <strong>–</strong> rather than sample <strong>–</strong><br />

weights for the various ability-strata. When educational surveys of this kind<br />

were first attempted in 1964 no routine nationwide tests of mathematical attainments<br />

were available for England; now that they have become available,<br />

and even on an annual basis, they could be used <strong>to</strong> provide population weights<br />

for a TIMSS-type of survey using internationally specified questions. 16<br />

15 G Ruddocket al., Where England Stands ( . . . in TIMSS 2003), National Report for England<br />

(NFER, 2004), p. 25. The (previous) view expressed by <strong>PISA</strong> was very different. ‘A<br />

subsequent bias analysis provided no evidence for any significant bias of school-level performance<br />

results but did suggest there was potential non-response bias at student levels’<br />

(<strong>PISA</strong>, p. 328, my ital.). To emphasise, this is different from the TIMSS conclusion that it<br />

was weaker schools that needed up-weighting <strong>to</strong> improve representation (pp. 9, 25).<br />

16 It is difficult <strong>to</strong> find more than a trace of a reference <strong>to</strong> this re-weighting in the international<br />

TIMSS report, though it is quite explicit in the English national report; the same average<br />

scores for England are published in both reports. The TIMSS Technical Report (ch. 7, by<br />

M Joncas, p. 202, n. 7) offers the following light: ‘The sampling plan for England included<br />

implicit stratification of schools by a measure of school academic performance. Because the<br />

school participation rate even after including replacement schools was relatively low (54 %),<br />

it was decided <strong>to</strong> apply the school non-participation adjustment separately for each implicit<br />

stratum. Since the measure of academic performance used for stratification was strongly<br />

related <strong>to</strong> average school mathematics and science achievement on TIMSS, this served <strong>to</strong><br />

reduce the potential for bias introduced by low school participation’. The <strong>PISA</strong> report does


ENGLAND: POOR SURVEY RESPONSE 147<br />

The upshot is that, first, while the TIMSS primary survey results for England<br />

are less reliable than would appear from the way they were reported, those<br />

for secondary schools <strong>–</strong> after re-weighting <strong>–</strong> are more reliable. Secondly, sampling<br />

errors ought properly <strong>to</strong> be calculated for the TIMSS secondary school<br />

survey as for a stratified sample. Thirdly, the poor response-rates achieved in<br />

both these secondary school surveys might yet encourage a refusal by England<br />

<strong>–</strong> at a political level <strong>–</strong> <strong>to</strong> support any such future surveys; but we see here<br />

that what is first really required is more research in<strong>to</strong> sampling design, that<br />

is, better use of population information collected in any event for general educational<br />

objectives, so enabling more accurate results <strong>to</strong> be attained at lower<br />

cost. 17<br />

Objectives of international tests<br />

When these international educational tests were introduced nearly two generations<br />

ago, it was widely unders<strong>to</strong>od that their main objective was not <strong>–</strong><br />

as it seems <strong>to</strong> have become <strong>to</strong>day <strong>–</strong> <strong>to</strong> produce an international ‘league table’<br />

of countries’ schooling attainments, but <strong>to</strong> provide broader insight in<strong>to</strong><br />

the diverse fac<strong>to</strong>rs leading <strong>to</strong> success in learning. Despite current popular emphasis<br />

on ‘league table’ aspects (but usually without corresponding emphasis<br />

on sampling errors), much space is devoted in the present reports <strong>to</strong> students’<br />

‘perceptions’, attitudes <strong>to</strong>wards learning and their relation <strong>to</strong> success. But the<br />

reader often finds himself questioning the direction of causation; for example,<br />

we are <strong>to</strong>ld such things as that students who are happy with mathematics tend<br />

not discuss any such possible improved estimation procedure.<br />

17 The above discussion of response rates has been restricted, for the sake of brevity, <strong>to</strong> the<br />

primary school survey. More or less the same applied <strong>to</strong> both secondary school surveys, as<br />

follows. For the TIMSS secondary survey, the participation rate of the 160 sampled schools<br />

(before replacements were included) was a pathetic 34 per cent (TIMSS, p. 358); for the<br />

<strong>PISA</strong> inquiry, directed <strong>to</strong> 450 schools, it was 64 per cent (<strong>PISA</strong>, p. 327, col. 1). For the<br />

US, which deserves special attention because of its greater financial sponsorship, the corresponding<br />

secondary school response rates were 66 and 65 per cent (but would their financial<br />

contribution have been as great if the true response rates had been published, i.e. after correctly<br />

allowing for replacement sampling as explained above?).<br />

The English Department of Education issued Notes of guidance for media-edi<strong>to</strong>rs explaining<br />

that their ‘failure <strong>to</strong> persuade enough schools in England <strong>to</strong> participate occurred<br />

despite . . . various measures including an offer <strong>to</strong> reimburse schools for their time . . . ’ (National<br />

Statistics First Release 47/2004, p. 4, 7 December 2004). Note the term ‘reimburse’;<br />

there is no suggestion of motivating a sub-sample of schools by a substantial net financial<br />

incentive.


148 SJPRAIS<br />

<strong>to</strong> do better in that subject: but perhaps causation is more the other way round <strong>–</strong><br />

those who do well in that subject are happier, or more willing <strong>to</strong> declare their<br />

happiness. Similarly, much space is given <strong>to</strong> watching TV, and its association<br />

with test scores; with reading books, and so on. But little space is given in<br />

these reports <strong>to</strong> what <strong>to</strong>pics are taught at each age, <strong>to</strong> what level, and <strong>to</strong> what<br />

fraction of the age-group (see Annex B on the implications of longer basic<br />

schooling life in the US); nor <strong>to</strong> such a basic ‘mechanism’ of school learning<br />

as <strong>to</strong> how students, who inevitably differ in their precise levels of attainment,<br />

are grouped in<strong>to</strong> ‘parallel’ differentiated classes <strong>–</strong> despite the obvious concern<br />

of this feature of schooling <strong>to</strong> teachers, parents, policy makers and, not least,<br />

<strong>to</strong> students.<br />

The relation between the size of a class and its average achievement is<br />

tabulated in one of the studies and well illustrates the issue of direction of causation.<br />

For smaller classes of up <strong>to</strong> 24 students, an average score of 479 was<br />

recorded for England at Year 9; for medium-sized classes of 25-32 students,<br />

the average score was higher at 511; and for larger classes of 33 or more students,<br />

the average score was higher still at 552 (much the same applied in the<br />

other countries). 18 Higher attainments in larger classes have previously been<br />

frequently observed <strong>–</strong> contrary <strong>to</strong> the presumption that smaller classes would<br />

do better; this ‘statistical relation’ has generally been attributed <strong>to</strong> widespread<br />

recognition by schools that slower/weaker pupils should be taught in smaller<br />

‘parallel’ classes where possible. Whether schools allocated higher attaining<br />

pupils <strong>to</strong> larger classes as efficiently as possible can be debated; but it is clear<br />

that no one (least of all, the present writer) would draw the policy implication<br />

that if children were only <strong>to</strong> be taught in larger classes then they would attain<br />

better results at lower costs. Much care is similarly necessary in drawing<br />

conclusions from other statistical associations noted in these studies.<br />

For example, very strong conclusions were drawn by <strong>PISA</strong> on how the<br />

schooling system should deal with variability of students’ attainments and capabilities.<br />

But let us first spell out realistically the issue of variability of pupils’<br />

attainments in a class from the teacher’s point of view. Some variability of students’<br />

attainments within a class is unavoidable; but, once a certain level of<br />

variability is exceeded, the pace at which the teacher can teach slows, as does<br />

the pace at which learning takes place, not least amongst those students who<br />

are weaker (weaker for whatever reason <strong>–</strong> born at the later end of the school-<br />

18 TIMSS, Mathematics Report, p. 266; the same applied also <strong>to</strong> the primary inquiry at Year<br />

5, p. 267.


ENGLAND: POOR SURVEY RESPONSE 149<br />

year, illness last year, slow learning in a previous school, difficulties at home<br />

that weigh on the student’s mind . . . ), often with consequent ‘playing up’ in<br />

class; eventually the teacher finds it better <strong>to</strong> divide his ‘class’ in<strong>to</strong> explicit<br />

sub-groups, or ‘sets’, which follow a more or less different syllabus of tasks,<br />

with consequences for the pace of learning, and the costly need for teaching<br />

assistants. All this is of course familiar; and it might have been thought that an<br />

elementary calculation of variability of attainments within a class would have<br />

been a natural, obvious, useful <strong>–</strong> indeed essential <strong>–</strong> part of such inquiries.<br />

But the <strong>PISA</strong> sample was deliberately based not on whole classes, but on<br />

all those aged 15 in a school <strong>–</strong> whichever Year or attainment-set they were in.<br />

In England, as in other countries where promotion from one class <strong>to</strong> the next is<br />

based strictly on age, it might seem that nothing much is at issue; but <strong>to</strong> rely on<br />

that would ignore the widespread practice of ‘setting’ students within each year<br />

in<strong>to</strong> groups by attainment levels <strong>–</strong> a practice that becomes more widespread at<br />

higher ages. In most other countries some reference <strong>to</strong> attainment level usually<br />

influences promotion from one class <strong>to</strong> the next. But nothing of this can be<br />

investigated with the help of <strong>PISA</strong> since its sampling was based not on whole<br />

classes <strong>–</strong> but purely on age, irrespective of class or teaching-group.<br />

The TIMSS sampling process, on the other hand, was different since it was<br />

based on whole classes, and thus may be expected <strong>to</strong> be more relevant <strong>to</strong> our<br />

concerns; but that does not take us out of the woods. For the reality of a ‘class’<br />

becomes tenuous in the upper reaches of secondary schooling, as ‘setting’ by<br />

attainment becomes more prevalent. In large English comprehensive secondary<br />

schools, a dozen ‘parallel’ mathematics classes for each age or ‘Year’, varying<br />

according <strong>to</strong> attainment, is not unusual; for TIMSS, usually just one of<br />

those classes was selected by some ‘equal probability’ procedure, except that<br />

when some classes were very small they were combined with another <strong>to</strong> form<br />

a ‘pseudo-classroom’ for sampling purposes. 19 A small class for very weak<br />

pupils might be combined with another class next higher in its attainments; or,<br />

for all we are <strong>to</strong>ld, could be combined with a small <strong>to</strong>p set. In any event, no statistical<br />

analysis of the extent of student variability within teaching groups, nor<br />

even of the whole year-group within a school, seems <strong>to</strong> have been attempted<br />

as part of either of these sample inquiries, despite the central importance of<br />

19 TIMSS, [International] Technical Report, p. 121 (see also Mathematics Report, p. 349,<br />

which is also not very helpful); the English National Report has an Appendix on Sampling<br />

(p. 287) but regrettably says nothing on this vital aspect of sampling.


150 SJPRAIS<br />

that issue <strong>to</strong> success in teaching and learning, and its interest <strong>to</strong> teachers and<br />

educational planners.<br />

Despite the sampling design of both inquiries being so perverse that variability<br />

of students within teaching groups cannot be computed (<strong>to</strong> repeat: <strong>PISA</strong><br />

did not sample whole classes, TIMSS generally sampled only one ‘ability-set’<br />

out of each year-group), very strong policy conclusions were voiced in the<br />

<strong>PISA</strong> report against any form of differentiation: they were against dividing<br />

secondary school pupils in<strong>to</strong> different schools according <strong>to</strong> attainment levels<br />

(in England: Grammar schools and Comprehensives); they were against dividing<br />

pupils within schools in<strong>to</strong> streams or attainment sets; and they were<br />

against grade repetition which they ‘considered as a form of differentiation’,<br />

and ipso fac<strong>to</strong> evil. 20 Throughout there is the assumption that differentiation<br />

is the cause of lower average attainments, rather than seeing it the other way<br />

round <strong>–</strong> where teachers are faced with a student body that is unusually diverse,<br />

they use any organisational mechanism at their disposal <strong>to</strong> reduce diversity, and<br />

so make the group more teachable. In other words, greater variability within<br />

the class needs <strong>to</strong> be unders<strong>to</strong>od as the cause, rather than the effect, of lower<br />

attainments. All their conclusions were announced by <strong>PISA</strong> with great conviction<br />

<strong>–</strong> indeed, with great presumption <strong>–</strong> despite, as said, no calculations having<br />

been possible from their data on the variability of attainments within teaching<br />

groups, classes oryear-groups.<br />

The future<br />

How was it possible, the reader will ask himself, for such large inquiries, with<br />

their endless sub-committees of expert specialists, <strong>to</strong> arrange their sampling<br />

procedures <strong>to</strong> exclude the possibility of calculating the variability of attainments<br />

for each class/teaching group? Any student of Kafka will readily invent<br />

his detailed scenario; but their essence is probably that the specialists were <strong>to</strong>o<br />

specialised <strong>–</strong> in particular, the statisticians did not understand, or give sufficient<br />

weight <strong>to</strong>, the pedagogics of class-based learning; and the educationists<br />

did not give sufficient attention <strong>to</strong> the implications of the sampling procedures<br />

20 Parents in countries with low between-school variances, we are <strong>to</strong>ld, ‘can be confident of<br />

high and consistent performance standards across schools in the entire education system’<br />

(<strong>PISA</strong>, p. 163). ‘Avoiding ability grouping in mathematics classes has an overall positive<br />

effect on student performance’ (though it is conceded ‘the effect tends not <strong>to</strong> be statistically<br />

significant at the country level’!), (p. 258). ‘Grade repetition can also be considered as a<br />

form of differentiation’ [and therefore <strong>to</strong> be avoided] (p. 264).


ENGLAND: POOR SURVEY RESPONSE 151<br />

for response-rates. Perhaps most important, those in overall command were not<br />

sufficiently alive <strong>to</strong> such deficiencies in their varied specialists. Better ‘generalists’,<br />

rather than more specialists, seem <strong>to</strong> be required.<br />

From the point of view of more representative sampling, future international<br />

inquiries of this kind, it can now be seen more clearly, need <strong>to</strong> be redesigned<br />

<strong>to</strong> incorporate sampling features of both these recent inquiries. We<br />

need <strong>to</strong> focus (a) initially on the original variability of attainments of a complete<br />

age-group of students (variability due <strong>to</strong> socio-his<strong>to</strong>rical or genetic elements),<br />

perhaps estimated by the <strong>PISA</strong>-approach or by sampling two (? three)<br />

adjacent school-grades as in previous TIMSS inquiries; (b) then we need <strong>to</strong> estimate<br />

the extent <strong>to</strong> which variability is reduced within teaching groups as they<br />

have been organised by schools in practice; (c) finally, we need <strong>to</strong> estimate the<br />

separate contributions of various institutional fac<strong>to</strong>rs in each country <strong>to</strong> that reduction<br />

in variability <strong>–</strong> secondary school selection, ability-setting within Yeargroups,<br />

class-repetition. Differences among countries in these elements may<br />

yield valuable and empirically-based policy conclusions.<br />

From the point of view of the substance of the inquiries, more focus and<br />

debate would be valuable on syllabus issues within mathematics. For example,<br />

what is the proper share of arithmetic in the overall mathematics curriculum at<br />

younger ages, and how should that share vary for different attainment-groups?<br />

In some countries (Switzerland, Germany), at least until recently, the less academic<br />

group of students often become more expert in mental arithmetic skills<br />

as a result of their different curricular emphases; has the wholesale use of calcula<strong>to</strong>rs<br />

really made this otiose? At what ages, and <strong>to</strong> what fractions of pupils,<br />

should specific <strong>to</strong>pics be introduced such as simultaneous linear equations,<br />

quadratic equations, basic trigonometry or even basic calculus? No more than<br />

these few hints can be thrown out within the ambit of the present Note <strong>to</strong><br />

indicate what a proper Next Step should include (see also Annex B on the<br />

anomalously low average attainments in mathematics at age 15 by the world’s<br />

economically leading country).<br />

A final question: how much public breast-beating by the organisations that<br />

have carried out the two recent inquiries will be needed before they can be<br />

considered eligible for participation in such an improved Next Step?<br />

Acknowledgements and apologies<br />

This Note has benefited from comments on earlier drafts by Professor G Howson<br />

(Southamp<strong>to</strong>n), Professor PE Hart (Reading), Professor J Micklewright


152 SJPRAIS<br />

(Southamp<strong>to</strong>n), Dr Julia Whitburn and many others at the National Institute<br />

of Economic and Social Research, London; I am also indebted <strong>to</strong> the National<br />

Institute for the provision of research facilities. Needless <strong>to</strong> say, I remain solely<br />

responsible for errors and misjudgements.<br />

I take this opportunity also of offering apologies <strong>to</strong> the individuals who<br />

have innocently participated in carrying out the underlying inquiries here reviewed;<br />

but those who planned those inquiries must fully accept their share of<br />

blame for the inadequacies complained of here, and for <strong>to</strong>o often uncritically<br />

following what was done in previous inquiries <strong>–</strong> instead of improving on those<br />

practices.<br />

ANNEX A<br />

Some background on the two international educational inquiries of<br />

2003<br />

The International Association for the Evaluation of Educational Achievement<br />

(IEA) has been active since the 1960s in sponsoring internationally comparative<br />

studies of secondary schooling <strong>–</strong> subsequently also primary schooling <strong>–</strong><br />

involving tests set <strong>to</strong> representative samples of students. The school subjects<br />

covered were mathematics and science, plus some separate inquiries in<strong>to</strong> reading/literacy.<br />

The year-groups focussed on were eighth and fourth grades on the<br />

international grading (Europe and the United States), corresponding <strong>to</strong> Years<br />

9 and 5 in the UK, that is, <strong>to</strong> ages of about 14 and 10. Sampling was based<br />

on school classes. Before 2003 the IEA had carried out similar inquiries in<br />

1995 (in some countries also in 1999). The number of countries expanded over<br />

time <strong>to</strong> reach 49 in 2003; the most recent IEA inquiries go under the name of<br />

TIMSS <strong>–</strong> Trends in International Mathematics and Science Study. The studies<br />

are now managed from Bos<strong>to</strong>n College, Mass., with substantial financial<br />

support from the US government mainly for the central organisation; financial<br />

support for the surveys in each country is provided locally.<br />

Three reports were published by TIMSS on their 2003 inquiries:<br />

IVS Mullis et al., TIMSS 2003 International Mathematics Report (Bos<strong>to</strong>n<br />

College, 2004), pp. 455.<br />

IVS Mullis et al., TIMSS 2003 International Science Report (Bos<strong>to</strong>n College,<br />

2004), pp. 467.<br />

MO Martin et al. (eds), TIMSS 2003 Technical Report (Bos<strong>to</strong>n College,<br />

2004), pp. 503.


ENGLAND: POOR SURVEY RESPONSE 153<br />

The second inquiry considered here was sponsored by OECD (Organisation of<br />

Economic Cooperation and Development), an international organisation set up<br />

in Paris <strong>to</strong> assist European post-war economic reconstruction and development,<br />

with heavy support from the United States. It conducted its first assessment of<br />

educational attainments in 2000 under the name Programme of International<br />

Student Assessment, <strong>PISA</strong> for short; and a repeat was carried out in 2003. I<br />

have not been able <strong>to</strong> find any written justification for setting up an inquiry so<br />

close in its objectives <strong>to</strong> the IEA’s; but two differences <strong>–</strong> not necessarily justifications<br />

<strong>–</strong> should be noted. First, <strong>PISA</strong> focuses on a certain age, 15 <strong>–</strong> rather<br />

than school Year (or grade) as for TIMSS <strong>–</strong> for those included in its survey<br />

(though for some countries, Brazil, Mexico, that age is beyond compulsory<br />

schooling and only about half that age-group can be contacted). On average,<br />

the <strong>PISA</strong> age is about a year above TIMSS, and closer <strong>to</strong> the age of entering the<br />

workforce. Secondly, the focus of students’ questioning in <strong>PISA</strong> was said <strong>to</strong><br />

be on the ‘ability <strong>to</strong> use their knowledge and skills <strong>to</strong> meet real-life challenges,<br />

rather than merely on the extent <strong>to</strong> which they have mastered a specific school<br />

curriculum’; whereas the focus of TIMSS is closer <strong>to</strong> the school curriculum. 21<br />

It still remains <strong>to</strong> be shown whether the practicalities of written examinations<br />

held in a school room makes any substantial difference <strong>to</strong> the outcome whether<br />

one kind of question is asked or the other.<br />

The <strong>PISA</strong> inquiry covered mathematics and science, just as TIMSS; and<br />

also had questions on literacy (reading). <strong>PISA</strong>’s emphasis in 2003 was on mathematics.<br />

Results were published in:<br />

[No attributed authorship] Learning for Tomorrow’s World: First Results from<br />

<strong>PISA</strong> 2003 (OECD, Paris, 2004), pp. 476.<br />

R Adams (ed.), <strong>PISA</strong> 2003 Technical Report (OECD, Paris, 2005).<br />

Of the 48 countries included in <strong>PISA</strong> (49 in TIMSS, as said), 19 also participated<br />

in TIMSS. A full investigation, with access <strong>to</strong> individual questions<br />

and results in both inquires would be needed for a proper comparison; here we<br />

may note only that Hong Kong and Korea were near the <strong>to</strong>p scorers in both<br />

inquiries (scores of 586, 589 in TIMSS; 550, 542 in <strong>PISA</strong>); in Europe, Netherlands<br />

and Belgium were about equally high (536, 537 in TIMSS <strong>–</strong> Flemish<br />

Belgium only; 538, 529 in <strong>PISA</strong>); and the United States was very slightly<br />

above average in TIMSS (a score of 504) and more than slightly below average<br />

in <strong>PISA</strong> (483). The different mix of countries in the two samples affects<br />

21 <strong>PISA</strong> (2004), p. 20.


154 SJPRAIS<br />

the standardised marks published: such comparisons between the results of the<br />

inquiries are therefore no more than suggestive.<br />

ANNEX B<br />

The proper objectives of international comparative educational<br />

research<br />

That the US, the world’s <strong>to</strong>p economic performing country, was found <strong>to</strong> have<br />

schooling attainments that are only middling casts fundamental doubts on the<br />

value, and approach, of these surveys. It could be that the hyper-involved statistical<br />

methods of analysis used (known as Item Response Modelling) is, as<br />

many have suggested, wholly inappropriate (see also my comment of 2003 on<br />

the <strong>PISA</strong> 2000 survey, p. 161). Or it could be, as two US academics have suggested,<br />

that the level of schooling does not matter all that much for economic<br />

progress; rather, it is ‘Adam Smithian’ fac<strong>to</strong>rs such as economies of scale,<br />

and minimally regulated labour markets that allow US ‘employers enormous<br />

agility in hiring, paying and allocating workers . . . ’. 22 Or <strong>–</strong> my own view <strong>–</strong><br />

that the typical age of school-leaving in the US, at some three years above<br />

that in most European countries (say, 19 rather than 16), has the consequence<br />

that schooling attainments at 14-15 hardly provides a clear indication of the<br />

contribution of final schooling attainments <strong>to</strong> subsequent working capabilities.<br />

An older typical school-leaving age means that teachers can sequence their<br />

courses of instruction in a more graduated way; and that the kind of question<br />

set in the <strong>PISA</strong> inquiries <strong>–</strong> designed <strong>to</strong> be close <strong>to</strong> everyday life <strong>–</strong> is indeed<br />

something for which US students aged 15 are less ready than their European<br />

counterparts. But that does not mean that at later ages their schooling has not<br />

served US students as a whole at least as well as their European counterparts;<br />

more time may have been usefully spent by US students in those subsequent<br />

three years in consolidating fundamentals. No investigation, or even discussion,<br />

of such issues is <strong>to</strong> be found in the official reports on these inquiries; and<br />

the absence of a sufficient number of published individual questions makes it<br />

impossible for the reader <strong>to</strong> take the issue further.<br />

22 See A P Carnevale and D M Desrochers, The democratization of mathematics, in Quantitative<br />

Literacy (eds. B L Maddison and L A Steen, National Council on Education and the<br />

Disciplines, Prince<strong>to</strong>n NJ, 2003), esp. p. 24: ‘if the United States is so bad at mathematics<br />

and science, how can we be so successful in the new high-tec global economy? If we are so<br />

dumb, why are we so rich?’


ENGLAND: POOR SURVEY RESPONSE 155<br />

So far we have treated both surveys (TIMSS, <strong>PISA</strong>) as showing much the<br />

same schooling performance for US pupils <strong>–</strong> namely, as indifferent, or even<br />

weak, when judged in relation <strong>to</strong> the tremendous economic performance of<br />

that country. But we should also notice, and express surprise, that it is precisely<br />

in that survey with questions emphasising practical and ‘real life’ aspects,<br />

namely, the <strong>PISA</strong> survey, that average US 15 year-olds are shown at being<br />

below world average <strong>–</strong> whereas, in the more school-task oriented TIMSS<br />

survey, US students were <strong>–</strong> even if only modestly <strong>–</strong> above the world average.<br />

Indeed, it is not <strong>to</strong>o fanciful <strong>to</strong> suppose that the undistinguished performance<br />

of US students in school-curriculum oriented questions in the earlier TIMSS<br />

surveys provided some of the impetus for carrying out a further survey with<br />

a more practical emphasis in its questioning. But anyone who expected better<br />

results for the US via that line of questioning must have been sorely disappointed<br />

by the outcome. That outcome, it may also be concluded, casts further<br />

doubt on the value of repeating a <strong>PISA</strong>-type survey. Until wider-ranging pilot<br />

inquiries, on alternative lines, have been carried out and analysed, it is difficult<br />

<strong>to</strong> see that further inquiries of the present sort and scale are justified.


Disappearing Students<br />

<strong>PISA</strong> and Students With Disabilities<br />

Bernadette Hörmann<br />

Austria: University of Vienna<br />

1 Who is disappearing?<br />

“Have you ever tried <strong>to</strong> get a stroller or cart<br />

in<strong>to</strong> a building that did not have a ramp? Or<br />

open a door with you hands full? Or read<br />

something that has white print on a yellow<br />

background, or is printed <strong>to</strong>o small <strong>to</strong> read<br />

without a magnifying glass, or has words<br />

from a different generation or culture? Have<br />

you ever listened <strong>to</strong> a speech given without<br />

a microphone?” (Johns<strong>to</strong>ne/Altman/Thurlow<br />

2006, p. 1)<br />

Concerning student assessment, public and scientific discourse seems <strong>to</strong> be<br />

limited <strong>to</strong> questions about its condition, possibilities and consequences. The<br />

urgent question that has <strong>to</strong> be asked is about the role of children with disabilities<br />

in assessment tests like <strong>PISA</strong>, TIMSS, etc. Are these students included?<br />

How are they included? Is there a way <strong>to</strong> include children with special needs<br />

in assessment tests? Are these assessment tests even able <strong>to</strong> assess the abilities<br />

of students with disabilities in an adequate way?<br />

Generally, students with disabilities (SWD) do not get the chance <strong>to</strong> participate<br />

in student assessments (e.g. Posch/Altrichter 1997, p. 41; Van Ackeren<br />

2005, p. 26; in the USA: McGrew/Algozzine/Spiegel/Thurlow/Ysseldyke


158 BERNADETTE HÖRMANN<br />

1993; Thurlow/Elliott/Ysseldyke/Erickson 1996a; Quenemoen/Lehr/Thurlow/<br />

Massanari 2001). In most cases they are asked <strong>to</strong> stay at home when the test<br />

takes place, or they are sent <strong>to</strong> another classroom during the test. If students<br />

with disabilities are allowed <strong>to</strong> participate in the testing, their scores are most<br />

of the time not counted, which means that these children are not represented in<br />

the official statistics. In the case of <strong>PISA</strong>, students who attend a special school<br />

(“Sonderschule”) get special testbooks that contain easier questions and that<br />

are shorter than the normal books. But in general, students with disabilities are<br />

excluded from the testing process, which makes them disappear, as they are<br />

not represented in the results of the assessments. Children who face exclusion<br />

are students with all kinds of disabilities, immigrants (non-native speakers)<br />

and low-achievers. As Wuttke observed, the participating states in <strong>PISA</strong> 2003<br />

dealt quite differently with the “problem” of handicapped children. Turkey, for<br />

example, only excluded 0.7 percent of the students, while the exclusion rate in<br />

Spain and the USA reached 7.3 percent (OECD 2005, p. 169, quoted in Wuttke<br />

2006, p. 106). <strong>PISA</strong> regulations say that it is allowed <strong>to</strong> exclude five percent<br />

of the population, a limit that has been exceeded in several states. Haider, the<br />

Austrian national coordina<strong>to</strong>r of <strong>PISA</strong> 2003, gives the following advice concerning<br />

the exclusion of students in the Austrian <strong>PISA</strong> report of 2003: Pupils<br />

can be excluded in case of severe, constant physical or mental disability, insufficient<br />

language knowledge or when the student drops out of school.<br />

In the U.S., student assessment is more common and looks back on a long<br />

tradition. For this reason, scientists have developed a distinct branch of research,<br />

one concerned with the problem of the exclusion of students with disabilities.<br />

The NCEO (National Center of Educational Outcomes) provides annual<br />

reports and detailed research studies on this <strong>to</strong>pic, and it aims at raising<br />

the number of students included in the testing. In Europe, however, this issue<br />

seems not <strong>to</strong> be considered as important (cf. Hörmann 2007). There seems <strong>to</strong><br />

be a lack of research and public interest, a <strong>to</strong>pic which will be discussed later<br />

on in this article.<br />

The following lines will deal with arguments for the inclusion of SWD<br />

and will illustrate how the U.S. tries <strong>to</strong> account for the diversity of students.<br />

Afterwards, the situation in Austria and the German-speaking countries will<br />

be discussed while aspects for the future will be given in the conclusion.


DISAPPEARING STUDENTS <strong>PISA</strong> AND STUDENTS WITH DISABILITIES 159<br />

2“Ou<strong>to</strong>fsightisou<strong>to</strong>fmind” 1 : Arguments for the inclusion of<br />

students with disabilities in assessment tests<br />

As Hopmann (2006) shows, student assessment has become an inevitable necessity<br />

of modern society. Since the late 1990s, the socio-political models have<br />

changed from “management by placement” <strong>to</strong> “management by expectations”.<br />

The social welfare state with its traditional institutions and form of government<br />

could not be maintained any longer, and the public gradually demanded<br />

accountability from the institutions providing public services. Under management<br />

by expectations, risks and expectations in relation <strong>to</strong> an institution are<br />

collected and then taken as the basis on which its tasks and budget are fixed<br />

(cf. Hopmann 2006).<br />

Assessment tests in their current state also deal with expectations and are<br />

a new way of measuring risks. State school systems are now bound <strong>to</strong> account<br />

for their services; and their services are delivered <strong>to</strong> regular students as well as<br />

<strong>to</strong> students with disabilities. On which basis can the exclusion of children with<br />

disabilities from such assessments be justified any longer?<br />

Günter Haider wrote in the Austrian national report of <strong>PISA</strong> 2003:<br />

“Im Rahmen von <strong>PISA</strong> sollen jedoch nicht einzelne Personen geprüft, sondern die<br />

Merkmale aller Schülerinnen und Schüler eines Landes kollektiv <strong>–</strong> über große Stichproben<br />

<strong>–</strong> erfasst werden.” (Haider 2003, p. 13, emphasis in original)<br />

Thus, <strong>PISA</strong> is not designed <strong>to</strong> test individual students, but rather <strong>to</strong> measure<br />

characteristics of all the pupils of a specific country. Apparently, it is intended<br />

<strong>to</strong> produce a representative picture of all students’ performances of the whole<br />

nation.<br />

<strong>According</strong> <strong>to</strong> Thurlow et al from the NCEO, the idea of the inclusion of all<br />

students is based on the following three assumptions:<br />

<strong>–</strong> “All students can learn.<br />

<strong>–</strong> Schools are responsible for measuring the progress of learners.<br />

<strong>–</strong> The learning process of all students should be measured” (Thurlow et al<br />

1996a)<br />

Excluding particular students from testing would mean that those children<br />

would be made invisible and that they would also be excluded from any political<br />

or social decisions. Most of the policy decisions concerning school structures<br />

are based on results of large-scale assessments. Consequently, children<br />

1 As Thurlow, Elliott, Ysseldyke and Erickson (1996a) remarked in a pointed way


160 BERNADETTE HÖRMANN<br />

who are excluded from the test are also excluded from policy decisions which<br />

actually affect them. Thurlow even points out that excluding certain children<br />

from tests leads <strong>to</strong> invalid comparisons, which means that including children<br />

could provide more realistic results (cf. ibid.).<br />

From the perspective of special needs education, it has <strong>to</strong> be said that it is<br />

an obligation that every single child be granted the possibility of taking part<br />

in international student assessment. From the 1990s onwards, a new concept<br />

has been developed which should displace the old notion of “integration”: The<br />

concept of “inclusion”. The Salamanca Statement from 1994 (UNESCO 1994)<br />

proclaims the right of every single child <strong>to</strong> participate in society, which means<br />

that all children should attend the same kind of school. Disability is viewed<br />

as just one kind of diversity among many others, and the ambitious aim of the<br />

statement is not <strong>to</strong> change people, but rather social structures and institutions,<br />

so that it will become possible <strong>to</strong> account for all people’s needs (cf. Biewer<br />

2005, p. 102ff). From this point of view, it is not the student who is disabled,<br />

but rather the school, which “disables” certain kinds of children. Consequently,<br />

student assessment, as a part of education, has <strong>to</strong> be geared <strong>to</strong> children with<br />

disabilities; it has <strong>to</strong> be constructed in a way in which every single child has<br />

the chance <strong>to</strong> successfully participate in the test. In contrast <strong>to</strong> this conception,<br />

the concept of “integration” would mean that students with disabilities would<br />

have <strong>to</strong> be remedially instructed <strong>to</strong> an adequate degree so that they could take<br />

part in the assessment. In this case the students would have <strong>to</strong> be changed<br />

instead of the tests.<br />

3 Research in the U.S.<br />

Including SWD in assessment tests has gained a lot of attention and has become<br />

an important part of policy in the U.S. It is trying hard <strong>to</strong> establish a<br />

“participation policy”, which should assure that full inclusion will become reality.<br />

In 2001, the “No Child Left Behind Act” was installed, which dictates<br />

that every single state of the U.S. has <strong>to</strong> report the participation rates of students<br />

with disabilities and the way they participate in assessment. “Full inclusion”,<br />

however, can never become reality, as there will always be excluded students <strong>–</strong><br />

at random or not (cf. Koretz/Bar<strong>to</strong>n 2003).<br />

The National Center for Educational Outcomes (NCEO) publishes an annual<br />

report in which the fundamental trends of the participation of particular<br />

student groups are presented. This is a quite challenging task because every


DISAPPEARING STUDENTS <strong>PISA</strong> AND STUDENTS WITH DISABILITIES 161<br />

state has different guidelines concerning inclusion of students with disabilities.<br />

But in general, almost every state is able <strong>to</strong> give trends and facts about<br />

inclusion and performance of students over the past three years.<br />

3.1 Participation and performance trends in the U.S.<br />

The Annual Performance Report from Thurlow, Moen and Altman (2006)<br />

gives actual figures on the participation and performance of students with IEP<br />

(Individualized Education Plan) enrolment in state assessments of the entire<br />

U.S. for the years 2003 and 2004. Almost every state of the US is able <strong>to</strong> provide<br />

data about its participation rates in student assessment. The figures are<br />

categorised in<strong>to</strong> the different types of assessment (reading and math), the three<br />

school levels (elementary, middle and high school) and the two different types<br />

of states (regular or unique states). For example, at the elementary level, in 38<br />

regular states (out of 50) and 2 unique states (out of 10), at least 95 percent of<br />

all children with IEP enrolment were assessed. Most of the other states reached<br />

between 85 and 95 per cent, which can be seen as quite a high participation<br />

rate (see figure 1).<br />

At the middle school level, 34 regular states and 1 unique state assessed 95<br />

percent or more, and at the high school level, 26 regular states and no unique<br />

state reached this limit, whereas nearly half of the states reached between 85<br />

and 95 percent (Thurlow et al 2006, p. 7).<br />

Data for the math exam are quite similar <strong>to</strong> those of the reading exam (see<br />

Thurlow et al 2006, p. 10ff).<br />

In three states students with IEP have the possibility of taking an alternate<br />

assessment which is based on grade level achievement standards. The amount<br />

of students that participate in those alternate tests lies between 0.1 and about<br />

10 percent (Table 1).<br />

Elementary Middle High<br />

school<br />

Massachusetts .29 .10 .30<br />

North Carolina<br />

1.19 .88 .42<br />

Texas 10.37 4.81 <strong>–</strong><br />

Table 1: Percent of students with IEPs in an alternate reading assessment based on grade level<br />

achievement standards (Thurlow et al 2006, p. 20)


162 BERNADETTE HÖRMANN<br />

Fig.1: Reading Assessment Participation Rates in Elementary School: Percent Participation is<br />

of IEP Enrollment (Includes Regular and Alternate Assessment) (Thurlow et al 2006, p. 14)<br />

Table one also shows the discrepancy between the figures of the various<br />

states (compare, for example, Massachusetts and Texas), which indicates that<br />

regulations and their execution in the states are interpreted rather differently<br />

and are not consistent.<br />

About 65 percent (high school: 61 percent) of students with IEPs take part<br />

in regular assessment, but with accommodations being made. However, there<br />

were quite a number of states that were unable <strong>to</strong> document the number of<br />

students who <strong>to</strong>ok this adjusted version of the regular test (cf. ibid., p. 8 and<br />

21).<br />

The performance of students with IEP in reading assessment tends <strong>to</strong> increase<br />

steadily. About 30 percent of the students with IEP performed at a<br />

level considered <strong>to</strong> be “proficient”, which is slightly higher than in the years<br />

2002/2003. At the elementary school level, IEP students in 32 regular states<br />

reached more than 30 percent, at the middle school level in 15 and at the high<br />

school level in 17 regular states (ibid., p. 30). The rates of proficiency improved<br />

in 31 states at the elementary level, in 32 at the middle school level and in 29


DISAPPEARING STUDENTS <strong>PISA</strong> AND STUDENTS WITH DISABILITIES 163<br />

regular states (out of approximately 43 states which provided data) at the high<br />

school level (ibid., p. 32).<br />

3.2 Manners of inclusion<br />

Basically, there are three manners of including students with disabilities in<br />

assessment:<br />

<strong>–</strong> Letting them take part in the regular assessment<br />

<strong>–</strong> Creating accommodated versions of the regular assessment<br />

<strong>–</strong> Creating an alternative test<br />

In the case of children having severe disabilities, it is most common <strong>to</strong> create<br />

alternative tests for them. In other cases of rather moderate impairment (e.g.<br />

learning disabilities), the students can take part in the test with special accommodations<br />

being made.<br />

3.2.1 Testing accommodations<br />

Referring <strong>to</strong> Sireci, Scarpati and Li, an accommodation is an “ . . . intentional<br />

change <strong>to</strong> the testing process designed <strong>to</strong> make the tests more accessible <strong>to</strong><br />

SWD and consequently lead <strong>to</strong> more valid interpretations of their test scores”<br />

(Sireci/Scarpati/Li 2005, p. 460). In practice, there are a lot of possibilities in<br />

order <strong>to</strong> gain more valid interpretations of test scores of SWD. The kind of<br />

accommodation that is indicated depends on the special needs of each single<br />

student. The possibilities are the following:<br />

<strong>–</strong> Providing additional time<br />

<strong>–</strong> Providing a separate location, where the student can work undisturbed<br />

<strong>–</strong> Taking more breaks<br />

<strong>–</strong> Reading the test directions or items <strong>to</strong> students<br />

<strong>–</strong> Providing the test in Braille or large type<br />

<strong>–</strong> Allowing the students <strong>to</strong> dictate their answers<br />

<strong>–</strong> “Out-of-level”- or “out-of-grade”-testing (student gets a form which is actually<br />

used for a previous grade)<br />

<strong>–</strong> Deleting some items from the test (Koretz/Bar<strong>to</strong>n 2003, p. 6)<br />

It is also possible <strong>to</strong> combine items of the regular version, accommodated items<br />

and alternative items. Furthermore, there is the possibility <strong>to</strong> apply access<br />

assistants, who act like “intermediaries” between children and their special<br />

needs. They can read the test items <strong>to</strong> the students, write down their responses<br />

or communicate with the students through sign language. Sometimes they also<br />

do translation work, turn pages, transcribe or paraphrase the students’ answers.


164 BERNADETTE HÖRMANN<br />

Clapper et al give advice on the development of guidelines for these assistants<br />

(cf. Clapper et al 2006).<br />

In most cases it is necessary <strong>to</strong> make more than one accommodation, as one<br />

accommodation might require a further one (e.g. a deaf student, who receives<br />

the test instructions in written form needs additional time, because it takes<br />

more time <strong>to</strong> read the instructions than <strong>to</strong> hear them) (cf. Koretz/Bar<strong>to</strong>n 2003,<br />

p. 22).<br />

Of course, there is a heated discussion going on about the validity of all<br />

these adjustments. In particular, accommodations which have an impact on the<br />

basic construct of the test have caused a controversial debate. Even in the U.S.<br />

there is a lack of research concerning the effects of accommodations on the<br />

validity of test scores, as Koretz and Bar<strong>to</strong>n observed (cf. Koretz/Bar<strong>to</strong>n 2003,<br />

p. 3, also Thurlow et al 1996a). In their opinion, the main problems concerning<br />

test accommodations are the inhomogeneity of the group of SWD, the accurate<br />

and appropriate use of accommodations, construct-relevant disabilities and the<br />

design of the tests (danger of item or test bias) (cf. Koretz/Bar<strong>to</strong>n 2003).<br />

3.2.2 Alternate Tests<br />

Alternate tests are usually based on the IEPs of the respective students and are<br />

an attempt <strong>to</strong> include students with disabilities who cannot participate in the<br />

general assessment system. Basically, one has <strong>to</strong> be aware that students with<br />

disabilities are not au<strong>to</strong>matically assessed by alternate tests. Thurlow et al point<br />

out that the majority of students with disabilities should participate in the regular<br />

assessment, be it with accommodations or without. An important criterion<br />

for the decision regarding the kind of assessment a student should participate<br />

in is the goal the student has in mind. If the student aims at achieving the same<br />

goals as students with a regular curriculum, the student should take part in the<br />

general assessment, albeit with accommodations. The most important advice<br />

is that the decision should not be based on the expectation of the performance<br />

of the student (cf. Thurlow/Olsen/Elliott/Ysseldyke/Erickson/Aherarn 1996b).<br />

Concerning the integration of the results of alternate tests, there is quite a<br />

lack of research. On the one hand, there is the possibility <strong>to</strong> report the results<br />

separately from those of the general assessment, which would make it possible<br />

<strong>to</strong> analyze the special education services. But on the other hand, the attempt<br />

is made <strong>to</strong> avoid separation between students with and without disabilities,<br />

which would mean that testing results should be aggregated and combined (cf.


DISAPPEARING STUDENTS <strong>PISA</strong> AND STUDENTS WITH DISABILITIES 165<br />

Thurlow et al 1996b). Once more it becomes clear that much more research is<br />

needed on this issue.<br />

3.3 Consequences for students with disabilities in student assessment<br />

Ysseldyke, Dennison and Nelson (2004) tried <strong>to</strong> investigate positive consequences<br />

of large-scale-assessments for students with disabilities. The increased<br />

participation of SWD led <strong>to</strong> higher expectations of their performance<br />

(from parents, teachers and students themselves), which usually have been<br />

rather low. The students that were interviewed for this study even pointed out<br />

that they had the impression that their teacher would pay more attention <strong>to</strong><br />

them and give them more support. Furthermore, the participation of students<br />

with disabilities resulted in improved test instructions, teaching strategies and<br />

performances of the respective students. There are also better chances for the<br />

respective children <strong>to</strong> graduate or <strong>to</strong> get diplomas, and the risk of dropping out<br />

of school decreases. The cooperation between IEP-teachers and regular teachers<br />

is improved, and parents from students with disabilities seem <strong>to</strong> be more<br />

interested in the performance and development of their children (cf. Ysseldyke<br />

et al 2004, p. 4ff). Ysseldyke et al gained all these findings from an extensive<br />

research in literature, media and from interviews with people that are involved<br />

in student assessment.<br />

Ruth Nelson searched for positive as well as negative consequences and<br />

found that due <strong>to</strong> the participation of students with disabilities, there is an increased<br />

exposure <strong>to</strong> the curriculum, which is a consequence from intensified<br />

test preparation, extra tu<strong>to</strong>ring, extra lessons, etc. Moreover, the increased exposure<br />

also causes higher levels of stress, anxiety and frustration as well as<br />

limited possibilities for choosing electives among students. Ysseldyke et al<br />

(2004) as well as Nelson (2006) discovered that both participation and expectations<br />

have increased. However, Nelson could not find any empirical evidence<br />

for assumed increased referrals <strong>to</strong> special education or the retention of students<br />

(cf. Nelson 2006).<br />

3.4 Universally designed assessments or: “a more accessible<br />

assessment is always an option” 2<br />

The latest attempts <strong>to</strong> include as many students as possible in state assessment<br />

are “universally designed assessments”. It is a project of NCEO for which a<br />

2 Johns<strong>to</strong>ne et al 2006, p. 23


166 BERNADETTE HÖRMANN<br />

guide was published in order <strong>to</strong> provide states and their representatives responsible<br />

for information and ideas about ways of including students with disabilities<br />

in assessment tests. Universal design demands accessibility for everyone<br />

whether he or she is disabled, a non-native speaker of English, a migrant or<br />

whatever. When an assessment test is designed universally, it respects the diversity<br />

of the population. It is characterised by concise and readable texts, clear<br />

formats and clear visuals, and it allows changes in the format as long as they do<br />

not change the meaning or the level of difficulty. It is stressed that those tests<br />

are not intended <strong>to</strong> change the standard of performance of assessments, nor <strong>to</strong><br />

make them easier for special groups. The ambitious aim of universal design is<br />

<strong>to</strong> create the “most valid assessment possible for the greatest number of students,<br />

including students with disabilities” (Johns<strong>to</strong>ne et al 2006, p. 1). In this<br />

guide provided by Johns<strong>to</strong>ne et al, 10 steps are proposed for the best way of<br />

achieving a universally designed assessment. I will not describe each of these<br />

steps in detail, but offer an overview of the main features of this approach.<br />

The main idea of the approach is <strong>to</strong> consider the diversity of the students<br />

from the very beginning (and not <strong>to</strong> adjust the tests afterwards for the special<br />

needs of some groups of students). For this reason, every item has <strong>to</strong> be<br />

checked in the phase of conceptualization. Contents which could give unfair<br />

advantage or disadvantage <strong>to</strong> a certain group of students should be avoided<br />

(e.g. using large font sizes, avoiding unnecessary linguistic complexity when<br />

it is not assessed). Every single item has <strong>to</strong> be checked if it allows consideration<br />

for the diversity of the pupils (gender, age, ethnicity, socioeconomic<br />

status, region, disability, language). In order <strong>to</strong> avoid ceiling or floor effects,<br />

it is important <strong>to</strong> develop a full range of test performance, and an adequately<br />

sized item pool is needed in case the items have <strong>to</strong> be eliminated. The authors<br />

are aware of the fact that this is a truly challenging and time-consuming procedure,<br />

but as the authors argue, considering accessibility from the beginning<br />

can save time and effort later (cf. ibid., p. 6).<br />

When all items are constructed, it is necessary <strong>to</strong> let them be reviewed by<br />

expert teams in the participating states. Members of several special groups,<br />

such as language minorities, disability groups, scientists, teachers, etc. should<br />

be involved in this review process, and they should examine if the test items<br />

give some advantage or disadvantage <strong>to</strong> a certain group of students. They are<br />

required <strong>to</strong> look at the response format and decide if it is clear enough, check if<br />

the item really tests what it says and if there could be induced errors which are<br />

not related <strong>to</strong> the question (cf. ibid., p. 7). If there are any items that the experts


DISAPPEARING STUDENTS <strong>PISA</strong> AND STUDENTS WITH DISABILITIES 167<br />

find problematic, these items are analyzed with the “Think Aloud Method” in<br />

order <strong>to</strong> find out if they can be incorporated in the test or not, or if they should<br />

be adjusted. After the field test, the items are analyzed statistically, in particular<br />

those which were conspicuous (cf. ibid., p. 13). When the final revision is done,<br />

the test can be carried out.<br />

It has often been pointed out that it is not possible <strong>to</strong> create a test which<br />

is accessible <strong>to</strong> every single student. However, as mentioned, the goal is <strong>to</strong><br />

make it as widely accessible as possible. Moreover, as the authors proclaim, all<br />

students participating in universally designed assessments benefit from having<br />

more accessible tests (cf. ibid., p. 2).<br />

3.5Emergingissues<br />

Koretz and Bar<strong>to</strong>n summarize the most important <strong>to</strong>pics which have <strong>to</strong> be considered<br />

when it comes <strong>to</strong> the inclusion of students with disabilities. At the same<br />

time, those issues represent the most important research gaps that need <strong>to</strong> be<br />

closed:<br />

<strong>–</strong> First of all, students with disabilities have <strong>to</strong> be identified and classified.<br />

Comparisons have shown that figures on e.g. children with learning disabilities<br />

differ from 3 <strong>to</strong> 9.1 percent. Therefore, it can be assumed that the lines<br />

are drawn rather differently and that the term “learning disability” does not<br />

necessarily mean the same thing <strong>to</strong> everybody (cf. also McGrew et al 1993).<br />

Thus, identification and classification seem <strong>to</strong> be a crucial in order <strong>to</strong> make<br />

an equitable assessment system possible.<br />

<strong>–</strong> Appropriate use of accommodations: Accommodations, or the “corrective<br />

lenses” (Koretz/Bar<strong>to</strong>n 2003, p. 7), are not only an important way <strong>to</strong> increase<br />

the inclusion of students with disabilities in assessment tests, but they also<br />

tend <strong>to</strong> influence and bias the validity of tests. Research concerning the validity<br />

of accommodations is still tremendously needed.<br />

<strong>–</strong> The problem of construct relevant disabilities: The assessment test can be<br />

offered in Braille <strong>to</strong> blind students. This kind of accommodation does not<br />

influence the construct of the test. But when it comes <strong>to</strong> e.g. dyslexia, it<br />

becomes quite difficult. In this case, the student is not able <strong>to</strong> understand<br />

the tasks, because most assessment tests are language-based. However, this<br />

does not mean that this student is not able <strong>to</strong> solve the task just because he<br />

cannot read it.<br />

<strong>–</strong> Concerning test design, it is important <strong>to</strong> keep an eye on bias. Several assessment<br />

formats (like multiple choice, open response, etc.) have different


168 BERNADETTE HÖRMANN<br />

effects and consequences for different students, especially for those with<br />

disabilities.<br />

(cf. Koretz/Bar<strong>to</strong>n 2003, p. 3ff)<br />

Although the US plays a leading role in including students with disabilities,<br />

there is still a huge lack of research concerning validity and alternate ways of<br />

test participation (cf. also Quenemoen et al 2001, Thurlow et al 1996a). As<br />

Koretz and Bar<strong>to</strong>n point out, the mere inhomogeneity of the group of students<br />

with disabilities makes it tremendously difficult <strong>to</strong> create guidelines and prescriptions.<br />

Moreover, construct relevant disabilities pose <strong>to</strong>ugh challenges for<br />

research (cf. Gerald Hales 2004, who shows that and why the common tests<br />

are not able <strong>to</strong> measure the skills and proficiency of students with dyslexia in<br />

an adequate way).<br />

<strong>According</strong> <strong>to</strong> Koretz and Bar<strong>to</strong>n, the most important steps are an increased<br />

collection of data on assessment participation of students with disabilities, further<br />

research on possible item bias, test bias, and, of course, validity. To make<br />

comparisons possible, it is necessary <strong>to</strong> standardize several definitions of disabilities<br />

and the participation conditions (cf. Koretz/Bar<strong>to</strong>n 2003, p. 23ff).<br />

Finally, Ruth Nelson emphasizes the need <strong>to</strong> identify and limit unintended,<br />

negative consequences (e.g. as mentioned above increased anxiety, exposure <strong>to</strong><br />

curriculum, etc.) for SWD in assessment tests and <strong>to</strong> document them empirically.<br />

Trying <strong>to</strong> avoid these unintended consequences can be “life-changing”<br />

for the respective students, because it allows the students <strong>to</strong> get a fair chance<br />

in order <strong>to</strong> show their actual abilities (Nelson 2006, p. 34f).<br />

4 Situation in Austria and German-speaking countries<br />

In 1995, Elliott, Shin, Thurlow and Ysseldyke searched in national education<br />

encyclopedias and yearbooks of 14 states worldwide <strong>to</strong> discover if they reported<br />

facts and figures on the inclusion of students with disabilities in assessments.<br />

These states included the following: Argentina, Australia, Canada,<br />

Chile, China, England and Wales, France, Japan, Corea, the Netherlands, Nigeria,<br />

Sweden, Tunisia, U.S. What they found was that out of these 14 states,<br />

just a few documented the inclusion of students with disabilities in assessment<br />

tests. It is only the U.S. who reports exact facts and figures, while some other<br />

states present a short description of their directions regarding the participation<br />

of students with disabilities (Canada, France and Korea mention that they allow<br />

accommodated tests for students with disabilities). Elliott et al interpret


DISAPPEARING STUDENTS <strong>PISA</strong> AND STUDENTS WITH DISABILITIES 169<br />

their research findings as follows: There may be three possible reasons why<br />

states do not give any data about inclusion of SWD. First, it could be that they<br />

exclude SWD arbitrarily; second, it is possible that data from disabled children<br />

are collected, but not counted and not published; third, data could be collected,<br />

counted, but not published (cf. Elliott et al 1995).<br />

As in the most states included in the study, there is no attention paid <strong>to</strong><br />

the <strong>to</strong>pic of inclusion of SWD in assessment in Austria. Since the 1990s, and<br />

since 2000 in particular, Austria has been taking part in several assessments<br />

(CIVED, FIMS/TIMSS, <strong>PISA</strong>, PIRLS) and it now has a national screening<br />

for testing reading abilities (Salzburger Lese-Screening). Although the official<br />

test reports of <strong>PISA</strong> give some advice as <strong>to</strong> how <strong>to</strong> handle SWD, there is no<br />

real interest in how this advice is carried out in practice. As shown in Hörmann<br />

2007, people who are involved in assessment testing processes do not<br />

consider the problem as being relevant and urgent. Eight interview partners<br />

(teachers, school direc<strong>to</strong>rs, scientists, employee of the ministry of education<br />

(responsible for international assessment)) were asked by means of short qualitative<br />

interviews what they think about the problem, if they have already had<br />

any experience with it and how they reacted. The main interest in this research<br />

project was <strong>to</strong> investigate the extent <strong>to</strong> which these people have already been<br />

confronted with the problem, what they know and think about it and how they<br />

deal with the problem (cf. Hörmann 2007, p. 58ff).<br />

The interviews reveal that there are two ways of perceiving the problem:<br />

The administrative-organisational perspective and the perspective of the children<br />

concerned by this problem. The majority of the interview partners take<br />

the administrative-organisational perspective and do not regard the problem<br />

as an important one. Some of the interview partners who work closely with<br />

children with disabilities or disadvantaged children take the position of these<br />

students and require solutions <strong>to</strong> the problem. It is a fact for all interviewees<br />

that students with e.g. learning disabilities need more support at school, but for<br />

most of them it is not obvious that these children would also need this support<br />

when taking an assessment test (cf. Hörmann 2007, p. 85f).<br />

In my diploma thesis, I conducted thorough literary research in order <strong>to</strong><br />

find information on the way Austria copes with this problem. I asked scientists<br />

and institutions, but nobody could help me. Likewise, in all German-speaking<br />

countries, I could barely find any hint of relevant literature. As mentioned<br />

above, Wuttke published exclusion rates and Elisabeth von Stechow dedicated<br />

a small chapter of her article <strong>to</strong> exclusion rates of students with disabilities (cf.


170 BERNADETTE HÖRMANN<br />

Stechow 2006, p. 22). Her book “Sonderpädagogik und <strong>PISA</strong>” (2006) seems <strong>to</strong><br />

be the first German publication that responds <strong>to</strong> the problem of students with<br />

disabilities in assessments, at least in part. Oser and Biedermann proclaim the<br />

necessity of a specific assessment for special education (their slogan: “<strong>PISA</strong><br />

for the rest”) (cf. Oser/Biedermann 2006).<br />

Nevertheless, it is obvious that there are children who do not fit in<strong>to</strong> the<br />

“norm”, but who have special needs and who cannot cope with a conventional<br />

testing situation. This especially concerns students with learning disabilities,<br />

who never have the chance <strong>to</strong> show their real abilities, because they generally<br />

fail in reading the test items. Meyerhöfer talks about “Testfähigkeit”, which<br />

means that every student has <strong>to</strong> develop a certain kind of ability that enables<br />

her or him <strong>to</strong> cope with the testing situation, organize the provided time, read<br />

both quickly and carefully and use clever strategies <strong>to</strong> find the right solution<br />

(Meyerhöfer 2005, p. 187 and in this book). Assessment tests as they are constructed<br />

at present are definitively not able <strong>to</strong> test students with disabilities in<br />

an adequate way. In addition, when nobody is interested in what happens <strong>to</strong><br />

these students, they become invisible and disappear from the public.<br />

5Conclusion<br />

This article sets out <strong>to</strong> raise an awareness of the problem students with disabilities<br />

have <strong>to</strong> face in relation <strong>to</strong> student assessment. Research has revealed<br />

that there is a lack of literature and discourse about this problem, which shows<br />

how unknown and how unimportant it is <strong>to</strong> people that actually work with<br />

assessment tests in their profession (cf. Hörmann 2007).<br />

<strong>PISA</strong> seems <strong>to</strong> be no exception in this respect. Assuming that there exists<br />

an average student endowed with average skills not only within one but<br />

even accross countries, it neglects by construction children deviating from that<br />

“norm”. As a result these children are either excluded from the assessment or<br />

(if they get the chance <strong>to</strong> take part at all) are doomed <strong>to</strong> fail.<br />

Confronted with this problem, people behind <strong>PISA</strong> play down the importance<br />

of the problem, the impact on the respective children and show no ambition<br />

<strong>to</strong> change the situation. It probably lies in the self-interest of the people<br />

involved in the construction and conduct of this study <strong>to</strong> marginalise the<br />

problem, since this critique does not just refer <strong>to</strong> certain aspects of <strong>PISA</strong> but<br />

questions primary assumptions and thereby shakes it <strong>to</strong> its foundations.<br />

Even though many people think that the problem of exclusion is irrelevant<br />

from a statistical point of view, I am not of the opinion that this is an argument


DISAPPEARING STUDENTS <strong>PISA</strong> AND STUDENTS WITH DISABILITIES 171<br />

against inclusion. Assessment tests are not able <strong>to</strong> test SWD in an adequate<br />

way, but this does not mean that there is no way <strong>to</strong> change the assessment<br />

tests in order <strong>to</strong> make them able <strong>to</strong> assess SWD (see 3.2 and 3.4). Results of<br />

assessments are often used for political decisions. If children with disabilities<br />

are excluded from assessment tests, they are also excluded from those political<br />

decisions in spite of the fact that these decisions concern them equally as<br />

much as all the other students. This means that SWD disappear once again.<br />

The state and society are responsible for the education of every single child,<br />

including children with disabilities, and every single child has the right <strong>to</strong> be<br />

part of society. In terms of the concept of inclusive education, it is a duty and a<br />

responsibility <strong>to</strong> accommodate the tests <strong>to</strong> the children, and not the children <strong>to</strong><br />

the tests. Participating in assessments in an adequate and successful way can<br />

support the self-confidence and the performance of the respective students in a<br />

very positive manner.<br />

Research and experience in the U.S. show interesting ways of including<br />

students with disabilities in assessment tests. Testing accommodations, alternate<br />

tests and “universally designed assessments” are a new option <strong>to</strong> account<br />

for the diversity of people, although research is still at the beginning.<br />

If <strong>PISA</strong> wants <strong>to</strong> move with the times, I suggest a revision both of the<br />

paradigm and the construction in order <strong>to</strong> gain results that really represent the<br />

variety of students. In order <strong>to</strong> reach this goal, it is important <strong>to</strong> raise awareness<br />

of this problem and start a discussion in public and among scientists. Only then<br />

will it be possible <strong>to</strong> think about ways of solving this problem.<br />

Literature<br />

Biewer, Gottfried (2005): “Inclusive Education”. Effektivitätssteigerung von<br />

Bildungsinstitutionen oder Verlust heilpädagogischer Standards?- In:<br />

Zeitschrift für Heilpädagogik. Jahrgang 56 (2005), Heft 3, S. 101-108<br />

Elliott, J.L.; Shin, H.; Thurlow, M.L.; Ysseldyke, J.E. (1995): A perspective<br />

on education and assessment in other nations: Where are students with<br />

disabilities? (Synthesis Report No. 19).- Minneapolis, MN: University of<br />

Minnesota, National Center on Educational Outcomes.<br />

Available at: http://education.umn.edu/NCEO/OnlinePubs/Synthesis19.html<br />

(15.6.2006)<br />

Haider, Günter (2003): OECD/<strong>PISA</strong> <strong>–</strong> Programme for International Student<br />

Assessment (Kapitel 1.2 des nationalen Berichts von <strong>PISA</strong> 2003).<br />

Available at:


172 BERNADETTE HÖRMANN<br />

http://www.pisa-austria.at/<strong>PISA</strong>2003_Kapitel_1_2_nationalerBericht.<br />

pdf (12.6.2007)<br />

Hales, Gerald (2004): Putting in nails with a spanner: the potential effect of<br />

using false data from language-rich tests <strong>to</strong> assess dyslexic people. Available<br />

at: http://www.bdainternationalconference.org/2004/presentations/<br />

sat_s3_d_7.shtml (27.4.2006)<br />

Hopmann, Stefan Thomas (2006): Im Durchschnitt <strong>PISA</strong> oder Alles bleibt<br />

schlechter.- In: Criblez, Lucien; Gautschi, Peter u.a. (Hrsg): Lehrpläne<br />

und Bildungsstandards. Was Schülerinnen und Schüler lernen sollen.<br />

Festschrift zum 65. Geburtstag von Prof. Dr. Rudolf Künzli.- Bern: hep-<br />

Verlag, S. 149-172.<br />

Hörmann, Bernadette (2007): Die Unsichtbaren in <strong>PISA</strong>, TIMSS & Co. Kinder<br />

mit Lernbehinderungen in nationalen und internationalen Schulleistungsstudien.-<br />

Wien (Diploma Thesis)<br />

Jahnke, Thomas; Meyerhöfer, Wolfram (Hrsg) (2006): <strong>PISA</strong>&Co. Kritik eines<br />

Programms.- Hildesheim, Berlin: Franzbecker<br />

Johns<strong>to</strong>ne, Chris<strong>to</strong>pher; Altman, Jason; Thurlow, Martha (2006): A state guide<br />

<strong>to</strong> the development of universally designed assessments.- Minneapolis,<br />

MN: University of Minnesota, National Center on Educational Outcomes.<br />

Koretz, Daniel; Bar<strong>to</strong>n, Karen (2003): Assessing Students with Disabilities:<br />

Issues and Evidence. (CSE Technical Report 587).- Los Angeles: National<br />

Center for Research on Evaluation, Standards, and Student Testing.<br />

Available at: http://www.cse.ucla.edu/products/Reports/TR587.pdf<br />

(20.6.2007)<br />

McGrew, K., Algozzine, B., Spiegel, A., Thurlow M., Ysseldyke, J. (1993):<br />

The identification of people with disabilities in national databases: A<br />

failure <strong>to</strong> communicate (Technical Report No. 6).- Minneapolis, MN:<br />

University of Minnesota, National Center on Educational Outcomes.<br />

Available at: http://education.umn.edu/NCEO/OnlinePubs/Technical6.<br />

html (22.1.2006)<br />

Meyerhöfer, Wolfram (2005): TestsimTest.DasBeispiel<strong>PISA</strong>.- Opladen: Budrich.<br />

Nelson, J. Ruth (2006): High stakes graduation exams: The intended and<br />

unintended consequences of Minnesota’s Basic Standards Tests for<br />

students with disabilities (Synthesis Report 62).- Minneapolis, MN:<br />

University of Minnesota, National Center on Educational Outcomes.<br />

Available at:


DISAPPEARING STUDENTS <strong>PISA</strong> AND STUDENTS WITH DISABILITIES 173<br />

http://www.education.umn.edu/NCEO/OnlinePubs/Synthesis62/default.<br />

html<br />

OECD (2005): <strong>PISA</strong> 2003 Technical Report.- Paris: OECD.<br />

Oser, Fritz; Biedermann, Horst (2006): <strong>PISA</strong> für den Rest: Lehr- und Lernbehinderung<br />

und ihre schulische Anstrengungslogik.- In: Vierteljahresschrift<br />

für Heilpädagogik und ihre Nachbargebiete. Jahrgang 75 (2006),<br />

Heft 1, S. 4-8.<br />

Posch, Peter; Altrichter, Herbert (1997): Möglichkeiten und Grenzen der Qualitätsevaluation<br />

und Qualitätsentwicklung im Schulwesen. Forschungsbericht<br />

des Bundesministeriums für Unterricht und kulturelle Angelegenheiten.-<br />

Innsbruck, Wien: Studien <strong>–</strong> Verlag (Bildungsforschung des Bundesministeriums<br />

für Unterricht und kulturelle Angelegenheiten; 12)<br />

Quenemoen, R. F., Lehr, C. A., Thurlow, M. L., Massanari, C. B. (2001): Students<br />

with disabilities in standards-based assesment and accountability<br />

systems: Emerging issues, strategies, and recommendations (Synthesis<br />

Report 37).- Minneapolis, MN: University of Minnesota, National Center<br />

on Educational Outcomes. Available at: http://education.umn.edu/NCEO/<br />

OnlinePubs/Synthesis37.html (22.1.2006)<br />

Sireci, Stephen G.; Scarpati, Stanley E.; Li, Shuhong (2005): Test Accommodations<br />

for Students With Disabilities: An Analysis of the Interaction<br />

Hypothesis.- In: Review of Educational Research, Vol. 75 (2005) No.4,<br />

p. 457-490.<br />

Stechow, Elisabeth von (2006): Soziokulturelle Benachteiligung und Bewältigung<br />

von Heterogenität <strong>–</strong> Eine sonderpädagogische Antwort auf eine<br />

Empfehlung der KMK.- In: Stechow, Elisabeth von; Hofmann, Christiane<br />

(Hrsg): Sonderpädagogik und <strong>PISA</strong>. Kritisch-konstruktive Beiträge.- Bad<br />

Heilbrunn: Klinkhardt.<br />

Thurlow, M.L.; Elliott, J.L.; Ysseldyke, J.E.; Erickson, R.N. (1996a): Questions<br />

and answers: Tough questions about accountability systems and<br />

students with disabilities (Synthesis Report No. 24).- Minneapolis, MN:<br />

University of Minnesota, National Center on Educational Outcomes.<br />

Available at: http://education.umn.edu/NCEO/OnlinePubs/Synthesis24.<br />

html (22.1.2006)<br />

Thurlow, M.; Olsen, K.; Elliott, J.; Ysseldyke, J.; Erickson R.; Aherarn, E.<br />

(1996b): Alternate assessments for students with disabilities for students<br />

unable <strong>to</strong> participate in general large-scale assessments (Policy<br />

Directions No. 5).- Minneapolis, MN: University of Minnesota, National


174 BERNADETTE HÖRMANN<br />

Center on Educational Outcomes. Available at: http://education.umn.edu/<br />

NCEO/OnlinePubs/Policy5.html (22.1.2006)<br />

Thurlow, Martha; Moen, Ross; Altman, Jason (2006): Annual Performance<br />

Report: 2003-2004. State Assessment Data.- National Center on Educational<br />

Outcomes. Available at: http://www.education.umn.edu/nceo/<br />

OnlinePubs/APR2003-04.pdf (13.5.2007)<br />

UNESCO (1994): The Salamanca Statement and Framework for Action on<br />

Special Needs Education. World Conference on Special Needs Education:<br />

Access and Quality.- Salamanca, Spain: 7-10 June 1994. Available<br />

at:<br />

http://unesdoc.unesco.org/images/0009/000984/098427eo.pdf<br />

(18.6.2007)<br />

Van Ackeren, Isabell (2005): Vom Daten- zum Informationsreichtum?<br />

Erfahrungen mit standardisierten Vergleichstests in ausgewählten<br />

Nachbarländern.- In: Pädagogik, Jahrgang 57 (2005), Heft 5, S. 24-28<br />

Wuttke, Joachim (2006): Fehler, Verzerrungen, Unsicherheiten in der<br />

<strong>PISA</strong>-Auswertung.- In: Jahnke, Thomas; Meyerhöfer, Wolfram (Hrsg):<br />

<strong>PISA</strong>&Co. Kritik eines Programms.- Hildesheim, Berlin: Franzbecker, S.<br />

101-154<br />

Ysseldyke, J.; Dennison, A.; Nelson, R. (2004): Large-scale assessment and<br />

accountability systems: Positive consequences for students with disabilities<br />

(Synthesis Report 51).- Minneapolis, MN: University of Minnesota,<br />

National Center on Educational Outcomes. Available at: http://education.<br />

umn.edu/NCEO/OnlinePubs/Synthesis51.html (15.6.2006)


Identification of Group Differences Using <strong>PISA</strong> Scales <strong>–</strong><br />

Considering Effects of Inhomogeneous Items<br />

Peter Allerup<br />

Denmark: University of Aarhus<br />

Abstract:<br />

<strong>PISA</strong> data have been available for analysis since the first <strong>PISA</strong> data base was<br />

released from the <strong>PISA</strong> 2000 study. The two following <strong>PISA</strong> studies in 2003<br />

and 2006 formed the basis of dynamic analyses besides the traditional cross<br />

sectional type of analysis, where <strong>PISA</strong> performances in mathematics, science<br />

and reading are analysed in relation <strong>to</strong> student background variables. The caption<br />

for many analyses, carried out separately on the <strong>PISA</strong> 2000 and <strong>PISA</strong><br />

2003 data, has been <strong>to</strong> look for significant differences created by <strong>PISA</strong> performances<br />

for groups of students.<br />

Few studies have, however, been directed <strong>to</strong>wards the psychometric question<br />

as <strong>to</strong> whether the <strong>PISA</strong> scales are correctly measuring the reported differences.<br />

For example, could it be that reported sex differences in mathematics<br />

are partly due <strong>to</strong> the fact that the <strong>PISA</strong> mathematics scales are not measuring<br />

the girls and the boys in a uniform or homogenous way? In other words, using<br />

the terms of modern IRT analyses (Item Response Theory), it is questioned<br />

whether the relative difficulty of the items is the same for girls and boys. The<br />

fact that item difficulties are not the same for girls and boys, a condition which<br />

is called item inhomogeneity, can be demonstrated <strong>to</strong> have impact on the conclusions<br />

of the comparisons of student groups, e.g. girls versus boys.<br />

The present analyses address the problem of possible item inhomogeneity<br />

in <strong>PISA</strong> scales from 2000 and 2003, asking specifically if the <strong>PISA</strong> scale<br />

items are homogeneous across sex, ethnicity and the two points in time (2000<br />

and 2003). This will be illustrated using items from all three <strong>PISA</strong> subjects:<br />

reading, mathematics and science. Main efforts will, however, be concentrated


176 PETER ALLERUP<br />

on the subject of reading. The consequences are demonstrated of detected item<br />

inhomogeneities for the calculation of student <strong>PISA</strong> performances (measures<br />

of ability). This will take place on the individual student level as well as on a<br />

general, average student level.<br />

Inhomogeneous items and some consequences<br />

In order <strong>to</strong> give a precise definition of item inhomogeneity, it is useful <strong>to</strong> refer<br />

<strong>to</strong> the general framework <strong>to</strong> which items, students and responses belong and<br />

their mutual interactions can be made operational. In fact, figure 1 displays the<br />

fundamental concepts behind many IRT (Item Response Theory) approaches<br />

<strong>to</strong> data analysis, the Rasch analysis in particular. The response avifrom student<br />

No. v <strong>to</strong> item No. i takes the values avi = 0 for a non correct and avi =1fora<br />

correct response.<br />

The parameters 1 ... kare latent measures of item difficulty, and 1<br />

nare the students’ parameters carrying the information about student ability.<br />

These are the <strong>PISA</strong> student scores which are reported and compared internationally<br />

(or estimates thereof).<br />

The definition of item homogeneity is now given by a manifestation of<br />

the fact that the responses ((avi )) are determined by a fixed set of item parameters<br />

given by the framework, valid for all students, and therefore for every<br />

subgrouping of the students. Actually, the probability of obtaining a correct<br />

response avi = 1 for student No. v <strong>to</strong> item No. i is given by the special IRT<br />

model, the so-called Rasch Model (Rasch, 1960) which calculates chances <strong>to</strong><br />

solving the tasks behind items by referring <strong>to</strong> the same set of item parameters<br />

regardless of which student is considered.<br />

1 ... k<br />

Responses<br />

Student<br />

No. Ability<br />

Item 1<br />

1<br />

Item 2<br />

2<br />

Item 3<br />

3<br />

Item i<br />

i<br />

. Item k<br />

1 1 1 0 1 1 1 0 a 1<br />

2 2 0 1 1 0 0 1 a 2 .<br />

3 3 1 1 0 1 1 0 a 3 .<br />

. .<br />

v v 1 0 1 a vi . 1 av.<br />

n n 1 1 0 a Ni . 0 a N<br />

k<br />

Student<br />

score<br />

(rv)<br />

Figure 1: The framework for analyzing item inhomogeneity in IRT models. Individual responses<br />

((a vi )), latent measures of item difficulty ( i ) i=1, . . . ,k, student abilities ( v)<br />

v=1, . . . ,n and student scores (rv) recording the <strong>to</strong>tal number of correct responses across k<br />

items.


IDENTIFICATION OF GROUP DIFFERENCES USING <strong>PISA</strong> SCALES 177<br />

The Rasch Model is the theoretical, psychometric reference for validation<br />

of the <strong>PISA</strong> scales, and it has been the reference for scale verification and<br />

calibration in the IEA international comparative investigations, e.g. the reading<br />

literacy study RL (Elley,1993), TIMSS (Bea<strong>to</strong>n et al., 1998), CIVIC (Torney-<br />

Purta et al., 2000) and NAEP assessments after 1984 in USA.<br />

Using this model it can e.g. be shown, that a correct response a vi =1 <strong>to</strong> an<br />

item with item difficulty i =1.20 given by a student with v= -0.5 takes place<br />

with probability P(a=1) = 0.62, i.e. with a 62 % chance.<br />

P a 1<br />

exp �i �v<br />

1 exp �i �v<br />

A major reason for the wide applicability of the Rasch Model lies in the<br />

existence of the following three equivalent characterizations of item homogeneity,<br />

proved by Rasch (see e.g Rasch, 1971, Allerup 1994, Fischer and<br />

Molenaar, 1995 ) and brought here in abbreviated form:<br />

1. The student scores (and parallel property of item scores) are sufficient statistics<br />

for the latent student abilities 1 n, viz. all information concerning<br />

vis contained in the student score rv<br />

2. The student abilities 1 n can be calculated with the same result irrespective<br />

of which subset of items is used.<br />

3. Data collected in the framework in figure 1 fits the Rasch Model, i.e. the<br />

model forms an adequate description of the variability of the observations<br />

((avi )) in figure 1.<br />

While Rasch often referred <strong>to</strong> these properties as the analytic means for<br />

‘specific objective comparisons’, others have adopted the notion ‘homogeneous’<br />

for the status of items when the conditions are met. The practical power<br />

behind this, seen from the point of view of theory of science, is that ‘objective<br />

comparisons’ is in casu a requirement, which can be investigated empirically<br />

by means of simple statistical techniques, i.e. statistical test of fit oftheRasch<br />

Model (cf. property 3). It is henceforth not a purely theoretical concept but<br />

rather one which requires empirical actions <strong>to</strong> be taken beyond the ‘theoretical’<br />

thoughts invested from the subject matter’s point of view in<strong>to</strong> the construction<br />

of items.<br />

By the characterization of item homogeneity, it follows that ‘inhomogeneity’,<br />

or ‘inhomogeneous items’, appears when items are not homogeneous, for<br />

example when different subsets of items give rise <strong>to</strong> different measures of student<br />

abilities. This is e.g. one of the risks which might appear in <strong>PISA</strong> when


178 PETER ALLERUP<br />

using rotating of booklets, where students who are responding <strong>to</strong> different item<br />

blocks must still be compared on the same <strong>PISA</strong> scale (cf. property 2). The<br />

present analyses will focus directly on possible violations of ‘item homogeneity’<br />

by looking for indications of different sets of estimated item parameters<br />

assigned <strong>to</strong> different student groups, through the fit of the Rasch Model. In<br />

other words it will be tested whether e.g. boys and girls are measured by the<br />

same set of item parameters. Two other criteria defining groups of students<br />

will be applied, these being 1) the year 2000 vs. 2003 and ethnicity 2) Danish<br />

vs. non-Danish linguistic background. Especially in the subject of reading, the<br />

distinction by ethnicity is of interest, because different language competencies<br />

are expected <strong>to</strong> influence the understanding and through this the ability <strong>to</strong> reply<br />

correctly <strong>to</strong> the reading tasks.<br />

The consequences of item inhomogeneity are diversified and can bring<br />

about serious implications, depending on the analytic view. In a <strong>PISA</strong> context,<br />

however, one specific kind of consequence attracts attention: How are<br />

comparisons carried out by means of student <strong>PISA</strong> scores affected by inhomogeneity?<br />

If boys and girls are in fact measured by two different scales, i.e. two<br />

sets of item parameters, will this influence conclusions carried out under the<br />

use of one, common ‘average’ set of items? Will an interval of <strong>PISA</strong> points<br />

estimated <strong>to</strong> separate the average -level for Danish students from the non-<br />

Danish students be greater or smaller, if knowledge as <strong>to</strong> item inhomogeneity<br />

is introduced in<strong>to</strong> the -calculations?<br />

Such consequences can be exposed on the -scale either at the individual<br />

student level using one item and one individual or at the general level using all<br />

students and all items.<br />

The individual level is established in a simple way by calculating the individual<br />

change on the -scale, which is mathematically needed <strong>to</strong> compensate<br />

for a given difference in the <strong>–</strong> parameter under the assumption that a fixed<br />

probability for answering correct is maintained. Suppose for instance that data<br />

from boys are fitted <strong>to</strong> the Rasch Models with estimated item difficulty 1<br />

0.40 and the same item gets an estimated difficulty 2 0.75 for the girls, a<br />

difference which can be tested <strong>to</strong> be significantly different from the first item<br />

(Allerup, 1995 and 1997). Then a simple calculation under the Rasch Model<br />

shows that in order for a boy and a girl <strong>to</strong> obtain equal probabilities for answering<br />

this item correctly, the boy’s -value must be adjusted by 0.75-0.40<br />

= 0.35. This item is easier for the boy compared <strong>to</strong> the girl, even considering


IDENTIFICATION OF GROUP DIFFERENCES USING <strong>PISA</strong> SCALES 179<br />

aboyandagirlwith the same -value 1 and hence should be equally capable<br />

of answering the item correctly. In order <strong>to</strong> compensate for this scale-specific<br />

advantage, a boy ‘should start lower’ by subtracting 0.35 from 2 .Inaway<br />

it resembles the rules in golf, where the concept of ‘handicap’ plays a similar<br />

role <strong>to</strong> making comparisons between players more fair.<br />

When moving from the individual level <strong>to</strong> the comprehensive level and<br />

including all items, two simple methods are available. The first one is based<br />

on theoretical calculations, where expected scores are compared for fixed<br />

<strong>–</strong> values using the two sets of inhomogeneous item parameters 3 . The second<br />

approach is based on summing up the individual changes for all students as an<br />

average; it suffices <strong>to</strong> summarize all individual -changes within each group<br />

in question when using the set of item parameters specific for each group. A<br />

third strategy consists of first removing inhomogeneous items from the scale,<br />

then carrying out statistical analyses by means of the remaining homogeneous<br />

items only, e.g. estimation of the student <strong>PISA</strong> scores. Following this procedure<br />

a ‘true’ difference between the groups will then be obtained. In a way<br />

this last procedure follows the traditional path of Rasch scale analysis, where<br />

successive steps from field trials <strong>to</strong> the main study are paved by item analyses,<br />

correcting and eliminating inhomogeneous items step by step. As stated,<br />

the present analyses will focus on student groups defined by gender, year of<br />

investigation and ethnicity.<br />

Data used<br />

Data for these analyses are collected under different studies with no overlap.<br />

The Standard <strong>PISA</strong> 2000 and 2003 data are representative samples, while<br />

the <strong>PISA</strong> Copenhagen data comprises all public schools in the community of<br />

Copenhagen, and <strong>PISA</strong> E is a sample specifically addressing the participation<br />

of ethnic students, and was therefore created from prior knowledge as <strong>to</strong> where<br />

this group of students attend school.<br />

1 Same -value means that they are considered <strong>to</strong> be identical in the framework<br />

2 The analytic picture is slightly more complicated, because there are constrains on the <strong>–</strong><br />

3<br />

values: ? i =1.00<br />

E(avi 1 ... k<br />

estimates of 1<br />

) as function of ;rv =E(avi 1 ... k ) with conditional ml<br />

... k inserted, provides the estimate of


180 PETER ALLERUP<br />

1. <strong>PISA</strong> 2000 N=4209: 50 % girls 50 % boys, 6 % ethnics<br />

2. <strong>PISA</strong> 2003 N=4218: 51 % girls 49 % boys, 7 % ethnics<br />

3. <strong>PISA</strong> E N=3652: 48 % girls 52 % boys, 25 % ethnics<br />

4. <strong>PISA</strong> Copenhagen N=2202: 50 % girls 50 % boys, 24 % ethnics<br />

In the three studies <strong>PISA</strong> 2000, E and Copenhagen, the same set of <strong>PISA</strong><br />

instruments has been used, ie. the same set of items organized in nine booklets<br />

has been rotated among the students. In <strong>PISA</strong> 2003 some of the items from the<br />

<strong>PISA</strong> 2000 study were reused, because items in common must be available for<br />

bridging between 2000 and 2003. <strong>According</strong> <strong>to</strong> the <strong>PISA</strong> cycles, every study<br />

has a special theme; in 2000 it was reading, and in 2003 it was mathematics.<br />

In these years the two subjects were especially heavily represented by many<br />

items. Because of this, the present analyses dealing with the 2003 data are<br />

undertaken mainly by means of items which are in common for the two <strong>PISA</strong><br />

studies 2000 and 2003.<br />

Scaling <strong>PISA</strong> 2000 versus <strong>PISA</strong> 2003 in reading<br />

One of the reasons for the interest in the <strong>PISA</strong> scaling procedures was the fact<br />

that the international <strong>PISA</strong> report from <strong>PISA</strong> 2003 comments upon general<br />

change in the level of reading competencies between 2000 and 2003 in the<br />

following manner:<br />

“However, mainly because of the inclusion of new countries in 2003, the overall<br />

OECD mean for reading literacy is now 494 score points and the standard deviation is<br />

100 score points.” (<strong>PISA</strong> 2003, OECD)<br />

It seems very unlikely that all students in the world being taught in more than<br />

50 different school systems should experience a common weakening across<br />

three years of their reading capacities, amounting <strong>to</strong> 6 <strong>PISA</strong> points (from 500<br />

<strong>to</strong> 494); a further explanation given in the Danish National Report does not<br />

increase a sense for a convincing explanation for this significant drop of 6<br />

<strong>PISA</strong> points:<br />

“The general reading score for the OECD-countries dropped from 500 <strong>to</strong> 494 points.<br />

This is influenced by the fact that two countries joined <strong>PISA</strong> between 2000 and 2003,<br />

contributing <strong>to</strong> the lower end, while the Netherlands lifts the average a bit. But, considering<br />

all countries, it looks like the reading score has dropped a bit” (<strong>PISA</strong> 2003,<br />

ed. Mejding)


IDENTIFICATION OF GROUP DIFFERENCES USING <strong>PISA</strong> SCALES 181<br />

Could it be that the 6-point drop was the result of item inhomogeneities across<br />

2000 and 2003? If this question either in full or in part must be answered by<br />

a yes, one can still hope <strong>to</strong> conduct appropriate comparisons between student<br />

responses from 2000 with 2003. In fact, assuming that no other scale problem<br />

exists within each of the years 2000 and 2003, one can consider the two scales<br />

completely separately and apply statistical Test Equating techniques. The <strong>PISA</strong><br />

2000 reading scale has been compared <strong>to</strong> the IEA 1992 Reading Literacy scale<br />

using this technique, showing that these two scales <strong>–</strong> in spite of inhomogeneous<br />

items <strong>–</strong> are psychometrically parallel (Allerup. 2002)<br />

<strong>PISA</strong> 2000 and <strong>PISA</strong> 2003 share 22 reading items which are necessary<br />

for the analysis of homogeneity by means of the Rasch Model. The items are<br />

found in booklet No. 10 in <strong>PISA</strong> 2003 and booklet No. 4 in <strong>PISA</strong> 2000. Table 1<br />

displays the (log) item difficulties, estimated under the simple one-dimensional<br />

Rasch Model 4 .<br />

Item difficulties <strong>PISA</strong> -scale<br />

item i (2000) 1 R055Q01_ 1.27<br />

i (2003)<br />

1.23<br />

difference<br />

-3.6<br />

2 R055Q02_ -0.66 -0.79 -11.7<br />

3 R055Q03_ -0.08 -0.21 -11.7 percent correct<br />

4 R055Q05_ 0.44 0.55 9.9 2000 2003<br />

5 R067Q01_ 0.58 1.97 125.1 0.64 0.88<br />

6 R067Q04_ -0.29 0.88 105.3 0.43 0.71<br />

7 R067Q05_ -0.47 1.15 145.8 0.38 0.76<br />

8 R102Q05_ -0.86 -1.18 -28.8<br />

9 R102Q07_ 1.73 1.41 -28.8<br />

10 R102Q04A_ -1.34 -2.01 -60.3<br />

11 R104Q01_ 0.41 0.10 -27.9<br />

12 R104Q02_ -0.31 -0.63 -28.8<br />

13 R104Q05_ -0.40 -0.72 -28.8<br />

14 R111Q01_ -0.99 -1.08 -8.1<br />

15 R111Q02B_ 0.04 -0.05 -8.1<br />

16 R111Q06B_ 1.51 1.66 13.5<br />

4 Conditional maximum likelihood estimates from p(((avi )) (rv) ), conditional on student<br />

scores (rv), cf. fig1


182 PETER ALLERUP<br />

17 R219Q02_ 0.28 0.44 14.4 percent correct<br />

18 R219Q01E_ 0.08 0.20 10.8 2000 2003<br />

19 R220Q01_ -0.32 -0.82 -45.0 0.42 0.31<br />

20 R220Q04_ -0.05 -0.60 -49.5 0.49 0.35<br />

21 R220Q05_ 0.83 0.33 -45.0 0.70 0.58<br />

22 R220Q06_ -1.40 -1.82 -37.8 0.20 0.14<br />

Table 1: Rasch Model estimates of item difficulties i for the two years of testing 2000 and<br />

2003 and -scale adjustments for unequal item difficulties.<br />

Several test statistics can be applied for testing the hypothesis stating that<br />

item difficulties are equal across the years 2000 and 2003, both multivariate<br />

conditional (Andersen, 1973) and exact tests item-by-item (Allerup, 1997).<br />

The results are all clearly rejecting the hypothesis and, consequently, the items<br />

are inhomogeneous across the year of testing 2000 and 2003.<br />

A visual impression of how the two <strong>PISA</strong> scales are composed by item<br />

difficulties as marks on two ‘rulers’ is displayed in figure 2. Items connected by<br />

vertical lines tend <strong>to</strong> be homogeneous, while oblique connecting lines indicate<br />

inhomogeneous items.<br />

The last column in table 1 lists the consequences at the individual student<br />

level of the estimated item inhomogeneity transformed <strong>to</strong> quantities measured<br />

on the ordinary <strong>PISA</strong> student scale, i.e. the -scale internationally calibrated<br />

<strong>to</strong> mean value = 500 with standard deviation = 100. As an example the item<br />

R055Q01 changed the estimated difficulty from 1.27 in 2000 <strong>to</strong> 1.23 in 2003,<br />

a small decrease in the relative difficulty of -3.6. For an average student, i.e.<br />

with <strong>PISA</strong> ability v= 0.00 this means that the chance of responding correctly<br />

<strong>to</strong> these items has changed from 0.78 <strong>to</strong> 0.77, a small 1 % drop. This can be calculated<br />

from the Rasch Model; for an above-average student with v=2.00the<br />

change will be 0.963 <strong>to</strong> 0.962, a very minor change of magnitude, 1 per mille.<br />

Table 1 shows how the consequences amount <strong>to</strong> considerable <strong>PISA</strong> points for<br />

some items, especially the items R067 and R220, which are framed in the table.<br />

These items are the ones which distinguish themselves on figure 2 by nonvertical<br />

lines. The marginal percent correction, which is based on all booklets<br />

and students, is included in table 1 in order <strong>to</strong> get a well-known interpretation<br />

of the change from 2000 <strong>to</strong> 2003. It is a tacitly assumed that the <strong>PISA</strong> items<br />

are accepted under tests of reliability.<br />

The last column in table 1 indicates the advantage (difference >0) or disadvantage<br />

(difference


IDENTIFICATION OF GROUP DIFFERENCES USING <strong>PISA</strong> SCALES 183<br />

Figure 2: Estimated Item difficulties i (2000) and i (2003) for <strong>PISA</strong> 2000 (lower line) and<br />

<strong>PISA</strong> 2003 (upper line). Estimates based on booklet 4 (<strong>PISA</strong> 2000) and booklet 10 (<strong>PISA</strong><br />

2003) using data from all countries.<br />

pretation being that 2003-students are given ‘free’ <strong>PISA</strong> points as a result of<br />

the fact that the (relative) item difficulty has dropped between the years 2000<br />

and 2003; that this ‘advantage’ can be quantified in terms of ‘compensations’<br />

on the -scale shown in the last column, displaying how much a student must<br />

change the <strong>PISA</strong>-score in order <strong>to</strong> compensate for a the change of difficulty<br />

of the item. This way of thinking is much alike the thoughts behind the construction<br />

of the so-called items maps, visualizing both the distribution of item<br />

difficulties and student abilities anchored in predefined probabilities for a correct<br />

response.<br />

Table 1 pictures item inhomogenities, item by item, in reading; some items<br />

turn out <strong>to</strong> be (relatively) more difficult between 2000 and 2003, while others<br />

are easier between the two years. A comprehensive picture involving all<br />

single-item ‘movements’ and all students is more complicated <strong>to</strong> establish 5 .<br />

The technique used in this case is <strong>to</strong> study the gap between expected score<br />

5 Analyze the expected score E(avi 1 ... k ) as function of with conditional ml<br />

estimates of 1 ... k inserted.


184 PETER ALLERUP<br />

levels caused by the two item sets of (inhomogeneous) difficulties. By this, it<br />

can be shown that the general effect is approximately 11 <strong>PISA</strong> points. In other<br />

words, the average <strong>PISA</strong> 2003- student experiences a ‘loss’ of approximately<br />

11 <strong>PISA</strong> points, purely due <strong>to</strong> psychometric scale inhomogeneities. The official<br />

drop between 2000 and 2003 was for Denmark 497? 492, i.e. a drop of<br />

5 points. In the light of scale-induced changes of magnitude minus 11 points,<br />

could this be switching a disappointing conclusion <strong>to</strong> the contrary?<br />

Scaling <strong>PISA</strong> 2003 in reading <strong>–</strong> gender and ethnicity<br />

Whenever analysis of item homogeneity is executed by using an external variable<br />

<strong>to</strong> define sub-groups, it is tacitly assumed that the Rasch model works<br />

within each group 6 , i.e. the items are homogeneous within each group.<br />

2003 <strong>PISA</strong> 2003 <strong>PISA</strong><br />

Item difficulties -scale Item difficulties -scale<br />

item i (girls) i (boys) difference i (DK) 1 R055Q01_ 1.18 1.35 15.16 1.25<br />

i (ejDK) difference<br />

2.05 72.77<br />

2 R055Q02_ -0.71 -0.70 1.56 -1.13 -1.59 -41.76<br />

3 R055Q03_ -0.23 -0.02 18.80 -0.53 -0.83 -26.60<br />

4 R055Q05_ 0.58 0.43 -13.32 0.20 0.28 7.38<br />

5 R067Q01_ 1.04 1.11 5.96 2.38 2.05 -29.43<br />

6 R067Q04_ 0.25 0.15 -8.83 0.08 0.01 -6.63<br />

7 R067Q05_ 0.36 0.01 -30.83 0.25 0.42 16.06<br />

8 R102Q05_ -1.07 -0.91 14.44 -0.42 -0.24 15.51<br />

9 R102Q07_ 1.55 1.66 10.31 2.01 0.91 -99.20<br />

10 R102Q04A -1.75 -1.51 21.34 -1.68 -1.82 -12.02<br />

11 R104Q01_ 0.38 0.20 -15.90 0.25 -0.12 -32.89<br />

12 R104Q02_ -0.22 -0.68 -40.96 -0.75 -0.60 13.82<br />

13 R104Q05_ -0.39 -0.69 -26.30 -0.93 -1.70 -69.25<br />

14 R111Q01_ -1.28 -0.74 48.33 -1.02 -1.05 -2.26<br />

15 R111Q02B 0.04 -0.01 -4.85 -0.77 -0.37 36.59<br />

16 R111Q06B 1.59 1.57 -1.86 1.37 2.05 61.47<br />

17 R219Q02_ 0.32 0.40 7.05 0.37 1.29 82.81<br />

18 R219Q01E 0.05 0.25 17.95 0.45 1.09 57.84<br />

19 R220Q01_ -0.62 -0.45 15.31 -0.09 -0.37 -24.59<br />

20 R220Q04_ -0.16 -0.43 -24.37 -0.97 -1.27 -26.68<br />

21 R220Q05_ 0.64 0.60 -3.69 0.47 0.74 23.62<br />

22 R220Q06_ -1.55 -1.61 -5.28 -0.75 -0.94 -16.57<br />

Table 2: Rasch Model estimates of item difficulties ifor girls and boys (international student<br />

responses) and for Danish (DK) and non-Danish, ethnic students (ejDK) (Danish student<br />

responses) for <strong>PISA</strong> 2003; -scale adjustments for unequal item difficulties. All items from<br />

booklet No. 10.<br />

6 By nature the likelihood ration test statistic (Andersen, 1973) for item homogeneity across<br />

groups has as prerequisite that item parameters exist within each group, i.e. the Rasch model<br />

fits within each group.


IDENTIFICATION OF GROUP DIFFERENCES USING <strong>PISA</strong> SCALES 185<br />

Within <strong>PISA</strong> 2003 data a repetition of the statistical tests presented in the<br />

previous section for homogeneity across 2000 <strong>–</strong> 2003 have been undertaken<br />

across gender and ethnicity. While the international data was used for the gender<br />

analysis, only data from Denmark has been used for the ethnic grouping.<br />

This leads <strong>to</strong> table 2.<br />

The numerical indications in table 2 regarding the degree of inhomogeneity<br />

can be illustrated in the same fashion as in figure 2, here presented as figure 3.<br />

Perfect homogeneity across the two external criteria gender and ethnicity can<br />

be read as perfect vertical lines in the figures.<br />

Figure 3: Estimated item difficulties for 22 reading items, Danish students (lower line) and<br />

non-Danish students (upper line), left part. Danish <strong>PISA</strong> 2003 data. Estimated item difficulties<br />

for 22 reading items, girls (lower line) and boys (upper line), right part. International<br />

<strong>PISA</strong> 2003 data.<br />

Although it is the impression from figures in table 2 and the graphs in figure<br />

3 that ethnicity creates the largest degree of inhomogeneity, the contrary is,<br />

in fact, the truth. The explanation for this is that the statistical tests for homogeneity<br />

across ethnicity are based on the Danish <strong>PISA</strong> 2003 set, booklet No. 10<br />

consisting only of 325 valid student responses, providing little power behind<br />

the tests. Again both simultaneous tests as multivariate conditional (Andersen,<br />

1973) and exact tests, item-by-item (Allerup, 1997) have been applied. While<br />

test statistics are strongly rejecting the homogeneity hypothesis across gender,<br />

more weak signs of inhomogeity are indicated across ethnicity.<br />

Reading the crude deviations from table 2 points e.g. <strong>to</strong> items R104 and<br />

R067 favouring girls and R111 and R102 favouring boys. Likewise, items<br />

R102 and R104 constitute challenges which favour Danish students, while


186 PETER ALLERUP<br />

items R219 and R055 seem <strong>to</strong> favour ethnic students. Details behind these<br />

suggestions for inhomogeneity, e.g. assessing didactic interpretations for these<br />

deviations, can be evaluated through a closer look at the relation between the<br />

observed and expected number of responses in specific score-groups. 7<br />

If the displayed inhomogeneities in table 2 are accumulated in the same<br />

way as with the <strong>PISA</strong> 2000 vs. 2003 analysis, it can be shown that poorly performing<br />

girls get a scale-specific advantage of magnitude 8-10 <strong>PISA</strong> points,<br />

which is reduced <strong>to</strong> approximately 1-2 points for high performing girls. A similar<br />

accumulation for the analysis across ethnicity shows that low performing<br />

Danish students (around 30 % correct responses), get a scale-specific advantage<br />

of approximately 12 <strong>PISA</strong> points, while very low or very high performing<br />

students do not get any ‘free’ scale points because of inhomogeneity.<br />

Scaling <strong>PISA</strong> 2000 in reading <strong>–</strong> ethnicity<br />

<strong>PISA</strong> 2000 data offers an excellent opportunity <strong>to</strong> study what happens if the<br />

reading by Danish students is compared with that of the ethnic students in<br />

Denmark. Before any didactic explanations can be discussed, a first approach<br />

<strong>to</strong> recognizing possible inhomogeneity is achieved by comparing the relative<br />

item difficulties for the two groups. As said, both the ordinary <strong>PISA</strong> 2000<br />

study and the two studies (<strong>PISA</strong> Ethnic and <strong>PISA</strong> Copenhagen) have been run<br />

on the <strong>PISA</strong> 2000 instruments, bringing the <strong>to</strong>tal number of student responses<br />

<strong>to</strong> approximately 10,000, 17 % of which come from ethnic students.<br />

<strong>PISA</strong> 2000 <strong>PISA</strong><br />

Item difficulties -scale<br />

Item cat i (DK) R055Q01 1.21 1 1.30<br />

i (ejDK) difference<br />

1.13 -15.39<br />

booklet<br />

2<br />

R055Q03 1.17 1 -0.59 -1.22 -56.88 2<br />

R061Q01 0.91 0 -0.37 -0.17 17.59 6<br />

R076Q03 0.86 1 0.20 0.78 52.69 4<br />

R076Q04 0.80 1 -0.67 0.22 79.94 4<br />

R076Q05 1.08 0 -0.87 -0.63 21.96 4<br />

R076Q05 1.15 0 0.02 0.12 8.73 5<br />

R077Q04 0.72 1 0.67 0.77 9.32 8<br />

R081Q05 1.00 0 0.23 0.34 9.53 1<br />

7 Compare ai (r) <strong>–</strong> the observed number of correct responses <strong>to</strong> item No i in score group r,<br />

with nr i (r) <strong>–</strong> the expected number, where nr is the number of students in score group r<br />

and i (r) is the conditional probability for a correct response <strong>to</strong> item No i in score group r<br />

(depending on iand the so-called symmetric functions of 1 ... k only)


IDENTIFICATION OF GROUP DIFFERENCES USING <strong>PISA</strong> SCALES 187<br />

R083Q06 1.14 0 -0.93 -0.66 24.48 5<br />

R086Q05 1.64 1 1.93 1.36 -51.97 1<br />

R086Q05 1.54 1 1.89 1.05 -75.48 3<br />

R086Q05 1.21 1 2.31 1.65 -58.93 4<br />

R091Q06 0.72 1 1.50 1.51 1.11 3<br />

R100Q06 1.35 1 1.35 0.85 -44.77 3<br />

R100Q06 1.31 1 1.56 0.49 -96.48 6<br />

R101Q02 1.36 1 1.53 0.87 -58.99 5<br />

R104Q01 1.06 1 0.64 0.90 23.86 5<br />

R104Q01 1.46 0 2.64 2.11 -47.50 6<br />

R110Q06 0.98 0 1.03 1.11 7.42 7<br />

R111Q06B 1.35 1 -0.45 -1.41 -86.05 4<br />

R119Q06 0.70 1 1.21 1.18 -2.15 3<br />

R120Q01 1.20 1 0.63 0.67 2.99 4<br />

R120Q01 1.47 1 0.62 0.15 -42.32 6<br />

R120Q07T 1.32 1 0.69 0.19 -44.98 4<br />

R219Q02 0.75 1 0.56 0.96 35.43 1<br />

R220Q02B 1.17 1 0.49 -0.06 -49.78 4<br />

R220Q06 0.87 0 1.20 0.96 -21.55 7<br />

R227Q04 1.53 1 -0.46 -0.88 -37.47 3<br />

R234Q01 1.16 0 1.40 1.38 -1.52 1<br />

R234Q02 1.24 1 -2.16 -2.04 10.88 1<br />

R234Q02 0.95 1 -2.04 -1.74 26.81 2<br />

R241Q02 0.81 1 -0.70 -0.34 32.56 2<br />

Table 3: Rasch Model estimates of significant inhomogeneous items across ethnicity; item<br />

difficulties ifor Danish(DK) and non-Danish (ejDK), ethnic students(N=10063 student responses);<br />

-scale adjustments for unequal item difficulties under the simple Rasch Model.<br />

Cat=1 indicates significant item discrimination ( 1.00).<br />

In this section analyses are based on 140 reading items from all nine booklets,<br />

each containing around 40 items, organized with overlap in a rotation system<br />

across the booklets. This brings about 1,100 student responses per booklet.<br />

Using these <strong>PISA</strong> 2000 data, the statistical tests for homogeneity across the<br />

two student groups defined by ethnicity (DK and ejDK) may once more be applied.<br />

Both multivariate conditional (Andersen, 1973) and exact item-by-item<br />

tests (Allerup, 1997) were applied. The results clearly reject the hypothesis of<br />

homogeneity and, consequently, the items are inhomogeneous across the two<br />

ethnic student groups.<br />

Because of the amount of data available, statistical tests for proper Rasch<br />

model item discriminations (Allerup, 1994) have been included also; if significant,<br />

i.e. the hypothesis i = 1.00 must be rejected, it can be taken as an indication<br />

of the validity of the so-called two-parameter Rasch model (Lord and


188 PETER ALLERUP<br />

Novic, 1968) 8 . Other more orthodox views would claim that basic properties<br />

behind ‘objective comparisons’ are then violated because of intersecting ICC<br />

curves (Item Characteristic Curves). Hence, this would be taken as just another<br />

sign of item inhomogeneity. Table 3 lists all items (among the 140 items in <strong>to</strong>tal)<br />

found <strong>to</strong> be inhomogeneous in the predefined setting with unequal item<br />

difficulties only. Items with significant item discriminations are then marked<br />

with cat=1. Some items appear twice because of the rotation, allowing items<br />

<strong>to</strong> be used in several different booklets.<br />

The combination of high item discrimination and the existence of two<br />

slightly different -groups, which are compared on the general average level,<br />

can cause serious effects. Since it is expected that the ethnic student group<br />

generally performs lower than the Danish group, it could be that one item with<br />

high item discrimination acts like a ‘separa<strong>to</strong>r’ in the item-map-sense between<br />

the two -groups. This situation will artificially decrease the probability of a<br />

correct response from students in the lower latent -group while, on the opposite<br />

end, students from the upper -group will artificially enjoy enhanced<br />

probabilities <strong>to</strong> respond correctly. In a way this phenomenon of high item discrimination<br />

tends <strong>to</strong> punish the poor students and disproportionately rewards<br />

the high performing students.<br />

From table 3 it can be read that e.g. item R055Q03 is (relatively) more<br />

difficult for the ethnic students compared with the Danish students. In terms of<br />

compensation on the <strong>PISA</strong> -scale, this means that an ethnic student experiences<br />

a <strong>PISA</strong> scale induced loss 56.88 points. In other words, an ethnic student<br />

must be 56.88 scale points ahead of his Danish classmate if they are going <strong>to</strong><br />

have equal chances for responding correctly <strong>to</strong> the item. A Danish student with<br />

a <strong>PISA</strong> score equal <strong>to</strong>, say 475, has the same probability for a correct response,<br />

as an ethnic student with <strong>PISA</strong> score 475+56.88=531.88.<br />

It is an interesting feature of table 3 that more than 60 % of these ethnicsignificant<br />

items are administered in a multiple choice format (i.e. closed response<br />

categories, MC), while only 19 % belong <strong>to</strong> this category in the full<br />

<strong>PISA</strong> 2000 set up. This is surprising, because an open response would be expected<br />

<strong>to</strong> call for deeper insight in<strong>to</strong> linguistic details about the formulation<br />

of the reading problem compared <strong>to</strong> just ticking a predefined box in the MC<br />

<strong>–</strong>format.<br />

The item R076Q04 is a MC item under the caption “retrieving information”,<br />

where the students examine the flying schedule of Iran Air. This item is<br />

8 The two parameter model with item discrimination i isP a 1<br />

exp � i � i �v<br />

1 exp � i � i �v


IDENTIFICATION OF GROUP DIFFERENCES USING <strong>PISA</strong> SCALES 189<br />

solved far better by the ethnic students compared with the Danish students, because<br />

the item doesn’t really contain complicated text at all, just numbers and<br />

figures listed in a schematic form. Contrary <strong>to</strong> this example, item R100Q06<br />

(MC) contains long and twisted Danish text, and the caption for the item is<br />

“interpreting”, which aims at ‘reading’ behind the lines; only if the interpretation<br />

is correct, is the complete response considered <strong>to</strong> be correct.<br />

In this example from reading, the accumulated effect of the individual item<br />

inhomogeneities is evaluated using a different technique from the previous sections.<br />

In fact, the more traditional step-by-step method is now applied in which<br />

inhomogeneous items are removed before re-estimation of the <strong>PISA</strong> score<br />

takes place. The gap between Danish and ethnic students can then be studied<br />

before and after removal of inhomogeneous items.<br />

From the joint data <strong>PISA</strong> 2000, <strong>PISA</strong> Copenhagen and <strong>PISA</strong> E one gets<br />

the crude differences:<br />

Language N <strong>PISA</strong> -score<br />

average<br />

Danish<br />

Non Danish<br />

Difference<br />

The crude average difference amounts <strong>to</strong> 90.54 <strong>PISA</strong> points. Since the<br />

items are spread over nine booklets, it is of interest <strong>to</strong> judge the accumulated<br />

effect for each booklet. At the same time this would be an opportunity <strong>to</strong> check<br />

one of the implications of the three equivalent characterizations of the Rasch<br />

model 9 , viz. that you should get almost the same picture, irrespective of which<br />

booklet is investigated.<br />

Booklet -scores<br />

all items<br />

-scores<br />

homogeneous items<br />

9 Student abilities 1 ncan be calculated with same result irrespective of which subset of<br />

items is used.


190 PETER ALLERUP<br />

<strong>to</strong>tal<br />

Table 4: Average differences between Danish and non-Danish student calculated under two<br />

scenarios: (1) all items and (2) homogeneous items, i.e. items enjoying the property i (DK) <strong>–</strong><br />

i (ejDK)


IDENTIFICATION OF GROUP DIFFERENCES USING <strong>PISA</strong> SCALES 191<br />

M155Q04T 0.86 0 -0.37 0.27 58.11<br />

S114Q03T 1.74 1 0.57 0.63 6.18 2 8<br />

S114Q04T 1.63 1 0.28 0.30 1.68<br />

S114Q05T 1.11 0 -1.21 -1.54 -29.37<br />

S128Q01_ 0.94 0 0.53 0.70 14.91<br />

S128Q02_ 0.78 1 -0.16 -0.31 -14.08<br />

S128Q03T 0.80 1 0.35 0.25 -8.60<br />

S131Q02T 1.43 1 0.0 -0.20 -18.26<br />

S131Q04T 1.60 1 -1.62 -1.54 7.62<br />

S133Q01_ 0.91 0 0.60 0.95 31.93<br />

S133Q03_ 0.51 1 -0.61 -0.70 -8.20<br />

S133Q04T 0.56 1 0.23 0.18 -5.04<br />

S213Q02_ 0.88 0 1.21 1.19 -1.61<br />

S213Q01T 1.21 1 -0.17 0.09 22.84<br />

Table 5: Rasch Model estimates of items difficulties i (2000) and i (2003) for math items<br />

shared by <strong>PISA</strong> 2000 and <strong>PISA</strong> 2003 in four booklets; -scale adjustments for unequal item<br />

difficulties under the simple Rasch Model. Cat=1 indicates significant item discrimination (<br />

1.00).<br />

The test statistics applied earlier are again brought in<strong>to</strong> operation, testing<br />

the hypothesis, that the item difficulties for the years 2000 and 2003 are equal.<br />

In fact, both multivariate conditional (Andersen, 1973) and exact tests itemby-item<br />

(Allerup, 1997) were used. The results of estimation are presented in<br />

table 5 <strong>to</strong>gether with and an evaluation of the item discriminations i .<br />

The results for mathematics shows that the hypothesis must be rejected<br />

and, consequently, the items presented in table 5 are inhomogeneous across<br />

the year of testing 2000 and 2003. Item M155Q04T is an item which systematically<br />

for all score levels seems <strong>to</strong> become easier between 2000 and 2003;<br />

in more familiar terms a rise is seen from 64 % correct responses <strong>to</strong> 75 %,<br />

calculated for all students.<br />

The results for science seem <strong>to</strong> be in accordance with the expectations<br />

behind the <strong>PISA</strong> scaling. In fact, the multivariate conditional and the exact<br />

tests for single items are not rejecting the hypothesis of equal item difficulties<br />

across the test years 2000 and 2003.<br />

Since only a very few item groups from four booklets have been investigated,<br />

no attempt on calculating accumulated effects for larger groups of students<br />

and items will be carried out.


192 PETER ALLERUP<br />

Scaling <strong>PISA</strong> 2000 and 2003 in mathematics<br />

In view of the fact that the tests for homogeneity across 2000 and 2003 failed<br />

in mathematics, it could be of interest <strong>to</strong> investigate scale properties within<br />

each of the two years. Using booklets No 5 (same booklet number in 2000 and<br />

2003), around 400 student responses are available for analysis of homogeneity<br />

across gender. Table 6 displays the estimates of item difficulties for the seven<br />

math items shared in 2000 and 20003 in booklet No. 5 <strong>to</strong>gether with the estimated<br />

item discriminations and an evaluation of the item discrimination i in<br />

relation <strong>to</strong> the Rasch model requirement: i =1.00<br />

<strong>PISA</strong> <strong>PISA</strong><br />

Item difficulties -scale<br />

Item cat i (girls) 2000:<br />

i (boys) difference booklet<br />

M150Q01_ 0.86 0 0.36 0.57 -18.88 5<br />

M150Q02T 1.23 0 3.24 2.67 51.03<br />

M150Q03T 1.00 0 -0.54 -0.71 14.79<br />

M155Q01_ 0.91 0 0.06 -0.38 39.20<br />

M155Q02T 1.20 0 0.36 0.63 -24.46<br />

M155Q03T 1.61 0 -3.06 -2.47 -53.63<br />

M155Q04T 0.85 0 -0.42 -0.33 -8.04<br />

2003:<br />

M150Q01_ 0.75 0 -0.27 0.63 -81.63 5<br />

M150Q02T 0.95 0 2.40 3.38 -88.71<br />

M150Q03T 0.99 0 -0.52 -1.09 51.06<br />

M155Q01_ 1.26 0 0.12 -0.15 24.37<br />

M155Q02T 1.09 0 0.42 -0.07 43.28<br />

M155Q03T 1.45 0 -2.41 -2.99 51.92<br />

M155Q04T 0.85 0 0.27 0.27 -0.29<br />

Table 6: Rasch Model estimates of items difficulties i (girls) and i (boys) for math items in<br />

<strong>PISA</strong> 2000 and <strong>PISA</strong> 2003, using two booklets; -scale adjustments for unequal item difficulties<br />

under the simple Rasch Model. Cat=1 indicates significant item discrimination (<br />

1.00).<br />

The statistical methods for testing the hypothesis of equal difficulties for<br />

girls and boys are brought in<strong>to</strong> operation again. Both multivariate conditional<br />

(Andersen, 1973) and exact tests item-by-item (Allerup, 1997) were used.<br />

Behind the estimates presented in table 6 lies the information that the<br />

gender-specific homogeneity hypothesis must clearly be rejected in the data<br />

from <strong>PISA</strong> 2003, while the picture is less distinct for <strong>PISA</strong> 2000 (significance


IDENTIFICATION OF GROUP DIFFERENCES USING <strong>PISA</strong> SCALES 193<br />

probability p=0.08 for the simultaneous test). Consequently, in <strong>PISA</strong> 2003 the<br />

seven items presented in table 6 are inhomogeneous across gender. In particular,<br />

item No. 2, M155Q02T is one item which changes position from favouring<br />

the girls in <strong>PISA</strong> 2000 (98 % correct for girls vs 96 % correct for boys) <strong>to</strong> the<br />

contrasting role of favouring the boys in <strong>PISA</strong> 2003 (96 % correct for girls vs.<br />

97 % correct for boys). In terms of log-odds ratio, this is a change from 1.14<br />

as relative ‘distance’ between girls and boys in <strong>PISA</strong> 2000 <strong>to</strong> -0.29 in <strong>PISA</strong><br />

2003. In <strong>PISA</strong> 2003 the items M150Q01, M150Q03T and M155Q03T attracts<br />

attention, also because of the large transformed consequences on the -scale.<br />

However, the only item showing significant gender bias according <strong>to</strong> exacts<br />

tests for single items, is M150Q01.<br />

As stated in the three equivalent characterizations of item homogeneity, rejecting<br />

the hypothesis about homogeneous items means that information about<br />

students’ ability <strong>to</strong> solve the tasks is not accessible through the raw scores, i.e.<br />

the <strong>to</strong>tal number of correct responses across items. The student raw score is not<br />

asufficient statistic for the ability , or the <strong>PISA</strong> scale score does not measure<br />

the students’ competencies as <strong>to</strong> solving the items: these are two other ways<br />

of describing the situation under the caption ‘inhomogeneous items’. On the<br />

other hand this does not exclude the <strong>PISA</strong> analyst <strong>to</strong> obtaining another kind of<br />

information from the responses with respect <strong>to</strong> comparing students by means<br />

of the <strong>PISA</strong> items.<br />

With regard <strong>to</strong> the two items M150Q02 and M150Q03 above, it has been<br />

demonstrated (Allerup et al, 2005) how information from these two openended<br />

10 items can be handled as profiles. By this, all combinations of responses<br />

<strong>to</strong> the two items are considered, and analysis of group differences takes place<br />

using these profiles as ‘units’ for the analyses. In principle every combination<br />

of responses from simultaneous items entering such profiles must be labelled<br />

prior <strong>to</strong> the analysis in order <strong>to</strong> be able <strong>to</strong> interpret differences found by way of<br />

the profiles. If the number of items exceeds, say, ten, with two response levels<br />

on each item, this would in turn require about approx 1,000 different labels!<br />

In general this is far <strong>to</strong>o many profiles <strong>to</strong> be able <strong>to</strong> assign different interpretations,<br />

and the profile methods is, consequently, not suited for analyses built on<br />

a large number of items.<br />

One consequence of accepting an item as part of a scale for further analyses,<br />

in spite of the fact that the item was found <strong>to</strong> be inhomogeneous across<br />

10 An item which requires a written answer, not a multiple choice item. The response is later<br />

on rated and scored correct or non-correct


194 PETER ALLERUP<br />

gender, can be illustrated by the reports from the international TIMSS study<br />

from 1995 (Bea<strong>to</strong>n et al, 1998), operated 11 by IEA. In this study a general difference<br />

was found in mathematics performance between girls and boys, showing<br />

that in practically all participating countries, boys performed better than<br />

girls. Although this conclusion contrasted greatly with experiences obtained<br />

nationally for many countries, the TIMSS result was generally accepted as the<br />

fact. The TIMSS study was at that time designed in a way using rotated booklets<br />

as in the <strong>PISA</strong>, but without using itemblocks. In stead a fixed set of six<br />

math items and six science items were part of every booklet as fixed reference<br />

for bridging between the booklets<br />

Unfortunately, it turned out that one of the six math reference items 12 was<br />

strongly inhomogeneous (Allerup, 2002). The girls were actually ‘punished’<br />

by this item, and even very highly performing female students rated on the<br />

basis of responses <strong>to</strong> other items, responded incorrectly <strong>to</strong> this particular item.<br />

This could be confirmed by analysing data from all participating countries,<br />

providing high statistical power <strong>to</strong> the tests for homogeneity.<br />

Scaling <strong>PISA</strong> 2000 <strong>–</strong> ‘not reached’ items in reading<br />

‘Not reached’ items are the same as ‘not attempted’ items, and constitute a<br />

special kind of item, which deserves attention in studies like <strong>PISA</strong>. They are<br />

usually found at the end of a booklet because the students read the booklet<br />

from page 1 and try solving the tasks in the order they appear. In the international<br />

versions of the final data base, the ‘not reached’ items are marked by<br />

a special missing-symbol <strong>to</strong> distinguish them from omitted items, which are<br />

items, where neighbouring items <strong>to</strong> the right have obviously been attempted.<br />

It is ordinary testing practice <strong>to</strong> present several tasks <strong>to</strong> the student, which<br />

are in turn properly adjusted <strong>to</strong> the complete testing time, e.g. two lessons in<br />

the case of <strong>PISA</strong>. This is a widespread practice with exceptions seen in Nordic<br />

testing practices. Many tests are thereby constructed in a way as <strong>to</strong> make it<br />

possible <strong>to</strong> judge two separate aspects: proficiency and speed. In reading it is<br />

considered <strong>to</strong> be crucial for relevant teaching that the teacher gets information<br />

about the students’ proficiency both in terms of ‘correctness’ and reading<br />

speed. In order for the last fac<strong>to</strong>r <strong>to</strong> be measurable, one usually needs <strong>to</strong> have<br />

11 IEA, The International Association for the Evaluation of Educational Achievement<br />

12 A math item aiming at testing the students knowledge of proportionality, but presented in a<br />

linguistic form, which was misunders<strong>to</strong>od by the girls.


IDENTIFICATION OF GROUP DIFFERENCES USING <strong>PISA</strong> SCALES 195<br />

a test which discriminates between students with respect <strong>to</strong> be able <strong>to</strong> reach<br />

all items, viz. has a length exceeding the capacity for some students but being<br />

easy <strong>to</strong> reach for other students.<br />

While everybody seems <strong>to</strong> agree on the statistical treatment of omitted<br />

items (they are simply scored as “non-correct”) there have been discussions as<br />

<strong>to</strong> how <strong>to</strong> treat “not reached” items. This takes place from two distinct points<br />

of views: one dealing with scaling problems and one dealing with the problem<br />

of assigning justifiable <strong>PISA</strong> scores <strong>to</strong> the students.<br />

One of the virtues of linking scale properties <strong>to</strong> the analysis of Rasch homogeneity<br />

is found in the second characterization above of item homogeneity,<br />

viz. that “the student abilities 1 ncan be calculated with same result, irrespective<br />

of which subset of items is used”. This strong requirement, which<br />

in <strong>PISA</strong> ensures that responses from different booklets can be compared, irrespective<br />

of which items are included, in principle paves the road as well for<br />

non-problematic comparisons between students who have completed all items<br />

and students who have not completed all items in a booklet. At any rate, seen<br />

from a technical point of view, the existence of ‘not reached’ items does therefore<br />

not pose a problem for the estimation of the student scores , because<br />

the quoted fundamental property of homogeneity has been tested for in a pilot<br />

study prior <strong>to</strong> the main study, and all items included in the main study are<br />

consequently expected <strong>to</strong> enjoy this property. In the IEA reading literacy study<br />

(Elley, 1992 ) the discussion about which student Rasch -score <strong>to</strong> choose, the<br />

one based on the “attempted items”, considering ‘not reached’ items as ‘non<br />

existing’ or the one considering ‘not reached’ items as ‘non-correct’ responses<br />

was never solved, and both estimates were published. In subsequent IEA studies<br />

and in the <strong>PISA</strong> cycles <strong>to</strong> date, the ‘not reached’ items have been considered<br />

as ‘non-correct’.<br />

The second problem mentioned is the influence the ‘not reached’ items<br />

have on the statistical tests for homogeneity, an analytical phase which is undertaken<br />

previously <strong>to</strong> the estimation of student abilities 1 n. The immediate<br />

question here is whether different management of the ‘not reached’ item<br />

responses could lead <strong>to</strong> different results as <strong>to</strong> the acceptance of the homogeneity<br />

hypothesis. The immediate answer <strong>to</strong> the question is that it matters how ‘not<br />

reached’ item responses are scored, ‘not attempted’ or ‘non correct’. The technical<br />

details will, however, not be discussed here, but one important point is<br />

the type of estimation technique applied for the item parameters 1 ... k 13 .<br />

13 Marginal estimation with or without prior distribution on the students scores 1 n or


196 PETER ALLERUP<br />

<strong>PISA</strong> study<br />

Booklet 2000 Cop Ethnic<br />

1 0.02 0.02 0.01<br />

2 0.00 0.00 0.00<br />

3 0.01 0.01 0.00<br />

4 0.00 0.00 0.00<br />

5 0.00 0.00 0.01<br />

6 0.00 0.00 0.01<br />

7 0.01 0.02 0.03<br />

8 0.02 0.02 0.05<br />

9 0.05 0.07 0.17<br />

Table 7: Frequency of ‘not reached’ items in three studies using <strong>PISA</strong> 2000 instruments: Ordinary<br />

<strong>PISA</strong> 2000, The Copenhagen (Cop) and Ethnic Special study.<br />

In <strong>PISA</strong> 2000 with reading as the main theme, the ‘not reached’ problem<br />

was not a significant issue. Table 7 displays the frequency of ‘not reached’<br />

items in the main study <strong>PISA</strong> 2000. It can be read from the table that the level<br />

of ‘not reached’ varies greatly across booklets with a maximum amounting<br />

<strong>to</strong> 5 % for booklet No. 9. Looking at the Copenhagen study and the special<br />

Ethnic study it is, however, clear that the ‘not reached’ problem is probably<br />

most critical for the students having an ethnic minority background. In fact,<br />

using all N=10063 observations in the combined data from table 7, it can be<br />

shown that the average frequency of ‘not reached’ is 1.6 % for Danish students<br />

and 4.3 % for ethnic minority students. For the ethnic minority group it can<br />

furthermore be shown that the frequency of ‘not reached’ reaches a maximum<br />

in booklet No. 9 of 17 %.<br />

Before conclusions will be drawn as <strong>to</strong> the evaluation of group differences<br />

in terms of different <strong>PISA</strong> <strong>–</strong> values, the relation between <strong>PISA</strong> <strong>–</strong>values<br />

and the frequency of ‘not reached’ can be shown. Using log-odds as a measure<br />

of the level of ‘not reached’, a distinct linear relationship can be detected in<br />

figure 4. As anticipated, the relation indicates a negative correlation. For the<br />

summary of conclusions as <strong>to</strong> viewing the effects of inhomogeneity and other<br />

sources influencing the <strong>–</strong> scaling, it is clear from figure 4 that the statistical<br />

administration of this variable can be modelled in a simple linear manner.<br />

conditional maximum likelihood estimation. A popular technique for estimation and testing<br />

of homogeneity is undertaken by successive extention of data, increasing the number of<br />

items, using only complete response data with no ‘not reached’ responses in each step.


IDENTIFICATION OF GROUP DIFFERENCES USING <strong>PISA</strong> SCALES 197<br />

Figure 4: Relation between estimated <strong>PISA</strong> -scores and the frequency of ‘not reached’ (log<br />

odds of the frequency) for booklet No 9 in the combined data set from <strong>PISA</strong> 2000, <strong>PISA</strong><br />

Copenhagen and <strong>PISA</strong> Ethnic.<br />

Conclusions and summary of effects on the scaling of <strong>PISA</strong><br />

students<br />

It has been essential for the analyses presented above <strong>to</strong> elucidate the theoretical<br />

arguments for the use of Rasch models in the work of calibrating scales for<br />

<strong>PISA</strong> measurements. Although the two latent scales containing item difficulties<br />

and student abilities are, mathematically speaking, completely symmetrical,<br />

different concepts and different methods are associated with the practical<br />

management of the two scales.<br />

The analyses have demonstrated that a certain degree of item inhomogeneity<br />

has been found in the <strong>PISA</strong> 2000 and 2003 scales. These effects of inhomogeneity<br />

have been transformed <strong>to</strong> practical, measurable effects on the ordinary<br />

<strong>PISA</strong> ability -scale, which holds the internationally reported student results.<br />

It was a conclusion that on the individual student level this transformed effects<br />

amounted <strong>to</strong> rather large quantities, up <strong>to</strong> 150 <strong>PISA</strong> points, but they were often


198 PETER ALLERUP<br />

below 100 points. For the standard groupings of <strong>PISA</strong> students according <strong>to</strong><br />

gender and ethnicity the accumulated average effect on group level amounted<br />

<strong>to</strong> around 10 <strong>PISA</strong> points.<br />

In order <strong>to</strong> examine effects of item inhomogeneity in relation <strong>to</strong> other systematic<br />

fac<strong>to</strong>rs, which are influential on comparisons between groups of students,<br />

an illustration will be used from <strong>PISA</strong> 2000 in reading (see also Allerup,<br />

2006). From the previous analyses a picture of item inhomogeneity across two<br />

systematic fac<strong>to</strong>rs (gender and ethnicity) was obtained. Together with the fac<strong>to</strong>r<br />

Booklet Id and the number of ‘not reached’ items, four fac<strong>to</strong>rs have by this<br />

already been at work as systematic background for contrasting levels of <strong>PISA</strong><br />

-scores.<br />

The illustration aims at setting the effect of inhomogeneity in relation <strong>to</strong><br />

other systematic fac<strong>to</strong>rs when statistical analysis of -scores differences are<br />

investigated. The illustration will be using differences between the two ethnic<br />

groups, carried out as adjusted comparisons with the systematic fac<strong>to</strong>rs as controlling<br />

variables. In order <strong>to</strong> complete a typical <strong>PISA</strong> data analysis, one supplementary<br />

fac<strong>to</strong>r must be included: the socio-economic index (ESCS), aiming<br />

at measuring through a simple index the economical, educational and occupational<br />

level at home for the student 14 . The relation between <strong>PISA</strong> -scores and<br />

the index ESCS is a (weak) linear function and is usually called the ‘law of<br />

negative social heritage’. Together with the linear impression gained in figure<br />

4, an adequate statistical analysis behind the illustration will be an analysis<br />

of <strong>PISA</strong> -scores as dependent variable and (1) number of not reached items,<br />

(2) booklet id, (3) gender and (4) socio economic index ESCS as independent<br />

variables, all implemented in a generalized linear model.<br />

Two kinds of <strong>PISA</strong> -scores enter the analysis: (1) The reported <strong>PISA</strong><br />

scores found in the official reports from <strong>PISA</strong> 2000 (OECD, 2001), <strong>PISA</strong><br />

Copenhagen (Egelund og Rangvid, 2004) and <strong>PISA</strong> Ethnic (Egelund and<br />

Tranæs red., 2006) and (2) Rasch <strong>to</strong>tal, i.e.estimated <strong>–</strong> scores based on a<br />

combined data set after removal of inhomogeneous items. By this the composition<br />

of effects on the resulting <strong>–</strong> scale from item inhomogeneity and other<br />

systematic fac<strong>to</strong>rs is illustrated with an evaluation of their relative significance.<br />

14 The economy is not included as exact income figures but are estimated from information<br />

from the student questionnaire


IDENTIFICATION OF GROUP DIFFERENCES USING <strong>PISA</strong> SCALES 199<br />

Controlling variables <strong>PISA</strong>-scores<br />

<strong>–</strong>value<br />

Not reached Booklet, gender<br />

Not reached,<br />

Booklet, gender, socio-economy<br />

NO adjusting variables<br />

Reported<br />

Rasch <strong>to</strong>tal<br />

Reported<br />

Rasch <strong>to</strong>tal<br />

Reported<br />

Rasch <strong>to</strong>tal<br />

Adjusted average<br />

difference<br />

Danish vs. ethnic<br />

Table 8: Evaluation of differences between Danish and ethnic minority students using the<br />

combined data set from <strong>PISA</strong> 2000, <strong>PISA</strong> Copenhagen and <strong>PISA</strong> Ethnic. Differences listed<br />

by means of (1) reported <strong>PISA</strong> scores from the international <strong>PISA</strong> report and from (2) Rasch<br />

scores where item inhomogeneity has been removed (Rasch <strong>to</strong>tal).<br />

The results of analyzing the gap between Danish and ethnic students are<br />

presented in table 8. Under ‘no adjustment’ the officially reported gap of 90.54<br />

<strong>PISA</strong> points is listed. If inhomogeneous items are removed from the item scale,<br />

this group difference is reduced <strong>to</strong> 80.69 points, i.e. a reduction of around 10<br />

<strong>PISA</strong> points. The inhomogeneity is therefore responsible for around 10 <strong>PISA</strong><br />

points. If the variables ‘not reached’, ‘booklet id’ and ‘gender’ are added as<br />

systematic fac<strong>to</strong>r in the statistical analysis the controlled gap is now 56.00<br />

<strong>PISA</strong> points, if viewed from the point of official <strong>PISA</strong> scores, and 47.48 if<br />

calculated after removal of inhomogenous items. After controlling for ESCS,<br />

the socio-economic index, it is seen that the reported gap is now 43.89 <strong>PISA</strong><br />

points, while the gap comes down <strong>to</strong> 26.74 <strong>PISA</strong> points, if the gap is measured<br />

by means of homogeneous reading items. Ordinary least square evaluation of<br />

the last fac<strong>to</strong>r mentioned, controlled difference 26.74, shows that this difference<br />

is not far from being insignificant (p=0.01). Notice that the part of difference<br />

which can be attributed <strong>to</strong> the effect of inhomogeneous items varies<br />

with 10 <strong>PISA</strong> points, constituting around 11 % of the <strong>to</strong>tal official interval, in<br />

the case of crude comparisons without other controlling variables (last line in<br />

table 8) <strong>to</strong> approximately 20 <strong>PISA</strong> points, constituting around 50 % of the <strong>to</strong>tal<br />

official interval in the case of inhomogeneity is evaluated after adjusting for<br />

other variables.<br />

What can be seen from this example and the previous discussions and data<br />

analysis is that the effect of inhomogeneous items on the official <strong>PISA</strong> -scale<br />

can be substantial, if the aim of analysis is <strong>to</strong> compare either individuals, or<br />

a few students at one time. The average effect on the official <strong>PISA</strong> -scale


200 PETER ALLERUP<br />

in case of larger student groups depends on the environment in which comparisons<br />

are carried out. It seems <strong>to</strong> have less impact on crude comparisons of<br />

(average) <strong>PISA</strong> abilities with no other variables involved, amounting <strong>to</strong> around<br />

10 <strong>PISA</strong> points, while more sophisticated comparisons with adjusted comparisons<br />

involving controlling variables are more affected by item inhomogeneity.<br />

References<br />

Allerup P. (1994): “Rasch Measurement, theory of ”. The International Encyclopedia<br />

of Education, Vol. 8, Pergamon, 1994.<br />

Allerup, P (1995): “The IEA Study of Reading Literacy”. Owen, P. & Pumfrey,<br />

P. (red.): Children Learning <strong>to</strong> Read: International Concerns, Vol.2,<br />

p. 186-297, 1995.<br />

Allerup, P. (1997) “Statistical Analysis of Data from the IEA Reading Literacy<br />

Study”; Applications of Latent trait and latent Class models in the Social<br />

Sciences; Waxmann, 1997.<br />

Allerup, P. (2002): “Test Equating using IRT models” proc. 7’th round table<br />

conference on Assessment, Canberra November 2002<br />

Allerup, P. (2002). “Gender Differences in Mathematics Achievement”., Measurement<br />

and Multivariate Analysis. Springer Verlag, Tokyo.<br />

Allerup, P. (2005) “<strong>PISA</strong> præstationer <strong>–</strong> målinger med skæve måles<strong>to</strong>kke?”<br />

Dansk Pædagogisk Tidsskrift, vol 1, 2005. (in Danish)<br />

Allerup, P., Lindenskov, L., Weng, P. (2006) “growing up <strong>–</strong>The s<strong>to</strong>ry behind<br />

two items in <strong>PISA</strong> 2003”. Nordic Light, Nordisk Råd 2006.<br />

Allerup, P. (2006) “<strong>PISA</strong> 2000’s læseskala <strong>–</strong> vurdering af psykometriske egenskaber<br />

for elever med dansk og ikke-dansk sproglig baggrund” Rockwool<br />

Fondens Forskningsenhed og Syddansk Universitetsforlag, 2006 (in Danish)<br />

Andersen, A et al. (2001) “Forventninger og færdigheder <strong>–</strong> danske unge i en international<br />

Sammenligning”, AKF (Anvendt Kommunal Forskning) DPU<br />

(Danmarks Pædagogiske Universitet), SFI Social Forsknings Instituttet.<br />

Andersen E.B. (1973). “Conditional Inference and Models for Measuring”,<br />

Copenhagen: Mentalhygiejnisk Forlag.<br />

Bea<strong>to</strong>n, A et al. (1996): “Mathematics Achievement in the Middle School<br />

Years. IEA’s Third International Mathematics and Science Study”.<br />

Bos<strong>to</strong>n College USA


IDENTIFICATION OF GROUP DIFFERENCES USING <strong>PISA</strong> SCALES 201<br />

Egelund, N og Tranæs,T ed. (2007) “<strong>PISA</strong> Etnisk 2005 <strong>–</strong> kompetencer hos<br />

danske og etniske elever I 9.klasser I Danmark 2005.” Rockwool Fondens<br />

Forskningsenhed, Syddansk Universitetsforlag<br />

Elley, W (1992):“How in the world do students read?”, The International<br />

Association for the Evaluation of Educational Achievement, 1992, IEA,<br />

Haque<br />

Fischer, G., Molenaar, I. (1995) Rasch Models <strong>–</strong> Foundations, recent Developments,<br />

and Applications. Springer-Verlag, New York.<br />

Lord, F and Novick, M(1968). “Statistical Theories of Mental Test Scores”.<br />

Addison Wesley, Massachusetts<br />

OECD (2001) “Knowledge and Skills for Life <strong>–</strong> First Results from <strong>PISA</strong> 2000”;<br />

OECD, Paris.<br />

OECD (2004) “Learning for Tomorrow’s World <strong>–</strong> First Results from <strong>PISA</strong><br />

2003”; OECD, Paris.<br />

Rasch, G. (1960) “Probabilistic Models for some Intelligence and Attaintment<br />

Tests” Munksgaard, 1960. Genoptrykt Chicago University Press, 1980.<br />

Rasch G. (1971) “Proof that the necessary condition for the validity of the<br />

multiplicative dicho<strong>to</strong>mic model is also sufficient”. Dupl. note, Statistical<br />

Institute, Copenhagen (see Allerup, 1994).<br />

Torney-Purta, J., Lehman, R., Oswald, H., Schulz,W (2001) “Citizenship and<br />

Education in twenty-eight Countries Civic Knowledge and Engagement<br />

at age fourteen”. Amsterdam: IEA 2001.


<strong>PISA</strong> and “Real Life Challenges”: Mission Impossible?<br />

Svein Sjøberg<br />

Sweden: University of Oslo<br />

Introduction<br />

The <strong>PISA</strong> project has positive as well as more problematic aspects, and it is<br />

important for educa<strong>to</strong>rs and researchers <strong>to</strong> engage in critical public debates on<br />

this utterly important project, including its uses and misuses.<br />

The <strong>PISA</strong> project sets the educational agenda internationally as well as<br />

within the participating countries. <strong>PISA</strong> results and advice are often considered<br />

as objective and value-free scientific truths, while they are, in fact embedded in<br />

the overall political and economic aims and priorities of the OECD. Through<br />

media coverage <strong>PISA</strong> results create the public perception of the quality of a<br />

country’s overall school system. The lack of critical voices from academics as<br />

well as from media gives authority <strong>to</strong> the images that are presented.<br />

In this article, I will raise critical points from several perspectives. The<br />

main point of view is that the <strong>PISA</strong> ambitions of testing “real-life skills and<br />

competencies in authentic contexts” are by definition alone impossible <strong>to</strong><br />

achieve. A test is never better than the items that constitute the test. Hence,<br />

a critique of <strong>PISA</strong> should not mainly address the official rationale, ambitions<br />

and definitions, but should scrutinize the test items and the realities around the<br />

data collection. The secrecy over <strong>PISA</strong> items makes detailed critique difficult,<br />

but I will illustrate the quality of the items with two examples from the released<br />

texts.<br />

Finally, I will raise serious questions about the credibility of the results,<br />

in particular the ranking. Reliable results assume that the respondents in all<br />

countries do their best while they are sitting the test. I will assert that young<br />

learners in different countries and cultures may vary in the way they behave in<br />

the <strong>PISA</strong> test situation. I claim that in many modern societies, several students


204 SVEIN SJØBERG<br />

are unwilling <strong>to</strong> give their best performance if they find the <strong>PISA</strong> items long,<br />

unreadable, unrealistic and boring, in particular if bad test results have no negative<br />

consequence for them. I will use the concept of “perceived task value” <strong>to</strong><br />

argue this important point.<br />

The political importance of <strong>PISA</strong><br />

Whether one likes the <strong>PISA</strong> study or not, one might easily agree about the importance<br />

of the project. When OECD has embarked on such a large project,<br />

it is certainly not meant as a purely academic research undertaking. <strong>PISA</strong> is<br />

meant <strong>to</strong> provide results <strong>to</strong> be used in the shaping of future policies. After 6-<br />

7 years of living with <strong>PISA</strong>, we see that the <strong>PISA</strong> concepts, ideology, values<br />

and not least the results and the rankings, shape international educational policies<br />

and also influence national policies in most of the participating countries.<br />

Moreover, the <strong>PISA</strong> results provide media and the public with convincing images<br />

and perceptions about the quality of the school system, the quality of their<br />

teachers’ work and the characteristics of both the school population and future<br />

citizen.<br />

Contemporary natural science is often labelled Big Science or Technoscience:<br />

The projects are multinational, they involve thousands of researchers,<br />

and they require heavy funding. Moreover, the traditional scientific values and<br />

ethos of science become different from the traditional ideals of academic science<br />

(Ziman, 2000). Prime examples are CERN, The Human Genome Project,<br />

European Space Agency etc. The <strong>PISA</strong> project has many similarities with such<br />

projects, although the scale and the costs are much lower. But the number of<br />

people who are involved is large, and the mere organization of the undertaking<br />

requires resources, planning and logistics unusual <strong>to</strong> the social sciences. <strong>According</strong><br />

<strong>to</strong> Prais (2007), the <strong>to</strong>tal cost of the <strong>PISA</strong> and TIMSS testing in 2006<br />

was “probably well over 100 million US dollars for all countries <strong>to</strong>gether, plus<br />

the time of pupils and teachers directly involved.”<br />

Why is an organization like the OECD embarking on an ambitious task<br />

like this? The OECD is an organization for the promotion of economic<br />

growth, cooperation and development in countries that are committed <strong>to</strong> market<br />

economies. Their slogan appears on their website: “For a better world economy.”<br />

1<br />

1 These and other quotes in the article are taken from OECDs home site http://www.oecd.org/,<br />

retrieved Sept 2, 2007.


<strong>PISA</strong> AND “REAL LIFE CHALLENGES”: MISSION IMPOSSIBLE? 205<br />

The OECD and its member countries have not embarked on the <strong>PISA</strong><br />

project because they have an interest in basic research in education or learning<br />

theory. They have decided <strong>to</strong> invest in <strong>PISA</strong> because education is crucial for<br />

the economy. Governments need information that is supposed <strong>to</strong> be relevant for<br />

their policies and priorities in this economic perspective. Since mass education<br />

is expensive, they also most certainly want “value for money” <strong>to</strong> ensure efficient<br />

running of the educational systems. Stating this is not meant as a critique<br />

of <strong>PISA</strong>. It is, however, meant <strong>to</strong> state the obvious, but still important fact:<br />

<strong>PISA</strong> should be judged in the context of the agenda of the OECD; economic<br />

development and competition in a global market economy.<br />

The strong influence that <strong>PISA</strong> has on national educational policies should<br />

imply that all educa<strong>to</strong>rs ought <strong>to</strong> be interested in <strong>PISA</strong> whether they endorse<br />

the aims of <strong>PISA</strong> or not. Educa<strong>to</strong>rs should be able <strong>to</strong> discuss and use the results<br />

with some insight in the methods, underlying assumptions, strengths and<br />

weaknesses, possibilities and limitations of the project. We need <strong>to</strong> know what<br />

we might learn from the study, as well as what we cannot learn. Moreover, we<br />

need <strong>to</strong> raise a critical (not necessarily negative!) voice in the public as well as<br />

professional debates over uses and misuses of the results.<br />

The influence of <strong>PISA</strong>: Norway as an example<br />

Attention given <strong>to</strong> <strong>PISA</strong> results in national media varies between countries,<br />

but in most countries it is formidable. In my country, Norway, the results from<br />

<strong>PISA</strong>2000 as well as from <strong>PISA</strong>2003 provided war-like headings in most national<br />

newspapers.<br />

Our then Minister of Education (2001-2005), Kristin Clemet (representing<br />

Høyre, the Conservative party), commented on the <strong>PISA</strong>2000 results, released<br />

a few months after she had taken office, following a Labour government: “Norway<br />

is a school loser, now it is well documented. It is like coming home from<br />

the Winter Olympics without a gold medal” (which, of course, for Norway<br />

would have been a most unthinkable disaster!). She even added: “And this<br />

time we cannot even claim that the Finnish participants have been doped!”<br />

(Aftenposten January 2001).<br />

The headlines in all the newspapers <strong>to</strong>ld us again an again “Norway is a loser”.<br />

In fact, such headings were misleading. Norway has ended up close <strong>to</strong><br />

the average among the OECD countries in all test domains in <strong>PISA</strong>2000 and<br />

<strong>PISA</strong>2003. But for some reason, Norwegians had expected that we should be


206 SVEIN SJØBERG<br />

Figure 1: <strong>PISA</strong> results are presented in the media with war-like headings, shaping public perception<br />

about the national school system. Here are <strong>PISA</strong> results presented in the leading Norwegian<br />

newspaper Dagbladet with the heading “Norway is a school loser”.<br />

on <strong>to</strong>p <strong>–</strong> as we often are on other indica<strong>to</strong>rs and in winter sports. When we are<br />

not the winners, we regard ourselves as being losers.<br />

The results from <strong>PISA</strong> (and TIMSS as well) have shaped the public image<br />

of the quality of our school system, not only for the aspects that have in fact<br />

been studied, but for more or less all other aspects of school. It has now become<br />

commonly ‘accepted’ that Norwegian schools in general have a very low level<br />

of quality, and that Norwegian classrooms are among the noisiest in the world.<br />

The media present tabloid-like and oversimplified rankings. It seems that the<br />

public as well as politicians have accepted these versions as objective scientific<br />

truths about our education system. There has been little public debate, and<br />

even the researchers behind the <strong>PISA</strong> study have little <strong>to</strong> modify and remind<br />

the public about the limitations of the study. In sum; <strong>PISA</strong> (as well as TIMSS)<br />

has created a public image of the quality of the Norwegian school that is not<br />

justified, and that may be seen <strong>to</strong> be detrimental. I assume that other counties<br />

may have similar experiences.<br />

But <strong>PISA</strong> does not only shape the public image, it also provides a scientific


<strong>PISA</strong> AND “REAL LIFE CHALLENGES”: MISSION IMPOSSIBLE? 207<br />

legitimization of school reforms. Under Kristin Clemet as Minister of Education<br />

(2001-2005), a series of educational reforms were introduced in Norway.<br />

Most of these reforms were legitimized by reference <strong>to</strong> international testing,<br />

mainly <strong>to</strong> <strong>PISA</strong>. In 2005, we had a change in government, and Kristin Clemet’s<br />

Secretary of State, Helge Ole Bergesen, published a book shortly afterwards in<br />

which he presented the “inside s<strong>to</strong>ry” on the reforms made while they were in<br />

power. The main perspective of the book is the many references <strong>to</strong> large-scale<br />

achievement studies. He confirms that these studies provided the key arguments<br />

and rationale for curricular as well as other school reforms. Under the<br />

tabloid heading: “The <strong>PISA</strong> Shock”, he confirms the key role of <strong>PISA</strong>:<br />

With the [publication of the] <strong>PISA</strong> results, the scene was set for a national battle over<br />

knowledge in our schools. [ . . . ] For those of us who had just taken over the political<br />

power in the Ministry of Education and Research, the <strong>PISA</strong> results provided a “flying<br />

start” (Bergesen 2006: 41-42. Author’s translation).<br />

Other countries may have different s<strong>to</strong>ries <strong>to</strong> tell. Figures 2 and 3 provide examples<br />

from the public sphere in Germany. In sum: There is no doubt that <strong>PISA</strong><br />

has provided <strong>–</strong> and will continue <strong>to</strong> provide <strong>–</strong> results, ideologies, concepts,<br />

analysis, advice and recommendations that will shape our future educational<br />

debates and reforms, nationally as well as internationally.<br />

<strong>PISA</strong>: Underlying values and assumptions<br />

It is important <strong>to</strong> examine the ideas and values that underpin <strong>PISA</strong>, because,<br />

like most research studies, <strong>PISA</strong> is not impartial. It builds on several assumptions,<br />

and it carries with it several value judgements. Some of these values are<br />

explicit; others are implicit and ‘hidden’, but nevertheless of great importance.<br />

Some value commitments are not very controversial, others may be contested.<br />

Peter Fensham, a key scholar in international science education thinking<br />

and research for many decades, has also been heavily involved in several committees<br />

in TIMSS and <strong>PISA</strong>. He has seen all aspects of the projects from the<br />

inside over decades. In a recent book chapter, he provides an insider’s overview<br />

and critique of the values that underlie these projects. He draws attention <strong>to</strong> the<br />

underlying values and implications:<br />

The design and findings from large-scale international comparisons of science learning<br />

do impact on how science education is thought about, is taught and is assessed in<br />

the participating countries. The design and the development of the instruments used


208 SVEIN SJØBERG<br />

Figure 2: The political agenda and the public image of the quality of the entire school system<br />

is formed by the <strong>PISA</strong> results. This is an example from the German Newspaper Die Woche<br />

after the release of <strong>PISA</strong>2000 results.<br />

and the findings that they produce send implicit messages <strong>to</strong> the curriculum authorities<br />

and, through them, <strong>to</strong> science teachers. It is thus important that these values, at<br />

all levels of existence and operations of the projects, be discussed, lest these messages<br />

act counterproductively <strong>to</strong> other sets of values that individual countries try <strong>to</strong> achieve<br />

with their science curricula. (Fensham 2007: 215,216)<br />

Aims and purpose of the OECD<br />

In the public debate as well as among politicians, advice and reports from<br />

OECD experts are often considered <strong>to</strong> be impartial and objective. The OECD<br />

has become an important contribu<strong>to</strong>r <strong>to</strong> the political battle over social, political,<br />

economic and other ideas. To a large extent, these persons shape the political<br />

landscape, and their reports and advice set the political agenda in the national<br />

as well as international debates over priorities and concerns. But the OECD<br />

is certainly not a impartial group of independent educational researchers. The<br />

OECD is built on a neo-liberal political and economic ideology, and its advice<br />

should be seen in this perspective. The seemingly scientific and neutral language<br />

of expert advice conceals the fact that there are possibilities for other


<strong>PISA</strong> AND “REAL LIFE CHALLENGES”: MISSION IMPOSSIBLE? 209<br />

Figure 3: <strong>PISA</strong> has become a well-known concept in public debate: A bookshelf in a German<br />

airport offering bestselling books with <strong>PISA</strong>-like tests for self-assessment, very much like<br />

IQ-tests.<br />

political choices based on different sets of social, cultural and educational values.<br />

Figure 4 shows how the OECD presents itself.<br />

The overall perspective of the OECD is concerned with market economy<br />

and growth in free world trade. All policy advice they provide is certainly<br />

coloured by such underlying value commitments. Hence, the agenda of the<br />

OECD (and <strong>PISA</strong>) does not necessarily coincide with the concerns of many<br />

educa<strong>to</strong>rs (or other citizens, for that matter). The concerns of <strong>PISA</strong> are not<br />

about ‘Bildung’ or liberal education, not about solidarity with the poor, not<br />

about sustainable development etc. <strong>–</strong> but about skills and competencies that<br />

can promote the economic goals of the OECD. Saying this is, of course, stating<br />

the obvious, but such basic facts are often forgotten in the public and political<br />

debates over <strong>PISA</strong> results.


210 SVEIN SJØBERG<br />

About the OECD<br />

The OECD brings <strong>to</strong>gether the governments of countries committed <strong>to</strong><br />

democracy and the market economy from around the world <strong>to</strong>:<br />

<strong>–</strong> Support sustainable economic growth<br />

<strong>–</strong> Boost employment<br />

<strong>–</strong> Raise living standards<br />

<strong>–</strong> Maintain financial stability<br />

<strong>–</strong> Assist other countries’ economic development<br />

<strong>–</strong> Contribute <strong>to</strong> growth in world trade<br />

Figure 4: The basis for and commitments of the OECD as they appear on<br />

http://www.oecd.org/ Retrieved Sept 7 2007<br />

Educational and curricular values in <strong>PISA</strong><br />

Quite naturally, values creep in<strong>to</strong> <strong>PISA</strong> testing in several ways. <strong>PISA</strong> sets out<br />

<strong>to</strong> shed light on important (and not very controversial) questions like these:<br />

Are students well prepared for future challenges? Can they analyse, reason and communicate<br />

effectively? Do they have the capacity <strong>to</strong> continue learning throughout life?<br />

(First words on the <strong>PISA</strong> home page at: http://www.pisa.oecd.org/)<br />

These are important concerns for most people, and it is hard <strong>to</strong> disagree with<br />

such aims. However, as is well known, <strong>PISA</strong> tests just a few areas of the<br />

school curriculum: Reading, mathematics and science. These subjects are, consequently,<br />

considered more important than other areas of the school curriculum<br />

in order <strong>to</strong> reach the brave goals quoted above. Hence, the OECD implicitly<br />

says that our future challenges are not highly dependent on subjects like his<strong>to</strong>ry,<br />

geography, social science, ethics, foreign language, practical skills, arts<br />

and aesthetics, etc.<br />

<strong>PISA</strong> provides test results that are closely connected <strong>to</strong> (certain aspects of)<br />

the three subjects that they test. But when test results are communicated <strong>to</strong><br />

the public, one receives the impression that they have tested the quality of the<br />

entire school system and all the competencies that are of key importance for<br />

preparing <strong>to</strong> meet the challenges of the future.<br />

There is one important feature of <strong>PISA</strong> that is often forgotten in the public<br />

debate: <strong>PISA</strong> (in contrast <strong>to</strong> TIMSS) does not test “school knowledge”. Neither<br />

the <strong>PISA</strong> framework nor the test items claim having any connection <strong>to</strong><br />

national school curricula. This fact is in many ways the strength of the <strong>PISA</strong><br />

undertaking; they have set out <strong>to</strong> think independently from the constraints of


<strong>PISA</strong> AND “REAL LIFE CHALLENGES”: MISSION IMPOSSIBLE? 211<br />

all the different school curricula. There is a strong contrast with the TIMSS<br />

test, as its items are meant <strong>to</strong> test knowledge that is more or less common in<br />

all curricula in the numerous participating countries. This implies, of course,<br />

that the “TIMSS curriculum” (Mullis et al 2001) may be characterized as a<br />

fossilized and old-fashioned curriculum of a type that most science educa<strong>to</strong>rs<br />

want <strong>to</strong> eradicate. In fact, nearly all TIMSS test items could have been used<br />

60-70 years ago. The <strong>PISA</strong> thinking has been freed from the constraints of<br />

school curricula and could in principle be more radical and forward-looking in<br />

their thinking. (However, as asserted in other parts of this chapter, <strong>PISA</strong> does<br />

not manage <strong>to</strong> live up <strong>to</strong> such high expectations.)<br />

<strong>PISA</strong> stresses that the skills and competencies assessed may not only stem<br />

from activities at school but from experiences and influences from family life,<br />

contact with friends, etc. In spite of this, both good and bad results are most<br />

often considered by both the public and politicians <strong>to</strong> be attributed <strong>to</strong> the school<br />

only.<br />

Values in the <strong>PISA</strong>-reporting<br />

The <strong>PISA</strong> data collection also covers a great variety of dimensions regarding<br />

background variables. The intention is, of course, <strong>to</strong> use these <strong>to</strong> explain<br />

the variance in the test results (“explain” in a statistical sense, i.e. <strong>to</strong> establish<br />

correlations etc.). Many interesting studies have been published on such<br />

issues. But the main focus in the public reporting is in the form of simple ranking,<br />

often in the form of league tables for the participating countries. Here,<br />

the mean scores of the national samples in different countries are published.<br />

These league tables are nearly the only results that appear in the mass media.<br />

Although the <strong>PISA</strong> researchers take care <strong>to</strong> explain that many differences (say,<br />

between a mean national score of 567 and 572) are not statistically significant,<br />

the placement on the list gets most of the public attention. It is somewhat similar<br />

<strong>to</strong> sporting events: The winner takes it all. If you become no 8, no one<br />

asks how far you are from the winner, or how far you are from no 24 at any<br />

event. Moving up or down some places in this league table from <strong>PISA</strong>2000<br />

<strong>to</strong> <strong>PISA</strong>2003 is awarded great importance in the public debate, although the<br />

differences may be non-significant statistically as well as educationally.<br />

The winners also become models and ideals for other countries. Many want<br />

<strong>to</strong> copy aspects of the school system from the winners. This, among other<br />

things, assumes that <strong>PISA</strong> results can be explained mainly by school fac<strong>to</strong>rs <strong>–</strong>


212 SVEIN SJØBERG<br />

and not by political, his<strong>to</strong>rical, economic or cultural fac<strong>to</strong>rs or by youth culture<br />

and the values and concerns of the young learners. Peter Fensham claims:<br />

. . . the project managers choose <strong>to</strong> have quite separate expert groups <strong>to</strong> work on the<br />

science learning and the contextual fac<strong>to</strong>rs—a decision that was later <strong>to</strong> lead <strong>to</strong> discrepancies.<br />

Both projects have taken a positivist stance <strong>to</strong> the relationship between<br />

contextual constructs and students’ achievement scores, although after the first round<br />

of TIMSS other voices suggested a more holistic or cultural approach <strong>to</strong> be more appropriate<br />

for such multi-cultural comparisons. (Fensham 2007: 218)<br />

<strong>PISA</strong> (and even more so TIMSS) is dominated and driven by psychometric<br />

concerns, and much less by educational. The data that emerge from these studies<br />

provides a fantastic pool of social and educational data, collected under<br />

strictly controlled conditions <strong>–</strong> a playground for psychometricians and their<br />

models. In fact, the rather complicated statistical design of the studies decreases<br />

the intelligibility of the studies. It is, even for experts, rather difficult <strong>to</strong><br />

understand the statistical and sampling procedures, the rationale and the models<br />

that underlie the emergence of even test scores. In practice, one has <strong>to</strong> take<br />

the results at face value and on trust, given that some of our best statisticians<br />

are involved. But the advanced statistics certainly reduce the transparency of<br />

the study and hinder publicly informed debate.<br />

<strong>PISA</strong> items <strong>–</strong> a critique<br />

The secrecy<br />

An achievement test is never better than the quality of its items. If the items<br />

are miserable, even the best statisticians in the world cannot change this fact.<br />

Subject matter educa<strong>to</strong>rs should have a particular interest, and even a duty,<br />

<strong>to</strong> go in<strong>to</strong> detail on how their subject is treated and ‘operationalized’ through<br />

the <strong>PISA</strong> test items. One should not just discuss the given definitions of e.g.<br />

scientific literacy and the intentions of what <strong>PISA</strong> claims <strong>to</strong> test. In fact, the<br />

framework as well as the intentions and ideologies in the <strong>PISA</strong> testing may be<br />

considered acceptable and even progressive. The important question is: How<br />

are these brave intentions translated in<strong>to</strong> actual items?<br />

But it is not easy <strong>to</strong> address this important issue, as only a very few of the<br />

items have been made publicly available. Peter Fensham, himself a member of<br />

the <strong>PISA</strong> (as well as TIMSS) subject matter expert group, deplores the secrecy:


<strong>PISA</strong> AND “REAL LIFE CHALLENGES”: MISSION IMPOSSIBLE? 213<br />

“By their decision <strong>to</strong> maintain most items in a test secret [ . . . ] TIMSS and <strong>PISA</strong> deny<br />

<strong>to</strong> curriculum authorities and <strong>to</strong> teachers the most immediate feedback the project<br />

could make, namely the release in detail of the items, that would indicate better than<br />

framework statements, what is meant by ‘science learning’. The released items are<br />

tantalizing few and can easily be misinterpreted” (Fensham 2007: 217)<br />

The reason for this secrecy is, of course, that the items will be used in the next<br />

<strong>PISA</strong> testing round, and therefore they may not be made public. An informed<br />

public debate on this key issue is therefore difficult, <strong>to</strong> say the least. But we<br />

scrutinize the relatively few items that have been made public 2 .<br />

Can “real-life challenges” assessed by wordy paper-and-pencil items?<br />

The <strong>PISA</strong> testing takes place in about 60 countries, which <strong>to</strong>gether (according<br />

<strong>to</strong> the <strong>PISA</strong> homepage) account for 90 % of the world economy. <strong>PISA</strong> has the<br />

intention of testing<br />

. . . knowledge and skills that are essential for full participation in society. [ . . . ] not<br />

merely in terms of mastery of the school curriculum, but in terms of important knowledge<br />

and skills needed in adult life. [ . . . ]<br />

The questions are reviewed by the international contrac<strong>to</strong>r and by participating countries<br />

and are carefully checked for cultural bias. Only those questions that are unanimously<br />

approved are used in <strong>PISA</strong>. (Quotes from Pisa.oecd.org, retrieved 5 sept 2007)<br />

In each item unit, the questions are based on what is called an “authentic text”.<br />

This, one may assume, means that the original text has appeared in print in<br />

one of the 60 participating countries, and that it has been translated from this<br />

original.<br />

There are many critical comments that can be made <strong>to</strong> challenge the claim<br />

that <strong>PISA</strong> lives up <strong>to</strong> the high ambition of testing real-life skills. An obvious<br />

limitation is the test format itself: The test contains only paper-and-pencil<br />

items, and most items are based on the reading of rather lengthy pieces of text.<br />

This is, of course, only a subset of the types of “knowledge and skills that are<br />

essential for full participation in society”. Coping with life in modern societies<br />

requires a range of competencies and skills that cannot possibly be measured<br />

by test items of the <strong>PISA</strong> units’ format.<br />

2 All the released items from previous <strong>PISA</strong> rounds can be retrieved from the <strong>PISA</strong> website<br />

http://www.oecd.org/document/25/0,3343,en_32252351_32235731_38709529_1_1_1_<br />

1,00.html


214 SVEIN SJØBERG<br />

Identical “real-life challenges” in 60 countries?<br />

But the abovementioned criticism has other and equally important dimensions:<br />

The <strong>PISA</strong> test items are by necessity exactly the same in each country. The<br />

quote above assures us that any “cultural bias” has been removed, and items<br />

have <strong>to</strong> be “unanimously approved”.<br />

At first glance, this sounds positive. But there are indeed difficulties with<br />

such requirements: Real life is different in different countries. Here are, in<br />

alphabetical order, the first countries on the list of participating countries:<br />

Argentina*, Australia, Austria, Azerbaijan*, Belgium, Brazil*, Bulgaria*,<br />

Canada, Chile*, Colombia*, Croatia*, the Czech Republic, Denmark 3<br />

We can only imagine the deliberation <strong>to</strong>wards unanimous acceptance of all<br />

items among the 60 countries with the demands that there should be no cultural<br />

bias and that context of no country should be favoured.<br />

The following consequences seem unavoidable: The items will become<br />

decontextualised, or with contrived ‘contexts’ far removed from the reality of<br />

most real life situations in any of the participating countries. While the schools<br />

in most countries have a mandate <strong>to</strong> prepare students <strong>to</strong> meet the challenges<br />

in that particular society (depending on level of development, climate, natural<br />

environment, culture, urgent local and national needs and challenges, etc.),<br />

the <strong>PISA</strong> tests only aspects that are shared with all other nations. This runs<br />

contrary <strong>to</strong> current curriculum trends in many countries, where the issue of<br />

providing local relevance and context have become urgent. In many countries,<br />

educa<strong>to</strong>rs argue for a more contextualized (or ‘localized’) curriculum, at least<br />

in the obliga<strong>to</strong>ry basic education for all young learners.<br />

The item construction process also rules out the inclusion of all sorts of<br />

controversial issues, be they scientific, cultural, economic or political. It is<br />

indeed enough that the authorities in one of the participating countries have<br />

objections.<br />

To repeat: Schools in many countries have the mandate of preparing their<br />

learners <strong>to</strong> take an active part in social and political life. While many countries<br />

encourage the schools <strong>to</strong> treat controversial socio-scientific issues, such issues<br />

are unthinkable in schools in other countries. Moreover, a controversial issue in<br />

one country may not be seen as controversial in another. In sum: The demands<br />

of the item construction process set serious limitations on the actual items that<br />

comprise the <strong>PISA</strong> test.<br />

3 The list is from http://www.pisa.oecd.org/ Countries marked with a * are not members of<br />

OECD, (but are also assumed <strong>to</strong> unanimously agree on the inclusion of all test units.)


<strong>PISA</strong> AND “REAL LIFE CHALLENGES”: MISSION IMPOSSIBLE? 215<br />

Now, all the above considerations are simply deductions from the demands<br />

of the processes behind the construction of the <strong>PISA</strong> instrument. It is, of<br />

course, of great importance <strong>to</strong> check out such conclusions against the test itself.<br />

But, as mentioned, this is not an easy task <strong>to</strong> complete, given the secrecy over<br />

the test items. Nonetheless, the items that have been released confirm the above<br />

analysis: The <strong>PISA</strong> items are basically decontextualised and non-controversial.<br />

The <strong>PISA</strong> items are <strong>–</strong> in spite of an admirable level of ambition <strong>–</strong> nearly the<br />

negation of the skills and competencies that many educa<strong>to</strong>rs consider important<br />

for facing future challenges in modern, democratic societies.<br />

<strong>PISA</strong> items have also been criticized on other aspects. Many claim that<br />

the scientific content is questionable or misleading and that the language is<br />

strange, often verbose. In the next paragraph, two examples of <strong>PISA</strong> units are<br />

discussed in some detail, one from Mathematics, the other from Science.


216 SVEIN SJØBERG<br />

A <strong>PISA</strong> mathematics unit: Walking<br />

In Figure 5 below the complete <strong>PISA</strong> test unit called Walking is reproduced.<br />

M124: Walking<br />

The picture shows the footprints of a man walking. The pacelength P is the distance<br />

between the rear of two consecutive footprints.<br />

For men, the formula, n/P=140, gives an approximate relationship between n and P<br />

where,<br />

n= number of steps per minute, and<br />

P= pacelength in metres<br />

Question 1: WALKING M124Q01- 0 1 2 9<br />

If the formula applies <strong>to</strong> Heiko’s walking and Heiko takes 70 steps per minute,<br />

what is Heiko’s pacelength? Show your work.<br />

Question 3: WALKING M124Q03- 00 11 21 22 23 24 31 99<br />

Bernard knows his pacelength is 0.80 metres. The formula applies <strong>to</strong> Bernard’s<br />

walking. Calculate Bernard’s walking speed in metres per minute and in kilometres<br />

per hour. Show your working out.<br />

Figure 5: A complete <strong>PISA</strong> mathematics unit, “Walking”, with the text presenting the situation<br />

and the questions relating <strong>to</strong> the situation.


<strong>PISA</strong> AND “REAL LIFE CHALLENGES”: MISSION IMPOSSIBLE? 217<br />

Comments <strong>to</strong> Walking<br />

Some details first:<br />

Note that Question 2 is missing! (This may be an omission in the published<br />

document.) Note also the end of Q1: “Show your work.” And for Q2. “Show<br />

your working out.” There also seems <strong>to</strong> be several commas <strong>to</strong>o many. Consider<br />

the commas in this paragraph: “For men, the formula, n/P = 140, gives an<br />

approximate relationship between n and P where, etc . . . ”. In my view, all<br />

the 4 commas seem somewhat misplaced. Perhaps these are merely details,<br />

but they are not very convincing as the final outcome of serious negotiations<br />

between 60 countries!<br />

The main comments <strong>to</strong> this unit are, however, more on the content of the<br />

item. First of all: Is this situation really a “real-life situation”? How real is the<br />

situation described above? Is this type of question a real challenge in the future<br />

life of young people <strong>–</strong> in any country?<br />

But even if we accept the situation as a real problem, it seems hard <strong>to</strong> acknowledge<br />

that the given formula is a realistic mathematization of a genuine<br />

situation. The formula implies that when you increase the frequency in your<br />

walking, your paces simultaneously become longer. To my knowledge, a person<br />

may walk with long paces and low frequency. Moreover, the same person<br />

may also walk using short steps at high frequency. In fact, at least from my<br />

point of view, the two fac<strong>to</strong>rs should be inversely proportional rather than proportional,<br />

as suggested in the “Walking” item. In any case, a respondent who<br />

tries <strong>to</strong> think critically about the formula may get confused, but those who do<br />

not think may easily solve the question simply by inserting the formula.<br />

But the problems do not s<strong>to</strong>p here: Take a careful look at the dimensions<br />

given in the figure. If the marked footstep is 80 cm (as suggested in Q3 above),<br />

then the footprint is 55 cm long! A regular man’s foot is actually only about<br />

26 cm long, so the figure is extremely misleading! But even worse: From the<br />

figure, we can see (or measure) the next footstep <strong>to</strong> be 60 % longer. Given the<br />

formula above, this also implies a more rapid pace, and the man’s acceleration<br />

from the first <strong>to</strong> the second footstep has <strong>to</strong> be enormous!<br />

In conclusion: The situation is unrealistic and flawed from several points<br />

of view. Students who simply insert numbers in the formula without thinking<br />

will get it right. More critical students who start thinking will, however, be<br />

confused and get in trouble!


218 SVEIN SJØBERG<br />

A <strong>PISA</strong> science unit: Cloning<br />

In Figure 6 below the complete <strong>PISA</strong> test unit called Cloning is reproduced<br />

S128: Cloning<br />

Read the newspaper article and answer the questions that follow.<br />

Without any doubt, if there had<br />

been elections for the animal of the year<br />

1997, Dolly would have been the winner!<br />

Dolly is a Scottish sheep that you<br />

see in the pho<strong>to</strong>. But Dolly is not just<br />

a simple sheep. She is a clone of another<br />

sheep. A clone means: a copy.<br />

Cloning means copying ‘from a single<br />

master copy’. Scientists succeeded in<br />

creating a sheep (Dolly) that is identical<br />

<strong>to</strong> a sheep that functioned as a<br />

‘master copy’. It was the Scottish scientist<br />

Ian Wilmut who designed the<br />

‘copying machine’ for sheep. He <strong>to</strong>ok<br />

a very small piece from the udder of an<br />

adult sheep (sheep 1). From that small<br />

A copying machine for living beings?<br />

piece he removed the nucleus, then he<br />

transferred the nucleus in<strong>to</strong> the eggcell<br />

of another (female) sheep (sheep<br />

2). But first he removed from that eggcell<br />

all the material that would have<br />

determined sheep 2 characteristics in<br />

a lamb produced from that egg-cell.<br />

Ian Wilmut implanted the manipulated<br />

egg-cell of sheep 2 in<strong>to</strong> yet another (female)<br />

sheep (sheep 3). Sheep 3 became<br />

pregnant and had a lamb: Dolly. Some<br />

scientists think that within a few years<br />

it will be possible <strong>to</strong> clone people as<br />

well. But many governments have already<br />

decided <strong>to</strong> forbid cloning of people<br />

by law.


<strong>PISA</strong> AND “REAL LIFE CHALLENGES”: MISSION IMPOSSIBLE? 219<br />

Question 1:<br />

CLONING<br />

Which sheep is Dolly<br />

identical <strong>to</strong>?<br />

A Sheep 1<br />

B Sheep 2<br />

C Sheep 3<br />

D Dolly’s father<br />

Question 2: CLONING S128Q02<br />

In line 14 the part of the udder that was used is<br />

described as “a very small piece”.<br />

From the article text you can work out what is<br />

meant by “a very small piece”.<br />

That “very small piece” is<br />

A a cell.<br />

B a gene.<br />

C a cell nucleus.<br />

D a chromosome.<br />

Question 3: CLONING S128Q03<br />

In the last sentence of the article it is stated that many governments have already<br />

decided <strong>to</strong> forbid cloning of people by law.<br />

Two possible reasons for this decision are mentioned below.<br />

Are these reasons scientific reasons?<br />

Circle either “Yes” or “No” for each.<br />

Reason: Scientific?<br />

Cloned people could be more sensitive <strong>to</strong> certain diseases than<br />

normal people.<br />

Yes/No<br />

People should not take over the role of a Crea<strong>to</strong>r Yes/No<br />

Figure 6: A complete <strong>PISA</strong> science unit, “Cloning”, with the text presenting the situation and<br />

the three questions relating <strong>to</strong> the situation.


220 SVEIN SJØBERG<br />

Comments <strong>to</strong> Cloning<br />

This task requires the understanding of the rather lengthy 30 lines of text.<br />

In non-English-speaking countries, this text is translated in<strong>to</strong> the language of<br />

instruction. The translation follows rather detailed procedures <strong>to</strong> ensure high<br />

quality. The requirement that the text should be more or less identical results in<br />

rather strange prose in many languages. The original has, we assume, been an<br />

“authentic text” in some language, but the resulting translations cannot be considered<br />

<strong>to</strong> be “authentic” in the sense that they could appear in any newspaper<br />

or journal in that particular country.<br />

<strong>PISA</strong> adheres <strong>to</strong> strict rules for the translation process, but this is not the<br />

way prose should be translated <strong>to</strong> become good, natural and readable in other<br />

languages. In my own language, Norwegian, the heading “A copying machine<br />

for living being” is translated word by word. This does not make sense, and<br />

prose like this would never appear in real texts.<br />

The scientific content of the item may also be challenged. The only accepted<br />

answer on Question 1 is that Dolly is identical <strong>to</strong> Sheep 1 (alternative<br />

A). It may seem strange <strong>to</strong> claim that two sheep of very different ages are<br />

“identical” <strong>–</strong> but this is the only acceptable answer. The other two questions<br />

are also open for criticism. Basically, they test language skills, reading as well<br />

as vocabulary. (The word ‘udder’ was unknown <strong>to</strong> me.)<br />

In conclusion: Although the intentions behind the <strong>PISA</strong> test are positive,<br />

it becomes next <strong>to</strong> impossible <strong>to</strong> produce items that are ‘authentic’, close <strong>to</strong><br />

real life challenges <strong>–</strong> and at the same without cultural bias and equally ‘fair’<br />

in all countries. Items have <strong>to</strong> be constructed by international negotiations, and<br />

the result will therefore be that all contexts are wiped out <strong>–</strong> contrary <strong>to</strong> the<br />

ambitions of the <strong>PISA</strong> framework.<br />

Youth culture: Who cares <strong>to</strong> concentrate on <strong>PISA</strong> tests?<br />

In the <strong>PISA</strong> testing, students at the age of 15 are supposed <strong>to</strong> sit for 2 hours<br />

and do their best <strong>to</strong> answer the items. The data gathered in this way forms<br />

the basis of all conclusions on achievement and all forms of fac<strong>to</strong>r analysis<br />

that explain (in a statistical sense) the variation in achievement. The quality<br />

of these achievement data determines the quality of the whole <strong>PISA</strong> exercise.<br />

Good data assumes, of course, that the respondents have done their best <strong>to</strong><br />

answer the questions. For <strong>PISA</strong> results <strong>to</strong> be valid, one has <strong>to</strong> assume that


<strong>PISA</strong> AND “REAL LIFE CHALLENGES”: MISSION IMPOSSIBLE? 221<br />

students are motivated and cooperative, and that they are willing <strong>to</strong> concentrate<br />

on the items and give their best performance.<br />

There are good reasons <strong>to</strong> question such assumptions. My assertion is that<br />

students in different countries react very differently <strong>to</strong> test situations like those<br />

of <strong>PISA</strong> (and TIMSS). This situation is closely linked <strong>to</strong> the overall cultural<br />

environment in the country, and in particular <strong>to</strong> students’ attitudes <strong>to</strong> schools<br />

and education. Let me give an illustration of such cultures with examples from<br />

two countries scoring high on tests like <strong>PISA</strong> and TIMSS.<br />

Testing in Taiwan and Singapore<br />

An observer from Times Educational observed the TIMSS testing at a school<br />

in Taiwan, and he noticed that pupils and parents were gathered in the schoolyard<br />

before the big event, the TIMSS testing. The direc<strong>to</strong>r of the school gave<br />

an appeal in which he also urged the students <strong>to</strong> perform their utmost for themselves<br />

and their country. Then they marched in while the national hymn was<br />

played. Of course, they worked hard; they lived up <strong>to</strong> the expectations from<br />

their parents, school and society.<br />

Similar observations can be made in Singapore, another high achiever on<br />

the international test. A professor in mathematics at the National University of<br />

Singapore (Helmer Aslaksen) makes the following comment: “In this country,<br />

only one thing matters: Be best <strong>–</strong> teach <strong>to</strong> the test!”<br />

He has also taken the pho<strong>to</strong>graph from the check-out counter in a typical<br />

Singaporean shop, (see Figure 7). This is where the last-minute offers are displayed:<br />

On the lower shelf one finds pain-killers, while the upper shelf displays<br />

a collection of exam papers for the important public exams in mathematics, science<br />

and English (i.e. the three <strong>PISA</strong> subjects). This is what ambitious parents<br />

may bring home for their 13-year-old kids at home. Good results from such<br />

exams are determinants for the future of the student.<br />

This is definitely not the way such testing takes place in my part of the<br />

world (Norway) and the other Scandinavian countries. Here, students have a<br />

very different attitude <strong>to</strong> schooling, and even more so <strong>to</strong> exams and testing. The<br />

students know that the performance on the <strong>PISA</strong> testing has no significance for<br />

them: They are <strong>to</strong>ld that they will never get the results, the items will never be<br />

discussed at school, and they will not get any other form of feedback, let alone<br />

school marks for their efforts. Given the educational and cultural milieu in<br />

(e.g.) Scandinavia, it is hard <strong>to</strong> believe that all students will engage seriously<br />

in the <strong>PISA</strong> test.


222 SVEIN SJØBERG<br />

Figure 7: The context of exams and testing: This is the check-out counter in a shop in Singapore.<br />

Last-minute offers are displayed: On the lower shelf: Medicinal pain-killers. On the<br />

upper shelf: Exam papers for the important public exams in mathematics, science and English<br />

(i.e. the three <strong>PISA</strong> subjects).<br />

Task value: “Why should I answer this question?”<br />

Several theoretical concepts and perspectives are used <strong>to</strong> describe and explain<br />

performance on tests. The concept of self-efficacy beliefs has become central<br />

<strong>to</strong> this field. By self-efficacy belief, one understands it <strong>to</strong> be the belief and<br />

confidence that students have in their resources and competencies when facing<br />

the task (Bandura 1997). Self-efficacy is rather specific <strong>to</strong> the type of task in<br />

question, and should not be confused with more general psychological personality<br />

traits like self-confidence or self-esteem. <strong>PISA</strong> has several constructs that<br />

seek <strong>to</strong> address self-efficacy, and they have noted a rather strong positive relationship<br />

between e.g. mathematical self-efficacy beliefs and achievement on<br />

the <strong>PISA</strong> mathematics test on the individual level (Knain & Turmo 2003). (It


<strong>PISA</strong> AND “REAL LIFE CHALLENGES”: MISSION IMPOSSIBLE? 223<br />

is, however, interesting <strong>to</strong> note that such a positive correlation does not exist<br />

when countries are the unit of comparison.)<br />

There is, however, a related concept that may be of greater importance<br />

when explaining test results and students’ behaviour in test situations. This is<br />

the concept of task value beliefs (Eccles & Wigfield 1992, 1995). While selfefficacy<br />

beliefs ask the question, “Am I capable of completing this task?”, the<br />

task value belief focuses on the question, “Why do I want <strong>to</strong> do this task?”.<br />

The task value belief concerns beliefs about the importance of succeeding (or<br />

even trying <strong>to</strong> succeed) on a given task.<br />

It has been proposed that the task value belief may be seen <strong>to</strong> have three<br />

different components or dimensions: These are the 1. attainment value, 2. intrinsic<br />

value or interest, and 3. utility value. Rhee et al (2007) explains in more<br />

detail:<br />

Attainment value refers <strong>to</strong> the importance or salience that students place on the task.<br />

Intrinsic value (i.e. personal interest) relates <strong>to</strong> general enjoyment of the task or subject<br />

matter, which remains more stable over time. Finally, utility value concerns students’<br />

perceptions of the usefulness of the task, in terms of their daily life or for future careerrelated<br />

or life goals. (Rhee et al 2007: 87)<br />

I would argue that young learners in different countries perceive the task value<br />

of the <strong>PISA</strong> testing in very different ways, as indicated in this chapter’s previous<br />

sections.<br />

Based on my knowledge about the school system and youth culture in my<br />

own part of the world, in particular Norway and Denmark, I would claim that<br />

many students in these countries assign very little value <strong>to</strong> all the above three<br />

dimensions of the task value of the <strong>PISA</strong> test and its items. Given the nature<br />

of the <strong>PISA</strong> tasks (long, clumsy prose and contrived situations removed from<br />

everyday life), many students can hardly find these items <strong>to</strong> have high “intrinsic<br />

value”; the items are simply not interesting and do not provide joy or pleasure.<br />

Neither does the <strong>PISA</strong> test have any “utility value” for these Scandinavian<br />

students; the results have no consequence, the items will never be discussed,<br />

there is no feed-back, results are secret and do not count, neither for school<br />

marks nor in their daily lives. They do not count for students’ future careerrelated<br />

or life goals. Given the cultural and school milieu and the values held<br />

by young learners in e.g. Scandinavia, it is hard <strong>to</strong> understand why they should<br />

choose <strong>to</strong> push themselves in a <strong>PISA</strong> test situation.<br />

If so, we have an additional cause for serious uncertainty about the validity<br />

and the reliability of the <strong>PISA</strong> results.


224 SVEIN SJØBERG<br />

References<br />

Bandura, A. (1997). Self-efficacy: The exercise of control. New York: Freeman.<br />

Bergesen, O. H. (2006). Kampen om kunnskapsskolen (Eng: The fight for a<br />

knowledge-based school) Oslo: Universitetsforlaget.<br />

Eccles, J. S., & Wigfield, A. (1992). The development of achievement-task<br />

values: A theoretical analysis. Developmental Review, 12, 256<strong>–</strong>273.<br />

Eccles, J. S., & Wigfield, A. (1995). In the mind of the ac<strong>to</strong>r: the structure<br />

of adolescents’ achievement task values and expectancy-related beliefs.<br />

Personality and Social Psychology Bulletin, 21, 215<strong>–</strong>225.<br />

Fensham, Peter (2007). Values in the measurement of students’ science<br />

achievement in TIMSS and <strong>PISA</strong>. In Corrigan et al (Eds) (2007). The<br />

Re-Emergence of Values in Science Education, (p. 215-229) Rotterdam:<br />

Sense Publishers.<br />

Knain, E. & A. Turmo (2003). Self-regulated learning Lie S et al (Eds) NorthernLightson<strong>PISA</strong>,<br />

University of Oslo, Norway (p. 101-112).<br />

Mullis, I. V. S., Martin, M. O., Smith, T. A., Garden, R. A., Gregory, K. D.,<br />

Gonzales, E. J., et al. (2001). TIMSS Assessment Frameworks and Specifications<br />

2003. Bos<strong>to</strong>n: International Study Center, Bos<strong>to</strong>n College.<br />

Prais, S.J. (2007). England: Poor survey response and no sampling of teaching<br />

groups Oxford Review of Education vol. 33, no. 1.<br />

Rhee, C. B., T. Kempler, A. Zusho, B. Coppola & P. Pintrich (2005). Student<br />

learning in science classrooms: what role does motivation play? In Alsop,<br />

S. (Ed). Beyond Cartesian Dualism. Encountering Affect in the Teaching<br />

and Learning of Science Dordrecht: Springer, Science and Technology<br />

Education Library.<br />

Ziman, J.M. (2000). Real Science: What it is and what it means. Cambridge:<br />

Cambridge University Press.


<strong>PISA</strong> <strong>–</strong> Undressing the Truth or Dressing Up a Will <strong>to</strong><br />

Govern?<br />

Gjert Langfeldt<br />

Norway: University of Agder<br />

Background<br />

The background for this article is a study of accountability in Europe. The<br />

testing of pupils’ results is a prime mechanism in establishing an accountability<br />

<strong>–</strong> based logic of governance 1 . A part of understanding accountability is the<br />

study of the quality of the instruments used <strong>to</strong> measure results, among which<br />

the international comparative tests are of prime importance.<br />

<strong>PISA</strong> <strong>–</strong> Programme for Student International Assessment <strong>–</strong> stands out<br />

among these tests as being by far the most influential of the international comparative<br />

tests. The approach <strong>to</strong> <strong>PISA</strong> used here was <strong>to</strong> collect articles of how<br />

educational researchers around Europe have reacted <strong>to</strong> <strong>PISA</strong>. This is not easy,<br />

as <strong>PISA</strong> is conducted on a tri-annual scale, with three different focal points,<br />

taking nine years <strong>to</strong> complete a complete cycle. In 2000 the focal point of <strong>PISA</strong><br />

was reading literacy, in 2003 mathematical competence. This meant that the<br />

researchers’ critique was often linked <strong>to</strong> a partial theme. A common methodological<br />

ground for critique of <strong>PISA</strong> was not always straightforward <strong>to</strong> find.<br />

Synthesising the literature, this article is structured under three headings:<br />

Reliability. “The International League Table” is the spearhead of <strong>PISA</strong> in<br />

attracting public interest.The differences reported between countries have huge<br />

consequences, and the issue of whether these differences are reliable must be<br />

of primary concern. So the issue of reliability will be the first theme: Does the<br />

1 The publication of schools’ results, most crudely known as “league tables”, and the sanctioning<br />

of schools based on the test results, appear <strong>to</strong> be further steps in creating a more<br />

full-blown version of accountability-based regimes.


226 GJERT LANGFELDT<br />

international literature indicate a concern that there are sources of “fuzziness”<br />

in the <strong>PISA</strong> results that can make the national scores appear unreliable?<br />

Validity. The second issue of concern is the issue of validity. Several issues<br />

appear <strong>to</strong> be discussed under this theme in the literature. The angle chosen<br />

here can be stated thus: Currently, 57 nations partake in <strong>PISA</strong>. In what sense is<br />

it meaningful <strong>to</strong> compare these in the form of presenting a ladder of national<br />

results? Theoretically, how can one find a legitimate basis for comparing different<br />

nations? Closely related <strong>to</strong> this is the assumption <strong>–</strong> the reliance upon<br />

which <strong>PISA</strong> is not alone <strong>–</strong> that pupils’ results can be an indica<strong>to</strong>r of school<br />

quality, which again can be a proof of the quality of national educational systems.<br />

A third element in the discussion of the validity of <strong>PISA</strong> is the issue of<br />

inference: Can one assess a school system on the basis of scores of individual<br />

students?<br />

The business model of <strong>PISA</strong>. The third issue I wish <strong>to</strong> focus on is <strong>PISA</strong> as a<br />

sociological event: The impact of <strong>PISA</strong> is not only how it changes the lives of<br />

pupils and teachers or makes educational policymaking change priorities but<br />

also the impact of <strong>PISA</strong> on how we think about education, about school quality<br />

and what aims the educational policies of a nation should fulfil. Traditional<br />

ac<strong>to</strong>rs in this field are politicians who can be held accountable for their views.<br />

What kind of ac<strong>to</strong>r is <strong>PISA</strong>? Researchers claim that there is another agenda in<br />

which <strong>PISA</strong> is a prominent ac<strong>to</strong>r <strong>–</strong> and the final discussion of this paper will<br />

be <strong>to</strong> look at the legitimacy of <strong>PISA</strong> within such a broader horizon.<br />

Why <strong>PISA</strong><br />

On the European scene, two providers of international, comparative knowledge<br />

tests are dominant, the IEA and the OECD.<br />

The IEA (http://www.iea.nl/), or International Association for the Evaluation<br />

of Educational Achievement, is a foundation owned by member states and<br />

organisations, currently at 62 with another 20 non-member states partaking in<br />

various activities. Its most popular product is the TIMSS (Trends in Mathematics<br />

and Science) which currently (TIMSS 2007) is used in more than 60<br />

countries, of which more than 20 are European. TIMSS aims <strong>to</strong> measure mastery<br />

of curriculum provided. PIRLS (Progress in Reading Literacy Studies)<br />

aims at measuring reading literacy. 41 countries currently participate, among<br />

which 23 European. Where TIMSS is run in 4 <strong>–</strong> year cycles, PIRLS is run in<br />

5 <strong>–</strong> year cycles. A third study is SITES (Second Information Technology in


UNDRESSING THE TRUTH OR DRESSING UP AWILL TO GOVERN? 227<br />

Education studies), which in 2006 was run by 20 states, half of which were<br />

European. ICCS (International Civics and Citizenship Education Study) currently<br />

has 40 participants, 25 of which are European. The fifth and last product<br />

<strong>to</strong> be announced on their homepage is concerned with Teacher Education Development<br />

in Mathematics, (Ted <strong>–</strong>M). This is a new test whose first results are<br />

being published in 2007. Fourteen countries participate on this test, of which<br />

6 are European.<br />

The OECD has 30 member states, accounting for about 90 % of the world’s<br />

GNP, exerting a huge influence as “the rich countries of the world’s club”.<br />

Their efforts and influence in education is increasing (Jacobi 2006), partly due<br />

<strong>to</strong> the success of the definitive market leader, <strong>PISA</strong>. In addition <strong>to</strong> <strong>PISA</strong>, the<br />

<strong>to</strong>ols of influence are the annual publication of statistics on Education (Education<br />

at a Glance). The OECD also runs a “think-tank” related <strong>to</strong> education,<br />

CERI <strong>–</strong> Centre for Educational Research and Innovation. In addition <strong>to</strong> <strong>PISA</strong>,<br />

OECD also manages another international knowledge test, ALL (Adult Literacy<br />

and Life Skills Survey), (http://nces.ed.gov/surveys/all/) with the precursors<br />

SIALS and IALS. These are large scale tests with the purpose of charting<br />

the adult population’s skills in reading and mathematics.<br />

<strong>PISA</strong> is a program of assessment in the sense that it is carried out each third<br />

year with differing focus on the three main areas. In 2000 41 countries participated<br />

in the study, of which 25 were European. In 2006 57 countries were in,<br />

and 31 were European. For each round of testing, OECD will publish results<br />

comparing countries in the form of a league table. <strong>PISA</strong> will assess 15-yearold<br />

students, as “this is normally close <strong>to</strong> the end of the initial period of basic<br />

schooling in which all young people follow a broadly common curriculum”<br />

<strong>PISA</strong>’s aim is <strong>to</strong> measure literacy:<br />

While OECD/<strong>PISA</strong> does assess students’ knowledge, it also examines their ability<br />

<strong>to</strong> reflect, and <strong>to</strong> apply their knowledge and experience <strong>to</strong> real-world issues. . . . The<br />

term “literacy” is used <strong>to</strong> encapsulate this broader conception of knowledge and skills.<br />

(OECD 2003 p. 9-10).<br />

This approach sets <strong>PISA</strong> apart from competi<strong>to</strong>rs such as TIMSS, which tries <strong>to</strong><br />

measure the degree <strong>to</strong> which pupils master the knowledge transmitted through<br />

the national curricula. <strong>PISA</strong> can thus claim not <strong>to</strong> be constrained by national<br />

curricula.<br />

The special focus means that this theme will “take up nearly two-thirds<br />

of the testing time” (OECD 2003 p. 13). As several authors confer approximately<br />

2 minutes per item, this means that the special focus area has about 40


228 GJERT LANGFELDT<br />

questions, and that the <strong>to</strong>tal runs <strong>to</strong> about 60 items. The universe of items is,<br />

however, much larger, and the items are organised in 14 different textbooks<br />

distributed <strong>to</strong> equal proportions of the sample.<br />

Nearly all these tests have in common the fact that in addition <strong>to</strong> the test<br />

document, the students will also answer a questionnaire, charting the context<br />

of the education. In addition several forms are <strong>to</strong> be answered by e.g teachers,<br />

principals, municipal authorities, etc. in order <strong>to</strong> allow generalising <strong>to</strong> the<br />

context of education.<br />

In each participating country a national <strong>PISA</strong> office is set up, using up <strong>to</strong><br />

two years <strong>to</strong> establish the national sample, establishing processes for administering<br />

the tests, and functioning as local quality assurance officers. It seems<br />

<strong>to</strong> be a general trait that these national offices also function as the chief <strong>PISA</strong><br />

interpreters in their country, often undertaking not only publication but also<br />

additional research in order <strong>to</strong> enlarge the <strong>PISA</strong> impact.<br />

The reliability issue<br />

Starting from a textbook definition, this issue concerns how random errors<br />

can influence results. One should differentiate between random and systematic<br />

errors in all evaluations of measurement; the latter approaches the issue of validity.<br />

The importance of reliability is that in this respect, reliability constitutes<br />

a precondition for validity. Metaphorically, this can be illustrated by how noise<br />

is able <strong>to</strong> destroy a musical experience <strong>–</strong> how <strong>PISA</strong> measures is a precondition<br />

for discussing what it claims <strong>to</strong> have found.<br />

Random variation between 450-500 000 15-year-olds can come from innumerable<br />

sources, and it is an important discussion in itself <strong>to</strong> assess what<br />

differences should be controlled for and which not. Neither the research community<br />

nor <strong>PISA</strong> have undertaken any systematic discussion of how real-world<br />

differences of such a magnitude can be controlled for <strong>–</strong> even though such a theory<br />

is fundamental for explaining differences in score. An example of this is<br />

that, so far, I have not found any mention of the influence resulting from the<br />

substantial variations in the amount of instructional hours 15-year-old pupils<br />

will have received.<br />

What I have found in the survey of European research articles on <strong>PISA</strong> reliability<br />

are three arguments, two concerning sample quality and one concerning<br />

item cultural bias.<br />

Of the two arguments relating <strong>to</strong> sample quality, one argument is a minor<br />

issue, and concerns the sample representativity of <strong>PISA</strong>. The arguments run


UNDRESSING THE TRUTH OR DRESSING UP AWILL TO GOVERN? 229<br />

that when many schools decline <strong>to</strong> partake, and substitute schools are recruited,<br />

one cannot be sure that the properties of the supplementary schools are equal <strong>to</strong><br />

those who declined. Although a pertinent objection, this problem will probably<br />

disappear if <strong>PISA</strong> influence keeps mounting.<br />

However, I find the second argument <strong>to</strong> be rather an important one, as it<br />

concerns what hides under the <strong>PISA</strong> assertion that <strong>PISA</strong> tests the competence<br />

of 15-year-olds as “this is normally close <strong>to</strong> the end of the initial period of basic<br />

schooling in which all young people follow a broadly common curriculum” 2 .<br />

The critics point <strong>to</strong> this being a simplification for two reasons: Whether a<br />

given grade actually contains pupils aged 15 is an empirical question, and there<br />

is every reason <strong>to</strong> believe that this will vary between countries. S.N. Praise,<br />

who first launched this argument, formulates his argument thus: “Perhaps most<br />

pupils were in classes for mainly 15-year-olds; others had repeated a class<br />

and <strong>–</strong> though aged 15 <strong>–</strong> were in a classes for mainly 14-year-olds, others in<br />

classes for mainly 13-year-olds; and a few had ‘skipped’ and were in classes<br />

for 16-year-olds. Often (France, Germany, Switzerland . . . ) by the age of 15<br />

hardly more than half of pupils may be in a class for pupils of that age. 3 (Prais<br />

2004 p. 571)<br />

Another aspect of the same fact is that when you compare 57 countries,<br />

some of these countries will not have all pupils aged 15 in class <strong>–</strong> they have already<br />

dropped out, for instance, <strong>to</strong> work. The relevant issue for a discussion of<br />

reliability is whether those who drop out have the same academic proficiency<br />

as those who stay in school. Arguing the case for better home background being<br />

decisive for academic achievement, a case being so well researched that<br />

it would be trite <strong>to</strong> mention evidence, one may well assume that the pupils<br />

quitting before the age of 15 as being unequal <strong>to</strong> the ones remaining in school.<br />

This leads <strong>to</strong> the conclusion that in a <strong>PISA</strong> context, some nations gain from the<br />

fact that as much as 60 % of their classmates have left school before the age of<br />

15.<br />

2 Actually, this is not completely precise. In the <strong>PISA</strong> technical report it says: “The 15-yearold<br />

official target was slightly adapted <strong>to</strong> better fit the age structure of most of the northern<br />

hemisphere countries. As the majority of the testing was planned <strong>to</strong> occur in April, the<br />

international target population was consequently defined as all students aged from 15 years<br />

and 3 (completed) months <strong>to</strong> 16 years and 2 (completed) months at the beginning of the<br />

assessment period.” (OECD: <strong>PISA</strong> 2000 technical report, page 46)<br />

3 As an illustration, Bodin, referring <strong>to</strong> <strong>PISA</strong> 2000, states that “That leads <strong>to</strong> 59,1 % of the<br />

French students who <strong>to</strong>ok the test were in high school grades, at grade 10, or for a few of<br />

them at grade 11” (Bodin 2005 p. 4). Another illustration would be Norway, where about<br />

95 % were in grade 9 (out of 10) at age 15, as they started school at age 7.


230 GJERT LANGFELDT<br />

One researcher who has agued this is Wuttke (2006) in regard <strong>to</strong> <strong>PISA</strong><br />

2003. He starts by asking how representative <strong>PISA</strong> is, and his answer is that<br />

the representative aspect has far <strong>to</strong>o little basis in actual numbers. This conclusion<br />

is based (as with Prais) on an appraisal of school attendance and the<br />

attendance of 15-year-olds. Wuttke points <strong>to</strong> Turkey, where the school attendance<br />

for 15-year-olds is a meagre 54 %, and <strong>to</strong> Mexico, where it is 58 %. Even<br />

within OECD countries this is a problem: Wuttke refers <strong>to</strong> Portugal, where 5 %<br />

of the sample left school between the time they were recruited and the time the<br />

test was administered. As one cannot assume that the drop-out rate is randomly<br />

distributed, he draws the conclusion that for <strong>PISA</strong> “it becomes a measure of<br />

success that the weaker pupils have dropped prematurely out of school” (Wuttke<br />

p. 105).<br />

In addition <strong>to</strong> this, there is also the issue of the representativeness of the<br />

national samples. <strong>According</strong> <strong>to</strong> <strong>PISA</strong> procedure this is organised so that one<br />

first recruits schools, and then pupils within those schools. Under these circumstances<br />

is it vital <strong>to</strong> have a documentation of the relative size of the sample<br />

from each school. The importance of this record is that if such a list is present,<br />

one may adjust for unwanted differences; for example, if such a list opens the<br />

possibility for the statistical weighing of samples from particular schools for<br />

example because of small school size. On surveying the underlying material, 4<br />

Wuttke concludes that “This documentation is lacking from far <strong>to</strong>o many countries”<br />

<strong>–</strong> the examples he provides <strong>to</strong> illustrate this point is taken from Greece,<br />

where all pupils had <strong>to</strong> be given equal weight, as there was no information of<br />

school size at all, while the participation rate from Sweden was 102,5 %, from<br />

Toscana 107 % (ibid p. 106).<br />

Another set of criticism relating <strong>to</strong> sample quality, has <strong>to</strong> do with students<br />

whose results can not be counted as other students, typically because they are<br />

handicapped. Wuttke asserts that the <strong>PISA</strong> report (OECD 2005 p. 183 ff) leaves<br />

the definition of handicaps up <strong>to</strong> the national committees. He refers <strong>to</strong> how<br />

the exemption rate within the OECD varies from 0.7 % for Turkey <strong>to</strong> 7.3 for<br />

Spain and the US (OECD 2005 p. 169). In addition Denmark, New Zealand<br />

and Canada transgress the 5 % limit, (OECD 2005 p. 241 ff), but this has not<br />

had any consequence <strong>–</strong> data from all these countries are presented at face value<br />

in the presentation (ibid p. 106). The conclusion that one is comparing apples<br />

and oranges is nigh at hand.<br />

4 Wuttke refers <strong>to</strong> OECD 2005 p. 108 for an explanation of this, and he develops this argument<br />

in some detail.


UNDRESSING THE TRUTH OR DRESSING UP AWILL TO GOVERN? 231<br />

Wording as a reliability issue<br />

The reliability of <strong>PISA</strong> is affected by how the questions are worded, and the<br />

<strong>PISA</strong> technical report explains the logic of how this is handled and the thoroughness<br />

with which is has been looked in<strong>to</strong>. Hemmingen (2005) introduces a<br />

question not covered by <strong>PISA</strong> when she asks whether the wording lives up <strong>to</strong><br />

its promises of measuring the life skills of pupils. Her assessment is that by and<br />

large it does not, and it cannot do so. By using some of the <strong>PISA</strong> items (Going<br />

hand in hand/Growing up/Semmelweiss) as examples, she demonstrates how<br />

<strong>PISA</strong> items are constructed just like “school-test” items, how their difficulty<br />

is affected by their wording, and how they relate <strong>to</strong> contexts that cannot be<br />

similarly known <strong>to</strong> all students. The proof of the pudding as it were for Hemmingsen<br />

is that <strong>PISA</strong> is just as subjected <strong>to</strong> being “test wise” as ordinary tests<br />

(Hemmingsen, 2005 p. 41).<br />

The importance of these objections is not that they represent any neglect on<br />

the hand of <strong>PISA</strong>. Rather on the contrary <strong>–</strong> <strong>PISA</strong> has done more than any other<br />

similar undertaking in trying <strong>to</strong> establish a discussion on how reliability issues<br />

can be met. The importance of these objections is rather that they give rise <strong>to</strong><br />

the issue of how fruitful the ambition of attempting <strong>to</strong> control for real-world<br />

differences on a multinational scale can be. In fact a point of criticism might be<br />

that without a theory of what differences can be accounted for and how such a<br />

control can be established, the undertaking of comparing the complex realities<br />

of 57 nations along one scale will appear as high-handed.<br />

Summing up the methodological objection raised by researchers it appears<br />

that a methodological discussion should be encouraged <strong>to</strong> a larger degree, and<br />

that <strong>PISA</strong> itself has central role in establishing such a discussion. It is not only<br />

legitimate but even vital that research should try <strong>to</strong> influence public debate<br />

on the quality of education, but one must not transgress the limits granted by<br />

one’s <strong>to</strong>ols. This is particularly so if the agenda is <strong>to</strong> contribute <strong>to</strong> greater accountability<br />

in education. The verdict of the research community raises grave<br />

questions of whether <strong>PISA</strong> transgresses such limits.<br />

The validity issue<br />

From a textbook definition, the issue of validity is an issue of inference quality.<br />

5 What is the basis of the conclusions drawn? In a <strong>PISA</strong> context, one rele-<br />

5 Shadish, Cook and Campbell use this definition: “We use the term validity <strong>to</strong> refer <strong>to</strong> the<br />

approximate truth of an inference . . . Validity is a property of inferences.” (2002 p. 34)


232 GJERT LANGFELDT<br />

vant definition of validity is put in this way “A <strong>to</strong>tal judgment rests on an holistic<br />

assessment of whether the empirical evidence and the theoretical framework<br />

form a sufficient basis <strong>to</strong> justify the actions and the consequences that<br />

are drawn from the test scores”.(Jablonski 2005, 157). This definition goes <strong>to</strong><br />

the core of what quality in a test like <strong>PISA</strong> is about: Does its impact rest on<br />

a solid basis, of both theory and data? It is only from this perspective that the<br />

lack of reliability finds its true importance. The issue of validity goes beyond<br />

systematic errors in the sense that errors can also accrue from a lack of theory,<br />

as well as from the quality of cohesion between theory and data.<br />

The main issues of validity which have been raised by European researchers<br />

can be summed up in three arguments: The issue of cultural bias,<br />

the issue of scaling, and the issue of how <strong>to</strong> interpret <strong>PISA</strong> scores. As the issue<br />

of scaling is covered comprehensively elsewhere in this volume, I will focus<br />

on the other two issues.<br />

Cultural bias as a validity issue<br />

This argument was heard in the reactions <strong>to</strong> both <strong>PISA</strong> 2000 from Italian<br />

(Nardi 2002), Swiss (Bain 2003), and French (Bodin 2005) sources, as well<br />

as from German ones relating <strong>to</strong> <strong>PISA</strong> 2003. The main argument challenges<br />

whether the real world ambition of <strong>PISA</strong> refers <strong>to</strong> a world shared by all, and<br />

even the concept of “real world” competence is argued <strong>to</strong> be an Anglo-phone<br />

preference. This is a validity issue in two respects. First, it argues that pupils<br />

from different countries will have systematically different chances <strong>to</strong> perform<br />

equally well. Secondly, it argues that the more successful <strong>PISA</strong> is, the less will<br />

one be able <strong>to</strong> see cultural differences as an asset, diversity as an <strong>to</strong>ol for improvement.<br />

This last argument raises the issue of whether the one dimensional<br />

scale of a sum score is a valid standard for comparing nations <strong>–</strong> will it prove<br />

legitimate when the needs of globalisation put new changes on the agenda?<br />

Some of these arguments can be contested. When Nari, (2002) doubts if the<br />

methodology is correct, when four out of the six best countries were Anglophone<br />

(the exceptions being Korea and Finland) or when Jablonka (2006) finds<br />

that out of a <strong>to</strong>tal of 54 questions in mathematics in <strong>PISA</strong> 2003, 13 come<br />

from Holland, 15 from Australia and 7 from Canada, the rest stems from 9<br />

different countries (ibid p. 167), one can still argue from <strong>PISA</strong> that when they<br />

can document that the questions are equally well unders<strong>to</strong>od everywhere, this<br />

argument is effectively controlled for 6 .<br />

6 The problem resurfaces within <strong>PISA</strong>’s context as why some countries have a weak item


UNDRESSING THE TRUTH OR DRESSING UP AWILL TO GOVERN? 233<br />

The issue of cultural differences is raised in a more radical sense by Bodin<br />

whose argument concerns the quality of teaching as a precondition for <strong>PISA</strong><br />

performance. He observes that “the differences are more important in favour<br />

of the Finnish students for the more ‘realistic’ items, and that the difference<br />

tends <strong>to</strong> turn in favour of the French students for more abstract or formal questions”<br />

(Bodin, 2005 p. 8). He uses the bookshelves question <strong>to</strong> lament “This<br />

question, along with many others, points <strong>to</strong> the weak stress given by <strong>PISA</strong> <strong>to</strong><br />

the proof undertakings. Even explaining and justifying are not much valued by<br />

the <strong>PISA</strong> marking scheme. That makes a big difference with the . . . French<br />

conception of mathematical achievement”. (2005 p. 12). Here Bodin illustrates<br />

how culture defines relevance: As a French mathematician he is proud of their<br />

tradition in mathematics.<br />

On the contrary I think, and all the work of the so-called French didactics school has<br />

helped me, that ruptures are necessary and constitutive <strong>to</strong> learning. So we may fear<br />

that putting <strong>to</strong>o much stress on real life and actual situations may in return have some<br />

negative effects (2005, p. 13).<br />

Would Europe become intellectually richer if this pride vanished in the face<br />

of <strong>PISA</strong> results? Is it not rather the opposite: That by the next corner, the<br />

French style in mathematical reasoning might be not only vindicated, but prove<br />

an asset <strong>to</strong> all? Cultural diversity makes for a sustainable development. The<br />

Choice of an approach that ends up treating cultural diversity as a measurement<br />

error, makes the very undertaking of comparison repressive.<br />

A special case of this argument is related <strong>to</strong> the <strong>PISA</strong> strategy for measuring<br />

reading skills.<br />

Bain (2003) queries whether the conceptual framework for the reading<br />

tests is adequate, and also in what respect <strong>PISA</strong> can improve teaching. The<br />

argument Bain raises is whether reading is such a complex skill that it can not<br />

be validly tested within the restrictions of the <strong>PISA</strong> test format. His criticism<br />

of the <strong>PISA</strong> conceptual framework is firstly linked <strong>to</strong> the fact that at the time<br />

(<strong>PISA</strong> 2000) he finds that the theory used for understanding reading literacy is<br />

statistic. This is reported as affecting 12 countries (Basque county, Brazil, Indonesia, Japan,<br />

Macau-china, Mexico, Thailand and Tunisia as well as <strong>to</strong> a lesser extent Hong Kong China,<br />

Serbia and Turkey). The explanations offered by <strong>PISA</strong> (the items may have discriminated<br />

differently in different countries, there may be concern about linguistic and cultural equivalence<br />

or one simply has not recruited transla<strong>to</strong>rs well enough equipped for the job),) actually<br />

strengthen the argument that cultural bias is and must be present in such tests. (<strong>PISA</strong> 2003<br />

Technical Report p. 79)


234 GJERT LANGFELDT<br />

<strong>to</strong>o empirical, “the test given can not verify the validity of a model but relies on<br />

a model <strong>to</strong> emerge from the facts” (ibid, p. 61), a situation which is aggravated<br />

by the fact that it is a restricted understanding of reading skills that is tested, in<br />

the sense that “of course one may agree that the pupils read texts that are about<br />

situations, but the situation they are read in, is a typical school-situation”. This<br />

is a situation which disfavours the weak pupils, only the clever pupils will in<br />

this situation be able <strong>to</strong> recreate the use intended for the text by the author (ibid<br />

p. 64). Bain proceeds <strong>to</strong> argue that the mastery of different genres is not and<br />

cannot be mastered within the restrictive test format (ibid p. 66).<br />

Interpreting <strong>PISA</strong> results <strong>–</strong> dressing up the will <strong>to</strong> govern<br />

An important validity issue is the choice of <strong>PISA</strong> <strong>to</strong> develop the results in<strong>to</strong><br />

an “international league table”, thus opening for the comparison of national<br />

results and explicitly discussing how these results can be improved. This is<br />

in <strong>PISA</strong> linked <strong>to</strong> the organisation and prioritisation of research focus areas,<br />

discussed in the technical framework in chapter 3.<br />

These focus areas are based on the OECD education indica<strong>to</strong>rs (INES)<br />

and organise data along two dimensions: Firstly, data is interpreted by the<br />

level of the education system they originate from, and secondly, the indica<strong>to</strong>rs<br />

<strong>PISA</strong> produces are seen as outcomes or outputs, contexts or constraints.<br />

There are four levels <strong>to</strong> which the resulting indica<strong>to</strong>rs relate. These are specified<br />

thus: “The education system as a whole, the educational institutions and<br />

providers of educational service, the instructional setting and the learning environment<br />

within the institution and the individuals participating in the learning<br />

activities”. Each of these levels are studied in three aspects: With respect <strong>to</strong><br />

“outputs and outcomes of education and learning”, “policy levers and contexts”<br />

(circumstances that shape the outputs and outcomes at each level) and<br />

“antecedents and constraints (fac<strong>to</strong>rs that define or constrain policy)” (OECD<br />

2003 technical report p. 35). Organised in<strong>to</strong> a matrix, this gives a 12-cell matrix<br />

of <strong>PISA</strong> focus areas, where educational outputs can be identified on four<br />

levels, and which also specifies policy levers and contexts and antecedents and<br />

constraints. Such a matrix is presented in the technical framework, specifying<br />

what variables are used <strong>to</strong> produce the indica<strong>to</strong>rs presented in each cell.<br />

The way the <strong>PISA</strong> focus areas are organised raises the issue of how one<br />

can interpret data from the “international league table”, and one question in<br />

particular is important: Does this framework lead <strong>PISA</strong> <strong>to</strong> offer opinions on<br />

educational matters beyond which they have an adequate basis?


UNDRESSING THE TRUTH OR DRESSING UP AWILL TO GOVERN? 235<br />

<strong>PISA</strong> itself acknowledges a first objection <strong>to</strong> this framework: The problem<br />

of recursivity and complexity among levels. The <strong>PISA</strong> example is how at<br />

the classroom level the relation between student achievement and class size is<br />

negative, while at the class or school level the relation is positive <strong>–</strong> the explanation<br />

being that students are often intentionally grouped so that the weaker<br />

student is placed in smaller classes. <strong>PISA</strong> sees this as an example that “a differentiation<br />

between levels is not only important with regard <strong>to</strong> the collection<br />

of information, but also because many features of the education system play<br />

out quite differently at different levels of the system” (ibid p. 35). This is a fact<br />

that really should be underlined, and it can be shown <strong>to</strong> be present in many<br />

aspects of the educational system. One example often used in the literature on<br />

causality in education is the issue of the interplay between teacher and class.<br />

Carroll’s (1963) model of class learning as a function of student level and time<br />

spent proved overly simplistic by not being able <strong>to</strong> give room for the interplay<br />

between class and teacher: An equal amount of time and teacher effort will<br />

provide widely different results as the students’ attitude <strong>to</strong> learning differs. 7<br />

That the interpretative matrix is riddled with difficulties is also illustrated<br />

in what <strong>PISA</strong> terms “antecedents and constraints” and which are described as<br />

follows: “Policy levers and contexts typically have antecedents, that is, fac<strong>to</strong>rs<br />

that define or constrain policy. These are usually specific foragivenlevelof<br />

the educational system, and antecedents at a lower level may well be policy<br />

levers at a higher level ( e.g for teachers and students in a school, teacher qualifications<br />

are a given constant, while at the level of the education system, professional<br />

development of teachers is a key policy lever(ibid p. 35). How is this<br />

a validity problem? What <strong>PISA</strong> does not say, but should have advised, is that<br />

the cultural traditions of different countries, which often constitute contexts as<br />

well as restraints, cannot be discussed out of context. Particularly when data<br />

are aggregated it is of the utmost importance <strong>to</strong> present data contextualised.<br />

James S. Coleman (1990) argues that it is in principle logically invalid <strong>to</strong> deduce<br />

principles of government and management from aggregated macro-level<br />

data, if such data lack substantial contextualization. If this is omitted, two problems<br />

arise: The first is that one lacks “reality checks” and is led opine beyond<br />

the realistic. The second, which can be seen as a correlate <strong>to</strong> this, is that the<br />

7 In fact it almost 30 years ago since Cronbach suggested that most causal links in education<br />

might preferably be unders<strong>to</strong>od as interactions, relations linked in such a way that causality<br />

can only be unders<strong>to</strong>od as probable and where the direction of causality might change.


236 GJERT LANGFELDT<br />

interpretation of results becomes more difficult. (e.g. as of <strong>to</strong>day, no real explanation<br />

for Finn excellence in <strong>PISA</strong> is presented).<br />

A third argument <strong>–</strong> and once again the real world differences crop up <strong>–</strong><br />

must also be mentioned here concerning the levels of the educational system<br />

in which <strong>PISA</strong> organises data. Two of the <strong>PISA</strong> levels are intuitively understandable:<br />

The student level and the level of the classroom; that is, where education<br />

as a social interaction occurs. The two last levels <strong>–</strong> the educational<br />

institutions and providers of educational services and the education system as<br />

a whole <strong>–</strong> seem <strong>to</strong> be introduced in order <strong>to</strong> be able <strong>to</strong> differentiate between<br />

nations where schools are run by the government (and where the institution<br />

owner is the state) and nations where schools are run by a number of organisers<br />

(churches, NGO’s, local communities). It is only for such settings that<br />

a differentiation between institution owner and system as a whole is appropriate.<br />

Two questions must be asked: Is such a way of organising the levels<br />

appropriate, and in the <strong>PISA</strong> case, is it used sensibly?<br />

The first question has been addressed in a recent paper (Afzar 2007). She<br />

insists that the <strong>PISA</strong> framework is based on a theoretical assumption about a<br />

linear administrative chain of steering. This chain runs from the political level<br />

via the political body of the school owner, through the instructional setting<br />

organised within each school <strong>to</strong> individual learning. In addition, she argues,<br />

lying behind the principle of aggregation of data, there is an action theoretical<br />

approach reminiscent of methodological individualism. Afzar rejects the notion<br />

of a linear chain, and argues that when one tries <strong>to</strong> grasp education as a<br />

system, one must use an approach legitimised theoretically, a model which can<br />

also explain the complexity of the relation of the different levels of the educational<br />

system, and she ventures <strong>to</strong> introduce one such model, a model whose<br />

importance in this context is that it allows for seeing the function of the school<br />

as being different for the individual student than for the functions of society at<br />

large. The Afzar model, interesting as it may be, is not of relevance here, but it<br />

serves <strong>to</strong> highlight in what respects the model used by <strong>PISA</strong> is legitimate.<br />

Does <strong>PISA</strong> use its own framework sensibly? <strong>PISA</strong> 2003 did not study<br />

teachers nor had intact classrooms as units of sampling. The level of instructional<br />

settings is therefore empty with regards <strong>to</strong> outputs and outcomes. What<br />

data it contains are data of students’ learning <strong>–</strong> learning that happens in different<br />

instructional settings, setting which <strong>PISA</strong> does not explain. A similar<br />

argument is applicable <strong>to</strong> the institutional level, where data is either aggre-


UNDRESSING THE TRUTH OR DRESSING UP AWILL TO GOVERN? 237<br />

gated from the individual level or “synthesised” across institutional settings,<br />

as these are not identified as such.<br />

At the systems level, <strong>PISA</strong> only relies on aggregates. In particular, the<br />

<strong>PISA</strong> systems outcome consists not only of aggregated individual data, but<br />

also for policy levers and contexts from system-level aggregates, (notably from<br />

the instructional setting level), and the same goes for the antecedents and constraints<br />

column at the systems level, albeit at these levels unspecified OECD<br />

data are also indicated as source material. The question is: Can student achievement<br />

be aggregated through such levels and still contribute <strong>to</strong> a meaningful<br />

interpretation of “national performance”?<br />

<strong>PISA</strong> Inc.<br />

The title of this argument is taken from a German book on <strong>PISA</strong> (Jahncke et<br />

al 2006), and it concerns the nature of <strong>PISA</strong> as enterprise. Here only two arguments<br />

will be discussed. The first argument is about whether <strong>PISA</strong> itself as<br />

a <strong>to</strong>ol for accountability appears as transparent: The aim of the international<br />

league table is <strong>to</strong> influence educational policies. Does <strong>PISA</strong> aim <strong>to</strong> handle this<br />

influence in a democratic way, or is <strong>PISA</strong> just another brand-building enterprise,<br />

whose aim is <strong>to</strong> exert as much influence as possible, not caring <strong>to</strong> make<br />

account of whether this influence is justified or not?<br />

This argument has been explicitly addressed by Howie & Plomp (2005),<br />

who argue that even if it is intended that <strong>PISA</strong> will have policy implications,<br />

no study has been undertaken <strong>to</strong> systematically chart how <strong>PISA</strong> affects educational<br />

policies. They refer <strong>to</strong> Kellahan (1996) as stating that most accounts<br />

of the use of findings appear <strong>to</strong> be “limited and impressionistic” and that “detailed<br />

analyses are not available” and that “the way policy makers arrive at<br />

their conclusions is also little known”. Howie & Plomp concur with this and<br />

add that albeit nearly a decade later, this still appears <strong>to</strong> be the case. This may<br />

be due in part <strong>to</strong> the difficulty of researchers gaining access <strong>to</strong> the policymakers’<br />

realm as well as having a lack of funding for impact studies. “In fact whilst<br />

government are prepared often <strong>to</strong> fund data collection and the initial descriptive<br />

reports, little funding is offered for secondary analyses of the same data<br />

let alone an impact study of the release of such a rich source of data nationally<br />

or internationally” (2005 p. 93)<br />

The second argument is that the allegations of lacking transparency seem<br />

<strong>to</strong> hold true even when looking inside how <strong>PISA</strong> is organised. Mogens Niss,<br />

a Danish member of the <strong>PISA</strong> expert group in mathematics, <strong>to</strong>uches on this


238 GJERT LANGFELDT<br />

in an interview given in 2005: He does not agree that “leading experts in the<br />

field is a guarantee for <strong>PISA</strong> quality”, the reason being that “There is no one<br />

within <strong>PISA</strong> who keeps tabs on things. It is like the Internet: There is no allcontrolling<br />

central brain in <strong>PISA</strong>”. He describes the development of mathematics<br />

literacy as “The expert group should clarify a description of the frame, a job<br />

that was organised as a analytical developmental process, not a research process,<br />

and whose results were shaped also by the <strong>PISA</strong> governing board, which<br />

is constituted by officials from ministries in the involved countries. The <strong>PISA</strong><br />

questions are shaped by expert groups, the OECD secretariat, the governing<br />

board and the international consortium <strong>to</strong>gether and Professor Niss concludes<br />

that <strong>PISA</strong> is not a clear-cut object; it is a mixture of research, development<br />

work, influenced by needs for comparison and politics.<br />

What Niss does not mention is that <strong>PISA</strong> is developed largely by enterprises<br />

either dependent on providing part of their income in the market (The<br />

Australian Council of Educational Research or the Educational Testing Service<br />

(USA) or living wholly off the market, as exemplified by Weststat ( a US<br />

company) or Ci<strong>to</strong>groep (a Dutch company) 8 . The problem with such an approach<br />

<strong>to</strong> organising is that when companies who have a vested interest in the<br />

success of <strong>PISA</strong> is <strong>to</strong> advice governments on educational policy, one cannot<br />

know whether the advice is biased or not. 9<br />

It may be relevant <strong>to</strong> mention that costs of <strong>PISA</strong> participation are not easily<br />

come by. For most of the IEA tests, however, the price is USD 30,000 per<br />

country per year, or USD 120,000 for a full cycle of a four <strong>–</strong> year annual<br />

repetition. Most of these tests are, however, sponsored by Ford, which is how<br />

8 A complete listing is available in the <strong>PISA</strong> 2003 Technical Framework appendix 2.<br />

9 This ambiguity is apparent in the chapter on data abjudication in the 2003 Technical Framework<br />

report. Using the USA as an example, this country not only did not meet the required<br />

school response rate (68,12 % after replacement), it also broke the <strong>PISA</strong> test timing window<br />

and had a <strong>to</strong>o high overall exclusion rate (7,28 %). After an evaluation (where no sources<br />

are given), it is concluded that the US data will be included in the full range of <strong>PISA</strong> reports.<br />

Another country plagued with grave problems is the United Kingdom, where the technical<br />

report concludes that “The uncertainty surrounding the sample and its bias are such that<br />

<strong>PISA</strong> 2003 scores for the UK cannot be reliably compared with other countries”, or with<br />

<strong>PISA</strong> 2000. The conclusion is still that all international averages and aggregate statistics include<br />

the data from the UK. There are apparent anomalies, such as those found in Mexico,<br />

where only 58 % of the classmates are in school, and the coverage of the national 15-yearold<br />

population was at only 49 %, or Spain, where the pupil exclusion rate was about 50 %<br />

above <strong>PISA</strong> standards, or Turkey, where the coverage of 15-year-olds was at 36 % <strong>–</strong> they are<br />

all included in the full range of <strong>PISA</strong> 2003 reports. Is this because it is scientifically sound<br />

or is it because another ruling would be bad for <strong>PISA</strong> Inc.?


UNDRESSING THE TRUTH OR DRESSING UP AWILL TO GOVERN? 239<br />

the price can be so low. No data is published on the financing of the OECD<br />

tests.<br />

It must be fair <strong>to</strong> conclude that <strong>PISA</strong> has huge unresolved issues concerning<br />

the way it is used. There seems <strong>to</strong> be an imbalance between the <strong>to</strong>ols created<br />

and the eagerness <strong>to</strong> influence politics. In the long run this is detrimental<br />

<strong>to</strong> the very issue <strong>PISA</strong> seeks <strong>to</strong> promote: A sensible approach <strong>to</strong> measureing<br />

the human capital generated in the member countries.<br />

Particular notice should be paid <strong>to</strong> <strong>PISA</strong> relation <strong>to</strong> private enterprise, so<br />

that one does not produce a capacity for test-making which goes far beyond<br />

how such tests can be sensibly used.<br />

References<br />

Afzar. A. (2007). A systems theoretical critique of international comparisons.<br />

Paper presented at the 2007 AERA Convention.<br />

Bain D. (2003). <strong>PISA</strong> et la lecture: Un point de vue didacticien. In Schweizerische<br />

Zeitschrift für Bildungswissenschaften vol 25 2003 no 1 p. 59-78.<br />

Bender, P.,(2006). Was sagen uns <strong>PISA</strong> & Co, wenn wir uns auf sie einlassen?<br />

In Jahnke,T. and Meyerhöfer W. (2006) <strong>PISA</strong> & Co Kritik eines Programms,<br />

Hildesheim Verlag Franzbecker.<br />

Bodin, A. (2005). What does <strong>PISA</strong> really assess? What it doesn’t. A French<br />

view. Report prepared for Joint Finnish-French conference “Teaching<br />

mathematics: Beyond the <strong>PISA</strong> survey”, Paris.<br />

Folkeskolen 8.4.2005: “<strong>PISA</strong> <strong>–</strong> Der er ingen der har styr på det hele. Et sammensurium<br />

af forskning, test og politik siger Mogens Niss fra <strong>PISA</strong>s<br />

ekspertgruppe i matematikk”.<br />

http://www.folkeskolen.dk/objectShow.aspx?ObjectId=33661<br />

Hemmingsen, I (2005). Et kritisk blik på opgaverne i <strong>PISA</strong> med særlig vekt på<br />

matematikk. In MONA vol? 2005 no. 1. p. 24-43.<br />

Howie S. and Plomp T.,(2005). International comparative studies of education<br />

and large-scale change. In Bascia N., A. Cumming A. Datnow<br />

and K. Leithwood (2005) International Handbook of Educational Policy,<br />

Springer Internatinal Handbooks of Education, Dordrect, Holland.<br />

Jablonka,E., (2006). Mathematical literacy: Die Verflüchtigung eines ambitionierten<br />

Testkonstrukts in bedeutungslose <strong>PISA</strong>-Punkte in Jahnke,T. and<br />

Meyerhöfer W. (2006) <strong>PISA</strong> & Co Kritik eines Programms, Hildesheim<br />

Verlag Franzbecker.


240 GJERT LANGFELDT<br />

Jahnke,T. and Meyerhöfer W. (2006). <strong>PISA</strong> & Co Kritik eines Programms,<br />

Hildesheim Verlag Franzbecker.<br />

Jahnke,T.,(2006) Zur Ideologie von <strong>PISA</strong> & CO. In Jahnke,T. and Meyerhöfer<br />

W. (2006) <strong>PISA</strong> & Co Kritik eines Programms, Hildesheim Verlag<br />

Franzbecker.<br />

OECD (2003). The <strong>PISA</strong> 2003 Assessment Framework, Paris OECD.<br />

OECD (2002). School sampling preparation manual. <strong>PISA</strong> 2003 Main Study<br />

Version one 2002.<br />

OECD (2002). Programme for international student assessment sample. Task<br />

from the <strong>PISA</strong> 2000 assessment of reading, mathematical and scientific<br />

literacy.<br />

OECD: <strong>PISA</strong> 2003 Technical Report<br />

Prais S.J. (2003). Cautions on OECD’s recent educational survey (<strong>PISA</strong>) Oxford<br />

Review of Education, vol 29 no 2, 2003, p. 139 <strong>–</strong> 163.<br />

Prais S.J. (2004). Cautions on OECD’s recent educational survey(<strong>PISA</strong>): Rejoinder<br />

<strong>to</strong> OECD’s response. Oxford Review of Education vol 30 no 4,<br />

Dec 2004.<br />

Romainville, M., (2002). L’enquete O.C.D.E. sur les aquis des élèves en débat<br />

in La Revue Nouvelle vol 115, 2002 no 3-4 pp. 84-108.<br />

Shadish,W., Cook, Th. and D. Campbell (2003). Experimental and quasiexperimental<br />

designs for causal inference. Bos<strong>to</strong>n, Houg<strong>to</strong>n Mifflin Company.<br />

The French Ministry of Education (2002). The meetings of Desco: “Evaluation<br />

of the knowledge and skills of 15-year-old pupils: Questions and<br />

hypotheses formulated following the OECD study”, contains Gaudemar<br />

J_P.,(2002): Opening of the conference debate, Crowne, S. (2002) The<br />

British case, Nardi E.,(2002) The Italian case, Koch H.C.,(2002) The German<br />

case and Cytermann, J_R.,(2002) The french Case.<br />

Wuttke, J.,(2006). Fehler, Verzerrungen,Unsicherheiten in der <strong>PISA</strong>-<br />

Auswertung in Jahnke,T. and Meyerhöfer W. (2006) <strong>PISA</strong> & Co Kritik<br />

eines Programms, Hildesheim Verlag Franzbecker.


Uncertainties and Bias in <strong>PISA</strong><br />

Joachim Wuttke<br />

Germany: Forschungszentrum Jülich <strong>–</strong> Munich<br />

This is a summary of a detailed report (>100 pages, >100 references) that has<br />

appeared in German (Wuttke 2007). It will be shown that <strong>PISA</strong>’s statistical significance<br />

criteria are misleading because several sources of systematic bias and<br />

uncertainty are quantitatively more important than the standard errors communicated<br />

in the official reports.<br />

1 Introduction<br />

1.1 A huge framework<br />

<strong>PISA</strong> is a long-term project. Starting in 2000, assessments are carried out every<br />

three years. One and a half years are needed for data processing until an<br />

international report entitled “First Results” (FR00, FR03) appears, and it takes<br />

even longer until a Technical Report (TR00, TR03) is published and the raw<br />

data are made available for independent analysis. Therefore, although the third<br />

assessment was carried out in spring 2006, at present (summer 2007) only<br />

<strong>PISA</strong> 2000 and 2003 can be evaluated. In the following we will concentrate on<br />

data from <strong>PISA</strong> 2003.<br />

<strong>PISA</strong> 2003 was carried out in 30 OECD countries and in some partner<br />

countries. As data from the latter were not used in the international calibration,<br />

they will be disregarded in the following. The United Kingdom (UK), which<br />

failed <strong>to</strong> meet several criteria required for participation, was excluded from<br />

tables in the official report. However, data from the UK were fully used in<br />

calibrating the international data set and in calculating OECD averages <strong>–</strong> an<br />

inconsistency that is left unexplained (TR03: 128, FR03: 31).


242 JOACHIM WUTTKE<br />

<strong>PISA</strong> rules required a minimum sample size of 4,500 students per country<br />

except in very small countries (Iceland, Luxembourg), where all fifteen-yearold<br />

students were recruited. In several countries (Australia, Belgium, Canada,<br />

Italy, Mexico, Spain, Switzerland, UK), considerably larger samples of up <strong>to</strong><br />

nearly 30,000 students (TR03: 168) were drawn so that separate analyses for<br />

regions or linguistic communities became possible. For the comparison of the<br />

sixteen German länder, an even larger sample of 44,580 students was tested<br />

(Prenzel et al. 2005: 392) of which, however, only 4,660 were contributed <strong>to</strong><br />

the international sample (TR03: 168). The Kultusministerkonferenz, fearing<br />

unauthorised cross-länder comparisons of school types, has imposed deletion<br />

of länder codes from public-use data files. Therefore, the inner-German comparison<br />

will not be considered further.<br />

The bulk of <strong>PISA</strong> data comes from a three-hour student testing session.<br />

Some more information is gathered from school principals. The testing session<br />

consists of a two-hour cognitive test and of a third hour devoted <strong>to</strong> questionnaires.<br />

The main questionnaire enquires about the students’ social background,<br />

educational environment, and learning habits. The questionnaire responses certainly<br />

constitute a valuable resource for studying the living and learning conditions<br />

of fifteen-year-olds in large parts of the world, even though participation<br />

rate gradients introduce some bias.<br />

Compared <strong>to</strong> the rich empirical material obtained from the questionnaires,<br />

the outcome of the cognitive test is meagre: the official data analysis reduces<br />

it <strong>to</strong> just four scores per student, interpreted as “competences” in specific subject<br />

domains (reading, mathematics, science, problem-solving). Nevertheless,<br />

these results are at the origin of <strong>PISA</strong>’s political impact; communicated as<br />

“league tables” of national mean values, they made <strong>PISA</strong> known <strong>to</strong> the general<br />

public, causing an outright “shock” in some countries.<br />

While controversy erupted about possible causes of results perceived as<br />

unsatisfac<strong>to</strong>ry, the three-digit precision of the underlying data has rarely been<br />

questioned. This will be done in the present paper. The accuracy and validity<br />

of cognitive test results are <strong>to</strong> be reviewed from a statistical point of view.<br />

1.2 A surprisingly simple measure of competence<br />

As a first step of data reduction, student responses are digitally coded. The<br />

Technical Report discusses inter-coder and inter-country variance at length<br />

(TR03: 218-232); the conclusion that non-uniform coding is an important<br />

source of bias and uncertainty is left up <strong>to</strong> the reader.


UNCERTAINTIES AND BIAS IN <strong>PISA</strong> 243<br />

Some codes are kept secret because national authorities want <strong>to</strong> prevent<br />

certain analyses. In several multilingual countries the test language is kept secret.<br />

Except for such deletions, the international raw data set is available for<br />

downloading on the website of the OECD’s main contrac<strong>to</strong>r ACER (Australian<br />

Council for Educational Research).<br />

On the lowest level of data aggregation, single item response statistics<br />

(percentages of correct, incorrect, and invalid responses <strong>to</strong> one cognitive test<br />

item) can be generated. In the international report not even one such statistic<br />

is shown. <strong>PISA</strong> is decidedly not a study in Fachdidaktik (math education,<br />

science education, etc.). <strong>PISA</strong> does not aim at gathering information about<br />

the understanding of scientific concepts or the mastery of specific mathematical<br />

techniques. The data provide almost no handle <strong>to</strong> understandwhy students<br />

give incorrect responses. Only Luxembourg has scanned and published some<br />

student solutions <strong>to</strong> free-response items; these examples show that students<br />

sometimes just misunders<strong>to</strong>od what the item writer meant <strong>to</strong> ask.<br />

<strong>PISA</strong> is designed <strong>to</strong> be analysed on a much coarser level. As anticipated<br />

above, cognitive test results are aggregated in<strong>to</strong> just four “competence” values<br />

per student. The determination of these values is technically complicated because<br />

not all students worked on the same item set: thirteen different booklets<br />

were used, and in some countries some items turned out <strong>to</strong> be invalid because<br />

of misprints, translation errors, or other problems. This makes it necessary <strong>to</strong><br />

establish an “item difficulty” scale prior <strong>to</strong> the quantification of student competences.<br />

For this calibration an elementary version of item response theory is<br />

used.<br />

The importance of this theory tends <strong>to</strong> be overestimated by defenders and<br />

critics of <strong>PISA</strong> alike. Misunderstandings are also provoked by poor documentation<br />

in the official reports. For a functional understanding of what <strong>PISA</strong> measures,<br />

it is not important that different booklets were used, and it is plainly irrelevant<br />

that in some countries certain items were deleted. Glossing over these<br />

technicalities, pretending that all students were assigned the same item set, and<br />

ignoring the probabilistic aspect of item response theory, it becomes apparent<br />

what the competence values actually measure: no more and no less than the<br />

number of correct responses.<br />

In the mathematics subtest of <strong>PISA</strong> 2003, a student with a competence<br />

of 500 (the OECD mean) has solved about 46 % of the items assigned <strong>to</strong><br />

him. A competence of 400 (one standard deviation below the mean) corresponds<br />

<strong>to</strong> a correct-response rate of 23 %; 600 corresponds <strong>to</strong> 71 % (Wuttke


244 JOACHIM WUTTKE<br />

2007: Fig. 4). Within this span the relationship between competence value and<br />

correct-response percentage is nearly linear. The slope is about 4 competence<br />

points per 1 % of assigned items. This conversion gives the competence scale<br />

a much simpler meaning than the official reports allow one <strong>to</strong> suspect.<br />

1.3 League Tables and S<strong>to</strong>chastic Uncertainties<br />

Any analysis of <strong>PISA</strong> data aims at statistical statements about populations. For<br />

instance, an elementary analysis of the cognitive test yields results like the following:<br />

German students have a mean mathematics competence of 503; the<br />

standard deviation is 103; the standard error of the mean is 3.3, and the standard<br />

error of the standard deviation is 1.8 (Prenzel et al. 2004: 70). In order<br />

<strong>to</strong> make sense of such numbers, they need <strong>to</strong> be put in<strong>to</strong> context. The <strong>PISA</strong><br />

reports provide two kinds of interpretation guidance: Verbal descriptions of<br />

“proficiency levels” give a rough idea of what competence differences of 60<br />

or more points signify (see below), and comparisons between different populations<br />

insinuate that even differences of only a few points bear a message.<br />

Since the assessment of competences within each of the four subject domains<br />

is strictly one-dimensional, any inter-population comparison implies a<br />

ranking. This explains the primordial role of league tables in <strong>PISA</strong>: They are<br />

not only a vehicle for gaining media attention, but they are deeply rooted in the<br />

conception of the study (cf. Bottani/Vrignaud 2005). In the official reports almost<br />

all statistics are communicated in the form of country league tables. The<br />

ranks in these tables, especially low ranks (and every country has low ranks in<br />

some tables), are then easily turned in<strong>to</strong> political messages. In this way <strong>PISA</strong><br />

results can be interpreted without any understanding of what has actually been<br />

measured.<br />

Of course, not all rank differences are statistically significant. This is duly<br />

noted in the official reports. For all statistics, standard errors are calculated.<br />

After processing these standard errors through a zero hypothesis testing machinery,<br />

some mean value differences are judged significant, while others are<br />

not. Complicated tables (FR03: 59, 71, 81, 88, 92, 281, 294) indicate which<br />

differences of competence means are significant and which are not. It turns<br />

out that in some cases 9 points are “sufficient <strong>to</strong> say with confidence that the<br />

higher performance by sampled students in one country holds for the entire<br />

population of enrolled 15-year-olds” (FR03: 93).<br />

This accuracy is formidable when compared <strong>to</strong> the intra-country spread<br />

of test performances. The standard deviation of the competence distribution is


UNCERTAINTIES AND BIAS IN <strong>PISA</strong> 245<br />

100 points in the OECD country average and not much smaller within single<br />

nations. This is an order of magnitude more than an inter-country difference of<br />

9 points. Figure 1 illustrates the situation.<br />

Figure 1: Two Gaussian distributions with mean values differing by 9 % of their standard<br />

deviation. Such a small difference between two populations is considered significant in <strong>PISA</strong>.<br />

However, significant does not mean valid, let alone relevant. Statistical significance<br />

is achieved by nothing more than the law of large numbers. The standard<br />

errors on which the significance criteria are based only account for two<br />

specific sources of s<strong>to</strong>chastic uncertainty: the student sampling and the itemresponse<br />

modeling of student behaviour. By testing more and more students<br />

on more and more items, these uncertainties can be made arbitrarily small. At<br />

some point, however, this effort becomes inefficient because the validity of the<br />

study remains limited by non-s<strong>to</strong>chastic sources of bias and uncertainty, which<br />

do not decrease with increasing sample size.<br />

Before entering in<strong>to</strong> details, the likeliness of non-s<strong>to</strong>chastic bias will be<br />

made plausible by a simple estimate: To bring about a significant inter-country<br />

difference of 9 points, correct-response rates must differ by about 2 % of given<br />

responses. On average, a student is assigned 26 mathematics items. Hence, 9<br />

points correspond <strong>to</strong> no more than half a correct response per student. This<br />

suggests that little systematic error is needed <strong>to</strong> dis<strong>to</strong>rt test results far beyond<br />

their nominal standard errors.<br />

In this paper, I will argue that <strong>PISA</strong> does indeed suffer from severe nons<strong>to</strong>chastic<br />

limitations, and that the large sample sizes are therefore uneconomic.<br />

Part 2 describes disparities in student sampling, Part 3 shows that the<br />

projection of cognitive test results on<strong>to</strong> a one-dimensional “competence” scale<br />

is neither technically convincing nor culturally fair, and Part 4 raises certain<br />

objections on the conceptual level.


246 JOACHIM WUTTKE<br />

2 Sampling disparities<br />

In some countries it is clear from the outset that <strong>PISA</strong> cannot be representative<br />

(Sect. 2.1). But even in countries where school is obliga<strong>to</strong>ry beyond the<br />

age of fifteen, low participation rates are likely <strong>to</strong> introduce some bias. Several<br />

imperfections and inconsistencies of the international sample are well documented<br />

in the Technical Report. Participation rate requirements were not strict<br />

enough <strong>to</strong> prevent significant bias, and violations of predefined rules had no<br />

consequences.<br />

2.1 Target population does not serve study objective<br />

<strong>PISA</strong> claims <strong>to</strong> measure “outcomes of education systems in terms of student<br />

achievements”. This claim is not consistent with the choice of the target population,<br />

namely “15-year-olds enrolled full-time in educational institutions”. In<br />

some countries (Mexico, Turkey, several partner countries), enrollment is less<br />

than 60 %. Obviously, <strong>PISA</strong> says nothing about the outcome of the education<br />

system of these countries.<br />

On the other hand, in many countries school is obliga<strong>to</strong>ry beyond the age<br />

of 15. At fifteen, the ability of abstract reasoning is still in full development.<br />

<strong>PISA</strong> therefore systematically underestimates the abilities students have “near<br />

the end of compulsory schooling” (FR03: 3, 298; TR03: 46).<br />

2.2 Target population <strong>to</strong>o loosely defined: unequal exclusions<br />

Rules allowed countries <strong>to</strong> exclude up <strong>to</strong> 5 % of the target population: up <strong>to</strong><br />

0.5 % for organizational reasons and up <strong>to</strong> 4.5 % for intellectual or functional<br />

disabilities or limited language proficiency. Exclusions for intellectual disability<br />

depended on “the professional opinion of the school principal, or by other<br />

qualified staff” <strong>–</strong> a completely uncontrollable source of uncertainty. From the<br />

fine print in the Technical Report, it appears that some countries defined additional<br />

criteria: Denmark, Finland, Ireland, Poland, and Spain excluded students<br />

with dyslexia; Denmark also excluded students with dyscalculia; Luxembourg<br />

excluded recently immigrated students (TR03: 47, 65, 169, 183).<br />

Actual student exclusion rates of the OECD countries varied from 0.7 %<br />

<strong>to</strong> 7.3 %. Canada, Denmark, New Zealand, Spain, and the USA exceeded the<br />

5 % limit. Nevertheless, data from these countries were fully included in all<br />

analyses.


UNCERTAINTIES AND BIAS IN <strong>PISA</strong> 247<br />

For a first-order estimate of the impact caused by the unequal use of student<br />

exclusions, let us approximate the competence distribution in every single<br />

country by a Gaussian with standard deviation 100, and let us assume for a<br />

moment that countries exclude with perfect precision the least competent students.<br />

Under these assumptions, exclusion of the weakest 0.7 % increases the<br />

country’s mean by 2.0 points and reduces its standard deviation by 2.5 points,<br />

whereas exclusion of 7.3 % increases the mean by 15.0 and reduces the standard<br />

deviation by 12.8. Of course, exclusion criteria are only correlatives of<br />

potential test achievement, and they are never applied with perfect precision.<br />

When a probabilistic cut-off, spread over a range of 100 points, is used <strong>to</strong><br />

model soft exclusion criteria, the bias in the two countries’ competence mean<br />

difference is reduced <strong>to</strong> about half of the initial 13 points.<br />

In Germany much public attention has been drawn <strong>to</strong> the percentage of<br />

students in a so-called “risk group” defined by test scores below an arbitrary<br />

threshold. International comparisons of such percentages are particularly unreliable,<br />

because they are extremely sensitive <strong>to</strong> non-uniform exclusion criteria.<br />

2.3 On the fringe of the target population: unequal inclusion of<br />

learning-disabled students<br />

The imprecision of exclusion criteria and the resulting bias are further illustrated<br />

by the unequal inclusion of students with learning disabilities. Seven<br />

countries cater <strong>to</strong> them in special schools. In these schools the cognitive test<br />

was abridged <strong>to</strong> one hour, and a special booklet with a selection of easy items<br />

was used. In all other countries student exclusions were decided per case; but<br />

even in countries that used the special booklets, some learning-disabled students<br />

could be individually excluded (cf. Prais 2003: 149, 158).<br />

The extent <strong>to</strong> which students were either excluded from the test or given<br />

the short booklet varies widely among the seven countries. In Austria, 1.6 % of<br />

the target population were completely excluded, and 0.9 % of the participating<br />

students got the short test. In Hungary, 3.9 % were excluded, and 6.1 % did<br />

the short test. Given this discrepancy, it is barely surprising that Hungarian<br />

students who did the short test achieved nearly 200 points more than Austrians.<br />

For another rough estimate of the quantitative impact of unclear exclusion<br />

criteria, one can recalculate national means without short tests. If all short tests<br />

were excluded from the <strong>PISA</strong> sample, the mean reading score of Belgium,<br />

Denmark, and Germany would increase by more than 7 points; in doing so,<br />

Belgium (1.5 % exclusions, 3.0 % short tests) would even remain within the


248 JOACHIM WUTTKE<br />

5 % limit (TR03: 169). A bias of the order of 7 points is in perfect accord with<br />

the estimate from the previous section.<br />

2.4 Sampling problems: inconsistent input<br />

The sampling is technically difficult. Many governments do not dispose of<br />

consistent databases. Sometimes, this leads <strong>to</strong> bewildering inconsistencies: In<br />

Sweden, 102.5 % of all 15-year-olds are reported <strong>to</strong> be enrolled in an educational<br />

institution; in the Italian region of Tuscany, 107.7 %; in the USA, in spite<br />

of a strong homeschooling movement, 100.000 % (TR03: 168, 183).<br />

The sample is drawn in two stages: schools within strata (regions and/or<br />

school types), and students within schools. As a consequence of this stratification<br />

and of unequal participation rates, not all students are equally representative<br />

of the target population. To correct this, students are assigned statistical<br />

weights composed of several fac<strong>to</strong>rs. The recommended way <strong>to</strong> calculate these<br />

weights is so difficult that international rules foresee three replacement procedures.<br />

In Greece, none of the four procedures worked, so that a uniform student<br />

weight had <strong>to</strong> be used (TR03: 52).<br />

2.5 Sampling problems: inconsistent output<br />

In the Austrian sample of <strong>PISA</strong> 2000, students from vocational schools were<br />

underrepresented. As a consequence, average student competences were overestimated,<br />

and other statistics were dis<strong>to</strong>rted as well. The error was only<br />

searched for and found three years later, when the deceiving outcome of <strong>PISA</strong><br />

2003 induced the government (which had changed in the meantime) <strong>to</strong> order<br />

an investigation (Neuwirth et al. 2006).<br />

In South Tyrol, a change of government is not in sight, and therefore nobody<br />

seems interested in verifying accusations that the excellent <strong>PISA</strong> results<br />

of this region are largely due <strong>to</strong> the underrepresentation of students from vocational<br />

schools (Putz 2006).<br />

In South Korea, only 40.5 % of <strong>PISA</strong> participants are girls. In the 1980s,<br />

due <strong>to</strong> selective abortion and possibly <strong>to</strong> hepatitis-B, the sex ratio at birth in<br />

South Korea had attained a his<strong>to</strong>ric low of 47 %, perhaps even 46 %. But even<br />

when this is taken this in<strong>to</strong> account, girls are still severely underrepresented<br />

in the <strong>PISA</strong> sample. <strong>According</strong> <strong>to</strong> the Technical Report, this cannot be explained<br />

by unequal enrollment or test compliance: The reported enrollment<br />

rate is 99.94 %, the school participation rate 100 %, and the student participation<br />

rate 98.81 %. Either these numbers are wrong, or the sampling scheme was


UNCERTAINTIES AND BIAS IN <strong>PISA</strong> 249<br />

inappropriate. This conclusion is also supported by an anomalous distribution<br />

of birth months.<br />

2.6 Insufficient response rates<br />

Rules required a school response rate of 85 %, within-school student response<br />

rates of 25 %, and a country-wide student response rate of 80 % (TR03: 48-50).<br />

The United Kingdom breached more than one criterion, which led <strong>to</strong> its superficial<br />

disqualification. Canada profited from a strange rule according <strong>to</strong> which<br />

initial response rates between 65 % and 85 % could be cured by negotiation<br />

if the 85 % quorum was not even reached after calling replacement schools<br />

(TR03: 238). With 64.9 %, the USA missed the non-negotiable initial condition,<br />

though by a narrow margin, and the response from replacement schools<br />

was overwhelmingly negative, bringing the participation rate <strong>to</strong> no more than<br />

68.1 %. Nevertheless, US data were fully included in all analyses (note: the<br />

USA contributes 25 % of the OECD’s budget).<br />

Non-response can cause considerable bias because the propensity of school<br />

principals and students <strong>to</strong> partake in the testing is likely <strong>to</strong> be correlated with<br />

the potential outcome. Quantitative estimates are difficult because the international<br />

data base contains not the least information about those who refused the<br />

test. Nevertheless, there is ample indirect evidence that the correlation is quite<br />

high. To cite just one example: In Germany schools with a student response of<br />

100 % had a mean math score of 553. Schools with participation below 90 %<br />

achieved only 476 points. Even if the latter number is subject <strong>to</strong> some uncertainty<br />

(discussed at length in Wuttke 2007), the strong correlation between<br />

student ability and test compliance is beyond any doubt.<br />

In the official analysis, statistical weights provide a first-order correction<br />

for the between-school variation of response rates: When schools refuse <strong>to</strong> participate,<br />

the weight of other schools from the same stratum is increased accordingly.<br />

Similarly, in schools with low student response rates, the participating<br />

students are given higher weights.<br />

However, these corrections do not cure within-school correlations between<br />

students’ latent abilities and their propensity <strong>to</strong> partake in the test. In the absence<br />

of data from absent students, the possible bias can only roughly be estimated:<br />

In some countries, the student response rate is more than 15 % lower<br />

than in others. Assuming very conservatively that the latent ability of the missing<br />

students is only half a standard deviation below the true national average,


250 JOACHIM WUTTKE<br />

one finds that the absence of these students increases the measured national<br />

average by 8.8 points.<br />

2.7 Gender-dependent response rates<br />

In many countries, girls are overrepresented in the <strong>PISA</strong> sample. The discrepancy<br />

is largest in France, with 52.6 % girls in <strong>PISA</strong> against an estimated 48.9 %<br />

among 15-year-olds: Compared <strong>to</strong> the age cohort, the <strong>PISA</strong> sample has more<br />

than 7 % <strong>to</strong>o many girls and more than 7 % <strong>to</strong>o few boys. Insofar as this is due<br />

<strong>to</strong> different enrollment, it enforces the argument of Sect. 2.1. Otherwise, the<br />

most likely explanation is a gender-dependent propensity <strong>to</strong> participate in the<br />

testing.<br />

2.8 Doubts about data transmission: missing missing responses<br />

Normally, some students do not respond <strong>to</strong> all questions of the background<br />

questionnaire. Moreover, some students leave between the cognitive test and<br />

the questionnaire session. In Poland, however, such missing data are missing:<br />

There is no single student who responded <strong>to</strong> less than 25 questionnaire items,<br />

and there are 7 items <strong>to</strong> which no single student did not respond. Unless this<br />

anomaly is explained otherwise, one must suspect that booklets with missing<br />

data have been suppressed.<br />

3 Ignored dimensions of the cognitive test<br />

<strong>PISA</strong>’s “competence” scale depends on the assumption that all items from one<br />

subject domain measure essentially one and the same latent ability. In reality,<br />

any test outcome is also influenced by fac<strong>to</strong>rs that cannot be subsumed under<br />

a subject-specific competence. While there is no generally accepted way<br />

<strong>to</strong> indicate the degree of multi-dimensionality of a test (Hattie 1985), simple<br />

first-order estimates are sufficient <strong>to</strong> demonstrate its impact: Non-competence<br />

dimensions cause an amount of arbitrariness, uncertainty, and bias in <strong>PISA</strong>’s<br />

competence measure, which is by no means negligible when compared <strong>to</strong> the<br />

purely s<strong>to</strong>chastic official standard errors.<br />

3.1 Elimination of disturbing items<br />

The evidence for multidimensionality <strong>to</strong> be presented in the following sections<br />

is even more striking on the background that the cognitive items actually used


UNCERTAINTIES AND BIAS IN <strong>PISA</strong> 251<br />

in <strong>PISA</strong> have been preselected for unidimensionality: Submissions from participating<br />

countries were streamlined by “professional item writers”, reviewed<br />

by national “subject matter experts”, tested with students in think-aloud interviews,<br />

tested in a pre-pilot study in a few countries, tested in a field trial in most<br />

participant countries, rated by expert groups, and selected by the consortium<br />

(TR03: 20-30).<br />

Only one-third of the items that had reached the field trial were finally<br />

used in the main test. Items that did not fit in<strong>to</strong> the idea that competence can be<br />

measured in a culturally neutral way on a one-dimensional scale were simply<br />

eliminated. Field test results remain unpublished, although one could imagine<br />

an open-ended analysis providing valuable insight in<strong>to</strong> the diversity of education<br />

outcomes. This adds <strong>to</strong> Olsen’s (2005a: 5) observation that in <strong>PISA</strong>-like<br />

studies the major portion of information is thrown away.<br />

However, the strong preselection did not prevent seriously flawed items<br />

from being used in the main test: In the analysis of <strong>PISA</strong> 2000, the item “Continent<br />

Area Q1” had <strong>to</strong> be disqualified, in 2003 “Room Numbers Q1”. Furthermore,<br />

some items had <strong>to</strong> be disqualified in specific countries.<br />

3.2 Unfounded models<br />

In <strong>PISA</strong> a probabilistic psychological model is used <strong>to</strong> calibrate item difficulties<br />

and <strong>to</strong> estimate student competences. This model, named after Georg<br />

Rasch, is the most elementary incarnation of item response theory. It assumes<br />

that the probability of a correct response depends only on the difference of the<br />

student’s competence value and the item’s difficulty value. Mislevy (1993) calls<br />

this attempt <strong>to</strong> “explain problem-solving ability in terms of a single, continuous<br />

variable” a “caricature”, based in “19th century psychology”. The model<br />

does not even admit the possibility that some items are easier in one subpopulation<br />

than in another. The reason for its usage in <strong>PISA</strong> is neither theoretical<br />

nor empirical, but rather pragmatic: Only one-dimensional models yield unambiguous<br />

rankings.<br />

Taking the Rasch model literally, there is no way <strong>to</strong> estimate the competence<br />

of students who solved all items or none. To them the test has been<br />

<strong>to</strong>o easy or <strong>to</strong>o difficult, respectively. In <strong>PISA</strong>, this problem is circumvented<br />

by enhancing the probability of intermediate competences through a Bayesian<br />

prior, arbitrarily assumed <strong>to</strong> be a Gaussian. As distributions of psychometric<br />

measures are never Gaussian (Micceri 1992), this inappropriate prior causes<br />

bias in the competence estimates (Molenaar in Fischer/Molenaar 1995: 48),


252 JOACHIM WUTTKE<br />

especially at extreme values (Woods/Thissen 2006). This further undermines<br />

statements about “risk groups” with particularly low competence values.<br />

3.3 Failure of the Rasch model<br />

Various mathematical criteria have been developed <strong>to</strong> assists in the decision<br />

whether or not the Rasch model reasonably approximates an empirical data<br />

set. It appears that only one of them has been used <strong>to</strong> check the outcome of the<br />

<strong>PISA</strong> main test: an unexplained “item infit mean square” (TR03: 123, 278).<br />

A much more sensitive way <strong>to</strong> test the goodness of fit is a visual inspection<br />

of appropriate plots (Hamble<strong>to</strong>n et al. 1991: 66). An “item characteristic”<br />

or “score curve” is a plot of correct-response percentages as function of competence<br />

values, each data point representing a quantile of examinees. In the<br />

Technical Report (TR03: 127), one single item characteristic is shown <strong>–</strong> an<br />

atypical one that agrees rather well with the Rasch model.<br />

<strong>According</strong> <strong>to</strong> the model, all item characteristics from one subject domain<br />

should have strictly the same shape; the only degree of freedom is a horizontal<br />

shift, driven by the model’s only item parameter, the difficulty. This is clearly<br />

inconsistent with the variety of shapes exhibited by the four item characteristics<br />

in Figure 2. Whereas “Water Q3b” discriminates quite well between more<br />

or less “competent” students, the other three items have deficiencies that cannot<br />

be described without additional parameters.<br />

The characteristic of “Chair Lift Q1” has almost a plateau at low competence<br />

values. This is the typical signature of guessing. On the other hand,<br />

“Freezer Q1” saturates at less than 35 %. This indicates that many students<br />

Figure 2: Some item characteristics that show pronounced deviations from the Rasch model.<br />

Solid curves in (a) are fits with a two-parameter model that accounts for different discrimination.<br />

The four-parameter fits in (b) additionally model guessing and misunderstanding.


UNCERTAINTIES AND BIAS IN <strong>PISA</strong> 253<br />

did not find out the intention of the testers. Low discrimination strengths as<br />

in “South Rainea Q2” may have several reasons: different difficulties in different<br />

subpopulations, different difficulties for different solution strategies (cf.<br />

Meyerhöfer 2004), qualified guessing, weak correlation of the latent ability<br />

measured here and in the majority of this domain’s items.<br />

The solid lines in Fig. 2 show that satisfac<strong>to</strong>ry fits of the empirical data<br />

are possible when the Rasch model is extended by parameters that allow for<br />

variable discrimination strength, for guessing, and for misunderstanding. Such<br />

multi-parameter item-response models still contain a linear shift parameter that<br />

may be interpreted as the item difficulty. However, best-fit estimates of this<br />

parameter deviate by typically 30 points from the official Rasch difficulties<br />

(Wuttke 2007: Fig. 11). This model dependence of item difficulty estimates is<br />

not compatible with a one-dimensional ranking of items as is needed for the<br />

construction of “proficiency levels” (Sect. 4.1). Furthermore, as soon as one<br />

admits more than one item parameter, any student ranking becomes arbitrary<br />

because of the ad-hoc anchoring of the difficulty and competence scales.<br />

The first data point of the characteristics of “South Rainea” and “Chair<br />

Lift” clearly lies below the fit curves: the weakest 4 % of participants perform<br />

weaker than modeled. This may be due <strong>to</strong> a lack of cooperation: yet another<br />

dimension that is not contained in elementary item-response theory. It may<br />

also be due <strong>to</strong> the inappropriateness of the Gaussian population model.<br />

3.4 Between-booklet variance<br />

The use of different test booklets makes it possible <strong>to</strong> employ a <strong>to</strong>tal of 165<br />

different items, though every single student works on no more than 60 of them.<br />

This reduces the dependence of test results on the arbitrary choice of items.<br />

At the same time, it allows us <strong>to</strong> get an idea of how strong this dependence<br />

actually is. Calculating mathematics competence means for groups of<br />

students who have worked on the same booklet, inter-booklet standard deviations<br />

between 4 (Hungary) and 18 (Mexico) points are found. The largest difference<br />

occurs in the USA: Students who worked on booklet 2 were estimated<br />

<strong>to</strong> have a math competence of 444, whereas those who worked on booklet 10<br />

achieved 512 points. Eliminating either booklet 2 or booklet 10 would respectively<br />

increase or decrease the overall national mean by about three points.<br />

This variance only reflects the arbitrariness in choosing items from a pool that<br />

is already quite homogeneous due <strong>to</strong> the procedures described above (Sect.


254 JOACHIM WUTTKE<br />

3.1). Cultural bias in the submission, selection, and adaptation of items may<br />

have a far stronger impact.<br />

3.5 Imputation with wrong normalisation<br />

Each of the thirteen regular booklets consists of four blocks. Each item appears<br />

in four different blocks, in four different positions, in four different booklets.<br />

The major subject domain, mathematics, is covered by seven of the thirteen<br />

blocks; the other three domains are tested in two blocks each.<br />

While all thirteen booklets contain at least one mathematics block, each<br />

minor domain appears only in seven booklets. Nevertheless, in the scaled data<br />

all students are attributed competence values in all four domains. If a student<br />

has not been tested in a domain, the competence estimate is based on both<br />

his questionnaire responses and his school’s average math achievement. Such<br />

an imputation, when done correctly, reduces the standard error of population<br />

means without introducing bias.<br />

In <strong>PISA</strong>, however, it is not done correctly. Bias is introduced because the<br />

imputation is anchored in only one of the seven booklets for which real data are<br />

available. This bias is plainly admitted in the Technical Report (TR03: 211),<br />

though it is quantified only for Canada. The case of Greece is more extreme:<br />

The official science competence mean of 481 is 16 points above the average<br />

achievement of those students who were actually tested in science (Wuttke<br />

2007: Sect. 3.10; cf. Neuwirth in Neuwirth et al. 2006: 53). This huge bias is<br />

certainly not justified by the benefits of imputation, which consists in a slight<br />

simplification of the secondary data structure and in a reduction of s<strong>to</strong>chastic<br />

standard errors by probably no more than 10 %.<br />

3.6 Timing, tactics, fatigue<br />

Since every item occurs in four different positions, one can easily investigate<br />

how response rates vary during the two-hour testing session: Per-block response<br />

rates, averaged across booklets over all items, can be directly compared<br />

<strong>to</strong> each other.<br />

One finds that the average rates of non-reached items, of missing responses,<br />

and of incorrect responses systematically decrease from block <strong>to</strong><br />

block. The extent of this decrease varies considerably between countries. The<br />

ratio of non-reached items in the fourth block is 1 % in the Netherlands, while<br />

in Mexico it is 25.3 %. In the Netherlands the ratio of items that were reached<br />

but not answered goes up from 2.5 % in the first block <strong>to</strong> 4.0 % in the fourth


UNCERTAINTIES AND BIAS IN <strong>PISA</strong> 255<br />

block; in Greece, from 11.1 % <strong>to</strong> 24.4 %. In Austria, the ratio of corrent <strong>to</strong>given<br />

responses decreases from 56.2 % in the first block <strong>to</strong> 54.4 % in the fourth block;<br />

in Iceland, from 58.5 % <strong>to</strong> 53.1 %.<br />

All these data indicate that students lack a sufficient amount of time in the<br />

last of the four blocks. This alone is a strong argument against the applicability<br />

of one-dimensional item response theory (Rost 2004: 43). The ways students<br />

react <strong>to</strong> the lack of time vary considerably between countries:<br />

<strong>–</strong> Dutch students try <strong>to</strong> answer almost every item. Towards the end of the test,<br />

they become hasty and increasingly resort <strong>to</strong> guessing.<br />

<strong>–</strong> Austrian and German students skip many items, and they do so from the first<br />

block on, which leaves them enough time <strong>to</strong> finish the test without greatly<br />

accelerating their pace.<br />

<strong>–</strong> Greek students, in contrast, seem <strong>to</strong> be taken by surprise by the time pressure<br />

near the end. In the first block, their correct-response rate is better than in<br />

Portugal and not far away from the USA and Italy. In the last block, however,<br />

non-reached items and missing responses add up <strong>to</strong> 35 %, bringing Greece<br />

down <strong>to</strong> one of the last ranks.<br />

Aside from such extreme cases, it is hardly possible <strong>to</strong> disentangle the effects<br />

of test-taking tactics and fatigue.<br />

3.7 Multiple responses <strong>to</strong> multiple choice items<br />

In <strong>PISA</strong> 2003, 42 of 165 items are in a simple multiple-choice format. For<br />

each of these items, four or five responses are proposed of which exactly one<br />

is meant <strong>to</strong> be the correct one. This essential rule is not clearly explained <strong>to</strong> the<br />

examinees. In some countries, for some items, a considerable number of multiple<br />

responses are given. They are denoted by a special code in the international<br />

database, but they are subsequently counted as incorrect.<br />

In many countries, including Australia, Canada, Japan, Mexico, the<br />

Netherlands, New Zealand, and the USA, the quota of multiple responses is<br />

close <strong>to</strong> 0 % (except for one particularly flawed item). In Austria, Germany,<br />

and Luxembourg, on the other hand, the fraction of multiple responses surpasses<br />

4 % for at least eleven items, and it reaches up <strong>to</strong> 10 % for one of them.<br />

Such a misunderstanding of the test format does not only dis<strong>to</strong>rt the outcome<br />

of the directly concerned item. It also costs time: it requires more effort<br />

<strong>to</strong> decide four or five times whether or not a proposed answer is correct than <strong>to</strong><br />

choose only one alternative. Those who are familiar with the multiple-choice<br />

format sometimes do not even need <strong>to</strong> read all distrac<strong>to</strong>rs.


256 JOACHIM WUTTKE<br />

3.8 Testing cultural background<br />

If one wants <strong>to</strong> understand what a test actually measures, one has <strong>to</strong> study<br />

the manifold reasons why students give incorrect responses (cf. Kohn 2000:<br />

11). The few student solutions of open-ended items published by Luxembourg<br />

show how much information is lost when verbal or pic<strong>to</strong>rial responses are digitally<br />

coded.<br />

A B C D<br />

Slovakia 3.1 % 46.1 % 17.5 % 33.3 %<br />

Sweden 3.1 % 46.2 % 37.0 % 13.7 %<br />

Table 1: Percentages for the four possible responses of the multiple-choice item “Optician<br />

Q1”. Data are shown for two countries where almost the same percentage of students chooses<br />

the correct response B. However, preferences for the distrac<strong>to</strong>rs C and D vary by about 20 %.<br />

In contrast, in the digital coding of multiple-choice items, most information<br />

is preserved; the codes for formally valid but incorrect responses indicate<br />

which of the three distrac<strong>to</strong>rs was chosen. Table 1 shows the response percentages<br />

for one item and two countries. In this example distrac<strong>to</strong>r preferences vary<br />

by about 20 %, although the correct-response percentage is almost the same.<br />

This demonstrates quantitatively that the reasons that induce students <strong>to</strong> give a<br />

specific incorrect answer can vary enormously from country <strong>to</strong> country.<br />

It is fairly obvious that the offer of distrac<strong>to</strong>rs also influences correct -<br />

response rates. Had distrac<strong>to</strong>r D been more in the spirit of C, it would have<br />

attracted additional responses in Sweden, whereas in Slovakia many students<br />

would have reoriented their choice <strong>to</strong>wards B.<br />

Between-country variance may be due <strong>to</strong> school curricula, cultural background,<br />

test language, or <strong>to</strong> a combination of several fac<strong>to</strong>rs. These fac<strong>to</strong>rs are<br />

particularly influential in <strong>PISA</strong> because students have little time (about 2’20”<br />

per item), and reading texts are <strong>to</strong>o long. Sometimes the stimulus material even<br />

tricks students in<strong>to</strong> misclues (Ruddock et al. 2006). In this situation, test-wise<br />

students try <strong>to</strong> solve items without actually reading the introduc<strong>to</strong>ry texts. Such<br />

qualified guessing is of course highly dependent on extrinsic knowledge and<br />

therefore particularly susceptible <strong>to</strong> cultural bias.<br />

The released reading unit “Flu” from <strong>PISA</strong> 2000 provides a nice example.<br />

The stimulus material is an information sheet about a flu vaccination. One of


UNCERTAINTIES AND BIAS IN <strong>PISA</strong> 257<br />

the items asks how the vaccination compares <strong>to</strong> alternative or complementary<br />

means of protection. Of course, students are not asked about their personal<br />

opinion; the answer is <strong>to</strong> be sought in the reading text. Nevertheless, the distrac<strong>to</strong>r<br />

preferences reflect French reliance on technology and German belief in<br />

nature.<br />

3.9 Language-related problems<br />

The language influences the test in several ways:<br />

Translations are prone <strong>to</strong> errors. In <strong>PISA</strong>, a complicated scheme with double<br />

translation from English and French was foreseen <strong>to</strong> minimise such errors.<br />

However, in many cases, including the German-speaking countries, the French<br />

original was not taken seriously, and final versions were produced under extreme<br />

time pressure. There are clear-cut translation errors in the released sample<br />

items. In the unit entitled “Daylight”, the English word “hemisphere” was<br />

translated by the erudite “Hemisphäre” where German schoolbooks use the<br />

word “Erdhälfte”. In the unit “Farms”, “attic floor” was rendered as “Dachboden”<br />

which just means “attic”. The fact that the Austrian version has the correct<br />

wording “Boden des Dachgeschosses” though all German-speaking languages<br />

had shared the translation work indicates that uncoordinated and unchecked<br />

last-minute modifications have been made.<br />

Blum and Guérin-Pace (2000: 113) report that changing a question (“Quels<br />

taux . . . ?”) in<strong>to</strong> a prompt (“Énumérez <strong>to</strong>us les taux . . . ”) can change the<br />

rate of correct responses by 31 %. This gives an idea of how much freedom<br />

transla<strong>to</strong>rs have either <strong>to</strong> help or confuse (cf. Freudenthal 1975: 172; Olsen et<br />

al. 2001).<br />

Under translation, texts tend <strong>to</strong> become longer, and some languages are<br />

more concise than others. In <strong>PISA</strong> 2000, the English and French versions of<br />

60 stimulus texts were compared, and showed that the French texts contained<br />

on average 12 % more words and 19 % more letters (TR00: 64). Of course,<br />

reading time is not simply proportional <strong>to</strong> the number of words or letters. It<br />

seems nevertheless plausible that such a huge length difference induces an<br />

important bias.<br />

3.10 Origin of test items<br />

A majority of test items comes from English-speaking countries; the other<br />

items were translated in<strong>to</strong> English before they were streamlined by “professional<br />

item writers”. If there is cultural bias, it is clearly in favour of the


258 JOACHIM WUTTKE<br />

English-speaking countries. This makes it difficult <strong>to</strong> separate it from the translation<br />

bias, which acts in the same direction.<br />

The quantitative importance of cultural or/and linguistic bias can be read<br />

off from the correlation of correct-response-percentage-per-item vec<strong>to</strong>rs, as<br />

has been shown by Zabulionis (2001, for TIMSS), Rocher (2003), Olsen<br />

(2005), and Wuttke (2007). Cluster analyses invariably show that student behaviour<br />

is most similar for countries that share both language and cultural heritage,<br />

such as Australia and New Zealand (correlation coefficient 0.98). If the<br />

languages differ, correlations are at best about 0.96, as for the Czech and Slovak<br />

Republics. If the languages do not belong <strong>to</strong> the same stem, correlations are<br />

hardly larger than 0.94. While some countries belong <strong>to</strong> large clusters, others<br />

like Japan and Korea are quite isolated (no correlation larger than 0.90). These<br />

results have immediate implications for the validity of inter-country comparisons:<br />

The lesser the correlation of response patterns, the more a comparison<br />

depends on the arbitrary choice of items.<br />

4 Interpreting cognitive test results<br />

4.1 Proficiency levels<br />

Verbal descriptions of “proficiency levels” are used <strong>to</strong> guide the interpretation<br />

of numeric results (FR03: 46-56). The boundaries of these levels are arbitrarily<br />

chosen; nevertheless, they are communicated with absurd four-digit precision.<br />

Starting at a competence of 358.3, there are six proficiency levels. The width of<br />

levels 1 <strong>to</strong> 5 is about 62.1; the semi-infinite level 6 starts at 668.7. Depending<br />

on how many students gave the right response, each item is assigned <strong>to</strong> one<br />

of these levels. Based on all items assigned <strong>to</strong> one level, a verbal synthesis is<br />

given of what students with corresponding competence values “can typically<br />

do”.<br />

By construction, the student competence distribution is approximately<br />

Gaussian. The mean of 500 and the standard deviation of 100 are imposed by<br />

an explicit (though ill documented) renormalisation. Therefore, the percentages<br />

of students in the different proficiency levels are almost constant.<br />

To illustrate this point, let us perform a Gedanken experiment. If the percentage<br />

of correct responses given by a single student grows by 6 %, his competence<br />

value increases by about 30 points. Suppose now that the correctresponse<br />

rate grows by 6 % for all students. In this case, the competence values<br />

assigned <strong>to</strong> the students will not increase because any uniform change


UNCERTAINTIES AND BIAS IN <strong>PISA</strong> 259<br />

of competences is immediately reverted by the renormalisation <strong>to</strong> the predefined<br />

Gaussian. Instead, the itemdifficulty values would be lowered by about<br />

30 points, so that about every second item would be relegated <strong>to</strong> the next lower<br />

proficiency level. Theoretically, this should then lead <strong>to</strong> a rephrasing of the<br />

proficiency level descriptions.<br />

However, these descriptions are highly systematic. They are so systematic<br />

that they could have been derived straight from Bloom’s forty-year-old<br />

taxonomy. They are far <strong>to</strong>o systematic <strong>to</strong> appear like a summary of empirical<br />

results: One would expect that not every single item fits equally well in<br />

such a scheme, but the level descriptions do not reflect the least irritation. As<br />

Meyerhöfer (2004) has pointed out, the very idea of proficiency levels is not<br />

consistent with the fact that test items can be solved in quite different ways, depending<br />

for instance on curricular premises, on testwiseness and time pressure.<br />

Therefore, the most likely outcome of our Gedanken experiment seems <strong>to</strong> be<br />

that the official level descriptions would not at all change, so that the overall<br />

increase in student achievement would pass unnoticed <strong>–</strong> as has the misfit of<br />

the Rasch model and the resulting bias and uncertainty of about 30 difficulty<br />

points.<br />

Another fundamental objection is the lack of transparency. The proficiency<br />

level descriptions are not scientifically discussible unless the consortium publishes<br />

the instruments on which they are based and the proceedings of the<br />

hermeneutic sessions in which the descriptions have been worked out.<br />

In the German reports, students in and below proficiency level 1 are called<br />

“the risk group”. This deviates from the international reports that speak of<br />

“risk” only in connection with students below level 1. It has become an urban<br />

legend in Germany that nearly one quarter of all fifteen-year-olds are almost<br />

functionally illiterate, although the original report clearly states that <strong>PISA</strong> does<br />

not bother <strong>to</strong> measure fluency of reading, which is taken for granted even on<br />

level 1 (FR00: 47-48). Furthermore, as has been stressed above, the percentage<br />

of students on or below level 1 is extremely sensitive <strong>to</strong> disparities in sampling<br />

and participation.<br />

4.2 Is <strong>PISA</strong> an intelligence test?<br />

<strong>PISA</strong> items from different domains are quite similar in style <strong>–</strong> and sometimes<br />

even in contents: Reading items are based on nontextual stimulus material<br />

such as graphics or tables, and math or science items require a lot of reading.<br />

This is intentional insofar as it reflects a certain conception of “literacy”.


260 JOACHIM WUTTKE<br />

It is therefore unsurprising that competence values from different domains<br />

are highly correlated. A majority of per-country inter-domain correlations is<br />

stronger than 80 %.<br />

In such a situation, the sensible thing <strong>to</strong> do is a principal component analysis.<br />

One finds that between 75 % (Greece) and 92 % (Netherlands) of the <strong>to</strong>tal<br />

variance of student competences can be attributed <strong>to</strong> just one component.<br />

However, no such analysis has been published by the consortium, and when<br />

Rindermann (2006) did so, members of <strong>PISA</strong> Germany tried <strong>to</strong> dismiss and<br />

even <strong>to</strong> ridicule it. The ideological and strategical reasons for this opposition<br />

are obvious: Once it is found that <strong>PISA</strong> mainly measures one general fac<strong>to</strong>r<br />

per examinee, it is hard not <strong>to</strong> make a connection <strong>to</strong> the g fac<strong>to</strong>r of cognitive<br />

psychology. This must be seen as a sacrilege and as a threat by <strong>PISA</strong> members,<br />

who avoid the term “intelligence” throughout their writings. The word<br />

is taboo in much of the pedagogical mainstream, and no government would<br />

spend millions <strong>to</strong> be informed about the intelligence of students.<br />

4.3 Uncontrolled variables<br />

<strong>PISA</strong> aims at moni<strong>to</strong>ring “outcomes of education systems”. However, the education<br />

system is just one of many variables that influence the outcome of<br />

the cognitive test. As we have seen, sampling, exclusions, participation rates,<br />

test taking habits, culture, and language are quantitatively important. Since all<br />

these variables are country dependent, there is no way <strong>to</strong> separate them from<br />

the variable “education system”.<br />

But even in the hypothetical case of a technically and culturally fair test,<br />

it would not be clear that differences in test outcome are due <strong>to</strong> differences<br />

in education systems. There are certainly country dependent educational influences<br />

that are not part of what is generally unders<strong>to</strong>od under “education<br />

system”, such as the subtitled TV programs prevalent in small language communities.<br />

Furthermore, equating test achievement with the outcome of schooling<br />

is highly ideological in that it dismisses differences in genetic equipment,<br />

pre-scholar education, and extra-scholar environment.<br />

The importance of extrinsic parameters becomes obvious when subpopulations<br />

are compared that share the same education system. One example is<br />

the two language communities in Finland. In the major domain of <strong>PISA</strong> 2000,<br />

reading, students in Finnish-speaking schools achieve 548 points, in Swedishspeaking<br />

schools only 513 <strong>–</strong> slightly less than Sweden’s national average of<br />

516 (Wuttke 2007: Sect. 4.8). A national report (Brunell 2004) suggests that


UNCERTAINTIES AND BIAS IN <strong>PISA</strong> 261<br />

much of the difference between the two communities can be explained by two<br />

fac<strong>to</strong>rs, namely by the language spoken at home and by the social, economic,<br />

and cultural background.<br />

If student dependent background variables have such a huge impact in an<br />

otherwise comparatively homogeneous country like Finland, they can even<br />

more severely dis<strong>to</strong>rt international comparisons. As several authors have already<br />

noted, one of the most important background variables is the language<br />

spoken at home. Except in a few bilingual regions, a non-test language spoken<br />

at home is typically linked <strong>to</strong> immigration. The immigration status is accessible<br />

through the questionnaire, which asks for the country of birth of the student<br />

and his parents. Excluding first and second generation immigrant students from<br />

the national averages considerably mutates the country league tables: On <strong>to</strong>p<br />

of the 2003 mathematics league table, Finland is replaced by the Netherlands<br />

and Belgium, and it is closely followed by Switzerland. The superiority of the<br />

Finnish school system, one of the most publicised “results” of <strong>PISA</strong>, vanishes<br />

as soon as one single background variable is controlled.<br />

5 Conclusions<br />

One defense line of <strong>PISA</strong> proponents reads: <strong>PISA</strong> is state-of-the-art; at present<br />

nobody can do it better. This is probably true. If there was one outstanding<br />

source of bias, one could hope <strong>to</strong> improve <strong>PISA</strong> by fighting this specific problem.<br />

However, it rather appears that there is a plethora of inaccuracies of similar<br />

magnitude. Reducing a few of them will have very little effect on the overall<br />

uncertainty. Therefore, one has <strong>to</strong> live with the unsatisfac<strong>to</strong>ry state of the art<br />

and draw the right consequences.<br />

Firstly, the outcome of <strong>PISA</strong> must be reassessed. The official significance<br />

criteria, based only on s<strong>to</strong>chastic errors, are irrelevant and misleading. The accuracy<br />

of country rankings is largely overestimated. Statistics are particularly<br />

dis<strong>to</strong>rted if they depend on response rates among weak students; statements<br />

about “risk groups” are untenable.<br />

Secondly, the large sample sizes of <strong>PISA</strong> are uneconomic. Since the accuracy<br />

of the study is determined by other fac<strong>to</strong>rs, the effort currently invested in<br />

minimising s<strong>to</strong>chastic errors is unjustified.<br />

Thirdly, it is clear from the outset that little can be learned when something<br />

as complex as a school system is characterised by something as simple as the<br />

average number of solved test items.


262 JOACHIM WUTTKE<br />

References<br />

Blum, A./Guérin-Pace, F. (2000): De Lettres et des Chiffres. Des tests<br />

d’intelligence à l’évaluation du “savoir lire”, un siècle de polémiques.<br />

Paris: Fayard.<br />

Bottani, N./Vrignaud, P. (2005): La France et les évaluations internationales.<br />

Rapport établi à la demande du Haut Conseil de l’évaluation de<br />

l’école. http://lesrapports.ladocumentationfrancaise.fr/BRP/054000359/<br />

0000.pdf.<br />

Brunell, V. (2004): Utmärkta <strong>PISA</strong>-resultat också i Svenskfinland. Pedagogiska<br />

Forskningsinstitutet, Jyväskylä Universitet. http://ktl.jyu.fi/pisa/<br />

Langt_pressmeddelande.pdf.<br />

Fischer, G. H./Molenaar, I. W. (1995): Rasch Models. Foundations, Recent<br />

Developments, and Applications. New York: Springer.<br />

Freudenthal, H. (1975): Pupils achievements internationally compared <strong>–</strong> the<br />

IEA. In: Educ. Stud. Math. 6, 127-186.<br />

FR00: OECD, ed. (2001): Knowledge and Skills for Life. First Results from<br />

the OECD Programme for International Student Assessment (<strong>PISA</strong>)<br />

2000. Paris: OECD.<br />

FR03: OECD, ed. (2004): Learning for Tomorrow’s World. First Results from<br />

<strong>PISA</strong> 2003. Paris: OECD.<br />

Hamble<strong>to</strong>n, R. K./Swaminathan, H./Rogers, H. J. (1991): Fundamentals of<br />

Item Response Theory. Newbury Park: Sage.<br />

Hattie, J. (1985): Methodology Review: Assessing Unidimensionality of Tests<br />

and Items. In: Appl. Psych. Meas. 9 (2) 139-164.<br />

Kohn, A. (2000): The Case Against Standardized Testing. Raising the Scores,<br />

Ruining the Schools. Portsmouth NH: Heinemann.<br />

Meyerhöfer, W. (2004): Zum Problem des Ratens bei <strong>PISA</strong>. In: J. Math.-did.<br />

25 (1) 62-69.<br />

Micceri, T. (1989): The Unicorn, the Normal Curve, and other Improbable<br />

Creatures. In: Psychol. Bull. 105 (1) 156-166.<br />

Mislevy, R. J. (1993): Foundations of a New Test Theory. In: Frederiksen,<br />

N./Mislevy, R. J./Bejar, I. I., eds.: Test Theory for a New Generation of<br />

Tests. Hillsdale: Lawrence Erlbaum.<br />

Neuwirth, E./Ponocny, I./Grossmann, W., eds. (2006): <strong>PISA</strong> 2000 und <strong>PISA</strong><br />

2003: Vertiefende Analysen und Beiträge zur Methodik. Graz: Leykam.<br />

Olsen, R. V./Turmo, A./Lie, S. (2001): Learning about students’ knowledge


UNCERTAINTIES AND BIAS IN <strong>PISA</strong> 263<br />

and thinking in science through large-scale quantitative studies. Eur. J.<br />

Psychol. Educ. 16 (3) 403-420.<br />

Olsen, R. V. (2005a): Achievement tests from an item perspective. An exploration<br />

of single item data from the <strong>PISA</strong> and TIMSS studies, and how<br />

such data can inform us about students’ knowledge and thinking in science.<br />

Dissertation, Universität Oslo.<br />

Olsen, R. V. (2005b): An exploration of cluster structure in scientific literacy<br />

in <strong>PISA</strong>: Evidence for a Nordic dimension? In: NorDiNa 1 (1) 81-94.<br />

Prenzel, M. et al. [<strong>PISA</strong>-Konsortium Deutschland], eds. (2004): <strong>PISA</strong> 2003.<br />

Der Bildungsstand der Jugendlichen in Deutschland <strong>–</strong> Ergebnisse des<br />

zweiten internationalen Vergleichs. Münster: Waxmann.<br />

Prenzel, M. et al. [<strong>PISA</strong>-Konsortium Deutschland], eds. (2005): <strong>PISA</strong> 2003.<br />

Der zweite Vergleich der Länder in Deutschland <strong>–</strong> Was wissen und können<br />

Jugendliche. Münster: Waxmann.<br />

Putz, M. (2006): <strong>PISA</strong>: Zu schön um wahr zu sein? Liegt das Traumergebnis<br />

an Rechenfehlern? Unpublished.<br />

Rindermann, H. (2006): Was messen internationale Schulleistungsstudien?<br />

Schulleistungen, Schülerfähigkeiten, kognitive Fähigkeiten, Wissen oder<br />

allgemeine Intelligenz? In: Psychol. Rundsch. 57 (2) 69-86. See also comments<br />

and reply in vol. 58 (2).<br />

Rocher, T. (2003): La méthodologie des évaluations internationales de compétences.<br />

In: Psychologie et Psychométrie 24 (2-3) [Numéro spécial:<br />

Mesure et Éducation], 117-146.<br />

Rost,J.( 2 2004): Lehrbuch Testtheorie <strong>–</strong> Testkonstruktion. Bern: Hans Huber.<br />

TR00: Adams, R./Wu, M., eds. (2002): <strong>PISA</strong> 2000 Technical Report. Paris:<br />

OECD.<br />

TR03: OECD, ed. (2005): <strong>PISA</strong> 2003 Technical Report. Paris: OECD.<br />

Woods, C. M./Thissen, D. (2006): Item Response Theory with Estimation of<br />

the Latent Population Distribution Using Spline-Based Densities. In: Psychometrika<br />

71 (2) 281-301.<br />

Wuttke, J. (2007): Die Insignifikanz signifikanter Unterschiede: Der<br />

Genauigkeitsanspruch von <strong>PISA</strong> ist illusorisch. In: Jahnke,<br />

T./Meyerhöfer, W., eds.: Pisa & Co. Kritik eines Programms. 2nd<br />

edition [note: my contribution <strong>to</strong> the 1st edition is outdated]. Hildesheim:<br />

Franzbecker.<br />

Zabulionis, A. (2001): Similarity of Mathematics and Science Achievement of<br />

Various Nations. In: Educ. Policy Analysis Arch. 9 (33).


Large-Scale International Comparative Achievement<br />

Studies in Education: Their Primary Purposes and<br />

Beyond<br />

Rolf V. Olsen<br />

Norway: University of Oslo<br />

Abstract:<br />

This chapter argues that <strong>PISA</strong> is more than a driver for policy decisions in<br />

many countries. The study also provides unique data with the potential <strong>to</strong> engage<br />

educational researchers across the world in conducting a range of secondary<br />

analyses. The first section of the chapter describes how the primary purpose<br />

of such studies in general has gradually evolved. This description reflects<br />

how the studies have typically related <strong>to</strong> educational research. This section of<br />

the chapter is used as the general background for the second and major section,<br />

which presents a rationale for why educational researchers could or should be<br />

motivated <strong>to</strong> engage in analytical work relating <strong>to</strong> these studies. This is followed<br />

up by a provisional framework for how educational researchers may<br />

approach and make use of the data from these studies in secondary analyses.<br />

This framework is based on six generic analytical approaches derived from the<br />

study of a large number of examples of published secondary analyses.<br />

Introduction<br />

The overall purpose of this article is <strong>to</strong> argue that both <strong>PISA</strong> and a range of<br />

other studies often referred <strong>to</strong> as large-scale international comparative achievement<br />

studies in education (LINCAS) (Bos, 2002), are not only an important<br />

driver for policy decisions in many countries, but the <strong>PISA</strong> study also provides<br />

unique data with the potential <strong>to</strong> engage educational researchers across the


266 ROLF V. OLSEN<br />

world in conducting a range of secondary analyses. The first part of the chapter<br />

describes how the primary purpose of such studies in general has gradually<br />

evolved. This description reflects how the studies have typically related <strong>to</strong> educational<br />

research. This section of the chapter is used as a general background<br />

for the second and major section, which provides arguments for why educational<br />

researchers could or should be motivated <strong>to</strong> engage in work relating <strong>to</strong><br />

these studies. Furthermore, this section presents how educational researchers<br />

may approach and make use of the data from these studies in secondary analyses<br />

by suggesting six generic analytical approaches. The six suggested generic<br />

approaches probably do not make up an exhaustive list of possibilities for secondary<br />

analytical work. Instead, they should, when taken <strong>to</strong>gether, be regarded<br />

as a provisional framework <strong>to</strong> be used as a starting point for a more comprehensive<br />

and systematic review of the available literature presenting secondary<br />

analyses relating <strong>to</strong> the <strong>PISA</strong> study.<br />

Even though <strong>PISA</strong> is the main case <strong>to</strong> be discussed in this book, the theme<br />

for this chapter is of a more overarching and general nature. Many of the references<br />

made throughout the chapter <strong>to</strong> specific secondary analytical work will<br />

therefore be <strong>to</strong> studies relating <strong>to</strong> other studies, and particularly <strong>to</strong> TIMSS 1 ,<br />

since this study has been around for a much longer time. Furthermore, given<br />

the author’s background as a researcher in science education, a majority of<br />

the examples will be related <strong>to</strong> this subject. However, the discussion offered<br />

is not subject specific, and the arguments are thus equally relevant for other<br />

international studies as well as for other subject areas.<br />

Part I: The primary purposes of the comparative studies<br />

In order <strong>to</strong> start describing the main features of LINCAS, it is relevant <strong>to</strong> note<br />

that they include one or several measures of achievement, specifically speaking<br />

school subjects or in more overarching competencies transcending the traditional<br />

borders set up by school subjects. Furthermore, these measures have<br />

been developed under the requirement that they should allow for meaningful<br />

international comparison. In addition, an essential design component in the<br />

studies is that differences between countries can be studied as effects of contextual<br />

fac<strong>to</strong>rs. It is also important <strong>to</strong> underscore the fact that these studies<br />

are large-scale, which implies that the aim of these studies is <strong>to</strong> find measures<br />

which can be generalised <strong>to</strong> schools and educational systems. In order <strong>to</strong><br />

1 Trends in International Mathematics and Science Study


LARGE-SCALE INTERNATIONAL COMPARATIVE ACHIEVEMENT STUDIES267<br />

obtain reliable measures which can be generalised <strong>to</strong> the systematic level, rigorous<br />

procedures for sampling a large number of schools/classes and students<br />

must be employed.<br />

Although the aims of, the use of data from, the organisation of, and the<br />

methodology applied by these studies have developed gradually (Porter &<br />

Gamoran, 2002), there have been two distinctly different and partly competing<br />

overall visions underlying the studies. They were first conceived as a specific<br />

design or method for conducting research in<strong>to</strong> education with a cross-country<br />

comparative perspective. This initial idea will be labelled Purpose I <strong>–</strong> the research<br />

purpose. Gradually, the focus has shifted, influenced by the increasing<br />

attention of policymakers <strong>to</strong>wards moni<strong>to</strong>ring the outcomes of educational systems<br />

and the study of possible determinants of such outcomes. This rationale<br />

for the comparative studies will be labelled Purpose II <strong>–</strong> the effective policy<br />

purpose.<br />

The labels Purpose I and II are only suggested as being useful heuristic<br />

devices for understanding some of the ideological tensions with which these<br />

studies have <strong>to</strong> live. 2 However, using this dicho<strong>to</strong>my does not suggest that<br />

the research purpose and the effective policy purpose are incompatible. On<br />

the contrary, I will offer the perspective that the studies may be considered<br />

as arenas where researchers in education and educational policymakers can<br />

exchange ideas, developing in turn mutual interest for and acceptance of each<br />

other’s engagement in educational issues on both the national and international<br />

levels.<br />

Purpose I: The research purpose<br />

Today the label ‘comparative studies in education’ refers <strong>to</strong> various types of<br />

research ranging from issues of the more philosophical and methodological<br />

aspects of comparing across cultures <strong>to</strong> very specific studies of narrowly defined<br />

aspects of education across countries, regions or classrooms. This label<br />

also covers studies with a great variety of designs and scales, and in general it<br />

is fair <strong>to</strong> say that the label ‘comparative studies’ is used with different meanings,<br />

as there is no generally accepted definition of the term (see for instance<br />

Alexander, Broadfoot, & Phillips, 1999; Alexander, Osborn, & Phillips, 2000;<br />

Carnoy, 2006). The idea of the large-scale comparative studies receiving focus<br />

here was created and defined as a research agenda with the establishment<br />

2 These labels are inspired by the way Roberts (2007) uses the terms Vision I and Vision II in<br />

his review of the concept of scientific literacy.


268 ROLF V. OLSEN<br />

of the IEA <strong>–</strong> the International Association for the Evaluation of Educational<br />

Achievement <strong>–</strong> in 1961 under the auspices of the UNESCO Institute for Education<br />

(Husén & Tuijnman, 1994; Keeves, 1992). The fundamental idea of<br />

the founders of IEA is very clearly expressed by one of the pioneers Torsten<br />

Husén (1973):<br />

We, the researchers who . . . decided <strong>to</strong> cooperate in developing internationally valid<br />

evaluation instruments, conceived of the world as one big educational labora<strong>to</strong>ry<br />

where a great variety of practices in terms of school structure and curriculum were<br />

tried out. We simply wanted <strong>to</strong> take advantage of the international variability with<br />

regard both <strong>to</strong> the outcomes of the educational systems and the fac<strong>to</strong>rs which caused<br />

differences in those outcomes. (p. 10)<br />

The term “labora<strong>to</strong>ry” in this quote is used only as a metaphor, since labora<strong>to</strong>ry<br />

conditions with controlled experiments, taken literally, are hardly feasible<br />

in educational research due <strong>to</strong> both practical and ethical considerations. The<br />

alternative <strong>to</strong> the experiment would therefore be survey designs in which the<br />

variables of interest could be studied under a great variety of different conditions.<br />

In this way “differences between education systems would provide<br />

the opportunity <strong>to</strong> examine the impact of different variables on educational<br />

outcome” (Bos, 2002, p. 5). “Thus the studies were envisaged as having a research<br />

perspective . . . , as well as policy implications” (Kellaghan & Greaney,<br />

2001, p. 92). The assumption is, in other words, that educational organisation<br />

and practice affect educational opportunities and outcome, and this can be the<br />

subject of empirical research with the following aim:<br />

. . . go beyond the purely descriptive identification of salient fac<strong>to</strong>rs which account<br />

for cross-national differences and <strong>to</strong> explain how they operate. Thus the ambition has<br />

been the one prevalent in the social sciences in general, that is <strong>to</strong> say, <strong>to</strong> explain and<br />

predict, and <strong>to</strong> arrive at generalizations. (Husén, 1973, pp. 10-11)<br />

The two quotes above taken from Husén should be seen as typical of the time<br />

and for the prevailing optimism regarding how the social sciences could contribute<br />

<strong>to</strong> the development of a better understanding of the causal relationship<br />

between different types of fac<strong>to</strong>rs in society, a vision that in retrospect is often<br />

referred <strong>to</strong> as “social engineering”. The importance of the quotes in this<br />

context is, however, <strong>to</strong> identify the fact that the studies originally came from<br />

researchers in education who aimed <strong>to</strong> use them in order <strong>to</strong> find answers <strong>to</strong><br />

what they saw as important research questions. Furthermore, they considered


LARGE-SCALE INTERNATIONAL COMPARATIVE ACHIEVEMENT STUDIES269<br />

that an international comparative design gave particularly good opportunities<br />

for answering such questions.<br />

Purpose II: The effective policy purpose<br />

Policymakers are required <strong>to</strong> establish overall plans for the nation’s educational<br />

system; e.g. <strong>to</strong> accomplish the following:<br />

<strong>–</strong> decide the amount and distribution methods of, resources;<br />

<strong>–</strong> specify the overall purpose of education as part of the wider social context<br />

and specific goals of achievement; and <strong>to</strong><br />

<strong>–</strong> determine the organisation of the progression of schooling from childhood<br />

<strong>to</strong> adolescence and beyond.<br />

To a large extent comparative studies and other internationally comparative<br />

data have been regarded by policymakers as providing information that is relevant<br />

in their continuous evaluation of such overall plans. What was initially a<br />

formulation of a platform for comparative educational research coincided with<br />

a growing recognition among politicians, industrial leaders and others, stating<br />

that education was one of the most central agents <strong>to</strong> realising long-term<br />

political, societal or economic visions, such as the following:<br />

<strong>–</strong> developing a society with a better distribution of resources across class, race,<br />

gender or any other social group;<br />

<strong>–</strong> fulfilling the need for a highly competent workforce in order <strong>to</strong> succeed in<br />

the international marketplace;<br />

<strong>–</strong> enhancing and further developing democracy by giving all citizens basic and<br />

further education so that they are enabled <strong>to</strong> fulfil their own life-agenda and<br />

become full-fledged participants in the democratic process.<br />

These were just a few examples of the visions of the ideal society that <strong>to</strong> a<br />

large degree were, and still are, shared visions throughout large parts of the<br />

world. At the same time, during the post-Second World War period, international<br />

organisations such as the United Nations, the World Bank, the OECD,<br />

and the European Union were established and quickly grew in size and influence.<br />

These are organisations with different (and <strong>to</strong> some degree conflicting)<br />

agendas. However, they all <strong>to</strong> various degrees invest resources in<strong>to</strong> the study<br />

of education in their member countries, and several of these organisations are<br />

linked <strong>to</strong> each other through joint projects concerning educational issues.<br />

IEA became a provider of educational data and analyses, not only <strong>to</strong> national<br />

policymakers, but also <strong>to</strong> several international organisations. In addition<br />

<strong>to</strong> UNESCO, which was involved in the establishment of the IEA, OECD (be-


270 ROLF V. OLSEN<br />

fore <strong>PISA</strong> was established) used data from IEA studies in their publications<br />

Education at a Glance (eg. the use of TIMSS data in OECD, 1996, 1997,<br />

1998). Since the first studies conducted in the early 1960’s, the IEA has been<br />

in charge of a great number of comparative studies in different subjects, and<br />

over the years the studies have grown <strong>to</strong> include a great number of countries<br />

throughout the world. At the same time the methodological challenges have<br />

been a driving force in the development of new designs and psychometrical<br />

procedures (Porter & Gamoran, 2002).<br />

During the last few decades the growth of comparative studies has also<br />

probably been fuelled by the reform of public services that is often referred<br />

<strong>to</strong> as ‘new public management’. This is characterised by deregulation of the<br />

public sec<strong>to</strong>r and a drive <strong>to</strong>wards a higher degree of privatisation of those parts<br />

of the public sec<strong>to</strong>rs that can be thought of as the infrastructure of society.<br />

Deregulation implies a transfer of responsibility from the central government<br />

<strong>to</strong> the local authorities. Nevertheless, important decisions related <strong>to</strong> schools<br />

are <strong>to</strong> be made by policymakers at administrative levels above either the local<br />

community level or local school level. A consequence in most countries where<br />

deregulation <strong>to</strong>ok place was therefore <strong>to</strong> reinforce the central government’s<br />

role by installing a national assessment system. This was a shift from the regulation<br />

of inputs (e.g. specification of the use of the resources or number of<br />

students per class) <strong>to</strong> controlling the output (achievement, surveys of students<br />

and parents). In this way the service providers were made accountable both <strong>to</strong><br />

the central government and <strong>to</strong> the users of these services. On the one hand the<br />

central government could control and direct the services by connecting measures<br />

of the output <strong>to</strong> incentives, or <strong>to</strong> intervening and manipulating the system<br />

<strong>to</strong> work as intended. On the other hand, the users could make use of the output<br />

measures in personal decisions regarding the public services.<br />

In this context the studies provide many indica<strong>to</strong>rs considered as being<br />

relevant, especially for the policymaker:<br />

<strong>–</strong> They produce measures of some of the outputs, most importantly achievement<br />

measures.<br />

<strong>–</strong> They produce indica<strong>to</strong>rs for systemic fac<strong>to</strong>rs that may be directly linked <strong>to</strong><br />

policy, such as average class size, availability of resources (e.g. computers),<br />

teacher education and allocation of time <strong>to</strong> different subjects. They also offer<br />

the possibility of relating such fac<strong>to</strong>rs <strong>to</strong> achievement.<br />

<strong>–</strong> They provide indica<strong>to</strong>rs of relationships between variables that policy seeks


LARGE-SCALE INTERNATIONAL COMPARATIVE ACHIEVEMENT STUDIES271<br />

<strong>to</strong> change in a certain direction, e.g. the aim of schooling <strong>to</strong> provide an equal<br />

opportunity for everyone <strong>to</strong> learn regardless of background.<br />

<strong>–</strong> Some studies (for instance, <strong>PISA</strong>) produce indica<strong>to</strong>rs of how the indica<strong>to</strong>rs<br />

and their relationship <strong>to</strong> various demographic characteristics change over<br />

time by repeating the surveys at regular intervals.<br />

Moreover, indica<strong>to</strong>rs used <strong>to</strong> moni<strong>to</strong>r the educational system in countries are<br />

not based on absolute measures. Placing such measures in an international context<br />

provides a comparative context for the interpretation of many of these measures.<br />

Comparison is a fundamental concept in measurement (Andrich, 1988).<br />

In an assessment with no comparative component, it is usually possible <strong>to</strong> establish<br />

whether various effect sizes are statistically significant,butitwouldbe<br />

very difficult <strong>to</strong> establish whether the effects are small or large. Even though<br />

the international variation cannot be used in order <strong>to</strong> draw causal inferences,<br />

it provides a description of what is possible and a context in which national<br />

data can be compared. One specific example of how international variation improves<br />

the potential for interpreting the results in the national context relates <strong>to</strong><br />

the issue of equity: It is often expressed in policy documents that large systematic<br />

differences in achievement between pupils from different socio-economic<br />

levels indicate that school systems fail in providing equal opportunities for all<br />

pupils. Moreover, a large standard deviation in achievement in the <strong>to</strong>tal population<br />

is often considered as an indica<strong>to</strong>r of inequities. For both these types of<br />

effects, the international context provides an opportunity for the policymaker<br />

<strong>to</strong> evaluate whether or not the differences between students or groups of students<br />

are large or small as compared <strong>to</strong> other systems perceived as relevant for<br />

comparison. There will always be differences between students, but without a<br />

contrast it would be impossible <strong>to</strong> evaluate or provide a substantial interpretation<br />

of the size of the effect.<br />

Common ground for the two purposes?<br />

In order <strong>to</strong> better understand the possible tensions between Purpose I and II,<br />

Jenkins (2000), based on Loving & Cobern (2000) and Huberman (1994), offers<br />

an interesting starting point. He suggests that the educational researchers<br />

and the policymaker not only have different agendas, but they also live within<br />

different knowledge systems, and therefore:<br />

The knowledge produced within one system and for the one set of purposes cannot<br />

normally be readily transferred <strong>to</strong> another. (Jenkins 2000, p. 18)


272 ROLF V. OLSEN<br />

Jenkins does not provide a definition of the concept knowledge system, and he<br />

does not identify more specific aspects of the two knowledge systems claimed<br />

<strong>to</strong> be very different. Furthermore, he does not come up with a solution for how<br />

the problem in the above quote may be amended.<br />

One very obvious difference in the way that researchers and policymakers<br />

approach knowledge is of course that the latter are <strong>to</strong> a much larger extent<br />

confronted with decision-making. This entails at least two characteristics<br />

of the knowledge seen as relevant. Firstly, decisions are bound by time. The<br />

pace of decision-making is usually much faster than the timelines for most researchers.<br />

It is therefore likely that, due <strong>to</strong> the pressure <strong>to</strong> produce policy in<br />

a short time (before the next election), the knowledge that may be digested<br />

and unders<strong>to</strong>od without occupying <strong>to</strong>o much time is considered as being more<br />

relevant by the policymaker. Secondly, knowledge that is likely <strong>to</strong> be true (analogous<br />

<strong>to</strong> evidence that will ‘hold up in court’) is generally more appreciated<br />

when confronted with the realities of decision-making. Using these criteria,<br />

it is possible <strong>to</strong> imagine that numbers and quantitative measures are deemed<br />

more appropriate than thick and rich qualitative descriptions.<br />

The OECD, the UN and other international organisations play important<br />

roles by being engaged in both these knowledge systems. In contrast <strong>to</strong> the national<br />

policy level, these international organisations have been given mandates<br />

that are relatively stable across time, and among several functions, they have<br />

been given the role of providing continuous policy analysis within a longer<br />

time frame. This mandate is particularly visible in the mandate of the OECD.<br />

<strong>PISA</strong> is therefore an interesting case regarding the issue of how educational researchers<br />

and policymakers may operate in a joint knowledge space. Through<br />

many of their educational initiatives, the OECD aims at establishing procedures<br />

and arenas for the dissemination of educational research <strong>to</strong> the policymakers.<br />

Conversely, through the same arenas, policymakers are able <strong>to</strong> communicate<br />

their needs for information on which <strong>to</strong> base their decisions. This<br />

is at least part of the solution for how it might be possible <strong>to</strong> get an efficient<br />

transfer of information back and forth between the two knowledge systems.<br />

This means that the overall aim of the <strong>PISA</strong> study is very much aligned with<br />

how policymakers define and justify educational outcomes. This also means<br />

that the cognitive measures are contextualised by variables perceived by the<br />

policy level <strong>to</strong> be of importance.<br />

In summary, the second purpose of effective policy development is in many<br />

respects compatible with the aims of the researchers who established the IEA


LARGE-SCALE INTERNATIONAL COMPARATIVE ACHIEVEMENT STUDIES273<br />

and conducted the first surveys (Purpose I). Policy issues such as those related<br />

<strong>to</strong> how one can improve the conditions for comprehensive and equitable education<br />

is, for instance, also a core issue in much pedagogical and didactical research.<br />

The difference is that within Purpose II the comparative studies are not<br />

any longer principally considered as basic research in education. This is not <strong>to</strong><br />

say that they can no longer be used <strong>to</strong> study fundamental issues in educational<br />

research. However, in the international comparative large-scale studies, such<br />

research issues have gradually been awarded lower priority in the shaping of<br />

the studies. The research purpose has <strong>to</strong> some extent become secondary <strong>to</strong> the<br />

primary purpose, which is <strong>to</strong> moni<strong>to</strong>r and benchmark the educational outcome<br />

of educational systems in order <strong>to</strong> inform policymakers.<br />

Part II: Beyond the primary purpose of the comparative studies<br />

In the first part of this chapter, I have argued that educational researchers would<br />

be well-advised <strong>to</strong> engage in studies like <strong>PISA</strong> since they provide communicative<br />

<strong>to</strong>ols for interchange of relevant knowledge with policymakers. However,<br />

there is also a more direct argument as <strong>to</strong> why researchers in education could<br />

be highly motivated <strong>to</strong> take part in or follow up these studies, which is the<br />

issue <strong>to</strong> be discussed in the remaining section of the chapter: These studies<br />

provide valuable and unique data for researchers that may be used as a basis<br />

for secondary analyses. This research activity may range from theoretical contributions<br />

<strong>to</strong> secondary analysis of the data or the documents accompanying<br />

these data (e.g. analyses of instruments and items, analyses of the theoretical<br />

framework and rationale underlying the studies).<br />

A number of slightly different definitions of the term ‘secondary analysis’<br />

have been suggested in some of the literature on research designs in the social<br />

sciences. They usually focus on the fact that secondary analyses are analyses of<br />

already existing data, conducted by researchers other than those who originally<br />

collected the data, and with a purpose that most likely was not included in the<br />

original design leading <strong>to</strong> the data collection. The definition that is best suited<br />

for the discussion presented in the following is most likely the one suggested<br />

by Bryman. (2004):<br />

Secondary analysis is the analysis of data by researchers who will probably not have<br />

been involved in the collection of those data for purposes that in all likelihood were<br />

not envisaged by those responsible for the data collection. (p. 201)


274 ROLF V. OLSEN<br />

This definition also opens up the possibility that the original researchers may<br />

be involved in secondary analysis, and furthermore, that the purpose of the<br />

secondary analysis may have been included in the original research design.<br />

The latter point is highly relevant for many of the large-scale official surveys<br />

of different aspects of social life (e.g. different types of household surveys<br />

conducted regularly in many nations), many of which may be considered as<br />

having multiple purposes (Bur<strong>to</strong>n, 2000; The BMS, 1994), and where the potential<br />

for secondary analysis by social scientists is an important part of the<br />

primary design.<br />

There are a number of perfectly sound reasons for why many researchers<br />

give priority <strong>to</strong> collecting their own data instead of analysing already collected<br />

data. The primary reason is that ‘the scientific approach’ <strong>to</strong> some extent may<br />

be pragmatically defined by a methodology starting with the posing of research<br />

questions and hypotheses. Data collected by others is collected with other specific<br />

questions or hypotheses in mind, and it may therefore be difficult <strong>to</strong> use<br />

this data <strong>to</strong> analyse other issues. Secondly, there are often many technical obstacles<br />

in using data collected by others: they might not be publicly available;<br />

they may lack the documentation necessary <strong>to</strong> understand the data (e.g.<br />

a comprehensive codebook); or the data may require technical analytical skills<br />

beyond those of most researchers. Thirdly, there may be ideological reasons<br />

for not wanting <strong>to</strong> base research on data collected by national or international<br />

organisations that are primarily collected for policy analyses. Some of these<br />

issues are also conditions that limit the potential for using data from the comparative<br />

studies in secondary research.<br />

However, I would argue that the benefits of such secondary analysis<br />

strongly outweigh the limiting conditions. First of all, the data provided by<br />

these studies has qualities not often seen in educational research. The primary<br />

reason for this claim is that the quality is documented in unprecedented detail.<br />

In the technical reports for the <strong>PISA</strong> surveys (Adams & Wu, 2002; OECD,<br />

2005b), all the procedures for the instrument development, sampling, marking<br />

and data adjudication are thoroughly described. By studying such reports, it is<br />

clear that the <strong>PISA</strong> study (and other LINCAS) is based on the following:<br />

<strong>–</strong> very clearly defined populations and adequate routines for sampling these<br />

populations in all participating countries;<br />

<strong>–</strong> well-developed frameworks and instruments, including documentation of<br />

the quality of the translation in<strong>to</strong> the different languages;


LARGE-SCALE INTERNATIONAL COMPARATIVE ACHIEVEMENT STUDIES275<br />

<strong>–</strong> well-developed and controlled routines for ensuring that the administration<br />

of the test was equal in all countries; and<br />

<strong>–</strong> well-developed routines and quality moni<strong>to</strong>ring of how student responses<br />

were scored as well as how the data was entered and further processed.<br />

Gathering data with procedures like these is not usually possible in ordinary<br />

(low-cost) research, which brings us <strong>to</strong> the second argument for why the data<br />

should be used more: Millions of dollars or euros have been spent on producing<br />

these high quality databases. Samples have been established and the<br />

instruments distributed <strong>to</strong> the students and back <strong>to</strong> the research centres in a<br />

way that ensures a certain degree of quality and comparability. Furthermore,<br />

the data has been assembled and restructured through skilful work by experts<br />

in data processing and measurement <strong>to</strong> further secure the quality of the information<br />

available. Nevertheless, relatively little money is spent on the analysis<br />

of the data, as most of the money has been spent on gathering and processing it.<br />

Evidently, investing more in further analyses of the data would be a financially<br />

sound idea.<br />

Thirdly, data from <strong>PISA</strong> has been made publicly available (although some<br />

of the achievement items are kept secure for future use), and researchers interested<br />

in using the data can get access <strong>to</strong> it through a number of channels 3 .<br />

In order <strong>to</strong> make the database accessible, <strong>PISA</strong> has even developed a thorough<br />

manual for how <strong>to</strong> analyse the data (OECD, 2005a). An even better proposal<br />

would be <strong>to</strong> engage in a dialogue with the national centre. Through this contact<br />

it could be possible <strong>to</strong> get some advice and access <strong>to</strong> material that is otherwise<br />

not so readily available.<br />

A fourth argument for why researchers in education should be keen <strong>to</strong><br />

use data from <strong>PISA</strong> and similar studies is the fact that this data is perhaps the<br />

single-most influential knowledge bases for decisionmaking and political argumentation<br />

about educational issues in many countries, and as such these data<br />

should be scrutinised from a multitude of perspectives. Even if one suspects<br />

that the data may be affected and encased by certain ideologies, secondary<br />

analysis of the data can in many cases be used <strong>to</strong> document such a relationship<br />

(Pole & Lampard, 2002). Data from LINCAS and documents describing or reporting<br />

outcomes of the studies need informed reviews from scholars who can<br />

frame the data and documents differently and thus offer both new interpretations<br />

and criticism. To a large extent LINCAS is regularly exposed <strong>to</strong> such criticism.<br />

Some of this feedback concerns ideological aspects of the studies (e.g.<br />

3 For access <strong>to</strong> the <strong>PISA</strong> data, see http://www.pisa.oecd.org


276 ROLF V. OLSEN<br />

Atkin & Black, 1997; Brown, 1998; Goldstein, 2004a; Keitel & Kilpatrick,<br />

1999; Kellaghan & Greaney, 2001; Orpwood, 2000; Reddy, 2005). Other critical<br />

remarks are more specifically related <strong>to</strong> methodological issues (e.g. Blum,<br />

Goldstein, & Guerin-Pace, 2001; Bonnet, 2002; Freudenthal, 1975; Goldstein,<br />

1995, 2004b; Harlow & Jones, 2004; Prais, 2003; Wang, 2001), and no doubt,<br />

this book adds <strong>to</strong> this collection of critical notes.<br />

Finally, since the results from the studies are mainly used <strong>to</strong> inform policy<br />

at the national level, it is necessary <strong>to</strong> conduct discussions on how the results<br />

may be used <strong>to</strong> evaluate the national school system. In order for comparative<br />

studies <strong>to</strong> provide an even better basis of information for this discussion, it<br />

may be necessary <strong>to</strong> develop specific national designs. This would ensure that<br />

one could obtain information seen as vital in the national context. Germany is<br />

the prime example of a country which has emphasised the national dimension<br />

by implementing several national extensions <strong>to</strong> the <strong>PISA</strong> study. In Germany<br />

participating students respond <strong>to</strong> additional nationally developed instruments,<br />

and the country also has an extended sample in order <strong>to</strong> cover the educational<br />

system in each of the partially au<strong>to</strong>nomous districts (Länder) (Stanat et al.,<br />

2002). These extended efforts in Germany have increased participation by researchers<br />

regarding the data as seen by the number of articles discussing <strong>PISA</strong><br />

in the German academic journals in education. It has also boosted the public<br />

awareness and debate about educational issues in general. To a somewhat<br />

lesser extent the situation is similar in Norway.<br />

Targeting research questions in education<br />

The above discussion mainly presented arguments emanating from the studies<br />

themselves regarding why data from large-scale international comparative<br />

achievement studies should be the subject of secondary analyses. Nevertheless,<br />

the main reason why educational researchers could be motivated <strong>to</strong> invest their<br />

own time and resources on secondary analyses of such data is that they may<br />

be used <strong>to</strong> address research questions of importance. In the remainder of the<br />

chapter, I will therefore turn <strong>to</strong> the more specific question of how these data<br />

may be used <strong>to</strong> target research questions in education.<br />

I will suggest that most of the secondary research using data from these<br />

studies can be classified in<strong>to</strong> one of six generic types of research designs or<br />

methodological approaches. The sequence of or extent <strong>to</strong> which the six generic<br />

research designs are presented does not suggest a priority. Furthermore, the intention<br />

is not <strong>to</strong> provide an exhaustive list of possible secondary research issues


LARGE-SCALE INTERNATIONAL COMPARATIVE ACHIEVEMENT STUDIES277<br />

<strong>to</strong> be addressed by this data. Moreover, it is not suggested that these generic<br />

types form a typology of mutually exclusive categories. Typically, much secondary<br />

analysis would relate <strong>to</strong> several of the six headings. Finally, the author’s<br />

own background accounts for why studies within science education prevail in<br />

the references given in the forthcoming discussion. The purpose of the six suggested<br />

categories is rather <strong>to</strong> provide a provisional framework at some level of<br />

generality for what secondary research relating <strong>to</strong> studies like <strong>PISA</strong> may look<br />

like.<br />

Using data, results, or interpretations as a background<br />

Secondary analysis of already existing data, results or their interpretations may<br />

be included as a somewhat peripheral part of a research design. The original<br />

study referred <strong>to</strong> in this type of research may provide the background or major<br />

referent for generating hypotheses and research questions, it may provide data<br />

or findings with which <strong>to</strong> contrast or triangulate other data or findings, or it may<br />

be part of the basis for theoretical argumentation or deliberation of educational<br />

issues. In this type of research the aim is usually <strong>to</strong> go behind the data in order<br />

<strong>to</strong> develop thicker and richer descriptions and analyses of issues derived from<br />

findings of the international studies.<br />

One example of this type of work (albeit in a Norwegian context) is the<br />

research project entitled <strong>PISA</strong>+ 4 . The researchers involved in the project use<br />

transcripts of videotapes from classrooms covering several hours of activities<br />

as their primary data source. Therefore, it is clearly not secondary analysis of<br />

data from <strong>PISA</strong>. But as the title of the research project reflects, it is triggered<br />

by some of the findings from <strong>PISA</strong> in a Norwegian context needing follow-up<br />

(hence the plus sign in the title). Other types of research in which the focus is<br />

on how phenomena change over time, or how one group of respondents compares<br />

<strong>to</strong> another group, may also use data or findings from comparative studies<br />

as a background. In some of these cases the international comparative studies<br />

can provide data that may be used as a baseline or benchmark for comparison<br />

<strong>to</strong> which the researchers’ own data may be related. For such a purpose it would,<br />

strictly speaking, be necessary <strong>to</strong> use partly identical instruments and similar<br />

routines for collecting and processing the data. One specific example of this is<br />

the use of items from TIMSS 1995 in an evaluation of scientific achievement in<br />

Norway before and after the curriculum reform in 1997 (Almendingen, Tveita,<br />

& Klepaker, 2003). In addition, as mentioned above, findings from <strong>PISA</strong> may<br />

4 See http://www.pfi.uio.no/forskning/forskningsprosjekter/pisa+/ for a description


278 ROLF V. OLSEN<br />

be used as one of the key referents for a theoretical deliberation on educational<br />

issues. This seems <strong>to</strong> be the case for a substantial amount of articles discussing<br />

educational (and more general social issues) in the German context during<br />

recent years (e.g. Opitz, 2006; Pongratz, 2006; Sacher, 2003; von-Stechow,<br />

2006).<br />

The two next types in this generic scheme are related <strong>to</strong> the fact that the<br />

primary units of analysis in the comparative studies are broad and overarching<br />

aggregates of the two main dimensions in the data matrix (persons and items).<br />

The persons are sampled <strong>to</strong> study the population of interest, and these populations<br />

are described by composite and broad measures, which have been constructed<br />

by aggregating several items. These constructs or traits are measures<br />

of students’ achievements in broadly defined domains (e.g. science, mathematics,<br />

and reading) as well as contextual descrip<strong>to</strong>rs (e.g. socioeconomic status,<br />

interest, motivation and learning strategies). It is therefore natural <strong>to</strong> suggest<br />

two classes of designs for secondary analyses related <strong>to</strong> the deconstruction of<br />

the respective two axis of the data matrix.<br />

In-depth analyses of certain variables<br />

Among the most frequently reported secondary analyses are those aiming at<br />

presenting a more finely tuned picture by studying more narrowly defined traits<br />

or even single items. This type of analysis utilises information in the data that<br />

is not included in the analyses of the <strong>to</strong>tal test scores (Olsen, 2005). Several<br />

relevant examples may be mentioned. Turmo (2003b) reported on qualitative<br />

aspects of students’ responses <strong>to</strong> a few single cognitive items from the <strong>PISA</strong><br />

2000 study related <strong>to</strong> the environmental issue of depletion of the ozone layer,<br />

relating the types of responses <strong>to</strong> published research in science education. Similar<br />

studies of data from TIMSS 1995 exist in abundance (e.g. Angell, 2004;<br />

Dossey, Jones, & Martin, 2002; Kjærnsli, Angell, & Lie, 2002).<br />

In the same manner the student questionnaire data may be analysed indepth<br />

by selecting one or a few variables in a more narrowly targeted analysis,<br />

including discussions and alternative interpretations in light of other theoretical<br />

or methodological positions. Papanastasiou et al. (2003) has for instance<br />

carried out an in-depth analysis, based on data from <strong>PISA</strong> 2000 of the relationship<br />

between the use of computers and scientific literacy in the US. Gorard<br />

& Smith (2004) used other data from the same study <strong>to</strong> compute several indexes<br />

of segregation within the European Union countries, and these indexes<br />

supplement the selections of indica<strong>to</strong>rs reported in the official OECD publica-


LARGE-SCALE INTERNATIONAL COMPARATIVE ACHIEVEMENT STUDIES279<br />

tions of the <strong>PISA</strong> data. Thematically, both these two latter studies belong <strong>to</strong><br />

the primary intent of the comparative study <strong>to</strong> which it is related, and as such<br />

they exemplify that the line of demarcation between ‘secondary analysis’ and<br />

‘primary analysis’ is not easy <strong>to</strong> draw.<br />

In-depth analyses of a sub-sample of students<br />

Another type of secondary research is, as briefly stated above, analyses in<br />

which the person axis in the data matrix is deconstructed. This is a very fruitful<br />

approach for targeting many specific issues in educational research. Many of<br />

the data sets from these studies are so large that the researcher may extract a<br />

subset of respondents with similar characteristics.<br />

One may, for instance, conduct an in-depth analysis of ethnic minority<br />

groups (Roe & Hvistendahl, 2006; Sohn & Ozcan, 2006). The fact that the<br />

OECD has recently published a supplementary thematic report on this issue<br />

(OECD, 2006) exemplifies that a clear-cut line of demarcation between primary<br />

and secondary analysis does not always exist. Another study which illustrates<br />

extremely well the possibilities for using the samples from LINCAS <strong>to</strong><br />

address issues that are specific <strong>to</strong> marginal groups in the population is the study<br />

by Mullis and Stemler (2002). They used the original sample from TIMSS<br />

1995 <strong>to</strong> identify gender differences for high-achieving upper secondary students<br />

in mathematics. The reason why such highly specific subgroups can be<br />

studied is, of course, that the samples used in the studies are very large, so one<br />

may actually select the students above the 75th percentile and further divide<br />

these by gender (as was done in this particular study) and still have adequate<br />

sample sizes of reasonable power. Thus, using data describing students’ backgrounds,<br />

attitudes and or achievements, it is possible <strong>to</strong> construct a number of<br />

subgroups found relevant for the specific research issue at hand.<br />

A more specific and narrow comparative outlook on the data<br />

The fourth class of secondary analysis consists of studies aiming at giving<br />

amorefinely tuned comparative view by selecting only a few countries (or<br />

merely one country) as the unit of analysis. These studies relate <strong>to</strong> a longstanding<br />

tradition within comparative research in education.<br />

When comparing a smaller selection of countries, two rather different<br />

strategies for selecting countries have proven <strong>to</strong> be fruitful. In one strategy<br />

countries are selected <strong>to</strong> represent rather divergent educational systems. Having<br />

a sample of educational systems that differ from each other along policy


280 ROLF V. OLSEN<br />

relevant variables is at the heart of the idea of comparative studies. One of the<br />

most powerful international studies using this strategy is the TIMSS videotape<br />

study (Stigler & Hiebert, 1999). Even if this was not a secondary analysis of<br />

data from TIMSS, but rather an independent study conducted and analysed<br />

simultaneously, it is a prime example of how studying divergent educational<br />

systems may uncover hidden assumptions and tacit features of the participating<br />

countries’ teaching practice. One example of a secondary analysis of<br />

<strong>PISA</strong> data is the comparison of mathematics teaching in Finland and France,<br />

<strong>to</strong> which a recent conference was entirely devoted 5 . A third illustrative and<br />

interesting example is the comparison of mathematics achievement in <strong>PISA</strong> in<br />

Brazil, Japan and Norway (Güzel & Berberoglu, 2005).<br />

The other strategy is <strong>to</strong> compare convergent educational systems. Naturally,<br />

such studies are often regional studies of neighbouring countries. Examples<br />

of regionally focused studies are the reports issued by Nordic researchers<br />

working on the <strong>PISA</strong> data (Lie, Linnakylä, & Roe, 2003; Mejding & Roe,<br />

2006), and a similar report by researchers in several Eastern European countries<br />

based on data from TIMSS 1995 (Vári, 1997). A third example is the<br />

study by Wößman (2005), using data from TIMSS, on the impact of family<br />

background in the East Asian countries. There are several reasons why regionally<br />

focussed reports and studies are valuable. First of all, comparing countries<br />

with certain common cultural features in a wider sense (be they his<strong>to</strong>rical, political<br />

and/or linguistic), implies that more fac<strong>to</strong>rs may be controlled, which<br />

is imperative when studying naturally occurring phenomena by a comparative<br />

survey design. Secondly, in comparisons between neighbouring or linguistically<br />

similar countries, the possible measurement errors due <strong>to</strong> item-bycountry<br />

interactions are also reduced (Wolfe, 1999). Therefore, from a policy<br />

perspective such comparisons are more likely <strong>to</strong> produce fruitful recommendations<br />

for decision-making since neighbouring countries, such as the Nordic<br />

ones, often have an institutionalised and continuous exchange of policy.<br />

The comparative basis may be even further reduced <strong>to</strong> case studies of single<br />

countries. Obviously, the national reports that are developed in most participating<br />

countries are <strong>to</strong> a large degree examples of such studies. However, these<br />

reports are mainly reported in public reports targeting a wide group of prospective<br />

readers. Hence, parts of the analyses presented in these reports should be<br />

transformed in<strong>to</strong> a format aimed at an international audience of researchers<br />

withstanding the scrutiny of peer review. This type of secondary analysis also<br />

5 See http://smf.emath.fr/VieSociete/Rencontres/France-Finlande-2005/


LARGE-SCALE INTERNATIONAL COMPARATIVE ACHIEVEMENT STUDIES281<br />

includes studies aiming at linking the international studies <strong>to</strong> either the national<br />

curriculum or prevalent ideologies in the participating country. Numerous examples<br />

of such studies exist. One recent French contribution is the chapter by<br />

Bodin in this book. In this chapter he analyses the degree of correspondence<br />

between the French mathematical syllabus, the French grade 9 national exam<br />

(Brevet) and the mathematics assessment in <strong>PISA</strong>. Yet other examples are papers<br />

discussing the case of Finland, which has quite understandably received a<br />

lot of attention given their performance on the assessments in <strong>PISA</strong> (Linnakyla<br />

& Valijarvi, 2005; Sahlberg, 2007; Simola, 2005; Valijarvi et al., 2002).<br />

The two last examples highlight why a national scope on the data from<br />

these studies clearly may have wider and broader implications. Ichilov (2004)<br />

used data from the IEA Civic Education Study (CivEd) <strong>to</strong> report on civic orientations<br />

in Hebrew and Arab schools in Israel, an issue which (unfortunately) is<br />

extremely relevant for the international community. Howie (2004) and Reddy<br />

(2005) have used the case of TIMSS in South Africa <strong>to</strong> reflect upon and question<br />

the values of participating in international comparative studies for developing<br />

countries, particularly when students’ mother-<strong>to</strong>ngues are not applied as<br />

the language of the test. What these examples have in common is that what<br />

at the outset seem <strong>to</strong> be issues primarily of national interest may be highly<br />

relevant contributions <strong>to</strong> educational research in general.<br />

Combining data from one study with other sources of information<br />

Many countries participate in several studies, and secondary analyses seeking<br />

<strong>to</strong> combine, contrast or synthesise information across studies would be valuable<br />

contributions. Furthermore, efforts should be made <strong>to</strong> combine quantitative<br />

results from a study like <strong>PISA</strong> with other supplemental pieces of information.<br />

These supplements may well be of a qualitative nature. However, this is<br />

methodologically challenging, since it is not always clear how <strong>to</strong> combine the<br />

information formally through a common unit of analysis.<br />

The most obvious possibility for linking different international surveys is<br />

by using the results aggregated <strong>to</strong> country level. One successful example is the<br />

study reported by Kirkcaldy et al. (2004) on the relationship between health efficacy,<br />

educational attainment and well-being. This study combined data from<br />

<strong>PISA</strong> with that provided by the World Health Organisation, the United Nations,<br />

and other sources.<br />

Another possibility for combining data (perhaps the primary candidate,<br />

given the discussion above) would be <strong>to</strong> find ways of combining the data


282 ROLF V. OLSEN<br />

in <strong>PISA</strong> and TIMSS <strong>to</strong> address issues in mathematics or science education<br />

(Olsen, 2006; Olsen & Grønmo, 2006; Olsen, Kjærnsli, & Lie, 2007), and data<br />

from <strong>PISA</strong> and PIRLS 6 <strong>to</strong> address the issue of reading skills (e.g. Becker &<br />

Schubert, 2006). This is not a straightforward task since these studies, even if<br />

they partially overlap in the content assessed, differ in many other ways, including<br />

the ages and grade levels of students. However, it is in principle possible<br />

<strong>to</strong> use data aggregated <strong>to</strong> countries in order <strong>to</strong> explore and describe typical<br />

features of students’ achievements, attitudes, motivation, and background in<br />

different countries. Furthermore, it is highly recommended <strong>to</strong> gather complementary<br />

data <strong>to</strong> help establish links between different studies. This was, for<br />

instance, done in a Danish study in which the <strong>PISA</strong> reading literacy measure<br />

from 2000 was formally linked <strong>to</strong> the IEA Reading Literacy study (which later<br />

has become known as PIRLS) from 1991. This made it possible <strong>to</strong> compare<br />

the two measurements. However, more significantly, it was possible <strong>to</strong> develop<br />

a measure of change in reading literacy for Danish students from 1991 <strong>to</strong> 2000<br />

(Allerup & Mejding, 2003). A less stringent way of linking the studies <strong>to</strong> each<br />

other would be <strong>to</strong> compare the documents describing the studies and <strong>to</strong> compare<br />

the match between different studies’ frameworks and item pools. For example,<br />

several comparisons of TIMSS and <strong>PISA</strong> have been done documenting<br />

how these two surveys differ in their conceptualisation of mathematics and<br />

science (Grønmo & Olsen, In press; Hutchison & Schagen, In press; Neidorf,<br />

Binkley, Gattis, & Nohara, 2006; Neidorf, Binkley, & Stephens, 2006; Olsen,<br />

2005; Olsen et al., 2007).<br />

Approaching the data with other methodological <strong>to</strong>ols<br />

This is not a specific methodological approach such as the previous categories<br />

are. This is rather a category used <strong>to</strong> collect the many studies utilising a different<br />

methodological approach <strong>to</strong> the data. Although these studies often result<br />

in alternative interpretations of the data, their additional aim is often <strong>to</strong> comment<br />

on the consequences of the methods used. In recent years there is, for<br />

instance, a growing recognition of the hierarchical structures in educational<br />

achievement data in which students are located within classes that are located<br />

within schools, which in turn are located within regions, etc. (e.g. Malin, 2005;<br />

O’Dwyer, 2002; Ramírez, 2006; Schagen, 2004). By applying specialised statistical<br />

<strong>to</strong>ols, it is possible <strong>to</strong> impose this structure while modelling the data.<br />

6 Progress in International Reading Literacy Study


LARGE-SCALE INTERNATIONAL COMPARATIVE ACHIEVEMENT STUDIES283<br />

Furthermore, some observers of the comparative studies question the requirement<br />

of uni-dimensionality of the measurements, and have analysed the<br />

data sets from some of these studies by allowing for multiple dimensions in the<br />

data (Blum et al., 2001; Gustafsson & Rosén, 2004). Related <strong>to</strong> this are a number<br />

of studies using cluster analysis in order <strong>to</strong> study the distinct differences in<br />

achievement profiles across the cognitive items for clusters of empirically related<br />

countries (e.g. Angell, Kjærnsli, & Lie, 2006; Grønmo, Kjærnsli, & Lie,<br />

2004; Lie & Roe, 2003; Olsen, 2006).<br />

Others have used latent variable or latent group modelling techniques in<br />

their approach <strong>to</strong> the data (e.g. Hansen, Rosén, & Gustafsson, 2004; C. Papanastasiou<br />

& Papanastasiou, 2006; Wolff, 2004). These are merely some examples<br />

of the use of alternative methodological approaches. The fruitfulness<br />

of such studies is both that they may utilise other aspects of the data set than<br />

what was their primary purpose, and that they may be considered as competing<br />

hypotheses regarding how <strong>to</strong> model and interpret such data.<br />

Concluding remarks<br />

In this chapter I have argued that data from the large-scale international<br />

achievement studies should be valued as an important resource for researchers<br />

in the educational sciences. In the first part I gave a condensed presentation<br />

of how these large-scale international studies of students’ achievements were<br />

created and are still affected by two visions or purposes. Originally, the studies<br />

were conceived of as <strong>to</strong>ols for conducting fundamental research (Purpose I).<br />

This vision was gradually adopted, absorbed and transformed by educational<br />

policymakers in<strong>to</strong> a vision in which these studies were regarded as one of<br />

the primary <strong>to</strong>ols for moni<strong>to</strong>ring educational systems (Purpose II). As a consequence<br />

I argued that researchers who engage themselves in these studies would<br />

get access <strong>to</strong> an arena for the exchange of ideas and thoughts with educational<br />

policymakers. This argument is particularly valid for <strong>PISA</strong>, which <strong>to</strong> a larger<br />

extent than other similar studies is defined as a joint venture between policymakers<br />

and researchers. This joint venture is set up by the organisational frame<br />

of the OECD with active involvement of both researchers and policymakers.<br />

In the second part of this chapter, the call for researchers <strong>to</strong> engage in secondary<br />

analysis of data from the international comparative achievement studies<br />

was argued from within the studies themselves. Specifically, it was argued that<br />

the data sets offered by the studies are complex and multifaceted, and thus it


284 ROLF V. OLSEN<br />

should be possible <strong>to</strong> target a range of fundamental issues in educational research<br />

by secondary analyses of the data. This argument has been augmented<br />

by the fact that the data has a supreme and unprecedented quality. Furthermore,<br />

the data is publicly available. Moreover, it was argued from an economic<br />

perspective that since so much money has been spent in creating this dataset,<br />

any additional resources put in<strong>to</strong> secondary analytical research activities would<br />

ensure an even better return on the investment. This argumentation is equally<br />

valid for several studies.<br />

Having presented an argumentation for why researchers in education could<br />

or should be interested in utilising the data, the second part turned <strong>to</strong> the issue<br />

of the types of secondary analysis that are possible or viable. Six generic<br />

approaches <strong>to</strong> secondary use of the data in research were suggested. This was<br />

accompanied by reference <strong>to</strong> a diverse range of secondary analyses in order <strong>to</strong><br />

document and exemplify the possibilities for conducting such analyses. These<br />

references are only a fraction of the available academic literature that utilises<br />

information from <strong>PISA</strong> and other similar studies. Searching international bibliographical<br />

databases with the term <strong>PISA</strong> in the title, keyword or abstract gave<br />

more than 600 hits when combining two of the most comprehensive databases<br />

of literature in educational research (the ERIC and the ISI Web of Knowledge<br />

databases). By deleting duplicates and other false positives, we are brought<br />

down <strong>to</strong> approximately 250 hits in the period 2001-2007. Out of these, approximately<br />

50 entries are references given <strong>to</strong> what should be labelled as primary<br />

analysis (national and international reports written as an intended part of the<br />

study) bringing the <strong>to</strong>tal down <strong>to</strong> about 200. At the same time it is obvious<br />

that not all published secondary analytical work is included in the database<br />

(false negatives). Several of the references included in this chapter are for instance<br />

not found in this bibliography. It is therefore reasonable <strong>to</strong> claim that<br />

secondary analysis of the <strong>PISA</strong> data is a vital field of research. I doubt that any<br />

other data set within educational research has been analysed by so many people<br />

from so many diverse perspectives. A more detailed and systematic analysis of<br />

this bibliographical database will be conducted in the future in order <strong>to</strong> give a<br />

more comprehensive synthesis of how data from <strong>PISA</strong> is used by researchers.<br />

In order <strong>to</strong> take further advantage of the <strong>PISA</strong> data, governments should<br />

consider allocating resources for further analyses of the data sets, especially<br />

<strong>to</strong> analyses that help relate the data <strong>to</strong> the national context. For instance, in<br />

Norway funds have been made available so that the primary researchers may<br />

spend some time developing and publishing research going beyond the com-


LARGE-SCALE INTERNATIONAL COMPARATIVE ACHIEVEMENT STUDIES285<br />

missioned reports. Furthermore, in many countries funds have been allocated<br />

<strong>to</strong> facilitate the use of data from <strong>PISA</strong> by students as the basis for their Master’s<br />

or Doc<strong>to</strong>ral thesis, particularly so in countries where the national institution<br />

responsible for the study is located at a university. To continue with the case<br />

of Norway as an example, several doc<strong>to</strong>ral dissertations have been produced<br />

based on what could be labelled as secondary analysis of data from TIMSS and<br />

<strong>PISA</strong> (Angell, 1996; Isager, 1996; Kind, 1996; Olsen, 2005; Turmo, 2003a).<br />

Hopefully, this chapter may motivate and help in engaging researchers <strong>to</strong><br />

explore the possibilities for making use of the resources offered by the largescale<br />

international studies in education and, subsequently, make such analyses<br />

available through internationally accepted publications. Furthermore, the<br />

framework of types of analyses suggested, ranging from using the data as<br />

merely a referent for the collection of other data, and <strong>to</strong> sophisticated modelling<br />

of the data, will hopefully provide a certain amount of guidance with respect<br />

<strong>to</strong> what such secondary analyses may look like. The many examples provided<br />

should also be regarded as an initial source of inspiration for researchers<br />

who would welcome taking on this challenge.<br />

Acknowledgement<br />

This manuscript has been adapted from the article Olsen, R.V. & Lie, S. (2006).<br />

Les évaluations internationales et la recherche en éducation: principaux objectifs<br />

et perspectives. Revue française de pédagogie, No 157, pp. 11-26, and I<br />

wish <strong>to</strong> thank the edi<strong>to</strong>rs of the journal for giving their permission for this<br />

adaptation. Fellow author Svein Lie has also kindly agreed that I may publish<br />

the present adapted version.<br />

References<br />

Adams, R., & Wu, M. (Eds.). (2002). <strong>PISA</strong> 2000 technical report.Paris:OECD<br />

Publications.<br />

Alexander, R., Broadfoot, P., & Phillips, D. (Eds.). (1999). Learning from comparing:<br />

New directions in comparative educational research. Volume 1:<br />

Contexts, classrooms and outcomes. Oxford: Symposium Books.<br />

Alexander, R., Osborn, M., & Phillips, D. (Eds.). (2000). Learning from comparing:<br />

New directions in comparative educational research <strong>–</strong> Volume 2:<br />

policy, professionals and developments. Oxford: Symposium Books.


286 ROLF V. OLSEN<br />

Allerup, P., & Mejding, J. (2003). Reading achievement in 1991 and 2000. In<br />

S. Lie, P. Linnakylä & A. Roe (Eds.), Northern Lights on <strong>PISA</strong> (pp. 133-<br />

145). Oslo: Department of Teacher Education and School Development,<br />

University of Oslo.<br />

Almendingen, S. B. M. F., Tveita, J., & Klepaker, T. (2003). Tenke det, ønske<br />

det, ville det med, men gjøre det . . . ?: en evaluering av natur- og miljøfag<br />

etter Reform 97. Nesna: Høgskolen i Nesna.<br />

Andrich, D. (1988). Rasch models for measurement. Newbury<br />

Park/London/New Delhi: Sage Publications.<br />

Angell, C. (1996). Elevers fysikkforståelse. En studie basert på utvalgte<br />

fysikkoppgaver i TIMSS. Oslo: Det matematisk-naturvitenskapelige<br />

fakultet, Universitetet i Oslo.<br />

Angell, C. (2004). Exploring students’ intuitive ideas based on physics items in<br />

TIMSS <strong>–</strong> 1995. In C. Papanastasiou (Ed.), Proceedings of the IRC-2004<br />

TIMSS (pp. 108-123). Nicosia: Cyprus University Press.<br />

Angell, C., Kjærnsli, M., & Lie, S. (2006). Curricular and cultural effects in<br />

patterns of students’ responses <strong>to</strong> TIMSS science items. In S. J. Howie &<br />

T. Plomp (Eds.), Contexts of learning mathematics and science: Lessons<br />

learned from TIMSS (pp. 277-290). London: Rutledge.<br />

Atkin, J. M., & Black, P. (1997). Policy perils of international comparisons:<br />

The TIMSS case. Phi Delta Kappan, 79(1), 22-28.<br />

Becker, R., & Schubert, F. (2006). Soziale ungleichheit von lesekompetenzen:<br />

Eine Matching-Analyse im Langsschnitt mit Querschnittsdaten von<br />

PIRLS 2001 und <strong>PISA</strong> 2000/Social inequality of reading literacy: A longitudinal<br />

analysis with cross-sectional data of PIRLS 2001 and <strong>PISA</strong><br />

2000 Utilizing the pair-wise matching Procedure. Kölner Zeitschrift für<br />

Soziologie und Sozialpsychologie, 58(2), 253-284.<br />

Blum, A., Goldstein, H., & Guerin-Pace, F. (2001). International adult literacy<br />

survey (IALS): an analysis of international comparisons of adult literacy.<br />

Assessment in Education, 8(2), 225-246.<br />

Bonnet, G. (2002). Reflections in a critical eye [1]: On pitfalls of international<br />

assessment. Assessment in Education, 9(3), 387-399.<br />

Bos, K. T. (2002). Benefits and limitations of large-scale international comparative<br />

achievement studies: The case of IEA’s TIMSS study. Unpublished<br />

PhD, University of Twente.<br />

Brown, M. (1998). The tyranny of the international horse race. In R. Slee,<br />

G. Weiner & S. Tomlinson (Eds.), School effectiveness for whom? Chal-


LARGE-SCALE INTERNATIONAL COMPARATIVE ACHIEVEMENT STUDIES287<br />

lenges <strong>to</strong> the school effectiveness and school improvement movements<br />

(pp. 33-47). London: Falmer Press.<br />

Bryman, A. (2004). Social research methods (2nd ed.). Oxford: University<br />

Press.<br />

Bur<strong>to</strong>n, D. (2000). Secondary data analysis. In D. Bur<strong>to</strong>n (Ed.), Research training<br />

for social scientists (pp. 347-360). London: Sage Publications.<br />

Carnoy, M. (2006). Rethinking the comparative—and the international. Comparative<br />

Education Review, 50(4), 551-570.<br />

Dossey, J. A., Jones, C. O., & Martin, T. S. (2002). Analyzing student responses<br />

in mathematics using two-digit rubrics. In D. F. Robitaille & A.<br />

E. Bea<strong>to</strong>n (Eds.), Secondary analysis of the TIMSS Data (pp. 21-45). Dordrecht:<br />

Kluwer Academic Publishers.<br />

Freudenthal, H. (1975). Pupils’ achievements internationally compared <strong>–</strong> The<br />

IEA. Educational Studies in Mathematics, 6, 127-186.<br />

Goldstein, H. (1995). Interpreting international comparisons of student<br />

achievement (Vol. 63). Paris: UNESCO Publishing.<br />

Goldstein, H. (2004a). Education for all: the globalization of learning targets.<br />

Comparative Education, 40(1), 7-14.<br />

Goldstein, H. (2004b). International comparative assessment: how far have we<br />

really come? Assessment in Education, 11(2), 227-234.<br />

Gorard, S., & Smith, E. (2004). An international comparison of equity in education<br />

systems. Comparative Education, 40(1), 15-28.<br />

Grønmo, L. S., Kjærnsli, M., & Lie, S. (2004). Looking for cultural and geographical<br />

fac<strong>to</strong>rs in patterns of response <strong>to</strong> TIMSS items. In C. Papanastasiou<br />

(Ed.), Proceedings of the IRC-2004 TIMSS (Vol. 1, pp. 99-112).<br />

Nicosia: Cyprus University Press.<br />

Grønmo, L. S., & Olsen, R. V. (In press). TIMSS versus <strong>PISA</strong>: the case of pure<br />

and applied mathematics. In Unknown (Ed.), Unknown. Washing<strong>to</strong>n, DC.<br />

Gustafsson, J.-E., & Rosén, M. (2004). The IEA 10-year trend study of reading<br />

literacy: A multivariate reanalysis. In C. Papanastasiou (Ed.), Proceedings<br />

of the IRC-2004 (Vol. 3, pp. 1-16). Nicosia: Cyprus University Press.<br />

Güzel, C. I., & Berberoglu, G. (2005). An analysis of the programme for international<br />

student assessment 2000 (<strong>PISA</strong> 2000) mathematical literacy<br />

data for brazilian, japanese and norwegian students. Studies In Educational<br />

Evaluation, 31(4), 283-314.<br />

Hansen, K. Y., Rosén, M., & Gustafsson, J.-E. (2004). Effects of socioeconomic<br />

status on reading achievement at collective and individual lev-


288 ROLF V. OLSEN<br />

els in Sweden in 1991 and 2001. In C. Papanastasiou (Ed.), Proceedings<br />

of the IRC-2004 PIRLS (Vol. 3, pp. 123-139). Nicosia: Cyprus University<br />

Press.<br />

Harlow, A., & Jones, A. (2004). Why students answer TIMSS science test<br />

items the way they do. Research in Science Education, 34(2), 221-238.<br />

Howie, S. J. (2004). TIMSS in South Africa: The value of international comparative<br />

studies for a developing country. In D. Shorrocks-Taylor & E. W.<br />

Jenkins (Eds.), Learning from Others (Vol. 8). Dordrecht: Kluwer Academic<br />

Publishers.<br />

Huberman, M. (1994). The OERI/CERI Seminar on educational research and<br />

development: a synthesis and commentary. In T. M. Tomlinson & A. C.<br />

Tuijnman (Eds.), Education research and reform: an international perspective<br />

(pp. 45-66). Washing<strong>to</strong>n D.C.: OECD Centre for Educational Research<br />

ans Innovation/US Department of Education.<br />

Husén, T. (1973). Foreword. In L. C. Comber & J. P. Keeves (Eds.), Science<br />

achievement in nineteen countries (pp. 13-24). S<strong>to</strong>ckholm/New York:<br />

Almqvist & Wiksell/John Wiley & Sons.<br />

Husén, T., & Tuijnman, A. (1994). Moni<strong>to</strong>ring standards in education: Why<br />

and how it came about. In A. C. Tuijnman & T. N. Postlethwaite (Eds.),<br />

Moni<strong>to</strong>ring the standards of education. Papers in honor of John P. Keeves<br />

(pp. 1-21). Oxford: Pergamon.<br />

Hutchison, D., & Schagen, I. (In press). Comparison between <strong>PISA</strong> and<br />

TIMSS <strong>–</strong> Are we the man with two watches? In T. Loveless (Ed.), Lessons<br />

Learned: What International Assessments Tell Us about Math Achievement.<br />

Washing<strong>to</strong>n, DC: Brookings Institution Press.<br />

Ichilov, O. (2004). Becoming citizens in Israel: A deeply divided society. Civic<br />

orientations in Hebrew and Arab schools. In C. Papanastasiou (Ed.), Proceedings<br />

of the IRC-2004 CivEd-Sites (Vol. 4, pp. 69-86). Nicosia: Cyprus<br />

University Press.<br />

Isager, O. A. (1996). Den norske grunnskolens biologi i et his<strong>to</strong>risk og komparativt<br />

perspektiv. Oslo: Det matematisk-naturvitenskapelige fakultet,<br />

Universitetet i Oslo.<br />

Jenkins, E. W. (2000). Research in science education: Time for a health check?<br />

Studies in Science Education, 35, 1-25.<br />

Keeves, J. P. (Ed.). (1992). The IEA study of science III: Changes in science<br />

education and achievement: 1970 <strong>to</strong> 1984. New York: Pergamon Press.<br />

Keitel, C., & Kilpatrick, J. (1999). The tationality and irrationality of interna-


LARGE-SCALE INTERNATIONAL COMPARATIVE ACHIEVEMENT STUDIES289<br />

tional comparative studies. In G. Kaiser, E. Luna & I. Huntley (Eds.), International<br />

Comparisons in Mathematics Education (pp. 241-256). London:<br />

Falmer Press.<br />

Kellaghan, T., & Greaney, V. (2001). The globalisation of assessment in the<br />

20th century. Assessment in Education, 8(1), 87-102.<br />

Kind, P. M. (1996). Exploring performance assessment in science. Oslo:Det<br />

matematisk-naturvitenskapelige fakultet, Universitetet i Oslo.<br />

Kirkcaldy, B., Furnham, A., & Siefen, G. (2004). The relationship between<br />

health efficacy, Educational attainment, and well-being among 30 nations.<br />

European Psychologist, 9(2), 107-119.<br />

Kjærnsli, M., Angell, C., & Lie, S. (2002). Exploring population 2 students’<br />

ideas about science. In D. F. Robitaille & A. E. Bea<strong>to</strong>n (Eds.), Secondary<br />

Analysis of the TIMSS Data (pp. 127-144). Dordrecht: Kluwer Academic<br />

Publishers.<br />

Lie, S., Linnakylä, P., & Roe, A. (Eds.). (2003). Northern lights on <strong>PISA</strong>:<br />

Unity and diversity in the Nordic countries in <strong>PISA</strong> 2000: Department<br />

of Teacher Education and School Development, University of Oslo.<br />

Lie, S., & Roe, A. (2003). Unity and diversity of reading literacy profiles. In<br />

S. Lie, P. Linnakylä & A. Roe (Eds.), Northern Lights on <strong>PISA</strong> (pp. 147-<br />

157): Department of Teacher Education and School Development, University<br />

of Oslo.<br />

Linnakyla, P., & Valijarvi, J. (2005). Secrets <strong>to</strong> literacy success: The Finnish<br />

s<strong>to</strong>ry. Education Canada, 45(3), 34-37.<br />

Loving, C. C., & Cobern, W. W. (2000). Invoking Thomas Kuhn: What citation<br />

analysis reveals about science education. Science & Education, 9(1-2),<br />

187-206.<br />

Malin, A. (2005). School differences and inequities in educational outcomes.<br />

Jyväskylä: Jyväskylä University Press.<br />

Mejding, J., & Roe, A. (Eds.). (2006). Northern lights on <strong>PISA</strong> 2003 <strong>–</strong> a reflection<br />

from the Nordic countries. Copenhagen: Nordic Council of Ministers.<br />

Mullis, I. V. S., & Stemler, S. E. (2002). Analyzing gender differences for high<br />

achieving students on TIMSS. In D. F. Robitaille & A. E. Bea<strong>to</strong>n (Eds.),<br />

Secondary Analysis of the TIMSS Data (pp. 287-290). Dordrecht: Kluwer<br />

Academic Publishers.<br />

Neidorf, T. S., Binkley, M., Gattis, K., & Nohara, D. (2006). Comparing<br />

mathematics content in the National Assessment of Educational<br />

Progress (NAEP), Trends in International Mathematics and Science Study


290 ROLF V. OLSEN<br />

(TIMSS), and Program for International Student Assessment (<strong>PISA</strong>) 2003<br />

assessments. Washing<strong>to</strong>n, DC: National Center for Education Statistics.<br />

Neidorf, T. S., Binkley, M., & Stephens, M. (2006). Comparing science content<br />

in the National Assessment of Educational Progress (NAEP) 2000 and<br />

Trends in International Mathematics and Science Study (TIMSS) 2003<br />

assessments. Washing<strong>to</strong>n, DC: National Center for Education Statistics.<br />

O’Dwyer, L. M. (2002). Extending the application of multilevel modelling <strong>to</strong><br />

data from TIMSS. In D. F. Robitaille & A. E. Bea<strong>to</strong>n (Eds.), Secondary<br />

Analysis of the TIMSS Data (pp. 359-373). Dordrecht: Kluwer Academic<br />

Publishers.<br />

OECD (1996). Education at a glance. Paris: OECD Publications.<br />

OECD (1997). Education at a glance. Paris: OECD Publications.<br />

OECD (1998). Education at a glance. Paris: OECD Publications.<br />

OECD (2005a). <strong>PISA</strong> 2003 Data analysis manual. Paris: OECD Publishing.<br />

OECD (2005b). <strong>PISA</strong> 2003: Technical report. Paris: OECD Publications.<br />

OECD (2006). Where immigrant students succed. A comparative review of<br />

performance and engagement in <strong>PISA</strong> 2003. Paris: OECD Publications.<br />

Olsen, R. V. (2005). Achievement tests from an item perspective. An exploration<br />

of single item data from the <strong>PISA</strong> and TIMSS studies, and how<br />

such data can inform us about students’ knowledge and thinking in science.<br />

Oslo: Unipub forlag.<br />

Olsen, R. V. (2006). A nordic profile of mathematics achievement: Myth or<br />

reality? In J. Mejding & A. Roe (Eds.), Northern Lights on <strong>PISA</strong> 2003 <strong>–</strong><br />

areflection from the Nordic countries (pp. 33-45). Copenhagen: Nordic<br />

Council of Ministers.<br />

Olsen, R. V., & Grønmo, L. S. (2006). What are the characteristics of the nordic<br />

profile in mathematical literacy? In J. Mejding & A. Roe (Eds.), Northern<br />

Lights on <strong>PISA</strong> 2003 <strong>–</strong> a reflection from the Nordic countries (pp. 47-57).<br />

Copenhagen: Nordic Council of Ministers.<br />

Olsen, R. V., Kjærnsli, M., & Lie, S. (2007, 21.-25. August). A comparison of<br />

the measures of science achievement in <strong>PISA</strong> and TIMSS. Paper presented<br />

at the ESERA 2007, Malmö, Sweden.<br />

Olsen, R. V., & Lie, S. (2006). Les évaluations internationales et la recherche<br />

en éducation:principaux objectifs et perspectives. Revue française de pédagogie(157),<br />

11-26.<br />

Opitz, E.-M. (2006). <strong>PISA</strong> und Bildungsstandards: Stein des Ans<strong>to</strong>sses oder<br />

Ans<strong>to</strong>ss fur die Sonderpadagogik?/<strong>PISA</strong> and Education Standards: Stum-


LARGE-SCALE INTERNATIONAL COMPARATIVE ACHIEVEMENT STUDIES291<br />

bling Block or Impulse for Special Education? Vierteljahresschrift fur<br />

Heilpadagogik und ihre Nachbargebiete, 75(2), 110-120.<br />

Orpwood, G. (2000). Diversity of purpose in international assessments: Issues<br />

arising from the TIMSS test of mathematics and science. In D.<br />

Shorrocks-Taylor & E. W. Jenkins (Eds.), Learning from Others: International<br />

Comparisons in Education (pp. 49-62). Dordrecht/Bos<strong>to</strong>n/London:<br />

Kluwer Academic Publishers.<br />

Papanastasiou, C., & Papanastasiou, E. C. (2006). Modelling mathematics<br />

achievement in Cyprus. In S. J. Howie & T. Plomp (Eds.), Contexts of<br />

Learning Mathematics and Science. London: Routledge.<br />

Papanastasiou, E. C., Zembylas, M., & Vrasidas, C. (2003). Can computer<br />

use hurt science achievement? The USA Results from <strong>PISA</strong>. Journal of<br />

Science Education and Technology, 12(3), 325-332.<br />

Pole, C., & Lampard, R. (2002). Practical social investigation: Qualitative<br />

and quantitative methods in social research. Essex: Pearson Education<br />

Limited.<br />

Pongratz, L.-A. (2006). Voluntary self-control: Education reform as a governmental<br />

strategy. Educational philosophy and theory, 38(4), 471-482.<br />

Porter, A. C., & Gamoran, A. (Eds.). (2002). Methodological advances in<br />

cross-national surveys of educational achievement. Washing<strong>to</strong>n, DC: National<br />

Academy Press.<br />

Prais, S. J. (2003). Cautions on OECD’s recent educational survey (<strong>PISA</strong>).<br />

Oxford Review of Education, 29(2), 139-163.<br />

Ramírez, M. J. (2006). Fac<strong>to</strong>rs related <strong>to</strong> mathematics cchievement in Chile.<br />

In S. J. Howie & T. Plomp (Eds.), Contexts of Learning Mathematics and<br />

Science (pp. 97-111). London: Routledge.<br />

Reddy, V. (2005). Cross-national achievement studies: learning from South<br />

Africa’s participation in the Trends in International Mathematics and Science<br />

Study (TIMSS). Compare, 35(1), 63-77.<br />

Roberts, D. A. (2007). Scientific literacy/science literacy. In S. K. Abell &<br />

N. G. Lederman (Eds.), Handbook of Research in Science Education<br />

(pp. 729-780). Mahwah: Lawrence Erlbaum Associates, Publishers.<br />

Roe, A., & Hvistendahl, R. (2006). Nordic minority students’ literacy achievement<br />

and home background. In J. Mejding & A. Roe (Eds.), Northern<br />

Lights on <strong>PISA</strong> 2003 <strong>–</strong> a reflection from the Nordic countries. Copenhagen:<br />

Nordic Council of Ministers.


292 ROLF V. OLSEN<br />

Sacher, W. (2003). Schulleistungsdiagnose <strong>–</strong> padagogisch oder nach dem Modell<br />

<strong>PISA</strong>? Padagogische Rundschau, 57(4), 399-417.<br />

Sahlberg, P. (2007). Education policies for raising student learning: the Finnish<br />

approach. Journal of Education Policy, 22(2), 147-171.<br />

Schagen, I. (2004). Multilevel analysis of PIRLS data for England. In C. Papanastasiou<br />

(Ed.), Proceedings of the IRC-2004 PIRLS (Vol. 3, pp. 82-<br />

102). Nicosia: Cyprus University Press.<br />

Simola, H. (2005). The Finnish miracle of <strong>PISA</strong>: His<strong>to</strong>rical and sociological<br />

remarks on teaching and teacher education. Comparative Education,<br />

41(4), 455-470.<br />

Sohn, J., & Ozcan, V. (2006). The educational attainment of Turkish migrants<br />

in Germany. Turkish studies, 7(1), 101-124.<br />

Stanat, Artelt, Baumert, Klieme, Neubrand, Prenzel, et al. (2002). <strong>PISA</strong> 2000:<br />

Overview of the study. Design, method and results. Berlin: Max Planck<br />

Institute for Human Development.<br />

Stigler, J. W., & Hiebert, J. (1999). The teaching gap: Best ideas from the<br />

world’s teachers for improving education in the classroom. NewYork:<br />

Free Press.<br />

The BMS. (1994). Correspondence analysis: A his<strong>to</strong>ry and french sociological<br />

perspective. In M. J. Greenacre & J. Blasius (Eds.), Correspondence<br />

Analysis in the Social Sciences (pp. 128-137). London: Academic Press.<br />

Turmo, A. (2003a). Naturfagdidaktikk og internasjonale studier. S<strong>to</strong>re internasjonale<br />

studier som ramme for naturfagdidaktisk forskning: En drøfting<br />

med eksempler på hvordan data fra <strong>PISA</strong> 2000 kan belyse sider ved<br />

begrepet naturfaglig allmenndannelse. Oslo: Unipub AS.<br />

Turmo, A. (2003b). Understanding a newsletter article on ozone <strong>–</strong> a crossnational<br />

comparison of the scientific literacy of 15-year-olds in a specific<br />

context. Paper presented at the 4th ESERA conference “Research<br />

and the Quality of Science Education”, August 2003, Noordwijkerhout,<br />

The Netherlands.<br />

Valijarvi, J., Linnakyla, P., Kupari, P., Reinikainen, P., Arffman, I., & Jyvaskyla<br />

Univ. Inst. for Educational, R. (2002). The Finnish success in <strong>PISA</strong><strong>–</strong>And<br />

some reasons behind it: <strong>PISA</strong> 2000.<br />

Vári, P. (Ed.). (1997). Are we similar in math and science? A study of grade 8 in<br />

nine Central and Eastern European countries. Amsterdam: International<br />

Association for the Evaluation of Educational Achievement.<br />

von-Stechow, E. (2006). <strong>PISA</strong> und die Folgen fur schwache Schülerinnen


LARGE-SCALE INTERNATIONAL COMPARATIVE ACHIEVEMENT STUDIES293<br />

und Schüler/<strong>PISA</strong> and the Consequences for Pupils with Learning Disabilities.<br />

Vierteljahresschrift fur Heilpädagogik und ihre Nachbargebiete,<br />

75(4), 285-292.<br />

Wang, J. (2001). TIMSS primary and middle school data: Some technical concerns.<br />

Educational Researcher, 30(6), 17-21.<br />

Wolfe, R. G. (1999). Measurement obstacles <strong>to</strong> international comparisons and<br />

the need for regional design and analysis in mathematics surveys. In G.<br />

Kaiser, E. Luna & I. Huntley (Eds.), International Comparisons in Mathematics<br />

Education. London: Falmer Press.<br />

Wolff, U. (2004). Different patterns of reading performance: A latent profile<br />

analysis. In C. Papanastasiou (Ed.), Proceedings of the IRC-2004 PIRLS<br />

(Vol. 3, pp. 188-202). Nicosia: Cyprus University Press.<br />

Wossmann, L. (2005). Educational production in east Asia: The impact of family<br />

background and schooling policies on student performance. German<br />

Economic Review, 6(3), 331-353.


The Hidden Curriculum of <strong>PISA</strong> <strong>–</strong> The Promotion of<br />

Neo-Liberal Policy By Educational Assessment<br />

Michael Uljens<br />

Finland: Åbo Akademi University<br />

Introduction<br />

The aim of the present chapter is <strong>to</strong> contextualise the <strong>PISA</strong> evaluation as an<br />

exponent of an ongoing shift in the educational policy of many countries participating<br />

in <strong>PISA</strong> The shift is considered <strong>to</strong> reflect a neoliberally oriented<br />

understanding of the relation between the state, market and education. From a<br />

Finnish perspective, this shift was intiated at the end of the 1980s and beginning<br />

of the 1990s. It has been referred <strong>to</strong> as the educational policy of “the third<br />

republic”.<br />

Even if the <strong>PISA</strong> project strengthens the development of a neoliberal educational<br />

discourse, both nationally and globally, the project was prepared for<br />

by developments within many nations during the 1990s. Movements and actions<br />

that preceded the <strong>PISA</strong> project are being described from the perspective<br />

of Finland. The argument is that these preceding operations made <strong>PISA</strong> appear<br />

a natural continuation of an already initiated change process on the national<br />

level.<br />

The chapter also points out some of the mechanisms through which the<br />

<strong>PISA</strong> evaluations operate in order <strong>to</strong> promote the neoliberal interests of the<br />

OECD. This is considered important, as it often appears <strong>to</strong> be forgotten that<br />

OECD is the organisation behind and running the <strong>PISA</strong> project. <strong>PISA</strong> is interpreted<br />

as a specific kind of a transnational, semi-global, educational evaluation<br />

technique previously unexperienced. <strong>PISA</strong> is thus interpreted as having been<br />

prepared by previous actions on the local and national levels, but <strong>PISA</strong> in turn<br />

promotes and strengthens the readiness <strong>to</strong> uphold a competition-oriented cooperation<br />

within and between nations.


296 MICHAEL ULJENS<br />

Doing the groundwork for <strong>PISA</strong> <strong>–</strong> the silent educational revolution<br />

The point of departure of the present chapter is the view that especially all<br />

large-scale changes in education must be unders<strong>to</strong>od as being socially, culturally<br />

and as his<strong>to</strong>rically developed. Consequently, the claim is here that the<br />

<strong>PISA</strong> project cannot be correctly unders<strong>to</strong>od without acknowledging it as an<br />

exponent of an ongoing shift in European and global educational policy. The<br />

shift <strong>to</strong>day concerns and covers all levels and areas of the western educational<br />

system, although in varying degrees in different countries.<br />

In Finland, this shift has been called a movement <strong>to</strong>wards the educational<br />

policy of the “third republic”. The first republic refers <strong>to</strong> the period from<br />

Finnish independence (1917) up <strong>to</strong> the Second World War. The second republic,<br />

started in1945 and lasted up <strong>to</strong> the mid-1980s. This period focused on<br />

educational expansion, solidarity, basic education for all students, equal opportunities,<br />

regional balance, and education for the civil society. In a word,<br />

it was the educational doctrine of the welfare state assuming mutual positive<br />

effects between economic growth, welfare and political participation (see e.g.<br />

Siljander 2007).<br />

The period of the “third republic” started <strong>to</strong>wards the end of the 20th century<br />

or, symbolically, when the previous century “ended” in 1989, i.e. after the<br />

collapse of Soviet Union and the fall of the Berlin Wall. The political mentality<br />

in Finland had already started <strong>to</strong> change <strong>to</strong>wards a more conservative direction<br />

in the 1980s, and has since then developed chronologically in this direction,<br />

even though the movement was even more obvious in other countries.<br />

In contrast <strong>to</strong> the period of the second republic, the educational mentality<br />

of the third republic initiated a discourse on excellency, efficiency, productivity,<br />

competition, internationalisation, increased individual freedom and<br />

responsibility as well as deregulation in all societal areas (e.g. communication,<br />

health-care, infrastructure) including the educational sec<strong>to</strong>r (education<br />

law, curriculum planning and educational administration). The direction was<br />

clearly manifested in the Governmental program in Finland after the elections<br />

in 1990. The project could be called the creation of the educational policy of<br />

the global post-industrial knowledge economy and information society. New<br />

Public Management ideas were introduced in the late 1980s, and a so-called<br />

agency theoretical approach, according <strong>to</strong> which the role of the state is expanded<br />

and changed from producing services <strong>to</strong> buying services. The model<br />

included, as we know, the lowering of taxes as well as techniques for “quality<br />

assurance”. Attention also turned <strong>to</strong>wards profiling individual schools and


THE HIDDEN CURRICULUM OF <strong>PISA</strong> 297<br />

institutions and on increasing flexiblility e.g. in educational career planning.<br />

Extended freedom of choice on the local level was supported by e.g. decentralising<br />

curriculum planning, first <strong>to</strong> the community level in the 1980s and then <strong>to</strong><br />

the school level in the 1990s. Parents were included in school boards. Salaries<br />

determined according <strong>to</strong> achievement were later introduced in the public sec<strong>to</strong>r.<br />

This mentality supported a kind of commodification of knowledge, marketisation<br />

of schooling as well as a much stronger view of national education<br />

as vehicles for international competition. The use of national tests for ranking<br />

schools was introduced in the 1990s as a mean for promoting a competitionoriented<br />

climate. The education of gifted students became acknowledged in<br />

addition <strong>to</strong> the strong emphasis on traditional special education. Today, limiting<br />

dropping out of school is motivated by its societal costs rather than many<br />

other reasonable arguments.<br />

Despite all of these changes, the idea of educational equality has remained<br />

the guiding principle, although it has become weakened. The process by and<br />

large reflects a view of the students or parents as “cus<strong>to</strong>mers”, according <strong>to</strong><br />

which parents were offered enlarged opportunities <strong>to</strong> choose which schools<br />

their children attended on the basis of the success of schools and their profiles.<br />

The view of citizens as cus<strong>to</strong>mers is also obvious in various EU documents<br />

(Heikkinen, 2004), (Finland joined EU in 1995). Education has increasingly<br />

come <strong>to</strong> be considered a private good rather than a public good. During the<br />

past decade movements in this direction have been very obvious within the<br />

university system (law, financing models, productivity, etc.) in Finland.<br />

The changes pointed out above reflect a silent but on-going “revolution” in<br />

educational ideology and policy. The development in Finland is similar <strong>to</strong> other<br />

European countries. Globally seen, it is difficult not <strong>to</strong> consider the collapse of<br />

the former Soviet Union as the starting point for the development of a new<br />

ideological and economic world order.<br />

The conclusion of what has been said thus far is that the <strong>PISA</strong> evaluations,<br />

organised by the OECD, were in many ways prepared for by the developments<br />

described above. The argument of the following section is that although international<br />

ranking of countries with respect <strong>to</strong> pupils’ success at testing is not a<br />

new phenomenon, taking in<strong>to</strong> account how <strong>PISA</strong> has been constructed, governed<br />

and how its results have been distributed, interpreted and made use of<br />

makes the <strong>PISA</strong> process an organic part of an on-going “silent revolution” in<br />

western educational thinking.


298 MICHAEL ULJENS<br />

Governing technologies used by the OECD<br />

It is important <strong>to</strong> observe that the <strong>PISA</strong> evaluations are coordinated by<br />

the OECD (Organisation for Economic Cooperation and Development). The<br />

OECD was founded in Paris in 1960 in order <strong>to</strong> stimulate economic growth<br />

and employment. The OECD was founded by 20 countries but was extended<br />

in 2000 <strong>to</strong> 30 countries. A growing number of non-OECD countries have participated<br />

in the <strong>PISA</strong> evaluations.<br />

The overall logic behind the strategy of the OECD seems <strong>to</strong> be <strong>to</strong> support<br />

an increase of a competitive mentality combined with a system of having common<br />

standards for nations, as this is expected <strong>to</strong> be beneficial for a common<br />

market. The intention seems thus <strong>to</strong> combine competition with cooperation.<br />

The current question is through what mechanisms, operations or technologies<br />

this is put in<strong>to</strong> practice? In the following some major strategies are identified<br />

that have been applied in and through the <strong>PISA</strong> evaluations in order <strong>to</strong> promote<br />

a competitive mentality combined with cooperation.<br />

First, using transnational evaluation procedures following one single measurment<br />

standard (common <strong>to</strong> all and independent of every participating country)<br />

supports in the end the development of an increased homogeneity. The<br />

argument is that this occurs through a self-adjusting process. More precisely,<br />

the strategy applied is the following: As <strong>PISA</strong> is mainly focused on the ranking<br />

of participating countries and not very interested in explaining differences<br />

between them, the burden of producing explanations is left <strong>to</strong> the participating<br />

nations, their governments, educational administration and the media. We saw<br />

this occuring after launching the results of <strong>PISA</strong> 2000, and even more clearly<br />

after <strong>PISA</strong> 2003.<br />

By not offering systematic explanations <strong>to</strong> the reported differences in<br />

school achievements, a development of a self-adjusting mentality or a certain<br />

mode of self-reflection was promoted. Through this process the countries<br />

themselves begin <strong>to</strong> orientate <strong>to</strong>wards certain types of questions and <strong>to</strong>pics, i.e.<br />

looking for keys <strong>to</strong> success. We all know that ranking participating countries<br />

created an unforeseen alertness among politicians and within the educational<br />

administration <strong>to</strong> explain either their students’ success or lack thereof.<br />

From an OECD perspective this is the best anyone can hope for <strong>–</strong> getting<br />

nations engaged in the right issues, so <strong>to</strong> speak. By leaving the task of<br />

explaining differences <strong>to</strong> participating nations, media people and the like, national<br />

experts, governmental representatives and politicians are also free <strong>to</strong><br />

make different kinds of conclusions from the results. Thus, the policies ema-


THE HIDDEN CURRICULUM OF <strong>PISA</strong> 299<br />

nating from the process vary between countries. However, this process leads <strong>to</strong><br />

limiting the agenda for educational politics of a specific country. Instrumental<br />

policy issues, i.e. means for how things should be carried out and corrected,<br />

then becomes the main <strong>to</strong>pic, while reflection related <strong>to</strong> the orientation and<br />

aim of education and schooling as such diminishes. Nonetheless, it would be<br />

wrong <strong>to</strong> say that the question of educational aims has moved <strong>to</strong> the background<br />

during this process as it is obvious that all levels of education strongly<br />

empasize that education, research and developmental work are core strategies<br />

for creating economic growth. As the aims are so obvious, there is a risk that<br />

educational policymaking on a national level becomes a kind of educational<br />

managerialism or “procedurology”.<br />

A second strategy applied for promoting the interests dominating the<br />

OECD is related <strong>to</strong> the construction of the tests and their relation <strong>to</strong> national<br />

curricula. One of the fundamental differences between the <strong>PISA</strong> evaluation<br />

and e.g. the IEA evaluation is that IEA <strong>to</strong>ok the national curriculum, its intentions<br />

and content as the point of departure. As it is quite natural <strong>to</strong> consider<br />

the national curriculum as the frame of reference when evaluating pedagogical<br />

efforts, it becomes important <strong>to</strong> try and understand why <strong>PISA</strong> did not evaluate<br />

what teachers in respective countries were expected <strong>to</strong> strive for? But what if<br />

the point was also something else in addition <strong>to</strong> primarily evaluating the effectiveness<br />

of the educational system? What if the idea was rather <strong>to</strong> use international<br />

evaluation as a technique for homogenising the participating educational<br />

systems and creating a competition-oriented mentality?<br />

If homogenisation (or increased coherence) may be seen as one interest<br />

aim <strong>to</strong> be reached from an OECD perspective, then the promotion of a<br />

competition-oriented mentality is another, equally important aim. Having accepted<br />

this, the main question is not concerned with the aims of education but<br />

the means of how <strong>to</strong> reach or hold a leading position.<br />

A mentality accepting a never-ending competition is deceptive, as one cannot<br />

ever reach either the goal or certainty. The only point that is clear is that<br />

one has <strong>to</strong> struggle for keeping or improving one’s position. Competition is<br />

always accompanied by insecurity, and this insecure identity or mentality continuously<br />

strives <strong>to</strong> reach safety. The mentality supported is one of continuous<br />

angst or a feeling of insufficiency. Lifelong learning, which was first hailed<br />

as a deliberating policy, has quickly turned out <strong>to</strong> be more like a life sentence<br />

than something emancipating. The individual is not allowed <strong>to</strong> reach “heaven<br />

on earth”, but is rather expected <strong>to</strong> try <strong>to</strong> learn <strong>to</strong> live with the idea that a con-


300 MICHAEL ULJENS<br />

tinuous learning process is the closest we can come <strong>to</strong> fulfilment in life. In fact,<br />

this construction is not a recent or new one. In some respects it is a fundamental<br />

feature of the European tradition of Bildung. At the risk of oversimplifying,<br />

we could say that while the Bildung tradition emphasises learning as emancipation,<br />

independence, self-awareness and maturity (Mündigkeit), the lifelong<br />

learning ideology or dogma explicates learning activity as something that the<br />

individual has <strong>to</strong> exhibit in order <strong>to</strong> meet “legitimate” expectations of those<br />

<strong>to</strong>wards one is considered <strong>to</strong> be responsible. A learning attitude is the ethos of<br />

an “alert readiness <strong>to</strong> change” according <strong>to</strong> what the situation needs, but where<br />

one is not defining this situation. In this sense the lifelong learning dogma is<br />

opposite <strong>to</strong> the concept of Bildung.<br />

Conclusions<br />

The intention in this chapter has been <strong>to</strong> develop an interpretation of the possible<br />

logic behind the <strong>PISA</strong> evaluation compared <strong>to</strong> previous international evaluations.<br />

Moreover, the aim has been <strong>to</strong> analyse some of the mechanisms or<br />

governmental strategies utilised or operating in the <strong>PISA</strong> process. This general<br />

logic was considered simple but intelligent. It was interpreted as aiming at<br />

the uniting of intercultural communicative activities oriented <strong>to</strong>wards learning<br />

from each other and a simultaneous or parallel competition-oriented mentality<br />

<strong>–</strong> a logic of competing and competitive cooperation. As this has not been<br />

formulated as an explicit aim of the <strong>PISA</strong> program, it may be interpreted as a<br />

part of the hidden curriculum of <strong>PISA</strong>. International evaluations in the shape<br />

they have taken in the <strong>PISA</strong> process thus include a kind of hidden curriculum,<br />

aiming at developing the educational systems of participating countries in a<br />

neo-liberal direction.<br />

The analysis was not focused on what in fact was measured by the tests<br />

themselves or whether the theoretical foundation of the project was weak or<br />

not, e.g. with respect <strong>to</strong> how comparative educational research was unders<strong>to</strong>od.<br />

The point was rather first and foremost <strong>to</strong> pay attention <strong>to</strong> how the educational<br />

policy landscape in Finland for its part prepared for <strong>PISA</strong> and secondly, <strong>to</strong><br />

point out effects that this kind of evaluation procedure may have on the educational<br />

thinking of the participating countries. <strong>PISA</strong> was thus more unders<strong>to</strong>od<br />

as an instrument or technique used by the OECD <strong>to</strong> support the development<br />

of a specific type of national educational policy. Expressed in the terminology<br />

of Michel Foucault, the <strong>PISA</strong> evaluation is viewed as a good example of how


THE HIDDEN CURRICULUM OF <strong>PISA</strong> 301<br />

evaluation operated not by direct governing behaviour but by governing the<br />

self-government or self-conduct of individuals.<br />

However, it has not <strong>to</strong> been claimed that the supporting of countries’ competitive<br />

capacity by educational means is a new feature of Finnish or European<br />

educational policy. It may, in fact, be argued that living with uncertainty and<br />

openness is a fundamental feature of the modern European tradition of Bildung<br />

(Uljens, 2007). Furthermore, the educational policy of the welfare state was,<br />

and still is (at least in Finland) <strong>to</strong> the extent it exists built upon the conviction<br />

of positive mutual effects between economic progress, educational equality,<br />

social justice and welfare and active, participa<strong>to</strong>ry citizenship.<br />

In order <strong>to</strong> avoid misunderstanding, it should be stated that it is in this<br />

context that the term ‘neo-liberalism’ is distinguished from ‘classical liberalism’<br />

(A. Smith). Classical liberalism is taken <strong>to</strong> refer <strong>to</strong> the idea that the state<br />

should not intervene in market-related issues, as the market regulates itself and<br />

au<strong>to</strong>matically is beneficial for all. Neo-liberalism is taken <strong>to</strong> refer <strong>to</strong> the view<br />

that the state does and should intervene in the market by laws and regulations<br />

of all kinds. In the neo-liberal model, politics, economics and education are<br />

seen as mutually dependent on each other. The international development of<br />

market-oriented economic thinking after 1989 may thus be considered as a<br />

renewed neo-liberalist politics in which the relative impact of politics on the<br />

economy has diminished. This has created a dissonance in the “school-statemarket<br />

triangle” (education, politics, economics), which is most clearly visible<br />

in and through the contemporary discussions on the crisis of citizenship and<br />

citizenship education.<br />

In conclusion, unders<strong>to</strong>od in the sense defined above, the <strong>PISA</strong> process is<br />

coherent with the kind of educational policy in Finland that has been evolving<br />

over the past 15-20 years. The relation may also be seen the other way around:<br />

The educational policy of Finland, as it developed from the end of the 1980s<br />

and beginning of the 1990s, moulded the national scene so that the strategies<br />

and technologies used in the <strong>PISA</strong> evaluation appeared as a reasonable continuation<br />

of the national policy.<br />

It was pointed out that this prepara<strong>to</strong>ry work was mainly carried out by<br />

applying three policy technologies: a) economisation referring <strong>to</strong> the measurement<br />

of value primarily in economical terms, b) privatisation as a movement<br />

<strong>to</strong>wards partial deconstruction of collective, societal institutions in favour of<br />

private ac<strong>to</strong>rs, deregulating laws and increasing flexibility of educational administration<br />

and increased individual responsibility and freedom and, finally,


302 MICHAEL ULJENS<br />

c) productivity referring <strong>to</strong> the fact that activities effectively stimulating economic<br />

growth are supported.<br />

One of the anomalies resulting from the international <strong>PISA</strong> discussion is<br />

how <strong>to</strong> explain the case that an educational system like the Finnish comprehensive<br />

school was indeed able <strong>to</strong> produce better results and a smaller variation<br />

compared <strong>to</strong> parallel school systems, like those in Germany or Great Britain.<br />

One reason as <strong>to</strong> why this raised so much confusion was the fact that the ideology<br />

behind the comprehensive school fundamentally differed from the OECD<br />

ideology’s emphasising more individual freedom and less state intervention.<br />

<strong>PISA</strong> also has resulted in increased expectations for continued and extended<br />

success. In Finland the <strong>PISA</strong> success for the compulsory school system<br />

turned attention <strong>to</strong>wards the universities: Why are our universities not doing<br />

equally well in international rankings? During the last few years many different<br />

steps have been taken in order <strong>to</strong> push for Finnish universities’ international<br />

success. One example is that the decentralised model of higher education<br />

which was initiated at the end of the 1960s definitely remains. <strong>According</strong> <strong>to</strong><br />

the unquestioned rhe<strong>to</strong>ric of <strong>to</strong>day, large university units are considered capable<br />

of being successful in many ways, not least when it comes <strong>to</strong> raising<br />

research funding and offering stimulating study programs. This also happens<br />

on the EU level (e.g. establishment of the European Institute of Technology,<br />

EIT).<br />

A final comment <strong>–</strong> or query, <strong>–</strong> concerning where <strong>PISA</strong> is or has been discussed<br />

the past years: Compared with the immense attention <strong>PISA</strong> issues have<br />

got in the public debate all over the world, and the impact it has had on governmental<br />

policies and school practices, it is fascinating how seldom educational<br />

researchers <strong>to</strong>uch upon the <strong>to</strong>pic in international research conferences<br />

and journal articles. If the observation is correct, which I do think it is, then it<br />

seems that we have two different worlds of educational debate which are not<br />

necessarily in <strong>to</strong>uch with each other. Is this how things should be?<br />

References<br />

Heikkinen, A. (2004): Evaluation in the transnational ‘Management by<br />

projects’ Policies. In: European Educational Research Journal, 3(2), 486-<br />

500.<br />

Uljens, M. (2007): Education and societal change in the global age. In: R.<br />

Jakku-Sihvonen & H. Niemi (Eds), Education as a societal contribu<strong>to</strong>r<br />

(pp. 23-49). Frankfurt am Main: Peter Lang.


THE HIDDEN CURRICULUM OF <strong>PISA</strong> 303<br />

Siljander, P. (2007): Education and ‘Bildung’ in modern society <strong>–</strong> Developmental<br />

trends of finnish educational and sociocultural processes. In: R.<br />

Jakku-Sihvonen & H. Niemi (Eds), Education as a societal contribu<strong>to</strong>r<br />

(pp. 71-90). Frankfurt am Main: Peter Lang.


Deutsche Pisa-Folgen<br />

Thomas Jahnke<br />

Deutschland: Universität Potsdam<br />

In dieser Note werden die Beschlüsse der Kultusministerkonferenz zum ‚Bildungsmoni<strong>to</strong>ring‘und<br />

zu den ‚Bildungsstandards‘ in Mathematik als nationale<br />

Pisa-Folgen identifiziert. Eine Auseinandersetzung mit der Testforschung in<br />

den USA und eine Ernüchterung der veröffentlichten Meinung zu der Testwirklichkeit<br />

kann die Geltungsmacht von Pisa & Co in Deutschland möglicherweise<br />

eindämmen.<br />

Die Teilnahme an der Dritten Mathematik- und Naturwissenschaftsstudie<br />

(TIMSS) und dem ersten Durchgang des Programme for International Student<br />

Assessment (<strong>PISA</strong>) hat zu einer grundlegenden Wende in der deutschen<br />

Bildungspolitik geführt. Das Unbehagen an der deutschen Schule ist messbar<br />

geworden und mit diesen Messungen ist auch der Weg, die Verhältnisse zu bessern,<br />

vorgezeichnet: die Messwerte müssen höher werden, dann wird es besser.<br />

Die Wucht, mit der dieser Gedanke die mediale und politische Öffentlichkeit<br />

durchrollte und vereinzelte Kritik an solchen Erkenntnisse und der einzuschlagenden<br />

Kur unter sich begrub, hatte lawinenartigen Charakter. Die Messergebnisse<br />

scheinen wirklicher als jede Theorie, und den „deskriptiven Befunde“<br />

haftet eine quasi-naturwissenschaftliche Objektivität und damit unwiderlegbare<br />

Wahrheit an: so liegen die Dinge <strong>–</strong> im Rahmen der Messgenauigkeit. Der<br />

Triumpf empirischen Denkens: die Wirklichkeit ist beziffert, digitalisiert, das<br />

Menetekel hat Dezimale bekommen und kann nun Steuerungsprozessen unterworfen<br />

werden, deren Ergebnisse wieder zu messen sind und so weiter.<br />

In Deutschland wird kaum diskutiert, dass auch solchen Messungen eine<br />

<strong>–</strong> möglicherweise holprige, unausgesprochene, wenig durchdachte <strong>–</strong> Theorie<br />

zugrunde liegt und Begriffe wie ‚Kompetenzstufen‘ oder ‚Grundbildung‘<br />

sich nicht messtechnisch ergeben, sondern „Realität“ eher hervorbringen als


306 THOMAS JAHNKE<br />

beschreiben. Ferner ist der Glaube, durch periodisierte Testungen würden die<br />

Leistungen deutscher Schülerinnen und Schüler steigen, weit und auch in der<br />

Bildungsadministration verbreitet. Kritik an Pisa wird häufig damit zurückgewiesen,<br />

dass ein Unternehmen dieser Größenordnung natürlich auch Schwächen<br />

und Ungereimtheiten aufweise, dass der ‚Pisa-Schock‘aber grundsätzlich<br />

doch das Augenmerk auf die Schulwirklichkeit gelenkt und schon damit Bewegung<br />

gebracht und diverse Reformbestrebungen in Gang gesetzt habe. Verkannt<br />

wird dabei, dass es sich bei Pisa nicht um eine einmalige Testung handelt,<br />

deren Ergebnisse in ihrer Aussagekraft möglicherweise überschätzt einem<br />

reflektierenden Betrachter schon etwas erzählen könnten, sondern um ein<br />

Programm, das keineswegs den Blick für verschiedenste Reformansätze und<br />

<strong>–</strong>anstrengungen öffnet, sondern im Gegenteil den Weg durch das Ziel schon<br />

festgeschrieben hat: deutsche Schülerinnen und Schüler sollen bei den künftigen<br />

Tests besser abschneiden.<br />

„Bildungsmoni<strong>to</strong>ring“<br />

Die Gesamtstrategie der Kultusministerkonferenz zum Bildungsmoni<strong>to</strong>ring<br />

liegt in doppelter Form vor: einmal als Beschluss der Kultusministerkonferenz<br />

vom 02.06.2006 1 , zum anderen als Broschüre 2 , die vom Sekretariat der Ständigen<br />

Konferenz der Kultusminister der Länder der Bundesrepublik Deutschland<br />

in Zusammenarbeit mit dem Institut zur Qualitätsentwicklung im Bildungswesen<br />

(IQB) 2006 herausgegeben wurde. Die Broschüre ist <strong>–</strong> schon auf<br />

dem Umschlag <strong>–</strong> mit ganzseitigen Farbfo<strong>to</strong>s von Schülerinnen und Schülern illustriert,<br />

enthält ein Vorwort der Präsidentin der Kultusministerkonferenz und<br />

ein Inhaltsverzeichnis mit geänderter Nummerierungen der Abschnitte, ist um<br />

einen Abschnitt mit Aufgabenbeispielen angereichert. Offensichtlich hat man<br />

dem IQB zugestanden, sein Aufgaben- und Pflichtenbuch selbst zu überarbeiten<br />

und die Formulierungen des zugrunde liegenden Beschlusses sich passend<br />

zu glätten und auszulegen. Dies geschieht tatsächlich Absatz für Absatz. Aus<br />

der Formulierung . . . in eine Reihe von Beschlüssen der KMK einzuordnen,<br />

die entsprechende Handlungsfelder beschreiben und gemeinsame zentrale Arbeitsbereiche<br />

nach Pisa 2003 festlegen in dem Beschluss der KMK wird der<br />

Bezug auf Pisa 2003 gestrichen. Aus dem Arbeitsbereich Bereitstellung von<br />

1 Kultusministerkonferenz (KMK): Gesamtstrategie der Kultusministerkonferenz zum Bildungsmoni<strong>to</strong>ring<br />

(Beschlüsse der Kultusministerkonferenz vom 02.06.2006)<br />

2 Kultusministerkonferenz (KMK): Gesamtstrategie der Kultusministerkonferenz zum Bildungsmoni<strong>to</strong>ring<br />

(2006)


DEUTSCHE <strong>PISA</strong>-FOLGEN 307<br />

Fortbildungskonzeptionen und <strong>–</strong>materialien zur kompetenz- bzw. standardbasierten<br />

Unterrichtsentwicklung, vor allem Lesen, Geometrie, S<strong>to</strong>chastik wird<br />

der Bezug zum Lesen und der <strong>–</strong> nicht nachvollziehbare, verwunderliche <strong>–</strong> Bezug<br />

auf spezielle mathematische Bereiche, den man wohl nur durch die Abwesenheit<br />

und damit auch Verzichtbarkeit von ‚Fachkompetenz‘ erklären kann,<br />

gestrichen. Wir zitieren im Folgenden die etwas schlankeren Formulierungen<br />

des ursprünglichen Beschlusses. Schon der erste Absatz lässt wenig Zweifel<br />

unter welchen Auspizien Bildung heute betrachtet wird:<br />

Bildung nimmt eine Schlüsselrolle für die individuelle Entwicklung, für gesellschaftliche<br />

Teilhabe sowie berufliches Fortkommen, aber auch für den wirtschaftlichen Erfolg<br />

eines Landes ein. Die globalen Entwicklungen der vergangenen Jahrzehnte haben<br />

die grundlegende Bedeutung von Bildung für Deutschland noch einmal unterstrichen.<br />

Die Ausschöpfung aller Begabungspotentiale und die Sicherung und Entwicklung von<br />

Qualität im Bildungswesen sind daher zentrale Aufgaben der Bildungspolitik. (S. 1) 3<br />

Als zentrale Instrumente der Kultusministerkonferenz für das Bildungsmoni<strong>to</strong>ring<br />

werden dann benannt<br />

<strong>–</strong> Internationale Schulleistungsuntersuchungen<br />

<strong>–</strong> Zentrale Überprüfung des Erreichens der Bildungsstandards in einem Ländervergleich<br />

<strong>–</strong> Vergleichsarbeiten in Anbindung oder Ankoppelung an die Bildungsstandards<br />

zur landesweiten Überprüfung der Leistungsfähigkeit einzelner Schulen<br />

<strong>–</strong> Gemeinsame Bildungsberichterstattung von Bund und Ländern. (S. 1/2)<br />

Ob die bisher veröffentlichten ‚Bildungsstandards‘(s.u.!) solchen Überprüfungen<br />

und Belastungen standhalten, kann man bezweifeln. In jedem Fall wird<br />

Deutschland durch diesen Beschluss zum Testland ausgerufen und erklärt: für<br />

die Jahre 2006 bis 2018 (!) werden in einer Tabelle 17 Testungen und 19<br />

Berichterstattungen über diese terminiert, die sich allein durch die Teilnahme<br />

an PIRLS, TIMSS und <strong>PISA</strong> 4 sowie die Ländervergleiche bundesweit er-<br />

3 In ihrem Tenor und Jargon erinnert solche Funktionsbeschreibung von ‚Bildung‘an die entsprechende<br />

Verlautbarungen der OECD. Überraschender Weise ist in der Überarbeitung des<br />

Textes (Siehe die o.a. Broschüre) in dem angeführten Zitat das Wort ‚Bildung‘durch ‚Das<br />

Bildungssystem‘ersetzt, als seien diese Begriffe synonym.<br />

4 PIRLS ist die Abkürzung für Progress in International Reading Literacy Study, die in<br />

Deutschland auch mit IGLU für Internationale Grundschul-Lese-Untersuchung bezeichnet<br />

wird. TIMSS war ursprünglich ein Akronym für Third International Mathematics and<br />

Science Study; seit TIMSS 2003 steht das Akronym für Trends in Mathematics and Science<br />

Study; <strong>PISA</strong> steht für Progamme for International Students Assessment.


308 THOMAS JAHNKE<br />

geben. Dazu kommen noch die länderspezifische und länderübergreifenden<br />

Vergleichsarbeiten in Anbindung oder Anlehnung an die Bildungsstandards in<br />

Jahrgangsstufen 3 und 4 für Deutsch und Mathematik, in den Jahrgangsstufen<br />

8 und 9 für den Hauptschulabschluss in Deutsch, Mathematik, Erste Fremdsprache<br />

(Englisch, Französisch) und in den Jahrgangsstufen 9 und 10 für den<br />

Mittleren Schulabschluss in Deutsch, Mathematik, Erste Fremdsprache (Englisch,<br />

Französisch), Biologie, Chemie, Physik.<br />

Dass schulische Bildung in Deutschland solchem ‚Moni<strong>to</strong>ring‘ nicht mehr<br />

entkommen kann, wird schließlich im letzten Abschnitt Bildungsberichterstattung<br />

gesichert:<br />

Kern der Bildungsberichterstattung ist ein überschaubarer, systematischer, regelmäßig<br />

aktualisierter Satz von Indika<strong>to</strong>ren, d.h. statistischen Kennziffern, die jeweils für<br />

ein zentrales Merkmal von Bildungsprozessen bzw. einen zentralen Aspekt von Bildungsqualität<br />

stehen. Diese Indika<strong>to</strong>ren werden aus amtlichen Daten und sozialwissenschaftlichen<br />

Erhebungen in Zeitreihe dargestellt, wenn möglich im internationalen<br />

Vergleich und aufgeschlüsselt nach Ländern. Um den Vergleich mit Entwicklungen in<br />

den Mitgliedstaaten der Europäischen Union und der OECD zu ermöglichen, wird Anschlussfähigkeit<br />

und Kompatibilität mit internationalen Berichtssystemen ( . . . ) angestrebt.(...<br />

)<br />

Durch die Verfügbarkeit individueller Verlaufsdaten und die regelmäßige Erfassung<br />

erworbener Kompetenzen soll die Leitidee der Bildungsberichtserstattung „Bildung<br />

im Lebenslauf“ umgesetzt werden. Für einen einheitlichen Satz schulstatistischer Daten<br />

und die Sicherung der Anschlussfähigkeit an die internationale Bildungsstatistik<br />

haben die Länder bereits grundlegende Beschlüsse gefasst. So haben die Länder am<br />

22.09.2005 vereinbart, längerfristig ihre Daten entsprechend den im Kerndatensatz<br />

vereinbarten Merkmalsausprägungen zur Verfügung zu stellen. Zumindest Daten der<br />

öffentlichen Schulen sollen für das Schuljahr 2008/2009 von allen Ländern vorliegen.<br />

(S. 14)<br />

In der überarbeiten Broschüre zum Bildungsmoni<strong>to</strong>ring wurde der zuletzt zitierte<br />

Absatz gestrichen. Es ist aber wohl kaum davon auszugehen, dass damit<br />

auch die angestrebte Datenbank nicht eingerichtet wird.<br />

„Teaching <strong>to</strong> the Test“<br />

Man muss konstatieren, dass die Kritik an den Testverfahren und an dem Gedanken,<br />

die ‚Erträge‘schulischer Bildung könnten über periodisierte Tests in<br />

sinnvoller Weise gemessen und gesteigert werden, Deutschland nicht erreicht<br />

hat oder dass es auch nur zu einer sorgfältigen und redlichen Diskussion dieses


DEUTSCHE <strong>PISA</strong>-FOLGEN 309<br />

prima vista selbstverständlich erscheinenden Gedankens hierzulande gekommen<br />

ist. 5 Das ist auch nicht weiter erstaunlich, weil von den involvierten Testinstitute<br />

und den mit ihnen kooperierenden Wissenschaftler kaum zu erwarten<br />

ist, dass sie mit ihren Test- Knowhow auch die Test-Kritik auf den Markt bringen.<br />

In dem vierzehnseitigen Beschluss der Kultusministerkonferenz zum Bildungsmoni<strong>to</strong>ring<br />

vom 02.02.2006 wird auf Seite 13 unter der Zwischenüberschrift<br />

„Weiterentwicklung der Bildung, aber kein Teaching <strong>to</strong> the Test“ auf<br />

diese Problematik <strong>–</strong> wie folgt <strong>–</strong> kurz eingegangen:<br />

Neben der Funktion der Beschreibung von Leistungsanforderungen und der Leistungsmessung<br />

dienen die Bildungsstandards primär der Weiterentwicklung des Unterrichts<br />

und vor allem der individuellen Förderung aller Schülerinnen und Schüler. Die<br />

Länder sind sich darin einig, dass mit der Setzung der Bildungsstandards als übergreifenden<br />

Referenzrahmen eine Entwicklung hin zum „teaching <strong>to</strong> the test“ oder eine<br />

Verengung des Unterrichts aus die Anforderungen der Standards verhindert werden<br />

muss. (S. 13)<br />

Diese Kürze ist trotz der beschworenen Einigkeit der Länder erstaunlich. Es<br />

liegt nahe, wenn man die ‚Weiterentwicklung des Unterrichts und vor allem der<br />

individuellen Förderungen der Schülerinnen und Schüler‘ durch Bildungsstandards<br />

befördern oder anordnen will, deren Erreichen im Wesentlichen durch<br />

Tests überprüft wird, die Erfahrungen von Ländern und darunter insbesondere<br />

der USA zu rezipieren, die seit Jahren oder Jahrzehnten eine solche Politik<br />

verfolgen.<br />

For several decades, some measurement experts have warned that high-stakes testing<br />

could lead <strong>to</strong> inappropriate forms of test preparation and score inflation, which we<br />

5 Es ist aufschlussreich und vermutlich nicht folgenlos, dass die Daten von Pisa in Australien<br />

aufgearbeitet werden und gleichsam deutschen (oder europäischen) Boden nie betreten.<br />

In Zeiten einer sich global verstehenden Forschung scheinen solche räumliche Distanzen<br />

ohne jede Auswirkung, u.a. weil der Zugriff auf Datenserver ubiquitär und ohne zeitliche<br />

Verzögerung möglich ist. Dennoch ist es von Bedeutung, ob und in welchem Rahmen und<br />

geistigen Raum die Verfahren zur Aufbereitung der Daten entwickelt, diskutiert und kritisiert<br />

werden, ob sie als die Ergebnisse der Form und dem Inhalt nach prägende Instrumente<br />

begriffen werden oder nur <strong>–</strong> als mehr oder minder schlecht dokumentierte <strong>–</strong> Routinen in<br />

Softwarepaketen erscheinen, ob sie überhaupt wissenschaftlich diskutiert oder schlicht als<br />

notwendige und doch arbiträre Essenzen eines ‚State of the Art‘ aufgefasst werden, ob den<br />

beteiligten Forschern daran liegt, ihre Verfahren zu verkaufen oder als Erkenntnisinstrumente<br />

in die Diskussion einzuführen und zu legitimieren etc.


310 THOMAS JAHNKE<br />

define as a gain in scores that substantially overstates the improvement in learning it<br />

implies. (p. 99)<br />

So leitet Daniel Koretz, Erziehungswissenschaftler an der Havard-Universität<br />

und assoziierter Direk<strong>to</strong>r des Center of Research, Standards, and Student Testing<br />

(CRESST), seinen Aufsatz Alignment, High Stakes, and the Inflation of<br />

Test Scores 6 ein und beschreibt einen Ausgangspunkt, über den eine öffentliche<br />

Diskussion in Deutschlang bisher kaum hinausgekommen ist:<br />

On common response <strong>to</strong> this problem has been <strong>to</strong> seek “tests worth teaching <strong>to</strong>”.<br />

The search for such tests has led reformers in several directions over the years, but<br />

currently, many argue that tests well aligned with standards meet this criterion. If tests<br />

are aligned with standards, the arguments runs, they test material deemed important,<br />

and teaching <strong>to</strong> the test therefore teaches what is important. If students are being taught<br />

what is important, how can the resulting score gains be misleading? (p. 99)<br />

Koretz begründet seinen Widerspruch gegen solche Naivität theoretisch und<br />

empirisch unter anderem eindrücklich mit Sägezahnkurven (“saw<strong>to</strong>oth pattern“)<br />

für die gemessenen Leistungen der gleichen oder einer vergleichbaren<br />

Population, die sich in verschiedenen Erhebungen je nach den verwendeten<br />

Tests in unterschiedlichster Weise ergaben. Auch der Hoffnung, solche Effekte<br />

seien allein der Testkonstruktion und den Testumständen zuzuschreiben, widerspricht<br />

er:<br />

The problem is not confined <strong>to</strong> commercial, off-the-shelf, multiple-choice tests. It has<br />

appeared as well with standards-based tests and with tests using no multiple-choice<br />

items. (p. 106)<br />

Die Vorstellung, Schülerleistungen ließen sich in einem Test objektiv oder mit<br />

angebbaren Fehlermargen <strong>–</strong> gleichsam physikalisch messen, ist schlicht (und)<br />

irreführend. Folgerungen aus solcher Vorstellung mehr als fragwürdig. Wird<br />

dies in Abrede gestellt, verschwiegen oder das Gegenteil prätendiert, liegen in<br />

aller Regel massive Erkenntnisinteressen der Auftraggeber oder -nehmer der<br />

Testungen vor.<br />

6 Koretz, D.: Alignment, High Stakes, and the Inflation of Test Scores. Yearbook of the National<br />

Society for the Study of Education (2005) 104 (2), 99<strong>–</strong>118.<br />

(Online erhältlich unter: http://www.blackwell-synergy.com/doi/abs/10.1111/j.<br />

1744-7984.2005.00027.x)


DEUTSCHE <strong>PISA</strong>-FOLGEN 311<br />

Auch die Auswirkungen von Testungen auf den Unterricht werden in den<br />

USA seit Jahrzehnten untersucht. Koretz zum Beispiel beschreibt und charakterisiert<br />

in dem zitierten Papier Reallocation, Alignment und Coaching:<br />

Reallocation. Reallocation refers <strong>to</strong> shifts in instructional resources among the elements<br />

of performance. Research has shown that when scores on a test are important<br />

<strong>to</strong> teachers, many of them will reallocate their instructional time <strong>to</strong> focus more on<br />

the material emphasized by the test. ( . . . ) Many observers believe that reallocation is<br />

among the most important fac<strong>to</strong>rs causing the saw<strong>to</strong>oth pattern ( . . . ).<br />

Alignment. Content and performance standards comprise material <strong>–</strong> performance elements,<br />

in the terminology used here <strong>–</strong> that someone (not necessarily the ultimate user<br />

of scores) has decided are important. If the material is emphasized in the standards,<br />

that implies that users should give this material substantial weight in the interference<br />

they draw about student performance. Alignment gives this same material high<br />

weights in the test as well. ( . . . )<br />

Coaching. The term “coaching” is used in a variety of different ways in writings about<br />

test preparation. Here it is used <strong>to</strong> refer <strong>to</strong> two specific, related types of test preparation,<br />

called substantive and non-substantive coaching.<br />

Substantive coaching is an emphasis on narrow, substantive aspects of a test that capitalizes<br />

on the particular style or emphasis of test items. The aspects of the tests that are<br />

emphasized may be either intended or unintended by the test designers. For example,<br />

in one study of the author’s, a teacher noted that the state’s test always used regular<br />

polygons in test items and suggested that teachers should focus solely on those and<br />

ignore irregular polygons. The intended interferences, however, were about polygons,<br />

not specifically regular polygons. ( . . . )<br />

Nonsubstantive coaching refers <strong>to</strong> the same process when focused on nonsubstantive<br />

aspects of a test, such as characteristics of distracters (incorrect answers <strong>to</strong> multiplechoice<br />

items), substantively unimportant aspects of scoring rubrics, and so on. Teaching<br />

test-taking tricks (process of elimination, plug-in, etc.) can also be seen as nonsubstantive<br />

coaching. In some cases <strong>–</strong> for example, when first introducing young children<br />

<strong>to</strong> the op-scan answer sheets used with multiple-choice tests <strong>–</strong> a modest amount of<br />

certain types of nonsubstantive coaching can increase scores and improve validity by<br />

removing irrelevant barriers <strong>to</strong> performance. In most cases, however, it either wastes<br />

time or inflates scores. (p. 110-112)<br />

An anderer Stelle findet sich ähnliche Kritik. So fasst Brian M. Stecher sein<br />

Kapitel 4 Consequences of large-scale, high-stakes testing on school and classroom<br />

practices in dem von ihm mitherausgegebenen Buch Making Sense of<br />

Test-Based Accountability in Education 7 folgendermaßen zusammen:<br />

7 Stecher, B. M.: Consequences of large-scale, high-stakes testing on school and classroom


312 THOMAS JAHNKE<br />

The net effect of high-stakes testing on policy and practice is uncertain. Researchers<br />

have not documented the desirable consequences of testing <strong>–</strong> providing more instruction,<br />

working harder, and working more effectively <strong>–</strong> as clearly as the undesirable ones <strong>–</strong><br />

such as negative reallocation, negative alignment of classroom time <strong>to</strong> emphasize <strong>to</strong>pics<br />

covered by a test, excessive coaching, and cheating. More important, researchers<br />

have not generally measured the extent or magnitude of the shifts in practice that the<br />

identified as a result of high-stakes testing.<br />

Overall, the evidence suggests that large-scale high-stakes testing has been a relatively<br />

potent policy in terms of bringing about changes within schools and classrooms.<br />

Many of these changes appear <strong>to</strong> diminish students’ exposure <strong>to</strong> curriculum, which<br />

undermines the meaning of the test scores. (p. 99/100)<br />

Der im letzten Absatz angesprochene Antagonismus scheint der deutschen<br />

Kultusministerkonferenz möglicherweise von ihren Beratern vorenthalten<br />

worden zu sein. Das Gleiche gilt vermutlich für das Position Statement on<br />

High Stakes Testing in PreK-12 Education der American Evaluation Association<br />

(AEA), in dem es heißt:<br />

High stakes testing leads <strong>to</strong> under-serving or mis-serving all students, especially the<br />

most needy and vulnerable, thereby violating the principle of “do no harm.” The American<br />

Evaluation Association opposes the use of tests as the sole or primary criterion<br />

for making decisions with serious negative consequences for students, educa<strong>to</strong>rs, and<br />

schools. The AEA supports systems of assessment and accountability that help education.<br />

Recent years have seen an increased reliance on high stakes testing (the use of tests <strong>to</strong><br />

make critical decisions about students, teachers, and schools) without full validation<br />

throughout the United States. The rationale for increased uses of testing is often based<br />

on a need for solid information <strong>to</strong> help policy makers shape policies and practices <strong>to</strong><br />

insure the academic success of all students. Our reading of the accumulated evidence<br />

over the past two decades indicates that high stakes testing does not lead <strong>to</strong> better<br />

educational policies and practices. There is evidence that such testing often leads <strong>to</strong><br />

educationally unjust consequences and unsound practices, even though it occasionally<br />

upgrades teaching and learning conditions in some classrooms and schools. The consequences<br />

that concern us most are increased drop out rates, teacher and administra<strong>to</strong>r<br />

deprofessionalization, loss of curricular integrity, increased cultural insensitivity, and<br />

disproportionate allocation of educational resources in<strong>to</strong> testing programs and not in<strong>to</strong><br />

hiring qualified teachers and providing sound educational programs. The deleterious<br />

practices. In L. S. Hamil<strong>to</strong>n, B. M. Stecher, and S. P. Klein (Eds.): Making Sense of Test-<br />

Based Accountability in Education. RAND. Santa Monica 2002. P. 79-100<br />

(Online unter: http://www.rand.org/pubs/monograph_reports/MR1554/index.html)


DEUTSCHE <strong>PISA</strong>-FOLGEN 313<br />

effects of high stakes testing need further study, but the evidence of injury is compelling<br />

enough that AEA does not support continuation of the practice.<br />

While the shortcomings of contemporary schooling are serious, the simplistic application<br />

of single tests or test batteries <strong>to</strong> make high stakes decisions about individuals and<br />

groups impede rather than improve student learning. Comparisons of schools and students<br />

based on test scores promote teaching <strong>to</strong> the test, especially in ways that do not<br />

constitute an improvement in teaching and learning. Although used for more than two<br />

decades, state mandated high stakes testing has not improved the quality of schools;<br />

nor diminished disparities in academic achievement along gender, race or class lines;<br />

nor moved the country forward in moral, social, or economic terms. The American<br />

Evaluation Association (AEA) is a staunch supporter of accountability, but not test<br />

driven accountability. AEA joins many other professional associations in opposing<br />

the inappropriate use of tests <strong>to</strong> make high stakes decisions.<br />

In einer Endnote zu diesem Text wird auf weitere Organisationen verwiesen,<br />

die ebenfalls dagegen opponieren, weit reichende Entscheidungen auf Grund<br />

von Testergebnissen zu fällen.<br />

AEA joins many other professional associations, teacher unions, parent advocacy<br />

groups in opposing the inappropriate use of tests <strong>to</strong> make high stakes decisions. These<br />

include, but are not limited <strong>to</strong> the American Educational Research Association, the<br />

National Council for Teachers of English, the National Council for Teachers of Mathematics,<br />

the International Reading Association, the College and University Faculty<br />

Assembly of the National Council for the Social Studies, and the National Education<br />

Association 8<br />

Für den deutschen Betrachter ist es kaum nachvollziehbar, mit welchen<br />

Besserungs- wenn nicht gar Heilserwartungen gleich in welcher Richtung die<br />

hiesige Bildungspolitik umfangreichste Testprograme einführt, während der sicherlich<br />

nicht zimperliche angelsächsische Evaluations-Pragmatismus sich in<br />

kaum zu übertreffender Deutlichkeit nach mehr als zwanzigjähriger Erfahrung<br />

von solchen Bestrebungen distanziert.<br />

„Bildungsstandards“<br />

Während in dem Beschluss der Kultusministerkonferenz zum Bildungsmoni<strong>to</strong>ring<br />

noch wie oben zitiert davon die Rede ist, dass die Bildungsstandards<br />

primär der Weiterentwicklung des Unterrichts und vor allem der individuellen<br />

8 American Evaluation Association AEA): Position Statement on HIGH STAKES TESTING<br />

In PreK-12 Education. 2002 (Online unter: http://www.eval.org/hst3.htm)


314 THOMAS JAHNKE<br />

Förderung aller Schülerinnen und Schüler dienen heißt es in der Vereinbarung<br />

über Bildungsstandards für den Mittleren Schulabschluss 9 der gleichen Organisation<br />

deutlicher und weniger pädagogisch verschleiert:<br />

Die Kultusministerkonferenz sieht es als zentrale Aufgabe an, die Qualität schulischer<br />

Bildung, die Vergleichbarkeit schulischer Abschlüsse sowie die Durchlässigkeit<br />

des Bildungssystems zu sichern. Bildungsstandards sind hierbei von besonderer Bedeutung.<br />

Sie sind Bestandteil eines umfassenden Systems der Qualitätssicherung, das<br />

auch Schulentwicklung, externe und interne Evaluation umfasst. Bildungsstandards<br />

beschreiben erwartete Lernergebnisse. Ihre Anwendung bietet Hinweise für notwendige<br />

Förderungs- und Unterstützungsmaßnahmen. (S. 3)<br />

Standards 10 und Tests bescheinigen sich gegenseitig ihre Notwendigkeit in<br />

einem Maße, dass sie gleichsam aus purer Logik existent werden: Tests benötigen<br />

Standards, an denen sie oder auf die sie ausgerichtet sind; Standards<br />

benötigen Tests zur Überprüfung ihrer Einhaltung oder Erreichung oder ihres<br />

Verfehlens. Eine kurze <strong>–</strong> weniger logische <strong>–</strong> Geschichte der Standards in<br />

Deutschland hat Hans Dieter Sill 2006 skizziert. 11 Er kommt zu dem Schluss:<br />

Die Standards sind nicht im Resultat gründlicher wissenschaftlicher Analysen internationaler<br />

und nationaler Entwicklungen entstanden, sondern sind Ergebnis eines politisch<br />

motivierten Beschlusses auf ministerialer Ebene, der in sehr kurzer Zeit umzusetzen<br />

war. Es bestanden weder zeitliche noch personelle Ressourcen, um den wissenschaftlich<br />

außerordentlich anspruchsvollen Prozess der Entwicklung nationaler Standards<br />

in der notwendigen Tiefe und Gründlichkeit zu gestalten. (S. 299/200)<br />

Die Ergebnisse solcher Knappheit kennzeichnen zum mindestens rechnerisch<br />

nicht die Bildungsstandards im Fach Mathematik für den Mittleren Schulabschluss,<br />

die am 4.12.2003 von der Kultusministerkonferenz beschlossen wurden.<br />

Durch die Setzung von sechs sich ohne jede Trennschärfe oder auch nur<br />

9 Vereinbarung über Bildungsstandards für den Mittleren Schulabschluss (Jahrgangsstufe<br />

10) <strong>–</strong> (Beschluss der Kultusministerkonferenz vom 4.12.2003) in: Kultusministerkonferenz<br />

(KMK): Bildungsstandards im Fach Mathematik für den Mittleren Schulabschluss.<br />

Beschluss vom 4.12.2003<br />

10 In Achtung der großen deutschen Bildungstheoretiker des 18., 19. und 20. Jahrhunderts<br />

versuche ich das Wort Bildungsstandard zu vermeiden. Als wäre Ohr und Verstand mit dem<br />

Kompositum ‚Bildungsstandards‘ noch nicht ausreichend gequält, wird in dem Beschluss<br />

der Kultusministerkonferenz zum Bildungsmoni<strong>to</strong>ring an zahlreichen Stellen noch von der<br />

notwendigen Normierung und Nachnormierung der Bildungsstandards gesprochen.<br />

11 Sill, H. D.: <strong>PISA</strong> und die Bildungsstandards. In: Jahnke, Th.; Meyerhöher, W. (Hrsg.): Pisa<br />

& Co <strong>–</strong> Kritik eines Programms. Franzbecker Verlag. Hildesheim 2006. S. 293 <strong>–</strong> 330.


DEUTSCHE <strong>PISA</strong>-FOLGEN 315<br />

eigene Charakteristik überlappenden Kompetenzen, von fünf <strong>–</strong> seit Jahren bekannten<br />

<strong>–</strong> mathematischen Leitideen und drei Anforderungsbereichen ergeben<br />

sich aus Gründen der Multiplikation neunzig verschiedene Möglichkeiten eine<br />

Aufgabe zu kennzeichnen. Sind <strong>–</strong> wie wohl meistens zu erwarten <strong>–</strong> mehrere<br />

Kompetenzen oder Leitideen gefragt, dann ergeben sich mehrere hundert<br />

solcher Klassifikationen.<br />

Auf 24 der 36 Seiten der Broschüre sind Aufgabenbeispiele und Lösungsskizzen<br />

mit der Angabe von Leitideen und allgemeinen mathematischen Kompetenzen<br />

sowie deren Zuordnung zu Anforderungsbereichen abgedruckt. Während<br />

die Klassifikation der Aufgaben wenig zwingend oder aufschlussreich,<br />

eher selbstverständlich und für die Bearbeitung von zu vernachlässigender Bedeutung<br />

ist, erschreckt die magere Qualität der Aufgaben, die Material, wie<br />

es in neueren, gut durchgearbeiteten und aufbereiteten Schulbüchern zu finden<br />

ist, nicht einmal im Ansatz erreicht.<br />

In Aufgabe (1) wird unangemessen modelliert.<br />

In Aufgabe (2) wird das Ungeschick eines Grafikers, das diesem vermutlich durch<br />

einen fehlerhaften Umgang mit einer Tabellenkalkulationssoftware unterlaufen ist,<br />

nicht thematisiert, sondern hingenommen.<br />

In Aufgabe (3) wird ein nicht symmetrisch gezeichneter Stern als symmetrisch bezeichnet<br />

und dann nach der Zahl seiner Symmetrieachsen gefragt.<br />

In Aufgabe (4) erstaunen die künstliche Fragestellung und die Klassifikation.<br />

In Aufgabe (5) ist eine mit „Lohnerhöhung in EURO“ beschriftete Achse in Zehnerschritten<br />

von 0 bis 50 bezeichnet, aber zugleich in 30 Teile geteilt, so dass ein Teilabschnitt<br />

1 2/3 ¤ entspricht und die Achsenbeschriftungen nicht an den Teilstrichen<br />

sitzen (können).<br />

In Aufgabe (6) wird ein Punkt mit P(y;x) bezeichnet und dann bemerkt, dass „x die<br />

erste Koordinate des Punktes P ist“.<br />

In Aufgabe (7) erstaunen die Teilfragen c) und d).<br />

In Aufgabe (8) wird vor allem die Anstrengung deutlich, eine Leitidee unterzubringen.<br />

In Aufgabe (9) ist die Fragestellung c) undurchsichtig.<br />

In Aufgabe (10) wird der Taschenrechner fragwürdig benutzt.<br />

In Aufgabe (11) soll man sich mit den fünf Schüleräußerungen in Sprechblasen auseinandersetzen,<br />

die man außerhalb der Schule wohl kaum mathematisch aufarbeiten<br />

würde.<br />

In Aufgabe (12) werden Fragestellungen zu Linearen Funktionen behandelt, denen<br />

man wenig Sinn abgewinnen kann.


316 THOMAS JAHNKE<br />

In Aufgabe (13) wäre im Ansatz einmal eine Modellierung möglich, wenn sie nicht<br />

im Text schon vorgegeben wäre.<br />

In Aufgabe (14) vereint mühsam Fragestellungen, die wenig gemein haben.<br />

Die Aufgaben sind durchweg eher hölzern formuliert, die Grafiken lieblos und<br />

fehlerhaft, die Lösungsskizzen wenig hilfreich und zum Teil falsch (Z.B. in<br />

gravierender Weise bei Aufgabe (3) und Aufgabe (5)). Innovative Anregungen<br />

gehen von solchem Material nicht aus. Warum sind diese Aufgaben, die<br />

die ‚Bildungsstandards für den Mittleren Schulabschluss‘ deutschlandweit exemplifizieren<br />

sollen, über die blass konturierten Kompetenzen hinaus deren Inkarnation<br />

darstellen, so voller Mängel? Die einzige rationale Antwort auf diese<br />

Frage liegt darin, dass es in den Standards nicht um Kompetenzen, Leitideen<br />

und Anforderungsbereiche geht, dass das Musterhafte dieser Aufgaben sich<br />

nicht auf ihren Inhalt bezieht. Es geht gar nicht darum, sie und ihre Lösungsmöglichkeiten<br />

sich gründlich anzuschauen, sie also ernst zu nehmen, sondern<br />

den Lehrpersonen und den Schülerinnen und Schüler klar zu machen, dass es<br />

jetzt einen neuen administrativ-zwingenden Begriff gibt, nämlich den der Standards<br />

gibt, den es ohne Widerworte einzuhalten gilt, der keinen Widerspruch<br />

ob gegen Tests oder Vergleichsarbeiten und deren Inhalte duldet. Ernst zunehmen<br />

sind also nicht die Aufgaben, sondern die Kandare, an die Lehrerinnen<br />

und Lehrer wie Schülerinnen und Schüler genommen werden: ihr müsst das<br />

jetzt können, sonst setzt es etwas, sei es durch Publikation der mageren Ergebnisse<br />

der Schüler, der Lehrer oder der Schule, sei es durch andere Zwangsmaßnahmen.<br />

Jetzt wird Ernst gemacht und dieser Ernst heißt eben Standard. Es<br />

mag schon sein, dass das Aufziehen dieser neuen Saiten <strong>–</strong> gleichsam als Lob<br />

der Ernsthaftigkeit staatlicher Bildungsvorgaben <strong>–</strong> manchem zu Pass kommt<br />

und mancher davon profitiert zum Beispiel als staatlich bestellter Bildungsforscher<br />

oder Testentwickler, aber Mathematikdidaktik ist das nicht. In der<br />

Vereinbarung über Bildungsstandards für den Mittleren Schulabschluss (Jahrgangsstufe<br />

10) heißt es (auf Seite 4 in der zitierten Broschüre):<br />

Die Standards und ihre Einhaltung werden unter Berücksichtigung der Entwicklung<br />

in den Fachwissenschaften, in der Fachdidaktik und in der Schulpraxis durch eine von<br />

den Ländern gemeinsam beauftragte wissenschaftliche Einrichtung überprüft und auf<br />

der Basis validierter Tests weiter entwickelt. (S. 4)<br />

Eine inhaltliche Weiterentwicklung hat seither nicht stattgefunden; offensichtlich<br />

besteht auch gar kein Bedürfnis nach einer breiten und tiefen fachlichen,<br />

fachdidaktischen oder schulpraktischen Diskussion.


Risse in der öffentlichen Geltungsmacht<br />

DEUTSCHE <strong>PISA</strong>-FOLGEN 317<br />

Bei der sorgfältig arrangierten Veröffentlichung der ersten Pisa-‚Ergebnisse‘ in<br />

Deutschland wurde <strong>–</strong> wie in geringerem Ausmaß schon bei der Dritten Internationalen<br />

Mathematik- und Naturwissenschaftsstudie (TIMSS) <strong>–</strong> in den Medien<br />

im Kern nur das Entsetzen über das Abschneiden der deutschen Schülerinnen<br />

und Schüler ausgerufen und in Szene gesetzt (‚Pisa-Schock‘). Die Ergebnisse<br />

selbst, ihre Interpretation oder die angewandten Verfahren zu ihrer Gewinnung<br />

wurden auf den Pressekonferenzen, in den zugehörigen Berichten und<br />

Kommentare nicht einmal simpelsten Plausibilitätsprüfungen unterzogen. Ein<br />

Elchtest, der diesen komplexen Untersuchungsapparat im Ansatz oder auch<br />

nur die technischen Details des Tests in der Schule (zeitliche Länge, Art und<br />

Anzahl der Fragen) und dessen wunderliche Aussagekraft näher befragt hätte,<br />

blieb aus. Es galt nur das Ausmaß des deutschen Versagens zu beklagen und<br />

auf Abhilfe je nach Couleur des Kommenta<strong>to</strong>rs und seiner Organisationszugehörigkeit<br />

zu sinnen. Zwar waren zuweilen recht ernüchternde Berichte von<br />

direkt an dem Test beteiligten Schülerinnen und Schülern sowie Lehrerinnen<br />

und Lehrern zu lesen, aber solche Augenzeugenkolportagen wurden als lokale<br />

Ausrutscher wider die Handbücher und vorgegebenen internationalen Verfahrensregeln<br />

bezeichnet und ihre Erwähnung oder Betrachtung als unwissenschaftlich<br />

gebrandmarkt. Sie gingen in der Dramatik und Wucht der globalen<br />

Untersuchung unter. Jegliche Kritik an Pisa wurde medial nur als ein untauglicher<br />

Versuch gesehen, die Misere der deutschen Bildung schön zu reden oder<br />

sie gar ganz zu leugnen. Die Geltungsmacht von Pisa hatte die Medien wie<br />

auch die Politik fest im Griff.<br />

Bei der zweiten Pisa-Welle ließ sich keine vergleichbare Dramatik in den<br />

Medien mehr aufbauen. Auch halbherzige, bildungspolitisch forcierte Versuche,<br />

aus dem Vergleich der Ergebnisse der beider Durchläufe Schlüsse auf<br />

ein erstes Wirken deutscher Maßnahmen zu ziehen, erwiesen sich als verfahrenstechnisch<br />

gewagt und inhaltlich weder glaubwürdig noch überhaupt plausibel<br />

und in der Tendenz sogar kontraproduktiv. Nicht einmal die (unsinnigspektakulären)<br />

Länderrankings ließen sich noch verwerten, so dass ein neues<br />

Debakel die mediale Aufmerksamkeit sichern musste, dass nämlich in<br />

Deutschland die terri<strong>to</strong>riale und soziale Herkunft in besonderer Weise auf<br />

die Bildungschancen ‚durchschlage‘. Auch hier unterblieben übrigens einfache<br />

Nachfragen, wie denn dieses Forschungsergebnis zustande gekommen sei,<br />

welche Größen oder Indika<strong>to</strong>ren man hier gemessen, verrechnet oder gegeneinander<br />

aufgetragen habe und in welcher Weise die deutschen Ergebnisse


318 THOMAS JAHNKE<br />

die vergleichbarer Länder über- oder untertrafen. Medial handelte es sich also<br />

nicht um ein Resultat einer komplexen Untersuchung, deren Verfahren zumindest<br />

im Groben zu erläutern seien, sondern um eine moralische Katastrophe,<br />

an deren Beseitigung man ohne Nachfrage und Aufschub zu arbeiten habe. 12<br />

Inzwischen ist auch die Blendkraft dieser Nachricht dahin. Der folgende<br />

Artikel zeigt beispielhaft, dass der schiere Glaube, dem jähen Entsetzen über<br />

das vermessene deutsche Schulbildungsdebakel würde sich nun mit der gleicher<br />

vollmundigen Bestimmtheit und Kennerschaft eine Besserung der Verhältnisse<br />

anschließen, in den Medien zu bröckeln beginnt.<br />

Langer Anlauf ohne Sprung<br />

Die wahren Pisa-Sieger sind gar nicht die Finnen. Die wahren Pisa-Sieger sitzen in<br />

Berlin, Dortmund und Bielefeld. In den Schulen sieht man sie selten. Meist brüten sie<br />

über Testbögen, ersinnen Prüfungsfragen oder erforschen mit Hingabe die Wirkung<br />

ihrer eigenen Forschung. „So viele Daten hatten wir noch nie“, freut sich der Bielefelder<br />

Erziehungswissenschaftler Klaus Jürgen Tillmann, Mitglied im deutschen Pisa-<br />

Konsortium, „als empirischer Bildungsforscher bin ich natürlich ganz begeistert.“ An<br />

Fördergeldern herrscht kein Mangel, neue Forschungsstätten werden gegründet, etwa<br />

das Institut zur Qualitätsentwicklung im Bildungswesen (IQB) an der Berliner<br />

Humboldt-Universität. Allein der Forschungsgegenstand selbst dämpft noch die Wissenschaftlereuphorie:<br />

„Den Schulen bringt das leider nichts“, sagt Pädagoge Tillmann.<br />

Gegen miese Testergebnisse, scheinen Deutschlands Schulminister zu glauben, helfe<br />

vor allem Testen. Zwar hat sich die Kultusministerkonferenz als Reaktion auf den<br />

Pisa-Schock sieben Verbesserungsstrategien vorgenommen <strong>–</strong> darunter Sprachkurse<br />

für Migrantenkinder, mehr Ganztagsschulen und gezielte Leseförderung <strong>–</strong>, doch konsequent<br />

umgesetzt haben sie bislang nur eine einzige: Tests. „Entwicklungen gibt es<br />

zwar in allen sieben Bereichen“, sagt Tillmann, „aber flächendeckend in allen Ländern<br />

sind nur die zentralen Prüfungen in den Schulen angekommen.“ ( . . . )<br />

Tatsächlich wird an den deutschen Schulen so viel evaluiert, verglichen und inspiziert<br />

wie nie zuvor. Schon vor der Einschulung müssen Vierjährige häufig zum Deutschtest<br />

antreten, in sieben Bundesländern schwitzen dann die Drittklässler über „Vera“<br />

(„Vergleichsarbeiten“) Tests, in der Mittelstufe folgen vielerorts weitere Vergleichsarbeiten.<br />

Dazwischen kommen alle Jahre wieder internationale Studien wie Pisa, Iglu<br />

oder Timss und je nach Land Erhebungen mit phantasievollen Namen wie „Quasum“,<br />

„Desi“, „Tosca“, „Markus“, „Ulme“ oder „Lau“. ( . . . )<br />

12 Es geht hier keineswegs darum, deutsche Defizite im Umgang und in der Beschulung mit<br />

Schülerinnen und Schülern mit Migrationshintergrund (o.a.) in Abrede zu stellen, sondern<br />

darum deren heftige Moralisierung als einen wesentlichen Grund für die Existenzberechtigung<br />

von Pisa & Co zu akzeptieren.


DEUTSCHE <strong>PISA</strong>-FOLGEN 319<br />

„Nach Pisa wollte sich kein Kultusminister vorwerfen lassen, dass er nicht auf Leistung<br />

setzt“, erklärt Forscher Tillmann. „Dahinter steht die vage Hoffnung, dass vom<br />

Überprüfen auch alles irgendwie besser wird.“ Doch noch fehlt den Lehrerkollegien<br />

das Know-how, um aus der Datenflut Konzepte abzuleiten. „Da muss dringend was<br />

geschehen“, sagt Tillmann, „sonst bleibt das Ganze ein langer Anlauf, ohne dass gesprungen<br />

wird.“ ( . . . )<br />

Besonders weit auf dem Weg, sinnvolle Lehren aus den vielen Tests zu ziehen, glaubt<br />

sich Nordrhein-Westfalens Bildungsministerin Sommer. Sie rühmt ihr Schulsystem<br />

als das „modernste in Deutschland“. So will NRW als erstes Bundesland noch in dieser<br />

Legislaturperiode Schulrankings einführen. Zugleich können Eltern an Rhein und<br />

Ruhr neuerdings aussuchen, wo sie ihr Kind einschulen <strong>–</strong> und sich dabei möglicherweise<br />

an den Listen orientieren. „Wir wollen einen fairen Wettbewerb“, sagt Sommer.<br />

Doch gerade darin sehen viele Wissenschaftler die größte Gefahr der neuen Testkultur:<br />

„Wenn die Schulen nur noch auf ihre Listenplätze schauen, findet überhaupt keine<br />

Schulentwicklung mehr statt“, warnt Wilfried Bos, Chef des Dortmunder Instituts für<br />

Schulentwicklungsforschung. Faire Rankings, die etwa den sozialen Hintergrund der<br />

Schülerschaft berücksichtigen, sind kaum möglich, wenn wie etwa bei Vera nach Herkunft<br />

und Familie der Kinder gar nicht gefragt werden darf.<br />

Zudem erwies sich schon die Prämierung der Vera-Besten im vergangenen Jahr als<br />

Flop: Viele Schulen hatten sich gute Ergebnisse erschummelt <strong>–</strong> sie hatten die Testaufgaben<br />

vorher mit den Schülern trainiert (SPIEGEL 27/2006). „Wenn es erst mal<br />

richtige Rankings gibt“, glaubt Schulleiterin Borns aus Münster, „dann wird noch viel<br />

mehr gemogelt.“<br />

Julia Koch in Der SPIEGEL 24/2007<br />

Vermutlich werden solche Artikel die öffentliche Geltungsmacht von Pisa in<br />

Deutschland mehr erschüttern und eher zerrütten als eine wissenschaftliche<br />

Kritik an den Methoden und Verfahren der Untersuchung, die als unbedeutender<br />

innerwissenschaftlicher, von Laien angezettelter Zwist abgetan werden<br />

kann, die Öffentlichkeit kaum erreicht und eine Bildungspolitik, die sich Pisa<br />

gleichsam verschworen hat, nicht irritieren kann.<br />

Literatur<br />

American Evaluation Association AEA): Position Statement on HIGH STA-<br />

KES TESTING In PreK-12 Education. 2002<br />

(Online unter: http://www.eval.org/hst3.htm)<br />

Kultusministerkonferenz (KMK): Gesamtstrategie der Kultusministerkonferenz<br />

zum Bildungsmoni<strong>to</strong>ring (Beschlüsse der Kultusministerkon-


320 THOMAS JAHNKE<br />

ferenz vom 02.06.2006). (Online unter: http://www.kmk.org/aktuell/<br />

Gesamtstrategie%20Dokumentation.pdf)<br />

Kultusministerkonferenz (Hrsg.) in Zusammenarbeit mit dem Institut zur Qualitätsentwicklung<br />

im Bildungswesen: Gesamtstrategie der Kultusministerkonferenz<br />

zum Bildungsmoni<strong>to</strong>ring. Berlin 2006.<br />

(Online unter: http://www.kmk.org/schul/Bildungsmoni<strong>to</strong>ring_Brosch%<br />

FCre_Endf.pdf)<br />

Kultusministerkonferenz (KMK): Bildungsstandards im Fach Mathematik für<br />

den Mittleren Schulabschluss. Beschluss vom 4.12.2003. (Online unter:<br />

http://www.kmk.org/schul/Bildungsstandards/Mathematik_MSA_BS_<br />

04-12-2003.pdf)<br />

Koch, Julia: Langer Anlauf ohne Sprung. Der SPIEGEL 24/2007<br />

Koretz, D.: Alignment, High Stakes, and the Inflation of Test Scores. Yearbook<br />

of the National Society for the Study of Education (2005) 104 (2),<br />

99<strong>–</strong>118.<br />

(Online erhältlich unter: http://www.blackwell-synergy.com/doi/abs/10.<br />

1111/j.1744-7984.2005.00027.x)<br />

Sill, H. D.: <strong>PISA</strong> und die Bildungsstandards. In: Jahnke, Th.; Meyerhöfer, W.<br />

(Hrsg.): Pisa & Co <strong>–</strong> Kritik eines Programms. Franzbecker Verlag. Hildesheim<br />

2006. S. 293-330<br />

Stecher, B. M.: Consequences of large-scale, high-stakes testing on school and<br />

classroom practices. In L. S. Hamil<strong>to</strong>n, B. M. Stecher, and S. P. Klein<br />

(Eds.): Making Sense of Test-Based Accountability in Education. RAND.<br />

Santa Monica 2002. P. 79-100.<br />

(Online unter: http://www.rand.org/pubs/monograph_reports/MR1554/<br />

index.html)


<strong>PISA</strong> in Österreich: Mediale Reaktionen, öffentliche<br />

Bewertungen und politische Konsequenzen<br />

Dominik Bozkurt, Gertrude Brinek, Martin Retzl<br />

Österreich: Universität Wien<br />

Abstract:<br />

In diesem Beitrag werden nach einer Gegenüberstellung der Ergebnisse der<br />

beiden <strong>PISA</strong>-Testungen 2000 und 2003 die öffentlichen medialen Reaktionen<br />

sowie die Reaktionen der politischen Organisationen und deren bildungspolitische<br />

Konsequenzen aus <strong>PISA</strong> dargestellt. Dabei soll verdeutlicht werden, wie<br />

die Ergebnisse im öffentlichen Diskurs aufgenommen und interpretiert bzw.<br />

welche politischen Handlungsaufträge daraus abgeleitet wurden. So werden<br />

sowohl Übereinstimmungen und Abweichungen zwischen öffentlichen bzw.<br />

politischen Reaktionen und den offiziellen Ergebnissen erörtert als auch Veränderungen<br />

zwischen den Reaktionen auf die <strong>PISA</strong>-Ergebnisse 2000 und 2003<br />

sichtbar gemacht. Die mediale Analyse und die bildungspolitische Bewertung<br />

zeigen Dichte der Resonanz sowie Art und Grad der gesellschaftlichen „Erregung“,<br />

nicht nur in der scientific community.<br />

1 Österreichische <strong>PISA</strong>-Ergebnisse<br />

In diesem Kapitel werden die offiziellen Ergebnisse von <strong>PISA</strong> 2000 und <strong>PISA</strong><br />

2003 für Österreich, welche von den Mitgliedern des österreichischen <strong>PISA</strong>-<br />

Konsortiums in diversen Publikationen veröffentlicht wurden, vorgestellt und<br />

verglichen. Diese Ergebnisse wurden in der Öffentlichkeit recht unkritisch rezipiert.<br />

Dass sie ihrem Anspruch nicht bzw. nur bedingt gerecht werden, zeigen<br />

wissenschaftliche Beiträge, die bspw. in Deutschland publiziert wurden sowie<br />

die verschiedenen Beiträge in diesem Band. Allerdings erfolgten die Stellungnahmen<br />

und Reaktionen von Medien und Politik auf Grundlage eben dieser


322 DOMINIK BOZKURT, GERTRUDE BRINEK, MARTIN RETZL<br />

Ergebnisse. Für unser Vorhaben erscheint uns daher die Berufung darauf sinnvoll<br />

und angebracht.<br />

1.1 Die Ergebnisse der <strong>PISA</strong>-Studie 2000<br />

Im Dezember 2001 wurden die <strong>PISA</strong>-Ergebnisse 2000, an der 31 Staaten teilnahmen<br />

<strong>–</strong> darunter auch Österreich <strong>–</strong>, von der OECD veröffentlicht. <strong>PISA</strong><br />

(Programme for International Student Assessment) erhebt in einem Dreijahres-<br />

Rhythmus das Leseverständnis, das mathematische und das naturwissenschaftliche<br />

Grundwissen der 15-/16-jährigen SchülerInnen (vgl. Reiter, Haider<br />

2002).<br />

Der erste Testteilleistungsbereich der <strong>PISA</strong>-Studie 2000 umfasste das<br />

„Kompetenzprofil Lesen“ (ebenda, 13), wobei die Inhalte geschriebener Texte<br />

von den Jugendlichen verstanden, genützt und reflektiert werden mussten<br />

(vgl. ebenda, 13). Zum Messen dieser Eigenschaften setzte die OECD 129<br />

Testaufgaben ein, die wiederum in „fünf aufsteigende Lese-Kompetenzstufen“<br />

(ebenda, 13) gegliedert wurden. Ca. 9 % der österreichischen 15-/16-jährigen<br />

SchülerInnen waren „zur obersten Kompetenzstufe und rund 14 % waren zu<br />

den sehr schlechten Leser/innen „der beiden untersten Stufen“ (ebenda, 13)<br />

zuzuordnen.<br />

Die österreichischen TeilnehmerInnen erreichten in der Lese-Kompetenz,<br />

der Schwerpunktdisziplin von <strong>PISA</strong> 2000 (vgl. ebenda, 21), 507 Punkte und<br />

somit Platz 10 unter den 27 getesteten OECD-Staaten (vgl. Haider, Reiter<br />

2004, 77). Österreich lag somit „knapp über dem OECD-Durchschnitt von<br />

500“ Punkten in dieser Disziplin (vgl. Reiter, Haider 2002, 13).<br />

Im Zuge von <strong>PISA</strong> 2000 wurden auch die mathematischen Kenntnisse der<br />

15-/16-jährigen SchülerInnen gemessen. Um diese in Erfahrung zu bringen,<br />

mussten die getesteten Jugendlichen ihr Können in den unterschiedlichsten<br />

Bereichen wie z.B. „Problemlösen“ und „Modellieren“ (ebenda, 21) zeigen.<br />

Österreich erreichte in der Mathematik-Kompetenz „515 Punkte“ (ebenda, 21)<br />

und somit den 11. Platz unter 27 OECD-Staaten (vgl. Haider, Reiter 2004, 63).<br />

Allerdings ist darauf hinzuweisen, dass die innerösterreichischen Ergebnisse<br />

auf Grund des stark gegliederten hiesigen Schulsystems stark variieren. So erzielten<br />

SchülerInnen der Allgemeinbildenden Höheren Schule insgesamt 565<br />

Punkte, während die Jugendlichen der Allgemeinen Pflichtschulen lediglich<br />

„438 Punkte“ (Reiter, Haider 2002, 23) im Mathematikranking erreichten.<br />

Im dritten und abschließenden Testbereich von <strong>PISA</strong> 2000 wurden die<br />

Naturwissenschafts-Kompetenzen der österreichischen SchülerInnen getestet.


<strong>PISA</strong> IN ÖSTERREICH 323<br />

Besonderes Augenmerk wurde auf das Erkennen naturwissenschaftlicher<br />

Fragestellungen sowie auf die Anwendung naturwissenschaftlichen Wissens<br />

gelegt. Die SchülerInnen waren u. a. dazu aufgerufen, „durch Belege gestützte<br />

Aussagen von bloßen Meinungen zu unterscheiden“ (ebenda, 29). Haider weist<br />

darauf hin, dass bei diesem Testbereich das Erkennen naturwissenschaftlicher<br />

Fragen, das Anwenden von naturwissenschaftlichem Wissen und das Ziehen<br />

von Schlussfolgerungen aus Belegen im Mittelpunkt stehe (vgl. ebenda, 29).<br />

Im Bereich der Naturwissenschaften erreichten die österreichischen SchülerInnen<br />

insgesamt „519 Punkte“ (vgl. ebenda, 29) und somit den achten Rang<br />

unter den 27 an der Testung teilnehmenden OECD-Staaten (vgl. Haider, Reiter<br />

2004, 89).<br />

Zusammenfassend kann festgehalten werden, dass Österreich im Bereich<br />

der Naturwissenschaften mit 519 Punkten und Rang 8 am besten abgeschnitten<br />

hat. In Mathematik wurden 515 Punkte erreicht und somit Rang 11. Im Lesen<br />

konnten mit 507 die wenigsten Punkte erzielt werden, wobei damit immerhin<br />

der 10. Rang unter allen OECD-Staaten eingenommen wurde. Die in jedem<br />

Bereich über dem OECD- Durchschnitt liegenden Leistungen der österreichischen<br />

Jugendlichen führten schließlich dazu, dass Österreich im Gesamtranking<br />

von <strong>PISA</strong> 2000 den 10. Platz belegte und somit im vorderen Drittel der<br />

Untersuchung rangierte.<br />

1.2 Die Ergebnisse der <strong>PISA</strong>-Studie 2003<br />

Zum Jahresende 2004 wurden die Ergebnisse von <strong>PISA</strong> 2003 veröffentlicht.<br />

Der Fokus der <strong>PISA</strong>-Testung 2003, an der bereits 41 Staaten teilgenommen<br />

haben, wovon jedoch nur 40 in die Wertung mit aufgenommen wurden (vgl.<br />

Haider, Reiter 2004, 18), lag auf den mathematischen Kompetenzen, die zur<br />

„Hauptdomäne“ (ebenda 2004, 13) ernannt wurden. In den „Nebendomänen“<br />

wurden wiederum die Lesekompetenz und das naturwissenschaftliche Wissen<br />

geprüft. Erstmals wurden auch die Problemlösungskompetenzen der 15-/16jährigen<br />

SchülerInnen untersucht (vgl. ebenda, 13).<br />

Der getestete Bereich der Mathematik umfasste insgesamt <strong>–</strong> da Schwerpunktdisziplin<br />

<strong>–</strong> 2/3 aller Testaufgaben von <strong>PISA</strong> 2003. Offiziell erreichten die<br />

österreichischen SchülerInnen im Lösen mathematischer Aufgaben und Problemstellungen<br />

mit 506 Punkten den 15. Rang unter 29 OECD-Staaten (vgl.<br />

Haider, Reiter 2004, 63). Jedoch stellten sich zwei nicht unerhebliche Probleme<br />

beim Vergleich der <strong>PISA</strong>-Ergebnisse 2000 im Bereich der Mathematik<br />

mit jenen von der Testung aus dem Jahre 2003 heraus. Haider weist darauf


324 DOMINIK BOZKURT, GERTRUDE BRINEK, MARTIN RETZL<br />

hin, dass bei <strong>PISA</strong> 2000 lediglich „nur zwei der vier in <strong>PISA</strong> 2003 abgefragten<br />

Unterbereiche in Mathematik“ (ebenda, 45) getestet wurden und somit sind<br />

auch nur diese beiden Bereiche unmittelbar vergleichbar. Für die in <strong>PISA</strong> 2003<br />

neu geschaffenen mathematischen Bereiche der „Unsicherheit“ 1 und der „Größen“<br />

2 gibt es somit keine Vergleichsmöglichkeit (vgl. ebenda, 45).<br />

Die Lese-Kompetenz der 15-/16-jährigen SchülerInnen wurde in <strong>PISA</strong><br />

2003 neuerlich untersucht. Offiziell erreichten die österreichischen TeilnehmerInnen<br />

mit 491 Punkten den 19. Rang „innerhalb der 29 OECD-Staaten“ und<br />

den 22. Rang unter allen 40 „<strong>PISA</strong>-Teilnehmerstaaten“, wodurch sich Österreich<br />

„nicht signifikant“ vom OECD-Schnitt von 494 Punkten unterscheidet<br />

(vgl. ebenda, 76). Haider resümiert, dass Österreich in der Lesekompetenz bei<br />

<strong>PISA</strong> 2003 im OECD-Vergleich um 9 Ränge oder 16 Punkte zurückfiel und somit<br />

signifikant schlechter abschnitt als drei Jahre zuvor. Allerdings relativiert<br />

er diesen Rückfall, denn bei „Berücksichtigung der geteilten Ränge nach statistischer<br />

Bandbreite heißt das: <strong>PISA</strong> 2000: 10.-16. Rang; <strong>PISA</strong> 2003: 12.-21.<br />

Rang“ 3 (ebenda, 77).<br />

Als eine weitere Nebendomäne bei <strong>PISA</strong> 2003 wurden die naturwissenschaftlichen<br />

Fertigkeiten der Jugendlichen getestet. Dabei mussten die 15-/16jährigen<br />

SchülerInnen ihre Fähigkeit unter Beweis stellen und zeigen, inwieweit<br />

sie „das jeweilige physikalische, chemische oder biologische Fachwissen“<br />

zu „praktischen Problemlösungen“ einsetzen, Probleme repräsentieren<br />

und Lösungsvorschläge argumentieren können (vgl. ebenda, 78).<br />

Österreich befindet sich in diesem Bereich mit 491 Punkten, vorausgesetzt,<br />

man berücksichtigt nur „die Punktschätzung des Mittelwerts“ auf dem<br />

20. Rang innerhalb der 29 OECD-Staaten (vgl. ebenda, 79; 89). Allerdings ist<br />

darauf hinzuweisen, dass bei Betrachtung des Konfidenzintervalls (es wurde<br />

im Zuge von <strong>PISA</strong> lediglich ein Teil der österreichischen Schülerpopulation<br />

tatsächlich erfasst) die österreichischen SchülerInnen den 16. bis 23. Rang unter<br />

den 29 OECD-Staaten erreichen (vgl. ebenda, 79).<br />

1 Die mathematische Subgruppe Unsicherheit umfasst „Aufgaben und Darstellung von Daten<br />

sowie Wahrscheinlichkeiten, Unsicherheiten und Schlussfolgerungen“ (Haider, Reiter 2004,<br />

52).<br />

2 Der Bereich der Größen meint jene mathematischen Aufgaben, „die sich mit numerischen<br />

Phänomenen und Mustern sowie quantitativen Zusammenhängen beschäftigen“ (ebenda,<br />

53).<br />

3 vgl. dazu die Info-Seite: Wichtige Informationen zur Interpretation der Ergebnisse. In: Haider,<br />

Reiter 2004, 43.


<strong>PISA</strong> IN ÖSTERREICH 325<br />

Das kann aber nicht darüber hinwegtäuschen, dass Österreich im Bereich<br />

der Naturwissenschaften vom 8. Rang (unter 27 OECD-Staaten) und 519<br />

Punkten in <strong>PISA</strong> 2000 bei der Untersuchung 2003 mit 491 Punkten an die<br />

20. Stelle (von 29 OECD-Staaten) zurückgefallen ist (vgl. ebenda, 89).<br />

Die drei bisherigen <strong>PISA</strong>-Testbereiche (Mathematik, Lesen, Naturwissenschaften)<br />

wurden in der Untersuchung von 2003 um „Problemlösungs-<br />

Kompetenz“ erweitert. Hierbei wurden SchülerInnen dazu aufgefordert, sich<br />

„nicht-routinemäßigen (neuartigen) Problemen“ (ebenda, 90) zu stellen. Im<br />

Zuge dessen sollten auf Grund von Denkprozessen Lösungsvorschläge angeregt<br />

werden. Im Bereich der Problemlösungen erreichten die 15-/16jährigen<br />

SchülerInnen aus Österreich 506 Punkte, womit sie über dem OECD-<br />

Durchschnitt von 500 Punkten lagen und (ohne Berücksichtigung des Konfidenzintervalls<br />

4 ) den 15. Rangplatz von 29 OECD-Ländern einnahmen (vgl.<br />

ebenda, 90; 91).<br />

Festzuhalten bleibt, dass die österreichischen SchülerInnen bei <strong>PISA</strong> 2003<br />

in Mathematik mit 506 Punkten die besten Leistungen erbrachten. Im Lesen<br />

und den Naturwissenschaften wurden lediglich je 491 Punkte erreicht. Beim<br />

„Problemlösen“ konnten 506 Punkte erzielt werden. Die österreichischen 15-<br />

/16-jährigen SchülerInnen belegten somit in sämtlichen Testbereichen Platzierungen<br />

im mittleren oder im hinteren Drittel.<br />

1.3 Korrigierte Hauptergebnisse<br />

Die og. <strong>PISA</strong>-Ergebnisse bzw. deren Interpretationen halten nicht jeder Prüfung<br />

stand. Werden die von Neuwirth et al. publizierten und korrigierten<br />

Hauptergebnisse in den Vergleich der <strong>PISA</strong> Ergebnisse der Jahre 2000 und<br />

2003 miteinbezogen, so relativieren sich die von Österreich erreichten Ergebnisse<br />

im Jahre 2003 beträchtlich. Neuwirth spricht davon, dass bei „näherer<br />

Betrachtung des Datenmaterials“ bei den österreichischen Daten bald „Inkonsistenzen“<br />

festgestellt werden konnten (vgl. Neuwirth et al. 2004, 11). Der auf<br />

Grund der veröffentlichten <strong>PISA</strong> 2003-Daten interpretierte „Absturz“ fand also<br />

nicht statt, sondern es geht vielmehr darum, dass die <strong>PISA</strong>-Daten 2000 und<br />

2003 nicht direkt vergleichbar seien (vgl. ebenda, 62ff). Neuwirth führt dafür<br />

folgende Gründe an:<br />

<strong>–</strong> bei <strong>PISA</strong> 2000 war die Beteiligung der weiblichen Schüler höher als jene<br />

der männlichen (vgl. ebenda, 11)<br />

4 Bei der Berücksichtigung des Konfidenzintervalls würde Österreich den „13. bis 17. Rang<br />

unter 29 OECD-Staaten“ (ebenda, 91) erreichen.


326 DOMINIK BOZKURT, GERTRUDE BRINEK, MARTIN RETZL<br />

<strong>–</strong> die Leistungsergebnisse von österreichischen BerufsschülerInnen wurden<br />

bei <strong>PISA</strong> 2000 weniger stark gewichtet als bei <strong>PISA</strong> 2003. Bei einem direkten,<br />

d.h. unreflektierten Vergleich zeigt die zweite Untersuchung eine<br />

Verschlechterung der Ergebnisse, da BerufsschülerInnen sich am „unteren<br />

Ende des Leistungsspektrums“ befinden (vgl. ebenda 28ff).<br />

Basierend auf diesen Erkenntnissen führt Neuwirth aus, dass „sich die Leistungen<br />

der österreichischen Schüler/innen im Lesen kaum geändert haben und<br />

sowohl die Lesewerte als auch die Mathematikwerte in der Nähe des OECD-<br />

Durchschnitts liegen“ (ebenda, 62). Neuwirth räumt aber auch ein, dass in den<br />

Naturwissenschaften „ein deutlicher Rückgang der österreichischen Werte erkennbar“<br />

sei (vgl. ebenda, 62).<br />

Zusätzliche Schwierigkeiten beim Vergleich der Leistungsdaten von <strong>PISA</strong><br />

2000 und 2003 ergeben sich dadurch, dass manche Testbereiche (wie z.B. in<br />

der Mathematik) für <strong>PISA</strong> 2003 neu geschaffen wurden und dass es somit keine<br />

Vergleichsmöglichkeit gibt (vgl. ebenda, 63). Zu erwähnen ist hier außerdem,<br />

dass eine hohe Interpretationsfreiheit der <strong>PISA</strong>-Ergebnisse 2003 besteht,<br />

wenn das Konfidenzintervall (notwendigerweise) berücksichtigt wird (siehe<br />

oben).<br />

2 Öffentliche mediale Reaktionen auf die <strong>PISA</strong>-Ergebnisse<br />

Nach der einleitenden Darstellung der <strong>PISA</strong>-Ergebnisse für Österreich im internationalen<br />

Vergleich werden nun die öffentlichen Reaktionen auf die Ergebnisse<br />

erörtert. Dies ist nicht zuletzt deshalb von Bedeutung, da, wie etwa<br />

auch Uljens in diesem Sammelband erinnert, <strong>PISA</strong> primär auf die Förderung<br />

des Wettbewerbs (am Bildungssek<strong>to</strong>r) zwischen den Teilnehmerstaaten<br />

und auf die Förderung von einheitlichen Bildungsstandards in den teilnehmenden<br />

Nationen abziele. Auf eine Erklärung für das Zustandekommen der unterschiedlichen<br />

Ergebnisse in den verschiedenen Ländern würde dabei aber vollständig<br />

verzichtet. Diese Aufgabe sei den Regierungen, dem Schulwesen und<br />

den Medien der jeweiligen Staaten überlassen (vgl. Uljens in diesem Band).<br />

In Österreich hat das zu einer medialen und politischen <strong>PISA</strong>-Erklärungsflut<br />

geführt, die zwar angesichts der „pisanischen“ Zurückhaltung an Erklärungsund<br />

Interpretationsangeboten verständlich, jedoch geprägt von vorgefertigten<br />

Überzeugungen entsprechend undifferenziert über die Bevölkerung hereingebrochen<br />

ist.


<strong>PISA</strong> IN ÖSTERREICH 327<br />

Im Folgenden sollen daher anhand der am meisten gelesenen Tageszeitungen<br />

in Österreich (Kronen Zeitung, Kurier, Standard, Presse, Kleine Zeitung,<br />

Oberösterreichische Nachrichten, Salzburger Nachrichten, Tiroler Tageszeitung,<br />

Vorarlberger Nachrichten, Wirtschaftsblatt, Neues Volksblatt und Wiener<br />

Zeitung), welche zusammen im Jahre 2001 eine Net<strong>to</strong>reichweite von ca.<br />

75 % sowie im Jahre 2004 eine Net<strong>to</strong>reichweite von ca. 74 % aufweisen konnten,<br />

die medialen Reaktionen auf das Abschneiden Österreichs bei den beiden<br />

<strong>PISA</strong>-Testungen aufgezeigt werden. Zumindest eine der genannten Zeitungen<br />

wurde von täglich ca. 3/4 der österreichischen Bevölkerung über 14 Jahren gelesen<br />

(vgl. Mediaanalyse 2007, 2007a). Des Weiteren werden die Originaltext-<br />

Aussendungen der Austria Presse Agentur (APA) nach der Veröffentlichung<br />

der Ergebnisse zum Thema <strong>PISA</strong>-Studie präsentiert, analysiert und interpretiert.<br />

Die Ergebnisse der beiden <strong>PISA</strong>-Testungen wurden jeweils Anfang Dezember<br />

des auf die Durchführung folgenden Kalenderjahres veröffentlicht, d.<br />

h. im Dezember 2001 und im Dezember 2004.<br />

Der Darstellungs- bzw. Beobachtungszeitraum erstreckt sich daher jeweils<br />

vom Veröffentlichungsdatum bis zum 16. Jänner bzw. bis zum 31. Jänner der<br />

Folgejahre. Untersucht werden in den genannten Zeiträumen die Artikel in Tageszeitungen<br />

und ungefähr 3 Monate lang nach Veröffentlichung die Presseaussendungen<br />

der APA, welche sich thematisch u. a. den österreichischen Ergebnissen<br />

der Studie widmen. Die Zeiträume wurden deshalb so gewählt, weil<br />

in der ersten Zeit nach Veröffentlichung der Ergebnisse die stärksten Reaktionen<br />

zu erwarten sind, obwohl festgehalten werden muss, dass auch danach<br />

die Thematik rund um die <strong>PISA</strong>-Studie(n) in der Öffentlichkeit immer wieder<br />

aufgegriffen wurde. Die Kriterien, wonach die Medienberichte untersucht<br />

werden, sind folgende:<br />

<strong>–</strong> Anzahl der Artikel bzw. Presseaussendungen zu den <strong>PISA</strong>-Studien in den<br />

besagten Zeiträumen<br />

<strong>–</strong> Verfasser des Beitrags bzw. zu Wort kommende Personen im Beitrag<br />

<strong>–</strong> Bewertung der österreichischen Ergebnisse im Beitrag (positiv-neutralnegativ)<br />

<strong>–</strong> Ursachenzuschreibung im Beitrag (wer/was ist schuld, ist verantwortlich)<br />

<strong>–</strong> Geforderte Maßnahmen im Beitrag (was muss getan werden)


328 DOMINIK BOZKURT, GERTRUDE BRINEK, MARTIN RETZL<br />

2.1 Die Reaktionen der Tageszeitungen auf die <strong>PISA</strong>-Ergebnisse 2000<br />

und 2003 im Vergleich<br />

Die Artikel in den genannten Tageszeitungen wurden in den elektronischen<br />

Archiven der Tageszeitungen bzw. in der Sammlung der österreichischen Nationalbibliothek<br />

recherchiert, von den Tageszeitungen selbst zur Verfügung gestellt<br />

oder stammen aus der Zeitungsberichtsammlung zum Thema Schule des<br />

ehemaligen Bundesministeriums für Bildung, Wissenschaft und Kultur. Berücksichtigt<br />

wurden alle Artikel, die von der offiziellen Veröffentlichung der<br />

ersten <strong>PISA</strong>-Ergebnisse am 4. Dezember 2001 bis zum 31. Jänner 2002 sowie<br />

von der offiziellen Veröffentlichung der zweiten <strong>PISA</strong>-Ergebnisse am 6. Dezember<br />

2004 bis einschließlich 16. Jänner 2005 erschienen, das Wort <strong>PISA</strong>,<br />

OECD oder STUDIE beinhalten und sich thematisch in irgendeiner Weise<br />

auf die österreichischen <strong>PISA</strong>-Ergebnisse beziehen. Die Kategorien, in welchen<br />

die Ursachen/Gründe für das Abschneiden bei der <strong>PISA</strong>-Studie bzw. das<br />

Zustandekommen der Ergebnisse und die geforderten Lösungen bzw. Maßnahmen<br />

in den Tageszeitungen unterteilt wurden, sind großteils von Schwarzgruber<br />

übernommen, der sich bereits einer intensiven Analyse der Ergebnisse<br />

2003 in österreichischen Tageszeitungen widmete. Seine ausgearbeiteten Kategorien<br />

sind inhaltlich auch für die Zeitungsberichte zu <strong>PISA</strong> 2000 geeignet<br />

und ermöglichen so einen guten Vergleich mit den Zeitungsberichten zu <strong>PISA</strong><br />

2003.<br />

Auffallend unterschiedlich ist die Anzahl der Artikel in den besagten Tageszeitungen<br />

zu den Ergebnissen in den beiden Testjahren. Der Tabelle 1 ist<br />

zu entnehmen, dass als Reaktion auf die ersten <strong>PISA</strong>-Ergebnisse im genannten<br />

Untersuchungszeitraum insgesamt 36 Berichte in den erwähnten Tageszeitungen<br />

verfasst wurden, als Reaktion auf die zweiten <strong>PISA</strong>-Ergebnisse jedoch<br />

231 (Schwarzgruber 2006, 69). Berücksichtigt man, dass der Untersuchungszeitraum<br />

nach Reaktionen auf die erste <strong>PISA</strong>-Welle um zwei Wochen länger<br />

war, so ist der tatsächliche Unterschied noch größer. Damit ist über die zweiten<br />

<strong>PISA</strong>-Ergebnisse in den österreichischen Tageszeitungen mindestens mehr als<br />

sechsmal soviel berichtet worden als über die ersten.<br />

In der Hälfte der 36 Zeitungsartikel zu den <strong>PISA</strong>-Ergebnissen 2000 (Tab.<br />

2) berichten ausschließlich JournalistInnen. In jedem fünften Bericht kommen<br />

PolitikerInnen zu Wort und in jedem zwölften WissenschaftlerInnen. In 22,2 %<br />

der Artikel wird auf sonstige, andere Personen Bezug genommen. In 45,5 %<br />

der 231 Artikel zu den <strong>PISA</strong>-Ergebnissen 2003 hingegen werden die Ansichten<br />

von PolitikerInnen wiedergegeben und nur in jedem fünften Bericht neh-


<strong>PISA</strong> IN ÖSTERREICH 329<br />

Tabelle 1: vgl. Schwarzgruber 2006, 69; Bozkurt/Brinek/Retzl 2007<br />

Tabelle 2: vgl. Schwarzgruber 2006, 72; Bozkurt/Brinek/Retzl 2007<br />

men ausschließlich JournalistInnen Stellung. WissenschaftlerInnen werden mit<br />

16,9 % im Verhältnis doppelt so viel zitiert wie in den Berichten zu <strong>PISA</strong> 2000.


330 DOMINIK BOZKURT, GERTRUDE BRINEK, MARTIN RETZL<br />

In 2,5 % der Artikel werden auch Vertreter der Industrie berücksichtigt. In fast<br />

jedem siebenten Artikel finden sich Kommentare von sonstigen Personen.<br />

Tabelle 3: vgl. Schwarzgruber 2006, 76; Bozkurt/Brinek/Retzl 2007<br />

69,4 % der Artikel zu <strong>PISA</strong> 2000 bewerten die österreichischen Ergebnisse<br />

positiv. Jeder vierte Artikel verhält sich neutral zu den Ergebnissen. In nur<br />

5,6 % der Artikel werden die Ergebnisse negativ interpretiert. Das Abschneiden<br />

Österreichs bei <strong>PISA</strong> 2003 wird hingegen in der Hälfte der Artikel negativ<br />

gesehen. Die andere Hälfte der Artikel verhält sich neutral zu den Ergebnissen.<br />

In keinem Beitrag konnte den Ergebnissen etwas Positives abgewonnen<br />

werden. Diese Tatsache zeigt deutlich auf, dass die <strong>PISA</strong>-Ergebnisse in den öffentlichen<br />

Tageszeitungen massiv polarisierend dargestellt wurden. Dies kann<br />

durchaus mit einer positiven, teilweise euphorischen Haltung gegenüber <strong>PISA</strong><br />

2000 und einer Katastrophenstimmung nach <strong>PISA</strong> 2003 beschrieben werden.<br />

In acht der 36 Berichte (22 %) zu den <strong>PISA</strong>-Ergebnissen 2000 werden vermutete<br />

Ursachen angeführt. Von den 231 Berichten zu den <strong>PISA</strong>-Ergebnissen<br />

2003 werden in 106 Berichten (46 %) Ursachen/Gründe für das Abschneiden<br />

genannt. Pro Bericht können mehrere Ursachen angegeben werden; zu den<br />

<strong>PISA</strong>-Ergebnissen 2000 lassen sich 10 und zu den <strong>PISA</strong>-Ergebnissen 2003<br />

insgesamt 156 Ursachen aufzählen (vgl. Schwarzgruber 2006, 85f). Bezüglich<br />

2003 wurde in jedem zweiten Bericht über mögliche Ursachen spekuliert, die<br />

Ergebnisse aus 2000 betreffend nur knapp in jedem fünften Bericht.


<strong>PISA</strong> IN ÖSTERREICH 331<br />

Tabelle 4: vgl. Schwarzgruber 2006, 86; Bozkurt/Brinek/Retzl 2007<br />

Die Hauptverantwortlichen für die durchwegs positiv bewerteten Ergebnisse<br />

2000 sind die LehrerInnen bzw. die Schule und das Schulsystem ebenso<br />

wie die Politik. Für die als negativ bewerteten Ergebnisse aus 2003 trägt<br />

hauptsächlich das Schulsystem (mit 40 Nennungen), gefolgt von der Politik<br />

(mit 27) und den Lehrern bzw. der Schule (mit 24 Nennungen) die Verantwortung.<br />

Häufig werden auch noch MigrantInnen, die Eltern und die Lesefähigkeit<br />

insgesamt als Ursache genannt. Mit neun Nennungen werden die SchülerInnen<br />

bzw. ihre Leistungen/ihr Leistungsverhalten als Ursache berücksichtigt, jedoch<br />

vergleichsweise selten für die Ergebnisse verantwortlich gemacht.<br />

In 19 oder 52,8 % der Berichte zu <strong>PISA</strong> 2000 und in 193 bzw. 83,5 % der<br />

Berichte zu <strong>PISA</strong> 2003 wurden Lösungen bzw. Maßnahmen gefordert. Dabei<br />

erfolgten 28 Nennungen von Maßnahmen zu <strong>PISA</strong> 2000 und 493 Nennungen<br />

zu <strong>PISA</strong> 2003 (vgl. Schwarzgruber 2006, 111). Das Kategoriensystem<br />

Schwarzgrubers wurde noch um drei Kategorien erweitert, da die geforderten<br />

Maßnahmen zu den <strong>PISA</strong>-Ergebnissen 2000 nicht vollständig den Kategorien<br />

von <strong>PISA</strong> 2003 zuzuordnen waren (Tab. 5). Die Forderungen nach mehr Tests<br />

bzw. Evaluierung und die Änderung der Politik bzw. mehr Budget wurden mit<br />

jeweils drei Nennungen in den Berichten zu <strong>PISA</strong> 2000 erwähnt, in den Berichten<br />

zu <strong>PISA</strong> 2003 jedoch nie. Maßnahmen wie mehr Au<strong>to</strong>nomie für Schulen,<br />

Sprachkurse und vorschulische Programme wurden hingegen nur in den<br />

Berichten zu <strong>PISA</strong> 2003 genannt und gefordert, nicht jedoch in denen zu <strong>PISA</strong>


332 DOMINIK BOZKURT, GERTRUDE BRINEK, MARTIN RETZL<br />

Tabelle 5: vgl. Schwarzgruber 2006, 112; Bozkurt/Brinek/Retzl 2007<br />

2000. Mit 101 Nennungen sind allgemeine Rufe nach generellen Reformen,<br />

nach der Gesamtschule (96 Nennungen) und deutlich weniger, aber dennoch<br />

häufig (56 Nennungen) die Verbesserung des Unterrichts bzw. der Unterrichtsqualität<br />

die am meisten genannten Forderungen nach <strong>PISA</strong> 2003. Weiters wurden<br />

nach <strong>PISA</strong> 2003 49 mal die Ganztagesschule, 44 mal Veränderungen in<br />

der Lehreraus- und weiterbildung, 34 mal die Verbesserung der Lesefähigkeit<br />

(Alphabetisierung, Lesetests) und 14 mal eine neutrale Analyse der Ergebnisse<br />

gefordert. Die am häufigsten geforderten Maßnahmen nach <strong>PISA</strong> 2000 sind<br />

mit je fünf Nennungen die Verbesserung des Unterrichts bzw. der Unterrichtsqualität<br />

und die Verbesserung der Lesefähigkeit (Alphabetisierung, Lesetests).<br />

3 mal wurden nach <strong>PISA</strong> 2000 generelle Reformvorschläge gemacht, 2 mal<br />

wurde die Gesamtschule als geeignete Maßnahme angegeben. Je einmal wurde<br />

auch nach <strong>PISA</strong> 2000 bereits die Ganztagesschule, eine neutrale Analyse<br />

bzw. die Verbesserung der Lehreraus- und weiterbildung genannt.<br />

2.2 Die Reaktionen in den Presseaussendungen der Austria Presse<br />

Agentur (APA) auf die <strong>PISA</strong>-Ergebnisse 2000 und 2003<br />

Die im folgenden analysierten Presseaussendungen stammen aus dem APA-<br />

OTS Online-Archiv, welche die Worte <strong>PISA</strong>, OECD oder STUDIE beinhalten<br />

und einen Bezug zu den österreichischen <strong>PISA</strong>-Ergebnissen herstellen, indem<br />

entweder über mögliche Ursachen der Ergebnisse oder über jene aus den Ergebnissen<br />

resultierende Maßnahmen und Veränderungen berichtet bzw. eine


<strong>PISA</strong> IN ÖSTERREICH 333<br />

Bewertung der Ergebnisse vorgenommen wird. Die APA Originaltext-Service<br />

GmbH (OTS) verbreitet Presseaussendungen im Originalwortlaut unter inhaltlicher<br />

Verantwortung des Aussenders (vgl. APA-OTS 2007). Bezieher dieser<br />

Presseaussendungen sind über 650 österreichische Redaktionen und Pressestellen<br />

(alle österreichischen Tageszeitungen mit Ausnahme der Kronenzeitung,<br />

öffentliches und privates Fernsehen und Radio, Periodika, Verlage, internationale<br />

Nachrichtenagenturen, Ministerien und Pressestellen, Politik, Organisationen<br />

und Interessenvertretungen u.v.m.), 7600 professionelle User der<br />

Plattform APA OnlineManager (AOM), 12.500 Abonnenten der Mailingliste<br />

APA-OTS Mailabo, rund 15.000 User der APA Online Pressespiegel und kundenspezifischer<br />

Selektionen, sowie Webportale und WAP-Services (vgl. APA-<br />

OTS 2007a). Der Zeitraum der Untersuchung der Reaktionen auf die erste<br />

<strong>PISA</strong>-Welle liegt zwischen der Veröffentlichung der ersten <strong>PISA</strong>-Ergebnisse<br />

am 4. Dezember 2001 und dem 1. März 2002. Die Reaktionen auf die Ergebnisse<br />

der zweiten <strong>PISA</strong>-Welle werden bereits vom 1. Dezember 2004 bis zum<br />

1. März 2005 untersucht. Der Grund dafür ist, dass bereits vor der offiziellen<br />

Veröffentlichung am 6. Dezember 2004 die <strong>PISA</strong>-Ergebnisse bekannt wurden<br />

und somit bereits ab Ende November heftige mediale Diskussionen entbrannten.<br />

Tabelle 6: Bozkurt/Brinek/Retzl 2007


334 DOMINIK BOZKURT, GERTRUDE BRINEK, MARTIN RETZL<br />

Der bereits festgestellte Trend, dass die zweite <strong>PISA</strong>-Welle viel mehr öffentliches<br />

Interesse erweckte als die zweite, lässt sich auch durch die Presseaussendungen<br />

bestätigen. So gab es in den genannten Zeiträumen über fünf<br />

mal mehr Presseaussendungen als Reaktion auf die zweiten <strong>PISA</strong>-Ergebnisse<br />

als auf die ersten.<br />

Tabelle 7: Bozkurt/Brinek/Retzl 2007<br />

Von den 14 Berichten zu den <strong>PISA</strong>-Ergebnissen 2000 stammen 64,2 %,<br />

von den 77 Berichten zu den <strong>PISA</strong>-Ergebnissen 2003 gar 87 % von PolitikerInnen.<br />

In beiden Untersuchungszeiträumen kamen die SPÖ-PolitikerInnen<br />

vor den ÖVP-PolitikerInnen am meisten zu Wort, wobei zu <strong>PISA</strong> 2003 in mehr<br />

als der Hälfte aller Berichte auf SPÖ-Politiker Bezug genommen wird, während<br />

nur 15,6 % der Berichte Ansichten von ÖVP-Politikern widerspiegeln.<br />

Reaktionen von LehrerInnen, SchülerInnen und Elternverbänden sowie solche<br />

von sozialen Organisationen bzw. Interessenvertretungen werden in beiden<br />

Untersuchungszeiträumen wiedergegeben. Des Weiteren werden Meinungen<br />

von Vertretern aus der Industrie zu <strong>PISA</strong> 2000 und die Ansicht von WissenschaftlerInnen<br />

zu <strong>PISA</strong> 2003 in den Presseaussendungen erwähnt bzw. wiedergegeben.<br />

Aus Tabelle 8 ist ersichtlich, dass die Bewertungen der <strong>PISA</strong>-Ergebnisse<br />

2000 und 2003 sehr stark variieren. So werden in 71,4 % der Presseaussendungen<br />

zu <strong>PISA</strong> 2000 die Ergebnisse positiv beurteilt und nur in 7,1 % der<br />

Meldungen negativ. Anders verhält es sich bei den <strong>PISA</strong>-Ergebnissen 2003. In<br />

keiner Aussendung werden die Ergebnisse 2003 positiv beurteilt, in fast der


Tabelle 8: Bozkurt/Brinek/Retzl 2007<br />

<strong>PISA</strong> IN ÖSTERREICH 335<br />

Hälfte erfolgt eine negative Bewertung. Keine oder eine neutrale Bewertung<br />

wurde in 21,4 % der Aussendungen zu <strong>PISA</strong> 2000 getroffen. Meldungen zu<br />

<strong>PISA</strong> 2003 hingegen beinhalten in mehr als der Hälfte wertende Stellungnahmen.<br />

SPÖ-PolitikerInnen nennen als Ursache für die großteils positiv bewerteten<br />

<strong>PISA</strong>-Ergebnisse 2000 einmal die LehrerInnen und zweimal die SPÖ-<br />

Regierung (vor der ÖVP-FPÖ-Koalition). ÖVP-PolitikerInnen hingegen geben<br />

als Ursache einmal Bundesministerin Gehrer und die ÖVP-FPÖ-Regierung<br />

an, zweimal die LehrerInnen und einmal das differenzierte Schulsystem<br />

und seine Durchlässigkeit. Hingegen sehen VertreterInnen von LehrerInnen-,<br />

SchülerInnen- bzw. Elternverbänden einmal in Bundesministerin Gehrer (und<br />

ihrer Sparpolitik) die Ursache für die negativen oder neutral bewerteten Ergebnisse<br />

aus <strong>PISA</strong> 2000, einmal im selektiven Schulsystem (frühe Aufteilung<br />

in Starke und Schwache SS, HS, AHS; Selektion ab 10; verkrustetes System).<br />

Alle anderen Personengruppen nennen in den erwähnten Presseaussendungen<br />

keine Ursache für die Ergebnisse der ersten <strong>PISA</strong>-Tests.<br />

Die überwiegend negativ bewerteten <strong>PISA</strong>-Ergebnisse 2003 begründen<br />

SPÖ-PolitikerInnen 21 mal mit Bundesministerin Gehrer und der Sparpoli-


336 DOMINIK BOZKURT, GERTRUDE BRINEK, MARTIN RETZL<br />

Tabelle 9: Bozkurt/Brinek/Retzl 2007<br />

Tabelle 10: Bozkurt/Brinek/Retzl 2007<br />

tik der ÖVP-FPÖ-Regierung (Kürzungen von Stunden, Personal etc.). Dass<br />

das selektive Schulsystem, die Aufteilung in Starke und Schwache (HS-SS-<br />

AHS), die Trennung mit 10 Jahren und das verkrustete System Schuld an den<br />

Ergebnissen tragen, senden SPÖ-PolitikerInnen drei mal aus. Zweimal nennen<br />

diese als Ursache die LehrerInnen. ÖVP-PolitikerInnen äußern sich kaum<br />

zu den Ursachen betreffend die Ergebnisse 2003. PolitikerInnen der Grünen


<strong>PISA</strong> IN ÖSTERREICH 337<br />

nennen in einer Aussendung ebenfalls Bundesministerin Gehrer und die Sparpolitik<br />

der ÖVP-FPÖ-Regierung als negativen Einflußfak<strong>to</strong>r. In zwei der 77<br />

Aussendungen beklagen PolitikerInnen der Grünen das selektive Schulsystem,<br />

den hohen Zeitaufwand durch Schule, die Aufteilung in Starke und Schwache<br />

(SS-HS-AHS), die Trennung mit 10 Jahren und das verkrustete System. FPÖ-<br />

PolitikerInnen nennen einmal als Ursache die SPÖ-Regierungen vor der ÖVP-<br />

FPÖ-Koalition und einmal die MigrantInnen. LehrerInnen-, SchülerInnenbzw.<br />

Elternverbände nennen in drei Aussendungen Bundesministerin Gehrer<br />

und die Sparpolitik der ÖVP-FPÖ-Regierung. Vertreter von sozialen Organisationen<br />

bzw. Interessenvertretungen sehen im selektiven Schulsystem und dem<br />

hohen Zeitaufwand durch Schule, in der Aufteilung in Starke und Schwache<br />

(HS-SS-AHS), der Trennung mit 10 Jahren und dem verkrusteten System Ursachen<br />

für die mäßigen <strong>PISA</strong>-Ergebnisse 2003. WissenschaftlerInnen äußern<br />

sich nicht zu möglichen Ursachen/Gründen bezüglich der <strong>PISA</strong>-Ergebnisse<br />

2003.<br />

2.3 Geforderte Maßnahmen als Reaktion auf <strong>PISA</strong> 2000 und <strong>PISA</strong><br />

2003<br />

Forderungen in<br />

APA-OTS<br />

SPÖ ÖVP Grüne FPÖ In<br />

du<br />

Wi<br />

ss<br />

L-S-<br />

E<br />

soz.<br />

Org.;<br />

Int.v.<br />

<strong>to</strong>tal<br />

Jahr 200/. 0 3 0 3 0 3 0 3 0 3 0 3 0 3 0 3<br />

Abschaffung von<br />

Leistungstests<br />

1 - - - - - - - - - 1 - - - 2 -<br />

Regierung ÖVP-<br />

FPÖ (BM Gehrer)<br />

3 2 - - 1 - 1 - - - 1 - - - 6 2<br />

Ganztagesschule - 13 - 2 - - - 1 - - - 3 1 1 1 20<br />

Gesamtschule - 14 - - - 2 - 2 - - 1 2 1 1 2 21<br />

Förderung 1 10 1 1 - - - 2 - - - - 1 1 3 14<br />

Lehreraus- und<br />

weiterbildung<br />

1 2 - 1 - - - 1 - - - 2 1 - 2 6<br />

Infrastruktur 1 - - - - - - - - - - - - - 1 -<br />

Tests, Evaluierung - 2 1 - - - 1 - 1 - - - - - 3 2<br />

keine Studiengebühren<br />

- - - - - - - - - - - - 1 - 1 -<br />

technische Bildung - - - - - - - - 1 - - - - - 1 -<br />

gemeinsame Reformen<br />

1 10 - 7 1 - - 3 - 1 - 2 - - 2 23<br />

keine Schulstrukturdeb.<br />

- - 2 1 - - - - - - - - - - 2 1<br />

Schulau<strong>to</strong>nomie - 3 - - - - - - - - - - - 1 - 4


338 DOMINIK BOZKURT, GERTRUDE BRINEK, MARTIN RETZL<br />

Oberstufenreform 1 - - - - - - - - - - - - - 1 -<br />

SPÖ-<br />

Bildungsprogramm<br />

1 7 - - - - - - - 1 - - - - 1 8<br />

Zukunftskommiss. - 7 - 1 - - - - - - - - - 1 - 9<br />

Schulpartnerschaft - - - 2 - - - - - - - 1 - - - 3<br />

Lehrplanreform - - - - - - - - - - - 1 - - - 1<br />

Keine Noten - 1 - - - - - - - 1 - 1 - - - 3<br />

Kein anderes Modell<br />

übernehmen<br />

- 1 - - - 1 - - - - - - - - - 2<br />

Aufhebung der<br />

2/3 Mehrheit für<br />

Schulgesetze<br />

- 4 - - - - - 2 - - - - - - - 6<br />

Neues anstelle von<br />

Altem<br />

- - - 2 - - - - - - - - - - - 2<br />

Verbindung Beruf-<br />

Schule<br />

- 2 - 1 - - - 2 - - - - - - - 5<br />

anderes - 17 1 7 - 2 - 4 - 1 - 1 1 1 2 33<br />

<strong>to</strong>tal 10 95 5 25 2 5 2 17 2 4 3 13 6 6 30 165<br />

Tabelle 11: Bozkurt/Brinek/Retzl 2007<br />

Der Tabelle 11 sind die genannten Forderungen nach <strong>PISA</strong> 2000 und <strong>PISA</strong><br />

2003 in den besagten APA-Originaltextsendungen zu entnehmen. Zusätzlich<br />

informiert die Tabelle darüber, wer eine bestimmte Forderung wie oft stellt.<br />

Damit ist gut ersichtlich, dass verschiedene Parteien, Verbände, Organisationen<br />

oder Interessensgemeinschaften oft unterschiedliche Forderungen stellen,<br />

die, wie kaum verwunderlich, sehr stark aus der Ideologie oder den Interessen<br />

der jeweiligen Gruppe zu verstehen sind.<br />

Das Fehlen fachlich korrekter Interpretationen und wissenschaftlicher<br />

Schlussfolgerungen von Seiten des <strong>PISA</strong>-Konsortiums hat wesentlich dazu<br />

beigetragen, dass Schlussfolgerungen und Einschätzungen vielfach den jeweiligen<br />

Vertretern oder „Anwälten“ von vorgefertigten Meinungen überlassen<br />

wurden. Offenkundige politische Gegner sind daher auch an der Gegensätzlichkeit<br />

ihrer Forderungen leicht zu erkennen (vgl. dazu den nächsten Abschnitt<br />

in diesem Beitrag). Eigene vorgefertigte ideologisch genährte Überzeugungen<br />

und Pläne wurden damit gefestigt und von kaum einer Seite wurde<br />

Anlass und Ans<strong>to</strong>ß zu einer rationalen Argumentation geboten. Auch die<br />

einschlägige Wissenschaft hat dies weitgehend unterlassen. Grundsätzlich ist<br />

festzuhalten, dass in den 14 Presseaussendungen nach der Veröffentlichung der<br />

<strong>PISA</strong>-Ergebnisse 2000 insgesamt 30 geforderte Maßnahmen enthalten sind, in<br />

den 77 Pressemeldungen nach der Veröffentlichung der <strong>PISA</strong>-Ergebnisse 2003


<strong>PISA</strong> IN ÖSTERREICH 339<br />

insgesamt 165. Vertreter der Industrie melden sich nur in den Aussendungen<br />

zu <strong>PISA</strong> 2000 zu Wort, Wissenschaftler nur in jenen zu <strong>PISA</strong> 2003.<br />

Die meist genannte Forderung des Jahres 2000 mit sechs Nennungen ist,<br />

dass Bundesminister Gehrer und die Bundesregierung „den Hut nehmen“,<br />

„sich nicht ausruhen“, mehr Engagement zeigen, mit Kürzungen aufhören bzw.<br />

Reformen erarbeiten sollen. Je dreimal wird nach<br />

<strong>–</strong> (früher) Lese- und Sprachförderung (Vorschuljahr für alle) bzw. Begabungsförderung<br />

(ab Kindergarten), Lesetests in der Volksschule bzw. Förderung<br />

der RisikoschülerInnen (Berufsschule) und nach<br />

<strong>–</strong> Tests, Benchmarking, Leistungsvergleich, Qualitätsmanagement, Evaluierung<br />

verlangt.<br />

Immerhin wird je zweimal auch in der Abschaffung bzw. der Reduktion der<br />

Leistungstests ein geeigneter Weg gesehen. Des Weiteren wird zweimal als<br />

Forderung erwähnt:<br />

<strong>–</strong> Die Gesamtschule, Zusammenführung der Schularten bzw. keine Selektion<br />

mit 10<br />

<strong>–</strong> Aus- und Weiterbildung von LehrerInnen: bspw. Diagnostik und Therapie<br />

von Leseschwächen, Ausbildung auf akademisches Niveau heben (Uni bzw.<br />

PH)<br />

<strong>–</strong> gemeinsame Reformen (Regierung mit Opposition), parlamentarische Bildungsenquete,<br />

Krisengipfel (genaue Datenanalyse + Ursachenforschung),<br />

Blick auf andere Länder (Finnland)<br />

<strong>–</strong> keine Schulstrukturdebatte, sondern Beibehaltung des differenzierten Systems<br />

und Verbesserung von Unterricht<br />

Je einmal wird als Reaktion auf die <strong>PISA</strong>-Ergebnisse 2000<br />

<strong>–</strong> die Ganztagesschule<br />

<strong>–</strong> die Verbesserung der Schulinfrastruktur (Computer)<br />

<strong>–</strong> die Abschaffung der Studiengebühren<br />

<strong>–</strong> mehr Bildungsangebot im technischen Bereich (HTL, FH)<br />

<strong>–</strong> eine Oberstufenreform<br />

<strong>–</strong> die Realisierung des SPÖ-Bildungsprogramms<br />

gefordert.<br />

Die am häufigsten genannten Forderungen nach <strong>PISA</strong> 2003 sind mit 23<br />

Nennungen „gemeinsame Reformen (Regierung mit Opposition), parlamentarische<br />

Bildungsenquete, Krisengipfel (genaue Datenanalyse + Ursachenforschung),<br />

Blick auf andere Länder (Finnland)“. Knapp dahinter folgen mit 21<br />

Nennungen die Forderung nach einer „Gesamtschule, Zusammenführung der


340 DOMINIK BOZKURT, GERTRUDE BRINEK, MARTIN RETZL<br />

Schularten bzw. keine Selektion mit 10“ und mit 20 Nennungen die Forderung<br />

nach einer „Ganztagesschule“. Eindringlich werden auch (frühe) Lese- und<br />

Sprachförderung (Vorschuljahr für alle) bzw. Begabungsförderung (ab Kindergarten),<br />

Lesetests in der Volksschule bzw. Förderung der Risikoschüler (Berufsschule)<br />

gewünscht (14 Nennungen).<br />

Des Weiteren wird häufig der Wunsch nach „Umsetzung der Vorschläge<br />

der Zukunftskommission 5 “ (9 Nennungen) und die Umsetzung des SPÖ-<br />

Bildungsprogramms (8 Nennungen) ausgesprochen. Hier fällt überraschend<br />

auf, dass vorwiegend die SPÖ die Umsetzung der Vorschläge der Zukunftskommission<br />

einmahnt, obwohl diese Kommission von der ÖVP-Ministerin<br />

Gehrer eingesetzt wurde. Sechs mal wird je die<br />

<strong>–</strong> Aus- und Weiterbildung von LehrerInnen: bspw. Diagnostik und Therapie<br />

von Leseschwächen, Ausbildung auf akademisches Niveau heben (Uni bzw.<br />

PH)<br />

<strong>–</strong> die Aufhebung der 2/3 Mehrheit für Schulgesetze<br />

genannt.<br />

Fünf mal erwähnt wird die Forderung nach „stärkerer Zusammenarbeit<br />

von Schule und Berufswelt“. Vier mal wird der „Ausbau der Schulau<strong>to</strong>nomie<br />

(Schulen und Kommunen entscheiden)“ gefordert.<br />

Je drei mal erfolgt in den APA-Originaltextsendungen der Ruf nach<br />

<strong>–</strong> Abschaffung der Noten bzw. der Klassenwiederholungen und<br />

<strong>–</strong> Ausbau konstruktiver Schulpartnerschaft.<br />

Je zweimal wird aus den <strong>PISA</strong>-Ergebnissen gefolgert, dass<br />

<strong>–</strong> Bundesminister Gehrer und die Bundesregierung „den Hut nehmen“, „sich<br />

nicht ausruhen“, mehr Engagement zeigen, mit Kürzungen aufhören bzw.<br />

Reformen erarbeiten sollen<br />

<strong>–</strong> der Einsatz von Tests, Benchmarking, Leistungsvergleich, Qualitätsmanagement,<br />

Evaluierung ausgebaut werden soll<br />

<strong>–</strong> keine anderen Schulmodelle (wie bspw. die skandinavischen) unhinterfragt<br />

übernommen werden sollen<br />

<strong>–</strong> neue Wege begangen und keine „alten Hüte“ hervorgeholt werden sollen.<br />

Je einmal erwähnt wird, dass<br />

<strong>–</strong> keine Schulstrukturdebatte geführt, sondern das differenzierte System beibehalten<br />

und der Unterricht verbessert sowie<br />

<strong>–</strong> der Lehrplan reformiert werden soll.<br />

5 Mehr über die Zukunftskommission siehe Kapitel 3.2


2.4 Resümee<br />

<strong>PISA</strong> IN ÖSTERREICH 341<br />

Die Anzahl der Zeitungsberichte und der APA-Originaltext-Sendungen zu den<br />

österreichischen <strong>PISA</strong>-Ergebnissen im Zeitraum nach der Veröffentlichung der<br />

Ergebnisse war im Jahre 2003 um ein vielfaches höher als im Jahre 2000. PI-<br />

SA ist somit erst nach der zweiten Welle in den Blickpunkt der österreichischen<br />

Öffentlichkeit gerückt. Stark unterschiedlich stellt sich ebenso die Bewertung<br />

der Ergebnisse in den beiden Testjahren dar. Während das Abschneiden<br />

Österreichs bei <strong>PISA</strong> 2000 vorwiegend positiv bewertet wurde, sind die<br />

österreichischen <strong>PISA</strong>-Ergebnisse 2003 im Gegensatz dazu überwiegend negativ<br />

beurteilt worden. Dies spricht für eine stark polarisierende Darstellung<br />

der <strong>PISA</strong>-Ergebnisse 2000 und 2003 in der Öffentlichkeit.<br />

Mit Ausnahme der JournalistInnen, die naturgemäß in den Tageszeitungen<br />

am häufigsten zu Wort kommen, in den Originaltextsendungen der APA<br />

jedoch kaum bis gar nicht, sind PolitikerInnen in beiden Medien am stärksten<br />

präsent. WissenschaftlerInnen werden in den Zeitungen öfter (und auch zu<br />

beiden <strong>PISA</strong>-Testungen) zitiert und kommen auch in den Presseaussendungen<br />

zu <strong>PISA</strong> 2003 vor. Einige wenige Ansichten von Vertretern der Industrie sind<br />

zu <strong>PISA</strong> 2000 in den Presseaussendungen und zu <strong>PISA</strong> 2003 in den Zeitungsberichten<br />

zu finden. Eltern-, LehrerInnen- und SchülerInnen-Verbände sowie<br />

soziale Organisationen bzw. Interessenvertretungen kommen immer wieder in<br />

Presseaussendungen zu Wort, werden jedoch in den Zeitungsberichten kaum<br />

explizit berücksichtigt.<br />

Die klare Dominanz der PolitikerInnen unter den berücksichtigten Personen<br />

in den untersuchten Medien weist daraufhin, dass <strong>PISA</strong> in der Öffentlichkeit<br />

hauptsächlich als politisches Ereignis wahrgenommen wird, wodurch der<br />

irrtümliche Eindruck entsteht, dass darauf politisch reagiert werden kann/muss<br />

und auch die Politik die Verantwortung für die Testergebnisse trägt. Das Übertragen<br />

der Interpretation der <strong>PISA</strong>-Ergebnisse auf Politik und verschiedene Interessengruppen<br />

hat eine Ideologisierung und damit auch maßgebliche Überschätzung<br />

der Untersuchung in der Öffentlichkeit zur Folge, wodurch der<br />

Einsatz sachlicher Argumente und Grundlagen unterdrückt wird. Eine solche<br />

sachliche Auseinandersetzung mit <strong>PISA</strong> muss somit im Nachhinein, in Sammelbänden<br />

wie diesem nachgeholt und damit im wissenschaftlichen Diskurs<br />

erschlossen werden.<br />

Die Übertragung der Verantwortung für die <strong>PISA</strong>-Ergebnisse 2000 und<br />

2003 auf einzelne Gruppen, z. B. die LehrerInnen, die Politik oder das Schulsystem<br />

insgesamt ist bereits Folge einer ideologisierten Diskussion.


342 DOMINIK BOZKURT, GERTRUDE BRINEK, MARTIN RETZL<br />

Für das stark negativ bewertete Abschneiden bei der <strong>PISA</strong>-Studie 2003<br />

werden sowohl in den Zeitungen als auch in den Presseaussendungen vielfach<br />

auch MigrantInnen und ihr unterrichtssprachliches Leistungsvermögen<br />

genannt, ohne auf eine Diskussion von Vorschlägen zur Verbesserung einzugehen.<br />

Auf mangelnde Lesefähigkeit bei den SchülerInnen wird in den Zeitungen<br />

hingewiesen, jedoch wird sie in den Pressemeldungen nicht als Ursache<br />

für die <strong>PISA</strong>-Ergebnisse angesehen. „Schulangst“ wird hingegen nur in den<br />

Pressemeldungen genannt.<br />

Nach der Veröffentlichung der <strong>PISA</strong>-Ergebnisse 2000 werden die Änderung<br />

der Politik, mehr Budget bzw. das Beenden der Sparmaßnahmen in den<br />

Tageszeitungen am häufigsten gefordert und auch in den Pressemeldungen verhältnismäßig<br />

oft artikuliert. Ebenso ist die Forderung nach mehr Tests und<br />

Evaluierung, sowie nach einer Gesamtschule und einer Ganztagesschule in<br />

beiden Medien eine oft genannte. Des Weiteren ist der Ruf nach generellen<br />

Reformen, die nach genauerer Analyse gemeinsam erarbeitet werden sollen,<br />

in beiden Medien identifizierbar, ebenso wie die Forderung nach Verbesserung<br />

bzw. Veränderung der LehrerInnenaus- und weiterbildung. Die Verbesserung<br />

des Unterrichts bzw. der Unterrichtsqualität sowie die Alphabetisierung und<br />

die Verbesserung der Lesefähigkeit, welche in den Zeitungen am meisten gefordert<br />

werden, kommen in den Pressemeldungen jedoch kaum bis gar nicht<br />

vor. Alle anderen Maßnahmen werden nur in jeweils einem der beiden Medien<br />

genannt.<br />

Etwas anders verhält es sich mit der Häufigkeit der einzelnen Forderungen<br />

nach der Veröffentlichung der <strong>PISA</strong>-Ergebnisse 2003. Sowohl in den Tageszeitungen<br />

als auch in den Pressemeldungen wird die Forderung nach generellen,<br />

gemeinsamen Reformen am meisten genannt. Die Gesamtschule ist in beiden<br />

Medien die am zweithäufigsten geforderte Maßnahme. Die Ganztagesschule,<br />

die in den Pressemeldungen mit 20 Nennungen sehr häufig verlangt wird, rangiert<br />

in den Zeitungsberichten mit 49 Nennungen auch weit oben, d.h. unter<br />

den Top 5.<br />

Die Forderung nach Förderung, (früher) Lese- und Sprachförderung (Vorschuljahr<br />

für alle) bzw. Begabungsförderung (ab Kindergarten), Lesetests in<br />

VS und Förderung der Risikoschüler (Berufsschule) gehört mit 14 Nennungen<br />

in den Pressemeldungen und in den Zeitungen zusammen mit den Kategorien<br />

„vorschulische Maßnahmen“ (50 Nennungen), „Sprachkurse“ (28 Nennungen)<br />

und Verbesserung der Lesefähigkeit bzw. Alphabetisierung (34 Nennungen)<br />

in beiden Medien zu den am meisten genannten Maßnahmen. Eben-


<strong>PISA</strong> IN ÖSTERREICH 343<br />

so in beiden Medien regelmäßig vertreten ist die Forderung nach Veränderung<br />

der LehrerInnenaus- und weiterbildung; weniger häufig, aber dennoch<br />

vertreten ist der Ruf nach mehr Schulau<strong>to</strong>nomie. Die Umsetzung des SPÖ-<br />

Bildungsprogramms (9 Nennungen) und die Vorschläge der Zukunftskommission<br />

(8 Nennungen), die Aufhebung der 2/3 Mehrheit für Schulgesetze (6 Nennungen)<br />

sowie eine verbesserte Verbindung von Beruf und Schule (5 Nennungen)<br />

sind relativ häufige Forderungen in den Pressemeldungen, kommen jedoch<br />

nicht in den Zeitungsmeldungen vor. Andererseits wird in den Zeitungen<br />

die Verbesserung des Unterrichts bzw. der Unterrichtsqualität mit 56 Nennungen<br />

sehr oft erwähnt, in den Pressemeldungen jedoch nur einmal andeutungsweise.<br />

Hier fällt auf, dass in beiden Testjahren die Forderungen nach generellen,<br />

gemeinsamen Reformen, nach einer Gesamtschule und einer Ganztagesschule<br />

zu den am häufigsten genannten gehören. Außerdem kommen sowohl<br />

nach <strong>PISA</strong> 2000 und nach <strong>PISA</strong> 2003 die Forderung nach einer Lese- und<br />

Sprachförderung (Alphabetisierung durch vorschulische Maßnahmen zur Aufhebung<br />

des Risikoschüler-Phänomens) häufig vor. Ebenso werden die Verbesserung<br />

der LehrerInnenaus- und weiterbildung, sowie die Verbesserung des<br />

Unterrichts- bzw. der Unterrichtsqualität als Forderung in beiden Testjahren<br />

regelmäßig genannt.<br />

Erwähnenswert ist darüber hinaus, dass nach <strong>PISA</strong> 2000 sehr oft eine Änderung<br />

der Politik, mehr Budget bzw. das Beenden der Sparmaßnahmen und<br />

mehr Tests und Evaluierungen gefordert werden. Nach <strong>PISA</strong> 2003 sind diese<br />

Forderungen jedoch von keiner bzw. allenfalls von untergeordneter Bedeutung.<br />

3 <strong>PISA</strong>-Ergebnisse <strong>–</strong> ihre bildungspolitischen Bewertungen und<br />

Konsequenzen<br />

Neben der kommentierten Wiedergabe der <strong>PISA</strong>-Ergebnisse 2000 und 2003<br />

sowie der Darstellung und Erörterung der medialen Reflexion (vgl. Kapitel<br />

1 und 2) zeigen die bildungspolitischen Bewertungen Art und Grad der gesellschaftlichen<br />

Erregung auf einer weiteren Ebene außerhalb der „scientific<br />

community“. Unter Verzicht auf die Besinnung dessen, was von der OECD als<br />

explizites und implizites Ziel der Testung genannt wurde und unter der Berücksichtigung<br />

der spezifisch österreichischen Tradition der „Erregung“ kann<br />

man in Österreich von einer politischen und argumentativen Verselbstständigung<br />

sprechen, die bis heute nicht wirklich abgeklungen ist. Die Grundlage


344 DOMINIK BOZKURT, GERTRUDE BRINEK, MARTIN RETZL<br />

für eine ernsthafte Diskussion um die Weiterentwicklung des Schulsystems<br />

wäre gelegt, z. B. mit dem Ergebnis der ministeriellen Zukunftskommission,<br />

das aber nur mäßig engagiert diskutiert wurde.<br />

Erst in den letzten Wochen, anges<strong>to</strong>ßen durch das Symposion, das diesem<br />

Band zugrunde liegt, weniger auf jener Basis, die in Deutschland die Diskussion<br />

bestimmte, kommt es zu einem vorsichtigen Innehalten im unbedachten<br />

Absingen des <strong>PISA</strong>-Lobes. In der Besinnung auf die Frage, was guter Unterricht<br />

sei und wie er gelingen könne, respektive in der Erinnerung an Humboldt<br />

gehe es keineswegs darum, „Testartisten“ (Hartmut von Hentig) auszubilden,<br />

sondern den Herausforderungen für morgen gerecht werden zu können <strong>–</strong> welche<br />

immer das auch sein mögen.<br />

<strong>PISA</strong>-Ergebnisse wurden in den untersuchten Ländern unterschiedlich aufgenommen<br />

und sowohl pädagogisch als auch bildungspolitisch entweder aufgeregt<br />

oder entspannt diskutiert und interpretiert. Wie kaum eine andere Untersuchung<br />

und Bewertung von SchülerInnenleistungen haben sie Anlass für<br />

Analysen und Schlussfolgerungen geliefert und waren zumindest in Österreich<br />

und in Deutschland bald überlagert von Spekulationen und politischen Reflexen.<br />

Bildungs-Seiten-Macher der Tageszeitungen und Wochenmagazine haben<br />

„Schulexperten“ aufgeboten, die auf verschiedene Aspekte fokussierten und<br />

umgehend zu wissen vorgaben, woran es lag, dass das jeweilige Land so und<br />

so abgeschnitten hatte und welche Konsequenzen zu ziehen wären.<br />

Diskussionen über notwendige pädagogische oder didaktische Anstrengungen<br />

zur Verbesserung des Unterrichts oder bildungswissenschaftlichsystematische<br />

Vergewisserungen <strong>–</strong> als Konsequenz sorgfältiger Analysen <strong>–</strong><br />

wurden über weite Strecken aufmerksamkeitspolitisch verdrängt und waren<br />

schließlich eher die Ausnahme, denn die Regel . . .<br />

Übersehen wurde, dass die <strong>PISA</strong>-Untersuchung „die jeweils nationale<br />

Sichtweise ergänzen und vertiefen (kann), indem sie nationale Ergebnisse zur<br />

besseren Interpretation in einen größeren Zusammenhang (zu) stellen und die<br />

jeweiligen Stärken und Schwächen im Lichte der Leistungsfähigkeit anderer<br />

Bildungssysteme einzuschätzen“ (erlaubt). <strong>PISA</strong> habe die Basis für den Dialog<br />

und die Zusammenarbeit bei der Definition und Umsetzung von Bildungszielen<br />

geschaffen, wobei die für das spätere Leben relevanten Kompetenzen im<br />

Vordergrund stehen (Schleicher o. J., 9). Von Rückschlüssen auf (Allgemein-)<br />

Bildung, wie sie im mitteleuropäisch umfassenden Sinn gedacht wird, ist nicht<br />

die Rede, ebenso wenig wie auf Kausalverhältnisse bzgl. <strong>PISA</strong>-Ergebnis und<br />

Schulsystem abgestellt wird.


3.1 Didaktische Verbesserungen<br />

<strong>PISA</strong> IN ÖSTERREICH 345<br />

Einige Länder haben entschieden gemäß dieser Maßgabe gehandelt.<br />

„Die alarmierenden und beunruhigenden Befunde der ersten <strong>PISA</strong>-<br />

Untersuchung“ führten in Deutschland sehr bald nach Veröffentlichung der<br />

ersten Ergebnisse zu einer internationalen Tagung der Gesellschaft für Fachdidaktik<br />

(der Dachorganisation aller wissenschaftlich-fachdidaktischen Fachgesellschaften),<br />

mit dem Ziel, „Perspektiven einer Verbesserung fachlichen<br />

wie fächerübergreifenden Lernens und Lehrens ( . . . ) zu entwickeln“. (Bayrhuber/Vollmer<br />

2004, 7).<br />

Bundesministerin Edelgard Buhlmann (ebenda 25f) stellt in ihrem programmatischen<br />

Vortrag Bildungsreformansätze vor. Sie fordert mehr „Bildungsoptimismus“<br />

und konzentriert sich im Verweis auf Finnland auf das<br />

Prinzip der individuellen Förderung, auch kann sie dazu gleich mit zusätzlichen<br />

Mitteln der deutschen Bundesregierung (4 Mrd. Euro) aufwarten, die<br />

in Ganztagesschulprogramme zu investieren seien, damit anders als in der alten<br />

„Gleichschritt-Pädagogik“ (G.B.) nun auf der Basis von größeren Zeitbudgets<br />

anders unterrichtet werden könne. Offen bleibt dabei die innere<br />

Form der „Ganztagesschule“, wird doch auf Partnerschaft und Kooperation<br />

mit Sportvereinen, Musikschulen, Elterninitiativen u. ä. gesetzt. 6 Es gehe<br />

um die frühe Identifikation von Defiziten und Stärken bei Kindern <strong>–</strong> v. a.<br />

im Bereich der Sprach-, Lese- und Schreibkompetenz <strong>–</strong> und die Fokussierung<br />

auf die Fachdidaktik. „Eine bessere Qualität des Unterrichts in unseren<br />

Schulen kann nur über einen didaktischen Wandel erreicht werden“ (ebenda,<br />

28), d.h. über die schlüssige Verbindung von fachwissenschaftlicher und<br />

erziehungswissenschaftlich-didaktischer Ausbildung.<br />

Staatsministerin Karin Wolff, Präsidentin der Kultusministerkonferenz,<br />

verweist auf das rasche und ergebnisorientierte Handeln nach der <strong>PISA</strong>-<br />

Ergebnis-Präsentation, das die Konzentration auf die Förderung der Kinder<br />

mit Migrationshintergrund, 7 die Lesefähigkeit sowie die Verbesserung in der<br />

6 In Österreich ist der Unterschied zwischen Ganztagesschule und ganztägigen Schulformen<br />

wesentlich: Während in der einen Unterrichts-, Übungs- und Freizeitstunden abwechselnd<br />

über den Tag verteilt sind und den verpflichtenden ganztägigen Besuch bedingen, ist in der<br />

anderen der verpflichtende Unterricht im wesentlichen am Vormittag angesetzt, während in<br />

den Nachmittagsstunden Vertiefung, Übung, Freizeit und Sport angeboten werden <strong>–</strong> zum<br />

freiwilligen Besuch.<br />

7 In Hessen werden nur SchülerInnen eingeschult, die die deutsche Sprache beherrschen.<br />

Auch in Finnland sind spezielle Sprach-Vorbereitungsklassen für Kinder mit Migrationshintergrund<br />

eingerichtet. In Schweden u. a. Ländern gibt es Vorbereitungsklassen für Kin-


346 DOMINIK BOZKURT, GERTRUDE BRINEK, MARTIN RETZL<br />

Bewältigung komplexer Aufgaben verfolgt. In allen Dimensionen sei die Fachdidaktik<br />

angesprochen, ginge es doch um die Verbesserung der Unterrichtsqualität.<br />

Unter diesem Aspekt habe die Kultusministerkonferenz die Bildungsstandards<br />

als zentrales Mittel zur Sicherung der Qualität schulischer Bildung<br />

herausgestellt (nähere Definitionen werden dazu im Buch S. 36 ff geliefert).<br />

Mit der Länder-Auswertung wird aber auch in Deutschland die bildungspolitische<br />

bzw. schulorganisationspolitische Diskussion entsprechend aufgeladen,<br />

schneiden doch die Länder des Südens bei <strong>PISA</strong> mit ihrem gegliederten<br />

Schulsystem grob gesprochen besser ab als die nördlichen.<br />

Peter Bender, Uni Paderborn, nimmt in einem Beitrag „für die GDM<br />

Nr. 81“ Bezug auf die <strong>PISA</strong>-Vergleichsstudien und zu jenen Artikeln, die sich<br />

mit der Kritik der <strong>PISA</strong>-Kritiker und mit schulorganisa<strong>to</strong>rischen Konsequenzen<br />

befassen: „Die Bayern waren mit Abstand die Besten, ( . . . ) die nächsten<br />

drei Plätze (gingen an) Baden-Württemberg, Sachsen, Thüringen. Aus dem<br />

direkten Vergleich der Schulformen, der ja bei <strong>PISA</strong> 2000 für die integrierte<br />

Gesamtschule schlecht ausgefallen war, war diese jetzt herausgenommen worden<br />

( . . . ) aus statistischen Gründen. Aber die <strong>PISA</strong>-Schwäche der Gesamtschule<br />

ist nach wie vor erkennbar ( . . . ). Auch aus den internationalen <strong>PISA</strong>und<br />

TIMSS-Zahlen lässt sich kein Honig für ein Einheitsschulsystem saugen.<br />

Zwar verfügen die Spitzenländer durchweg über ein solches, aber <strong>–</strong> und das<br />

wird immer wieder geflissentlich ignoriert <strong>–</strong> eben auch sämtliche Länder in<br />

der unteren Hälfte der Tabelle. Die wenigen Länder mit früh gegliedertem<br />

Schulsystem (Belgien, Deutschland, Österreich, Schweiz, Slowakei, Tschechien)<br />

dagegen finden sich alle in der oberen Hälfte. Die internationalen <strong>PISA</strong>-<br />

Zahlen sprechen also ebenfalls eher gegen die Einheitsschule. Jedoch meine<br />

ich, dass sie überhaupt nicht für oder gegen Schulsysteme sprechen, sondern<br />

Ausdruck des kulturell-technischen Entwicklungsstandes, des Leistungsorientierungsgrads<br />

und der Migrationsstruktur der jeweiligen Gesellschaft sind, und<br />

zwar i. W. unabhängig vom Schulsystem“ (Bender 2007). Insbesondere seien<br />

die skandinavischen Länder als Vorbilder abhanden gekommen bzw. lägen<br />

abgesehen von Finnland auf Augenhöhe mit Deutschland; ebenso wird noch<br />

einmal auf die migrationspolitisch günstigeren Bedingungen etwa Schwedens<br />

(gegenüber Deutschlands) hingewiesen.<br />

der, die einen unterrichtssprachlichen Nachholbedarf haben. Österreichische Schulen kennen<br />

nur die Möglichkeit des außerordentlichen Schulbesuchs, der gesetzlich auf ein Jahr<br />

beschränkt ist, sowie einen auf wenige Stunden pro Woche beschränkten Zusatzunterricht,<br />

der während des Regelunterrichts angeboten wird.


<strong>PISA</strong> IN ÖSTERREICH 347<br />

Als weiteren Interpretations-Aspekt bezieht Bender das Moment der Chancen(un)gleichheit<br />

in seine Stellungnahme mit ein und verweist auf methodische<br />

Fehler, die im Zusammenhang mit Konsequenzen aus „Bildungsbeteiligung“<br />

und „ökonomisch-sozial-kulturellen Status“ stünden.<br />

Direkte, etwa von Andreas Schleicher an verschiedenen Stellen getätigte<br />

OECD-Aussagen brächten die Motive der <strong>PISA</strong>-Untersuchung zutage. So sei<br />

vor einigen Jahren, so Bender, das Schulsystem mit der Steigerung des Brut<strong>to</strong>sozialprodukts<br />

in Verbindung gebracht worden <strong>–</strong> in Ignoranz anderer wesentlicher<br />

Einflussgrößen. 8<br />

Ähnlich grob „geschnitzt“ seien die den Wettbewerb stimulierenden Tabellen<br />

über die (Steigerung der) Bildungsausgaben ausgefallen: Mexiko würde<br />

demnach einsam an der Spitze liegen, so Bender.<br />

In einem offenen Brief an die stv. Vorsitzende der GEW, Marianne Demmer,<br />

nimmt Bender auch zu den Kritikern des Buches „<strong>PISA</strong> & Co <strong>–</strong> Kritik<br />

eines Programms“ Stellung bzw. zum Reflex, den die Kritik in Deutschland<br />

ausgelöst habe, womit gewissermaßen bewiesen wäre, dass eine auf Argumenten<br />

basierende Analyse zu Schmähung und Verfolgung führt(e). Mehr dazu im<br />

Beitrag von Stefan Hopmann.<br />

In diesem Zusammenhang sei auch die Stellungnahme des UNO-<br />

Sonderberichterstatters der UN- Menschenrechteskommission, Vernor Munoz<br />

aus Costa Rica, 2006 in Deutschland, zu interpretieren, in der er das gegliederte<br />

Schulsystem kritisierte und befand, dass zur Integration von Familien mit<br />

Migrationhintergrund sinngemäß die Sprache nicht ausschlaggebend sei . . . 9<br />

3.2 Die „Zukunftskommission“<br />

Im österreichischen Report <strong>PISA</strong> 2000 <strong>–</strong> Lernen für das Leben fasst die zuständige<br />

Bundesministerin Elisabeth Gehrer die notwendigen Konsequenzen<br />

und Handlungsschritte im Vorwort des Ergebnis-Reports zusammen: „Bei der<br />

nun vorliegenden Detailauswertung gibt es wichtige Hinweise, in welchen Bereichen<br />

die Anstrengungen zur Steigerung der Bildungsqualität noch verstärkt<br />

werden sollen. Österreichische Kinder sollten spätestens zum Ende der dritten<br />

Klasse Volksschule verlässlich sinnerfassend lesen können. Lesen ist die Kulturkompetenz,<br />

auch im Zeitalter der Au<strong>to</strong>matisierung. Deshalb wurde vom Bil-<br />

8 Auf Österreich übertragen ließen sich daraus äußerst beruhigende Rückschlüsse auf das<br />

Schulsystem ziehen, zeigen doch die jüngsten Wirtschaftsdaten vergleichsweise sehr gute<br />

Ergebnisse.<br />

9 Wer hat hier den Kern der <strong>PISA</strong>-Tests <strong>–</strong> literacy! <strong>–</strong> nicht verstanden oder gar überlesen?


348 DOMINIK BOZKURT, GERTRUDE BRINEK, MARTIN RETZL<br />

dungsministerium das Projekt ›Lesefit‹ unter dem Mot<strong>to</strong> ‚Lesen können heißt<br />

lernen können‘ gestartet. Unter Einbindung der Eltern und des Buchklubs muss<br />

erreicht werden, dass alle Kinder die Volksschule mit hervorragenden Lesekenntnissen<br />

verlassen.“ Unter Verweis auf die „thematischen Berichte“ wird<br />

auf die „unterschiedlichen Kompetenzen bei Mädchen und Buben sowie bei<br />

deutsch- und nicht deutschsprachigen Schülerinnen und Schülern“ verwiesen<br />

und insgesamt die Detailauswertung hinsichtlich ihrer Rückmeldefunktion für<br />

die Qualitätssteigerung im Bildungswesen gewürdigt. Im Vorwort der <strong>PISA</strong>-<br />

Studie 2003 wird an ›Lesefit‹ erinnert sowie auf die Ausweitung des Projekts<br />

IMST (Innovations in Mathematics, Science and Technology Teaching) Bezug<br />

genommen. Mit der Initiative ›klasse:zukunft‹, der Erarbeitung von Bildungsstandards<br />

und der zielstrebigen Fortsetzung der inneren Schulreform sei Österreich<br />

auf dem richtigen Weg, die Qualität des Unterrichts nachhaltig zu verbessern<br />

und zu sichern, so Gehrer. Dieselbe bemerkt abschließend aber auch, dass<br />

„Leistungsmessungen wie <strong>PISA</strong> wichtige Momentaufnahmen“ lieferten, aber<br />

nur einen „Teil der Leistungen ( . . . ) verkörperten, die ( . . . ) an unseren Schulen<br />

erbracht werden“.<br />

Auf Wunsch der Bildungsministerin wird unter dem österreichischen<br />

<strong>PISA</strong>-Verantwortlichen Günter Haider mit Ministerratsbeschluss vom<br />

1.4.2003 die sog. Zukunftskommission eingesetzt, die bildungspolitische Konsequenzen<br />

aus der OECD-Untersuchung formulieren soll (weitere Mitglieder<br />

sind Christiane Spiel, Ferdinand Eder, Werner Specht und Manfred Wimmer).<br />

Die Kommission legt 2005 ein Analyse- und Maßnahmen-Papier bzw. das Ergebnis<br />

vor, das der deutschen Konklusion inhaltlich nicht so fern steht; sie<br />

stellt auf die Verbesserung der Unterrichtsqualität ab.<br />

Als Reformziel wird genannt: Schule und Unterricht systematisch verbessern<br />

(Hervorhebungen jeweils d. d. Verfasser). „Sowohl Ergebnisse neuerer<br />

Leistungsmessungen (vor allem <strong>PISA</strong>), als auch die seit mehr als einem Jahrzehnt<br />

laufenden Überlegungen zur Qualitätsverbesserung in den Schulen sowie<br />

die Analyse der Rahmenbedingungen in Österreich legen nahe, die Lehr-<br />

/Lernprozesse im Unterricht, die Unterrichtsinhalte und die Unterrichtsmethoden,<br />

somit ‚Guten Unterricht‘ in das Zentrum der Reformmaßnahmen zu<br />

rücken. Reformstrategie: Qualitätsentwicklung vor Strukturreform<br />

Die Zukunftskommission hat in ihrem ersten Bericht das Schwergewicht<br />

der Vorschläge auf Qualitätssicherung, Qualitätsentwicklung und den Ausbau<br />

einer verlässlichen Schule gelegt, und weniger auf den Umbau von Strukturmerkmalen<br />

und Organisationselementen. Sie bleibt auch in diesem Folgebe-


<strong>PISA</strong> IN ÖSTERREICH 349<br />

richt auf derselben Linie. Die vorgeschlagenen Maßnahmen der Zukunftskommission<br />

streben daher Unterrichtsverbesserungen durch Schulentwicklung und<br />

Qualitätssicherung, durch Lehrerbildung und Unterstützungssysteme <strong>–</strong> und<br />

nicht durch Systemumbau an.<br />

Die Gesamtstrategie orientiert sich an folgenden vier Prinzipien:<br />

1. Systematisches Qualitätsmanagement: Förderung der Qualitätsentwicklung<br />

und der Qualitätssicherung auf allen Ebenen. ( . . . )<br />

2. Mehr Au<strong>to</strong>nomie und mehr Selbstverantwortung <strong>–</strong> erhöhter Handlungsspielraum<br />

bei transparenter Leistung und Rechenschaftspflicht. (... )<br />

3. Professionalisierung der LehrerInnen: kriterienbezogene Auswahl, kompetenzorientierte<br />

Ausbildung, leistungsorientierte Aufstiegsmöglichkeiten.<br />

(... )<br />

4. Mehr Forschung & Entwicklung und bessere Unterstützungssysteme.<br />

( . . . )“ (vgl. Abschlußbericht <strong>–</strong> Zusammenfassung).<br />

Dazu wurden in den zusammenfassenden Empfehlungen fünf Handlungsbereiche<br />

(mit einzelnen Subbereichen) und vordringliche und übergreifende<br />

Forschungs- & Entwicklungsbereiche formuliert und detailliert ausgeführt.<br />

Diesen in der Öffentlichkeit durchaus positiv bewerteten Aktivitäten (vgl.<br />

auch die in diesem Beitrag zitierten parlamentarischen Debatten) folgten einige<br />

weitere:<br />

Am 9.2.2005 schreibt Peter Posch im Lichte der <strong>PISA</strong>-Ergebnisse in einem<br />

Gutachten für Bundesminsterin Gehrer über „einige mögliche Gründe für die<br />

Schwächen des österreichischen Schulsystems und Ansätze zu ihrer Überwindung:<br />

Wesentlich ist ( . . . ) die Erkenntnis, dass Verbesserungen nur von einem<br />

komplexen Ensemble von Maßnahmen zu erwarten sind.“<br />

Unter den 10 Punkten weist er zwar auch auf die „Frage der Fragmentierung“<br />

des Schulsystems hin und die Folgen für schwächere SchülerInnen<br />

bzw. für jene, die aus schwierigen sozialen Verhältnissen kommen <strong>–</strong> 2000<br />

und 2003 hätte diese Gruppe beunruhigend schlecht abgeschnitten, weil der<br />

Leistungsanreiz durch die leistungsmäßig stärkeren Schüler gefehlt hätte <strong>–</strong>,<br />

setzt aber im ersten Punkt auf Qualitätssicherung. „Die Einführung der Verpflichtung<br />

auf ein Schulprogramm, in dem in bestimmten Abständen von den<br />

Schulen verlangt wird, Rechenschaft über Initiativen und deren Ergebnisse<br />

zur Weiterentwicklung der Unterrichtsqualität und der schulischen Rahmenbedingungen<br />

abzulegen, wurde rechtlich nicht verankert, obwohl bereits 2002<br />

ein detaillierter Vorschlag ausgearbeitet worden ist ( . . . ).“ Ein weiterer Punkt


350 DOMINIK BOZKURT, GERTRUDE BRINEK, MARTIN RETZL<br />

ist für Posch die Gewinnung von Schulzeit, d.h. der Vormittagsunterricht reiche<br />

nicht. Neben der Verbesserung und Professionalisierung der LehrerInnen-<br />

Ausbildung mahnt der Au<strong>to</strong>r ein, den gängigen Unterricht methodisch weiterzuentwickeln,<br />

um künftig auch anspruchsvolle Aufgaben und Denkleistungen<br />

besser bewältigen zu können, vgl. TIMSS. Mehr Transparenz in der<br />

Formulierung und Beurteilung der Leistungsansprüche sollte u. a. das Ergebnis<br />

einer laufenden LehrerInnen-Fortbildung und der professionellen Zusammenarbeit<br />

in Fachgruppen-Teams sein. Schließlich sei die SchulleiterInnen-<br />

Qualifizierung und -Stärkung der Management-Ebene vonnöten sowie das<br />

„Aufsichtsvakuum“ zwischen Schulleitern (verantwortlich für das Schulprogramm)<br />

und Schulaufsicht (verantwortlich für die Qualität der Selbstevaluation)<br />

zu beseitigen.<br />

Als wesentlichen Punkt streicht Posch die wahrscheinlich nachteilige Auswirkung<br />

der mangelnden Deutsch-Kenntnisse der MigrantInnen-Kinder heraus.<br />

Dazu: „Einrichtung von Programmen zur Sicherung der Deutschkenntnisse<br />

von Kindern mit Migrationshintergrund und zwar nicht nur begleitend zur<br />

Schullaufbahn sondern ( . . . ) bevor sie in die Schule eintreten.“ Dabei solle eine<br />

hohe Konzentration von Kindern nicht-deutscher Muttersprache vermieden<br />

werden. Abschließend wird auf Finnland verwiesen, wo bereits im Kindergartenalter<br />

systematisch Finnisch vermittelt werde.<br />

3.3 Öffentliche Reaktionen<br />

Im Spiegel der medialen Rezeption stellt sich die öffentliche Diskussion um<br />

Konsequenzen aus <strong>PISA</strong> und Schulleistungs-Vergleichtests anders dar. Gab es<br />

zum ersten <strong>PISA</strong>-Report viel bisweilen auch undifferenzierte Zufriedenheit,<br />

Eigenlob und Unaufgeregtheit nach dem Mot<strong>to</strong> „alles im grünen Bereich“ bis<br />

„einmal noch gut davongekommen und vor allem besser als Deutschland abgeschnitten“<br />

(mehr dazu im Kapitel 2), so verdichten sich Reaktionen auf den<br />

zweiten <strong>PISA</strong>-Report zu Katastrophenmeldungen.<br />

Finnland ist oft das geflügelte Wort für das Gute in der Pädagogik, die<br />

Scheu vor einer moralischen Aufladung und Überstrapazierung nimmt rapide<br />

ab, das können auch andere internationale Bildungsanalysen nicht relativieren,<br />

z. B. der von Bildungsministerin Gehrer zitierte OECD-Länderbericht 2005,<br />

der für Pressegespräche und einschlägige Informationen zitiert wird: „Als ein<br />

besonderes Charakteristikum Österreichs kann die große Vielfalt an Schultypen<br />

angesehen werden. Der hohe Level an vertikaler und horizontaler Differenzierung<br />

zeigt Vorteile, aber auch Einschränkungen. Im Schulsystem gibt es


<strong>PISA</strong> IN ÖSTERREICH 351<br />

den Eltern eine große Freiheit an Wahlmöglichkeiten, besonders in Wien und<br />

den großen Städten, wenngleich dies auch zu Zersplitterung und hohen Kosten<br />

führen könnte“. Gestützt wird die Argumentation nach dem Erscheinen des og.<br />

Länderberichtes mit einer Analyse in der FAZ: „Entscheidend ist nicht das jeweilige<br />

Schulsystem, sondern der kluge und bedachte Umgang mit vorhandener<br />

Schultradition. Nach diesem Ländervergleich gibt es weniger Grund denn<br />

je, das in Deutschland etablierte dreigliedrige Schulsystem einem Einheitsschulsystem<br />

nach skandinavischem Vorbild zu opfern. Vielmehr liegen die<br />

Länder, die auf ein dreigliedriges Schulsystem mit hohen Qualitätsstandards<br />

in allen Schularten gesetzt haben, in Führung ( . . . )“ (Presse-Information des<br />

BMBWK zuletzt v. 16. Dez. 2005). In derselben wird von Seiten des Ministeriums<br />

auch auf das erfolgreiche Abschneiden Bayerns, einem Land mit einem<br />

gegliederten Schulsystem, hingewiesen, sowie auf die in Österreich niedrige<br />

Jugendarbeitslosigkeit und den WHO-Bericht über das Wohlbefinden in der<br />

Schule: Das Ergebnis aus der Frage „Fühlst Du Dich in der Schule wohl?“<br />

bedeutet für Österreich den Platz 3, für Finnland den Platz 34 (letzte Stelle).<br />

Dies alles trägt aber zur Versachlichung der Debatte nicht (mehr) bei. Es<br />

zeigt nur ein pädagogisch-systematisch motiviertes Aufflackern eines ernsthaften<br />

Versuches, die gegenwärtige Schule und ihre Funktion als umfassenden<br />

„Gerechtigkeitsherstellungsapparat“ zu relativieren bzw. nicht zu strapazieren,<br />

sondern den Beitrag der Schule zur Schaffung einer humanen Gesellschaft als<br />

so bescheiden herauszustreichen als es auf Basis der aktuellen Forschungslage<br />

benannt werden darf und muss . . .<br />

Auch Mitglieder der Zukunftskommission nehmen in Bezug auf einen<br />

möglichen Umbau des Schulsystems in einer ähnlichen Weise Stellung, so<br />

z. B. der Vorsitzende Haider, indem er v. a. mündlich immer wieder auf die<br />

österreichische Tradition und Schulkultur verweist (z.B. Haider im parlamentarischen<br />

Unterrichtsausschuss und im Rahmen einer Studientagung der ÖVP<br />

in Alpbach).<br />

3.4 Parlamentarische Resonanz<br />

Im Rahmen einer sog. Aktuellen Aussprache im parlamentarischen<br />

Unterrichts-Ausschuss (in den Ausschüssen werden die legistischen Arbeiten<br />

vorbereitet, diskutiert und mehrheitlich beschlossen, um danach in öffentlichen<br />

Plenarsitzungen abgestimmt zu werden; Aktuelle Aussprachen sind Teil<br />

der Debatten in den Ausschuss-Sitzungen, zu denen auch Experten eingeladen<br />

werden) stand am 3.7.03 „die <strong>PISA</strong>-Studie und die Tätigkeit der Zukunftskom-


352 DOMINIK BOZKURT, GERTRUDE BRINEK, MARTIN RETZL<br />

mission“ zur Diskussion. Dazu wurde auch Günter Haider als Experte eingeladen.<br />

Von Bundesministerin Gehrer wird einleitend das Ziel der Zukunftskommission<br />

referiert; der österreichische <strong>PISA</strong>-Vorsitzende schließt sich an und<br />

unterstreicht, dass sich die Zukunftskommission „insbesondere mit der Qualität<br />

und der Qualitätsentwicklung auseinandersetzen werde, wobei man den<br />

Schwerpunkt auf die Verbesserung des Unterrichts lege. 80 % der Qualitätsverbesserung<br />

seien, so Haider, dadurch erreichbar, wobei man insbesondere<br />

in der Lehrerbildung ansetzen müsse. Nur 20 % an Verbesserungen, schätzt<br />

er, könnten durch organisa<strong>to</strong>rische Maßnahmen erreicht werden. Besonderes<br />

Augenmerk werde der Vorbereitung auf ein lebenslanges Lernen und einem<br />

gesicherten und verstehenden Lesevermögen geschenkt werden, denn ohne<br />

dieses sei kein selbstständiger Bildungserwerb möglich“ <strong>–</strong> was nur durch den<br />

entsprechenden Unterricht erreichbar wäre (Parlamentkorrespondenz Nr. 536).<br />

Nach den Ausführungen Haiders ziele die Arbeit der Zukunftskommission<br />

auf ein Gesamtkonzept ab, was nur langfristig umgesetzt werden könne;<br />

kurzfristige Auswirkungen würden der Umsetzung des Qualitätsmanagements<br />

zugeschrieben. Als langfristiges Projekt bezeichnete der Experte die Hinwendung<br />

zur Rechenschaftslegungsorientierung sowie die Realisierung einer verstärkten<br />

Au<strong>to</strong>nomie . . . .<br />

In der anschließenden Debatte zeigten sich dann „die unterschiedlichen<br />

Gewichtungen, die die Opposition und die Regierungsfraktionen vornahmen.<br />

Aus den Äußerungen der Abgeordneten von SPÖ und Grünen war die Meinung<br />

herauszuhören, dass „organisa<strong>to</strong>rische Maßnahmen durchaus einen stärkeren<br />

Einfluss auf die Qualität und den Output des Unterrichts nehmen könnten.“<br />

Für Werner Amon von der ÖVP ginge es, so die exemplarische Wortmeldung,<br />

darum, das relativ gute Schulsystem durch eine Konzentration auf die Verbesserung<br />

des Unterrichts und den Ausbau der Au<strong>to</strong>nomie weiterzuentwickeln.<br />

Haider unterstrich nach einer ausführlichen Debatte auch noch die Notwendigkeit<br />

des Ausbaus der Tagesschul- bzw. Unterrichtszeit und Unterteilung<br />

des Lehrplanes in einen Kern- und Erweiterungss<strong>to</strong>ff.<br />

Am 1.12.04 wurden im parlamentarischen Unterrichts-Ausschuss im Rahmen<br />

der Aktuellen Aussprache die kurz zuvor bekannt gewordenen Ergebnisse<br />

der 2. <strong>PISA</strong>-Studie diskutiert (ein endgültiges offizielles Ergebnis lag zu<br />

diesem Zeitpunkt nicht vor; bis 7.12. seien die Ergebnisse im Eigentum der<br />

OECD). Bundesministerin Gehrer räumte ein, dass sich Österreich im Ranking<br />

verschlechtert habe. Sie halte es aber „zum jetzigen Zeitpunkt für verfehlt,<br />

voreilige Schuldzuweisungen vorzunehmen und ein bestimmtes Schulsystem


<strong>PISA</strong> IN ÖSTERREICH 353<br />

als Heilmittel zu propagieren“ (Parlamentkorrespondenz Nr. 893). Neben dem<br />

Hinweis auf die beabsichtigte Analyse, mit der sie Haider beauftragen wolle,<br />

erinnerte sie an bereits gesetzte Maßnahmen, sowie an die im Vergleich etwa<br />

zu Finnland positive Lage auf dem Sek<strong>to</strong>r der Jugendarbeitslosigkeit.<br />

Um den anstehenden Reformen im Bildungswesen gerecht werden zu<br />

können, stand die Arbeit des parlamentarischen Unterrichts-Ausschusses am<br />

20.4.05 im Zeichen der Entscheidung über die Abschaffung der Zweidrittelmehrheit<br />

für Schulgesetze; dazu war die Unterstützung der Opposition notwendig.<br />

Grundlage für die Diskussion im Rahmen der Aktuellen Aussprache<br />

war der Abschlußbericht der Zukunftskommission; Günter Haider stand<br />

abermals für Fragen und Diskussionsbeiträge zur Verfügung und bezeichnete<br />

wiederum die systematische Verbesserung des Unterrichts als zentrales Element<br />

der Schulreform. Dazu gehöre es, die Lern- und Leistungsfähigkeit optimal<br />

zu fördern sowie die Qualifizierung der LehrerInnen zu erhöhen und die<br />

Nachhaltigkeit des Unterrichts zu verbessern. „Die Qualitätsentwicklung habe<br />

Vorrang vor der Strukturreform“, sagte er und vertrat auch die Auffassung,<br />

die sprachliche Frühförderung sollte so rasch wie möglich umgesetzt werden<br />

(Parlamentskorrespondenz Nr. 272). Erwartungsgemäß reagierten die Oppositionsabgeordneten<br />

mit dem Wunsch nach einer Strukturreform, ebenso wie die<br />

ÖVP-Mandatare die qualitative Verbesserung des Unterrichts als vordringlich<br />

erachteten. Nach einer weiteren Stellungnahme von Bundesministerin Gehrer<br />

<strong>–</strong> sie referierte alle bisher gesetzten Verbesserungs- und Schulentwicklungsmaßnahmen<br />

<strong>–</strong> wurde die Diskussion nach einer kurzen Unterbrechung<br />

fortgesetzt, auf die die Debatte bezüglich neuer Mehrheiten bei Schulgesetzen<br />

folgte. Reformen sollten rascher und einfacher umgesetzt werden als bisher.<br />

Des Weiteren wurde auf die geplanten bzw. auch budgetierten ca. 300.000 zusätzlichen<br />

Förderstunden und den ebenso vorgesehenen Ausbau der Tagesbetreuung<br />

hingewiesen.<br />

Im Laufe der folgenden Monate wird die politische Diskussion um <strong>PISA</strong><br />

oder besser die öffentlich geführte Debatte jedoch argumentativ immer enger<br />

und konzentriert sich <strong>–</strong> ideologisch aufgeladen <strong>–</strong> auf die Frage „Gesamtschule<br />

oder gegliedertes Schulwesen“.<br />

Dass sich darin, wenn auch nicht nur, ab 2006 der nahende Bundes-<br />

Wahlkampf widerspiegelt, wird offenkundig etwa indem sich nun u. a. auch der<br />

österreichische <strong>PISA</strong>-Chef Haider öffentlich zunehmend anders, d.h. von seiner<br />

inhaltlichen Linie und seinen ursprünglichen Empfehlungen abweichend<br />

und wissenschaftlich grenzüberschreitend, politisch wertend verhält und der


354 DOMINIK BOZKURT, GERTRUDE BRINEK, MARTIN RETZL<br />

Bildungsministerin sinngemäß schwere bildungspolitische Versäumnisse, Unbeweglichkeit<br />

und Amtsmüdigkeit vorwirft.<br />

3.5 Vertiefende Analyse<br />

2005 entwickelt sich auch in Österreich eine Diskussion um die offenkundigen<br />

methodischen Schwächen, d.h. die Anlage und die Auswertung der Leistungserhebung<br />

in der <strong>PISA</strong>-Untersuchung. Bundesministerin Gehrer beauftragt<br />

Erich Neuwirth und sein Team an der Universität Wien mit vertiefenden<br />

Analysen und Beiträgen zur Methodik. Wissenschaftlich abgesicherte Aussagen<br />

sollten dazu beitragen, die Unterschiede klären und interpretieren zu können.<br />

Im Abschnitt „Korrigierte Hauptergebnisse“ (Neuwirth, 62 ff.) wird auf<br />

den nun möglichen stichhaltigen Vergleich der <strong>PISA</strong>-Ergebnisse 2000 und<br />

2003 abgestellt. „Dabei zeigt sich, dass sich die Leistungen der österreichischen<br />

Schüler/innen in Lesen kaum geändert haben und sowohl die Lesewerte<br />

als auch die Mathematikwerte in der Nähe des OECD-Durchschschnitts liegen.<br />

In den Naturwissenschaften ist dagegen ein deutlicher Rückgang der österreichischen<br />

Werte erkennbar“ (Neuwirth, 62). Festzuhalten sei auch, dass<br />

die Werte für Mathematik nicht unmittelbar vergleichbar sind, weil die im<br />

Mathematik-Test abgedruckten Bereiche 2003 sehr stark erweitert wurden und<br />

daher nicht mehr exakt dasselbe Kompetenzfeld untersucht wurde wie 2000.<br />

Die Analyse der Geschlechterunterschiede ergibt sowohl beim Lesen als<br />

auch in den Naturwissenschaften kein nennenswertes Ergebnis. Analysiert<br />

man die Antwortformate in den Naturwissenschaften, so kommt man zu einem<br />

aufschlussreichen Resultat. Bei Aufgaben mit offenen, freien (verbalen)<br />

Antwortformaten schnitten Österreichs Schüler/innen schlechter ab als 2000<br />

(vgl. Neuwirth, 71, 75). Das trifft v. a. bei Schülerinnen in Berufschulen und<br />

Berufsbildenden Mittleren Schulen zu 10 . „Vom öffentlich diskutierten ‚<strong>PISA</strong>-<br />

Absturz‘ in allen Disziplinen und vom drastischen Auseinadergehen der Leistungswerte<br />

der Geschlechter beim Lesen bleibt bei Analyse der korrigierten<br />

Daten nichts übrig“, so der Au<strong>to</strong>r (Neuwirth, 64).<br />

Das Statistiker-Team kommt zu dem Schluss, dass das Datenmaterial einer<br />

näheren Betrachtung in Bezug auf Konsistenz nicht standhält (vgl. Kapitel<br />

10 Eine Relation darf in diesem Zusammenhang artikuliert werden, nämlich die mit der Entwicklung<br />

der Zahl der Kinder mit Migrationshintergrund und den Erfahrungen aus der Integrationsarbeit.<br />

Die österreichische Bundespolitik setzt in dieser Zeit ihr Programm der<br />

Familienzusammenführung um, d.h. es kommen hauptsächlich Kinder nach Österreich . . .


<strong>PISA</strong> IN ÖSTERREICH 355<br />

1.2.1) und von einem „Absturz“ Österreichs nicht die Rede sein kann. Beispielsweise<br />

war bei <strong>PISA</strong> 2000 in Österreich der Mädchenanteil höher als der<br />

Burschenanteil, was jedem demografischen Grundwissen widerspricht. Genauere<br />

Analysen unter Verwendung zusätzlicher, nicht öffentlich verfügbarer<br />

Daten zeigten dann, dass die Daten der beiden <strong>PISA</strong>-Erhebungen von 2000<br />

und 2003 speziell für Österreich nicht unmittelbar vergleichbar sind . . . (Neuwirth,<br />

11).<br />

3.6 „<strong>PISA</strong> bringt allen was“<br />

Die Analyse Neuwirths vermag an der allgemein strapazierten Interpretation<br />

der <strong>PISA</strong>-Ergebnisse in Österreich kaum etwas zu ändern. Die politische<br />

Opposition lädt die Diskussion moralisch auf und verlangt vom Schulsystem<br />

weniger Selektion und mehr Gerechtigkeit; es wird kommuniziert, dass das<br />

Gesamtschulwesen eine solche bringen werde. Die ursprüngliche Absicht von<br />

<strong>PISA</strong>, den (wirtschaftlichen) Wettbewerb unter immer mehr Ländern mit unüberschaubaren<br />

Datenmengen anzuheizen <strong>–</strong> der 3. <strong>PISA</strong>-Bericht bringt Daten<br />

aus 60 Ländern <strong>–</strong> bleibt unberücksichtigt. Demoskopische Ergebnisse zeigen<br />

bezüglich Schulsystem weiterhin eine konstante Einschätzung der Österreicher<br />

und Österreicherinnen:<br />

In der jüngeren Vergangenheit bis in die Gegenwart lag das gegliederte<br />

Schulwesen in Umfragen in Österreich eindeutig vor der Gesamtschule (meist<br />

zwischen 65 und 75 %). Die Einstellung gegenüber Schulreformen entwickelte<br />

sich aber hin zu einer Art von kollektivem Bewusstsein für die Notwendigkeit<br />

von Verbesserungen. Detail-Abfragen lieferten aber kein schlüssiges Bild.<br />

Im Frühjahr 2007 wird in Lehrerkreisen kolportiert, dass die Österreicherinnen<br />

und Österreicher nun doch für die Gesamtschule votierten, nachdem<br />

der Frage die Information voraus gegangen sei, <strong>PISA</strong> habe gezeigt, die Gesamtschule<br />

sei das bessere Schulmodell. Die Quellen sind nicht eruierbar, die<br />

Resonanz bleibt insgesamt bescheiden . . .<br />

Gemäß SPÖ-ÖVP-Regierungsprogramm vom Jänner 2007 setzt die Unterrichtsministerin<br />

eine Reform-Kommission ein, um Modellregionen zur Erprobung<br />

der Gesamtschulformen zu ermitteln. Die Mitglieder sind mehrheitlich<br />

keine ExpertInnen aus den Bereichen Bildungstheorie/-wissenschaft oder<br />

Schulforschung, sondern haben sich in der Vergangenheit im Wesentlichen als<br />

Gegner des gegliederten Schulwesens artikuliert. Günter Haider fungiert als<br />

Bildungsberater der Ministerin.


356 DOMINIK BOZKURT, GERTRUDE BRINEK, MARTIN RETZL<br />

Einzelne Bundesländer stellen ihre Varianten einer reformierten Schule der<br />

Zukunft vor, so unterstützt z.B. Oberösterreich die einjährige Verlängerung der<br />

Volksschule, an dessen Ende die Entscheidung über den weiteren Bildungsweg<br />

stehen soll. Niederösterreich setzt auf eine Art zweijährige Orientierungsstufe<br />

nach der Volksschule und beruft sich dabei auf eine hohe Akzeptanz: Laut Fessel+<br />

GFK-Institut (vom Juli 07) würden 78 % beim bestehenden Schulsystem<br />

bleiben wollen und nur 18 % für einen einheitlichen Schultyp votieren (3 %<br />

keine Angaben). 62 % finden die og. Orientierungsstufe für gut.<br />

Andere Bundesländer haben ihre Kooperation bei der Umsetzung einer<br />

Gesamtschul-Modellregion gezeigt (z. B. Kärnten, Burgenland). Die publizierten<br />

Ergebnisse aus den Versuchen mit Modellen der Integrierten Gesamtschule<br />

der letzten Jahrzehnte bleiben unberücksichtigt.<br />

Gegenwärtig, wenn auch nicht mehr so heftig wie in den Jahren davor,<br />

steht <strong>PISA</strong> einerseits gewissermaßen für eine nationale Kränkung, mit der man<br />

noch nicht so richtig umzugehen gelernt hat, andererseits verbindet man v. a.<br />

auf akademischem Boden mit der „OECD-Maschinerie“ das, was es selbst sein<br />

wollte, nämlich eine Lizenz zur internationalen Aufmischung der Schul- und<br />

Wirtschaftswelt. Wolfgang Horvath zitiert dazu die OECD selbst: <strong>PISA</strong> dient<br />

dem „( . . . ) besseren Verständnis der Bildungserträge in den am weitesten entwickelten<br />

Ländern und in denjenigen, die sich noch in einem früheren Stadium<br />

der wirtschaftlichen Entwicklung befinden“ (Horvath, 208) und stellt auf<br />

„die für das spätere Leben relevanten Kompetenzen ab“ (Schleicher, 9). In der<br />

öffentlichen Diskussion wird diesem Motiv jedoch nicht Rechnung getragen,<br />

nicht zuletzt, weil <strong>PISA</strong> technisch so angelegt ist, dass ein solcherart ermitteltes<br />

Ergebnis in Abwandlung des Leit-Spruches der Österreichischen Post AG<br />

„allen was bringt“.<br />

Darüber hinaus ist bis da<strong>to</strong> weitgehend unreflektiert geblieben, dass<br />

sich hinter <strong>PISA</strong> eine Bildungsnorm verbirgt, ein (Allgemein-)Bildungs-<br />

Verständnis, das in den Ländern mit der Tradition von Humboldt und Schleiermacher<br />

nicht selbstverständlich ist. „Dieses an und für sich durchaus diskussionswürdige<br />

Bildungskonzept, dessen Wurzeln in einem konkreten his<strong>to</strong>rischen<br />

und kulturellen Raum zu suchen sind, wird mittels mo<strong>to</strong>rischen Verschweigens<br />

seiner Herkunftsgeschichte in seinem Gültigkeitsanspruch absolut gesetzt. Es<br />

wird als universell, naturgegeben, vorgestellt und damit als Norm gesetzt bzw.<br />

wird diese Norm als solche genau genommen eben nicht ausgewiesen. Sie wird<br />

als einzig mögliche, weil einzig denkbare, stillschweigend vorausgesetzt. Ihr<br />

eignet somit der Charakter eines Naturgesetzes ( . . . ). Allgemeinbildung wird


<strong>PISA</strong> IN ÖSTERREICH 357<br />

damit auf Nutzbarkeit hin getrimmt, der gemessene Bildungswert dient als Indika<strong>to</strong>r<br />

für die wirtschaftliche Schlagkraft“ (Horvath, 210). Insofern sind die<br />

artikulierten Verbesserungsvorschläge der Zukunftskommission in sich schlüssig,<br />

sie stellen <strong>–</strong> nachvollziehbar in einer über weite Strecken managementtechnischen<br />

Begrifflichkeit <strong>–</strong> auf verbesserte Qualifikationen und deren effizienten<br />

Einsatz im Dienst der SchülerInnen und der Wirtschafts- und Berufswelt<br />

ab.<br />

So mutet es geradezu paradox an, wenn Vertreter der politischen Linken<br />

den <strong>PISA</strong>-Ergebnissen noch stärker gerecht werden wollen: mit der Gesamtschule<br />

und damit der globalen Wirtschaftsorientierung im Bildungswesen,<br />

nicht etwa mit einem (notwendigerweise reformierten) gegliederten Schulwesen,<br />

das auch das Gymnasium realisiert. Dass die Chancen und Stärken der<br />

Europäischen Union in der Berücksichtigung eines aufgeklärten Bildungsverständnisses<br />

liegen könnten, wird von kaum jemandem außerhalb der „academia“<br />

artikuliert.<br />

Dass sich nicht alle Länder gleich stark engagiert mit dem „Rankingspektakel“<br />

(Schirlbauer 2007, 6) beschäftigen, zeige die <strong>PISA</strong>-Hompage der OECD.<br />

Zwei Monate nach der Veröffentlichung der <strong>PISA</strong>-Studie widmete der Sieger<br />

Finnland dem Ereignis 8 Seiten Berichterstattung in den Printmedien, UK (im<br />

Ergebnis im oberen Viertel) 88 Seiten, Frankreich (im oberen Drittel) 32 Seiten,<br />

Deutschland (unter dem Durchschnitt) 774 Seiten, Italien (hinter Deutschland)<br />

16 Seiten. Österreich kommt in der Liste nicht vor (vgl. Gruber).<br />

Schirlbauer folgt dabei Grubers Wertung (ebenda) nicht, nämlich dass Italiens<br />

Medienöffentlichkeit durch Berlusconis Unterdrückungspolitik gekennzeichnet<br />

sei, vielmehr verweist er auf einen anderen nationalen Aufmerksamkeitsgrund,<br />

den Fußball. Ebenso sei in anderen Ländern die jeweilige Fähigkeit,<br />

mit internationalen Vergleichsstudien bildungspolitisch konstruktiv umgehen<br />

zu können, unterschiedlich herausgebildet und damit Grund für öffentliche<br />

Erregung oder eben nicht . . .<br />

Schirlbauers Interpretation der aktuellen Bildungsstandards, die auch eine<br />

Art Konsequenz aus dem Projekt ›klasse:zukunft‹ (ein Vorschlag der Zukunfts-<br />

Kommission) sind, orientierten sich an einem notwendigen, aber noch verdeckten<br />

Come back des Lehrplans <strong>–</strong> einerseits resultierend aus den <strong>PISA</strong>-<br />

Ergebnissen, andererseits aus dem immer offenkundigeren Scheitern der Bildungsreformbemühungen<br />

der 70er-Jahre, die auf Inhalte verzichten wollten<br />

und das Methodische in pseudoemanzipa<strong>to</strong>rischer Absicht zu feiern verstan-


358 DOMINIK BOZKURT, GERTRUDE BRINEK, MARTIN RETZL<br />

den, um SchülerInnen und LehrerInnen vom „S<strong>to</strong>ffdruck“ zu befreien (vgl.<br />

Schirlbauer 1992, 27).<br />

„Was ist guter Unterricht?“ sei schlüssiger Weise auch die leitende Frage,<br />

das leitende Motiv, der Reformvorschläge der Zukunftskommission. Das Ansinnen<br />

stehe damit in der gewissermaßen wiederaufkeimenden Tradition der<br />

Orientierung auf Schul- und Unterrichtsqualität <strong>–</strong> mehr aus der Besinnung auf<br />

Didaktik und moderne Lehrkunst und weniger als Reaktion auf den Globalisierungsdruck<br />

der sog. Wissensgesellschaft (vgl. die Schlussfolgerungen der<br />

deutschen Bildungs-Verantwortlichen, in: Bayrhuber et al.).<br />

Ewald Terhart, einer der maßgeblichen Didaktiker des deutschsprachigen<br />

Raumes, legte bereits 2002 den Schwerpunkt auf Unterricht: „Die Qualität eines<br />

Lehrers entscheidet sich an der Qualität des Unterrichts. Schulqualität entsteht<br />

zu einem hohen Anteil aus Unterrichtsqualität. Zwar bildet die gesamte<br />

Schulkultur <strong>–</strong> Schule als Erfahrungsraum <strong>–</strong> einen wichtigen Lern- und Sozialisationshintergrund<br />

für Schüler und Lehrer; gleichwohl aber ist die Qualität von<br />

Unterricht und Unterrichtsentwicklung der letztlich entscheidende Bereich.“<br />

Die Frage des ‚guten Unterrichts‘ entscheide sich auf drei Feldern: der Gestaltung/Vorbereitung<br />

des Kontextes, der Durchführung des Unterrichts selbst<br />

sowie der nachgängigen Analyse und Auswertung (vgl. Terhart, 99f).<br />

Das stets unfertig entwickelte Bewusstsein vom (eigentlichen?) Telos des<br />

Lehrers, zeitgeistige Marketing-Erwartungen und die Ansichten, dass mühsame<br />

Kleinarbeit unbedankt, weil lange Zeit unsichtbar bliebe, ein als steigend<br />

wahrgenommener Legitimationsdruck und ein spürbarer Wettbewerb unter<br />

Schulen (vgl. auch die zurückgehende SchülerInnenzahl) führ(t)en zur Neigung,<br />

die „innere Schulreform“ zu vernachlässigen und dafür auf strukturelle<br />

(d.h. herzeigbare) Veränderungen zu setzen. Das steht in Konfrontation mit<br />

der nachhaltigen Erkenntnis, dass die „Erledigung des pädagogischen Kerngeschäfts“<br />

dem Lehrer/der Lehrerin obliege und er/sie diese zu verantworten<br />

habe, mag der Weg auch noch so herausfordernd sein und sich die (Selbst-)<br />

Täuschung noch so komfortabel anfühlen.<br />

Die Vorboten des <strong>PISA</strong>-Reports 2007 lassen ahnen, dass politischschematische<br />

Interpretationen den undifferenzierten und einseitigen Urteilen<br />

der Vergangenheit folgen werden.<br />

Mit jeder derartigen Annahme geht jedoch auch die Hoffnung einher,<br />

dass ein Fortschreiten in Richtung argumentative Sensibilität und intellektuelle<br />

Sorgfalt immer noch Wirklichkeit werden könne <strong>–</strong> in der evidenz-basierten<br />

Wissensgesellschaft im Europa des 21. Jahrhunderts.


4Fazit<br />

<strong>PISA</strong> IN ÖSTERREICH 359<br />

Die Ergebnisse von <strong>PISA</strong> 2000 und <strong>PISA</strong> 2003 wurden in der medialen Öffentlichkeit<br />

durchaus den veröffentlichten Ergebnissen entsprechend wahrgenommen.<br />

Ohne Berücksichtigung der Re-Analysen von Neuwirth et al. ist im<br />

Vergleich <strong>PISA</strong> 2000 zu <strong>PISA</strong> 2003 ein Leistungsabfall der österreichischen<br />

SchülerInnen festzustellen (siehe Kapitel 1). Dieser Leistungsabfall wurde medial<br />

widergespiegelt, jedoch in seinem Ausmaß dramatisiert und übertrieben.<br />

So werden die Ergebnisse von <strong>PISA</strong> 2000 in den Medien als äußerst positiv<br />

und die Ergebnisse von <strong>PISA</strong> 2003 als überwiegend negativ angesehen. Werden<br />

die Re-Analysen von Neuwirth et al. berücksichtigt, liegen die österreichischen<br />

SchülerInnen in beiden Testungen im Lesen und in Mathematik im<br />

OECD-Durchschnitt. Daher hat sich, genau genommen, in diesen beiden Bereichen<br />

nichts geändert, wodurch sich der viel erwähnte drastische Leistungsabfall<br />

als Fiktion herausstellt. Lediglich im Bereich der Naturwissenschaften<br />

gibt es einen tatsächlich feststellbaren Leistungsabfall, auf den die entsprechende<br />

mediale Resonanz so gut wie ausgeblieben ist.<br />

Interessanterweise betreffen die fachspezifischen medialen Forderungen<br />

nach beiden <strong>PISA</strong>-Wellen fast ausschließlich die Lese- und Sprachförderung,<br />

was zwar angesichts der bedenklich hohen Anzahl an SchülerInnen, die bei<br />

<strong>PISA</strong> als sehr schlechte LeserInnen einzustufen sind und angesichts der elementaren<br />

Bedeutung des Lesens in unserer Welt sicher seine Berechtigung hat,<br />

jedoch in Hinblick auf den weitaus größeren Leistungsabfall in den Naturwissenschaften<br />

etwas überrascht.<br />

Alle anderen Forderungen nach <strong>PISA</strong> 2000 und nach <strong>PISA</strong> 2003 unterscheiden<br />

sich im Wesentlichen kaum und wiederholen sich unabhängig von<br />

den Ergebnissen. Sie spiegeln eher jene von bekannten Überzeugungen geprägten<br />

bildungspolitischen Debatten (z. B. bezüglich Schulstruktur) wider,<br />

betreffen familien- und gesellschaftspolitische Weichenstellungen (Ganztagesschule)<br />

und beinhalten Rufe nach Reformen, sind aber in keiner Weise aus der<br />

<strong>PISA</strong>-Studie abzuleiten, durch sie zu widerlegen oder zu begründen. Nicht anders<br />

verhält es sich mit den vorgebrachten mutmaßlichen Ursachen/Gründen<br />

für das Abschneiden bei <strong>PISA</strong>. Auch diese entstammen vielfach subjektiven<br />

Einschätzungen und können in keiner Weise von und durch die <strong>PISA</strong>-<br />

Ergebnisse belegt werden. Hier sei nochmals davor gewarnt, dass das willkürliche<br />

Äußern von Schuld- und Verantwortungs-Zuweisungen ohne ausreichende<br />

pädagogische Grundlage nicht die schwächsten Gruppen einer Gesellschaft<br />

treffen dürfe.


360 DOMINIK BOZKURT, GERTRUDE BRINEK, MARTIN RETZL<br />

Insgesamt bleibt festzuhalten, dass sich die mediale Aufmerksamkeit für<br />

<strong>PISA</strong> 2000 von der für <strong>PISA</strong> 2003 stark unterscheidet und sich die Einschätzung<br />

(der Konsequenzen) von manchen Personen unter veränderten Zeit- und<br />

Politikbedingungen wandelt. Während <strong>PISA</strong> 2000 als mediales „Randthema“<br />

fungiert, wird <strong>PISA</strong> 2003 zum medialen „Spektakel“.<br />

Angesichts dieser Ergebnisse lässt sich resümieren, dass <strong>PISA</strong> keinen Rahmen<br />

darstellt, welcher ein systematisches, rationales Ergründen von Ursachen<br />

bzw. Erstellen von Maßnahmen im Schul- und Bildungsbereich ermöglicht<br />

oder eine bestimmte schulpolitische Konklusion nahe legt. <strong>PISA</strong> selbst bietet<br />

für sinnvolle Konzepte, Maßnahmen bzw. Eingriffe im Bildungsbereich der<br />

Einzelstaaten keinen Anhaltspunkt. Sowohl <strong>PISA</strong> 2000 als auch <strong>PISA</strong> 2003<br />

dienten höchstens dazu, das „Wettbewerbsbewusstsein“ unter den OECD-<br />

Staaten anzukurbeln und bildungspolitische Positionen, Überzeugungen und<br />

Vorhaben scheinbar wissenschaftlich zu begründen und medienwirksam in der<br />

Öffentlichkeit zu verbreiten.<br />

Paradoxerweise ist die Tatsache, dass jede(r) seine Vorstellungen und<br />

Überzeugungen durch <strong>PISA</strong> bestätigt finden kann, der Grund für Durchsetzungskraft<br />

und Erfolg von <strong>PISA</strong>, wenn auch gleichzeitig der Fluch des Programms.<br />

Zumindest ein positiver Aspekt von <strong>PISA</strong> besteht darin, dass Bildung wieder<br />

in den Mittelpunkt der öffentlichen und politischen Diskussion rückt und<br />

somit die Chance besteht, dass pädagogisch legitimierbare Entwicklungen eingeleitet<br />

werden können, auch wenn sich die bisherigen Reaktionen auf <strong>PISA</strong><br />

großteils durch das Aufgreifen „verstaubter“ Bildungskonzepte und ausländischer<br />

„Schulkopien“ auszeichnen, was aber umso mehr Anreiz sein kann, neue<br />

eigene Überlegungen auf wissenschaftlich gesicherter Basis anzustellen.<br />

Literatur<br />

APA-OTS (Hg.) (2007): Über APA-OTS. Online-Publikation<br />

[http://service.ots.at/standard.php?channel=CH0171&document=<br />

CMS1096293925986&sc=pt] download 20.7.2007.<br />

APA-OTS (Hg.) (2007a): APA-OTS Empfänger. Online-Publikation<br />

[http://service.ots.at/standard.php?channel=CH0171&document=<br />

CMS1135843558515&sc=pt&sb=pt3] download 20.7.2007.<br />

Bayrhuber, Horst, Ralle Bernd, Reiss, Kristina, Schön, Lutz-Helmut, Vollmer,<br />

Johannes (Hrsg.) (2004): Konsequenzen aus <strong>PISA</strong>. Perspektiven der<br />

Fachdidaktiken. Innsbruck.


<strong>PISA</strong> IN ÖSTERREICH 361<br />

Bender Peter (2007): Leserbrief. Online-Publikation [www.uni-paderborn.de/<br />

bender/Leserbrief<strong>PISA</strong>kritik.pdf] download 19.7.2007.<br />

Buhlmann Edelgard (2004): „Konsequenzen aus <strong>PISA</strong> <strong>–</strong> Perspektiven der<br />

Fachdidaktiken“. In: Bayrhuber, Horst, Ralle, Bernd, Reiss, Kristina,<br />

Schön, Lutz-Helmut, Vollmer, Johannes (Hrsg.): Konsequenzen aus PI-<br />

SA. Perspektiven der Fachdidaktiken. Innsbruck.<br />

Gruber, Karl Heinz (2004): Bildungsstandards: „World class“, <strong>PISA</strong>-<br />

Durchschnitt und österreichische Mindeststandards. In: Erziehung und<br />

Unterricht. Jg. 2004.<br />

Haider, G. u. Reiter, C. (2004): <strong>PISA</strong> 2003, Internationaler Vergleich von<br />

Schülerleistungen. Leykam: Graz.<br />

Horvath, Wolfgang (2006): <strong>PISA</strong>-Studie. In: Dzierzbicka, Agnieszka/Schirlbauer,<br />

Alfred (Hrsg.): Pädagogisches Glossar der Gegenwart.<br />

Wien.<br />

Mediaanalyse (Hrsg.) (2007): Jahresbericht 2001. Tageszeitungen. Total.<br />

Online-Publikation [http://www.media-analyse.at/frmdata2001.html]<br />

download 25.6.2007.<br />

Mediaanalyse (Hrsg.) (2007a): Jahresbericht 2004. Tageszeitungen. Total.<br />

Online-Publikation [http://www.media-analyse.at/frmdata2004.html]<br />

download 25.6.2007.<br />

Neuwirth, E., Ponocny, I., Grossmann, W. (2004): <strong>PISA</strong> 2000 und 2003: Vertiefende<br />

Analysen und Beiträge zur Methodik. Leykam: Graz.<br />

Parlamentskorrespondenz (2007): Online-Publikation [http://www.parlament.<br />

gv.at/portal/page?_pageid=607,78669&_dad=portal&_schema] download<br />

24.7.2007.<br />

Posch, Peter (2005): „Einige mögliche Gründe für die Schwächen des österreichischen<br />

Schulsystems und Ansätze zu ihrer Überwindung“. Unveröffentlichtes<br />

Arbeitspapier des Bundesministeriums für Bildung Wissenschaft<br />

und Kultur.<br />

Reiter, C. u. Haider, G. (2002): <strong>PISA</strong> 2000 <strong>–</strong> Lernen für das Leben. Österreichische<br />

Perspektiven des internationalen Vergleichs. Studien Verlag:<br />

Innsbruck-Wien-München-Bozen.<br />

Schirlbauer, Alfred (1992): Junge Bitternis. Eine Kritik der Didaktik. Wien.<br />

Schirlbauer, Alfred (2007): Sollen wir uns vor den Bildungsstandards fürchten,<br />

oder dürfen wir uns über die freuen? (unveröffentlichtes Manuskript).<br />

Wien.<br />

Schleicher, Andreas (2004): Vorwort des Leiters der Abteilung für Indika<strong>to</strong>-


362 DOMINIK BOZKURT, GERTRUDE BRINEK, MARTIN RETZL<br />

ren und Analysen im OECD Direk<strong>to</strong>rat für Bildung. In: Neuwirth, E.,<br />

Ponocny, I., Grossmann, W. (Hrsg.): <strong>PISA</strong> 2000 und 2003: Vertiefende<br />

Analysen und Beiträge zur Methodik. Leykam: Graz.<br />

Schwarzgruber, Manfred (2006): Die <strong>PISA</strong>-Studie und ihre mediale Darstellung.<br />

Eine Inhaltsanalyse der Berichterstattung über die <strong>PISA</strong>-Studie<br />

2003 in österreichischen Tageszeitungen. Universität Salzburg: Diplomarbeit.<br />

Terhart, Ewald (2002): „Wie können die Ergebnisse von vergleichenden Leistungsstudien<br />

systematisch zur Qualitätsverbesserung in Schulen genutzt<br />

werden?“ In: Zeitschrift für Pädagogik. Jg. 48 <strong>–</strong> Heft 1.<br />

zukunft:schule (2005): Strategien und Maßnahmen zur Qualitätsentwicklung.<br />

Abschlußbericht der Zukunftskommission. Bundesministerium für Bildung,<br />

Wissenschaft und Kultur: Wien.


Epilogue: No Child, No School, No State Left Behind:<br />

Comparative Research in the Age of Accountability<br />

Stefan T. Hopmann<br />

Austria: University of Vienna<br />

Warum und unter welchen Bedingungen hat <strong>PISA</strong> Erfolg? Wie kommt es, dass<br />

<strong>PISA</strong> <strong>–</strong> Ergebnisse für höchst gegensätzliche bildungspolitische Optionen gleichzeitig<br />

als Begründung verwendet werden können, und dass in manchen Ländern<br />

die gesamte Bildungspolitik in den schiefen Schatten von <strong>PISA</strong> gerät,<br />

während <strong>PISA</strong> andernorts nur eine Stimme unter vielen ist? Ausgehend von<br />

den Beiträgen zu diesem Band und ergänzt von Ergebnissen his<strong>to</strong>risch <strong>–</strong> vergleichender<br />

Forschung analysiert der folgende Beitrag <strong>PISA</strong> als ein Fallbeispiel<br />

für die grundlegenden Veränderungen, die nach und nach alle öffentlichen<br />

Dienstleistungen (wie etwa auch den Gesundheitssek<strong>to</strong>r) erfassen.<br />

Es zeigt sich, dass es für die Leistungen und Schwächen des <strong>PISA</strong> <strong>–</strong> Projektes<br />

und für deren höchst unterschiedliche Nutzung gute his<strong>to</strong>rische und aktuelle<br />

Gründe gibt.<br />

Why is it that a comparative project like <strong>PISA</strong> can gain so much public attention<br />

in so many countries at the same time? What makes some governments<br />

tremble, parliaments discuss, journalists write, parents nervous, and teachers<br />

angry when <strong>PISA</strong> announces new results? Why are educational administrations<br />

and political committees eager <strong>to</strong> align their curriculum concepts <strong>to</strong> the<br />

one implicit in the <strong>PISA</strong> tests? Why is <strong>PISA</strong> in some places big news, in others<br />

news appropriate for a short notice on page five or in the education section?<br />

<strong>PISA</strong> is not the first project of this kind: What is different with <strong>PISA</strong>?<br />

Of course there is no single explanation for this mind-boggling success<br />

s<strong>to</strong>ry. <strong>PISA</strong> has obviously hit something in the public mind, or in the political<br />

mind, at least in Western societies. This makes it “knowledge of most worth”.<br />

It is unlikely that this success is a result only of the quality and scope of <strong>PISA</strong>


364 STEFAN T. HOPMANN<br />

itself. If one accepts at least some of the criticism voiced in this volume, the<br />

opposite seems <strong>to</strong> be the case. What is of “most worth” in the eyes of the<br />

public is not the complicated and often overstretched research techniques or<br />

the specific design, but the simple messages which front the public appearance<br />

of <strong>PISA</strong>: the league tables and the summaries, which indicate what <strong>PISA</strong> sees<br />

as the weakness or strengths of the respective systems of schooling.<br />

But even news of this kind has been around before <strong>–</strong> ever since the IEA began<br />

its comparative research in 1959 <strong>–</strong> but never gained a similar impact. Thus<br />

it is not enough <strong>to</strong> look at <strong>PISA</strong> itself <strong>to</strong> understand this s<strong>to</strong>ry. It is necessary <strong>to</strong><br />

understand, at the same time, how the social environment has changed: schooling,<br />

policies, and the public. The question is how <strong>PISA</strong> and its methodology<br />

fit in<strong>to</strong> a larger frame of social transformation and thus could achieve the influence<br />

they have now. Moreover one has <strong>to</strong> ask if there is one <strong>PISA</strong> achieving<br />

all this, or whether it is more appropriate <strong>to</strong> talk about the “multiple realities<br />

of <strong>PISA</strong>”, i.e., the manifold ways in which <strong>PISA</strong> is enacted and experienced,<br />

which have been crucial for the success, and which the methodological mix<br />

utilized by <strong>PISA</strong> allows for.<br />

In my view, the rise of <strong>PISA</strong> owes much <strong>to</strong> what I would call the emerging<br />

“age of accountability”, i.e., a fundamental transformation on-going on in at<br />

least in the Western world, and which centres on how societies deal with welfare<br />

problems like security, health, resurrection and education (see Hopmann<br />

2003, 2006, 2007). <strong>PISA</strong> fits this context in many different ways, depending<br />

on how accountability issues unfold in different societies. In some places the<br />

same fit is equally expressed by national policies like the “No Child Left Behind”<br />

legislation in the US, or the development of national education standards<br />

as has been the case in countries as diverse as Sweden, Germany, Switzerland<br />

or New Zealand. What is important <strong>to</strong> note here is that the education sec<strong>to</strong>r is<br />

rather late in addressing such issues when compared <strong>to</strong> other areas as health or<br />

security. In this perspective <strong>PISA</strong> is but one example, a kind of collateral damage<br />

sparked of by the intrusion of accountability mechanism in<strong>to</strong> the social<br />

fabric of schooling.<br />

To explain this observation, I will first outline how the transformation<br />

called “the age of accountability” can be unders<strong>to</strong>od (1). In the following section<br />

(2) I will try <strong>to</strong> illuminate three basic modes of accountability, namely<br />

the strategies of “no child left behind”, “no school left behind” and “no state<br />

left behind”, and how <strong>PISA</strong> fits in<strong>to</strong> these different settings. In the last section<br />

(3) implications of the multiple realities of accountability for current and fu-


EPILOGUE: NO CHILD, NO SCHOOL, NO STATE LEFT BEHIND 365<br />

ture school development are discussed. In doing so I rely on the results of the<br />

Norwegian research project on “Achieving School Accountability in Practice”<br />

(ASAP), which I initiated in 2003 and which will present its results in another<br />

volume later this year (Langfeldt, Elstad & Hopmann 2007). Other sources<br />

are comparative projects, which I have been involved with in recent years like<br />

“Organizing Curriculum Change” (OCC), including research projects in Finland,<br />

Norway, Switzerland, Germany and the US (cf. e.g. Künzli & Hopmann<br />

1998), and the dialogue project “Didaktik meets Curriculum”, which involved<br />

scholars form about twenty countries throughout the 1990s (cf. e.g. Westbury,<br />

Hopmann & Riquarts 2000; Gundem & Hopmann 2002), and <strong>–</strong> last but not<br />

least <strong>–</strong> the research done in preparation of this volume on <strong>PISA</strong> in cooperation<br />

with colleagues from seven different countries (Austria, Denmark, Germany,<br />

Finland, France, Norway, and the UK).<br />

1 The Age of Accountability<br />

Social scientists, economists, politicians, educa<strong>to</strong>rs and the public seem <strong>to</strong><br />

agree that something fundamental is going on, something which changes at<br />

least in principle the social fabric of Western societies. However they differ<br />

widely in what they see as the core of this transition. To name but a few more<br />

recent examples:<br />

<strong>–</strong> “The Modern World System” is according <strong>to</strong> Immanuel Wallerstein characterized<br />

by the ever expanding commodification (in Marx’s terms: “Verdinglichung”)<br />

of all natural resources, human relations, labour, knowledge,<br />

etc., forcing a lasting division of labour in and between nations and turning<br />

ordinary citizens in<strong>to</strong> alienated <strong>to</strong>ols of a globalized economy (cf. e.g.<br />

Wallerstein 2004).<br />

<strong>–</strong> Neoinstitutionalist like John W. Meyer speak about globalization as well,<br />

however in organizational or structural terms, defining the current transition<br />

process as an outcome of the rapidly growing influx of international “institutions”,<br />

i.e. common ways of seeing and dealing with society, as provided<br />

by international organizations (such as the UN, the World Bank, OECD)<br />

and the emerging “world polity”, which supersedes national his<strong>to</strong>ries and<br />

policies alike (cf. e.g. Meyer 2006).<br />

<strong>–</strong> Theories on “reflexive modernity”, as provided by Anthony Giddens and<br />

others, would agree that the change is global, however pinpoint the special<br />

implications for the members of society, as, e.g., the need <strong>to</strong> develop a reflexive<br />

stance <strong>to</strong>wards the structures of society and the embedded risks for


366 STEFAN T. HOPMANN<br />

society as a whole as well as for the individual (cf. e.g. Beck, Giddens &<br />

Lash 1996; Beck 2006).<br />

<strong>–</strong> Governmentality theories, drawing on Foucault’s famous 1977/78 College<br />

de France lectures (cf. Foucault 2004), point similarly <strong>to</strong> the impact the<br />

transition has for the state and its institutions as well as for the public and its<br />

members, but they see a growing transfer and diffusion of power relations<br />

in<strong>to</strong> self-control mechanisms making the citizens internalize the (more or<br />

less alienated) mentality necessary <strong>to</strong> govern them(selves) (cf. e.g. Bröckling,<br />

Lassmann & Lemke 2000; Lange & Schimank 2004; Gottweis 2006).<br />

<strong>–</strong> New Public Management (NPM) supporters would not disagree that there<br />

is a diffusion of power and a change of the habits required, but they rather<br />

see it as a positive force making societies and their institutions and members<br />

more effective in a globalized world as cus<strong>to</strong>mer-centred management and<br />

control techniques are introduced (cf. e.g. Buschor & Schedler 1994; Pollit<br />

& Bouckaert 2004 2 ).<br />

<strong>–</strong> More recent theories on the welfare state discuss similar issues, but rather<br />

as a question of how the modern “intervention state” is forced <strong>to</strong> dismantle<br />

its traditional comprehensive strategies governing resources, the law and the<br />

social sphere in a more and more post-national world, and how welfare is<br />

re-modelled within this “unravelling” of the state and its institutions (cf. e.g.<br />

Esping-Andersen 1996; Scharpf & Schmidt 2000; Leibfried & Zürn 2006).<br />

<strong>–</strong> Finally, systems theories based on the work of Niklas Luhmann (cf. eg. Luhmann<br />

1998) argue that the current transition grows from within, from the<br />

need of social systems <strong>to</strong> deal with an ever-growing complexity and contingency<br />

that forces a reflexive re-design of the ways and means of social<br />

communication (which constitutes the fabric of social systems according <strong>to</strong><br />

Luhmann; cf. e.g. Akerstrøm Andersen 2003; Rasmussen 2006).<br />

Of course, this is but a small selection of the staggering number of transition<br />

theories, which flourish despite the obviously prematurely proclaimed “end of<br />

his<strong>to</strong>ry” (Fukuyama 1992). Moreover these approaches vary widely. Some of<br />

them see the current change as a late consequence of processes started with<br />

the invention of the modern state (e.g. Foucault, Wallerstein), whereas others<br />

point <strong>to</strong> more recent changes, <strong>to</strong> for instance the crisis of the welfare state or the<br />

rapid globalization process (e.g. Leibfried & Zürn, Meyer). Some of them look<br />

primarily at it as a <strong>to</strong>p-down process by which global developments overpower<br />

local traditions (e.g. Meyer, Wallerstein); others stress the role of intermediate<br />

levels as the nation state and its institutions (e.g. Giddens, Leibfried & Zürn,


EPILOGUE: NO CHILD, NO SCHOOL, NO STATE LEFT BEHIND 367<br />

Pollit & Bouckaert); whereas some see the main issue at the level of the impact<br />

of the transition process on those involved (e.g. Beck, Foucault). Some<br />

theories stress institutional patterns or social systems as the prime force (e.g.<br />

Meyer, Luhmann), others believe in ac<strong>to</strong>rs and their policies as defining elements<br />

(e.g. Pollit & Bouckaert; Wallerstein), whereas some try <strong>to</strong> sketch a third<br />

perspective, in which ac<strong>to</strong>rs and structures are seen as inextricably intertwined<br />

(e.g. Giddens, Foucault).<br />

One should not complain about this amazing diversity of approaches: It is<br />

an expression of the difficulty in finding more common ground at a time when<br />

the transition is still unfolding with growing, but uneven speed in different<br />

places. Additionally, many of these authors and their followers use a similar<br />

pool of examples in spite of their differences. Even though they do not agree<br />

on all the why questions, they point <strong>to</strong> much of the same kind of evidence: as,<br />

for instance, in examples<br />

<strong>–</strong> of the redistribution of resources, risks and responsibilities within and across<br />

societies,<br />

<strong>–</strong> of the destabilization, or at least of the restructuring, of most public institutions<br />

and their relations <strong>to</strong> or competition with the private sec<strong>to</strong>r,<br />

<strong>–</strong> of the re-<strong>to</strong>oling of legitimation and control patterns within the public as<br />

well as the private sphere and its impact,<br />

<strong>–</strong> of the pressure on systems and ac<strong>to</strong>rs <strong>to</strong>wards taking a reflexive stance <strong>to</strong>wards<br />

themselves and taking responsibility for their own “well-being”.<br />

Accountability<br />

Looking at the narrower issue of “accountability”, a similar wealth of models<br />

and approaches can be observed. Besides the more or less implicit accountability<br />

concepts within the general transition theories mentioned above (mostly<br />

constructed as ‘being made responsible’ in one or another way e.g. by Giddens,<br />

Foucault or the NPM theories), different models of accountability have<br />

emerged based on the areas in which accountability is observed:<br />

<strong>–</strong> In economic theories (e.g. Laffont 2003), where the concept originated, accountability<br />

is nowadays often constructed as a means by which a principal<br />

(the resource-giver) under conditions of limited information tries <strong>to</strong> multiply<br />

the ends, by giving the agent (the resource-taker) incentives and/or forcing<br />

him by other means <strong>to</strong> account for the efficiency, quality and results of his<br />

deliveries.


368 STEFAN T. HOPMANN<br />

<strong>–</strong> In government research (e.g. Hood 1991, 1995; Hood, Rothstein & Baldwin<br />

2004), accountability is often seen as a key <strong>to</strong>ol of the New Public Management<br />

movement <strong>to</strong> ensure that units and persons provide services according<br />

<strong>to</strong> the goals set for them or agreed with them. <strong>According</strong> <strong>to</strong> this approach,<br />

it unfolds as a combination of risk management and blame avoidance, by<br />

which those hold accountable try <strong>to</strong> limit the scope of possible failure.<br />

<strong>–</strong> In research on social policy, the same phenomenon is described as a “quasimarket<br />

revolution” (Bartlett, Roberts & Le Grand 1998), i.e. as the intrusion<br />

of marked-like mechanisms of distribution and control in<strong>to</strong> the public<br />

sec<strong>to</strong>r: elements of competition, contractualiziation and finally auditing are<br />

introduced in<strong>to</strong> the service-rendering. As Hood (2004) has pointed out, often<br />

in form of “double whamming”, i.e. as the co-existence of the traditional<br />

bureaucratic modes with the administrative <strong>to</strong>ol kit fostered by NPM.<br />

<strong>–</strong> Similarly, some educational and health-care researchers see the rise of “the<br />

age of accountability” as a “revolutionary” move <strong>to</strong>wards “evidence-based”<br />

practice, i.e. the growing expectation that professionals can present data <strong>to</strong><br />

prove that they have performed professionally and efficiently for education,<br />

Slavin 2007; for health care, cf. e.g. Muir Gray 2001).<br />

<strong>–</strong> Generalized beyond the realms of public service, this leads <strong>to</strong> the concept of<br />

the emergence of an “audit society” (e.g. Power 1997), the assumption that<br />

more and more areas of social life are being made “verifiable”, i.e. subjugated<br />

regimes of counting what can be counted, and thus become part of a<br />

measurable accountability.<br />

<strong>–</strong> Interaction and transactions theories construe the personal costs of such transition,<br />

looking at accountability as an interpersonal relation in which we deal<br />

with “accounts”, “excuses” and “apologies”, i.e. strategies <strong>to</strong> explain ourselves<br />

in ways which give a sustainable account of our efforts (e.g. Benoit<br />

1995).<br />

<strong>–</strong> Finally, psychological approaches <strong>to</strong> accountability (e.g. Sedikides 2002)<br />

look at the personal ways of dealing with accountability, how one develops<br />

mechanisms <strong>to</strong> attribute or <strong>to</strong> reject accountability-embedded in roles and<br />

functions we have <strong>to</strong> perform.<br />

Like the transition theories, accountability theories provide a wide array of<br />

possible causes and implications. Some see this process primarily as an effect<br />

of a growing “economization” of all parts of the society (e.g. social policy<br />

and audit theories), whereas others see accountability as inevitably embedded<br />

in the social fabric of modern societies (e.g. the psychological explanations).


EPILOGUE: NO CHILD, NO SCHOOL, NO STATE LEFT BEHIND 369<br />

Some see accountability primarily as a politically initiated restructuring effort<br />

(e.g. quasi-market theories), whereas others see accountability simply as a legitimate<br />

means <strong>to</strong> ensure that cus<strong>to</strong>mers or clients get what they have paid for<br />

(economic and NPM theories). Additionally, accountability is viewed on rather<br />

different levels. Utilizing a model developed by Melvin Dubnick (2006) one<br />

can discern:<br />

<strong>–</strong> a first-order accountability, i.e. accountability arising in face-<strong>to</strong>-face relations<br />

(as described by the psychological models);<br />

<strong>–</strong> a second-order accountability, which is characterized by how good one follows<br />

the rules and standards set by a resource giver (as described by government<br />

theories);<br />

<strong>–</strong> a third-order accountability can be seen as “managerial accountability”, i.e.<br />

the use of accountability by a principal as a means <strong>to</strong> achieve better service<br />

and effectiveness of the agent. Finally,<br />

<strong>–</strong> a fourth-order accountability is based on that the one held accountable internalizes<br />

the norms, values and expectations of the stake-holders, and which<br />

puts him or her in<strong>to</strong> action (as pointed out e.g. by theories of governmentality<br />

or of professionalism).<br />

In practice, all of these can be intertwined. However the dividing line is how<br />

it is assumed that this interaction comes in<strong>to</strong> being and which levels rule compared<br />

<strong>to</strong> the others (if they are not seen, as they are in economic theories, as an<br />

embedded rationale of social ac<strong>to</strong>rs at all levels).<br />

In addition, accountability concepts change over time and are different in<br />

different places. A good indica<strong>to</strong>r of this is that (1) there is no common translation<br />

of the concept available in most of the non-English-speaking countries,<br />

and (2) public agencies or policy makers do not employ similar definitions of<br />

the elements and limits of accountability when “accounting for accountability”<br />

(cf. Dubnick & Justice 2004, Birkeland 2007). Nevertheless, most accountability<br />

analysts agree with the above-mentioned transition theories on some core<br />

issues, namely:<br />

<strong>–</strong> that accountability procedures more and more permeate at least all Western<br />

societies, and thereby change the ways and means by which societies deal<br />

with themselves,<br />

<strong>–</strong> that the rapid rise of accountability affects all areas of the public sec<strong>to</strong>r from<br />

education <strong>to</strong> health and their relation <strong>to</strong> the private sec<strong>to</strong>r,<br />

<strong>–</strong> that this transition enforces a vast redistribution of resources and responsibilities<br />

and thereby a fundamental change in the interplay between resource-


370 STEFAN T. HOPMANN<br />

providers and -users, often described as a kind of implicit (values, norms)<br />

or explicit (standards, contracts) fixation of what is supposed <strong>to</strong> shape their<br />

relations.<br />

<strong>–</strong> that this process unfolds at different speeds and with different patterns, depending<br />

on what kind of social setting they become a part of.<br />

For the purpose of this chapter, it is not necessary <strong>to</strong> decide which of these<br />

theories and models carries the most theoretical or empirical evidence. Rather<br />

the common features, shared by most of them, should be enough as a starting<br />

point, even though this puts some of the “why questions” aside temporarily and<br />

moves the focus <strong>to</strong> the question, of how the emergence of the age of accountability<br />

can be observed in action. In my view, its common core can be described<br />

as a slow, but steady transition from what I call “management of placements”<br />

(Verortung) <strong>to</strong>wards a “management of expectations” (Vermessung), by which<br />

the ways and means of dealing with “ill-defined” problems, such as health, education,<br />

security and resurrection, are changed fundamentally (see Hopmann<br />

2000, 2003, 2006, 2007).<br />

Managing transition<br />

Following in the footsteps of Max Weber (cf. Weber 1923, Breuer 1991), the<br />

rise of the modern state can be described as the successive unfolding of a management<br />

of placements, by which the risks of being born (e.g., how <strong>to</strong> get an<br />

education, who takes care of me when ill or old, who gives me security in<br />

my everyday life and my dealings, how <strong>to</strong> be at peace with myself and my<br />

neighbours) were taken care of by institutions run by professionals with a specific<br />

education on how <strong>to</strong> deal with such ill-defined problems. These institutions<br />

(such as schools, hospitals, prisons, armies, bureaucracies, churches) had<br />

a comprehensive mission in that their professionals needed leeway <strong>to</strong> define<br />

which of these problems required what kind of treatment. The institutionalized<br />

problem-sharing allowed for taking on more risks and moving beyond the care<br />

for immediate needs. Of course, which problems were considered being ill- or<br />

well-defined changed over time, as did the resources available. But the internal<br />

distribution of resources and the evaluation of outcomes were mostly left <strong>to</strong><br />

the professionals themselves, or the emerging professional communities, who<br />

defined and controlled the education, licensing and practice of their members<br />

(cf. Abbott 1988; Hopmann 2003).<br />

However, this comprehensive institutionalization had no fixed boundaries<br />

(which would have required the transformation from the ill-defined in<strong>to</strong> well-


EPILOGUE: NO CHILD, NO SCHOOL, NO STATE LEFT BEHIND 371<br />

defined problems), thus opening a continuing process of the broadening of the<br />

scope and differentiating the means whenever new aspects of the problems<br />

seemed <strong>to</strong> become urgent. Thus each and every field underwent a massive expansion,<br />

multiplying its tasks and treatments. In the past, for example, a couple<br />

of years in schools (and for a few, in universities) was all the public education<br />

available. Today we spend twenty and more years of our life in all kinds of professionalized<br />

educational settings from childcare <strong>to</strong> elder hostels. When once<br />

we met a doc<strong>to</strong>r at the beginning and at the end of life, and maybe for a few<br />

other times under extraordinary circumstances, <strong>to</strong>day we spend in each and<br />

every year a lot of time with medical doc<strong>to</strong>rs, nurses and other health care specialists<br />

in waiting, treatment or emergency rooms. In short, the management<br />

of placement was extremely successful, so successful that Western societies<br />

spend most of their public budgets on dealing with these problems. As long as<br />

the differentiation of the institutions did not outspend the resources available,<br />

differentiation could go on and on, and with an ever-growing speed.<br />

This success s<strong>to</strong>ry seems <strong>to</strong> come <strong>to</strong> an end in what social policy theory<br />

calls “the crisis of the welfare state”, i.e. as resource limits and boundaries for<br />

further expansion become more and more visible (wherever they stem from).<br />

The legitimacy of the whole placement strategy relied on its ability <strong>to</strong> cover<br />

new ill-defined problems by expansion and sophistication; but there is now<br />

mistrust and anxieties about whether this comprehensive help will be sustainable<br />

in the future. A very visible impact of this loss of trust is the rise of welfare<br />

patriotism on the right and the left in almost all Western societies, articulating,<br />

and maybe misusing, much of the unease citizens feel about the future and security<br />

of the inherited places and treatments (our welfare is said <strong>to</strong> be at risk<br />

because of immigrants, globalization, outsourcing, etc.). One of the important<br />

responses <strong>to</strong> this is a stepwise transition from a management of placements<br />

<strong>to</strong>wards a management of expectations. Instead of guaranteeing comprehensive<br />

institutions, there is an attempt <strong>to</strong> transform ill-defined problems in <strong>to</strong><br />

better-defined expectations as <strong>to</strong> what can be achieved with a given amount of<br />

resources. Standards, benchmarks, indica<strong>to</strong>r-based budgets etc. are examples<br />

of how this transition is managed. In that they do not necessarily imply longterm<br />

commitments, expectations can remain transient and volatile <strong>to</strong> changes<br />

in the social fabric of expectations mix. This allows for more target-oriented<br />

management and accountability that, however, comes at the price that whatever<br />

does not fit in<strong>to</strong> the expectation regime of a time becomes marginalized.<br />

Comprehensive coverage is replaced by a fragmented system of treatments


372 STEFAN T. HOPMANN<br />

available under certain conditions. The left-over, not least the still ill-defined<br />

general issues <strong>–</strong> what does it mean <strong>to</strong> be well-educated, healthy, secure, feel<br />

well etc.? <strong>–</strong> is either still connected <strong>to</strong> the former placements and/or transformed<br />

in<strong>to</strong> temporary programs seemingly better equipped for addressing the<br />

remains immediately (“the patient in focus”, “fighting crime”, “strengthening<br />

social education” etc.).<br />

Take the example of schooling: In earlier times public education was provided<br />

by “a place called school” (Goodlad 1983) run by professionals called<br />

teachers who decided within sketchy limits, based on professional and local<br />

traditions, how <strong>to</strong> teach and what achievement seemed <strong>to</strong> be sufficient. There<br />

was no external public evaluation of the quality of the services provided, except<br />

for extraordinary cases of failure, or while the normal procedures seemed<br />

<strong>to</strong> be professionally acceptable. <strong>According</strong>ly, good instruction was not defined<br />

primarily by its measurable outcomes, but rather by the professional judgement<br />

of the adequacy of what was done. Expectation management changes<br />

the picture dramatically. The core focus shifts <strong>to</strong> more or less well-defined expectations<br />

of what has <strong>to</strong> be achieved by whom. Good instruction is the one<br />

overlapping expectations, and that can be provided outside the traditional institutions<br />

and professions, in fact: everybody is welcome <strong>to</strong> provide as long<br />

as the expectations are met. Of course, there are issues which are not (yet)<br />

coveredbyidentifiable expectations, however <strong>–</strong> in case of conflicting goals <strong>–</strong><br />

the balance will always tip <strong>to</strong>wards those expectations, which are well-defined<br />

enough <strong>to</strong> become part of the implied accountability of the treatment providers.<br />

The rest, that which is not addressed but seems <strong>to</strong> need <strong>to</strong> be taken in<strong>to</strong> account<br />

(e.g. issues such as mobbing/bullying, gender, migration etc.) is embedded in<strong>to</strong><br />

transient intervention programs of limited scope, enough <strong>to</strong> ensure the public<br />

that no ill-defined problem is left behind.<br />

It is important <strong>to</strong> remember, that this is expectation management, and is not<br />

about outputs or outcomes or “efficiency” as such <strong>–</strong> as, for instance, NPM theories<br />

see it. Only those results, which can be “verified” according <strong>to</strong> the stakes<br />

given and do not meet expectations become problematic, and only those outcomes<br />

which meet the predefined criteria are considered a success. In fact any<br />

care-taker of an ill-defined problem will always produce many more effects<br />

than any accountability system can observe and measure. Some of them may<br />

be simply by-products or minor collateral damage, but some impact may indeed<br />

be a major contribution (e.g. inclusion in<strong>to</strong> society, regulating biographies<br />

etc.), which is beyond the short-sighted reach of the management of expecta-


EPILOGUE: NO CHILD, NO SCHOOL, NO STATE LEFT BEHIND 373<br />

tions. The line is drawn by the ever-changing fabric of expectations on the one<br />

hand and, on the other, by the simple fact that accountability needs something<br />

which can be counted, or where it is at least possible <strong>to</strong> measure the distance<br />

between expectations and results (cf. Slavin 2007).<br />

The emergence and spreading of accountability is a signifying hallmark of<br />

the whole transition process. <strong>PISA</strong> fits nicely in<strong>to</strong> this transition, as we will see<br />

in the following sections. Seen as part of a management of placements <strong>PISA</strong><br />

would be a disaster: it covers only a few aspects of the place, of schooling<br />

and the curriculum, and even are covered in ways that account at best very<br />

indirectly (cf. the preceding chapters on <strong>PISA</strong>’s limits). However, as a <strong>to</strong>ol of<br />

expectation management, <strong>PISA</strong> fosters a transformation of what had been illdefined<br />

issues (e.g. curriculum contents) in<strong>to</strong> seemingly well-defined attainment<br />

goals. It delivers, at the same time, a parameter for holding schooling<br />

accountable <strong>–</strong> for delivering according <strong>to</strong> the expectations embedded in<strong>to</strong> its<br />

questionnaires. It contributes <strong>to</strong> the fragmentation of the field by transforming<br />

the conditions and constraints of this delivery in<strong>to</strong> independent fac<strong>to</strong>rs (e.g.<br />

social background, gender, migration etc.), whose impact has <strong>to</strong> be minimized<br />

by way of teaching if expectations are <strong>to</strong> be fully met. The best representation<br />

of this is given by the “production functions” by which <strong>PISA</strong>-using economists<br />

calculate the transaction costs of schooling and ways and means that the principals<br />

(parents, the state) might maximize the effectiveness of the chosen agents<br />

(i.e. teachers, schools or school systems; c.f. e.g. Bishop & Woessmann 2004;<br />

Fuchs & Woessmann 2004; Micklewright & Schnep. 2004, 2006; Sutherland<br />

& Price 2007).<br />

Constitutional Mindsets<br />

When it comes <strong>to</strong> the public sphere, the transition from a management of placements<br />

<strong>to</strong>wards a management of expectations meets different constitutional<br />

mindsets, i.e. deeply engrained ways of understanding the relation between<br />

the public and its institutions (cf. for the basics of the following e.g. Haft &<br />

Hopmann 1990; Hopmann & Wulff 1993; Zweigert & Kötz 1997; Lepsius<br />

2006). For example, the American constitution is constructed as a protection<br />

of the individual against the misuse of power by governments and others. It<br />

sees the rights of the individuals as a given and the intervention of government<br />

as limited by these rights, and obliged <strong>to</strong> protect citizens against any infringements<br />

of their constitutional freedoms. The First Amendment, for instance,<br />

states: “Congress shall make no law respecting the establishment of religion,


374 STEFAN T. HOPMANN<br />

prohibiting the free exercise thereof; or abridging the freedom of speech or<br />

the press; or the right of the people peacefully <strong>to</strong> assemble, an <strong>to</strong> petition the<br />

government for the redress of grievances”. Within the Prussian or the Austrian<br />

tradition, which comes from the opposite direction (not at least Roman<br />

law), civil rights are something constituted and limited by the law, i.e. it is<br />

the state and its (more or less enlightened) institutions which create and define<br />

the boundaries of social and individual life. Religious freedom, for instance,<br />

maybe granted, but the freedom is closely connected <strong>to</strong> state supervision of its<br />

organizations and institutions, which can set limits for the conditions for the<br />

‘full’ exercise of a religion (which creates problems for non-institutionalized<br />

traditions such as Islam). The Scandinavian constitutional tradition settled (at<br />

least in its beginnings) somewhere between these fundamentally opposed starting<br />

points: it acknowledges the right of the state <strong>to</strong> impose a constitution, but<br />

originally limits its reach making it subsidiary <strong>to</strong> local and regional law traditions.<br />

That has changed gradually, but there is still no unified code of law,<br />

rather a pragmatic approach <strong>to</strong> regulating fields of interest based on practical<br />

experience and home-grown traditions. The local constituency is still seen as<br />

the core of the social fabric. Citizens are empowered <strong>to</strong> define a community<br />

life based on their own traditions within a broad constitutional setting. Thus,<br />

while there are state churches in Norway and Denmark, there was plenty of<br />

leeway <strong>to</strong> establish new local traditions (e.g. as “free churches”); <strong>to</strong>day any<br />

group of certain size and permanence and with a discernible creed of its own<br />

can establish itself as a “church” with a right <strong>to</strong> receive state subsidies (cf.<br />

Repstad 2000 2 ).<br />

Of course, the constitutional and legal structures are much more mixed, the<br />

patterns much more blurred, than these different starting points indicate. But<br />

the mind sets on which they are founded seem <strong>to</strong> be alive and well, and have<br />

a strong impact on how the public and its institutions conceptualize the legal<br />

and structural implications of social change. At least, this is the case when it<br />

comes <strong>to</strong> how accountability measures are embedded in<strong>to</strong> the public system<br />

as a whole, and especially in<strong>to</strong> the school system. There the main questions<br />

are: Is accountability about protecting the individual citizen (student) against<br />

bad service-rendering? Or is the primary goal <strong>to</strong> strengthen the ability of local<br />

communities <strong>to</strong> run its institutions according <strong>to</strong> their own needs and aspirations?<br />

Or is it about holding the public system accountable for its contribution<br />

<strong>to</strong> the state’s welfare? Put in the current educational context, one has <strong>to</strong> ask<br />

accordingly, where the main focus of accountability is situated: at “no child<br />

left behind”, “no school left behind”, or “no state left behind”?


EPILOGUE: NO CHILD, NO SCHOOL, NO STATE LEFT BEHIND 375<br />

2 The Multiple Realities of Accountability<br />

In a his<strong>to</strong>rical perspective, <strong>PISA</strong> and the likes are heavily indebted <strong>to</strong> the legacy<br />

of the assessment movement in the United States and its internationalization<br />

by the International Association for the Evaluation of Educational Achievement<br />

(IEA), beginning in the late 1950s. Seminal works such as Caswell’s<br />

City School Surveys (1929), the Eight Year Study (1942) or Benjamin Bloom’s<br />

groundbreaking work on “the taxonomy of educational objectives” (1956), the<br />

national spreading of the Scholastic Aptitude Test (SAT) from the 1950s onwards<br />

and the establishment of the National Assessment of Education Progress<br />

(NAEP) paved the way for an understanding, in which student achievements<br />

were seen as the prime indica<strong>to</strong>r of the quality of schooling. The rapid rise of<br />

assessment and evaluation as key <strong>to</strong>ols of educational control was fuelled by a<br />

constant flow of critical works on the poor state of the Nation’s schools. From<br />

Conant’s report, The American High School (1959), Rickover’s American Education<br />

<strong>–</strong> a National Failure (1963), Coleman’s report on the “Equality of Educational<br />

Opportunity” (1966) <strong>to</strong> the national report, A Nation at Risk (1983)<br />

and the Nation’s Report Card (Lamar & Thomas, 1987), the basic tenor was<br />

the same: The American system is failing many of its students <strong>–</strong> as demonstrated<br />

by the test scores achieved in local, state-wide and national testing.<br />

From the late 1980s onwards, this seemingly constant failing lead <strong>to</strong> a<br />

more generalized approach <strong>to</strong> assessment, testing and “reform”, now called<br />

“standards-based reform” (cf. Achieve 1998; Ahearn 2000; Fuhrman 2001).<br />

State after state introduced state standards for the curriculum and <strong>–</strong> if not yet<br />

done so <strong>–</strong> state-wide assessment of student achievement, <strong>to</strong> assure that these<br />

standards were applied. It would be a wild exaggeration <strong>to</strong> pretend that this<br />

approach was an immediate success. In some cases, the introduction of state<br />

standards obviously spelled out disaster (cf., e.g., the Kentucky experience;<br />

Whitford & Jones 2000), in others at best modest gains could be reported but<br />

their validity was, and is, heavily disputed (cf. e.g. Cannell 1987; Dorn 1998;<br />

Saunders 1999; Linn 2000; Haney 2000; Watson & Suppovitz 2001; Amrein &<br />

Berliner 2002; Haney 2002; Ladd & Walsh 2002; Swanson & Stevenson 2002;<br />

Darling-Hammond 2003; Braun 2004). However, despite some 50 years of<br />

mixed experience with assessment and rather shallow results (cf. Cook 1997;<br />

Mehrens 1998; McNeil 2000; Herman & Haertel 2005), the next move was<br />

<strong>to</strong> introduce national legislation, aiming at a unified approach <strong>to</strong> assessment<br />

and accountability, the “No Child Left Behind Act” of 2001, enacted under the


376 STEFAN T. HOPMANN<br />

Bush presidency and supported by an almost united Congress (cf. Peterson &<br />

West 2003).<br />

No Child Left Behind (NCLB)<br />

It is worthwhile <strong>to</strong> give the provisions of the NCLB act a closer look, as they<br />

are paradigmatic for how accountability is constructed within the American<br />

tradition. Already the comprehensive “statement of purpose” unfolds a wide<br />

array of issues:<br />

“The purpose of this title is <strong>to</strong> ensure that all children have a fair, equal, and significant<br />

opportunity <strong>to</strong> obtain a high-quality education and reach, at a minimum, proficiency<br />

on challenging State academic achievement standards and state academic assessments.<br />

This purpose can be accomplished by <strong>–</strong><br />

(1) ensuring that high-quality academic assessments, accountability systems, teacher<br />

preparation and training, curriculum, and instructional materials are aligned with challenging<br />

State academic standards so that students, teachers, parents, and administra<strong>to</strong>rs<br />

can measure progress against common expectations for student academic achievement;<br />

(2) meeting the educational needs of low-achieving children in our Nation’s highestpoverty<br />

schools, limited English proficient children, migra<strong>to</strong>ry children, children with<br />

disabilities, Indian children, neglected or delinquent children, and young children in<br />

need of reading assistance;<br />

(3) closing the achievement gap between high- and low performing children, especially<br />

the achievement gaps between minority and non-minority students, and between<br />

disadvantaged children and their more advantaged peers;<br />

(4) holding schools, local educational agencies, and States accountable for improving<br />

the academic achievement of all students, and identifying and turning around lowperforming<br />

schools that have failed <strong>to</strong> provide a high-quality education <strong>to</strong> their students,<br />

while providing alternatives <strong>to</strong> students in such schools <strong>to</strong> enable the students<br />

<strong>to</strong> receive a high-quality education;<br />

(5) distributing and targeting resources sufficiently <strong>to</strong> make a difference <strong>to</strong> local educational<br />

agencies and schools where needs are greatest;<br />

(6) improving and strengthening accountability, teaching, and learning by using State<br />

assessment systems designed <strong>to</strong> ensure that students are meeting challenging State<br />

academic achievement and content standards and increasing achievement overall, but<br />

especially for the disadvantaged;<br />

(7) providing greater decision making authority and flexibility <strong>to</strong> schools and teachers<br />

in exchange for greater responsibility for student performance;


EPILOGUE: NO CHILD, NO SCHOOL, NO STATE LEFT BEHIND 377<br />

(8) providing children an enriched and accelerated educational program, including<br />

the use of school-wide programs or additional services that increase the amount and<br />

quality of instructional time;<br />

(9) promoting school-wide reform and ensuring the access of children <strong>to</strong> effective,<br />

scientifically based instructional strategies and challenging academic content;<br />

(10) significantly elevating the quality of instruction by providing staff in participating<br />

schools with substantial opportunities for professional development;<br />

(11) coordinating services under all parts of this title with each other, with other educational<br />

services, and, <strong>to</strong> the extent feasible, with other agencies providing services <strong>to</strong><br />

youth, children, and families; and<br />

(12) affording parents substantial and meaningful opportunities <strong>to</strong> participate in the<br />

education of their children.” (Section 1001)<br />

But this complexity is right away reduced <strong>to</strong> more specific expectations, when<br />

it comes <strong>to</strong>, which goals are in focus and how accountability is supposed <strong>to</strong><br />

foster these goals. Academic standards are according <strong>to</strong> NCLB the following:<br />

Standards under this paragraph shall include—<br />

(i) challenging academic content standards in academic subjects that—<br />

(I) specify what children are expected <strong>to</strong> know and be able <strong>to</strong> do;<br />

(II) contain coherent and rigorous content; and<br />

(III) encourage the teaching of advanced skills; and<br />

(ii) challenging student academic achievement standards that—<br />

(I) are aligned with the State’s academic content standards;<br />

(II) describe two levels of high achievement (proficient and advanced) that determine<br />

how well children are mastering the material in the State academic content standards;<br />

and<br />

(III) describe a third level of achievement (basic) <strong>to</strong> provide complete information<br />

about the progress of the lower-achieving children <strong>to</strong>ward mastering the proficient<br />

and advanced levels of achievement.” (Section 1111)<br />

Accountability is then based on these standards:<br />

“Each State plan shall demonstrate that the State has developed and is implementing<br />

a single, statewide State accountability system that will be effective in ensuring<br />

that all local educational agencies, public elementary schools, and public secondary<br />

schools make adequate yearly progress as defined under this paragraph. Each State<br />

accountability system shall <strong>–</strong>


378 STEFAN T. HOPMANN<br />

(i) be based on the academic standards and academic assessments . . . and other academic<br />

indica<strong>to</strong>rs consistent . . . , and shall take in<strong>to</strong> account the achievement of all<br />

public elementary school and secondary school students;<br />

(ii) be the same accountability system the State uses for all public elementary schools<br />

and secondary schools or all local educational agencies in the State, except that public<br />

elementary schools, secondary schools, and local educational agencies not participating<br />

under this part . . . and<br />

(iii) include sanctions and rewards, such as bonuses and recognition, the State will<br />

use <strong>to</strong> hold local educational agencies and public elementary schools and secondary<br />

schools accountable for student achievement and for ensuring that they make adequate<br />

yearly progress . . . ”. (ibid.)<br />

Finally, what is meant by “Adequate Yearly Progress” is defined in the subsequent<br />

paragraph:<br />

“(B) ADEQUATE YEARLY PROGRESS.—Each State plan shall demonstrate, based<br />

on academic assessments described in paragraph (3), and in accordance with this<br />

paragraph, what constitutes adequate yearly progress of the State, and of all public<br />

elementary schools, secondary schools, and local educational agencies in the State,<br />

<strong>to</strong>ward enabling all public elementary school and secondary school students <strong>to</strong> meet<br />

the State’s student academic achievement standards, while working <strong>to</strong>ward the goal of<br />

narrowing the achievement gaps in the State, local educational agencies, and schools.<br />

(C) DEFINITION.—‘Adequate yearly progress’ shall be defined by the State in a<br />

manner that—<br />

(i) applies the same high standards of academic achievement <strong>to</strong> all public elementary<br />

school and secondary school students in the State;<br />

(ii) is statistically valid and reliable;<br />

(iii) results in continuous and substantial academic improvement for all students;<br />

(iv) measures the progress of public elementary schools, secondary schools and local<br />

educational agencies and the State based primarily on the academic assessments<br />

described in paragraph (3);<br />

(v) includes separate measurable annual objectives for continuous and substantial improvement<br />

for<br />

each of the following:<br />

(I) The achievement of all public elementary school and secondary school students.<br />

(II) The achievement of—<br />

(aa) economically disadvantaged students;<br />

(bb) students from major racial and ethnic groups;


EPILOGUE: NO CHILD, NO SCHOOL, NO STATE LEFT BEHIND 379<br />

(cc) students with disabilities; and<br />

(dd) students with limited English proficiency;<br />

except that disaggregation of data under subclause (II) shall not be required in a case in<br />

which the number of students in a category is insufficient <strong>to</strong> yield statistically reliable<br />

information or the results would reveal personally identifiable information about an<br />

individual student.” (ibid.)<br />

I have quoted the NCLB act in such length because it provides a concise definition<br />

of what management of expectations in this perspective is about. The<br />

core of accountability is narrowly focused on student achievements measured<br />

by “academic standards”. Other functions of schooling (such as the role school<br />

plays for local communities or in shaping society) are hardly mentioned, and<br />

if at all, they are constructed as minority problems. At the same time the academic<br />

achievement is reduced <strong>to</strong> that which can be reported as “statistically<br />

valid and reliable,” leaving out any educational or social achievements which<br />

cannot be counted as required. With in this frame, responsibility is passed from<br />

the federal <strong>to</strong>p through intermediate levels such as state and district administrations<br />

<strong>to</strong> teachers and local school leaders. They are expected <strong>to</strong> improve<br />

the test results by “evidence-based teaching” or even by “data-driven decision<br />

making” (cf. ECS 2002; Marsh & Hamil<strong>to</strong>n 2006), which collapses the complexities<br />

of class room work or school leadership in<strong>to</strong> single-minded framesets<br />

of statistically significant achievement gains (cf. the comments by e.g. Koretz<br />

2002; Berliner 2005; Hargreaves 2006; Ingersoll 2006). The starting idea of<br />

the “basic principles of curriculum and instruction” (Tyler 1949), which was<br />

embedded in the methodologically much broader approach of, e.g. the Eight<br />

Year Study (Aikin 1942), and which asked for a comprehensive understanding<br />

of schooling as social and local institution, has <strong>–</strong> as it seems <strong>–</strong> dwindled <strong>to</strong> a<br />

concept of measurable yearly progress.<br />

The response <strong>to</strong> NCLB within the education community has been almost<br />

evenly divided. While a majority of politicians and economists, and certain<br />

parts of the public, seem <strong>to</strong> support NCLB wholeheartedly, or at least its core<br />

concept of accountability, many educa<strong>to</strong>rs are less enthusiastic. The public response<br />

seems <strong>to</strong> be fragmented based on social class, level of education, and<br />

political orientation (cf. Loveless 2006). The professional reactions have much<br />

<strong>to</strong> do with the question if or if not one accepts the narrow focus of NCLB as<br />

reasonable. On the one side are those who see NCLB at least as a starting point<br />

for a possible school revolution, finally solving the American school crisis (cf.<br />

e.g. Ladd & Walsh 2002; Petersen & West 2003; Irons & Harris 2006). Some


380 STEFAN T. HOPMANN<br />

economists have even begun <strong>to</strong> calculate the economical spin-off of NCLB if<br />

modest gains can be sustained (Hanushek 2002, 2006; Hanushek & Raymond<br />

2003, 2005). Others are sceptical, pointing <strong>to</strong> the “impoverished” scope of the<br />

provisions (cf. e.g. Berliner 2005) or the obvious implementation problems of<br />

the current approach (for a variety of such problems cf. e.g. Eberts, Hollenbeck<br />

& S<strong>to</strong>ne 2002; Mintrop 2003; Chubb 2005; Gorard 2006; Martineau 2006;<br />

Apple 2007; Deretchin & Craig 2007; Zimmer et al. 2007). The construct of<br />

“adequate yearly progress” (AYP) in particular has created a tremendous challenge.<br />

The expectation of the progress could be reached went far beyond the<br />

reality of slowness and instability in school change <strong>–</strong> and adding minority criteria<br />

worsened the situation. Some critics fear that almost all American schools<br />

will end up on the watch lists of failing schools (cf. Linn, Baker & Betebenner<br />

2002; Linn & Haug 2002; Herman & Haertel 2005; Linn 2005). And many<br />

researchers and practitioners have pointed out, NCLB will fail while schools<br />

and their leadership do not have the required “capacities”, i.e. the ability <strong>to</strong><br />

identify their local mix of problems and <strong>to</strong> deal with these professionally (cf.<br />

e.g. Elmore 2006). However, what is almost never challenged in this debate is<br />

the basic assumption of the whole enterprise, namely, that it is the student’s<br />

academic achievement which best reflects the quality of schooling, and that it<br />

is the poor quality of instruction provided by poor teaching, which is <strong>to</strong> blame<br />

for the fact that children are left behind. Holding states, school districts and<br />

schools accountable is reduced <strong>to</strong> the requirement <strong>to</strong> do whatever necessary <strong>to</strong><br />

pass this accountability along <strong>to</strong> classrooms and teachers and, finally, <strong>to</strong> the<br />

students themselves.<br />

Similar criticism has been voiced about <strong>PISA</strong>’s role in the US (cf. e.g.<br />

Bracey 2005). However, even though it uses a similar approach <strong>to</strong> mapping<br />

school achievements, and even though the US has been one of the driving<br />

forces behind it, <strong>PISA</strong> does not have much of an American audience in the<br />

shadow of NCLB, and no significant impact on the wider public or the educational<br />

science community. For instance, a recent search of the Education<br />

Resource Information Centre (ERIC) brought up less than 150 articles and<br />

books about <strong>PISA</strong>, most of them from outside the US <strong>–</strong> nothing compared <strong>to</strong><br />

the general issue of accountability (more than 18.000 hits) or NCLB (about<br />

2000 hits). It does not even match the impact of TIMSS (with more than 400<br />

hits) and is far below of what the German equivalent of ERIC, the FIS Bildung,<br />

reports on <strong>PISA</strong> from Germany (more than 2500 hits). Arguably, this reflects


EPILOGUE: NO CHILD, NO SCHOOL, NO STATE LEFT BEHIND 381<br />

NCLB’s status as a national law, whereas <strong>PISA</strong> is an international enterprise<br />

with no direct obligations for the states and the schools participating.<br />

In general, <strong>PISA</strong> seems <strong>to</strong> have less impact where a national achievement<br />

control of some kind already is in place (as, e.g., in Sweden), which bodes<br />

ill for the future of <strong>PISA</strong> as more and more countries introduce accountability<br />

systems of their own making. However, this would not explain why earlier<br />

studies like TIMSS have had more visibility in the US. In my view, this shortcoming<br />

of <strong>PISA</strong> has much <strong>to</strong> do with the key fallacy of its design, namely<br />

<strong>to</strong> present itself mainly as a cross-national comparison, even though it shares<br />

with NCLB the same starting point, student achievement, which only <strong>to</strong> a very<br />

small degree (not least in case of the U.S. with its manifold states) can be attributed<br />

<strong>to</strong> a specific “national” fabric of schooling (cf. the critique brought<br />

forward in this volume). As a cross-national comparison <strong>PISA</strong> is not much of<br />

an eye-opener for the US public; it only confirms what was known from earlier<br />

international studies, i.e., that US students don’t do very well in such tests<br />

compared <strong>to</strong> the students of many other nations. In that the “winner” of <strong>PISA</strong><br />

2000 and 2003 was tiny Finland, and not an international competi<strong>to</strong>r like Japan<br />

(which succeeded in TIMSS, and did well on <strong>PISA</strong> <strong>to</strong>o), there is, seemingly,<br />

nothing much for Americans <strong>to</strong> learn from successful <strong>PISA</strong> nations.<br />

No School left Behind<br />

When the accountability wave hit Nordic shores for the first time, the spontaneous<br />

reaction of the political and educational establishments was almost<br />

opposite <strong>to</strong> what had happened in the US. While there accountability became<br />

a <strong>to</strong>ol <strong>to</strong> centralize important elements of educational control, firstatthestate,<br />

later at the national level, the spontaneous reaction of the Scandinavians was<br />

decentralization. Although government offices and administrative departments<br />

(in Norway and Sweden) were created <strong>to</strong> satisfy the discourse of the New Public<br />

Management, and many national reports and white papers were commissioned<br />

on public service rendering and administration, none of this <strong>–</strong> except<br />

maybe for Finland, which was under much more economic strain following<br />

the break down of the Berlin Wall (cf. Sinola 2005; Uljens in this volume) <strong>–</strong><br />

led <strong>to</strong> a sustained and comprehensive accountability reform of the US kind (cf.<br />

Bogason 1996; Irjala & Eikås 1996; Prahl & Olsen 1997; Pollit & Bouckaert<br />

2004 2 ). The concurrent debate about joining or not-joining the EU may have<br />

had a share in this decision making in that EU participation was often framed in<br />

terms of the risk of more centralization (cf. Karlsen 1994). However, tackling


382 STEFAN T. HOPMANN<br />

challenges by way of an issue-focused and pragmatic step-by-step approach<br />

with special regard as <strong>to</strong> how lower levels of government, such as districts and<br />

municipalities, could deal with any emerging <strong>to</strong>ol kit was consistent with what<br />

I am calling their constitutional mindset.<br />

Thus as a background for understanding the Nordic education sec<strong>to</strong>r, one<br />

has <strong>to</strong> know that schools and their teachers played a pivotal role in the nationbuilding<br />

processes across the region, and in the shaping of national identities<br />

(cf. e.g. Slagstad 1998; Telhaug & Mediås 2003, Korsgaard 2004; Werler 2004;<br />

Telhaug 2005). Moreover, schools are not only seen as places for the young,<br />

but as the cultural core of the local community <strong>–</strong> which turns the local and<br />

regional distribution of schooling in<strong>to</strong> always contested issue.<br />

Until the 1990s the main <strong>to</strong>ol used <strong>to</strong> govern the school curriculum was<br />

curriculum guidelines, developed mainly by the state administrations by way<br />

of committees largely consisting of experienced teachers and subject matter<br />

specialists (cf. Gundem, 1992, 1993, 1997; Sivesind, Bachmann & Afzar,<br />

2003; Bachmann, Sivesind & Hopmann 2004; ). Local schools and teachers<br />

had considerable leeway <strong>to</strong> pick and chose within this curriculum frame in order<br />

<strong>to</strong> develop locally adapted teaching programs. There was no regular staterun<br />

evaluation of the outcomes of teaching, and, indeed, outside research not<br />

even the concept was familiar (cf. Hopmann 2003). In a Nordic perspective<br />

schools were seen as places run by highly educated and esteemed teachers,<br />

who knew best how <strong>to</strong> do their job. Curriculum change was primarily seen<br />

as a matter of dialogue between local experience and national needs; changes<br />

were typically introduced by way of lengthy try-out periods, and with an often<br />

extraordinary involvement of all levels of schooling and administration. Of<br />

course, this was by no means a paradise of peaceful change: each and every<br />

curriculum reform has had its proponents and opponents, and the interplay between<br />

the school sec<strong>to</strong>r, research, politics and the public was at times pretty<br />

contentious (cf. Sivesind, forthcoming). However, this played out within the<br />

context of school systems that enjoyed, for most of the time, broad support at<br />

all levels of society.<br />

In this context, it was no surprise that the first reaction <strong>to</strong> sharp national and<br />

international criticism of schooling was a re-doing of what had been successful.<br />

In the case of Norway, for instance, the first contemporary criticism of the<br />

school system was voiced by an OECD panel (1988) and by a national committee<br />

commissioned by the parliament (NOU 1988:22). Reflecting the emerging<br />

NPM discourse, both concluded that the weaknesses of the national school sys-


EPILOGUE: NO CHILD, NO SCHOOL, NO STATE LEFT BEHIND 383<br />

tem were, significantly, an outcome of an underperforming school governance<br />

structure, which was not able <strong>to</strong> ensure that the goals of curriculum guidelines<br />

were being reached. Two conclusions were drawn: On the one side a sweeping<br />

reform of the whole school curriculum was launched, beginning with a new<br />

general curriculum frame (L93), followed by new comprehensive guidelines<br />

for the upper secondary sec<strong>to</strong>r (R94) and the elementary and lower secondary<br />

schools (L97). The frame stressed the double purpose of schooling as caretaker<br />

of the national and local heritage and as knowledge-promoter (cf. STM<br />

29 1994/95). The subsequent curriculum guidelines received a new structure:<br />

they were <strong>to</strong> focus on the most important requirements and state these expectations<br />

in terms of goals (a kind of management-by-objectives approach) that<br />

could be reached by average schools. On the other side a re-make of the governance<br />

structure was inaugurated, constructing a double-faced reform combining<br />

a re-focussing of national steering while stressing the importance of<br />

local au<strong>to</strong>nomy and responsibility for reaching these goals (cf. STM 37 1990-<br />

1991; STM 47 1995/96; KUF 1997). The reform was supported by numerous<br />

in-service and research programs <strong>to</strong> help districts, municipalities, schools and<br />

teachers identify the major obstacles and prepare for the enactment of the new<br />

guidelines. This new orientation was complemented by initiatives <strong>to</strong> develop<br />

school-based and peer-guided school improvement (cf. e.g. Granheim, Kogan<br />

& Lundgren 1990; Karlsen 1993; Ålvik 1994; KUF 1994; Haug & Monsen<br />

2002; Nesje & Hopmann 2003). However, this first take-up of NPM like measures<br />

was <strong>to</strong> infuriate many educa<strong>to</strong>rs, politicians and practitioners alike; these<br />

critics felt that the <strong>to</strong>ol-kit of accountability was an “instrumentalist mistake”<br />

that did not fit the national traditions of schooling and challenged the former<br />

strategy of placement, i.e. the compulsory comprehensive school (cf. e.g. Hovednak<br />

2000, Koritzinsky 2000, Lindblad, Johanneson & Simola 2003).<br />

When a new liberal-conservative government felt that these first steps of<br />

reform were still not enough <strong>to</strong> ensure adequate school development, it commissioned<br />

a new national report <strong>to</strong> recommend additional measures. What<br />

emerged was a peculiar understanding of school development as development<br />

of “quality”, in which “quality” represents a rather vague and all-encompassing<br />

understanding of whatever might affect the outcomes of schooling (cf. STM 30<br />

2007; Birkeland 2007; Sivesind forthcoming). The then-secretary of education,<br />

Kristin Clemet, expressed the basic rationale of this approach as follows:<br />

Society’s reasons for having schools, and the community tasks imposed on them, are<br />

still relevant <strong>to</strong>day: Education is an institution that binds us <strong>to</strong>gether. We all share


384 STEFAN T. HOPMANN<br />

it. It has its roots in the past and is meant <strong>to</strong> equip us for the future. It transfers<br />

knowledge, culture and values from one generation <strong>to</strong> the next. It promotes social<br />

mobility and ensures the creation of values and welfare for all. For the individual,<br />

education is <strong>to</strong> contribute <strong>to</strong> cultural and moral growth, mastering social skills and<br />

learning self-sufficiency. It passes on values and imparts knowledge and <strong>to</strong>ols that allow<br />

everyone <strong>to</strong> make full use of their abilities and realize their talents. It is meant <strong>to</strong><br />

cultivate and educate, so that individuals can accept personal responsibility for themselves<br />

and their fellows. Education must make it possible for pupils <strong>to</strong> develop so that<br />

they can make well-founded decisions and influence their own futures. At the same<br />

time, schools must change when society changes. New knowledge and understanding,<br />

new surroundings and new challenges influence schools and the way they carry out<br />

the tasks they have been given. Schools must also prepare pupils for looking farther a<br />

field than the Norwegian frontiers and being part of a larger, international community.<br />

We must nourish and further develop the best aspects of Norwegian schools and at the<br />

same time make them better equipped for meeting the challenges of the knowledge<br />

society. Our vision is <strong>to</strong> create a better culture for learning. If we are <strong>to</strong> succeed, we<br />

must be more able and willing <strong>to</strong> learn. Schools themselves must be learning organizations.<br />

Only then can they offer attractive jobs and stimulate pupils’ curiosity and<br />

motivation for learning. . . . We will equip schools <strong>to</strong> meet a greater diversity amongst<br />

pupils and parents/guardians. Schools are already ideals for the rest of society in the<br />

way they include everybody. However, in the future we must increasingly appreciate<br />

variety and deal with differences. Schools must have as their ambition <strong>to</strong> exploit and<br />

adapt <strong>to</strong> this diversity in a positive manner.<br />

If schools are <strong>to</strong> be able <strong>to</strong> achieve this, it is necessary <strong>to</strong> change the system by which<br />

schools are administered. National authorities must allow greater diversity in the solutions<br />

and working methods chosen, so that these can be adapted and cus<strong>to</strong>mized<br />

<strong>to</strong> the situation of each individual pupil, teacher and school. The national authorities<br />

must define the objectives and contribute with good framework conditions, support<br />

and guidance. At the same time, we must have confidence in schools and teachers as<br />

professionals. We wish <strong>to</strong> mobilize <strong>to</strong> greater creativity and commitment by allowing<br />

greater freedom <strong>to</strong> accept responsibility....<br />

All plans for developing and improving schools will fail without competent, committed<br />

and ambitious teachers and school administra<strong>to</strong>rs. They are the school system’s<br />

most important assets. It is therefore an important task <strong>to</strong> strengthen and further develop<br />

the teachers’ professional and pedagogical expertise and <strong>to</strong> motivate for improvements<br />

and changes. This Report heralds comprehensive efforts regarding competence<br />

development in schools. Education must be developed through a dialogue with<br />

those who have their daily work in and for schools. (Introduction <strong>to</strong> STM 30 2004)<br />

The difference <strong>to</strong> the accountability rhe<strong>to</strong>ric of NCLB is striking. Where<br />

NCLB solely is focused on “academic standards” and on allocating respon-


EPILOGUE: NO CHILD, NO SCHOOL, NO STATE LEFT BEHIND 385<br />

sibilities <strong>to</strong> states, districts, teachers and students, the Norwegians talk about<br />

allowing for “greater diversity”, about the core role of the teachers and <strong>–</strong> above<br />

all <strong>–</strong> about the “confidence in schools and teachers as professionals”. The<br />

“comprehensive effort” announced is built around three dimensions: structure,<br />

process, and outcomes, and both the minister and the committee stress time and<br />

again that one cannot expect better results without improving the structures and<br />

the processes, and without considerable help from all sides (cf. STM 30 2004).<br />

The committee tried <strong>to</strong> embed the new <strong>to</strong>ols in a way that is less offensive <strong>to</strong><br />

traditionalists, by integrating the new in the familiar concepts of local moni<strong>to</strong>ring<br />

and school au<strong>to</strong>nomy. The proposals included the establishment of a<br />

national testing procedure <strong>to</strong> ensure that basic competencies are achieved, but<br />

stressing the “basic” and seeing this first and foremost as a helping hand <strong>to</strong> assist<br />

schools in diagnosing where may have a need for improvement (cf. NOU<br />

2003:16).<br />

The introduction of the national testing has been very difficult and is<br />

a not yet finished task, disputed by researchers and practitioners alike, and<br />

still far from anything resembling NCLB (cf. Langfeldt, Elstad & Hopmann<br />

2007). Nobody speaks about “evidence-based teaching” or “data-driven decision<br />

making” as prime <strong>to</strong>ols <strong>to</strong> make school improvement work; the data are<br />

seen as a limited indica<strong>to</strong>r, which has <strong>to</strong> be embedded in a wider understanding<br />

of a school’s program and needs. But even this limited aspiration has put<br />

a tremendous stress on both the national test developers and local communities<br />

and schools as they <strong>to</strong> meet the new expectation regime. Because of their<br />

poor technical quality, the first wave of national tests was met by sharp criticism,<br />

even from the supporters of their use. This forced the government <strong>to</strong><br />

take a one-year break and completely redo the <strong>to</strong>ol-kit of assessment (cf. Lie<br />

2005; MMI 2005; Telhaug 2005; Langfeldt 2007b). As a result, many schools<br />

and municipalities felt more confused, then controlled by the new measures.<br />

It seems that it will take some time before a more coherent pattern of working<br />

with national moni<strong>to</strong>ring emerges and the different levels find sustainable<br />

strategies for dealing with the new <strong>to</strong>ol kit of expectation management (cf.<br />

Møller 2003; Riksvevisjonen 2006; Sivesind, Langfeldt & Skedsmo 2006; Elstad<br />

2007; Elstad & Langfeldt 2007; Engeland, Roald & Langfeldt 2007; Isaksen<br />

2007). However, the prevailing attitude <strong>to</strong>wards what might be expected<br />

can be illustrated by what a principal of a <strong>to</strong>p-scoring school said at a national<br />

leaders conference: “One shouldn’t put <strong>to</strong>o much in<strong>to</strong> these results”; they reflected,<br />

he said, only a small part of his school’s program and did not inform


386 STEFAN T. HOPMANN<br />

his school about the challenges they faced, not at least in relation <strong>to</strong> special<br />

education. In all events, one should not expect his school <strong>to</strong> be on <strong>to</strong>p next<br />

year; the next year’s class wasn’t close <strong>to</strong> the quality of this one. This was not<br />

just a fine display of public Norwegian humbleness (“you shouldn’t believe<br />

you are someone”). He seemed genuinely concerned that the unexpected success<br />

would divert attentiveness from the more pressing problems of his school<br />

and mislead parents and local politicians, with the implication of less support<br />

in tackling his school’s problems. This is a similar reaction <strong>to</strong> the one seen in<br />

Finland as they discuss the overwhelming <strong>PISA</strong> success of their country and its<br />

more or less unintended side-effects (cf. e.g. Sinola 2005; Kivirauma, Klemala<br />

& Rinne 2006; Uljens in this volume).<br />

<strong>PISA</strong> and its predecessors like TIMSS played an important, but not a key<br />

role in this development in Norway. The move <strong>to</strong>wards a policy change had<br />

started long before <strong>PISA</strong> came in<strong>to</strong> being. The TIMSS and <strong>PISA</strong> data underlined<br />

that there were some substantial short-comings <strong>to</strong> address, but <strong>PISA</strong> was<br />

not taken as a sufficient description of the challenges ahead in either in the<br />

relevant committees or in the parliament. Nor did <strong>PISA</strong> lead <strong>to</strong> a fundamental<br />

change in the course of action, with the one exception that the new generation<br />

of curriculum guidelines tries <strong>to</strong> adapt some of <strong>PISA</strong>’s competence conceptualizations.<br />

But this was not by chance. Most Nordic <strong>PISA</strong> researchers were<br />

scrupulous in outlining the reach of their results, pointing <strong>to</strong> the limited scope<br />

of <strong>PISA</strong>’s material, admonishing against any attempt <strong>to</strong> simplify the complexities,<br />

and warning against any expectation of comprehensive political solutions<br />

based on <strong>PISA</strong> (cf. e.g. Mejding & Roe 2006). The most substantial criticism<br />

of <strong>PISA</strong>’s reach came from within, from researchers with close connections <strong>to</strong><br />

the project. They have analysed particularly the match and mismatch of <strong>PISA</strong><br />

constructs with their nation’s traditions of knowledge culture and schooling<br />

(cf. Olsen 2005 and in this volume, Sjøberg in this volume, Dolin in this volume).<br />

They have discussed if and how <strong>PISA</strong> is reflects the social and cultural<br />

diversity of student achievements (cf. e.g. Allerup 2005, 2006 and in this volume).<br />

In addition, the <strong>PISA</strong> project tried from its beginning <strong>to</strong> place a main<br />

focus on schools as the decisive units of action. This was not easy: <strong>PISA</strong> does<br />

not provide comprehensive, independently cross-checked school data, but relies<br />

instead on the descriptions of school climate and classroom practice provided<br />

by the students and the teachers themselves, a weak source because of<br />

the well-known variance in the ways students describe the same experienced<br />

curriculum (cf. Turmo & Lie 2004).


EPILOGUE: NO CHILD, NO SCHOOL, NO STATE LEFT BEHIND 387<br />

It would exceed the scope of this chapter <strong>to</strong> address the subtle differences<br />

between the Nordic countries with their different levels and shades of public<br />

debate on <strong>PISA</strong> and national testing (cf. Langfeldt, Elstad & Hopmann 2007).<br />

What is important, however, is another fundamental difference <strong>to</strong> the NCLB<br />

approach which the Nordic countries have in common. As a recent survey of<br />

teacher education in the Nordic countries shows (Skågen 2006), they share a<br />

fundamental trust in the quality of their teachers and the underlying teacher education.<br />

It is not that nothing might be improved. Rather they feel that teachers<br />

are well enough educated <strong>to</strong> do what is necessary if they are given the means<br />

and the challenges <strong>to</strong> do so. The core issue becomes then how <strong>to</strong> improve the<br />

local communities “room <strong>to</strong> move”, their ability <strong>to</strong> unleash teachers’ energies<br />

and <strong>to</strong> moni<strong>to</strong>r progress in a supportive way (cf. Engeland, Roald & Langfeldt<br />

2007).<br />

No State Left Behind<br />

What a difference <strong>PISA</strong> can make was nowhere more visible than in Germany<br />

and <strong>–</strong> with a typical delay <strong>–</strong> in Austria. In Germany <strong>PISA</strong> was from the beginning<br />

“big news”, filling newspapers, forcing political responses, engaging each<br />

and everyone interested in school affairs (cf. summarizing Weigel 2004). The<br />

Austrian reaction was somewhat slower; Austria seemed <strong>to</strong> have fared better in<br />

<strong>PISA</strong> 2000, at least better than Germany, which counts for quite a lot in Austria<br />

(cf. Bozkurt, Brinek & Retzl in this volume). When it turned out that Austria<br />

scored worse in the <strong>PISA</strong> 2003, and that the better results of 2000 might have<br />

been an artefact of flawed sampling (cf. Neuwirth 2006; Neuwirth, Ponocny<br />

& Grossmann 2006), the discussion climate changed dramatically. Now both<br />

school systems were seen <strong>to</strong> be in a deep crisis, not least a crisis of their traditional<br />

school structures and their out-worn forms of teaching (cf. summarizing<br />

Terhart 2004; Bozkurt, Brinek & Retzl in this volume).<br />

The response pattern as such was no surprise. Both countries have, since<br />

the school reforms of late-18th century (cf. Mel<strong>to</strong>n-Horn 1988), had recurrent<br />

“big school debates” every 20 <strong>to</strong> 30 years. Every debate is a struggle about<br />

the national curriculum, and (about) every second debate is more specifically<br />

focused on the structures of schooling and their implications (as was the case<br />

for Prussia/Germany in the early-19th century, the 1850s, the 1890s, the 1920s,<br />

and finally in the 1960s and early 1970s; cf. summarizing Hopmann 1988,<br />

2000).


388 STEFAN T. HOPMANN<br />

The important role of school structural issues within this pattern results<br />

from the understanding in both countries that, at least since the reforms of late-<br />

18 th century, schools are state-owned and state-run systems <strong>–</strong> at the national<br />

level in Austria, at the state level in the Federal Republic of Germany. Local<br />

municipalities have some responsibilities for “outer” school matters, such as<br />

buildings and equipment, but the curriculum, the hiring and firing of teachers,<br />

the licensing of school books, and the day-by-day control of all “internal”<br />

school matters etc. are seen as being within the realm of the state’s school<br />

administration. Moreover, both countries have stratified school systems, in<br />

which secondary schools are divided in different strands for “high-” and “low-”<br />

achievers, providing, e.g. different schools for “academic achievers” (Gymnasium),<br />

for more “practically oriented” youth (Realschule, Hauptschule, Berufsschule),<br />

and for children with “special needs” (Sonderschule). The decision<br />

about which kind of a student should attend is normally made following 4th<br />

or 6th grade (in the 19 th century the division was from the first grade). Both<br />

countries have a system of vocational education, combining school with onthe-job<br />

training, sometimes beginning at the lower secondary level, but more<br />

usually covering those who do not attend a Gymnasium or the like for uppersecondary<br />

education. However, in both countries rates of attainment of the<br />

highest academic qualification, (Abitur, Matura), and thereby access <strong>to</strong> universities,<br />

is considered as the key indica<strong>to</strong>r of social equity (cf. Becker & Lauterbach<br />

2004).<br />

In that it is the state, and the state alone, that regulates schools, school<br />

structures can be unders<strong>to</strong>od as institutionalized expressions of the state’s view<br />

on social class and stratification. The proverbial example of inequality in the<br />

school debates of the 1960s was the catholic working-class girl from a rural<br />

area attending a Hauptschule; she is now replaced by the Moslem daughter<br />

of an immigrant family living in a poor inner-city district who also attends<br />

a Hauptschule or a Sonderschule (cf. summarizing Berger & Kahlert 2005).<br />

Within this frame, school-structure debates tend <strong>to</strong> become debates on social<br />

division; the stratified school system is regularly defended by conservatives<br />

and economists, whereas the move <strong>to</strong>wards a comprehensive school system is<br />

an affair of the heart for social democrats and the labour movement, without<br />

regard <strong>to</strong> whether one system or the other has a better record in terms of social<br />

equity. In both countries the core argument is the assumed, yet not proven<br />

effect stratification might have on human capital<strong>–</strong> does stratification lead <strong>to</strong> a<br />

structural underperformance of lower-class students or does a comprehensive


EPILOGUE: NO CHILD, NO SCHOOL, NO STATE LEFT BEHIND 389<br />

school limit the space and speed of development of high-achievers, and vice<br />

versa (cf. Bozkurt, Brinek & Retzl in this volume).<br />

How <strong>PISA</strong> fits in this frame, is easily unders<strong>to</strong>od if one takes its official<br />

purpose as stated by its owner, the OECD:<br />

Quality education is the most valuable asset for present and future generations.<br />

Achieving it requires a strong commitment from everyone, including governments,<br />

teachers, parents and students themselves. The OECD is contributing <strong>to</strong> this goal<br />

through <strong>PISA</strong>, which moni<strong>to</strong>rs results in education within an agreed framework, allowing<br />

for valid international comparisons. By showing that some countries succeed<br />

in providing both high quality and equitable learning outcomes, <strong>PISA</strong> sets ambitious<br />

goals for others. (Angel Gurría, OECD Secretary-General, as introduction <strong>to</strong> <strong>PISA</strong><br />

2006, 3)<br />

<strong>According</strong> <strong>to</strong> the same source <strong>PISA</strong>’s “key features” have been so far:<br />

<strong>–</strong> Its policy orientation, with design and reporting methods determined by the need of<br />

governments <strong>to</strong> draw policy lessons.<br />

<strong>–</strong> Its innovative “literacy” concept, which is concerned with the capacity of students<br />

<strong>to</strong> apply knowledge and skills in key subject areas and <strong>to</strong> analyse, reason and communicate<br />

effectively as they pose, solve and interpret problems in a variety of situations.<br />

<strong>–</strong> Its relevance <strong>to</strong> lifelong learning, which does not limit <strong>PISA</strong> <strong>to</strong> assessing students’<br />

curricular and cross-curricular competencies but also asks them <strong>to</strong> report on their<br />

own motivation <strong>to</strong> learn, their beliefs about themselves and their learning strategies<br />

<strong>–</strong> Its regularity, which will enable countries <strong>to</strong> moni<strong>to</strong>r their progress in meeting key<br />

learning objectives.<br />

<strong>–</strong> Its contextualisation within the system of OECD education indica<strong>to</strong>rs, which examine<br />

the quality of learning outcomes, the policy levers and contextual fac<strong>to</strong>rs that<br />

shape these outcomes, and the broader private and social returns <strong>to</strong> investments in<br />

education.<br />

<strong>–</strong> Its breadth of geographical coverage and collaborative nature, with more than 60<br />

countries (covering roughly nine-tenths of the world economy) having participated<br />

in <strong>PISA</strong> assessments <strong>to</strong> date, including all 30 OECD countries. (ibid., 7)<br />

The “policy orientation, with design and reporting methods determined (sic!)<br />

by the need of governments <strong>to</strong> draw policy lessons” has lead <strong>to</strong> wealth of national<br />

and OECD reports, using <strong>PISA</strong> data as a means <strong>to</strong> assess the quality<br />

of school structures and schooling, issues of social inequality, gender, migration<br />

etc., and, not least, comparisons of again and again of countries and their<br />

<strong>PISA</strong> performance in relation <strong>to</strong> other OECD indica<strong>to</strong>rs (most of this online<br />

available at http://www.pisa.oecd.org).


390 STEFAN T. HOPMANN<br />

Of course this approach has a number of implicit assumptions, which are all<br />

but self-evident:<br />

<strong>–</strong> The assumption that what <strong>PISA</strong> measures is somehow important knowledge<br />

for the future: There is no research available, which proves this assertion<br />

beyond the point of that knowing something is always good and knowing<br />

more is always better. There is not even research showing that <strong>PISA</strong> covers<br />

enough <strong>to</strong> be representative for the school subjects involved or the general<br />

school knowledge base. <strong>PISA</strong> items are based on the practical reasoning of<br />

its researchers and based on pre-tests of what works in all or most settings <strong>–</strong><br />

and not on systematic research on current or future knowledge structures and<br />

needs (cf. Dohn 2007; Bodin, Jahnke, Meyerhöfer, Sjøberg in this volume).<br />

<strong>–</strong> The assumption that the economic future is dependent on the knowledgebase<br />

moni<strong>to</strong>red by <strong>PISA</strong>: The little research on this theme <strong>–</strong> which assumes<br />

that there is a direct relation between test scores and future economic<br />

development<strong>–</strong> relies on strong and unproven arguments which have no basis<br />

when, for instance, comparing success in <strong>PISA</strong>’s predecessors and later<br />

economic development (cf. Fertig 2004; Heynemann 2006).<br />

<strong>–</strong> The assumption that <strong>PISA</strong> measures what is learned in schools: this is not<br />

<strong>PISA</strong>’s own starting point which is not <strong>to</strong> use national curricula as point of<br />

reference (as e.g. TIMSS does; cf. Sjøberg in this volume). The decision <strong>to</strong><br />

focus on a small number of issues and <strong>to</strong>pics, which can be expected <strong>to</strong> be<br />

present in all involved countries leaves open the question of how these items<br />

represent the school curriculum as a whole (cf. Benner 2002; Fuchs 2003;<br />

Ladenthin 2004; Hopmann 2001, 2006; Dolin, Sjøberg, Meyerhöfer in this<br />

volume) beyond the fact that those who are successful in school do, on average,<br />

better on <strong>PISA</strong> <strong>–</strong> which is hardly a surprise inasmuch as <strong>PISA</strong> requires<br />

cognitive and not at least language skills, which are helpful in schools as<br />

well. Some even argue that <strong>PISA</strong> first and foremost moni<strong>to</strong>rs whatever intelligence<br />

testing moni<strong>to</strong>rs (cf. Rindermann 2006), which could lead <strong>to</strong> the<br />

somewhat irritating implication that according <strong>to</strong> <strong>PISA</strong>, e.g. Finns are more<br />

“intelligent” than Germans or Austrians.<br />

<strong>–</strong> The assumption that <strong>PISA</strong> measures the competitiveness of schooling: One<br />

has <strong>to</strong> keep in mind, that at best 5<strong>–</strong>15 percent of the variance in the <strong>PISA</strong> results<br />

can be attributed <strong>to</strong> lasting qualities provided by the schools studied (cf.<br />

already Watermann et al. 2003; for the principal problems of re-constructing<br />

schooling and teaching based on such data see Rauin 2004). Most of the<br />

variance can be attributed <strong>to</strong> fac<strong>to</strong>rs from the outside, that are mostly be-


EPILOGUE: NO CHILD, NO SCHOOL, NO STATE LEFT BEHIND 391<br />

yond the reach of schooling (such as social background; cf. Baumert, Stanat<br />

& Watermann 2006).<br />

<strong>–</strong> The assumption that <strong>PISA</strong> thus measures and compares the quality of national<br />

school provision, not at least of school structures, teacher quality,<br />

the curriculum, etc.: Although school effects as such have a very limited<br />

role in the results of <strong>PISA</strong>, one has <strong>to</strong> add that a) <strong>PISA</strong> has a considerable<br />

sampling and cultural match problem, which reduces its trustworthiness as<br />

an indica<strong>to</strong>r for national systems, at least for systems with the small differences<br />

seen between Western countries (see the contributions <strong>to</strong> this volume),<br />

and b) since Coleman’s seminal study (1967) it is well known that schools<br />

only have a very limited impact on social distributions of education success<br />

when compared <strong>to</strong> fac<strong>to</strong>rs such as the social fabric of the surrounding society<br />

(cf. Shavit & Blossfeldt 1993; Becker & Lauterbach 2004). Moreover,<br />

by its very design <strong>PISA</strong> is forced <strong>to</strong> drop most of what might indeed indicate<br />

specifics of national systems (cf. Dolin, Langfeldt in this volume).<br />

In short: <strong>PISA</strong> relies on “strong assumptions” (Fertig 2004) based on weak<br />

data (cf. e.g. Allerup, Langfeldt, Wuttke in this volume) that appeal <strong>to</strong> conventional<br />

wisdom (“education does matter, doesn’t it?”; “school structures make a<br />

difference, don’t they?”), but almost no empirical and his<strong>to</strong>rical research supporting<br />

its implied causalities.<br />

But this has not kept either <strong>PISA</strong> researchers or the public from using<br />

<strong>PISA</strong> as if such causal relations are given. Otherwise one would not be able<br />

<strong>to</strong> explain how the two main impacts which <strong>PISA</strong> has had on school administration<br />

and policy making in Austria and Germany, directly referring <strong>to</strong> this<br />

frame of reference, albeit using it somewhat differently. Thus <strong>PISA</strong>’s approach<br />

<strong>to</strong> competency measuring has been a sweeping success in both countries, in<br />

part fuelled by the “national expertise” (Klieme et al. 2003) produced by researchers<br />

close <strong>to</strong> the <strong>PISA</strong> efforts who argue that a national moni<strong>to</strong>ring of<br />

student achievement based on an approach similar <strong>to</strong> <strong>PISA</strong> is both necessary<br />

and feasible (cf. Jahnke in this volume). Based on this, the German education<br />

ministers of the states have established a process <strong>to</strong>wards such national<br />

standards and given a helping hand <strong>to</strong> the mounting of a National Institute for<br />

Progress in Education (Institut zur Qualitätsentwicklung im Bildungswesen)<br />

with similar functions <strong>to</strong> the US National Assessment of Educational Progress<br />

(NAEP). Both of these accomplishments are significant in Germany given that<br />

curriculum matters are normally considered <strong>to</strong> be state, not federal responsibilities,<br />

and that there was no previous tradition of state-run outcome controls


392 STEFAN T. HOPMANN<br />

(except for some standardizations of final exams in a few states). Prior <strong>to</strong> this<br />

point there have been more than 4000 different state curriculum guidelines that<br />

werethemainroad<strong>to</strong>defining expected results, without any regular control for<br />

whether they were achieved (as in the Nordic countries; cf. Hopmann 2003).<br />

Similarly, the Austrian government has initiated a not yet finished project <strong>to</strong><br />

develop and implement national competency standards as an alternative <strong>to</strong> the<br />

former guidelines and <strong>to</strong> combine this with regular testing (cf. the material collected<br />

at the official site http://www.gemeinsamlernen.at). All this in spite of<br />

the fact, that the impact of the use of national or state assessment on what <strong>PISA</strong><br />

and similar projects measure is at best weak in either direction (cf. Amrein &<br />

Berliner 2002: Bishop & Woessmann 2004; Fuchs & Woessmann 2004), and<br />

that the overall importance of meeting the goals which <strong>PISA</strong> and according<br />

state standards happen <strong>to</strong> measure at best is a good guess without a solid research<br />

foundation.<br />

Whereas this approach seems <strong>to</strong> have support across the whole political<br />

spectrum, the second impact has proven <strong>to</strong> be rather divisive: Based on <strong>PISA</strong><br />

and similar studies, researchers and politicians have <strong>–</strong> as mentioned above <strong>–</strong><br />

reopened the debate on school structures. Interestingly both sides <strong>–</strong> proponents<br />

and opponents of a comprehensive system, proponents and opponents of<br />

early school start, proponents and opponents of an integrated teacher education,<br />

etc. <strong>–</strong> feel themselves encouraged by the very same <strong>PISA</strong> data, which the<br />

other faction uses as well. The most prominent example of this is a national<br />

report on schooling, written by a number of “leading experts” (i.e. researchers<br />

utilizing <strong>PISA</strong> and the like) on behalf of the Confederation of Bavarian Industry<br />

(VBW), which <strong>–</strong> focused on equity-issues <strong>–</strong> argues that there is ample<br />

research evidence for re-organizing the whole school system as a two-tier<br />

organization (cf. Aktionsrat Bildung 2007). On the other hand, the leader of<br />

the <strong>PISA</strong> effort at OECD, Andreas Schleicher, is <strong>to</strong>tally convinced that <strong>PISA</strong><br />

proves the advantages of a comprehensive system. The leader of the national<br />

<strong>PISA</strong> effort in Austria, Günther Haider, managed first <strong>to</strong> support a continuation<br />

of the current stratified structures, then a transition <strong>to</strong>wards a comprehensive<br />

system, in both cases claiming <strong>PISA</strong> data as evidence for his recommendations<br />

(cf. Bozkurt, Brinek & Retzl in this volume).<br />

Public criticism of the empirical evidence provided by <strong>PISA</strong> has been weak<br />

in both countries. The devastating results were all <strong>to</strong>o much in line with the<br />

political needs <strong>to</strong> find good causes at the end of the economically painful reunification<br />

process in Germany and at a time when both countries felt them-


EPILOGUE: NO CHILD, NO SCHOOL, NO STATE LEFT BEHIND 393<br />

selves economically underperforming compared <strong>to</strong> other European countries,<br />

not at least those seemingly more successful in <strong>PISA</strong>, such as Finland. Even<br />

the scientific discourse <strong>to</strong>ok the economic reasoning backing <strong>PISA</strong> for granted,<br />

arguing that <strong>PISA</strong> reduced “Bildung” <strong>to</strong> economic necessities and the needs<br />

of globalization, thereby acknowledging the unfounded premises of <strong>PISA</strong>’s<br />

ability <strong>to</strong> moni<strong>to</strong>r and guide the school curriculum (cf. e.g. Huisken 2005;<br />

Lohmann 2007). Except for the obvious case of the Austrian sampling problems<br />

(Neuwirth 2006), the few methodological objections that were voiced<br />

were either ignored or ridiculed by the <strong>PISA</strong> community and its supporters,<br />

and has had <strong>–</strong> at least up <strong>to</strong> now <strong>–</strong>no substantial impact on the public standing<br />

of <strong>PISA</strong> in either Austria or Germany (cf. the introduction <strong>to</strong> this volume).<br />

The “no state left behind approach” of the OECD and its German and Austrian<br />

consorts leads <strong>to</strong> the somewhat paradoxical effect that <strong>PISA</strong> has the most<br />

impact by way of the by-products of the <strong>PISA</strong> research, which in design and<br />

methodology are most probably the weakest links of the whole enterprise. But<br />

even this is not without precedent. Re-reading Georg Picht’s volume on the<br />

“education catastrophe” (1964), which started the last “big school debate” in<br />

the mid-1960’s, one is amazed how little of his evidence could stand the test<br />

of time and how much of it was simply speculative. However the book was the<br />

single most important lever for the ensuing debate on how the school system<br />

should adapt <strong>to</strong> the social changes at the end of the post-war reconstruction period,<br />

a process which ended with biggest expansion of the educational system<br />

and of public expenditure in the his<strong>to</strong>ry of schooling. This process also included<br />

the temporary transfer of a <strong>to</strong>ol-kit, scientific curriculum development,<br />

from the US <strong>–</strong> in spite of its then self-pronounced “moribund” state on its<br />

home turf (cf. Hopmann 1988). At least in Austria and Germany, <strong>PISA</strong> seems<br />

<strong>to</strong> have achieved something similar, <strong>to</strong> help politicians, educa<strong>to</strong>rs and the public<br />

<strong>to</strong> get the educational field in <strong>to</strong>uch with the transition processes going on in<br />

the whole public sec<strong>to</strong>r by providing them with a sense of what “manageable<br />

expectations” might be, and with <strong>to</strong>ols <strong>to</strong> moni<strong>to</strong>r their success <strong>–</strong> or failure.<br />

3 Comparative Accountability<br />

Thus the overall picture of the accountability approaches I have reviewed<br />

shows three very different basic philosophies of what this transition is about<br />

(cf. fig. 1):


394 STEFAN T. HOPMANN<br />

Core Data Student achievement<br />

Main Tools Standards controlled<br />

by testing<br />

Aggregated<br />

school achieve-<br />

ment data<br />

Testing with<br />

regard <strong>to</strong> opportunities<br />

<strong>to</strong> learn<br />

(OTL)<br />

Aggregated national<br />

student<br />

achievement<br />

Competencies<br />

measured by<br />

random testing<br />

Stakes High stakes Low stakes No stakes<br />

(<strong>PISA</strong>)<br />

Low or high<br />

stakes (standards)<br />

Driving Force Blame Community Competition<br />

Main levels of<br />

attribution<br />

Class room &<br />

teaching<br />

Spirit<br />

Local school<br />

management<br />

School systems/Society<br />

at large<br />

Best Practice Data-driven Cus<strong>to</strong>mized Research based<br />

Accountability Bot<strong>to</strong>m up Bridging the<br />

gap<br />

Top down<br />

The role of<br />

<strong>PISA</strong><br />

Almost none Supporting act Main act<br />

Fig. 1: Basics of the No Child, No School, No State Left Behind Strategies<br />

Of course, this table only pinpoints the main assumptions and entry points<br />

of each approach. In the nature of public schooling, each approach carries elements<br />

of the others. Additional analysis of more countries would show that<br />

there are mixed patterns, combining elements of different modes of accountability<br />

(e.g. the case of Switzerland would probably reveal a mixture of “no<br />

school” and “no can<strong>to</strong>n” strategies; Canada a mixture of “no child” and “no<br />

school” etc.; cf. e.g. BFS 2005; Rhyn 2007; Stack 2006; Klatt, Murphy &<br />

Irvine 2003; Ma & Crocker 2007). And there is a fourth pattern, where “no accountability<br />

has yet arrived”, and where the public sec<strong>to</strong>r is in the early stages<br />

of a transition <strong>to</strong>wards accountability policies, and therefore not yet open for<br />

the influx of international accountability measures (as, for instance, in Italy<br />

where <strong>PISA</strong> has been no real issue, and even the government has treated it as<br />

almost non-existent; cf. Nardi 2004).


EPILOGUE: NO CHILD, NO SCHOOL, NO STATE LEFT BEHIND 395<br />

Emerging issues<br />

Each strategy has its own strength and carries its own risks depending on the<br />

larger concept of expectation management it is a part of:<br />

The “no child” approach has the advantage of a clear focus: Everybody<br />

knows what counts and how it is measured. But the price for this is what David<br />

Berliner (2005) calls “an impoverished view on educational reform”, a system<br />

of accountability checks which places “statistical significance” above all other<br />

ways of looking at individual and institutional achievements. The very narrow<br />

conceptualization lends itself <strong>to</strong> reduced remedial strategies: “evidence-based”<br />

or “best-practice” models, or “data-driven decision making”, only make sense<br />

if it is assumed that assessment data are all that counts and that local conditions<br />

do not play a significant role, or at least can be overcome, if one does<br />

as the successful do. But while NCLB’s “blueprint” (US Department of Education<br />

2007) honours a rather naïve empiricism, much of the NCLB-induced<br />

research provides more complex insights in the complexities of school life,<br />

using a whole range of mixed methods and avoiding the fallacies of an engineering<br />

approach <strong>to</strong> social transition (cf. e.g. Elmore 2006; O’Day 2007).<br />

However, it seems unlikely that the insights produced by this research will<br />

have any lasting impact on the NCLB movement: the prime implication of this<br />

research, the importance of capacity-building in local schooling with special<br />

regard <strong>to</strong> the unique mix of challenges at hand, is contrary <strong>to</strong> any belief that<br />

the same high-stakes for everyone, and distributing blame and shame in large<br />

portions, is a reasonable approach <strong>to</strong> making accountability work.<br />

Moreover, it is this one-sided focus which allows the transformation of<br />

the apparent problems with equity and equality, with minorities, special needs,<br />

gender etc. in<strong>to</strong> individualized liabilities, whose impact on achievement has <strong>to</strong><br />

be minimized, if not eradicated. If high stakes as sole approach <strong>to</strong> this fails<br />

(and research points <strong>to</strong> that it will; cf. Linn 2007), there are a number of technical<br />

options <strong>to</strong> ease the burden such as lowering the ceilings, adding opt-out<br />

clauses for the worst students and/or schools, inflating the number of stakes<br />

such that everybody can succeed in something, and not at least creating more<br />

school choice and vouchers, which leaves the responsibility of choosing the<br />

right school with the parents. All of these options are under consideration in<br />

the current debate on the renewal of NCLB (cf. US Department of Education<br />

2007). Choice is, of course, the core of a strategy of passing the basket on <strong>to</strong><br />

the next in line of the accountability chain, i.e. the ones who seemingly bring<br />

the liabilities <strong>to</strong> school: the parents, the minorities, the poor, those with special


396 STEFAN T. HOPMANN<br />

needs, etc. We can expect more accountability <strong>to</strong>ols, e.g. contractual attainment<br />

goals and/or connecting welfare subsidies or other sanctions with them,<br />

<strong>to</strong> make these families directly responsible for the outcomes. The achievement<br />

gap will not disappear with these moves, but rather what once was considered<br />

being a failure of the school system <strong>to</strong> cope with the diversity of society (cf.<br />

e.g. Coleman et al. 1967) will be turned step by step in<strong>to</strong> a problem of individual<br />

cus<strong>to</strong>mers failing <strong>to</strong> meet rising expectations.<br />

The “no school left behind” approach of Norway, and most of the Nordic<br />

countries, is far away from such reductionism, but also pays a heavy price. The<br />

double task of embedding the new strategies in the traditional <strong>to</strong>ol-kit and of<br />

doing so in close co-operation with the local level, obviously leaves many wondering<br />

if there is a real change process going, and if there is, what it consists<br />

of. No real sense of the new obligations has emerged in schools and municipalities,<br />

and it seems as if they respond <strong>to</strong> the new accountability expectation<br />

with classic Nordic “muddling-through”: planning, coordinating and reporting<br />

on a local level time and again, with no real stakes and inconclusive outcomes<br />

(cf. Engeland, Roald & Langfeldt 2007; Elstad 2007). That the first national<br />

tests were a technical disaster (cf. Lie et al. 2005), and prompted a break in the<br />

whole process, did not really help, nor did the new curriculum guidelines of<br />

2006 which, in spite of much of the rhe<strong>to</strong>ric, do not require more adjustments<br />

then earlier guidelines, i.e. most teaching does not change significantly as a<br />

result of their adoption, and the prime concern of teachers remains with local<br />

adaptation, not national outcomes (cf. Bachmann & Sivesind 2007).<br />

But this will not be the end of the s<strong>to</strong>ry! The key question is what will<br />

happen if the current effort, which even in a Norwegian perspective is quite expensive,<br />

fails <strong>to</strong> achieve significant and sustainable gains beyond those which<br />

come as the system becomes used <strong>to</strong> the new <strong>to</strong>ols? Social and economic inequality<br />

are rising rapidly, and knowing how this can affect both schools and<br />

students, growing achievement disparities and gaps will be no surprise. Two<br />

response pattern seem likely: The first one would move even more rapidly <strong>to</strong>wards<br />

more radical accountability strategies, i.e. raising stakes, adding more<br />

national testing, and most importantly adding sanctions for those who continue<br />

<strong>to</strong> fail. This would put tremendous pressure on the comprehensive school<br />

system: homogenous schools without <strong>to</strong>o many non-achievers will succeed<br />

and tell the public that the time of an all-encompassing school has come <strong>to</strong> an<br />

end. The other strategy would introduce more choice and private options in<strong>to</strong><br />

the system (as it is already the case in Sweden and Denmark), thus allowing


EPILOGUE: NO CHILD, NO SCHOOL, NO STATE LEFT BEHIND 397<br />

schools <strong>to</strong> remove their most challenging parents and the most challenged children,<br />

leaving the public school as main route for ‘average’ people (cf. Kvale<br />

2007). Both strategies would imply a definite end of the “one school for all”<br />

notion. This idea is deeply engrained in the social fabric of Nordic societies,<br />

and the move will be no easy task (and will lead in Norway <strong>to</strong> a continuing<br />

back and forth of who is allowed <strong>to</strong> opt out, and why). But there are no other<br />

ways <strong>to</strong> reconcile the former management of places with the new needs of accountability,<br />

even if it takes time before the “muddling through” is forced <strong>to</strong><br />

accept this consequence as inevitable.<br />

None of this applies <strong>to</strong> the two leading examples of the “no state left behind”<br />

strategy, Austria and Germany. Both have fragmented school systems in<br />

which comprehensive schools play no significant role. Moreover, for the moment,<br />

both have easy access <strong>to</strong> knowing if their school improvement works.<br />

All they seemingly have <strong>to</strong> do is <strong>to</strong> wait for the next <strong>PISA</strong> wave; it will then<br />

be clear who has lost or won in the interstate competition (of course only if<br />

one believes that <strong>PISA</strong> indeed is able <strong>to</strong> tell something about that). The main<br />

risk lies in the deeply-engrained traditions of how <strong>to</strong> deal with “big school debates”,<br />

because these traditions transform the achievement problem in<strong>to</strong> one<br />

of school structure and other institutional change. The issues at stake are more<br />

or less the same in both countries (cf. Aktionsrat Bildung 2007; Retzl, Bozkurt<br />

& Brinek in this volume): Comprehensive schools or different tracks? Compulsory<br />

pre-school education and if so for whom? Unified teacher education<br />

or different routes for different types of schooling? Special schools for special<br />

needs or inclusive education? Keeping the double structure of vocational education<br />

(school plus training on the job) or integrating vocational education in<br />

some general kind of upper-secondary schooling?<br />

If the attempts <strong>to</strong> force a comprehensive re-structuring fail (and there is no<br />

empirical or political evidence indicating that this could turn out otherwise),<br />

then at least two possible outcomes are likely: The first one would be <strong>to</strong> move<br />

<strong>to</strong>wards a more NCLB-like approach <strong>to</strong> accountability, i.e. adding more stakes<br />

and tests (e.g. unified entrance and exit exams), including all levels and becoming<br />

more all-encompassing than is possible within <strong>PISA</strong>, i.e., by requiring<br />

more data on single schools, school districts and the different federal states,<br />

eventually extending the screening beyond student achievement <strong>to</strong>wards indica<strong>to</strong>rs<br />

on teaching patterns, teaching materials, teacher qualifications, studentcareer<br />

data and the like. But, at least in Germany, such an approach faces the<br />

obstacle that schooling is constitutionally a matter for states, not the Federal


398 STEFAN T. HOPMANN<br />

government <strong>–</strong> which means that there are no means for enforcing alignment beyond<br />

that which all states agree upon. In Austria the Federal government has<br />

the necessary constitutional backing for federal involvement, but in that the<br />

country, since its reconstruction after World War II, depended on compromise<br />

between the two largest political wings (the social democrats and the conservatives),<br />

each controlling about half of the states, it is unlikely that any lasting<br />

agreement on a comprehensive accountability approach is feasible. Which<br />

brings the second option <strong>to</strong> the forefront, namely <strong>to</strong> dissolve, or embed, the national<br />

accountability measures in an internal competition between the different<br />

federal states. Those confident of their success would prove the advantages<br />

of their chosen solutions by own data; those not meeting the standards would<br />

have <strong>to</strong> answer by their own explanations of why a mismatch was unavoidable.<br />

In the end, there would be a inextricable hodgepodge of testing, controlling,<br />

moni<strong>to</strong>ring etc., with each state having its own <strong>to</strong>ol-kit of accountability measures.<br />

But at least two problems would be left behind by either option: on the one<br />

hand, none of this addresses the problems of sustainable inner-school development,<br />

capacity-building over the long run. A race <strong>to</strong> match changing expectations<br />

by restructuring the system will not leave energy and resources <strong>to</strong> address<br />

the tricky problems of “no teaching left behind” (cf. Terhart 2005). Secondly,<br />

both approaches would lead <strong>to</strong> further marginalization of the special-needs<br />

students who already invisible, or turned in<strong>to</strong> liabilities, in the <strong>PISA</strong> approach<br />

(cf. Hörmann 2007 and in this volume). They promise <strong>to</strong> become even more<br />

marginalized inasmuch as they don’t help <strong>to</strong> win a competition that has individualized<br />

academic achievement as its basic rationale (cf. Hopmann 2007).<br />

Each and every new round of testing would only reaffirm their “lower” abilities<br />

and the “superiority” of the schools dealing with high achievers, thus petrifying<br />

the hierarchy of schools. In that this hierarchy always has been experienced<br />

as an expression of the social fabric of society, and the state’s position <strong>to</strong>wards<br />

it, one can only imagine how rapidly this will lead <strong>to</strong> inner tensions in a school<br />

system surrounded by a society with rapidly growing social inequalities.<br />

<strong>PISA</strong> in Transition<br />

Most of the emerging issues stem from inner tensions between the former<br />

management of placement and the new expectation regime. The data-driven<br />

NCLB disintegrates the former “place called school” (Goodlad) in<strong>to</strong> concurrent,<br />

but not intertwined, individual challenges of meeting the standards. The


EPILOGUE: NO CHILD, NO SCHOOL, NO STATE LEFT BEHIND 399<br />

“no school left behind” approach has difficulties in embracing a coherent set<br />

of expectations, as it dissolves the idea of accountability with its old routines<br />

of institutionalized muddling-through. The “no state left behind” strategy is at<br />

risk of answering the new expectations by functionalizing them for a renewing<br />

of the conventional restructuring game, without really changing what is going<br />

on inside schools and classrooms.<br />

The success of <strong>PISA</strong> within this transition is made possible by a certain<br />

“fuzziness” of design and self-presentation. It treats the links between student,<br />

school and national achievements as evident, thus allowing for a black-box<br />

approach <strong>to</strong> schooling itself (economists call that a production function) in<br />

which the coincidence of results and fac<strong>to</strong>rs is transformed in<strong>to</strong> correlations<br />

and causalities, without proving how exactly this linearity comes in<strong>to</strong> being.<br />

Within a management of placements, <strong>PISA</strong> and the national testing inspired by<br />

it, would be dysfunctional in that it covers only a few aspects of schooling and<br />

these in a way which does not allow for research-based decision-making concerning<br />

the whole school or even teaching and learning under given conditions.<br />

However as a <strong>to</strong>ol of expectation management <strong>PISA</strong> allows in each setting <strong>to</strong><br />

address problems as they are framed within the respective constitutional mindsets,<br />

using <strong>PISA</strong> as “evidence”. Thus <strong>PISA</strong> refreshes the never-ending dispute<br />

in Germany and Austria on school structures and their relation <strong>to</strong> social class<br />

and diversity, reinvigorates the co-dependency of national government and local<br />

community in the Nordic countries, and reaffirms the starting point of the<br />

NCLB discourse on failing schools and teachers as the main culprits for the<br />

uneven distribution of knowledge and cultural capital in Western societies.<br />

The irony of this s<strong>to</strong>ry is, of course, that <strong>PISA</strong> achieves this not in spite of,<br />

but because of its short-comings. Although it uses advanced statistical <strong>to</strong>ols,<br />

<strong>PISA</strong> stays methodologically with in the frame of a pre-Popper positivism,<br />

which takes item responses for the realities addressed by them. There is no<br />

theory of schooling or of the curriculum, which allows for a non-affirmative<br />

stance <strong>to</strong>wards the policy-driven expectations which, according <strong>to</strong> OECD, “determine”<br />

“the design and reporting methods” of <strong>PISA</strong> (OECD 2007). There<br />

is no systematically embedded concept of how yet unheard voices and nonstandardized<br />

needs could be recognized as equally valid expressions of what<br />

schooling is about. <strong>According</strong>ly, none of the newer developments in educational<br />

research, addressing the situatedness, multi-perspectivity, non-linearity<br />

or contingency of social action, plays a significant role in <strong>PISA</strong>’s design (cf.<br />

Hopmann 2007). Even though, there are many quite advanced options <strong>to</strong> use


400 STEFAN T. HOPMANN<br />

<strong>PISA</strong> data within mixed methods or other more comprehensive research designs,<br />

which could address some of <strong>PISA</strong>’s inherent weaknesses as well (cf.<br />

Olsen in this volume). But <strong>to</strong> incorporate such developments on a large scale<br />

would be close <strong>to</strong> impossible. They do not lend themselves <strong>to</strong> such generalized<br />

bot<strong>to</strong>m-lines as league tables; <strong>to</strong> include them in a large-scale study of the size<br />

of <strong>PISA</strong> would require resources far beyond that available <strong>to</strong> even <strong>PISA</strong>.<br />

As an entry <strong>to</strong> the commencing accountability transition, <strong>PISA</strong> has done<br />

a significant job in facilitating and illustrating the difficulties any approach <strong>to</strong><br />

these issues will have <strong>to</strong> face. But it might be, that the <strong>PISA</strong>-frenzy already<br />

has reached its peak, or is very close <strong>to</strong> doing so (the next wave of results,<br />

coming in December 2007, will show whether this is the case). But if the <strong>PISA</strong>frenzy<br />

is drawing <strong>to</strong> a close, it will not be because of the technical mishaps and<br />

fallacies discussed in this volume. Such details go unnoticed by the politicians<br />

and the public. If <strong>PISA</strong> looses its unique position, it will happen because of<br />

its success, because of the multiplying of <strong>PISA</strong>-like <strong>to</strong>ols in national and state<br />

accountability programmes. If the NCLB experience holds true, <strong>PISA</strong> will be<br />

reduced <strong>to</strong> being just one voice in the polyphonic concert of assessment results,<br />

and <strong>–</strong> having no sanctions other than statistical blame <strong>–</strong> will be overcome by<br />

accountability measures that carry more immediate risks for those involved.<br />

The important question for the future of educational research is how much<br />

<strong>PISA</strong> then will be left behind, <strong>to</strong> what extend will its methodological reductionism<br />

prevail as the state-of-the-art of comparative research. But the more<br />

pressing question centers on the long-term effect its conceptualization of student<br />

achievement will have on the public understanding of what schooling is<br />

about. What will happen <strong>to</strong> the school subjects left out, <strong>to</strong> the special-needs<br />

that are marginalized, <strong>to</strong> school tasks which have nothing <strong>to</strong> do with higherorder<br />

academic achievement, <strong>to</strong> the school functions which move beyond a<br />

one-dimensional kind of knowledge distribution? Perhaps there are new, not<br />

yet seen possibilities hidden in the multiple realities of the transition from<br />

the management of placement <strong>to</strong>wards the management of expectations, even<br />

some which make research, policy and schooling accountable for not leaving<br />

their social conscience behind on their march in<strong>to</strong> the emerging age of accountability.


References<br />

EPILOGUE: NO CHILD, NO SCHOOL, NO STATE LEFT BEHIND 401<br />

Abbott, A.: The System of Professions. Chicago (University of Chicago Press)<br />

1988.<br />

Achieve: Aiming higher: 1998 annual report. Cambridge (Achieve, Inc.) 1998.<br />

Ahearn, E.M.: Educational Accountability: A Synthesis of Literature and Review<br />

of a Balanced Model of Accountability. Washing<strong>to</strong>n D.C. (Department<br />

of Education) 2000.<br />

Aikin, W.M. et al.: The Eight Year Study. Vol. I <strong>–</strong> V. New York/London (Harper<br />

& Brothers) 1942.<br />

Akerstrøm Andersen, N.: Borgerens kontraktliggørelse. Kopenhagen (Reitzel)<br />

2003.<br />

Aktionsrat Bildung: Bildungsgerechtigkeit. Jahresgutachten 2007. Wiesbaden<br />

(VS) 2007.<br />

Allerup, P.: <strong>PISA</strong> præstationer <strong>–</strong> målinger med skæve måles<strong>to</strong>kke?. In: Dansk<br />

Pædagogisk Tidsskrift 2005-1, 68-81.<br />

Allerup, P.: <strong>PISA</strong> 2000’s læseskala <strong>–</strong> vurdering af psykometriske egenskaber<br />

for elever med dansk og ikke-dansk sproglig baggrund. (Rockwool<br />

Fondens Forskningsenhed og Syddansk Universitetsforlag) Odense 2006.<br />

Allerup, P.: Identification of Group Differences Using <strong>PISA</strong> Scales <strong>–</strong> Considering<br />

Effects of Inhomogeneous Items. In this volume.<br />

Ålvik, T. (ed.): Skolebasert vurdering <strong>–</strong> en artikkelsamling. Oslo (Ad notam)<br />

1994.<br />

Amrein, A.L. & Berliner, D.C.: High-stakes testing, uncertainty, and student<br />

learning. Education Policy Analysis Archives 10-2002-18. Online: http:<br />

//epaa.asu.edu/epaa/v10n18 (10.03.2007)<br />

Apple, M.: Ideological Success, Educational Failure? On the Politics of No<br />

Child Left Behind. In: Journal of Teacher Education 58-2007-2, 108-116.<br />

Bachmann, K.; Sivesind, K. & Hopmann, S.T.: Hvordan formidles læreplanen.<br />

Kristiansand (Høyskoleforlag) 2004.<br />

Bachmann, K. & Sivesind, K.: Regn med meg! Evaluering og ansvarligjøring i<br />

skolen. In: Langfeldt, G., Elstad, E. & Hopmann, S.T. (eds.): Ansvarlighet<br />

i skolen. Oslo (Cappelen) 2007 (forthcoming).<br />

Bartlett, W., Roberts, J.A. & Le Grand, J.: A revolution in social policy: Quasimarket<br />

reforms in the 1990’s. Bris<strong>to</strong>l (Policy Press) 1998.<br />

Baumert, J., Stanat, P. & Watermann, R. Hrsg.: Herkunftsbedingte Disparitäten<br />

im Bildungswesen. Verteiefende Analysen im Rahmen von <strong>PISA</strong> 2000.<br />

Wiesbaden(VS) 2004.


402 STEFAN T. HOPMANN<br />

Beck, U.: Weltrisikogesellschaft. Frankfurt (Suhrkamp) 2007.<br />

Beck, U., Giddens, A. & Lash, S.: Reflexive Modernisierung. Frankfurt<br />

(Suhrkamp) 1996.<br />

Benner, D.: Die Struktur der Allgemeinbildung im Kerncurriculum moderner<br />

Bildungssysteme. Ein Vorschlag zur bildungstheoretischen Rahmung von<br />

<strong>PISA</strong>. In: Zeitschrift für Pädagogik 48-2002-1, 68-90.<br />

Benoit, W.L.: Accounts, Excuses and Apologies: A Theory of Image Res<strong>to</strong>ration.<br />

Albany (SUNY) 1995.<br />

Berliner, D. C.: Our impoverished view of educational reform. In: Teachers<br />

College Record 2005. Online: http.//www.tcrecord.org ID no. 12106<br />

(2007/07/07).<br />

Birkeland; N.: Ansvarlig, jeg? Accountability på norsk. In: Langfeldt, G., Elstad,<br />

E. & Hopmann, S.T. (eds.): Ansvarlighet i skolen. Oslo (Cappelen)<br />

2007 (forthcoming).<br />

Bishop, J.H. & Woessmann, L.: Institutional Effects in a Simple Model of<br />

Educational Production. In: Education Economics 12-2004-1, 17-38.<br />

Bloom, B. S. (ed.): Taxonomy of Educational Objectives, the classification of<br />

educational goals <strong>–</strong> Handbook I: Cognitive Domain. New York (McKay)<br />

1956.<br />

Bodin, A.: What does <strong>PISA</strong> really assess? What it doesn’t A French view.<br />

Report prepared for Joint Finnish-French conference “Teaching mathematics:<br />

Beyond the <strong>PISA</strong> survey”, Paris 2005.<br />

Bodin, A.: What does <strong>PISA</strong> really assess? What it doesn’t? A French view. In<br />

this volume.<br />

Bogason, P. (ed.): New Modes of Local Political Organization: Local Government<br />

Fragmentation in Scandinavia. Commack (Nova Sciences) 1996.<br />

Bracey, G.W.: Research: Put Out Over <strong>PISA</strong>. In: Phi Delta Kappan 86-2005-<br />

10, 797.<br />

Braun, H. (2004): Reconsidering the impact of high-stakes testing. Education<br />

Policy Analysis Archives 12-1. Online: http://epaa.asu.edu/epaa/v12n1/<br />

(2006/01/20)<br />

Buschor, E. & Schedler, K. (eds.): Perspecticves on Performance Measurement<br />

and Public Sec<strong>to</strong>r Accounting. Bern (Haupt) 1994.<br />

Cannell, J.J.: Nationally Normed Elementary Achievement Testing in America’s<br />

Public Schools: How All 50 States are Above National Average.<br />

Daniels (Friends of Education) 1987.


EPILOGUE: NO CHILD, NO SCHOOL, NO STATE LEFT BEHIND 403<br />

Caswell, H.L.: City School Surveys: An Interpretation and Analysis. New York<br />

(Teacher College) 1929.<br />

Chubb, J.E. (ed.): Within Our Reach: How America Can Educate Every Child.<br />

Lanham (Rowman & Littlefield) 2005.<br />

Coleman, J. S. et. al.:. Equality of Educational Opportunity. Washing<strong>to</strong>n (U. S.<br />

Department of Health, Education and Welfare) 1966.<br />

Conant, J. B.: The American High School Today; A First Report <strong>to</strong> Interested<br />

Citizens New York (McGraw Hill) 1959.<br />

Cook, T. D.: Lessons Learned in Evaluation Over the Past 25 Years In Chelimsky,<br />

E. & Shadish, W.R. (eds.): Evaluation for the 21st Century. Thousand<br />

Oaks, London, New Delhi (Sage Publications) 1997, 30-52.<br />

Darling-Hammond, L.: Standards and Assessment: Where We Are and What<br />

We Need. Teachers College Record 16-2003-2.<br />

Deretchin, L.F.: Craig, C.J. (eds.): International Research on the Impact of Accountability<br />

Systems (Teacher Education Yearbook XV). Lanham (Rowman<br />

& Littlefield) 2007.<br />

Dewey, J.: Democracy and Education (1916). Online: http://www.ilt.columbia.<br />

edu/publications/dewey.html (2007/01/07)<br />

Dohn, N.B.: Knowledge and Skills for <strong>PISA</strong> <strong>–</strong> Assessing the Assessment. In:<br />

Journal of Philosophy of Education. 41-2007-1, 1-16.<br />

Dolin, J.: <strong>PISA</strong> <strong>–</strong> an Example of the Use and Misuse of Large-scale Comparative<br />

Tests. In this volume.<br />

Dorn, S.: The Political Legacy of School Accountability Systems. In: Education<br />

Policy Analysis Archives 6-1998-1. Online: http://epaa.asu.edu/epaa/<br />

v6n1/ (2007/03/02).<br />

Dubnick, M.J.: Accountability and the Promise of Performance. Paper presented<br />

at the 2003 Annual Metting of the American Poltical Science Association.<br />

Philadelphia<br />

Dubnick, M.J. & Justice, J.B.: Accounting for Accountability. Paper presented<br />

at the Annual Meeting of the American Political Science Association<br />

2004. Online: http://pubpages.unh.edu/dubnick/papers/2004/<br />

dubjusacctg2004.pdf (2007/07/07).<br />

Dubnick, M.J.: Orders of Accountability. Paper presented at the World Ethics<br />

Forum in Oxford 2006. Online: http://pubpages.unh.edu/dubnick/papers/<br />

2006/oxford2006.pdf (2007/07/07).<br />

Eberts, R., Hollenbeck L. & S<strong>to</strong>ne, J.: Teacher Performance Incentives and


404 STEFAN T. HOPMANN<br />

Student Outcomes. The Journal of Human Resources, 37-2002-4, 913-<br />

927.<br />

Education Commission of the States (ECS): No Child Left Behind Issue<br />

Brief: Data-Driven Decisionmaking. 2002. online: http://www.nsba.org/<br />

site/docs/9200/9153.pdf (2007/07/07).<br />

Elmore, R.F.: School Reform From the Inside Out. Cambridge (Harvard University<br />

Press). 2006.<br />

Elstad, E.: Hvordan forholder skoler seg til ansvarliggjøring av skolens bidrag<br />

til elevenes læringsresultater? In: Langfeldt, G., Elstad, E. & Hopmann,<br />

S.T. (eds.): Ansvarlighet i skolen. Oslo (Cappelen) 2007 (forthcoming).<br />

Elstad, E. & Langfeldt, G.: Hvordan forholder skoler seg til målinger av<br />

kvalitetsaspekter ved lærernes undervisning og elevenes læringsprosesser?.<br />

In: Langfeldt, G., Elstad, E. & Hopmann, S.T. (eds.): Ansvarlighet<br />

i skolen. Oslo (Cappelen) 2007 (forthcoming).<br />

Engeland, Ø.: Skolen i kommunalt eie <strong>–</strong> politisk styrt eller profesjonell<br />

ledet skoleutvikling? Avhandling til dr. polit graden. Oslo (Det utdanningsvitenskapelige<br />

fakultet, Universitetet i Oslo) 2000.<br />

Engeland, Ø., Langfeldt, G. & Roald, K.: Kommunalt handlingsrom <strong>–</strong> hvordan<br />

møter norske kommuner ansvarsstyring i skolen?. In: Langfeldt, G., Elstad,<br />

E. & Hopmann, S.T. (eds.): Ansvarlighet i skolen. Oslo (Cappelen)<br />

2007 (forthcoming).<br />

Esping-Andersen, G. Hrsg.: Welfare States in Transition. National Adaptations<br />

<strong>to</strong> Global Economies, London (Sage) 1996<br />

Evers, A., Rauch, U. & Stitz, U. (eds.): Von öffentlichen Einrichtungen zu<br />

sozialen Unternehmen. Hybride Organisationsformen im Bereich sozialer<br />

Dienstleistungen. Berlin (Edition Sigma) 2002.<br />

Fertig, M.: What Can We Learn From International Student Performance Studies?<br />

Some Methodological Remarks. RWI: Discussion Papers No. 23. Essen<br />

(RWI) 2004.<br />

Foucault, M.: Geschichte der Gouvernementalität 1: Sicherheit, Terri<strong>to</strong>rium,<br />

Bevölkerung. Vorlesung am College de France 1977/1978. Frankfurt<br />

(Suhrkamp) 2006.<br />

Fuchs, H.-W.: Auf dem Wege zu einem neune Weltcurriculum? Zum Grundbildungskonzept<br />

von <strong>PISA</strong> und der Aufgabenzuweisung an die Schule.<br />

In: Zeitschrift für Pädagogik 49-2003-2, 161-179.<br />

Fuchs, T. & Woessmann, L.: What Accounts for International Differences in


EPILOGUE: NO CHILD, NO SCHOOL, NO STATE LEFT BEHIND 405<br />

Student Performance? A Re-Examination Using <strong>PISA</strong> Data. Bonn (IZA)<br />

2004.<br />

Fuhrmann, S. (ed.): From the Capi<strong>to</strong>l <strong>to</strong> the Classroom: Standards-Based Reform<br />

in the States. Vol. I & II. Chicago (National Society for the Study of<br />

Education Yearbooks) 2001.<br />

Fukuyama, F.: The End of His<strong>to</strong>ry and the Last Man. New York (Free Press)<br />

1992.<br />

Goodlad, J.I.: A Place Called School. New York (McGraw-Hill) 1983.<br />

Gorard, S.: Value-Addedis of Little Value. In: Journal of Education Policy 21-<br />

2006-2, 235-243.<br />

Gottweis, H. et al.: Verwaltete Körper. Strategien der Gesundheitspolitik im<br />

internationalen Vergleich. Wien (Böhlau) 2005.<br />

Granheim, M.; Kogan, M. & Lundgren, U.P. (eds.): Evaluation as Policymaking:<br />

Introducing Evaluation In<strong>to</strong> a National Decentralized Educational<br />

System. London (Jessica Kingsley Publishers) 1990.<br />

Grisay, A. & Monseur, C.: Measuring the Equivalence of Item Difficulty in<br />

the Various Versions of an International Test. In: Studies in Educational<br />

Evaluation 33-2007-1, 69-86.<br />

Gundem, B. B. & Hopmann, S. (eds.): Didaktik and/or Curriculum: An International<br />

Dialogue. New York, Bern etc. (Lang) 2002 2 .<br />

Gundem, B.B.: Læreplanadministrering: fremvækst og utvikling i et<br />

sentraliserings-desentraliseringsperspektiv. Oslo (UiO/PFI) 1992.<br />

Gundem, B.B.: Mot en ny skolevirkelighet? Læreplanen i et sentraliseringsdesentraliseringsperspektiv.<br />

Oslo (Ad Notam) 1993.<br />

Gundem, B.B.: Læreplanhis<strong>to</strong>rie <strong>–</strong> his<strong>to</strong>rien om skolens innhold <strong>–</strong> som forskningsfelt:<br />

en innføring og noen eksempler. Oslo (UiO/PFI) 1997.<br />

Haft, H./Hopmann, S.T. (eds.): Case Studies in Curriculum Administration<br />

His<strong>to</strong>ry. London/New York 1990.<br />

Haney, W. (2000): The Myth of the Texas Miracle in Education. Eductional<br />

Policy Analysis Archives 8-2000-41. Online: http://epaa.asu.edu/epaa/<br />

v8n41/ (2007/07/07).<br />

Haney, W.: Lake Wobegon Guaranteed. Educational Policy Analysis Archives<br />

10-2002-24. Online: http://epaa.asu.edu/epaa/v10n24/ (2007/07/07)<br />

Hanushek, E.A. & Raymond, M.E. (2003): Lessons about the Design of State<br />

Accountability Systems. In: No Child Left Behind. Petersen, P. & West,<br />

M.R. Hrsg. (Brookings) Washing<strong>to</strong>n, 127-151<br />

Hanushek, E.A. & Raymond, M.E.: Does School Accountability Lead <strong>to</strong> Im-


406 STEFAN T. HOPMANN<br />

proved Student Performance? In: Journal of Policy Analysis and Management<br />

24-2005-2, 297-327.<br />

Hanushek, E.A.: The Failure of Input-based Schooling Policies. Working Paper<br />

9040. Cambridge, MA (National Bureau of Economic Research).<br />

2002.<br />

Hanushek, E.A.: Alternative School Policies ad the Benefits of General Cognitive<br />

Skills. In: Economics of Education Review 25-2006-4, 447-462.<br />

Hanushek, E.A.: The Long Run Importance of School Quality. NBER Working<br />

Paper No. 9071. Cambridge, MA (NBER) 2002.<br />

Hargreaves, A.: Teaching in the Knowledge Society: Education in the Age of<br />

Insecurity. New York, NY (Teachers College Press) 2003.<br />

Haug, P. & Monsen, L. (eds.): Skolebasert vurdering: erfaringer og utfordringer.<br />

Oslo (abstract) 2002.<br />

Herman, J.L. & Haertel, E.H. Hrsg.: Uses and Misuses of Data for Educational<br />

Accountability and Improvement (The 104 th Yearbook of NSSE Part 2).<br />

Malden (Blackwell) 2005.<br />

Hood, C.: A Public Management for all Seasons. Public Administration 69-<br />

1991-1, 3-20.<br />

Hood, C.: Contemporary Public Management: A New Global Paradigm? Public<br />

Policy and Administration. 10-1995-2, 104-117.<br />

Hood, C.: Institutions, Blame Avoidance and Negativity Bias: Where Public<br />

Management Reform Meets the Blame Cuulture. Paper presented at the<br />

CMPO Conference on Public Organisation and the New Public Management.<br />

Bris<strong>to</strong>l 2004.<br />

Hood, C., Rothstein, H. & Baldwin, R.: The Government of Risk: Understanding<br />

Risk Regulation Regimes. Oxford (University Press) 2004.<br />

Hopmann, S.T.: Lehrplanarbeit als Verwaltungshandeln (Curriculum making<br />

as administration). Kiel (IPN) 1988.<br />

Hopmann, S.T.: Lehrplan des Abendlandes <strong>–</strong> am Ende seiner Geschichte?<br />

Geschichte der Lehrplanarbeit und des Lehrplans seit 1900. (The curriculum<br />

of the occident <strong>–</strong> at the end of its his<strong>to</strong>ry? Curriculum development<br />

and the curriculum since 1800). In: Keck, Rudolf et al. (eds.): Lehrplan<br />

des Abendlandes <strong>–</strong> revisited. (Hohengrefe) Braunschweig 2000a.<br />

Hopmann, S.T.: Die Schule von morgen <strong>–</strong> Entwicklungsperspektiven für<br />

einen nachhaltigen Unterricht. In: Die Schweizer Schule 2000-3, 13-19.<br />

(2000b)<br />

Hopmann, S.T.: Von der gutbürgerlichen Küche zu McDonald’s: Beabsichtigte


EPILOGUE: NO CHILD, NO SCHOOL, NO STATE LEFT BEHIND 407<br />

und unbeabsichtigte Folgen der Internationalisierung der Erwartungen<br />

an Schule und Unterricht. In: Keiner, E. (Hrsg.): Evaluation in den<br />

Erziehungswissenschaften. Weinheim (Beltz) 2001, 207-224.<br />

Hopmann, S.T.: On the Evaluation of Curriculum Reforms. In: Journal of Curriculum<br />

Studies 2003-4, 459-478.<br />

Hopmann S.T.: Im Durchschnitt Pisa oder: Alles bleibt schlechter. In Criblez,<br />

L. et al. (eds) Lehrpläne und Bildungsstandards. Bern (hep) 2006, 149-<br />

172<br />

Hopmann, S.T.: Keine Ausnahme für Hotten<strong>to</strong>tten. Methoden der vergleichenden<br />

Bildungswissenschaften für die heilpädagogische Forschung.<br />

In: Biewer, G. & Schwinge, M. (Hrsg.): Internationale Sonderpädagogik.<br />

Bad Heilbrunn: Klinkhardt (in print).<br />

Hörmann, B. (2007): Die Unsichtbaren in <strong>PISA</strong>, TIMSS & Co. Diplomarbeit.<br />

Wien: Institut für Bildungswissenschaft der Universität Wien<br />

Hörmann, B.: Disappearing Students. <strong>PISA</strong> and students with disabilities. In<br />

this volume.<br />

Hovdenak, S.S.: 90-tallsreformene <strong>–</strong> et instrumentalistisk mistak? Oslo<br />

(Gyldendal Akademisk) 2000.<br />

Huisken, F.: Der “<strong>PISA</strong>-Schock” und seine Bewältigung <strong>–</strong> Wieviel Dummheit<br />

braucht/verträgt die Republik? (VSA-Verlag) Hamburg 2005<br />

Ingersoll, R.: Who Controls Teachers’ Work? Power and Accountability in<br />

America’s Schools. Cambridge (Harvard University Press) 2006.<br />

Irjala, A & Eikås, M.: State Culture and Decentralization: a Comparative<br />

Study of Decentralization Processes in Nordic Cultural Politics. Sogndal<br />

& Helsinki (Western Norway Research Institute/Arts Council of Finland)<br />

1996.<br />

Irons, J.E. & Harris, S.: The Challenges of No Child Left Behind. Blue Ridge<br />

Summit (Rowman & Littlefield) 2006.<br />

Isaksen, L.: Skoler i gapes<strong>to</strong>kken. In: Langfeldt, G., Elstad, E. & Hopmann,<br />

S.T. (eds.): Ansvarlighet i skolen. Oslo (Cappelen) 2007 (forthcoming).<br />

Jahnke, T.: Deutsche Pisa-Folgen. In this volume.<br />

Jahnke, T. & Meyerhöfer, W. (eds.): <strong>PISA</strong> & Co <strong>–</strong> Kritik eines Programms.<br />

(Franzbecker) Hildesheim 2006.<br />

Karlsen, G.E.: Desentralisering <strong>–</strong> løsning eller oppløsning: søkelys på norsk<br />

skoleutvikling og utdanningspolitik. Oslo (Ad notam) 1993.<br />

Karlsen, G.E.: EU, EØS og utdanning. Oslo (Tano) 1994.<br />

Kirke-, utdannings- og forskningsdepartementet (KUF): Underveis: Håndbok


408 STEFAN T. HOPMANN<br />

i skolebasert vurdering: grunnskole og videregående skole. Oslo (KUF)<br />

1994.<br />

Kivirauma, J., Klemala, K. & Rinne, R.: Segregation, Integration, Inclusion <strong>–</strong><br />

The Ideology and Reality in Finland. In: European Journal of Special<br />

Needs Education 21-2006-2, 117-133<br />

Klatt, B., Murphy, S. & Irvine, D: Accountability: Getting a Grip on Results.<br />

Calgary (Bow River) 2003 2 .<br />

Klieme, E. et al.: Expertise zur Entwicklung nationaler Bildungsstandards.<br />

Berlin (BMBF) 2003. Online: http://www.bmbf.de/pub/zur_<br />

entwicklung_nationaler_bildungsstandards.pdf (07/07/2007).<br />

Koretz, D.: Limitations in the Use of Achievement Tests as Measures of Educa<strong>to</strong>rs’<br />

Productivity. The Journal of Human Resources, 37-2002-4, 752 <strong>–</strong><br />

777.<br />

Koritzinsky, T.: Pedagogikk og politikk i L 97: Læreplanens innhold og beslutningsprosessene.<br />

Oslo (Universitetsforlaget) 2000.<br />

Korsgaard, O.: Kampen om folket: et dannelsesperspektiv på dansk his<strong>to</strong>rie<br />

gennem 500 år. Copenhagen (Gyldendal) 2004.<br />

KUF: Rapport om nasjonalt vurderingssystem, (Moe utvalget). Forslag fra utvalg<br />

oppnevnt av Kirke-, Utdannings- og Forskningsdepartementet. Oslo<br />

(KUF) 1997.<br />

Künzli, R. & Hopmann, S.T. (eds.): Lehrpläne: Wie sie entwickelt werden und<br />

was von ihnen erwartet wird. Forschungsstand, Zugänge und Ergebnisse<br />

aus der Schweiz und der Bundesrepublik Deutschland. Zürich (Ruegger)<br />

1998.<br />

Kvale, G.: “Det er ditt val!” <strong>–</strong> om fritt skuleval i <strong>to</strong> norske kommunar. In:<br />

Langfeldt, G., Elstad, E. & Hopmann, S.T. (eds.): Ansvarlighet i skolen.<br />

Oslo (Cappelen) 2007 (forthcoming).<br />

Ladd, H.E. & Walsh, R.P.: Implementing Value-added Measures of School<br />

Effectiveness. Economics of Education Review. 21-2002-1, 1-17.<br />

Ladenthin, V.: Bildung als Aufgabe der Gesellschaft. In: Studia Comenia Et<br />

His<strong>to</strong>rica 34-2004-71/72, 305-319.<br />

Laffont, J.-J. (ed.): The Principal Agent Model: The Economic Theory of Incentives.<br />

Cheltenham (Edward Elgar Publishing) 2003.<br />

Lamar, A. & Thomas, J.A.: The Nation’s Report Card: Improving the Assessment<br />

of Student Achievement. Stanford, CA (National Academy of Education)<br />

1987.


EPILOGUE: NO CHILD, NO SCHOOL, NO STATE LEFT BEHIND 409<br />

Lange, S. & Schimank, U. (eds.): Governance und gesellschaftliche Integration.<br />

Opladen (VS) 2004.<br />

Langfeldt, G.: Resultatstyring som verktøy og ideologi. Statlige styringsstrategier<br />

i utdanningssek<strong>to</strong>ren In: Langfeldt, G., Elstad, E. & Hopmann, S.T.<br />

(eds.): Ansvarlighet i skolen. Oslo (Cappelen) 2007 (forthcoming).<br />

Langfeldt, G.: <strong>PISA</strong> <strong>–</strong> Undressing the Truth or Dressing Up a Will <strong>to</strong> Govern.<br />

In this volume.<br />

Langfeldt, G., Elstad, E. & Hopmann, S.T. (eds.): Ansvarlighet i skolen. Oslo<br />

(Cappelen) 2007 (forthcoming).<br />

Leibfried, S. & Zürn, M.: Transformation des Staates. Frankfurt (Suhrkamp)<br />

2006.<br />

Lie S. et al.: Nasjonale prøver på ny prøve Rapport fra en utvalgsundersøkelse<br />

for å analysere og vurdere kvaliteten på oppgaver og resultater<br />

til nasjonale prøver våren 2005. Oslo (UiO, ILS) 2005.<br />

Lindblad, S.; Johannesson, I. & Simola, R.: Education governance in transition.<br />

Scandinavian Journal of Educational Research. 2003-2.<br />

Linn R.L. & Haug, C.: Stability of school building Accountability Scores and<br />

Gains in Educational Evaluation and Policy Analysis 24-2002-1, 29-36.<br />

Linn, R. L.: Assessments and Accountability. In: Educational Researcher 29-<br />

2000-2, 4-16.<br />

Linn, R. L., Baker, E. L., & Betebenner, D. W.: Accountability Systems: Implications<br />

of Requirements of the No Child Left Behind Act of 2001. In:<br />

Educational Reseacher 31-2002-6, 3-16.<br />

Linn, R.L. (2005): Issues in the Design of Accountability Systems. In: Herman,<br />

J.L. & Haertel, E.H. Hrsg.: Uses and Misuses of Data for Educational<br />

Accountability and Improvement (The 104 th Yearbook of the NSSE Part<br />

2). Malden (Blackwell) 2005, 78-98.<br />

Lohmann, I.: After Neoliberalism. Können nationalstaatliche Bildungssysteme<br />

den ‚freien Markt‘ überleben? 2001 online: http://www.erzwiss.<br />

uni-hamburg.de/Personal/Lohmann/AfterNeo.htm (2007/07/07)<br />

Lohmann, I.: Was bedeutet eigentlich “Humankapital”? GEW Bezirksverband<br />

Lüneburg und Universität Lüneburg: Der brauchbare Mensch.<br />

Bildung statt Nützlichkeitswahn. Bildungstage 2007, online: http://www.<br />

erzwiss.uni-hamburg.de/Personal/Lohmann/Publik/Humankapital.pdf<br />

(2007/07/07)<br />

Loveless, T.: The Peculiar Politics of No Child Left Behind. Washing<strong>to</strong>n<br />

(Brookings) 2006.


410 STEFAN T. HOPMANN<br />

Luhmann, N.: Die Gesellschaft der Gesellschaft. Frankfurt (Suhrkamp) 1998.<br />

Marsh, J.; Pane, J. & Hamil<strong>to</strong>n, L.: Making Sense of Data-Driven Decision<br />

Making in Education: Evidence from Recent RAND Research. Santa<br />

Monica (RAND) 2006.<br />

Martineau, J. A.: Dis<strong>to</strong>rting Value Added: The Use of Longitudinal, Vertically<br />

Scaled Student Achievement Data for Growth-Based, Value-Added Accountability.<br />

Journal of Educational and Behavioral Statistics 31-2006-1,<br />

35-62.<br />

McNeil, L.: Contradictions of School Reform: Educational Costs of Standardized<br />

Testing. New York (Routledge) 2000.<br />

Mediås, O.A. & Telhaug, A.O.: Fra sentral til desentalisert styring: statlig og<br />

regional styring af utdanningen i Skandinavia fram mot år 2000. Steinkjer<br />

(Projekt: Utdanning som nasjonsbygging) 2000.<br />

Mehrens, W.A.: Consequences of Assessmen: What is the Evidence?. In: Educational<br />

Policy Analysis Archives, 13-1998-6. Online: http://epaa.asu.<br />

edu/epaa/v6n13.html (2007/07/07).<br />

Mejding, J. & Roe, A. (eds.): Northern Lights on <strong>PISA</strong> 2003 <strong>–</strong> A Reflection<br />

Form the Nordic Countries. Copenhagen (Nordic Council) 2006.<br />

Mel<strong>to</strong>n-Horn, J.: Absolutism and the Eighteenth-Century Origins of Compulsory<br />

Schooling in Prussia and Austria. Cambridge (University Press)<br />

1988.<br />

Meyer, J.W.: Weltkultur. Wie die westlichen Prinzipien die Welt durchdringen.<br />

Frankfurt (Suhrkamp) 2005.<br />

Meyerhöfer, W.: Tests im Test. Das Beispiel <strong>PISA</strong>. Opladen (Budrich) 2005<br />

Meyerhöfer, W.: Testfähigkeit <strong>–</strong> Was ist das?. In this volume.<br />

Micklewright, J. & Schnepf, S.S.: Educational Achievement in the English<br />

Speaking Countries: Do Different Surveys Tell the Same S<strong>to</strong>ry?. Bonn<br />

(IZA) 2004.<br />

Micklewright, J. & Schnepf, S.S.: Inequality of Learning in Industrialised<br />

Countries. Bonn (IZA) 2006.<br />

Mintrop, H.: The Limit of Sanctions in Low-Performing Schools: A Study<br />

of Maryland and Kentucky Schools on Probation. In: Education Policy<br />

Analysis Archives, Vol. 11-2003-3. Online: http://epaa.asu.edu/epaa/<br />

v11n3.html (2007/07/07).<br />

MMI (Markeds- og meidainstituttet AS): Evaluering av gjennomføring av de<br />

nasjonale prøvene i 2005. Online: http://www.utdanningsdirek<strong>to</strong>ratet.no/<br />

eway/library/forms/showmessage.aspx?oid=338 (retr. 07/07/2007).


EPILOGUE: NO CHILD, NO SCHOOL, NO STATE LEFT BEHIND 411<br />

Møller, J.: Coping with Accountability <strong>–</strong> A Tension between Reason and Emotion.<br />

Passionate Principalship: Learning from Life His<strong>to</strong>ries of School<br />

Leaders. London (Falmer) 2003.<br />

Muir Grey, J.A.: Evidence Based Health Care. Oxford (Elsevier) 2001 2<br />

National Commission on Excellence in Education: A Nation at Risk: The<br />

Imperative for Educational Reform. Washing<strong>to</strong>n, DC (U.S. Government<br />

Printing Office) 1983.<br />

Nesje, K. & Hopmann, S.T. (eds.): En lærende skole: L97 i Skolepraksis. Oslo<br />

(Cappelen) 2002.<br />

Neuwirth, E., Ponocny, I. & Grossmann, W. (eds.): <strong>PISA</strong> 2000 und <strong>PISA</strong> 2003.<br />

Graz (Leykam) 2006.<br />

Neuwirth, E.: <strong>PISA</strong> 2000. Sample Weight Problems in Austria. OECD Education<br />

Working Papers No. 5. Paris (OECD) 2006.<br />

NOU 1988:22: Med viten og vilje. White paper commissioned by the Norwegian<br />

Government. Oslo 1988<br />

NOU 2002:10. Førsteklasses fra første klasse. White paper commissioned by<br />

the Norwegian Government. Oslo 2002<br />

NOU 2003:16. I første rekke. White paper commissioned by the Norwegian<br />

Government. Oslo 2003<br />

OECD: Reviews of National Policies for Education: Norway. Paris (OECD)<br />

1987<br />

OECD: Public Management Developments. Paris (OECD) 1995.<br />

OECD: Education at a Glance. Paris: OECD.Online: http://www.oecd.org/<br />

document/34/0,2340,en_2649_34515_35289570_1_1_1_1,00.html.<br />

Paris 2005 (2007/07/07).<br />

OECD: Knowledge and Skills for Life. First Results From Pisa 2000. Paris<br />

(OECD) 2001.<br />

OECD: The <strong>PISA</strong> 2003 Assessment Framework., Paris (OECD) 2003.<br />

Olsen, R.V.: Achievement Tests From an Item Perspective. An Exploration of<br />

Single Item Data form the <strong>PISA</strong> and TIMSS studies. Thesis (University<br />

of Oslo) Oslo 2005. Online at: http://www.duo.uio.no/publ/realfag/2005/<br />

35342/Rolf_Olsen.pdf (retr. 2007/07/07)<br />

Olsen, R.: Large-scale international comparative achievement studies in education:<br />

Their primary purposes and beyond. In this volume.<br />

Peterson, P.E. & West, M.R. (eds.): No Child Left Behind. The Politics and<br />

Practice of School Accountability. Washing<strong>to</strong>n (Brookings) 2003.<br />

Picht, G.: Die Deutsche Bildungskatastrophe. Olten (Walter Verlag) 1964.


412 STEFAN T. HOPMANN<br />

<strong>PISA</strong> 2006: <strong>PISA</strong> <strong>–</strong> THE OECD PROGRAMME FOR INTERNA-<br />

TIONAL STUDENT ASSESSMENT. Leaflet produced by the OECD<br />

in 2006. Online: http://www.pisa.oecd.org/dataoecd/51/27/37474503.pdf<br />

(07/07/2007).<br />

Pollitt, C. & Bouckaert, G.: Public Mangement Reform. A Comparative Analysis.<br />

Oxford (University Press) 2004 2 .<br />

Power M.: The Audit Society; Rituals of Verification. Oxford (University<br />

Press) 1997.<br />

Prahl, A. & Olsen, C.B.: Lokalsamfundet som samarbejdspartner: sammenhænge<br />

mellem decentralisering og lokalsamfundsudvikling i de nordiske<br />

lande. Copenhagen (Nordisk Minsterråd). 1997.<br />

Prais, S. J.: Cautions on OECD’s recent educational survey(<strong>PISA</strong>): Rejoinder<br />

<strong>to</strong> OECD’s response. In: Oxford Review of Education 30-2004-4. 377-<br />

389.<br />

Prais, S.J.: Cautions on OECD’s Recent Educational Survey (<strong>PISA</strong>) In: Oxford<br />

Review of Education 29-2003-2, 139-163.<br />

Prais, S.J.: England: Poor Survey Response and no Sampling. In this volume.<br />

Rasmussen, J.: Undervisning i det refleksivt moderne. Copenhagen (Reitzel)<br />

2006.<br />

Rauin, U.: Die Pädagogik im Bann empirischer Mythen <strong>–</strong> Wie aus empirischen<br />

Vermutungen scheinbare pädagogische Gewissheit wird. In: Pädagogische<br />

Korrespondenz. 2004-32, 39-49.<br />

Rickover, H. G.: American Education <strong>–</strong> a National Failure: The Problem of<br />

Our Schools and What We Can Learn from England. New York (E. P.<br />

Dut<strong>to</strong>n) 1963.<br />

Riksrevisjonen: Riksrevisjonens undersøkelse av opplæringen i Grunnskolen<br />

Dokument nr 3: 10 (2005-2006).Oslo (Riksrevisjonen) 2006.<br />

Rindermann, H.: Was messen internationale Schulleistungsstudien? Schulleistungen,<br />

Schülerfähigkeiten, kognitive Fähigkeiten, Wissen oder allgemeine<br />

Intelligenz? Psychologische Rundschau 57-2006-1, 69-86.<br />

Robin, S.R. & Sprietsma, M.: Characteristics of Teaching Institutions and Student’s<br />

Performance: New Empirical Evidence from OECD Data. Lille<br />

(CRESGE) 2003.<br />

Saunders, L.: Brief His<strong>to</strong>ry of Educational ‘Value-Added‘”. How Did We Get<br />

<strong>to</strong> Where We Are?. In: School Effectivenes and School Improvement 10-<br />

1999-2, 233-256.


EPILOGUE: NO CHILD, NO SCHOOL, NO STATE LEFT BEHIND 413<br />

Scharpf, F.W. & Schmidt, V. Hrsg.: Welfare and Work in the Open Economy.<br />

v1 & v2. Oxford (University Press) 2000.<br />

Schedler, K. & Proeller, I.: New Public Management. Bern: Haupt (UTB)<br />

2006 3 .<br />

Sedikides, C. et al.: Accountability as a Deterrent <strong>to</strong> Self-Enhancement: The<br />

Search for Mechanisms. In: Journal of Personality and Social Psychology<br />

83-2002-3, 592-605.<br />

Shavit, Y. & Blossfeld, H.-P. (eds.): Persistent inequality. Boulder (Westview<br />

Press) 1993.<br />

Simola, H.: The Finnish miracle of <strong>PISA</strong>: his<strong>to</strong>rical and sociological remarks<br />

on teaching and teacher education.Quelle: In: Comparative Education 45-<br />

2005-4, 455-470.<br />

Sivesind, K.: Reformulating Reforms. Oslo (UiO, ILS). Forthcoming.<br />

Sivesind, K.: Task and Themes in the Communication about the Curriculum.<br />

The Norwegian Compulsory School Reform in Perspective. In: Rosenmund,<br />

M. et al.: Comparing Curriculum Making Processes. Bern (Lang)<br />

2002.<br />

Sivesind, K.; Bachmann, K. & Afzar, A.: Nordiske læreplaner. Oslo<br />

(Læringssenteret) 2003.<br />

Sivesind, K.; Langfeldt, G. & Skedsmo, G.: Utdanningsledelse. Oslo (Cappelen<br />

akademisk) 2006.<br />

Slagstad, R.: De nasjonale strateger. Oslo (Pax forlag) 1996.<br />

Slavin, R.E.: Educational Research in an Age of Accountability. Bos<strong>to</strong>n (Pearson)<br />

2006.<br />

Stack, M.: Testing, Testing, Read All About It: Canadian Press Coverage of<br />

the <strong>PISA</strong> Results. In: Canadian Journal of Education, 29-2006-1, 49-69.<br />

STM S<strong>to</strong>rtingsmelding 37 (1990-1991): Om organisering og styring i utdanningssek<strong>to</strong>ren.<br />

Report <strong>to</strong> the Norwegian Parliament.<br />

STM S<strong>to</strong>rtingsmelding 29 (1994-1995): Om prinsipper og retningslinjer for<br />

tiårig grunnskole- ny læreplan.<br />

STM S<strong>to</strong>rtingsmelding 47 (1995-1996) Om elevvurdering, skolebasert vurdering<br />

og nasjonalt vurderingssystem<br />

STM S<strong>to</strong>rtingsmelding 28 (1998-99): Mot rikare mål. Nasjonalt vurderingssystem<br />

for grunnskolen. Report <strong>to</strong> the Norwegian Parliament.<br />

STM S<strong>to</strong>rtingsmelding 17 (2002-2003): Om statlige tilsyn.<br />

STM S<strong>to</strong>rtingsmelding 30 (2003-2004): Kultur for læring. (a shortened<br />

version on English at: http://www.regjeringen.no/en/dep/kd/Documents/


414 STEFAN T. HOPMANN<br />

Brochures-and-handbooks/2004/Report-no-30-<strong>to</strong>-the-S<strong>to</strong>rting-2003-<br />

2004.html?id=419442).<br />

Sutherland, D. & Price, R.: Linkages Between Performance and Institutions in<br />

the Primary and Secondary Education Sec<strong>to</strong>r. OECD Economics Department<br />

Working Papers No. 558. Paris (OECD) 2007.<br />

Swanson, C. B. & Stevenson, D. L.: Standards-Based Reform in Practice: Evidence<br />

on State Policy and Classroom Instruction from the NAEP State<br />

Assessments. In: Educational Evaluation and Policy Analysis, 24-2002-1,<br />

1 <strong>–</strong> 27.<br />

Swiss Federal Statistical Office (BFS): <strong>PISA</strong> 2003 <strong>–</strong> Einflussfak<strong>to</strong>ren auf die<br />

kan<strong>to</strong>nalen Ergebnisse. Neuchatel (BFS) 2005.<br />

Telhaug, A.O. & Mediås, O.A.: Grunnskolen som nasjonsbygger: fra statspietisme<br />

til nyliberalisme. Oslo (Abstrakt) 2003.<br />

Telhaug, A.O.: Kunnskapsløftet <strong>–</strong> Ny eller Gammel skole? Oslo (Cappelen<br />

Akademisk) 2005.<br />

Telhaug, A.O.: Skolen mellom stat og marked: norsk skoletenkning fra år til år<br />

1990-2005. Oslo (Didakta) 2005.<br />

TNS Gallup: Undersøkelse blant rek<strong>to</strong>rer og lærere om gjennomføring av de<br />

nasjonale prøvene våren 2005. Rapport. Oslo 2005.<br />

Turmo, A. & Lie, S.: Hva kjennetegner norske skoler som skårer høyt i <strong>PISA</strong><br />

2000? Oslo (UiO/ILS) 2004.<br />

Tyler, R.: Basic Principles of Curriculum and Instruction. Chicago (University<br />

Press) 1949.<br />

Uljens, M.: The Hidden Curriculum of <strong>PISA</strong>. <strong>–</strong> The Promotion of Neo-liberal<br />

Policy by Educational Assessment. In this volume.<br />

U.S. Department of Education, Building on Results: A Blueprint for Strengthening<br />

the No Child Left Behind Act, Washing<strong>to</strong>n, D.C., 2007<br />

Wallerstein, I.: World-Systems Analysis: An Introduction. Durham, North Carolina<br />

(Duke University Press) 2004<br />

Watermann, R. et al. (2003): Schulrückmeldungen im Rahmen von Schulleistungsuntersuchungen:<br />

Das Disseminationskonzept von <strong>PISA</strong>-2000. In:<br />

Zeitschrift Für Pädagogik 49-2003-1, 92-111.<br />

Watson, S., & Supovitz, J.: Au<strong>to</strong>nomy and Accountability in the Context<br />

of Standards-based Reform. In: Education Policy Analysis Archives, 9-<br />

2001-32. Online: http://epaa.asu.edu/epaa/v9n32.html (2007/07/07).<br />

Weber, M. (1923): Wirtschaft und Gesellschaft. Online: http://www.textlog.de/<br />

weber_wirtschaft.html (2007/03/19)


EPILOGUE: NO CHILD, NO SCHOOL, NO STATE LEFT BEHIND 415<br />

Weigel, T.M.: Die <strong>PISA</strong>-Studie im bildungspolitischen Diskurs. Eine Untersuchung<br />

der Reaktionen auf <strong>PISA</strong> in Deutschland und im Vereinigten<br />

Königreich. Trier (Universität) 2004. Online: http://www.oecd.<br />

org/dataoecd/46/23/34805090.pdf (2007/07/07).<br />

Werler, T.: Nation, Gemeinschaft, Bildung: die Evolution des modernen skandinavischen<br />

Wohlfahrtsstaates und das Schulsystem. Baltmannsweiler<br />

(Schneider Verlag) 2004.<br />

Westbury, I.: Didaktik and Curriculum Studies. In: Gundem, B B. & Hopmann,<br />

S.T. (eds) Didaktik and/or Curriculum. New York (Lang) 2002 2 , 47-78.<br />

Westbury, I., Hopmann, S. & Riquarts, K. (eds.): Teaching as Reflective Practice:<br />

The German Didaktik Tradition. Mahwah, NJ (Lawrence Erlbaum<br />

Associates) 2000.<br />

Withford, B. L. & Jones, K.: Accountability, Assessment, and Teacher Commitment:<br />

Lessons from Kentucky’s Reform Efforts. Albany (SUNY)<br />

2000.<br />

Wuttke, J.: Uncertainties and Bias in <strong>PISA</strong>. In this volume.<br />

Zimmer, R. et al.: State and Local Implementation of the “No Child Left Behind<br />

Act”. Washing<strong>to</strong>n (Department of Education) 2007.


416 STEFAN T. HOPMANN<br />

Über die Au<strong>to</strong>ren/About the Authors<br />

Allerup, Peter Nimmo:<br />

Peter Allerup graduated in Mathematical Statistics from University of Copenhagen in<br />

1970. Today his preferred fields of interests are mathematical statistics, psychometrics<br />

and quantitative research methods in general. From 1994 <strong>–</strong> 2002 he was Senior Research<br />

Scientist at The Royal Danish Institute for Educational Research, from 2002 he<br />

holds a professors chair at Danish University of Education, later Aarhus University,<br />

School of Education. He has been involved in<strong>to</strong> the majority of empirical international<br />

studies conducted by the university, OECD’s <strong>PISA</strong> and IEA’s comparative investigations<br />

in mathematics and science and in civic education, TIMSS and CIVIC. He has<br />

specialized experience in the field of IRT models (Item Response Theory), viz. the<br />

Rasch Models in particular, with emphasis on applications, where psychometric scaling<br />

properties are essential. He has long time experiences with data from multilevel<br />

specifications of the research frame work.<br />

Kontakt: nimmo@dpu.dk<br />

Bodin, An<strong>to</strong>ine:<br />

Graduated in pure mathematics and in mathematics education (Didactics of Mathematics).<br />

Successively, or at the same time, secondary math teacher, teacher trainer, researcher<br />

in mathematics education, evaluation specialist, mathematics text book author,<br />

international consultant (World Bank and other national and international agencies).<br />

An<strong>to</strong>ine Bodin was much involved in the IREM network (French Institute of Research<br />

in Mathematics Education) and in the APMEP (French Mathematics Teacher<br />

Association) were he created the EVAPM observa<strong>to</strong>ry and leaded it for 20 years.<br />

He was a member of the TIMSS Subject Matter Advisory Committee and for<br />

a few months of the <strong>PISA</strong> 2003 Math Expert Group. He was also a member of the<br />

mathematics curriculum expert group and of the test development unit in the French<br />

Ministry of Education.<br />

He has published numerous papers in math education: see his website: http://web.<br />

mac.com/an<strong>to</strong>inebodin/iWeb/Site_An<strong>to</strong>ine_Bodin /<br />

Contact: an<strong>to</strong>inebodin@mac.com<br />

Bozkurt, Dominik:<br />

Dominik Bozkurt wurde am 10. November 1975 in Wels geboren. Er absolvierte die<br />

Matura am Realgymnasium am Henriettenplatz in 1150 Wien. Nach der Matura begann<br />

er 2001 das Studium der Pädagogik (Schul- und Sozialpädagogik) und der Romanistik<br />

(Französisch). Im Jänner 2007 schloß er sein Studium mit einer Diplomarbeit<br />

über Schulqualität und Kooperativer Mittelschule und einer kommissionellen Prüfung<br />

ab. Von 2005 bis 2007 war er als Studienassistent in der Forschungseinheit für Schulund<br />

Bildungsforschung tätig. Seit 2007 arbeitet er als Sozialpädagoge in der Präventionsabteilung<br />

der Aids Hilfe Wien.<br />

Kontakt: bozkurt@aids.at


EPILOGUE: NO CHILD, NO SCHOOL, NO STATE LEFT BEHIND 417<br />

Brinek, Gertrude:<br />

Lehrtätigkeit an Wiener Volks- und Hauptschulen (1973 bis 1983); Studium der<br />

Kunstgeschichte, Pädagogik und Psychologie an der Universität Wien, (Studienabschluss<br />

Dr. phil. in Pädagogik); Studien-Assistentin bzw. Universitäts-Assistentin am<br />

Institut für Erziehungswissenschaften der Universität Wien.<br />

Forschungsarbeiten/Lehrtätigkeit und Publikationen in den Bereichen Schulklima/Schulangst,<br />

Museumspädagogik, Bildungs-/Schultheorie und -forschung u. a.<br />

Seit 2003 Ass. Prof. am Institut für Bildungswissenschaft/Fakultät für Philosophie<br />

und Bildungswissenschaft der Uni Wien (teilkarenziert zur Ausübung eines Mandats<br />

in der Bundesgesetzgebung).<br />

Kontakt: gertrude.brinek@univie.ac.at<br />

Dolin, Jens:<br />

Head of the Department of Science Education at the University of Copenhagen. He<br />

has done research in teaching and learning science (with focus on dialogical processes,<br />

forms of representation and the development of competencies), general pedagogical<br />

issues (bildung, competencies, assessment and evaluation) and organizational change<br />

(reform processes, curriculum development, teacher conceptions). He has been engaged<br />

in the development and implementation of the new science curriculum 2005 for<br />

the Danish Upper Secondary School.<br />

He has been member of the <strong>PISA</strong> Science Forum 2006 which formulated the<br />

Science Literacy Framework for the <strong>PISA</strong> 2006 science test, and is currently leader of<br />

a research project on the validation of <strong>PISA</strong> science in a Danish context.<br />

Contact: dolin@ind.ku.dk<br />

Hopmann, Stefan Thomas:<br />

Universität Wien. Zuvor unter anderem in Kiel, Potsdam, Oslo, Trondheim und Kristiansand<br />

tätig. Forschungsgebiete:<br />

His<strong>to</strong>rische und vergleichende Schul- und Bildungsforschung insbesondere mit<br />

Blick auf Didaktik, Schulentwicklung, Schulverwaltung und Lehrerbildung.<br />

Kontakt: stefan.hopmann@univie.ac.at<br />

Hörmann, Bernadette:<br />

Born in 1983, studied educational science at the University of Vienna. She currently<br />

works as an assistant at the department for educational science at the University of<br />

Vienna where she is writing her dissertation. Her core themes are school accountability<br />

and school structures.<br />

Contact: bernadette.hoermann@univie.ac.at<br />

Jahnke, Thomas:<br />

(Jg. 1949), Diplom in Mathematik 1974 (Universität Marburg), Promotion in Mathematik<br />

1979 (Universität Freiburg), Habilitation in Didaktik der Mathematik 1988<br />

(Universität Siegen). Seit 1994 Lehrstuhl für Didaktik der Mathematik (Universität<br />

Potsdam). Zahlreiche wissenschaftliche Veröffentlichungen; Herausgeber und Au<strong>to</strong>r<br />

von Schulbüchern für den Mathematikunterricht an Gymnasien. Arbeitsgebiete:


418 STEFAN T. HOPMANN<br />

S<strong>to</strong>ffdidaktik; Kritik didaktischer Ideologien; Curriculumentwicklung für Lehramtsstudiengänge;<br />

Philosophie, Geschichte und Kultur der Mathematik.<br />

Kontakt: jahnke@math.uni-potsdam.de<br />

Langfeldt, Gjert:<br />

Dr. Gjert Langfeldt is a tenured associate professor at the University of Agder, Norway.<br />

Substantially his main areas of research are efficiency and equity linked issues<br />

and also how didactics can be transformed in<strong>to</strong> an empirical discipline. He has pursued<br />

his research interest by an empirical approach and an interest in methodological<br />

issues.<br />

Langfeldt is currently engaged in two research projects funded by national authorities:<br />

Together with Stefan Hopmann he is engaged in a project charting how schools<br />

and teachers can come <strong>to</strong> grips with the new logic of accountability in education, and<br />

he is also involved in evaluating the National System of Quality Assurance in Education.<br />

Contact: gjert.langfeldt@uia.no<br />

Meyerhöfer, Wolfram:<br />

1990-1995: Studium Lehramt für Mathematik und Physik an der Universität Potsdam<br />

1996-1998: Referendariat am Studienseminar Potsdam<br />

1998-2007: Universität Potsdam, Didaktik der Mathematik<br />

Promotion Mai 2004: Was testen Tests? Objektiv-hermeneutische Analysen am Beispiel<br />

von TIMSS und <strong>PISA</strong>.<br />

seit 2007: Gastprofessor FU Berlin<br />

Kontakt: meyerhof@math.uni-potsdam.de<br />

Olechowski, Richard:<br />

Geb.: 7.5.1936, in Wien, ab Herbst 1955: Studium der Psychologie an der Universität<br />

Wien, 1962: Dr.phil.; als Psychol. im Bereich der Justiz (Resozialisierung), ab 1966:<br />

Ass. an der Univ. Wien, 1970: Habil. für Pädagogik (m.bes. Berücksichtigung d. Päd.<br />

Psychol), 1972: O. Prof. f. Pädagogik an der Univ. Salzburg, 1977: O. Prof. f. Pädagogik<br />

an der Univ. Wien (m.bes. Berücksichtigung d. Schulpäd. u. d. Allg. Didaktik),<br />

ab 1986 zusätzl.: Wissenschaftl. und administr. Leiter des Ludw. Boltzmann-Inst. für<br />

Schulentwicklung und intern.-vergl. Schulforschg. Seit 1988: Redaktionsmitglied d.<br />

Ztschr. „Erziehung u. Unterricht“.2004: Dr. h.c. und Prof. h.c. der Eötvös-Loránd-<br />

Universität in Budapest, Herbst 2004: Emeritierung.<strong>–</strong> Publikationen: Das alternde<br />

Gedächtnis (1969), Das Sprachlabor (1970, 1973 2 ), ins Japanische übers.; über 100<br />

Beiträge in in- u. ausl. Fachztschr., Lexika u. Handbüchern, Hrsg. d. Reihen „Schule-<br />

Wissenschaft-Politik“, „Erziehungswiss. Forschung <strong>–</strong> Päd. Praxis“ sowie „Schulpäd.<br />

und Päd. Psychol.“ Spezialgebiet: Quantitative empirisch-päd. Forschung (bes. zum<br />

Problemgebiet der Schulorganisation), von 1992 bis 1997: Leitung eines empirischen<br />

Großprojekts; Evaluierung des Schulmodells „Kooperative Mittelschule“ im Längsschnitt.<br />

Kontakt: richard.olechowski@univie.ac.at


EPILOGUE: NO CHILD, NO SCHOOL, NO STATE LEFT BEHIND 419<br />

Olsen, Rolf V.:<br />

Rolf V. Olsen holds a postdoc position at the Department of Teacher Education and<br />

School Development at the University of Oslo where he also received his phd in 2005<br />

with a thesis on secondary analysis of the science data in <strong>PISA</strong> and TIMSS. His current<br />

research activities are extensions of the work presented in his thesis. He is a member<br />

of a research group, which among other activities, is responsible for the Norwegian<br />

activities in a range of similar studies (Unit for Quantitative Analysis in Education).<br />

Besides his analytical work presented in his publications he has extensive practical<br />

experience with item development in both international and national studies, and previously<br />

he worked as a teacher in science, physics and mathematics for several years<br />

in upper secondary education<br />

Contact: r.v.olsen@ils.uio.no<br />

Prais, S J:<br />

S J Prais (b. 1928) has spent most of his career in economic research, mostly at the<br />

National Institute of Economic and Social Research.<br />

The economic analysis of consumer behaviour formed his initial research field,<br />

followed by a study of the growth of industrial concentrations in Britain. Work on international<br />

differences in industrial productivity, and their relation <strong>to</strong> vocational training<br />

of the workforce were the centre of subsequent extended empirical comparisons<br />

of British and German industries. This led <strong>to</strong> international comparisons of schoolleaving<br />

standards, particularly in mathematics; and, in due course, <strong>to</strong> the membership<br />

of the National Curriculum committee in that subject. Schooling standards and teaching<br />

methods have been compared, particularly with Germany and Switzerland.He<br />

was elected Fellow of the British Academy in 1985; and was awarded a D.Litt. (hon.)<br />

by City University in 1989 and Dr. Sc. (hon.) by University of Birmingham in 2006.<br />

Contact: c/o m.ockenden@niesr.ac.uk<br />

Puchhammer, Markus:<br />

Dipl.Ing.Dr.phil.Dr.techn. worked as research assistant and graduated in physics on<br />

Technical University Vienna, graduated in psychology and education science on University<br />

Vienna; worked for several years as software developer, systems analyst and<br />

project manager in telecommunications industry, then in the electronic transaction<br />

processing business, EDP training courses on WIFI Vienna, teaching in a vocational<br />

education college [HTL], lecturer on FH Joanneum Graz, then on the University of<br />

Applied Sciences Technikum Wien for science and research, statistics and data analysis;<br />

teleteaching survey.<br />

Contact: puchhammer@gmx.at<br />

Retzl, Martin:<br />

(b.1980); Assistant at the Department of Educational Science at the University of Vienna<br />

(since 1 st Sept. 2007)<br />

<strong>–</strong> Student-Assistant at the Department of Educational Science at the University of<br />

Vienna (March 2006-July 2007)


420 STEFAN T. HOPMANN<br />

<strong>–</strong> Master’s degree in educational science (Mag. phil.)<br />

<strong>–</strong> Graduate of teacher’s college: Diploma for teaching at “Hauptschule” (lower secondary<br />

school)<br />

<strong>–</strong> Projects and research interests: development of teaching material, empirical research<br />

(diploma thesis: teacher-study), school-capacity-research, governance of the<br />

school- and educational system.<br />

Contact: martin.retzl@univie.ac.at<br />

Sjøberg, Svein:<br />

Professor in science education at Oslo University. He was educated as a nuclear physicist,<br />

later also in education and social science. Current research interests: Social,<br />

cultural and ethical aspects of science education, science education and development,<br />

gender and science education in developing countries. Critical approach <strong>to</strong> issues of<br />

scientific literacy and public understanding of science. Currently organizer of ROSE<br />

(The Relevance of Science Education), a comparative project on pupils’ interests, attitudes,<br />

perceptions etc. of importance <strong>to</strong> science teaching and learning.<br />

Information and articles on http://folk.uio.no/sveinsj/<br />

Contact: svein.sjoberg@ils.uio.no<br />

Uljens, Michael:<br />

Michael Uljens (b. 1962), Prof. Dr, Vice Dean at Åbo Akademi University, Dozent at<br />

Helsinki university, has been working with a wide range of educational <strong>to</strong>pics, but his<br />

main field of research through the years has the theory and philosophy of education<br />

(books: “School Didactics and Learning” and “Allmän pedagogik”). Since 2005 he is<br />

running a 4 year research project (“Bildung and learning in the late-modern society”)<br />

with 6 doc<strong>to</strong>ral students working fulltime. He has been working as visiting scholar<br />

at the University of Göteborg with Prof. Mar<strong>to</strong>n and at Humboldt University with<br />

Prof. Benner. 2000-2003 he was professor in general education at Helsinki University.<br />

Contact: muljens@abo.fi<br />

Wuttke, Joachim:<br />

Joachim Wuttke studied physics in München and Grenoble.<br />

He holds a state certificate for teaching mathematics and physics in secondary<br />

schools, a PhD in physical chemistry, and a habilitation in experimental physics.<br />

He has worked in the telecommunication industry, as a school teacher, and as a<br />

group leader in academic research. Besides 25 research papers in statistical physics,<br />

he has published on scientific instrumentation and computing.<br />

Joachim Wuttke is staff scientist at the Munich outstation of Forschungszentrum<br />

Jülich.<br />

Contact: wuttke1@web.de


Schulpädagogik und Pädagogische Psychologie<br />

hrsg. von Univ.Prof. Dr. Richard Olechowski (Universität Wien)<br />

Rudolf Beer<br />

Bildungsstandards<br />

Einstellungen von Lehrerinnen und Lehrern<br />

Bildungsstandards sollen die Qualität der österreichischen Schulen erhöhen. Lehrer/innen<br />

werden als „zentrales Gelenksstück“ bei der Implementierung von Bildungsstandards und<br />

einer daraus resultierenden Qualitätsentwicklung gesehen. Wie weit können Bildungsstandards<br />

diesem Anspruch aus Sicht der Lehrer/innen gerecht werden? Diese Publikation bildet<br />

den aktuellen Stand der Diskussion ab, versucht eine Klärung der Begriffe, weist aber auch<br />

auf Widersprüche und Risiken hin. Die empirische Studie in Wien widmet sich der Frage der<br />

Akzeptanz eines solchen Konzepts durch die betroffenen Lehrer/innen.<br />

Bd. 1, 2006, 256 S., 24,90 €, br., ISBN 3-8258-0104-7<br />

LIT Verlag GmbH & Co. KG Wien <strong>–</strong> Zürich<br />

Auslieferung Österreich: Medienlogistik Pichler-ÖBZ GmbH & Co KG<br />

IZ-NÖ Süd, Straße 1, Objekt 34, A-2355 Wiener Neudorf, Postfach 133<br />

Tel. +43 (0) 2236 / 63 535 - 236, Fax +43 (0) 2236 / 63 535 - 243, e-Mail: bestellen@medien-logistik.at<br />

Auslieferung Deutschland: Fresnostr. 2 48159 Münster<br />

Tel.: 0251 / 620 32 22 <strong>–</strong> Fax 0251 / 922 60 99<br />

e-Mail: vertrieb@lit-verlag.de <strong>–</strong> http://www.lit-verlag.de


Isabella Benischek<br />

Leistungsbeurteilung im österreichischen Schulsystem<br />

Dieses Buch greift die stets brisante Thematik der Leistungsbeurteilung im österreichischen<br />

Schulsystem unter dem Gesichtspunkt auf, dass die Leistungsbeurteilung mitentscheidend<br />

für den weiteren Bildungsweg der SchülerInnen ist. Daher wird der Frage nachgegangen, ob<br />

und wie sich die Beurteilungen in Volksschule, Hauptschule und AHS unterscheiden. Eine<br />

empirische Untersuchung gibt Aufschluss über das Beurteilungsverhalten von Volksschul-,<br />

Hauptschul- und AHSLehrerInnen. Das Buch beinhaltet auch eine kurze Übersicht der his<strong>to</strong>rischen<br />

Entwicklung des österreichischen Schulsystems und eine Darstellung des aktuellen<br />

Modells.<br />

Bd. 2, 2006, 288 S., 24,90 €, br., ISBN 3-8258-0074-1<br />

LIT Verlag GmbH & Co. KG Wien <strong>–</strong> Zürich<br />

Auslieferung Österreich: Medienlogistik Pichler-ÖBZ GmbH & Co KG<br />

IZ-NÖ Süd, Straße 1, Objekt 34, A-2355 Wiener Neudorf, Postfach 133<br />

Tel. +43 (0) 2236 / 63 535 - 236, Fax +43 (0) 2236 / 63 535 - 243, e-Mail: bestellen@medien-logistik.at<br />

Auslieferung Deutschland: Fresnostr. 2 48159 Münster<br />

Tel.: 0251 / 620 32 22 <strong>–</strong> Fax 0251 / 922 60 99<br />

e-Mail: vertrieb@lit-verlag.de <strong>–</strong> http://www.lit-verlag.de


Roman Lehnert; Justine Scanferla<br />

Zusammenleben in Wien<br />

Ergebnisse einer empirischen Längsschnittstudie an Migrantenkindern<br />

Anhand einer empirischen Längsschnittstudie wird der Frage nachgegangen, inwieweit Jugendliche<br />

in Wien interethnischen Kontakten gegenüber aufgeschlossen sind. Dabei werden<br />

die Einstellungen von Jugendlichen mit Migrationshintergrund (türkischer und serbisch/serbokroatischer<br />

Muttersprache) und deutschsprachigen Jugendlichen verglichen. Darüber<br />

hinaus werden auch die Mädchen und Burschen der jeweiligen Sprachgruppen separat<br />

betrachtet. Die vorliegende Studie ermöglicht einen Einblick in die Veränderung der Meinung<br />

der befragten SchülerInnen im Alter von 10 bis 15 Jahren.<br />

Bd. 4, 2007, 280 S., 24,90 €, br., ISBN 978-3-8258-0554-8<br />

LIT Verlag GmbH & Co. KG Wien <strong>–</strong> Zürich<br />

Auslieferung Österreich: Medienlogistik Pichler-ÖBZ GmbH & Co KG<br />

IZ-NÖ Süd, Straße 1, Objekt 34, A-2355 Wiener Neudorf, Postfach 133<br />

Tel. +43 (0) 2236 / 63 535 - 236, Fax +43 (0) 2236 / 63 535 - 243, e-Mail: bestellen@medien-logistik.at<br />

Auslieferung Deutschland: Fresnostr. 2 48159 Münster<br />

Tel.: 0251 / 620 32 22 <strong>–</strong> Fax 0251 / 922 60 99<br />

e-Mail: vertrieb@lit-verlag.de <strong>–</strong> http://www.lit-verlag.de


Christa-Monika Reisinger<br />

Unterrichtsdifferenzierung<br />

Die Forschung zur Unterrichtsqualität analysiert zunehmend den Unterricht auf seine Effizienz.<br />

Wie kann diesem Anspruch in heterogenen Klassen Rechnung getragen werden? Das<br />

Buch behandelt ausgewählte Determinanten der Schulleistung und zeigt, wie die Anpassung<br />

von Unterrichtsmethoden an persönliche Voraussetzungen der Kinder in der Schulpraxis realisiert<br />

werden kann. Eine empirische Untersuchung gibt anhand eines Beispiels aus der Mathematik<br />

Aufschluss darüber, unter welchen Bedingungen Schüler/innen ihren persönlich besten<br />

Weg zum Lernen finden und so optimale Lernergebnisse erzielen können.<br />

Bd. 5, 2007, 336 S., 29,90 €, br., ISBN 978-3-8258-0867-9<br />

LIT Verlag GmbH & Co. KG Wien <strong>–</strong> Zürich<br />

Auslieferung Österreich: Medienlogistik Pichler-ÖBZ GmbH & Co KG<br />

IZ-NÖ Süd, Straße 1, Objekt 34, A-2355 Wiener Neudorf, Postfach 133<br />

Tel. +43 (0) 2236 / 63 535 - 236, Fax +43 (0) 2236 / 63 535 - 243, e-Mail: bestellen@medien-logistik.at<br />

Auslieferung Deutschland: Fresnostr. 2 48159 Münster<br />

Tel.: 0251 / 620 32 22 <strong>–</strong> Fax 0251 / 922 60 99<br />

e-Mail: vertrieb@lit-verlag.de <strong>–</strong> http://www.lit-verlag.de


Osnabrücker Schriften zur Psychologie<br />

hrsg. von Prof. Dr. Josef Rogner, Prof. Dr. Henning Schöttke<br />

und Prof. Dr. Manfred Tücke<br />

Manfred Tücke unter Mitarbeit von Ulla Burger<br />

Entwicklungspsychologie des Kindes- und Jugendalters für (zukünftige) Lehrer<br />

Dies Buch wurde für LehramtsstudentInnen und LehrerInnen geschrieben. Es soll praxisnah<br />

über wesentliche Themen und Kontroversen der Entwicklungspsychologie des Kindes- und<br />

Jugendalters informieren, soweit sie für die Schule wichtig und/oder interessant sind. Wo immer<br />

es ohne wesentlichen Verlust an Exaktheit möglich war, wurde eine um- gangssprachliche<br />

Darstellung gegenüber dem wissenschaftlichen Fachvokabular bevorzugt.<br />

Bd. 6, 3. Aufl. 2007, 440 S., 29,90 €, gb., ISBN 978-3-8258-0157-1<br />

LIT Verlag GmbH & Co. KG Wien <strong>–</strong> Zürich<br />

Auslieferung Österreich: Medienlogistik Pichler-ÖBZ GmbH & Co KG<br />

IZ-NÖ Süd, Straße 1, Objekt 34, A-2355 Wiener Neudorf, Postfach 133<br />

Tel. +43 (0) 2236 / 63 535 - 236, Fax +43 (0) 2236 / 63 535 - 243, e-Mail: bestellen@medien-logistik.at<br />

Auslieferung Deutschland: Fresnostr. 2 48159 Münster<br />

Tel.: 0251 / 620 32 22 <strong>–</strong> Fax 0251 / 922 60 99<br />

e-Mail: vertrieb@lit-verlag.de <strong>–</strong> http://www.lit-verlag.de


Manfred Tücke<br />

Grundlagen der Psychologie für (zukünftige) Lehrer<br />

Dies Buch wurde für LehramtsstudentInnen und LehrerInnen geschrieben. Darin werden<br />

wichtige Denkweisen und Ergebnisse der Psychologie vorgestellt, an Hand klassischer Untersuchungen<br />

erläutert und an Hand vieler Beispiele auf unser Alltagsleben bezogen. Wo immer<br />

es ohne wesentlichen Verlust an Exaktheit möglich war, wurde eine umgangssprachliche<br />

Darstellung gegenüber dem wissenschaftlichen Fachvokabular bevorzugt. Folgende Themen<br />

werden angesprochen: <strong>–</strong> Gegenstand und Methoden der Psychologie <strong>–</strong> Konditionieren und<br />

Lernen: Lernen aus Erfahrung <strong>–</strong> Erinnern und Vergessen: das menschliche Gedächtnis <strong>–</strong> Denken,<br />

Problemlösen und Entscheiden <strong>–</strong> Intelligenz und Intelligenzmessung <strong>–</strong> Emotionen <strong>–</strong> am<br />

Beispiel Glück, Zufriedenheit und Angst <strong>–</strong> Soziale Prozesse und soziales Verhalten<br />

Bd. 8, 2. Aufl 2004, 472 S., 29,90 €, gb., ISBN 3-8258-7190-8<br />

LIT Verlag GmbH & Co. KG Wien <strong>–</strong> Zürich<br />

Auslieferung Österreich: Medienlogistik Pichler-ÖBZ GmbH & Co KG<br />

IZ-NÖ Süd, Straße 1, Objekt 34, A-2355 Wiener Neudorf, Postfach 133<br />

Tel. +43 (0) 2236 / 63 535 - 236, Fax +43 (0) 2236 / 63 535 - 243, e-Mail: bestellen@medien-logistik.at<br />

Auslieferung Deutschland: Fresnostr. 2 48159 Münster<br />

Tel.: 0251 / 620 32 22 <strong>–</strong> Fax 0251 / 922 60 99<br />

e-Mail: vertrieb@lit-verlag.de <strong>–</strong> http://www.lit-verlag.de

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!