PISA zufolge PISA – PISA According to PISA
PISA zufolge PISA – PISA According to PISA
PISA zufolge PISA – PISA According to PISA
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
Stefan Thomas Hopmann, Gertrude Brinek,<br />
Martin Retzl (Hg./Eds.)<br />
<strong>PISA</strong> <strong>zufolge</strong> <strong>PISA</strong> <strong>–</strong> <strong>PISA</strong> <strong>According</strong> <strong>to</strong> <strong>PISA</strong>
Schulpädagogik und<br />
Pädagogische Psychologie<br />
herausgegeben von<br />
Univ.-Prof. Dr. Dr. h. c. Richard Olechowski<br />
(Universität Wien)<br />
Band 6<br />
LIT
Stefan Thomas Hopmann, Gertrude Brinek,<br />
Martin Retzl (Hg./Eds.)<br />
<strong>PISA</strong> <strong>zufolge</strong> <strong>PISA</strong> <strong>–</strong><br />
<strong>PISA</strong> <strong>According</strong> <strong>to</strong> <strong>PISA</strong><br />
Hält <strong>PISA</strong>, was es verspricht? <strong>–</strong><br />
Does <strong>PISA</strong> Keep What It Promises?<br />
LIT
Bibliographic information published by the Deutsche Nationalbibliothek<br />
The Deutsche Nationalbibliothek lists this publication in the Deutsche<br />
Nationalbibliografie; detailed bibliographic data are available in the Internet at<br />
http://dnb.d-nb.de.<br />
ISBN 978-3-7000-0771-5 (Österreich)<br />
ISBN 978-3-8258-0946-1 (Deutschland)<br />
A catalogue record for this book is available from the British Library<br />
© LIT VERLAG GmbH & Co. KG Wien 2007<br />
Krotenthallergasse 10/8<br />
A-1080 Wien<br />
Tel. +43 (0) 1 / 409 56 61<br />
Fax +43 (0) 1 / 409 56 97<br />
e-Mail: wien@lit-verlag.at<br />
http://www.lit-verlag.at<br />
LIT VERLAG Dr. W. Hopf<br />
Berlin 2007<br />
Auslieferung/Verlagskontakt:<br />
Fresnostr. 2<br />
D-48159 Münster<br />
Tel. +49 (0)251<strong>–</strong>62 03 20<br />
Fax +49 (0)251<strong>–</strong>23 19 72<br />
e-Mail: lit@lit-verlag.de<br />
http://www.lit-verlag.de<br />
Auslieferung:<br />
Österreich: Medienlogistik Pichler-ÖBZ GmbH & Co KG<br />
IZ-NÖ, Süd, Straße 1, Objekt 34, A-2355 Wiener Neudorf<br />
Tel. +43 (0) 2236/63 535 - 290, Fax +43 (0) 2236/63 535 - 243, e-Mail: mlo@medien-logistik.at<br />
Deutschland: LIT Verlag Fresnostr. 2, D-48159 Münster<br />
Tel. +49 (0) 2 51/620 32 - 22, Fax +49 (0) 2 51/922 60 99, e-Mail: vertrieb@lit-verlag.de<br />
Distributed in the UK by: Global Book Marketing, 99B Wallis Rd, London, E9 5LN<br />
Phone: +44 (0) 20 8533 5800 <strong>–</strong> Fax: +44 (0) 1600 775 663<br />
http://www.centralbooks.co.uk/acatalog/search.html<br />
Distributed in North America by:<br />
Transaction Publishers<br />
Rutgers University<br />
35 Berrue Circle<br />
Piscataway, NJ 08854<br />
Phone: +1 (732) 445 - 2280<br />
Fax: + 1 (732) 445 - 3138<br />
for orders (U. S. only):<br />
<strong>to</strong>ll free (888) 999 - 6778<br />
e-mail:<br />
orders@transactionspub.com
Inhalt/Table of content<br />
Zu diesem Buch 1<br />
Vorwort 5<br />
Richard Olechowski<br />
Introduction: <strong>PISA</strong> <strong>According</strong> <strong>to</strong> <strong>PISA</strong> <strong>–</strong> Does <strong>PISA</strong> Keep What It<br />
Promises? 9<br />
Stefan T. Hopmann/Gertrude Brinek<br />
What Does <strong>PISA</strong> Really Assess? What Does It Not? A French View 21<br />
An<strong>to</strong>ine Bodin<br />
Testfähigkeit <strong>–</strong> Was ist das? 57<br />
Wolfram Meyerhöfer<br />
<strong>PISA</strong> <strong>–</strong> An Example of the Use and Misuse of Large-Scale<br />
Comparative Tests 93<br />
Jens Dolin<br />
Language-Based Item Analysis <strong>–</strong> Problems in Intercultural<br />
Comparisons 127<br />
Markus Puchhammer<br />
England: Poor Survey Response and No Sampling of Teaching Groups 139<br />
SJPrais<br />
Disappearing Students <strong>PISA</strong> and Students With Disabilities 157<br />
Bernadette Hörmann<br />
Identification of Group Differences Using <strong>PISA</strong> Scales <strong>–</strong> Considering<br />
Effects of Inhomogeneous Items 175<br />
Peter Allerup<br />
<strong>PISA</strong> and “Real Life Challenges”: Mission Impossible? 203<br />
Svein Sjøberg
ii INHALT/TABLE OF CONTENT<br />
<strong>PISA</strong> <strong>–</strong> Undressing the Truth or Dressing Up a Will <strong>to</strong> Govern? 225<br />
Gjert Langfeldt<br />
Uncertainties and Bias in <strong>PISA</strong> 241<br />
Joachim Wuttke<br />
Large-Scale International Comparative Achievement Studies in<br />
Education: Their Primary Purposes and Beyond 265<br />
Rolf V. Olsen<br />
The Hidden Curriculum of <strong>PISA</strong> <strong>–</strong> The Promotion of Neo-Liberal<br />
Policy By Educational Assessment 295<br />
Michael Uljens<br />
Deutsche Pisa-Folgen 305<br />
Thomas Jahnke<br />
<strong>PISA</strong> in Österreich: Mediale Reaktionen, öffentliche Bewertungen und<br />
politische Konsequenzen 321<br />
Dominik Bozkurt, Gertrude Brinek, Martin Retzl<br />
Epilogue: No Child, No School, No State Left Behind: Comparative<br />
Research in the Age of Accountability 363<br />
Stefan T. Hopmann
Zu diesem Buch<br />
„<strong>PISA</strong> <strong>zufolge</strong> <strong>PISA</strong>“ (<strong>PISA</strong> <strong>According</strong> <strong>to</strong> <strong>PISA</strong>) war Thema eines Symposiums,<br />
das im März 2007 an der Universität Wien von der Forschungseinheit für<br />
Schul- und Bildungsforschung des Instituts für Bildungswissenschaft durchgeführt<br />
wurde. Zu dieser Veranstaltung hatten wir einige Kritiker der vorliegenden<br />
<strong>PISA</strong>-Studien eingeladen, aber auch Vertreter des österreichischen <strong>PISA</strong>-<br />
Konsortiums, die jedoch leider kurzfristig absagen mussten. Unser Ziel war<br />
eine Versachlichung der Debatte über <strong>PISA</strong>, weg vom politisch-ideologischen<br />
Streit über <strong>PISA</strong> und hin zu einer Analyse der methodologischen Voraussetzungen<br />
und Folgen des <strong>PISA</strong>-Projektes, genauer gesagt der Frage: Ob <strong>PISA</strong><br />
bei gegebenem Design halten kann, was es in seinen Analysen und Berichten<br />
zu erklären verspricht. Bei dieser Veranstaltung wurde die Frage gestellt, ob<br />
und wie <strong>PISA</strong> international als wissenschaftliches Vorhaben diskutiert wird.<br />
Wir haben dies zum Anlass genommen, den vorliegenden Band zu gestalten.<br />
Trotz seiner enormen Breitenwirkung hat <strong>PISA</strong> in der vergleichenden Bildungsforschung<br />
bislang kaum internationale, wohl aber eine Reihe von nationalen<br />
Nachfragen ausgelöst (vgl. Hopmann & Brinek in diesem Band). Für<br />
diesen Band haben sich nun erstmals länderübergreifend achtzehn ForscherInnen<br />
aus sieben Ländern (Dänemark, Deutschland, England, Finnland, Frankreich,<br />
Norwegen und Österreich) zusammengefunden, um <strong>PISA</strong> kritisch von<br />
allen Seiten unter die Lupe zu nehmen, und zwar den gesamten Forschungsprozess<br />
vom Design und den Erhebungsinstrumenten über die Durchführung<br />
und Datenanalyse bis hin zur öffentlichen Präsentation der Daten. Berücksichtigt<br />
wurden dabei alle einschlägig relevanten wissenschaftlichen Zugänge zu<br />
<strong>PISA</strong>: empirische Bildungsforschung, Forschungsmethodologie, Statistik, die<br />
allgemeine und die einschlägigen Fachdidaktiken. Fast alle Beteiligten verfügen<br />
über langjährige Erfahrungen in der vergleichenden Bildungsforschung<br />
oder verwandten Unternehmungen, einige waren zumindest zeitweise auch direkt<br />
an <strong>PISA</strong>- Forschungen beteiligt.<br />
Bei allem Respekt für das grossartige Engagement der OECD und der nationalen<br />
<strong>PISA</strong>-Konsortien, fällt das Ergebnis sehr ernüchternd aus: <strong>PISA</strong> hält<br />
nicht annähernd, was <strong>PISA</strong> verspricht, und kann das mit den angewandten
2 ZU DIESEMBUCH<br />
Mitteln auch gar nicht leisten! Das <strong>PISA</strong>-Projekt ist offenkundig mit so vielen<br />
Schwachstellen und Fehlerquellen belastet, dass sich zumindest die populärsten<br />
Endprodukte, die internationalen Vergleichstabellen sowie die meisten<br />
nationalen Zusatzanalysen zu Schulen und Schulstrukturen, Unterricht, Schulleistungen<br />
und Problemen wie Migration, sozialer Hintergrund, Geschlecht<br />
usw., in den bisher praktizierten Formen wissenschaftlich schlicht nicht aufrecht<br />
erhalten lassen. Sie überspannen bei weitem die Tragfähigkeit des gewählten<br />
Designs und dessen theoretische und methodische Grundlagen. Wer<br />
auf dieser Grundlage über Schulstrukturen, Lehrpläne, nationale Tests oder<br />
die zukünftige Lehrerbildung befinden will, ist nicht gut beraten.<br />
Damit hört <strong>PISA</strong> nicht auf, eines der wichtigsten und ertragreichsten Projekte<br />
der vergleichenden Forschung der Gegenwart zu sein. Einzelne Beiträge<br />
in diesem Band weisen dazu ausdrücklich auf künftige Möglichkeiten der<br />
<strong>PISA</strong>-Forschung hin. Nur scheint dringend geboten, die zum Teil bei solcher<br />
Forschung unvermeidlichen Grenzen der Geltung und Zuverlässigkeit weitaus<br />
deutlicher auszuweisen und dafür zu sorgen, dass nicht auch künftig <strong>PISA</strong> für<br />
Beweislasten in Anspruch genommen wird, die es auf wissenschaftlich vertretbare<br />
Weise nicht schultern kann. Man kann fast sagen, es gilt das Gute an <strong>PISA</strong><br />
und die interessierte Öffentlichkeit gegen den methodologisch haltlosen Überschwang<br />
einiger der an <strong>PISA</strong> Beteiligten in Schutz zu nehmen. Sonst droht die<br />
Gefahr, dass eines Tages die Bildungsverwaltungen, die Schulen und Schulleitungen,<br />
die Lehrkräfte und die Schülerinnen und Schüler nicht nur des stetigen<br />
Missbrauchs ihrer Daten überdrüssig werden, sondern gleich alle vergleichbaren<br />
Maßnahmen und Forschungsvorhaben in Bausch und Bogen ablehnen oder<br />
gar <strong>–</strong> wie das mit staatlichen Tests in einigen Ländern (u.a. in den USA, Chile,<br />
Norwegen) schon passiert ist <strong>–</strong> boykottieren oder durch mutwilliges Antwortverhalten<br />
beschädigen. Dies würde der vergleichenden Bildungsforschung als<br />
Ganzes einen nachhaltigen Schaden zufügen, und die Sorge, dass es dazu am<br />
Ende der <strong>PISA</strong>-Begeisterung kommen kann, begründet unser Engagement.<br />
Natürlich waren auch unserem Vorhaben deutliche Grenzen gesetzt:<br />
<strong>–</strong> Zum einen war eine direkte Re-Analyse von <strong>PISA</strong>-Originaldaten, <strong>PISA</strong>-<br />
Fragen etc. nur im begrenzten Masse möglich, nur dort wo einzelnen Datenbestände<br />
zugänglich waren. Zusätzlich haben wir fast die gesamte Literatur<br />
zur <strong>PISA</strong>-Methodologie und ihren Implikationen ausgewertet (vgl. die<br />
Literaturangaben in den einzelnen Kapiteln). <strong>PISA</strong> lässt eine unabhängige<br />
Überprüfung der vollständigen Datensätze einschliesslich aller Unterlagen<br />
bislang jedoch nicht zu. Es mag also sein, dass sich bei einer entsprechenden
ZU DIESEMBUCH 3<br />
Nachlese <strong>–</strong> wenn sie eines Tages möglich sein wird <strong>–</strong> einzelne Ergebnisse<br />
unserer Metaanalysen anders darstellen, als es uns nachzuprüfen möglich<br />
war. Allerdings haben sich so viele kritische Einwände in unseren Untersuchungen<br />
ergeben, dass die Widerlegung eines halben oder ganzen Dutzends<br />
von ihnen an den Kernaussagen dieses Bandes nichts ändern würde. In jeder<br />
Phase des <strong>PISA</strong>-Projektes gibt es zahlreiche Designentscheidungen und<br />
-probleme, die für sich allein genommen ausreichen, einen erheblichen Teil<br />
der gegenwärtig üblichen Darstellung und Nutzung der <strong>PISA</strong>-Ergebnisse für<br />
wissenschaftlich nicht tragfähig zu halten.<br />
<strong>–</strong> Zum zweiten war uns nicht daran gelegen, mit einer Stimme zu reden. Nicht<br />
nur, dass auch <strong>PISA</strong>-Kritiker Unterschiedliches kritikwürdig finden, und<br />
deshalb verschiedene Zugänge und Argumentationen wählen. Wir wollten<br />
die ganze Bandbreite der zur Zeit in Europa zugänglichen Kritik präsentieren<br />
und niemanden ausschliessen, nur weil der eine oder die andere eventuell<br />
einzelne Punkte oder Schlussfolgerungen nicht teilen. Wir hatten auch<br />
das deutsche und das österreichische <strong>PISA</strong>-Konsortium mehrfach zur Mitwirkung<br />
eingeladen. Leider ist diese nicht zustande gekommen. Zum Glück<br />
waren einige andere mit <strong>PISA</strong>-Erfahrung dennoch bereit, an unserem Vorhaben<br />
teilzunehmen. Damit gelingt uns aber nur teilweise, das gesamte Für<br />
und Wider der Diskussion widerzuspiegeln. Wir zweifeln jedoch nicht daran,<br />
dass die <strong>PISA</strong>-Konsortien genügend andere Möglichkeiten haben, sich<br />
aktiv an der Debatte zu beteiligen.<br />
Ein solches Vorhaben wie das vorliegende kann ohne vielfältige Hilfe nicht<br />
gelingen. Das seinerzeitige Österreichische Bundesministerium für Bildung,<br />
Wissenschaft und Kultur (BMBWK), die Österreichische Gesellschaft für<br />
Bildungsforschung sowie der norwegische Forschungsverbund „Achieving<br />
School Accountability in Practice (ASAP)“, zu dessen Veröffentlichungen<br />
auch dieser Band zählt, haben das Symposium und die Arbeiten am vorliegenden<br />
Band großzügig unterstützt. Nicht zu vergessen ist die Hilfe durch die<br />
Sekretariate in Wien (Patricia Stuhr) und in Kristiansand (Inger Linn Nystad<br />
Baade, Karen Beth Lee Hansen), die uns durch Geduld und Sprachfertigkeit<br />
ein Gelingen ermöglichten. Schliesslich möchten wir Richard Olechowski und<br />
dem LIT-Verlag für die Aufnahme in die Reihe „Schulpädagogik und Pädagogische<br />
Psychologie“ herzlich danken.<br />
Stefan T. Hopmann, Gertrude Brinek, Martin Retzl<br />
Wien, im September 2007
4 ZU DIESEMBUCH<br />
Wissenschaft lebt von der Diskussion. Aus diesem Grund möchten wir Sie<br />
herzlich einladen, an unserem Online-Diskussionsforum teilzunehmen. Posten<br />
Sie Ihre Meinung, Kritik und Anregungen zum Buch! Nähere Informationen<br />
dazu sind auf folgender Homepage verfügbar:<br />
http://institut.erz.univie.ac.at/home/fe2/.<br />
Wir freuen uns auf eine anregende Diskussion!
Vorwort<br />
Richard Olechowski<br />
Österreich: Universität Wien<br />
Vermutlich jeder/jede Erziehungswissenschaftler/in, der/die jemals die letzte<br />
Verantwortung für ein empirisches Großprojekt getragen hat <strong>–</strong> das aber, wie<br />
in der Regel, ohnehin auf die nationale Ebene begrenzt, zusätzlich vielleicht<br />
eingeengt auf die in der betreffenden Nation am häufigsten gesprochene Sprache<br />
war <strong>–</strong> wird sich über den Wagemut des <strong>PISA</strong>-Konsortiums gewundert haben.<br />
Ist diesem Konsortium der „Sal<strong>to</strong> mortale“ geglückt oder ist er für seine<br />
Mitglieder „letal“ ausgegangen? Nur zu gut weiß jeder/jede in einem nationalen<br />
Großprojekt Tätige (selbst wenn dieses so, wie eben skizziert, oder in<br />
noch größerem Maße eingeschränkt ist), dass die Gefahr einer Reihe von Beeinträchtigungen<br />
der nötigen Exaktheit in allen Stadien des Projekts gegeben<br />
ist, angefangen von der Stichprobenziehung, über die einzelnen Schritte der<br />
Testkonstruktion oder die Frage, ob die gleichartige Testvorgabe in allen Subgruppen<br />
gelungen ist. Nicht zu unterschätzen sind die Probleme der Testauswertung<br />
und der Datenanalyse (im engeren Sinn), zumal wenn Manches hiervon<br />
dezentralisiert <strong>–</strong> wie beim <strong>PISA</strong>-Projekt <strong>–</strong> in den mitwirkenden Staaten<br />
durchgeführt werden musste. Nicht zu gering ist auch die Wirkung der Wahl<br />
der Art und Weise der Publikation einzuschätzen, besonders wenn es sich um<br />
ein Projekt handelt, mit einem öffentlichen Interesse, im Ausmaß und in einer<br />
Intensität, wie dies bei <strong>PISA</strong> der Fall ist. Fachlich kompetente Kritiker von<br />
<strong>PISA</strong> werfen den für die <strong>PISA</strong>-Publikation Verantwortlichen in diesem Buch<br />
vor, dass die Intensität des öffentlichen Interesses durch den Umstand, dass<br />
die <strong>PISA</strong>-Ergebnisse in Form von nationalen Rangskalen publiziert wurden,<br />
noch bewusst erhöht und somit das öffentliche Interesse nicht in Richtung eines<br />
bildungswissenschaftlichen Interesses, sondern in das der Boulevardpresse<br />
gelenkt wurde. Die Daten wurden ursprünglich auf höherem Skalenniveau erhoben.<br />
Erst für die Publikation wählte man das verhältnismäßig grobe Maß
6 RICHARD OLECHOWSKI<br />
von Rangdaten; die „Suggestivkraft“ der Ranglisten sei vorherzusehen gewesen.<br />
Daher tragen die Betreiber von <strong>PISA</strong>, so die Kritiker, die Verantwortung<br />
für die Art und Weise der jetzigen Diskussion.<br />
Wie Kolleginnen und Kollegen, die fachlich kompetent sind <strong>–</strong> manche<br />
von ihnen haben sogar teilweise an dem Großprojekt <strong>PISA</strong> mitgearbeitet <strong>–</strong><br />
in diesem Buch berichten, gebe es zu allen Phasen des <strong>PISA</strong>-Projekts ernst zu<br />
nehmende kritische Einwände. Einige Beispiele, zusätzlich zu jenen, die oben<br />
schon erwähnt wurden:<br />
<strong>–</strong> Am Beginn des <strong>PISA</strong>-Projekts stand die Sammlung von Aufgaben. Es wurden<br />
zwar alle Staaten, die sich an der <strong>PISA</strong>-Studie beteiligten, eingeladen,<br />
entsprechende Aufgaben einzusenden, doch nicht alle der in Betracht kommenden<br />
Staaten kamen dieser Einladung nach. Dadurch entstand ein „cultural<br />
bias“ <strong>–</strong> eine Verzerrung in Richtung der kulturellen Eigenheiten bzw.<br />
Eigenarten jener Staaten, die sich an der Aufforderung nach Einsendung von<br />
Aufgaben beteiligten.<br />
<strong>–</strong> Die Lehrpläne der in Betracht kommenden Schulen der einzelnen Länder<br />
wurden <strong>–</strong> im Großen und Ganzen <strong>–</strong> seinerzeit (viele Jahre vor dem<br />
<strong>PISA</strong>-Projekt) ohne internationale Koordination erstellt. Die Lehrer und<br />
Lehrerinnen der einzelnen Länder haben außerdem auch in je unterschiedlichem<br />
Ausmaß eine gewisse Lehrplanfreiheit. In einzelnen Lehrer/innen/arbeitsgemeinschaften<br />
werden in den meisten Ländern auch <strong>–</strong> in<br />
Konkretisierung der Unterrichtsarbeit <strong>–</strong> „Lehrs<strong>to</strong>ffverteilungen“ erarbeitet.<br />
Es ist selbstverständlich, dass ein Schulleistungstest jeweils in einer genauen<br />
Abstimmung auf das, was tatsächlich in den Schulen unterrichtet wurde,<br />
zu erstellen ist. Dieser Gesichtspunkt wurde bei der Aufgabenerstellung<br />
im <strong>PISA</strong>-Projekt nicht systematisch berücksichtigt. (Ein allfälliger absichtlicher<br />
Verzicht auf das Erzielen einer „Lehrplanvalidität“ wäre gleichzusetzen<br />
mit einer bewussten Wahl des Risikos eines Argumentationsnotstands.)<br />
<strong>–</strong> Jeder Test hat sich an einer Eichstichprobe zu orientieren. Die Eichstichprobe<br />
muss jener Stichprobe, die in einer konkreten Untersuchung <strong>–</strong> stellvertretend<br />
für die interessierende Grundgesamtheit <strong>–</strong> getestet wird, in größtmöglicher<br />
Weise ähnlich sein. Dies ist bei <strong>PISA</strong> nicht der Fall: Es wurden nicht<br />
in jedem einzelnen Land (der sich am <strong>PISA</strong>-Projekt beteiligenden Länder)<br />
die einzelnen Tests an einer separaten Eichstichprobe geeicht bzw. genormt.<br />
<strong>–</strong> Es fehlen nähere Angaben über die eingesetzten Tests hinsichtlich ihrer „Reliabilität“.<br />
(Angaben darüber, wie groß das Maß der Übereinstimmung von
VORWORT 7<br />
Testresultaten ist, die zu unterschiedlichen Zeitpunkten an denselben bzw.<br />
„vergleichbaren“ Personen festgestellt wurden, fehlen.)<br />
<strong>–</strong> Ebenso fehlen nähere Angaben zur Frage der „Validität“ der eingesetzten<br />
Tests. (Angaben darüber, wie groß die Ähnlichkeit der Ergebnisse der Testergebnisse<br />
ist, verglichen mit Ergebnissen aus Tests, mit denen dieselben<br />
oder ähnliche Dimensionen gemessen werden, fehlen.)<br />
Solche, wie die oben aufgelisteten Mängel können, auch darüber sind sich die<br />
Kritiker von <strong>PISA</strong>, die in diesem Buch zu Wort kommen, einig, durch noch so<br />
große Stichproben nicht „ausgeglichen“ werden. Es handelt sich nämlich nicht<br />
um „Zufallsfehler“, sondern um sog. „systematische Fehler“.<br />
Dennoch <strong>–</strong> auch darüber sind sich sogar die strengsten Kritiker einig <strong>–</strong><br />
ist mit <strong>PISA</strong> Neuland betreten und sind die Ergebnisse, nach teilweise herber<br />
Kritik, nicht einfach beiseite zu schieben. Zum gegenwärtigen Zeitpunkt<br />
könnte eine so große internationale Vergleichsuntersuchung von niemandem<br />
besser durchgeführt werden. Die oben angeführten Punkte der Kritik, die in<br />
diesem Buch ausführlich und fachkundig dargestellt und diskutiert werden,<br />
dürfen auch nicht in der Weise missverstanden werden, dass damit eine prinzipielle<br />
Zurückhaltung oder gar Aversion gegenüber dem Messen und Zählen<br />
zum Ausdruck gebracht werden sollte, wenn es um Fragen der Bildung geht.<br />
Auch wenn in Einzelheiten <strong>–</strong> und sogar wenn hinsichtlich der Frage des „nationalen<br />
Rankings“ bezüglich einer einzelnen Dimension <strong>–</strong> hin und wieder eine<br />
Korrektur nötig wäre: Durch die <strong>PISA</strong>-Studie hat die Bildungsforschung<br />
großen Gewinn gezogen:<br />
Einerseits sind wohl fast alle Staaten, die nicht die obersten Rangplätze<br />
in den meisten der geprüften Dimensionen „innehaben“, motiviert worden,<br />
kritisch zu prüfen, was an ihrem Schulsystem reformiert werden sollte, die<br />
Schulorganisation, der Lehrplan, die Lehreraus- und -weiterbildung oder andere<br />
Aspekte im jeweiligen Schul- und Bildungssystem. (Die bisherigen <strong>PISA</strong>-<br />
Erhebungen und <strong>PISA</strong>-Auswertungen helfen den einzelnen Ländern freilich<br />
nicht bei der Auffindung der konkreten Ursachen für die allenfalls unbefriedigenden<br />
<strong>PISA</strong>-Ergebnisse.) Andererseits ist durch <strong>PISA</strong> ein Gesichtspunkt<br />
von besonderer Bedeutung ins Bewusstsein der „ScientificCommunity“getreten:<br />
Die Vergleichende Erziehungswissenschaft <strong>–</strong> eine wichtige Forschungsrichtung<br />
innerhalb der Erziehungswissenschaft <strong>–</strong> ist mit einem Mal von der<br />
globalen Betrachtungsweise des Vergleichs der Schulsysteme, der Lehrpläne,<br />
der Ausbildungssysteme der einzelnen unterrichtenden Lehrerinnen und Lehrer<br />
geradezu gewaltsam weggezerrt und hingelenkt worden auf den wesentli-
8 RICHARD OLECHOWSKI<br />
chen Aspekt des „Outcome“ <strong>–</strong> durch den auf der Basis von <strong>PISA</strong> weltweiten<br />
Vergleich, einem Vergleich, der mit Hilfe eines methodisch anspruchsvollen<br />
Instrumentariums durchgeführt wird, mit Tests, die von probabilistischen Modellen<br />
der Testtheorie ausgehen.<br />
In dieser Ausgewogenheit zwischen der Kritik, dem methodenkritischen<br />
Referieren der im Drei-Jahres-Rhythmus stattfindenden Erhebungen (2000,<br />
2003, 2006), insbesondere der Auswertungen, soweit diese bereits vorliegen,<br />
und dem Blick in die Zukunft, liegt auch der Vorzug dieses Buches. Es bietet<br />
nicht nur die Möglichkeit, die <strong>PISA</strong>-Studie in aller methodenkritischen<br />
Schärfe und Kritik zu sehen; oft genug ist es eine konstruktive Kritik, sondern<br />
es eröffnet auch die Möglichkeit, <strong>PISA</strong> als einen Schritt der Weiterentwicklung<br />
der Vergleichen Erziehungswissenschaft zu erkennen. Zum ersten<br />
Mal ist mit dieser groß angelegten Vergleichsuntersuchung auch in der<br />
international-vergleichenden Bildungsforschung die Möglichkeit gegeben, intersubjektiv<br />
vergleichbare und somit auch <strong>–</strong> nach klaren Kriterien <strong>–</strong> falsifizierbare<br />
Ergebnisse zu produzieren (für die Wissenschaft ein wichtiger Gesichtspunkt!),<br />
aber auch umgekehrt, im engen Wortsinn, replizierbare Ergebnisse<br />
zu erzielen und so zu einem echten, empirisch gesicherten Wissensbestand zu<br />
gelangen <strong>–</strong> soweit dieser Begriff mit der Sicht der Vorläufigkeit aller Wissenschaft<br />
prinzipiell vereinbar ist.<br />
Richard Olechowski<br />
Wien, im Herbst 2007
Introduction:<br />
<strong>PISA</strong> <strong>According</strong> <strong>to</strong> <strong>PISA</strong> <strong>–</strong> Does <strong>PISA</strong> Keep What It<br />
Promises?<br />
Stefan T. Hopmann/Gertrude Brinek<br />
Austria: University of Vienna<br />
For the time being, <strong>PISA</strong> is the most successful enterprise in comparative education.<br />
Every time a new <strong>PISA</strong> wave rolls in, or an additional analysis appears,<br />
governments fear the results, newspapers fill column after column, and the<br />
public demands answers <strong>to</strong> the claimed failings in their country’s school system.<br />
Of course, such a tremendous impact evokes discussions and criticism.<br />
On the one side are those:<br />
<strong>–</strong> who blame <strong>PISA</strong> for not covering the whole breath of education or schooling<br />
(e.g. Fuchs 2003; Ladenthin 2004; Kraus 2005; Herrmann 2005; Dohn 2007;<br />
adding <strong>to</strong> the <strong>PISA</strong> frame: Benner 2002),<br />
<strong>–</strong> who point <strong>to</strong> the fact that <strong>PISA</strong> is run by private companies (“<strong>PISA</strong> Incorporated”)<br />
looking for a share of the ever-growing testing market (see e.g.<br />
Bracey 2005; Flitner 2006; Lohmann 2006), or<br />
<strong>–</strong> who depict <strong>PISA</strong> as a New Public Government outlet of the most neo-liberal<br />
kind (see e.g. Lohmann 2001; Huisken 2005; Klausnitzer 2006).<br />
On the other side are those who praise <strong>PISA</strong> for giving us the best data base<br />
ever available for comparative research, for developing new <strong>to</strong>ols of research,<br />
and for <strong>PISA</strong>’s creative analysis of its data sets (for many examples see Pekrun<br />
2003; Roeder 2003; Weigel 2004; Stack 2006; Olsen in this volume).<br />
<strong>PISA</strong> <strong>According</strong> <strong>to</strong> <strong>PISA</strong><br />
However, surprisingly, and in spite of its public impact, <strong>PISA</strong> has not lead <strong>to</strong><br />
thorough methodological debates within the comparative research community,
10 STEFAN T. HOPMANN/GERTRUDE BRINEK<br />
at least not internationally. There have been some critiques pointing <strong>to</strong> design<br />
or analytic short-comings in some of the participating countries (e.g. Bonnet<br />
2002; Romainville 2002; Nash 2003; Prais 2003, 2004; Goldstein 2004;<br />
Allerup 2005, 2006; Bodin 2005; Bottani & Virgnaud 2005; Gaeth 2005; Olsen<br />
2005; Jahnke & Meyerhöfer 2006; Neuwirth, Ponocny & Grossmann 2006;<br />
Grisay & Monseur 2007). There has been some fundamental, and highly contested<br />
criticism of the methodological soundness of <strong>PISA</strong>’s research as a whole<br />
(Jahnke & Meyerhöfer 2006; especially Wuttke 2006; rebuttal by Prenzel &<br />
Walter 2006) 1 . However, none of this has lead <strong>to</strong> an international debate on<br />
the validity claims of <strong>PISA</strong> outside the <strong>PISA</strong> community itself. It seems as if<br />
the overwhelming success of the approach has led <strong>to</strong> any attempt <strong>to</strong> discuss<br />
<strong>PISA</strong>’s design, data collection and analysis methodologically looking pettyminded<br />
and irreverent. The strategy of <strong>PISA</strong> itself in not giving access <strong>to</strong> the<br />
full database, including all the questionnaires, contributes <strong>to</strong> this problem.<br />
The present volume on “<strong>PISA</strong> <strong>According</strong> <strong>to</strong> <strong>PISA</strong>” is probably the first<br />
independent international approach <strong>to</strong> discuss the methodological merits and<br />
shortcomings of <strong>PISA</strong> in relation <strong>to</strong> the validity and reliability claims <strong>PISA</strong> itself<br />
puts forward. Our aim is not <strong>to</strong> add <strong>to</strong> the debate for or against <strong>PISA</strong>. Most<br />
of us believe that <strong>PISA</strong> is an important miles<strong>to</strong>ne in the his<strong>to</strong>ry of our field.<br />
But we do question if some basic elements of <strong>PISA</strong> are done well enough <strong>to</strong><br />
carry the weight of, e.g., comparative league tables or of in-depth analyses of<br />
weaknesses of educational systems. We ask if other, and better, uses of the<br />
<strong>PISA</strong> data base are warranted, and if <strong>PISA</strong>-as-a-public-event should come under<br />
much more independent scrutiny <strong>–</strong> if only <strong>to</strong> avoid its misuse <strong>to</strong> validate<br />
claims and policies which cannot be legitimately derived from <strong>PISA</strong>.<br />
The volume seeks <strong>to</strong> follow <strong>–</strong> as much as possible <strong>–</strong> the whole <strong>PISA</strong> research<br />
process from the design and sampling, the data collection and analysis,<br />
through <strong>to</strong> the data presentation and impact. Our aim is not <strong>to</strong> give an<br />
overview of the different national <strong>PISA</strong> debates, rather <strong>to</strong> discuss general issues<br />
of construction and use. The contribu<strong>to</strong>rs come from seven countries and<br />
from all walks of educational research, including specialists in empirical research<br />
methodology, statistical data analysis, general and subject matter didactics,<br />
and educational policy analysis. We include contribu<strong>to</strong>rs who are or<br />
1 The edi<strong>to</strong>rs of the above mentioned “<strong>PISA</strong> & Co.” volume (Jahnke & Meyerhöfer 2006) are<br />
working on a new and revised edition of that book, including an explicit discussion of the<br />
response they got <strong>to</strong> the first edition. This book will be available by late 2007.
INTRODUCTION: <strong>PISA</strong>ACCORDING TO <strong>PISA</strong> 11<br />
have themselves been involved in <strong>PISA</strong> or similar projects (see the bios at the<br />
end of the book).<br />
To highlight just a few core issues:<br />
<strong>–</strong> An<strong>to</strong>ine Bodin (IREM de Besançon <strong>–</strong> Université de Franche-Comté) shows<br />
from a French perspective how much the <strong>PISA</strong> <strong>–</strong> assessment is embedded in<br />
a certain understanding of (school) knowledge, which doesn’t fit all.<br />
<strong>–</strong> Wolfram Meyerhöfer (Universität Potsdam) continues this argument by an<br />
in-depth-analysis of what <strong>PISA</strong> really asks for in its questionnaires, showing<br />
how little this is in <strong>to</strong>uch with a comprehensive concept of “Bildung” or even<br />
current didactics.<br />
<strong>–</strong> Jens Dolin (Syddansk Universitet) adds similar arguments from a Danish<br />
perspective, underlining how much <strong>PISA</strong>’s conceptualization of knowledge<br />
is at risk <strong>to</strong> misrepresent what is taught and learned in schools.<br />
<strong>–</strong> Markus Puchhammer (Technikum Wien) shows <strong>–</strong> using the published example<br />
questions <strong>–</strong> how translation problems may affect results <strong>to</strong> a degree<br />
making comparisons guesswork.<br />
<strong>–</strong> SJPrais(National Institute of Economic and Social Research London) uses<br />
the example of England <strong>to</strong> demonstrate serious flaws in the response rates<br />
and sampling, which necessarily lead <strong>to</strong> biased results.<br />
<strong>–</strong> Bernadette Hörmann (Universität Wien) points <strong>to</strong> the systematic marginalization<br />
of special needs students by <strong>PISA</strong> and <strong>to</strong> how little there has been<br />
done <strong>to</strong> deal with their role within the <strong>PISA</strong> approach at least in Austria.<br />
<strong>–</strong> Peter Allerup (Århus Universitet) elaborates a similar issue by showing from<br />
Denmark <strong>to</strong> what degree <strong>PISA</strong>’s much acclaimed analysis of the impact of<br />
gender, migration and similar fac<strong>to</strong>rs depends on but a few, highly problematic<br />
items.<br />
<strong>–</strong> Svein Sjøberg (Universitetet i Oslo) underlines how much both, <strong>PISA</strong>’s design<br />
on the one hand, and the student response behavior on the other are<br />
culturally embedded, which may lead <strong>to</strong> a partial or complete mismatch.<br />
<strong>–</strong> Gjert Langfeldt (Agder Universitet) questions the validity and reliability<br />
claims made by <strong>PISA</strong>, pointing <strong>to</strong> constructional constraints, methodological<br />
mishaps and the cultural bias embedded in the <strong>PISA</strong> design.<br />
<strong>–</strong> Joachim Wuttke gives a comprehensive overview over recently voiced criticism<br />
of <strong>PISA</strong>’s research conduct and the resulting bias and uncertainties,<br />
which put not at least its league tables and comparisons at random.<br />
<strong>–</strong> Rolf Olsen (Universitetet i Oslo) outlines ways how <strong>PISA</strong> can overcome
12 STEFAN T. HOPMANN/GERTRUDE BRINEK<br />
some of its short-comings by broadening its approach and adding new research.<br />
<strong>–</strong> Michael Uljens (Åbo Akademi) explains the Finnish <strong>PISA</strong> success by the<br />
fact that what <strong>PISA</strong> asks for had already gained a foothold in Finnish schooling<br />
before <strong>PISA</strong> came around.<br />
<strong>–</strong> Thomas Jahnke (Universität Potsdam) elaborates from a German perspective<br />
how <strong>PISA</strong> fails <strong>to</strong> really assess what is or should be taught in schools, and<br />
how reliance on <strong>PISA</strong> can lead <strong>to</strong> an impoverished view on the curriculum.<br />
<strong>–</strong> Dominik Bozkurt, Gertrude Brinek and Martin Retzl (Universität Wien) use<br />
the Austrian example <strong>to</strong> show how the public and political response <strong>to</strong> <strong>PISA</strong><br />
unfolds irrespective of what <strong>PISA</strong> really can cover or prove.<br />
<strong>–</strong> Finally, Stefan T. Hopmann (Universität Wien) puts both the <strong>PISA</strong> project<br />
and the <strong>PISA</strong> discourse in a comparative perspective, showing how much the<br />
design, use of and response <strong>to</strong> <strong>PISA</strong> is depending on the needs and traditions<br />
of those involved.<br />
All in all, the contributions give a very varied picture of the <strong>PISA</strong> effort. No<br />
step in the research process seems <strong>to</strong> be without substantial problems, several<br />
steps do not meet rigorous scholarly standards. Some of us seem <strong>to</strong> believe that<br />
these are obstacles, which can be solved within the <strong>PISA</strong> frame (e.g. Allerup,<br />
Dolin, Olsen, Sjøberg), others tend <strong>to</strong> a conclusion that the <strong>PISA</strong> project is<br />
beyond repair (e.g. Langfeldt, Meyerhöfer, Wuttke) or so much embedded in<br />
a specific political purpose, that it rather should be considered as a type of<br />
research-based policy making, not as a scholarly undertaking (e.g. Hopmann,<br />
Jahnke, Uljens, Bozkurt/Brinek/Retzl).<br />
Almost all of the chapters raise serious doubts concerning the theoretical<br />
and methodological standards applied within <strong>PISA</strong>, and particularly <strong>to</strong> its<br />
most prominent by-products, its national league tables or analyses of school<br />
systems. Without access <strong>to</strong> the full set of original data, it is difficult <strong>to</strong> come<br />
<strong>to</strong> final conclusions. However, from our viewpoint, a few points seem <strong>to</strong> be<br />
evident beyond any reasonable doubt:<br />
<strong>–</strong> <strong>PISA</strong> is by design culturally biased and methodologically constrained <strong>to</strong> a<br />
degree which prohibits accurate representations of what actually is achieved<br />
in and by schools. Nor is there any proof that what it covers is a valid conceptualization<br />
of what every student should know.<br />
<strong>–</strong> The product of most public value, the national league tables (cf. Steiner-<br />
Khamsi 2003), are based on so many weak links that they should be abandoned<br />
right away. If only a few of the methodological issues raised in this
INTRODUCTION: <strong>PISA</strong>ACCORDING TO <strong>PISA</strong> 13<br />
volume are on target, the league tables depend on assumptions about their<br />
validity and reliability which are unattainable.<br />
<strong>–</strong> The widely discussed by-products of <strong>PISA</strong>, such as the analyses of “good<br />
schools”, “good instruction” or of differences between school systems and<br />
on issues like gender, migration, or social background, go far beyond what<br />
a cautious approach <strong>to</strong> these data allows for. They are more often than not<br />
speculative, and would at least need a wider framing by additional research<br />
looking at the aspects, which <strong>PISA</strong> by design cannot cover or gets wrong.<br />
<strong>–</strong> Any policy making based on these data (whether about school structures,<br />
standards or the curriculum) cannot be justified. The use and misuse of <strong>PISA</strong><br />
data in such contexts <strong>–</strong> done with or without <strong>PISA</strong> researchers consent or<br />
cooperation <strong>–</strong> belongs solely <strong>to</strong> the sphere of policy making. Of course <strong>PISA</strong><br />
researchers have the same right as every citizen <strong>to</strong> pronounce their political<br />
convictions in public. However they cannot do so claiming research as an<br />
unquestionable basis for their arguments.<br />
This does not mean that there are no valuable lessons <strong>to</strong> be drawn from <strong>PISA</strong>.<br />
At least it is a very innovative comparative study on the uneven distribution<br />
of a peculiar kind of knowledge and abilities among young people in different<br />
countries. However, the use of <strong>PISA</strong> as research on schooling by the OECD,<br />
its members and some of the research groups connected <strong>to</strong> the effort goes far<br />
beyond what is scientific evidence or simply well done research. <strong>PISA</strong> is not<br />
according <strong>to</strong> <strong>PISA</strong>, when it comes <strong>to</strong> how it is produced and used in these<br />
cases.<br />
<strong>PISA</strong> <strong>–</strong> The Contergan of Educational Research?<br />
Of course, we would have loved <strong>to</strong> add <strong>to</strong> this volume commentaries and criticism<br />
of what is presented here by members of the <strong>PISA</strong> consortium <strong>–</strong> because<br />
we believe in the necessity of broad and uninhibited scholarly exchange. However,<br />
repeated invitations <strong>to</strong> address these issues in open symposia, or <strong>to</strong> contribute<br />
<strong>to</strong> this volume, remained either unanswered or were turned down. The<br />
German <strong>PISA</strong> consortium went so far <strong>to</strong> make an official decision not <strong>to</strong> participate<br />
in this effort; others simply kept silent. Time and again we were <strong>to</strong>ld<br />
in public and at meetings that most of the methodological criticism published<br />
on <strong>PISA</strong> has been proven wrong, and that every possible weakness has been<br />
taken care of. However, we could not obtain a published justification for this
14 STEFAN T. HOPMANN/GERTRUDE BRINEK<br />
claim. Even an invitation <strong>to</strong> contribute a summary of the counterarguments <strong>to</strong><br />
this volume was turned down.<br />
As sad as this is, it was no surprise. In the preparation of this volume we<br />
exchanged quite a few notes on how the national debates around <strong>PISA</strong> unfold<br />
in ‘our’ countries. What emerged was a picture not unlike that seen in<br />
the behaviour of large companies when they encounter a potential scandal, e.g.<br />
pharmaceutical companies dealing with ill-conceived drugs (like Chemie Grünenthal<br />
in the famous Contergan/Thalomide case or other scandals; cf. Kirk<br />
1999; Luhmann 2000; Schulz 2001) where the strategy is one of an “issue<br />
framing” (cf. Entman 1993: Sniderman & Theriault 2004). To take just the<br />
most recent German example:<br />
<strong>–</strong> If some critique is voiced in public, the first response seems <strong>to</strong> be silence. Or<br />
as the leader of the German consortium, Manfred Prenzel, puts it in case of<br />
this book: One doesn’t want <strong>to</strong> provide “a forum for unproven allegations”<br />
(as an answer <strong>to</strong> the invitation <strong>to</strong> participate in this book by mail 2007-05-<br />
09, which was turned down by a “unanimous” decision of the German <strong>PISA</strong><br />
consortium confirmed by a mail 2007-05-21). He wrote this before knowing<br />
the authors and titles of all but one of the chapters contained in this volume.<br />
<strong>–</strong> If that is not enough, the next step is often <strong>to</strong> raise doubts about the motives<br />
and the abilities of those who are critical of the enterprise. For instance,<br />
when asked about the recently published volume on <strong>PISA</strong> & Co. (Jahnke &<br />
Meyerhöfer 2006), Olaf Köller, as the head of the German National Institute<br />
for Educational Progress, suggested that (1) these critics were unqualified<br />
<strong>to</strong> discuss <strong>PISA</strong> (even though they included many leading members<br />
of the mathematics didactic research in Germany) and (2) they were probably<br />
driven by envy or other non-scholarly motives (Köller 2006a; Kerstan<br />
2006).<br />
<strong>–</strong> The next step seems <strong>to</strong> acknowledge some problems, but <strong>to</strong> insist that they<br />
are very limited in nature and scope, not affecting the overall picture. Alternatively,<br />
it is pointed out that these problems are well known within largescale<br />
survey research of the kind like <strong>PISA</strong>, and even unavoidable when<br />
working comparatively (e.g. Köller 2006b). Of course that claim does not<br />
reduce the impact of these problems on the validity of the results.<br />
<strong>–</strong> Finally, there is the statement that the criticism does not contain anything<br />
new, and nothing that has not been dealt with within the <strong>PISA</strong> research itself<br />
<strong>–</strong> and often this claim is accompanied by references <strong>to</strong> opaque technical
INTRODUCTION: <strong>PISA</strong>ACCORDING TO <strong>PISA</strong> 15<br />
reports, that only insiders can understand, or <strong>to</strong> unpublished papers or reports<br />
(e.g. Prenzel & Walter 2006; Schleicher 2006).<br />
What does not happen is what is normally considered <strong>to</strong> be “good science”:<br />
open debate on the pros and cons of the arguments. If one understands <strong>PISA</strong><br />
as an economic enterprise, in line with the abovementioned pharmaceutical<br />
companies, this is quite reasonable. Ignoring, silencing, or simply marginalizing<br />
a critic does less harm <strong>to</strong> the brand than a public argument. A public<br />
rebuttal carries the risk that some cus<strong>to</strong>mers would not be <strong>to</strong>tally convinced<br />
(“semper aliquid haeret”). It is only necessary <strong>to</strong> take firmer steps when criticism<br />
finally becomes so public that it cannot be ignored by cus<strong>to</strong>mers and<br />
buyers. But the first move is still <strong>to</strong> discredit the critics and their supporters as<br />
being uninformed, ill-equipped, or simply following a personal agenda. The<br />
final move rests on the claim that there is other research, which proves the<br />
critics wrong <strong>–</strong> although for a variety of reasons the data-sets on which these<br />
conclusions are based cannot be made available. By using such techniques,<br />
companies can hold the realistic expectation that even proven deficiencies will<br />
not harm sales substantially and over time.<br />
Of course, the comparison of <strong>PISA</strong> and Contergan can be seen as overreaching:<br />
Thalomide did lead <strong>to</strong> thousands of severely disabled newborns,<br />
whereas <strong>PISA</strong> only does harm <strong>to</strong> children’s education in the worst case. Additionally,<br />
the Grünenthal company directly advertised the medication for purposes<br />
with high risk, whereas the <strong>PISA</strong> consortium can argue that it is up <strong>to</strong><br />
the people <strong>to</strong> believe or not <strong>to</strong> believe in what <strong>PISA</strong> tells. But other similarities<br />
are striking: <strong>PISA</strong> has a large “market share” <strong>to</strong> defend: most of public money<br />
spent on educational research nowadays is being put in<strong>to</strong> <strong>PISA</strong> and similar approaches<br />
(the standards and testing business); many chairs in education have<br />
turned <strong>to</strong> related <strong>to</strong>pics and issues, thus providing a significant market for collabora<strong>to</strong>rs<br />
in the field. This is all <strong>to</strong>o big and <strong>to</strong>o seductive <strong>to</strong> be put at risk just<br />
because of a few other scholars who do not support the whole enterprise or the<br />
way it is done.<br />
The readers of this volume should expect similar responses <strong>to</strong> what is said<br />
here. But don’t worry: Nobody is going <strong>to</strong> pull <strong>PISA</strong> in<strong>to</strong> courts of law because<br />
of its flaws <strong>–</strong> as was the case with the pharmaceutical companies. No other<br />
court than the one of public reasoning is available, but with Kant we do believe,<br />
that this is the strongest court of all.<br />
Discussion is an essential part of science. Therefore we invite you <strong>to</strong> take
16 STEFAN T. HOPMANN/GERTRUDE BRINEK<br />
part in our discussion forum on the Internet and <strong>to</strong> post your opinion and critique<br />
concerning the book. Find more information at<br />
http://institut.erz.univie.ac.at/home/fe2/.<br />
We are looking forward <strong>to</strong> an inspiring discussion!<br />
References<br />
Aktionsrat Bildung: Bildungsgerechtigkeit. Jahresgutachten 2007. (ed.<br />
Vereinigung der Bayerischen Wirtschaft e.V.; online:<br />
http://www.aktionsrat-bildung.de/fileadmin/Dokumente/<br />
Bildungsgerechtigkeit_Jahresgutachten_2007_-_Aktionsrat_Bildung.pdf<br />
retr. 2007/07/07).<br />
Allerup, P.: <strong>PISA</strong> præstationer-målinger med skæve måles<strong>to</strong>kke?.<br />
In: Dansk Pædagogisk Tidsskrift 2005-1, 68-81<br />
Allerup, P.: <strong>PISA</strong> 2000’s læseskala <strong>–</strong> vurdering af psykometriske egenskaber<br />
for elever med dansk og ikke-dansk sproglig baggrund. (Rockwool<br />
Fondens Forskningsenhed og Syddansk Universitetsforlag) Odense 2006.<br />
Allerup, P.: Identification of Group Differences Using <strong>PISA</strong> Scales <strong>–</strong> Considering<br />
Effects of Inhomogeneous Items. In this volume.<br />
Benner, D.: Die Struktur der Allgemeinbildung im Kerncurriculum moderner<br />
Bildungssysteme. Ein Vorschlag zur bildungstheoretischen Rahmung von<br />
<strong>PISA</strong>. In: Zeitschrift für Pädagogik 48-2002-1, 68-90.<br />
Bodin, A.: What does <strong>PISA</strong> really assess? What it doesn’t A French view.<br />
Report prepared for Joint Finnish-French conference “Teaching mathematics:<br />
Beyond the <strong>PISA</strong> survey”, Paris 2005.<br />
Bodin, A.: What does <strong>PISA</strong> really assess? What it doesn’t? A French view. In<br />
this volume.<br />
Bonnet, G.: Reflections in a Critical Eye: On the Pitfalls of International Assessment.<br />
In: Assessment in Education 2002-9, 387-400.<br />
Bottani, N. & Virgnaud, P.: La France et les evaluations internationales.<br />
Paris 2005. online: http://lesrapports.ladocumentationfrancaise.fr/BRP/<br />
054000359/0000.pdf (retr. 2007/07/07).<br />
Bracey, G.W.: Research: Put Our Over <strong>PISA</strong>. In: Phi Delta Kappan 86-2005-<br />
10, 797.<br />
Dohn, N.B.: Knowledge and Skills for <strong>PISA</strong> <strong>–</strong> Assessing the Assessment. In:<br />
Journal of Philosophy of Education. 41-2007-1, 1-16.<br />
Flitner, E.: Pädagogische Wertschöpfung. Zur Rationalisierung von Schulsystemen<br />
durch public-private-partnerships am Beispiel von <strong>PISA</strong>. In: Oelk-
INTRODUCTION: <strong>PISA</strong>ACCORDING TO <strong>PISA</strong> 17<br />
ers J. et al. (eds.): Rationalisierung und Bildung bei Max Weber. ) Bad<br />
Heilbrunn (Klinkhardt 2006, 245-266.<br />
Fuchs, H.-W.: Auf dem Wege zu einem neune Weltcurriculum? Zum Grundbildungskonzept<br />
von <strong>PISA</strong> und der Aufgabenzuweisung an die Schule.<br />
In: Zeitschrift für Pädagogik 49-2003-2, 161-179.<br />
Gaeth, F.: <strong>PISA</strong> (Programme for International Student Assessment) Eine<br />
statistisch-methodische Evaluation. Berlin (Freie Universität) 2005.<br />
Goldstein, H.: International Comparisons of Student Attainment: Some Issues<br />
Arising from the <strong>PISA</strong> Study. In: Assessment in Education <strong>–</strong> Principles,<br />
Policy, and Practice 11-2004-3, 319-330.<br />
Grisay, A. & Monseur, C.: Measuring the Equivalence of Item Difficulty in<br />
the Various Versions of an International Test. In: Studies in Educational<br />
Evaluation 33-2007-1, 69-86.<br />
Hermann, U.: Fördern “Bildungsstandards” die allgemeine Schulbildung? In:<br />
Rekus, J. (ed.): Bildungsstandards, Kerncurricula und die Aufgabe der<br />
Schule. Münster (Aschendorff ) 2005, 24-52.<br />
Herrmann U.: <strong>PISA</strong> <strong>–</strong> Welche Konsequenzen für Schule und Unterricht kann<br />
man wirklich ziehen? Diskussionsbeitrag DIDACTA Hannover 2006.<br />
FORUM BILDUNG (online: http://forum-kritische-paedagogik.de/start/<br />
download.php?view.209; retr. 2007/07/07).<br />
Hopmann, S.T.: Restrained Teaching: The Common Core of Didaktik. In: European<br />
Educational Research Journal 6-2007-2, 109-124.<br />
Huisken, F.: Der “<strong>PISA</strong>-Schock” und seine Bewältigung <strong>–</strong> Wieviel Dummheit<br />
braucht/verträgt die Republik? Hamburg (VSA-Verlag) 2005.<br />
Jahnke, T. & Meyerhöfer, W. (eds.): <strong>PISA</strong> & Co <strong>–</strong> Kritik eines Programms.<br />
Hildesheim (Franzbecker) 2006.<br />
Kerstan, K.: An <strong>PISA</strong> gescheitert. In: DIE ZEIT, 16.11.2006 Nr. 47<br />
Kirk, B.: Der Contergan-Fall: eine unvermeidbare Arzneimittelkatastrophe?<br />
Stuttgart (Wissenschaftliche Verlagsgesellschaft) 1999.<br />
Klausnitzer, J.: <strong>PISA</strong> <strong>–</strong> einige offene Fragen zur OECD Bildungspolitik.<br />
2006. Online: http://www.links-netz.de/K_texte/K_klausenitzer_oecd.<br />
html (retr. 2007/07/07).<br />
Köller, O.: Kritik an <strong>PISA</strong> unberechtigt. Interview mit bildungsklick.de 2006a.<br />
online: http://bildungsklick.de/a/50155/kritik-an-pisa-unberechtigt (retr.<br />
2007/07/07).<br />
Köller, O.: Stellungnahme zum Text von Joachim Wuttke: Fehler, Verzerrun-
18 STEFAN T. HOPMANN/GERTRUDE BRINEK<br />
gen, Unsicherheiten in der <strong>PISA</strong>-Auswertung. Press release of the National<br />
Institute for Educational Progress (IQB) 2006/11/14, (2006b).<br />
Kraus, J.: Der <strong>PISA</strong> Schwindel. Unsere Kinder sind besser als ihr Ruf. Wie Eltern<br />
und Schule Potentiale fördern können. Wien (Signum Verlag) 2005.<br />
Ladenthin, V.: Bildung als Aufgabe der Gesellschaft. In: studia comenia et<br />
his<strong>to</strong>rica 34-2004-71/72, 305-319.<br />
Lohmann, I.: After Neoliberalism. Können nationalstaatliche Bildungssysteme<br />
den ‚freien Markt‘ überleben? 2001 online: http://www.erzwiss.<br />
uni-hamburg.de/Personal/Lohmann/AfterNeo.htm (retr. 2007/07/07)<br />
Lohmann, I.: Was bedeutet eigentlich “Humankapital”? GEW Bezirksverband<br />
Lüneburg und Universität Lüneburg: Der brauchbare Mensch. Bildung<br />
statt Nützlichkeitswahn. Bildungstage 2007, online: http://www.erzwiss.<br />
uni-hamburg.de/Personal/Lohmann/Publik/Humankapital.pdf (retr.<br />
2007/07/07)<br />
Luhmann, H.-J.: Die Contergan-Katastrophe revisited <strong>–</strong> Ein Lehrstück<br />
vom Beitrag der Wissenschaft zur gesellschaftlichen Blindheit. In:<br />
Umweltmedizin in Forschung und Praxis. 5-2000-5, 295-300.<br />
Nash, R.: Is the School Composition Effect Real? A Discussion with Evidence<br />
from the UK <strong>PISA</strong> Data. In: School Effectiveness and School Improvement<br />
14-2003-4, 441-457.<br />
Neuwirth, E., Ponocny, I. & Grossmann, W. (eds.): <strong>PISA</strong> 2000 und <strong>PISA</strong> 2003.<br />
Graz (Leykam) 2006.<br />
Olsen, R.V.: Achievement Tests From an Item Perspective. An Exploration of<br />
Single Item Data form the <strong>PISA</strong> and TIMSS studies. Oslo (University of<br />
Oslo) 2005. Online at: http://www.duo.uio.no/publ/realfag/2005/35342/<br />
Rolf_Olsen.pdf (retr. 2007/07/07)<br />
Pekrun, R.: Vergleichende Evaluationsstudien zu Schülerleistungen: Konsequenzen<br />
für die Bildungsforschung. In: Zeitschrift für Pädagogik 48-<br />
2002-1, 111-128.<br />
Prais S. J.: Cautions on OECD’s recent educational survey(<strong>PISA</strong>): Rejoinder<br />
<strong>to</strong> OECD’s response. In: Oxford Review of Education 30-2004-4<br />
Prais S.J.: Cautions on OECD’s Recent Educational Survey (<strong>PISA</strong>) In: Oxford<br />
Review of Education 29-2003-2, 139-163.<br />
Prenzel, M. & Walter, O.: Wie solide ist <strong>PISA</strong>? Oder Ist die Kritik von Joachim<br />
Wuttke begründet? Kiel (IPN) 2006 (two pages including a one page attachment!)<br />
Roeder, P.M.: TIMSS und <strong>PISA</strong> <strong>–</strong> Chancen eines neuen Anfangs in Bil-
INTRODUCTION: <strong>PISA</strong>ACCORDING TO <strong>PISA</strong> 19<br />
dungspolitik, -planung. <strong>–</strong>verwaltung und Unterricht. Endlich ein Schock<br />
mit Folgen? In: Zeitschrift für Pädagogik 49-2003-2, 180-197.<br />
Romainville, M.: On the Appropriate Use of <strong>PISA</strong>. In: La Revue Nouvelle<br />
2002-3/4.<br />
Schleicher, A.: Interview mit der Frankfurter Rundschau. In: Frankfurter<br />
Rundschau vom 28.11.2006<br />
Schulz, J.: Management von Risiko- und Krisenkommunikation <strong>–</strong> zur Bestandserhaltung<br />
und Anschlussfähigkeit von Kommunikationssystemen. Berlin<br />
(Humboldt Universität) 2001.<br />
Stack, M.: Testing, Testing, Read All About It: Canadian Press Coverage of<br />
the <strong>PISA</strong> Results. In: Canadian Journal of Education, 29-2006-1, 49-69.<br />
Steiner-Khamsi, G.: The Politics of League Tables. (http://www.<br />
sowi-onlinejournal.de/2003-1/tables_khamsi.htm; retr. 2007/07/07).<br />
Weigel, T.M.: Die <strong>PISA</strong>-Studie im bildungspolitischen Diskurs. Eine Untersuchung<br />
der Reaktionen auf <strong>PISA</strong> in Deutschland und im Vereinigten<br />
Königreich. Diplomarbeit Trier (Universität) 2004.<br />
Wuttke, J. (2006): Fehler, Verzerrungen, Unsicherheiten in der <strong>PISA</strong>-<br />
Auswertung.- In: Jahnke, T. & Meyerhöfer,W. (Hrsg): <strong>PISA</strong> & Co. Kritik<br />
eines Programms. Hildesheim, Berlin (Franzbecker), 101-154.<br />
Wuttke, J.: Uncertainties and Bias in <strong>PISA</strong>. In this volume.
What Does <strong>PISA</strong> Really Assess? What Does It Not?<br />
1 2<br />
AFrenchView<br />
An<strong>to</strong>ine Bodin 3<br />
France: Université de Franche-Comté<br />
Summary<br />
This paper puts aside many important aspects of the <strong>PISA</strong> design <strong>to</strong> focus on<br />
the external validity issue of its mathematics questions.<br />
First, it seeks <strong>to</strong> position the <strong>PISA</strong> item contents against the French mathematical<br />
syllabus, trying <strong>to</strong> identify the overlap of them both.<br />
Then it tries <strong>to</strong> compare the <strong>PISA</strong> mathematical cognitive demands and<br />
competency levels with those implied in some French assessment and examination<br />
settings.<br />
Underlining certain differences between the general <strong>PISA</strong> design and the<br />
French mathematical curriculum and school culture, it also tackles the <strong>PISA</strong><br />
mathematical items ‘epistemological and didactical validity issues’.<br />
Cet article laisse de côté de nombreux points importants des études <strong>PISA</strong><br />
pour se centrer sur l’examen de la validité externe des questions du domaine<br />
mathématique.<br />
1 This paper was partially presented in Oc<strong>to</strong>ber 2005 at a French-Finnish Conference jointly<br />
organized by the French and Finnish Mathematical Societies. A French language version<br />
is available as well as two presentations used for the Conference (also in English and in<br />
French <strong>–</strong> see addresses on the page entitled “references”).<br />
2 With many thanks <strong>to</strong> Rosalind Charnaux for her kind help and advice for this English version.<br />
3 an<strong>to</strong>inebodin@mac.com, website: http://web.mac.com/an<strong>to</strong>inebodin/iWeb/Site_An<strong>to</strong>ine_<br />
Bodin/
22 ANTOINE BODIN<br />
Tout d’abord il cherche à situer les contenus mathématiques des questions<br />
par rapport au curriculum français, et essaie de quantifier le recouvrement par<br />
<strong>PISA</strong> de ce curriculum.<br />
Ensuite il tente de comparer la complexité cognitive des questions mathématiques<br />
de <strong>PISA</strong> avec celle des questions d’examens et d’évaluations<br />
courantes en France.<br />
Pointant des différences entre les conceptions liées aux études <strong>PISA</strong> et les<br />
attendus du curriculum mathématique et de la culture scolaire de notre pays,<br />
il soulève des questions relatives à la validité épistémologique et didactique de<br />
l’étude.<br />
Introduction<br />
The <strong>PISA</strong> studies have been organised by the OECD, which, as everyone<br />
knows, is an organisation devoted <strong>to</strong> world economic development. The main<br />
reason that led this organisation <strong>to</strong> undertake such a study lies in a strong belief<br />
that good education is the key <strong>to</strong> better development.<br />
We will examine in this paper neither the value of this belief nor the economic<br />
and political implications of the studies.<br />
At the same time we accept the idea that the <strong>PISA</strong> mathematics framework<br />
is consistent with the general <strong>PISA</strong> design, and that the mathematics test development<br />
has been made as faithfully and as accurately as possible (personally,<br />
I believe this is the case). There is, however, the internal validity issue.<br />
Plenty of documents have been written and displayed all around the world<br />
about the <strong>PISA</strong> studies, a certain number of them directly issued by the OECD<br />
and by the <strong>PISA</strong> consortium 4 and many others by officials, research teams<br />
and/or the media in the participating countries.<br />
Therefore, the information is rich and full of contrasts. Most of the documents<br />
are public, and the OECD has done its utmost <strong>to</strong> allow scholars and<br />
other interested persons obtain complete access <strong>to</strong> <strong>PISA</strong>’s general design as<br />
well as its frameworks, complete database and international reports.<br />
Far from producing flimsy yet exciting, though often denounced, results<br />
(<strong>to</strong> which <strong>to</strong>o much interest is generally paid), the <strong>PISA</strong> studies produce quality<br />
data of interest for a huge range of complementary studies ranging from<br />
politics <strong>to</strong> didactics.<br />
4 ACER <strong>–</strong> Melbourne <strong>–</strong> Australia
WHAT DOES <strong>PISA</strong> REALLY ASSESS? WHAT DOES IT NOT? 23<br />
Many international and national analyses have been undertaken which try<br />
<strong>to</strong> draw from processed data (as well as from raw data) the information of<br />
interest <strong>to</strong> all kinds of people concerned by educational matters.<br />
Meanwhile, not much effort has been made until now <strong>to</strong> examine the set<br />
of mathematics questions from an external point of view and try <strong>to</strong> more efficiently<br />
understand what they really assess and <strong>to</strong> which degree they may be<br />
viewed as epistemologically and didactically consistent. Further research in<strong>to</strong><br />
these points would produce possible implications for teaching and for teachers.<br />
This paper seeks only <strong>to</strong> examine <strong>PISA</strong>’s external validity and is limited<br />
<strong>to</strong> its mathematical section, and even narrower in scope, from a French point<br />
of view (‘French’ in the sense of being related <strong>to</strong> the French mathematics curriculum,<br />
French cus<strong>to</strong>mary assessment settings, teacher beliefs, school culture,<br />
etc.).<br />
Intended and implemented <strong>PISA</strong> assessment focus<br />
First, it seems important <strong>to</strong> recall that <strong>PISA</strong> does not claim <strong>to</strong> assess the general<br />
quality of the educational systems examined. Regarding our <strong>to</strong>pic, it does not<br />
pretend <strong>to</strong> assess the general mathematical proficiency, but simply concentrates<br />
on what the OECD judges essential for the normal life of any citizen (the socalled<br />
‘mathematical literacy’).<br />
Let us quote the official report:<br />
“<strong>PISA</strong> seeks <strong>to</strong> measure how well young adults, at age 15 and therefore approaching<br />
the end of compulsory schooling, are prepared <strong>to</strong> meet the challenges of <strong>to</strong>day’s knowledge<br />
societies. The assessment is forward-looking, focusing on young people’s ability<br />
<strong>to</strong> use their knowledge and skills <strong>to</strong> meet real-life challenges, rather than merely on<br />
the extent <strong>to</strong> which they have mastered a specific school curriculum. This orientation<br />
reflects a change in the goals and objectives of curricula themselves, which are increasingly<br />
concerned with what students can do with what they learn at school, and<br />
not merely whether they can reproduce what they have learned.” 5<br />
At any rate, individual students who do not correctly answer the <strong>PISA</strong> mathematics<br />
questions seem doomed <strong>to</strong> a troubled life, and countries that do not<br />
perform well are viewed as doing a poor job of preparing their young people<br />
for the future.<br />
Thus, while <strong>PISA</strong> does not assess the entire body of mathematical knowledge<br />
acquired in schools, it does test at least a part of this knowledge.<br />
5 OCDE (2004) : Learning for Tomorrow’s World. First Results from <strong>PISA</strong> 2003. p. 20
24 ANTOINE BODIN<br />
We will therefore first try <strong>to</strong> identify more clearly the part truly assessed<br />
by <strong>PISA</strong> and then relate this part <strong>to</strong> the entire French mathematics educated<br />
offered <strong>to</strong> the country’s 15-year-olds. The relationship between this “literacy”<br />
part and the entire test is a problematic question, one that leads <strong>to</strong> raising epistemological<br />
and didactically complex issues.<br />
However, first we must examine the way in which the <strong>PISA</strong> material is<br />
linked <strong>to</strong> the French mathematics curriculum.<br />
A comparison of the <strong>PISA</strong> mathematics item content with the<br />
current French mathematical syllabus<br />
For the moment, let us limit ourselves <strong>to</strong> the French syllabus, which most of the<br />
15-year-old French students have studied. By this I mean the French “collège”<br />
syllabus from grade 6 <strong>to</strong> grade 9 (French “sixième” <strong>to</strong> “troisième”). At age 15,<br />
some French students attend high school (up <strong>to</strong> grade 11), while others are still<br />
lagging as far behind as grade 7, and yet a few others are in special education.<br />
However, on the whole, more than 85 % of the 15-year-olds have studied this<br />
syllabus 6 .<br />
The reader will find in annex 6 a presentation of this syllabus indicating<br />
the <strong>to</strong>pics that have been addressed by at least one <strong>PISA</strong> 2003 mathematics<br />
question.<br />
Annex 3 shows a list of analysed <strong>PISA</strong> questions.<br />
Here we should recall that only a certain number of the <strong>PISA</strong> questions<br />
have been secured for future use. In this paper I will only quote some of the<br />
released questions, while most of the questions used have nevertheless been<br />
taken in<strong>to</strong> account in the analysis.<br />
Finally, we find that the <strong>PISA</strong> questions cover about 15 % of the French<br />
syllabus, and are answered by more than 85 % of the 15-year-old French students.<br />
This shows beyond any doubt the marginal focus of the <strong>PISA</strong> questions<br />
(but marginal does not mean unimportant!).<br />
6 In fact, the 15-year-old official target is somewhat misleading. Let us quote the <strong>PISA</strong> technical<br />
report (page 46): “The 15-year-old international target population was slightly adapted<br />
<strong>to</strong> better fit the age structure of most of the northern hemisphere countries. As the majority<br />
of the testing was planned <strong>to</strong> occur in April, the international target population was consequently<br />
defined as all students ages 15 years and 3 (completed) months <strong>to</strong> 16 years and 2<br />
(completed) months at the beginning of the assessment period.”<br />
That leads <strong>to</strong> 59.1 % of the French students who <strong>to</strong>ok the tests were in high schools in grade<br />
10 (or for a few of them, grade 11).
WHAT DOES <strong>PISA</strong> REALLY ASSESS? WHAT DOES IT NOT? 25<br />
At the same time those 15 % represent only about 75 % of the <strong>PISA</strong> mathematics<br />
items. This means that about 25 % of the <strong>PISA</strong> items do not fit in<strong>to</strong><br />
the French curriculum. It is not only the case for many items in the field of<br />
uncertainty, but it is also the case for items not directly linked <strong>to</strong> our current<br />
curriculum (such as some combined items).<br />
But an assessment setting can never completely cover 100 % of any curriculum.<br />
In order <strong>to</strong> explore further, we found it useful <strong>to</strong> compare the <strong>PISA</strong><br />
material with some cus<strong>to</strong>mary French examinations.<br />
A comparison of the <strong>PISA</strong> mathematics item content with some<br />
French examination and assessment settings at the 15-year-old<br />
level<br />
Comparison with the grade 9 national examination<br />
We choose <strong>to</strong> analyse in the same way some issues of the mathematics form<br />
of the national examination taken by all students at the end of French middle<br />
school (grade 9).<br />
Annex 6 shows the corresponding curriculum coverage for one of these<br />
issues, while the corresponding examination form is displayed in annex 5 with<br />
an analysis chart appearing in annex 4.<br />
Here we found that this particular “Brevet” examination form covers about<br />
35 % of the French syllabus presented above.<br />
In addition, the entire set of <strong>PISA</strong> 2003 questions has been planned for<br />
approximately 210 minutes of testing time, while every area of the “Brevet”<br />
is just 120 minutes each. As two different “Brevet” forms are different and<br />
address different parts of the syllabus, we can estimate that in the “Brevet”<br />
context, a 210-minute testing time might cover more than 50 % of the French<br />
syllabus.<br />
What is more striking is the fact that the coverage by <strong>PISA</strong> focuses more<br />
on the syllabus for grade 6 and 7, while the coverage by the “Brevet” concerns<br />
mainly the syllabus for grades 8 and 9 (which <strong>to</strong> a certain point contains and<br />
extends the previous syllabus).<br />
But the Brevet examination is a poor illustration of the entire French curriculum<br />
(“programmes et instructions officielles”) as well as of the teachers’<br />
aims and teaching practices. The “Brevet” is well known for shrinking the objectives,<br />
and preparing for the “Brevet” is not viewed as a good way <strong>to</strong> prepare<br />
for further high school studies.
26 ANTOINE BODIN<br />
The EVAPM studies<br />
In the following sections I will refer <strong>to</strong> a series of large-scale studies organised<br />
in the “EVAPM Observa<strong>to</strong>ry”.<br />
EVAPM is a 20-year-long research project conducted by the Mathematics<br />
Teacher Association (APMEP) and the National Institute for Pedagogical Research<br />
(INRP), <strong>to</strong> follow the evolution of the French mathematics curriculum<br />
(and especially the attained curriculum), from grade 6 <strong>to</strong> grade 12.<br />
Being strongly linked <strong>to</strong> the teachers, and implicating them in the test development<br />
process, the EVAPM studies obviously reflect the authors’ beliefs<br />
and intentions. As the students are not directly assessed, there is no problem<br />
for checking competencies that are known for being just at the beginning of<br />
their development. In other words, the EVAPM questions are not limited by<br />
social expectations or political exploitation, as it is the case in the national exams.<br />
That could have been the case with the <strong>PISA</strong> questions; obviously, it is<br />
not.<br />
In recent EVAPM studies there was strong teacher resistance when we<br />
tried <strong>to</strong> introduce some <strong>PISA</strong> items. Most of the items were considered as<br />
not appropriate <strong>to</strong> the curriculum, and many of them were considered as such<br />
culturally biased.<br />
It is not relevant <strong>to</strong> mention curriculum coverage, as the EVAPM studies<br />
tend <strong>to</strong> be comprehensive (100 % coverage).<br />
In this paper, we will make use of the EVAPM studies in order <strong>to</strong> compare<br />
the cognitive demands of <strong>PISA</strong> with actual French curriculum expectations (at<br />
least as viewed by the French teachers).<br />
Comparison of cognitive demands<br />
In order <strong>to</strong> compare the cognitive demands of mathematics assessment items,<br />
we will use a cognitive taxonomy, of which the main categories are the following:<br />
<strong>–</strong> A Knowing and recognising . . .<br />
<strong>–</strong> B Understanding . . .<br />
<strong>–</strong> C Applying . . .<br />
<strong>–</strong> D Creating . . .<br />
<strong>–</strong> E Evaluating . . .<br />
See annex 1 for a first expansion of this taxonomy.<br />
The following chart displays the <strong>PISA</strong> levels of cognitive demands along<br />
those of the “Brevet” examination paper already examined.
WHAT DOES <strong>PISA</strong> REALLY ASSESS? WHAT DOES IT NOT? 27<br />
The difference is most striking: the “Brevet” addresses mostly the recognition<br />
level, and even the classification of some items at Level C (application)<br />
might be questioned (most of them are routine procedures that might have been<br />
classified at Level A).<br />
Without doubt, the taxonomic range of the <strong>PISA</strong> items is much more balanced<br />
than that of the French examination 7 .<br />
Figure 1<br />
However, as we have already noted, the “Brevet” does not correctly reflect<br />
the actual French curriculum.<br />
The following chart (figure 2) adds classifications obtained for two<br />
EVAPM studies (grade 10 <strong>–</strong> 2003 and grade 6 <strong>–</strong> 2005).<br />
Here, the balance across levels is closer <strong>to</strong> <strong>PISA</strong>, at least at the same age<br />
level (grade 10).<br />
The chart (figure 2) seems <strong>to</strong> indicate that French teachers would be keen<br />
<strong>to</strong> evolve <strong>to</strong>wards a more <strong>PISA</strong>-like assessment practice. The EVAPM studies<br />
have shown that most French teachers are quite <strong>to</strong>rn between the need <strong>to</strong> prepare<br />
their students for formal exams like the “Brevet” presented in this paper<br />
and their conception about what a good math education should include (we<br />
also know that the conflict between exams and education is not unique <strong>to</strong> the<br />
French!).<br />
7 Renovation of the “Brevet” is on the agenda. Perhaps <strong>PISA</strong> will help speed along this process?
28 ANTOINE BODIN<br />
Figure 2<br />
Comparison of implied range of competencies<br />
<strong>PISA</strong> makes use of a three-tiered competency level classification:<br />
<strong>–</strong> Class 1: Reproduction: “ . . . consists of simple computations or definitions<br />
of the type most familiar in conventional mathematics assessments”.<br />
<strong>–</strong> Class 2: Connection: “ . . . requires connections <strong>to</strong> be made in order <strong>to</strong> solve<br />
straightforward problems”.<br />
<strong>–</strong> Class 3: Reflection: “ . . . consists of mathematical thinking, generalisation<br />
and insight, and requires students <strong>to</strong> engage in analysis, identify the mathematical<br />
elements in a situation and pose their own problems”.<br />
See annex 5 for more details 8 .<br />
The following chart (figure 3) displays the competency levels of the <strong>PISA</strong><br />
items along those of the “Brevet”.<br />
<strong>PISA</strong> puts more than 70 % of the emphasis on Levels 2 and 3, while the<br />
“Brevet” exam puts less than 15 % on those levels.<br />
Here again, we can examine some EVAPM assessment settings.<br />
The chart (figure 4) shows again a balance much closer <strong>to</strong> <strong>PISA</strong> for the<br />
EVAPM studies than for the national examination.<br />
8 Note that for EVAPM we use a competency classification originating in the Aline Robert<br />
works (see references), which, while based on other assumptions than the <strong>PISA</strong> classification,<br />
provides about the same repartition.
WHAT DOES <strong>PISA</strong> REALLY ASSESS? WHAT DOES IT NOT? 29<br />
Figure 3<br />
Figure 4<br />
Towards some epistemological analyse<br />
About Finnish and French differences<br />
With regards <strong>to</strong> this paper, we had a special interest in the differences between<br />
the Finnish and French results. The overall results on a global scale (511 for
30 ANTOINE BODIN<br />
France, 548 for Finland) hide the fact that this difference means a difference of<br />
a .33 standard deviation on the standard normal distribution, and that at a point<br />
where the density of probability is at its maximum. In France not many people<br />
know this fact, and still fewer understand it.<br />
Looking <strong>to</strong> the subscales of the study (quantity, change and relationships,<br />
space and shape, uncertainty) sheds no any supplementary light. In order <strong>to</strong><br />
help understand the observed differences, it is essential <strong>to</strong> turn <strong>to</strong> the items<br />
themselves and <strong>to</strong> the percentages of success in each country (or for other<br />
approaches for each of the subgroups investigated 9 ).<br />
First, let us say that this examination confirms the better Finnish results <strong>–</strong> it<br />
is only the magnitude of the differences and its meaning that can be questioned.<br />
Regarding the magnitude, let us say that according <strong>to</strong> the items being examined,<br />
the differences in success rates range from + 30 % <strong>to</strong> the Finnish advantage<br />
<strong>to</strong> + 25 % <strong>to</strong> the French advantage, the average of the differences being<br />
3.5 % <strong>to</strong> the Finnish advantage. 10<br />
We observe that the differences are more important in favour of the Finnish<br />
students for the more “realistic” items, and that the differences tend <strong>to</strong> turn in<br />
favour of the French students for more abstract or formal items (compare for<br />
instance, below, the results of “Apples Item 1” with the results of “Apples Item<br />
3”. But the case seems general).<br />
It is important <strong>to</strong> note that the difference in results between Finland and<br />
France would <strong>to</strong>tally disappear if 10 % of the less successful French students<br />
(the first 10 percent) were put aside.<br />
In fact, while in the case of the Finnish students, only 7 % of the age group<br />
score at Levels 1 or below Level 1 (on a proficiency scale ranging from 1 <strong>to</strong> 6),<br />
and 17 % of the French students fall in<strong>to</strong> those categories. This confirms the<br />
fact that France does not succeed well in its mathematical education for all (a<br />
fact already strongly confirmed by the TIMSS studies).<br />
The other end of the scale (Level 6), concerns 7 % of the Finnish students,<br />
but only 3 % of the French ones. This fact may be less worrying than the one<br />
concerning the low levels. Let us remember here that <strong>PISA</strong> addresses only the<br />
literacy and does not pretend <strong>to</strong> assess the general mathematical competency.<br />
9 We do not mention in this paper the gender question, but our analysis points out a certain<br />
amount of gender bias, at least for some countries. As the overall results are weaker for<br />
girls than for boys in all countries but two, the question invites more examination. But other<br />
subgroups might also be worth scrutinising.<br />
10 This is only a rough estimate <strong>–</strong> only 41 items have been accounted for.
WHAT DOES <strong>PISA</strong> REALLY ASSESS? WHAT DOES IT NOT? 31<br />
Nevertheless, this casts doubt on the assumption regarding French math education<br />
high level of quality demonstrated by the best students.<br />
The <strong>PISA</strong> questions presented below (exclusively those that have been released)<br />
are displayed with results for France, Finland, all OECD countries,<br />
containing in addition the highest and lowest observed results (OECD and all<br />
participating countries). 11<br />
Mathematics?<br />
The mathematical field may be extended or restricted according <strong>to</strong> different<br />
conceptions. Some mathematical <strong>PISA</strong> questions puzzle many French mathematics<br />
teachers. They do not recognise the mathematics they are striving <strong>to</strong><br />
teach. At the same time they recognise the social usefulness of the knowledge<br />
implied by these questions. The same thing applies <strong>to</strong> mathematicians: the insertion<br />
of many mathematical <strong>PISA</strong> questions in the theoretical mathematical<br />
constructions are not obvious <strong>to</strong> them.<br />
Quantity, change and relationships, space and shape as well as uncertainty<br />
are not only modelled in mathematical theories, but are also used in common<br />
situation, using common sense and common language.<br />
In its endeavour <strong>to</strong> stick <strong>to</strong> real life, <strong>PISA</strong> could not help using normal<br />
language <strong>to</strong> display its questions. In some cases, understanding a text, which<br />
is in no way a mathematical text, is the main difficulty students have <strong>to</strong> face.<br />
Certainly, this is also part of the mathematical process, but the true mathematical<br />
work begins once the problem is fully unders<strong>to</strong>od. Here the “devolution”<br />
process is not controlled, and it is never certain if it is the either the “dressing<br />
up” or the wording that prevents students from solving the problem, or if it is<br />
the problem’s degree of the mathematical difficulty. These mathematical difficulties<br />
often appear trivial when compared with the structural and semantic<br />
complexity of the questions.<br />
The strong correlation observed between individual results in reading literacy<br />
and mathematical literacy (r = 0.77) perfectly illustrates this point. This<br />
correlation is smaller that the correlations observed among the four <strong>PISA</strong> mathematical<br />
domains at the International Level (which range from 0.89 <strong>to</strong> 0.92),<br />
but is much higher that what is generally observed in France (EVAPM studies)<br />
between students’ results in different mathematical domains (algebra, geometry,<br />
calculus, statistics), which usually lie in the interval [0.35; 0.60]. All this<br />
11 A more complete presentation may be downloaded on my website.
32 ANTOINE BODIN<br />
leads one <strong>to</strong> think that the mathematical <strong>PISA</strong> questions may all assess a general<br />
ability <strong>to</strong> read a text, <strong>to</strong> articulate between textual, iconic information and<br />
other indices indirectly given by the question’s context and <strong>to</strong> process based<br />
on this information. We also could invoke here the well-known “fac<strong>to</strong>r g.”<br />
Numbers, quantity, etc. also appear in the <strong>PISA</strong> reading questions, in the<br />
science questions and in the problem-solving questions. It is not always obvious<br />
whether a <strong>PISA</strong> question should be allocated <strong>to</strong> one branch of the study<br />
rather than <strong>to</strong> another one. In particular, some problem-solving questions could<br />
be analysed and gathered with the set of mathematics questions.<br />
Let us now examine some typical questions.<br />
The Apples Example<br />
This question is typical of realistic mathematics (and authentic assessment),<br />
which the OECD seeks both <strong>to</strong> assess and promote. In this context, a good<br />
question must open up for the process thus described in the framework:<br />
a) Starting with a reality-based problem.<br />
b) Organising it according <strong>to</strong> mathematical concepts<br />
c) Gradually trimming away the reality through process, such as making assumptions<br />
about which features of the problem are important, then generalising, formalising„<br />
transforming the problem in<strong>to</strong> a mathematical problem that closely represents the<br />
situation<br />
d) Solving the mathematical problem<br />
e) Making sense of the mathematical solution in terms of the actual situation
WHAT DOES <strong>PISA</strong> REALLY ASSESS? WHAT DOES IT NOT? 33
34 ANTOINE BODIN<br />
For Item 1 the main point is <strong>to</strong> understand the situation and being subsequently<br />
able <strong>to</strong> extrapolate a pattern. This may be complete by merely counting<br />
the first four lines in the chart. In the fifth example the student can either extend<br />
the drawing and then count or identify a number pattern in the completed<br />
chart.
WHAT DOES <strong>PISA</strong> REALLY ASSESS? WHAT DOES IT NOT? 35<br />
The 10 % difference between French and Finnish students illustrates the<br />
French students’ relative lack of confidence or lack of initiative. They do not<br />
have a mathematical procedure on hand <strong>to</strong> treat the question, and this lack<br />
hinders a certain percentage of them from solving the problem.<br />
Conversely, French students who overcome this initial difficulty perform<br />
much better on the second item than their Finnish counterparts (26 % <strong>to</strong> 21 %<br />
for the entire population, but 62 % <strong>to</strong> 38 % for those who successfully completed<br />
Item 1). This also seems <strong>to</strong> be rather general.<br />
For this item, the mathematical process is quite obvious and leads <strong>to</strong> an<br />
equation <strong>to</strong> be solved: n2 8n.<br />
French students are used <strong>to</strong> solving this type of equation (though often in a<br />
formal, non-realistic, context). We may even suppose that many of them have<br />
used a correct mathematical method: by this I mean fac<strong>to</strong>rizing n n 8<br />
0 and finding the two values: 0 and 8, then and only then (Point E above)<br />
eliminating the value 0 and retaining the value 8.<br />
However, some students (in France as well in Finland) should have gotten<br />
the correct answer just by making this invalid simplification: n2 8n n 8<br />
or n n 8 n n n 8 n.<br />
Another procedural possibility consists of extending the chart until n 8.<br />
These procedures (of which at least one is mathematically incorrect) and<br />
other ones have been considered correct (full or partial credit!). This raises<br />
the epistemological issue: which kind of mathematics are at stake? What is<br />
valued?<br />
Let us be clear on this point: it is not our purpose <strong>to</strong> deny the interest of<br />
the question nor its relevance in a mathematical test, not even the legitimacy<br />
of building scales which may be of some usefulness <strong>to</strong> policymakers. What<br />
is raised here is the need for complementary qualitative studies, which could<br />
more deeply analyse students’ procedures from a mathematical point of view.<br />
Item 3 needs <strong>to</strong> compare two variation rates. In this instance it may lead <strong>to</strong><br />
comparing the growth of the derivatives of functions f such as f(n) = n2 and g<br />
such as g(n) =8n, and, finally, <strong>to</strong> comparing the second derivatives.<br />
Here again students are not supposed <strong>to</strong> know derivatives; they should just<br />
have a sound and personal approach <strong>to</strong> the question. Several procedures are<br />
possible that have different mathematical values, but are considered the same.<br />
Note that the question is by no means trivial, and it is not <strong>to</strong>o surprising<br />
that so few students across the world are able <strong>to</strong> cope with it.
36 ANTOINE BODIN<br />
The apples question has been used in an EVAPM study at the tenth grade<br />
level. The results of this setting also appear in the rectangles.<br />
The 6 % success rate (France) and the 4 % success rate (Finland) concern<br />
only a correct mathematical procedure. Those rates have <strong>to</strong> be compared with<br />
the 11 % obtained in Japan and also with the 11 % obtained by EVAPM in<br />
France at grade 10.<br />
For all countries but one the Item 3 success rate ranges from 2 % <strong>to</strong> 12 %<br />
The only exception (Korea at 24 %) deserves further examination.<br />
There is also an interesting point coming out from international studies<br />
(similar <strong>to</strong> TIMSS): real mathematical difficulties, meaning difficulties linked<br />
<strong>to</strong> the concepts and not only <strong>to</strong> the presentation or the wording seem <strong>to</strong> be<br />
experienced in the same way all over the world.<br />
The Bookshelves Example<br />
The following question is typically a case of one question not fitting the current<br />
French mathematical curriculum; more precisely, it would be considered as<br />
being more appropriate at the primary school level.
WHAT DOES <strong>PISA</strong> REALLY ASSESS? WHAT DOES IT NOT? 37<br />
At the same time everyone in France (and especially French mathematics<br />
teachers) would expect 15-year-old students <strong>to</strong> be able <strong>to</strong> solve this problem.<br />
The success rate is more than 10 % higher in Finland than in France, which<br />
illustrates what has been said about realistic questions.<br />
But is it a mathematical question? Or should any question using numbers<br />
be considered as a mathematical question? In some countries (especially in<br />
France) this question would more likely be asked in the technological subject<br />
matter area.<br />
A mathematical solution could be:<br />
N Min<br />
26<br />
4<br />
; 33<br />
6<br />
; 200<br />
12<br />
; 20<br />
2<br />
; 510<br />
14<br />
Where N is the maximum number of bookshelves the carpenter can make,<br />
and where x stands for the integer part of x.<br />
Once again, it is not expected of the students that they write this complex<br />
formula. In fact, they proceed by a try and guess method. Meanwhile, if they<br />
had <strong>to</strong> prove their result, they would be forced <strong>to</strong> write down in everyday language<br />
the content as well as the meaning of this formula, which should be even<br />
more difficult than writing the symbolic formula.<br />
Fortunately for <strong>PISA</strong>, no student thinks about using such a formula (neither<br />
would we other than for this paper!), so the international results are quite high,<br />
ranging mostly from 50 % <strong>to</strong> 70 %.<br />
But is it still mathematics? Can these kinds of realistic questions be a good<br />
preparation for more abstract mathematics? As many educational systems tend<br />
<strong>to</strong> ask teachers <strong>to</strong> stress realistic mathematics, the question is surely worthy of<br />
being raised.<br />
This question, along with many others, points out the weak stress given<br />
by <strong>PISA</strong> <strong>to</strong> the proof undertakings (and what is mathematics without proof?).<br />
Even explaining and justifying are not much valued by the <strong>PISA</strong> marking<br />
scheme. This makes a great difference with the casual French conception of<br />
mathematical achievement.<br />
The idea of proof is not the only mathematical main feature which is quite<br />
absent in the <strong>PISA</strong> questions. There is a lack of any symbolism, and what<br />
is sometimes labelled as algebra (especially in national reports) is usually no<br />
more that the use of letters as substitute <strong>to</strong> numbers, without any perspective<br />
of using them in direct computations. The <strong>PISA</strong> design insists on real life, the<br />
concrete aspects of mathematics. So it conciously misses several fundamental<br />
aspects of the mathematical world.
38 ANTOINE BODIN<br />
Toward didactical analysis<br />
The preceding remarks lead directly <strong>to</strong> the raising of a central didactical question:<br />
Which sequence of teaching situations can help students <strong>to</strong> gain proficiency<br />
both in mathematical literacy (partly common sense knowledge) and<br />
in abstract and symbolic mathematics?<br />
Some people would assume that the question is not relevant and that there<br />
is a continuum from common sense knowledge <strong>to</strong> theoretical knowledge.<br />
On the contrary, we think that all the work of the so-called “French didactics<br />
school” has helped us <strong>to</strong> think that ruptures are necessary and constitutive<br />
<strong>to</strong> learning. So we may fear that putting <strong>to</strong>o much stress on real life and concrete<br />
situations may in return have some negative effects.<br />
Here is an example.<br />
The Coloured Candies Question<br />
This question belongs <strong>to</strong> the uncertainty field and a “probability” value is requested.<br />
Probability is not part of the curriculum followed by 98 % of the French<br />
students at age 15; meanwhile, they perform at the same level as other OECD<br />
students.<br />
We obtained the same kind of result with TIMSS at age 13. While probability<br />
was not in the curriculum, French students performed better than others<br />
in countries where probability was considered as part of the curriculum. Other<br />
observations (EVAPM) show that when introduced <strong>to</strong> probability concepts (at<br />
least at the outset), students find more difficulty answering this kind of question<br />
than when they have not been taught the subject.
WHAT DOES <strong>PISA</strong> REALLY ASSESS? WHAT DOES IT NOT? 39<br />
Once again, we can talk of common knowledge: understanding the diagram,<br />
counting the <strong>to</strong>tal number of candies (30), noting that 6 of them are red,<br />
and finally interpreting the 6 chances out of 30 as being a probability value.<br />
These are common language and preconceptions about a mathematical<br />
concept. Stressing this kind of task, particularly in an MCQ format, and allowing<br />
students (and many others) <strong>to</strong> think they have acquired some knowledge in<br />
probability may surely lead <strong>to</strong> serious misunderstandings.<br />
Many other questions deserve this kind of examination.<br />
A good example is given by a question that we are not allowed <strong>to</strong> display<br />
here (an unreleased question). The only point of this question is <strong>to</strong> identify an<br />
oblique line as being longer than a perpendicular one. Everybody feels this and<br />
can use this fact, even if they do not formally know it, and especially if they<br />
have not been taught it. Even dogs behave as if they know this fact.
40 ANTOINE BODIN<br />
The amusing point is that this question has been identified, at least in the<br />
French official report, as assessing the Pythagorean theorem! Well spread confusion<br />
between the fact that common sense may be mathematized and integrated<br />
in mathematical theories and the fact that students’ abilities for making<br />
good use of this common sense proves something concerning their theoretical<br />
knowledge.<br />
Some conclusions<br />
<strong>PISA</strong> has gathered a huge amount of quality data across countries, which opens<br />
the way for further research. Aside from edumetrics studies focusing on marks<br />
and scales, there is room for many interesting qualitative studies (more precisely<br />
for studies articulating quantitative and qualitative approaches).<br />
A large amount of resources have been put in the <strong>PISA</strong> studies, as well as<br />
a great variety of commitment and expertise, and it would be disappointing if<br />
students were not the primary beneficiaries of these contribu<strong>to</strong>rs.<br />
In this paper we have attempted <strong>to</strong> demonstrate that certain precautions<br />
should be taken when interpreting and using the <strong>PISA</strong> results, at least in mathematics.<br />
Moreover, on the whole, the <strong>PISA</strong> studies are worth being taken seriously.<br />
They can bring new questions and new ideas <strong>to</strong> teachers that can help<br />
them <strong>to</strong> go ahead with a way of teaching that fits the needs of our societies as<br />
well as preserving the values of which they are conveyors. This balance is difficult<br />
<strong>to</strong> obtain; however, weak, flawed or biased interpretations of the general<br />
<strong>PISA</strong> implications and results will not help.<br />
This paper is particularly aimed at attracting scholars’ attention and justifying<br />
the idea that some complementary studies should be undertaken by and<br />
within research in the mathematics education community (and not, as is often<br />
the case, only processed and interpreted by officials strictly controlled by<br />
political bodies).<br />
The <strong>PISA</strong> studies may help scholars in different countries distance themselves<br />
from their national or regional places of origin and acquire a more comprehensive<br />
understanding of the teaching and acquisition of mathematics for<br />
future citizens, consumers and <strong>–</strong> above all <strong>–</strong> for the advancement of mankind.<br />
References<br />
Adams, R.J.: 2003, Response <strong>to</strong> “Cautions on OECD’s Recent Educational<br />
survey (<strong>PISA</strong>), Oxford Review of Education, 29(3)
WHAT DOES <strong>PISA</strong> REALLY ASSESS? WHAT DOES IT NOT? 41<br />
Anderson, W. A.: 2001, A taxonomy for learning, teaching, and assessing; a<br />
revision of Bloom’s taxonomy of educational objectives. Longman.<br />
Bodin, A. & Capponi, B.: 1996, Junior Secondary School Practices, International<br />
Handbook of Mathematics Education, Chapter 15, Teaching<br />
and learning Mathematics, A. Bishop & C. Laborde (eds), pp. 565-613,<br />
Kluwer Academics Publishers, Dordrecht.<br />
Bodin, A.: 2003, Comment classer les questions de mathématiques? Communication<br />
au colloque international du Kangourou, Paris 7 novembre 2003.<br />
Article à paraître.<br />
Bodin, A; Straesser, R.; Villani, V.: 2001, Niveaux de référence pour<br />
l’enseignement des mathématiques en Europe <strong>–</strong> Rapport international<br />
Reference levels in School Mathematics Education in Europe <strong>–</strong> International<br />
report.<br />
Bodin A.: 1997, L’évaluation du savoir mathématique <strong>–</strong> Questions et méthodes.<br />
Recherches en Didactique des Mathématiques, Éditions La Pensée<br />
Sauvage, Grenoble.<br />
Bottani, N.& Vrignaud, P. (2005): La France et les évaluations internationales.<br />
Haut Conseil de l’Évaluation de l’École.<br />
Clarke, D. 2003, International comparative Research in Mathematics Education<br />
: Of What, By Whom, for What, and How. Second international<br />
Handbook on Mathematics education, Kluwer academic Publishers.<br />
Cytermann, J.R., Demeuse, M. (2005): La lecture des indicateurs internationaux<br />
en France. Haut Conseil de l’Évaluation de l’École.<br />
Demonty, I. & Fagnant, A. (2004): Évaluation de la culure mathématique des<br />
jeunes de 15 ans (<strong>PISA</strong>). Ministère de la Communauté Française. Bruxelles.<br />
Dupé, C. & Olivier, Y. (2005): Ce que l’évaluation <strong>PISA</strong> 2003 peut nous apprendre.<br />
Bulletin de l’APMEP N˚460 <strong>–</strong> oc<strong>to</strong>bre 2005<br />
French Ministry of Education (2007): L’évaluation internationale <strong>PISA</strong><br />
2003 . . . dossier n˚ 180 de la Direction de l’Évaluation de la Prospective<br />
et de la Performance (DEPP).<br />
Freudhenthal, H: 1975, Pupils’ achievements internationally compared <strong>–</strong> The<br />
IEA. In Educational Studies in Mathematics <strong>–</strong> Vol 1975.<br />
Gras R.: 1977, Contributions à l’étude expérimentale et à l’analyse de certaines<br />
acquisitions cognitives et de certains objectifs didactiques en mathématiques<br />
<strong>–</strong> Thèse- université de RENNES.<br />
Lemke, M., Sen, A., Pahlke, E., Partelow, L., Miller, D., Williams, T., Kast-
42 ANTOINE BODIN<br />
berg, D., Jocelyn, L. (2004). International Outcomes of Learning in Mathematics<br />
Literacy and Problem Solving: <strong>PISA</strong> 2003 Results From the U.S.<br />
Perspective. (NCES 2005<strong>–</strong>003). Washing<strong>to</strong>n, DC: U.S. Department of<br />
Education, National Center for Education Statistics.<br />
Lie, S. & al (2003): Northern lights on <strong>PISA</strong>. Unity and diversity in the Nordic<br />
countries in <strong>PISA</strong> 2000. University of Oslo, Norway<br />
Meuret, D. 2003 Considérations sur la confiance que l’on peut faire à <strong>PISA</strong><br />
2000. Intervention au colloque international de l’Agence Nationale de<br />
Lutte Contre l’Illetrisme sur l’évaluation des bas niveaux de compétences,<br />
Lyon, 5 novembre 2003<br />
Meuret, D. 2003 Pourquoi les jeunes français ont-ils à 15 ans des performances<br />
inférieures à celles des jeunes d’autres pays? Revue française de Pédagogie,<br />
n˚142, 89-104.<br />
Note DPD 04.12 (décembre) <strong>–</strong> Les élèves de 15 ans Premiers résultats de<br />
l’évaluation internationale <strong>PISA</strong> 2003<br />
OECD (2004), Problem Solving for Tomorrow’s World: First measures of<br />
Cross-Curricular Competencies from <strong>PISA</strong> 2003<br />
OECD (2004), Technical report.<br />
OECD 2004, First results from <strong>PISA</strong> 2003. Executive summary.<br />
OECD 2004, Learning for Tomorrow’s World: First results from <strong>PISA</strong> 2003<br />
OECD 2004, <strong>PISA</strong> 2003 Assessment Framework <strong>–</strong> Mathematics, Reading,<br />
Science and Problem Solving Knowledge and Skills<br />
Orivel, F. (3003) : De l’intérêt des comparaisons internationales en éducation.<br />
Robert A: 2003, Taches mathématiques et activités des élèves : une discussion<br />
sur le jeu des adaptations introduites au démarrage des exercices cherchés<br />
en classe de collège. Petit x N˚62<br />
Varcher, P. (2002), Evaluation des systèmes éducatifs par des batteries<br />
d’indicateurs du type <strong>PISA</strong> : vers une régresion des pratiques d’évaluation<br />
dans les classes.<br />
Addresses and contacts<br />
APMEP with access <strong>to</strong> EVAPM documents as well as <strong>to</strong> a show displaying the<br />
released <strong>PISA</strong> questions with some results:<br />
http://www.apmep.asso.fr/spip.php?rubrique114 (presentations in English<br />
and in French).<br />
Reference Levels in School Mathematics Education in Europe:<br />
http://www-irem.univ-fcomte.fr/Presentation_ref_levels.HTM and
WHAT DOES <strong>PISA</strong> REALLY ASSESS? WHAT DOES IT NOT? 43<br />
http://www.emis.de/projects/Ref/<br />
IREM de Franche-Comté: http://www-irem.univ-fcomte.fr/<br />
French official reports: http://www.educ-eval.education.fr/pisa2003.htm<br />
International frameworks and reports: http://www.pisa.oecd.org/<br />
An<strong>to</strong>ine Bodin personal website:<br />
http://web.mac.com/an<strong>to</strong>inebodin/iWeb/Site_An<strong>to</strong>ine_Bodin<br />
Email address: bodin.an<strong>to</strong>ine@nerim.fr
44 ANTOINE BODIN<br />
ANNEXES<br />
Annexe 1: Taxonomy of cognitive demands for designing and<br />
analysing mathematical tasks <strong>–</strong> ordered by integrated level of<br />
complexity<br />
Simplified version <strong>–</strong> see complete taxonomy on the Web (in French)<br />
A<br />
B<br />
Main categories<br />
Knowing and<br />
recognising . . .<br />
Understanding<br />
. . .<br />
C Applying . . .<br />
D Creating...<br />
E Evaluating . . .<br />
Sub-categories<br />
A1 Facts<br />
A2 Vocabulary<br />
A3 Tools<br />
A4 Procedures<br />
B1 Facts<br />
B2 Vocabulary<br />
B3 Tools<br />
B4 Procedures<br />
B5 Relations<br />
B6 Situations<br />
C1 in simple familiar contexts<br />
C2 in mean complex familiar contexts<br />
C3 in complex familiar contexts<br />
D1 as mobilizing known mathematical<br />
<strong>to</strong>ols and procedures in new situations<br />
D2 new ideas<br />
D3 personal <strong>to</strong>ols or procedures<br />
E1 as issuing judgements about external<br />
productions<br />
E2 as assessing one’s own knowledge, process<br />
and results<br />
Taxonomy designed by An<strong>to</strong>ine Bodin, with full acknowledgment <strong>to</strong> R.<br />
Gras’ seminal work as well as <strong>to</strong> W. A. Anderson’s later influence.
WHAT DOES <strong>PISA</strong> REALLY ASSESS? WHAT DOES IT NOT? 45<br />
Annexe 2: Competency classes for designing and analysing<br />
mathematical tasks <strong>–</strong> ordered by integrated level of complexity<br />
Simplified version <strong>–</strong> see expanded version in OECD documents on the Web<br />
Level OECD definition<br />
1 Reproduction The competencies in this<br />
cluster essentially involve<br />
reproduction of practised<br />
knowledge . . .<br />
2 Connection The connection cluster builds<br />
on the reproduction cluster<br />
competencies in taking problem<br />
solving <strong>to</strong> situations that<br />
are not simply routine, but<br />
still involve familiar or quasi<br />
familiar settings<br />
3 Reflection The competencies in this<br />
cluster include an element<br />
of reflectiveness . . . about<br />
the processes needed or used<br />
<strong>to</strong> solve a problem. They<br />
relate <strong>to</strong> students’ abilities <strong>to</strong><br />
plan solution strategies and<br />
implement them in problem<br />
settings that contain more<br />
elements and may be more<br />
“original” (or unfamiliar)<br />
than those in the connection<br />
cluster...<br />
Reproduction<br />
Simple mathematisation<br />
Complex mathematisation<br />
(<strong>to</strong> modelisation)
46 ANTOINE BODIN<br />
Annexe 3: <strong>PISA</strong> 2003 and 2000 <strong>–</strong>Analysed Question Set<br />
Along with some other non released questions taken in<strong>to</strong> account for this paper,<br />
the whole analysis covers about 70 % of the <strong>PISA</strong> material (60/85)<br />
<strong>PISA</strong><br />
code<br />
Item<br />
name<br />
Mathematical<br />
content<br />
Taxo C Remarks<br />
M037Q01 Farms 1 Pyramid <strong>–</strong> square<br />
area<br />
B6 1 <strong>PISA</strong>2000 only<br />
M037Q02 Farms 2 Middle of the sides<br />
of a triangle.<br />
C1 2 <strong>PISA</strong>2000 only<br />
M124Q01 Walking 1 Using letters and<br />
formula<br />
C1 2 & <strong>PISA</strong>2000<br />
M124Q02 Walking 2 Using letters and<br />
formula <strong>–</strong> Units<br />
...<br />
B5 2 & <strong>PISA</strong>2000<br />
M136Q01 Apple 1 Completing charts B6 3 & <strong>PISA</strong>2000 &<br />
EVAPM<br />
M136Q02 Apple 2 Equation C1 2 & <strong>PISA</strong>2000 &<br />
EVAPM<br />
M136Q03 Apple 3 Don’t fit D1 3 & <strong>PISA</strong>2000 &<br />
EVAPM<br />
M145Q01 Cubes Cube B5 2 & <strong>PISA</strong>2000<br />
M148Q02 Continent area D1 3 <strong>PISA</strong>2000 only<br />
area<br />
&EVAPM<br />
M150Q01 Growing<br />
up 1<br />
Reading graphs B5 2 & <strong>PISA</strong>2000<br />
M150Q02 Growing<br />
up 2<br />
Reading graphs B5 1 & <strong>PISA</strong>2000<br />
M150Q03 Growing Reading graphs B5 1 & <strong>PISA</strong>2000 <strong>–</strong><br />
up 3<br />
Gender bias ?<br />
M155Q02 Number<br />
cube<br />
Cube B5 2 & EVAPM<br />
M159Q01 Speed of a<br />
car 1<br />
Interpreting graph B6 2 <strong>PISA</strong>2000 only<br />
M159Q02 Speed of a<br />
car 2<br />
Reading graph A3 1 <strong>PISA</strong>2000 only
WHAT DOES <strong>PISA</strong> REALLY ASSESS? WHAT DOES IT NOT? 47<br />
M159Q03 Speed of a<br />
car 3<br />
Interpreting graph B3 1 <strong>PISA</strong>2000 only<br />
M159Q04 Speed of a<br />
car 4<br />
Interpreting graph D1 2 <strong>PISA</strong>2000 only<br />
M161Q01 Triangles Constructing geometrical<br />
figures<br />
B5 1 <strong>PISA</strong>2000 only<br />
M179Q01 Robberies Bar charts E1 3 & TIMSS &<br />
<strong>PISA</strong>2000 &<br />
EVAPM<br />
M266Q01 Carpenter Perimeter of a rect- D1 2 & <strong>PISA</strong>2000 <strong>–</strong><br />
angle<br />
Gender bias ?<br />
M402Q01 Internet<br />
relay chat<br />
1<br />
Don’t fit D1 2 Gender bias ?<br />
M402Q02 Internet<br />
relay chat<br />
2<br />
Don’t fit D1 3 Gender bias ?<br />
M413Q01 Exchange<br />
rate 1<br />
Proportionality C1 2<br />
M413Q02 Exchange<br />
rate 2<br />
Proportionality A4 1<br />
M413Q03 Exchange<br />
rate 3<br />
Proportionality C1 2<br />
M438Q01 Export <strong>–</strong> 1 Bar charts A3 1<br />
M438Q02 Export <strong>–</strong> 2 Circle charts <strong>–</strong> Percentage<br />
C1 1<br />
M467Q01 Coloured<br />
candies<br />
Don’t fit C1 1 Probability<br />
M468Q01 Science<br />
test<br />
Mean C1 2<br />
M484Q01 Bookshelves Don’t fit D1 2 & EVAPMGender<br />
bias ?<br />
M505Q01 Litter Bar charts B6 2 ? Huge diff<br />
FRA-FIN<br />
M509Q01 Earthquake Don’t fit B5 2 Probability<br />
M510Q01 Choice Don’t fit D1 3 Combina<strong>to</strong>ry<br />
<strong>–</strong>transtation pb
48 ANTOINE BODIN<br />
M513Q1 Test<br />
Scores<br />
Bar graph<br />
M520Q01 Skateboard<br />
1<br />
Don’t fit C1 2 EVAPM<br />
M520Q02 Skateboard<br />
2<br />
Don’t fit C1 2<br />
M520Q03 Skateboard<br />
3<br />
Don’t fit D1 3<br />
M547Q01 Staircase Division A4 1<br />
M555Q02 Number<br />
cubes<br />
Cube B5 2<br />
M702Q01 Support<br />
for president<br />
Don’t fit B6 2<br />
M704Q01 Best car 1 Reading charts C1 2<br />
M704Q02 Best car 2 Reading charts D1 3<br />
M806Q01 Step pattern<br />
Don’t fit A1 1<br />
<strong>PISA</strong> 2003: 85 items released: 31<br />
<strong>PISA</strong> 2000: 32 items released: 11<br />
Annexe 4: A typical mathematical examination at the final year of<br />
middle school<br />
Taxo Comp Remarks<br />
Part I <strong>–</strong> Numerical<br />
activities<br />
Numbers Ex 1 1) A4 1 Formal and<br />
unrealistic<br />
2) A4 1 id<br />
3) A4 1 id<br />
4) A4 1 id<br />
Data Ex 2 1) B5 1 Pseudorealistic<br />
2) A4 1 id<br />
3) C1 1 id
WHAT DOES <strong>PISA</strong> REALLY ASSESS? WHAT DOES IT NOT? 49<br />
4) A2 1 id<br />
Numbers Ex 3 1) a A2 1 Formal and<br />
unrealistic<br />
1) b A4 1 id<br />
1) c A4 1 id<br />
1) d A4 1 id<br />
Numbers <strong>–</strong>Arithmetic Ex 4 1) C1 1 Formal and<br />
unrealistic<br />
2) C1 1 id<br />
3) C1 1 id<br />
Part II <strong>–</strong> Geometrical<br />
activities<br />
Space geometry Ex 1 1) a B1 1 Unrealistic<br />
1) b B5 1 id<br />
2) a A4 1 id<br />
2) b B5 1 id<br />
3) A4 1 id<br />
Plane géometry <strong>–</strong><br />
Proof <strong>–</strong> Thalès<br />
Plane géometry <strong>–</strong><br />
Proof <strong>–</strong> Pythagore<br />
Part III <strong>–</strong> Problem<br />
Geometry-Pythagore-<br />
Trigonometry<br />
Linear functions <strong>–</strong> inequations<br />
4) A4 1 id<br />
Ex 2 1) C1 2 Formal and<br />
unrealistic<br />
2) C1 2 id<br />
EX 3 1) A4 1 Formal and<br />
unrealistic<br />
2) A4 1 id<br />
3) i A4 1 id<br />
3) ii A4 1 id<br />
Part<br />
I<br />
Part<br />
II<br />
1) A4 1 Pseudorealisticdressing<br />
2) i A4 1 id<br />
2) ii A4 1 id<br />
3) A4 1 id<br />
1) a A3 1 id<br />
i<br />
1) a<br />
ii<br />
A3 1 id
50 ANTOINE BODIN<br />
Scale area <strong>–</strong> volume Part<br />
III<br />
1) b A3 1 id<br />
2) a A2 1 id<br />
2) b C1 2 id<br />
3) a B5 1 id<br />
3) b A4 1 id<br />
3) c A4 1 id<br />
1) A2 1 id<br />
2) A4 1 id<br />
3) C1 2 id<br />
4) i C1 2 id<br />
4) ii C1 2 id
WHAT DOES <strong>PISA</strong> REALLY ASSESS? WHAT DOES IT NOT? 51<br />
Annexe 5: The examination on scope: Brevet 2005 <strong>–</strong> South of<br />
France<br />
Wording and appearance will be counted as 4 marks out of 40.<br />
Handheld calcula<strong>to</strong>rs allowed.<br />
Test duration: 2 hours<br />
Part I: Numerical activities
52 ANTOINE BODIN<br />
part II: Geometrical activities
Part III: Problem<br />
WHAT DOES <strong>PISA</strong> REALLY ASSESS? WHAT DOES IT NOT? 53
54 ANTOINE BODIN<br />
Annexe 6: Comparing <strong>PISA</strong> with the French curriculum
WHAT DOES <strong>PISA</strong> REALLY ASSESS? WHAT DOES IT NOT? 55<br />
Annexe 7: Comparing a cus<strong>to</strong>mary French examination with the<br />
French curriculum
Testfähigkeit <strong>–</strong> Was ist das?<br />
Wolfram Meyerhöfer<br />
Deutschland: Freie Universität Berlin<br />
In diesem Artikel wird das Problem der Testfähigkeit am Beispiel mathematischer<br />
Leistungstests erkundet. „Testfähigkeit“ beschreibt jene Kenntnisse, Fähigkeiten<br />
und Fertigkeiten, die in einem Test miterfasst bzw. mitgemessen werden,<br />
die man aber nicht unter den Begriff „mathematische Leistungsfähigkeit“<br />
fassen würde. Es wird zunächst ausgelotet, warum dem Thema Testfähigkeit<br />
eine erhebliche Bedeutung im Zusammenhang mit Tests zukommt. Anhand<br />
von Aufgaben aus TIMSS und <strong>PISA</strong> wird mit Hilfe von s<strong>to</strong>ffdidaktischen und<br />
objektiv-hermeneutischen Aufgabeninterpretationen herausgearbeitet, welche<br />
empirischen Phänomene das Problem der Testfähigkeit ausmachen. Es zeigt<br />
sich, dass Testfähigkeit dem Gedanken von mathematischer Bildung entgegensteht.<br />
1 Testfähigkeit in der Mathematikdidaktik<br />
Der Begriff der Testfähigkeit 1 ist in der deutschsprachigen Mathematikdidaktik<br />
bisher eher am Rande abgehandelt, nie aber ernsthaft diskutiert worden.<br />
1 Ich werde in diesem Beitrag keine Begriffsdefinitionen geben: Bei der von mir verwendeten<br />
Methode der Objektiven Hermeneutik folgt man der Wittgensteinschen Erkenntnis,<br />
dass die Bedeutung eines Textes sich ausschließlich daraus erschließt, wie er benutzt wird.<br />
Dieser Erkenntnis folge ich auch hinsichtlich der verwendeten Fachbegriffe. Ich möchte<br />
die von mir verwendeten Begriffe wie Testfähigkeit, Bildung, standardisierter Leistungstest<br />
usw. nicht im Vorhinein verengen, indem ich sie in eine Begriffsbestimmung fasse. Ich<br />
möchte die Begriffe im Grunde auch nicht erweitern. Ich möchte sie vertiefen. Jeder Angehörige<br />
der Sprachgemeinschaft <strong>–</strong> erst recht der fachlichen Sprachgemeinschaft <strong>–</strong> hat einen<br />
unmittelbaren Zugriff auf diese Begriffe in ihrer ganzen Breite und Vielfalt (notfalls über<br />
Wörterbücher und Lexika). Mir geht es darum, bezüglich Testfähigkeit diesem „Bekannten“<br />
einiges bislang Unerschlossenes hinzuzufügen. Für diesen Erkenntnisprozess ist es sinnvoll,<br />
„alles, was der Begriff so mit sich rumschleppt“ mitzuschleppen. Methodisch hängt das da-
58 WOLFRAM MEYERHÖFER<br />
Dies mag zum einen damit zu tun haben, dass der Begriff selbsterklärend<br />
und damit uninteressant erscheint: „Testfähigkeit“ beschreibt jene Kenntnisse,<br />
Fähigkeiten und Fertigkeiten, die in einem Test miterfasst (bei nichtstandardisierten<br />
Leistungstests) bzw. mitgemessen (bei standardisierten Leistungstests)<br />
werden, die man aber nicht unter den Begriff „mathematische Leistungsfähigkeit“<br />
fassen würde. Insbesondere wenn es sich dabei um Dimensionen handelt,<br />
die nur deshalb auftauchen, weil es sich um einen Test handelt, scheint die Bezeichnung<br />
„Testfähigkeit“ oder „Testfähigkeiten“ sinnvoll zu sein.<br />
Ein zweiter Grund des bisher eher reduzierten Interesses mag sein, dass<br />
erst der hohe Anspruch, mit dem in den letzten Jahren standardisierte Leistungstests<br />
in alle gesellschaftlichen Praxen drängen, die Eigengesetzlichkeiten<br />
dieser Instrumente in den Blick rückt: Auch in „herkömmlichen“ schulischen<br />
Leistungstests (Klassenarbeiten, Klausuren usw.) werden natürlich Testfähigkeiten<br />
mitgetestet. Die Unschärfen dieser Instrumente sind aber im Prinzip<br />
unstrittig. Deshalb kann der Schüler über die Bepunktung einer Klassenarbeit<br />
mit dem Lehrer diskutieren, deshalb können Eltern gegen eine Klausurzensur<br />
klagen, deshalb wird kein Arbeitgeber einen Lehrling allein aufgrund seiner<br />
Zensuren einstellen und deshalb bestreitet kaum jemand, dass Verfahren<br />
der Vergabe von Studienplätzen aufgrund von Abiturzeugnissen sachlich problematisch<br />
sind. Standardisierte Leistungstests folgen dem Anspruch, solche<br />
Unschärfen zu vermeiden. Sie unterliegen deshalb einem <strong>–</strong> im Vergleich mit<br />
schulischen Leistungstests <strong>–</strong> verschärften Anspruch, Testfähigkeit nicht mitzumessen.<br />
Ihre erhöhte Relevanz als Herrschaftsinstrument und Instrument der<br />
Vergabe von Zukunftschancen lenkt den Blick darauf, dass dieser Anspruch<br />
verfehlt wird.<br />
Einen dritten Grund möchte ich nur als Eindruck formulieren: Die Mathematikdidaktik<br />
hat bzw. die Mathematikdidaktiker haben sich habituell noch<br />
nicht vom Lehrer zum Wissenschaftler gewandelt. Dieser Eindruck ist zwar<br />
in seiner Schärfe sichtbar falsch, aber seine Formulierung ist fruchtbar für das<br />
Lesen der nachfolgenden empirischen Rekonstruktionen. Das erkennt man immer,<br />
wenn man dabei geneigt ist zu sagen: „Aber das ist doch in der Schule<br />
auch so, wo liegt also das Problem?“ Man kann dann nicht die Rekonstruktion<br />
des Problems verwerfen <strong>–</strong> wie man es gerade in Diskussionen um Tests immer<br />
wieder erlebt <strong>–</strong> sondern man hat dann intuitiv in den Testaufgaben etwas wie-<br />
mit zusammen, dass man mit der Objektiven Hermeneutik ohnehin einerseits empirisch neu<br />
rekonstruiert, andererseits im Akt des Geschichtenerzählens all das „Mitgeschleppte“ verarbeitet.
TESTFÄHIGKEIT <strong>–</strong>WAS IST DAS? 59<br />
dergefunden, was man aus Schule kennt. Um eine Interpretation und die dort<br />
erfolgende Rekonstruktion eines Problems zu verwerfen, muss man hingegen<br />
die Interpretation selbst kritisieren. Man kann nicht einfach davon ausgehen,<br />
dass die Interpretation falsch ist, weil das Resultat einem nicht gefällt.<br />
Man hat also intuitiv Erkenntnis über Schule gewonnen, wo man Erkenntnis<br />
über Tests gewinnen wollte. Der Übergang vom Lehrer zum Wissenschaftler<br />
besteht gerade darin, Erkenntnis zu wollen und also zuzulassen, statt das<br />
Vorgefundene bereits als normal zu kennzeichnen und zu rechtfertigen, bevor<br />
man Erkenntnis überhaupt zugelassen hat. Abstrakter gesagt: Wissenschaftler<br />
sein heißt, sich weit genug von der zu untersuchenden Praxis zu distanzieren,<br />
um den Deutungsmustern dieser Praxis nicht selbst zu unterliegen. Wissenschaft<br />
betreiben heißt eben, diese Deutungsmuster nicht zu reproduzieren,<br />
sondern sie zu verstehen, ihre impliziten Annahmen, Widersprüchlichkeiten,<br />
Fehldeutungen, Verwerfungen usw. aufzuzeigen, kurz: ihre Deutungsmuster<br />
zu rekonstruieren.<br />
Dazu gehört beim Thema Testfähigkeit auch die Vielzahl von Rechtfertigungen,<br />
dass Testfähigkeit „dazugehöre“, dass diese Fähigkeiten sich auf Bildungsziele<br />
zurückführen ließen bzw. anderweitigen Wert in sich hätten. Diese<br />
Behauptung erweist sich in den empirischen Rekonstruktionen als oberflächlich,<br />
oftmals falsch und zumeist zynisch.<br />
Ein vierter Grund für das geringe Interesse der deutschsprachigen Mathematikdidaktik<br />
an Testfähigkeiten mag eine gewissen Furcht sein, dem Verhältnis<br />
von Theorie und Praxis konsequent nachzuspüren bzw. nachzudenken: Die<br />
Komponenten von Testfähigkeit, die nachfolgend rekonstruiert werden, findet<br />
man auch in schulischen Leistungstests. Auch dort sollten sie vermieden sein.<br />
Zwar sind schulische Leistungstests keine standardisierten Leistungstests. Sie<br />
unterliegen also nicht dem Anspruch, wissenschaftliche Instrumente zu sein<br />
und somit genau zu benennen, was sie messen <strong>–</strong> und das Mitmessen von Testfähigkeit<br />
zu vermeiden. Aber die Zynismen und die Beschädigungen der mathematischen<br />
Bildung durch jene Aufgabeneigenschaften, die im weiteren unter<br />
dem Fokus der Testfähigkeit erschlossen werden, verweisen auf allgemeine<br />
Probleme des Mathematikunterrichts.<br />
Will ich als Mathematikdidaktiker über die reine Rekonstruktion des Problems<br />
hinausweisen, dann muss ich also darstellen, was der Lehrer im Erstellen<br />
schulischer Tests anders machen kann, ohne dabei zum wissenschaftlichen<br />
Testkonstrukteur zu werden. Es ist nur zu verständlich, dass die Mathematikdidaktik<br />
diesem schwierigen Problem bisher eher nicht nachspüren mochte.
60 WOLFRAM MEYERHÖFER<br />
Auch ich umschiffe dieses Problem vorläufig, indem ich mich auf standardisierte<br />
Leistungstests beschränke. Aus den nachfolgenden Rekonstruktionen ergibt<br />
sich aber bereits die Hypothese, dass die Rekonstruktion und Bearbeitung<br />
von Habitusmustern in der professionellen Entwicklung von Lehrern einen Ansatz<br />
auch für die Problematik der Testfähigkeit liefert.<br />
Ein fünfter Grund für die bisherige Abstinenz der Mathematikdidaktik<br />
gegenüber dem Thema Testfähigkeit mögen methodische Probleme gewesen<br />
sein: Man kann das Mitmessen von Testfähigkeit auch ohne Methoden analysieren,<br />
aber es verlangt ein gutes Gespür für das Latente und eine erhebliche<br />
Distanz zum eigenen Produkt. Hinzu kommt, dass man ohne methodischen<br />
Rückhalt erhebliche Legitimationsprobleme hat, insbesondere wenn man Tests<br />
in kulturindustriellen Kontexten erstellt (vergleiche Meyerhöfer 2006).<br />
Ich habe mit meiner Promotionsschrift (Meyerhöfer 2004 a, 2005) die Methode<br />
der Objektiven Hermeneutik in die Mathematikdidaktik eingeführt. Sie<br />
ermöglicht, methodisch kontrolliert auch latente Textelemente zu rekonstruieren<br />
und zwingt dazu, systematisch den Text zu deuten und nicht die eigenen<br />
Intentionen bzw. die Intentionen des Testerstellers in den Text hineinzudeuten.<br />
Die Methode erweist sich als fruchtbares Instrument der Rekonstruktion von<br />
Testfähigkeiten.<br />
Im englischsprachigen Raum mit seiner langen Tradition von Versuchen<br />
der Vermessung des menschlichen Geistes ist eine Debatte um Testfähigkeit<br />
naturgemäß bereits länger im Gange. In einer positivistischen Denktradition,<br />
die das Messen zum Maßstab des Erkennens nimmt, werden auch die das Messen<br />
begleitenden Phänomene einer Messung unterzogen. So hat beispielsweise<br />
Hembree (1987) 120 Forschungsarbeiten mit mathematischen Leistungstests<br />
einer Meta-Analyse unterzogen, um den Einfluss von „Noncontent Variables“<br />
auf Testleistung zu untersuchen. Wenn man dem Messparadigma anhängt und<br />
einen mathematischen Leistungstest erstellen möchte, so findet man den Einfluss<br />
nichtinhaltlicher „Variablen“ hier sicherlich befriedigend und erschöpfend<br />
erschlossen, auch die meisten im deutschsprachigen Raum relativ neuen<br />
Debatten um Testformate, Schreibstile, Aufgabenanordnungen usw. erfahren<br />
hier eine quantitative Analyse, wenn auch nicht auf so diffizile Weise wie in<br />
der <strong>PISA</strong>-Analyse von Wuttke (2007).<br />
Hembree untersucht u.a. eine „Variable“, die begrifflich oftmals mit Testfähigkeit<br />
gleichgesetzt wird, die Testwiseness: “Testwiseness refers <strong>to</strong> a testee’s<br />
ability <strong>to</strong> use the features and formats of the test and test situation <strong>to</strong> make a<br />
higher score, indepentent of knowledge of content (Millman, Bishop & Ebel,
TESTFÄHIGKEIT <strong>–</strong>WAS IST DAS? 61<br />
1965). The comparison of scores by a group trained in test taking and an untrained<br />
group is the effect related <strong>to</strong> testwiseness.” (Hembree 1987, S. 201)<br />
Hembree stellt in seiner Metaanalyse fest, dass ein Training von Testwiseness<br />
die Testleistung erhöht.<br />
Mir scheint der Begriff der Testwiseness zu wenig erschließend, um das<br />
Problem der Testfähigkeit wirklich zu fassen. Mir geht es nicht nur um die <strong>–</strong><br />
im Grunde und dann auch in der quantitativen Analyse äußerlich bleibenden <strong>–</strong><br />
„features and formats of the test and the test situation“. Das liegt daran, dass<br />
diese Begriffe kategorial gemeint sind. Ich meine mehr, wenn ich von jenen<br />
Kenntnissen, Fähigkeiten und Fertigkeiten spreche, die in einem Test miterfasst<br />
bzw. mitgemessen werden, die man aber nicht unter den Begriff „mathematische<br />
Leistungsfähigkeit“ fassen würde; und es wird sich zeigen dass das<br />
nachfolgend Rekonstruierte sich nicht kategorisieren lässt. Vor allem spreche<br />
ich auch davon, wie das Mitmessen von Testfähigkeiten das Messen von mathematischen<br />
Fähigkeiten durchwebt und mathematische Bildung beschädigt.<br />
Die nachfolgend dargestellte Analyse der Aufgaben verweist dabei auf den<br />
Ort, den die Debatte um Testfähigkeit systematisch vernachlässigt, wo aber<br />
das Problem erst erzeugt wird: Testfähigkeit ist ja erst in zweiter Linie eine<br />
Fähigkeit des Individuums. Testfähigkeit ist zunächst etwas, das in der Aufgabe<br />
drinsteckt, denn die Aufgabe ist der primäre Ort der Geltungserzeugung<br />
der Testaussage: Erst wenn ich verstanden habe, was die Aufgabe misst, hat es<br />
überhaupt Sinn, den Blick auf das vermessene Individuum zu richten. Deshalb<br />
wird hier die Frage, was Testfähigkeit ist, aus den Aufgaben herauspräpariert.<br />
2 Bedeutung von Testfähigkeit innerhalb der Diskussion um Tests<br />
2.1 Wissenschaftlicher Anspruch von Tests und Testfähigkeit<br />
Der hier verwendete Begriff der Testfähigkeit bezieht sich nur auf standardisierte<br />
(mathematische) Leistungstests. Diese Tests beanspruchen, die Relativität<br />
von Leistungsbewertung in der Schule zu heilen, also ein weniger relatives<br />
oder nicht relatives Maß für (mathematische) Leistungsfähigkeit darzustellen.<br />
Die Verringerung von Subjektivität ist offensichtlich, da (i) das Multiple-<br />
Choice-Format die Subjektivität der Deutung der Schüler„antwort“ im Grunde<br />
auf Null fährt (der Scanner trifft ja keine subjektive Entscheidung darüber, ob<br />
er einen Tintenhaufen als „Angekreuzt“ akzeptiert oder nicht, und der Programmierer,<br />
der den Grenzwert einstellt, trifft damit ja auch keine subjektive<br />
Entscheidung über die zu bewertende Leistung), da (ii) beim Rating von
62 WOLFRAM MEYERHÖFER<br />
halboffenen oder offenen Schülerantworten viel weniger Subjekte und damit<br />
Subjektivitäten beteiligt sind als wenn jeder Lehrer selbst korrigiert und da<br />
(iii) die Schulung von Ratern die Subjektivität der Ratings hoffentlich wirklich<br />
verringert. Mit der Verringerung von Subjektivität liegt aber noch keine<br />
geringere Relativität der Leistungsbewertung im Vergleich mit der Schule vor.<br />
Es muss untersucht werden, ob die verringerte Subjektivität auch zu einer präziseren,<br />
vergleichbareren und im Sinne der Leistungsanforderungen wahrhaftigeren<br />
(vielleicht „valideren“) Leistungsmessung führt. Dass dies prinzipiell<br />
kaum möglich und speziell bei TIMSS und <strong>PISA</strong> nicht der Fall ist, habe ich<br />
ausgiebig in Meyerhöfer (2004 a) bzw. (2005) diskutiert.<br />
Nur aus dem Anspruch heraus, die Relativität der Leistungsbewertung heilen<br />
zu können, ergibt sich überhaupt die Notwendigkeit, das Mitmessen von<br />
Testfähigkeit zu diskutieren: Wenn Fähigkeiten mitgemessen werden, die nicht<br />
die zu messenden mathematischen Fähigkeiten sind, dann sind diese Fähigkeiten<br />
zu benennen, und sie sind daraufhin zu untersuchen, ob sie erwünscht sind.<br />
Der Anspruch eines Instrumentes, wissenschaftlich zu sein, verweist eben gerade<br />
auf die Verpflichtung, das mit dem Instrument Erfasste zu explizieren. Ich<br />
verzichte dabei hier auf die Diskussion von Banalitäten, z.B. das Mitmessen<br />
verbaler Fähigkeiten oder der Fähigkeit, überhaupt mit der Arbeit anzufangen.<br />
Man könnte sich nun auch darauf festlegen, dass es erwünscht ist, Testfähigkeiten<br />
mitzumessen, z.B. die Fähigkeit bei einer Mathematikaufgabe eine<br />
inhaltlich sinnlose Anhäufung von Textmasse beiseite zu schaufeln, um an das<br />
mathematische Problem zu gelangen, oder die Fähigkeit, sich der mathematischen<br />
Anforderung durch Unverfrorenheit zu entziehen und den Punkt trotzdem<br />
zu erhalten. Ich gehe bei den folgenden Betrachtungen allerdings davon<br />
aus, dass TIMSS und <strong>PISA</strong> ausschließlich mathematische Fähigkeiten messen<br />
sollen.<br />
Die zusätzlich mitgemessenen Fähigkeiten können <strong>–</strong> wenn man sie erkannt<br />
hat <strong>–</strong> zwar dem Messkonstrukt zugeschlagen werden. Man verwickelt sich<br />
dann allerdings in Probleme der Fairness und der Zielstellung des Testens:<br />
i) Es ist in sich problematisch, dass Testfähigkeit als mathematische Fähigkeit<br />
erscheint.<br />
ii) Je mehr nichtmathematische Fähigkeiten man bereit ist mitzumessen,<br />
umso breiter und präziser muss diskutiert werden, was man misst und warum<br />
man es messen möchte.<br />
iii) Man steht außerdem in der Gefahr, sich in der Beliebigkeit des zu Messenden<br />
zu verlieren und das zu Messende nicht mehr aus einem erwünschten
TESTFÄHIGKEIT <strong>–</strong>WAS IST DAS? 63<br />
Leistungskonstrukt heraus zu erarbeiten, sondern alles zu messen, „was die<br />
Items so mitmessen“.<br />
So war die Vorgehensweise bei TIMSS und <strong>PISA</strong> 2 . Man ist damit von<br />
der Messunschärfe einer herkömmlichen Klassenarbeit nicht weit entfernt, verliert<br />
also den wesentlichen Grund für standardisierte Leistungstests. 3 Man verschlechtert<br />
damit sogar die Position des Schülers, denn die Unwägbarkeiten in<br />
den Aufgabenformulierungen einer Klassenarbeit kann er durch sein im Unterricht<br />
erworbenes implizites oder explizites Wissen über den Lehrer heilen,<br />
notfalls kann er sogar fragen. 4 Die durch Testfähigkeit entstehenden Unwägbarkeiten<br />
bei standardisierten Tests sind auf diese Weise nicht zu bearbeiten.<br />
Man sollte den Begriff Testfähigkeit abgrenzen von der Fähigkeit, bei einem<br />
Test im Sinne einer Klassenarbeit gut abzuschneiden (letztere Fähigkeit<br />
ist expliziter Bestandteil schulischer Leistungserbringung): Auch bei dieser<br />
Fähigkeit geht es zwar z.B. darum, erfolgreich zu erschließen (und das heißt<br />
2 vgl. die Darstellung der Testkonstruktion in Meyerhöfer (2004 a, S. 98 f. und 139-157) oder<br />
in Meyerhöfer (2005, Kapitel 4)<br />
3 An diesen Überlegungen ist zu erkennen, dass das Defizit von schulischer Leistungsbewertung<br />
nur scheinbar in mangelnder Standardisierung liegt. Ein Test, der „alles mögliche“<br />
mitmisst, kann trotzdem hochstandardisiert sein. Er hat aber fast die gleiche Messunschärfe<br />
wie eine Klassenarbeit. Hier reproduziert sich ein Irrtum, der uns auch im Forschungsprozess<br />
oft begegnet, nämlich der Glaube, dass hohe Standardisierung zu präziseren oder<br />
„besseren“, breiteren, tieferen oder wenigstens allgemeiner gültigen Erkenntnissen führen<br />
würde. Standardisierung führt aber zunächst nur dazu, dass alle Mitglieder einer Population<br />
bezüglich bestimmter Aspekte den gleichen Bedingungen unterworfen sind. Das bedeutet<br />
zwar, dass bestimmte Rahmenbedingungen (oder auch für die Testkonstrukte: bestimmte<br />
Dimensionen eines multidimensionalen Kausalkonstrukts) für alle Mitglieder gleich konstruiert<br />
sind. Das bedeutet aber noch lange nicht, dass damit die Geltungserzeugung präziser,<br />
besser, breiter, tiefer, eindeutiger oder wenigstens allgemeiner gültig ist. Am Problem<br />
der Geltungserzeugung geht die Standardisierung eher vorbei <strong>–</strong> wobei natürlich bestimmte<br />
Standardisierungselemente die Geltungserzeugung unterstützen können.<br />
Ein beredtes Beispiel für dieses Problem ist der <strong>PISA</strong>-Test: Man kann 180 000 Schüler<br />
hochstandardisiert untersuchen. Wenn dabei unklar bleibt, was eigentlich gemessen wird,<br />
bleibt die Testaussage begrenzt. Selbst der hohe voyeuristische Wert einer Länderrangskala<br />
ergibt sich nicht aus hoher Standardisierung, sondern lediglich aus der großen Anzahl der<br />
Beteiligten.<br />
4 Nikola Leufer (U Dortmund) hat mich darauf aufmerksam gemacht, dass umgekehrt ein<br />
Lehrer, der seinen Schüler gut kennt, dessen „Testunfähigkeit“ in Bezug auf eine Klassenarbeit<br />
bei der Korrektur berücksichtigen, also quasi durch „gutmütige Korrektur“ heilen<br />
kann. Allgemein könnte man annehmen: Testfähigkeiten werden auch in der Schule miterfasst<br />
<strong>–</strong> haben aber keine so starken Konsequenzen. Mit einem anderen Blick: Es ist fester<br />
Bestandteil professionellen Könnens (und damit nicht technisierbar), diese Konsequenzen<br />
gering zu halten.
64 WOLFRAM MEYERHÖFER<br />
manchmal: erraten oder erahnen), was der Lehrer mit seiner Frage meint und in<br />
welcher Tiefe bzw. auf welcher Ebene die Aufgabe zu erfüllen ist. In der Klasse<br />
ist aber die Vermittlung dieser Fähigkeit expliziter Bestandteil des Unterrichtsprozesses:<br />
Unterricht ist per se eine nichtstandardisierte Angelegenheit.<br />
Somit ist er allen Vor- und Nachteilen der Nichtstandardisierung ausgesetzt.<br />
Das schlägt sich auch in unterrichtlichen Tests nieder. Die daraus resultierende<br />
Relativität von Zensierungen lässt zwei polarisierte Schlussfolgerungen zu:<br />
Einerseits kann man eine größere Standardisierung von Leistungsbewertung<br />
anstreben. Andererseits kann diese Relativität Anlass sein, den Zensuren mit<br />
einer gewissen Gelassenheit zu begegnen, also u.a. ihre Rolle für die Vergabe<br />
von Zukunftschancen ebenso zu relativieren. Sie sollte jedenfalls Anlass<br />
sein, sich der Vielschichtigkeit der Ursachen von Schulerfolg zu stellen <strong>–</strong> denn<br />
Zensuren sind das gesetzte und wahrscheinlich das beste quantitative Maß für<br />
Schulerfolg 5 . Leistungen sind nur eine Ursache von Schulerfolg, und Schulerfolg<br />
wirkt vielfältig zurück auf Leistungen. Will man das Leistungsprinzip in<br />
der Schule stärker zur Geltung bringen (und das ist eine Implikation des Trends<br />
zu Tests), so muss die Kopplung von Schulerfolg an Leistung gesichert werden.<br />
Werden nun andererseits in standardisierten Tests irgendwelche anderen<br />
als die zu leistenden Fähigkeiten mitgemessen, bedeutet dies wiederum eine<br />
Abkehr vom Leistungsprinzip <strong>–</strong> nur dass jetzt andere Nicht-Leistungskriterien<br />
einfließen als in der Klasse.<br />
5 Diese Behauptung bedürfte einer tieferen Argumentation, die hier nicht geleistet werden<br />
kann. Die Argumentationsrichtung wäre etwa die folgende: Wenn man ein Maß für Schulerfolg<br />
erstellen möchte, dann muss man Schulerfolg definieren und in ein Messkonstrukt<br />
überführen. Der Versuch wäre mit Messunschärfen und anderen Konstruktionsproblemen<br />
behaftet. Bereits die Adressierung von „Schulerfolg“ würde zu unüberwindlichen Schwierigkeiten<br />
führen: Verschiedene gesellschaftliche Gruppen haben verschiedene Ansprüche<br />
an „Schulerfolg“, die Vielfalt an schulischen Aufgaben müsste in eine gewichtete Form<br />
gebracht werden usw. Die Schulzensur ist der Versuch, eine solche Gesamt„messung“ vorzunehmen.<br />
Das „Messkonstrukt“ ist in einem langen Prozess entstanden, in dem innerschulische<br />
und außerschulische Interessen in das Konstrukt eingeflossen sind. Es ist kaum zu<br />
überschauen, welche impliziten und expliziten Elemente hier zusammenfließen. Es handelt<br />
sich aber um ein Konstrukt von erstaunlich hoher gesellschaftlicher Akzeptanz: Obwohl<br />
die Probleme der „Messunschärfe“ von Zensuren hinlänglich bekannt sind, sind Zensuren<br />
nach wie vor vorrangige Instrumente der Vergabe von Zukunftschancen in nachschulischen<br />
Feldern.<br />
Im Zusammenhang damit ist bemerkenswert, dass keine Untersuchung über den Zusammenhang<br />
von Zensur und Testleistung bei <strong>PISA</strong> vorliegt, obwohl die Zensuren erhoben<br />
wurden. Man kann sich unschwer vorstellen, dass Tests schnell als überflüssig angesehen<br />
würden, wenn sich herausstellte, dass die ordinale Anordnung erhalten bleibt, und dass vorrangig<br />
die Tests problematisiert würden, wenn die ordinale Anordnung nicht erhalten bleibt.
2.2 Testfähigkeit und Bildungsziele<br />
TESTFÄHIGKEIT <strong>–</strong>WAS IST DAS? 65<br />
Tests setzen Standards. Sie tun dies in umso größerem Maße, je relevanter sie<br />
für die Vergabe von Zukunftschancen sind. Sie tun dies aber auch dadurch, dass<br />
sie sich als wissenschaftliche Instrumente gerieren. Diese Standards schlagen<br />
bis in den Unterricht durch. Dadurch ist es problematisch, wenn in Tests Aufgaben<br />
auftauchen, die man lösen kann, ohne dass man die Fähigkeit, die getestet<br />
werden soll, wirklich besitzen muss. Umgekehrt ist es ebenfalls problematisch,<br />
wenn man eine Aufgabe nicht (richtig im Sinne der Tester) lösen kann,<br />
obwohl man die Fähigkeit(en) besitzt.<br />
Für den Lehrer ist es schwierig, Elemente von Testfähigkeit in den Aufgaben<br />
von Test, Bildungsstandards usw. zu erkennen und zu beheben, wenn<br />
er ausschließlich auf die mathematischen Fähigkeiten rekurrieren möchte:<br />
Der Lehrer arbeitet unter Handlungsdruck und setzt verständlicherweise darauf,<br />
dass standardisierte mathematische Leistungstests wirklich mathematische<br />
Leistung testen. Wissenschaftler unterliegen also einer gewissen Verantwortung<br />
für ihr Instrument.<br />
Man trifft in der Debatte um Testfähigkeiten auf das Argument, dass manche<br />
Testfähigkeiten durchaus als Bildungsziele taugen bzw. mit ihnen korrespondieren.<br />
Dieses Argument wird unten anhand der in den Interpretationen rekonstruierten<br />
Komponenten von Testfähigkeit diskutiert werden. Es stellt sich<br />
dabei im Wesentlichen als wenig fundiert und zynisch heraus.<br />
2.3 Chancengleichheit und Testfähigkeit<br />
Testfairness ist verletzt, wenn Teile der zu vermessenden Population Teile der<br />
gemessenen Fähigkeiten nicht oder in geringerem Maße als andere Teile der<br />
Population erwerben konnten. Dies kann z.B. der Fall sein, wenn Inhalte getestet<br />
werden, welche in einer der vermessenen Schularten gar nicht unterrichtet<br />
wurden. Dies kann auch der Fall sein, wenn mit einem Realitätskontext gearbeitet<br />
wird, welcher einer Gruppe völlig unbekannt, einer anderen hingegen<br />
vertraut ist. Ideale Testfairness kann es nicht geben und man muss sich dem in<br />
der Deutung der Testdaten stellen.<br />
In Bezug auf Testfähigkeit liegt eine Verletzung von Testfairness dann vor,<br />
wenn Tests Testfähigkeiten mitmessen und gleichzeitig Teile der zu vermessenden<br />
Population mehr Gelegenheit als andere Teile dieser Population hatten,<br />
diese Testfähigkeiten zu erlangen. So gab es eine kurze, aber intensive Debatte<br />
über Testfähigkeit, als die ersten TIMSS-Ergebnisse 1997 in Deutschland<br />
veröffentlicht wurden. Insbesondere wurde darauf verwiesen, dass die USA-
66 WOLFRAM MEYERHÖFER<br />
und die asiatischen „Nationalauswahlen“ viel besser auf den Test vorbereitet<br />
gewesen seien, weil in diesen Ländern eine ausgeprägte „Kultur“ des Testens<br />
zur Vergabe von Zukunftschancen herrscht. Man ging also davon aus, dass die<br />
asiatischen und die USA-Teile der vermessenen Population mehr Gelegenheit<br />
als die deutschen Teilnehmer hatten, Testfähigkeiten zu erlangen. Zwei polarisierte<br />
Schlussfolgerungen wurden daraus gezogen: Einerseits die Schlussfolgerung,<br />
die mangelnde Aussagekraft bei der Interpretation der Resultate zu berücksichtigen<br />
und vielleicht sogar auf solche Tests zu verzichten. Andererseits<br />
die Schlussfolgerung, die deutschen Teilnehmer ebenso intensiv in Testfähigkeiten<br />
einzuüben.<br />
3 Erfolge von Testtraining<br />
Mittlerweile tendiert die Praxis des deutschen Schulsystems in die Richtung<br />
des verstärkten Testens auch der deutschen Schüler. Allerdings wird das Problem<br />
der Testfähigkeit dabei kaum noch diskutiert. In der früheren Debatte<br />
fühlte sich die TIMSS-Gruppe noch genötigt zu behaupten, dass man solche<br />
Tests nicht trainieren kann 6 (Baumert u.a. 2000, S. 108 in Antwort auf Hagemeister<br />
1999). Das hieße nun allerdings zugespitzt, dass es Testfähigkeit<br />
im hier gemeinten Sinne nicht gibt, denn beim Testtraining geht es nicht um<br />
das Training der mathematischen Fähigkeiten, sondern um jene Fähigkeiten,<br />
die neben den mathematischen Fähigkeiten für den Testerfolg sorgen. Etwas<br />
schlicht gesagt: Testtraining (als Idealtypus) stellt nicht die Frage: Welche mathematischen<br />
Fähigkeiten müssen wir noch elaborieren? Es stellt die Fragen:<br />
Wie ticken Tester? Wie tickt der Test? Wie musst du ticken, damit du möglichst<br />
gut durchkommst. Man kann als Gegentypus das Üben konstruieren, das<br />
die Frage stellt: Welche mathematischen Fähigkeiten müssen wir noch elaborieren?<br />
Mit der Konstruktion als Idealtypus wird deutlich, dass manches Üben<br />
auch Elemente von Testtraining enthält, und dass manches Testtraining auch<br />
Elemente von Üben enthält.<br />
Da ich in meiner Dissertation festgestellt habe, dass und wie TIMSS<br />
und <strong>PISA</strong> Testfähigkeiten mitmessen, habe ich dort näher untersucht, wie die<br />
6 Die Behauptung erfolgt unter Nichtberücksichtigung der Metaanalyse von Hembree (1987),<br />
aber unter der im Text das Thema erschlagenden Bemerkung: „In den USA gibt es eine<br />
breite Forschungsliteratur zu den begrenzten Auswirkungen von Test-Coaching.“ (Baumert<br />
u.a. 2000, S. 108)
TESTFÄHIGKEIT <strong>–</strong>WAS IST DAS? 67<br />
TIMSS-Gruppe zum Resultat gelangt, dass man Tests nicht trainieren kann und<br />
es diese Testfähigkeiten also nicht gibt oder man sie vernachlässigen kann. 7<br />
Dabei stellt sich heraus, dass Baumert u.a. (2000) Forschungsergebnisse<br />
verzerrt darstellen. Sie berufen sich auf zwei Studien von Klieme und Maichle<br />
(1989, 1990). Klieme und Maichle haben ein Training für Teile der medizinischen<br />
Eingangstests entwickelt und durchgeführt. Sie wollten im Wesentlichen<br />
herausbekommen, ob bezahlte Vorbereitungskurse für diese Tests die Chancengleichheit<br />
der Kandidaten verletzen können. In gewisser Weise betrifft die<br />
Fragestellung also die gleiche Chancengleichheitsdebatte bezüglich Tests wie<br />
heute.<br />
Klieme und Maichle haben ein Testtraining von sechs (!) Zeitstunden mit<br />
21 Personen durchgeführt. Sie erreichten dabei Verbesserungen in den trainierten<br />
Komponenten <strong>–</strong> es trat also ein positiver Trainingseffekt auf. Sie erreichten<br />
keine Verbesserungen im eigentlichen Test, aber dafür hatte auch kein<br />
Training stattgefunden, weil sie aus Zeitgründen nur einzelne Komponenten<br />
trainiert hatten. Sie diskutieren das Ergebnis ihres Trainings dann auch recht<br />
vielschichtig, schließen allerdings in erstaunlicher Weise: „Auch die Resultate<br />
dieser spezifischen . . . Fördermaßnahmen bestätigen letztlich die Aussage<br />
. . . , daß komplexe Problemlöseleistungen im Sinne der Subtests . . . nicht<br />
bzw. nur in relativ geringem Ausmaß trainierbar sind.“ (Klieme/Maichle 1990,<br />
S. 307) Diese Schlussfolgerung steht in offensichtlichem Widerspruch zu den<br />
Ergebnissen der Untersuchung. Vielleicht erklärt sie sich aus der institutionellen<br />
Einbindung der Forscher im Testinstitut heraus: Die Untersuchung sollte<br />
herausfinden, ob die Testfairness durch bezahlte Vorbereitungskurse verletzt<br />
werden kann. Wäre die Antwort ein „Ja“ gewesen oder wäre auch nur das<br />
schwächere „Ja“ dieser Untersuchung herausgekommen, dann hätte der Test<br />
massiv verändert oder abgeschafft werden müssen.<br />
Die TIMSS-Gruppe nimmt das verfälschte „Resultat“ auf, obwohl es sich<br />
nicht mal auf die Diskussion um Langzeiteffekte von Massentestungen bezieht.<br />
Die Studie von Klieme und Maichle wird offensichtlich lediglich vorgeschoben,<br />
um unerwünschte Nebeneffekte des Testens wegzudiskutieren.<br />
In der Diskussion um Testfähigkeit geht es jedoch um Langzeiteffekte bei<br />
kindlichen und jugendlichen Schülern, für die direkt oder indirekt Zukunftschancen<br />
an Testergebnisse gebunden werden. Man muss sich mit diesen Nebeneffekten,<br />
die zu Haupteffekten beim Lernen von Mathematik werden kön-<br />
7 Meyerhöfer (2004 a, S. 219-221; 2005, S. 190-192)
68 WOLFRAM MEYERHÖFER<br />
nen, beschäftigen, um ihren Charakter und ihren Einfluss abschätzen zu können.<br />
4 Testfähigkeit und Au<strong>to</strong>nomie<br />
In diesem Beitrag wird nicht der Trainingsprozess betrachtet, sondern es wird<br />
untersucht, welche Itemeigenschaften dafür sorgen, dass neben mathematischen<br />
Fähigkeiten auch Testfähigkeiten gemessen werden. Ziel dieser Betrachtung<br />
ist es, den Beteiligten zu größerer Au<strong>to</strong>nomie gegenüber dem Problem zu<br />
verhelfen.<br />
Das Training von Testfähigkeit scheint eine Möglichkeit dazu zu sein, weil<br />
Testfähigkeit die Au<strong>to</strong>nomie des Schülers gegenüber dem Testprozess stärkt.<br />
Sie reproduziert aber auch die Au<strong>to</strong>nomiezerstörung, indem sie den Schüler<br />
auf Fähigkeiten hin trainiert, die außerhalb von mathematischen Fähigkeiten<br />
liegen. Die au<strong>to</strong>nomiezerstörende Grundstruktur von Tests ist nicht hintergehbar<br />
8 . Sie kann nur durch distanzierte Reflexion gebrochen werden. Außerdem<br />
erfordert jedes Testtraining ein Zeitbudget, welches für sinnvollere Lerninhalte<br />
einsetzbar ist.<br />
Die Erweiterung von Au<strong>to</strong>nomie kann ebenso auf Seiten des Lehrers oder<br />
des bildungspolitischen Raums stattfinden. Mit dem Wissen um Komponenten<br />
von Testfähigkeit kann man bewusster entscheiden, ob man bereit ist, Leistungstests,<br />
welche Testfähigkeit mitmessen, zur Vergabe von Zukunftschancen<br />
einzusetzen.<br />
Erweiterung von Au<strong>to</strong>nomie kann aber auch auf Seiten der Testentwickler<br />
stattfinden. Mit dem Wissen um Komponenten von Testfähigkeit kann auch<br />
hier bewusster entschieden werden, was man alles (mit)messen möchte.<br />
5 Testfähigkeit <strong>–</strong> Empirische Erschließungen<br />
5.1 Eine erste Annäherung<br />
Führen wir uns zunächst die Grundstruktur des Testens vor Augen. Tests werden<br />
erstellt, um Eigenschaften von Messobjekten in einem Messprozess zu<br />
erfassen. Man hat also zunächst eine Vorstellung davon, was in unserem Fall<br />
„mathematische Leistungsfähigkeit“ sein soll. Nun operationalisiert man diese<br />
Vorstellung, man schafft also Items, die in ihrem Zusammenwirken messen,<br />
inwieweit diese Fähigkeit vorhanden ist. Das so entstandene Konstrukt,<br />
8 vgl. Meyerhöfer (2004 a, S. 81-83; 2005, S. 24-27)
TESTFÄHIGKEIT <strong>–</strong>WAS IST DAS? 69<br />
die „operationalisierte mathematische Leistungsfähigkeit“, soll natürlich möglichst<br />
identisch sein mit dem, was man sich vor der Operationalisierung unter<br />
mathematischer Leistungsfähigkeit vorgestellt hat.<br />
Das operationalisierte Messkonstrukt trifft <strong>–</strong> materialisiert in Form eines<br />
Testheftes <strong>–</strong> auf das Messobjekt, also auf den Schüler. Für den Schüler ist egal,<br />
was mathematische Leistungsfähigkeit ist, ob sie richtig operationalisiert ist,<br />
ob die getesteten Fähigkeiten relevant sind usw. Für den Schüler ist nur eines<br />
wichtig: Er muss die Erwartung des Testers bedienen. Er muss sein Kreuz an<br />
der richtigen Stelle machen, er muss die richtige Zahl hinschreiben, er muss eine<br />
Antwort notieren, die der auswertende Kodierer mit einem Leistungspunkt<br />
belegen kann. Damit ist die Richtung für einen Begriff von Testfähigkeit festgelegt:<br />
Testfähigkeit ist die Fähigkeit der Optimierung des eigenen Punktwertes<br />
innerhalb des Testkonstrukts. Das heißt insbesondere, dass man erstens in<br />
der Lage ist, eine wirklich vorhandene mathematische Fähigkeit in einen Testpunkt<br />
umzusetzen, und dass man zweitens in der Lage ist, einen Testpunkt<br />
auch dann zu erreichen, wenn man nicht über die mathematische Fähigkeit<br />
verfügt. Das zeigt zunächst, dass für das Individuum Testfähigkeit umso wichtiger<br />
wird, je bedeutsamer der Test für die Vergabe von Zukunftschancen wird.<br />
Zur Testfähigkeit gehört aber auch die Fähigkeit, den Test sinnvoll zu verweigern.<br />
Wenn z.B. bei <strong>PISA</strong> die Schule und nicht das Individuum vermessen<br />
wird, dann sollte im Sinne der Schule ein „schlechter“ Schüler den Test ebenso<br />
verweigern wie ein Schüler, der an diesem Tag sein Leistungsoptimum nicht<br />
erreicht. Die Schule muss für eine solche Verweigerung dankbar sein und sie<br />
unterstützen. 9<br />
9 Wolfgang Schulz (HU Berlin) schlägt in einem Gutachten zu diesem Beitrag vor, auch die<br />
Bereitschaft, sich dem Test zu stellen und das Bestreben, den Test möglichst erfolgreich zu<br />
absolvieren, als Testfähigkeit zu behandeln. Ich finde den Vorschlag fruchtbar, mir scheint<br />
hier aber eher das vorzuliegen, was Soziologen in Anlehnung an Durkheim „vorvertragliche<br />
Grundlagen des (sozialen) Vertrages“ nennen. Diese vorvertraglichen Grundlagen sind<br />
aber ein eigenes Thema. Ich setze hier voraus, dass die Getesteten möglichst gut abschneiden<br />
möchten. Dazu gehört dann aber <strong>–</strong> wenn die Schule als Ganzes vermessen wird <strong>–</strong> dass<br />
Schüler und Lehrer dieses gute Abschneiden als gemeinsames Projekt begreifen. Dass man<br />
meinen Vorschlag, dass beide Gruppen sich auf die absichtliche Absenz von testschwachen<br />
Schülern einigen, als absurd empfindet, zeigt lediglich einen Zustand an, in dem sich Lehrer<br />
und Schüler (noch?) nicht als Gemeinschaft gegen etwas Äußeres verstehen. Dieses Phänomen<br />
lässt sich aber nur im Rahmen einer Schultheorie umfassender diskutieren.
70 WOLFRAM MEYERHÖFER<br />
5.2 Bekannte Komponenten von Testfähigkeit<br />
Nur kurz erwähnen möchte ich allgemeine Testbearbeitungsstrategien, die bereits<br />
andernorts ausgiebig dargestellt sind und zu deren weiterer Beschreibung<br />
ich hier nichts beitragen möchte. Das sind Zeiteinteilungsstrategien, Fehlervermeidungsstrategien,<br />
Ratestrategien, Strategien zur Ausnutzung versteckter<br />
Lösungshinweise und formale Strategien zum deduktiven Erschließen der vermeintlich<br />
richtigen Antwort:<br />
<strong>–</strong> „Zeiteinteilungsstrategien (z.B.: das Überspringen von schwierigen Aufgaben,<br />
das Markieren von ungelösten Aufgaben oder solchen Items, bei denen<br />
man sich seiner Lösung nicht ganz sicher ist, das Markieren von Teillösungen,<br />
das Anlegen eines Arbeitspro<strong>to</strong>kolls, aus dem man ersehen kann, wie<br />
schnell man vorankommt, usw.),<br />
<strong>–</strong> Fehlervermeidungsstrategien (sorgfältiges Lesen der Instruktion, Beachten<br />
der Aufgabenstellung, Überprüfen der Antwort usw.),<br />
<strong>–</strong> Ratestrategien [ 10 ],<br />
<strong>–</strong> Strategien zur Ausnutzung versteckter Lösungshinweise (das Beachten aller<br />
Merkmale, hinsichtlich derer sich die Antworten von den Distrak<strong>to</strong>ren<br />
unterscheiden könnten <strong>–</strong> z.B. der Länge, der Position, des Stils der betreffenden<br />
Aussagen usw.)<br />
<strong>–</strong> dieBeachtung sogenannter „specific determiners“ (gemeint sind Worte wie<br />
„immer“, „niemals“, „alle“ usw., die nach Meinung der Veranstalter [von<br />
Testtrainings, W. M.] speziell die Distrak<strong>to</strong>ren, also die Falschantworten,<br />
kennzeichnen),<br />
<strong>–</strong> formale Strategien zum deduktiven Erschließen der vermeintlich richtigen<br />
Antwort (z.B. auf der Basis inhaltlicher oder formaler Abhängigkeiten<br />
zwischen den einzelnen Antwortmöglichkeiten).“ (Klieme, Maichle 1989,<br />
S. 207)<br />
Ebenfalls nur erwähnen möchte ich folgende Aspekte, die sich auch ohne<br />
eingehendere Aufgabeninterpretationen erschließen: Wenn man nichts weiß,<br />
muss man raten. Wenn man Multiple-Choice-Angebote mit nur einer richtigen<br />
Antwort abarbeiten muss, ist es besser, die Bearbeitung bei Erreichen der<br />
wahrscheinlich richtigen Antwort abzubrechen und die Aufgabe zu kennzeichnen.<br />
Erst wenn man später noch Zeit hat, sollte man zurückkehren und eine<br />
Fehlerkontrolle durchführen. Gleiches gilt für andere Unsicherheiten.<br />
10 Näheres zum Raten vgl. Meyerhöfer (2004 c).
TESTFÄHIGKEIT <strong>–</strong>WAS IST DAS? 71<br />
Wenn man sich durch viele überflüssige Informationen (z.B. <strong>PISA</strong>-<br />
Aufgabe „Bauernhöfe“ <strong>–</strong> siehe unten) oder durch eine Anhäufung von Variationen<br />
der immer gleichen Wortgruppe („Bauernhöfe“ und TIMSS-Aufgabe<br />
A5 <strong>–</strong> siehe unten) hindurcharbeiten muss oder wenn man eine Ansammlung<br />
von Aussagen abarbeiten muss (z.B. <strong>PISA</strong>-Aufgabe „Dreiecke“ 11 ), dann kann<br />
man von Konzentrations- bzw. Durchhaltefähigkeit sprechen. Diese Fähigkeit<br />
benötigt man zwar oft im Leben, aber es wäre sicherlich wünschenswert, wenn<br />
man aus der inneren Verfasstheit und Ernsthaftigkeit eines Problems heraus<br />
durchhalten muss und nicht, weil eine Aufgabe schlecht gestellt ist bzw. weil<br />
sie Durchhaltefähigkeit statt der eigentlich zu testenden Fähigkeit misst.<br />
5.3 Testfähigkeit in Testaufgaben<br />
Ich möchte nun Komponenten von Testfähigkeit darstellen, die in den Aufgabeninterpretationen<br />
von <strong>PISA</strong> und TIMSS rekonstruiert wurden. Ich habe die<br />
Aufgaben objektiv-hermeneutisch interpretiert, hier sind lediglich Interpretationselemente<br />
angedeutet und Interpretationsresultate dargestellt. Die Interpretationen<br />
erfolgten zunächst unter der Fragestellung, was mit den Aufgaben gemessen<br />
wird. Dabei zeigten sich nicht nur erhebliche Messprobleme, die zu der<br />
Schlussfolgerung führten, dass beide Tests als Instrument zur Messung mathematischer<br />
Leistungsfähigkeit ungeeignet sind. Es zeigten sich auch habituelle<br />
Probleme 12 .<br />
Das Mitmessen von Testfähigkeit erweist sich als ein Problem, dem Messprobleme<br />
ebenso wie habituelle Probleme anhaften. Die hier aufgezeigten<br />
Komponenten von Testfähigkeit sind vielfältig miteinander verwoben <strong>–</strong> im Erscheinungsbild,<br />
im Charakter, im Hintergrund und in den Ursachen ihres Auftretens.<br />
Es würde der Reichhaltigkeit der empirischen Rekonstruktion nicht<br />
entsprechen, wenn man versuchte, die Komponenten mit zusammenfassenden<br />
Namen zu belegen, sie gar zu kategorisieren. Ich möchte auch nicht der Versuchung<br />
erliegen, die Komponenten in <strong>–</strong> dann zwingend plakative <strong>–</strong> Schüleranweisungen<br />
zu übersetzen, z.B.: Nimm das reale Problem nicht ernst! Denke<br />
zum Mittelmaß hin! Auch dies würde der Komplexität des Gegenstandes nicht<br />
entsprechen, welche hier entfaltet, aber noch nicht reduziert werden soll. Die<br />
11 vgl. Deutsches <strong>PISA</strong>-Konsortium (2001, S. 178); Diskussion bei Meyerhöfer (2004 b)<br />
12 Manifeste Orientierung auf Fachsprachlichkeit bei gleichzeitiger Beschädigung des Mathematischen;<br />
Verwerfungen des Mathematischen und des Realen bei realitätsnahen Aufgaben<br />
(Misslingen der angestrebten „Vermittlung von Realem und Mathematischem“); Kalkülorientierung<br />
statt mathematischer Bildung; Illusion der Schülernähe als Verblendung. Ich habe<br />
das als „Abkehr von der Sache“ zusammengefasst.
72 WOLFRAM MEYERHÖFER<br />
Überschriften sind dementsprechend Stichworte zu den jeweils herausgearbeiteten<br />
Phänomenen, keine Benennungen für trennscharf gedachte Komponenten.<br />
Da die Komponenten als Aufgabeneigenschaften rekonstruiert wurden,<br />
verweisen die Überschriften auf solche Eigenschaften. Erst zum Schluss führe<br />
ich diese Eigenschaften zu „Fähigkeiten“ zusammen.<br />
5.3.1 Fremde und bizarre Wörter; Irritationen; Tendenz zum Mittelmaß<br />
Die wohl am einfachsten zu erkennende und zu behebende Komponente von<br />
Testfähigkeit begegnet uns in der TIMSS-Aufgabe A1 im Wort schattieren:<br />
Betrachte die Figur. Wie viele von den kleinen Quadraten muss man ZU-<br />
SÄTZLICH schattieren, damit 4<br />
5 der kleinen Quadrate schattiert sind?<br />
A) 5<br />
B) 4<br />
C) 3<br />
D) 2<br />
E) 1<br />
Diese Komponente von Testfähigkeit ist durch ungewöhnliche, schwierige,<br />
mehrdeutige, vielleicht auch falsch benutzte Wörter im Aufgabentext gekennzeichnet<br />
(Interpretation für „schattieren“ vgl. Meyerhöfer 2004 a, S. 104 f.).<br />
Als Hauptursache für das Auftreten dieser Komponente sind Prätentionen,<br />
Übersetzungsfehler und auch mangelnde Sorgfalt bei der Durchsicht der Aufgaben<br />
zu nennen: Diese Fehler bewegen sich auf der manifesten Textebene<br />
und sind durch sorgfältige Durchsicht der Texte zu beheben. Hier kann man<br />
zum Beispiel das Wort „schraffieren“ verwenden und wirklich eine Schraffur<br />
verwenden.<br />
Als Verschärfung dieser Komponente lässt es sich ansehen, wenn der Text<br />
in eine offen bizzare Form übergeht. So wird in der <strong>PISA</strong>-Aufgabe „Bauernhöfe“<br />
der Quader EFGHKLMN als rechtwinkliges Prisma erläutert:
<strong>PISA</strong>-Aufgabe Bauernhöfe<br />
TESTFÄHIGKEIT <strong>–</strong>WAS IST DAS? 73<br />
Hier siehst du ein Fo<strong>to</strong> eines Bauernhauses mit pyramidenförmigem Dach.<br />
Nachfolgend siehst du eine Skizze mit den entsprechenden Maßen, die eine<br />
Schülerin vom Dach des Bauernhauses gezeichnet hat.<br />
Der Dachboden, in der Skizze ABCD, ist ein Quadrat. Die Balken, die das<br />
Dach stützen, sind die Kanten eines Quaders (rechtwinkliges Prisma) EFGH-<br />
KLMN. E ist die Mitte von AT, F ist die Mitte von BT , G ist die Mitte von<br />
CT und H ist die Mitte von DT. Jede Kante der Pyramide in der Skizze misst<br />
12 m.<br />
Bauernhöfe 1. Berechne den Flächeninhalt des Dachbodens ABCD.<br />
Der Flächeninhalt des Dachbodens ABCD = ______ m 2 .<br />
Bauernhöfe 2. Berechne die Länge von EF, einer der waagerechten Kanten<br />
des Quaders.<br />
Die Länge von EF= ______ m.
74 WOLFRAM MEYERHÖFER<br />
Ein weiteres Beispiel liefert die TIMSS-Aufgabe A2:<br />
Die Gegenstände auf der Waage halten sich im Gleichgewicht. Auf der linken<br />
Waagschale befinden sich ein Gewicht (eine Masse) von 1 kg und ein halber<br />
Ziegelstein. Auf der rechten Seite befindet sich ein ganzer Ziegelstein.<br />
Welches Gewicht (welche Masse) hat ein ganzer Ziegelstein?<br />
A) 0,5 kg<br />
B) 1kg<br />
C) 2kg<br />
D) 3kg<br />
Hier kann man sich in der Aufgabenerstellung nicht entscheiden, ob es um<br />
Gewicht oder um Masse geht, obwohl das für das Problem belanglos ist. Bis<br />
zur ersten Klammer haben Bild und Text widersprüchliche Signale bezüglich<br />
der Frage gegeben, ob die Aufgabe unter mathematischen, physikalischen oder<br />
alltäglichen Gesichtspunkten zu bearbeiten sei 13 . Den äußerlichen auf Physik<br />
bzw. Messtechnik orientierenden Signalen wird latent widersprochen. An der<br />
Klammer tritt die Unsicherheit in offene Konfusion über. Die Unklarheit wird<br />
im Text manifest. Die Feinanalyse zeigt, dass lediglich ein schulmeisterliches<br />
Bedürfnis nach korrektem Gebrauch von Fachsprache bedient wird. Die Struktur<br />
kann man als äußerliche Verfachsprachlichung bei gleichzeitiger inhaltlicher<br />
Dementierung von Fachlichkeit bezeichnen. Gleichzeitig wird ein Irritationsmoment<br />
geschaffen, denn der Schüler muss einen Umgang mit der offenen<br />
sprachlichen Verwerfung finden. Konkret muss er entscheiden, ob die begriffliche<br />
Doppelung für die Lösung wichtig ist oder nicht. Ein Schüler, der das<br />
Schulmeisterliche des Textes erfassen und übergehen kann, erhält hier einen<br />
Zeitvorteil.<br />
Ein drittes Beispiel findetsichinderTIMSS-Aufgabe A5:<br />
Welche der Aussagen über das Quadrat EFGH ist FALSCH?<br />
A) EIF und EIH sind kongruent (deckungsgleich).<br />
B) GHI und GHF sind kongruent (deckungsgleich).<br />
C) EFH und EGH sind kongruent (deckungsgleich).<br />
D) EIF und GIH sind kongruent (deckungsgleich).<br />
In der Verwendung von kongruent und deckungsgleich spiegelt sich ein wahrscheinlich<br />
nicht lösbarer Konflikt von „zentralen“ Tests. Beide Wörter sind <strong>–</strong><br />
bezogen auf das hier in Rede stehende Problem <strong>–</strong> gleichbedeutend. Beide Wör-<br />
13 Interpretation vgl. Meyerhöfer (2004 a, S. 107-113)
TESTFÄHIGKEIT <strong>–</strong>WAS IST DAS? 75<br />
ter gehören zur Fachsprache von Mathematikunterricht. Es gibt Klassen, in denen<br />
der Begriff der Deckungsgleichheit der Lerns<strong>to</strong>ff ist. Es gibt auch Klassen,<br />
in denen der Kongruenzbegriff der Lerns<strong>to</strong>ff ist. Bei einigen von diesen wird<br />
wiederum der Begriff der Deckungsgleichheit zur Erklärung des Kongruenzbegriffs<br />
herangezogen <strong>–</strong> das bezieht sich auf ein gewisses Selbsterklärungspotenzial<br />
des Begriffs der Deckungsgleichheit. Die in der Aufgabe gewählte<br />
Formulierung kongruent (deckungsgleich) nimmt nun vorrangig diesen letzten<br />
Aspekt auf: kongruent wird <strong>–</strong> quasi zur Erinnerung <strong>–</strong> als deckungsgleich<br />
erläutert. Gleichzeitig wird deckungsgleich als begriffliche Alternative für diejenigen,<br />
die den Begriff kongruent nicht kennen, angeboten. Für diese Gruppe<br />
ist ein neuer Begriff <strong>–</strong> quasi als Hauptbegriff, weil nicht in der Klammer stehend<br />
<strong>–</strong> aufgetaucht. Für diejenigen, die nur den Begriff kongruent kennen, ist<br />
ebenfalls ein neuer Begriff eingeführt, und zwar in einer Klammer. Für alle<br />
drei Gruppen entsteht durch die Klammer ein Irritationspotential: Entweder<br />
wird man mit einem neuen Begriff konfrontiert, oder es wird plötzlich in der<br />
Klammer an die Bedeutung eines Begriff erinnert <strong>–</strong> für einen Test ein seltsames<br />
Unterfangen. Erklärbar ist diese Begriffsverwirrung für denjenigen, der<br />
das <strong>–</strong> im Kern didaktische <strong>–</strong> Problem der zwei Begriffe kennt. „Normal“ ist<br />
es für denjenigen, der mit solchen Konstruktionen in Tests vertraut ist und sie<br />
übergehen kann: ein Bestandteil von Testfähigkeit.<br />
In allen drei Beispielen stellt sich als Ursache ein habituelles Problem des<br />
„Schulmeisterlichen“ heraus: Hier wird probleminadäquat Wert auf Verwendung<br />
von Fachsprache gelegt <strong>–</strong> und das Fachliche unterminiert. Gleichzeitig<br />
wird ein Irritationsmoment geschaffen, denn der Schüler muss einen Umgang<br />
mit dieser sprachlichen Verwerfung finden. Konkret muss er z.B. entscheiden,<br />
ob die begriffliche Doppelung für die Lösung wichtig ist oder nicht. Ein Schüler,<br />
der das Schulmeisterliche des Textes erfassen und übergehen kann, erhält<br />
hier einen Zeitvorteil: Er spart die Zeit, die jemand benötigt, der erst über das<br />
Masse-Gewichts-Problem oder den Prismenbegriff oder über Kongruenz nachdenkt<br />
oder es gar tiefgründig in seine Überlegungen Einzug halten lässt.<br />
Die Aufgabe für den Schüler besteht bei dieser ersten Testfähigkeitskomponente<br />
darin, die entstehende Klippe zu umschiffen. Das kann einerseits bedeuten,<br />
erfolgreich den Inhalt des „seltsamen“ Wortes zu erfassen <strong>–</strong> eine verbale<br />
Fähigkeit. Bei mehreren möglichen Bedeutungen ist die von den Testern<br />
intendierte Bedeutung zu erfassen. Habituell erfordert das, sich auf die von<br />
den Testern anvisierte Ebene der Problembearbeitung zu begeben, also nicht<br />
zu tiefgründig oder zu oberflächlich zu denken. Wer intellektuell zu weit nach<br />
unten oder oben denkt, ist einer erhöhten Gefahr des Scheiterns ausgesetzt.
76 WOLFRAM MEYERHÖFER<br />
Testfähigkeit hat hier also eine inhaltliche und eine habituelle Dimension und<br />
kennzeichnet eine Tendenz zum Mittelmaß. Die Klippe kann auch umschifft<br />
werden, indem man das Seltsame übergeht und begriffliche oder inhaltliche<br />
Exaktheit vermeidet. <strong>–</strong> Es geht nicht darum, das Problem der Aufgabe vollständig<br />
zu verstehen, sondern es geht um die im Sinne des Tests richtige Lösung.<br />
An dieser Stelle wird deutlich, wie Tests die viel beklagte Resultatsorientierung<br />
(statt Inhaltsorientierung) der Schüler reproduzieren.<br />
Das Umschiffen der Klippe muss nicht nur inhaltlich erfolgreich geschehen,<br />
sondern es muss auch unter möglichst geringem Zeitverlust geschehen,<br />
denn Zeit ist in einem Test eine kostbare Ressource. Testfähigkeit bedeutet<br />
dabei zu wissen, dass es auf das einzelne Wort nicht so sehr ankommt und<br />
dass man das Seltsame übergehen muss. Man nimmt es möglichst gar nicht<br />
zur Kenntnis oder erschließt aus dem Rest des Textes möglichst schnell, dass<br />
hier keine Falle lauert. Die Möglichkeit, dass es sich um ein wichtiges Wort<br />
handelt bzw. dass ein Begriff der fachlichen Präzision wegen eingeführt ist, ist<br />
die große Gefahr für den Testfähigen. Beim Auftreten eines „seltsamen“ Wortes<br />
oder einer seltsamen Konstruktion in einem Test ist es aber ausgesprochen<br />
unwahrscheinlich, dass es sich um eine begriffliche Präzisierung handelt, die<br />
für die Erbringung der richtigen Antwort unbedingt verstanden werden muss.<br />
Testfähigkeit bedeutet hier, keine Zeit mit Nachdenken zu vertun.<br />
Die dargestellte Komponente von Testfähigkeit kann kaum (im Sinne des<br />
Arguments, dass Testfähigkeiten als Bildungsziele taugen bzw. mit ihnen korrespondieren)<br />
als Bildungsziel deklariert werden: Zwar geht es beim Rezipieren<br />
von Texten immer auch darum, bei der schnellen Erfassung von Textinhalten<br />
die in den Texten auftretenden Prätentionen und Fehler zu „überlesen“,<br />
sich also an ihnen vorbei den Inhalt zu erschließen. Daraus lässt sich aber keine<br />
Rechtfertigung für das Mitmessen dieser Komponente von Testfähigkeit konstruieren,<br />
weil damit eine Tendenz zur Normalisierung von Fehlern verbunden<br />
ist: Der Schüler wird gezwungen, Defizite der Testerstellung zu übergehen und<br />
damit zu akzeptieren, statt sie zurückzuweisen <strong>–</strong> das kann er wegen der au<strong>to</strong>nomiezerstörenden<br />
Grundstruktur bei Tests nicht ungestraft. Auch die Vermeidung<br />
von Irritation durch schulmeisterlichen Fachsprachengebrauch kann nur<br />
unter großen Verbiegungen als Bildungsziel deklariert werden: Man müsste<br />
dazu voraussetzen, dass das Fachsprachliche einen Wert außerhalb des Fachlichen<br />
hat. Mir scheint das Fachsprachliche aber nur einen Wert zu haben, wenn<br />
dadurch Fachliches transportiert oder konstruiert wird. Die latente Unterminierung<br />
des Fachlichen durch das Fachsprachliche scheint mir als Bildungsziel<br />
wenig geeignet.
TESTFÄHIGKEIT <strong>–</strong>WAS IST DAS? 77<br />
5.3.2 Irritationen durch misslungene künstliche Beschleunigungen und<br />
Vereindeutigungen<br />
Eine weitere Komponente von Testfähigkeit tritt auf, wenn versucht wird, den<br />
Schüler künstlich schneller durch den Test zu schleusen. Dabei wird in einigen<br />
Fällen mit der manifesten Konstruktion einer Eindeutigkeit genau diese Eindeutigkeit<br />
latent zerstört. Dasselbe Prinzip tritt auf, wenn eine textliche Konstruktion,<br />
die die Texterfassung beschleunigen soll, Irritationspotenzial entfaltet,<br />
welches die Texterfassung verzögert.<br />
Das erste Beispiel findet sich in der Konstruktion Wie viele von den kleinen<br />
Quadraten . . . in der TIMSS-Aufgabe A1 (vgl. 5.3.1). Hier wird besonders<br />
auf die kleinen Quadrate verwiesen. Dieser Verweis soll vereindeutigen, denn<br />
es wird ausgeschlossen, dass sich der Schüler mit den aus den kleinen Quadraten<br />
zusammengesetzten „großen“ Quadraten auseinandersetzt. Der Verweis<br />
verwirrt aber auch, denn die Wahrscheinlichkeit, dass sich Schüler von sich<br />
aus mit den „großen“ Quadraten auseinandersetzen, ist ausgesprochen gering.<br />
Man wird also darauf ges<strong>to</strong>ßen, besondere kleine und eventuell sogar große<br />
Quadrate zu suchen. Lediglich als Hilfe für den Schüler, der nicht weiß, was<br />
Quadrate sind, könnte man sich „kleine Quadrate“ vorstellen. Dieses Argument<br />
zerbricht aber daran, dass außer den Quadraten gar nichts da ist, womit<br />
man arbeiten kann. Die Vereindeutigung zerstört sich also selbst.<br />
Testfähigkeit besteht hier darin, sich von solchen testvereindeutigenden<br />
und beschleunigenden Konstruktionen nicht irritieren zu lassen: Der testfähige<br />
Schüler ist also mit solchen Konstruktionen vertraut und weiß (implizit<br />
oder explizit), dass es lediglich um Vereindeutigung geht und dass über diese<br />
schlichte Funktion nicht hinausgedacht werden muss. Es geht darum, auf<br />
eine vielschichtige Auseinandersetzung mit der Aufgabe gerade zu verzichten,<br />
also nicht über die Rolle von großen und kleinen Quadraten und über die<br />
vielfältigen Möglichkeiten des Umgangs mit Mengen in der Zeichnung nachzudenken<br />
<strong>–</strong> wie es der explizite Verweis auf die kleinen Quadrate zunächst<br />
nahelegt. Wenn man auf vieldimensionales Nachdenken verzichtet und sich<br />
auf das Setzen des richtigen Kreuzes konzentriert, dann wird die Bearbeitung<br />
der Aufgabe durch kleine vielleicht sogar wirklich beschleunigt.<br />
Das gleiche Prinzip wiederholt sich in A1 mit . . . muss man ZUSÄTZ-<br />
LICH schattieren . . . Die Großschreibung scheint zunächst eine Hilfe zu sein,<br />
da sie vor der Angabe der insgesamt zu schattierenden Quadrate warnt. Diese<br />
Hilfe ist aber nicht notwendig, weil die durch Multiple Choice angegebenen<br />
Lösungsvarianten dem Schüler seinen Irrtum signalisieren würden. Auch
78 WOLFRAM MEYERHÖFER<br />
an dieser Stelle erfährt ein testfähiger Schüler einen Vorteil, weil er mit einer<br />
solchen testbeschleunigenden Konstruktion vertraut ist. Der testunerfahrene<br />
Schüler wird eher irritiert sein, weil in der Schriftsprache normaler Texte,<br />
auch bei schulischen Texten, Wörter in Großbuchstaben eine derart starke Exponierung<br />
erzeugen, dass ein Nachdenken über den Grund der Exponierung<br />
angezeigt ist. Testfähigkeit bedeutet hier, den Grund der Exponierung bereits<br />
zu „kennen“: Vermeidung naheliegender Fehler. Der Beschleunigungsvorteil<br />
durch diese Exponierung gilt natürlich nur für jenen, der der impliziten Aufforderung<br />
widersteht, über den Grund der Exponierung nachzudenken. Auch<br />
hier bedeutet Testfähigkeit wieder, nicht über den Text nachzudenken, sondern<br />
dem Prinzip zu folgen, dass es um das Kreuz an der richtigen Stelle geht, nicht<br />
um inhaltliche Auseinandersetzung.<br />
Die Komponente der irritationshaltigen Beschleunigung findet sich auch in<br />
der Formulierung Welche der Aussagen . . . ist FALSCH? von A5 (vgl. 5.3.1).<br />
Ursache ist hier eine Prätention: Eine mathematisch anspruchslose Fragestellung<br />
wird zunächst künstlich verkompliziert: Hier ist reine Fleiß- und Konzentrationsarbeit<br />
zu verrichten, deren Anspruch aus dem zu lösenden Problem<br />
heraus nicht zu begründen ist. Die künstliche Verkomplizierung durch die Umkehr<br />
des Anspruchs <strong>–</strong> man soll benennen, was falsch ist <strong>–</strong> motiviert wiederum<br />
die Hervorhebung durch Großbuchstaben: Das Ungewöhnliche muss hervorgehoben<br />
werden, um eine Verwechslung mit der erwartbaren Anforderung zu<br />
vermeiden. In der Variante „ . . . ist richtig“ käme der Gedanke, RICHTIG groß<br />
zu schreiben, nicht auf.<br />
Ein weiteres Beispiel findet sich in der <strong>PISA</strong>-Aufgabe „Pyramide“:<br />
Die Grundfläche einer Pyramide ist ein Quadrat. Jede Kante der<br />
skizzierten Pyramide misst 12 cm. Berechne den Flächeninhalt<br />
der Grundfläche ABCD.<br />
Im zweiten Satz wird hier die Option des Vorhandenseins<br />
zweier Pyramiden eröffnet, nämlich der Pyramide<br />
des ersten Satzes und der skizzierten Pyramide. Manifest erfolgt durch<br />
die Einfügung des skizzierten eine Lesebeschleunigung durch die explizite Verknüpfung<br />
von Text und Bild. Latent wird eine Irritation erzeugt. Die Aufgabe<br />
an den Schüler lautet, sich von dieser Irritation nicht ergreifen zu lassen, also<br />
darüber hinweg zu lesen.<br />
Auch die hier beschriebene Dimension von Testfähigkeit lässt sich als Bildungsziel<br />
diskutieren: Schließlich gibt es solche Brechungen zwischen dem<br />
textlich Gewollten und dem damit produzierten Irritierenden auch in den Tex-
TESTFÄHIGKEIT <strong>–</strong>WAS IST DAS? 79<br />
ten, auf deren Rezeption Unterricht die Schüler vorbereitet. Man kann es zum<br />
Bildungsziel erklären, einen Umgang damit zu finden und zu lernen, diese Irritationen<br />
zu überwinden. Das Argument ist allerdings zynisch: Au<strong>to</strong>nomievergrößerung<br />
würde bedeuten, das Auseinanderlaufen verschiedener Textebenen<br />
in irgendeiner Weise zu thematisieren. Dem Schüler würde dabei ermöglicht,<br />
Distanz zum Text und damit auch zum Prozess der Leistungskontrolle zu erlangen.<br />
Er könnte sich damit intellektuell von schulischen und leistungsbewertenden<br />
Prozessen emanzipieren. In einem Test ist er diesen Prozessen ausgeliefert,<br />
weil er den Punkt auch dann nicht bekommt, wenn er die Aufgabe intellektuell<br />
brilliant zurückweist. Mir scheint es weitaus einleuchtender, dass die Tester<br />
der Verpflichtung unterliegen, Irritationspotenzial zu vermeiden, indem sie gebrochene<br />
Vereindeutigungen und Beschleunigungen unterlassen. Dies ist aber<br />
offenbar nur möglich, wenn man diese Brüche überhaupt erkennt. Die erste<br />
Voraussetzung dafür ist ein Perspektivwechsel: Der Tester darf sich nicht nur<br />
darauf konzentrieren, was er hören will, sondern muss sich fragen, was der<br />
Text wirklich verlangt und ob das mit dem zusammenläuft, was er will. Die<br />
zweite Voraussetzung ist dann nur noch eine gewisse Textsensibilität. Objektive<br />
Hermeneutik bietet dieser Sensibilität ein Instrument methodischer Kontrolle.<br />
5.3.3 Fehlbarkeit der Tester<br />
<strong>PISA</strong>-Aufgabe ÄPFEL<br />
Ein Bauer pflanzt Apfelbäume an, die er in einem quadratischen Muster anordnet.<br />
Um diese Bäume vor dem Wind zu schützen, pflanzt er Nadelbäume um<br />
den Obstgarten herum.<br />
Im folgenden Diagramm siehst du das Muster, nach dem Apfelbäume und Nadelbäume<br />
für eine beliebige Anzahl (n) von Apfelbaumreihen gepflanzt werden:
80 WOLFRAM MEYERHÖFER<br />
Äpfel 1:<br />
Vervollständige die Tabelle:<br />
Äpfel 2:<br />
Es gibt zwei Formeln, die man verwenden kann, um die Anzahl der Apfelbäume<br />
und die Anzahl der Nadelbäume für das oben beschriebene Muster zu berechnen:<br />
Anzahl der Apfelbäume = n2 Anzahl der Nadelbäume = 8n<br />
wobei n die Anzahl der Apfelbaumreihen bezeichnet.<br />
Es gibt einen Wert für n, bei dem die Anzahl der Apfelbäume gleich groß ist<br />
wie die Anzahl der Nadelbäume. Bestimme diesen Wert und gib an, wie du ihn<br />
berechnet hast.<br />
Äpfel 3:<br />
Angenommen, der Bauer möchte einen viel größeren Obstgarten mit vielen<br />
Reihen von Bäumen anlegen. Was wird schneller zunehmen, wenn der Bauer<br />
den Obstgarten vergrößert: die Anzahl der Apfelbäume oder die Anzahl der<br />
Nadelbäume? Erkläre, wie du zu deiner Antwort gekommen bist.<br />
Die Aufgabe „Äpfel“ hat sich in der näheren Betrachtung 14 als produktive und<br />
mathematisch gehaltvoll erweiterbare unterrichtliche Aufgabe herausgestellt,<br />
die aber als Testaufgabe ungeeignet ist: Unter anderem wird in Äpfel 2 eine<br />
Formel für das Lösen von Äpfel 1 nachgereicht. Testfähigkeit besteht hier<br />
nicht vorrangig darin, erkennen zu können, in welcher Weise die Tester die<br />
Lösung oder Teile der Lösung bereits mitgeliefert haben. Schließlich hat es<br />
wenig Sinn, Aufgaben gezielt auf solche Möglichkeiten hin zu durchsuchen <strong>–</strong><br />
dazu sind diese Möglichkeiten zu selten.<br />
Testfähigkeit besteht vielmehr darin, den Gedanken der Fehlbarkeit der<br />
Tester zuzulassen und die Fehlung dann auch auszunutzen. Immerhin handelt<br />
es sich beim Test um ein Instrumentarium, das sich auf die exponierte Zielgenauigkeit<br />
und Sorgfalt des Wissenschaftlichen beruft und bereits durch seinen<br />
Umfang und sein Auftreten signalisiert, dass hier viele Leute lange darüber<br />
14 Winter (2005), Meyerhöfer (2004 a, S. 203 f.), Meyerhöfer (2005, S. 171 ff.)
TESTFÄHIGKEIT <strong>–</strong>WAS IST DAS? 81<br />
nachgedacht haben, was sie an den Schüler herantragen. Es verlangt ein gewisses<br />
Maß an Au<strong>to</strong>nomie, Gelassenheit, Abstand oder Unverfrorenheit, um<br />
den Gedanken zuzulassen, dass dieser riesige Apparat in Aufgabe 2 die Formel<br />
für die Lösung von Aufgabe 1 reinschreibt.<br />
5.3.4 Primat des gewünschten Resultats vor dem mathematischen Anspruch,<br />
Unverfrorenheit gegenüber dem mathematischen Anspruch,<br />
Möglichkeiten von Multiple Choice<br />
Eine Zuspitzung erfährt die eben beschriebene Dimension in der TIMSS-<br />
Aufgabe M7, in der die mathematische Anforderung unterminiert wird:<br />
AB ist in dieser Zeichnung eine Gerade.<br />
Wieviel Grad mißt Winkel BCD?<br />
A) 20<br />
B) 40<br />
C) 50<br />
D) 80<br />
E) 100<br />
Hier soll der Schüler offenbar erkennen, dass 9x gleich 180 Grad ist, und daraus<br />
das Winkelmaß von 80 Grad für BCD bestimmen. Sehr viel effektiver ist<br />
es, mit Hilfe der Multiple-Choice-Angebote abzuschätzen, dass es nur 80 Grad<br />
sein können. Dass der Schüler eigentlich rechnen soll, erkennt man am ersten<br />
Satz und an der Tatsache, dass die außergewöhnlichen Bezeichnungen 5x<br />
und 4x angebracht sind. In einer Schätzaufgabe würde so etwas nicht vorkommen<br />
15 .<br />
Ein Schüler, der das Problem rechnerisch nicht lösen kann, hat hier Glück,<br />
er kommt nämlich nicht in Gefahr, Zeit zu verschwenden. Für ihn besteht lediglich<br />
die Aufgabe, sich zu trauen, einfach das anzukreuzen, was er sieht.<br />
Das ist nicht trivial, denn mancher Schüler traut sich nicht, das Offensichtliche<br />
hinzuschreiben, wenn er spürt bzw. merkt, dass er eigentlich rechnen soll. Ein<br />
Schüler, der das Problem rechnerisch lösen kann, kommt ebenfalls zum richtigen<br />
Ergebnis <strong>–</strong> wenn er nicht in die Fallen der Lösungsangebote A oder E<br />
fällt. Er verbraucht aber sehr viel von der kostbaren Ressource Zeit. Um diese<br />
Zeit einzusparen, benötigt er eine gewisse Unverfrorenheit gegenüber der<br />
rechnerischen Anforderung, gepaart mit einer gewissen Cleverness im Erkennen<br />
der durch Multiple Choice geschaffenen Möglichkeiten. Für das Problem<br />
15 Interpretation siehe Meyerhöfer (2001)
82 WOLFRAM MEYERHÖFER<br />
der Testfähigkeit ergibt sich damit eine weitere Komponente: Man muss sich<br />
trauen, einen nichtrechnerischen Weg zu gehen, auch wenn offenbar Rechnen<br />
verlangt ist. Man muss also unverfroren gegen die Anforderung handeln, denn<br />
es geht nicht um den mathematischen Inhalt, sondern um das Kreuz an der<br />
richtigen Stelle. Die Aufgabe M7 eignet sich geradezu ideal dazu, Menschen<br />
zu identifizieren, die sich clever und effektiv der eigentlichen Anforderung<br />
stellen und dabei unverfroren gegen die manifest intendierte Aufgabe handeln.<br />
Im Vergleich dazu erscheint das Bedienen der rechnerischen Intention als braves<br />
Abarbeiten von fehlerbehafteten und probleminadäquaten mathematikunterrichtlichen<br />
Techniken.<br />
Die gleiche Dimension von Testfähigkeit wird in den Aufgaben „Bauernhöfe“<br />
(vgl. 5.3.1) und „Dreieck“ 16 mitgemessen. Dort ist (in unterschiedlich<br />
starkem Maße) die Verwendung von lokalem Satz- und Formelwissen gefragt.<br />
Tendenziell schneller sind die Wege über Intuition bzw. Messen. Den höchsten<br />
Zeitverlust hat dort derjenige, der genuin mathematisch denkt und handelt.<br />
Reinhard Woschek (2005) hat im Rahmen seiner Dissertation untersucht,<br />
auf welch unterschiedliche Weisen deutsche und Schweizer Schüler TIMSS-<br />
Aufgaben lösen. Bei M7 stellte er fest, dass deutsche Schüler fast nur rechnen<br />
und auch oft damit scheitern. Schweizer Schüler hingegen schätzen fast nur.<br />
Es gibt natürlich auch in Deutschland Lehrer, die möchten, dass ihre Schüler<br />
an dieser Stelle schätzen <strong>–</strong> jedenfalls wenn nicht gerade das Aufstellen<br />
von und Umgehen mit Gleichungen angesagt ist. Testfähigkeit läuft hier zwar<br />
gegen die Aufgabenintention, hat aber durchaus einen Charakter, der Lehrintentionen<br />
entsprechen kann: Man kann durchaus wollen, dass die Schüler mit<br />
gegebenen Problemen nicht stur rechnerisch umgehen, sondern sie der Situation<br />
angemessen möglichst effektiv lösen. Allerdings bleibt unklar, warum man<br />
gerade das künstliche und statische Instrument der Multiple-Choice-Aufgabe<br />
wählen sollte, um sich einem dynamischen, problemadäquaten und effektiven<br />
Umgang mit mathematischen Problemen zu nähern, die noch dazu nach rechnerischer<br />
Bearbeitung verlangen. Man sollte sich auch vor Augen halten, dass<br />
es zynisch wäre, eine rechnerische Anforderung künstlich zu suggerieren bzw.<br />
zu konstruieren, die nicht aus der Sache selbst erwächst.<br />
5.3.5 Egal wie wenig du weißt, schreibe immer irgendetwas hin.<br />
Eine elementare Komponente von Testfähigkeit lässt sich in die Aufforderung<br />
umschreiben: Egal wie wenig du weißt, schreibe immer irgendetwas hin. Die<br />
16 vgl. Deutsches <strong>PISA</strong>-Konsortium (2001, S. 178); Diskussion bei Meyerhöfer (2004 b)
TESTFÄHIGKEIT <strong>–</strong>WAS IST DAS? 83<br />
Multiple-Choice-Variante dieser Aufforderung unterstreicht das Prinzip: Wenn<br />
du nichts weißt, dann kreuze irgendetwas an, und zwar möglichst das, was dir<br />
am meisten einleuchtet. Die Diskussion um das Raten bei Tests 17 kann man<br />
darauf zuspitzen, dass sich alle Populationsunterschiede in der Testleistung mit<br />
dem unterschiedlichen Grad der Verinnerlichung dieser Komponente von Testfähigkeit<br />
erklären lassen. Diese Behauptung ist zwar ebensowenig überprüfbar<br />
wie die Behauptung, Raten spiele keine Rolle. Aber wenn wir von Wuttke<br />
(2007, Abschnitt 3.12.) erfahren, dass bereits Lösungsunterschiede von einer<br />
halben Aufgabe in der <strong>PISA</strong>-Skalierung als relevanter Unterschied (9 Punkte)<br />
gedeutet werden, dann offenbart das die Anfälligkeit des Konstrukts für Rateprobleme.<br />
Ich möchte das Problem hier nur für offene Antwortformate (die allerdings<br />
im kategorialen Vorgehen immer geschlossen kodiert werden) diskutieren: In<br />
der Aufgabe „Äpfel 2“ (vgl. 5.3.3) heißt es: Es gibt einen Wert für n, bei dem<br />
die Anzahl der Apfelbäume gleich groß ist wie die Anzahl der Nadelbäume.<br />
Bestimme diesen Wert und gib an, wie du ihn berechnet hast.<br />
Es sind aber zwei Werte, nämlich 0 und 8. Aus den Lösungskodierungen<br />
der <strong>PISA</strong>-Gruppe geht hervor, dass ein Schüler, der n = 8 angibt, den Lösungspunkt<br />
erhält, auch wenn er keine Begründung bzw. Berechnung angibt <strong>–</strong> wenn<br />
er also die Aufgabenstellung nicht erfüllt. Ein Schüler, der nur n = 0 angibt, erhält<br />
hingegen keinen Punkt, selbst wenn er seine Antwort begründet und es bei<br />
diesem Wert belässt, weil schließlich im Aufgabentext nur ein Wert gefordert<br />
ist. Man kennt natürlich nie die Kodierungsanweisungen der Tester, wenn man<br />
getestet wird. Aber es wird deutlich, dass es nicht in jedem Fall darum geht,<br />
die Aufgabe wirklich zu erfüllen. Bereits das Hinschreiben einer Teillösung<br />
führt zum Punkt.<br />
Es liegt nahe einzuwenden, dass die Lösung Null für den realen Kontext<br />
eher irrelevant ist. Das stimmt zwar inhaltlich, setzt aber eine Kernerfahrung<br />
mit Mathematikunterricht nicht außer Kraft: Dort geht es unsystematisch <strong>–</strong> das<br />
heißt, nicht unbedingt durchschaubar aus der Sache heraus begründet, sondern<br />
gelegentlich aus dem Belieben des Lehrers heraus erscheinend <strong>–</strong> immer wieder<br />
um solche „Randbetrachtungen“. Für den Schüler bleibt gerade in Tests<br />
undurchschaubar, in welchem Maße er „Randbetrachtungen“ mit zu leisten<br />
hat (und leisten darf). Die Unsicherheit wird dadurch gestärkt, dass es um das<br />
Reale offensichtlich gar nicht geht.<br />
17 vgl. Meyerhöfer (2004 c), Lind (2004)
84 WOLFRAM MEYERHÖFER<br />
Diese Herabwürdigung des Realen lässt in der <strong>PISA</strong>-Aufgabe „Sparen“ 18<br />
den Eindruck entstehen, man könne irgend etwas über den Zinseszins hinschreiben<br />
<strong>–</strong> womöglich sogar, ohne ihn berechnet zu haben <strong>–</strong> und könnte trotzdem<br />
den Punkt erhalten.<br />
Irgendetwas hinzuschreiben erweist sich auch als sinnvoll, wenn man sich<br />
die Kodierungspraxis vor Augen hält: Ein Kodierer <strong>–</strong> meist ein schlecht bezahlter<br />
Student <strong>–</strong> muss in einem entfremdeten Arbeitsprozess unter Zeitdruck<br />
eine große Menge an schlecht lesbaren Schülernotizen entziffern. Er muss versuchen,<br />
dem Geschriebenen einen Sinn abzuringen und diesen Sinn mit einer<br />
umfangreichen, die Wirklichkeit aber doch nur holzschnittartig erfassenden<br />
Bewertungsvorschrift in Einklang zu bringen. Er steht im ständigen Konflikt,<br />
dass einerseits die von ihm geleistete Geltungserzeugung dem Anspruch der<br />
Wissenschaftlichkeit ausgesetzt ist, dass ihm andererseits aber keine wissenschaftliche<br />
Methode der Geltungserzeugung zur Verfügung steht. (Der Konflikt<br />
existiert unabhängig vom Bewusstsein des Kodierers. Allerdings sind die<br />
Kodierer direkt mit dem Text konfrontiert und dürften am deutlichsten spüren,<br />
dass die Kategorisierungen weder das Latente noch die vielen verschiedenen<br />
Ausprägungen von Verstehen oder von Können zu greifen vermögen.) Ergebnis<br />
seines Tuns soll eine undifferenzierte Null-Eins-Entscheidung sein, und in<br />
18 Sparen<br />
Karina hat 1000 DM in ihrem Ferienjob verdient. Ihre Mutter empfiehlt ihr, das Geld<br />
zunächst bei einer Bank für 2 Jahre festzulegen (Zinseszins!) Dafür hat sie zwei Angebote:<br />
a) „Plus“-Sparen: Im ersten Jahr 3 % Zinsen, im zweiten Jahr dann 5 % Zinsen.<br />
b) „Extra“-Sparen: Im ersten und zweiten Jahr jeweils 4 % Zinsen.<br />
Karina meint: „Beide Angebote sind gleich gut.“ Was meinst du dazu?<br />
Begründe deine Antwort!<br />
Die Differenz zwischen beiden Angeboten beträgt zehn Pfennige und es bleibt unklar,<br />
was die Tester jetzt hören wollen: Sind 10 Pfennig Unterschied noch „gleich gut“ oder nicht?<br />
Die Frage ist ja wegen der unbekannten sonstigen Bedingungen offensichtlich nicht beantwortbar:<br />
Selbst bei einer Anlage von 100 000 DM wäre der Unterschied ja nur 10 DM, also<br />
durch jede Kon<strong>to</strong>führungsgebühr bzw. andere Nebenkosten, Fahrtkosten zur Bank, selbst<br />
durch Mitnehmen von Werbegeschenken ausgeglichen.<br />
Das Problem für den Getesteten ist immer, erfolgreich zu erahnen, was die Tester hören<br />
wollen. Hier bleibt das unklar. Es könnte sogar sein, dass man auf eine sinnvolle Argumentation<br />
hin einen Punkt bekommt, egal wie man sich entscheidet. Ein Element von<br />
Testfähigkeit besteht hier darin, trotz der in ihrer Bedeutung für die Antworterwartung nicht<br />
einschätzbaren Zehn-Pfennig-Differenz irgendetwas hinzuschreiben. Da das Reale hier ohnehin<br />
nicht ernst genommen wird, könnte man den Punkt erhalten, wenn man irgendetwas<br />
über Zinseszins hinschreibt <strong>–</strong> womöglich sogar, ohne ihn berechnet zu haben. (näheres vgl.<br />
Meyerhöfer 2004, S. 199 f., Meyerhöfer 2005, S. 166 f.)
TESTFÄHIGKEIT <strong>–</strong>WAS IST DAS? 85<br />
gewisser Weise ist es auch egal, ob man sorgfältig oder gültig bepunktet oder<br />
nicht: Der Kodierer spürt ja unmittelbar die Brüchigkeit, mit der im Kodierungsverfahren<br />
die Geltung der Testaussage erzeugt wird. Er spürt hautnah die<br />
Illusion des Punktwertes.<br />
Die Kodierungspraxis birgt also eine sowohl im Kategorisierungsprinzip<br />
liegende als auch eine menschliche Komponente von Willkür <strong>–</strong> und diese Willkür<br />
ist ein wesentlicher Unterschied zur Klassenarbeit, nach der der Lehrer<br />
immer unter Rechtfertigungs- und damit unter Fairnessdruck steht. Es wird<br />
deutlich, dass man wenig Einfluss auf die „Gnadenstimmung“ des Kodierers<br />
und auf den Kategorienkatalog hat, dass es aber die Chance auf eine positive<br />
Bewertung erhöht, wenn man irgendetwas hinschreibt.<br />
Diese Komponente von Testfähigkeit ist zwar testimmanent (unabhängig<br />
davon, ob eine Testaufgabe gelungen oder misslungen ist), bewegt sich nichtsdes<strong>to</strong>trotz<br />
nah an Fähigkeiten, die in Klassenarbeiten benötigt werden, denn<br />
auch dort geht es darum, durch das Hinschreiben von Fragmenten „Punkte zu<br />
schinden“, selbst wenn man wenig weiß. Auch dieser Komponente von Testfähigkeit<br />
mag man deshalb „Bildungswert“ zuschreiben. Es ist aber ein rein<br />
innerschulischer Wert: Das Versammeln von Halbwissen oder Fragmenten von<br />
Wissen dient hier keiner Annäherung an Bildungsgut durch Versammeln des<br />
bereits Gewussten, durch seine Reflexion, Aufarbeitung und Erweiterung. Es<br />
dient lediglich dem Bedienen fremdgesetzter Anforderungen in einer asymmetrischen<br />
Konstellation, deren inhaltliche Füllung zunächst keinem Bildungsprozess<br />
dient.<br />
5.3.6 Nichtrespektierung der Au<strong>to</strong>nomie und Authentizität des<br />
Mathematischen wie des Realen; Schein des Realen; spezifische Realität<br />
der Tester<br />
Wenn Tester sich dem Realen außerhalb der Mathematik zuwenden, öffnet sich<br />
ihnen ein mannigfaltiges Potenzial einer Produktion von Verwerfungen, deren<br />
Bearbeitung Testfähigkeiten erfordert. Da gibt es in der Aufgabe „Bauernhöfe“<br />
(vgl. 5.3.1) Dachböden, die Quadrate sind, da gibt es Mitten von Strecken, da<br />
werden Modellierungsanforderungen behauptet und zerstört.<br />
Ich habe in meiner Untersuchung zu <strong>PISA</strong> 19 diskutiert, wie sich diese Verwerfungen<br />
vermeiden lassen: Grundbedingung ist, das Reale und das Mathematische<br />
in ihrer Au<strong>to</strong>nomie und Authentizität zu respektieren. Damit ist die<br />
19 Meyerhöfer (2004, Kapitel 5), Meyerhöfer (2005, Kapitel 5)
86 WOLFRAM MEYERHÖFER<br />
Grundrichtung der hier zu beschreibenden Testfähigkeit abgesteckt: Die Nichtrespektierung<br />
der Au<strong>to</strong>nomie und Authentizität ist zu bearbeiten.<br />
In der TIMSS-Aufgabe A2 (vgl. 5.3.1) wird zwischen dem Realen, dem<br />
Mathematischen und dem Physikalischen „hin- und herverworfen“. Eine Möglichkeit<br />
des Scheiterns ergibt sich dort, wenn man das für einen Ziegelstein<br />
hält, was wie ein Ziegelstein aussieht, nämlich der „halbe“ Ziegelstein. Der<br />
Fehler liegt nahe, weil Ziegelsteine mit quadratischem Querschnitt uns seltener<br />
begegnen und weil man sie nie längs teilt, wie das hier geschehen ist. Wir<br />
lernen für das Problem der Testfähigkeit: Man soll nicht dem Schein des Realen<br />
glauben. Man muss sich also in die spezifische Realität der Tester begeben.<br />
In dieser Realität werden Ziegelsteine längs geteilt, Mutter ruft „Zinseszins!“<br />
und Schülerinnen zeichnen angeblich Dächer von Bauernhöfen, die<br />
keine Bauernhöfe sind. Diese Welt ähnelt der Welt der Schulbücher und sicherlich<br />
auch der Welt von Mathematikunterricht. Insofern läuft die Fähigkeit,<br />
sich in die Realität der Tester zu begeben, wahrscheinlich mit der Fähigkeit<br />
zusammen, sich in die spezifische Realität von Mathematikunterricht<br />
zu begeben. Diese Komponente von Testfähigkeit hat also eine gewisse Ähnlichkeit<br />
mit einer Fähigkeitskomponente, die auch im Mathematikunterricht<br />
thematisch ist. Der Unterschied ist allerdings ein konstitutiver: Im Mathematikunterricht<br />
scheint mir Bestandteil des Bildungsgedankens die Forderung zu<br />
sein, die Spezifität des Realen und die Spezifität des Mathematischen in den<br />
Blick zu nehmen. <strong>–</strong> Es geht hier gerade nicht um die unreflektierte Übernahme<br />
von Aufgabenmustern. Diese unreflektierte Übernahme würde man einem Mathematikunterricht<br />
zuschreiben, der seinen Bildungsauftrag nicht erfüllt. Die<br />
Nichtrespektierung der Au<strong>to</strong>nomie und Authentizität insbesondere des Mathematischen,<br />
aber auch des Realen, ist mathematikdidaktisch also nicht zu rechtfertigen.<br />
Deshalb ist es problematisch, wenn beides in Tests nicht respektiert<br />
wird. Und es ist im Sinne eines Bildungsauftrags problematisch, dass in keiner<br />
der veröffentlichten <strong>PISA</strong>-Aufgaben und in keiner jener TIMSS-Aufgaben, die<br />
allen Schülern vorlagen, die Spezifität des Realen und des Mathematischen<br />
thematisch ist. Am Bildungsauftrag von Mathematikunterricht arbeiten diese<br />
beiden Tests diesbezüglich vorbei. Die hier mitgemessenen Komponenten<br />
von Testfähigkeit bedienen lediglich die Anpassung an einen Mathematikunterricht,<br />
der seinen Bildungsauftrag nicht erfüllt.<br />
Es ist ebenso bedenklich, wie häufig Schulbuchaufgaben sich der Spezifität<br />
des Realen und des Mathematischen nicht stellen; insofern ist der Bildungsauftrag<br />
von Mathematikunterricht zum Teil gegen die Praxis von Schul-
TESTFÄHIGKEIT <strong>–</strong>WAS IST DAS? 87<br />
buchaufgaben gerichtet. Das zerstört aber den Auftrag nicht, sondern behindert<br />
lediglich seine Umsetzung (und illustriert die mangelnde Verankerung des Bildungsauftrags<br />
im Feld). In Tests gibt es aber nichts als die Aufgaben selbst. Sie<br />
konstituieren das Ganze.<br />
5.3.7 Nichternsthaftigkeit des realen Problems; Dominanz des Schlichten;<br />
Nachteiligkeit von exakten Überlegungen, von kreativem oder<br />
intellektuell anspruchsvollem Arbeiten; Testerwünsche als Maß des Tuns<br />
Eng verbunden mit der Forderung, sich dem Schein des Realen nicht hinzugeben,<br />
ist die Forderung, das reale bzw. realitätsnahe Problem nicht ernst zu<br />
nehmen. Wenn man in der TIMSS-Aufgabe A2 (vgl. 5.3.3) das Problem ernst<br />
nimmt und unter Berücksichtigung des Abstandes der Körper vom Waagenmittelpunkt<br />
durchrechnet, bekommt man heraus, dass der Ziegelstein 2,62 kg<br />
wiegt. Dafür gibt es aber kein Multiple-Choice-Angebot. Man würde dem<strong>zufolge</strong><br />
auf 3 kg runden und damit ein „falsches“ Resultat erhalten, weil die<br />
Tester 2 kg angekreuzt sehen möchten. Hier liegt also ein Fall vor, in dem ein<br />
Schüler zum im Sinne der Tester falschen Resultat gelangen würde, obwohl er<br />
ein anspruchsvolles Problem löst und wahrscheinlich auch das kann, was die<br />
Tester zu messen glauben. Man könnte vereinfacht sagen: Ein Schüler, der „zu<br />
klug“ ist, gelangt zum falschen Resultat. Defizitär formuliert: Dieser Schüler<br />
erkennt nicht, auf welcher Ebene er hier argumentieren soll. Das Irritationsmoment<br />
liegt auch vor, wenn man lediglich über das Bild „s<strong>to</strong>lpert“, weil man<br />
die Problemstellung ernst nimmt. Hier besteht die zusätzliche Aufgabe darin<br />
zu erkennen, dass man das Problem nicht ernst nehmen darf, sondern eine<br />
schlichtere Überlegung anstellen soll. Ein Schüler mit Testfähigkeit, der exakte<br />
Überlegungen von vornherein ausspart, erfährt damit bei dieser Aufgabe<br />
einen zeitlichen Vorteil. Man könnte diese Komponente von Testfähigkeit also<br />
so formulieren: Du sollst nicht das Problem ernst nehmen und lösen, welches<br />
gestellt ist, sondern du sollst herausfinden, was die Tester wollen, dass du es<br />
hinschreibst bzw. ankreuzt. Aus den Erkenntnissen über das Testen kann man<br />
hinzufügen: Es ist wahrscheinlicher, dass du schlicht arbeiten sollst, als dass<br />
du kreativ oder intellektuell anspruchsvoll arbeiten sollst.<br />
Wer in der Aufgabe „Bauernhöfe“ das Problem ernst nimmt, der erfährt<br />
nicht einmal, welche Länge er bestimmen soll, weil der zu berechnende Balken<br />
einen trapezförmigen Querschnitt haben muss. Glücklicherweise zerstört der<br />
Text die Modellierungsanforderung sehr gründlich, so dass man das Problem<br />
nicht ernst nehmen wird.
88 WOLFRAM MEYERHÖFER<br />
6 Zusammenführung<br />
„Testfähigkeit“ beschreibt jene Kenntnisse, Fähigkeiten und Fertigkeiten, die<br />
in einem Test miterfasst bzw. mitgemessen werden, die man aber nicht unter<br />
den Begriff „mathematische Leistungsfähigkeit“ fassen würde.<br />
Wenn in einem Test Testfähigkeiten mitgemessen werden, entstehen folgende<br />
Probleme:<br />
<strong>–</strong> Es wird nicht nur das gemessen, was gemessen werden soll: mathematische<br />
Leistungsfähigkeit. Das Messresultat wäre also verfälscht.<br />
<strong>–</strong> Tests setzen Standards. Wenn Tests Testfähigkeiten mitmessen, werden diese<br />
Fähigkeiten Bestandteil des Standards. Die empirische Analyse hat gezeigt,<br />
dass dies kein nebensächliches Phänomen ist, sondern den Kern mathematischer<br />
Bildung betrifft. Sie hat gleichzeitig gezeigt, dass Testfähigkeiten<br />
nicht in Bildungsgut umzudeuten sind. Entsprechende Versuche erweisen<br />
sich in der empirischen Analyse als oberflächlich, oftmals falsch und<br />
zumeist zynisch.<br />
<strong>–</strong> Für die aufgezeigten Probleme ist denkbar, sie mittels Testtraining zu bearbeiten.<br />
Das Hauptproblem tritt aber auf, wenn latentes Testtraining durch<br />
gehäuftes Bearbeiten von Testaufgaben stattfindet. (Es gibt hierfür den im<br />
Kern zynischen Euphemismus „Testkultur“.) Dann entfalten die den Bildungsgedanken<br />
beschädigenden Phänomene schleichend ihre Wirkung. Ich<br />
kann den Gedanken hier nicht vertiefen, möchte aber darauf verweisen, dass<br />
Adornos „Theorie der Halbbildung“ (Adorno 1972) das Problem weiter erschließt.<br />
Vor diesem Hintergrund ist das Konzept der deutschen Bildungsstandards zu<br />
überdenken. Sie fokussieren auf das Testen von „Kompetenzen“. Derzeit wird<br />
ein <strong>–</strong> in seinen Ausmaßen den <strong>PISA</strong>-Test weit übersteigender <strong>–</strong> Test entwickelt,<br />
der die Bildungsstandards für das Fach Mathematik in eine Testform<br />
gerinnen lassen soll. Der Gedanke, dass Tests Standards setzen, erreicht hier<br />
eine radikalisierte Praktizierung. Dieser Standards-Test wird in einer Weise erstellt,<br />
bei welcher das Problem der Testfähigkeit nicht bearbeitet werden kann.<br />
Wegen der hohen Durchschlagkraft der Bildungsstandards-Tests auf Mathematikunterricht<br />
wird Testfähigkeit somit zum Standard(s)phänomen in deutschen<br />
Schulen. Ich schlage deshalb vor, diese Testerstellung zu s<strong>to</strong>ppen.<br />
Ich möchte nun die empirisch herausgearbeiteten Komponenten von Testfähigkeit<br />
zusammenfassen.<br />
Oberster Grundsatz für den Getesteten ist: Es geht in der Testsituation nicht<br />
darum, dass ein mathematisches Problem erschlossen wird, dass ein Gedanke
TESTFÄHIGKEIT <strong>–</strong>WAS IST DAS? 89<br />
entfaltet wird oder eine Argumentation brilliant entwickelt wird. Es geht darum,<br />
das Kreuz an der von den Testern gewünschten (an der „richtigen“) Stelle<br />
zu setzen, die von den Testern gewünschte Zahl hinzuschreiben oder einen Gedanken<br />
so weit zu entfalten, dass der Kodierer einen Punkt dafür vergibt. Gelegentlich<br />
laufen Erschließung und Gewünschtes zusammen, d.h. gelegentlich<br />
testet eine Aufgabe mathematische Bildung bzw. Leistung.<br />
Es geht darum, sich der Tendenz zum Mittelmaß, die Tests innewohnt,<br />
anzupassen. Wenn man „von unten her“ auf dieses Mittelmaß blickt, dann<br />
erscheint dieses Phänomen auf den ersten Blick unproblematisch, weil man<br />
dann nach bestem Wissen die Aufgabe bearbeiten kann. Etwas schlicht gesagt:<br />
Jemand, der bildungsfern ist, wird vielleicht durch das Erlangen von Testfähigkeit<br />
nicht in seiner Intellektualität beschädigt, er wird lediglich in ihrer<br />
Entwicklung behindert. Er hat zusätzlichen, unausgesprochenen und überflüssigen<br />
Lerns<strong>to</strong>ff zu bewältigen. Das raubt Kapazität für relevante Inhalte. Es<br />
verschärft zusätzlich für die bildungsferne Klientel das Problem der Benachteiligung<br />
durch Unausgesprochenes (vgl. Bourdieu/Passeron 1971). Wenn man<br />
umgekehrt dazu neigt, sich der Welt intellektuell zu nähern, sie ernsthaft zu befragen,<br />
Gedanken vielschichtig zu entfalten und mathematische Probleme bis<br />
hin zu einem eigenen Verständnis zu bearbeiten, dann verliert man in Tests<br />
Zeit, gelegentlich landet man auch bei einem richtigen oder „auch richtigen“<br />
oder „unter diesem Blickwinkel auch richtigen“ Resultat, welches aber nicht<br />
das erwünschte und somit prämierte Resultat ist. Der Grundsatz lautet: Tiefgründigkeit<br />
und Vielschichtigkeit vermeiden!<br />
Dies gilt in einer besonderen Färbung, wenn Tester sich in das Reale begeben.<br />
Hier ist wichtig, die realen „Einkleidungen“ nicht ernst zu nehmen. Man<br />
muss herausfinden, in welcher spezifischen Realität die Tester sich bewegen.<br />
Man fährt am besten, wenn man sich einfach fragt: Was wollen sie, dass ich<br />
rechne? Wenn die Realität und das mathematisch (meist: rechnerisch) Gewollte<br />
nicht recht zusammenlaufen, dann ist immer das mathematisch Gewollte<br />
das Primäre. Man muss wiederum besonders aufpassen, wenn das Reale uns<br />
zu differenzierterem Denken auffordert: An solchen Stellen gilt es herauszufinden,<br />
was die Tester hören wollen und sich nicht mit dem zu beschäftigen,<br />
was das Reale erfordert <strong>–</strong> das kostet Zeit bzw. führt zu einem nicht prämierten<br />
Resultat.<br />
Umgekehrt muss man immer etwas hinschreiben, egal wie wenig man<br />
weiß. Es bietet sich an, schwierige Aufgaben während des Durcharbeitens zu<br />
kennzeichnen. Je nachdem, wie viel Zeit man am Ende noch hat, muss man
90 WOLFRAM MEYERHÖFER<br />
mit Plausibilitätsbetrachtungen inhaltliches Raten betreiben oder notfalls Lotterieraten<br />
durchführen (vgl. Meyerhöfer 2004 c), und man muss bei Texten<br />
irgendetwas hinschreiben, bei einzusetzenden Zahlen jene Zahl, die einem am<br />
plausibelsten erscheint.<br />
Standardisierte Tests sind fremdartige und hölzerne Instrumente. Sie halten<br />
vielfältige Irritationen parat, die im Laufe des Erstellungs- bzw. Operationalisierungsprozesses<br />
eingewaschen werden. Man sollte sich vor Augen halten,<br />
dass (Zehn)Tausende den Test bearbeiten sollen, dass also viele verschiedene<br />
Fachbegriffe und Umgehensweisen abgedeckt werden müssen und dass<br />
Übersetzungsprobleme hinzukommen. Da „geht schon mal ein Wort daneben“,<br />
manchmal auch mehr. Es kommt auch vor, dass die Aufgabe verständlicher<br />
oder schneller erfassbar gemacht werden soll und dass dadurch irritierende<br />
Formulierungen entstehen. Irritationen vermeiden heißt hier: Darüber hinweg<br />
lesen können. Auch hier hilft der Hinweis auf das Mittelmaß: Es ist meist<br />
das weniger Komplizierte gemeint, und meist kommt es auf das einzelne Wort<br />
nicht an, man kann es getrost überlesen. Wenn man sich darauf konzentriert,<br />
was die Tester hören wollen, dann merkt man auch, dass das Irritierende oftmals<br />
nebensächlich ist. Übrigens kann man auch immer den Testleiter fragen.<br />
Er darf zwar prinzipiell nichts sagen. Aber Verständnisprobleme bei Wörtern<br />
darf er manchmal klären, und vielleicht erzählt er ja noch mehr.<br />
Empirisch zeigt sich, dass Testfähigkeit unvermeidbar dort auftritt,<br />
<strong>–</strong> wo Multiple-Choice-Angebote das Raten ermöglichen,<br />
<strong>–</strong> wo offene Antworten kategorial in Null-Eins-Entscheidungen kodiert werden,<br />
<strong>–</strong> wo ein verschiedener Umgang mit Fachbegriffen in verschiedenen Teilen<br />
der zu vermessenden Population bearbeitet werden muss.<br />
Testfähigkeit tritt vermeidbar dort auf,<br />
<strong>–</strong> wo ein Gegeneinanderlaufen von latenter und manifester Textebene zu Irritationspotential<br />
führt,<br />
<strong>–</strong> wo der Inhalt nicht ernstgenommen wird, um den es zu gehen scheint. Dabei<br />
ist es zunächst nicht so, dass der Schüler den Inhalt nicht ernst nimmt,<br />
sondern der Aufgabenersteller, der den Text erstellt, nimmt den Inhalt nicht<br />
ernst. (Das ist natürlich didaktisch verschleiert.)<br />
<strong>–</strong> wo Mehrdeutigkeiten bzw. Unschärfen auftreten bezüglich dessen, was gemessen<br />
wird und was gemessen werden soll.<br />
Dieses empirische Ergebnis verwundert nicht: Testfähigkeit spielt eben immer<br />
dort eine Rolle, wo die Aufgabe schlecht konstruiert ist, wo also latente und
TESTFÄHIGKEIT <strong>–</strong>WAS IST DAS? 91<br />
manifeste Textebene auseinanderlaufen, wo der mathematische Inhalt didaktisch<br />
verworfen ist und wo der Operationalisierungsprozess nicht sorgfältig<br />
verlaufen ist, wo der eigene mathematikdidaktische Habitus nicht reflektiert<br />
wird. Diese Art von Testfähigkeiten kann also im Sinne von Vermeidung bearbeitet<br />
werden, wenn die Tester das Zusammenlaufen von latenter und manifester<br />
Textebene bearbeiten, wenn sie didaktischen Illusionen und Verschleierungen<br />
selbst nicht aufsitzen und wenn sie sorgfältige Operationalisierungen<br />
ihrer Messkonstrukte vornehmen.<br />
Literatur<br />
Adorno, Theodor W. (1972): Theorie der Halbbildung. In: Soziologische<br />
Schriften I (Gesammelte Schriften Band 8), Frankfurt: Suhrkamp<br />
Baumert, Jürgen, Eckhard Klieme, Manfred Lehre und Elwin Savelsbergh<br />
(2000): Konzeption und Aussagekraft der TIMSS-Leistungstests. Zur<br />
Diskussion um TIMSS-Aufgaben aus der Mittelstufenphysik. In: Die<br />
Deutsche Schule, 92. Jahrgg. 2000, Heft 1 (S. 102-115), Heft 2 (S. 196-<br />
217)<br />
Bourdieu, Pierre; Passeron, Jean-Claude (1971): Die Illusion der Chancengleichheit.<br />
Stuttgart: Klett<br />
Deutsches <strong>PISA</strong>-Konsortium (Hrsg.) (2001): <strong>PISA</strong> 2000. Basiskompetenzen<br />
von Schülerinnen und Schülern im internationalen Vergleich. Opladen:<br />
Leske + Budrich<br />
Hagemeister, Volker (1999): Was wurde bei TIMSS erhoben? Eine Analyse<br />
der empirischen Basis von TIMSS. In: Die Deutsche Schule, 91.Jahrgg.<br />
1999, Heft 2, S. 160-177<br />
Hembree, Ray (1987): Effects Of Noncontent Variables On Mathematics Test<br />
Performance. In: Journal for Research in Mathematics Education. Vol. 18,<br />
No. 3, S. 197-214<br />
Klieme, Eckhard; Maichle, Ulla (1989): Zum Training von Techniken des<br />
Textverstehens und des Problemlösens in Naturwissenschaften und Medizin.<br />
In: Günter Trost (Hrg.): Test für medizinischen Studiengänge (TMS):<br />
Studien zur Evaluation (13.Arbeitsbericht), Bonn: Institut für Test- und<br />
Begabungsforschung. S. 188-247<br />
Klieme, Eckhard; Maichle, Ulla (1990): Ergebnisse eines Trainings zum Textverstehen<br />
und zum Problemlösen in Naturwissenschaften und Medizin.<br />
In: Günter Trost (Hrg.): Test für medizinischen Studiengänge (TMS). 14.
92 WOLFRAM MEYERHÖFER<br />
Arbeitsbericht. Bonn: Institut für Test- und Begabungsforschung, S. 258-<br />
307<br />
Lind, Detlef: Welches Raten ist unerwünscht? Eine Erwiderung. („Erwiderung“<br />
auf Meyerhöfer 2004 c) In: JMD 1/2004, S. 70-74<br />
Meyerhöfer, Wolfram (2001): Was misst TIMSS? Einige Überlegungen zum<br />
Problem der Interpretierbarkeit der erhobenen Daten. In: http://pub.ub.<br />
uni-potsdam.de/2001meta/0012/door.htm<br />
Meyerhöfer, Wolfram (2004 a): Was testen Tests? Objektiv-hermeneutische<br />
Analysen am Beispiel von TIMSS und <strong>PISA</strong>. Dissertation an der<br />
Mathematisch-Naturwissenschaftlichen Fakultät der Universität Potsdam<br />
Meyerhöfer, Wolfram (2004 b): Zum Kompetenzstufenmodell von <strong>PISA</strong>.<br />
In: JMD 1/2004, S. 294-305. Längere Version unter: http://www.math.<br />
uni-potsdam.de/prof/o_didaktik/mita/me/Veroe<br />
Meyerhöfer, Wolfram (2004 c): Zum Problem des Ratens bei <strong>PISA</strong>. JMD<br />
1/2004, S. 62-69<br />
Meyerhöfer, Wolfram (2005): Tests im Test. Das Beispiel <strong>PISA</strong>. Verlag Barbara<br />
Budrich. Opladen<br />
Meyerhöfer, Wolfram (2006): <strong>PISA</strong> & Co als kulturindustrielle Phänomene.<br />
In: Thomas Jahnke, Wolfram Meyerhöfer (Hrsg.): <strong>PISA</strong> & Co <strong>–</strong> Kritik<br />
eines Programms. Franzbecker, Hildesheim, S. 63-100<br />
Millman, J.; Bishop, C. & Ebel, R. (1965): An Analysis Of Test-Wiseness. In:<br />
Educational and Psychological Measurement, 25, S. 707-726 (zitiert nach<br />
Hembree 1987)<br />
Winter, Heinrich (2005): Apfelbäume und Fichten <strong>–</strong> und Isoperimetrie. In: mathematik<br />
lehren, Heft 128, S. 58-62<br />
Woschek, Reinhard (2005): TIMSS 2 elaboriert: Eine didaktische Analyse von<br />
Schülerarbeiten im Ländervergleich Schweiz/Deutschland. Dissertation<br />
beim Fachbereich Mathematik der Universität Duisburg-Essen.<br />
Wuttke, Joachim (2007): Die Insignifikanz signifikanter Unterschiede: Der Genauigkeitsanspruch<br />
von <strong>PISA</strong> ist illusorisch. In: Thomas Jahnke, Wolfram<br />
Meyerhöfer (Hrsg.): <strong>PISA</strong> & Co <strong>–</strong> Kritik eines Programms. 2., überarbeitete<br />
Auflage, Hildesheim: Franzbecker
<strong>PISA</strong> <strong>–</strong> An Example of the Use and Misuse of<br />
Large-Scale Comparative Tests 1<br />
Jens Dolin<br />
Denmark: University of Copenhagen<br />
To an ever increasing extent, international evaluations such as <strong>PISA</strong> are both<br />
setting the agenda in the educational policy debate in the participating countries<br />
and exerting a considerable influence on their educational policy decisions.<br />
But do such surveys justify the fuss they often cause?<br />
In Denmark, the headlines which followed the publication of the <strong>PISA</strong><br />
2003 survey included:<br />
<strong>–</strong> More discipline in the schools. Discipline will help <strong>to</strong> improve Danish results<br />
in international surveys (Jyllandsposten, 7 Dec. 2004)<br />
<strong>–</strong> Time for physics classes in country no. 31 (Jyllandsposten, 7 Dec. 2004)<br />
<strong>–</strong> The government <strong>to</strong> introduce more tests for Danish schoolchildren (Politiken,<br />
7 Dec. 2004).<br />
The government used the <strong>PISA</strong> results as a lever <strong>to</strong> tighten up educational policy,<br />
while a number of leading education researchers warned against introducing<br />
drastic alterations on the basis of an international test of a character which<br />
was described as being <strong>to</strong> some extent foreign <strong>to</strong> the Danish educational culture.<br />
The <strong>to</strong>ne of the debate was sharp, as illustrated by the following extracts<br />
from an interview appearing in a Danish newspaper:<br />
You have been fooled by the <strong>PISA</strong> report. The <strong>PISA</strong> report on the elementary schools<br />
is nonsense and a perverse provocation. It is based on neither knowledge nor insight.<br />
(Prof. Staf Callawaert, in the newspaper Information, 10 December 2004).<br />
1 This paper is an updating of a key-note held at a Nordic Conference for Science Education<br />
2005.
94 JENS DOLIN<br />
A rather barren chasm was rapidly dug which prevented large parts of the educational<br />
system from utilising the <strong>PISA</strong> results productively and large parts<br />
of the political system from placing <strong>PISA</strong> in the necessary context. Hopefully,<br />
this article may contribute a little <strong>to</strong> both.<br />
The article will analyse <strong>PISA</strong> <strong>–</strong> particularly the part dealing with science <strong>–</strong><br />
as an example of a major comparative evaluation.<br />
<strong>PISA</strong> will first be described and then analysed on the basis of test theory,<br />
which will address some detailed technical aspects of the test as well as the<br />
broader issue of validation. The purpose of this is <strong>to</strong> illustrate how the technical<br />
aspects of evaluations are not neutral practices, but rather a part of the<br />
fundamental value system on which the evaluation is based. Some apparently<br />
objective choices must necessarily be made which have consequences for the<br />
theoretical basis of the evaluation, and the technique thereby becomes part of<br />
the fundamental value system. These considerations form the basis for an evaluation<br />
of <strong>PISA</strong>’s predicative power in a national context <strong>–</strong> in this case, that of<br />
Denmark. On this basis, the analysis will focus on the relationship between<br />
<strong>PISA</strong>’s fundamental assumptions and the national consequences of participation.<br />
Finally, I will conclude with some reflections on how <strong>PISA</strong> may be utilised<br />
and developed.<br />
Comparative evaluation <strong>–</strong> between politics and science<br />
Whether or not evaluations in the form of politically-initiated surveys can be<br />
considered research as such, the designation “comparative evaluation” forms<br />
part of the lexicon of comparative educational research. Internationally, this<br />
is a major research field, organised in the World Council of Comparative Education<br />
Societies, which was founded in 1970 and now has 35 national and<br />
regional member organisations. All major international education conferences<br />
have sessions for comparative evaluation, and the field is covered by several international<br />
periodicals, of which the two largest are the British periodical Comparative<br />
Education and the American Comparative Education Review. Finally,<br />
large-scale international comparative tests such as <strong>PISA</strong> present an opportunity<br />
<strong>to</strong> conduct a growing amount of related research. This secondary research<br />
may focus on <strong>PISA</strong> itself, or it may utilise the <strong>PISA</strong> data in analyses which<br />
expand its perspectives, such as in comparisons between countries, surveys of<br />
sub-populations, correlations between different variables, etc.
<strong>PISA</strong> <strong>–</strong> USE AND MISUSE OF LARGE-SCALE COMPARATIVE TESTS 95<br />
However, the research field also has a longer tradition which, under the designation<br />
comparative educational theory, refers <strong>to</strong> comparative studies of educational<br />
matters in different countries and cultures. One of the earliest Danish<br />
comparative surveys was conducted in 1841 and compared the Danish school<br />
system with the German and French systems; an analysis which was included<br />
in the formation of the Danish upper secondary school education. During the<br />
same period, the Danish educationalist Grundtvig visited Britain and was inspired<br />
by the college system <strong>to</strong> develop the Danish folk high schools. Comparative<br />
educational theory has thus contributed <strong>to</strong> building up the educational<br />
systems of national states via inspiration and exchanges of experience. This<br />
tradition, which Winther-Jensen (2004) terms comparative educational theory<br />
in its horizontal significance, was dominant until the nineteen-sixties.<br />
International comparative studies have grown considerably in their extent<br />
and level of interest over the past decade, but the most important point is<br />
that their actual aim has altered. A change occurred during the course of the<br />
nineteen-seventies and nineteen-eighties in the conditions and significance of<br />
the educational systems, which also altered the focus of comparative educational<br />
theory. The key words here are globalisation and marketisation; education<br />
comprises a key sec<strong>to</strong>r of the global knowledge society, and it therefore<br />
becomes important for politicians <strong>to</strong> know how their country is doing in the<br />
international competition <strong>to</strong> become the best knowledge society. At the same<br />
time, a marketisation of the educational system is taking place, one which<br />
causes politicians <strong>to</strong> ask: Are we getting value for money? There is a need<br />
for data <strong>to</strong> determine whether a Danish school student is more costly than a<br />
foreign one, and if so, whether she is at least more skilled. The marketisation<br />
of the educational system and of the public sec<strong>to</strong>r in general is being implemented<br />
via New Public Management: a system of control which is based on<br />
goals and result targets on the output side, and the implementation of which<br />
requires knowledge and data obtained by means of national and international<br />
evaluations and the standards these impose. This is given precise expression in<br />
<strong>PISA</strong>:<br />
Across the world, policymakers use <strong>PISA</strong> findings <strong>to</strong>:<br />
<strong>–</strong> gauge the literary skills of students in their own country in comparison with<br />
those of the other participating countries<br />
<strong>–</strong> establish benchmarks for educational improvement . . .<br />
<strong>–</strong> understand the relative strengths and weaknesses of their educational system<br />
(OECD 2004)
96 JENS DOLIN<br />
There is still an interest in comparing oneself with other countries <strong>–</strong> the horizontal<br />
dimension <strong>–</strong> but now international concepts and standards have been established<br />
which provide a basis on which national states can assess themselves.<br />
These supranational structures make it possible <strong>to</strong> speak of comparative educational<br />
theory in the vertical sense. The EU is developing a concept of lifelong<br />
learning, UNESCO defines Education for All, and the OECD is testing a literacy<br />
concept through <strong>PISA</strong>. These international concepts become a determining<br />
fac<strong>to</strong>r in national policies, and the international evaluations set up a standard<br />
which is independent of the differences between the individual countries, both<br />
for these key concepts and for the actual educational systems. The goals of the<br />
educational systems thereby become harmonised, and increasing emphasis is<br />
placed on standardisation and on comparison of student performance in order<br />
<strong>to</strong> measure the extent <strong>to</strong> which a country is meeting the international requirements.<br />
Under such conditions, the horizontal dimension becomes reduced <strong>to</strong> a<br />
comparison with those countries that best fulfil the international standards.<br />
We may, for example, ask ourselves in desperation, “What is it that Finland<br />
does that causes it <strong>to</strong> do so well in <strong>PISA</strong>?” But we ask less about what school<br />
students can do in Denmark. How is it that Denmark is doing so well in international<br />
competition, when Danish young people achieve such a mediocre<br />
score in international comparative tests? It may be that these comparative evaluations<br />
fail <strong>to</strong> capture the essence of the students’ skills <strong>–</strong> or at any rate, only<br />
an inessential subset. It is therefore important <strong>to</strong> analyse what such evaluations<br />
can really tell us, and what they cannot. What are the limitations, for example,<br />
in comparing complex matters between many countries <strong>–</strong> both from the<br />
perspective of test theory and educational theory?<br />
The aim of this article is <strong>to</strong> evaluate the predictive power of the <strong>PISA</strong><br />
results, and thereby provide a perspective on international comparative evaluations<br />
in general. The criticism examined here should then be compared with<br />
the advantages that <strong>PISA</strong> bes<strong>to</strong>ws. One problem in this context is that surveys<br />
like <strong>PISA</strong> are initiated and planned in one part of the educational system (typically<br />
at policy level) but implemented by another part of the system (typically<br />
the directly practising level), after which the results are used by the policy level<br />
<strong>to</strong> characterise and change the practising level. The situation is thereby one of<br />
attack and defence from the beginning, which makes it difficult <strong>to</strong> find a neutral<br />
standpoint from which <strong>to</strong> assess <strong>PISA</strong>.<br />
It is, however, important <strong>to</strong> understand that <strong>PISA</strong> was designed by the
<strong>PISA</strong> <strong>–</strong> USE AND MISUSE OF LARGE-SCALE COMPARATIVE TESTS 97<br />
OECD with the official aim of acquiring a data foundation for the use of (educational)<br />
decision-makers. As <strong>PISA</strong>’s own introduction makes clear (OECD<br />
1999):<br />
The results of the OECD assessments, <strong>to</strong> be published every three years along<br />
with other indica<strong>to</strong>rs of education systems, will allow national policymakers <strong>to</strong> compare<br />
the performance of their education systems with those of other countries. They<br />
will also help <strong>to</strong> focus and motivate educational reform and school improvement, especially<br />
where schools or education systems with similar inputs achieve markedly different<br />
results. Further, they will provide a basis for better assessment and moni<strong>to</strong>ring<br />
of the effectiveness of education systems at the national level (p. 7).<br />
<strong>PISA</strong> is administered by a <strong>PISA</strong> Governing Board, which includes representatives<br />
from the governments of the participant countries, and which takes the<br />
decisive decisions concerning <strong>PISA</strong>’s goals, content, procedures, etc.<br />
<strong>PISA</strong> thereby comprises a different type of survey and research from that<br />
with which we are traditionally familiar from the universities. It is a commissioned,<br />
research-based survey containing questions formulated by the commissioners,<br />
and with some set frameworks, but with rather extensive freedom with<br />
regard <strong>to</strong> how these frameworks are filled (e.g. the formulation and choice of<br />
test items). Such surveys have become quite common in the research world,<br />
such as in the form of evaluations and memoranda, but they differ in crucial<br />
areas from the free research of the universities. Many of the associated debates<br />
and decisions, for example, take place in relatively closed groups, with<br />
strong influence from the administrative layer of the ministries, and thereby<br />
with the fingerprint of the present government. It is thus a blend of research,<br />
investigation, evaluation and educational policy.<br />
The results which <strong>PISA</strong> has published, first and foremost in the form of the<br />
so-called league tables which rank countries according <strong>to</strong> the performance of<br />
their young people, have also been used in many other countries as arguments<br />
for fundamental alterations in their educational systems. In Denmark, with direct<br />
reference <strong>to</strong> the poor results in <strong>PISA</strong> 2003, the government introduced<br />
a wide range of school tests, albeit under strong protest from the teachers’<br />
organisations and education researchers (Dolin 2007). Teaching methods and<br />
so-called progressive education were identified by leading politicians as the<br />
cause of the disappointing results, which gave rise <strong>to</strong> a back-<strong>to</strong>-basics wave<br />
and greater emphasis on strong school leadership.
98 JENS DOLIN<br />
Critical choices, reliability and validity<br />
Any evaluation requires a number of theoretical, practical and methodological<br />
choices in order <strong>to</strong> ensure the production of the results necessary <strong>to</strong> fulfil<br />
its goals. These choices are taken at various points in the <strong>PISA</strong> system on the<br />
basis of compiled foundation documents (often of a political or scientific nature).<br />
The choices relate <strong>to</strong> questions of framework and content, such as the<br />
relationship with other surveys and test item design, and are of significance for<br />
the validity and reliability of the evaluation. Such fundamental choices set the<br />
limits for the survey’s usefulness and predictive power, and define its methodological<br />
standard.<br />
In a comparative test, reliability is crucial. Irrespective of what you measure,<br />
it must be done correctly. You must be certain that the various countries<br />
are appraised in the same way, so that their ranking in the final evaluation<br />
will not be open <strong>to</strong> question. Reliability-related problems include, for example,<br />
sampling procedures and the scoring of responses. The most fundamental<br />
questions, however, relate <strong>to</strong> the survey’s validity <strong>–</strong> the extent <strong>to</strong> which the<br />
chosen design can measure what you are interested in. There is a gradual transition<br />
between problems of reliability and problems of validity, so the divisions<br />
between them are as much questions of organisation as of content.<br />
We will begin with some of the critical choices, then review a number of<br />
apparently technical and reliability-related issues, and finally utilise the more<br />
fundamental validity problems <strong>to</strong> form the transition <strong>to</strong> a discussion which will<br />
place the issue in perspective.<br />
Critical choices<br />
An international survey must position itself within the range of comparative<br />
tests with regard <strong>to</strong> its aim, content, target group, etc., and it must possess<br />
a design in accordance with this positioning. Some of the choices taken will<br />
have consequences for the survey’s reliability and validity; for example, certain<br />
aims give rise <strong>to</strong> certain test forms <strong>to</strong> ensure that the test is in accord with its<br />
aims and thereby valid. But the testing must also be of a type that enables a<br />
high degree of reliability. These two considerations can be difficult <strong>to</strong> unite,<br />
and often the reliability consideration will be given highest priority.
<strong>PISA</strong> <strong>–</strong> USE AND MISUSE OF LARGE-SCALE COMPARATIVE TESTS 99<br />
Lack of comparability with earlier surveys<br />
In the case of <strong>PISA</strong>, no links have been established with earlier international<br />
surveys (particularly those undertaken by the IEA), which makes comparisons<br />
very difficult.<br />
It is regrettable that <strong>PISA</strong> did not link the survey <strong>to</strong> prior surveys, for<br />
example by including some test items of the same type as were included in<br />
TIMSS (which was curriculum-based). This would have enabled comparisons<br />
over time, and comparisons of tests with different testing purposes. Is there,<br />
for example, agreement between the results of a curriculum-based test and a<br />
more general “fit-for-life” test?<br />
Secondary surveys, however, have revealed quite significant correlations<br />
between the two surveys. Lie & Olsen 2007 compared science results for 22<br />
countries that had participated in both TIMSS and <strong>PISA</strong>, and found that the<br />
correlation between the scores in the two studies was as high as 0.95 at the<br />
country level.<br />
Whether this high level of consistency between different forms of measurement<br />
is good or bad is quite a delicate question, <strong>to</strong> which we will return.<br />
Year sample instead of class sample<br />
By selecting a representative sample of a given year’s school students, we<br />
can illuminate whether society receives “value for money” in the educational<br />
system as such: Does our educational system adequately equip young people<br />
for the future? (Assuming that you are actually able <strong>to</strong> measure such “futurepreparedness”,<br />
but that is a question <strong>to</strong> which we will return later.) How many<br />
of our schools’ students have which particular skills? And so on. This takes<br />
place at a highly aggregated level, where, for example, something can be said<br />
about sociocultural differences, the distribution of the results across a year,<br />
etc., and some general issues can be identified which the educational system is<br />
failing <strong>to</strong> satisfac<strong>to</strong>rily address. In addition, a yearly-based sample provides a<br />
good overview of a given school year, and the size of the sample provides an<br />
opportunity <strong>to</strong> compare different parts of the educational system.<br />
However, if we wish <strong>to</strong> know something about the educational system<br />
which can be used <strong>to</strong> change it, we must examine the places where the education<br />
actually takes place <strong>–</strong> which is <strong>to</strong> say the classroom and the school.<br />
The problem with <strong>PISA</strong> here is that the test does not illuminate the teaching<br />
conditions which, in the final analysis, are responsible for the measured results.<br />
Data collection at this level, with, for example, entire classes represent-
100 JENS DOLIN<br />
ing a school, would provide an opportunity for teaching-related comparisons.<br />
The students tested would certainly have been exposed <strong>to</strong> different forms of<br />
teaching, but the Danish model, under which teachers often are permanently<br />
assigned <strong>to</strong> particular groups of students throughout the years, would enable<br />
meaningful correlations <strong>to</strong> be made between teaching variables and output.<br />
Problems with the selected statistical model<br />
The fundamental problem for all comparative evaluations is how <strong>to</strong> safeguard<br />
comparability between different cultures and educational systems. The statistical<br />
side of this process is addressed in <strong>PISA</strong> by choosing a psychometric<br />
model which assumes that differences between systems may be ascribed <strong>to</strong><br />
variation along a scale. <strong>PISA</strong> has chosen <strong>to</strong> rely on a technique known as “Item<br />
Response Modelling”, despite the absence of (published) theoretical considerations<br />
concerning what the choice of this model might mean. The problem<br />
with this model is that it permits only a one-dimensional variation along the<br />
chosen scales, and thereby risks overlooking differences between countries lying<br />
outside the scale in question. As the technical report says: “An item may<br />
be deleted from <strong>PISA</strong> al<strong>to</strong>gether if it has poor psychometric characteristics in<br />
more than eight countries (a dodgy item)” (Adams and WU 2001, p. 101). If a<br />
particular test item does not fit the one-dimensional model <strong>–</strong> i.e. it gives very<br />
different results in several countries <strong>–</strong> it is omitted, even though the reasons<br />
why it gives different results might be an expression of a variation in another<br />
dimension than the relevant scale is designed <strong>to</strong> measure. Potential information<br />
can thereby be suppressed, or <strong>to</strong> put it another way: In its efforts <strong>to</strong> avoid<br />
cultural bias, <strong>PISA</strong> neglects cultural differences <strong>–</strong> the very differences that it<br />
would have been interesting <strong>to</strong> identify as explanations for the observed variations<br />
in performance between the various countries.<br />
As Harvey Goldstein puts it:<br />
“Perhaps the major (concern) centres around the narrowness of its focus, which remains<br />
concerned, even fixated, with the psychometric properties of a restricted class<br />
of conceptually simplistic models. . . . It needs <strong>to</strong> be recognized that the reality of<br />
comparing countries is a complex multidimensional issue, well beyond the somewhat<br />
ineffectual attempt by <strong>PISA</strong> <strong>to</strong> produce subscales. With such recognition, however, it<br />
becomes difficult <strong>to</strong> promote the simple country rankings which appear <strong>to</strong> be what are<br />
demanded by policymakers.” (Goldstein 2004, p. 328)
<strong>PISA</strong> <strong>–</strong> USE AND MISUSE OF LARGE-SCALE COMPARATIVE TESTS 101<br />
The price of such item-homogeneity is that cultural differences are erased on<br />
the “profile level”. It would in general have been useful <strong>to</strong> have some clearer<br />
considerations on the appropriateness of the chosen model.<br />
Is <strong>PISA</strong> authentic?<br />
This is one of the crucial elements in <strong>PISA</strong>.<br />
Figure 1: The pizza item<br />
Let us for example examine the ‘pizza’ test item from a set of published pilot<br />
items that were stated <strong>to</strong> be representative of <strong>PISA</strong>’s mathematics questions<br />
(figure 1). On the surface it appears <strong>to</strong> be an everyday situation (at any rate for<br />
city-dwellers in Western Europe), but it has been made abstract by the use of<br />
an unknown currency and “nice” numbers. It is in fact a disguised mathematics<br />
problem.<br />
I wonder how students who are used <strong>to</strong> ordering from their local pizzeria<br />
would reply <strong>to</strong> such a question using realistic (and known) numbers. But<br />
this <strong>to</strong>uches on the fundamental <strong>–</strong> and conflict-ridden <strong>–</strong> academic debate on<br />
whether mathematics should be taught as a closed, deductive system or as ‘realistic’<br />
mathematics. Those who belong <strong>to</strong> the first school tend <strong>to</strong> formulate<br />
questions which test the ability <strong>to</strong> perceive mathematical structures in everyday<br />
examples, while the other school prefers <strong>to</strong> focus on the skill <strong>to</strong> be able <strong>to</strong><br />
“manage” everyday situations <strong>–</strong> irrespective of whether approved methods are<br />
used. If you aim for the latter, it would be correct <strong>to</strong> say that the more realistic<br />
a test is <strong>–</strong> the more it is designed <strong>to</strong> reflect actual everyday situations <strong>–</strong> the<br />
less it makes sense <strong>to</strong> compile a globally comparable test! It is quite simply a<br />
fundamental conflict of principle in which the choice of test questions reflects<br />
a particular academic and pedagogical attitude.<br />
In <strong>PISA</strong>, it is as though backward reasoning has been used in the formulation<br />
of many of the test questions: Here we have a set of school subjects <strong>–</strong>
102 JENS DOLIN<br />
biology, physics, geography, chemistry, etc. Where can the students apply this<br />
knowledge? Where in the real world are there situations that involve the use of<br />
this knowledge? Instead of starting (authentically!) with some realistic everyday<br />
situations <strong>–</strong> the consumer, the manufacturer, the citizen, leisure activities,<br />
etc. <strong>–</strong> and then choosing some in which scientific insight might play a role. But<br />
it is fair <strong>to</strong> say that it would be a very difficult agenda <strong>to</strong> set up <strong>–</strong> due <strong>to</strong> the<br />
special character of science. In everyday life we use the known and the experienced<br />
<strong>to</strong> explain the unknown. In science it is reversed. Here you explain the<br />
well-known with abstract, invisible, and non experienced concepts. And it is a<br />
huge pedagogical and didactical challenge <strong>to</strong> make the two ways of knowing<br />
meet.<br />
In this connection it is also characteristic that the answers must be based on<br />
the information supplied in the test question, which must not be combined with<br />
the students’ own knowledge of the subject (see, for example, Svendsen 2005).<br />
In order <strong>to</strong> do well on the test item, it is as least as important <strong>to</strong> understand test<br />
logic as <strong>to</strong> know the subject. You have <strong>to</strong> know how tests are scored, how<br />
<strong>to</strong> optimise your answer strategy, etc. Greater familiarity with tests probably<br />
gives a higher score.<br />
<strong>PISA</strong>’s results, like all those of all evaluations, are dependent on the evaluation<br />
context both with regard <strong>to</strong> the formulation of the specific questions and<br />
with regard <strong>to</strong> the context in which the test items are solved. As an example,<br />
Kjeld Kjertmann (2000) shows how readers who have done well in a standard<br />
word reading test (US64) achieve very different results in reading tests which<br />
involve meaningful texts.<br />
The question of reliability<br />
The main question here is whether <strong>PISA</strong> lives up <strong>to</strong> its own premises from a<br />
test’s technical point of view. Is the test performed “properly”, i.e. in conformity<br />
with recognised test standards?<br />
The reliability of <strong>PISA</strong> is probably as high as is practically possible in such<br />
an extensive survey. For each round, a ‘Technical Report’ is published containing<br />
thorough documentation of the procedures used in all phases of the survey,<br />
which gives the impression that the test has been undertaken competently in<br />
every respect. In the case of <strong>PISA</strong> 2000, this is Technical Report 2000 (Adams<br />
and Wu 2001), and outlines how the test was compiled and pilot tested, how<br />
the respondents were selected and the data collected and processed, etc. The<br />
reliability of the data and processes was evaluated in all respects, and special
<strong>PISA</strong> <strong>–</strong> USE AND MISUSE OF LARGE-SCALE COMPARATIVE TESTS 103<br />
reliability studies were also undertaken. In the case of one study, in which the<br />
scoring undertaken by the national test scorers in reading questions was compared<br />
with that of a <strong>PISA</strong> consortium official (a so-called ‘verifier’), there was<br />
agreement between the OECD’s ‘verifiers’ and all four national scorers in 78 %<br />
of instances (p. 174). There was agreement with a majority of the national test<br />
scorers in 91.5 % of instances. However, the results revealed a large degree of<br />
scoring variation between both the questions and the countries. Some marking<br />
of questions showed an inter-country agreement rate of less than 0.80 (Technical<br />
Report, p. 175), and some countries showed an inconsistency rate of more<br />
than 50 % in the marking of certain questions (Technical Report p. 177). The<br />
overall consistency rate of the individual countries varied from 80.2 % (France)<br />
<strong>to</strong> 96.5 % (New Zealand) (Technical Report, p. 178).<br />
There were major variations in reliability in the various areas. In the ‘soft’<br />
data (background variables), reliability was significantly lower than in the test<br />
items. The reliability of measures of the quality of school resources (a subset<br />
of “physical infrastructure”) was for example 0.70 for Denmark (Technical<br />
Report, p. 250). It is difficult <strong>to</strong> see how the figure of 0.7 was arrived at, but it<br />
is probably based on measurements of employees’ classifications of the same<br />
answer. Account has not been taken here of the validity problems involved<br />
in questions such as: “What is your father’s occupation?”; an answer such as<br />
painter, teacher, or office worker can mean quite different things, despite the<br />
fact that the <strong>PISA</strong> scorers classified them as identical. Here in Denmark, however,<br />
we have the opportunity <strong>to</strong> check the answers via data pooling.<br />
One can always discuss whether an overall reliability rate of 92 % is good<br />
or bad, but the survey gives the appearance of being scientifically correct. As<br />
the Danish Minister of Education, Bertel Haarder put it: When so many international<br />
experts have participated, it must be satisfac<strong>to</strong>ry. But as in the case<br />
of all statistics, they have been collected in a particular way for a particular<br />
purpose, and in any survey, statistics can only describe a (limited) part of the<br />
issues and phenomena dealt with by the survey.<br />
A number of education researchers and statisticians have also criticised<br />
both the theoretical background and the technical implementation of <strong>PISA</strong>.<br />
Noteworthy in this context has been the debate between Professor Prais of the<br />
National Department of Economic and Social Research in London and Raymond<br />
Adams of the International <strong>PISA</strong> Consortium (Adams 2003; Prais 2003;<br />
Prais 2004), and the critique by Professor Goldstein, Professor in Statistical<br />
Methods at the Institute of Education, University of London (Goldstein 2004).
104 JENS DOLIN<br />
It would be going <strong>to</strong>o far here <strong>to</strong> undertake an in-depth analysis of these criticisms,<br />
which would require a rather advanced level of familiarity with statistical<br />
theory; the following should therefore be mainly seen as a summary of<br />
the problems identified by various persons in the technical and design-related<br />
aspects of <strong>PISA</strong>.<br />
Translation problems<br />
Once the test items have been selected, they must be translated in<strong>to</strong> the various<br />
national languages. As the questions have often originally been formulated in<br />
English, the translation must often be worded in a more complex manner in<br />
order <strong>to</strong> convey the precise meaning. It is generally recognized that in order<br />
<strong>to</strong> represent the full meaning in a text originally produced in a foreign language,<br />
you often have <strong>to</strong> reframe and paraphrase <strong>–</strong> causing several awkward<br />
and clumsy sentences. The Danish version of the text suffers <strong>to</strong> some degree<br />
of this inappropriateness.<br />
The translation also results in a number of inevitable inaccuracies, the effects<br />
of which are impossible <strong>to</strong> assess. In a questionnaire directed at school<br />
principals, for example, the English term “assessment” was translated in<strong>to</strong><br />
Danish as “standpunktsprøver” (“proficiency tests”), which has a different<br />
meaning.<br />
The occurrence of translation problems, inelegant style and imprecise<br />
meaning causes a drop in reliability.<br />
Measuring scale errors (lack of chronological comparability)<br />
The Danish statistician Peter Allerup at the Danish School of Education has<br />
demonstrated that the comparability between the individual cycles which<br />
forms an important part of <strong>PISA</strong> is not valid, because different measuring<br />
scales are used in the two surveys (Allerup 2005).<br />
In the scaling technique utilised by <strong>PISA</strong>, the average score of each student<br />
in all questions is not first calculated in order <strong>to</strong> assess the average score of all<br />
the students; instead, the latent item difficulty is calculated by examining the<br />
students’ simultaneous item responses, i.e. the same student’s answers <strong>to</strong> all the<br />
questions. In <strong>PISA</strong>, these are termed the “item parameters”. By undertaking a<br />
so-called Rasch statistical analysis of all the students who answered the same<br />
question, it is then possible <strong>to</strong> see how the latent item difficulty is distributed<br />
in different surveys. It is a prerequisite for comparability that the relative level<br />
of difficulty is fixed.
<strong>PISA</strong> <strong>–</strong> USE AND MISUSE OF LARGE-SCALE COMPARATIVE TESTS 105<br />
Figure 2: measuring scale differences<br />
Figure 2 shows the relative difficulty of 22 common test items in reading<br />
in the years 2000 and 2003. As can be seen, the relative level of difficulty is<br />
not fixed, i.e. the same measuring scale has not been used in both cases (if this<br />
had been the case, the lines would have been vertical).<br />
A student with a particular level of skill is awarded points as he moves<br />
<strong>to</strong> the right on the scale; i.e. as he solves test items with a greater level of<br />
difficulty. It can be seen that changes in latent item difficulty between the two<br />
test cycles produce different scores for the same average student. An aboveaverage<br />
student with an item parameter of 0.7, for example, would be one who<br />
can solve 18 out of 22 common reading tasks in 2000, but only 16 of the same<br />
22 items in 2003. For the 22 test items, the sum of these deviations for all tasks<br />
results in a difference in latent student scores of approximately 11 scale points<br />
between the 2000 and 2003 surveys.<br />
Corresponding analyses may be undertaken regarding gender and ethnicity.<br />
Changes in the difficulty of test items for boys and girls respectively ac-
106 JENS DOLIN<br />
cumulate <strong>to</strong> a scale advantage for girls at the weak end of the scale of 8-10<br />
points (and in the strong end just 1-2 points). Students whose Danish is poor<br />
receive a scale-conditioned advantage over ethnic Danish students amounting<br />
<strong>to</strong> approximately 12 scale points.<br />
Eleven <strong>to</strong> twelve scale points is quite a lot. In the <strong>PISA</strong> 2000 scientific<br />
literacy test, this would be enough <strong>to</strong> lift Denmark from the group of countries<br />
with a statistical score significantly under the OECD average in<strong>to</strong> the medium<br />
group of countries.<br />
Validity<br />
In my opinion, the validity problems of <strong>PISA</strong> are more fundamental than its<br />
weaknesses in technique and reliability.<br />
A test can only measure what it can capture with the current test design.<br />
What the test says about the test subjects may be one thing, while the information<br />
which can be derived from the test results <strong>to</strong> reveal something about<br />
the educational system which has educated these students is something quite<br />
different. It is thus quite a complicated and extensive task <strong>to</strong> provide an adequate<br />
analysis of the validity of an international comparative test. <strong>According</strong>ly,<br />
a validation of <strong>PISA</strong> implies a mixture of test design analysis and comparisons<br />
between the test and the national context. There are questions regarding what<br />
one might term internal validity: Does <strong>PISA</strong> Science 2006 really measure what<br />
it is intended <strong>to</strong>, namely scientific literacy? This question has two parts: How<br />
well does the concept of scientific literacy proposed in <strong>PISA</strong> correspond <strong>to</strong><br />
other generally accepted concepts of literacy, and <strong>to</strong> what extent can the test<br />
items and the test concept measure the proposed literacy concept?<br />
The starting-point for the <strong>PISA</strong> 2006 science test is the so-called “Framework”<br />
compiled by the Science Forum, a group of science researchers from the<br />
participating countries, and the Science Expert Group. Here, scientific literacy<br />
is defined as:<br />
Scientific knowledge and use of that knowledge <strong>to</strong> identify questions, <strong>to</strong> acquire new<br />
knowledge, <strong>to</strong> explain scientific phenomena, and <strong>to</strong> draw evidence-based conclusions<br />
about science-related issues;<br />
understanding of the characteristic features of science as a form of human knowledge<br />
and enquiry;<br />
awareness of how science and technology shape our material, intellectual, and cultural<br />
environments; and
<strong>PISA</strong> <strong>–</strong> USE AND MISUSE OF LARGE-SCALE COMPARATIVE TESTS 107<br />
a willingness <strong>to</strong> engage in science-related issues, and with the ideas of science, as a<br />
reflective citizen.<br />
(Doc: ScFor(0407)1, OECD 2004)<br />
This concept is quite similar <strong>to</strong> other concepts of scientific literacy, as is<br />
revealed in the first inspection report from an ongoing validation project (Dolin<br />
et al 2006), so we can with some assurance state that <strong>PISA</strong> aims <strong>to</strong> test scientific<br />
literacy. It is worth noting, however, that the definition of scientific literacy<br />
has been changed quite a lot from <strong>PISA</strong> 2003 <strong>to</strong> <strong>PISA</strong> 2006. The 2006 definition<br />
places more emphasis on knowledge about science and on the students’<br />
attitudes <strong>to</strong>wards science. The incorporation of attitudinal aspects is solved by<br />
separating the cognitive and attitudinal items in the same unit. However, by<br />
doing so, the possibility of testing the situational interest is renounced.<br />
A more fundamental question is: What scientific knowledge do young people<br />
need later in life, and is this what is tested? No real analysis of this question<br />
has been undertaken by the Science Forum. Instead, the Forum has looked at<br />
the existing school curriculum and existing school traditions, and has considered<br />
which parts of these could be considered relevant for the young person’s<br />
future life. On this basis, they then produced the model for scientific literacy<br />
shown in Figure 3.<br />
The level of literacy is thus tested via four coherent aspects, namely the answers<br />
<strong>to</strong> these questions:<br />
What contexts are suitable for testing 15-year-olds?<br />
What competencies are necessary for 15-year-olds?<br />
What knowledge is it reasonable <strong>to</strong> expect 15-year-olds <strong>to</strong> have?<br />
What affective responses are reasonable <strong>to</strong> expect from 15-year-olds?<br />
These four questions have been thoroughly processed by the Science Forum,<br />
which under<strong>to</strong>ok a mixture of academic and educational policy weighting of<br />
the different interests. The cognitive aspect was weighed up in relation <strong>to</strong> the<br />
affective, and the various academic areas were weighted in terms of percentages<br />
in the test areas. The extent <strong>to</strong> which people in the individual countries<br />
feel that the result covers what young people might be predicted <strong>to</strong> need in<br />
their adult lives is a matter for the individual countries <strong>to</strong> assess. An analysis<br />
of <strong>PISA</strong>’s framework in comparison with future demands for knowledge<br />
management, multimodality and innovation points out <strong>PISA</strong>’s lack of broader<br />
contexts and more future-proof categories (Dolin 2005).<br />
The fundamental question regarding validity is whether one can reasonably<br />
claim that sitting with a paper and pencil and (casually) answering ques-
108 JENS DOLIN<br />
Figure 3: Scientific literacy framework<br />
tions about imaginary situations has anything at all <strong>to</strong> do with competencies<br />
in the sense that we normally understand them. I will return <strong>to</strong> this fundamental<br />
question later, but many of the test questions that have been published can<br />
hardly be said <strong>to</strong> test appropriate everyday actions, not <strong>to</strong> mention the willingness<br />
<strong>to</strong> engage in science-related issues, and with the ideas of science, as<br />
areflective citizen, but rather the students’ general ability <strong>to</strong> make deductions<br />
and hypotheses, evaluate evidence, etc. <strong>–</strong> in other words a number of schoolspecific<br />
skills which, according <strong>to</strong> the logic of school, can be used later in life.<br />
And this aspect is tested very well! Seen in this light, many of the questions<br />
are diagnostically strong, inasmuch as a great deal of work has been done <strong>to</strong><br />
investigate the use of particular cognitive processes. Let us examine a couple<br />
of examples.<br />
Problems with test item formulation<br />
Although the test items have been formulated in conformity with a detailed<br />
framework and subjected <strong>to</strong> a quite comprehensive selection process, there are<br />
still a few duds. It is hard <strong>to</strong> formulate “good” questions, as all teachers know,<br />
and even though only one-third of the test items made it through the process <strong>to</strong>
<strong>PISA</strong> <strong>–</strong> USE AND MISUSE OF LARGE-SCALE COMPARATIVE TESTS 109<br />
the pilot test, and even though all countries had the right <strong>to</strong> object, there will always<br />
be some less appropriate items. I will mention just one; Inge Henningsen<br />
has provided a more detailed criticism in Mona 2005 no. 1 (Henningsen 2005),<br />
and Lars Svendsen criticised some of the other published test items in the Danish<br />
newspaper Politiken on 13 January 2005 (Svendsen 2005).<br />
Figure 4: Walking<br />
In the test item “walking” from the 2003 mathematics set (fig. 4), the length<br />
of the stride is indicated for the first step, but it is clearly apparent that the<br />
second step is quite a bit longer; so in fact, the length of the stride should be<br />
defined as the average length of the measured strides. What is worse is that<br />
the formula provided is pure nonsense. Larger stride lengths, according <strong>to</strong> the<br />
formula, are faster strides, which contradicts our experience.<br />
Cultural bias<br />
Despite careful attention on the part of the question compilers, it is impossible<br />
<strong>to</strong> avoid a certain amount of cultural bias. Test items which require the student<br />
<strong>to</strong> read between the lines in references <strong>to</strong> cultural background knowledge are<br />
managed more easily by ethnic Danes than by Danish students from ethnic minorities.<br />
One could naturally argue that all students ought <strong>to</strong> be able <strong>to</strong> manage<br />
even culturally-determined tasks, and there is a certain logic <strong>to</strong> this, given that
110 JENS DOLIN<br />
they must be able <strong>to</strong> manage life in a (post-) modern society. But in this case,<br />
one cannot simultaneously accept that <strong>PISA</strong> aims at smoothing out cultural<br />
differences while measuring cultural deviations from a West European norm.<br />
This also applies <strong>to</strong> gender, regarded as culture.<br />
Let us examine the racetrack test item (fig. 5) from <strong>PISA</strong> 2000.<br />
Figure 5a: racetrack item<br />
Figure 5b<br />
The test item seems realistic and meaningful (at any rate <strong>to</strong> me as a man).<br />
Unfortunately, the question cannot be solved. Based on the number of curves,
<strong>PISA</strong> <strong>–</strong> USE AND MISUSE OF LARGE-SCALE COMPARATIVE TESTS 111<br />
the lane must be B, C or D. If we look at the position of the start (at the<br />
conclusion of a straight length) it should be lane D, but as the first curve is<br />
followed by one which is sharper and one that is less sharp than the others, it<br />
must be lane B! However, what is more interesting is that the responses show<br />
a gender-based imbalance:<br />
Greece, girls: 8 % correct<br />
Portugal, girls: 10 % correct<br />
Australia, boys: 43 % correct<br />
Switzerland, boys: 46 % correct<br />
What do these answers really reveal? Are they perhaps more a reflection of<br />
society’s socialisation, i.e. gender-specific interest, than of what the school has<br />
taught students (concerning the ability <strong>to</strong> create a graphic representation of<br />
movement)? Or are the girls just so bright that they can see there is no solution?<br />
There can hardly be a standardised version of “everyday life” valid for the<br />
whole world, and the issue of what can be regarded as everyday mathematics or<br />
science is the subject of considerable debate. Do all young people really need<br />
<strong>to</strong> learn the same things? Do we all need <strong>to</strong> know the same things in order <strong>to</strong><br />
be ‘fit for life’? And should they be evaluated in the same way?<br />
<strong>PISA</strong> and the Danish educational goals<br />
The next general question is: Does this framework harmonise with Danish educational<br />
goals, for example as expressed in the Common Goals statement of<br />
the Ministry of Education? (http://www.faellesmaal.uvm.dk/) The answer is<br />
both yes and no. Dolin et al (2006) have undertaken a thorough analysis of the<br />
intentions of the <strong>PISA</strong> survey and compared these with the goals of Danish<br />
education as formulated in the Common Goals. The report concludes:<br />
To summarise, we could say that <strong>PISA</strong>’s scientific literacy framework covers key parts<br />
of the formulated aims and mentality-related goals of Danish scientific school subjects.<br />
The greatest lack is the emphasis placed by Danish scientific subjects on students’<br />
practical and field work, which is not included in <strong>PISA</strong>. This also means that a<br />
number of personal qualities, such as imagination and inquisitiveness, are not tested.<br />
It is also important <strong>to</strong> point out that the personal and affective aims of science teaching<br />
are given considerable emphasis in the Danish aims and goals, while these comprise<br />
only a minor part of the overall <strong>PISA</strong> test in science.<br />
In addition, the <strong>PISA</strong> competencies primarily relate <strong>to</strong> cognitive skills, whereas Danish<br />
goals are more holistic and aim <strong>to</strong> encourage independent problem-solving, which<br />
naturally also involves cognitive skills, but in interplay with other abilities.
112 JENS DOLIN<br />
The <strong>PISA</strong> framework thus covers some of the Danish goals, but far from all of<br />
them, and perhaps not even the ones which many Danes would regard as being<br />
the most important, such as democratic culture, social skills, personal development,<br />
etc. Here, I feel we can find one of the key reasons for the opposition<br />
<strong>to</strong> <strong>PISA</strong>; many opponents criticise <strong>PISA</strong> for failing <strong>to</strong> test what they regard as<br />
important, but at the same time overlook what <strong>PISA</strong> does in fact test. Similarly,<br />
many of <strong>PISA</strong>’s supporters focus on what <strong>PISA</strong> tests, and perhaps fail <strong>to</strong> place<br />
this in relation <strong>to</strong> what <strong>PISA</strong> does not test. The exciting question is whether<br />
there are correlations between the two areas; this would demand an actual field<br />
validation process, i.e. a concrete examination of the <strong>PISA</strong>-tested students with<br />
the aid of other evaluation methods besides that of <strong>PISA</strong>. A Danish research<br />
project is doing so at the time of writing (Dolin et al 2006).<br />
All in all, a wide range of important validity problems appears when we ask<br />
what it is that <strong>PISA</strong> measures, and what the actual skills of Danish students are<br />
in the areas tested by <strong>PISA</strong>. One should thus be extremely cautious in drawing<br />
<strong>to</strong>o hasty or <strong>to</strong>o firm conclusions from the <strong>PISA</strong> results.<br />
Against the background of these considerations, I would recommend that<br />
in the case of extensive surveys such as <strong>PISA</strong>, more aspects of the survey<br />
design and its context should be taken in<strong>to</strong> consideration when assessing the<br />
test’s validity and consequences.<br />
A wider view of validity<br />
In the following, I wish <strong>to</strong> present a broader and more differentiated view of<br />
the validity issue in order <strong>to</strong> further define the problems that can arise in connection<br />
with a survey such as <strong>PISA</strong>. The issue of validity will be examined in<br />
relation <strong>to</strong>:<br />
<strong>–</strong> the structure and design of the actual test apparatus in relation <strong>to</strong> the questions<br />
posed<br />
<strong>–</strong> the range which defines the test’s area of validity or generalisability<br />
<strong>–</strong> the foundation upon which the test’s fundamental assumptions are juxtaposed<br />
with the field’s dominant assumptions.<br />
Validity in relation <strong>to</strong> the test design<br />
A test can naturally only measure that which it is designed <strong>to</strong> measure, so the<br />
first and most fundamental validity evaluation must clarify whether the test’s<br />
design is in accordance with its aims. Does the <strong>PISA</strong> test measure scientific
<strong>PISA</strong> <strong>–</strong> USE AND MISUSE OF LARGE-SCALE COMPARATIVE TESTS 113<br />
literacy? I would as mentioned be concerned if the idea of literacy were <strong>to</strong> be<br />
restricted <strong>to</strong> something that can be measured using paper and pencil, sitting<br />
at a desk in a gymnasium. The prevalent approach <strong>to</strong> literacy operates with a<br />
significantly broader view of the concept of literacy <strong>–</strong> typically the ability <strong>to</strong><br />
manage everyday situations in which the necessary actions or considerations<br />
demand scientific insight (Roth & Désautels 2002). <strong>PISA</strong>’s concept of literacy<br />
attaches great importance <strong>to</strong> deductive skills on the basis of some given<br />
premises which comprise a subset of the prevalent approaches <strong>to</strong> literacy, and<br />
such abilities are excellently tested in quite a few of the <strong>PISA</strong> test items. However,<br />
is it not the case that the more the test items and the test situation are<br />
shorn of their context and removed from ordinary everyday life, the more we<br />
tend <strong>to</strong> test levels of general intelligence? It gives one food for thought that a<br />
recent study demonstrates high consistency in measuring performance in <strong>PISA</strong><br />
2003 and TIMSS 2003 (Lie&Olsen 2007). A comparison of the results for 22<br />
educational systems participating in the two tests shows a correlation between<br />
the scores in the two tests as high as 0.95 at the country level. Despite differences<br />
in focus (scientific literacy vs. curriculum test), the test results are very<br />
much the same. So perhaps the <strong>PISA</strong> test does not reflect the different focus<br />
sufficiently. <strong>PISA</strong> might have a good definition of scientific literacy, but the<br />
test items and the whole test setup are <strong>to</strong>o close <strong>to</strong> a traditional curriculum test.<br />
Achieving complex goals will often require a blend of multi-dimensional<br />
skills, the integration of academic and personal/social skills, and the utilisation<br />
of several academic areas and subjects. Such complex goals can only be evaluated<br />
with the help of complex forms of evaluation. Much work has been done<br />
on developing process-oriented and complexity-capturing forms of evaluation<br />
(e.g. logbooks, portfolios, project reports), but these are, naturally enough, difficult<br />
<strong>to</strong> carry out and more time-consuming, and they need <strong>to</strong> be learned, for<br />
which reason they are more costly than traditional written tests. The better the<br />
evaluation is at capturing complex skills, the more difficult it is <strong>to</strong> present the<br />
results in the form of simple, comparable data.<br />
This brings us back <strong>to</strong> the traditional dilemma between undertaking an<br />
evaluation with a high degree of validity which is costly <strong>to</strong> carry out and which,<br />
because of its complexity, will have low reliability, and an evaluation of simple<br />
fac<strong>to</strong>rs which is capable of measuring with high reliability, but in which the<br />
level of validity is relatively low.
114 JENS DOLIN<br />
Generalisability<br />
One cannot generalise test results beyond their area of validity. It would thus<br />
seem unreasonable, on the basis of a test in deduction and calculation, <strong>to</strong> generalise<br />
regarding general abilities and skills in science. A test is a very specific<br />
communicative situation in which students must answer questions in writing<br />
and under time pressure, and without the help of an interlocu<strong>to</strong>r <strong>to</strong> adjust<br />
their understanding of the problem. As far as I am aware, no survey has been<br />
undertaken of the relationship between such problem-solving skills and the<br />
ability <strong>to</strong> manage later in life in situations which include a scientific content.<br />
Nonetheless, <strong>PISA</strong> measures something, and the measuring apparatus provides<br />
a fine scaling of the students. There is, for example, a correlation between<br />
the <strong>PISA</strong> reading results and later educational achievements. An analysis of<br />
Danish school students who participated in <strong>PISA</strong> 2000 showed that the young<br />
people’s educational position four years after elementary school was primarily<br />
determined by their reading skills and their academic self-image in the ninth<br />
grade (such as these were established by the <strong>PISA</strong> test) (Pilegaard Jensen and<br />
Andersen, 2006). Such a correlation may not, however, necessarily indicate a<br />
direct causal relation, but rather reflect some general relationships, probably<br />
attributable <strong>to</strong> social background, which are revealed by <strong>PISA</strong>. But we also<br />
know that of the 17 % of Danish school students who were designated functionally<br />
illiterate on the basis of <strong>PISA</strong> 2000, 20 % later completed an upper<br />
secondary or vocational education. Many of these were in other words capable<br />
of coping with relatively high demands for reading and comprehension. This<br />
implies that we can draw no clear conclusions with regard <strong>to</strong> the generalisability<br />
of the <strong>PISA</strong> test, and thus it begins <strong>to</strong> resemble soothsaying, <strong>to</strong> put it mildly,<br />
<strong>to</strong> rank countries by the supposed ability of their school students <strong>to</strong> manage in<br />
the future, as in Figure Six taken from the Danish <strong>PISA</strong> 2003 report.<br />
The skill requirements of the future are difficult <strong>to</strong> predict, and an exaggerated<br />
re-traditionalisation of the school system might well occur at the expense<br />
of explorative, creative, communicative and playful skills, and many<br />
other skills which the digital society of the future might come <strong>to</strong> rely on <strong>–</strong> and<br />
which <strong>PISA</strong> does not test.<br />
Naturally, this must not divert our attention from the problem that an unreasonably<br />
large proportion of Danish youth have poor reading abilities <strong>–</strong> something<br />
which it is good that <strong>PISA</strong> documents. But is it reasonable <strong>to</strong> conclude,<br />
on the basis of the <strong>PISA</strong> data, that three-quarters of the school students in Finland<br />
are ready for the labour market of the 21st century, while this applies only
<strong>PISA</strong> <strong>–</strong> USE AND MISUSE OF LARGE-SCALE COMPARATIVE TESTS 115<br />
Figure 6. Percentage of students prepared for the 22th century labor market<br />
Source: Mejding 2003, s. 132<br />
<strong>to</strong> just under half of the Norwegians? And <strong>to</strong> what extent would it enhance<br />
the students’ future preparedness <strong>to</strong> improve their skills in traditional cultural<br />
techniques, if this requires learning decontextualised skills? In this connection,<br />
it is of crucial importance <strong>to</strong> find a reasonable balance between fundamental<br />
subject-related skills and social/personal skills, and that this balance is<br />
expressed in relevant contexts.<br />
Fundamental assumptions<br />
The question of validity is closely linked with the fundamental assumptions<br />
and values upon which the test is based. If, for example, a test builds on the
116 JENS DOLIN<br />
premise that knowledge is an objective quantity, independent of context, it<br />
might be meaningful <strong>to</strong> attempt <strong>to</strong> test the presence and extent of this knowledge<br />
with individual students in neutral contexts. And if you define competence<br />
as the ability <strong>to</strong> solve items in a test, you could call it competencies. If,<br />
on the other hand, we view knowledge as a social construction in actual contexts,<br />
such a test set-up might amount <strong>to</strong> a valid measurement of school knowledge<br />
<strong>–</strong> but not in any respect a measurement of ‘everyday useful’ knowledge <strong>–</strong><br />
let alone competencies.<br />
Consider, for example, the following view of competence and knowledge,<br />
from the perspective of situated cognition (St. Julien 1997):<br />
Competence, unders<strong>to</strong>od as the ability <strong>to</strong> act on the basis of understanding, has been<br />
a fundamental goal of education. But it is a painful fact of educational life that knowledge<br />
gained in school <strong>to</strong>o often does not transfer <strong>to</strong> the ability <strong>to</strong> act competently in<br />
more “worldly” settings.<br />
...<br />
From the viewpoint of situated cognition, competent action is not grounded in individual<br />
accumulations of knowledge but is, instead, generated in the web of social<br />
relations and human artefacts that define the context of our action.<br />
This view of knowledge and competence shifts the focus when assessing competencies<br />
from a focus on examining individual knowledge <strong>to</strong> examining authentic<br />
activities in social contexts. In the Nordic countries, we have built up<br />
a view of knowledge in an educational context which attempts <strong>to</strong> combine<br />
the process-oriented view of knowledge expressed by constructivism with the<br />
more absolute view of knowledge expressed by science. We also work <strong>to</strong> a<br />
great extent on the basis of a socio-cultural view of learning, i.e. in educational<br />
contexts we tend <strong>to</strong> emphasise the ability of individual students and the<br />
group <strong>to</strong> work <strong>to</strong>wards their own view of knowledge, which then gradually<br />
approaches that of established science.<br />
There is no room for such a view of knowledge in the <strong>PISA</strong> format. Here,<br />
concrete questions are asked <strong>to</strong> which the answer <strong>to</strong> most items is either correct<br />
or incorrect (or at most correct, partly correct, incorrect). Such questions<br />
are naturally also asked in Danish science teaching <strong>–</strong> and it is obviously important<br />
<strong>to</strong> be able <strong>to</strong> answer them <strong>–</strong> but they are not the most important questions,<br />
as the aim is <strong>to</strong> build up the students’ general scientific understanding. However,<br />
it is not possible <strong>to</strong> train test scorers <strong>to</strong> assess whether a student is on<br />
the right path. In <strong>PISA</strong>, certain premises are typically presented within a spec-
<strong>PISA</strong> <strong>–</strong> USE AND MISUSE OF LARGE-SCALE COMPARATIVE TESTS 117<br />
ified frame, and the student is then expected <strong>to</strong> apply particular knowledge or<br />
a particular process <strong>to</strong> this frame, accepting the given terms. This is a very<br />
Anglo-Saxon approach. In a constructivist context, one would instead emphasise<br />
the students’ ability <strong>to</strong> draw up frames and premises themselves, and the<br />
ability <strong>to</strong> formulate the actual problem as part of the solution. The “walking”<br />
test item in Figure Four would, for example, be formulated in a completely different<br />
manner under a constructivist view of education; in this case the students<br />
would be required <strong>to</strong> measure the stride length themselves and then attempt <strong>to</strong><br />
work out the relationship between stride length and speed, evaluating whether<br />
or not this is reasonable. It would be this ability <strong>to</strong> structure the problem that is<br />
primarily tested, rather than whether the student is capable of inserting figures<br />
in<strong>to</strong> a given formula (which they must naturally also be able <strong>to</strong> do).<br />
The critical point, however, is that if the actual test format itself rules out<br />
questions which are <strong>to</strong>o open, and students who display independent thought,<br />
i.e. by exceeding the test item’s premises or drawing upon knowledge other<br />
than that provided, risk being penalised (Svendsen 2005).<br />
Seen in this light, the <strong>PISA</strong> test seems epistemologically conservative, and<br />
consequently more of a measuring rod for idealised skills than a <strong>to</strong>ol for promoting<br />
education which is centred on the learning process.<br />
Tiberghien (2007) advocates research studies on test design similar <strong>to</strong> those<br />
which have led <strong>to</strong> the development of research-based teaching sequences. Such<br />
studies would allow an item construction in close connection <strong>to</strong> the desired<br />
learning process of the students, and thus provide a didactical foundation for a<br />
more fine-grained scoring.<br />
Consequences for educational policy<br />
The results of an evaluation provide a basis for certain decisions, but it is important<br />
that these decisions do not exceed what is actually justified by the test.<br />
<strong>According</strong>ly, it is interesting <strong>to</strong> consider how the results of a test such as <strong>PISA</strong><br />
can be used <strong>–</strong> and abused.<br />
<strong>PISA</strong> in the media<br />
The media debate following the publication of an international test often has<br />
a very uncertain foundation, and experience from the publication of the <strong>PISA</strong><br />
2000 and <strong>PISA</strong> 2003 results indicates that the loose claims advanced in the<br />
initial hectic media coverage tend <strong>to</strong> remain the main impressions of <strong>PISA</strong>.
118 JENS DOLIN<br />
First impressions last. They thus become truths upon which the educational<br />
debate becomes based in ensuing years.<br />
In a media society, a media image can have a direct influence on political<br />
decisions. When the media construct a particular view of reality, many politicians<br />
feel obliged <strong>to</strong> act on this basis.<br />
Figure 7. Denmark gets <strong>to</strong>o little value for money from its education budget<br />
(Source: Arbejdsmarkedspolitisk Agenda (The Danish Employers’ Confederation) April 7th,<br />
2005)<br />
See, for example, the juxtaposition by the Danish Employers’ Confederation<br />
of the <strong>PISA</strong> results with educational spending (figure 7). Here a comparison<br />
is made between Denmark’s ranking in <strong>PISA</strong> and its expenditure per student<br />
<strong>–</strong> and by implication, the quality of its educational system. In this ranking,<br />
Denmark ends up in third last place in a range of OECD countries when the<br />
Danish <strong>PISA</strong> score is compared with its educational budget. Denmark pays an<br />
average of EUR 1,000 per student <strong>to</strong> achieve just over six <strong>PISA</strong> points, while<br />
the Germans, for example, obtain ten points for the same price. The conclusion<br />
is clear. However, this analysis disregards the fact that Denmark obtains much<br />
more from its expenditure on education than <strong>PISA</strong> points.<br />
The politicians ask whether we get “value for money”, and they are accus<strong>to</strong>med<br />
<strong>to</strong> measuring value in terms of figures in columns. If the results are
<strong>PISA</strong> <strong>–</strong> USE AND MISUSE OF LARGE-SCALE COMPARATIVE TESTS 119<br />
<strong>to</strong>o low, we need more tests and measurements, and we must introduce economic<br />
rewards, grading systems, etc. The end result may well be that schools<br />
and classes alter their teaching in such a way that students become better able<br />
<strong>to</strong> manage <strong>PISA</strong> test items, but at the expense of the less detectable results of<br />
education. I do not claim that there is a direct contradiction between these two<br />
things, but with the limited time and resources available, it is a delicate matter<br />
<strong>to</strong> maintain the existing values while at the same time gearing the system <strong>to</strong><br />
meet a number of specific requirements. However, the possibility cannot be<br />
ruled out that it could be a fruitful process.<br />
Measurable fac<strong>to</strong>rs as parameters of quality<br />
In general, it would be well <strong>to</strong> exercise caution when measuring and assessing<br />
something as complex as human behaviour with figures, not <strong>to</strong> mention a<br />
country’s overall performance <strong>–</strong> and in particular, when using figures procured<br />
via measurement of only a limited part of the overall area. This is extreme reductionism,<br />
and an example of how one of the central scientific <strong>to</strong>ols <strong>to</strong> create<br />
knowledge <strong>–</strong> the ability <strong>to</strong> practise reductionist methods <strong>–</strong> should be utilised<br />
with caution outside the domain of science itself.<br />
There is a major risk that the fac<strong>to</strong>rs which are measurable via the test in<br />
question become the norm-setting parameters of quality, while the remainder<br />
of the large and complex educational picture imperceptibly slips out of view.<br />
This would have serious consequences for the entire educational system, including<br />
the priorities of individual schools and teachers.<br />
We risk harmonising away the very qualities that we have built up over<br />
generations and which may be the key <strong>to</strong> our survival in the globalised world<br />
of the future. A process of cultural uniformity and harmonisation of values<br />
is occurring on the basis of the contemporary mainstream. In an interview in<br />
the Danish newspaper Information (Thorup 2005 (20 March)), Microsoft CEO<br />
Steve Balmer expresses the company’s winning strategy as: “I want the whole<br />
world <strong>to</strong> be Danish.” This is followed up by Mikael R. Lindholm, a member of<br />
the Innovation Council’s strategic planning group, who says:<br />
The welfare system helps <strong>to</strong> create some highly committed, dynamic, inquisitive and<br />
competent people in Denmark. And these are precisely the qualities from which we<br />
benefit, and of which the rest of the world is very envious.<br />
[... ]
120 JENS DOLIN<br />
But Denmark shows <strong>to</strong>o little interest in these special, culturally-determined competencies<br />
that the rest of the world covets. Instead, the government is trying <strong>to</strong> harmonise<br />
our strengths out of the educational system.<br />
It is a no<strong>to</strong>rious fact in educational research that the more important something<br />
is, the harder it is <strong>to</strong> see and measure!<br />
Evaluation as construction of an area<br />
It is well known that educational evaluations have a strong back-wash effect<br />
on teaching: ‘Teach <strong>to</strong> the test’, as it is known. This is in itself a reasonable<br />
and desirable process, if the evaluation is sensible and reflects the goals of<br />
the educational system. However, it is problematic if the evaluation fails <strong>to</strong><br />
accord with the foundation of the educational system and its overall goals, but<br />
is instead undertaken <strong>to</strong> support a number of ideological aims.<br />
Figure 8. The utilisation of evaluation<br />
(From: Dahler-Larsen&Larsen 2001)<br />
The Danish educational researchers Peter Dahler-Larsen and Flemming<br />
Larsen (Dahler-Larsen and Larsen 2001) have drawn up a list of uses <strong>to</strong> which<br />
evaluations are put (figure eight). They distinguish between uses which view<br />
human actions as based on rationality and functionality, i.e. in which we act<br />
in order <strong>to</strong> achieve a particular goal (such as learning or acquiring information),<br />
and uses aimed at making the system “suitable”, what is expected, rather<br />
than in order <strong>to</strong> achieve a particular goal. In the latter case it is not the effects<br />
of the evaluation that are important, but rather the fact that the evaluations<br />
are undertaken at all. There is a tendency for these symbolic and constitutive<br />
uses <strong>to</strong> occupy ever more space in the evaluation landscape. By evaluating,<br />
you communicate credibility and drive; you show that you are prepared <strong>to</strong><br />
do something; you are part of the action (just think of how the number of<br />
countries participating in <strong>PISA</strong> grows with each round). The symbolic value
<strong>PISA</strong> <strong>–</strong> USE AND MISUSE OF LARGE-SCALE COMPARATIVE TESTS 121<br />
is politically important, and often more important than adhering <strong>to</strong> the results<br />
achieved. However, at the same time, there may be a number of unintended<br />
consequences for the content. By undertaking evaluation, you can influence the<br />
field in a particular direction and help <strong>to</strong> form it. By evaluating, we also create<br />
social relations and identities (passive students, etc.). Or the evaluation creates<br />
a view of the subject matter (such as the scientific competence of school students)<br />
for which there is insufficient evidence (e.g. that Danish school students<br />
are scientifically incompetent, even though <strong>PISA</strong> measures only their ability <strong>to</strong><br />
perform a particular type of task in a particular context).<br />
There is no doubt that the <strong>PISA</strong> results, <strong>to</strong>gether with those of a number<br />
of other surveys, such as the OECD review of elementary schools (Uddannelsesstyrelsen<br />
2004), have contributed <strong>to</strong> an increasing focus on the apparently<br />
weak evaluation culture in Danish elementary schools. This is a result<br />
which is consistent with the general foundation of the <strong>PISA</strong> and OECD surveys,<br />
and it is <strong>to</strong> a large extent necessary and useful. An increased level of evaluation<br />
<strong>–</strong> if well-balanced, well-designed and diagnostically-oriented <strong>–</strong> would<br />
undoubtedly enhance the benefits of the educational process for all groups of<br />
students.<br />
However, there are signs that <strong>PISA</strong>, besides exerting an influence on teaching,<br />
has also had an influence on the actual objects clause of the elementary<br />
schools, so as <strong>to</strong> direct the teaching <strong>to</strong> conform <strong>to</strong> a greater degree with what<br />
<strong>PISA</strong> is capable of measuring!<br />
<strong>PISA</strong> in perspective<br />
Taking a critical approach tends <strong>to</strong> sharpens your argumentation, and here I<br />
have emphasised the problematic aspects of <strong>PISA</strong>. It would not be reasonable<br />
<strong>to</strong> conclude on this basis alone that <strong>PISA</strong> is unusable, worthless or the like. On<br />
the contrary, <strong>PISA</strong> encompasses a great deal of potential.<br />
To begin with, it presents us with an enormous amount of empirical material.<br />
The figures indicate many unknown fac<strong>to</strong>rs in the educational sec<strong>to</strong>r<br />
which it would be worthwhile <strong>to</strong> investigate further, as well as confirming<br />
much which we already know, such as the large gender variations in Denmark,<br />
the variation between ethnic groups, etc. It is thought-provoking that there appears<br />
<strong>to</strong> be a statistical correlation between results achieved and the students’<br />
comments on the level of discipline and order during lessons. Moreover, it is<br />
in itself remarkable that a very large proportion of the students <strong>–</strong> more than
122 JENS DOLIN<br />
one-third <strong>–</strong> report experiencing poor discipline and order during lessons. It<br />
is useful <strong>to</strong> know that Danish school students feel at home in their schools,<br />
and that they have a positive attitude <strong>to</strong> their studies and a positive image of<br />
their own academic skills. There are many correlations which it would be interesting<br />
<strong>to</strong> explore in more depth, and an extensive diagnostic potential in the<br />
<strong>PISA</strong> material, first and foremost in connection with finding out what young<br />
people think, for better or worse. Rolf V. Olsen (2007) suggests five generic<br />
approaches <strong>to</strong> a secondary analysis of data in <strong>PISA</strong>, each of them accompanied<br />
with a comprehensive list of approaches <strong>to</strong> analysis.<br />
It is also meaningful <strong>to</strong> undertake comparisons within the same cultural<br />
groups, which may provide some fruitful contextualisation of well-known issues.<br />
This has extensively been done in a Nordic context, for example (Lie,<br />
Linnakylä et al. 2003; Kjærnsli and Lie 2004).<br />
Finally, it should be mentioned that <strong>PISA</strong> is a labora<strong>to</strong>ry in testing techniques<br />
and test theory. Participation in <strong>PISA</strong> has provided Denmark with a<br />
much-needed test-related theoretical boost, and has also helped <strong>to</strong> place the<br />
evaluation culture in Danish elementary schools on the agenda.<br />
But this potential must be balanced with the danger of mainstreaming<br />
and dis<strong>to</strong>rting the educational system and teaching which <strong>PISA</strong> could<br />
also induce. International comparative evaluations possess almost inherent retraditionalising<br />
and standardising elements which could influence the national<br />
development in a direction which is foreign <strong>to</strong> the local educational culture.<br />
Evaluations as comprehensive as <strong>PISA</strong> express themselves with great authority<br />
on the basis of what many view as incontrovertible documentation. In relation<br />
<strong>to</strong> the national research environments, the <strong>PISA</strong> system has so many resources<br />
at its disposal that it is difficult <strong>to</strong> establish genuinely critical and independent<br />
research of <strong>PISA</strong> and the <strong>PISA</strong> results, with the result that a project like <strong>PISA</strong><br />
can rapidly become established as a representative of objective, neutral reality.<br />
Political prestige has also been invested in participation, which makes it difficult<br />
for the participating countries <strong>to</strong> distance themselves from the project at<br />
policy level; as a “member of the club”, one feels obliged <strong>to</strong> show solidarity<br />
with the club’s rules.<br />
Finally, it is important <strong>to</strong> point out that from an educational perspective, it<br />
is difficult <strong>to</strong> establish links between the findings of comparative evaluations,<br />
which describe the educational system in its entirety, and teaching in individual<br />
classes, with individual students. <strong>PISA</strong>’s strength lies in its analytical and diagnostic<br />
possibilities at the overall educational policy level, but when utilised <strong>to</strong>
<strong>PISA</strong> <strong>–</strong> USE AND MISUSE OF LARGE-SCALE COMPARATIVE TESTS 123<br />
influence the structure of specific teaching practice, there is a risk of promoting<br />
changes on the basis of an oversimplified view of educational practice which<br />
can have a counterproductive effect in the long run in relation <strong>to</strong> achieving the<br />
stated goals.<br />
References<br />
Adams, R. and M. Wu (2001). <strong>PISA</strong> 2000 technical report. Paris:OECD.<br />
Adams, R. J. (2003). Response <strong>to</strong> “Cautions on OECD’s recent educational<br />
survey (<strong>PISA</strong>)”. Oxford Review of Education 29(3): 377 <strong>–</strong> 389.<br />
Allerup, P. (2005). <strong>PISA</strong> Præstationer <strong>–</strong> målinger med skæve måles<strong>to</strong>kke.<br />
Dansk Pædagogisk Tidsskrift(1): 68-81.<br />
Dahler-Larsen, P. and F. Larsen (2001). Anvendelser af evaluering <strong>–</strong> His<strong>to</strong>rien<br />
om et begreb, der udvider sig.<br />
I: P. Dahler-Larsen and H. K. Krogstrup. Tendenser i evaluering. Odense:<br />
Odense Universitetsforlag.<br />
Dolin, J. (2005). <strong>PISA</strong> og fremtidens kundskabskrav. In: <strong>PISA</strong>-undersøgelsen<br />
og det danske uddannelsessystem. Folketingshøring om <strong>PISA</strong>undersøgelsen<br />
12. september 2005. Teknologirådet.<br />
Dolin, J., H. Busch og L. B. Krogh (2006). En sammenlignende analyse af<br />
<strong>PISA</strong>2006 science testens grundlag og de danske målkategorier i naturfagene.<br />
Første delrapport fra VAP-projektet. Odense: IFPR/Syddansk Universitet.<br />
(with English summary)<br />
Dolin, J. (2007). Science education standards and their assessment in Denmark.<br />
In: Wadding<strong>to</strong>n, D., Nentwig, P. & Schanze, S. (eds.): Standards in<br />
Science Education. Waxmann.<br />
Goldstein, H. (2004). International comparisons of student attainment: Some<br />
issues arising from the <strong>PISA</strong> study. Assessment in Education 11(3).<br />
Hansen, E. J. (2005). <strong>PISA</strong> <strong>–</strong> et svagt funderet projekt. Dansk Pædagogisk<br />
Tidsskrift (1): 64-67<br />
Henningsen, I. (2005). <strong>PISA</strong> <strong>–</strong> et kritisk blik. MONA (1).<br />
Kjertmann, K. (2000). Evaluering af læsning: Generelle og specifikke problemer.<br />
Forskningstidsskrift fra Danmarks Lærerhøjskole, nr.6.<br />
Kjærnsli, M. and S. Lie (2004). <strong>PISA</strong> and scientific literacy: Similarities and<br />
differences between the Nordic countries. Scandinavian Journal of Educational<br />
Research 48(3): 271-286.
124 JENS DOLIN<br />
Lie, S., P. Linnakylä, et al., Eds. (2003). Northern lights on <strong>PISA</strong>. Unity and<br />
diversity in the Nordic countries in <strong>PISA</strong> 2000. Oslo: University of Oslo.<br />
Lie, S. and Olsen, R. (2007). A comparison of the measures of science achievement<br />
in <strong>PISA</strong> and TIMSS. Paper presented at ESERA 2007 Conference,<br />
Malmoe.<br />
Mejding, J. (ed.) (2004). <strong>PISA</strong> 2003 <strong>–</strong> danske unge i en international sammenligning.<br />
København: Danmarks Pædagogiske Universitets Forlag.<br />
Mejding, J., S. Reusch og T. Yung Andersen (2006). Leaving examination<br />
marks and <strong>PISA</strong> results <strong>–</strong> Exploring the validity of <strong>PISA</strong> scores. In: Mejding,<br />
J. og A. Roe (red.). Northern Lights on <strong>PISA</strong> <strong>–</strong> a reflection from the<br />
Nordic countries. Copenhagen: Nordic Council of Ministers.<br />
OECD (1999). Measuring student knowledge and skills <strong>–</strong> a new framework for<br />
assessment. Paris:OECD.<br />
OECD (2001). Knowledge and skills for life. First results from <strong>PISA</strong> 2000.<br />
Paris: OECD.<br />
OECD (2002). Sample tasks from the <strong>PISA</strong> 2000 assessment. Paris:OECD.<br />
OECD (2004). Learning for <strong>to</strong>morrow’s world. First results from <strong>PISA</strong> 2003.<br />
Paris: OECD.<br />
Olsen, R.V. (2007). Beyond the primary purpose: Potentials for secondary research<br />
in science education based on <strong>PISA</strong> 2006 data. Paper presented at<br />
ESERA 2007 Conference, Malmoe.<br />
Pilegaard Jensen, T. & D. Andersen (2006). Participants in <strong>PISA</strong> 2000. Four<br />
years later. In: Mejding, J. & A. Roe (red.). Northern lights on <strong>PISA</strong> <strong>–</strong><br />
areflection from the Nordic countries. Copenhagen: Nordic Council of<br />
Ministers.<br />
Prais, S. J. (2003). Cautions on OECD’S recent educational survey (<strong>PISA</strong>).<br />
Oxford Review of Education 29(2): 139-163.<br />
Prais, S. J. (2004). Cautions on OECD’s recent educational survey (<strong>PISA</strong>):<br />
rejoinder <strong>to</strong> OECD’s response. Oxford Review of Education 30(4): 569-<br />
573.<br />
Roth, W.-M. and J. Désautels, Eds. (2002). Science education as/for sociopolitical<br />
action. New York: Peter Lang.<br />
St. Julien, J. (1997). Explaining learning: The research trajec<strong>to</strong>ry of situated<br />
cognition and the implications of connectionism. I: D. Kirshner and J.<br />
A. Whitson. Situated Cognition. Social, Semiotic, and Psychological Perspectives.<br />
London: Lawrence Erlbaum Associates.
<strong>PISA</strong> <strong>–</strong> USE AND MISUSE OF LARGE-SCALE COMPARATIVE TESTS 125<br />
Svendsen, L. S. (2005). Med Klods-Hans til <strong>PISA</strong>-prøve. Politiken. København.<br />
Thorup, M.-L. (2005 (20. marts)). I want the whole world <strong>to</strong> be Danish. Information.<br />
København.<br />
Tibergien, A. (2007). Assessing scientific literacy: The need for research <strong>to</strong><br />
inform the future development af assessment instruments. Paper presented<br />
at ESERA 2007 Conference, Malmoe.<br />
Uddannelsesstyrelsen (2004). OECD-rapport om grundskolen i Danmark <strong>–</strong><br />
2004 -. Uddannelsesstyrelsens temahæfteserie nr. 5.<br />
Winther-Jensen, T. (2004). Komparativ pædagogik <strong>–</strong> faglig tradition og global<br />
udfordring. København: Akademisk Forlag.
Language-Based Item Analysis <strong>–</strong><br />
Problems in Intercultural Comparisons<br />
Markus Puchhammer<br />
Austria: University of Applied Sciences Technikum Wien<br />
<strong>PISA</strong> was started as an instrument <strong>to</strong> check the outputs of the education systems<br />
of several different countries against each other. To achieve an assessment<br />
across a still growing number of participating countries, test items were developed<br />
for mathematics, reading literacy, science and problem solving. Multiple<br />
institutions located in different OECD-countries contributed <strong>to</strong> the effort of<br />
creating test items. Quite similarly looking test booklets were produced featuring<br />
the same items presented in the languages officially used in education<br />
of the participating nations. The results have been widely discussed, rankings<br />
were often attributed <strong>to</strong> the organization of national education systems. But is<br />
it correct <strong>to</strong> argue that the same items have been presented <strong>–</strong> not taking in<strong>to</strong><br />
account the use of different languages? If different cultural backgrounds may<br />
be assumed <strong>to</strong> influence reading literacy (getting visible in the results), areas<br />
like mathematics may be regarded as less sensitive. But quantitative evaluations<br />
(presented below) show that there are still enough fac<strong>to</strong>rs introduced by<br />
wording and by language. Thus, the validity of <strong>PISA</strong> assessment <strong>–</strong> <strong>to</strong> test what<br />
is intended <strong>to</strong> be tested <strong>–</strong> should be watched carefully within an international<br />
frame.<br />
Fac<strong>to</strong>rs indicating the importance of reading and language<br />
Based on cultural backgrounds, influences of language may be expected for<br />
areas that are linked <strong>to</strong> reading. In this case arguing the importance of wording<br />
could have good chances of success, because it seems obvious that reading<br />
fluency and reading comprehension are related <strong>to</strong> sentence structure, <strong>to</strong> the use
128 MARKUS PUCHHAMMER<br />
of specific terms, or <strong>to</strong> the length of words of a text. These fac<strong>to</strong>rs may be<br />
regarded as less important for other areas such as mathematics; also scientific<br />
competence may be less affected. But if it can be shown that even in these<br />
areas the effects of language must not be neglected, the result obtained so far<br />
can be generalized. Thus, the following context focuses on <strong>PISA</strong>’s mathematics<br />
assessment.<br />
In an average text book on mathematics pages are full of formulas and<br />
numbers. A first look on <strong>PISA</strong> items designed for testing mathematics performance<br />
revealed another layout. Approximately 7 % of the text (1250 characters<br />
out of 18058 in the English item sample) were digits or mathematical opera<strong>to</strong>rs<br />
like = + - % /, the rest (93 %) formed a readable text where proper understanding<br />
is recommended <strong>to</strong> find the correct solution. (Some diagrams were also<br />
shown, but note that diagram texts are included in these counts.)<br />
The influence of language is further demonstrated by a model calculated<br />
<strong>to</strong> explain <strong>PISA</strong>-2000 results in mathematics, presented in Artelt et al. (2001,<br />
25). To predict performance in mathematics, the fac<strong>to</strong>rs such as socioeconomic<br />
status, gender, general cognitive ability, mathematical self-image and reading<br />
competence were considered. These variables explained 76 percent of the variance<br />
of the performance in mathematics. It was expected that general cognitive<br />
ability or mathematical self-image would influence the mathematics score<br />
most strongly. Surprisingly, the most prominent influence was realised in the<br />
area of reading competence, expressed in a path coefficient of 0.55, whereas<br />
the other fac<strong>to</strong>rs contributed less: path coefficients were 0.32 for general cognitive<br />
ability and 0.14 for the mathematical self-image.<br />
The translation process <strong>–</strong> as described by <strong>PISA</strong>-Austria (2004a) <strong>–</strong> will<br />
be considered next. Language versions have been generated for the teaching<br />
languages of the participating particular nations. They start from an English<br />
source text and a French source text (often derived from the English version).<br />
A so-called double translation process is used, i.e. teams of two independent<br />
transla<strong>to</strong>rs develop the national items which are then cross-checked and<br />
merged in<strong>to</strong> a final national version. International verification steps, a training<br />
programme and item analyses serve as quality assurance of the translation process.<br />
But nevertheless this process starts with the two source languages, and<br />
it is not clear how far the transla<strong>to</strong>rs can free themselves from language structures<br />
of the source language(s) <strong>to</strong> achieve good readability of the translated<br />
items. If reading comprehension is reduced due <strong>to</strong> this process, then the test-
LANGUAGE-BASED ITEM ANALYSIS 129<br />
ing process is dissimilar, comparison of results is eventually rather restricted <strong>–</strong><br />
the influence of translation therefore should be considered in more detail.<br />
The principal component analysis on <strong>PISA</strong>-2003 country scores for mathematics,<br />
science, problem solving and reading literacy yields a common fac<strong>to</strong>r<br />
which contributes 94 % of the <strong>to</strong>tal variance. This observation shows that there<br />
are no distinct foci, e.g. when expecting that some countries promote mainly<br />
natural sciences or mathematics, while others concentrate on reading comprehension<br />
and literacy. Different interpretations are possible, but the concept of<br />
reading comprehension and thus language understanding being the most important<br />
fac<strong>to</strong>r would be a good explanation. Therefore, the influence of fac<strong>to</strong>rs<br />
like wording, length of item texts etc. should be investigated in more detail.<br />
Selection of <strong>PISA</strong> items for comparison<br />
An evaluation based on the language and wording of items should carefully<br />
select appropriate items. Language differences can be demonstrated by items<br />
that were originally considered as <strong>PISA</strong> items, but finally were not chosen for<br />
the test booklets (cf. e.g. the description of the item construction process in<br />
OECD, 2001, p. 42ff.). Some of these items have been released <strong>to</strong> the public<br />
and serve as examples for the different <strong>PISA</strong> assessment areas in the “full<br />
report”, available in several languages (e.g. OECD 2004a,b,c). They are presented<br />
as pro<strong>to</strong>types for different assessment areas, e.g. for several levels of<br />
mathematical skills. The major advantage of these items is that they can be<br />
easily retrieved in several languages, and their release <strong>to</strong> the public eases discussion<br />
of individual contents.<br />
For the German version, 42 released mathematics items (with up <strong>to</strong> 3 questions<br />
each) were located (see <strong>PISA</strong> Austria, 2004b). Some of them have already<br />
been released in OECD publications (2004c). The English items were<br />
downloaded directly from OECD (2005) in a file comprising 27 PDF documents,<br />
extended by other sources (e.g. HK<strong>PISA</strong>, 2006). All of the English<br />
items (except one item) are also available in the German sample and have<br />
been used for subsequent comparisons. Items released <strong>to</strong> the public are also<br />
available for other assessment areas, but in a smaller number. Mathematics<br />
performance items suggest <strong>to</strong> be less language-dependant, and the sample of<br />
31 possible comparisons results in an item number that can be reasonably used<br />
for statistical calculations. The availability of these items in other languages<br />
invites <strong>to</strong> extend the investigation. Item identifiers start with a letter (“M” for
130 MARKUS PUCHHAMMER<br />
mathematics), followed by a three-digit item number and, possibly, a question<br />
number indication. An abbreviation of the item contents has been taken from<br />
the file name of the electronic format.<br />
The following comparisons first show that the German text is significantly<br />
longer than the English version, thereby discussing the implications for the<br />
assessment. Then the familiarity of words (which should be related <strong>to</strong> the word<br />
knowledge of the target group of 15-year aged students) is retrieved using an<br />
approach of quantitative linguistics. Furthermore, German sentence structures<br />
still seem <strong>to</strong> increase complexity in German texts.<br />
Text-length based comparisons<br />
31 mathematics items have been evaluated, both in the German and English<br />
version. The item texts have been retrieved from the PDF format and were<br />
imported in<strong>to</strong> a word processing program. To analyze details a computer program<br />
(written in Visual Basic) has been developed and applied <strong>to</strong> the item<br />
texts. Table 1 summarizes the results for the number of words (units separated<br />
by spaces or similar sentence marks) and number of characters. To eliminate<br />
the confusing effect of number tables (there were some in the item sample) the<br />
number of digits and mathematical opera<strong>to</strong>rs (like + <strong>–</strong> / %) were counted separately;<br />
these special characters are not included in the <strong>to</strong>tal character count.<br />
Results for English and German are shown below.<br />
ItemID #words #chars #word #chars<br />
(English) (English) (German) (German)<br />
M037Farms 155 694 149 796<br />
M124Walkg 109 634 124 761<br />
M145Cubes 68 306 62 341<br />
M148Cont 52 286 61 353<br />
M150GrwUp 91 403 85 495<br />
M159Speed 235 1063 228 1307<br />
M161Triang 66 328 62 349<br />
M179Robbr 64 335 54 333<br />
M266Crntr 110 475 93 465<br />
M402IRC 157 771 144 820<br />
M413Excha 182 1002 157 1005<br />
M438Expor 128 576 126 616<br />
M467Candy 68 321 69 364
LANGUAGE-BASED ITEM ANALYSIS 131<br />
M468STest 58 296 49 330<br />
M471SFair 98 473 92 566<br />
M484Books 65 365 58 396<br />
M505Littr 84 474 74 516<br />
M509Quake 156 818 153 919<br />
M510Choic 65 393 62 446<br />
M513Score 132 557 155 867<br />
M515ShKid 112 380 101 454<br />
M520Skate 227 1143 226 1426<br />
M521Table 80 433 66 443<br />
M525Deacr 265 1405 278 1629<br />
M543Space 98 488 97 532<br />
M547Stair 41 198 35 177<br />
M555NCube2 94 448 74 509<br />
M555NCube3 149 723 143 951<br />
M702Presi 158 877 145 1003<br />
M704BestC 237 1098 226 1290<br />
M806StepP 58 295 57 340<br />
Tab. 1: Text-length based results for <strong>PISA</strong> example mathematics items, English and German<br />
version.<br />
Text lengths vary from item <strong>to</strong> item, but texts usually contain several hundred<br />
characters (most items had only one question attached, and only three<br />
items had 3, but still a lower length limit can be observed also on a per-question<br />
basis). Average lengths calculated were 583 characters for the English version,<br />
and 670 characters for a German item on average <strong>–</strong> indicating the German<br />
items <strong>to</strong> be noticeably longer. Average word counts were more similar for the<br />
English and German version (118.1 and 113.1 respectively <strong>–</strong> but note that some<br />
terms use two words in English whereas in German often two words are combined<br />
in<strong>to</strong> one; this effect results in more and shorter words in English).<br />
It is interesting <strong>to</strong> observe the length of German item texts depending on<br />
their English source counterpart (in fact, dependency is given by the translation<br />
process). This relationship is shown in Fig. 1 and is clearly visible <strong>–</strong> described<br />
by a high correlation coefficient of r=0.98; only a few entries deviate noticeably<br />
from the regression line.<br />
To obtain a profound estimate of the relative text lengths, the regression
132 MARKUS PUCHHAMMER<br />
line (based on a least-squares fit using intercept=0) has been calculated, represented<br />
by the formula<br />
length(German) = 1.16 length(English)<br />
The testing of the slope coefficient for statistical significance clearly supports<br />
the statement that German item texts are in fact longer than in English<br />
by nearly 1/6 (95 %-confidence interval for the slope parameter between<br />
[1.123 . . . 1.198] ).<br />
Fig. 1: Graphical display of English vs. German item text length. Regression line is shown.<br />
The relevance of these observations is visible when contrasting items <strong>to</strong> the<br />
reading speed of 15-year old pupils. For average adults, reading speeds around<br />
200-300 words per minute are frequently reported (depending on the amount
LANGUAGE-BASED ITEM ANALYSIS 133<br />
of text comprehension required, print size etc.). For these readers, a reading<br />
time around half a minute per item can be expected.<br />
Assuming that the mathematics part of a <strong>PISA</strong> assessment contains 20<br />
items <strong>to</strong> be worked out in 30 minutes, just reading the items would consume<br />
1/3 of the available time. On the other hand, <strong>PISA</strong> items should be carried out<br />
also by pupils on a below-average level. For slow readers speeds are proposed<br />
of only 110 words per minute for English texts (Readingsoft, 2007; similar<br />
reading speeds are reported for “efficient words” related <strong>to</strong> understanding).<br />
For those readers, the <strong>to</strong>tal reading time would sum up <strong>to</strong> 21 minutes <strong>–</strong> 70 %<br />
of the 30 minute session. German texts have longer words (about 20 % by our<br />
data), reading speed is even a bit slower, therefore the rest of the time that actually<br />
can be devoted <strong>to</strong> reflect upon the mathematics behind the question is still<br />
shorter. High variances in reading ability can actually be expected, e.g. by the<br />
findings of Klipcera and Gasteiger-Klipcera (1993) who reported that the least<br />
performing 15 % of pupils in the 8 th school year were at a level of an average<br />
reader at the end of the 2 nd year or beginning of the 3 rd year in school.<br />
Familiarity of words<br />
In 1932, the American linguist and philologist George Kinsley Zipf (1932)<br />
noticed that the statistical frequency of words can be linked <strong>to</strong> their rank by<br />
frequency of occurrence. A word ranked n th in frequency occurs with a probability<br />
Pn of about<br />
Pn 1/n a<br />
where the exponent a is almost 1. What is now known as Zipf’s law holds<br />
well for nearly all languages (except for the first few words <strong>–</strong> e.g. in English:<br />
the, and, <strong>to</strong>, of, a, . . . <strong>–</strong> with probabilities deviating only slightly). Since then,<br />
word frequency tables have been constructed. Files representing the <strong>to</strong>p 10 000<br />
words of languages like English, German, French and Dutch can be downloaded<br />
(e.g., see Universität Leipzig, 2007), spanning across some orders of<br />
magnitude between the relative frequency for words that occur in an average<br />
text of the selected language. Words that occur quite seldom (e.g. only once<br />
every 100 000 words) may not be well-known, may be difficult <strong>to</strong> understand<br />
or even unknown <strong>to</strong> average users of the language.<br />
In order <strong>to</strong> detect reading disadvantages of items in the translated language,<br />
these rank lists of English and German words have been applied. Words occurring<br />
frequently have low rank numbers, rare words have high rank numbers.
134 MARKUS PUCHHAMMER<br />
If frequent words in the first language are replaced by infrequent words in<br />
the second language (during translation), then the resulting text is more difficult<br />
<strong>to</strong> understand. Text translations that claim <strong>to</strong> yield similar difficulty should<br />
use words that occur with similar probability, having similar rank numbers (according<br />
<strong>to</strong> Zipf’s law). Words used seldom are more relevant for a comparison,<br />
because a shift in difficulty would be easier <strong>to</strong> observe, and would influence<br />
the understandability of a text more distinctly.<br />
The <strong>PISA</strong> example mathematics items have been reviewed in both versions,<br />
English and German. Words that seemed important for an item text as<br />
well as “difficult” words (occurring less frequently) were identified and their<br />
translated counterparts were located. Then, rank numbers were determined and<br />
compared. A list of (the first of these) words is shown below. It should be noted<br />
that even more complicated words (e.g. German words like Hemisphäre) could<br />
be found in other <strong>PISA</strong> assessment areas (e.g. science).<br />
English German rank(English) rank(German) in favour of<br />
...<br />
footprints Fußabdrücke 4491,1233 2833,- English<br />
pacelength Schrittlänge 2649,2607 746,3172 English<br />
average height Durchschnitts- 388,5346 3259,1784 German<br />
größe<br />
interpretation Interpretation 5246 5632 English<br />
make a border umranden 112,1669 - English<br />
communicate kommunizieren 3608 - English<br />
exchange rate Wechselkurs 508,207 1923,1187 English<br />
information Informationen 135 472 English<br />
exports Exporte 1936 7452 English<br />
probability Wahrscheinlichkeit<br />
6703 6161 German<br />
probable, wahrscheinlich 654, 456 1247 English<br />
likely<br />
average Durchschnitt 388 3259 English<br />
represented dargestellt 2422 3148 English<br />
clips Klammern 7963 - English<br />
bar graph Balkendiagramm 2128,5010 -,- English<br />
happen passieren, passiert 2238 1820,2799 English<br />
Tab. 2: Rank order comparison of words extracted from <strong>PISA</strong> mathematics example items.<br />
“<strong>–</strong>” indicates that the word could not be found in the list of the <strong>to</strong>p 10 000 words. Commas<br />
indicate that the constituents of a term have been selected instead of a single word, in this case<br />
the higher rank number has been used for comparison (both parts need <strong>to</strong> be unders<strong>to</strong>od).
LANGUAGE-BASED ITEM ANALYSIS 135<br />
The compilation suggests that in most of the cases the English original<br />
uses words with lower rank numbers (hence occurring more frequently in the<br />
English language) than their German equivalents. To obtain a <strong>to</strong>tal figure for<br />
an approximate comparison, average rank numbers can be calculated (when<br />
substituting the rank number 10 000 for words not in the list). Then, the average<br />
for English is rank 2770, but the average for German is rank 5133 (being<br />
considerably higher).<br />
Although only a few words have been selected, the result is impressive <strong>–</strong><br />
the words’ rank numbers indicate that the German item translation can be considered<br />
<strong>to</strong> be more difficult <strong>to</strong> understand than the English original.<br />
This approach explains why persons with a foreign mother <strong>to</strong>ngue (e.g.<br />
with migration background) sometimes may face problems. Usually the most<br />
frequent words of a language are taught first, and a vocabulary of the most<br />
frequent 10 000 words may not be sufficient <strong>to</strong> understand several of the <strong>PISA</strong><br />
mathematics items.<br />
Further language issues<br />
When comparing two specific languages (e.g. English and German) further<br />
language issues may be identified. Subordinate clauses inserted in the mid of<br />
sentences are more frequent in German and may deteriorate readability. German<br />
grammar is considered <strong>to</strong> be more complicated than English grammar.<br />
Ambiguities may lead <strong>to</strong> misunderstanding, the use of an official language in<br />
translations may be more difficult when “peer slang” is used predominantly in<br />
the target group. Still other <strong>to</strong>pics can be found in a summary by Rost (2001)<br />
on reading comprehension. However, a quantitative evaluation of these aspects<br />
is beyond the scope of this contribution.<br />
Conclusions<br />
For <strong>PISA</strong> sample mathematics items it has been shown by the regression’s<br />
slope coefficient that German items are significantly longer than the English<br />
ones (based on straightforward character counting). Some slightly difficult<br />
words are more difficult after their translation in<strong>to</strong> German, hence they do<br />
not improve fast and efficient answers in a test situation. And a quick look in<strong>to</strong><br />
science and problem-solving items suggests that these findings are not limited<br />
<strong>to</strong> mathematics. As a consequence, the promise of <strong>PISA</strong> <strong>to</strong> support fair inter-
136 MARKUS PUCHHAMMER<br />
national, inter-language comparisons of the output of education systems begins<br />
<strong>to</strong> fail on the language boundaries.<br />
As a consequence, three steps are proposed for the future: At first, interpretation<br />
of rigid inter-national ranking schemes should be done more carefully,<br />
<strong>to</strong> account for potential problems. Then, investigations have <strong>to</strong> take place <strong>to</strong><br />
understand better the process of answering <strong>PISA</strong> items, including languagespecific<br />
problems and a variety of other fac<strong>to</strong>rs, extending research on open<br />
issues, of discussing <strong>PISA</strong> according <strong>to</strong> <strong>PISA</strong>. Finally, improvement of the<br />
whole process of item creation should consider item translation, item formats<br />
and new item types <strong>to</strong> overcome current problems.<br />
References<br />
Artelt, C., Baumert, J., Klieme, E., Neubrand, M., Prenzel, M., Schiefele,<br />
U., Schneider, W., Schümer, G., Stanat, P., Tillmann, K.-J. & Weiß,<br />
M. (Hrsg.): <strong>PISA</strong> 2000. Zusammenfassung zentraler Befunde. Berlin<br />
2001; online: http://www.mpib-berlin.mpg.de/pisa/pdfs/ergebnisse.pdf<br />
retr. 2004/12/03.<br />
HK<strong>PISA</strong> Programme for International Student Assessment Hong Kong<br />
Centre: Sample Test Items <strong>PISA</strong> 2000, 2003. 2006; online: http:<br />
//www.fed.cuhk.edu.hk/ hkpisa/sample/files/2000_Maths_Sample.pdf,<br />
retr. 2007/09/20.<br />
Klipcera, C. & Gasteiger-Klipcera, B. (1993). Lesen und Schreiben <strong>–</strong> Entwicklung<br />
und Schwierigkeiten. Huber; Bern, 1993.<br />
OECD Organisation for Economic Co-operation and Development: Learning<br />
for Tomorrow’s World. First Results from <strong>PISA</strong> 2003. Paris 2004a<br />
OECD: Apprendre aujourd’hui, réussir demain. Premiers résultats de <strong>PISA</strong><br />
2003. Paris 2004b<br />
OECD: Lernen für die Welt von morgen <strong>–</strong> Erste Ergebnisse von <strong>PISA</strong> 2003.<br />
Paris 2004c<br />
OECD Organisation for Economic Co-operation and Development: <strong>PISA</strong><br />
2003 mathematics questions. Paris, 2005; online: https://www.oecd.org/<br />
dataoecd/12/7/34993147.zip, retr. 2007/09/14.<br />
<strong>PISA</strong> Austria: Testinstrumente. 2004a; online: http://www.pisa-austria.at/<br />
pisa2003/testinstrumente/lang/III_Testinstrumente.htm, retr. 2005/08/14.<br />
<strong>PISA</strong> Austria: Mathematik freigegebene Aufgaben. 2004b; online:<br />
www.pisa-austria.at/pisa2003/testinstrumente/lang/mathematik_<br />
freigegebene_aufgaben.pdf, retr. 2007/09/14.
LANGUAGE-BASED ITEM ANALYSIS 137<br />
Readingsoft: Speed Reading Test Online. 2007; online: http://www.<br />
readingsoft.com, retr. 2007/09/17.<br />
Rost, D.H.: Leseverständnis. In: Rost, H.D. (Ed.): Handwörterbuch Pädagogische<br />
Psychologie. pp. 449-456. PVU, Weinheim, 2007.<br />
Universität Leipzig: Deutscher Wortschatz <strong>–</strong> Wortschatzportal. Institut<br />
für Informatik, Universität Leipzig, 2007; online: http://wortschaftz.<br />
uni-leipzig.de/html/wliste.html, retr. 2007/09/15.<br />
Zipf, G. K.: Selected Studies of the Principle of Relative Frequency in Language.<br />
Cambridge (Mass.) 1932.
England: Poor Survey Response and No Sampling of<br />
Teaching Groups 1<br />
SJPrais<br />
United Kingdom: National Institute of Economic and Social Research,<br />
London<br />
Abstract:<br />
The two recent (2003) international surveys of pupils’ attainments were uncoordinated,<br />
overlapped considerably, were costly and wasteful, especially<br />
from the point of view of England where inadequate response-rates meant that<br />
no reliable comparisons at all could be made with other countries. It is the<br />
weaker pupils who tend not <strong>to</strong> respond, and poor response rates thus tend <strong>to</strong><br />
show upwardly-biased results. Inadequate emphasis on classes, or on teaching<br />
groups, in designing the samples, means that little progress can be made in<br />
tracing success in the learning <strong>to</strong> average class-size or variability among pupils.<br />
The surveys were conducted, respectively, by the OECD (Programme of International<br />
Student Assessment <strong>–</strong> <strong>PISA</strong>) and by the US-based International Educational<br />
Assessment group (Trends in International Mathematics and Science<br />
Study <strong>–</strong> TIMSS). Sources of the problem are investigated here.<br />
Some as<strong>to</strong>nishment was aroused by the recently published results of two,<br />
apparently independently organised, large-scale international questionnaire<br />
surveys of pupils’ mathematical attainments <strong>to</strong>wards the middle of their secondary<br />
schooling (age 14-15); nearly 50 countries participated in each survey,<br />
with some 200 schools in each country. Both surveys were carried out in the<br />
same year, 2003; previous surveys had generally been carried out at about tenyear<br />
intervals, and each of these two very recent surveys had been carried out<br />
1 This chapter is an edited version of my paper in the Oxford Review of Education (vol. 33,<br />
no. 1,. February 2007). Thanks are due <strong>to</strong> the edi<strong>to</strong>rs of that Review for permission <strong>to</strong> reproduce.
140 SJPRAIS<br />
only 3-4 years previously. Some questions on science and literacy were included<br />
in 2003, but the focus was on mathematics (and that is our focus here).<br />
A test <strong>to</strong>wards the end of primary schooling, at age 10, was also carried out<br />
in association with one of these surveys. The <strong>to</strong>tal cost was probably over<br />
£1m for England, and probably well over $100m for all countries <strong>to</strong>gether,<br />
plus the time of pupils and teachers directly involved. 2 Results were published<br />
by the beginning of 2005 in several thick volumes, <strong>to</strong>talling some 2000 large<br />
(A4) pages; the two organisations behind the surveys are known as TIMSS and<br />
<strong>PISA</strong> (details of the organisations and publications are at Annex A at the end<br />
of this paper). There does not appear, from these publications, <strong>to</strong> have been<br />
any coordination between the two organisations. Much wasteful overlap and<br />
duplication is evident; the interval between recent repetitions of these surveys<br />
was so tight as not <strong>to</strong> permit adequate consultation for lessons <strong>to</strong> be learnt. 3<br />
Representativeness of samples<br />
We shall try and assess here some of the main findings for England, ask<br />
whether further surveys of this kind are justified, and whether anything is <strong>to</strong><br />
be learnt from these recent surveys which might improve future surveys. What<br />
can be said with any confidence about English pupils’ attainments <strong>to</strong>wards the<br />
end of their secondary schooling is much limited by poor sample response.<br />
From the TIMSS report on 14 year-olds we learn: ‘England’s participation fell<br />
below the minimum requirements of 50 per cent, and so their results were<br />
annotated and placed below a line in exhibits (= statistical tables) showing<br />
2 Only limited information on costs of these surveys has been released. For England, a <strong>to</strong>tal<br />
of £0.5m was paid by England <strong>to</strong> the international coordinating bodies, but information<br />
on locally incurred costs were withheld (in reply <strong>to</strong> a Parliamentary Question on 7 March<br />
2005) as publication could ‘prejudice commercial interests’ in the government’s negotiating<br />
of repeat surveys in 2006-7. It is as<strong>to</strong>nishing that expenditure on further surveys should have<br />
been put in hand before there has been adequate opportunity for scientific assessment of the<br />
value of the 2003 surveys and of the appropriate frequency of their repetition.<br />
3 The <strong>PISA</strong> (Programme of International Student Assessment) inquiry of 2003 was organised<br />
by OECD and followed their first attempt in this activity in 2000. The report on their<br />
first survey was critically reviewed in my article in the Oxford Review of Education, 29<br />
(2) (2003); the present paper has benefited from discussion following that earlier paper.<br />
The acronym TIMSS was originally short for Third International Mathematics and Science<br />
Study; subsequently it became short for Trends in International . . . The previous occasion<br />
on which it had been carried out was 1999. More of the 2003 co-ordinating costs (76 %)<br />
were incurred by <strong>PISA</strong>, making TIMSS <strong>–</strong> which covered two age-groups <strong>–</strong> the better buy<br />
for the British taxpayer.
ENGLAND: POOR SURVEY RESPONSE 141<br />
achievement’. 4 For the parallel <strong>PISA</strong> report, in all tables mentioning findings<br />
for the United Kingdom, a footnote was attached <strong>to</strong> the line for the UK (and<br />
only for the UK!): ‘Response rate <strong>to</strong>o low <strong>to</strong> ensure comparability’. 5<br />
In other words, any differences that may appear between published results<br />
for England and other countries are not <strong>to</strong> be relied on. This reservation was<br />
not however attached <strong>to</strong> the tests of English 10 year-olds <strong>to</strong>wards the end of<br />
their primary schooling (carried out by TIMSS, following a similar survey at<br />
that age in 1995); and those results, <strong>to</strong> first appearances, appear <strong>to</strong> be the most<br />
scientifically interesting and important for educational policy. We will need <strong>to</strong><br />
examine below whether those results are indeed robust enough <strong>–</strong> that is <strong>to</strong> say,<br />
adequately representative <strong>–</strong> <strong>to</strong> be relied upon.<br />
But before that, a short word on the recent his<strong>to</strong>rical background of<br />
Britain’s schooling attainments may be helpful. Britain’s economic capabilities<br />
<strong>–</strong> its mo<strong>to</strong>r industry, machine <strong>to</strong>ol manufacturing industry, as well as other<br />
industries relying on a technically skilled workforce <strong>–</strong> led <strong>to</strong> much public concern<br />
by the 1960s: expressed subsequently, for example, in the official Cockcroft<br />
Committee’s report on Mathematics Counts (HMSO, 1978), eventually<br />
leading <strong>to</strong> the National Curriculum, the National Numeracy Project, and then<br />
<strong>to</strong> nationwide annual testing of all pupils in basic school subjects at all primary<br />
and secondary schools (SATs at ages 7, 11 and 14 <strong>to</strong> supplement the<br />
longer-standing GCSE tests at 16).<br />
Detailed empirical comparisons in the 1980s and 1990s by teams centred at<br />
the National Institute of Economic and Social Research (London) were made<br />
of productivity and workforce qualifications. Site visits <strong>to</strong> comparable samples<br />
of manufacturing plants in England and Germany clarified the nature of<br />
the great gaps in workforce qualifications; these gaps were not so much at the<br />
university graduate level, but at the intermediate craft-levels (City and Guilds,<br />
etc.) <strong>–</strong> the central half of the workforce. The difficulty in England in expanding<br />
that central category of trainees was traced <strong>to</strong> the secondary school-leaving<br />
stage when the standards of mathematical attainments required for craft and<br />
technician training, especially in numeracy, were much below Germany’s. The<br />
IEA’s First International Mathematics Survey of 1964 (FIMS <strong>–</strong> the original<br />
predecessor of TIMSS) was one of the important sources that confirmed this<br />
gap in secondary school mathematics; it was made evident <strong>to</strong> our teams of<br />
secondary mathematics teachers and inspec<strong>to</strong>rs on visits <strong>to</strong> secondary schools<br />
4 TIMSS, Mathematics Report, p. 351.<br />
5 See, for example, <strong>PISA</strong>, Annex B, Data Tables, pp. 340 et seq.
142 SJPRAIS<br />
in France, Germany, the Netherlands and Switzerland, and in discussions with<br />
heads of industrial training departments (Meister). 6 An important conclusion<br />
from visits <strong>to</strong> schools was that it was quite unrealistic <strong>to</strong> expect English secondary<br />
schools <strong>to</strong> be able <strong>to</strong> produce the numbers of students with levels of<br />
mathematical competence that had been seen abroad if they had <strong>to</strong> start with<br />
the standards delivered by our primary schools.<br />
Shifts in research interests and in official educational policy ensued for<br />
mathematics teaching, especially at primary level. Textbooks in England and<br />
in Europe were carefully compared; teaching methods abroad were observed<br />
by practising teachers; new teaching schemes were prepared; and annual nationwide<br />
tests of pupils’ attainments were administered nationally <strong>to</strong> all pupils<br />
at ages 2-3 years apart (SATs). Much more could be said on the details of what<br />
has amounted <strong>to</strong> a ‘didactic revolution’; but perhaps the foregoing is sufficient<br />
<strong>to</strong> indicate the interest attached <strong>to</strong> the 2003 TIMSS mathematics results at age<br />
10 which can be compared with the similar sample inquiry eight years previously<br />
at that age (the 1995 TIMSS <strong>–</strong> Third International Mathematics and<br />
Science Survey). Had England now caught up with its competi<strong>to</strong>rs, at least by<br />
the end of primary schooling?<br />
The comparison was set out, clearly and apparently convincingly, in the<br />
national report for England for 2003 produced by the (English) National Foundation<br />
for Educational Research (which carried out the survey in England in<br />
coordination with the international body). It noted that England’s mathematics<br />
scores showed the largest rise of any of the 15 countries that participated at the<br />
primary level in both 1995 and 2003 (the English rise was of 47 standardised<br />
points, from 484 <strong>to</strong> 531, where 500 is the notional average standardised score<br />
of all countries in these international tests, and the standard deviation is standardised<br />
at 100). Most test questions asked were different in the two years, but<br />
37 questions were the same in both years; the proportion who answered those<br />
common questions correctly in England rose very satisfac<strong>to</strong>rily from 63 <strong>to</strong> 72<br />
per cent. The rise was even a little greater in questions relating <strong>to</strong> numeracy<br />
(arithmetic); this may all be taken as reassuring, since previous deficiencies in<br />
6 See my paper with K Wagner, Schooling Standards in England and Germany: Some summary<br />
comparisons bearing on economic performance, in National Institute Economic Review,<br />
May 1985 and in Compare: A Journal of Comparative Education, 1986, no 1. More<br />
generally, see the series of reprints re-issued by NIESR in two compendia entitled Productivity,<br />
Education and Training (1990 and 1995). Teams of teachers and school inspec<strong>to</strong>rs,<br />
particularly from the London Borough of Barking and Dagenham, were invaluable in assessing<br />
school-visits here and abroad.
ENGLAND: POOR SURVEY RESPONSE 143<br />
English students’ attainments were, as said, particularly marked in that area <strong>–</strong><br />
the foundation s<strong>to</strong>ne of mathematics. 7 The <strong>to</strong>p countries at the primary school<br />
level were, once again, those bordering the Pacific: Singapore, Hong Kong,<br />
Japan <strong>–</strong> with scores averaging about 570; England’s rise in performance in the<br />
nine intervening years, by 47 points <strong>to</strong> 531, can thus be seen as approximately<br />
halving the gap with these <strong>to</strong>p countries <strong>–</strong> and in hardly more than a decade.<br />
To first appearances, this seems a remarkably encouraging achievement;<br />
and, one must equally say, in a remarkably short time-span given the complexity<br />
of what amounted <strong>to</strong> changing almost the whole mathematics didactics<br />
system. But are these sample results <strong>to</strong> be relied upon? We have noted that at<br />
the secondary school level (age 14) serious reservations were attached by the<br />
surveys’ sponsors <strong>to</strong> response rates <strong>to</strong> the samples for England; at the primary<br />
level (average age 10.3, Year 5 in England) a cautionary footnote is always<br />
attached <strong>to</strong> the TIMSS results reported for England (not as serious as for secondary<br />
school results <strong>–</strong> but not <strong>to</strong> be ignored): ‘Met guidelines for sample<br />
participation rates only after replacement schools were included’. 8 With that<br />
modestly expressed caution in mind, let us next patiently re-examine the actual<br />
response rates for England, bearing in mind that if response rates were lower in<br />
2003 than in 1995 we might expect better average scores <strong>to</strong> be recorded simply<br />
as a result of ‘creaming higher up the bottle’.<br />
We first compare the response for schools; then the response for students<br />
within responding schools; and finally, the product of these two rates. In 2003<br />
there were 150 primary schools in the original English representative sample,<br />
of which 79 schools participated, or 53 per cent. 9 For the previous primary<br />
school inquiry of 1995, 92 out of 145 sampled schools participated at the fourth<br />
grade <strong>–</strong> 63 per cent. 10<br />
The student participation rate (within participating schools) was 93 per<br />
cent in 2003, just a little below the 95 per cent recorded for 1995. Combining<br />
the two participation rates (schools x students) we have a participation rate of<br />
something like 50 per cent in 2003 compared with 60 per cent in 1995: there<br />
7 See G Ruddocket al., Where England Stands in the Trends in International Mathematics<br />
and Science Study (TIMSS) 2003, (NFER), 2004, pp. 8-10.<br />
8 IVS Mullis et al., TIMSS 2003 International Mathematics Report (IEA, Bos<strong>to</strong>n), 2004, for<br />
example, p. 35.<br />
9 Ibid. p. 355.<br />
10 IVS Mullis et al., Mathematics Achievement in the Primary School Years (TIMSS), 1997,<br />
p. A 13.
144 SJPRAIS<br />
are thus grounds for worrying whether there has been a genuine improvement<br />
in scores in the population. 11<br />
But are either of these overall response rates adequate for anyone <strong>to</strong> place<br />
reliance on the representativeness of the results? Even TIMSS put the ‘minimum<br />
acceptable participation rate’ at ‘a combined rate (the product of school<br />
and student participation) of 75 per cent’; but at Year 5 in England (as also<br />
in five other countries) 12 that criterion was said <strong>to</strong> be satisfied ‘only after including<br />
replacement schools’. This brings us <strong>to</strong> a long-standing thorny dispute<br />
on acceptable sampling practices. The sampling procedure adopted in these<br />
international educational inquiries is not at all orthodox. It starts with several<br />
parallel lists of schools, each list being equally representative. 13 If an inadequate<br />
response is received from the initial list, then ‘corresponding’ schools<br />
from the second list are approached, and from a third list if necessary. For England<br />
in 2003, as said, a sample of 150 schools was drawn from the initial list;<br />
in the outcome, 79 schools from that list participated (a mere 53 per cent) and<br />
71 schools refused. A further 71 (replacement) schools were then chosen from<br />
the second list, an estimated 27 schools of which participated (38 per cent)<br />
and 44 refused; an estimated 44 were then approached from the third list, of<br />
which 17 participated. The <strong>to</strong>tal number of schools now participating <strong>to</strong>talled<br />
79+27+17=123; the <strong>to</strong>tal number approached was (nota bene, since the organisers<br />
of these surveys do not agree!) 150+71+44=265; the overall response rate<br />
for schools was therefore 123/265=46 per cent (a little below the 53 per cent<br />
from the first list). Taken <strong>to</strong>gether with a response of 93 per cent of students in<br />
participating schools, the <strong>to</strong>tal combined response (schools and students) was<br />
thus only 43 per cent <strong>–</strong> all much below the proportion (75 per cent) originally<br />
laid down by TIMSS as acceptable.<br />
11 The reader will understand that the gradient of the response-rate with respect <strong>to</strong> attainmentlevel<br />
will be different according <strong>to</strong> whether it is amongst schools, at the school-level, or<br />
amongst students within schools; but the point is not worth elaboration in view of what is<br />
said in the next paragraph.<br />
12 Australia, Hong Kong, Netherlands, Scotland, United States (ibid., p. 359). For the US a<br />
response rate (before replacement) of only 66 percent was recorded for the primary survey<br />
and the same for the TIMSS secondary survey. For England’s secondary survey, the<br />
corresponding proportion was a mere 34 per cent!<br />
13 For example, starting from an initial list of schools organised by geographical area, size,<br />
etc., a random start is made; subsequent schools are chosen after counting down a given<br />
<strong>to</strong>tal number of pupils (so, in effect, sampling schools with probability proportional <strong>to</strong> their<br />
size). A reserve list is yielded by taking schools, each one place above the schools in that<br />
initial list; and a second reserve, by going one place down the initial list.
ENGLAND: POOR SURVEY RESPONSE 145<br />
Incredible as it may seem, the statisticians at TIMSS calculated a participation<br />
rate, not in relation <strong>to</strong> the <strong>to</strong>tal number of schools approached on first and<br />
subsequent lists (221), but in relation <strong>to</strong> the smaller number originally aimed<br />
at (150); they consequently published a misleading response rate of 123/150 =<br />
82 per cent for schools, and of 75 per cent for schools and students combined <strong>–</strong><br />
just falling in<strong>to</strong> their originally stipulated requirements (whereas the correctly<br />
calculated combined response rate, as just said, was only 43 per cent). Was this<br />
merely a momentary forgivable slip? Or was it more in the nature of a scientistic<br />
trompe l’oeil encouraging readers that all was fundamentally well, and<br />
had been placed in sound hands <strong>–</strong> including the hands of a Sampling Referee,<br />
an expert <strong>to</strong> whom such technical statistical details had been safely relegated?<br />
Having discussed this issue with a number of British statisticians, I have regretfully<br />
come <strong>to</strong> the conclusion <strong>–</strong> putting it as kindly as I can <strong>–</strong> that these surveys’<br />
statisticians had misled themselves and their educationist colleagues as a result<br />
of their commercial experience with quota sampling; and that any future such<br />
inquiry needs <strong>to</strong> be advised by a broader panel of social statisticians. 14 For the<br />
sake of clarity, I repeat that such an enlarged body will need <strong>to</strong> address two<br />
issues: first (a simple arithmetical issue), what is the correct method of calculating<br />
a <strong>to</strong>tal response rate if ‘replacement’ samples are included; secondly,<br />
is there any substantial scientific justification for approaching a ‘replacement<br />
sample’ (rather, say, than an initially larger sample)?<br />
Returning <strong>to</strong> the real issue on which we would all like <strong>to</strong> draw happy conclusions,<br />
namely, the tremendous rise in our pupils’ attainments at age 10, we<br />
see from the previous paragraph that the sample of responding schools (at 43<br />
per cent, not 82 per cent as reported by TIMSS) has <strong>to</strong> be judged as al<strong>to</strong>gether<br />
<strong>to</strong>o low <strong>to</strong> support any such conclusion.<br />
But we cannot leave the <strong>to</strong>pic of response rates without noticing a considerable<br />
improvement in the way that England’s secondary school scores were calculated<br />
for TIMSS. As said at the outset, the whole of the English results were<br />
14 Quota sampling is used in commercial work, and places greater emphasis on achieving the<br />
agreed <strong>to</strong>tal of respondents, rather than on their representativeness; it is avoided in scientific<br />
work. On the ‘Sampling Referee’, see TIMSS 2003, p. 441. The issue of replacement sampling<br />
was questioned in my previous paper on <strong>PISA</strong> 2000 (Oxf. Rev. Education, 29, 2);see<br />
also the response by RJ Adams (ibid., 29, 3), and my rejoinder <strong>to</strong> that response (ibid., 30,<br />
4). The need for representative sampling is so basic <strong>to</strong> scientific survey procedures that it is<br />
as<strong>to</strong>nishing that those responsible for educational surveys, <strong>to</strong>gether with the government departments<br />
providing taxpayers’ money for such exercises, could accept such an easy-going<br />
(slack) approach <strong>to</strong> non-response. But, as it now turns out, this was not the last word <strong>–</strong> as<br />
discussed below in relation <strong>to</strong> re-weighting with population weights.
146 SJPRAIS<br />
rejected for international comparability in the international reports because<br />
they did not satisfy their originally specified sampling requirements (the rejection<br />
applied equally <strong>to</strong> TIMSS and <strong>PISA</strong>). There was however an additional<br />
national report on England’s TIMSS survey which outlined an alternative calculation<br />
based on re-weighting the sample results by population weights. It<br />
tells us that the TIMSS sample over-represented schools that were ‘average<br />
and above average in terms of national examination (or test) results (i.e. weaker<br />
schools were under-represented: SJP). This sample was therefore re-weighted<br />
using this measure of performance <strong>to</strong> remove this effect’. 15 Presumably, the<br />
obliga<strong>to</strong>ry nationwide SAT test results were used <strong>to</strong> provide better weights,<br />
but details have not been released as <strong>to</strong> whether, for example, the re-weighting<br />
was for the country taken as a whole, or for the sampled schools or, indeed,<br />
the sampled students. The consequence of the re-weighting was that England<br />
was moved down in the TIMSS mathematics ranking, <strong>to</strong> below Australia, the<br />
United States, Lithuania, and Sweden (a reduction of England’s international<br />
score from 505 <strong>to</strong> 498). Nothing of very great substance, it might be thought;<br />
but the new method of estimation is of great importance for future surveys.<br />
Such an adjustment raises the reliability of English estimated average<br />
scores because, <strong>to</strong> put it simply, it employs population <strong>–</strong> rather than sample <strong>–</strong><br />
weights for the various ability-strata. When educational surveys of this kind<br />
were first attempted in 1964 no routine nationwide tests of mathematical attainments<br />
were available for England; now that they have become available,<br />
and even on an annual basis, they could be used <strong>to</strong> provide population weights<br />
for a TIMSS-type of survey using internationally specified questions. 16<br />
15 G Ruddocket al., Where England Stands ( . . . in TIMSS 2003), National Report for England<br />
(NFER, 2004), p. 25. The (previous) view expressed by <strong>PISA</strong> was very different. ‘A<br />
subsequent bias analysis provided no evidence for any significant bias of school-level performance<br />
results but did suggest there was potential non-response bias at student levels’<br />
(<strong>PISA</strong>, p. 328, my ital.). To emphasise, this is different from the TIMSS conclusion that it<br />
was weaker schools that needed up-weighting <strong>to</strong> improve representation (pp. 9, 25).<br />
16 It is difficult <strong>to</strong> find more than a trace of a reference <strong>to</strong> this re-weighting in the international<br />
TIMSS report, though it is quite explicit in the English national report; the same average<br />
scores for England are published in both reports. The TIMSS Technical Report (ch. 7, by<br />
M Joncas, p. 202, n. 7) offers the following light: ‘The sampling plan for England included<br />
implicit stratification of schools by a measure of school academic performance. Because the<br />
school participation rate even after including replacement schools was relatively low (54 %),<br />
it was decided <strong>to</strong> apply the school non-participation adjustment separately for each implicit<br />
stratum. Since the measure of academic performance used for stratification was strongly<br />
related <strong>to</strong> average school mathematics and science achievement on TIMSS, this served <strong>to</strong><br />
reduce the potential for bias introduced by low school participation’. The <strong>PISA</strong> report does
ENGLAND: POOR SURVEY RESPONSE 147<br />
The upshot is that, first, while the TIMSS primary survey results for England<br />
are less reliable than would appear from the way they were reported, those<br />
for secondary schools <strong>–</strong> after re-weighting <strong>–</strong> are more reliable. Secondly, sampling<br />
errors ought properly <strong>to</strong> be calculated for the TIMSS secondary school<br />
survey as for a stratified sample. Thirdly, the poor response-rates achieved in<br />
both these secondary school surveys might yet encourage a refusal by England<br />
<strong>–</strong> at a political level <strong>–</strong> <strong>to</strong> support any such future surveys; but we see here<br />
that what is first really required is more research in<strong>to</strong> sampling design, that<br />
is, better use of population information collected in any event for general educational<br />
objectives, so enabling more accurate results <strong>to</strong> be attained at lower<br />
cost. 17<br />
Objectives of international tests<br />
When these international educational tests were introduced nearly two generations<br />
ago, it was widely unders<strong>to</strong>od that their main objective was not <strong>–</strong><br />
as it seems <strong>to</strong> have become <strong>to</strong>day <strong>–</strong> <strong>to</strong> produce an international ‘league table’<br />
of countries’ schooling attainments, but <strong>to</strong> provide broader insight in<strong>to</strong><br />
the diverse fac<strong>to</strong>rs leading <strong>to</strong> success in learning. Despite current popular emphasis<br />
on ‘league table’ aspects (but usually without corresponding emphasis<br />
on sampling errors), much space is devoted in the present reports <strong>to</strong> students’<br />
‘perceptions’, attitudes <strong>to</strong>wards learning and their relation <strong>to</strong> success. But the<br />
reader often finds himself questioning the direction of causation; for example,<br />
we are <strong>to</strong>ld such things as that students who are happy with mathematics tend<br />
not discuss any such possible improved estimation procedure.<br />
17 The above discussion of response rates has been restricted, for the sake of brevity, <strong>to</strong> the<br />
primary school survey. More or less the same applied <strong>to</strong> both secondary school surveys, as<br />
follows. For the TIMSS secondary survey, the participation rate of the 160 sampled schools<br />
(before replacements were included) was a pathetic 34 per cent (TIMSS, p. 358); for the<br />
<strong>PISA</strong> inquiry, directed <strong>to</strong> 450 schools, it was 64 per cent (<strong>PISA</strong>, p. 327, col. 1). For the<br />
US, which deserves special attention because of its greater financial sponsorship, the corresponding<br />
secondary school response rates were 66 and 65 per cent (but would their financial<br />
contribution have been as great if the true response rates had been published, i.e. after correctly<br />
allowing for replacement sampling as explained above?).<br />
The English Department of Education issued Notes of guidance for media-edi<strong>to</strong>rs explaining<br />
that their ‘failure <strong>to</strong> persuade enough schools in England <strong>to</strong> participate occurred<br />
despite . . . various measures including an offer <strong>to</strong> reimburse schools for their time . . . ’ (National<br />
Statistics First Release 47/2004, p. 4, 7 December 2004). Note the term ‘reimburse’;<br />
there is no suggestion of motivating a sub-sample of schools by a substantial net financial<br />
incentive.
148 SJPRAIS<br />
<strong>to</strong> do better in that subject: but perhaps causation is more the other way round <strong>–</strong><br />
those who do well in that subject are happier, or more willing <strong>to</strong> declare their<br />
happiness. Similarly, much space is given <strong>to</strong> watching TV, and its association<br />
with test scores; with reading books, and so on. But little space is given in<br />
these reports <strong>to</strong> what <strong>to</strong>pics are taught at each age, <strong>to</strong> what level, and <strong>to</strong> what<br />
fraction of the age-group (see Annex B on the implications of longer basic<br />
schooling life in the US); nor <strong>to</strong> such a basic ‘mechanism’ of school learning<br />
as <strong>to</strong> how students, who inevitably differ in their precise levels of attainment,<br />
are grouped in<strong>to</strong> ‘parallel’ differentiated classes <strong>–</strong> despite the obvious concern<br />
of this feature of schooling <strong>to</strong> teachers, parents, policy makers and, not least,<br />
<strong>to</strong> students.<br />
The relation between the size of a class and its average achievement is<br />
tabulated in one of the studies and well illustrates the issue of direction of causation.<br />
For smaller classes of up <strong>to</strong> 24 students, an average score of 479 was<br />
recorded for England at Year 9; for medium-sized classes of 25-32 students,<br />
the average score was higher at 511; and for larger classes of 33 or more students,<br />
the average score was higher still at 552 (much the same applied in the<br />
other countries). 18 Higher attainments in larger classes have previously been<br />
frequently observed <strong>–</strong> contrary <strong>to</strong> the presumption that smaller classes would<br />
do better; this ‘statistical relation’ has generally been attributed <strong>to</strong> widespread<br />
recognition by schools that slower/weaker pupils should be taught in smaller<br />
‘parallel’ classes where possible. Whether schools allocated higher attaining<br />
pupils <strong>to</strong> larger classes as efficiently as possible can be debated; but it is clear<br />
that no one (least of all, the present writer) would draw the policy implication<br />
that if children were only <strong>to</strong> be taught in larger classes then they would attain<br />
better results at lower costs. Much care is similarly necessary in drawing<br />
conclusions from other statistical associations noted in these studies.<br />
For example, very strong conclusions were drawn by <strong>PISA</strong> on how the<br />
schooling system should deal with variability of students’ attainments and capabilities.<br />
But let us first spell out realistically the issue of variability of pupils’<br />
attainments in a class from the teacher’s point of view. Some variability of students’<br />
attainments within a class is unavoidable; but, once a certain level of<br />
variability is exceeded, the pace at which the teacher can teach slows, as does<br />
the pace at which learning takes place, not least amongst those students who<br />
are weaker (weaker for whatever reason <strong>–</strong> born at the later end of the school-<br />
18 TIMSS, Mathematics Report, p. 266; the same applied also <strong>to</strong> the primary inquiry at Year<br />
5, p. 267.
ENGLAND: POOR SURVEY RESPONSE 149<br />
year, illness last year, slow learning in a previous school, difficulties at home<br />
that weigh on the student’s mind . . . ), often with consequent ‘playing up’ in<br />
class; eventually the teacher finds it better <strong>to</strong> divide his ‘class’ in<strong>to</strong> explicit<br />
sub-groups, or ‘sets’, which follow a more or less different syllabus of tasks,<br />
with consequences for the pace of learning, and the costly need for teaching<br />
assistants. All this is of course familiar; and it might have been thought that an<br />
elementary calculation of variability of attainments within a class would have<br />
been a natural, obvious, useful <strong>–</strong> indeed essential <strong>–</strong> part of such inquiries.<br />
But the <strong>PISA</strong> sample was deliberately based not on whole classes, but on<br />
all those aged 15 in a school <strong>–</strong> whichever Year or attainment-set they were in.<br />
In England, as in other countries where promotion from one class <strong>to</strong> the next is<br />
based strictly on age, it might seem that nothing much is at issue; but <strong>to</strong> rely on<br />
that would ignore the widespread practice of ‘setting’ students within each year<br />
in<strong>to</strong> groups by attainment levels <strong>–</strong> a practice that becomes more widespread at<br />
higher ages. In most other countries some reference <strong>to</strong> attainment level usually<br />
influences promotion from one class <strong>to</strong> the next. But nothing of this can be<br />
investigated with the help of <strong>PISA</strong> since its sampling was based not on whole<br />
classes <strong>–</strong> but purely on age, irrespective of class or teaching-group.<br />
The TIMSS sampling process, on the other hand, was different since it was<br />
based on whole classes, and thus may be expected <strong>to</strong> be more relevant <strong>to</strong> our<br />
concerns; but that does not take us out of the woods. For the reality of a ‘class’<br />
becomes tenuous in the upper reaches of secondary schooling, as ‘setting’ by<br />
attainment becomes more prevalent. In large English comprehensive secondary<br />
schools, a dozen ‘parallel’ mathematics classes for each age or ‘Year’, varying<br />
according <strong>to</strong> attainment, is not unusual; for TIMSS, usually just one of<br />
those classes was selected by some ‘equal probability’ procedure, except that<br />
when some classes were very small they were combined with another <strong>to</strong> form<br />
a ‘pseudo-classroom’ for sampling purposes. 19 A small class for very weak<br />
pupils might be combined with another class next higher in its attainments; or,<br />
for all we are <strong>to</strong>ld, could be combined with a small <strong>to</strong>p set. In any event, no statistical<br />
analysis of the extent of student variability within teaching groups, nor<br />
even of the whole year-group within a school, seems <strong>to</strong> have been attempted<br />
as part of either of these sample inquiries, despite the central importance of<br />
19 TIMSS, [International] Technical Report, p. 121 (see also Mathematics Report, p. 349,<br />
which is also not very helpful); the English National Report has an Appendix on Sampling<br />
(p. 287) but regrettably says nothing on this vital aspect of sampling.
150 SJPRAIS<br />
that issue <strong>to</strong> success in teaching and learning, and its interest <strong>to</strong> teachers and<br />
educational planners.<br />
Despite the sampling design of both inquiries being so perverse that variability<br />
of students within teaching groups cannot be computed (<strong>to</strong> repeat: <strong>PISA</strong><br />
did not sample whole classes, TIMSS generally sampled only one ‘ability-set’<br />
out of each year-group), very strong policy conclusions were voiced in the<br />
<strong>PISA</strong> report against any form of differentiation: they were against dividing<br />
secondary school pupils in<strong>to</strong> different schools according <strong>to</strong> attainment levels<br />
(in England: Grammar schools and Comprehensives); they were against dividing<br />
pupils within schools in<strong>to</strong> streams or attainment sets; and they were<br />
against grade repetition which they ‘considered as a form of differentiation’,<br />
and ipso fac<strong>to</strong> evil. 20 Throughout there is the assumption that differentiation<br />
is the cause of lower average attainments, rather than seeing it the other way<br />
round <strong>–</strong> where teachers are faced with a student body that is unusually diverse,<br />
they use any organisational mechanism at their disposal <strong>to</strong> reduce diversity, and<br />
so make the group more teachable. In other words, greater variability within<br />
the class needs <strong>to</strong> be unders<strong>to</strong>od as the cause, rather than the effect, of lower<br />
attainments. All their conclusions were announced by <strong>PISA</strong> with great conviction<br />
<strong>–</strong> indeed, with great presumption <strong>–</strong> despite, as said, no calculations having<br />
been possible from their data on the variability of attainments within teaching<br />
groups, classes oryear-groups.<br />
The future<br />
How was it possible, the reader will ask himself, for such large inquiries, with<br />
their endless sub-committees of expert specialists, <strong>to</strong> arrange their sampling<br />
procedures <strong>to</strong> exclude the possibility of calculating the variability of attainments<br />
for each class/teaching group? Any student of Kafka will readily invent<br />
his detailed scenario; but their essence is probably that the specialists were <strong>to</strong>o<br />
specialised <strong>–</strong> in particular, the statisticians did not understand, or give sufficient<br />
weight <strong>to</strong>, the pedagogics of class-based learning; and the educationists<br />
did not give sufficient attention <strong>to</strong> the implications of the sampling procedures<br />
20 Parents in countries with low between-school variances, we are <strong>to</strong>ld, ‘can be confident of<br />
high and consistent performance standards across schools in the entire education system’<br />
(<strong>PISA</strong>, p. 163). ‘Avoiding ability grouping in mathematics classes has an overall positive<br />
effect on student performance’ (though it is conceded ‘the effect tends not <strong>to</strong> be statistically<br />
significant at the country level’!), (p. 258). ‘Grade repetition can also be considered as a<br />
form of differentiation’ [and therefore <strong>to</strong> be avoided] (p. 264).
ENGLAND: POOR SURVEY RESPONSE 151<br />
for response-rates. Perhaps most important, those in overall command were not<br />
sufficiently alive <strong>to</strong> such deficiencies in their varied specialists. Better ‘generalists’,<br />
rather than more specialists, seem <strong>to</strong> be required.<br />
From the point of view of more representative sampling, future international<br />
inquiries of this kind, it can now be seen more clearly, need <strong>to</strong> be redesigned<br />
<strong>to</strong> incorporate sampling features of both these recent inquiries. We<br />
need <strong>to</strong> focus (a) initially on the original variability of attainments of a complete<br />
age-group of students (variability due <strong>to</strong> socio-his<strong>to</strong>rical or genetic elements),<br />
perhaps estimated by the <strong>PISA</strong>-approach or by sampling two (? three)<br />
adjacent school-grades as in previous TIMSS inquiries; (b) then we need <strong>to</strong> estimate<br />
the extent <strong>to</strong> which variability is reduced within teaching groups as they<br />
have been organised by schools in practice; (c) finally, we need <strong>to</strong> estimate the<br />
separate contributions of various institutional fac<strong>to</strong>rs in each country <strong>to</strong> that reduction<br />
in variability <strong>–</strong> secondary school selection, ability-setting within Yeargroups,<br />
class-repetition. Differences among countries in these elements may<br />
yield valuable and empirically-based policy conclusions.<br />
From the point of view of the substance of the inquiries, more focus and<br />
debate would be valuable on syllabus issues within mathematics. For example,<br />
what is the proper share of arithmetic in the overall mathematics curriculum at<br />
younger ages, and how should that share vary for different attainment-groups?<br />
In some countries (Switzerland, Germany), at least until recently, the less academic<br />
group of students often become more expert in mental arithmetic skills<br />
as a result of their different curricular emphases; has the wholesale use of calcula<strong>to</strong>rs<br />
really made this otiose? At what ages, and <strong>to</strong> what fractions of pupils,<br />
should specific <strong>to</strong>pics be introduced such as simultaneous linear equations,<br />
quadratic equations, basic trigonometry or even basic calculus? No more than<br />
these few hints can be thrown out within the ambit of the present Note <strong>to</strong><br />
indicate what a proper Next Step should include (see also Annex B on the<br />
anomalously low average attainments in mathematics at age 15 by the world’s<br />
economically leading country).<br />
A final question: how much public breast-beating by the organisations that<br />
have carried out the two recent inquiries will be needed before they can be<br />
considered eligible for participation in such an improved Next Step?<br />
Acknowledgements and apologies<br />
This Note has benefited from comments on earlier drafts by Professor G Howson<br />
(Southamp<strong>to</strong>n), Professor PE Hart (Reading), Professor J Micklewright
152 SJPRAIS<br />
(Southamp<strong>to</strong>n), Dr Julia Whitburn and many others at the National Institute<br />
of Economic and Social Research, London; I am also indebted <strong>to</strong> the National<br />
Institute for the provision of research facilities. Needless <strong>to</strong> say, I remain solely<br />
responsible for errors and misjudgements.<br />
I take this opportunity also of offering apologies <strong>to</strong> the individuals who<br />
have innocently participated in carrying out the underlying inquiries here reviewed;<br />
but those who planned those inquiries must fully accept their share of<br />
blame for the inadequacies complained of here, and for <strong>to</strong>o often uncritically<br />
following what was done in previous inquiries <strong>–</strong> instead of improving on those<br />
practices.<br />
ANNEX A<br />
Some background on the two international educational inquiries of<br />
2003<br />
The International Association for the Evaluation of Educational Achievement<br />
(IEA) has been active since the 1960s in sponsoring internationally comparative<br />
studies of secondary schooling <strong>–</strong> subsequently also primary schooling <strong>–</strong><br />
involving tests set <strong>to</strong> representative samples of students. The school subjects<br />
covered were mathematics and science, plus some separate inquiries in<strong>to</strong> reading/literacy.<br />
The year-groups focussed on were eighth and fourth grades on the<br />
international grading (Europe and the United States), corresponding <strong>to</strong> Years<br />
9 and 5 in the UK, that is, <strong>to</strong> ages of about 14 and 10. Sampling was based<br />
on school classes. Before 2003 the IEA had carried out similar inquiries in<br />
1995 (in some countries also in 1999). The number of countries expanded over<br />
time <strong>to</strong> reach 49 in 2003; the most recent IEA inquiries go under the name of<br />
TIMSS <strong>–</strong> Trends in International Mathematics and Science Study. The studies<br />
are now managed from Bos<strong>to</strong>n College, Mass., with substantial financial<br />
support from the US government mainly for the central organisation; financial<br />
support for the surveys in each country is provided locally.<br />
Three reports were published by TIMSS on their 2003 inquiries:<br />
IVS Mullis et al., TIMSS 2003 International Mathematics Report (Bos<strong>to</strong>n<br />
College, 2004), pp. 455.<br />
IVS Mullis et al., TIMSS 2003 International Science Report (Bos<strong>to</strong>n College,<br />
2004), pp. 467.<br />
MO Martin et al. (eds), TIMSS 2003 Technical Report (Bos<strong>to</strong>n College,<br />
2004), pp. 503.
ENGLAND: POOR SURVEY RESPONSE 153<br />
The second inquiry considered here was sponsored by OECD (Organisation of<br />
Economic Cooperation and Development), an international organisation set up<br />
in Paris <strong>to</strong> assist European post-war economic reconstruction and development,<br />
with heavy support from the United States. It conducted its first assessment of<br />
educational attainments in 2000 under the name Programme of International<br />
Student Assessment, <strong>PISA</strong> for short; and a repeat was carried out in 2003. I<br />
have not been able <strong>to</strong> find any written justification for setting up an inquiry so<br />
close in its objectives <strong>to</strong> the IEA’s; but two differences <strong>–</strong> not necessarily justifications<br />
<strong>–</strong> should be noted. First, <strong>PISA</strong> focuses on a certain age, 15 <strong>–</strong> rather<br />
than school Year (or grade) as for TIMSS <strong>–</strong> for those included in its survey<br />
(though for some countries, Brazil, Mexico, that age is beyond compulsory<br />
schooling and only about half that age-group can be contacted). On average,<br />
the <strong>PISA</strong> age is about a year above TIMSS, and closer <strong>to</strong> the age of entering the<br />
workforce. Secondly, the focus of students’ questioning in <strong>PISA</strong> was said <strong>to</strong><br />
be on the ‘ability <strong>to</strong> use their knowledge and skills <strong>to</strong> meet real-life challenges,<br />
rather than merely on the extent <strong>to</strong> which they have mastered a specific school<br />
curriculum’; whereas the focus of TIMSS is closer <strong>to</strong> the school curriculum. 21<br />
It still remains <strong>to</strong> be shown whether the practicalities of written examinations<br />
held in a school room makes any substantial difference <strong>to</strong> the outcome whether<br />
one kind of question is asked or the other.<br />
The <strong>PISA</strong> inquiry covered mathematics and science, just as TIMSS; and<br />
also had questions on literacy (reading). <strong>PISA</strong>’s emphasis in 2003 was on mathematics.<br />
Results were published in:<br />
[No attributed authorship] Learning for Tomorrow’s World: First Results from<br />
<strong>PISA</strong> 2003 (OECD, Paris, 2004), pp. 476.<br />
R Adams (ed.), <strong>PISA</strong> 2003 Technical Report (OECD, Paris, 2005).<br />
Of the 48 countries included in <strong>PISA</strong> (49 in TIMSS, as said), 19 also participated<br />
in TIMSS. A full investigation, with access <strong>to</strong> individual questions<br />
and results in both inquires would be needed for a proper comparison; here we<br />
may note only that Hong Kong and Korea were near the <strong>to</strong>p scorers in both<br />
inquiries (scores of 586, 589 in TIMSS; 550, 542 in <strong>PISA</strong>); in Europe, Netherlands<br />
and Belgium were about equally high (536, 537 in TIMSS <strong>–</strong> Flemish<br />
Belgium only; 538, 529 in <strong>PISA</strong>); and the United States was very slightly<br />
above average in TIMSS (a score of 504) and more than slightly below average<br />
in <strong>PISA</strong> (483). The different mix of countries in the two samples affects<br />
21 <strong>PISA</strong> (2004), p. 20.
154 SJPRAIS<br />
the standardised marks published: such comparisons between the results of the<br />
inquiries are therefore no more than suggestive.<br />
ANNEX B<br />
The proper objectives of international comparative educational<br />
research<br />
That the US, the world’s <strong>to</strong>p economic performing country, was found <strong>to</strong> have<br />
schooling attainments that are only middling casts fundamental doubts on the<br />
value, and approach, of these surveys. It could be that the hyper-involved statistical<br />
methods of analysis used (known as Item Response Modelling) is, as<br />
many have suggested, wholly inappropriate (see also my comment of 2003 on<br />
the <strong>PISA</strong> 2000 survey, p. 161). Or it could be, as two US academics have suggested,<br />
that the level of schooling does not matter all that much for economic<br />
progress; rather, it is ‘Adam Smithian’ fac<strong>to</strong>rs such as economies of scale,<br />
and minimally regulated labour markets that allow US ‘employers enormous<br />
agility in hiring, paying and allocating workers . . . ’. 22 Or <strong>–</strong> my own view <strong>–</strong><br />
that the typical age of school-leaving in the US, at some three years above<br />
that in most European countries (say, 19 rather than 16), has the consequence<br />
that schooling attainments at 14-15 hardly provides a clear indication of the<br />
contribution of final schooling attainments <strong>to</strong> subsequent working capabilities.<br />
An older typical school-leaving age means that teachers can sequence their<br />
courses of instruction in a more graduated way; and that the kind of question<br />
set in the <strong>PISA</strong> inquiries <strong>–</strong> designed <strong>to</strong> be close <strong>to</strong> everyday life <strong>–</strong> is indeed<br />
something for which US students aged 15 are less ready than their European<br />
counterparts. But that does not mean that at later ages their schooling has not<br />
served US students as a whole at least as well as their European counterparts;<br />
more time may have been usefully spent by US students in those subsequent<br />
three years in consolidating fundamentals. No investigation, or even discussion,<br />
of such issues is <strong>to</strong> be found in the official reports on these inquiries; and<br />
the absence of a sufficient number of published individual questions makes it<br />
impossible for the reader <strong>to</strong> take the issue further.<br />
22 See A P Carnevale and D M Desrochers, The democratization of mathematics, in Quantitative<br />
Literacy (eds. B L Maddison and L A Steen, National Council on Education and the<br />
Disciplines, Prince<strong>to</strong>n NJ, 2003), esp. p. 24: ‘if the United States is so bad at mathematics<br />
and science, how can we be so successful in the new high-tec global economy? If we are so<br />
dumb, why are we so rich?’
ENGLAND: POOR SURVEY RESPONSE 155<br />
So far we have treated both surveys (TIMSS, <strong>PISA</strong>) as showing much the<br />
same schooling performance for US pupils <strong>–</strong> namely, as indifferent, or even<br />
weak, when judged in relation <strong>to</strong> the tremendous economic performance of<br />
that country. But we should also notice, and express surprise, that it is precisely<br />
in that survey with questions emphasising practical and ‘real life’ aspects,<br />
namely, the <strong>PISA</strong> survey, that average US 15 year-olds are shown at being<br />
below world average <strong>–</strong> whereas, in the more school-task oriented TIMSS<br />
survey, US students were <strong>–</strong> even if only modestly <strong>–</strong> above the world average.<br />
Indeed, it is not <strong>to</strong>o fanciful <strong>to</strong> suppose that the undistinguished performance<br />
of US students in school-curriculum oriented questions in the earlier TIMSS<br />
surveys provided some of the impetus for carrying out a further survey with<br />
a more practical emphasis in its questioning. But anyone who expected better<br />
results for the US via that line of questioning must have been sorely disappointed<br />
by the outcome. That outcome, it may also be concluded, casts further<br />
doubt on the value of repeating a <strong>PISA</strong>-type survey. Until wider-ranging pilot<br />
inquiries, on alternative lines, have been carried out and analysed, it is difficult<br />
<strong>to</strong> see that further inquiries of the present sort and scale are justified.
Disappearing Students<br />
<strong>PISA</strong> and Students With Disabilities<br />
Bernadette Hörmann<br />
Austria: University of Vienna<br />
1 Who is disappearing?<br />
“Have you ever tried <strong>to</strong> get a stroller or cart<br />
in<strong>to</strong> a building that did not have a ramp? Or<br />
open a door with you hands full? Or read<br />
something that has white print on a yellow<br />
background, or is printed <strong>to</strong>o small <strong>to</strong> read<br />
without a magnifying glass, or has words<br />
from a different generation or culture? Have<br />
you ever listened <strong>to</strong> a speech given without<br />
a microphone?” (Johns<strong>to</strong>ne/Altman/Thurlow<br />
2006, p. 1)<br />
Concerning student assessment, public and scientific discourse seems <strong>to</strong> be<br />
limited <strong>to</strong> questions about its condition, possibilities and consequences. The<br />
urgent question that has <strong>to</strong> be asked is about the role of children with disabilities<br />
in assessment tests like <strong>PISA</strong>, TIMSS, etc. Are these students included?<br />
How are they included? Is there a way <strong>to</strong> include children with special needs<br />
in assessment tests? Are these assessment tests even able <strong>to</strong> assess the abilities<br />
of students with disabilities in an adequate way?<br />
Generally, students with disabilities (SWD) do not get the chance <strong>to</strong> participate<br />
in student assessments (e.g. Posch/Altrichter 1997, p. 41; Van Ackeren<br />
2005, p. 26; in the USA: McGrew/Algozzine/Spiegel/Thurlow/Ysseldyke
158 BERNADETTE HÖRMANN<br />
1993; Thurlow/Elliott/Ysseldyke/Erickson 1996a; Quenemoen/Lehr/Thurlow/<br />
Massanari 2001). In most cases they are asked <strong>to</strong> stay at home when the test<br />
takes place, or they are sent <strong>to</strong> another classroom during the test. If students<br />
with disabilities are allowed <strong>to</strong> participate in the testing, their scores are most<br />
of the time not counted, which means that these children are not represented in<br />
the official statistics. In the case of <strong>PISA</strong>, students who attend a special school<br />
(“Sonderschule”) get special testbooks that contain easier questions and that<br />
are shorter than the normal books. But in general, students with disabilities are<br />
excluded from the testing process, which makes them disappear, as they are<br />
not represented in the results of the assessments. Children who face exclusion<br />
are students with all kinds of disabilities, immigrants (non-native speakers)<br />
and low-achievers. As Wuttke observed, the participating states in <strong>PISA</strong> 2003<br />
dealt quite differently with the “problem” of handicapped children. Turkey, for<br />
example, only excluded 0.7 percent of the students, while the exclusion rate in<br />
Spain and the USA reached 7.3 percent (OECD 2005, p. 169, quoted in Wuttke<br />
2006, p. 106). <strong>PISA</strong> regulations say that it is allowed <strong>to</strong> exclude five percent<br />
of the population, a limit that has been exceeded in several states. Haider, the<br />
Austrian national coordina<strong>to</strong>r of <strong>PISA</strong> 2003, gives the following advice concerning<br />
the exclusion of students in the Austrian <strong>PISA</strong> report of 2003: Pupils<br />
can be excluded in case of severe, constant physical or mental disability, insufficient<br />
language knowledge or when the student drops out of school.<br />
In the U.S., student assessment is more common and looks back on a long<br />
tradition. For this reason, scientists have developed a distinct branch of research,<br />
one concerned with the problem of the exclusion of students with disabilities.<br />
The NCEO (National Center of Educational Outcomes) provides annual<br />
reports and detailed research studies on this <strong>to</strong>pic, and it aims at raising<br />
the number of students included in the testing. In Europe, however, this issue<br />
seems not <strong>to</strong> be considered as important (cf. Hörmann 2007). There seems <strong>to</strong><br />
be a lack of research and public interest, a <strong>to</strong>pic which will be discussed later<br />
on in this article.<br />
The following lines will deal with arguments for the inclusion of SWD<br />
and will illustrate how the U.S. tries <strong>to</strong> account for the diversity of students.<br />
Afterwards, the situation in Austria and the German-speaking countries will<br />
be discussed while aspects for the future will be given in the conclusion.
DISAPPEARING STUDENTS <strong>PISA</strong> AND STUDENTS WITH DISABILITIES 159<br />
2“Ou<strong>to</strong>fsightisou<strong>to</strong>fmind” 1 : Arguments for the inclusion of<br />
students with disabilities in assessment tests<br />
As Hopmann (2006) shows, student assessment has become an inevitable necessity<br />
of modern society. Since the late 1990s, the socio-political models have<br />
changed from “management by placement” <strong>to</strong> “management by expectations”.<br />
The social welfare state with its traditional institutions and form of government<br />
could not be maintained any longer, and the public gradually demanded<br />
accountability from the institutions providing public services. Under management<br />
by expectations, risks and expectations in relation <strong>to</strong> an institution are<br />
collected and then taken as the basis on which its tasks and budget are fixed<br />
(cf. Hopmann 2006).<br />
Assessment tests in their current state also deal with expectations and are<br />
a new way of measuring risks. State school systems are now bound <strong>to</strong> account<br />
for their services; and their services are delivered <strong>to</strong> regular students as well as<br />
<strong>to</strong> students with disabilities. On which basis can the exclusion of children with<br />
disabilities from such assessments be justified any longer?<br />
Günter Haider wrote in the Austrian national report of <strong>PISA</strong> 2003:<br />
“Im Rahmen von <strong>PISA</strong> sollen jedoch nicht einzelne Personen geprüft, sondern die<br />
Merkmale aller Schülerinnen und Schüler eines Landes kollektiv <strong>–</strong> über große Stichproben<br />
<strong>–</strong> erfasst werden.” (Haider 2003, p. 13, emphasis in original)<br />
Thus, <strong>PISA</strong> is not designed <strong>to</strong> test individual students, but rather <strong>to</strong> measure<br />
characteristics of all the pupils of a specific country. Apparently, it is intended<br />
<strong>to</strong> produce a representative picture of all students’ performances of the whole<br />
nation.<br />
<strong>According</strong> <strong>to</strong> Thurlow et al from the NCEO, the idea of the inclusion of all<br />
students is based on the following three assumptions:<br />
<strong>–</strong> “All students can learn.<br />
<strong>–</strong> Schools are responsible for measuring the progress of learners.<br />
<strong>–</strong> The learning process of all students should be measured” (Thurlow et al<br />
1996a)<br />
Excluding particular students from testing would mean that those children<br />
would be made invisible and that they would also be excluded from any political<br />
or social decisions. Most of the policy decisions concerning school structures<br />
are based on results of large-scale assessments. Consequently, children<br />
1 As Thurlow, Elliott, Ysseldyke and Erickson (1996a) remarked in a pointed way
160 BERNADETTE HÖRMANN<br />
who are excluded from the test are also excluded from policy decisions which<br />
actually affect them. Thurlow even points out that excluding certain children<br />
from tests leads <strong>to</strong> invalid comparisons, which means that including children<br />
could provide more realistic results (cf. ibid.).<br />
From the perspective of special needs education, it has <strong>to</strong> be said that it is<br />
an obligation that every single child be granted the possibility of taking part<br />
in international student assessment. From the 1990s onwards, a new concept<br />
has been developed which should displace the old notion of “integration”: The<br />
concept of “inclusion”. The Salamanca Statement from 1994 (UNESCO 1994)<br />
proclaims the right of every single child <strong>to</strong> participate in society, which means<br />
that all children should attend the same kind of school. Disability is viewed<br />
as just one kind of diversity among many others, and the ambitious aim of the<br />
statement is not <strong>to</strong> change people, but rather social structures and institutions,<br />
so that it will become possible <strong>to</strong> account for all people’s needs (cf. Biewer<br />
2005, p. 102ff). From this point of view, it is not the student who is disabled,<br />
but rather the school, which “disables” certain kinds of children. Consequently,<br />
student assessment, as a part of education, has <strong>to</strong> be geared <strong>to</strong> children with<br />
disabilities; it has <strong>to</strong> be constructed in a way in which every single child has<br />
the chance <strong>to</strong> successfully participate in the test. In contrast <strong>to</strong> this conception,<br />
the concept of “integration” would mean that students with disabilities would<br />
have <strong>to</strong> be remedially instructed <strong>to</strong> an adequate degree so that they could take<br />
part in the assessment. In this case the students would have <strong>to</strong> be changed<br />
instead of the tests.<br />
3 Research in the U.S.<br />
Including SWD in assessment tests has gained a lot of attention and has become<br />
an important part of policy in the U.S. It is trying hard <strong>to</strong> establish a<br />
“participation policy”, which should assure that full inclusion will become reality.<br />
In 2001, the “No Child Left Behind Act” was installed, which dictates<br />
that every single state of the U.S. has <strong>to</strong> report the participation rates of students<br />
with disabilities and the way they participate in assessment. “Full inclusion”,<br />
however, can never become reality, as there will always be excluded students <strong>–</strong><br />
at random or not (cf. Koretz/Bar<strong>to</strong>n 2003).<br />
The National Center for Educational Outcomes (NCEO) publishes an annual<br />
report in which the fundamental trends of the participation of particular<br />
student groups are presented. This is a quite challenging task because every
DISAPPEARING STUDENTS <strong>PISA</strong> AND STUDENTS WITH DISABILITIES 161<br />
state has different guidelines concerning inclusion of students with disabilities.<br />
But in general, almost every state is able <strong>to</strong> give trends and facts about<br />
inclusion and performance of students over the past three years.<br />
3.1 Participation and performance trends in the U.S.<br />
The Annual Performance Report from Thurlow, Moen and Altman (2006)<br />
gives actual figures on the participation and performance of students with IEP<br />
(Individualized Education Plan) enrolment in state assessments of the entire<br />
U.S. for the years 2003 and 2004. Almost every state of the US is able <strong>to</strong> provide<br />
data about its participation rates in student assessment. The figures are<br />
categorised in<strong>to</strong> the different types of assessment (reading and math), the three<br />
school levels (elementary, middle and high school) and the two different types<br />
of states (regular or unique states). For example, at the elementary level, in 38<br />
regular states (out of 50) and 2 unique states (out of 10), at least 95 percent of<br />
all children with IEP enrolment were assessed. Most of the other states reached<br />
between 85 and 95 per cent, which can be seen as quite a high participation<br />
rate (see figure 1).<br />
At the middle school level, 34 regular states and 1 unique state assessed 95<br />
percent or more, and at the high school level, 26 regular states and no unique<br />
state reached this limit, whereas nearly half of the states reached between 85<br />
and 95 percent (Thurlow et al 2006, p. 7).<br />
Data for the math exam are quite similar <strong>to</strong> those of the reading exam (see<br />
Thurlow et al 2006, p. 10ff).<br />
In three states students with IEP have the possibility of taking an alternate<br />
assessment which is based on grade level achievement standards. The amount<br />
of students that participate in those alternate tests lies between 0.1 and about<br />
10 percent (Table 1).<br />
Elementary Middle High<br />
school<br />
Massachusetts .29 .10 .30<br />
North Carolina<br />
1.19 .88 .42<br />
Texas 10.37 4.81 <strong>–</strong><br />
Table 1: Percent of students with IEPs in an alternate reading assessment based on grade level<br />
achievement standards (Thurlow et al 2006, p. 20)
162 BERNADETTE HÖRMANN<br />
Fig.1: Reading Assessment Participation Rates in Elementary School: Percent Participation is<br />
of IEP Enrollment (Includes Regular and Alternate Assessment) (Thurlow et al 2006, p. 14)<br />
Table one also shows the discrepancy between the figures of the various<br />
states (compare, for example, Massachusetts and Texas), which indicates that<br />
regulations and their execution in the states are interpreted rather differently<br />
and are not consistent.<br />
About 65 percent (high school: 61 percent) of students with IEPs take part<br />
in regular assessment, but with accommodations being made. However, there<br />
were quite a number of states that were unable <strong>to</strong> document the number of<br />
students who <strong>to</strong>ok this adjusted version of the regular test (cf. ibid., p. 8 and<br />
21).<br />
The performance of students with IEP in reading assessment tends <strong>to</strong> increase<br />
steadily. About 30 percent of the students with IEP performed at a<br />
level considered <strong>to</strong> be “proficient”, which is slightly higher than in the years<br />
2002/2003. At the elementary school level, IEP students in 32 regular states<br />
reached more than 30 percent, at the middle school level in 15 and at the high<br />
school level in 17 regular states (ibid., p. 30). The rates of proficiency improved<br />
in 31 states at the elementary level, in 32 at the middle school level and in 29
DISAPPEARING STUDENTS <strong>PISA</strong> AND STUDENTS WITH DISABILITIES 163<br />
regular states (out of approximately 43 states which provided data) at the high<br />
school level (ibid., p. 32).<br />
3.2 Manners of inclusion<br />
Basically, there are three manners of including students with disabilities in<br />
assessment:<br />
<strong>–</strong> Letting them take part in the regular assessment<br />
<strong>–</strong> Creating accommodated versions of the regular assessment<br />
<strong>–</strong> Creating an alternative test<br />
In the case of children having severe disabilities, it is most common <strong>to</strong> create<br />
alternative tests for them. In other cases of rather moderate impairment (e.g.<br />
learning disabilities), the students can take part in the test with special accommodations<br />
being made.<br />
3.2.1 Testing accommodations<br />
Referring <strong>to</strong> Sireci, Scarpati and Li, an accommodation is an “ . . . intentional<br />
change <strong>to</strong> the testing process designed <strong>to</strong> make the tests more accessible <strong>to</strong><br />
SWD and consequently lead <strong>to</strong> more valid interpretations of their test scores”<br />
(Sireci/Scarpati/Li 2005, p. 460). In practice, there are a lot of possibilities in<br />
order <strong>to</strong> gain more valid interpretations of test scores of SWD. The kind of<br />
accommodation that is indicated depends on the special needs of each single<br />
student. The possibilities are the following:<br />
<strong>–</strong> Providing additional time<br />
<strong>–</strong> Providing a separate location, where the student can work undisturbed<br />
<strong>–</strong> Taking more breaks<br />
<strong>–</strong> Reading the test directions or items <strong>to</strong> students<br />
<strong>–</strong> Providing the test in Braille or large type<br />
<strong>–</strong> Allowing the students <strong>to</strong> dictate their answers<br />
<strong>–</strong> “Out-of-level”- or “out-of-grade”-testing (student gets a form which is actually<br />
used for a previous grade)<br />
<strong>–</strong> Deleting some items from the test (Koretz/Bar<strong>to</strong>n 2003, p. 6)<br />
It is also possible <strong>to</strong> combine items of the regular version, accommodated items<br />
and alternative items. Furthermore, there is the possibility <strong>to</strong> apply access<br />
assistants, who act like “intermediaries” between children and their special<br />
needs. They can read the test items <strong>to</strong> the students, write down their responses<br />
or communicate with the students through sign language. Sometimes they also<br />
do translation work, turn pages, transcribe or paraphrase the students’ answers.
164 BERNADETTE HÖRMANN<br />
Clapper et al give advice on the development of guidelines for these assistants<br />
(cf. Clapper et al 2006).<br />
In most cases it is necessary <strong>to</strong> make more than one accommodation, as one<br />
accommodation might require a further one (e.g. a deaf student, who receives<br />
the test instructions in written form needs additional time, because it takes<br />
more time <strong>to</strong> read the instructions than <strong>to</strong> hear them) (cf. Koretz/Bar<strong>to</strong>n 2003,<br />
p. 22).<br />
Of course, there is a heated discussion going on about the validity of all<br />
these adjustments. In particular, accommodations which have an impact on the<br />
basic construct of the test have caused a controversial debate. Even in the U.S.<br />
there is a lack of research concerning the effects of accommodations on the<br />
validity of test scores, as Koretz and Bar<strong>to</strong>n observed (cf. Koretz/Bar<strong>to</strong>n 2003,<br />
p. 3, also Thurlow et al 1996a). In their opinion, the main problems concerning<br />
test accommodations are the inhomogeneity of the group of SWD, the accurate<br />
and appropriate use of accommodations, construct-relevant disabilities and the<br />
design of the tests (danger of item or test bias) (cf. Koretz/Bar<strong>to</strong>n 2003).<br />
3.2.2 Alternate Tests<br />
Alternate tests are usually based on the IEPs of the respective students and are<br />
an attempt <strong>to</strong> include students with disabilities who cannot participate in the<br />
general assessment system. Basically, one has <strong>to</strong> be aware that students with<br />
disabilities are not au<strong>to</strong>matically assessed by alternate tests. Thurlow et al point<br />
out that the majority of students with disabilities should participate in the regular<br />
assessment, be it with accommodations or without. An important criterion<br />
for the decision regarding the kind of assessment a student should participate<br />
in is the goal the student has in mind. If the student aims at achieving the same<br />
goals as students with a regular curriculum, the student should take part in the<br />
general assessment, albeit with accommodations. The most important advice<br />
is that the decision should not be based on the expectation of the performance<br />
of the student (cf. Thurlow/Olsen/Elliott/Ysseldyke/Erickson/Aherarn 1996b).<br />
Concerning the integration of the results of alternate tests, there is quite a<br />
lack of research. On the one hand, there is the possibility <strong>to</strong> report the results<br />
separately from those of the general assessment, which would make it possible<br />
<strong>to</strong> analyze the special education services. But on the other hand, the attempt<br />
is made <strong>to</strong> avoid separation between students with and without disabilities,<br />
which would mean that testing results should be aggregated and combined (cf.
DISAPPEARING STUDENTS <strong>PISA</strong> AND STUDENTS WITH DISABILITIES 165<br />
Thurlow et al 1996b). Once more it becomes clear that much more research is<br />
needed on this issue.<br />
3.3 Consequences for students with disabilities in student assessment<br />
Ysseldyke, Dennison and Nelson (2004) tried <strong>to</strong> investigate positive consequences<br />
of large-scale-assessments for students with disabilities. The increased<br />
participation of SWD led <strong>to</strong> higher expectations of their performance<br />
(from parents, teachers and students themselves), which usually have been<br />
rather low. The students that were interviewed for this study even pointed out<br />
that they had the impression that their teacher would pay more attention <strong>to</strong><br />
them and give them more support. Furthermore, the participation of students<br />
with disabilities resulted in improved test instructions, teaching strategies and<br />
performances of the respective students. There are also better chances for the<br />
respective children <strong>to</strong> graduate or <strong>to</strong> get diplomas, and the risk of dropping out<br />
of school decreases. The cooperation between IEP-teachers and regular teachers<br />
is improved, and parents from students with disabilities seem <strong>to</strong> be more<br />
interested in the performance and development of their children (cf. Ysseldyke<br />
et al 2004, p. 4ff). Ysseldyke et al gained all these findings from an extensive<br />
research in literature, media and from interviews with people that are involved<br />
in student assessment.<br />
Ruth Nelson searched for positive as well as negative consequences and<br />
found that due <strong>to</strong> the participation of students with disabilities, there is an increased<br />
exposure <strong>to</strong> the curriculum, which is a consequence from intensified<br />
test preparation, extra tu<strong>to</strong>ring, extra lessons, etc. Moreover, the increased exposure<br />
also causes higher levels of stress, anxiety and frustration as well as<br />
limited possibilities for choosing electives among students. Ysseldyke et al<br />
(2004) as well as Nelson (2006) discovered that both participation and expectations<br />
have increased. However, Nelson could not find any empirical evidence<br />
for assumed increased referrals <strong>to</strong> special education or the retention of students<br />
(cf. Nelson 2006).<br />
3.4 Universally designed assessments or: “a more accessible<br />
assessment is always an option” 2<br />
The latest attempts <strong>to</strong> include as many students as possible in state assessment<br />
are “universally designed assessments”. It is a project of NCEO for which a<br />
2 Johns<strong>to</strong>ne et al 2006, p. 23
166 BERNADETTE HÖRMANN<br />
guide was published in order <strong>to</strong> provide states and their representatives responsible<br />
for information and ideas about ways of including students with disabilities<br />
in assessment tests. Universal design demands accessibility for everyone<br />
whether he or she is disabled, a non-native speaker of English, a migrant or<br />
whatever. When an assessment test is designed universally, it respects the diversity<br />
of the population. It is characterised by concise and readable texts, clear<br />
formats and clear visuals, and it allows changes in the format as long as they do<br />
not change the meaning or the level of difficulty. It is stressed that those tests<br />
are not intended <strong>to</strong> change the standard of performance of assessments, nor <strong>to</strong><br />
make them easier for special groups. The ambitious aim of universal design is<br />
<strong>to</strong> create the “most valid assessment possible for the greatest number of students,<br />
including students with disabilities” (Johns<strong>to</strong>ne et al 2006, p. 1). In this<br />
guide provided by Johns<strong>to</strong>ne et al, 10 steps are proposed for the best way of<br />
achieving a universally designed assessment. I will not describe each of these<br />
steps in detail, but offer an overview of the main features of this approach.<br />
The main idea of the approach is <strong>to</strong> consider the diversity of the students<br />
from the very beginning (and not <strong>to</strong> adjust the tests afterwards for the special<br />
needs of some groups of students). For this reason, every item has <strong>to</strong> be<br />
checked in the phase of conceptualization. Contents which could give unfair<br />
advantage or disadvantage <strong>to</strong> a certain group of students should be avoided<br />
(e.g. using large font sizes, avoiding unnecessary linguistic complexity when<br />
it is not assessed). Every single item has <strong>to</strong> be checked if it allows consideration<br />
for the diversity of the pupils (gender, age, ethnicity, socioeconomic<br />
status, region, disability, language). In order <strong>to</strong> avoid ceiling or floor effects,<br />
it is important <strong>to</strong> develop a full range of test performance, and an adequately<br />
sized item pool is needed in case the items have <strong>to</strong> be eliminated. The authors<br />
are aware of the fact that this is a truly challenging and time-consuming procedure,<br />
but as the authors argue, considering accessibility from the beginning<br />
can save time and effort later (cf. ibid., p. 6).<br />
When all items are constructed, it is necessary <strong>to</strong> let them be reviewed by<br />
expert teams in the participating states. Members of several special groups,<br />
such as language minorities, disability groups, scientists, teachers, etc. should<br />
be involved in this review process, and they should examine if the test items<br />
give some advantage or disadvantage <strong>to</strong> a certain group of students. They are<br />
required <strong>to</strong> look at the response format and decide if it is clear enough, check if<br />
the item really tests what it says and if there could be induced errors which are<br />
not related <strong>to</strong> the question (cf. ibid., p. 7). If there are any items that the experts
DISAPPEARING STUDENTS <strong>PISA</strong> AND STUDENTS WITH DISABILITIES 167<br />
find problematic, these items are analyzed with the “Think Aloud Method” in<br />
order <strong>to</strong> find out if they can be incorporated in the test or not, or if they should<br />
be adjusted. After the field test, the items are analyzed statistically, in particular<br />
those which were conspicuous (cf. ibid., p. 13). When the final revision is done,<br />
the test can be carried out.<br />
It has often been pointed out that it is not possible <strong>to</strong> create a test which<br />
is accessible <strong>to</strong> every single student. However, as mentioned, the goal is <strong>to</strong><br />
make it as widely accessible as possible. Moreover, as the authors proclaim, all<br />
students participating in universally designed assessments benefit from having<br />
more accessible tests (cf. ibid., p. 2).<br />
3.5Emergingissues<br />
Koretz and Bar<strong>to</strong>n summarize the most important <strong>to</strong>pics which have <strong>to</strong> be considered<br />
when it comes <strong>to</strong> the inclusion of students with disabilities. At the same<br />
time, those issues represent the most important research gaps that need <strong>to</strong> be<br />
closed:<br />
<strong>–</strong> First of all, students with disabilities have <strong>to</strong> be identified and classified.<br />
Comparisons have shown that figures on e.g. children with learning disabilities<br />
differ from 3 <strong>to</strong> 9.1 percent. Therefore, it can be assumed that the lines<br />
are drawn rather differently and that the term “learning disability” does not<br />
necessarily mean the same thing <strong>to</strong> everybody (cf. also McGrew et al 1993).<br />
Thus, identification and classification seem <strong>to</strong> be a crucial in order <strong>to</strong> make<br />
an equitable assessment system possible.<br />
<strong>–</strong> Appropriate use of accommodations: Accommodations, or the “corrective<br />
lenses” (Koretz/Bar<strong>to</strong>n 2003, p. 7), are not only an important way <strong>to</strong> increase<br />
the inclusion of students with disabilities in assessment tests, but they also<br />
tend <strong>to</strong> influence and bias the validity of tests. Research concerning the validity<br />
of accommodations is still tremendously needed.<br />
<strong>–</strong> The problem of construct relevant disabilities: The assessment test can be<br />
offered in Braille <strong>to</strong> blind students. This kind of accommodation does not<br />
influence the construct of the test. But when it comes <strong>to</strong> e.g. dyslexia, it<br />
becomes quite difficult. In this case, the student is not able <strong>to</strong> understand<br />
the tasks, because most assessment tests are language-based. However, this<br />
does not mean that this student is not able <strong>to</strong> solve the task just because he<br />
cannot read it.<br />
<strong>–</strong> Concerning test design, it is important <strong>to</strong> keep an eye on bias. Several assessment<br />
formats (like multiple choice, open response, etc.) have different
168 BERNADETTE HÖRMANN<br />
effects and consequences for different students, especially for those with<br />
disabilities.<br />
(cf. Koretz/Bar<strong>to</strong>n 2003, p. 3ff)<br />
Although the US plays a leading role in including students with disabilities,<br />
there is still a huge lack of research concerning validity and alternate ways of<br />
test participation (cf. also Quenemoen et al 2001, Thurlow et al 1996a). As<br />
Koretz and Bar<strong>to</strong>n point out, the mere inhomogeneity of the group of students<br />
with disabilities makes it tremendously difficult <strong>to</strong> create guidelines and prescriptions.<br />
Moreover, construct relevant disabilities pose <strong>to</strong>ugh challenges for<br />
research (cf. Gerald Hales 2004, who shows that and why the common tests<br />
are not able <strong>to</strong> measure the skills and proficiency of students with dyslexia in<br />
an adequate way).<br />
<strong>According</strong> <strong>to</strong> Koretz and Bar<strong>to</strong>n, the most important steps are an increased<br />
collection of data on assessment participation of students with disabilities, further<br />
research on possible item bias, test bias, and, of course, validity. To make<br />
comparisons possible, it is necessary <strong>to</strong> standardize several definitions of disabilities<br />
and the participation conditions (cf. Koretz/Bar<strong>to</strong>n 2003, p. 23ff).<br />
Finally, Ruth Nelson emphasizes the need <strong>to</strong> identify and limit unintended,<br />
negative consequences (e.g. as mentioned above increased anxiety, exposure <strong>to</strong><br />
curriculum, etc.) for SWD in assessment tests and <strong>to</strong> document them empirically.<br />
Trying <strong>to</strong> avoid these unintended consequences can be “life-changing”<br />
for the respective students, because it allows the students <strong>to</strong> get a fair chance<br />
in order <strong>to</strong> show their actual abilities (Nelson 2006, p. 34f).<br />
4 Situation in Austria and German-speaking countries<br />
In 1995, Elliott, Shin, Thurlow and Ysseldyke searched in national education<br />
encyclopedias and yearbooks of 14 states worldwide <strong>to</strong> discover if they reported<br />
facts and figures on the inclusion of students with disabilities in assessments.<br />
These states included the following: Argentina, Australia, Canada,<br />
Chile, China, England and Wales, France, Japan, Corea, the Netherlands, Nigeria,<br />
Sweden, Tunisia, U.S. What they found was that out of these 14 states,<br />
just a few documented the inclusion of students with disabilities in assessment<br />
tests. It is only the U.S. who reports exact facts and figures, while some other<br />
states present a short description of their directions regarding the participation<br />
of students with disabilities (Canada, France and Korea mention that they allow<br />
accommodated tests for students with disabilities). Elliott et al interpret
DISAPPEARING STUDENTS <strong>PISA</strong> AND STUDENTS WITH DISABILITIES 169<br />
their research findings as follows: There may be three possible reasons why<br />
states do not give any data about inclusion of SWD. First, it could be that they<br />
exclude SWD arbitrarily; second, it is possible that data from disabled children<br />
are collected, but not counted and not published; third, data could be collected,<br />
counted, but not published (cf. Elliott et al 1995).<br />
As in the most states included in the study, there is no attention paid <strong>to</strong><br />
the <strong>to</strong>pic of inclusion of SWD in assessment in Austria. Since the 1990s, and<br />
since 2000 in particular, Austria has been taking part in several assessments<br />
(CIVED, FIMS/TIMSS, <strong>PISA</strong>, PIRLS) and it now has a national screening<br />
for testing reading abilities (Salzburger Lese-Screening). Although the official<br />
test reports of <strong>PISA</strong> give some advice as <strong>to</strong> how <strong>to</strong> handle SWD, there is no<br />
real interest in how this advice is carried out in practice. As shown in Hörmann<br />
2007, people who are involved in assessment testing processes do not<br />
consider the problem as being relevant and urgent. Eight interview partners<br />
(teachers, school direc<strong>to</strong>rs, scientists, employee of the ministry of education<br />
(responsible for international assessment)) were asked by means of short qualitative<br />
interviews what they think about the problem, if they have already had<br />
any experience with it and how they reacted. The main interest in this research<br />
project was <strong>to</strong> investigate the extent <strong>to</strong> which these people have already been<br />
confronted with the problem, what they know and think about it and how they<br />
deal with the problem (cf. Hörmann 2007, p. 58ff).<br />
The interviews reveal that there are two ways of perceiving the problem:<br />
The administrative-organisational perspective and the perspective of the children<br />
concerned by this problem. The majority of the interview partners take<br />
the administrative-organisational perspective and do not regard the problem<br />
as an important one. Some of the interview partners who work closely with<br />
children with disabilities or disadvantaged children take the position of these<br />
students and require solutions <strong>to</strong> the problem. It is a fact for all interviewees<br />
that students with e.g. learning disabilities need more support at school, but for<br />
most of them it is not obvious that these children would also need this support<br />
when taking an assessment test (cf. Hörmann 2007, p. 85f).<br />
In my diploma thesis, I conducted thorough literary research in order <strong>to</strong><br />
find information on the way Austria copes with this problem. I asked scientists<br />
and institutions, but nobody could help me. Likewise, in all German-speaking<br />
countries, I could barely find any hint of relevant literature. As mentioned<br />
above, Wuttke published exclusion rates and Elisabeth von Stechow dedicated<br />
a small chapter of her article <strong>to</strong> exclusion rates of students with disabilities (cf.
170 BERNADETTE HÖRMANN<br />
Stechow 2006, p. 22). Her book “Sonderpädagogik und <strong>PISA</strong>” (2006) seems <strong>to</strong><br />
be the first German publication that responds <strong>to</strong> the problem of students with<br />
disabilities in assessments, at least in part. Oser and Biedermann proclaim the<br />
necessity of a specific assessment for special education (their slogan: “<strong>PISA</strong><br />
for the rest”) (cf. Oser/Biedermann 2006).<br />
Nevertheless, it is obvious that there are children who do not fit in<strong>to</strong> the<br />
“norm”, but who have special needs and who cannot cope with a conventional<br />
testing situation. This especially concerns students with learning disabilities,<br />
who never have the chance <strong>to</strong> show their real abilities, because they generally<br />
fail in reading the test items. Meyerhöfer talks about “Testfähigkeit”, which<br />
means that every student has <strong>to</strong> develop a certain kind of ability that enables<br />
her or him <strong>to</strong> cope with the testing situation, organize the provided time, read<br />
both quickly and carefully and use clever strategies <strong>to</strong> find the right solution<br />
(Meyerhöfer 2005, p. 187 and in this book). Assessment tests as they are constructed<br />
at present are definitively not able <strong>to</strong> test students with disabilities in<br />
an adequate way. In addition, when nobody is interested in what happens <strong>to</strong><br />
these students, they become invisible and disappear from the public.<br />
5Conclusion<br />
This article sets out <strong>to</strong> raise an awareness of the problem students with disabilities<br />
have <strong>to</strong> face in relation <strong>to</strong> student assessment. Research has revealed<br />
that there is a lack of literature and discourse about this problem, which shows<br />
how unknown and how unimportant it is <strong>to</strong> people that actually work with<br />
assessment tests in their profession (cf. Hörmann 2007).<br />
<strong>PISA</strong> seems <strong>to</strong> be no exception in this respect. Assuming that there exists<br />
an average student endowed with average skills not only within one but<br />
even accross countries, it neglects by construction children deviating from that<br />
“norm”. As a result these children are either excluded from the assessment or<br />
(if they get the chance <strong>to</strong> take part at all) are doomed <strong>to</strong> fail.<br />
Confronted with this problem, people behind <strong>PISA</strong> play down the importance<br />
of the problem, the impact on the respective children and show no ambition<br />
<strong>to</strong> change the situation. It probably lies in the self-interest of the people<br />
involved in the construction and conduct of this study <strong>to</strong> marginalise the<br />
problem, since this critique does not just refer <strong>to</strong> certain aspects of <strong>PISA</strong> but<br />
questions primary assumptions and thereby shakes it <strong>to</strong> its foundations.<br />
Even though many people think that the problem of exclusion is irrelevant<br />
from a statistical point of view, I am not of the opinion that this is an argument
DISAPPEARING STUDENTS <strong>PISA</strong> AND STUDENTS WITH DISABILITIES 171<br />
against inclusion. Assessment tests are not able <strong>to</strong> test SWD in an adequate<br />
way, but this does not mean that there is no way <strong>to</strong> change the assessment<br />
tests in order <strong>to</strong> make them able <strong>to</strong> assess SWD (see 3.2 and 3.4). Results of<br />
assessments are often used for political decisions. If children with disabilities<br />
are excluded from assessment tests, they are also excluded from those political<br />
decisions in spite of the fact that these decisions concern them equally as<br />
much as all the other students. This means that SWD disappear once again.<br />
The state and society are responsible for the education of every single child,<br />
including children with disabilities, and every single child has the right <strong>to</strong> be<br />
part of society. In terms of the concept of inclusive education, it is a duty and a<br />
responsibility <strong>to</strong> accommodate the tests <strong>to</strong> the children, and not the children <strong>to</strong><br />
the tests. Participating in assessments in an adequate and successful way can<br />
support the self-confidence and the performance of the respective students in a<br />
very positive manner.<br />
Research and experience in the U.S. show interesting ways of including<br />
students with disabilities in assessment tests. Testing accommodations, alternate<br />
tests and “universally designed assessments” are a new option <strong>to</strong> account<br />
for the diversity of people, although research is still at the beginning.<br />
If <strong>PISA</strong> wants <strong>to</strong> move with the times, I suggest a revision both of the<br />
paradigm and the construction in order <strong>to</strong> gain results that really represent the<br />
variety of students. In order <strong>to</strong> reach this goal, it is important <strong>to</strong> raise awareness<br />
of this problem and start a discussion in public and among scientists. Only then<br />
will it be possible <strong>to</strong> think about ways of solving this problem.<br />
Literature<br />
Biewer, Gottfried (2005): “Inclusive Education”. Effektivitätssteigerung von<br />
Bildungsinstitutionen oder Verlust heilpädagogischer Standards?- In:<br />
Zeitschrift für Heilpädagogik. Jahrgang 56 (2005), Heft 3, S. 101-108<br />
Elliott, J.L.; Shin, H.; Thurlow, M.L.; Ysseldyke, J.E. (1995): A perspective<br />
on education and assessment in other nations: Where are students with<br />
disabilities? (Synthesis Report No. 19).- Minneapolis, MN: University of<br />
Minnesota, National Center on Educational Outcomes.<br />
Available at: http://education.umn.edu/NCEO/OnlinePubs/Synthesis19.html<br />
(15.6.2006)<br />
Haider, Günter (2003): OECD/<strong>PISA</strong> <strong>–</strong> Programme for International Student<br />
Assessment (Kapitel 1.2 des nationalen Berichts von <strong>PISA</strong> 2003).<br />
Available at:
172 BERNADETTE HÖRMANN<br />
http://www.pisa-austria.at/<strong>PISA</strong>2003_Kapitel_1_2_nationalerBericht.<br />
pdf (12.6.2007)<br />
Hales, Gerald (2004): Putting in nails with a spanner: the potential effect of<br />
using false data from language-rich tests <strong>to</strong> assess dyslexic people. Available<br />
at: http://www.bdainternationalconference.org/2004/presentations/<br />
sat_s3_d_7.shtml (27.4.2006)<br />
Hopmann, Stefan Thomas (2006): Im Durchschnitt <strong>PISA</strong> oder Alles bleibt<br />
schlechter.- In: Criblez, Lucien; Gautschi, Peter u.a. (Hrsg): Lehrpläne<br />
und Bildungsstandards. Was Schülerinnen und Schüler lernen sollen.<br />
Festschrift zum 65. Geburtstag von Prof. Dr. Rudolf Künzli.- Bern: hep-<br />
Verlag, S. 149-172.<br />
Hörmann, Bernadette (2007): Die Unsichtbaren in <strong>PISA</strong>, TIMSS & Co. Kinder<br />
mit Lernbehinderungen in nationalen und internationalen Schulleistungsstudien.-<br />
Wien (Diploma Thesis)<br />
Jahnke, Thomas; Meyerhöfer, Wolfram (Hrsg) (2006): <strong>PISA</strong>&Co. Kritik eines<br />
Programms.- Hildesheim, Berlin: Franzbecker<br />
Johns<strong>to</strong>ne, Chris<strong>to</strong>pher; Altman, Jason; Thurlow, Martha (2006): A state guide<br />
<strong>to</strong> the development of universally designed assessments.- Minneapolis,<br />
MN: University of Minnesota, National Center on Educational Outcomes.<br />
Koretz, Daniel; Bar<strong>to</strong>n, Karen (2003): Assessing Students with Disabilities:<br />
Issues and Evidence. (CSE Technical Report 587).- Los Angeles: National<br />
Center for Research on Evaluation, Standards, and Student Testing.<br />
Available at: http://www.cse.ucla.edu/products/Reports/TR587.pdf<br />
(20.6.2007)<br />
McGrew, K., Algozzine, B., Spiegel, A., Thurlow M., Ysseldyke, J. (1993):<br />
The identification of people with disabilities in national databases: A<br />
failure <strong>to</strong> communicate (Technical Report No. 6).- Minneapolis, MN:<br />
University of Minnesota, National Center on Educational Outcomes.<br />
Available at: http://education.umn.edu/NCEO/OnlinePubs/Technical6.<br />
html (22.1.2006)<br />
Meyerhöfer, Wolfram (2005): TestsimTest.DasBeispiel<strong>PISA</strong>.- Opladen: Budrich.<br />
Nelson, J. Ruth (2006): High stakes graduation exams: The intended and<br />
unintended consequences of Minnesota’s Basic Standards Tests for<br />
students with disabilities (Synthesis Report 62).- Minneapolis, MN:<br />
University of Minnesota, National Center on Educational Outcomes.<br />
Available at:
DISAPPEARING STUDENTS <strong>PISA</strong> AND STUDENTS WITH DISABILITIES 173<br />
http://www.education.umn.edu/NCEO/OnlinePubs/Synthesis62/default.<br />
html<br />
OECD (2005): <strong>PISA</strong> 2003 Technical Report.- Paris: OECD.<br />
Oser, Fritz; Biedermann, Horst (2006): <strong>PISA</strong> für den Rest: Lehr- und Lernbehinderung<br />
und ihre schulische Anstrengungslogik.- In: Vierteljahresschrift<br />
für Heilpädagogik und ihre Nachbargebiete. Jahrgang 75 (2006),<br />
Heft 1, S. 4-8.<br />
Posch, Peter; Altrichter, Herbert (1997): Möglichkeiten und Grenzen der Qualitätsevaluation<br />
und Qualitätsentwicklung im Schulwesen. Forschungsbericht<br />
des Bundesministeriums für Unterricht und kulturelle Angelegenheiten.-<br />
Innsbruck, Wien: Studien <strong>–</strong> Verlag (Bildungsforschung des Bundesministeriums<br />
für Unterricht und kulturelle Angelegenheiten; 12)<br />
Quenemoen, R. F., Lehr, C. A., Thurlow, M. L., Massanari, C. B. (2001): Students<br />
with disabilities in standards-based assesment and accountability<br />
systems: Emerging issues, strategies, and recommendations (Synthesis<br />
Report 37).- Minneapolis, MN: University of Minnesota, National Center<br />
on Educational Outcomes. Available at: http://education.umn.edu/NCEO/<br />
OnlinePubs/Synthesis37.html (22.1.2006)<br />
Sireci, Stephen G.; Scarpati, Stanley E.; Li, Shuhong (2005): Test Accommodations<br />
for Students With Disabilities: An Analysis of the Interaction<br />
Hypothesis.- In: Review of Educational Research, Vol. 75 (2005) No.4,<br />
p. 457-490.<br />
Stechow, Elisabeth von (2006): Soziokulturelle Benachteiligung und Bewältigung<br />
von Heterogenität <strong>–</strong> Eine sonderpädagogische Antwort auf eine<br />
Empfehlung der KMK.- In: Stechow, Elisabeth von; Hofmann, Christiane<br />
(Hrsg): Sonderpädagogik und <strong>PISA</strong>. Kritisch-konstruktive Beiträge.- Bad<br />
Heilbrunn: Klinkhardt.<br />
Thurlow, M.L.; Elliott, J.L.; Ysseldyke, J.E.; Erickson, R.N. (1996a): Questions<br />
and answers: Tough questions about accountability systems and<br />
students with disabilities (Synthesis Report No. 24).- Minneapolis, MN:<br />
University of Minnesota, National Center on Educational Outcomes.<br />
Available at: http://education.umn.edu/NCEO/OnlinePubs/Synthesis24.<br />
html (22.1.2006)<br />
Thurlow, M.; Olsen, K.; Elliott, J.; Ysseldyke, J.; Erickson R.; Aherarn, E.<br />
(1996b): Alternate assessments for students with disabilities for students<br />
unable <strong>to</strong> participate in general large-scale assessments (Policy<br />
Directions No. 5).- Minneapolis, MN: University of Minnesota, National
174 BERNADETTE HÖRMANN<br />
Center on Educational Outcomes. Available at: http://education.umn.edu/<br />
NCEO/OnlinePubs/Policy5.html (22.1.2006)<br />
Thurlow, Martha; Moen, Ross; Altman, Jason (2006): Annual Performance<br />
Report: 2003-2004. State Assessment Data.- National Center on Educational<br />
Outcomes. Available at: http://www.education.umn.edu/nceo/<br />
OnlinePubs/APR2003-04.pdf (13.5.2007)<br />
UNESCO (1994): The Salamanca Statement and Framework for Action on<br />
Special Needs Education. World Conference on Special Needs Education:<br />
Access and Quality.- Salamanca, Spain: 7-10 June 1994. Available<br />
at:<br />
http://unesdoc.unesco.org/images/0009/000984/098427eo.pdf<br />
(18.6.2007)<br />
Van Ackeren, Isabell (2005): Vom Daten- zum Informationsreichtum?<br />
Erfahrungen mit standardisierten Vergleichstests in ausgewählten<br />
Nachbarländern.- In: Pädagogik, Jahrgang 57 (2005), Heft 5, S. 24-28<br />
Wuttke, Joachim (2006): Fehler, Verzerrungen, Unsicherheiten in der<br />
<strong>PISA</strong>-Auswertung.- In: Jahnke, Thomas; Meyerhöfer, Wolfram (Hrsg):<br />
<strong>PISA</strong>&Co. Kritik eines Programms.- Hildesheim, Berlin: Franzbecker, S.<br />
101-154<br />
Ysseldyke, J.; Dennison, A.; Nelson, R. (2004): Large-scale assessment and<br />
accountability systems: Positive consequences for students with disabilities<br />
(Synthesis Report 51).- Minneapolis, MN: University of Minnesota,<br />
National Center on Educational Outcomes. Available at: http://education.<br />
umn.edu/NCEO/OnlinePubs/Synthesis51.html (15.6.2006)
Identification of Group Differences Using <strong>PISA</strong> Scales <strong>–</strong><br />
Considering Effects of Inhomogeneous Items<br />
Peter Allerup<br />
Denmark: University of Aarhus<br />
Abstract:<br />
<strong>PISA</strong> data have been available for analysis since the first <strong>PISA</strong> data base was<br />
released from the <strong>PISA</strong> 2000 study. The two following <strong>PISA</strong> studies in 2003<br />
and 2006 formed the basis of dynamic analyses besides the traditional cross<br />
sectional type of analysis, where <strong>PISA</strong> performances in mathematics, science<br />
and reading are analysed in relation <strong>to</strong> student background variables. The caption<br />
for many analyses, carried out separately on the <strong>PISA</strong> 2000 and <strong>PISA</strong><br />
2003 data, has been <strong>to</strong> look for significant differences created by <strong>PISA</strong> performances<br />
for groups of students.<br />
Few studies have, however, been directed <strong>to</strong>wards the psychometric question<br />
as <strong>to</strong> whether the <strong>PISA</strong> scales are correctly measuring the reported differences.<br />
For example, could it be that reported sex differences in mathematics<br />
are partly due <strong>to</strong> the fact that the <strong>PISA</strong> mathematics scales are not measuring<br />
the girls and the boys in a uniform or homogenous way? In other words, using<br />
the terms of modern IRT analyses (Item Response Theory), it is questioned<br />
whether the relative difficulty of the items is the same for girls and boys. The<br />
fact that item difficulties are not the same for girls and boys, a condition which<br />
is called item inhomogeneity, can be demonstrated <strong>to</strong> have impact on the conclusions<br />
of the comparisons of student groups, e.g. girls versus boys.<br />
The present analyses address the problem of possible item inhomogeneity<br />
in <strong>PISA</strong> scales from 2000 and 2003, asking specifically if the <strong>PISA</strong> scale<br />
items are homogeneous across sex, ethnicity and the two points in time (2000<br />
and 2003). This will be illustrated using items from all three <strong>PISA</strong> subjects:<br />
reading, mathematics and science. Main efforts will, however, be concentrated
176 PETER ALLERUP<br />
on the subject of reading. The consequences are demonstrated of detected item<br />
inhomogeneities for the calculation of student <strong>PISA</strong> performances (measures<br />
of ability). This will take place on the individual student level as well as on a<br />
general, average student level.<br />
Inhomogeneous items and some consequences<br />
In order <strong>to</strong> give a precise definition of item inhomogeneity, it is useful <strong>to</strong> refer<br />
<strong>to</strong> the general framework <strong>to</strong> which items, students and responses belong and<br />
their mutual interactions can be made operational. In fact, figure 1 displays the<br />
fundamental concepts behind many IRT (Item Response Theory) approaches<br />
<strong>to</strong> data analysis, the Rasch analysis in particular. The response avifrom student<br />
No. v <strong>to</strong> item No. i takes the values avi = 0 for a non correct and avi =1fora<br />
correct response.<br />
The parameters 1 ... kare latent measures of item difficulty, and 1<br />
nare the students’ parameters carrying the information about student ability.<br />
These are the <strong>PISA</strong> student scores which are reported and compared internationally<br />
(or estimates thereof).<br />
The definition of item homogeneity is now given by a manifestation of<br />
the fact that the responses ((avi )) are determined by a fixed set of item parameters<br />
given by the framework, valid for all students, and therefore for every<br />
subgrouping of the students. Actually, the probability of obtaining a correct<br />
response avi = 1 for student No. v <strong>to</strong> item No. i is given by the special IRT<br />
model, the so-called Rasch Model (Rasch, 1960) which calculates chances <strong>to</strong><br />
solving the tasks behind items by referring <strong>to</strong> the same set of item parameters<br />
regardless of which student is considered.<br />
1 ... k<br />
Responses<br />
Student<br />
No. Ability<br />
Item 1<br />
1<br />
Item 2<br />
2<br />
Item 3<br />
3<br />
Item i<br />
i<br />
. Item k<br />
1 1 1 0 1 1 1 0 a 1<br />
2 2 0 1 1 0 0 1 a 2 .<br />
3 3 1 1 0 1 1 0 a 3 .<br />
. .<br />
v v 1 0 1 a vi . 1 av.<br />
n n 1 1 0 a Ni . 0 a N<br />
k<br />
Student<br />
score<br />
(rv)<br />
Figure 1: The framework for analyzing item inhomogeneity in IRT models. Individual responses<br />
((a vi )), latent measures of item difficulty ( i ) i=1, . . . ,k, student abilities ( v)<br />
v=1, . . . ,n and student scores (rv) recording the <strong>to</strong>tal number of correct responses across k<br />
items.
IDENTIFICATION OF GROUP DIFFERENCES USING <strong>PISA</strong> SCALES 177<br />
The Rasch Model is the theoretical, psychometric reference for validation<br />
of the <strong>PISA</strong> scales, and it has been the reference for scale verification and<br />
calibration in the IEA international comparative investigations, e.g. the reading<br />
literacy study RL (Elley,1993), TIMSS (Bea<strong>to</strong>n et al., 1998), CIVIC (Torney-<br />
Purta et al., 2000) and NAEP assessments after 1984 in USA.<br />
Using this model it can e.g. be shown, that a correct response a vi =1 <strong>to</strong> an<br />
item with item difficulty i =1.20 given by a student with v= -0.5 takes place<br />
with probability P(a=1) = 0.62, i.e. with a 62 % chance.<br />
P a 1<br />
exp �i �v<br />
1 exp �i �v<br />
A major reason for the wide applicability of the Rasch Model lies in the<br />
existence of the following three equivalent characterizations of item homogeneity,<br />
proved by Rasch (see e.g Rasch, 1971, Allerup 1994, Fischer and<br />
Molenaar, 1995 ) and brought here in abbreviated form:<br />
1. The student scores (and parallel property of item scores) are sufficient statistics<br />
for the latent student abilities 1 n, viz. all information concerning<br />
vis contained in the student score rv<br />
2. The student abilities 1 n can be calculated with the same result irrespective<br />
of which subset of items is used.<br />
3. Data collected in the framework in figure 1 fits the Rasch Model, i.e. the<br />
model forms an adequate description of the variability of the observations<br />
((avi )) in figure 1.<br />
While Rasch often referred <strong>to</strong> these properties as the analytic means for<br />
‘specific objective comparisons’, others have adopted the notion ‘homogeneous’<br />
for the status of items when the conditions are met. The practical power<br />
behind this, seen from the point of view of theory of science, is that ‘objective<br />
comparisons’ is in casu a requirement, which can be investigated empirically<br />
by means of simple statistical techniques, i.e. statistical test of fit oftheRasch<br />
Model (cf. property 3). It is henceforth not a purely theoretical concept but<br />
rather one which requires empirical actions <strong>to</strong> be taken beyond the ‘theoretical’<br />
thoughts invested from the subject matter’s point of view in<strong>to</strong> the construction<br />
of items.<br />
By the characterization of item homogeneity, it follows that ‘inhomogeneity’,<br />
or ‘inhomogeneous items’, appears when items are not homogeneous, for<br />
example when different subsets of items give rise <strong>to</strong> different measures of student<br />
abilities. This is e.g. one of the risks which might appear in <strong>PISA</strong> when
178 PETER ALLERUP<br />
using rotating of booklets, where students who are responding <strong>to</strong> different item<br />
blocks must still be compared on the same <strong>PISA</strong> scale (cf. property 2). The<br />
present analyses will focus directly on possible violations of ‘item homogeneity’<br />
by looking for indications of different sets of estimated item parameters<br />
assigned <strong>to</strong> different student groups, through the fit of the Rasch Model. In<br />
other words it will be tested whether e.g. boys and girls are measured by the<br />
same set of item parameters. Two other criteria defining groups of students<br />
will be applied, these being 1) the year 2000 vs. 2003 and ethnicity 2) Danish<br />
vs. non-Danish linguistic background. Especially in the subject of reading, the<br />
distinction by ethnicity is of interest, because different language competencies<br />
are expected <strong>to</strong> influence the understanding and through this the ability <strong>to</strong> reply<br />
correctly <strong>to</strong> the reading tasks.<br />
The consequences of item inhomogeneity are diversified and can bring<br />
about serious implications, depending on the analytic view. In a <strong>PISA</strong> context,<br />
however, one specific kind of consequence attracts attention: How are<br />
comparisons carried out by means of student <strong>PISA</strong> scores affected by inhomogeneity?<br />
If boys and girls are in fact measured by two different scales, i.e. two<br />
sets of item parameters, will this influence conclusions carried out under the<br />
use of one, common ‘average’ set of items? Will an interval of <strong>PISA</strong> points<br />
estimated <strong>to</strong> separate the average -level for Danish students from the non-<br />
Danish students be greater or smaller, if knowledge as <strong>to</strong> item inhomogeneity<br />
is introduced in<strong>to</strong> the -calculations?<br />
Such consequences can be exposed on the -scale either at the individual<br />
student level using one item and one individual or at the general level using all<br />
students and all items.<br />
The individual level is established in a simple way by calculating the individual<br />
change on the -scale, which is mathematically needed <strong>to</strong> compensate<br />
for a given difference in the <strong>–</strong> parameter under the assumption that a fixed<br />
probability for answering correct is maintained. Suppose for instance that data<br />
from boys are fitted <strong>to</strong> the Rasch Models with estimated item difficulty 1<br />
0.40 and the same item gets an estimated difficulty 2 0.75 for the girls, a<br />
difference which can be tested <strong>to</strong> be significantly different from the first item<br />
(Allerup, 1995 and 1997). Then a simple calculation under the Rasch Model<br />
shows that in order for a boy and a girl <strong>to</strong> obtain equal probabilities for answering<br />
this item correctly, the boy’s -value must be adjusted by 0.75-0.40<br />
= 0.35. This item is easier for the boy compared <strong>to</strong> the girl, even considering
IDENTIFICATION OF GROUP DIFFERENCES USING <strong>PISA</strong> SCALES 179<br />
aboyandagirlwith the same -value 1 and hence should be equally capable<br />
of answering the item correctly. In order <strong>to</strong> compensate for this scale-specific<br />
advantage, a boy ‘should start lower’ by subtracting 0.35 from 2 .Inaway<br />
it resembles the rules in golf, where the concept of ‘handicap’ plays a similar<br />
role <strong>to</strong> making comparisons between players more fair.<br />
When moving from the individual level <strong>to</strong> the comprehensive level and<br />
including all items, two simple methods are available. The first one is based<br />
on theoretical calculations, where expected scores are compared for fixed<br />
<strong>–</strong> values using the two sets of inhomogeneous item parameters 3 . The second<br />
approach is based on summing up the individual changes for all students as an<br />
average; it suffices <strong>to</strong> summarize all individual -changes within each group<br />
in question when using the set of item parameters specific for each group. A<br />
third strategy consists of first removing inhomogeneous items from the scale,<br />
then carrying out statistical analyses by means of the remaining homogeneous<br />
items only, e.g. estimation of the student <strong>PISA</strong> scores. Following this procedure<br />
a ‘true’ difference between the groups will then be obtained. In a way<br />
this last procedure follows the traditional path of Rasch scale analysis, where<br />
successive steps from field trials <strong>to</strong> the main study are paved by item analyses,<br />
correcting and eliminating inhomogeneous items step by step. As stated,<br />
the present analyses will focus on student groups defined by gender, year of<br />
investigation and ethnicity.<br />
Data used<br />
Data for these analyses are collected under different studies with no overlap.<br />
The Standard <strong>PISA</strong> 2000 and 2003 data are representative samples, while<br />
the <strong>PISA</strong> Copenhagen data comprises all public schools in the community of<br />
Copenhagen, and <strong>PISA</strong> E is a sample specifically addressing the participation<br />
of ethnic students, and was therefore created from prior knowledge as <strong>to</strong> where<br />
this group of students attend school.<br />
1 Same -value means that they are considered <strong>to</strong> be identical in the framework<br />
2 The analytic picture is slightly more complicated, because there are constrains on the <strong>–</strong><br />
3<br />
values: ? i =1.00<br />
E(avi 1 ... k<br />
estimates of 1<br />
) as function of ;rv =E(avi 1 ... k ) with conditional ml<br />
... k inserted, provides the estimate of
180 PETER ALLERUP<br />
1. <strong>PISA</strong> 2000 N=4209: 50 % girls 50 % boys, 6 % ethnics<br />
2. <strong>PISA</strong> 2003 N=4218: 51 % girls 49 % boys, 7 % ethnics<br />
3. <strong>PISA</strong> E N=3652: 48 % girls 52 % boys, 25 % ethnics<br />
4. <strong>PISA</strong> Copenhagen N=2202: 50 % girls 50 % boys, 24 % ethnics<br />
In the three studies <strong>PISA</strong> 2000, E and Copenhagen, the same set of <strong>PISA</strong><br />
instruments has been used, ie. the same set of items organized in nine booklets<br />
has been rotated among the students. In <strong>PISA</strong> 2003 some of the items from the<br />
<strong>PISA</strong> 2000 study were reused, because items in common must be available for<br />
bridging between 2000 and 2003. <strong>According</strong> <strong>to</strong> the <strong>PISA</strong> cycles, every study<br />
has a special theme; in 2000 it was reading, and in 2003 it was mathematics.<br />
In these years the two subjects were especially heavily represented by many<br />
items. Because of this, the present analyses dealing with the 2003 data are<br />
undertaken mainly by means of items which are in common for the two <strong>PISA</strong><br />
studies 2000 and 2003.<br />
Scaling <strong>PISA</strong> 2000 versus <strong>PISA</strong> 2003 in reading<br />
One of the reasons for the interest in the <strong>PISA</strong> scaling procedures was the fact<br />
that the international <strong>PISA</strong> report from <strong>PISA</strong> 2003 comments upon general<br />
change in the level of reading competencies between 2000 and 2003 in the<br />
following manner:<br />
“However, mainly because of the inclusion of new countries in 2003, the overall<br />
OECD mean for reading literacy is now 494 score points and the standard deviation is<br />
100 score points.” (<strong>PISA</strong> 2003, OECD)<br />
It seems very unlikely that all students in the world being taught in more than<br />
50 different school systems should experience a common weakening across<br />
three years of their reading capacities, amounting <strong>to</strong> 6 <strong>PISA</strong> points (from 500<br />
<strong>to</strong> 494); a further explanation given in the Danish National Report does not<br />
increase a sense for a convincing explanation for this significant drop of 6<br />
<strong>PISA</strong> points:<br />
“The general reading score for the OECD-countries dropped from 500 <strong>to</strong> 494 points.<br />
This is influenced by the fact that two countries joined <strong>PISA</strong> between 2000 and 2003,<br />
contributing <strong>to</strong> the lower end, while the Netherlands lifts the average a bit. But, considering<br />
all countries, it looks like the reading score has dropped a bit” (<strong>PISA</strong> 2003,<br />
ed. Mejding)
IDENTIFICATION OF GROUP DIFFERENCES USING <strong>PISA</strong> SCALES 181<br />
Could it be that the 6-point drop was the result of item inhomogeneities across<br />
2000 and 2003? If this question either in full or in part must be answered by<br />
a yes, one can still hope <strong>to</strong> conduct appropriate comparisons between student<br />
responses from 2000 with 2003. In fact, assuming that no other scale problem<br />
exists within each of the years 2000 and 2003, one can consider the two scales<br />
completely separately and apply statistical Test Equating techniques. The <strong>PISA</strong><br />
2000 reading scale has been compared <strong>to</strong> the IEA 1992 Reading Literacy scale<br />
using this technique, showing that these two scales <strong>–</strong> in spite of inhomogeneous<br />
items <strong>–</strong> are psychometrically parallel (Allerup. 2002)<br />
<strong>PISA</strong> 2000 and <strong>PISA</strong> 2003 share 22 reading items which are necessary<br />
for the analysis of homogeneity by means of the Rasch Model. The items are<br />
found in booklet No. 10 in <strong>PISA</strong> 2003 and booklet No. 4 in <strong>PISA</strong> 2000. Table 1<br />
displays the (log) item difficulties, estimated under the simple one-dimensional<br />
Rasch Model 4 .<br />
Item difficulties <strong>PISA</strong> -scale<br />
item i (2000) 1 R055Q01_ 1.27<br />
i (2003)<br />
1.23<br />
difference<br />
-3.6<br />
2 R055Q02_ -0.66 -0.79 -11.7<br />
3 R055Q03_ -0.08 -0.21 -11.7 percent correct<br />
4 R055Q05_ 0.44 0.55 9.9 2000 2003<br />
5 R067Q01_ 0.58 1.97 125.1 0.64 0.88<br />
6 R067Q04_ -0.29 0.88 105.3 0.43 0.71<br />
7 R067Q05_ -0.47 1.15 145.8 0.38 0.76<br />
8 R102Q05_ -0.86 -1.18 -28.8<br />
9 R102Q07_ 1.73 1.41 -28.8<br />
10 R102Q04A_ -1.34 -2.01 -60.3<br />
11 R104Q01_ 0.41 0.10 -27.9<br />
12 R104Q02_ -0.31 -0.63 -28.8<br />
13 R104Q05_ -0.40 -0.72 -28.8<br />
14 R111Q01_ -0.99 -1.08 -8.1<br />
15 R111Q02B_ 0.04 -0.05 -8.1<br />
16 R111Q06B_ 1.51 1.66 13.5<br />
4 Conditional maximum likelihood estimates from p(((avi )) (rv) ), conditional on student<br />
scores (rv), cf. fig1
182 PETER ALLERUP<br />
17 R219Q02_ 0.28 0.44 14.4 percent correct<br />
18 R219Q01E_ 0.08 0.20 10.8 2000 2003<br />
19 R220Q01_ -0.32 -0.82 -45.0 0.42 0.31<br />
20 R220Q04_ -0.05 -0.60 -49.5 0.49 0.35<br />
21 R220Q05_ 0.83 0.33 -45.0 0.70 0.58<br />
22 R220Q06_ -1.40 -1.82 -37.8 0.20 0.14<br />
Table 1: Rasch Model estimates of item difficulties i for the two years of testing 2000 and<br />
2003 and -scale adjustments for unequal item difficulties.<br />
Several test statistics can be applied for testing the hypothesis stating that<br />
item difficulties are equal across the years 2000 and 2003, both multivariate<br />
conditional (Andersen, 1973) and exact tests item-by-item (Allerup, 1997).<br />
The results are all clearly rejecting the hypothesis and, consequently, the items<br />
are inhomogeneous across the year of testing 2000 and 2003.<br />
A visual impression of how the two <strong>PISA</strong> scales are composed by item<br />
difficulties as marks on two ‘rulers’ is displayed in figure 2. Items connected by<br />
vertical lines tend <strong>to</strong> be homogeneous, while oblique connecting lines indicate<br />
inhomogeneous items.<br />
The last column in table 1 lists the consequences at the individual student<br />
level of the estimated item inhomogeneity transformed <strong>to</strong> quantities measured<br />
on the ordinary <strong>PISA</strong> student scale, i.e. the -scale internationally calibrated<br />
<strong>to</strong> mean value = 500 with standard deviation = 100. As an example the item<br />
R055Q01 changed the estimated difficulty from 1.27 in 2000 <strong>to</strong> 1.23 in 2003,<br />
a small decrease in the relative difficulty of -3.6. For an average student, i.e.<br />
with <strong>PISA</strong> ability v= 0.00 this means that the chance of responding correctly<br />
<strong>to</strong> these items has changed from 0.78 <strong>to</strong> 0.77, a small 1 % drop. This can be calculated<br />
from the Rasch Model; for an above-average student with v=2.00the<br />
change will be 0.963 <strong>to</strong> 0.962, a very minor change of magnitude, 1 per mille.<br />
Table 1 shows how the consequences amount <strong>to</strong> considerable <strong>PISA</strong> points for<br />
some items, especially the items R067 and R220, which are framed in the table.<br />
These items are the ones which distinguish themselves on figure 2 by nonvertical<br />
lines. The marginal percent correction, which is based on all booklets<br />
and students, is included in table 1 in order <strong>to</strong> get a well-known interpretation<br />
of the change from 2000 <strong>to</strong> 2003. It is a tacitly assumed that the <strong>PISA</strong> items<br />
are accepted under tests of reliability.<br />
The last column in table 1 indicates the advantage (difference >0) or disadvantage<br />
(difference
IDENTIFICATION OF GROUP DIFFERENCES USING <strong>PISA</strong> SCALES 183<br />
Figure 2: Estimated Item difficulties i (2000) and i (2003) for <strong>PISA</strong> 2000 (lower line) and<br />
<strong>PISA</strong> 2003 (upper line). Estimates based on booklet 4 (<strong>PISA</strong> 2000) and booklet 10 (<strong>PISA</strong><br />
2003) using data from all countries.<br />
pretation being that 2003-students are given ‘free’ <strong>PISA</strong> points as a result of<br />
the fact that the (relative) item difficulty has dropped between the years 2000<br />
and 2003; that this ‘advantage’ can be quantified in terms of ‘compensations’<br />
on the -scale shown in the last column, displaying how much a student must<br />
change the <strong>PISA</strong>-score in order <strong>to</strong> compensate for a the change of difficulty<br />
of the item. This way of thinking is much alike the thoughts behind the construction<br />
of the so-called items maps, visualizing both the distribution of item<br />
difficulties and student abilities anchored in predefined probabilities for a correct<br />
response.<br />
Table 1 pictures item inhomogenities, item by item, in reading; some items<br />
turn out <strong>to</strong> be (relatively) more difficult between 2000 and 2003, while others<br />
are easier between the two years. A comprehensive picture involving all<br />
single-item ‘movements’ and all students is more complicated <strong>to</strong> establish 5 .<br />
The technique used in this case is <strong>to</strong> study the gap between expected score<br />
5 Analyze the expected score E(avi 1 ... k ) as function of with conditional ml<br />
estimates of 1 ... k inserted.
184 PETER ALLERUP<br />
levels caused by the two item sets of (inhomogeneous) difficulties. By this, it<br />
can be shown that the general effect is approximately 11 <strong>PISA</strong> points. In other<br />
words, the average <strong>PISA</strong> 2003- student experiences a ‘loss’ of approximately<br />
11 <strong>PISA</strong> points, purely due <strong>to</strong> psychometric scale inhomogeneities. The official<br />
drop between 2000 and 2003 was for Denmark 497? 492, i.e. a drop of<br />
5 points. In the light of scale-induced changes of magnitude minus 11 points,<br />
could this be switching a disappointing conclusion <strong>to</strong> the contrary?<br />
Scaling <strong>PISA</strong> 2003 in reading <strong>–</strong> gender and ethnicity<br />
Whenever analysis of item homogeneity is executed by using an external variable<br />
<strong>to</strong> define sub-groups, it is tacitly assumed that the Rasch model works<br />
within each group 6 , i.e. the items are homogeneous within each group.<br />
2003 <strong>PISA</strong> 2003 <strong>PISA</strong><br />
Item difficulties -scale Item difficulties -scale<br />
item i (girls) i (boys) difference i (DK) 1 R055Q01_ 1.18 1.35 15.16 1.25<br />
i (ejDK) difference<br />
2.05 72.77<br />
2 R055Q02_ -0.71 -0.70 1.56 -1.13 -1.59 -41.76<br />
3 R055Q03_ -0.23 -0.02 18.80 -0.53 -0.83 -26.60<br />
4 R055Q05_ 0.58 0.43 -13.32 0.20 0.28 7.38<br />
5 R067Q01_ 1.04 1.11 5.96 2.38 2.05 -29.43<br />
6 R067Q04_ 0.25 0.15 -8.83 0.08 0.01 -6.63<br />
7 R067Q05_ 0.36 0.01 -30.83 0.25 0.42 16.06<br />
8 R102Q05_ -1.07 -0.91 14.44 -0.42 -0.24 15.51<br />
9 R102Q07_ 1.55 1.66 10.31 2.01 0.91 -99.20<br />
10 R102Q04A -1.75 -1.51 21.34 -1.68 -1.82 -12.02<br />
11 R104Q01_ 0.38 0.20 -15.90 0.25 -0.12 -32.89<br />
12 R104Q02_ -0.22 -0.68 -40.96 -0.75 -0.60 13.82<br />
13 R104Q05_ -0.39 -0.69 -26.30 -0.93 -1.70 -69.25<br />
14 R111Q01_ -1.28 -0.74 48.33 -1.02 -1.05 -2.26<br />
15 R111Q02B 0.04 -0.01 -4.85 -0.77 -0.37 36.59<br />
16 R111Q06B 1.59 1.57 -1.86 1.37 2.05 61.47<br />
17 R219Q02_ 0.32 0.40 7.05 0.37 1.29 82.81<br />
18 R219Q01E 0.05 0.25 17.95 0.45 1.09 57.84<br />
19 R220Q01_ -0.62 -0.45 15.31 -0.09 -0.37 -24.59<br />
20 R220Q04_ -0.16 -0.43 -24.37 -0.97 -1.27 -26.68<br />
21 R220Q05_ 0.64 0.60 -3.69 0.47 0.74 23.62<br />
22 R220Q06_ -1.55 -1.61 -5.28 -0.75 -0.94 -16.57<br />
Table 2: Rasch Model estimates of item difficulties ifor girls and boys (international student<br />
responses) and for Danish (DK) and non-Danish, ethnic students (ejDK) (Danish student<br />
responses) for <strong>PISA</strong> 2003; -scale adjustments for unequal item difficulties. All items from<br />
booklet No. 10.<br />
6 By nature the likelihood ration test statistic (Andersen, 1973) for item homogeneity across<br />
groups has as prerequisite that item parameters exist within each group, i.e. the Rasch model<br />
fits within each group.
IDENTIFICATION OF GROUP DIFFERENCES USING <strong>PISA</strong> SCALES 185<br />
Within <strong>PISA</strong> 2003 data a repetition of the statistical tests presented in the<br />
previous section for homogeneity across 2000 <strong>–</strong> 2003 have been undertaken<br />
across gender and ethnicity. While the international data was used for the gender<br />
analysis, only data from Denmark has been used for the ethnic grouping.<br />
This leads <strong>to</strong> table 2.<br />
The numerical indications in table 2 regarding the degree of inhomogeneity<br />
can be illustrated in the same fashion as in figure 2, here presented as figure 3.<br />
Perfect homogeneity across the two external criteria gender and ethnicity can<br />
be read as perfect vertical lines in the figures.<br />
Figure 3: Estimated item difficulties for 22 reading items, Danish students (lower line) and<br />
non-Danish students (upper line), left part. Danish <strong>PISA</strong> 2003 data. Estimated item difficulties<br />
for 22 reading items, girls (lower line) and boys (upper line), right part. International<br />
<strong>PISA</strong> 2003 data.<br />
Although it is the impression from figures in table 2 and the graphs in figure<br />
3 that ethnicity creates the largest degree of inhomogeneity, the contrary is,<br />
in fact, the truth. The explanation for this is that the statistical tests for homogeneity<br />
across ethnicity are based on the Danish <strong>PISA</strong> 2003 set, booklet No. 10<br />
consisting only of 325 valid student responses, providing little power behind<br />
the tests. Again both simultaneous tests as multivariate conditional (Andersen,<br />
1973) and exact tests, item-by-item (Allerup, 1997) have been applied. While<br />
test statistics are strongly rejecting the homogeneity hypothesis across gender,<br />
more weak signs of inhomogeity are indicated across ethnicity.<br />
Reading the crude deviations from table 2 points e.g. <strong>to</strong> items R104 and<br />
R067 favouring girls and R111 and R102 favouring boys. Likewise, items<br />
R102 and R104 constitute challenges which favour Danish students, while
186 PETER ALLERUP<br />
items R219 and R055 seem <strong>to</strong> favour ethnic students. Details behind these<br />
suggestions for inhomogeneity, e.g. assessing didactic interpretations for these<br />
deviations, can be evaluated through a closer look at the relation between the<br />
observed and expected number of responses in specific score-groups. 7<br />
If the displayed inhomogeneities in table 2 are accumulated in the same<br />
way as with the <strong>PISA</strong> 2000 vs. 2003 analysis, it can be shown that poorly performing<br />
girls get a scale-specific advantage of magnitude 8-10 <strong>PISA</strong> points,<br />
which is reduced <strong>to</strong> approximately 1-2 points for high performing girls. A similar<br />
accumulation for the analysis across ethnicity shows that low performing<br />
Danish students (around 30 % correct responses), get a scale-specific advantage<br />
of approximately 12 <strong>PISA</strong> points, while very low or very high performing<br />
students do not get any ‘free’ scale points because of inhomogeneity.<br />
Scaling <strong>PISA</strong> 2000 in reading <strong>–</strong> ethnicity<br />
<strong>PISA</strong> 2000 data offers an excellent opportunity <strong>to</strong> study what happens if the<br />
reading by Danish students is compared with that of the ethnic students in<br />
Denmark. Before any didactic explanations can be discussed, a first approach<br />
<strong>to</strong> recognizing possible inhomogeneity is achieved by comparing the relative<br />
item difficulties for the two groups. As said, both the ordinary <strong>PISA</strong> 2000<br />
study and the two studies (<strong>PISA</strong> Ethnic and <strong>PISA</strong> Copenhagen) have been run<br />
on the <strong>PISA</strong> 2000 instruments, bringing the <strong>to</strong>tal number of student responses<br />
<strong>to</strong> approximately 10,000, 17 % of which come from ethnic students.<br />
<strong>PISA</strong> 2000 <strong>PISA</strong><br />
Item difficulties -scale<br />
Item cat i (DK) R055Q01 1.21 1 1.30<br />
i (ejDK) difference<br />
1.13 -15.39<br />
booklet<br />
2<br />
R055Q03 1.17 1 -0.59 -1.22 -56.88 2<br />
R061Q01 0.91 0 -0.37 -0.17 17.59 6<br />
R076Q03 0.86 1 0.20 0.78 52.69 4<br />
R076Q04 0.80 1 -0.67 0.22 79.94 4<br />
R076Q05 1.08 0 -0.87 -0.63 21.96 4<br />
R076Q05 1.15 0 0.02 0.12 8.73 5<br />
R077Q04 0.72 1 0.67 0.77 9.32 8<br />
R081Q05 1.00 0 0.23 0.34 9.53 1<br />
7 Compare ai (r) <strong>–</strong> the observed number of correct responses <strong>to</strong> item No i in score group r,<br />
with nr i (r) <strong>–</strong> the expected number, where nr is the number of students in score group r<br />
and i (r) is the conditional probability for a correct response <strong>to</strong> item No i in score group r<br />
(depending on iand the so-called symmetric functions of 1 ... k only)
IDENTIFICATION OF GROUP DIFFERENCES USING <strong>PISA</strong> SCALES 187<br />
R083Q06 1.14 0 -0.93 -0.66 24.48 5<br />
R086Q05 1.64 1 1.93 1.36 -51.97 1<br />
R086Q05 1.54 1 1.89 1.05 -75.48 3<br />
R086Q05 1.21 1 2.31 1.65 -58.93 4<br />
R091Q06 0.72 1 1.50 1.51 1.11 3<br />
R100Q06 1.35 1 1.35 0.85 -44.77 3<br />
R100Q06 1.31 1 1.56 0.49 -96.48 6<br />
R101Q02 1.36 1 1.53 0.87 -58.99 5<br />
R104Q01 1.06 1 0.64 0.90 23.86 5<br />
R104Q01 1.46 0 2.64 2.11 -47.50 6<br />
R110Q06 0.98 0 1.03 1.11 7.42 7<br />
R111Q06B 1.35 1 -0.45 -1.41 -86.05 4<br />
R119Q06 0.70 1 1.21 1.18 -2.15 3<br />
R120Q01 1.20 1 0.63 0.67 2.99 4<br />
R120Q01 1.47 1 0.62 0.15 -42.32 6<br />
R120Q07T 1.32 1 0.69 0.19 -44.98 4<br />
R219Q02 0.75 1 0.56 0.96 35.43 1<br />
R220Q02B 1.17 1 0.49 -0.06 -49.78 4<br />
R220Q06 0.87 0 1.20 0.96 -21.55 7<br />
R227Q04 1.53 1 -0.46 -0.88 -37.47 3<br />
R234Q01 1.16 0 1.40 1.38 -1.52 1<br />
R234Q02 1.24 1 -2.16 -2.04 10.88 1<br />
R234Q02 0.95 1 -2.04 -1.74 26.81 2<br />
R241Q02 0.81 1 -0.70 -0.34 32.56 2<br />
Table 3: Rasch Model estimates of significant inhomogeneous items across ethnicity; item<br />
difficulties ifor Danish(DK) and non-Danish (ejDK), ethnic students(N=10063 student responses);<br />
-scale adjustments for unequal item difficulties under the simple Rasch Model.<br />
Cat=1 indicates significant item discrimination ( 1.00).<br />
In this section analyses are based on 140 reading items from all nine booklets,<br />
each containing around 40 items, organized with overlap in a rotation system<br />
across the booklets. This brings about 1,100 student responses per booklet.<br />
Using these <strong>PISA</strong> 2000 data, the statistical tests for homogeneity across the<br />
two student groups defined by ethnicity (DK and ejDK) may once more be applied.<br />
Both multivariate conditional (Andersen, 1973) and exact item-by-item<br />
tests (Allerup, 1997) were applied. The results clearly reject the hypothesis of<br />
homogeneity and, consequently, the items are inhomogeneous across the two<br />
ethnic student groups.<br />
Because of the amount of data available, statistical tests for proper Rasch<br />
model item discriminations (Allerup, 1994) have been included also; if significant,<br />
i.e. the hypothesis i = 1.00 must be rejected, it can be taken as an indication<br />
of the validity of the so-called two-parameter Rasch model (Lord and
188 PETER ALLERUP<br />
Novic, 1968) 8 . Other more orthodox views would claim that basic properties<br />
behind ‘objective comparisons’ are then violated because of intersecting ICC<br />
curves (Item Characteristic Curves). Hence, this would be taken as just another<br />
sign of item inhomogeneity. Table 3 lists all items (among the 140 items in <strong>to</strong>tal)<br />
found <strong>to</strong> be inhomogeneous in the predefined setting with unequal item<br />
difficulties only. Items with significant item discriminations are then marked<br />
with cat=1. Some items appear twice because of the rotation, allowing items<br />
<strong>to</strong> be used in several different booklets.<br />
The combination of high item discrimination and the existence of two<br />
slightly different -groups, which are compared on the general average level,<br />
can cause serious effects. Since it is expected that the ethnic student group<br />
generally performs lower than the Danish group, it could be that one item with<br />
high item discrimination acts like a ‘separa<strong>to</strong>r’ in the item-map-sense between<br />
the two -groups. This situation will artificially decrease the probability of a<br />
correct response from students in the lower latent -group while, on the opposite<br />
end, students from the upper -group will artificially enjoy enhanced<br />
probabilities <strong>to</strong> respond correctly. In a way this phenomenon of high item discrimination<br />
tends <strong>to</strong> punish the poor students and disproportionately rewards<br />
the high performing students.<br />
From table 3 it can be read that e.g. item R055Q03 is (relatively) more<br />
difficult for the ethnic students compared with the Danish students. In terms of<br />
compensation on the <strong>PISA</strong> -scale, this means that an ethnic student experiences<br />
a <strong>PISA</strong> scale induced loss 56.88 points. In other words, an ethnic student<br />
must be 56.88 scale points ahead of his Danish classmate if they are going <strong>to</strong><br />
have equal chances for responding correctly <strong>to</strong> the item. A Danish student with<br />
a <strong>PISA</strong> score equal <strong>to</strong>, say 475, has the same probability for a correct response,<br />
as an ethnic student with <strong>PISA</strong> score 475+56.88=531.88.<br />
It is an interesting feature of table 3 that more than 60 % of these ethnicsignificant<br />
items are administered in a multiple choice format (i.e. closed response<br />
categories, MC), while only 19 % belong <strong>to</strong> this category in the full<br />
<strong>PISA</strong> 2000 set up. This is surprising, because an open response would be expected<br />
<strong>to</strong> call for deeper insight in<strong>to</strong> linguistic details about the formulation<br />
of the reading problem compared <strong>to</strong> just ticking a predefined box in the MC<br />
<strong>–</strong>format.<br />
The item R076Q04 is a MC item under the caption “retrieving information”,<br />
where the students examine the flying schedule of Iran Air. This item is<br />
8 The two parameter model with item discrimination i isP a 1<br />
exp � i � i �v<br />
1 exp � i � i �v
IDENTIFICATION OF GROUP DIFFERENCES USING <strong>PISA</strong> SCALES 189<br />
solved far better by the ethnic students compared with the Danish students, because<br />
the item doesn’t really contain complicated text at all, just numbers and<br />
figures listed in a schematic form. Contrary <strong>to</strong> this example, item R100Q06<br />
(MC) contains long and twisted Danish text, and the caption for the item is<br />
“interpreting”, which aims at ‘reading’ behind the lines; only if the interpretation<br />
is correct, is the complete response considered <strong>to</strong> be correct.<br />
In this example from reading, the accumulated effect of the individual item<br />
inhomogeneities is evaluated using a different technique from the previous sections.<br />
In fact, the more traditional step-by-step method is now applied in which<br />
inhomogeneous items are removed before re-estimation of the <strong>PISA</strong> score<br />
takes place. The gap between Danish and ethnic students can then be studied<br />
before and after removal of inhomogeneous items.<br />
From the joint data <strong>PISA</strong> 2000, <strong>PISA</strong> Copenhagen and <strong>PISA</strong> E one gets<br />
the crude differences:<br />
Language N <strong>PISA</strong> -score<br />
average<br />
Danish<br />
Non Danish<br />
Difference<br />
The crude average difference amounts <strong>to</strong> 90.54 <strong>PISA</strong> points. Since the<br />
items are spread over nine booklets, it is of interest <strong>to</strong> judge the accumulated<br />
effect for each booklet. At the same time this would be an opportunity <strong>to</strong> check<br />
one of the implications of the three equivalent characterizations of the Rasch<br />
model 9 , viz. that you should get almost the same picture, irrespective of which<br />
booklet is investigated.<br />
Booklet -scores<br />
all items<br />
-scores<br />
homogeneous items<br />
9 Student abilities 1 ncan be calculated with same result irrespective of which subset of<br />
items is used.
190 PETER ALLERUP<br />
<strong>to</strong>tal<br />
Table 4: Average differences between Danish and non-Danish student calculated under two<br />
scenarios: (1) all items and (2) homogeneous items, i.e. items enjoying the property i (DK) <strong>–</strong><br />
i (ejDK)
IDENTIFICATION OF GROUP DIFFERENCES USING <strong>PISA</strong> SCALES 191<br />
M155Q04T 0.86 0 -0.37 0.27 58.11<br />
S114Q03T 1.74 1 0.57 0.63 6.18 2 8<br />
S114Q04T 1.63 1 0.28 0.30 1.68<br />
S114Q05T 1.11 0 -1.21 -1.54 -29.37<br />
S128Q01_ 0.94 0 0.53 0.70 14.91<br />
S128Q02_ 0.78 1 -0.16 -0.31 -14.08<br />
S128Q03T 0.80 1 0.35 0.25 -8.60<br />
S131Q02T 1.43 1 0.0 -0.20 -18.26<br />
S131Q04T 1.60 1 -1.62 -1.54 7.62<br />
S133Q01_ 0.91 0 0.60 0.95 31.93<br />
S133Q03_ 0.51 1 -0.61 -0.70 -8.20<br />
S133Q04T 0.56 1 0.23 0.18 -5.04<br />
S213Q02_ 0.88 0 1.21 1.19 -1.61<br />
S213Q01T 1.21 1 -0.17 0.09 22.84<br />
Table 5: Rasch Model estimates of items difficulties i (2000) and i (2003) for math items<br />
shared by <strong>PISA</strong> 2000 and <strong>PISA</strong> 2003 in four booklets; -scale adjustments for unequal item<br />
difficulties under the simple Rasch Model. Cat=1 indicates significant item discrimination (<br />
1.00).<br />
The test statistics applied earlier are again brought in<strong>to</strong> operation, testing<br />
the hypothesis, that the item difficulties for the years 2000 and 2003 are equal.<br />
In fact, both multivariate conditional (Andersen, 1973) and exact tests itemby-item<br />
(Allerup, 1997) were used. The results of estimation are presented in<br />
table 5 <strong>to</strong>gether with and an evaluation of the item discriminations i .<br />
The results for mathematics shows that the hypothesis must be rejected<br />
and, consequently, the items presented in table 5 are inhomogeneous across<br />
the year of testing 2000 and 2003. Item M155Q04T is an item which systematically<br />
for all score levels seems <strong>to</strong> become easier between 2000 and 2003;<br />
in more familiar terms a rise is seen from 64 % correct responses <strong>to</strong> 75 %,<br />
calculated for all students.<br />
The results for science seem <strong>to</strong> be in accordance with the expectations<br />
behind the <strong>PISA</strong> scaling. In fact, the multivariate conditional and the exact<br />
tests for single items are not rejecting the hypothesis of equal item difficulties<br />
across the test years 2000 and 2003.<br />
Since only a very few item groups from four booklets have been investigated,<br />
no attempt on calculating accumulated effects for larger groups of students<br />
and items will be carried out.
192 PETER ALLERUP<br />
Scaling <strong>PISA</strong> 2000 and 2003 in mathematics<br />
In view of the fact that the tests for homogeneity across 2000 and 2003 failed<br />
in mathematics, it could be of interest <strong>to</strong> investigate scale properties within<br />
each of the two years. Using booklets No 5 (same booklet number in 2000 and<br />
2003), around 400 student responses are available for analysis of homogeneity<br />
across gender. Table 6 displays the estimates of item difficulties for the seven<br />
math items shared in 2000 and 20003 in booklet No. 5 <strong>to</strong>gether with the estimated<br />
item discriminations and an evaluation of the item discrimination i in<br />
relation <strong>to</strong> the Rasch model requirement: i =1.00<br />
<strong>PISA</strong> <strong>PISA</strong><br />
Item difficulties -scale<br />
Item cat i (girls) 2000:<br />
i (boys) difference booklet<br />
M150Q01_ 0.86 0 0.36 0.57 -18.88 5<br />
M150Q02T 1.23 0 3.24 2.67 51.03<br />
M150Q03T 1.00 0 -0.54 -0.71 14.79<br />
M155Q01_ 0.91 0 0.06 -0.38 39.20<br />
M155Q02T 1.20 0 0.36 0.63 -24.46<br />
M155Q03T 1.61 0 -3.06 -2.47 -53.63<br />
M155Q04T 0.85 0 -0.42 -0.33 -8.04<br />
2003:<br />
M150Q01_ 0.75 0 -0.27 0.63 -81.63 5<br />
M150Q02T 0.95 0 2.40 3.38 -88.71<br />
M150Q03T 0.99 0 -0.52 -1.09 51.06<br />
M155Q01_ 1.26 0 0.12 -0.15 24.37<br />
M155Q02T 1.09 0 0.42 -0.07 43.28<br />
M155Q03T 1.45 0 -2.41 -2.99 51.92<br />
M155Q04T 0.85 0 0.27 0.27 -0.29<br />
Table 6: Rasch Model estimates of items difficulties i (girls) and i (boys) for math items in<br />
<strong>PISA</strong> 2000 and <strong>PISA</strong> 2003, using two booklets; -scale adjustments for unequal item difficulties<br />
under the simple Rasch Model. Cat=1 indicates significant item discrimination (<br />
1.00).<br />
The statistical methods for testing the hypothesis of equal difficulties for<br />
girls and boys are brought in<strong>to</strong> operation again. Both multivariate conditional<br />
(Andersen, 1973) and exact tests item-by-item (Allerup, 1997) were used.<br />
Behind the estimates presented in table 6 lies the information that the<br />
gender-specific homogeneity hypothesis must clearly be rejected in the data<br />
from <strong>PISA</strong> 2003, while the picture is less distinct for <strong>PISA</strong> 2000 (significance
IDENTIFICATION OF GROUP DIFFERENCES USING <strong>PISA</strong> SCALES 193<br />
probability p=0.08 for the simultaneous test). Consequently, in <strong>PISA</strong> 2003 the<br />
seven items presented in table 6 are inhomogeneous across gender. In particular,<br />
item No. 2, M155Q02T is one item which changes position from favouring<br />
the girls in <strong>PISA</strong> 2000 (98 % correct for girls vs 96 % correct for boys) <strong>to</strong> the<br />
contrasting role of favouring the boys in <strong>PISA</strong> 2003 (96 % correct for girls vs.<br />
97 % correct for boys). In terms of log-odds ratio, this is a change from 1.14<br />
as relative ‘distance’ between girls and boys in <strong>PISA</strong> 2000 <strong>to</strong> -0.29 in <strong>PISA</strong><br />
2003. In <strong>PISA</strong> 2003 the items M150Q01, M150Q03T and M155Q03T attracts<br />
attention, also because of the large transformed consequences on the -scale.<br />
However, the only item showing significant gender bias according <strong>to</strong> exacts<br />
tests for single items, is M150Q01.<br />
As stated in the three equivalent characterizations of item homogeneity, rejecting<br />
the hypothesis about homogeneous items means that information about<br />
students’ ability <strong>to</strong> solve the tasks is not accessible through the raw scores, i.e.<br />
the <strong>to</strong>tal number of correct responses across items. The student raw score is not<br />
asufficient statistic for the ability , or the <strong>PISA</strong> scale score does not measure<br />
the students’ competencies as <strong>to</strong> solving the items: these are two other ways<br />
of describing the situation under the caption ‘inhomogeneous items’. On the<br />
other hand this does not exclude the <strong>PISA</strong> analyst <strong>to</strong> obtaining another kind of<br />
information from the responses with respect <strong>to</strong> comparing students by means<br />
of the <strong>PISA</strong> items.<br />
With regard <strong>to</strong> the two items M150Q02 and M150Q03 above, it has been<br />
demonstrated (Allerup et al, 2005) how information from these two openended<br />
10 items can be handled as profiles. By this, all combinations of responses<br />
<strong>to</strong> the two items are considered, and analysis of group differences takes place<br />
using these profiles as ‘units’ for the analyses. In principle every combination<br />
of responses from simultaneous items entering such profiles must be labelled<br />
prior <strong>to</strong> the analysis in order <strong>to</strong> be able <strong>to</strong> interpret differences found by way of<br />
the profiles. If the number of items exceeds, say, ten, with two response levels<br />
on each item, this would in turn require about approx 1,000 different labels!<br />
In general this is far <strong>to</strong>o many profiles <strong>to</strong> be able <strong>to</strong> assign different interpretations,<br />
and the profile methods is, consequently, not suited for analyses built on<br />
a large number of items.<br />
One consequence of accepting an item as part of a scale for further analyses,<br />
in spite of the fact that the item was found <strong>to</strong> be inhomogeneous across<br />
10 An item which requires a written answer, not a multiple choice item. The response is later<br />
on rated and scored correct or non-correct
194 PETER ALLERUP<br />
gender, can be illustrated by the reports from the international TIMSS study<br />
from 1995 (Bea<strong>to</strong>n et al, 1998), operated 11 by IEA. In this study a general difference<br />
was found in mathematics performance between girls and boys, showing<br />
that in practically all participating countries, boys performed better than<br />
girls. Although this conclusion contrasted greatly with experiences obtained<br />
nationally for many countries, the TIMSS result was generally accepted as the<br />
fact. The TIMSS study was at that time designed in a way using rotated booklets<br />
as in the <strong>PISA</strong>, but without using itemblocks. In stead a fixed set of six<br />
math items and six science items were part of every booklet as fixed reference<br />
for bridging between the booklets<br />
Unfortunately, it turned out that one of the six math reference items 12 was<br />
strongly inhomogeneous (Allerup, 2002). The girls were actually ‘punished’<br />
by this item, and even very highly performing female students rated on the<br />
basis of responses <strong>to</strong> other items, responded incorrectly <strong>to</strong> this particular item.<br />
This could be confirmed by analysing data from all participating countries,<br />
providing high statistical power <strong>to</strong> the tests for homogeneity.<br />
Scaling <strong>PISA</strong> 2000 <strong>–</strong> ‘not reached’ items in reading<br />
‘Not reached’ items are the same as ‘not attempted’ items, and constitute a<br />
special kind of item, which deserves attention in studies like <strong>PISA</strong>. They are<br />
usually found at the end of a booklet because the students read the booklet<br />
from page 1 and try solving the tasks in the order they appear. In the international<br />
versions of the final data base, the ‘not reached’ items are marked by<br />
a special missing-symbol <strong>to</strong> distinguish them from omitted items, which are<br />
items, where neighbouring items <strong>to</strong> the right have obviously been attempted.<br />
It is ordinary testing practice <strong>to</strong> present several tasks <strong>to</strong> the student, which<br />
are in turn properly adjusted <strong>to</strong> the complete testing time, e.g. two lessons in<br />
the case of <strong>PISA</strong>. This is a widespread practice with exceptions seen in Nordic<br />
testing practices. Many tests are thereby constructed in a way as <strong>to</strong> make it<br />
possible <strong>to</strong> judge two separate aspects: proficiency and speed. In reading it is<br />
considered <strong>to</strong> be crucial for relevant teaching that the teacher gets information<br />
about the students’ proficiency both in terms of ‘correctness’ and reading<br />
speed. In order for the last fac<strong>to</strong>r <strong>to</strong> be measurable, one usually needs <strong>to</strong> have<br />
11 IEA, The International Association for the Evaluation of Educational Achievement<br />
12 A math item aiming at testing the students knowledge of proportionality, but presented in a<br />
linguistic form, which was misunders<strong>to</strong>od by the girls.
IDENTIFICATION OF GROUP DIFFERENCES USING <strong>PISA</strong> SCALES 195<br />
a test which discriminates between students with respect <strong>to</strong> be able <strong>to</strong> reach<br />
all items, viz. has a length exceeding the capacity for some students but being<br />
easy <strong>to</strong> reach for other students.<br />
While everybody seems <strong>to</strong> agree on the statistical treatment of omitted<br />
items (they are simply scored as “non-correct”) there have been discussions as<br />
<strong>to</strong> how <strong>to</strong> treat “not reached” items. This takes place from two distinct points<br />
of views: one dealing with scaling problems and one dealing with the problem<br />
of assigning justifiable <strong>PISA</strong> scores <strong>to</strong> the students.<br />
One of the virtues of linking scale properties <strong>to</strong> the analysis of Rasch homogeneity<br />
is found in the second characterization above of item homogeneity,<br />
viz. that “the student abilities 1 ncan be calculated with same result, irrespective<br />
of which subset of items is used”. This strong requirement, which<br />
in <strong>PISA</strong> ensures that responses from different booklets can be compared, irrespective<br />
of which items are included, in principle paves the road as well for<br />
non-problematic comparisons between students who have completed all items<br />
and students who have not completed all items in a booklet. At any rate, seen<br />
from a technical point of view, the existence of ‘not reached’ items does therefore<br />
not pose a problem for the estimation of the student scores , because<br />
the quoted fundamental property of homogeneity has been tested for in a pilot<br />
study prior <strong>to</strong> the main study, and all items included in the main study are<br />
consequently expected <strong>to</strong> enjoy this property. In the IEA reading literacy study<br />
(Elley, 1992 ) the discussion about which student Rasch -score <strong>to</strong> choose, the<br />
one based on the “attempted items”, considering ‘not reached’ items as ‘non<br />
existing’ or the one considering ‘not reached’ items as ‘non-correct’ responses<br />
was never solved, and both estimates were published. In subsequent IEA studies<br />
and in the <strong>PISA</strong> cycles <strong>to</strong> date, the ‘not reached’ items have been considered<br />
as ‘non-correct’.<br />
The second problem mentioned is the influence the ‘not reached’ items<br />
have on the statistical tests for homogeneity, an analytical phase which is undertaken<br />
previously <strong>to</strong> the estimation of student abilities 1 n. The immediate<br />
question here is whether different management of the ‘not reached’ item<br />
responses could lead <strong>to</strong> different results as <strong>to</strong> the acceptance of the homogeneity<br />
hypothesis. The immediate answer <strong>to</strong> the question is that it matters how ‘not<br />
reached’ item responses are scored, ‘not attempted’ or ‘non correct’. The technical<br />
details will, however, not be discussed here, but one important point is<br />
the type of estimation technique applied for the item parameters 1 ... k 13 .<br />
13 Marginal estimation with or without prior distribution on the students scores 1 n or
196 PETER ALLERUP<br />
<strong>PISA</strong> study<br />
Booklet 2000 Cop Ethnic<br />
1 0.02 0.02 0.01<br />
2 0.00 0.00 0.00<br />
3 0.01 0.01 0.00<br />
4 0.00 0.00 0.00<br />
5 0.00 0.00 0.01<br />
6 0.00 0.00 0.01<br />
7 0.01 0.02 0.03<br />
8 0.02 0.02 0.05<br />
9 0.05 0.07 0.17<br />
Table 7: Frequency of ‘not reached’ items in three studies using <strong>PISA</strong> 2000 instruments: Ordinary<br />
<strong>PISA</strong> 2000, The Copenhagen (Cop) and Ethnic Special study.<br />
In <strong>PISA</strong> 2000 with reading as the main theme, the ‘not reached’ problem<br />
was not a significant issue. Table 7 displays the frequency of ‘not reached’<br />
items in the main study <strong>PISA</strong> 2000. It can be read from the table that the level<br />
of ‘not reached’ varies greatly across booklets with a maximum amounting<br />
<strong>to</strong> 5 % for booklet No. 9. Looking at the Copenhagen study and the special<br />
Ethnic study it is, however, clear that the ‘not reached’ problem is probably<br />
most critical for the students having an ethnic minority background. In fact,<br />
using all N=10063 observations in the combined data from table 7, it can be<br />
shown that the average frequency of ‘not reached’ is 1.6 % for Danish students<br />
and 4.3 % for ethnic minority students. For the ethnic minority group it can<br />
furthermore be shown that the frequency of ‘not reached’ reaches a maximum<br />
in booklet No. 9 of 17 %.<br />
Before conclusions will be drawn as <strong>to</strong> the evaluation of group differences<br />
in terms of different <strong>PISA</strong> <strong>–</strong> values, the relation between <strong>PISA</strong> <strong>–</strong>values<br />
and the frequency of ‘not reached’ can be shown. Using log-odds as a measure<br />
of the level of ‘not reached’, a distinct linear relationship can be detected in<br />
figure 4. As anticipated, the relation indicates a negative correlation. For the<br />
summary of conclusions as <strong>to</strong> viewing the effects of inhomogeneity and other<br />
sources influencing the <strong>–</strong> scaling, it is clear from figure 4 that the statistical<br />
administration of this variable can be modelled in a simple linear manner.<br />
conditional maximum likelihood estimation. A popular technique for estimation and testing<br />
of homogeneity is undertaken by successive extention of data, increasing the number of<br />
items, using only complete response data with no ‘not reached’ responses in each step.
IDENTIFICATION OF GROUP DIFFERENCES USING <strong>PISA</strong> SCALES 197<br />
Figure 4: Relation between estimated <strong>PISA</strong> -scores and the frequency of ‘not reached’ (log<br />
odds of the frequency) for booklet No 9 in the combined data set from <strong>PISA</strong> 2000, <strong>PISA</strong><br />
Copenhagen and <strong>PISA</strong> Ethnic.<br />
Conclusions and summary of effects on the scaling of <strong>PISA</strong><br />
students<br />
It has been essential for the analyses presented above <strong>to</strong> elucidate the theoretical<br />
arguments for the use of Rasch models in the work of calibrating scales for<br />
<strong>PISA</strong> measurements. Although the two latent scales containing item difficulties<br />
and student abilities are, mathematically speaking, completely symmetrical,<br />
different concepts and different methods are associated with the practical<br />
management of the two scales.<br />
The analyses have demonstrated that a certain degree of item inhomogeneity<br />
has been found in the <strong>PISA</strong> 2000 and 2003 scales. These effects of inhomogeneity<br />
have been transformed <strong>to</strong> practical, measurable effects on the ordinary<br />
<strong>PISA</strong> ability -scale, which holds the internationally reported student results.<br />
It was a conclusion that on the individual student level this transformed effects<br />
amounted <strong>to</strong> rather large quantities, up <strong>to</strong> 150 <strong>PISA</strong> points, but they were often
198 PETER ALLERUP<br />
below 100 points. For the standard groupings of <strong>PISA</strong> students according <strong>to</strong><br />
gender and ethnicity the accumulated average effect on group level amounted<br />
<strong>to</strong> around 10 <strong>PISA</strong> points.<br />
In order <strong>to</strong> examine effects of item inhomogeneity in relation <strong>to</strong> other systematic<br />
fac<strong>to</strong>rs, which are influential on comparisons between groups of students,<br />
an illustration will be used from <strong>PISA</strong> 2000 in reading (see also Allerup,<br />
2006). From the previous analyses a picture of item inhomogeneity across two<br />
systematic fac<strong>to</strong>rs (gender and ethnicity) was obtained. Together with the fac<strong>to</strong>r<br />
Booklet Id and the number of ‘not reached’ items, four fac<strong>to</strong>rs have by this<br />
already been at work as systematic background for contrasting levels of <strong>PISA</strong><br />
-scores.<br />
The illustration aims at setting the effect of inhomogeneity in relation <strong>to</strong><br />
other systematic fac<strong>to</strong>rs when statistical analysis of -scores differences are<br />
investigated. The illustration will be using differences between the two ethnic<br />
groups, carried out as adjusted comparisons with the systematic fac<strong>to</strong>rs as controlling<br />
variables. In order <strong>to</strong> complete a typical <strong>PISA</strong> data analysis, one supplementary<br />
fac<strong>to</strong>r must be included: the socio-economic index (ESCS), aiming<br />
at measuring through a simple index the economical, educational and occupational<br />
level at home for the student 14 . The relation between <strong>PISA</strong> -scores and<br />
the index ESCS is a (weak) linear function and is usually called the ‘law of<br />
negative social heritage’. Together with the linear impression gained in figure<br />
4, an adequate statistical analysis behind the illustration will be an analysis<br />
of <strong>PISA</strong> -scores as dependent variable and (1) number of not reached items,<br />
(2) booklet id, (3) gender and (4) socio economic index ESCS as independent<br />
variables, all implemented in a generalized linear model.<br />
Two kinds of <strong>PISA</strong> -scores enter the analysis: (1) The reported <strong>PISA</strong><br />
scores found in the official reports from <strong>PISA</strong> 2000 (OECD, 2001), <strong>PISA</strong><br />
Copenhagen (Egelund og Rangvid, 2004) and <strong>PISA</strong> Ethnic (Egelund and<br />
Tranæs red., 2006) and (2) Rasch <strong>to</strong>tal, i.e.estimated <strong>–</strong> scores based on a<br />
combined data set after removal of inhomogeneous items. By this the composition<br />
of effects on the resulting <strong>–</strong> scale from item inhomogeneity and other<br />
systematic fac<strong>to</strong>rs is illustrated with an evaluation of their relative significance.<br />
14 The economy is not included as exact income figures but are estimated from information<br />
from the student questionnaire
IDENTIFICATION OF GROUP DIFFERENCES USING <strong>PISA</strong> SCALES 199<br />
Controlling variables <strong>PISA</strong>-scores<br />
<strong>–</strong>value<br />
Not reached Booklet, gender<br />
Not reached,<br />
Booklet, gender, socio-economy<br />
NO adjusting variables<br />
Reported<br />
Rasch <strong>to</strong>tal<br />
Reported<br />
Rasch <strong>to</strong>tal<br />
Reported<br />
Rasch <strong>to</strong>tal<br />
Adjusted average<br />
difference<br />
Danish vs. ethnic<br />
Table 8: Evaluation of differences between Danish and ethnic minority students using the<br />
combined data set from <strong>PISA</strong> 2000, <strong>PISA</strong> Copenhagen and <strong>PISA</strong> Ethnic. Differences listed<br />
by means of (1) reported <strong>PISA</strong> scores from the international <strong>PISA</strong> report and from (2) Rasch<br />
scores where item inhomogeneity has been removed (Rasch <strong>to</strong>tal).<br />
The results of analyzing the gap between Danish and ethnic students are<br />
presented in table 8. Under ‘no adjustment’ the officially reported gap of 90.54<br />
<strong>PISA</strong> points is listed. If inhomogeneous items are removed from the item scale,<br />
this group difference is reduced <strong>to</strong> 80.69 points, i.e. a reduction of around 10<br />
<strong>PISA</strong> points. The inhomogeneity is therefore responsible for around 10 <strong>PISA</strong><br />
points. If the variables ‘not reached’, ‘booklet id’ and ‘gender’ are added as<br />
systematic fac<strong>to</strong>r in the statistical analysis the controlled gap is now 56.00<br />
<strong>PISA</strong> points, if viewed from the point of official <strong>PISA</strong> scores, and 47.48 if<br />
calculated after removal of inhomogenous items. After controlling for ESCS,<br />
the socio-economic index, it is seen that the reported gap is now 43.89 <strong>PISA</strong><br />
points, while the gap comes down <strong>to</strong> 26.74 <strong>PISA</strong> points, if the gap is measured<br />
by means of homogeneous reading items. Ordinary least square evaluation of<br />
the last fac<strong>to</strong>r mentioned, controlled difference 26.74, shows that this difference<br />
is not far from being insignificant (p=0.01). Notice that the part of difference<br />
which can be attributed <strong>to</strong> the effect of inhomogeneous items varies<br />
with 10 <strong>PISA</strong> points, constituting around 11 % of the <strong>to</strong>tal official interval, in<br />
the case of crude comparisons without other controlling variables (last line in<br />
table 8) <strong>to</strong> approximately 20 <strong>PISA</strong> points, constituting around 50 % of the <strong>to</strong>tal<br />
official interval in the case of inhomogeneity is evaluated after adjusting for<br />
other variables.<br />
What can be seen from this example and the previous discussions and data<br />
analysis is that the effect of inhomogeneous items on the official <strong>PISA</strong> -scale<br />
can be substantial, if the aim of analysis is <strong>to</strong> compare either individuals, or<br />
a few students at one time. The average effect on the official <strong>PISA</strong> -scale
200 PETER ALLERUP<br />
in case of larger student groups depends on the environment in which comparisons<br />
are carried out. It seems <strong>to</strong> have less impact on crude comparisons of<br />
(average) <strong>PISA</strong> abilities with no other variables involved, amounting <strong>to</strong> around<br />
10 <strong>PISA</strong> points, while more sophisticated comparisons with adjusted comparisons<br />
involving controlling variables are more affected by item inhomogeneity.<br />
References<br />
Allerup P. (1994): “Rasch Measurement, theory of ”. The International Encyclopedia<br />
of Education, Vol. 8, Pergamon, 1994.<br />
Allerup, P (1995): “The IEA Study of Reading Literacy”. Owen, P. & Pumfrey,<br />
P. (red.): Children Learning <strong>to</strong> Read: International Concerns, Vol.2,<br />
p. 186-297, 1995.<br />
Allerup, P. (1997) “Statistical Analysis of Data from the IEA Reading Literacy<br />
Study”; Applications of Latent trait and latent Class models in the Social<br />
Sciences; Waxmann, 1997.<br />
Allerup, P. (2002): “Test Equating using IRT models” proc. 7’th round table<br />
conference on Assessment, Canberra November 2002<br />
Allerup, P. (2002). “Gender Differences in Mathematics Achievement”., Measurement<br />
and Multivariate Analysis. Springer Verlag, Tokyo.<br />
Allerup, P. (2005) “<strong>PISA</strong> præstationer <strong>–</strong> målinger med skæve måles<strong>to</strong>kke?”<br />
Dansk Pædagogisk Tidsskrift, vol 1, 2005. (in Danish)<br />
Allerup, P., Lindenskov, L., Weng, P. (2006) “growing up <strong>–</strong>The s<strong>to</strong>ry behind<br />
two items in <strong>PISA</strong> 2003”. Nordic Light, Nordisk Råd 2006.<br />
Allerup, P. (2006) “<strong>PISA</strong> 2000’s læseskala <strong>–</strong> vurdering af psykometriske egenskaber<br />
for elever med dansk og ikke-dansk sproglig baggrund” Rockwool<br />
Fondens Forskningsenhed og Syddansk Universitetsforlag, 2006 (in Danish)<br />
Andersen, A et al. (2001) “Forventninger og færdigheder <strong>–</strong> danske unge i en international<br />
Sammenligning”, AKF (Anvendt Kommunal Forskning) DPU<br />
(Danmarks Pædagogiske Universitet), SFI Social Forsknings Instituttet.<br />
Andersen E.B. (1973). “Conditional Inference and Models for Measuring”,<br />
Copenhagen: Mentalhygiejnisk Forlag.<br />
Bea<strong>to</strong>n, A et al. (1996): “Mathematics Achievement in the Middle School<br />
Years. IEA’s Third International Mathematics and Science Study”.<br />
Bos<strong>to</strong>n College USA
IDENTIFICATION OF GROUP DIFFERENCES USING <strong>PISA</strong> SCALES 201<br />
Egelund, N og Tranæs,T ed. (2007) “<strong>PISA</strong> Etnisk 2005 <strong>–</strong> kompetencer hos<br />
danske og etniske elever I 9.klasser I Danmark 2005.” Rockwool Fondens<br />
Forskningsenhed, Syddansk Universitetsforlag<br />
Elley, W (1992):“How in the world do students read?”, The International<br />
Association for the Evaluation of Educational Achievement, 1992, IEA,<br />
Haque<br />
Fischer, G., Molenaar, I. (1995) Rasch Models <strong>–</strong> Foundations, recent Developments,<br />
and Applications. Springer-Verlag, New York.<br />
Lord, F and Novick, M(1968). “Statistical Theories of Mental Test Scores”.<br />
Addison Wesley, Massachusetts<br />
OECD (2001) “Knowledge and Skills for Life <strong>–</strong> First Results from <strong>PISA</strong> 2000”;<br />
OECD, Paris.<br />
OECD (2004) “Learning for Tomorrow’s World <strong>–</strong> First Results from <strong>PISA</strong><br />
2003”; OECD, Paris.<br />
Rasch, G. (1960) “Probabilistic Models for some Intelligence and Attaintment<br />
Tests” Munksgaard, 1960. Genoptrykt Chicago University Press, 1980.<br />
Rasch G. (1971) “Proof that the necessary condition for the validity of the<br />
multiplicative dicho<strong>to</strong>mic model is also sufficient”. Dupl. note, Statistical<br />
Institute, Copenhagen (see Allerup, 1994).<br />
Torney-Purta, J., Lehman, R., Oswald, H., Schulz,W (2001) “Citizenship and<br />
Education in twenty-eight Countries Civic Knowledge and Engagement<br />
at age fourteen”. Amsterdam: IEA 2001.
<strong>PISA</strong> and “Real Life Challenges”: Mission Impossible?<br />
Svein Sjøberg<br />
Sweden: University of Oslo<br />
Introduction<br />
The <strong>PISA</strong> project has positive as well as more problematic aspects, and it is<br />
important for educa<strong>to</strong>rs and researchers <strong>to</strong> engage in critical public debates on<br />
this utterly important project, including its uses and misuses.<br />
The <strong>PISA</strong> project sets the educational agenda internationally as well as<br />
within the participating countries. <strong>PISA</strong> results and advice are often considered<br />
as objective and value-free scientific truths, while they are, in fact embedded in<br />
the overall political and economic aims and priorities of the OECD. Through<br />
media coverage <strong>PISA</strong> results create the public perception of the quality of a<br />
country’s overall school system. The lack of critical voices from academics as<br />
well as from media gives authority <strong>to</strong> the images that are presented.<br />
In this article, I will raise critical points from several perspectives. The<br />
main point of view is that the <strong>PISA</strong> ambitions of testing “real-life skills and<br />
competencies in authentic contexts” are by definition alone impossible <strong>to</strong><br />
achieve. A test is never better than the items that constitute the test. Hence,<br />
a critique of <strong>PISA</strong> should not mainly address the official rationale, ambitions<br />
and definitions, but should scrutinize the test items and the realities around the<br />
data collection. The secrecy over <strong>PISA</strong> items makes detailed critique difficult,<br />
but I will illustrate the quality of the items with two examples from the released<br />
texts.<br />
Finally, I will raise serious questions about the credibility of the results,<br />
in particular the ranking. Reliable results assume that the respondents in all<br />
countries do their best while they are sitting the test. I will assert that young<br />
learners in different countries and cultures may vary in the way they behave in<br />
the <strong>PISA</strong> test situation. I claim that in many modern societies, several students
204 SVEIN SJØBERG<br />
are unwilling <strong>to</strong> give their best performance if they find the <strong>PISA</strong> items long,<br />
unreadable, unrealistic and boring, in particular if bad test results have no negative<br />
consequence for them. I will use the concept of “perceived task value” <strong>to</strong><br />
argue this important point.<br />
The political importance of <strong>PISA</strong><br />
Whether one likes the <strong>PISA</strong> study or not, one might easily agree about the importance<br />
of the project. When OECD has embarked on such a large project,<br />
it is certainly not meant as a purely academic research undertaking. <strong>PISA</strong> is<br />
meant <strong>to</strong> provide results <strong>to</strong> be used in the shaping of future policies. After 6-<br />
7 years of living with <strong>PISA</strong>, we see that the <strong>PISA</strong> concepts, ideology, values<br />
and not least the results and the rankings, shape international educational policies<br />
and also influence national policies in most of the participating countries.<br />
Moreover, the <strong>PISA</strong> results provide media and the public with convincing images<br />
and perceptions about the quality of the school system, the quality of their<br />
teachers’ work and the characteristics of both the school population and future<br />
citizen.<br />
Contemporary natural science is often labelled Big Science or Technoscience:<br />
The projects are multinational, they involve thousands of researchers,<br />
and they require heavy funding. Moreover, the traditional scientific values and<br />
ethos of science become different from the traditional ideals of academic science<br />
(Ziman, 2000). Prime examples are CERN, The Human Genome Project,<br />
European Space Agency etc. The <strong>PISA</strong> project has many similarities with such<br />
projects, although the scale and the costs are much lower. But the number of<br />
people who are involved is large, and the mere organization of the undertaking<br />
requires resources, planning and logistics unusual <strong>to</strong> the social sciences. <strong>According</strong><br />
<strong>to</strong> Prais (2007), the <strong>to</strong>tal cost of the <strong>PISA</strong> and TIMSS testing in 2006<br />
was “probably well over 100 million US dollars for all countries <strong>to</strong>gether, plus<br />
the time of pupils and teachers directly involved.”<br />
Why is an organization like the OECD embarking on an ambitious task<br />
like this? The OECD is an organization for the promotion of economic<br />
growth, cooperation and development in countries that are committed <strong>to</strong> market<br />
economies. Their slogan appears on their website: “For a better world economy.”<br />
1<br />
1 These and other quotes in the article are taken from OECDs home site http://www.oecd.org/,<br />
retrieved Sept 2, 2007.
<strong>PISA</strong> AND “REAL LIFE CHALLENGES”: MISSION IMPOSSIBLE? 205<br />
The OECD and its member countries have not embarked on the <strong>PISA</strong><br />
project because they have an interest in basic research in education or learning<br />
theory. They have decided <strong>to</strong> invest in <strong>PISA</strong> because education is crucial for<br />
the economy. Governments need information that is supposed <strong>to</strong> be relevant for<br />
their policies and priorities in this economic perspective. Since mass education<br />
is expensive, they also most certainly want “value for money” <strong>to</strong> ensure efficient<br />
running of the educational systems. Stating this is not meant as a critique<br />
of <strong>PISA</strong>. It is, however, meant <strong>to</strong> state the obvious, but still important fact:<br />
<strong>PISA</strong> should be judged in the context of the agenda of the OECD; economic<br />
development and competition in a global market economy.<br />
The strong influence that <strong>PISA</strong> has on national educational policies should<br />
imply that all educa<strong>to</strong>rs ought <strong>to</strong> be interested in <strong>PISA</strong> whether they endorse<br />
the aims of <strong>PISA</strong> or not. Educa<strong>to</strong>rs should be able <strong>to</strong> discuss and use the results<br />
with some insight in the methods, underlying assumptions, strengths and<br />
weaknesses, possibilities and limitations of the project. We need <strong>to</strong> know what<br />
we might learn from the study, as well as what we cannot learn. Moreover, we<br />
need <strong>to</strong> raise a critical (not necessarily negative!) voice in the public as well as<br />
professional debates over uses and misuses of the results.<br />
The influence of <strong>PISA</strong>: Norway as an example<br />
Attention given <strong>to</strong> <strong>PISA</strong> results in national media varies between countries,<br />
but in most countries it is formidable. In my country, Norway, the results from<br />
<strong>PISA</strong>2000 as well as from <strong>PISA</strong>2003 provided war-like headings in most national<br />
newspapers.<br />
Our then Minister of Education (2001-2005), Kristin Clemet (representing<br />
Høyre, the Conservative party), commented on the <strong>PISA</strong>2000 results, released<br />
a few months after she had taken office, following a Labour government: “Norway<br />
is a school loser, now it is well documented. It is like coming home from<br />
the Winter Olympics without a gold medal” (which, of course, for Norway<br />
would have been a most unthinkable disaster!). She even added: “And this<br />
time we cannot even claim that the Finnish participants have been doped!”<br />
(Aftenposten January 2001).<br />
The headlines in all the newspapers <strong>to</strong>ld us again an again “Norway is a loser”.<br />
In fact, such headings were misleading. Norway has ended up close <strong>to</strong><br />
the average among the OECD countries in all test domains in <strong>PISA</strong>2000 and<br />
<strong>PISA</strong>2003. But for some reason, Norwegians had expected that we should be
206 SVEIN SJØBERG<br />
Figure 1: <strong>PISA</strong> results are presented in the media with war-like headings, shaping public perception<br />
about the national school system. Here are <strong>PISA</strong> results presented in the leading Norwegian<br />
newspaper Dagbladet with the heading “Norway is a school loser”.<br />
on <strong>to</strong>p <strong>–</strong> as we often are on other indica<strong>to</strong>rs and in winter sports. When we are<br />
not the winners, we regard ourselves as being losers.<br />
The results from <strong>PISA</strong> (and TIMSS as well) have shaped the public image<br />
of the quality of our school system, not only for the aspects that have in fact<br />
been studied, but for more or less all other aspects of school. It has now become<br />
commonly ‘accepted’ that Norwegian schools in general have a very low level<br />
of quality, and that Norwegian classrooms are among the noisiest in the world.<br />
The media present tabloid-like and oversimplified rankings. It seems that the<br />
public as well as politicians have accepted these versions as objective scientific<br />
truths about our education system. There has been little public debate, and<br />
even the researchers behind the <strong>PISA</strong> study have little <strong>to</strong> modify and remind<br />
the public about the limitations of the study. In sum; <strong>PISA</strong> (as well as TIMSS)<br />
has created a public image of the quality of the Norwegian school that is not<br />
justified, and that may be seen <strong>to</strong> be detrimental. I assume that other counties<br />
may have similar experiences.<br />
But <strong>PISA</strong> does not only shape the public image, it also provides a scientific
<strong>PISA</strong> AND “REAL LIFE CHALLENGES”: MISSION IMPOSSIBLE? 207<br />
legitimization of school reforms. Under Kristin Clemet as Minister of Education<br />
(2001-2005), a series of educational reforms were introduced in Norway.<br />
Most of these reforms were legitimized by reference <strong>to</strong> international testing,<br />
mainly <strong>to</strong> <strong>PISA</strong>. In 2005, we had a change in government, and Kristin Clemet’s<br />
Secretary of State, Helge Ole Bergesen, published a book shortly afterwards in<br />
which he presented the “inside s<strong>to</strong>ry” on the reforms made while they were in<br />
power. The main perspective of the book is the many references <strong>to</strong> large-scale<br />
achievement studies. He confirms that these studies provided the key arguments<br />
and rationale for curricular as well as other school reforms. Under the<br />
tabloid heading: “The <strong>PISA</strong> Shock”, he confirms the key role of <strong>PISA</strong>:<br />
With the [publication of the] <strong>PISA</strong> results, the scene was set for a national battle over<br />
knowledge in our schools. [ . . . ] For those of us who had just taken over the political<br />
power in the Ministry of Education and Research, the <strong>PISA</strong> results provided a “flying<br />
start” (Bergesen 2006: 41-42. Author’s translation).<br />
Other countries may have different s<strong>to</strong>ries <strong>to</strong> tell. Figures 2 and 3 provide examples<br />
from the public sphere in Germany. In sum: There is no doubt that <strong>PISA</strong><br />
has provided <strong>–</strong> and will continue <strong>to</strong> provide <strong>–</strong> results, ideologies, concepts,<br />
analysis, advice and recommendations that will shape our future educational<br />
debates and reforms, nationally as well as internationally.<br />
<strong>PISA</strong>: Underlying values and assumptions<br />
It is important <strong>to</strong> examine the ideas and values that underpin <strong>PISA</strong>, because,<br />
like most research studies, <strong>PISA</strong> is not impartial. It builds on several assumptions,<br />
and it carries with it several value judgements. Some of these values are<br />
explicit; others are implicit and ‘hidden’, but nevertheless of great importance.<br />
Some value commitments are not very controversial, others may be contested.<br />
Peter Fensham, a key scholar in international science education thinking<br />
and research for many decades, has also been heavily involved in several committees<br />
in TIMSS and <strong>PISA</strong>. He has seen all aspects of the projects from the<br />
inside over decades. In a recent book chapter, he provides an insider’s overview<br />
and critique of the values that underlie these projects. He draws attention <strong>to</strong> the<br />
underlying values and implications:<br />
The design and findings from large-scale international comparisons of science learning<br />
do impact on how science education is thought about, is taught and is assessed in<br />
the participating countries. The design and the development of the instruments used
208 SVEIN SJØBERG<br />
Figure 2: The political agenda and the public image of the quality of the entire school system<br />
is formed by the <strong>PISA</strong> results. This is an example from the German Newspaper Die Woche<br />
after the release of <strong>PISA</strong>2000 results.<br />
and the findings that they produce send implicit messages <strong>to</strong> the curriculum authorities<br />
and, through them, <strong>to</strong> science teachers. It is thus important that these values, at<br />
all levels of existence and operations of the projects, be discussed, lest these messages<br />
act counterproductively <strong>to</strong> other sets of values that individual countries try <strong>to</strong> achieve<br />
with their science curricula. (Fensham 2007: 215,216)<br />
Aims and purpose of the OECD<br />
In the public debate as well as among politicians, advice and reports from<br />
OECD experts are often considered <strong>to</strong> be impartial and objective. The OECD<br />
has become an important contribu<strong>to</strong>r <strong>to</strong> the political battle over social, political,<br />
economic and other ideas. To a large extent, these persons shape the political<br />
landscape, and their reports and advice set the political agenda in the national<br />
as well as international debates over priorities and concerns. But the OECD<br />
is certainly not a impartial group of independent educational researchers. The<br />
OECD is built on a neo-liberal political and economic ideology, and its advice<br />
should be seen in this perspective. The seemingly scientific and neutral language<br />
of expert advice conceals the fact that there are possibilities for other
<strong>PISA</strong> AND “REAL LIFE CHALLENGES”: MISSION IMPOSSIBLE? 209<br />
Figure 3: <strong>PISA</strong> has become a well-known concept in public debate: A bookshelf in a German<br />
airport offering bestselling books with <strong>PISA</strong>-like tests for self-assessment, very much like<br />
IQ-tests.<br />
political choices based on different sets of social, cultural and educational values.<br />
Figure 4 shows how the OECD presents itself.<br />
The overall perspective of the OECD is concerned with market economy<br />
and growth in free world trade. All policy advice they provide is certainly<br />
coloured by such underlying value commitments. Hence, the agenda of the<br />
OECD (and <strong>PISA</strong>) does not necessarily coincide with the concerns of many<br />
educa<strong>to</strong>rs (or other citizens, for that matter). The concerns of <strong>PISA</strong> are not<br />
about ‘Bildung’ or liberal education, not about solidarity with the poor, not<br />
about sustainable development etc. <strong>–</strong> but about skills and competencies that<br />
can promote the economic goals of the OECD. Saying this is, of course, stating<br />
the obvious, but such basic facts are often forgotten in the public and political<br />
debates over <strong>PISA</strong> results.
210 SVEIN SJØBERG<br />
About the OECD<br />
The OECD brings <strong>to</strong>gether the governments of countries committed <strong>to</strong><br />
democracy and the market economy from around the world <strong>to</strong>:<br />
<strong>–</strong> Support sustainable economic growth<br />
<strong>–</strong> Boost employment<br />
<strong>–</strong> Raise living standards<br />
<strong>–</strong> Maintain financial stability<br />
<strong>–</strong> Assist other countries’ economic development<br />
<strong>–</strong> Contribute <strong>to</strong> growth in world trade<br />
Figure 4: The basis for and commitments of the OECD as they appear on<br />
http://www.oecd.org/ Retrieved Sept 7 2007<br />
Educational and curricular values in <strong>PISA</strong><br />
Quite naturally, values creep in<strong>to</strong> <strong>PISA</strong> testing in several ways. <strong>PISA</strong> sets out<br />
<strong>to</strong> shed light on important (and not very controversial) questions like these:<br />
Are students well prepared for future challenges? Can they analyse, reason and communicate<br />
effectively? Do they have the capacity <strong>to</strong> continue learning throughout life?<br />
(First words on the <strong>PISA</strong> home page at: http://www.pisa.oecd.org/)<br />
These are important concerns for most people, and it is hard <strong>to</strong> disagree with<br />
such aims. However, as is well known, <strong>PISA</strong> tests just a few areas of the<br />
school curriculum: Reading, mathematics and science. These subjects are, consequently,<br />
considered more important than other areas of the school curriculum<br />
in order <strong>to</strong> reach the brave goals quoted above. Hence, the OECD implicitly<br />
says that our future challenges are not highly dependent on subjects like his<strong>to</strong>ry,<br />
geography, social science, ethics, foreign language, practical skills, arts<br />
and aesthetics, etc.<br />
<strong>PISA</strong> provides test results that are closely connected <strong>to</strong> (certain aspects of)<br />
the three subjects that they test. But when test results are communicated <strong>to</strong><br />
the public, one receives the impression that they have tested the quality of the<br />
entire school system and all the competencies that are of key importance for<br />
preparing <strong>to</strong> meet the challenges of the future.<br />
There is one important feature of <strong>PISA</strong> that is often forgotten in the public<br />
debate: <strong>PISA</strong> (in contrast <strong>to</strong> TIMSS) does not test “school knowledge”. Neither<br />
the <strong>PISA</strong> framework nor the test items claim having any connection <strong>to</strong><br />
national school curricula. This fact is in many ways the strength of the <strong>PISA</strong><br />
undertaking; they have set out <strong>to</strong> think independently from the constraints of
<strong>PISA</strong> AND “REAL LIFE CHALLENGES”: MISSION IMPOSSIBLE? 211<br />
all the different school curricula. There is a strong contrast with the TIMSS<br />
test, as its items are meant <strong>to</strong> test knowledge that is more or less common in<br />
all curricula in the numerous participating countries. This implies, of course,<br />
that the “TIMSS curriculum” (Mullis et al 2001) may be characterized as a<br />
fossilized and old-fashioned curriculum of a type that most science educa<strong>to</strong>rs<br />
want <strong>to</strong> eradicate. In fact, nearly all TIMSS test items could have been used<br />
60-70 years ago. The <strong>PISA</strong> thinking has been freed from the constraints of<br />
school curricula and could in principle be more radical and forward-looking in<br />
their thinking. (However, as asserted in other parts of this chapter, <strong>PISA</strong> does<br />
not manage <strong>to</strong> live up <strong>to</strong> such high expectations.)<br />
<strong>PISA</strong> stresses that the skills and competencies assessed may not only stem<br />
from activities at school but from experiences and influences from family life,<br />
contact with friends, etc. In spite of this, both good and bad results are most<br />
often considered by both the public and politicians <strong>to</strong> be attributed <strong>to</strong> the school<br />
only.<br />
Values in the <strong>PISA</strong>-reporting<br />
The <strong>PISA</strong> data collection also covers a great variety of dimensions regarding<br />
background variables. The intention is, of course, <strong>to</strong> use these <strong>to</strong> explain<br />
the variance in the test results (“explain” in a statistical sense, i.e. <strong>to</strong> establish<br />
correlations etc.). Many interesting studies have been published on such<br />
issues. But the main focus in the public reporting is in the form of simple ranking,<br />
often in the form of league tables for the participating countries. Here,<br />
the mean scores of the national samples in different countries are published.<br />
These league tables are nearly the only results that appear in the mass media.<br />
Although the <strong>PISA</strong> researchers take care <strong>to</strong> explain that many differences (say,<br />
between a mean national score of 567 and 572) are not statistically significant,<br />
the placement on the list gets most of the public attention. It is somewhat similar<br />
<strong>to</strong> sporting events: The winner takes it all. If you become no 8, no one<br />
asks how far you are from the winner, or how far you are from no 24 at any<br />
event. Moving up or down some places in this league table from <strong>PISA</strong>2000<br />
<strong>to</strong> <strong>PISA</strong>2003 is awarded great importance in the public debate, although the<br />
differences may be non-significant statistically as well as educationally.<br />
The winners also become models and ideals for other countries. Many want<br />
<strong>to</strong> copy aspects of the school system from the winners. This, among other<br />
things, assumes that <strong>PISA</strong> results can be explained mainly by school fac<strong>to</strong>rs <strong>–</strong>
212 SVEIN SJØBERG<br />
and not by political, his<strong>to</strong>rical, economic or cultural fac<strong>to</strong>rs or by youth culture<br />
and the values and concerns of the young learners. Peter Fensham claims:<br />
. . . the project managers choose <strong>to</strong> have quite separate expert groups <strong>to</strong> work on the<br />
science learning and the contextual fac<strong>to</strong>rs—a decision that was later <strong>to</strong> lead <strong>to</strong> discrepancies.<br />
Both projects have taken a positivist stance <strong>to</strong> the relationship between<br />
contextual constructs and students’ achievement scores, although after the first round<br />
of TIMSS other voices suggested a more holistic or cultural approach <strong>to</strong> be more appropriate<br />
for such multi-cultural comparisons. (Fensham 2007: 218)<br />
<strong>PISA</strong> (and even more so TIMSS) is dominated and driven by psychometric<br />
concerns, and much less by educational. The data that emerge from these studies<br />
provides a fantastic pool of social and educational data, collected under<br />
strictly controlled conditions <strong>–</strong> a playground for psychometricians and their<br />
models. In fact, the rather complicated statistical design of the studies decreases<br />
the intelligibility of the studies. It is, even for experts, rather difficult <strong>to</strong><br />
understand the statistical and sampling procedures, the rationale and the models<br />
that underlie the emergence of even test scores. In practice, one has <strong>to</strong> take<br />
the results at face value and on trust, given that some of our best statisticians<br />
are involved. But the advanced statistics certainly reduce the transparency of<br />
the study and hinder publicly informed debate.<br />
<strong>PISA</strong> items <strong>–</strong> a critique<br />
The secrecy<br />
An achievement test is never better than the quality of its items. If the items<br />
are miserable, even the best statisticians in the world cannot change this fact.<br />
Subject matter educa<strong>to</strong>rs should have a particular interest, and even a duty,<br />
<strong>to</strong> go in<strong>to</strong> detail on how their subject is treated and ‘operationalized’ through<br />
the <strong>PISA</strong> test items. One should not just discuss the given definitions of e.g.<br />
scientific literacy and the intentions of what <strong>PISA</strong> claims <strong>to</strong> test. In fact, the<br />
framework as well as the intentions and ideologies in the <strong>PISA</strong> testing may be<br />
considered acceptable and even progressive. The important question is: How<br />
are these brave intentions translated in<strong>to</strong> actual items?<br />
But it is not easy <strong>to</strong> address this important issue, as only a very few of the<br />
items have been made publicly available. Peter Fensham, himself a member of<br />
the <strong>PISA</strong> (as well as TIMSS) subject matter expert group, deplores the secrecy:
<strong>PISA</strong> AND “REAL LIFE CHALLENGES”: MISSION IMPOSSIBLE? 213<br />
“By their decision <strong>to</strong> maintain most items in a test secret [ . . . ] TIMSS and <strong>PISA</strong> deny<br />
<strong>to</strong> curriculum authorities and <strong>to</strong> teachers the most immediate feedback the project<br />
could make, namely the release in detail of the items, that would indicate better than<br />
framework statements, what is meant by ‘science learning’. The released items are<br />
tantalizing few and can easily be misinterpreted” (Fensham 2007: 217)<br />
The reason for this secrecy is, of course, that the items will be used in the next<br />
<strong>PISA</strong> testing round, and therefore they may not be made public. An informed<br />
public debate on this key issue is therefore difficult, <strong>to</strong> say the least. But we<br />
scrutinize the relatively few items that have been made public 2 .<br />
Can “real-life challenges” assessed by wordy paper-and-pencil items?<br />
The <strong>PISA</strong> testing takes place in about 60 countries, which <strong>to</strong>gether (according<br />
<strong>to</strong> the <strong>PISA</strong> homepage) account for 90 % of the world economy. <strong>PISA</strong> has the<br />
intention of testing<br />
. . . knowledge and skills that are essential for full participation in society. [ . . . ] not<br />
merely in terms of mastery of the school curriculum, but in terms of important knowledge<br />
and skills needed in adult life. [ . . . ]<br />
The questions are reviewed by the international contrac<strong>to</strong>r and by participating countries<br />
and are carefully checked for cultural bias. Only those questions that are unanimously<br />
approved are used in <strong>PISA</strong>. (Quotes from Pisa.oecd.org, retrieved 5 sept 2007)<br />
In each item unit, the questions are based on what is called an “authentic text”.<br />
This, one may assume, means that the original text has appeared in print in<br />
one of the 60 participating countries, and that it has been translated from this<br />
original.<br />
There are many critical comments that can be made <strong>to</strong> challenge the claim<br />
that <strong>PISA</strong> lives up <strong>to</strong> the high ambition of testing real-life skills. An obvious<br />
limitation is the test format itself: The test contains only paper-and-pencil<br />
items, and most items are based on the reading of rather lengthy pieces of text.<br />
This is, of course, only a subset of the types of “knowledge and skills that are<br />
essential for full participation in society”. Coping with life in modern societies<br />
requires a range of competencies and skills that cannot possibly be measured<br />
by test items of the <strong>PISA</strong> units’ format.<br />
2 All the released items from previous <strong>PISA</strong> rounds can be retrieved from the <strong>PISA</strong> website<br />
http://www.oecd.org/document/25/0,3343,en_32252351_32235731_38709529_1_1_1_<br />
1,00.html
214 SVEIN SJØBERG<br />
Identical “real-life challenges” in 60 countries?<br />
But the abovementioned criticism has other and equally important dimensions:<br />
The <strong>PISA</strong> test items are by necessity exactly the same in each country. The<br />
quote above assures us that any “cultural bias” has been removed, and items<br />
have <strong>to</strong> be “unanimously approved”.<br />
At first glance, this sounds positive. But there are indeed difficulties with<br />
such requirements: Real life is different in different countries. Here are, in<br />
alphabetical order, the first countries on the list of participating countries:<br />
Argentina*, Australia, Austria, Azerbaijan*, Belgium, Brazil*, Bulgaria*,<br />
Canada, Chile*, Colombia*, Croatia*, the Czech Republic, Denmark 3<br />
We can only imagine the deliberation <strong>to</strong>wards unanimous acceptance of all<br />
items among the 60 countries with the demands that there should be no cultural<br />
bias and that context of no country should be favoured.<br />
The following consequences seem unavoidable: The items will become<br />
decontextualised, or with contrived ‘contexts’ far removed from the reality of<br />
most real life situations in any of the participating countries. While the schools<br />
in most countries have a mandate <strong>to</strong> prepare students <strong>to</strong> meet the challenges<br />
in that particular society (depending on level of development, climate, natural<br />
environment, culture, urgent local and national needs and challenges, etc.),<br />
the <strong>PISA</strong> tests only aspects that are shared with all other nations. This runs<br />
contrary <strong>to</strong> current curriculum trends in many countries, where the issue of<br />
providing local relevance and context have become urgent. In many countries,<br />
educa<strong>to</strong>rs argue for a more contextualized (or ‘localized’) curriculum, at least<br />
in the obliga<strong>to</strong>ry basic education for all young learners.<br />
The item construction process also rules out the inclusion of all sorts of<br />
controversial issues, be they scientific, cultural, economic or political. It is<br />
indeed enough that the authorities in one of the participating countries have<br />
objections.<br />
To repeat: Schools in many countries have the mandate of preparing their<br />
learners <strong>to</strong> take an active part in social and political life. While many countries<br />
encourage the schools <strong>to</strong> treat controversial socio-scientific issues, such issues<br />
are unthinkable in schools in other countries. Moreover, a controversial issue in<br />
one country may not be seen as controversial in another. In sum: The demands<br />
of the item construction process set serious limitations on the actual items that<br />
comprise the <strong>PISA</strong> test.<br />
3 The list is from http://www.pisa.oecd.org/ Countries marked with a * are not members of<br />
OECD, (but are also assumed <strong>to</strong> unanimously agree on the inclusion of all test units.)
<strong>PISA</strong> AND “REAL LIFE CHALLENGES”: MISSION IMPOSSIBLE? 215<br />
Now, all the above considerations are simply deductions from the demands<br />
of the processes behind the construction of the <strong>PISA</strong> instrument. It is, of<br />
course, of great importance <strong>to</strong> check out such conclusions against the test itself.<br />
But, as mentioned, this is not an easy task <strong>to</strong> complete, given the secrecy over<br />
the test items. Nonetheless, the items that have been released confirm the above<br />
analysis: The <strong>PISA</strong> items are basically decontextualised and non-controversial.<br />
The <strong>PISA</strong> items are <strong>–</strong> in spite of an admirable level of ambition <strong>–</strong> nearly the<br />
negation of the skills and competencies that many educa<strong>to</strong>rs consider important<br />
for facing future challenges in modern, democratic societies.<br />
<strong>PISA</strong> items have also been criticized on other aspects. Many claim that<br />
the scientific content is questionable or misleading and that the language is<br />
strange, often verbose. In the next paragraph, two examples of <strong>PISA</strong> units are<br />
discussed in some detail, one from Mathematics, the other from Science.
216 SVEIN SJØBERG<br />
A <strong>PISA</strong> mathematics unit: Walking<br />
In Figure 5 below the complete <strong>PISA</strong> test unit called Walking is reproduced.<br />
M124: Walking<br />
The picture shows the footprints of a man walking. The pacelength P is the distance<br />
between the rear of two consecutive footprints.<br />
For men, the formula, n/P=140, gives an approximate relationship between n and P<br />
where,<br />
n= number of steps per minute, and<br />
P= pacelength in metres<br />
Question 1: WALKING M124Q01- 0 1 2 9<br />
If the formula applies <strong>to</strong> Heiko’s walking and Heiko takes 70 steps per minute,<br />
what is Heiko’s pacelength? Show your work.<br />
Question 3: WALKING M124Q03- 00 11 21 22 23 24 31 99<br />
Bernard knows his pacelength is 0.80 metres. The formula applies <strong>to</strong> Bernard’s<br />
walking. Calculate Bernard’s walking speed in metres per minute and in kilometres<br />
per hour. Show your working out.<br />
Figure 5: A complete <strong>PISA</strong> mathematics unit, “Walking”, with the text presenting the situation<br />
and the questions relating <strong>to</strong> the situation.
<strong>PISA</strong> AND “REAL LIFE CHALLENGES”: MISSION IMPOSSIBLE? 217<br />
Comments <strong>to</strong> Walking<br />
Some details first:<br />
Note that Question 2 is missing! (This may be an omission in the published<br />
document.) Note also the end of Q1: “Show your work.” And for Q2. “Show<br />
your working out.” There also seems <strong>to</strong> be several commas <strong>to</strong>o many. Consider<br />
the commas in this paragraph: “For men, the formula, n/P = 140, gives an<br />
approximate relationship between n and P where, etc . . . ”. In my view, all<br />
the 4 commas seem somewhat misplaced. Perhaps these are merely details,<br />
but they are not very convincing as the final outcome of serious negotiations<br />
between 60 countries!<br />
The main comments <strong>to</strong> this unit are, however, more on the content of the<br />
item. First of all: Is this situation really a “real-life situation”? How real is the<br />
situation described above? Is this type of question a real challenge in the future<br />
life of young people <strong>–</strong> in any country?<br />
But even if we accept the situation as a real problem, it seems hard <strong>to</strong> acknowledge<br />
that the given formula is a realistic mathematization of a genuine<br />
situation. The formula implies that when you increase the frequency in your<br />
walking, your paces simultaneously become longer. To my knowledge, a person<br />
may walk with long paces and low frequency. Moreover, the same person<br />
may also walk using short steps at high frequency. In fact, at least from my<br />
point of view, the two fac<strong>to</strong>rs should be inversely proportional rather than proportional,<br />
as suggested in the “Walking” item. In any case, a respondent who<br />
tries <strong>to</strong> think critically about the formula may get confused, but those who do<br />
not think may easily solve the question simply by inserting the formula.<br />
But the problems do not s<strong>to</strong>p here: Take a careful look at the dimensions<br />
given in the figure. If the marked footstep is 80 cm (as suggested in Q3 above),<br />
then the footprint is 55 cm long! A regular man’s foot is actually only about<br />
26 cm long, so the figure is extremely misleading! But even worse: From the<br />
figure, we can see (or measure) the next footstep <strong>to</strong> be 60 % longer. Given the<br />
formula above, this also implies a more rapid pace, and the man’s acceleration<br />
from the first <strong>to</strong> the second footstep has <strong>to</strong> be enormous!<br />
In conclusion: The situation is unrealistic and flawed from several points<br />
of view. Students who simply insert numbers in the formula without thinking<br />
will get it right. More critical students who start thinking will, however, be<br />
confused and get in trouble!
218 SVEIN SJØBERG<br />
A <strong>PISA</strong> science unit: Cloning<br />
In Figure 6 below the complete <strong>PISA</strong> test unit called Cloning is reproduced<br />
S128: Cloning<br />
Read the newspaper article and answer the questions that follow.<br />
Without any doubt, if there had<br />
been elections for the animal of the year<br />
1997, Dolly would have been the winner!<br />
Dolly is a Scottish sheep that you<br />
see in the pho<strong>to</strong>. But Dolly is not just<br />
a simple sheep. She is a clone of another<br />
sheep. A clone means: a copy.<br />
Cloning means copying ‘from a single<br />
master copy’. Scientists succeeded in<br />
creating a sheep (Dolly) that is identical<br />
<strong>to</strong> a sheep that functioned as a<br />
‘master copy’. It was the Scottish scientist<br />
Ian Wilmut who designed the<br />
‘copying machine’ for sheep. He <strong>to</strong>ok<br />
a very small piece from the udder of an<br />
adult sheep (sheep 1). From that small<br />
A copying machine for living beings?<br />
piece he removed the nucleus, then he<br />
transferred the nucleus in<strong>to</strong> the eggcell<br />
of another (female) sheep (sheep<br />
2). But first he removed from that eggcell<br />
all the material that would have<br />
determined sheep 2 characteristics in<br />
a lamb produced from that egg-cell.<br />
Ian Wilmut implanted the manipulated<br />
egg-cell of sheep 2 in<strong>to</strong> yet another (female)<br />
sheep (sheep 3). Sheep 3 became<br />
pregnant and had a lamb: Dolly. Some<br />
scientists think that within a few years<br />
it will be possible <strong>to</strong> clone people as<br />
well. But many governments have already<br />
decided <strong>to</strong> forbid cloning of people<br />
by law.
<strong>PISA</strong> AND “REAL LIFE CHALLENGES”: MISSION IMPOSSIBLE? 219<br />
Question 1:<br />
CLONING<br />
Which sheep is Dolly<br />
identical <strong>to</strong>?<br />
A Sheep 1<br />
B Sheep 2<br />
C Sheep 3<br />
D Dolly’s father<br />
Question 2: CLONING S128Q02<br />
In line 14 the part of the udder that was used is<br />
described as “a very small piece”.<br />
From the article text you can work out what is<br />
meant by “a very small piece”.<br />
That “very small piece” is<br />
A a cell.<br />
B a gene.<br />
C a cell nucleus.<br />
D a chromosome.<br />
Question 3: CLONING S128Q03<br />
In the last sentence of the article it is stated that many governments have already<br />
decided <strong>to</strong> forbid cloning of people by law.<br />
Two possible reasons for this decision are mentioned below.<br />
Are these reasons scientific reasons?<br />
Circle either “Yes” or “No” for each.<br />
Reason: Scientific?<br />
Cloned people could be more sensitive <strong>to</strong> certain diseases than<br />
normal people.<br />
Yes/No<br />
People should not take over the role of a Crea<strong>to</strong>r Yes/No<br />
Figure 6: A complete <strong>PISA</strong> science unit, “Cloning”, with the text presenting the situation and<br />
the three questions relating <strong>to</strong> the situation.
220 SVEIN SJØBERG<br />
Comments <strong>to</strong> Cloning<br />
This task requires the understanding of the rather lengthy 30 lines of text.<br />
In non-English-speaking countries, this text is translated in<strong>to</strong> the language of<br />
instruction. The translation follows rather detailed procedures <strong>to</strong> ensure high<br />
quality. The requirement that the text should be more or less identical results in<br />
rather strange prose in many languages. The original has, we assume, been an<br />
“authentic text” in some language, but the resulting translations cannot be considered<br />
<strong>to</strong> be “authentic” in the sense that they could appear in any newspaper<br />
or journal in that particular country.<br />
<strong>PISA</strong> adheres <strong>to</strong> strict rules for the translation process, but this is not the<br />
way prose should be translated <strong>to</strong> become good, natural and readable in other<br />
languages. In my own language, Norwegian, the heading “A copying machine<br />
for living being” is translated word by word. This does not make sense, and<br />
prose like this would never appear in real texts.<br />
The scientific content of the item may also be challenged. The only accepted<br />
answer on Question 1 is that Dolly is identical <strong>to</strong> Sheep 1 (alternative<br />
A). It may seem strange <strong>to</strong> claim that two sheep of very different ages are<br />
“identical” <strong>–</strong> but this is the only acceptable answer. The other two questions<br />
are also open for criticism. Basically, they test language skills, reading as well<br />
as vocabulary. (The word ‘udder’ was unknown <strong>to</strong> me.)<br />
In conclusion: Although the intentions behind the <strong>PISA</strong> test are positive,<br />
it becomes next <strong>to</strong> impossible <strong>to</strong> produce items that are ‘authentic’, close <strong>to</strong><br />
real life challenges <strong>–</strong> and at the same without cultural bias and equally ‘fair’<br />
in all countries. Items have <strong>to</strong> be constructed by international negotiations, and<br />
the result will therefore be that all contexts are wiped out <strong>–</strong> contrary <strong>to</strong> the<br />
ambitions of the <strong>PISA</strong> framework.<br />
Youth culture: Who cares <strong>to</strong> concentrate on <strong>PISA</strong> tests?<br />
In the <strong>PISA</strong> testing, students at the age of 15 are supposed <strong>to</strong> sit for 2 hours<br />
and do their best <strong>to</strong> answer the items. The data gathered in this way forms<br />
the basis of all conclusions on achievement and all forms of fac<strong>to</strong>r analysis<br />
that explain (in a statistical sense) the variation in achievement. The quality<br />
of these achievement data determines the quality of the whole <strong>PISA</strong> exercise.<br />
Good data assumes, of course, that the respondents have done their best <strong>to</strong><br />
answer the questions. For <strong>PISA</strong> results <strong>to</strong> be valid, one has <strong>to</strong> assume that
<strong>PISA</strong> AND “REAL LIFE CHALLENGES”: MISSION IMPOSSIBLE? 221<br />
students are motivated and cooperative, and that they are willing <strong>to</strong> concentrate<br />
on the items and give their best performance.<br />
There are good reasons <strong>to</strong> question such assumptions. My assertion is that<br />
students in different countries react very differently <strong>to</strong> test situations like those<br />
of <strong>PISA</strong> (and TIMSS). This situation is closely linked <strong>to</strong> the overall cultural<br />
environment in the country, and in particular <strong>to</strong> students’ attitudes <strong>to</strong> schools<br />
and education. Let me give an illustration of such cultures with examples from<br />
two countries scoring high on tests like <strong>PISA</strong> and TIMSS.<br />
Testing in Taiwan and Singapore<br />
An observer from Times Educational observed the TIMSS testing at a school<br />
in Taiwan, and he noticed that pupils and parents were gathered in the schoolyard<br />
before the big event, the TIMSS testing. The direc<strong>to</strong>r of the school gave<br />
an appeal in which he also urged the students <strong>to</strong> perform their utmost for themselves<br />
and their country. Then they marched in while the national hymn was<br />
played. Of course, they worked hard; they lived up <strong>to</strong> the expectations from<br />
their parents, school and society.<br />
Similar observations can be made in Singapore, another high achiever on<br />
the international test. A professor in mathematics at the National University of<br />
Singapore (Helmer Aslaksen) makes the following comment: “In this country,<br />
only one thing matters: Be best <strong>–</strong> teach <strong>to</strong> the test!”<br />
He has also taken the pho<strong>to</strong>graph from the check-out counter in a typical<br />
Singaporean shop, (see Figure 7). This is where the last-minute offers are displayed:<br />
On the lower shelf one finds pain-killers, while the upper shelf displays<br />
a collection of exam papers for the important public exams in mathematics, science<br />
and English (i.e. the three <strong>PISA</strong> subjects). This is what ambitious parents<br />
may bring home for their 13-year-old kids at home. Good results from such<br />
exams are determinants for the future of the student.<br />
This is definitely not the way such testing takes place in my part of the<br />
world (Norway) and the other Scandinavian countries. Here, students have a<br />
very different attitude <strong>to</strong> schooling, and even more so <strong>to</strong> exams and testing. The<br />
students know that the performance on the <strong>PISA</strong> testing has no significance for<br />
them: They are <strong>to</strong>ld that they will never get the results, the items will never be<br />
discussed at school, and they will not get any other form of feedback, let alone<br />
school marks for their efforts. Given the educational and cultural milieu in<br />
(e.g.) Scandinavia, it is hard <strong>to</strong> believe that all students will engage seriously<br />
in the <strong>PISA</strong> test.
222 SVEIN SJØBERG<br />
Figure 7: The context of exams and testing: This is the check-out counter in a shop in Singapore.<br />
Last-minute offers are displayed: On the lower shelf: Medicinal pain-killers. On the<br />
upper shelf: Exam papers for the important public exams in mathematics, science and English<br />
(i.e. the three <strong>PISA</strong> subjects).<br />
Task value: “Why should I answer this question?”<br />
Several theoretical concepts and perspectives are used <strong>to</strong> describe and explain<br />
performance on tests. The concept of self-efficacy beliefs has become central<br />
<strong>to</strong> this field. By self-efficacy belief, one understands it <strong>to</strong> be the belief and<br />
confidence that students have in their resources and competencies when facing<br />
the task (Bandura 1997). Self-efficacy is rather specific <strong>to</strong> the type of task in<br />
question, and should not be confused with more general psychological personality<br />
traits like self-confidence or self-esteem. <strong>PISA</strong> has several constructs that<br />
seek <strong>to</strong> address self-efficacy, and they have noted a rather strong positive relationship<br />
between e.g. mathematical self-efficacy beliefs and achievement on<br />
the <strong>PISA</strong> mathematics test on the individual level (Knain & Turmo 2003). (It
<strong>PISA</strong> AND “REAL LIFE CHALLENGES”: MISSION IMPOSSIBLE? 223<br />
is, however, interesting <strong>to</strong> note that such a positive correlation does not exist<br />
when countries are the unit of comparison.)<br />
There is, however, a related concept that may be of greater importance<br />
when explaining test results and students’ behaviour in test situations. This is<br />
the concept of task value beliefs (Eccles & Wigfield 1992, 1995). While selfefficacy<br />
beliefs ask the question, “Am I capable of completing this task?”, the<br />
task value belief focuses on the question, “Why do I want <strong>to</strong> do this task?”.<br />
The task value belief concerns beliefs about the importance of succeeding (or<br />
even trying <strong>to</strong> succeed) on a given task.<br />
It has been proposed that the task value belief may be seen <strong>to</strong> have three<br />
different components or dimensions: These are the 1. attainment value, 2. intrinsic<br />
value or interest, and 3. utility value. Rhee et al (2007) explains in more<br />
detail:<br />
Attainment value refers <strong>to</strong> the importance or salience that students place on the task.<br />
Intrinsic value (i.e. personal interest) relates <strong>to</strong> general enjoyment of the task or subject<br />
matter, which remains more stable over time. Finally, utility value concerns students’<br />
perceptions of the usefulness of the task, in terms of their daily life or for future careerrelated<br />
or life goals. (Rhee et al 2007: 87)<br />
I would argue that young learners in different countries perceive the task value<br />
of the <strong>PISA</strong> testing in very different ways, as indicated in this chapter’s previous<br />
sections.<br />
Based on my knowledge about the school system and youth culture in my<br />
own part of the world, in particular Norway and Denmark, I would claim that<br />
many students in these countries assign very little value <strong>to</strong> all the above three<br />
dimensions of the task value of the <strong>PISA</strong> test and its items. Given the nature<br />
of the <strong>PISA</strong> tasks (long, clumsy prose and contrived situations removed from<br />
everyday life), many students can hardly find these items <strong>to</strong> have high “intrinsic<br />
value”; the items are simply not interesting and do not provide joy or pleasure.<br />
Neither does the <strong>PISA</strong> test have any “utility value” for these Scandinavian<br />
students; the results have no consequence, the items will never be discussed,<br />
there is no feed-back, results are secret and do not count, neither for school<br />
marks nor in their daily lives. They do not count for students’ future careerrelated<br />
or life goals. Given the cultural and school milieu and the values held<br />
by young learners in e.g. Scandinavia, it is hard <strong>to</strong> understand why they should<br />
choose <strong>to</strong> push themselves in a <strong>PISA</strong> test situation.<br />
If so, we have an additional cause for serious uncertainty about the validity<br />
and the reliability of the <strong>PISA</strong> results.
224 SVEIN SJØBERG<br />
References<br />
Bandura, A. (1997). Self-efficacy: The exercise of control. New York: Freeman.<br />
Bergesen, O. H. (2006). Kampen om kunnskapsskolen (Eng: The fight for a<br />
knowledge-based school) Oslo: Universitetsforlaget.<br />
Eccles, J. S., & Wigfield, A. (1992). The development of achievement-task<br />
values: A theoretical analysis. Developmental Review, 12, 256<strong>–</strong>273.<br />
Eccles, J. S., & Wigfield, A. (1995). In the mind of the ac<strong>to</strong>r: the structure<br />
of adolescents’ achievement task values and expectancy-related beliefs.<br />
Personality and Social Psychology Bulletin, 21, 215<strong>–</strong>225.<br />
Fensham, Peter (2007). Values in the measurement of students’ science<br />
achievement in TIMSS and <strong>PISA</strong>. In Corrigan et al (Eds) (2007). The<br />
Re-Emergence of Values in Science Education, (p. 215-229) Rotterdam:<br />
Sense Publishers.<br />
Knain, E. & A. Turmo (2003). Self-regulated learning Lie S et al (Eds) NorthernLightson<strong>PISA</strong>,<br />
University of Oslo, Norway (p. 101-112).<br />
Mullis, I. V. S., Martin, M. O., Smith, T. A., Garden, R. A., Gregory, K. D.,<br />
Gonzales, E. J., et al. (2001). TIMSS Assessment Frameworks and Specifications<br />
2003. Bos<strong>to</strong>n: International Study Center, Bos<strong>to</strong>n College.<br />
Prais, S.J. (2007). England: Poor survey response and no sampling of teaching<br />
groups Oxford Review of Education vol. 33, no. 1.<br />
Rhee, C. B., T. Kempler, A. Zusho, B. Coppola & P. Pintrich (2005). Student<br />
learning in science classrooms: what role does motivation play? In Alsop,<br />
S. (Ed). Beyond Cartesian Dualism. Encountering Affect in the Teaching<br />
and Learning of Science Dordrecht: Springer, Science and Technology<br />
Education Library.<br />
Ziman, J.M. (2000). Real Science: What it is and what it means. Cambridge:<br />
Cambridge University Press.
<strong>PISA</strong> <strong>–</strong> Undressing the Truth or Dressing Up a Will <strong>to</strong><br />
Govern?<br />
Gjert Langfeldt<br />
Norway: University of Agder<br />
Background<br />
The background for this article is a study of accountability in Europe. The<br />
testing of pupils’ results is a prime mechanism in establishing an accountability<br />
<strong>–</strong> based logic of governance 1 . A part of understanding accountability is the<br />
study of the quality of the instruments used <strong>to</strong> measure results, among which<br />
the international comparative tests are of prime importance.<br />
<strong>PISA</strong> <strong>–</strong> Programme for Student International Assessment <strong>–</strong> stands out<br />
among these tests as being by far the most influential of the international comparative<br />
tests. The approach <strong>to</strong> <strong>PISA</strong> used here was <strong>to</strong> collect articles of how<br />
educational researchers around Europe have reacted <strong>to</strong> <strong>PISA</strong>. This is not easy,<br />
as <strong>PISA</strong> is conducted on a tri-annual scale, with three different focal points,<br />
taking nine years <strong>to</strong> complete a complete cycle. In 2000 the focal point of <strong>PISA</strong><br />
was reading literacy, in 2003 mathematical competence. This meant that the<br />
researchers’ critique was often linked <strong>to</strong> a partial theme. A common methodological<br />
ground for critique of <strong>PISA</strong> was not always straightforward <strong>to</strong> find.<br />
Synthesising the literature, this article is structured under three headings:<br />
Reliability. “The International League Table” is the spearhead of <strong>PISA</strong> in<br />
attracting public interest.The differences reported between countries have huge<br />
consequences, and the issue of whether these differences are reliable must be<br />
of primary concern. So the issue of reliability will be the first theme: Does the<br />
1 The publication of schools’ results, most crudely known as “league tables”, and the sanctioning<br />
of schools based on the test results, appear <strong>to</strong> be further steps in creating a more<br />
full-blown version of accountability-based regimes.
226 GJERT LANGFELDT<br />
international literature indicate a concern that there are sources of “fuzziness”<br />
in the <strong>PISA</strong> results that can make the national scores appear unreliable?<br />
Validity. The second issue of concern is the issue of validity. Several issues<br />
appear <strong>to</strong> be discussed under this theme in the literature. The angle chosen<br />
here can be stated thus: Currently, 57 nations partake in <strong>PISA</strong>. In what sense is<br />
it meaningful <strong>to</strong> compare these in the form of presenting a ladder of national<br />
results? Theoretically, how can one find a legitimate basis for comparing different<br />
nations? Closely related <strong>to</strong> this is the assumption <strong>–</strong> the reliance upon<br />
which <strong>PISA</strong> is not alone <strong>–</strong> that pupils’ results can be an indica<strong>to</strong>r of school<br />
quality, which again can be a proof of the quality of national educational systems.<br />
A third element in the discussion of the validity of <strong>PISA</strong> is the issue of<br />
inference: Can one assess a school system on the basis of scores of individual<br />
students?<br />
The business model of <strong>PISA</strong>. The third issue I wish <strong>to</strong> focus on is <strong>PISA</strong> as a<br />
sociological event: The impact of <strong>PISA</strong> is not only how it changes the lives of<br />
pupils and teachers or makes educational policymaking change priorities but<br />
also the impact of <strong>PISA</strong> on how we think about education, about school quality<br />
and what aims the educational policies of a nation should fulfil. Traditional<br />
ac<strong>to</strong>rs in this field are politicians who can be held accountable for their views.<br />
What kind of ac<strong>to</strong>r is <strong>PISA</strong>? Researchers claim that there is another agenda in<br />
which <strong>PISA</strong> is a prominent ac<strong>to</strong>r <strong>–</strong> and the final discussion of this paper will<br />
be <strong>to</strong> look at the legitimacy of <strong>PISA</strong> within such a broader horizon.<br />
Why <strong>PISA</strong><br />
On the European scene, two providers of international, comparative knowledge<br />
tests are dominant, the IEA and the OECD.<br />
The IEA (http://www.iea.nl/), or International Association for the Evaluation<br />
of Educational Achievement, is a foundation owned by member states and<br />
organisations, currently at 62 with another 20 non-member states partaking in<br />
various activities. Its most popular product is the TIMSS (Trends in Mathematics<br />
and Science) which currently (TIMSS 2007) is used in more than 60<br />
countries, of which more than 20 are European. TIMSS aims <strong>to</strong> measure mastery<br />
of curriculum provided. PIRLS (Progress in Reading Literacy Studies)<br />
aims at measuring reading literacy. 41 countries currently participate, among<br />
which 23 European. Where TIMSS is run in 4 <strong>–</strong> year cycles, PIRLS is run in<br />
5 <strong>–</strong> year cycles. A third study is SITES (Second Information Technology in
UNDRESSING THE TRUTH OR DRESSING UP AWILL TO GOVERN? 227<br />
Education studies), which in 2006 was run by 20 states, half of which were<br />
European. ICCS (International Civics and Citizenship Education Study) currently<br />
has 40 participants, 25 of which are European. The fifth and last product<br />
<strong>to</strong> be announced on their homepage is concerned with Teacher Education Development<br />
in Mathematics, (Ted <strong>–</strong>M). This is a new test whose first results are<br />
being published in 2007. Fourteen countries participate on this test, of which<br />
6 are European.<br />
The OECD has 30 member states, accounting for about 90 % of the world’s<br />
GNP, exerting a huge influence as “the rich countries of the world’s club”.<br />
Their efforts and influence in education is increasing (Jacobi 2006), partly due<br />
<strong>to</strong> the success of the definitive market leader, <strong>PISA</strong>. In addition <strong>to</strong> <strong>PISA</strong>, the<br />
<strong>to</strong>ols of influence are the annual publication of statistics on Education (Education<br />
at a Glance). The OECD also runs a “think-tank” related <strong>to</strong> education,<br />
CERI <strong>–</strong> Centre for Educational Research and Innovation. In addition <strong>to</strong> <strong>PISA</strong>,<br />
OECD also manages another international knowledge test, ALL (Adult Literacy<br />
and Life Skills Survey), (http://nces.ed.gov/surveys/all/) with the precursors<br />
SIALS and IALS. These are large scale tests with the purpose of charting<br />
the adult population’s skills in reading and mathematics.<br />
<strong>PISA</strong> is a program of assessment in the sense that it is carried out each third<br />
year with differing focus on the three main areas. In 2000 41 countries participated<br />
in the study, of which 25 were European. In 2006 57 countries were in,<br />
and 31 were European. For each round of testing, OECD will publish results<br />
comparing countries in the form of a league table. <strong>PISA</strong> will assess 15-yearold<br />
students, as “this is normally close <strong>to</strong> the end of the initial period of basic<br />
schooling in which all young people follow a broadly common curriculum”<br />
<strong>PISA</strong>’s aim is <strong>to</strong> measure literacy:<br />
While OECD/<strong>PISA</strong> does assess students’ knowledge, it also examines their ability<br />
<strong>to</strong> reflect, and <strong>to</strong> apply their knowledge and experience <strong>to</strong> real-world issues. . . . The<br />
term “literacy” is used <strong>to</strong> encapsulate this broader conception of knowledge and skills.<br />
(OECD 2003 p. 9-10).<br />
This approach sets <strong>PISA</strong> apart from competi<strong>to</strong>rs such as TIMSS, which tries <strong>to</strong><br />
measure the degree <strong>to</strong> which pupils master the knowledge transmitted through<br />
the national curricula. <strong>PISA</strong> can thus claim not <strong>to</strong> be constrained by national<br />
curricula.<br />
The special focus means that this theme will “take up nearly two-thirds<br />
of the testing time” (OECD 2003 p. 13). As several authors confer approximately<br />
2 minutes per item, this means that the special focus area has about 40
228 GJERT LANGFELDT<br />
questions, and that the <strong>to</strong>tal runs <strong>to</strong> about 60 items. The universe of items is,<br />
however, much larger, and the items are organised in 14 different textbooks<br />
distributed <strong>to</strong> equal proportions of the sample.<br />
Nearly all these tests have in common the fact that in addition <strong>to</strong> the test<br />
document, the students will also answer a questionnaire, charting the context<br />
of the education. In addition several forms are <strong>to</strong> be answered by e.g teachers,<br />
principals, municipal authorities, etc. in order <strong>to</strong> allow generalising <strong>to</strong> the<br />
context of education.<br />
In each participating country a national <strong>PISA</strong> office is set up, using up <strong>to</strong><br />
two years <strong>to</strong> establish the national sample, establishing processes for administering<br />
the tests, and functioning as local quality assurance officers. It seems<br />
<strong>to</strong> be a general trait that these national offices also function as the chief <strong>PISA</strong><br />
interpreters in their country, often undertaking not only publication but also<br />
additional research in order <strong>to</strong> enlarge the <strong>PISA</strong> impact.<br />
The reliability issue<br />
Starting from a textbook definition, this issue concerns how random errors<br />
can influence results. One should differentiate between random and systematic<br />
errors in all evaluations of measurement; the latter approaches the issue of validity.<br />
The importance of reliability is that in this respect, reliability constitutes<br />
a precondition for validity. Metaphorically, this can be illustrated by how noise<br />
is able <strong>to</strong> destroy a musical experience <strong>–</strong> how <strong>PISA</strong> measures is a precondition<br />
for discussing what it claims <strong>to</strong> have found.<br />
Random variation between 450-500 000 15-year-olds can come from innumerable<br />
sources, and it is an important discussion in itself <strong>to</strong> assess what<br />
differences should be controlled for and which not. Neither the research community<br />
nor <strong>PISA</strong> have undertaken any systematic discussion of how real-world<br />
differences of such a magnitude can be controlled for <strong>–</strong> even though such a theory<br />
is fundamental for explaining differences in score. An example of this is<br />
that, so far, I have not found any mention of the influence resulting from the<br />
substantial variations in the amount of instructional hours 15-year-old pupils<br />
will have received.<br />
What I have found in the survey of European research articles on <strong>PISA</strong> reliability<br />
are three arguments, two concerning sample quality and one concerning<br />
item cultural bias.<br />
Of the two arguments relating <strong>to</strong> sample quality, one argument is a minor<br />
issue, and concerns the sample representativity of <strong>PISA</strong>. The arguments run
UNDRESSING THE TRUTH OR DRESSING UP AWILL TO GOVERN? 229<br />
that when many schools decline <strong>to</strong> partake, and substitute schools are recruited,<br />
one cannot be sure that the properties of the supplementary schools are equal <strong>to</strong><br />
those who declined. Although a pertinent objection, this problem will probably<br />
disappear if <strong>PISA</strong> influence keeps mounting.<br />
However, I find the second argument <strong>to</strong> be rather an important one, as it<br />
concerns what hides under the <strong>PISA</strong> assertion that <strong>PISA</strong> tests the competence<br />
of 15-year-olds as “this is normally close <strong>to</strong> the end of the initial period of basic<br />
schooling in which all young people follow a broadly common curriculum” 2 .<br />
The critics point <strong>to</strong> this being a simplification for two reasons: Whether a<br />
given grade actually contains pupils aged 15 is an empirical question, and there<br />
is every reason <strong>to</strong> believe that this will vary between countries. S.N. Praise,<br />
who first launched this argument, formulates his argument thus: “Perhaps most<br />
pupils were in classes for mainly 15-year-olds; others had repeated a class<br />
and <strong>–</strong> though aged 15 <strong>–</strong> were in a classes for mainly 14-year-olds, others in<br />
classes for mainly 13-year-olds; and a few had ‘skipped’ and were in classes<br />
for 16-year-olds. Often (France, Germany, Switzerland . . . ) by the age of 15<br />
hardly more than half of pupils may be in a class for pupils of that age. 3 (Prais<br />
2004 p. 571)<br />
Another aspect of the same fact is that when you compare 57 countries,<br />
some of these countries will not have all pupils aged 15 in class <strong>–</strong> they have already<br />
dropped out, for instance, <strong>to</strong> work. The relevant issue for a discussion of<br />
reliability is whether those who drop out have the same academic proficiency<br />
as those who stay in school. Arguing the case for better home background being<br />
decisive for academic achievement, a case being so well researched that<br />
it would be trite <strong>to</strong> mention evidence, one may well assume that the pupils<br />
quitting before the age of 15 as being unequal <strong>to</strong> the ones remaining in school.<br />
This leads <strong>to</strong> the conclusion that in a <strong>PISA</strong> context, some nations gain from the<br />
fact that as much as 60 % of their classmates have left school before the age of<br />
15.<br />
2 Actually, this is not completely precise. In the <strong>PISA</strong> technical report it says: “The 15-yearold<br />
official target was slightly adapted <strong>to</strong> better fit the age structure of most of the northern<br />
hemisphere countries. As the majority of the testing was planned <strong>to</strong> occur in April, the<br />
international target population was consequently defined as all students aged from 15 years<br />
and 3 (completed) months <strong>to</strong> 16 years and 2 (completed) months at the beginning of the<br />
assessment period.” (OECD: <strong>PISA</strong> 2000 technical report, page 46)<br />
3 As an illustration, Bodin, referring <strong>to</strong> <strong>PISA</strong> 2000, states that “That leads <strong>to</strong> 59,1 % of the<br />
French students who <strong>to</strong>ok the test were in high school grades, at grade 10, or for a few of<br />
them at grade 11” (Bodin 2005 p. 4). Another illustration would be Norway, where about<br />
95 % were in grade 9 (out of 10) at age 15, as they started school at age 7.
230 GJERT LANGFELDT<br />
One researcher who has agued this is Wuttke (2006) in regard <strong>to</strong> <strong>PISA</strong><br />
2003. He starts by asking how representative <strong>PISA</strong> is, and his answer is that<br />
the representative aspect has far <strong>to</strong>o little basis in actual numbers. This conclusion<br />
is based (as with Prais) on an appraisal of school attendance and the<br />
attendance of 15-year-olds. Wuttke points <strong>to</strong> Turkey, where the school attendance<br />
for 15-year-olds is a meagre 54 %, and <strong>to</strong> Mexico, where it is 58 %. Even<br />
within OECD countries this is a problem: Wuttke refers <strong>to</strong> Portugal, where 5 %<br />
of the sample left school between the time they were recruited and the time the<br />
test was administered. As one cannot assume that the drop-out rate is randomly<br />
distributed, he draws the conclusion that for <strong>PISA</strong> “it becomes a measure of<br />
success that the weaker pupils have dropped prematurely out of school” (Wuttke<br />
p. 105).<br />
In addition <strong>to</strong> this, there is also the issue of the representativeness of the<br />
national samples. <strong>According</strong> <strong>to</strong> <strong>PISA</strong> procedure this is organised so that one<br />
first recruits schools, and then pupils within those schools. Under these circumstances<br />
is it vital <strong>to</strong> have a documentation of the relative size of the sample<br />
from each school. The importance of this record is that if such a list is present,<br />
one may adjust for unwanted differences; for example, if such a list opens the<br />
possibility for the statistical weighing of samples from particular schools for<br />
example because of small school size. On surveying the underlying material, 4<br />
Wuttke concludes that “This documentation is lacking from far <strong>to</strong>o many countries”<br />
<strong>–</strong> the examples he provides <strong>to</strong> illustrate this point is taken from Greece,<br />
where all pupils had <strong>to</strong> be given equal weight, as there was no information of<br />
school size at all, while the participation rate from Sweden was 102,5 %, from<br />
Toscana 107 % (ibid p. 106).<br />
Another set of criticism relating <strong>to</strong> sample quality, has <strong>to</strong> do with students<br />
whose results can not be counted as other students, typically because they are<br />
handicapped. Wuttke asserts that the <strong>PISA</strong> report (OECD 2005 p. 183 ff) leaves<br />
the definition of handicaps up <strong>to</strong> the national committees. He refers <strong>to</strong> how<br />
the exemption rate within the OECD varies from 0.7 % for Turkey <strong>to</strong> 7.3 for<br />
Spain and the US (OECD 2005 p. 169). In addition Denmark, New Zealand<br />
and Canada transgress the 5 % limit, (OECD 2005 p. 241 ff), but this has not<br />
had any consequence <strong>–</strong> data from all these countries are presented at face value<br />
in the presentation (ibid p. 106). The conclusion that one is comparing apples<br />
and oranges is nigh at hand.<br />
4 Wuttke refers <strong>to</strong> OECD 2005 p. 108 for an explanation of this, and he develops this argument<br />
in some detail.
UNDRESSING THE TRUTH OR DRESSING UP AWILL TO GOVERN? 231<br />
Wording as a reliability issue<br />
The reliability of <strong>PISA</strong> is affected by how the questions are worded, and the<br />
<strong>PISA</strong> technical report explains the logic of how this is handled and the thoroughness<br />
with which is has been looked in<strong>to</strong>. Hemmingen (2005) introduces a<br />
question not covered by <strong>PISA</strong> when she asks whether the wording lives up <strong>to</strong><br />
its promises of measuring the life skills of pupils. Her assessment is that by and<br />
large it does not, and it cannot do so. By using some of the <strong>PISA</strong> items (Going<br />
hand in hand/Growing up/Semmelweiss) as examples, she demonstrates how<br />
<strong>PISA</strong> items are constructed just like “school-test” items, how their difficulty<br />
is affected by their wording, and how they relate <strong>to</strong> contexts that cannot be<br />
similarly known <strong>to</strong> all students. The proof of the pudding as it were for Hemmingsen<br />
is that <strong>PISA</strong> is just as subjected <strong>to</strong> being “test wise” as ordinary tests<br />
(Hemmingsen, 2005 p. 41).<br />
The importance of these objections is not that they represent any neglect on<br />
the hand of <strong>PISA</strong>. Rather on the contrary <strong>–</strong> <strong>PISA</strong> has done more than any other<br />
similar undertaking in trying <strong>to</strong> establish a discussion on how reliability issues<br />
can be met. The importance of these objections is rather that they give rise <strong>to</strong><br />
the issue of how fruitful the ambition of attempting <strong>to</strong> control for real-world<br />
differences on a multinational scale can be. In fact a point of criticism might be<br />
that without a theory of what differences can be accounted for and how such a<br />
control can be established, the undertaking of comparing the complex realities<br />
of 57 nations along one scale will appear as high-handed.<br />
Summing up the methodological objection raised by researchers it appears<br />
that a methodological discussion should be encouraged <strong>to</strong> a larger degree, and<br />
that <strong>PISA</strong> itself has central role in establishing such a discussion. It is not only<br />
legitimate but even vital that research should try <strong>to</strong> influence public debate<br />
on the quality of education, but one must not transgress the limits granted by<br />
one’s <strong>to</strong>ols. This is particularly so if the agenda is <strong>to</strong> contribute <strong>to</strong> greater accountability<br />
in education. The verdict of the research community raises grave<br />
questions of whether <strong>PISA</strong> transgresses such limits.<br />
The validity issue<br />
From a textbook definition, the issue of validity is an issue of inference quality.<br />
5 What is the basis of the conclusions drawn? In a <strong>PISA</strong> context, one rele-<br />
5 Shadish, Cook and Campbell use this definition: “We use the term validity <strong>to</strong> refer <strong>to</strong> the<br />
approximate truth of an inference . . . Validity is a property of inferences.” (2002 p. 34)
232 GJERT LANGFELDT<br />
vant definition of validity is put in this way “A <strong>to</strong>tal judgment rests on an holistic<br />
assessment of whether the empirical evidence and the theoretical framework<br />
form a sufficient basis <strong>to</strong> justify the actions and the consequences that<br />
are drawn from the test scores”.(Jablonski 2005, 157). This definition goes <strong>to</strong><br />
the core of what quality in a test like <strong>PISA</strong> is about: Does its impact rest on<br />
a solid basis, of both theory and data? It is only from this perspective that the<br />
lack of reliability finds its true importance. The issue of validity goes beyond<br />
systematic errors in the sense that errors can also accrue from a lack of theory,<br />
as well as from the quality of cohesion between theory and data.<br />
The main issues of validity which have been raised by European researchers<br />
can be summed up in three arguments: The issue of cultural bias,<br />
the issue of scaling, and the issue of how <strong>to</strong> interpret <strong>PISA</strong> scores. As the issue<br />
of scaling is covered comprehensively elsewhere in this volume, I will focus<br />
on the other two issues.<br />
Cultural bias as a validity issue<br />
This argument was heard in the reactions <strong>to</strong> both <strong>PISA</strong> 2000 from Italian<br />
(Nardi 2002), Swiss (Bain 2003), and French (Bodin 2005) sources, as well<br />
as from German ones relating <strong>to</strong> <strong>PISA</strong> 2003. The main argument challenges<br />
whether the real world ambition of <strong>PISA</strong> refers <strong>to</strong> a world shared by all, and<br />
even the concept of “real world” competence is argued <strong>to</strong> be an Anglo-phone<br />
preference. This is a validity issue in two respects. First, it argues that pupils<br />
from different countries will have systematically different chances <strong>to</strong> perform<br />
equally well. Secondly, it argues that the more successful <strong>PISA</strong> is, the less will<br />
one be able <strong>to</strong> see cultural differences as an asset, diversity as an <strong>to</strong>ol for improvement.<br />
This last argument raises the issue of whether the one dimensional<br />
scale of a sum score is a valid standard for comparing nations <strong>–</strong> will it prove<br />
legitimate when the needs of globalisation put new changes on the agenda?<br />
Some of these arguments can be contested. When Nari, (2002) doubts if the<br />
methodology is correct, when four out of the six best countries were Anglophone<br />
(the exceptions being Korea and Finland) or when Jablonka (2006) finds<br />
that out of a <strong>to</strong>tal of 54 questions in mathematics in <strong>PISA</strong> 2003, 13 come<br />
from Holland, 15 from Australia and 7 from Canada, the rest stems from 9<br />
different countries (ibid p. 167), one can still argue from <strong>PISA</strong> that when they<br />
can document that the questions are equally well unders<strong>to</strong>od everywhere, this<br />
argument is effectively controlled for 6 .<br />
6 The problem resurfaces within <strong>PISA</strong>’s context as why some countries have a weak item
UNDRESSING THE TRUTH OR DRESSING UP AWILL TO GOVERN? 233<br />
The issue of cultural differences is raised in a more radical sense by Bodin<br />
whose argument concerns the quality of teaching as a precondition for <strong>PISA</strong><br />
performance. He observes that “the differences are more important in favour<br />
of the Finnish students for the more ‘realistic’ items, and that the difference<br />
tends <strong>to</strong> turn in favour of the French students for more abstract or formal questions”<br />
(Bodin, 2005 p. 8). He uses the bookshelves question <strong>to</strong> lament “This<br />
question, along with many others, points <strong>to</strong> the weak stress given by <strong>PISA</strong> <strong>to</strong><br />
the proof undertakings. Even explaining and justifying are not much valued by<br />
the <strong>PISA</strong> marking scheme. That makes a big difference with the . . . French<br />
conception of mathematical achievement”. (2005 p. 12). Here Bodin illustrates<br />
how culture defines relevance: As a French mathematician he is proud of their<br />
tradition in mathematics.<br />
On the contrary I think, and all the work of the so-called French didactics school has<br />
helped me, that ruptures are necessary and constitutive <strong>to</strong> learning. So we may fear<br />
that putting <strong>to</strong>o much stress on real life and actual situations may in return have some<br />
negative effects (2005, p. 13).<br />
Would Europe become intellectually richer if this pride vanished in the face<br />
of <strong>PISA</strong> results? Is it not rather the opposite: That by the next corner, the<br />
French style in mathematical reasoning might be not only vindicated, but prove<br />
an asset <strong>to</strong> all? Cultural diversity makes for a sustainable development. The<br />
Choice of an approach that ends up treating cultural diversity as a measurement<br />
error, makes the very undertaking of comparison repressive.<br />
A special case of this argument is related <strong>to</strong> the <strong>PISA</strong> strategy for measuring<br />
reading skills.<br />
Bain (2003) queries whether the conceptual framework for the reading<br />
tests is adequate, and also in what respect <strong>PISA</strong> can improve teaching. The<br />
argument Bain raises is whether reading is such a complex skill that it can not<br />
be validly tested within the restrictions of the <strong>PISA</strong> test format. His criticism<br />
of the <strong>PISA</strong> conceptual framework is firstly linked <strong>to</strong> the fact that at the time<br />
(<strong>PISA</strong> 2000) he finds that the theory used for understanding reading literacy is<br />
statistic. This is reported as affecting 12 countries (Basque county, Brazil, Indonesia, Japan,<br />
Macau-china, Mexico, Thailand and Tunisia as well as <strong>to</strong> a lesser extent Hong Kong China,<br />
Serbia and Turkey). The explanations offered by <strong>PISA</strong> (the items may have discriminated<br />
differently in different countries, there may be concern about linguistic and cultural equivalence<br />
or one simply has not recruited transla<strong>to</strong>rs well enough equipped for the job),) actually<br />
strengthen the argument that cultural bias is and must be present in such tests. (<strong>PISA</strong> 2003<br />
Technical Report p. 79)
234 GJERT LANGFELDT<br />
<strong>to</strong>o empirical, “the test given can not verify the validity of a model but relies on<br />
a model <strong>to</strong> emerge from the facts” (ibid, p. 61), a situation which is aggravated<br />
by the fact that it is a restricted understanding of reading skills that is tested, in<br />
the sense that “of course one may agree that the pupils read texts that are about<br />
situations, but the situation they are read in, is a typical school-situation”. This<br />
is a situation which disfavours the weak pupils, only the clever pupils will in<br />
this situation be able <strong>to</strong> recreate the use intended for the text by the author (ibid<br />
p. 64). Bain proceeds <strong>to</strong> argue that the mastery of different genres is not and<br />
cannot be mastered within the restrictive test format (ibid p. 66).<br />
Interpreting <strong>PISA</strong> results <strong>–</strong> dressing up the will <strong>to</strong> govern<br />
An important validity issue is the choice of <strong>PISA</strong> <strong>to</strong> develop the results in<strong>to</strong><br />
an “international league table”, thus opening for the comparison of national<br />
results and explicitly discussing how these results can be improved. This is<br />
in <strong>PISA</strong> linked <strong>to</strong> the organisation and prioritisation of research focus areas,<br />
discussed in the technical framework in chapter 3.<br />
These focus areas are based on the OECD education indica<strong>to</strong>rs (INES)<br />
and organise data along two dimensions: Firstly, data is interpreted by the<br />
level of the education system they originate from, and secondly, the indica<strong>to</strong>rs<br />
<strong>PISA</strong> produces are seen as outcomes or outputs, contexts or constraints.<br />
There are four levels <strong>to</strong> which the resulting indica<strong>to</strong>rs relate. These are specified<br />
thus: “The education system as a whole, the educational institutions and<br />
providers of educational service, the instructional setting and the learning environment<br />
within the institution and the individuals participating in the learning<br />
activities”. Each of these levels are studied in three aspects: With respect <strong>to</strong><br />
“outputs and outcomes of education and learning”, “policy levers and contexts”<br />
(circumstances that shape the outputs and outcomes at each level) and<br />
“antecedents and constraints (fac<strong>to</strong>rs that define or constrain policy)” (OECD<br />
2003 technical report p. 35). Organised in<strong>to</strong> a matrix, this gives a 12-cell matrix<br />
of <strong>PISA</strong> focus areas, where educational outputs can be identified on four<br />
levels, and which also specifies policy levers and contexts and antecedents and<br />
constraints. Such a matrix is presented in the technical framework, specifying<br />
what variables are used <strong>to</strong> produce the indica<strong>to</strong>rs presented in each cell.<br />
The way the <strong>PISA</strong> focus areas are organised raises the issue of how one<br />
can interpret data from the “international league table”, and one question in<br />
particular is important: Does this framework lead <strong>PISA</strong> <strong>to</strong> offer opinions on<br />
educational matters beyond which they have an adequate basis?
UNDRESSING THE TRUTH OR DRESSING UP AWILL TO GOVERN? 235<br />
<strong>PISA</strong> itself acknowledges a first objection <strong>to</strong> this framework: The problem<br />
of recursivity and complexity among levels. The <strong>PISA</strong> example is how at<br />
the classroom level the relation between student achievement and class size is<br />
negative, while at the class or school level the relation is positive <strong>–</strong> the explanation<br />
being that students are often intentionally grouped so that the weaker<br />
student is placed in smaller classes. <strong>PISA</strong> sees this as an example that “a differentiation<br />
between levels is not only important with regard <strong>to</strong> the collection<br />
of information, but also because many features of the education system play<br />
out quite differently at different levels of the system” (ibid p. 35). This is a fact<br />
that really should be underlined, and it can be shown <strong>to</strong> be present in many<br />
aspects of the educational system. One example often used in the literature on<br />
causality in education is the issue of the interplay between teacher and class.<br />
Carroll’s (1963) model of class learning as a function of student level and time<br />
spent proved overly simplistic by not being able <strong>to</strong> give room for the interplay<br />
between class and teacher: An equal amount of time and teacher effort will<br />
provide widely different results as the students’ attitude <strong>to</strong> learning differs. 7<br />
That the interpretative matrix is riddled with difficulties is also illustrated<br />
in what <strong>PISA</strong> terms “antecedents and constraints” and which are described as<br />
follows: “Policy levers and contexts typically have antecedents, that is, fac<strong>to</strong>rs<br />
that define or constrain policy. These are usually specific foragivenlevelof<br />
the educational system, and antecedents at a lower level may well be policy<br />
levers at a higher level ( e.g for teachers and students in a school, teacher qualifications<br />
are a given constant, while at the level of the education system, professional<br />
development of teachers is a key policy lever(ibid p. 35). How is this<br />
a validity problem? What <strong>PISA</strong> does not say, but should have advised, is that<br />
the cultural traditions of different countries, which often constitute contexts as<br />
well as restraints, cannot be discussed out of context. Particularly when data<br />
are aggregated it is of the utmost importance <strong>to</strong> present data contextualised.<br />
James S. Coleman (1990) argues that it is in principle logically invalid <strong>to</strong> deduce<br />
principles of government and management from aggregated macro-level<br />
data, if such data lack substantial contextualization. If this is omitted, two problems<br />
arise: The first is that one lacks “reality checks” and is led opine beyond<br />
the realistic. The second, which can be seen as a correlate <strong>to</strong> this, is that the<br />
7 In fact it almost 30 years ago since Cronbach suggested that most causal links in education<br />
might preferably be unders<strong>to</strong>od as interactions, relations linked in such a way that causality<br />
can only be unders<strong>to</strong>od as probable and where the direction of causality might change.
236 GJERT LANGFELDT<br />
interpretation of results becomes more difficult. (e.g. as of <strong>to</strong>day, no real explanation<br />
for Finn excellence in <strong>PISA</strong> is presented).<br />
A third argument <strong>–</strong> and once again the real world differences crop up <strong>–</strong><br />
must also be mentioned here concerning the levels of the educational system<br />
in which <strong>PISA</strong> organises data. Two of the <strong>PISA</strong> levels are intuitively understandable:<br />
The student level and the level of the classroom; that is, where education<br />
as a social interaction occurs. The two last levels <strong>–</strong> the educational<br />
institutions and providers of educational services and the education system as<br />
a whole <strong>–</strong> seem <strong>to</strong> be introduced in order <strong>to</strong> be able <strong>to</strong> differentiate between<br />
nations where schools are run by the government (and where the institution<br />
owner is the state) and nations where schools are run by a number of organisers<br />
(churches, NGO’s, local communities). It is only for such settings that<br />
a differentiation between institution owner and system as a whole is appropriate.<br />
Two questions must be asked: Is such a way of organising the levels<br />
appropriate, and in the <strong>PISA</strong> case, is it used sensibly?<br />
The first question has been addressed in a recent paper (Afzar 2007). She<br />
insists that the <strong>PISA</strong> framework is based on a theoretical assumption about a<br />
linear administrative chain of steering. This chain runs from the political level<br />
via the political body of the school owner, through the instructional setting<br />
organised within each school <strong>to</strong> individual learning. In addition, she argues,<br />
lying behind the principle of aggregation of data, there is an action theoretical<br />
approach reminiscent of methodological individualism. Afzar rejects the notion<br />
of a linear chain, and argues that when one tries <strong>to</strong> grasp education as a<br />
system, one must use an approach legitimised theoretically, a model which can<br />
also explain the complexity of the relation of the different levels of the educational<br />
system, and she ventures <strong>to</strong> introduce one such model, a model whose<br />
importance in this context is that it allows for seeing the function of the school<br />
as being different for the individual student than for the functions of society at<br />
large. The Afzar model, interesting as it may be, is not of relevance here, but it<br />
serves <strong>to</strong> highlight in what respects the model used by <strong>PISA</strong> is legitimate.<br />
Does <strong>PISA</strong> use its own framework sensibly? <strong>PISA</strong> 2003 did not study<br />
teachers nor had intact classrooms as units of sampling. The level of instructional<br />
settings is therefore empty with regards <strong>to</strong> outputs and outcomes. What<br />
data it contains are data of students’ learning <strong>–</strong> learning that happens in different<br />
instructional settings, setting which <strong>PISA</strong> does not explain. A similar<br />
argument is applicable <strong>to</strong> the institutional level, where data is either aggre-
UNDRESSING THE TRUTH OR DRESSING UP AWILL TO GOVERN? 237<br />
gated from the individual level or “synthesised” across institutional settings,<br />
as these are not identified as such.<br />
At the systems level, <strong>PISA</strong> only relies on aggregates. In particular, the<br />
<strong>PISA</strong> systems outcome consists not only of aggregated individual data, but<br />
also for policy levers and contexts from system-level aggregates, (notably from<br />
the instructional setting level), and the same goes for the antecedents and constraints<br />
column at the systems level, albeit at these levels unspecified OECD<br />
data are also indicated as source material. The question is: Can student achievement<br />
be aggregated through such levels and still contribute <strong>to</strong> a meaningful<br />
interpretation of “national performance”?<br />
<strong>PISA</strong> Inc.<br />
The title of this argument is taken from a German book on <strong>PISA</strong> (Jahncke et<br />
al 2006), and it concerns the nature of <strong>PISA</strong> as enterprise. Here only two arguments<br />
will be discussed. The first argument is about whether <strong>PISA</strong> itself as<br />
a <strong>to</strong>ol for accountability appears as transparent: The aim of the international<br />
league table is <strong>to</strong> influence educational policies. Does <strong>PISA</strong> aim <strong>to</strong> handle this<br />
influence in a democratic way, or is <strong>PISA</strong> just another brand-building enterprise,<br />
whose aim is <strong>to</strong> exert as much influence as possible, not caring <strong>to</strong> make<br />
account of whether this influence is justified or not?<br />
This argument has been explicitly addressed by Howie & Plomp (2005),<br />
who argue that even if it is intended that <strong>PISA</strong> will have policy implications,<br />
no study has been undertaken <strong>to</strong> systematically chart how <strong>PISA</strong> affects educational<br />
policies. They refer <strong>to</strong> Kellahan (1996) as stating that most accounts<br />
of the use of findings appear <strong>to</strong> be “limited and impressionistic” and that “detailed<br />
analyses are not available” and that “the way policy makers arrive at<br />
their conclusions is also little known”. Howie & Plomp concur with this and<br />
add that albeit nearly a decade later, this still appears <strong>to</strong> be the case. This may<br />
be due in part <strong>to</strong> the difficulty of researchers gaining access <strong>to</strong> the policymakers’<br />
realm as well as having a lack of funding for impact studies. “In fact whilst<br />
government are prepared often <strong>to</strong> fund data collection and the initial descriptive<br />
reports, little funding is offered for secondary analyses of the same data<br />
let alone an impact study of the release of such a rich source of data nationally<br />
or internationally” (2005 p. 93)<br />
The second argument is that the allegations of lacking transparency seem<br />
<strong>to</strong> hold true even when looking inside how <strong>PISA</strong> is organised. Mogens Niss,<br />
a Danish member of the <strong>PISA</strong> expert group in mathematics, <strong>to</strong>uches on this
238 GJERT LANGFELDT<br />
in an interview given in 2005: He does not agree that “leading experts in the<br />
field is a guarantee for <strong>PISA</strong> quality”, the reason being that “There is no one<br />
within <strong>PISA</strong> who keeps tabs on things. It is like the Internet: There is no allcontrolling<br />
central brain in <strong>PISA</strong>”. He describes the development of mathematics<br />
literacy as “The expert group should clarify a description of the frame, a job<br />
that was organised as a analytical developmental process, not a research process,<br />
and whose results were shaped also by the <strong>PISA</strong> governing board, which<br />
is constituted by officials from ministries in the involved countries. The <strong>PISA</strong><br />
questions are shaped by expert groups, the OECD secretariat, the governing<br />
board and the international consortium <strong>to</strong>gether and Professor Niss concludes<br />
that <strong>PISA</strong> is not a clear-cut object; it is a mixture of research, development<br />
work, influenced by needs for comparison and politics.<br />
What Niss does not mention is that <strong>PISA</strong> is developed largely by enterprises<br />
either dependent on providing part of their income in the market (The<br />
Australian Council of Educational Research or the Educational Testing Service<br />
(USA) or living wholly off the market, as exemplified by Weststat ( a US<br />
company) or Ci<strong>to</strong>groep (a Dutch company) 8 . The problem with such an approach<br />
<strong>to</strong> organising is that when companies who have a vested interest in the<br />
success of <strong>PISA</strong> is <strong>to</strong> advice governments on educational policy, one cannot<br />
know whether the advice is biased or not. 9<br />
It may be relevant <strong>to</strong> mention that costs of <strong>PISA</strong> participation are not easily<br />
come by. For most of the IEA tests, however, the price is USD 30,000 per<br />
country per year, or USD 120,000 for a full cycle of a four <strong>–</strong> year annual<br />
repetition. Most of these tests are, however, sponsored by Ford, which is how<br />
8 A complete listing is available in the <strong>PISA</strong> 2003 Technical Framework appendix 2.<br />
9 This ambiguity is apparent in the chapter on data abjudication in the 2003 Technical Framework<br />
report. Using the USA as an example, this country not only did not meet the required<br />
school response rate (68,12 % after replacement), it also broke the <strong>PISA</strong> test timing window<br />
and had a <strong>to</strong>o high overall exclusion rate (7,28 %). After an evaluation (where no sources<br />
are given), it is concluded that the US data will be included in the full range of <strong>PISA</strong> reports.<br />
Another country plagued with grave problems is the United Kingdom, where the technical<br />
report concludes that “The uncertainty surrounding the sample and its bias are such that<br />
<strong>PISA</strong> 2003 scores for the UK cannot be reliably compared with other countries”, or with<br />
<strong>PISA</strong> 2000. The conclusion is still that all international averages and aggregate statistics include<br />
the data from the UK. There are apparent anomalies, such as those found in Mexico,<br />
where only 58 % of the classmates are in school, and the coverage of the national 15-yearold<br />
population was at only 49 %, or Spain, where the pupil exclusion rate was about 50 %<br />
above <strong>PISA</strong> standards, or Turkey, where the coverage of 15-year-olds was at 36 % <strong>–</strong> they are<br />
all included in the full range of <strong>PISA</strong> 2003 reports. Is this because it is scientifically sound<br />
or is it because another ruling would be bad for <strong>PISA</strong> Inc.?
UNDRESSING THE TRUTH OR DRESSING UP AWILL TO GOVERN? 239<br />
the price can be so low. No data is published on the financing of the OECD<br />
tests.<br />
It must be fair <strong>to</strong> conclude that <strong>PISA</strong> has huge unresolved issues concerning<br />
the way it is used. There seems <strong>to</strong> be an imbalance between the <strong>to</strong>ols created<br />
and the eagerness <strong>to</strong> influence politics. In the long run this is detrimental<br />
<strong>to</strong> the very issue <strong>PISA</strong> seeks <strong>to</strong> promote: A sensible approach <strong>to</strong> measureing<br />
the human capital generated in the member countries.<br />
Particular notice should be paid <strong>to</strong> <strong>PISA</strong> relation <strong>to</strong> private enterprise, so<br />
that one does not produce a capacity for test-making which goes far beyond<br />
how such tests can be sensibly used.<br />
References<br />
Afzar. A. (2007). A systems theoretical critique of international comparisons.<br />
Paper presented at the 2007 AERA Convention.<br />
Bain D. (2003). <strong>PISA</strong> et la lecture: Un point de vue didacticien. In Schweizerische<br />
Zeitschrift für Bildungswissenschaften vol 25 2003 no 1 p. 59-78.<br />
Bender, P.,(2006). Was sagen uns <strong>PISA</strong> & Co, wenn wir uns auf sie einlassen?<br />
In Jahnke,T. and Meyerhöfer W. (2006) <strong>PISA</strong> & Co Kritik eines Programms,<br />
Hildesheim Verlag Franzbecker.<br />
Bodin, A. (2005). What does <strong>PISA</strong> really assess? What it doesn’t. A French<br />
view. Report prepared for Joint Finnish-French conference “Teaching<br />
mathematics: Beyond the <strong>PISA</strong> survey”, Paris.<br />
Folkeskolen 8.4.2005: “<strong>PISA</strong> <strong>–</strong> Der er ingen der har styr på det hele. Et sammensurium<br />
af forskning, test og politik siger Mogens Niss fra <strong>PISA</strong>s<br />
ekspertgruppe i matematikk”.<br />
http://www.folkeskolen.dk/objectShow.aspx?ObjectId=33661<br />
Hemmingsen, I (2005). Et kritisk blik på opgaverne i <strong>PISA</strong> med særlig vekt på<br />
matematikk. In MONA vol? 2005 no. 1. p. 24-43.<br />
Howie S. and Plomp T.,(2005). International comparative studies of education<br />
and large-scale change. In Bascia N., A. Cumming A. Datnow<br />
and K. Leithwood (2005) International Handbook of Educational Policy,<br />
Springer Internatinal Handbooks of Education, Dordrect, Holland.<br />
Jablonka,E., (2006). Mathematical literacy: Die Verflüchtigung eines ambitionierten<br />
Testkonstrukts in bedeutungslose <strong>PISA</strong>-Punkte in Jahnke,T. and<br />
Meyerhöfer W. (2006) <strong>PISA</strong> & Co Kritik eines Programms, Hildesheim<br />
Verlag Franzbecker.
240 GJERT LANGFELDT<br />
Jahnke,T. and Meyerhöfer W. (2006). <strong>PISA</strong> & Co Kritik eines Programms,<br />
Hildesheim Verlag Franzbecker.<br />
Jahnke,T.,(2006) Zur Ideologie von <strong>PISA</strong> & CO. In Jahnke,T. and Meyerhöfer<br />
W. (2006) <strong>PISA</strong> & Co Kritik eines Programms, Hildesheim Verlag<br />
Franzbecker.<br />
OECD (2003). The <strong>PISA</strong> 2003 Assessment Framework, Paris OECD.<br />
OECD (2002). School sampling preparation manual. <strong>PISA</strong> 2003 Main Study<br />
Version one 2002.<br />
OECD (2002). Programme for international student assessment sample. Task<br />
from the <strong>PISA</strong> 2000 assessment of reading, mathematical and scientific<br />
literacy.<br />
OECD: <strong>PISA</strong> 2003 Technical Report<br />
Prais S.J. (2003). Cautions on OECD’s recent educational survey (<strong>PISA</strong>) Oxford<br />
Review of Education, vol 29 no 2, 2003, p. 139 <strong>–</strong> 163.<br />
Prais S.J. (2004). Cautions on OECD’s recent educational survey(<strong>PISA</strong>): Rejoinder<br />
<strong>to</strong> OECD’s response. Oxford Review of Education vol 30 no 4,<br />
Dec 2004.<br />
Romainville, M., (2002). L’enquete O.C.D.E. sur les aquis des élèves en débat<br />
in La Revue Nouvelle vol 115, 2002 no 3-4 pp. 84-108.<br />
Shadish,W., Cook, Th. and D. Campbell (2003). Experimental and quasiexperimental<br />
designs for causal inference. Bos<strong>to</strong>n, Houg<strong>to</strong>n Mifflin Company.<br />
The French Ministry of Education (2002). The meetings of Desco: “Evaluation<br />
of the knowledge and skills of 15-year-old pupils: Questions and<br />
hypotheses formulated following the OECD study”, contains Gaudemar<br />
J_P.,(2002): Opening of the conference debate, Crowne, S. (2002) The<br />
British case, Nardi E.,(2002) The Italian case, Koch H.C.,(2002) The German<br />
case and Cytermann, J_R.,(2002) The french Case.<br />
Wuttke, J.,(2006). Fehler, Verzerrungen,Unsicherheiten in der <strong>PISA</strong>-<br />
Auswertung in Jahnke,T. and Meyerhöfer W. (2006) <strong>PISA</strong> & Co Kritik<br />
eines Programms, Hildesheim Verlag Franzbecker.
Uncertainties and Bias in <strong>PISA</strong><br />
Joachim Wuttke<br />
Germany: Forschungszentrum Jülich <strong>–</strong> Munich<br />
This is a summary of a detailed report (>100 pages, >100 references) that has<br />
appeared in German (Wuttke 2007). It will be shown that <strong>PISA</strong>’s statistical significance<br />
criteria are misleading because several sources of systematic bias and<br />
uncertainty are quantitatively more important than the standard errors communicated<br />
in the official reports.<br />
1 Introduction<br />
1.1 A huge framework<br />
<strong>PISA</strong> is a long-term project. Starting in 2000, assessments are carried out every<br />
three years. One and a half years are needed for data processing until an<br />
international report entitled “First Results” (FR00, FR03) appears, and it takes<br />
even longer until a Technical Report (TR00, TR03) is published and the raw<br />
data are made available for independent analysis. Therefore, although the third<br />
assessment was carried out in spring 2006, at present (summer 2007) only<br />
<strong>PISA</strong> 2000 and 2003 can be evaluated. In the following we will concentrate on<br />
data from <strong>PISA</strong> 2003.<br />
<strong>PISA</strong> 2003 was carried out in 30 OECD countries and in some partner<br />
countries. As data from the latter were not used in the international calibration,<br />
they will be disregarded in the following. The United Kingdom (UK), which<br />
failed <strong>to</strong> meet several criteria required for participation, was excluded from<br />
tables in the official report. However, data from the UK were fully used in<br />
calibrating the international data set and in calculating OECD averages <strong>–</strong> an<br />
inconsistency that is left unexplained (TR03: 128, FR03: 31).
242 JOACHIM WUTTKE<br />
<strong>PISA</strong> rules required a minimum sample size of 4,500 students per country<br />
except in very small countries (Iceland, Luxembourg), where all fifteen-yearold<br />
students were recruited. In several countries (Australia, Belgium, Canada,<br />
Italy, Mexico, Spain, Switzerland, UK), considerably larger samples of up <strong>to</strong><br />
nearly 30,000 students (TR03: 168) were drawn so that separate analyses for<br />
regions or linguistic communities became possible. For the comparison of the<br />
sixteen German länder, an even larger sample of 44,580 students was tested<br />
(Prenzel et al. 2005: 392) of which, however, only 4,660 were contributed <strong>to</strong><br />
the international sample (TR03: 168). The Kultusministerkonferenz, fearing<br />
unauthorised cross-länder comparisons of school types, has imposed deletion<br />
of länder codes from public-use data files. Therefore, the inner-German comparison<br />
will not be considered further.<br />
The bulk of <strong>PISA</strong> data comes from a three-hour student testing session.<br />
Some more information is gathered from school principals. The testing session<br />
consists of a two-hour cognitive test and of a third hour devoted <strong>to</strong> questionnaires.<br />
The main questionnaire enquires about the students’ social background,<br />
educational environment, and learning habits. The questionnaire responses certainly<br />
constitute a valuable resource for studying the living and learning conditions<br />
of fifteen-year-olds in large parts of the world, even though participation<br />
rate gradients introduce some bias.<br />
Compared <strong>to</strong> the rich empirical material obtained from the questionnaires,<br />
the outcome of the cognitive test is meagre: the official data analysis reduces<br />
it <strong>to</strong> just four scores per student, interpreted as “competences” in specific subject<br />
domains (reading, mathematics, science, problem-solving). Nevertheless,<br />
these results are at the origin of <strong>PISA</strong>’s political impact; communicated as<br />
“league tables” of national mean values, they made <strong>PISA</strong> known <strong>to</strong> the general<br />
public, causing an outright “shock” in some countries.<br />
While controversy erupted about possible causes of results perceived as<br />
unsatisfac<strong>to</strong>ry, the three-digit precision of the underlying data has rarely been<br />
questioned. This will be done in the present paper. The accuracy and validity<br />
of cognitive test results are <strong>to</strong> be reviewed from a statistical point of view.<br />
1.2 A surprisingly simple measure of competence<br />
As a first step of data reduction, student responses are digitally coded. The<br />
Technical Report discusses inter-coder and inter-country variance at length<br />
(TR03: 218-232); the conclusion that non-uniform coding is an important<br />
source of bias and uncertainty is left up <strong>to</strong> the reader.
UNCERTAINTIES AND BIAS IN <strong>PISA</strong> 243<br />
Some codes are kept secret because national authorities want <strong>to</strong> prevent<br />
certain analyses. In several multilingual countries the test language is kept secret.<br />
Except for such deletions, the international raw data set is available for<br />
downloading on the website of the OECD’s main contrac<strong>to</strong>r ACER (Australian<br />
Council for Educational Research).<br />
On the lowest level of data aggregation, single item response statistics<br />
(percentages of correct, incorrect, and invalid responses <strong>to</strong> one cognitive test<br />
item) can be generated. In the international report not even one such statistic<br />
is shown. <strong>PISA</strong> is decidedly not a study in Fachdidaktik (math education,<br />
science education, etc.). <strong>PISA</strong> does not aim at gathering information about<br />
the understanding of scientific concepts or the mastery of specific mathematical<br />
techniques. The data provide almost no handle <strong>to</strong> understandwhy students<br />
give incorrect responses. Only Luxembourg has scanned and published some<br />
student solutions <strong>to</strong> free-response items; these examples show that students<br />
sometimes just misunders<strong>to</strong>od what the item writer meant <strong>to</strong> ask.<br />
<strong>PISA</strong> is designed <strong>to</strong> be analysed on a much coarser level. As anticipated<br />
above, cognitive test results are aggregated in<strong>to</strong> just four “competence” values<br />
per student. The determination of these values is technically complicated because<br />
not all students worked on the same item set: thirteen different booklets<br />
were used, and in some countries some items turned out <strong>to</strong> be invalid because<br />
of misprints, translation errors, or other problems. This makes it necessary <strong>to</strong><br />
establish an “item difficulty” scale prior <strong>to</strong> the quantification of student competences.<br />
For this calibration an elementary version of item response theory is<br />
used.<br />
The importance of this theory tends <strong>to</strong> be overestimated by defenders and<br />
critics of <strong>PISA</strong> alike. Misunderstandings are also provoked by poor documentation<br />
in the official reports. For a functional understanding of what <strong>PISA</strong> measures,<br />
it is not important that different booklets were used, and it is plainly irrelevant<br />
that in some countries certain items were deleted. Glossing over these<br />
technicalities, pretending that all students were assigned the same item set, and<br />
ignoring the probabilistic aspect of item response theory, it becomes apparent<br />
what the competence values actually measure: no more and no less than the<br />
number of correct responses.<br />
In the mathematics subtest of <strong>PISA</strong> 2003, a student with a competence<br />
of 500 (the OECD mean) has solved about 46 % of the items assigned <strong>to</strong><br />
him. A competence of 400 (one standard deviation below the mean) corresponds<br />
<strong>to</strong> a correct-response rate of 23 %; 600 corresponds <strong>to</strong> 71 % (Wuttke
244 JOACHIM WUTTKE<br />
2007: Fig. 4). Within this span the relationship between competence value and<br />
correct-response percentage is nearly linear. The slope is about 4 competence<br />
points per 1 % of assigned items. This conversion gives the competence scale<br />
a much simpler meaning than the official reports allow one <strong>to</strong> suspect.<br />
1.3 League Tables and S<strong>to</strong>chastic Uncertainties<br />
Any analysis of <strong>PISA</strong> data aims at statistical statements about populations. For<br />
instance, an elementary analysis of the cognitive test yields results like the following:<br />
German students have a mean mathematics competence of 503; the<br />
standard deviation is 103; the standard error of the mean is 3.3, and the standard<br />
error of the standard deviation is 1.8 (Prenzel et al. 2004: 70). In order<br />
<strong>to</strong> make sense of such numbers, they need <strong>to</strong> be put in<strong>to</strong> context. The <strong>PISA</strong><br />
reports provide two kinds of interpretation guidance: Verbal descriptions of<br />
“proficiency levels” give a rough idea of what competence differences of 60<br />
or more points signify (see below), and comparisons between different populations<br />
insinuate that even differences of only a few points bear a message.<br />
Since the assessment of competences within each of the four subject domains<br />
is strictly one-dimensional, any inter-population comparison implies a<br />
ranking. This explains the primordial role of league tables in <strong>PISA</strong>: They are<br />
not only a vehicle for gaining media attention, but they are deeply rooted in the<br />
conception of the study (cf. Bottani/Vrignaud 2005). In the official reports almost<br />
all statistics are communicated in the form of country league tables. The<br />
ranks in these tables, especially low ranks (and every country has low ranks in<br />
some tables), are then easily turned in<strong>to</strong> political messages. In this way <strong>PISA</strong><br />
results can be interpreted without any understanding of what has actually been<br />
measured.<br />
Of course, not all rank differences are statistically significant. This is duly<br />
noted in the official reports. For all statistics, standard errors are calculated.<br />
After processing these standard errors through a zero hypothesis testing machinery,<br />
some mean value differences are judged significant, while others are<br />
not. Complicated tables (FR03: 59, 71, 81, 88, 92, 281, 294) indicate which<br />
differences of competence means are significant and which are not. It turns<br />
out that in some cases 9 points are “sufficient <strong>to</strong> say with confidence that the<br />
higher performance by sampled students in one country holds for the entire<br />
population of enrolled 15-year-olds” (FR03: 93).<br />
This accuracy is formidable when compared <strong>to</strong> the intra-country spread<br />
of test performances. The standard deviation of the competence distribution is
UNCERTAINTIES AND BIAS IN <strong>PISA</strong> 245<br />
100 points in the OECD country average and not much smaller within single<br />
nations. This is an order of magnitude more than an inter-country difference of<br />
9 points. Figure 1 illustrates the situation.<br />
Figure 1: Two Gaussian distributions with mean values differing by 9 % of their standard<br />
deviation. Such a small difference between two populations is considered significant in <strong>PISA</strong>.<br />
However, significant does not mean valid, let alone relevant. Statistical significance<br />
is achieved by nothing more than the law of large numbers. The standard<br />
errors on which the significance criteria are based only account for two<br />
specific sources of s<strong>to</strong>chastic uncertainty: the student sampling and the itemresponse<br />
modeling of student behaviour. By testing more and more students<br />
on more and more items, these uncertainties can be made arbitrarily small. At<br />
some point, however, this effort becomes inefficient because the validity of the<br />
study remains limited by non-s<strong>to</strong>chastic sources of bias and uncertainty, which<br />
do not decrease with increasing sample size.<br />
Before entering in<strong>to</strong> details, the likeliness of non-s<strong>to</strong>chastic bias will be<br />
made plausible by a simple estimate: To bring about a significant inter-country<br />
difference of 9 points, correct-response rates must differ by about 2 % of given<br />
responses. On average, a student is assigned 26 mathematics items. Hence, 9<br />
points correspond <strong>to</strong> no more than half a correct response per student. This<br />
suggests that little systematic error is needed <strong>to</strong> dis<strong>to</strong>rt test results far beyond<br />
their nominal standard errors.<br />
In this paper, I will argue that <strong>PISA</strong> does indeed suffer from severe nons<strong>to</strong>chastic<br />
limitations, and that the large sample sizes are therefore uneconomic.<br />
Part 2 describes disparities in student sampling, Part 3 shows that the<br />
projection of cognitive test results on<strong>to</strong> a one-dimensional “competence” scale<br />
is neither technically convincing nor culturally fair, and Part 4 raises certain<br />
objections on the conceptual level.
246 JOACHIM WUTTKE<br />
2 Sampling disparities<br />
In some countries it is clear from the outset that <strong>PISA</strong> cannot be representative<br />
(Sect. 2.1). But even in countries where school is obliga<strong>to</strong>ry beyond the<br />
age of fifteen, low participation rates are likely <strong>to</strong> introduce some bias. Several<br />
imperfections and inconsistencies of the international sample are well documented<br />
in the Technical Report. Participation rate requirements were not strict<br />
enough <strong>to</strong> prevent significant bias, and violations of predefined rules had no<br />
consequences.<br />
2.1 Target population does not serve study objective<br />
<strong>PISA</strong> claims <strong>to</strong> measure “outcomes of education systems in terms of student<br />
achievements”. This claim is not consistent with the choice of the target population,<br />
namely “15-year-olds enrolled full-time in educational institutions”. In<br />
some countries (Mexico, Turkey, several partner countries), enrollment is less<br />
than 60 %. Obviously, <strong>PISA</strong> says nothing about the outcome of the education<br />
system of these countries.<br />
On the other hand, in many countries school is obliga<strong>to</strong>ry beyond the age<br />
of 15. At fifteen, the ability of abstract reasoning is still in full development.<br />
<strong>PISA</strong> therefore systematically underestimates the abilities students have “near<br />
the end of compulsory schooling” (FR03: 3, 298; TR03: 46).<br />
2.2 Target population <strong>to</strong>o loosely defined: unequal exclusions<br />
Rules allowed countries <strong>to</strong> exclude up <strong>to</strong> 5 % of the target population: up <strong>to</strong><br />
0.5 % for organizational reasons and up <strong>to</strong> 4.5 % for intellectual or functional<br />
disabilities or limited language proficiency. Exclusions for intellectual disability<br />
depended on “the professional opinion of the school principal, or by other<br />
qualified staff” <strong>–</strong> a completely uncontrollable source of uncertainty. From the<br />
fine print in the Technical Report, it appears that some countries defined additional<br />
criteria: Denmark, Finland, Ireland, Poland, and Spain excluded students<br />
with dyslexia; Denmark also excluded students with dyscalculia; Luxembourg<br />
excluded recently immigrated students (TR03: 47, 65, 169, 183).<br />
Actual student exclusion rates of the OECD countries varied from 0.7 %<br />
<strong>to</strong> 7.3 %. Canada, Denmark, New Zealand, Spain, and the USA exceeded the<br />
5 % limit. Nevertheless, data from these countries were fully included in all<br />
analyses.
UNCERTAINTIES AND BIAS IN <strong>PISA</strong> 247<br />
For a first-order estimate of the impact caused by the unequal use of student<br />
exclusions, let us approximate the competence distribution in every single<br />
country by a Gaussian with standard deviation 100, and let us assume for a<br />
moment that countries exclude with perfect precision the least competent students.<br />
Under these assumptions, exclusion of the weakest 0.7 % increases the<br />
country’s mean by 2.0 points and reduces its standard deviation by 2.5 points,<br />
whereas exclusion of 7.3 % increases the mean by 15.0 and reduces the standard<br />
deviation by 12.8. Of course, exclusion criteria are only correlatives of<br />
potential test achievement, and they are never applied with perfect precision.<br />
When a probabilistic cut-off, spread over a range of 100 points, is used <strong>to</strong><br />
model soft exclusion criteria, the bias in the two countries’ competence mean<br />
difference is reduced <strong>to</strong> about half of the initial 13 points.<br />
In Germany much public attention has been drawn <strong>to</strong> the percentage of<br />
students in a so-called “risk group” defined by test scores below an arbitrary<br />
threshold. International comparisons of such percentages are particularly unreliable,<br />
because they are extremely sensitive <strong>to</strong> non-uniform exclusion criteria.<br />
2.3 On the fringe of the target population: unequal inclusion of<br />
learning-disabled students<br />
The imprecision of exclusion criteria and the resulting bias are further illustrated<br />
by the unequal inclusion of students with learning disabilities. Seven<br />
countries cater <strong>to</strong> them in special schools. In these schools the cognitive test<br />
was abridged <strong>to</strong> one hour, and a special booklet with a selection of easy items<br />
was used. In all other countries student exclusions were decided per case; but<br />
even in countries that used the special booklets, some learning-disabled students<br />
could be individually excluded (cf. Prais 2003: 149, 158).<br />
The extent <strong>to</strong> which students were either excluded from the test or given<br />
the short booklet varies widely among the seven countries. In Austria, 1.6 % of<br />
the target population were completely excluded, and 0.9 % of the participating<br />
students got the short test. In Hungary, 3.9 % were excluded, and 6.1 % did<br />
the short test. Given this discrepancy, it is barely surprising that Hungarian<br />
students who did the short test achieved nearly 200 points more than Austrians.<br />
For another rough estimate of the quantitative impact of unclear exclusion<br />
criteria, one can recalculate national means without short tests. If all short tests<br />
were excluded from the <strong>PISA</strong> sample, the mean reading score of Belgium,<br />
Denmark, and Germany would increase by more than 7 points; in doing so,<br />
Belgium (1.5 % exclusions, 3.0 % short tests) would even remain within the
248 JOACHIM WUTTKE<br />
5 % limit (TR03: 169). A bias of the order of 7 points is in perfect accord with<br />
the estimate from the previous section.<br />
2.4 Sampling problems: inconsistent input<br />
The sampling is technically difficult. Many governments do not dispose of<br />
consistent databases. Sometimes, this leads <strong>to</strong> bewildering inconsistencies: In<br />
Sweden, 102.5 % of all 15-year-olds are reported <strong>to</strong> be enrolled in an educational<br />
institution; in the Italian region of Tuscany, 107.7 %; in the USA, in spite<br />
of a strong homeschooling movement, 100.000 % (TR03: 168, 183).<br />
The sample is drawn in two stages: schools within strata (regions and/or<br />
school types), and students within schools. As a consequence of this stratification<br />
and of unequal participation rates, not all students are equally representative<br />
of the target population. To correct this, students are assigned statistical<br />
weights composed of several fac<strong>to</strong>rs. The recommended way <strong>to</strong> calculate these<br />
weights is so difficult that international rules foresee three replacement procedures.<br />
In Greece, none of the four procedures worked, so that a uniform student<br />
weight had <strong>to</strong> be used (TR03: 52).<br />
2.5 Sampling problems: inconsistent output<br />
In the Austrian sample of <strong>PISA</strong> 2000, students from vocational schools were<br />
underrepresented. As a consequence, average student competences were overestimated,<br />
and other statistics were dis<strong>to</strong>rted as well. The error was only<br />
searched for and found three years later, when the deceiving outcome of <strong>PISA</strong><br />
2003 induced the government (which had changed in the meantime) <strong>to</strong> order<br />
an investigation (Neuwirth et al. 2006).<br />
In South Tyrol, a change of government is not in sight, and therefore nobody<br />
seems interested in verifying accusations that the excellent <strong>PISA</strong> results<br />
of this region are largely due <strong>to</strong> the underrepresentation of students from vocational<br />
schools (Putz 2006).<br />
In South Korea, only 40.5 % of <strong>PISA</strong> participants are girls. In the 1980s,<br />
due <strong>to</strong> selective abortion and possibly <strong>to</strong> hepatitis-B, the sex ratio at birth in<br />
South Korea had attained a his<strong>to</strong>ric low of 47 %, perhaps even 46 %. But even<br />
when this is taken this in<strong>to</strong> account, girls are still severely underrepresented<br />
in the <strong>PISA</strong> sample. <strong>According</strong> <strong>to</strong> the Technical Report, this cannot be explained<br />
by unequal enrollment or test compliance: The reported enrollment<br />
rate is 99.94 %, the school participation rate 100 %, and the student participation<br />
rate 98.81 %. Either these numbers are wrong, or the sampling scheme was
UNCERTAINTIES AND BIAS IN <strong>PISA</strong> 249<br />
inappropriate. This conclusion is also supported by an anomalous distribution<br />
of birth months.<br />
2.6 Insufficient response rates<br />
Rules required a school response rate of 85 %, within-school student response<br />
rates of 25 %, and a country-wide student response rate of 80 % (TR03: 48-50).<br />
The United Kingdom breached more than one criterion, which led <strong>to</strong> its superficial<br />
disqualification. Canada profited from a strange rule according <strong>to</strong> which<br />
initial response rates between 65 % and 85 % could be cured by negotiation<br />
if the 85 % quorum was not even reached after calling replacement schools<br />
(TR03: 238). With 64.9 %, the USA missed the non-negotiable initial condition,<br />
though by a narrow margin, and the response from replacement schools<br />
was overwhelmingly negative, bringing the participation rate <strong>to</strong> no more than<br />
68.1 %. Nevertheless, US data were fully included in all analyses (note: the<br />
USA contributes 25 % of the OECD’s budget).<br />
Non-response can cause considerable bias because the propensity of school<br />
principals and students <strong>to</strong> partake in the testing is likely <strong>to</strong> be correlated with<br />
the potential outcome. Quantitative estimates are difficult because the international<br />
data base contains not the least information about those who refused the<br />
test. Nevertheless, there is ample indirect evidence that the correlation is quite<br />
high. To cite just one example: In Germany schools with a student response of<br />
100 % had a mean math score of 553. Schools with participation below 90 %<br />
achieved only 476 points. Even if the latter number is subject <strong>to</strong> some uncertainty<br />
(discussed at length in Wuttke 2007), the strong correlation between<br />
student ability and test compliance is beyond any doubt.<br />
In the official analysis, statistical weights provide a first-order correction<br />
for the between-school variation of response rates: When schools refuse <strong>to</strong> participate,<br />
the weight of other schools from the same stratum is increased accordingly.<br />
Similarly, in schools with low student response rates, the participating<br />
students are given higher weights.<br />
However, these corrections do not cure within-school correlations between<br />
students’ latent abilities and their propensity <strong>to</strong> partake in the test. In the absence<br />
of data from absent students, the possible bias can only roughly be estimated:<br />
In some countries, the student response rate is more than 15 % lower<br />
than in others. Assuming very conservatively that the latent ability of the missing<br />
students is only half a standard deviation below the true national average,
250 JOACHIM WUTTKE<br />
one finds that the absence of these students increases the measured national<br />
average by 8.8 points.<br />
2.7 Gender-dependent response rates<br />
In many countries, girls are overrepresented in the <strong>PISA</strong> sample. The discrepancy<br />
is largest in France, with 52.6 % girls in <strong>PISA</strong> against an estimated 48.9 %<br />
among 15-year-olds: Compared <strong>to</strong> the age cohort, the <strong>PISA</strong> sample has more<br />
than 7 % <strong>to</strong>o many girls and more than 7 % <strong>to</strong>o few boys. Insofar as this is due<br />
<strong>to</strong> different enrollment, it enforces the argument of Sect. 2.1. Otherwise, the<br />
most likely explanation is a gender-dependent propensity <strong>to</strong> participate in the<br />
testing.<br />
2.8 Doubts about data transmission: missing missing responses<br />
Normally, some students do not respond <strong>to</strong> all questions of the background<br />
questionnaire. Moreover, some students leave between the cognitive test and<br />
the questionnaire session. In Poland, however, such missing data are missing:<br />
There is no single student who responded <strong>to</strong> less than 25 questionnaire items,<br />
and there are 7 items <strong>to</strong> which no single student did not respond. Unless this<br />
anomaly is explained otherwise, one must suspect that booklets with missing<br />
data have been suppressed.<br />
3 Ignored dimensions of the cognitive test<br />
<strong>PISA</strong>’s “competence” scale depends on the assumption that all items from one<br />
subject domain measure essentially one and the same latent ability. In reality,<br />
any test outcome is also influenced by fac<strong>to</strong>rs that cannot be subsumed under<br />
a subject-specific competence. While there is no generally accepted way<br />
<strong>to</strong> indicate the degree of multi-dimensionality of a test (Hattie 1985), simple<br />
first-order estimates are sufficient <strong>to</strong> demonstrate its impact: Non-competence<br />
dimensions cause an amount of arbitrariness, uncertainty, and bias in <strong>PISA</strong>’s<br />
competence measure, which is by no means negligible when compared <strong>to</strong> the<br />
purely s<strong>to</strong>chastic official standard errors.<br />
3.1 Elimination of disturbing items<br />
The evidence for multidimensionality <strong>to</strong> be presented in the following sections<br />
is even more striking on the background that the cognitive items actually used
UNCERTAINTIES AND BIAS IN <strong>PISA</strong> 251<br />
in <strong>PISA</strong> have been preselected for unidimensionality: Submissions from participating<br />
countries were streamlined by “professional item writers”, reviewed<br />
by national “subject matter experts”, tested with students in think-aloud interviews,<br />
tested in a pre-pilot study in a few countries, tested in a field trial in most<br />
participant countries, rated by expert groups, and selected by the consortium<br />
(TR03: 20-30).<br />
Only one-third of the items that had reached the field trial were finally<br />
used in the main test. Items that did not fit in<strong>to</strong> the idea that competence can be<br />
measured in a culturally neutral way on a one-dimensional scale were simply<br />
eliminated. Field test results remain unpublished, although one could imagine<br />
an open-ended analysis providing valuable insight in<strong>to</strong> the diversity of education<br />
outcomes. This adds <strong>to</strong> Olsen’s (2005a: 5) observation that in <strong>PISA</strong>-like<br />
studies the major portion of information is thrown away.<br />
However, the strong preselection did not prevent seriously flawed items<br />
from being used in the main test: In the analysis of <strong>PISA</strong> 2000, the item “Continent<br />
Area Q1” had <strong>to</strong> be disqualified, in 2003 “Room Numbers Q1”. Furthermore,<br />
some items had <strong>to</strong> be disqualified in specific countries.<br />
3.2 Unfounded models<br />
In <strong>PISA</strong> a probabilistic psychological model is used <strong>to</strong> calibrate item difficulties<br />
and <strong>to</strong> estimate student competences. This model, named after Georg<br />
Rasch, is the most elementary incarnation of item response theory. It assumes<br />
that the probability of a correct response depends only on the difference of the<br />
student’s competence value and the item’s difficulty value. Mislevy (1993) calls<br />
this attempt <strong>to</strong> “explain problem-solving ability in terms of a single, continuous<br />
variable” a “caricature”, based in “19th century psychology”. The model<br />
does not even admit the possibility that some items are easier in one subpopulation<br />
than in another. The reason for its usage in <strong>PISA</strong> is neither theoretical<br />
nor empirical, but rather pragmatic: Only one-dimensional models yield unambiguous<br />
rankings.<br />
Taking the Rasch model literally, there is no way <strong>to</strong> estimate the competence<br />
of students who solved all items or none. To them the test has been<br />
<strong>to</strong>o easy or <strong>to</strong>o difficult, respectively. In <strong>PISA</strong>, this problem is circumvented<br />
by enhancing the probability of intermediate competences through a Bayesian<br />
prior, arbitrarily assumed <strong>to</strong> be a Gaussian. As distributions of psychometric<br />
measures are never Gaussian (Micceri 1992), this inappropriate prior causes<br />
bias in the competence estimates (Molenaar in Fischer/Molenaar 1995: 48),
252 JOACHIM WUTTKE<br />
especially at extreme values (Woods/Thissen 2006). This further undermines<br />
statements about “risk groups” with particularly low competence values.<br />
3.3 Failure of the Rasch model<br />
Various mathematical criteria have been developed <strong>to</strong> assists in the decision<br />
whether or not the Rasch model reasonably approximates an empirical data<br />
set. It appears that only one of them has been used <strong>to</strong> check the outcome of the<br />
<strong>PISA</strong> main test: an unexplained “item infit mean square” (TR03: 123, 278).<br />
A much more sensitive way <strong>to</strong> test the goodness of fit is a visual inspection<br />
of appropriate plots (Hamble<strong>to</strong>n et al. 1991: 66). An “item characteristic”<br />
or “score curve” is a plot of correct-response percentages as function of competence<br />
values, each data point representing a quantile of examinees. In the<br />
Technical Report (TR03: 127), one single item characteristic is shown <strong>–</strong> an<br />
atypical one that agrees rather well with the Rasch model.<br />
<strong>According</strong> <strong>to</strong> the model, all item characteristics from one subject domain<br />
should have strictly the same shape; the only degree of freedom is a horizontal<br />
shift, driven by the model’s only item parameter, the difficulty. This is clearly<br />
inconsistent with the variety of shapes exhibited by the four item characteristics<br />
in Figure 2. Whereas “Water Q3b” discriminates quite well between more<br />
or less “competent” students, the other three items have deficiencies that cannot<br />
be described without additional parameters.<br />
The characteristic of “Chair Lift Q1” has almost a plateau at low competence<br />
values. This is the typical signature of guessing. On the other hand,<br />
“Freezer Q1” saturates at less than 35 %. This indicates that many students<br />
Figure 2: Some item characteristics that show pronounced deviations from the Rasch model.<br />
Solid curves in (a) are fits with a two-parameter model that accounts for different discrimination.<br />
The four-parameter fits in (b) additionally model guessing and misunderstanding.
UNCERTAINTIES AND BIAS IN <strong>PISA</strong> 253<br />
did not find out the intention of the testers. Low discrimination strengths as<br />
in “South Rainea Q2” may have several reasons: different difficulties in different<br />
subpopulations, different difficulties for different solution strategies (cf.<br />
Meyerhöfer 2004), qualified guessing, weak correlation of the latent ability<br />
measured here and in the majority of this domain’s items.<br />
The solid lines in Fig. 2 show that satisfac<strong>to</strong>ry fits of the empirical data<br />
are possible when the Rasch model is extended by parameters that allow for<br />
variable discrimination strength, for guessing, and for misunderstanding. Such<br />
multi-parameter item-response models still contain a linear shift parameter that<br />
may be interpreted as the item difficulty. However, best-fit estimates of this<br />
parameter deviate by typically 30 points from the official Rasch difficulties<br />
(Wuttke 2007: Fig. 11). This model dependence of item difficulty estimates is<br />
not compatible with a one-dimensional ranking of items as is needed for the<br />
construction of “proficiency levels” (Sect. 4.1). Furthermore, as soon as one<br />
admits more than one item parameter, any student ranking becomes arbitrary<br />
because of the ad-hoc anchoring of the difficulty and competence scales.<br />
The first data point of the characteristics of “South Rainea” and “Chair<br />
Lift” clearly lies below the fit curves: the weakest 4 % of participants perform<br />
weaker than modeled. This may be due <strong>to</strong> a lack of cooperation: yet another<br />
dimension that is not contained in elementary item-response theory. It may<br />
also be due <strong>to</strong> the inappropriateness of the Gaussian population model.<br />
3.4 Between-booklet variance<br />
The use of different test booklets makes it possible <strong>to</strong> employ a <strong>to</strong>tal of 165<br />
different items, though every single student works on no more than 60 of them.<br />
This reduces the dependence of test results on the arbitrary choice of items.<br />
At the same time, it allows us <strong>to</strong> get an idea of how strong this dependence<br />
actually is. Calculating mathematics competence means for groups of<br />
students who have worked on the same booklet, inter-booklet standard deviations<br />
between 4 (Hungary) and 18 (Mexico) points are found. The largest difference<br />
occurs in the USA: Students who worked on booklet 2 were estimated<br />
<strong>to</strong> have a math competence of 444, whereas those who worked on booklet 10<br />
achieved 512 points. Eliminating either booklet 2 or booklet 10 would respectively<br />
increase or decrease the overall national mean by about three points.<br />
This variance only reflects the arbitrariness in choosing items from a pool that<br />
is already quite homogeneous due <strong>to</strong> the procedures described above (Sect.
254 JOACHIM WUTTKE<br />
3.1). Cultural bias in the submission, selection, and adaptation of items may<br />
have a far stronger impact.<br />
3.5 Imputation with wrong normalisation<br />
Each of the thirteen regular booklets consists of four blocks. Each item appears<br />
in four different blocks, in four different positions, in four different booklets.<br />
The major subject domain, mathematics, is covered by seven of the thirteen<br />
blocks; the other three domains are tested in two blocks each.<br />
While all thirteen booklets contain at least one mathematics block, each<br />
minor domain appears only in seven booklets. Nevertheless, in the scaled data<br />
all students are attributed competence values in all four domains. If a student<br />
has not been tested in a domain, the competence estimate is based on both<br />
his questionnaire responses and his school’s average math achievement. Such<br />
an imputation, when done correctly, reduces the standard error of population<br />
means without introducing bias.<br />
In <strong>PISA</strong>, however, it is not done correctly. Bias is introduced because the<br />
imputation is anchored in only one of the seven booklets for which real data are<br />
available. This bias is plainly admitted in the Technical Report (TR03: 211),<br />
though it is quantified only for Canada. The case of Greece is more extreme:<br />
The official science competence mean of 481 is 16 points above the average<br />
achievement of those students who were actually tested in science (Wuttke<br />
2007: Sect. 3.10; cf. Neuwirth in Neuwirth et al. 2006: 53). This huge bias is<br />
certainly not justified by the benefits of imputation, which consists in a slight<br />
simplification of the secondary data structure and in a reduction of s<strong>to</strong>chastic<br />
standard errors by probably no more than 10 %.<br />
3.6 Timing, tactics, fatigue<br />
Since every item occurs in four different positions, one can easily investigate<br />
how response rates vary during the two-hour testing session: Per-block response<br />
rates, averaged across booklets over all items, can be directly compared<br />
<strong>to</strong> each other.<br />
One finds that the average rates of non-reached items, of missing responses,<br />
and of incorrect responses systematically decrease from block <strong>to</strong><br />
block. The extent of this decrease varies considerably between countries. The<br />
ratio of non-reached items in the fourth block is 1 % in the Netherlands, while<br />
in Mexico it is 25.3 %. In the Netherlands the ratio of items that were reached<br />
but not answered goes up from 2.5 % in the first block <strong>to</strong> 4.0 % in the fourth
UNCERTAINTIES AND BIAS IN <strong>PISA</strong> 255<br />
block; in Greece, from 11.1 % <strong>to</strong> 24.4 %. In Austria, the ratio of corrent <strong>to</strong>given<br />
responses decreases from 56.2 % in the first block <strong>to</strong> 54.4 % in the fourth block;<br />
in Iceland, from 58.5 % <strong>to</strong> 53.1 %.<br />
All these data indicate that students lack a sufficient amount of time in the<br />
last of the four blocks. This alone is a strong argument against the applicability<br />
of one-dimensional item response theory (Rost 2004: 43). The ways students<br />
react <strong>to</strong> the lack of time vary considerably between countries:<br />
<strong>–</strong> Dutch students try <strong>to</strong> answer almost every item. Towards the end of the test,<br />
they become hasty and increasingly resort <strong>to</strong> guessing.<br />
<strong>–</strong> Austrian and German students skip many items, and they do so from the first<br />
block on, which leaves them enough time <strong>to</strong> finish the test without greatly<br />
accelerating their pace.<br />
<strong>–</strong> Greek students, in contrast, seem <strong>to</strong> be taken by surprise by the time pressure<br />
near the end. In the first block, their correct-response rate is better than in<br />
Portugal and not far away from the USA and Italy. In the last block, however,<br />
non-reached items and missing responses add up <strong>to</strong> 35 %, bringing Greece<br />
down <strong>to</strong> one of the last ranks.<br />
Aside from such extreme cases, it is hardly possible <strong>to</strong> disentangle the effects<br />
of test-taking tactics and fatigue.<br />
3.7 Multiple responses <strong>to</strong> multiple choice items<br />
In <strong>PISA</strong> 2003, 42 of 165 items are in a simple multiple-choice format. For<br />
each of these items, four or five responses are proposed of which exactly one<br />
is meant <strong>to</strong> be the correct one. This essential rule is not clearly explained <strong>to</strong> the<br />
examinees. In some countries, for some items, a considerable number of multiple<br />
responses are given. They are denoted by a special code in the international<br />
database, but they are subsequently counted as incorrect.<br />
In many countries, including Australia, Canada, Japan, Mexico, the<br />
Netherlands, New Zealand, and the USA, the quota of multiple responses is<br />
close <strong>to</strong> 0 % (except for one particularly flawed item). In Austria, Germany,<br />
and Luxembourg, on the other hand, the fraction of multiple responses surpasses<br />
4 % for at least eleven items, and it reaches up <strong>to</strong> 10 % for one of them.<br />
Such a misunderstanding of the test format does not only dis<strong>to</strong>rt the outcome<br />
of the directly concerned item. It also costs time: it requires more effort<br />
<strong>to</strong> decide four or five times whether or not a proposed answer is correct than <strong>to</strong><br />
choose only one alternative. Those who are familiar with the multiple-choice<br />
format sometimes do not even need <strong>to</strong> read all distrac<strong>to</strong>rs.
256 JOACHIM WUTTKE<br />
3.8 Testing cultural background<br />
If one wants <strong>to</strong> understand what a test actually measures, one has <strong>to</strong> study<br />
the manifold reasons why students give incorrect responses (cf. Kohn 2000:<br />
11). The few student solutions of open-ended items published by Luxembourg<br />
show how much information is lost when verbal or pic<strong>to</strong>rial responses are digitally<br />
coded.<br />
A B C D<br />
Slovakia 3.1 % 46.1 % 17.5 % 33.3 %<br />
Sweden 3.1 % 46.2 % 37.0 % 13.7 %<br />
Table 1: Percentages for the four possible responses of the multiple-choice item “Optician<br />
Q1”. Data are shown for two countries where almost the same percentage of students chooses<br />
the correct response B. However, preferences for the distrac<strong>to</strong>rs C and D vary by about 20 %.<br />
In contrast, in the digital coding of multiple-choice items, most information<br />
is preserved; the codes for formally valid but incorrect responses indicate<br />
which of the three distrac<strong>to</strong>rs was chosen. Table 1 shows the response percentages<br />
for one item and two countries. In this example distrac<strong>to</strong>r preferences vary<br />
by about 20 %, although the correct-response percentage is almost the same.<br />
This demonstrates quantitatively that the reasons that induce students <strong>to</strong> give a<br />
specific incorrect answer can vary enormously from country <strong>to</strong> country.<br />
It is fairly obvious that the offer of distrac<strong>to</strong>rs also influences correct -<br />
response rates. Had distrac<strong>to</strong>r D been more in the spirit of C, it would have<br />
attracted additional responses in Sweden, whereas in Slovakia many students<br />
would have reoriented their choice <strong>to</strong>wards B.<br />
Between-country variance may be due <strong>to</strong> school curricula, cultural background,<br />
test language, or <strong>to</strong> a combination of several fac<strong>to</strong>rs. These fac<strong>to</strong>rs are<br />
particularly influential in <strong>PISA</strong> because students have little time (about 2’20”<br />
per item), and reading texts are <strong>to</strong>o long. Sometimes the stimulus material even<br />
tricks students in<strong>to</strong> misclues (Ruddock et al. 2006). In this situation, test-wise<br />
students try <strong>to</strong> solve items without actually reading the introduc<strong>to</strong>ry texts. Such<br />
qualified guessing is of course highly dependent on extrinsic knowledge and<br />
therefore particularly susceptible <strong>to</strong> cultural bias.<br />
The released reading unit “Flu” from <strong>PISA</strong> 2000 provides a nice example.<br />
The stimulus material is an information sheet about a flu vaccination. One of
UNCERTAINTIES AND BIAS IN <strong>PISA</strong> 257<br />
the items asks how the vaccination compares <strong>to</strong> alternative or complementary<br />
means of protection. Of course, students are not asked about their personal<br />
opinion; the answer is <strong>to</strong> be sought in the reading text. Nevertheless, the distrac<strong>to</strong>r<br />
preferences reflect French reliance on technology and German belief in<br />
nature.<br />
3.9 Language-related problems<br />
The language influences the test in several ways:<br />
Translations are prone <strong>to</strong> errors. In <strong>PISA</strong>, a complicated scheme with double<br />
translation from English and French was foreseen <strong>to</strong> minimise such errors.<br />
However, in many cases, including the German-speaking countries, the French<br />
original was not taken seriously, and final versions were produced under extreme<br />
time pressure. There are clear-cut translation errors in the released sample<br />
items. In the unit entitled “Daylight”, the English word “hemisphere” was<br />
translated by the erudite “Hemisphäre” where German schoolbooks use the<br />
word “Erdhälfte”. In the unit “Farms”, “attic floor” was rendered as “Dachboden”<br />
which just means “attic”. The fact that the Austrian version has the correct<br />
wording “Boden des Dachgeschosses” though all German-speaking languages<br />
had shared the translation work indicates that uncoordinated and unchecked<br />
last-minute modifications have been made.<br />
Blum and Guérin-Pace (2000: 113) report that changing a question (“Quels<br />
taux . . . ?”) in<strong>to</strong> a prompt (“Énumérez <strong>to</strong>us les taux . . . ”) can change the<br />
rate of correct responses by 31 %. This gives an idea of how much freedom<br />
transla<strong>to</strong>rs have either <strong>to</strong> help or confuse (cf. Freudenthal 1975: 172; Olsen et<br />
al. 2001).<br />
Under translation, texts tend <strong>to</strong> become longer, and some languages are<br />
more concise than others. In <strong>PISA</strong> 2000, the English and French versions of<br />
60 stimulus texts were compared, and showed that the French texts contained<br />
on average 12 % more words and 19 % more letters (TR00: 64). Of course,<br />
reading time is not simply proportional <strong>to</strong> the number of words or letters. It<br />
seems nevertheless plausible that such a huge length difference induces an<br />
important bias.<br />
3.10 Origin of test items<br />
A majority of test items comes from English-speaking countries; the other<br />
items were translated in<strong>to</strong> English before they were streamlined by “professional<br />
item writers”. If there is cultural bias, it is clearly in favour of the
258 JOACHIM WUTTKE<br />
English-speaking countries. This makes it difficult <strong>to</strong> separate it from the translation<br />
bias, which acts in the same direction.<br />
The quantitative importance of cultural or/and linguistic bias can be read<br />
off from the correlation of correct-response-percentage-per-item vec<strong>to</strong>rs, as<br />
has been shown by Zabulionis (2001, for TIMSS), Rocher (2003), Olsen<br />
(2005), and Wuttke (2007). Cluster analyses invariably show that student behaviour<br />
is most similar for countries that share both language and cultural heritage,<br />
such as Australia and New Zealand (correlation coefficient 0.98). If the<br />
languages differ, correlations are at best about 0.96, as for the Czech and Slovak<br />
Republics. If the languages do not belong <strong>to</strong> the same stem, correlations are<br />
hardly larger than 0.94. While some countries belong <strong>to</strong> large clusters, others<br />
like Japan and Korea are quite isolated (no correlation larger than 0.90). These<br />
results have immediate implications for the validity of inter-country comparisons:<br />
The lesser the correlation of response patterns, the more a comparison<br />
depends on the arbitrary choice of items.<br />
4 Interpreting cognitive test results<br />
4.1 Proficiency levels<br />
Verbal descriptions of “proficiency levels” are used <strong>to</strong> guide the interpretation<br />
of numeric results (FR03: 46-56). The boundaries of these levels are arbitrarily<br />
chosen; nevertheless, they are communicated with absurd four-digit precision.<br />
Starting at a competence of 358.3, there are six proficiency levels. The width of<br />
levels 1 <strong>to</strong> 5 is about 62.1; the semi-infinite level 6 starts at 668.7. Depending<br />
on how many students gave the right response, each item is assigned <strong>to</strong> one<br />
of these levels. Based on all items assigned <strong>to</strong> one level, a verbal synthesis is<br />
given of what students with corresponding competence values “can typically<br />
do”.<br />
By construction, the student competence distribution is approximately<br />
Gaussian. The mean of 500 and the standard deviation of 100 are imposed by<br />
an explicit (though ill documented) renormalisation. Therefore, the percentages<br />
of students in the different proficiency levels are almost constant.<br />
To illustrate this point, let us perform a Gedanken experiment. If the percentage<br />
of correct responses given by a single student grows by 6 %, his competence<br />
value increases by about 30 points. Suppose now that the correctresponse<br />
rate grows by 6 % for all students. In this case, the competence values<br />
assigned <strong>to</strong> the students will not increase because any uniform change
UNCERTAINTIES AND BIAS IN <strong>PISA</strong> 259<br />
of competences is immediately reverted by the renormalisation <strong>to</strong> the predefined<br />
Gaussian. Instead, the itemdifficulty values would be lowered by about<br />
30 points, so that about every second item would be relegated <strong>to</strong> the next lower<br />
proficiency level. Theoretically, this should then lead <strong>to</strong> a rephrasing of the<br />
proficiency level descriptions.<br />
However, these descriptions are highly systematic. They are so systematic<br />
that they could have been derived straight from Bloom’s forty-year-old<br />
taxonomy. They are far <strong>to</strong>o systematic <strong>to</strong> appear like a summary of empirical<br />
results: One would expect that not every single item fits equally well in<br />
such a scheme, but the level descriptions do not reflect the least irritation. As<br />
Meyerhöfer (2004) has pointed out, the very idea of proficiency levels is not<br />
consistent with the fact that test items can be solved in quite different ways, depending<br />
for instance on curricular premises, on testwiseness and time pressure.<br />
Therefore, the most likely outcome of our Gedanken experiment seems <strong>to</strong> be<br />
that the official level descriptions would not at all change, so that the overall<br />
increase in student achievement would pass unnoticed <strong>–</strong> as has the misfit of<br />
the Rasch model and the resulting bias and uncertainty of about 30 difficulty<br />
points.<br />
Another fundamental objection is the lack of transparency. The proficiency<br />
level descriptions are not scientifically discussible unless the consortium publishes<br />
the instruments on which they are based and the proceedings of the<br />
hermeneutic sessions in which the descriptions have been worked out.<br />
In the German reports, students in and below proficiency level 1 are called<br />
“the risk group”. This deviates from the international reports that speak of<br />
“risk” only in connection with students below level 1. It has become an urban<br />
legend in Germany that nearly one quarter of all fifteen-year-olds are almost<br />
functionally illiterate, although the original report clearly states that <strong>PISA</strong> does<br />
not bother <strong>to</strong> measure fluency of reading, which is taken for granted even on<br />
level 1 (FR00: 47-48). Furthermore, as has been stressed above, the percentage<br />
of students on or below level 1 is extremely sensitive <strong>to</strong> disparities in sampling<br />
and participation.<br />
4.2 Is <strong>PISA</strong> an intelligence test?<br />
<strong>PISA</strong> items from different domains are quite similar in style <strong>–</strong> and sometimes<br />
even in contents: Reading items are based on nontextual stimulus material<br />
such as graphics or tables, and math or science items require a lot of reading.<br />
This is intentional insofar as it reflects a certain conception of “literacy”.
260 JOACHIM WUTTKE<br />
It is therefore unsurprising that competence values from different domains<br />
are highly correlated. A majority of per-country inter-domain correlations is<br />
stronger than 80 %.<br />
In such a situation, the sensible thing <strong>to</strong> do is a principal component analysis.<br />
One finds that between 75 % (Greece) and 92 % (Netherlands) of the <strong>to</strong>tal<br />
variance of student competences can be attributed <strong>to</strong> just one component.<br />
However, no such analysis has been published by the consortium, and when<br />
Rindermann (2006) did so, members of <strong>PISA</strong> Germany tried <strong>to</strong> dismiss and<br />
even <strong>to</strong> ridicule it. The ideological and strategical reasons for this opposition<br />
are obvious: Once it is found that <strong>PISA</strong> mainly measures one general fac<strong>to</strong>r<br />
per examinee, it is hard not <strong>to</strong> make a connection <strong>to</strong> the g fac<strong>to</strong>r of cognitive<br />
psychology. This must be seen as a sacrilege and as a threat by <strong>PISA</strong> members,<br />
who avoid the term “intelligence” throughout their writings. The word<br />
is taboo in much of the pedagogical mainstream, and no government would<br />
spend millions <strong>to</strong> be informed about the intelligence of students.<br />
4.3 Uncontrolled variables<br />
<strong>PISA</strong> aims at moni<strong>to</strong>ring “outcomes of education systems”. However, the education<br />
system is just one of many variables that influence the outcome of<br />
the cognitive test. As we have seen, sampling, exclusions, participation rates,<br />
test taking habits, culture, and language are quantitatively important. Since all<br />
these variables are country dependent, there is no way <strong>to</strong> separate them from<br />
the variable “education system”.<br />
But even in the hypothetical case of a technically and culturally fair test,<br />
it would not be clear that differences in test outcome are due <strong>to</strong> differences<br />
in education systems. There are certainly country dependent educational influences<br />
that are not part of what is generally unders<strong>to</strong>od under “education<br />
system”, such as the subtitled TV programs prevalent in small language communities.<br />
Furthermore, equating test achievement with the outcome of schooling<br />
is highly ideological in that it dismisses differences in genetic equipment,<br />
pre-scholar education, and extra-scholar environment.<br />
The importance of extrinsic parameters becomes obvious when subpopulations<br />
are compared that share the same education system. One example is<br />
the two language communities in Finland. In the major domain of <strong>PISA</strong> 2000,<br />
reading, students in Finnish-speaking schools achieve 548 points, in Swedishspeaking<br />
schools only 513 <strong>–</strong> slightly less than Sweden’s national average of<br />
516 (Wuttke 2007: Sect. 4.8). A national report (Brunell 2004) suggests that
UNCERTAINTIES AND BIAS IN <strong>PISA</strong> 261<br />
much of the difference between the two communities can be explained by two<br />
fac<strong>to</strong>rs, namely by the language spoken at home and by the social, economic,<br />
and cultural background.<br />
If student dependent background variables have such a huge impact in an<br />
otherwise comparatively homogeneous country like Finland, they can even<br />
more severely dis<strong>to</strong>rt international comparisons. As several authors have already<br />
noted, one of the most important background variables is the language<br />
spoken at home. Except in a few bilingual regions, a non-test language spoken<br />
at home is typically linked <strong>to</strong> immigration. The immigration status is accessible<br />
through the questionnaire, which asks for the country of birth of the student<br />
and his parents. Excluding first and second generation immigrant students from<br />
the national averages considerably mutates the country league tables: On <strong>to</strong>p<br />
of the 2003 mathematics league table, Finland is replaced by the Netherlands<br />
and Belgium, and it is closely followed by Switzerland. The superiority of the<br />
Finnish school system, one of the most publicised “results” of <strong>PISA</strong>, vanishes<br />
as soon as one single background variable is controlled.<br />
5 Conclusions<br />
One defense line of <strong>PISA</strong> proponents reads: <strong>PISA</strong> is state-of-the-art; at present<br />
nobody can do it better. This is probably true. If there was one outstanding<br />
source of bias, one could hope <strong>to</strong> improve <strong>PISA</strong> by fighting this specific problem.<br />
However, it rather appears that there is a plethora of inaccuracies of similar<br />
magnitude. Reducing a few of them will have very little effect on the overall<br />
uncertainty. Therefore, one has <strong>to</strong> live with the unsatisfac<strong>to</strong>ry state of the art<br />
and draw the right consequences.<br />
Firstly, the outcome of <strong>PISA</strong> must be reassessed. The official significance<br />
criteria, based only on s<strong>to</strong>chastic errors, are irrelevant and misleading. The accuracy<br />
of country rankings is largely overestimated. Statistics are particularly<br />
dis<strong>to</strong>rted if they depend on response rates among weak students; statements<br />
about “risk groups” are untenable.<br />
Secondly, the large sample sizes of <strong>PISA</strong> are uneconomic. Since the accuracy<br />
of the study is determined by other fac<strong>to</strong>rs, the effort currently invested in<br />
minimising s<strong>to</strong>chastic errors is unjustified.<br />
Thirdly, it is clear from the outset that little can be learned when something<br />
as complex as a school system is characterised by something as simple as the<br />
average number of solved test items.
262 JOACHIM WUTTKE<br />
References<br />
Blum, A./Guérin-Pace, F. (2000): De Lettres et des Chiffres. Des tests<br />
d’intelligence à l’évaluation du “savoir lire”, un siècle de polémiques.<br />
Paris: Fayard.<br />
Bottani, N./Vrignaud, P. (2005): La France et les évaluations internationales.<br />
Rapport établi à la demande du Haut Conseil de l’évaluation de<br />
l’école. http://lesrapports.ladocumentationfrancaise.fr/BRP/054000359/<br />
0000.pdf.<br />
Brunell, V. (2004): Utmärkta <strong>PISA</strong>-resultat också i Svenskfinland. Pedagogiska<br />
Forskningsinstitutet, Jyväskylä Universitet. http://ktl.jyu.fi/pisa/<br />
Langt_pressmeddelande.pdf.<br />
Fischer, G. H./Molenaar, I. W. (1995): Rasch Models. Foundations, Recent<br />
Developments, and Applications. New York: Springer.<br />
Freudenthal, H. (1975): Pupils achievements internationally compared <strong>–</strong> the<br />
IEA. In: Educ. Stud. Math. 6, 127-186.<br />
FR00: OECD, ed. (2001): Knowledge and Skills for Life. First Results from<br />
the OECD Programme for International Student Assessment (<strong>PISA</strong>)<br />
2000. Paris: OECD.<br />
FR03: OECD, ed. (2004): Learning for Tomorrow’s World. First Results from<br />
<strong>PISA</strong> 2003. Paris: OECD.<br />
Hamble<strong>to</strong>n, R. K./Swaminathan, H./Rogers, H. J. (1991): Fundamentals of<br />
Item Response Theory. Newbury Park: Sage.<br />
Hattie, J. (1985): Methodology Review: Assessing Unidimensionality of Tests<br />
and Items. In: Appl. Psych. Meas. 9 (2) 139-164.<br />
Kohn, A. (2000): The Case Against Standardized Testing. Raising the Scores,<br />
Ruining the Schools. Portsmouth NH: Heinemann.<br />
Meyerhöfer, W. (2004): Zum Problem des Ratens bei <strong>PISA</strong>. In: J. Math.-did.<br />
25 (1) 62-69.<br />
Micceri, T. (1989): The Unicorn, the Normal Curve, and other Improbable<br />
Creatures. In: Psychol. Bull. 105 (1) 156-166.<br />
Mislevy, R. J. (1993): Foundations of a New Test Theory. In: Frederiksen,<br />
N./Mislevy, R. J./Bejar, I. I., eds.: Test Theory for a New Generation of<br />
Tests. Hillsdale: Lawrence Erlbaum.<br />
Neuwirth, E./Ponocny, I./Grossmann, W., eds. (2006): <strong>PISA</strong> 2000 und <strong>PISA</strong><br />
2003: Vertiefende Analysen und Beiträge zur Methodik. Graz: Leykam.<br />
Olsen, R. V./Turmo, A./Lie, S. (2001): Learning about students’ knowledge
UNCERTAINTIES AND BIAS IN <strong>PISA</strong> 263<br />
and thinking in science through large-scale quantitative studies. Eur. J.<br />
Psychol. Educ. 16 (3) 403-420.<br />
Olsen, R. V. (2005a): Achievement tests from an item perspective. An exploration<br />
of single item data from the <strong>PISA</strong> and TIMSS studies, and how<br />
such data can inform us about students’ knowledge and thinking in science.<br />
Dissertation, Universität Oslo.<br />
Olsen, R. V. (2005b): An exploration of cluster structure in scientific literacy<br />
in <strong>PISA</strong>: Evidence for a Nordic dimension? In: NorDiNa 1 (1) 81-94.<br />
Prenzel, M. et al. [<strong>PISA</strong>-Konsortium Deutschland], eds. (2004): <strong>PISA</strong> 2003.<br />
Der Bildungsstand der Jugendlichen in Deutschland <strong>–</strong> Ergebnisse des<br />
zweiten internationalen Vergleichs. Münster: Waxmann.<br />
Prenzel, M. et al. [<strong>PISA</strong>-Konsortium Deutschland], eds. (2005): <strong>PISA</strong> 2003.<br />
Der zweite Vergleich der Länder in Deutschland <strong>–</strong> Was wissen und können<br />
Jugendliche. Münster: Waxmann.<br />
Putz, M. (2006): <strong>PISA</strong>: Zu schön um wahr zu sein? Liegt das Traumergebnis<br />
an Rechenfehlern? Unpublished.<br />
Rindermann, H. (2006): Was messen internationale Schulleistungsstudien?<br />
Schulleistungen, Schülerfähigkeiten, kognitive Fähigkeiten, Wissen oder<br />
allgemeine Intelligenz? In: Psychol. Rundsch. 57 (2) 69-86. See also comments<br />
and reply in vol. 58 (2).<br />
Rocher, T. (2003): La méthodologie des évaluations internationales de compétences.<br />
In: Psychologie et Psychométrie 24 (2-3) [Numéro spécial:<br />
Mesure et Éducation], 117-146.<br />
Rost,J.( 2 2004): Lehrbuch Testtheorie <strong>–</strong> Testkonstruktion. Bern: Hans Huber.<br />
TR00: Adams, R./Wu, M., eds. (2002): <strong>PISA</strong> 2000 Technical Report. Paris:<br />
OECD.<br />
TR03: OECD, ed. (2005): <strong>PISA</strong> 2003 Technical Report. Paris: OECD.<br />
Woods, C. M./Thissen, D. (2006): Item Response Theory with Estimation of<br />
the Latent Population Distribution Using Spline-Based Densities. In: Psychometrika<br />
71 (2) 281-301.<br />
Wuttke, J. (2007): Die Insignifikanz signifikanter Unterschiede: Der<br />
Genauigkeitsanspruch von <strong>PISA</strong> ist illusorisch. In: Jahnke,<br />
T./Meyerhöfer, W., eds.: Pisa & Co. Kritik eines Programms. 2nd<br />
edition [note: my contribution <strong>to</strong> the 1st edition is outdated]. Hildesheim:<br />
Franzbecker.<br />
Zabulionis, A. (2001): Similarity of Mathematics and Science Achievement of<br />
Various Nations. In: Educ. Policy Analysis Arch. 9 (33).
Large-Scale International Comparative Achievement<br />
Studies in Education: Their Primary Purposes and<br />
Beyond<br />
Rolf V. Olsen<br />
Norway: University of Oslo<br />
Abstract:<br />
This chapter argues that <strong>PISA</strong> is more than a driver for policy decisions in<br />
many countries. The study also provides unique data with the potential <strong>to</strong> engage<br />
educational researchers across the world in conducting a range of secondary<br />
analyses. The first section of the chapter describes how the primary purpose<br />
of such studies in general has gradually evolved. This description reflects<br />
how the studies have typically related <strong>to</strong> educational research. This section of<br />
the chapter is used as the general background for the second and major section,<br />
which presents a rationale for why educational researchers could or should be<br />
motivated <strong>to</strong> engage in analytical work relating <strong>to</strong> these studies. This is followed<br />
up by a provisional framework for how educational researchers may<br />
approach and make use of the data from these studies in secondary analyses.<br />
This framework is based on six generic analytical approaches derived from the<br />
study of a large number of examples of published secondary analyses.<br />
Introduction<br />
The overall purpose of this article is <strong>to</strong> argue that both <strong>PISA</strong> and a range of<br />
other studies often referred <strong>to</strong> as large-scale international comparative achievement<br />
studies in education (LINCAS) (Bos, 2002), are not only an important<br />
driver for policy decisions in many countries, but the <strong>PISA</strong> study also provides<br />
unique data with the potential <strong>to</strong> engage educational researchers across the
266 ROLF V. OLSEN<br />
world in conducting a range of secondary analyses. The first part of the chapter<br />
describes how the primary purpose of such studies in general has gradually<br />
evolved. This description reflects how the studies have typically related <strong>to</strong> educational<br />
research. This section of the chapter is used as a general background<br />
for the second and major section, which provides arguments for why educational<br />
researchers could or should be motivated <strong>to</strong> engage in work relating <strong>to</strong><br />
these studies. Furthermore, this section presents how educational researchers<br />
may approach and make use of the data from these studies in secondary analyses<br />
by suggesting six generic analytical approaches. The six suggested generic<br />
approaches probably do not make up an exhaustive list of possibilities for secondary<br />
analytical work. Instead, they should, when taken <strong>to</strong>gether, be regarded<br />
as a provisional framework <strong>to</strong> be used as a starting point for a more comprehensive<br />
and systematic review of the available literature presenting secondary<br />
analyses relating <strong>to</strong> the <strong>PISA</strong> study.<br />
Even though <strong>PISA</strong> is the main case <strong>to</strong> be discussed in this book, the theme<br />
for this chapter is of a more overarching and general nature. Many of the references<br />
made throughout the chapter <strong>to</strong> specific secondary analytical work will<br />
therefore be <strong>to</strong> studies relating <strong>to</strong> other studies, and particularly <strong>to</strong> TIMSS 1 ,<br />
since this study has been around for a much longer time. Furthermore, given<br />
the author’s background as a researcher in science education, a majority of<br />
the examples will be related <strong>to</strong> this subject. However, the discussion offered<br />
is not subject specific, and the arguments are thus equally relevant for other<br />
international studies as well as for other subject areas.<br />
Part I: The primary purposes of the comparative studies<br />
In order <strong>to</strong> start describing the main features of LINCAS, it is relevant <strong>to</strong> note<br />
that they include one or several measures of achievement, specifically speaking<br />
school subjects or in more overarching competencies transcending the traditional<br />
borders set up by school subjects. Furthermore, these measures have<br />
been developed under the requirement that they should allow for meaningful<br />
international comparison. In addition, an essential design component in the<br />
studies is that differences between countries can be studied as effects of contextual<br />
fac<strong>to</strong>rs. It is also important <strong>to</strong> underscore the fact that these studies<br />
are large-scale, which implies that the aim of these studies is <strong>to</strong> find measures<br />
which can be generalised <strong>to</strong> schools and educational systems. In order <strong>to</strong><br />
1 Trends in International Mathematics and Science Study
LARGE-SCALE INTERNATIONAL COMPARATIVE ACHIEVEMENT STUDIES267<br />
obtain reliable measures which can be generalised <strong>to</strong> the systematic level, rigorous<br />
procedures for sampling a large number of schools/classes and students<br />
must be employed.<br />
Although the aims of, the use of data from, the organisation of, and the<br />
methodology applied by these studies have developed gradually (Porter &<br />
Gamoran, 2002), there have been two distinctly different and partly competing<br />
overall visions underlying the studies. They were first conceived as a specific<br />
design or method for conducting research in<strong>to</strong> education with a cross-country<br />
comparative perspective. This initial idea will be labelled Purpose I <strong>–</strong> the research<br />
purpose. Gradually, the focus has shifted, influenced by the increasing<br />
attention of policymakers <strong>to</strong>wards moni<strong>to</strong>ring the outcomes of educational systems<br />
and the study of possible determinants of such outcomes. This rationale<br />
for the comparative studies will be labelled Purpose II <strong>–</strong> the effective policy<br />
purpose.<br />
The labels Purpose I and II are only suggested as being useful heuristic<br />
devices for understanding some of the ideological tensions with which these<br />
studies have <strong>to</strong> live. 2 However, using this dicho<strong>to</strong>my does not suggest that<br />
the research purpose and the effective policy purpose are incompatible. On<br />
the contrary, I will offer the perspective that the studies may be considered<br />
as arenas where researchers in education and educational policymakers can<br />
exchange ideas, developing in turn mutual interest for and acceptance of each<br />
other’s engagement in educational issues on both the national and international<br />
levels.<br />
Purpose I: The research purpose<br />
Today the label ‘comparative studies in education’ refers <strong>to</strong> various types of<br />
research ranging from issues of the more philosophical and methodological<br />
aspects of comparing across cultures <strong>to</strong> very specific studies of narrowly defined<br />
aspects of education across countries, regions or classrooms. This label<br />
also covers studies with a great variety of designs and scales, and in general it<br />
is fair <strong>to</strong> say that the label ‘comparative studies’ is used with different meanings,<br />
as there is no generally accepted definition of the term (see for instance<br />
Alexander, Broadfoot, & Phillips, 1999; Alexander, Osborn, & Phillips, 2000;<br />
Carnoy, 2006). The idea of the large-scale comparative studies receiving focus<br />
here was created and defined as a research agenda with the establishment<br />
2 These labels are inspired by the way Roberts (2007) uses the terms Vision I and Vision II in<br />
his review of the concept of scientific literacy.
268 ROLF V. OLSEN<br />
of the IEA <strong>–</strong> the International Association for the Evaluation of Educational<br />
Achievement <strong>–</strong> in 1961 under the auspices of the UNESCO Institute for Education<br />
(Husén & Tuijnman, 1994; Keeves, 1992). The fundamental idea of<br />
the founders of IEA is very clearly expressed by one of the pioneers Torsten<br />
Husén (1973):<br />
We, the researchers who . . . decided <strong>to</strong> cooperate in developing internationally valid<br />
evaluation instruments, conceived of the world as one big educational labora<strong>to</strong>ry<br />
where a great variety of practices in terms of school structure and curriculum were<br />
tried out. We simply wanted <strong>to</strong> take advantage of the international variability with<br />
regard both <strong>to</strong> the outcomes of the educational systems and the fac<strong>to</strong>rs which caused<br />
differences in those outcomes. (p. 10)<br />
The term “labora<strong>to</strong>ry” in this quote is used only as a metaphor, since labora<strong>to</strong>ry<br />
conditions with controlled experiments, taken literally, are hardly feasible<br />
in educational research due <strong>to</strong> both practical and ethical considerations. The<br />
alternative <strong>to</strong> the experiment would therefore be survey designs in which the<br />
variables of interest could be studied under a great variety of different conditions.<br />
In this way “differences between education systems would provide<br />
the opportunity <strong>to</strong> examine the impact of different variables on educational<br />
outcome” (Bos, 2002, p. 5). “Thus the studies were envisaged as having a research<br />
perspective . . . , as well as policy implications” (Kellaghan & Greaney,<br />
2001, p. 92). The assumption is, in other words, that educational organisation<br />
and practice affect educational opportunities and outcome, and this can be the<br />
subject of empirical research with the following aim:<br />
. . . go beyond the purely descriptive identification of salient fac<strong>to</strong>rs which account<br />
for cross-national differences and <strong>to</strong> explain how they operate. Thus the ambition has<br />
been the one prevalent in the social sciences in general, that is <strong>to</strong> say, <strong>to</strong> explain and<br />
predict, and <strong>to</strong> arrive at generalizations. (Husén, 1973, pp. 10-11)<br />
The two quotes above taken from Husén should be seen as typical of the time<br />
and for the prevailing optimism regarding how the social sciences could contribute<br />
<strong>to</strong> the development of a better understanding of the causal relationship<br />
between different types of fac<strong>to</strong>rs in society, a vision that in retrospect is often<br />
referred <strong>to</strong> as “social engineering”. The importance of the quotes in this<br />
context is, however, <strong>to</strong> identify the fact that the studies originally came from<br />
researchers in education who aimed <strong>to</strong> use them in order <strong>to</strong> find answers <strong>to</strong><br />
what they saw as important research questions. Furthermore, they considered
LARGE-SCALE INTERNATIONAL COMPARATIVE ACHIEVEMENT STUDIES269<br />
that an international comparative design gave particularly good opportunities<br />
for answering such questions.<br />
Purpose II: The effective policy purpose<br />
Policymakers are required <strong>to</strong> establish overall plans for the nation’s educational<br />
system; e.g. <strong>to</strong> accomplish the following:<br />
<strong>–</strong> decide the amount and distribution methods of, resources;<br />
<strong>–</strong> specify the overall purpose of education as part of the wider social context<br />
and specific goals of achievement; and <strong>to</strong><br />
<strong>–</strong> determine the organisation of the progression of schooling from childhood<br />
<strong>to</strong> adolescence and beyond.<br />
To a large extent comparative studies and other internationally comparative<br />
data have been regarded by policymakers as providing information that is relevant<br />
in their continuous evaluation of such overall plans. What was initially a<br />
formulation of a platform for comparative educational research coincided with<br />
a growing recognition among politicians, industrial leaders and others, stating<br />
that education was one of the most central agents <strong>to</strong> realising long-term<br />
political, societal or economic visions, such as the following:<br />
<strong>–</strong> developing a society with a better distribution of resources across class, race,<br />
gender or any other social group;<br />
<strong>–</strong> fulfilling the need for a highly competent workforce in order <strong>to</strong> succeed in<br />
the international marketplace;<br />
<strong>–</strong> enhancing and further developing democracy by giving all citizens basic and<br />
further education so that they are enabled <strong>to</strong> fulfil their own life-agenda and<br />
become full-fledged participants in the democratic process.<br />
These were just a few examples of the visions of the ideal society that <strong>to</strong> a<br />
large degree were, and still are, shared visions throughout large parts of the<br />
world. At the same time, during the post-Second World War period, international<br />
organisations such as the United Nations, the World Bank, the OECD,<br />
and the European Union were established and quickly grew in size and influence.<br />
These are organisations with different (and <strong>to</strong> some degree conflicting)<br />
agendas. However, they all <strong>to</strong> various degrees invest resources in<strong>to</strong> the study<br />
of education in their member countries, and several of these organisations are<br />
linked <strong>to</strong> each other through joint projects concerning educational issues.<br />
IEA became a provider of educational data and analyses, not only <strong>to</strong> national<br />
policymakers, but also <strong>to</strong> several international organisations. In addition<br />
<strong>to</strong> UNESCO, which was involved in the establishment of the IEA, OECD (be-
270 ROLF V. OLSEN<br />
fore <strong>PISA</strong> was established) used data from IEA studies in their publications<br />
Education at a Glance (eg. the use of TIMSS data in OECD, 1996, 1997,<br />
1998). Since the first studies conducted in the early 1960’s, the IEA has been<br />
in charge of a great number of comparative studies in different subjects, and<br />
over the years the studies have grown <strong>to</strong> include a great number of countries<br />
throughout the world. At the same time the methodological challenges have<br />
been a driving force in the development of new designs and psychometrical<br />
procedures (Porter & Gamoran, 2002).<br />
During the last few decades the growth of comparative studies has also<br />
probably been fuelled by the reform of public services that is often referred<br />
<strong>to</strong> as ‘new public management’. This is characterised by deregulation of the<br />
public sec<strong>to</strong>r and a drive <strong>to</strong>wards a higher degree of privatisation of those parts<br />
of the public sec<strong>to</strong>rs that can be thought of as the infrastructure of society.<br />
Deregulation implies a transfer of responsibility from the central government<br />
<strong>to</strong> the local authorities. Nevertheless, important decisions related <strong>to</strong> schools<br />
are <strong>to</strong> be made by policymakers at administrative levels above either the local<br />
community level or local school level. A consequence in most countries where<br />
deregulation <strong>to</strong>ok place was therefore <strong>to</strong> reinforce the central government’s<br />
role by installing a national assessment system. This was a shift from the regulation<br />
of inputs (e.g. specification of the use of the resources or number of<br />
students per class) <strong>to</strong> controlling the output (achievement, surveys of students<br />
and parents). In this way the service providers were made accountable both <strong>to</strong><br />
the central government and <strong>to</strong> the users of these services. On the one hand the<br />
central government could control and direct the services by connecting measures<br />
of the output <strong>to</strong> incentives, or <strong>to</strong> intervening and manipulating the system<br />
<strong>to</strong> work as intended. On the other hand, the users could make use of the output<br />
measures in personal decisions regarding the public services.<br />
In this context the studies provide many indica<strong>to</strong>rs considered as being<br />
relevant, especially for the policymaker:<br />
<strong>–</strong> They produce measures of some of the outputs, most importantly achievement<br />
measures.<br />
<strong>–</strong> They produce indica<strong>to</strong>rs for systemic fac<strong>to</strong>rs that may be directly linked <strong>to</strong><br />
policy, such as average class size, availability of resources (e.g. computers),<br />
teacher education and allocation of time <strong>to</strong> different subjects. They also offer<br />
the possibility of relating such fac<strong>to</strong>rs <strong>to</strong> achievement.<br />
<strong>–</strong> They provide indica<strong>to</strong>rs of relationships between variables that policy seeks
LARGE-SCALE INTERNATIONAL COMPARATIVE ACHIEVEMENT STUDIES271<br />
<strong>to</strong> change in a certain direction, e.g. the aim of schooling <strong>to</strong> provide an equal<br />
opportunity for everyone <strong>to</strong> learn regardless of background.<br />
<strong>–</strong> Some studies (for instance, <strong>PISA</strong>) produce indica<strong>to</strong>rs of how the indica<strong>to</strong>rs<br />
and their relationship <strong>to</strong> various demographic characteristics change over<br />
time by repeating the surveys at regular intervals.<br />
Moreover, indica<strong>to</strong>rs used <strong>to</strong> moni<strong>to</strong>r the educational system in countries are<br />
not based on absolute measures. Placing such measures in an international context<br />
provides a comparative context for the interpretation of many of these measures.<br />
Comparison is a fundamental concept in measurement (Andrich, 1988).<br />
In an assessment with no comparative component, it is usually possible <strong>to</strong> establish<br />
whether various effect sizes are statistically significant,butitwouldbe<br />
very difficult <strong>to</strong> establish whether the effects are small or large. Even though<br />
the international variation cannot be used in order <strong>to</strong> draw causal inferences,<br />
it provides a description of what is possible and a context in which national<br />
data can be compared. One specific example of how international variation improves<br />
the potential for interpreting the results in the national context relates <strong>to</strong><br />
the issue of equity: It is often expressed in policy documents that large systematic<br />
differences in achievement between pupils from different socio-economic<br />
levels indicate that school systems fail in providing equal opportunities for all<br />
pupils. Moreover, a large standard deviation in achievement in the <strong>to</strong>tal population<br />
is often considered as an indica<strong>to</strong>r of inequities. For both these types of<br />
effects, the international context provides an opportunity for the policymaker<br />
<strong>to</strong> evaluate whether or not the differences between students or groups of students<br />
are large or small as compared <strong>to</strong> other systems perceived as relevant for<br />
comparison. There will always be differences between students, but without a<br />
contrast it would be impossible <strong>to</strong> evaluate or provide a substantial interpretation<br />
of the size of the effect.<br />
Common ground for the two purposes?<br />
In order <strong>to</strong> better understand the possible tensions between Purpose I and II,<br />
Jenkins (2000), based on Loving & Cobern (2000) and Huberman (1994), offers<br />
an interesting starting point. He suggests that the educational researchers<br />
and the policymaker not only have different agendas, but they also live within<br />
different knowledge systems, and therefore:<br />
The knowledge produced within one system and for the one set of purposes cannot<br />
normally be readily transferred <strong>to</strong> another. (Jenkins 2000, p. 18)
272 ROLF V. OLSEN<br />
Jenkins does not provide a definition of the concept knowledge system, and he<br />
does not identify more specific aspects of the two knowledge systems claimed<br />
<strong>to</strong> be very different. Furthermore, he does not come up with a solution for how<br />
the problem in the above quote may be amended.<br />
One very obvious difference in the way that researchers and policymakers<br />
approach knowledge is of course that the latter are <strong>to</strong> a much larger extent<br />
confronted with decision-making. This entails at least two characteristics<br />
of the knowledge seen as relevant. Firstly, decisions are bound by time. The<br />
pace of decision-making is usually much faster than the timelines for most researchers.<br />
It is therefore likely that, due <strong>to</strong> the pressure <strong>to</strong> produce policy in<br />
a short time (before the next election), the knowledge that may be digested<br />
and unders<strong>to</strong>od without occupying <strong>to</strong>o much time is considered as being more<br />
relevant by the policymaker. Secondly, knowledge that is likely <strong>to</strong> be true (analogous<br />
<strong>to</strong> evidence that will ‘hold up in court’) is generally more appreciated<br />
when confronted with the realities of decision-making. Using these criteria,<br />
it is possible <strong>to</strong> imagine that numbers and quantitative measures are deemed<br />
more appropriate than thick and rich qualitative descriptions.<br />
The OECD, the UN and other international organisations play important<br />
roles by being engaged in both these knowledge systems. In contrast <strong>to</strong> the national<br />
policy level, these international organisations have been given mandates<br />
that are relatively stable across time, and among several functions, they have<br />
been given the role of providing continuous policy analysis within a longer<br />
time frame. This mandate is particularly visible in the mandate of the OECD.<br />
<strong>PISA</strong> is therefore an interesting case regarding the issue of how educational researchers<br />
and policymakers may operate in a joint knowledge space. Through<br />
many of their educational initiatives, the OECD aims at establishing procedures<br />
and arenas for the dissemination of educational research <strong>to</strong> the policymakers.<br />
Conversely, through the same arenas, policymakers are able <strong>to</strong> communicate<br />
their needs for information on which <strong>to</strong> base their decisions. This<br />
is at least part of the solution for how it might be possible <strong>to</strong> get an efficient<br />
transfer of information back and forth between the two knowledge systems.<br />
This means that the overall aim of the <strong>PISA</strong> study is very much aligned with<br />
how policymakers define and justify educational outcomes. This also means<br />
that the cognitive measures are contextualised by variables perceived by the<br />
policy level <strong>to</strong> be of importance.<br />
In summary, the second purpose of effective policy development is in many<br />
respects compatible with the aims of the researchers who established the IEA
LARGE-SCALE INTERNATIONAL COMPARATIVE ACHIEVEMENT STUDIES273<br />
and conducted the first surveys (Purpose I). Policy issues such as those related<br />
<strong>to</strong> how one can improve the conditions for comprehensive and equitable education<br />
is, for instance, also a core issue in much pedagogical and didactical research.<br />
The difference is that within Purpose II the comparative studies are not<br />
any longer principally considered as basic research in education. This is not <strong>to</strong><br />
say that they can no longer be used <strong>to</strong> study fundamental issues in educational<br />
research. However, in the international comparative large-scale studies, such<br />
research issues have gradually been awarded lower priority in the shaping of<br />
the studies. The research purpose has <strong>to</strong> some extent become secondary <strong>to</strong> the<br />
primary purpose, which is <strong>to</strong> moni<strong>to</strong>r and benchmark the educational outcome<br />
of educational systems in order <strong>to</strong> inform policymakers.<br />
Part II: Beyond the primary purpose of the comparative studies<br />
In the first part of this chapter, I have argued that educational researchers would<br />
be well-advised <strong>to</strong> engage in studies like <strong>PISA</strong> since they provide communicative<br />
<strong>to</strong>ols for interchange of relevant knowledge with policymakers. However,<br />
there is also a more direct argument as <strong>to</strong> why researchers in education could<br />
be highly motivated <strong>to</strong> take part in or follow up these studies, which is the<br />
issue <strong>to</strong> be discussed in the remaining section of the chapter: These studies<br />
provide valuable and unique data for researchers that may be used as a basis<br />
for secondary analyses. This research activity may range from theoretical contributions<br />
<strong>to</strong> secondary analysis of the data or the documents accompanying<br />
these data (e.g. analyses of instruments and items, analyses of the theoretical<br />
framework and rationale underlying the studies).<br />
A number of slightly different definitions of the term ‘secondary analysis’<br />
have been suggested in some of the literature on research designs in the social<br />
sciences. They usually focus on the fact that secondary analyses are analyses of<br />
already existing data, conducted by researchers other than those who originally<br />
collected the data, and with a purpose that most likely was not included in the<br />
original design leading <strong>to</strong> the data collection. The definition that is best suited<br />
for the discussion presented in the following is most likely the one suggested<br />
by Bryman. (2004):<br />
Secondary analysis is the analysis of data by researchers who will probably not have<br />
been involved in the collection of those data for purposes that in all likelihood were<br />
not envisaged by those responsible for the data collection. (p. 201)
274 ROLF V. OLSEN<br />
This definition also opens up the possibility that the original researchers may<br />
be involved in secondary analysis, and furthermore, that the purpose of the<br />
secondary analysis may have been included in the original research design.<br />
The latter point is highly relevant for many of the large-scale official surveys<br />
of different aspects of social life (e.g. different types of household surveys<br />
conducted regularly in many nations), many of which may be considered as<br />
having multiple purposes (Bur<strong>to</strong>n, 2000; The BMS, 1994), and where the potential<br />
for secondary analysis by social scientists is an important part of the<br />
primary design.<br />
There are a number of perfectly sound reasons for why many researchers<br />
give priority <strong>to</strong> collecting their own data instead of analysing already collected<br />
data. The primary reason is that ‘the scientific approach’ <strong>to</strong> some extent may<br />
be pragmatically defined by a methodology starting with the posing of research<br />
questions and hypotheses. Data collected by others is collected with other specific<br />
questions or hypotheses in mind, and it may therefore be difficult <strong>to</strong> use<br />
this data <strong>to</strong> analyse other issues. Secondly, there are often many technical obstacles<br />
in using data collected by others: they might not be publicly available;<br />
they may lack the documentation necessary <strong>to</strong> understand the data (e.g.<br />
a comprehensive codebook); or the data may require technical analytical skills<br />
beyond those of most researchers. Thirdly, there may be ideological reasons<br />
for not wanting <strong>to</strong> base research on data collected by national or international<br />
organisations that are primarily collected for policy analyses. Some of these<br />
issues are also conditions that limit the potential for using data from the comparative<br />
studies in secondary research.<br />
However, I would argue that the benefits of such secondary analysis<br />
strongly outweigh the limiting conditions. First of all, the data provided by<br />
these studies has qualities not often seen in educational research. The primary<br />
reason for this claim is that the quality is documented in unprecedented detail.<br />
In the technical reports for the <strong>PISA</strong> surveys (Adams & Wu, 2002; OECD,<br />
2005b), all the procedures for the instrument development, sampling, marking<br />
and data adjudication are thoroughly described. By studying such reports, it is<br />
clear that the <strong>PISA</strong> study (and other LINCAS) is based on the following:<br />
<strong>–</strong> very clearly defined populations and adequate routines for sampling these<br />
populations in all participating countries;<br />
<strong>–</strong> well-developed frameworks and instruments, including documentation of<br />
the quality of the translation in<strong>to</strong> the different languages;
LARGE-SCALE INTERNATIONAL COMPARATIVE ACHIEVEMENT STUDIES275<br />
<strong>–</strong> well-developed and controlled routines for ensuring that the administration<br />
of the test was equal in all countries; and<br />
<strong>–</strong> well-developed routines and quality moni<strong>to</strong>ring of how student responses<br />
were scored as well as how the data was entered and further processed.<br />
Gathering data with procedures like these is not usually possible in ordinary<br />
(low-cost) research, which brings us <strong>to</strong> the second argument for why the data<br />
should be used more: Millions of dollars or euros have been spent on producing<br />
these high quality databases. Samples have been established and the<br />
instruments distributed <strong>to</strong> the students and back <strong>to</strong> the research centres in a<br />
way that ensures a certain degree of quality and comparability. Furthermore,<br />
the data has been assembled and restructured through skilful work by experts<br />
in data processing and measurement <strong>to</strong> further secure the quality of the information<br />
available. Nevertheless, relatively little money is spent on the analysis<br />
of the data, as most of the money has been spent on gathering and processing it.<br />
Evidently, investing more in further analyses of the data would be a financially<br />
sound idea.<br />
Thirdly, data from <strong>PISA</strong> has been made publicly available (although some<br />
of the achievement items are kept secure for future use), and researchers interested<br />
in using the data can get access <strong>to</strong> it through a number of channels 3 .<br />
In order <strong>to</strong> make the database accessible, <strong>PISA</strong> has even developed a thorough<br />
manual for how <strong>to</strong> analyse the data (OECD, 2005a). An even better proposal<br />
would be <strong>to</strong> engage in a dialogue with the national centre. Through this contact<br />
it could be possible <strong>to</strong> get some advice and access <strong>to</strong> material that is otherwise<br />
not so readily available.<br />
A fourth argument for why researchers in education should be keen <strong>to</strong><br />
use data from <strong>PISA</strong> and similar studies is the fact that this data is perhaps the<br />
single-most influential knowledge bases for decisionmaking and political argumentation<br />
about educational issues in many countries, and as such these data<br />
should be scrutinised from a multitude of perspectives. Even if one suspects<br />
that the data may be affected and encased by certain ideologies, secondary<br />
analysis of the data can in many cases be used <strong>to</strong> document such a relationship<br />
(Pole & Lampard, 2002). Data from LINCAS and documents describing or reporting<br />
outcomes of the studies need informed reviews from scholars who can<br />
frame the data and documents differently and thus offer both new interpretations<br />
and criticism. To a large extent LINCAS is regularly exposed <strong>to</strong> such criticism.<br />
Some of this feedback concerns ideological aspects of the studies (e.g.<br />
3 For access <strong>to</strong> the <strong>PISA</strong> data, see http://www.pisa.oecd.org
276 ROLF V. OLSEN<br />
Atkin & Black, 1997; Brown, 1998; Goldstein, 2004a; Keitel & Kilpatrick,<br />
1999; Kellaghan & Greaney, 2001; Orpwood, 2000; Reddy, 2005). Other critical<br />
remarks are more specifically related <strong>to</strong> methodological issues (e.g. Blum,<br />
Goldstein, & Guerin-Pace, 2001; Bonnet, 2002; Freudenthal, 1975; Goldstein,<br />
1995, 2004b; Harlow & Jones, 2004; Prais, 2003; Wang, 2001), and no doubt,<br />
this book adds <strong>to</strong> this collection of critical notes.<br />
Finally, since the results from the studies are mainly used <strong>to</strong> inform policy<br />
at the national level, it is necessary <strong>to</strong> conduct discussions on how the results<br />
may be used <strong>to</strong> evaluate the national school system. In order for comparative<br />
studies <strong>to</strong> provide an even better basis of information for this discussion, it<br />
may be necessary <strong>to</strong> develop specific national designs. This would ensure that<br />
one could obtain information seen as vital in the national context. Germany is<br />
the prime example of a country which has emphasised the national dimension<br />
by implementing several national extensions <strong>to</strong> the <strong>PISA</strong> study. In Germany<br />
participating students respond <strong>to</strong> additional nationally developed instruments,<br />
and the country also has an extended sample in order <strong>to</strong> cover the educational<br />
system in each of the partially au<strong>to</strong>nomous districts (Länder) (Stanat et al.,<br />
2002). These extended efforts in Germany have increased participation by researchers<br />
regarding the data as seen by the number of articles discussing <strong>PISA</strong><br />
in the German academic journals in education. It has also boosted the public<br />
awareness and debate about educational issues in general. To a somewhat<br />
lesser extent the situation is similar in Norway.<br />
Targeting research questions in education<br />
The above discussion mainly presented arguments emanating from the studies<br />
themselves regarding why data from large-scale international comparative<br />
achievement studies should be the subject of secondary analyses. Nevertheless,<br />
the main reason why educational researchers could be motivated <strong>to</strong> invest their<br />
own time and resources on secondary analyses of such data is that they may<br />
be used <strong>to</strong> address research questions of importance. In the remainder of the<br />
chapter, I will therefore turn <strong>to</strong> the more specific question of how these data<br />
may be used <strong>to</strong> target research questions in education.<br />
I will suggest that most of the secondary research using data from these<br />
studies can be classified in<strong>to</strong> one of six generic types of research designs or<br />
methodological approaches. The sequence of or extent <strong>to</strong> which the six generic<br />
research designs are presented does not suggest a priority. Furthermore, the intention<br />
is not <strong>to</strong> provide an exhaustive list of possible secondary research issues
LARGE-SCALE INTERNATIONAL COMPARATIVE ACHIEVEMENT STUDIES277<br />
<strong>to</strong> be addressed by this data. Moreover, it is not suggested that these generic<br />
types form a typology of mutually exclusive categories. Typically, much secondary<br />
analysis would relate <strong>to</strong> several of the six headings. Finally, the author’s<br />
own background accounts for why studies within science education prevail in<br />
the references given in the forthcoming discussion. The purpose of the six suggested<br />
categories is rather <strong>to</strong> provide a provisional framework at some level of<br />
generality for what secondary research relating <strong>to</strong> studies like <strong>PISA</strong> may look<br />
like.<br />
Using data, results, or interpretations as a background<br />
Secondary analysis of already existing data, results or their interpretations may<br />
be included as a somewhat peripheral part of a research design. The original<br />
study referred <strong>to</strong> in this type of research may provide the background or major<br />
referent for generating hypotheses and research questions, it may provide data<br />
or findings with which <strong>to</strong> contrast or triangulate other data or findings, or it may<br />
be part of the basis for theoretical argumentation or deliberation of educational<br />
issues. In this type of research the aim is usually <strong>to</strong> go behind the data in order<br />
<strong>to</strong> develop thicker and richer descriptions and analyses of issues derived from<br />
findings of the international studies.<br />
One example of this type of work (albeit in a Norwegian context) is the<br />
research project entitled <strong>PISA</strong>+ 4 . The researchers involved in the project use<br />
transcripts of videotapes from classrooms covering several hours of activities<br />
as their primary data source. Therefore, it is clearly not secondary analysis of<br />
data from <strong>PISA</strong>. But as the title of the research project reflects, it is triggered<br />
by some of the findings from <strong>PISA</strong> in a Norwegian context needing follow-up<br />
(hence the plus sign in the title). Other types of research in which the focus is<br />
on how phenomena change over time, or how one group of respondents compares<br />
<strong>to</strong> another group, may also use data or findings from comparative studies<br />
as a background. In some of these cases the international comparative studies<br />
can provide data that may be used as a baseline or benchmark for comparison<br />
<strong>to</strong> which the researchers’ own data may be related. For such a purpose it would,<br />
strictly speaking, be necessary <strong>to</strong> use partly identical instruments and similar<br />
routines for collecting and processing the data. One specific example of this is<br />
the use of items from TIMSS 1995 in an evaluation of scientific achievement in<br />
Norway before and after the curriculum reform in 1997 (Almendingen, Tveita,<br />
& Klepaker, 2003). In addition, as mentioned above, findings from <strong>PISA</strong> may<br />
4 See http://www.pfi.uio.no/forskning/forskningsprosjekter/pisa+/ for a description
278 ROLF V. OLSEN<br />
be used as one of the key referents for a theoretical deliberation on educational<br />
issues. This seems <strong>to</strong> be the case for a substantial amount of articles discussing<br />
educational (and more general social issues) in the German context during<br />
recent years (e.g. Opitz, 2006; Pongratz, 2006; Sacher, 2003; von-Stechow,<br />
2006).<br />
The two next types in this generic scheme are related <strong>to</strong> the fact that the<br />
primary units of analysis in the comparative studies are broad and overarching<br />
aggregates of the two main dimensions in the data matrix (persons and items).<br />
The persons are sampled <strong>to</strong> study the population of interest, and these populations<br />
are described by composite and broad measures, which have been constructed<br />
by aggregating several items. These constructs or traits are measures<br />
of students’ achievements in broadly defined domains (e.g. science, mathematics,<br />
and reading) as well as contextual descrip<strong>to</strong>rs (e.g. socioeconomic status,<br />
interest, motivation and learning strategies). It is therefore natural <strong>to</strong> suggest<br />
two classes of designs for secondary analyses related <strong>to</strong> the deconstruction of<br />
the respective two axis of the data matrix.<br />
In-depth analyses of certain variables<br />
Among the most frequently reported secondary analyses are those aiming at<br />
presenting a more finely tuned picture by studying more narrowly defined traits<br />
or even single items. This type of analysis utilises information in the data that<br />
is not included in the analyses of the <strong>to</strong>tal test scores (Olsen, 2005). Several<br />
relevant examples may be mentioned. Turmo (2003b) reported on qualitative<br />
aspects of students’ responses <strong>to</strong> a few single cognitive items from the <strong>PISA</strong><br />
2000 study related <strong>to</strong> the environmental issue of depletion of the ozone layer,<br />
relating the types of responses <strong>to</strong> published research in science education. Similar<br />
studies of data from TIMSS 1995 exist in abundance (e.g. Angell, 2004;<br />
Dossey, Jones, & Martin, 2002; Kjærnsli, Angell, & Lie, 2002).<br />
In the same manner the student questionnaire data may be analysed indepth<br />
by selecting one or a few variables in a more narrowly targeted analysis,<br />
including discussions and alternative interpretations in light of other theoretical<br />
or methodological positions. Papanastasiou et al. (2003) has for instance<br />
carried out an in-depth analysis, based on data from <strong>PISA</strong> 2000 of the relationship<br />
between the use of computers and scientific literacy in the US. Gorard<br />
& Smith (2004) used other data from the same study <strong>to</strong> compute several indexes<br />
of segregation within the European Union countries, and these indexes<br />
supplement the selections of indica<strong>to</strong>rs reported in the official OECD publica-
LARGE-SCALE INTERNATIONAL COMPARATIVE ACHIEVEMENT STUDIES279<br />
tions of the <strong>PISA</strong> data. Thematically, both these two latter studies belong <strong>to</strong><br />
the primary intent of the comparative study <strong>to</strong> which it is related, and as such<br />
they exemplify that the line of demarcation between ‘secondary analysis’ and<br />
‘primary analysis’ is not easy <strong>to</strong> draw.<br />
In-depth analyses of a sub-sample of students<br />
Another type of secondary research is, as briefly stated above, analyses in<br />
which the person axis in the data matrix is deconstructed. This is a very fruitful<br />
approach for targeting many specific issues in educational research. Many of<br />
the data sets from these studies are so large that the researcher may extract a<br />
subset of respondents with similar characteristics.<br />
One may, for instance, conduct an in-depth analysis of ethnic minority<br />
groups (Roe & Hvistendahl, 2006; Sohn & Ozcan, 2006). The fact that the<br />
OECD has recently published a supplementary thematic report on this issue<br />
(OECD, 2006) exemplifies that a clear-cut line of demarcation between primary<br />
and secondary analysis does not always exist. Another study which illustrates<br />
extremely well the possibilities for using the samples from LINCAS <strong>to</strong><br />
address issues that are specific <strong>to</strong> marginal groups in the population is the study<br />
by Mullis and Stemler (2002). They used the original sample from TIMSS<br />
1995 <strong>to</strong> identify gender differences for high-achieving upper secondary students<br />
in mathematics. The reason why such highly specific subgroups can be<br />
studied is, of course, that the samples used in the studies are very large, so one<br />
may actually select the students above the 75th percentile and further divide<br />
these by gender (as was done in this particular study) and still have adequate<br />
sample sizes of reasonable power. Thus, using data describing students’ backgrounds,<br />
attitudes and or achievements, it is possible <strong>to</strong> construct a number of<br />
subgroups found relevant for the specific research issue at hand.<br />
A more specific and narrow comparative outlook on the data<br />
The fourth class of secondary analysis consists of studies aiming at giving<br />
amorefinely tuned comparative view by selecting only a few countries (or<br />
merely one country) as the unit of analysis. These studies relate <strong>to</strong> a longstanding<br />
tradition within comparative research in education.<br />
When comparing a smaller selection of countries, two rather different<br />
strategies for selecting countries have proven <strong>to</strong> be fruitful. In one strategy<br />
countries are selected <strong>to</strong> represent rather divergent educational systems. Having<br />
a sample of educational systems that differ from each other along policy
280 ROLF V. OLSEN<br />
relevant variables is at the heart of the idea of comparative studies. One of the<br />
most powerful international studies using this strategy is the TIMSS videotape<br />
study (Stigler & Hiebert, 1999). Even if this was not a secondary analysis of<br />
data from TIMSS, but rather an independent study conducted and analysed<br />
simultaneously, it is a prime example of how studying divergent educational<br />
systems may uncover hidden assumptions and tacit features of the participating<br />
countries’ teaching practice. One example of a secondary analysis of<br />
<strong>PISA</strong> data is the comparison of mathematics teaching in Finland and France,<br />
<strong>to</strong> which a recent conference was entirely devoted 5 . A third illustrative and<br />
interesting example is the comparison of mathematics achievement in <strong>PISA</strong> in<br />
Brazil, Japan and Norway (Güzel & Berberoglu, 2005).<br />
The other strategy is <strong>to</strong> compare convergent educational systems. Naturally,<br />
such studies are often regional studies of neighbouring countries. Examples<br />
of regionally focused studies are the reports issued by Nordic researchers<br />
working on the <strong>PISA</strong> data (Lie, Linnakylä, & Roe, 2003; Mejding & Roe,<br />
2006), and a similar report by researchers in several Eastern European countries<br />
based on data from TIMSS 1995 (Vári, 1997). A third example is the<br />
study by Wößman (2005), using data from TIMSS, on the impact of family<br />
background in the East Asian countries. There are several reasons why regionally<br />
focussed reports and studies are valuable. First of all, comparing countries<br />
with certain common cultural features in a wider sense (be they his<strong>to</strong>rical, political<br />
and/or linguistic), implies that more fac<strong>to</strong>rs may be controlled, which<br />
is imperative when studying naturally occurring phenomena by a comparative<br />
survey design. Secondly, in comparisons between neighbouring or linguistically<br />
similar countries, the possible measurement errors due <strong>to</strong> item-bycountry<br />
interactions are also reduced (Wolfe, 1999). Therefore, from a policy<br />
perspective such comparisons are more likely <strong>to</strong> produce fruitful recommendations<br />
for decision-making since neighbouring countries, such as the Nordic<br />
ones, often have an institutionalised and continuous exchange of policy.<br />
The comparative basis may be even further reduced <strong>to</strong> case studies of single<br />
countries. Obviously, the national reports that are developed in most participating<br />
countries are <strong>to</strong> a large degree examples of such studies. However, these<br />
reports are mainly reported in public reports targeting a wide group of prospective<br />
readers. Hence, parts of the analyses presented in these reports should be<br />
transformed in<strong>to</strong> a format aimed at an international audience of researchers<br />
withstanding the scrutiny of peer review. This type of secondary analysis also<br />
5 See http://smf.emath.fr/VieSociete/Rencontres/France-Finlande-2005/
LARGE-SCALE INTERNATIONAL COMPARATIVE ACHIEVEMENT STUDIES281<br />
includes studies aiming at linking the international studies <strong>to</strong> either the national<br />
curriculum or prevalent ideologies in the participating country. Numerous examples<br />
of such studies exist. One recent French contribution is the chapter by<br />
Bodin in this book. In this chapter he analyses the degree of correspondence<br />
between the French mathematical syllabus, the French grade 9 national exam<br />
(Brevet) and the mathematics assessment in <strong>PISA</strong>. Yet other examples are papers<br />
discussing the case of Finland, which has quite understandably received a<br />
lot of attention given their performance on the assessments in <strong>PISA</strong> (Linnakyla<br />
& Valijarvi, 2005; Sahlberg, 2007; Simola, 2005; Valijarvi et al., 2002).<br />
The two last examples highlight why a national scope on the data from<br />
these studies clearly may have wider and broader implications. Ichilov (2004)<br />
used data from the IEA Civic Education Study (CivEd) <strong>to</strong> report on civic orientations<br />
in Hebrew and Arab schools in Israel, an issue which (unfortunately) is<br />
extremely relevant for the international community. Howie (2004) and Reddy<br />
(2005) have used the case of TIMSS in South Africa <strong>to</strong> reflect upon and question<br />
the values of participating in international comparative studies for developing<br />
countries, particularly when students’ mother-<strong>to</strong>ngues are not applied as<br />
the language of the test. What these examples have in common is that what<br />
at the outset seem <strong>to</strong> be issues primarily of national interest may be highly<br />
relevant contributions <strong>to</strong> educational research in general.<br />
Combining data from one study with other sources of information<br />
Many countries participate in several studies, and secondary analyses seeking<br />
<strong>to</strong> combine, contrast or synthesise information across studies would be valuable<br />
contributions. Furthermore, efforts should be made <strong>to</strong> combine quantitative<br />
results from a study like <strong>PISA</strong> with other supplemental pieces of information.<br />
These supplements may well be of a qualitative nature. However, this is<br />
methodologically challenging, since it is not always clear how <strong>to</strong> combine the<br />
information formally through a common unit of analysis.<br />
The most obvious possibility for linking different international surveys is<br />
by using the results aggregated <strong>to</strong> country level. One successful example is the<br />
study reported by Kirkcaldy et al. (2004) on the relationship between health efficacy,<br />
educational attainment and well-being. This study combined data from<br />
<strong>PISA</strong> with that provided by the World Health Organisation, the United Nations,<br />
and other sources.<br />
Another possibility for combining data (perhaps the primary candidate,<br />
given the discussion above) would be <strong>to</strong> find ways of combining the data
282 ROLF V. OLSEN<br />
in <strong>PISA</strong> and TIMSS <strong>to</strong> address issues in mathematics or science education<br />
(Olsen, 2006; Olsen & Grønmo, 2006; Olsen, Kjærnsli, & Lie, 2007), and data<br />
from <strong>PISA</strong> and PIRLS 6 <strong>to</strong> address the issue of reading skills (e.g. Becker &<br />
Schubert, 2006). This is not a straightforward task since these studies, even if<br />
they partially overlap in the content assessed, differ in many other ways, including<br />
the ages and grade levels of students. However, it is in principle possible<br />
<strong>to</strong> use data aggregated <strong>to</strong> countries in order <strong>to</strong> explore and describe typical<br />
features of students’ achievements, attitudes, motivation, and background in<br />
different countries. Furthermore, it is highly recommended <strong>to</strong> gather complementary<br />
data <strong>to</strong> help establish links between different studies. This was, for<br />
instance, done in a Danish study in which the <strong>PISA</strong> reading literacy measure<br />
from 2000 was formally linked <strong>to</strong> the IEA Reading Literacy study (which later<br />
has become known as PIRLS) from 1991. This made it possible <strong>to</strong> compare<br />
the two measurements. However, more significantly, it was possible <strong>to</strong> develop<br />
a measure of change in reading literacy for Danish students from 1991 <strong>to</strong> 2000<br />
(Allerup & Mejding, 2003). A less stringent way of linking the studies <strong>to</strong> each<br />
other would be <strong>to</strong> compare the documents describing the studies and <strong>to</strong> compare<br />
the match between different studies’ frameworks and item pools. For example,<br />
several comparisons of TIMSS and <strong>PISA</strong> have been done documenting<br />
how these two surveys differ in their conceptualisation of mathematics and<br />
science (Grønmo & Olsen, In press; Hutchison & Schagen, In press; Neidorf,<br />
Binkley, Gattis, & Nohara, 2006; Neidorf, Binkley, & Stephens, 2006; Olsen,<br />
2005; Olsen et al., 2007).<br />
Approaching the data with other methodological <strong>to</strong>ols<br />
This is not a specific methodological approach such as the previous categories<br />
are. This is rather a category used <strong>to</strong> collect the many studies utilising a different<br />
methodological approach <strong>to</strong> the data. Although these studies often result<br />
in alternative interpretations of the data, their additional aim is often <strong>to</strong> comment<br />
on the consequences of the methods used. In recent years there is, for<br />
instance, a growing recognition of the hierarchical structures in educational<br />
achievement data in which students are located within classes that are located<br />
within schools, which in turn are located within regions, etc. (e.g. Malin, 2005;<br />
O’Dwyer, 2002; Ramírez, 2006; Schagen, 2004). By applying specialised statistical<br />
<strong>to</strong>ols, it is possible <strong>to</strong> impose this structure while modelling the data.<br />
6 Progress in International Reading Literacy Study
LARGE-SCALE INTERNATIONAL COMPARATIVE ACHIEVEMENT STUDIES283<br />
Furthermore, some observers of the comparative studies question the requirement<br />
of uni-dimensionality of the measurements, and have analysed the<br />
data sets from some of these studies by allowing for multiple dimensions in the<br />
data (Blum et al., 2001; Gustafsson & Rosén, 2004). Related <strong>to</strong> this are a number<br />
of studies using cluster analysis in order <strong>to</strong> study the distinct differences in<br />
achievement profiles across the cognitive items for clusters of empirically related<br />
countries (e.g. Angell, Kjærnsli, & Lie, 2006; Grønmo, Kjærnsli, & Lie,<br />
2004; Lie & Roe, 2003; Olsen, 2006).<br />
Others have used latent variable or latent group modelling techniques in<br />
their approach <strong>to</strong> the data (e.g. Hansen, Rosén, & Gustafsson, 2004; C. Papanastasiou<br />
& Papanastasiou, 2006; Wolff, 2004). These are merely some examples<br />
of the use of alternative methodological approaches. The fruitfulness<br />
of such studies is both that they may utilise other aspects of the data set than<br />
what was their primary purpose, and that they may be considered as competing<br />
hypotheses regarding how <strong>to</strong> model and interpret such data.<br />
Concluding remarks<br />
In this chapter I have argued that data from the large-scale international<br />
achievement studies should be valued as an important resource for researchers<br />
in the educational sciences. In the first part I gave a condensed presentation<br />
of how these large-scale international studies of students’ achievements were<br />
created and are still affected by two visions or purposes. Originally, the studies<br />
were conceived of as <strong>to</strong>ols for conducting fundamental research (Purpose I).<br />
This vision was gradually adopted, absorbed and transformed by educational<br />
policymakers in<strong>to</strong> a vision in which these studies were regarded as one of<br />
the primary <strong>to</strong>ols for moni<strong>to</strong>ring educational systems (Purpose II). As a consequence<br />
I argued that researchers who engage themselves in these studies would<br />
get access <strong>to</strong> an arena for the exchange of ideas and thoughts with educational<br />
policymakers. This argument is particularly valid for <strong>PISA</strong>, which <strong>to</strong> a larger<br />
extent than other similar studies is defined as a joint venture between policymakers<br />
and researchers. This joint venture is set up by the organisational frame<br />
of the OECD with active involvement of both researchers and policymakers.<br />
In the second part of this chapter, the call for researchers <strong>to</strong> engage in secondary<br />
analysis of data from the international comparative achievement studies<br />
was argued from within the studies themselves. Specifically, it was argued that<br />
the data sets offered by the studies are complex and multifaceted, and thus it
284 ROLF V. OLSEN<br />
should be possible <strong>to</strong> target a range of fundamental issues in educational research<br />
by secondary analyses of the data. This argument has been augmented<br />
by the fact that the data has a supreme and unprecedented quality. Furthermore,<br />
the data is publicly available. Moreover, it was argued from an economic<br />
perspective that since so much money has been spent in creating this dataset,<br />
any additional resources put in<strong>to</strong> secondary analytical research activities would<br />
ensure an even better return on the investment. This argumentation is equally<br />
valid for several studies.<br />
Having presented an argumentation for why researchers in education could<br />
or should be interested in utilising the data, the second part turned <strong>to</strong> the issue<br />
of the types of secondary analysis that are possible or viable. Six generic<br />
approaches <strong>to</strong> secondary use of the data in research were suggested. This was<br />
accompanied by reference <strong>to</strong> a diverse range of secondary analyses in order <strong>to</strong><br />
document and exemplify the possibilities for conducting such analyses. These<br />
references are only a fraction of the available academic literature that utilises<br />
information from <strong>PISA</strong> and other similar studies. Searching international bibliographical<br />
databases with the term <strong>PISA</strong> in the title, keyword or abstract gave<br />
more than 600 hits when combining two of the most comprehensive databases<br />
of literature in educational research (the ERIC and the ISI Web of Knowledge<br />
databases). By deleting duplicates and other false positives, we are brought<br />
down <strong>to</strong> approximately 250 hits in the period 2001-2007. Out of these, approximately<br />
50 entries are references given <strong>to</strong> what should be labelled as primary<br />
analysis (national and international reports written as an intended part of the<br />
study) bringing the <strong>to</strong>tal down <strong>to</strong> about 200. At the same time it is obvious<br />
that not all published secondary analytical work is included in the database<br />
(false negatives). Several of the references included in this chapter are for instance<br />
not found in this bibliography. It is therefore reasonable <strong>to</strong> claim that<br />
secondary analysis of the <strong>PISA</strong> data is a vital field of research. I doubt that any<br />
other data set within educational research has been analysed by so many people<br />
from so many diverse perspectives. A more detailed and systematic analysis of<br />
this bibliographical database will be conducted in the future in order <strong>to</strong> give a<br />
more comprehensive synthesis of how data from <strong>PISA</strong> is used by researchers.<br />
In order <strong>to</strong> take further advantage of the <strong>PISA</strong> data, governments should<br />
consider allocating resources for further analyses of the data sets, especially<br />
<strong>to</strong> analyses that help relate the data <strong>to</strong> the national context. For instance, in<br />
Norway funds have been made available so that the primary researchers may<br />
spend some time developing and publishing research going beyond the com-
LARGE-SCALE INTERNATIONAL COMPARATIVE ACHIEVEMENT STUDIES285<br />
missioned reports. Furthermore, in many countries funds have been allocated<br />
<strong>to</strong> facilitate the use of data from <strong>PISA</strong> by students as the basis for their Master’s<br />
or Doc<strong>to</strong>ral thesis, particularly so in countries where the national institution<br />
responsible for the study is located at a university. To continue with the case<br />
of Norway as an example, several doc<strong>to</strong>ral dissertations have been produced<br />
based on what could be labelled as secondary analysis of data from TIMSS and<br />
<strong>PISA</strong> (Angell, 1996; Isager, 1996; Kind, 1996; Olsen, 2005; Turmo, 2003a).<br />
Hopefully, this chapter may motivate and help in engaging researchers <strong>to</strong><br />
explore the possibilities for making use of the resources offered by the largescale<br />
international studies in education and, subsequently, make such analyses<br />
available through internationally accepted publications. Furthermore, the<br />
framework of types of analyses suggested, ranging from using the data as<br />
merely a referent for the collection of other data, and <strong>to</strong> sophisticated modelling<br />
of the data, will hopefully provide a certain amount of guidance with respect<br />
<strong>to</strong> what such secondary analyses may look like. The many examples provided<br />
should also be regarded as an initial source of inspiration for researchers<br />
who would welcome taking on this challenge.<br />
Acknowledgement<br />
This manuscript has been adapted from the article Olsen, R.V. & Lie, S. (2006).<br />
Les évaluations internationales et la recherche en éducation: principaux objectifs<br />
et perspectives. Revue française de pédagogie, No 157, pp. 11-26, and I<br />
wish <strong>to</strong> thank the edi<strong>to</strong>rs of the journal for giving their permission for this<br />
adaptation. Fellow author Svein Lie has also kindly agreed that I may publish<br />
the present adapted version.<br />
References<br />
Adams, R., & Wu, M. (Eds.). (2002). <strong>PISA</strong> 2000 technical report.Paris:OECD<br />
Publications.<br />
Alexander, R., Broadfoot, P., & Phillips, D. (Eds.). (1999). Learning from comparing:<br />
New directions in comparative educational research. Volume 1:<br />
Contexts, classrooms and outcomes. Oxford: Symposium Books.<br />
Alexander, R., Osborn, M., & Phillips, D. (Eds.). (2000). Learning from comparing:<br />
New directions in comparative educational research <strong>–</strong> Volume 2:<br />
policy, professionals and developments. Oxford: Symposium Books.
286 ROLF V. OLSEN<br />
Allerup, P., & Mejding, J. (2003). Reading achievement in 1991 and 2000. In<br />
S. Lie, P. Linnakylä & A. Roe (Eds.), Northern Lights on <strong>PISA</strong> (pp. 133-<br />
145). Oslo: Department of Teacher Education and School Development,<br />
University of Oslo.<br />
Almendingen, S. B. M. F., Tveita, J., & Klepaker, T. (2003). Tenke det, ønske<br />
det, ville det med, men gjøre det . . . ?: en evaluering av natur- og miljøfag<br />
etter Reform 97. Nesna: Høgskolen i Nesna.<br />
Andrich, D. (1988). Rasch models for measurement. Newbury<br />
Park/London/New Delhi: Sage Publications.<br />
Angell, C. (1996). Elevers fysikkforståelse. En studie basert på utvalgte<br />
fysikkoppgaver i TIMSS. Oslo: Det matematisk-naturvitenskapelige<br />
fakultet, Universitetet i Oslo.<br />
Angell, C. (2004). Exploring students’ intuitive ideas based on physics items in<br />
TIMSS <strong>–</strong> 1995. In C. Papanastasiou (Ed.), Proceedings of the IRC-2004<br />
TIMSS (pp. 108-123). Nicosia: Cyprus University Press.<br />
Angell, C., Kjærnsli, M., & Lie, S. (2006). Curricular and cultural effects in<br />
patterns of students’ responses <strong>to</strong> TIMSS science items. In S. J. Howie &<br />
T. Plomp (Eds.), Contexts of learning mathematics and science: Lessons<br />
learned from TIMSS (pp. 277-290). London: Rutledge.<br />
Atkin, J. M., & Black, P. (1997). Policy perils of international comparisons:<br />
The TIMSS case. Phi Delta Kappan, 79(1), 22-28.<br />
Becker, R., & Schubert, F. (2006). Soziale ungleichheit von lesekompetenzen:<br />
Eine Matching-Analyse im Langsschnitt mit Querschnittsdaten von<br />
PIRLS 2001 und <strong>PISA</strong> 2000/Social inequality of reading literacy: A longitudinal<br />
analysis with cross-sectional data of PIRLS 2001 and <strong>PISA</strong><br />
2000 Utilizing the pair-wise matching Procedure. Kölner Zeitschrift für<br />
Soziologie und Sozialpsychologie, 58(2), 253-284.<br />
Blum, A., Goldstein, H., & Guerin-Pace, F. (2001). International adult literacy<br />
survey (IALS): an analysis of international comparisons of adult literacy.<br />
Assessment in Education, 8(2), 225-246.<br />
Bonnet, G. (2002). Reflections in a critical eye [1]: On pitfalls of international<br />
assessment. Assessment in Education, 9(3), 387-399.<br />
Bos, K. T. (2002). Benefits and limitations of large-scale international comparative<br />
achievement studies: The case of IEA’s TIMSS study. Unpublished<br />
PhD, University of Twente.<br />
Brown, M. (1998). The tyranny of the international horse race. In R. Slee,<br />
G. Weiner & S. Tomlinson (Eds.), School effectiveness for whom? Chal-
LARGE-SCALE INTERNATIONAL COMPARATIVE ACHIEVEMENT STUDIES287<br />
lenges <strong>to</strong> the school effectiveness and school improvement movements<br />
(pp. 33-47). London: Falmer Press.<br />
Bryman, A. (2004). Social research methods (2nd ed.). Oxford: University<br />
Press.<br />
Bur<strong>to</strong>n, D. (2000). Secondary data analysis. In D. Bur<strong>to</strong>n (Ed.), Research training<br />
for social scientists (pp. 347-360). London: Sage Publications.<br />
Carnoy, M. (2006). Rethinking the comparative—and the international. Comparative<br />
Education Review, 50(4), 551-570.<br />
Dossey, J. A., Jones, C. O., & Martin, T. S. (2002). Analyzing student responses<br />
in mathematics using two-digit rubrics. In D. F. Robitaille & A.<br />
E. Bea<strong>to</strong>n (Eds.), Secondary analysis of the TIMSS Data (pp. 21-45). Dordrecht:<br />
Kluwer Academic Publishers.<br />
Freudenthal, H. (1975). Pupils’ achievements internationally compared <strong>–</strong> The<br />
IEA. Educational Studies in Mathematics, 6, 127-186.<br />
Goldstein, H. (1995). Interpreting international comparisons of student<br />
achievement (Vol. 63). Paris: UNESCO Publishing.<br />
Goldstein, H. (2004a). Education for all: the globalization of learning targets.<br />
Comparative Education, 40(1), 7-14.<br />
Goldstein, H. (2004b). International comparative assessment: how far have we<br />
really come? Assessment in Education, 11(2), 227-234.<br />
Gorard, S., & Smith, E. (2004). An international comparison of equity in education<br />
systems. Comparative Education, 40(1), 15-28.<br />
Grønmo, L. S., Kjærnsli, M., & Lie, S. (2004). Looking for cultural and geographical<br />
fac<strong>to</strong>rs in patterns of response <strong>to</strong> TIMSS items. In C. Papanastasiou<br />
(Ed.), Proceedings of the IRC-2004 TIMSS (Vol. 1, pp. 99-112).<br />
Nicosia: Cyprus University Press.<br />
Grønmo, L. S., & Olsen, R. V. (In press). TIMSS versus <strong>PISA</strong>: the case of pure<br />
and applied mathematics. In Unknown (Ed.), Unknown. Washing<strong>to</strong>n, DC.<br />
Gustafsson, J.-E., & Rosén, M. (2004). The IEA 10-year trend study of reading<br />
literacy: A multivariate reanalysis. In C. Papanastasiou (Ed.), Proceedings<br />
of the IRC-2004 (Vol. 3, pp. 1-16). Nicosia: Cyprus University Press.<br />
Güzel, C. I., & Berberoglu, G. (2005). An analysis of the programme for international<br />
student assessment 2000 (<strong>PISA</strong> 2000) mathematical literacy<br />
data for brazilian, japanese and norwegian students. Studies In Educational<br />
Evaluation, 31(4), 283-314.<br />
Hansen, K. Y., Rosén, M., & Gustafsson, J.-E. (2004). Effects of socioeconomic<br />
status on reading achievement at collective and individual lev-
288 ROLF V. OLSEN<br />
els in Sweden in 1991 and 2001. In C. Papanastasiou (Ed.), Proceedings<br />
of the IRC-2004 PIRLS (Vol. 3, pp. 123-139). Nicosia: Cyprus University<br />
Press.<br />
Harlow, A., & Jones, A. (2004). Why students answer TIMSS science test<br />
items the way they do. Research in Science Education, 34(2), 221-238.<br />
Howie, S. J. (2004). TIMSS in South Africa: The value of international comparative<br />
studies for a developing country. In D. Shorrocks-Taylor & E. W.<br />
Jenkins (Eds.), Learning from Others (Vol. 8). Dordrecht: Kluwer Academic<br />
Publishers.<br />
Huberman, M. (1994). The OERI/CERI Seminar on educational research and<br />
development: a synthesis and commentary. In T. M. Tomlinson & A. C.<br />
Tuijnman (Eds.), Education research and reform: an international perspective<br />
(pp. 45-66). Washing<strong>to</strong>n D.C.: OECD Centre for Educational Research<br />
ans Innovation/US Department of Education.<br />
Husén, T. (1973). Foreword. In L. C. Comber & J. P. Keeves (Eds.), Science<br />
achievement in nineteen countries (pp. 13-24). S<strong>to</strong>ckholm/New York:<br />
Almqvist & Wiksell/John Wiley & Sons.<br />
Husén, T., & Tuijnman, A. (1994). Moni<strong>to</strong>ring standards in education: Why<br />
and how it came about. In A. C. Tuijnman & T. N. Postlethwaite (Eds.),<br />
Moni<strong>to</strong>ring the standards of education. Papers in honor of John P. Keeves<br />
(pp. 1-21). Oxford: Pergamon.<br />
Hutchison, D., & Schagen, I. (In press). Comparison between <strong>PISA</strong> and<br />
TIMSS <strong>–</strong> Are we the man with two watches? In T. Loveless (Ed.), Lessons<br />
Learned: What International Assessments Tell Us about Math Achievement.<br />
Washing<strong>to</strong>n, DC: Brookings Institution Press.<br />
Ichilov, O. (2004). Becoming citizens in Israel: A deeply divided society. Civic<br />
orientations in Hebrew and Arab schools. In C. Papanastasiou (Ed.), Proceedings<br />
of the IRC-2004 CivEd-Sites (Vol. 4, pp. 69-86). Nicosia: Cyprus<br />
University Press.<br />
Isager, O. A. (1996). Den norske grunnskolens biologi i et his<strong>to</strong>risk og komparativt<br />
perspektiv. Oslo: Det matematisk-naturvitenskapelige fakultet,<br />
Universitetet i Oslo.<br />
Jenkins, E. W. (2000). Research in science education: Time for a health check?<br />
Studies in Science Education, 35, 1-25.<br />
Keeves, J. P. (Ed.). (1992). The IEA study of science III: Changes in science<br />
education and achievement: 1970 <strong>to</strong> 1984. New York: Pergamon Press.<br />
Keitel, C., & Kilpatrick, J. (1999). The tationality and irrationality of interna-
LARGE-SCALE INTERNATIONAL COMPARATIVE ACHIEVEMENT STUDIES289<br />
tional comparative studies. In G. Kaiser, E. Luna & I. Huntley (Eds.), International<br />
Comparisons in Mathematics Education (pp. 241-256). London:<br />
Falmer Press.<br />
Kellaghan, T., & Greaney, V. (2001). The globalisation of assessment in the<br />
20th century. Assessment in Education, 8(1), 87-102.<br />
Kind, P. M. (1996). Exploring performance assessment in science. Oslo:Det<br />
matematisk-naturvitenskapelige fakultet, Universitetet i Oslo.<br />
Kirkcaldy, B., Furnham, A., & Siefen, G. (2004). The relationship between<br />
health efficacy, Educational attainment, and well-being among 30 nations.<br />
European Psychologist, 9(2), 107-119.<br />
Kjærnsli, M., Angell, C., & Lie, S. (2002). Exploring population 2 students’<br />
ideas about science. In D. F. Robitaille & A. E. Bea<strong>to</strong>n (Eds.), Secondary<br />
Analysis of the TIMSS Data (pp. 127-144). Dordrecht: Kluwer Academic<br />
Publishers.<br />
Lie, S., Linnakylä, P., & Roe, A. (Eds.). (2003). Northern lights on <strong>PISA</strong>:<br />
Unity and diversity in the Nordic countries in <strong>PISA</strong> 2000: Department<br />
of Teacher Education and School Development, University of Oslo.<br />
Lie, S., & Roe, A. (2003). Unity and diversity of reading literacy profiles. In<br />
S. Lie, P. Linnakylä & A. Roe (Eds.), Northern Lights on <strong>PISA</strong> (pp. 147-<br />
157): Department of Teacher Education and School Development, University<br />
of Oslo.<br />
Linnakyla, P., & Valijarvi, J. (2005). Secrets <strong>to</strong> literacy success: The Finnish<br />
s<strong>to</strong>ry. Education Canada, 45(3), 34-37.<br />
Loving, C. C., & Cobern, W. W. (2000). Invoking Thomas Kuhn: What citation<br />
analysis reveals about science education. Science & Education, 9(1-2),<br />
187-206.<br />
Malin, A. (2005). School differences and inequities in educational outcomes.<br />
Jyväskylä: Jyväskylä University Press.<br />
Mejding, J., & Roe, A. (Eds.). (2006). Northern lights on <strong>PISA</strong> 2003 <strong>–</strong> a reflection<br />
from the Nordic countries. Copenhagen: Nordic Council of Ministers.<br />
Mullis, I. V. S., & Stemler, S. E. (2002). Analyzing gender differences for high<br />
achieving students on TIMSS. In D. F. Robitaille & A. E. Bea<strong>to</strong>n (Eds.),<br />
Secondary Analysis of the TIMSS Data (pp. 287-290). Dordrecht: Kluwer<br />
Academic Publishers.<br />
Neidorf, T. S., Binkley, M., Gattis, K., & Nohara, D. (2006). Comparing<br />
mathematics content in the National Assessment of Educational<br />
Progress (NAEP), Trends in International Mathematics and Science Study
290 ROLF V. OLSEN<br />
(TIMSS), and Program for International Student Assessment (<strong>PISA</strong>) 2003<br />
assessments. Washing<strong>to</strong>n, DC: National Center for Education Statistics.<br />
Neidorf, T. S., Binkley, M., & Stephens, M. (2006). Comparing science content<br />
in the National Assessment of Educational Progress (NAEP) 2000 and<br />
Trends in International Mathematics and Science Study (TIMSS) 2003<br />
assessments. Washing<strong>to</strong>n, DC: National Center for Education Statistics.<br />
O’Dwyer, L. M. (2002). Extending the application of multilevel modelling <strong>to</strong><br />
data from TIMSS. In D. F. Robitaille & A. E. Bea<strong>to</strong>n (Eds.), Secondary<br />
Analysis of the TIMSS Data (pp. 359-373). Dordrecht: Kluwer Academic<br />
Publishers.<br />
OECD (1996). Education at a glance. Paris: OECD Publications.<br />
OECD (1997). Education at a glance. Paris: OECD Publications.<br />
OECD (1998). Education at a glance. Paris: OECD Publications.<br />
OECD (2005a). <strong>PISA</strong> 2003 Data analysis manual. Paris: OECD Publishing.<br />
OECD (2005b). <strong>PISA</strong> 2003: Technical report. Paris: OECD Publications.<br />
OECD (2006). Where immigrant students succed. A comparative review of<br />
performance and engagement in <strong>PISA</strong> 2003. Paris: OECD Publications.<br />
Olsen, R. V. (2005). Achievement tests from an item perspective. An exploration<br />
of single item data from the <strong>PISA</strong> and TIMSS studies, and how<br />
such data can inform us about students’ knowledge and thinking in science.<br />
Oslo: Unipub forlag.<br />
Olsen, R. V. (2006). A nordic profile of mathematics achievement: Myth or<br />
reality? In J. Mejding & A. Roe (Eds.), Northern Lights on <strong>PISA</strong> 2003 <strong>–</strong><br />
areflection from the Nordic countries (pp. 33-45). Copenhagen: Nordic<br />
Council of Ministers.<br />
Olsen, R. V., & Grønmo, L. S. (2006). What are the characteristics of the nordic<br />
profile in mathematical literacy? In J. Mejding & A. Roe (Eds.), Northern<br />
Lights on <strong>PISA</strong> 2003 <strong>–</strong> a reflection from the Nordic countries (pp. 47-57).<br />
Copenhagen: Nordic Council of Ministers.<br />
Olsen, R. V., Kjærnsli, M., & Lie, S. (2007, 21.-25. August). A comparison of<br />
the measures of science achievement in <strong>PISA</strong> and TIMSS. Paper presented<br />
at the ESERA 2007, Malmö, Sweden.<br />
Olsen, R. V., & Lie, S. (2006). Les évaluations internationales et la recherche<br />
en éducation:principaux objectifs et perspectives. Revue française de pédagogie(157),<br />
11-26.<br />
Opitz, E.-M. (2006). <strong>PISA</strong> und Bildungsstandards: Stein des Ans<strong>to</strong>sses oder<br />
Ans<strong>to</strong>ss fur die Sonderpadagogik?/<strong>PISA</strong> and Education Standards: Stum-
LARGE-SCALE INTERNATIONAL COMPARATIVE ACHIEVEMENT STUDIES291<br />
bling Block or Impulse for Special Education? Vierteljahresschrift fur<br />
Heilpadagogik und ihre Nachbargebiete, 75(2), 110-120.<br />
Orpwood, G. (2000). Diversity of purpose in international assessments: Issues<br />
arising from the TIMSS test of mathematics and science. In D.<br />
Shorrocks-Taylor & E. W. Jenkins (Eds.), Learning from Others: International<br />
Comparisons in Education (pp. 49-62). Dordrecht/Bos<strong>to</strong>n/London:<br />
Kluwer Academic Publishers.<br />
Papanastasiou, C., & Papanastasiou, E. C. (2006). Modelling mathematics<br />
achievement in Cyprus. In S. J. Howie & T. Plomp (Eds.), Contexts of<br />
Learning Mathematics and Science. London: Routledge.<br />
Papanastasiou, E. C., Zembylas, M., & Vrasidas, C. (2003). Can computer<br />
use hurt science achievement? The USA Results from <strong>PISA</strong>. Journal of<br />
Science Education and Technology, 12(3), 325-332.<br />
Pole, C., & Lampard, R. (2002). Practical social investigation: Qualitative<br />
and quantitative methods in social research. Essex: Pearson Education<br />
Limited.<br />
Pongratz, L.-A. (2006). Voluntary self-control: Education reform as a governmental<br />
strategy. Educational philosophy and theory, 38(4), 471-482.<br />
Porter, A. C., & Gamoran, A. (Eds.). (2002). Methodological advances in<br />
cross-national surveys of educational achievement. Washing<strong>to</strong>n, DC: National<br />
Academy Press.<br />
Prais, S. J. (2003). Cautions on OECD’s recent educational survey (<strong>PISA</strong>).<br />
Oxford Review of Education, 29(2), 139-163.<br />
Ramírez, M. J. (2006). Fac<strong>to</strong>rs related <strong>to</strong> mathematics cchievement in Chile.<br />
In S. J. Howie & T. Plomp (Eds.), Contexts of Learning Mathematics and<br />
Science (pp. 97-111). London: Routledge.<br />
Reddy, V. (2005). Cross-national achievement studies: learning from South<br />
Africa’s participation in the Trends in International Mathematics and Science<br />
Study (TIMSS). Compare, 35(1), 63-77.<br />
Roberts, D. A. (2007). Scientific literacy/science literacy. In S. K. Abell &<br />
N. G. Lederman (Eds.), Handbook of Research in Science Education<br />
(pp. 729-780). Mahwah: Lawrence Erlbaum Associates, Publishers.<br />
Roe, A., & Hvistendahl, R. (2006). Nordic minority students’ literacy achievement<br />
and home background. In J. Mejding & A. Roe (Eds.), Northern<br />
Lights on <strong>PISA</strong> 2003 <strong>–</strong> a reflection from the Nordic countries. Copenhagen:<br />
Nordic Council of Ministers.
292 ROLF V. OLSEN<br />
Sacher, W. (2003). Schulleistungsdiagnose <strong>–</strong> padagogisch oder nach dem Modell<br />
<strong>PISA</strong>? Padagogische Rundschau, 57(4), 399-417.<br />
Sahlberg, P. (2007). Education policies for raising student learning: the Finnish<br />
approach. Journal of Education Policy, 22(2), 147-171.<br />
Schagen, I. (2004). Multilevel analysis of PIRLS data for England. In C. Papanastasiou<br />
(Ed.), Proceedings of the IRC-2004 PIRLS (Vol. 3, pp. 82-<br />
102). Nicosia: Cyprus University Press.<br />
Simola, H. (2005). The Finnish miracle of <strong>PISA</strong>: His<strong>to</strong>rical and sociological<br />
remarks on teaching and teacher education. Comparative Education,<br />
41(4), 455-470.<br />
Sohn, J., & Ozcan, V. (2006). The educational attainment of Turkish migrants<br />
in Germany. Turkish studies, 7(1), 101-124.<br />
Stanat, Artelt, Baumert, Klieme, Neubrand, Prenzel, et al. (2002). <strong>PISA</strong> 2000:<br />
Overview of the study. Design, method and results. Berlin: Max Planck<br />
Institute for Human Development.<br />
Stigler, J. W., & Hiebert, J. (1999). The teaching gap: Best ideas from the<br />
world’s teachers for improving education in the classroom. NewYork:<br />
Free Press.<br />
The BMS. (1994). Correspondence analysis: A his<strong>to</strong>ry and french sociological<br />
perspective. In M. J. Greenacre & J. Blasius (Eds.), Correspondence<br />
Analysis in the Social Sciences (pp. 128-137). London: Academic Press.<br />
Turmo, A. (2003a). Naturfagdidaktikk og internasjonale studier. S<strong>to</strong>re internasjonale<br />
studier som ramme for naturfagdidaktisk forskning: En drøfting<br />
med eksempler på hvordan data fra <strong>PISA</strong> 2000 kan belyse sider ved<br />
begrepet naturfaglig allmenndannelse. Oslo: Unipub AS.<br />
Turmo, A. (2003b). Understanding a newsletter article on ozone <strong>–</strong> a crossnational<br />
comparison of the scientific literacy of 15-year-olds in a specific<br />
context. Paper presented at the 4th ESERA conference “Research<br />
and the Quality of Science Education”, August 2003, Noordwijkerhout,<br />
The Netherlands.<br />
Valijarvi, J., Linnakyla, P., Kupari, P., Reinikainen, P., Arffman, I., & Jyvaskyla<br />
Univ. Inst. for Educational, R. (2002). The Finnish success in <strong>PISA</strong><strong>–</strong>And<br />
some reasons behind it: <strong>PISA</strong> 2000.<br />
Vári, P. (Ed.). (1997). Are we similar in math and science? A study of grade 8 in<br />
nine Central and Eastern European countries. Amsterdam: International<br />
Association for the Evaluation of Educational Achievement.<br />
von-Stechow, E. (2006). <strong>PISA</strong> und die Folgen fur schwache Schülerinnen
LARGE-SCALE INTERNATIONAL COMPARATIVE ACHIEVEMENT STUDIES293<br />
und Schüler/<strong>PISA</strong> and the Consequences for Pupils with Learning Disabilities.<br />
Vierteljahresschrift fur Heilpädagogik und ihre Nachbargebiete,<br />
75(4), 285-292.<br />
Wang, J. (2001). TIMSS primary and middle school data: Some technical concerns.<br />
Educational Researcher, 30(6), 17-21.<br />
Wolfe, R. G. (1999). Measurement obstacles <strong>to</strong> international comparisons and<br />
the need for regional design and analysis in mathematics surveys. In G.<br />
Kaiser, E. Luna & I. Huntley (Eds.), International Comparisons in Mathematics<br />
Education. London: Falmer Press.<br />
Wolff, U. (2004). Different patterns of reading performance: A latent profile<br />
analysis. In C. Papanastasiou (Ed.), Proceedings of the IRC-2004 PIRLS<br />
(Vol. 3, pp. 188-202). Nicosia: Cyprus University Press.<br />
Wossmann, L. (2005). Educational production in east Asia: The impact of family<br />
background and schooling policies on student performance. German<br />
Economic Review, 6(3), 331-353.
The Hidden Curriculum of <strong>PISA</strong> <strong>–</strong> The Promotion of<br />
Neo-Liberal Policy By Educational Assessment<br />
Michael Uljens<br />
Finland: Åbo Akademi University<br />
Introduction<br />
The aim of the present chapter is <strong>to</strong> contextualise the <strong>PISA</strong> evaluation as an<br />
exponent of an ongoing shift in the educational policy of many countries participating<br />
in <strong>PISA</strong> The shift is considered <strong>to</strong> reflect a neoliberally oriented<br />
understanding of the relation between the state, market and education. From a<br />
Finnish perspective, this shift was intiated at the end of the 1980s and beginning<br />
of the 1990s. It has been referred <strong>to</strong> as the educational policy of “the third<br />
republic”.<br />
Even if the <strong>PISA</strong> project strengthens the development of a neoliberal educational<br />
discourse, both nationally and globally, the project was prepared for<br />
by developments within many nations during the 1990s. Movements and actions<br />
that preceded the <strong>PISA</strong> project are being described from the perspective<br />
of Finland. The argument is that these preceding operations made <strong>PISA</strong> appear<br />
a natural continuation of an already initiated change process on the national<br />
level.<br />
The chapter also points out some of the mechanisms through which the<br />
<strong>PISA</strong> evaluations operate in order <strong>to</strong> promote the neoliberal interests of the<br />
OECD. This is considered important, as it often appears <strong>to</strong> be forgotten that<br />
OECD is the organisation behind and running the <strong>PISA</strong> project. <strong>PISA</strong> is interpreted<br />
as a specific kind of a transnational, semi-global, educational evaluation<br />
technique previously unexperienced. <strong>PISA</strong> is thus interpreted as having been<br />
prepared by previous actions on the local and national levels, but <strong>PISA</strong> in turn<br />
promotes and strengthens the readiness <strong>to</strong> uphold a competition-oriented cooperation<br />
within and between nations.
296 MICHAEL ULJENS<br />
Doing the groundwork for <strong>PISA</strong> <strong>–</strong> the silent educational revolution<br />
The point of departure of the present chapter is the view that especially all<br />
large-scale changes in education must be unders<strong>to</strong>od as being socially, culturally<br />
and as his<strong>to</strong>rically developed. Consequently, the claim is here that the<br />
<strong>PISA</strong> project cannot be correctly unders<strong>to</strong>od without acknowledging it as an<br />
exponent of an ongoing shift in European and global educational policy. The<br />
shift <strong>to</strong>day concerns and covers all levels and areas of the western educational<br />
system, although in varying degrees in different countries.<br />
In Finland, this shift has been called a movement <strong>to</strong>wards the educational<br />
policy of the “third republic”. The first republic refers <strong>to</strong> the period from<br />
Finnish independence (1917) up <strong>to</strong> the Second World War. The second republic,<br />
started in1945 and lasted up <strong>to</strong> the mid-1980s. This period focused on<br />
educational expansion, solidarity, basic education for all students, equal opportunities,<br />
regional balance, and education for the civil society. In a word,<br />
it was the educational doctrine of the welfare state assuming mutual positive<br />
effects between economic growth, welfare and political participation (see e.g.<br />
Siljander 2007).<br />
The period of the “third republic” started <strong>to</strong>wards the end of the 20th century<br />
or, symbolically, when the previous century “ended” in 1989, i.e. after the<br />
collapse of Soviet Union and the fall of the Berlin Wall. The political mentality<br />
in Finland had already started <strong>to</strong> change <strong>to</strong>wards a more conservative direction<br />
in the 1980s, and has since then developed chronologically in this direction,<br />
even though the movement was even more obvious in other countries.<br />
In contrast <strong>to</strong> the period of the second republic, the educational mentality<br />
of the third republic initiated a discourse on excellency, efficiency, productivity,<br />
competition, internationalisation, increased individual freedom and<br />
responsibility as well as deregulation in all societal areas (e.g. communication,<br />
health-care, infrastructure) including the educational sec<strong>to</strong>r (education<br />
law, curriculum planning and educational administration). The direction was<br />
clearly manifested in the Governmental program in Finland after the elections<br />
in 1990. The project could be called the creation of the educational policy of<br />
the global post-industrial knowledge economy and information society. New<br />
Public Management ideas were introduced in the late 1980s, and a so-called<br />
agency theoretical approach, according <strong>to</strong> which the role of the state is expanded<br />
and changed from producing services <strong>to</strong> buying services. The model<br />
included, as we know, the lowering of taxes as well as techniques for “quality<br />
assurance”. Attention also turned <strong>to</strong>wards profiling individual schools and
THE HIDDEN CURRICULUM OF <strong>PISA</strong> 297<br />
institutions and on increasing flexiblility e.g. in educational career planning.<br />
Extended freedom of choice on the local level was supported by e.g. decentralising<br />
curriculum planning, first <strong>to</strong> the community level in the 1980s and then <strong>to</strong><br />
the school level in the 1990s. Parents were included in school boards. Salaries<br />
determined according <strong>to</strong> achievement were later introduced in the public sec<strong>to</strong>r.<br />
This mentality supported a kind of commodification of knowledge, marketisation<br />
of schooling as well as a much stronger view of national education<br />
as vehicles for international competition. The use of national tests for ranking<br />
schools was introduced in the 1990s as a mean for promoting a competitionoriented<br />
climate. The education of gifted students became acknowledged in<br />
addition <strong>to</strong> the strong emphasis on traditional special education. Today, limiting<br />
dropping out of school is motivated by its societal costs rather than many<br />
other reasonable arguments.<br />
Despite all of these changes, the idea of educational equality has remained<br />
the guiding principle, although it has become weakened. The process by and<br />
large reflects a view of the students or parents as “cus<strong>to</strong>mers”, according <strong>to</strong><br />
which parents were offered enlarged opportunities <strong>to</strong> choose which schools<br />
their children attended on the basis of the success of schools and their profiles.<br />
The view of citizens as cus<strong>to</strong>mers is also obvious in various EU documents<br />
(Heikkinen, 2004), (Finland joined EU in 1995). Education has increasingly<br />
come <strong>to</strong> be considered a private good rather than a public good. During the<br />
past decade movements in this direction have been very obvious within the<br />
university system (law, financing models, productivity, etc.) in Finland.<br />
The changes pointed out above reflect a silent but on-going “revolution” in<br />
educational ideology and policy. The development in Finland is similar <strong>to</strong> other<br />
European countries. Globally seen, it is difficult not <strong>to</strong> consider the collapse of<br />
the former Soviet Union as the starting point for the development of a new<br />
ideological and economic world order.<br />
The conclusion of what has been said thus far is that the <strong>PISA</strong> evaluations,<br />
organised by the OECD, were in many ways prepared for by the developments<br />
described above. The argument of the following section is that although international<br />
ranking of countries with respect <strong>to</strong> pupils’ success at testing is not a<br />
new phenomenon, taking in<strong>to</strong> account how <strong>PISA</strong> has been constructed, governed<br />
and how its results have been distributed, interpreted and made use of<br />
makes the <strong>PISA</strong> process an organic part of an on-going “silent revolution” in<br />
western educational thinking.
298 MICHAEL ULJENS<br />
Governing technologies used by the OECD<br />
It is important <strong>to</strong> observe that the <strong>PISA</strong> evaluations are coordinated by<br />
the OECD (Organisation for Economic Cooperation and Development). The<br />
OECD was founded in Paris in 1960 in order <strong>to</strong> stimulate economic growth<br />
and employment. The OECD was founded by 20 countries but was extended<br />
in 2000 <strong>to</strong> 30 countries. A growing number of non-OECD countries have participated<br />
in the <strong>PISA</strong> evaluations.<br />
The overall logic behind the strategy of the OECD seems <strong>to</strong> be <strong>to</strong> support<br />
an increase of a competitive mentality combined with a system of having common<br />
standards for nations, as this is expected <strong>to</strong> be beneficial for a common<br />
market. The intention seems thus <strong>to</strong> combine competition with cooperation.<br />
The current question is through what mechanisms, operations or technologies<br />
this is put in<strong>to</strong> practice? In the following some major strategies are identified<br />
that have been applied in and through the <strong>PISA</strong> evaluations in order <strong>to</strong> promote<br />
a competitive mentality combined with cooperation.<br />
First, using transnational evaluation procedures following one single measurment<br />
standard (common <strong>to</strong> all and independent of every participating country)<br />
supports in the end the development of an increased homogeneity. The<br />
argument is that this occurs through a self-adjusting process. More precisely,<br />
the strategy applied is the following: As <strong>PISA</strong> is mainly focused on the ranking<br />
of participating countries and not very interested in explaining differences<br />
between them, the burden of producing explanations is left <strong>to</strong> the participating<br />
nations, their governments, educational administration and the media. We saw<br />
this occuring after launching the results of <strong>PISA</strong> 2000, and even more clearly<br />
after <strong>PISA</strong> 2003.<br />
By not offering systematic explanations <strong>to</strong> the reported differences in<br />
school achievements, a development of a self-adjusting mentality or a certain<br />
mode of self-reflection was promoted. Through this process the countries<br />
themselves begin <strong>to</strong> orientate <strong>to</strong>wards certain types of questions and <strong>to</strong>pics, i.e.<br />
looking for keys <strong>to</strong> success. We all know that ranking participating countries<br />
created an unforeseen alertness among politicians and within the educational<br />
administration <strong>to</strong> explain either their students’ success or lack thereof.<br />
From an OECD perspective this is the best anyone can hope for <strong>–</strong> getting<br />
nations engaged in the right issues, so <strong>to</strong> speak. By leaving the task of<br />
explaining differences <strong>to</strong> participating nations, media people and the like, national<br />
experts, governmental representatives and politicians are also free <strong>to</strong><br />
make different kinds of conclusions from the results. Thus, the policies ema-
THE HIDDEN CURRICULUM OF <strong>PISA</strong> 299<br />
nating from the process vary between countries. However, this process leads <strong>to</strong><br />
limiting the agenda for educational politics of a specific country. Instrumental<br />
policy issues, i.e. means for how things should be carried out and corrected,<br />
then becomes the main <strong>to</strong>pic, while reflection related <strong>to</strong> the orientation and<br />
aim of education and schooling as such diminishes. Nonetheless, it would be<br />
wrong <strong>to</strong> say that the question of educational aims has moved <strong>to</strong> the background<br />
during this process as it is obvious that all levels of education strongly<br />
empasize that education, research and developmental work are core strategies<br />
for creating economic growth. As the aims are so obvious, there is a risk that<br />
educational policymaking on a national level becomes a kind of educational<br />
managerialism or “procedurology”.<br />
A second strategy applied for promoting the interests dominating the<br />
OECD is related <strong>to</strong> the construction of the tests and their relation <strong>to</strong> national<br />
curricula. One of the fundamental differences between the <strong>PISA</strong> evaluation<br />
and e.g. the IEA evaluation is that IEA <strong>to</strong>ok the national curriculum, its intentions<br />
and content as the point of departure. As it is quite natural <strong>to</strong> consider<br />
the national curriculum as the frame of reference when evaluating pedagogical<br />
efforts, it becomes important <strong>to</strong> try and understand why <strong>PISA</strong> did not evaluate<br />
what teachers in respective countries were expected <strong>to</strong> strive for? But what if<br />
the point was also something else in addition <strong>to</strong> primarily evaluating the effectiveness<br />
of the educational system? What if the idea was rather <strong>to</strong> use international<br />
evaluation as a technique for homogenising the participating educational<br />
systems and creating a competition-oriented mentality?<br />
If homogenisation (or increased coherence) may be seen as one interest<br />
aim <strong>to</strong> be reached from an OECD perspective, then the promotion of a<br />
competition-oriented mentality is another, equally important aim. Having accepted<br />
this, the main question is not concerned with the aims of education but<br />
the means of how <strong>to</strong> reach or hold a leading position.<br />
A mentality accepting a never-ending competition is deceptive, as one cannot<br />
ever reach either the goal or certainty. The only point that is clear is that<br />
one has <strong>to</strong> struggle for keeping or improving one’s position. Competition is<br />
always accompanied by insecurity, and this insecure identity or mentality continuously<br />
strives <strong>to</strong> reach safety. The mentality supported is one of continuous<br />
angst or a feeling of insufficiency. Lifelong learning, which was first hailed<br />
as a deliberating policy, has quickly turned out <strong>to</strong> be more like a life sentence<br />
than something emancipating. The individual is not allowed <strong>to</strong> reach “heaven<br />
on earth”, but is rather expected <strong>to</strong> try <strong>to</strong> learn <strong>to</strong> live with the idea that a con-
300 MICHAEL ULJENS<br />
tinuous learning process is the closest we can come <strong>to</strong> fulfilment in life. In fact,<br />
this construction is not a recent or new one. In some respects it is a fundamental<br />
feature of the European tradition of Bildung. At the risk of oversimplifying,<br />
we could say that while the Bildung tradition emphasises learning as emancipation,<br />
independence, self-awareness and maturity (Mündigkeit), the lifelong<br />
learning ideology or dogma explicates learning activity as something that the<br />
individual has <strong>to</strong> exhibit in order <strong>to</strong> meet “legitimate” expectations of those<br />
<strong>to</strong>wards one is considered <strong>to</strong> be responsible. A learning attitude is the ethos of<br />
an “alert readiness <strong>to</strong> change” according <strong>to</strong> what the situation needs, but where<br />
one is not defining this situation. In this sense the lifelong learning dogma is<br />
opposite <strong>to</strong> the concept of Bildung.<br />
Conclusions<br />
The intention in this chapter has been <strong>to</strong> develop an interpretation of the possible<br />
logic behind the <strong>PISA</strong> evaluation compared <strong>to</strong> previous international evaluations.<br />
Moreover, the aim has been <strong>to</strong> analyse some of the mechanisms or<br />
governmental strategies utilised or operating in the <strong>PISA</strong> process. This general<br />
logic was considered simple but intelligent. It was interpreted as aiming at<br />
the uniting of intercultural communicative activities oriented <strong>to</strong>wards learning<br />
from each other and a simultaneous or parallel competition-oriented mentality<br />
<strong>–</strong> a logic of competing and competitive cooperation. As this has not been<br />
formulated as an explicit aim of the <strong>PISA</strong> program, it may be interpreted as a<br />
part of the hidden curriculum of <strong>PISA</strong>. International evaluations in the shape<br />
they have taken in the <strong>PISA</strong> process thus include a kind of hidden curriculum,<br />
aiming at developing the educational systems of participating countries in a<br />
neo-liberal direction.<br />
The analysis was not focused on what in fact was measured by the tests<br />
themselves or whether the theoretical foundation of the project was weak or<br />
not, e.g. with respect <strong>to</strong> how comparative educational research was unders<strong>to</strong>od.<br />
The point was rather first and foremost <strong>to</strong> pay attention <strong>to</strong> how the educational<br />
policy landscape in Finland for its part prepared for <strong>PISA</strong> and secondly, <strong>to</strong><br />
point out effects that this kind of evaluation procedure may have on the educational<br />
thinking of the participating countries. <strong>PISA</strong> was thus more unders<strong>to</strong>od<br />
as an instrument or technique used by the OECD <strong>to</strong> support the development<br />
of a specific type of national educational policy. Expressed in the terminology<br />
of Michel Foucault, the <strong>PISA</strong> evaluation is viewed as a good example of how
THE HIDDEN CURRICULUM OF <strong>PISA</strong> 301<br />
evaluation operated not by direct governing behaviour but by governing the<br />
self-government or self-conduct of individuals.<br />
However, it has not <strong>to</strong> been claimed that the supporting of countries’ competitive<br />
capacity by educational means is a new feature of Finnish or European<br />
educational policy. It may, in fact, be argued that living with uncertainty and<br />
openness is a fundamental feature of the modern European tradition of Bildung<br />
(Uljens, 2007). Furthermore, the educational policy of the welfare state was,<br />
and still is (at least in Finland) <strong>to</strong> the extent it exists built upon the conviction<br />
of positive mutual effects between economic progress, educational equality,<br />
social justice and welfare and active, participa<strong>to</strong>ry citizenship.<br />
In order <strong>to</strong> avoid misunderstanding, it should be stated that it is in this<br />
context that the term ‘neo-liberalism’ is distinguished from ‘classical liberalism’<br />
(A. Smith). Classical liberalism is taken <strong>to</strong> refer <strong>to</strong> the idea that the state<br />
should not intervene in market-related issues, as the market regulates itself and<br />
au<strong>to</strong>matically is beneficial for all. Neo-liberalism is taken <strong>to</strong> refer <strong>to</strong> the view<br />
that the state does and should intervene in the market by laws and regulations<br />
of all kinds. In the neo-liberal model, politics, economics and education are<br />
seen as mutually dependent on each other. The international development of<br />
market-oriented economic thinking after 1989 may thus be considered as a<br />
renewed neo-liberalist politics in which the relative impact of politics on the<br />
economy has diminished. This has created a dissonance in the “school-statemarket<br />
triangle” (education, politics, economics), which is most clearly visible<br />
in and through the contemporary discussions on the crisis of citizenship and<br />
citizenship education.<br />
In conclusion, unders<strong>to</strong>od in the sense defined above, the <strong>PISA</strong> process is<br />
coherent with the kind of educational policy in Finland that has been evolving<br />
over the past 15-20 years. The relation may also be seen the other way around:<br />
The educational policy of Finland, as it developed from the end of the 1980s<br />
and beginning of the 1990s, moulded the national scene so that the strategies<br />
and technologies used in the <strong>PISA</strong> evaluation appeared as a reasonable continuation<br />
of the national policy.<br />
It was pointed out that this prepara<strong>to</strong>ry work was mainly carried out by<br />
applying three policy technologies: a) economisation referring <strong>to</strong> the measurement<br />
of value primarily in economical terms, b) privatisation as a movement<br />
<strong>to</strong>wards partial deconstruction of collective, societal institutions in favour of<br />
private ac<strong>to</strong>rs, deregulating laws and increasing flexibility of educational administration<br />
and increased individual responsibility and freedom and, finally,
302 MICHAEL ULJENS<br />
c) productivity referring <strong>to</strong> the fact that activities effectively stimulating economic<br />
growth are supported.<br />
One of the anomalies resulting from the international <strong>PISA</strong> discussion is<br />
how <strong>to</strong> explain the case that an educational system like the Finnish comprehensive<br />
school was indeed able <strong>to</strong> produce better results and a smaller variation<br />
compared <strong>to</strong> parallel school systems, like those in Germany or Great Britain.<br />
One reason as <strong>to</strong> why this raised so much confusion was the fact that the ideology<br />
behind the comprehensive school fundamentally differed from the OECD<br />
ideology’s emphasising more individual freedom and less state intervention.<br />
<strong>PISA</strong> also has resulted in increased expectations for continued and extended<br />
success. In Finland the <strong>PISA</strong> success for the compulsory school system<br />
turned attention <strong>to</strong>wards the universities: Why are our universities not doing<br />
equally well in international rankings? During the last few years many different<br />
steps have been taken in order <strong>to</strong> push for Finnish universities’ international<br />
success. One example is that the decentralised model of higher education<br />
which was initiated at the end of the 1960s definitely remains. <strong>According</strong> <strong>to</strong><br />
the unquestioned rhe<strong>to</strong>ric of <strong>to</strong>day, large university units are considered capable<br />
of being successful in many ways, not least when it comes <strong>to</strong> raising<br />
research funding and offering stimulating study programs. This also happens<br />
on the EU level (e.g. establishment of the European Institute of Technology,<br />
EIT).<br />
A final comment <strong>–</strong> or query, <strong>–</strong> concerning where <strong>PISA</strong> is or has been discussed<br />
the past years: Compared with the immense attention <strong>PISA</strong> issues have<br />
got in the public debate all over the world, and the impact it has had on governmental<br />
policies and school practices, it is fascinating how seldom educational<br />
researchers <strong>to</strong>uch upon the <strong>to</strong>pic in international research conferences<br />
and journal articles. If the observation is correct, which I do think it is, then it<br />
seems that we have two different worlds of educational debate which are not<br />
necessarily in <strong>to</strong>uch with each other. Is this how things should be?<br />
References<br />
Heikkinen, A. (2004): Evaluation in the transnational ‘Management by<br />
projects’ Policies. In: European Educational Research Journal, 3(2), 486-<br />
500.<br />
Uljens, M. (2007): Education and societal change in the global age. In: R.<br />
Jakku-Sihvonen & H. Niemi (Eds), Education as a societal contribu<strong>to</strong>r<br />
(pp. 23-49). Frankfurt am Main: Peter Lang.
THE HIDDEN CURRICULUM OF <strong>PISA</strong> 303<br />
Siljander, P. (2007): Education and ‘Bildung’ in modern society <strong>–</strong> Developmental<br />
trends of finnish educational and sociocultural processes. In: R.<br />
Jakku-Sihvonen & H. Niemi (Eds), Education as a societal contribu<strong>to</strong>r<br />
(pp. 71-90). Frankfurt am Main: Peter Lang.
Deutsche Pisa-Folgen<br />
Thomas Jahnke<br />
Deutschland: Universität Potsdam<br />
In dieser Note werden die Beschlüsse der Kultusministerkonferenz zum ‚Bildungsmoni<strong>to</strong>ring‘und<br />
zu den ‚Bildungsstandards‘ in Mathematik als nationale<br />
Pisa-Folgen identifiziert. Eine Auseinandersetzung mit der Testforschung in<br />
den USA und eine Ernüchterung der veröffentlichten Meinung zu der Testwirklichkeit<br />
kann die Geltungsmacht von Pisa & Co in Deutschland möglicherweise<br />
eindämmen.<br />
Die Teilnahme an der Dritten Mathematik- und Naturwissenschaftsstudie<br />
(TIMSS) und dem ersten Durchgang des Programme for International Student<br />
Assessment (<strong>PISA</strong>) hat zu einer grundlegenden Wende in der deutschen<br />
Bildungspolitik geführt. Das Unbehagen an der deutschen Schule ist messbar<br />
geworden und mit diesen Messungen ist auch der Weg, die Verhältnisse zu bessern,<br />
vorgezeichnet: die Messwerte müssen höher werden, dann wird es besser.<br />
Die Wucht, mit der dieser Gedanke die mediale und politische Öffentlichkeit<br />
durchrollte und vereinzelte Kritik an solchen Erkenntnisse und der einzuschlagenden<br />
Kur unter sich begrub, hatte lawinenartigen Charakter. Die Messergebnisse<br />
scheinen wirklicher als jede Theorie, und den „deskriptiven Befunde“<br />
haftet eine quasi-naturwissenschaftliche Objektivität und damit unwiderlegbare<br />
Wahrheit an: so liegen die Dinge <strong>–</strong> im Rahmen der Messgenauigkeit. Der<br />
Triumpf empirischen Denkens: die Wirklichkeit ist beziffert, digitalisiert, das<br />
Menetekel hat Dezimale bekommen und kann nun Steuerungsprozessen unterworfen<br />
werden, deren Ergebnisse wieder zu messen sind und so weiter.<br />
In Deutschland wird kaum diskutiert, dass auch solchen Messungen eine<br />
<strong>–</strong> möglicherweise holprige, unausgesprochene, wenig durchdachte <strong>–</strong> Theorie<br />
zugrunde liegt und Begriffe wie ‚Kompetenzstufen‘ oder ‚Grundbildung‘<br />
sich nicht messtechnisch ergeben, sondern „Realität“ eher hervorbringen als
306 THOMAS JAHNKE<br />
beschreiben. Ferner ist der Glaube, durch periodisierte Testungen würden die<br />
Leistungen deutscher Schülerinnen und Schüler steigen, weit und auch in der<br />
Bildungsadministration verbreitet. Kritik an Pisa wird häufig damit zurückgewiesen,<br />
dass ein Unternehmen dieser Größenordnung natürlich auch Schwächen<br />
und Ungereimtheiten aufweise, dass der ‚Pisa-Schock‘aber grundsätzlich<br />
doch das Augenmerk auf die Schulwirklichkeit gelenkt und schon damit Bewegung<br />
gebracht und diverse Reformbestrebungen in Gang gesetzt habe. Verkannt<br />
wird dabei, dass es sich bei Pisa nicht um eine einmalige Testung handelt,<br />
deren Ergebnisse in ihrer Aussagekraft möglicherweise überschätzt einem<br />
reflektierenden Betrachter schon etwas erzählen könnten, sondern um ein<br />
Programm, das keineswegs den Blick für verschiedenste Reformansätze und<br />
<strong>–</strong>anstrengungen öffnet, sondern im Gegenteil den Weg durch das Ziel schon<br />
festgeschrieben hat: deutsche Schülerinnen und Schüler sollen bei den künftigen<br />
Tests besser abschneiden.<br />
„Bildungsmoni<strong>to</strong>ring“<br />
Die Gesamtstrategie der Kultusministerkonferenz zum Bildungsmoni<strong>to</strong>ring<br />
liegt in doppelter Form vor: einmal als Beschluss der Kultusministerkonferenz<br />
vom 02.06.2006 1 , zum anderen als Broschüre 2 , die vom Sekretariat der Ständigen<br />
Konferenz der Kultusminister der Länder der Bundesrepublik Deutschland<br />
in Zusammenarbeit mit dem Institut zur Qualitätsentwicklung im Bildungswesen<br />
(IQB) 2006 herausgegeben wurde. Die Broschüre ist <strong>–</strong> schon auf<br />
dem Umschlag <strong>–</strong> mit ganzseitigen Farbfo<strong>to</strong>s von Schülerinnen und Schülern illustriert,<br />
enthält ein Vorwort der Präsidentin der Kultusministerkonferenz und<br />
ein Inhaltsverzeichnis mit geänderter Nummerierungen der Abschnitte, ist um<br />
einen Abschnitt mit Aufgabenbeispielen angereichert. Offensichtlich hat man<br />
dem IQB zugestanden, sein Aufgaben- und Pflichtenbuch selbst zu überarbeiten<br />
und die Formulierungen des zugrunde liegenden Beschlusses sich passend<br />
zu glätten und auszulegen. Dies geschieht tatsächlich Absatz für Absatz. Aus<br />
der Formulierung . . . in eine Reihe von Beschlüssen der KMK einzuordnen,<br />
die entsprechende Handlungsfelder beschreiben und gemeinsame zentrale Arbeitsbereiche<br />
nach Pisa 2003 festlegen in dem Beschluss der KMK wird der<br />
Bezug auf Pisa 2003 gestrichen. Aus dem Arbeitsbereich Bereitstellung von<br />
1 Kultusministerkonferenz (KMK): Gesamtstrategie der Kultusministerkonferenz zum Bildungsmoni<strong>to</strong>ring<br />
(Beschlüsse der Kultusministerkonferenz vom 02.06.2006)<br />
2 Kultusministerkonferenz (KMK): Gesamtstrategie der Kultusministerkonferenz zum Bildungsmoni<strong>to</strong>ring<br />
(2006)
DEUTSCHE <strong>PISA</strong>-FOLGEN 307<br />
Fortbildungskonzeptionen und <strong>–</strong>materialien zur kompetenz- bzw. standardbasierten<br />
Unterrichtsentwicklung, vor allem Lesen, Geometrie, S<strong>to</strong>chastik wird<br />
der Bezug zum Lesen und der <strong>–</strong> nicht nachvollziehbare, verwunderliche <strong>–</strong> Bezug<br />
auf spezielle mathematische Bereiche, den man wohl nur durch die Abwesenheit<br />
und damit auch Verzichtbarkeit von ‚Fachkompetenz‘ erklären kann,<br />
gestrichen. Wir zitieren im Folgenden die etwas schlankeren Formulierungen<br />
des ursprünglichen Beschlusses. Schon der erste Absatz lässt wenig Zweifel<br />
unter welchen Auspizien Bildung heute betrachtet wird:<br />
Bildung nimmt eine Schlüsselrolle für die individuelle Entwicklung, für gesellschaftliche<br />
Teilhabe sowie berufliches Fortkommen, aber auch für den wirtschaftlichen Erfolg<br />
eines Landes ein. Die globalen Entwicklungen der vergangenen Jahrzehnte haben<br />
die grundlegende Bedeutung von Bildung für Deutschland noch einmal unterstrichen.<br />
Die Ausschöpfung aller Begabungspotentiale und die Sicherung und Entwicklung von<br />
Qualität im Bildungswesen sind daher zentrale Aufgaben der Bildungspolitik. (S. 1) 3<br />
Als zentrale Instrumente der Kultusministerkonferenz für das Bildungsmoni<strong>to</strong>ring<br />
werden dann benannt<br />
<strong>–</strong> Internationale Schulleistungsuntersuchungen<br />
<strong>–</strong> Zentrale Überprüfung des Erreichens der Bildungsstandards in einem Ländervergleich<br />
<strong>–</strong> Vergleichsarbeiten in Anbindung oder Ankoppelung an die Bildungsstandards<br />
zur landesweiten Überprüfung der Leistungsfähigkeit einzelner Schulen<br />
<strong>–</strong> Gemeinsame Bildungsberichterstattung von Bund und Ländern. (S. 1/2)<br />
Ob die bisher veröffentlichten ‚Bildungsstandards‘(s.u.!) solchen Überprüfungen<br />
und Belastungen standhalten, kann man bezweifeln. In jedem Fall wird<br />
Deutschland durch diesen Beschluss zum Testland ausgerufen und erklärt: für<br />
die Jahre 2006 bis 2018 (!) werden in einer Tabelle 17 Testungen und 19<br />
Berichterstattungen über diese terminiert, die sich allein durch die Teilnahme<br />
an PIRLS, TIMSS und <strong>PISA</strong> 4 sowie die Ländervergleiche bundesweit er-<br />
3 In ihrem Tenor und Jargon erinnert solche Funktionsbeschreibung von ‚Bildung‘an die entsprechende<br />
Verlautbarungen der OECD. Überraschender Weise ist in der Überarbeitung des<br />
Textes (Siehe die o.a. Broschüre) in dem angeführten Zitat das Wort ‚Bildung‘durch ‚Das<br />
Bildungssystem‘ersetzt, als seien diese Begriffe synonym.<br />
4 PIRLS ist die Abkürzung für Progress in International Reading Literacy Study, die in<br />
Deutschland auch mit IGLU für Internationale Grundschul-Lese-Untersuchung bezeichnet<br />
wird. TIMSS war ursprünglich ein Akronym für Third International Mathematics and<br />
Science Study; seit TIMSS 2003 steht das Akronym für Trends in Mathematics and Science<br />
Study; <strong>PISA</strong> steht für Progamme for International Students Assessment.
308 THOMAS JAHNKE<br />
geben. Dazu kommen noch die länderspezifische und länderübergreifenden<br />
Vergleichsarbeiten in Anbindung oder Anlehnung an die Bildungsstandards in<br />
Jahrgangsstufen 3 und 4 für Deutsch und Mathematik, in den Jahrgangsstufen<br />
8 und 9 für den Hauptschulabschluss in Deutsch, Mathematik, Erste Fremdsprache<br />
(Englisch, Französisch) und in den Jahrgangsstufen 9 und 10 für den<br />
Mittleren Schulabschluss in Deutsch, Mathematik, Erste Fremdsprache (Englisch,<br />
Französisch), Biologie, Chemie, Physik.<br />
Dass schulische Bildung in Deutschland solchem ‚Moni<strong>to</strong>ring‘ nicht mehr<br />
entkommen kann, wird schließlich im letzten Abschnitt Bildungsberichterstattung<br />
gesichert:<br />
Kern der Bildungsberichterstattung ist ein überschaubarer, systematischer, regelmäßig<br />
aktualisierter Satz von Indika<strong>to</strong>ren, d.h. statistischen Kennziffern, die jeweils für<br />
ein zentrales Merkmal von Bildungsprozessen bzw. einen zentralen Aspekt von Bildungsqualität<br />
stehen. Diese Indika<strong>to</strong>ren werden aus amtlichen Daten und sozialwissenschaftlichen<br />
Erhebungen in Zeitreihe dargestellt, wenn möglich im internationalen<br />
Vergleich und aufgeschlüsselt nach Ländern. Um den Vergleich mit Entwicklungen in<br />
den Mitgliedstaaten der Europäischen Union und der OECD zu ermöglichen, wird Anschlussfähigkeit<br />
und Kompatibilität mit internationalen Berichtssystemen ( . . . ) angestrebt.(...<br />
)<br />
Durch die Verfügbarkeit individueller Verlaufsdaten und die regelmäßige Erfassung<br />
erworbener Kompetenzen soll die Leitidee der Bildungsberichtserstattung „Bildung<br />
im Lebenslauf“ umgesetzt werden. Für einen einheitlichen Satz schulstatistischer Daten<br />
und die Sicherung der Anschlussfähigkeit an die internationale Bildungsstatistik<br />
haben die Länder bereits grundlegende Beschlüsse gefasst. So haben die Länder am<br />
22.09.2005 vereinbart, längerfristig ihre Daten entsprechend den im Kerndatensatz<br />
vereinbarten Merkmalsausprägungen zur Verfügung zu stellen. Zumindest Daten der<br />
öffentlichen Schulen sollen für das Schuljahr 2008/2009 von allen Ländern vorliegen.<br />
(S. 14)<br />
In der überarbeiten Broschüre zum Bildungsmoni<strong>to</strong>ring wurde der zuletzt zitierte<br />
Absatz gestrichen. Es ist aber wohl kaum davon auszugehen, dass damit<br />
auch die angestrebte Datenbank nicht eingerichtet wird.<br />
„Teaching <strong>to</strong> the Test“<br />
Man muss konstatieren, dass die Kritik an den Testverfahren und an dem Gedanken,<br />
die ‚Erträge‘schulischer Bildung könnten über periodisierte Tests in<br />
sinnvoller Weise gemessen und gesteigert werden, Deutschland nicht erreicht<br />
hat oder dass es auch nur zu einer sorgfältigen und redlichen Diskussion dieses
DEUTSCHE <strong>PISA</strong>-FOLGEN 309<br />
prima vista selbstverständlich erscheinenden Gedankens hierzulande gekommen<br />
ist. 5 Das ist auch nicht weiter erstaunlich, weil von den involvierten Testinstitute<br />
und den mit ihnen kooperierenden Wissenschaftler kaum zu erwarten<br />
ist, dass sie mit ihren Test- Knowhow auch die Test-Kritik auf den Markt bringen.<br />
In dem vierzehnseitigen Beschluss der Kultusministerkonferenz zum Bildungsmoni<strong>to</strong>ring<br />
vom 02.02.2006 wird auf Seite 13 unter der Zwischenüberschrift<br />
„Weiterentwicklung der Bildung, aber kein Teaching <strong>to</strong> the Test“ auf<br />
diese Problematik <strong>–</strong> wie folgt <strong>–</strong> kurz eingegangen:<br />
Neben der Funktion der Beschreibung von Leistungsanforderungen und der Leistungsmessung<br />
dienen die Bildungsstandards primär der Weiterentwicklung des Unterrichts<br />
und vor allem der individuellen Förderung aller Schülerinnen und Schüler. Die<br />
Länder sind sich darin einig, dass mit der Setzung der Bildungsstandards als übergreifenden<br />
Referenzrahmen eine Entwicklung hin zum „teaching <strong>to</strong> the test“ oder eine<br />
Verengung des Unterrichts aus die Anforderungen der Standards verhindert werden<br />
muss. (S. 13)<br />
Diese Kürze ist trotz der beschworenen Einigkeit der Länder erstaunlich. Es<br />
liegt nahe, wenn man die ‚Weiterentwicklung des Unterrichts und vor allem der<br />
individuellen Förderungen der Schülerinnen und Schüler‘ durch Bildungsstandards<br />
befördern oder anordnen will, deren Erreichen im Wesentlichen durch<br />
Tests überprüft wird, die Erfahrungen von Ländern und darunter insbesondere<br />
der USA zu rezipieren, die seit Jahren oder Jahrzehnten eine solche Politik<br />
verfolgen.<br />
For several decades, some measurement experts have warned that high-stakes testing<br />
could lead <strong>to</strong> inappropriate forms of test preparation and score inflation, which we<br />
5 Es ist aufschlussreich und vermutlich nicht folgenlos, dass die Daten von Pisa in Australien<br />
aufgearbeitet werden und gleichsam deutschen (oder europäischen) Boden nie betreten.<br />
In Zeiten einer sich global verstehenden Forschung scheinen solche räumliche Distanzen<br />
ohne jede Auswirkung, u.a. weil der Zugriff auf Datenserver ubiquitär und ohne zeitliche<br />
Verzögerung möglich ist. Dennoch ist es von Bedeutung, ob und in welchem Rahmen und<br />
geistigen Raum die Verfahren zur Aufbereitung der Daten entwickelt, diskutiert und kritisiert<br />
werden, ob sie als die Ergebnisse der Form und dem Inhalt nach prägende Instrumente<br />
begriffen werden oder nur <strong>–</strong> als mehr oder minder schlecht dokumentierte <strong>–</strong> Routinen in<br />
Softwarepaketen erscheinen, ob sie überhaupt wissenschaftlich diskutiert oder schlicht als<br />
notwendige und doch arbiträre Essenzen eines ‚State of the Art‘ aufgefasst werden, ob den<br />
beteiligten Forschern daran liegt, ihre Verfahren zu verkaufen oder als Erkenntnisinstrumente<br />
in die Diskussion einzuführen und zu legitimieren etc.
310 THOMAS JAHNKE<br />
define as a gain in scores that substantially overstates the improvement in learning it<br />
implies. (p. 99)<br />
So leitet Daniel Koretz, Erziehungswissenschaftler an der Havard-Universität<br />
und assoziierter Direk<strong>to</strong>r des Center of Research, Standards, and Student Testing<br />
(CRESST), seinen Aufsatz Alignment, High Stakes, and the Inflation of<br />
Test Scores 6 ein und beschreibt einen Ausgangspunkt, über den eine öffentliche<br />
Diskussion in Deutschlang bisher kaum hinausgekommen ist:<br />
On common response <strong>to</strong> this problem has been <strong>to</strong> seek “tests worth teaching <strong>to</strong>”.<br />
The search for such tests has led reformers in several directions over the years, but<br />
currently, many argue that tests well aligned with standards meet this criterion. If tests<br />
are aligned with standards, the arguments runs, they test material deemed important,<br />
and teaching <strong>to</strong> the test therefore teaches what is important. If students are being taught<br />
what is important, how can the resulting score gains be misleading? (p. 99)<br />
Koretz begründet seinen Widerspruch gegen solche Naivität theoretisch und<br />
empirisch unter anderem eindrücklich mit Sägezahnkurven (“saw<strong>to</strong>oth pattern“)<br />
für die gemessenen Leistungen der gleichen oder einer vergleichbaren<br />
Population, die sich in verschiedenen Erhebungen je nach den verwendeten<br />
Tests in unterschiedlichster Weise ergaben. Auch der Hoffnung, solche Effekte<br />
seien allein der Testkonstruktion und den Testumständen zuzuschreiben, widerspricht<br />
er:<br />
The problem is not confined <strong>to</strong> commercial, off-the-shelf, multiple-choice tests. It has<br />
appeared as well with standards-based tests and with tests using no multiple-choice<br />
items. (p. 106)<br />
Die Vorstellung, Schülerleistungen ließen sich in einem Test objektiv oder mit<br />
angebbaren Fehlermargen <strong>–</strong> gleichsam physikalisch messen, ist schlicht (und)<br />
irreführend. Folgerungen aus solcher Vorstellung mehr als fragwürdig. Wird<br />
dies in Abrede gestellt, verschwiegen oder das Gegenteil prätendiert, liegen in<br />
aller Regel massive Erkenntnisinteressen der Auftraggeber oder -nehmer der<br />
Testungen vor.<br />
6 Koretz, D.: Alignment, High Stakes, and the Inflation of Test Scores. Yearbook of the National<br />
Society for the Study of Education (2005) 104 (2), 99<strong>–</strong>118.<br />
(Online erhältlich unter: http://www.blackwell-synergy.com/doi/abs/10.1111/j.<br />
1744-7984.2005.00027.x)
DEUTSCHE <strong>PISA</strong>-FOLGEN 311<br />
Auch die Auswirkungen von Testungen auf den Unterricht werden in den<br />
USA seit Jahrzehnten untersucht. Koretz zum Beispiel beschreibt und charakterisiert<br />
in dem zitierten Papier Reallocation, Alignment und Coaching:<br />
Reallocation. Reallocation refers <strong>to</strong> shifts in instructional resources among the elements<br />
of performance. Research has shown that when scores on a test are important<br />
<strong>to</strong> teachers, many of them will reallocate their instructional time <strong>to</strong> focus more on<br />
the material emphasized by the test. ( . . . ) Many observers believe that reallocation is<br />
among the most important fac<strong>to</strong>rs causing the saw<strong>to</strong>oth pattern ( . . . ).<br />
Alignment. Content and performance standards comprise material <strong>–</strong> performance elements,<br />
in the terminology used here <strong>–</strong> that someone (not necessarily the ultimate user<br />
of scores) has decided are important. If the material is emphasized in the standards,<br />
that implies that users should give this material substantial weight in the interference<br />
they draw about student performance. Alignment gives this same material high<br />
weights in the test as well. ( . . . )<br />
Coaching. The term “coaching” is used in a variety of different ways in writings about<br />
test preparation. Here it is used <strong>to</strong> refer <strong>to</strong> two specific, related types of test preparation,<br />
called substantive and non-substantive coaching.<br />
Substantive coaching is an emphasis on narrow, substantive aspects of a test that capitalizes<br />
on the particular style or emphasis of test items. The aspects of the tests that are<br />
emphasized may be either intended or unintended by the test designers. For example,<br />
in one study of the author’s, a teacher noted that the state’s test always used regular<br />
polygons in test items and suggested that teachers should focus solely on those and<br />
ignore irregular polygons. The intended interferences, however, were about polygons,<br />
not specifically regular polygons. ( . . . )<br />
Nonsubstantive coaching refers <strong>to</strong> the same process when focused on nonsubstantive<br />
aspects of a test, such as characteristics of distracters (incorrect answers <strong>to</strong> multiplechoice<br />
items), substantively unimportant aspects of scoring rubrics, and so on. Teaching<br />
test-taking tricks (process of elimination, plug-in, etc.) can also be seen as nonsubstantive<br />
coaching. In some cases <strong>–</strong> for example, when first introducing young children<br />
<strong>to</strong> the op-scan answer sheets used with multiple-choice tests <strong>–</strong> a modest amount of<br />
certain types of nonsubstantive coaching can increase scores and improve validity by<br />
removing irrelevant barriers <strong>to</strong> performance. In most cases, however, it either wastes<br />
time or inflates scores. (p. 110-112)<br />
An anderer Stelle findet sich ähnliche Kritik. So fasst Brian M. Stecher sein<br />
Kapitel 4 Consequences of large-scale, high-stakes testing on school and classroom<br />
practices in dem von ihm mitherausgegebenen Buch Making Sense of<br />
Test-Based Accountability in Education 7 folgendermaßen zusammen:<br />
7 Stecher, B. M.: Consequences of large-scale, high-stakes testing on school and classroom
312 THOMAS JAHNKE<br />
The net effect of high-stakes testing on policy and practice is uncertain. Researchers<br />
have not documented the desirable consequences of testing <strong>–</strong> providing more instruction,<br />
working harder, and working more effectively <strong>–</strong> as clearly as the undesirable ones <strong>–</strong><br />
such as negative reallocation, negative alignment of classroom time <strong>to</strong> emphasize <strong>to</strong>pics<br />
covered by a test, excessive coaching, and cheating. More important, researchers<br />
have not generally measured the extent or magnitude of the shifts in practice that the<br />
identified as a result of high-stakes testing.<br />
Overall, the evidence suggests that large-scale high-stakes testing has been a relatively<br />
potent policy in terms of bringing about changes within schools and classrooms.<br />
Many of these changes appear <strong>to</strong> diminish students’ exposure <strong>to</strong> curriculum, which<br />
undermines the meaning of the test scores. (p. 99/100)<br />
Der im letzten Absatz angesprochene Antagonismus scheint der deutschen<br />
Kultusministerkonferenz möglicherweise von ihren Beratern vorenthalten<br />
worden zu sein. Das Gleiche gilt vermutlich für das Position Statement on<br />
High Stakes Testing in PreK-12 Education der American Evaluation Association<br />
(AEA), in dem es heißt:<br />
High stakes testing leads <strong>to</strong> under-serving or mis-serving all students, especially the<br />
most needy and vulnerable, thereby violating the principle of “do no harm.” The American<br />
Evaluation Association opposes the use of tests as the sole or primary criterion<br />
for making decisions with serious negative consequences for students, educa<strong>to</strong>rs, and<br />
schools. The AEA supports systems of assessment and accountability that help education.<br />
Recent years have seen an increased reliance on high stakes testing (the use of tests <strong>to</strong><br />
make critical decisions about students, teachers, and schools) without full validation<br />
throughout the United States. The rationale for increased uses of testing is often based<br />
on a need for solid information <strong>to</strong> help policy makers shape policies and practices <strong>to</strong><br />
insure the academic success of all students. Our reading of the accumulated evidence<br />
over the past two decades indicates that high stakes testing does not lead <strong>to</strong> better<br />
educational policies and practices. There is evidence that such testing often leads <strong>to</strong><br />
educationally unjust consequences and unsound practices, even though it occasionally<br />
upgrades teaching and learning conditions in some classrooms and schools. The consequences<br />
that concern us most are increased drop out rates, teacher and administra<strong>to</strong>r<br />
deprofessionalization, loss of curricular integrity, increased cultural insensitivity, and<br />
disproportionate allocation of educational resources in<strong>to</strong> testing programs and not in<strong>to</strong><br />
hiring qualified teachers and providing sound educational programs. The deleterious<br />
practices. In L. S. Hamil<strong>to</strong>n, B. M. Stecher, and S. P. Klein (Eds.): Making Sense of Test-<br />
Based Accountability in Education. RAND. Santa Monica 2002. P. 79-100<br />
(Online unter: http://www.rand.org/pubs/monograph_reports/MR1554/index.html)
DEUTSCHE <strong>PISA</strong>-FOLGEN 313<br />
effects of high stakes testing need further study, but the evidence of injury is compelling<br />
enough that AEA does not support continuation of the practice.<br />
While the shortcomings of contemporary schooling are serious, the simplistic application<br />
of single tests or test batteries <strong>to</strong> make high stakes decisions about individuals and<br />
groups impede rather than improve student learning. Comparisons of schools and students<br />
based on test scores promote teaching <strong>to</strong> the test, especially in ways that do not<br />
constitute an improvement in teaching and learning. Although used for more than two<br />
decades, state mandated high stakes testing has not improved the quality of schools;<br />
nor diminished disparities in academic achievement along gender, race or class lines;<br />
nor moved the country forward in moral, social, or economic terms. The American<br />
Evaluation Association (AEA) is a staunch supporter of accountability, but not test<br />
driven accountability. AEA joins many other professional associations in opposing<br />
the inappropriate use of tests <strong>to</strong> make high stakes decisions.<br />
In einer Endnote zu diesem Text wird auf weitere Organisationen verwiesen,<br />
die ebenfalls dagegen opponieren, weit reichende Entscheidungen auf Grund<br />
von Testergebnissen zu fällen.<br />
AEA joins many other professional associations, teacher unions, parent advocacy<br />
groups in opposing the inappropriate use of tests <strong>to</strong> make high stakes decisions. These<br />
include, but are not limited <strong>to</strong> the American Educational Research Association, the<br />
National Council for Teachers of English, the National Council for Teachers of Mathematics,<br />
the International Reading Association, the College and University Faculty<br />
Assembly of the National Council for the Social Studies, and the National Education<br />
Association 8<br />
Für den deutschen Betrachter ist es kaum nachvollziehbar, mit welchen<br />
Besserungs- wenn nicht gar Heilserwartungen gleich in welcher Richtung die<br />
hiesige Bildungspolitik umfangreichste Testprograme einführt, während der sicherlich<br />
nicht zimperliche angelsächsische Evaluations-Pragmatismus sich in<br />
kaum zu übertreffender Deutlichkeit nach mehr als zwanzigjähriger Erfahrung<br />
von solchen Bestrebungen distanziert.<br />
„Bildungsstandards“<br />
Während in dem Beschluss der Kultusministerkonferenz zum Bildungsmoni<strong>to</strong>ring<br />
noch wie oben zitiert davon die Rede ist, dass die Bildungsstandards<br />
primär der Weiterentwicklung des Unterrichts und vor allem der individuellen<br />
8 American Evaluation Association AEA): Position Statement on HIGH STAKES TESTING<br />
In PreK-12 Education. 2002 (Online unter: http://www.eval.org/hst3.htm)
314 THOMAS JAHNKE<br />
Förderung aller Schülerinnen und Schüler dienen heißt es in der Vereinbarung<br />
über Bildungsstandards für den Mittleren Schulabschluss 9 der gleichen Organisation<br />
deutlicher und weniger pädagogisch verschleiert:<br />
Die Kultusministerkonferenz sieht es als zentrale Aufgabe an, die Qualität schulischer<br />
Bildung, die Vergleichbarkeit schulischer Abschlüsse sowie die Durchlässigkeit<br />
des Bildungssystems zu sichern. Bildungsstandards sind hierbei von besonderer Bedeutung.<br />
Sie sind Bestandteil eines umfassenden Systems der Qualitätssicherung, das<br />
auch Schulentwicklung, externe und interne Evaluation umfasst. Bildungsstandards<br />
beschreiben erwartete Lernergebnisse. Ihre Anwendung bietet Hinweise für notwendige<br />
Förderungs- und Unterstützungsmaßnahmen. (S. 3)<br />
Standards 10 und Tests bescheinigen sich gegenseitig ihre Notwendigkeit in<br />
einem Maße, dass sie gleichsam aus purer Logik existent werden: Tests benötigen<br />
Standards, an denen sie oder auf die sie ausgerichtet sind; Standards<br />
benötigen Tests zur Überprüfung ihrer Einhaltung oder Erreichung oder ihres<br />
Verfehlens. Eine kurze <strong>–</strong> weniger logische <strong>–</strong> Geschichte der Standards in<br />
Deutschland hat Hans Dieter Sill 2006 skizziert. 11 Er kommt zu dem Schluss:<br />
Die Standards sind nicht im Resultat gründlicher wissenschaftlicher Analysen internationaler<br />
und nationaler Entwicklungen entstanden, sondern sind Ergebnis eines politisch<br />
motivierten Beschlusses auf ministerialer Ebene, der in sehr kurzer Zeit umzusetzen<br />
war. Es bestanden weder zeitliche noch personelle Ressourcen, um den wissenschaftlich<br />
außerordentlich anspruchsvollen Prozess der Entwicklung nationaler Standards<br />
in der notwendigen Tiefe und Gründlichkeit zu gestalten. (S. 299/200)<br />
Die Ergebnisse solcher Knappheit kennzeichnen zum mindestens rechnerisch<br />
nicht die Bildungsstandards im Fach Mathematik für den Mittleren Schulabschluss,<br />
die am 4.12.2003 von der Kultusministerkonferenz beschlossen wurden.<br />
Durch die Setzung von sechs sich ohne jede Trennschärfe oder auch nur<br />
9 Vereinbarung über Bildungsstandards für den Mittleren Schulabschluss (Jahrgangsstufe<br />
10) <strong>–</strong> (Beschluss der Kultusministerkonferenz vom 4.12.2003) in: Kultusministerkonferenz<br />
(KMK): Bildungsstandards im Fach Mathematik für den Mittleren Schulabschluss.<br />
Beschluss vom 4.12.2003<br />
10 In Achtung der großen deutschen Bildungstheoretiker des 18., 19. und 20. Jahrhunderts<br />
versuche ich das Wort Bildungsstandard zu vermeiden. Als wäre Ohr und Verstand mit dem<br />
Kompositum ‚Bildungsstandards‘ noch nicht ausreichend gequält, wird in dem Beschluss<br />
der Kultusministerkonferenz zum Bildungsmoni<strong>to</strong>ring an zahlreichen Stellen noch von der<br />
notwendigen Normierung und Nachnormierung der Bildungsstandards gesprochen.<br />
11 Sill, H. D.: <strong>PISA</strong> und die Bildungsstandards. In: Jahnke, Th.; Meyerhöher, W. (Hrsg.): Pisa<br />
& Co <strong>–</strong> Kritik eines Programms. Franzbecker Verlag. Hildesheim 2006. S. 293 <strong>–</strong> 330.
DEUTSCHE <strong>PISA</strong>-FOLGEN 315<br />
eigene Charakteristik überlappenden Kompetenzen, von fünf <strong>–</strong> seit Jahren bekannten<br />
<strong>–</strong> mathematischen Leitideen und drei Anforderungsbereichen ergeben<br />
sich aus Gründen der Multiplikation neunzig verschiedene Möglichkeiten eine<br />
Aufgabe zu kennzeichnen. Sind <strong>–</strong> wie wohl meistens zu erwarten <strong>–</strong> mehrere<br />
Kompetenzen oder Leitideen gefragt, dann ergeben sich mehrere hundert<br />
solcher Klassifikationen.<br />
Auf 24 der 36 Seiten der Broschüre sind Aufgabenbeispiele und Lösungsskizzen<br />
mit der Angabe von Leitideen und allgemeinen mathematischen Kompetenzen<br />
sowie deren Zuordnung zu Anforderungsbereichen abgedruckt. Während<br />
die Klassifikation der Aufgaben wenig zwingend oder aufschlussreich,<br />
eher selbstverständlich und für die Bearbeitung von zu vernachlässigender Bedeutung<br />
ist, erschreckt die magere Qualität der Aufgaben, die Material, wie<br />
es in neueren, gut durchgearbeiteten und aufbereiteten Schulbüchern zu finden<br />
ist, nicht einmal im Ansatz erreicht.<br />
In Aufgabe (1) wird unangemessen modelliert.<br />
In Aufgabe (2) wird das Ungeschick eines Grafikers, das diesem vermutlich durch<br />
einen fehlerhaften Umgang mit einer Tabellenkalkulationssoftware unterlaufen ist,<br />
nicht thematisiert, sondern hingenommen.<br />
In Aufgabe (3) wird ein nicht symmetrisch gezeichneter Stern als symmetrisch bezeichnet<br />
und dann nach der Zahl seiner Symmetrieachsen gefragt.<br />
In Aufgabe (4) erstaunen die künstliche Fragestellung und die Klassifikation.<br />
In Aufgabe (5) ist eine mit „Lohnerhöhung in EURO“ beschriftete Achse in Zehnerschritten<br />
von 0 bis 50 bezeichnet, aber zugleich in 30 Teile geteilt, so dass ein Teilabschnitt<br />
1 2/3 ¤ entspricht und die Achsenbeschriftungen nicht an den Teilstrichen<br />
sitzen (können).<br />
In Aufgabe (6) wird ein Punkt mit P(y;x) bezeichnet und dann bemerkt, dass „x die<br />
erste Koordinate des Punktes P ist“.<br />
In Aufgabe (7) erstaunen die Teilfragen c) und d).<br />
In Aufgabe (8) wird vor allem die Anstrengung deutlich, eine Leitidee unterzubringen.<br />
In Aufgabe (9) ist die Fragestellung c) undurchsichtig.<br />
In Aufgabe (10) wird der Taschenrechner fragwürdig benutzt.<br />
In Aufgabe (11) soll man sich mit den fünf Schüleräußerungen in Sprechblasen auseinandersetzen,<br />
die man außerhalb der Schule wohl kaum mathematisch aufarbeiten<br />
würde.<br />
In Aufgabe (12) werden Fragestellungen zu Linearen Funktionen behandelt, denen<br />
man wenig Sinn abgewinnen kann.
316 THOMAS JAHNKE<br />
In Aufgabe (13) wäre im Ansatz einmal eine Modellierung möglich, wenn sie nicht<br />
im Text schon vorgegeben wäre.<br />
In Aufgabe (14) vereint mühsam Fragestellungen, die wenig gemein haben.<br />
Die Aufgaben sind durchweg eher hölzern formuliert, die Grafiken lieblos und<br />
fehlerhaft, die Lösungsskizzen wenig hilfreich und zum Teil falsch (Z.B. in<br />
gravierender Weise bei Aufgabe (3) und Aufgabe (5)). Innovative Anregungen<br />
gehen von solchem Material nicht aus. Warum sind diese Aufgaben, die<br />
die ‚Bildungsstandards für den Mittleren Schulabschluss‘ deutschlandweit exemplifizieren<br />
sollen, über die blass konturierten Kompetenzen hinaus deren Inkarnation<br />
darstellen, so voller Mängel? Die einzige rationale Antwort auf diese<br />
Frage liegt darin, dass es in den Standards nicht um Kompetenzen, Leitideen<br />
und Anforderungsbereiche geht, dass das Musterhafte dieser Aufgaben sich<br />
nicht auf ihren Inhalt bezieht. Es geht gar nicht darum, sie und ihre Lösungsmöglichkeiten<br />
sich gründlich anzuschauen, sie also ernst zu nehmen, sondern<br />
den Lehrpersonen und den Schülerinnen und Schüler klar zu machen, dass es<br />
jetzt einen neuen administrativ-zwingenden Begriff gibt, nämlich den der Standards<br />
gibt, den es ohne Widerworte einzuhalten gilt, der keinen Widerspruch<br />
ob gegen Tests oder Vergleichsarbeiten und deren Inhalte duldet. Ernst zunehmen<br />
sind also nicht die Aufgaben, sondern die Kandare, an die Lehrerinnen<br />
und Lehrer wie Schülerinnen und Schüler genommen werden: ihr müsst das<br />
jetzt können, sonst setzt es etwas, sei es durch Publikation der mageren Ergebnisse<br />
der Schüler, der Lehrer oder der Schule, sei es durch andere Zwangsmaßnahmen.<br />
Jetzt wird Ernst gemacht und dieser Ernst heißt eben Standard. Es<br />
mag schon sein, dass das Aufziehen dieser neuen Saiten <strong>–</strong> gleichsam als Lob<br />
der Ernsthaftigkeit staatlicher Bildungsvorgaben <strong>–</strong> manchem zu Pass kommt<br />
und mancher davon profitiert zum Beispiel als staatlich bestellter Bildungsforscher<br />
oder Testentwickler, aber Mathematikdidaktik ist das nicht. In der<br />
Vereinbarung über Bildungsstandards für den Mittleren Schulabschluss (Jahrgangsstufe<br />
10) heißt es (auf Seite 4 in der zitierten Broschüre):<br />
Die Standards und ihre Einhaltung werden unter Berücksichtigung der Entwicklung<br />
in den Fachwissenschaften, in der Fachdidaktik und in der Schulpraxis durch eine von<br />
den Ländern gemeinsam beauftragte wissenschaftliche Einrichtung überprüft und auf<br />
der Basis validierter Tests weiter entwickelt. (S. 4)<br />
Eine inhaltliche Weiterentwicklung hat seither nicht stattgefunden; offensichtlich<br />
besteht auch gar kein Bedürfnis nach einer breiten und tiefen fachlichen,<br />
fachdidaktischen oder schulpraktischen Diskussion.
Risse in der öffentlichen Geltungsmacht<br />
DEUTSCHE <strong>PISA</strong>-FOLGEN 317<br />
Bei der sorgfältig arrangierten Veröffentlichung der ersten Pisa-‚Ergebnisse‘ in<br />
Deutschland wurde <strong>–</strong> wie in geringerem Ausmaß schon bei der Dritten Internationalen<br />
Mathematik- und Naturwissenschaftsstudie (TIMSS) <strong>–</strong> in den Medien<br />
im Kern nur das Entsetzen über das Abschneiden der deutschen Schülerinnen<br />
und Schüler ausgerufen und in Szene gesetzt (‚Pisa-Schock‘). Die Ergebnisse<br />
selbst, ihre Interpretation oder die angewandten Verfahren zu ihrer Gewinnung<br />
wurden auf den Pressekonferenzen, in den zugehörigen Berichten und<br />
Kommentare nicht einmal simpelsten Plausibilitätsprüfungen unterzogen. Ein<br />
Elchtest, der diesen komplexen Untersuchungsapparat im Ansatz oder auch<br />
nur die technischen Details des Tests in der Schule (zeitliche Länge, Art und<br />
Anzahl der Fragen) und dessen wunderliche Aussagekraft näher befragt hätte,<br />
blieb aus. Es galt nur das Ausmaß des deutschen Versagens zu beklagen und<br />
auf Abhilfe je nach Couleur des Kommenta<strong>to</strong>rs und seiner Organisationszugehörigkeit<br />
zu sinnen. Zwar waren zuweilen recht ernüchternde Berichte von<br />
direkt an dem Test beteiligten Schülerinnen und Schülern sowie Lehrerinnen<br />
und Lehrern zu lesen, aber solche Augenzeugenkolportagen wurden als lokale<br />
Ausrutscher wider die Handbücher und vorgegebenen internationalen Verfahrensregeln<br />
bezeichnet und ihre Erwähnung oder Betrachtung als unwissenschaftlich<br />
gebrandmarkt. Sie gingen in der Dramatik und Wucht der globalen<br />
Untersuchung unter. Jegliche Kritik an Pisa wurde medial nur als ein untauglicher<br />
Versuch gesehen, die Misere der deutschen Bildung schön zu reden oder<br />
sie gar ganz zu leugnen. Die Geltungsmacht von Pisa hatte die Medien wie<br />
auch die Politik fest im Griff.<br />
Bei der zweiten Pisa-Welle ließ sich keine vergleichbare Dramatik in den<br />
Medien mehr aufbauen. Auch halbherzige, bildungspolitisch forcierte Versuche,<br />
aus dem Vergleich der Ergebnisse der beider Durchläufe Schlüsse auf<br />
ein erstes Wirken deutscher Maßnahmen zu ziehen, erwiesen sich als verfahrenstechnisch<br />
gewagt und inhaltlich weder glaubwürdig noch überhaupt plausibel<br />
und in der Tendenz sogar kontraproduktiv. Nicht einmal die (unsinnigspektakulären)<br />
Länderrankings ließen sich noch verwerten, so dass ein neues<br />
Debakel die mediale Aufmerksamkeit sichern musste, dass nämlich in<br />
Deutschland die terri<strong>to</strong>riale und soziale Herkunft in besonderer Weise auf<br />
die Bildungschancen ‚durchschlage‘. Auch hier unterblieben übrigens einfache<br />
Nachfragen, wie denn dieses Forschungsergebnis zustande gekommen sei,<br />
welche Größen oder Indika<strong>to</strong>ren man hier gemessen, verrechnet oder gegeneinander<br />
aufgetragen habe und in welcher Weise die deutschen Ergebnisse
318 THOMAS JAHNKE<br />
die vergleichbarer Länder über- oder untertrafen. Medial handelte es sich also<br />
nicht um ein Resultat einer komplexen Untersuchung, deren Verfahren zumindest<br />
im Groben zu erläutern seien, sondern um eine moralische Katastrophe,<br />
an deren Beseitigung man ohne Nachfrage und Aufschub zu arbeiten habe. 12<br />
Inzwischen ist auch die Blendkraft dieser Nachricht dahin. Der folgende<br />
Artikel zeigt beispielhaft, dass der schiere Glaube, dem jähen Entsetzen über<br />
das vermessene deutsche Schulbildungsdebakel würde sich nun mit der gleicher<br />
vollmundigen Bestimmtheit und Kennerschaft eine Besserung der Verhältnisse<br />
anschließen, in den Medien zu bröckeln beginnt.<br />
Langer Anlauf ohne Sprung<br />
Die wahren Pisa-Sieger sind gar nicht die Finnen. Die wahren Pisa-Sieger sitzen in<br />
Berlin, Dortmund und Bielefeld. In den Schulen sieht man sie selten. Meist brüten sie<br />
über Testbögen, ersinnen Prüfungsfragen oder erforschen mit Hingabe die Wirkung<br />
ihrer eigenen Forschung. „So viele Daten hatten wir noch nie“, freut sich der Bielefelder<br />
Erziehungswissenschaftler Klaus Jürgen Tillmann, Mitglied im deutschen Pisa-<br />
Konsortium, „als empirischer Bildungsforscher bin ich natürlich ganz begeistert.“ An<br />
Fördergeldern herrscht kein Mangel, neue Forschungsstätten werden gegründet, etwa<br />
das Institut zur Qualitätsentwicklung im Bildungswesen (IQB) an der Berliner<br />
Humboldt-Universität. Allein der Forschungsgegenstand selbst dämpft noch die Wissenschaftlereuphorie:<br />
„Den Schulen bringt das leider nichts“, sagt Pädagoge Tillmann.<br />
Gegen miese Testergebnisse, scheinen Deutschlands Schulminister zu glauben, helfe<br />
vor allem Testen. Zwar hat sich die Kultusministerkonferenz als Reaktion auf den<br />
Pisa-Schock sieben Verbesserungsstrategien vorgenommen <strong>–</strong> darunter Sprachkurse<br />
für Migrantenkinder, mehr Ganztagsschulen und gezielte Leseförderung <strong>–</strong>, doch konsequent<br />
umgesetzt haben sie bislang nur eine einzige: Tests. „Entwicklungen gibt es<br />
zwar in allen sieben Bereichen“, sagt Tillmann, „aber flächendeckend in allen Ländern<br />
sind nur die zentralen Prüfungen in den Schulen angekommen.“ ( . . . )<br />
Tatsächlich wird an den deutschen Schulen so viel evaluiert, verglichen und inspiziert<br />
wie nie zuvor. Schon vor der Einschulung müssen Vierjährige häufig zum Deutschtest<br />
antreten, in sieben Bundesländern schwitzen dann die Drittklässler über „Vera“<br />
(„Vergleichsarbeiten“) Tests, in der Mittelstufe folgen vielerorts weitere Vergleichsarbeiten.<br />
Dazwischen kommen alle Jahre wieder internationale Studien wie Pisa, Iglu<br />
oder Timss und je nach Land Erhebungen mit phantasievollen Namen wie „Quasum“,<br />
„Desi“, „Tosca“, „Markus“, „Ulme“ oder „Lau“. ( . . . )<br />
12 Es geht hier keineswegs darum, deutsche Defizite im Umgang und in der Beschulung mit<br />
Schülerinnen und Schülern mit Migrationshintergrund (o.a.) in Abrede zu stellen, sondern<br />
darum deren heftige Moralisierung als einen wesentlichen Grund für die Existenzberechtigung<br />
von Pisa & Co zu akzeptieren.
DEUTSCHE <strong>PISA</strong>-FOLGEN 319<br />
„Nach Pisa wollte sich kein Kultusminister vorwerfen lassen, dass er nicht auf Leistung<br />
setzt“, erklärt Forscher Tillmann. „Dahinter steht die vage Hoffnung, dass vom<br />
Überprüfen auch alles irgendwie besser wird.“ Doch noch fehlt den Lehrerkollegien<br />
das Know-how, um aus der Datenflut Konzepte abzuleiten. „Da muss dringend was<br />
geschehen“, sagt Tillmann, „sonst bleibt das Ganze ein langer Anlauf, ohne dass gesprungen<br />
wird.“ ( . . . )<br />
Besonders weit auf dem Weg, sinnvolle Lehren aus den vielen Tests zu ziehen, glaubt<br />
sich Nordrhein-Westfalens Bildungsministerin Sommer. Sie rühmt ihr Schulsystem<br />
als das „modernste in Deutschland“. So will NRW als erstes Bundesland noch in dieser<br />
Legislaturperiode Schulrankings einführen. Zugleich können Eltern an Rhein und<br />
Ruhr neuerdings aussuchen, wo sie ihr Kind einschulen <strong>–</strong> und sich dabei möglicherweise<br />
an den Listen orientieren. „Wir wollen einen fairen Wettbewerb“, sagt Sommer.<br />
Doch gerade darin sehen viele Wissenschaftler die größte Gefahr der neuen Testkultur:<br />
„Wenn die Schulen nur noch auf ihre Listenplätze schauen, findet überhaupt keine<br />
Schulentwicklung mehr statt“, warnt Wilfried Bos, Chef des Dortmunder Instituts für<br />
Schulentwicklungsforschung. Faire Rankings, die etwa den sozialen Hintergrund der<br />
Schülerschaft berücksichtigen, sind kaum möglich, wenn wie etwa bei Vera nach Herkunft<br />
und Familie der Kinder gar nicht gefragt werden darf.<br />
Zudem erwies sich schon die Prämierung der Vera-Besten im vergangenen Jahr als<br />
Flop: Viele Schulen hatten sich gute Ergebnisse erschummelt <strong>–</strong> sie hatten die Testaufgaben<br />
vorher mit den Schülern trainiert (SPIEGEL 27/2006). „Wenn es erst mal<br />
richtige Rankings gibt“, glaubt Schulleiterin Borns aus Münster, „dann wird noch viel<br />
mehr gemogelt.“<br />
Julia Koch in Der SPIEGEL 24/2007<br />
Vermutlich werden solche Artikel die öffentliche Geltungsmacht von Pisa in<br />
Deutschland mehr erschüttern und eher zerrütten als eine wissenschaftliche<br />
Kritik an den Methoden und Verfahren der Untersuchung, die als unbedeutender<br />
innerwissenschaftlicher, von Laien angezettelter Zwist abgetan werden<br />
kann, die Öffentlichkeit kaum erreicht und eine Bildungspolitik, die sich Pisa<br />
gleichsam verschworen hat, nicht irritieren kann.<br />
Literatur<br />
American Evaluation Association AEA): Position Statement on HIGH STA-<br />
KES TESTING In PreK-12 Education. 2002<br />
(Online unter: http://www.eval.org/hst3.htm)<br />
Kultusministerkonferenz (KMK): Gesamtstrategie der Kultusministerkonferenz<br />
zum Bildungsmoni<strong>to</strong>ring (Beschlüsse der Kultusministerkon-
320 THOMAS JAHNKE<br />
ferenz vom 02.06.2006). (Online unter: http://www.kmk.org/aktuell/<br />
Gesamtstrategie%20Dokumentation.pdf)<br />
Kultusministerkonferenz (Hrsg.) in Zusammenarbeit mit dem Institut zur Qualitätsentwicklung<br />
im Bildungswesen: Gesamtstrategie der Kultusministerkonferenz<br />
zum Bildungsmoni<strong>to</strong>ring. Berlin 2006.<br />
(Online unter: http://www.kmk.org/schul/Bildungsmoni<strong>to</strong>ring_Brosch%<br />
FCre_Endf.pdf)<br />
Kultusministerkonferenz (KMK): Bildungsstandards im Fach Mathematik für<br />
den Mittleren Schulabschluss. Beschluss vom 4.12.2003. (Online unter:<br />
http://www.kmk.org/schul/Bildungsstandards/Mathematik_MSA_BS_<br />
04-12-2003.pdf)<br />
Koch, Julia: Langer Anlauf ohne Sprung. Der SPIEGEL 24/2007<br />
Koretz, D.: Alignment, High Stakes, and the Inflation of Test Scores. Yearbook<br />
of the National Society for the Study of Education (2005) 104 (2),<br />
99<strong>–</strong>118.<br />
(Online erhältlich unter: http://www.blackwell-synergy.com/doi/abs/10.<br />
1111/j.1744-7984.2005.00027.x)<br />
Sill, H. D.: <strong>PISA</strong> und die Bildungsstandards. In: Jahnke, Th.; Meyerhöfer, W.<br />
(Hrsg.): Pisa & Co <strong>–</strong> Kritik eines Programms. Franzbecker Verlag. Hildesheim<br />
2006. S. 293-330<br />
Stecher, B. M.: Consequences of large-scale, high-stakes testing on school and<br />
classroom practices. In L. S. Hamil<strong>to</strong>n, B. M. Stecher, and S. P. Klein<br />
(Eds.): Making Sense of Test-Based Accountability in Education. RAND.<br />
Santa Monica 2002. P. 79-100.<br />
(Online unter: http://www.rand.org/pubs/monograph_reports/MR1554/<br />
index.html)
<strong>PISA</strong> in Österreich: Mediale Reaktionen, öffentliche<br />
Bewertungen und politische Konsequenzen<br />
Dominik Bozkurt, Gertrude Brinek, Martin Retzl<br />
Österreich: Universität Wien<br />
Abstract:<br />
In diesem Beitrag werden nach einer Gegenüberstellung der Ergebnisse der<br />
beiden <strong>PISA</strong>-Testungen 2000 und 2003 die öffentlichen medialen Reaktionen<br />
sowie die Reaktionen der politischen Organisationen und deren bildungspolitische<br />
Konsequenzen aus <strong>PISA</strong> dargestellt. Dabei soll verdeutlicht werden, wie<br />
die Ergebnisse im öffentlichen Diskurs aufgenommen und interpretiert bzw.<br />
welche politischen Handlungsaufträge daraus abgeleitet wurden. So werden<br />
sowohl Übereinstimmungen und Abweichungen zwischen öffentlichen bzw.<br />
politischen Reaktionen und den offiziellen Ergebnissen erörtert als auch Veränderungen<br />
zwischen den Reaktionen auf die <strong>PISA</strong>-Ergebnisse 2000 und 2003<br />
sichtbar gemacht. Die mediale Analyse und die bildungspolitische Bewertung<br />
zeigen Dichte der Resonanz sowie Art und Grad der gesellschaftlichen „Erregung“,<br />
nicht nur in der scientific community.<br />
1 Österreichische <strong>PISA</strong>-Ergebnisse<br />
In diesem Kapitel werden die offiziellen Ergebnisse von <strong>PISA</strong> 2000 und <strong>PISA</strong><br />
2003 für Österreich, welche von den Mitgliedern des österreichischen <strong>PISA</strong>-<br />
Konsortiums in diversen Publikationen veröffentlicht wurden, vorgestellt und<br />
verglichen. Diese Ergebnisse wurden in der Öffentlichkeit recht unkritisch rezipiert.<br />
Dass sie ihrem Anspruch nicht bzw. nur bedingt gerecht werden, zeigen<br />
wissenschaftliche Beiträge, die bspw. in Deutschland publiziert wurden sowie<br />
die verschiedenen Beiträge in diesem Band. Allerdings erfolgten die Stellungnahmen<br />
und Reaktionen von Medien und Politik auf Grundlage eben dieser
322 DOMINIK BOZKURT, GERTRUDE BRINEK, MARTIN RETZL<br />
Ergebnisse. Für unser Vorhaben erscheint uns daher die Berufung darauf sinnvoll<br />
und angebracht.<br />
1.1 Die Ergebnisse der <strong>PISA</strong>-Studie 2000<br />
Im Dezember 2001 wurden die <strong>PISA</strong>-Ergebnisse 2000, an der 31 Staaten teilnahmen<br />
<strong>–</strong> darunter auch Österreich <strong>–</strong>, von der OECD veröffentlicht. <strong>PISA</strong><br />
(Programme for International Student Assessment) erhebt in einem Dreijahres-<br />
Rhythmus das Leseverständnis, das mathematische und das naturwissenschaftliche<br />
Grundwissen der 15-/16-jährigen SchülerInnen (vgl. Reiter, Haider<br />
2002).<br />
Der erste Testteilleistungsbereich der <strong>PISA</strong>-Studie 2000 umfasste das<br />
„Kompetenzprofil Lesen“ (ebenda, 13), wobei die Inhalte geschriebener Texte<br />
von den Jugendlichen verstanden, genützt und reflektiert werden mussten<br />
(vgl. ebenda, 13). Zum Messen dieser Eigenschaften setzte die OECD 129<br />
Testaufgaben ein, die wiederum in „fünf aufsteigende Lese-Kompetenzstufen“<br />
(ebenda, 13) gegliedert wurden. Ca. 9 % der österreichischen 15-/16-jährigen<br />
SchülerInnen waren „zur obersten Kompetenzstufe und rund 14 % waren zu<br />
den sehr schlechten Leser/innen „der beiden untersten Stufen“ (ebenda, 13)<br />
zuzuordnen.<br />
Die österreichischen TeilnehmerInnen erreichten in der Lese-Kompetenz,<br />
der Schwerpunktdisziplin von <strong>PISA</strong> 2000 (vgl. ebenda, 21), 507 Punkte und<br />
somit Platz 10 unter den 27 getesteten OECD-Staaten (vgl. Haider, Reiter<br />
2004, 77). Österreich lag somit „knapp über dem OECD-Durchschnitt von<br />
500“ Punkten in dieser Disziplin (vgl. Reiter, Haider 2002, 13).<br />
Im Zuge von <strong>PISA</strong> 2000 wurden auch die mathematischen Kenntnisse der<br />
15-/16-jährigen SchülerInnen gemessen. Um diese in Erfahrung zu bringen,<br />
mussten die getesteten Jugendlichen ihr Können in den unterschiedlichsten<br />
Bereichen wie z.B. „Problemlösen“ und „Modellieren“ (ebenda, 21) zeigen.<br />
Österreich erreichte in der Mathematik-Kompetenz „515 Punkte“ (ebenda, 21)<br />
und somit den 11. Platz unter 27 OECD-Staaten (vgl. Haider, Reiter 2004, 63).<br />
Allerdings ist darauf hinzuweisen, dass die innerösterreichischen Ergebnisse<br />
auf Grund des stark gegliederten hiesigen Schulsystems stark variieren. So erzielten<br />
SchülerInnen der Allgemeinbildenden Höheren Schule insgesamt 565<br />
Punkte, während die Jugendlichen der Allgemeinen Pflichtschulen lediglich<br />
„438 Punkte“ (Reiter, Haider 2002, 23) im Mathematikranking erreichten.<br />
Im dritten und abschließenden Testbereich von <strong>PISA</strong> 2000 wurden die<br />
Naturwissenschafts-Kompetenzen der österreichischen SchülerInnen getestet.
<strong>PISA</strong> IN ÖSTERREICH 323<br />
Besonderes Augenmerk wurde auf das Erkennen naturwissenschaftlicher<br />
Fragestellungen sowie auf die Anwendung naturwissenschaftlichen Wissens<br />
gelegt. Die SchülerInnen waren u. a. dazu aufgerufen, „durch Belege gestützte<br />
Aussagen von bloßen Meinungen zu unterscheiden“ (ebenda, 29). Haider weist<br />
darauf hin, dass bei diesem Testbereich das Erkennen naturwissenschaftlicher<br />
Fragen, das Anwenden von naturwissenschaftlichem Wissen und das Ziehen<br />
von Schlussfolgerungen aus Belegen im Mittelpunkt stehe (vgl. ebenda, 29).<br />
Im Bereich der Naturwissenschaften erreichten die österreichischen SchülerInnen<br />
insgesamt „519 Punkte“ (vgl. ebenda, 29) und somit den achten Rang<br />
unter den 27 an der Testung teilnehmenden OECD-Staaten (vgl. Haider, Reiter<br />
2004, 89).<br />
Zusammenfassend kann festgehalten werden, dass Österreich im Bereich<br />
der Naturwissenschaften mit 519 Punkten und Rang 8 am besten abgeschnitten<br />
hat. In Mathematik wurden 515 Punkte erreicht und somit Rang 11. Im Lesen<br />
konnten mit 507 die wenigsten Punkte erzielt werden, wobei damit immerhin<br />
der 10. Rang unter allen OECD-Staaten eingenommen wurde. Die in jedem<br />
Bereich über dem OECD- Durchschnitt liegenden Leistungen der österreichischen<br />
Jugendlichen führten schließlich dazu, dass Österreich im Gesamtranking<br />
von <strong>PISA</strong> 2000 den 10. Platz belegte und somit im vorderen Drittel der<br />
Untersuchung rangierte.<br />
1.2 Die Ergebnisse der <strong>PISA</strong>-Studie 2003<br />
Zum Jahresende 2004 wurden die Ergebnisse von <strong>PISA</strong> 2003 veröffentlicht.<br />
Der Fokus der <strong>PISA</strong>-Testung 2003, an der bereits 41 Staaten teilgenommen<br />
haben, wovon jedoch nur 40 in die Wertung mit aufgenommen wurden (vgl.<br />
Haider, Reiter 2004, 18), lag auf den mathematischen Kompetenzen, die zur<br />
„Hauptdomäne“ (ebenda 2004, 13) ernannt wurden. In den „Nebendomänen“<br />
wurden wiederum die Lesekompetenz und das naturwissenschaftliche Wissen<br />
geprüft. Erstmals wurden auch die Problemlösungskompetenzen der 15-/16jährigen<br />
SchülerInnen untersucht (vgl. ebenda, 13).<br />
Der getestete Bereich der Mathematik umfasste insgesamt <strong>–</strong> da Schwerpunktdisziplin<br />
<strong>–</strong> 2/3 aller Testaufgaben von <strong>PISA</strong> 2003. Offiziell erreichten die<br />
österreichischen SchülerInnen im Lösen mathematischer Aufgaben und Problemstellungen<br />
mit 506 Punkten den 15. Rang unter 29 OECD-Staaten (vgl.<br />
Haider, Reiter 2004, 63). Jedoch stellten sich zwei nicht unerhebliche Probleme<br />
beim Vergleich der <strong>PISA</strong>-Ergebnisse 2000 im Bereich der Mathematik<br />
mit jenen von der Testung aus dem Jahre 2003 heraus. Haider weist darauf
324 DOMINIK BOZKURT, GERTRUDE BRINEK, MARTIN RETZL<br />
hin, dass bei <strong>PISA</strong> 2000 lediglich „nur zwei der vier in <strong>PISA</strong> 2003 abgefragten<br />
Unterbereiche in Mathematik“ (ebenda, 45) getestet wurden und somit sind<br />
auch nur diese beiden Bereiche unmittelbar vergleichbar. Für die in <strong>PISA</strong> 2003<br />
neu geschaffenen mathematischen Bereiche der „Unsicherheit“ 1 und der „Größen“<br />
2 gibt es somit keine Vergleichsmöglichkeit (vgl. ebenda, 45).<br />
Die Lese-Kompetenz der 15-/16-jährigen SchülerInnen wurde in <strong>PISA</strong><br />
2003 neuerlich untersucht. Offiziell erreichten die österreichischen TeilnehmerInnen<br />
mit 491 Punkten den 19. Rang „innerhalb der 29 OECD-Staaten“ und<br />
den 22. Rang unter allen 40 „<strong>PISA</strong>-Teilnehmerstaaten“, wodurch sich Österreich<br />
„nicht signifikant“ vom OECD-Schnitt von 494 Punkten unterscheidet<br />
(vgl. ebenda, 76). Haider resümiert, dass Österreich in der Lesekompetenz bei<br />
<strong>PISA</strong> 2003 im OECD-Vergleich um 9 Ränge oder 16 Punkte zurückfiel und somit<br />
signifikant schlechter abschnitt als drei Jahre zuvor. Allerdings relativiert<br />
er diesen Rückfall, denn bei „Berücksichtigung der geteilten Ränge nach statistischer<br />
Bandbreite heißt das: <strong>PISA</strong> 2000: 10.-16. Rang; <strong>PISA</strong> 2003: 12.-21.<br />
Rang“ 3 (ebenda, 77).<br />
Als eine weitere Nebendomäne bei <strong>PISA</strong> 2003 wurden die naturwissenschaftlichen<br />
Fertigkeiten der Jugendlichen getestet. Dabei mussten die 15-/16jährigen<br />
SchülerInnen ihre Fähigkeit unter Beweis stellen und zeigen, inwieweit<br />
sie „das jeweilige physikalische, chemische oder biologische Fachwissen“<br />
zu „praktischen Problemlösungen“ einsetzen, Probleme repräsentieren<br />
und Lösungsvorschläge argumentieren können (vgl. ebenda, 78).<br />
Österreich befindet sich in diesem Bereich mit 491 Punkten, vorausgesetzt,<br />
man berücksichtigt nur „die Punktschätzung des Mittelwerts“ auf dem<br />
20. Rang innerhalb der 29 OECD-Staaten (vgl. ebenda, 79; 89). Allerdings ist<br />
darauf hinzuweisen, dass bei Betrachtung des Konfidenzintervalls (es wurde<br />
im Zuge von <strong>PISA</strong> lediglich ein Teil der österreichischen Schülerpopulation<br />
tatsächlich erfasst) die österreichischen SchülerInnen den 16. bis 23. Rang unter<br />
den 29 OECD-Staaten erreichen (vgl. ebenda, 79).<br />
1 Die mathematische Subgruppe Unsicherheit umfasst „Aufgaben und Darstellung von Daten<br />
sowie Wahrscheinlichkeiten, Unsicherheiten und Schlussfolgerungen“ (Haider, Reiter 2004,<br />
52).<br />
2 Der Bereich der Größen meint jene mathematischen Aufgaben, „die sich mit numerischen<br />
Phänomenen und Mustern sowie quantitativen Zusammenhängen beschäftigen“ (ebenda,<br />
53).<br />
3 vgl. dazu die Info-Seite: Wichtige Informationen zur Interpretation der Ergebnisse. In: Haider,<br />
Reiter 2004, 43.
<strong>PISA</strong> IN ÖSTERREICH 325<br />
Das kann aber nicht darüber hinwegtäuschen, dass Österreich im Bereich<br />
der Naturwissenschaften vom 8. Rang (unter 27 OECD-Staaten) und 519<br />
Punkten in <strong>PISA</strong> 2000 bei der Untersuchung 2003 mit 491 Punkten an die<br />
20. Stelle (von 29 OECD-Staaten) zurückgefallen ist (vgl. ebenda, 89).<br />
Die drei bisherigen <strong>PISA</strong>-Testbereiche (Mathematik, Lesen, Naturwissenschaften)<br />
wurden in der Untersuchung von 2003 um „Problemlösungs-<br />
Kompetenz“ erweitert. Hierbei wurden SchülerInnen dazu aufgefordert, sich<br />
„nicht-routinemäßigen (neuartigen) Problemen“ (ebenda, 90) zu stellen. Im<br />
Zuge dessen sollten auf Grund von Denkprozessen Lösungsvorschläge angeregt<br />
werden. Im Bereich der Problemlösungen erreichten die 15-/16jährigen<br />
SchülerInnen aus Österreich 506 Punkte, womit sie über dem OECD-<br />
Durchschnitt von 500 Punkten lagen und (ohne Berücksichtigung des Konfidenzintervalls<br />
4 ) den 15. Rangplatz von 29 OECD-Ländern einnahmen (vgl.<br />
ebenda, 90; 91).<br />
Festzuhalten bleibt, dass die österreichischen SchülerInnen bei <strong>PISA</strong> 2003<br />
in Mathematik mit 506 Punkten die besten Leistungen erbrachten. Im Lesen<br />
und den Naturwissenschaften wurden lediglich je 491 Punkte erreicht. Beim<br />
„Problemlösen“ konnten 506 Punkte erzielt werden. Die österreichischen 15-<br />
/16-jährigen SchülerInnen belegten somit in sämtlichen Testbereichen Platzierungen<br />
im mittleren oder im hinteren Drittel.<br />
1.3 Korrigierte Hauptergebnisse<br />
Die og. <strong>PISA</strong>-Ergebnisse bzw. deren Interpretationen halten nicht jeder Prüfung<br />
stand. Werden die von Neuwirth et al. publizierten und korrigierten<br />
Hauptergebnisse in den Vergleich der <strong>PISA</strong> Ergebnisse der Jahre 2000 und<br />
2003 miteinbezogen, so relativieren sich die von Österreich erreichten Ergebnisse<br />
im Jahre 2003 beträchtlich. Neuwirth spricht davon, dass bei „näherer<br />
Betrachtung des Datenmaterials“ bei den österreichischen Daten bald „Inkonsistenzen“<br />
festgestellt werden konnten (vgl. Neuwirth et al. 2004, 11). Der auf<br />
Grund der veröffentlichten <strong>PISA</strong> 2003-Daten interpretierte „Absturz“ fand also<br />
nicht statt, sondern es geht vielmehr darum, dass die <strong>PISA</strong>-Daten 2000 und<br />
2003 nicht direkt vergleichbar seien (vgl. ebenda, 62ff). Neuwirth führt dafür<br />
folgende Gründe an:<br />
<strong>–</strong> bei <strong>PISA</strong> 2000 war die Beteiligung der weiblichen Schüler höher als jene<br />
der männlichen (vgl. ebenda, 11)<br />
4 Bei der Berücksichtigung des Konfidenzintervalls würde Österreich den „13. bis 17. Rang<br />
unter 29 OECD-Staaten“ (ebenda, 91) erreichen.
326 DOMINIK BOZKURT, GERTRUDE BRINEK, MARTIN RETZL<br />
<strong>–</strong> die Leistungsergebnisse von österreichischen BerufsschülerInnen wurden<br />
bei <strong>PISA</strong> 2000 weniger stark gewichtet als bei <strong>PISA</strong> 2003. Bei einem direkten,<br />
d.h. unreflektierten Vergleich zeigt die zweite Untersuchung eine<br />
Verschlechterung der Ergebnisse, da BerufsschülerInnen sich am „unteren<br />
Ende des Leistungsspektrums“ befinden (vgl. ebenda 28ff).<br />
Basierend auf diesen Erkenntnissen führt Neuwirth aus, dass „sich die Leistungen<br />
der österreichischen Schüler/innen im Lesen kaum geändert haben und<br />
sowohl die Lesewerte als auch die Mathematikwerte in der Nähe des OECD-<br />
Durchschnitts liegen“ (ebenda, 62). Neuwirth räumt aber auch ein, dass in den<br />
Naturwissenschaften „ein deutlicher Rückgang der österreichischen Werte erkennbar“<br />
sei (vgl. ebenda, 62).<br />
Zusätzliche Schwierigkeiten beim Vergleich der Leistungsdaten von <strong>PISA</strong><br />
2000 und 2003 ergeben sich dadurch, dass manche Testbereiche (wie z.B. in<br />
der Mathematik) für <strong>PISA</strong> 2003 neu geschaffen wurden und dass es somit keine<br />
Vergleichsmöglichkeit gibt (vgl. ebenda, 63). Zu erwähnen ist hier außerdem,<br />
dass eine hohe Interpretationsfreiheit der <strong>PISA</strong>-Ergebnisse 2003 besteht,<br />
wenn das Konfidenzintervall (notwendigerweise) berücksichtigt wird (siehe<br />
oben).<br />
2 Öffentliche mediale Reaktionen auf die <strong>PISA</strong>-Ergebnisse<br />
Nach der einleitenden Darstellung der <strong>PISA</strong>-Ergebnisse für Österreich im internationalen<br />
Vergleich werden nun die öffentlichen Reaktionen auf die Ergebnisse<br />
erörtert. Dies ist nicht zuletzt deshalb von Bedeutung, da, wie etwa<br />
auch Uljens in diesem Sammelband erinnert, <strong>PISA</strong> primär auf die Förderung<br />
des Wettbewerbs (am Bildungssek<strong>to</strong>r) zwischen den Teilnehmerstaaten<br />
und auf die Förderung von einheitlichen Bildungsstandards in den teilnehmenden<br />
Nationen abziele. Auf eine Erklärung für das Zustandekommen der unterschiedlichen<br />
Ergebnisse in den verschiedenen Ländern würde dabei aber vollständig<br />
verzichtet. Diese Aufgabe sei den Regierungen, dem Schulwesen und<br />
den Medien der jeweiligen Staaten überlassen (vgl. Uljens in diesem Band).<br />
In Österreich hat das zu einer medialen und politischen <strong>PISA</strong>-Erklärungsflut<br />
geführt, die zwar angesichts der „pisanischen“ Zurückhaltung an Erklärungsund<br />
Interpretationsangeboten verständlich, jedoch geprägt von vorgefertigten<br />
Überzeugungen entsprechend undifferenziert über die Bevölkerung hereingebrochen<br />
ist.
<strong>PISA</strong> IN ÖSTERREICH 327<br />
Im Folgenden sollen daher anhand der am meisten gelesenen Tageszeitungen<br />
in Österreich (Kronen Zeitung, Kurier, Standard, Presse, Kleine Zeitung,<br />
Oberösterreichische Nachrichten, Salzburger Nachrichten, Tiroler Tageszeitung,<br />
Vorarlberger Nachrichten, Wirtschaftsblatt, Neues Volksblatt und Wiener<br />
Zeitung), welche zusammen im Jahre 2001 eine Net<strong>to</strong>reichweite von ca.<br />
75 % sowie im Jahre 2004 eine Net<strong>to</strong>reichweite von ca. 74 % aufweisen konnten,<br />
die medialen Reaktionen auf das Abschneiden Österreichs bei den beiden<br />
<strong>PISA</strong>-Testungen aufgezeigt werden. Zumindest eine der genannten Zeitungen<br />
wurde von täglich ca. 3/4 der österreichischen Bevölkerung über 14 Jahren gelesen<br />
(vgl. Mediaanalyse 2007, 2007a). Des Weiteren werden die Originaltext-<br />
Aussendungen der Austria Presse Agentur (APA) nach der Veröffentlichung<br />
der Ergebnisse zum Thema <strong>PISA</strong>-Studie präsentiert, analysiert und interpretiert.<br />
Die Ergebnisse der beiden <strong>PISA</strong>-Testungen wurden jeweils Anfang Dezember<br />
des auf die Durchführung folgenden Kalenderjahres veröffentlicht, d.<br />
h. im Dezember 2001 und im Dezember 2004.<br />
Der Darstellungs- bzw. Beobachtungszeitraum erstreckt sich daher jeweils<br />
vom Veröffentlichungsdatum bis zum 16. Jänner bzw. bis zum 31. Jänner der<br />
Folgejahre. Untersucht werden in den genannten Zeiträumen die Artikel in Tageszeitungen<br />
und ungefähr 3 Monate lang nach Veröffentlichung die Presseaussendungen<br />
der APA, welche sich thematisch u. a. den österreichischen Ergebnissen<br />
der Studie widmen. Die Zeiträume wurden deshalb so gewählt, weil<br />
in der ersten Zeit nach Veröffentlichung der Ergebnisse die stärksten Reaktionen<br />
zu erwarten sind, obwohl festgehalten werden muss, dass auch danach<br />
die Thematik rund um die <strong>PISA</strong>-Studie(n) in der Öffentlichkeit immer wieder<br />
aufgegriffen wurde. Die Kriterien, wonach die Medienberichte untersucht<br />
werden, sind folgende:<br />
<strong>–</strong> Anzahl der Artikel bzw. Presseaussendungen zu den <strong>PISA</strong>-Studien in den<br />
besagten Zeiträumen<br />
<strong>–</strong> Verfasser des Beitrags bzw. zu Wort kommende Personen im Beitrag<br />
<strong>–</strong> Bewertung der österreichischen Ergebnisse im Beitrag (positiv-neutralnegativ)<br />
<strong>–</strong> Ursachenzuschreibung im Beitrag (wer/was ist schuld, ist verantwortlich)<br />
<strong>–</strong> Geforderte Maßnahmen im Beitrag (was muss getan werden)
328 DOMINIK BOZKURT, GERTRUDE BRINEK, MARTIN RETZL<br />
2.1 Die Reaktionen der Tageszeitungen auf die <strong>PISA</strong>-Ergebnisse 2000<br />
und 2003 im Vergleich<br />
Die Artikel in den genannten Tageszeitungen wurden in den elektronischen<br />
Archiven der Tageszeitungen bzw. in der Sammlung der österreichischen Nationalbibliothek<br />
recherchiert, von den Tageszeitungen selbst zur Verfügung gestellt<br />
oder stammen aus der Zeitungsberichtsammlung zum Thema Schule des<br />
ehemaligen Bundesministeriums für Bildung, Wissenschaft und Kultur. Berücksichtigt<br />
wurden alle Artikel, die von der offiziellen Veröffentlichung der<br />
ersten <strong>PISA</strong>-Ergebnisse am 4. Dezember 2001 bis zum 31. Jänner 2002 sowie<br />
von der offiziellen Veröffentlichung der zweiten <strong>PISA</strong>-Ergebnisse am 6. Dezember<br />
2004 bis einschließlich 16. Jänner 2005 erschienen, das Wort <strong>PISA</strong>,<br />
OECD oder STUDIE beinhalten und sich thematisch in irgendeiner Weise<br />
auf die österreichischen <strong>PISA</strong>-Ergebnisse beziehen. Die Kategorien, in welchen<br />
die Ursachen/Gründe für das Abschneiden bei der <strong>PISA</strong>-Studie bzw. das<br />
Zustandekommen der Ergebnisse und die geforderten Lösungen bzw. Maßnahmen<br />
in den Tageszeitungen unterteilt wurden, sind großteils von Schwarzgruber<br />
übernommen, der sich bereits einer intensiven Analyse der Ergebnisse<br />
2003 in österreichischen Tageszeitungen widmete. Seine ausgearbeiteten Kategorien<br />
sind inhaltlich auch für die Zeitungsberichte zu <strong>PISA</strong> 2000 geeignet<br />
und ermöglichen so einen guten Vergleich mit den Zeitungsberichten zu <strong>PISA</strong><br />
2003.<br />
Auffallend unterschiedlich ist die Anzahl der Artikel in den besagten Tageszeitungen<br />
zu den Ergebnissen in den beiden Testjahren. Der Tabelle 1 ist<br />
zu entnehmen, dass als Reaktion auf die ersten <strong>PISA</strong>-Ergebnisse im genannten<br />
Untersuchungszeitraum insgesamt 36 Berichte in den erwähnten Tageszeitungen<br />
verfasst wurden, als Reaktion auf die zweiten <strong>PISA</strong>-Ergebnisse jedoch<br />
231 (Schwarzgruber 2006, 69). Berücksichtigt man, dass der Untersuchungszeitraum<br />
nach Reaktionen auf die erste <strong>PISA</strong>-Welle um zwei Wochen länger<br />
war, so ist der tatsächliche Unterschied noch größer. Damit ist über die zweiten<br />
<strong>PISA</strong>-Ergebnisse in den österreichischen Tageszeitungen mindestens mehr als<br />
sechsmal soviel berichtet worden als über die ersten.<br />
In der Hälfte der 36 Zeitungsartikel zu den <strong>PISA</strong>-Ergebnissen 2000 (Tab.<br />
2) berichten ausschließlich JournalistInnen. In jedem fünften Bericht kommen<br />
PolitikerInnen zu Wort und in jedem zwölften WissenschaftlerInnen. In 22,2 %<br />
der Artikel wird auf sonstige, andere Personen Bezug genommen. In 45,5 %<br />
der 231 Artikel zu den <strong>PISA</strong>-Ergebnissen 2003 hingegen werden die Ansichten<br />
von PolitikerInnen wiedergegeben und nur in jedem fünften Bericht neh-
<strong>PISA</strong> IN ÖSTERREICH 329<br />
Tabelle 1: vgl. Schwarzgruber 2006, 69; Bozkurt/Brinek/Retzl 2007<br />
Tabelle 2: vgl. Schwarzgruber 2006, 72; Bozkurt/Brinek/Retzl 2007<br />
men ausschließlich JournalistInnen Stellung. WissenschaftlerInnen werden mit<br />
16,9 % im Verhältnis doppelt so viel zitiert wie in den Berichten zu <strong>PISA</strong> 2000.
330 DOMINIK BOZKURT, GERTRUDE BRINEK, MARTIN RETZL<br />
In 2,5 % der Artikel werden auch Vertreter der Industrie berücksichtigt. In fast<br />
jedem siebenten Artikel finden sich Kommentare von sonstigen Personen.<br />
Tabelle 3: vgl. Schwarzgruber 2006, 76; Bozkurt/Brinek/Retzl 2007<br />
69,4 % der Artikel zu <strong>PISA</strong> 2000 bewerten die österreichischen Ergebnisse<br />
positiv. Jeder vierte Artikel verhält sich neutral zu den Ergebnissen. In nur<br />
5,6 % der Artikel werden die Ergebnisse negativ interpretiert. Das Abschneiden<br />
Österreichs bei <strong>PISA</strong> 2003 wird hingegen in der Hälfte der Artikel negativ<br />
gesehen. Die andere Hälfte der Artikel verhält sich neutral zu den Ergebnissen.<br />
In keinem Beitrag konnte den Ergebnissen etwas Positives abgewonnen<br />
werden. Diese Tatsache zeigt deutlich auf, dass die <strong>PISA</strong>-Ergebnisse in den öffentlichen<br />
Tageszeitungen massiv polarisierend dargestellt wurden. Dies kann<br />
durchaus mit einer positiven, teilweise euphorischen Haltung gegenüber <strong>PISA</strong><br />
2000 und einer Katastrophenstimmung nach <strong>PISA</strong> 2003 beschrieben werden.<br />
In acht der 36 Berichte (22 %) zu den <strong>PISA</strong>-Ergebnissen 2000 werden vermutete<br />
Ursachen angeführt. Von den 231 Berichten zu den <strong>PISA</strong>-Ergebnissen<br />
2003 werden in 106 Berichten (46 %) Ursachen/Gründe für das Abschneiden<br />
genannt. Pro Bericht können mehrere Ursachen angegeben werden; zu den<br />
<strong>PISA</strong>-Ergebnissen 2000 lassen sich 10 und zu den <strong>PISA</strong>-Ergebnissen 2003<br />
insgesamt 156 Ursachen aufzählen (vgl. Schwarzgruber 2006, 85f). Bezüglich<br />
2003 wurde in jedem zweiten Bericht über mögliche Ursachen spekuliert, die<br />
Ergebnisse aus 2000 betreffend nur knapp in jedem fünften Bericht.
<strong>PISA</strong> IN ÖSTERREICH 331<br />
Tabelle 4: vgl. Schwarzgruber 2006, 86; Bozkurt/Brinek/Retzl 2007<br />
Die Hauptverantwortlichen für die durchwegs positiv bewerteten Ergebnisse<br />
2000 sind die LehrerInnen bzw. die Schule und das Schulsystem ebenso<br />
wie die Politik. Für die als negativ bewerteten Ergebnisse aus 2003 trägt<br />
hauptsächlich das Schulsystem (mit 40 Nennungen), gefolgt von der Politik<br />
(mit 27) und den Lehrern bzw. der Schule (mit 24 Nennungen) die Verantwortung.<br />
Häufig werden auch noch MigrantInnen, die Eltern und die Lesefähigkeit<br />
insgesamt als Ursache genannt. Mit neun Nennungen werden die SchülerInnen<br />
bzw. ihre Leistungen/ihr Leistungsverhalten als Ursache berücksichtigt, jedoch<br />
vergleichsweise selten für die Ergebnisse verantwortlich gemacht.<br />
In 19 oder 52,8 % der Berichte zu <strong>PISA</strong> 2000 und in 193 bzw. 83,5 % der<br />
Berichte zu <strong>PISA</strong> 2003 wurden Lösungen bzw. Maßnahmen gefordert. Dabei<br />
erfolgten 28 Nennungen von Maßnahmen zu <strong>PISA</strong> 2000 und 493 Nennungen<br />
zu <strong>PISA</strong> 2003 (vgl. Schwarzgruber 2006, 111). Das Kategoriensystem<br />
Schwarzgrubers wurde noch um drei Kategorien erweitert, da die geforderten<br />
Maßnahmen zu den <strong>PISA</strong>-Ergebnissen 2000 nicht vollständig den Kategorien<br />
von <strong>PISA</strong> 2003 zuzuordnen waren (Tab. 5). Die Forderungen nach mehr Tests<br />
bzw. Evaluierung und die Änderung der Politik bzw. mehr Budget wurden mit<br />
jeweils drei Nennungen in den Berichten zu <strong>PISA</strong> 2000 erwähnt, in den Berichten<br />
zu <strong>PISA</strong> 2003 jedoch nie. Maßnahmen wie mehr Au<strong>to</strong>nomie für Schulen,<br />
Sprachkurse und vorschulische Programme wurden hingegen nur in den<br />
Berichten zu <strong>PISA</strong> 2003 genannt und gefordert, nicht jedoch in denen zu <strong>PISA</strong>
332 DOMINIK BOZKURT, GERTRUDE BRINEK, MARTIN RETZL<br />
Tabelle 5: vgl. Schwarzgruber 2006, 112; Bozkurt/Brinek/Retzl 2007<br />
2000. Mit 101 Nennungen sind allgemeine Rufe nach generellen Reformen,<br />
nach der Gesamtschule (96 Nennungen) und deutlich weniger, aber dennoch<br />
häufig (56 Nennungen) die Verbesserung des Unterrichts bzw. der Unterrichtsqualität<br />
die am meisten genannten Forderungen nach <strong>PISA</strong> 2003. Weiters wurden<br />
nach <strong>PISA</strong> 2003 49 mal die Ganztagesschule, 44 mal Veränderungen in<br />
der Lehreraus- und weiterbildung, 34 mal die Verbesserung der Lesefähigkeit<br />
(Alphabetisierung, Lesetests) und 14 mal eine neutrale Analyse der Ergebnisse<br />
gefordert. Die am häufigsten geforderten Maßnahmen nach <strong>PISA</strong> 2000 sind<br />
mit je fünf Nennungen die Verbesserung des Unterrichts bzw. der Unterrichtsqualität<br />
und die Verbesserung der Lesefähigkeit (Alphabetisierung, Lesetests).<br />
3 mal wurden nach <strong>PISA</strong> 2000 generelle Reformvorschläge gemacht, 2 mal<br />
wurde die Gesamtschule als geeignete Maßnahme angegeben. Je einmal wurde<br />
auch nach <strong>PISA</strong> 2000 bereits die Ganztagesschule, eine neutrale Analyse<br />
bzw. die Verbesserung der Lehreraus- und weiterbildung genannt.<br />
2.2 Die Reaktionen in den Presseaussendungen der Austria Presse<br />
Agentur (APA) auf die <strong>PISA</strong>-Ergebnisse 2000 und 2003<br />
Die im folgenden analysierten Presseaussendungen stammen aus dem APA-<br />
OTS Online-Archiv, welche die Worte <strong>PISA</strong>, OECD oder STUDIE beinhalten<br />
und einen Bezug zu den österreichischen <strong>PISA</strong>-Ergebnissen herstellen, indem<br />
entweder über mögliche Ursachen der Ergebnisse oder über jene aus den Ergebnissen<br />
resultierende Maßnahmen und Veränderungen berichtet bzw. eine
<strong>PISA</strong> IN ÖSTERREICH 333<br />
Bewertung der Ergebnisse vorgenommen wird. Die APA Originaltext-Service<br />
GmbH (OTS) verbreitet Presseaussendungen im Originalwortlaut unter inhaltlicher<br />
Verantwortung des Aussenders (vgl. APA-OTS 2007). Bezieher dieser<br />
Presseaussendungen sind über 650 österreichische Redaktionen und Pressestellen<br />
(alle österreichischen Tageszeitungen mit Ausnahme der Kronenzeitung,<br />
öffentliches und privates Fernsehen und Radio, Periodika, Verlage, internationale<br />
Nachrichtenagenturen, Ministerien und Pressestellen, Politik, Organisationen<br />
und Interessenvertretungen u.v.m.), 7600 professionelle User der<br />
Plattform APA OnlineManager (AOM), 12.500 Abonnenten der Mailingliste<br />
APA-OTS Mailabo, rund 15.000 User der APA Online Pressespiegel und kundenspezifischer<br />
Selektionen, sowie Webportale und WAP-Services (vgl. APA-<br />
OTS 2007a). Der Zeitraum der Untersuchung der Reaktionen auf die erste<br />
<strong>PISA</strong>-Welle liegt zwischen der Veröffentlichung der ersten <strong>PISA</strong>-Ergebnisse<br />
am 4. Dezember 2001 und dem 1. März 2002. Die Reaktionen auf die Ergebnisse<br />
der zweiten <strong>PISA</strong>-Welle werden bereits vom 1. Dezember 2004 bis zum<br />
1. März 2005 untersucht. Der Grund dafür ist, dass bereits vor der offiziellen<br />
Veröffentlichung am 6. Dezember 2004 die <strong>PISA</strong>-Ergebnisse bekannt wurden<br />
und somit bereits ab Ende November heftige mediale Diskussionen entbrannten.<br />
Tabelle 6: Bozkurt/Brinek/Retzl 2007
334 DOMINIK BOZKURT, GERTRUDE BRINEK, MARTIN RETZL<br />
Der bereits festgestellte Trend, dass die zweite <strong>PISA</strong>-Welle viel mehr öffentliches<br />
Interesse erweckte als die zweite, lässt sich auch durch die Presseaussendungen<br />
bestätigen. So gab es in den genannten Zeiträumen über fünf<br />
mal mehr Presseaussendungen als Reaktion auf die zweiten <strong>PISA</strong>-Ergebnisse<br />
als auf die ersten.<br />
Tabelle 7: Bozkurt/Brinek/Retzl 2007<br />
Von den 14 Berichten zu den <strong>PISA</strong>-Ergebnissen 2000 stammen 64,2 %,<br />
von den 77 Berichten zu den <strong>PISA</strong>-Ergebnissen 2003 gar 87 % von PolitikerInnen.<br />
In beiden Untersuchungszeiträumen kamen die SPÖ-PolitikerInnen<br />
vor den ÖVP-PolitikerInnen am meisten zu Wort, wobei zu <strong>PISA</strong> 2003 in mehr<br />
als der Hälfte aller Berichte auf SPÖ-Politiker Bezug genommen wird, während<br />
nur 15,6 % der Berichte Ansichten von ÖVP-Politikern widerspiegeln.<br />
Reaktionen von LehrerInnen, SchülerInnen und Elternverbänden sowie solche<br />
von sozialen Organisationen bzw. Interessenvertretungen werden in beiden<br />
Untersuchungszeiträumen wiedergegeben. Des Weiteren werden Meinungen<br />
von Vertretern aus der Industrie zu <strong>PISA</strong> 2000 und die Ansicht von WissenschaftlerInnen<br />
zu <strong>PISA</strong> 2003 in den Presseaussendungen erwähnt bzw. wiedergegeben.<br />
Aus Tabelle 8 ist ersichtlich, dass die Bewertungen der <strong>PISA</strong>-Ergebnisse<br />
2000 und 2003 sehr stark variieren. So werden in 71,4 % der Presseaussendungen<br />
zu <strong>PISA</strong> 2000 die Ergebnisse positiv beurteilt und nur in 7,1 % der<br />
Meldungen negativ. Anders verhält es sich bei den <strong>PISA</strong>-Ergebnissen 2003. In<br />
keiner Aussendung werden die Ergebnisse 2003 positiv beurteilt, in fast der
Tabelle 8: Bozkurt/Brinek/Retzl 2007<br />
<strong>PISA</strong> IN ÖSTERREICH 335<br />
Hälfte erfolgt eine negative Bewertung. Keine oder eine neutrale Bewertung<br />
wurde in 21,4 % der Aussendungen zu <strong>PISA</strong> 2000 getroffen. Meldungen zu<br />
<strong>PISA</strong> 2003 hingegen beinhalten in mehr als der Hälfte wertende Stellungnahmen.<br />
SPÖ-PolitikerInnen nennen als Ursache für die großteils positiv bewerteten<br />
<strong>PISA</strong>-Ergebnisse 2000 einmal die LehrerInnen und zweimal die SPÖ-<br />
Regierung (vor der ÖVP-FPÖ-Koalition). ÖVP-PolitikerInnen hingegen geben<br />
als Ursache einmal Bundesministerin Gehrer und die ÖVP-FPÖ-Regierung<br />
an, zweimal die LehrerInnen und einmal das differenzierte Schulsystem<br />
und seine Durchlässigkeit. Hingegen sehen VertreterInnen von LehrerInnen-,<br />
SchülerInnen- bzw. Elternverbänden einmal in Bundesministerin Gehrer (und<br />
ihrer Sparpolitik) die Ursache für die negativen oder neutral bewerteten Ergebnisse<br />
aus <strong>PISA</strong> 2000, einmal im selektiven Schulsystem (frühe Aufteilung<br />
in Starke und Schwache SS, HS, AHS; Selektion ab 10; verkrustetes System).<br />
Alle anderen Personengruppen nennen in den erwähnten Presseaussendungen<br />
keine Ursache für die Ergebnisse der ersten <strong>PISA</strong>-Tests.<br />
Die überwiegend negativ bewerteten <strong>PISA</strong>-Ergebnisse 2003 begründen<br />
SPÖ-PolitikerInnen 21 mal mit Bundesministerin Gehrer und der Sparpoli-
336 DOMINIK BOZKURT, GERTRUDE BRINEK, MARTIN RETZL<br />
Tabelle 9: Bozkurt/Brinek/Retzl 2007<br />
Tabelle 10: Bozkurt/Brinek/Retzl 2007<br />
tik der ÖVP-FPÖ-Regierung (Kürzungen von Stunden, Personal etc.). Dass<br />
das selektive Schulsystem, die Aufteilung in Starke und Schwache (HS-SS-<br />
AHS), die Trennung mit 10 Jahren und das verkrustete System Schuld an den<br />
Ergebnissen tragen, senden SPÖ-PolitikerInnen drei mal aus. Zweimal nennen<br />
diese als Ursache die LehrerInnen. ÖVP-PolitikerInnen äußern sich kaum<br />
zu den Ursachen betreffend die Ergebnisse 2003. PolitikerInnen der Grünen
<strong>PISA</strong> IN ÖSTERREICH 337<br />
nennen in einer Aussendung ebenfalls Bundesministerin Gehrer und die Sparpolitik<br />
der ÖVP-FPÖ-Regierung als negativen Einflußfak<strong>to</strong>r. In zwei der 77<br />
Aussendungen beklagen PolitikerInnen der Grünen das selektive Schulsystem,<br />
den hohen Zeitaufwand durch Schule, die Aufteilung in Starke und Schwache<br />
(SS-HS-AHS), die Trennung mit 10 Jahren und das verkrustete System. FPÖ-<br />
PolitikerInnen nennen einmal als Ursache die SPÖ-Regierungen vor der ÖVP-<br />
FPÖ-Koalition und einmal die MigrantInnen. LehrerInnen-, SchülerInnenbzw.<br />
Elternverbände nennen in drei Aussendungen Bundesministerin Gehrer<br />
und die Sparpolitik der ÖVP-FPÖ-Regierung. Vertreter von sozialen Organisationen<br />
bzw. Interessenvertretungen sehen im selektiven Schulsystem und dem<br />
hohen Zeitaufwand durch Schule, in der Aufteilung in Starke und Schwache<br />
(HS-SS-AHS), der Trennung mit 10 Jahren und dem verkrusteten System Ursachen<br />
für die mäßigen <strong>PISA</strong>-Ergebnisse 2003. WissenschaftlerInnen äußern<br />
sich nicht zu möglichen Ursachen/Gründen bezüglich der <strong>PISA</strong>-Ergebnisse<br />
2003.<br />
2.3 Geforderte Maßnahmen als Reaktion auf <strong>PISA</strong> 2000 und <strong>PISA</strong><br />
2003<br />
Forderungen in<br />
APA-OTS<br />
SPÖ ÖVP Grüne FPÖ In<br />
du<br />
Wi<br />
ss<br />
L-S-<br />
E<br />
soz.<br />
Org.;<br />
Int.v.<br />
<strong>to</strong>tal<br />
Jahr 200/. 0 3 0 3 0 3 0 3 0 3 0 3 0 3 0 3<br />
Abschaffung von<br />
Leistungstests<br />
1 - - - - - - - - - 1 - - - 2 -<br />
Regierung ÖVP-<br />
FPÖ (BM Gehrer)<br />
3 2 - - 1 - 1 - - - 1 - - - 6 2<br />
Ganztagesschule - 13 - 2 - - - 1 - - - 3 1 1 1 20<br />
Gesamtschule - 14 - - - 2 - 2 - - 1 2 1 1 2 21<br />
Förderung 1 10 1 1 - - - 2 - - - - 1 1 3 14<br />
Lehreraus- und<br />
weiterbildung<br />
1 2 - 1 - - - 1 - - - 2 1 - 2 6<br />
Infrastruktur 1 - - - - - - - - - - - - - 1 -<br />
Tests, Evaluierung - 2 1 - - - 1 - 1 - - - - - 3 2<br />
keine Studiengebühren<br />
- - - - - - - - - - - - 1 - 1 -<br />
technische Bildung - - - - - - - - 1 - - - - - 1 -<br />
gemeinsame Reformen<br />
1 10 - 7 1 - - 3 - 1 - 2 - - 2 23<br />
keine Schulstrukturdeb.<br />
- - 2 1 - - - - - - - - - - 2 1<br />
Schulau<strong>to</strong>nomie - 3 - - - - - - - - - - - 1 - 4
338 DOMINIK BOZKURT, GERTRUDE BRINEK, MARTIN RETZL<br />
Oberstufenreform 1 - - - - - - - - - - - - - 1 -<br />
SPÖ-<br />
Bildungsprogramm<br />
1 7 - - - - - - - 1 - - - - 1 8<br />
Zukunftskommiss. - 7 - 1 - - - - - - - - - 1 - 9<br />
Schulpartnerschaft - - - 2 - - - - - - - 1 - - - 3<br />
Lehrplanreform - - - - - - - - - - - 1 - - - 1<br />
Keine Noten - 1 - - - - - - - 1 - 1 - - - 3<br />
Kein anderes Modell<br />
übernehmen<br />
- 1 - - - 1 - - - - - - - - - 2<br />
Aufhebung der<br />
2/3 Mehrheit für<br />
Schulgesetze<br />
- 4 - - - - - 2 - - - - - - - 6<br />
Neues anstelle von<br />
Altem<br />
- - - 2 - - - - - - - - - - - 2<br />
Verbindung Beruf-<br />
Schule<br />
- 2 - 1 - - - 2 - - - - - - - 5<br />
anderes - 17 1 7 - 2 - 4 - 1 - 1 1 1 2 33<br />
<strong>to</strong>tal 10 95 5 25 2 5 2 17 2 4 3 13 6 6 30 165<br />
Tabelle 11: Bozkurt/Brinek/Retzl 2007<br />
Der Tabelle 11 sind die genannten Forderungen nach <strong>PISA</strong> 2000 und <strong>PISA</strong><br />
2003 in den besagten APA-Originaltextsendungen zu entnehmen. Zusätzlich<br />
informiert die Tabelle darüber, wer eine bestimmte Forderung wie oft stellt.<br />
Damit ist gut ersichtlich, dass verschiedene Parteien, Verbände, Organisationen<br />
oder Interessensgemeinschaften oft unterschiedliche Forderungen stellen,<br />
die, wie kaum verwunderlich, sehr stark aus der Ideologie oder den Interessen<br />
der jeweiligen Gruppe zu verstehen sind.<br />
Das Fehlen fachlich korrekter Interpretationen und wissenschaftlicher<br />
Schlussfolgerungen von Seiten des <strong>PISA</strong>-Konsortiums hat wesentlich dazu<br />
beigetragen, dass Schlussfolgerungen und Einschätzungen vielfach den jeweiligen<br />
Vertretern oder „Anwälten“ von vorgefertigten Meinungen überlassen<br />
wurden. Offenkundige politische Gegner sind daher auch an der Gegensätzlichkeit<br />
ihrer Forderungen leicht zu erkennen (vgl. dazu den nächsten Abschnitt<br />
in diesem Beitrag). Eigene vorgefertigte ideologisch genährte Überzeugungen<br />
und Pläne wurden damit gefestigt und von kaum einer Seite wurde<br />
Anlass und Ans<strong>to</strong>ß zu einer rationalen Argumentation geboten. Auch die<br />
einschlägige Wissenschaft hat dies weitgehend unterlassen. Grundsätzlich ist<br />
festzuhalten, dass in den 14 Presseaussendungen nach der Veröffentlichung der<br />
<strong>PISA</strong>-Ergebnisse 2000 insgesamt 30 geforderte Maßnahmen enthalten sind, in<br />
den 77 Pressemeldungen nach der Veröffentlichung der <strong>PISA</strong>-Ergebnisse 2003
<strong>PISA</strong> IN ÖSTERREICH 339<br />
insgesamt 165. Vertreter der Industrie melden sich nur in den Aussendungen<br />
zu <strong>PISA</strong> 2000 zu Wort, Wissenschaftler nur in jenen zu <strong>PISA</strong> 2003.<br />
Die meist genannte Forderung des Jahres 2000 mit sechs Nennungen ist,<br />
dass Bundesminister Gehrer und die Bundesregierung „den Hut nehmen“,<br />
„sich nicht ausruhen“, mehr Engagement zeigen, mit Kürzungen aufhören bzw.<br />
Reformen erarbeiten sollen. Je dreimal wird nach<br />
<strong>–</strong> (früher) Lese- und Sprachförderung (Vorschuljahr für alle) bzw. Begabungsförderung<br />
(ab Kindergarten), Lesetests in der Volksschule bzw. Förderung<br />
der RisikoschülerInnen (Berufsschule) und nach<br />
<strong>–</strong> Tests, Benchmarking, Leistungsvergleich, Qualitätsmanagement, Evaluierung<br />
verlangt.<br />
Immerhin wird je zweimal auch in der Abschaffung bzw. der Reduktion der<br />
Leistungstests ein geeigneter Weg gesehen. Des Weiteren wird zweimal als<br />
Forderung erwähnt:<br />
<strong>–</strong> Die Gesamtschule, Zusammenführung der Schularten bzw. keine Selektion<br />
mit 10<br />
<strong>–</strong> Aus- und Weiterbildung von LehrerInnen: bspw. Diagnostik und Therapie<br />
von Leseschwächen, Ausbildung auf akademisches Niveau heben (Uni bzw.<br />
PH)<br />
<strong>–</strong> gemeinsame Reformen (Regierung mit Opposition), parlamentarische Bildungsenquete,<br />
Krisengipfel (genaue Datenanalyse + Ursachenforschung),<br />
Blick auf andere Länder (Finnland)<br />
<strong>–</strong> keine Schulstrukturdebatte, sondern Beibehaltung des differenzierten Systems<br />
und Verbesserung von Unterricht<br />
Je einmal wird als Reaktion auf die <strong>PISA</strong>-Ergebnisse 2000<br />
<strong>–</strong> die Ganztagesschule<br />
<strong>–</strong> die Verbesserung der Schulinfrastruktur (Computer)<br />
<strong>–</strong> die Abschaffung der Studiengebühren<br />
<strong>–</strong> mehr Bildungsangebot im technischen Bereich (HTL, FH)<br />
<strong>–</strong> eine Oberstufenreform<br />
<strong>–</strong> die Realisierung des SPÖ-Bildungsprogramms<br />
gefordert.<br />
Die am häufigsten genannten Forderungen nach <strong>PISA</strong> 2003 sind mit 23<br />
Nennungen „gemeinsame Reformen (Regierung mit Opposition), parlamentarische<br />
Bildungsenquete, Krisengipfel (genaue Datenanalyse + Ursachenforschung),<br />
Blick auf andere Länder (Finnland)“. Knapp dahinter folgen mit 21<br />
Nennungen die Forderung nach einer „Gesamtschule, Zusammenführung der
340 DOMINIK BOZKURT, GERTRUDE BRINEK, MARTIN RETZL<br />
Schularten bzw. keine Selektion mit 10“ und mit 20 Nennungen die Forderung<br />
nach einer „Ganztagesschule“. Eindringlich werden auch (frühe) Lese- und<br />
Sprachförderung (Vorschuljahr für alle) bzw. Begabungsförderung (ab Kindergarten),<br />
Lesetests in der Volksschule bzw. Förderung der Risikoschüler (Berufsschule)<br />
gewünscht (14 Nennungen).<br />
Des Weiteren wird häufig der Wunsch nach „Umsetzung der Vorschläge<br />
der Zukunftskommission 5 “ (9 Nennungen) und die Umsetzung des SPÖ-<br />
Bildungsprogramms (8 Nennungen) ausgesprochen. Hier fällt überraschend<br />
auf, dass vorwiegend die SPÖ die Umsetzung der Vorschläge der Zukunftskommission<br />
einmahnt, obwohl diese Kommission von der ÖVP-Ministerin<br />
Gehrer eingesetzt wurde. Sechs mal wird je die<br />
<strong>–</strong> Aus- und Weiterbildung von LehrerInnen: bspw. Diagnostik und Therapie<br />
von Leseschwächen, Ausbildung auf akademisches Niveau heben (Uni bzw.<br />
PH)<br />
<strong>–</strong> die Aufhebung der 2/3 Mehrheit für Schulgesetze<br />
genannt.<br />
Fünf mal erwähnt wird die Forderung nach „stärkerer Zusammenarbeit<br />
von Schule und Berufswelt“. Vier mal wird der „Ausbau der Schulau<strong>to</strong>nomie<br />
(Schulen und Kommunen entscheiden)“ gefordert.<br />
Je drei mal erfolgt in den APA-Originaltextsendungen der Ruf nach<br />
<strong>–</strong> Abschaffung der Noten bzw. der Klassenwiederholungen und<br />
<strong>–</strong> Ausbau konstruktiver Schulpartnerschaft.<br />
Je zweimal wird aus den <strong>PISA</strong>-Ergebnissen gefolgert, dass<br />
<strong>–</strong> Bundesminister Gehrer und die Bundesregierung „den Hut nehmen“, „sich<br />
nicht ausruhen“, mehr Engagement zeigen, mit Kürzungen aufhören bzw.<br />
Reformen erarbeiten sollen<br />
<strong>–</strong> der Einsatz von Tests, Benchmarking, Leistungsvergleich, Qualitätsmanagement,<br />
Evaluierung ausgebaut werden soll<br />
<strong>–</strong> keine anderen Schulmodelle (wie bspw. die skandinavischen) unhinterfragt<br />
übernommen werden sollen<br />
<strong>–</strong> neue Wege begangen und keine „alten Hüte“ hervorgeholt werden sollen.<br />
Je einmal erwähnt wird, dass<br />
<strong>–</strong> keine Schulstrukturdebatte geführt, sondern das differenzierte System beibehalten<br />
und der Unterricht verbessert sowie<br />
<strong>–</strong> der Lehrplan reformiert werden soll.<br />
5 Mehr über die Zukunftskommission siehe Kapitel 3.2
2.4 Resümee<br />
<strong>PISA</strong> IN ÖSTERREICH 341<br />
Die Anzahl der Zeitungsberichte und der APA-Originaltext-Sendungen zu den<br />
österreichischen <strong>PISA</strong>-Ergebnissen im Zeitraum nach der Veröffentlichung der<br />
Ergebnisse war im Jahre 2003 um ein vielfaches höher als im Jahre 2000. PI-<br />
SA ist somit erst nach der zweiten Welle in den Blickpunkt der österreichischen<br />
Öffentlichkeit gerückt. Stark unterschiedlich stellt sich ebenso die Bewertung<br />
der Ergebnisse in den beiden Testjahren dar. Während das Abschneiden<br />
Österreichs bei <strong>PISA</strong> 2000 vorwiegend positiv bewertet wurde, sind die<br />
österreichischen <strong>PISA</strong>-Ergebnisse 2003 im Gegensatz dazu überwiegend negativ<br />
beurteilt worden. Dies spricht für eine stark polarisierende Darstellung<br />
der <strong>PISA</strong>-Ergebnisse 2000 und 2003 in der Öffentlichkeit.<br />
Mit Ausnahme der JournalistInnen, die naturgemäß in den Tageszeitungen<br />
am häufigsten zu Wort kommen, in den Originaltextsendungen der APA<br />
jedoch kaum bis gar nicht, sind PolitikerInnen in beiden Medien am stärksten<br />
präsent. WissenschaftlerInnen werden in den Zeitungen öfter (und auch zu<br />
beiden <strong>PISA</strong>-Testungen) zitiert und kommen auch in den Presseaussendungen<br />
zu <strong>PISA</strong> 2003 vor. Einige wenige Ansichten von Vertretern der Industrie sind<br />
zu <strong>PISA</strong> 2000 in den Presseaussendungen und zu <strong>PISA</strong> 2003 in den Zeitungsberichten<br />
zu finden. Eltern-, LehrerInnen- und SchülerInnen-Verbände sowie<br />
soziale Organisationen bzw. Interessenvertretungen kommen immer wieder in<br />
Presseaussendungen zu Wort, werden jedoch in den Zeitungsberichten kaum<br />
explizit berücksichtigt.<br />
Die klare Dominanz der PolitikerInnen unter den berücksichtigten Personen<br />
in den untersuchten Medien weist daraufhin, dass <strong>PISA</strong> in der Öffentlichkeit<br />
hauptsächlich als politisches Ereignis wahrgenommen wird, wodurch der<br />
irrtümliche Eindruck entsteht, dass darauf politisch reagiert werden kann/muss<br />
und auch die Politik die Verantwortung für die Testergebnisse trägt. Das Übertragen<br />
der Interpretation der <strong>PISA</strong>-Ergebnisse auf Politik und verschiedene Interessengruppen<br />
hat eine Ideologisierung und damit auch maßgebliche Überschätzung<br />
der Untersuchung in der Öffentlichkeit zur Folge, wodurch der<br />
Einsatz sachlicher Argumente und Grundlagen unterdrückt wird. Eine solche<br />
sachliche Auseinandersetzung mit <strong>PISA</strong> muss somit im Nachhinein, in Sammelbänden<br />
wie diesem nachgeholt und damit im wissenschaftlichen Diskurs<br />
erschlossen werden.<br />
Die Übertragung der Verantwortung für die <strong>PISA</strong>-Ergebnisse 2000 und<br />
2003 auf einzelne Gruppen, z. B. die LehrerInnen, die Politik oder das Schulsystem<br />
insgesamt ist bereits Folge einer ideologisierten Diskussion.
342 DOMINIK BOZKURT, GERTRUDE BRINEK, MARTIN RETZL<br />
Für das stark negativ bewertete Abschneiden bei der <strong>PISA</strong>-Studie 2003<br />
werden sowohl in den Zeitungen als auch in den Presseaussendungen vielfach<br />
auch MigrantInnen und ihr unterrichtssprachliches Leistungsvermögen<br />
genannt, ohne auf eine Diskussion von Vorschlägen zur Verbesserung einzugehen.<br />
Auf mangelnde Lesefähigkeit bei den SchülerInnen wird in den Zeitungen<br />
hingewiesen, jedoch wird sie in den Pressemeldungen nicht als Ursache<br />
für die <strong>PISA</strong>-Ergebnisse angesehen. „Schulangst“ wird hingegen nur in den<br />
Pressemeldungen genannt.<br />
Nach der Veröffentlichung der <strong>PISA</strong>-Ergebnisse 2000 werden die Änderung<br />
der Politik, mehr Budget bzw. das Beenden der Sparmaßnahmen in den<br />
Tageszeitungen am häufigsten gefordert und auch in den Pressemeldungen verhältnismäßig<br />
oft artikuliert. Ebenso ist die Forderung nach mehr Tests und<br />
Evaluierung, sowie nach einer Gesamtschule und einer Ganztagesschule in<br />
beiden Medien eine oft genannte. Des Weiteren ist der Ruf nach generellen<br />
Reformen, die nach genauerer Analyse gemeinsam erarbeitet werden sollen,<br />
in beiden Medien identifizierbar, ebenso wie die Forderung nach Verbesserung<br />
bzw. Veränderung der LehrerInnenaus- und weiterbildung. Die Verbesserung<br />
des Unterrichts bzw. der Unterrichtsqualität sowie die Alphabetisierung und<br />
die Verbesserung der Lesefähigkeit, welche in den Zeitungen am meisten gefordert<br />
werden, kommen in den Pressemeldungen jedoch kaum bis gar nicht<br />
vor. Alle anderen Maßnahmen werden nur in jeweils einem der beiden Medien<br />
genannt.<br />
Etwas anders verhält es sich mit der Häufigkeit der einzelnen Forderungen<br />
nach der Veröffentlichung der <strong>PISA</strong>-Ergebnisse 2003. Sowohl in den Tageszeitungen<br />
als auch in den Pressemeldungen wird die Forderung nach generellen,<br />
gemeinsamen Reformen am meisten genannt. Die Gesamtschule ist in beiden<br />
Medien die am zweithäufigsten geforderte Maßnahme. Die Ganztagesschule,<br />
die in den Pressemeldungen mit 20 Nennungen sehr häufig verlangt wird, rangiert<br />
in den Zeitungsberichten mit 49 Nennungen auch weit oben, d.h. unter<br />
den Top 5.<br />
Die Forderung nach Förderung, (früher) Lese- und Sprachförderung (Vorschuljahr<br />
für alle) bzw. Begabungsförderung (ab Kindergarten), Lesetests in<br />
VS und Förderung der Risikoschüler (Berufsschule) gehört mit 14 Nennungen<br />
in den Pressemeldungen und in den Zeitungen zusammen mit den Kategorien<br />
„vorschulische Maßnahmen“ (50 Nennungen), „Sprachkurse“ (28 Nennungen)<br />
und Verbesserung der Lesefähigkeit bzw. Alphabetisierung (34 Nennungen)<br />
in beiden Medien zu den am meisten genannten Maßnahmen. Eben-
<strong>PISA</strong> IN ÖSTERREICH 343<br />
so in beiden Medien regelmäßig vertreten ist die Forderung nach Veränderung<br />
der LehrerInnenaus- und weiterbildung; weniger häufig, aber dennoch<br />
vertreten ist der Ruf nach mehr Schulau<strong>to</strong>nomie. Die Umsetzung des SPÖ-<br />
Bildungsprogramms (9 Nennungen) und die Vorschläge der Zukunftskommission<br />
(8 Nennungen), die Aufhebung der 2/3 Mehrheit für Schulgesetze (6 Nennungen)<br />
sowie eine verbesserte Verbindung von Beruf und Schule (5 Nennungen)<br />
sind relativ häufige Forderungen in den Pressemeldungen, kommen jedoch<br />
nicht in den Zeitungsmeldungen vor. Andererseits wird in den Zeitungen<br />
die Verbesserung des Unterrichts bzw. der Unterrichtsqualität mit 56 Nennungen<br />
sehr oft erwähnt, in den Pressemeldungen jedoch nur einmal andeutungsweise.<br />
Hier fällt auf, dass in beiden Testjahren die Forderungen nach generellen,<br />
gemeinsamen Reformen, nach einer Gesamtschule und einer Ganztagesschule<br />
zu den am häufigsten genannten gehören. Außerdem kommen sowohl<br />
nach <strong>PISA</strong> 2000 und nach <strong>PISA</strong> 2003 die Forderung nach einer Lese- und<br />
Sprachförderung (Alphabetisierung durch vorschulische Maßnahmen zur Aufhebung<br />
des Risikoschüler-Phänomens) häufig vor. Ebenso werden die Verbesserung<br />
der LehrerInnenaus- und weiterbildung, sowie die Verbesserung des<br />
Unterrichts- bzw. der Unterrichtsqualität als Forderung in beiden Testjahren<br />
regelmäßig genannt.<br />
Erwähnenswert ist darüber hinaus, dass nach <strong>PISA</strong> 2000 sehr oft eine Änderung<br />
der Politik, mehr Budget bzw. das Beenden der Sparmaßnahmen und<br />
mehr Tests und Evaluierungen gefordert werden. Nach <strong>PISA</strong> 2003 sind diese<br />
Forderungen jedoch von keiner bzw. allenfalls von untergeordneter Bedeutung.<br />
3 <strong>PISA</strong>-Ergebnisse <strong>–</strong> ihre bildungspolitischen Bewertungen und<br />
Konsequenzen<br />
Neben der kommentierten Wiedergabe der <strong>PISA</strong>-Ergebnisse 2000 und 2003<br />
sowie der Darstellung und Erörterung der medialen Reflexion (vgl. Kapitel<br />
1 und 2) zeigen die bildungspolitischen Bewertungen Art und Grad der gesellschaftlichen<br />
Erregung auf einer weiteren Ebene außerhalb der „scientific<br />
community“. Unter Verzicht auf die Besinnung dessen, was von der OECD als<br />
explizites und implizites Ziel der Testung genannt wurde und unter der Berücksichtigung<br />
der spezifisch österreichischen Tradition der „Erregung“ kann<br />
man in Österreich von einer politischen und argumentativen Verselbstständigung<br />
sprechen, die bis heute nicht wirklich abgeklungen ist. Die Grundlage
344 DOMINIK BOZKURT, GERTRUDE BRINEK, MARTIN RETZL<br />
für eine ernsthafte Diskussion um die Weiterentwicklung des Schulsystems<br />
wäre gelegt, z. B. mit dem Ergebnis der ministeriellen Zukunftskommission,<br />
das aber nur mäßig engagiert diskutiert wurde.<br />
Erst in den letzten Wochen, anges<strong>to</strong>ßen durch das Symposion, das diesem<br />
Band zugrunde liegt, weniger auf jener Basis, die in Deutschland die Diskussion<br />
bestimmte, kommt es zu einem vorsichtigen Innehalten im unbedachten<br />
Absingen des <strong>PISA</strong>-Lobes. In der Besinnung auf die Frage, was guter Unterricht<br />
sei und wie er gelingen könne, respektive in der Erinnerung an Humboldt<br />
gehe es keineswegs darum, „Testartisten“ (Hartmut von Hentig) auszubilden,<br />
sondern den Herausforderungen für morgen gerecht werden zu können <strong>–</strong> welche<br />
immer das auch sein mögen.<br />
<strong>PISA</strong>-Ergebnisse wurden in den untersuchten Ländern unterschiedlich aufgenommen<br />
und sowohl pädagogisch als auch bildungspolitisch entweder aufgeregt<br />
oder entspannt diskutiert und interpretiert. Wie kaum eine andere Untersuchung<br />
und Bewertung von SchülerInnenleistungen haben sie Anlass für<br />
Analysen und Schlussfolgerungen geliefert und waren zumindest in Österreich<br />
und in Deutschland bald überlagert von Spekulationen und politischen Reflexen.<br />
Bildungs-Seiten-Macher der Tageszeitungen und Wochenmagazine haben<br />
„Schulexperten“ aufgeboten, die auf verschiedene Aspekte fokussierten und<br />
umgehend zu wissen vorgaben, woran es lag, dass das jeweilige Land so und<br />
so abgeschnitten hatte und welche Konsequenzen zu ziehen wären.<br />
Diskussionen über notwendige pädagogische oder didaktische Anstrengungen<br />
zur Verbesserung des Unterrichts oder bildungswissenschaftlichsystematische<br />
Vergewisserungen <strong>–</strong> als Konsequenz sorgfältiger Analysen <strong>–</strong><br />
wurden über weite Strecken aufmerksamkeitspolitisch verdrängt und waren<br />
schließlich eher die Ausnahme, denn die Regel . . .<br />
Übersehen wurde, dass die <strong>PISA</strong>-Untersuchung „die jeweils nationale<br />
Sichtweise ergänzen und vertiefen (kann), indem sie nationale Ergebnisse zur<br />
besseren Interpretation in einen größeren Zusammenhang (zu) stellen und die<br />
jeweiligen Stärken und Schwächen im Lichte der Leistungsfähigkeit anderer<br />
Bildungssysteme einzuschätzen“ (erlaubt). <strong>PISA</strong> habe die Basis für den Dialog<br />
und die Zusammenarbeit bei der Definition und Umsetzung von Bildungszielen<br />
geschaffen, wobei die für das spätere Leben relevanten Kompetenzen im<br />
Vordergrund stehen (Schleicher o. J., 9). Von Rückschlüssen auf (Allgemein-)<br />
Bildung, wie sie im mitteleuropäisch umfassenden Sinn gedacht wird, ist nicht<br />
die Rede, ebenso wenig wie auf Kausalverhältnisse bzgl. <strong>PISA</strong>-Ergebnis und<br />
Schulsystem abgestellt wird.
3.1 Didaktische Verbesserungen<br />
<strong>PISA</strong> IN ÖSTERREICH 345<br />
Einige Länder haben entschieden gemäß dieser Maßgabe gehandelt.<br />
„Die alarmierenden und beunruhigenden Befunde der ersten <strong>PISA</strong>-<br />
Untersuchung“ führten in Deutschland sehr bald nach Veröffentlichung der<br />
ersten Ergebnisse zu einer internationalen Tagung der Gesellschaft für Fachdidaktik<br />
(der Dachorganisation aller wissenschaftlich-fachdidaktischen Fachgesellschaften),<br />
mit dem Ziel, „Perspektiven einer Verbesserung fachlichen<br />
wie fächerübergreifenden Lernens und Lehrens ( . . . ) zu entwickeln“. (Bayrhuber/Vollmer<br />
2004, 7).<br />
Bundesministerin Edelgard Buhlmann (ebenda 25f) stellt in ihrem programmatischen<br />
Vortrag Bildungsreformansätze vor. Sie fordert mehr „Bildungsoptimismus“<br />
und konzentriert sich im Verweis auf Finnland auf das<br />
Prinzip der individuellen Förderung, auch kann sie dazu gleich mit zusätzlichen<br />
Mitteln der deutschen Bundesregierung (4 Mrd. Euro) aufwarten, die<br />
in Ganztagesschulprogramme zu investieren seien, damit anders als in der alten<br />
„Gleichschritt-Pädagogik“ (G.B.) nun auf der Basis von größeren Zeitbudgets<br />
anders unterrichtet werden könne. Offen bleibt dabei die innere<br />
Form der „Ganztagesschule“, wird doch auf Partnerschaft und Kooperation<br />
mit Sportvereinen, Musikschulen, Elterninitiativen u. ä. gesetzt. 6 Es gehe<br />
um die frühe Identifikation von Defiziten und Stärken bei Kindern <strong>–</strong> v. a.<br />
im Bereich der Sprach-, Lese- und Schreibkompetenz <strong>–</strong> und die Fokussierung<br />
auf die Fachdidaktik. „Eine bessere Qualität des Unterrichts in unseren<br />
Schulen kann nur über einen didaktischen Wandel erreicht werden“ (ebenda,<br />
28), d.h. über die schlüssige Verbindung von fachwissenschaftlicher und<br />
erziehungswissenschaftlich-didaktischer Ausbildung.<br />
Staatsministerin Karin Wolff, Präsidentin der Kultusministerkonferenz,<br />
verweist auf das rasche und ergebnisorientierte Handeln nach der <strong>PISA</strong>-<br />
Ergebnis-Präsentation, das die Konzentration auf die Förderung der Kinder<br />
mit Migrationshintergrund, 7 die Lesefähigkeit sowie die Verbesserung in der<br />
6 In Österreich ist der Unterschied zwischen Ganztagesschule und ganztägigen Schulformen<br />
wesentlich: Während in der einen Unterrichts-, Übungs- und Freizeitstunden abwechselnd<br />
über den Tag verteilt sind und den verpflichtenden ganztägigen Besuch bedingen, ist in der<br />
anderen der verpflichtende Unterricht im wesentlichen am Vormittag angesetzt, während in<br />
den Nachmittagsstunden Vertiefung, Übung, Freizeit und Sport angeboten werden <strong>–</strong> zum<br />
freiwilligen Besuch.<br />
7 In Hessen werden nur SchülerInnen eingeschult, die die deutsche Sprache beherrschen.<br />
Auch in Finnland sind spezielle Sprach-Vorbereitungsklassen für Kinder mit Migrationshintergrund<br />
eingerichtet. In Schweden u. a. Ländern gibt es Vorbereitungsklassen für Kin-
346 DOMINIK BOZKURT, GERTRUDE BRINEK, MARTIN RETZL<br />
Bewältigung komplexer Aufgaben verfolgt. In allen Dimensionen sei die Fachdidaktik<br />
angesprochen, ginge es doch um die Verbesserung der Unterrichtsqualität.<br />
Unter diesem Aspekt habe die Kultusministerkonferenz die Bildungsstandards<br />
als zentrales Mittel zur Sicherung der Qualität schulischer Bildung<br />
herausgestellt (nähere Definitionen werden dazu im Buch S. 36 ff geliefert).<br />
Mit der Länder-Auswertung wird aber auch in Deutschland die bildungspolitische<br />
bzw. schulorganisationspolitische Diskussion entsprechend aufgeladen,<br />
schneiden doch die Länder des Südens bei <strong>PISA</strong> mit ihrem gegliederten<br />
Schulsystem grob gesprochen besser ab als die nördlichen.<br />
Peter Bender, Uni Paderborn, nimmt in einem Beitrag „für die GDM<br />
Nr. 81“ Bezug auf die <strong>PISA</strong>-Vergleichsstudien und zu jenen Artikeln, die sich<br />
mit der Kritik der <strong>PISA</strong>-Kritiker und mit schulorganisa<strong>to</strong>rischen Konsequenzen<br />
befassen: „Die Bayern waren mit Abstand die Besten, ( . . . ) die nächsten<br />
drei Plätze (gingen an) Baden-Württemberg, Sachsen, Thüringen. Aus dem<br />
direkten Vergleich der Schulformen, der ja bei <strong>PISA</strong> 2000 für die integrierte<br />
Gesamtschule schlecht ausgefallen war, war diese jetzt herausgenommen worden<br />
( . . . ) aus statistischen Gründen. Aber die <strong>PISA</strong>-Schwäche der Gesamtschule<br />
ist nach wie vor erkennbar ( . . . ). Auch aus den internationalen <strong>PISA</strong>und<br />
TIMSS-Zahlen lässt sich kein Honig für ein Einheitsschulsystem saugen.<br />
Zwar verfügen die Spitzenländer durchweg über ein solches, aber <strong>–</strong> und das<br />
wird immer wieder geflissentlich ignoriert <strong>–</strong> eben auch sämtliche Länder in<br />
der unteren Hälfte der Tabelle. Die wenigen Länder mit früh gegliedertem<br />
Schulsystem (Belgien, Deutschland, Österreich, Schweiz, Slowakei, Tschechien)<br />
dagegen finden sich alle in der oberen Hälfte. Die internationalen <strong>PISA</strong>-<br />
Zahlen sprechen also ebenfalls eher gegen die Einheitsschule. Jedoch meine<br />
ich, dass sie überhaupt nicht für oder gegen Schulsysteme sprechen, sondern<br />
Ausdruck des kulturell-technischen Entwicklungsstandes, des Leistungsorientierungsgrads<br />
und der Migrationsstruktur der jeweiligen Gesellschaft sind, und<br />
zwar i. W. unabhängig vom Schulsystem“ (Bender 2007). Insbesondere seien<br />
die skandinavischen Länder als Vorbilder abhanden gekommen bzw. lägen<br />
abgesehen von Finnland auf Augenhöhe mit Deutschland; ebenso wird noch<br />
einmal auf die migrationspolitisch günstigeren Bedingungen etwa Schwedens<br />
(gegenüber Deutschlands) hingewiesen.<br />
der, die einen unterrichtssprachlichen Nachholbedarf haben. Österreichische Schulen kennen<br />
nur die Möglichkeit des außerordentlichen Schulbesuchs, der gesetzlich auf ein Jahr<br />
beschränkt ist, sowie einen auf wenige Stunden pro Woche beschränkten Zusatzunterricht,<br />
der während des Regelunterrichts angeboten wird.
<strong>PISA</strong> IN ÖSTERREICH 347<br />
Als weiteren Interpretations-Aspekt bezieht Bender das Moment der Chancen(un)gleichheit<br />
in seine Stellungnahme mit ein und verweist auf methodische<br />
Fehler, die im Zusammenhang mit Konsequenzen aus „Bildungsbeteiligung“<br />
und „ökonomisch-sozial-kulturellen Status“ stünden.<br />
Direkte, etwa von Andreas Schleicher an verschiedenen Stellen getätigte<br />
OECD-Aussagen brächten die Motive der <strong>PISA</strong>-Untersuchung zutage. So sei<br />
vor einigen Jahren, so Bender, das Schulsystem mit der Steigerung des Brut<strong>to</strong>sozialprodukts<br />
in Verbindung gebracht worden <strong>–</strong> in Ignoranz anderer wesentlicher<br />
Einflussgrößen. 8<br />
Ähnlich grob „geschnitzt“ seien die den Wettbewerb stimulierenden Tabellen<br />
über die (Steigerung der) Bildungsausgaben ausgefallen: Mexiko würde<br />
demnach einsam an der Spitze liegen, so Bender.<br />
In einem offenen Brief an die stv. Vorsitzende der GEW, Marianne Demmer,<br />
nimmt Bender auch zu den Kritikern des Buches „<strong>PISA</strong> & Co <strong>–</strong> Kritik<br />
eines Programms“ Stellung bzw. zum Reflex, den die Kritik in Deutschland<br />
ausgelöst habe, womit gewissermaßen bewiesen wäre, dass eine auf Argumenten<br />
basierende Analyse zu Schmähung und Verfolgung führt(e). Mehr dazu im<br />
Beitrag von Stefan Hopmann.<br />
In diesem Zusammenhang sei auch die Stellungnahme des UNO-<br />
Sonderberichterstatters der UN- Menschenrechteskommission, Vernor Munoz<br />
aus Costa Rica, 2006 in Deutschland, zu interpretieren, in der er das gegliederte<br />
Schulsystem kritisierte und befand, dass zur Integration von Familien mit<br />
Migrationhintergrund sinngemäß die Sprache nicht ausschlaggebend sei . . . 9<br />
3.2 Die „Zukunftskommission“<br />
Im österreichischen Report <strong>PISA</strong> 2000 <strong>–</strong> Lernen für das Leben fasst die zuständige<br />
Bundesministerin Elisabeth Gehrer die notwendigen Konsequenzen<br />
und Handlungsschritte im Vorwort des Ergebnis-Reports zusammen: „Bei der<br />
nun vorliegenden Detailauswertung gibt es wichtige Hinweise, in welchen Bereichen<br />
die Anstrengungen zur Steigerung der Bildungsqualität noch verstärkt<br />
werden sollen. Österreichische Kinder sollten spätestens zum Ende der dritten<br />
Klasse Volksschule verlässlich sinnerfassend lesen können. Lesen ist die Kulturkompetenz,<br />
auch im Zeitalter der Au<strong>to</strong>matisierung. Deshalb wurde vom Bil-<br />
8 Auf Österreich übertragen ließen sich daraus äußerst beruhigende Rückschlüsse auf das<br />
Schulsystem ziehen, zeigen doch die jüngsten Wirtschaftsdaten vergleichsweise sehr gute<br />
Ergebnisse.<br />
9 Wer hat hier den Kern der <strong>PISA</strong>-Tests <strong>–</strong> literacy! <strong>–</strong> nicht verstanden oder gar überlesen?
348 DOMINIK BOZKURT, GERTRUDE BRINEK, MARTIN RETZL<br />
dungsministerium das Projekt ›Lesefit‹ unter dem Mot<strong>to</strong> ‚Lesen können heißt<br />
lernen können‘ gestartet. Unter Einbindung der Eltern und des Buchklubs muss<br />
erreicht werden, dass alle Kinder die Volksschule mit hervorragenden Lesekenntnissen<br />
verlassen.“ Unter Verweis auf die „thematischen Berichte“ wird<br />
auf die „unterschiedlichen Kompetenzen bei Mädchen und Buben sowie bei<br />
deutsch- und nicht deutschsprachigen Schülerinnen und Schülern“ verwiesen<br />
und insgesamt die Detailauswertung hinsichtlich ihrer Rückmeldefunktion für<br />
die Qualitätssteigerung im Bildungswesen gewürdigt. Im Vorwort der <strong>PISA</strong>-<br />
Studie 2003 wird an ›Lesefit‹ erinnert sowie auf die Ausweitung des Projekts<br />
IMST (Innovations in Mathematics, Science and Technology Teaching) Bezug<br />
genommen. Mit der Initiative ›klasse:zukunft‹, der Erarbeitung von Bildungsstandards<br />
und der zielstrebigen Fortsetzung der inneren Schulreform sei Österreich<br />
auf dem richtigen Weg, die Qualität des Unterrichts nachhaltig zu verbessern<br />
und zu sichern, so Gehrer. Dieselbe bemerkt abschließend aber auch, dass<br />
„Leistungsmessungen wie <strong>PISA</strong> wichtige Momentaufnahmen“ lieferten, aber<br />
nur einen „Teil der Leistungen ( . . . ) verkörperten, die ( . . . ) an unseren Schulen<br />
erbracht werden“.<br />
Auf Wunsch der Bildungsministerin wird unter dem österreichischen<br />
<strong>PISA</strong>-Verantwortlichen Günter Haider mit Ministerratsbeschluss vom<br />
1.4.2003 die sog. Zukunftskommission eingesetzt, die bildungspolitische Konsequenzen<br />
aus der OECD-Untersuchung formulieren soll (weitere Mitglieder<br />
sind Christiane Spiel, Ferdinand Eder, Werner Specht und Manfred Wimmer).<br />
Die Kommission legt 2005 ein Analyse- und Maßnahmen-Papier bzw. das Ergebnis<br />
vor, das der deutschen Konklusion inhaltlich nicht so fern steht; sie<br />
stellt auf die Verbesserung der Unterrichtsqualität ab.<br />
Als Reformziel wird genannt: Schule und Unterricht systematisch verbessern<br />
(Hervorhebungen jeweils d. d. Verfasser). „Sowohl Ergebnisse neuerer<br />
Leistungsmessungen (vor allem <strong>PISA</strong>), als auch die seit mehr als einem Jahrzehnt<br />
laufenden Überlegungen zur Qualitätsverbesserung in den Schulen sowie<br />
die Analyse der Rahmenbedingungen in Österreich legen nahe, die Lehr-<br />
/Lernprozesse im Unterricht, die Unterrichtsinhalte und die Unterrichtsmethoden,<br />
somit ‚Guten Unterricht‘ in das Zentrum der Reformmaßnahmen zu<br />
rücken. Reformstrategie: Qualitätsentwicklung vor Strukturreform<br />
Die Zukunftskommission hat in ihrem ersten Bericht das Schwergewicht<br />
der Vorschläge auf Qualitätssicherung, Qualitätsentwicklung und den Ausbau<br />
einer verlässlichen Schule gelegt, und weniger auf den Umbau von Strukturmerkmalen<br />
und Organisationselementen. Sie bleibt auch in diesem Folgebe-
<strong>PISA</strong> IN ÖSTERREICH 349<br />
richt auf derselben Linie. Die vorgeschlagenen Maßnahmen der Zukunftskommission<br />
streben daher Unterrichtsverbesserungen durch Schulentwicklung und<br />
Qualitätssicherung, durch Lehrerbildung und Unterstützungssysteme <strong>–</strong> und<br />
nicht durch Systemumbau an.<br />
Die Gesamtstrategie orientiert sich an folgenden vier Prinzipien:<br />
1. Systematisches Qualitätsmanagement: Förderung der Qualitätsentwicklung<br />
und der Qualitätssicherung auf allen Ebenen. ( . . . )<br />
2. Mehr Au<strong>to</strong>nomie und mehr Selbstverantwortung <strong>–</strong> erhöhter Handlungsspielraum<br />
bei transparenter Leistung und Rechenschaftspflicht. (... )<br />
3. Professionalisierung der LehrerInnen: kriterienbezogene Auswahl, kompetenzorientierte<br />
Ausbildung, leistungsorientierte Aufstiegsmöglichkeiten.<br />
(... )<br />
4. Mehr Forschung & Entwicklung und bessere Unterstützungssysteme.<br />
( . . . )“ (vgl. Abschlußbericht <strong>–</strong> Zusammenfassung).<br />
Dazu wurden in den zusammenfassenden Empfehlungen fünf Handlungsbereiche<br />
(mit einzelnen Subbereichen) und vordringliche und übergreifende<br />
Forschungs- & Entwicklungsbereiche formuliert und detailliert ausgeführt.<br />
Diesen in der Öffentlichkeit durchaus positiv bewerteten Aktivitäten (vgl.<br />
auch die in diesem Beitrag zitierten parlamentarischen Debatten) folgten einige<br />
weitere:<br />
Am 9.2.2005 schreibt Peter Posch im Lichte der <strong>PISA</strong>-Ergebnisse in einem<br />
Gutachten für Bundesminsterin Gehrer über „einige mögliche Gründe für die<br />
Schwächen des österreichischen Schulsystems und Ansätze zu ihrer Überwindung:<br />
Wesentlich ist ( . . . ) die Erkenntnis, dass Verbesserungen nur von einem<br />
komplexen Ensemble von Maßnahmen zu erwarten sind.“<br />
Unter den 10 Punkten weist er zwar auch auf die „Frage der Fragmentierung“<br />
des Schulsystems hin und die Folgen für schwächere SchülerInnen<br />
bzw. für jene, die aus schwierigen sozialen Verhältnissen kommen <strong>–</strong> 2000<br />
und 2003 hätte diese Gruppe beunruhigend schlecht abgeschnitten, weil der<br />
Leistungsanreiz durch die leistungsmäßig stärkeren Schüler gefehlt hätte <strong>–</strong>,<br />
setzt aber im ersten Punkt auf Qualitätssicherung. „Die Einführung der Verpflichtung<br />
auf ein Schulprogramm, in dem in bestimmten Abständen von den<br />
Schulen verlangt wird, Rechenschaft über Initiativen und deren Ergebnisse<br />
zur Weiterentwicklung der Unterrichtsqualität und der schulischen Rahmenbedingungen<br />
abzulegen, wurde rechtlich nicht verankert, obwohl bereits 2002<br />
ein detaillierter Vorschlag ausgearbeitet worden ist ( . . . ).“ Ein weiterer Punkt
350 DOMINIK BOZKURT, GERTRUDE BRINEK, MARTIN RETZL<br />
ist für Posch die Gewinnung von Schulzeit, d.h. der Vormittagsunterricht reiche<br />
nicht. Neben der Verbesserung und Professionalisierung der LehrerInnen-<br />
Ausbildung mahnt der Au<strong>to</strong>r ein, den gängigen Unterricht methodisch weiterzuentwickeln,<br />
um künftig auch anspruchsvolle Aufgaben und Denkleistungen<br />
besser bewältigen zu können, vgl. TIMSS. Mehr Transparenz in der<br />
Formulierung und Beurteilung der Leistungsansprüche sollte u. a. das Ergebnis<br />
einer laufenden LehrerInnen-Fortbildung und der professionellen Zusammenarbeit<br />
in Fachgruppen-Teams sein. Schließlich sei die SchulleiterInnen-<br />
Qualifizierung und -Stärkung der Management-Ebene vonnöten sowie das<br />
„Aufsichtsvakuum“ zwischen Schulleitern (verantwortlich für das Schulprogramm)<br />
und Schulaufsicht (verantwortlich für die Qualität der Selbstevaluation)<br />
zu beseitigen.<br />
Als wesentlichen Punkt streicht Posch die wahrscheinlich nachteilige Auswirkung<br />
der mangelnden Deutsch-Kenntnisse der MigrantInnen-Kinder heraus.<br />
Dazu: „Einrichtung von Programmen zur Sicherung der Deutschkenntnisse<br />
von Kindern mit Migrationshintergrund und zwar nicht nur begleitend zur<br />
Schullaufbahn sondern ( . . . ) bevor sie in die Schule eintreten.“ Dabei solle eine<br />
hohe Konzentration von Kindern nicht-deutscher Muttersprache vermieden<br />
werden. Abschließend wird auf Finnland verwiesen, wo bereits im Kindergartenalter<br />
systematisch Finnisch vermittelt werde.<br />
3.3 Öffentliche Reaktionen<br />
Im Spiegel der medialen Rezeption stellt sich die öffentliche Diskussion um<br />
Konsequenzen aus <strong>PISA</strong> und Schulleistungs-Vergleichtests anders dar. Gab es<br />
zum ersten <strong>PISA</strong>-Report viel bisweilen auch undifferenzierte Zufriedenheit,<br />
Eigenlob und Unaufgeregtheit nach dem Mot<strong>to</strong> „alles im grünen Bereich“ bis<br />
„einmal noch gut davongekommen und vor allem besser als Deutschland abgeschnitten“<br />
(mehr dazu im Kapitel 2), so verdichten sich Reaktionen auf den<br />
zweiten <strong>PISA</strong>-Report zu Katastrophenmeldungen.<br />
Finnland ist oft das geflügelte Wort für das Gute in der Pädagogik, die<br />
Scheu vor einer moralischen Aufladung und Überstrapazierung nimmt rapide<br />
ab, das können auch andere internationale Bildungsanalysen nicht relativieren,<br />
z. B. der von Bildungsministerin Gehrer zitierte OECD-Länderbericht 2005,<br />
der für Pressegespräche und einschlägige Informationen zitiert wird: „Als ein<br />
besonderes Charakteristikum Österreichs kann die große Vielfalt an Schultypen<br />
angesehen werden. Der hohe Level an vertikaler und horizontaler Differenzierung<br />
zeigt Vorteile, aber auch Einschränkungen. Im Schulsystem gibt es
<strong>PISA</strong> IN ÖSTERREICH 351<br />
den Eltern eine große Freiheit an Wahlmöglichkeiten, besonders in Wien und<br />
den großen Städten, wenngleich dies auch zu Zersplitterung und hohen Kosten<br />
führen könnte“. Gestützt wird die Argumentation nach dem Erscheinen des og.<br />
Länderberichtes mit einer Analyse in der FAZ: „Entscheidend ist nicht das jeweilige<br />
Schulsystem, sondern der kluge und bedachte Umgang mit vorhandener<br />
Schultradition. Nach diesem Ländervergleich gibt es weniger Grund denn<br />
je, das in Deutschland etablierte dreigliedrige Schulsystem einem Einheitsschulsystem<br />
nach skandinavischem Vorbild zu opfern. Vielmehr liegen die<br />
Länder, die auf ein dreigliedriges Schulsystem mit hohen Qualitätsstandards<br />
in allen Schularten gesetzt haben, in Führung ( . . . )“ (Presse-Information des<br />
BMBWK zuletzt v. 16. Dez. 2005). In derselben wird von Seiten des Ministeriums<br />
auch auf das erfolgreiche Abschneiden Bayerns, einem Land mit einem<br />
gegliederten Schulsystem, hingewiesen, sowie auf die in Österreich niedrige<br />
Jugendarbeitslosigkeit und den WHO-Bericht über das Wohlbefinden in der<br />
Schule: Das Ergebnis aus der Frage „Fühlst Du Dich in der Schule wohl?“<br />
bedeutet für Österreich den Platz 3, für Finnland den Platz 34 (letzte Stelle).<br />
Dies alles trägt aber zur Versachlichung der Debatte nicht (mehr) bei. Es<br />
zeigt nur ein pädagogisch-systematisch motiviertes Aufflackern eines ernsthaften<br />
Versuches, die gegenwärtige Schule und ihre Funktion als umfassenden<br />
„Gerechtigkeitsherstellungsapparat“ zu relativieren bzw. nicht zu strapazieren,<br />
sondern den Beitrag der Schule zur Schaffung einer humanen Gesellschaft als<br />
so bescheiden herauszustreichen als es auf Basis der aktuellen Forschungslage<br />
benannt werden darf und muss . . .<br />
Auch Mitglieder der Zukunftskommission nehmen in Bezug auf einen<br />
möglichen Umbau des Schulsystems in einer ähnlichen Weise Stellung, so<br />
z. B. der Vorsitzende Haider, indem er v. a. mündlich immer wieder auf die<br />
österreichische Tradition und Schulkultur verweist (z.B. Haider im parlamentarischen<br />
Unterrichtsausschuss und im Rahmen einer Studientagung der ÖVP<br />
in Alpbach).<br />
3.4 Parlamentarische Resonanz<br />
Im Rahmen einer sog. Aktuellen Aussprache im parlamentarischen<br />
Unterrichts-Ausschuss (in den Ausschüssen werden die legistischen Arbeiten<br />
vorbereitet, diskutiert und mehrheitlich beschlossen, um danach in öffentlichen<br />
Plenarsitzungen abgestimmt zu werden; Aktuelle Aussprachen sind Teil<br />
der Debatten in den Ausschuss-Sitzungen, zu denen auch Experten eingeladen<br />
werden) stand am 3.7.03 „die <strong>PISA</strong>-Studie und die Tätigkeit der Zukunftskom-
352 DOMINIK BOZKURT, GERTRUDE BRINEK, MARTIN RETZL<br />
mission“ zur Diskussion. Dazu wurde auch Günter Haider als Experte eingeladen.<br />
Von Bundesministerin Gehrer wird einleitend das Ziel der Zukunftskommission<br />
referiert; der österreichische <strong>PISA</strong>-Vorsitzende schließt sich an und<br />
unterstreicht, dass sich die Zukunftskommission „insbesondere mit der Qualität<br />
und der Qualitätsentwicklung auseinandersetzen werde, wobei man den<br />
Schwerpunkt auf die Verbesserung des Unterrichts lege. 80 % der Qualitätsverbesserung<br />
seien, so Haider, dadurch erreichbar, wobei man insbesondere<br />
in der Lehrerbildung ansetzen müsse. Nur 20 % an Verbesserungen, schätzt<br />
er, könnten durch organisa<strong>to</strong>rische Maßnahmen erreicht werden. Besonderes<br />
Augenmerk werde der Vorbereitung auf ein lebenslanges Lernen und einem<br />
gesicherten und verstehenden Lesevermögen geschenkt werden, denn ohne<br />
dieses sei kein selbstständiger Bildungserwerb möglich“ <strong>–</strong> was nur durch den<br />
entsprechenden Unterricht erreichbar wäre (Parlamentkorrespondenz Nr. 536).<br />
Nach den Ausführungen Haiders ziele die Arbeit der Zukunftskommission<br />
auf ein Gesamtkonzept ab, was nur langfristig umgesetzt werden könne;<br />
kurzfristige Auswirkungen würden der Umsetzung des Qualitätsmanagements<br />
zugeschrieben. Als langfristiges Projekt bezeichnete der Experte die Hinwendung<br />
zur Rechenschaftslegungsorientierung sowie die Realisierung einer verstärkten<br />
Au<strong>to</strong>nomie . . . .<br />
In der anschließenden Debatte zeigten sich dann „die unterschiedlichen<br />
Gewichtungen, die die Opposition und die Regierungsfraktionen vornahmen.<br />
Aus den Äußerungen der Abgeordneten von SPÖ und Grünen war die Meinung<br />
herauszuhören, dass „organisa<strong>to</strong>rische Maßnahmen durchaus einen stärkeren<br />
Einfluss auf die Qualität und den Output des Unterrichts nehmen könnten.“<br />
Für Werner Amon von der ÖVP ginge es, so die exemplarische Wortmeldung,<br />
darum, das relativ gute Schulsystem durch eine Konzentration auf die Verbesserung<br />
des Unterrichts und den Ausbau der Au<strong>to</strong>nomie weiterzuentwickeln.<br />
Haider unterstrich nach einer ausführlichen Debatte auch noch die Notwendigkeit<br />
des Ausbaus der Tagesschul- bzw. Unterrichtszeit und Unterteilung<br />
des Lehrplanes in einen Kern- und Erweiterungss<strong>to</strong>ff.<br />
Am 1.12.04 wurden im parlamentarischen Unterrichts-Ausschuss im Rahmen<br />
der Aktuellen Aussprache die kurz zuvor bekannt gewordenen Ergebnisse<br />
der 2. <strong>PISA</strong>-Studie diskutiert (ein endgültiges offizielles Ergebnis lag zu<br />
diesem Zeitpunkt nicht vor; bis 7.12. seien die Ergebnisse im Eigentum der<br />
OECD). Bundesministerin Gehrer räumte ein, dass sich Österreich im Ranking<br />
verschlechtert habe. Sie halte es aber „zum jetzigen Zeitpunkt für verfehlt,<br />
voreilige Schuldzuweisungen vorzunehmen und ein bestimmtes Schulsystem
<strong>PISA</strong> IN ÖSTERREICH 353<br />
als Heilmittel zu propagieren“ (Parlamentkorrespondenz Nr. 893). Neben dem<br />
Hinweis auf die beabsichtigte Analyse, mit der sie Haider beauftragen wolle,<br />
erinnerte sie an bereits gesetzte Maßnahmen, sowie an die im Vergleich etwa<br />
zu Finnland positive Lage auf dem Sek<strong>to</strong>r der Jugendarbeitslosigkeit.<br />
Um den anstehenden Reformen im Bildungswesen gerecht werden zu<br />
können, stand die Arbeit des parlamentarischen Unterrichts-Ausschusses am<br />
20.4.05 im Zeichen der Entscheidung über die Abschaffung der Zweidrittelmehrheit<br />
für Schulgesetze; dazu war die Unterstützung der Opposition notwendig.<br />
Grundlage für die Diskussion im Rahmen der Aktuellen Aussprache<br />
war der Abschlußbericht der Zukunftskommission; Günter Haider stand<br />
abermals für Fragen und Diskussionsbeiträge zur Verfügung und bezeichnete<br />
wiederum die systematische Verbesserung des Unterrichts als zentrales Element<br />
der Schulreform. Dazu gehöre es, die Lern- und Leistungsfähigkeit optimal<br />
zu fördern sowie die Qualifizierung der LehrerInnen zu erhöhen und die<br />
Nachhaltigkeit des Unterrichts zu verbessern. „Die Qualitätsentwicklung habe<br />
Vorrang vor der Strukturreform“, sagte er und vertrat auch die Auffassung,<br />
die sprachliche Frühförderung sollte so rasch wie möglich umgesetzt werden<br />
(Parlamentskorrespondenz Nr. 272). Erwartungsgemäß reagierten die Oppositionsabgeordneten<br />
mit dem Wunsch nach einer Strukturreform, ebenso wie die<br />
ÖVP-Mandatare die qualitative Verbesserung des Unterrichts als vordringlich<br />
erachteten. Nach einer weiteren Stellungnahme von Bundesministerin Gehrer<br />
<strong>–</strong> sie referierte alle bisher gesetzten Verbesserungs- und Schulentwicklungsmaßnahmen<br />
<strong>–</strong> wurde die Diskussion nach einer kurzen Unterbrechung<br />
fortgesetzt, auf die die Debatte bezüglich neuer Mehrheiten bei Schulgesetzen<br />
folgte. Reformen sollten rascher und einfacher umgesetzt werden als bisher.<br />
Des Weiteren wurde auf die geplanten bzw. auch budgetierten ca. 300.000 zusätzlichen<br />
Förderstunden und den ebenso vorgesehenen Ausbau der Tagesbetreuung<br />
hingewiesen.<br />
Im Laufe der folgenden Monate wird die politische Diskussion um <strong>PISA</strong><br />
oder besser die öffentlich geführte Debatte jedoch argumentativ immer enger<br />
und konzentriert sich <strong>–</strong> ideologisch aufgeladen <strong>–</strong> auf die Frage „Gesamtschule<br />
oder gegliedertes Schulwesen“.<br />
Dass sich darin, wenn auch nicht nur, ab 2006 der nahende Bundes-<br />
Wahlkampf widerspiegelt, wird offenkundig etwa indem sich nun u. a. auch der<br />
österreichische <strong>PISA</strong>-Chef Haider öffentlich zunehmend anders, d.h. von seiner<br />
inhaltlichen Linie und seinen ursprünglichen Empfehlungen abweichend<br />
und wissenschaftlich grenzüberschreitend, politisch wertend verhält und der
354 DOMINIK BOZKURT, GERTRUDE BRINEK, MARTIN RETZL<br />
Bildungsministerin sinngemäß schwere bildungspolitische Versäumnisse, Unbeweglichkeit<br />
und Amtsmüdigkeit vorwirft.<br />
3.5 Vertiefende Analyse<br />
2005 entwickelt sich auch in Österreich eine Diskussion um die offenkundigen<br />
methodischen Schwächen, d.h. die Anlage und die Auswertung der Leistungserhebung<br />
in der <strong>PISA</strong>-Untersuchung. Bundesministerin Gehrer beauftragt<br />
Erich Neuwirth und sein Team an der Universität Wien mit vertiefenden<br />
Analysen und Beiträgen zur Methodik. Wissenschaftlich abgesicherte Aussagen<br />
sollten dazu beitragen, die Unterschiede klären und interpretieren zu können.<br />
Im Abschnitt „Korrigierte Hauptergebnisse“ (Neuwirth, 62 ff.) wird auf<br />
den nun möglichen stichhaltigen Vergleich der <strong>PISA</strong>-Ergebnisse 2000 und<br />
2003 abgestellt. „Dabei zeigt sich, dass sich die Leistungen der österreichischen<br />
Schüler/innen in Lesen kaum geändert haben und sowohl die Lesewerte<br />
als auch die Mathematikwerte in der Nähe des OECD-Durchschschnitts liegen.<br />
In den Naturwissenschaften ist dagegen ein deutlicher Rückgang der österreichischen<br />
Werte erkennbar“ (Neuwirth, 62). Festzuhalten sei auch, dass<br />
die Werte für Mathematik nicht unmittelbar vergleichbar sind, weil die im<br />
Mathematik-Test abgedruckten Bereiche 2003 sehr stark erweitert wurden und<br />
daher nicht mehr exakt dasselbe Kompetenzfeld untersucht wurde wie 2000.<br />
Die Analyse der Geschlechterunterschiede ergibt sowohl beim Lesen als<br />
auch in den Naturwissenschaften kein nennenswertes Ergebnis. Analysiert<br />
man die Antwortformate in den Naturwissenschaften, so kommt man zu einem<br />
aufschlussreichen Resultat. Bei Aufgaben mit offenen, freien (verbalen)<br />
Antwortformaten schnitten Österreichs Schüler/innen schlechter ab als 2000<br />
(vgl. Neuwirth, 71, 75). Das trifft v. a. bei Schülerinnen in Berufschulen und<br />
Berufsbildenden Mittleren Schulen zu 10 . „Vom öffentlich diskutierten ‚<strong>PISA</strong>-<br />
Absturz‘ in allen Disziplinen und vom drastischen Auseinadergehen der Leistungswerte<br />
der Geschlechter beim Lesen bleibt bei Analyse der korrigierten<br />
Daten nichts übrig“, so der Au<strong>to</strong>r (Neuwirth, 64).<br />
Das Statistiker-Team kommt zu dem Schluss, dass das Datenmaterial einer<br />
näheren Betrachtung in Bezug auf Konsistenz nicht standhält (vgl. Kapitel<br />
10 Eine Relation darf in diesem Zusammenhang artikuliert werden, nämlich die mit der Entwicklung<br />
der Zahl der Kinder mit Migrationshintergrund und den Erfahrungen aus der Integrationsarbeit.<br />
Die österreichische Bundespolitik setzt in dieser Zeit ihr Programm der<br />
Familienzusammenführung um, d.h. es kommen hauptsächlich Kinder nach Österreich . . .
<strong>PISA</strong> IN ÖSTERREICH 355<br />
1.2.1) und von einem „Absturz“ Österreichs nicht die Rede sein kann. Beispielsweise<br />
war bei <strong>PISA</strong> 2000 in Österreich der Mädchenanteil höher als der<br />
Burschenanteil, was jedem demografischen Grundwissen widerspricht. Genauere<br />
Analysen unter Verwendung zusätzlicher, nicht öffentlich verfügbarer<br />
Daten zeigten dann, dass die Daten der beiden <strong>PISA</strong>-Erhebungen von 2000<br />
und 2003 speziell für Österreich nicht unmittelbar vergleichbar sind . . . (Neuwirth,<br />
11).<br />
3.6 „<strong>PISA</strong> bringt allen was“<br />
Die Analyse Neuwirths vermag an der allgemein strapazierten Interpretation<br />
der <strong>PISA</strong>-Ergebnisse in Österreich kaum etwas zu ändern. Die politische<br />
Opposition lädt die Diskussion moralisch auf und verlangt vom Schulsystem<br />
weniger Selektion und mehr Gerechtigkeit; es wird kommuniziert, dass das<br />
Gesamtschulwesen eine solche bringen werde. Die ursprüngliche Absicht von<br />
<strong>PISA</strong>, den (wirtschaftlichen) Wettbewerb unter immer mehr Ländern mit unüberschaubaren<br />
Datenmengen anzuheizen <strong>–</strong> der 3. <strong>PISA</strong>-Bericht bringt Daten<br />
aus 60 Ländern <strong>–</strong> bleibt unberücksichtigt. Demoskopische Ergebnisse zeigen<br />
bezüglich Schulsystem weiterhin eine konstante Einschätzung der Österreicher<br />
und Österreicherinnen:<br />
In der jüngeren Vergangenheit bis in die Gegenwart lag das gegliederte<br />
Schulwesen in Umfragen in Österreich eindeutig vor der Gesamtschule (meist<br />
zwischen 65 und 75 %). Die Einstellung gegenüber Schulreformen entwickelte<br />
sich aber hin zu einer Art von kollektivem Bewusstsein für die Notwendigkeit<br />
von Verbesserungen. Detail-Abfragen lieferten aber kein schlüssiges Bild.<br />
Im Frühjahr 2007 wird in Lehrerkreisen kolportiert, dass die Österreicherinnen<br />
und Österreicher nun doch für die Gesamtschule votierten, nachdem<br />
der Frage die Information voraus gegangen sei, <strong>PISA</strong> habe gezeigt, die Gesamtschule<br />
sei das bessere Schulmodell. Die Quellen sind nicht eruierbar, die<br />
Resonanz bleibt insgesamt bescheiden . . .<br />
Gemäß SPÖ-ÖVP-Regierungsprogramm vom Jänner 2007 setzt die Unterrichtsministerin<br />
eine Reform-Kommission ein, um Modellregionen zur Erprobung<br />
der Gesamtschulformen zu ermitteln. Die Mitglieder sind mehrheitlich<br />
keine ExpertInnen aus den Bereichen Bildungstheorie/-wissenschaft oder<br />
Schulforschung, sondern haben sich in der Vergangenheit im Wesentlichen als<br />
Gegner des gegliederten Schulwesens artikuliert. Günter Haider fungiert als<br />
Bildungsberater der Ministerin.
356 DOMINIK BOZKURT, GERTRUDE BRINEK, MARTIN RETZL<br />
Einzelne Bundesländer stellen ihre Varianten einer reformierten Schule der<br />
Zukunft vor, so unterstützt z.B. Oberösterreich die einjährige Verlängerung der<br />
Volksschule, an dessen Ende die Entscheidung über den weiteren Bildungsweg<br />
stehen soll. Niederösterreich setzt auf eine Art zweijährige Orientierungsstufe<br />
nach der Volksschule und beruft sich dabei auf eine hohe Akzeptanz: Laut Fessel+<br />
GFK-Institut (vom Juli 07) würden 78 % beim bestehenden Schulsystem<br />
bleiben wollen und nur 18 % für einen einheitlichen Schultyp votieren (3 %<br />
keine Angaben). 62 % finden die og. Orientierungsstufe für gut.<br />
Andere Bundesländer haben ihre Kooperation bei der Umsetzung einer<br />
Gesamtschul-Modellregion gezeigt (z. B. Kärnten, Burgenland). Die publizierten<br />
Ergebnisse aus den Versuchen mit Modellen der Integrierten Gesamtschule<br />
der letzten Jahrzehnte bleiben unberücksichtigt.<br />
Gegenwärtig, wenn auch nicht mehr so heftig wie in den Jahren davor,<br />
steht <strong>PISA</strong> einerseits gewissermaßen für eine nationale Kränkung, mit der man<br />
noch nicht so richtig umzugehen gelernt hat, andererseits verbindet man v. a.<br />
auf akademischem Boden mit der „OECD-Maschinerie“ das, was es selbst sein<br />
wollte, nämlich eine Lizenz zur internationalen Aufmischung der Schul- und<br />
Wirtschaftswelt. Wolfgang Horvath zitiert dazu die OECD selbst: <strong>PISA</strong> dient<br />
dem „( . . . ) besseren Verständnis der Bildungserträge in den am weitesten entwickelten<br />
Ländern und in denjenigen, die sich noch in einem früheren Stadium<br />
der wirtschaftlichen Entwicklung befinden“ (Horvath, 208) und stellt auf<br />
„die für das spätere Leben relevanten Kompetenzen ab“ (Schleicher, 9). In der<br />
öffentlichen Diskussion wird diesem Motiv jedoch nicht Rechnung getragen,<br />
nicht zuletzt, weil <strong>PISA</strong> technisch so angelegt ist, dass ein solcherart ermitteltes<br />
Ergebnis in Abwandlung des Leit-Spruches der Österreichischen Post AG<br />
„allen was bringt“.<br />
Darüber hinaus ist bis da<strong>to</strong> weitgehend unreflektiert geblieben, dass<br />
sich hinter <strong>PISA</strong> eine Bildungsnorm verbirgt, ein (Allgemein-)Bildungs-<br />
Verständnis, das in den Ländern mit der Tradition von Humboldt und Schleiermacher<br />
nicht selbstverständlich ist. „Dieses an und für sich durchaus diskussionswürdige<br />
Bildungskonzept, dessen Wurzeln in einem konkreten his<strong>to</strong>rischen<br />
und kulturellen Raum zu suchen sind, wird mittels mo<strong>to</strong>rischen Verschweigens<br />
seiner Herkunftsgeschichte in seinem Gültigkeitsanspruch absolut gesetzt. Es<br />
wird als universell, naturgegeben, vorgestellt und damit als Norm gesetzt bzw.<br />
wird diese Norm als solche genau genommen eben nicht ausgewiesen. Sie wird<br />
als einzig mögliche, weil einzig denkbare, stillschweigend vorausgesetzt. Ihr<br />
eignet somit der Charakter eines Naturgesetzes ( . . . ). Allgemeinbildung wird
<strong>PISA</strong> IN ÖSTERREICH 357<br />
damit auf Nutzbarkeit hin getrimmt, der gemessene Bildungswert dient als Indika<strong>to</strong>r<br />
für die wirtschaftliche Schlagkraft“ (Horvath, 210). Insofern sind die<br />
artikulierten Verbesserungsvorschläge der Zukunftskommission in sich schlüssig,<br />
sie stellen <strong>–</strong> nachvollziehbar in einer über weite Strecken managementtechnischen<br />
Begrifflichkeit <strong>–</strong> auf verbesserte Qualifikationen und deren effizienten<br />
Einsatz im Dienst der SchülerInnen und der Wirtschafts- und Berufswelt<br />
ab.<br />
So mutet es geradezu paradox an, wenn Vertreter der politischen Linken<br />
den <strong>PISA</strong>-Ergebnissen noch stärker gerecht werden wollen: mit der Gesamtschule<br />
und damit der globalen Wirtschaftsorientierung im Bildungswesen,<br />
nicht etwa mit einem (notwendigerweise reformierten) gegliederten Schulwesen,<br />
das auch das Gymnasium realisiert. Dass die Chancen und Stärken der<br />
Europäischen Union in der Berücksichtigung eines aufgeklärten Bildungsverständnisses<br />
liegen könnten, wird von kaum jemandem außerhalb der „academia“<br />
artikuliert.<br />
Dass sich nicht alle Länder gleich stark engagiert mit dem „Rankingspektakel“<br />
(Schirlbauer 2007, 6) beschäftigen, zeige die <strong>PISA</strong>-Hompage der OECD.<br />
Zwei Monate nach der Veröffentlichung der <strong>PISA</strong>-Studie widmete der Sieger<br />
Finnland dem Ereignis 8 Seiten Berichterstattung in den Printmedien, UK (im<br />
Ergebnis im oberen Viertel) 88 Seiten, Frankreich (im oberen Drittel) 32 Seiten,<br />
Deutschland (unter dem Durchschnitt) 774 Seiten, Italien (hinter Deutschland)<br />
16 Seiten. Österreich kommt in der Liste nicht vor (vgl. Gruber).<br />
Schirlbauer folgt dabei Grubers Wertung (ebenda) nicht, nämlich dass Italiens<br />
Medienöffentlichkeit durch Berlusconis Unterdrückungspolitik gekennzeichnet<br />
sei, vielmehr verweist er auf einen anderen nationalen Aufmerksamkeitsgrund,<br />
den Fußball. Ebenso sei in anderen Ländern die jeweilige Fähigkeit,<br />
mit internationalen Vergleichsstudien bildungspolitisch konstruktiv umgehen<br />
zu können, unterschiedlich herausgebildet und damit Grund für öffentliche<br />
Erregung oder eben nicht . . .<br />
Schirlbauers Interpretation der aktuellen Bildungsstandards, die auch eine<br />
Art Konsequenz aus dem Projekt ›klasse:zukunft‹ (ein Vorschlag der Zukunfts-<br />
Kommission) sind, orientierten sich an einem notwendigen, aber noch verdeckten<br />
Come back des Lehrplans <strong>–</strong> einerseits resultierend aus den <strong>PISA</strong>-<br />
Ergebnissen, andererseits aus dem immer offenkundigeren Scheitern der Bildungsreformbemühungen<br />
der 70er-Jahre, die auf Inhalte verzichten wollten<br />
und das Methodische in pseudoemanzipa<strong>to</strong>rischer Absicht zu feiern verstan-
358 DOMINIK BOZKURT, GERTRUDE BRINEK, MARTIN RETZL<br />
den, um SchülerInnen und LehrerInnen vom „S<strong>to</strong>ffdruck“ zu befreien (vgl.<br />
Schirlbauer 1992, 27).<br />
„Was ist guter Unterricht?“ sei schlüssiger Weise auch die leitende Frage,<br />
das leitende Motiv, der Reformvorschläge der Zukunftskommission. Das Ansinnen<br />
stehe damit in der gewissermaßen wiederaufkeimenden Tradition der<br />
Orientierung auf Schul- und Unterrichtsqualität <strong>–</strong> mehr aus der Besinnung auf<br />
Didaktik und moderne Lehrkunst und weniger als Reaktion auf den Globalisierungsdruck<br />
der sog. Wissensgesellschaft (vgl. die Schlussfolgerungen der<br />
deutschen Bildungs-Verantwortlichen, in: Bayrhuber et al.).<br />
Ewald Terhart, einer der maßgeblichen Didaktiker des deutschsprachigen<br />
Raumes, legte bereits 2002 den Schwerpunkt auf Unterricht: „Die Qualität eines<br />
Lehrers entscheidet sich an der Qualität des Unterrichts. Schulqualität entsteht<br />
zu einem hohen Anteil aus Unterrichtsqualität. Zwar bildet die gesamte<br />
Schulkultur <strong>–</strong> Schule als Erfahrungsraum <strong>–</strong> einen wichtigen Lern- und Sozialisationshintergrund<br />
für Schüler und Lehrer; gleichwohl aber ist die Qualität von<br />
Unterricht und Unterrichtsentwicklung der letztlich entscheidende Bereich.“<br />
Die Frage des ‚guten Unterrichts‘ entscheide sich auf drei Feldern: der Gestaltung/Vorbereitung<br />
des Kontextes, der Durchführung des Unterrichts selbst<br />
sowie der nachgängigen Analyse und Auswertung (vgl. Terhart, 99f).<br />
Das stets unfertig entwickelte Bewusstsein vom (eigentlichen?) Telos des<br />
Lehrers, zeitgeistige Marketing-Erwartungen und die Ansichten, dass mühsame<br />
Kleinarbeit unbedankt, weil lange Zeit unsichtbar bliebe, ein als steigend<br />
wahrgenommener Legitimationsdruck und ein spürbarer Wettbewerb unter<br />
Schulen (vgl. auch die zurückgehende SchülerInnenzahl) führ(t)en zur Neigung,<br />
die „innere Schulreform“ zu vernachlässigen und dafür auf strukturelle<br />
(d.h. herzeigbare) Veränderungen zu setzen. Das steht in Konfrontation mit<br />
der nachhaltigen Erkenntnis, dass die „Erledigung des pädagogischen Kerngeschäfts“<br />
dem Lehrer/der Lehrerin obliege und er/sie diese zu verantworten<br />
habe, mag der Weg auch noch so herausfordernd sein und sich die (Selbst-)<br />
Täuschung noch so komfortabel anfühlen.<br />
Die Vorboten des <strong>PISA</strong>-Reports 2007 lassen ahnen, dass politischschematische<br />
Interpretationen den undifferenzierten und einseitigen Urteilen<br />
der Vergangenheit folgen werden.<br />
Mit jeder derartigen Annahme geht jedoch auch die Hoffnung einher,<br />
dass ein Fortschreiten in Richtung argumentative Sensibilität und intellektuelle<br />
Sorgfalt immer noch Wirklichkeit werden könne <strong>–</strong> in der evidenz-basierten<br />
Wissensgesellschaft im Europa des 21. Jahrhunderts.
4Fazit<br />
<strong>PISA</strong> IN ÖSTERREICH 359<br />
Die Ergebnisse von <strong>PISA</strong> 2000 und <strong>PISA</strong> 2003 wurden in der medialen Öffentlichkeit<br />
durchaus den veröffentlichten Ergebnissen entsprechend wahrgenommen.<br />
Ohne Berücksichtigung der Re-Analysen von Neuwirth et al. ist im<br />
Vergleich <strong>PISA</strong> 2000 zu <strong>PISA</strong> 2003 ein Leistungsabfall der österreichischen<br />
SchülerInnen festzustellen (siehe Kapitel 1). Dieser Leistungsabfall wurde medial<br />
widergespiegelt, jedoch in seinem Ausmaß dramatisiert und übertrieben.<br />
So werden die Ergebnisse von <strong>PISA</strong> 2000 in den Medien als äußerst positiv<br />
und die Ergebnisse von <strong>PISA</strong> 2003 als überwiegend negativ angesehen. Werden<br />
die Re-Analysen von Neuwirth et al. berücksichtigt, liegen die österreichischen<br />
SchülerInnen in beiden Testungen im Lesen und in Mathematik im<br />
OECD-Durchschnitt. Daher hat sich, genau genommen, in diesen beiden Bereichen<br />
nichts geändert, wodurch sich der viel erwähnte drastische Leistungsabfall<br />
als Fiktion herausstellt. Lediglich im Bereich der Naturwissenschaften<br />
gibt es einen tatsächlich feststellbaren Leistungsabfall, auf den die entsprechende<br />
mediale Resonanz so gut wie ausgeblieben ist.<br />
Interessanterweise betreffen die fachspezifischen medialen Forderungen<br />
nach beiden <strong>PISA</strong>-Wellen fast ausschließlich die Lese- und Sprachförderung,<br />
was zwar angesichts der bedenklich hohen Anzahl an SchülerInnen, die bei<br />
<strong>PISA</strong> als sehr schlechte LeserInnen einzustufen sind und angesichts der elementaren<br />
Bedeutung des Lesens in unserer Welt sicher seine Berechtigung hat,<br />
jedoch in Hinblick auf den weitaus größeren Leistungsabfall in den Naturwissenschaften<br />
etwas überrascht.<br />
Alle anderen Forderungen nach <strong>PISA</strong> 2000 und nach <strong>PISA</strong> 2003 unterscheiden<br />
sich im Wesentlichen kaum und wiederholen sich unabhängig von<br />
den Ergebnissen. Sie spiegeln eher jene von bekannten Überzeugungen geprägten<br />
bildungspolitischen Debatten (z. B. bezüglich Schulstruktur) wider,<br />
betreffen familien- und gesellschaftspolitische Weichenstellungen (Ganztagesschule)<br />
und beinhalten Rufe nach Reformen, sind aber in keiner Weise aus der<br />
<strong>PISA</strong>-Studie abzuleiten, durch sie zu widerlegen oder zu begründen. Nicht anders<br />
verhält es sich mit den vorgebrachten mutmaßlichen Ursachen/Gründen<br />
für das Abschneiden bei <strong>PISA</strong>. Auch diese entstammen vielfach subjektiven<br />
Einschätzungen und können in keiner Weise von und durch die <strong>PISA</strong>-<br />
Ergebnisse belegt werden. Hier sei nochmals davor gewarnt, dass das willkürliche<br />
Äußern von Schuld- und Verantwortungs-Zuweisungen ohne ausreichende<br />
pädagogische Grundlage nicht die schwächsten Gruppen einer Gesellschaft<br />
treffen dürfe.
360 DOMINIK BOZKURT, GERTRUDE BRINEK, MARTIN RETZL<br />
Insgesamt bleibt festzuhalten, dass sich die mediale Aufmerksamkeit für<br />
<strong>PISA</strong> 2000 von der für <strong>PISA</strong> 2003 stark unterscheidet und sich die Einschätzung<br />
(der Konsequenzen) von manchen Personen unter veränderten Zeit- und<br />
Politikbedingungen wandelt. Während <strong>PISA</strong> 2000 als mediales „Randthema“<br />
fungiert, wird <strong>PISA</strong> 2003 zum medialen „Spektakel“.<br />
Angesichts dieser Ergebnisse lässt sich resümieren, dass <strong>PISA</strong> keinen Rahmen<br />
darstellt, welcher ein systematisches, rationales Ergründen von Ursachen<br />
bzw. Erstellen von Maßnahmen im Schul- und Bildungsbereich ermöglicht<br />
oder eine bestimmte schulpolitische Konklusion nahe legt. <strong>PISA</strong> selbst bietet<br />
für sinnvolle Konzepte, Maßnahmen bzw. Eingriffe im Bildungsbereich der<br />
Einzelstaaten keinen Anhaltspunkt. Sowohl <strong>PISA</strong> 2000 als auch <strong>PISA</strong> 2003<br />
dienten höchstens dazu, das „Wettbewerbsbewusstsein“ unter den OECD-<br />
Staaten anzukurbeln und bildungspolitische Positionen, Überzeugungen und<br />
Vorhaben scheinbar wissenschaftlich zu begründen und medienwirksam in der<br />
Öffentlichkeit zu verbreiten.<br />
Paradoxerweise ist die Tatsache, dass jede(r) seine Vorstellungen und<br />
Überzeugungen durch <strong>PISA</strong> bestätigt finden kann, der Grund für Durchsetzungskraft<br />
und Erfolg von <strong>PISA</strong>, wenn auch gleichzeitig der Fluch des Programms.<br />
Zumindest ein positiver Aspekt von <strong>PISA</strong> besteht darin, dass Bildung wieder<br />
in den Mittelpunkt der öffentlichen und politischen Diskussion rückt und<br />
somit die Chance besteht, dass pädagogisch legitimierbare Entwicklungen eingeleitet<br />
werden können, auch wenn sich die bisherigen Reaktionen auf <strong>PISA</strong><br />
großteils durch das Aufgreifen „verstaubter“ Bildungskonzepte und ausländischer<br />
„Schulkopien“ auszeichnen, was aber umso mehr Anreiz sein kann, neue<br />
eigene Überlegungen auf wissenschaftlich gesicherter Basis anzustellen.<br />
Literatur<br />
APA-OTS (Hg.) (2007): Über APA-OTS. Online-Publikation<br />
[http://service.ots.at/standard.php?channel=CH0171&document=<br />
CMS1096293925986&sc=pt] download 20.7.2007.<br />
APA-OTS (Hg.) (2007a): APA-OTS Empfänger. Online-Publikation<br />
[http://service.ots.at/standard.php?channel=CH0171&document=<br />
CMS1135843558515&sc=pt&sb=pt3] download 20.7.2007.<br />
Bayrhuber, Horst, Ralle Bernd, Reiss, Kristina, Schön, Lutz-Helmut, Vollmer,<br />
Johannes (Hrsg.) (2004): Konsequenzen aus <strong>PISA</strong>. Perspektiven der<br />
Fachdidaktiken. Innsbruck.
<strong>PISA</strong> IN ÖSTERREICH 361<br />
Bender Peter (2007): Leserbrief. Online-Publikation [www.uni-paderborn.de/<br />
bender/Leserbrief<strong>PISA</strong>kritik.pdf] download 19.7.2007.<br />
Buhlmann Edelgard (2004): „Konsequenzen aus <strong>PISA</strong> <strong>–</strong> Perspektiven der<br />
Fachdidaktiken“. In: Bayrhuber, Horst, Ralle, Bernd, Reiss, Kristina,<br />
Schön, Lutz-Helmut, Vollmer, Johannes (Hrsg.): Konsequenzen aus PI-<br />
SA. Perspektiven der Fachdidaktiken. Innsbruck.<br />
Gruber, Karl Heinz (2004): Bildungsstandards: „World class“, <strong>PISA</strong>-<br />
Durchschnitt und österreichische Mindeststandards. In: Erziehung und<br />
Unterricht. Jg. 2004.<br />
Haider, G. u. Reiter, C. (2004): <strong>PISA</strong> 2003, Internationaler Vergleich von<br />
Schülerleistungen. Leykam: Graz.<br />
Horvath, Wolfgang (2006): <strong>PISA</strong>-Studie. In: Dzierzbicka, Agnieszka/Schirlbauer,<br />
Alfred (Hrsg.): Pädagogisches Glossar der Gegenwart.<br />
Wien.<br />
Mediaanalyse (Hrsg.) (2007): Jahresbericht 2001. Tageszeitungen. Total.<br />
Online-Publikation [http://www.media-analyse.at/frmdata2001.html]<br />
download 25.6.2007.<br />
Mediaanalyse (Hrsg.) (2007a): Jahresbericht 2004. Tageszeitungen. Total.<br />
Online-Publikation [http://www.media-analyse.at/frmdata2004.html]<br />
download 25.6.2007.<br />
Neuwirth, E., Ponocny, I., Grossmann, W. (2004): <strong>PISA</strong> 2000 und 2003: Vertiefende<br />
Analysen und Beiträge zur Methodik. Leykam: Graz.<br />
Parlamentskorrespondenz (2007): Online-Publikation [http://www.parlament.<br />
gv.at/portal/page?_pageid=607,78669&_dad=portal&_schema] download<br />
24.7.2007.<br />
Posch, Peter (2005): „Einige mögliche Gründe für die Schwächen des österreichischen<br />
Schulsystems und Ansätze zu ihrer Überwindung“. Unveröffentlichtes<br />
Arbeitspapier des Bundesministeriums für Bildung Wissenschaft<br />
und Kultur.<br />
Reiter, C. u. Haider, G. (2002): <strong>PISA</strong> 2000 <strong>–</strong> Lernen für das Leben. Österreichische<br />
Perspektiven des internationalen Vergleichs. Studien Verlag:<br />
Innsbruck-Wien-München-Bozen.<br />
Schirlbauer, Alfred (1992): Junge Bitternis. Eine Kritik der Didaktik. Wien.<br />
Schirlbauer, Alfred (2007): Sollen wir uns vor den Bildungsstandards fürchten,<br />
oder dürfen wir uns über die freuen? (unveröffentlichtes Manuskript).<br />
Wien.<br />
Schleicher, Andreas (2004): Vorwort des Leiters der Abteilung für Indika<strong>to</strong>-
362 DOMINIK BOZKURT, GERTRUDE BRINEK, MARTIN RETZL<br />
ren und Analysen im OECD Direk<strong>to</strong>rat für Bildung. In: Neuwirth, E.,<br />
Ponocny, I., Grossmann, W. (Hrsg.): <strong>PISA</strong> 2000 und 2003: Vertiefende<br />
Analysen und Beiträge zur Methodik. Leykam: Graz.<br />
Schwarzgruber, Manfred (2006): Die <strong>PISA</strong>-Studie und ihre mediale Darstellung.<br />
Eine Inhaltsanalyse der Berichterstattung über die <strong>PISA</strong>-Studie<br />
2003 in österreichischen Tageszeitungen. Universität Salzburg: Diplomarbeit.<br />
Terhart, Ewald (2002): „Wie können die Ergebnisse von vergleichenden Leistungsstudien<br />
systematisch zur Qualitätsverbesserung in Schulen genutzt<br />
werden?“ In: Zeitschrift für Pädagogik. Jg. 48 <strong>–</strong> Heft 1.<br />
zukunft:schule (2005): Strategien und Maßnahmen zur Qualitätsentwicklung.<br />
Abschlußbericht der Zukunftskommission. Bundesministerium für Bildung,<br />
Wissenschaft und Kultur: Wien.
Epilogue: No Child, No School, No State Left Behind:<br />
Comparative Research in the Age of Accountability<br />
Stefan T. Hopmann<br />
Austria: University of Vienna<br />
Warum und unter welchen Bedingungen hat <strong>PISA</strong> Erfolg? Wie kommt es, dass<br />
<strong>PISA</strong> <strong>–</strong> Ergebnisse für höchst gegensätzliche bildungspolitische Optionen gleichzeitig<br />
als Begründung verwendet werden können, und dass in manchen Ländern<br />
die gesamte Bildungspolitik in den schiefen Schatten von <strong>PISA</strong> gerät,<br />
während <strong>PISA</strong> andernorts nur eine Stimme unter vielen ist? Ausgehend von<br />
den Beiträgen zu diesem Band und ergänzt von Ergebnissen his<strong>to</strong>risch <strong>–</strong> vergleichender<br />
Forschung analysiert der folgende Beitrag <strong>PISA</strong> als ein Fallbeispiel<br />
für die grundlegenden Veränderungen, die nach und nach alle öffentlichen<br />
Dienstleistungen (wie etwa auch den Gesundheitssek<strong>to</strong>r) erfassen.<br />
Es zeigt sich, dass es für die Leistungen und Schwächen des <strong>PISA</strong> <strong>–</strong> Projektes<br />
und für deren höchst unterschiedliche Nutzung gute his<strong>to</strong>rische und aktuelle<br />
Gründe gibt.<br />
Why is it that a comparative project like <strong>PISA</strong> can gain so much public attention<br />
in so many countries at the same time? What makes some governments<br />
tremble, parliaments discuss, journalists write, parents nervous, and teachers<br />
angry when <strong>PISA</strong> announces new results? Why are educational administrations<br />
and political committees eager <strong>to</strong> align their curriculum concepts <strong>to</strong> the<br />
one implicit in the <strong>PISA</strong> tests? Why is <strong>PISA</strong> in some places big news, in others<br />
news appropriate for a short notice on page five or in the education section?<br />
<strong>PISA</strong> is not the first project of this kind: What is different with <strong>PISA</strong>?<br />
Of course there is no single explanation for this mind-boggling success<br />
s<strong>to</strong>ry. <strong>PISA</strong> has obviously hit something in the public mind, or in the political<br />
mind, at least in Western societies. This makes it “knowledge of most worth”.<br />
It is unlikely that this success is a result only of the quality and scope of <strong>PISA</strong>
364 STEFAN T. HOPMANN<br />
itself. If one accepts at least some of the criticism voiced in this volume, the<br />
opposite seems <strong>to</strong> be the case. What is of “most worth” in the eyes of the<br />
public is not the complicated and often overstretched research techniques or<br />
the specific design, but the simple messages which front the public appearance<br />
of <strong>PISA</strong>: the league tables and the summaries, which indicate what <strong>PISA</strong> sees<br />
as the weakness or strengths of the respective systems of schooling.<br />
But even news of this kind has been around before <strong>–</strong> ever since the IEA began<br />
its comparative research in 1959 <strong>–</strong> but never gained a similar impact. Thus<br />
it is not enough <strong>to</strong> look at <strong>PISA</strong> itself <strong>to</strong> understand this s<strong>to</strong>ry. It is necessary <strong>to</strong><br />
understand, at the same time, how the social environment has changed: schooling,<br />
policies, and the public. The question is how <strong>PISA</strong> and its methodology<br />
fit in<strong>to</strong> a larger frame of social transformation and thus could achieve the influence<br />
they have now. Moreover one has <strong>to</strong> ask if there is one <strong>PISA</strong> achieving<br />
all this, or whether it is more appropriate <strong>to</strong> talk about the “multiple realities<br />
of <strong>PISA</strong>”, i.e., the manifold ways in which <strong>PISA</strong> is enacted and experienced,<br />
which have been crucial for the success, and which the methodological mix<br />
utilized by <strong>PISA</strong> allows for.<br />
In my view, the rise of <strong>PISA</strong> owes much <strong>to</strong> what I would call the emerging<br />
“age of accountability”, i.e., a fundamental transformation on-going on in at<br />
least in the Western world, and which centres on how societies deal with welfare<br />
problems like security, health, resurrection and education (see Hopmann<br />
2003, 2006, 2007). <strong>PISA</strong> fits this context in many different ways, depending<br />
on how accountability issues unfold in different societies. In some places the<br />
same fit is equally expressed by national policies like the “No Child Left Behind”<br />
legislation in the US, or the development of national education standards<br />
as has been the case in countries as diverse as Sweden, Germany, Switzerland<br />
or New Zealand. What is important <strong>to</strong> note here is that the education sec<strong>to</strong>r is<br />
rather late in addressing such issues when compared <strong>to</strong> other areas as health or<br />
security. In this perspective <strong>PISA</strong> is but one example, a kind of collateral damage<br />
sparked of by the intrusion of accountability mechanism in<strong>to</strong> the social<br />
fabric of schooling.<br />
To explain this observation, I will first outline how the transformation<br />
called “the age of accountability” can be unders<strong>to</strong>od (1). In the following section<br />
(2) I will try <strong>to</strong> illuminate three basic modes of accountability, namely<br />
the strategies of “no child left behind”, “no school left behind” and “no state<br />
left behind”, and how <strong>PISA</strong> fits in<strong>to</strong> these different settings. In the last section<br />
(3) implications of the multiple realities of accountability for current and fu-
EPILOGUE: NO CHILD, NO SCHOOL, NO STATE LEFT BEHIND 365<br />
ture school development are discussed. In doing so I rely on the results of the<br />
Norwegian research project on “Achieving School Accountability in Practice”<br />
(ASAP), which I initiated in 2003 and which will present its results in another<br />
volume later this year (Langfeldt, Elstad & Hopmann 2007). Other sources<br />
are comparative projects, which I have been involved with in recent years like<br />
“Organizing Curriculum Change” (OCC), including research projects in Finland,<br />
Norway, Switzerland, Germany and the US (cf. e.g. Künzli & Hopmann<br />
1998), and the dialogue project “Didaktik meets Curriculum”, which involved<br />
scholars form about twenty countries throughout the 1990s (cf. e.g. Westbury,<br />
Hopmann & Riquarts 2000; Gundem & Hopmann 2002), and <strong>–</strong> last but not<br />
least <strong>–</strong> the research done in preparation of this volume on <strong>PISA</strong> in cooperation<br />
with colleagues from seven different countries (Austria, Denmark, Germany,<br />
Finland, France, Norway, and the UK).<br />
1 The Age of Accountability<br />
Social scientists, economists, politicians, educa<strong>to</strong>rs and the public seem <strong>to</strong><br />
agree that something fundamental is going on, something which changes at<br />
least in principle the social fabric of Western societies. However they differ<br />
widely in what they see as the core of this transition. To name but a few more<br />
recent examples:<br />
<strong>–</strong> “The Modern World System” is according <strong>to</strong> Immanuel Wallerstein characterized<br />
by the ever expanding commodification (in Marx’s terms: “Verdinglichung”)<br />
of all natural resources, human relations, labour, knowledge,<br />
etc., forcing a lasting division of labour in and between nations and turning<br />
ordinary citizens in<strong>to</strong> alienated <strong>to</strong>ols of a globalized economy (cf. e.g.<br />
Wallerstein 2004).<br />
<strong>–</strong> Neoinstitutionalist like John W. Meyer speak about globalization as well,<br />
however in organizational or structural terms, defining the current transition<br />
process as an outcome of the rapidly growing influx of international “institutions”,<br />
i.e. common ways of seeing and dealing with society, as provided<br />
by international organizations (such as the UN, the World Bank, OECD)<br />
and the emerging “world polity”, which supersedes national his<strong>to</strong>ries and<br />
policies alike (cf. e.g. Meyer 2006).<br />
<strong>–</strong> Theories on “reflexive modernity”, as provided by Anthony Giddens and<br />
others, would agree that the change is global, however pinpoint the special<br />
implications for the members of society, as, e.g., the need <strong>to</strong> develop a reflexive<br />
stance <strong>to</strong>wards the structures of society and the embedded risks for
366 STEFAN T. HOPMANN<br />
society as a whole as well as for the individual (cf. e.g. Beck, Giddens &<br />
Lash 1996; Beck 2006).<br />
<strong>–</strong> Governmentality theories, drawing on Foucault’s famous 1977/78 College<br />
de France lectures (cf. Foucault 2004), point similarly <strong>to</strong> the impact the<br />
transition has for the state and its institutions as well as for the public and its<br />
members, but they see a growing transfer and diffusion of power relations<br />
in<strong>to</strong> self-control mechanisms making the citizens internalize the (more or<br />
less alienated) mentality necessary <strong>to</strong> govern them(selves) (cf. e.g. Bröckling,<br />
Lassmann & Lemke 2000; Lange & Schimank 2004; Gottweis 2006).<br />
<strong>–</strong> New Public Management (NPM) supporters would not disagree that there<br />
is a diffusion of power and a change of the habits required, but they rather<br />
see it as a positive force making societies and their institutions and members<br />
more effective in a globalized world as cus<strong>to</strong>mer-centred management and<br />
control techniques are introduced (cf. e.g. Buschor & Schedler 1994; Pollit<br />
& Bouckaert 2004 2 ).<br />
<strong>–</strong> More recent theories on the welfare state discuss similar issues, but rather<br />
as a question of how the modern “intervention state” is forced <strong>to</strong> dismantle<br />
its traditional comprehensive strategies governing resources, the law and the<br />
social sphere in a more and more post-national world, and how welfare is<br />
re-modelled within this “unravelling” of the state and its institutions (cf. e.g.<br />
Esping-Andersen 1996; Scharpf & Schmidt 2000; Leibfried & Zürn 2006).<br />
<strong>–</strong> Finally, systems theories based on the work of Niklas Luhmann (cf. eg. Luhmann<br />
1998) argue that the current transition grows from within, from the<br />
need of social systems <strong>to</strong> deal with an ever-growing complexity and contingency<br />
that forces a reflexive re-design of the ways and means of social<br />
communication (which constitutes the fabric of social systems according <strong>to</strong><br />
Luhmann; cf. e.g. Akerstrøm Andersen 2003; Rasmussen 2006).<br />
Of course, this is but a small selection of the staggering number of transition<br />
theories, which flourish despite the obviously prematurely proclaimed “end of<br />
his<strong>to</strong>ry” (Fukuyama 1992). Moreover these approaches vary widely. Some of<br />
them see the current change as a late consequence of processes started with<br />
the invention of the modern state (e.g. Foucault, Wallerstein), whereas others<br />
point <strong>to</strong> more recent changes, <strong>to</strong> for instance the crisis of the welfare state or the<br />
rapid globalization process (e.g. Leibfried & Zürn, Meyer). Some of them look<br />
primarily at it as a <strong>to</strong>p-down process by which global developments overpower<br />
local traditions (e.g. Meyer, Wallerstein); others stress the role of intermediate<br />
levels as the nation state and its institutions (e.g. Giddens, Leibfried & Zürn,
EPILOGUE: NO CHILD, NO SCHOOL, NO STATE LEFT BEHIND 367<br />
Pollit & Bouckaert); whereas some see the main issue at the level of the impact<br />
of the transition process on those involved (e.g. Beck, Foucault). Some<br />
theories stress institutional patterns or social systems as the prime force (e.g.<br />
Meyer, Luhmann), others believe in ac<strong>to</strong>rs and their policies as defining elements<br />
(e.g. Pollit & Bouckaert; Wallerstein), whereas some try <strong>to</strong> sketch a third<br />
perspective, in which ac<strong>to</strong>rs and structures are seen as inextricably intertwined<br />
(e.g. Giddens, Foucault).<br />
One should not complain about this amazing diversity of approaches: It is<br />
an expression of the difficulty in finding more common ground at a time when<br />
the transition is still unfolding with growing, but uneven speed in different<br />
places. Additionally, many of these authors and their followers use a similar<br />
pool of examples in spite of their differences. Even though they do not agree<br />
on all the why questions, they point <strong>to</strong> much of the same kind of evidence: as,<br />
for instance, in examples<br />
<strong>–</strong> of the redistribution of resources, risks and responsibilities within and across<br />
societies,<br />
<strong>–</strong> of the destabilization, or at least of the restructuring, of most public institutions<br />
and their relations <strong>to</strong> or competition with the private sec<strong>to</strong>r,<br />
<strong>–</strong> of the re-<strong>to</strong>oling of legitimation and control patterns within the public as<br />
well as the private sphere and its impact,<br />
<strong>–</strong> of the pressure on systems and ac<strong>to</strong>rs <strong>to</strong>wards taking a reflexive stance <strong>to</strong>wards<br />
themselves and taking responsibility for their own “well-being”.<br />
Accountability<br />
Looking at the narrower issue of “accountability”, a similar wealth of models<br />
and approaches can be observed. Besides the more or less implicit accountability<br />
concepts within the general transition theories mentioned above (mostly<br />
constructed as ‘being made responsible’ in one or another way e.g. by Giddens,<br />
Foucault or the NPM theories), different models of accountability have<br />
emerged based on the areas in which accountability is observed:<br />
<strong>–</strong> In economic theories (e.g. Laffont 2003), where the concept originated, accountability<br />
is nowadays often constructed as a means by which a principal<br />
(the resource-giver) under conditions of limited information tries <strong>to</strong> multiply<br />
the ends, by giving the agent (the resource-taker) incentives and/or forcing<br />
him by other means <strong>to</strong> account for the efficiency, quality and results of his<br />
deliveries.
368 STEFAN T. HOPMANN<br />
<strong>–</strong> In government research (e.g. Hood 1991, 1995; Hood, Rothstein & Baldwin<br />
2004), accountability is often seen as a key <strong>to</strong>ol of the New Public Management<br />
movement <strong>to</strong> ensure that units and persons provide services according<br />
<strong>to</strong> the goals set for them or agreed with them. <strong>According</strong> <strong>to</strong> this approach,<br />
it unfolds as a combination of risk management and blame avoidance, by<br />
which those hold accountable try <strong>to</strong> limit the scope of possible failure.<br />
<strong>–</strong> In research on social policy, the same phenomenon is described as a “quasimarket<br />
revolution” (Bartlett, Roberts & Le Grand 1998), i.e. as the intrusion<br />
of marked-like mechanisms of distribution and control in<strong>to</strong> the public<br />
sec<strong>to</strong>r: elements of competition, contractualiziation and finally auditing are<br />
introduced in<strong>to</strong> the service-rendering. As Hood (2004) has pointed out, often<br />
in form of “double whamming”, i.e. as the co-existence of the traditional<br />
bureaucratic modes with the administrative <strong>to</strong>ol kit fostered by NPM.<br />
<strong>–</strong> Similarly, some educational and health-care researchers see the rise of “the<br />
age of accountability” as a “revolutionary” move <strong>to</strong>wards “evidence-based”<br />
practice, i.e. the growing expectation that professionals can present data <strong>to</strong><br />
prove that they have performed professionally and efficiently for education,<br />
Slavin 2007; for health care, cf. e.g. Muir Gray 2001).<br />
<strong>–</strong> Generalized beyond the realms of public service, this leads <strong>to</strong> the concept of<br />
the emergence of an “audit society” (e.g. Power 1997), the assumption that<br />
more and more areas of social life are being made “verifiable”, i.e. subjugated<br />
regimes of counting what can be counted, and thus become part of a<br />
measurable accountability.<br />
<strong>–</strong> Interaction and transactions theories construe the personal costs of such transition,<br />
looking at accountability as an interpersonal relation in which we deal<br />
with “accounts”, “excuses” and “apologies”, i.e. strategies <strong>to</strong> explain ourselves<br />
in ways which give a sustainable account of our efforts (e.g. Benoit<br />
1995).<br />
<strong>–</strong> Finally, psychological approaches <strong>to</strong> accountability (e.g. Sedikides 2002)<br />
look at the personal ways of dealing with accountability, how one develops<br />
mechanisms <strong>to</strong> attribute or <strong>to</strong> reject accountability-embedded in roles and<br />
functions we have <strong>to</strong> perform.<br />
Like the transition theories, accountability theories provide a wide array of<br />
possible causes and implications. Some see this process primarily as an effect<br />
of a growing “economization” of all parts of the society (e.g. social policy<br />
and audit theories), whereas others see accountability as inevitably embedded<br />
in the social fabric of modern societies (e.g. the psychological explanations).
EPILOGUE: NO CHILD, NO SCHOOL, NO STATE LEFT BEHIND 369<br />
Some see accountability primarily as a politically initiated restructuring effort<br />
(e.g. quasi-market theories), whereas others see accountability simply as a legitimate<br />
means <strong>to</strong> ensure that cus<strong>to</strong>mers or clients get what they have paid for<br />
(economic and NPM theories). Additionally, accountability is viewed on rather<br />
different levels. Utilizing a model developed by Melvin Dubnick (2006) one<br />
can discern:<br />
<strong>–</strong> a first-order accountability, i.e. accountability arising in face-<strong>to</strong>-face relations<br />
(as described by the psychological models);<br />
<strong>–</strong> a second-order accountability, which is characterized by how good one follows<br />
the rules and standards set by a resource giver (as described by government<br />
theories);<br />
<strong>–</strong> a third-order accountability can be seen as “managerial accountability”, i.e.<br />
the use of accountability by a principal as a means <strong>to</strong> achieve better service<br />
and effectiveness of the agent. Finally,<br />
<strong>–</strong> a fourth-order accountability is based on that the one held accountable internalizes<br />
the norms, values and expectations of the stake-holders, and which<br />
puts him or her in<strong>to</strong> action (as pointed out e.g. by theories of governmentality<br />
or of professionalism).<br />
In practice, all of these can be intertwined. However the dividing line is how<br />
it is assumed that this interaction comes in<strong>to</strong> being and which levels rule compared<br />
<strong>to</strong> the others (if they are not seen, as they are in economic theories, as an<br />
embedded rationale of social ac<strong>to</strong>rs at all levels).<br />
In addition, accountability concepts change over time and are different in<br />
different places. A good indica<strong>to</strong>r of this is that (1) there is no common translation<br />
of the concept available in most of the non-English-speaking countries,<br />
and (2) public agencies or policy makers do not employ similar definitions of<br />
the elements and limits of accountability when “accounting for accountability”<br />
(cf. Dubnick & Justice 2004, Birkeland 2007). Nevertheless, most accountability<br />
analysts agree with the above-mentioned transition theories on some core<br />
issues, namely:<br />
<strong>–</strong> that accountability procedures more and more permeate at least all Western<br />
societies, and thereby change the ways and means by which societies deal<br />
with themselves,<br />
<strong>–</strong> that the rapid rise of accountability affects all areas of the public sec<strong>to</strong>r from<br />
education <strong>to</strong> health and their relation <strong>to</strong> the private sec<strong>to</strong>r,<br />
<strong>–</strong> that this transition enforces a vast redistribution of resources and responsibilities<br />
and thereby a fundamental change in the interplay between resource-
370 STEFAN T. HOPMANN<br />
providers and -users, often described as a kind of implicit (values, norms)<br />
or explicit (standards, contracts) fixation of what is supposed <strong>to</strong> shape their<br />
relations.<br />
<strong>–</strong> that this process unfolds at different speeds and with different patterns, depending<br />
on what kind of social setting they become a part of.<br />
For the purpose of this chapter, it is not necessary <strong>to</strong> decide which of these<br />
theories and models carries the most theoretical or empirical evidence. Rather<br />
the common features, shared by most of them, should be enough as a starting<br />
point, even though this puts some of the “why questions” aside temporarily and<br />
moves the focus <strong>to</strong> the question, of how the emergence of the age of accountability<br />
can be observed in action. In my view, its common core can be described<br />
as a slow, but steady transition from what I call “management of placements”<br />
(Verortung) <strong>to</strong>wards a “management of expectations” (Vermessung), by which<br />
the ways and means of dealing with “ill-defined” problems, such as health, education,<br />
security and resurrection, are changed fundamentally (see Hopmann<br />
2000, 2003, 2006, 2007).<br />
Managing transition<br />
Following in the footsteps of Max Weber (cf. Weber 1923, Breuer 1991), the<br />
rise of the modern state can be described as the successive unfolding of a management<br />
of placements, by which the risks of being born (e.g., how <strong>to</strong> get an<br />
education, who takes care of me when ill or old, who gives me security in<br />
my everyday life and my dealings, how <strong>to</strong> be at peace with myself and my<br />
neighbours) were taken care of by institutions run by professionals with a specific<br />
education on how <strong>to</strong> deal with such ill-defined problems. These institutions<br />
(such as schools, hospitals, prisons, armies, bureaucracies, churches) had<br />
a comprehensive mission in that their professionals needed leeway <strong>to</strong> define<br />
which of these problems required what kind of treatment. The institutionalized<br />
problem-sharing allowed for taking on more risks and moving beyond the care<br />
for immediate needs. Of course, which problems were considered being ill- or<br />
well-defined changed over time, as did the resources available. But the internal<br />
distribution of resources and the evaluation of outcomes were mostly left <strong>to</strong><br />
the professionals themselves, or the emerging professional communities, who<br />
defined and controlled the education, licensing and practice of their members<br />
(cf. Abbott 1988; Hopmann 2003).<br />
However, this comprehensive institutionalization had no fixed boundaries<br />
(which would have required the transformation from the ill-defined in<strong>to</strong> well-
EPILOGUE: NO CHILD, NO SCHOOL, NO STATE LEFT BEHIND 371<br />
defined problems), thus opening a continuing process of the broadening of the<br />
scope and differentiating the means whenever new aspects of the problems<br />
seemed <strong>to</strong> become urgent. Thus each and every field underwent a massive expansion,<br />
multiplying its tasks and treatments. In the past, for example, a couple<br />
of years in schools (and for a few, in universities) was all the public education<br />
available. Today we spend twenty and more years of our life in all kinds of professionalized<br />
educational settings from childcare <strong>to</strong> elder hostels. When once<br />
we met a doc<strong>to</strong>r at the beginning and at the end of life, and maybe for a few<br />
other times under extraordinary circumstances, <strong>to</strong>day we spend in each and<br />
every year a lot of time with medical doc<strong>to</strong>rs, nurses and other health care specialists<br />
in waiting, treatment or emergency rooms. In short, the management<br />
of placement was extremely successful, so successful that Western societies<br />
spend most of their public budgets on dealing with these problems. As long as<br />
the differentiation of the institutions did not outspend the resources available,<br />
differentiation could go on and on, and with an ever-growing speed.<br />
This success s<strong>to</strong>ry seems <strong>to</strong> come <strong>to</strong> an end in what social policy theory<br />
calls “the crisis of the welfare state”, i.e. as resource limits and boundaries for<br />
further expansion become more and more visible (wherever they stem from).<br />
The legitimacy of the whole placement strategy relied on its ability <strong>to</strong> cover<br />
new ill-defined problems by expansion and sophistication; but there is now<br />
mistrust and anxieties about whether this comprehensive help will be sustainable<br />
in the future. A very visible impact of this loss of trust is the rise of welfare<br />
patriotism on the right and the left in almost all Western societies, articulating,<br />
and maybe misusing, much of the unease citizens feel about the future and security<br />
of the inherited places and treatments (our welfare is said <strong>to</strong> be at risk<br />
because of immigrants, globalization, outsourcing, etc.). One of the important<br />
responses <strong>to</strong> this is a stepwise transition from a management of placements<br />
<strong>to</strong>wards a management of expectations. Instead of guaranteeing comprehensive<br />
institutions, there is an attempt <strong>to</strong> transform ill-defined problems in <strong>to</strong><br />
better-defined expectations as <strong>to</strong> what can be achieved with a given amount of<br />
resources. Standards, benchmarks, indica<strong>to</strong>r-based budgets etc. are examples<br />
of how this transition is managed. In that they do not necessarily imply longterm<br />
commitments, expectations can remain transient and volatile <strong>to</strong> changes<br />
in the social fabric of expectations mix. This allows for more target-oriented<br />
management and accountability that, however, comes at the price that whatever<br />
does not fit in<strong>to</strong> the expectation regime of a time becomes marginalized.<br />
Comprehensive coverage is replaced by a fragmented system of treatments
372 STEFAN T. HOPMANN<br />
available under certain conditions. The left-over, not least the still ill-defined<br />
general issues <strong>–</strong> what does it mean <strong>to</strong> be well-educated, healthy, secure, feel<br />
well etc.? <strong>–</strong> is either still connected <strong>to</strong> the former placements and/or transformed<br />
in<strong>to</strong> temporary programs seemingly better equipped for addressing the<br />
remains immediately (“the patient in focus”, “fighting crime”, “strengthening<br />
social education” etc.).<br />
Take the example of schooling: In earlier times public education was provided<br />
by “a place called school” (Goodlad 1983) run by professionals called<br />
teachers who decided within sketchy limits, based on professional and local<br />
traditions, how <strong>to</strong> teach and what achievement seemed <strong>to</strong> be sufficient. There<br />
was no external public evaluation of the quality of the services provided, except<br />
for extraordinary cases of failure, or while the normal procedures seemed<br />
<strong>to</strong> be professionally acceptable. <strong>According</strong>ly, good instruction was not defined<br />
primarily by its measurable outcomes, but rather by the professional judgement<br />
of the adequacy of what was done. Expectation management changes<br />
the picture dramatically. The core focus shifts <strong>to</strong> more or less well-defined expectations<br />
of what has <strong>to</strong> be achieved by whom. Good instruction is the one<br />
overlapping expectations, and that can be provided outside the traditional institutions<br />
and professions, in fact: everybody is welcome <strong>to</strong> provide as long<br />
as the expectations are met. Of course, there are issues which are not (yet)<br />
coveredbyidentifiable expectations, however <strong>–</strong> in case of conflicting goals <strong>–</strong><br />
the balance will always tip <strong>to</strong>wards those expectations, which are well-defined<br />
enough <strong>to</strong> become part of the implied accountability of the treatment providers.<br />
The rest, that which is not addressed but seems <strong>to</strong> need <strong>to</strong> be taken in<strong>to</strong> account<br />
(e.g. issues such as mobbing/bullying, gender, migration etc.) is embedded in<strong>to</strong><br />
transient intervention programs of limited scope, enough <strong>to</strong> ensure the public<br />
that no ill-defined problem is left behind.<br />
It is important <strong>to</strong> remember, that this is expectation management, and is not<br />
about outputs or outcomes or “efficiency” as such <strong>–</strong> as, for instance, NPM theories<br />
see it. Only those results, which can be “verified” according <strong>to</strong> the stakes<br />
given and do not meet expectations become problematic, and only those outcomes<br />
which meet the predefined criteria are considered a success. In fact any<br />
care-taker of an ill-defined problem will always produce many more effects<br />
than any accountability system can observe and measure. Some of them may<br />
be simply by-products or minor collateral damage, but some impact may indeed<br />
be a major contribution (e.g. inclusion in<strong>to</strong> society, regulating biographies<br />
etc.), which is beyond the short-sighted reach of the management of expecta-
EPILOGUE: NO CHILD, NO SCHOOL, NO STATE LEFT BEHIND 373<br />
tions. The line is drawn by the ever-changing fabric of expectations on the one<br />
hand and, on the other, by the simple fact that accountability needs something<br />
which can be counted, or where it is at least possible <strong>to</strong> measure the distance<br />
between expectations and results (cf. Slavin 2007).<br />
The emergence and spreading of accountability is a signifying hallmark of<br />
the whole transition process. <strong>PISA</strong> fits nicely in<strong>to</strong> this transition, as we will see<br />
in the following sections. Seen as part of a management of placements <strong>PISA</strong><br />
would be a disaster: it covers only a few aspects of the place, of schooling<br />
and the curriculum, and even are covered in ways that account at best very<br />
indirectly (cf. the preceding chapters on <strong>PISA</strong>’s limits). However, as a <strong>to</strong>ol of<br />
expectation management, <strong>PISA</strong> fosters a transformation of what had been illdefined<br />
issues (e.g. curriculum contents) in<strong>to</strong> seemingly well-defined attainment<br />
goals. It delivers, at the same time, a parameter for holding schooling<br />
accountable <strong>–</strong> for delivering according <strong>to</strong> the expectations embedded in<strong>to</strong> its<br />
questionnaires. It contributes <strong>to</strong> the fragmentation of the field by transforming<br />
the conditions and constraints of this delivery in<strong>to</strong> independent fac<strong>to</strong>rs (e.g.<br />
social background, gender, migration etc.), whose impact has <strong>to</strong> be minimized<br />
by way of teaching if expectations are <strong>to</strong> be fully met. The best representation<br />
of this is given by the “production functions” by which <strong>PISA</strong>-using economists<br />
calculate the transaction costs of schooling and ways and means that the principals<br />
(parents, the state) might maximize the effectiveness of the chosen agents<br />
(i.e. teachers, schools or school systems; c.f. e.g. Bishop & Woessmann 2004;<br />
Fuchs & Woessmann 2004; Micklewright & Schnep. 2004, 2006; Sutherland<br />
& Price 2007).<br />
Constitutional Mindsets<br />
When it comes <strong>to</strong> the public sphere, the transition from a management of placements<br />
<strong>to</strong>wards a management of expectations meets different constitutional<br />
mindsets, i.e. deeply engrained ways of understanding the relation between<br />
the public and its institutions (cf. for the basics of the following e.g. Haft &<br />
Hopmann 1990; Hopmann & Wulff 1993; Zweigert & Kötz 1997; Lepsius<br />
2006). For example, the American constitution is constructed as a protection<br />
of the individual against the misuse of power by governments and others. It<br />
sees the rights of the individuals as a given and the intervention of government<br />
as limited by these rights, and obliged <strong>to</strong> protect citizens against any infringements<br />
of their constitutional freedoms. The First Amendment, for instance,<br />
states: “Congress shall make no law respecting the establishment of religion,
374 STEFAN T. HOPMANN<br />
prohibiting the free exercise thereof; or abridging the freedom of speech or<br />
the press; or the right of the people peacefully <strong>to</strong> assemble, an <strong>to</strong> petition the<br />
government for the redress of grievances”. Within the Prussian or the Austrian<br />
tradition, which comes from the opposite direction (not at least Roman<br />
law), civil rights are something constituted and limited by the law, i.e. it is<br />
the state and its (more or less enlightened) institutions which create and define<br />
the boundaries of social and individual life. Religious freedom, for instance,<br />
maybe granted, but the freedom is closely connected <strong>to</strong> state supervision of its<br />
organizations and institutions, which can set limits for the conditions for the<br />
‘full’ exercise of a religion (which creates problems for non-institutionalized<br />
traditions such as Islam). The Scandinavian constitutional tradition settled (at<br />
least in its beginnings) somewhere between these fundamentally opposed starting<br />
points: it acknowledges the right of the state <strong>to</strong> impose a constitution, but<br />
originally limits its reach making it subsidiary <strong>to</strong> local and regional law traditions.<br />
That has changed gradually, but there is still no unified code of law,<br />
rather a pragmatic approach <strong>to</strong> regulating fields of interest based on practical<br />
experience and home-grown traditions. The local constituency is still seen as<br />
the core of the social fabric. Citizens are empowered <strong>to</strong> define a community<br />
life based on their own traditions within a broad constitutional setting. Thus,<br />
while there are state churches in Norway and Denmark, there was plenty of<br />
leeway <strong>to</strong> establish new local traditions (e.g. as “free churches”); <strong>to</strong>day any<br />
group of certain size and permanence and with a discernible creed of its own<br />
can establish itself as a “church” with a right <strong>to</strong> receive state subsidies (cf.<br />
Repstad 2000 2 ).<br />
Of course, the constitutional and legal structures are much more mixed, the<br />
patterns much more blurred, than these different starting points indicate. But<br />
the mind sets on which they are founded seem <strong>to</strong> be alive and well, and have<br />
a strong impact on how the public and its institutions conceptualize the legal<br />
and structural implications of social change. At least, this is the case when it<br />
comes <strong>to</strong> how accountability measures are embedded in<strong>to</strong> the public system<br />
as a whole, and especially in<strong>to</strong> the school system. There the main questions<br />
are: Is accountability about protecting the individual citizen (student) against<br />
bad service-rendering? Or is the primary goal <strong>to</strong> strengthen the ability of local<br />
communities <strong>to</strong> run its institutions according <strong>to</strong> their own needs and aspirations?<br />
Or is it about holding the public system accountable for its contribution<br />
<strong>to</strong> the state’s welfare? Put in the current educational context, one has <strong>to</strong> ask<br />
accordingly, where the main focus of accountability is situated: at “no child<br />
left behind”, “no school left behind”, or “no state left behind”?
EPILOGUE: NO CHILD, NO SCHOOL, NO STATE LEFT BEHIND 375<br />
2 The Multiple Realities of Accountability<br />
In a his<strong>to</strong>rical perspective, <strong>PISA</strong> and the likes are heavily indebted <strong>to</strong> the legacy<br />
of the assessment movement in the United States and its internationalization<br />
by the International Association for the Evaluation of Educational Achievement<br />
(IEA), beginning in the late 1950s. Seminal works such as Caswell’s<br />
City School Surveys (1929), the Eight Year Study (1942) or Benjamin Bloom’s<br />
groundbreaking work on “the taxonomy of educational objectives” (1956), the<br />
national spreading of the Scholastic Aptitude Test (SAT) from the 1950s onwards<br />
and the establishment of the National Assessment of Education Progress<br />
(NAEP) paved the way for an understanding, in which student achievements<br />
were seen as the prime indica<strong>to</strong>r of the quality of schooling. The rapid rise of<br />
assessment and evaluation as key <strong>to</strong>ols of educational control was fuelled by a<br />
constant flow of critical works on the poor state of the Nation’s schools. From<br />
Conant’s report, The American High School (1959), Rickover’s American Education<br />
<strong>–</strong> a National Failure (1963), Coleman’s report on the “Equality of Educational<br />
Opportunity” (1966) <strong>to</strong> the national report, A Nation at Risk (1983)<br />
and the Nation’s Report Card (Lamar & Thomas, 1987), the basic tenor was<br />
the same: The American system is failing many of its students <strong>–</strong> as demonstrated<br />
by the test scores achieved in local, state-wide and national testing.<br />
From the late 1980s onwards, this seemingly constant failing lead <strong>to</strong> a<br />
more generalized approach <strong>to</strong> assessment, testing and “reform”, now called<br />
“standards-based reform” (cf. Achieve 1998; Ahearn 2000; Fuhrman 2001).<br />
State after state introduced state standards for the curriculum and <strong>–</strong> if not yet<br />
done so <strong>–</strong> state-wide assessment of student achievement, <strong>to</strong> assure that these<br />
standards were applied. It would be a wild exaggeration <strong>to</strong> pretend that this<br />
approach was an immediate success. In some cases, the introduction of state<br />
standards obviously spelled out disaster (cf., e.g., the Kentucky experience;<br />
Whitford & Jones 2000), in others at best modest gains could be reported but<br />
their validity was, and is, heavily disputed (cf. e.g. Cannell 1987; Dorn 1998;<br />
Saunders 1999; Linn 2000; Haney 2000; Watson & Suppovitz 2001; Amrein &<br />
Berliner 2002; Haney 2002; Ladd & Walsh 2002; Swanson & Stevenson 2002;<br />
Darling-Hammond 2003; Braun 2004). However, despite some 50 years of<br />
mixed experience with assessment and rather shallow results (cf. Cook 1997;<br />
Mehrens 1998; McNeil 2000; Herman & Haertel 2005), the next move was<br />
<strong>to</strong> introduce national legislation, aiming at a unified approach <strong>to</strong> assessment<br />
and accountability, the “No Child Left Behind Act” of 2001, enacted under the
376 STEFAN T. HOPMANN<br />
Bush presidency and supported by an almost united Congress (cf. Peterson &<br />
West 2003).<br />
No Child Left Behind (NCLB)<br />
It is worthwhile <strong>to</strong> give the provisions of the NCLB act a closer look, as they<br />
are paradigmatic for how accountability is constructed within the American<br />
tradition. Already the comprehensive “statement of purpose” unfolds a wide<br />
array of issues:<br />
“The purpose of this title is <strong>to</strong> ensure that all children have a fair, equal, and significant<br />
opportunity <strong>to</strong> obtain a high-quality education and reach, at a minimum, proficiency<br />
on challenging State academic achievement standards and state academic assessments.<br />
This purpose can be accomplished by <strong>–</strong><br />
(1) ensuring that high-quality academic assessments, accountability systems, teacher<br />
preparation and training, curriculum, and instructional materials are aligned with challenging<br />
State academic standards so that students, teachers, parents, and administra<strong>to</strong>rs<br />
can measure progress against common expectations for student academic achievement;<br />
(2) meeting the educational needs of low-achieving children in our Nation’s highestpoverty<br />
schools, limited English proficient children, migra<strong>to</strong>ry children, children with<br />
disabilities, Indian children, neglected or delinquent children, and young children in<br />
need of reading assistance;<br />
(3) closing the achievement gap between high- and low performing children, especially<br />
the achievement gaps between minority and non-minority students, and between<br />
disadvantaged children and their more advantaged peers;<br />
(4) holding schools, local educational agencies, and States accountable for improving<br />
the academic achievement of all students, and identifying and turning around lowperforming<br />
schools that have failed <strong>to</strong> provide a high-quality education <strong>to</strong> their students,<br />
while providing alternatives <strong>to</strong> students in such schools <strong>to</strong> enable the students<br />
<strong>to</strong> receive a high-quality education;<br />
(5) distributing and targeting resources sufficiently <strong>to</strong> make a difference <strong>to</strong> local educational<br />
agencies and schools where needs are greatest;<br />
(6) improving and strengthening accountability, teaching, and learning by using State<br />
assessment systems designed <strong>to</strong> ensure that students are meeting challenging State<br />
academic achievement and content standards and increasing achievement overall, but<br />
especially for the disadvantaged;<br />
(7) providing greater decision making authority and flexibility <strong>to</strong> schools and teachers<br />
in exchange for greater responsibility for student performance;
EPILOGUE: NO CHILD, NO SCHOOL, NO STATE LEFT BEHIND 377<br />
(8) providing children an enriched and accelerated educational program, including<br />
the use of school-wide programs or additional services that increase the amount and<br />
quality of instructional time;<br />
(9) promoting school-wide reform and ensuring the access of children <strong>to</strong> effective,<br />
scientifically based instructional strategies and challenging academic content;<br />
(10) significantly elevating the quality of instruction by providing staff in participating<br />
schools with substantial opportunities for professional development;<br />
(11) coordinating services under all parts of this title with each other, with other educational<br />
services, and, <strong>to</strong> the extent feasible, with other agencies providing services <strong>to</strong><br />
youth, children, and families; and<br />
(12) affording parents substantial and meaningful opportunities <strong>to</strong> participate in the<br />
education of their children.” (Section 1001)<br />
But this complexity is right away reduced <strong>to</strong> more specific expectations, when<br />
it comes <strong>to</strong>, which goals are in focus and how accountability is supposed <strong>to</strong><br />
foster these goals. Academic standards are according <strong>to</strong> NCLB the following:<br />
Standards under this paragraph shall include—<br />
(i) challenging academic content standards in academic subjects that—<br />
(I) specify what children are expected <strong>to</strong> know and be able <strong>to</strong> do;<br />
(II) contain coherent and rigorous content; and<br />
(III) encourage the teaching of advanced skills; and<br />
(ii) challenging student academic achievement standards that—<br />
(I) are aligned with the State’s academic content standards;<br />
(II) describe two levels of high achievement (proficient and advanced) that determine<br />
how well children are mastering the material in the State academic content standards;<br />
and<br />
(III) describe a third level of achievement (basic) <strong>to</strong> provide complete information<br />
about the progress of the lower-achieving children <strong>to</strong>ward mastering the proficient<br />
and advanced levels of achievement.” (Section 1111)<br />
Accountability is then based on these standards:<br />
“Each State plan shall demonstrate that the State has developed and is implementing<br />
a single, statewide State accountability system that will be effective in ensuring<br />
that all local educational agencies, public elementary schools, and public secondary<br />
schools make adequate yearly progress as defined under this paragraph. Each State<br />
accountability system shall <strong>–</strong>
378 STEFAN T. HOPMANN<br />
(i) be based on the academic standards and academic assessments . . . and other academic<br />
indica<strong>to</strong>rs consistent . . . , and shall take in<strong>to</strong> account the achievement of all<br />
public elementary school and secondary school students;<br />
(ii) be the same accountability system the State uses for all public elementary schools<br />
and secondary schools or all local educational agencies in the State, except that public<br />
elementary schools, secondary schools, and local educational agencies not participating<br />
under this part . . . and<br />
(iii) include sanctions and rewards, such as bonuses and recognition, the State will<br />
use <strong>to</strong> hold local educational agencies and public elementary schools and secondary<br />
schools accountable for student achievement and for ensuring that they make adequate<br />
yearly progress . . . ”. (ibid.)<br />
Finally, what is meant by “Adequate Yearly Progress” is defined in the subsequent<br />
paragraph:<br />
“(B) ADEQUATE YEARLY PROGRESS.—Each State plan shall demonstrate, based<br />
on academic assessments described in paragraph (3), and in accordance with this<br />
paragraph, what constitutes adequate yearly progress of the State, and of all public<br />
elementary schools, secondary schools, and local educational agencies in the State,<br />
<strong>to</strong>ward enabling all public elementary school and secondary school students <strong>to</strong> meet<br />
the State’s student academic achievement standards, while working <strong>to</strong>ward the goal of<br />
narrowing the achievement gaps in the State, local educational agencies, and schools.<br />
(C) DEFINITION.—‘Adequate yearly progress’ shall be defined by the State in a<br />
manner that—<br />
(i) applies the same high standards of academic achievement <strong>to</strong> all public elementary<br />
school and secondary school students in the State;<br />
(ii) is statistically valid and reliable;<br />
(iii) results in continuous and substantial academic improvement for all students;<br />
(iv) measures the progress of public elementary schools, secondary schools and local<br />
educational agencies and the State based primarily on the academic assessments<br />
described in paragraph (3);<br />
(v) includes separate measurable annual objectives for continuous and substantial improvement<br />
for<br />
each of the following:<br />
(I) The achievement of all public elementary school and secondary school students.<br />
(II) The achievement of—<br />
(aa) economically disadvantaged students;<br />
(bb) students from major racial and ethnic groups;
EPILOGUE: NO CHILD, NO SCHOOL, NO STATE LEFT BEHIND 379<br />
(cc) students with disabilities; and<br />
(dd) students with limited English proficiency;<br />
except that disaggregation of data under subclause (II) shall not be required in a case in<br />
which the number of students in a category is insufficient <strong>to</strong> yield statistically reliable<br />
information or the results would reveal personally identifiable information about an<br />
individual student.” (ibid.)<br />
I have quoted the NCLB act in such length because it provides a concise definition<br />
of what management of expectations in this perspective is about. The<br />
core of accountability is narrowly focused on student achievements measured<br />
by “academic standards”. Other functions of schooling (such as the role school<br />
plays for local communities or in shaping society) are hardly mentioned, and<br />
if at all, they are constructed as minority problems. At the same time the academic<br />
achievement is reduced <strong>to</strong> that which can be reported as “statistically<br />
valid and reliable,” leaving out any educational or social achievements which<br />
cannot be counted as required. With in this frame, responsibility is passed from<br />
the federal <strong>to</strong>p through intermediate levels such as state and district administrations<br />
<strong>to</strong> teachers and local school leaders. They are expected <strong>to</strong> improve<br />
the test results by “evidence-based teaching” or even by “data-driven decision<br />
making” (cf. ECS 2002; Marsh & Hamil<strong>to</strong>n 2006), which collapses the complexities<br />
of class room work or school leadership in<strong>to</strong> single-minded framesets<br />
of statistically significant achievement gains (cf. the comments by e.g. Koretz<br />
2002; Berliner 2005; Hargreaves 2006; Ingersoll 2006). The starting idea of<br />
the “basic principles of curriculum and instruction” (Tyler 1949), which was<br />
embedded in the methodologically much broader approach of, e.g. the Eight<br />
Year Study (Aikin 1942), and which asked for a comprehensive understanding<br />
of schooling as social and local institution, has <strong>–</strong> as it seems <strong>–</strong> dwindled <strong>to</strong> a<br />
concept of measurable yearly progress.<br />
The response <strong>to</strong> NCLB within the education community has been almost<br />
evenly divided. While a majority of politicians and economists, and certain<br />
parts of the public, seem <strong>to</strong> support NCLB wholeheartedly, or at least its core<br />
concept of accountability, many educa<strong>to</strong>rs are less enthusiastic. The public response<br />
seems <strong>to</strong> be fragmented based on social class, level of education, and<br />
political orientation (cf. Loveless 2006). The professional reactions have much<br />
<strong>to</strong> do with the question if or if not one accepts the narrow focus of NCLB as<br />
reasonable. On the one side are those who see NCLB at least as a starting point<br />
for a possible school revolution, finally solving the American school crisis (cf.<br />
e.g. Ladd & Walsh 2002; Petersen & West 2003; Irons & Harris 2006). Some
380 STEFAN T. HOPMANN<br />
economists have even begun <strong>to</strong> calculate the economical spin-off of NCLB if<br />
modest gains can be sustained (Hanushek 2002, 2006; Hanushek & Raymond<br />
2003, 2005). Others are sceptical, pointing <strong>to</strong> the “impoverished” scope of the<br />
provisions (cf. e.g. Berliner 2005) or the obvious implementation problems of<br />
the current approach (for a variety of such problems cf. e.g. Eberts, Hollenbeck<br />
& S<strong>to</strong>ne 2002; Mintrop 2003; Chubb 2005; Gorard 2006; Martineau 2006;<br />
Apple 2007; Deretchin & Craig 2007; Zimmer et al. 2007). The construct of<br />
“adequate yearly progress” (AYP) in particular has created a tremendous challenge.<br />
The expectation of the progress could be reached went far beyond the<br />
reality of slowness and instability in school change <strong>–</strong> and adding minority criteria<br />
worsened the situation. Some critics fear that almost all American schools<br />
will end up on the watch lists of failing schools (cf. Linn, Baker & Betebenner<br />
2002; Linn & Haug 2002; Herman & Haertel 2005; Linn 2005). And many<br />
researchers and practitioners have pointed out, NCLB will fail while schools<br />
and their leadership do not have the required “capacities”, i.e. the ability <strong>to</strong><br />
identify their local mix of problems and <strong>to</strong> deal with these professionally (cf.<br />
e.g. Elmore 2006). However, what is almost never challenged in this debate is<br />
the basic assumption of the whole enterprise, namely, that it is the student’s<br />
academic achievement which best reflects the quality of schooling, and that it<br />
is the poor quality of instruction provided by poor teaching, which is <strong>to</strong> blame<br />
for the fact that children are left behind. Holding states, school districts and<br />
schools accountable is reduced <strong>to</strong> the requirement <strong>to</strong> do whatever necessary <strong>to</strong><br />
pass this accountability along <strong>to</strong> classrooms and teachers and, finally, <strong>to</strong> the<br />
students themselves.<br />
Similar criticism has been voiced about <strong>PISA</strong>’s role in the US (cf. e.g.<br />
Bracey 2005). However, even though it uses a similar approach <strong>to</strong> mapping<br />
school achievements, and even though the US has been one of the driving<br />
forces behind it, <strong>PISA</strong> does not have much of an American audience in the<br />
shadow of NCLB, and no significant impact on the wider public or the educational<br />
science community. For instance, a recent search of the Education<br />
Resource Information Centre (ERIC) brought up less than 150 articles and<br />
books about <strong>PISA</strong>, most of them from outside the US <strong>–</strong> nothing compared <strong>to</strong><br />
the general issue of accountability (more than 18.000 hits) or NCLB (about<br />
2000 hits). It does not even match the impact of TIMSS (with more than 400<br />
hits) and is far below of what the German equivalent of ERIC, the FIS Bildung,<br />
reports on <strong>PISA</strong> from Germany (more than 2500 hits). Arguably, this reflects
EPILOGUE: NO CHILD, NO SCHOOL, NO STATE LEFT BEHIND 381<br />
NCLB’s status as a national law, whereas <strong>PISA</strong> is an international enterprise<br />
with no direct obligations for the states and the schools participating.<br />
In general, <strong>PISA</strong> seems <strong>to</strong> have less impact where a national achievement<br />
control of some kind already is in place (as, e.g., in Sweden), which bodes<br />
ill for the future of <strong>PISA</strong> as more and more countries introduce accountability<br />
systems of their own making. However, this would not explain why earlier<br />
studies like TIMSS have had more visibility in the US. In my view, this shortcoming<br />
of <strong>PISA</strong> has much <strong>to</strong> do with the key fallacy of its design, namely<br />
<strong>to</strong> present itself mainly as a cross-national comparison, even though it shares<br />
with NCLB the same starting point, student achievement, which only <strong>to</strong> a very<br />
small degree (not least in case of the U.S. with its manifold states) can be attributed<br />
<strong>to</strong> a specific “national” fabric of schooling (cf. the critique brought<br />
forward in this volume). As a cross-national comparison <strong>PISA</strong> is not much of<br />
an eye-opener for the US public; it only confirms what was known from earlier<br />
international studies, i.e., that US students don’t do very well in such tests<br />
compared <strong>to</strong> the students of many other nations. In that the “winner” of <strong>PISA</strong><br />
2000 and 2003 was tiny Finland, and not an international competi<strong>to</strong>r like Japan<br />
(which succeeded in TIMSS, and did well on <strong>PISA</strong> <strong>to</strong>o), there is, seemingly,<br />
nothing much for Americans <strong>to</strong> learn from successful <strong>PISA</strong> nations.<br />
No School left Behind<br />
When the accountability wave hit Nordic shores for the first time, the spontaneous<br />
reaction of the political and educational establishments was almost<br />
opposite <strong>to</strong> what had happened in the US. While there accountability became<br />
a <strong>to</strong>ol <strong>to</strong> centralize important elements of educational control, firstatthestate,<br />
later at the national level, the spontaneous reaction of the Scandinavians was<br />
decentralization. Although government offices and administrative departments<br />
(in Norway and Sweden) were created <strong>to</strong> satisfy the discourse of the New Public<br />
Management, and many national reports and white papers were commissioned<br />
on public service rendering and administration, none of this <strong>–</strong> except<br />
maybe for Finland, which was under much more economic strain following<br />
the break down of the Berlin Wall (cf. Sinola 2005; Uljens in this volume) <strong>–</strong><br />
led <strong>to</strong> a sustained and comprehensive accountability reform of the US kind (cf.<br />
Bogason 1996; Irjala & Eikås 1996; Prahl & Olsen 1997; Pollit & Bouckaert<br />
2004 2 ). The concurrent debate about joining or not-joining the EU may have<br />
had a share in this decision making in that EU participation was often framed in<br />
terms of the risk of more centralization (cf. Karlsen 1994). However, tackling
382 STEFAN T. HOPMANN<br />
challenges by way of an issue-focused and pragmatic step-by-step approach<br />
with special regard as <strong>to</strong> how lower levels of government, such as districts and<br />
municipalities, could deal with any emerging <strong>to</strong>ol kit was consistent with what<br />
I am calling their constitutional mindset.<br />
Thus as a background for understanding the Nordic education sec<strong>to</strong>r, one<br />
has <strong>to</strong> know that schools and their teachers played a pivotal role in the nationbuilding<br />
processes across the region, and in the shaping of national identities<br />
(cf. e.g. Slagstad 1998; Telhaug & Mediås 2003, Korsgaard 2004; Werler 2004;<br />
Telhaug 2005). Moreover, schools are not only seen as places for the young,<br />
but as the cultural core of the local community <strong>–</strong> which turns the local and<br />
regional distribution of schooling in<strong>to</strong> always contested issue.<br />
Until the 1990s the main <strong>to</strong>ol used <strong>to</strong> govern the school curriculum was<br />
curriculum guidelines, developed mainly by the state administrations by way<br />
of committees largely consisting of experienced teachers and subject matter<br />
specialists (cf. Gundem, 1992, 1993, 1997; Sivesind, Bachmann & Afzar,<br />
2003; Bachmann, Sivesind & Hopmann 2004; ). Local schools and teachers<br />
had considerable leeway <strong>to</strong> pick and chose within this curriculum frame in order<br />
<strong>to</strong> develop locally adapted teaching programs. There was no regular staterun<br />
evaluation of the outcomes of teaching, and, indeed, outside research not<br />
even the concept was familiar (cf. Hopmann 2003). In a Nordic perspective<br />
schools were seen as places run by highly educated and esteemed teachers,<br />
who knew best how <strong>to</strong> do their job. Curriculum change was primarily seen<br />
as a matter of dialogue between local experience and national needs; changes<br />
were typically introduced by way of lengthy try-out periods, and with an often<br />
extraordinary involvement of all levels of schooling and administration. Of<br />
course, this was by no means a paradise of peaceful change: each and every<br />
curriculum reform has had its proponents and opponents, and the interplay between<br />
the school sec<strong>to</strong>r, research, politics and the public was at times pretty<br />
contentious (cf. Sivesind, forthcoming). However, this played out within the<br />
context of school systems that enjoyed, for most of the time, broad support at<br />
all levels of society.<br />
In this context, it was no surprise that the first reaction <strong>to</strong> sharp national and<br />
international criticism of schooling was a re-doing of what had been successful.<br />
In the case of Norway, for instance, the first contemporary criticism of the<br />
school system was voiced by an OECD panel (1988) and by a national committee<br />
commissioned by the parliament (NOU 1988:22). Reflecting the emerging<br />
NPM discourse, both concluded that the weaknesses of the national school sys-
EPILOGUE: NO CHILD, NO SCHOOL, NO STATE LEFT BEHIND 383<br />
tem were, significantly, an outcome of an underperforming school governance<br />
structure, which was not able <strong>to</strong> ensure that the goals of curriculum guidelines<br />
were being reached. Two conclusions were drawn: On the one side a sweeping<br />
reform of the whole school curriculum was launched, beginning with a new<br />
general curriculum frame (L93), followed by new comprehensive guidelines<br />
for the upper secondary sec<strong>to</strong>r (R94) and the elementary and lower secondary<br />
schools (L97). The frame stressed the double purpose of schooling as caretaker<br />
of the national and local heritage and as knowledge-promoter (cf. STM<br />
29 1994/95). The subsequent curriculum guidelines received a new structure:<br />
they were <strong>to</strong> focus on the most important requirements and state these expectations<br />
in terms of goals (a kind of management-by-objectives approach) that<br />
could be reached by average schools. On the other side a re-make of the governance<br />
structure was inaugurated, constructing a double-faced reform combining<br />
a re-focussing of national steering while stressing the importance of<br />
local au<strong>to</strong>nomy and responsibility for reaching these goals (cf. STM 37 1990-<br />
1991; STM 47 1995/96; KUF 1997). The reform was supported by numerous<br />
in-service and research programs <strong>to</strong> help districts, municipalities, schools and<br />
teachers identify the major obstacles and prepare for the enactment of the new<br />
guidelines. This new orientation was complemented by initiatives <strong>to</strong> develop<br />
school-based and peer-guided school improvement (cf. e.g. Granheim, Kogan<br />
& Lundgren 1990; Karlsen 1993; Ålvik 1994; KUF 1994; Haug & Monsen<br />
2002; Nesje & Hopmann 2003). However, this first take-up of NPM like measures<br />
was <strong>to</strong> infuriate many educa<strong>to</strong>rs, politicians and practitioners alike; these<br />
critics felt that the <strong>to</strong>ol-kit of accountability was an “instrumentalist mistake”<br />
that did not fit the national traditions of schooling and challenged the former<br />
strategy of placement, i.e. the compulsory comprehensive school (cf. e.g. Hovednak<br />
2000, Koritzinsky 2000, Lindblad, Johanneson & Simola 2003).<br />
When a new liberal-conservative government felt that these first steps of<br />
reform were still not enough <strong>to</strong> ensure adequate school development, it commissioned<br />
a new national report <strong>to</strong> recommend additional measures. What<br />
emerged was a peculiar understanding of school development as development<br />
of “quality”, in which “quality” represents a rather vague and all-encompassing<br />
understanding of whatever might affect the outcomes of schooling (cf. STM 30<br />
2007; Birkeland 2007; Sivesind forthcoming). The then-secretary of education,<br />
Kristin Clemet, expressed the basic rationale of this approach as follows:<br />
Society’s reasons for having schools, and the community tasks imposed on them, are<br />
still relevant <strong>to</strong>day: Education is an institution that binds us <strong>to</strong>gether. We all share
384 STEFAN T. HOPMANN<br />
it. It has its roots in the past and is meant <strong>to</strong> equip us for the future. It transfers<br />
knowledge, culture and values from one generation <strong>to</strong> the next. It promotes social<br />
mobility and ensures the creation of values and welfare for all. For the individual,<br />
education is <strong>to</strong> contribute <strong>to</strong> cultural and moral growth, mastering social skills and<br />
learning self-sufficiency. It passes on values and imparts knowledge and <strong>to</strong>ols that allow<br />
everyone <strong>to</strong> make full use of their abilities and realize their talents. It is meant <strong>to</strong><br />
cultivate and educate, so that individuals can accept personal responsibility for themselves<br />
and their fellows. Education must make it possible for pupils <strong>to</strong> develop so that<br />
they can make well-founded decisions and influence their own futures. At the same<br />
time, schools must change when society changes. New knowledge and understanding,<br />
new surroundings and new challenges influence schools and the way they carry out<br />
the tasks they have been given. Schools must also prepare pupils for looking farther a<br />
field than the Norwegian frontiers and being part of a larger, international community.<br />
We must nourish and further develop the best aspects of Norwegian schools and at the<br />
same time make them better equipped for meeting the challenges of the knowledge<br />
society. Our vision is <strong>to</strong> create a better culture for learning. If we are <strong>to</strong> succeed, we<br />
must be more able and willing <strong>to</strong> learn. Schools themselves must be learning organizations.<br />
Only then can they offer attractive jobs and stimulate pupils’ curiosity and<br />
motivation for learning. . . . We will equip schools <strong>to</strong> meet a greater diversity amongst<br />
pupils and parents/guardians. Schools are already ideals for the rest of society in the<br />
way they include everybody. However, in the future we must increasingly appreciate<br />
variety and deal with differences. Schools must have as their ambition <strong>to</strong> exploit and<br />
adapt <strong>to</strong> this diversity in a positive manner.<br />
If schools are <strong>to</strong> be able <strong>to</strong> achieve this, it is necessary <strong>to</strong> change the system by which<br />
schools are administered. National authorities must allow greater diversity in the solutions<br />
and working methods chosen, so that these can be adapted and cus<strong>to</strong>mized<br />
<strong>to</strong> the situation of each individual pupil, teacher and school. The national authorities<br />
must define the objectives and contribute with good framework conditions, support<br />
and guidance. At the same time, we must have confidence in schools and teachers as<br />
professionals. We wish <strong>to</strong> mobilize <strong>to</strong> greater creativity and commitment by allowing<br />
greater freedom <strong>to</strong> accept responsibility....<br />
All plans for developing and improving schools will fail without competent, committed<br />
and ambitious teachers and school administra<strong>to</strong>rs. They are the school system’s<br />
most important assets. It is therefore an important task <strong>to</strong> strengthen and further develop<br />
the teachers’ professional and pedagogical expertise and <strong>to</strong> motivate for improvements<br />
and changes. This Report heralds comprehensive efforts regarding competence<br />
development in schools. Education must be developed through a dialogue with<br />
those who have their daily work in and for schools. (Introduction <strong>to</strong> STM 30 2004)<br />
The difference <strong>to</strong> the accountability rhe<strong>to</strong>ric of NCLB is striking. Where<br />
NCLB solely is focused on “academic standards” and on allocating respon-
EPILOGUE: NO CHILD, NO SCHOOL, NO STATE LEFT BEHIND 385<br />
sibilities <strong>to</strong> states, districts, teachers and students, the Norwegians talk about<br />
allowing for “greater diversity”, about the core role of the teachers and <strong>–</strong> above<br />
all <strong>–</strong> about the “confidence in schools and teachers as professionals”. The<br />
“comprehensive effort” announced is built around three dimensions: structure,<br />
process, and outcomes, and both the minister and the committee stress time and<br />
again that one cannot expect better results without improving the structures and<br />
the processes, and without considerable help from all sides (cf. STM 30 2004).<br />
The committee tried <strong>to</strong> embed the new <strong>to</strong>ols in a way that is less offensive <strong>to</strong><br />
traditionalists, by integrating the new in the familiar concepts of local moni<strong>to</strong>ring<br />
and school au<strong>to</strong>nomy. The proposals included the establishment of a<br />
national testing procedure <strong>to</strong> ensure that basic competencies are achieved, but<br />
stressing the “basic” and seeing this first and foremost as a helping hand <strong>to</strong> assist<br />
schools in diagnosing where may have a need for improvement (cf. NOU<br />
2003:16).<br />
The introduction of the national testing has been very difficult and is<br />
a not yet finished task, disputed by researchers and practitioners alike, and<br />
still far from anything resembling NCLB (cf. Langfeldt, Elstad & Hopmann<br />
2007). Nobody speaks about “evidence-based teaching” or “data-driven decision<br />
making” as prime <strong>to</strong>ols <strong>to</strong> make school improvement work; the data are<br />
seen as a limited indica<strong>to</strong>r, which has <strong>to</strong> be embedded in a wider understanding<br />
of a school’s program and needs. But even this limited aspiration has put<br />
a tremendous stress on both the national test developers and local communities<br />
and schools as they <strong>to</strong> meet the new expectation regime. Because of their<br />
poor technical quality, the first wave of national tests was met by sharp criticism,<br />
even from the supporters of their use. This forced the government <strong>to</strong><br />
take a one-year break and completely redo the <strong>to</strong>ol-kit of assessment (cf. Lie<br />
2005; MMI 2005; Telhaug 2005; Langfeldt 2007b). As a result, many schools<br />
and municipalities felt more confused, then controlled by the new measures.<br />
It seems that it will take some time before a more coherent pattern of working<br />
with national moni<strong>to</strong>ring emerges and the different levels find sustainable<br />
strategies for dealing with the new <strong>to</strong>ol kit of expectation management (cf.<br />
Møller 2003; Riksvevisjonen 2006; Sivesind, Langfeldt & Skedsmo 2006; Elstad<br />
2007; Elstad & Langfeldt 2007; Engeland, Roald & Langfeldt 2007; Isaksen<br />
2007). However, the prevailing attitude <strong>to</strong>wards what might be expected<br />
can be illustrated by what a principal of a <strong>to</strong>p-scoring school said at a national<br />
leaders conference: “One shouldn’t put <strong>to</strong>o much in<strong>to</strong> these results”; they reflected,<br />
he said, only a small part of his school’s program and did not inform
386 STEFAN T. HOPMANN<br />
his school about the challenges they faced, not at least in relation <strong>to</strong> special<br />
education. In all events, one should not expect his school <strong>to</strong> be on <strong>to</strong>p next<br />
year; the next year’s class wasn’t close <strong>to</strong> the quality of this one. This was not<br />
just a fine display of public Norwegian humbleness (“you shouldn’t believe<br />
you are someone”). He seemed genuinely concerned that the unexpected success<br />
would divert attentiveness from the more pressing problems of his school<br />
and mislead parents and local politicians, with the implication of less support<br />
in tackling his school’s problems. This is a similar reaction <strong>to</strong> the one seen in<br />
Finland as they discuss the overwhelming <strong>PISA</strong> success of their country and its<br />
more or less unintended side-effects (cf. e.g. Sinola 2005; Kivirauma, Klemala<br />
& Rinne 2006; Uljens in this volume).<br />
<strong>PISA</strong> and its predecessors like TIMSS played an important, but not a key<br />
role in this development in Norway. The move <strong>to</strong>wards a policy change had<br />
started long before <strong>PISA</strong> came in<strong>to</strong> being. The TIMSS and <strong>PISA</strong> data underlined<br />
that there were some substantial short-comings <strong>to</strong> address, but <strong>PISA</strong> was<br />
not taken as a sufficient description of the challenges ahead in either in the<br />
relevant committees or in the parliament. Nor did <strong>PISA</strong> lead <strong>to</strong> a fundamental<br />
change in the course of action, with the one exception that the new generation<br />
of curriculum guidelines tries <strong>to</strong> adapt some of <strong>PISA</strong>’s competence conceptualizations.<br />
But this was not by chance. Most Nordic <strong>PISA</strong> researchers were<br />
scrupulous in outlining the reach of their results, pointing <strong>to</strong> the limited scope<br />
of <strong>PISA</strong>’s material, admonishing against any attempt <strong>to</strong> simplify the complexities,<br />
and warning against any expectation of comprehensive political solutions<br />
based on <strong>PISA</strong> (cf. e.g. Mejding & Roe 2006). The most substantial criticism<br />
of <strong>PISA</strong>’s reach came from within, from researchers with close connections <strong>to</strong><br />
the project. They have analysed particularly the match and mismatch of <strong>PISA</strong><br />
constructs with their nation’s traditions of knowledge culture and schooling<br />
(cf. Olsen 2005 and in this volume, Sjøberg in this volume, Dolin in this volume).<br />
They have discussed if and how <strong>PISA</strong> is reflects the social and cultural<br />
diversity of student achievements (cf. e.g. Allerup 2005, 2006 and in this volume).<br />
In addition, the <strong>PISA</strong> project tried from its beginning <strong>to</strong> place a main<br />
focus on schools as the decisive units of action. This was not easy: <strong>PISA</strong> does<br />
not provide comprehensive, independently cross-checked school data, but relies<br />
instead on the descriptions of school climate and classroom practice provided<br />
by the students and the teachers themselves, a weak source because of<br />
the well-known variance in the ways students describe the same experienced<br />
curriculum (cf. Turmo & Lie 2004).
EPILOGUE: NO CHILD, NO SCHOOL, NO STATE LEFT BEHIND 387<br />
It would exceed the scope of this chapter <strong>to</strong> address the subtle differences<br />
between the Nordic countries with their different levels and shades of public<br />
debate on <strong>PISA</strong> and national testing (cf. Langfeldt, Elstad & Hopmann 2007).<br />
What is important, however, is another fundamental difference <strong>to</strong> the NCLB<br />
approach which the Nordic countries have in common. As a recent survey of<br />
teacher education in the Nordic countries shows (Skågen 2006), they share a<br />
fundamental trust in the quality of their teachers and the underlying teacher education.<br />
It is not that nothing might be improved. Rather they feel that teachers<br />
are well enough educated <strong>to</strong> do what is necessary if they are given the means<br />
and the challenges <strong>to</strong> do so. The core issue becomes then how <strong>to</strong> improve the<br />
local communities “room <strong>to</strong> move”, their ability <strong>to</strong> unleash teachers’ energies<br />
and <strong>to</strong> moni<strong>to</strong>r progress in a supportive way (cf. Engeland, Roald & Langfeldt<br />
2007).<br />
No State Left Behind<br />
What a difference <strong>PISA</strong> can make was nowhere more visible than in Germany<br />
and <strong>–</strong> with a typical delay <strong>–</strong> in Austria. In Germany <strong>PISA</strong> was from the beginning<br />
“big news”, filling newspapers, forcing political responses, engaging each<br />
and everyone interested in school affairs (cf. summarizing Weigel 2004). The<br />
Austrian reaction was somewhat slower; Austria seemed <strong>to</strong> have fared better in<br />
<strong>PISA</strong> 2000, at least better than Germany, which counts for quite a lot in Austria<br />
(cf. Bozkurt, Brinek & Retzl in this volume). When it turned out that Austria<br />
scored worse in the <strong>PISA</strong> 2003, and that the better results of 2000 might have<br />
been an artefact of flawed sampling (cf. Neuwirth 2006; Neuwirth, Ponocny<br />
& Grossmann 2006), the discussion climate changed dramatically. Now both<br />
school systems were seen <strong>to</strong> be in a deep crisis, not least a crisis of their traditional<br />
school structures and their out-worn forms of teaching (cf. summarizing<br />
Terhart 2004; Bozkurt, Brinek & Retzl in this volume).<br />
The response pattern as such was no surprise. Both countries have, since<br />
the school reforms of late-18th century (cf. Mel<strong>to</strong>n-Horn 1988), had recurrent<br />
“big school debates” every 20 <strong>to</strong> 30 years. Every debate is a struggle about<br />
the national curriculum, and (about) every second debate is more specifically<br />
focused on the structures of schooling and their implications (as was the case<br />
for Prussia/Germany in the early-19th century, the 1850s, the 1890s, the 1920s,<br />
and finally in the 1960s and early 1970s; cf. summarizing Hopmann 1988,<br />
2000).
388 STEFAN T. HOPMANN<br />
The important role of school structural issues within this pattern results<br />
from the understanding in both countries that, at least since the reforms of late-<br />
18 th century, schools are state-owned and state-run systems <strong>–</strong> at the national<br />
level in Austria, at the state level in the Federal Republic of Germany. Local<br />
municipalities have some responsibilities for “outer” school matters, such as<br />
buildings and equipment, but the curriculum, the hiring and firing of teachers,<br />
the licensing of school books, and the day-by-day control of all “internal”<br />
school matters etc. are seen as being within the realm of the state’s school<br />
administration. Moreover, both countries have stratified school systems, in<br />
which secondary schools are divided in different strands for “high-” and “low-”<br />
achievers, providing, e.g. different schools for “academic achievers” (Gymnasium),<br />
for more “practically oriented” youth (Realschule, Hauptschule, Berufsschule),<br />
and for children with “special needs” (Sonderschule). The decision<br />
about which kind of a student should attend is normally made following 4th<br />
or 6th grade (in the 19 th century the division was from the first grade). Both<br />
countries have a system of vocational education, combining school with onthe-job<br />
training, sometimes beginning at the lower secondary level, but more<br />
usually covering those who do not attend a Gymnasium or the like for uppersecondary<br />
education. However, in both countries rates of attainment of the<br />
highest academic qualification, (Abitur, Matura), and thereby access <strong>to</strong> universities,<br />
is considered as the key indica<strong>to</strong>r of social equity (cf. Becker & Lauterbach<br />
2004).<br />
In that it is the state, and the state alone, that regulates schools, school<br />
structures can be unders<strong>to</strong>od as institutionalized expressions of the state’s view<br />
on social class and stratification. The proverbial example of inequality in the<br />
school debates of the 1960s was the catholic working-class girl from a rural<br />
area attending a Hauptschule; she is now replaced by the Moslem daughter<br />
of an immigrant family living in a poor inner-city district who also attends<br />
a Hauptschule or a Sonderschule (cf. summarizing Berger & Kahlert 2005).<br />
Within this frame, school-structure debates tend <strong>to</strong> become debates on social<br />
division; the stratified school system is regularly defended by conservatives<br />
and economists, whereas the move <strong>to</strong>wards a comprehensive school system is<br />
an affair of the heart for social democrats and the labour movement, without<br />
regard <strong>to</strong> whether one system or the other has a better record in terms of social<br />
equity. In both countries the core argument is the assumed, yet not proven<br />
effect stratification might have on human capital<strong>–</strong> does stratification lead <strong>to</strong> a<br />
structural underperformance of lower-class students or does a comprehensive
EPILOGUE: NO CHILD, NO SCHOOL, NO STATE LEFT BEHIND 389<br />
school limit the space and speed of development of high-achievers, and vice<br />
versa (cf. Bozkurt, Brinek & Retzl in this volume).<br />
How <strong>PISA</strong> fits in this frame, is easily unders<strong>to</strong>od if one takes its official<br />
purpose as stated by its owner, the OECD:<br />
Quality education is the most valuable asset for present and future generations.<br />
Achieving it requires a strong commitment from everyone, including governments,<br />
teachers, parents and students themselves. The OECD is contributing <strong>to</strong> this goal<br />
through <strong>PISA</strong>, which moni<strong>to</strong>rs results in education within an agreed framework, allowing<br />
for valid international comparisons. By showing that some countries succeed<br />
in providing both high quality and equitable learning outcomes, <strong>PISA</strong> sets ambitious<br />
goals for others. (Angel Gurría, OECD Secretary-General, as introduction <strong>to</strong> <strong>PISA</strong><br />
2006, 3)<br />
<strong>According</strong> <strong>to</strong> the same source <strong>PISA</strong>’s “key features” have been so far:<br />
<strong>–</strong> Its policy orientation, with design and reporting methods determined by the need of<br />
governments <strong>to</strong> draw policy lessons.<br />
<strong>–</strong> Its innovative “literacy” concept, which is concerned with the capacity of students<br />
<strong>to</strong> apply knowledge and skills in key subject areas and <strong>to</strong> analyse, reason and communicate<br />
effectively as they pose, solve and interpret problems in a variety of situations.<br />
<strong>–</strong> Its relevance <strong>to</strong> lifelong learning, which does not limit <strong>PISA</strong> <strong>to</strong> assessing students’<br />
curricular and cross-curricular competencies but also asks them <strong>to</strong> report on their<br />
own motivation <strong>to</strong> learn, their beliefs about themselves and their learning strategies<br />
<strong>–</strong> Its regularity, which will enable countries <strong>to</strong> moni<strong>to</strong>r their progress in meeting key<br />
learning objectives.<br />
<strong>–</strong> Its contextualisation within the system of OECD education indica<strong>to</strong>rs, which examine<br />
the quality of learning outcomes, the policy levers and contextual fac<strong>to</strong>rs that<br />
shape these outcomes, and the broader private and social returns <strong>to</strong> investments in<br />
education.<br />
<strong>–</strong> Its breadth of geographical coverage and collaborative nature, with more than 60<br />
countries (covering roughly nine-tenths of the world economy) having participated<br />
in <strong>PISA</strong> assessments <strong>to</strong> date, including all 30 OECD countries. (ibid., 7)<br />
The “policy orientation, with design and reporting methods determined (sic!)<br />
by the need of governments <strong>to</strong> draw policy lessons” has lead <strong>to</strong> wealth of national<br />
and OECD reports, using <strong>PISA</strong> data as a means <strong>to</strong> assess the quality<br />
of school structures and schooling, issues of social inequality, gender, migration<br />
etc., and, not least, comparisons of again and again of countries and their<br />
<strong>PISA</strong> performance in relation <strong>to</strong> other OECD indica<strong>to</strong>rs (most of this online<br />
available at http://www.pisa.oecd.org).
390 STEFAN T. HOPMANN<br />
Of course this approach has a number of implicit assumptions, which are all<br />
but self-evident:<br />
<strong>–</strong> The assumption that what <strong>PISA</strong> measures is somehow important knowledge<br />
for the future: There is no research available, which proves this assertion<br />
beyond the point of that knowing something is always good and knowing<br />
more is always better. There is not even research showing that <strong>PISA</strong> covers<br />
enough <strong>to</strong> be representative for the school subjects involved or the general<br />
school knowledge base. <strong>PISA</strong> items are based on the practical reasoning of<br />
its researchers and based on pre-tests of what works in all or most settings <strong>–</strong><br />
and not on systematic research on current or future knowledge structures and<br />
needs (cf. Dohn 2007; Bodin, Jahnke, Meyerhöfer, Sjøberg in this volume).<br />
<strong>–</strong> The assumption that the economic future is dependent on the knowledgebase<br />
moni<strong>to</strong>red by <strong>PISA</strong>: The little research on this theme <strong>–</strong> which assumes<br />
that there is a direct relation between test scores and future economic<br />
development<strong>–</strong> relies on strong and unproven arguments which have no basis<br />
when, for instance, comparing success in <strong>PISA</strong>’s predecessors and later<br />
economic development (cf. Fertig 2004; Heynemann 2006).<br />
<strong>–</strong> The assumption that <strong>PISA</strong> measures what is learned in schools: this is not<br />
<strong>PISA</strong>’s own starting point which is not <strong>to</strong> use national curricula as point of<br />
reference (as e.g. TIMSS does; cf. Sjøberg in this volume). The decision <strong>to</strong><br />
focus on a small number of issues and <strong>to</strong>pics, which can be expected <strong>to</strong> be<br />
present in all involved countries leaves open the question of how these items<br />
represent the school curriculum as a whole (cf. Benner 2002; Fuchs 2003;<br />
Ladenthin 2004; Hopmann 2001, 2006; Dolin, Sjøberg, Meyerhöfer in this<br />
volume) beyond the fact that those who are successful in school do, on average,<br />
better on <strong>PISA</strong> <strong>–</strong> which is hardly a surprise inasmuch as <strong>PISA</strong> requires<br />
cognitive and not at least language skills, which are helpful in schools as<br />
well. Some even argue that <strong>PISA</strong> first and foremost moni<strong>to</strong>rs whatever intelligence<br />
testing moni<strong>to</strong>rs (cf. Rindermann 2006), which could lead <strong>to</strong> the<br />
somewhat irritating implication that according <strong>to</strong> <strong>PISA</strong>, e.g. Finns are more<br />
“intelligent” than Germans or Austrians.<br />
<strong>–</strong> The assumption that <strong>PISA</strong> measures the competitiveness of schooling: One<br />
has <strong>to</strong> keep in mind, that at best 5<strong>–</strong>15 percent of the variance in the <strong>PISA</strong> results<br />
can be attributed <strong>to</strong> lasting qualities provided by the schools studied (cf.<br />
already Watermann et al. 2003; for the principal problems of re-constructing<br />
schooling and teaching based on such data see Rauin 2004). Most of the<br />
variance can be attributed <strong>to</strong> fac<strong>to</strong>rs from the outside, that are mostly be-
EPILOGUE: NO CHILD, NO SCHOOL, NO STATE LEFT BEHIND 391<br />
yond the reach of schooling (such as social background; cf. Baumert, Stanat<br />
& Watermann 2006).<br />
<strong>–</strong> The assumption that <strong>PISA</strong> thus measures and compares the quality of national<br />
school provision, not at least of school structures, teacher quality,<br />
the curriculum, etc.: Although school effects as such have a very limited<br />
role in the results of <strong>PISA</strong>, one has <strong>to</strong> add that a) <strong>PISA</strong> has a considerable<br />
sampling and cultural match problem, which reduces its trustworthiness as<br />
an indica<strong>to</strong>r for national systems, at least for systems with the small differences<br />
seen between Western countries (see the contributions <strong>to</strong> this volume),<br />
and b) since Coleman’s seminal study (1967) it is well known that schools<br />
only have a very limited impact on social distributions of education success<br />
when compared <strong>to</strong> fac<strong>to</strong>rs such as the social fabric of the surrounding society<br />
(cf. Shavit & Blossfeldt 1993; Becker & Lauterbach 2004). Moreover,<br />
by its very design <strong>PISA</strong> is forced <strong>to</strong> drop most of what might indeed indicate<br />
specifics of national systems (cf. Dolin, Langfeldt in this volume).<br />
In short: <strong>PISA</strong> relies on “strong assumptions” (Fertig 2004) based on weak<br />
data (cf. e.g. Allerup, Langfeldt, Wuttke in this volume) that appeal <strong>to</strong> conventional<br />
wisdom (“education does matter, doesn’t it?”; “school structures make a<br />
difference, don’t they?”), but almost no empirical and his<strong>to</strong>rical research supporting<br />
its implied causalities.<br />
But this has not kept either <strong>PISA</strong> researchers or the public from using<br />
<strong>PISA</strong> as if such causal relations are given. Otherwise one would not be able<br />
<strong>to</strong> explain how the two main impacts which <strong>PISA</strong> has had on school administration<br />
and policy making in Austria and Germany, directly referring <strong>to</strong> this<br />
frame of reference, albeit using it somewhat differently. Thus <strong>PISA</strong>’s approach<br />
<strong>to</strong> competency measuring has been a sweeping success in both countries, in<br />
part fuelled by the “national expertise” (Klieme et al. 2003) produced by researchers<br />
close <strong>to</strong> the <strong>PISA</strong> efforts who argue that a national moni<strong>to</strong>ring of<br />
student achievement based on an approach similar <strong>to</strong> <strong>PISA</strong> is both necessary<br />
and feasible (cf. Jahnke in this volume). Based on this, the German education<br />
ministers of the states have established a process <strong>to</strong>wards such national<br />
standards and given a helping hand <strong>to</strong> the mounting of a National Institute for<br />
Progress in Education (Institut zur Qualitätsentwicklung im Bildungswesen)<br />
with similar functions <strong>to</strong> the US National Assessment of Educational Progress<br />
(NAEP). Both of these accomplishments are significant in Germany given that<br />
curriculum matters are normally considered <strong>to</strong> be state, not federal responsibilities,<br />
and that there was no previous tradition of state-run outcome controls
392 STEFAN T. HOPMANN<br />
(except for some standardizations of final exams in a few states). Prior <strong>to</strong> this<br />
point there have been more than 4000 different state curriculum guidelines that<br />
werethemainroad<strong>to</strong>defining expected results, without any regular control for<br />
whether they were achieved (as in the Nordic countries; cf. Hopmann 2003).<br />
Similarly, the Austrian government has initiated a not yet finished project <strong>to</strong><br />
develop and implement national competency standards as an alternative <strong>to</strong> the<br />
former guidelines and <strong>to</strong> combine this with regular testing (cf. the material collected<br />
at the official site http://www.gemeinsamlernen.at). All this in spite of<br />
the fact, that the impact of the use of national or state assessment on what <strong>PISA</strong><br />
and similar projects measure is at best weak in either direction (cf. Amrein &<br />
Berliner 2002: Bishop & Woessmann 2004; Fuchs & Woessmann 2004), and<br />
that the overall importance of meeting the goals which <strong>PISA</strong> and according<br />
state standards happen <strong>to</strong> measure at best is a good guess without a solid research<br />
foundation.<br />
Whereas this approach seems <strong>to</strong> have support across the whole political<br />
spectrum, the second impact has proven <strong>to</strong> be rather divisive: Based on <strong>PISA</strong><br />
and similar studies, researchers and politicians have <strong>–</strong> as mentioned above <strong>–</strong><br />
reopened the debate on school structures. Interestingly both sides <strong>–</strong> proponents<br />
and opponents of a comprehensive system, proponents and opponents of<br />
early school start, proponents and opponents of an integrated teacher education,<br />
etc. <strong>–</strong> feel themselves encouraged by the very same <strong>PISA</strong> data, which the<br />
other faction uses as well. The most prominent example of this is a national<br />
report on schooling, written by a number of “leading experts” (i.e. researchers<br />
utilizing <strong>PISA</strong> and the like) on behalf of the Confederation of Bavarian Industry<br />
(VBW), which <strong>–</strong> focused on equity-issues <strong>–</strong> argues that there is ample<br />
research evidence for re-organizing the whole school system as a two-tier<br />
organization (cf. Aktionsrat Bildung 2007). On the other hand, the leader of<br />
the <strong>PISA</strong> effort at OECD, Andreas Schleicher, is <strong>to</strong>tally convinced that <strong>PISA</strong><br />
proves the advantages of a comprehensive system. The leader of the national<br />
<strong>PISA</strong> effort in Austria, Günther Haider, managed first <strong>to</strong> support a continuation<br />
of the current stratified structures, then a transition <strong>to</strong>wards a comprehensive<br />
system, in both cases claiming <strong>PISA</strong> data as evidence for his recommendations<br />
(cf. Bozkurt, Brinek & Retzl in this volume).<br />
Public criticism of the empirical evidence provided by <strong>PISA</strong> has been weak<br />
in both countries. The devastating results were all <strong>to</strong>o much in line with the<br />
political needs <strong>to</strong> find good causes at the end of the economically painful reunification<br />
process in Germany and at a time when both countries felt them-
EPILOGUE: NO CHILD, NO SCHOOL, NO STATE LEFT BEHIND 393<br />
selves economically underperforming compared <strong>to</strong> other European countries,<br />
not at least those seemingly more successful in <strong>PISA</strong>, such as Finland. Even<br />
the scientific discourse <strong>to</strong>ok the economic reasoning backing <strong>PISA</strong> for granted,<br />
arguing that <strong>PISA</strong> reduced “Bildung” <strong>to</strong> economic necessities and the needs<br />
of globalization, thereby acknowledging the unfounded premises of <strong>PISA</strong>’s<br />
ability <strong>to</strong> moni<strong>to</strong>r and guide the school curriculum (cf. e.g. Huisken 2005;<br />
Lohmann 2007). Except for the obvious case of the Austrian sampling problems<br />
(Neuwirth 2006), the few methodological objections that were voiced<br />
were either ignored or ridiculed by the <strong>PISA</strong> community and its supporters,<br />
and has had <strong>–</strong> at least up <strong>to</strong> now <strong>–</strong>no substantial impact on the public standing<br />
of <strong>PISA</strong> in either Austria or Germany (cf. the introduction <strong>to</strong> this volume).<br />
The “no state left behind approach” of the OECD and its German and Austrian<br />
consorts leads <strong>to</strong> the somewhat paradoxical effect that <strong>PISA</strong> has the most<br />
impact by way of the by-products of the <strong>PISA</strong> research, which in design and<br />
methodology are most probably the weakest links of the whole enterprise. But<br />
even this is not without precedent. Re-reading Georg Picht’s volume on the<br />
“education catastrophe” (1964), which started the last “big school debate” in<br />
the mid-1960’s, one is amazed how little of his evidence could stand the test<br />
of time and how much of it was simply speculative. However the book was the<br />
single most important lever for the ensuing debate on how the school system<br />
should adapt <strong>to</strong> the social changes at the end of the post-war reconstruction period,<br />
a process which ended with biggest expansion of the educational system<br />
and of public expenditure in the his<strong>to</strong>ry of schooling. This process also included<br />
the temporary transfer of a <strong>to</strong>ol-kit, scientific curriculum development,<br />
from the US <strong>–</strong> in spite of its then self-pronounced “moribund” state on its<br />
home turf (cf. Hopmann 1988). At least in Austria and Germany, <strong>PISA</strong> seems<br />
<strong>to</strong> have achieved something similar, <strong>to</strong> help politicians, educa<strong>to</strong>rs and the public<br />
<strong>to</strong> get the educational field in <strong>to</strong>uch with the transition processes going on in<br />
the whole public sec<strong>to</strong>r by providing them with a sense of what “manageable<br />
expectations” might be, and with <strong>to</strong>ols <strong>to</strong> moni<strong>to</strong>r their success <strong>–</strong> or failure.<br />
3 Comparative Accountability<br />
Thus the overall picture of the accountability approaches I have reviewed<br />
shows three very different basic philosophies of what this transition is about<br />
(cf. fig. 1):
394 STEFAN T. HOPMANN<br />
Core Data Student achievement<br />
Main Tools Standards controlled<br />
by testing<br />
Aggregated<br />
school achieve-<br />
ment data<br />
Testing with<br />
regard <strong>to</strong> opportunities<br />
<strong>to</strong> learn<br />
(OTL)<br />
Aggregated national<br />
student<br />
achievement<br />
Competencies<br />
measured by<br />
random testing<br />
Stakes High stakes Low stakes No stakes<br />
(<strong>PISA</strong>)<br />
Low or high<br />
stakes (standards)<br />
Driving Force Blame Community Competition<br />
Main levels of<br />
attribution<br />
Class room &<br />
teaching<br />
Spirit<br />
Local school<br />
management<br />
School systems/Society<br />
at large<br />
Best Practice Data-driven Cus<strong>to</strong>mized Research based<br />
Accountability Bot<strong>to</strong>m up Bridging the<br />
gap<br />
Top down<br />
The role of<br />
<strong>PISA</strong><br />
Almost none Supporting act Main act<br />
Fig. 1: Basics of the No Child, No School, No State Left Behind Strategies<br />
Of course, this table only pinpoints the main assumptions and entry points<br />
of each approach. In the nature of public schooling, each approach carries elements<br />
of the others. Additional analysis of more countries would show that<br />
there are mixed patterns, combining elements of different modes of accountability<br />
(e.g. the case of Switzerland would probably reveal a mixture of “no<br />
school” and “no can<strong>to</strong>n” strategies; Canada a mixture of “no child” and “no<br />
school” etc.; cf. e.g. BFS 2005; Rhyn 2007; Stack 2006; Klatt, Murphy &<br />
Irvine 2003; Ma & Crocker 2007). And there is a fourth pattern, where “no accountability<br />
has yet arrived”, and where the public sec<strong>to</strong>r is in the early stages<br />
of a transition <strong>to</strong>wards accountability policies, and therefore not yet open for<br />
the influx of international accountability measures (as, for instance, in Italy<br />
where <strong>PISA</strong> has been no real issue, and even the government has treated it as<br />
almost non-existent; cf. Nardi 2004).
EPILOGUE: NO CHILD, NO SCHOOL, NO STATE LEFT BEHIND 395<br />
Emerging issues<br />
Each strategy has its own strength and carries its own risks depending on the<br />
larger concept of expectation management it is a part of:<br />
The “no child” approach has the advantage of a clear focus: Everybody<br />
knows what counts and how it is measured. But the price for this is what David<br />
Berliner (2005) calls “an impoverished view on educational reform”, a system<br />
of accountability checks which places “statistical significance” above all other<br />
ways of looking at individual and institutional achievements. The very narrow<br />
conceptualization lends itself <strong>to</strong> reduced remedial strategies: “evidence-based”<br />
or “best-practice” models, or “data-driven decision making”, only make sense<br />
if it is assumed that assessment data are all that counts and that local conditions<br />
do not play a significant role, or at least can be overcome, if one does<br />
as the successful do. But while NCLB’s “blueprint” (US Department of Education<br />
2007) honours a rather naïve empiricism, much of the NCLB-induced<br />
research provides more complex insights in the complexities of school life,<br />
using a whole range of mixed methods and avoiding the fallacies of an engineering<br />
approach <strong>to</strong> social transition (cf. e.g. Elmore 2006; O’Day 2007).<br />
However, it seems unlikely that the insights produced by this research will<br />
have any lasting impact on the NCLB movement: the prime implication of this<br />
research, the importance of capacity-building in local schooling with special<br />
regard <strong>to</strong> the unique mix of challenges at hand, is contrary <strong>to</strong> any belief that<br />
the same high-stakes for everyone, and distributing blame and shame in large<br />
portions, is a reasonable approach <strong>to</strong> making accountability work.<br />
Moreover, it is this one-sided focus which allows the transformation of<br />
the apparent problems with equity and equality, with minorities, special needs,<br />
gender etc. in<strong>to</strong> individualized liabilities, whose impact on achievement has <strong>to</strong><br />
be minimized, if not eradicated. If high stakes as sole approach <strong>to</strong> this fails<br />
(and research points <strong>to</strong> that it will; cf. Linn 2007), there are a number of technical<br />
options <strong>to</strong> ease the burden such as lowering the ceilings, adding opt-out<br />
clauses for the worst students and/or schools, inflating the number of stakes<br />
such that everybody can succeed in something, and not at least creating more<br />
school choice and vouchers, which leaves the responsibility of choosing the<br />
right school with the parents. All of these options are under consideration in<br />
the current debate on the renewal of NCLB (cf. US Department of Education<br />
2007). Choice is, of course, the core of a strategy of passing the basket on <strong>to</strong><br />
the next in line of the accountability chain, i.e. the ones who seemingly bring<br />
the liabilities <strong>to</strong> school: the parents, the minorities, the poor, those with special
396 STEFAN T. HOPMANN<br />
needs, etc. We can expect more accountability <strong>to</strong>ols, e.g. contractual attainment<br />
goals and/or connecting welfare subsidies or other sanctions with them,<br />
<strong>to</strong> make these families directly responsible for the outcomes. The achievement<br />
gap will not disappear with these moves, but rather what once was considered<br />
being a failure of the school system <strong>to</strong> cope with the diversity of society (cf.<br />
e.g. Coleman et al. 1967) will be turned step by step in<strong>to</strong> a problem of individual<br />
cus<strong>to</strong>mers failing <strong>to</strong> meet rising expectations.<br />
The “no school left behind” approach of Norway, and most of the Nordic<br />
countries, is far away from such reductionism, but also pays a heavy price. The<br />
double task of embedding the new strategies in the traditional <strong>to</strong>ol-kit and of<br />
doing so in close co-operation with the local level, obviously leaves many wondering<br />
if there is a real change process going, and if there is, what it consists<br />
of. No real sense of the new obligations has emerged in schools and municipalities,<br />
and it seems as if they respond <strong>to</strong> the new accountability expectation<br />
with classic Nordic “muddling-through”: planning, coordinating and reporting<br />
on a local level time and again, with no real stakes and inconclusive outcomes<br />
(cf. Engeland, Roald & Langfeldt 2007; Elstad 2007). That the first national<br />
tests were a technical disaster (cf. Lie et al. 2005), and prompted a break in the<br />
whole process, did not really help, nor did the new curriculum guidelines of<br />
2006 which, in spite of much of the rhe<strong>to</strong>ric, do not require more adjustments<br />
then earlier guidelines, i.e. most teaching does not change significantly as a<br />
result of their adoption, and the prime concern of teachers remains with local<br />
adaptation, not national outcomes (cf. Bachmann & Sivesind 2007).<br />
But this will not be the end of the s<strong>to</strong>ry! The key question is what will<br />
happen if the current effort, which even in a Norwegian perspective is quite expensive,<br />
fails <strong>to</strong> achieve significant and sustainable gains beyond those which<br />
come as the system becomes used <strong>to</strong> the new <strong>to</strong>ols? Social and economic inequality<br />
are rising rapidly, and knowing how this can affect both schools and<br />
students, growing achievement disparities and gaps will be no surprise. Two<br />
response pattern seem likely: The first one would move even more rapidly <strong>to</strong>wards<br />
more radical accountability strategies, i.e. raising stakes, adding more<br />
national testing, and most importantly adding sanctions for those who continue<br />
<strong>to</strong> fail. This would put tremendous pressure on the comprehensive school<br />
system: homogenous schools without <strong>to</strong>o many non-achievers will succeed<br />
and tell the public that the time of an all-encompassing school has come <strong>to</strong> an<br />
end. The other strategy would introduce more choice and private options in<strong>to</strong><br />
the system (as it is already the case in Sweden and Denmark), thus allowing
EPILOGUE: NO CHILD, NO SCHOOL, NO STATE LEFT BEHIND 397<br />
schools <strong>to</strong> remove their most challenging parents and the most challenged children,<br />
leaving the public school as main route for ‘average’ people (cf. Kvale<br />
2007). Both strategies would imply a definite end of the “one school for all”<br />
notion. This idea is deeply engrained in the social fabric of Nordic societies,<br />
and the move will be no easy task (and will lead in Norway <strong>to</strong> a continuing<br />
back and forth of who is allowed <strong>to</strong> opt out, and why). But there are no other<br />
ways <strong>to</strong> reconcile the former management of places with the new needs of accountability,<br />
even if it takes time before the “muddling through” is forced <strong>to</strong><br />
accept this consequence as inevitable.<br />
None of this applies <strong>to</strong> the two leading examples of the “no state left behind”<br />
strategy, Austria and Germany. Both have fragmented school systems in<br />
which comprehensive schools play no significant role. Moreover, for the moment,<br />
both have easy access <strong>to</strong> knowing if their school improvement works.<br />
All they seemingly have <strong>to</strong> do is <strong>to</strong> wait for the next <strong>PISA</strong> wave; it will then<br />
be clear who has lost or won in the interstate competition (of course only if<br />
one believes that <strong>PISA</strong> indeed is able <strong>to</strong> tell something about that). The main<br />
risk lies in the deeply-engrained traditions of how <strong>to</strong> deal with “big school debates”,<br />
because these traditions transform the achievement problem in<strong>to</strong> one<br />
of school structure and other institutional change. The issues at stake are more<br />
or less the same in both countries (cf. Aktionsrat Bildung 2007; Retzl, Bozkurt<br />
& Brinek in this volume): Comprehensive schools or different tracks? Compulsory<br />
pre-school education and if so for whom? Unified teacher education<br />
or different routes for different types of schooling? Special schools for special<br />
needs or inclusive education? Keeping the double structure of vocational education<br />
(school plus training on the job) or integrating vocational education in<br />
some general kind of upper-secondary schooling?<br />
If the attempts <strong>to</strong> force a comprehensive re-structuring fail (and there is no<br />
empirical or political evidence indicating that this could turn out otherwise),<br />
then at least two possible outcomes are likely: The first one would be <strong>to</strong> move<br />
<strong>to</strong>wards a more NCLB-like approach <strong>to</strong> accountability, i.e. adding more stakes<br />
and tests (e.g. unified entrance and exit exams), including all levels and becoming<br />
more all-encompassing than is possible within <strong>PISA</strong>, i.e., by requiring<br />
more data on single schools, school districts and the different federal states,<br />
eventually extending the screening beyond student achievement <strong>to</strong>wards indica<strong>to</strong>rs<br />
on teaching patterns, teaching materials, teacher qualifications, studentcareer<br />
data and the like. But, at least in Germany, such an approach faces the<br />
obstacle that schooling is constitutionally a matter for states, not the Federal
398 STEFAN T. HOPMANN<br />
government <strong>–</strong> which means that there are no means for enforcing alignment beyond<br />
that which all states agree upon. In Austria the Federal government has<br />
the necessary constitutional backing for federal involvement, but in that the<br />
country, since its reconstruction after World War II, depended on compromise<br />
between the two largest political wings (the social democrats and the conservatives),<br />
each controlling about half of the states, it is unlikely that any lasting<br />
agreement on a comprehensive accountability approach is feasible. Which<br />
brings the second option <strong>to</strong> the forefront, namely <strong>to</strong> dissolve, or embed, the national<br />
accountability measures in an internal competition between the different<br />
federal states. Those confident of their success would prove the advantages<br />
of their chosen solutions by own data; those not meeting the standards would<br />
have <strong>to</strong> answer by their own explanations of why a mismatch was unavoidable.<br />
In the end, there would be a inextricable hodgepodge of testing, controlling,<br />
moni<strong>to</strong>ring etc., with each state having its own <strong>to</strong>ol-kit of accountability measures.<br />
But at least two problems would be left behind by either option: on the one<br />
hand, none of this addresses the problems of sustainable inner-school development,<br />
capacity-building over the long run. A race <strong>to</strong> match changing expectations<br />
by restructuring the system will not leave energy and resources <strong>to</strong> address<br />
the tricky problems of “no teaching left behind” (cf. Terhart 2005). Secondly,<br />
both approaches would lead <strong>to</strong> further marginalization of the special-needs<br />
students who already invisible, or turned in<strong>to</strong> liabilities, in the <strong>PISA</strong> approach<br />
(cf. Hörmann 2007 and in this volume). They promise <strong>to</strong> become even more<br />
marginalized inasmuch as they don’t help <strong>to</strong> win a competition that has individualized<br />
academic achievement as its basic rationale (cf. Hopmann 2007).<br />
Each and every new round of testing would only reaffirm their “lower” abilities<br />
and the “superiority” of the schools dealing with high achievers, thus petrifying<br />
the hierarchy of schools. In that this hierarchy always has been experienced<br />
as an expression of the social fabric of society, and the state’s position <strong>to</strong>wards<br />
it, one can only imagine how rapidly this will lead <strong>to</strong> inner tensions in a school<br />
system surrounded by a society with rapidly growing social inequalities.<br />
<strong>PISA</strong> in Transition<br />
Most of the emerging issues stem from inner tensions between the former<br />
management of placement and the new expectation regime. The data-driven<br />
NCLB disintegrates the former “place called school” (Goodlad) in<strong>to</strong> concurrent,<br />
but not intertwined, individual challenges of meeting the standards. The
EPILOGUE: NO CHILD, NO SCHOOL, NO STATE LEFT BEHIND 399<br />
“no school left behind” approach has difficulties in embracing a coherent set<br />
of expectations, as it dissolves the idea of accountability with its old routines<br />
of institutionalized muddling-through. The “no state left behind” strategy is at<br />
risk of answering the new expectations by functionalizing them for a renewing<br />
of the conventional restructuring game, without really changing what is going<br />
on inside schools and classrooms.<br />
The success of <strong>PISA</strong> within this transition is made possible by a certain<br />
“fuzziness” of design and self-presentation. It treats the links between student,<br />
school and national achievements as evident, thus allowing for a black-box<br />
approach <strong>to</strong> schooling itself (economists call that a production function) in<br />
which the coincidence of results and fac<strong>to</strong>rs is transformed in<strong>to</strong> correlations<br />
and causalities, without proving how exactly this linearity comes in<strong>to</strong> being.<br />
Within a management of placements, <strong>PISA</strong> and the national testing inspired by<br />
it, would be dysfunctional in that it covers only a few aspects of schooling and<br />
these in a way which does not allow for research-based decision-making concerning<br />
the whole school or even teaching and learning under given conditions.<br />
However as a <strong>to</strong>ol of expectation management <strong>PISA</strong> allows in each setting <strong>to</strong><br />
address problems as they are framed within the respective constitutional mindsets,<br />
using <strong>PISA</strong> as “evidence”. Thus <strong>PISA</strong> refreshes the never-ending dispute<br />
in Germany and Austria on school structures and their relation <strong>to</strong> social class<br />
and diversity, reinvigorates the co-dependency of national government and local<br />
community in the Nordic countries, and reaffirms the starting point of the<br />
NCLB discourse on failing schools and teachers as the main culprits for the<br />
uneven distribution of knowledge and cultural capital in Western societies.<br />
The irony of this s<strong>to</strong>ry is, of course, that <strong>PISA</strong> achieves this not in spite of,<br />
but because of its short-comings. Although it uses advanced statistical <strong>to</strong>ols,<br />
<strong>PISA</strong> stays methodologically with in the frame of a pre-Popper positivism,<br />
which takes item responses for the realities addressed by them. There is no<br />
theory of schooling or of the curriculum, which allows for a non-affirmative<br />
stance <strong>to</strong>wards the policy-driven expectations which, according <strong>to</strong> OECD, “determine”<br />
“the design and reporting methods” of <strong>PISA</strong> (OECD 2007). There<br />
is no systematically embedded concept of how yet unheard voices and nonstandardized<br />
needs could be recognized as equally valid expressions of what<br />
schooling is about. <strong>According</strong>ly, none of the newer developments in educational<br />
research, addressing the situatedness, multi-perspectivity, non-linearity<br />
or contingency of social action, plays a significant role in <strong>PISA</strong>’s design (cf.<br />
Hopmann 2007). Even though, there are many quite advanced options <strong>to</strong> use
400 STEFAN T. HOPMANN<br />
<strong>PISA</strong> data within mixed methods or other more comprehensive research designs,<br />
which could address some of <strong>PISA</strong>’s inherent weaknesses as well (cf.<br />
Olsen in this volume). But <strong>to</strong> incorporate such developments on a large scale<br />
would be close <strong>to</strong> impossible. They do not lend themselves <strong>to</strong> such generalized<br />
bot<strong>to</strong>m-lines as league tables; <strong>to</strong> include them in a large-scale study of the size<br />
of <strong>PISA</strong> would require resources far beyond that available <strong>to</strong> even <strong>PISA</strong>.<br />
As an entry <strong>to</strong> the commencing accountability transition, <strong>PISA</strong> has done<br />
a significant job in facilitating and illustrating the difficulties any approach <strong>to</strong><br />
these issues will have <strong>to</strong> face. But it might be, that the <strong>PISA</strong>-frenzy already<br />
has reached its peak, or is very close <strong>to</strong> doing so (the next wave of results,<br />
coming in December 2007, will show whether this is the case). But if the <strong>PISA</strong>frenzy<br />
is drawing <strong>to</strong> a close, it will not be because of the technical mishaps and<br />
fallacies discussed in this volume. Such details go unnoticed by the politicians<br />
and the public. If <strong>PISA</strong> looses its unique position, it will happen because of<br />
its success, because of the multiplying of <strong>PISA</strong>-like <strong>to</strong>ols in national and state<br />
accountability programmes. If the NCLB experience holds true, <strong>PISA</strong> will be<br />
reduced <strong>to</strong> being just one voice in the polyphonic concert of assessment results,<br />
and <strong>–</strong> having no sanctions other than statistical blame <strong>–</strong> will be overcome by<br />
accountability measures that carry more immediate risks for those involved.<br />
The important question for the future of educational research is how much<br />
<strong>PISA</strong> then will be left behind, <strong>to</strong> what extend will its methodological reductionism<br />
prevail as the state-of-the-art of comparative research. But the more<br />
pressing question centers on the long-term effect its conceptualization of student<br />
achievement will have on the public understanding of what schooling is<br />
about. What will happen <strong>to</strong> the school subjects left out, <strong>to</strong> the special-needs<br />
that are marginalized, <strong>to</strong> school tasks which have nothing <strong>to</strong> do with higherorder<br />
academic achievement, <strong>to</strong> the school functions which move beyond a<br />
one-dimensional kind of knowledge distribution? Perhaps there are new, not<br />
yet seen possibilities hidden in the multiple realities of the transition from<br />
the management of placement <strong>to</strong>wards the management of expectations, even<br />
some which make research, policy and schooling accountable for not leaving<br />
their social conscience behind on their march in<strong>to</strong> the emerging age of accountability.
References<br />
EPILOGUE: NO CHILD, NO SCHOOL, NO STATE LEFT BEHIND 401<br />
Abbott, A.: The System of Professions. Chicago (University of Chicago Press)<br />
1988.<br />
Achieve: Aiming higher: 1998 annual report. Cambridge (Achieve, Inc.) 1998.<br />
Ahearn, E.M.: Educational Accountability: A Synthesis of Literature and Review<br />
of a Balanced Model of Accountability. Washing<strong>to</strong>n D.C. (Department<br />
of Education) 2000.<br />
Aikin, W.M. et al.: The Eight Year Study. Vol. I <strong>–</strong> V. New York/London (Harper<br />
& Brothers) 1942.<br />
Akerstrøm Andersen, N.: Borgerens kontraktliggørelse. Kopenhagen (Reitzel)<br />
2003.<br />
Aktionsrat Bildung: Bildungsgerechtigkeit. Jahresgutachten 2007. Wiesbaden<br />
(VS) 2007.<br />
Allerup, P.: <strong>PISA</strong> præstationer <strong>–</strong> målinger med skæve måles<strong>to</strong>kke?. In: Dansk<br />
Pædagogisk Tidsskrift 2005-1, 68-81.<br />
Allerup, P.: <strong>PISA</strong> 2000’s læseskala <strong>–</strong> vurdering af psykometriske egenskaber<br />
for elever med dansk og ikke-dansk sproglig baggrund. (Rockwool<br />
Fondens Forskningsenhed og Syddansk Universitetsforlag) Odense 2006.<br />
Allerup, P.: Identification of Group Differences Using <strong>PISA</strong> Scales <strong>–</strong> Considering<br />
Effects of Inhomogeneous Items. In this volume.<br />
Ålvik, T. (ed.): Skolebasert vurdering <strong>–</strong> en artikkelsamling. Oslo (Ad notam)<br />
1994.<br />
Amrein, A.L. & Berliner, D.C.: High-stakes testing, uncertainty, and student<br />
learning. Education Policy Analysis Archives 10-2002-18. Online: http:<br />
//epaa.asu.edu/epaa/v10n18 (10.03.2007)<br />
Apple, M.: Ideological Success, Educational Failure? On the Politics of No<br />
Child Left Behind. In: Journal of Teacher Education 58-2007-2, 108-116.<br />
Bachmann, K.; Sivesind, K. & Hopmann, S.T.: Hvordan formidles læreplanen.<br />
Kristiansand (Høyskoleforlag) 2004.<br />
Bachmann, K. & Sivesind, K.: Regn med meg! Evaluering og ansvarligjøring i<br />
skolen. In: Langfeldt, G., Elstad, E. & Hopmann, S.T. (eds.): Ansvarlighet<br />
i skolen. Oslo (Cappelen) 2007 (forthcoming).<br />
Bartlett, W., Roberts, J.A. & Le Grand, J.: A revolution in social policy: Quasimarket<br />
reforms in the 1990’s. Bris<strong>to</strong>l (Policy Press) 1998.<br />
Baumert, J., Stanat, P. & Watermann, R. Hrsg.: Herkunftsbedingte Disparitäten<br />
im Bildungswesen. Verteiefende Analysen im Rahmen von <strong>PISA</strong> 2000.<br />
Wiesbaden(VS) 2004.
402 STEFAN T. HOPMANN<br />
Beck, U.: Weltrisikogesellschaft. Frankfurt (Suhrkamp) 2007.<br />
Beck, U., Giddens, A. & Lash, S.: Reflexive Modernisierung. Frankfurt<br />
(Suhrkamp) 1996.<br />
Benner, D.: Die Struktur der Allgemeinbildung im Kerncurriculum moderner<br />
Bildungssysteme. Ein Vorschlag zur bildungstheoretischen Rahmung von<br />
<strong>PISA</strong>. In: Zeitschrift für Pädagogik 48-2002-1, 68-90.<br />
Benoit, W.L.: Accounts, Excuses and Apologies: A Theory of Image Res<strong>to</strong>ration.<br />
Albany (SUNY) 1995.<br />
Berliner, D. C.: Our impoverished view of educational reform. In: Teachers<br />
College Record 2005. Online: http.//www.tcrecord.org ID no. 12106<br />
(2007/07/07).<br />
Birkeland; N.: Ansvarlig, jeg? Accountability på norsk. In: Langfeldt, G., Elstad,<br />
E. & Hopmann, S.T. (eds.): Ansvarlighet i skolen. Oslo (Cappelen)<br />
2007 (forthcoming).<br />
Bishop, J.H. & Woessmann, L.: Institutional Effects in a Simple Model of<br />
Educational Production. In: Education Economics 12-2004-1, 17-38.<br />
Bloom, B. S. (ed.): Taxonomy of Educational Objectives, the classification of<br />
educational goals <strong>–</strong> Handbook I: Cognitive Domain. New York (McKay)<br />
1956.<br />
Bodin, A.: What does <strong>PISA</strong> really assess? What it doesn’t A French view.<br />
Report prepared for Joint Finnish-French conference “Teaching mathematics:<br />
Beyond the <strong>PISA</strong> survey”, Paris 2005.<br />
Bodin, A.: What does <strong>PISA</strong> really assess? What it doesn’t? A French view. In<br />
this volume.<br />
Bogason, P. (ed.): New Modes of Local Political Organization: Local Government<br />
Fragmentation in Scandinavia. Commack (Nova Sciences) 1996.<br />
Bracey, G.W.: Research: Put Out Over <strong>PISA</strong>. In: Phi Delta Kappan 86-2005-<br />
10, 797.<br />
Braun, H. (2004): Reconsidering the impact of high-stakes testing. Education<br />
Policy Analysis Archives 12-1. Online: http://epaa.asu.edu/epaa/v12n1/<br />
(2006/01/20)<br />
Buschor, E. & Schedler, K. (eds.): Perspecticves on Performance Measurement<br />
and Public Sec<strong>to</strong>r Accounting. Bern (Haupt) 1994.<br />
Cannell, J.J.: Nationally Normed Elementary Achievement Testing in America’s<br />
Public Schools: How All 50 States are Above National Average.<br />
Daniels (Friends of Education) 1987.
EPILOGUE: NO CHILD, NO SCHOOL, NO STATE LEFT BEHIND 403<br />
Caswell, H.L.: City School Surveys: An Interpretation and Analysis. New York<br />
(Teacher College) 1929.<br />
Chubb, J.E. (ed.): Within Our Reach: How America Can Educate Every Child.<br />
Lanham (Rowman & Littlefield) 2005.<br />
Coleman, J. S. et. al.:. Equality of Educational Opportunity. Washing<strong>to</strong>n (U. S.<br />
Department of Health, Education and Welfare) 1966.<br />
Conant, J. B.: The American High School Today; A First Report <strong>to</strong> Interested<br />
Citizens New York (McGraw Hill) 1959.<br />
Cook, T. D.: Lessons Learned in Evaluation Over the Past 25 Years In Chelimsky,<br />
E. & Shadish, W.R. (eds.): Evaluation for the 21st Century. Thousand<br />
Oaks, London, New Delhi (Sage Publications) 1997, 30-52.<br />
Darling-Hammond, L.: Standards and Assessment: Where We Are and What<br />
We Need. Teachers College Record 16-2003-2.<br />
Deretchin, L.F.: Craig, C.J. (eds.): International Research on the Impact of Accountability<br />
Systems (Teacher Education Yearbook XV). Lanham (Rowman<br />
& Littlefield) 2007.<br />
Dewey, J.: Democracy and Education (1916). Online: http://www.ilt.columbia.<br />
edu/publications/dewey.html (2007/01/07)<br />
Dohn, N.B.: Knowledge and Skills for <strong>PISA</strong> <strong>–</strong> Assessing the Assessment. In:<br />
Journal of Philosophy of Education. 41-2007-1, 1-16.<br />
Dolin, J.: <strong>PISA</strong> <strong>–</strong> an Example of the Use and Misuse of Large-scale Comparative<br />
Tests. In this volume.<br />
Dorn, S.: The Political Legacy of School Accountability Systems. In: Education<br />
Policy Analysis Archives 6-1998-1. Online: http://epaa.asu.edu/epaa/<br />
v6n1/ (2007/03/02).<br />
Dubnick, M.J.: Accountability and the Promise of Performance. Paper presented<br />
at the 2003 Annual Metting of the American Poltical Science Association.<br />
Philadelphia<br />
Dubnick, M.J. & Justice, J.B.: Accounting for Accountability. Paper presented<br />
at the Annual Meeting of the American Political Science Association<br />
2004. Online: http://pubpages.unh.edu/dubnick/papers/2004/<br />
dubjusacctg2004.pdf (2007/07/07).<br />
Dubnick, M.J.: Orders of Accountability. Paper presented at the World Ethics<br />
Forum in Oxford 2006. Online: http://pubpages.unh.edu/dubnick/papers/<br />
2006/oxford2006.pdf (2007/07/07).<br />
Eberts, R., Hollenbeck L. & S<strong>to</strong>ne, J.: Teacher Performance Incentives and
404 STEFAN T. HOPMANN<br />
Student Outcomes. The Journal of Human Resources, 37-2002-4, 913-<br />
927.<br />
Education Commission of the States (ECS): No Child Left Behind Issue<br />
Brief: Data-Driven Decisionmaking. 2002. online: http://www.nsba.org/<br />
site/docs/9200/9153.pdf (2007/07/07).<br />
Elmore, R.F.: School Reform From the Inside Out. Cambridge (Harvard University<br />
Press). 2006.<br />
Elstad, E.: Hvordan forholder skoler seg til ansvarliggjøring av skolens bidrag<br />
til elevenes læringsresultater? In: Langfeldt, G., Elstad, E. & Hopmann,<br />
S.T. (eds.): Ansvarlighet i skolen. Oslo (Cappelen) 2007 (forthcoming).<br />
Elstad, E. & Langfeldt, G.: Hvordan forholder skoler seg til målinger av<br />
kvalitetsaspekter ved lærernes undervisning og elevenes læringsprosesser?.<br />
In: Langfeldt, G., Elstad, E. & Hopmann, S.T. (eds.): Ansvarlighet<br />
i skolen. Oslo (Cappelen) 2007 (forthcoming).<br />
Engeland, Ø.: Skolen i kommunalt eie <strong>–</strong> politisk styrt eller profesjonell<br />
ledet skoleutvikling? Avhandling til dr. polit graden. Oslo (Det utdanningsvitenskapelige<br />
fakultet, Universitetet i Oslo) 2000.<br />
Engeland, Ø., Langfeldt, G. & Roald, K.: Kommunalt handlingsrom <strong>–</strong> hvordan<br />
møter norske kommuner ansvarsstyring i skolen?. In: Langfeldt, G., Elstad,<br />
E. & Hopmann, S.T. (eds.): Ansvarlighet i skolen. Oslo (Cappelen)<br />
2007 (forthcoming).<br />
Esping-Andersen, G. Hrsg.: Welfare States in Transition. National Adaptations<br />
<strong>to</strong> Global Economies, London (Sage) 1996<br />
Evers, A., Rauch, U. & Stitz, U. (eds.): Von öffentlichen Einrichtungen zu<br />
sozialen Unternehmen. Hybride Organisationsformen im Bereich sozialer<br />
Dienstleistungen. Berlin (Edition Sigma) 2002.<br />
Fertig, M.: What Can We Learn From International Student Performance Studies?<br />
Some Methodological Remarks. RWI: Discussion Papers No. 23. Essen<br />
(RWI) 2004.<br />
Foucault, M.: Geschichte der Gouvernementalität 1: Sicherheit, Terri<strong>to</strong>rium,<br />
Bevölkerung. Vorlesung am College de France 1977/1978. Frankfurt<br />
(Suhrkamp) 2006.<br />
Fuchs, H.-W.: Auf dem Wege zu einem neune Weltcurriculum? Zum Grundbildungskonzept<br />
von <strong>PISA</strong> und der Aufgabenzuweisung an die Schule.<br />
In: Zeitschrift für Pädagogik 49-2003-2, 161-179.<br />
Fuchs, T. & Woessmann, L.: What Accounts for International Differences in
EPILOGUE: NO CHILD, NO SCHOOL, NO STATE LEFT BEHIND 405<br />
Student Performance? A Re-Examination Using <strong>PISA</strong> Data. Bonn (IZA)<br />
2004.<br />
Fuhrmann, S. (ed.): From the Capi<strong>to</strong>l <strong>to</strong> the Classroom: Standards-Based Reform<br />
in the States. Vol. I & II. Chicago (National Society for the Study of<br />
Education Yearbooks) 2001.<br />
Fukuyama, F.: The End of His<strong>to</strong>ry and the Last Man. New York (Free Press)<br />
1992.<br />
Goodlad, J.I.: A Place Called School. New York (McGraw-Hill) 1983.<br />
Gorard, S.: Value-Addedis of Little Value. In: Journal of Education Policy 21-<br />
2006-2, 235-243.<br />
Gottweis, H. et al.: Verwaltete Körper. Strategien der Gesundheitspolitik im<br />
internationalen Vergleich. Wien (Böhlau) 2005.<br />
Granheim, M.; Kogan, M. & Lundgren, U.P. (eds.): Evaluation as Policymaking:<br />
Introducing Evaluation In<strong>to</strong> a National Decentralized Educational<br />
System. London (Jessica Kingsley Publishers) 1990.<br />
Grisay, A. & Monseur, C.: Measuring the Equivalence of Item Difficulty in<br />
the Various Versions of an International Test. In: Studies in Educational<br />
Evaluation 33-2007-1, 69-86.<br />
Gundem, B. B. & Hopmann, S. (eds.): Didaktik and/or Curriculum: An International<br />
Dialogue. New York, Bern etc. (Lang) 2002 2 .<br />
Gundem, B.B.: Læreplanadministrering: fremvækst og utvikling i et<br />
sentraliserings-desentraliseringsperspektiv. Oslo (UiO/PFI) 1992.<br />
Gundem, B.B.: Mot en ny skolevirkelighet? Læreplanen i et sentraliseringsdesentraliseringsperspektiv.<br />
Oslo (Ad Notam) 1993.<br />
Gundem, B.B.: Læreplanhis<strong>to</strong>rie <strong>–</strong> his<strong>to</strong>rien om skolens innhold <strong>–</strong> som forskningsfelt:<br />
en innføring og noen eksempler. Oslo (UiO/PFI) 1997.<br />
Haft, H./Hopmann, S.T. (eds.): Case Studies in Curriculum Administration<br />
His<strong>to</strong>ry. London/New York 1990.<br />
Haney, W. (2000): The Myth of the Texas Miracle in Education. Eductional<br />
Policy Analysis Archives 8-2000-41. Online: http://epaa.asu.edu/epaa/<br />
v8n41/ (2007/07/07).<br />
Haney, W.: Lake Wobegon Guaranteed. Educational Policy Analysis Archives<br />
10-2002-24. Online: http://epaa.asu.edu/epaa/v10n24/ (2007/07/07)<br />
Hanushek, E.A. & Raymond, M.E. (2003): Lessons about the Design of State<br />
Accountability Systems. In: No Child Left Behind. Petersen, P. & West,<br />
M.R. Hrsg. (Brookings) Washing<strong>to</strong>n, 127-151<br />
Hanushek, E.A. & Raymond, M.E.: Does School Accountability Lead <strong>to</strong> Im-
406 STEFAN T. HOPMANN<br />
proved Student Performance? In: Journal of Policy Analysis and Management<br />
24-2005-2, 297-327.<br />
Hanushek, E.A.: The Failure of Input-based Schooling Policies. Working Paper<br />
9040. Cambridge, MA (National Bureau of Economic Research).<br />
2002.<br />
Hanushek, E.A.: Alternative School Policies ad the Benefits of General Cognitive<br />
Skills. In: Economics of Education Review 25-2006-4, 447-462.<br />
Hanushek, E.A.: The Long Run Importance of School Quality. NBER Working<br />
Paper No. 9071. Cambridge, MA (NBER) 2002.<br />
Hargreaves, A.: Teaching in the Knowledge Society: Education in the Age of<br />
Insecurity. New York, NY (Teachers College Press) 2003.<br />
Haug, P. & Monsen, L. (eds.): Skolebasert vurdering: erfaringer og utfordringer.<br />
Oslo (abstract) 2002.<br />
Herman, J.L. & Haertel, E.H. Hrsg.: Uses and Misuses of Data for Educational<br />
Accountability and Improvement (The 104 th Yearbook of NSSE Part 2).<br />
Malden (Blackwell) 2005.<br />
Hood, C.: A Public Management for all Seasons. Public Administration 69-<br />
1991-1, 3-20.<br />
Hood, C.: Contemporary Public Management: A New Global Paradigm? Public<br />
Policy and Administration. 10-1995-2, 104-117.<br />
Hood, C.: Institutions, Blame Avoidance and Negativity Bias: Where Public<br />
Management Reform Meets the Blame Cuulture. Paper presented at the<br />
CMPO Conference on Public Organisation and the New Public Management.<br />
Bris<strong>to</strong>l 2004.<br />
Hood, C., Rothstein, H. & Baldwin, R.: The Government of Risk: Understanding<br />
Risk Regulation Regimes. Oxford (University Press) 2004.<br />
Hopmann, S.T.: Lehrplanarbeit als Verwaltungshandeln (Curriculum making<br />
as administration). Kiel (IPN) 1988.<br />
Hopmann, S.T.: Lehrplan des Abendlandes <strong>–</strong> am Ende seiner Geschichte?<br />
Geschichte der Lehrplanarbeit und des Lehrplans seit 1900. (The curriculum<br />
of the occident <strong>–</strong> at the end of its his<strong>to</strong>ry? Curriculum development<br />
and the curriculum since 1800). In: Keck, Rudolf et al. (eds.): Lehrplan<br />
des Abendlandes <strong>–</strong> revisited. (Hohengrefe) Braunschweig 2000a.<br />
Hopmann, S.T.: Die Schule von morgen <strong>–</strong> Entwicklungsperspektiven für<br />
einen nachhaltigen Unterricht. In: Die Schweizer Schule 2000-3, 13-19.<br />
(2000b)<br />
Hopmann, S.T.: Von der gutbürgerlichen Küche zu McDonald’s: Beabsichtigte
EPILOGUE: NO CHILD, NO SCHOOL, NO STATE LEFT BEHIND 407<br />
und unbeabsichtigte Folgen der Internationalisierung der Erwartungen<br />
an Schule und Unterricht. In: Keiner, E. (Hrsg.): Evaluation in den<br />
Erziehungswissenschaften. Weinheim (Beltz) 2001, 207-224.<br />
Hopmann, S.T.: On the Evaluation of Curriculum Reforms. In: Journal of Curriculum<br />
Studies 2003-4, 459-478.<br />
Hopmann S.T.: Im Durchschnitt Pisa oder: Alles bleibt schlechter. In Criblez,<br />
L. et al. (eds) Lehrpläne und Bildungsstandards. Bern (hep) 2006, 149-<br />
172<br />
Hopmann, S.T.: Keine Ausnahme für Hotten<strong>to</strong>tten. Methoden der vergleichenden<br />
Bildungswissenschaften für die heilpädagogische Forschung.<br />
In: Biewer, G. & Schwinge, M. (Hrsg.): Internationale Sonderpädagogik.<br />
Bad Heilbrunn: Klinkhardt (in print).<br />
Hörmann, B. (2007): Die Unsichtbaren in <strong>PISA</strong>, TIMSS & Co. Diplomarbeit.<br />
Wien: Institut für Bildungswissenschaft der Universität Wien<br />
Hörmann, B.: Disappearing Students. <strong>PISA</strong> and students with disabilities. In<br />
this volume.<br />
Hovdenak, S.S.: 90-tallsreformene <strong>–</strong> et instrumentalistisk mistak? Oslo<br />
(Gyldendal Akademisk) 2000.<br />
Huisken, F.: Der “<strong>PISA</strong>-Schock” und seine Bewältigung <strong>–</strong> Wieviel Dummheit<br />
braucht/verträgt die Republik? (VSA-Verlag) Hamburg 2005<br />
Ingersoll, R.: Who Controls Teachers’ Work? Power and Accountability in<br />
America’s Schools. Cambridge (Harvard University Press) 2006.<br />
Irjala, A & Eikås, M.: State Culture and Decentralization: a Comparative<br />
Study of Decentralization Processes in Nordic Cultural Politics. Sogndal<br />
& Helsinki (Western Norway Research Institute/Arts Council of Finland)<br />
1996.<br />
Irons, J.E. & Harris, S.: The Challenges of No Child Left Behind. Blue Ridge<br />
Summit (Rowman & Littlefield) 2006.<br />
Isaksen, L.: Skoler i gapes<strong>to</strong>kken. In: Langfeldt, G., Elstad, E. & Hopmann,<br />
S.T. (eds.): Ansvarlighet i skolen. Oslo (Cappelen) 2007 (forthcoming).<br />
Jahnke, T.: Deutsche Pisa-Folgen. In this volume.<br />
Jahnke, T. & Meyerhöfer, W. (eds.): <strong>PISA</strong> & Co <strong>–</strong> Kritik eines Programms.<br />
(Franzbecker) Hildesheim 2006.<br />
Karlsen, G.E.: Desentralisering <strong>–</strong> løsning eller oppløsning: søkelys på norsk<br />
skoleutvikling og utdanningspolitik. Oslo (Ad notam) 1993.<br />
Karlsen, G.E.: EU, EØS og utdanning. Oslo (Tano) 1994.<br />
Kirke-, utdannings- og forskningsdepartementet (KUF): Underveis: Håndbok
408 STEFAN T. HOPMANN<br />
i skolebasert vurdering: grunnskole og videregående skole. Oslo (KUF)<br />
1994.<br />
Kivirauma, J., Klemala, K. & Rinne, R.: Segregation, Integration, Inclusion <strong>–</strong><br />
The Ideology and Reality in Finland. In: European Journal of Special<br />
Needs Education 21-2006-2, 117-133<br />
Klatt, B., Murphy, S. & Irvine, D: Accountability: Getting a Grip on Results.<br />
Calgary (Bow River) 2003 2 .<br />
Klieme, E. et al.: Expertise zur Entwicklung nationaler Bildungsstandards.<br />
Berlin (BMBF) 2003. Online: http://www.bmbf.de/pub/zur_<br />
entwicklung_nationaler_bildungsstandards.pdf (07/07/2007).<br />
Koretz, D.: Limitations in the Use of Achievement Tests as Measures of Educa<strong>to</strong>rs’<br />
Productivity. The Journal of Human Resources, 37-2002-4, 752 <strong>–</strong><br />
777.<br />
Koritzinsky, T.: Pedagogikk og politikk i L 97: Læreplanens innhold og beslutningsprosessene.<br />
Oslo (Universitetsforlaget) 2000.<br />
Korsgaard, O.: Kampen om folket: et dannelsesperspektiv på dansk his<strong>to</strong>rie<br />
gennem 500 år. Copenhagen (Gyldendal) 2004.<br />
KUF: Rapport om nasjonalt vurderingssystem, (Moe utvalget). Forslag fra utvalg<br />
oppnevnt av Kirke-, Utdannings- og Forskningsdepartementet. Oslo<br />
(KUF) 1997.<br />
Künzli, R. & Hopmann, S.T. (eds.): Lehrpläne: Wie sie entwickelt werden und<br />
was von ihnen erwartet wird. Forschungsstand, Zugänge und Ergebnisse<br />
aus der Schweiz und der Bundesrepublik Deutschland. Zürich (Ruegger)<br />
1998.<br />
Kvale, G.: “Det er ditt val!” <strong>–</strong> om fritt skuleval i <strong>to</strong> norske kommunar. In:<br />
Langfeldt, G., Elstad, E. & Hopmann, S.T. (eds.): Ansvarlighet i skolen.<br />
Oslo (Cappelen) 2007 (forthcoming).<br />
Ladd, H.E. & Walsh, R.P.: Implementing Value-added Measures of School<br />
Effectiveness. Economics of Education Review. 21-2002-1, 1-17.<br />
Ladenthin, V.: Bildung als Aufgabe der Gesellschaft. In: Studia Comenia Et<br />
His<strong>to</strong>rica 34-2004-71/72, 305-319.<br />
Laffont, J.-J. (ed.): The Principal Agent Model: The Economic Theory of Incentives.<br />
Cheltenham (Edward Elgar Publishing) 2003.<br />
Lamar, A. & Thomas, J.A.: The Nation’s Report Card: Improving the Assessment<br />
of Student Achievement. Stanford, CA (National Academy of Education)<br />
1987.
EPILOGUE: NO CHILD, NO SCHOOL, NO STATE LEFT BEHIND 409<br />
Lange, S. & Schimank, U. (eds.): Governance und gesellschaftliche Integration.<br />
Opladen (VS) 2004.<br />
Langfeldt, G.: Resultatstyring som verktøy og ideologi. Statlige styringsstrategier<br />
i utdanningssek<strong>to</strong>ren In: Langfeldt, G., Elstad, E. & Hopmann, S.T.<br />
(eds.): Ansvarlighet i skolen. Oslo (Cappelen) 2007 (forthcoming).<br />
Langfeldt, G.: <strong>PISA</strong> <strong>–</strong> Undressing the Truth or Dressing Up a Will <strong>to</strong> Govern.<br />
In this volume.<br />
Langfeldt, G., Elstad, E. & Hopmann, S.T. (eds.): Ansvarlighet i skolen. Oslo<br />
(Cappelen) 2007 (forthcoming).<br />
Leibfried, S. & Zürn, M.: Transformation des Staates. Frankfurt (Suhrkamp)<br />
2006.<br />
Lie S. et al.: Nasjonale prøver på ny prøve Rapport fra en utvalgsundersøkelse<br />
for å analysere og vurdere kvaliteten på oppgaver og resultater<br />
til nasjonale prøver våren 2005. Oslo (UiO, ILS) 2005.<br />
Lindblad, S.; Johannesson, I. & Simola, R.: Education governance in transition.<br />
Scandinavian Journal of Educational Research. 2003-2.<br />
Linn R.L. & Haug, C.: Stability of school building Accountability Scores and<br />
Gains in Educational Evaluation and Policy Analysis 24-2002-1, 29-36.<br />
Linn, R. L.: Assessments and Accountability. In: Educational Researcher 29-<br />
2000-2, 4-16.<br />
Linn, R. L., Baker, E. L., & Betebenner, D. W.: Accountability Systems: Implications<br />
of Requirements of the No Child Left Behind Act of 2001. In:<br />
Educational Reseacher 31-2002-6, 3-16.<br />
Linn, R.L. (2005): Issues in the Design of Accountability Systems. In: Herman,<br />
J.L. & Haertel, E.H. Hrsg.: Uses and Misuses of Data for Educational<br />
Accountability and Improvement (The 104 th Yearbook of the NSSE Part<br />
2). Malden (Blackwell) 2005, 78-98.<br />
Lohmann, I.: After Neoliberalism. Können nationalstaatliche Bildungssysteme<br />
den ‚freien Markt‘ überleben? 2001 online: http://www.erzwiss.<br />
uni-hamburg.de/Personal/Lohmann/AfterNeo.htm (2007/07/07)<br />
Lohmann, I.: Was bedeutet eigentlich “Humankapital”? GEW Bezirksverband<br />
Lüneburg und Universität Lüneburg: Der brauchbare Mensch.<br />
Bildung statt Nützlichkeitswahn. Bildungstage 2007, online: http://www.<br />
erzwiss.uni-hamburg.de/Personal/Lohmann/Publik/Humankapital.pdf<br />
(2007/07/07)<br />
Loveless, T.: The Peculiar Politics of No Child Left Behind. Washing<strong>to</strong>n<br />
(Brookings) 2006.
410 STEFAN T. HOPMANN<br />
Luhmann, N.: Die Gesellschaft der Gesellschaft. Frankfurt (Suhrkamp) 1998.<br />
Marsh, J.; Pane, J. & Hamil<strong>to</strong>n, L.: Making Sense of Data-Driven Decision<br />
Making in Education: Evidence from Recent RAND Research. Santa<br />
Monica (RAND) 2006.<br />
Martineau, J. A.: Dis<strong>to</strong>rting Value Added: The Use of Longitudinal, Vertically<br />
Scaled Student Achievement Data for Growth-Based, Value-Added Accountability.<br />
Journal of Educational and Behavioral Statistics 31-2006-1,<br />
35-62.<br />
McNeil, L.: Contradictions of School Reform: Educational Costs of Standardized<br />
Testing. New York (Routledge) 2000.<br />
Mediås, O.A. & Telhaug, A.O.: Fra sentral til desentalisert styring: statlig og<br />
regional styring af utdanningen i Skandinavia fram mot år 2000. Steinkjer<br />
(Projekt: Utdanning som nasjonsbygging) 2000.<br />
Mehrens, W.A.: Consequences of Assessmen: What is the Evidence?. In: Educational<br />
Policy Analysis Archives, 13-1998-6. Online: http://epaa.asu.<br />
edu/epaa/v6n13.html (2007/07/07).<br />
Mejding, J. & Roe, A. (eds.): Northern Lights on <strong>PISA</strong> 2003 <strong>–</strong> A Reflection<br />
Form the Nordic Countries. Copenhagen (Nordic Council) 2006.<br />
Mel<strong>to</strong>n-Horn, J.: Absolutism and the Eighteenth-Century Origins of Compulsory<br />
Schooling in Prussia and Austria. Cambridge (University Press)<br />
1988.<br />
Meyer, J.W.: Weltkultur. Wie die westlichen Prinzipien die Welt durchdringen.<br />
Frankfurt (Suhrkamp) 2005.<br />
Meyerhöfer, W.: Tests im Test. Das Beispiel <strong>PISA</strong>. Opladen (Budrich) 2005<br />
Meyerhöfer, W.: Testfähigkeit <strong>–</strong> Was ist das?. In this volume.<br />
Micklewright, J. & Schnepf, S.S.: Educational Achievement in the English<br />
Speaking Countries: Do Different Surveys Tell the Same S<strong>to</strong>ry?. Bonn<br />
(IZA) 2004.<br />
Micklewright, J. & Schnepf, S.S.: Inequality of Learning in Industrialised<br />
Countries. Bonn (IZA) 2006.<br />
Mintrop, H.: The Limit of Sanctions in Low-Performing Schools: A Study<br />
of Maryland and Kentucky Schools on Probation. In: Education Policy<br />
Analysis Archives, Vol. 11-2003-3. Online: http://epaa.asu.edu/epaa/<br />
v11n3.html (2007/07/07).<br />
MMI (Markeds- og meidainstituttet AS): Evaluering av gjennomføring av de<br />
nasjonale prøvene i 2005. Online: http://www.utdanningsdirek<strong>to</strong>ratet.no/<br />
eway/library/forms/showmessage.aspx?oid=338 (retr. 07/07/2007).
EPILOGUE: NO CHILD, NO SCHOOL, NO STATE LEFT BEHIND 411<br />
Møller, J.: Coping with Accountability <strong>–</strong> A Tension between Reason and Emotion.<br />
Passionate Principalship: Learning from Life His<strong>to</strong>ries of School<br />
Leaders. London (Falmer) 2003.<br />
Muir Grey, J.A.: Evidence Based Health Care. Oxford (Elsevier) 2001 2<br />
National Commission on Excellence in Education: A Nation at Risk: The<br />
Imperative for Educational Reform. Washing<strong>to</strong>n, DC (U.S. Government<br />
Printing Office) 1983.<br />
Nesje, K. & Hopmann, S.T. (eds.): En lærende skole: L97 i Skolepraksis. Oslo<br />
(Cappelen) 2002.<br />
Neuwirth, E., Ponocny, I. & Grossmann, W. (eds.): <strong>PISA</strong> 2000 und <strong>PISA</strong> 2003.<br />
Graz (Leykam) 2006.<br />
Neuwirth, E.: <strong>PISA</strong> 2000. Sample Weight Problems in Austria. OECD Education<br />
Working Papers No. 5. Paris (OECD) 2006.<br />
NOU 1988:22: Med viten og vilje. White paper commissioned by the Norwegian<br />
Government. Oslo 1988<br />
NOU 2002:10. Førsteklasses fra første klasse. White paper commissioned by<br />
the Norwegian Government. Oslo 2002<br />
NOU 2003:16. I første rekke. White paper commissioned by the Norwegian<br />
Government. Oslo 2003<br />
OECD: Reviews of National Policies for Education: Norway. Paris (OECD)<br />
1987<br />
OECD: Public Management Developments. Paris (OECD) 1995.<br />
OECD: Education at a Glance. Paris: OECD.Online: http://www.oecd.org/<br />
document/34/0,2340,en_2649_34515_35289570_1_1_1_1,00.html.<br />
Paris 2005 (2007/07/07).<br />
OECD: Knowledge and Skills for Life. First Results From Pisa 2000. Paris<br />
(OECD) 2001.<br />
OECD: The <strong>PISA</strong> 2003 Assessment Framework., Paris (OECD) 2003.<br />
Olsen, R.V.: Achievement Tests From an Item Perspective. An Exploration of<br />
Single Item Data form the <strong>PISA</strong> and TIMSS studies. Thesis (University<br />
of Oslo) Oslo 2005. Online at: http://www.duo.uio.no/publ/realfag/2005/<br />
35342/Rolf_Olsen.pdf (retr. 2007/07/07)<br />
Olsen, R.: Large-scale international comparative achievement studies in education:<br />
Their primary purposes and beyond. In this volume.<br />
Peterson, P.E. & West, M.R. (eds.): No Child Left Behind. The Politics and<br />
Practice of School Accountability. Washing<strong>to</strong>n (Brookings) 2003.<br />
Picht, G.: Die Deutsche Bildungskatastrophe. Olten (Walter Verlag) 1964.
412 STEFAN T. HOPMANN<br />
<strong>PISA</strong> 2006: <strong>PISA</strong> <strong>–</strong> THE OECD PROGRAMME FOR INTERNA-<br />
TIONAL STUDENT ASSESSMENT. Leaflet produced by the OECD<br />
in 2006. Online: http://www.pisa.oecd.org/dataoecd/51/27/37474503.pdf<br />
(07/07/2007).<br />
Pollitt, C. & Bouckaert, G.: Public Mangement Reform. A Comparative Analysis.<br />
Oxford (University Press) 2004 2 .<br />
Power M.: The Audit Society; Rituals of Verification. Oxford (University<br />
Press) 1997.<br />
Prahl, A. & Olsen, C.B.: Lokalsamfundet som samarbejdspartner: sammenhænge<br />
mellem decentralisering og lokalsamfundsudvikling i de nordiske<br />
lande. Copenhagen (Nordisk Minsterråd). 1997.<br />
Prais, S. J.: Cautions on OECD’s recent educational survey(<strong>PISA</strong>): Rejoinder<br />
<strong>to</strong> OECD’s response. In: Oxford Review of Education 30-2004-4. 377-<br />
389.<br />
Prais, S.J.: Cautions on OECD’s Recent Educational Survey (<strong>PISA</strong>) In: Oxford<br />
Review of Education 29-2003-2, 139-163.<br />
Prais, S.J.: England: Poor Survey Response and no Sampling. In this volume.<br />
Rasmussen, J.: Undervisning i det refleksivt moderne. Copenhagen (Reitzel)<br />
2006.<br />
Rauin, U.: Die Pädagogik im Bann empirischer Mythen <strong>–</strong> Wie aus empirischen<br />
Vermutungen scheinbare pädagogische Gewissheit wird. In: Pädagogische<br />
Korrespondenz. 2004-32, 39-49.<br />
Rickover, H. G.: American Education <strong>–</strong> a National Failure: The Problem of<br />
Our Schools and What We Can Learn from England. New York (E. P.<br />
Dut<strong>to</strong>n) 1963.<br />
Riksrevisjonen: Riksrevisjonens undersøkelse av opplæringen i Grunnskolen<br />
Dokument nr 3: 10 (2005-2006).Oslo (Riksrevisjonen) 2006.<br />
Rindermann, H.: Was messen internationale Schulleistungsstudien? Schulleistungen,<br />
Schülerfähigkeiten, kognitive Fähigkeiten, Wissen oder allgemeine<br />
Intelligenz? Psychologische Rundschau 57-2006-1, 69-86.<br />
Robin, S.R. & Sprietsma, M.: Characteristics of Teaching Institutions and Student’s<br />
Performance: New Empirical Evidence from OECD Data. Lille<br />
(CRESGE) 2003.<br />
Saunders, L.: Brief His<strong>to</strong>ry of Educational ‘Value-Added‘”. How Did We Get<br />
<strong>to</strong> Where We Are?. In: School Effectivenes and School Improvement 10-<br />
1999-2, 233-256.
EPILOGUE: NO CHILD, NO SCHOOL, NO STATE LEFT BEHIND 413<br />
Scharpf, F.W. & Schmidt, V. Hrsg.: Welfare and Work in the Open Economy.<br />
v1 & v2. Oxford (University Press) 2000.<br />
Schedler, K. & Proeller, I.: New Public Management. Bern: Haupt (UTB)<br />
2006 3 .<br />
Sedikides, C. et al.: Accountability as a Deterrent <strong>to</strong> Self-Enhancement: The<br />
Search for Mechanisms. In: Journal of Personality and Social Psychology<br />
83-2002-3, 592-605.<br />
Shavit, Y. & Blossfeld, H.-P. (eds.): Persistent inequality. Boulder (Westview<br />
Press) 1993.<br />
Simola, H.: The Finnish miracle of <strong>PISA</strong>: his<strong>to</strong>rical and sociological remarks<br />
on teaching and teacher education.Quelle: In: Comparative Education 45-<br />
2005-4, 455-470.<br />
Sivesind, K.: Reformulating Reforms. Oslo (UiO, ILS). Forthcoming.<br />
Sivesind, K.: Task and Themes in the Communication about the Curriculum.<br />
The Norwegian Compulsory School Reform in Perspective. In: Rosenmund,<br />
M. et al.: Comparing Curriculum Making Processes. Bern (Lang)<br />
2002.<br />
Sivesind, K.; Bachmann, K. & Afzar, A.: Nordiske læreplaner. Oslo<br />
(Læringssenteret) 2003.<br />
Sivesind, K.; Langfeldt, G. & Skedsmo, G.: Utdanningsledelse. Oslo (Cappelen<br />
akademisk) 2006.<br />
Slagstad, R.: De nasjonale strateger. Oslo (Pax forlag) 1996.<br />
Slavin, R.E.: Educational Research in an Age of Accountability. Bos<strong>to</strong>n (Pearson)<br />
2006.<br />
Stack, M.: Testing, Testing, Read All About It: Canadian Press Coverage of<br />
the <strong>PISA</strong> Results. In: Canadian Journal of Education, 29-2006-1, 49-69.<br />
STM S<strong>to</strong>rtingsmelding 37 (1990-1991): Om organisering og styring i utdanningssek<strong>to</strong>ren.<br />
Report <strong>to</strong> the Norwegian Parliament.<br />
STM S<strong>to</strong>rtingsmelding 29 (1994-1995): Om prinsipper og retningslinjer for<br />
tiårig grunnskole- ny læreplan.<br />
STM S<strong>to</strong>rtingsmelding 47 (1995-1996) Om elevvurdering, skolebasert vurdering<br />
og nasjonalt vurderingssystem<br />
STM S<strong>to</strong>rtingsmelding 28 (1998-99): Mot rikare mål. Nasjonalt vurderingssystem<br />
for grunnskolen. Report <strong>to</strong> the Norwegian Parliament.<br />
STM S<strong>to</strong>rtingsmelding 17 (2002-2003): Om statlige tilsyn.<br />
STM S<strong>to</strong>rtingsmelding 30 (2003-2004): Kultur for læring. (a shortened<br />
version on English at: http://www.regjeringen.no/en/dep/kd/Documents/
414 STEFAN T. HOPMANN<br />
Brochures-and-handbooks/2004/Report-no-30-<strong>to</strong>-the-S<strong>to</strong>rting-2003-<br />
2004.html?id=419442).<br />
Sutherland, D. & Price, R.: Linkages Between Performance and Institutions in<br />
the Primary and Secondary Education Sec<strong>to</strong>r. OECD Economics Department<br />
Working Papers No. 558. Paris (OECD) 2007.<br />
Swanson, C. B. & Stevenson, D. L.: Standards-Based Reform in Practice: Evidence<br />
on State Policy and Classroom Instruction from the NAEP State<br />
Assessments. In: Educational Evaluation and Policy Analysis, 24-2002-1,<br />
1 <strong>–</strong> 27.<br />
Swiss Federal Statistical Office (BFS): <strong>PISA</strong> 2003 <strong>–</strong> Einflussfak<strong>to</strong>ren auf die<br />
kan<strong>to</strong>nalen Ergebnisse. Neuchatel (BFS) 2005.<br />
Telhaug, A.O. & Mediås, O.A.: Grunnskolen som nasjonsbygger: fra statspietisme<br />
til nyliberalisme. Oslo (Abstrakt) 2003.<br />
Telhaug, A.O.: Kunnskapsløftet <strong>–</strong> Ny eller Gammel skole? Oslo (Cappelen<br />
Akademisk) 2005.<br />
Telhaug, A.O.: Skolen mellom stat og marked: norsk skoletenkning fra år til år<br />
1990-2005. Oslo (Didakta) 2005.<br />
TNS Gallup: Undersøkelse blant rek<strong>to</strong>rer og lærere om gjennomføring av de<br />
nasjonale prøvene våren 2005. Rapport. Oslo 2005.<br />
Turmo, A. & Lie, S.: Hva kjennetegner norske skoler som skårer høyt i <strong>PISA</strong><br />
2000? Oslo (UiO/ILS) 2004.<br />
Tyler, R.: Basic Principles of Curriculum and Instruction. Chicago (University<br />
Press) 1949.<br />
Uljens, M.: The Hidden Curriculum of <strong>PISA</strong>. <strong>–</strong> The Promotion of Neo-liberal<br />
Policy by Educational Assessment. In this volume.<br />
U.S. Department of Education, Building on Results: A Blueprint for Strengthening<br />
the No Child Left Behind Act, Washing<strong>to</strong>n, D.C., 2007<br />
Wallerstein, I.: World-Systems Analysis: An Introduction. Durham, North Carolina<br />
(Duke University Press) 2004<br />
Watermann, R. et al. (2003): Schulrückmeldungen im Rahmen von Schulleistungsuntersuchungen:<br />
Das Disseminationskonzept von <strong>PISA</strong>-2000. In:<br />
Zeitschrift Für Pädagogik 49-2003-1, 92-111.<br />
Watson, S., & Supovitz, J.: Au<strong>to</strong>nomy and Accountability in the Context<br />
of Standards-based Reform. In: Education Policy Analysis Archives, 9-<br />
2001-32. Online: http://epaa.asu.edu/epaa/v9n32.html (2007/07/07).<br />
Weber, M. (1923): Wirtschaft und Gesellschaft. Online: http://www.textlog.de/<br />
weber_wirtschaft.html (2007/03/19)
EPILOGUE: NO CHILD, NO SCHOOL, NO STATE LEFT BEHIND 415<br />
Weigel, T.M.: Die <strong>PISA</strong>-Studie im bildungspolitischen Diskurs. Eine Untersuchung<br />
der Reaktionen auf <strong>PISA</strong> in Deutschland und im Vereinigten<br />
Königreich. Trier (Universität) 2004. Online: http://www.oecd.<br />
org/dataoecd/46/23/34805090.pdf (2007/07/07).<br />
Werler, T.: Nation, Gemeinschaft, Bildung: die Evolution des modernen skandinavischen<br />
Wohlfahrtsstaates und das Schulsystem. Baltmannsweiler<br />
(Schneider Verlag) 2004.<br />
Westbury, I.: Didaktik and Curriculum Studies. In: Gundem, B B. & Hopmann,<br />
S.T. (eds) Didaktik and/or Curriculum. New York (Lang) 2002 2 , 47-78.<br />
Westbury, I., Hopmann, S. & Riquarts, K. (eds.): Teaching as Reflective Practice:<br />
The German Didaktik Tradition. Mahwah, NJ (Lawrence Erlbaum<br />
Associates) 2000.<br />
Withford, B. L. & Jones, K.: Accountability, Assessment, and Teacher Commitment:<br />
Lessons from Kentucky’s Reform Efforts. Albany (SUNY)<br />
2000.<br />
Wuttke, J.: Uncertainties and Bias in <strong>PISA</strong>. In this volume.<br />
Zimmer, R. et al.: State and Local Implementation of the “No Child Left Behind<br />
Act”. Washing<strong>to</strong>n (Department of Education) 2007.
416 STEFAN T. HOPMANN<br />
Über die Au<strong>to</strong>ren/About the Authors<br />
Allerup, Peter Nimmo:<br />
Peter Allerup graduated in Mathematical Statistics from University of Copenhagen in<br />
1970. Today his preferred fields of interests are mathematical statistics, psychometrics<br />
and quantitative research methods in general. From 1994 <strong>–</strong> 2002 he was Senior Research<br />
Scientist at The Royal Danish Institute for Educational Research, from 2002 he<br />
holds a professors chair at Danish University of Education, later Aarhus University,<br />
School of Education. He has been involved in<strong>to</strong> the majority of empirical international<br />
studies conducted by the university, OECD’s <strong>PISA</strong> and IEA’s comparative investigations<br />
in mathematics and science and in civic education, TIMSS and CIVIC. He has<br />
specialized experience in the field of IRT models (Item Response Theory), viz. the<br />
Rasch Models in particular, with emphasis on applications, where psychometric scaling<br />
properties are essential. He has long time experiences with data from multilevel<br />
specifications of the research frame work.<br />
Kontakt: nimmo@dpu.dk<br />
Bodin, An<strong>to</strong>ine:<br />
Graduated in pure mathematics and in mathematics education (Didactics of Mathematics).<br />
Successively, or at the same time, secondary math teacher, teacher trainer, researcher<br />
in mathematics education, evaluation specialist, mathematics text book author,<br />
international consultant (World Bank and other national and international agencies).<br />
An<strong>to</strong>ine Bodin was much involved in the IREM network (French Institute of Research<br />
in Mathematics Education) and in the APMEP (French Mathematics Teacher<br />
Association) were he created the EVAPM observa<strong>to</strong>ry and leaded it for 20 years.<br />
He was a member of the TIMSS Subject Matter Advisory Committee and for<br />
a few months of the <strong>PISA</strong> 2003 Math Expert Group. He was also a member of the<br />
mathematics curriculum expert group and of the test development unit in the French<br />
Ministry of Education.<br />
He has published numerous papers in math education: see his website: http://web.<br />
mac.com/an<strong>to</strong>inebodin/iWeb/Site_An<strong>to</strong>ine_Bodin /<br />
Contact: an<strong>to</strong>inebodin@mac.com<br />
Bozkurt, Dominik:<br />
Dominik Bozkurt wurde am 10. November 1975 in Wels geboren. Er absolvierte die<br />
Matura am Realgymnasium am Henriettenplatz in 1150 Wien. Nach der Matura begann<br />
er 2001 das Studium der Pädagogik (Schul- und Sozialpädagogik) und der Romanistik<br />
(Französisch). Im Jänner 2007 schloß er sein Studium mit einer Diplomarbeit<br />
über Schulqualität und Kooperativer Mittelschule und einer kommissionellen Prüfung<br />
ab. Von 2005 bis 2007 war er als Studienassistent in der Forschungseinheit für Schulund<br />
Bildungsforschung tätig. Seit 2007 arbeitet er als Sozialpädagoge in der Präventionsabteilung<br />
der Aids Hilfe Wien.<br />
Kontakt: bozkurt@aids.at
EPILOGUE: NO CHILD, NO SCHOOL, NO STATE LEFT BEHIND 417<br />
Brinek, Gertrude:<br />
Lehrtätigkeit an Wiener Volks- und Hauptschulen (1973 bis 1983); Studium der<br />
Kunstgeschichte, Pädagogik und Psychologie an der Universität Wien, (Studienabschluss<br />
Dr. phil. in Pädagogik); Studien-Assistentin bzw. Universitäts-Assistentin am<br />
Institut für Erziehungswissenschaften der Universität Wien.<br />
Forschungsarbeiten/Lehrtätigkeit und Publikationen in den Bereichen Schulklima/Schulangst,<br />
Museumspädagogik, Bildungs-/Schultheorie und -forschung u. a.<br />
Seit 2003 Ass. Prof. am Institut für Bildungswissenschaft/Fakultät für Philosophie<br />
und Bildungswissenschaft der Uni Wien (teilkarenziert zur Ausübung eines Mandats<br />
in der Bundesgesetzgebung).<br />
Kontakt: gertrude.brinek@univie.ac.at<br />
Dolin, Jens:<br />
Head of the Department of Science Education at the University of Copenhagen. He<br />
has done research in teaching and learning science (with focus on dialogical processes,<br />
forms of representation and the development of competencies), general pedagogical<br />
issues (bildung, competencies, assessment and evaluation) and organizational change<br />
(reform processes, curriculum development, teacher conceptions). He has been engaged<br />
in the development and implementation of the new science curriculum 2005 for<br />
the Danish Upper Secondary School.<br />
He has been member of the <strong>PISA</strong> Science Forum 2006 which formulated the<br />
Science Literacy Framework for the <strong>PISA</strong> 2006 science test, and is currently leader of<br />
a research project on the validation of <strong>PISA</strong> science in a Danish context.<br />
Contact: dolin@ind.ku.dk<br />
Hopmann, Stefan Thomas:<br />
Universität Wien. Zuvor unter anderem in Kiel, Potsdam, Oslo, Trondheim und Kristiansand<br />
tätig. Forschungsgebiete:<br />
His<strong>to</strong>rische und vergleichende Schul- und Bildungsforschung insbesondere mit<br />
Blick auf Didaktik, Schulentwicklung, Schulverwaltung und Lehrerbildung.<br />
Kontakt: stefan.hopmann@univie.ac.at<br />
Hörmann, Bernadette:<br />
Born in 1983, studied educational science at the University of Vienna. She currently<br />
works as an assistant at the department for educational science at the University of<br />
Vienna where she is writing her dissertation. Her core themes are school accountability<br />
and school structures.<br />
Contact: bernadette.hoermann@univie.ac.at<br />
Jahnke, Thomas:<br />
(Jg. 1949), Diplom in Mathematik 1974 (Universität Marburg), Promotion in Mathematik<br />
1979 (Universität Freiburg), Habilitation in Didaktik der Mathematik 1988<br />
(Universität Siegen). Seit 1994 Lehrstuhl für Didaktik der Mathematik (Universität<br />
Potsdam). Zahlreiche wissenschaftliche Veröffentlichungen; Herausgeber und Au<strong>to</strong>r<br />
von Schulbüchern für den Mathematikunterricht an Gymnasien. Arbeitsgebiete:
418 STEFAN T. HOPMANN<br />
S<strong>to</strong>ffdidaktik; Kritik didaktischer Ideologien; Curriculumentwicklung für Lehramtsstudiengänge;<br />
Philosophie, Geschichte und Kultur der Mathematik.<br />
Kontakt: jahnke@math.uni-potsdam.de<br />
Langfeldt, Gjert:<br />
Dr. Gjert Langfeldt is a tenured associate professor at the University of Agder, Norway.<br />
Substantially his main areas of research are efficiency and equity linked issues<br />
and also how didactics can be transformed in<strong>to</strong> an empirical discipline. He has pursued<br />
his research interest by an empirical approach and an interest in methodological<br />
issues.<br />
Langfeldt is currently engaged in two research projects funded by national authorities:<br />
Together with Stefan Hopmann he is engaged in a project charting how schools<br />
and teachers can come <strong>to</strong> grips with the new logic of accountability in education, and<br />
he is also involved in evaluating the National System of Quality Assurance in Education.<br />
Contact: gjert.langfeldt@uia.no<br />
Meyerhöfer, Wolfram:<br />
1990-1995: Studium Lehramt für Mathematik und Physik an der Universität Potsdam<br />
1996-1998: Referendariat am Studienseminar Potsdam<br />
1998-2007: Universität Potsdam, Didaktik der Mathematik<br />
Promotion Mai 2004: Was testen Tests? Objektiv-hermeneutische Analysen am Beispiel<br />
von TIMSS und <strong>PISA</strong>.<br />
seit 2007: Gastprofessor FU Berlin<br />
Kontakt: meyerhof@math.uni-potsdam.de<br />
Olechowski, Richard:<br />
Geb.: 7.5.1936, in Wien, ab Herbst 1955: Studium der Psychologie an der Universität<br />
Wien, 1962: Dr.phil.; als Psychol. im Bereich der Justiz (Resozialisierung), ab 1966:<br />
Ass. an der Univ. Wien, 1970: Habil. für Pädagogik (m.bes. Berücksichtigung d. Päd.<br />
Psychol), 1972: O. Prof. f. Pädagogik an der Univ. Salzburg, 1977: O. Prof. f. Pädagogik<br />
an der Univ. Wien (m.bes. Berücksichtigung d. Schulpäd. u. d. Allg. Didaktik),<br />
ab 1986 zusätzl.: Wissenschaftl. und administr. Leiter des Ludw. Boltzmann-Inst. für<br />
Schulentwicklung und intern.-vergl. Schulforschg. Seit 1988: Redaktionsmitglied d.<br />
Ztschr. „Erziehung u. Unterricht“.2004: Dr. h.c. und Prof. h.c. der Eötvös-Loránd-<br />
Universität in Budapest, Herbst 2004: Emeritierung.<strong>–</strong> Publikationen: Das alternde<br />
Gedächtnis (1969), Das Sprachlabor (1970, 1973 2 ), ins Japanische übers.; über 100<br />
Beiträge in in- u. ausl. Fachztschr., Lexika u. Handbüchern, Hrsg. d. Reihen „Schule-<br />
Wissenschaft-Politik“, „Erziehungswiss. Forschung <strong>–</strong> Päd. Praxis“ sowie „Schulpäd.<br />
und Päd. Psychol.“ Spezialgebiet: Quantitative empirisch-päd. Forschung (bes. zum<br />
Problemgebiet der Schulorganisation), von 1992 bis 1997: Leitung eines empirischen<br />
Großprojekts; Evaluierung des Schulmodells „Kooperative Mittelschule“ im Längsschnitt.<br />
Kontakt: richard.olechowski@univie.ac.at
EPILOGUE: NO CHILD, NO SCHOOL, NO STATE LEFT BEHIND 419<br />
Olsen, Rolf V.:<br />
Rolf V. Olsen holds a postdoc position at the Department of Teacher Education and<br />
School Development at the University of Oslo where he also received his phd in 2005<br />
with a thesis on secondary analysis of the science data in <strong>PISA</strong> and TIMSS. His current<br />
research activities are extensions of the work presented in his thesis. He is a member<br />
of a research group, which among other activities, is responsible for the Norwegian<br />
activities in a range of similar studies (Unit for Quantitative Analysis in Education).<br />
Besides his analytical work presented in his publications he has extensive practical<br />
experience with item development in both international and national studies, and previously<br />
he worked as a teacher in science, physics and mathematics for several years<br />
in upper secondary education<br />
Contact: r.v.olsen@ils.uio.no<br />
Prais, S J:<br />
S J Prais (b. 1928) has spent most of his career in economic research, mostly at the<br />
National Institute of Economic and Social Research.<br />
The economic analysis of consumer behaviour formed his initial research field,<br />
followed by a study of the growth of industrial concentrations in Britain. Work on international<br />
differences in industrial productivity, and their relation <strong>to</strong> vocational training<br />
of the workforce were the centre of subsequent extended empirical comparisons<br />
of British and German industries. This led <strong>to</strong> international comparisons of schoolleaving<br />
standards, particularly in mathematics; and, in due course, <strong>to</strong> the membership<br />
of the National Curriculum committee in that subject. Schooling standards and teaching<br />
methods have been compared, particularly with Germany and Switzerland.He<br />
was elected Fellow of the British Academy in 1985; and was awarded a D.Litt. (hon.)<br />
by City University in 1989 and Dr. Sc. (hon.) by University of Birmingham in 2006.<br />
Contact: c/o m.ockenden@niesr.ac.uk<br />
Puchhammer, Markus:<br />
Dipl.Ing.Dr.phil.Dr.techn. worked as research assistant and graduated in physics on<br />
Technical University Vienna, graduated in psychology and education science on University<br />
Vienna; worked for several years as software developer, systems analyst and<br />
project manager in telecommunications industry, then in the electronic transaction<br />
processing business, EDP training courses on WIFI Vienna, teaching in a vocational<br />
education college [HTL], lecturer on FH Joanneum Graz, then on the University of<br />
Applied Sciences Technikum Wien for science and research, statistics and data analysis;<br />
teleteaching survey.<br />
Contact: puchhammer@gmx.at<br />
Retzl, Martin:<br />
(b.1980); Assistant at the Department of Educational Science at the University of Vienna<br />
(since 1 st Sept. 2007)<br />
<strong>–</strong> Student-Assistant at the Department of Educational Science at the University of<br />
Vienna (March 2006-July 2007)
420 STEFAN T. HOPMANN<br />
<strong>–</strong> Master’s degree in educational science (Mag. phil.)<br />
<strong>–</strong> Graduate of teacher’s college: Diploma for teaching at “Hauptschule” (lower secondary<br />
school)<br />
<strong>–</strong> Projects and research interests: development of teaching material, empirical research<br />
(diploma thesis: teacher-study), school-capacity-research, governance of the<br />
school- and educational system.<br />
Contact: martin.retzl@univie.ac.at<br />
Sjøberg, Svein:<br />
Professor in science education at Oslo University. He was educated as a nuclear physicist,<br />
later also in education and social science. Current research interests: Social,<br />
cultural and ethical aspects of science education, science education and development,<br />
gender and science education in developing countries. Critical approach <strong>to</strong> issues of<br />
scientific literacy and public understanding of science. Currently organizer of ROSE<br />
(The Relevance of Science Education), a comparative project on pupils’ interests, attitudes,<br />
perceptions etc. of importance <strong>to</strong> science teaching and learning.<br />
Information and articles on http://folk.uio.no/sveinsj/<br />
Contact: svein.sjoberg@ils.uio.no<br />
Uljens, Michael:<br />
Michael Uljens (b. 1962), Prof. Dr, Vice Dean at Åbo Akademi University, Dozent at<br />
Helsinki university, has been working with a wide range of educational <strong>to</strong>pics, but his<br />
main field of research through the years has the theory and philosophy of education<br />
(books: “School Didactics and Learning” and “Allmän pedagogik”). Since 2005 he is<br />
running a 4 year research project (“Bildung and learning in the late-modern society”)<br />
with 6 doc<strong>to</strong>ral students working fulltime. He has been working as visiting scholar<br />
at the University of Göteborg with Prof. Mar<strong>to</strong>n and at Humboldt University with<br />
Prof. Benner. 2000-2003 he was professor in general education at Helsinki University.<br />
Contact: muljens@abo.fi<br />
Wuttke, Joachim:<br />
Joachim Wuttke studied physics in München and Grenoble.<br />
He holds a state certificate for teaching mathematics and physics in secondary<br />
schools, a PhD in physical chemistry, and a habilitation in experimental physics.<br />
He has worked in the telecommunication industry, as a school teacher, and as a<br />
group leader in academic research. Besides 25 research papers in statistical physics,<br />
he has published on scientific instrumentation and computing.<br />
Joachim Wuttke is staff scientist at the Munich outstation of Forschungszentrum<br />
Jülich.<br />
Contact: wuttke1@web.de
Schulpädagogik und Pädagogische Psychologie<br />
hrsg. von Univ.Prof. Dr. Richard Olechowski (Universität Wien)<br />
Rudolf Beer<br />
Bildungsstandards<br />
Einstellungen von Lehrerinnen und Lehrern<br />
Bildungsstandards sollen die Qualität der österreichischen Schulen erhöhen. Lehrer/innen<br />
werden als „zentrales Gelenksstück“ bei der Implementierung von Bildungsstandards und<br />
einer daraus resultierenden Qualitätsentwicklung gesehen. Wie weit können Bildungsstandards<br />
diesem Anspruch aus Sicht der Lehrer/innen gerecht werden? Diese Publikation bildet<br />
den aktuellen Stand der Diskussion ab, versucht eine Klärung der Begriffe, weist aber auch<br />
auf Widersprüche und Risiken hin. Die empirische Studie in Wien widmet sich der Frage der<br />
Akzeptanz eines solchen Konzepts durch die betroffenen Lehrer/innen.<br />
Bd. 1, 2006, 256 S., 24,90 €, br., ISBN 3-8258-0104-7<br />
LIT Verlag GmbH & Co. KG Wien <strong>–</strong> Zürich<br />
Auslieferung Österreich: Medienlogistik Pichler-ÖBZ GmbH & Co KG<br />
IZ-NÖ Süd, Straße 1, Objekt 34, A-2355 Wiener Neudorf, Postfach 133<br />
Tel. +43 (0) 2236 / 63 535 - 236, Fax +43 (0) 2236 / 63 535 - 243, e-Mail: bestellen@medien-logistik.at<br />
Auslieferung Deutschland: Fresnostr. 2 48159 Münster<br />
Tel.: 0251 / 620 32 22 <strong>–</strong> Fax 0251 / 922 60 99<br />
e-Mail: vertrieb@lit-verlag.de <strong>–</strong> http://www.lit-verlag.de
Isabella Benischek<br />
Leistungsbeurteilung im österreichischen Schulsystem<br />
Dieses Buch greift die stets brisante Thematik der Leistungsbeurteilung im österreichischen<br />
Schulsystem unter dem Gesichtspunkt auf, dass die Leistungsbeurteilung mitentscheidend<br />
für den weiteren Bildungsweg der SchülerInnen ist. Daher wird der Frage nachgegangen, ob<br />
und wie sich die Beurteilungen in Volksschule, Hauptschule und AHS unterscheiden. Eine<br />
empirische Untersuchung gibt Aufschluss über das Beurteilungsverhalten von Volksschul-,<br />
Hauptschul- und AHSLehrerInnen. Das Buch beinhaltet auch eine kurze Übersicht der his<strong>to</strong>rischen<br />
Entwicklung des österreichischen Schulsystems und eine Darstellung des aktuellen<br />
Modells.<br />
Bd. 2, 2006, 288 S., 24,90 €, br., ISBN 3-8258-0074-1<br />
LIT Verlag GmbH & Co. KG Wien <strong>–</strong> Zürich<br />
Auslieferung Österreich: Medienlogistik Pichler-ÖBZ GmbH & Co KG<br />
IZ-NÖ Süd, Straße 1, Objekt 34, A-2355 Wiener Neudorf, Postfach 133<br />
Tel. +43 (0) 2236 / 63 535 - 236, Fax +43 (0) 2236 / 63 535 - 243, e-Mail: bestellen@medien-logistik.at<br />
Auslieferung Deutschland: Fresnostr. 2 48159 Münster<br />
Tel.: 0251 / 620 32 22 <strong>–</strong> Fax 0251 / 922 60 99<br />
e-Mail: vertrieb@lit-verlag.de <strong>–</strong> http://www.lit-verlag.de
Roman Lehnert; Justine Scanferla<br />
Zusammenleben in Wien<br />
Ergebnisse einer empirischen Längsschnittstudie an Migrantenkindern<br />
Anhand einer empirischen Längsschnittstudie wird der Frage nachgegangen, inwieweit Jugendliche<br />
in Wien interethnischen Kontakten gegenüber aufgeschlossen sind. Dabei werden<br />
die Einstellungen von Jugendlichen mit Migrationshintergrund (türkischer und serbisch/serbokroatischer<br />
Muttersprache) und deutschsprachigen Jugendlichen verglichen. Darüber<br />
hinaus werden auch die Mädchen und Burschen der jeweiligen Sprachgruppen separat<br />
betrachtet. Die vorliegende Studie ermöglicht einen Einblick in die Veränderung der Meinung<br />
der befragten SchülerInnen im Alter von 10 bis 15 Jahren.<br />
Bd. 4, 2007, 280 S., 24,90 €, br., ISBN 978-3-8258-0554-8<br />
LIT Verlag GmbH & Co. KG Wien <strong>–</strong> Zürich<br />
Auslieferung Österreich: Medienlogistik Pichler-ÖBZ GmbH & Co KG<br />
IZ-NÖ Süd, Straße 1, Objekt 34, A-2355 Wiener Neudorf, Postfach 133<br />
Tel. +43 (0) 2236 / 63 535 - 236, Fax +43 (0) 2236 / 63 535 - 243, e-Mail: bestellen@medien-logistik.at<br />
Auslieferung Deutschland: Fresnostr. 2 48159 Münster<br />
Tel.: 0251 / 620 32 22 <strong>–</strong> Fax 0251 / 922 60 99<br />
e-Mail: vertrieb@lit-verlag.de <strong>–</strong> http://www.lit-verlag.de
Christa-Monika Reisinger<br />
Unterrichtsdifferenzierung<br />
Die Forschung zur Unterrichtsqualität analysiert zunehmend den Unterricht auf seine Effizienz.<br />
Wie kann diesem Anspruch in heterogenen Klassen Rechnung getragen werden? Das<br />
Buch behandelt ausgewählte Determinanten der Schulleistung und zeigt, wie die Anpassung<br />
von Unterrichtsmethoden an persönliche Voraussetzungen der Kinder in der Schulpraxis realisiert<br />
werden kann. Eine empirische Untersuchung gibt anhand eines Beispiels aus der Mathematik<br />
Aufschluss darüber, unter welchen Bedingungen Schüler/innen ihren persönlich besten<br />
Weg zum Lernen finden und so optimale Lernergebnisse erzielen können.<br />
Bd. 5, 2007, 336 S., 29,90 €, br., ISBN 978-3-8258-0867-9<br />
LIT Verlag GmbH & Co. KG Wien <strong>–</strong> Zürich<br />
Auslieferung Österreich: Medienlogistik Pichler-ÖBZ GmbH & Co KG<br />
IZ-NÖ Süd, Straße 1, Objekt 34, A-2355 Wiener Neudorf, Postfach 133<br />
Tel. +43 (0) 2236 / 63 535 - 236, Fax +43 (0) 2236 / 63 535 - 243, e-Mail: bestellen@medien-logistik.at<br />
Auslieferung Deutschland: Fresnostr. 2 48159 Münster<br />
Tel.: 0251 / 620 32 22 <strong>–</strong> Fax 0251 / 922 60 99<br />
e-Mail: vertrieb@lit-verlag.de <strong>–</strong> http://www.lit-verlag.de
Osnabrücker Schriften zur Psychologie<br />
hrsg. von Prof. Dr. Josef Rogner, Prof. Dr. Henning Schöttke<br />
und Prof. Dr. Manfred Tücke<br />
Manfred Tücke unter Mitarbeit von Ulla Burger<br />
Entwicklungspsychologie des Kindes- und Jugendalters für (zukünftige) Lehrer<br />
Dies Buch wurde für LehramtsstudentInnen und LehrerInnen geschrieben. Es soll praxisnah<br />
über wesentliche Themen und Kontroversen der Entwicklungspsychologie des Kindes- und<br />
Jugendalters informieren, soweit sie für die Schule wichtig und/oder interessant sind. Wo immer<br />
es ohne wesentlichen Verlust an Exaktheit möglich war, wurde eine um- gangssprachliche<br />
Darstellung gegenüber dem wissenschaftlichen Fachvokabular bevorzugt.<br />
Bd. 6, 3. Aufl. 2007, 440 S., 29,90 €, gb., ISBN 978-3-8258-0157-1<br />
LIT Verlag GmbH & Co. KG Wien <strong>–</strong> Zürich<br />
Auslieferung Österreich: Medienlogistik Pichler-ÖBZ GmbH & Co KG<br />
IZ-NÖ Süd, Straße 1, Objekt 34, A-2355 Wiener Neudorf, Postfach 133<br />
Tel. +43 (0) 2236 / 63 535 - 236, Fax +43 (0) 2236 / 63 535 - 243, e-Mail: bestellen@medien-logistik.at<br />
Auslieferung Deutschland: Fresnostr. 2 48159 Münster<br />
Tel.: 0251 / 620 32 22 <strong>–</strong> Fax 0251 / 922 60 99<br />
e-Mail: vertrieb@lit-verlag.de <strong>–</strong> http://www.lit-verlag.de
Manfred Tücke<br />
Grundlagen der Psychologie für (zukünftige) Lehrer<br />
Dies Buch wurde für LehramtsstudentInnen und LehrerInnen geschrieben. Darin werden<br />
wichtige Denkweisen und Ergebnisse der Psychologie vorgestellt, an Hand klassischer Untersuchungen<br />
erläutert und an Hand vieler Beispiele auf unser Alltagsleben bezogen. Wo immer<br />
es ohne wesentlichen Verlust an Exaktheit möglich war, wurde eine umgangssprachliche<br />
Darstellung gegenüber dem wissenschaftlichen Fachvokabular bevorzugt. Folgende Themen<br />
werden angesprochen: <strong>–</strong> Gegenstand und Methoden der Psychologie <strong>–</strong> Konditionieren und<br />
Lernen: Lernen aus Erfahrung <strong>–</strong> Erinnern und Vergessen: das menschliche Gedächtnis <strong>–</strong> Denken,<br />
Problemlösen und Entscheiden <strong>–</strong> Intelligenz und Intelligenzmessung <strong>–</strong> Emotionen <strong>–</strong> am<br />
Beispiel Glück, Zufriedenheit und Angst <strong>–</strong> Soziale Prozesse und soziales Verhalten<br />
Bd. 8, 2. Aufl 2004, 472 S., 29,90 €, gb., ISBN 3-8258-7190-8<br />
LIT Verlag GmbH & Co. KG Wien <strong>–</strong> Zürich<br />
Auslieferung Österreich: Medienlogistik Pichler-ÖBZ GmbH & Co KG<br />
IZ-NÖ Süd, Straße 1, Objekt 34, A-2355 Wiener Neudorf, Postfach 133<br />
Tel. +43 (0) 2236 / 63 535 - 236, Fax +43 (0) 2236 / 63 535 - 243, e-Mail: bestellen@medien-logistik.at<br />
Auslieferung Deutschland: Fresnostr. 2 48159 Münster<br />
Tel.: 0251 / 620 32 22 <strong>–</strong> Fax 0251 / 922 60 99<br />
e-Mail: vertrieb@lit-verlag.de <strong>–</strong> http://www.lit-verlag.de