Developing valid, reliable and fair exams.pdf - Pearson VUE

Developing valid, 

reliable and fair exams 

3 

01010100101010101001011 

11001010010101000101010 

10011100101001101001000 

10100100100100010100010 

10001010101010101011010 

10101011001010101010101 

00110010101001001000101 

01010101011010011000101 

Some high-stakes examinations are created to distinguish 

candidates who demonstrate the required knowledge, 

skills and abilities from candidates who do not. Other 

examinations place test-takers along a continuum so 

that valid comparisons can be made. Regardless of 

whether the goal of the examination is to make a pass/ 

fail decision or to provide a ranking of test-takers, the 

examination must be valid, reliable and fair (1) . 

Validity, reliability and fairness 

Validity 

The Standards for Educational and Psychological Testing (SEPT) 

(1) describe validity as “the most fundamental consideration in 

developing and evaluating tests”. Simply put, validity is concerned 

with answering the following two questions: 

1. Does the test measure what it is intended to measure 

2. Are the interpretations drawn from the test scores appropriate 

and justifiable 

Evaluating the processes used to develop test items can provide 

evidence for validity, as can an analysis of the relationships among 

test items and whether these relationships support the intended 

test construct. The validity of a test can also be assessed by 

comparing the test scores to other relevant external variables. For 

example, test scores from a university admissions test could be 

compared to measures of performance at the university to obtain 

validity-related information. 

Reliability 

Reliability refers to the consistency of the measure and the degree 

to which a test is free of random error. The usefulness of testing 

presupposes that there is at least some stability in the test-taker’s 

knowledge level (1). However, some degree of test score variance 

is inevitable. Test-takers’ performance can be affected by many 

factors, such as anxiety or how they are feeling on the day of the 

test. While these factors are outside of their control, test sponsors 

and providers have a duty to standardize the factors contributing to 

irrelevant test score variance that are within their control, such as 

the testing environment and the test itself. 

Anything about the test or testing conditions that causes a testtaker 

to respond based on something other than knowledge of 

the correct response is an error. There are two types of error 

which must be understood. Systematic error reflects differences 

amongst test-takers that are irrelevant to the purposes of testing. 

For example, quantitative reasoning items that require high levels 

of verbal ability to answer would introduce a systematic source 

of variance into the test scores (verbal ability) that is irrelevant to 

the goal of testing (assessment of quantitative reasoning). Random 

error is caused by temporary, chance influences on the test result, 

such as a test-taker misreading an item or a grading error on an 

essay item. The relationship between score variance, validity and 

reliability is illustrated in the following diagram. 

Real Differences 

Among Individuals 

Validity 

Systematic Error 

Total test score variance 

Reliability 

Random Error

Developing valid, reliable and fair exams 

Fairness 

As the (SEPT) (1) point out, “the term fairness is used in many 

different ways and has no single technical meaning.” The Standards 

outline four ways in which the term can be used in relation to 

testing: A) the absence of bias, B) fair treatment with regard to 

test procedures, test scoring, and the use of scores, C) equality of 

outcomes in testing so that test-takers of equivalent ability should 

have equivalent test results regardless of group membership (for 

example, race or ethnicity), and D) equitable opportunities to learn 

the material covered by the test. 

The incorporation of item bias and sensitivity reviews during 

the item development process can be supplemented by the use 

of statistical measures to identify items which have a different 

probability of a correct response for different test-taker subgroups. 

A psychometric analysis of differential item functioning (DIF) 

identifies items that perform differently across subgroups of testtakers 

(e.g. gender, ethnicity, sociodemographic status, age) while 

controlling for the ability of the test-takers. That is, the probability 

of a correct response differs dependent on group membership 

even for test-takers with the same ability. Items flagged for DIF 

are subject to greater content scrutiny to investigate potential bias 

and justify their continued inclusion in the item bank. It should be 

stressed that differential item functioning does not automatically 

equate to bias – there may be legitimate reasons as to why an item 

performs differently for different subgroups. 

Standard setting, test assembly and equating are important to 

establishing fairness in test procedures and test scoring. These 

concepts are discussed in the subsequent sections. 

Standard setting 

For examinations that require a pass/fail decision, a passing 

standard must be established. There are three general methods for 

setting a pass mark: 

1. Holistic – An arbitrary fixed percentage pass mark (for example, 

60%) 

2. Norm-referenced – A fixed pass rate (for example, the top 60% 

of test-takers) 

3. Criterion-referenced – Setting the pass mark at an absolute 

standard that denotes the required level of competence 

A holistic pass mark is the least appropriate. Unless there is a 

rationale for using a norm-referenced standard (for example, a 

limited number of placements available for a training program), 

criterion-referenced standards are preferred. For high-stakes 

examinations, such as licensure or certification tests, criterionreferenced 

standard setting is widely recognized as the method of 

choice. Pearson VUE psychometricians employ criterion-referenced 

standard-setting procedures that have been developed based upon 

universally-accepted psychometric practices. 

Test assembly 

Pearson VUE has extensive experience with test assembly for 

computer-based examinations. Our content developers and 

psychometricians work with clients to choose an appropriate 

test administration model, including fixed-form, linear-on-the-fly 

(LOFT) or adaptive test design. In the fixed-form design, a specified 

number of content- and statistically-equivalent exam versions 

(forms) are assembled. In a LOFT test design, items are randomly 

selected from the item pool for examination inclusion to fulfil a 

prescribed series of content and statistical rules. In adaptive testing, 

items are selected to test an individual test-taker based upon an 

understanding of the test-taker’s ability level as identified through 

Glossary of Terms 

Classical test theory (CTT) 

An exam development and evaluation 

framework derived from the premise that 

any test score can be expressed as the sum 

of two independent components – 1) the 

test-taker’s true standing on the construct 

of interest, and 2) random error. Test 

items are characterized in terms of the 

proportion of a specified population able 

to correctly answer the item (item difficulty 

or p value) 

and the point-biserial correlation between 

item score and test score (item-test 

correlation). Test-takers are scored using 

some function of the number of items 

answered correctly. 

Computerized adaptive testing (CAT) 

A computer-based test in which successive 

items are selected from a pool of items 

based on the test-taker’s performance on 

previous items. Based in 

item response theory (IRT), this type of 

testing is intended to select items that are 

of appropriate difficulty for the test-taker. 

Good performance by the test-taker 

leads to more difficult questions; poor 

performance leads to easier questions. 

Adaptive tests can be fixed or variable in 

length.

his or her responses to previous questions. Thus, the examination 

“adapts” to an individual test-taker’s ability level and more precisely 

measures that test-taker’s proficiency. Considerations for optimal 

test design include the size and quality of the item bank, the 

number of test-takers and the frequency of test administration. 

Equating and scaling 

Through the procedure known as equating, passing standard values 

are adjusted so that an equivalent level of proficiency is required 

to pass different versions of the examination. The goal of this 

process is to make sure that each test-taker receives a statistically 

equivalent examination: one that is neither statistically easier nor 

harder than that received by any other test-taker. 

A method typically utilized by Pearson VUE employs IRT to 

calibrate items from two or more test forms to the same scale. 

IRT uses statistical models to quantitatively link all item parameters 

to a common benchmark scale. Pearson VUE then develops an 

IRT-calibrated item bank. When items are linked to a common 

scale, test forms can be created and passing standards set such that 

slight differences in test difficulty across forms are accounted for. 

Test forms are drawn from the calibrated item bank and no item 

appears on a test before it has been trialled and equated to the 

benchmark scale. If a fixed-form test design is used, a fixed number 

of equated forms are prepared and are available for administration. 

A test form is randomly selected for each test-taker. If a LOFT test 

design is used, the test form is assembled as the test-taker begins 

the computer-based test. 

Test functionality 

In addition to making decisions on the psychometric properties of 

the tests, test sponsors delivering tests in a CBT environment also 

need to consider which computerized features are appropriate 

for the desired measurement. The variety of functionality allowed 

through CBT means that technical test specifications are necessary 

as a complement to the test plan or test blueprint (the outline of 

the content requirements). Technical specifications may include the 

following details on items and their display in a CBT environment: 

• Navigation between test items 

• Guidelines on the inclusion of graphical images (for example, 

size and format) 

• Ancillary information to display with the test items (such as 

exhibits, instructions, calculators) and the format in which 

these will be displayed 

• Whether or not test-takers are allowed to go back to 

previous screens of the test or previous items 

When implementing CBT as a testing method, care should be 

exercised to ensure that its functionality enhances, and does not 

interfere, with the assessment of test-takers’ ability related to the 

testing purpose (2) . 

By working with Pearson VUE and through careful development of 

tests according to psychometric best practices and the considerate 

use of CBT functionality, test sponsors can create examinations 

that are valid, reliable, and fair. 

References 

1: American Educational Research Association, American Psychological 

Association, & National Council on Measurement in Education. (1999). 

Standards for educational and psychological testing. Washington, DC:American 

Educational Research Association. 

2: International Test Commission (2005). International Guidelines on Computer- 

Based and Internet Delivered Testing. http://www.intestcom.org/Downloads/ 

ITC%20Guidelines%20on%20Computer%20-20version%202005%20approved. 

pdf (Retrieved 8 January 2012). 

de Klerk, G. Classical test theory (CTT). In M. Born, C.D. Foxcroft & R. 

Butter (Eds.), Online Readings in Testing and Assessment, International Test 

Commission, http://www.intestcom.org/Publications/ORTA.php (Retrieved 5 

December 2011). 

Construct irrelevant variance 

“The degree to which the test scores are 

affected by processes that are extraneous 

to its intended construct” (AERA, APA, 

NCME, 1999, p. 10) 

Construct underrepresentation 

“The degree to which a test fails to capture 

important aspects of the construct” 

(AERA, APA, NCME, 1999, p. 10) 

Equating 

The process of statistically adjusting the 

scoring of alternate forms of a test so that 

they use the same scoring scale. 

Item response theory (IRT) 

A statistical model for analyzing test-takers’ 

performance on a set of test questions 

(items). Its basic assumption is that the 

probability that a test-taker will answer a 

test question correctly 

depends on one characteristic of the 

test-taker (called “ability”) and on one to 

three characteristics of the test question. 

The three characteristics of the test 

question are indicated by numbers called 

“parameters.”

Pearson VUE Sales Offices 

Americas 

Global Headquarters 

Minneapolis, MN 

+01 800 837 8969 

pvamericassales@pearson.com 

www.pearsonvue.com 

Philadelphia, PA 

+01 610 617 9300 



Chicago, IL 

+01 800 837 8969 



Asia Pacific 

Delhi, India 

+91 120 4001600 

pvindiasales@pearson.com 


Beijing, China 

+86 10 6849 2066 

pvchinasales@pearson.com 

www.pearsonvue.com.cn 

Tokyo, Japan 

+81 3 5214 0888 

pvjsales@pearson.com 

www.pearsonvue.com/japan 

Europe, Middle East & Africa 

Manchester, United Kingdom 

+44 0 161 855 7000 

vuemarketing@pearson.com 

www.pearsonvue.co.uk 

London, United Kingdom 

+44 0 161 855 7000 


www.pearsonvue.co.uk 

Dubai, United Arab Emirates 

+971 44 535300 


www.pearsonvue.ae 

Committed to developing 

valid, reliable and fair exams 

To learn more, visit www.pearsonvue.com 

PV/3 Test Dev/US/9-12 

Copyright © 2012 Pearson Education, Inc. or its affiliate(s). All rights reserved. 800 837 8969

Developing valid, reliable and fair exams.pdf - Pearson VUE

Create successful ePaper yourself

Delete template?

Save as template?