Predictive Modelling of Undergraduate Student Intake - aair

**Predictive** **Modelling** **of**

**Undergraduate** **Student** **Intake**

Anatoli Lightfoot

Information Analyst, Statistical Services

Outline

• Introduction (brief)

• Theory **of** regression analysis (not so brief)

• Some possible applications

• What to aim for to obtain reliable predictions

• Limitations **of** regression models

• One model in detail (acceptance rates)

AAIR Forum 2008

2

Anatoli Lightfoot – ANU

Why is this important?

• Load management is vital to universities!

So:

• **Student** load has a major effect on university funding

• The consequences **of** being under- or over-enrolled are

potentially very serious

• We want to get it right; and

• We want to know how likely it is to go wrong

AAIR Forum 2008

3

Anatoli Lightfoot – ANU

Why should you listen to me?

• You shouldn’t! (necessarily)

• Iaimto:

• Explain some important basic statistics

• Offer some food for thoughtht

• But:

• This is not a substitute for a statistics degree

• I am not a pr**of**essional statistician (yet)

AAIR Forum 2008

4

Anatoli Lightfoot – ANU

Things this presentation does not cover:

• Setting intake targets

• **Modelling** continuing load

• Financial outcomes/consequences

And a warning:

• The next 10 slides are statistical theory

• Now is your chance to bail out!

AAIR Forum 2008

5

Anatoli Lightfoot – ANU

Time for some statistics!

AAIR Forum 2008

6

Anatoli Lightfoot – ANU

Regression

• Relationship between variables (X and Y)

Y i = α + βX i + ε i

• Y is the “response” or “independent” variable

• X is an “explanatory” or “dependent” variable

• α and β are constants

• Equation **of** a straight line

AAIR Forum 2008

7

Anatoli Lightfoot – ANU

The regression equation

• What do the i and ε signify?

Y i = α + βX i + ε i

• The subscript i indexes observations

• Each i-value represents a data point

• Often omitted for clarity

• ε i is an error term

• The “residual” for each observation

AAIR Forum 2008

8

Anatoli Lightfoot – ANU

The regression equation - example

• Height vs 100m sprint time

Y i = α + βX i + ε i

• For the i-th observation (person):

• Y i is 100m sprint time

• X i is height

• Determining α and β is “fitting” a model

• This is done using statistical s**of**tware

• α and β are chosen to minimise Σ(ε

2

i )

AAIR Forum 2008

9

Anatoli Lightfoot – ANU

The regression equation - example

i height time100

1 140 17.6

2 142 14.3

3 147 16.4

4 150 15.1

1

5 153 15.4

6 159 15.2

7 163 12.7

8 164 13.9

9 168 14.1

10 170 13.7

α = 30

β = -0.1

(s)

100m time

13

14

15

16

17

Height vs 100m sprint times

Y i = 30 – 0.1X i + ε i

140 145 150 155 160 165 170

Height (cm)

AAIR Forum 2008

10

Anatoli Lightfoot – ANU

The regression equation

• The ε are used in model diagnostics

• They can be used to:

• Check basic assumptions

• Check goodness-**of**-fit

• Identify outliers

Y = α + βX +εε

• They are also used to calculate l confidence

intervals when using a model to predict

AAIR Forum 2008

11

Anatoli Lightfoot – ANU

Regression – basic assumptions

• The ε are independent

• The ε are identically distributed

• In particular, ε ~ N(0,σ 2 ) where σ 2 is a constant

• The sample is representative **of** the population

• Vital for useful predictions

AAIR Forum 2008

12

Anatoli Lightfoot – ANU

Transformations

• The variables used need not be “as measured”

• Variables can be transformed:

• Using square, square root, or higher order polynomial

• Using inverse, logarithm, or exponential function

• Using another function

• By multiplying l i them together th (“interaction” ti terms)

• Transformations are **of**ten used on response

variables which are not defined on (-∞,∞)

AAIR Forum 2008

13

Anatoli Lightfoot – ANU

Transformations – logit function

• Maps (0,1) to (-∞,∞) ∞ ∞)

• Used to transform a

response variable which

is a binomial proportion

• Model is fitted to

transformed Y-variable

logit(Y) = α + βX + ε

• Inverse function used to

“un-transform” results

AAIR Forum 2008

14

The logit function

y = ln(x) - ln(1-x)

Anatoli Lightfoot – ANU

Predictions

• Model is fit on observed (historical) data

• To make predictions:

Y = α + βX +εε

• Obtain new data which contains explanatory variables

• Apply model equation to data

• Output is predicted Y-values and confidence intervals

• Make sure new data is from same population!

AAIR Forum 2008

15

Anatoli Lightfoot – ANU

That’s it for the hard stuff

So why use regression to model student intake?

• You may already be using it!

• Large body **of** knowledge exists

• Ideally suited to large admissions datasets

• Can provide confidence in predictions, not just

an unqualified number!

AAIR Forum 2008

16

Anatoli Lightfoot – ANU

Applications **of** regression

• Many and varied

• I will discuss just two:

• Predicting enrolments from TAC preferences

• Predicting enrolments from simulated TAC **of**fers

AAIR Forum 2008

17

Anatoli Lightfoot – ANU

Applications **of** regression

• Historical datasets available from UAC are large

• Many possible explanatory variables present

• Bio & demo data (age, gender, location)

• Education data (UAI, prior studies)

• Preference information (which courses, what order)

• What are the observations?

• Hard to tell where to start!

t!

AAIR Forum 2008

18

Anatoli Lightfoot – ANU

Applications **of** regression

• Conversion **of** preferences to **of**fers depends

only on type **of** course (eg. arts, science, etc.)

• Model equation:

• Results:

logit(Y) = α + β 1 X 1 + ε

• 1 st preferences for B Arts will result in the same proportion

p

**of** enrolments as 5 th preferences for B Arts

AAIR Forum 2008

19

Anatoli Lightfoot – ANU

Applications **of** regression

• Conversion **of** preferences to **of**fers depends on

both preference number and faculty

• Model equation:

• Results:

logit(Y) = α + β 1 X 1 + β 2 X 2 + ε

• 1 st and 5 th preferences are now treated differently

• What happens if the split between local and non-local

applicants changes for arts courses?

AAIR Forum 2008

20

Anatoli Lightfoot – ANU

Model refinement

• Iterative process

• Add or remove variables and refit model

• Examine model diagnostics

• Compare to previous models

• Rinse and repeat

• Important to revisit basic assumptions

• No!

• Can we treat each preference as a separate observation?

AAIR Forum 2008

21

Anatoli Lightfoot – ANU

Model refinement

• Preferences as observations is bad

• Outcome **of** each preference is not independent

• Each applicant as an observation

• Group information from preferences together

th

• Create additional variables

• Often datasets require modifying in some way

AAIR Forum 2008

22

Anatoli Lightfoot – ANU

Reliable models

• Simple models are usually better models

• **Modelling** is not an exact science

• But the theory behind it is!

• Many different models are possible

• All **of** them may produce acceptable results

• A model should make intuitive sense

• If it doesn’t, something is probably wrong with it!

AAIR Forum 2008

23

Anatoli Lightfoot – ANU

Limitations **of** regression

• There are times when it is not appropriate

• Very small datasets can cause problems

• Some datasets require specialised techniques

• Time series analysis

• Some datasets t simply resist analysis

• Other methods available – eg. non-parametric statistics

AAIR Forum 2008

24

Anatoli Lightfoot – ANU

Detailed example – acceptance rates

• Model based on historical UAC data (3 years)

• Basic observations are individual **of**fers

• Observations are grouped

• Response is proportion **of** acceptances

• Each group is weighted when fitting model

AAIR Forum 2008

25

Anatoli Lightfoot – ANU

Detailed example – acceptance rates

• Simplified model equation:

Y = α + β 1 X 1 + β 2 X 2 + β 3 X 3 + β 4 X 4 + β 6 X 1 X 2 + ε

X 1 is an binary variable identifying ACT school-leavers

X 2 represents 3 variables describing preference number

X 3 identifies current and prior year school leavers

X 4 represents 6 variables for different groups **of** courses

The last term is an interaction term between preference number

and ACT school-leaver

AAIR Forum 2008

26

Anatoli Lightfoot – ANU

Detailed example – acceptance rates

• Mostly additive model

• Includes one interaction term

• Preference number with ACT school-leaver

• Many iterations to develop

• More refinements are possible

AAIR Forum 2008

27

Anatoli Lightfoot – ANU

Thank you

• Questions?

AAIR Forum 2008

28

Anatoli Lightfoot – ANU