31.03.2015 Views

Predictive Modelling of Undergraduate Student Intake - aair

Predictive Modelling of Undergraduate Student Intake - aair

Predictive Modelling of Undergraduate Student Intake - aair

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

<strong>Predictive</strong> <strong>Modelling</strong> <strong>of</strong><br />

<strong>Undergraduate</strong> <strong>Student</strong> <strong>Intake</strong><br />

Anatoli Lightfoot<br />

Information Analyst, Statistical Services


Outline<br />

• Introduction (brief)<br />

• Theory <strong>of</strong> regression analysis (not so brief)<br />

• Some possible applications<br />

• What to aim for to obtain reliable predictions<br />

• Limitations <strong>of</strong> regression models<br />

• One model in detail (acceptance rates)<br />

AAIR Forum 2008<br />

2<br />

Anatoli Lightfoot – ANU


Why is this important?<br />

• Load management is vital to universities!<br />

So:<br />

• <strong>Student</strong> load has a major effect on university funding<br />

• The consequences <strong>of</strong> being under- or over-enrolled are<br />

potentially very serious<br />

• We want to get it right; and<br />

• We want to know how likely it is to go wrong<br />

AAIR Forum 2008<br />

3<br />

Anatoli Lightfoot – ANU


Why should you listen to me?<br />

• You shouldn’t! (necessarily)<br />

• Iaimto:<br />

• Explain some important basic statistics<br />

• Offer some food for thoughtht<br />

• But:<br />

• This is not a substitute for a statistics degree<br />

• I am not a pr<strong>of</strong>essional statistician (yet)<br />

AAIR Forum 2008<br />

4<br />

Anatoli Lightfoot – ANU


Things this presentation does not cover:<br />

• Setting intake targets<br />

• <strong>Modelling</strong> continuing load<br />

• Financial outcomes/consequences<br />

And a warning:<br />

• The next 10 slides are statistical theory<br />

• Now is your chance to bail out!<br />

AAIR Forum 2008<br />

5<br />

Anatoli Lightfoot – ANU


Time for some statistics!<br />

AAIR Forum 2008<br />

6<br />

Anatoli Lightfoot – ANU


Regression<br />

• Relationship between variables (X and Y)<br />

Y i = α + βX i + ε i<br />

• Y is the “response” or “independent” variable<br />

• X is an “explanatory” or “dependent” variable<br />

• α and β are constants<br />

• Equation <strong>of</strong> a straight line<br />

AAIR Forum 2008<br />

7<br />

Anatoli Lightfoot – ANU


The regression equation<br />

• What do the i and ε signify?<br />

Y i = α + βX i + ε i<br />

• The subscript i indexes observations<br />

• Each i-value represents a data point<br />

• Often omitted for clarity<br />

• ε i is an error term<br />

• The “residual” for each observation<br />

AAIR Forum 2008<br />

8<br />

Anatoli Lightfoot – ANU


The regression equation - example<br />

• Height vs 100m sprint time<br />

Y i = α + βX i + ε i<br />

• For the i-th observation (person):<br />

• Y i is 100m sprint time<br />

• X i is height<br />

• Determining α and β is “fitting” a model<br />

• This is done using statistical s<strong>of</strong>tware<br />

• α and β are chosen to minimise Σ(ε<br />

2<br />

i )<br />

AAIR Forum 2008<br />

9<br />

Anatoli Lightfoot – ANU


The regression equation - example<br />

i height time100<br />

1 140 17.6<br />

2 142 14.3<br />

3 147 16.4<br />

4 150 15.1<br />

1<br />

5 153 15.4<br />

6 159 15.2<br />

7 163 12.7<br />

8 164 13.9<br />

9 168 14.1<br />

10 170 13.7<br />

α = 30<br />

β = -0.1<br />

(s)<br />

100m time<br />

13<br />

14<br />

15<br />

16<br />

17<br />

Height vs 100m sprint times<br />

Y i = 30 – 0.1X i + ε i<br />

140 145 150 155 160 165 170<br />

Height (cm)<br />

AAIR Forum 2008<br />

10<br />

Anatoli Lightfoot – ANU


The regression equation<br />

• The ε are used in model diagnostics<br />

• They can be used to:<br />

• Check basic assumptions<br />

• Check goodness-<strong>of</strong>-fit<br />

• Identify outliers<br />

Y = α + βX +εε<br />

• They are also used to calculate l confidence<br />

intervals when using a model to predict<br />

AAIR Forum 2008<br />

11<br />

Anatoli Lightfoot – ANU


Regression – basic assumptions<br />

• The ε are independent<br />

• The ε are identically distributed<br />

• In particular, ε ~ N(0,σ 2 ) where σ 2 is a constant<br />

• The sample is representative <strong>of</strong> the population<br />

• Vital for useful predictions<br />

AAIR Forum 2008<br />

12<br />

Anatoli Lightfoot – ANU


Transformations<br />

• The variables used need not be “as measured”<br />

• Variables can be transformed:<br />

• Using square, square root, or higher order polynomial<br />

• Using inverse, logarithm, or exponential function<br />

• Using another function<br />

• By multiplying l i them together th (“interaction” ti terms)<br />

• Transformations are <strong>of</strong>ten used on response<br />

variables which are not defined on (-∞,∞)<br />

AAIR Forum 2008<br />

13<br />

Anatoli Lightfoot – ANU


Transformations – logit function<br />

• Maps (0,1) to (-∞,∞) ∞ ∞)<br />

• Used to transform a<br />

response variable which<br />

is a binomial proportion<br />

• Model is fitted to<br />

transformed Y-variable<br />

logit(Y) = α + βX + ε<br />

• Inverse function used to<br />

“un-transform” results<br />

AAIR Forum 2008<br />

14<br />

The logit function<br />

y = ln(x) - ln(1-x)<br />

Anatoli Lightfoot – ANU


Predictions<br />

• Model is fit on observed (historical) data<br />

• To make predictions:<br />

Y = α + βX +εε<br />

• Obtain new data which contains explanatory variables<br />

• Apply model equation to data<br />

• Output is predicted Y-values and confidence intervals<br />

• Make sure new data is from same population!<br />

AAIR Forum 2008<br />

15<br />

Anatoli Lightfoot – ANU


That’s it for the hard stuff<br />

So why use regression to model student intake?<br />

• You may already be using it!<br />

• Large body <strong>of</strong> knowledge exists<br />

• Ideally suited to large admissions datasets<br />

• Can provide confidence in predictions, not just<br />

an unqualified number!<br />

AAIR Forum 2008<br />

16<br />

Anatoli Lightfoot – ANU


Applications <strong>of</strong> regression<br />

• Many and varied<br />

• I will discuss just two:<br />

• Predicting enrolments from TAC preferences<br />

• Predicting enrolments from simulated TAC <strong>of</strong>fers<br />

AAIR Forum 2008<br />

17<br />

Anatoli Lightfoot – ANU


Applications <strong>of</strong> regression<br />

• Historical datasets available from UAC are large<br />

• Many possible explanatory variables present<br />

• Bio & demo data (age, gender, location)<br />

• Education data (UAI, prior studies)<br />

• Preference information (which courses, what order)<br />

• What are the observations?<br />

• Hard to tell where to start!<br />

t!<br />

AAIR Forum 2008<br />

18<br />

Anatoli Lightfoot – ANU


Applications <strong>of</strong> regression<br />

• Conversion <strong>of</strong> preferences to <strong>of</strong>fers depends<br />

only on type <strong>of</strong> course (eg. arts, science, etc.)<br />

• Model equation:<br />

• Results:<br />

logit(Y) = α + β 1 X 1 + ε<br />

• 1 st preferences for B Arts will result in the same proportion<br />

p<br />

<strong>of</strong> enrolments as 5 th preferences for B Arts<br />

AAIR Forum 2008<br />

19<br />

Anatoli Lightfoot – ANU


Applications <strong>of</strong> regression<br />

• Conversion <strong>of</strong> preferences to <strong>of</strong>fers depends on<br />

both preference number and faculty<br />

• Model equation:<br />

• Results:<br />

logit(Y) = α + β 1 X 1 + β 2 X 2 + ε<br />

• 1 st and 5 th preferences are now treated differently<br />

• What happens if the split between local and non-local<br />

applicants changes for arts courses?<br />

AAIR Forum 2008<br />

20<br />

Anatoli Lightfoot – ANU


Model refinement<br />

• Iterative process<br />

• Add or remove variables and refit model<br />

• Examine model diagnostics<br />

• Compare to previous models<br />

• Rinse and repeat<br />

• Important to revisit basic assumptions<br />

• No!<br />

• Can we treat each preference as a separate observation?<br />

AAIR Forum 2008<br />

21<br />

Anatoli Lightfoot – ANU


Model refinement<br />

• Preferences as observations is bad<br />

• Outcome <strong>of</strong> each preference is not independent<br />

• Each applicant as an observation<br />

• Group information from preferences together<br />

th<br />

• Create additional variables<br />

• Often datasets require modifying in some way<br />

AAIR Forum 2008<br />

22<br />

Anatoli Lightfoot – ANU


Reliable models<br />

• Simple models are usually better models<br />

• <strong>Modelling</strong> is not an exact science<br />

• But the theory behind it is!<br />

• Many different models are possible<br />

• All <strong>of</strong> them may produce acceptable results<br />

• A model should make intuitive sense<br />

• If it doesn’t, something is probably wrong with it!<br />

AAIR Forum 2008<br />

23<br />

Anatoli Lightfoot – ANU


Limitations <strong>of</strong> regression<br />

• There are times when it is not appropriate<br />

• Very small datasets can cause problems<br />

• Some datasets require specialised techniques<br />

• Time series analysis<br />

• Some datasets t simply resist analysis<br />

• Other methods available – eg. non-parametric statistics<br />

AAIR Forum 2008<br />

24<br />

Anatoli Lightfoot – ANU


Detailed example – acceptance rates<br />

• Model based on historical UAC data (3 years)<br />

• Basic observations are individual <strong>of</strong>fers<br />

• Observations are grouped<br />

• Response is proportion <strong>of</strong> acceptances<br />

• Each group is weighted when fitting model<br />

AAIR Forum 2008<br />

25<br />

Anatoli Lightfoot – ANU


Detailed example – acceptance rates<br />

• Simplified model equation:<br />

Y = α + β 1 X 1 + β 2 X 2 + β 3 X 3 + β 4 X 4 + β 6 X 1 X 2 + ε<br />

X 1 is an binary variable identifying ACT school-leavers<br />

X 2 represents 3 variables describing preference number<br />

X 3 identifies current and prior year school leavers<br />

X 4 represents 6 variables for different groups <strong>of</strong> courses<br />

The last term is an interaction term between preference number<br />

and ACT school-leaver<br />

AAIR Forum 2008<br />

26<br />

Anatoli Lightfoot – ANU


Detailed example – acceptance rates<br />

• Mostly additive model<br />

• Includes one interaction term<br />

• Preference number with ACT school-leaver<br />

• Many iterations to develop<br />

• More refinements are possible<br />

AAIR Forum 2008<br />

27<br />

Anatoli Lightfoot – ANU


Thank you<br />

• Questions?<br />

AAIR Forum 2008<br />

28<br />

Anatoli Lightfoot – ANU

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!