11.07.2015 Views

Multilevel modelling and time series analysis in ... - ERSO - Swov

Multilevel modelling and time series analysis in ... - ERSO - Swov

Multilevel modelling and time series analysis in ... - ERSO - Swov

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Deliverable D7.5<strong>Multilevel</strong> <strong>modell<strong>in</strong>g</strong> <strong>and</strong> <strong>time</strong> <strong>series</strong><strong>analysis</strong> <strong>in</strong> traffic safety research –ManualDupont, E. <strong>and</strong> Martensen, H. (Eds.) (2007) <strong>Multilevel</strong> <strong>modell<strong>in</strong>g</strong> <strong>and</strong> <strong>time</strong><strong>series</strong> <strong>analysis</strong> <strong>in</strong> traffic research – Manual. Deliverable D7.5 of the EU FP6project SafetyNet.Contract No: TREN-04-FP6TR-S12.395465/506723Acronym: SafetyNetTitle: Build<strong>in</strong>g the European Road Safety ObservatoryIntegrated Project, Thematic Priority 6.2 “Susta<strong>in</strong>able Surface Transport”Project Co-ord<strong>in</strong>ator:Professor Pete ThomasVehicle Safety Research CentreErgonomics <strong>and</strong> Safety Research InstituteLoughborough UniversityHolywell Build<strong>in</strong>gLoughboroughLE11 3UZOrganisation name of lead contractor for this deliverable:Belgian Road Safety Institute (IBSR)Due Date of Deliverable: 28/02/2007Submission Date:Editors: E. Dupont & H. Martensen (IBSR)Contribut<strong>in</strong>g Authors: A. Angermann (KfV), C. Antoniou (NTUA) R. Bergel(INRETS), E. Berends (SWOV), F. Bijleveld (SWOV), C. Br<strong>and</strong>stätter (KfV),M. Cherfi (INRETS), C. de Blois (SWOV), E. Dupont (IBSR), M. Gatscha(KfV), H. Martensen (IBSR), E. Papadimitriou (NTUA), G. Yannis (NTUA)Project Start Date: 1st May 2004Duration: 4 yearsPUProject co-funded by the European Commission with<strong>in</strong> the Sixth Framework Programme (2002 -2006)Dissem<strong>in</strong>ation LevelPublicP r o j e c t c o - f i n a n c e d b y t h e E u r o p e a n C o m m i s s i o n , D i r e c t o r a t e - G e n e r a l T r a n s p o r t a n d E n e r g yP a g e 1


Table of ContentsEXECUTIVE SUMMARY.................................................................................... 4CHAPTER 1 - INTRODUCTION ........................................................................ 5CHAPTER 2 - MULTILEVEL MODELLING....................................................... 82.1 Introduction........................................................................................................... 82.2 <strong>Multilevel</strong> l<strong>in</strong>ear regression models.................................................................... 112.2.1 Basic two level r<strong>and</strong>om <strong>in</strong>tercept <strong>and</strong> r<strong>and</strong>om slope models............................ 112.2.2 Three level models <strong>and</strong> more.......................................................................... 262.3 Discrete response models .................................................................................... 322.3.1 Introduction.................................................................................................... 322.3.2 B<strong>in</strong>ary <strong>and</strong> general b<strong>in</strong>omial responses........................................................... 322.3.3 Mult<strong>in</strong>omial responses.................................................................................... 402.3.4 Counts............................................................................................................ 522.4 Longitud<strong>in</strong>al data................................................................................................ 722.5 Multivariate models ............................................................................................ 852.6 Structural equations models ............................................................................. 1012.7 More complex data structures.......................................................................... 1012.8 Bayesian estimation <strong>in</strong> multivlevel <strong>modell<strong>in</strong>g</strong> .................................................. 101CHAPTER 3 - TIME SERIES ANALYSIS ...................................................... 1023.1 Introduction to <strong>time</strong> <strong>series</strong> models ................................................................... 1023.2 Classical l<strong>in</strong>ear <strong>and</strong> non-l<strong>in</strong>ear regression models........................................... 1053.2.1 Classical l<strong>in</strong>ear regression models ................................................................ 1053.2.2 Generalized l<strong>in</strong>ear models (GLM) ................................................................ 1303.2.3 Non-l<strong>in</strong>ear models........................................................................................ 1303.3 Dedicated <strong>time</strong> <strong>series</strong> <strong>analysis</strong> <strong>in</strong> road safety research .................................... 1303.4 ARMA-type models........................................................................................... 1313.4.1 Introduction.................................................................................................. 1313.4.2 ARIMA models for stationary <strong>series</strong> (simulated data)................................... 1343.4.3 ARIMA models for non seasonal <strong>series</strong> (Norwegian Fatalities) .................... 1343.4.4 ARIMA models for seasonal <strong>series</strong> (UK-KSI Drivers).................................. 1543.4.5 Conclusion ARMA-type models................................................................... 1833.5 DRAG models.................................................................................................... 184


3.6 State space models............................................................................................. 1853.6.1 Introduction.................................................................................................. 1853.6.2 Local level model......................................................................................... 1873.6.3 Local l<strong>in</strong>ear trend model............................................................................... 2163.6.4 Local l<strong>in</strong>ear trend plus seasonal model ......................................................... 2323.6.5 Intervention variables ................................................................................... 2433.6.6 Explanatory variables ................................................................................... 2523.6.7 Conclusion state space models...................................................................... 265CHAPTER 4 - CONCLUSION........................................................................ 266P r o j e c t c o - f i n a n c e d b y t h e E u r o p e a n C o m m i s s i o n , D i r e c t o r a t e - G e n e r a l T r a n s p o r t a n d E n e r g yP a g e 3


Executive SummaryThe SafetyNet project is set up to build a European Road Safety Observatory.The data assembled or gathered for the observatory consist of the Communitydatabase on Accidents on the Roads <strong>in</strong> Europe (CARE); data on road safetyrisk <strong>in</strong>dicators; data on road safety performance <strong>in</strong>dicators <strong>and</strong> <strong>in</strong>-depthaccident data. Potential users will l<strong>in</strong>k data from different data-sets, considerdifferent levels of aggregation jo<strong>in</strong>tly, <strong>and</strong> analyse the development over <strong>time</strong>.Work package 7 (WP7) is set up to deal with statistical <strong>and</strong> conceptual issuesthat come <strong>in</strong>to play when analys<strong>in</strong>g such complex data structures.One of WP7’s ma<strong>in</strong> objectives is to develop a best practice advice for the<strong>analysis</strong> of data structures that require more than the st<strong>and</strong>ard statistical tools.This best practice consists of D7.4 “<strong>Multilevel</strong> <strong>modell<strong>in</strong>g</strong> <strong>and</strong> <strong>time</strong> <strong>series</strong><strong>analysis</strong> <strong>in</strong> traffic research – A methodology” <strong>and</strong> D7.5 “<strong>Multilevel</strong> <strong>modell<strong>in</strong>g</strong> <strong>and</strong><strong>time</strong> <strong>series</strong> <strong>analysis</strong> <strong>in</strong> traffic research – The manual”.The ma<strong>in</strong> goal is to enable the reader to deal with complex data structures thatshow dependencies <strong>in</strong> space (nested data) or <strong>in</strong> <strong>time</strong> (<strong>time</strong> <strong>series</strong> data). At firstit is demonstrated how such dependencies can compromise the applicability ofst<strong>and</strong>ard methods of statistical <strong>in</strong>ferences, because they can lead to anunderestimation of the st<strong>and</strong>ard error <strong>and</strong> consequently of the error <strong>in</strong> statisticaltests.As a solution to this problem, two families of statistical techniques are presentedto deal with these dependencies. <strong>Multilevel</strong> Modell<strong>in</strong>g is dedicated to the<strong>analysis</strong> of data that are structured hierarchically. It offers the possibility to<strong>in</strong>clude hierarchical structures <strong>in</strong>to the model of <strong>analysis</strong>. In road safetyresearch, multilevel analyses allow for the <strong>in</strong>troduction of exposure data <strong>and</strong> ofsafety performance <strong>in</strong>dicators, even if those are not specified at the same levelof disaggregation as the accident data themselves. In this way, multilevelanalyses allow a global <strong>and</strong> detailed approach simultaneously. Time <strong>series</strong>analyses are employed to overcome dependency issues <strong>in</strong> <strong>time</strong>-related data.They allow describ<strong>in</strong>g the development over <strong>time</strong>, relat<strong>in</strong>g the accidentoccurrencesto explanatory factors such as exposure measures or safetyperformance<strong>in</strong>dicators (e.g., speed<strong>in</strong>g, seatbelt-use, alcohol, etc), <strong>and</strong>forecast<strong>in</strong>g the development <strong>in</strong>to the near future.Deliverable 7.5 conta<strong>in</strong>s the manual to support the methodology D7.4, wherethe theoretical background for these two families of analyses is given. For eachtechnique described <strong>in</strong> the methodology, this manual presents the <strong>in</strong>structionsto fit the models on the basis of user friendly software, as well as guidel<strong>in</strong>es for<strong>in</strong>terpret<strong>in</strong>g the results. The aim of the manual is to enable the reader toconduct all analyses described <strong>in</strong> the methodology <strong>and</strong> this way to get h<strong>and</strong>s onexperience <strong>in</strong> the <strong>analysis</strong> of road safety data. To enable the reader to trackevery step presented, the data sets discussed <strong>in</strong> the various sections areavailable.


Chapter 1 - IntroductionHeike Martensen <strong>and</strong> Emmanuelle Dupont (IBSR)This deliverable has been produced <strong>in</strong> Workpackage 7 (WP7) of the SafetyNetproject. WP7 is set up to deal with statistical <strong>and</strong> conceptual issues that come<strong>in</strong>to play when analys<strong>in</strong>g complex data structures as they arise <strong>in</strong> road safetyresearch when comb<strong>in</strong><strong>in</strong>g data from different sources or when consider<strong>in</strong>g datathat have been collected over a long <strong>time</strong>span. One of its ma<strong>in</strong> objectives is thedevelopment of a best practice for the <strong>analysis</strong> of data structures that requiremore than the st<strong>and</strong>ard statistical tools.This best practice consists of D7.4 “<strong>Multilevel</strong> <strong>modell<strong>in</strong>g</strong> <strong>and</strong> <strong>time</strong> <strong>series</strong><strong>analysis</strong> <strong>in</strong> traffic research – A methodology” (subsequently, simply “themethodology-report”) <strong>and</strong> the present deliverable. This document conta<strong>in</strong>s thepractical <strong>in</strong>structions to support the methodology, where it has been describedhow to deal with data that are dependent <strong>in</strong> space (nested data) or <strong>in</strong> <strong>time</strong> (<strong>time</strong><strong>series</strong> data). It has been demonstrated how such dependencies cancompromise the applicability of st<strong>and</strong>ard methods of statistical <strong>in</strong>ferences,because they lead to an underestimation of the st<strong>and</strong>ard error <strong>and</strong>consequently of the probability to classify a result as significant that is <strong>in</strong> factdue to chance.Two families of statistical techniques have been presented to deal with thesedependencies. <strong>Multilevel</strong> Modell<strong>in</strong>g is dedicated to the <strong>analysis</strong> of data that arestructured hierarchically <strong>and</strong> Time Series analyses are employed to overcomedependency issues <strong>in</strong> <strong>time</strong>-related data. The methodology is organized <strong>in</strong> twoma<strong>in</strong> chapters, focuss<strong>in</strong>g on multilevel <strong>modell<strong>in</strong>g</strong> (Chapter 2) <strong>and</strong> <strong>time</strong> <strong>series</strong><strong>analysis</strong> (Chapter 3) respectively.For those sections <strong>in</strong> the methodology where models dedicated to multilevel<strong>analysis</strong> or to <strong>time</strong> <strong>series</strong> <strong>analysis</strong> are presented, this manual presents the<strong>in</strong>structions to fit each model on the basis of user friendly software, as well asguidel<strong>in</strong>es for <strong>in</strong>terpret<strong>in</strong>g the results. The aim of this document is to enable thereader to conduct all analyses described <strong>in</strong> the methodology <strong>and</strong> this way to geth<strong>and</strong>s-on experience <strong>in</strong> the <strong>analysis</strong> of road safety data. To enable the readerto track every step presented, the data sets discussed <strong>in</strong> the various sectionsare available. The data are <strong>in</strong>cluded as a CD <strong>and</strong> will be available at theSafetyNet website (www.erso.en/safetynet.htm).This manual is not a st<strong>and</strong>-alone document. It is <strong>in</strong>timately related to themethodology <strong>and</strong> its sections were written under the assumption that therespective part of the methodology report is known. To allow an easy match<strong>in</strong>gof methodology report <strong>and</strong> manual sections, the number<strong>in</strong>g <strong>in</strong> the manual is thesame as that <strong>in</strong> the methodology report. Some sections <strong>in</strong> the methodologyreport, however, do not conta<strong>in</strong> data examples or the models presented employtraditional techniques rather than multilevel or dedicated <strong>time</strong> <strong>series</strong> models.For these latter sections there is no counterpart <strong>in</strong> this manual. As aconsequence, some sections <strong>in</strong> the manual are rather short, their ma<strong>in</strong> purpose


Chapter 2be<strong>in</strong>g, to allow the number<strong>in</strong>g to cont<strong>in</strong>ue <strong>in</strong> the same way as <strong>in</strong> themethodology report. In Figures 1.1 <strong>and</strong> 1.2, the structure of Chapters 2 <strong>and</strong> 3 ispresented. Sections that are represented by a colourless box are present <strong>in</strong> themethodology report, but there is no correspond<strong>in</strong>g example <strong>in</strong> this manual.Figure 1.1: Structure of multilevel models presented <strong>in</strong> Chapter 2. Note: Sectionsrepresented <strong>in</strong> white are present <strong>in</strong> the methodology report but have no correspond<strong>in</strong>gexample <strong>in</strong> this manual.Chapter 2 starts with a short description of the pr<strong>in</strong>ciples of multilevel <strong>modell<strong>in</strong>g</strong><strong>and</strong> a software overview <strong>in</strong> 2.1. Section 2.2 is dedicated to <strong>modell<strong>in</strong>g</strong> ofcont<strong>in</strong>uous responses <strong>and</strong> section 2.3 to the <strong>modell<strong>in</strong>g</strong> of discrete responses.Section 2.4 presents an example for a multivariate model <strong>and</strong> section 2.5 for amodel for longitud<strong>in</strong>al data.


IntroductionFigure 1.2: Structure of multilevel models presented <strong>in</strong> Chapter 3. Note: Sectionsrepresented <strong>in</strong> white are present <strong>in</strong> the methodology report but have no correspond<strong>in</strong>gexample <strong>in</strong> this manual.Chapter 3 starts with a short <strong>in</strong>troduction to <strong>time</strong> <strong>series</strong> <strong>analysis</strong> <strong>and</strong> a softwareoverview (3.1). Section 3.2 describes traditional regression analyses models.Traditional regression analyses models were chosen because they are probablythe best known type of model, <strong>and</strong> are often used <strong>in</strong> the <strong>time</strong> <strong>series</strong> context.Special attention is paid to diagnostic tools that serve to detect possibleviolations of the assumption when deal<strong>in</strong>g with <strong>time</strong> <strong>series</strong> data <strong>and</strong> thepossibilities to solve these problems with<strong>in</strong> the traditional framework. In sections3.4 to 3.6 of the methodology report models dedicated to <strong>time</strong> <strong>series</strong> analysesare presented. In the end, these models can be categorised <strong>in</strong>to two classes,one group, <strong>in</strong>clud<strong>in</strong>g DRAG type modes, that can be seen as variants of socalledARMA-type models <strong>and</strong> another group of decomposition models that canbe regarded as members of state space models. In this manual there are twoextensive sections on ARMA-type models (3.5) <strong>and</strong> on state space models (3.6)respectively. Both conta<strong>in</strong> many empirical examples <strong>and</strong> detailed <strong>in</strong>structionsfor their implementation.Chapter 4 presents an overview of the methods presented <strong>and</strong> the examplesused.P r o j e c t c o - f i n a n c e d b y t h e E u r o p e a n C o m m i s s i o n , D i r e c t o r a t e - G e n e r a l T r a n s p o r t a n d E n e r g yP a g e 7


Chapter 2 - <strong>Multilevel</strong> Modell<strong>in</strong>g2.1 IntroductionHeike Martensen <strong>and</strong> Emmanuelle Dupont (IBSR)As described <strong>in</strong> more detail <strong>in</strong> the methodology report, <strong>in</strong> traditional regressionanalyses a dependent variable (y) is predicted by a comb<strong>in</strong>ation of one or more<strong>in</strong>dependent variables (x 1 , x 2 , …), such that y can be modelled by equation2.1.1.yi= b0 + b1x1+ b2x2+ ... + e i(2.1.1)with i be<strong>in</strong>g the <strong>in</strong>dex of the subjects of study (e.g. accidents, persons, etc.).As examples, <strong>in</strong>jury severity <strong>in</strong> an accident can be predicted by the speed ofcollision or accident frequency can be predicted by the number of alcoholcontrols <strong>and</strong> the number of speed <strong>in</strong>fr<strong>in</strong>gements. Of course, these predictionsare never perfect. Everyth<strong>in</strong>g that is not predicted is assumed to be due to ther<strong>and</strong>omly distributed error ei.One of the most important assumptions upon which the traditional analyses arebased is the <strong>in</strong>dependence assumption, stat<strong>in</strong>g that the residuals, the ei’s, are<strong>in</strong>dependently distributed across all units. Hierarchical structures or nested dataoften cause the <strong>in</strong>dependence assumption to be violated. In hierarchies, thecases with<strong>in</strong> one group are often more similar to each other than the cases <strong>in</strong>another group. These hierarchical structures have to be represented <strong>in</strong> themodel of <strong>analysis</strong>, because otherwise the residuals (the variation that cannot beexpla<strong>in</strong>ed by the model) will show the same structure <strong>and</strong> will therefore not be<strong>in</strong>dependently distributed. Examples for such hierarchies are presented <strong>in</strong> therema<strong>in</strong>der of the document. To name just a few: In section 2.2 speed data arepresented that are collected at a number of r<strong>and</strong>omly selected road sites. Thespeed of cars at the same road site is jo<strong>in</strong>tly <strong>in</strong>fluenced by a large number offactors <strong>and</strong> therefore cars at the same road site are more similar <strong>in</strong> speed thanbetween different road sites. In sections 2.3.2 <strong>and</strong> 2.3.3 data on driv<strong>in</strong>g underalcohol <strong>in</strong>fluence are presented. Aga<strong>in</strong> the probabilities of hav<strong>in</strong>g drunk aremore similar for drivers at the same road site as compared to drivers at differentroad sites. In sections 2.3.4 <strong>and</strong> 2.4 the fatalities for counties <strong>in</strong> Greece arepresented <strong>and</strong> it is demonstrated that the numbers of fatalities as well as theeffect of certa<strong>in</strong> measures (alcohol <strong>and</strong> speed controls) vary across regions.<strong>Multilevel</strong> <strong>modell<strong>in</strong>g</strong> offers the possibility to <strong>in</strong>clude hierarchical structures <strong>in</strong>tothe model of <strong>analysis</strong> by allow<strong>in</strong>g r<strong>and</strong>om variation at each level of the model.<strong>Multilevel</strong> models also allow the effect of predictor variables to vary acrosshigher level units. In the present chapter, multilevel models are presented forcont<strong>in</strong>uous data (section 2.2), dichotomous data (section 2.3.2), <strong>and</strong> count data(section 2.3.4). Moreover it is demonstrated how multilevel models can be used


2.1 Introduction multilevel <strong>modell<strong>in</strong>g</strong>to establish a multivariate data structure (section 2.4) that can also be used formult<strong>in</strong>omial responses (section 2.3.3) <strong>and</strong> repeated measurement data (section2.5).The multilevel models are all implemented with the MLwiN software (Rasbash,Steele, Brown, & Prosser, 2004, www.mlw<strong>in</strong>.com), dedicated to multilevel<strong>modell<strong>in</strong>g</strong>. The reason that this software was chosen is its high educationalvalue. In MLwiN (Rasbash et al., 2004), the model formulation is menu-based<strong>and</strong> can therefore be mastered easily without study<strong>in</strong>g a programm<strong>in</strong>glanguage. The analyses are presented <strong>in</strong> the form of model equations, allow<strong>in</strong>ga good underst<strong>and</strong><strong>in</strong>g of the model built. Another advantage of this software isthe presence of diagnostic methods tailored to multilevel <strong>modell<strong>in</strong>g</strong>. Mostnotably, residuals can be studied at each of the levels <strong>in</strong>cluded <strong>in</strong> the model.The program also has excellent plott<strong>in</strong>g functions with an <strong>in</strong>terface that is easyto use, encourag<strong>in</strong>g a thorough <strong>in</strong>spection of raw data, model predictions, <strong>and</strong>residuals. The output of the <strong>analysis</strong> is also presented <strong>in</strong> the framework ofmodel formulation: the parameters <strong>in</strong> the model equations are simply replacedby their estimates. This presentation allows maximal underst<strong>and</strong><strong>in</strong>g of the roleof each parameter, <strong>and</strong> of its possible <strong>in</strong>terpretation.The downside of this very educational <strong>in</strong>terface is its impracticality: no tablesare provided as output, the text <strong>in</strong> the Equations w<strong>in</strong>dow cannot be copied;there is no way to export the result<strong>in</strong>g estimations but to simply type them over.The program is <strong>in</strong> fact so educational that it forces the user to conducthim/herself many of the calculations necessary for <strong>in</strong>terpretation (variancepartition coefficients, test statistics). This policy of not allow<strong>in</strong>g the user tosimply take some output without underst<strong>and</strong><strong>in</strong>g how it came about, can becomevery tedious once one has passed the <strong>in</strong>itial phase of try<strong>in</strong>g to underst<strong>and</strong> themodels <strong>and</strong> that one simply wants to carry out some rout<strong>in</strong>e analyses.HLM (Bryk, Raudenbush, & Congdon, 1996) is also a special purpose statisticalpackage that will fit many k<strong>in</strong>ds of multilevel models. It has been under activedevelopment s<strong>in</strong>ce the mid 1980s <strong>and</strong> is now distributed by Scientific SoftwareInternational (SSI, www.ssicentral.com).The MIX project (Hedeker & Gibbons, 1996 a, b) is a collection of programs formultilevel techniques, <strong>in</strong>clud<strong>in</strong>g mixed-effects l<strong>in</strong>ear regression, mixed-effectslogistic regression for nom<strong>in</strong>al or ord<strong>in</strong>al outcomes, mixed-effects probitregression for ord<strong>in</strong>al outcomes, mixed-effects Poisson regression, <strong>and</strong> mixedeffectsgrouped-<strong>time</strong> survival <strong>analysis</strong>. The programs can be downloaded fromhttp://tigger.uic.edu/~hedeker/mix.html.WINBUGS is a software that uses Bayesian estimation algorithms (MCMC, seesection 2.7.2 <strong>in</strong> the Methodology report). Models are represented by a flexiblelanguage. Additionally it allows the user to specify their model on a graphical<strong>in</strong>terface.<strong>Multilevel</strong> <strong>modell<strong>in</strong>g</strong> can also be carried out <strong>in</strong> R (http://cran.r-project.org) or itscommercial version S-Plus (www.<strong>in</strong>sightful.com/). These programs allow mostP r o j e c t c o - f i n a n c e d b y t h e E u r o p e a n C o m m i s s i o n , D i r e c t o r a t e - G e n e r a l T r a n s p o r t a n d E n e r g yP a g e 9


Chapter 2of the functions present <strong>in</strong> MLwiN but without its easily accessible user<strong>in</strong>terface.The st<strong>and</strong>ard statistical software packages allow multilevel <strong>modell<strong>in</strong>g</strong> to someextent. Most notably SAS, allows the estimation of all models presented <strong>in</strong> thisdocument (Littell, Milliken, Stroup, <strong>and</strong> Wolf<strong>in</strong>ger, 1996). However, it does notenable the detailed diagnostics tailored to these models. SPSS only allows theestimation of l<strong>in</strong>ear multilevel models. An excellent collection of reviews how toimplement multilevel models <strong>in</strong> a wide variety of statistical software can befound on the MLwiN website (www.mlw<strong>in</strong>.com/softrev/<strong>in</strong>dex.html).With<strong>in</strong> the present chapter on multilevel <strong>modell<strong>in</strong>g</strong>, there is a build-up of<strong>in</strong>formation about the use of MLwiN. The chapters on l<strong>in</strong>ear models form an<strong>in</strong>troduction to MLwiN <strong>and</strong> some of its possibilities as well. They conta<strong>in</strong> verydetailed <strong>in</strong>formation how to address the functions <strong>and</strong> how to <strong>in</strong>terpret theoutput. In later chapters this <strong>in</strong>formation is more compressed. This all said, thefocus of this document is not to learn to deal with MLwiN (for all details thereader is referred to the MLwiN manual by Rasbash, Steel, Brown, & Prosser,2004) but to get a practical <strong>in</strong>troduction to the multilevel <strong>analysis</strong> of road safetydata.


2.2 <strong>Multilevel</strong> l<strong>in</strong>ear regression modelsHeike Martensen <strong>and</strong> Emmanuelle Dupont (IBSR)2.2.1 Basic two level r<strong>and</strong>om <strong>in</strong>tercept <strong>and</strong> r<strong>and</strong>om slopemodelsThe example data used <strong>in</strong> this <strong>and</strong> the follow<strong>in</strong>g section are based on a nationalspeed survey conducted <strong>in</strong> Belgium. The speed of 4994 cars was “measured” at131 r<strong>and</strong>omly selected road sites. Additionally, the length of each car wasrecorded. The question pursued here, is whether there is a relation between thespeed <strong>and</strong> the length of the car (considered here as rough <strong>in</strong>dicator of itseng<strong>in</strong>e power). The data conta<strong>in</strong> the follow<strong>in</strong>g variables:IDlocation Identifies the road site (i.e. the location)IDsubject Identifies the subjects, i.e. the <strong>in</strong>dividual cars with<strong>in</strong> eachlocationSpeed Indicates the speed of that carLength Indicates the length of that carLengthCentred Indicates the length of that car m<strong>in</strong>us the average lengthLengthCat Indicates whether a car is shorter (0) or longer (1) than 4.3mTrafficCount Indicates the number of cars pass<strong>in</strong>g a road site dur<strong>in</strong>gmeasurementTrafficCountCat Indicates whether fewer (0) or more (1) than 100 cars passedProv<strong>in</strong>ce The Belgian Prov<strong>in</strong>ce <strong>in</strong> which measurement has taken placeData loadTo beg<strong>in</strong> with, we will import the data from Excel.• Open MLwiN• Open the data file (SPEED.xls) <strong>in</strong> Excel• Select all columns (ctrl A) <strong>and</strong> copy them (ctrl C) <strong>in</strong> Excel• Go to MLwiN, press Ctrl VThe follow<strong>in</strong>g w<strong>in</strong>dow appears:


Chapter 2• Check “Use First row as names” <strong>in</strong> the lower left corner of the Paste ViewW<strong>in</strong>dow• Click on “Free Columns” (this assigns the first free columns <strong>in</strong> MLwiN to theimported data)• Click on “Paste”• Close the paste w<strong>in</strong>dow• Select “File” <strong>in</strong> the top menu bar <strong>and</strong> save the worksheet you just createdThe centre of MLwiN is the Equations w<strong>in</strong>dow, where the models that you wantto fit to the data are built.• Click on “Model” <strong>in</strong> the menu bar <strong>and</strong> select “Equation”• Click on “Notation” at the bottom of the Equations w<strong>in</strong>dow• Uncheck “General” (this changes the notation from the General-l<strong>in</strong>ear-modelnotation to the l<strong>in</strong>ear-model notation)• Click “Done”2.2.1.1. An “empty” s<strong>in</strong>gle-level modelThe first model built is one that ignores the hierarchical structure <strong>in</strong> the data. It<strong>in</strong>cludes only one level, that of the <strong>in</strong>dividual cars (IDsubject). This model,conta<strong>in</strong><strong>in</strong>g only an <strong>in</strong>tercept <strong>and</strong> no predictors will be used as a po<strong>in</strong>t ofreference.Model formulationDef<strong>in</strong>e the dependent variable:• Click on the “y” <strong>and</strong> Select “Speed” from the drop-down menu as dependentvariable Select “1-i”from the N-levels drop-down menu Select “ID-subject” from the level 1(j) drop-down menu


2.2 L<strong>in</strong>ear multilevel models Click “done”The specified model <strong>in</strong>cludes only one r<strong>and</strong>om-term (e i ) <strong>and</strong> so far only an<strong>in</strong>tercept. If your Equations w<strong>in</strong>dow does not look like this,click “Estimates” at the bottom of the Equations w<strong>in</strong>dow. This changes back <strong>and</strong>forth between three views:- the parameter names (e.g. β 0 )- the parameter names <strong>in</strong> colour cod<strong>in</strong>g- the estimates for each parameter with their St<strong>and</strong>ard Errors <strong>in</strong>parenthesis.Colour cod<strong>in</strong>g of the parameters <strong>in</strong>dicates their status:- red: not yet specified- blue: specified but not yet estimated- green: estimation is complete• Click “Estimates” until you see blue numbers <strong>in</strong> the equation.• Press “Start” <strong>in</strong> the upper left corner of the MLwiN W<strong>in</strong>dow to start theestimation procedureThe Estimation is concluded when the numbers turn from blue to green.Results <strong>and</strong> InterpretationIn this simple model, only two parameters are estimated (you can tell, becausethere are only two green numbers): the <strong>in</strong>tercept, here simply the overall meanof Speed, <strong>and</strong> σ 2 e the variance of the <strong>in</strong>dividual error-term e i . The error e idenotes the derivation of each <strong>in</strong>dividual (i.e. the cars) from the model, heresimply the variance of the complete sample. Beh<strong>in</strong>d each parameter estimate,its st<strong>and</strong>ard error is <strong>in</strong>dicated <strong>in</strong> parenthesis. To be significant, a parameterestimate has to be at least twice as large as its st<strong>and</strong>ard error. (To be moreP r o j e c t c o - f i n a n c e d b y t h e E u r o p e a n C o m m i s s i o n , D i r e c t o r a t e - G e n e r a l T r a n s p o r t a n d E n e r g yP a g e 1 3


Chapter 2exact, the parameter estimate divided by its st<strong>and</strong>ard error is z-distributed. A Z-value of 1.96 <strong>in</strong>dicates a two-tailed probability of 0.05). The third l<strong>in</strong>e of theoutput shows the deviance of the model (the -2loglikelihood). This value<strong>in</strong>dicates how well the model fits the data. The smaller it is, the better the modelfits. It is used to compare models. We will come back to that later.A practical tip: MLwiN makes build<strong>in</strong>g different models very easy. It does,however, not allow to eyeball the results for two different models <strong>in</strong> parallel. Asa solution we suggest to use a view<strong>in</strong>g software like Irfan (freeware) to produce<strong>and</strong> save screenshots of the models you build: Start IrfanViewer, click onOptions <strong>in</strong> the menu-bar <strong>and</strong> select Capture/Screenshots. Press “Start” <strong>and</strong>close the w<strong>in</strong>dow (but not the program!). Now go to the equations-w<strong>in</strong>dow ofMLwiN <strong>and</strong> press Ctrl-F11. Irfan now keeps a picture of your model <strong>and</strong>overwrites it (unless you save it) when you press Ctrl-F11 aga<strong>in</strong>.2.2.1.2. An “empty” two-level modelValues measured at the same location can be expected to be more similar toeach other than to values measured at different locations. To <strong>in</strong>clude thishierarchical structure <strong>in</strong> the model, we will now def<strong>in</strong>e a two-level model with thecars (ID-subject) constitut<strong>in</strong>g the first level <strong>and</strong> the road sites (ID-location)constitut<strong>in</strong>g the second.Model formulationFirst def<strong>in</strong>e the dependent variable (Speed) as vary<strong>in</strong>g over two r<strong>and</strong>om factors,namely the <strong>in</strong>dividual cars (IDsubject) <strong>and</strong> the location of measurement(IDlocation)• Click on the dependent variable <strong>and</strong> def<strong>in</strong>e IDlocation as the second level asshown below.Then def<strong>in</strong>e a variance component model by allow<strong>in</strong>g the <strong>in</strong>tercept to varyr<strong>and</strong>omly across locations.• Click on the <strong>in</strong>tercept• Check the box j(IDlocation)• Press “Done”• Press “Start” to estimate the parameters


2.2 L<strong>in</strong>ear multilevel modelsResults <strong>and</strong> InterpretationYour Equations w<strong>in</strong>dow should now look like this.The model has two parts; the first equation corresponds to the <strong>in</strong>dividual carlevel (Level 1), the second equation to the level of the locations (Level 2).Instead of an overall <strong>in</strong>tercept β 0 , the <strong>in</strong>tercept β 0j varies across locations. Themean value (68.688), <strong>in</strong>dicated <strong>in</strong> the second equation, gives the mean speedacross all road sites.Two error-variances are estimated: σ 2 u0 is the variance of u oj , the location errorterm <strong>in</strong> the second equation. This location error-term u oj <strong>in</strong>dicates the derivationof each location-<strong>in</strong>tercept from the mean <strong>in</strong>tercept estimated <strong>in</strong> the secondequation. σ 2 e is the variance of e ij , the <strong>in</strong>dividual error term <strong>in</strong> the first equation.σ 2 u0, the variation between locations, is highly significant <strong>and</strong> much larger thanσ 2 e, the variation with<strong>in</strong> locations. The variance partition coefficient ( σ 2 u0/(σ 2 u0+ σ 2 e)) is .75 <strong>in</strong>dicat<strong>in</strong>g that 75% of the total variance is due to variationsbetween the level-two units (here the locations).The deviance is now by a factor 10 smaller than that of the s<strong>in</strong>gle-level model.The difference between the two deviances (<strong>in</strong> this case 464737) is Chi-squaredistributed with the difference <strong>in</strong> numbers of parameters as degrees of freedom(<strong>in</strong> this case one, because the two-level model estimates one parameter morethan the one-level model). As a guidel<strong>in</strong>e, the expected value for a Chi-squareddistribution is equal to the degrees of freedom, so there is little doubt that464737 significantly exceeds this value. For a formal check, click on “DataManipulation” <strong>in</strong> the top menu bar <strong>and</strong> select “Tail Areas”, select “Chi-squared”,fill the X 2 -value (464737) <strong>and</strong> the degrees of freedom (1) <strong>in</strong> <strong>and</strong> press“Calculate”.To conclude, both the variance partition coefficient <strong>and</strong> the deviance test bothstrongly suggest that a two-level structure describes the data more adequatelythan a s<strong>in</strong>gle-level structure.P r o j e c t c o - f i n a n c e d b y t h e E u r o p e a n C o m m i s s i o n , D i r e c t o r a t e - G e n e r a l T r a n s p o r t a n d E n e r g yP a g e 1 5


Chapter 2Graphic <strong>in</strong>spection of the residuals: Level 1To <strong>in</strong>spect the residuals of the model,• Click on Model <strong>in</strong> the top menu bar• Select Residuals• Click “Calc” to calculate the Level 1 residuals• Select “Plot” at the top of the Residuals w<strong>in</strong>dow• Check radio-button <strong>in</strong> front of “st<strong>and</strong>ardised residual <strong>and</strong> normal score”• Click “Apply”Graphic <strong>in</strong>spection of the residuals: Level 2• Go back to the “Sett<strong>in</strong>gs” part of the residuals w<strong>in</strong>dow• Select “2:IDlocation” from the “level” dropdown list


2.2 L<strong>in</strong>ear multilevel models• Click “Calc” to calculate the Level 2 residuals• Select “Plot” at the top of the Residuals w<strong>in</strong>dow• Click “Apply”Normally distributed residuals would result <strong>in</strong> straight l<strong>in</strong>es. Obviously this is notthe case. A log-transformation (lnSpeed) could help normalize the speeddistribution. The transformed value would, however, make the <strong>in</strong>terpretationmore difficult. Therefore, for the sake of clarity of <strong>in</strong>terpretation the nontransformedspeed variable is kept here. As an exercise, the reader is advised,however, to repeat the analyses presented here with the log-transformed speedvariable (lnSpeed) as an exercise.2.2.1.3. A Two-level variance component model with predictor lengthNext, <strong>in</strong>clude the length of a car as a predictor for its speed. Rather than<strong>in</strong>clud<strong>in</strong>g the absolute length, <strong>in</strong>clude LenthCentred, the length of the carcentred to its mean.Model formulation• Click “Add Term” at the bottom of the Equations w<strong>in</strong>dow• Select “LengthCentred” from the “Variable” drop-down w<strong>in</strong>dow• Click “Done”• To estimate this model press “Start”P r o j e c t c o - f i n a n c e d b y t h e E u r o p e a n C o m m i s s i o n , D i r e c t o r a t e - G e n e r a l T r a n s p o r t a n d E n e r g yP a g e 1 7


Chapter 2Results <strong>and</strong> InterpretationThe <strong>in</strong>tercept β 0j now presents the average speed at LengthCentred = 0 (i.e. fora car of average length) <strong>and</strong> the coefficient <strong>in</strong> front of LengthCentred <strong>in</strong>dicatesits slope: the change <strong>in</strong> speed per unit of length (here meters). The deviance (-2*loglikelihood) decreased by 70, i.e. the <strong>in</strong>troduction of car length as apredictor significantly improved the model.2.2.1.4. Two-level r<strong>and</strong>om <strong>in</strong>tercept, r<strong>and</strong>om slope modelTo <strong>in</strong>vestigate whether the relation between speed <strong>and</strong> the length of a car wasthe same at all measurement locations, the slope of Length will now be allowedto vary r<strong>and</strong>omly across locations too.Model formulation• Click on the Length• Check the box j (IDlocation)• Estimate the parameters by click<strong>in</strong>g on “Start”Results <strong>and</strong> Interpretation


2.2 L<strong>in</strong>ear multilevel modelsThe model now has three parts; the first equation specifies the level of the<strong>in</strong>dividual cars <strong>and</strong> the other two the level of the locations. Both, the <strong>in</strong>tercept β 0j<strong>and</strong> the coefficient of Length β 1j are now vary<strong>in</strong>g across locations with themeans <strong>in</strong>dicated <strong>in</strong> the second <strong>and</strong> third equation.The location variance has now become a variance-covariance matrix Ω u : Theupper left number is σ 2 u0, the variance of the <strong>in</strong>tercepts across locations. It<strong>in</strong>dicates how much the general level of “Speed” varies between groups. Thelower right number is σ 2 u1, the variance of the coefficient for Length acrosslocations. It shows to what extent the relation between “Speed” <strong>and</strong>“LengthCentered” varies between groups. The lower left number is thecovariance between the two, <strong>in</strong>dicat<strong>in</strong>g to what extent there is a relationbetween the <strong>in</strong>tercept (i.e. the general level of “Speed”) <strong>and</strong> the slope (i.e. thestrength of the relation to “LengthCentred”) across locations. While the twovariances can only be positive, the covariance can be positive or negative. Apositive covariance <strong>in</strong>dicates that larger <strong>in</strong>tercepts are associated with largerslopes. The opposite is true for a negative covariance.The deviance decreased by 181 as opposed to the variance component model,<strong>in</strong>dicat<strong>in</strong>g that there is <strong>in</strong>deed substantial variation across locations <strong>in</strong> the effectthat length has on speed. This also becomes apparent <strong>in</strong> the fact that thevariance of the slope is significant. However the covariance between slope <strong>and</strong><strong>in</strong>tercept is not, <strong>in</strong>dicat<strong>in</strong>g that there is no relation between the average level ofspeed (i.e. the <strong>in</strong>tercept) <strong>and</strong> the length effect (i.e. the slope).Graphic <strong>in</strong>spection of the model predictionsTo <strong>in</strong>terpret the results it can often help to make use of MLwiNs great graphicalfunctions. In this case we will plot the predicted speed values for each location.To do so, we have to save the model-predictions as a new variable first.• Select “Model” <strong>in</strong> the top menu bar <strong>and</strong> click on “Predictions”• Click on all parameters <strong>in</strong> the lower half of the appear<strong>in</strong>g dialogue w<strong>in</strong>dow(so that they turn from grey to black)• Select an empty column (e.g. c10) for “output from prediction to”• Click on “Calc”• Now you can close the “Predictions” w<strong>in</strong>dowP r o j e c t c o - f i n a n c e d b y t h e E u r o p e a n C o m m i s s i o n , D i r e c t o r a t e - G e n e r a l T r a n s p o r t a n d E n e r g yP a g e 1 9


Chapter 2Now we have created a variable that conta<strong>in</strong>s the predictions of the model,which we are go<strong>in</strong>g to plot aga<strong>in</strong>st the length.Open the graph dialogue by click<strong>in</strong>g on “Graph” <strong>in</strong> the top menu bar <strong>and</strong> byselect<strong>in</strong>g “Customized Graphs”. Fill <strong>in</strong> as shown below (all changes have to bemade <strong>in</strong> the drop-down lists on the right-h<strong>and</strong> side <strong>and</strong> appear automatically <strong>in</strong>the left-h<strong>and</strong> side table)• Press “Apply” to create the graph (You can close the Dialogue W<strong>in</strong>dowthen.)The result<strong>in</strong>g graph shows separate regression l<strong>in</strong>es for each location.


2.2 L<strong>in</strong>ear multilevel modelsThe size of σ 2 u0 (the <strong>in</strong>tercept-variance across locations) is reflected byvariations <strong>in</strong> the height of regression l<strong>in</strong>es <strong>and</strong> the size of σ 2 u1 (the slopevarianceacross locations) <strong>in</strong> variation <strong>in</strong> their steepness. The size of σ u01 (thecovariance between <strong>in</strong>tercept <strong>and</strong> slope) would be reflected <strong>in</strong> the fact that l<strong>in</strong>esthat are on a higher level all-together (larger <strong>in</strong>tercept) tend to be more or lesssteep than those at the bottom of the graph. As the covariance is not significanthere, it is however not possible to see such a tendency.2.2.1.5. Add<strong>in</strong>g a categorical predictorTo demonstrate how a categorical predictor can be <strong>in</strong>cluded <strong>in</strong>to the model, thecont<strong>in</strong>uous variable LengthCentred will be replaced by a categorical one(LengthCat) that simply <strong>in</strong>dicates whether the length of a car is below (0) orabove (1) average. The first step is to def<strong>in</strong>e this variable as categorical ratherthan cont<strong>in</strong>uous.Model formulation• Click on “Data Manipulation” <strong>in</strong> the top menu bar <strong>and</strong> select “Names”• Select LengthCat• Click “Categories”• Def<strong>in</strong>e the categories as shown below (simply start typ<strong>in</strong>g after you clickedon each field)• Press “Apply”• Close the “Names” w<strong>in</strong>dowP r o j e c t c o - f i n a n c e d b y t h e E u r o p e a n C o m m i s s i o n , D i r e c t o r a t e - G e n e r a l T r a n s p o r t a n d E n e r g yP a g e 2 1


Chapter 2• Remove LengthCentred by click<strong>in</strong>g on the term <strong>and</strong> than on “Delete Term”• Include LengthCat with “Add Term” Choose “


2.2 L<strong>in</strong>ear multilevel modelsResults <strong>and</strong> InterpretationContrary to the model <strong>in</strong> 2.2.1.4, with Length as a cont<strong>in</strong>uous predictor, thecovariance between <strong>in</strong>tercept <strong>and</strong> slope is now significant. Its negative value<strong>in</strong>dicates that the difference between long <strong>and</strong> short cars is smaller at locationswith a high overall speed level (i.e. a large <strong>in</strong>tercept) as compared to locationswith a low overall speed level. The decreased deviance value (45218-44963 =255) is highly significant <strong>in</strong>dicat<strong>in</strong>g that the effect of speed is <strong>in</strong>deed notconstant across locations.2.2.1.6. Add<strong>in</strong>g a contextual variableOne of the advantages of multilevel models is the possibility to <strong>in</strong>cludepredictors situated at different levels simultaneously <strong>in</strong> the model. As anexample of a higher level variable, the traffic count for each road site will betaken up <strong>in</strong>to the model as a contextual predictor (that means it does not varyacross Level 1, but only across higher level units, here the locations). Thevariable TrafCountCat takes the value 0 for each road site with fewer than 100cars pass<strong>in</strong>g dur<strong>in</strong>g observation <strong>and</strong> 1 for road sites on which more than 100cars passed.Model formulationFirst go back to model 2.2.1.4, the r<strong>and</strong>om slope model with the cont<strong>in</strong>uouslength variable:• Delete LengthCat from the equation• take up LengthCentered aga<strong>in</strong>• make the effect of LengthCentred vary r<strong>and</strong>omly across locationsNow add the context variable• Include TrafCountCat <strong>in</strong>to the equation with “Add Term”.P r o j e c t c o - f i n a n c e d b y t h e E u r o p e a n C o m m i s s i o n , D i r e c t o r a t e - G e n e r a l T r a n s p o r t a n d E n e r g yP a g e 2 3


Chapter 2o Select 100” <strong>in</strong>dicates that at road sites withmore than 100 cars pass<strong>in</strong>g, cars went on average 33.2 km/h faster than onroad sites with fewer than 100 cars pass<strong>in</strong>g. The coefficient is highly significant.Moreover the deviance is reduced by 385 as opposed to the model <strong>in</strong> 2.2.1.4.Both <strong>in</strong>dicates that the number of cars pass<strong>in</strong>g at a road sites is a goodpredictor of speed at that road site.2.2.1.7. Test<strong>in</strong>g for a cross-level <strong>in</strong>teractionIn order to test whether the context variable trafficCount modifies the lengtheffectat the level of the <strong>in</strong>dividual cars, the <strong>in</strong>teraction between TrafficCountCat<strong>and</strong> LengthCentred is added to the model.Model formulation• Click on “Add term” <strong>and</strong>• Include the <strong>in</strong>teraction between LimitCat <strong>and</strong> LengthCentred as shownbelow Select order 1 (this means it is a first-order <strong>in</strong>teraction) Choose aga<strong>in</strong>


2.2 L<strong>in</strong>ear multilevel modelsResults <strong>and</strong> InterpretationThe coefficient for the <strong>in</strong>teraction (lengthCentred.>100) is clearly not significant,as it is much smaller than its st<strong>and</strong>ard error <strong>in</strong> parenthesis. The negligibleweight for this <strong>in</strong>teraction <strong>in</strong>dicates that the length effect at road sites with morethan 100 cars pass<strong>in</strong>g were not different from those at road sites with fewercars pass<strong>in</strong>g. The same message is conveyed by the likelihood that did notdecrease from the model <strong>in</strong> 2.2.1.6 to the present model, suggest<strong>in</strong>g thatadd<strong>in</strong>g the <strong>in</strong>teraction <strong>in</strong>troduces complexity that does not expla<strong>in</strong> anyth<strong>in</strong>g.To conclude, the speed of cars varies more between road sites than with<strong>in</strong>them. Accord<strong>in</strong>gly, speed is affected by a level-1 variable (length) to someextent, but much more so by a level-2 variable. The effect of length is notmodified by the traffic count.2.2.1.8. ConclusionIn this chapter it was demonstrated how to extend a l<strong>in</strong>ear regression model toa multilevel structure. It was demonstrated how the variance partition coefficient<strong>and</strong> the deviance test can be used to establish the appropriateness of themultilevel structure <strong>and</strong> how predictions <strong>and</strong> residuals at these different levelscan be presented graphically. Moreover, it was expla<strong>in</strong>ed how effects of level-1predictors (here the length of a car) can be considered together with predictorsat higher levels (here the traffic count at the measurement location) <strong>and</strong> how an<strong>in</strong>teraction between variables at different levels can be <strong>in</strong>vestigated.P r o j e c t c o - f i n a n c e d b y t h e E u r o p e a n C o m m i s s i o n , D i r e c t o r a t e - G e n e r a l T r a n s p o r t a n d E n e r g yP a g e 2 5


2.2 L<strong>in</strong>ear multilevel models2.2.2.1. A three-level variance component modelThe first three-level model to test would always be the variance componentmodel <strong>in</strong> which the <strong>in</strong>tercepts but not the slopes vary across the levels.Model formulationDef<strong>in</strong>e the dependent variable (Speed) as vary<strong>in</strong>g over three r<strong>and</strong>om factors,namely the <strong>in</strong>dividual cars (IDsubject), the location of measurement(IDlocation), <strong>and</strong> the prov<strong>in</strong>ce.• Click on the dependent variable <strong>and</strong> def<strong>in</strong>e as shown belowThen def<strong>in</strong>e a variance component model by allow<strong>in</strong>g the <strong>in</strong>tercept to varyr<strong>and</strong>omly across locations <strong>and</strong> take up LengthCentred as a predictor.• Click on the <strong>in</strong>tercept• Check the box j(IDlocation)• Check the box k(Prov<strong>in</strong>ce)• Press “Done”• Include LengthCentred as a predictor with “Add Term”• Press “Start” to estimate the parametersResults <strong>and</strong> InterpretationThe <strong>in</strong>tercept β 0jk now varies across locations <strong>and</strong> prov<strong>in</strong>ces, result<strong>in</strong>g <strong>in</strong> theestimation of three variances: σ 2 v0 is the variance of u ok , the derivation of eachpronv<strong>in</strong>ce’s <strong>in</strong>tercept from the mean <strong>in</strong>tercept. σ 2 u0 estimates the variation of theP r o j e c t c o - f i n a n c e d b y t h e E u r o p e a n C o m m i s s i o n , D i r e c t o r a t e - G e n e r a l T r a n s p o r t a n d E n e r g yP a g e 2 7


Chapter 2<strong>in</strong>tercept between locations but with<strong>in</strong> the prov<strong>in</strong>ces <strong>and</strong> σ 2 e is the variance ofe ij , the <strong>in</strong>dividual error term <strong>in</strong> the first equation.The deviance of this three-level model as compared to the two-level variancecomponent model presented <strong>in</strong> 2.2.1.3 decreased by 24, which is significant(p


2.2 L<strong>in</strong>ear multilevel modelsResults <strong>and</strong> InterpretationThe variance of the <strong>in</strong>tercept across prov<strong>in</strong>ces σ 2 v0 is still only marg<strong>in</strong>allysignificant (p=.053). For the rest the estimates look very similar to the output ofthe two-level model (see section 2.2.1.4). There is a lot of variation <strong>in</strong> the<strong>in</strong>tercept across locations, a small but significant amount of variation <strong>in</strong> theslope of length across locations <strong>and</strong> no significant covariation between slope<strong>and</strong> <strong>in</strong>tercept. The <strong>in</strong>troduction of a r<strong>and</strong>om slope for speed decreased thedeviance by 290, which is highly significant.2.2.2.3. A fully r<strong>and</strong>om three-level modelAs a last step, the effect of speed will be allowed to vary r<strong>and</strong>omly not onlyacross locations but also across prov<strong>in</strong>ces.Model formulation• Click on the Length• Check the box k (prov<strong>in</strong>ce)• Estimate the parameters by click<strong>in</strong>g on “Start”P r o j e c t c o - f i n a n c e d b y t h e E u r o p e a n C o m m i s s i o n , D i r e c t o r a t e - G e n e r a l T r a n s p o r t a n d E n e r g yP a g e 2 9


Chapter 2Results <strong>and</strong> <strong>in</strong>terpretationThe results now conta<strong>in</strong> two variance-covariance matrices: Ω v <strong>and</strong> Ω u . Ω v<strong>in</strong>dicates the variance <strong>and</strong> covariance of <strong>in</strong>tercept <strong>and</strong> slope across prov<strong>in</strong>ces<strong>and</strong> Ω u across locations. The variance-covariance matrix for the locations is stillrelatively unchanged as compared to the two-level model (2.2.1.4). Forprov<strong>in</strong>ces, the <strong>in</strong>tercept variance is still only marg<strong>in</strong>ally significant. The slopevariance <strong>and</strong> the slope/<strong>in</strong>tercept covariance that have been added to the modelare clearly not significant (to be significant at the .05-level they would have toexceed twice the size of their st<strong>and</strong>ard error, which is clearly not the case). Theconclusion that this last extension of the model was not necessary is confirmedby the deviance test. Although the present model estimates two extraparameters, the deviance is exactly the same as that of the simpler model withthe length effect only vary<strong>in</strong>g at the level of the locations.To conclude, the speed of cars has shown to vary substantially acrossmeasurement locations <strong>and</strong> only to a limited extent across prov<strong>in</strong>ces. There is arelation between the length of a car <strong>and</strong> its speed <strong>and</strong> this relation varies acrossmeasurement locations but not across prov<strong>in</strong>ces. The <strong>in</strong>troduction of a thirdlevel has proven of limited use. Not only is the variation attributed to this levelvery limited, but moreover the results for the other levels were almost exactlythe same as <strong>in</strong> the two-level models.2.2.2.4. ConclusionIt has been shown how to extend the two-level models presented <strong>in</strong> section2.2.1 to three level models. It has also been demonstrated how to <strong>in</strong>vestigate


2.2 L<strong>in</strong>ear multilevel modelswhether the additional level improves the model <strong>and</strong> <strong>in</strong> the present case it hasbeen concluded that a two-level model would be sufficient.P r o j e c t c o - f i n a n c e d b y t h e E u r o p e a n C o m m i s s i o n , D i r e c t o r a t e - G e n e r a l T r a n s p o r t a n d E n e r g yP a g e 3 1


2.3 Discrete response models2.3.1 IntroductionIn the methodology report the section for discrete responses is <strong>in</strong>troduced byoutl<strong>in</strong><strong>in</strong>g the generalised l<strong>in</strong>ear models (GLM) <strong>and</strong> their hierarchical version, themultilevel GLM. As there is no empirical example <strong>in</strong> the <strong>in</strong>troduction of theGLM, there is no correspond<strong>in</strong>g manual section. In the follow<strong>in</strong>g sections, the<strong>analysis</strong> of general b<strong>in</strong>omial responses (2.3.2), mult<strong>in</strong>omial responses (2.3.3)<strong>and</strong> counts (2.3.4) will be presented. All these analyses are <strong>in</strong>stances of themultilevel GLM.2.3.2 B<strong>in</strong>ary <strong>and</strong> general b<strong>in</strong>omial responsesHeike Martensen <strong>and</strong> Emmanuelle Dupont (IBSR)The example data used <strong>in</strong> this section were gathered <strong>in</strong> a Belgian dr<strong>in</strong>k driv<strong>in</strong>groadside survey. At 413 r<strong>and</strong>omly selected road sites 11,186 drivers werestopped, asked to perform an alcohol breath test <strong>and</strong> to answer a number ofquestions. The data conta<strong>in</strong> the follow<strong>in</strong>g variables:Dr<strong>in</strong>kDriv<strong>in</strong>g Was the alcohol concentration of the driver above the legal limitof .05 g/l? Yes=1, No=0.ID_<strong>in</strong>d Identification number of each driver tested.ID_loc Identification number of each test location.Gender Gender of the driver: Male =1, Female =2.Age A categorical variable: 16-25=1, 26-39=2, 40-54=3, 55+=4.Previously Has the driver been tested for alcohol at a roadside controlpreviously? Yes=1, No=0.Probability How high does the driver estimate the probability of be<strong>in</strong>gstopped for an alcohol control? Very Low=1, Low=2, Medium=3,High=4, Very High=5.TrafficCount The average number of cars pass<strong>in</strong>g the test site with<strong>in</strong> 15m<strong>in</strong>utes.Intensity The number of control officers present divided by TrafficCount.Data load• Click on “File” <strong>in</strong> the top menu bar <strong>and</strong> select “Open worksheet”.• Open “ALCOHOL.ws”Two constant variables (“denom” <strong>and</strong> “cons”) are necessary for build<strong>in</strong>g amodel for b<strong>in</strong>ary data. To generate these variables,• Click on “Data Manipulation” <strong>and</strong> select “Generate Vector”• Select “Constant Vector” as type of vector• Select an empty output column (e.g. c30)• Fill <strong>in</strong> the number of cases (11,186) at “Number of Copies”• Fill <strong>in</strong> “1” at “Value”


2.3.2 B<strong>in</strong>ary <strong>and</strong> general b<strong>in</strong>omial responses• Click “Generate”• Select another empty output column (e.g. c31)• Click “Generate” aga<strong>in</strong>• Close the “Generate Vector” dialogue w<strong>in</strong>dow• Click on “Data Manipulation” <strong>and</strong> select “Names”• Select the first variable just generated (i.e. c30)• Type “cons” <strong>in</strong> the field at the top of the w<strong>in</strong>dow <strong>and</strong> press return• Name the second variable just generated (i.e. c31) “denom”Model formulationDef<strong>in</strong>e the dependent variable• Click on the “y” <strong>and</strong> Select “Dr<strong>in</strong>kDriv<strong>in</strong>g” from the drop-down w<strong>in</strong>dow asdependent variable Select “2-ij”from the N-levels drop-down w<strong>in</strong>dow Select “ID_loc” from the level 2(j) drop-down w<strong>in</strong>dow Select “ID_<strong>in</strong>d” from the level 1(i) drop-down w<strong>in</strong>dow Click “done”• Click on the N <strong>in</strong> the Distribution statement for Dr<strong>in</strong>kDriv<strong>in</strong>g <strong>and</strong> Check “B<strong>in</strong>omial” In the appear<strong>in</strong>g l<strong>in</strong>k functions, leave “logit” checked Click “Done”• Click on the red n ij <strong>in</strong> the distribution statement for Dr<strong>in</strong>kDriv<strong>in</strong>g Select “denom” <strong>in</strong> the variable drop down listA b<strong>in</strong>omial distribution is characterised by the proportion π ji <strong>and</strong> thedenom<strong>in</strong>ator n ij stat<strong>in</strong>g the number of <strong>in</strong>stances on which the proportion isbased. In the present study the denom<strong>in</strong>ator is a constant 1, mean<strong>in</strong>g that thedata are b<strong>in</strong>ary. This <strong>and</strong> the choice of the “logit” function as a l<strong>in</strong>k functionmake the model a logistic regression model.Build a two-level r<strong>and</strong>om <strong>in</strong>tercept model• Click “Add Term” <strong>and</strong> select “Cons” from the variable drop down list• Click on “cons” <strong>and</strong> check “j(id_loc)”In the General L<strong>in</strong>ear Model notation, the <strong>in</strong>tercept is not automatically <strong>in</strong>cluded.To do so, a constant variable must be <strong>in</strong>cluded as a predictor.Add predictors• Click on “Add Term”, select “TrafficCount” as a predictor from the variabledrop down list• Click on “Add Term”, select “Intensity” as a predictor• Click on “Add Term”, select “Gender” as a predictor, chose “Male” asreference categoryP r o j e c t c o - f i n a n c e d b y t h e E u r o p e a n C o m m i s s i o n , D i r e c t o r a t e - G e n e r a l T r a n s p o r t a n d E n e r g yP a g e 3 3


Chapter 2• Click on “Add Term”, select “Previously” as a predictor, select “not testedpreviously” as reference category• Click on “Add Term”, select “Probability” as a predictor; select “very low” asreference category• Click on “Add Term”, select “Previously” as a predictor, select “Age16-25” asreference categoryEstimationIn the b<strong>in</strong>omial distribution it is assumed that the variance is equal to the oddsratio,π ji (1- π ji ). To test this assumption, first estimate a model assum<strong>in</strong>g anextra-B<strong>in</strong>omial distribution, where the variance is left free to vary.• Click on “Notation” at the bottom of the “Equations” w<strong>in</strong>dow aga<strong>in</strong>• Select “extra B<strong>in</strong>omial” <strong>in</strong> the top row• Leave the other two options at their default value (1 st order l<strong>in</strong>earization <strong>and</strong>MQL as estimation type)• Click “Done”• Press “Start” to start the estimation procedure.Once the estimation procedure has converged (i.e. all blue numbers turnedgreen)• Click on “Notation” at the bottom of the “Equations” w<strong>in</strong>dow aga<strong>in</strong>• Select “2 nd order” under “L<strong>in</strong>earization”• Select “PQL” under “Estimation type”• Press “Done”• Press “More” to cont<strong>in</strong>ue the estimationIn the estimation procedure, the nonl<strong>in</strong>ear l<strong>in</strong>k function is l<strong>in</strong>earized byapproximat<strong>in</strong>g it with a Taylor Series expansion. A Taylor <strong>series</strong> consists of an<strong>in</strong>f<strong>in</strong>ite number of terms <strong>and</strong> the more of them are used, the closer theapproximation. The first choice is whether only the first (1 st order l<strong>in</strong>earization)or the first two are used (2 nd order l<strong>in</strong>earization). The other choice concerns thevalues that the Taylor <strong>series</strong> expansion is based on: Dur<strong>in</strong>g each iteration theTaylor <strong>series</strong> is calculated on the basis of the currently estimated parametervalues. In the Marg<strong>in</strong>al Quasi Likelihood method (MQL) only the fixedparameters are <strong>in</strong>cluded, <strong>in</strong> the Penalized Quasi Liklihood method (PQL) theresiduals are <strong>in</strong>cluded as well. Generally speak<strong>in</strong>g, 2 nd order l<strong>in</strong>earization <strong>and</strong>PQL are more accurate but computationally <strong>in</strong>tensive <strong>and</strong> more prone toconvergence problems. The 1 st order MQL estimates on the other h<strong>and</strong> areknown to be biased downwards. It is therefore suggested to use 1 st orderl<strong>in</strong>earization <strong>and</strong> MQL to get rough start<strong>in</strong>g values on which the f<strong>in</strong>al estimationus<strong>in</strong>g 2 nd order l<strong>in</strong>earization <strong>and</strong> PQL is based. For more <strong>in</strong>formation seeGoldste<strong>in</strong> (2003) or Hox (2002). With these methods, slight variations <strong>in</strong> theestimated values are to be expected.


2.3.2 B<strong>in</strong>ary <strong>and</strong> general b<strong>in</strong>omial responsesResultsThe first <strong>in</strong>terest is to evaluate whether the assumption of a b<strong>in</strong>omial distributionholds. In that case the theoretically expected value for the variance would be 1.At the bottom of the Equations w<strong>in</strong>dow, the estimated variance is <strong>in</strong>dicated with0.711. As this is very close to the theoretically expected value, it is probablysafe to estimate the model under the more restrictive assumption of a B<strong>in</strong>omialdistribution (rather than an extra-B<strong>in</strong>omial one).Estimate the model aga<strong>in</strong> assum<strong>in</strong>g a B<strong>in</strong>omial distribution.• Click on “Notation” at the bottom of the “Equations” w<strong>in</strong>dow aga<strong>in</strong>• Click on “Use Defaults” (i.e., B<strong>in</strong>omial distribution, 1 st order l<strong>in</strong>earization <strong>and</strong>MQL as estimation type)• Click “Done”• Press “Start” to start the estimation procedure.Once the estimation procedure has converged (i.e. all blue numbers turnedgreen)• Click on “Notation” at the bottom of the “Equations” w<strong>in</strong>dow aga<strong>in</strong>• Select “2 nd order” under “L<strong>in</strong>earization”• Select “PQL” under estimation type• Press “Done”• Press “More” to cont<strong>in</strong>ue the estimationP r o j e c t c o - f i n a n c e d b y t h e E u r o p e a n C o m m i s s i o n , D i r e c t o r a t e - G e n e r a l T r a n s p o r t a n d E n e r g yP a g e 3 5


Chapter 2Results <strong>and</strong> <strong>in</strong>terpretationThe parameter estimates of the B<strong>in</strong>omial model are very similar to that of theextra-B<strong>in</strong>omial one, confirm<strong>in</strong>g that the B<strong>in</strong>omial distribution assumptions holdfor the present <strong>analysis</strong>. The <strong>in</strong>terpretation will therefore be based on this lastmodel.Before <strong>in</strong>terpret<strong>in</strong>g the coefficients, their significance has to be tested. Forcategorical variables with several levels (e.g. probability) there is more than onepredictor (here 4: low, medium, high, <strong>and</strong> very high) which have to be testedjo<strong>in</strong>tly. This is done with the Multivariate Wald test. S<strong>in</strong>gle coefficients forcont<strong>in</strong>uous predictor variables (e.g. TrafficCount) or those with only two levels(e.g. Gender) can be tested with the Z-test.To conduct the Z-test:• Divide the coefficients by their st<strong>and</strong>ard errors• Click on “Basic Statistics” <strong>in</strong> the top menu bar <strong>and</strong> select “Tail Areas”• Check “St<strong>and</strong>ard Normal distribution”• Fill <strong>in</strong> the result of the division• Click “Calc”To conduct a Multivariate Wald test (jo<strong>in</strong>t Chi-square test)• Click on “Model” <strong>in</strong> the top menu bar <strong>and</strong> select “Intervals <strong>and</strong> tests”• Check “fixed” at the bottom of the appear<strong>in</strong>g dialogue w<strong>in</strong>dow• Type a 1 <strong>in</strong> front of every coefficient that you want to test jo<strong>in</strong>tly (e.g. thosefor “low probability”, “medium probability”, “high probability”, <strong>and</strong> “very highprobability”)• Click on “Calc”• Click on “Basic Statistics” <strong>in</strong> the top menu bar <strong>and</strong> select “Tail Areas”• Check “Chi Squared”• Fill <strong>in</strong> the result<strong>in</strong>g Chi-square value from the “Intervals <strong>and</strong> tests” dialogue• Click “Calc”As can be seen <strong>in</strong> Table 2.3.1, all predictor variables are significant.


2.3.2 B<strong>in</strong>ary <strong>and</strong> general b<strong>in</strong>omial responsesPredictor Coefficient SE Z p(Z) Chi2 d.f. p(chi2) e coefficientTrafficCount -0.002 0.0001 -20.00 0.000 0.998Intensity 0.898 0.379 2.37 0.009 2.455Female -1.374 0.206 -6.67 0.000 0.253Previously 0.407 0.14 2.91 0.002 1.502Prob. Low 0.536 0.166 25.46 4 0.000 1.709Prob. Medium 0.743 0.168 2.102Prob. High 0.313 0.277 1.368Prob. Very high 1.431 0.289 4.183Age26-39 0.709 0.241 18.17 3 0.000 2.032Age40-54 1.312 0.233 3.714Age55+ 0.859 0.271 2.361Table 2.3.1: Results of s<strong>in</strong>gle <strong>and</strong> jo<strong>in</strong>t tests for predictorsOne way to <strong>in</strong>terpret the coefficients is to take their exponentials, which ispresented <strong>in</strong> the right-most column of Table 2.3.1. For a one unit <strong>in</strong>crease <strong>in</strong> thepredictor, the odds of the dependent variable have to be multiplied by theexponential of the coefficient.The odds of an event are calculated as the number of events divided by thenumber of non-events. For example, on average 3 drivers <strong>in</strong> every 100 aredrunk, so the odds for any r<strong>and</strong>omly chosen car of hav<strong>in</strong>g a drunk driver are:3/97 = 0.031. The odds for an event that is as likely to happen as not (p=0.5)are 1. While odds have useful mathematical properties, they can producecounter<strong>in</strong>tuitive results because they are similar to probabilities <strong>in</strong> the lowerranges (the odds of p=.01 are .0101) but not at all <strong>in</strong> the higher ranges (theodds of p=.75 are 3 <strong>and</strong> those of p=.99 are 99). As an example: an 80%probability is four <strong>time</strong>s the chance of a 20% probability but the odds are 16<strong>time</strong>s higher.Another way to <strong>in</strong>terpret the coefficients, is to calculate the probability fordifferent values of the predictor. The probability for any chosen value of apredictor x is given by:1π = i1+exp( −(β + β x ))(2.3.1)0 1 iIn Table 2.3.2, the probability for Dr<strong>in</strong>kDriv<strong>in</strong>g is given for each predictor tak<strong>in</strong>gthe value of 1, while all other predictors are 0. In the third column, thisprobability is divided by the probability at the <strong>in</strong>tercept (i.e. for all predictorsbe<strong>in</strong>g zero), <strong>in</strong>dicat<strong>in</strong>g the multiplicative factor on the probability for a one-unit<strong>in</strong>crease. This factor is compared to the exponential of the coefficient, themultiplicative factor on the odds. As can be seen, the two right-most columnsare exactly the same, <strong>in</strong>dicat<strong>in</strong>g that with such a small proportion of dr<strong>in</strong>kdriv<strong>in</strong>g, the difference between probabilities <strong>and</strong> odds are negligible.The <strong>in</strong>terpretation of each coefficient is described elaborately <strong>in</strong> section 2.3.2 ofthe methodology report (D7.4) <strong>and</strong> will not be fully repeated here. As anexample weP r o j e c t c o - f i n a n c e d b y t h e E u r o p e a n C o m m i s s i o n , D i r e c t o r a t e - G e n e r a l T r a n s p o r t a n d E n e r g yP a g e 3 7


Chapter 2Predictor Coefficient π ji \x=1 (π ji \x=0) / (π ji \x=1) e coefficientIntercept -4.746 0.009TrafficCount -0.002 0.009 0.998 0.998Intensity 0.9 0.021 2.460 2.460Female -1.374 0.002 0.253 0.253Previously 0.407 0.013 1.502 1.502Prob. Low 0.536 0.015 1.709 1.709Prob. Medium 0.743 0.018 2.102 2.102Prob. High 0.313 0.012 1.368 1.368Prob. Very high 1.431 0.036 4.183 4.183Age26-39 0.709 0.018 2.032 2.032Age40-54 1.312 0.032 3.714 3.714Age55+ 0.859 0.021 2.361 2.361Table 2.3.2: Probability of Dr<strong>in</strong>kDriv<strong>in</strong>g for each predictor at x=0 <strong>and</strong> x=1.will describe the <strong>in</strong>terpretation of one cont<strong>in</strong>uous variable (TrafficCount) <strong>and</strong> ofa categorical one (Age).TrafficCount has a negative weight, <strong>in</strong>dicat<strong>in</strong>g that for each car pass<strong>in</strong>g, theproportion of dr<strong>in</strong>k driv<strong>in</strong>g decreases. The exponential of the coefficient (.998)<strong>in</strong>dicates that for the decrease <strong>in</strong> one car the odds have to be multiplied by.998, i.e. decrease by 0.2%. Note that the relation between predictor <strong>and</strong>dependent variable is not l<strong>in</strong>ear. To establish the decrease <strong>in</strong> odds for 100 carspass<strong>in</strong>g, one has to multiply the coefficient by 100 before tak<strong>in</strong>g the exponentialwhich results <strong>in</strong> 0.819 or an 18% decrease.The coefficients for the age categories 26-39, 40-54, <strong>and</strong> 55+ are all positive<strong>and</strong> thus result <strong>in</strong> exponential coefficients larger than 1. This means all agegroups show a higher <strong>in</strong>cidence of dr<strong>in</strong>k driv<strong>in</strong>g that the youngest drivers (16-25) who constitute the reference category. Most notably, <strong>in</strong> the age-group of 40-54 year olds the exponential coefficient amounts to 3.714, which <strong>in</strong>dicates thatdr<strong>in</strong>k driv<strong>in</strong>g <strong>in</strong> this age group occurs almost four <strong>time</strong>s as often as among theyoung drivers.The jo<strong>in</strong>t chi-square test reported above <strong>in</strong>dicates that there is a differencebetween the age-groups somewhere. One might also want to test, whether twoparticular age groups differ from each other significantly. To test whether the40-54 year olds differ from the 55+ year olds, follow the same procedure asdescribed above, but put 1 <strong>in</strong> front of “age 40-54” <strong>and</strong> -1 <strong>in</strong> front of “age 55+”.The difference between those two age groups is significant (X 2 (1)=5.58,p=.018). We can therefore conclude that the 40-54 year olds dr<strong>in</strong>k <strong>and</strong> drivesignificantly more often than even the group of people older than 55+ whofeature the second highest rate of dr<strong>in</strong>k driv<strong>in</strong>g.2.3.2.1. ConclusionA multilevel version of a logistic regression <strong>analysis</strong> was presented. Specialcharacteristics of the estimation procedure for b<strong>in</strong>ary variables were discussed.The b<strong>in</strong>omial model was compared to the extra-b<strong>in</strong>omial model <strong>and</strong> it wasconcluded that the b<strong>in</strong>omial distribution holds. It was demonstrated how to use


2.3.2 B<strong>in</strong>ary <strong>and</strong> general b<strong>in</strong>omial responsesthe jo<strong>in</strong>t Wald test to test the significance of categorical variables <strong>and</strong> shownhow the coefficients can be <strong>in</strong>terpreted either by transform<strong>in</strong>g them <strong>in</strong>to oddsratios or calculat<strong>in</strong>g the probability for specific values of the predictors.P r o j e c t c o - f i n a n c e d b y t h e E u r o p e a n C o m m i s s i o n , D i r e c t o r a t e - G e n e r a l T r a n s p o r t a n d E n e r g yP a g e 3 9


2.3.3 Mult<strong>in</strong>omial responsesHeike Martensen <strong>and</strong> Emmanuelle Dupont (IBSR)The example data used <strong>in</strong> this section are the Belgian dr<strong>in</strong>k driv<strong>in</strong>g roadsidesurvey data also used <strong>in</strong> section 2.3.2. The data conta<strong>in</strong> the same variables as<strong>in</strong> 2.3.2, with the exception of the dependent variable. Rather than Dr<strong>in</strong>kDriv<strong>in</strong>g(a dichotomous variable) <strong>in</strong> this chapter a variable with three possibleoutcomes, “Breathtest” will be modelled:Breathtest 1 = Safe; blood-alcohol concentration (BAC) is below 0.05 mg/l.2 = Alarm; driver’s BAC is between 0.05 <strong>and</strong> 0.08 mg/l. 3 =Positive; driver’s BAC is above 0.08 mg/l.ID_<strong>in</strong>d Identification number of each driver tested.ID_loc Identification number of each test location.Gender Gender of the driver: Male =1, Female =2.Age A categorical variable: 16-25=1, 26-39=2, 40-54=3, 55+=4.Categorical responses can be perceived <strong>in</strong> two ways: They can either form anordered <strong>series</strong> that is based on some underly<strong>in</strong>g cont<strong>in</strong>uous variable or theyconsist of different categories that are not systematically related. In the presentcase, the three categories (“safe”, “alarm”, “positive”) are clearly related to theunderly<strong>in</strong>g variable blood alcohol concentration (BAC) <strong>and</strong> will therefore bemodelled <strong>in</strong> an ordered proportional odds model. At the end of this section, theunordered category model will be presented, so that the reader can see howsuch a model is fitted <strong>and</strong> how the results differ from an ordered model.Data loadClick on “File” <strong>in</strong> the top menu bar <strong>and</strong> select “Open worksheet”.Open “ALCOHOL.ws”A constant (“cons”) is necessary for build<strong>in</strong>g a model for categorical data. Togenerate this variable (unless you have done it <strong>and</strong> saved it <strong>in</strong> section 2.3.2),• Click on “Data Manipulation” <strong>and</strong> select “Generate Vector”• Select “Constant Vector” as type of vector• Select an empty output column (e.g. c30)• Fill <strong>in</strong> the number of cases (11,186) at “Number of Copies”• Fill <strong>in</strong> “1” at “Value”• Click “Generate”• Close the “Generate Vector” dialogue w<strong>in</strong>dow• Click on “Data Manipulation” <strong>and</strong> select “Names”• Select the first variable just generated (i.e. c30)• Type “cons” <strong>in</strong> the field at the top of the w<strong>in</strong>dow <strong>and</strong> press return2.3.3.1. Ordered proportional odds: empty s<strong>in</strong>gle level model


2.3.3 Mult<strong>in</strong>omial responsesModel formulationDef<strong>in</strong>e the dependent variable• Click on the “y” <strong>and</strong> Select “Breathtest” from the drop-down w<strong>in</strong>dow as dependentvariable Select “2-ij”from the N-levels drop-down menu Select “ID_loc” from the level 2(j) drop-down menu Select “ID_<strong>in</strong>d” from the level 1(i) drop-down menu Click “Done”• Click on the N <strong>in</strong> the Distribution statement for Breathtest <strong>and</strong> Check “Mult<strong>in</strong>omial” In the appear<strong>in</strong>g w<strong>in</strong>dow, leave “logit” checked Select “Ordered proportional odds” Leave “safe” as reference category Click “Done”• Click on the red n ij <strong>in</strong> the distribution statement for Breathtest Select “cons” <strong>in</strong> the variable drop down listWhen a response variable is def<strong>in</strong>ed as mult<strong>in</strong>omial, MLwiN automaticallygenerates a number of new variables. To view these variables select “DataManipulation” <strong>in</strong> the top menu bar, click “view or edit data”, click on “view” <strong>and</strong>select “resp”, “resp_<strong>in</strong>dicator”, <strong>and</strong> “id_<strong>in</strong>d_long”. Click “Ok”.P r o j e c t c o - f i n a n c e d b y t h e E u r o p e a n C o m m i s s i o n , D i r e c t o r a t e - G e n e r a l T r a n s p o r t a n d E n e r g yP a g e 4 1


Chapter 2A new response variable has been made (resp), which <strong>in</strong>dicates for each<strong>in</strong>dividual for each response category, whether this category was the givenresponse (1) or not (0). In the present case we have three categories. Notehowever, that the data can be fully described with two <strong>in</strong>dependent categories.If someone was neither <strong>in</strong> the category “alarm” nor <strong>in</strong> “positive”, we know forsure that he is <strong>in</strong> “safe”. Therefore the variable “resp” conta<strong>in</strong>s for each<strong>in</strong>dividual two values <strong>in</strong>dicat<strong>in</strong>g whether or not the response has been “alarm” or“positive”, respectively.The variable “resp_<strong>in</strong>dicator” <strong>in</strong>dicates which response the value <strong>in</strong> the variable“resp” applies to. There are two possible values: (>=alarm) or (>=positive).These labels <strong>in</strong>clud<strong>in</strong>g the “greater-than or equal” relation are automaticallygenerated by MLwiN. They are due to the fact that the estimated model is anordered category model <strong>in</strong> which it is assumed that “positive” is greater than“alarm” <strong>and</strong> “alarm” is greater than “safe”. Therefore, an <strong>in</strong>dividual categorizedas (>=positive) is automatically also categorized as (>=alarm). The reader canverify this by check<strong>in</strong>g the values <strong>in</strong> “resp”. The majority of the <strong>in</strong>dividuals havetwo zeros, <strong>in</strong>dicat<strong>in</strong>g that they were <strong>in</strong> the safe-category. There are some<strong>in</strong>dividuals hav<strong>in</strong>g a 1 at (>=alarm) <strong>and</strong> a zero at (>=positive), who were <strong>in</strong> thealarm category. Note however, that all <strong>in</strong>dividuals with a 1 at (>=positive) alsohave a 1 at (>=alarm). It is easy to underst<strong>and</strong> this when re-translat<strong>in</strong>g thecategory titles to the underly<strong>in</strong>g BAC values: Everybody who is <strong>in</strong> the positivecategory (i.e. his BAC was >=0.08) also has a BAC >=0.05 (which def<strong>in</strong>es thealarm category).A mult<strong>in</strong>omial model has an <strong>in</strong>tercept for each <strong>in</strong>dependent category (i.e. onefor (>=alarm) <strong>and</strong> one for (>=positive). To <strong>in</strong>clude these <strong>in</strong>tercepts <strong>in</strong>to themodel• Click “Add Term” <strong>and</strong> select “Cons” from the variable drop down list• Click “add Separate coefficients”


2.3.3 Mult<strong>in</strong>omial responses• Click “Done”Estimation• Click on “Nonl<strong>in</strong>ear” at the bottom of the Equations w<strong>in</strong>dow• Select “Use defaults” or check “Mult<strong>in</strong>omial”, “1 st order”-L<strong>in</strong>earisation, <strong>and</strong>“MQL”• Click “Done”A couple of remarks about the estimation procedure are necessary. At the <strong>time</strong>of writ<strong>in</strong>g this manual, the estimation of mult<strong>in</strong>omial models proved to beproblematic. As described <strong>in</strong> the previous section on b<strong>in</strong>omial data (2.3.2), it ispr<strong>in</strong>cipally advised to use 2 nd order PQL estimates for good unbiased results.However, for the mult<strong>in</strong>omial model these estimates did not converge, leav<strong>in</strong>gus with 1 st order MQL estimates, which are more stable but downwardly biased.It was decided to <strong>in</strong>clude this chapter nevertheless, because due to the rapiddevelopment <strong>in</strong> estimation techniques (see also section 2.7 <strong>in</strong> the Methodologyreport) these problems might be solved soon. Because the estimation isproblematic at the moment, MLwiN gives a lot of numerical warn<strong>in</strong>gs whilerunn<strong>in</strong>g. As advised <strong>in</strong> the MLwiN manual (Rasbash et al., 2004), we suggest tosuppress them by click<strong>in</strong>g “Estimation control” <strong>and</strong> check<strong>in</strong>g “suppress numericwarn<strong>in</strong>gs”.• Then press “Start” the estimate the modelP r o j e c t c o - f i n a n c e d b y t h e E u r o p e a n C o m m i s s i o n , D i r e c t o r a t e - G e n e r a l T r a n s p o r t a n d E n e r g yP a g e 4 3


Chapter 2Results <strong>and</strong> <strong>in</strong>terpretationThe only th<strong>in</strong>g this empty model does is to estimate an <strong>in</strong>tercept for each of thecategories. As <strong>in</strong>dicated <strong>in</strong> the methodology report, these <strong>in</strong>tercepts can betranslated <strong>in</strong>to probabilities by fill<strong>in</strong>g them <strong>in</strong>to equation 2.3.4 of themethodology report.1γij=(2.3.4)1+exp−(parameter)This way we receive 0.02 as probability to be <strong>in</strong> category “positive” (i.e. BAC>=.08) <strong>and</strong> 0.03 to be <strong>in</strong> category “alarm” or “positive” (i.e. BAC >=.05). Notethat no variance is estimated, as <strong>in</strong> the mult<strong>in</strong>omial distribution the variance isdeterm<strong>in</strong>ed exclusively by the probability of belong<strong>in</strong>g to a particular category(see section 2.3.3 <strong>in</strong> the methodology report).Each component of the model equations is double <strong>in</strong>dexed by ij, <strong>in</strong>dicat<strong>in</strong>g thatwe have a two-level structure. This might be confus<strong>in</strong>g, because conceptually,this model has only one level (the <strong>in</strong>dividuals). Structurally, the <strong>in</strong>dividuals arethe second level <strong>and</strong> therefore <strong>in</strong>dexed j here. The first level, <strong>in</strong>dexed by i, isdef<strong>in</strong>ed by the variable “resp_<strong>in</strong>dicator” that <strong>in</strong>dicates which of two categories((>=alarm) or (>=positive)) the response variable refers to.2.3.3.2. Ordered proportional odds: empty two-level modelIn the previous section on b<strong>in</strong>ary responses (2.3.2), it had been demonstratedthat the probability of dr<strong>in</strong>k driv<strong>in</strong>g varied across measurement locations. Amodel that <strong>in</strong>cludes a location-level is conceptually a two-level model. Toimplement a two-level mult<strong>in</strong>omial model, however, we will build a structuralthree-level model (the first level be<strong>in</strong>g reserved for the response <strong>in</strong>dicator).Model formulationBecause the three response categories are assumed to have one underly<strong>in</strong>gvariable (<strong>in</strong> this case the BAC) it is assumed that r<strong>and</strong>om variation across thelocations is the same for each response category. To create this structure, it isnecessary to <strong>in</strong>clude another constant <strong>in</strong>to the model with a common coefficient


2.3.3 Mult<strong>in</strong>omial responses(as opposed to the constant that is already <strong>in</strong>cluded that has separatecoefficients for (>=alarm) <strong>and</strong> (>=positive)).• Click on “Add Term” at the bottom of the Equations w<strong>in</strong>dow• Select “cons” from the variable drop down list• Click on “Add common coefficient”• Select “Include all”• Click “Done”This common coefficient must be allowed to vary r<strong>and</strong>omly across locations.However, because we already have <strong>in</strong>tercepts we do not want the newly<strong>in</strong>cluded constant to function as a fixed factor. Therefore• Click on the constant just added (cons.23)• Check “k(id_loc_long)”• Uncheck “Fixed Parameter”• Click “Done”Press “Start to estimate the model.P r o j e c t c o - f i n a n c e d b y t h e E u r o p e a n C o m m i s s i o n , D i r e c t o r a t e - G e n e r a l T r a n s p o r t a n d E n e r g yP a g e 4 5


Chapter 2ResultsAs can be seen <strong>in</strong> the triple <strong>in</strong>dex ijk, structurally this is a three level model.Conceptually however, this is a two-level model. The second conceptual level isthat of the locations. It is implemented by the jo<strong>in</strong>t <strong>in</strong>tercept h jk . This jo<strong>in</strong>t<strong>in</strong>tercept has been def<strong>in</strong>ed as r<strong>and</strong>om factor only, but not as fixed factor.Consequently there is no mean estimated for it, only its variance Ω v , which<strong>in</strong>dicates how the probability to be either <strong>in</strong> category “alarm” or <strong>in</strong> category“positive” varies across locations. Although the present estimations should be<strong>in</strong>terpreted with caution, we can note that the variation Ω v is quite largecompared to its st<strong>and</strong>ard error, mak<strong>in</strong>g it very likely that there is substantialvariation across locations <strong>in</strong> the probability to have a BAC above .05 or above.08. This is also <strong>in</strong> l<strong>in</strong>e with the results from the b<strong>in</strong>ary model <strong>in</strong> section 2.3.2.2.3.3.3. Ordered proportional odds: the two-level model with predictorsThe next step is to <strong>in</strong>clude predictors <strong>in</strong>to the mult<strong>in</strong>omial model. In the orderedmodel, predictors are usually assumed to apply to all categories <strong>in</strong> the sameway (i.e., if a particular variables is thought to affect the probability to have aBAC above .05 it is also thought to affect the probability to have a BAC above.08). Therefore only one slope is estimated for each predictor. We will take upthe variables “gender” <strong>and</strong> “age”.Model formulation• Click on “Add Term” at the bottom of the Equations w<strong>in</strong>dow• Select “gender” from the variable drop down list• Select “male” as reference category


2.3.3 Mult<strong>in</strong>omial responses• Click on “Add common coefficient”• Select “Include all”• Click “Done”Results <strong>and</strong> <strong>in</strong>terpretationThe common part h jk is now def<strong>in</strong>ed by the r<strong>and</strong>om variation over the locations(v 3k cons.23) but also by the effect of gender. The coefficient for female.23 jk isnegative, <strong>in</strong>dicat<strong>in</strong>g that women, as compared to men who form the base l<strong>in</strong>ecategory, have a lower probability to be <strong>in</strong> categories “alarm” or “positive”.Accord<strong>in</strong>gly the <strong>in</strong>tercepts cons.(>=alarm) ijk <strong>and</strong> cons.(>=positive) ijk have<strong>in</strong>creased (i.e., they have lower negative values) as compared to the modelwithout predictors. These <strong>in</strong>tercepts now represent the probabilities for men<strong>in</strong>stead of represent<strong>in</strong>g the probabilities for the whole group.Now <strong>in</strong>clude the predictor “age”. This is a categorical variable represent<strong>in</strong>g fourage-categories (16-25, 26-39, 40-54, 55+).• Click on “Add Term” at the bottom of the Equations w<strong>in</strong>dowP r o j e c t c o - f i n a n c e d b y t h e E u r o p e a n C o m m i s s i o n , D i r e c t o r a t e - G e n e r a l T r a n s p o r t a n d E n e r g yP a g e 4 7


Chapter 2• Select “age” from the variable drop down list• Select “age 16-25” as reference category• Click on “Add common coefficient”• Select “Include all”• Click “Done”• Click “Start” to estimate the modelResults <strong>and</strong> <strong>in</strong>terpretationThe coefficients for the various age-categories are positive, <strong>in</strong>dicat<strong>in</strong>g that allage-categories listed have a higher probability of be<strong>in</strong>g <strong>in</strong> “alarm” or “positive”than the youngest (16-25) that forms the reference category. The highestcoefficient is estimated for drivers aged 40-54, <strong>in</strong>dicat<strong>in</strong>g that this group isespecially at risk of dr<strong>in</strong>k driv<strong>in</strong>g.2.3.3.4. Unordered categories: the two-level model with predictorsIn the present example we clearly have a variable (the BAC) underly<strong>in</strong>g theresponse categories, mak<strong>in</strong>g the ordered proportional odds model likely to bethe appropriate. In this way, it was assumed that both probabilities (of hav<strong>in</strong>g aBAC >.05 <strong>and</strong> of hav<strong>in</strong>g a BAC >.08) vary <strong>in</strong> the same way across locations <strong>and</strong>that the effects of gender <strong>and</strong> age on both probabilities are the same. However,even if both probabilities are related to one underly<strong>in</strong>g variable, they mightnevertheless show different r<strong>and</strong>om or fixed effects. This can be <strong>in</strong>vestigated <strong>in</strong>an unordered category model, which assumes no relation between the differentcategories to start with. To switch from an ordered to an unordered mult<strong>in</strong>omialmodel <strong>in</strong> MLwiN, one has to build a new model from the start.• Click on “Clear” at the bottom of the Equations w<strong>in</strong>dow• Click on the “y” <strong>and</strong> Select “Breathtest” from the drop-down menu as dependentvariable Select “2-ij”from the N-levels drop-down menu


2.3.3 Mult<strong>in</strong>omial responses Select “ID_loc” from the level 2(j) drop-down menu Select “ID_<strong>in</strong>d” from the level 1(i) drop-down menu Click “Done”• Click on the N <strong>in</strong> the Distribution statement for Breathtest <strong>and</strong> Check “Mult<strong>in</strong>omial” In the appear<strong>in</strong>g w<strong>in</strong>dow, leave “logit” checked Select “Unorderd” Leave “safe” as reference category Click “Done”• Click “Add Term” <strong>and</strong> select “Cons” from the variable drop down list• Click “add Separate coefficients”• Click “Done”To <strong>in</strong>clude the location-level <strong>in</strong>to the model, one has to let the <strong>in</strong>tercept varyr<strong>and</strong>omly across locations. In the ordered model, this was done for a jo<strong>in</strong>t<strong>in</strong>tercept. In the unordered model, we simply let the <strong>in</strong>tercepts that alreadydef<strong>in</strong>e the categories vary across locations.• Click on “cons.alarm ij ”• Check “k(id_loc_long)• Click “Done”• Do the same with “cons.positive ij ”In the unordered model, predictors are not assumed to have the same effect onall categories. Therefore separate coefficients have to be estimated for eachcategory.• Click on “Add Term” at the bottom of the Equations w<strong>in</strong>dow• Select “gender” from the variable drop down list• Select “male” as reference category• Click on “Add separate coefficients”• Click “Done”• Repeat the procedure to <strong>in</strong>clude predictor “age”o Select “age16-25” as reference category• Click “Start” to estimate the modelP r o j e c t c o - f i n a n c e d b y t h e E u r o p e a n C o m m i s s i o n , D i r e c t o r a t e - G e n e r a l T r a n s p o r t a n d E n e r g yP a g e 4 9


Chapter 2Results <strong>and</strong> <strong>in</strong>terpretationIn the unordered category model for three categories, two contrasts function asdependent variables: The contrast between “safe” (the reference category) <strong>and</strong>“alarm”, <strong>and</strong> the contrast between “safe” <strong>and</strong> “positive”. The values that arepredicted by the model are the log-odds of these contrasts (“alarmsafe”:log(π2jk / π 1jk ) <strong>and</strong> “positive-safe”: log(π 3jk / π 1jk )). For each of these logodds, there is a full prediction model <strong>in</strong>clud<strong>in</strong>g fixed factors (age <strong>and</strong> gender)<strong>and</strong> a r<strong>and</strong>om factor (the location effects v 0k <strong>and</strong> v 1k ). The variation acrosslocations is given by Ω v which is now a matrix conta<strong>in</strong><strong>in</strong>g the variation of thecontrast “alarm-safe” across locations (upper left), the variation of the contrast“positive-safe” across locations (lower right), <strong>and</strong> the covariance (lower leftcorner). We can see that there is substantial covariance, <strong>in</strong>dicat<strong>in</strong>g that forlocations with a large contrast “alarm-safe”, the contrast “positive-safe” is alsolarge. The assumption that the probabilities to be <strong>in</strong> “alarm” <strong>and</strong> to be <strong>in</strong>“positive” vary across locations <strong>in</strong> the same way, lay at the basis of assum<strong>in</strong>g acommon r<strong>and</strong>om factor <strong>in</strong> the ordered model. The large covariance <strong>in</strong> thepresent model supports this assumption.The coefficients for gender <strong>and</strong> age show the same pattern for both contrasts:The negative coefficients for “female” <strong>in</strong>dicate that men have a higherprobability to be <strong>in</strong> “alarm” or “positive” respectively as compared to “safe”. Thepositive coefficients for the age categories listed <strong>in</strong>dicate that drivers <strong>in</strong> thoseage categories have higher probabilities to be <strong>in</strong> “alarm” or “positive” than theyoungest drivers. And for both contrasts it is the category of 40-54 year oldsthat received the highest coefficient.Although both sets of coefficients show exactly the same pattern, one mightnote that the coefficients for the contrast “positive-safe” all have somewhathigher values than those for the contrast “alarm-safe”. This would <strong>in</strong>dicate thatbe<strong>in</strong>g a man between 40 <strong>and</strong> 54 years is even a better predictor for be<strong>in</strong>g <strong>in</strong>category “positive” (i.e. hav<strong>in</strong>g a BAC of .08 or higher) than for be<strong>in</strong>g <strong>in</strong> category“alarm” (i.e., .05


2.3.3 Mult<strong>in</strong>omial responses• Click on “Model” <strong>in</strong> the top-menu bar• Select “Intervals <strong>and</strong> tests”• Check the radio-button <strong>in</strong> front of “fixed”• Type “1” beh<strong>in</strong>d “female.alarm”• Type “-1” beh<strong>in</strong>d “female.positive”• Click on “Calc”The result<strong>in</strong>g Chi-square value of 2.792 with 2 degrees of freedom correspondsto a probability of .095. We can conclude that the effect of be<strong>in</strong>g a woman doesnot differ significantly between the contrasts “alarm-safe” <strong>and</strong> “positive-safe”. Asfor the other coefficients the differences between both contrasts are smallerthan the one for the gender-coefficients, we can conclude that there is nosystematic difference between the effects of age <strong>and</strong> gender on the categories“alarm” <strong>and</strong> “positive” respectively. Aga<strong>in</strong>, the results of the unordered modelsupport the assumptions at the basis of the ordered proportional odds model.2.3.3.5. ConclusionIt has been shown how categorical responses can be analysed <strong>in</strong> a multilevelmult<strong>in</strong>omial model. Two different versions were presented. (1) The orderedproportional odds model is based on the assumptions that the responsecategories result from an underly<strong>in</strong>g cont<strong>in</strong>uous variable <strong>and</strong> that fixed <strong>and</strong>r<strong>and</strong>om effects therefore have the same shape for outcome. (2) The unorderedcategories model does not assume any systematic relation between thedifferent outcomes. Independent models are estimated for each outcome (<strong>in</strong>contrast to the reference category). It was shown that even for categorical datathat are expected to have a common underly<strong>in</strong>g variable it can be <strong>in</strong>terest<strong>in</strong>g toanalyse them <strong>in</strong> an unordered categorical model, as compar<strong>in</strong>g the <strong>in</strong>dependentprediction models for each category can <strong>in</strong>dicate whether an orderedproportional odds model is appropriate.P r o j e c t c o - f i n a n c e d b y t h e E u r o p e a n C o m m i s s i o n , D i r e c t o r a t e - G e n e r a l T r a n s p o r t a n d E n e r g yP a g e 5 1


2.3.4 CountsGeorge Yannis, Eleonora Papadimitriou <strong>and</strong> Constant<strong>in</strong>os Antoniou (NTUA)In this section, an example for fitt<strong>in</strong>g Poisson multilevel models is presentedus<strong>in</strong>g the MLw<strong>in</strong>N 2.01 software. The example concerns an <strong>in</strong>vestigation of theregional effect of speed<strong>in</strong>g <strong>and</strong> dr<strong>in</strong>k<strong>in</strong>g-<strong>and</strong>-driv<strong>in</strong>g enforcement on the numberof road accidents <strong>in</strong> Greece. The theoretical background, models fit <strong>and</strong> resultswere discussed <strong>in</strong> section 2.3.4 of the Methodology Report.The dataset <strong>in</strong>cludes accidents data, police enforcement data as well as otherdemographic data for the 49 counties <strong>and</strong> 12 regions of Greece for the period1998-2002. More specifically, the variables <strong>and</strong> values used are summarized <strong>in</strong>the follow<strong>in</strong>g Table:Region 1-12 regions of GreeceCounty 1-49 counties of GreeceAccs The number of accidents of each countyalcohol The number of alcohol controls of each county (1000 alcohol controls)Speed The number of speed <strong>in</strong>fr<strong>in</strong>gements of each county (1000 speed <strong>in</strong>fr<strong>in</strong>gements)logepop The natural logarithm of the population of each countyNatroad The proportion of National Roads of the road network of each countyVehown The vehicle ownership of each county (vehicles per 1000 <strong>in</strong>habitants)Cons The constant term (1)Open the dataset PoissonManualData3.ws us<strong>in</strong>g the Open Worksheet optionfrom the Files menu. Open<strong>in</strong>g the Names w<strong>in</strong>dow from the Data Manipulationmenu gives the follow<strong>in</strong>g:


2.3.4 CountsThe response variable (accs) <strong>in</strong> this dataset is the number (counts) of roadaccidents <strong>in</strong> various counties of Greece dur<strong>in</strong>g the period 1998 to 2002. Thedata were collected <strong>in</strong> 49 counties 1 ; these counties are <strong>in</strong>cluded <strong>in</strong>to 12 regions,giv<strong>in</strong>g two levels of data. As expla<strong>in</strong>ed <strong>in</strong> the Methodology Report (section2.3.4), count data are constra<strong>in</strong>ed to be non-negative, therefore we would preferto model the logarithms of the counts. We will therefore fit a Poisson model tothe count data us<strong>in</strong>g a log l<strong>in</strong>k function, which can be specified through thesoftware.In order to work with the accident rates rather than the accident counts, we usean additional parameter known as an offset. The variable logepop reflects theexpected number of accidents <strong>in</strong> each county, which is considered to beproportional to the population of each county, <strong>and</strong> will be used to create anoffset variable. It should be noted that, as a log l<strong>in</strong>k function is used for theresponse variable, the offset term should also be transformed accord<strong>in</strong>gly. Inthis dataset no transformation is required, as the variable logepop alreadycorresponds to the natural logarithm of the population of each county. However,the transformation could be carried out us<strong>in</strong>g the Comm<strong>and</strong> Interface optionfrom the Data Manipulation menu.The variables alcohol, speed, natroad <strong>and</strong> vehown also concern each county<strong>and</strong> shall be used as explanatory variables. These variables have beencentered around their mean, as recommended by the MLwiN users manual, <strong>in</strong>order to avoid computational <strong>in</strong>stabilities.We will start by fitt<strong>in</strong>g a simple (s<strong>in</strong>gle level) model <strong>and</strong> then proceed tomultilevel structure.2.3.4.1. A s<strong>in</strong>gle-level Poisson modelIn order to specify a simple (s<strong>in</strong>gle level) model <strong>in</strong> MLwiN:▪ Open the Equations w<strong>in</strong>dow from the Model menu <strong>and</strong> click on y▪ In the Y variable w<strong>in</strong>dow, select accs from the y: drop down list, select i-1from the N levels: drop down list <strong>and</strong> county from the level 1(i) drop down list,<strong>and</strong> click Done.1The Athens <strong>and</strong> Thessalonica metropolitan areas, where adisproportionally high number of accidents <strong>and</strong> police controls are observed,were not <strong>in</strong>cluded <strong>in</strong> the dataset.P r o j e c t c o - f i n a n c e d b y t h e E u r o p e a n C o m m i s s i o n , D i r e c t o r a t e - G e n e r a l T r a n s p o r t a n d E n e r g yP a g e 5 3


Chapter 2▪ Click on the N (ΩΧ, Β) that appears on the first l<strong>in</strong>e of the Equations w<strong>in</strong>dow,select Poisson from the available distributions <strong>and</strong> click Done.▪ Click on the (π i ) that appears on the second l<strong>in</strong>e of the Equations w<strong>in</strong>dow,select logepop as offset term <strong>and</strong> click Done.▪ Click on the Add Term button of the Equations w<strong>in</strong>dow <strong>and</strong> add cons, alcohol<strong>and</strong> speed to the model. By click<strong>in</strong>g on each of the terms <strong>in</strong> the Equationsw<strong>in</strong>dow, we can see that these are entered by default as fixed parameters.▪ Click on the Estimates button of the Equations w<strong>in</strong>dow <strong>and</strong> the parameters tobe estimated will be highlighted <strong>in</strong> blue. The Equations w<strong>in</strong>dow will now lookas follows:


2.3.4 CountsIt should be noted that, the last l<strong>in</strong>e <strong>in</strong> the Equations w<strong>in</strong>dow reflects thePoisson assumption that the variance of the response variable is equal to themean π i .▪ In order to set the estimation procedure, click on the Nonl<strong>in</strong>ear button of theEquations w<strong>in</strong>dow select distributional assumptions Poisson, l<strong>in</strong>earization 1 storder <strong>and</strong> estimation type MQL (Marg<strong>in</strong>al Quasi Likelihood) <strong>and</strong> click Done.▪ In order to run the model, click Start on the toolbar of the ma<strong>in</strong> w<strong>in</strong>dow. Wethen obta<strong>in</strong> the follow<strong>in</strong>g results:P r o j e c t c o - f i n a n c e d b y t h e E u r o p e a n C o m m i s s i o n , D i r e c t o r a t e - G e n e r a l T r a n s p o r t a n d E n e r g yP a g e 5 5


Chapter 2These results are <strong>in</strong>tuitive (i.e. an <strong>in</strong>crease <strong>in</strong> speed<strong>in</strong>g <strong>and</strong> dr<strong>in</strong>k<strong>in</strong>g-<strong>and</strong>-driv<strong>in</strong>gcontrols results <strong>in</strong> a reduction of road accidents). In the next section, we will seehow this effect may vary when add<strong>in</strong>g more structure to the data.2.3.4.2. A two-level Poisson modelWe will now fit a two-level model, <strong>in</strong> order to <strong>in</strong>vestigate the regional variation ofthe effect of enforcement on the number of road accidents. We shall start withthe r<strong>and</strong>om <strong>in</strong>tercept model:▪ Remove the terms alcohol <strong>and</strong> speed from the model.▪ Click on accs <strong>in</strong> the Equations w<strong>in</strong>dow, select j-2 from the N levels: dropdown list <strong>and</strong> region from the level 2(j) drop down list, <strong>and</strong> click Done.▪ Click on the variable cons <strong>in</strong> the Equations w<strong>in</strong>dow <strong>and</strong> set cons to ber<strong>and</strong>om at the j(region) level.As there are only 12 regions at the higher level, it is recommended to use theRIGLS estimation method, which provides less biased estimates of the variancethan the IGLS when there is limited number of higher level units.▪ Select RIGLS from the Estimation menu of the ma<strong>in</strong> w<strong>in</strong>dow▪ Click the Start button on the toolbar of the ma<strong>in</strong> w<strong>in</strong>dow to run the model.The results are as follows:


2.3.4 CountsIn order to graphically represent the average <strong>in</strong>tercept <strong>and</strong> the r<strong>and</strong>om<strong>in</strong>tercepts:▪ Open the Predictions w<strong>in</strong>dow from the Model menu. The elements of themodel are arranged <strong>in</strong> two columns. These columns are <strong>in</strong>itially grayed out.We will build a prediction equation <strong>in</strong> the top of the w<strong>in</strong>dow, by select<strong>in</strong>g theelements we want from the bottom section.▪ Click on β 0 <strong>in</strong> order to select only the fixed part of the model.▪ Select c100 from the Output from prediction drop down list▪ Click the Calc button▪ Open the Customized Graphs w<strong>in</strong>dow from the Graphs menuP r o j e c t c o - f i n a n c e d b y t h e E u r o p e a n C o m m i s s i o n , D i r e c t o r a t e - G e n e r a l T r a n s p o r t a n d E n e r g yP a g e 5 7


Chapter 2▪ In the plot what? tab, select c100 from the y drop down list, alcohol <strong>in</strong> the xdrop down list <strong>and</strong> l<strong>in</strong>e <strong>in</strong> the plot type drop down list.▪Click the Apply button to obta<strong>in</strong> a graph of the average (fixed) <strong>in</strong>terceptAccord<strong>in</strong>gly, <strong>in</strong> order to graphically represent the average <strong>in</strong>tercept <strong>and</strong> ther<strong>and</strong>om <strong>in</strong>tercepts:▪ Open the Predictions w<strong>in</strong>dow from the Model menu.▪ Click on β 0 <strong>and</strong> u 0j <strong>in</strong> order to select both the fixed <strong>and</strong> the r<strong>and</strong>om part of themodel.▪ Select c101 from the Output from prediction drop down list▪ Click the Calc button


2.3.4 Counts▪ Open the Customized Graphs w<strong>in</strong>dow from the Graphs menu▪ In the plot what? tab, select c101 from the y drop down list, alcohol <strong>in</strong> the xdrop down list, l<strong>in</strong>e <strong>in</strong> the plot type drop down list <strong>and</strong> region <strong>in</strong> the groupdrop down list.▪ In the plot style tab, select 16 rotate from the color drop down list.▪ In the other tab, select group code.▪ Click the Apply button to obta<strong>in</strong> a graph of the r<strong>and</strong>om <strong>in</strong>terceptsP r o j e c t c o - f i n a n c e d b y t h e E u r o p e a n C o m m i s s i o n , D i r e c t o r a t e - G e n e r a l T r a n s p o r t a n d E n e r g yP a g e 5 9


Chapter 2We will now add a r<strong>and</strong>om slope to the model:▪▪▪Click on the Add Term button of the Equations w<strong>in</strong>dow <strong>and</strong> add alcohol tothe model.Click on the variable alcohol <strong>in</strong> the Equations w<strong>in</strong>dow <strong>and</strong> set alcohol to ber<strong>and</strong>om at the j(region) level.Click on the Start button to run the model


2.3.4 CountsThe results reveal a significant variance <strong>in</strong> the effect of alcohol controls on thenumber of accidents.However, it has been proved that the 1 st order MQL estimation method tends tooverestimate some of the variance <strong>in</strong> Poisson multilevel models. We willtherefore switch to the 2 nd order PQL (Penalized "Predictive" Quasi Likelihood),which is more accurate.▪ Click on the Nonl<strong>in</strong>ear button of the Equations <strong>and</strong> select l<strong>in</strong>earization 2 ndorder <strong>and</strong> estimation type PQL <strong>and</strong> click Done.▪Click on the More on the toolbar of the ma<strong>in</strong> w<strong>in</strong>dow button to run themodel.P r o j e c t c o - f i n a n c e d b y t h e E u r o p e a n C o m m i s s i o n , D i r e c t o r a t e - G e n e r a l T r a n s p o r t a n d E n e r g yP a g e 6 1


Chapter 2All fixed <strong>and</strong> r<strong>and</strong>om effects are statistically significant.In order to graphically represent the r<strong>and</strong>om slopes, we follow the processdescribed above for the r<strong>and</strong>om <strong>in</strong>tercepts. In this case, we should select β 0 , β 1 ,u 0j <strong>and</strong> u 1j <strong>in</strong> the Predictions w<strong>in</strong>dow of the Model menu <strong>and</strong> output fromprediction to another column. We then obta<strong>in</strong> the follow<strong>in</strong>g:In order to explore the residuals of the model, start<strong>in</strong>g by the level-1 residuals:


2.3.4 Counts▪ Open the Residuals w<strong>in</strong>dow from the Model menu▪ In the Sett<strong>in</strong>gs tab, write 300 (or any other appropriate column number) <strong>in</strong> thestart output box <strong>and</strong> click on Set columns. The boxes beneath this button arethen filled <strong>in</strong> gray with the column numbers that will be used for residualscalculations. Additionally, select 1-county fro the level: drop down list.▪ Click on the Calc button <strong>in</strong> the Residuals w<strong>in</strong>dow.▪ In the Plots tab, select the first option st<strong>and</strong>ardized residual * normal scores▪ Click on the Apply button.▪ Then <strong>in</strong> the Plots tab, select the fourth option st<strong>and</strong>ardized residual * fixedpart prediction▪ Click on the Apply button.P r o j e c t c o - f i n a n c e d b y t h e E u r o p e a n C o m m i s s i o n , D i r e c t o r a t e - G e n e r a l T r a n s p o r t a n d E n e r g yP a g e 6 3


Chapter 2We can see that the level-1 residuals present no significant deviation from theNormal distribution. Moreover, the residuals are <strong>in</strong>dependent from the predictedvalues.In order to explore the level-2 residuals, we will repeat the process describedabove, except that we will set start output at a different column number <strong>and</strong>select 2-region <strong>in</strong> the level: drop down list, <strong>in</strong> the Sett<strong>in</strong>gs tab of the Residualsw<strong>in</strong>dow.We will then obta<strong>in</strong> the follow<strong>in</strong>g results:


2.3.4 CountsThese results are less satisfactory compared to the level-1 results, as aconsequence of the limited number of higher level units.Accord<strong>in</strong>gly, the effect of speed enforcement on the number of accidents can beseparately exam<strong>in</strong>ed, by remov<strong>in</strong>g the variable alcohol from the model <strong>and</strong>add<strong>in</strong>g the variable speed, also allow<strong>in</strong>g it to r<strong>and</strong>omly vary between regions.The multilevel model fitted should be as follows:All fixed <strong>and</strong> r<strong>and</strong>om effects are statistically significant.In order to exam<strong>in</strong>e the comb<strong>in</strong>ed effect of speed<strong>in</strong>g <strong>and</strong> dr<strong>in</strong>k<strong>in</strong>g-<strong>and</strong>-driv<strong>in</strong>genforcement, we will add alcohol to the model, allow<strong>in</strong>g it to vary amongregions. We will obta<strong>in</strong> the follow<strong>in</strong>g results:P r o j e c t c o - f i n a n c e d b y t h e E u r o p e a n C o m m i s s i o n , D i r e c t o r a t e - G e n e r a l T r a n s p o r t a n d E n e r g yP a g e 6 5


Chapter 2In this case, all fixed effects are significant; however, the covariances related tothe number of speed <strong>in</strong>fr<strong>in</strong>gements are non significant. This is quite surpris<strong>in</strong>g,when consider<strong>in</strong>g that both effects were significant when exam<strong>in</strong>ed separately.It is therefore <strong>in</strong>dicated that there is some bias <strong>in</strong> the model. This is alsoidentified when plott<strong>in</strong>g the predicted values with alcohol <strong>and</strong> speed.In order to plot the effects of alcohol, follow the process described above, butselect <strong>in</strong> the Predictions w<strong>in</strong>dow only the β 0j , β 2j, u 0j <strong>and</strong> u 2j effects (i.e. cons <strong>and</strong>alcohol) <strong>and</strong> another column (e.g. c106) <strong>in</strong> the output from prediction drop downlist. Accord<strong>in</strong>gly, <strong>in</strong> order to plot the effects of speed, follow the same process,but select <strong>in</strong> the Predictions w<strong>in</strong>dow only the β 0j , β 1j, u 0j <strong>and</strong> u 1j effects (i.e. cons<strong>and</strong> speed) <strong>and</strong> another column (e.g. c107) <strong>in</strong> the output from prediction dropdown list. The two plots should be as follows:We can see that there are several regions for which the slopes are counter<strong>in</strong>tuitive.We should exam<strong>in</strong>e whether the two variables are correlated.


2.3.4 Counts▪ Open the Averages <strong>and</strong> correlations w<strong>in</strong>dow from the Basic Statistics menu▪ Select Correlation <strong>in</strong> the Operation tab▪ Click on alcohol <strong>and</strong> speed <strong>in</strong> the variables list▪ Click CalculateThe results will appear <strong>in</strong> the Output w<strong>in</strong>dow as follows:We can see that there is a positive correlation of 0,729 between the variablesspeed <strong>and</strong> alcohol, <strong>in</strong>dicat<strong>in</strong>g multicoll<strong>in</strong>earity. This expla<strong>in</strong>s to some degree theconfus<strong>in</strong>g <strong>modell<strong>in</strong>g</strong> results (see section 2.3.4 of the Methodology Report formore <strong>in</strong>formation on multicoll<strong>in</strong>earity effects).A two-level extra-Poisson model (with overdispersion)P r o j e c t c o - f i n a n c e d b y t h e E u r o p e a n C o m m i s s i o n , D i r e c t o r a t e - G e n e r a l T r a n s p o r t a n d E n e r g yP a g e 6 7


Chapter 2Another issue that needs to be exam<strong>in</strong>ed <strong>in</strong> Poisson models is overdispersion(see section 2.3.4 of the Methodology Report). In order to fit a multilevel modelwith overdispersion, an extra-Poisson distribution is assumed:▪ Create a two-level model <strong>in</strong>clud<strong>in</strong>g only a constant term <strong>in</strong> the Equationsw<strong>in</strong>dow, as described previously▪ Click on the Nonl<strong>in</strong>ear button of the Equations w<strong>in</strong>dow <strong>and</strong> selectdistributional assumptions extra Poisson, l<strong>in</strong>earization 2 nd order <strong>and</strong>estimation type PQL <strong>and</strong> click Done.An additional term to be estimated (i.e. the dispersion parameter) will appear <strong>in</strong>the bottom l<strong>in</strong>e of the Equations w<strong>in</strong>dow, allow<strong>in</strong>g for the mean / variancerelationship to be different than 1. Runn<strong>in</strong>g the model should give the follow<strong>in</strong>gresults:The results <strong>in</strong>dicate that there is overdispersion <strong>in</strong> the data, as the dispersionparameter estimated is highly significant.Add<strong>in</strong>g alcohol to the model gives the follow<strong>in</strong>g results.


2.3.4 CountsIn this case, the dispersion parameter is lower (but also significant), <strong>in</strong>dicat<strong>in</strong>gthat the explanatory variable has accounted for a part of the overdispersion.2.3.4.3. A two-level negative b<strong>in</strong>omial modelAs expla<strong>in</strong>ed <strong>in</strong> the Methodology Report (section 2.3.4), another option fordeal<strong>in</strong>g with overdispersion <strong>in</strong> count data is to assume a Negative B<strong>in</strong>omialdistribution, which <strong>in</strong>cludes a more complex variance structure, allow<strong>in</strong>g thusmore flexibility. In order to fit a Negative B<strong>in</strong>omial model <strong>in</strong> the data:▪ Create a two-level model <strong>in</strong>clud<strong>in</strong>g only a constant term <strong>in</strong> the Equationsw<strong>in</strong>dow, as described previously▪ Click on the N (ΩΧ, Β) that appears on the first l<strong>in</strong>e of the Equations w<strong>in</strong>dow,select Negative B<strong>in</strong>omial from the available distributions <strong>and</strong> click Done.P r o j e c t c o - f i n a n c e d b y t h e E u r o p e a n C o m m i s s i o n , D i r e c t o r a t e - G e n e r a l T r a n s p o r t a n d E n e r g yP a g e 6 9


Chapter 2In the bottom l<strong>in</strong>e of the Equations w<strong>in</strong>dow, the mean / variance relationship isdisplayed, <strong>in</strong> which the variance is a quadratic function of the mean. Runn<strong>in</strong>gthe model should give the follow<strong>in</strong>g results:Add<strong>in</strong>g alcohol to the model gives the follow<strong>in</strong>g results.These results are very similar to the Extra-Poisson model, <strong>in</strong> terms of both fixed<strong>and</strong> r<strong>and</strong>om parameter estimates. It is therefore shown that both Extra-Poisson


2.3.4 Counts<strong>and</strong> Negative B<strong>in</strong>omial distributional assumptions can efficiently h<strong>and</strong>leoverdispersion <strong>in</strong> count data.P r o j e c t c o - f i n a n c e d b y t h e E u r o p e a n C o m m i s s i o n , D i r e c t o r a t e - G e n e r a l T r a n s p o r t a n d E n e r g yP a g e 7 1


2.4 Longitud<strong>in</strong>al dataHeike Martensen & Emmanuelle Dupont (IBSR)The example data used <strong>in</strong> this section are a set of simulated data for 500beg<strong>in</strong>n<strong>in</strong>g drivers for which a driv<strong>in</strong>g-skill score was simulated for 7 consecutiveyears. Moreover, for each occasion an experience value was generated. Thisvalue was always “0” at the first measurement occasion <strong>and</strong> corresponded tothe cumulative number of km driven for all the others. The variable “<strong>in</strong>itial age”<strong>in</strong>dicates for each driver the age at which they acquired their licences.Data loadOpen the file DRIVING SKILL.xls. Like most repeated measures tables, thedata are coded <strong>in</strong> a format that will be called “wide” here. This means that thereis one row for each subject with all measurements <strong>in</strong> it. Below a section of theorig<strong>in</strong>al wide table is shown. To the right, the table cont<strong>in</strong>ues with exp3/skill3 toexp6/skill6, below there are more subjects than visible here.To analyse these data <strong>in</strong> MLwiN, they have to be imported (see section 2.2 for<strong>in</strong>structions to paste data from Excel <strong>in</strong>to MLwiN) <strong>and</strong> then the wide table mustbe converted <strong>in</strong>to a long table. This means that all measurements are noted <strong>in</strong> along list below each other. Only one variable “experience” <strong>and</strong> one variable“skill” are present <strong>and</strong> the measurement occasion is <strong>in</strong>dicated by a thirdvariable. An example for a long table is given here:


2.4 Longitud<strong>in</strong>al dataTo create a long table <strong>in</strong> MLwiN, click on “Data Manipulation” <strong>in</strong> the top menu<strong>and</strong> select “Split record” <strong>and</strong> fill the dialogue w<strong>in</strong>dow <strong>in</strong> as shown below.Click on “Split” to conduct build the long table. Then click on “DataManipulation” <strong>and</strong> select “Names”. Change the names of C18, C19, <strong>and</strong> C20 byselect<strong>in</strong>g each of these numbers from the list, typ<strong>in</strong>g their new name <strong>in</strong>to the topframe <strong>and</strong> press<strong>in</strong>g return. C18 is the collection of the experience scores. Call it“experience”. C19 is the collection of the skill scores <strong>and</strong> should be called “skill”.C20 conta<strong>in</strong>s an <strong>in</strong>dication from which measurement occasion the data come.This variable, call it “occasion”, has a value from 1 to 7 (1 for skill0, 2 for skill1,… etc.).As described <strong>in</strong> the methodology report, rather than simply <strong>in</strong>dicat<strong>in</strong>g thenumber of the measurements, the <strong>time</strong> elapsed should be coded. This meansthat values should run from 0 to 6 rather than from 1 to 7. In order to do so, clickon “Data Manipulation” <strong>in</strong> the top menu bar <strong>and</strong> select “Calculate”. Fill thedialogue w<strong>in</strong>dow as shown below:P r o j e c t c o - f i n a n c e d b y t h e E u r o p e a n C o m m i s s i o n , D i r e c t o r a t e - G e n e r a l T r a n s p o r t a n d E n e r g yP a g e 7 3


Chapter 2Than open the “Names” w<strong>in</strong>dow <strong>and</strong> rename the newly generated variable <strong>in</strong>to“telaps” (short for “<strong>time</strong> elapsed”).The long table should conta<strong>in</strong> 3500 cases <strong>and</strong> consist of the follow<strong>in</strong>g variables:subID Identification number of the driver tested.<strong>in</strong>itialAge Age at obta<strong>in</strong>ment of drivers licences.iacen Initial age centred to its mean.experience Number of km driven (<strong>in</strong> 1000).skill Number between 0 <strong>and</strong> 15 <strong>in</strong>dicat<strong>in</strong>g the driv<strong>in</strong>g skill.telaps Time elapsed: 0 for first test, 1 – 6 for tests at consecutiveyears.2.4.1.1. The empty two-level modelIn the case of repeated measurements, the simplest model is the empty twolevelmodel. The first level is that of the s<strong>in</strong>gle driv<strong>in</strong>g skill scores. These skillscores are nested with<strong>in</strong> subjects, as there are 7 skill scores from each person.The repeated measures structure is therefore def<strong>in</strong>ed at the second level.Model formulationDef<strong>in</strong>e the dependent variable:• Click on the “y” <strong>and</strong> Select “skill” from the drop-down w<strong>in</strong>dow as dependentvariable Select “1-i”from the N-levels drop-down w<strong>in</strong>dow Select “subID” from the level 1(j) drop-down w<strong>in</strong>dow Click “done”Then def<strong>in</strong>e the dependent variable (skill) as vary<strong>in</strong>g over two r<strong>and</strong>om factors,namely the <strong>in</strong>dividual measurements (telaps) <strong>and</strong> the subject they are takenfrom (subID)


2.4 Longitud<strong>in</strong>al data• Click on the dependent variable <strong>and</strong> def<strong>in</strong>e subID as the second level asshown below.Then def<strong>in</strong>e a variance component model by allow<strong>in</strong>g the <strong>in</strong>tercept to varyr<strong>and</strong>omly across locations.• Click on the <strong>in</strong>tercept• Check the box j(subID)• Press “Done”• Press “Start” to estimate the parametersResults <strong>and</strong> InterpretationYour Equations w<strong>in</strong>dow should now look like this.The with<strong>in</strong>-subject variation is <strong>in</strong>dicated by σ 2 e <strong>and</strong> the variation betweensubjects by σ 2 u0. Note that the results suggest that the differences betweenrepeated driv<strong>in</strong>g tests for each participant are larger than those acrossparticipants. The mean <strong>in</strong>tercept, β 0j , <strong>in</strong>dicates that over all subjects <strong>and</strong> all<strong>time</strong>s of test<strong>in</strong>g the mean skill score is 6.543.Graphic <strong>in</strong>spection of the residuals: Level 1To <strong>in</strong>spect the residuals of the model,• Click on Model <strong>in</strong> the top menu bar• Select ResidualsP r o j e c t c o - f i n a n c e d b y t h e E u r o p e a n C o m m i s s i o n , D i r e c t o r a t e - G e n e r a l T r a n s p o r t a n d E n e r g yP a g e 7 5


Chapter 2• Click “Calc” to calculate the Level 1 residuals• Select “Plot” at the top of the “Residuals” w<strong>in</strong>dow• Check radio-button <strong>in</strong> front of “st<strong>and</strong>ardised residual <strong>and</strong> normal score”• Click “Apply”Graphic <strong>in</strong>spection of the residuals: Level 2• Go back to the “Sett<strong>in</strong>gs” part of the “Residuals” w<strong>in</strong>dow• Select “2:subID” from the “level” dropdown list• Click “Calc” to calculate the Level 2 residuals• Select “Plot” at the top of the Residuals w<strong>in</strong>dow


2.4 Longitud<strong>in</strong>al data• Click “Apply”The residuals are satisfy<strong>in</strong>gly close to a normal distribution (that would havebeen <strong>in</strong>dicated by a straight l<strong>in</strong>e).2.4.1.2. A Two-level variance component model with on predictorIn the empty model there was more variation with<strong>in</strong> subjects (as <strong>in</strong>dicated by σ 2 e) than between subjects. To estimate the proportion of the with<strong>in</strong> subjectsvariation that can be attributed to the <strong>time</strong> elapsed after acquisition of theirdrivers licence, “telaps” is used as a predictor.Model formulation• Click “Add Term” at the bottom of the Equations w<strong>in</strong>dow• Select “telaps” from the “Variable” drop-down w<strong>in</strong>dow• Click “Done”• To estimate this model press “Start”.P r o j e c t c o - f i n a n c e d b y t h e E u r o p e a n C o m m i s s i o n , D i r e c t o r a t e - G e n e r a l T r a n s p o r t a n d E n e r g yP a g e 7 7


Chapter 2Results <strong>and</strong> InterpretationBecause “telaps” is coded to be zero at the <strong>time</strong> of acquirement of the driv<strong>in</strong>glicence, the mean <strong>in</strong>tercept β 0j <strong>in</strong>dicates the average skill score at that moment.The coefficient for “telaps” <strong>in</strong>dicates that on average the skill score <strong>in</strong>creases byhalf a po<strong>in</strong>t each year that a driver has his/her licence. Note that the with<strong>in</strong>subject variance σ 2 e is reduced as compared to the null model, suggest<strong>in</strong>g thatthe measurements for each participant changed over <strong>time</strong>. However, thebetween subject variance σ 2 u0 as well, suggest<strong>in</strong>g that participants varied <strong>in</strong> theeffect that telaps had for tham. F<strong>in</strong>ally, the decrease of the deviance(loglikelihood) confirms that the model with “telaps” fits better than the onewithout.2.4.1.3. Two-level r<strong>and</strong>om <strong>in</strong>tercept model with two predictorsThe question “treated” <strong>in</strong> this simulated study is whether it is the number ofyears pass<strong>in</strong>g or rather the <strong>in</strong>crease of experience that make older drivers lessaccident prone than younger ones. To <strong>in</strong>vestigate this, the driv<strong>in</strong>g experience(measured <strong>in</strong> 1000 km driven) is taken up <strong>in</strong>to the model <strong>in</strong> parallel with the <strong>time</strong>elapsed (telaps).Model formulation• Click “Add Term” at the bottom of the Equations w<strong>in</strong>dow• Select “experience” from the “Variable” drop-down w<strong>in</strong>dow• Click “Done”• To estimate this model press “Start”


2.4 Longitud<strong>in</strong>al dataResults <strong>and</strong> InterpretationThe coefficient for the variable “experience” is highly significant as it exceeds itsown st<strong>and</strong>ard error several <strong>time</strong>s. The coefficient for “telaps” however, is notsignificant anymore. This might be confus<strong>in</strong>g, especially when not<strong>in</strong>g thatoverall the model has clearly <strong>in</strong>creased <strong>in</strong> fit (to test the significance of thedifference between the two deviances, 162.43, use “basic statistics”, “tailareas”). The change <strong>in</strong> the predictor for “telaps” is due to the fact that the <strong>time</strong>elapsed s<strong>in</strong>ce one has acquired his driver’s licence <strong>and</strong> the number ofkilometres one has driven are two related measures. A significant coefficient<strong>in</strong>dicates that a proportion of the variance can be attributed to the predictorexclusively. The fact that “telaps” is not significant anymore when taken uptogether with “experience” suggests, that all the variance that “telaps” expla<strong>in</strong>edcan be expla<strong>in</strong>ed by “experience” as well. Note however, that the reverse is notthe case: “experience” is significant even if it is taken up jo<strong>in</strong>tly with “telaps”,<strong>in</strong>dicat<strong>in</strong>g that there is a proportion of variance <strong>in</strong> the driv<strong>in</strong>g skills that can beuniquely attributed to “experience”.2.4.1.4. A Two-level r<strong>and</strong>om <strong>in</strong>tercept model with predictor experienceonlyBecause the predictor “telaps” (i.e. the <strong>time</strong> elapsed s<strong>in</strong>ce acquirement of thedriv<strong>in</strong>g licence) is not significant anymore one “experience” is taken up <strong>in</strong>to theequation, this term can be dropped <strong>and</strong> “experience” rema<strong>in</strong>s <strong>in</strong> the modelequation.Model formulation• Click the term “telaps” <strong>in</strong> the model equation• Click on “delete Term”• To estimate this model press “Start”P r o j e c t c o - f i n a n c e d b y t h e E u r o p e a n C o m m i s s i o n , D i r e c t o r a t e - G e n e r a l T r a n s p o r t a n d E n e r g yP a g e 7 9


Chapter 2Results <strong>and</strong> InterpretationRemov<strong>in</strong>g “telaps” from the model has <strong>in</strong>creased the deviance by only 0.55.With 1 degree of freedom (one parameter that needs to be estimated less), thisCHI-2 value has a probability of 0.76 which is not significant at all. Indeed, allthe variance <strong>in</strong> the “skill” scores that can be expla<strong>in</strong>ed by <strong>time</strong> elapsed (telaps),can be expla<strong>in</strong>ed by “experience” as well.2.4.1.5. A Two-level r<strong>and</strong>om <strong>in</strong>tercept r<strong>and</strong>om slope modelThe coefficient for “experience” <strong>in</strong>dicates that generally driv<strong>in</strong>g skills improvewith an <strong>in</strong>crease of km driven. To test whether this <strong>in</strong>crease is the same for allparticipants we will allow the slope of experience to vary r<strong>and</strong>omly across thelevel-2 units, i.e. the participants.Model formulation• Click on the Length• Check the box j(IDlocation)• Estimate the parameters by click<strong>in</strong>g on “Start”


2.4 Longitud<strong>in</strong>al dataResults <strong>and</strong> InterpretationBoth, the <strong>in</strong>tercept β 0j <strong>and</strong> the coefficient of “Experience”, β 1j , are now vary<strong>in</strong>gacross subjects with the means <strong>in</strong>dicated <strong>in</strong> the second <strong>and</strong> third equation.The between subject variance has now become a variance-covariance matrixΩ u : The upper left number is σ 2 u0, the variance of the <strong>in</strong>tercepts across subjects.It <strong>in</strong>dicates how much the average driv<strong>in</strong>g score varies between them. Thelower right number is σ 2 u1, the variance of the “experience”- coefficient acrosslocations. It shows that the relation between “experience” <strong>and</strong> “skill” variesbetween participants. The lower left number is σ u01 , the covariance between thetwo, <strong>in</strong>dicat<strong>in</strong>g that subjects with a higher <strong>in</strong>tercept (i.e. the average “skill”) havesteeper slopes (i.e. a stronger <strong>in</strong>crease of skills with experience).The deviance decreased by 85 as opposed to the variance component model,<strong>in</strong>dicat<strong>in</strong>g that there is <strong>in</strong>deed substantial variation <strong>in</strong> the size of the experienceeffect,which also is apparent <strong>in</strong> the fact that the variance of the slope, σ 2 u1, issignificant.Graphic <strong>in</strong>spection of the model predictionsTo plot the model predictions, they have to be saved as a new variable first.• Select “Model” <strong>in</strong> the top menu bar <strong>and</strong> click on “Predictions”• Click on all parameters <strong>in</strong> the lower half of the appear<strong>in</strong>g dialogue w<strong>in</strong>dow(so that they turn from grey to black)• Select an empty column (e.g. c20) for “output from prediction to”• Click on “Calc”• Now you can close the “Predictions” w<strong>in</strong>dowP r o j e c t c o - f i n a n c e d b y t h e E u r o p e a n C o m m i s s i o n , D i r e c t o r a t e - G e n e r a l T r a n s p o r t a n d E n e r g yP a g e 8 1


Chapter 2Now a variable that conta<strong>in</strong>s the predictions of the model is created, which canbe plotted aga<strong>in</strong>st the “experience”. Open the graph dialogue by click<strong>in</strong>g on“Graph” <strong>in</strong> the top menu bar <strong>and</strong> by select<strong>in</strong>g “Customized Graphs”. Fill <strong>in</strong> asshown below (all changes have to be made <strong>in</strong> the drop-down lists on the righth<strong>and</strong>side <strong>and</strong> appear automatically <strong>in</strong> the left-h<strong>and</strong> side table)• Press “Apply” to create the graph. (You can close the Dialogue W<strong>in</strong>dowthen.)The result<strong>in</strong>g graph shows separate regression l<strong>in</strong>es for each location.The graph shows a “the rich get richer”effect: Those drivers that start at a highlevel also improve more with additional driv<strong>in</strong>g experience. In the model output,this effect becomes apparent <strong>in</strong> the covariance between <strong>in</strong>tercept <strong>and</strong> slope,σ u01 .2.4.1.6. Add<strong>in</strong>g a level-two predictorThe beg<strong>in</strong>n<strong>in</strong>g level of driv<strong>in</strong>g skills <strong>and</strong> the effect of “experience” varysystematically across participants. It is therefore sensible to search forpredictors at the second level (here the subject level) to expla<strong>in</strong> these


2.4 Longitud<strong>in</strong>al datavariations. As an example we will <strong>in</strong>clude the variable “iacen”, the <strong>in</strong>itial age (ia)centered to its mean (cen).Model formulation• Click “Add Term” at the bottom of the Equations w<strong>in</strong>dow• Select “iacen” from the “Variable” drop-down w<strong>in</strong>dow• Click “Done”• To estimate this model press “Start”Results <strong>and</strong> InterpretationThe coefficient of “iacen” is marg<strong>in</strong>ally significant (Z= 1.88; p=.060). Its positivevalue would <strong>in</strong>dicate that drivers who acquired their driv<strong>in</strong>g licences at a higherage tend to have higher skill scores.2.4.1.7. Add<strong>in</strong>g a cross-level <strong>in</strong>teractionAs the next step it will be tested whether the <strong>in</strong>itial age (iacen) modifies theeffect of “experience”. To do this, the <strong>in</strong>teraction between these two varis<strong>in</strong>cluded <strong>in</strong>to the model.Model formulation• Click on “Add term” <strong>and</strong>• Include the <strong>in</strong>teraction between “iacen” <strong>and</strong> “experience” Select order 1 (this means it’s a first-order <strong>in</strong>teraction) Select “iacen” as first variable Select “experience” as secondP r o j e c t c o - f i n a n c e d b y t h e E u r o p e a n C o m m i s s i o n , D i r e c t o r a t e - G e n e r a l T r a n s p o r t a n d E n e r g yP a g e 8 3


Chapter 2 Click “Done”• Estimate the model by press<strong>in</strong>g “Start”Results <strong>and</strong> InterpretationTak<strong>in</strong>g up the <strong>in</strong>teraction term leaves the already marg<strong>in</strong>al coefficient of “iacen”non-significant. The <strong>in</strong>teraction between <strong>in</strong>itial age (iacen) <strong>and</strong> “experience”itself, however, is significant. Its positive coefficient <strong>in</strong>dicates that drivers whoacquired their licence at a later age improved more per 1000 km driven thanthose who acquired their licence at an earlier age. (Please remember that allthese “conclusions” are based on simulated data <strong>and</strong> generated for thismanual).2.4.1.8. ConclusionIn this chapter it was demonstrated how to use a two-level model to analyserepeated measurements taken from a group of participants. It was shown thatthe first level <strong>in</strong>dicates the variation between measurements taken from thesame subject, while the second level conta<strong>in</strong>s variation between subjects.Accord<strong>in</strong>gly, variables that vary with<strong>in</strong> subject across measurements (e.g. <strong>time</strong>or growth variables) should be <strong>in</strong>cluded at level one, while variables thatcharacterise <strong>in</strong>dividuals should be taken up as level-two variables. In the caseof a repeated measurements <strong>analysis</strong>, a cross-level <strong>in</strong>teraction then <strong>in</strong>dicateshow person-characteristics can modify the effect of <strong>time</strong>- or growth variables.


2.5 Multivariate modelsGeorge Yannis, Eleonora Papadimitriou <strong>and</strong> Costas Antoniou (NTUA)In this section, an example for fitt<strong>in</strong>g multivariate multilevel models is presentedus<strong>in</strong>g the MLw<strong>in</strong>N 2.01 software. The example concerns an <strong>in</strong>vestigation of theregional effect of dr<strong>in</strong>k<strong>in</strong>g-<strong>and</strong>-driv<strong>in</strong>g enforcement on the number of roadaccidents <strong>and</strong> related persons killed <strong>in</strong> Greece. The theoretical background,models fit <strong>and</strong> results were discussed <strong>in</strong> section 2.5 of the Methodology report.It is noted that <strong>in</strong> section 2.5 of the Methodology Report two model formulationswere def<strong>in</strong>ed <strong>and</strong> presented: a normal bivariate multilevel model <strong>and</strong> a hybridnormal-poisson bivariate multilevel model. However, only the latter isdemonstrated <strong>in</strong> this section, as this formulation was proved to be more efficient<strong>in</strong> the estimation of the models. However, apart from the different level-1distributional assumptions (see section 2.5. of the Methodology Report) thesame process would be followed for fitt<strong>in</strong>g the first formulation as well."The dataset <strong>in</strong>cludes data on accidents <strong>and</strong> persons killed, as well as alcohol<strong>and</strong> speed police controls data for the 49 counties <strong>and</strong> 12 regions of Greece forthe period 1998-2002. Part of this dataset was also used <strong>in</strong> section 2.3.4 of theMethodology report (multilevel models for count data) <strong>and</strong> <strong>in</strong> the relateddemonstration for the Manual. In this section, a variable correspond<strong>in</strong>g to thenumber of persons killed was also <strong>in</strong>cluded.More specifically, the variables <strong>and</strong> values used are summarized <strong>in</strong> thefollow<strong>in</strong>g Table:Region 1-12 regions of GreeceCounty 1-49 counties of GreeceAccidents The number of accidents of each countyKilled The number of persons killed of each countyalcohol The number of alcohol controls of each county (1000 alcoholcontrols)Speed The number of speed <strong>in</strong>fr<strong>in</strong>gements of each county (1000 speed<strong>in</strong>fr<strong>in</strong>gements)logepop The natural logarithm of the population of each countyCons The constant term (1)It is rem<strong>in</strong>ded that the counties of Athens <strong>and</strong> Thessalonica (large metropolitanareas with disproportionally high numbers of road accidents, persons killed <strong>and</strong>police controls) are not <strong>in</strong>cluded <strong>in</strong> the dataset. It is also rem<strong>in</strong>ded that theexplanatory variables (alcohol <strong>and</strong> speed controls) are centered around theirmean to avoid numerical problems <strong>in</strong> the estimations.▪ Open the dataset MultivariateManualData2.ws us<strong>in</strong>g the Open Worksheetoption from the Files menu. Open<strong>in</strong>g the Names w<strong>in</strong>dow from the DataManipulation menu gives the follow<strong>in</strong>g:


Chapter 2As described <strong>in</strong> the multivariate multilevel models Methodology report, the tworesponses will be treated as 2 nd level group<strong>in</strong>g <strong>and</strong> the actual values of bothresponses will be treated as 1 st level units. In order to def<strong>in</strong>e the bivariatestructure:▪ Click on the Responses button <strong>in</strong> the Equations w<strong>in</strong>dow▪ In the Specify responses w<strong>in</strong>dow, click on accidents <strong>and</strong> killed <strong>and</strong> then clickDone:The Equations w<strong>in</strong>dow should now look like this:


2.5 Multivariate modelsMoreover, <strong>in</strong> the Names w<strong>in</strong>dow, we can see that two new variables werecreated; the variable the variable resp_<strong>in</strong>dicator, which is a b<strong>in</strong>ary variableseparat<strong>in</strong>g the two responses (i.e. <strong>in</strong>dicat<strong>in</strong>g to which response the current datarow applies to), <strong>and</strong> resp, which conta<strong>in</strong>s the respective actual values of the tworesponses.Note also that the variable resp <strong>in</strong>cludes exactly twice the number of entries ofeach response, i.e. 2*245=490. Moreover, the m<strong>in</strong>imum <strong>and</strong> maximum valuesof this variable are the m<strong>in</strong>imum <strong>and</strong> maximum values of the grouped values ofboth responses.However, so far we have specified a s<strong>in</strong>gle level model. In order to def<strong>in</strong>e themultivariate two-level structure, we should specify that counties are nestedwith<strong>in</strong> responses.P r o j e c t c o - f i n a n c e d b y t h e E u r o p e a n C o m m i s s i o n , D i r e c t o r a t e - G e n e r a l T r a n s p o r t a n d E n e r g yP a g e 8 7


Chapter 2▪ Click on resp1 or resp2 <strong>in</strong> the Equations w<strong>in</strong>dow▪ In the Y variable w<strong>in</strong>dow, select 2-ij from the N levels: drop down list <strong>and</strong>county from the level 2(j) drop down list, <strong>and</strong> click Done.If we click on resp1 or resp2 aga<strong>in</strong>, we can see that the county has beenreplaced by a new variable county_long (see below picture on the right).The Equations w<strong>in</strong>dow should now look like this:It is <strong>in</strong>terest<strong>in</strong>g to see how the multilevel <strong>modell<strong>in</strong>g</strong> properties are exploited tobuild the multivariate structure out of the <strong>in</strong>itial dataset.▪ From the Data Manipulation menu on the ma<strong>in</strong> toolbar, select View or EditData.▪ Click on the view button <strong>in</strong> the Data w<strong>in</strong>dow <strong>and</strong> select: county, accidents,killed, resp_<strong>in</strong>dicator, resp <strong>and</strong> county_long.


2.5 Multivariate modelsWe can see how the resp_<strong>in</strong>dicator variable separates the two responses, whiletheir respective values are stored <strong>in</strong> a s<strong>in</strong>gle column (resp). Moreover, the newvariable county_long is the 2 nd level group<strong>in</strong>g variable. This demonstration fullycorresponds to the general theoretical multivariate structure presented <strong>in</strong> Table2.5.1 of section 2.5 of the Methodology report.P r o j e c t c o - f i n a n c e d b y t h e E u r o p e a n C o m m i s s i o n , D i r e c t o r a t e - G e n e r a l T r a n s p o r t a n d E n e r g yP a g e 8 9


Chapter 2Before we proceed <strong>in</strong> fitt<strong>in</strong>g the multivariate model, the distributionalassumptions of the two responses should be specified. As discussed <strong>in</strong> section2.3.4, the counts of road accidents <strong>and</strong> persons killed are r<strong>and</strong>om counts ofevents occurr<strong>in</strong>g with<strong>in</strong> a population <strong>and</strong> consequently they can only takepositive <strong>in</strong>teger values. Therefore, a Poisson distribution is assumed <strong>and</strong> a logl<strong>in</strong>k function should be used together with an appropriate offset term.▪ Click on the N (ΩΧ, Β) that appears for each response of the Equationsw<strong>in</strong>dow, select Poisson from the available distributions <strong>and</strong> click Done.▪ Click on the (π i ) that appears for each response <strong>in</strong> the Equations w<strong>in</strong>dow,select logepop as offset term <strong>and</strong> click Done.The Equations w<strong>in</strong>dow should now look like this:


2.5 Multivariate modelsIn section 2.3.4 of the Methodology report, it was shown that overdispersionwas present <strong>in</strong> the accidents data <strong>and</strong> that extra-Poisson or Negative B<strong>in</strong>omialdistributional assumptions would be required <strong>in</strong> order to h<strong>and</strong>le this unexpla<strong>in</strong>edvariation. As Negative B<strong>in</strong>omial responses are not available <strong>in</strong> this latest versionof the software, we will model the two responses by assum<strong>in</strong>g extra-Poissondistributions (for details see section 2.3.4 of the Methodology report).▪ Click on the Nonl<strong>in</strong>ear button of the Equations w<strong>in</strong>dow <strong>and</strong> selectDistributional assumptions extra Poisson.We will now enter variables <strong>in</strong> the model, start<strong>in</strong>g by an <strong>in</strong>tercept term.▪ Click on the Add Term button of the Equations w<strong>in</strong>dow▪ In the Specify term w<strong>in</strong>dow, select cons from the variable drop-down list <strong>and</strong>click on Add Separate coefficients.P r o j e c t c o - f i n a n c e d b y t h e E u r o p e a n C o m m i s s i o n , D i r e c t o r a t e - G e n e r a l T r a n s p o r t a n d E n e r g yP a g e 9 1


Chapter 2▪ Click on the Estimates button of the Equations w<strong>in</strong>dow <strong>and</strong> the parameters tobe estimated will be highlighted <strong>in</strong> blue. The Equations w<strong>in</strong>dow will now lookas follows:First of all, we can see that the coefficients β 0 <strong>and</strong> β 1 are fixed by default (i.e.the option of r<strong>and</strong>om variation is not available when click<strong>in</strong>g on the related termof the model). This is due to the fact that, as expla<strong>in</strong>ed <strong>in</strong> section 2.3.4 of theMethodology report, no r<strong>and</strong>om structure can be def<strong>in</strong>ed at the lowest level of aPoisson model, as the level-1 variance is assumed to be equal to the mean,<strong>and</strong> therefore known. In this case, though, the lowest level of the Poissonvariables is level-2 of the multivariate model.Moreover, a covariance matrix for the two responses is created. In this matrix,two dispersion parameters α are to be estimated, one for each response, <strong>in</strong>order to fit extra-Poisson models, <strong>in</strong> which the variance-mean equalityassumption is relaxed. F<strong>in</strong>ally, a covariance ρ between the two responses willbe estimated.▪ Click on the Estimation Control button of the ma<strong>in</strong> Toolbar <strong>and</strong> select RIGLS▪ Click Start to run the model. The results are as follows:


2.5 Multivariate modelsThe <strong>in</strong>tercept terms of the two responses are both highly significant. Moreover,a significant between-response covariance (ρ) <strong>in</strong>dicates that more roadaccidents per county correspond to more persons killed per county. Thesignificant dispersion parameters (α) of the two responses <strong>in</strong>dicate that theextra-Poisson distributional assumptions adopted were reasonable.We will now <strong>in</strong>troduce a (fixed) slope term <strong>in</strong> the model term.▪ Click on the Add Term button of the Equations w<strong>in</strong>dow▪ In the Specify term w<strong>in</strong>dow, select alcohol from the variable drop-down list<strong>and</strong> click on Add Separate coefficients.▪ Click More to run the model. The results are as follows:These results <strong>in</strong>dicate that the effect of alcohol enforcement (alcohol.accidents<strong>and</strong> alcohol.killed) is significant both for the number of accidents <strong>and</strong> for thenumber of persons killed.P r o j e c t c o - f i n a n c e d b y t h e E u r o p e a n C o m m i s s i o n , D i r e c t o r a t e - G e n e r a l T r a n s p o r t a n d E n e r g yP a g e 9 3


Chapter 2We will now proceed <strong>in</strong> build<strong>in</strong>g a three-level model, <strong>in</strong> order to <strong>in</strong>vestigateregional effects. This third level is created above the exist<strong>in</strong>g 2 nd level, whichcorresponds to the counts of the two responses at the county-level.▪ Click on resp1 or resp2 <strong>in</strong> the Equations w<strong>in</strong>dow▪ In the Y variable w<strong>in</strong>dow, select 3-ijk from the N levels: drop down list <strong>and</strong>region from the level 3(k) drop down list, <strong>and</strong> click Done.If we click on resp1 or resp2 aga<strong>in</strong>, we can see that the region has beenreplaced by a new variable region_long (see below picture on the right), <strong>in</strong> orderto comply to the multivariate structure specified previously.It is <strong>in</strong>terest<strong>in</strong>g to note that, <strong>in</strong> the Names w<strong>in</strong>dow, several new variables havebeen created (cons.accidents, cons.killed, alcohol.accidents, alcohol.killed), asa result of the previous two-level model<strong>in</strong>g, <strong>in</strong> order to def<strong>in</strong>e the <strong>in</strong>tercept <strong>and</strong>slope terms for each one of the responses. The additions of the 3 rd levelresulted <strong>in</strong> the creation of the variable region_long.


2.5 Multivariate modelsDelete the terms alcohol.accidents <strong>and</strong> alcohol.killed from the models <strong>in</strong> theEquations w<strong>in</strong>dow, <strong>in</strong> order to fit a r<strong>and</strong>om <strong>in</strong>tercept model.▪ Click on cons.accidents <strong>in</strong> the Equations w<strong>in</strong>dow▪ In the X variable w<strong>in</strong>dow, click <strong>in</strong> the k(region_long) box to specify ther<strong>and</strong>om variation among regions▪ Repeat for cons.killedNote that now a 3 rd level covariance matrix is also to be estimated, <strong>in</strong>clud<strong>in</strong>g thelevel-3 variances of the two <strong>in</strong>tercepts <strong>and</strong> their covariance. When runn<strong>in</strong>g themodel, the follow<strong>in</strong>g output is produced:P r o j e c t c o - f i n a n c e d b y t h e E u r o p e a n C o m m i s s i o n , D i r e c t o r a t e - G e n e r a l T r a n s p o r t a n d E n e r g yP a g e 9 5


Chapter 2A significant regional variation of both road accidents <strong>and</strong> road accidentcasualties <strong>and</strong> a significant covariance between the two <strong>in</strong>tercepts are obta<strong>in</strong>ed.However, the regional variation of the <strong>in</strong>tercept is higher for the number ofpersons killed. Moreover, the covariance between responses (ρ) <strong>and</strong> itssignificance is reduced. We may conclude that the variations of accidents <strong>and</strong>persons killed follow the same trend both at national level <strong>and</strong> with<strong>in</strong> differentregions i.e. some of the covariance between accidents <strong>and</strong> persons killed issituated at the regional level.The f<strong>in</strong>al stage of the model<strong>in</strong>g concerns the <strong>in</strong>troduction of a r<strong>and</strong>om slope.▪ Click on the Add Term button of the Equations w<strong>in</strong>dow▪ In the Specify term w<strong>in</strong>dow, select alcohol from the variable drop-down list<strong>and</strong> click on Add Separate coefficients, as shown above.▪ Right-click on the Ων term <strong>in</strong> the Equations w<strong>in</strong>dow <strong>and</strong> select Set diagonalmatrix from the menu displayed.


2.5 Multivariate modelsBy sett<strong>in</strong>g a diagonal matrix, the covariances among <strong>in</strong>tercepts <strong>and</strong>/or slopesare all assumed to be equal to zero. Although a number of limitations might beconsidered to arise from this assumption, <strong>in</strong> the framework of the presentdemonstration it is adopted ma<strong>in</strong>ly for practical reasons (i.e. numerical<strong>in</strong>stabilities <strong>and</strong> convergence problems were encountered <strong>in</strong> the full matrixconsideration).Consequently, the Equations w<strong>in</strong>dow will now look like this:P r o j e c t c o - f i n a n c e d b y t h e E u r o p e a n C o m m i s s i o n , D i r e c t o r a t e - G e n e r a l T r a n s p o r t a n d E n e r g yP a g e 9 7


Chapter 2▪ Click More to run the model. The follow<strong>in</strong>g results are displayed:


2.5 Multivariate modelsIn order to display more decimals <strong>in</strong> the values of the variance matrix:▪ In the Options ma<strong>in</strong> menu, select Numbers(display precision <strong>and</strong> miss<strong>in</strong>gvalue code)▪ In the Sett<strong>in</strong>gs w<strong>in</strong>dow that appears, set digits after decimal po<strong>in</strong>t equal to 5<strong>and</strong> click Apply. Then click Done.The Equations w<strong>in</strong>dow should now look like this:P r o j e c t c o - f i n a n c e d b y t h e E u r o p e a n C o m m i s s i o n , D i r e c t o r a t e - G e n e r a l T r a n s p o r t a n d E n e r g yP a g e 9 9


Chapter 2The fixed effect of enforcement on the number of accidents is higher comparedto the related effect on persons killed. The regional variation of the effect ofalcohol enforcement effects is only significant as far as the number of accidentsis concerned. In particular, the effect of alcohol controls on persons killed doesnot vary significantly among regions. These results are further <strong>in</strong>terpreted <strong>in</strong>section 2.5 of the Methodology report.


Chapter 22.6 Structural equations modelsAs expla<strong>in</strong>ed <strong>in</strong> the methodology report, the application of a structural equationmodel requires large amounts of data <strong>and</strong> also of a certa<strong>in</strong> quality. In practice,these models are only applied <strong>in</strong> studies that have been planned to producedata suitable for this type of <strong>analysis</strong>. Moreover, the conduction of structuralequation <strong>modell<strong>in</strong>g</strong> requires a high level of expertise, which to deliver wouldexceed the scope of the present document. A number of good tutorials,exclusively dedicated to structural equation <strong>modell<strong>in</strong>g</strong> are on the market (c.f.www.ssicentral.com ). Consequently it was refra<strong>in</strong>ed from present<strong>in</strong>g thepractical <strong>in</strong>struction for recapitulat<strong>in</strong>g the example <strong>in</strong> the methodology report.2.7 More complex data structuresIn the respective section of the Methodology report (D7.4), several particularcases of multilevel models, ma<strong>in</strong>ly referred to as "non-hierarchical" models werepresented (i.e. cross-classified structures, multiple membership structures).These issues were ma<strong>in</strong>ly presented for completeness' sake <strong>and</strong> no practicalexamples were provided; non-hierarchical structures are seldom encountered <strong>in</strong>road safety research2.8 Bayesian estimation <strong>in</strong> multivlevel <strong>modell<strong>in</strong>g</strong>Estimation methods based on simulation techniques (i.e. Monte Carlo MarkovCha<strong>in</strong> methods, bootstrap methods) for fitt<strong>in</strong>g these models (<strong>and</strong> multilevelmodels <strong>in</strong> general) were presented. The models for deal<strong>in</strong>g with thesestructures are still under further development. Moreover, a detailed presentationadvanced estimation methods (e.g. simulation techniques) is beyond the scopeof this document. Consequently, no manuals are provided for this section.


Chapter 3 - Time <strong>series</strong> <strong>analysis</strong>3.1 Introduction to <strong>time</strong> <strong>series</strong> modelsEllen Berends <strong>and</strong> Frits Bijleveld (SWOV)In the SafetyNet project, many road traffic data are collected that consist ofrepeated measurements over <strong>time</strong>. Examples are the annual or monthlynumber of road traffic accidents <strong>in</strong> a country, its annual or monthly number ofroad traffic fatalities, its annual or monthly number of vehicle kilometres driven,its annual or monthly values on safety performance <strong>in</strong>dicators, etc., allrepeatedly measured over a certa<strong>in</strong> period of <strong>time</strong>. Whenever one is <strong>in</strong>terested<strong>in</strong> study<strong>in</strong>g <strong>and</strong> analys<strong>in</strong>g such developments of one <strong>and</strong> the samephenomenon over <strong>time</strong>, special issues arise not encountered <strong>in</strong> cross-sectionaldata <strong>analysis</strong>. An important issue is that the residuals, although assumed to be<strong>in</strong>dependent <strong>in</strong> the (cross-sectional) model specification, as demonstrated <strong>in</strong> themethodology report may <strong>in</strong> fact not be <strong>in</strong>dependent of one another. Thisviolation may result <strong>in</strong> unreliable test statistics, <strong>and</strong> thus unreliable <strong>in</strong>ferencesfrom the models.The problem of dependencies between the residuals <strong>in</strong> the traditional l<strong>in</strong>earregression <strong>analysis</strong> of <strong>time</strong> <strong>series</strong> data may some<strong>time</strong>s be solved <strong>in</strong> a numberof different ways:• additional predictor variables can be added to the regression of thedependent variable on <strong>time</strong> such that the dependencies are removed fromthe residuals, <strong>and</strong>/or• the relation between the dependent variable <strong>and</strong> <strong>time</strong> can be analysed withgeneralised l<strong>in</strong>ear models <strong>and</strong>/or non-l<strong>in</strong>ear models, <strong>and</strong>/or• the dependent variable can be analysed with a special family of <strong>analysis</strong>techniques collectively known as <strong>time</strong> <strong>series</strong> models. The most commondedicated <strong>time</strong> <strong>series</strong> <strong>analysis</strong> techniques used <strong>in</strong> road safety <strong>analysis</strong> areARIMA, its special case DRAG <strong>and</strong> state space models.In the manual, we only deal with the dedicated <strong>time</strong> <strong>series</strong> methods. As <strong>in</strong>pr<strong>in</strong>ciple the DRAG method can be regarded as a special case of ARMA-type<strong>modell<strong>in</strong>g</strong>, this approach is also not covered. However, l<strong>in</strong>ear regression modelis <strong>in</strong>cluded because it is used <strong>in</strong> the methodology report to demonstrate theidentification <strong>and</strong> consequences of dependency of residuals <strong>and</strong> it is wellknown. Because l<strong>in</strong>ear regression is so well known, it is assumed to be a betterstart<strong>in</strong>g po<strong>in</strong>t than dedicated <strong>time</strong> <strong>series</strong> <strong>analysis</strong> methods, namely ARMA-typemodels <strong>and</strong> state space models, which are discussed as well.The first part of the chapter on Time Series Analysis shows when l<strong>in</strong>earregression can be used for repeated measurements over <strong>time</strong>. It isrecommended to read this chapter before start<strong>in</strong>g with one of the dedicated<strong>time</strong> <strong>series</strong> methods. Austrian fatalities from the period from 1987 to 2004 wasused as example to show which tests have to be carried out to test the Gauss-Markov assumptions, which are the conditions for l<strong>in</strong>ear regression. L<strong>in</strong>ear


3.1 Introduction <strong>time</strong> <strong>series</strong> modelsregression is not the preferred model for the Austrian fatalities due to heavyviolations of two basic assumptions.In the parts about ARMA-type models <strong>and</strong> state space <strong>analysis</strong> <strong>in</strong> themethodology report, several data <strong>series</strong> were used as examples of us<strong>in</strong>g thededicated <strong>time</strong> <strong>series</strong> <strong>analysis</strong> techniques. We name a few <strong>in</strong>terest<strong>in</strong>gexamples. An ARMA-type <strong>analysis</strong> was conducted on the monthly total numberof French fatalities collected between 1975 <strong>and</strong> 2001. It was shown that next tovarious seasonal <strong>and</strong> economic variables, the number of fatalities is alsoaffected by certa<strong>in</strong> media events. Furthermore, the presidential amnesty that isusually given to traffic offenders dur<strong>in</strong>g the French elections appeared to beassociated to an <strong>in</strong>crease <strong>in</strong> fatalities. A state space type of <strong>analysis</strong> wascarried out on the monthly number of drivers killed or seriously <strong>in</strong>jured (KSI) <strong>in</strong>the United K<strong>in</strong>gdom. It was shown that the number of KSI depends on the<strong>in</strong>troduction of the seat belt law, the petrol price (as a <strong>in</strong>dicator of mobility) <strong>and</strong>seasonal <strong>in</strong>fluences. The <strong>in</strong>troduction of the seat belt law resulted <strong>in</strong> a 21.1%reduction of the number KSI <strong>in</strong> the UK.Numerous software packages can be used to carry out the above mentioned<strong>analysis</strong> techniques. The choice of one software package for this manual doesnot mean that the user, after hav<strong>in</strong>g worked with this manual, could not applythis type of <strong>analysis</strong> <strong>in</strong> other software environments or new versions of thesame software. The software is just used as a means to demonstrate theopportunities that dedicated <strong>time</strong> <strong>series</strong> <strong>modell<strong>in</strong>g</strong> offers to road safety <strong>analysis</strong><strong>and</strong> to <strong>in</strong>struct <strong>in</strong> the design <strong>and</strong> <strong>in</strong>terpretation of classical l<strong>in</strong>ear regression,ARMA-type models or state space models.For l<strong>in</strong>ear regression <strong>and</strong> ARMA-type models, SPSS was reta<strong>in</strong>ed because it isa ma<strong>in</strong>stream statistics program <strong>and</strong> it is very suitable for these <strong>analysis</strong>techniques, <strong>in</strong> addition to be<strong>in</strong>g a user-friendly software (http:///www.spss.com).Other exist<strong>in</strong>g dedicated softwares, such as E-views, R <strong>and</strong> SAS Proc ARIMAare also appropiriate for perform<strong>in</strong>g ARMA-type <strong>analysis</strong>).The three probably best-known software packages which can be applied forstate space <strong>analysis</strong>., i.e. Ox/SsfPack, STAMP, <strong>and</strong> SAS Proc UCM, whilenot<strong>in</strong>g the availability <strong>in</strong> other packages, such as Splus, R, matlab <strong>and</strong>Mathematica, will be shortly compared below. SsfPack conta<strong>in</strong>s a lot of rout<strong>in</strong>esfor state space <strong>analysis</strong>, <strong>and</strong> can be used <strong>in</strong> the Ox programm<strong>in</strong>g environment.By programm<strong>in</strong>g the user has a lot of freedom <strong>in</strong> <strong>modell<strong>in</strong>g</strong>. A disadvantage ofOx/SsfPack is that it requires <strong>in</strong> addition to experience <strong>in</strong> statistics <strong>and</strong><strong>modell<strong>in</strong>g</strong> , some experience <strong>in</strong> programm<strong>in</strong>g. STAMP is a user friendly, menudrivenpackage specifically designed for state space <strong>analysis</strong> <strong>and</strong> is thereforemore appropriate for <strong>in</strong>struct<strong>in</strong>g the possibly <strong>in</strong>experienced road safety analist <strong>in</strong>state space <strong>modell<strong>in</strong>g</strong>. SAS Proc UCM can h<strong>and</strong>le univariate models <strong>and</strong> <strong>in</strong> thefuture multivariate models as well (Yaffee, 2003), whereas STAMP 6.0 h<strong>and</strong>lesboth univariate <strong>and</strong> multivariate models. Yaffee (2003) states that SAS ProcUCM is "powerful <strong>and</strong> easy to use" <strong>and</strong> that "STAMP h<strong>and</strong>les a wide variety ofmodels". Furthermore, STAMP has good graphical options, can displayforecasts with error marg<strong>in</strong>s, <strong>and</strong> its algorithms are fast. Judge <strong>and</strong> N<strong>in</strong>omiya(2000) make the follow<strong>in</strong>g remark, which is very relevant <strong>in</strong> the light of theP r o j e c t c o - f i n a n c e d b y t h e E u r o p e a n C o m m i s s i o n , D i r e c t o r a t e - G e n e r a l T r a n s p o r t a n d E n e r g yP a g e 1 0 3


Chapter 3manual's objective: "even those who are <strong>in</strong>experienced with structural <strong>time</strong><strong>series</strong> <strong>modell<strong>in</strong>g</strong> can use STAMP to familiarise themselves with this approach".It was this, its relatively small size (<strong>and</strong> price), <strong>and</strong> its dedicatedness to statespace <strong>analysis</strong> which made us choose STAMP for structural <strong>time</strong> <strong>series</strong><strong>modell<strong>in</strong>g</strong>.


3.2 Classical l<strong>in</strong>ear <strong>and</strong> non-l<strong>in</strong>ear regression models3.2.1 Classical l<strong>in</strong>ear regression modelsChristian Br<strong>and</strong>stätter <strong>and</strong> Andrea Angermann (KfV)3.2.1.1. IntroductionThe goal of this chapter is to demonstrate how to apply l<strong>in</strong>ear regressionmodels to Austrian road accident fatalities data <strong>and</strong> how to determ<strong>in</strong>e if a trend<strong>in</strong> the counts of road accident fatalities can be derived. Some practicalcomputations us<strong>in</strong>g SPSS software (version 14.0) on these data are presented.However, the different views on the assumptions underly<strong>in</strong>g data test<strong>in</strong>g areconsidered more important than the result<strong>in</strong>g regression l<strong>in</strong>e with itsparameters.Screenshots <strong>and</strong> descriptions are used to expla<strong>in</strong> the steps <strong>and</strong> results <strong>in</strong> theprocess of data <strong>analysis</strong>; for more detailed <strong>in</strong>formation on the theoreticalbackground, please see the correspond<strong>in</strong>g theorie chapter. For furtherexplanations regard<strong>in</strong>g SPSS, we refer the reader to the SPSS user manuals.3.2.1.2. Dataset descriptionAustrian data from the period from 1987 to 2004 is analyzed <strong>in</strong> this tutorial. Theraw dataset imported to SPSS is shown <strong>in</strong> the follow<strong>in</strong>g table. The <strong>time</strong> variable“Y/M” <strong>and</strong> number of “Fatalities” can be seen.


Chapter 3The option “Variable View” displays all def<strong>in</strong>itions <strong>and</strong> attributes of the usedvariables <strong>in</strong> the data set:3.2.1.3. First heuristic view of dataIn order to get a first impression of the <strong>time</strong> <strong>series</strong> data, it is recommended togenerate a simple scatter plot of fatalities over <strong>time</strong>. In the follow<strong>in</strong>gscreenshots the necessary steps for this procedure <strong>in</strong> SPSS are be<strong>in</strong>g


3.2 Classical l<strong>in</strong>ear regression modelsexpla<strong>in</strong>ed. Simple scatter plot has been chosen because the data only conta<strong>in</strong>sone data <strong>series</strong>.Choose from the menu “Graphs” the option “Scatter/Dot…”. After choos<strong>in</strong>g a“Simple Scatter”, click “Def<strong>in</strong>e”. Mark <strong>and</strong> click the variable “Year/Month” <strong>in</strong> theX-axis, the variable “Fatalities” <strong>in</strong> the Y-axis. After click<strong>in</strong>g “Ok”, the SPSSprocessorstarts with produc<strong>in</strong>g the scatter graph which is shown <strong>in</strong> Output 1.P r o j e c t c o - f i n a n c e d b y t h e E u r o p e a n C o m m i s s i o n , D i r e c t o r a t e - G e n e r a l T r a n s p o r t a n d E n e r g yP a g e 1 0 7


Chapter 3The figure displayed below shows the structure of data at a glance: adecreas<strong>in</strong>g development of counts of fatalities over <strong>time</strong>.A first attempt to describe this development mathematically could be carried outby fitt<strong>in</strong>g a simple l<strong>in</strong>ear regression. The next step to get a better impression ofthe data is to generate a l<strong>in</strong>ear regression l<strong>in</strong>e <strong>in</strong> the scatter plot.


3.2 Classical l<strong>in</strong>ear regression modelsFor this step click on the graph, click the right mouse button <strong>and</strong> open the“SPSS Chart Object”. In the now new opened w<strong>in</strong>dow “Chart Editor”, youchoose “Elements” <strong>and</strong> click “Fit L<strong>in</strong>e at Total”. In the new opened w<strong>in</strong>dow“Properties” you choose “L<strong>in</strong>ear” <strong>and</strong> click “Apply”. As a result you can see thel<strong>in</strong>ear regression l<strong>in</strong>e <strong>in</strong> the chart <strong>and</strong> you can close the “Chart Editor” <strong>in</strong> orderto return to the “Output”-w<strong>in</strong>dow.P r o j e c t c o - f i n a n c e d b y t h e E u r o p e a n C o m m i s s i o n , D i r e c t o r a t e - G e n e r a l T r a n s p o r t a n d E n e r g yP a g e 1 0 9


Chapter 3


3.2 Classical l<strong>in</strong>ear regression modelsThe scatter plot above may be roughly <strong>in</strong>terpreted with the help of a l<strong>in</strong>earregression model. At this stage of data process<strong>in</strong>g there is no reason for notapply<strong>in</strong>g this model; therefore next steps <strong>and</strong> analyses should be started.3.2.1.4. L<strong>in</strong>ear regression:The l<strong>in</strong>ear regression is generated with road accident “Fatalities” <strong>in</strong> Austria asdependent variable <strong>and</strong> “Year/Month” as <strong>in</strong>dependent variable. The <strong>time</strong> <strong>series</strong>starts <strong>in</strong> January 1987 <strong>and</strong> ends <strong>in</strong> December 2004.The next figures show the necessary operations for calculat<strong>in</strong>g a l<strong>in</strong>earregression:Choose from the menu “Analyze” “Regression” <strong>and</strong> click “L<strong>in</strong>ear”.P r o j e c t c o - f i n a n c e d b y t h e E u r o p e a n C o m m i s s i o n , D i r e c t o r a t e - G e n e r a l T r a n s p o r t a n d E n e r g yP a g e 1 1 1


Chapter 3Then mark <strong>and</strong> click the variable “Fatalities” <strong>in</strong> the “Dependent” field, thevariable “Year/Month” you click <strong>in</strong> the “Independent” field. Then click the button“Statistics”:In the “L<strong>in</strong>ear Regression: Statistics”-w<strong>in</strong>dow choose the regression coefficients“Estimates” <strong>and</strong> “Confidence <strong>in</strong>tervals”, as well as the option “Model fit” <strong>and</strong> theresiduals “Dub<strong>in</strong>-Watson” <strong>and</strong> “Casewise diagnostics” [1]. Def<strong>in</strong>e the “Outliersoutside:” with “3” st<strong>and</strong>ard deviations <strong>and</strong> click “Cont<strong>in</strong>ue”.


3.2 Classical l<strong>in</strong>ear regression modelsThe result (see the Output-w<strong>in</strong>dow stated below) shows a highly significantdecrease <strong>in</strong> the count of fatalities; this is expla<strong>in</strong>ed <strong>in</strong> the coefficients-table: The variable “Year/Month” is negative (-0.520 for the st<strong>and</strong>ardizedcoefficient Beta, the unst<strong>and</strong>ardized coefficients can be used on the orig<strong>in</strong>aldata with fomula 3.2.1 <strong>in</strong> the correspond<strong>in</strong>g chapter <strong>in</strong> the methodologyreport) <strong>and</strong> shows a very “high” significance of < 0.0001: a significancevalue of less than 0.05 means that the variation expla<strong>in</strong>ed by the model isnot due to change.Furthermore, the regression model has a reasonable fit: The ANOVA table reports a significant F statistic because its significancevalue is below 0.05. R Square <strong>in</strong> the Model Summary table shows that the regression expla<strong>in</strong>s27.0% of the variance of the data.P r o j e c t c o - f i n a n c e d b y t h e E u r o p e a n C o m m i s s i o n , D i r e c t o r a t e - G e n e r a l T r a n s p o r t a n d E n e r g yP a g e 1 1 3


Chapter 3The residual <strong>analysis</strong> <strong>in</strong> the table Casewise Diagnostics shows one outlier withcase “1” <strong>in</strong> January 1987. In this stage of the calculations/<strong>analysis</strong> noassumptions can be made why case “1” is an exceptional case. Presently, nofurther <strong>analysis</strong> on the cause of this outlier is be<strong>in</strong>g made.


3.2 Classical l<strong>in</strong>ear regression models3.2.1.5. Condition test<strong>in</strong>g (Gauss-Markov Assumptions)The test<strong>in</strong>g of the Gauss-Markov assumptions can be done by click<strong>in</strong>g “Plots” <strong>in</strong>the l<strong>in</strong>ear regression. The underly<strong>in</strong>g assumptions are [1]: The r<strong>and</strong>om errors are distributed normally. The value for the error term associated with any different observations is<strong>in</strong>dependent. The error associated with one value of y has no effect on theerrors associated with other values. This means that all autocorrelations ofthe errors are near 0. The variance of the error term is constant across cases (x) <strong>and</strong> <strong>in</strong>dependentof the variables <strong>in</strong> the model. This is called homoscedasticity, orhomogeneity of the variance of error. An error term with non-constantvariance is said to be heteroscedasticit. The prediction error ε is uncorrelated with x, the <strong>in</strong>dependence assumption.This assumption is fullfilled when deal<strong>in</strong>g with road accident <strong>time</strong> <strong>series</strong>. Aswe are deal<strong>in</strong>g with univariate data <strong>in</strong> this example the problem of col<strong>in</strong>earityis not relevant.For all these assumptions visual <strong>and</strong> numeric representations are be<strong>in</strong>ggenerated (based on statistic <strong>in</strong>ference). The statistical tests give detailed<strong>in</strong>formation whether the statement is valid or not. The advantage of thegraphical <strong>analysis</strong> is that deviations <strong>and</strong> type of deviations from the testedconditions/assumptions can be detected more easily.Normal distribution of r<strong>and</strong>om errorFor a simple overview of the distribution of the variables, the graphicalrepresentation can be used. Choose the l<strong>in</strong>ear regression from the menu“Analyze” <strong>and</strong> choose the dependent <strong>and</strong> <strong>in</strong>dependent variable respectively, asdescribed above. Then click the button “Plots”.P r o j e c t c o - f i n a n c e d b y t h e E u r o p e a n C o m m i s s i o n , D i r e c t o r a t e - G e n e r a l T r a n s p o r t a n d E n e r g yP a g e 1 1 5


Chapter 3A new w<strong>in</strong>dow “L<strong>in</strong>ear Regression: Plots” will be opened. Choose for the“St<strong>and</strong>ardized Residual Plots” the options “Histogram” <strong>and</strong> “Normal ProbabilityPlot”, <strong>and</strong> then click the button “Cont<strong>in</strong>ue”:The results you can see <strong>in</strong> the “Output”-w<strong>in</strong>dow:


3.2 Classical l<strong>in</strong>ear regression modelsThe assumption of a normal distribution of r<strong>and</strong>om error can be confirmed withthe two graphs stated above.For the numeric test<strong>in</strong>g the 1-Sample Kolmogorov-Smirnov test [1] has beenused, this is an <strong>in</strong>ference statistic us<strong>in</strong>g a non-parametric test. If theKolmogorov-Smirnov test is not significant, the assumption of a normaldistribution of r<strong>and</strong>om error can be confirmed.To start this test, choose from the menu “Analyze” the “Nonparametric Test” “1Sample K-S”. You can see the result <strong>in</strong> the Output-W<strong>in</strong>dow.P r o j e c t c o - f i n a n c e d b y t h e E u r o p e a n C o m m i s s i o n , D i r e c t o r a t e - G e n e r a l T r a n s p o r t a n d E n e r g yP a g e 1 1 7


Chapter 3The result stated above shows that also the not significant Kolmogorov-Smirnovtest (Asymp. Sig. (2-tailed) = 0.316) confirms the normal distribution of ther<strong>and</strong>om error.Independency of VariablesThe autocorrelation function (ACF) is used to verify the assumption that theerror term associated with any different observations is <strong>in</strong>dependent of anyother. For these computations both a graphical <strong>and</strong> a numeric method exists.


3.2 Classical l<strong>in</strong>ear regression modelsTo start the computation of the autocorrelation function, choose the option“Time Series” from the “Graphs”-menu <strong>and</strong> click “Autocorrelation”.The variable you choose is the dependent one (“Fatalities”): click <strong>in</strong> the“Autocorrelation”-w<strong>in</strong>dow the button “Options” <strong>and</strong> def<strong>in</strong>e there the maximumnumber of legs to “24”. Close this w<strong>in</strong>dow with “Cont<strong>in</strong>ue” <strong>and</strong> start the <strong>analysis</strong><strong>in</strong> the “Autocorrelation” w<strong>in</strong>dow with the button “OK”.A clear seasonal pattern of the autocorrelations can be identified <strong>in</strong> the diagramstated below with large peaks at 12 <strong>and</strong> 24 month. this pattern also exceeds theconfidence b<strong>and</strong>. Through this test the assumption that the error termassociated with any different observations is <strong>in</strong>dependent of any other, can notbe confirmed.P r o j e c t c o - f i n a n c e d b y t h e E u r o p e a n C o m m i s s i o n , D i r e c t o r a t e - G e n e r a l T r a n s p o r t a n d E n e r g yP a g e 1 1 9


Chapter 3The same result is obta<strong>in</strong>ed with the follow<strong>in</strong>g Box-Ljung-statistics, which is partof the executed ACF <strong>and</strong> is also presented <strong>in</strong> the Output w<strong>in</strong>dow below. TheBox-Ljung test tests the significance of autocorrelation at each lag. All 24 lagsshow a highly significant autocorrelation.


3.2 Classical l<strong>in</strong>ear regression modelsHomoscedasticity AssumptionThe homoscedasticity assumption specifies that the variance of the error term isconstant across cases <strong>and</strong> <strong>in</strong>dependent of the variables <strong>in</strong> the model. This isthe last tested <strong>and</strong> described assumption <strong>in</strong> this chapter; aga<strong>in</strong> a graphical <strong>and</strong><strong>in</strong>ference-statistical method is be<strong>in</strong>g used.To compute the necessary plots, a number of new variables have to be createdwith the regression procedure.Choose the l<strong>in</strong>ear regression model as shown <strong>in</strong> the example above, beforeclick<strong>in</strong>g “OK”, click the button “Save”.P r o j e c t c o - f i n a n c e d b y t h e E u r o p e a n C o m m i s s i o n , D i r e c t o r a t e - G e n e r a l T r a n s p o r t a n d E n e r g yP a g e 1 2 1


Chapter 3In this new opened w<strong>in</strong>dow you choose the predicted values “St<strong>and</strong>ardized” <strong>and</strong>the “Unst<strong>and</strong>ardized” <strong>and</strong> “St<strong>and</strong>ardized” residuals. Go on with the button“Cont<strong>in</strong>ue” <strong>and</strong> then click “OK” <strong>in</strong> the “L<strong>in</strong>ear Regression” w<strong>in</strong>dow.For the three new variables RES_1, ZPR_1, ZRE_1 see the data view <strong>in</strong> theSPSS-data file:


3.2 Classical l<strong>in</strong>ear regression modelsThe three new variables are used to generate the variables “square ofst<strong>and</strong>ardized residuals” <strong>and</strong> “absolute value of st<strong>and</strong>ardized residuals”. Togenerate them, choose from the menu “Transform” “Compute”.In the newly opened w<strong>in</strong>dow “Compute Variable” you name the target variable“ZRE2”. Then you have to enter the numeric expression: Mark <strong>and</strong> click fromthe list of variables the “St<strong>and</strong>ardized Residual” (ZRE_1) <strong>in</strong> the numericexpressions field, <strong>in</strong>sert from the key pad * (for multiplication) <strong>and</strong> put oncemore the ZRE_1 variable <strong>in</strong> the numeric expressions field. Click “OK” forstart<strong>in</strong>g the computation process.P r o j e c t c o - f i n a n c e d b y t h e E u r o p e a n C o m m i s s i o n , D i r e c t o r a t e - G e n e r a l T r a n s p o r t a n d E n e r g yP a g e 1 2 3


Chapter 3To compute the last variable (ZRE_ABS), choose from the menu “Functiongroup” the type “All” <strong>and</strong> double-click “ABS” to put it <strong>in</strong> the field “NumericExpression”. In the given bracket mark <strong>and</strong> click the variable “ZRE_1”(st<strong>and</strong>ardized residual). Click “OK” to compute the variable ZRE_ABS.As a result five new variables have been retrieved.


3.2 Classical l<strong>in</strong>ear regression modelsDescriptions <strong>and</strong> def<strong>in</strong>itions can be seen <strong>in</strong> the “Variable View”-w<strong>in</strong>dow:With these new variables the assumption test<strong>in</strong>g can be started. The scatterplots are derived with the same steps that are already expla<strong>in</strong>ed above.P r o j e c t c o - f i n a n c e d b y t h e E u r o p e a n C o m m i s s i o n , D i r e c t o r a t e - G e n e r a l T r a n s p o r t a n d E n e r g yP a g e 1 2 5


Chapter 3


3.2 Classical l<strong>in</strong>ear regression modelsAll four plots, especially the last two, show a dist<strong>in</strong>ctive pattern that <strong>in</strong>dicatesheteroscedasticity.To complete the <strong>analysis</strong> with an <strong>in</strong>ference-statistical model, the White-ChiSquare test [1] is used. To compute this test, “R square” of the regression of theP r o j e c t c o - f i n a n c e d b y t h e E u r o p e a n C o m m i s s i o n , D i r e c t o r a t e - G e n e r a l T r a n s p o r t a n d E n e r g yP a g e 1 2 7


Chapter 3<strong>time</strong> <strong>series</strong> on “Squared St<strong>and</strong>ardized Residuals” (ZRE2) is needed. Start thel<strong>in</strong>ear regression as expla<strong>in</strong>ed above with “ZRE2” as dependent <strong>and</strong>“Year/Month” as <strong>in</strong>dependent variable.The output of “Model Summary” conta<strong>in</strong>s the R square with 0.052.The actual test can be easily done <strong>in</strong> Excel. Enter <strong>in</strong> a new excel spread sheetthe computed R-square <strong>and</strong> the number of cases (n=216 <strong>in</strong> our example). TheWhite Chi square is the multiplication of both.To calculate the “Chivert” “1-alpha”, <strong>in</strong>sert the Excel-function “Chivert” <strong>in</strong> the cellB6 <strong>and</strong> put <strong>in</strong> brackets the White Chi square (cell B4) <strong>and</strong>, after the semicolon,“1” for the number of the degrees of freedom.


3.2 Classical l<strong>in</strong>ear regression modelsThe result of “1-alpha” presents a highly significant deviation from thehomoscedatisticity-assumption:3.2.1.6. ConclusionThe conclusion of the results from the presented tests is that l<strong>in</strong>ear regression isnot the preferred model <strong>in</strong> this case. Due to heavy violations of two basicassumptions of the model to fit <strong>time</strong> <strong>series</strong> data, a different <strong>and</strong> more advancedmodel specialised on <strong>time</strong> <strong>series</strong> data should be applied <strong>and</strong> will bedemonstrated <strong>in</strong> the follow<strong>in</strong>g chapters.P r o j e c t c o - f i n a n c e d b y t h e E u r o p e a n C o m m i s s i o n , D i r e c t o r a t e - G e n e r a l T r a n s p o r t a n d E n e r g yP a g e 1 2 9


Chapter 33.2.2 Generalized l<strong>in</strong>ear models (GLM)As expla<strong>in</strong>ed <strong>in</strong> the Introduction, this manual focuses on state-of-the-artdedicated <strong>time</strong> <strong>series</strong> techniques. Generalized L<strong>in</strong>ear Models (GLM) are auseful <strong>and</strong> flexible technique that can be applied for <strong>time</strong> <strong>series</strong> <strong>analysis</strong>,however they do not by def<strong>in</strong>ition take <strong>in</strong>to account the <strong>time</strong> dependencebetween observations; these dependences can only be considered byextend<strong>in</strong>g the GLM approaches with <strong>time</strong> <strong>series</strong> properties. In this sense, GLMare not dedicated <strong>time</strong> <strong>series</strong> techniques, like the ARMA-type <strong>and</strong> state spacetechniques. Consequently, no manuals are provided for GLM <strong>in</strong> <strong>time</strong> <strong>series</strong><strong>analysis</strong>.3.2.3 Non-l<strong>in</strong>ear modelsSimilarly to the Generalized L<strong>in</strong>ear Models (GLM), Non L<strong>in</strong>ear models are alsoa possible technique that can be used for <strong>time</strong> <strong>series</strong> <strong>analysis</strong>. However, theyalso can take <strong>in</strong>to account the <strong>time</strong> dependence between observations only byextend<strong>in</strong>g the models accord<strong>in</strong>gly; therefore they can not be considered as adedicated <strong>time</strong> <strong>series</strong> techniques, like the ARMA-type <strong>and</strong> state spacetechniques. Consequently, no manuals are provided for Non L<strong>in</strong>ear models <strong>in</strong><strong>time</strong> <strong>series</strong> <strong>analysis</strong>.3.3 Dedicated <strong>time</strong> <strong>series</strong> <strong>analysis</strong> <strong>in</strong> road safetyresearchThe respective section <strong>in</strong> the methodology report is an overview section thatdoes not conta<strong>in</strong> empirical examples. Consequently there is no manual part forthis section.


3.4 ARMA-type modelsRuth Bergel <strong>and</strong> Mohamed Cherfi, INRETS3.4.1 IntroductionThe objective of this part of the manual is to <strong>in</strong>troduce technical aspects of <strong>time</strong><strong>series</strong> <strong>modell<strong>in</strong>g</strong> us<strong>in</strong>g ARMA-type models, applied to road safety <strong>analysis</strong>.The section is structured <strong>in</strong> three parts. In Section 3.4.2, it is briefly recalled tothe reader that he should refer to the respective section 3.4.2 of theMethodology Report, <strong>in</strong> which several ARMA models were fitted on simulatedstationary datasets. Section 3.4.3 presents an example of ARIMA modelestimated on non seasonal (yearly) real data, without any use of exogenousvariables. And <strong>in</strong> Section 3.4.4, ARIMA models estimated on seasonal(monthly) real data have been chosen <strong>and</strong> exogenous variables, whether<strong>in</strong>tervention variables or explanatory variables, have been succeed<strong>in</strong>gly<strong>in</strong>troduced <strong>in</strong> the model.Each case of real data reta<strong>in</strong>ed <strong>in</strong> this manual is a non-stationary case: of firstorder (the mean of the process varies over <strong>time</strong>) <strong>and</strong> of second order (thevariance of the process varies over <strong>time</strong>), this be<strong>in</strong>g the general case for risk<strong>in</strong>dicators <strong>in</strong> the road safety field.Because of that non stationarity <strong>in</strong> variance, the dependent data weresystematically Log-transformed. As for the exogenous (<strong>in</strong>dependent) variables,which are always considered as known (non stochastic, or <strong>in</strong> other words, notunder measurement errors), the ones reta<strong>in</strong>ed for measur<strong>in</strong>g the traffic volumeor the petrol/gasol<strong>in</strong>e price were Log-transformed; whereas the <strong>in</strong>terventionvariable was used <strong>in</strong> a l<strong>in</strong>ear form (see the Methodology Report).The detailed specification of the ARMA-type estimated models cannot be found<strong>in</strong> the manual but is described <strong>in</strong> the Methodology Report. Note thatexplanations for the ARMA structure, applied to the stationary datasets derivedfrom the <strong>in</strong>itial ones, are to be found <strong>in</strong> section 3.4.2 of the Methodology Report.For each of the data cases, we first followed the succeed<strong>in</strong>g steps for fitt<strong>in</strong>gpure ARIMA models on the datasets: Data description Model identification Model estimation <strong>and</strong> validation Graphical results <strong>and</strong> additional (normality) testWe then used external <strong>in</strong>formation by means of add<strong>in</strong>g <strong>in</strong>tervention <strong>and</strong>explanatory variables <strong>in</strong>to the pure ARIMA models <strong>in</strong> view of tak<strong>in</strong>g account forcerta<strong>in</strong> risk factors, road safety measures or special events.For each example the parameters of the exogenous variables were <strong>in</strong>terpreted,<strong>and</strong> the ga<strong>in</strong> <strong>in</strong> fit statistics was measured.


Chapter 3SPSS (Version 14.0) was used for this work.Regard<strong>in</strong>g the output obta<strong>in</strong>ed after each model estimation, the follow<strong>in</strong>g resultsare systematically given:- three tables (model fit, model statistics, <strong>and</strong> ARIMA model parameters),- the ACF plot (<strong>and</strong> PACF plot) of the residuals- two graphs (the observed <strong>and</strong> the fitted <strong>series</strong> on the one h<strong>and</strong>, <strong>and</strong> theresiduals on the other h<strong>and</strong>),- <strong>and</strong> the results of additional tests of normality of the residuals.The first of the three tables provides Goodness-of-Fit Measures* (which enableto evaluate the model’s empirical performance): Stationary R-squared R-squared RMSE MAPE MAE MaxAPE MaxAE Normalized BICThe second table provides ma<strong>in</strong>ly the Ljung-Box statistic** (which enables toevaluate the model specification),The third one provides the Model parameters (the estimated model parameters<strong>and</strong> their significance).As for the normality of the residuals test, two plots were systematically given(the histogram <strong>and</strong> the QQ-plot), <strong>and</strong> the non-parametric Kolmogorov-Smirnovstatistic ***----------------------------------------------------------------------------------------------------------(*) The first of the three output tables provides Goodness-of-Fit Measures:Goodness-of-fit statistics are based on the orig<strong>in</strong>al <strong>series</strong> Y(t). Let k= number of parameters <strong>in</strong> the model,n = number of non-miss<strong>in</strong>g residuals. Stationary R-squared. It compares the stationary part of the model to asimple mean model. This measure is preferable to ord<strong>in</strong>ary R-squared when thereis a trend or seasonal component <strong>in</strong> the <strong>series</strong>.2∑(Z(t)− Zˆ(t))2tRS= 1 −2( ∆Z(t)− ∆Z)Where:The sum is over the terms <strong>in</strong> which, both ( Z(t)− Zˆ(t))<strong>and</strong> ∆ Z( t)− ∆Zare not miss<strong>in</strong>g.∑t


3.4 ARMA-type models∆Zis the simple mean model for the differenced transformed <strong>series</strong>, which is equivalent to theunivariate basel<strong>in</strong>e model ARIMA(0,d,0)(0,D,0). R-squared. An estimate of the proportion of the total variation <strong>in</strong> the <strong>series</strong> that isexpla<strong>in</strong>ed by the model.2∑(Y ( t)− Yˆ(t))2R = 1−2( Y ( t)− Y ) RMSE. Root Mean Square Error. The square root of mean square error.∑2( Y ( t)− Yˆ(t))RMSE =n − k MAPE. Mean Absolute Percentage Error.100MAPE = ∑ ( Y ( t)− Yˆ(t)) / Y ( t)n MAE. Mean absolute error.1MAE = ∑ Y ( t)− Yˆ(t)n MaxAPE. Maximum Absolute Percentage Error. The largest forecasted error,expressed as a percentageMaxAPE = 100max((Y ( t)− Yˆ(t)) / Y ( t) ) MaxAE. Maximum Absolute Error. The largest forecasted error, expressed <strong>in</strong> thesame units as the dependent <strong>series</strong>MaxAE = max( Y ( t)− Yˆ(t) )∑ Normalized BIC. Normalized Bayesian Information Criterion.ln( n)NormalizedBIC = ln( MSE)+ kn--------------------------------------------------------------------------------------------------------------(**) The second table provides ma<strong>in</strong>ly the Ljung-Box statitistic (which enables to evaluate themodel specification),K2∑ kk=1Q( K)= n(n + 2) r /( n − k),where rkis the kth lag ACF of residual.Q(K) is approximately distributed as χ2 ( K − m), where m is the number of parameters other than theconstant term <strong>and</strong> predictor related-parameters.(***)The two one-sided Kolmogorov-Smirnov test statistics are given by:D−nD+n= max( F= max( F(x)− Fn( x)− F(x))n( x))where F(x) is the hypothesized distribution.P r o j e c t c o - f i n a n c e d b y t h e E u r o p e a n C o m m i s s i o n , D i r e c t o r a t e - G e n e r a l T r a n s p o r t a n d E n e r g yP a g e 1 3 3


Chapter 33.4.2 ARIMA models for stationary <strong>series</strong> (simulated data)Stationary <strong>series</strong> are usually not found <strong>in</strong> the road safety field. Therefore,simulated stationary data samples were used <strong>in</strong> a first approach, on whichARMA models were fitted. The structure of these simple models is similar to thestructure of the more elaborated models which will be fitted on real road safetydata, as far as h<strong>and</strong>l<strong>in</strong>g their stationary part is required.The reader will therefore refer to the respective section of the MethodologyReport, <strong>in</strong> which the <strong>modell<strong>in</strong>g</strong> stages are described <strong>and</strong> the <strong>modell<strong>in</strong>g</strong> resultsgiven.3.4.3 ARIMA models for non seasonal <strong>series</strong> (NorwegianFatalities)3.4.3.1. Data description1. Start of <strong>analysis</strong> <strong>and</strong> data load First, we start SPSS.Use the menu to open the file 'Norw_Fatalities.sav'.This data file consists of the annual number of people killed <strong>in</strong> road traffic <strong>in</strong>Norway for the years 1970 to 2003 ('Norw.Fatalities') <strong>and</strong> of the logarithm of thelatter <strong>time</strong> <strong>series</strong> ('LNorw.Fatalities'), <strong>and</strong> of two additional variables, labelledYEAR <strong>and</strong> DATE (note that all variables are described <strong>in</strong> the Variable View).2. Graphical diagnosticsThe data are represented graphically <strong>in</strong> a <strong>time</strong> <strong>series</strong> plot. This will help showup important features such as trend <strong>and</strong>, eventually, seasonality. The <strong>time</strong>


3.4 ARMA-type models<strong>series</strong> plot can also help decid<strong>in</strong>g whether a prelim<strong>in</strong>ary transformation(logarithmic transformation, filter<strong>in</strong>g,) is required on the data.• Click on Graphs..Sequence• Move Norw.Fatalities <strong>in</strong>to the Variables list boxThe plot shows the existence of a decreas<strong>in</strong>g trend <strong>in</strong> the <strong>time</strong> <strong>series</strong> (first ordernon stationarity).By tak<strong>in</strong>g the logarithm of these data, we can stabilize the data variance(second order non stationarity).P r o j e c t c o - f i n a n c e d b y t h e E u r o p e a n C o m m i s s i o n , D i r e c t o r a t e - G e n e r a l T r a n s p o r t a n d E n e r g yP a g e 1 3 5


Chapter 3• Click on Graphs..Sequence• Remove Norvw.Fatalities from the Variables list box• Move LNorw.Fatalities <strong>in</strong>to the Variables list box3.4.3.2. Model identificationThe model identification consists <strong>in</strong> determ<strong>in</strong><strong>in</strong>g the three <strong>in</strong>tegers p, d, <strong>and</strong> q <strong>in</strong>the ARIMA(p,d,q) process generat<strong>in</strong>g the <strong>series</strong>.The ACF plot will be used to detect the presence of non stationarity <strong>in</strong> the data.


3.4 ARMA-type models• Click on Graphs..Time Series..Autocorrelations• Move the variable LNorw.Fatalities <strong>in</strong>to the Variables list box• Click on the Options pushbutton• Replace 16 with 60 <strong>in</strong> the Maximum number of lags text boxP r o j e c t c o - f i n a n c e d b y t h e E u r o p e a n C o m m i s s i o n , D i r e c t o r a t e - G e n e r a l T r a n s p o r t a n d E n e r g yP a g e 1 3 7


Chapter 3The ACF plot <strong>in</strong>dicates the presence of non stationarity: this is due to the factthat the autocorrelations do not decrease at an exponential rate, after a certa<strong>in</strong>order.We shall therefore now differenciate the <strong>series</strong>, by apply<strong>in</strong>g the difference filterF( B)= 1−B to the data B be<strong>in</strong>g the backshift operator.• Click on the Dialog Recall button , <strong>and</strong> then click on SequenceCharts• Click on the Difference check box, <strong>and</strong> verify that 1 is <strong>in</strong> the Differencetext box.


3.4 ARMA-type modelsThe ACF plot of the filtered <strong>series</strong> will be used once aga<strong>in</strong>, to detect thepresence of nonstationarity <strong>in</strong> this filtered dataset.• Click on Graphs..Time Series..Autocorrelations• Move the variable LNorw.Fatalities <strong>in</strong>to the Variables list box• Click on the Difference check box, <strong>and</strong> verify that 1 is <strong>in</strong> the Differencetext box.• Click on the Options pushbutton• Replace 16 with 60 <strong>in</strong> the Maximum number of lags text boxP r o j e c t c o - f i n a n c e d b y t h e E u r o p e a n C o m m i s s i o n , D i r e c t o r a t e - G e n e r a l T r a n s p o r t a n d E n e r g yP a g e 1 3 9


Chapter 3The ACF plot of the filtered <strong>series</strong> does not <strong>in</strong>dicate the presence of rema<strong>in</strong><strong>in</strong>gnon stationarity.We shall therefore accept the hypothesis that this filtered <strong>series</strong> is stationary,which enables to reta<strong>in</strong> d=1 for the value of the <strong>in</strong>teger d.Second, the choice of p=0 <strong>and</strong> q=1 is made by exam<strong>in</strong><strong>in</strong>g the ACF <strong>and</strong> thePACF plots taken together: we choose to fit the data with an ARIMA(0,1,1)model.


3.4 ARMA-type models3.4.3.3. Model estimation <strong>and</strong> validation• Click on Analyze..Time Series..Create Models• Move the variable LNorw.Fatalities <strong>in</strong>to the Dependent Variable(s) listbox.• From the Method box, select ARIMA <strong>modell<strong>in</strong>g</strong> method• Click on Criteria then enter values for the three <strong>in</strong>tegers p,d,q.P r o j e c t c o - f i n a n c e d b y t h e E u r o p e a n C o m m i s s i o n , D i r e c t o r a t e - G e n e r a l T r a n s p o r t a n d E n e r g yP a g e 1 4 1


Chapter 3• Click on Statistics then check parameter estimates <strong>in</strong> the Statistics forIndividual Models list.


3.4 ARMA-type models• Click on Plots <strong>and</strong> check Observed Values, Fit values, Residualautocorrelation function (ACF) <strong>in</strong> Plots for Individual Models.• Click on Save <strong>and</strong> check Noise Residuals then Click on OK.P r o j e c t c o - f i n a n c e d b y t h e E u r o p e a n C o m m i s s i o n , D i r e c t o r a t e - G e n e r a l T r a n s p o r t a n d E n e r g yP a g e 1 4 3


Chapter 3Model Description ARIMA(0,1,1)The results obta<strong>in</strong>ed after the estimation procedure (maximum likelihood) arepresented <strong>and</strong> commented below.Model Fit


3.4 ARMA-type modelsDue to the presence of the trend, the stationary R-squared is only 16,7% (themodel expla<strong>in</strong>s 16,7% of the variance of the filtered data, compared to aregression model), <strong>and</strong> much smaller than the R-square which is 78,9% (themodel expla<strong>in</strong>s 78,9% of the variance of the <strong>in</strong>itial data).As for the different measures of the error made:the mean absolute percentage error (MAPE) is 1,36% %, its highest valueobserved be<strong>in</strong>g 3,915%,the mean absolute error (MAPE) is 0,08, its highest value observed be<strong>in</strong>g 0,23,the root mean square error (RMSE) is 0,099, <strong>and</strong> is a little higher than if it werecomputed as an arithmetic mean (0,08),At last, the normalized BIC, which is -4,413, is a goodness of fit measure thattakes account of the parsimony of the model. Note that, as it is the case for theR-squared, its <strong>in</strong>terest lies <strong>in</strong> comparisons between several nested models, <strong>and</strong>not <strong>in</strong> its absolute value.Model StatisticsThe Ljung-Box statistic provides an <strong>in</strong>dication of whether the model is correctlyspecified, <strong>in</strong> the sense it allows test<strong>in</strong>g the global nullity of the autocorrelation ofthe residuals (of each autocorrelation, of order 1 up to order 18).In our case, this hypothesis is accepted, because the 0,510 value of the Ljung-Box statistic is more than 0.05.ARIMA Model ParametersThe ARIMA model parameters table provides estimates of the modelparameters <strong>and</strong> associated significance values (at the usual 95% confidencelevel). A t-value higher than 1,96 <strong>in</strong>dicates that the hypothesis of nullity of theparameter has to be rejected, <strong>and</strong> that the parameter can thus be considered assignificantly different from 0.In this case, the hypothesis of nullity is rejected <strong>in</strong> both cases, <strong>and</strong> the twoparameters of the ARIMA models are to be considered as different from zero.P r o j e c t c o - f i n a n c e d b y t h e E u r o p e a n C o m m i s s i o n , D i r e c t o r a t e - G e n e r a l T r a n s p o r t a n d E n e r g yP a g e 1 4 5


Chapter 3In addition to the preced<strong>in</strong>g Ljung-Box test, of global nonautocorrelation of theresiduals (correct specification of the model), the hypothesis that eachautocorrelation of the residuals is zero can be tested us<strong>in</strong>g the above ACF plot.The computation of confidence regions enable to determ<strong>in</strong>e visually whether itis the case (at the usual 95% confidence level). It is the case <strong>in</strong>deed for thisexample.


3.4 ARMA-type models3.4.3.4. Graphical results <strong>and</strong> additional testGraphical outputsThe preced<strong>in</strong>g plot describes the development of the observed <strong>and</strong> fitted <strong>series</strong>.Note that, due to the use of the filtered state used for comput<strong>in</strong>g the fittedvalues, the fitted data appear to stay one step beh<strong>in</strong>d the observed data.Second, the plot of the estimated residuals - the difference between theobserved <strong>and</strong> the fitted data plotted first - is to be considered:• Click on Graphs..Sequence• Remove LNorw.Fatalities from the variables list box• Move Noise residual <strong>in</strong>to the variables list box•P r o j e c t c o - f i n a n c e d b y t h e E u r o p e a n C o m m i s s i o n , D i r e c t o r a t e - G e n e r a l T r a n s p o r t a n d E n e r g yP a g e 1 4 7


Chapter 3Note that it is very difficult, on this example, to determ<strong>in</strong>e visually whether theresiduals are a white noise or not.In addition to the nonautocorrelation hypothesis, the Gaussian hypothesis hasto be validated too, which then enables to consider the residuals as an<strong>in</strong>dependent <strong>series</strong> - or white noise.Nevertheless, note that the Gaussian hypothesis is not necessary, <strong>and</strong> that theresiduals can be a white noise even if this condition is not fulfilled. Thehypothesis of <strong>in</strong>dependence is then to be tested directly, which will not be thecase <strong>in</strong> this manual.Normality testWe shall now give the histogram of the residuals (which gives a general idea ofwhether the residuals are Gaussian), the QQ-plot (which is a graphical test of


3.4 ARMA-type modelsthis hypothesis), <strong>and</strong> at last the result of the Kolmogorov-Smirnov test (which isa non-parametric test of this hypothesis).Histogram• Click on Graphs..Histogram…• Move the variable NResidual_LNorv_Model_1 <strong>in</strong>to the Variable list box.• Check Display normal curve, then click on OKP r o j e c t c o - f i n a n c e d b y t h e E u r o p e a n C o m m i s s i o n , D i r e c t o r a t e - G e n e r a l T r a n s p o r t a n d E n e r g yP a g e 1 4 9


Chapter 3In the case the Residuals have a normal distribution, the histogram looksapproximately like the normal curve, <strong>and</strong> the difference area between the twographs is m<strong>in</strong>imal.QQ-plot• Click on Graphs..QQ..• Move the variable NResidual_LNorv_Model_1 <strong>in</strong>to the Variables list box.• Choose Normal from Test Distribution, then click on OK.


3.4 ARMA-type modelsP r o j e c t c o - f i n a n c e d b y t h e E u r o p e a n C o m m i s s i o n , D i r e c t o r a t e - G e n e r a l T r a n s p o r t a n d E n e r g yP a g e 1 5 1


Chapter 3The normal Q-Q plot compares the distribution of a given variable to the normaldistribution (represented by a straight l<strong>in</strong>e). The straight l<strong>in</strong>e represents what theresiduals would look like if they were perfectly normally distributed. Theresiduals are represented by the circles plotted along this l<strong>in</strong>e. The closest thecircles are to the l<strong>in</strong>e, the best the normality hypothesis is fulfilled.Kolmogorov-Smirnov Test• Click on Analyze..Non parametric Tests..1-Sample K-S…• Move the variable NResidual_LNorv_1 <strong>in</strong>to the Test Variable List box• Check Normal <strong>in</strong> Test Distribution list, then click on OK.


3.4 ARMA-type modelsIn the case the Kolmogorov-Smirnov test is significant, the normal distribution ofthe residuals hypothesis is to be rejected.In our case, this hypothesis is accepted, because the 0,713 value of theAsymp. Sig. (2-tailed) is more than 0.05 (at the usual 95% confidence level).P r o j e c t c o - f i n a n c e d b y t h e E u r o p e a n C o m m i s s i o n , D i r e c t o r a t e - G e n e r a l T r a n s p o r t a n d E n e r g yP a g e 1 5 3


Chapter 33.4.4 ARIMA models for seasonal <strong>series</strong> (UK-KSI Drivers)The dataset presented <strong>in</strong> this section is the monthly number of drivers killed <strong>and</strong>seriously <strong>in</strong>jured <strong>in</strong> the United K<strong>in</strong>gdom (UK-KSI), for the period January 1969 -December 1984 (as described <strong>in</strong> Harvey <strong>and</strong> Durb<strong>in</strong>, 1986). A pure ARIMAmodel will first be fitted on these data, <strong>and</strong> an <strong>in</strong>tervention variable <strong>and</strong> twoexplanatory variables will be <strong>in</strong>troduced <strong>in</strong> the model <strong>in</strong> the two follow<strong>in</strong>g steps(see the Methodology Report).3.4.4.1. Data description1. Start of <strong>analysis</strong> <strong>and</strong> data load Use the menu to open the file ‘UK_KSI.sav’.The data file consists of the follow<strong>in</strong>g variables:DRIVERS: The number of drivers killed or seriously <strong>in</strong>jured.Interv: The <strong>in</strong>tervention variableTRKM: The car traffic <strong>in</strong>dexPPRICE: The petrol price.The log transformed variables of the preced<strong>in</strong>g ones, to the exception of the<strong>in</strong>tervention variable, are <strong>in</strong>cluded <strong>in</strong> the data file, <strong>and</strong> three additional variablesYEAR MONTH <strong>and</strong> DATE (see the Variable View, <strong>in</strong> which all variables aredescribed).2. Graphical diagnostics


3.4 ARMA-type models• Click on Graphs..Sequence• Move DRIVERS <strong>in</strong>to the Variables list box• Click on Graphs..Sequence• Remove DRIVERS from the Variables list box• Move DRIVERS <strong>in</strong>to the Variables list boxP r o j e c t c o - f i n a n c e d b y t h e E u r o p e a n C o m m i s s i o n , D i r e c t o r a t e - G e n e r a l T r a n s p o r t a n d E n e r g yP a g e 1 5 5


Chapter 33.4.4.2. Model identificationThe model identification consists <strong>in</strong> determ<strong>in</strong><strong>in</strong>g the six <strong>in</strong>tegers p, d, q (relatedto the non seasonal part of the model), <strong>and</strong> P,D,Q (related to the seasonal partof the model) <strong>in</strong> the multiplicative ARIMA(p,d,q)(P,D,Q) S formulation.As already mentioned before, the ACF plot will be used to detect nonstationarity <strong>in</strong> the data.• Click on Graphs..Time Series..Autocorrelations• Move the variable LDRIVERS <strong>in</strong>to the Variables list box


3.4 ARMA-type models• Click on the Options pushbutton• Replace 16 with 60 <strong>in</strong> the Maximum number of lags text box• Click on Cont<strong>in</strong>ue, <strong>and</strong> then click on OK to get a plot of the ACF plotP r o j e c t c o - f i n a n c e d b y t h e E u r o p e a n C o m m i s s i o n , D i r e c t o r a t e - G e n e r a l T r a n s p o r t a n d E n e r g yP a g e 1 5 7


Chapter 3The ACF plot <strong>in</strong>dicates obvious non stationarity, here aga<strong>in</strong> this is due to thefact that the autocorrelations do not decrease at an exponential rate, after acerta<strong>in</strong> order.We shall differenciate the <strong>series</strong>, by apply<strong>in</strong>g the “seasonal” difference filter12F( B)= 1−B to the data, B be<strong>in</strong>g the backshift operator.• Click on the Dialog Recall button , <strong>and</strong> then click on SequenceCharts• Click on the Seasonally difference check box, <strong>and</strong> verify that 1 is <strong>in</strong> theSeasonally difference text box.


3.4 ARMA-type modelsNo <strong>in</strong>dication of rema<strong>in</strong><strong>in</strong>g non stationnarity <strong>in</strong> the seasonally filtered data canbe found <strong>in</strong> the next-com<strong>in</strong>g ACF plot.We shall therefore accept the hypothesis that this filtered <strong>series</strong> is stationary,which enables to reta<strong>in</strong> d=0 for the non seasonal part of the general filter <strong>and</strong>D=1 for its seasonal part (see the Methodology Report)..Second, the ACF <strong>and</strong> the PACF plots, taken together, lead to choose p=2 , q=0for the non seasonal part of the model, <strong>and</strong> P=0, Q=1 for the seasonal part ofthe model, <strong>in</strong>dicat<strong>in</strong>g that the model is an ARIMA(2,0,0)(0,1,1) 12 .P r o j e c t c o - f i n a n c e d b y t h e E u r o p e a n C o m m i s s i o n , D i r e c t o r a t e - G e n e r a l T r a n s p o r t a n d E n e r g yP a g e 1 5 9


Chapter 33.4.4.3. Model estimation <strong>and</strong> validation• Click on Analyze..Time Series..Create Models• Move the variable LDRIVERS <strong>in</strong>to the Dependent Variable(s) list box.• Choose ARIMA <strong>in</strong> the Method list.• Click on Criteria then specify the ARIMA model


3.4 ARMA-type models• Click on Statistics tab then check parameter estimates <strong>in</strong> the Statisticsfor Individual Models list.P r o j e c t c o - f i n a n c e d b y t h e E u r o p e a n C o m m i s s i o n , D i r e c t o r a t e - G e n e r a l T r a n s p o r t a n d E n e r g yP a g e 1 6 1


Chapter 3• Click on Plots tab <strong>and</strong> check Observed Values, Fit values, Residualautocorrelation function (ACF), <strong>in</strong> Plots for Individual Models.• Click on Save <strong>and</strong> check Noise Residuals then Click on OK.


3.4 ARMA-type modelsIn the output w<strong>in</strong>dow of SPSS, we f<strong>in</strong>d the statistics for the chosen model asfollows:P r o j e c t c o - f i n a n c e d b y t h e E u r o p e a n C o m m i s s i o n , D i r e c t o r a t e - G e n e r a l T r a n s p o r t a n d E n e r g yP a g e 1 6 3


Chapter 3Model Description ARIMA (2,0,0)(0,1,1)The results obta<strong>in</strong>ed after the estimation procedure (maximum likelihood) arepresented <strong>and</strong> commented below.Model FitThe stationary R-squared is only 53,0% ( the model expla<strong>in</strong>s 53,0% of thevariance of the filtered data, compared to a regression model), <strong>and</strong> muchsmaller than the R-square which is 77,3% (the model expla<strong>in</strong>s 77,3% of thevariance of the <strong>in</strong>itial data).As for the different measures of the error made:The mean absolute percentage error (MAPE) is 0,902% %, its highest valueobserved be<strong>in</strong>g 3,841%,the mean absolute error (MAE) is 0,067, its highest value observed be<strong>in</strong>g0,267,the root mean square error (RMSE) is 0,084, <strong>and</strong> is a little higher than if it werecomputed as an arithmetic mean (0,067),At last, the normalised BIC, which is -4,4850, is a goodness of fit measure thattakes account of the parsimony of the model. Note that, as it is the case for theR-squared, its <strong>in</strong>terest lies <strong>in</strong> comparisons between several models, <strong>and</strong> not <strong>in</strong>its absolute value.


3.4 ARMA-type modelsModel StatisticsIn this case, this hypothesis of correct specification (global nullity of theautocorrelation of the residuals is accepted, as the 0,17 value of the Ljung-Boxstatistic is more than 0.05.ARIMA Model ParametersIn this case, the hypothesis of nullity of each of the four parameters is rejected,<strong>and</strong> all parameters of the ARIMA model are to be considered as different fromzero.The residual ACF plot <strong>in</strong>dicates that no very significant autocorrelation rema<strong>in</strong>s,up to order 24.P r o j e c t c o - f i n a n c e d b y t h e E u r o p e a n C o m m i s s i o n , D i r e c t o r a t e - G e n e r a l T r a n s p o r t a n d E n e r g yP a g e 1 6 5


Chapter 33.4.4.4. Graphical results <strong>and</strong> additional testGraphical outputsAs <strong>in</strong> the preced<strong>in</strong>g case, the first plot describes the development of theobserved <strong>and</strong> fitted <strong>series</strong>, <strong>and</strong> the second plot the development of theirdifference (the estimated residuals)Note that the fitted data do not appear to stay one step beh<strong>in</strong>d the observeddata, as it was the case for the Norwegian fatalities model. The strong seasonalpattern is reproduced, which takes over the adjustment of the trend.


3.4 ARMA-type modelsNormality testHistogramQQ-plotP r o j e c t c o - f i n a n c e d b y t h e E u r o p e a n C o m m i s s i o n , D i r e c t o r a t e - G e n e r a l T r a n s p o r t a n d E n e r g yP a g e 1 6 7


Chapter 3Kolmogorov-Smirnov TestIn this case, this hypothesis of normality of the residuals is accepted, becausethe 0,66 value of the Asymp. Sig. (2-tailed) is more than 0.05 (at the usual 95%confidence level).3.4.4.5. Intervention variableIn this section, we will add an <strong>in</strong>tervention variable to the model, <strong>in</strong> view ofperform<strong>in</strong>g a so-called <strong>in</strong>tervention <strong>analysis</strong>.Data descriptionThe reason for the <strong>in</strong>troduction of the <strong>in</strong>tervention variable is the <strong>in</strong>troduction ofthe seat belt law <strong>in</strong> February 1983. The variable will therefore be equal to 1,February 1983 onwards, <strong>and</strong> equal to 0 before (see the Methodology Report).Model estimation <strong>and</strong> validation• Click on Analyze..Time Series..Create Models• Move the variable LDRIVERS <strong>in</strong>to the Dependent Variables list box <strong>and</strong>the variable <strong>in</strong>terv <strong>in</strong> the Independent variables list box.• Choose ARIMA <strong>in</strong> the Method list, click on Criteria <strong>and</strong> then specify themodel you want to estimate.


3.4 ARMA-type modelsThe Transfer Function tab (only present if <strong>in</strong>dependent variables are specified)will now be used to def<strong>in</strong>e the call to the <strong>in</strong>tervention variable.The Transfer Function tab allows def<strong>in</strong><strong>in</strong>g transfer functions for the<strong>in</strong>dependent variables specified on the Variables tab. In this case, the<strong>in</strong>tervention variable is the only <strong>in</strong>dependent variable, <strong>and</strong> the <strong>in</strong>dication to begiven is that a seasonal difference filter is used for that variable (see theMethodology Report).P r o j e c t c o - f i n a n c e d b y t h e E u r o p e a n C o m m i s s i o n , D i r e c t o r a t e - G e n e r a l T r a n s p o r t a n d E n e r g yP a g e 1 6 9


Chapter 3Here are the SPSS results for the specified model:


3.4 ARMA-type modelsModel Description ARIMA(2,0,0)(0,1,1) with <strong>in</strong>tervention variableModel FitThe addition of the <strong>in</strong>tervention to the model has improved the goodness-of-fit:the Stationary R-squared has <strong>in</strong>creased (from 0,53 to 0,563).The R-squared has <strong>in</strong>creased (from 0,773 to 0,788)the MAPE has decreased (from 3,841 to 2,835).the Normalized BIC has decreased (from -4,590 to -4,886).Model StatisticsNote that, <strong>in</strong> this case, the Ljung-Box statistic value of 0,043 is smaller then0,05, which <strong>in</strong>dicates that the hypothesis of global nullity of the autocorrelationP r o j e c t c o - f i n a n c e d b y t h e E u r o p e a n C o m m i s s i o n , D i r e c t o r a t e - G e n e r a l T r a n s p o r t a n d E n e r g yP a g e 1 7 1


Chapter 3of the residuals, up to order 18, has to be rejected, at the usual 95% confidencelevel.ARIMA Model ParametersIn this case, the hypothesis of nullity of all parameters is rejected: all fourparameters related to the dynamics are to be considered as different from zero<strong>and</strong> the <strong>in</strong>tervention parameter too..The follow<strong>in</strong>g ACF plot of the residuals <strong>in</strong>dicate that the autocorrelation of order5, <strong>and</strong> of order 18, for <strong>in</strong>stance, differ significantly from zero, at this usualconfidence levelGraphical results <strong>and</strong> additional testThe two usual graphical outputs are given below, followed by the normality testresults.


3.4 ARMA-type modelsP r o j e c t c o - f i n a n c e d b y t h e E u r o p e a n C o m m i s s i o n , D i r e c t o r a t e - G e n e r a l T r a n s p o r t a n d E n e r g yP a g e 1 7 3


Chapter 3HistogramQQ-plot


3.4 ARMA-type modelsKolmogorov-Smirnov TestIn this case, this hypothesis of normality of the residuals is accepted, becausethe 0,554 value of the Asymp. Sig. (2-tailed) is more than 0.05 (at the usual95% confidence level).3.4.4.6. Intervention <strong>and</strong> explanatory variablesIn this section, we will add two explanatory variables, LTRKM, LPPRICE, asdef<strong>in</strong>ed above <strong>in</strong> the database. The seat belt law <strong>in</strong>tervention variable from theprevious section will be kept <strong>in</strong> the model.Model estimation <strong>and</strong> validation• Click on Analyze..Time Series..Create Models• Move the variable LDRIVERS <strong>in</strong>to the Dependent Variables list box <strong>and</strong>the variables <strong>in</strong>terv, LTKRM <strong>and</strong> LPPRICE <strong>in</strong> the Independent variableslist box.P r o j e c t c o - f i n a n c e d b y t h e E u r o p e a n C o m m i s s i o n , D i r e c t o r a t e - G e n e r a l T r a n s p o r t a n d E n e r g yP a g e 1 7 5


Chapter 3As the <strong>in</strong>tervention variable is the only <strong>in</strong>dependent variable, the <strong>in</strong>dication to begiven is that a seasonal difference filter is performed on that variable (see theMethodology Report).


3.4 ARMA-type modelsModel Description ARIMA(2,0,0)(0,1,1) with explanatory variableModel FitP r o j e c t c o - f i n a n c e d b y t h e E u r o p e a n C o m m i s s i o n , D i r e c t o r a t e - G e n e r a l T r a n s p o r t a n d E n e r g yP a g e 1 7 7


Chapter 3The addition of the two explanatory variables to the model has still improved thegoodness-of-fit:The Stationary R-squared has <strong>in</strong>creased to 0,59.The R-squared has <strong>in</strong>creased to 0,802The MAPE has decreased to 0,86 %.The only exception is that the Normalized BIC has <strong>in</strong>creased a little, from -4,886to -4,881, but rema<strong>in</strong>s smaller than <strong>in</strong> the pure ARIMA model (64,841).Model StatisticsThe hypothesis of global nullity of the autocorrelation of the residuals is stillaccepted, as the statistic is 0,078, <strong>and</strong> higher than 0,05 (at the 95% confidence


3.4 ARMA-type modelslevel).ARIMA Model ParametersIn this case, the hypothesis of nullity of the model parameters is rejected (at the95% confidence level), except for the log of the traffic <strong>in</strong>dex variable parameter:all parameters related to the dynamics are to be considered as different fromzero, the petrol price parameter <strong>and</strong> the <strong>in</strong>tervention parameter too.Note that, <strong>in</strong> case the confidence level is lowered to 70% for <strong>in</strong>stance (t-valuebetween 1 <strong>and</strong> 2), the parameter related to the traffic <strong>in</strong>dex variable would alsobe considered as different from zero too.Graphical results <strong>and</strong> additional testThe two usual graphical outputs are still given below, followed by the normalitytest results.P r o j e c t c o - f i n a n c e d b y t h e E u r o p e a n C o m m i s s i o n , D i r e c t o r a t e - G e n e r a l T r a n s p o r t a n d E n e r g yP a g e 1 7 9


Chapter 3Histogram


3.4 ARMA-type modelsQQ-plotP r o j e c t c o - f i n a n c e d b y t h e E u r o p e a n C o m m i s s i o n , D i r e c t o r a t e - G e n e r a l T r a n s p o r t a n d E n e r g yP a g e 1 8 1


Chapter 3Kolmogorov-Smirnov TestIn this case, this hypothesis of normality of the residuals is accepted, becausethe 0,761 value of the Asymp. Sig. (2-tailed) is more than 0.05 (at the usual95% confidence level).Note that the statistic value has kept on <strong>in</strong>creas<strong>in</strong>g, <strong>and</strong> has the highest value <strong>in</strong>this very last case.


3.4 ARMA-type models3.4.5 Conclusion ARMA-type modelsIn this manual, the general approach for perform<strong>in</strong>g an ARMA-type <strong>analysis</strong>was given <strong>and</strong> demonstrated on several examples of ARMA-type models.A pure ARIMA model was estimated on non stationary annual data (section3.4.3.) <strong>and</strong> ARIMA models <strong>in</strong>clud<strong>in</strong>g additional variables (<strong>in</strong>tervention <strong>and</strong>explanatory) were estimated on non stationary monthly data (sections 3.4.4).In the case a pure ARIMA model was estimated, for descriptive or forecast<strong>in</strong>gpurposes, without any call to additional variables), the first relevant taskconsists <strong>in</strong> pretransform<strong>in</strong>g the <strong>in</strong>itial dataset <strong>in</strong> order to obta<strong>in</strong> anotherstationary data set. The second relevant task consists <strong>in</strong> identify<strong>in</strong>g aparsimonious ARMA model on this second data set, <strong>and</strong> to test whether thema<strong>in</strong> hypothesis related to the residuals of the model (non correlation, <strong>and</strong>normality) are valid.In the case ARIMA models <strong>in</strong>clud<strong>in</strong>g additional variables were estimated, fordescriptive <strong>and</strong> explanatory purposes, the global model was fitted directly, <strong>and</strong>all parameters - whether related to the dynamics or to the exogenous effects –estimated altogether.The ma<strong>in</strong> hypotheses related to the residuals were tested <strong>in</strong> the same manneras <strong>in</strong> the preced<strong>in</strong>g case.Apart of the added value due to the exogenous variables - <strong>in</strong> terms of<strong>in</strong>terpretation of the exogenous estimated parameters - , this <strong>modell<strong>in</strong>g</strong> <strong>in</strong>successive stages described <strong>in</strong> this manual, highlighted a general <strong>in</strong>crease ofthe model fit, which was observed at each stage (see the Methodology Reportfor more results about parameters <strong>in</strong>terpretation <strong>and</strong> ga<strong>in</strong> <strong>in</strong> the model fit).P r o j e c t c o - f i n a n c e d b y t h e E u r o p e a n C o m m i s s i o n , D i r e c t o r a t e - G e n e r a l T r a n s p o r t a n d E n e r g yP a g e 1 8 3


Chapter 33.5 DRAG modelsThe DRAG model is an important theoretical contribution to road safety <strong>analysis</strong><strong>and</strong> has for the sake of completeness been described <strong>in</strong> the methodologyreport. In this manual, however, there is no section dedicated to DRAG-models<strong>in</strong> this manual, because they require large amounts of data which, <strong>in</strong> practice,are seldom available <strong>in</strong> road safety research. This manual is written as part ofthe SafetyNet project. The databases this project produces are not meant to beexhaustive enough for such a purpose. Moreover, the software reta<strong>in</strong>ed forperform<strong>in</strong>g <strong>time</strong> <strong>series</strong> <strong>analysis</strong> with<strong>in</strong> the project do not allow for estimation of aDRAG-model ( note that a software: the TRIO program, is dedicated toestimat<strong>in</strong>g DRAG models).


3.6 State space modelsChris de Blois (SWOV)3.6.1 IntroductionThe objective of this manual for state space <strong>analysis</strong> is 1) to demonstrate theopportunities that state space <strong>modell<strong>in</strong>g</strong> offers to road safety <strong>analysis</strong>, 2) to<strong>in</strong>struct the reader <strong>in</strong> sett<strong>in</strong>g up a state space model, <strong>and</strong> 3) to <strong>in</strong>struct thereader how to <strong>in</strong>terpret the model results. The reader does not need to be anexpert <strong>in</strong> statistics, <strong>modell<strong>in</strong>g</strong>, or programm<strong>in</strong>g.This state space <strong>analysis</strong> manual is closely related to Section 3.6 of theMethodology report, which deals with the theory beh<strong>in</strong>d state space <strong>modell<strong>in</strong>g</strong>.This manual section demonstrates how the analyses discussed <strong>in</strong> Section 3.6 ofthe Methodology report are performed with a software package for state space<strong>analysis</strong> called STAMP 6.0 2 . Therefore, the datasets used <strong>in</strong> this manual arethe same as the datasets which are considered <strong>in</strong> Section 3.6. In addition to thetheoretical sections, this manual also describes the general approachrecommended for state space <strong>analysis</strong> of <strong>time</strong> <strong>series</strong>. This approach isillustrated us<strong>in</strong>g the dataset represent<strong>in</strong>g the monthly number of drivers killed orseriously <strong>in</strong>jured (KSI) for the years 1969-1984 <strong>in</strong> the UK, which is one of thedatasets employed <strong>in</strong> the Methodology report.STAMP 6.0, a software package dedicated to state space <strong>modell<strong>in</strong>g</strong>, is powerful<strong>and</strong> easy to use, <strong>and</strong> is therefore also used for the state space analyses <strong>in</strong> thismanual. STAMP 6.0 has <strong>in</strong>dependently been reviewed by several authors:Teyssière (2005), Hallahan (2003), Judge <strong>and</strong> N<strong>in</strong>omiya (2000), <strong>and</strong> Yaffee(2003).Section 3.6.2 first describes <strong>in</strong> detail how to set up a determ<strong>in</strong>istic level model <strong>in</strong>STAMP 6.0 <strong>and</strong> how to <strong>in</strong>terpret the results. Then, the stochastic level model isdescribed less extensively <strong>and</strong> the results of the <strong>analysis</strong> are compared with theresults of the <strong>analysis</strong> with the determ<strong>in</strong>istic model. Section 3.6.3 deals with thelocal l<strong>in</strong>ear trend model. The determ<strong>in</strong>istic variant of this model, thedeterm<strong>in</strong>istic level <strong>and</strong> determ<strong>in</strong>istic slope model, corresponds to a classicall<strong>in</strong>ear regression. So, if this model is applied to the same dataset as used <strong>in</strong> theclassical l<strong>in</strong>ear regression manual (Section 3.2), then the results shouldcorrespond to the results presented <strong>in</strong> that manual. Section 3.6.4 <strong>in</strong>troduces anadditional component: the seasonal. In Section 3.6.5 <strong>and</strong> Section 3.6.6 anothertwo components are added: <strong>in</strong>tervention variables <strong>and</strong> explanatory variables,respectively. Sections 3.6.4 through 3.6.6 demonstrate the recommendedapproach to state space <strong>analysis</strong> of <strong>time</strong> <strong>series</strong>. F<strong>in</strong>ally, Section 3.6.7 conta<strong>in</strong>s2 System requirements for STAMP 6.0 are: W<strong>in</strong>dows XP/2000/NT/98/95. STAMP 6.0 is availablefrom Timberlake Consultants Ltd, Ujit 3, Broomsleigh Bus<strong>in</strong>ess Park, Worsley Bridge Road,London SE26 5BN, United K<strong>in</strong>gdom. Telephone +44 (0)20 86973377, Fax +44 (0)20 86973388.E-mail: <strong>in</strong>fo@timberlake.co.uk. Website: www.timberlake.co.uk. The ma<strong>in</strong> website for STAMPis: www.STAMP-software.com.


Chapter 3a summary of results <strong>and</strong> some general recommendations for the <strong>analysis</strong> ofroad safety data with state space techniques.The <strong>analysis</strong> of each model is subdivided <strong>in</strong>to the follow<strong>in</strong>g steps:1. Start of <strong>analysis</strong> <strong>and</strong> data load;2. Model formulation;3. Model estimation <strong>and</strong> <strong>in</strong>spection of results;4. Graphics of model components;5. Test of model residuals;6. Test of auxiliary residuals;7. Conclusion of <strong>analysis</strong>;8. Forecast<strong>in</strong>g (optional);9. Exercise (optional).Forecast<strong>in</strong>g is discussed for some of the models. Forecasts can be made only ifthe model performs well <strong>and</strong> the residuals satisfy the model assumptions. Theadditional exercise for the reader is optional as well.The follow<strong>in</strong>g conventions concern<strong>in</strong>g notation are used throughout Section 3.6:The basic explanations are <strong>in</strong> st<strong>and</strong>ard pr<strong>in</strong>t, Instructions are preceded by a bullet po<strong>in</strong>t,The model output is pr<strong>in</strong>ted <strong>in</strong> Courier, 10 pnt,More elaborate explanatory texts, which can be skipped without miss<strong>in</strong>g essential <strong>in</strong>formation,are pr<strong>in</strong>ted <strong>in</strong> italics,.


3.6 State space models3.6.2 Local level modelThis section first presents an extensive step-by-step description of the <strong>analysis</strong>of the Norwegian fatalities <strong>time</strong> <strong>series</strong> us<strong>in</strong>g the determ<strong>in</strong>istic level model <strong>in</strong>STAMP. The <strong>analysis</strong> <strong>in</strong>cludes trend description, residual test<strong>in</strong>g, <strong>and</strong> outliertest<strong>in</strong>g. Then, the <strong>analysis</strong> with the stochastic level model, or local level model,is described more succ<strong>in</strong>ctly. The latter <strong>analysis</strong> also <strong>in</strong>cludes forecast<strong>in</strong>g overseven years.3.6.2.1. Determ<strong>in</strong>istic level modelThe above mentioned steps will now be taken one-by-one for the determ<strong>in</strong>isticlevel model.Step 1: Start of <strong>analysis</strong> <strong>and</strong> data loadFirst, we open GiveW<strong>in</strong>, load the data, <strong>and</strong> start STAMP. Start the GiveW<strong>in</strong>2 program. Use the menu to open the file“NorwayFatalities.<strong>in</strong>7”.The data file is loaded <strong>and</strong> displayed <strong>in</strong> a m<strong>in</strong>imized w<strong>in</strong>dow at the bottom of theGiveW<strong>in</strong> ma<strong>in</strong> w<strong>in</strong>dow. To view the data file: Click on the icon with the two overlapp<strong>in</strong>g boxes:P r o j e c t c o - f i n a n c e d b y t h e E u r o p e a n C o m m i s s i o n , D i r e c t o r a t e - G e n e r a l T r a n s p o r t a n d E n e r g yP a g e 1 8 7


Chapter 3This data file consists of two variables: the annual number of people killed <strong>in</strong>road traffic <strong>in</strong> Norway (see Sections 1.2.2 <strong>and</strong> 3.6.1 of the Methodology report)for the years 1970 through 2003 (“NO_fatalities”) <strong>and</strong> the logarithm of the latter<strong>time</strong> <strong>series</strong> (“Log_NO_fat”). M<strong>in</strong>imize the data file w<strong>in</strong>dow aga<strong>in</strong> <strong>and</strong> use the menu to start the STAMP program. The STAMP w<strong>in</strong>dow appears:Step 2: Model FormulationIn this step, we def<strong>in</strong>e the determ<strong>in</strong>istic level model: In the STAMP w<strong>in</strong>dow choose the menu . In the Data selection w<strong>in</strong>dow select the variable Log_NO_fat.


3.6 State space models Click the Add button. Then click OK. In the Select components w<strong>in</strong>dow, choose a Fixed Level, No slope,Irregular, <strong>and</strong> No seasonal:P r o j e c t c o - f i n a n c e d b y t h e E u r o p e a n C o m m i s s i o n , D i r e c t o r a t e - G e n e r a l T r a n s p o r t a n d E n e r g yP a g e 1 8 9


Chapter 3 Then click on the F<strong>in</strong>ish button.Step 3: Model estimation <strong>and</strong> <strong>in</strong>spection of resultsThe third step is to estimate the model <strong>and</strong> <strong>in</strong>spect the results. In the Estimate Model w<strong>in</strong>dow, select Maximum Likelihood: Click OK.The model is estimated, <strong>and</strong> the follow<strong>in</strong>g output appears <strong>in</strong> the GiveW<strong>in</strong>Results w<strong>in</strong>dow:---- STAMP 6.30 session started at 13:05:05 on Monday 20 February 2006----Please cite STAMP as:


3.6 State space modelsKoopman S.J., Harvey, A.C., Doornik, J.A. <strong>and</strong> Shephard, N. (2000).Stamp: Structural Time Series Analyser, Modeller <strong>and</strong> Predictor,London: Timberlake Consultants Press.Method of estimation is Maximum likelihoodThe present sample is: 1970 to 2003Equation 1.Log_NO_fat = Level + IrregularEstimation reportModel with 1 parameters ( 1 restrictions).Parameter estimation sample is 1970. 1 - 2003. 1. (T = 34).Log-likelihood kernel is 0.No estimation done.Eq 1 : Diagnostic summary report.Estimation sample is 1970. 1 - 2003. 1. (T = 34, n = 33).Log-Likelihood is 48.1408 (-2 LogL = -96.2816).Prediction error variance is 0.047433Summary statisticsLog_NO_fatStd.Error 0.21779Normality 1.3457H( 11) 3.6612r( 1) 0.58763r( 6) -0.073609DW 0.22639Q( 6, 6) 28.814R^2 0.00000Eq 1 : Estimated variances of disturbances.Component Log_NO_fat (q-ratio)Irr 0.048583 ( 1.0000) In the first part of the output (estimation report <strong>and</strong> above), check the outputon the estimation method (maximum likelihood), sample period (1970-2003),model components (level <strong>and</strong> irregular), the number of parametersestimated (1), <strong>and</strong> the number of observations (T=34).The diagnostic summary report gives some additional <strong>in</strong>formation: number of degrees offreedom (T-1), log-likelihood, <strong>and</strong> prediction error variance. The log-likelihood value given is thelog-likelihood function at its maximum value after estimation. This value is different from thevalue <strong>in</strong> Section 3.6.1.4 of the Methodology report, which is obta<strong>in</strong>ed from the above value byextract<strong>in</strong>g a constant <strong>and</strong> divid<strong>in</strong>g by another constant. Both constants depend on the number ofobservations T. The prediction error variance (PEV) is a basic measure of goodness-of-fit (thesmaller the PEV, the better the fit).Next, the summary of statistics can be used to evaluate model performancewith respect to the diagnostic tests (see Section 3.6.1.4 of the Methodologyreport). For this evaluation, we make a table like Table 3.6.1. A “+” <strong>in</strong> the lastcolumn of Table 3.6.1 means that the assumption is satisfied, a “-” <strong>in</strong>dicatesviolation of the assumption.Statistic Value Critical 5%value aAssumptionsatisfiedP r o j e c t c o - f i n a n c e d b y t h e E u r o p e a n C o m m i s s i o n , D i r e c t o r a t e - G e n e r a l T r a n s p o r t a n d E n e r g yP a g e 1 9 1


Chapter 3Independence Q(6,6) 28.8 12.59 -r(1) 0.588 0.34 -r(6) -0.0736 0.34 +Homoscedasticity H(11) 3.66 3.47 -Normality N 1.35 5.99 +Table 3.6.1: Diagnostic test results for the determ<strong>in</strong>istic level model applied to the logof Norwegian fatalities. a Probability that statistic exceeds critical value is 0.05.Comparison of Table 3.6.1 <strong>and</strong> the correspond<strong>in</strong>g Table 3.6.1 <strong>in</strong> the Methodology report, showsthat STAMP uses other choices with respect to the statistics than <strong>in</strong> the <strong>analysis</strong> presented <strong>in</strong>the Methodology report, i.e. Q(6) <strong>in</strong>stead of Q(10) <strong>and</strong> r(6) <strong>in</strong>stead of r(4). Below, STAMP'schoices are amplified.Koopman et al. (2000) give more <strong>in</strong>formation on the summary statistics. Here, we restrictourselves to some concise remarks.− For test<strong>in</strong>g normality, the Doornik-Hansen statistic is used, which is, under the nullhypothesis of normally distributed residuals, approximately χ 2 (2) distributed.− H(h) is a heteroscedasticity test, which is approximately F(h,h) distributed. In STAMP, “h” isdeterm<strong>in</strong>ed as the number of degrees of freedom divided by three <strong>and</strong> rounded down to thenearest <strong>in</strong>teger.− r(τ) is the residual autocorrelation at lag τ, distributed approximately as N(0,1/T).− DW is the Durb<strong>in</strong>-Watson statistic, which tests for residual autocorrelation at lag 1, <strong>and</strong> isapproximately N(2,4/T) distributed.− Q(P,d) is the BOX-Ljung Q-statistic based on the first P residual autocorrelations, which isdistributed approximately as χ 2 (d), where d is P-m+1 with m the number of parameters.− R 2 is the coefficient of determ<strong>in</strong>ation, which is a measure of the proportion of observationalvariance which is expla<strong>in</strong>ed by the model <strong>and</strong> as such a measure of goodness-of-fit. At the bottom of the GiveW<strong>in</strong> results w<strong>in</strong>dow, check whether the estimatedvariances of the disturbances are sufficiently large.A near zero variance is an <strong>in</strong>dication of a determ<strong>in</strong>istic component. In this model, the only modelcomponent which can vary, the irregular component (i.e., the observation disturbances), hasunequal to zero. The level is fixed. In the determ<strong>in</strong>istic level model, the estimated variance ofthe irregular component is equal to the variance of the <strong>series</strong>. The variance of the log of thenumber of Norwegian fatalities is therefore equal to 0.048583. The q-ratio (<strong>in</strong> the outputbetween brackets) is the ratio of each variance to the largest <strong>and</strong> is equal to one, because thereis only one variance, which therefore is the largest.Next, we will produce some additional output: In the STAMP w<strong>in</strong>dow choose <strong>in</strong> the menu. Select Additional output, Get steady state, Anti-log <strong>analysis</strong>, <strong>and</strong> State <strong>and</strong>regression output:


3.6 State space models Click OK.In the GiveW<strong>in</strong> Results w<strong>in</strong>dow, the follow<strong>in</strong>g additional results are displayed:Eq 1 : Estimated st<strong>and</strong>ard deviations of disturbances.Component Log_NO_fat (q-ratio)Irr 0.22042 ( 1.0000)Eq 1 : Estimated coefficients of f<strong>in</strong>al state vector.Variable Coefficient R.m.s.e. t-valueLvl 5.9323 0.037801 156.93 [ 0.0000]Anti-log trend <strong>analysis</strong>Trend value at end of period is 377.005.The estimated st<strong>and</strong>ard deviation of the irregular is the square root of the estimated variance ofthe irregular (see above). In the determ<strong>in</strong>istic level model, the estimated st<strong>and</strong>ard deviation ofthe irregular is equal to the st<strong>and</strong>ard deviation of the observations <strong>in</strong> the <strong>series</strong>. Check the values of the estimated coefficients of the f<strong>in</strong>al state vector.The f<strong>in</strong>al state vector conta<strong>in</strong>s the values of the model components for the last <strong>time</strong> step of theobserved <strong>time</strong> <strong>series</strong>. The state only consists of a level component <strong>in</strong> this case, <strong>and</strong> theestimate for the value of the level <strong>in</strong> 2003 equals 5.9323, which is the mean of the log of theNorwegian fatalities <strong>series</strong>. Moreover, s<strong>in</strong>ce the level is treated determ<strong>in</strong>istically <strong>in</strong> this <strong>analysis</strong>,its estimated value is actually 5.9323 for all T=34 <strong>time</strong> po<strong>in</strong>ts of the <strong>series</strong>. The t-statistic iscomputed as the coefficient (5.9323) divided by its root mean square error (0.0378), <strong>and</strong> is usedto test whether the estimated value of the level significantly deviates from zero. Test the significance of the estimated coefficients of the f<strong>in</strong>al state vector.Under the null hypothesis, the t-statistic has a Student's t-distribution with T-1 degrees offreedom. Between square brackets beh<strong>in</strong>d the t-value, the model output gives the probabilitythat the absolute value of a Student's t-distributed variable X exceeds the actual, absolute valueof the t-statistic, i.e. Prob(|X|>|t|). This probability is very small here, so it may seem that thelevel significantly deviates from zero. However, s<strong>in</strong>ce the residuals do not satisfy theassumptions of <strong>in</strong>dependence <strong>and</strong> homoscedasticity (see Table 3.6.1), this t-test is seriouslyflawed, <strong>and</strong> one should be careful not to draw any conclusions from this test.The anti-log trend <strong>analysis</strong> presents the value of the estimated level for the orig<strong>in</strong>al <strong>series</strong> at theend of the <strong>series</strong> (2003). In the determ<strong>in</strong>istic level model, this value is equal to exp(mean of thelog-transformed <strong>series</strong>).P r o j e c t c o - f i n a n c e d b y t h e E u r o p e a n C o m m i s s i o n , D i r e c t o r a t e - G e n e r a l T r a n s p o r t a n d E n e r g yP a g e 1 9 3


Chapter 3Step 4: Graphics of model componentsGraphics enlarge the <strong>in</strong>sight <strong>in</strong> the data <strong>and</strong> the model results. Therefore, wewill generate figures of the observed <strong>and</strong> log-transformed <strong>time</strong> <strong>series</strong> <strong>and</strong> theestimated trend. In the STAMP w<strong>in</strong>dow choose menu .Select Trend, Irregular <strong>and</strong> Smoothed: Click OK.The STAMP Graphics w<strong>in</strong>dow appears with graphs of the observed logtransformed<strong>time</strong> <strong>series</strong> <strong>and</strong> the modelled trend <strong>and</strong> irregular:


3.6 State space models Exam<strong>in</strong>e the figures <strong>in</strong> the STAMP Graphics w<strong>in</strong>dow <strong>and</strong> visually <strong>in</strong>spect thebottom figure of observation disturbances for possible serial correlation.The top figure shows the log-transformed observations <strong>and</strong> the estimated level, which is just themean of the <strong>series</strong>. The bottom figure shows the irregular component or observationdisturbances, which, <strong>in</strong> this case, are equal to the deviations of the observations from theirmean value. Visual <strong>in</strong>spection of the latter figure clearly reveals serial correlation because apositive disturbance tends to be followed by other positive disturbances while a negativedisturbance tends to be followed by more negative disturbances. This is confirmed by the moreformal tests for <strong>in</strong>dependence presented <strong>in</strong> Table 3.6.1. Use the menu or to save these graphs, e.g. as anEncapsulated Postcript file (*.eps). M<strong>in</strong>imize the STAMP Graphics w<strong>in</strong>dow.Now, we will have a look at the graphs re-expressed <strong>in</strong> terms of the orig<strong>in</strong>aldata (i.e., <strong>in</strong> terms of the orig<strong>in</strong>al number of fatalities <strong>in</strong>stead of their logarithm): Go back to the STAMP w<strong>in</strong>dow <strong>and</strong> choose . Select Trend, Irregular, Smoothed, <strong>and</strong> Anti-log <strong>analysis</strong>. Click OK.The STAMP Graphics w<strong>in</strong>dow appears with graphs of the orig<strong>in</strong>al observed<strong>time</strong> <strong>series</strong> <strong>and</strong> the modelled (anti-logged) level <strong>and</strong> irregular components:P r o j e c t c o - f i n a n c e d b y t h e E u r o p e a n C o m m i s s i o n , D i r e c t o r a t e - G e n e r a l T r a n s p o r t a n d E n e r g yP a g e 1 9 5


Chapter 3 Shortly exam<strong>in</strong>e the figures <strong>in</strong> the STAMP Graphics w<strong>in</strong>dow <strong>and</strong> checkpossible serial dependence of the observation disturbances.The top figure shows the orig<strong>in</strong>al observations <strong>and</strong> the estimated level, which is equal to theanti-logged mean of the log-transformed <strong>series</strong> (which is not exactly the same as the mean ofthe orig<strong>in</strong>al <strong>series</strong>!). The bottom figure shows the irregular component or observationdisturbances, which, <strong>in</strong> this case, are equal to the deviations from the mean value. Notice thatthe mean of the irregular is around one <strong>and</strong> not zero, because of the anti-logg<strong>in</strong>g. Aga<strong>in</strong> use the menu or to save these graphs <strong>and</strong>m<strong>in</strong>imize the STAMP Graphics w<strong>in</strong>dow.Step 5: Test of model residualsSTAMP provides the most relevant graphical residual tests as well as moreextensive test statistics for normality, goodness-of-fit, <strong>and</strong> serial correlation.The residuals are not the same as the observation disturbances (i.e., the irregular component).In state space <strong>modell<strong>in</strong>g</strong>, the estimation process consists of two ma<strong>in</strong> steps:- filter<strong>in</strong>g, <strong>in</strong> which only the preced<strong>in</strong>g observations are used <strong>and</strong> which leads to the “filteredstate” <strong>and</strong> the “one-step ahead predictions”, <strong>and</strong>- smooth<strong>in</strong>g, <strong>in</strong> which all observations are used <strong>and</strong> which leads to the “smoothed state” <strong>and</strong>the “smoothed predictions”.The residuals correspond to the filtered state, the observation disturbances to the smoothedstate. In fact, the residuals are the st<strong>and</strong>ardized one-step ahead prediction errors, whereas theobservation disturbances are the smoothed prediction errors. For more <strong>in</strong>formation about thefiltered <strong>and</strong> smoothed state, see Harvey (1989) or Durb<strong>in</strong> an Koopman (2001).


3.6 State space modelsTests of the model assumptions are usually applied to the residuals, not to the observationdisturbances. Go back to the STAMP w<strong>in</strong>dow <strong>and</strong> choose . In the Residual graphics w<strong>in</strong>dow select Residuals, Correlogram, with 8,Density, Histogram, Normal, QQ plot, <strong>and</strong> Write diagnostic tests: Click OK.P r o j e c t c o - f i n a n c e d b y t h e E u r o p e a n C o m m i s s i o n , D i r e c t o r a t e - G e n e r a l T r a n s p o r t a n d E n e r g yP a g e 1 9 7


Chapter 3The STAMP graphics w<strong>in</strong>dow <strong>in</strong> GiveW<strong>in</strong> shows the st<strong>and</strong>ardized residuals <strong>and</strong>their correlogram, density function, <strong>and</strong> normal probability plot: see above. In the top left figure of the STAMP Graphics w<strong>in</strong>dow, check how manyresiduals are outside the 95% confidence <strong>in</strong>terval. The two confidencebounds are <strong>in</strong>dicated by straight l<strong>in</strong>es at the level of about 2 <strong>and</strong> -2.Under the assumption of normality, about 95% of the residuals should lie between the twoconfidence bounds. The figure shows that only one residual is located outside the confidencebounds, <strong>in</strong>dicat<strong>in</strong>g that the residuals are acceptable with respect to this test. In the top right figure of the STAMP Graphics w<strong>in</strong>dow, check thecorrelogram for possible serial dependence of the residuals.The correlogram presents the residual autocorrelation for lags 1 to 8. Us<strong>in</strong>g a 95% confidencelevel, the autocorrelation should be between -2/√T=-0.34 <strong>and</strong> 2/√T=0.34 (see also Table 3.6.1).As we can see, for the first three out of eight lags the autocorrelation is outside this range,<strong>in</strong>dicat<strong>in</strong>g serial dependence of the residuals. In the bottom left figure of the STAMP Graphics w<strong>in</strong>dow, compare theestimated density function with the normal density function with the samemean <strong>and</strong> st<strong>and</strong>ard deviation, <strong>in</strong> order to evaluate the degree of normality ofthe residuals.The density diagram shows the distribution of the residuals over discrete <strong>in</strong>tervals <strong>in</strong> thehistogram. The density function is estimated by “a smoothed function of the histogram us<strong>in</strong>g anormal or Gaussian kernel” (Koopman et al., 2000). From this density diagram, we canconclude that normality seems to be ok. This conclusion is <strong>in</strong> agreement with the conclusion onnormality based on the diagnostic tests <strong>in</strong> Table 3.6.1.


3.6 State space models In the bottom right figure of the STAMP Graphics w<strong>in</strong>dow, check the QQ plotto evaluate the degree of normality of the residuals.The QQ plot or normal probability plot is created by rank order<strong>in</strong>g the residuals from small tolarge <strong>and</strong> compar<strong>in</strong>g them with the normal probability values correspond<strong>in</strong>g to the cumulativeprobabilities of 1/(T+1), 2/(T+1), etc. If the residuals are approximately normally distributed, theplot is an (almost) straight l<strong>in</strong>e. From this QQ plot, we can conclude that the residuals areapproximately normally distributed, which is <strong>in</strong> agreement with the conclusions based on thedensity diagram <strong>and</strong> the diagnostic tests (see Table 3.6.1). Use the menu or to save these graphs <strong>and</strong> m<strong>in</strong>imizethe STAMP Graphics w<strong>in</strong>dow.In the Results w<strong>in</strong>dow of GiveW<strong>in</strong> the follow<strong>in</strong>g residual test results (normality,goodness of fit, serial correlation) have been added:Normality test for Residual Log_NO_fatSample Size 33Mean -0.835567Std.Devn. 0.549388Skewness -0.282549Excess Kurtosis -0.763819M<strong>in</strong>imum -2.048279Maximum 0.148184Skewness Chi^2(1) 0.43909 [0.5076]Kurtosis Chi^2(1) 0.8022 [0.3704]Normal-BS Chi^2(2) 1.2413 [0.5376]Normal-DH Chi^2(2) 1.3457 [0.5102]Goodness-of-fit results for Residual Log_NO_fatPrediction error variance (p.e.v) 0.047433Prediction error mean deviation (m.d) 0.040164Ratio p.e.v. / m.d <strong>in</strong> squares 0.887888Coefficient of determ<strong>in</strong>ation R2 -0.005917... based on differences RD2 -3.292629Information criterion of Akaike AIC -2.989614... of Schwartz (Bayes) BIC -2.944721Serial correlation statistics for Residual Log_NO_fat.Durb<strong>in</strong>-Watson test is 0.226385.Asymptotic deviation for correlation is 0.174078.Lag dF SerCorr BoxLjung ProbChi2(dF)1 0 0.58762 1 0.5052 21.9744 [ 0.0000]3 2 0.3724 27.3134 [ 0.0000]4 3 0.1785 28.5818 [ 0.0000]5 4 0.0032 28.5822 [ 0.0000]6 5 -0.0736 28.8140 [ 0.0000]7 6 -0.0734 29.0532 [ 0.0001]8 7 -0.0637 29.2405 [ 0.0001] In the Results w<strong>in</strong>dow of GiveW<strong>in</strong>, check the results of the normality test.Use the Doornik-Hansen statistic. The probability value [between squarebrackets] should be larger than 0.05.The normality test gives the sample size <strong>and</strong> the mean, st<strong>and</strong>ard deviation, skewness, kurtosis,m<strong>in</strong>imum, <strong>and</strong> maximum of the residuals. The values of the skewness <strong>and</strong> kurtosis are testedaga<strong>in</strong>st a χ 2 (1) distribution, whereas the Bowman-Shenton statistic <strong>and</strong> the Doornik-Hansenstatistic are tested aga<strong>in</strong>st a χ 2 (2) distribution. Koopman et al. (2000) note that the first threetests are only suitable when applied to very large samples. In this case, with a sample size ofP r o j e c t c o - f i n a n c e d b y t h e E u r o p e a n C o m m i s s i o n , D i r e c t o r a t e - G e n e r a l T r a n s p o r t a n d E n e r g yP a g e 1 9 9


Chapter 334, we better use the Doornik-Hansen statistic only. The correspond<strong>in</strong>g probability value islarger than 0.05 <strong>and</strong> therefore <strong>in</strong>dicates normality of the residuals. The Doornik-Hansen statisticwas also <strong>in</strong>cluded <strong>in</strong> the summary of statistics (see step 1) <strong>and</strong> <strong>in</strong> the diagnostic test resultstable (Table 3.6.1). Consider the goodness-of-fit results <strong>and</strong> checko whether the ratio between PEV <strong>and</strong> MD <strong>in</strong> squares is close to one;o whether the coefficient of determ<strong>in</strong>ation, R 2 , is positive <strong>and</strong> close toone;o the value of the Akaike Information Criterion (AIC).The goodness-of-fit results <strong>in</strong>clude the prediction error variance (PEV), the prediction errormean deviation (MD), <strong>and</strong> the ratio of their squares computed as 2*PEV 2 /(π*MD 2 ). Theprediction error variance is the variance of the residuals <strong>in</strong> the steady state, i.e. when therecursive computation procedure, known as the “Kalman Filter” (Harvey, 1989; Durb<strong>in</strong> <strong>and</strong>Koopman, 2001), has converged. If the Kalman Filter does not converge, the f<strong>in</strong>ite PEV is used.The prediction error mean deviation is the mean deviation of the residuals <strong>in</strong> the steady state. Ina correctly specified model, the ratio between the squared PEV <strong>and</strong> the squared MD should beclose to one (Koopman et al., 2000).Furthermore, the goodness of fit can be evaluated by means of the coefficient of determ<strong>in</strong>ation,which is a measure of the extent to which the variance of the observations is expla<strong>in</strong>ed by thevariance of the model predictions. Koopman et al. (2000) give three variants <strong>and</strong> def<strong>in</strong>e theirarea of application as <strong>in</strong> Table 3.6.2. They note that the coefficient of determ<strong>in</strong>ation maybecome negative, which is an <strong>in</strong>dication of a worse fit than <strong>in</strong> a simple r<strong>and</strong>om level model (orr<strong>and</strong>om level <strong>and</strong> slope model, or r<strong>and</strong>om level, slope <strong>and</strong> seasonal model).Coefficient of determ<strong>in</strong>ation Appropriate to be applied to <strong>time</strong> <strong>series</strong> with …R 2no slope, no seasonal (stationary)R 2 Dslope, no seasonalslope <strong>and</strong> seasonalR 2 STable 3.6.2: The coefficients of determ<strong>in</strong>ation <strong>and</strong> their application area. Source:(Koopman et al., 2000).F<strong>in</strong>ally, often used goodness-of-fit variables are the Akaike Information Criterion (AIC) <strong>and</strong> theBayes Information Criterion (BIC). The smaller the AIC (or BIC), the better the model. Thesevariables are computed as log(PEV)+c*m/T, where T is the sample size (34), m the number ofparameters estimated (1), <strong>and</strong> c is 2 <strong>in</strong> the AIC <strong>and</strong> log(T) <strong>in</strong> the BIC. Note that the AIC isdef<strong>in</strong>ed differently <strong>in</strong> Section 3.6.1.4 of the Methodology report, where it is based on the loglikelihood.Check on serial correlation by us<strong>in</strong>g the Box-Ljung statistic as computed forlags 1 to 8.The serial correlation statistics <strong>in</strong>clude the Durb<strong>in</strong>-Watson test, the asymptotic deviation forcorrelation, <strong>and</strong> the serial correlation for lags 1 to 8 with the correspond<strong>in</strong>g value of the Box-Ljung statistic <strong>and</strong> the correspond<strong>in</strong>g probability value. These probability values should belarger than 0.05, which is the case for none of the eight lags considered. The Durb<strong>in</strong>-Watson<strong>and</strong> the Box-Ljung statistic were already described above, when deal<strong>in</strong>g with the summary ofstatistics (see Step 1: Start of <strong>analysis</strong> <strong>and</strong> data load).Not all of these residual test results have to be checked always. Werecommend to use the AIC (<strong>and</strong>/or BIC) to test the goodness-of-fit <strong>and</strong> the Box-Ljung statistics to test serial <strong>in</strong>dependence. The Doornik-Hansen test fornormality is already <strong>in</strong>cluded <strong>in</strong> the summary of statistics (see Step 1: Start of<strong>analysis</strong> <strong>and</strong> data load).


3.6 State space modelsStep 6: Test of auxiliary residualsThe so-called auxiliary residuals are very helpful <strong>in</strong> f<strong>in</strong>d<strong>in</strong>g possible outliersamong the observations <strong>and</strong> structural breaks, e.g. caused by <strong>in</strong>terventions. Go to the STAMP w<strong>in</strong>dow aga<strong>in</strong> <strong>and</strong> choose . In the Auxiliary residuals graphics w<strong>in</strong>dow select Irregular, Level residual,Index plot, Density, Histogram, Normal, QQ plot, Write normality tests, <strong>and</strong>Write values exceed<strong>in</strong>g (3.5): Click OK.The STAMP graphics w<strong>in</strong>dow <strong>in</strong> GiveW<strong>in</strong> displays the auxiliary residuals of theirregular <strong>and</strong> their density function <strong>and</strong> normal probability plot:P r o j e c t c o - f i n a n c e d b y t h e E u r o p e a n C o m m i s s i o n , D i r e c t o r a t e - G e n e r a l T r a n s p o r t a n d E n e r g yP a g e 2 0 1


Chapter 3The auxiliary residuals are st<strong>and</strong>ardised smoothed observation <strong>and</strong> state disturbances. Theauxiliary residuals should be normally distributed with zero mean <strong>and</strong> unity st<strong>and</strong>ard deviation.The auxiliary residuals of the irregular help to detect possible outlier observations, the auxiliaryresiduals of level, slope, <strong>and</strong> seasonal help to identify structural breaks <strong>in</strong> the level, slope, <strong>and</strong>seasonal component, respectively. For example, if for a certa<strong>in</strong> <strong>time</strong> po<strong>in</strong>t the auxiliary residualof the irregular is larger than 2 or smaller than -2, then this <strong>in</strong>dicates a possible outlierobservation. However, one should note that accord<strong>in</strong>g to a normal distribution 5% of theauxiliary residuals are expected to lie outside the 95% confidence <strong>in</strong>terval of ± 2. Us<strong>in</strong>g the figures <strong>in</strong> the STAMP Graphics w<strong>in</strong>dow, check whether theauxiliary residuals are approximately normally distributed.In the top figure, we see that none of the 34 st<strong>and</strong>ardised smoothed observation disturbances,i.e. unmistakably less than 5%, is larger than 2 or smaller than -2. The middle <strong>and</strong> the bottomfigure show that the st<strong>and</strong>ardised smoothed observation disturbances are approximatelynormally distributed.Because the level was assumed determ<strong>in</strong>istic <strong>in</strong> this model, no st<strong>and</strong>ardised smoothed leveldisturbances were estimated. Use the menu or to save these graphs <strong>and</strong> m<strong>in</strong>imizethe STAMP Graphics w<strong>in</strong>dow.In the Results w<strong>in</strong>dow of GiveW<strong>in</strong> the follow<strong>in</strong>g auxiliary residual test results(normality, goodness of fit, serial correlation) have been added:


3.6 State space modelsNormality test for IrrRes Log_NO_fatSample Size 34Mean 0.000000Std.Devn. 1.000000Skewness 0.103696Excess Kurtosis -1.063128M<strong>in</strong>imum -1.800584Maximum 1.822140Skewness Chi^2(1) 0.060933 [0.8050]Kurtosis Chi^2(1) 1.6012 [0.2057]Normal-BS Chi^2(2) 1.6621 [0.4356]Normal-DH Chi^2(2) 2.003 [0.3673] Check the result of the Doornik-Hansen test for normality.The normality test for residuals were already described above. From the Doornik-Hansen test,we can conclude that the hypothesis of normally distributed auxiliary residuals is accepted; theprobability value between square brackets is larger than 0.05.Note that <strong>in</strong> the Results w<strong>in</strong>dow of GiveW<strong>in</strong> no values exceed<strong>in</strong>g 3.5 have beenadded.Under the null hypothesis of normality, the probability of an auxiliary residual whose absolutevalue exceeds 3.5 is very small, about 0.0005. So, when this happens this is a very strong<strong>in</strong>dication of an outlier.Step 7: Conclusion of <strong>analysis</strong>The residuals obta<strong>in</strong>ed with the <strong>analysis</strong> of the log of the annual Norwegianfatalities from 1970 to 2003 with the determ<strong>in</strong>istic level model do not satisfy theimportant model assumptions of <strong>in</strong>dependence <strong>and</strong> homoscedasticity (seeTable 3.6.1). It is therefore not the appropriate model for describ<strong>in</strong>g this <strong>series</strong>.We still discussed all the output that can be obta<strong>in</strong>ed from STAMP <strong>in</strong> this case,so as to make the reader familiar with the possible options, <strong>and</strong> for reasons oflater reference.Step 8: Forecast<strong>in</strong>gBecause the determ<strong>in</strong>istic level model is clearly not appropriate, it does notmake much sense to compute forecasts with this model.P r o j e c t c o - f i n a n c e d b y t h e E u r o p e a n C o m m i s s i o n , D i r e c t o r a t e - G e n e r a l T r a n s p o r t a n d E n e r g yP a g e 2 0 3


Chapter 33.6.2.2 Stochastic level modelThe stochastic level model, which is also known as local level model, will bedescribed more briefly than the determ<strong>in</strong>istic level model, s<strong>in</strong>ce much of theoutput from STAMP has already been discussed <strong>and</strong> expla<strong>in</strong>ed <strong>in</strong> Section3.6.2.1. The focus will be on the new aspects <strong>and</strong> the comparison of the resultsof this model with the results of the determ<strong>in</strong>istic model.Step 1: Start of <strong>analysis</strong> <strong>and</strong> data loadIf you just fitted the determ<strong>in</strong>istic level model, GiveW<strong>in</strong> <strong>and</strong> STAMP havealready been started <strong>and</strong> data is still loaded <strong>in</strong> the GiveW<strong>in</strong> w<strong>in</strong>dow. If you starthere or if you have closed the database, STAMP, or GiveW<strong>in</strong> after the previousexercise, please follow the <strong>in</strong>structions under step 1 of Section 3.6.2.1.Step 2: Model FormulationThe stochastic (or: local) level model can be fitted <strong>in</strong> STAMP as follows: Choose the menu <strong>in</strong> the STAMP w<strong>in</strong>dow. If needed, select the variable Log_NO_fat <strong>in</strong> the Data selection w<strong>in</strong>dow <strong>and</strong>click the Add button. Then click OK. In the Select components w<strong>in</strong>dow, choose a Stochastic Level, No slope,Irregular, <strong>and</strong> No seasonal, as follows: Then click on the F<strong>in</strong>ish button.Step 3: Model estimation <strong>and</strong> <strong>in</strong>spection of results


3.6 State space models In the Estimate Model w<strong>in</strong>dow, select Maximum Likelihood. Click OK.The model is estimated, <strong>and</strong> the follow<strong>in</strong>g output appears <strong>in</strong> the GiveW<strong>in</strong>Results w<strong>in</strong>dow:Method of estimation is Maximum likelihoodThe present sample is: 1970 to 2003MaxLik <strong>in</strong>itialis<strong>in</strong>g...it 1 f= 2.09171 e0= 0.52237 step= 1.00000it 2 f= 2.22405 e0= 0.00514 step= 1.00000MaxLik iterat<strong>in</strong>g...it 2 f= 2.22407 df= 0.00000 e1= 0.00000 e2= 0.00000 step=0.00000Equation 2.Log_NO_fat = Level + IrregularEstimation reportModel with 2 parameters ( 1 restrictions).Parameter estimation sample is 1970. 1 - 2003. 1. (T = 34).Log-likelihood kernel is 2.224067.Very strong convergence <strong>in</strong> 2 iterations.( likelihood cvg 0gradient cvg 2.435829e-007parameter cvg 9.737567e-012 )Eq 2 : Diagnostic summary report.Estimation sample is 1970. 1 - 2003. 1. (T = 34, n = 33).Log-Likelihood is 75.6183 (-2 LogL = -151.237).Prediction error variance is 0.00989161Summary statisticsLog_NO_fatStd.Error 0.099457Normality 1.2746H( 11) 1.7464r( 1) -0.12735r( 7) -0.15301DW 2.0513Q( 7, 6) 5.4955R^2 0.79023Eq 2 : Estimated variances of disturbances.Component Log_NO_fat (q-ratio)Irr 0.0032682 ( 0.6949)Lvl 0.0047030 ( 1.0000) Check the results (sample period, log-likelihood, estimated variances ofdisturbances).The output first reports about the estimation process, which is subdivided <strong>in</strong>to <strong>in</strong>itialisation <strong>and</strong>further maximisation of the log-likelihood kernel f (see Koopman et al., 2000). In thedeterm<strong>in</strong>istic level case, no iterations were needed. Now, however, two parameters have to beestimated: the observation disturbance variance <strong>and</strong> the level disturbance variance.Convergence is reached <strong>in</strong> two <strong>in</strong>terations, <strong>and</strong> is very strong. This implies that all of theP r o j e c t c o - f i n a n c e d b y t h e E u r o p e a n C o m m i s s i o n , D i r e c t o r a t e - G e n e r a l T r a n s p o r t a n d E n e r g yP a g e 2 0 5


Chapter 3convergence criteria used <strong>in</strong> the iterative process for parameter estimation are satisfied. Theseconvergence criteria are likelihood, gradient, <strong>and</strong> parameter convergence (cvg).At convergence, the value of the log-likelihood function is 75.6 which is larger than <strong>in</strong> thedeterm<strong>in</strong>istic level case (48.1). The prediction error variance (0.00989) is clearly smaller than forthe determ<strong>in</strong>istic level model (0.0474). These results <strong>in</strong>dicate that the stochastic model yields abetter fit than the determ<strong>in</strong>istic model, albeit at the expense of hav<strong>in</strong>g estimated one extraparameter (the variance of the level disturbances).The STAMP output results concern<strong>in</strong>g the summary statistics can aga<strong>in</strong> becondensed <strong>in</strong>to the follow<strong>in</strong>g table (see also Table 3.6.2 <strong>in</strong> the Methodologyreport):Statistic Value Critical 5%value aAssumptionsatisfiedIndependence Q(7,6) 5.50 12.59 +r(1) -0.127 0.34 +r(7) -0.153 0.34 +Homoscedasticity H(11) 1.75 3.47 +Normality N 1.27 5.99 +Table 3.6.3: Diagnostic test results for the stochastic level model applied to the log ofNorwegian fatalities. a Probability that statistic exceeds critical value is 0.05.When we compare the results <strong>in</strong> Table 3.6.3 with those <strong>in</strong> Table 3.6.1, we see that also withrespect to the diagnostic tests the stochastic level model performs better than the determ<strong>in</strong>isticlevel model, because the present model satisfies all assumptions.F<strong>in</strong>ally, the output gives the estimated disturbance variances. The irregular disturbance varianceis about 30% smaller than the level disturbance variance. In the STAMP w<strong>in</strong>dow choose <strong>in</strong> the menu. Select Additional output, Get steady state, Anti-log <strong>analysis</strong>, <strong>and</strong> State <strong>and</strong>regression output. Click OK.


3.6 State space modelsThe GiveW<strong>in</strong> Results w<strong>in</strong>dow will display the follow<strong>in</strong>g additional results:Eq 3 : Estimated st<strong>and</strong>ard deviations of disturbances.Component Log_NO_fat (q-ratio)Irr 0.057168 ( 0.8336)Lvl 0.068578 ( 1.0000)Eq 3 : Estimated coefficients of f<strong>in</strong>al state vector.Variable Coefficient R.m.s.e. t-valueLvl 5.6627 0.047118 120.18 [ 0.0000]Anti-log trend <strong>analysis</strong>Trend value at end of period is 287.92.The value of the level at the end of the period as presented by the anti-log trend <strong>analysis</strong> (288)is considerably smaller than the correspond<strong>in</strong>g value <strong>in</strong> the determ<strong>in</strong>istic model (377).Step 4: Graphics of model components In the STAMP w<strong>in</strong>dow choose menu .Select Trend, Irregular <strong>and</strong> Smoothed. Click OK.The STAMP Graphics w<strong>in</strong>dow appears with graphs of the observed logtransformed<strong>time</strong> <strong>series</strong> <strong>and</strong> the modelled trend <strong>and</strong> irregular. Use the menu or to save these graphs, e.g. as anEncapsulated Postcript file (*.eps).An eps. file can be loaded <strong>in</strong>to a Word document. In the rema<strong>in</strong>der of thismanual we will do so <strong>in</strong>stead of add<strong>in</strong>g a screen pr<strong>in</strong>t as <strong>in</strong> Section 3.6.2.1.Figure 3.6.1 shows the observed log-transformed <strong>time</strong> <strong>series</strong> <strong>and</strong> the stochasticlevel component (top), <strong>and</strong> the irregular component (bottom).P r o j e c t c o - f i n a n c e d b y t h e E u r o p e a n C o m m i s s i o n , D i r e c t o r a t e - G e n e r a l T r a n s p o r t a n d E n e r g yP a g e 2 0 7


Chapter 36.25Log_NO_fatTrend_Log_NO_fat6.005.751970 1975 1980 1985 1990 1995 2000 20050.05Irr_Log_NO_fat0.00−0.051970 1975 1980 1985 1990 1995 2000 2005Figure 3.6.1: Observed log-transformed <strong>time</strong> <strong>series</strong> <strong>and</strong> the local level <strong>and</strong> irregularcomponents for the log of Norwegian fatalities.In the top part of Figure 3.6.1, we see that the estimated trend is close to the (log-transformed)observations. Furthermore, we can see that the irregular shows a much more r<strong>and</strong>om patternthan <strong>in</strong> the determ<strong>in</strong>istic case. M<strong>in</strong>imize the STAMP Graphics w<strong>in</strong>dow.Step 5: Test of model residuals Go back to the STAMP w<strong>in</strong>dow <strong>and</strong> choose . In the Residual graphics w<strong>in</strong>dow select Residuals, Correlogram, with 8,Density, Histogram, Normal, QQ plot, <strong>and</strong> Write diagnostic tests. Click OK. Use the menu or to save these graphs.Figure 3.6.2 shows the st<strong>and</strong>ardized residuals <strong>and</strong> their correlogram, densityfunction, <strong>and</strong> normal probability plot as depicted by the STAMP graphicsw<strong>in</strong>dow <strong>in</strong> GiveW<strong>in</strong>.The top left graph of Figure 3.6.2 illustrates that none of the 34 residuals is outside the 95%confidence <strong>in</strong>terval, which is very good. From the top right graph, we learn that for none of theeight lags considered the autocorrelation is outside the 95% confidence <strong>in</strong>terval, which isdef<strong>in</strong>ed by the boundaries -2/√T=-0.34 <strong>and</strong> +2/√T=0.34. This is an <strong>in</strong>dication of absence ofserial dependence. The bottom graphs show that the assumption of normality of the residuals isbetter satisfied than <strong>in</strong> the determ<strong>in</strong>istic level model.


3.6 State space models2Residual Log_NO_fat1.0CorrelogramResidual Log_NO_fat10.500.0−1−0.51970 1980 1990 2000DensityN(s=0.956)0.40.30.20.10 5 10QQ plot2normal10−1−3 −2 −1 0 1 2 3−2−2.0 −1.5 −1.0 −0.5 0.0 0.5 1.0 1.5Figure 3.6.2: Residuals <strong>and</strong> residual tests for the stochastic level model applied to thelog of Norwegian fatalities.In the Results w<strong>in</strong>dow of GiveW<strong>in</strong>, the follow<strong>in</strong>g residual test results have beenadded (only part of the results is pr<strong>in</strong>ted):Goodness-of-fit results for Residual Log_NO_fatInformation criterion of Akaike AIC -4.498421... of Schwartz (Bayes) BIC -4.408635Serial correlation statistics for Residual Log_NO_fat.Lag dF SerCorr BoxLjung ProbChi2(dF)1 0 -0.12732 0 -0.01243 1 0.1095 1.0526 [ 0.3049]4 2 -0.1054 1.4951 [ 0.4735]5 3 -0.1382 2.2833 [ 0.5157]6 4 -0.2253 4.4556 [ 0.3478]7 5 -0.1530 5.4955 [ 0.3584]8 6 -0.0478 5.6010 [ 0.4693]The goodness-of-fit results are undoubtedly better than <strong>in</strong> the determ<strong>in</strong>istic case: <strong>in</strong> thestochastic model, the AIC is smaller (-4.50 <strong>in</strong>stead of -3.00), just as the BIC (-4.41 <strong>in</strong>stead of -2.94). For all lags considered, the Box-Ljung test <strong>in</strong>dicates that the most important assumptionof <strong>in</strong>dependence is satisfied.Step 6: Test of auxiliary residualsP r o j e c t c o - f i n a n c e d b y t h e E u r o p e a n C o m m i s s i o n , D i r e c t o r a t e - G e n e r a l T r a n s p o r t a n d E n e r g yP a g e 2 0 9


Chapter 3 Go to the STAMP w<strong>in</strong>dow aga<strong>in</strong> <strong>and</strong> choose . In the Auxiliary residuals graphics w<strong>in</strong>dow select Irregular, Level residual,Index plot, Density, Histogram, Normal, QQ plot, Write normality tests, <strong>and</strong>Write values exceed<strong>in</strong>g (3.5). Click OK. Use the menu or to save these graphs.The STAMP graphics w<strong>in</strong>dow <strong>in</strong> GiveW<strong>in</strong> displays the auxiliary residuals of theirregular <strong>and</strong> of the level component <strong>and</strong> their density function <strong>and</strong> normalprobability plot: see Figure 3.6.3.2IrrRes Log_NO_fat0.50DensityN(s=0.996)00.25−221970QQ plot1980 1990 2000normal21−3 −2 −1 0 1 2 3LvlRes Log_NO_fat00−1−2−20.750.50DensityN(s=0.905)−1 0 1 22101970QQ plot1980 1990 2000normal0.25−1−2−2 −1 0 1 2−2.0 −1.5 −1.0 −0.5 0.0 0.5 1.0 1.5Figure 3.6.3: Auxiliary residuals <strong>and</strong> correspond<strong>in</strong>g tests for the stochastic level modelapplied to the log of Norwegian fatalities.


3.6 State space modelsThe follow<strong>in</strong>g output describes the auxiliary residual test results (normality,goodness of fit, serial correlation) as can be found <strong>in</strong> the Results w<strong>in</strong>dow ofGiveW<strong>in</strong>.Normality test for IrrRes Log_NO_fatSample Size 34Mean -0.000312Std.Devn. 0.995882Skewness -0.122988Excess Kurtosis -0.405802M<strong>in</strong>imum -2.234331Maximum 1.935788Skewness Chi^2(1) 0.085714 [0.7697]Kurtosis Chi^2(1) 0.23329 [0.6291]Normal-BS Chi^2(2) 0.319 [0.8526]Normal-DH Chi^2(2) 0.12802 [0.9380]Normality test for LvlRes Log_NO_fatSample Size 34Mean -0.386541Std.Devn. 0.904762Skewness 0.485177Excess Kurtosis -0.311715M<strong>in</strong>imum -2.040092Maximum 1.633678Skewness Chi^2(1) 1.3339 [0.2481]Kurtosis Chi^2(1) 0.13765 [0.7106]Normal-BS Chi^2(2) 1.4716 [0.4791]Normal-DH Chi^2(2) 1.9093 [0.3849]Both Figure 3.6.3 <strong>and</strong> the auxiliary residual tests demonstrate that the auxiliary residuals of boththe irregular <strong>and</strong> the level component satisfy the assumption of normality.Note that, just as <strong>in</strong> the determ<strong>in</strong>istic case, <strong>in</strong> the Results w<strong>in</strong>dow of GiveW<strong>in</strong> novalues exceed<strong>in</strong>g 3.5 have been added.Step 7: Conclusion of <strong>analysis</strong>The residuals obta<strong>in</strong>ed with the <strong>analysis</strong> of the log of the annual Norwegianfatalities from 1970 to 2003 with the local level model satisfy all the modelassumptions of <strong>in</strong>dependence, homoscedasticity, <strong>and</strong> normality. It seemstherefore to be the appropriate model for describ<strong>in</strong>g this <strong>series</strong>.Step 8: Forecast<strong>in</strong>gS<strong>in</strong>ce the local level model provides an appropriate description of the log of theNorwegian fatalities <strong>series</strong>, as a f<strong>in</strong>al step <strong>in</strong> the <strong>analysis</strong> we will computeseven-year forecasts for this <strong>series</strong>. Furthermore, by perform<strong>in</strong>g an anti-log<strong>analysis</strong> the forecasts will be re-expressed <strong>in</strong> terms of the orig<strong>in</strong>al count data. Go to the STAMP w<strong>in</strong>dow aga<strong>in</strong> <strong>and</strong> choose . In the Forecast<strong>in</strong>g w<strong>in</strong>dow select 7 as the number of forecasts, Trend, <strong>and</strong>Write forecasts Y:P r o j e c t c o - f i n a n c e d b y t h e E u r o p e a n C o m m i s s i o n , D i r e c t o r a t e - G e n e r a l T r a n s p o r t a n d E n e r g yP a g e 2 1 1


Chapter 3 Click OK.


3.6 State space models6.0F−Log_NO_fatForecast5.85.61985 1990 1995 2000 2005 20106.0F−Log_NO_fatF−Trend_Log_NO_fat5.85.61985 1990 1995 2000 2005 2010Figure 3.6.4: Seven-years forecasts (2004-2010) of the stochastic level model appliedto the log of annual Norwegian fatalities, 1970-2003.The STAMP graphics w<strong>in</strong>dow <strong>in</strong> GiveW<strong>in</strong> displays the log-transformedobservations extended with the seven-years forecasts with 70% confidence<strong>in</strong>terval (plus <strong>and</strong> m<strong>in</strong>us one estimated st<strong>and</strong>ard deviation) <strong>in</strong> the top figure <strong>and</strong>the log-transformed observations <strong>and</strong> the extrapolated trend <strong>in</strong> the bottomfigure: see Figure 3.6.4. Use the menu or to save these graphs, e.g. as anEncapsulated Postcript file (*.eps).The bottom figure clearly illustrates that the local level model always yields forecasts that areequal to the last value of the level component <strong>in</strong> the <strong>series</strong>. This is <strong>in</strong> complete agreement withthe fact that we are deal<strong>in</strong>g with a local level model.In the Results w<strong>in</strong>dow of GiveW<strong>in</strong> the forecasts for the log-transformed <strong>time</strong><strong>series</strong> have been added:Eq 2 : Forecasts for F-Log_NO_fat.Period Forecast R.m.s.e. - Rmse + Rmse2004. 1 5.6627 0.10095 5.5617 5.76362005. 1 5.6627 0.12204 5.5406 5.78472006. 1 5.6627 0.13999 5.5227 5.80272007. 1 5.6627 0.15589 5.5068 5.81862008. 1 5.6627 0.17030 5.4924 5.83302009. 1 5.6627 0.18359 5.4791 5.84632010. 1 5.6627 0.19598 5.4667 5.8587P r o j e c t c o - f i n a n c e d b y t h e E u r o p e a n C o m m i s s i o n , D i r e c t o r a t e - G e n e r a l T r a n s p o r t a n d E n e r g yP a g e 2 1 3


Chapter 3The list of forecast results gives for each <strong>time</strong> po<strong>in</strong>t the forecast, its st<strong>and</strong>ard error, <strong>and</strong> thelower <strong>and</strong> upper bound of the 70% confidence <strong>in</strong>terval. Go to the STAMP w<strong>in</strong>dow aga<strong>in</strong> <strong>and</strong> choose . In the Forecast<strong>in</strong>g w<strong>in</strong>dow select 7 as the number of forecasts, Trend,Modified anti-log <strong>analysis</strong>, <strong>and</strong> Write forecasts Y: Click OK.The STAMP graphics w<strong>in</strong>dow <strong>in</strong> GiveW<strong>in</strong> displays the orig<strong>in</strong>al observationsextended with the seven-years forecasts with 70% confidence <strong>in</strong>terval (plus <strong>and</strong>m<strong>in</strong>us one estimated st<strong>and</strong>ard deviation) <strong>in</strong> the top figure <strong>and</strong> the orig<strong>in</strong>alobservations <strong>and</strong> the extrapolated trend <strong>in</strong> the bottom figure: see Figure 3.6.5. Use the menu or to save these graphs.In the Results w<strong>in</strong>dow of GiveW<strong>in</strong> the forecasts for the orig<strong>in</strong>al observed <strong>time</strong><strong>series</strong> have been added:


3.6 State space modelsEq 2 : Forecasts for E_F-Log_NO_fat.Anti-logPeriod Forecast R.m.s.e. - Rmse + Rmse2004. 1 287.92 30.584 257.34 318.502005. 1 287.92 37.373 250.55 325.292006. 1 287.92 43.264 244.66 331.182007. 1 287.92 48.570 239.35 336.492008. 1 287.92 53.457 234.46 341.382009. 1 287.92 58.023 229.90 345.942010. 1 287.92 62.336 225.58 350.26450E_F−Log_NO_fatForecast4003503002501985 1990 1995 2000 2005 2010450 E_F−Log_NO_fat F−Trend_ME_Log_NO_fat4003503001985 1990 1995 2000 2005 2010Figure 3.6.5: Anti-logged seven-year forecasts (2004-2010) of the stochastic levelmodel applied to the log of annual Norwegian fatalities, 1970-2003.P r o j e c t c o - f i n a n c e d b y t h e E u r o p e a n C o m m i s s i o n , D i r e c t o r a t e - G e n e r a l T r a n s p o r t a n d E n e r g yP a g e 2 1 5


Chapter 33.6.3 Local l<strong>in</strong>ear trend modelIn this section, the slope component will be added to the local level model, soas to obta<strong>in</strong> the local l<strong>in</strong>ear trend model. The model will be applied to thenumber of fatalities as observed <strong>in</strong> F<strong>in</strong>l<strong>and</strong> for the period 1970 through 2003.The theory on this model <strong>and</strong> the results of its application to the F<strong>in</strong>nish dataare described <strong>in</strong> Section 3.6.2 of the Methodology report. This section expla<strong>in</strong>show the model is built <strong>in</strong> STAMP.First, this section presents a step-by-step description of the <strong>analysis</strong> of theF<strong>in</strong>nish fatalities <strong>time</strong> <strong>series</strong> us<strong>in</strong>g a l<strong>in</strong>ear trend model with determ<strong>in</strong>istic level<strong>and</strong> determ<strong>in</strong>istic slope, also called the determ<strong>in</strong>istic l<strong>in</strong>ear trend model. As <strong>in</strong>Section 3.6.2 of the Methodology report, we will show that this model, which isequivalent to a classical l<strong>in</strong>ear regression model, does not satisfy importantmodel assumptions. Then, we describe the <strong>analysis</strong> with the l<strong>in</strong>ear trend modelwith stochastic level <strong>and</strong> stochastic slope, which is also known as the stochasticl<strong>in</strong>ear trend model or local l<strong>in</strong>ear trend model. The latter <strong>analysis</strong> also <strong>in</strong>cludesforecast<strong>in</strong>g over seven years.3.6.3.1. Determ<strong>in</strong>istic l<strong>in</strong>ear trend modelStep 1: Start of <strong>analysis</strong> <strong>and</strong> data loadFirst, we open GiveW<strong>in</strong>, load the data, <strong>and</strong> start STAMP. If GiveW<strong>in</strong> is not yet open, then start GiveW<strong>in</strong>2. If GiveW<strong>in</strong> is still open from the previous <strong>analysis</strong>, then close all results,data, <strong>and</strong> graphics w<strong>in</strong>dows <strong>in</strong> GiveW<strong>in</strong> by click<strong>in</strong>g on the cross <strong>in</strong> the topright corner of each w<strong>in</strong>dow. Use the menu to open the file“F<strong>in</strong>l<strong>and</strong>Fatalities.<strong>in</strong>7”.The data file is loaded <strong>and</strong> displayed <strong>in</strong> a m<strong>in</strong>imized w<strong>in</strong>dow at the bottom of theGiveW<strong>in</strong> ma<strong>in</strong> w<strong>in</strong>dow. To view the data file: Click on the icon with the two overlapp<strong>in</strong>g boxes.The data file consists of two variables: the annual number of people killed <strong>in</strong>road traffic <strong>in</strong> F<strong>in</strong>l<strong>and</strong> for the years 1970 through 2003 (“FI_fatalities”) <strong>and</strong> thelogarithm of the latter <strong>time</strong> <strong>series</strong> (“Log_FI_fat”). M<strong>in</strong>imize the data file w<strong>in</strong>dow aga<strong>in</strong> <strong>and</strong> use the menu to start the STAMP program.Step 2: Model FormulationIn this step, we def<strong>in</strong>e the determ<strong>in</strong>istic l<strong>in</strong>ear trend model: In STAMP, choose the menu .


3.6 State space models In the Data selection w<strong>in</strong>dow select the variable Log_FI_fat <strong>and</strong> click Add. Then click OK. In the Select components w<strong>in</strong>dow, choose a Fixed level, Fixed slope,Irregular, <strong>and</strong> No seasonal: Then click on the F<strong>in</strong>ish button.Step 3: Model estimation <strong>and</strong> <strong>in</strong>spection of results In the Estimate Model w<strong>in</strong>dow, select Maximum Likelihood. Click OK.The model is estimated, <strong>and</strong> the follow<strong>in</strong>g output appears <strong>in</strong> the GiveW<strong>in</strong>Results w<strong>in</strong>dow:Eq 1 : Diagnostic summary report.Estimation sample is 1970. 1 - 2003. 1. (T = 34, n = 32).Log-Likelihood is 55.7297 (-2 LogL = -111.459).Prediction error variance is 0.0205839Summary statisticsLog_FI_fatStd.Error 0.14347Normality 3.5468H( 10) 0.63283r( 1) 0.76735r( 6) -0.080113DW 0.38632Q( 6, 6) 49.093Rd^2 -1.2238Eq 1 : Estimated variances of disturbances.ComponentLog_FI_fat (q-ratio)P r o j e c t c o - f i n a n c e d b y t h e E u r o p e a n C o m m i s s i o n , D i r e c t o r a t e - G e n e r a l T r a n s p o r t a n d E n e r g yP a g e 2 1 7


Chapter 3Irr 0.021360 ( 1.0000) Check the results (sample period, log-likelihood, estimated variance ofdisturbances). The output results concern<strong>in</strong>g the summary statistics can aga<strong>in</strong> becondensed <strong>in</strong>to the follow<strong>in</strong>g table (see also Table 3.6.3 <strong>in</strong> the Methodologyreport):Statistic Value Critical 5% Assumption satisfiedvalue aIndependence Q(6,6) 49.1 12.59 -r(1) 0.767 0.34 -r(6) -0.0801 0.34 +Homoscedasticity H(10) 0.633 3.72 +Normality N 3.55 5.99 +Table 3.6.4: Diagnostic test results for the determ<strong>in</strong>istic l<strong>in</strong>ear trend model applied tothe log of F<strong>in</strong>nish fatalities. a Probability that statistic exceeds critical value is 0.05.Table 3.6.3 <strong>in</strong> the Methodology report gives the reciprocal value of the homoscedasticity test,because both the orig<strong>in</strong>al value <strong>and</strong> its reciprocal should be smaller than the critical 5% value<strong>and</strong>, <strong>in</strong> this case, the reciprocal is larger than the orig<strong>in</strong>al value. To stay close to the STAMPresults, Table 3.6.4 does not present the reciprocal value of the homoscedasticity test statistic.S<strong>in</strong>ce the reciprocal of H(10) equals 1/H(10) = 1/0.633 = 1.580, <strong>and</strong> because this value is stillsmaller than the critical value of 3.72, the assumption of homoscedasticity is satisfied. However,the most important assumption of <strong>in</strong>dependence is clearly not satisfied <strong>in</strong> this <strong>analysis</strong>. In the STAMP w<strong>in</strong>dow choose <strong>in</strong> the menu. Select Additional output, Get steady state, Anti-log <strong>analysis</strong>, <strong>and</strong> State <strong>and</strong>regression output. Click OK.In the GiveW<strong>in</strong> Results w<strong>in</strong>dow, the follow<strong>in</strong>g additional results are displayed:Eq 1 : Estimated st<strong>and</strong>ard deviations of disturbances.Component Log_FI_fat (q-ratio)Irr 0.14615 ( 1.0000)Eq 1 : Estimated coefficients of f<strong>in</strong>al state vector.Variable Coefficient R.m.s.e. t-valueLvl 5.9235 0.049044 120.78 [ 0.0000]Slp -0.028733 0.0025548 -11.246 [ 0.0000]Anti-log trend <strong>analysis</strong>Trend value at end of period is 373.731.Growth rate at end of period is -0.0287328 ( -2.87328 % per “year”).From the estimated coefficients of the f<strong>in</strong>al state vector <strong>and</strong> their t-values, we can conclude that,<strong>in</strong> the f<strong>in</strong>al state, both the level <strong>and</strong> the negative slope component are significantly different fromzero. In classical l<strong>in</strong>ear regression terms, we would say that both the <strong>in</strong>tercept <strong>and</strong> theregression coefficient significantly deviate from zero. However, these tests are flawed becausethe residuals do not satisfy the assumption of <strong>in</strong>dependence (see Table 3.6.4).From the anti-log trend <strong>analysis</strong>, we can see that the estimated trend value at the end of theperiod is 374, with a reduction rate of almost 3% per year.


3.6 State space modelsStep 4: Graphics of model components In the STAMP w<strong>in</strong>dow choose menu .Select (Plot Y <strong>and</strong> …) Trend, Slope, Irregular <strong>and</strong> Smoothed. Click OK.The STAMP Graphics w<strong>in</strong>dow appears with graphs of the observed logtransformed<strong>time</strong> <strong>series</strong> <strong>and</strong> the modelled trend, slope, <strong>and</strong> irregular (see Figure3.6.6).7.0 Log_FI_fat Trend_Log_FI_fat6.56.00.11970 1975 1980 1985 1990 1995 2000 2005Slope_Log_FI_fat0.01970 1975 1980 1985 1990 1995 2000 20050.25 Irr_Log_FI_fat0.00−0.251970 1975 1980 1985 1990 1995 2000 2005Figure 3.6.6: Observed log-transformed <strong>time</strong> <strong>series</strong> <strong>and</strong> the determ<strong>in</strong>istic l<strong>in</strong>ear trend(top graph), slope (middle graph), <strong>and</strong> irregular component (bottom graph) for the log ofF<strong>in</strong>nish fatalities.In the top part of Figure 3.6.6, we see that the estimated trend is a a straight l<strong>in</strong>e; it is <strong>in</strong> fact aclassical l<strong>in</strong>ear regression l<strong>in</strong>e. The middle graph displays the constant, negative slope. Thebottom graph clearly <strong>in</strong>dicates serial dependence <strong>in</strong> the observation disturbances. Use the menu or to save these graphs, e.g. as anEncapsulated Postcript file (*.eps). M<strong>in</strong>imize the STAMP Graphics w<strong>in</strong>dow.Step 5: Test of model residuals Go back to the STAMP w<strong>in</strong>dow <strong>and</strong> choose .P r o j e c t c o - f i n a n c e d b y t h e E u r o p e a n C o m m i s s i o n , D i r e c t o r a t e - G e n e r a l T r a n s p o r t a n d E n e r g yP a g e 2 1 9


Chapter 3 In the Residual graphics w<strong>in</strong>dow select Residuals, Correlogram, with 8,Density, Histogram, Normal, QQ plot, <strong>and</strong> Write diagnostic tests. Click OK. Use the menu or to save these graphs <strong>and</strong> m<strong>in</strong>imizethe STAMP Graphics w<strong>in</strong>dow.Figure 3.6.7 shows the st<strong>and</strong>ardized residuals <strong>and</strong> their correlogram, densityfunction, <strong>and</strong> normal probability plot as depicted by the STAMP graphicsw<strong>in</strong>dow <strong>in</strong> GiveW<strong>in</strong>.The top left graph of Figure 3.6.7 shows that 1 out of the 34 residuals, i.e. 3%, lies outside the95% confidence <strong>in</strong>terval; this is acceptable s<strong>in</strong>ce we would expect about 1 out of 20 to falloutside this range. From the top right graph, we learn that for the first 3 of the 8 lags consideredthe autocorrelation is outside the 95% confidence <strong>in</strong>terval, which is def<strong>in</strong>ed by the boundaries-2/√T=-0.34 <strong>and</strong> +2/√T=0.34. The bottom graphs show that the assumption of normality of theresiduals is quite well satisfied, which confirms the normality test results <strong>in</strong> Table 3.6.4.3Residual Log_FI_fat1.0CorrelogramResidual Log_FI_fat20.5100.0−1−0.51970 1980 1990 2000Density0.6N(s=0.926)0.40QQ plot5 10normal210.20−1−2 −1 0 1 2 3 4−1.5 −1.0 −0.5 0.0 0.5 1.0 1.5 2.0Figure 3.6.7: Residuals <strong>and</strong> residual tests for the determ<strong>in</strong>istic l<strong>in</strong>ear trend modelapplied to the log of F<strong>in</strong>nish fatalities.In the Results w<strong>in</strong>dow of GiveW<strong>in</strong>, the follow<strong>in</strong>g residual test results can befound (only part of the results are pr<strong>in</strong>ted below):


3.6 State space modelsGoodness-of-fit results for Residual Log_FI_fatInformation criterion of Akaike AIC -3.765598... of Schwartz (Bayes) BIC -3.675812Serial correlation statistics for Residual Log_FI_fat.Lag dF SerCorr BoxLjung ProbChi2(dF)1 0 0.76732 1 0.6601 36.4701 [ 0.0000]3 2 0.5001 45.8532 [ 0.0000]4 3 0.2712 48.7118 [ 0.0000]5 4 0.0528 48.8243 [ 0.0000]6 5 -0.0801 49.0928 [ 0.0000]7 6 -0.2189 51.1785 [ 0.0000]8 7 -0.3337 56.2274 [ 0.0000]For all lags considered, the Box-Ljung test <strong>in</strong>dicates that the most important assumption of<strong>in</strong>dependence is not satisfied, as we already noted on the basis of Figure 3.6.7 <strong>and</strong> Table 3.6.4.Step 6: Test of auxiliary residualsBecause the residuals obta<strong>in</strong>ed with the determ<strong>in</strong>istic l<strong>in</strong>ear trend model<strong>analysis</strong> does not satisfy the important model assumption of <strong>in</strong>dependence, weskipped this <strong>analysis</strong> step.Step 7: Conclusion of <strong>analysis</strong>The residuals obta<strong>in</strong>ed with the <strong>analysis</strong> of the log of the annual F<strong>in</strong>nishfatalities from 1970 to 2003 with the determ<strong>in</strong>istic l<strong>in</strong>ear trend model do notsatisfy the model assumption of <strong>in</strong>dependence. Therefore, it is not theappropriate model for describ<strong>in</strong>g this <strong>time</strong> <strong>series</strong>. This also means that aclassical l<strong>in</strong>ear regression model (without explanatory variables) would not beable to appropriately describe this <strong>series</strong>.Step 8: Forecast<strong>in</strong>gBecause the determ<strong>in</strong>istic l<strong>in</strong>ear trend model is not appropriate, it does notmake much sense to compute forecasts with this model.Step 9: ExerciseApply a determ<strong>in</strong>istic l<strong>in</strong>ear trend model to the dataset which was used <strong>in</strong> thepart of the manual dedicated to classical l<strong>in</strong>ear regression (Section 3.2.1). Usethe annual averages <strong>and</strong> compare the results with the correspond<strong>in</strong>g results <strong>in</strong>Section 3.2.1.P r o j e c t c o - f i n a n c e d b y t h e E u r o p e a n C o m m i s s i o n , D i r e c t o r a t e - G e n e r a l T r a n s p o r t a n d E n e r g yP a g e 2 2 1


Chapter 33.6.3.2. Stochastic l<strong>in</strong>ear trend modelStep 1: Start of <strong>analysis</strong> <strong>and</strong> data loadIf you just fitted the determ<strong>in</strong>istic l<strong>in</strong>ear trend model, GiveW<strong>in</strong> <strong>and</strong> STAMP havealready been started <strong>and</strong> data is still loaded <strong>in</strong> the GiveW<strong>in</strong> w<strong>in</strong>dow. If you starthere or if you have closed the database, STAMP, or GiveW<strong>in</strong> after the previousexercise, please follow the <strong>in</strong>structions under step 1 of Section 3.6.3.1.Step 2: Model FormulationThe stochastic (or: local) l<strong>in</strong>ear trend model can be fitted <strong>in</strong> STAMP as follows: Choose the menu . If needed, select the variable Log_FI_fat <strong>in</strong> the Data selection w<strong>in</strong>dow <strong>and</strong>click the Add button. Then click OK. In the Select components w<strong>in</strong>dow, choose a Stochastic Level, Stochasticslope, Irregular, <strong>and</strong> No seasonal: Then click on the F<strong>in</strong>ish button.Step 3: Model estimation <strong>and</strong> <strong>in</strong>spection of results In the Estimate Model w<strong>in</strong>dow, select Maximum Likelihood. Click OK.


3.6 State space modelsThe model is estimated, <strong>and</strong> the follow<strong>in</strong>g output appears <strong>in</strong> the GiveW<strong>in</strong>Results w<strong>in</strong>dow:Equation 1.Log_FI_fat = Trend + IrregularEstimation reportModel with 3 parameters ( 2 restrictions).Parameter estimation sample is 1970. 1 - 2003. 1. (T = 34).Log-likelihood kernel is 2.121946.Very strong convergence <strong>in</strong> 12 iterations.( likelihood cvg 3.160187e-014gradient cvg 2.331024e-007parameter cvg 1.878562e-008 )Eq 1 : Diagnostic summary report.Estimation sample is 1970. 1 - 2003. 1. (T = 34, n = 32).Log-Likelihood is 72.1462 (-2 LogL = -144.292).Prediction error variance is 0.0100779Summary statisticsLog_FI_fatStd.Error 0.10039Normality 0.39376H( 10) 0.50989r( 1) -0.028431r( 8) -0.15533DW 2.0137Q( 8, 6) 5.8640Rd^2 -0.088796Eq 1 : Estimated variances of disturbances.Component Log_FI_fat (q-ratio)Irr 0.0032008 ( 1.0000)Lvl 0.00000 ( 0.0000)Slp 0.0015332 ( 0.4790)From the estimated variances of disturbances, we see that the variancecorrespond<strong>in</strong>g to the level component (Lvl) is (almost) equal to zero, mean<strong>in</strong>gthat this component does vary over <strong>time</strong>, <strong>and</strong> that we may as well treat itdeterm<strong>in</strong>istically. Therefore, we repeat the <strong>analysis</strong> (steps 2 <strong>and</strong> 3) with adeterm<strong>in</strong>istic <strong>in</strong>stead of a stochastic level component (select Fixed level <strong>in</strong> theSTAMP Select components w<strong>in</strong>dow). This yields the follow<strong>in</strong>g output <strong>in</strong> theGiveW<strong>in</strong> results w<strong>in</strong>dow:Equation 2.Log_FI_fat = Trend + IrregularEstimation reportModel with 2 parameters ( 1 restrictions).Parameter estimation sample is 1970. 1 - 2003. 1. (T = 34).Log-likelihood kernel is 2.121946.Very strong convergence <strong>in</strong> 2 iterations.( likelihood cvg 1.21594e-013gradient cvg 1.811884e-008parameter cvg 4.042691e-008 )Eq 2 : Diagnostic summary report.P r o j e c t c o - f i n a n c e d b y t h e E u r o p e a n C o m m i s s i o n , D i r e c t o r a t e - G e n e r a l T r a n s p o r t a n d E n e r g yP a g e 2 2 3


Chapter 3Estimation sample is 1970. 1 - 2003. 1. (T = 34, n = 32).Log-Likelihood is 72.1462 (-2 LogL = -144.292).Prediction error variance is 0.0100779Summary statisticsLog_FI_fatStd.Error 0.10039Normality 0.39376H( 10) 0.50989r( 1) -0.028427r( 7) -0.059707DW 2.0137Q( 7, 6) 4.7703Rd^2 -0.088796Eq 2 : Estimated variances of disturbances.Component Log_FI_fat (q-ratio)Irr 0.0032008 ( 1.0000)Slp 0.0015331 ( 0.4790) Check the results (sample period, log-likelihood, estimated variance ofdisturbances).The estimation report tells that there was very strong convergence <strong>in</strong> 2 iterations. The estimatedvariances of both the irregular <strong>and</strong> the slope component are unequal to zero. The diagnosticsummary report shows that the value of the log-likelihood function is 72.1, which is larger than <strong>in</strong>the determ<strong>in</strong>istic l<strong>in</strong>ear trend case (55.7). The prediction error variance (0.0101) is clearlysmaller than for the determ<strong>in</strong>istic level model (0.0206). These results <strong>in</strong>dicate that thedeterm<strong>in</strong>istic level <strong>and</strong> stochastic slope model, also known as the “smooth trend model”, isperform<strong>in</strong>g better than the fully determ<strong>in</strong>istic model. The STAMP output results concern<strong>in</strong>g the summary statistics can aga<strong>in</strong> becondensed <strong>in</strong>to the follow<strong>in</strong>g table (see also Table 3.6.4 <strong>in</strong> Section 3.6.2 ofthe Methodology report):Statistic Value Critical 5% Assumption satisfiedvalue aIndependence Q(7,6) 4.77 12.59 +r(1) -0.0284 0.34 +r(7) -0.0597 0.34 +Homoscedasticity H(10) 0.510 3.72 +Normality N 0.394 5.99 +Table 3.6.5: Diagnostic test results for the determ<strong>in</strong>istic level <strong>and</strong> stochastic slopemodel applied to the log of F<strong>in</strong>nish fatalities. a Probability that statistic exceeds criticalvalue is 0.05.When we compare the results <strong>in</strong> Table 3.6.5 with those <strong>in</strong> Table 3.6.4 of the manual, we seethat also with respect to the diagnostic tests the determ<strong>in</strong>istic level <strong>and</strong> stochastic slope modelis better than the determ<strong>in</strong>istic l<strong>in</strong>ear trend model. The stochastic model satisfies all modelassumptions, whereas the detem<strong>in</strong>istic model did not satisfy the most important assumption, i.e.the assumption of <strong>in</strong>dependence. In the STAMP w<strong>in</strong>dow choose <strong>in</strong> the menu. Select Additional output, Get steady state, Anti-log <strong>analysis</strong>, <strong>and</strong> State <strong>and</strong>regression output. Click OK.


3.6 State space modelsThe GiveW<strong>in</strong> Results w<strong>in</strong>dow will display the follow<strong>in</strong>g additional results:Eq 2 : Estimated st<strong>and</strong>ard deviations of disturbances.Component Log_FI_fat (q-ratio)Irr 0.056576 ( 1.0000)Slp 0.039155 ( 0.6921)Eq 2 : Estimated coefficients of f<strong>in</strong>al state vector.Variable Coefficient R.m.s.e. t-valueLvl 5.9689 0.047371 126 [ 0.0000]Slp -0.035603 0.053297 -0.66802 [ 0.5089]Anti-log trend <strong>analysis</strong>Trend value at end of period is 391.056.Growth rate at end of period is -0.0356035 ( -3.56035 % per “year”).From the estimated coefficients of the f<strong>in</strong>al state vector <strong>and</strong> their t-values, we can conclude that,<strong>in</strong> the f<strong>in</strong>al state, the level component is significantly different from zero whereas the negativeslope component is not.The trend value at the end of the period as presented by the anti-log trend <strong>analysis</strong> (391) islarger than the correspond<strong>in</strong>g value <strong>in</strong> the determ<strong>in</strong>istic model (374).Step 4: Graphics of model components In the STAMP w<strong>in</strong>dow choose menu .Select Trend, Slope, Irregular <strong>and</strong> Smoothed. Click OK.The STAMP Graphics w<strong>in</strong>dow appears with graphs of the observed logtransformed<strong>time</strong> <strong>series</strong> <strong>and</strong> the modelled trend, slope, <strong>and</strong> irregular (see Figure3.6.8).P r o j e c t c o - f i n a n c e d b y t h e E u r o p e a n C o m m i s s i o n , D i r e c t o r a t e - G e n e r a l T r a n s p o r t a n d E n e r g yP a g e 2 2 5


Chapter 37.0 Log_FI_fat Trend_Log_FI_fat6.56.01970 1975 1980 1985 1990 1995 2000 20050.05 Slope_Log_FI_fat0.00−0.05−0.100.100.051970 1975 1980 1985 1990 1995 2000 2005Irr_Log_FI_fat0.00−0.051970 1975 1980 1985 1990 1995 2000 2005Figure 3.6.8: Observed log-transformed <strong>time</strong> <strong>series</strong> <strong>and</strong> the trend of the determ<strong>in</strong>isticlevel stochastic slope model (top graph), slope component (middle graph), <strong>and</strong>irregular component (bottom graph) for the log of F<strong>in</strong>nish fatalities.In the middle graph of Figure 3.6.8, we see that a negative slope component corresponds to adecreas<strong>in</strong>g trend, whereas a positive slope component corresponds to an <strong>in</strong>creas<strong>in</strong>g trend. Theslope is negative <strong>in</strong> 1971-1980, (almost) zero <strong>in</strong> 1981-1983, positive <strong>in</strong> 1984-1988, negative <strong>in</strong>1989-1996, <strong>and</strong> (almost) negative or slightly negative <strong>in</strong> 1997-2003. Note that the negativeslope component <strong>in</strong> the f<strong>in</strong>al state was found to be <strong>in</strong>significantly different from zero (see step3). Use the menu or to save these graphs, e.g. as anEncapsulated Postcript file (*.eps). M<strong>in</strong>imize the STAMP Graphics w<strong>in</strong>dow.Step 5: Test of model residuals Go back to the STAMP w<strong>in</strong>dow <strong>and</strong> choose . In the Residual graphics w<strong>in</strong>dow, select Residuals, Correlogram, with 8,Density, Histogram, Normal, QQ plot, <strong>and</strong> Write diagnostic tests. Click OK.Figure 3.6.9 shows the st<strong>and</strong>ardized residuals <strong>and</strong> their correlogram, densityfunction, <strong>and</strong> normal probability plot as depicted by the STAMP graphicsw<strong>in</strong>dow <strong>in</strong> GiveW<strong>in</strong>.


3.6 State space models2Residual Log_FI_fat1.0CorrelogramResidual Log_FI_fat10.500.0−1−0.5−21970 1980 1990 2000Density0.4 N(s=0.997)0.30.20.10 5 10QQ plot2normal10−1−2−3 −2 −1 0 1 2 3−1 0 1 2Figure 3.6.9: Residuals <strong>and</strong> residual tests for the determ<strong>in</strong>istic level stochastic slopemodel applied to the log of F<strong>in</strong>nish fatalities.The top left graph of Figure 3.6.9 shows that 1 out of the 34 residuals, i.e. 3%, lies outside the95% confidence <strong>in</strong>terval; this is acceptable s<strong>in</strong>ce we would expect about 1 out of 20 to falloutside this range. From the top right graph, we learn that for none of the 38 lags considered theautocorrelation is outside the 95% confidence <strong>in</strong>terval, which is def<strong>in</strong>ed by the boundaries -2/√T=-0.34 <strong>and</strong> +2/√T=0.34. The bottom graphs show that the assumption of normality of theresiduals is quite well satisfied, which confirms the normality test results <strong>in</strong> Table 3.6.5.Use the menu or to save these graphs <strong>and</strong> m<strong>in</strong>imizethe STAMP Graphics w<strong>in</strong>dow.In the Results w<strong>in</strong>dow of GiveW<strong>in</strong>, the follow<strong>in</strong>g residual test results can befound (only part of the results are pr<strong>in</strong>ted below):Goodness-of-fit results for Residual Log_FI_fatInformation criterion of Akaike AIC -4.420941... of Schwartz (Bayes) BIC -4.286262Serial correlation statistics for Residual Log_FI_fat.Lag dF SerCorr BoxLjung ProbChi2(dF)1 0 -0.02842 0 0.06313 1 0.1610 1.1450 [ 0.2846]4 2 -0.0937 1.4865 [ 0.4756]5 3 -0.2734 4.4980 [ 0.2125]6 4 -0.0529 4.6151 [ 0.3291]7 5 -0.0597 4.7703 [ 0.4446]8 6 -0.1553 5.8640 [ 0.4386]P r o j e c t c o - f i n a n c e d b y t h e E u r o p e a n C o m m i s s i o n , D i r e c t o r a t e - G e n e r a l T r a n s p o r t a n d E n e r g yP a g e 2 2 7


Chapter 3The goodness-of-fit is clearly better than <strong>in</strong> the fully determ<strong>in</strong>istic case: the AIC is smaller (-4.42<strong>in</strong>stead of -3.77) as well as the BIC (-4.29 <strong>in</strong>stead of -3.68).For all lags considered, the Box-Ljung test <strong>in</strong>dicates that the most important model assumption,i.e. <strong>in</strong>dependence, is satisfied, as we already noted on the basis of Figure 3.6.9 <strong>and</strong> <strong>in</strong> Table3.6.5.Step 6: Test of auxiliary residuals Go to the STAMP w<strong>in</strong>dow aga<strong>in</strong> <strong>and</strong> choose . In the Auxiliary residuals graphics w<strong>in</strong>dow select Irregular, Level residual,Index plot, Density, Histogram, Normal, QQ plot, Write normality tests, <strong>and</strong>Write values exceed<strong>in</strong>g (3.5). Click OK. Use the menu or to save these graphs.The STAMP graphics w<strong>in</strong>dow <strong>in</strong> GiveW<strong>in</strong> displays the auxiliary residuals of theirregular <strong>and</strong> of the level component <strong>and</strong> their density function <strong>and</strong> normalprobability plot: see Figure 3.6.10.210−1−220IrrRes Log_FI_fat1970QQ plot1980 1990 2000normal0.40.30.20.110−1DensityN(s=1.03)−4 −3 −2 −1 0 1 2 3SlpRes Log_FI_fat−2−20.4DensityN(s=0.988)−1 0 1 221970QQ plot1980 1990 2000normal0.20−2−3 −2 −1 0 1 2−1 0 1Figure 3.6.10: Auxiliary residuals <strong>and</strong> correspond<strong>in</strong>g tests for the determ<strong>in</strong>istic levelstochastic slope model applied to the log of F<strong>in</strong>nish fatalities.The follow<strong>in</strong>g output describes the auxiliary residual test results for normality ascan be found <strong>in</strong> the Results w<strong>in</strong>dow of GiveW<strong>in</strong>.


3.6 State space modelsNormality test for IrrRes Log_FI_fatSample Size 34Mean -0.034110Std.Devn. 1.028845Skewness -0.165816Excess Kurtosis -1.021443M<strong>in</strong>imum -2.007568Maximum 1.881536Skewness Chi^2(1) 0.15581 [0.6930]Kurtosis Chi^2(1) 1.4781 [0.2241]Normal-BS Chi^2(2) 1.6339 [0.4418]Normal-DH Chi^2(2) 2.0182 [0.3645]Normality test for IrrRes Log_FI_fatSample Size 34Mean -0.034110Std.Devn. 1.028845Skewness -0.165816Excess Kurtosis -1.021443M<strong>in</strong>imum -2.007568Maximum 1.881536Skewness Chi^2(1) 0.15581 [0.6930]Kurtosis Chi^2(1) 1.4781 [0.2241]Normal-BS Chi^2(2) 1.6339 [0.4418]Normal-DH Chi^2(2) 2.0182 [0.3645]Figure 3.6.10 <strong>and</strong> the auxiliary residual tests demonstrate that the auxiliary residuals of both theirregular <strong>and</strong> the slope component satisfy the assumption of normality. Note that none of theauxiliary residuals of the irregular (top left graph <strong>in</strong> Figure 3.6.10) is larger than 2 or smaller than-2, whereas only one of the auxiliary residuals of the slope component (middle left graph)slightly exceeds these bounds. So, there is no <strong>in</strong>dication of outlier observations or structuralbreaks <strong>in</strong> the slope component.Step 7: Conclusion of <strong>analysis</strong>The residuals obta<strong>in</strong>ed with the <strong>analysis</strong> of the log of the annual F<strong>in</strong>nishfatalities from 1970 to 2003 with the determ<strong>in</strong>istic level <strong>and</strong> stochastic slopemodel (or: smooth trend model) satisfy all the model assumptions of<strong>in</strong>dependence, homoscedasticity, <strong>and</strong> normality. Therefore, it is an appropriatemodel for describ<strong>in</strong>g this <strong>series</strong>.Step 8: Forecast<strong>in</strong>gS<strong>in</strong>ce the determ<strong>in</strong>istic level stochastic slope model provides an appropriatedescription of the log of the F<strong>in</strong>nish fatalities <strong>series</strong>, as a f<strong>in</strong>al step <strong>in</strong> the<strong>analysis</strong> we will compute seven-year forecasts for this <strong>series</strong>. By perform<strong>in</strong>g ananti-log <strong>analysis</strong> the forecasts will be re-expressed <strong>in</strong> terms of the orig<strong>in</strong>al countdata. The forecasts are made accord<strong>in</strong>g to the <strong>in</strong>structions <strong>in</strong> step 8 of the<strong>analysis</strong> of the local level model (Section 3.6.2.2). Go to the STAMP w<strong>in</strong>dow aga<strong>in</strong> <strong>and</strong> choose . In the Forecast<strong>in</strong>g w<strong>in</strong>dow select 7 as the number of forecasts, Trend,Modified anti-log <strong>analysis</strong>, <strong>and</strong> Write forecasts Y. Click OK.P r o j e c t c o - f i n a n c e d b y t h e E u r o p e a n C o m m i s s i o n , D i r e c t o r a t e - G e n e r a l T r a n s p o r t a n d E n e r g yP a g e 2 2 9


Chapter 3The STAMP graphics w<strong>in</strong>dow <strong>in</strong> GiveW<strong>in</strong> displays the orig<strong>in</strong>al observationsextended with the seven-years forecasts with 70% confidence <strong>in</strong>terval (i.e., plus<strong>and</strong> m<strong>in</strong>us one estimated st<strong>and</strong>ard deviation) at the top, <strong>and</strong> the orig<strong>in</strong>alobservations <strong>and</strong> the extrapolated trend at the bottom of Figure 3.6.11.E_F−Log_FI_fatForecast6004002001985 1990 1995 2000 2005 2010700E_F−Log_FI_fatF−Trend_ME_Log_FI_fat6005004001985 1990 1995 2000 2005 2010Figure 3.6.11: Anti-logged seven-year forecasts (2004-2010) of the determ<strong>in</strong>istic levelstochastic slope model applied to the log of annual F<strong>in</strong>nish fatalities, 1970-2003.The bottom figure illustrates that a l<strong>in</strong>ear trend model always yields forecasts by extend<strong>in</strong>g thelast value of the trend (i.e., level plus slope) <strong>in</strong> the <strong>series</strong>. This is <strong>in</strong> complete agreement with thefact that we are deal<strong>in</strong>g with a l<strong>in</strong>ear trend model. Use the menu or to save these graphs, e.g. as anEncapsulated Postcript file (*.eps).In the Results w<strong>in</strong>dow of GiveW<strong>in</strong> the forecasts for the orig<strong>in</strong>al observed <strong>time</strong><strong>series</strong> have been added:


3.6 State space modelsEq 1 : Forecasts for E_F-Log_FI_fat.Anti-logPeriod Forecast R.m.s.e. - Rmse + Rmse2004. 1 377.38 41.142 336.24 418.522005. 1 364.18 59.896 304.28 424.082006. 1 351.44 84.011 267.43 435.452007. 1 339.15 112.49 226.66 451.642008. 1 327.29 145.04 182.25 472.322009. 1 315.84 181.72 134.11 497.562010. 1 304.79 222.82 81.973 527.61The list of forecast results gives for each <strong>time</strong> po<strong>in</strong>t the forecast, its st<strong>and</strong>ard error, <strong>and</strong> thelower <strong>and</strong> upper bound of the 70% confidence <strong>in</strong>terval.P r o j e c t c o - f i n a n c e d b y t h e E u r o p e a n C o m m i s s i o n , D i r e c t o r a t e - G e n e r a l T r a n s p o r t a n d E n e r g yP a g e 2 3 1


Chapter 33.6.4 Local l<strong>in</strong>ear trend plus seasonal modelIn this section, the seasonal component will be added to the local l<strong>in</strong>ear trendmodel, so as to obta<strong>in</strong> the local l<strong>in</strong>ear trend plus seasonal model. The model willbe applied to the monthly number of drivers killed or seriously <strong>in</strong>jured (KSI) asobserved <strong>in</strong> the UK for the period January 1969 through December 1984.Modell<strong>in</strong>g a seasonal component only makes sense when we are deal<strong>in</strong>g withseasonal (monthly, quarterly, weekly, etc.) <strong>time</strong> <strong>series</strong> data, <strong>and</strong> this type ofcomponent was therefore not considered <strong>in</strong> the <strong>analysis</strong> of the annual datadiscussed <strong>in</strong> Section 3.6.2 <strong>and</strong> 3.6.3. In addition to the theory on this model <strong>and</strong>the results of its application to the UK data, as presented <strong>in</strong> Section 3.6.3 of theMethodology report, this section expla<strong>in</strong>s how the model is built <strong>in</strong> STAMP <strong>and</strong>how the results can be <strong>in</strong>terpreted.This section presents a step-by-step description of the <strong>analysis</strong> of the UKdrivers KSI <strong>time</strong> <strong>series</strong> us<strong>in</strong>g a local l<strong>in</strong>ear trend plus seasonal model. Contraryto the setup of Sections 3.6.2 <strong>and</strong> 3.6.3, <strong>in</strong> this section all components will beassumed stochastic from the start. As such, this <strong>analysis</strong> example follows therecommended way of analys<strong>in</strong>g <strong>time</strong> <strong>series</strong> data by state space techniques(see Section 3.6.7).Step 1: Start of <strong>analysis</strong> <strong>and</strong> data loadFirst, we open GiveW<strong>in</strong>, load the data, <strong>and</strong> start STAMP. If GiveW<strong>in</strong> is not yet open, then start GiveW<strong>in</strong>2. If GiveW<strong>in</strong> is already open from a previous exercise, then close all results,data, <strong>and</strong> graphics w<strong>in</strong>dows <strong>in</strong> GiveW<strong>in</strong> by click<strong>in</strong>g on the icon with the cross<strong>in</strong> the top right corner of each w<strong>in</strong>dow. Use the menu to open the file “UKdriversKSI.<strong>in</strong>7”.The data file is loaded <strong>and</strong> displayed <strong>in</strong> a m<strong>in</strong>imized w<strong>in</strong>dow at the bottom of theGiveW<strong>in</strong> ma<strong>in</strong> w<strong>in</strong>dow. To view the data file: Click on the icon with the two overlapp<strong>in</strong>g boxes.The data file consists of four variables: the monthly number of drivers KSI <strong>in</strong> theUK for the months January 1969 through December 1984 (UKdriversKSI), thepetrol price <strong>in</strong> the UK <strong>in</strong> the same months (PetrolPrice), <strong>and</strong> the logarithm of thesame <strong>time</strong> <strong>series</strong> (Log_UKdriversKSI <strong>and</strong> Log_PetrolPrice). The variablePetrolPrice will be used <strong>in</strong> Section 3.6.6. M<strong>in</strong>imize the data file w<strong>in</strong>dow aga<strong>in</strong> <strong>and</strong> use the menu to start the STAMP program.Step 2: Model FormulationIn this step, we def<strong>in</strong>e the local l<strong>in</strong>ear trend plus seasonal model:


3.6 State space models In STAMP, choose the menu . In the Data selection w<strong>in</strong>dow select the variable Log_UKdriversKSI. Then click OK. In the Select components w<strong>in</strong>dow, choose a Stochastic Level, Stochasticslope, Irregular, <strong>and</strong> Dummy seasonal: Then click on the F<strong>in</strong>ish button.Step 3: Model estimation <strong>and</strong> <strong>in</strong>spection of results In the Estimate Model w<strong>in</strong>dow, select Maximum Likelihood. Click OK.The model is estimated, <strong>and</strong> the follow<strong>in</strong>g output appears <strong>in</strong> the GiveW<strong>in</strong>Results w<strong>in</strong>dow:Equation 4.Log_UKdriversKSI = Trend + Dummy seasonal + IrregularEstimation reportModel with 4 parameters ( 3 restrictions).Parameter estimation sample is 1969. 1 - 1984.12. (T = 192).Log-likelihood kernel is 2.279365.Very strong convergence <strong>in</strong> 11 iterations.( likelihood cvg 7.253531e-013gradient cvg 7.576162e-008parameter cvg 4.127312e-006 )Eq 4 : Diagnostic summary report.Estimation sample is 1969. 1 - 1984.12. (T = 192, n = 179).Log-Likelihood is 437.638 (-2 LogL = -875.276).P r o j e c t c o - f i n a n c e d b y t h e E u r o p e a n C o m m i s s i o n , D i r e c t o r a t e - G e n e r a l T r a n s p o r t a n d E n e r g yP a g e 2 3 3


Chapter 3Prediction error variance is 0.00556772Summary statisticsLog_UKdriverStd.Error 0.074617Normality 3.7108H( 59) 1.0905r( 1) 0.026288r(12) 0.029382DW 1.9311Q(12, 9) 12.378Rs^2 0.23046Eq 4 : Estimated variances of disturbances.Component Log_UKdriversKSI (q-ratio)Irr 0.0034678 ( 1.0000)Lvl 0.0010009 ( 0.2886)Slp 0.00000 ( 0.0000)Sea 0.00000 ( 0.0000)Inspect<strong>in</strong>g the estimated variances of the disturbances, we see that thevariances correspond<strong>in</strong>g to the slope component (Slp) <strong>and</strong> the seasonalcomponent (Sea) are (almost) equal to zero. Therefore, we repeat the <strong>analysis</strong>(steps 2 <strong>and</strong> 3) with a determ<strong>in</strong>istic slope <strong>and</strong> seasonal component (bychoos<strong>in</strong>g Stochastic level, Fixed slope, <strong>and</strong> Fixed seasonal <strong>in</strong> the STAMPSelect components w<strong>in</strong>dow). This yields the follow<strong>in</strong>g output <strong>in</strong> the GiveW<strong>in</strong>results w<strong>in</strong>dow:Equation 5.Log_UKdriversKSI = Trend + Fixed seasonal + IrregularEstimation reportModel with 2 parameters ( 1 restrictions).Parameter estimation sample is 1969. 1 - 1984.12. (T = 192).Log-likelihood kernel is 2.279365.Very strong convergence <strong>in</strong> 3 iterations.( likelihood cvg 1.23912e-013gradient cvg 3.566036e-008parameter cvg 3.127122e-007 )Eq 5 : Diagnostic summary report.Estimation sample is 1969. 1 - 1984.12. (T = 192, n = 190).Log-Likelihood is 437.638 (-2 LogL = -875.276).Prediction error variance is 0.00550388Summary statisticsLog_UKdriverStd.Error 0.074188Normality 2.9938H( 63) 1.0297r( 1) 0.029490r(12) -0.0070854DW 1.9343Q(12,11) 12.196Rs^2 0.23928Eq 5 : Estimated variances of disturbances.Component Log_UKdriversKSI (q-ratio)Irr 0.0034678 ( 1.0000)Lvl 0.0010009 ( 0.2886)


3.6 State space models Check the results (sample period, log-likelihood, estimated variance ofdisturbances).The estimation report tells that there was very strong convergence <strong>in</strong> three iterations. Thediagnostic summary report shows that the number of observations is 192, that the value of thelog-likelihood function at convergence is 437, <strong>and</strong> that the prediction error variance is 0.00550.The estimated variance of the level disturbances is unequal to zero. In the STAMP w<strong>in</strong>dow choose <strong>in</strong> the menu. Select Additional output, Get steady state, Anti-log <strong>analysis</strong>, <strong>and</strong> State <strong>and</strong>regression output. Click OK.The GiveW<strong>in</strong> Results w<strong>in</strong>dow will display the follow<strong>in</strong>g additional results:Eq 5 : Estimated st<strong>and</strong>ard deviations of disturbances.Component Log_UKdriversKSI (q-ratio)Irr 0.058888 ( 1.0000)Lvl 0.031637 ( 0.5372)Eq 5 : Estimated coefficients of f<strong>in</strong>al state vector.Variable Coefficient R.m.s.e. t-valueLvl 7.2404 0.038792 186.65 [ 0.0000]Slp -0.00090532 0.0023076 -0.39233 [ 0.6953]Sea_ 1 0.017176 0.016254 1.0567 [ 0.2920]Sea_ 2 -0.10933 0.016219 -6.7409 [ 0.0000]Sea_ 3 -0.070091 0.016192 -4.3288 [ 0.0000]Sea_ 4 -0.14686 0.016171 -9.0817 [ 0.0000]Sea_ 5 -0.055472 0.016157 -3.4333 [ 0.0007]Sea_ 6 -0.092507 0.016150 -5.7279 [ 0.0000]Sea_ 7 -0.043175 0.016150 -2.6734 [ 0.0082]Sea_ 8 -0.032024 0.016157 -1.982 [ 0.0489]Sea_ 9 0.0058909 0.016171 0.36429 [ 0.7160]Sea_10 0.086848 0.016192 5.3637 [ 0.0000]Sea_11 0.19221 0.016219 11.851 [ 0.0000]Anti-log trend <strong>analysis</strong>Trend value at end of period is 1394.63.Growth rate at end of period is -0.000905317 ( -1.08638 % per“year”).From the estimated coefficients of the f<strong>in</strong>al state vector <strong>and</strong> their t-values, we can conclude thatthe (constant) value of -0.00090532 for the slope component does not significantly deviate fromzero. The same applies to the values for the first <strong>and</strong> n<strong>in</strong>th estimates of the seasonalcomponent (correspond<strong>in</strong>g to the months of January <strong>and</strong> September).Because the determ<strong>in</strong>istic slope component is not significantly different fromzero, we drop the slope component from the model <strong>and</strong> repeat the <strong>analysis</strong>(steps 2 <strong>and</strong> 3) with a stochastic level determ<strong>in</strong>istic seasonal model (byselect<strong>in</strong>g Stochastic level <strong>and</strong> Fixed seasonal <strong>in</strong> the STAMP Select componentsw<strong>in</strong>dow). This yields the follow<strong>in</strong>g output <strong>in</strong> the GiveW<strong>in</strong> results w<strong>in</strong>dow:Equation 6.Log_UKdriversKSI = Level + Fixed seasonal + IrregularEstimation reportModel with 2 parameters ( 1 restrictions).P r o j e c t c o - f i n a n c e d b y t h e E u r o p e a n C o m m i s s i o n , D i r e c t o r a t e - G e n e r a l T r a n s p o r t a n d E n e r g yP a g e 2 3 5


Chapter 3Parameter estimation sample is 1969. 1 - 1984.12. (T = 192).Log-likelihood kernel is 2.313251.Very strong convergence <strong>in</strong> 2 iterations.( likelihood cvg 1.343833e-015gradient cvg 2.082778e-008parameter cvg 3.620022e-009 )Eq 6 : Diagnostic summary report.Estimation sample is 1969. 1 - 1984.12. (T = 192, n = 191).Log-Likelihood is 444.144 (-2 LogL = -888.289).Prediction error variance is 0.00550316Summary statisticsLog_UKdriverStd.Error 0.074183Normality 3.2218H( 63) 1.0686r( 1) 0.041087r(12) -0.00017919DW 1.9155Q(12,11) 12.000Rs^2 0.23938Eq 6 : Estimated variances of disturbances.Component Log_UKdriversKSI (q-ratio)Irr 0.0035140 ( 1.0000)Lvl 0.00094564 ( 0.2691) Check the results (sample period, log-likelihood, estimated variance ofdisturbances).The estimation report tells that there was very strong convergence <strong>in</strong> two iterations. Theestimated variance of the level disturbances is unequal to zero, mean<strong>in</strong>g that the levelcomponent should be treated stochastically. The diagnostic summary report shows that thevalue of the log-likelihood function at convergence is 444, which is a somewhat larger valuethan <strong>in</strong> the stochastic level <strong>and</strong> determ<strong>in</strong>istic slope <strong>and</strong> seasonal model (437). In the STAMP w<strong>in</strong>dow choose <strong>in</strong> the menu. Select Additional output, Get steady state, Anti-log <strong>analysis</strong>, <strong>and</strong> State <strong>and</strong>regression output. Click OK.The GiveW<strong>in</strong> Results w<strong>in</strong>dow will display the follow<strong>in</strong>g additional results:Eq 6 : Estimated st<strong>and</strong>ard deviations of disturbances.Component Log_UKdriversKSI (q-ratio)Irr 0.059279 ( 1.0000)Lvl 0.030751 ( 0.5188)Eq 6 : Estimated coefficients of f<strong>in</strong>al state vector.Variable Coefficient R.m.s.e. t-valueLvl 7.2414 0.038351 188.82 [ 0.0000]Sea_ 1 0.017272 0.016225 1.0646 [ 0.2884]Sea_ 2 -0.10925 0.016192 -6.7474 [ 0.0000]Sea_ 3 -0.070030 0.016165 -4.3321 [ 0.0000]Sea_ 4 -0.14682 0.016146 -9.0933 [ 0.0000]Sea_ 5 -0.055446 0.016132 -3.4369 [ 0.0007]Sea_ 6 -0.092499 0.016126 -5.7361 [ 0.0000]Sea_ 7 -0.043184 0.016126 -2.678 [ 0.0081]Sea_ 8 -0.032050 0.016132 -1.9867 [ 0.0484]Sea_ 9 0.0058471 0.016146 0.36215 [ 0.7176]


3.6 State space modelsSea_10 0.086786 0.016165 5.3686 [ 0.0000]Sea_11 0.19213 0.016192 11.866 [ 0.0000]Anti-log trend <strong>analysis</strong>Trend value at end of period is 1396.04.The values for the first <strong>and</strong> n<strong>in</strong>th estimates of the seasonal component (correspond<strong>in</strong>g to themonths of January <strong>and</strong> September) do not significantly deviate from zero. At the end of theperiod, the trend value as presented by the anti-log trend <strong>analysis</strong> is 1396. The STAMP output results concern<strong>in</strong>g the summary statistics can becondensed <strong>in</strong>to the follow<strong>in</strong>g table (see also Table 3.6.6 <strong>in</strong> the Methodologyreport):Statistic Value Critical 5%value aAssumptionsatisfiedIndependence Q(12,11) 12.0 19.68 +r(1) 0.0411 0.14 +r(12) -0.00018 0.14 +Homoscedasticity H(63) 1.07 1.65 +Normality N 3.22 5.99 +Table 3.6.6: Diagnostic test results for the stochastic level <strong>and</strong> determ<strong>in</strong>istic seasonalmodel applied to the log of the number of UK drivers KSI. a Probability that statisticexceeds critical value is 0.05.From Table 3.6.6, we can conclude that the stochastic level <strong>and</strong> determ<strong>in</strong>isticseasonal model satisfies all model assumptions.Step 4: Graphics of model components In the STAMP w<strong>in</strong>dow choose menu .Select Trend, Seasonal, Irregular, <strong>and</strong> Smoothed. Click OK.P r o j e c t c o - f i n a n c e d b y t h e E u r o p e a n C o m m i s s i o n , D i r e c t o r a t e - G e n e r a l T r a n s p o r t a n d E n e r g yP a g e 2 3 7


Chapter 38.007.757.507.257.000.2Log_UKdriversKSI Trend_Log_UKdriversKSI1970 1975 1980 1985Seas_Log_UKdriversKSI0.01970 1975 1980 19850.1Irr_Log_UKdriversKSI0.0−0.11970 1975 1980 1985Figure 3.6.12: Observed log-transformed <strong>time</strong> <strong>series</strong> <strong>and</strong> the trend of the stochasticlevel <strong>and</strong> determ<strong>in</strong>istic seasonal model (top graph), seasonal component (middlegraph), <strong>and</strong> irregular component (bottom graph) for the log of UK drivers KSI.The STAMP Graphics w<strong>in</strong>dow appears with graphs of the observed logtransformed<strong>time</strong> <strong>series</strong> <strong>and</strong> the modelled trend, seasonal, <strong>and</strong> irregular (seeFigure 3.6.12).The seasonal component is close to zero for the months January <strong>and</strong> September, is negativefor February to August, <strong>and</strong> is positive for October to December. Use the menu or to save these graphs, e.g. as anEncapsulated Postcript file (*.eps). M<strong>in</strong>imize the STAMP Graphics w<strong>in</strong>dow.Step 5: Test of model residuals Go back to the STAMP w<strong>in</strong>dow <strong>and</strong> choose . In the Residual graphics w<strong>in</strong>dow select Residuals, Correlogram, with 14,Density, Histogram, Normal, QQ plot, <strong>and</strong> Write diagnostic tests. Click OK.Figure 3.6.13 shows the st<strong>and</strong>ardized residuals <strong>and</strong> their correlogram, densityfunction, <strong>and</strong> normal probability plot as depicted by the STAMP graphicsw<strong>in</strong>dow <strong>in</strong> GiveW<strong>in</strong>.


3.6 State space models2 Residual Log_UKdriversKSI1.0CorrelogramResidual Log_UKdriversKSI0−20.50.0−0.50.50.41970 1975 1980 1985DensityN(s=0.97)0QQ plot5 10 15normal20.30.20.10−2−4 −2 0 2−2 −1 0 1 2Figure 3.6.13: Residuals <strong>and</strong> residual tests for the stochastic level determ<strong>in</strong>isticseasonal model applied to the log of UK drivers KSI.The top left graph of Figure 3.6.13 shows that only five out of the 192 residuals exceed the 95%confidence limits. Still, the residual correspond<strong>in</strong>g to February 1983 is very extreme: -3.73.Under the assumption of a normal distribution with zero mean <strong>and</strong> unit st<strong>and</strong>ard deviation, theprobability of a value smaller than -3.73 is 0.01%! From the top right graph, we learn that for 1out of the 14 lags considered the autocorrelation is (just) outside the 95% confidence <strong>in</strong>terval,which is def<strong>in</strong>ed by the boundaries -2/√T=-0.14 <strong>and</strong> +2/√T=0.14. The bottom graphs show thatthe assumption of normality of the residuals is satisfied, which confirms the normality test resultdisplayed <strong>in</strong> Table 3.6.6. Use the menu or to save the graphs <strong>and</strong> m<strong>in</strong>imize theSTAMP Graphics w<strong>in</strong>dow.In the Results w<strong>in</strong>dow of GiveW<strong>in</strong>, the follow<strong>in</strong>g residual test results can befound (only part of the results are pr<strong>in</strong>ted below):P r o j e c t c o - f i n a n c e d b y t h e E u r o p e a n C o m m i s s i o n , D i r e c t o r a t e - G e n e r a l T r a n s p o r t a n d E n e r g yP a g e 2 3 9


Chapter 3Goodness-of-fit results for Residual Log_UKdriversKSIInformation criterion of Akaike AIC -5.067016... of Schwartz (Bayes) BIC -4.846457Serial correlation statistics for Residual Log_UKdriversKSI.Lag dF SerCorr BoxLjung ProbChi2(dF)1 0 0.04112 0 0.03323 1 -0.0618 1.2906 [ 0.2559]4 2 -0.1199 4.1225 [ 0.1273]5 3 0.0452 4.5278 [ 0.2098]6 4 -0.0812 5.8412 [ 0.2113]7 5 -0.0654 6.6993 [ 0.2440]8 6 -0.1413 10.7213 [ 0.0974]9 7 0.0105 10.7438 [ 0.1502]10 8 -0.0667 11.6512 [ 0.1675]11 9 0.0413 12.0002 [ 0.2133]12 10 -0.0002 12.0002 [ 0.2850]13 11 0.1043 14.2549 [ 0.2192]14 12 0.0364 14.5314 [ 0.2681]The value of the AIC is -5.07. Furthermore, the BoxLjung test statistic for theautocorrelations of the first 14 lags shows that the residuals satisfy theassumption of <strong>in</strong>dependence.Step 6: Test of auxiliary residuals Go to the STAMP w<strong>in</strong>dow aga<strong>in</strong> <strong>and</strong> choose . In the Auxiliary residuals graphics w<strong>in</strong>dow select Irregular, Level residual,Index plot, Density, Histogram, Normal, QQ plot, Write normality tests, <strong>and</strong>Write values exceed<strong>in</strong>g (3.5). Click OK.The STAMP graphics w<strong>in</strong>dow <strong>in</strong> GiveW<strong>in</strong> displays the auxiliary residuals of theirregular <strong>and</strong> of the level component <strong>and</strong> their density function <strong>and</strong> normalprobability plot: see Figure 3.6.14. The output below the figure describes theauxiliary residual test results as can be found <strong>in</strong> the Results w<strong>in</strong>dow of GiveW<strong>in</strong>.


3.6 State space models2.5 IrrRes Log_UKdriversKSI0.4DensityN(s=0.999)0.00.2−2.52.50.01970 1975 1980 1985 −4 −3 −2 −1 0 1 2 3 4QQ plot2.5normalLvlRes Log_UKdriversKSI0.0−2.50.4−2 −1 0 1 2DensityN(s=0.993)−2.52.51970QQ plot1975 1980 1985normal0.00.2−2.5−5 −4 −3 −2 −1 0 1 2 3−2 −1 0 1 2Figure 3.6.14: Auxiliary residuals <strong>and</strong> correspond<strong>in</strong>g tests for the stochastic leveldeterm<strong>in</strong>istic seasonal model applied to the log of UK drivers KSI.Normality test for IrrRes Log_UKdriversKSISample Size 192Mean 0.000027Std.Devn. 0.998613Skewness -0.190959Excess Kurtosis -0.163838M<strong>in</strong>imum -2.884350Maximum 2.672501Skewness Chi^2(1) 1.1669 [0.2800]Kurtosis Chi^2(1) 0.21474 [0.6431]Normal-BS Chi^2(2) 1.3816 [0.5012]Normal-DH Chi^2(2) 1.4114 [0.4938]Normality test for LvlRes Log_UKdriversKSISample Size 192Mean -0.059080Std.Devn. 0.992958Skewness -0.669142Excess Kurtosis 1.084877M<strong>in</strong>imum -3.789216Maximum 2.172144Skewness Chi^2(1) 14.328 [0.0002]Kurtosis Chi^2(1) 9.4157 [0.0022]Normal-BS Chi^2(2) 23.744 [0.0000]Normal-DH Chi^2(2) 13.024 [0.0015]Eq 6 : Large values <strong>in</strong> LvlRes Log_UKdriversKSI.Period Value1983. 2 -3.7892 [ 0.0001]Both Figure 3.6.14 <strong>and</strong> the auxiliary residual tests demonstrate that the auxiliary residuals of theirregular component satisfy the assumption of normality, whereas those of the level componentdo not satisfy this assumption. The latter fact means that we must be cautious with theP r o j e c t c o - f i n a n c e d b y t h e E u r o p e a n C o m m i s s i o n , D i r e c t o r a t e - G e n e r a l T r a n s p o r t a n d E n e r g yP a g e 2 4 1


Chapter 3<strong>in</strong>terpretation of the test statistics of <strong>in</strong>dividual auxiliary residuals <strong>and</strong> it underl<strong>in</strong>es theimportance of look<strong>in</strong>g at the graphical output (Koopman et al, 2000).The auxiliary residual of the level component correspond<strong>in</strong>g to February 1983 is extremely large(-3.79), which is an <strong>in</strong>dication of a structural break <strong>in</strong> the level component. Use the menu or to save the graphs <strong>and</strong> m<strong>in</strong>imize theSTAMP Graphics w<strong>in</strong>dow.Step 7: Conclusion of <strong>analysis</strong>The residuals obta<strong>in</strong>ed with the <strong>analysis</strong> of the log of the monthly UK driversKSI from January 1969 to December 1984 with the stochastic level <strong>and</strong>determ<strong>in</strong>istic seasonal model satisfy all the model assumptions of<strong>in</strong>dependence, homoscedasticity, <strong>and</strong> normality. However, the auxiliary residualof the level component for February 1983 is found to be extremely large (-3.79),which is an <strong>in</strong>dication of a structural break <strong>in</strong> this component. Furthermore, theauxiliary residuals of the level component do not satisfy the assumption ofnormality. For these reasons, we expect that the model can be improved byadd<strong>in</strong>g an <strong>in</strong>tervention variable to the local level <strong>and</strong> determ<strong>in</strong>istic seasonalmodel. In fact, there is a very good reason why February 1983 was a specialmonth for road safety <strong>in</strong> the UK. It was <strong>in</strong> that month that the seatbelt law was<strong>in</strong>troduced, requir<strong>in</strong>g front seat passengers <strong>in</strong> cars to wear a seatbelt. In Section3.6.5 we will therefore <strong>in</strong>vestigate the effect of <strong>modell<strong>in</strong>g</strong> this event by add<strong>in</strong>g alevel shift <strong>in</strong>tervention variable to the the local level <strong>and</strong> determ<strong>in</strong>istic seasonalmodel.Step 8: Forecast<strong>in</strong>gBecause the stochastic level <strong>and</strong> determ<strong>in</strong>istic seasonal model for describ<strong>in</strong>gthe log of the monthly number of UK drivers KSI from January 1969 toDecember 1984 can still be improved, as will be discussed <strong>in</strong> the follow<strong>in</strong>gsections, the issue of obta<strong>in</strong><strong>in</strong>g forecasts from this <strong>series</strong> is postponed untilSection 3.6.6.


3.6 State space models3.6.5 Intervention variablesIn the previous section, we discussed that it could be worthwhile to add an<strong>in</strong>tervention variable to the stochastic level determ<strong>in</strong>istic <strong>and</strong> seasonal model ofthe (log of the) number of UK drivers KSI for the period January 1969 throughDecember 1984. The reason for this recommendation was the extremely largevalue of the February 1983 auxiliary residual of the level component. As stated<strong>in</strong> Section 3.6.4, this po<strong>in</strong>t <strong>in</strong> <strong>time</strong> co<strong>in</strong>cides with the <strong>in</strong>troduction of the seat beltlaw <strong>in</strong> the UK.Because the extremely large value concerns the auxiliary residual of the levelcomponent, <strong>in</strong> this section we will add a level shift variable to the stochasticlevel <strong>and</strong> determ<strong>in</strong>istic seasonal model discussed <strong>in</strong> the previous section.Step 1: Start of <strong>analysis</strong> <strong>and</strong> data loadIn this first step of the <strong>analysis</strong>, we open GiveW<strong>in</strong>, load the data, <strong>and</strong> startSTAMP, if needed. If GiveW<strong>in</strong> is not yet open, then start GiveW<strong>in</strong>2. If GiveW<strong>in</strong> is already open but with another dataset than the UK drivers KSI,then close all results, data, <strong>and</strong> graphics w<strong>in</strong>dows <strong>in</strong> GiveW<strong>in</strong> by click<strong>in</strong>g onthe icon with the cross <strong>in</strong> the top right corner of each w<strong>in</strong>dow. Use the menu to open the file “UKdriversKSI.<strong>in</strong>7”. If GiveW<strong>in</strong> is already open <strong>and</strong> the UK drivers KSI dataset is already loaded,then proceed to the next <strong>in</strong>struction.The data file is loaded <strong>and</strong> displayed <strong>in</strong> a m<strong>in</strong>imized w<strong>in</strong>dow at the bottom of theGiveW<strong>in</strong> ma<strong>in</strong> w<strong>in</strong>dow. To view the data file: Click on the icon with the two overlapp<strong>in</strong>g boxes. M<strong>in</strong>imize the data file w<strong>in</strong>dow aga<strong>in</strong> <strong>and</strong> use the menu to start the STAMP program.Step 2: Model FormulationIn this step, we will add a level shift <strong>in</strong> February 1983 to the stochastic leveldeterm<strong>in</strong>istic seasonal model from the previous section. First, formulate themodel: In STAMP, choose the menu . In the Data selection w<strong>in</strong>dow select the variable Log_UKdriversKSI. Then click OK. In the Select components w<strong>in</strong>dow, choose a Stochastic Level, No slope,Irregular, <strong>and</strong> Fixed seasonal.Next, we add the level shift: Click Next. Select the period <strong>in</strong> sample: year is 1983 <strong>and</strong> period is 2. Click Level.P r o j e c t c o - f i n a n c e d b y t h e E u r o p e a n C o m m i s s i o n , D i r e c t o r a t e - G e n e r a l T r a n s p o r t a n d E n e r g yP a g e 2 4 3


Chapter 3The text "Lvl 1983.2" appears <strong>in</strong> the list of <strong>in</strong>terventions. Then click on the F<strong>in</strong>ish button.Step 3: Model estimation <strong>and</strong> <strong>in</strong>spection of results In the Estimate Model w<strong>in</strong>dow, select Maximum Likelihood. Click OK.The model is estimated, <strong>and</strong> the follow<strong>in</strong>g output appears <strong>in</strong> the GiveW<strong>in</strong>Results w<strong>in</strong>dow:Equation 1.Log_UKdriversKSI = Level + Fixed seasonal + Interv + IrregularEstimation reportModel with 2 parameters ( 1 restrictions).Parameter estimation sample is 1969. 1 - 1984.12. (T = 192).Log-likelihood kernel is 2.339682.Very strong convergence <strong>in</strong> 2 iterations.( likelihood cvg 1.328653e-015gradient cvg 3.952394e-009parameter cvg 5.449141e-014 )Eq 1 : Diagnostic summary report.Estimation sample is 1969. 1 - 1984.12. (T = 192, n = 191).Log-Likelihood is 449.219 (-2 LogL = -898.438).Prediction error variance is 0.00501578Summary statisticsLog_UKdriverStd.Error 0.070822Normality 2.4014H( 63) 0.75510r( 1) 0.079936r(12) 0.058582DW 1.8396Q(12,11) 15.267Rs^2 0.30674Eq 1 : Estimated variances of disturbances.Component Log_UKdriversKSI (q-ratio)Irr 0.0037838 ( 1.0000)Lvl 0.00047358 ( 0.1252) Check the results (sample period, log-likelihood, estimated variance ofdisturbances) <strong>and</strong> compare them with the results from the <strong>analysis</strong> without<strong>in</strong>tervention <strong>in</strong> the previous section.The estimation report tells that there was very strong convergence <strong>in</strong> two iterations. Thediagnostic summary report shows that the addition of the <strong>in</strong>tervention has improved the fit of thestochastic level determ<strong>in</strong>istic seasonal model: the value of the log-likelihood function has<strong>in</strong>creased from 437 to 449 <strong>and</strong> the prediction error variance has decreased from 0.00550 to0.00502 (see the previous section).


3.6 State space models In the STAMP w<strong>in</strong>dow choose <strong>in</strong> the menu. Select Additional output, Get steady state, Anti-log <strong>analysis</strong>, <strong>and</strong> State <strong>and</strong>regression output. Click OK.The GiveW<strong>in</strong> Results w<strong>in</strong>dow will display the follow<strong>in</strong>g additional results:Eq 1 : Estimated st<strong>and</strong>ard deviations of disturbances.Component Log_UKdriversKSI (q-ratio)Irr 0.061513 ( 1.0000)Lvl 0.021762 ( 0.3538)Eq 1 : Estimated coefficients of f<strong>in</strong>al state vector.Variable Coefficient R.m.s.e. t-valueLvl 7.2380 0.033960 213.13 [ 0.0000]Sea_ 1 0.010335 0.015834 0.65272 [ 0.5147]Sea_ 2 -0.10244 0.015814 -6.4775 [ 0.0000]Sea_ 3 -0.064452 0.015770 -4.0869 [ 0.0001]Sea_ 4 -0.14248 0.015736 -9.0543 [ 0.0000]Sea_ 5 -0.052342 0.015711 -3.3315 [ 0.0010]Sea_ 6 -0.090631 0.015696 -5.7741 [ 0.0000]Sea_ 7 -0.042554 0.015691 -2.7119 [ 0.0073]Sea_ 8 -0.032656 0.015696 -2.0805 [ 0.0388]Sea_ 9 0.0040042 0.015711 0.25487 [ 0.7991]Sea_10 0.083707 0.015736 5.3196 [ 0.0000]Sea_11 0.18782 0.015770 11.91 [ 0.0000]Anti-log trend <strong>analysis</strong>Trend value at end of period is 1391.34.Eq 1 : Estimated coefficients of explanatory variables.Variable Coefficient R.m.s.e. t-valueLvl 1983. 2 -0.23981 0.053072 -4.5185 [ 0.0000]Just as <strong>in</strong> the model without <strong>in</strong>tervention, the parameter estimates for the first <strong>and</strong> the n<strong>in</strong>thmonth of the seasonal component do not deviate from zero <strong>in</strong> the f<strong>in</strong>al state. At the end of theperiod, the trend value as presented by the anti-log trend <strong>analysis</strong> is 1391, which is a somewhatlower value than <strong>in</strong> the model without <strong>in</strong>tervention (1396).New <strong>in</strong> the output are the estimated coefficients of the explanatory variables. In this case thereis only one explanatory variable which is the <strong>in</strong>tervention variable for February 1983. Theestimated coefficient for the level shift is -0.240, which corresponds to a 100*(e -0.23981 -1)=-21.3%change <strong>in</strong> the number of UK drivers KSI as a result of the <strong>in</strong>troduction of the seat belt law. Thecoefficient is shown to be significant, but to guarantee that the t-value is reliable, we must firstcheck whether the model assumptions are satisfied. The summary statistics <strong>in</strong> the output of STAMP can be used to set up thefollow<strong>in</strong>g table (see also Table 3.6.8 <strong>in</strong> the Methodology report):Statistic Value Critical 5% Assumption satisfiedvalue aIndependence Q(12,11) 15.3 19.68 +r(1) 0.0799 0.14 +r(12) 0.0586 0.14 +Homoscedasticity H(63) 0.755 1.65 +P r o j e c t c o - f i n a n c e d b y t h e E u r o p e a n C o m m i s s i o n , D i r e c t o r a t e - G e n e r a l T r a n s p o r t a n d E n e r g yP a g e 2 4 5


Chapter 3Normality N 2.40 5.99 +Table 3.6.7: Diagnostic test results for the stochastic level determ<strong>in</strong>istic seasonalmodel with <strong>in</strong>tervention applied to the log UK drivers KSI. a Probability that statisticexceeds critical value is 0.05.Table 3.6.7 shows that the stochastic level <strong>and</strong> determ<strong>in</strong>istic seasonal modelwith an <strong>in</strong>tervention variable satisfies all model assumptions. This guaranteesthat the t-test for the regression coefficient of the <strong>in</strong>tervention variable (seeoutput above) is reliable.Step 4: Graphics of model components In the STAMP w<strong>in</strong>dow choose menu .Select Trend, Seasonal, Irregular, <strong>and</strong> Smoothed. Click OK.The STAMP Graphics w<strong>in</strong>dow appears with graphs of the observed logtransformed<strong>time</strong> <strong>series</strong> <strong>and</strong> the modelled trend, seasonal, <strong>and</strong> irregular (seeFigure 3.6.15).


3.6 State space models8.007.757.507.257.00Log_UKdriversKSITrend_Log_UKdriversKSI1970 1975 1980 19850.2 Seas_Log_UKdriversKSI0.10.0−0.10.11970 1975 1980 1985Irr_Log_UKdriversKSI0.0−0.11970 1975 1980 1985Figure 3.6.15: Observed log-transformed <strong>time</strong> <strong>series</strong> <strong>and</strong> the trend of the stochasticlevel determ<strong>in</strong>istic seasonal model with <strong>in</strong>tervention (top graph), seasonal component(middle graph), <strong>and</strong> irregular component (bottom graph) for the log of UK drivers KSI.Compared to the level of the model without <strong>in</strong>tervention, the level <strong>in</strong> the present model shows asudden decrease at the start of 1983 (see Figure 3.6.12 <strong>and</strong> 3.6.15, top graphs). This results <strong>in</strong>a value for the irregular component <strong>in</strong> February 1983 which is considerably smaller, <strong>in</strong> absoluteterms, than <strong>in</strong> the model without <strong>in</strong>tervention (bottom graphs). Use the menu or to save these graphs, e.g. as anEncapsulated Postcript file (*.eps). M<strong>in</strong>imize the STAMP Graphics w<strong>in</strong>dow.Step 5: Test of model residuals Go back to the STAMP w<strong>in</strong>dow <strong>and</strong> choose . In the Residual graphics w<strong>in</strong>dow select Residuals, Correlogram, with 14,Density, Histogram, Normal, QQ plot, <strong>and</strong> Write diagnostic tests. Click OK.Figure 3.6.16 shows the st<strong>and</strong>ardized residuals <strong>and</strong> their correlogram, densityfunction, <strong>and</strong> normal probability plot as depicted <strong>in</strong> the STAMP graphics w<strong>in</strong>dowof GiveW<strong>in</strong>.P r o j e c t c o - f i n a n c e d b y t h e E u r o p e a n C o m m i s s i o n , D i r e c t o r a t e - G e n e r a l T r a n s p o r t a n d E n e r g yP a g e 2 4 7


Chapter 32Residual Log_UKdriversKSI1.0CorrelogramResidual Log_UKdriversKSI10.500.0−1−2−0.50.40.30.20.11970 1975 1980 1985DensityN(s=0.968)0 5 10 15QQ plotnormal210−1−2−3 −2 −1 0 1 2 3−2 −1 0 1 2Figure 3.6.16: Residuals <strong>and</strong> residual tests for the stochastic level determ<strong>in</strong>isticseasonal model with <strong>in</strong>tervention applied to the log of UK drivers KSI.When we compare the top left graphs of Figure 3.6.13 <strong>and</strong> 3.6.16, we can see that theextremely large residual <strong>in</strong> February 1993 has disappeared <strong>in</strong> the latter figure because ofadd<strong>in</strong>g the <strong>in</strong>tervention variable. In the model with <strong>in</strong>tervention variable only seven residuals(3.6%) are outside the 95% confidence <strong>in</strong>terval. However, none of them is extremely large.From the top right graph, we learn that for 2 out of the first 14 lags the autocorrelation is (just)outside the 95% confidence <strong>in</strong>terval, which is def<strong>in</strong>ed by the boundaries -2/√T=-0.14 <strong>and</strong>+2/√T=0.14.The bottom graphs show that the assumption of normality of the residuals is very well satisfied,which confirms the normality test results <strong>in</strong> Table 3.6.7. Use the menu or to save the graphs <strong>and</strong> m<strong>in</strong>imize theSTAMP Graphics w<strong>in</strong>dow.In the Results w<strong>in</strong>dow of GiveW<strong>in</strong>, the follow<strong>in</strong>g residual test results can befound (only part of the results are pr<strong>in</strong>ted below):


3.6 State space modelsGoodness-of-fit results for Residual Log_UKdriversKSIInformation criterion of Akaike AIC -5.149332... of Schwartz (Bayes) BIC -4.911807Serial correlation statistics for Residual Log_UKdriversKSI.Lag dF SerCorr BoxLjung ProbChi2(dF)1 0 0.07992 0 0.04493 1 -0.0809 2.9183 [ 0.0876]4 2 -0.1449 7.0597 [ 0.0293]5 3 0.0232 7.1660 [ 0.0668]6 4 -0.0716 8.1885 [ 0.0849]7 5 -0.0489 8.6681 [ 0.1231]8 6 -0.1351 12.3438 [ 0.0547]9 7 -0.0049 12.3487 [ 0.0897]10 8 -0.0541 12.9445 [ 0.1138]11 9 0.0888 14.5601 [ 0.1037]12 10 0.0586 15.2669 [ 0.1226]13 11 0.1458 19.6698 [ 0.0501]14 12 0.0201 19.7541 [ 0.0719]The addition of the <strong>in</strong>tervention to the model has improved the goodness-of-fit: the AIC hasdecreased (from -5.07 to -5.15) as well as the BIC (from -4.85 to -4.91).Step 6: Test of auxiliary residuals Go to the STAMP w<strong>in</strong>dow aga<strong>in</strong> <strong>and</strong> choose . In the Auxiliary residuals graphics w<strong>in</strong>dow select Irregular, Level residual,Index plot, Density, Histogram, Normal, QQ plot, Write normality tests, <strong>and</strong>Write values exceed<strong>in</strong>g (3.5). Click OK.The STAMP graphics w<strong>in</strong>dow <strong>in</strong> GiveW<strong>in</strong> displays the auxiliary residuals of theirregular <strong>and</strong> of the level component <strong>and</strong> their density function <strong>and</strong> normalprobability plot: see Figure 3.6.17. Use the menu or to save these graphs <strong>and</strong> m<strong>in</strong>imizethe STAMP Graphics w<strong>in</strong>dow.P r o j e c t c o - f i n a n c e d b y t h e E u r o p e a n C o m m i s s i o n , D i r e c t o r a t e - G e n e r a l T r a n s p o r t a n d E n e r g yP a g e 2 4 9


Chapter 32IrrRes Log_UKdriversKSI0.4DensityN(s=0.999)00.2−221970QQ plot1975 1980 1985normal2−3 −2 −1 0 1 2 3 4LvlRes Log_UKdriversKSI00−2−20.6−2 −1 0 1 2DensityN(s=0.991)2.51970QQ plot1975 1980 1985normal0.40.20.0−2.5−4 −3 −2 −1 0 1 2 3−2 −1 0 1 2Figure 3.6.17: Auxiliary residuals <strong>and</strong> correspond<strong>in</strong>g tests for the stochastic level <strong>and</strong>determ<strong>in</strong>istic seasonal model with <strong>in</strong>tervention applied to the log of UK drivers KSI.The follow<strong>in</strong>g output describes the auxiliary residual test results for normality ascan be found <strong>in</strong> the Results w<strong>in</strong>dow of GiveW<strong>in</strong>.Normality test for IrrRes Log_UKdriversKSISample Size 192Mean -0.000521Std.Devn. 0.999144Skewness -0.081441Excess Kurtosis -0.412446M<strong>in</strong>imum -2.424978Maximum 2.528211Skewness Chi^2(1) 0.21225 [0.6450]Kurtosis Chi^2(1) 1.3609 [0.2434]Normal-BS Chi^2(2) 1.5731 [0.4554]Normal-DH Chi^2(2) 1.2611 [0.5323]Normality test for LvlRes Log_UKdriversKSISample Size 192Mean 0.039095Std.Devn. 0.991080Skewness -0.656959Excess Kurtosis 0.780304M<strong>in</strong>imum -3.181119Maximum 2.053959Skewness Chi^2(1) 13.811 [0.0002]Kurtosis Chi^2(1) 4.871 [0.0273]Normal-BS Chi^2(2) 18.682 [0.0001]Normal-DH Chi^2(2) 13.017 [0.0015]Both Figure 3.6.17 <strong>and</strong> the auxiliary residual tests demonstrate that the auxiliary residuals of theirregular component satisfy the assumption of normality, whereas those of the level componentdo not satisfy this assumption. This was also the case <strong>in</strong> the model without <strong>in</strong>tervention. Fromthe middle right graph, we see that <strong>in</strong> 1973 <strong>and</strong> 1974 there are too many large negative auxiliaryresiduals of the level component. As mentioned above, this means that we must be careful withthe <strong>in</strong>terpretation of the test statistics for the level component's auxiliary residuals. However, the


3.6 State space modelsextremely large value of the auxiliary residual of the level component for February 1983 that weobserved <strong>in</strong> Figure 3.6.14 has disappeared <strong>in</strong> Figure 3.6.17. This is the result of <strong>in</strong>clud<strong>in</strong>g thelevel shift <strong>in</strong>tervention variable <strong>in</strong> the model.Step 7: Conclusion of <strong>analysis</strong>The residuals obta<strong>in</strong>ed with the <strong>analysis</strong> of the log of the monthly UK driversKSI from January 1969 to December 1984 with the stochastic level <strong>and</strong>determ<strong>in</strong>istic seasonal model with <strong>in</strong>tervention variable satisfy all the modelassumptions of <strong>in</strong>dependence, homoscedasticity, <strong>and</strong> normality. The addition ofthe level shift <strong>in</strong>tervention variable for February 1983 has improved the fit of themodel, <strong>and</strong> the extremely large value of the auxiliary residual of the levelcomponent observed <strong>in</strong> the previous <strong>analysis</strong> has disappeared <strong>in</strong> the presentone. The regression coefficient for the <strong>in</strong>tervention variable is significant, <strong>and</strong><strong>in</strong>dicates that the <strong>in</strong>troduction of the seat belt law resulted <strong>in</strong> a 21.3% reduction<strong>in</strong> the number of UK drivers KSI. However, the auxiliary residuals of the levelcomponent still do not satisfy the assumption of normality. Therefore, we mustbe careful with the <strong>in</strong>terpretation of test statistics with respect to structural levelbreaks.In Section 3.6.6, we will extend the model with yet another component: acont<strong>in</strong>uous explanatory variable.Step 8: Forecast<strong>in</strong>gBecause <strong>in</strong> the next section we will extend the stochastic level <strong>and</strong> determ<strong>in</strong>isticseasonal model <strong>in</strong>clud<strong>in</strong>g an <strong>in</strong>tervention variable with yet another explanatoryvariable, we postpone the forecast<strong>in</strong>g to that section.P r o j e c t c o - f i n a n c e d b y t h e E u r o p e a n C o m m i s s i o n , D i r e c t o r a t e - G e n e r a l T r a n s p o r t a n d E n e r g yP a g e 2 5 1


Chapter 33.6.6 Explanatory variablesIn the previous section, we extended the stochastic level <strong>and</strong> determ<strong>in</strong>isticseasonal model for the <strong>analysis</strong> of the log of the number of UK drivers KSI forthe period January 1969 through December 1984 with an <strong>in</strong>tervention variable.In this section, we will add yet another component: a cont<strong>in</strong>uous explanatoryvariable.This cont<strong>in</strong>uous explanatory variable is the log of the monthly prices of petrol <strong>in</strong>the UK <strong>in</strong> the period January 1969 to December 1984 of which is assumed thatit may have affected car mobility <strong>and</strong> thus also the number of drivers KSI. Theseat belt law <strong>in</strong>tervention from the previous section will be kept <strong>in</strong> the model.Step 1: Start of <strong>analysis</strong> <strong>and</strong> data loadIf needed, we wil first open GiveW<strong>in</strong>, load the data, <strong>and</strong> start STAMP. If GiveW<strong>in</strong> is not yet open, then start GiveW<strong>in</strong>2. If GiveW<strong>in</strong> is already open from a previous exercise but with another datasetthan the UK drivers KSI, then close all results, data, <strong>and</strong> graphics w<strong>in</strong>dows<strong>in</strong> GiveW<strong>in</strong> by click<strong>in</strong>g on the icon with the cross <strong>in</strong> the top right corner ofeach w<strong>in</strong>dow. Use the menu to open the file“UKdriversKSI.<strong>in</strong>7”. If GiveW<strong>in</strong> is already open <strong>and</strong> the UK drivers KSI dataset is already loaded,then proceed with the next <strong>in</strong>struction.The data file is loaded <strong>and</strong> displayed <strong>in</strong> a m<strong>in</strong>imized w<strong>in</strong>dow at the bottom of theGiveW<strong>in</strong> ma<strong>in</strong> w<strong>in</strong>dow. To view the data file: Click on the icon with the two overlapp<strong>in</strong>g boxes. M<strong>in</strong>imize the data file w<strong>in</strong>dow aga<strong>in</strong> <strong>and</strong> use the menu to start the STAMP program.Step 2: Model formulationIn this step, we will add the petrol price variable to the stochastic level <strong>and</strong>determ<strong>in</strong>istic seasonal model with seat belt law <strong>in</strong>tervention from the previoussection: In STAMP, choose the menu . In the Data selection w<strong>in</strong>dow select the variable Log_UKdriversKSI <strong>and</strong> clickAdd. Also select the variable Log_PetrolPrice <strong>and</strong> click Add. Then click OK. In the Select components w<strong>in</strong>dow, choose a Stochastic Level, No slope,Irregular, <strong>and</strong> Fixed seasonal.To add the level shift <strong>in</strong>tervention variable for February 1983: Click Next.


3.6 State space models Select the the po<strong>in</strong>t <strong>in</strong> the <strong>series</strong>: year is 1983 <strong>and</strong> period is 2. Click Level. Then click on the F<strong>in</strong>ish button.Step 3: Model estimation <strong>and</strong> <strong>in</strong>spection of results In the Estimate Model w<strong>in</strong>dow, select Maximum Likelihood. Click OK.The model is estimated, <strong>and</strong> the follow<strong>in</strong>g output appears <strong>in</strong> the GiveW<strong>in</strong>Results w<strong>in</strong>dow:Equation 1.Log_UKdriversKSI = Level + Fixed seasonal + Expl vars + Interv +IrregularEstimation reportModel with 2 parameters ( 1 restrictions).Parameter estimation sample is 1969. 1 - 1984.12. (T = 192).Log-likelihood kernel is 2.342043.Very strong convergence <strong>in</strong> 2 iterations.( likelihood cvg 3.792323e-016gradient cvg 1.065814e-009parameter cvg 7.219832e-009 )Eq 1 : Diagnostic summary report.Estimation sample is 1969. 1 - 1984.12. (T = 192, n = 191).Log-Likelihood is 449.672 (-2 LogL = -899.345).Prediction error variance is 0.00483573Summary statisticsLog_UKdriverStd.Error 0.069539Normality 1.9020H( 63) 0.87770r( 1) 0.10275r(12) 0.052579DW 1.7930Q(12,11) 18.706Rs^2 0.33163Eq 1 : Estimated variances of disturbances.Component Log_UKdriversKSI (q-ratio)Irr 0.0040344 ( 1.0000)Lvl 0.00026772 ( 0.0664) Check the results (sample period, log-likelihood, estimated variance ofdisturbances) <strong>and</strong> compare them with the results from the <strong>analysis</strong> without<strong>in</strong>tervention <strong>in</strong> the previous section.We have very strong convergence <strong>in</strong> two iterations. The diagnostic summary report shows thatthe addition of the petrol price as explanatory variable has improved the stochastic level <strong>and</strong>determ<strong>in</strong>istic seasonal model with seat belt law <strong>in</strong>tervention: the value of the log-likelihoodP r o j e c t c o - f i n a n c e d b y t h e E u r o p e a n C o m m i s s i o n , D i r e c t o r a t e - G e n e r a l T r a n s p o r t a n d E n e r g yP a g e 2 5 3


Chapter 3function has <strong>in</strong>creased from 449 to 450 <strong>and</strong> the prediction error variance has decreased from0.00502 to 0.00484. In the STAMP w<strong>in</strong>dow choose <strong>in</strong> the menu. Select Additional output, Get steady state, Anti-log <strong>analysis</strong>, <strong>and</strong> State <strong>and</strong>regression output. Click OK.The GiveW<strong>in</strong> Results w<strong>in</strong>dow displays the follow<strong>in</strong>g additional results:Eq 1 : Estimated st<strong>and</strong>ard deviations of disturbances.Component Log_UKdriversKSI (q-ratio)Irr 0.063517 ( 1.0000)Lvl 0.016362 ( 0.2576)Eq 1 : Estimated coefficients of f<strong>in</strong>al state vector.Variable Coefficient R.m.s.e. t-valueLvl 6.6317 0.21428 30.948 [ 0.0000]Sea_ 1 0.0085360 0.015860 0.5382 [ 0.5911]Sea_ 2 -0.10336 0.015836 -6.5267 [ 0.0000]Sea_ 3 -0.064435 0.015801 -4.0778 [ 0.0001]Sea_ 4 -0.14119 0.015783 -8.9458 [ 0.0000]Sea_ 5 -0.052945 0.015757 -3.3601 [ 0.0009]Sea_ 6 -0.088490 0.015764 -5.6135 [ 0.0000]Sea_ 7 -0.039156 0.015787 -2.4803 [ 0.0140]Sea_ 8 -0.031078 0.015754 -1.9727 [ 0.0500]Sea_ 9 0.0039760 0.015756 0.25234 [ 0.8010]Sea_10 0.080770 0.015815 5.1073 [ 0.0000]Sea_11 0.18615 0.015816 11.769 [ 0.0000]Anti-log trend <strong>analysis</strong>Trend value at end of period is 758.765.Eq 1 : Estimated coefficients of explanatory variables.Variable Coefficient R.m.s.e. t-valueLog_PetrolPrice -0.27721 0.098431 -2.8163 [ 0.0054]Lvl 1983. 2 -0.23757 0.046430 -5.1167 [ 0.0000]Just as <strong>in</strong> the model with seat belt law <strong>in</strong>tervention but without petrol price as explanatoryvariable, the parameter estimates for the first <strong>and</strong> the n<strong>in</strong>th month of the seasonal componentdo not deviate from zero <strong>in</strong> the f<strong>in</strong>al state. At the end of the period, the trend value as presentedby the anti-log trend <strong>analysis</strong> is 757, which is smaller than <strong>in</strong> the model with seat belt law<strong>in</strong>tervention but without petrol price variable (1391). This large difference <strong>in</strong> the trend value iscaused by the <strong>in</strong>troduction of the explanatory variable, which expla<strong>in</strong>s part of the trend.The regression coefficients for the seat belt law <strong>in</strong>tervention (-0.23757) <strong>and</strong> for the log of petrolprice (-0.27721) are both significant <strong>in</strong> this model. However, to guarantee that the t-values arereliable, we must first test the model assumptions. The summary statistics <strong>in</strong> the output of STAMP can be used to set up thefollow<strong>in</strong>g table (see also Table 3.6.10 <strong>in</strong> the Methodology report):Statistic Value Critical 5% Assumption satisfiedvalue aIndependence Q(12,11) 18.7 19.68 +r(1) 0.102 0.14 +r(12) 0.0526 0.14 +Homoscedasticity H(63) 0.878 1.65 +


3.6 State space modelsNormality N 1.90 5.99 +Table 3.6.8: Diagnostic test results for the stochastic level <strong>and</strong> determ<strong>in</strong>istic seasonalmodel with <strong>in</strong>tervention <strong>and</strong> explanatory variable applied to the log of UK drivers KSI.a Probability that statistic exceeds critical value is 0.05.Table 3.6.8 shows that the stochastic level <strong>and</strong> determ<strong>in</strong>istic seasonal modelwith <strong>in</strong>tervention <strong>and</strong> explanatory variable satisfies all model assumptions. Thisguarantees that the t-tests for the regression coefficients of the <strong>in</strong>terventionvariable <strong>and</strong> the explanatory variable (see output above) are reliable. Accord<strong>in</strong>gto this <strong>analysis</strong>, the <strong>in</strong>troduction of the seat belt law resulted <strong>in</strong> a 100*(e -0.23757 –1) = -21.1% change <strong>in</strong> the number of UK drivers KSI (which is virtually identicalto what we found <strong>in</strong> the previous section), while the regression coefficient for logpetrol price <strong>in</strong>dicates that a 1% rise <strong>in</strong> petrol price was associated with a 0.28%reduction <strong>in</strong> the number of UK drivers KSI.Step 4: Graphics of model components In the STAMP w<strong>in</strong>dow choose menu .Select Trend plus Xs, Seasonal, Irregular, <strong>and</strong> Smoothed. Click OK.The STAMP Graphics w<strong>in</strong>dow appears with graphs of the observed logtransformed<strong>time</strong> <strong>series</strong> <strong>and</strong> the modelled trend, seasonal, <strong>and</strong> irregular (seeFigure 3.6.18).8.007.757.507.257.00Log_UKdriversKSITrendX_Log_UKdriversKSI1970 1975 1980 19850.2 Seas_Log_UKdriversKSI0.10.0−0.10.11970 1975 1980 1985Irr_Log_UKdriversKSI0.0−0.11970 1975 1980 1985P r o j e c t c o - f i n a n c e d b y t h e E u r o p e a n C o m m i s s i o n , D i r e c t o r a t e - G e n e r a l T r a n s p o r t a n d E n e r g yP a g e 2 5 5


Chapter 3Figure 3.6.18: Observed log-transformed <strong>time</strong> <strong>series</strong> <strong>and</strong> the trend of the stochasticlevel determ<strong>in</strong>istic seasonal model with <strong>in</strong>tervention <strong>and</strong> explanatory variable (topgraph), seasonal component (middle graph), <strong>and</strong> irregular component (bottom graph)for the log of UK drivers KSI.Comparison of Figures 3.6.15 <strong>and</strong> 3.6.18 leads to the conclusion that there is no differencebetween the trend, seasonal, <strong>and</strong> irregular components of the model with <strong>and</strong> without petrolprice as explanatory variable. The difference between the models lies <strong>in</strong> the composition of thetrend component, which will be shown below. Use the menu or to save these graphs, e.g. as anEncapsulated Postcript file (*.eps). M<strong>in</strong>imize the STAMP Graphics w<strong>in</strong>dow.Next, we will make a graph of the part of trend which is expla<strong>in</strong>ed by the log ofpetrol price: In the STAMP w<strong>in</strong>dow choose menu .Select Trend, Trend plus Xs, <strong>and</strong> Smoothed. Click OK.The STAMP Graphics w<strong>in</strong>dow appears with graphs of the observed logtransformed<strong>time</strong> <strong>series</strong> <strong>and</strong> the modelled trend without <strong>and</strong> with the expla<strong>in</strong>edportion (see Figure 3.6.19).Log_UKdriversKSITrend_Log_UKdriversKSI7.57.01970 1975 1980 19857.75Log_UKdriversKSITrendX_Log_UKdriversKSI7.507.257.001970 1975 1980 1985Figure 3.6.19: Observed log-transformed <strong>time</strong> <strong>series</strong> <strong>and</strong> the trend of the stochasticlevel determ<strong>in</strong>istic seasonal model with <strong>in</strong>tervention <strong>and</strong> explanatory variable for the logof UK drivers KSI, without the expla<strong>in</strong>ed portion (top graph) <strong>and</strong> with the expla<strong>in</strong>edportion (bottom graph).As can be seen <strong>in</strong> Figure 3.6.19, <strong>in</strong> the model with explanatory variable a considerable part ofthe trend is expla<strong>in</strong>ed by the log of petrol price, whereas the rema<strong>in</strong><strong>in</strong>g part of the trend is astochastic level <strong>in</strong>clud<strong>in</strong>g the effect of the seat belt <strong>in</strong>tervention. In the model withoutexplanatory variable, the trend entirely consists of a stochastic level plus seat belt <strong>in</strong>terventioneffect.


3.6 State space models Use the menu or to save these graphs, e.g. as anEncapsulated Postcript file (*.eps). M<strong>in</strong>imize the STAMP Graphics w<strong>in</strong>dow.Step 5: Test of model residuals Go back to the STAMP w<strong>in</strong>dow <strong>and</strong> choose . In the Residual graphics w<strong>in</strong>dow select Residuals, Correlogram, with 14,Density, Histogram, Normal, QQ plot, <strong>and</strong> Write diagnostic tests. Click OK.Figure 3.6.20 shows the st<strong>and</strong>ardized residuals <strong>and</strong> their correlogram, densityfunction, <strong>and</strong> normal probability plot as depicted by the STAMP graphicsw<strong>in</strong>dow <strong>in</strong> GiveW<strong>in</strong>.2Residual Log_UKdriversKSI1.0CorrelogramResidual Log_UKdriversKSI10.500.0−1−2−0.50.40.30.20.11970 1975 1980 1985DensityN(s=0.965)0 5 10 15QQ plotnormal210−1−2−3 −2 −1 0 1 2 3−2 −1 0 1 2Figure 3.6.20: Residuals <strong>and</strong> residual tests for the stochastic level <strong>and</strong> determ<strong>in</strong>isticseasonal model with <strong>in</strong>tervention <strong>and</strong> explanatory variable applied to the log of UKdrivers KSI.When we compare the residuals <strong>and</strong> the autocorrelations for lags 1 to 14 of this model <strong>in</strong>clud<strong>in</strong>g<strong>in</strong>tervention <strong>and</strong> explanatory variable with the residuals <strong>and</strong> autocorrelations of the modelwithout explanatory variable (see the top left graphs of Figures 3.6.16 <strong>and</strong> 3.6.20), we seealmost no differences. However, from the bottom graphs of Figures 3.6.16 <strong>and</strong> 3.6.20 we canconclude that the distribution of the residuals is more close to the normal distribution. Thisconclusion is confirmed by the Doornik-Hansen statistic, whose value is smaller for this model(1.90, see Table 3.6.8) than for the model without explanatory variable (2.40, see Table 3.6.7).P r o j e c t c o - f i n a n c e d b y t h e E u r o p e a n C o m m i s s i o n , D i r e c t o r a t e - G e n e r a l T r a n s p o r t a n d E n e r g yP a g e 2 5 7


Chapter 3 Use the menu or to save the graphs <strong>and</strong> m<strong>in</strong>imize theSTAMP Graphics w<strong>in</strong>dow.In the Results w<strong>in</strong>dow of GiveW<strong>in</strong>, the follow<strong>in</strong>g residual test results can befound (only part of the results are pr<strong>in</strong>ted below):


3.6 State space modelsGoodness-of-fit results for Residual Log_UKdriversKSIInformation criterion of Akaike AIC -5.175474... of Schwartz (Bayes) BIC -4.920982Serial correlation statistics for Residual Log_UKdriversKSI.Lag dF SerCorr BoxLjung ProbChi2(dF)1 0 0.10282 0 0.05823 1 -0.0805 3.9804 [ 0.0460]4 2 -0.1456 8.1591 [ 0.0169]5 3 0.0203 8.2410 [ 0.0413]6 4 -0.0653 9.0899 [ 0.0589]7 5 -0.0583 9.7707 [ 0.0820]8 6 -0.1682 15.4709 [ 0.0169]9 7 -0.0467 15.9130 [ 0.0259]10 8 -0.0856 17.4054 [ 0.0262]11 9 0.0598 18.1372 [ 0.0336]12 10 0.0526 18.7065 [ 0.0442]13 11 0.1513 23.4503 [ 0.0153]14 12 0.0237 23.5674 [ 0.0233]The addition of the petrol price variable to the model has improved the goodness-of-fit: the AIChas decreased (from -5.15 to -5.17) as well as the BIC (from -4.91 to -4.92).Step 6: Test of auxiliary residuals Go to the STAMP w<strong>in</strong>dow aga<strong>in</strong> <strong>and</strong> choose . In the Auxiliary residuals graphics w<strong>in</strong>dow select Irregular, Level residual,Index plot, Density, Histogram, Normal, Write normality tests, <strong>and</strong> Writevalues exceed<strong>in</strong>g (3.5). Click OK.The STAMP graphics w<strong>in</strong>dow <strong>in</strong> GiveW<strong>in</strong> displays the auxiliary residuals of theirregular <strong>and</strong> of the level component <strong>and</strong> their density function <strong>and</strong> normalprobability plot: see Figure 3.6.21.P r o j e c t c o - f i n a n c e d b y t h e E u r o p e a n C o m m i s s i o n , D i r e c t o r a t e - G e n e r a l T r a n s p o r t a n d E n e r g yP a g e 2 5 9


Chapter 32IrrRes Log_UKdriversKSI0.4DensityN(s=0.999)10.300.2−1−20.1210−1−21970 1975 1980 19850.5LvlRes Log_UKdriversKSI0.40.30.20.1−3 −2 −1 0 1 2 3 4DensityN(s=0.996)1970 1975 1980 1985−4 −3 −2 −1 0 1 2 3Figure 3.6.21: Auxiliary residuals <strong>and</strong> correspond<strong>in</strong>g tests for the stochastic leveldeterm<strong>in</strong>istic seasonal model with <strong>in</strong>tervention <strong>and</strong> explanatory variable applied to thelog of UK drivers KSI.The follow<strong>in</strong>g output describes the auxiliary residual test results for normality ascan be found <strong>in</strong> the Results w<strong>in</strong>dow of GiveW<strong>in</strong>.Normality test for IrrRes Log_UKdriversKSISample Size 192Mean -0.000458Std.Devn. 0.999195Skewness -0.063404Excess Kurtosis -0.402381M<strong>in</strong>imum -2.408254Maximum 2.494782Skewness Chi^2(1) 0.12864 [0.7198]Kurtosis Chi^2(1) 1.2953 [0.2551]Normal-BS Chi^2(2) 1.4239 [0.4907]Normal-DH Chi^2(2) 1.055 [0.5901]Normality test for LvlRes Log_UKdriversKSISample Size 192Mean 0.080397Std.Devn. 0.995941Skewness -0.481094Excess Kurtosis 0.144688M<strong>in</strong>imum -2.756100Maximum 2.427820Skewness Chi^2(1) 7.4065 [0.0065]Kurtosis Chi^2(1) 0.16748 [0.6824]Normal-BS Chi^2(2) 7.5739 [0.0227]Normal-DH Chi^2(2) 8.305 [0.0157]Both Figure 3.6.21 <strong>and</strong> the auxiliary residual tests demonstrate that the auxiliary residuals of theirregular component satisfy the assumption of normality, whereas those of the level componentdo not satisfy this assumption. This was also the case <strong>in</strong> the model with <strong>in</strong>tervention but withoutthe petrol price explanatory variable. From the bottom left graph, we see that <strong>in</strong> 1973 <strong>and</strong> 1974there are quite a number of large negative auxiliary residuals of the level component. As


3.6 State space modelsmentioned <strong>in</strong> the previous section, this means that we must be careful with the <strong>in</strong>terpretation ofthe test statistics for the auxiliary residuals correspond<strong>in</strong>g to the level component.Step 7: Conclusion of <strong>analysis</strong>In this section, we extended the stochastic level determ<strong>in</strong>istic seasonal modelwith the seat belt <strong>in</strong>tervention <strong>and</strong> the log of petrol price as explanatory variable<strong>and</strong> applied this model to the log of the monthly UK drivers KSI from January1969 to December 1984. The residuals obta<strong>in</strong>ed with the <strong>analysis</strong> of this modelsatisfy all the model assumptions of <strong>in</strong>dependence, homoscedasticity, <strong>and</strong>normality. However, the auxiliary residuals of the level component do not satisfythe assumption of normality. Therefore, we must be careful with the<strong>in</strong>terpretation of test statistics with respect to structural level breaks.Accord<strong>in</strong>g to this <strong>analysis</strong>, the <strong>in</strong>troduction of the seat belt law resulted <strong>in</strong> a100*(e -0.23757 – 1) = -21.1% change <strong>in</strong> the number of UK drivers KSI, while a 1%rise <strong>in</strong> petrol price was associated with a 0.28% reduction <strong>in</strong> the number of UKdrivers KSI.Step 8: Forecast<strong>in</strong>gS<strong>in</strong>ce the stochastic level determ<strong>in</strong>istic seasonal model with the seat belt<strong>in</strong>tervention <strong>and</strong> the log of petrol price as explanatory variable provides anappropriate description <strong>and</strong> explanation of the log of the monthly UK drivers KSI<strong>series</strong>, as a f<strong>in</strong>al step <strong>in</strong> the <strong>analysis</strong> we will make forecasts for this <strong>series</strong>. Byperform<strong>in</strong>g an anti-log <strong>analysis</strong>, the forecasts will also be re-expressed <strong>in</strong> termsof the orig<strong>in</strong>al count data.As <strong>in</strong> Section 3.6.6 of the Methodology report, we will make <strong>in</strong>-sample forecastsso as to validate the model. To this end, the stochastic level <strong>and</strong> determ<strong>in</strong>isticseasonal model with the seat belt <strong>in</strong>tervention <strong>and</strong> the log of petrol price asexplanatory variable is fitted to the log of the monthly number of UK drivers KSI,but now only for the period January 1969 to June 1984. So, we now do not<strong>in</strong>clude the last six observations (i.e., the last half year) of the UK drivers KSI<strong>series</strong> <strong>in</strong> the <strong>analysis</strong>. The results of the <strong>analysis</strong> without the months July 1984through December 1984 are very similar to the results presented earlier <strong>in</strong> thissection for the complete <strong>series</strong>. Repeat steps 1 <strong>and</strong> 2 of this section. Then, go to step 3: <strong>in</strong> the Estimate Model w<strong>in</strong>dow, select MaximumLikelihood. Choose Less forecasts 6. Click OK. Go through steps 4 to 7 to check the model assumptions.Next, we use these results to obta<strong>in</strong> forecasts for July 1984 through December1984 <strong>and</strong> compare the forecasts with the observed number of UK drivers KSIP r o j e c t c o - f i n a n c e d b y t h e E u r o p e a n C o m m i s s i o n , D i r e c t o r a t e - G e n e r a l T r a n s p o r t a n d E n e r g yP a g e 2 6 1


Chapter 3for this period. The actual values of the log of the petrol price are used formak<strong>in</strong>g the forecasts. Go to the STAMP w<strong>in</strong>dow <strong>and</strong> choose . In the Forecast<strong>in</strong>g w<strong>in</strong>dow select 6 as the number of forecasts, PlusXs,Seasonal, Use available database Xs, <strong>and</strong> Write forecasts Y. Click OK.The STAMP graphics w<strong>in</strong>dow <strong>in</strong> GiveW<strong>in</strong> displays the log of the UK drivers KSI<strong>series</strong> from January 1976 until June 1984 extended with the six-monthsforecasts <strong>in</strong>clud<strong>in</strong>g their 70% confidence <strong>in</strong>terval (plus <strong>and</strong> m<strong>in</strong>us one estimatedst<strong>and</strong>ard deviation) <strong>in</strong> the top figure; the log of the UK drivers KSI <strong>and</strong> theextrapolated trend are shown <strong>in</strong> the middle figure, <strong>and</strong> the extrapolatedseasonal is displayed <strong>in</strong> the bottom figure, see Figure 3.6.22.The GiveW<strong>in</strong> Results w<strong>in</strong>dow conta<strong>in</strong>s the forecasted values for July 1984 toDecember 1984, their root mean square errors, <strong>and</strong> the lower <strong>and</strong> upperbounds of the 70% confidence <strong>in</strong>tervals.7.757.507.257.00F−Log_UKdriversKSIForecast7.757.501977 1978 1979 1980 1981 1982 1983 1984 1985F−Log_UKdriversKSIF−TrendX_Log_UKdriversKSI7.257.001977 1978 1979 1980 1981 1982 1983 1984 19850.2 F−Seas_Log_UKdriversKSI0.10.0−0.11977 1978 1979 1980 1981 1982 1983 1984 1985Figure 3.6.22: Six-months forecasts (July-December 1984) of the stochastic level <strong>and</strong>determ<strong>in</strong>istic seasonal model with <strong>in</strong>tervention <strong>and</strong> explanatory variable applied to thelog of UK drivers KSI, January 1969 – June 1984.For the forecast, the fixed seasonal is set through to the forecast range (bottom graph). Thestochastic level part of the trend is set through as a horizontal l<strong>in</strong>e, whereas the expla<strong>in</strong>ed partof the trend is extrapolated by multiply<strong>in</strong>g the log of the petrol price with the correspond<strong>in</strong>gregression coefficient. The total forecast is obta<strong>in</strong>ed by add<strong>in</strong>g up the forecasts for the trend <strong>and</strong>the seasonal. Use the menu or to save these graphs, e.g. as anEncapsulated Postcript file (*.eps). M<strong>in</strong>imize the STAMP Graphics w<strong>in</strong>dow.


3.6 State space modelsIn Figure 3.6.23, the observed number of UK drivers KSI (not their logs) arecompared with the forecasts (also <strong>in</strong> absolute numbers). The figure alsodisplays the 90% confidence limits, determ<strong>in</strong>ed as the forecasted values plus<strong>and</strong> m<strong>in</strong>us 1.64 <strong>time</strong>s the root mean square error (see also Section 3.6.6 of theMethodology report). As can be seen <strong>in</strong> Figure 3.6.23 the observed numbers ofUK drivers KSI for July-December 1984 are all located with<strong>in</strong> the 90%confidence limits of the forecasts. This is a good sign because it means thatnone of the observed values significantly deviates from the forecasts <strong>in</strong> thisperiod.200018001600140090% upper limit120090% low er limitforecasts1000123456789101112log UK drivers KSI1984Figure 3.6.23: Observed number of UK drivers KSI <strong>in</strong> 1984, forecasts of the stochasticlevel <strong>and</strong> determ<strong>in</strong>istic seasonal model with <strong>in</strong>tervention <strong>and</strong> explanatory variable, <strong>and</strong>90% confidence <strong>in</strong>terval.To display the forecasts <strong>in</strong> terms of absolute numbers <strong>in</strong> STAMP, we apply ananti-log <strong>analysis</strong>: Aga<strong>in</strong> go to the STAMP w<strong>in</strong>dow <strong>and</strong> choose . In the Forecast<strong>in</strong>g w<strong>in</strong>dow select 6 as the number of forecasts, PlusXs,Modified anti-log <strong>analysis</strong>, Use available database Xs, <strong>and</strong> Write forecastsY. Click OK.The STAMP graphics w<strong>in</strong>dow <strong>in</strong> GiveW<strong>in</strong> displays the orig<strong>in</strong>al observationsfrom January 1976 until June 1983 extended with the six-months forecasts with70% confidence <strong>in</strong>terval (plus <strong>and</strong> m<strong>in</strong>us one estimated st<strong>and</strong>ard deviation) <strong>in</strong>P r o j e c t c o - f i n a n c e d b y t h e E u r o p e a n C o m m i s s i o n , D i r e c t o r a t e - G e n e r a l T r a n s p o r t a n d E n e r g yP a g e 2 6 3


Chapter 3the top figure <strong>and</strong> the orig<strong>in</strong>al observations <strong>and</strong> the extrapolated trend <strong>in</strong> thebottom figure: see Figure 3.6.24.In the Results w<strong>in</strong>dow of GiveW<strong>in</strong> the follow<strong>in</strong>g forecasts for the orig<strong>in</strong>alobserved <strong>time</strong> <strong>series</strong> have been added:Period Forecast R.m.s.e. - Rmse + Rmse1984. 7 1262.9 94.991 1167.9 1357.91984. 8 1270.3 97.608 1172.7 1367.91984. 9 1311.2 102.84 1208.3 1414.01984.10 1406.1 112.49 1293.7 1518.61984.11 1565.8 127.67 1438.1 1693.51984.12 1659.1 137.79 1521.3 1796.9The list of forecast results gives for each <strong>time</strong> po<strong>in</strong>t the value of the forecast, itsst<strong>and</strong>ard error, <strong>and</strong> the lower <strong>and</strong> upper limit of the 70% confidence <strong>in</strong>terval.2250 E_F−Log_UKdriversKSI Forecast20001750150012501977 1978 1979 1980 1981 1982 1983 1984 19852250 E_F−Log_UKdriversKSI F−TrendX_ME_Log_UKdriversKSI20001750150012501977 1978 1979 1980 1981 1982 1983 1984 1985Figure 3.6.24: Anti-logged six-months forecasts (July-December 1984) of the stochasticlevel <strong>and</strong> determ<strong>in</strong>istic seasonal model with <strong>in</strong>tervention <strong>and</strong> explanatory variableapplied to the log of UK drivers KSI, January 1969 – June 1984.


3.6 State space models3.6.7 Conclusion state space modelsThis chapter showed examples of the application of state space <strong>analysis</strong> to roadsafety data. State space <strong>analysis</strong> was used to describe the development of roadunsafety, to expla<strong>in</strong> part of this development by add<strong>in</strong>g <strong>in</strong>terventions orexplanatory variables, <strong>and</strong> to make forecasts of road unsafety.For the <strong>analysis</strong> of a new road safety dataset, we recommend to first analysethe data on fatalities, casualties, drivers KSI, etc. by go<strong>in</strong>g through <strong>analysis</strong>steps 1 to 7 as presented <strong>in</strong> this chapter. The most efficient way to f<strong>in</strong>d the bestfitt<strong>in</strong>g model is to start from a stochastic level <strong>and</strong> a stochastic slopecomponent. If there is possible seasonality, e.g. <strong>in</strong> the case of quarterly,monthly, or weekly data, a dummy or trigonometric seasonal should be <strong>in</strong>cluded<strong>in</strong> the model. Then, the estimated variances of disturbances <strong>in</strong>dicate whetherthe components should be treated stochastically or determ<strong>in</strong>istically. Next, if thef<strong>in</strong>al state value of a determ<strong>in</strong>istic component is not significantly different fromzero, then the component can be removed from the model.Always carefully check whether the residuals of the model satisfy the modelassumptions of serial <strong>in</strong>dependence, homoscedasticity, <strong>and</strong> normality. Thestatistics as presented <strong>in</strong> Table 3.6.1, for example, can be used for thatpurpose. However, one should not only rely on these statistical tests, but alsomake use of graphical tests, like the plot of the st<strong>and</strong>ardised residuals, thecorrelogram, the normal density diagram, <strong>and</strong> the normal probability plot.The auxiliary residuals of the irregular component are useful to detect outlierobservations <strong>and</strong> the auxiliary residuals of the level, slope, <strong>and</strong> seasonalcomponents can be employed to f<strong>in</strong>d structural breaks <strong>in</strong> the respective level,slope, <strong>and</strong> seasonal. The auxiliary residuals should be tested for normality, byus<strong>in</strong>g statistical tests (e.g., the Doornik-Hansen statistic) <strong>and</strong> graphical outputas the auxiliary residual plot, the normal density diagram, <strong>and</strong> the normalprobability plot. If the auxiliary residuals of a component are not normallydistributed, then this can be <strong>in</strong>terpreted as a warn<strong>in</strong>g that the tests of outliers orstructural breaks should be <strong>in</strong>terpreted with care.With state space <strong>analysis</strong>, forecasts can be made for short <strong>and</strong> long periodsahead. In do<strong>in</strong>g this, one should be attentive to the fact that the forecasts totallyrely on sett<strong>in</strong>g through the dynamics of the <strong>time</strong> <strong>series</strong> from the past to thefuture, at least if no additional <strong>in</strong>formation regard<strong>in</strong>g the future development ofexplanatory variables is added. A very useful property of state space <strong>analysis</strong> isthat it produces confidence <strong>in</strong>tervals for the forecasts, which provide <strong>in</strong>sight <strong>in</strong>the range of possible future values. In general, these ranges becomeimpressively wide when the <strong>time</strong> span between the forecast <strong>and</strong> the lastobservation grows. This prevents the user of state space <strong>analysis</strong> from draw<strong>in</strong>gtoo firm conclusions <strong>and</strong> helps to put the forecast results <strong>in</strong>to perspective.P r o j e c t c o - f i n a n c e d b y t h e E u r o p e a n C o m m i s s i o n , D i r e c t o r a t e - G e n e r a l T r a n s p o r t a n d E n e r g yP a g e 2 6 5


Chapter 4 - ConclusionThe present document constitutes the practical part of the best practice advicefor the <strong>analysis</strong> of complex data structures by Work Package 7 of the SafetyNetproject. This manual is <strong>in</strong>timately l<strong>in</strong>ked to D7.4, “<strong>Multilevel</strong> <strong>modell<strong>in</strong>g</strong> <strong>and</strong> <strong>time</strong><strong>series</strong> <strong>analysis</strong> <strong>in</strong> traffic research – A methodology”. While <strong>in</strong> the methodologyreport the emphasis is on theoretical background <strong>in</strong>formation, the manual givespractical <strong>in</strong>structions for the conduction of multilevel <strong>and</strong> <strong>time</strong> <strong>series</strong> analyseson the basis of user friendly software. Like the methodology report, this manualis divided <strong>in</strong>to two ma<strong>in</strong> chapters each dedicated to one broad family ofanalyses, multilevel <strong>modell<strong>in</strong>g</strong> <strong>and</strong> <strong>time</strong> <strong>series</strong> <strong>analysis</strong>.In Chapter 2, first an overview is given over the various multilevel modelspresented <strong>in</strong> this deliverable. Moreover the software used <strong>in</strong> this deliverable <strong>and</strong>other available multilevel software is discussed (Section 2.1). <strong>Multilevel</strong><strong>modell<strong>in</strong>g</strong> is then <strong>in</strong>troduced with the simplest case (Section 2.2): A l<strong>in</strong>earvariable (speed of a car) was predicted by another l<strong>in</strong>ear variable (length). Inthat example the <strong>in</strong>dividual cars constituted the first level, a second level wasgiven by the road sites at which the speed was measured. It was also<strong>in</strong>vestigated whether the regions had an effect on the speed measurements (i.e.whether there was a third level), however, this was not the case.Often <strong>in</strong> road safety research, variables are not l<strong>in</strong>ear. Therefore the l<strong>in</strong>earmodels are put <strong>in</strong>to the framework of the Generalised L<strong>in</strong>ear Model that allowsto model variables from other distributions, for example b<strong>in</strong>ary responsevariables, as well. As an example of a b<strong>in</strong>ary response variable, the data froman alcohol study are presented, <strong>in</strong>dicat<strong>in</strong>g whether a driver had a BAC abovethe legal limit or not (Section 2.3.2). The <strong>in</strong>dividual drivers were the first level,<strong>and</strong> aga<strong>in</strong>, the road site at which the alcohol level was tested constituted thesecond level <strong>and</strong> it was demonstrated that first-level as well as second-levelvariables could expla<strong>in</strong> some of the variation <strong>in</strong> dr<strong>in</strong>k-driv<strong>in</strong>g. This data couldalso be analysed as mult<strong>in</strong>omial-response data with 3 categories (BACBAC>.08, <strong>and</strong> BAC>.08) as demonstrated <strong>in</strong> Section 2.3.3. In this section itis demonstrated how this type of data can be considered as multivariateresponse structures <strong>and</strong> can be implemented <strong>in</strong> a multilevel model.The number of accidents can be assumed to be Poisson-distributed <strong>and</strong> <strong>in</strong>Section 2.3.4 it is demonstrated how the numbers of fatal accidents can bepredicted by law-enforcement measures. The first level <strong>in</strong> this <strong>analysis</strong> wasgiven by the counties <strong>in</strong> which the number of fatalities had been established <strong>and</strong>it was shown that this number varied across regions (the second level) <strong>and</strong> thatthe effect that alcohol <strong>and</strong> speed controls had also varied across regions.Often <strong>in</strong> road safety research, the same <strong>in</strong>dividual or unit is measured a numberof <strong>time</strong>s subsequently. <strong>Multilevel</strong> <strong>modell<strong>in</strong>g</strong> can be used for such longitud<strong>in</strong>aldata. This was demonstrated <strong>in</strong> Section 2.5 with a simulated data set of driv<strong>in</strong>gscores taken over 6 consecutive measurements. In this case, the <strong>in</strong>dividualmeasurements constituted the first level <strong>and</strong> the <strong>in</strong>dividuals from whom the


3.6 State space modelsmeasurements were taken formed the second level. It was demonstrated howthis technique allows the <strong>in</strong>clusion of predictors at the measurement level (i.e.number of km driven at <strong>time</strong> of measurement) as well as at the <strong>in</strong>dividual level(e.g. age at acquirement of driver’s licence).Similarly to a structure of repeated measurements, multilevel models can alsoserve to analyse multiple responses or measurements from the same <strong>in</strong>dividualunit. This was demonstrated <strong>in</strong> Section 2.4, where accident numbers <strong>and</strong> fatalitynumbers were predicted simultaneously by law-enforcement measures <strong>in</strong> amultivariate model. The multivariate response structure was set up, so that thenumber of fatalities <strong>and</strong> the number of accidents jo<strong>in</strong>tly formed the data vector.In this case the <strong>in</strong>dicator for the lowest level does not correspond to the unit ofmeasurement (the counties) but to an <strong>in</strong>dicator specify<strong>in</strong>g the type of responsegiven (number of accidents or number of fatalities).The topic of Chapter 3 is the <strong>analysis</strong> of road safety <strong>time</strong> <strong>series</strong> data. In the<strong>in</strong>troduction (Section 3.1) a short overview is given over the different methodsas well as the software used <strong>in</strong> this manual <strong>and</strong> other available software.The first <strong>time</strong> <strong>series</strong> approach discussed is the well known l<strong>in</strong>ear regressionapproach. Although technically not a specific <strong>time</strong> <strong>series</strong> <strong>analysis</strong> method, dueto the fact that it is well known, it appeared this method is suitable todemonstrate the key issues with <strong>time</strong> <strong>series</strong> data (<strong>in</strong> road safety) <strong>in</strong> anenvironment familiar to many readers. It is demonstrated that the ord<strong>in</strong>ary l<strong>in</strong>earregression model is not suitable for the number of fatalities <strong>in</strong> Austria. In fact,both the heteroscedasticity <strong>and</strong> the <strong>in</strong>dependence assumption have to bequestioned.Next, the ARMA-type section demonstrates how the examples discussed <strong>in</strong> themethodology report can be fit us<strong>in</strong>g SPSS. Details about the Norway fatalitiesdataset, UK-KSI drivers <strong>and</strong> the French fatalities examples are given. F<strong>in</strong>ally, <strong>in</strong>the state space section, the model properties are discussed us<strong>in</strong>g examplesbased on Norwegian <strong>and</strong> F<strong>in</strong>nish fatality data. F<strong>in</strong>ally, more extensive modelsare developed based on the UK-KSI data.To conclude, a wide range of examples of road safety data analyses waspresented <strong>in</strong> a very detailed way, allow<strong>in</strong>g the reader to underst<strong>and</strong> thenecessary decisions about distributional assumptions, variables <strong>in</strong>cluded, <strong>and</strong>estimation methods chosen. The data files used are <strong>in</strong>cluded so that eachaction <strong>and</strong> output can be traced by the reader. The <strong>in</strong>terpretations are directlyl<strong>in</strong>ked to the output, so that they can serve as examples enabl<strong>in</strong>g the reader to<strong>in</strong>terpret the output for the same type of <strong>analysis</strong> with a different set of data.Together with the methodology report, D7.4, this manual forms the best practicefor the <strong>analysis</strong> of complex data structures. The reader should have understoodthe necessity to check the assumptions underly<strong>in</strong>g the statistical analyses used,<strong>and</strong> if necessary to use methods like multilevel <strong>modell<strong>in</strong>g</strong> <strong>and</strong> <strong>time</strong> <strong>series</strong>analyses that explicitly represent complex data structures <strong>and</strong> thus allowP r o j e c t c o - f i n a n c e d b y t h e E u r o p e a n C o m m i s s i o n , D i r e c t o r a t e - G e n e r a l T r a n s p o r t a n d E n e r g yP a g e 2 6 7


Chapter 4researchers conduct valid analyses <strong>and</strong> ga<strong>in</strong> more <strong>in</strong>formation about thestructure itself.


ReferencesBox, G. E. P. & Jenk<strong>in</strong>s, G.M. (1976). Time <strong>series</strong> <strong>analysis</strong>: forecast<strong>in</strong>g <strong>and</strong>control. Revised Edition. Oakl<strong>and</strong>, CA: Holden-Day.Brockwell P.J., Davis R.A. (1998) Introduction to <strong>time</strong> <strong>series</strong> <strong>and</strong> forecast<strong>in</strong>g,Spr<strong>in</strong>ger VerlagBryk, A. S., S. W. Raudenbush, <strong>and</strong> R. Congdon (1996). Hierarchical L<strong>in</strong>ear<strong>and</strong> nonl<strong>in</strong>ear model<strong>in</strong>g with the HLM/2L <strong>and</strong> HLM/3L Programs. Chicago:Scientific Software International.Centre for <strong>Multilevel</strong> Modell<strong>in</strong>g, www.mlw<strong>in</strong>.com. Last accessed on 27.02.2007Durb<strong>in</strong>, J. <strong>and</strong> Koopman, S.J. (2001), Time <strong>series</strong> <strong>analysis</strong> by state spacemethods. Oxford University Press.Enders W. Applied econometric <strong>time</strong> <strong>series</strong>, Wiley, 1995The European Road Safety Observatory. www.erso.en/safetynet.htm. Lastaccessed on 27.02.2007.Gourieroux, C. & Monfort, A. (1990). Séries temporelles et modèlesdynamiques. Economica, 1990.Hallahan, C. (2003), "STAMP 6.0", International Journal of Forecast<strong>in</strong>g, Vol. 19,Issue 2, pp. 319-325.Harvey, A.C. (1989), Forecast<strong>in</strong>g, structural <strong>time</strong> <strong>series</strong> models <strong>and</strong> the Kalmanfilter. Cambridge University Press.Hedeker, D. & Gibbons, R. D. (1996a). MIXOR: a computer program for mixedeffectsord<strong>in</strong>al regression <strong>analysis</strong>. Computer Methods <strong>and</strong> Programs <strong>in</strong>Biomedic<strong>in</strong>e, 49:157-176.Hedeker, D. & Gibbons, R. D. (1996b). MIXREG: a computer program formixed-effects regression <strong>analysis</strong> with autocorrelated errors. ComputerMethods <strong>and</strong> Programs <strong>in</strong> Biomedic<strong>in</strong>e, 49:229-252.HLM - Hierarchical L<strong>in</strong>ear <strong>and</strong> Nonl<strong>in</strong>ear Model<strong>in</strong>g. Scientific SoftwareInternational, www.ssicentral.com/hlm/<strong>in</strong>dex.html. Last accessed on27.02.2007.Judge, G. <strong>and</strong> N<strong>in</strong>omiya, Y. (2000), " STAMP 6.0: Structural Time SeriesAnalyser, Modeller <strong>and</strong> Predictor", CHEER, Vol. 14, Issue 1 (virtual edition).Littell, R.C., Miliken, G.A., Stroup, W.W., & Wolf<strong>in</strong>ger, R.D. (1996). SAS systemfor mixed models. Cary: NC: SAS Institute, Inc.P r o j e c t c o - f i n a n c e d b y t h e E u r o p e a n C o m m i s s i o n , D i r e c t o r a t e - G e n e r a l T r a n s p o r t a n d E n e r g yP a g e 2 6 9


ReferencesLISREL for W<strong>in</strong>dows. Scientific Software International.www.ssicentral.com/lisrel/<strong>in</strong>dex.html. Last accessed on 27.02.2007Koopman, S. J., Harvey, A. C., Doornik, J. A. <strong>and</strong> Shephard, N. (2000),STAMP: Structural Time Series Analyser Modeller <strong>and</strong> Predictor, TimberlakeConsultants.MIX project. http://tigger.uic.edu/~hedeker/mix.html. Last accessed on27.02.2007.The R project. (http://cran.r-project.org). Last accessed on 13 July, 2006Rasbash, J., Steele, F., Browne, W, Prosser, B. (2004). A User’s Guide toMLwiN. Version 2.1e, Centre for <strong>Multilevel</strong> Model<strong>in</strong>g, Institute of Education,University of London, UK.S-Plus 7 – Deliver<strong>in</strong>g the power of predicative analytics across the enterprises(www.<strong>in</strong>sightful.com/products/splus/default.asp). Last accessed on 27.02.2007Teyssière, G. (2005), "Structural <strong>time</strong> <strong>series</strong> <strong>modell<strong>in</strong>g</strong> with STAMP 6.02",Journal of Applied Econometrics, Vol. 20, Issue 4, pp. 571-577.WINBUGS, The bugs project. www.mrc-bsu.cam.ac.uk/bugs/. Last accessed on27.02.2007.Yaffee, R. (2003), "Structural Time Series Model<strong>in</strong>g with SAS Proc UCM <strong>and</strong>STAMP", Social Science, Statistics & Mapp<strong>in</strong>g, Spr<strong>in</strong>g 2003 (virtual edition).

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!