UNCORRECTED PROOF

More documents

Recommendations

Info

4 J. Brynjarsdóttir, G. Stefánsson / Fisheries Research xxx (2004) xxx–xxx 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 3. Methods The gamma and log-normal distributions share some characteristics which often make it difficult to choose between them. Both distributions have a positive probability mass only for positive values and can describe data sets for which the majority of the probability mass is at low values but there is a heavy tail to the right. They also share the same relationship between the mean and variance, i.e. the variance function: var(Y) = φE(Y) 2 (1) where φ is a constant. This relationship differs from that for other distributions such as the normal, Poisson and the negative binomial distribution and can therefore be used to distinguish these two distributions from others. A common approach to check this relationship is to examine a plot of log(sample variance) versus log(sample mean) for homogeneous groups of data, see, for example, McCullagh and Nelder (1989, p. 306). If the points lie on a straight line with the slope close to 2, the gamma and log-normal distributions with fixed scale parameters can not be rejected as the true underlying distribution. Such an investigation cannot, however, distinguish between these two distributions. Data in one subrectangle and 1 year from the Icelandic groundfish survey can be thought to be realizations of i.i.d. variables since the environmental conditions are fairly homogeneous within a sub-rectangle. The drawback is that there are few observations for each sub-rectangle; the highest number is 7 observations, resulting in high uncertainty of the estimated means and variances for the sub-rectangles. The statistical rectangles which have up to 16 observations per rectangle are therefore also considered, but since they are four times the size of the sub-rectangles, the assumption of homogeneity is not as reliable. A goodness-of-fit test with help of a generalized linear model is used to distinguish between the two proposed probability distributions, the gamma and log-normal distributions. Following the approach of Stefánsson and Pálsson (1997) and Stefánsson (1988), this was done by scaling the observations with the fitted values from a GLM and then performing a Kolmogorov–Smirnov test on the scaled data. Let Y yji be a random variable that represents the number of cod caught in year y, sub-rectangle j and tow i. It is assumed that either 234 ( r, µyj ) Y yji ∼ G or 235 r Z yji = log(Y yji ) ∼ N(a yj ,b 2 ) (2) 236 where N(a, b 2 ) is the normal distribution with mean a 237 and variance b 2 and G(r, µ/r) is the gamma distribu- 238 tion with mean µ, variance µ 2 /r and density function 239 f (y) = yr−1 e −yr/µ (µ/r) r , y > 0 (3) 240 Γ (r) where Γ is the Gamma function, Γ (r) = 241 ∫ ∞ 0 x r−1 e −x dx. The effects of sub-rectangles 242 and years are assumed to be multiplicative on the 243 original scale of number of cod and hence additive 244 on the log scale. This leads to the log link if Y yji is 245 gamma distributed and the identity link if log(Y yji )is 246 normally distributed. We fit the models: 247 log(µ yj ) = β 0 + α y + β j + γ yj and 248 a yj = β 0 + α y + β j + γ yj (4) 249 where β 0 is the grand mean, α y is the year effect, β j is 250 the spatial effect of sub-rectangles and γ yj is the inter- 251 action. The error is assumed to be gamma distributed, 252 G(1,1/r), in the first model but normally distributed, 253 N(0,b 2 ), in the second. For a fixed year y, the models 254 become: 255 log(µ yj ) = β 0 + β j and a yj = β 0 + β j (5) 256 The goodness-of-fit test is based on the follow- 257 ing: Firstly, a known fact is that if X ∼ G(r, µ/r) 258 then X/µ ∼ G(r, 1/r). If µ yj and r were known 259 we could test whether Y yji /µ yj ∼ G(r, 1/r) using the 260 Kolmogorov–Smirnov test. This is done here by as- 261 suming that the fitted values ˆµ yj and the estimated dis- 262 persion parameter 1/ˆr obtained from model (4), with 263 gamma distributed errors, are the true parameters. Sec- 264 ondly, another known fact is that if X ∼ N(a, b 2 ) then 265 e X−a ∼ LN(0,b 2 ). If the true parameters were known 266 this could be tested via the Kolmogorov–Smirnov test. 267 This is done here by assuming that the fitted values 268 â yj and the estimated dispersion parameter ˆb 2 obtained 269 from the model (4), with normally distributed errors, 270 are the true parameters. The D n Kolmogorov test statis- 271 tic measures the distance from the empirical and hy- 272 pothesized distributions so we compare these test statis- 273 tics to see which distribution better represents the data. 274 NCORRECTED <strong>PROOF</strong> FISH 1762 1–14
J. Brynjarsdóttir, G. Stefánsson / Fisheries Research xxx (2004) xxx–xxx 5 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 A temperature surface over the survey area is fitted using locally weighted regression (loess) on bottom temperature (which is measured in each station in the survey) to obtain estimates of temperature gradients. The temperature estimates of a fine grid are then used to obtain an estimate of the magnitude of the temperature gradient vector at each station, which is then tested for a significant relation to cod catch in a GLM. A grid containing the entire survey area was constructed by distributing 101 points equally over the latitude range and 101 points over the longitude range (a total of 10 201 points). A second-order loess smoother in latitude and longitude was used to obtain a bottom temperature surface over the survey area, a different one for each year. The model for a given year is an additive model: T i = g(lat i , lon i ) + ɛ i (6) where T i represents the bottom temperature, ɛ i is assumed to be normally distributed with zero mean and constant variance and g is the loess smoother. The scope parameter f was set to 0.2. For each year of the survey, model (6) was fitted to a dataset, containing the recorded bottom temperatures, latitudes and longitudes for that year along with the grid points, which were given the temperature value of 0 deg C. A point in these datasets (t i , lat i , lon i ) was given a weight of 1 if it was a record from the survey data but a weight of 10 −10 if it was a grid point. The effect of these weights is that they are simply multiplied by the built-in weights of the loess smoother. The grid points therefore have negligible effect on the fitted values, except at points that are far from the survey data, where the fitted temperatures are not needed anyway. On the other hand, the survey data are dominant for the fitted values at the grid points and hence the fitted values at the grid points provide a smooth surface of the temperature over the survey area for every year. Once the estimated temperature ˆt i = ĝ(lat i , lon i ) has been obtained for the grid points for each year of the survey, the squared length of the gradient vector is estimated using: ‖̂∇g(lat i , lon i )‖= √ (ĝ(lat i+1 , lon i ) − ĝ(lat i , lon i )) 2 + (ĝ(lat i , lon i+1) − ĝ(lat i , lon i )) 2 (7) where i and i + 1 are adjacent points on the grid. Finally, a tow in the survey data is assigned the gradient length value of the grid point that is nearest to the position of the tow. This gradient value is then used as a covariate in a continuous GLM model. The analyses presented here are partly performed 320 in R, a free statistical software package (http://www.r- 321 project.org) and partly in S-PLUS (Venables and 322 Ripely, 2002). R is available from the Comprehensive 323 R Archive Network, http://cran.r-project.org. 324 4. Results 325 4.1. The variance function 326 In order to check assumption (1), the log sample 327 variances and log sample means are calculated for ho- 328 mogeneous groups of data (Fig. 2). As noted above, 329 each sub-rectangle/year combination contains at most 330 seven observations and many of these contain one or 331 two observations. Only those sub-rectangle/year com- 332 binations that contain five or more (n j ≥ 5) observa- 333 tions are included in this part of the analysis to reduce 334 the variances of the estimates of the means and vari- 335 ances. This reduced data set contains 1289 observations 336 and 241 different sub-rectangle/year combinations. 337 A weighted linear regression, with weights 1/n j , 338 of the log sample variance on the log sample mean 339 gives a slope of β = 2.23 ± 0.05 (mean ± standard er- 340 ror) and an intercept of α =−1.6 ± 0.2. The points 341 are close to a straight line and the regression has an R 2 342 value of 0.88, but the two-sided t-test for the hypoth- 343 esis H 0 : β = 2 is rejected at the 5% level of signifi- 344 cance since (ˆβ − 2)/ σˆ β2 = (2.23 − 2.00)/0.05 = 4.60 345 (P-value is less than 10 −5 ). The results for statistical 346 rectangles were similar, the regression gives a slope of 347 β = 2.23 ± 0.05 and an intercept of α =−1.1 ± 0.3 348 and the hypothesis H 0 : β = 2 is rejected at the 5% 349 level of significance. 350 In the following analysis it will nevertheless be as- 351 sumed that the mean-variance relationship (1) is valid 352 for the data at hand and the gamma and log-normal distributions are proposed for describing the data. The 354 sample mean and variance are estimated from a small 355 group of data and have associated uncertainty which 356 are not accounted for in the regression above. NCORRECTED <strong>PROOF</strong> FISH 1762 1–14 353
Page 1 and 2: Fisheries Research xxx (2004) xxx-x
Page 3: J. Brynjarsdóttir, G. Stefánsson
Page 7 and 8: J. Brynjarsdóttir, G. Stefánsson

UNCORRECTED PROOF

Create successful ePaper yourself

Delete template?

Save as template?