21.01.2014 Views

Xiao Liu PhD Thesis.pdf - Faculty of Information and Communication ...

Xiao Liu PhD Thesis.pdf - Faculty of Information and Communication ...

Xiao Liu PhD Thesis.pdf - Faculty of Information and Communication ...

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

4.2.2 Problem Analysis<br />

In cloud workflow systems, most workflow activities are data <strong>and</strong>/or computation<br />

intensive which require constant processing <strong>of</strong> specific resources such as network,<br />

I/O facility, <strong>and</strong> computing units for a long period <strong>of</strong> time. For workflow activities<br />

in data/computation intensive scientific applications, two problems <strong>of</strong> their duration<br />

series are limited sample size <strong>and</strong> frequent turning points. These two problems are<br />

very common in most scientific applications. Therefore, an effective forecasting<br />

strategy is required to tackle these two problems which are explained as follows.<br />

1) Limited sample size. Current work on the prediction <strong>of</strong> CPU load such as<br />

AR(16) dem<strong>and</strong>s a large sample size to fit the model <strong>and</strong> at least 16 prior points to<br />

make one prediction. However, this becomes a real issue for scientific workflow<br />

activity duration series. Unlike CPU load which reflects an instant state <strong>and</strong> the<br />

observations can be measured at any time with high sampling rates such as 1Hz<br />

[103], scientific workflow activity durations denote discrete long-term intervals<br />

since activity instances occur less frequently <strong>and</strong> take minutes or even hours to be<br />

completed. Meanwhile, due to the heavy workload brought by these activities <strong>and</strong><br />

the limits <strong>of</strong> available underlying resources (given requirements on the capability<br />

<strong>and</strong> performance), the number <strong>of</strong> concurrent activity instances is usually constrained<br />

in the systems. Therefore, the duration sample size is <strong>of</strong>ten limited <strong>and</strong> linear timeseries<br />

models cannot be fitted effectively.<br />

2) Frequent turning points. A serious problem which deteriorates the<br />

effectiveness <strong>of</strong> linear time-series models is the frequent occurrence <strong>of</strong> turning<br />

points where large deviations from the previous duration sequences take place [103].<br />

This problem resides not only in the prediction <strong>of</strong> CPU load, but also in the<br />

forecasting <strong>of</strong> activity durations as observed <strong>and</strong> discussed in many literature [32,<br />

103]. Actually, due to the large durations <strong>and</strong> the uneven distribution <strong>of</strong> activity<br />

instances, the states <strong>of</strong> the underlying system environment are usually <strong>of</strong> great<br />

differences even between neighbouring duration-series segments. For example, the<br />

average number <strong>of</strong> concurrent activity instances in the observation unit <strong>of</strong><br />

(10:00am~11:00am) may well be bigger than that <strong>of</strong> (9:00am~10:00am) due to the<br />

normal schedule <strong>of</strong> working hours. Therefore, activity instances are unevenly<br />

distributed <strong>and</strong> frequent turning points are hence <strong>of</strong>ten expected. This further<br />

45

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!