05.02.2014 Views

Hashing Methods for Temporal Data - Computer Science - Boston ...

Hashing Methods for Temporal Data - Computer Science - Boston ...

Hashing Methods for Temporal Data - Computer Science - Boston ...

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

10 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. XXX, NO. XX, XXXXXXX 2001<br />

MAXTIME is set to 50,000 <strong>for</strong> all workloads. The value<br />

used in the R-trees <strong>for</strong> now was 100,000.<br />

Each workload is described by the distribution used to<br />

generate the object lifespans,the number of different oids,<br />

the total number of changes in the evolution n object<br />

additions and deletions),the total number of object<br />

additions NB,and the total number of queries.<br />

4.3 Experiments<br />

First,the behavior of all implementations was tested using a<br />

basic Uni<strong>for</strong>m workload. The number of lifespans per object<br />

follows a uni<strong>for</strong>m distribution between 20 and 40. The total<br />

number of distinct oids was jUj ˆ8; 000,the number of real<br />

changes n ˆ 466; 854,and NB ˆ 237; 606 object additions.<br />

Hence,the average number of lifespans per oid was<br />

NB 30 we refer to this workload as Uni<strong>for</strong>m-30). The<br />

number of queries was 115,878.<br />

Fig. 3a presents the average number of pages accessed<br />

per query by all methods. The PPLH methods have the best<br />

per<strong>for</strong>mance,about two pages per query. The ALH approach<br />

uses more query I/O about 1.5 times in this example)<br />

because of the larger buckets it creates. The MVBT also uses<br />

more I/O about 1.75 times) than the PPLH approaches since<br />

a tree path is traversed per query. The R i uses more I/Os<br />

per query than the MVBT about 11.5 I/Os),mainly due to<br />

tree node overlapping and larger tree height its height<br />

relates to the total number of oid lifespans while MVBT's<br />

height corresponds to the alive oids at the time specified by<br />

the query). The R p tree has the worse query per<strong>for</strong>mance an<br />

average of 28.3 I/Os per query). The per<strong>for</strong>mance of the R-<br />

tree methods has been truncated in Fig. 3a to fit the graph.<br />

While using a separate dimension <strong>for</strong> the two endpoints of a<br />

lifespan interval allows <strong>for</strong> better clustering see also the<br />

space usage in Fig. 3c),it makes it more difficult to check<br />

whether an interval contains a query time instant.<br />

Fig. 3b shows the average number of I/Os per update.<br />

The best update per<strong>for</strong>mance was given by the PPLH-s<br />

method. In PPLH-l,the NT array implementation inside<br />

each page limits the actual page area assigned <strong>for</strong> storing<br />

oids and,thus,increases the number of pages used per<br />

bucket. The MVBT update is longer than PPLH-s since the<br />

MVBT traverses a tree <strong>for</strong> each update instead of quickly<br />

finding the location of the updated element through<br />

hashing). The update of R i follows; it is larger than the<br />

MVBT since the size of the tree traversed is related to all oid<br />

lifespans while the size of the MVBT structure traversed is<br />

related to the number of alive oids at the time of the update).<br />

The R p tree uses larger update processing than the R i<br />

because of the overhead to store an interval as two points.<br />

The ALH had the worse update processing since all<br />

lifespans with the same oid are thrown on the same bucket,<br />

creating large buckets that must be searched serially.<br />

The space consumed by each method appears in Fig. 3c.<br />

The ALH approach uses the smallest space since it stores a<br />

single record per oid lifespan and uses ªcontrolledº splits<br />

with high utilization. The PPLH-s method has also very<br />

good space utilization,very close to ALH. The R-tree<br />

methods follow; R p uses slightly less space than the R i<br />

because paginating intervals putting them into bounding<br />

rectangles) is more demanding than with points. Note that<br />

similarly to ALH,both R* methods use a single record per<br />

oid lifespan; the additional space is mainly because the<br />

average R-tree page utilization is about 65 percent. PPLH-l<br />

uses more space than PPLH-s because the NT array<br />

implementation reduces page utilization. The MVBT has<br />

the largest space requirements,about twice more space than<br />

the ALH and PPLH-s methods.<br />

To consider the effect of lifespan distribution,all<br />

approaches were compared using five additional workloads<br />

called the exponential,step,normal,Poisson,and uni<strong>for</strong>mconsecutive).<br />

These workloads had the same number of<br />

distinct oids …jUj ˆ8; 000†,number of queries 115,878),and<br />

similar n … 0:5M† and NB ˆ…30† parameters. The<br />

Exponential workload generated the n k lifespans per oid<br />

using an exponential distribution with probability density<br />

function f…x† ˆexp… x† and mean 1= ˆ 30. The total<br />

number of changes was n ˆ 487; 774,the total number of<br />

object additions was NB ˆ 245; 562,and NB ˆ 30:7. In the<br />

Step workload,the number of lifespans per oid follows a<br />

step function. The first 500 oids have four lifespans,the next<br />

500 have eight lifespans,and so on,i.e.,<strong>for</strong> every 500 oids,<br />

the number of lifespans advances by four. In this workload,<br />

we had n ˆ 540; 425, NB ˆ 272; 064,and NB ˆ 34. The<br />

Normal workload used a normal distribution with ˆ 30<br />

and 2 ˆ 25. Here,the parameters were n ˆ 470; 485,<br />

NB ˆ 237; 043,and NB ˆ 29:6.<br />

For the Poisson workload the first lifespan <strong>for</strong> every oid<br />

was generated randomly between time instants 1 and 500.<br />

The length of a lifespan was generated using a Poisson<br />

distribution with mean 1,100. Each next start time <strong>for</strong> a<br />

given oid was also generated by a Poisson distribution with<br />

mean value 500. For this workload,we had n ˆ 498; 914,<br />

NB ˆ 251; 404,and NB ˆ 31. The main characteristic of the<br />

Poisson workload is that the number of alive oids over time<br />

can vary from a very small number to a large proportion of<br />

jUj,i.e.,there are time instants where the number of alive<br />

oids is some hundreds and other time instants where almost<br />

all distinct oids are alive.<br />

The special characteristic of the Uni<strong>for</strong>m-consecutive<br />

workload is that it contains objects with multiple but<br />

consecutive lifespans. This scenario occurs when objects are<br />

updated frequently during their lifetime. Each update is<br />

seen as the deletion of the object followed by the insertion of<br />

the updated object at the same time. Since the object retains<br />

its oid through updates,this process creates consecutive<br />

lifespans <strong>for</strong> the same object the end of one lifespan is the<br />

start of the next lifespan). This workload was based on the<br />

Uni<strong>for</strong>m-30 workload and had n ˆ 468; 715, NB ˆ 236; 155,<br />

and NB ˆ 30. An object has a single lifetime which is cut<br />

into consecutive lifespans. The start_times of an object's<br />

lifespans are chosen uni<strong>for</strong>mly.<br />

Fig. 4 presents the query,update and space per<strong>for</strong>mance<br />

under the new workloads. The results resemble the<br />

Uni<strong>for</strong>m-30 workload. For brevity,we have excluded the<br />

R-tree-based methods from the remaining discussion as<br />

they consistently had much worse query per<strong>for</strong>mance; the<br />

interested reader can find the detailed per<strong>for</strong>mance in<br />

[17]). As be<strong>for</strong>e,the PPLH-s approach has the best overall<br />

per<strong>for</strong>mance using slightly more space than the ªminimalº<br />

space of ALH. PPLH-l has the same query per<strong>for</strong>mance<br />

with PPLH-s,but uses more updating and space. Note that

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!