Hashing Methods for Temporal Data - Computer Science - Boston ...

10 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. XXX, NO. XX, XXXXXXX 2001 

MAXTIME is set to 50,000 for all workloads. The value 

used in the R-trees for now was 100,000. 

Each workload is described by the distribution used to 

generate the object lifespans,the number of different oids, 

the total number of changes in the evolution n object 

additions and deletions),the total number of object 

additions NB,and the total number of queries. 

4.3 Experiments 

First,the behavior of all implementations was tested using a 

basic Uniform workload. The number of lifespans per object 

follows a uniform distribution between 20 and 40. The total 

number of distinct oids was jUj ˆ8; 000,the number of real 

changes n ˆ 466; 854,and NB ˆ 237; 606 object additions. 

Hence,the average number of lifespans per oid was 

NB 30 we refer to this workload as Uniform-30). The 

number of queries was 115,878. 

Fig. 3a presents the average number of pages accessed 

per query by all methods. The PPLH methods have the best 

performance,about two pages per query. The ALH approach 

uses more query I/O about 1.5 times in this example) 

because of the larger buckets it creates. The MVBT also uses 

more I/O about 1.75 times) than the PPLH approaches since 

a tree path is traversed per query. The R i uses more I/Os 

per query than the MVBT about 11.5 I/Os),mainly due to 

tree node overlapping and larger tree height its height 

relates to the total number of oid lifespans while MVBT's 

height corresponds to the alive oids at the time specified by 

the query). The R p tree has the worse query performance an 

average of 28.3 I/Os per query). The performance of the R- 

tree methods has been truncated in Fig. 3a to fit the graph. 

While using a separate dimension for the two endpoints of a 

lifespan interval allows for better clustering see also the 

space usage in Fig. 3c),it makes it more difficult to check 

whether an interval contains a query time instant. 

Fig. 3b shows the average number of I/Os per update. 

The best update performance was given by the PPLH-s 

method. In PPLH-l,the NT array implementation inside 

each page limits the actual page area assigned for storing 

oids and,thus,increases the number of pages used per 

bucket. The MVBT update is longer than PPLH-s since the 

MVBT traverses a tree for each update instead of quickly 

finding the location of the updated element through 

hashing). The update of R i follows; it is larger than the 

MVBT since the size of the tree traversed is related to all oid 

lifespans while the size of the MVBT structure traversed is 

related to the number of alive oids at the time of the update). 

The R p tree uses larger update processing than the R i 

because of the overhead to store an interval as two points. 

The ALH had the worse update processing since all 

lifespans with the same oid are thrown on the same bucket, 

creating large buckets that must be searched serially. 

The space consumed by each method appears in Fig. 3c. 

The ALH approach uses the smallest space since it stores a 

single record per oid lifespan and uses ªcontrolledº splits 

with high utilization. The PPLH-s method has also very 

good space utilization,very close to ALH. The R-tree 

methods follow; R p uses slightly less space than the R i 

because paginating intervals putting them into bounding 

rectangles) is more demanding than with points. Note that 

similarly to ALH,both R* methods use a single record per 

oid lifespan; the additional space is mainly because the 

average R-tree page utilization is about 65 percent. PPLH-l 

uses more space than PPLH-s because the NT array 

implementation reduces page utilization. The MVBT has 

the largest space requirements,about twice more space than 

the ALH and PPLH-s methods. 

To consider the effect of lifespan distribution,all 

approaches were compared using five additional workloads 

called the exponential,step,normal,Poisson,and uniformconsecutive). 

These workloads had the same number of 

distinct oids …jUj ˆ8; 000†,number of queries 115,878),and 

similar n … 0:5M† and NB ˆ…30† parameters. The 

Exponential workload generated the n k lifespans per oid 

using an exponential distribution with probability density 

function f…x† ˆexp… x† and mean 1= ˆ 30. The total 

number of changes was n ˆ 487; 774,the total number of 

object additions was NB ˆ 245; 562,and NB ˆ 30:7. In the 

Step workload,the number of lifespans per oid follows a 

step function. The first 500 oids have four lifespans,the next 

500 have eight lifespans,and so on,i.e.,for every 500 oids, 

the number of lifespans advances by four. In this workload, 

we had n ˆ 540; 425, NB ˆ 272; 064,and NB ˆ 34. The 

Normal workload used a normal distribution with ˆ 30 

and 2 ˆ 25. Here,the parameters were n ˆ 470; 485, 

NB ˆ 237; 043,and NB ˆ 29:6. 

For the Poisson workload the first lifespan for every oid 

was generated randomly between time instants 1 and 500. 

The length of a lifespan was generated using a Poisson 

distribution with mean 1,100. Each next start time for a 

given oid was also generated by a Poisson distribution with 

mean value 500. For this workload,we had n ˆ 498; 914, 

NB ˆ 251; 404,and NB ˆ 31. The main characteristic of the 

Poisson workload is that the number of alive oids over time 

can vary from a very small number to a large proportion of 

jUj,i.e.,there are time instants where the number of alive 

oids is some hundreds and other time instants where almost 

all distinct oids are alive. 

The special characteristic of the Uniform-consecutive 

workload is that it contains objects with multiple but 

consecutive lifespans. This scenario occurs when objects are 

updated frequently during their lifetime. Each update is 

seen as the deletion of the object followed by the insertion of 

the updated object at the same time. Since the object retains 

its oid through updates,this process creates consecutive 

lifespans for the same object the end of one lifespan is the 

start of the next lifespan). This workload was based on the 

Uniform-30 workload and had n ˆ 468; 715, NB ˆ 236; 155, 

and NB ˆ 30. An object has a single lifetime which is cut 

into consecutive lifespans. The start_times of an object's 

lifespans are chosen uniformly. 

Fig. 4 presents the query,update and space performance 

under the new workloads. The results resemble the 

Uniform-30 workload. For brevity,we have excluded the 

R-tree-based methods from the remaining discussion as 

they consistently had much worse query performance; the 

interested reader can find the detailed performance in 

[17]). As before,the PPLH-s approach has the best overall 

performance using slightly more space than the ªminimalº 

space of ALH. PPLH-l has the same query performance 

with PPLH-s,but uses more updating and space. Note that

Previous page

Next page

1

2

3

4

5

6

7

8

9

10

11

13

14

15

16

17

18

19

Hashing Methods for Temporal Data - Computer Science - Boston ...

Create successful ePaper yourself

Delete template?

Save as template?