Hashing Methods for Temporal Data - Computer Science - Boston ...

More documents

Recommendations

Info

10 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. XXX, NO. XX, XXXXXXX 2001 MAXTIME is set to 50,000 for all workloads. The value used in the R-trees for now was 100,000. Each workload is described by the distribution used to generate the object lifespans,the number of different oids, the total number of changes in the evolution n object additions and deletions),the total number of object additions NB,and the total number of queries. 4.3 Experiments First,the behavior of all implementations was tested using a basic Uniform workload. The number of lifespans per object follows a uniform distribution between 20 and 40. The total number of distinct oids was jUj ˆ8; 000,the number of real changes n ˆ 466; 854,and NB ˆ 237; 606 object additions. Hence,the average number of lifespans per oid was NB 30 we refer to this workload as Uniform-30). The number of queries was 115,878. Fig. 3a presents the average number of pages accessed per query by all methods. The PPLH methods have the best performance,about two pages per query. The ALH approach uses more query I/O about 1.5 times in this example) because of the larger buckets it creates. The MVBT also uses more I/O about 1.75 times) than the PPLH approaches since a tree path is traversed per query. The R i uses more I/Os per query than the MVBT about 11.5 I/Os),mainly due to tree node overlapping and larger tree height its height relates to the total number of oid lifespans while MVBT's height corresponds to the alive oids at the time specified by the query). The R p tree has the worse query performance an average of 28.3 I/Os per query). The performance of the R- tree methods has been truncated in Fig. 3a to fit the graph. While using a separate dimension for the two endpoints of a lifespan interval allows for better clustering see also the space usage in Fig. 3c),it makes it more difficult to check whether an interval contains a query time instant. Fig. 3b shows the average number of I/Os per update. The best update performance was given by the PPLH-s method. In PPLH-l,the NT array implementation inside each page limits the actual page area assigned for storing oids and,thus,increases the number of pages used per bucket. The MVBT update is longer than PPLH-s since the MVBT traverses a tree for each update instead of quickly finding the location of the updated element through hashing). The update of R i follows; it is larger than the MVBT since the size of the tree traversed is related to all oid lifespans while the size of the MVBT structure traversed is related to the number of alive oids at the time of the update). The R p tree uses larger update processing than the R i because of the overhead to store an interval as two points. The ALH had the worse update processing since all lifespans with the same oid are thrown on the same bucket, creating large buckets that must be searched serially. The space consumed by each method appears in Fig. 3c. The ALH approach uses the smallest space since it stores a single record per oid lifespan and uses ªcontrolledº splits with high utilization. The PPLH-s method has also very good space utilization,very close to ALH. The R-tree methods follow; R p uses slightly less space than the R i because paginating intervals putting them into bounding rectangles) is more demanding than with points. Note that similarly to ALH,both R* methods use a single record per oid lifespan; the additional space is mainly because the average R-tree page utilization is about 65 percent. PPLH-l uses more space than PPLH-s because the NT array implementation reduces page utilization. The MVBT has the largest space requirements,about twice more space than the ALH and PPLH-s methods. To consider the effect of lifespan distribution,all approaches were compared using five additional workloads called the exponential,step,normal,Poisson,and uniformconsecutive). These workloads had the same number of distinct oids …jUj ˆ8; 000†,number of queries 115,878),and similar n … 0:5M† and NB ˆ…30† parameters. The Exponential workload generated the n k lifespans per oid using an exponential distribution with probability density function f…x† ˆexp… x† and mean 1= ˆ 30. The total number of changes was n ˆ 487; 774,the total number of object additions was NB ˆ 245; 562,and NB ˆ 30:7. In the Step workload,the number of lifespans per oid follows a step function. The first 500 oids have four lifespans,the next 500 have eight lifespans,and so on,i.e.,for every 500 oids, the number of lifespans advances by four. In this workload, we had n ˆ 540; 425, NB ˆ 272; 064,and NB ˆ 34. The Normal workload used a normal distribution with ˆ 30 and 2 ˆ 25. Here,the parameters were n ˆ 470; 485, NB ˆ 237; 043,and NB ˆ 29:6. For the Poisson workload the first lifespan for every oid was generated randomly between time instants 1 and 500. The length of a lifespan was generated using a Poisson distribution with mean 1,100. Each next start time for a given oid was also generated by a Poisson distribution with mean value 500. For this workload,we had n ˆ 498; 914, NB ˆ 251; 404,and NB ˆ 31. The main characteristic of the Poisson workload is that the number of alive oids over time can vary from a very small number to a large proportion of jUj,i.e.,there are time instants where the number of alive oids is some hundreds and other time instants where almost all distinct oids are alive. The special characteristic of the Uniform-consecutive workload is that it contains objects with multiple but consecutive lifespans. This scenario occurs when objects are updated frequently during their lifetime. Each update is seen as the deletion of the object followed by the insertion of the updated object at the same time. Since the object retains its oid through updates,this process creates consecutive lifespans for the same object the end of one lifespan is the start of the next lifespan). This workload was based on the Uniform-30 workload and had n ˆ 468; 715, NB ˆ 236; 155, and NB ˆ 30. An object has a single lifetime which is cut into consecutive lifespans. The start_times of an object's lifespans are chosen uniformly. Fig. 4 presents the query,update and space performance under the new workloads. The results resemble the Uniform-30 workload. For brevity,we have excluded the R-tree-based methods from the remaining discussion as they consistently had much worse query performance; the interested reader can find the detailed performance in [17]). As before,the PPLH-s approach has the best overall performance using slightly more space than the ªminimalº space of ALH. PPLH-l has the same query performance with PPLH-s,but uses more updating and space. Note that
KOLLIOS AND TSOTRAS: HASHING METHODS FOR TEMPORAL DATA 11 Fig. 3. a) Query, b) update, and c) space performance for all implementations on a uniform workload with 8K oids, n 0:5M and NB 30.
Page 1 and 2: IEEE TRANSACTIONS ON KNOWLEDGE AND
Page 3 and 4: KOLLIOS AND TSOTRAS: HASHING METHOD
Page 9: KOLLIOS AND TSOTRAS: HASHING METHOD
Page 14 and 15: 14 IEEE TRANSACTIONS ON KNOWLEDGE A

Hashing Methods for Temporal Data - Computer Science - Boston ...

Create successful ePaper yourself

Delete template?

Save as template?