Hashing Methods for Temporal Data - Computer Science - Boston ...
Hashing Methods for Temporal Data - Computer Science - Boston ...
Hashing Methods for Temporal Data - Computer Science - Boston ...
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
10 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. XXX, NO. XX, XXXXXXX 2001<br />
MAXTIME is set to 50,000 <strong>for</strong> all workloads. The value<br />
used in the R-trees <strong>for</strong> now was 100,000.<br />
Each workload is described by the distribution used to<br />
generate the object lifespans,the number of different oids,<br />
the total number of changes in the evolution n object<br />
additions and deletions),the total number of object<br />
additions NB,and the total number of queries.<br />
4.3 Experiments<br />
First,the behavior of all implementations was tested using a<br />
basic Uni<strong>for</strong>m workload. The number of lifespans per object<br />
follows a uni<strong>for</strong>m distribution between 20 and 40. The total<br />
number of distinct oids was jUj ˆ8; 000,the number of real<br />
changes n ˆ 466; 854,and NB ˆ 237; 606 object additions.<br />
Hence,the average number of lifespans per oid was<br />
NB 30 we refer to this workload as Uni<strong>for</strong>m-30). The<br />
number of queries was 115,878.<br />
Fig. 3a presents the average number of pages accessed<br />
per query by all methods. The PPLH methods have the best<br />
per<strong>for</strong>mance,about two pages per query. The ALH approach<br />
uses more query I/O about 1.5 times in this example)<br />
because of the larger buckets it creates. The MVBT also uses<br />
more I/O about 1.75 times) than the PPLH approaches since<br />
a tree path is traversed per query. The R i uses more I/Os<br />
per query than the MVBT about 11.5 I/Os),mainly due to<br />
tree node overlapping and larger tree height its height<br />
relates to the total number of oid lifespans while MVBT's<br />
height corresponds to the alive oids at the time specified by<br />
the query). The R p tree has the worse query per<strong>for</strong>mance an<br />
average of 28.3 I/Os per query). The per<strong>for</strong>mance of the R-<br />
tree methods has been truncated in Fig. 3a to fit the graph.<br />
While using a separate dimension <strong>for</strong> the two endpoints of a<br />
lifespan interval allows <strong>for</strong> better clustering see also the<br />
space usage in Fig. 3c),it makes it more difficult to check<br />
whether an interval contains a query time instant.<br />
Fig. 3b shows the average number of I/Os per update.<br />
The best update per<strong>for</strong>mance was given by the PPLH-s<br />
method. In PPLH-l,the NT array implementation inside<br />
each page limits the actual page area assigned <strong>for</strong> storing<br />
oids and,thus,increases the number of pages used per<br />
bucket. The MVBT update is longer than PPLH-s since the<br />
MVBT traverses a tree <strong>for</strong> each update instead of quickly<br />
finding the location of the updated element through<br />
hashing). The update of R i follows; it is larger than the<br />
MVBT since the size of the tree traversed is related to all oid<br />
lifespans while the size of the MVBT structure traversed is<br />
related to the number of alive oids at the time of the update).<br />
The R p tree uses larger update processing than the R i<br />
because of the overhead to store an interval as two points.<br />
The ALH had the worse update processing since all<br />
lifespans with the same oid are thrown on the same bucket,<br />
creating large buckets that must be searched serially.<br />
The space consumed by each method appears in Fig. 3c.<br />
The ALH approach uses the smallest space since it stores a<br />
single record per oid lifespan and uses ªcontrolledº splits<br />
with high utilization. The PPLH-s method has also very<br />
good space utilization,very close to ALH. The R-tree<br />
methods follow; R p uses slightly less space than the R i<br />
because paginating intervals putting them into bounding<br />
rectangles) is more demanding than with points. Note that<br />
similarly to ALH,both R* methods use a single record per<br />
oid lifespan; the additional space is mainly because the<br />
average R-tree page utilization is about 65 percent. PPLH-l<br />
uses more space than PPLH-s because the NT array<br />
implementation reduces page utilization. The MVBT has<br />
the largest space requirements,about twice more space than<br />
the ALH and PPLH-s methods.<br />
To consider the effect of lifespan distribution,all<br />
approaches were compared using five additional workloads<br />
called the exponential,step,normal,Poisson,and uni<strong>for</strong>mconsecutive).<br />
These workloads had the same number of<br />
distinct oids …jUj ˆ8; 000†,number of queries 115,878),and<br />
similar n … 0:5M† and NB ˆ…30† parameters. The<br />
Exponential workload generated the n k lifespans per oid<br />
using an exponential distribution with probability density<br />
function f…x† ˆexp… x† and mean 1= ˆ 30. The total<br />
number of changes was n ˆ 487; 774,the total number of<br />
object additions was NB ˆ 245; 562,and NB ˆ 30:7. In the<br />
Step workload,the number of lifespans per oid follows a<br />
step function. The first 500 oids have four lifespans,the next<br />
500 have eight lifespans,and so on,i.e.,<strong>for</strong> every 500 oids,<br />
the number of lifespans advances by four. In this workload,<br />
we had n ˆ 540; 425, NB ˆ 272; 064,and NB ˆ 34. The<br />
Normal workload used a normal distribution with ˆ 30<br />
and 2 ˆ 25. Here,the parameters were n ˆ 470; 485,<br />
NB ˆ 237; 043,and NB ˆ 29:6.<br />
For the Poisson workload the first lifespan <strong>for</strong> every oid<br />
was generated randomly between time instants 1 and 500.<br />
The length of a lifespan was generated using a Poisson<br />
distribution with mean 1,100. Each next start time <strong>for</strong> a<br />
given oid was also generated by a Poisson distribution with<br />
mean value 500. For this workload,we had n ˆ 498; 914,<br />
NB ˆ 251; 404,and NB ˆ 31. The main characteristic of the<br />
Poisson workload is that the number of alive oids over time<br />
can vary from a very small number to a large proportion of<br />
jUj,i.e.,there are time instants where the number of alive<br />
oids is some hundreds and other time instants where almost<br />
all distinct oids are alive.<br />
The special characteristic of the Uni<strong>for</strong>m-consecutive<br />
workload is that it contains objects with multiple but<br />
consecutive lifespans. This scenario occurs when objects are<br />
updated frequently during their lifetime. Each update is<br />
seen as the deletion of the object followed by the insertion of<br />
the updated object at the same time. Since the object retains<br />
its oid through updates,this process creates consecutive<br />
lifespans <strong>for</strong> the same object the end of one lifespan is the<br />
start of the next lifespan). This workload was based on the<br />
Uni<strong>for</strong>m-30 workload and had n ˆ 468; 715, NB ˆ 236; 155,<br />
and NB ˆ 30. An object has a single lifetime which is cut<br />
into consecutive lifespans. The start_times of an object's<br />
lifespans are chosen uni<strong>for</strong>mly.<br />
Fig. 4 presents the query,update and space per<strong>for</strong>mance<br />
under the new workloads. The results resemble the<br />
Uni<strong>for</strong>m-30 workload. For brevity,we have excluded the<br />
R-tree-based methods from the remaining discussion as<br />
they consistently had much worse query per<strong>for</strong>mance; the<br />
interested reader can find the detailed per<strong>for</strong>mance in<br />
[17]). As be<strong>for</strong>e,the PPLH-s approach has the best overall<br />
per<strong>for</strong>mance using slightly more space than the ªminimalº<br />
space of ALH. PPLH-l has the same query per<strong>for</strong>mance<br />
with PPLH-s,but uses more updating and space. Note that