25.01.2015 Views

Download Full Issue in PDF - Academy Publisher

Download Full Issue in PDF - Academy Publisher

Download Full Issue in PDF - Academy Publisher

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

1418 JOURNAL OF COMPUTERS, VOL. 8, NO. 6, JUNE 2013<br />

large datasets. UFP-Growth [30] also builds the UFP-<br />

Tree <strong>in</strong> the same manner as FP-Growth, and the UFP-<br />

Tree is as compact as the orig<strong>in</strong>al FP-Tree. However<br />

UFP-M<strong>in</strong>e generates candidates, and identifies frequent<br />

itemsets by additional scan of dataset.<br />

So our thought is to build a tree as compact as the<br />

orig<strong>in</strong>al FP-Tree and avoid generat<strong>in</strong>g candidates. To<br />

achieve this goal, we propose a tree structure named AT-<br />

Tree (Array based Tail node Tree) and a new algorithm<br />

named AT-M<strong>in</strong>e. AT-M<strong>in</strong>e needs just two scans of<br />

dataset. In the first scan, it f<strong>in</strong>ds frequent items and<br />

arranges them <strong>in</strong> descend<strong>in</strong>g order of support number. In<br />

the second scan, it constructs an AT-Tree us<strong>in</strong>g<br />

transaction itemsets like the method of FP-Growth, while<br />

ma<strong>in</strong>ta<strong>in</strong>s the probability <strong>in</strong>formation of each transaction<br />

to a tail node and an array. Then AT-M<strong>in</strong>e can directly<br />

m<strong>in</strong>e frequent itemsets from AT-Tree without additional<br />

scan of datasets. The experimental results show that AT-<br />

M<strong>in</strong>e is more efficient than the algorithms MBP, UF-<br />

Growth and CUFP-M<strong>in</strong>e.<br />

The contributions of this paper are summarized as<br />

follows:<br />

(1) We propose a new tree structure named AT-Tree<br />

(Array based Tail node Tree) for ma<strong>in</strong>ta<strong>in</strong><strong>in</strong>g<br />

important <strong>in</strong>formation related to an uncerta<strong>in</strong><br />

transaction dataset;<br />

(2) We also give an algorithm named AT-M<strong>in</strong>e for<br />

FIM over uncerta<strong>in</strong> transaction datasets based on<br />

AT-Tree;<br />

(3) Both sparse and dense datasets are used <strong>in</strong> our<br />

experiments to compare the performance of the<br />

proposed algorithm aga<strong>in</strong>st the state-of-the-art<br />

algorithms based on level-wise approach and<br />

pattern-growth approach, respectively.<br />

The rest of this paper is organized as follows: Section<br />

2 is the description of the problem and def<strong>in</strong>itions;<br />

Section 3 describes related works; Section 4 describes our<br />

algorithms AT-M<strong>in</strong>e; Section 5 shows the experimental<br />

results; and Section 6 gives the conclusion and discussion.<br />

II.<br />

PROBLEM DEFINITIONS<br />

Let D = {T 1 , T 2 , …, T n } be an uncerta<strong>in</strong> transaction<br />

dataset which conta<strong>in</strong>s n transaction itemsets and m<br />

dist<strong>in</strong>ct items, i.e. I= {i 1 , i 2 , …, i m }. Each transaction<br />

itemset is represented as {i 1 :p 1 , i 2 :p 2 , …, i v :p v }, where {i 1 ,<br />

i 2 , …, i v } is a subset of I, and p u (1≤u≤v) is the existential<br />

probability of item i u <strong>in</strong> a transaction itemset. The size of<br />

dataset D is the number of transaction itemsets and is<br />

denoted as |D|. An itemset X = {i 1 , i 2 , …, i k }, which<br />

conta<strong>in</strong>s k dist<strong>in</strong>ct items, is called a k-itemset, and k is the<br />

length of the itemset X.<br />

We adopt some def<strong>in</strong>itions similar to those presented<br />

<strong>in</strong> the previous works [1, 23, 28, 29, 30, 31, 32].<br />

Def<strong>in</strong>ition 1: The support number (SN) of an itemset X <strong>in</strong><br />

a transaction dataset is def<strong>in</strong>ed by the number of<br />

transaction itemsets conta<strong>in</strong><strong>in</strong>g X.<br />

Def<strong>in</strong>ition 2: The probability of an item i u <strong>in</strong> transaction<br />

T d is denoted as p(i u ,T d ) and is def<strong>in</strong>ed by<br />

p( i , T ) = p<br />

(1)<br />

u d u<br />

For example, <strong>in</strong> Table 1, p({a},T 1 ) = 0.8, p({b},T 1 ) =<br />

0.7, p({d},T 1 ) = 0.9, p({f},T 1 ) = 0.5.<br />

Def<strong>in</strong>ition 3: The probability of an itemset X <strong>in</strong> a<br />

transaction T d is denoted as p(X, T d ) and is def<strong>in</strong>ed by<br />

p( XT ,<br />

d) = ∏ pi ( , )<br />

iu<br />

X,<br />

X T u<br />

T<br />

∈ ⊂<br />

d<br />

d<br />

(2)<br />

For example, <strong>in</strong> Table 1, p({a, b},T 1 ) = 0.8×0.7 = 0.56,<br />

p({a, b},T 4 ) = 0.9×0.85=0.765, p({a, b},T 5 ) =<br />

0.95×0.7=0.665.<br />

Def<strong>in</strong>ition 4: The expected support number (expSN) of<br />

an itemset X <strong>in</strong> an uncerta<strong>in</strong> transaction dataset is denoted<br />

as expSN(X) and is def<strong>in</strong>ed by<br />

expSN( X ) = ∑ P( X , T )<br />

Td⊇X,<br />

T d<br />

d∈D<br />

(3)<br />

For example, expSN({a, b}) = p({a, b},T 1 ) + p({a,<br />

b},T 4 ) + p({a, b},T 5 ) = 0.56+0.765+ 0.665 = 1.99.<br />

Def<strong>in</strong>ition 5: Given a dataset D, the m<strong>in</strong>imum expected<br />

support threshold η is a predef<strong>in</strong>ed percentage of |D|;<br />

correspond<strong>in</strong>gly, the m<strong>in</strong>imum expected support number<br />

(m<strong>in</strong>ExpSN) is def<strong>in</strong>ed by<br />

m<strong>in</strong>ExpSN = | D | × η<br />

(4)<br />

In the papers [23, 25, 26, 29, 30, 31], an itemset X is<br />

called a frequent itemset if its expected support number is<br />

not less than the value m<strong>in</strong>ExpSN. M<strong>in</strong><strong>in</strong>g frequent<br />

itemsets from an uncerta<strong>in</strong> transaction dataset means<br />

discover<strong>in</strong>g all itemsets whose expected support numbers<br />

are not less than the value m<strong>in</strong>ExpSN.<br />

Def<strong>in</strong>ition 6: The m<strong>in</strong>imum support threshold λ is a<br />

predef<strong>in</strong>ed percentage of |D|; correspond<strong>in</strong>gly, the<br />

m<strong>in</strong>imum support number (m<strong>in</strong>SN) <strong>in</strong> a dataset D is<br />

def<strong>in</strong>ed by<br />

m<strong>in</strong>SN = | D | × λ<br />

(5)<br />

III. RELATED WORK<br />

Most of algorithms of FIM on uncerta<strong>in</strong> datasets can<br />

be classified <strong>in</strong>to two ma<strong>in</strong> categories: the level-wise<br />

approach and the pattern-growth approach. The ma<strong>in</strong> idea<br />

of the level-wise algorithms comes from Apriori [1]<br />

which is the first level-wise algorithm for FIM. It is to<br />

iteratively generate candidate (k+1)-itemsets from<br />

comb<strong>in</strong>ations of frequent k-itemsets (k≥1), and calculate<br />

expected support numbers of candidates by one scan of<br />

dataset. Its ma<strong>in</strong> shortcom<strong>in</strong>g is that it needs multiple<br />

scans of dataset and generates candidate itemsets.<br />

The ma<strong>in</strong> idea of the pattern-growth approach comes<br />

from the algorithm FP-Growth [2] which is the first<br />

pattern-growth algorithm. It is also an iteration approach,<br />

but it does not m<strong>in</strong>e frequent itemsets by the comb<strong>in</strong>ation<br />

method like Apriori. It f<strong>in</strong>ds all frequent items under the<br />

condition of frequent k-itemset X, and generates frequent<br />

(k+1)-itemsets by the union of each one of those frequent<br />

items and X (k≥1). It ma<strong>in</strong>ta<strong>in</strong>s all transaction itemsets to<br />

a FP-Tree [2] with two scans of dataset. It will generate a<br />

conditional tree (which is also called prefix FP-Tree or<br />

sub FP-Tree) for each frequent itemset X. Thus it will<br />

f<strong>in</strong>d all frequent items under the condition of X by<br />

scann<strong>in</strong>g this conditional tree <strong>in</strong>stead of the whole dataset.<br />

© 2013 ACADEMY PUBLISHER

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!