Incrementally Mining Frequent Itemsets in Update Distorted Databases

SA-IFIM: Incrementally Mining Frequent 

Itemsets in Update Distorted Databases ⋆ 

Jinlong Wang, Congfu Xu ⋆⋆ , Hongwei Dan, and Yunhe Pan 

Institute of Artificial Intelligence, Zhejiang University 

Hangzhou, 310027, China 

zjupaper@yahoo.com xucongfu@cs.zju.edu.cn 

danhow2008@hotmail.com panyh@sun.zju.edu.cn 

Abstract. The issue of maintaining privacy in frequent itemset mining 

has attracted considerable attentions. In most of those works, only 

distorted data are available which may bring a lot of issues in the datamining 

process. Especially, in the dynamic update distorted database 

environment, it is nontrivial to mine frequent itemsets incrementally due 

to the high counting overhead to recompute support counts for itemsets. 

This paper investigates such a problem and develops an efficient 

algorithm SA-IFIM for incrementally mining frequent itemsets in update 

distorted databases. In this algorithm, some additional information 

is stored during the earlier mining process to support the efficient incremental 

computation. Especially, with the introduction of supporting 

aggregate and representing it with bit vector, the transaction database is 

transformed into machine oriented model to perform fast support computation. 

The performance studies show the efficiency of our algorithm. 

1 Introduction 

Recently, privacy becomes one of the prime concerns in data mining. For not 

compromising the privacy, most of works make use of distortion or randomization 

techniques to the original dataset, and only the disguised data are shared for data 

mining [1–3]. 

Mining frequent itemset models from the distorted databases with the reconstruction 

methods brings expensive overheads as compared to directly mining 

original data sets [2]. In [3, 4], the basic formula from set theory are used to eliminate 

these counting overheads. But, in reality, for many applications, a database 

is dynamic in the sense. The changes on the data set may invalidate some existing 

frequent itemsets and introduce some new ones, so the incremental algorithms 

[5, 6] were proposed for addressing the problem. However, it is not efficient to 

directly use these incremental algorithms in the update distorted database, because 

of the high counting overhead to recompute support for itemsets. Although 

⋆ Supported by the Natural Science Foundation of China (No. 60402010), Zhejiang 

Provincial Natural Science Foundation of China (Y105250) and the Science- 

Technology Progrom of Zhejiang Province of China (No. 2004C31098). 

⋆⋆ Congfu Xu is the corresponding author.

2 Jinlong Wang et al. 

[7] has proposed an algorithm for incremental updating, the efficiency still cannot 

satisfy the reality. 

This paper investigates the problem of incremental frequent itemset mining 

in update distorted databases. We first develop an efficient incremental updating 

computation method to quickly reconstruct an itemset’s support by using the 

additional information stored during the earlier mining process. Then, a new 

concept supporting aggregate (SA) is introduced and represented with bit vector. 

In this way, the transaction database is transformed into machine oriented 

model to perform fast support computation. Finally, an efficient algorithm SA- 

IFIM (Supporting Aggregate based Incremental Frequent Itemset Mining in 

update distorted databases) is presented to describe the process. The performance 

studies show the efficiency of our algorithm. 

The remainder of this paper is organized as follows. Section 2 presents the 

SA-IFIM algorithm step by step. The performance studies are reported in Section 

3. Finally, Section 4 concludes this paper. 

2 The SA-IFIM Algorithm 

In this section, the SA-IFIM algorithm is introduced step by step. Before mining, 

the data sets are distorted respectively using the method mentioned by EMASK 

[3]. In the following, we first describe the preliminaries about incremental frequent 

itemsets mining, then investigate the essence of the updating technique 

and use some additional information recorded during the earlier mining and the 

set theory for quick updating computation. Next, we introduce the supporting 

aggregate and represent it with bit vector to transform the database into machine 

oriented model for speeding up computations. Finally, the SA-IFIM algorithm 

is summarized. 

2.1 Preliminaries 

In this subsection, some preliminaries about the concept of incremental frequent 

itemset mining are presented, summarizing the formal description in [5, 6]. 

Let D be a set of transactions and I = {i 1 ,i 2 ,...,i m } a set of distinct 

literals (items). For a dynamic database, old transactions △ − are deleted from 

the database D and new transactions △ + are added. Naturally, △ − ⊆ D. Denote 

the updated database by D ′ , therefore D ′ = (D −△ − )∪△ + , and the unchanged 

transactions by D − = D − △ − . Let Fp express the frequent itemsets in the 

original database D, Fp k denote k-frequent itemsets. The problem of incremental 

mining is to find frequent itemsets (denoted by Fp ′ ) in D ′ , given △ − ,D − , △ + , 

and the mining result Fp, with respect to the same user specified minimum 

support s. Furthermore, the incremental approach needs to take advantage of 

previously obtained information to avoid rerunning the mining algorithms on 

the whole database when the database is updated. For the clarity, we present s 

as a relative support value, but δ + c , δ − c , σ c , and σ ′ c as absolute ones, respectively 

in △ + , △ − , D, D ′ . And set δ c as the change of support count of itemset c. Then 

δ c = δ + c − δ − c , σ ′ c = σ c + δ + c − δ − c .

The SA-IFIM Algorithm 3 

2.2 Efficient incremental computation 

Generally, in dynamically updating environment, the important aspect of mining 

is how to deal with the frequent itemsets in D, recorded in Fp, and how to add 

the itemsets, which are non-frequent in D (not existing in Fp) but frequent in 

D ′ . In the following, for simplicity, we define | • | as the tuple number in the 

transaction database. 

1. For the frequent itemsets in Fp, find the non-frequent or still available frequent 

itemsets in the updated database D ′ . 

Lemma 1 If c ∈ Fp (σ c ≥ |D| × s), and δ c ≥ (|△ + | − |△ − |) × s, then 

c ∈ Fp ′ . 

Proof. σ ′ c=σ c + δ + c − δ − c ≥ (|D| × s + |△ + | × s − |△ − | × s) =(|D| + |△ + | − 

|△ − |) × s = |D ′ | × s. ⊓⊔ 

Property 1. When c ∈ Fp, and δ c < (|△ + | − |△ − |) × s, then c ∈ Fp ′ if and 

only if σ ′ c ≥ |D ′ | × s. 

2. For itemsets which are non-frequent in D, mine the frequent itemsets in the 

changed database △ + − △ − and recompute their support counts through 

scanning D − . 

Lemma 2 If c ∉ Fp, and δ c < (|△ + | − |△ − |) × s, then c ∉ Fp ′ . 

Proof. Refer to Lemma 1. ⊓⊔ 

Property 2. When c ∉ Fp, and δ c ≥ (|△ + | − |△ − |) × s, then c ∈ Fp ′ if and 

only if σ ′ c ≥ |D ′ | × s. 

Under the framework of symbol-specific distortion process in [3], ‘1’ and ‘0’ 

in the original database are respectively flipped with (1−p) and (1−q). In incremental 

frequent itemset mining, the goal is to mine frequent itemsets from the 

distorted databases with the information obtained during the earlier process. To 

test the condition for an itemset not in Fp in the situation Property 2, we need reconstruct 

an itemset’s support in the unchanged database D − through scanning 

D −∗ . Not only the distorted support of the itemset itself, but also some other 

counts related to it need to be tracked of. This makes that the support count 

computing in Property 2 is difficult and paramount important in incremental 

mining. And it is nontrivial to directly apply traditional incremental algorithms 

to it. To address the problem, an efficient incremental updating operation is first 

developed through computation with the support in the distorted database, then 

another method is presented to improve the support computation efficiency in 

the section 2.3. 

In distorted databases, the support computations of frequent itemsets are 

tedious. Motivated by [3], the similar support computation method is used in 

incremental mining. With the method, for computing an itemset’s support, we 

should have the support counts of all its subsets in the distorted database. However, 

if we save the support counts of all the itemsets, this will be unpractical


and greatly increase cost and degrade indexing efficiency. Thus in incremental 

mining, when recording the frequent itemsets and their support counts, the 

corresponding ones in each distorted database are registered at the same time. 

In this way, for a k-itemset not in Fp, since all its subsets are frequent in the 

database, we can use the existing support counts in each distorted database to 

compute and reconstruct its support in the updated database quickly. Thus, the 

efficiency is improved. 

2.3 Supporting aggregate and database transformation 

In order to improve the efficiency, we introduce the concept supporting aggregate 

and use bit vector to represent it. By virtue of elementary supporting aggregate 

based on bit vector, the database is transformed into the machine oriented data 

model, which improves the efficiency of itemsets’ support computation. 

In the following statement, for transaction database D, let U denote a set 

of objects (universe), as unique identifiers for the transactions. For simplicity, 

we refer U as the transactions without differences. For an itemset A ⊆ I, a 

transaction u ∈ U is said to contain A if A ⊆ u. 

Definition 1. supporting aggregate (SA). For an attribute itemset A ⊆ I, 

denote S(A) = {u ∈ U|A ⊆ u} as its supporting aggregate, where S(A) is 

the aggregate, composed of the transactions including the attribute itemset A. 

Generally, S(A) ⊆ U. For the supporting aggregate of each attribute items, we 

call it elementary supporting aggregate (ESA). 

Using ESA, the original transaction database is vertically inverted and transformed 

into attribute-transaction list. Through the ESA, the SA of an itemset 

can be obtained quickly with set intersection. And the itemsets’ support can 

be efficiently computed. In order to further improve processing speed, for each 

SA (ESA), we denote it as BV-SA (BV-ESA) with a binary vector of |U| dimensions 

(|U| is the number of transaction in U). If an itemset’s SA contains 

the ith transaction, its binary vector’s ith dimension is set to 1, otherwise, the 

corresponding position is set to 0. By this representation, the support count of 

each attribute item can be computed efficiently. 

With the vertical database representation, where each row presents an attribute’s 

BV-ESA, the attribute items can be removed sequentially due to download 

closure property [8], which efficiently reduced the size of the data set. On 

the other hand, the whole BV-ESA sometimes cannot be loaded into memory 

entirely because of the memory constraints. Our approach seeks to solve the 

scalable problem through horizontally partitioning the transaction data set into 

subsets, which is composed of partial objects (transactions), then load them partition 

by partition. Through the method, each partition is disjointed with each 

other, which makes it suitable for the parallel and distributed processing. Furthermore, 

in reality, the optimizational memory swap strategy can be adopted 

to reduce the I/O cost.


2.4 The process of SA-IFIM algorithm 

In this subsection, the algorithm SA-IFIM is summarized as Algorithm 1. When 

the distorted data sets D −∗ , △ −∗ and △ +∗ are firstly scanned, they are transformed 

into the corresponding vertical bit vector representations BV (D −∗ ), 

BV (△ −∗ ) and BV (△ +∗ ) partition by partition, and saved into hard disk. From 

the representations, frequent k-itemsets Fp k can be obtained level by level. And 

based on the candidate set generation-and-test approach, candidate frequent 

k-itemsets (C k ) are generated from frequent (k-1)-itemsets (Fp k−1 ). 

Algorithm 1: Algorithm SA-IFIM 

Input: D −∗ , △ +∗ , △ −∗ , Fp (Frequent itemsets and the support counts in D), 

Fp ∗ (Frequent itemsets of Fp and the corresponding support counts in D ∗ ), 

minimum support s, and distortion parameter p, q as EMASK [3]. 

Output: Fp ′ (Frequent itemsets and the support counts in D ′ ) 

Method: As shown in Fig.1. In the algorithm, we use some temporal 

files to store the support counts in the distorted database for 

efficiency. 

Fig. 1. SA-IFIM algorithm diagram.


3 Performance Evaluation 

This section performed comprehensive experiments to compare SA-IFIM with 

EMASK, provided by the authors in [9]. And for the better performance evaluation, 

we also implemented the algorithm IFIM (Similar as IPPFIM [7]). All 

programs were coded in C++ using Cygwin with gcc 2.9.5. The experiments 

were done on a P4, 3GHz Processor, with 1G memory. SA-IFIM and IFIM yield 

the same itemsets as EMASK with the same data set and the same minimum 

support parameters. 

Our experiments were performed on the synthetic data sets by IBM synthetic 

market-basket data generator [8]. In the following, we use the notation as D 

(number of transactions), T (average size of the transactions), I (average size 

of the maximal potentially large itemsets), and N (number of items), and set 

N=1000. In our method, the sizes of |△ + | and |△ − | are not required to be the 

same. Without loss of generality, let |d|= |△ + | = |△ − | for simplicity. For the 

sake of clarity, TxIyDmdn is used to represent an original database with an 

update database, where the parameters T = x and I = y are the same, only 

different in the number of the original transaction database |D| = m and the 

update transaction database |d| = n. 

In the following, we used the distorted benchmark data sets as the input 

databases to the algorithms. The distortion parameters are same as EMASK [3], 

with p=0.5 and q=0.97. In the experiments, for a fair comparison of algorithms 

and scalable requirements, SA-IFIM is run where only 5K transactions are loaded 

into the main memory one time. 

3.1 Different support analysis 

In Fig.2, the relative performance of SA-IFIM, IFIM and EMASK are compared 

on two different data sets, T25I4D100Kd10K (sparse) and T40I10D100Kd10K 

(dense) with respect to various minimum support. As shown in Fig.2, SA-IFIM 

leads to prominent performance improvement. Explicitly, on the sparse data 

sets (T25I4D100Kd10K), IFIM is close to EMASK, and SA-IFIM is orders of 

magnitude faster than them; on the dense data sets (T40I10D100Kd10K), IFIM 

is faster than EMASK, but SA-IFIM also outperforms IFIM, and the margin 

grows as the minimum support decreases. 

3.2 Effect of the update size 

Two data sets T25I4D100Kdm and T40I10D100Kdm were experimented, and 

the results shown in Fig.3. As expected, when the same number of transactions 

are deleted and added, the time of rerunning EMASK maintains constant, but 

the one of IFIM increases sharply and surpass EMASK quickly. In Fig.3, the 

execution time of SA-IFIM is much less than EMASK. SA-IFIM still significantly 

outperforms EMASK, even when the update size is much large.


(a) T25I4D100Kd10K 

(b) T40I10D100Kd10K 

Fig. 2. Extensive analysis for different support 

(a) T25I4D100Kdm(s=0.6%) 

(b) T40I10D100Kdm(s=1.25%) 

Fig. 3. Different updating tuples analysis 

3.3 Scale up performance 

Finally, to assess the scalability of the algorithm SA-IFIM, two experiments, 

T25I4Dmd(m/10) at s = 0.6% and T40I10Dmd(m/10) at s = 1.25%, were 

conducted to examine the scale up performance by enlarging the number of 

mined data set. The scale up results for the two data sets are obtained as Fig.4, 

which shows the impact of |D| and |d| to the algorithms SA-IFIM and EMASK. 

In the experiments, the size of the update database is as 10% of the original 

database, and the size of the transaction database m was increased from 100K 

to 1000K. As shown in Fig.4, EMASK is very sensitive to the updating tuple 

but SA-IFIM is not, and the execution time of SA-IFIM increases linearly as the 

database size increases. This shows that the algorithm can be applied to very 

large databases and demonstrates good scalability of it.


(a) T25I4Dmd(m/10)(s=0.6%) 

(b) T40I10Dmd(m/10)(s=1.25%) 

Fig. 4. Scale up performance analysis 

4 Conclusions 

In this paper, we explore the issue of frequent itemset mining under the dynamically 

updating distorted databases environment. We first develop an efficient 

incremental updating computation method to quickly reconstruct an itemset’s 

support. Through the introduction of the supporting aggregate represented with 

bit vector, the databases are transformed into the representations more accessible 

and processible by computer. The support count computing can be accomplished 

efficiently. Experiments conducted show that SA-IFIM significantly outperforms 

EMASK of mining the whole updated database, and also have the advantage of 

the incremental algorithms only based on EMASK. 

References 

1. Agrawal, R., and Srikant, R.: Privacy-preserving data mining. In: Proceedings of 

SIGMOD. (2000) 439-450 

2. Rizvi, S., and Haritsa, J.: Maintaining data privacy in association rule mining. In: 

Proceedings of VLDB. (2002) 682-693 

3. Agrawal, S., Krishnan, V., and Haritsa, J.: On addressing efficiency concerns in 

privacy-preserving mining. In: Proceedings of DASFAA. (2004) 113-124 

4. Xu, C., Wang, J., Dan, H., and Pan, Y.: An improved EMASK algorithm for 

privacy-preserving frequent pattern mining. In: Proceedings of CIS. (2005) 752- 

757 

5. Cheung, D., Han, J., Ng, V., and Wong, C.: Maintenance of discovered association 

rules in large databases: An incremental updating tedchnique. In: Proceedings of 

ICDE. (1996) 104-114 

6. Cheung, D., Lee, S., and Kao, B.: A general incremental technique for updating 

discovered association rules. In: Proceedings of DASFAA. (1997) 106-114 

7. Wang, J., Xu, C., and Pan, Y.: An Incremental Algorithm for Mining Privacy- 

Preserving Frequent Itemsets. In: Proceedings of ICMLC. (2006) 

8. Agrawal, R., and Srikant, R.: Fast algorithms for mining association rules. In: 

Proceedings of VLDB. (1994) 487-499 

9. http://dsl.serc.iisc.ernet.in/projects/software/software.html.

Incrementally Mining Frequent Itemsets in Update Distorted Databases

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?