13.07.2015 Views

Comparative Study of Techniques to Discover Frequent ... - IRD India

Comparative Study of Techniques to Discover Frequent ... - IRD India

Comparative Study of Techniques to Discover Frequent ... - IRD India

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

<strong>Comparative</strong> <strong>Study</strong> <strong>of</strong> <strong>Techniques</strong> <strong>to</strong> <strong>Discover</strong><strong>Frequent</strong> Patterns <strong>of</strong> Web Usage MiningMona S. Kamat 1 , J. W. Bakal 2 & Madhu Nashipudi 31,3 Information Technology Department, Pillai Institute Of Information Technology(PIIT)Panvel, Navi Mumbai, <strong>India</strong>*2 Shivajirao S. Jondhale College Of Engineering(SSJCOE), Dombivali, Thane, <strong>India</strong>Email : 1 monakamat@gmail.com, 2 bakaljw@gmail.com & 3 madhu.nashipudi@yahoo.inAbstract – This <strong>to</strong>pic mainly briefs about the Web UsageMining process, the stages involved in it followed bydetailed study <strong>of</strong> three algorithms <strong>to</strong> discover frequentpatterns namely FAP(<strong>Frequent</strong> Access Pattern) mining,GSP(Generalized Sequential Pattern) and DFS(Depth FirstSearch). Performance <strong>of</strong> DFS and FAP mining is betterthan GSP since it requires repeated session scan <strong>to</strong> fins thepattern. Since DFS and FAP mining depend on theconstruction <strong>of</strong> pattern lattice and FAP tree it is moreeffective. Few applications <strong>of</strong> Web usage mining arebriefed in the last section. This paper gives an overview <strong>of</strong>Web Usage mining process but mainly focuses on thecomparative study <strong>of</strong> three algorithms GSP, FAP miningand DFS and their performances.Keywords— DFS(Depth First Search), FAP(<strong>Frequent</strong>Access Pattern), <strong>Frequent</strong> Patterns, GSP(GeneralizedSequential Pattern)I. INTRODUCTIONThis Web Usage Mining is the process in whichuser access patterns are discovered and analyzed bymining the log files and related data associated with acertain website. It is a kind <strong>of</strong> web mining whichau<strong>to</strong>matically discovers user usage patterns and ishelpful in studying and analyzing user interests. Webusage mining consists <strong>of</strong> mainly three stages, namelydata preprocessing, pattern discovery and patternanalysis as shown in fig[1].Fig.1. Stages in Web Usage MiningData for web mining is collected from multiplesources. Data is in different format following differentconventions with many duplicates, inconsistencies,sometimes incomplete. Hence data preprocessing is veryvital and is also most complex <strong>of</strong> all stages. It reducesthe data size radically thus improving the efficiency <strong>of</strong>mining. Pattern <strong>Discover</strong>y applies various techniques onthe preprocessed data <strong>to</strong> discover frequent patterns likestatistics analysis, clustering, association rule mining,sequential pattern, classification and so on. In the nextstage, Pattern analysis, all the patterns discovered in theprevious phase are analysed <strong>to</strong> choose only theinteresting patterns and rules sieving the useless patternsand rules.This paper gives a brief overview <strong>of</strong> web usagemining process and explains few mining algorithms t<strong>of</strong>ind frequent patterns which constitute the basicinformation source for intelligent web-based systemsalso making a comparative study <strong>of</strong> their performances.II. LITERATURE SURVEYAn Xidong Wang et al propose a method that candiscover users’ frequent access patterns underlyingusers’ browsing web behaviours. They introduce theconcept <strong>of</strong> access pattern according <strong>to</strong> a user’s accesspath, and then based on it put forward a revisedalgorithm (FAP-Mining) based on the FP-tree algorithm<strong>to</strong> mine frequent access patterns. The new algorithmfirst constructs a frequent access pattern tree and thenmines users ’frequent access patterns on the tree. Thealgorithm is accurate and scalable for mining frequentaccess patterns with different lengths.[7]Murat Ali Bayir et al have explained web miningand its categories. Further explaining in brief about thedata preprocessing, they have presented few algorithms45ISSN (Print) : 2319 – 2526, Volume-2, Issue-3, 2013


International Journal on Advanced Computer Theory and Engineering (IJACTE)for searching frequent patterns and a few algorithmsbased on construction <strong>of</strong> pattern lattice. Also they havepresented the results <strong>of</strong> the experiments they carried out<strong>to</strong> observe and analyse performances <strong>of</strong> the algorithmsthey have listed[5].Theint Theint Aye explains the importance <strong>of</strong> datapreprocessing <strong>to</strong> improve ease and efficiency <strong>of</strong> miningprocess. It mainly focuses on data preprocessing stagesand explained algorithms for field extraction and datacleaning algorithms that performs the process <strong>of</strong>separating fields from the single line <strong>of</strong> the log file andeliminates inconsistent or unnecessary items in theanalysed data respectively.[1]Robert Cooley et al. have presented a survey <strong>of</strong> theresearch in the area <strong>of</strong> Web usage mining and havebriefly discussed the pattern discovery techniques.III. STAGES IN WEB USAGE MININGBriefly, the stages are noted as follows:i. Obtain data from various sourcesii. Data preprocessingiii. Pattern <strong>Discover</strong>yiv. Pattern Analysiscookies, path analysis, association rules, sequentialpatterns, clustering, decision trees etc.C. Pattern AnalysisFollowing statistics are a few listed ones which arethe end products <strong>of</strong> analysis such as the frequency <strong>of</strong>visits per document, most recent visit per document,who is visiting which documents, frequency <strong>of</strong> use <strong>of</strong>each hyperlink, and most recent use <strong>of</strong> each hyperlink.The common techniques used for pattern analysis arevisualization techniques, OLAP techniques, Data &Knowledge Querying, Usability Analysis.IV. TECHNIQUES TO DISCOVER FREQUENTPATTERNSA. <strong>Frequent</strong> Access Pattern Mining(FAP Mining)FAP algorithm is rooted on FP Growth Algorithmand it also constructs a FAP tree like FP tree. FAPalgorithm is divided in<strong>to</strong> two steps. In the first step, itconstructs frequent access pattern tree (FAP tree) as perthe access paths obtained from the user sessions andrecords the access counts <strong>of</strong> each page. Next step iswhere the function <strong>of</strong> FAP-growth is used <strong>to</strong> mine bothlong and short access patterns on the FAP tree.FP Growth when used in mining association rulesand sequential patterns shows good functionality. Butthere is no sequence among those elements <strong>of</strong> an itemduring mining association rules, whereas access patternmining requires sequential page access. Thus the Fpgrowthhas been revised by X. Wang et al in [7]Table 1. Episode <strong>of</strong> user access path in users session filesFig.2. General Architecture <strong>of</strong> Web Usage Mining[14]A. Data PreprocessingThe data should be preprocessed <strong>to</strong> improve theefficiency and ease <strong>of</strong> the mining process. The main task<strong>of</strong> data preprocessing is <strong>to</strong> prune noisy and irrelevantdata, and <strong>to</strong> reduce data volume for the patterndiscovery phase. Field Extraction and data cleaningalgorithms parse the web log records separating thefields and purging. CoveringB. Pattern discoveryFew techniques <strong>to</strong> discover patterns frompreprocessed data are listed like converting IP addresses<strong>to</strong> domain names, filtering, dynamic site analysis,Fig.3. Flowchart for algorithm <strong>to</strong> construct frequent acesspattern46ISSN (Print) : 2319 – 2526, Volume-2, Issue-3, 2013


International Journal on Advanced Computer Theory and Engineering (IJACTE)The Construction <strong>of</strong> FAP-TreeAlgorithm: FAP_Tree(tree, p). Construct frequentAccess Pattern tree [b]Input: The set <strong>of</strong> user access path p.Output: The set <strong>of</strong> use access pattern.The flowchart in fig 3 shows the construction <strong>of</strong>FAP tree algorithm. To facilitate frequent access patterngeneration and FAP tree traversal, a page header table isbuilt and is associated <strong>to</strong> the tree in such a way that eachpage points <strong>to</strong> its position in the tree and the tableentries are sorted ascending in order <strong>of</strong> the page count.Table 1 shows an episode <strong>of</strong> access path <strong>of</strong> certain usercontained in the user session file. Accordingly thefunction <strong>of</strong> FAP-Tree constructs frequent access tree asin fig.4Table 2. Mining result <strong>of</strong> the FAP TreeB. Generalised Sequential Pattern(GSP)R.Srikant and R. Agrawal proposed the basicstructure <strong>of</strong> GSP algorithm for finding sequentialpatterns. The algorithm makes multiple passes over thesession set. At the end <strong>of</strong> the first pass the algorithmyields frequent items, 1-element frequent itemset. Thefollowing pass begins with the frequent itemset yieldedin the previous pass <strong>to</strong> generate potential itemsets calledcandidates. The candidates are then pruned based on theminimum support which again gives the set <strong>of</strong> frequentsets <strong>of</strong> n-element itemsets in pass n. The algorithm isiterative and ends when there no more frequent sets. Thethree main steps carried out in each pass are as shown infig. 6.Fig 4. FAP tree[7]CandidateGenerationPrunebased onsupportCountingCandidatesFig.6. Steps in each pass <strong>of</strong> GSP algorithmFAP GrowthFig.5.Flowchart <strong>of</strong> FAP-growth algorithmAlgorithm: FAP-growth(tree, α ), mine frequent accesspattern Input: FAP tree, min_supportOutput: α the set <strong>of</strong> all the access patternsCandidate Generation: It generates candidate sequencesby joining Ck-1 with Ck-1. Sequence s1 joins with s2 ifby dropping the first item <strong>of</strong> s1 and the last item <strong>of</strong> s2,same subsequence is obtained.[12]This createscandidates for the current pass.Counting Candidates: In each pass, one log-sequence isread at a time and we increment the access count <strong>of</strong>candidates contained in the log-sequencePruning: Only those candidates having their accesscount equal or higher than the mininmum support will bfiltered and passed as the frequent itemset in theiteration47ISSN (Print) : 2319 – 2526, Volume-2, Issue-3, 2013


International Journal on Advanced Computer Theory and Engineering (IJACTE)length n can be calculated by using two n-1 lengthpatterns belonging <strong>to</strong> the same class[5]. Example:Consider three a<strong>to</strong>ms corresponding <strong>to</strong> three web pagesP1, P2 and P3. Sample lattice up <strong>to</strong> some length-3patterns and all length-2 patterns shown in fig.8.GSP AlgorithmFig.7. Flowchart for GSP algorithmInput: S(sessions), F1(<strong>Frequent</strong> 1-sequences),min_sup(the minimal access count that satisfies thesupport threshold).Output: the set <strong>of</strong> all the access patterns: FTable: 3 Mining result for GSP algorithm1-<strong>Frequent</strong>accesspatterngenerated2-<strong>Frequent</strong>accesspatterngenerated3-<strong>Frequent</strong>accesspatterngenerated4-<strong>Frequent</strong>accesspatterngenerated{A}:4 {AB}:3 {ABC}:2 {ABCD}:2{B}:5 {BC}:3 {ABE}:2{C}:6 {BE}:2 {BCD}:2{D}:3{E}:3{G}:2{I}:4{CD}:3C. Depth First Search(DFS)In this algorithm, the patterns are categorizedaccording <strong>to</strong> the length executed on lattice model.Patterns will form a lattice based on the pattern-lengthand pattern-frequency. And using this lattice, frequentpatterns are searched depth first.Lattice Construction: The basic element <strong>of</strong> the latticeis an a<strong>to</strong>m i.e. single page. Each a<strong>to</strong>m or page stands forlength-1 prefix equivalence class. Beginning frombot<strong>to</strong>m elements the frequency <strong>of</strong> upper elements withFig.8. Pattern Lattice[5]Session id-timestamp list: After data preprocessing, aseries <strong>of</strong> web pages visited in each session is obtained.Session id timestamp list is a list which keeps session idand timestamp information for any patterns in allsessions. The timestamp information keeps thetimestamp value <strong>of</strong> last a<strong>to</strong>m for patterns with length >1. Example: 4 pages and 3 sessions given below.S1 = Page1 →Page2 → Page4 → Page1 → Page3S2 = Page4 → Page3 → Page1 → Page2S3 = Page3 → Page4 → Page1Table 4. Session id-timestamp listTable 5. Session id-timestamp list for Page3Page1The count for pattern Page3Page1 is 2/3 since i<strong>to</strong>ccurs twice in three sessions. They are then prunedbased on the minimum support. Fig.9. shows theflowchart depicting the working <strong>of</strong> DFS algorithm.48ISSN (Print) : 2319 – 2526, Volume-2, Issue-3, 2013


International Journal on Advanced Computer Theory and Engineering (IJACTE)Fig.10. Comparison <strong>of</strong> performance <strong>of</strong> GSP and DFS[5]Fig.9. Flowchart for DFS algorithmII. PERFORMANCE COMPARISONFAP mining algorithm mines the complete set <strong>of</strong>frequent itemsets without generating the candidate set.FAP-growth works in a divide-and-conquer way. Thefirst pass <strong>of</strong> the database derives a list <strong>of</strong> frequent itemsarranged in the descending order <strong>of</strong> frequency <strong>to</strong>generate a frequent-access-pattern tree, or FAP-tree. Thealgorithm searches for shorter ones recursively and thenconcatenates the suffix finding the long frequentpatterns. Performance studies demonstrate that themethod substantially reduces search time.[6]GSP can suffer from two-nontrivial costs: (1)generating a huge number <strong>of</strong> candidate sets, and (2)repeatedly scanning the database and checking thecandidates by pattern matching.DFS gives a good performance since it eliminatesinfrequent patterns at each level and in the memory itkeeps fewer patterns at each step.Murat Ali Bayir et al. have performed experimentalresults on the web logs <strong>of</strong> the departmental web serveron GSP and DFS algorithm. The log files <strong>of</strong> the webserver at Computer department are used.(www.ceng.metu.edu.tr).In the experiments GSP has given the worst resultsbecause it does not use pattern lattice structure and ateach step it has <strong>to</strong> perform a session scan. DFS is betterbecause it eliminates infrequent patterns at each leveland in the memory it keeps fewer patterns at each step.Session-id timestamp list structure prevents unnecessarydatabase scans for evaluating frequency <strong>of</strong> length 1 andlength 2 patterns. Hence DFS gives a betterperformance. Fig.10. compares discovery time <strong>of</strong> allfrequent patterns <strong>of</strong> different lengths for DFS and GSPalgorithms.Comprehensive <strong>Study</strong> Comparison <strong>of</strong> FAP-growth, DFSand GSP Algorithms:GSP algorithm is based on apriori property: if anitemset α is not frequent, then any <strong>of</strong> its superset cannotnot be frequent either. GSP uses a bot<strong>to</strong>m-up searchimplying that in order <strong>to</strong> produce a frequent sequence <strong>of</strong>length n, all 2 n subsequences have <strong>to</strong> be generated.Therefore this exponential complexity limits it <strong>to</strong>discover only short patterns, since it prunes anycandidate sequence in which there is a subsequencewhich is infrequent. Table 4. gives a comparisonbetween the three algorithms based on various fac<strong>to</strong>rs.Table 4. Performance comparison <strong>of</strong> FAP, GSP and DFSAlgorithmFAP-growth GSP DFS. Constructs FP tree.finds long frequentpatterns bysearching shorterones andconcatenating suffixCandidateGenerationCountPrune.ConstructPattern Lattice.Search Patternsby creatingsession Id-Timestamp List.PruneCandidate No Yes NoGenerationBased on FP tree Candidate Pattern LatticeGenerationScan entiredatasetMemoryWhen largedataset/ lowsupportScalableExecutiontimeOnly twice <strong>to</strong>construct FP treeand later works onFP treeHolds FP-tree inmemoryBushy FP tree maynot fit in the mainmemoryNot when support isvery low else yesScan repeatedlyfor patternmatchingAll candidatesgenerated aswell as datasetsExponentialnumber <strong>of</strong>candidatesOnly whensupport is fairlyhighScans <strong>to</strong> createSession Id-Timestamp ListVery fewpatterns aspatterns arepruned at eachlevelPattern latticebecomes hugeNot whensupport is verylow else yesLower Fairly high Lower49ISSN (Print) : 2319 – 2526, Volume-2, Issue-3, 2013


International Journal on Advanced Computer Theory and Engineering (IJACTE)Fig.11. Scalability with threshold[8]Table 4. gives a comprehensive comparisonbetween the three algorithms, GSP, FAP-Growth andDFS algorithm. Fig.11 shows the scalability in GSP andFP-growth algorithms.FP-growth works in a divide-and-conquer way. Thefirst scan <strong>of</strong> the database derives a list <strong>of</strong> frequent itemsin which items are ordered by frequency descendingorder. According <strong>to</strong> the list, the database is representedas frequent-pattern tree, or FP-tree, which shows theassociation between items. The FP-tree starts withfrequent length-1 pattern (suffix pattern), constructingits conditional pattern base (a “subdatabase”, consistingthe prefix paths in the FP-tree containing the suffixpattern), then forming conditional FP-tree, andperforming mining recursively on this tree. It uses theleast frequent items as a suffix, <strong>of</strong>fering goodselectivity. Performance studies demonstrate that themethod substantially reduces search time.DFS is algorithms take a more incremental approach asit generates possible frequent sequences and uses adivide-and-conquer approach. This algorithm mainlymakes an attempt <strong>to</strong> lessen the search space.III. ACKNOWLEDGMENTI would like <strong>to</strong> thank Dr. J.W. Bakal Sir and MadhuMadam for facilitating all the necessary inputs, studymaterial and resources and guiding me with their richexperience. I would especially like <strong>to</strong> thank my parents,in-laws and my husband for their unconditional support.IV. REFERENCES[1] Theint Aye, “Web cleaning for mining <strong>of</strong> webusage patterns”, International Conference onComputer research and Development(ICCRD),pages 490-494, Vol. 2, May 2011[2] K. R. Suneetha, Dr. K. R. Krishnamoorthy,“Identifying User Behavior by Analyzing WebServer Access Log”, International Journal <strong>of</strong>Computer Science and NetworkSecurity(IJCSNS),pages 327-331, VOL.9 No.4,April 2009[3] Guangyuan Li Qin Xiao Qinbin Hu ChanganYuan, “An Efficient Algorithm for Mining<strong>Frequent</strong> Sequences in Dynamic Environment”,in Granular Computing, 2009, GRC '09. IEEEInternational Conference, pages: 329 – 333, Aug.2009[4] Jiawei Han · Hong Cheng · Dong Xin · XifengYan , “<strong>Frequent</strong> pattern mining: current statusand future directions” ,In Proceedings <strong>of</strong>International Conference on Data MiningKnowledge <strong>Discover</strong>y Journal(DATAMINE),pages 55-86, Vol. 15 No.1, March 2007[5] Murat Ali Bayir, Ismail H. Toroslu, AhmetCosar,” Performance Comparison <strong>of</strong> Pattern<strong>Discover</strong>y Methods on Web Log Data”,Computer Systems and Applications, IEEEInternational Conference, pages 445 – 451, April2006[6] Osmar R. Zaıane, Mohammad ElHajj, “PatternLattice Traversal by Selective Jumps”, in Proc.2005 Int'l Conf. on Knowledge <strong>Discover</strong>y andData Mining (ACM SIGKDD), pp 729-735,Chicago, August, 2005[7] Xidong Wang, Yiming Ouyang, Xuegang Hu,Yan Zhang, “<strong>Discover</strong>y <strong>of</strong> User <strong>Frequent</strong> AccessPatterns on Web Usage Mining”, ComputerSupported Cooperative Work in DesignProceedings 8 th IEEE International Conference,pages 765 – 769, Vol 1, November 2004[8] Jiawei Han, Jian Pei, Yiwen Yin ,Runying Mao,“Mining <strong>Frequent</strong> Patterns without CandidateGeneration: A <strong>Frequent</strong>-Pattern Tree Approach”,In Proceeding <strong>of</strong> International Conference onData Mining and Knowledge <strong>Discover</strong>y, 8, pp.53–87, 2004[9] Jian Pei, Jiawei Han, Behzad Mortazavi-Asl,Jianyong Wang, Helen Pin<strong>to</strong>, Qiming Chen,Umeshwar Dayal, Mei-Chun Hsu, “MiningSequential Patterns by Pattern-Growth: ThePrefixSpan Approach”, IEEE Transactions onKnowledge and Data Engineering, Vol. 16, No.11, November 2004[10] Jaideep Srivastava, Robert Cooleyz , MukundDeshpande, Pang-Ning Tan, “Web UsageMining: <strong>Discover</strong>y and Applications <strong>of</strong> UsagePatterns from Web Data”, In proceedings <strong>of</strong> the9th IEEE International conference on Tools with50ISSN (Print) : 2319 – 2526, Volume-2, Issue-3, 2013


International Journal on Advanced Computer Theory and Engineering (IJACTE)Artifical Intelligence (ICTAI’97), pages 558-567,1997[11] R. Srikant and R. Agrawal, “Mining SequentialPatterns,” Proceedings <strong>of</strong> Fifth InternationalConference Extending Database Technology(EDBT ’96), pp. 3-17, Mar. 1996[12] Ramakrishnan Srikant, Rakesh Agrawal, “MiningSequential Patterns: Generalizations andPerformance Improvements”, In Proceedings <strong>of</strong>the 11 th International Conference on DataEngineering, pages 3-14, 1995.51ISSN (Print) : 2319 – 2526, Volume-2, Issue-3, 2013

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!