Comparative Study of Techniques to Discover Frequent ... - IRD India

Comparative Study of Techniques to DiscoverFrequent Patterns of Web Usage MiningMona S. Kamat 1 , J. W. Bakal 2 & Madhu Nashipudi 31,3 Information Technology Department, Pillai Institute Of Information Technology(PIIT)Panvel, Navi Mumbai, India*2 Shivajirao S. Jondhale College Of Engineering(SSJCOE), Dombivali, Thane, IndiaEmail : 1 monakamat@gmail.com, 2 bakaljw@gmail.com & 3 madhu.nashipudi@yahoo.inAbstract – This topic mainly briefs about the Web UsageMining process, the stages involved in it followed bydetailed study of three algorithms to discover frequentpatterns namely FAP(Frequent Access Pattern) mining,GSP(Generalized Sequential Pattern) and DFS(Depth FirstSearch). Performance of DFS and FAP mining is betterthan GSP since it requires repeated session scan to fins thepattern. Since DFS and FAP mining depend on theconstruction of pattern lattice and FAP tree it is moreeffective. Few applications of Web usage mining arebriefed in the last section. This paper gives an overview ofWeb Usage mining process but mainly focuses on thecomparative study of three algorithms GSP, FAP miningand DFS and their performances.Keywords— DFS(Depth First Search), FAP(FrequentAccess Pattern), Frequent Patterns, GSP(GeneralizedSequential Pattern)I. INTRODUCTIONThis Web Usage Mining is the process in whichuser access patterns are discovered and analyzed bymining the log files and related data associated with acertain website. It is a kind of web mining whichautomatically discovers user usage patterns and ishelpful in studying and analyzing user interests. Webusage mining consists of mainly three stages, namelydata preprocessing, pattern discovery and patternanalysis as shown in fig[1].Fig.1. Stages in Web Usage MiningData for web mining is collected from multiplesources. Data is in different format following differentconventions with many duplicates, inconsistencies,sometimes incomplete. Hence data preprocessing is veryvital and is also most complex of all stages. It reducesthe data size radically thus improving the efficiency ofmining. Pattern Discovery applies various techniques onthe preprocessed data to discover frequent patterns likestatistics analysis, clustering, association rule mining,sequential pattern, classification and so on. In the nextstage, Pattern analysis, all the patterns discovered in theprevious phase are analysed to choose only theinteresting patterns and rules sieving the useless patternsand rules.This paper gives a brief overview of web usagemining process and explains few mining algorithms tofind frequent patterns which constitute the basicinformation source for intelligent web-based systemsalso making a comparative study of their performances.II. LITERATURE SURVEYAn Xidong Wang et al propose a method that candiscover users’ frequent access patterns underlyingusers’ browsing web behaviours. They introduce theconcept of access pattern according to a user’s accesspath, and then based on it put forward a revisedalgorithm (FAP-Mining) based on the FP-tree algorithmto mine frequent access patterns. The new algorithmfirst constructs a frequent access pattern tree and thenmines users ’frequent access patterns on the tree. Thealgorithm is accurate and scalable for mining frequentaccess patterns with different lengths.[7]Murat Ali Bayir et al have explained web miningand its categories. Further explaining in brief about thedata preprocessing, they have presented few algorithms45ISSN (Print) : 2319 – 2526, Volume-2, Issue-3, 2013

International Journal on Advanced Computer Theory and Engineering (IJACTE)for searching frequent patterns and a few algorithmsbased on construction of pattern lattice. Also they havepresented the results of the experiments they carried outto observe and analyse performances of the algorithmsthey have listed[5].Theint Theint Aye explains the importance of datapreprocessing to improve ease and efficiency of miningprocess. It mainly focuses on data preprocessing stagesand explained algorithms for field extraction and datacleaning algorithms that performs the process ofseparating fields from the single line of the log file andeliminates inconsistent or unnecessary items in theanalysed data respectively.[1]Robert Cooley et al. have presented a survey of theresearch in the area of Web usage mining and havebriefly discussed the pattern discovery techniques.III. STAGES IN WEB USAGE MININGBriefly, the stages are noted as follows:i. Obtain data from various sourcesii. Data preprocessingiii. Pattern Discoveryiv. Pattern Analysiscookies, path analysis, association rules, sequentialpatterns, clustering, decision trees etc.C. Pattern AnalysisFollowing statistics are a few listed ones which arethe end products of analysis such as the frequency ofvisits per document, most recent visit per document,who is visiting which documents, frequency of use ofeach hyperlink, and most recent use of each hyperlink.The common techniques used for pattern analysis arevisualization techniques, OLAP techniques, Data &Knowledge Querying, Usability Analysis.IV. TECHNIQUES TO DISCOVER FREQUENTPATTERNSA. Frequent Access Pattern Mining(FAP Mining)FAP algorithm is rooted on FP Growth Algorithmand it also constructs a FAP tree like FP tree. FAPalgorithm is divided into two steps. In the first step, itconstructs frequent access pattern tree (FAP tree) as perthe access paths obtained from the user sessions andrecords the access counts of each page. Next step iswhere the function of FAP-growth is used to mine bothlong and short access patterns on the FAP tree.FP Growth when used in mining association rulesand sequential patterns shows good functionality. Butthere is no sequence among those elements of an itemduring mining association rules, whereas access patternmining requires sequential page access. Thus the Fpgrowthhas been revised by X. Wang et al in [7]Table 1. Episode of user access path in users session filesFig.2. General Architecture of Web Usage Mining[14]A. Data PreprocessingThe data should be preprocessed to improve theefficiency and ease of the mining process. The main taskof data preprocessing is to prune noisy and irrelevantdata, and to reduce data volume for the patterndiscovery phase. Field Extraction and data cleaningalgorithms parse the web log records separating thefields and purging. CoveringB. Pattern discoveryFew techniques to discover patterns frompreprocessed data are listed like converting IP addressesto domain names, filtering, dynamic site analysis,Fig.3. Flowchart for algorithm to construct frequent acesspattern46ISSN (Print) : 2319 – 2526, Volume-2, Issue-3, 2013

International Journal on Advanced Computer Theory and Engineering (IJACTE)The Construction of FAP-TreeAlgorithm: FAP_Tree(tree, p). Construct frequentAccess Pattern tree [b]Input: The set of user access path p.Output: The set of use access pattern.The flowchart in fig 3 shows the construction ofFAP tree algorithm. To facilitate frequent access patterngeneration and FAP tree traversal, a page header table isbuilt and is associated to the tree in such a way that eachpage points to its position in the tree and the tableentries are sorted ascending in order of the page count.Table 1 shows an episode of access path of certain usercontained in the user session file. Accordingly thefunction of FAP-Tree constructs frequent access tree asin fig.4Table 2. Mining result of the FAP TreeB. Generalised Sequential Pattern(GSP)R.Srikant and R. Agrawal proposed the basicstructure of GSP algorithm for finding sequentialpatterns. The algorithm makes multiple passes over thesession set. At the end of the first pass the algorithmyields frequent items, 1-element frequent itemset. Thefollowing pass begins with the frequent itemset yieldedin the previous pass to generate potential itemsets calledcandidates. The candidates are then pruned based on theminimum support which again gives the set of frequentsets of n-element itemsets in pass n. The algorithm isiterative and ends when there no more frequent sets. Thethree main steps carried out in each pass are as shown infig. 6.Fig 4. FAP tree[7]CandidateGenerationPrunebased onsupportCountingCandidatesFig.6. Steps in each pass of GSP algorithmFAP GrowthFig.5.Flowchart of FAP-growth algorithmAlgorithm: FAP-growth(tree, α ), mine frequent accesspattern Input: FAP tree, min_supportOutput: α the set of all the access patternsCandidate Generation: It generates candidate sequencesby joining Ck-1 with Ck-1. Sequence s1 joins with s2 ifby dropping the first item of s1 and the last item of s2,same subsequence is obtained.[12]This createscandidates for the current pass.Counting Candidates: In each pass, one log-sequence isread at a time and we increment the access count ofcandidates contained in the log-sequencePruning: Only those candidates having their accesscount equal or higher than the mininmum support will bfiltered and passed as the frequent itemset in theiteration47ISSN (Print) : 2319 – 2526, Volume-2, Issue-3, 2013

International Journal on Advanced Computer Theory and Engineering (IJACTE)length n can be calculated by using two n-1 lengthpatterns belonging to the same class[5]. Example:Consider three atoms corresponding to three web pagesP1, P2 and P3. Sample lattice up to some length-3patterns and all length-2 patterns shown in fig.8.GSP AlgorithmFig.7. Flowchart for GSP algorithmInput: S(sessions), F1(Frequent 1-sequences),min_sup(the minimal access count that satisfies thesupport threshold).Output: the set of all the access patterns: FTable: 3 Mining result for GSP algorithm1-Frequentaccesspatterngenerated2-Frequentaccesspatterngenerated3-Frequentaccesspatterngenerated4-Frequentaccesspatterngenerated{A}:4 {AB}:3 {ABC}:2 {ABCD}:2{B}:5 {BC}:3 {ABE}:2{C}:6 {BE}:2 {BCD}:2{D}:3{E}:3{G}:2{I}:4{CD}:3C. Depth First Search(DFS)In this algorithm, the patterns are categorizedaccording to the length executed on lattice model.Patterns will form a lattice based on the pattern-lengthand pattern-frequency. And using this lattice, frequentpatterns are searched depth first.Lattice Construction: The basic element of the latticeis an atom i.e. single page. Each atom or page stands forlength-1 prefix equivalence class. Beginning frombottom elements the frequency of upper elements withFig.8. Pattern Lattice[5]Session id-timestamp list: After data preprocessing, aseries of web pages visited in each session is obtained.Session id timestamp list is a list which keeps session idand timestamp information for any patterns in allsessions. The timestamp information keeps thetimestamp value of last atom for patterns with length >1. Example: 4 pages and 3 sessions given below.S1 = Page1 →Page2 → Page4 → Page1 → Page3S2 = Page4 → Page3 → Page1 → Page2S3 = Page3 → Page4 → Page1Table 4. Session id-timestamp listTable 5. Session id-timestamp list for Page3Page1The count for pattern Page3Page1 is 2/3 since itoccurs twice in three sessions. They are then prunedbased on the minimum support. Fig.9. shows theflowchart depicting the working of DFS algorithm.48ISSN (Print) : 2319 – 2526, Volume-2, Issue-3, 2013

International Journal on Advanced Computer Theory and Engineering (IJACTE)Fig.10. Comparison of performance of GSP and DFS[5]Fig.9. Flowchart for DFS algorithmII. PERFORMANCE COMPARISONFAP mining algorithm mines the complete set offrequent itemsets without generating the candidate set.FAP-growth works in a divide-and-conquer way. Thefirst pass of the database derives a list of frequent itemsarranged in the descending order of frequency togenerate a frequent-access-pattern tree, or FAP-tree. Thealgorithm searches for shorter ones recursively and thenconcatenates the suffix finding the long frequentpatterns. Performance studies demonstrate that themethod substantially reduces search time.[6]GSP can suffer from two-nontrivial costs: (1)generating a huge number of candidate sets, and (2)repeatedly scanning the database and checking thecandidates by pattern matching.DFS gives a good performance since it eliminatesinfrequent patterns at each level and in the memory itkeeps fewer patterns at each step.Murat Ali Bayir et al. have performed experimentalresults on the web logs of the departmental web serveron GSP and DFS algorithm. The log files of the webserver at Computer department are used.(www.ceng.metu.edu.tr).In the experiments GSP has given the worst resultsbecause it does not use pattern lattice structure and ateach step it has to perform a session scan. DFS is betterbecause it eliminates infrequent patterns at each leveland in the memory it keeps fewer patterns at each step.Session-id timestamp list structure prevents unnecessarydatabase scans for evaluating frequency of length 1 andlength 2 patterns. Hence DFS gives a betterperformance. Fig.10. compares discovery time of allfrequent patterns of different lengths for DFS and GSPalgorithms.Comprehensive Study Comparison of FAP-growth, DFSand GSP Algorithms:GSP algorithm is based on apriori property: if anitemset α is not frequent, then any of its superset cannotnot be frequent either. GSP uses a bottom-up searchimplying that in order to produce a frequent sequence oflength n, all 2 n subsequences have to be generated.Therefore this exponential complexity limits it todiscover only short patterns, since it prunes anycandidate sequence in which there is a subsequencewhich is infrequent. Table 4. gives a comparisonbetween the three algorithms based on various factors.Table 4. Performance comparison of FAP, GSP and DFSAlgorithmFAP-growth GSP DFS. Constructs FP tree.finds long frequentpatterns bysearching shorterones andconcatenating suffixCandidateGenerationCountPrune.ConstructPattern Lattice.Search Patternsby creatingsession Id-Timestamp List.PruneCandidate No Yes NoGenerationBased on FP tree Candidate Pattern LatticeGenerationScan entiredatasetMemoryWhen largedataset/ lowsupportScalableExecutiontimeOnly twice toconstruct FP treeand later works onFP treeHolds FP-tree inmemoryBushy FP tree maynot fit in the mainmemoryNot when support isvery low else yesScan repeatedlyfor patternmatchingAll candidatesgenerated aswell as datasetsExponentialnumber ofcandidatesOnly whensupport is fairlyhighScans to createSession Id-Timestamp ListVery fewpatterns aspatterns arepruned at eachlevelPattern latticebecomes hugeNot whensupport is verylow else yesLower Fairly high Lower49ISSN (Print) : 2319 – 2526, Volume-2, Issue-3, 2013

International Journal on Advanced Computer Theory and Engineering (IJACTE)Fig.11. Scalability with threshold[8]Table 4. gives a comprehensive comparisonbetween the three algorithms, GSP, FAP-Growth andDFS algorithm. Fig.11 shows the scalability in GSP andFP-growth algorithms.FP-growth works in a divide-and-conquer way. Thefirst scan of the database derives a list of frequent itemsin which items are ordered by frequency descendingorder. According to the list, the database is representedas frequent-pattern tree, or FP-tree, which shows theassociation between items. The FP-tree starts withfrequent length-1 pattern (suffix pattern), constructingits conditional pattern base (a “subdatabase”, consistingthe prefix paths in the FP-tree containing the suffixpattern), then forming conditional FP-tree, andperforming mining recursively on this tree. It uses theleast frequent items as a suffix, offering goodselectivity. Performance studies demonstrate that themethod substantially reduces search time.DFS is algorithms take a more incremental approach asit generates possible frequent sequences and uses adivide-and-conquer approach. This algorithm mainlymakes an attempt to lessen the search space.III. ACKNOWLEDGMENTI would like to thank Dr. J.W. Bakal Sir and MadhuMadam for facilitating all the necessary inputs, studymaterial and resources and guiding me with their richexperience. I would especially like to thank my parents,in-laws and my husband for their unconditional support.IV. REFERENCES[1] Theint Aye, “Web cleaning for mining of webusage patterns”, International Conference onComputer research and Development(ICCRD),pages 490-494, Vol. 2, May 2011[2] K. R. Suneetha, Dr. K. R. Krishnamoorthy,“Identifying User Behavior by Analyzing WebServer Access Log”, International Journal ofComputer Science and NetworkSecurity(IJCSNS),pages 327-331, VOL.9 No.4,April 2009[3] Guangyuan Li Qin Xiao Qinbin Hu ChanganYuan, “An Efficient Algorithm for MiningFrequent Sequences in Dynamic Environment”,in Granular Computing, 2009, GRC '09. IEEEInternational Conference, pages: 329 – 333, Aug.2009[4] Jiawei Han · Hong Cheng · Dong Xin · XifengYan , “Frequent pattern mining: current statusand future directions” ,In Proceedings ofInternational Conference on Data MiningKnowledge Discovery Journal(DATAMINE),pages 55-86, Vol. 15 No.1, March 2007[5] Murat Ali Bayir, Ismail H. Toroslu, AhmetCosar,” Performance Comparison of PatternDiscovery Methods on Web Log Data”,Computer Systems and Applications, IEEEInternational Conference, pages 445 – 451, April2006[6] Osmar R. Zaıane, Mohammad ElHajj, “PatternLattice Traversal by Selective Jumps”, in Proc.2005 Int'l Conf. on Knowledge Discovery andData Mining (ACM SIGKDD), pp 729-735,Chicago, August, 2005[7] Xidong Wang, Yiming Ouyang, Xuegang Hu,Yan Zhang, “Discovery of User Frequent AccessPatterns on Web Usage Mining”, ComputerSupported Cooperative Work in DesignProceedings 8 th IEEE International Conference,pages 765 – 769, Vol 1, November 2004[8] Jiawei Han, Jian Pei, Yiwen Yin ,Runying Mao,“Mining Frequent Patterns without CandidateGeneration: A Frequent-Pattern Tree Approach”,In Proceeding of International Conference onData Mining and Knowledge Discovery, 8, pp.53–87, 2004[9] Jian Pei, Jiawei Han, Behzad Mortazavi-Asl,Jianyong Wang, Helen Pinto, Qiming Chen,Umeshwar Dayal, Mei-Chun Hsu, “MiningSequential Patterns by Pattern-Growth: ThePrefixSpan Approach”, IEEE Transactions onKnowledge and Data Engineering, Vol. 16, No.11, November 2004[10] Jaideep Srivastava, Robert Cooleyz , MukundDeshpande, Pang-Ning Tan, “Web UsageMining: Discovery and Applications of UsagePatterns from Web Data”, In proceedings of the9th IEEE International conference on Tools with50ISSN (Print) : 2319 – 2526, Volume-2, Issue-3, 2013

International Journal on Advanced Computer Theory and Engineering (IJACTE)Artifical Intelligence (ICTAI’97), pages 558-567,1997[11] R. Srikant and R. Agrawal, “Mining SequentialPatterns,” Proceedings of Fifth InternationalConference Extending Database Technology(EDBT ’96), pp. 3-17, Mar. 1996[12] Ramakrishnan Srikant, Rakesh Agrawal, “MiningSequential Patterns: Generalizations andPerformance Improvements”, In Proceedings ofthe 11 th International Conference on DataEngineering, pages 3-14, 1995.51ISSN (Print) : 2319 – 2526, Volume-2, Issue-3, 2013

Comparative Study of Techniques to Discover Frequent ... - IRD India

Create successful ePaper yourself

Delete template?

Save as template?