12.07.2015 Views

A brief overview on Data mining - International Journal of Computer ...

A brief overview on Data mining - International Journal of Computer ...

A brief overview on Data mining - International Journal of Computer ...

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

ISSN 2249-6343Internati<strong>on</strong>al <strong>Journal</strong> <strong>of</strong> <strong>Computer</strong> Technology and Electr<strong>on</strong>ics Engineering (IJCTEE)Volume 1, Issue 3A Brief Overview <strong>on</strong> <strong>Data</strong> Mining SurveyHemlata Sahu, Shalini Shrma, Seema G<strong>on</strong>dhalakarAbstract- This paper provides an introducti<strong>on</strong> to the basicc<strong>on</strong>cept <strong>of</strong> data <strong>mining</strong>. Which gives <str<strong>on</strong>g>overview</str<strong>on</strong>g> <strong>of</strong> <strong>Data</strong><strong>mining</strong> is used to extract meaningful informati<strong>on</strong> and todevelop significant relati<strong>on</strong>ships am<strong>on</strong>g variables stored inlarge data set/data warehouse. In the case study reported inthis paper, a data <strong>mining</strong> approach is applied to extractknowledge from a data set. <strong>Data</strong> <strong>mining</strong> is the process <strong>of</strong>discovering potentially useful, interesting, and previouslyunknown patterns from a large collecti<strong>on</strong> <strong>of</strong> data.<strong>Data</strong> <strong>mining</strong> is a multidisciplinary field, drawing work fromareas including database technology, machine learning,statistics, pattern recogniti<strong>on</strong>, informati<strong>on</strong> retrieval, neuralnetworks, knowledge-based systems, artificial intelligence,high-performance computing, and data visualizati<strong>on</strong>. Wepresent techniques for the discovery <strong>of</strong> patterns hidden inlarge data sets, focusing <strong>on</strong> issues relating to their feasibility,usefulness, effectiveness, and scalability.The automated, prospective analyses <strong>of</strong>fered by data<strong>mining</strong> move bey<strong>on</strong>d the analyses <strong>of</strong> past events provided byretrospective tools typical <strong>of</strong> decisi<strong>on</strong> support systems.<strong>Data</strong> <strong>mining</strong> is the use <strong>of</strong> automated data analysis techniquesto uncover previously undetected relati<strong>on</strong>ships am<strong>on</strong>g dataitems. <strong>Data</strong> <strong>mining</strong> <strong>of</strong>ten involves the analysis <strong>of</strong> data stored ina data warehouse. Three <strong>of</strong> the major data <strong>mining</strong> techniques areregressi<strong>on</strong>, classificati<strong>on</strong> and clustering.<strong>Data</strong> Mining, also popularly known as Knowledge Discoveryin <strong>Data</strong>bases (KDD), refers to the n<strong>on</strong>trivial extracti<strong>on</strong> <strong>of</strong>implicit, previously unknown and potentially useful informati<strong>on</strong>from data in databases. While data <strong>mining</strong> and knowledgediscovery in databases (or KDD) are frequently treated assyn<strong>on</strong>yms, data <strong>mining</strong> is actually part <strong>of</strong> the knowledgediscovery process. The following figure (Figure 1.1) shows data<strong>mining</strong> as a step in an iterative knowledge discovery process.Keywords: <strong>Data</strong> <strong>mining</strong>; Associati<strong>on</strong> rules; Clustering; k-means; Decisi<strong>on</strong> tree.I. INTRODUCTION<strong>Data</strong> <strong>mining</strong> is a process to extract the implicit informati<strong>on</strong>and knowledge which is potentially useful and people do notknow in advance, and this extracti<strong>on</strong> is from the mass,incomplete, noisy, fuzzy and random data [2].The essential difference between the data <strong>mining</strong> and thetraditi<strong>on</strong>al data analysis (such as query, reporting and <strong>on</strong>-lineapplicati<strong>on</strong> <strong>of</strong> analysis) is that the data <strong>mining</strong> is to mineinformati<strong>on</strong> and discover knowledge <strong>on</strong> the premise <strong>of</strong> no clearassumpti<strong>on</strong> [1].In additi<strong>on</strong> to industry driven demand for standards andinteroperability, pr<strong>of</strong>essi<strong>on</strong>al and academic activity have alsomade c<strong>on</strong>siderable c<strong>on</strong>tributi<strong>on</strong>s to the evoluti<strong>on</strong> <strong>of</strong> the methodsand models; an article published in a 2008 issue <strong>of</strong> theInternati<strong>on</strong>al <strong>Journal</strong> <strong>of</strong> Informati<strong>on</strong> Technology and Decisi<strong>on</strong>Making summaries the results <strong>of</strong> a literature survey which tracesand analyzes this evoluti<strong>on</strong>. [8]Fig.1: <strong>Data</strong> <strong>mining</strong> is the core <strong>of</strong> Knowledge DiscoveryProcessThe Knowledge Discovery in <strong>Data</strong>bases process comprises <strong>of</strong>a few steps leading from raw data collecti<strong>on</strong>s to some form <strong>of</strong>new knowledge.114


ISSN 2249-6343Internati<strong>on</strong>al <strong>Journal</strong> <strong>of</strong> <strong>Computer</strong> Technology and Electr<strong>on</strong>ics Engineering (IJCTEE)Volume 1, Issue 3The iterative process c<strong>on</strong>sists <strong>of</strong> the following steps:<strong>Data</strong> cleaning: also known as data cleansing, it is a phase inwhich noise data and irrelevant data are removed from thecollecti<strong>on</strong>.<strong>Data</strong> integrati<strong>on</strong>: at this stage, multiple data sources, <strong>of</strong>tenheterogeneous, may be combined in a comm<strong>on</strong> source.<strong>Data</strong> selecti<strong>on</strong>: at this step, the data relevant to the analysis isdecided <strong>on</strong> and retrieved from the data collecti<strong>on</strong>.<strong>Data</strong> transformati<strong>on</strong>: also known as data c<strong>on</strong>solidati<strong>on</strong>, it is aphase in which the selected data is transformed into formsappropriate for the <strong>mining</strong> procedure.<strong>Data</strong> <strong>mining</strong>: it is the crucial step in which clever techniquesare applied to extract patterns potentially useful.Pattern evaluati<strong>on</strong>: in this step, strictly interesting patternsrepresenting knowledge are identified based <strong>on</strong> given measures.Knowledge representati<strong>on</strong>: is the final phase in which thediscovered knowledge is visually represented to the user. Thisessential step uses visualizati<strong>on</strong> techniques to help usersunderstand and interpret the data <strong>mining</strong> results.It is comm<strong>on</strong> to combine some <strong>of</strong> these steps together. Forinstance, data cleaning and data integrati<strong>on</strong> can be performedtogether as a pre-processing phase to generate a data warehouse.<strong>Data</strong> selecti<strong>on</strong> and data transformati<strong>on</strong> can also be combinedwhere the c<strong>on</strong>solidati<strong>on</strong> <strong>of</strong> the data is the result <strong>of</strong> the selecti<strong>on</strong>,or, as for the case <strong>of</strong> data warehouses, the selecti<strong>on</strong> is d<strong>on</strong>e <strong>on</strong>transformed data.<strong>Data</strong> Mining is…..<strong>Data</strong> <strong>mining</strong> comm<strong>on</strong>ly involves four classes <strong>of</strong> tasks: [3]Clustering - is the task <strong>of</strong> discovering groups and structures inthe data that are in some way or another "similar", without usingknown structures in the data.Clustering is a data <strong>mining</strong> (machine learning) technique used toplace data elements into related groups without advanceknowledge <strong>of</strong> the group definiti<strong>on</strong>s. Popular clusteringtechniques include k-means clustering and expectati<strong>on</strong>maximizati<strong>on</strong> (EM) clustering.Classificati<strong>on</strong> - is the task <strong>of</strong> generalizing known structure toapply to new data. For example, an email program mightattempt to classify an email as legitimate or spam. Comm<strong>on</strong>algorithms include decisi<strong>on</strong> tree learning, nearest neighbor,naive Bayesian classificati<strong>on</strong>, neural networks and supportvector machines.Working with categorical data or a mixture <strong>of</strong> c<strong>on</strong>tinuousnumeric and categorical data? Classificati<strong>on</strong> analysis might suityour needs well. This technique is capable <strong>of</strong> processing a widervariety <strong>of</strong> data than regressi<strong>on</strong> and is growing in popularity.Regressi<strong>on</strong> - Attempts to find a functi<strong>on</strong> which models the datawith the least error.Regressi<strong>on</strong> is the oldest and most well-known statisticaltechnique that the data <strong>mining</strong> community utilizes. Basically,regressi<strong>on</strong> takes a numerical dataset and develops amathematical formula that fits the data. When you're ready touse the results to predict future behavior, you simply take yournew data, plug it into the developed formula and you've got apredicti<strong>on</strong>! The major limitati<strong>on</strong> <strong>of</strong> this technique is that it <strong>on</strong>lyworks well with c<strong>on</strong>tinuous quantitative data (like weight, speedor age). If you're working with categorical data where order isnot significant (like color, name or gender) you're better <strong>of</strong>fchoosing another technique.Regressi<strong>on</strong> is a data <strong>mining</strong> (machine learning) techniqueused to fit an equati<strong>on</strong> to a dataset. The simplest form <strong>of</strong>regressi<strong>on</strong>, linear regressi<strong>on</strong>, uses the formula <strong>of</strong> a straight line(y = mx + b) and determines the appropriate values for m and bto predict the value <strong>of</strong> y based up<strong>on</strong> a given value <strong>of</strong> x.Advanced techniques, such as multiple regressi<strong>on</strong>, allow the use<strong>of</strong> more than <strong>on</strong>e input variable and allow for the fitting <strong>of</strong> morecomplex models, such as a quadratic equati<strong>on</strong>.115


ISSN 2249-6343Internati<strong>on</strong>al <strong>Journal</strong> <strong>of</strong> <strong>Computer</strong> Technology and Electr<strong>on</strong>ics Engineering (IJCTEE)Volume 1, Issue 3Associati<strong>on</strong> rule learning - Searches for relati<strong>on</strong>ships betweenvariables. For example a supermarket might gather data <strong>on</strong>customer purchasing habits. Using associati<strong>on</strong> rule learning, thesupermarket can determine which products are frequentlybought together and use this informati<strong>on</strong> for marketingpurposes. This is sometimes referred to as market basketanalysis.II.DATA MINING: CONVERGENCE OF THREETECHNOLOGIES‣ The more data the better (usually)• Improved AlgorithmsTechniques have <strong>of</strong>ten been waiting for computingtechnology to catch up‣ Statisticians already doing “manual data <strong>mining</strong>”‣ Good machine learning is just the intelligentapplicati<strong>on</strong> <strong>of</strong> statistical processes‣ A lot <strong>of</strong> data <strong>mining</strong> research focused <strong>on</strong> tweakingexisting techniques to get small percentage gainsThe <strong>Data</strong> Mining ProcessGenerally, data <strong>mining</strong> process is composed by datapreparati<strong>on</strong>, data <strong>mining</strong>, and informati<strong>on</strong> expressi<strong>on</strong> andanalysis decisi<strong>on</strong>-making phases, the specific process as shownin fig.1[5] .Fig.2: C<strong>on</strong>vergence <strong>of</strong> three technologies• Increasing Computing PowerMoore‟s law doubles computing power every 18 m<strong>on</strong>ths‣ Powerful workstati<strong>on</strong>s became comm<strong>on</strong>‣ Cost effective servers (SMPs) provide parallelprocessing to the mass market‣ Interesting trade<strong>of</strong>f‣ Small number <strong>of</strong> large analyses vs. largenumber <strong>of</strong> small analyses• Improved <strong>Data</strong> Collecti<strong>on</strong>‣ <strong>Data</strong> Collecti<strong>on</strong> Access Navigati<strong>on</strong> MiningFig.3: General process <strong>of</strong> <strong>Data</strong> Mining116


ISSN 2249-6343Internati<strong>on</strong>al <strong>Journal</strong> <strong>of</strong> <strong>Computer</strong> Technology and Electr<strong>on</strong>ics Engineering (IJCTEE)Volume 1, Issue 3Each data <strong>mining</strong> algorithm can be decomposed into fourcomp<strong>on</strong>ents:1. Model or pattern structure2. Interestingness measure (score functi<strong>on</strong>)3. Search method4. <strong>Data</strong> management strategyIn decisi<strong>on</strong> analysis, a decisi<strong>on</strong> tree can be used to visuallyand explicitly represent decisi<strong>on</strong>s and decisi<strong>on</strong> making. In data<strong>mining</strong>, a decisi<strong>on</strong> tree describes data but not decisi<strong>on</strong>s; ratherthe resulting classificati<strong>on</strong> tree can be an input for decisi<strong>on</strong>makingA decisi<strong>on</strong> support system (DSS) is a computer-basedinformati<strong>on</strong> system that supports business or organizati<strong>on</strong>aldecisi<strong>on</strong>-making activities. DSSs serve the management,operati<strong>on</strong>s, and planning levels <strong>of</strong> an organizati<strong>on</strong> and help tomake decisi<strong>on</strong>s, which may be rapidly changing and not easilyspecified in advance.DSSs include knowledge-based systems. A properly designedDSS is an interactive s<strong>of</strong>tware-based system intended to helpdecisi<strong>on</strong> makers compile useful informati<strong>on</strong> from a combinati<strong>on</strong><strong>of</strong> raw data, documents, pers<strong>on</strong>al knowledge, or businessmodels to identify and solve problems and make decisi<strong>on</strong>s.<strong>Data</strong> <strong>mining</strong> requires data preparati<strong>on</strong> which can uncoverinformati<strong>on</strong> or patterns which may compromise c<strong>on</strong>fidentialityand privacy obligati<strong>on</strong>s. A comm<strong>on</strong> way for this to occur isthrough data aggregati<strong>on</strong>. <strong>Data</strong> aggregati<strong>on</strong> is when the data areaccrued, possibly from various sources, and put together so thatthey can be analyzed. [38] This is not data <strong>mining</strong> per se, but aresult <strong>of</strong> the preparati<strong>on</strong> <strong>of</strong> data before and for the purposes <strong>of</strong>the analysis. The threat to an individual's privacy comes intoplay when the data, <strong>on</strong>ce compiled, cause the data miner, orany<strong>on</strong>e who has access to the newly compiled data set, to beable to identify specific individuals, especially when originallythe data were an<strong>on</strong>ymous.<strong>Data</strong> <strong>mining</strong> based <strong>on</strong> neural network:The data <strong>mining</strong> based <strong>on</strong> neural network is composed bydata preparati<strong>on</strong>, rules extracting and rules assessment threephases, as shown in Fig. 2.Fig.5: Algorithm process<strong>Data</strong> <strong>mining</strong> based <strong>on</strong> decisi<strong>on</strong> treeDecisi<strong>on</strong> tree learning, used in statistics, data <strong>mining</strong> andmachine learning, uses a decisi<strong>on</strong> tree as a predictive modelwhich maps observati<strong>on</strong>s about an item to c<strong>on</strong>clusi<strong>on</strong>s about theitem's target value. More descriptive names for such tree modelsare classificati<strong>on</strong> trees or regressi<strong>on</strong> trees. In these treestructures, leaves represent class labels and branches representc<strong>on</strong>juncti<strong>on</strong>s <strong>of</strong> features that lead to those class labels.Fig.6: <strong>Data</strong> <strong>mining</strong> process <strong>on</strong> neural network118


ISSN 2249-6343Internati<strong>on</strong>al <strong>Journal</strong> <strong>of</strong> <strong>Computer</strong> Technology and Electr<strong>on</strong>ics Engineering (IJCTEE)Volume 1, Issue 3V. CONCLUSION<strong>Data</strong> <strong>mining</strong> is a hot topic <strong>of</strong> the computer science research inrecent years, and it has a extensive applicati<strong>on</strong>s in variousfields. <strong>Data</strong> <strong>mining</strong> technology is an applicati<strong>on</strong> orientedtechnology. It not <strong>on</strong>ly is a simple search, query and transfer <strong>on</strong>the particular database, but also analyzes, integrates and reas<strong>on</strong>sthese data to guide the soluti<strong>on</strong> <strong>of</strong> practical problems and findthe relati<strong>on</strong> between events, and even to predict future activitiesthrough using the existing data.<strong>Data</strong> <strong>mining</strong> brings a lot <strong>of</strong> benefits to businesses, society,governments as well as individual. However privacy, securityand misuse <strong>of</strong> informati<strong>on</strong> are the big problem if it is notaddress correctly.Table Descripti<strong>on</strong>:FigureFig.1Fig.2Fig.3Fig.4Fig.5Fig.6Name Of FigureFig.1: <strong>Data</strong> <strong>mining</strong> is the core <strong>of</strong> KnowledgeDiscovery ProcessFig.2: C<strong>on</strong>vergence <strong>of</strong> three technologiesFig.3: General process <strong>of</strong> <strong>Data</strong> MiningFig.4: Architecture <strong>of</strong> <strong>Data</strong> <strong>mining</strong>Fig.5: Algorithm processFig.6: <strong>Data</strong> <strong>mining</strong> process <strong>on</strong> neural networkREFERENCES[5] <strong>Data</strong> <strong>mining</strong>:Ford, C.W.; Chia-Chu Chiang; Hao Wu; Chilka, R.R.;Talburt,J.R.; Informati<strong>on</strong> Technology: Coding and Computing, 2005.ITCC 2005 Internati<strong>on</strong>alC<strong>on</strong>ference Volume: Digital Object Identifier:10.1109/ITCC.2005.270 Publicati<strong>on</strong> Year: 2005 , Page(s): 122 - 127Vol. 1[6] Han, J. & M. Kamber, <strong>Data</strong> <strong>mining</strong>: c<strong>on</strong>cepts and techniques, SanFrancisco: Morgan Kaufman (2001).[7] “<strong>Data</strong> <strong>mining</strong> tools”, by Ralf Mikut, Markus Reischl, WileyInterdisciplinary Reviews: <strong>Data</strong> Mining and Knowledge Discovery,2011[8] “<strong>Data</strong> <strong>mining</strong> and ware housing”. Electr<strong>on</strong>ics <strong>Computer</strong>Technology (ICECT), 2011 3rd Internati<strong>on</strong>al C<strong>on</strong>ference <strong>on</strong>Volume:1, Publicati<strong>on</strong> Year: 2011 , Page(s): 1 – 5[9] “The applied research <strong>on</strong> data <strong>mining</strong> in the financial analysis <strong>of</strong>university with the analysis <strong>of</strong> college students „arrears as an example”Chen H<strong>on</strong>gfei; Wang Xiaoyan; Business Management and Electr<strong>on</strong>icInformati<strong>on</strong> (BMEI), 2011 Internati<strong>on</strong>al C<strong>on</strong>ference <strong>on</strong>Volume:2 Digital Object Identifier: 10.1109/ICBMEI.2011.5917992Publicati<strong>on</strong> Year: 2011 , Page(s): 633 - 636AUTHOR‟S PROFILE:Miss. Hemlata P. Sahu was born in Nagpur, Maharstra in 1988.Persuiung Post Graduate Degree (M.E.) III sem in <strong>Computer</strong> Science& Engineering from P.I.E.S., RGPU University in the year 2010-1012Mrs. Shalini Shrma was born in Gwaliyar, M.P. Persuiung PostGraduate Degree (M.E.) III sem in <strong>Computer</strong> Science & Engineeringfrom P.I.E.S., RGPU University in the year 2010-1012Mrs. Seema G<strong>on</strong>dhalakar was born in Nashik, Maharstra. PersuiungPost Graduate Degree (M.E.) III sem in <strong>Computer</strong> Science &Engineering from P.I.E.S., RGPU University in the year 2010-1012[1] Ming-Syan Chen, Jiawei Han, Philip S yu. <strong>Data</strong> Mining: AnOverview from a <strong>Data</strong>base Perspective[J]. IEEE Transacti<strong>on</strong>s <strong>on</strong>Knowledge and <strong>Data</strong> Engineering, l996, 8(6):866-883.[2] R Agrawal ,T 1 mielinski, A Swami. <strong>Data</strong>base Mining: APerformance Perspective[J]· IEEE Transacti<strong>on</strong>s <strong>on</strong> Knowledge and<strong>Data</strong> Engineering, 1993,12:914-925.[3] Fayyad, Usama; Gregory Piatetsky-Shapiro, and Padhraic Smyth(1996). "From <strong>Data</strong> Mining to Knowledge Discovery in <strong>Data</strong>bases".http://www.kdnuggets.com/gpspubs/aimag-kdd-<str<strong>on</strong>g>overview</str<strong>on</strong>g>-1996-Fayyad.pdf Retrieved 2008-12-17..[4] Y. Peng, G. Kou, Y. Shi, Z. Chen (2008). "A DescriptiveFramework for the Field <strong>of</strong> <strong>Data</strong> Mining and Knowledge Discovery"Internati<strong>on</strong>al <strong>Journal</strong> <strong>of</strong> Informati<strong>on</strong> Technology and Decisi<strong>on</strong> Making,Volume 7, Issue 4 7: 639 – 682. doi:10.1142/S0219622008003204.121

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!