12.07.2015 Views

A brief overview on Data mining - International Journal of Computer ...

A brief overview on Data mining - International Journal of Computer ...

A brief overview on Data mining - International Journal of Computer ...

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

ISSN 2249-6343Internati<strong>on</strong>al <strong>Journal</strong> <strong>of</strong> <strong>Computer</strong> Technology and Electr<strong>on</strong>ics Engineering (IJCTEE)Volume 1, Issue 3The iterative process c<strong>on</strong>sists <strong>of</strong> the following steps:<strong>Data</strong> cleaning: also known as data cleansing, it is a phase inwhich noise data and irrelevant data are removed from thecollecti<strong>on</strong>.<strong>Data</strong> integrati<strong>on</strong>: at this stage, multiple data sources, <strong>of</strong>tenheterogeneous, may be combined in a comm<strong>on</strong> source.<strong>Data</strong> selecti<strong>on</strong>: at this step, the data relevant to the analysis isdecided <strong>on</strong> and retrieved from the data collecti<strong>on</strong>.<strong>Data</strong> transformati<strong>on</strong>: also known as data c<strong>on</strong>solidati<strong>on</strong>, it is aphase in which the selected data is transformed into formsappropriate for the <strong>mining</strong> procedure.<strong>Data</strong> <strong>mining</strong>: it is the crucial step in which clever techniquesare applied to extract patterns potentially useful.Pattern evaluati<strong>on</strong>: in this step, strictly interesting patternsrepresenting knowledge are identified based <strong>on</strong> given measures.Knowledge representati<strong>on</strong>: is the final phase in which thediscovered knowledge is visually represented to the user. Thisessential step uses visualizati<strong>on</strong> techniques to help usersunderstand and interpret the data <strong>mining</strong> results.It is comm<strong>on</strong> to combine some <strong>of</strong> these steps together. Forinstance, data cleaning and data integrati<strong>on</strong> can be performedtogether as a pre-processing phase to generate a data warehouse.<strong>Data</strong> selecti<strong>on</strong> and data transformati<strong>on</strong> can also be combinedwhere the c<strong>on</strong>solidati<strong>on</strong> <strong>of</strong> the data is the result <strong>of</strong> the selecti<strong>on</strong>,or, as for the case <strong>of</strong> data warehouses, the selecti<strong>on</strong> is d<strong>on</strong>e <strong>on</strong>transformed data.<strong>Data</strong> Mining is…..<strong>Data</strong> <strong>mining</strong> comm<strong>on</strong>ly involves four classes <strong>of</strong> tasks: [3]Clustering - is the task <strong>of</strong> discovering groups and structures inthe data that are in some way or another "similar", without usingknown structures in the data.Clustering is a data <strong>mining</strong> (machine learning) technique used toplace data elements into related groups without advanceknowledge <strong>of</strong> the group definiti<strong>on</strong>s. Popular clusteringtechniques include k-means clustering and expectati<strong>on</strong>maximizati<strong>on</strong> (EM) clustering.Classificati<strong>on</strong> - is the task <strong>of</strong> generalizing known structure toapply to new data. For example, an email program mightattempt to classify an email as legitimate or spam. Comm<strong>on</strong>algorithms include decisi<strong>on</strong> tree learning, nearest neighbor,naive Bayesian classificati<strong>on</strong>, neural networks and supportvector machines.Working with categorical data or a mixture <strong>of</strong> c<strong>on</strong>tinuousnumeric and categorical data? Classificati<strong>on</strong> analysis might suityour needs well. This technique is capable <strong>of</strong> processing a widervariety <strong>of</strong> data than regressi<strong>on</strong> and is growing in popularity.Regressi<strong>on</strong> - Attempts to find a functi<strong>on</strong> which models the datawith the least error.Regressi<strong>on</strong> is the oldest and most well-known statisticaltechnique that the data <strong>mining</strong> community utilizes. Basically,regressi<strong>on</strong> takes a numerical dataset and develops amathematical formula that fits the data. When you're ready touse the results to predict future behavior, you simply take yournew data, plug it into the developed formula and you've got apredicti<strong>on</strong>! The major limitati<strong>on</strong> <strong>of</strong> this technique is that it <strong>on</strong>lyworks well with c<strong>on</strong>tinuous quantitative data (like weight, speedor age). If you're working with categorical data where order isnot significant (like color, name or gender) you're better <strong>of</strong>fchoosing another technique.Regressi<strong>on</strong> is a data <strong>mining</strong> (machine learning) techniqueused to fit an equati<strong>on</strong> to a dataset. The simplest form <strong>of</strong>regressi<strong>on</strong>, linear regressi<strong>on</strong>, uses the formula <strong>of</strong> a straight line(y = mx + b) and determines the appropriate values for m and bto predict the value <strong>of</strong> y based up<strong>on</strong> a given value <strong>of</strong> x.Advanced techniques, such as multiple regressi<strong>on</strong>, allow the use<strong>of</strong> more than <strong>on</strong>e input variable and allow for the fitting <strong>of</strong> morecomplex models, such as a quadratic equati<strong>on</strong>.115

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!