13.07.2015 Views

An Abstract.pdf - DSpace@UM

An Abstract.pdf - DSpace@UM

An Abstract.pdf - DSpace@UM

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

<strong>An</strong> <strong>Abstract</strong>Data preparation is an essential part of data mining, which consists of preparing,surveying and modelling data. It prepares the data as well as the miner so that when theprepared data is used, better and faster models are produced. Much of this important stepin data mining can be automated, which led to the development of a data preparation tool(the DP tool) for data mining.Data preparation involves looking at the data variables individually as well as lookingat the set of data variables as a whole. Certain variable features are problems in datamining. They include “sparse” variables, “compact” variables, monotonic variables, andoutliers. For some modelling methods, these problems may affect the speed of modellingand/or the value of model. Fortunately, techniques are available to solve them before thedata is mined, and some are used when performing simple data transformations on a dataset using the DP tool.When preparing a data set, two areas need attention. They are getting enough data andexposing their information content. Getting enough data is known as capturing data setvariability. Estimated confidence measures of each variable are compared to thecomputed ones to ensure a particular data collection set has enough data to build usefulmodels. In the process, a variable status report is prepared. The data collection set maycontain very complex relationships, which are often known beforehand by the businessexpert. Giving the mining tool such knowledge to begin with would have sped up itsii


process. One such case is the aggregation of transaction details to the customer level,which is performed when building a data set.The DP Tool is based on a visual mining project carried out by a cellular phonecompany. The project aimed to identify customers churn rate and to know what actions toreduce the rate. Descriptive models will not only provide the trend of customers churnbut also the profiles of churned customers. The project data sets serve as test data for thedata preparation tool.Before any data can be prepared, they have to be extracted by downloading from theirsources into an exploratory database. The DP Tool provides a module to extract onlinedata from different database servers both local and remote. <strong>An</strong>other module providesscrollable edit for different data “types” such as first-load data, which are reloaded aftercorrections. Table records can be edited, added or deleted. When the collection data arecleaned and verified, a data set is created. Then the data set undergoes some kinds of datatransformation, which are categorised into discrete items, continuous items and computeditems. A housekeeping module known as database maintenance is also provided.A client/server implementation of two-tier “plus many” architecture is used to developthe data preparation tool. The client and server reside on the same host, a laptop. Themain server is linked to other server instances for data access. SQL Server 2000 provideshigh reliability, high security, and a powerful SQL programming language, which is usedto implement all the data preparation tasks. <strong>An</strong>other development tool used is Jbuilder(Borland), which provides a visual programming environment to build the user-friendlyinterface, consisting of frames and dialogs. The Java user-interface classes reside in theclient while the data preparation stored procedures reside in the server database.iii

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!