24.11.2014 Views

Database Marketing, Business Intelligence and Knowledge Discovery

Database Marketing, Business Intelligence and Knowledge Discovery

Database Marketing, Business Intelligence and Knowledge Discovery

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

<strong>Database</strong> <strong>Marketing</strong>, <strong>Business</strong><br />

<strong>Intelligence</strong> <strong>and</strong> <strong>Knowledge</strong><br />

<strong>Discovery</strong><br />

Note: Using material from Tan / Steinbach / Kumar (2005)<br />

Introduction to Data Mining, , Addison Wesley; <strong>and</strong> Cios / Pedrycz /<br />

Swiniarski / Kurgan (2007) Data Mining: A <strong>Knowledge</strong> <strong>Discovery</strong><br />

Approach, , Springer.<br />

Engineering <strong>and</strong> Technology Management<br />

1


<strong>Database</strong> <strong>Marketing</strong><br />

<strong>Database</strong> marketing is a form of direct marketing<br />

using databases of customers or potential<br />

customers to generate personalized<br />

communications in order to promote a product or<br />

service for marketing purposes.<br />

The distinction between direct <strong>and</strong> database<br />

marketing stems primarily from the attention paid<br />

to the analysis of data. <strong>Database</strong> marketing<br />

emphasizes the use of statistical techniques to<br />

develop models of customer behavior, which are<br />

then used to select customers for<br />

communications.<br />

Engineering <strong>and</strong> Technology Management<br />

2


<strong>Database</strong> <strong>Marketing</strong><br />

Classic database marketing<br />

Customer list (in-house or bought)<br />

Simple model based on past data<br />

E-mails, coupons, offers<br />

<strong>Database</strong> marketing 2.0<br />

Integrated data source (internal, external) <strong>and</strong><br />

warehouses<br />

Complex models (data mining, social network<br />

analysis)<br />

Communication channels include social media,<br />

direct web interactions (recommender<br />

systems), <strong>and</strong> many more<br />

Engineering <strong>and</strong> Technology Management<br />

3


<strong>Business</strong> <strong>Intelligence</strong><br />

Encompasses architectures, tools, applications,<br />

databases <strong>and</strong> methodologies for the collection,<br />

integration, analysis, <strong>and</strong> presentation of business<br />

information.<br />

The purpose of business intelligence is to support<br />

better business decision making.<br />

Engineering <strong>and</strong> Technology Management<br />

4


BI Components <strong>and</strong> Architecture<br />

Engineering <strong>and</strong> Technology Management<br />

5


Transactional vs. Analytical Data<br />

Processing<br />

<br />

<br />

Transactional processing takes place in operational<br />

systems that provide the organization with the<br />

capability to perform business transactions <strong>and</strong><br />

produce transaction reports. This is done primarily for<br />

fast <strong>and</strong> efficient processing of routine, repetitive data.<br />

Supplementary activity to transaction processing is<br />

called analytical processing, which involves the<br />

analysis of accumulated data. Analytical processing,<br />

sometimes referred to as business intelligence,<br />

includes data mining, decision support systems (DSS),<br />

querying, <strong>and</strong> other analysis activities. These analyses<br />

place strategic information in the h<strong>and</strong>s of decision<br />

makers to enhance productivity <strong>and</strong> make better<br />

decisions, leading to greater competitive advantage.<br />

Engineering <strong>and</strong> Technology Management<br />

6


<strong>Business</strong> Analytics<br />

<br />

<br />

<br />

<strong>Business</strong> analytics is how organizations gather <strong>and</strong><br />

interpret data in order to make better business<br />

decisions <strong>and</strong> to optimize business processes. In<br />

businesses, analytics (alongside data access <strong>and</strong><br />

reporting) represents a subset of business intelligence<br />

(BI).<br />

Analytics are defined as the extensive use of data,<br />

statistical <strong>and</strong> quantitative analysis, explanatory <strong>and</strong><br />

predictive modeling, <strong>and</strong> fact-based decision-making.<br />

Analytics may be used as input for human decisions,<br />

but there are also examples of fully automated<br />

decisions that require minimal human intervention.<br />

Engineering <strong>and</strong> Technology Management<br />

7


<strong>Business</strong> Analytics<br />

Engineering <strong>and</strong> Technology Management<br />

8


<strong>Knowledge</strong> <strong>Discovery</strong><br />

<br />

The process of automatically searching large<br />

volumes of data for patterns that can be<br />

considered knowledge about the data<br />

Evolutionary stage<br />

<strong>Business</strong> question enabling technologies characteristic<br />

Data collection (1980s)<br />

What was my total revenue in<br />

the last 5 years?<br />

Computers ,tapes , disks<br />

Retrospective , static data<br />

delivery<br />

Data access (1980s)<br />

What were unit sales in new<br />

Engl<strong>and</strong> last March ?<br />

Relational databases (RDBMS),<br />

structured query language<br />

(SQL)<br />

Retrospective , dynamic data<br />

delivery at record level<br />

Data warehousing <strong>and</strong><br />

decision support (early<br />

1990s)<br />

What were the sales in region A<br />

by product, by salesperson?<br />

OLAP, multidimensional<br />

databases, data warehouses<br />

Retrospective , proactive data<br />

delivery at multiple level<br />

Intelligent data mining (late<br />

1990s)<br />

What’s likely to happen to the<br />

Boston unit’s sales next<br />

month ? Why?<br />

Advanced algorithms,<br />

multiprocessor computers,<br />

massive databases<br />

Prospective , proactive<br />

information delivery<br />

Advanced intelligent<br />

systems; complete<br />

integration (2000-2004)<br />

What is the best plan to follow?<br />

How did we perform compared<br />

to metrics?<br />

Neural computing advanced Al<br />

models, complex optimization,<br />

web services<br />

Proactive , integrative ;<br />

multiple business partners<br />

Engineering <strong>and</strong> Technology Management<br />

9


Data Mining<br />

<br />

<br />

Non-trivial extraction of implicit, previously<br />

unknown <strong>and</strong> potentially useful information from<br />

data<br />

Exploration & analysis, by automatic or semiautomatic<br />

means, of large quantities of data<br />

in order to discover meaningful patterns<br />

Prediction Methods: Use some variables to<br />

predict unknown or future values of other<br />

variables.<br />

Description Methods: Find human-interpretable<br />

patterns that describe the data.<br />

Engineering <strong>and</strong> Technology Management<br />

10


Text Mining<br />

<br />

<br />

The application of data mining to non- structured<br />

or less-structured text files.<br />

Text mining helps organizations to do the following<br />

(1) find the ‘’hidden’’ content of documents,<br />

including additional useful relationship <strong>and</strong> (2)<br />

group documents by common themes (e.g.,<br />

identity all the customers of an insurance firm who<br />

have similar complaints).<br />

Engineering <strong>and</strong> Technology Management<br />

11


Web Mining<br />

<br />

<br />

The application of data mining techniques to<br />

discover actionable <strong>and</strong> meaningful patterns,<br />

profiles, <strong>and</strong> trends from web resources.<br />

Web mining is used in the following areas:<br />

information filtering, mining of web- access logs<br />

for analyzing usage, assisted browsing,...<br />

Engineering <strong>and</strong> Technology Management<br />

12


Data Life Cycle Process<br />

Engineering <strong>and</strong> Technology Management<br />

13


<strong>Knowledge</strong> <strong>Discovery</strong> Process<br />

The knowledge discovery process (KDP) forms the<br />

overall process for extracting new knowledge from<br />

data.<br />

– a sequence of steps (with feedback loops) that should be<br />

followed to discover new knowledge (e.g. patterns)<br />

a well-defined KDP model is a logical, cohesive, wellthought-out<br />

structure <strong>and</strong> approach that is presented to<br />

decision-makers who may have difficulty underst<strong>and</strong>ing<br />

the need, value, <strong>and</strong> mechanics behind a KDP<br />

to ensure the end product is useful for the user/owner of<br />

the data<br />

KD projects require a significant project management<br />

effort that needs to be grounded in a solid framework<br />

KD should follow other disciplines that have established<br />

models<br />

Engineering <strong>and</strong> Technology Management<br />

14


<strong>Knowledge</strong> <strong>Discovery</strong> Process<br />

KDP is defined as the non-trivial process of identifying<br />

valid, novel, potentially useful, <strong>and</strong> ultimately<br />

underst<strong>and</strong>able patterns in data:<br />

consists of many steps (one is Data Mining), each<br />

attempting at the completion of a particular discovery<br />

task, <strong>and</strong> accomplished by the application of a DM method<br />

concerns the entire KD process, including how the data is<br />

stored <strong>and</strong> accessed, how to use efficient <strong>and</strong> scalable<br />

algorithms to analyze large datasets, how to interpret <strong>and</strong><br />

visualize the results, <strong>and</strong> how to model <strong>and</strong> support<br />

interaction between human <strong>and</strong> machine<br />

concerns support for learning <strong>and</strong> analyzing the<br />

application domain<br />

Engineering <strong>and</strong> Technology Management<br />

15


Overview of the <strong>Knowledge</strong> <strong>Discovery</strong><br />

Process<br />

– consists of multiple steps, which are executed in a sequence<br />

the next step is initiated upon successful completion of the<br />

previous step, <strong>and</strong> requires the result generated by the previous<br />

step as its input.<br />

it stretches between the task of underst<strong>and</strong>ing the project<br />

domain <strong>and</strong> data, through data preparation <strong>and</strong> analysis, to<br />

evaluation, underst<strong>and</strong>ing <strong>and</strong> application of the generated<br />

results<br />

it is iterative, i.e. includes feedback loops that are triggered by<br />

revisions<br />

Input data<br />

(database,<br />

images, video,<br />

semi-structured<br />

data, etc.)<br />

STEP 1 STEP 2<br />

STEP n-<br />

1<br />

STEP n<br />

<strong>Knowledge</strong><br />

(patterns, rules,<br />

clusters,<br />

classification,<br />

associations, etc.)<br />

Engineering <strong>and</strong> Technology Management<br />

16


<strong>Knowledge</strong> <strong>Discovery</strong> Process Models<br />

Popular KDP models include<br />

Nine-step model by Fayyad <strong>and</strong> colleagues<br />

• academic<br />

CRISP-DM (CRoss-Industry St<strong>and</strong>ard Process for Data<br />

Mining) model<br />

• industrial<br />

Six-step KDP model by Cios <strong>and</strong> colleagues<br />

• hybrid (academic/industrial)<br />

Engineering <strong>and</strong> Technology Management<br />

17


<strong>Knowledge</strong> <strong>Discovery</strong> Process Models<br />

Nine-step model by Fayyad <strong>and</strong> colleagues<br />

– Developing <strong>and</strong> Underst<strong>and</strong>ing of the Application Domain<br />

It includes learning the relevant prior knowledge, <strong>and</strong> the goals of the<br />

end-user of the discovered knowledge.<br />

– Creating a Target Data Set<br />

It selects a subset of variables (attributes) <strong>and</strong> data points (examples),<br />

which will be used to perform discovery tasks. It usually includes<br />

querying the existing data to select the desired subset.<br />

– Data Cleaning <strong>and</strong> Preprocessing<br />

It consists of removing outliers, dealing with noise <strong>and</strong> missing values in<br />

the data, <strong>and</strong> accounting for time sequence information <strong>and</strong> known<br />

changes.<br />

– Data Reduction <strong>and</strong> Projection<br />

It consists of finding useful attributes by applying dimension reduction<br />

<strong>and</strong> transformation methods, <strong>and</strong> finding invariant representation of the<br />

data.<br />

Engineering <strong>and</strong> Technology Management<br />

18


<strong>Knowledge</strong> <strong>Discovery</strong> Process Models<br />

– Choosing the Data Mining Task<br />

It matches the goals defined in step 1 with a particular DM method, such<br />

as classification, regression, clustering, etc.<br />

– Choosing the Data Mining Algorithm<br />

It selects methods for searching patterns in the data, <strong>and</strong> decides which<br />

models <strong>and</strong> parameters of the used methods may be appropriate.<br />

– Data Mining<br />

It generates patterns in a particular representational form, such as<br />

classification rules, decision trees, regression models, trends, etc.<br />

– Interpreting Mined Patterns<br />

It usually involves visualization of the extracted patterns <strong>and</strong> models,<br />

<strong>and</strong> visualization of the data based on the extracted models.<br />

– Consolidating Discovered <strong>Knowledge</strong><br />

It consists of incorporating the discovered knowledge into the<br />

performance system, <strong>and</strong> documenting <strong>and</strong> reporting it to the interested<br />

parties. It also may include checking <strong>and</strong> resolving potential conflicts<br />

with previously believed knowledge.<br />

Engineering <strong>and</strong> Technology Management<br />

19


<strong>Knowledge</strong> <strong>Discovery</strong> Process Models<br />

CRISP-DM (CRoss-Industry St<strong>and</strong>ard Process for<br />

Data Mining) model<br />

designed in late 1990s by four companies: Integral<br />

Solutions Ltd. (provider of commercial Data Mining<br />

solutions), NCR (database provider), Daimler Chrysler<br />

(automobile manufacturer), <strong>and</strong> OHRA (insurance<br />

company)<br />

CRISP-DM Special Interest Group was created to support<br />

the developed process model<br />

• it includes over 300 users <strong>and</strong> tool/service providers<br />

the model consists of six steps<br />

Engineering <strong>and</strong> Technology Management<br />

20


<strong>Knowledge</strong> <strong>Discovery</strong> Process Models<br />

CRISP-DM model<br />

– <strong>Business</strong> Underst<strong>and</strong>ing<br />

It focuses on underst<strong>and</strong>ing objectives <strong>and</strong> requirements from a<br />

business perspective. It also converts them into a DM problem definition,<br />

<strong>and</strong> designs a preliminary project plan to achieve the objectives.<br />

It is further broken into several sub-steps:<br />

– determination of business objectives<br />

– assessment of situation<br />

– determination of DM goals, <strong>and</strong><br />

– generation of project plan.<br />

– Data Underst<strong>and</strong>ing<br />

It starts with an initial data collection <strong>and</strong> familiarization with the data.<br />

Specific aims include identification of data quality problems, discovery of<br />

initial insights into the data, <strong>and</strong> detection of interesting data subsets.<br />

It is further broken down into:<br />

– collection of initial data<br />

– description of data<br />

– exploration of data, <strong>and</strong><br />

– verification of data quality<br />

Engineering <strong>and</strong> Technology Management<br />

21


<strong>Knowledge</strong> <strong>Discovery</strong> Process Models<br />

CRISP-DM model<br />

– Data Preparation<br />

It covers all activities to construct the final dataset, which constitutes<br />

the data that will be fed into DM tool(s) in the next step. It includes<br />

table, record, <strong>and</strong> attribute selection, data cleaning, construction of new<br />

attributes, <strong>and</strong> data transformation.<br />

This step is divided into:<br />

– selection of data<br />

– cleansing of data<br />

– construction of data<br />

– integration of data, <strong>and</strong><br />

– formatting of data sub-steps.<br />

Engineering <strong>and</strong> Technology Management<br />

22


<strong>Knowledge</strong> <strong>Discovery</strong> Process Models<br />

CRISP-DM model<br />

– Modeling<br />

It selects <strong>and</strong> applies various modeling techniques. It usually involves<br />

use of several methods for the same DM problem type, <strong>and</strong> calibration<br />

of their parameters to optimal values. Since some methods may require<br />

a specific format for input data, often reiteration into the previous step is<br />

necessary. This step is subdivided into:<br />

– selection of modeling technique(s)<br />

– generation of test design<br />

– creation of models, <strong>and</strong><br />

– assessment of generated models.<br />

Engineering <strong>and</strong> Technology Management<br />

23


<strong>Knowledge</strong> <strong>Discovery</strong> Process Models<br />

CRISP-DM model<br />

– Evaluation<br />

After building one or more models that have high quality from a data<br />

analysis perspective, the model is evaluated from business objective<br />

perspective. The model is thoroughly evaluated, <strong>and</strong> review of the steps<br />

executed to construct the model is performed. A key objective is to<br />

determine if there are important business issues that have not been<br />

sufficiently considered. At the end of this phase, a decision on the use of<br />

the DM results should be reached.<br />

The key sub-steps in this step include:<br />

– evaluation of the results<br />

– process review, <strong>and</strong><br />

– determination of the next step.<br />

Engineering <strong>and</strong> Technology Management<br />

24


<strong>Knowledge</strong> <strong>Discovery</strong> Process Models<br />

CRISP-DM model<br />

– Deployment<br />

It involves organization <strong>and</strong> presentation of the discovered knowledge in<br />

a way that the customer can use. Depending on the requirements, this<br />

can be as simple as generating a report or as complex as implementing<br />

a repeatable KDP.<br />

This step is further divided into:<br />

– planning of the deployment<br />

– planning of the monitoring <strong>and</strong> maintenance<br />

– generation of final report, <strong>and</strong><br />

– review of the process sub-steps.<br />

Engineering <strong>and</strong> Technology Management<br />

25


<strong>Knowledge</strong> <strong>Discovery</strong> Process Models<br />

CRISP-DM model<br />

is characterized by an easy to underst<strong>and</strong> vocabulary <strong>and</strong><br />

good documentation<br />

acknowledges the strong iterative nature of the process<br />

with loops between several of the steps<br />

successful <strong>and</strong> extensively applied model, which is mainly<br />

because of its grounding in practical, industrial, real-world<br />

<strong>Knowledge</strong> <strong>Discovery</strong> experience<br />

Engineering <strong>and</strong> Technology Management<br />

26


<strong>Knowledge</strong> <strong>Discovery</strong> Process Models<br />

Six-step model by Cios <strong>and</strong> colleagues<br />

developed based on the CRISP-DM model by adopting it to<br />

academic research; main differences <strong>and</strong> extensions<br />

include:<br />

• providing more general, research-oriented description of the<br />

steps<br />

• introducing the Data Mining step instead of the Modeling step<br />

• introducing several new explicit feedback mechanisms. The<br />

CRISP-DM model has only three major feedback sources,<br />

while this model has more detailed feedback mechanisms<br />

• modification of the last step; the discovered for a particular<br />

domain may be applied in other domains<br />

includes six steps<br />

Engineering <strong>and</strong> Technology Management<br />

27


<strong>Knowledge</strong> <strong>Discovery</strong> Process Models<br />

Six-step model<br />

Underst<strong>and</strong>ing of<br />

the Problem<br />

Domain<br />

Underst<strong>and</strong>ing of<br />

the Data<br />

input data<br />

(database, images,<br />

video, semistructured<br />

data, etc.)<br />

Preparation of the<br />

Data<br />

Data Mining<br />

Evaluation of the<br />

Discovered <strong>Knowledge</strong><br />

knowledge<br />

(patterns, rules, clusters,<br />

classifica-<br />

-tion, associations, etc.)<br />

Use of the Discovered<br />

<strong>Knowledge</strong><br />

Extend knowledge to<br />

other domains<br />

Engineering <strong>and</strong> Technology Management<br />

28


<strong>Knowledge</strong> <strong>Discovery</strong> Process Models<br />

Six-step model by Cios <strong>and</strong> colleagues<br />

– Underst<strong>and</strong>ing of the Problem Domain<br />

It involves working closely with domain experts to define the problem<br />

<strong>and</strong> determine the project goals, identifying key people, <strong>and</strong> learning<br />

about current solutions to the problem. It also involves learning domainspecific<br />

terminology. A description of the problem, including its<br />

restrictions, is prepared. Finally, project goals are translated into the DM<br />

goals <strong>and</strong> initial selection of DM tools to be used later in the process is<br />

performed.<br />

– Underst<strong>and</strong>ing of the Data<br />

It includes collection of sample data <strong>and</strong> deciding which data, including<br />

its format <strong>and</strong> size, will be needed. Background knowledge can be used<br />

to guide these efforts. Data is checked for completeness, redundancy,<br />

missing values, plausibility of attribute values, etc. Finally, the step<br />

includes verification of the usefulness of the data in respect to the DM<br />

goals.<br />

Engineering <strong>and</strong> Technology Management<br />

29


<strong>Knowledge</strong> <strong>Discovery</strong> Process Models<br />

– Preparation of the Data<br />

It concerns deciding which data will be used as input for DM methods in<br />

the next step. It involves sampling, running correlation <strong>and</strong> significance<br />

tests, data cleaning that includes checking completeness of data<br />

records, removing or correcting for noise <strong>and</strong> missing values, etc. The<br />

cleaned data may be further processed by feature selection <strong>and</strong><br />

extraction algorithms (to reduce dimensionality), by derivation of new<br />

attributes (say by discretization), <strong>and</strong> by summarization of data (data<br />

granularization). The end results are data that meet specific input<br />

requirements for the selected in step 1 DM tools.<br />

– Data Mining<br />

It involves using various DM methods to derive knowledge from<br />

preprocessed data.<br />

Engineering <strong>and</strong> Technology Management<br />

30


<strong>Knowledge</strong> <strong>Discovery</strong> Process Models<br />

– Evaluation of the Discovered <strong>Knowledge</strong><br />

It includes underst<strong>and</strong>ing the results, checking whether the discovered<br />

knowledge is novel <strong>and</strong> interesting, interpreting of the results by domain<br />

experts, <strong>and</strong> checking the impact of the discovered knowledge. Only the<br />

approved models are retained <strong>and</strong> the entire process is revisited to<br />

identify which alternative actions could have been taken to improve the<br />

results. A list of errors made in the process is prepared.<br />

– Use of the Discovered <strong>Knowledge</strong><br />

It consists of planning where <strong>and</strong> how the discovered knowledge will be<br />

used. The application area in the current domain may be extended to<br />

other domains. A plan to monitor the implementation of the discovered<br />

knowledge is created <strong>and</strong> the entire project documented. Finally the<br />

discovered knowledge is deployed.<br />

Engineering <strong>and</strong> Technology Management<br />

31


<strong>Knowledge</strong> <strong>Discovery</strong> Process Models<br />

Six-step model by Cios <strong>and</strong> colleagues<br />

– this model identifies <strong>and</strong> describes explicit feedback<br />

loops<br />

• from Underst<strong>and</strong>ing of the Data to the Underst<strong>and</strong>ing of the Problem<br />

Domain step; the loop is caused by needing additional domain<br />

knowledge to better underst<strong>and</strong> the data<br />

• from the Preparation of the Data to the Underst<strong>and</strong>ing of the Data<br />

step; the loop is caused by need for additional or more specific<br />

information about the data to guide the choice of data<br />

preprocessing algorithms<br />

• from the Data Mining to the Underst<strong>and</strong>ing of the Problem Domain<br />

step; the reason could be unsatisfactory results generated by<br />

selected DM methods, requiring modification of the project’s goals<br />

• from the Data Mining to the Underst<strong>and</strong>ing of the Data step; the<br />

most common reason is poor underst<strong>and</strong>ing of the data, which<br />

results in incorrect selection of DM method <strong>and</strong> its subsequent<br />

failure<br />

Engineering <strong>and</strong> Technology Management<br />

32


<strong>Knowledge</strong> <strong>Discovery</strong> Process Models<br />

• from the Data Mining to the Preparation of the Data step; the loop is<br />

caused by need to improve data preparation. This is often caused by<br />

the specific requirements of the used DM method, which may have<br />

not been known during the Data Preparation step,<br />

• from the Evaluation of the Discovered <strong>Knowledge</strong> to the<br />

Underst<strong>and</strong>ing of the Problem Domain step; the most common<br />

cause is invalidity of the discovered knowledge. Several possible<br />

reasons include incorrect underst<strong>and</strong>ing or interpretation of the<br />

domain, incorrect design or underst<strong>and</strong>ing of problem restrictions,<br />

requirements, or goals<br />

• from the Evaluation of the Discovered <strong>Knowledge</strong> to the Data<br />

Mining; this loop is executed when the discovered knowledge is not<br />

novel, interesting, or useful. The least expensive solution is to<br />

choose a different DM tool <strong>and</strong> repeat the DM step.<br />

Engineering <strong>and</strong> Technology Management<br />

33


Comparison of <strong>Knowledge</strong> <strong>Discovery</strong><br />

Process Models<br />

Model<br />

domain of origin<br />

# steps<br />

Steps<br />

Fayyad et al.<br />

academic<br />

9<br />

1. Developing <strong>and</strong> Underst<strong>and</strong>ing of the<br />

Application Domain<br />

2. Creating a Target Data Set<br />

Cios et al.<br />

hybrid (academic/industry)<br />

6<br />

1. Underst<strong>and</strong>ing of the Problem<br />

Domain<br />

2. Underst<strong>and</strong>ing of the Data<br />

CRISP-DM<br />

industry<br />

6<br />

1. <strong>Business</strong> Underst<strong>and</strong>ing<br />

2. Data Underst<strong>and</strong>ing<br />

Notes<br />

supporting software<br />

3. Data Cleaning <strong>and</strong> Preprocessing<br />

4. Data Reduction <strong>and</strong> Projection<br />

5. Choosing the Data Mining Task<br />

6. Choosing the Data Mining Algorithm<br />

7. Data Mining<br />

8. Interpreting Mined Patterns<br />

9. Consolidating Discovered <strong>Knowledge</strong><br />

the most popular model; provides detailed<br />

technical description with respect to data<br />

analysis, but lacks business aspects<br />

commercial system MineSet TM<br />

3. Preparation of the Data<br />

4. Data Mining<br />

5. Evaluation of the Discovered<br />

<strong>Knowledge</strong><br />

6. Use of the Discovered <strong>Knowledge</strong><br />

draws from both academic <strong>and</strong> industrial<br />

models; emphasizes iterative aspects;<br />

identifies <strong>and</strong> describes explicit<br />

feedback loops<br />

N/A<br />

3. Data Preparation<br />

4. Modeling<br />

5. Evaluation<br />

6. Deployment<br />

uses easy to underst<strong>and</strong><br />

vocabulary; has good<br />

documentation;<br />

commercial system Clementine®<br />

reported application<br />

domains<br />

medicine, engineering, production,<br />

e-business, software<br />

medicine, software<br />

medicine, engineering, marketing,<br />

sales<br />

Engineering <strong>and</strong> Technology Management<br />

34


Comparison of the <strong>Knowledge</strong> <strong>Discovery</strong><br />

Process Models<br />

A very important aspect of the KDP is the relative<br />

time spent to complete each of the steps<br />

– it enables precise scheduling<br />

– estimates proposed by both researchers <strong>and</strong> practitioners are<br />

shown below<br />

• specific estimated values depend on many factors, such as existing<br />

knowledge about the considered project domain, skills level of human<br />

resources, complexity of the problem, etc.<br />

• data preparation step is by far the most time consuming step<br />

relative effort [%]<br />

70<br />

60<br />

Cabena et al. estimates<br />

Shearer estimates<br />

Cios <strong>and</strong> Kurgan estimates<br />

50<br />

40<br />

30<br />

20<br />

10<br />

0<br />

Underst<strong>and</strong>ing<br />

of Domain<br />

Underst<strong>and</strong>ing<br />

of Data<br />

Preparation of<br />

Data<br />

Data Mining<br />

Evaluation of<br />

Results<br />

Deployment of<br />

Results<br />

KDDM steps<br />

Engineering <strong>and</strong> Technology Management<br />

35

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!