27.06.2013 Views

Volume Two - Academic Conferences

Volume Two - Academic Conferences

Volume Two - Academic Conferences

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

Cristina Wanzeller and Orlando Belo<br />

Table 2 shows the main extracted and specified values for the example problem description. The<br />

used clickstream data was a server log file, describing data at page view or access level. Concerning<br />

the explicit requisites: (i) for goals we used all the ones regarding to relationships among pages; (ii)<br />

we selected two sub-areas of the “quality of service” application area, since both are relevant and the<br />

closest ones to the intended actions; (iii) we gave the maximum relative importance to Interpretability<br />

evaluation criteria (followed by the other criteria items).<br />

Table 2: Main values for the features of problem description<br />

Problem description: category, features (and values)<br />

(D)<br />

Metadata at<br />

dataset<br />

level:<br />

(D)<br />

Metadata at<br />

variable<br />

level:<br />

(T)<br />

Problem<br />

type<br />

(P)<br />

Evaluation<br />

criteria<br />

Number of lines (7128) and variables (8); granularity (PageView)<br />

% of numeric (0.375), categorical (0.625), temporal (0.125) and binary (0) columns<br />

Type of visitor’s identification (not available) and information recording (not available)<br />

Access order (true) and access repetition availability (true)<br />

Access data and hour availability (true)<br />

Data type: 3 Numeric (1 DateTime; 2 Integer) 5 categorical (String)<br />

Number of distinct values (…) and number of null values (0 for all)<br />

Semantic category (…)<br />

Goals (Discover relationships among pages and items; Determine access order of pages and<br />

items)<br />

Application areas (Impact analysis; Content and structure optimization)<br />

(value; relative importance)<br />

Precision (5; 4); Time of reply (5; 2); Interpretability (5; 5); Resources requirements (5; 1);<br />

Implementation simplicity (5; 3)<br />

The solution proposed by the MPS system is presented in Figure 2. Using a similarity threshold of 0.5,<br />

the retrieve step finds eight cases, grouped into three model categories. Explaining the figure (from<br />

right to left), the proposed solution includes mining plans of the (three) DM functions and model<br />

categories (columns IV and V). The evaluation criteria average values are presented on column III.<br />

Each model category is instantiated with the most similar case, providing: on (II) its similitude to the<br />

target; at the left side of (I), the (most similar) case’s number hyperlink, to access its detailed<br />

information; at the right side of (I), the combo boxes to access further information about other (less<br />

similar) cases of the model category; each combo box may be expanded to see the case’s number<br />

and its similitude to the target, as well to access further information of the case.<br />

I II<br />

III IV V<br />

Figure 2: A problem solving solution description excerpt<br />

The proposed plans are all appropriate. Association rules is frequently used to discover relations<br />

among learning activities and sequential analysis to extract interesting patterns in the sequences of<br />

on-line activities, and clustering to group similar access behaviours (Zaiane 2001). Hierarchical<br />

clustering model (case 6) was a similarity value substantially inferior and is a different king of<br />

clustering. By default and as intended, the system gives emphasis to the similarity between datasets.<br />

This is considered a good result, from the experimental work, since dataset characteristics are the<br />

most important features of mining methods choice. The analyses from cases 8 and 9 were performed<br />

using datasets most similar to the target. The selection of the goal “Determine access order of pages<br />

and items” also favours the sequential model. The association rules model represents a good option,<br />

given the better compromise between precision and coverage. Yet, the inclusion of the hierarchical<br />

clustering plan may be useful, although the model is lesser suited to the problem than the association<br />

rules, which is more informative and accurate (i.e. provides rules with support and confidence).<br />

4.3 Exemplifying learning<br />

WUM processes description encompasses data from several dispersed and heterogeneous sources.<br />

Sources include databases or files, containing “unlimited” datasets, pre-processing and KD tools,<br />

858

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!