13.07.2015 Views

WWW/Internet - Portal do Software Público Brasileiro

WWW/Internet - Portal do Software Público Brasileiro

WWW/Internet - Portal do Software Público Brasileiro

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

IADIS International Conference <strong>WWW</strong>/<strong>Internet</strong> 20101990). Generally speaking, clustering methods about numerical data have been viewed in opposition toconceptual clustering methods developed in Artificial Intelligence.Referring to a specific CoP, a cluster is a collection of users that share similar ratings on items of thesame workspace. For example, let SP 1 , SP 2 ,..., SP k be the k workspaces used by a community A. We build anarray X of size n×n (n is the number of users), where the cell X ij denotes the correlation between user i anduser j. Correlation can be either sim′(ij) or sim′′(ij), which will result to symmetric undirected or directedarrays, respectively. After the construction of these arrays, a unified array can be built for the whole set ofworkspaces by calculating the average value of each cell.Regarding the clustering procedure and depending on the nature of the data gathered, two differentapproaches with respect to the nature of the arrays (one concerning symmetric undirected arrays and oneconcerning directed arrays) are followed.In the first case (symmetric undirected arrays), an algorithm for hierarchical clustering is applied. Inhierarchical clustering, there is a partitioning procedure of objects into optimally homogeneous groups(Johnson, 1967). There are two different categories of hierarchical algorithms: these that repeatedly mergetwo smaller clusters into a larger one, and those that split a larger cluster into smaller ones. In MinMax Cutalgorithm (Ding et al., 2001), given n data objects and the pair similarity matrix W=(w i,j ) (where w i,j is thesimilarity weight between i and j), the main scope is to partition data into two clusters A and B. The principleof this algorithm is to minimize similarity between clusters and maximize similarity within a cluster. Thesimilarity between clusters A and B is defined as the cutsizes( A,B)=.∑∈ ∈wi A, j B i , jSimilarity (self-similarity) within a cluster A is the sum of all similarity weights within A: s(A,A).Consequently, the algorithm requires to minimize s(A,B) and maximize s(A,A) and s(B,B), which isformulated by the min-max cut function MMcut(A,B):s(A,B)s(A,B)MMcut ( A,B)+s(A,A)s(B,B)= .Linkage l(A,B) is a closeness or similarity measure between clusters A and B; it quantifies clustersimilarity more efficiently than weight, since it normalizes cluster weight:s(A,B)l(A,B)=s(A,A)× s(B,B)For a single user i, his linkage to cluster A is: l(A,i)=S(A,i)/S(A,A), where S(A,i)=S(A,B), B={i}.According to this equation, users close to the cut can be found. If a user i belongs to a cluster, his linkagewith this cluster will be high. When a user is near the cut, then the linkage difference can be used: ∆l(i) =l(i,A)−l(i,B). A user with small ∆l is near the cut and is a possible candidate to be moved to the other cluster.In the second case (directed arrays), the clustering algorithm presented in (Chakrabarti et al., 1998),which is based on Kleinberg’s link analysis algorithm (Kleinberg, 1999), is a<strong>do</strong>pted. Initially, this analysiswas applied to <strong>do</strong>cuments related to each other through directed relationships (like in the World Wide Web);for every <strong>do</strong>cument, authority and hub scores are calculated as the sum of hub and authority scores pointed toand from this <strong>do</strong>cument, respectively. More specifically, having set up the user similarity matrix X of sizen×n as before, a weighted directed graph of users is allocated and the non-principal eigenvectors of X T ×X arecomputed. The components of each non-principal eigenvector are assigned to each user. By ordering inincreasing order the values of each eigenvector, a partition is declared among users at the largest gap betweenthem. The entries of X T ×X represent authority scores and those of X×X T refer to hub scores. The result isgroups of users that are close to each another under the authority or hub score value. The algorithm can createup to 2×n different groups of users, but experimental results have shown that groups with users that havelarge coordinates in the first few eigenvectors tend to recur and therefore a first set of collections of users cansatisfy the clustering procedure (e.g. (Kleinberg, 1999)).3.2 Social Network AnalysisAfter the clustering procedure, groups of users that share the same or closely related preferences with regardsto their <strong>do</strong>cuments’ ratings are revealed. More formally, there is a classification of members of every CoPinto a set of clusters C 1 ,C 2 , . . . ,C m . For each cluster, the proposed framework computes the values of each of301

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!