View - Ecole Centrale Paris

DB2SNA: an All-in-one Tool for Extraction andAggregation of underlying Social Networks fromRelational DatabasesRania Soussi 1,3 , Etienne Cuvelier 1 , Marie-Aude Aufaure 1,2 ,Amine Louati 2,3 , and Yves Lechevallier 21 Ecole Centrale Paris, MAS Laboratory, Business Intelligence Team,Grande Voie des Vignes 92 295 Chatenay-Malabry{rania.soussi, etienne.cuvelier, Marie-Aude.Aufaure}@ecp.fr2 INRIA Paris-Rocquencourt, Axis Team, Domaine de Voluceau78150 Rocquencourt{Yves.Lechevallier, Marie-Aude.Aufaure, Amine.Louati}@inria.fr3 ENSI, RIADI-GDL Laboratory, Campus Universitaire de la Manouba,2010, TunisiaAmine.louati@riadi.rnu.tnAbstract. In the enterprise context, People need to visualize differenttypes of interactions between heterogeneous objects (e.g. product andsite, customers and product, people interaction (social network)...). Theexisting approaches focus on social networks extraction using web document.However a considerable amount of information is stored in relationaldatabases. Therefore, relational databases can be seen as richsources for extracting a social network. The extracted network has ingeneral a huge size which makes it difficult to analyze and visualize.An aggregation step is needed in order to have more understandablegraphs. In this chapter, we propose a heterogeneous object graph extractionapproach from a relational database and we present its applicationto extract social network. This step is followed by an aggregation stepin order to improve the visualisation and the analyse of the extractedsocial network. Then, we aggregate the resulting network using the k-SNAP algorithm which produces a summarized graph.Keywords: relational database, graph database, social network analysis,graph extraction and aggregation1 IntroductionThe data manipulated in an enterprise context are structured data as well asunstructured data such as e-mails, documents, etc. Graphs are a natural wayof representing and modeling such data in a unified manner (structured semistructuredand unstructured ones). The main advantage of such structure relieson (or resides in) its dynamic aspect and its capability to represent relations,

2 Authors Suppressed Due to Excessive Lengtheven multiple ones, between objects. It also facilitates data query using graphoperations. Explicit graphs and graph operations allow a user to express a queryat a very high level of abstraction.People need to visualize different types of interactions between heterogeneousobjects (e.g. product and site, customers and product, people interaction likesocial networks, etc.).Inordertoanalyzetheseinteractionsandfacilitatetheirqueryingusinggraphquery languages and social network analysis methods, it is relevant to modulatesuch interaction by using a graph structure.Indeed, these different graphs can help enterprises sending product recommendations(using the graph of Products and client), finding experts (usingsocial network), etc.Nevertheless, in a business context, important expertise information is storedin files, databases and especially relational databases. Relational database pervadesalmost all businesses. Many kinds of data, from e-mails and contact informationto financial data and sales records, are stored in databases. Also,databases used in business contain information about all people, objects andprocesses related to the enterprise. This data source is a rich one to extractobject interaction.However, a relational database is not the most suited data structure to storethe”graph-like”informationaboutasocialnetwork.Bynature,agraphdatabaseis more preferable, because its structure is close to the structure of a socialnetwork.Then, this chapter presents a new approach of social network extractionfrom relational database which allows discovering hidden relationships betweenentities. This approachhas beengeneralized toextract differentkindofheterogeneousobjects graphs: such graph contains several kinds of relations and objects.Each object owns a set of characteristics which can be different from object toanother. In order to facilitate the visualization and data interpretation, it seemsinteresting to aggregate the extracted graphs. This aggregation should use notonly the relations between nodes but also the characteristics of each one andvery few algorithms do that. In this context, we use the aggregation algorithmK-SNAP.Then, the structure of this chapter is the following. First, we propose insection 2, as a pre-treatment, to migrate the social network information storedin a classical relational database towards a graph database model . Startingfrom this new representation of the social network information contained in theoriginaldatabases,weextractthesocialnetworkstructure,butitwouldbeapityto lose a lot of information using the classical graph representation of a socialnetwork. Indeed, in the graph theory used to model such networks, all nodes areof the same type, and all relationships are of the same kind. But, in real life,all people in a social network do not play the same role, and all relationshipsdot not necessary share the same type. In other words, we are not all friends,or only friends, with our neighbors or colleagues. The enterprise framework isa perfect example for this: accountants do not have the same relationship with

DB2SNA: Extraction and Aggregation of Social Networks from DB 3theiraccountantcolleagues,withthesalesmen,withtheworkersand,finallywiththeir manager. And all these people play, of course, different roles in the businessof their enterprise. That is why we propose in section 3 to use heterogeneousgraphs to model and extract such complex social networks, by working directlyon the resulting graph database. The extraction of the social network is madeusing graph transformations. In a last step, we propose a visualization processbecause the ”raw data” of a network can have, in general, a huge size whichmakes its difficult to analyze and visualize. In order to ease the latter tasks wepropose to use an aggregation process in section 4. After a study of the existinggraph aggregation methods, we propose to use an existing technique which takesinto account the heterogeneity of our networks. A global view of our all-in-onesolution is given in Figure 1. Finally, we conclude and give some perspectives tothis work.Fig.1. Graph extraction approach.2 Graph database Models and Graph AggregationAlgorithm2.1 Graph database ModelsA graph database is defined [13] as a “database where the data structures forthe schema and/or instances are modeled as a (labeled) (directed) graph, orgeneralizationsofthegraphdatastructure,wheredatamanipulationisexpressedbygraph-orientedoperationsandtypeconstructors,andhasintegrityconstraintsappropriate for the graph structure. There is a variety of models for a graphdatabase(formoredetailssee[13]).Allthesemodelshavetheirformalfoundation

4 Authors Suppressed Due to Excessive Lengthas variations of the basic mathematical definition of a graph. The structure, usedfor modeling entities and relations, influences the way that data is queried andvisualized. In this section, we compare existing models in order to find the onemost suitable one for storing and representing a social network. We will focuson the representation of entities and relations in these models. In the following,we present some models classified according to the data structure used to modelentities and relations.Models based on simple node Data are represented in these models by a(directedorundirected)graphwithsimplenodesandedges.Mostofthesemodels(GOOD [14]), GMOD [15], etc.) represent both schema and instance databaseas a labeled directed graph. Moreover, LDM [16] represents the graph schema asa directed graph where leaves represent data and whose internal nodes representconnections between the data. LDM instances consist of two-column tables, onefor each node of the schema. Entities in these model are represented by nodeslabeled with type name and also with type value or object identifier (in thecase of instance graph). Some models have nodes for explicit representation oftuples and sets (PaMaL[17], GDM [18]), and n-ary relations (GDM). Relations(attributes, relations between entities) are generally represented in these modelsby means of labeled edges. LDM and PaMaL use tuple nodes to describe aset of attributes which are used to define an entity. GOOD defines edges todistinguish between mono-valued (functional edge) and multi-valued attributes(nonfunctional edge). Nevertheless, these models do not allow the presentationof nested relations and are not very well suited for complex objects modeling.Models based on complex node. In these models, the basic structure of agraph (node and edge) and the presentation of entities and relations are based onhypernodes (and hypergraphs). Indeed, a hypernode is a directed graph in whichnodes themselves can be graphs (or hypernodes). Hypernodes can be used torepresent simple (flat) and complex objects (hierarchical, composite, and cyclic)as well as mappings and records. A hypergraph is a generalized notion of graphwhere the notion of edge is extended to hyperedge, which relates to an arbitraryset of nodes. The Hypernode Model [19] and GGL[20] emphasize the use ofhypernodesforrepresentingnestedcomplex objects.GROOVY[21]isfocusedonthe use of hypergraphs. The hypernode model is characterized by using nestedgraphs at the schema and instance levels. GGL introduces, in addition to itssupport for hypernodes (called Master-nodes), the notion of Master-edge forthe encapsulation of paths. It uses hypernodes as an abstraction mechanismconsisting in packaging other graphs as an encapsulated vertex, whereas, theHypernode model additionally uses hypernodes to represent other abstractionslike complex objects and relations. Most models have explicit labels on edges. Inthe hypernode model and GROOVY, labeling can be obtained by encapsulatingedges, that represent the same relation, within one hypernode (or hyperedge)labeled with the relation name.

DB2SNA: Extraction and Aggregation of Social Networks from DB 5Discussion The purpose of this review of graph database models is to find themost suited one to model many complex data objects and their relationships,such as social networks. Social Network is an explicit representation of relationshipsbetween people, groups, organizations, computers or other entities [22]. Asother networks, it can be represented as a complex graph [23].Indeed, the social network structure can contain one or more types of relations,oneormoretypesorlevelsofentitiesandmanyattributesovertheentities.This structure is dynamic due to growth of the volume, change of attributes andrelations.Then, we compare the previous graph database models using some characteristicsrelated to social network: the ability to present dynamic and complexobjects, nested and neighborhood relations and the ability to give a good visualizationof social network. We resume the comparison on Table 1 where “+”indicates the graph model support, “-” indicates that the graph model does notsupport and “+/-” partial support. From this comparison, we have concludedthat models based on hypernodes can be very appropriate to represent complexand dynamic objects. In particular, the hypernode model with its nested graphscanprovideanefficientsupporttorepresenteveryreal-worldobjectasaseparatedatabase entity. Moreover, models based on a simple graph are unsuitable forcomplex networks where entities have many attributes and multiple relations.Entity Relation VisualizationComplex Dynamic Nested NeighborhoodHypernode + + + + +Groovy + + + + -GGL + + + + -GOOD - + - - +GMOD - + - - +PaMaL + + - + +/-GDM + + - - +LDM + + - - -Table 1. Graph database model comparison2.2 Graph Aggregation AlgorithmsWhen graphs of extracted social networks are large, effective graph aggregationand visualization methods are helpful for the user to understand the underlyinginformation and structure. Graph aggregations produce small and understandablesummaries and can highlights communities in the network, which greatlyfacilitates the interpretation.The automatic detection of communities in a social network, can providethis kind of graph aggregation. The community detection is a clustering task,where a community is a cluster of nodes in a graph [4] [31], such the nodes of

6 Authors Suppressed Due to Excessive Lengththe cluster must be more connected with inside nodes, than with nodes outsideof the cluster (see found [5] and [6] for extended reviews).The first class of clustering algorithms are the partitional algorithms, whichtry to find a partition of a set of data, with a given number of clusters, usingjointly, most of the times, similarity or a dissimilarity measures and a qualitycriterion of the found partition. The most popular partitional algorithm (withseveral variants), the k-means clustering [7], tries to find a partition of the setof objects which minimizes the sum-of-square criterion which adds the dissimilaritiesfrom each object to the centre of its own cluster. Several (di)similaritymeasures can be defined in the social network context, like those based on theJaccard index, which measures similarity between the sets of neighbours of thetwo nodes, but other measures can be defined ([5] and [6]).Hierarchical clustering algorithms try to organize data into a hierarchicalstructure, and are divided into agglomerative and divisive algorithms, dependingon whether the partition is coarsened, or refined, at each iteration. The basicidea beyond agglomerative algorithms is simple: at the starting point, the objectsto cluster are their own classes, and then at each stage we merge the twomore similar clusters. Of course a dissimilarity measure between two clustersis mandatory, and for a given dissimilarity measure d between objects, severalcluster-dissimilarities exist. The result of the clustering process is a dendrogram,which can be cut to give one single partition. Divisive clustering algorithms,split the dataset iteratively or recursively into smaller and smaller clusters, withrespect to a quality criterion. The most popular method for divisive hierarchicalclustering of social networks uses the notion of edge betweenness [8], becausefinding the connecting edges between communities is also finding these communities.The algorithm given in [4] splits the network into clusters by removing,stepafterstep,theedgewiththehigherbetweennessvalue.Theuseofastoppingcriterion which measures the improvement at each step should permit to stopwhen no improvement is gained with an iteration. In most cases the modularity[9] is used. SuperGraph [28] employs hierarchical graph partitioning to visualizelarge graphs.Specially designed for graphs, spectral algorithms [10] are based on the notionof connected components. These algorithms work with a Laplacian, matrixbased onthe adjacency (or weight) matrix [11] [12]. If the graph of the socialnetwork contains k , completely disjoints communities (i.e. without any link betweenthem), called connected components, then the k eigenvectors associatedto the eigenvalue 0 are indicator vectors of the k connected components . Ifthe clusters of the social network do not contain “clean” connected components(i.e. if there are links between existing communities), then a simple clusteringon the k eigenvectors associated to the k least eigenvalues, can retrieve the kcommunities.Some other algorithms works on graph aggregation use statistical methods tostudy graph characteristics, such as degree distributions [26], hop-plots [27] andclustering coefficients [30]. The results are often useful but difficult to controland especially to exploit. Methods for mining frequent graph patterns [32] are

DB2SNA: Extraction and Aggregation of Social Networks from DB 7also used to understand the characteristics of large graphs. Washio and Motoda[26] provide an elegant review on this topic.However, all the previous algorithms use only on links between nodes of thegraph of the network, and do not take into account the internal values containedin each node, while classical clustering algorithms applied on tables of values,work only on these values ignoring completely the possible link between individual.Analgorithmwhichcantakeintoaccountbothkindofinformationwouldbevery valuable. Designed for graphical graph aggregation the k-SNAP algorithm[29], in its divisive version, begins with a grouping based on attributes of thenodes, and then tries to divide the existing groups thanks to their neighboursgroups, trying minimizing a loss information measure.3 Social Network extraction from relational databasesIn this section we describe our approach of graph extraction which is based ontwo steps. The first step to perform is to transform relational databases intograph databases according to a graph model.This transformation allows the extraction of all the entities in the relationaldatabase on the form of nodes and outlines the relations between them whichfacilitate, in further steps, the selection of the desired entities. Also, nodes ingraph database are more complex than a simple graph which can encapsulate allthe attribute of entities in the same node and give us a simple graph of entities.The second step is to define a method to transform the graph according tochosen entities. This method has to deal with the identification of entities ofinterest for a particular user and to reorganize the graph – nodes and relationships– according to this point of view. Then, we applied this approach to extractsocial network.3.1 Graph extractionThe graph extraction approach is based on two main steps: (1) converting therelational database into graph database and (2) Extracting the heterogeneousgraph (with chosen entities) from the graph database.Converting relational database into hypernode database Havingagraphdatabase instead of a relational database will provide a clearer view of existingobjects in the initial database. Indeed, all these objects will be presented in theform of nodes, and the relations between them will be outlined thus facilitatingthe selection of the desired objects of interest in a further step. In addition,nodes in a graph database can encapsulate all the attributes of objects in thesame node and give us a simple graph of objects.Using the comparison between existing graph database models (Table 1),we have chosen to work with the hypernode model [19] because the hypernodedatabase with its nested graphs can provide an efficient support to representeach real-world object as a separate database entity.

DB2SNA: Extraction and Aggregation of Social Networks from DB 13Pr := 〈r,h ,null〉 Find all the existing relationsInitial Relation Pattern Process and descriptionR 1 :=h s or h d instances are added to〈”IS −A”,h s,h d 〉O Iwhere h s or h dinstances ∈ O IR 2 := 〈r,h s,h d 〉stances ∈ O I then add them to R Owhere h s and h d in-1 s,h dbetween h s and h d instancesPr 2 :=whereo Ii = h is , o Ij = h is andi! = jwhich have relations with asame h d instances and linkthem with a new relation“Same (hd.name)”R 3 := 〈r,h s,h d 〉where h s instances ∈OI (finite set of entities)and h d /∈ O IR4 := 〈r,h s,h d 〉where h s /∈ O I andh d ∈ O IPr 3 :=whereo Ii = h is , o Ij = h is,i! = j and r! = “Part-of”Pr 4 :=whereo I1 = h s , o I2 =h s, r =”Part-of” andh j ∈ {h|h has the relationR h :=< ”Part-of”,h j, h d >}Pr 5 :=whereo I1 = h d and o Ij ∈{o I|o I has relation withh s}.Table 2. Pattern used to extract relationsFind all the h s instances whichhave relations with a same h dinstances and link them with anew relation-Find all the hypernodes h jhaving a “Part-of” relationwith h d such as R :=- add new node to hs containingthe name of h j- then the pattern Pr 4 is applied:Find all the h d instanceswhich have relations with asame h j instances and linkthem with a new relation-add a new node on h s containingh d-if h s have relations with otherentities h j :we link each h j witheach h d instance if they are inrelation with the same h s.

14 Authors Suppressed Due to Excessive LengthWe have applied our approach to extract an underlying social network from arelational database. In this section, we describe this process using the Hypernodedatabase depicted in Fig. 3 and Fig. 4.Person identification The chosen objects in this case are the hypernode representingpersons. Then, in this step, we describe the process to identify people.The hypernode database schema is used to extract candidate hypernodes (thosewhich may represent persons). Then, the instances are used to deeply analyzethe candidates and detect those containing people.Hypernodes candidates detection. A person has a number of characteristicslike name, surname, birthday, address, email, etc. Some of these characteristicsare used when designing databases containing persons. Based on characteristicsfrom various ontologies such FOAF ontology and person ontology (schemaWeb),we have manually designed a person ontology (PO) containing all these characteristicsand their synonyms (collected from WordNet). Figure 5 shows anexcerpt of this ontology. Using the person ontology, the set of nodes related toeach hypernode in the hypernode database is analyzed.Fig.5. Part of the Person Ontology– If the node’s name is one of the PO concepts, the number of characteristicsfor this hypernode is incremented.– If the number of characteristics for the hypernode >= 1 and one of themcontains a name, the hypernode h is a candidate to contain persons.Candidate hypernodes Analysis. Each candidate hypernode has a set of instancehypernodes h i . In order to analyze the name found in each instance (we takejust the 10 first instances), the name is sent to a web search engine. The top10 returned documents are downloaded and form the set D i of the consideredinstance, and this set is parsed using DOM 4 . Each document in D i is analyzed4 http://www.w3.org/DOM/

DB2SNA: Extraction and Aggregation of Social Networks from DB 15using the NER tool (Named Entity Recognition) developed in Stanford 5 andwhich is able to put three kinds of tags (Person, Location or Organization). Wegive for each document d, a weight w d . If the name is tagged in the documentby Person, the document is weighted by w d = 1, otherwise w d = 0. The meanassigned to the name found in the hypernode instance h i (h i ) counts how manytimes it is considered as a person name in the documents (where the tag of thisname is Person): ∑d∈Dh i = iw d(1)|D i |The mean assigned to the hypernode (H) calculates the mean where the namesfound in its instances are considered as a person name:H =∑h ih i|{h i }|NER has a precision around 90% for finding Person entities; so, a hypernodeis considered as representative of a person if more than 60% of its instancescontain a person name (we take only 60% as a threshold due to problems suchas wrongly written names, use of abbreviations, etc. which lower the precisionof NER).(2)Relation construction From the previous step, the entities identification processidentifies the Hypernodes “Student “and “Director thesis” as person entities.Then, all their instance hypernodes are added to the entities set.The relation construction process is then performed using the identified entities.We start by identifying the relation R 1 in order to search hidden entities.In our example, the process identifies “Foreign-Student” as an entity due therelation R 1 =< ′′ IS −A ′′ ,Foreign−Student,Student >.We will detail the identified relation in what follows using the set of patterndescribed in Table. 2.From the relation R h1 :=< “ ′′ ,Director thesis,Student >, we identify twopatterns (Table 3):– Pr 1 :< “ ′′ ,Director thesis,Student,null >:inthedatabaseschema,Studentand Director thesis share the relation R h1 . Using Pr 1 and the value of theforeign key St−id, we search on the instance hypernode database for eachStudent the corresponding Director thesis. R h1 relates directly Studentand Director thesis then we have no mediator (null).– Pr 2 :=< Same Student,Director thesis i ,Director thesis j ,Student >:twothesis director may have the same Student (same value of St−id). Then, wesearchontheinstancehypernodedatabasealltheinstancesofDirector thesiswhich have the same Student (mediator for this pattern) in order to add betweenthem the relation Same Student.5 http://nlp.stanford.edu/ner/index.shtml

16 Authors Suppressed Due to Excessive LengthRelation and identified patternRelation:R h1 :=< ′′′′ ,Director thesis,Student,>Patterns:Pr 1 :=< ′′′′ ,Director thesis,Student,null >Example of extracted relationsPr 2 :=< Same Student, Director thesis i,Director thesis j, Student >Table 3. Relation R h1 PatternFrom the relation R h2 :=< ′′′′ ,Director thesis,Laboratory >, we identify onepattern (Table 4): Laboratory is not an entity (of interest) then its instances arenot included in the final graph:– Pr 3 :=< Same Laboratory,Director thesis i ,Director thesis j ,Laboratory >:using the value of the foreign key Lab id in each hypernode instance of theentity Director thesis, we will link those having the same value of Lab idby the relation Same Laboratory.Relation and identified patternRelation:R h2 :=< ′′′′ ,Director thesis,Laboratory >Pattern:Pr 3 :=< Same Laboratory,Director thesis i,Director thesis j,Laboratory >Example of extracted relationsLaboratory is not connected to other entitiesTable 4. Relation R h2 PatternFromtherelation R h3 :=< ′′ Part−of ′′ ,Student,thesis hasStudent >,weidentifyone pattern (Table 5) and we add some information:

DB2SNA: Extraction and Aggregation of Social Networks from DB 17– Thesis hasStudentsharestworelations“Part-of”withStudentandThesis.We add a new node on the hypernode Student < Thesis, Thesis i >, correspondingto his Thesis. Then, we can apply the pattern Pr 4 .– Pr 4 :=< Same Thesis,Student i ,Student j , Thesis hasStudent >, by thispattern we search all the students which share the same thesis. We did notfind such relation which is semantically inexact.Relation and identified PatternRelation:R h :=< ′′ Part−of ′′ ,Student,thesis hasStudent >Pattern:- Thesis hasStudent shares two relations “Partof”with Student an Thesis- Add the node n :< Thesis,Thesis i > toeach instance of Student- Pr 4 :=< Same Thesis,Student i,Student j,Thesis hasStudent >Example of extracted relationTable 5. Relation Rh3 Pattern.From the relation R h4 :=< ′′′′ ,Thesis,Director thesis >, there are no identifiedpatterns because Thesis is not related to other entities (Table 6). ConsideringRelation and identified PatternRelation:R h4 :=< ′′′′ ,Thesis,Director thesis >Pattern:- Thesis has no relations with other entitiesthen no pattern detected.- Add the node n :< Thesis,Thesis i > toeach instance of StudentExample of extracted relationTable 6. Relation R h4 Patternthe identified patterns and the hypernode database instance (Fig.4), a first socialnetwork is extracted by applying the set of patterns to the instance hypernodes.

18 Authors Suppressed Due to Excessive LengthIn order to obtain a more sophisticated social network, we merge the entitieswhich share a relation “IS-A”. For example, we merge the entity “Foreignstudent”with the entity “Student” by adding all the information found in thishypernode to the hypernode “Student”. Additionally, if “Foreign-student” hasrelations with other entities, these relations will be added to “Student”. Finally,we obtain the social network depicted in Fig. 6. In the previous steps, we haveFig.6. The corresponding Social Networkextracted a social network from a relational database. However, a relationaldatabase can contain hundreds or thousands of tuple then we will obtain a verylarge social network.From the resultant social network in Figure Fig. 6, a graph composed withhomogeneoushypernodeshasbeenextracted.Thetypeoftheselectedhypernodeis specified by the user (for example Director-Thesis) in order to perform K-SNAP algorithm. Then we will have an aggregated view of this graph which willfacilitate its visualization and analysis.The extracted graph has the hypernode instances of “Director thesis” asvertex and their relations as edges.4 Visualizing the social network using the k-SNAP graphaggregationIn the previous section, we were able to extract a social network from a graphdatabase. However, as the actual relational databases contain hundreds or eventhousands of records, we obtain as a result, the large social network seen infigure 7. An efficient graph aggregation will be valuable to visualize such graphin highlighting its underlying structure. And even if any classical communitydetection could be used to aggregate such a graph, do not use the inner valuesof the extracted nodes would be a wasting of efforts and information. Then we

DB2SNA: Extraction and Aggregation of Social Networks from DB 19chose to aggregate our graphs with the k-SNAP algorithm because it permits anaggregation on the both kind of information : links and attributes.Let use recall here the basic concepts of graph theory. A graph G is definedas a pair of sets: G = (V,E) , where V is the set of vertices or nodes, and Ethe set of edges or links which connect vertices. The two vertices connected byan edge are called endpoints of this latter. Conversely, the edges connecting avertice to the other vertices are called incident edges of the node. A graph canbe directed or undirected. In the first case, also called digraph, an edge is denotedas pair (v,w), where v is the origin and w the target, and in the social networksframework it means: “v is in relation with w”, the opposite being true if andonly if there exists an edge (w,v). If the graph G is undirected, then an edge isdenoted as the set of its vertices: {v,w}. Most of the times, vertices are labelled,but it can also be the case for edges, then G is called a labelled graph. Moreover,if exists a function ω : E → R that assigns a weight for each edge, then G is aweighted graph. Two vertices v and u are called neighbors or adjacent, if they areconnected by an edge. The set of neighbors of a node v, denoted Γ(v), is calledits neighborhood. The topology of the graph can be captured in the adjacencymatrixA, where the element a i,j is equal to one when (v i ,v j ) ∈ E, and zerootherwise.From here, when we denote a graph by G = (V,E) with V the set of nodes,the set of edges will be E = {E 1 ,E 2 ,.....,E r } the set of edge types, with eachE i ⊆ V ×V representing the set of edges of a particular type.And each node of V will be characterized by a set of attributes Λ(G). For aset of attributes A ⊆ Λ(G), a function Φ defined on V which is called AttributesCompatible Grouping or simply A-compatible, if it satisfies the following condition:∀u,v ∈ V,ifΦ(u) = Φ(v) then ∀a i ∈ A,a i (u) = a i (v), it will be simplydenoted by Φ A . This function induce a partition {g 1 ,g 2 ,...g k } on V where ineach group g i , every node has exactly the same values for the set of attributesA.In fact, The A-compatible Φ A only considers the node attributes. However,nodes do not just have attributes, but also participate in pairwise relationshipsrepresented by the edges.Forthat, weconsidernowrelationships whengroupingnodes.Foragrouping,we denote the neighbor-groups of node v in E i as NeighborGroups Φ,Ei (v) ={Φ(u)|(u,v) ∈ E i } which represents the set of groups on the partition associatedwith Φ where at least one element is connected to v by the relation E i .4.1 Attributes and Relationships Compatible GroupingForasetofattributesA ⊆ Λ(G)andasetofrelationsR,agroupingiscompatiblewiththesetofattributesAandrelationshiptypesRorsimply(A,R)-compatible,if Φ satisfies the following conditions:1. Φ is A-compatible,2. ∀u,v ∈ V if Φ(u) = Φ(v),

20 Authors Suppressed Due to Excessive Lengththen ∀E i ∈ R, NeighborGroups Φ,Ei (u) ≡ NeighborGroups Φ,Ei (v).In each group of an (A,R)-compatible grouping, all the nodes are homogeneousin terms of both the set of attributes A and the set of relations R. In otherwords, every node inside a group has exactly the same attributes of A, and isadjacent to nodes in the same set of groups for all the relations in R.Now, we can formally define the graph aggregation algorithm k-SNAP. Notethat all calculations will be made from the incidence matrix A t = (a t i,j ) 1

DB2SNA: Extraction and Aggregation of Social Networks from DB 212. On the other hand, if the group relationship is strong (p t i,k> 0.5) , it countsthe differences between this strong relationship and a 100% participationratiogroup relationship (p t i,k = 1).Finally we define a matrix W Et = (δ t ij ) 1≤i,j≤|Φ A| from equation (5), that evaluatesthe part of the ∆ value contributed by a group g i with one of its neighborsg j in a group relationship of type E t .Given k the desired number of groups, the k-SNAP operation produces an(A,R)-compatible grouping with the minimum ∆ value, starting from a A-compatible grouping and ∆ initialized to zero, the procedure is to look for eachiteration the group to split. For this, we introduce a heuristic that choosesthe group that makes the most contribution to ∆ with one of its neighbourgroups. More formally, for each group g i , we denote CT(g i ) as follows: CT(g i ) =max { δ t ij}.Then, at each iterative step, we always choose the group with the maximumCT value to split, based on whether nodes in this group g i which have relationshipswith nodes in its neighbour group g t , where: g t = argmax gj {δ t ij } andthen split it into two sub-groups according to the following strategy: one of thesegroups contains all nodes participating in the relationship with the group g t andthe other contains the rest, i.e. the nodes that have no relation with the groupg t .Now we will apply the algorithm on the graph extracted from the social networkof thesis director (Fig. 7). In this experiment, we are interested in analysingFig.7. The complete graphhow thesis directors in the database interact with each other based on two relationsSame Laboratory and Same student. Each node in this graph has oneattribute called grade, a direct visualization highlights our inability to interpretthis graph without further treatment.In order to explain the process of classification of k-SNAP, we will analyzethe result of this classification on a sample of the real graph.

22 Authors Suppressed Due to Excessive LengthAtfirst,k-SNAPgeneratesasummaryformedbythreegroups(A-CompatibleGrouping) in accordance with the modalities HDR , co-supervisor and prof ofthe attribute grade as shown in Fig 8.a. The first iteration (Fig. 8.b) leads to theFig.8. The A-Compatible Grouping and the 2 first iterationssubdivision of the HDR group into two subgroups according to the relationshipSame Student, because it maximizes the contribution of ∆. This iteration givesrise to two groups: HDR 1 group consists of the HDRs that supervise a studentwith at least one professor or “co-supervisor”.However, HDR 2 group consists of the HDRs who supervise students thathave only as director, HDRs. After the second iteration (Fig.8.c), the group ofFig.9. Overview of the graph after the second iterationprofessorsisdividedintotwosubgroupsaccordingtotherelationshipSame Laboratory(Fig 9); the professor of prof 1 group shares the lab with at least one of the othergroups (HDR and co-supervisor), i.e, the laboratory to which they belong hasmembers with various degrees. Even if our aggregation algorithm is rather of asemantic nature (takes into account the contents of the node), it’s interestingto apply some of SNA metrics and try to compare its results with our criterion.We have chosen degree centrality as a measure and it distribution within thenetwork is shown in the figure below (Fig 10). According to the subdivision ofthe HDR group (Fig 8.b), we can notice that nodes belonging to the HDR 1group have globally a high degree of centrality. This is in accordance with our

DB2SNA: Extraction and Aggregation of Social Networks from DB 23Fig.10. Degree centrality of the graphprinciple of division. In fact, all nodes of HDR 1 group supervises a studentwith at least one professor or “co-supervisor” thus, this creates a link whichconsequently increases the degree of centrality. It’s the same interpretation forprofessors in the second iteration (Fig 8.c).We can say that our results are consistent with the centrality measure.However, if two nodes (highlighted and written in italic (Fig 10)) have thesame median index of centrality, they may do not belong to the same groupbecause their relationships with others are of different types. In this case, themeasure of centrality is not adequate to our criterion; it depends on the natureof the relationship.By changing the resolutions of summaries, users can better understand thecharacteristics of the original graph data and also explore the differences andsimilarities across different iterative steps.5 ConclusionIn this paper, we have presented an approach to extract a social network from arelational database, then the aggregation method of the resulted social networkusing K-SNAP algorithm.The social network extraction process is an application of the graph extractionapproach from a relational database. This process allows having differentgraphs using as input the same relational database and the type of entities chosenby the user. The extraction approach is based on two steps: (1) translation ofa relational database into a graph database, and (2) graph transformation whichis realized after a process of objects identification then a graph rearrangement(nodes and relations). We have applied our approach using a real database.The main interest to use K-SNAP was to show that algorithms designed forbeing applied on other kind of data sources can be use without any adaptationafter applying our method for extracting social networks.In our future work, we will focus on how to improve the extracting methodby the use of an automatically built ontology describing the relations between

24 Authors Suppressed Due to Excessive Lengththe entities on the relational database. Then, we will try to define a storagesystem based on the hypernode model and a graph query language better suitedto the social network structure. We will also improve the aggregation methodby combining it with conceptual clustering.AcknowledgmentsThis work is partially financed by the ARSA project (Social Networks Analysisfor Public Administrations) and by the STIC INRIA-Tunisia project “Socialnetwork exploration for recommender systems”. The Academic chair in BusinessIntelligence is funded by SAP.References[1] Culotta, R., Bekkerman, McCallum, A.: Extracting social networks and contactinformation from email and the web. In: CEAS-1, (2004)[2] Pentland, A.: Socially Aware computation and communication. In: Computer, Vol.38 (2005) 33–40[3] Mika, P.: Flink: Semantic web technology for the extraction and analysis of socialnetworks. Journal of Web Semantics, 3 (2005) 211–223[4] Girvan, M., Newman, M. E. J.: Community structure in social and biological networks.Proc. Natl. Acad. Sci. USA, 99 (2002)[5] Schaeffer, S. A.: Graph clustering. Computer Science Review, 1 (2007) 27–64[6] Santo, F.: Community detection in graphs. Physics Reports, 486 (2010) 75–174[7] MacQueen, J.: Some methods for classification and analysis of multivariate observations.Proc. 5th Berkeley Symp. Math. Stat. Probab., Univ. Calif. 1965/66 bf 1(1967) 281–297[8] Freeman, L.C.: A set of measures of centrality based upon betweenness. Sociometry,40 (1977) 35–41[9] Newman, M. E. J.: Detecting community structure in networks. The EuropeanPhysical Journal B, 38 (2004)321–330[10] von Luxburg, U.: A tutorial on spectral clustering. Technical Report TechnicalReport, Max Planck Institute for Biological Cybernetics 149 (2006)[11] Shi, J., Malik, J.: Normalized cuts and image segmentation. IEEE Transactionson Pattern Analysis and Machine Intelligence, 22 (2000)888–905[12] Ng, A. , Jordan, M., Weiss, Y., Dietterich , T., Becker, S., Ghahramani Z.: Advancesin Neural Information Processing Systems, 14, chapter On spectral clustering:analysis and an algorithm. MIT Press, (2002)[13] Angles, R.,Gutierrez, C. : Survey of graph database models. ACM Comput. Surv.40 (2008) 1-39[14] Gyssens, M., Paredaens, J., Gucht, D. V.: A graph-oriented object model fordatabase end-user interfaces. SIGMOD Rec. 19 (1990) 24–33[15] Andries, M., Gemis, M., Paredaens J., Thyssens, I., Bussche, J. D.: Conceptsfor Graph-Oriented Object Manipulation. In Proceedings of the 3rd internationalConference on Extending Database Technology: Advances in Database Technology,580 Springer-Verlag, London, (1992) 21–38.[16] Kuper, G. M., Vardi, M .Y.: The Logical Data Model. ACM Trans. Database Syst,(1993) 379–413

DB2SNA: Extraction and Aggregation of Social Networks from DB 25[17] Gemis, M., Paredaens, J.: An object-oriented pattern matching language. In:JSSST, Springer-Verlag, 742 (1993) 339–355.[18] Hidders, J.: Typing Graph-Manipulation Operations. In: Proceedings of the 9thInternational Conference on Database Theory (ICDT). Springer- Verlag. (2002) 394–409.[19] Levene, M., Loizou, G. : A Graph-Based Data Model and its Ramifications. IEEETrans. on Knowl. and Data Eng. 7 (1995) 809-823[20] Graves, M., Bergeman, E. R., Lawrence, C. B.: Graph database systems for genomics.IEEE Eng.Medicine Biol. 14 (1995) 737–745.[21] Levene, M., Poulovassilis, A.: An object-oriented data model formalized throughhyper-graphs. Data Knowl. Eng. 6 (1991) 205–224.[22] Barnes, J. A.: Class and committees in a Norwegian island parish. Hum. Relat(1954) 39–58.[23] Xu, X., Zhan, J., Zhu, H.: Using Social Networks to Organize Researcher Community.In: Proceedings of the IEEE ISI 2008 Paisi, Paccf, and SOCO internationalWorkshops on intelligence and Security informatics. Springer-Verlag, Heidelberg,(2008) 421–427[24] Maatuk, A., Akhtar, M., Rossiter, B.N.: Relational Database Migration: A Perspective.In: DEXA’08, (2008) 676–683[25] Soussi,R.,Aufaure,M.A.,Baazaoui,H.:TowardsSocialNetworkExtractionUsinga Graph Database. In: Proceedings of Second International Conference on Advancesin Databases, Knowledge, and Data Applications, (2010) 28–34[26] Newman, M. E. J.: The structure and function of complex networks. SIAM RE-VIEW , 45 (2003) 167–256.[27] Chakrabarti, D., Faloutsos, Zhan, C. Y.: Visualization of large networks with mincutplots,a-plots and r-mat. Int. J. Hum.-Comput. Stud. 655 (2007) 434–445.[28] Rodrigues, Jr., J. F., A. J. M. Traina, C. Faloutsos, Traina, C. Jr.: Supergraphvisualization. In ISM ’06 : Proceedings of the Eighth IEEE International Symposiumon Multimedia, Washington, DC, USA, IEEE Computer Society. (2006) 227–234[29] Tian, Y., R. A. Hankins, J. M. Patel: Efficient aggregation for graph summarization.In SIGMOD ’08 : Proceedings of the 2008 ACM SIGMOD internationalconference on Management of data, New York, NY, USA, (2008) 567–580[30] Watts, D. J. , Strogatz , S. H.: Collective dynamics of ’small-world’ networks.Nature 393(6684) (1998) 440–442.[31] Newman M.E.J, Girvan, M.: Finding and evaluating community structure in networks.Phys. Rev. E, 69 (2004).[32] Yan X., Han J.: gSpan: Graph-based substructure pattern mining. In Proceedingsof ICDM’02, (2002) pages 721-724.[33] Washio T., Motoda, H.: State of the art of graph-based data mining. SIGKDDExplor. Newsl. 5 (2003) 59–68.

View - Ecole Centrale Paris

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?