Graph Analytics in Big Data

Graph Analytics in Big Data -

Graph Analytics in Big DataJohn FeoPacific Northwest National Laboratory1

A changing WorldThe breadth of problems requiring graph analytics is growing rapidlyLarge NetworkSystemsSocial NetworksPacket InspectionNatural LanguageUnderstandingSemantic Search andKnowledge DiscoveryCyberSecurity

Graphs are not gridsGraphs arising in informatics are very different from thegrids used in scientific computingScientific GridsGraphs for Data InformaticsStatic or slowly involvingPlanarNearest neighbor communicationWork performed per cell or nodeWork modifies local dataDynamicNon-planarCommunications are non-local and dynamicWork performed by crawlers or autonomous agentsWork modifies data in many places

ChallengesProblem sizeTon of bytes, not ton of flopsLittle data localityHave only parallelism to tolerate latenciesLow computation to communication ratioSingle word accessThreads limited by loads and storesFrequent synchronizationNode, edge, recordWork tends to be dynamic and imbalancedLet any processor execute any thread

System requirementsGlobal shared memoryNo simple data partitionsLocal storage for thread private dataNetwork support for single word accessesTransfer multiple words when locality existsMulti-threaded processorsHide latency with parallelismSingle cycle context switchingMultiple outstanding loads and stores per threadFull-and-empty bitsEfficient synchronizationWait in memoryMessage driven operationsDynamic work queuesHardware support for thread migrationCray XMT

Our type of problemsReturn all triads such that(A B), (A C), (C B)Return all three paths with link types{T 1 , T 2 , T 3 } such that the timestamps ofconsecutive links overlap by at least0.5 seconds.From Facebook, return the connectedsubgraph G(V, E) such that G includesall the friends of John, the cardinality ofV is minimum, and Σ NetWorth(v i ε V)is maximum.

TriadsBCASELECT ?A ?B ?CWHERE { ?A ?a ?B .?A ?b ?C .?C ?c ?B.}A B A C A B C B C

Simple C code !?!?!for each node A {for each out_edge I of A {for each out_edge J of A {B = tail of I;C = tail of J;No memory explosion15 secs} } }for each out_edge K of Cif tail of K == B {… write answer …}8

SP2 BenchmarksWe have written the 12 SP2B queries in C using our graph APIExecution time on Cray XMT/2 is from one to three orders magnitude fasterthan Virtuoso on 3GHz Xeon serverNow porting sdb0 to x86 server and cluster systemsC code is simple, butCan we generate it automatically from a high level query language?Can we provide some other more appropriate query interface?9

Query 5Return the names of all persons that occur as author of at least oneinproceeding and at least one articleJohn“pub 1”ARTICLEPERSON“pub 2”Bill“pub 3”InPROC

Data parallel code for Query 5int PERSON_index = get_Vertex_Index(person);int ARTICLE_index = get_Vertex_Index(article);int INPROC_index = get_Vertex_index(inproc);int nmbr_Edges = inDegree(PERSON_index);in_edge_iterator Person_edges = get_InEdges(PERSON_index);for (i = 0; i < nmbr_Edges; i++) {int person = PERSON_edges[i].head;int nmbr_Publ = number_edges(person, creator);in_edge_iterator Publ_edges = get_InEdges(person, creator);for (j = 0; j < nmbr_Publ; j++) {int publ_type = edge_Head_Index(Publ_edges[j]);if (publ_type == ARTICLE_index) flag |= 1;else if (publ_type == INPROC_index) flag |= 2;} }if (flag == 3) {print person; break;}1.29 secs vs. 21 secs in Virtuoso11

ConclusionsBig data graph analytics is fundamentally different than big data scienceDifferent algorithmsDifferent challengesDifferent hardware requirementsConventional database systems based tables and join operations areinsufficientData parallel graph crawls can be orders of magnitude fasterNeed new query languages capable of expressing graph analytics operationsand compiling to data parallel operations12

More magazines by this user
Similar magazines