19.08.2013 Views

Social and Information - SNAP - Stanford University

Social and Information - SNAP - Stanford University

Social and Information - SNAP - Stanford University

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

CS 322: (<strong>Social</strong> <strong>and</strong> <strong>Information</strong>) Network Analysis<br />

Jure Leskovec<br />

<strong>Stanford</strong> <strong>University</strong>


Probabilistic models of network diffusion<br />

How cascades spread in real life:<br />

Viral marketing<br />

Blogs<br />

GGroup membership b hi<br />

Last 20 minutes: mid‐term mid term course evaluation<br />

10/27/2009 Jure Leskovec, <strong>Stanford</strong> CS322: Network Analysis 2


How do viruses/rumors propagate?<br />

Will a flu‐like virus linger, or will it become extinct?<br />

(Virus) birth rate β: probability than an infected<br />

neighbor attacks<br />

(Virus) death rate δ: probability that an infected node<br />

heals<br />

10/27/2009<br />

N 1<br />

Infected<br />

Prob. β<br />

Prob. δ<br />

N<br />

N 2<br />

Healthy<br />

Jure Leskovec, <strong>Stanford</strong> CS322: Network Analysis 3<br />

N 3


General scheme for epidemic models:<br />

S…susceptible<br />

E…exposed<br />

I…infected<br />

R…recoveredd<br />

Z…immune<br />

10/27/2009 Jure Leskovec, <strong>Stanford</strong> CS322: Network Analysis<br />

4


Number<br />

of noodes<br />

time<br />

Susceptible Infected Recovered<br />

10/27/2009 Jure Leskovec, <strong>Stanford</strong> CS322: Network Analysis<br />

Assuming g pperfect<br />

mixing, i.e., a<br />

network is a<br />

complete graph<br />

The model dynamics:<br />

5


Susceptible‐Infective‐Susceptible Susceptible Infective Susceptible (SIS) model<br />

Cured nodes immediately become susceptible<br />

Virus “strength”: strength : s = β / δ<br />

10/27/2009<br />

Infected by neighbor<br />

with prob. β<br />

Susceptible Infective<br />

Cured internally<br />

with prob. δ<br />

Jure Leskovec, <strong>Stanford</strong> CS322: Network Analysis 6


Number N of nodes n<br />

time<br />

SSusceptible sceptible Infected<br />

10/27/2009 Jure Leskovec, <strong>Stanford</strong> CS322: Network Analysis<br />

Assuming perfect f<br />

mixing (complete<br />

graph):<br />

dS<br />

dt<br />

dI<br />

dt<br />

<br />

SI <br />

<br />

<br />

SI <br />

<br />

I<br />

I<br />

7


Epidemic threshold of a graph is a value of t, t<br />

such that:<br />

If strength s = β / δ < t epidemic can not happen<br />

(it eventually dies out)<br />

Given a graph compute its epidemic threshold<br />

10/27/2009<br />

Jure Leskovec, <strong>Stanford</strong> CS322: Network Analysis 8


What should t depend on?<br />

avg. degree? <strong>and</strong>/or highest degree?<br />

<strong>and</strong>/or variance of degree?<br />

<strong>and</strong>/or third moment of degree?<br />

<strong>and</strong>/or diameter?<br />

10/27/2009 Jure Leskovec, <strong>Stanford</strong> CS322: Network Analysis<br />

9


We have no epidemic if:<br />

(Virus) ( ) Death<br />

rate<br />

Epidemic threshold<br />

β/δ β < τ = 1/ λ 1,A<br />

(Virus) Birth rate largest eigenvalue<br />

of adj. matrix A<br />

► λ λ1,AA alone captures the property of the graph!<br />

[Wang et al. 2003]<br />

10/27/2009 Jure Leskovec, <strong>Stanford</strong> CS322: Network Analysis<br />

10


N umber off<br />

Infectedd<br />

Nodes<br />

500<br />

400<br />

300<br />

200<br />

100<br />

0<br />

Oregon<br />

β = 0.001 0001<br />

0 250 500 750 1000<br />

Time<br />

δ: 0.05 0.06 0.07<br />

[Wang et al. 2003]<br />

10,900 nodes <strong>and</strong><br />

3, 31,180 edges g<br />

β/δ > τ<br />

(above threshold)<br />

β/δ = τ<br />

(at the threshold)<br />

β/δ β < τ<br />

(below threshold)<br />

10/27/2009 Jure Leskovec, <strong>Stanford</strong> CS322: Network Analysis<br />

11


Does it matter how many people are<br />

initially infected?<br />

10/27/2009 Jure Leskovec, <strong>Stanford</strong> CS322: Network Analysis<br />

12


[Leskovec et al., SDM ’07]<br />

Bloggers write posts <strong>and</strong> refer (link) to other<br />

posts <strong>and</strong> the information propagates<br />

10/27/2009 Jure Leskovec, <strong>Stanford</strong> CS322: Network Analysis<br />

13


Posts<br />

Blogs<br />

Time<br />

ordered<br />

hyperlinks<br />

Dt Data – Bl Blogs:<br />

We crawled 45,000 blogs for 1 year<br />

10 million posts <strong>and</strong> 350,000 cascades<br />

<strong>Information</strong><br />

cascade<br />

10/27/2009 Jure Leskovec, <strong>Stanford</strong> CS322: Network Analysis<br />

14


[Leskovec et al., TWEB ’07]<br />

Senders <strong>and</strong> followers of recommendations<br />

receive discounts on products<br />

10% credit 10% off<br />

• Data – Incentivized Viral Marketing program<br />

• 16 million recommendations<br />

• 4 million illi people l<br />

• 500,000 products<br />

10/27/2009 Jure Leskovec, <strong>Stanford</strong> CS322: Network Analysis<br />

15


[Backstrom et al., KDD ’06]<br />

Use social networks where people belong to<br />

explicitly defined groups<br />

Each group defines a behavior that diffuses<br />

Data – LiveJournal:<br />

On‐line blogging community with friendship links<br />

<strong>and</strong> user‐defined groups g p<br />

Over a million users update content each month<br />

Over 250,000 groups to join<br />

10/27/2009 Jure Leskovec, <strong>Stanford</strong> CS322: Network Analysis<br />

16


Prob. Prob of adoption depends on the number of<br />

friends who have adopted [Bass ‘69, Granovetter ’78]<br />

Prob. of o adoptioon<br />

What is the shape?<br />

Distinction has consequences for models <strong>and</strong> algorithms<br />

k = number of friends adopting k = number of friends adopting<br />

Prob. of o adoptioon<br />

Diminishing returns? Critical mass?<br />

10/27/2009 Jure Leskovec, <strong>Stanford</strong> CS322: Network Analysis 17


Probbability<br />

of o purchaasing<br />

0.1<br />

0.09<br />

0.08<br />

0.07<br />

0.06<br />

005 0.05<br />

0.04<br />

0.03<br />

002 0.02<br />

0.01<br />

0<br />

DVD recommendations<br />

(8.2 million observations)<br />

0 10 20 30 40<br />

# recommendations received<br />

10/27/2009 Jure Leskovec, <strong>Stanford</strong> CS322: Network Analysis<br />

[Leskovec et al., TWEB ’07]<br />

18


LiveJournal community membership<br />

Prrob.<br />

of jooining<br />

k ( (number b of f ffriends i d i in th the community) it )<br />

[Backstrom et al., KDD ’06]<br />

10/27/2009 Jure Leskovec, <strong>Stanford</strong> CS322: Network Analysis<br />

19


Sending email:<br />

Email network of large university<br />

Prob. of a link as a function of # of common friends<br />

mail<br />

ob. of em<br />

Pro<br />

k (number of common friends)<br />

[Kossinets‐Watts ‘06]<br />

10/27/2009 Jure Leskovec, <strong>Stanford</strong> CS322: Network Analysis<br />

20


For viral marketing:<br />

We see that node v receiving the i‐th<br />

recommendation <strong>and</strong> then purchased p the pproduct<br />

For communities:<br />

At time t we see the behavior of node v’s friends<br />

Questions:<br />

When did v become aware of recommendations or<br />

fi friends’ d’bh behavior? i ?<br />

When did it translate into a decision by v to act?<br />

How long after this decision did v act?<br />

10/27/2009 Jure Leskovec, <strong>Stanford</strong> CS322: Network Analysis<br />

21


Dependence on number of friends<br />

Consider: connectedness of friends<br />

x <strong>and</strong> y have three friends in the group<br />

x’s friends are independent<br />

’ fi d ll td x y<br />

y’s friends are all connected<br />

Who is more likely to join?<br />

10/27/2009 Jure Leskovec, <strong>Stanford</strong> CS322: Network Analysis<br />

22


Competing sociological theories x y<br />

Competing sociological theories<br />

<strong>Information</strong> argument [Granovetter ‘73]<br />

S<strong>Social</strong> i lcapital it l argument t [C [Coleman l ’88]<br />

<strong>Information</strong> argument:<br />

Unconnected friends give independent support<br />

<strong>Social</strong> capital argument:<br />

Safety/truest advantage in having friends who<br />

know each other<br />

10/27/2009 Jure Leskovec, <strong>Stanford</strong> CS322: Network Analysis<br />

23


LiveJournal: 1 million users, 250,000 groups<br />

<strong>Social</strong> capital p argument g wins!<br />

Prob. of joining increases with<br />

adjacent members.<br />

[Backstrom et al. KDD ‘06]<br />

10/27/2009 Jure Leskovec, <strong>Stanford</strong> CS322: Network Analysis 24


Large anonymous online retailer<br />

(June 2001 to May 2003)<br />

15 15,646,121 646 121 recommendations dti<br />

3,943,084 distinct customers<br />

548 548,523 523 products recommended<br />

Products belonging to 4 product groups:<br />

books<br />

DVDs<br />

music<br />

VHS<br />

10/27/2009 Jure Leskovec, <strong>Stanford</strong> CS322: Network Analysis<br />

25


t 1 < t 2 < … < t n<br />

t 1<br />

t 4<br />

10/27/2009 Jure Leskovec, <strong>Stanford</strong> CS322: Network Analysis<br />

t 3<br />

t 2<br />

legend<br />

bought but didn’t<br />

receive i a di discountt<br />

bought <strong>and</strong><br />

received a discount<br />

received a recommendation<br />

but didn’t buy<br />

26


There are relatively few DVD titles, but DVDs account for ~ 50% of recommendations.<br />

Recommendations pper pperson<br />

DVD: 10<br />

books <strong>and</strong> music: 2<br />

VHS: 1<br />

Recommendations per purchase<br />

books: 69<br />

DVDs: 108<br />

music: 136<br />

VHS: 203<br />

Overall there are 3.69 recommendations per node on 3.85 different products.<br />

Music recommendations reached about the same number of people p p as DVDs but used only y<br />

1/5 as many recommendations<br />

<br />

Book recommendations reached by far the most people –2.8 million.<br />

All networks have a very small number of unique edges. For books, videos <strong>and</strong> music the<br />

number of unique edges is smaller than the number of nodes – the networks are highly<br />

disconnected<br />

10/27/2009 Jure Leskovec, <strong>Stanford</strong> CS322: Network Analysis<br />

27


What role does the product category play?<br />

products customers<br />

recommenda-<br />

buy + get buy + no<br />

edges<br />

tions discount discount<br />

Book 103,161 2,863,977 5,741,611 2,097,809 65,344 17,769<br />

DVD 19,829 805,285 8,180,393 962,341 17,232 58,189<br />

Music 393,598 794,148 1,443,847 585,738 7,837 2,739<br />

Video 26,131 239,583 280,270 160,683 909 467<br />

FFull ll 542 542,719 719 33,943,084 943 084 15 15,646,121 646 121 33,153,676 153 676 91 91,322 322 79 79,164 164<br />

high<br />

low<br />

people<br />

recommendations<br />

10/27/2009 Jure Leskovec, <strong>Stanford</strong> CS322: Network Analysis<br />

28


Some products are easier to recommend than<br />

others<br />

product d t category t<br />

number of buy forward<br />

bits recommendations<br />

Book 65,391 15,769 24.2<br />

DVD 16,459 7,336 44.6<br />

Music 7,843 1,824 23.3<br />

Video 909 250 27.6<br />

Total 90 90,602 602 25 25,179 179 27 27.8 8<br />

percentt<br />

10/27/2009 Jure Leskovec, <strong>Stanford</strong> CS322: Network Analysis<br />

29


Does sending more recommendations<br />

influence more purchases?<br />

Number<br />

of Purchaases<br />

7<br />

6<br />

5<br />

4<br />

3<br />

2<br />

1<br />

0<br />

20 40 60 80 100 120 140<br />

Outgoing Recommendations<br />

10/27/2009 Jure Leskovec, <strong>Stanford</strong> CS322: Network Analysis<br />

30


What is the effectiveness of subsequent<br />

recommendations?<br />

Probbability<br />

of buying<br />

007 0.07<br />

0.06<br />

0.05<br />

0.04<br />

0.03<br />

002 0.02<br />

5 10 15 20 25 30 35 40<br />

Exchanged recommendations<br />

10/27/2009 Jure Leskovec, <strong>Stanford</strong> CS322: Network Analysis<br />

31


consider successful recommendations in terms of<br />

av. av # senders of recommendations per book category<br />

av. # of recommendations accepted<br />

books overall have a 3% success rate<br />

(2% with discount, 1% without)<br />

lower than average success rate (significant at p=0.01 p=0 01 level)<br />

fiction<br />

romance (1.78), horror (1.81)<br />

teen (1.94), children’s books (2.06)<br />

comics i (2 (2.30), 30) sci‐fi i fi (2 (2.34), 34) mystery t <strong>and</strong> d th thrillers ill (2 (2.40) 40)<br />

nonfiction<br />

sports (2.26)<br />

home & garden (2.26)<br />

travel (2.39) (2 39)<br />

higher than average success rate (statistically significant)<br />

professional & technical<br />

medicine (5.68)<br />

professional & technical (4 (4.54) 54)<br />

engineering (4.10), science (3.90), computers & internet (3.61)<br />

law (3.66), business & investing (3.62)<br />

10/27/2009 Jure Leskovec, <strong>Stanford</strong> CS322: Network Analysis<br />

32


47,000 47 000 customers responsible for the 2.5 25out out of<br />

16 million recommendations in the system<br />

29% success rate per recommender of an anime<br />

DVD<br />

Giant component covers 19% of the nodes<br />

Overall, recommendations for DVDs are more<br />

likely to result in a purchase (7%), but the anime<br />

community i st<strong>and</strong>s d out<br />

10/27/2009 Jure Leskovec, <strong>Stanford</strong> CS322: Network Analysis<br />

33


Variable transformation Coefficient<br />

const -0.940 ***<br />

# recommendations ln(r) 00.426 426 ***<br />

# senders ln(n s) -0.782 ***<br />

# recipients ln(n ln(nr) ) -11.307 307 ***<br />

product price ln(p) 0.128 ***<br />

# reviews ln(v) -0 -0.011 011***<br />

avg. rating ln(t) -0.027 *<br />

R2 R 074<br />

2 0.74<br />

significance at the 0.01 (***), 0.05 (**) <strong>and</strong> 0.1 (*) levels<br />

10/27/2009 Jure Leskovec, <strong>Stanford</strong> CS322: Network Analysis 34


94% of users make first recommendation without having g<br />

received one previously<br />

Size of giant connected component increases from 1% to<br />

2 2.5% % of f the h network k (100,420 (100 20 users) ) – small! ll!<br />

Some sub‐communities are better connected<br />

24% out of f 18,000 18 000 users ffor westerns on DVD V<br />

26% of 25,000 for classics on DVD<br />

19% of 47,000 for anime (Japanese animated film) on DVD<br />

Others are just as disconnected<br />

3% of 180,000 home <strong>and</strong> gardening<br />

2‐7% for children’s <strong>and</strong> fitness DVDs<br />

10/27/2009 Jure Leskovec, <strong>Stanford</strong> CS322: Network Analysis<br />

35


Products suited for Viral Marketing:<br />

small <strong>and</strong> tightly knit community<br />

few reviews, senders, <strong>and</strong> recipients<br />

but sending more recommendations helps<br />

pricey products<br />

rating doesn’t play as much of a role<br />

Observations for future diffusion models:<br />

purchase decision more complex than threshold or simple infection<br />

influence saturates as the number of contacts exp<strong>and</strong>s<br />

links user effectiveness if they are overused<br />

Conditions for successful recommendations:<br />

professional <strong>and</strong> organizational contexts<br />

discounts on expensive items<br />

small small, tightly knit communities<br />

10/27/2009 Jure Leskovec, <strong>Stanford</strong> CS322: Network Analysis<br />

36


How big are cascades?<br />

What are the building<br />

blocks of cascades?<br />

973<br />

938<br />

Medical guide book DVD<br />

10/27/2009 Jure Leskovec, <strong>Stanford</strong> CS322: Network Analysis<br />

37


Given a (social) network<br />

A process by spreading over the network<br />

creates a graph (a tree)<br />

<strong>Social</strong> network<br />

Let’s count cascades<br />

Cascade<br />

(propagation graph)<br />

10/27/2009 Jure Leskovec, <strong>Stanford</strong> CS322: Network Analysis<br />

38


is the most common cascade subgraph<br />

It accounts for ~75% cascades in books, CD<br />

<strong>and</strong> VHS, , only y 12% of DVD cascades<br />

is 6 (1.2 for DVD) times more frequent<br />

than<br />

For DVDs is more frequent than<br />

Chains ( ) are more frequent than<br />

iis more ffrequent tth than a collision lli i<br />

( ) (but collision has less edges)<br />

Late split ( ) is more frequent than<br />

10/27/2009 Jure Leskovec, <strong>Stanford</strong> CS322: Network Analysis<br />

39


Stars ( (“no no propagation propagation”) )<br />

Bipartite cores (“common friends”)<br />

Nodes having same friends<br />

10/27/2009 Jure Leskovec, <strong>Stanford</strong> CS322: Network Analysis<br />

40


Coount<br />

10 6<br />

10 4<br />

10 2<br />

10 0 10 0<br />

= 1.8e6 x -4.98<br />

10 1<br />

steep t drop‐off d ff books<br />

very few large cascades<br />

10 2<br />

x = Cascade size (number of nodes)<br />

10/27/2009 Jure Leskovec, <strong>Stanford</strong> CS322: Network Analysis<br />

41


DVD cascades can grow large<br />

Possibly as a result of websites where people<br />

sign up to exchange recommendations<br />

Count<br />

10 4<br />

10 2<br />

10 0 10 0<br />

~ x -1.56<br />

10 1<br />

10 2<br />

shallow drop off –fat tail<br />

a number of large cascades<br />

10/27/2009 Jure Leskovec, <strong>Stanford</strong> CS322: Network Analysis<br />

42<br />

10 3<br />

x = Cascade size (number of nodes)


The pprobability yof observing g a cascade on x<br />

nodes follows: p(x) ~ x ‐2<br />

Count<br />

x = Cascade size (number of nodes)<br />

10/27/2009 Jure Leskovec, <strong>Stanford</strong> CS322: Network Analysis<br />

43


Cascade sizes follow a heavy‐tailed y distribution<br />

Viral marketing:<br />

Books: steep drop‐off: power‐law exponent ‐5<br />

DVDs: s larger agecascades: cascades exponent epoe ‐1.55<br />

Blogs:<br />

Power‐law exponent ‐2<br />

However, o e e , it is s not o a ssimple peba branching c gpocess process<br />

A simple branching process (a on k‐ary tree):<br />

Every node infects each of k of its neighbors with prob. p<br />

gives exponential cascade size distribution<br />

Questions:<br />

What role does the underlying social network play?<br />

CCan make k a step t towards t d more realistic li ti cascade d generation ti<br />

(propagation) model?<br />

10/27/2009 Jure Leskovec, <strong>Stanford</strong> CS322: Network Analysis<br />

44


1) R<strong>and</strong>omly pick blog to<br />

infect, infect add to cascade. cascade<br />

1<br />

B1 1 B2 1 3<br />

B 3<br />

1<br />

B 4<br />

3) Add infected neighbors<br />

to cascade.<br />

1<br />

B1 1 B2 1 3<br />

B 3<br />

1<br />

B 4<br />

2<br />

2<br />

B 1<br />

2) Infect each in‐linked<br />

neighbor with probability <br />

1<br />

B1 1 B2 1 3<br />

B 3<br />

1<br />

B 4<br />

2<br />

4) Set node infected in (i) to<br />

uninfected.<br />

1<br />

B1 1 B2 B 1 B 1<br />

B4 B4 1 B3 B 3<br />

4<br />

10/27/2009 Jure Leskovec, <strong>Stanford</strong> CS322: Network Analysis<br />

45<br />

1<br />

2<br />

B 1


Generative model<br />

produces realistic<br />

cascades<br />

β=0.025<br />

Count<br />

Count<br />

Cascade size<br />

Count<br />

Coount<br />

Cascade node in‐degree<br />

Most frequent cascades Size of star cascade Size of chain cascade<br />

10/27/2009 Jure Leskovec, <strong>Stanford</strong> CS322: Network Analysis<br />

46


Blogs – information epidemics<br />

Which are the influential/infectious blogs?<br />

Viral marketing<br />

Who are the trendsetters?<br />

Influential people?<br />

Disease spreading<br />

Where to place p monitoring g stations to detect<br />

epidemics?<br />

10/27/2009 Jure Leskovec, <strong>Stanford</strong> CS322: Network Analysis<br />

47

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!