19.08.2013 Views

Social and Information - SNAP - Stanford University

Social and Information - SNAP - Stanford University

Social and Information - SNAP - Stanford University

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

CS 322: (<strong>Social</strong> <strong>and</strong> <strong>Information</strong>) Network Analysis<br />

Jure Leskovec<br />

<strong>Stanford</strong> <strong>University</strong>


Probabilistic models of network diffusion<br />

How cascades spread in real life:<br />

Viral marketing<br />

Blogs<br />

GGroup membership b hi<br />

Last 20 minutes: mid‐term mid term course evaluation<br />

10/27/2009 Jure Leskovec, <strong>Stanford</strong> CS322: Network Analysis 2


How do viruses/rumors propagate?<br />

Will a flu‐like virus linger, or will it become extinct?<br />

(Virus) birth rate β: probability than an infected<br />

neighbor attacks<br />

(Virus) death rate δ: probability that an infected node<br />

heals<br />

10/27/2009<br />

N 1<br />

Infected<br />

Prob. β<br />

Prob. δ<br />

N<br />

N 2<br />

Healthy<br />

Jure Leskovec, <strong>Stanford</strong> CS322: Network Analysis 3<br />

N 3


General scheme for epidemic models:<br />

S…susceptible<br />

E…exposed<br />

I…infected<br />

R…recoveredd<br />

Z…immune<br />

10/27/2009 Jure Leskovec, <strong>Stanford</strong> CS322: Network Analysis<br />

4


Number<br />

of noodes<br />

time<br />

Susceptible Infected Recovered<br />

10/27/2009 Jure Leskovec, <strong>Stanford</strong> CS322: Network Analysis<br />

Assuming g pperfect<br />

mixing, i.e., a<br />

network is a<br />

complete graph<br />

The model dynamics:<br />

5


Susceptible‐Infective‐Susceptible Susceptible Infective Susceptible (SIS) model<br />

Cured nodes immediately become susceptible<br />

Virus “strength”: strength : s = β / δ<br />

10/27/2009<br />

Infected by neighbor<br />

with prob. β<br />

Susceptible Infective<br />

Cured internally<br />

with prob. δ<br />

Jure Leskovec, <strong>Stanford</strong> CS322: Network Analysis 6


Number N of nodes n<br />

time<br />

SSusceptible sceptible Infected<br />

10/27/2009 Jure Leskovec, <strong>Stanford</strong> CS322: Network Analysis<br />

Assuming perfect f<br />

mixing (complete<br />

graph):<br />

dS<br />

dt<br />

dI<br />

dt<br />

<br />

SI <br />

<br />

<br />

SI <br />

<br />

I<br />

I<br />

7


Epidemic threshold of a graph is a value of t, t<br />

such that:<br />

If strength s = β / δ < t epidemic can not happen<br />

(it eventually dies out)<br />

Given a graph compute its epidemic threshold<br />

10/27/2009<br />

Jure Leskovec, <strong>Stanford</strong> CS322: Network Analysis 8


What should t depend on?<br />

avg. degree? <strong>and</strong>/or highest degree?<br />

<strong>and</strong>/or variance of degree?<br />

<strong>and</strong>/or third moment of degree?<br />

<strong>and</strong>/or diameter?<br />

10/27/2009 Jure Leskovec, <strong>Stanford</strong> CS322: Network Analysis<br />

9


We have no epidemic if:<br />

(Virus) ( ) Death<br />

rate<br />

Epidemic threshold<br />

β/δ β < τ = 1/ λ 1,A<br />

(Virus) Birth rate largest eigenvalue<br />

of adj. matrix A<br />

► λ λ1,AA alone captures the property of the graph!<br />

[Wang et al. 2003]<br />

10/27/2009 Jure Leskovec, <strong>Stanford</strong> CS322: Network Analysis<br />

10


N umber off<br />

Infectedd<br />

Nodes<br />

500<br />

400<br />

300<br />

200<br />

100<br />

0<br />

Oregon<br />

β = 0.001 0001<br />

0 250 500 750 1000<br />

Time<br />

δ: 0.05 0.06 0.07<br />

[Wang et al. 2003]<br />

10,900 nodes <strong>and</strong><br />

3, 31,180 edges g<br />

β/δ > τ<br />

(above threshold)<br />

β/δ = τ<br />

(at the threshold)<br />

β/δ β < τ<br />

(below threshold)<br />

10/27/2009 Jure Leskovec, <strong>Stanford</strong> CS322: Network Analysis<br />

11


Does it matter how many people are<br />

initially infected?<br />

10/27/2009 Jure Leskovec, <strong>Stanford</strong> CS322: Network Analysis<br />

12


[Leskovec et al., SDM ’07]<br />

Bloggers write posts <strong>and</strong> refer (link) to other<br />

posts <strong>and</strong> the information propagates<br />

10/27/2009 Jure Leskovec, <strong>Stanford</strong> CS322: Network Analysis<br />

13


Posts<br />

Blogs<br />

Time<br />

ordered<br />

hyperlinks<br />

Dt Data – Bl Blogs:<br />

We crawled 45,000 blogs for 1 year<br />

10 million posts <strong>and</strong> 350,000 cascades<br />

<strong>Information</strong><br />

cascade<br />

10/27/2009 Jure Leskovec, <strong>Stanford</strong> CS322: Network Analysis<br />

14


[Leskovec et al., TWEB ’07]<br />

Senders <strong>and</strong> followers of recommendations<br />

receive discounts on products<br />

10% credit 10% off<br />

• Data – Incentivized Viral Marketing program<br />

• 16 million recommendations<br />

• 4 million illi people l<br />

• 500,000 products<br />

10/27/2009 Jure Leskovec, <strong>Stanford</strong> CS322: Network Analysis<br />

15


[Backstrom et al., KDD ’06]<br />

Use social networks where people belong to<br />

explicitly defined groups<br />

Each group defines a behavior that diffuses<br />

Data – LiveJournal:<br />

On‐line blogging community with friendship links<br />

<strong>and</strong> user‐defined groups g p<br />

Over a million users update content each month<br />

Over 250,000 groups to join<br />

10/27/2009 Jure Leskovec, <strong>Stanford</strong> CS322: Network Analysis<br />

16


Prob. Prob of adoption depends on the number of<br />

friends who have adopted [Bass ‘69, Granovetter ’78]<br />

Prob. of o adoptioon<br />

What is the shape?<br />

Distinction has consequences for models <strong>and</strong> algorithms<br />

k = number of friends adopting k = number of friends adopting<br />

Prob. of o adoptioon<br />

Diminishing returns? Critical mass?<br />

10/27/2009 Jure Leskovec, <strong>Stanford</strong> CS322: Network Analysis 17


Probbability<br />

of o purchaasing<br />

0.1<br />

0.09<br />

0.08<br />

0.07<br />

0.06<br />

005 0.05<br />

0.04<br />

0.03<br />

002 0.02<br />

0.01<br />

0<br />

DVD recommendations<br />

(8.2 million observations)<br />

0 10 20 30 40<br />

# recommendations received<br />

10/27/2009 Jure Leskovec, <strong>Stanford</strong> CS322: Network Analysis<br />

[Leskovec et al., TWEB ’07]<br />

18


LiveJournal community membership<br />

Prrob.<br />

of jooining<br />

k ( (number b of f ffriends i d i in th the community) it )<br />

[Backstrom et al., KDD ’06]<br />

10/27/2009 Jure Leskovec, <strong>Stanford</strong> CS322: Network Analysis<br />

19


Sending email:<br />

Email network of large university<br />

Prob. of a link as a function of # of common friends<br />

mail<br />

ob. of em<br />

Pro<br />

k (number of common friends)<br />

[Kossinets‐Watts ‘06]<br />

10/27/2009 Jure Leskovec, <strong>Stanford</strong> CS322: Network Analysis<br />

20


For viral marketing:<br />

We see that node v receiving the i‐th<br />

recommendation <strong>and</strong> then purchased p the pproduct<br />

For communities:<br />

At time t we see the behavior of node v’s friends<br />

Questions:<br />

When did v become aware of recommendations or<br />

fi friends’ d’bh behavior? i ?<br />

When did it translate into a decision by v to act?<br />

How long after this decision did v act?<br />

10/27/2009 Jure Leskovec, <strong>Stanford</strong> CS322: Network Analysis<br />

21


Dependence on number of friends<br />

Consider: connectedness of friends<br />

x <strong>and</strong> y have three friends in the group<br />

x’s friends are independent<br />

’ fi d ll td x y<br />

y’s friends are all connected<br />

Who is more likely to join?<br />

10/27/2009 Jure Leskovec, <strong>Stanford</strong> CS322: Network Analysis<br />

22


Competing sociological theories x y<br />

Competing sociological theories<br />

<strong>Information</strong> argument [Granovetter ‘73]<br />

S<strong>Social</strong> i lcapital it l argument t [C [Coleman l ’88]<br />

<strong>Information</strong> argument:<br />

Unconnected friends give independent support<br />

<strong>Social</strong> capital argument:<br />

Safety/truest advantage in having friends who<br />

know each other<br />

10/27/2009 Jure Leskovec, <strong>Stanford</strong> CS322: Network Analysis<br />

23


LiveJournal: 1 million users, 250,000 groups<br />

<strong>Social</strong> capital p argument g wins!<br />

Prob. of joining increases with<br />

adjacent members.<br />

[Backstrom et al. KDD ‘06]<br />

10/27/2009 Jure Leskovec, <strong>Stanford</strong> CS322: Network Analysis 24


Large anonymous online retailer<br />

(June 2001 to May 2003)<br />

15 15,646,121 646 121 recommendations dti<br />

3,943,084 distinct customers<br />

548 548,523 523 products recommended<br />

Products belonging to 4 product groups:<br />

books<br />

DVDs<br />

music<br />

VHS<br />

10/27/2009 Jure Leskovec, <strong>Stanford</strong> CS322: Network Analysis<br />

25


t 1 < t 2 < … < t n<br />

t 1<br />

t 4<br />

10/27/2009 Jure Leskovec, <strong>Stanford</strong> CS322: Network Analysis<br />

t 3<br />

t 2<br />

legend<br />

bought but didn’t<br />

receive i a di discountt<br />

bought <strong>and</strong><br />

received a discount<br />

received a recommendation<br />

but didn’t buy<br />

26


There are relatively few DVD titles, but DVDs account for ~ 50% of recommendations.<br />

Recommendations pper pperson<br />

DVD: 10<br />

books <strong>and</strong> music: 2<br />

VHS: 1<br />

Recommendations per purchase<br />

books: 69<br />

DVDs: 108<br />

music: 136<br />

VHS: 203<br />

Overall there are 3.69 recommendations per node on 3.85 different products.<br />

Music recommendations reached about the same number of people p p as DVDs but used only y<br />

1/5 as many recommendations<br />

<br />

Book recommendations reached by far the most people –2.8 million.<br />

All networks have a very small number of unique edges. For books, videos <strong>and</strong> music the<br />

number of unique edges is smaller than the number of nodes – the networks are highly<br />

disconnected<br />

10/27/2009 Jure Leskovec, <strong>Stanford</strong> CS322: Network Analysis<br />

27


What role does the product category play?<br />

products customers<br />

recommenda-<br />

buy + get buy + no<br />

edges<br />

tions discount discount<br />

Book 103,161 2,863,977 5,741,611 2,097,809 65,344 17,769<br />

DVD 19,829 805,285 8,180,393 962,341 17,232 58,189<br />

Music 393,598 794,148 1,443,847 585,738 7,837 2,739<br />

Video 26,131 239,583 280,270 160,683 909 467<br />

FFull ll 542 542,719 719 33,943,084 943 084 15 15,646,121 646 121 33,153,676 153 676 91 91,322 322 79 79,164 164<br />

high<br />

low<br />

people<br />

recommendations<br />

10/27/2009 Jure Leskovec, <strong>Stanford</strong> CS322: Network Analysis<br />

28


Some products are easier to recommend than<br />

others<br />

product d t category t<br />

number of buy forward<br />

bits recommendations<br />

Book 65,391 15,769 24.2<br />

DVD 16,459 7,336 44.6<br />

Music 7,843 1,824 23.3<br />

Video 909 250 27.6<br />

Total 90 90,602 602 25 25,179 179 27 27.8 8<br />

percentt<br />

10/27/2009 Jure Leskovec, <strong>Stanford</strong> CS322: Network Analysis<br />

29


Does sending more recommendations<br />

influence more purchases?<br />

Number<br />

of Purchaases<br />

7<br />

6<br />

5<br />

4<br />

3<br />

2<br />

1<br />

0<br />

20 40 60 80 100 120 140<br />

Outgoing Recommendations<br />

10/27/2009 Jure Leskovec, <strong>Stanford</strong> CS322: Network Analysis<br />

30


What is the effectiveness of subsequent<br />

recommendations?<br />

Probbability<br />

of buying<br />

007 0.07<br />

0.06<br />

0.05<br />

0.04<br />

0.03<br />

002 0.02<br />

5 10 15 20 25 30 35 40<br />

Exchanged recommendations<br />

10/27/2009 Jure Leskovec, <strong>Stanford</strong> CS322: Network Analysis<br />

31


consider successful recommendations in terms of<br />

av. av # senders of recommendations per book category<br />

av. # of recommendations accepted<br />

books overall have a 3% success rate<br />

(2% with discount, 1% without)<br />

lower than average success rate (significant at p=0.01 p=0 01 level)<br />

fiction<br />

romance (1.78), horror (1.81)<br />

teen (1.94), children’s books (2.06)<br />

comics i (2 (2.30), 30) sci‐fi i fi (2 (2.34), 34) mystery t <strong>and</strong> d th thrillers ill (2 (2.40) 40)<br />

nonfiction<br />

sports (2.26)<br />

home & garden (2.26)<br />

travel (2.39) (2 39)<br />

higher than average success rate (statistically significant)<br />

professional & technical<br />

medicine (5.68)<br />

professional & technical (4 (4.54) 54)<br />

engineering (4.10), science (3.90), computers & internet (3.61)<br />

law (3.66), business & investing (3.62)<br />

10/27/2009 Jure Leskovec, <strong>Stanford</strong> CS322: Network Analysis<br />

32


47,000 47 000 customers responsible for the 2.5 25out out of<br />

16 million recommendations in the system<br />

29% success rate per recommender of an anime<br />

DVD<br />

Giant component covers 19% of the nodes<br />

Overall, recommendations for DVDs are more<br />

likely to result in a purchase (7%), but the anime<br />

community i st<strong>and</strong>s d out<br />

10/27/2009 Jure Leskovec, <strong>Stanford</strong> CS322: Network Analysis<br />

33


Variable transformation Coefficient<br />

const -0.940 ***<br />

# recommendations ln(r) 00.426 426 ***<br />

# senders ln(n s) -0.782 ***<br />

# recipients ln(n ln(nr) ) -11.307 307 ***<br />

product price ln(p) 0.128 ***<br />

# reviews ln(v) -0 -0.011 011***<br />

avg. rating ln(t) -0.027 *<br />

R2 R 074<br />

2 0.74<br />

significance at the 0.01 (***), 0.05 (**) <strong>and</strong> 0.1 (*) levels<br />

10/27/2009 Jure Leskovec, <strong>Stanford</strong> CS322: Network Analysis 34


94% of users make first recommendation without having g<br />

received one previously<br />

Size of giant connected component increases from 1% to<br />

2 2.5% % of f the h network k (100,420 (100 20 users) ) – small! ll!<br />

Some sub‐communities are better connected<br />

24% out of f 18,000 18 000 users ffor westerns on DVD V<br />

26% of 25,000 for classics on DVD<br />

19% of 47,000 for anime (Japanese animated film) on DVD<br />

Others are just as disconnected<br />

3% of 180,000 home <strong>and</strong> gardening<br />

2‐7% for children’s <strong>and</strong> fitness DVDs<br />

10/27/2009 Jure Leskovec, <strong>Stanford</strong> CS322: Network Analysis<br />

35


Products suited for Viral Marketing:<br />

small <strong>and</strong> tightly knit community<br />

few reviews, senders, <strong>and</strong> recipients<br />

but sending more recommendations helps<br />

pricey products<br />

rating doesn’t play as much of a role<br />

Observations for future diffusion models:<br />

purchase decision more complex than threshold or simple infection<br />

influence saturates as the number of contacts exp<strong>and</strong>s<br />

links user effectiveness if they are overused<br />

Conditions for successful recommendations:<br />

professional <strong>and</strong> organizational contexts<br />

discounts on expensive items<br />

small small, tightly knit communities<br />

10/27/2009 Jure Leskovec, <strong>Stanford</strong> CS322: Network Analysis<br />

36


How big are cascades?<br />

What are the building<br />

blocks of cascades?<br />

973<br />

938<br />

Medical guide book DVD<br />

10/27/2009 Jure Leskovec, <strong>Stanford</strong> CS322: Network Analysis<br />

37


Given a (social) network<br />

A process by spreading over the network<br />

creates a graph (a tree)<br />

<strong>Social</strong> network<br />

Let’s count cascades<br />

Cascade<br />

(propagation graph)<br />

10/27/2009 Jure Leskovec, <strong>Stanford</strong> CS322: Network Analysis<br />

38


is the most common cascade subgraph<br />

It accounts for ~75% cascades in books, CD<br />

<strong>and</strong> VHS, , only y 12% of DVD cascades<br />

is 6 (1.2 for DVD) times more frequent<br />

than<br />

For DVDs is more frequent than<br />

Chains ( ) are more frequent than<br />

iis more ffrequent tth than a collision lli i<br />

( ) (but collision has less edges)<br />

Late split ( ) is more frequent than<br />

10/27/2009 Jure Leskovec, <strong>Stanford</strong> CS322: Network Analysis<br />

39


Stars ( (“no no propagation propagation”) )<br />

Bipartite cores (“common friends”)<br />

Nodes having same friends<br />

10/27/2009 Jure Leskovec, <strong>Stanford</strong> CS322: Network Analysis<br />

40


Coount<br />

10 6<br />

10 4<br />

10 2<br />

10 0 10 0<br />

= 1.8e6 x -4.98<br />

10 1<br />

steep t drop‐off d ff books<br />

very few large cascades<br />

10 2<br />

x = Cascade size (number of nodes)<br />

10/27/2009 Jure Leskovec, <strong>Stanford</strong> CS322: Network Analysis<br />

41


DVD cascades can grow large<br />

Possibly as a result of websites where people<br />

sign up to exchange recommendations<br />

Count<br />

10 4<br />

10 2<br />

10 0 10 0<br />

~ x -1.56<br />

10 1<br />

10 2<br />

shallow drop off –fat tail<br />

a number of large cascades<br />

10/27/2009 Jure Leskovec, <strong>Stanford</strong> CS322: Network Analysis<br />

42<br />

10 3<br />

x = Cascade size (number of nodes)


The pprobability yof observing g a cascade on x<br />

nodes follows: p(x) ~ x ‐2<br />

Count<br />

x = Cascade size (number of nodes)<br />

10/27/2009 Jure Leskovec, <strong>Stanford</strong> CS322: Network Analysis<br />

43


Cascade sizes follow a heavy‐tailed y distribution<br />

Viral marketing:<br />

Books: steep drop‐off: power‐law exponent ‐5<br />

DVDs: s larger agecascades: cascades exponent epoe ‐1.55<br />

Blogs:<br />

Power‐law exponent ‐2<br />

However, o e e , it is s not o a ssimple peba branching c gpocess process<br />

A simple branching process (a on k‐ary tree):<br />

Every node infects each of k of its neighbors with prob. p<br />

gives exponential cascade size distribution<br />

Questions:<br />

What role does the underlying social network play?<br />

CCan make k a step t towards t d more realistic li ti cascade d generation ti<br />

(propagation) model?<br />

10/27/2009 Jure Leskovec, <strong>Stanford</strong> CS322: Network Analysis<br />

44


1) R<strong>and</strong>omly pick blog to<br />

infect, infect add to cascade. cascade<br />

1<br />

B1 1 B2 1 3<br />

B 3<br />

1<br />

B 4<br />

3) Add infected neighbors<br />

to cascade.<br />

1<br />

B1 1 B2 1 3<br />

B 3<br />

1<br />

B 4<br />

2<br />

2<br />

B 1<br />

2) Infect each in‐linked<br />

neighbor with probability <br />

1<br />

B1 1 B2 1 3<br />

B 3<br />

1<br />

B 4<br />

2<br />

4) Set node infected in (i) to<br />

uninfected.<br />

1<br />

B1 1 B2 B 1 B 1<br />

B4 B4 1 B3 B 3<br />

4<br />

10/27/2009 Jure Leskovec, <strong>Stanford</strong> CS322: Network Analysis<br />

45<br />

1<br />

2<br />

B 1


Generative model<br />

produces realistic<br />

cascades<br />

β=0.025<br />

Count<br />

Count<br />

Cascade size<br />

Count<br />

Coount<br />

Cascade node in‐degree<br />

Most frequent cascades Size of star cascade Size of chain cascade<br />

10/27/2009 Jure Leskovec, <strong>Stanford</strong> CS322: Network Analysis<br />

46


Blogs – information epidemics<br />

Which are the influential/infectious blogs?<br />

Viral marketing<br />

Who are the trendsetters?<br />

Influential people?<br />

Disease spreading<br />

Where to place p monitoring g stations to detect<br />

epidemics?<br />

10/27/2009 Jure Leskovec, <strong>Stanford</strong> CS322: Network Analysis<br />

47

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!