Social and Information - SNAP - Stanford University
Social and Information - SNAP - Stanford University
Social and Information - SNAP - Stanford University
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
CS 322: (<strong>Social</strong> <strong>and</strong> <strong>Information</strong>) Network Analysis<br />
Jure Leskovec<br />
<strong>Stanford</strong> <strong>University</strong>
Probabilistic models of network diffusion<br />
How cascades spread in real life:<br />
Viral marketing<br />
Blogs<br />
GGroup membership b hi<br />
Last 20 minutes: mid‐term mid term course evaluation<br />
10/27/2009 Jure Leskovec, <strong>Stanford</strong> CS322: Network Analysis 2
How do viruses/rumors propagate?<br />
Will a flu‐like virus linger, or will it become extinct?<br />
(Virus) birth rate β: probability than an infected<br />
neighbor attacks<br />
(Virus) death rate δ: probability that an infected node<br />
heals<br />
10/27/2009<br />
N 1<br />
Infected<br />
Prob. β<br />
Prob. δ<br />
N<br />
N 2<br />
Healthy<br />
Jure Leskovec, <strong>Stanford</strong> CS322: Network Analysis 3<br />
N 3
General scheme for epidemic models:<br />
S…susceptible<br />
E…exposed<br />
I…infected<br />
R…recoveredd<br />
Z…immune<br />
10/27/2009 Jure Leskovec, <strong>Stanford</strong> CS322: Network Analysis<br />
4
Number<br />
of noodes<br />
time<br />
Susceptible Infected Recovered<br />
10/27/2009 Jure Leskovec, <strong>Stanford</strong> CS322: Network Analysis<br />
Assuming g pperfect<br />
mixing, i.e., a<br />
network is a<br />
complete graph<br />
The model dynamics:<br />
5
Susceptible‐Infective‐Susceptible Susceptible Infective Susceptible (SIS) model<br />
Cured nodes immediately become susceptible<br />
Virus “strength”: strength : s = β / δ<br />
10/27/2009<br />
Infected by neighbor<br />
with prob. β<br />
Susceptible Infective<br />
Cured internally<br />
with prob. δ<br />
Jure Leskovec, <strong>Stanford</strong> CS322: Network Analysis 6
Number N of nodes n<br />
time<br />
SSusceptible sceptible Infected<br />
10/27/2009 Jure Leskovec, <strong>Stanford</strong> CS322: Network Analysis<br />
Assuming perfect f<br />
mixing (complete<br />
graph):<br />
dS<br />
dt<br />
dI<br />
dt<br />
<br />
SI <br />
<br />
<br />
SI <br />
<br />
I<br />
I<br />
7
Epidemic threshold of a graph is a value of t, t<br />
such that:<br />
If strength s = β / δ < t epidemic can not happen<br />
(it eventually dies out)<br />
Given a graph compute its epidemic threshold<br />
10/27/2009<br />
Jure Leskovec, <strong>Stanford</strong> CS322: Network Analysis 8
What should t depend on?<br />
avg. degree? <strong>and</strong>/or highest degree?<br />
<strong>and</strong>/or variance of degree?<br />
<strong>and</strong>/or third moment of degree?<br />
<strong>and</strong>/or diameter?<br />
10/27/2009 Jure Leskovec, <strong>Stanford</strong> CS322: Network Analysis<br />
9
We have no epidemic if:<br />
(Virus) ( ) Death<br />
rate<br />
Epidemic threshold<br />
β/δ β < τ = 1/ λ 1,A<br />
(Virus) Birth rate largest eigenvalue<br />
of adj. matrix A<br />
► λ λ1,AA alone captures the property of the graph!<br />
[Wang et al. 2003]<br />
10/27/2009 Jure Leskovec, <strong>Stanford</strong> CS322: Network Analysis<br />
10
N umber off<br />
Infectedd<br />
Nodes<br />
500<br />
400<br />
300<br />
200<br />
100<br />
0<br />
Oregon<br />
β = 0.001 0001<br />
0 250 500 750 1000<br />
Time<br />
δ: 0.05 0.06 0.07<br />
[Wang et al. 2003]<br />
10,900 nodes <strong>and</strong><br />
3, 31,180 edges g<br />
β/δ > τ<br />
(above threshold)<br />
β/δ = τ<br />
(at the threshold)<br />
β/δ β < τ<br />
(below threshold)<br />
10/27/2009 Jure Leskovec, <strong>Stanford</strong> CS322: Network Analysis<br />
11
Does it matter how many people are<br />
initially infected?<br />
10/27/2009 Jure Leskovec, <strong>Stanford</strong> CS322: Network Analysis<br />
12
[Leskovec et al., SDM ’07]<br />
Bloggers write posts <strong>and</strong> refer (link) to other<br />
posts <strong>and</strong> the information propagates<br />
10/27/2009 Jure Leskovec, <strong>Stanford</strong> CS322: Network Analysis<br />
13
Posts<br />
Blogs<br />
Time<br />
ordered<br />
hyperlinks<br />
Dt Data – Bl Blogs:<br />
We crawled 45,000 blogs for 1 year<br />
10 million posts <strong>and</strong> 350,000 cascades<br />
<strong>Information</strong><br />
cascade<br />
10/27/2009 Jure Leskovec, <strong>Stanford</strong> CS322: Network Analysis<br />
14
[Leskovec et al., TWEB ’07]<br />
Senders <strong>and</strong> followers of recommendations<br />
receive discounts on products<br />
10% credit 10% off<br />
• Data – Incentivized Viral Marketing program<br />
• 16 million recommendations<br />
• 4 million illi people l<br />
• 500,000 products<br />
10/27/2009 Jure Leskovec, <strong>Stanford</strong> CS322: Network Analysis<br />
15
[Backstrom et al., KDD ’06]<br />
Use social networks where people belong to<br />
explicitly defined groups<br />
Each group defines a behavior that diffuses<br />
Data – LiveJournal:<br />
On‐line blogging community with friendship links<br />
<strong>and</strong> user‐defined groups g p<br />
Over a million users update content each month<br />
Over 250,000 groups to join<br />
10/27/2009 Jure Leskovec, <strong>Stanford</strong> CS322: Network Analysis<br />
16
Prob. Prob of adoption depends on the number of<br />
friends who have adopted [Bass ‘69, Granovetter ’78]<br />
Prob. of o adoptioon<br />
What is the shape?<br />
Distinction has consequences for models <strong>and</strong> algorithms<br />
k = number of friends adopting k = number of friends adopting<br />
Prob. of o adoptioon<br />
Diminishing returns? Critical mass?<br />
10/27/2009 Jure Leskovec, <strong>Stanford</strong> CS322: Network Analysis 17
Probbability<br />
of o purchaasing<br />
0.1<br />
0.09<br />
0.08<br />
0.07<br />
0.06<br />
005 0.05<br />
0.04<br />
0.03<br />
002 0.02<br />
0.01<br />
0<br />
DVD recommendations<br />
(8.2 million observations)<br />
0 10 20 30 40<br />
# recommendations received<br />
10/27/2009 Jure Leskovec, <strong>Stanford</strong> CS322: Network Analysis<br />
[Leskovec et al., TWEB ’07]<br />
18
LiveJournal community membership<br />
Prrob.<br />
of jooining<br />
k ( (number b of f ffriends i d i in th the community) it )<br />
[Backstrom et al., KDD ’06]<br />
10/27/2009 Jure Leskovec, <strong>Stanford</strong> CS322: Network Analysis<br />
19
Sending email:<br />
Email network of large university<br />
Prob. of a link as a function of # of common friends<br />
mail<br />
ob. of em<br />
Pro<br />
k (number of common friends)<br />
[Kossinets‐Watts ‘06]<br />
10/27/2009 Jure Leskovec, <strong>Stanford</strong> CS322: Network Analysis<br />
20
For viral marketing:<br />
We see that node v receiving the i‐th<br />
recommendation <strong>and</strong> then purchased p the pproduct<br />
For communities:<br />
At time t we see the behavior of node v’s friends<br />
Questions:<br />
When did v become aware of recommendations or<br />
fi friends’ d’bh behavior? i ?<br />
When did it translate into a decision by v to act?<br />
How long after this decision did v act?<br />
10/27/2009 Jure Leskovec, <strong>Stanford</strong> CS322: Network Analysis<br />
21
Dependence on number of friends<br />
Consider: connectedness of friends<br />
x <strong>and</strong> y have three friends in the group<br />
x’s friends are independent<br />
’ fi d ll td x y<br />
y’s friends are all connected<br />
Who is more likely to join?<br />
10/27/2009 Jure Leskovec, <strong>Stanford</strong> CS322: Network Analysis<br />
22
Competing sociological theories x y<br />
Competing sociological theories<br />
<strong>Information</strong> argument [Granovetter ‘73]<br />
S<strong>Social</strong> i lcapital it l argument t [C [Coleman l ’88]<br />
<strong>Information</strong> argument:<br />
Unconnected friends give independent support<br />
<strong>Social</strong> capital argument:<br />
Safety/truest advantage in having friends who<br />
know each other<br />
10/27/2009 Jure Leskovec, <strong>Stanford</strong> CS322: Network Analysis<br />
23
LiveJournal: 1 million users, 250,000 groups<br />
<strong>Social</strong> capital p argument g wins!<br />
Prob. of joining increases with<br />
adjacent members.<br />
[Backstrom et al. KDD ‘06]<br />
10/27/2009 Jure Leskovec, <strong>Stanford</strong> CS322: Network Analysis 24
Large anonymous online retailer<br />
(June 2001 to May 2003)<br />
15 15,646,121 646 121 recommendations dti<br />
3,943,084 distinct customers<br />
548 548,523 523 products recommended<br />
Products belonging to 4 product groups:<br />
books<br />
DVDs<br />
music<br />
VHS<br />
10/27/2009 Jure Leskovec, <strong>Stanford</strong> CS322: Network Analysis<br />
25
t 1 < t 2 < … < t n<br />
t 1<br />
t 4<br />
10/27/2009 Jure Leskovec, <strong>Stanford</strong> CS322: Network Analysis<br />
t 3<br />
t 2<br />
legend<br />
bought but didn’t<br />
receive i a di discountt<br />
bought <strong>and</strong><br />
received a discount<br />
received a recommendation<br />
but didn’t buy<br />
26
There are relatively few DVD titles, but DVDs account for ~ 50% of recommendations.<br />
Recommendations pper pperson<br />
DVD: 10<br />
books <strong>and</strong> music: 2<br />
VHS: 1<br />
Recommendations per purchase<br />
books: 69<br />
DVDs: 108<br />
music: 136<br />
VHS: 203<br />
Overall there are 3.69 recommendations per node on 3.85 different products.<br />
Music recommendations reached about the same number of people p p as DVDs but used only y<br />
1/5 as many recommendations<br />
<br />
Book recommendations reached by far the most people –2.8 million.<br />
All networks have a very small number of unique edges. For books, videos <strong>and</strong> music the<br />
number of unique edges is smaller than the number of nodes – the networks are highly<br />
disconnected<br />
10/27/2009 Jure Leskovec, <strong>Stanford</strong> CS322: Network Analysis<br />
27
What role does the product category play?<br />
products customers<br />
recommenda-<br />
buy + get buy + no<br />
edges<br />
tions discount discount<br />
Book 103,161 2,863,977 5,741,611 2,097,809 65,344 17,769<br />
DVD 19,829 805,285 8,180,393 962,341 17,232 58,189<br />
Music 393,598 794,148 1,443,847 585,738 7,837 2,739<br />
Video 26,131 239,583 280,270 160,683 909 467<br />
FFull ll 542 542,719 719 33,943,084 943 084 15 15,646,121 646 121 33,153,676 153 676 91 91,322 322 79 79,164 164<br />
high<br />
low<br />
people<br />
recommendations<br />
10/27/2009 Jure Leskovec, <strong>Stanford</strong> CS322: Network Analysis<br />
28
Some products are easier to recommend than<br />
others<br />
product d t category t<br />
number of buy forward<br />
bits recommendations<br />
Book 65,391 15,769 24.2<br />
DVD 16,459 7,336 44.6<br />
Music 7,843 1,824 23.3<br />
Video 909 250 27.6<br />
Total 90 90,602 602 25 25,179 179 27 27.8 8<br />
percentt<br />
10/27/2009 Jure Leskovec, <strong>Stanford</strong> CS322: Network Analysis<br />
29
Does sending more recommendations<br />
influence more purchases?<br />
Number<br />
of Purchaases<br />
7<br />
6<br />
5<br />
4<br />
3<br />
2<br />
1<br />
0<br />
20 40 60 80 100 120 140<br />
Outgoing Recommendations<br />
10/27/2009 Jure Leskovec, <strong>Stanford</strong> CS322: Network Analysis<br />
30
What is the effectiveness of subsequent<br />
recommendations?<br />
Probbability<br />
of buying<br />
007 0.07<br />
0.06<br />
0.05<br />
0.04<br />
0.03<br />
002 0.02<br />
5 10 15 20 25 30 35 40<br />
Exchanged recommendations<br />
10/27/2009 Jure Leskovec, <strong>Stanford</strong> CS322: Network Analysis<br />
31
consider successful recommendations in terms of<br />
av. av # senders of recommendations per book category<br />
av. # of recommendations accepted<br />
books overall have a 3% success rate<br />
(2% with discount, 1% without)<br />
lower than average success rate (significant at p=0.01 p=0 01 level)<br />
fiction<br />
romance (1.78), horror (1.81)<br />
teen (1.94), children’s books (2.06)<br />
comics i (2 (2.30), 30) sci‐fi i fi (2 (2.34), 34) mystery t <strong>and</strong> d th thrillers ill (2 (2.40) 40)<br />
nonfiction<br />
sports (2.26)<br />
home & garden (2.26)<br />
travel (2.39) (2 39)<br />
higher than average success rate (statistically significant)<br />
professional & technical<br />
medicine (5.68)<br />
professional & technical (4 (4.54) 54)<br />
engineering (4.10), science (3.90), computers & internet (3.61)<br />
law (3.66), business & investing (3.62)<br />
10/27/2009 Jure Leskovec, <strong>Stanford</strong> CS322: Network Analysis<br />
32
47,000 47 000 customers responsible for the 2.5 25out out of<br />
16 million recommendations in the system<br />
29% success rate per recommender of an anime<br />
DVD<br />
Giant component covers 19% of the nodes<br />
Overall, recommendations for DVDs are more<br />
likely to result in a purchase (7%), but the anime<br />
community i st<strong>and</strong>s d out<br />
10/27/2009 Jure Leskovec, <strong>Stanford</strong> CS322: Network Analysis<br />
33
Variable transformation Coefficient<br />
const -0.940 ***<br />
# recommendations ln(r) 00.426 426 ***<br />
# senders ln(n s) -0.782 ***<br />
# recipients ln(n ln(nr) ) -11.307 307 ***<br />
product price ln(p) 0.128 ***<br />
# reviews ln(v) -0 -0.011 011***<br />
avg. rating ln(t) -0.027 *<br />
R2 R 074<br />
2 0.74<br />
significance at the 0.01 (***), 0.05 (**) <strong>and</strong> 0.1 (*) levels<br />
10/27/2009 Jure Leskovec, <strong>Stanford</strong> CS322: Network Analysis 34
94% of users make first recommendation without having g<br />
received one previously<br />
Size of giant connected component increases from 1% to<br />
2 2.5% % of f the h network k (100,420 (100 20 users) ) – small! ll!<br />
Some sub‐communities are better connected<br />
24% out of f 18,000 18 000 users ffor westerns on DVD V<br />
26% of 25,000 for classics on DVD<br />
19% of 47,000 for anime (Japanese animated film) on DVD<br />
Others are just as disconnected<br />
3% of 180,000 home <strong>and</strong> gardening<br />
2‐7% for children’s <strong>and</strong> fitness DVDs<br />
10/27/2009 Jure Leskovec, <strong>Stanford</strong> CS322: Network Analysis<br />
35
Products suited for Viral Marketing:<br />
small <strong>and</strong> tightly knit community<br />
few reviews, senders, <strong>and</strong> recipients<br />
but sending more recommendations helps<br />
pricey products<br />
rating doesn’t play as much of a role<br />
Observations for future diffusion models:<br />
purchase decision more complex than threshold or simple infection<br />
influence saturates as the number of contacts exp<strong>and</strong>s<br />
links user effectiveness if they are overused<br />
Conditions for successful recommendations:<br />
professional <strong>and</strong> organizational contexts<br />
discounts on expensive items<br />
small small, tightly knit communities<br />
10/27/2009 Jure Leskovec, <strong>Stanford</strong> CS322: Network Analysis<br />
36
How big are cascades?<br />
What are the building<br />
blocks of cascades?<br />
973<br />
938<br />
Medical guide book DVD<br />
10/27/2009 Jure Leskovec, <strong>Stanford</strong> CS322: Network Analysis<br />
37
Given a (social) network<br />
A process by spreading over the network<br />
creates a graph (a tree)<br />
<strong>Social</strong> network<br />
Let’s count cascades<br />
Cascade<br />
(propagation graph)<br />
10/27/2009 Jure Leskovec, <strong>Stanford</strong> CS322: Network Analysis<br />
38
is the most common cascade subgraph<br />
It accounts for ~75% cascades in books, CD<br />
<strong>and</strong> VHS, , only y 12% of DVD cascades<br />
is 6 (1.2 for DVD) times more frequent<br />
than<br />
For DVDs is more frequent than<br />
Chains ( ) are more frequent than<br />
iis more ffrequent tth than a collision lli i<br />
( ) (but collision has less edges)<br />
Late split ( ) is more frequent than<br />
10/27/2009 Jure Leskovec, <strong>Stanford</strong> CS322: Network Analysis<br />
39
Stars ( (“no no propagation propagation”) )<br />
Bipartite cores (“common friends”)<br />
Nodes having same friends<br />
10/27/2009 Jure Leskovec, <strong>Stanford</strong> CS322: Network Analysis<br />
40
Coount<br />
10 6<br />
10 4<br />
10 2<br />
10 0 10 0<br />
= 1.8e6 x -4.98<br />
10 1<br />
steep t drop‐off d ff books<br />
very few large cascades<br />
10 2<br />
x = Cascade size (number of nodes)<br />
10/27/2009 Jure Leskovec, <strong>Stanford</strong> CS322: Network Analysis<br />
41
DVD cascades can grow large<br />
Possibly as a result of websites where people<br />
sign up to exchange recommendations<br />
Count<br />
10 4<br />
10 2<br />
10 0 10 0<br />
~ x -1.56<br />
10 1<br />
10 2<br />
shallow drop off –fat tail<br />
a number of large cascades<br />
10/27/2009 Jure Leskovec, <strong>Stanford</strong> CS322: Network Analysis<br />
42<br />
10 3<br />
x = Cascade size (number of nodes)
The pprobability yof observing g a cascade on x<br />
nodes follows: p(x) ~ x ‐2<br />
Count<br />
x = Cascade size (number of nodes)<br />
10/27/2009 Jure Leskovec, <strong>Stanford</strong> CS322: Network Analysis<br />
43
Cascade sizes follow a heavy‐tailed y distribution<br />
Viral marketing:<br />
Books: steep drop‐off: power‐law exponent ‐5<br />
DVDs: s larger agecascades: cascades exponent epoe ‐1.55<br />
Blogs:<br />
Power‐law exponent ‐2<br />
However, o e e , it is s not o a ssimple peba branching c gpocess process<br />
A simple branching process (a on k‐ary tree):<br />
Every node infects each of k of its neighbors with prob. p<br />
gives exponential cascade size distribution<br />
Questions:<br />
What role does the underlying social network play?<br />
CCan make k a step t towards t d more realistic li ti cascade d generation ti<br />
(propagation) model?<br />
10/27/2009 Jure Leskovec, <strong>Stanford</strong> CS322: Network Analysis<br />
44
1) R<strong>and</strong>omly pick blog to<br />
infect, infect add to cascade. cascade<br />
1<br />
B1 1 B2 1 3<br />
B 3<br />
1<br />
B 4<br />
3) Add infected neighbors<br />
to cascade.<br />
1<br />
B1 1 B2 1 3<br />
B 3<br />
1<br />
B 4<br />
2<br />
2<br />
B 1<br />
2) Infect each in‐linked<br />
neighbor with probability <br />
1<br />
B1 1 B2 1 3<br />
B 3<br />
1<br />
B 4<br />
2<br />
4) Set node infected in (i) to<br />
uninfected.<br />
1<br />
B1 1 B2 B 1 B 1<br />
B4 B4 1 B3 B 3<br />
4<br />
10/27/2009 Jure Leskovec, <strong>Stanford</strong> CS322: Network Analysis<br />
45<br />
1<br />
2<br />
B 1
Generative model<br />
produces realistic<br />
cascades<br />
β=0.025<br />
Count<br />
Count<br />
Cascade size<br />
Count<br />
Coount<br />
Cascade node in‐degree<br />
Most frequent cascades Size of star cascade Size of chain cascade<br />
10/27/2009 Jure Leskovec, <strong>Stanford</strong> CS322: Network Analysis<br />
46
Blogs – information epidemics<br />
Which are the influential/infectious blogs?<br />
Viral marketing<br />
Who are the trendsetters?<br />
Influential people?<br />
Disease spreading<br />
Where to place p monitoring g stations to detect<br />
epidemics?<br />
10/27/2009 Jure Leskovec, <strong>Stanford</strong> CS322: Network Analysis<br />
47