CS224W: Social and Information Network Analysis

Jure Leskovec **Stanford** **University**

Jure Leskovec, **Stanford** **University**

http://cs224w.stanford.edu

Most influential set of

Most influential set of a 0.4

d

0.4

size k: set S of k nodes

0.2

0.3 0.3 0.2

producing largest

b f

0.33

0.2

e

h

expected cascade size 0.4 0.4

0.3

0.3

0.2

f(S) if activated

g

0.3

f(S) if activated

g

c

0.4 i

[Domingos‐Richardson ‘01]

Optimization p problem: p

max f

( S )

S of size k

10/20/2010 Jure Leskovec, **Stanford** CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu

2

Hill‐Climbing:

Start with S 0={}

For i=1…k

Choose node v that max f(S f( i‐1{v}) i‐1 {})

Let Si = Si‐1{v} Hill climbing produces a solution S where

f(S) (1 (1‐1/e) 1/ ) of f optimal ti l value l (~63%)

[Hemhauser, Fisher, Wolsey ’78, Kempe, Kleinberg, Tardos ‘03]

Claim holds for functions f with 2 properties: p p

f is monotone:

if S T then f(S) f(T) and f({})=0

f is submodular:

adding element to a set gives less improvement

than adding to one of subsets

a

b

c

d

e

f(S i‐1{v})

10/20/2010 Jure Leskovec, **Stanford** CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 3

Next, Next we must show f is submodular:

S T

f(S {u}) – f(S) ≥ f(T {u}) – f(T)

Gain of adding a node to a small set Gain of adding a node to a large set

Basic fact:

If f f1,…,f 1 f K are submodular, and c1,…,c 1 k 0

then ∑i ci fi is also submodular

10/20/2010 Jure Leskovec, **Stanford** CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu

4

S T: f(S {u}) – f(S) ≥ f(T {u}) – f(T)

S T:

Gain of adding u to a small set Gain of adding u to a large set

A simple submodular function:

Sets S1,…,Sm f(S) = | |iS S Si| | (i (size of f union i of f sets t S Si, iiS) S)

T

S

10/20/2010 Jure Leskovec, **Stanford** CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 5

u

Principle of deferred

decision:

Generate randomness

ahead of time

b

f

a d

c

Flip a coin for each

edge to decide whether it

will succeed when (if ever) it attempts to transmit

Edges on which activation will succeed are live

f fi(S) i(S) = size of the set reachable by

live‐edge paths from nodes in S (e.g., ={a,d})

10/20/2010 Jure Leskovec, **Stanford** CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu

6

g

e

i

h

Fix outcome i of coin flips

fi(S) = size of

cascade from S given g

these coin flips

fi(v) = set of nodes

reachable hblffrom v on

live‐edge paths

f i(S) = | vS f i(v)| → f i(S) is submodular

Expected influence set size: size:

f (S)= ∑i fi(S)→ f is submodular!

b

c

f

a d

**Stanford** CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu

7

g

e

i

h

Independent cascade model (with prob. prob 1%):

10/20/2010 Jure Leskovec, **Stanford** CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 8

Independent cascade model (with prob. prob 10%):

10/20/2010 Jure Leskovec, **Stanford** CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 9

a

ll l b What do we know about

optimizing submodular

reward

functions?

A hill‐climbing is near optimal

(1-1/e (~63%) of OPT)

Hill‐climbing

b

c

d

But

Hill‐climbing Hill climbing algorithm is slow

e

At each iteration we need to re‐

evaluate marginal gains

Add node with highest

It scales as O(n k)

marginal gain

10/20/2010 Jure Leskovec, **Stanford** CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 10

In round i+1:

So far we have picked Si = {s1,…,si} pick s i1= i+1 argmax argmaxx F(S i {x})‐F(S {x}) F(Si) i)

i.e., maximize “marginal benefit” x(Si) x(Si) = F(Si {x})‐F(Si) Observation: Submodularity implies

i j x(Si) x(Sj) Marginal benefits x never increase!

**Stanford** CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu

[Leskovec et al., KDD ’07]

x(S i) x(S i+1)

x

11

Lazy hill‐climbing:

hill climbing:

10/20/2010

Keep an ordered list of

marginal benefits b bi from previous iteration

a

b

Re‐evaluate Re evaluate b i only for

top node

c

d

Re‐sort Re sort and prune e

Marginal gain

[Leskovec et al., KDD ’07]

f(S {u}) – f(S) ≥ f(T {u}) – f(T) S T

Jure Leskovec, **Stanford** CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 12

Lazy hill‐climbing:

hill climbing:

10/20/2010

Keep an ordered list of

marginal benefits b bi from previous iteration

a

b

Re‐evaluate Re evaluate b i only for

top node

c

d

Re‐sort Re sort and prune e

Marginal gain

[Leskovec et al., KDD ’07]

f(S {u}) – f(S) ≥ f(T {u}) – f(T) S T

Jure Leskovec, **Stanford** CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 13

Lazy hill‐climbing:

hill climbing:

10/20/2010

Keep an ordered list of

marginal benefits b bi from previous iteration

Re‐evaluate Re evaluate b i only for

top node

Re‐sort Re sort and prune

Marginal gain

a

d

b

e

c

[Leskovec et al., KDD ’07]

f(S {u}) – f(S) ≥ f(T {u}) – f(T) S T

Jure Leskovec, **Stanford** CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 14

Given a real city water

distribution network

AAnd ddt data on hhow

contaminants spread

in the network

Problem posed by US

Environmental

Protection Agency

[Leskovec et al., KDD ’07]

**Stanford** CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu

15

S

Utility of placing sensors

Water flow dynamics, demands of households, …

For each subset A V compute utility F(A)

Model dlpredicts d LLow iimpact

High Contamination impact

location

S1

S

S4

S3

S1S2

Set V of all

network junctions

Medium impact

location

S3

Sensor reduces

impact through

early detection!

High sensing quality F(A) = 0.9 Low sensing quality F(A)=0.01

**Stanford** CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu

S1

S4

S2

16

Given a graph G(V,E)

and d a budget b d tB B for f sensors

and data on how contaminations

spread over the network:

for each contamination i we know the

time T(i,u) when it contaminated node u

Select a subset of nodes A that maximize

the expected reward:

subject to cost(A) < B

Reward for detecting

contamination i

[Leskovec et al., KDD ’07]

**Stanford** CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu

17

Observation: Diminishing returns

S N

S 1

S 2

Placement A={s 1, s 2}

Adding s’ helps a lot

New sensor:

S’ s’

S 1

S 3

S 4

[Leskovec et al., KDD ’07]

S 2

Placement A’={s 1, s 2, s 3, s 4}

Adding s’ helps

very llittle l

**Stanford** CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu

18

Claim:

Reward function is submodular

Consider cascade i:

Ri(uk) = set of nodes saved from uk R Ri(A) i(A) = size of union R Ri(u i(uk) k), u ukA kA

Ri is submodular

Global optimization:

R(A) = Prob(i) Ri(A) R(A) ( ) is submodular

Cascade/

outbreak i

**Stanford** CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu

[Leskovec et al., KDD ’07]

u 2

u 1 f i(u 1)

f i(u 2)

19

Consider: Hill‐climbing ignoring the cost

Ignore sensor cost. Repeatedly select

sensor with highest marginal gain

Example: n sensors sensors, budget B

[Leskovec et al., KDD ’07]

s 1: reward r, cost B

s 2…s n: reward r‐ε, cost 1

Hill‐climbing always prefers more expensive sensor

with reward r to a cheaper sensor with reward r‐ε

→ For variable cost it can fail arbitrarily badly

Idea: What if we optimize benefit‐cost ratio?

**Stanford** CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu

20

Benefit‐cost ratio can fail arbitrarily y badly y

Consider: budget B:

2 sensors s1 and s2: Costs: c(s1)=ε,c(s2)=B Only 1 cascade: R(s1)=2ε, R(s2)=B Then benefit‐cost benefit cost ratio is

bc(s1)=2 and bc(s2)=1 So, we first select s1 and

then can not afford s2 →We get reward 2ε instead of B

Now send ε to 0 and we get

arbitrarily bad solution

[Leskovec et al., KDD ’07]

**Stanford** CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu

21

CELF ( (cost‐effective lazy y forward‐selection): )

A two pass greedy algorithm:

Set (solution) A’’: use benefit‐cost greedy

Set (solution) A’: use unit cost greedy

Final solution: argmax(R(A’’), R(A’))

How far is CELF from (unknown) optimal

solution? so ut o ?

Theorem: CELF is near optimal

CELF achieves ½(1 ½(1-1/e) 1/e) factor approximation

[Leskovec et al., KDD ’07]

**Stanford** CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu

22

Real metropolitan area

water network

V = 21,000 nodes

E = 25,000 pipes

Use a cluster of 50 machines for a month

Simulate 3.6 million epidemic scenarios

(152 GB of epidemic data)

By exploiting sparsity we fit it into main

memor memory (16GB)

[Leskovec et al., KDD ’07]

**Stanford** CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu

23

Populattion

protected

F(A)

Higher

is bettter

1.4

Offline

1.2 (Nemhauser) h

bound

1

0.8

0.6

0.4

0.2

0

Greedy

solution

0 5 10 15 20

NNumber b of f sensors placed l d

Water

network

(1‐1/e) bound quite loose… can we get better bounds?

**Stanford** CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu

24

Suppose pp S is some solution to

argmaxS f(S) s.t. |S| k

f(S) is monotone & submodular

and let T = {t {t1,…,tt k} } is OPT solution

Then:

CLAIM: For each x V\S,

Let x = f(S{x})‐f(S)

Order so that 1 2 … n Then: f(T) f(S) + i=1 k i

**Stanford** CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu

i 1 i

25

Suppose S is some candidate solution to

argmax f(S) s.t. |S| k

and T = {t {t1,…,tt k} } be an optimal solution

Then f(T) f(S T)

= f(S)+ i f(S {t1,…,t 1 i}) i ‐ f(S {t1,…,t 1 i‐1}) i 1

f(S) + i [f(S {ti}) ‐ f(S)]

= f(S) + i ti For each s V\S V\S, let s = f(S{s})‐f(S) f(S{s})‐f(S)

Order such that 1 2 … n

Then: f(T) f(S) + k

Then: f(T) f(S) + i=1 i

**Stanford** CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu

26

Senssing

qualitty

F(A)

Hiigher

is beetter

1.4

Offline

1.2 12 (Nemhauser)

1

bound Data‐dependent

bound

0.8

0.6

0.4

0.2

0

Greedy

solution

0 5 10 15 20

Number of sensors placed

Submodularity gives data‐dependent bounds on the

performance of any algorithm

**Stanford** CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu

27

[Leskovec et al., KDD ’07]

Heuristics placements perform much worse

One really reall needs to consider the spread of

epidemics

**Stanford** CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu

28

[Leskovec et al., KDD ’07]

**Stanford** CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu

29

CELF is an order of magnitude faster than

hill‐climbing hill climbing

[Leskovec et al., KDD ’07]

**Stanford** CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu

30

= I have 10 minutes. Which

?

blogs should I read to be

most up to date?

= Who are the most

influential bloggers?

**Stanford** CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu

31

Detect blue & yellow

soon but miss red.

Want to read things

before others do.

Detect all

stories but late.

10/20/2010 Jure Leskovec, **Stanford** CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 32

Online bound is much tighter:

13% instead of 37%

[Leskovec et al., KDD ’07]

Old bound

Our bound

CELF

**Stanford** CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu

33

Unit cost:

algorithm picks large

popular blogs:

instapundit.com,

michellemalkin.com

i h ll lki

Variable cost:

proportional to the

number of posts

We can do much

better when

considering costs

[Leskovec et al., KDD ’07]

Variable cost

Unit cost

**Stanford** CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu

34

But then algorithm

picks lots of small

blogs that participate

iin few f cascades d

We pick p best solution

that interpolates

between the costs

We can get good

solutions with few

blogs and few posts

[Leskovec et al., KDD ’07]

EEach h curve represents solutions l i with ih

same final reward

**Stanford** CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu

35

Heuristics perform p much worse

One really needs to perform optimization

[Leskovec et al., KDD ’07]

**Stanford** CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu

36

[Leskovec et al., KDD ’07]

CELF runs 700

times faster than

simple hill‐

climbing algorithm

**Stanford** CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu

37