Big Data: Science or Gimmick!<br />

程 学 旗 (Xueqi Cheng)<br />

中 科 院 网 络 数 据 科 学 与 技 术 重 点 实 验 室 (CAS Key Lab of<br />

Network Science and Technology, NDST) <br />

中 国 科 学 院 计 算 技 术 研 究 所

Is Big Data Big Value <br />

Sensing the World<br />

Monitoring development index!<br />

(Environment、Health、Economy…)<br />

Crime Detection on the Web<br />

• Social Media Data and log data up to PB size!<br />

• Structured and unstructured data!<br />

• Historical data and streaming data<br />

• PB size Log Data!<br />

• EB size Monitoring Data!<br />

• Unpredictable criminal activity<br />

Challenges:Large volume, variety, complex connection and big noise !<br />

in data bring difficulties to measurement、fusion and pattern mining

Is Big Data Big Value <br />

Predicting Tomorrow!<br />

• Election Prediction based on Twitter!<br />

• Hedging Fund based on sentiment<br />

analyzing on Twitter!<br />

UN Global Pulse: Predicting<br />

unemployment rate and disease!<br />

Challenges:Strong interaction, real time and dynamic properties make the life cycle of<br />

data separated, hard to balance latency and accuracy, and difficult to predict the tendency

Is Big Data Big Value Meltdown Modelling <br />

Buchanan, M. (2009), Meltdown<br />

modelling, Nature 460, 680-682.<br />

• 2016 年 的 一 天 早 上 , 电 子 显 示 屏 上 的 橙 色 报 警 灯 突 然 不 停 闪 烁 着 , 美 国 政 府 的 专 家 们 探 测 到 一<br />

个 关 乎 国 家 安 全 的 预 警 信 号 。 由 于 这 个 电 子 显 示 屏 背 后 关 联 着 世 界 上 最 大 的 一 些 金 融 机 构 , 包<br />

括 银 行 、 政 府 、 对 冲 基 金 、 网 络 银 团 等 。 而 橙 色 预 警 灯 闪 烁 表 明 美 国 的 对 冲 基 金 已 经 积 聚 在 相<br />

同 的 金 融 资 产 上 , 此 时 , 如 果 某 个 基 金 突 然 变 现 卖 出 , 警 示 信 号 就 会 出 现 , 而 这 种 下 挫 价 格 的<br />

行 为 , 迫 使 其 他 基 金 尾 随 卖 出 , 加 速 资 产 价 格 下 挫 。 很 多 基 金 可 能 在 短 短 的 30 分 钟 内 就 会 破 产<br />

, 对 整 个 金 融 系 统 造 成 极 大 的 威 胁 。<br />

• Based on Big Data, can we prevent another financial crisis

Big Challenges Exist in Big Data <br />

• How to design the big<br />

data system <br />

• How to optimize<br />

System<br />

Complexity <br />

Computation<br />

Complexity <br />

• Is that computable <br />

• How to compute<br />

towards Global data<br />

Processing<br />

Architecture for<br />

Data Life Cycle<br />

Big!<br />

Data <br />

New Computing<br />

Paradigm for<br />

Core Data <br />

Data<br />

Complexity <br />

• How to represent the data <br />

• How to measure the data complexity <br />

Structure Regularity

Data Garbage or Data Goldmine <br />

Can we well address these Big Scientific<br />

Challenges in Big Data <br />

No <br />

Yes <br />

Data Garbage <br />

Data Goldmine

Topics in this talk <br />

Data<br />

Complexity <br />

• How to represent the data <br />

ü Finding the semantic representations <br />

Computation<br />

Complexity <br />

• How to compute with Web scale data<br />

ü Bring order to big data!<br />

ü Predict the dynamic structure <br />

System!<br />

ADA <br />

System<br />

Complexity <br />

ü Web Data Engine System!

Data<br />

Complexity <br />

Finding the semanc representaons of<br />

large scale Web documents <br />

Topic Modeling <br />

Topic modeling provides methods for automatically organizing,!<br />

understanding, searching, and summarizing large electronic archives.!<br />

!<br />

1 Discover the hidden themes that pervade the collection.!<br />

2 Annotate the documents according to those themes.!<br />

3 Use annotations to organize, summarize, and search the texts.

Topic Modeling meets Big Data <br />

Topic<br />

Modeling <br />

“Big Data” <br />

Semantic!<br />

Sparseness <br />

Topic!<br />

Sparseness <br />

Feature!<br />

Sparseness <br />

Topic space <br />

Doc<br />


Understanding the Topic(Topic Sparseness): <br />

Group Sparse Topical Coding: From Code to Topic <br />


Sparse and Meaningful Topics <br />

• Lots of data, Lots of potential topics<br />

(Global), but relative sparse topics in<br />

individual document (Local)!<br />

!<br />

•!<br />

Conventional idea:!<br />

Doc<br />

space <br />

– Control the sparseness of the topics at document level!<br />

Topic space <br />

Probabilistic Model!<br />

• Meaningful interpretation!<br />

• Hard to control sparse !<br />

Non-probabilistic Model!<br />

• Effective sparse controlling!<br />

• Lack clear semantic meanings!<br />

(Blei et al., ’03) <br />

(Lin., ’08) <br />

Can we find a better way to enjoy both the merits Yes!!!!

Basic Idea <br />

• The document is composed of words<br />

• The topics of document is thus composed of<br />

the topics of words<br />

• If we directly control the sparseness of topics of<br />

words with proper methods<br />

– We can in turn control the sparseness of topics of<br />

the document( 稀 疏 性 可 控 制 : 通 过 控 制 词 的 话 题 分<br />

布 来 控 制 对 文 档 话 题 稀 疏 性 )<br />

– We can in turn recover the topic proportion of the<br />

document( 语 义 可 解 释 : 可 以 恢 复 文 档 的 话 题 的 分 布 )!

Group Sparse Topical Coding <br />

Group Lasso <br />

words <br />

codes <br />

Words to Codes <br />

Objective function: <br />

( )<br />

K<br />

T<br />

s, β ∑∑ (<br />

d, n. ;<br />

d, n. βn. ) + λ∑| d,. k<br />

|<br />

2<br />

+ λ∑| d, n. |<br />

1<br />

d∈D n∈ d k= 1<br />

n∈d<br />

argmin L w s s s<br />

Lasso à add sparse control on individual variables!<br />

Group Lasso à add sparse control on group of variables <br />

K<br />

K<br />

T<br />

dn , . dn , .<br />

βn. ∑ dnk ,<br />

βnk dn , . ∑ dnk ,<br />

βnk<br />

k= 1 k=<br />

1<br />

Lw ( ; s ) = s − w ln( s ) + C

Group Sparse Topical Coding <br />

Group Lasso <br />

words <br />

codes <br />

topics <br />

Group Lasso <br />

Code to Topics <br />

⎡⎡<br />

⎢⎢<br />

⎤⎤<br />

w ⎥⎥ E[ w | w ]<br />

|| I<br />

|| I || I K<br />

|| I<br />

nk || I K<br />

nk nk nk kn<br />

n= 1 n= 1 n= 1 k= 1 n=<br />

1<br />

θ<br />

k<br />

= E ⎢⎢ | w ⎥⎥= =<br />

|| I K<br />

nk || I<br />

||<br />

⎢⎢<br />

∑∑<br />

I K<br />

n= 1 k=<br />

1 ⎥⎥<br />

⎢⎢∑∑wnk ⎥⎥ ∑wn ∑∑snkβkn<br />

n= 1 k= 1 n= 1 n= 1 k=<br />

1<br />

⎣⎣<br />

∑ ∑ ∑∑ ∑<br />

⎦⎦<br />

s β<br />

Moran’s Property!<br />

Sums of Poisson!

Results <br />

• Datasets<br />

– 20-newsgroup<br />

Time Efficiency <br />

• 18,846 documents<br />

• 26,214 distinct words<br />

• 20 related categories<br />

• Baseline methods<br />

– Probabilistic model: LDA<br />

– Non-Probabilistic model: NMF<br />

– Sparse topic model: STC<br />

Topic sparsity <br />

Classification Accuracy

Understanding the Topic(Feature Sparseness): <br />

Biterm Topic model for Short Text (WWW’13)

Short texts are prevalent <br />

Uncovering the topics of short texts is crucial for a<br />

wide range of content analysis tasks <br />

Data Source Average Word Count<br />

(removing stop words)<br />

Weibo Sina weibo ~9<br />

Questions Baidu Zhidao ~6<br />

Web page titles Logs ~5<br />

Query Query log ~3

The limitaon of convenonal topic models <br />

Bag-of-words Assumption <br />

• The occurrences of words play less discriminative role !<br />

– Not enough word counts to know how words are related!<br />

• The limited contexts in short texts!<br />

– More difficult to identify the senses of ambiguous words in short documents!

Key idea of our approach <br />

• Topics are basically groups of correlated words and the<br />

correlation is revealed by word co-occurrence patterns<br />

in documents <br />

– why not directly model the word co-occurrences for<br />

topic learning <br />

• Topic models on short texts suffer from the problem of<br />

severe sparse patterns in short documents <br />

– why not use the rich global word co-occurrence<br />

patterns for better revealing topics

Biterm Topic Model(BTM) <br />

• Biterm: co-occurred word pairs in short text <br />

– "visit apple store" -> "visit apple", "visit store", "apple store“!<br />

• Model the generation of biterms with latent topic structure <br />

– a topic ~ a probability distribution over words <br />

– a corpus ~ a mixture of topics <br />

– a biterm ~ two i.i.d sample drawn from one topic

Comparison between different models <br />

LDA Mixture of Unigram BTM <br />

l Document level topic<br />

distribution <br />

– Suffer sparsity of the doc!<br />

l Model the generation of<br />

each word <br />

– Ignore context <br />

l Corpus level topic<br />

distribution <br />

– Alleviate doc sparsity <br />

l Single topic assumption in<br />

each document <br />

– Too strong assumption <br />

l Corpus level topic<br />

distribution <br />

– Alleviate doc sparsity <br />

l Model the generation of<br />

word pairs <br />

– Leverage context

Evaluation on Tweets <br />

• Dataset:Tweets2011 <br />

– Sample 50 hashtag with clear topic <br />

– Extract tweets with these hashtags <br />

• Evaluation Metric: H score <br />

– IntraDis: average distance between docs under the same hashtag <br />

– InterDis: average distance between docs under different hashtags <br />

– The smaller H score is, the better topic representation

Evaluation on Baidu Zhidao <br />

• Dataset:Baidu Zhidao Q&A <br />

– Question classification according to their tags

Computation!<br />

Complexity <br />

Bring Order to Big Data <br />

• Ranking is central problem in may applicaAon! <br />

Ranking <br />

Recommendation <br />

Web Search <br />

Information<br />


Ranking meets Big Data <br />

High computation cost<br />

especially because ranking is a more complex task! <br />

We can save computation cost if we<br />

find the core data of ranking problem!

Ranking (Algorithm): <br />

Top-­‐k Learning to Rank: Labeling, Ranking and <br />

Evaluaon (SIGIR’12)

Top-­‐k Learning to Rank <br />

Ranking with Big Data<br />

WSDM <br />

n is usually 10 5 ~10 6<br />

Multi-­‐level <br />

ratings<br />

1 2 3 4<br />

Global <br />

learning<br />

Full-­‐order <br />

groundtruth<br />

Top-­‐K <br />

groundtruth<br />

1 2 3 4 5 ………………………………………… n<br />

1 2 3 4 … k<br />

Global <br />

prediction<br />

Local <br />

learning<br />

Users mainly care about top-­‐k ranking, k is usually 10-­‐20

Local=Global <br />

NDCG@5!<br />

NDCG@10!<br />

NDCG@5-full!<br />

NDCG@10-full!<br />

RankNet <br />

0.89!<br />

0.88!<br />

0.87!<br />

0.86!<br />

0.85!<br />

0.93!<br />

0.92!<br />

0.91!<br />

0.9!<br />

0.84!<br />

0! 10! 20! 30! 40! 50! 60! 70! 80! 90! 100!<br />

0.89!<br />

0! 10! 20! 30! 40! 50! 60! 70! 80! 90! 100!<br />

0.89!<br />

0.935!<br />

0.88!<br />

0.925!<br />

ListMLE <br />

0.87!<br />

0.86!<br />

0.915!<br />

0.905!<br />

0.85!<br />

0.895!<br />

0.84!<br />

0! 10! 20! 30! 40! 50! 60! 70! 80! 90! 100!<br />

MQ2007list <br />

0.885!<br />

0! 10! 20! 30! 40! 50! 60! 70! 80! 90! 100!<br />

MQ2008list <br />

Theoretical analysis was submited to NIPS 2013

Top-­‐k Ranking Framework <br />

Top-k<br />

Labeling <br />

Top-k<br />

Ranking <br />

Top-k<br />

Evaluation <br />

An efficient labeling strategy<br />

to get top-k ground-truth <br />

more powerful ranking algorithms<br />

in the new scenario <br />

new evaluation measures<br />

for the new scenario

Computation!<br />

Complexity <br />

Predict the Dynamic Structure <br />

• Percolation is a theoretical underpinning to model<br />

the Dynamics Structure of networks<br />

• (nodes = individuals, edges = relations) <br />

rumor diffusion <br />

epidemic spreading <br />

online information diffusion <br />

currency circulation

Problems <br />

• Predict the percolation transition point (Predictability, 可 预 测 )<br />

– Purpose:<br />

• epidemiology and rumor spreading: to know when to make vaccinations to<br />

avoid the large scale epidemic or rumor spreading. <br />

• currency circulation: to help government make policies to avoid inflation. <br />

• online information diffusion: to master the rules of online information diffusion<br />

• Predict how fast is the outbreak at percolation transition point<br />

or whether the transition is controllable. (Controllability, 可<br />

调 控 ) <br />

– Purpose:<br />

• epidemiology: to estimate how many people need to be vaccinate. <br />

• online information diffusion: to estimate the strength of a piece of information

An equivalent description of<br />

predictability & controllability <br />

Continuous transition<br />

Erdos-Renyi model <br />

Predictable & Controllable <br />

Continuous transition ! <br />

BFW model<br />

Discontinuous transition<br />

Huge Gap ! <br />

{ <br />

Unpredictable &<br />


Challenges in the era of Big Data <br />

large number of multi-type<br />

individuals<br />

(heterogeneous nodes) <br />

large number of<br />

multiplex relations<br />

(heterogeneous edges) <br />

complex & complicated<br />

mechanism for the structure<br />

dynamics <br />

Lack of early signs<br />

or critical features for<br />

percolation transition <br />

information spread<br />

very quickly in a very<br />

short time interval<br />

hard to predict<br />

when the percolation<br />

transition occurs<br />

(Predictability) <br />

hard to predict<br />

how fast is the<br />

information outbreak<br />

at the percolation<br />

transition <br />


Phys. Rev. E, 87, 052130, 2013 <br />

We find some typical features for discontinuous<br />

percolation transitions which help us quickly knows a<br />

information is unpredictable & uncontrollable<br />

Lack of finite size scaling ! <br />

EPL, 100 (6), 66006 <br />

Multiple giant components<br />

maybe appear ! <br />

A second percolation<br />

transition may occur ! <br />

Phys. Rev. Lett. 106, 115701, 2011.

Can be useful on Social Networks <br />

• Modeling the information diffusion in<br />

online social networks. <br />

• Predict the explosive phase transition of<br />

online information diffusion.<br />

• Investigate the ways to make a<br />

unpredictable and unctrollable system<br />

more predictable and controllable

ADA System: <br />

GoLaxy Advanced Data Analytics System <br />

• Goal: Discover, Query and Inference the<br />

relationship between objects!<br />

– Muti-type objects: virtual people, real people, event,<br />

organization, etc.!<br />

– Relation discovery:!<br />

• People to People: social, interaction, co-occurrence , action!<br />

• People to Event: initiate, participate, involve!<br />

• Event to Event: causality, sequence, contain!<br />

• Other: People to Org, Event to Org, Org to Org!<br />

– Relation Query:!<br />

• Retrieve the relations between objects!<br />

– Relation Inference:!<br />

• Inference/prediction!<br />

• Reasoning!<br />

• Virtual identity recognition!

Case1: Query the Real People on the Web

Case2: Event Analysis and Tracking

Summary <br />

• Three ScienAfic Challenges for Big Data <br />

– Data complexity <br />

– ComputaAon complexity <br />

– System complexity <br />

• Our research on big data <br />

– Finding the semanAc representaAons <br />

– Bring order to big data <br />

– Predict the structure dynamics <br />

• A pracAcal big data system <br />

– ADA: Discover, Query and Inference the relaAonship between<br />


Some of Researchers in our Lab <br />

郭 嘉 丰 、 沈 华 伟 、 兰 艳 艳 、 陈 巍 、etc <br />

Thanks for your aYenon! <br />


