PPT - æ°æ®å·¥ç¨ä¸ç¥è¯å·¥ç¨æè²é¨éç¹å®éªå®¤

Big Data: Science or Gimmick! 

程学旗 (Xueqi Cheng) 

中科院网络数据科学与技术重点实验室 (CAS Key Lab of 

Network Science and Technology, NDST) 

中国科学院计算技术研究所

Is Big Data Big Value 

Sensing the World 

Monitoring development index! 

(Environment、Health、Economy…) 

Crime Detection on the Web 

• Social Media Data and log data up to PB size! 

• Structured and unstructured data! 

• Historical data and streaming data 

• PB size Log Data! 

• EB size Monitoring Data! 

• Unpredictable criminal activity 

Challenges:Large volume, variety, complex connection and big noise ! 

in data bring difficulties to measurement、fusion and pattern mining

Is Big Data Big Value 

Predicting Tomorrow! 

• Election Prediction based on Twitter! 

• Hedging Fund based on sentiment 

analyzing on Twitter! 

UN Global Pulse: Predicting 

unemployment rate and disease! 

Challenges:Strong interaction, real time and dynamic properties make the life cycle of 

data separated, hard to balance latency and accuracy, and difficult to predict the tendency

Is Big Data Big Value Meltdown Modelling 

Buchanan, M. (2009), Meltdown 

modelling, Nature 460, 680-682. 

• 2016 年的一天早上 , 电子显示屏上的橙色报警灯突然不停闪烁着 , 美国政府的专家们探测到一 

个关乎国家安全的预警信号。由于这个电子显示屏背后关联着世界上最大的一些金融机构 , 包 

括银行、政府、对冲基金、网络银团等。而橙色预警灯闪烁表明美国的对冲基金已经积聚在相 

同的金融资产上 , 此时 , 如果某个基金突然变现卖出 , 警示信号就会出现 , 而这种下挫价格的 

行为 , 迫使其他基金尾随卖出 , 加速资产价格下挫。很多基金可能在短短的 30 分钟内就会破产 

, 对整个金融系统造成极大的威胁。 

• Based on Big Data, can we prevent another financial crisis

Big Challenges Exist in Big Data 

• How to design the big 

data system 

• How to optimize 

System 

Complexity 

Computation 

Complexity 

• Is that computable 

• How to compute 

towards Global data 

Processing 

Architecture for 

Data Life Cycle 

Big! 

Data 

New Computing 

Paradigm for 

Core Data 

Data 

Complexity 

• How to represent the data 

• How to measure the data complexity 

Structure Regularity

Data Garbage or Data Goldmine 

Can we well address these Big Scientific 

Challenges in Big Data 

No 

Yes 

Data Garbage 

Data Goldmine

Topics in this talk 

Data 

Complexity 

• How to represent the data 

ü Finding the semantic representations 

Computation 

Complexity 

• How to compute with Web scale data 

ü Bring order to big data! 

ü Predict the dynamic structure 

System! 

ADA 

System 

Complexity 

ü Web Data Engine System!

Data 

Complexity 

Finding the semanc representaons of 

large scale Web documents 

Topic Modeling 

Topic modeling provides methods for automatically organizing,! 

understanding, searching, and summarizing large electronic archives.! 

! 

1 Discover the hidden themes that pervade the collection.! 

2 Annotate the documents according to those themes.! 

3 Use annotations to organize, summarize, and search the texts.

Topic Modeling meets Big Data 

Topic 

Modeling 

“Big Data” 

Semantic! 

Sparseness 

Topic! 

Sparseness 

Feature! 

Sparseness 

Topic space 

Doc 

space

Understanding the Topic(Topic Sparseness): 

Group Sparse Topical Coding: From Code to Topic 

(WSDM’13)

Sparse and Meaningful Topics 

• Lots of data, Lots of potential topics 

(Global), but relative sparse topics in 

individual document (Local)! 

! 

•! 

Conventional idea:! 

Doc 

space 

– Control the sparseness of the topics at document level! 

Topic space 

Probabilistic Model! 

• Meaningful interpretation! 

• Hard to control sparse ! 

Non-probabilistic Model! 

• Effective sparse controlling! 

• Lack clear semantic meanings! 

(Blei et al., ’03) 

(Lin., ’08) 

Can we find a better way to enjoy both the merits Yes!!!!

Basic Idea 

• The document is composed of words 

• The topics of document is thus composed of 

the topics of words 

• If we directly control the sparseness of topics of 

words with proper methods 

– We can in turn control the sparseness of topics of 

the document( 稀疏性可控制 : 通过控制词的话题分 

布来控制对文档话题稀疏性 ) 

– We can in turn recover the topic proportion of the 

document( 语义可解释 : 可以恢复文档的话题的分布 )!

Group Sparse Topical Coding 

Group Lasso 

words 

codes 

Words to Codes 

Objective function: 

( ) 

K 

T 

s, β ∑∑ ( 

d, n. ; 

d, n. βn. ) + λ∑| d,. k 

| 

2 

+ λ∑| d, n. | 

1 

d∈D n∈ d k= 1 

n∈d 

argmin L w s s s 

Lasso à add sparse control on individual variables! 

Group Lasso à add sparse control on group of variables 

K 

K 

T 

dn , . dn , . 

βn. ∑ dnk , 

βnk dn , . ∑ dnk , 

βnk 

k= 1 k= 

1 

Lw ( ; s ) = s − w ln( s ) + C

Group Sparse Topical Coding 

Group Lasso 

words 

codes 

topics 

Group Lasso 

Code to Topics 

⎡⎡ 

⎢⎢ 

⎤⎤ 

w ⎥⎥ E[ w | w ] 

|| I 

|| I || I K 

|| I 

nk || I K 

nk nk nk kn 

n= 1 n= 1 n= 1 k= 1 n= 

1 

θ 

k 

= E ⎢⎢ | w ⎥⎥= = 

|| I K 

nk || I 

|| 

⎢⎢ 

∑∑ 

I K 

n= 1 k= 

1 ⎥⎥ 

⎢⎢∑∑wnk ⎥⎥ ∑wn ∑∑snkβkn 

n= 1 k= 1 n= 1 n= 1 k= 

1 

⎣⎣ 

∑ ∑ ∑∑ ∑ 

⎦⎦ 

s β 

Moran’s Property! 

Sums of Poisson!

Results 

• Datasets 

– 20-newsgroup 

Time Efficiency 

• 18,846 documents 

• 26,214 distinct words 

• 20 related categories 

• Baseline methods 

– Probabilistic model: LDA 

– Non-Probabilistic model: NMF 

– Sparse topic model: STC 

Topic sparsity 

Classification Accuracy

Understanding the Topic(Feature Sparseness): 

Biterm Topic model for Short Text (WWW’13)

Short texts are prevalent 

Uncovering the topics of short texts is crucial for a 

wide range of content analysis tasks 

Data Source Average Word Count 

(removing stop words) 

Weibo Sina weibo ~9 

Questions Baidu Zhidao ~6 

Web page titles Logs ~5 

Query Query log ~3

The limitaon of convenonal topic models 

Bag-of-words Assumption 

• The occurrences of words play less discriminative role ! 

– Not enough word counts to know how words are related! 

• The limited contexts in short texts! 

– More difficult to identify the senses of ambiguous words in short documents!

Key idea of our approach 

• Topics are basically groups of correlated words and the 

correlation is revealed by word co-occurrence patterns 

in documents 

– why not directly model the word co-occurrences for 

topic learning 

• Topic models on short texts suffer from the problem of 

severe sparse patterns in short documents 

– why not use the rich global word co-occurrence 

patterns for better revealing topics

Biterm Topic Model(BTM) 

• Biterm: co-occurred word pairs in short text 

– "visit apple store" -> "visit apple", "visit store", "apple store“! 

• Model the generation of biterms with latent topic structure 

– a topic ~ a probability distribution over words 

– a corpus ~ a mixture of topics 

– a biterm ~ two i.i.d sample drawn from one topic

Comparison between different models 

LDA Mixture of Unigram BTM 

l Document level topic 

distribution 

– Suffer sparsity of the doc! 

l Model the generation of 

each word 

– Ignore context 

l Corpus level topic 

distribution 

– Alleviate doc sparsity 

l Single topic assumption in 

each document 

– Too strong assumption 

l Corpus level topic 

distribution 

– Alleviate doc sparsity 

l Model the generation of 

word pairs 

– Leverage context

Evaluation on Tweets 

• Dataset:Tweets2011 

– Sample 50 hashtag with clear topic 

– Extract tweets with these hashtags 

• Evaluation Metric: H score 

– IntraDis: average distance between docs under the same hashtag 

– InterDis: average distance between docs under different hashtags 

– The smaller H score is, the better topic representation

Evaluation on Baidu Zhidao 

• Dataset:Baidu Zhidao Q&A 

– Question classification according to their tags

Computation! 

Complexity 

Bring Order to Big Data 

• Ranking is central problem in may applicaAon! 

Ranking 

Recommendation 

Web Search 

Information 

Filtering

Ranking meets Big Data 

High computation cost 

especially because ranking is a more complex task! 

We can save computation cost if we 

find the core data of ranking problem!

Ranking (Algorithm): 

Top-‐k Learning to Rank: Labeling, Ranking and 

Evaluaon (SIGIR’12)

Top-‐k Learning to Rank 

Ranking with Big Data 

WSDM 

n is usually 10 5 ~10 6 

Multi-‐level 

ratings 

1 2 3 4 

Global 

learning 

Full-‐order 

groundtruth 

Top-‐K 

groundtruth 

1 2 3 4 5 ………………………………………… n 

1 2 3 4 … k 

Global 

prediction 

Local 

learning 

Users mainly care about top-‐k ranking, k is usually 10-‐20

Local=Global 

NDCG@5! 

NDCG@10! 

NDCG@5-full! 

NDCG@10-full! 

RankNet 

0.89! 

0.88! 

0.87! 

0.86! 

0.85! 

0.93! 

0.92! 

0.91! 

0.9! 

0.84! 

0! 10! 20! 30! 40! 50! 60! 70! 80! 90! 100! 

0.89! 

0! 10! 20! 30! 40! 50! 60! 70! 80! 90! 100! 

0.89! 

0.935! 

0.88! 

0.925! 

ListMLE 

0.87! 

0.86! 

0.915! 

0.905! 

0.85! 

0.895! 

0.84! 

0! 10! 20! 30! 40! 50! 60! 70! 80! 90! 100! 

MQ2007list 

0.885! 

0! 10! 20! 30! 40! 50! 60! 70! 80! 90! 100! 

MQ2008list 

Theoretical analysis was submited to NIPS 2013

Top-‐k Ranking Framework 

Top-k 

Labeling 

Top-k 

Ranking 

Top-k 

Evaluation 

An efficient labeling strategy 

to get top-k ground-truth 

more powerful ranking algorithms 

in the new scenario 

new evaluation measures 

for the new scenario

Computation! 

Complexity 

Predict the Dynamic Structure 

• Percolation is a theoretical underpinning to model 

the Dynamics Structure of networks 

• (nodes = individuals, edges = relations) 

rumor diffusion 

epidemic spreading 

online information diffusion 

currency circulation

Problems 

• Predict the percolation transition point (Predictability, 可预测 ) 

– Purpose: 

• epidemiology and rumor spreading: to know when to make vaccinations to 

avoid the large scale epidemic or rumor spreading. 

• currency circulation: to help government make policies to avoid inflation. 

• online information diffusion: to master the rules of online information diffusion 

• Predict how fast is the outbreak at percolation transition point 

or whether the transition is controllable. (Controllability, 可 

调控 ) 

– Purpose: 

• epidemiology: to estimate how many people need to be vaccinate. 

• online information diffusion: to estimate the strength of a piece of information

An equivalent description of 

predictability & controllability 

Continuous transition 

Erdos-Renyi model 

Predictable & Controllable 

Continuous transition ! 

BFW model 

Discontinuous transition 

Huge Gap ! 

{ 

Unpredictable & 

Uncontrollable

Challenges in the era of Big Data 

large number of multi-type 

individuals 

(heterogeneous nodes) 

large number of 

multiplex relations 

(heterogeneous edges) 

complex & complicated 

mechanism for the structure 

dynamics 

Lack of early signs 

or critical features for 

percolation transition 

information spread 

very quickly in a very 

short time interval 

hard to predict 

when the percolation 

transition occurs 

(Predictability) 

hard to predict 

how fast is the 

information outbreak 

at the percolation 

transition 

(Controllability)

Phys. Rev. E, 87, 052130, 2013 

We find some typical features for discontinuous 

percolation transitions which help us quickly knows a 

information is unpredictable & uncontrollable 

Lack of finite size scaling ! 

EPL, 100 (6), 66006 

Multiple giant components 

maybe appear ! 

A second percolation 

transition may occur ! 

Phys. Rev. Lett. 106, 115701, 2011.

Can be useful on Social Networks 

• Modeling the information diffusion in 

online social networks. 

• Predict the explosive phase transition of 

online information diffusion. 

• Investigate the ways to make a 

unpredictable and unctrollable system 

more predictable and controllable

ADA System: 

GoLaxy Advanced Data Analytics System 

• Goal: Discover, Query and Inference the 

relationship between objects! 

– Muti-type objects: virtual people, real people, event, 

organization, etc.! 

– Relation discovery:! 

• People to People: social, interaction, co-occurrence , action! 

• People to Event: initiate, participate, involve! 

• Event to Event: causality, sequence, contain! 

• Other: People to Org, Event to Org, Org to Org! 

– Relation Query:! 

• Retrieve the relations between objects! 

– Relation Inference:! 

• Inference/prediction! 

• Reasoning! 

• Virtual identity recognition!

Case1: Query the Real People on the Web

Case2: Event Analysis and Tracking

Summary 

• Three ScienAfic Challenges for Big Data 

– Data complexity 

– ComputaAon complexity 

– System complexity 

• Our research on big data 

– Finding the semanAc representaAons 

– Bring order to big data 

– Predict the structure dynamics 

• A pracAcal big data system 

– ADA: Discover, Query and Inference the relaAonship between 

objects

Some of Researchers in our Lab 

郭嘉丰、沈华伟、兰艳艳、陈巍、etc 

Thanks for your aYenon! 

cxq@ict.ac.cn

PPT - æ°æ®å·¥ç¨ä¸ç¥è¯å·¥ç¨æè²é¨éç¹å®éªå®¤

Create successful ePaper yourself

Delete template?

Save as template?

PPT - æ°æ®å·¥ç¨ä¸ç¥è¯å·¥ç¨æè²é¨éç¹å®éªå®¤