PPT - æ°æ®å·¥ç¨ä¸ç¥è¯å·¥ç¨æè²é¨éç¹å®éªå®¤
PPT - æ°æ®å·¥ç¨ä¸ç¥è¯å·¥ç¨æè²é¨éç¹å®éªå®¤
PPT - æ°æ®å·¥ç¨ä¸ç¥è¯å·¥ç¨æè²é¨éç¹å®éªå®¤
You also want an ePaper? Increase the reach of your titles
YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.
Big Data: Science or Gimmick!<br />
程 学 旗 (Xueqi Cheng)<br />
中 科 院 网 络 数 据 科 学 与 技 术 重 点 实 验 室 (CAS Key Lab of<br />
Network Science and Technology, NDST) <br />
中 国 科 学 院 计 算 技 术 研 究 所
Is Big Data Big Value <br />
Sensing the World<br />
Monitoring development index!<br />
(Environment、Health、Economy…)<br />
Crime Detection on the Web<br />
• Social Media Data and log data up to PB size!<br />
• Structured and unstructured data!<br />
• Historical data and streaming data<br />
• PB size Log Data!<br />
• EB size Monitoring Data!<br />
• Unpredictable criminal activity<br />
Challenges:Large volume, variety, complex connection and big noise !<br />
in data bring difficulties to measurement、fusion and pattern mining
Is Big Data Big Value <br />
Predicting Tomorrow!<br />
• Election Prediction based on Twitter!<br />
• Hedging Fund based on sentiment<br />
analyzing on Twitter!<br />
UN Global Pulse: Predicting<br />
unemployment rate and disease!<br />
Challenges:Strong interaction, real time and dynamic properties make the life cycle of<br />
data separated, hard to balance latency and accuracy, and difficult to predict the tendency
Is Big Data Big Value Meltdown Modelling <br />
Buchanan, M. (2009), Meltdown<br />
modelling, Nature 460, 680-682.<br />
• 2016 年 的 一 天 早 上 , 电 子 显 示 屏 上 的 橙 色 报 警 灯 突 然 不 停 闪 烁 着 , 美 国 政 府 的 专 家 们 探 测 到 一<br />
个 关 乎 国 家 安 全 的 预 警 信 号 。 由 于 这 个 电 子 显 示 屏 背 后 关 联 着 世 界 上 最 大 的 一 些 金 融 机 构 , 包<br />
括 银 行 、 政 府 、 对 冲 基 金 、 网 络 银 团 等 。 而 橙 色 预 警 灯 闪 烁 表 明 美 国 的 对 冲 基 金 已 经 积 聚 在 相<br />
同 的 金 融 资 产 上 , 此 时 , 如 果 某 个 基 金 突 然 变 现 卖 出 , 警 示 信 号 就 会 出 现 , 而 这 种 下 挫 价 格 的<br />
行 为 , 迫 使 其 他 基 金 尾 随 卖 出 , 加 速 资 产 价 格 下 挫 。 很 多 基 金 可 能 在 短 短 的 30 分 钟 内 就 会 破 产<br />
, 对 整 个 金 融 系 统 造 成 极 大 的 威 胁 。<br />
• Based on Big Data, can we prevent another financial crisis
Big Challenges Exist in Big Data <br />
• How to design the big<br />
data system <br />
• How to optimize<br />
System<br />
Complexity <br />
Computation<br />
Complexity <br />
• Is that computable <br />
• How to compute<br />
towards Global data<br />
Processing<br />
Architecture for<br />
Data Life Cycle<br />
Big!<br />
Data <br />
New Computing<br />
Paradigm for<br />
Core Data <br />
Data<br />
Complexity <br />
• How to represent the data <br />
• How to measure the data complexity <br />
Structure Regularity
Data Garbage or Data Goldmine <br />
Can we well address these Big Scientific<br />
Challenges in Big Data <br />
No <br />
Yes <br />
Data Garbage <br />
Data Goldmine
Topics in this talk <br />
Data<br />
Complexity <br />
• How to represent the data <br />
ü Finding the semantic representations <br />
Computation<br />
Complexity <br />
• How to compute with Web scale data<br />
ü Bring order to big data!<br />
ü Predict the dynamic structure <br />
System!<br />
ADA <br />
System<br />
Complexity <br />
ü Web Data Engine System!
Data<br />
Complexity <br />
Finding the semanc representaons of<br />
large scale Web documents <br />
Topic Modeling <br />
Topic modeling provides methods for automatically organizing,!<br />
understanding, searching, and summarizing large electronic archives.!<br />
!<br />
1 Discover the hidden themes that pervade the collection.!<br />
2 Annotate the documents according to those themes.!<br />
3 Use annotations to organize, summarize, and search the texts.
Topic Modeling meets Big Data <br />
Topic<br />
Modeling <br />
“Big Data” <br />
Semantic!<br />
Sparseness <br />
Topic!<br />
Sparseness <br />
Feature!<br />
Sparseness <br />
Topic space <br />
Doc<br />
space
Understanding the Topic(Topic Sparseness): <br />
Group Sparse Topical Coding: From Code to Topic <br />
(WSDM’13)
Sparse and Meaningful Topics <br />
• Lots of data, Lots of potential topics<br />
(Global), but relative sparse topics in<br />
individual document (Local)!<br />
!<br />
•!<br />
Conventional idea:!<br />
Doc<br />
space <br />
– Control the sparseness of the topics at document level!<br />
Topic space <br />
Probabilistic Model!<br />
• Meaningful interpretation!<br />
• Hard to control sparse !<br />
Non-probabilistic Model!<br />
• Effective sparse controlling!<br />
• Lack clear semantic meanings!<br />
(Blei et al., ’03) <br />
(Lin., ’08) <br />
Can we find a better way to enjoy both the merits Yes!!!!
Basic Idea <br />
• The document is composed of words<br />
• The topics of document is thus composed of<br />
the topics of words<br />
• If we directly control the sparseness of topics of<br />
words with proper methods<br />
– We can in turn control the sparseness of topics of<br />
the document( 稀 疏 性 可 控 制 : 通 过 控 制 词 的 话 题 分<br />
布 来 控 制 对 文 档 话 题 稀 疏 性 )<br />
– We can in turn recover the topic proportion of the<br />
document( 语 义 可 解 释 : 可 以 恢 复 文 档 的 话 题 的 分 布 )!
Group Sparse Topical Coding <br />
Group Lasso <br />
words <br />
codes <br />
Words to Codes <br />
Objective function: <br />
( )<br />
K<br />
T<br />
s, β ∑∑ (<br />
d, n. ;<br />
d, n. βn. ) + λ∑| d,. k<br />
|<br />
2<br />
+ λ∑| d, n. |<br />
1<br />
d∈D n∈ d k= 1<br />
n∈d<br />
argmin L w s s s<br />
Lasso à add sparse control on individual variables!<br />
Group Lasso à add sparse control on group of variables <br />
K<br />
K<br />
T<br />
dn , . dn , .<br />
βn. ∑ dnk ,<br />
βnk dn , . ∑ dnk ,<br />
βnk<br />
k= 1 k=<br />
1<br />
Lw ( ; s ) = s − w ln( s ) + C
Group Sparse Topical Coding <br />
Group Lasso <br />
words <br />
codes <br />
topics <br />
Group Lasso <br />
Code to Topics <br />
⎡⎡<br />
⎢⎢<br />
⎤⎤<br />
w ⎥⎥ E[ w | w ]<br />
|| I<br />
|| I || I K<br />
|| I<br />
nk || I K<br />
nk nk nk kn<br />
n= 1 n= 1 n= 1 k= 1 n=<br />
1<br />
θ<br />
k<br />
= E ⎢⎢ | w ⎥⎥= =<br />
|| I K<br />
nk || I<br />
||<br />
⎢⎢<br />
∑∑<br />
I K<br />
n= 1 k=<br />
1 ⎥⎥<br />
⎢⎢∑∑wnk ⎥⎥ ∑wn ∑∑snkβkn<br />
n= 1 k= 1 n= 1 n= 1 k=<br />
1<br />
⎣⎣<br />
∑ ∑ ∑∑ ∑<br />
⎦⎦<br />
s β<br />
Moran’s Property!<br />
Sums of Poisson!
Results <br />
• Datasets<br />
– 20-newsgroup<br />
Time Efficiency <br />
• 18,846 documents<br />
• 26,214 distinct words<br />
• 20 related categories<br />
• Baseline methods<br />
– Probabilistic model: LDA<br />
– Non-Probabilistic model: NMF<br />
– Sparse topic model: STC<br />
Topic sparsity <br />
Classification Accuracy
Understanding the Topic(Feature Sparseness): <br />
Biterm Topic model for Short Text (WWW’13)
Short texts are prevalent <br />
Uncovering the topics of short texts is crucial for a<br />
wide range of content analysis tasks <br />
Data Source Average Word Count<br />
(removing stop words)<br />
Weibo Sina weibo ~9<br />
Questions Baidu Zhidao ~6<br />
Web page titles Logs ~5<br />
Query Query log ~3
The limitaon of convenonal topic models <br />
Bag-of-words Assumption <br />
• The occurrences of words play less discriminative role !<br />
– Not enough word counts to know how words are related!<br />
• The limited contexts in short texts!<br />
– More difficult to identify the senses of ambiguous words in short documents!
Key idea of our approach <br />
• Topics are basically groups of correlated words and the<br />
correlation is revealed by word co-occurrence patterns<br />
in documents <br />
– why not directly model the word co-occurrences for<br />
topic learning <br />
• Topic models on short texts suffer from the problem of<br />
severe sparse patterns in short documents <br />
– why not use the rich global word co-occurrence<br />
patterns for better revealing topics
Biterm Topic Model(BTM) <br />
• Biterm: co-occurred word pairs in short text <br />
– "visit apple store" -> "visit apple", "visit store", "apple store“!<br />
• Model the generation of biterms with latent topic structure <br />
– a topic ~ a probability distribution over words <br />
– a corpus ~ a mixture of topics <br />
– a biterm ~ two i.i.d sample drawn from one topic
Comparison between different models <br />
LDA Mixture of Unigram BTM <br />
l Document level topic<br />
distribution <br />
– Suffer sparsity of the doc!<br />
l Model the generation of<br />
each word <br />
– Ignore context <br />
l Corpus level topic<br />
distribution <br />
– Alleviate doc sparsity <br />
l Single topic assumption in<br />
each document <br />
– Too strong assumption <br />
l Corpus level topic<br />
distribution <br />
– Alleviate doc sparsity <br />
l Model the generation of<br />
word pairs <br />
– Leverage context
Evaluation on Tweets <br />
• Dataset:Tweets2011 <br />
– Sample 50 hashtag with clear topic <br />
– Extract tweets with these hashtags <br />
• Evaluation Metric: H score <br />
– IntraDis: average distance between docs under the same hashtag <br />
– InterDis: average distance between docs under different hashtags <br />
– The smaller H score is, the better topic representation
Evaluation on Baidu Zhidao <br />
• Dataset:Baidu Zhidao Q&A <br />
– Question classification according to their tags
Computation!<br />
Complexity <br />
Bring Order to Big Data <br />
• Ranking is central problem in may applicaAon! <br />
Ranking <br />
Recommendation <br />
Web Search <br />
Information<br />
Filtering
Ranking meets Big Data <br />
High computation cost<br />
especially because ranking is a more complex task! <br />
We can save computation cost if we<br />
find the core data of ranking problem!
Ranking (Algorithm): <br />
Top-‐k Learning to Rank: Labeling, Ranking and <br />
Evaluaon (SIGIR’12)
Top-‐k Learning to Rank <br />
Ranking with Big Data<br />
WSDM <br />
n is usually 10 5 ~10 6<br />
Multi-‐level <br />
ratings<br />
1 2 3 4<br />
Global <br />
learning<br />
Full-‐order <br />
groundtruth<br />
Top-‐K <br />
groundtruth<br />
1 2 3 4 5 ………………………………………… n<br />
1 2 3 4 … k<br />
Global <br />
prediction<br />
Local <br />
learning<br />
Users mainly care about top-‐k ranking, k is usually 10-‐20
Local=Global <br />
NDCG@5!<br />
NDCG@10!<br />
NDCG@5-full!<br />
NDCG@10-full!<br />
RankNet <br />
0.89!<br />
0.88!<br />
0.87!<br />
0.86!<br />
0.85!<br />
0.93!<br />
0.92!<br />
0.91!<br />
0.9!<br />
0.84!<br />
0! 10! 20! 30! 40! 50! 60! 70! 80! 90! 100!<br />
0.89!<br />
0! 10! 20! 30! 40! 50! 60! 70! 80! 90! 100!<br />
0.89!<br />
0.935!<br />
0.88!<br />
0.925!<br />
ListMLE <br />
0.87!<br />
0.86!<br />
0.915!<br />
0.905!<br />
0.85!<br />
0.895!<br />
0.84!<br />
0! 10! 20! 30! 40! 50! 60! 70! 80! 90! 100!<br />
MQ2007list <br />
0.885!<br />
0! 10! 20! 30! 40! 50! 60! 70! 80! 90! 100!<br />
MQ2008list <br />
Theoretical analysis was submited to NIPS 2013
Top-‐k Ranking Framework <br />
Top-k<br />
Labeling <br />
Top-k<br />
Ranking <br />
Top-k<br />
Evaluation <br />
An efficient labeling strategy<br />
to get top-k ground-truth <br />
more powerful ranking algorithms<br />
in the new scenario <br />
new evaluation measures<br />
for the new scenario
Computation!<br />
Complexity <br />
Predict the Dynamic Structure <br />
• Percolation is a theoretical underpinning to model<br />
the Dynamics Structure of networks<br />
• (nodes = individuals, edges = relations) <br />
rumor diffusion <br />
epidemic spreading <br />
online information diffusion <br />
currency circulation
Problems <br />
• Predict the percolation transition point (Predictability, 可 预 测 )<br />
– Purpose:<br />
• epidemiology and rumor spreading: to know when to make vaccinations to<br />
avoid the large scale epidemic or rumor spreading. <br />
• currency circulation: to help government make policies to avoid inflation. <br />
• online information diffusion: to master the rules of online information diffusion<br />
• Predict how fast is the outbreak at percolation transition point<br />
or whether the transition is controllable. (Controllability, 可<br />
调 控 ) <br />
– Purpose:<br />
• epidemiology: to estimate how many people need to be vaccinate. <br />
• online information diffusion: to estimate the strength of a piece of information
An equivalent description of<br />
predictability & controllability <br />
Continuous transition<br />
Erdos-Renyi model <br />
Predictable & Controllable <br />
Continuous transition ! <br />
BFW model<br />
Discontinuous transition<br />
Huge Gap ! <br />
{ <br />
Unpredictable &<br />
Uncontrollable
Challenges in the era of Big Data <br />
large number of multi-type<br />
individuals<br />
(heterogeneous nodes) <br />
large number of<br />
multiplex relations<br />
(heterogeneous edges) <br />
complex & complicated<br />
mechanism for the structure<br />
dynamics <br />
Lack of early signs<br />
or critical features for<br />
percolation transition <br />
information spread<br />
very quickly in a very<br />
short time interval<br />
hard to predict<br />
when the percolation<br />
transition occurs<br />
(Predictability) <br />
hard to predict<br />
how fast is the<br />
information outbreak<br />
at the percolation<br />
transition <br />
(Controllability)
Phys. Rev. E, 87, 052130, 2013 <br />
We find some typical features for discontinuous<br />
percolation transitions which help us quickly knows a<br />
information is unpredictable & uncontrollable<br />
Lack of finite size scaling ! <br />
EPL, 100 (6), 66006 <br />
Multiple giant components<br />
maybe appear ! <br />
A second percolation<br />
transition may occur ! <br />
Phys. Rev. Lett. 106, 115701, 2011.
Can be useful on Social Networks <br />
• Modeling the information diffusion in<br />
online social networks. <br />
• Predict the explosive phase transition of<br />
online information diffusion.<br />
• Investigate the ways to make a<br />
unpredictable and unctrollable system<br />
more predictable and controllable
ADA System: <br />
GoLaxy Advanced Data Analytics System <br />
• Goal: Discover, Query and Inference the<br />
relationship between objects!<br />
– Muti-type objects: virtual people, real people, event,<br />
organization, etc.!<br />
– Relation discovery:!<br />
• People to People: social, interaction, co-occurrence , action!<br />
• People to Event: initiate, participate, involve!<br />
• Event to Event: causality, sequence, contain!<br />
• Other: People to Org, Event to Org, Org to Org!<br />
– Relation Query:!<br />
• Retrieve the relations between objects!<br />
– Relation Inference:!<br />
• Inference/prediction!<br />
• Reasoning!<br />
• Virtual identity recognition!
Case1: Query the Real People on the Web
Case2: Event Analysis and Tracking
Summary <br />
• Three ScienAfic Challenges for Big Data <br />
– Data complexity <br />
– ComputaAon complexity <br />
– System complexity <br />
• Our research on big data <br />
– Finding the semanAc representaAons <br />
– Bring order to big data <br />
– Predict the structure dynamics <br />
• A pracAcal big data system <br />
– ADA: Discover, Query and Inference the relaAonship between<br />
objects
Some of Researchers in our Lab <br />
郭 嘉 丰 、 沈 华 伟 、 兰 艳 艳 、 陈 巍 、etc <br />
Thanks for your aYenon! <br />
cxq@ict.ac.cn